WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2015  
Project 1: Data Integration, Data Warehousing, Data Pre-processing

PROF. CAROLINA RUIZ 

Due Date: Feb. 10th 2015. ------------------------------------------

Instructions


Problem I. Knowledge Discovery in Databases (20 points)

  1. (5 points) Define knowledge discovery in databases.

  2. (10 points) Briefly describe the steps of the knowledge discovery in databases process.

  3. (5 points) Define data mining.
Base your answers on the definitions presented in class, the textbook, and the following paper: Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases". AAAI Magazine, pp. 37-54. Fall 1996.

Problem II. Data Preprocessing (65 points)

Consider the following dataset.
   DATE       OUTLOOK         TEMPERATURE   HUMIDITY    WIND    PLAYS 

   02/13/12   mostly sunny    47            25          strong  no 
   03/10/12   mostly cloudy   66            57          weak    yes
   06/28/12   cloudy          91            75          medium  yes
   07/12/12   sunny           82            27          strong  no
   08/30/12   rainy           76            80          weak    no
   09/23/12   drizzle         66            70          weak    yes
   11/24/12   sunny           52            60          medium  no
   12/19/12   mostly sunny    41            30          strong  no
   01/12/13   cloudy          36            40          ?      	no
   04/13/13   mostly cloudy   57            40          weak    yes
   05/20/13   mostly sunny    68            50          medium  yes
   06/28/13   drizzle         73            20          weak    yes
   07/06/13   sunny           95            85          weak    yes
   08/20/13   rainy           91            60          weak    yes
   09/01/13   mostly sunny    80            10          medium  no
   10/23/13   mostly cloudy   52            44          weak    no 

  1. (5 points) Assuming that the missing value (marked with "?") for WIND cannot be ignored, discuss 3 different alternatives to fill in that missing value. In each case, state what the selected value would be and the advantages and disadvantages of the approach. You may assume that the attribute PLAYS is the target attribute.

  2. (5 points) Describe a reasonable transformation of the attribute OUTLOOK so that the number of different values for that attribute is reduced to just 3.

  3. (5 points) Discretize the attribute TEMPERATURE by binning it into 4 equi-width intervals using unsupervised discretization. Perform this discretization by hand (i.e., do not use Weka). Explain your answer.

  4. (5 points) Discretize the attribute HUMIDITY by binning it into 4 equi-depth (= equal-frequency) intervals using unsupervised discretization. Perform this discretization by hand (i.e., do not use Weka). Explain your answer.

  5. (5 points) Would you keep the attribute DATE into your dataset when mining for patterns that predict the values for the PLAYS attribute? Explain your answer.

  6. (10 points) Consider the following new approach to discretizing a numeric attribute: Given the mean and the standard deviation (sd) of the attribute values, bin the attribute values into the following intervals:
     [mean - (k+1)*sd, mean - k*sd)   
     for all integer values k, i.e. k = ..., -4, -3, -2, -1, 0, 1, 2, ...
    
    Assume that the mean of the attribute HUMIDITY above is 48 and that the standard deviation sd of this attribute is 22.5. Discretize HUMIDITY by hand using this new approach. Show your work.

  7. (30 points) Use the supervised discretization filter in Weka (with UseKononorenko=False) to discretize the TEMPERATURE attribute. Describe the resulting intervals. Find the Java code that implements this filter in the directories that contain the Weka files. (See the instructions to find Weka's source code at the beginning of this project assignment.) Read the code carefully so that you can describe the algorithm followed by this code in your own words. Follow the code by hand to show precisely how the TEMPERATURE intervals were obtained. Is this the same or a different procedure to the supervised discretization procedure described in Section 2.3.6 of the texbook pp. 60-62? Explain.

Problem III. Feature Selection (60 points)

Consider the weather.nominal.arff dataset that comes with the Weka system. In this problem you will explain how Correlation based Feature Selection (CFS) works on this dataset. (See Witten's and Frank's textbook slides - Chapter 7 Slides 5-6).
  1. (5 points) Apply Weka's CfsSubsetEval (available under the Select attributes tab) to this dataset (using BestFirst as the search method, with default parameters) to determine what attributes are selected. Include the results in your project solutions.
  2. Looking at the code that implements CfsSubsetEval, as well as its description in the textbook and in class, describe in detail the process that it follows:
    1. (5 points) What's the initial (sub)set of attributes under consideration? Is forward or backward search used?
    2. (25 points) Using the latice of attribute subsets below, show step by step the process that the algorithm follows (i.e., show the search process in detail). For this you can add print instructions to the Weka code so that it tells you the order in which it considers the subsets and the goodness value of each of these subsets. Explain your answer.
    3. (25 points) Use the CfsSubsetEval formulas to calculate the goodness of the "best" (sub)set of attributes considered. Show your work.

      weather_data_attribute_latice.gif

      Taken from Witten's and Frank's textbook slides - Chapter 7.


Problem IV. Exploring Real Data (65 points)

Consider the Auto MPG Data Set available at the UCI Machine Learning Repository. Convert the "auto-mpg.data" dataset together with the "auto-mpg.names" to the arff format. Load this dataset into Weka by opening your arff dataset from the "Explorer" window in Weka. Load it into Python as well.

  1. Dataset Exploration. (40 points) Use Excel, Python, your own code, or Weka to complete the following parts. Please state in your report which tool from the above list you used for each part.

    1. (5 points) Start by familiarizing yourself with the dataset. Carefully look at the data directly (for this use Excel or a file editor, as well as Weka's and Python's funcionality to explore and to visualize the data). Describe in your report your observations about what is good about this data (mention at least 2 different good things), and what is problematic about this data (mention at least 2 different bad things). If appropriate, include visualizations of those good/bad things.

    2. For the horsepower attribute:
      1. (5 points) Calculate the percentiles (in increments of 10, as in Table 3.2 of the textbook, page 101), mean, median, range, and variance of the attribute.
      2. (5 points) Plot a histogram of the attribute using 10 or 20 bins (you choose the best value for the attribute). For examples, see Figures 3.7 and 3.8 in the textbook, page 113.

    3. In this part, use the nominal attributes as if they were continuous. For the set of all attributes in the dataset except for car-name, calculate (1) (10 points) the covariance matrix and (2) (10 points) the correlation matrix of these attributes.
      See notes on using Matlab and Excel to calculate these matrices. Construct a visualization of each of these matrices (e.g., heatmap) to more easily understand them.
      (5 points) If you had to remove 2 of the attributes above from the dataset based on these two matrices, which attributes would you remove and why? Explain your answer.

  2. Dimensionality Reduction. (10 points) Upload the entire dataset onto Weka and Python. Apply Principal Components Analysis in Weka and separately in Python to reduce the dimensionality of the full dataset. In Weka, use the PrincipalComponents option from the "Select attributes" tab. Use parameter values: centerData=True, varianceCovered=0.95. How many dimensions (= attributes) does the original dataset contain? How many dimensions are obtained after PCA? How much of the variance do they explain? Include in your report the linear combinations that define the first new attribute(= component) obtained. Look at the results and elaborate on any interesting observations you can make about the results.

  3. Feature Selection. (10 points) Use the origen attribute as the target classification attribute. Apply Correlation Based Feature Selection (CFS) (see Witten's and Frank's textbook slides - Chapter 7 Slides 5-6). For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters. Separately, use Python for the same purpose. Look at the results to determine which attributes were selected by this method and elaboreate on any interesting observations you can make about the results.

  4. Attribute Transformation. (5 points) Convert the car-name attribute into a nominal attribute by changing each car-name into just the car brand (e.g., toyota, ford, audi, ...). Using this modified dataset, run again PCA and CFS in Weka and separately in Python as you did above (keeping origen as the target attribute) and report any changes you observe in the results.

Problem V. Data Integration, Data Warehousing and OLAP (50 points)

  1. (10 points) Describe the main differences between the mediation approach and the data warehousing approach for data integration.

  2. (Adapted from Han's and Kamber's textbook.) Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit.
    1. (5 points) Illustrate how this dataset would look as a multidimensional array (see for instance Fig. 3.30 p. 132 of the textbook).
    2. (5 points) Starting with the base cuboid [day, doctor, patient], what sequence of specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2014?

  3. (30 points) Consider the following relational table:

    MODEL YEAR COLOR SALES

    Chevy

    2010

    red

    5

    Chevy

    2010

    white

    87

    Chevy

    2010

    blue

    62

    Chevy

    2011

    red

    54

    Chevy

    2011

    white

    95

    Chevy

    2011

    blue

    49

    Chevy

    2012

    red

    31

    Chevy

    2012

    white

    54

    Chevy

    2012

    blue

    71

    Ford

    2010

    red

    64

    Ford

    2010

    white

    62

    Ford

    2010

    blue

    63

    Ford

    2011

    red

    52

    Ford

    2011

    white

    9

    Ford

    2011

    blue

    55

    Ford

    2012

    red

    27

    Ford

    2012

    white

    62

    Ford

    2012

    blue

    39

    1. (5 points) Depict the data in the relational table above as a multidimensional cuboid, where MODEL, YEAR, and COLOR are the dimensions and SALES is the measure.
    2. (5 points) Depict the result of rolling-up MODEL from individual models to all.
    3. (5 points) Depict the result of drilling-down time from YEAR to month. (Although month data is not provided above, make up a couple of values to illustrate the drill-down operation.)
    4. (5 points) Depict the result of slicing for MODEL=Chevy.
    5. (5 points) Depict the result of dicing for MODEL=Chevy and YEAR=2011.
    6. (5 points) Starting with the basic cuboid model, year, color, sales, what specific OLAP operations should one perform in order to obtain the total number of red cars sold? Make your sequence of operations as efficient as possible.

ORAL AND WRITTEN REPORTS AND DUE DATE