WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2014  
Project 1: Data Integration, Data Warehousing, Data Pre-processing

PROF. CAROLINA RUIZ 

No need to submit a report. Instead, an in-class test will be given on Friday, Feb. 7th, 2014 to evaluate your work and your understanding of the material. This test will include all the material covered in class and in the corresponding chapters of the book from the beginning of the semester. Solving the problems below will help you study for the test.
------------------------------------------

Instructions


Problem I. Knowledge Discovery in Databases (20 points)

  1. (5 points) Define knowledge discovery in databases.

  2. (10 points) Briefly describe the steps of the knowledge discovery in databases process.

  3. (5 points) Define data mining.
Base your answers on the definitions presented in class, the textbook, and the following paper: Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases". AAAI Magazine, pp. 37-54. Fall 1996.

Problem II. Data Preprocessing (75 points)

Consider the following dataset.
   DATE       OUTLOOK         TEMPERATURE   HUMIDITY    WIND    PLAYS 

   02/13/06   mostly sunny    47            25          strong  no 
   03/10/06   mostly cloudy   66            57          weak    yes
   06/28/06   cloudy          91            75          medium  yes
   07/12/06   sunny           82            27          strong  no
   08/30/06   rainy           76            80          weak    no
   09/23/06   drizzle         66            70          weak    yes
   11/24/06   sunny           52            60          medium  no
   12/19/06   mostly sunny    41            30          strong  no
   01/12/07   cloudy          36            40          ?      	no
   04/13/07   mostly cloudy   57            40          weak    yes
   05/20/07   mostly sunny    68            50          medium  yes
   06/28/07   drizzle         73            20          weak    yes
   07/06/07   sunny           95            85          weak    yes
   08/20/07   rainy           91            60          weak    yes
   09/01/07   mostly sunny    80            10          medium  no
   10/23/07   mostly cloudy   52            44          weak    no 

  1. (5 points) Assuming that the missing value (marked with "?") for WIND cannot be ignored, discuss 3 different alternatives to fill in that missing value. In each case, state what the selected value would be and the advantages and disadvantages of the approach. You may assume that the attribute PLAYS is the target attribute.

  2. (5 points) Describe a reasonable transformation of the attribute OUTLOOK so that the number of different values for that attribute is reduced to just 3.

  3. (10 points) Discretize the attribute TEMPERATURE by binning it into 4 equi-width intervals using unsupervised discretization. Perform this discretization by hand (i.e., do not use Weka). Explain your answer.

  4. (10 points) Discretize the attribute HUMIDITY by binning it into 4 equi-depth intervals using unsupervised discretization. Perform this discretization by hand (i.e., do not use Weka). Explain your answer.

  5. (5 points) Would you keep the attribute DATE into your dataset when mining for patterns that predict the values for the PLAYS attribute? Explain your answer.

  6. (10 points) Consider the following new approach to discretizing a numeric attribute: Given the mean and the standard deviation (sd) of the attribute values, bin the attribute values into the following intervals:
     [mean - (k+1)*sd, mean - k*sd)   
     for all integer values k, i.e. k = ..., -4, -3, -2, -1, 0, 1, 2, ...
    
    Assume that the mean of the attribute HUMIDITY above is 48 and that the standard deviation sd of this attribute is 22.5. Discretize HUMIDITY by hand using this new approach. Show your work.

  7. (30 points) Use the supervised discretization filter in Weka (with UseKononorenko=False) to discretize the TEMPERATURE attribute. Describe the resulting intervals. Find the Java code that implements this filter in the directories that contain the Weka files. (See the instructions to find Weka's source code at the beginning of this project assignment.) Read the code carefully so that you can describe the algorithm followed by this code in your own words. Follow the code by hand to show precisely how the TEMPERATURE intervals were obtained.

Problem III. Feature Selection (60 points)

Consider the weather.arff dataset that comes with the Weka system. In this problem you will explain how Correlation based Feature Selection (CFS) works on this dataset. (See Witten's and Frank's textbook slides - Chapter 7 Slides 5-6).
  1. (5 points) Apply Weka's CfsSubsetEval (available under the Select attributes tab) to this dataset (using BestFirst as the search method, with default parameters) to determine what attributes are selected. Include the results in your project solutions.
  2. Looking at the code that implements CfsSubsetEval, as well as its description in the textbook and in class, describe in detail the process that it follows:
    1. (5 points) What's the initial (sub)set of attributes under consideration? Is forward or backward search used?
    2. (25 points) Using the latice of attribute subsets below, show step by step the process that the algorithm follows (i.e., show the search process in detail). For this you can add print instructions to the Weka code so that it tells you the order in which it considers the subsets and the goodness value of each of these subsets. Explain your answer.
    3. (25 points) Use the CfsSubsetEval formulas to calculate the goodness of the "best" (sub)set of attributes considered. Show your work.

      weather_data_attribute_latice.gif

      Taken from Witten's and Frank's textbook slides - Chapter 7.


Problem IV. Explorating Real Data (55 points)

Consider the Communities and Crime Unnormalized Data Set available at the UCI Machine Learning Repository. Convert the dataset to the arff format. The arff header is provided in the dataset webpage. Load this dataset into Weka by opening your arff dataset from the "Explorer" window in Weka. Increase the memory available to Weka as needed.

  1. Use Excel, Matlab, your own code, Weka, or RapidMiner, to complete the following parts. Please state in your report which tool from the above list you used for each part.

    1. For the murdPerPop attribute:
      1. (5 points) Calculate the percentiles (in increments of 10, as in Table 3.2 of the textbook, page 101), mean, median, range, and variance of the attribute.
      2. (5 points) Plot a histogram of the attribute using 10 or 20 bins (you choose the best value for the attribute). For examples, see Figures 3.7 and 3.8 in the textbook, page 113.

    2. For the following set of 21 continuous attributes, calculate (1) (10 points) the covariance matrix and (2) (10 points) the correlation matrix of these attributes.
      See notes on using Matlab and Excel to calculate these matrices. Try to construct a visualization of each of these matrices (e.g., heatmap) to more easily understand them.
      -- population
      -- householdsize
      -- racepctblack
      -- racePctWhite
      -- racePctAsian
      -- racePctHisp
      -- agePct12t21
      -- agePct12t29
      -- agePct16t24
      -- agePct65up
      -- numbUrban
      -- pctUrban
      -- medIncome
      -- pctWWage
      -- pctWFarmSelf
      -- pctWInvInc
      -- pctWSocSec
      -- pctWPubAsst
      -- pctWRetire
      -- medFamInc
      -- perCapInc
      
      (5 points) If you had to remove 4 of the continuous attributes above from the dataset based on these two matrices, which attributes would you remove and why? Explain your answer.

  2. Dimensionality Reduction. (10 points) Upload the entire dataset onto Weka, Matlab, or RapidMiner (ideally try all 3 systems to learn each of them). Apply Principal Components Analysis to reduce the dimensionality of the full dataset. In Weka, use the PrincipalComponents option from the "Select attributes" tab. Use parameter values: centerData=True, varianceCovered=0.95. How many dimensions (= attributes) does the original dataset contain? How many dimensions are obtained after PCA? How much of the variance do they explain? Include in your report the linear combinations that define the first two new attributes(= components) obtained. Look at the results and elaborate on any interesting observations you can make about the results.

  3. Feature Selection. (10 points) Using the full original dataset in Weka, discretize the murdPerPop attribute into 10 equal frequency bins using unsupervized discretization. Use this discretized attribute as the target classification attribute. Apply Correlation Based Feature Selection (see Witten's and Frank's textbook slides - Chapter 7 Slides 5-6). For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters. Look at the results to determine which attributes were selected by this method and to draw any interesting observations you can make about the results.

Problem V. Data Integration, Data Warehousing and OLAP (60 points)

  1. (10 points) Describe the main differences between the mediation approach and the data warehousing approach for data integration.

  2. (20 points) (Adapted from Han's and Kamber's textbook.) Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit.
    1. Enumerate three classes of schemas that are popularly used for modeling data warehouses.
    2. Draw a schema diagram for the above data warehouse using one of the schema classes listed in your previous answer.
    3. Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2005?
  3. (30 points) Consider the following relational table:

    MODEL

    YEAR

    COLOR

    SALES

    Chevy

    1990

    red

    5

    Chevy

    1990

    white

    87

    Chevy

    1990

    blue

    62

    Chevy

    1991

    red

    54

    Chevy

    1991

    white

    95

    Chevy

    1991

    blue

    49

    Chevy

    1992

    red

    31

    Chevy

    1992

    white

    54

    Chevy

    1992

    blue

    71

    Ford

    1990

    red

    64

    Ford

    1990

    white

    62

    Ford

    1990

    blue

    63

    Ford

    1991

    red

    52

    Ford

    1991

    white

    9

    Ford

    1991

    blue

    55

    Ford

    1992

    red

    27

    Ford

    1992

    white

    62

    Ford

    1992

    blue

    39

    1. (5 points) Depict the data in the relational table above as a multidimensional cuboid.
    2. (5 points) Illustrate the result of rolling-up MODEL from individual models to all.
    3. (5 points) Illustrate the result of drilling-down time from YEAR to month.
    4. (5 points) Illustrate the result of slicing for MODEL=Chevy.
    5. (5 points) Illustrate the result of dicing for MODEL=Chevy and YEAR=1991.
    6. (5 points) Starting with the basic cuboid model, year, color, sales, what specific OLAP operations should one perform in order to obtain the total number of red cars sold?