WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2012  
Project 1: Data Integration, Data Warehousing, Data Pre-processing

PROF. CAROLINA RUIZ 

DUE DATE: Thursday Feb. 9th, 2012. ------------------------------------------

Instructions


Problem I. Knowledge Discovery in Databases (20 points)

  1. (5 points) Define knowledge discovery in databases.

  2. (10 points) Briefly describe the steps of the knowledge discovery in databases process.

  3. (5 points) Define data mining.
Base your answers on the class handouts and the paper: Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases". AAAI Magazine, pp. 37-54. Fall 1996.

Problem II. Data Preprocessing (75 points)

Consider the following dataset.
   DATE       OUTLOOK         TEMPERATURE   HUMIDITY    WIND    PLAYS 

   02/13/06   mostly sunny    47            25          strong  no 
   03/10/06   mostly cloudy   66            57          weak    yes
   06/28/06   cloudy          91            75          medium  yes
   07/12/06   sunny           82            27          strong  no
   08/30/06   rainy           76            80          weak    no
   09/23/06   drizzle         66            70          weak    yes
   11/24/06   sunny           52            60          medium  no
   12/19/06   mostly sunny    41            30          strong  no
   01/12/07   cloudy          36            40          ?      	no
   04/13/07   mostly cloudy   57            40          weak    yes
   05/20/07   mostly sunny    68            50          medium  yes
   06/28/07   drizzle         73            20          weak    yes
   07/06/07   sunny           95            85          weak    yes
   08/20/07   rainy           91            60          weak    yes
   09/01/07   mostly sunny    80            10          medium  no
   10/23/07   mostly cloudy   52            44          weak    no 

  1. (5 points) Assuming that the missing value (marked with "?") for WIND cannot be ignored, discuss 3 different alternatives to fill in that missing value. In each case, state what the selected value would be and the advantages and disadvantages of the approach. You may assume that the attribute PLAYS is the target attribute.

  2. (5 points) Describe a reasonable transformation of the attribute OUTLOOK so that the number of different values for that attribute is reduced to just 3.

  3. (10 points) Discretize the attribute TEMPERATURE by binning it into 4 equi-width intervals using unsupervised discretization. Perform this discretization by hand (i.e., do not use Weka). Explain your answer.

  4. (10 points) Discretize the attribute HUMIDITY by binning it into 4 equi-depth intervals using unsupervised discretization. Perform this discretization by hand (i.e., do not use Weka). Explain your answer.

  5. (5 points) Would you keep the attribute DATE into your dataset when mining for patterns that predict the values for the PLAYS attribute? Explain your answer.

  6. (10 points) Consider the following new approach to discretizing a numeric attribute: Given the mean and the standard deviation (sd) of the attribute values, bin the attribute values into the following intervals:
     [mean - (k+1)*sd, mean - k*sd)   
     for all integer values k, i.e. k = ..., -4, -3, -2, -1, 0, 1, 2, ...
    
    Assume that the mean of the attribute HUMIDITY above is 48 and that the standard deviation sd of this attribute is 22.5. Discretize HUMIDITY by hand using this new approach. Show your work.

  7. (30 points) Use the supervised discretization filter in Weka (with UseKononorenko=False) to discretize the TEMPERATURE attribute. Describe the resulting intervals. Find the Java code that implements this filter in the directories that contain the Weka files. (See the instructions to find Weka's source code at the beginning of this project assignment.) Include the code implementing this filter in your report, and describe the algorithm followed by this code in your own words. Follow the code by hand to show precisely how the TEMPERATURE intervals were obtained. Show your work.

Problem III. Feature Selection (60 points)

Consider the weather.arff dataset that comes with the Weka system. In this problem you will explain how Correlation based Feature Selection (CFS) works on this dataset. (See Witten's and Frank's textbook slides - Chapter 7 Slides 5-6).
  1. (5 points) Apply Weka's CfsSubsetEval (available under the Select attributes tab) to this dataset (using BestFirst as the search method, with default parameters) to determine what attributes are selected. Include the results in your project solutions.
  2. Looking at the code that implements CfsSubsetEval, as well as its description in the textbook and in class, describe in detail the process that it follows:
    1. (5 points) What's the initial (sub)set of attributes under consideration? Is forward or backward search used?
    2. (25 points) Using the latice of attribute subsets below, show step by step the process that the algorithm follows (i.e., show the search process in detail). For this you can add print instructions to the Weka code so that it tells you the order in which it considers the subsets and the goodness value of each of these subsets. Explain your answer.
    3. (25 points) Use the CfsSubsetEval formulas to calculate the goodness of the "best" (sub)set of attributes considered. Show your work.

      weather_data_attribute_latice.gif

      Taken from Witten's and Frank's textbook slides - Chapter 7.


Problem IV. Explorating Real Data (55 points)

Consider the Communities and Crime Unnormalized Data Set available at the UCI Machine Learning Repository. Convert the dataset to the arff format. The arff header is provided in the dataset webpage. Load this dataset into Weka by opening your arff dataset from the "Explorer" window in Weka. Increase the memory available to Weka as needed.

  1. Use Excel, Matlab, your own code, Weka, or other software, to complete the following parts. Please state in your report which tool from the above list you used for each part.

    1. For the murdPerPop attribute:
      1. (5 points) Calculate the percentiles (in increments of 10, as in Table 3.2 of the textbook, page 101), mean, median, range, and variance of the attribute.
      2. (5 points) Plot a histogram of the attribute using 10 or 20 bins (you choose the best value for the attribute). For examples, see Figures 3.7 and 3.8 in the textbook, page 113.

    2. For the following set of 21 continuous attributes, calculate (1) (10 points) the covariance matrix and (2) (10 points) the correlation matrix of these attributes.
      -- population
      -- householdsize
      -- racepctblack
      -- racePctWhite
      -- racePctAsian
      -- racePctHisp
      -- agePct12t21
      -- agePct12t29
      -- agePct16t24
      -- agePct65up
      -- numbUrban
      -- pctUrban
      -- medIncome
      -- pctWWage
      -- pctWFarmSelf
      -- pctWInvInc
      -- pctWSocSec
      -- pctWPubAsst
      -- pctWRetire
      -- medFamInc
      -- perCapInc
      
      (5 points) If you had to remove 4 of the continuous attributes above from the dataset based on these two matrices, which attributes would you remove and why? Explain your answer.

  2. Dimensionality Reduction. (10 points) Upload the entire dataset onto Weka. Apply Principal Components Analysis to reduce the dimensionality of the full dataset. For this, use Weka's PrincipalComponents option from the "Select attributes" tab. Use parameter values: centerData=True, varianceCovered=0.95. How many dimensions (= attributes) does the original dataset contain? How many dimensions are obtained after PCA? How much of the variance do they explain? Include in your report the linear combinations that define the first two new attributes(= components) obtained. Look at the results and elaborate on any interesting observations you can make about the results.

  3. Feature Selection. (10 points) Using the full original dataset, discretize the murdPerPop attribute into 10 equal frequency bins using unsupervized discretization. Use this discretized attribute as the target classification attribute. Apply Correlation Based Feature Selection (see Witten's and Frank's textbook slides - Chapter 7 Slides 5-6). For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters. Include in your report which attributes were selected by this method. Look at the results and elaborate on any interesting observations you can make about the results.

Problem V. Data Integration, Data Warehousing and OLAP (60 points)

  1. (10 points) Describe the main differences between the mediation approach and the data warehousing approach for data integration.

  2. (20 points) (Adapted from Han's and Kamber's textbook.) Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit.
    1. Enumerate three classes of schemas that are popularly used for modeling data warehouses.
    2. Draw a schema diagram for the above data warehouse using one of the schema classes listed in your previous answer.
    3. Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2005?
  3. (30 points) Consider the following relational table:

    MODEL

    YEAR

    COLOR

    SALES

    Chevy

    1990

    red

    5

    Chevy

    1990

    white

    87

    Chevy

    1990

    blue

    62

    Chevy

    1991

    red

    54

    Chevy

    1991

    white

    95

    Chevy

    1991

    blue

    49

    Chevy

    1992

    red

    31

    Chevy

    1992

    white

    54

    Chevy

    1992

    blue

    71

    Ford

    1990

    red

    64

    Ford

    1990

    white

    62

    Ford

    1990

    blue

    63

    Ford

    1991

    red

    52

    Ford

    1991

    white

    9

    Ford

    1991

    blue

    55

    Ford

    1992

    red

    27

    Ford

    1992

    white

    62

    Ford

    1992

    blue

    39

    1. (5 points) Depict the data in the relational table above as a multidimensional cuboid.
    2. (5 points) Illustrate the result of rolling-up MODEL from individual models to all.
    3. (5 points) Illustrate the result of drilling-down time from YEAR to month.
    4. (5 points) Illustrate the result of slicing for MODEL=Chevy.
    5. (5 points) Illustrate the result of dicing for MODEL=Chevy and YEAR=1991.
    6. (5 points) Starting with the basic cuboid model, year, color, sales, what specific OLAP operations should one perform in order to obtain the total number of red cars sold?

REPORTS AND DUE DATE

  1. Slides, Class Presentation, and Class Partipation (10 points)
    We will discuss the results from the project during class so you should prepare slides summarizing your findings, and be prepared to give an oral presentation.

    Submit the following file with your slides for your oral report by email to me before 10:00 am the day the project is due:

    [your-lastname]__proj1_slides.[ext]
    where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this project would be named ruiz_proj2_slides.pptx

  2. Written Report
    Hand in a hardcopy of your written report at the beginning of class the day the project is due.