WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS539 Machine Learning. Fall 2012 
Project 1 - Data Preprocessing

PROF. CAROLINA RUIZ 

DUE DATE: Tuesday, Sept 11th 2012.
Slides are due at 12:00 noon (by email to Prof. Ruiz) and a hardcopy of the written report is due at 3:00 pm (beginning of class). 

------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is two-fold:

PROJECT ASSIGNMENT

  1. Download and install Weka and Matlab:

  2. Dataset: Consider the following dataset: The Spambase dataset available from the Univ. of California Irvine (UCI) Data Repository.

    Convert this data to the arff format. For this you can either use any tools provided by Weka, or you can make the conversion outside the Weka system using other tools (e.g., a word editor, Excel, etc.). Create a spambase.arff file with the converted dataset.

    Convert this data as needed so that it can be input into Matlab. Consider representing it using Matlab's dataset class. See my Matlab Notes. [Throughout this project, take advantage of Matlab's superb plotting capabilities to show your results graphically.]

  3. Experiments:

    1. Data Preprocessing. We will preprocess the dataset attributes using Weka's filters and Matlab functions as described below.

      1. Discretization. Explore different ways of discretizing continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals.
        • In Weka's Explorer:
          • Use the Discretize filter under the Supervised Filters. Play with the filter and its parameters. Read the Java code implementing it, and describe this code and the meaning of each parameter in your written report.
          • Use the Discretize filter under the Unsupervised Filters. Play with the filter and its parameters. Read the Java code implementing it, and describe this code and the meaning of each parameter in your written report.
        • In Matlab, find out what Matlab functions can be used to bin numeric attributes into nominal ones. Include a short description of these functions in your report, as well as a description of the results you obtained by using them.

      2. Missing Values. We will explore different ways of removing missing values. The given dataset does not contain missing values. So you are asked to do the following: (1) Consider the attribure capital_run_length_average (2) Randomly select 5% of the data instances, and replace the value of the attribute capital_run_length_average of these instances with a missing value.
        • In Weka, missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it. Describe the Weka code implementing this filter in your report.
        • In Matlab, find out what Matlab functions can be used to replace missing values. Include a short description of these functions in your report, as well as a description of the results you obtained by using them.
        In each case, compare the distribution of the original capital_run_length_average (without missing values) against the distribution of this attribute that results when the introduced missing values are replaced using the filters/functions above.

      3. Attribute/Feature Selection.
        1. Use only Matlab for this part.
          1. Using the original set of numeric/continuous attributes (without discretization), calculate the covariance matrix and the correlation matrix of these attributes. (See my miscellaneous notes on preprocessing for help calculating these matrices.)
          2. If you had to remove 2 continuous attributes from the dataset based on these two matrices, which attributes would you remove and why? Explain your answer.
        2. Use only Weka for this part.
          1. Apply Correlation Based Feature Selection to the original dataset. For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters. Include in your report which attributes were selected by this method.
            See Chapter 7 of Witten's, Frank's, and Hall's Weka textbook (available online from the WPI Library, and Witten's and Frank's textbook slides - Chapter 7 (Slides 5-6) for a description of this method.
          2. What can you observe about these selected attributes with respect to the covariance matrix and the correlation matrix you computed with Matlab above? Were the 2 attributes you chose to remove above kept or removed by CfsSubsetEval?
        3. In Matlab, find out what Matlab functions can be used for feature selection. Include a short description of these functions in your report, as well as a description of the results you obtained by using them.

    2. Model Construction. Use both the "Explorer" and the "Experimenter" options of the Weka system in turn to run the "ZeroR" and the "OneR" classifiers (under the "Classify" tab) over the (original) dataset. See if Matlab offers similar functionality. Use different ways of testing your results. That is, explore the following alternatives:
      • Testing your results over the training data.
      • Splitting your input file into two parts one for training and one for testing.
      • Using n-fold crossvalidation. Play with different values for n.
      Analyze the results obtained (i.e., interpret the meaning of the output produced by Weka and/or Matlab). In particular, pay attention to the model constructed, the accuracy (percentage of correctly classified instances), the error (percentage of incorrectly classified instances), the confusion matrices, and any other part of the output that you find interesting. Read to the Java code implementing the ZeroR and the OneR classifiers in Weka, and describe this code in your report.

    3. Run several experiments with your data and the systems varying the parameters so that you gain familiarity with the systems. Take advantage of Matlab's superb plotting capabilities to show your results graphically.
      For each experiment you ran describe in your report:
      • Data Instances: What data did you use for the experiments? That is, did you use the entire dataset of just a subset of it?
      • Any additional pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
      • Your system parameters.
      • For the ZeroR function, analysis of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
      • For the OneR function, analysis of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
      Also include a summary of your results, and discuss the strengths and the weaknesses of your project.

ORAL AND WRITTEN REPORTS AND DUE DATE