CS 539 Fall 2012 - Project 1

Computer Science Department

CS539 Machine Learning. Fall 2012
Project 1 - Data Preprocessing

PROF. CAROLINA RUIZ

DUE DATE: Tuesday, Sept 11th 2012.
Slides are due at 12:00 noon (by email to Prof. Ruiz) and a hardcopy of the written report is due at 3:00 pm (beginning of class).

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

The purpose of this project is two-fold:

To gain experience "pre-processing" datasets to clean, normalize, and discretize data attributes, and, when needed, reduce the dimensionality of the data.
to gain familiarity with the Weka system, its GUI, its code, and its input data format (arff), and to gain familirity with Matlab.

PROJECT ASSIGNMENT

Download and install Weka and Matlab:
- Weka: Use the latest developer version of Weka (currently weka-3-7-7) following the instructions on the course webpage.
  You can find the Weka code in a file called "weka-src.jar", which should be located in the directory where Weka was installed. This "weka-src.jar" file is a zip file. Hence you need to winzip or unzip it to extract its contents. Inside, you will find the .java files that implement Weka.
  Read in detail the "Explorer Guide" and the "Experimenter Tutorial" provided with the Weka system. Browse through the "Package Documentation" to become familiar with it.
  Use the following command to increase the amount of main memory used by Weka. Here, I'm increasing the amount of main memory used by Weka to 768m, but you can specify any other size instead of 768 if more memory is needed/available:
```
java -Xmx768m -jar weka.jar
```
- Matlab: Follow the instructions on the CCC Matlab Webpage.
Dataset: Consider the following dataset: The Spambase dataset available from the Univ. of California Irvine (UCI) Data Repository.

Convert this data to the arff format. For this you can either use any tools provided by Weka, or you can make the conversion outside the Weka system using other tools (e.g., a word editor, Excel, etc.). Create a spambase.arff file with the converted dataset.
Convert this data as needed so that it can be input into Matlab. Consider representing it using Matlab's dataset class. See my Matlab Notes. [Throughout this project, take advantage of Matlab's superb plotting capabilities to show your results graphically.]
Experiments:
1. Data Preprocessing. We will preprocess the dataset attributes using Weka's filters and Matlab functions as described below.
  1. Discretization. Explore different ways of discretizing continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals.
    - In Weka's Explorer:
      - Use the Discretize filter under the Supervised Filters. Play with the filter and its parameters. Read the Java code implementing it, and describe this code and the meaning of each parameter in your written report.
      - Use the Discretize filter under the Unsupervised Filters. Play with the filter and its parameters. Read the Java code implementing it, and describe this code and the meaning of each parameter in your written report.
    - In Matlab, find out what Matlab functions can be used to bin numeric attributes into nominal ones. Include a short description of these functions in your report, as well as a description of the results you obtained by using them.
  2. Missing Values. We will explore different ways of removing missing values. The given dataset does not contain missing values. So you are asked to do the following: (1) Consider the attribure capital_run_length_average (2) Randomly select 5% of the data instances, and replace the value of the attribute capital_run_length_average of these instances with a missing value.
    - In Weka, missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it. Describe the Weka code implementing this filter in your report.
    - In Matlab, find out what Matlab functions can be used to replace missing values. Include a short description of these functions in your report, as well as a description of the results you obtained by using them.
    In each case, compare the distribution of the original capital_run_length_average (without missing values) against the distribution of this attribute that results when the introduced missing values are replaced using the filters/functions above.
  3. Attribute/Feature Selection.
    1. Use only Matlab for this part.
      1. Using the original set of numeric/continuous attributes (without discretization), calculate the covariance matrix and the correlation matrix of these attributes. (See my miscellaneous notes on preprocessing for help calculating these matrices.)
      2. If you had to remove 2 continuous attributes from the dataset based on these two matrices, which attributes would you remove and why? Explain your answer.
    2. Use only Weka for this part.
      1. Apply Correlation Based Feature Selection to the original dataset. For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters. Include in your report which attributes were selected by this method.
        See Chapter 7 of Witten's, Frank's, and Hall's Weka textbook (available online from the WPI Library, and Witten's and Frank's textbook slides - Chapter 7 (Slides 5-6) for a description of this method.
      2. What can you observe about these selected attributes with respect to the covariance matrix and the correlation matrix you computed with Matlab above? Were the 2 attributes you chose to remove above kept or removed by CfsSubsetEval?
    3. In Matlab, find out what Matlab functions can be used for feature selection. Include a short description of these functions in your report, as well as a description of the results you obtained by using them.
2. Model Construction. Use both the "Explorer" and the "Experimenter" options of the Weka system in turn to run the "ZeroR" and the "OneR" classifiers (under the "Classify" tab) over the (original) dataset. See if Matlab offers similar functionality. Use different ways of testing your results. That is, explore the following alternatives:
  - Testing your results over the training data.
  - Splitting your input file into two parts one for training and one for testing.
  - Using n-fold crossvalidation. Play with different values for n.
  Analyze the results obtained (i.e., interpret the meaning of the output produced by Weka and/or Matlab). In particular, pay attention to the model constructed, the accuracy (percentage of correctly classified instances), the error (percentage of incorrectly classified instances), the confusion matrices, and any other part of the output that you find interesting. Read to the Java code implementing the ZeroR and the OneR classifiers in Weka, and describe this code in your report.
3. Run several experiments with your data and the systems varying the parameters so that you gain familiarity with the systems. Take advantage of Matlab's superb plotting capabilities to show your results graphically.
  For each experiment you ran describe in your report:
  - Data Instances: What data did you use for the experiments? That is, did you use the entire dataset of just a subset of it?
  - Any additional pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
  - Your system parameters.
  - For the ZeroR function, analysis of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
  - For the OneR function, analysis of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
  Also include a summary of your results, and discuss the strengths and the weaknesses of your project.

ORAL AND WRITTEN REPORTS AND DUE DATE

Written Report. Please hand in a hardcopy of your report at the beginning of class when the project is due.
Oral Report. We will discuss the results from the individual projects during the class when the project is due. Each of you will have approximately 3 minutes to present your report. Prepare SLIDES with the results of your experiments. Your slides should be a good "preview" of your written report and should summarize the contents of the different sections of your written report as described above. Be ready to show your results and to discuss your project in class within the time allowed. Given the time limitations, focus your presentation on the most relevant, unique, or creative parts of your project.
Email your slides to the professor AT LEAST THREE HOUR BEFORE THE BEGINNING OF CLASS the day the project is due:
```
       [your-lastname]_proj1_slides.[ext]  
       
```
containing your slides for your oral report. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Please use only lower case letters in the name file. For instance, the file with my slides for Project 1 would be named ruiz_proj1_slides.ppt

CS539 Machine Learning. Fall 2012 Project 1 - Data Preprocessing

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

ORAL AND WRITTEN REPORTS AND DUE DATE

CS539 Machine Learning. Fall 2012
Project 1 - Data Preprocessing