CS 539 Spring 2011

Computer Science Department

CS539 Machine Learning. Spring 2011
Project 1 - Data Preprocessing

PROF. CAROLINA RUIZ

DUE DATE: Tuesday, Feb 3rd 2011.
Slides are due at 10:00 am (by email to Prof. Ruiz) and a hardcopy of the written report is due at 1:00 pm (beginning of class).

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

The purpose of this project is two-fold:

To gain experience "pre-processing" datasets to clean, normalize, and discretize data attributes, and, when needed, reduce the dimensionality of the data.
to gain familiarity with the Weka system, its GUI, its code, and its input data format (arff), and/or to gain familirity with Matlab.

PROJECT ASSIGNMENT

Download and install Weka and/or Matlab:
- Weka: Use the latest developer version of Weka (currently weka-7-3-3) following the instructions on the course webpage.
  You can find the Weka code in a file called "weka-src.jar", which should be located in the directory where Weka was installed. This "weka-src.jar" file is a zip file. Hence you need to winzip or unzip it to extract its contents. Inside, you will find the .java files that implement Weka.
  Read in detail the "Explorer Guide" and the "Experimenter Tutorial" provided with the Weka system. Browse through the "Package Documentation" to become familiar with it.
- Matlab: Follow the instructions on the CCC Matlab Webpage.
Datasets: Consider the following datasets:
1. The iris data. This dataset is available in the data directory of the Weka system as the "iris.arff" file, and also from the Univ. of California Irvine (UCI) Data Repository.
2. The census-income (also called "adult") dataset from the US Census Bureau which is available at the Univ. of California Irvine (UCI) Data Repository.
  The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.
  Convert the census-income data to the arff format. For this you can either use any tools provided by Weka, or you can make the conversion outside the Weka system using other tools (e.g., a word editor, Excel, etc.). Create a census-income.arff file with the converted dataset.
Experiments: The following description of the experiments is written in terms of Weka. If you choose to work with Matlab, you need to do the equivalent work with Matlab functions (some of which you may need to write yourself). For each of the above datasets:
1. Use the "Explorer" option of the Weka system to perform the following operations:
  - Open the dataset in Weka.
  - Preprocess the dataset attributes using Weka's filters. In particular,
    1. explore different ways of discretizing continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals.
      - Use the Discretize filter under the Supervised Filters. Play with the filter and its parameters. Read the Java code implementing it, and describe this code and the meaning of each parameter in your written report.
      - Use the Discretize filter under the Unsupervised Filters. Play with the filter and its parameters. Read the Java code implementing it, and describe this code and the meaning of each parameter in your written report.
    2. explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.
    3. using the original set of numeric/continuous attributes (without discretization), calculate the covariance matrix and the correlation matrix of these attributes. (See my miscellaneous notes on preprocessing for help calculating these matrices.) If you had to remove 2 continuous attributes from the dataset based on these two matrices, which attributes would you remove and why? Explain your answer.
    4. apply Correlation Based Feature Selection (see Witten's and Frank's textbook slides - Chapter 7 Slides 5-6) to the Census-Income dataset. For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters. Include in your report which attributes were selected by this method. Also, what can you observe about these selected attributes with respect to the covariance matrix and the correlation matrix you computed for part 2.2.3 above? Were the 2 attributes you chose to remove in part 2.2.3 above kept or removed by CfsSubsetEval?
2. Use both the "Explorer" and the "Experimenter" options of the Weka system in turn to run the "ZeroR" and the "OneR" classifiers (under the "Classify" tab) over the above two (original) datasets. Use different ways of testing your results. That is, explore the following alternatives offered by the Weka system:
  - Testing your results over the training data.
  - Splitting your input file into two parts one for training and one for testing.
  - Using n-fold crossvalidation. Play with different values for n.
  Analyze the results obtained (i.e., interpret the meaning of the output produced by Weka). In particular, pay attention to the model constructed, the accuracy (percentage of correctly classified instances), the error (percentage of incorrectly classified instances), the confusion matrices, and any other part of the output that you find interesting. Read to the Java code implementing the ZeroR and the OneR classifiers.
3. Run several experiments with your data and the system varying the parameters so that you gain familiarity with the system.

ORAL AND WRITTEN REPORTS AND DUE DATE

Written Report. Please hand in a hardcopy of your report at the beginning of class when the project is due. Your report should contain the following sections with the corresponding discussions:

Data: Describe the datasets that you used in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.
Code Description: Describe the Weka or Matlab code implementing the filters you used and the ZeroR and OneR functions.
Experiments: For each experiment you ran describe:
- Instances: What data did you use for the experiments? That is, did you use the entire dataset of just a subset of it?
- Any pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
- Your system parameters.
- For the ZeroR function, analysis of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
- For the OneR function, analysis of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
Summary of Results
- Discuss the strengths and the weaknesses of your project.

Oral Report. We will discuss the results from the individual projects during the class when the project is due. Each of you will have approximately 5 minutes to present your report. Prepare SLIDES with the results of your experiments. Your slides should be a good "preview" of your written report and should summarize the contents of the different sections of your written report as described above. Be ready to show your results and to discuss your project in class within the time allowed. Given the time limitations, focus your presentation on the most relevant, unique, or creative parts of your project.

CS539 Machine Learning. Spring 2011 Project 1 - Data Preprocessing

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

ORAL AND WRITTEN REPORTS AND DUE DATE

CS539 Machine Learning. Spring 2011
Project 1 - Data Preprocessing