WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS539 Machine Learning. Spring 2007 
Project 1 - Using the Weka System to Preprocess Datasets

PROF. CAROLINA RUIZ 

DUE DATE: Thursday, Jan 25th 2007. Slides are due at 3:00 (by email) and a hardcopy of the written report is due at 4:00 pm (beginning of class).  ------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is two-fold:

PROJECT ASSIGNMENT

For this and other course projects, we will use the Weka system (http://www.cs.waikato.ac.nz/ml/weka/). Weka is an excellent machine-learning/data-mining environment. It provides a large collection of Java-based mining algorithms, data preprocessing filters, and experimentation capabilities. Weka is open source software issued under the GNU General Public License. For more information on the Weka sytem, to download the system and to get its documentation, look at Weka's webpage (http://www.cs.waikato.ac.nz/ml/weka/).

  1. You can download and use the latest stable GUI version of the system though I suggest to download the developer version (currently weka-3-5-4) which offers added functionality.

  2. Read in detail the "Explorer Guide" and the "Experimenter Tutorial" provided with the Weka system. Browse through the "Package Documentation" to become familiar with it.

  3. Datasets: Consider the following sets of data:
    1. The weather data (available in the data directory of the Weka system as the "weather.arff" file).
    2. The iris data (available in the data directory of the Weka system as the "iris.arff" file).
    3. The Automobile Database taken for the UCI Machine Learning Repository.
    4. A dataset of your choice. This dataset can consist of data that you use for your own research or work, a dataset taken from a public data repository (e.g., UCI Machine Learning Repository, or from the UCI KDD Archive), or data that you collect from public data sources. THIS DATASET CANNOT BE ONE OF THOSE INCLUDED IN THE WEKA SYSTEM.

  4. Experiments: For each of the above datasets:

    1. Translate the dataset into the arff format if needed.

    2. Use the "Explorer" option of the Weka system to perform the following operations:
      • Open the dataset in Weka.
      • Preprocess the dataset attributes using Weka's filters. In particular,
        • explore different ways of discretizing continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals - See the weka.filter.DiscretizeFilter in Weka. Play with the filter and read the Java code implementing it.
        • explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.

    3. Use both the "Explorer" and the "Experimenter" options of the Weka system in turn to run the "ZeroR" and the "OneR" classifiers (under the "Classify" tab) over the above datasets. Use different ways of testing your results. That is, explore the following alternatives offered by the Weka system:
      • Testing your results over the training data.
      • Splitting your input file into two parts one for training and one for testing.
      • Using n-fold crossvalidation. Play with different values for n.
      Analyze the results obtained (i.e. interpret the meaning of the output produced by Weka). Read to the Java code implementing the ZeroR and the OneR classifiers.

    4. Run several experiments with your data and the system varying the parameters so that you gain familiarity with the system.

ORAL AND WRITTEN REPORTS AND DUE DATE