WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS539 Machine Learning - Spring 2007 
Project 2 - Decision Trees

PROF. CAROLINA RUIZ 

Due Date: Thursday, Feb. 1st 2007. Slides are due at 3:00 (by email) and Written Report is due at 4:00 pm (beginning of class). 
------------------------------------------


PROJECT DESCRIPTION

Construct the best (i.e., most accurate and/or smaller and/or most readable and/or more informative) decision tree you can for predicting the class attribute for each of the following datasets:

  1. The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
    The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.

  2. A dataset of your choice. This dataset can consist of data that you use for your own research or work, a dataset taken from a public data repository (e.g., UCI Machine Learning Repository, or from the UCI KDD Archive), or data that you collect from public data sources. THIS DATASET CANNOT BE ONE OF THOSE INCLUDED IN THE WEKA SYSTEM.

PROJECT ASSIGNMENT

  1. Read Chapter 3 of the textbook about decision trees in great detail.

  2. Solve Exercise 3.2 of your textbook (page 77). Include your solution in your written report (and not in your oral report).

  3. The following are guidelines for the construction of your decision tree:

    • Code: You can use the decision tree methods implemented in the Weka system. Use ID3 and J4.8 for your experiments. Read the Weka code implementing ID3 and J4.8 in detail.

    • Training and Testing Instances:

      You may restrict your experiments to a subset of the instances IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision tree is, the better.

    • Objectives of the Learning Experiments: In order to make your experiments more focused, follow the guidelines below:
      • Before you start running experiments, look at the raw data in detail. Figure out 3 to 5 specific, interesting questions about the domain that you want to answer with your experiments. These questions may be phrased as conjectures that you want to confirm/refute with your experimental results.

        Note that the questions should be about the domain, not about specific details of the experiments or the machine learning technique you are using. An example of a good question about the census-income dataset would be "Is education a more important factor than gender in predicting salary"? An example of a bad question for this dataset would be "What accuracy will I obtain by running ID3 over the dataset?".

      • Design your preprocessing and experiments around answering these 3-5 questions.
      • Analyze your resulting trees in the light of your 3-5 questions.

    • Preprocessing of the Data: A main part of this project is the PREPROCESSING of your dataset.

      • For both ID3 and J4.8: You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionalit you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them into Weka if you wish).

        To the extent possible, modify the attribute names and the value names so that the resulting decision trees are easier to read.

      • For J4.8: Read J4.8's code to determine how J4.8 handles numeric attributes, missing values, etc. if they are present in the dataset. Also compare the performance of J4.8 when you allow it to handle numeric attributes and missing values automatically vs. its performance when you pre-process the data to handle those cases.

    • Evaluation and Testing: Experiment with different testing methods:

      1. Supply separate training and testing data to Weka.

      2. Supply training data to Weka and experiment with several split ratios.

      3. Use n-fold crossvalidation to test your results Experiment with different values for the number of folds.

    • Prunning of your decision tree:

      Read Weka's ID3 and J4.8 code to determine what type of post-processing techniques they offered to increase the classification accuracy and/or to reduce the size of the decision tree. Describe that functionality in detail in your written report and experiment with this functionality. Alter Weka's code if you want to tailor it to your needs.


REPORT AND DUE DATE