CS 539 Spring 2007

Computer Science Department

CS539 Machine Learning - Spring 2007
Project 2 - Decision Trees

PROF. CAROLINA RUIZ

Due Date: Thursday, Feb. 1st 2007. Slides are due at 3:00 (by email) and Written Report is due at 4:00 pm (beginning of class).

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

Construct the best (i.e., most accurate and/or smaller and/or most readable and/or more informative) decision tree you can for predicting the class attribute for each of the following datasets:

The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.
A dataset of your choice. This dataset can consist of data that you use for your own research or work, a dataset taken from a public data repository (e.g., UCI Machine Learning Repository, or from the UCI KDD Archive), or data that you collect from public data sources. THIS DATASET CANNOT BE ONE OF THOSE INCLUDED IN THE WEKA SYSTEM.

PROJECT ASSIGNMENT

Read Chapter 3 of the textbook about decision trees in great detail.
Solve Exercise 3.2 of your textbook (page 77). Include your solution in your written report (and not in your oral report).
The following are guidelines for the construction of your decision tree:
- Code: You can use the decision tree methods implemented in the Weka system. Use ID3 and J4.8 for your experiments. Read the Weka code implementing ID3 and J4.8 in detail.
- Training and Testing Instances:
  You may restrict your experiments to a subset of the instances IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision tree is, the better.
- Objectives of the Learning Experiments: In order to make your experiments more focused, follow the guidelines below:
  - Before you start running experiments, look at the raw data in detail. Figure out 3 to 5 specific, interesting questions about the domain that you want to answer with your experiments. These questions may be phrased as conjectures that you want to confirm/refute with your experimental results.
    Note that the questions should be about the domain, not about specific details of the experiments or the machine learning technique you are using. An example of a good question about the census-income dataset would be "Is education a more important factor than gender in predicting salary"? An example of a bad question for this dataset would be "What accuracy will I obtain by running ID3 over the dataset?".
  - Design your preprocessing and experiments around answering these 3-5 questions.
  - Analyze your resulting trees in the light of your 3-5 questions.
- Preprocessing of the Data: A main part of this project is the PREPROCESSING of your dataset.
  - For both ID3 and J4.8: You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionalit you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them into Weka if you wish).
    To the extent possible, modify the attribute names and the value names so that the resulting decision trees are easier to read.
  - For J4.8: Read J4.8's code to determine how J4.8 handles numeric attributes, missing values, etc. if they are present in the dataset. Also compare the performance of J4.8 when you allow it to handle numeric attributes and missing values automatically vs. its performance when you pre-process the data to handle those cases.
- Evaluation and Testing: Experiment with different testing methods:
  1. Supply separate training and testing data to Weka.
  2. Supply training data to Weka and experiment with several split ratios.
  3. Use n-fold crossvalidation to test your results Experiment with different values for the number of folds.
- Prunning of your decision tree:
  Read Weka's ID3 and J4.8 code to determine what type of post-processing techniques they offered to increase the classification accuracy and/or to reduce the size of the decision tree. Describe that functionality in detail in your written report and experiment with this functionality. Alter Weka's code if you want to tailor it to your needs.

REPORT AND DUE DATE

Written Report.
Your report should contain the following sections with the corresponding discussions:
1. Code Description: Explain the algorithm underlying the ID3 and the J4.8 code in terms of the input they receive, the output they produce, and the main steps they follow to produce their output.
2. Data:
  - Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.
  - Describe your 3-5 guiding questions/conjectures.
  - Provide a detailed description of the preprocessing of your data. Justify the preprocessing you applied and why the resulting data is the appropriate one for mining decision trees from it and for addressing your guiding questions.
3. Experiments: For each experiment you ran describe:
  - Which of your 3-5 specific questions/conjectures about the dataset domain you aim to answer/validate with your experiments.
  - Data: What data did you use to construct and test your decision tree?
  - Any additional pre or post processing done to the data or the tree in order to improve the accuracy of your tree.
  - Accuracy of the resulting decision tree.
  - Discuss how this accuracy compares with that of ZeroR and OneR over the same data.
  - Elaborate on the "descriptive qualities" of the decision tree you obtained. For instance, what domain knowledge does the tree capture? Is it readable? How large is it? Does it answer the specific guiding question(s) you aimed to answer with this experiment? If so, what's the answer given by this experiment?
4. Summary of Results
  - Describe the main results you obtained in your project; overall answers to your guiding questions; and general trends (either about the domain or about the experiments themselves) that you found in your experimentation.
  - Strengths and the weaknesses of your project.
Oral Report. We will discuss the results from the individual projects during the class when the project is due. Your oral report should summarize the different sections of your written report as described above, but focusing on particularly interesting results you obtained or approaches you used. Each of you will have NO MORE THAN 5 minutes for your class presentation. Be prepared!
Submission.
Please submit the following by the deadlines stated above:
1. Send me by email the file [your-lastname]_proj2_slides.[ext] containing the slides for your oral report. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Please use only lower case letters in the name file. For instance my file would be named ruiz_proj2_slides.ppt
2. Bring a hard copy of your written report by the beginning of class when the project is due.

CS539 Machine Learning - Spring 2007 Project 2 - Decision Trees

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORT AND DUE DATE

CS539 Machine Learning - Spring 2007
Project 2 - Decision Trees