CS 539 Spring 2009

Computer Science Department

CS539 Machine Learning - Spring 2009
Project 2 - Decision Trees

PROF. CAROLINA RUIZ

Due Date: Tuesday, Feb. 10th 2009. Slides are due at 2:00 (by email) and Written Report is due at 3:30 pm (beginning of class).

Read Chapter 3 of the textbook about decision trees in great detail.
Homework Assignment:
1. Calculate the Gain(S,A1) and Gain(S,A2) for the dataset S and attributes A1 and A2 on Slide 8 of the textbook slides (Chapter 3). Show each step of the calculation. Include your solution in your written report (and not in your oral report).
2. Consider the Gain(S,A) formula (Equation 3.4, p. 58 of your textbook). Is it the case that for any dataset S and for any attribute A in dataset S, Gain(S,A) &ge 0? If your answer is yes, provide a detailed proof. If your answer is no, provide a dataset S and an attribute A in that dataset such that Gain(S,A) < 0. Include your solution in your written report (and not in your oral report).
Project Assignment: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
- Data Mining Technique(s): We will run experiment using the following decision trees techniques:
  - ID3, and
  - J4.8 (given that J4.8 is able to handle numeric attributes and missing values directly, make sure to run some experiments with no pre-processing and some experiments with pre-processing, and compare your results).
- Dataset(s): In this project, we will use two datasets:
  - The 1995 Data Analysis Exposition. This dataset contains college data taken from the U.S. News & World Report's Guide to America's Best Colleges. The necessary files are:
    Let's make "private/public" the classification target. Note that even though the values of this attribute are 0s and 1s, this is a nominal (not a numberic!) attribute.
  - A dataset of your choice. This dataset can consist of data that you use for your own research or work, a dataset taken from a public data repository (e.g., UCI Machine Learning Repository, or from the UCI KDD Archive), or data that you collect from public data sources. THIS DATASET CANNOT BE ONE OF THOSE INCLUDED IN THE WEKA SYSTEM.
- Performance Metric(s):
  - Use (1) classification accuracy, (2) size of the tree, and (3) readability of the tree, as separate measures to evaluate the "goodness" of your models.
  - Compare each accuracy you obtained against those of benchmarking techniques as ZeroR and OneR over the same (sub-)set of data instances you used in the corresponding experiment.
  - Remember to experiment with pruning of your J4.8 decision tree: Experiment with Weka's J4.8 classifier to see how it performs pre- and/or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree.

CS539 Machine Learning - Spring 2009 Project 2 - Decision Trees

PROF. CAROLINA RUIZ

CS539 Machine Learning - Spring 2009
Project 2 - Decision Trees