CS539 Machine Learning - Spring 2009
Project 2 - Decision Trees
Due Date:
Tuesday, Feb. 10th 2009. Slides are due at 2:00 (by email)
and Written Report is due at 3:30 pm (beginning of class).
- Read Chapter 3 of the textbook about decision trees in great detail.
- Homework Assignment:
-
Calculate the Gain(S,A1) and Gain(S,A2) for the dataset S and attributes A1
and A2 on Slide 8 of
the textbook slides
(Chapter 3).
Show each step of the calculation.
Include your solution in your written report (and not in your oral report).
-
Consider the Gain(S,A) formula (Equation 3.4, p. 58 of your textbook).
Is it the case that for any dataset S and for any attribute A in dataset S,
Gain(S,A) &ge 0?
If your answer is yes, provide a detailed proof. If your answer is no,
provide a dataset S and an attribute A in that dataset such that Gain(S,A) < 0.
Include your solution in your written report (and not in your oral report).
- Project Assignment:
THOROUGHLY READ AND FOLLOW THE
PROJECT GUIDELINES.
These guidelines contain detailed information about how to structure your
project, and how to prepare your written and oral reports.
- Data Mining Technique(s):
We will run experiment using the following decision trees techniques:
- ID3, and
- J4.8 (given that J4.8 is able to handle numeric attributes and
missing values directly, make sure to run
some experiments with no pre-processing
and
some experiments with pre-processing, and compare your results).
- Dataset(s):
In this project, we will use two datasets:
-
The 1995 Data Analysis Exposition.
This dataset contains college data taken from the U.S. News & World Report's Guide to
America's Best Colleges. The necessary files are:
Let's make "private/public" the classification target. Note that even though the values
of this attribute are 0s and 1s, this is a nominal (not a numberic!) attribute.
- A dataset of your choice. This dataset can consist of
data that you use for your own research or work,
a dataset taken from a public data repository (e.g.,
UCI Machine Learning Repository, or from the
UCI KDD Archive),
or data that you collect from public data sources.
THIS DATASET CANNOT BE ONE OF THOSE INCLUDED IN THE WEKA SYSTEM.
- Performance Metric(s):
- Use (1) classification accuracy, (2) size of the tree, and (3) readability
of the tree, as separate measures to evaluate the "goodness" of your models.
- Compare each accuracy you obtained against those of benchmarking techniques
as ZeroR and OneR over the same (sub-)set of data instances you used in
the corresponding experiment.
- Remember to experiment with pruning of your J4.8 decision tree:
Experiment with Weka's J4.8 classifier to see how it performs pre- and/or
post-prunning of the decision tree in order to increase the classification
accuracy and/or to reduce the size of the decision tree.