CS 539 Spring 2011

Computer Science Department

CS539 Machine Learning - Spring 2011
Project 2 - Decision Trees

PROF. CAROLINA RUIZ

Due Date: Thursday, Feb. 17th 2011. Slides are due at 11:00 am (by email) and a hardcopy of the Written Report is due at 1:00 pm (beginning of class).

Project Assignment:

Read Chapter 3 of the textbook about decision trees in great detail.
THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
- Data Mining Technique(s): We will run experiments using J4.8 in Weka (given that J4.8 is able to handle numeric attributes and missing values directly, make sure to run some experiments with no pre-processing and some experiments with pre-processing, and compare your results); or the decision tree functions in Matlab (see Matlab decision tree demo); or both.
- Dataset(s): In this project, we will use two datasets:
  - The census-income (also called "adult") dataset from the US Census Bureau which is available at the Univ. of California Irvine (UCI) Data Repository.
    The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.
  - A dataset of your choice. This dataset can consist of data that you use for your own research or work, a dataset taken from a public data repository (e.g., see Additional Suggested References on our course webpage for a listing of data repositories), or data that you collect from public data sources. THIS DATASET CANNOT BE ONE OF THOSE INCLUDED IN THE WEKA SYSTEM.
- Performance Metric(s):
  - Use (1) classification accuracy, (2) size of the tree, and (3) readability of the tree, as separate measures to evaluate the "goodness" of your models.
  - Compare each accuracy you obtained against those of benchmarking techniques as ZeroR and OneR over the same (sub-)set of data instances you used in the corresponding experiment.
  - Remember to experiment with pruning of your (J4.8 or Matlab) decision tree: Experiment with pre- and/or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree.
Advanced Topic(s) (30 points): Investigate in more depth (experimentally, theoretically, or both) a topic of your choice that is related to decision trees and that is not covered already in this project. This decision tree-related topic might be something that was described or mentioned in the textbook or in class, or that comes from your own research, or that is related to your interests. Just a few ideas are: The prune function in Matlab; C4.5; C4.5 pruning methods (for trees or for rules); any of the additional tree classifiers in Weka: DecisionStump, LMT RandomForest, RandomTree, REPTree; meta-learning applied to decision trees (see Classifier -> Choose -> meta); an idea from a research paper that you find intriguing; ...
Project 2 Grading Sheet

CS539 Machine Learning - Spring 2011 Project 2 - Decision Trees

PROF. CAROLINA RUIZ

Project Assignment:

CS539 Machine Learning - Spring 2011
Project 2 - Decision Trees