Due Date:
Thursday, Feb. 17th 2011. Slides are due at 11:00 am (by email)
and a hardcopy of the Written Report is due at 1:00 pm (beginning of class).
Project Assignment:
- Read Chapter 3 of the textbook about decision trees in great detail.
-
THOROUGHLY READ AND FOLLOW THE
PROJECT GUIDELINES.
These guidelines contain detailed information about how to structure your
project, and how to prepare your written and oral reports.
- Data Mining Technique(s):
We will run experiments using
J4.8 in Weka (given that J4.8 is able to handle numeric attributes and
missing values directly, make sure to run
some experiments with no pre-processing
and
some experiments with pre-processing, and compare your results);
or the decision tree functions in Matlab (see
Matlab decision tree demo); or both.
- Dataset(s):
In this project, we will use two datasets:
- The
census-income (also called "adult") dataset
from the US Census Bureau which is
available at the
Univ. of California Irvine (UCI) Data Repository.
The census-income dataset contains census information for 48,842
people. It has 14 attributes for each person
(age,
workclass,
fnlwgt,
education,
education-num,
marital-status,
occupation,
relationship,
race,
sex,
capital-gain,
capital-loss,
hours-per-week, and
native-country)
and a boolean attribute class classifying the input
of the person as belonging to one of two categories >50K, <=50K.
- A dataset of your choice. This dataset can consist of
data that you use for your own research or work,
a dataset taken from a public data repository (e.g.,
see
Additional Suggested References on our course webpage for a
listing of data repositories),
or data that you collect from public data sources.
THIS DATASET CANNOT BE ONE OF THOSE INCLUDED IN THE WEKA SYSTEM.
- Performance Metric(s):
- Use (1) classification accuracy, (2) size of the tree, and (3) readability
of the tree, as separate measures to evaluate the "goodness" of your models.
- Compare each accuracy you obtained against those of benchmarking techniques
as ZeroR and OneR over the same (sub-)set of data instances you used in
the corresponding experiment.
- Remember to experiment with pruning of your (J4.8 or Matlab) decision tree:
Experiment with pre- and/or
post-prunning of the decision tree in order to increase the classification
accuracy and/or to reduce the size of the decision tree.
- Advanced Topic(s) (30 points):
Investigate in more depth (experimentally, theoretically, or both) a topic of your
choice that is related to decision trees
and that is not covered already in this project.
This decision tree-related topic might be something that was described or mentioned
in the textbook or in class, or that comes from your own research, or that is related
to your interests. Just a few ideas are: The prune function in Matlab; C4.5;
C4.5 pruning methods (for trees or for rules); any of the
additional tree classifiers in Weka: DecisionStump, LMT RandomForest, RandomTree,
REPTree; meta-learning applied to decision trees (see Classifier -> Choose -> meta);
an idea from a research paper that you find intriguing; ...
- Project 2 Grading Sheet