CS 539 Spring 2009

Computer Science Department

CS539 Machine Learning - Spring 2009
Project 8 - Rule Learning

PROF. CAROLINA RUIZ

Due Date: Tuesday, April 21th 2009. Slides are due at 2:30 pm and the written report is due at 3:30 pm.

Read Chapter 10 of the textbook on Rule Learning and the FOIL paper distributed in class New Generation Computing '95 in great detail.
Homework Assignment:
- Solve Exercise 10.2 of your textbook (page 303). Include your solution in your written report (and not in your oral report).

Project Assignment: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.

Part I.
- Data Mining Technique(s): Use sequential covering algorithms to construct sets of classification rules.
  - Code: Use both of the following two implementations of sequential-covering methods (or write your own implementations of those methods if you prefer):
    - The Prism algorithm implemented in the Weka system.
      Read the code of the Prism classifier in Weka in great detail. You can find a description of this algorithm in "Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)". I.H. Witten and E. Frank. Morgan Kaufmann Publishers. 2005.
    - The FOIL system (release 6) developed by Quinlan. This and other versions of FOIL are available online at: Quinlan's Webpage. Quinlan's webpage contains several copies of his papers describing the techniques employed in the system (the New Generation Computing '95 was the paper distributed in class).
- Dataset(s): In this project, we will use the following dataset:
  - The sample problems that come with the FOIL release 6 system:
    - ackermann.d
    - crx.d
    - member.d
    - ncm.d
    - qs44.d
    - sort.d
  - The labor.arff dataset that comes with the Weka system.
  - A dataset of your choice (ideally, but not necessarily, the same one that you pick for Part II below). This dataset can consist of data that you use for your own research or work, a dataset taken from a public data repository (e.g., UCI Machine Learning Repository, or from the UCI KDD Archive), or data that you collect from public data sources. This dataset should contain a large enough number of instances, and a combination of nominal and numeric attributes.

Part II. Complete the following table summarizing each and everyone of your projects. Pick one dataset (ideally, but not necessarily, the same one you pick for Part I above) and re-run experiments as necessary so that you can report results using the same evaluation approach (if at all possible 10-fold cross-validation, if not 4-fold cross-validation), the same training and testing datasets, etc. The experimenter in the Weka system would be very helpful for this Also, use the experimenter to determine whether or not the accuracy differences between pairs of these methods are statistically significant with a p value of 0.05 or less. Please include this table in your report and in your slides.

Technique DecisionTrees ID3/J4.8 NeuralNetworks NaiveBayes/BayesNets Instance-Based IB1/IBk/LR/LWR GeneticAlgorithms RuleLearning Prism/Foil

Code (Weka/mine/other/adapted)

Dataset (name):

Accuracy (or error metrics). List metrics used.

Stat. significantly better than: (list methods)

Size of the model

How readable is the model?

Number of attributes used

Num. of training instances

Num. of test instances

Missing values included?(y/n)

What Pre-processing done?

Evaluation method used
(n-fold cross val, n=?)

Training Time

Testing Time

Strengths and Weaknesses

For those methods (e.g., decision trees) for which two or more particular algorithms are listed on the table (e.g., ID3 and J4.8), provide the required information for each of the algorithms listed, in the order they are listed, separated by "/"s (e.g. "78% / 81%", under accuracy if the accuracy of your best ID3 decision tree was 78% and the accuracy of your best J4.8 decision tree was 81% on the dataset analyzed).

Include in your written report detail description and analysis of your table.

Technique	DecisionTrees ID3/J4.8	NeuralNetworks	NaiveBayes/BayesNets	Instance-Based IB1/IBk/LR/LWR	GeneticAlgorithms	RuleLearning Prism/Foil
Code (Weka/mine/other/adapted)
Dataset (name):
Accuracy (or error metrics). List metrics used.
Stat. significantly better than: (list methods)
Size of the model
How readable is the model?
Number of attributes used
Num. of training instances
Num. of test instances
Missing values included?(y/n)
What Pre-processing done?
Evaluation method used (n-fold cross val, n=?)
Training Time
Testing Time
Strengths and Weaknesses

CS539 Machine Learning - Spring 2009 Project 8 - Rule Learning

PROF. CAROLINA RUIZ

CS539 Machine Learning - Spring 2009
Project 8 - Rule Learning