WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS539 Machine Learning - Fall 2012 
Project 8 - Rule Learning

PROF. CAROLINA RUIZ 

Due Date: Tuesday, Dec. 11th 2012. Slides are due at 11:00 am and the written report is due at 3:00 pm. 
------------------------------------------

  1. Read in great detail Chapter 10 of the textbook on Rule Learning and the FOIL's "New Generation Computing '95" paper discussed in class.

  2. Homework Assignment:

  3. Project Assignment: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.

    1. Part I.

      • Data Mining Technique(s): Use sequential covering algorithms to construct sets of classification rules.

        • Code: Use both of the following two implementations of sequential-covering methods (or write your own implementations of those methods if you prefer):

          • The JRip algorithm implemented in the Weka system. Read the code of the JRip classifier in Weka in great detail.

          • The FOIL system (release 6) developed by Quinlan. This and other versions of FOIL are available online at: Quinlan's Webpage.

      • Dataset(s): In this project, we will use the following dataset:

        • The sample problems that come with the FOIL release 6 system:
          • ackermann.d
          • crx.d
          • member.d
          • ncm.d
          • qs44.d
          • sort.d

        • The diabetis.arff dataset that comes with the Weka system.
          (Try to add basic relationships to this dataset that FOIL can use as building blocks.)

        • The Spambase Data Set available at the UCI Machine Learning Repository.
          (Try to add basic relationships to this dataset that FOIL can use as building blocks.)

    2. Part II. Complete the following table summarizing each and everyone of your projects. Use the Spambase Data Set available at the UCI Machine Learning Repository. Re-run experiments as necessary so that you can report results using the same evaluation approach (if at all possible 10-fold cross-validation, if not 4-fold cross-validation), the same training and testing datasets, etc. The experimenter in the Weka system would be very helpful for this Also, use the experimenter to determine whether or not the accuracy differences between pairs of these methods are statistically significant with a p value of 0.05 or less. Please include this table in your report and in your slides.

      Technique DecisionTrees J4.8 NeuralNetworks NaiveBayes/BayesNets Instance-Based IB1/IBk/LR/LWR GeneticAlgorithms RuleLearning JRip/Foil
      Code (Weka/mine/other/adapted)            
      Dataset (name):            
      Accuracy (or error metrics). List metrics used.            
      Stat. significantly better than: (list methods)            
      Size of the model            
      How readable is the model?            
      Number of attributes used            
      Num. of training instances            
      Num. of test instances            
      Missing values included?(y/n)            
      What Pre-processing done?            
      Evaluation method used
      (n-fold cross val with n=?)
                 
      Training Time            
      Testing Time            
      Strengths and Weaknesses of the method            

      For those methods for which two or more alternatives are listed on the table (e.g., Naive Bayes / Bayesian Nets), provide the required information for each of the alternative listed, in the order they are listed, separated by "/"s (e.g. "78% / 81%", under accuracy if the accuracy of your best Naive Bayes model was 78% and the accuracy of your best Bayesian Net was 81% on the dataset analyzed).

      Include in your written report detail description and analysis of your table.

  4. Project 8 Grading Sheet