WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS539 Machine Learning - Spring 2009 
Project 4 - Evaluating Hypotheses

PROF. CAROLINA RUIZ 

Due Date: Thursday, March 5th, 2009. and Written Report is due at 3:30 pm (beginning of class). 
------------------------------------------

  1. Study Chapter 5 in detail.

  2. Solve each of the book exercises at the end of the Chapter 5:

    5.1, 5.2, 5.3, 5.4, 5.5, and 5.6.

  3. Use stratified sampling to select two different subsets of 1000 data instances each from the census-income (also called "adult") dataset. Let's denote these subsets S1 and S2. Let income (>50K, <=50K) be the target attribute.

    1. Train a J4.8 decision tree t over S1 using a 75% split. That is, use 75% of the data to build the tree and the remaining 25% to calculate the errorS1'(t). Use this errorS1'(t) to estimate with 95% probability the errorD(t), i.e. the error of t over the entire distribution D of census-income instances (not just the 50,000 data instances available at the dataset website).

    2. Train a neural network nn with 1 hidden layer and other default parameters over the dataset S2 using a 75% split. That is, use 75% of the data in S2 to train the neural net and the remaining 25% to calculate the errorS2'(nn). Compare the decision tree t from above and the neural network nn by estimating the difference d between the true errors of these two hypotheses with 95% probability using errorS1'(t) and errorS2'(nn).

    3. Compare J4.8 decision trees and neural networks (with default parameters) over the census-income dataset by running Weka's experimenter over S1, with 1 repetition of 10-fold cross-validation. Make sure to store the results that the experimenter outputs in an arff file.

      1. From the arff results file, extract the error (Percent_incorrect) of decision trees and of neural networks on each iteration of 10-fold cross-validation. Using those errors, follow by hand the paired t test procedure described in Section 5.6 of your textbook. Use an approximate confidence interval of 95%, k=10, and data subset S1 as D0. Show each step of your calculations. (Note that you need to find out the constant t 95, 9 so that you can complete the calculations.)

      2. From the "Analyse" tab of the Experimenter, use the "Paired T-Tester" to perform a test comparing the "percent_incorrect" of both learning methods used in the experiment with Significance = 0.05 (which is the same as confidence = 95%). Include in your document the results of the paired t test reported by Weka. Compare with the results you obtained above when you ran the paired t test by hand.

Please turn in written solutions to these problems at the beginning of class on Thursday, March 5th and be ready to discuss your solutions and the chapter in class.