Proj4 - CS 539 Fall 2014

Computer Science Department

CS539 Machine Learning - Fall 2014
Project 4 - Evaluating Hypotheses

PROF. CAROLINA RUIZ

Due Date: Tuesday, Oct. 14th, 2014.

Study Chapter 5 in detail.
Solve each of the book exercises at the end of the Chapter 5: 5.1, 5.2, 5.3, 5.4, 5.5, and 5.6.
Remember to show your work and explain your answers.
You can use Weka or R for this part - choose one of the two tools and use it for the the whole project. Use stratified sampling to select two different subsets of 500 data instances each from the diabetes.arff dataset that comes with the Weka system. Let's denote these subsets S1 and S2.
1. Train a J4.8 decision tree t over S1 using a 75% split. That is, use 75% of the data to build the tree and the remaining 25% to calculate the error_S1'(t). Use this error_S1'(t) to estimate with 95% probability the error_D(t), that is the error of t over the entire distribution D of patients in this domain.
2. Train a neural network nn with 1 hidden layer and other default parameters over the dataset S2 using a 75% split. That is, use 75% of the data in S2 to train the neural net and the remaining 25% to calculate the error_S2'(nn). Compare the decision tree t from above and the neural network nn by estimating the difference d between the true errors of these two hypotheses with 95% probability using error_S1'(t) and error_S2'(nn). What can you conclude from this comparison?
3. Compare J4.8 decision trees and neural networks (with default parameters) over the diabetes.arff dataset by running Weka's experimenter over S1, with 1 repetition of 10-fold cross-validation. Make sure to store the results that the experimenter outputs in an arff file. You can use R instead of Weka, if you prefer.
  1. From the arff results file, extract the error (Percent_incorrect) of decision trees and of neural networks on each iteration of 10-fold cross-validation. Using those errors, follow by hand the paired t test procedure described in Section 5.6 of your textbook (see Table 5.5). Use an approximate confidence interval of 95%, k=10, and data subset S1 as D₀. Show each step of your calculations. (Note that you need to find out the value of the constant t_{95, 9} so that you can complete the calculations.) What can you conclude from this comparison?
  2. From the "Analyse" tab of the Experimenter, use the "Paired T-Tester" to perform a test comparing the "percent_incorrect" of both learning methods used in the experiment with Significance = 0.05 (which is the same as confidence = 95%). Include in your document the results of the paired t test reported by Weka. What can you conclude from this comparison? Compare with the results you obtained above when you ran the paired t test by hand.

This is an individual project. Work on your solutions entirely by yourself. Bring your written solutions to these problems to class when the project is due, and be ready to discuss your solutions and the chapter in class. No need to submit a written report or slides. A quiz will be given to assess your knowledge of the topic.

CS539 Machine Learning - Fall 2014 Project 4 - Evaluating Hypotheses

PROF. CAROLINA RUIZ

CS539 Machine Learning - Fall 2014
Project 4 - Evaluating Hypotheses