Due Date:
Tuesday, Oct. 9th, 2012.
Written Report is due at 3:00 pm (beginning of class).
-
Study Chapter 5 in detail.
-
Solve each of the book exercises at the end of the Chapter 5:
5.1, 5.2, 5.3, 5.4, 5.5, and 5.6.
Remember to show your work and explain your answers.
-
Use stratified sampling to select two different subsets of 1000 data
instances each from the
Spambase Dataset
available from the
Univ. of California Irvine (UCI) Data Repository.
Let's denote these subsets S1 and S2.
- Train a J4.8 decision tree t over S1 using a 75% split.
That is, use 75% of the data to build the tree and the remaining
25% to calculate the errorS1'(t).
Use this errorS1'(t) to estimate with 95% probability the
errorD(t), that is the error of t over the entire
distribution D of emails (not just the 4601
data instances available at the dataset website).
- Train a neural network nn with 1 hidden layer and other
default parameters over the dataset S2 using a 75% split.
That is, use 75% of the data in S2 to train the neural net and the remaining
25% to calculate the errorS2'(nn).
Compare the decision tree t from above and the neural network nn
by estimating the difference d between the true errors of these
two hypotheses
with 95% probability
using errorS1'(t) and errorS2'(nn).
What can you conclude from this comparison?
- Compare J4.8 decision trees and
neural networks (with default parameters)
over the Spambase dataset by
running Weka's experimenter over S1, with 1 repetition of 10-fold
cross-validation. Make sure to store the results that the experimenter
outputs in an arff file. You can use Matlab instead of Weka, if you prefer.
-
From the arff results file, extract the error (Percent_incorrect)
of decision trees and of neural networks on each iteration of 10-fold
cross-validation.
Using those errors, follow by hand the paired t test procedure
described in Section 5.6 of your textbook (see Table 5.5).
Use an
approximate confidence interval of 95%, k=10, and data subset S1
as D0.
Show each step of your calculations.
(Note that you need to find out the value of the constant t 95, 9
so that you can complete the calculations.)
What can you conclude from this comparison?
-
From the "Analyse" tab of the Experimenter,
use the "Paired T-Tester" to perform a test comparing
the "percent_incorrect" of both learning methods used
in the experiment with
Significance = 0.05 (which is the same as confidence = 95%).
Include in your document the results of the paired t test
reported by Weka.
What can you conclude from this comparison?
Compare with the results you
obtained above when you ran the paired t test by hand.
Please turn in written solutions to these problems at the beginning
of class when the project is due and be ready to discuss your solutions
and the chapter in class.