Proj5 - CS 539 Fall 2014

Computer Science Department

CS539 Machine Learning - Fall 2014
Project 5 - Bayesian Networks

PROF. CAROLINA RUIZ

Due Date: Friday, November 7th 2014. Slides are due by 11:00 am (by email) and Written Report is due at 2:00 pm (beginning of class).

Read Sections 6.1, 6.2, 6.6, 6.7, 6.8, 6.9, 6.10, 6.11, 6.12, 6.13 of your textbook in great detail.
Read the NaiveBayes and the BayesNets code in the Weka system.
Project Assignment: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
*** You must use the Project 5 Template provided for your written report. *** (if you prefer not to use Word, you can copy and paste this format in a different editor as long as you respect the stated page structure and page limit.)
The font size should be no smaller than 11pts. Do not exceed the page limit.
- Machine Learning Technique(s): Use the Naive Bayes and Bayesian Net classification methods implemented in Weka and in R.
- Dataset(s): In this project, we will use two datasets:
  - The Flags dataset available at the UCI Machine Learning Repository. Use religion as the target attribute.
  - The ReutersCorn dataset that comes with the Weka system. Combine together ReutersCorn-train.arff and ReutersCorn-test.arff files into a ReutersCorn.arff dataset.
    This dataset is a collection of text documents. For transforming this dataset from a text (unstructured) format to a tabular (structured) format you can write your own code; use Weka (see the StringToWordVector filter in Weka); use R; or use a good, existing software package available to you. Describe in your report what code you used, and cite any resources used. Check the resulting list of words to make sure they are a good selection of words.
- Performance Metric(s):
  - Use classification accuracy, time to construct the model, dependency connections in the Bayesian graph, conditional probability tables (CPTs), readability of the net, and any other related information or metrics when you evaluate the "goodness" of your models (note that some of these evalution criteria are quantitative and some are qualitative).
  - Compare the classification accuracies/errors you obtained against those of benchmarking techniques or previously studied techniques as ZeroR, OneR, J4.8, ANNs over the same (sub-)set of data instances you used in each experiment. Use the experimenter in Weka to compare the performance of these different techniques, with a statistical significance threshold p=0.05.
- Algorithm Options:
  - Naive Bayes: Run experiments with and without the supervised discretization option. Contrast the results obtained. Also, experiment with and without Feature Selection.
  - Bayesian Nets: Run experiments with different net initialization, different search methods (including at least K2 and HillClimber), different upper bounds on the number of parents, and different orderings of the predicting attributes. Also, experiment with and without Feature Selection. For a description of the K2 algorithm see G. F. Cooper & E. Herskovits. "A Bayesian Method for the Induction of Probabilistic Networks from Data". For an illustration of the K2 algorithm over the toy dataset in that paper, see Prof. Ruiz's K2 handout.
- Advanced Topic(s) (30 points): Investigate in more depth (experimentally, theoretically, or both) a topic of your and your teammate's choice that is related to Bayesian learning and that is not covered already in this project. This Bayesian learning related topic might be something that was described or mentioned in the textbook or in class, or that comes from your own research, or that is related to your and your teammate's interests. One advanced topic per team.
Project 5 Grading Sheet

CS539 Machine Learning - Fall 2014 Project 5 - Bayesian Networks

PROF. CAROLINA RUIZ

CS539 Machine Learning - Fall 2014
Project 5 - Bayesian Networks