CS 539 Spring 2007

Computer Science Department

CS539 Machine Learning - Spring 2007
Project 6 - Instance-Based Learning and Regression Methods

PROF. CAROLINA RUIZ

Due Date: Thursday, March 22 2007. Slides are due at 3:00 pm and the written report is due at 4:00 pm.

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

Use Instance-based Learning and Regression techniques to construct classifiers for each of the following problems:

Predicting (1) the class attribute, and (2) a numeric attribute of your choice in the census-income dataset.
Predicting (1) a nominal attribute, and (2) a numeric attribute of your choice in a dataset selected by you. This dataset can consist of data that you use for your own research or work, a dataset taken from a public data repository (e.g., UCI Machine Learning Repository, or from the UCI KDD Archive), or data that you collect from public data sources. THIS DATASET SHOULD BE LARGE IN TERMS OF THE NUMBER OF INSTANCES AND ATTRIBUTES SO IT CANNOT BE ONE OF THOSE INCLUDED IN THE WEKA SYSTEM.

PROJECT ASSIGNMENT

Read Chapter 8 of the textbook about Instance-based Learning in great detail.
Solve Exercise 8.3 of your textbook (page 247). Include your solution in your written report (and not in your oral report).
Read the code of the Instance-based Learning and Regression techiques implemented in the Weka system. Some of those techniques are enumerated below:
- Instance-based Learning:
  - IB1: nearest neighbor classification
  - IBk: k-nearest neighbors classification. Experiment with several values of k.
- Other Lazy Learning:
  - LBR: Lazy Bayesian Rules Classifier
- Regression:
  - Linear Regression
  - LWR: Locally Weighted Regression [In order to run locally weighted linear regression using Weka, use LWL (locally weighted learning) from the Weka's lazy classifiers, and select "Linear Regression" for the LWL's classifier option]
The following are guidelines for the construction of your models:
- Code: Use the above listed techniques implemented in the Weka system. If you prefer, implement your own code.
- Objectives of the Learning Experiments: Before you start running experiments, look at the raw data in detail. Figure out 3 to 5 specific, interesting questions about the domain that you want to answer with your Instance-based learning experiments. These questions may be phrased as conjectures that you want to confirm/refute with your experimental results.
- Training and Testing Instances: You may restrict your experiments to a subset of the instances IF Weka cannot handle your whole dataset. But remember that usually the more training data that you can use, the better. FOR TESTING, you may restrict your test set to 100 data instances or less (not used for training) to reduce the time taken by the experiments using these lazy methods.

Preprocessing of the Data: You should apply relevant filters to your dataset as needed before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

Evaluation and Testing: Use with n-fold cross-validation (or with percentage split if the execution time required by cross-validation is too high).

REPORT AND DUE DATE

Written Report.
Your report should contain the following sections with the corresponding discussions:
1. Code Description: Describe IN YOUR OWN WORDS the algorithms that you use. Explain each algorithm in terms of the input it receives, the output it produces, and the main steps it follows to produce this output from the input.
  Explain the differences among the approaches you use, as much as possible.
2. Data: Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.
  Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining these classifications models from it.
  Describe your 3-5 guiding questions/conjectures.
3. Experiments: For each experiment you ran describe:
  - Which of your 3-5 specific questions/conjectures about the dataset domain you aim to answer/validate with your experiments.
  - Data: What data did you use to construct and test your models?
  - Any additional pre or post processing done to the data or the classifier's output in order to improve the accuracy of your classifier.
  - Accuracy of the resulting classifier.
  - Discuss how this accuracy compares with that of your most accurate ZeroR experiment, decision trees, neural nets, and Bayesian techniques from the previous assignments.
4. Summary of Results
  - For each dataset, what was the accuracy of the most accurate classifier constructed in your project?
  - strengths and the weaknesses of your project.
Oral Report. We will discuss the results from the individual projects during the class on March 22. Your oral report should summarize the different sections of your written report as described above. Each of you will have 6 minutes to explain your results and to discuss your project in class. Be prepared!
Submission and Due Date.
1. Please submit the following file by email by Thursday March 22th at 3:00 pm.
2. [your-lastname]_proj6_slides.[ext] containing your slides for your oral report. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Please use only lower case letters in the name file. For instance my file would be named ruiz_proj6_slides.ppt
3. Please bring a hardcopy of your report to class on March 22, 2007. This written report is due at 4:00 pm that day.

CS539 Machine Learning - Spring 2007 Project 6 - Instance-Based Learning and Regression Methods

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORT AND DUE DATE

CS539 Machine Learning - Spring 2007
Project 6 - Instance-Based Learning and Regression Methods