WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS539 Machine Learning 
Project 2 - Fall 2015

PROF. CAROLINA RUIZ 

Due Dates:
Phase I Written Report: Separate Online Submission Deadline per Section as specified in each Section below
Phase II Slides: myWPI Submission by Tuesday, December 15th 2015 at 1 pm.  

------------------------------------------

Project Instructions


Project Sections and Deadlines


Section A: Clustering (70 points + 10 bonus points)

Page limit for this section: 4 pages.
Due date: Sunday, Nov. 1st at 11:59 pm.
myWPI submission name: Proj2SectionsAandB

Dataset: For this part of the project, you will use the OptDigit Dataset available at the UCI Machine Learning Repository.

  1. K-Means Clustering:
    1. (20 points) Use the k-means procedure implemented in your chosen programming language (Matlab/R) to cluster the data in optdigits.tra (removing the class attribute first). Use Euclidean distance. Experiment with different initial random seeds Systematically experiment with different values for k (= number of clusters), say between 2 and 12. Use a table to summarize your results in your written report. In this table, include runtime, k, distance metric, and SSE (sum of squared errors, also called reconstruction error in the textbook) for each experiment. Provide a brief analysis of your results.
    2. (10 points) Pick the experiment that you think produced the best result. Justify your choice in your report. Use Matlab/R plotting functions to produce one or two visualizations of the resulting clusters of your chosen experiment. (e.g., consider using MultiDimensional Scaling (MSD) and Silhouette).
      (5 bonus points) Find a good way to add the class attribute in the visualization to see if some clusters are associated with any particular class value(s) (see for example Fig. 6.5 (p. 126) and Fig. 6.12 (p. 144) of the textbook).
    3. (10 points) In this part, you will investigate methods to evaluate how well a clustering relates to values of the class attribute. Study the notions of purity, normalized mutual information (NMI), and Rand index (RI). Calculate the purity, the NMI, and the RI of the clusters in the experiment you ran with k=10. For calculating these measures you'll need to use the class attribute, but after the clustering has been obtained without using class.

  2. EM Clustering:
    1. (20 points) Now cluster the data using Gaussian Mixture Models with the EM algorithm. Experiment with different number k of components (= clusters), with different initializations, with shared and not shared covariance matrices among components, and with diagonal and non-diagonal ("full") covariance matrices. Use a table to summarize your results in your written report. In this table, include runtime, k, initialization, type of covariance matrix, if covariance matrix is shared or not, the Akaike's Information Criterion (AIC) and the Bayesian Information Criterion (BIC) for each experiment. Provide a brief analysis of your results.
    2. (10 points) Pick the experiment that you think produced the best result. Justify your choice in your report. Use Matlab/R plotting functions to produce a visualization of the resulting clusters of your chosen experiment. This visualization should show the shape and orientation of each component.
      (5 bonus points) Find a good way to add the class attribute in the visualization to see if some clusters are associated with any particular class value(s).

  3. Homework Problems:
    These homework problems are for you to study this topic. You do NOT need to submit your solutions.
    Chapter 7 Exercises 2, 5, 6, 7, 10, 11 of the textbook (pp. 180-182)

Section B: Nonparametric Methods (90 + 10 bonus points)

Page limit for this section: 5 pages.
Due date: Sunday, Nov. 1st at 11:59 pm.
myWPI submission name: Proj2SectionsAandB

Dataset: For this part of the project, you will use the OptDigit Dataset available at the UCI Machine Learning Repository.

  1. Univariate Density Estimation:
    Randomly generate a set of N=100 data points using a uniform distribution in the range from 0 to 50. Construct the following 6 plots, using in each case the specified density estimation function over the randomly generated dataset.
    1. (5 points) Using a naive estimator with bin width h= 1 and a separate plot with h=4 (see Fig. 8.2 p. 189 of the textbook).
    2. (5 points) Using a Gaussian kernel estimator with bin width h= 1 and a separate plot with h=4 (see Fig. 8.3 p. 190 of the textbook).
    3. (5 points) Using a k-nearest neighbor kernel estimator with k=3 and a separate plot with k=6 (see Fig. 8.4 p. 191 of the textbook).

  2. Nonparametric Classification:
    Use the OpDigit dataset for this part. Use k-nearest classification functions in Matlab/R to classify the data instances in the test set optdigits.tes using optdigits.tra as the training set. Run knn with k=1, 5, 9, 11, using 3 different distance metrics: Mahalanobis, Euclidean, and cosine.
    1. (20 points) Use a table to summarize your results in your written report. In this table, include runtime, k, distance metric, and classification accuracy for each experiment. Provide a brief analysis of your results.
    2. (5 points) Pick the experiment that you think produced the best result. Justify your choice in your report. Include the confusion matrix for this experiment. See what misclassifications are most common and elaborate on your observations.

  3. Outlier Detection:
    Use the optdigits.tra dataset for this part.
    1. (20 points) Calculate the Local Outlier Factor (LOF) of each data instance in optdigits.tra. Describe in your report what code you used to do this calculation.
    2. (5 points) Sort the data instances in increasing order according to their LOF. Plot a graph where the horizontal axis consists of the sorted data instances and the vertical axis denotes their LOF values. Is there an "elbow" in the plot that could be a good threshold to discern between non-outliers and outliers? Explain your answer.
    3. (5 bonus points) Take the 3 data instances with the highest LOF values. See if you can plot the image (digit) corresponding to each of these data instances, and see if you can tell whether or not they are abnormal/outliers.

  4. Nonparametric Regression:
    You may find it useful to watch Matlab's nonparametric fitting video.
    Use the OpDigit dataset for this part, but instead of using the class attribute as discrete (or nominal), use it as continuous.
    1. (25 points) Use locally weighted regression functions in Matlab/R that implement techniques like loess or lowess (LOcally WEighted Scatter plot Smooth) on the optdigits.tra dataset. Use cross-validation to determine a good value for k (= number of nearest neighbors used). Summarize the results of your experiments on a table.
      (5 bonus points) Find a good way to visualize the smoothed regression curves constructed.

  5. Homework Problems:
    These homework problems are for you to study this topic. You do NOT need to submit your solutions.
    Chapter 8 Exercises 1, 2, 3, 4, 5, 6, 7 (no need to write code), 8, 9, 11 of the textbook (pp. 208-210).

Section C: Trees (80 points)

Page limit for this section: 4 pages.
Due date: Thursday, Nov. 19th at 11:59 pm.
myWPI submission name: Proj2SectionsCDE

Dataset: For this part of the project, you will use the Statlog (Heart) Dataset available at the UCI Machine Learning Repository.
Carefully read the description provided for this dataset and familiarize yourself with the dataset as much as possible.

  1. Classification Trees:
    For classification, use the attribute "Absence (1) or presence (2) of heart disease" as the target.
    1. (20 points) Use Matlab/R functions to construct decision trees over the dataset using 4-fold cross-validation. Briefly describe in your report the functions that you use and their parameters. Run at least 5 different experiments varying parameter values. Repeat the same experiments but now using pruning. Show the results of all of your experiments neatly organized on a table showing parameter values, classification accuracy, size of the tree (number of nodes and/or number of leaves), and runtime.
    2. (10 points) Select the pruned tree with smallest size. Use Matlab/R plotting functions to depict the tree. Include the plot in your report (or at least the top levels if the tree is too large). Briefly comment on any interesting aspect of this tree.
    3. (10 points) Research what the random forest technique does. Describe this technique briefly in your report, including what the inputs to this technique are, what it outputs, and how it constructs its output.
    4. (10 points) Include also what Matlab/R function constructs random trees. Run at least 5 different experiments varying parameter values. Show the results of your experiments neatly organized on a table showing parameter values, classification accuracy, size of the random forest, and runtime.

  2. Regression Trees:
    For regression, use the attribute "age" as the target.
    1. (20 points) Use Matlab/R functions to construct regression trees over the dataset using 4-fold cross-validation. Briefly describe in your report the functions that you use and their parameters. Run at least 5 different experiments varying parameter values. Repeat the same experiments but now using pruning. Show the results of all of your experiments neatly organized on a table showing parameter values, Sum of Square Errors (SSE), Root Mean Square Error (RMSE), Relative Square Error (RSE), Coeffient of Determination (R2), size of the tree (number of nodes and/or number of leaves), and runtime.
    2. (10 points) Select the pruned tree with smallest size. Use Matlab/R plotting functions to depict the tree. Include the plot in your report (or at least the top levels if the tree is too large). Briefly comment on any interesting aspect of this tree.

  3. Homework Problems:
    These homework problems are for you to study this topic. You do NOT need to submit your solutions.
    Chapter 9 Exercises 1, 2, 4, 6, 8, 9, 10 of the textbook (pp. 235-236).

Section D: Artificial Neural Networks and Deep Learning (50 points)

Page limit for this section: 3 pages.
Due date: Thursday, Nov. 19th at 11:59 pm.
myWPI submission name: Proj2SectionsCDE

Dataset: For this part of the project, you will use the OptDigit Dataset available at the UCI Machine Learning Repository.

  1. Classification using Artificial Neural Networks (ANNs):
    Use Matlab/R functions to construct and train ANNs over optdigits.tra and then test them over optdigits.tes.

    Topology of your Neural Net:


    Experiments:
    1. (5 points) Briefly describe in your report the Matlab/R functions that you use and their parameters.
    2. (5 points) Explain also how many nodes you use on the output layer, and how you use the output from the output node(s) to assign a classification label to a test instance.
    3. (35 points) Run at least 10 different experiments varying parameter values. Show the results of all of your experiments neatly organized on a table showing parameter values, number of hidden nodes in each layer, classification accuracy, and runtime.
    4. (5 points) Pick the experiment that you think produced the best result. Justify your choice in your report. Include the confusion matrix for this experiment. See what misclassifications are most common and elaborate on your observations.

  2. Deep Learning: Watch one of the following videos about deep learning (if you have time try to watch both). You're not expected to understand all the details, but try to get from the videos some of the theoretical foundations of deep learning and some of its applications.
    1. "Deep Learning" by Ruslan Salakhutdinov from the collection of Deep Learning Summer School, Montreal 2015
    2. "Recent developments on Deep Learning" Geoffrey Hinton's GoogleTech Talk, March 2010.
    Links to both videos (and several others) are available at deeplearning.net tutorials

  3. Homework Problems:
    These homework problems are for you to study this topic. You do NOT need to submit your solutions.

Section E: Support Vector Machines (50 points)

Page limit for this section: 4 pages.
Due date: Thursday, Nov. 19th at 11:59 pm.
myWPI submission name: Proj2SectionsCDE

Dataset: For this part of the project, you will use the OptDigit Dataset available at the UCI Machine Learning Repository.

  1. Classification using Support Vector Machines (SVMs):
    Consider a two class classification task that consists of predicting whether a given digit is an "8" or not.
    1. (4 points) Modify the input dataset to have 2 classes (instead of 10) by replacing the target value of a data instance as follows: if the original target value is "8", replace it with 1; and if the original target value is different from "8", replace it with -1.
    2. (5 points) Use Matlab/R functions to construct support vector machines over optdigits.tra and test them over optdigits.tes. Briefly describe in your report the functions that you use and their parameters.
    3. (36 points) Run at least 12 different experiments varying parameter values for each of the following kernel functions (run at least 4 experiments for each one of the 3 kernel functions required):
      • polynomial (including linear, quadratic, ...)
      • radial-basis functions (Gaussian)
      • sigmoid (tanh)
      Show the results of all of your experiments neatly organized on a table showing kernel function used, parameter values, classification accuracy, and runtime.
    4. (5 points) Pick the experiment that you think produced the best result. Justify your choice in your report. Use Matlab/R functionality to plot a 2 or 3 dimensional depiction of data instances in each of the two classes, support vectors, and the decision boundary.

  2. Homework Problems:
    These homework problems are for you to study this topic. You do NOT need to submit your solutions.
    Chapter 13 Exercises 1 and 2 of the textbook (pp. 382-383).

Section F: Bayesian Networks (50 points)

Page limit for this section: 4 pages.
Due date: Thursday, Dec. 10th at 11:59 pm.
myWPI submission name: Proj2SectionsFG

Dataset: For this part of the project, you will use the Congressional Voting Records Dataset available at the UCI Machine Learning Repository.

  1. Naive Bayes Models:
    For this part, it would be useful to look at my Matlab Naive Bayes example: diabetes_no_attribute_names.dat and naive_bayes_example_diabetis.m.
    1. (10 points) Using Matlab/R functions, create a Naive Bayes model over the training dataset. Look at the conditional probability tables and select one that looks interesting. Include it in your report and explain why you think it is interesting.
    2. (5 points) Classify the data instances in the test dataset using this Naive Bayes model. Include in your report the accuracy, precision, and recall values obtained.

  2. Bayesian Network:
    1. (10 points) Investigate what functions exist in your choice of Matlab/R to construct (non-Naive) Bayesian Networks. Describe those functions in your report.
    2. (20 points) Using Matlab/R functions, create a (non-Naive) Bayesian network over the training dataset. For this, I suggest you modify function parameters until you obtain a "reasonable" graph of nodes and connections among them. Plot the graphical model obtained. Describe any interesting facts about this graphical model. Look at the conditional probability tables and select one that looks interesting. Include it in your report and explain why you think it is interesting.
    3. (5 points) Classify the data instances in the test dataset using this Bayesian Network. Include in your report the accuracy, precision, and recall values obtained.

  3. Homework Problems:
    These homework problems are for you to study this topic. You do NOT need to submit your solutions.
    Chapter 14 Exercises 2, 3, and 5 of the textbook (pp. 413-415).

Section G: Observable Markov Models and Hidden Markov Models (60 points)

Page limit for this section: 4 pages.
Due date: Thursday, Dec. 10th at 11:59 pm.
myWPI submission name: Proj2SectionsFG
For this part, it would be useful to look at my Matlab examples: hmmgenerate_fair_loaded_coins_HMMs_tutorial_example.m and mmgenerate_pepsi_coke_HMMs_tutorial_example.m
  1. (5 points) Use Matlab or R to solve Exercise 1 of Chapter 15 (pp. 440-441).

  2. (10 points) Use Matlab or R to solve Exercise 2 of Chapter 15 (p. 441).

  3. Consider the Coke/Pepsi hidden Markov Model (HMM) used in Prof. Ruiz's example of Viterbi's, Forward, and Backward algorithms. Using the Matlab/R implementations of the Viterbi's, Forward, and Backward algorithms as appropriate, answer the following questions (include in your answers what algorithms and what Matlab/R commands you used and how you used them to solve each problem):

    1. (10 points) Consider the following sequence of observables:
      PPPPCCPPPCCCPCCCCCPPPCPCP
      What is the probability that this sequence was generated by our HMM? Explain.
    2. (10 points) What is the most likely sequence of hidden states that generated this sequence? Explain.
    3. (10 points) Assume that the sequence is numbered stating at 1 (i.e., the first element of the sequence, "P", is at position 1). What is the most likely hidden state that generated the "C" in position 10 of the sequence? Explain.
    4. (15 points) Use the HMM to generate a sequence of observables of length 2000. Then, use this generated sequence to learn the transition probabilities and the emision probabilities of a new hidden Markov model with 3 hidden states (let's forget about the "Start" state to simplify things). Compare the transition and emission probabilities of this new hidden Markov model with those of our original HMM.

  4. Homework Problems:
    These homework problems are for you to study this topic. You do NOT need to submit your solutions.
    Chapter 15 Exercises 3, 4, 5, 8, and 9 of the textbook (pp. 440-442).

Phase II: Combining Multiple Models

Due date: Tuesday, Dec. 15th at 1:00 pm.
myWPI submission name: MetaLearningGroupSlides

The topic for Phase II is combining multiple models, also called meta-learning. Investigate this topic using the following resources:

Your investigation must include in particular Boosting, Bagging, and Stacking, but you are encouraged to investigate other techniques you are interested in in addition to these three techniques. Once that you have learned about these techniques, run experiments to see how they work. Follow these guidelines: Your presentation slides should describe the following:
  1. Chosen dataset(s): short description of the dataset, number of attributes, number of data instances, distribution of the target attribute (i.e., % of each target value).
  2. Choosen meta-learning methods: You don't need to describe Boosting, Bagging, and Stacking as everyone should know these techniques. But do provide short descriptions of other meta-learning methods you used, if any.
  3. Design of your experiments.
  4. Results of your experiments.
  5. In-depth analysis of your results.