WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS539 Machine Learning 
Project 1 - Fall 2015

PROF. CAROLINA RUIZ 

Due Dates:
Phase I Written Report: Online Submission by Saturday, October 10th 2015 at 11:59 pm
Phase II Slides: Email Submission by Friday, October 16th 2015 at 12 noon

Phase II Written Report: Hardcopy Submission by Friday, October 16th 2015 2:59 pm  

------------------------------------------

Project Instructions

Bonus points:Students who contribute informative entries to the CS539inR Wiki will receive some bonus points. Let me know when you contribute to the Wiki.

Section A: Univariate Data (175 points + bonus points) Page limit: 5 pages for this section.

Important: When you are asked to randomly generate data, make sure to record the random seed used for the generation so that you can reproduce your experiments later.

  1. Data Generation:
    (5 points) Randomly generate a dataset X with N=1000 consisting of one attribute normally distributed with mean=60 and standard deviation=8.

  2. MLE:
    1. (10 points) Use the formulas (4.8) p. 68 to find the Maximum Likelihood Estimation (MLE) of sample distribution parameters (mean and stardard deviation) directly from the sample. Show your work in the report.
    2. (10 points) Use the Maximum Likelihood Estimation (MLE) function provided by your choice of Matlab/R to calculate these parameter values from X. Do these parameter values coincide with the ones you found directly from the formulas above? Explain.

  3. MAP and Bayes' Estimator:
    In this part, you will look at the Maximum A Posteriori (MAP) and Bayes' estimator to estimate the parameter values of the sample X above. Assume that collection of all these possible parameter value estimates is also distributed normally. That is, X ~ N(θ,σ2) and θ ~ N002). Assume that σ=8, μ0=60, σ0=3.
    1. (10 points) Calculate the MAP estimate and the Bayes' estimate of the mean value used to generate data sample X. Are the MAP estimate and the Bayes' estimate the same in this case? Why or why not?
    2. (5 points) Should the MAP estimate in this case be the same as the mean estimated by MLE? Why or why not?

  4. Classification:
    1. (5 points) Randomly generate 3 normally distributed samples, each consisting of just one attribute as follows:
      • Sample 1: number of instances: 500, mean=60 and standard deviation=8.
      • Sample 2: number of instances: 300, mean=30 and standard deviation=12.
      • Sample 3: number of instances: 200, mean=80 and standard deviation=4.
      Create a dataset X that consists of these 3 samples, where data instances in Sample i above belong to class Ci, for i=1, 2, 3.
    2. (10 points) Following the material presented in Section 4.5 of the textbook, define a precise discriminant function gi for each class Ci. Remember to apply MLE to estimate the parameters of each of the classes. Show your work.
    3. (5 points) Based on these discriminant functions, what would be the chosen class for each of the following inputs: x = 10, 30, 50, 70, 90. Show your work.
    4. (15 points) Find analytically the "decision thresholds" (see Fig. 4.2 p. 75) for these 3 classes.
    5. (5 points) Implement each of these 3 discriminant function gi as a function in your choice of Matlab/R.
    6. (5 points) Based on these 3 functions, implement a "decision" function that receives a number x as its input and outputs i, where i is the chosen class for input x. Test your function on inputs: x = 10, 30, 50, 70, 90. Show the results in your report.
    7. (5 points) Use your decision function on inputs: x = 0, 0.5, 1, 1.5, ..., 99, 99.5, 100. Do the "decision thresholds" you calculated analytically coincide with the results of this test? Explain.
    8. (10 points) Generate a pair of plots like those in Fig. 4.2 for this particular dataset.
    9. (10 points) Use stratified random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances). Test the "decision" function that you implemented on part 6 above on the validation set. Report the accuracy and the confusion matrix of your decision function, as well as the precision and the recall of your decision function for each of the three classes.

  5. Regression:
    1. (10 points) Create a dataset consisting of one input and one output as follows. For the input, use the dataset X you generated in part I above with N=1000, mean=60 and standard deviation=8. For the output, use r = f(x) + ε where f(x) = 2 sin(1.5x), and the noise ε ~ N(μ=0,σ2=1). (as in the example in Sections 4.6-4.8 pp. 77-87).
    2. (5 points) Use random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances).
    3. (10 points) Create three 2-dimensional plots: one for the entire dataset X, one for the training set, and one for the validation set. In each of these plots, the x axis correspond to the input variable x, and the y axis corresponds to the output (response) variable r.
    4. (15 points) Create 5 different regression models over the training set using the regression functionality provided by the programming language that you chose (Matlab/R):
      gk(x| wk,...,w0) = wk xk + ... + w1 x + w0, for k=0,1,2,3,4. Report the obtained coefficients in your written report.
    5. (15 points) Create two 2-dimensional plots: one containing the training set and the 5 fitting curves, and one containing the validation set and the 5 fitting curves obtained over the training set. In each of these plots, the x axis correspond to the input variable x, and the y axis corresponds to the output (response) variable r.
    6. (10 points) Evaluate each of the 5 regression models over the validation set. Report the Sum of Square Errors (SSE), the Root Mean Square Error (RMSE), the Relative Square Error (RSE), and the Coeffient of Determination (R2) of each regression model over the validation set. If the programming language you are using reports AIC, BIC, and/or log likelihood values, include these values in your report too. Based on these error measures, which model would you pick among the five regression models? Explain.
    7. (Bonus points) See if the regression functionality in your chosen language (Matlab/R) allows the use of Akaike information criterion (AIC). and/or the use of Bayesian information criterion (BIC), instead of minimizing SSE, to guide the construction of the regression model. If so, repeat parts 4 and 6 above for AIC and then for BIC. Which of the three approaches produced better results? Explain.

Section B: Multivariate Data (155 points + bonus points) Page limit: 5 pages for this section.

Important: When you are asked to randomly generate data, make sure to record the random seed used for the generation so that you can reproduce your experiments later.

  1. Multivariate Normal Distribution:
    In this part, you will work with randomly generated datasets with N=1000 data instances and d=20 dimensions (attributes). Each dataset will be generated using a multivariate normal distribution with parameters μ (1-by-d vector of means, one for each attribute) and Σ (d-by-d covariance matrix). To simplify the notation, we'll denote μ by "trueMeans" and Σ by "trueSigma".

  2. Multivariate Classification:
    In this part, you will work with datasets that consist of 2 classes C1 and C2. These datasets will contain N=1800 data instances and d=20 attributes.

  3. Multivariate Regression:
    1. (10 points) Create a dataset consisting of d inputs and one output as follows. For the d inputs, use the multivariate dataset X1 you generated in part I above with N=1000, trueMeans and trueSigmaA. For the output, use r = f(x) + ε where f(x) = 3*average(x) - min(x), that is the output is three times the average of the d input values minus the minimum input value; and the noise ε ~ N(μ=0,σ2=1).
    2. (5 points) Use random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances).
    3. (10 points) Create a multivariate linear regression model over the training set using the regression functionality provided by the programming language that you chose (Matlab/R). Report the obtained regression formula in your written report.
    4. (10 points) Evaluate the regression model over the validation set. Report the Sum of Square Errors (SSE), the Root Mean Square Error (RMSE), the Relative Square Error (RSE), and the Coeffient of Determination (R2) of each regression model over the validation set. If the programming language you are using reports AIC, BIC, and/or log likelihood values, include these values in your report too.
    5. (Bonus points) See if the regression functionality in your chosen language (Matlab/R) allows the use of Akaike information criterion (AIC). and/or the use of Bayesian information criterion (BIC), instead of minimizing SSE, to guide the construction of the regression model. If so, repeat part 4 above for AIC and then for BIC. Which of the three approaches produced better results? Explain.
    6. Bias and Variance:
      1. (10 points) Construct 10 new different datasets D1, ..., D10 each one consisting of 100 data instances randomly generated with trueMeans and trueSigmaA. For the output, use r = f(x) + ε where f(x) = 3*average(x) - min(x) and the noise ε ~ N(μ=0,σ2=1) as before.
      2. (10 points) Fit a multivariate linear regression formula gito each of these datasets.
      3. (10 points) Estimate the bias and the variance using the formulas on slide 24 of Chapter 4 slides (see also Section 4.7 of the textbook). Apply the formulas for bias and variance over the x's in the dataset X1 (together with the output value) that you constructed in part 1 above (hence N=1000 and M=10).

Section C: Dimensionality Reduction (115 points) Page limit: 5 pages for this section.

Dataset: For this part of the project, you will use the Communities and Crime Data Set available at the UCI Machine Learning Repository. Carefully read the description provided for this dataset and familiarize yourself with the dataset as much as possible.

(5 points) Make the following modifications to the dataset:

  1. Remove the "communityname" attribute (string).
  2. Replace each missing attribute value in the dataset (denoted by "?") with the attribute's mean.
  3. Use random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances). Let's call this training set TS and this validation set VS.
Use this modified dataset in all the experiments below. Note: Remember that feature selection and feature extraction methods should be applied to the input attributes only, not to the output (target) attribute.

  1. Baseline Regression Model:
    1. ** Fitting a linear model:
      (5 points) Create a multivariate linear regression model over the training set TS using the regression functionality provided by the programming language that you chose (Matlab/R). Report the obtained regression formula in your written report and also report the time taken to construct this regression model (for this use timing functionality provided in Matlab/R).
    2. ** Evaluating the linear model:
      (5 points) Evaluate the regression model over the validation set VS. Report the Sum of Square Errors (SSE), the Root Mean Square Error (RMSE), the Relative Square Error (RSE), and the Coeffient of Determination (R2) of each regression model over the validation set.

  2. Feature Selection: Sequential Subset Selection
    Look for a function (or functions) provided by the language you chose (Matlab or R) for doing feature selection. Try to find a function similar to the sequential subset selection (either forward or backward) described in Section 6.2 of the textbook.
    1. (5 points) Include the name(s) of the function(s) in the report. Briefly explain what the function does.
    2. (5 points) Apply the function to the training data TS. Include in your report the names of the attributes selected by this function.
    3. (10 points) Repeat steps 1 (** Fitting a linear model) and 2 (** Evaluating the linear model) described above, but now using just the selected subset of attributes constructed above. Remember that you need to modify the validation dataset VS so that it includes just the same exact subset of attributes selected from the training set.

  3. Feature Selection: Ranking Attributes
    Look for a function (or functions) provided by the language you chose (Matlab or R) for ranking attributes following the "Relief" approach.
    1. (5 points) Include the name(s) of the function(s) in the report. Briefly explain what the function does.
    2. (5 points) Apply the function to the training data TS. Include in your report the names of the top 50 attributes selected by this function in order of importance.
    3. (10 points) Repeat steps 1 (** Fitting a linear model) and 2 (** Evaluating the linear model) described above, but now using just the selected 50 attributes above. Remember that you need to modify the validation dataset VS so that it includes just the same exact 50 attributes selected from the training set.

  4. Feature Extraction: Principal Components Analysis
    Look for a function (or functions) provided by the language you chose (Matlab or R) for performing PCA.
    1. (5 points) Include the name(s) of the function(s) in the report.
    2. (5 points) Apply the function to the training data TS. Describe in your report the results of PCA. How many components were constructed? 128 or less? What is the minimum number of components needed to capture at least 90% of the data variance? Explain.
    3. (10 points) Repeat steps 1 (** Fitting a linear model) and 2 (** Evaluating the linear model) described above, but now using just the principal components needed to explain at least 90% of the data variance. Remember that you need to transform the validation dataset VS using the same exact transformation obtained from the training set.

  5. Feature Extraction: Factor Analysis (FA)
    Look for a function (or functions) provided by the language you chose (Matlab or R) for performing factor analysis.
    1. (5 points) Include the name(s) of the function(s) in the report.
    2. (5 points) Apply the function to the training data TS. Describe in your report the results you obtained from factor analysis.
    3. (10 points) Repeat steps 1 (** Fitting a linear model) and 2 (** Evaluating the linear model) described above, but now using the obtained factors. Remember that you need to transform the validation dataset VS using the same exact transformation obtained from the training set.

  6. Comparison of Results
    (10 points) Create a table summarizing the results of the dimensionality reduction experiments above. This table should contain a column for each the five methods used (Baseline, Sequential subset selection, Relief, PCA, and FA). Rows in the table should include the following: (10 points) Brifly analyze the results described on this table.