Proj1 - CS 539 Fall 2015

Computer Science Department

CS539 Machine Learning
Project 1 - Fall 2015

PROF. CAROLINA RUIZ

Due Dates:
Phase I Written Report: Online Submission by Saturday, October 10th 2015 at 11:59 pm
Phase II Slides: Email Submission by Friday, October 16th 2015 at 12 noon
Phase II Written Report: Hardcopy Submission by Friday, October 16th 2015 2:59 pm

Project Instructions

Phase I: Work on all the parts of the project by yourself without help from anyone or outside sources. Submit an individual written report (in pdf using the myWPI CS539 submission feature for Proj1) covering all the aspects of the project by the Phase I's deadline specified above.
Phase II: Based on the individual work you did for Phase I, work on all the parts of the project again with all members of your assigned group. You cannot use help from anyone outside of your group, or outside sources. Submit a group written report (just one submission per group, hardcopy by the beginning of class) covering all the aspects of the project by the Phase II's deadline specified above.
Important instructions for Phase II:
- Your group project should be greater than the sum of the teammates' individual parts. That is, your group project should reflect the work that you did as a team, building upon your individual projects. It should NOT be just a mere combination of the individual reports.
- Hence, it is expected that your group will meet for a significant amount of time to discuss ideas, answer each other's questions, rerun experiments as needed, and produce a solid group report and presentation slides.
Phase II Written Report:
- The font size must be at least 11pts.
- Your written report (including all graphs, figures, and appendices) must fit within the space limits specified in the project description. Note that 1 page is equal to one-side of a sheet of paper. If you print your report double-sided, each sheet is equivalent to 2 pages. Only the required sections within the given space limits will be read and graded. Exceeding page limits will lower your project grade.
- Given the page limits included in the project description, your group should discuss and summarize the results of the project, including only the most relevant and significant results and findings.
- Your group report should contain an authorship page describing in detail the work that each of you did on Phase II.
Project Presentations: We will discuss the results of each project in class. Your oral report should summarize the most important parts of your written report and should elaborate only on the most significant or more unique parts of your work. Each group will have about 5 minutes to present their project in class. Try to summarize results using tables, visualizations, and graphical depictions when possible. Given the time constraints, focus your presentation on the most relevant, unique, or creative parts of your project. Emphasize what you learned in the project. Be prepared and use your presentation time wisely!
Slides Submission: Please submit the following file containing your oral presentation by email to the professor at least THREE HOUR before the beginning of class the day the project is due:
[your-lastnames]_proj1_slides.[ext]
containing the slides for your oral report. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=pptx). Please use only lower case letters in the filename. List the lastnames in alphabetical order separated by "_". For instance, the file with my slides with teammates Bayes and Gauss would be named bayes_gauss_ruiz_proj1_slides.pptx

Project Grade:

Phase I individual report: 200 points
Phase II group report: 455 points
Project presentation: 50 points
Class participation during presentations: 5 points
Total: 700 points

Course Discusssion Forum: You can use the myWPI course forums to discuss the project as needed.
Topics Covered by the Project: Carefully study Chapters 1, 2, 3, 4, 5, and 6.1-6.8 of the textbook.
Programming Language: Matlab or R. Do the project using Matlab or R or both. No other programming languages are allowed on this project. You need to write your own code. Make sure to consult online documentation for Matlab and/or for R. Also, my miscellaneous notes on Matlab and R may be useful for this project.

Bonus points:Students who contribute informative entries to the CS539inR Wiki will receive some bonus points. Let me know when you contribute to the Wiki.

Section A: Univariate Data (175 points + bonus points) Page limit: 5 pages for this section.

Important: When you are asked to randomly generate data, make sure to record the random seed used for the generation so that you can reproduce your experiments later.

Data Generation:
(5 points) Randomly generate a dataset X with N=1000 consisting of one attribute normally distributed with mean=60 and standard deviation=8.
MLE:
1. (10 points) Use the formulas (4.8) p. 68 to find the Maximum Likelihood Estimation (MLE) of sample distribution parameters (mean and stardard deviation) directly from the sample. Show your work in the report.
2. (10 points) Use the Maximum Likelihood Estimation (MLE) function provided by your choice of Matlab/R to calculate these parameter values from X. Do these parameter values coincide with the ones you found directly from the formulas above? Explain.
MAP and Bayes' Estimator:
In this part, you will look at the Maximum A Posteriori (MAP) and Bayes' estimator to estimate the parameter values of the sample X above. Assume that collection of all these possible parameter value estimates is also distributed normally. That is, X ~ N(θ,σ²) and θ ~ N(μ₀,σ₀²). Assume that σ=8, μ₀=60, σ₀=3.
1. (10 points) Calculate the MAP estimate and the Bayes' estimate of the mean value used to generate data sample X. Are the MAP estimate and the Bayes' estimate the same in this case? Why or why not?
2. (5 points) Should the MAP estimate in this case be the same as the mean estimated by MLE? Why or why not?
Classification:
1. (5 points) Randomly generate 3 normally distributed samples, each consisting of just one attribute as follows:
  - Sample 1: number of instances: 500, mean=60 and standard deviation=8.
  - Sample 2: number of instances: 300, mean=30 and standard deviation=12.
  - Sample 3: number of instances: 200, mean=80 and standard deviation=4.
  Create a dataset X that consists of these 3 samples, where data instances in Sample i above belong to class C_i, for i=1, 2, 3.
2. (10 points) Following the material presented in Section 4.5 of the textbook, define a precise discriminant function g_i for each class C_i. Remember to apply MLE to estimate the parameters of each of the classes. Show your work.
3. (5 points) Based on these discriminant functions, what would be the chosen class for each of the following inputs: x = 10, 30, 50, 70, 90. Show your work.
4. (15 points) Find analytically the "decision thresholds" (see Fig. 4.2 p. 75) for these 3 classes.
5. (5 points) Implement each of these 3 discriminant function g_i as a function in your choice of Matlab/R.
6. (5 points) Based on these 3 functions, implement a "decision" function that receives a number x as its input and outputs i, where i is the chosen class for input x. Test your function on inputs: x = 10, 30, 50, 70, 90. Show the results in your report.
7. (5 points) Use your decision function on inputs: x = 0, 0.5, 1, 1.5, ..., 99, 99.5, 100. Do the "decision thresholds" you calculated analytically coincide with the results of this test? Explain.
8. (10 points) Generate a pair of plots like those in Fig. 4.2 for this particular dataset.
9. (10 points) Use stratified random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances). Test the "decision" function that you implemented on part 6 above on the validation set. Report the accuracy and the confusion matrix of your decision function, as well as the precision and the recall of your decision function for each of the three classes.
Regression:
1. (10 points) Create a dataset consisting of one input and one output as follows. For the input, use the dataset X you generated in part I above with N=1000, mean=60 and standard deviation=8. For the output, use r = f(x) + ε where f(x) = 2 sin(1.5x), and the noise ε ~ N(μ=0,σ²=1). (as in the example in Sections 4.6-4.8 pp. 77-87).
2. (5 points) Use random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances).
3. (10 points) Create three 2-dimensional plots: one for the entire dataset X, one for the training set, and one for the validation set. In each of these plots, the x axis correspond to the input variable x, and the y axis corresponds to the output (response) variable r.
4. (15 points) Create 5 different regression models over the training set using the regression functionality provided by the programming language that you chose (Matlab/R):
  g_k(x| w_k,...,w₀) = w_k x^k + ... + w₁ x + w₀, for k=0,1,2,3,4. Report the obtained coefficients in your written report.
5. (15 points) Create two 2-dimensional plots: one containing the training set and the 5 fitting curves, and one containing the validation set and the 5 fitting curves obtained over the training set. In each of these plots, the x axis correspond to the input variable x, and the y axis corresponds to the output (response) variable r.
6. (10 points) Evaluate each of the 5 regression models over the validation set. Report the Sum of Square Errors (SSE), the Root Mean Square Error (RMSE), the Relative Square Error (RSE), and the Coeffient of Determination (R²) of each regression model over the validation set. If the programming language you are using reports AIC, BIC, and/or log likelihood values, include these values in your report too. Based on these error measures, which model would you pick among the five regression models? Explain.
7. (Bonus points) See if the regression functionality in your chosen language (Matlab/R) allows the use of Akaike information criterion (AIC). and/or the use of Bayesian information criterion (BIC), instead of minimizing SSE, to guide the construction of the regression model. If so, repeat parts 4 and 6 above for AIC and then for BIC. Which of the three approaches produced better results? Explain.

Section B: Multivariate Data (155 points + bonus points) Page limit: 5 pages for this section.

Important: When you are asked to randomly generate data, make sure to record the random seed used for the generation so that you can reproduce your experiments later.

Multivariate Normal Distribution:
In this part, you will work with randomly generated datasets with N=1000 data instances and d=20 dimensions (attributes). Each dataset will be generated using a multivariate normal distribution with parameters μ (1-by-d vector of means, one for each attribute) and Σ (d-by-d covariance matrix). To simplify the notation, we'll denote μ by "trueMeans" and Σ by "trueSigma".
- Multivariate Data Generation:
  (10 points) Use functionality in the programming language you chose (e.g., in Matlab, use the mvnrnd function "Multivariate normal random numbers") to randomly generate three multivariate normally distributed datasets X1, X2, X3 as described below.
  - trueMeans: For all three datasets use the same vector of means: trueMeans.
  - trueSigma: The covariance (Sigma) matrix for each dataset is specified below:
    1. Dataset X1 (arbitrary covariance matrix): Use trueSigmaA as covariance (Sigma) matrix.
    2. Dataset X2 (diagonal covariance matrix): Use trueSigmaD as covariance (Sigma) matrix.
    3. Dataset X3 (identity covariance matrix): Use the d-by-d identity matrix as covariance (Sigma) matrix.
- Parameter Estimation:
  (10 points) For each of the datasets X1, X2, and X3 do the following:
  1. Estimate the parameters μ and Σ from the dataset. Let's call these estimates "estimatedMeans" and "estimatedSigma". Compare these estimates with the trueMeans and the trueSigma used to generate the dataset and describe your observations.
  2. Devise a good way to plot the dataset in 2 or 3 dimensions.
Multivariate Classification:
In this part, you will work with datasets that consist of 2 classes C₁ and C₂. These datasets will contain N=1800 data instances and d=20 attributes.
- Multivariate Data Generation:
  (10 points) In all cases described below, you will use the multivariate datasets generated in part I above as class C₁ and will generate class C₂ use functionality in the programming language you chose that generates multivariate normally distributed data.
  1. Dataset DX (classes have different arbitrary covariance matrices):
    - The 1,000 data instances in C₁ will be those in the dataset X1 generated above.
    - The 800 data instances in C₂ will be generated using parameters trueMeans2 and trueSigmaA2.
  2. Dataset SX1 (classes share the same arbitrary covariance matrix):
    - The 1,000 data instances in C₁ will be those in the dataset X1 generated above.
    - The 800 data instances in C₂ will be generated using parameters trueMeans2 and trueSigmaA.
  3. Dataset SX2 (classes share the same diagonal covariance matrix):
    - The 1,000 data instances in C₁ will be those in the dataset X2 generated above.
    - The 800 data instances in C₂ will be generated using parameters trueMeans2 and trueSigmaD.
  4. Dataset SX3 (classes share the identity covariance matrix):
    - The 1,000 data instances in C₁ will be those in the dataset X3 generated above.
    - The 800 data instances in C₂ will be generated using parameters trueMeans2 and the d-by-d identity matrix as covariance (Sigma) matrix.
- Multivariate Discriminant Functions:
  For each of the 4 datasets under consideration (DX, SX1, SX2, and SX3) do the following:
  1. (8 points) Determine which of the formulas in Section 5.5 of the textbook should be used to define a precise discriminant function g_i for each class C_i of the dataset at hand. Explain your answer.
  2. (12 points) Implement each of these 2 discriminant function g_i as a function in your choice of Matlab/R.
  3. (4 points) Based on these 2 functions, implement a "decision" function that receives a data instance x (which is a 1-by-d vector) as its input and outputs i, where i is the chosen class for input x.
  4. (16 points) Use stratified random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances). Test your "decision" function on the validation set. Report the accuracy and the confusion matrix of your decision function, as well as the precision and the recall of your decision function for each of the two classes.
  5. (20 points) Devise a good way of plotting the dataset in 2 or 3 dimensions to see and contrast the shapes of the two classes in the dataset.
    (Bonus points) Add to this plot the decision boundary between the two classes (that is, the curve defined by P(C₁|x) = 0.5).
Multivariate Regression:
1. (10 points) Create a dataset consisting of d inputs and one output as follows. For the d inputs, use the multivariate dataset X1 you generated in part I above with N=1000, trueMeans and trueSigmaA. For the output, use r = f(x) + ε where f(x) = 3*average(x) - min(x), that is the output is three times the average of the d input values minus the minimum input value; and the noise ε ~ N(μ=0,σ²=1).
2. (5 points) Use random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances).
3. (10 points) Create a multivariate linear regression model over the training set using the regression functionality provided by the programming language that you chose (Matlab/R). Report the obtained regression formula in your written report.
4. (10 points) Evaluate the regression model over the validation set. Report the Sum of Square Errors (SSE), the Root Mean Square Error (RMSE), the Relative Square Error (RSE), and the Coeffient of Determination (R²) of each regression model over the validation set. If the programming language you are using reports AIC, BIC, and/or log likelihood values, include these values in your report too.
5. (Bonus points) See if the regression functionality in your chosen language (Matlab/R) allows the use of Akaike information criterion (AIC). and/or the use of Bayesian information criterion (BIC), instead of minimizing SSE, to guide the construction of the regression model. If so, repeat part 4 above for AIC and then for BIC. Which of the three approaches produced better results? Explain.
6. Bias and Variance:
  1. (10 points) Construct 10 new different datasets D1, ..., D10 each one consisting of 100 data instances randomly generated with trueMeans and trueSigmaA. For the output, use r = f(x) + ε where f(x) = 3*average(x) - min(x) and the noise ε ~ N(μ=0,σ²=1) as before.
  2. (10 points) Fit a multivariate linear regression formula g_ito each of these datasets.
  3. (10 points) Estimate the bias and the variance using the formulas on slide 24 of Chapter 4 slides (see also Section 4.7 of the textbook). Apply the formulas for bias and variance over the x's in the dataset X1 (together with the output value) that you constructed in part 1 above (hence N=1000 and M=10).

Section C: Dimensionality Reduction (115 points) Page limit: 5 pages for this section.

Dataset: For this part of the project, you will use the Communities and Crime Data Set available at the UCI Machine Learning Repository. Carefully read the description provided for this dataset and familiarize yourself with the dataset as much as possible.

(5 points) Make the following modifications to the dataset:

Remove the "communityname" attribute (string).
Replace each missing attribute value in the dataset (denoted by "?") with the attribute's mean.
Use random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances). Let's call this training set TS and this validation set VS.

Use this modified dataset in all the experiments below. Note: Remember that feature selection and feature extraction methods should be applied to the input attributes only, not to the output (target) attribute.

Baseline Regression Model:
1. ** Fitting a linear model:
  (5 points) Create a multivariate linear regression model over the training set TS using the regression functionality provided by the programming language that you chose (Matlab/R). Report the obtained regression formula in your written report and also report the time taken to construct this regression model (for this use timing functionality provided in Matlab/R).
2. ** Evaluating the linear model:
  (5 points) Evaluate the regression model over the validation set VS. Report the Sum of Square Errors (SSE), the Root Mean Square Error (RMSE), the Relative Square Error (RSE), and the Coeffient of Determination (R²) of each regression model over the validation set.
Feature Selection: Sequential Subset Selection
Look for a function (or functions) provided by the language you chose (Matlab or R) for doing feature selection. Try to find a function similar to the sequential subset selection (either forward or backward) described in Section 6.2 of the textbook.
1. (5 points) Include the name(s) of the function(s) in the report. Briefly explain what the function does.
2. (5 points) Apply the function to the training data TS. Include in your report the names of the attributes selected by this function.
3. (10 points) Repeat steps 1 (** Fitting a linear model) and 2 (** Evaluating the linear model) described above, but now using just the selected subset of attributes constructed above. Remember that you need to modify the validation dataset VS so that it includes just the same exact subset of attributes selected from the training set.
Feature Selection: Ranking Attributes
Look for a function (or functions) provided by the language you chose (Matlab or R) for ranking attributes following the "Relief" approach.
1. (5 points) Include the name(s) of the function(s) in the report. Briefly explain what the function does.
2. (5 points) Apply the function to the training data TS. Include in your report the names of the top 50 attributes selected by this function in order of importance.
3. (10 points) Repeat steps 1 (** Fitting a linear model) and 2 (** Evaluating the linear model) described above, but now using just the selected 50 attributes above. Remember that you need to modify the validation dataset VS so that it includes just the same exact 50 attributes selected from the training set.
Feature Extraction: Principal Components Analysis
Look for a function (or functions) provided by the language you chose (Matlab or R) for performing PCA.
1. (5 points) Include the name(s) of the function(s) in the report.
2. (5 points) Apply the function to the training data TS. Describe in your report the results of PCA. How many components were constructed? 128 or less? What is the minimum number of components needed to capture at least 90% of the data variance? Explain.
3. (10 points) Repeat steps 1 (** Fitting a linear model) and 2 (** Evaluating the linear model) described above, but now using just the principal components needed to explain at least 90% of the data variance. Remember that you need to transform the validation dataset VS using the same exact transformation obtained from the training set.
Feature Extraction: Factor Analysis (FA)
Look for a function (or functions) provided by the language you chose (Matlab or R) for performing factor analysis.
1. (5 points) Include the name(s) of the function(s) in the report.
2. (5 points) Apply the function to the training data TS. Describe in your report the results you obtained from factor analysis.
3. (10 points) Repeat steps 1 (** Fitting a linear model) and 2 (** Evaluating the linear model) described above, but now using the obtained factors. Remember that you need to transform the validation dataset VS using the same exact transformation obtained from the training set.
Comparison of Results
(10 points) Create a table summarizing the results of the dimensionality reduction experiments above. This table should contain a column for each the five methods used (Baseline, Sequential subset selection, Relief, PCA, and FA). Rows in the table should include the following:
- number of attributes used to construct the linear regression model,
- number of attributes appearing in the linear regression model,
- time taken contructing the linear regression model,
- Sum of Square Errors (SSE),
- Root Mean Square Error (RMSE),
- Relative Square Error (RSE), and
- Coeffient of Determination (R²)
(10 points) Brifly analyze the results described on this table.