Due Dates:
Phase I Written Report: Online Submission by Saturday, October 10th 2015 at 11:59 pm
Phase II Slides: Email Submission by Friday, October 16th 2015 at 12 noon
Phase II Written Report: Hardcopy Submission by Friday, October 16th 2015 2:59 pm
Project Instructions
- Phase I:
Work on all the parts of the project by yourself without help from anyone or
outside sources.
Submit an individual written report
(in pdf using the myWPI CS539
submission feature for Proj1)
covering all
the aspects of the project by the Phase I's deadline specified above.
- Phase II:
Based on the individual work you did for Phase I, work on all the parts of
the project again with all members of your assigned group.
You cannot use help from anyone outside of your group, or outside sources.
Submit a group written report (just one submission per group, hardcopy
by the beginning of class) covering all
the aspects of the project by the Phase II's deadline specified above.
Important instructions for Phase II:
- Your group project should be greater than the sum of the teammates' individual
parts. That is, your group project should reflect the work that you did as a team,
building upon your individual projects. It should NOT be just a mere combination
of the individual reports.
- Hence, it is expected that your group will meet for a significant amount of
time to discuss ideas, answer each other's questions, rerun experiments as needed,
and produce a solid group report and presentation slides.
- Phase II Written Report:
- The font size must be at least 11pts.
-
Your written report
(including all graphs, figures, and appendices)
must fit within the space limits specified in the project description.
Note that 1 page is equal to one-side of a sheet of paper.
If you print your report double-sided, each sheet is equivalent to 2 pages.
Only the required sections within the given space limits will
be read and graded.
Exceeding page limits will lower your project grade.
- Given the page limits included in the project description,
your group should discuss and summarize the results
of the project, including only the most relevant and significant results and
findings.
- Your group report should contain an authorship page
describing in detail the work that each of you did on Phase II.
- Project Presentations:
We will discuss the results of each project in class. Your oral report
should summarize the most important parts of your written report and
should elaborate only on the most significant or more unique parts of
your work. Each group will have about 5 minutes to present their project
in class. Try to summarize results using tables, visualizations, and
graphical depictions when possible.
Given the time constraints, focus your presentation on the
most relevant, unique, or creative parts of your project.
Emphasize what you learned in the project.
Be prepared and use your presentation time wisely!
Slides Submission:
Please submit the following file containing your oral
presentation by email to the professor
at least THREE HOUR before the beginning of class the day
the project is due:
[your-lastnames]_proj1_slides.[ext]
containing the slides for your oral report.
This file should be either a PDF file (ext=pdf)
or a PowerPoint file (ext=pptx).
Please use only lower case letters in the filename.
List the lastnames in alphabetical order separated by "_".
For instance, the file with my slides with teammates
Bayes and Gauss
would be named bayes_gauss_ruiz_proj1_slides.pptx
- Project Grade:
Phase I individual report: | 200 points
|
Phase II group report: | 455 points
|
Project presentation: | 50 points
|
Class participation during presentations: | 5 points
|
Total: | 700 points
|
- Course Discusssion Forum:
You can use the myWPI course forums to discuss the project as needed.
- Topics Covered by the Project:
Carefully study Chapters 1, 2, 3, 4, 5, and 6.1-6.8 of the textbook.
- Programming Language: Matlab or R.
Do the project using
Matlab or R or both.
No other programming languages are allowed on this project.
You need to write your own code.
Make sure to consult online documentation for Matlab and/or for R.
Also, my miscellaneous notes on
Matlab and
R
may be useful for this project.
Bonus points:Students who contribute informative entries to the
CS539inR Wiki
will receive some bonus points. Let me know when you contribute to the Wiki.
Section A: Univariate Data (175 points + bonus points)
Page limit: 5 pages for this section.
Important: When you are asked to randomly generate data,
make sure to record the random seed used for the generation
so that you can reproduce your experiments later.
- Data Generation:
(5 points) Randomly generate a dataset X with N=1000 consisting of one attribute
normally distributed with mean=60 and standard deviation=8.
- MLE:
- (10 points)
Use the formulas (4.8) p. 68 to find the Maximum Likelihood Estimation (MLE) of
sample distribution parameters (mean and stardard deviation) directly from the sample.
Show your work in the report.
- (10 points)
Use the Maximum Likelihood Estimation (MLE) function provided by
your choice of Matlab/R to calculate these parameter values from X.
Do these parameter values coincide with the ones you found directly from
the formulas above? Explain.
- MAP and Bayes' Estimator:
In this part, you will look at the Maximum A Posteriori (MAP) and
Bayes' estimator to estimate the parameter values of the sample X above.
Assume that collection of all these possible parameter value estimates
is also distributed normally. That is,
X ~ N(θ,σ2) and
θ ~ N(μ0,σ02).
Assume that σ=8, μ0=60, σ0=3.
- (10 points)
Calculate the MAP estimate and the Bayes' estimate of the mean value
used to generate data sample X.
Are the MAP estimate and the Bayes' estimate the same in this case?
Why or why not?
- (5 points)
Should the MAP estimate in this case be the same as the mean estimated by MLE?
Why or why not?
- Classification:
- (5 points)
Randomly generate 3 normally distributed samples,
each consisting of just one attribute as follows:
- Sample 1: number of instances: 500, mean=60 and standard deviation=8.
- Sample 2: number of instances: 300, mean=30 and standard deviation=12.
- Sample 3: number of instances: 200, mean=80 and standard deviation=4.
Create a dataset X that consists of these 3 samples, where data instances in
Sample i above belong to class Ci, for i=1, 2, 3.
- (10 points)
Following the material presented in Section 4.5 of the textbook,
define a precise discriminant function gi for each class Ci. Remember to apply MLE to estimate the parameters of each of the classes.
Show your work.
- (5 points)
Based on these discriminant functions, what would be the chosen class
for each of the following inputs: x = 10, 30, 50, 70, 90.
Show your work.
- (15 points)
Find analytically the "decision thresholds" (see Fig. 4.2 p. 75)
for these 3 classes.
- (5 points)
Implement each of these 3 discriminant function gi
as a function in your choice of Matlab/R.
- (5 points)
Based on these 3 functions, implement a "decision" function that receives a
number x as its input and outputs i, where i is the chosen class for input x.
Test your function on inputs: x = 10, 30, 50, 70, 90.
Show the results in your report.
- (5 points)
Use your decision function on inputs: x = 0, 0.5, 1, 1.5, ..., 99, 99.5, 100. Do the "decision thresholds" you calculated analytically coincide with the
results of this test? Explain.
- (10 points)
Generate a pair of plots like those in Fig. 4.2 for this particular
dataset.
- (10 points)
Use stratified random sampling to split your dataset into 2 parts:
a training set (with 60% of the data instances) and a validation set (with
the remaining 40% of the data instances).
Test the "decision" function that you implemented on part 6 above
on the validation set.
Report the accuracy and the confusion matrix of your decision function,
as well as the precision and the recall of your decision function
for each of the three classes.
- Regression:
- (10 points)
Create a dataset consisting of one input and one output as follows.
For the input, use the dataset X you generated in part I above with
N=1000, mean=60 and standard deviation=8.
For the output, use r = f(x) + ε where
f(x) = 2 sin(1.5x), and the noise
ε ~ N(μ=0,σ2=1).
(as in the example in Sections 4.6-4.8 pp. 77-87).
- (5 points)
Use random sampling to split your dataset into 2 parts:
a training set (with 60% of the data instances) and a validation set (with
the remaining 40% of the data instances).
- (10 points)
Create three 2-dimensional plots: one for the entire dataset X,
one for the training set, and one for the validation set.
In each of these plots, the x axis correspond to the input variable x, and
the y axis corresponds to the output (response) variable r.
- (15 points)
Create 5 different regression models over the training set
using the regression functionality
provided by the programming language that you chose (Matlab/R):
gk(x| wk,...,w0) = wk xk + ... + w1 x + w0, for k=0,1,2,3,4.
Report the obtained coefficients in your written report.
- (15 points)
Create two 2-dimensional plots:
one containing the training set and the 5 fitting curves, and
one containing the validation set and the 5 fitting curves obtained
over the training set.
In each of these plots, the x axis correspond to the input variable x, and
the y axis corresponds to the output (response) variable r.
- (10 points)
Evaluate each of the 5 regression models over the validation set.
Report the Sum of Square Errors (SSE), the Root Mean Square Error (RMSE), the Relative Square Error (RSE), and the Coeffient of Determination (R2)
of each regression model over the validation set.
If the programming language you are using reports AIC, BIC, and/or log likelihood values, include these values in your report too.
Based on these error measures, which model would you pick among the five
regression models? Explain.
- (Bonus points)
See if the regression functionality in your chosen language (Matlab/R) allows the use of Akaike information criterion (AIC).
and/or the use of Bayesian information criterion (BIC),
instead of minimizing SSE, to guide the construction of the regression model.
If so, repeat parts 4 and 6 above for AIC and then for BIC.
Which of the three approaches produced better results? Explain.
Section B: Multivariate Data (155 points + bonus points)
Page limit: 5 pages for this section.
Important: When you are asked to randomly generate data,
make sure to record the random seed used for the generation
so that you can reproduce your experiments later.
- Multivariate Normal Distribution:
In this part, you will work with randomly generated datasets
with N=1000 data instances and d=20 dimensions (attributes).
Each dataset will be generated using a multivariate normal distribution with
parameters
μ (1-by-d vector of means, one for each attribute) and
Σ (d-by-d covariance matrix).
To simplify the notation, we'll denote μ by "trueMeans"
and Σ by "trueSigma".
- Multivariate Data Generation:
(10 points) Use functionality in the programming language you chose
(e.g., in Matlab, use the mvnrnd function "Multivariate normal random numbers")
to randomly generate three multivariate normally distributed datasets X1, X2, X3
as described below.
- trueMeans: For all three datasets use the same vector of means:
trueMeans.
- trueSigma:
The covariance (Sigma) matrix for each dataset is specified below:
- Dataset X1 (arbitrary covariance matrix):
Use trueSigmaA as covariance (Sigma) matrix.
- Dataset X2 (diagonal covariance matrix):
Use trueSigmaD as covariance (Sigma) matrix.
- Dataset X3 (identity covariance matrix):
Use the d-by-d identity matrix as covariance (Sigma) matrix.
- Parameter Estimation:
(10 points) For each of the datasets X1, X2, and X3 do the following:
- Estimate the parameters μ and Σ from the dataset.
Let's call these estimates "estimatedMeans" and "estimatedSigma".
Compare these estimates with the trueMeans and the trueSigma used to
generate the dataset and describe your observations.
- Devise a good way to plot the dataset in 2 or 3 dimensions.
- Multivariate Classification:
In this part, you will work with datasets that consist of 2 classes C1 and C2. These datasets will contain N=1800 data instances and d=20 attributes.
- Multivariate Data Generation:
(10 points) In all cases described below, you will use the multivariate datasets generated in part I above as class C1 and will generate class C2
use functionality in the programming language you chose
that generates multivariate normally distributed data.
- Dataset DX (classes have different arbitrary covariance matrices):
- The 1,000 data instances in C1 will be those in the dataset X1 generated above.
- The 800 data instances in C2 will be generated using parameters
trueMeans2 and
trueSigmaA2.
- Dataset SX1 (classes share the same arbitrary covariance matrix):
- The 1,000 data instances in C1 will be those in the dataset X1 generated above.
- The 800 data instances in C2 will be generated using parameters
trueMeans2 and
trueSigmaA.
- Dataset SX2 (classes share the same diagonal covariance matrix):
- The 1,000 data instances in C1 will be those in the dataset X2 generated above.
- The 800 data instances in C2 will be generated using parameters
trueMeans2 and
trueSigmaD.
- Dataset SX3 (classes share the identity covariance matrix):
- The 1,000 data instances in C1 will be those in the dataset X3 generated above.
- The 800 data instances in C2 will be generated using parameters
trueMeans2 and
the d-by-d identity matrix as covariance (Sigma) matrix.
- Multivariate Discriminant Functions:
For each of the 4 datasets under consideration (DX, SX1, SX2, and SX3) do the following:
- (8 points)
Determine which of the formulas in Section 5.5 of the textbook should be
used to define a precise discriminant function gi for each class
Ci of the dataset at hand. Explain your answer.
- (12 points)
Implement each of these 2 discriminant function gi
as a function in your choice of Matlab/R.
- (4 points)
Based on these 2 functions, implement a "decision" function that receives a
data instance x (which is a 1-by-d vector) as its input and outputs i,
where i is the chosen class for input x.
- (16 points)
Use stratified random sampling to split your dataset into 2 parts:
a training set (with 60% of the data instances) and a validation set (with
the remaining 40% of the data instances).
Test your "decision" function on the validation set.
Report the accuracy and the confusion matrix of your decision function,
as well as the precision and the recall of your decision function
for each of the two classes.
- (20 points)
Devise a good way of plotting the dataset in 2 or 3 dimensions
to see and contrast the shapes of the two classes in the
dataset.
(Bonus points) Add to this plot the decision boundary between
the two classes (that is, the curve defined by P(C1|x) = 0.5).
- Multivariate Regression:
- (10 points)
Create a dataset consisting of d inputs and one output as follows.
For the d inputs, use the multivariate dataset X1 you generated in part I above with
N=1000, trueMeans and trueSigmaA.
For the output, use r = f(x) + ε where
f(x) = 3*average(x) - min(x), that is the output is three times the average of the d input values minus the minimum input value;
and the noise
ε ~ N(μ=0,σ2=1).
- (5 points)
Use random sampling to split your dataset into 2 parts:
a training set (with 60% of the data instances) and a validation set (with
the remaining 40% of the data instances).
- (10 points)
Create a multivariate linear regression model over the training set
using the regression functionality
provided by the programming language that you chose (Matlab/R).
Report the obtained regression formula in your written report.
- (10 points)
Evaluate the regression model over the validation set.
Report the Sum of Square Errors (SSE), the Root Mean Square Error (RMSE), the Relative Square Error (RSE), and the Coeffient of Determination (R2)
of each regression model over the validation set.
If the programming language you are using reports AIC, BIC, and/or log likelihood values, include these values in your report too.
- (Bonus points)
See if the regression functionality in your chosen language (Matlab/R) allows the use of Akaike information criterion (AIC).
and/or the use of Bayesian information criterion (BIC),
instead of minimizing SSE, to guide the construction of the regression model.
If so, repeat part 4 above for AIC and then for BIC.
Which of the three approaches produced better results? Explain.
- Bias and Variance:
- (10 points)
Construct 10 new different datasets D1, ..., D10 each one
consisting of 100 data instances randomly generated with
trueMeans and trueSigmaA.
For the output, use r = f(x) + ε where
f(x) = 3*average(x) - min(x)
and the noise
ε ~ N(μ=0,σ2=1)
as before.
- (10 points)
Fit a multivariate linear regression formula gito each of these datasets.
- (10 points)
Estimate the bias and the variance using
the formulas on slide 24 of
Chapter 4 slides (see also Section 4.7 of the textbook).
Apply the formulas for bias and variance over the x's in
the dataset X1 (together with the output value) that you constructed in part 1
above (hence N=1000 and M=10).
Section C: Dimensionality Reduction (115 points)
Page limit: 5 pages for this section.
Dataset: For this part of the project, you will use the
Communities and Crime Data Set available at the
UCI Machine Learning Repository.
Carefully read the description provided for this dataset and
familiarize yourself with the dataset as much as possible.
(5 points) Make the following modifications to the dataset:
- Remove the "communityname" attribute (string).
- Replace each missing attribute value in the dataset (denoted by "?") with
the attribute's mean.
- Use random sampling to split your dataset into 2 parts:
a training set (with 60% of the data instances) and a validation set (with
the remaining 40% of the data instances).
Let's call this training set TS and this validation set VS.
Use this modified dataset in all the experiments below.
Note: Remember that feature selection and feature extraction
methods should be applied to the input attributes only, not to the
output (target) attribute.
- Baseline Regression Model:
- ** Fitting a linear model:
(5 points) Create a multivariate linear regression model over the training set TS
using the regression functionality
provided by the programming language that you chose (Matlab/R).
Report the obtained regression formula in your written report
and also report the time taken to construct this regression model
(for this use timing functionality provided in Matlab/R).
- ** Evaluating the linear model:
(5 points) Evaluate the regression model over the validation set VS.
Report the Sum of Square Errors (SSE), the Root Mean Square Error (RMSE), the Relative Square Error (RSE), and the Coeffient of Determination (R2)
of each regression model over the validation set.
- Feature Selection: Sequential Subset Selection
Look for a function (or functions) provided by the language
you chose (Matlab or R) for doing feature selection.
Try to find a function similar to the sequential subset selection
(either forward or backward) described in Section 6.2 of the textbook.
- (5 points)
Include the name(s) of the function(s) in the report.
Briefly explain what the function does.
- (5 points)
Apply the function to the training data TS.
Include in your report the names of the attributes selected by this function.
- (10 points)
Repeat steps 1 (** Fitting a linear model) and
2 (** Evaluating the linear model) described above,
but now using just the selected subset of attributes
constructed above.
Remember that you need to modify the validation dataset VS so that
it includes just
the same exact subset of attributes selected from the training set.
- Feature Selection: Ranking Attributes
Look for a function (or functions) provided by the language
you chose (Matlab or R) for ranking attributes following the "Relief"
approach.
- (5 points)
Include the name(s) of the function(s) in the report.
Briefly explain what the function does.
- (5 points)
Apply the function to the training data TS.
Include in your report the names of the top 50 attributes
selected by this function in order of importance.
- (10 points)
Repeat steps 1 (** Fitting a linear model) and
2 (** Evaluating the linear model) described above,
but now using just the selected 50 attributes above.
Remember that you need to modify the validation dataset VS so that
it includes just the same exact 50 attributes selected from the training set.
- Feature Extraction: Principal Components Analysis
Look for a function (or functions) provided by the language
you chose (Matlab or R) for performing PCA.
- (5 points)
Include the name(s) of the function(s) in the report.
- (5 points)
Apply the function to the training data TS.
Describe in your report the results of PCA.
How many components were constructed? 128 or less?
What is the minimum number of components needed to capture at least 90%
of the data variance? Explain.
- (10 points)
Repeat steps 1 (** Fitting a linear model) and
2 (** Evaluating the linear model) described above,
but now using just the
principal components needed to explain at least 90% of the data variance.
Remember that you need to transform the validation dataset VS using
the same exact transformation obtained from the training set.
- Feature Extraction: Factor Analysis (FA)
Look for a function (or functions) provided by the language
you chose (Matlab or R) for performing factor analysis.
- (5 points)
Include the name(s) of the function(s) in the report.
- (5 points)
Apply the function to the training data TS.
Describe in your report the results you obtained from factor analysis.
- (10 points)
Repeat steps 1 (** Fitting a linear model) and
2 (** Evaluating the linear model) described above,
but now using the obtained factors.
Remember that you need to transform the validation dataset VS using
the same exact transformation obtained from the training set.
- Comparison of Results
(10 points) Create a table summarizing the results of the dimensionality reduction
experiments above.
This table should contain a column for each the five methods used
(Baseline, Sequential subset selection, Relief, PCA, and FA).
Rows in the table should include the following:
- number of attributes used to construct the linear regression model,
- number of attributes appearing in the linear regression model,
- time taken contructing the linear regression model,
- Sum of Square Errors (SSE),
- Root Mean Square Error (RMSE),
- Relative Square Error (RSE), and
- Coeffient of Determination (R2)
(10 points) Brifly analyze the results described on this table.