THOROUGHLY READ AND FOLLOW THE
PROJECT GUIDELINES.
These guidelines contain detailed information about how to structure your
project, and how to prepare your written summary, and how to study for the test.
*** You must use the
Project 3 Template provided here for your written report.
Do NOT change the structucture of the report, do NOT exceed the page limits stated in the template and do NOT decrease the font size ***.
(If you prefer not to use Word, you can copy and paste this format in a
different editor as long as you respect the stated page structure and
page limit.)
- Data Mining Technique(s):
Run all project experiments in Python,
using the following techniques:
- Pre-processing Techniques:
Consider pre-processing techniques
(feature selection, feature creation, dimensionality reduction, noise reduction,
attribute transformations, ...)
discussed in class, the textbook and used in project 1.
- Determine which pre-processing techniques are necessary to pre-process the given dataset
before you can construct predictive (either classification or regression) models from this data.
The least pre-processing at first, the better.
List the necessary pre-processing in your report.
- Determine which pre-processing techniques would be useful (though not necessary)
for this dataset in order to construct better prediction models.
Do this by running experiments with and without applying these pre-processing
techniques and comparing how they affect the performance of the
prediction models.
- Classification Techniques:
- Artificial Neural Networks.
- Regression Techniques:
- Artificial Neural Networks.
- Python Packages for Artificial Neural Networks:
You can use any of the following Python libraries in this project.
If you want to use any other Python libraries in addition to the ones listed here,
please check with me in advance.
- Dataset:
- Students taking CS548:
Use the
Census-Income (KDD) Data Set
(use the census-income.data.gz data file with k-fold cross-validation,
so no need to use the census-income.test.gz data file).
This dataset is available at the
UCI Machine Learning Repository.
- For classification tasks:
Use income (<$50K or >$50K) as the target attribute.
- For regression tasks:
Use age as the regression target.
- Students taking BCB503/CS583:
Use the
Mice Protein Expression Data Set.
This dataset is available at the
UCI Machine Learning Repository.
- For classification tasks, use attribute Class
(that is, attribute #82) as the target attribute.
The large number of values of this attribute (8) may make
the classification task hard. Run preliminary experiments
to decide whether to use this attribute as the target or
to convert it into a binary attribute
(e.g., stimulated to learn vs. not stimulated to learn;
or injected with saline vs. injected with memantine)
or anothe reasonable option.
- For regression tasks:
Pick a numeric/continuous attribute of your choice
as the regression target.
- Note: If you prefer to pick another biological/biomedical dataset for this project,
please discuss your proposed dataset with me.
- Performance Metric(s):
- Use the following metrics or evaluation methods:
- For classification tasks:
use the loss metric provided by scikit learn.
Explain in your report what this loss function is.
If possible use also:
classification accuracy, precision, recall, ROC Area, and confusion matrices.
For regression tasks:
use the loss metric provided by scikit learn.
Explain in your report what this loss function is.
If possible use also:
use correlation coefficient AND any subset of the following error metrics
that you find appropriate: mean-squared error, root mean-squared error,
mean absolute error, relative squared error, root relative squared error,
and relative absolute error. An important part
of the data mining evaluation in this project is to try to make sense
of these performance metrics and to become familiar with them.
- size of the network: number of hidden layers and number of hidden nodes in each layer.
- time it took to train the network.
- Compare each accuracy/error you obtained against those of benchmarking techniques
as ZeroR, OneR, decision trees and regression/model trees over the same (sub-)set of data instances
you used in the corresponding experiment.
- Remember to experiment with varying parameters and hyperparameters of your network:
That is, experiment with varying the number of hidden layers,
number of nodes in each of the hidden layers,
activation function,
"solver" (i.e., training procedure: you are required to use "sgd" in at least some experiments,
and you can experiment with other solvers too),
learning rate,
momentum,
early stopping, ...
- Advanced Topic(s):
Investigate in depth (experimentally, theoretically, or both) a topic of your
choice that is related to deep learning
and that was not covered already in this project, class lectures,
or the textbook.
This deep learning topic might be something that was described or mentioned
briefly in the textbook or in class; comes from your own research;
or is related to your interests.
Remember that you need to investigate your advanced topic in depth,
at a "graduate level".