WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2018  
Project 3: Artificial Neural Networks and Deep Learning

By Michael Sokolovsky, Ahmedul Kabir and Prof. Carolina Ruiz 

DUE DATE: November 1st, 2018. ------------------------------------------

Project Assignment:

  1. Study Section 4.7 Artificial Neural Networks and Sectio 4.8 Deep Learning of the textbook in great detail.

  2. Study all the materials posted on the course Lecture Notes, especially those marked with "**":

  3. Work in groups of 3 students.

  4. Project Description and Dataset

  5. Project Requirements

    Projects will include turning in a written report, code for training and visualization, and a model. Details of what to include are listed below:

    1. Written Report:

      Due time: Hand in by 3:59 pm right before the beginning of class.

      • Set of Experiments Performed: Page limit: 1.5 pages
        Include a section describing the set of experiments that you performed, what structures you experimented with (i.e., number of layers, number of neurons in each layer), what parameters you varied (e.g., number of epochs of training, batch size and any other parameter values, weight initialization scheme, activation function)and what accuracies you obtained on each of these experiments.

      • Procedure Description: Page limit: 1 page
        Include a section describing in more detail the most accurate model you were able to obtain: the structure of your model, including number of layers, number of neurons in each layer, weight initialization scheme, activation function, number of epochs used for training, and batch size used for training.

      • Plot: Page limit: 0.5 pages
        Include a plot showing how training accuracy and validation accuracy change over time during the training of your best model. That is, the horizontal axis of your plot should be the number of training epochs and the vertical axis should be training and validation accuracy.

      • Model Performance and Confusion Matrix: Page limit: 1 page
        Include a confusion matrix showing results of your best model reported on the test set. The matrix should be a 10x10 grid showing which categories images were classified as. Use your confusion matrix to additionally report precision and recall for each of the 10 classes, as well as overall accuracy of your model.

      • Visualization: Page limit: 1 page
        Include visualizations of three images that were misclassified by your best model and any observations about why you think these images were misclassified. You will have to create or use a visualization program that takes a 28x28 matrix input and translate it into a black-and-white image.

      • Advanced Topic: Page limit: 1 page
        Include the description of your advanced topic (see instructions in bullet 7 below). It should contain 3 parts:
        • List of sources/books/papers used for this topic (include URLs if available).
        • In your own words, provide an in-depth, yet concise, description of your chosen topic. Make sure to cover all relevant data mining aspects of your topic. Your description here should be in-depth and at the graduate level.
        • How does this topic relate to trees and the material covered in this course?

    2. Code:

      Due time: Submit your code files on Canvas by 2:00 pm.

      • Model Code:

        Turn in your preprocessing, model creation, model training, plotting and confusion matrix code.

    3. Model:

      Due time: Submit your trained model file on Canvas by 2:00 pm.

      • Copy of Trained Model:

        Turn in a copy of your best model saved as `trained_model.proj3.' Please use the following Keras methods for saving your model.

    4. Slides:

      Due time: Submit your project slides on Canvas by 2:00 pm.
      Turn in slides summarizing your work on the projects and what you learned. Each team will have 4 minutes to present. Make sure to cover your Advanced Topic during your presentation. Many sure that each team member has equal chance to present.

  6. Project Preparatory Tasks and Guidelines:

    Below are import guidelines to follow for implementing the project. A model template is provided for you on this project webpage, and these guidelines follow the structure of the template.

    1. Installing Software and Dependencies:

      template.py is written with the Keras API in a Python3 script. You will use this template to build and train a model. To do so, you will need to implement the project in Python3 and install Keras and its dependencies. Please make sure you have a working version of Python3 and Keras as soon as possible, as these programs are necessary for completing the project.

    2. Downloading Data:

      Raw data is provided here:
      • Images are provided for you in the images.npy file, which contains 6500 images from the MNIST dataset.
      • The file labels.npy contains the 6500 corresponding labels for the image data.

    3. Preprocessing Data:

      All data is provided as NumPy .npy files. To load and preprocess data, use Python's NumPy package.

      Image data is provided as 28x28 matrices of integer pixel values. However, the input to the network will be a flat vector of length 28*28 = 784. You will have to flatten each matrix to be a vector, as illustrated by the toy example below:

      flattening matrix to vector

      The label for each image is provided as an integer in the range of 0 to 9. However, the output of the network should be structured as a "one-hot vector" of length 10 encoded as follows:

      one-hot vector

      To preprocess data, use NumPy functions like reshape for changing matrices into vectors. You can also use Keras' to_categorical function for converting label numbers into one-hot encodings.

      After preprocessing, you will need to take your data and randomly split it into Training, Validation, and Test Sets. In order to create the three sets of data, use stratified sampling, so that each set maintains the same relative frequency of the ten classes.

      You are given 6500 images and labels. The training set should contain ~60% of the data, the validation set should contain ~15% of the data, and the test set should contain ~25% of the data.

      Example Stratified Sampling Procedure:
      • Take data and separate it into 10 classes, one for each digit
      • From each class:
        • take 60% at random and put into the Training Set,
        • take 15% at random and put into the Validation Set,
        • take the remaining 25% and put into the Test Set

    4. Building a Model:

      model

      In Keras, Models are instantiations of the class Sequential. A Keras model template, template.py, written with the Sequential Model API is provided which can be used as starting point for building your model. The template includes a sample first input layer and output layer. You must limit yourself to "Dense" layers, which are Keras' version of traditional neural network layers. This portion of the project will involve experimentation.

      Good guidelines for model creation are:

      • Initialize weights randomly for every layer, try different initialization schemes.
      • Experiment with using ReLu Activation Units, as well as SeLu and Tanh.
      • Experiment with number of layers and number of neurons in each layer, including the first layer.

      Leave the final layer as it appears in the template with a softmax activation unit.

    5. Compiling a Model:

      model compilation

      Prior to training a model, you must specify what your loss function for the model is and what your gradient descent method is. Please use the standard categorical cross-entropy and stochastic gradient descent (`sgd') when compiling your model (as provided in the template).

    6. Training a Model:

      model training

      You have the option of changing how many epochs to train your model for and how large your mini-batch size is. Experiment to see what works best. Also remember to include your validation data in the fit() method.

    7. Reporting your Results:

      print history

      fit() returns data about your training experiment. In the template.py this is stored in the "history" variable. Use this information to construct your graph that shows how validation and training accuracy change after every epoch of training.

      model predict

      Use the predict() method on model to evaluate what labels your model predicts on test set. Use these and the true labels to construct your confusion matrix, like the toy example below, although you do not need to create a fancy visualization of the confusion matrix . Your matrix should have 10 rows and 10 columns.

      confusion matrix

  7. Advanced Topic(s):

    Investigate in depth (experimentally, theoretically, or both) a topic of your choice that is related to deep learning and that was not covered already in this project, class lectures, or the textbook. This deep learning related topic might be something that was described or mentioned briefly in the textbook or in class; comes from your own research; is related to your interests; is an idea from a research paper that you find intriguing; or any other deep learning related topic.
    Remember that you need to investigate your advanced topic in depth, at a "graduate level".

  8. Grading Rubric

    1. Report:

      • Set of Experiments Performed: 20 pts
      • Model and Training Procedure Description: 10 pts
      • Plot: 10 pts
      • Model Performance and Confusion Matrix: 10 pts
      • Visualization: 10 pts

    2. Code:

        Model Code: 30 pts

    3. Model:

        Copy of Trained Model: 10 pts

    4. Advanced Topic:

        Selection and Description of your Advanced Topic: 20 pts

    Total Points: 120 pts