WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2017  
Project 3: Association Rule Mining, Text Mining, and Sequence Mining

PROF. CAROLINA RUIZ 

DUE DATE: Thursday Nov. 2nd, 2017. ------------------------------------------

Project Assignment

  1. Project Topics: This is a text mining project using association rules. In this project you will gain experience with the following topics:

  2. THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written summary, and how to study for the test.

    *** You must use the Project 3 Template provided for your written report. (If you prefer not to use Word, you can copy and paste this format in a different editor as long as you respect the stated page structure and page limit.)

  3. Dataset: Each group can select its own dataset following all of the requirements below:
    1. The dataset must be a text dataset. It can come from either an existing text corpus, or text data that you collect yourselves from the web (e.g., Twitter). Python provides APIs to interface with Twitter and other text corpora.
    2. The dataset must contain at least 500 documents, with each document containing at least 100 words. Exceptions to this requirement must be approved by the professor in advance.
    3. The dataset must be related to your own interests and you must be familiar with the domain of the dataset. In particular you must be able to state meaningful guiding questions and interpret the association rules that you will obtain from your dataset.
    4. BCB503 students: Your dataset must be related to bioinformatics, computational biology, and/or medicine. For example, you can download abstracts and/or articles from Pubmed or any other text repository.

  4. Data Mining Technique(s): You will run experiments in Weka and in Python using the following techniques:

  5. Evaluation:

  6. General Comments: In contrast with our previous classification and regression projects, we won't use any evaluation protocol (e.g., 10-fold cross validation) for the association analysis of this project, as we're not using the rules for prediction. Focus instead on experimenting with different ways of preprocessing the data, varying the parameters of the Apriori algorithm, and providing your own method to evaluate the resulting collections of association rules.

  7. Advanced Topic: Sequence Mining using Association Rules Investigate in depth (experimentally, theoretically, or both) how to use an association-rules-like approach for sequence mining. For this, start by studying in detail Section 7.4 of the textbook. Then look for additional papers or references online. provide in your report a summary of what you have learned in your investigation of this topic, and of your results if you run any experiments.