WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2004 
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Rules

PROF. CAROLINA RUIZ 

DUE DATE: Part I (the individual homework assignment) is due on Tuesday, September 14th at 5:00 pm and Parts II.1 and II.2 (the individual+group project) are due on Sunday, September 26 2004 at 5 pm. 
------------------------------------------


HOMEWORK AND PROJECT DESCRIPTION

The purpose of this project is multi-fold: Readings: Read in great detail Sections 4.1, 4.4, 4.5 and 6.2 from your textbook.

INDIVIDUAL HOMEWORK ASSIGNMENT

See solutions to the
classification rules part and the association rules part of this HW by Min Song.

Consider the following subset of the Mushroom dataset.


@relation sample-mushroom

@attribute cap-surface {fibrous,grooves,scaly,smooth}
@attribute bruises? {bruises,no}
@attribute gill-size {broad,narrow}
@attribute habitat {grasses,leaves,meadows,paths,urban,waste,woods}
@attribute poisonousness {edible,poisonous}

@data

scaly,bruises,broad,waste,edible
smooth,no,narrow,woods,poisonous
fibrous,no,broad,grasses,edible
scaly,bruises,broad,woods,edible
scaly,no,narrow,leaves,poisonous
scaly,bruises,broad,paths,edible
smooth,no,broad,leaves,edible
scaly,no,broad,woods,poisonous
scaly,no,narrow,woods,poisonous
smooth,no,broad,leaves,edible
fibrous,no,broad,paths,poisonous
fibrous,bruises,broad,woods,edible
smooth,bruises,narrow,grasses,poisonous
fibrous,no,broad,paths,poisonous
smooth,bruises,narrow,grasses,poisonous
scaly,no,narrow,leaves,poisonous
scaly,no,narrow,woods,poisonous
fibrous,no,broad,grasses,edible
scaly,bruises,broad,woods,edible
fibrous,no,broad,grasses,edible

  1. (50 points) Construct "by hand" all the perfect classification rules that the Prism algorithm would output for this dataset using the ratio p/t to rank the attribute-values that are candidates for inclusion in a rule. You written solutions should show all your work. That is, the list of all attribute-values that were candidates during each of the stages of the rule construction process and which ones were selected.

  2. (50 points) Mine association rules by hand from this dataset by faithfully following the Apriori algorithm with minimal support = 25% and minimal confidence 90%. That is, start by generating candidate itemsets and frequent itemsets level by level and after all frequent itemsets have been generated, produce from them all the rules with confidence greater than or equal to the min. confidence. SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

Note that this dataset contains repeated instances. Your resulting classification and association rules should be affected by this fact.

Submission and Due Date.

Part I is due Tuesday, Sept. 14th at 5:00 pm. Bring a hardcopy of your homework to my office FL232 before the deadline. No submissions after 5:00 pm will be accepted.

PROJECT ASSIGNMENT

The following are general guidelines for the project.

Datasets:

Consider the following sets of data:

  1. The Titanic Dataset. Look at the dataset description and the Data instances.

    I suggest you use the following nominal values for the attributes rather than 0's and 1's to make the association rules easier to read:

    Class (0 = crew, 1 = first, 2 = second, 3 = third)
    Age   (1 = adult, 0 = child)
    Sex   (1 = male, 0 = female)
    Survived (1 = yes, 0 = no)
    

  2. 1995 Data Analysis Exposition. This dataset contains college data taken from the U.S. News & World Report's Guide to America's Best Colleges. The necessary files are: Let's make "private/public" the classification target. Note that even though the values of this attribute are 0s and 1s, this is a nominal (not a numeric!) attribute.

  3. The Microsoft Anonymous Web Data. This dataset is available at the UCI KDD Repository
The first two of these datasets (1 and 2) will be used for the Classification Rules experiments, and the last two of these datasets (2 and 3) will be used for the Association Rules experiments.

Experiments:

For each of the datasets, use the "Explorer" option of the Weka system to perform the following operations:

PROJECT SUBMISSION AND DUE DATE

Part II is due Sunday, Sept. 26 at 5:00 pm. Submissions received on Sunday, Sept 26 between 5:01 pm and 7:00 pm will be penalized with 30% off the grade and submissions after Sept 26 at 7:00 pm won't be accepted.

Please submit the following files using the myWpi digital drop box:

  1. [lastname]_proj2_report.[ext] containing your individual written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):

    If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.

  2. [lastname1_lastname2]_proj2_report.[ext] containing your group written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):

  3. [lastname1_lastname2]_proj2_slides.[ext] (or [lastname]_proj2_slides.[ext] in the case of students taking this course for graduate credit) containing your slides for your oral reports. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Your group will have only 4 minutes in class to discuss the entire project (both individual and group parts, and classification and association rules).

GRADING CRITERIA

FOR THE CLASSIFICATION RULES PART OF THE PROJECT 
TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY

(30 POINTS TOTAL: 15 points for individual and 15 for group work) 
PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
           (i.e. using nominal values instead of numeric
            when appropriate, using as many of them 
            as possible, etc.) 
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e. combining two attributes highly correlated
            into one, using background knowledge, etc.)
    
(TOTAL: 15 points for individual work) 
ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(05 points) Description of the algorithm underlying the Weka filters used
(15 points) Description of the algorithm underlying the construction and
            pruning of classification rules in Weka's PRISM code
(up to 5 extra credit points for an outstanding job) 
(providing just a structural description of the code, i.e. a list of 
classes and methods, will receive 0 points)

(TOTAL: 30 points for group work) 
CODE MODIFICATION:
(10 points) Description of the algorithmic modification
(20 points) Description of the modifications made to the Prism code 
(up to 10 extra credit points for an outstanding job) 

(120 POINTS TOTAL: 60 points for individual and 60 points for group work) 
EXPERIMENTS
(TOTAL: 30 points each dataset) FOR EACH DATASET:
       (06 points) ran a good number of experiments
                   to get familiar with the PRISM classification method and
                   different evaluation methods (%split, cross-validation,...)
       (08 points) good description of the experiment setting and the results 
       (08 points) good analysis of the results of the experiments
       (08 points) comparison of the results obtained with Prism and the
                   classifiers from previous project (ZeroR, ID3, and J4.8)
                   and argumentation of weaknesses and/or strengths of each of the
                   methods on this dataset, and argumentation of which method
                   should be preferred for this dataset and why. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments
                   selecting other classification attributes instead of the 
                   required in this project statement ("private/public", "Survived")

(TOTAL 5 points) SLIDES - how well do they summarize concisely
        the results of the project? We suggest you summarize the
        setting of your experiments and their results in a tabular manner.

---------------------------------------------------------------------------------

FOR THE ASSOCIATION RULES PART OF THE PROJECT 
TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY


(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(05 points) Description of the algorithm underlying the Weka filters used
(10 points) Description of the Apriori algorithm for the construction of
            frequent itemsets and association rules. 
(up to 5 extra credit points for an outstanding job) 
(providing just a structural description of the code, i.e. a list of 
classes and methods, will receive 0 points)

(TOTAL: 35 points for group work) 
CODE MODIFICATION:
(10 points) Description of the algorithmic modification
(20 points) Description of the modifications made to the Apriori code 
(up to 10 extra credit points for an outstanding job) 

(20 POINTS TOTAL: 10 points for individual and 10 points for group work) 
PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e. combining two attributes highly correlated
            into one, using background knowledge, etc.)
    
(110 POINTS TOTAL: 55 points for individual and 55 points for group work) 
EXPERIMENTS
(TOTAL: 28 points each dataset) FOR EACH DATASET:
       (05 points) ran a good number of experiments to get familiar with the 
                   Apriori algorithm varying the input parameters 
       (05 points) good description of the experiment setting and the results 
       (13 points) good analysis of the results of the experiments
                   INCLUDING discussion of particularly interesting association 
                   rules obtained.
       (05 points) comparison of the association rules obtained by Apriori and 
                   the classification rules obtained by Prism in project 2.
                   Argumentation of weaknesses and/or strengths of each of the
                   methods on this dataset, and argumentation of which method
                   should be preferred for this dataset and why. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments

(TOTAL 5 points) SLIDES - how well do they summarize concisely
        the results of the project? We suggest you summarize the
        setting of your experiments and their results in a tabular manner.
   (up to 6 extra credit points) for excellent summary and presentation of results 
   in the slides.


(TOTAL 15 points) Class presentation - how well your oral presentation summarized 
        concisely the results of the project and how focus your presentation was
        on the more creative/interesting/useful of your experiments and results.
        This grade is given individually to each team member.