CS4445 Data Mining and Knowledge Discovery in Databases. A-2004
SOLUTIONS Exam 1 by Prof. Carolina Ruiz - September 17, 2004

Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute


Problem I. Decision Trees (30 points)

Consider the following dataset. Each data instance corresponds to a patient. Assume that the classification target is the attribute diagnosis

@relation disease-diagnosis

% This toy dataset is taken from the book "Data Mining: A Tutorial-Based Primer"
% by R.J. Roiger and M.W. Geatz. Addison-Wesley. 2003.

@attribute sore-throat {yes, no}
@attribute fever {yes, no}
@attribute swollen-glands {yes, no}
@attribute congestion {yes, no}
@attribute headache {yes, no}
@attribute diagnosis {strep-throat, allergy, cold}

@data

yes, yes, yes, yes, yes, strep-throat

yes,  no, yes,  no,  no, strep-throat

 no,  no, yes,  no,  no, strep-throat

 no,  no,  no, yes, yes, allergy

 no,  no,  no, yes,  no, allergy

yes,  no,  no, yes, yes, allergy

yes, yes,  no, yes,  no, cold

 no, yes,  no, yes,  no, cold

 no, yes,  no, yes, yes, cold

yes, yes,  no, yes, yes, cold

The entropies of the predicting attributes with respect to diagnosis are the following:

Entropy of sore-throat with respect to diagnosis = 1.5
Entropy of fever with respect to diagnosis = 0.82 
Entropy of swollen-glands with respect to diagnosis =  0.68
Entropy of congestion with respect to diagnosis = 1.12 
Entropy of headache with respect to diagnosis = 1.5 

  1. (5 points) Show the steps of the calculations of the e entropy of sore-throat with respect to diagnosis (you already know that the result is 1.5, but show in detail what formula was used to produce such value).

    For your convenience, the logarithm in base 2 of selected values are provided.

    x 1/2 1/3 2/3 1/4 3/4 1/5 2/5 3/5 1/6 5/6 1/7 2/7 3/7 4/7 1
    log2(x) -1 -1.5 -0.6 -2 -0.4 -2.3 -1.3 -0.7 -2.5 -0.2 -2.8 -1.8 -1.2 -0.8 0

    SOLUTION

    Here is the computation of the SORE-THROAT entropy with respect to DIAGNOSIS: DIAGNOSIS strep-throat allergy cold SORE-THROAT yes (5/10)*[ - (2/5)*log2(2/5) - (1/5)*log2(1/5) - (2/5)*log2(2/5)] = 0.75 no (5/10)*[ - (1/5)*log2(1/5) - (2/5)*log2(2/5) - (2/5)*log2(2/5)] = 0.75 ------ 1.5 Although you didn't need to compute the entropy of the remaining predicting attributes, I include those calculations here for illustration purposes. FEVER yes (5/10)*[ - (1/5)*log2(1/5) - (0/5)*log2(0/5) - (4/5)*log2(4/5)] = 0.35 no (5/10)*[ - (2/5)*log2(2/5) - (3/5)*log2(3/5) - (0/5)*log2(0/5)] = 0.47 ------ 0.82 SWOLLEN-GLANDS yes (3/10)*[ - (3/3)*log2(3/3) - (0/3)*log2(0/3) - (0/3)*log2(0/3)] = 0 no (7/10)*[ - (0/7)*log2(0/7) - (3/7)*log2(3/7) - (4/7)*log2(4/7)] = 0.68 ------ 0.68 CONGESTION yes (8/10)*[ - (1/8)*log2(1/8) - (3/8)*log2(3/8) - (4/8)*log2(4/8)] = 1.12 no (2/10)*[ - (2/2)*log2(2/2) - (0/2)*log2(0/2) - (0/2)*log2(0/2)] = 0 ------ 1.12 HEADACHE yes (5/10)*[ - (1/5)*log2(1/5) - (2/5)*log2(2/5) - (2/5)*log2(2/5)] = 0.75 no (5/10)*[ - (2/5)*log2(2/5) - (1/5)*log2(1/5) - (2/5)*log2(2/5)] = 0.75 ------ 1.5

  2. (5 points) According to the ID3 algorithm, which of the predicting attributes is chosen as the root of the tree? Explain your answer.
    
    
    

    SOLUTION

    Swollen-glands is chosen as the root node as it is the predicting attribute with the lowest entropy.

  3. (20 points) Construct the FULL decision tree for this dataset USING THE ID3 ALGORITHM. Show all the steps of the entropy calculations.
    
    
    

    SOLUTION

    When Swollen-glands is used as the root node of the tree, the tree looks like: SWOLLEN-GLANDS / \ yes / \ no / \ diagnosis=strep-throat 3 instances with diagnosis=allergy and 4 instances with diagnosis=cold All the dataset instances in the lowest left-most node have the same classification: diagnosis=strep-throat, and hence this is the prediction made for that leaf of the tree. The lowest right-most node contains the following set of heterogeneous instances: no, no, no, yes, yes, allergy no, no, no, yes, no, allergy yes, no, no, yes, yes, allergy yes, yes, no, yes, no, cold no, yes, no, yes, no, cold no, yes, no, yes, yes, cold yes, yes, no, yes, yes, cold and hence we need to split that node. Possible attributes that we can use to split it are: SORE-THROAT, FEVER, CONGESTION, and HEADACHE. We need to find the attribute with the lowest entropy with respect to diagnosis over this smaller set of instances. Note that in this subset of the data, the only two values of the attribute DIAGNOSIS are "allergy" and "cold". Entropy of SORE-THROAT with respect to DIAGNOSIS over smaller dataset: DIAGNOSIS allergy cold SORE-THROAT yes (3/7)*[ - (1/3)*log2(1/3) - (2/3)*log2(2/3)] = no (4/7)*[ - (2/4)*log2(2/4) - (2/4)*log2(2/4)] = ------ Entropy of FEVER with respect to DIAGNOSIS over smaller dataset: DIAGNOSIS allergy cold FEVER yes (4/7)*[ - (0/4)*log2(0/4) - (4/4)*log2(4/4)] = 0 nos (3/7)*[ - (3/3)*log2(3/3) - (0/3)*log2(0/3)] = 0 ------ 0 Since the entropy of an attribute cannot be lower than 0, there is no need to keep computing the entropy of the remaining attributes, as none of them can be better "splitter" than FEVER. Hence, the right-most branch of the above tree is split by FEVER: SWOLLEN-GLANDS? / \ yes / \ no / \ diagnosis=strep-throat FEVER? / \ yes / \ no / \ diagnosis=cold diagnosis=allergy Now, all the tree branches end on homogeneous nodes and hence the tree construction ends with the tree above as the result.

Problem II. Classification Rules (35 points)

Consider the following subset of the disease-diagnosis dataset.
@relation 'disease-diagnosis-weka.filters.unsupervised.attribute.Remove-R4,5'

@attribute sore-throat {yes,no}
@attribute fever {yes,no}
@attribute swollen-glands {yes,no}
@attribute diagnosis {allergy,strep-throat,cold}

@data

yes, yes, yes, strep-throat

yes,  no, yes, strep-throat

 no,  no, yes, strep-throat

 no,  no,  no, allergy

yes,  no,  no, allergy

yes, yes,  no, cold

 no, yes,  no, cold

  1. (25 points) Follow the Prism sequential covering algorithm to construct classification rules for the target diagnosis=allergy. (NOTE: You don't need to construct rules for the other two values of diagnosis just for diagnosis=allergy.) Use the p/t measure to choose the best conditions for the rules. SHOW ALL THE STEPS OF YOUR CALCULATIONS.
    
    

    SOLUTIONS

    IF ? THEN diagnosis=allergy looking for the best condition to add to the left-hand-side of the rule: CONDITION p/t sore-throat=yes 1/4 sore-throat=no 1/3 fever=yes 0/3 fever=no 2/4 swollen-glands=yes 0/3 swollen-glands=no 2/4 Both fever=no and swollen-glands=yes have the best p/t value. Arbitrarily, we choose fever=no as the 1st condition. The rule is still not perfect as its accuracy is 50%. IF fever=no and ? THEN diagnosis=allergy looking for the best condition to add to the left-hand-side of the rule: CONDITION p/t fever=no and sore-throat=yes 1/2 fever=no and sore-throat=no 1/2 fever=no and swollen-glands=yes 0/2 fever=no and swollen-glands=no 2/2 The best condition to add to fever=no is swollen-glands=no, resulting in the rule: IF fever=no and swollen-glands=no THEN diagnosis=allergy The rule is now perfect as its accuracy over the training data is 100%. Hence, we are done with the construction of this rule. We now remove the dataset instances covered by this rule. Since no instances with diagnosis=allergy remain in the dataset, we are done with the construction of rules predicting diagnosis=allergy. The resulting set of rules consists of the rule: IF fever=no and swollen-glands=no THEN diagnosis=allergy

  2. (10 points) In general, when classification rules are mined from a dataset, shorter rules (that is, rules with few conditions in their left-hand-sides) are more desirable than longer rules. Propose a new measure (instead of the p/t ratio) to select conditions for a rule that takes into account the length of the rule. That is, your proposed measure should provide a tradeoff between the accuracy of the rule and the length of the rule. EXPLAIN YOUR ANSWER.
    Here are two sample alternate solutions taken from the students' exam solutions:
    
    Taken from Amanda Bazner's exam solution:
    
    I have a method for creating a tradeoff between rule length and accuracy, but
    is independed of the measure. p/t would work, as would any other measure.
    Provide to the algorithm a minimum level of confidence Cmin and a maximum
    value of rule length Lmax.
    
    The decrement deltaC is (1 - Cmin)/(Lmax - 1).
    
    Then, for each length L, the rules must have accuracy Cmin + (Lmax - 1)*deltaC
    to pass.
    
    If minimum confidence = 0.7 and max rule length = 4
    
    Cmi = 0.7,  Lmax = 4, deltaC = 0.3/3 = 0.1
    
    so rules of length 1 must be at least (0.7 + 0*deltaC) = 70% accurate to be kept,
       rules of length 2 must be at least (0.7 + 1*deltaC) = 80% accurate to be kept,
       rules of length 3 must be at least (0.7 + 2*deltaC) = 90% accurate to be kept,
       rules of length 4 must be          (0.7 + 3*deltaC) = 100% accurate to be kept.
    
    
    Taken from James Martineau's exam solution:
    
    An info. gain scheme algorithm would work, since it compares the goodness
    of the two rules before choosing. Choosing a minimum info gain necessary
    to justify lengthening the rule could keep rules shorter - only conditions that
    result in significant improvement would be added.
    
    

Problem III. Association Rules (35 points)

Consider the following dataset.
@relation 'disease-diagnosis-weka.filters.unsupervised.attribute.Remove-R4,5'

@attribute sore-throat {yes,no}
@attribute fever {yes,no}
@attribute swollen-glands {yes,no}
@attribute diagnosis {strep-throat,allergy,cold}

@data

yes, yes, yes, strep-throat

yes,  no, yes, strep-throat

 no,  no, yes, strep-throat

 no,  no,  no, allergy

yes,  no,  no, allergy

yes, yes,  no, cold

 no, yes,  no, cold

Assume that we want to mine association rules with minimum support: 0.25 (that is, the itemset has to be present in at least 2 data instances.

  1. (20 Points) Use the Apriori algorithm to construct all the frequent itemsets in this dataset. The first two levels of frequent itemsets are provided below.
    LEVEL 1
    
    
    SUPPORT  ITEMSETS              
    
    2/7 {diagnosis=allergy}             
    2/7 {diagnosis=cold}                
    3/7 {diagnosis=strep-throat}
    4/7 {fever=no}               
    3/7 {fever=yes}                     
    3/7 {sore-throat=no}               
    4/7 {sore-throat=yes}               
    4/7 {swollen-glands=no}             
    3/7 {swollen-glands=yes}            
    
    LEVEL 2
    
    ITEMSETS                                         
    
    2/7 {diagnosis=allergy, fever=no}                   
    0/7 {diagnosis=allergy, fever=yes}                 
    1/7 {diagnosis=allergy, sore-throat=no}                  
    1/7 {diagnosis=allergy, sore-throat=yes}                 
    2/7 {diagnosis=allergy, swollen-glands=no}               
    0/7 {diagnosis=allergy, swollen-glands=yes}
    
    0/7 {diagnosis=cold, fever=no}
    2/7 {diagnosis=cold, fever=yes}
    1/7 {diagnosis=cold, sore-throat=no}
    1/7 {diagnosis=cold, sore-throat=yes}
    2/7 {diagnosis=cold, swollen-glands=no}
    0/7 {diagnosis=cold, swollen-glands=yes}
    
    2/7 {diagnosis=strep-throat, fever=no}
    1/7 {diagnosis=strep-throat, fever=yes}
    1/7 {diagnosis=strep-throat, sore-throat=no}
    2/7 {diagnosis=strep-throat, sore-throat=yes}
    0/7 {diagnosis=strep-throat, swollen-glands=no}
    3/7 {diagnosis=strep-throat, swollen-glands=yes}
    
    2/7 {fever=no, sore-throat=no}
    2/7 {fever=no, sore-throat=yes}
    2/7 {fever=no, swollen-glands=no}
    2/7 {fever=no, swollen-glands=yes}
    
    1/7 {fever=yes, sore-throat=no}
    2/7 {fever=yes, sore-throat=yes}
    2/7 {fever=yes, swollen-glands=no}
    1/7 {fever=yes, swollen-glands=yes}
    
    2/7 {sore-throat=no, swollen-glands=no}
    1/7 {sore-throat=no, swollen-glands=yes}
    
    2/7 {sore-throat=yes, swollen-glands=no}
    2/7 {sore-throat=yes, swollen-glands=yes}
    
    
    
    LEVEL 3 Compute all the candidate and frequent itemsets
    for level 3. Use both the join and the subset pruning
    criteria to make the process more efficient.
    
    

    SOLUTION: SUPPORT ITEMSETS 2/7 {diagnosis=allergy, fever=no, swollen-glands=no} 2/7 {diagnosis=cold, fever=yes, swollen-glands=no} 1/7 {diagnosis=strep-throat, fever=no, sore-throat=yes} 2/7 {diagnosis=strep-throat, fever=no, swollen-glands=yes} 2/7 {diagnosis=strep-throat, sore-throat=yes, swollen-glands=yes} 1/7 {fever=no, sore-throat=no, swollen-glands=no} XXX {fever=no, sore-throat=no, swollen-glands=yes} no need to compute the support of this itemset as its subset {sore-throat=no, swollen-glands=yes} is not frequent. 1/7 {fever=no, sore-throat=no, swollen-glands=no} 1/7 {fever=no, sore-throat=yes, swollen-glands=yes} 1/7 {fever=yes, sore-throat=yes, swollen-glands=no} Hence there are 10 candidate itemsets and only 4 frequent itemsets. LEVEL 4 Compute all the candidate and frequent itemsets for level 4. Use both the join and the subset pruning criteria to make the process more efficient. ITEMSETS SUPPORT

    SOLUTION: There are NO candidate itemset for level 4 as no pair of frequent itemsets from level 3 satisfy the join condition: 2/7 {diagnosis=allergy, fever=no, swollen-glands=no} 2/7 {diagnosis=cold, fever=yes, swollen-glands=no} 2/7 {diagnosis=strep-throat, fever=no, swollen-glands=yes} 2/7 {diagnosis=strep-throat, sore-throat=yes, swollen-glands=yes} That is, having exaclty the same items, in the same (alphabetical) order, except for the last item in the two itemsets.

  2. (5 points) Select one of your frequent itemsets from level 3 and construct an association rule from it.

    SOLUTION: I chose the 3-level itemset: 2/7 {diagnosis=strep-throat, fever=no, swollen-glands=yes} Association Rule: diagnosis=strep-throat => fever=no & swollen-glands=yes Compute the confidence of your rule. Show the steps of your calculations. CONFIDENCE(diagnosis=strep-throat => fever=no & swollen-glands=yes) SUPPORT(diagnosis=strep-throat, fever=no, swollen-glands=yes) = ------------------------------------------------------- SUPPORT(diagnosis=strep-throat) 2/7 = --------- = 2/3 = 66% 3/7

  3. (10 points) Assume that we want to generate only association rules that have a particular item (or attribute-value) in their right-hand-sides. Suppose for instance that we want to generate only association rules that have diagnosis=allergy in their consequent. DESCRIBE the changes that you would make to the Apriori algorithm to generate just those rules in an EFFICIENT manner, given minimum support and minimum confidence thresholds. Use the dataset above to illustrate your ideas.

    SOLUTION: During frequent itemset generation: Eliminate from Level 1 all the 1-itemsets with diagnosis=value with value different from allergy as those itemsets will never be extended to contain diagnosis=allergy. (Note than eliminating itemsets that do not contain diagnosis=allergy would not work in general as this would prevent rules that should appear in the output from appearing. More on this will be discussed in class and on Project 2). During rule generation: Use only frequent itemsets that contain diagnosis=allergy to construct rules. Also, consider only those splits of these frequent itemsets that would put the required attribute-value diagnosis=allergy in the consequent, or right-hand-side, of the rule.