Exam 1 CS 4445 A04

CS4445 Data Mining and Knowledge Discovery in Databases. A-2004
SOLUTIONS Exam 1 by Prof. Carolina Ruiz - September 17, 2004

Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute

Problem I. Decision Trees (30 points)

Consider the following dataset. Each data instance corresponds to a patient. Assume that the classification target is the attribute diagnosis


@relation disease-diagnosis

% This toy dataset is taken from the book "Data Mining: A Tutorial-Based Primer"
% by R.J. Roiger and M.W. Geatz. Addison-Wesley. 2003.

@attribute sore-throat {yes, no}
@attribute fever {yes, no}
@attribute swollen-glands {yes, no}
@attribute congestion {yes, no}
@attribute headache {yes, no}
@attribute diagnosis {strep-throat, allergy, cold}

@data

yes, yes, yes, yes, yes, strep-throat

yes,  no, yes,  no,  no, strep-throat

 no,  no, yes,  no,  no, strep-throat

 no,  no,  no, yes, yes, allergy

 no,  no,  no, yes,  no, allergy

yes,  no,  no, yes, yes, allergy

yes, yes,  no, yes,  no, cold

 no, yes,  no, yes,  no, cold

 no, yes,  no, yes, yes, cold

yes, yes,  no, yes, yes, cold

The entropies of the predicting attributes with respect to diagnosis are the following:

Entropy of sore-throat with respect to diagnosis = 1.5
Entropy of fever with respect to diagnosis = 0.82 
Entropy of swollen-glands with respect to diagnosis =  0.68
Entropy of congestion with respect to diagnosis = 1.12 
Entropy of headache with respect to diagnosis = 1.5

(5 points) Show the steps of the calculations of the e entropy of sore-throat with respect to diagnosis (you already know that the result is 1.5, but show in detail what formula was used to produce such value).

For your convenience, the logarithm in base 2 of selected values are provided.


  

  
    
    
      x 
      1/2
      1/3
      2/3
      1/4
      3/4
      1/5
      2/5
      3/5
      1/6
      5/6
      1/7
      2/7
      3/7
      4/7
      1
    

      log2(x) 
      -1
      -1.5
      -0.6
      -2
      -0.4
      -2.3
      -1.3
      -0.7
      -2.5
      -0.2
      -2.8
      -1.8
      -1.2
      -0.8
      0




SOLUTION

Here is the computation of the SORE-THROAT entropy with respect to DIAGNOSIS:


      DIAGNOSIS     strep-throat         allergy             cold

SORE-THROAT
   
  yes   (5/10)*[ - (2/5)*log2(2/5)  - (1/5)*log2(1/5)  - (2/5)*log2(2/5)] = 0.75
  no    (5/10)*[ - (1/5)*log2(1/5)  - (2/5)*log2(2/5)  - (2/5)*log2(2/5)] = 0.75
                                                                           ------
                                                                            1.5


Although you didn't need to compute the entropy of the remaining predicting
attributes, I include those calculations here for illustration purposes.


FEVER

  yes   (5/10)*[ - (1/5)*log2(1/5)  - (0/5)*log2(0/5)  - (4/5)*log2(4/5)] = 0.35
  no    (5/10)*[ - (2/5)*log2(2/5)  - (3/5)*log2(3/5)  - (0/5)*log2(0/5)] = 0.47
                                                                           ------
                                                                            0.82

SWOLLEN-GLANDS
    
  yes   (3/10)*[ - (3/3)*log2(3/3)  - (0/3)*log2(0/3)  - (0/3)*log2(0/3)] = 0
  no    (7/10)*[ - (0/7)*log2(0/7)  - (3/7)*log2(3/7)  - (4/7)*log2(4/7)] = 0.68
                                                                           ------
                                                                            0.68
CONGESTION

  yes   (8/10)*[ - (1/8)*log2(1/8)  - (3/8)*log2(3/8)  - (4/8)*log2(4/8)] = 1.12 
  no    (2/10)*[ - (2/2)*log2(2/2)  - (0/2)*log2(0/2)  - (0/2)*log2(0/2)] = 0
                                                                           ------
                                                                            1.12
    

HEADACHE

  yes   (5/10)*[ - (1/5)*log2(1/5)  - (2/5)*log2(2/5)  - (2/5)*log2(2/5)] = 0.75
  no    (5/10)*[ - (2/5)*log2(2/5)  - (1/5)*log2(1/5)  - (2/5)*log2(2/5)] = 0.75
                                                                           ------
                                                                            1.5

(5 points) According to the ID3 algorithm, which of the predicting attributes is chosen as the root of the tree? Explain your answer.
```
SOLUTION

Swollen-glands is chosen as the root node as it is the predicting
attribute with the lowest entropy.
```

(20 points) Construct the FULL decision tree for this dataset USING THE ID3 ALGORITHM. Show all the steps of the entropy calculations.



SOLUTION

When Swollen-glands is used as the root node of the tree, the tree looks like:



                      SWOLLEN-GLANDS
                    /              \
               yes /                \ no
                  /                  \
         diagnosis=strep-throat     3 instances with diagnosis=allergy and
                                    4 instances with diagnosis=cold 

All the dataset instances in the lowest left-most node have the same
classification: diagnosis=strep-throat, and hence this is the prediction
made for that leaf of the tree.

The lowest right-most node contains the following set of heterogeneous
instances:

 no,  no,  no, yes, yes, allergy
 no,  no,  no, yes,  no, allergy
yes,  no,  no, yes, yes, allergy
yes, yes,  no, yes,  no, cold
 no, yes,  no, yes,  no, cold
 no, yes,  no, yes, yes, cold
yes, yes,  no, yes, yes, cold

 
and hence we need to split that node. Possible attributes that we can use to 
split it are: SORE-THROAT, FEVER, CONGESTION, and HEADACHE.
We need to find the attribute with the lowest entropy with respect to 
diagnosis over this smaller set of instances.
Note that in this subset of the data, the only two values of the attribute
DIAGNOSIS are "allergy" and "cold".
 

Entropy of SORE-THROAT with respect to DIAGNOSIS over smaller dataset:


      DIAGNOSIS     allergy             cold

SORE-THROAT

  yes   (3/7)*[ - (1/3)*log2(1/3)  - (2/3)*log2(2/3)] = 
  no    (4/7)*[ - (2/4)*log2(2/4)  - (2/4)*log2(2/4)] = 
                                                      ------
                                                       

Entropy of FEVER with respect to DIAGNOSIS over smaller dataset:


      DIAGNOSIS     allergy             cold

FEVER

  yes   (4/7)*[ - (0/4)*log2(0/4)  - (4/4)*log2(4/4)] = 0
  nos   (3/7)*[ - (3/3)*log2(3/3)  - (0/3)*log2(0/3)] = 0
                                                      ------
                                                        0 


Since the entropy of an attribute cannot be lower than 0, there
is no need to keep computing the entropy of the remaining
attributes, as none of them can be better "splitter" than FEVER.

Hence, the right-most branch of the above tree is split by FEVER:


                     SWOLLEN-GLANDS?
                    /              \
               yes /                \ no
                  /                  \
         diagnosis=strep-throat     FEVER?
                                   /     \
                              yes /       \  no
                                 /         \
                       diagnosis=cold   diagnosis=allergy

Now, all the tree branches end on homogeneous nodes and hence
the tree construction ends with the tree above as the result.

Problem II. Classification Rules (35 points)

Consider the following subset of the disease-diagnosis dataset.

@relation 'disease-diagnosis-weka.filters.unsupervised.attribute.Remove-R4,5'

@attribute sore-throat {yes,no}
@attribute fever {yes,no}
@attribute swollen-glands {yes,no}
@attribute diagnosis {allergy,strep-throat,cold}

@data

yes, yes, yes, strep-throat

yes,  no, yes, strep-throat

 no,  no, yes, strep-throat

 no,  no,  no, allergy

yes,  no,  no, allergy

yes, yes,  no, cold

 no, yes,  no, cold

(25 points) Follow the Prism sequential covering algorithm to construct classification rules for the target diagnosis=allergy. (NOTE: You don't need to construct rules for the other two values of diagnosis just for diagnosis=allergy.) Use the p/t measure to choose the best conditions for the rules. SHOW ALL THE STEPS OF YOUR CALCULATIONS.


SOLUTIONS

IF ? THEN diagnosis=allergy

looking for the best condition to add to the left-hand-side of the rule:

   CONDITION             p/t

   sore-throat=yes       1/4
   sore-throat=no        1/3
   fever=yes             0/3
   fever=no              2/4
   swollen-glands=yes    0/3
   swollen-glands=no     2/4

Both fever=no and swollen-glands=yes have the best p/t value.
Arbitrarily, we choose fever=no as the 1st condition.
The rule is still not perfect as its accuracy is 50%.
 
IF fever=no and ? THEN diagnosis=allergy

looking for the best condition to add to the left-hand-side of the rule:

   CONDITION                          p/t

   fever=no and sore-throat=yes       1/2
   fever=no and sore-throat=no        1/2
   fever=no and swollen-glands=yes    0/2
   fever=no and swollen-glands=no     2/2

The best condition to add to fever=no is swollen-glands=no, resulting
in the rule:

IF fever=no and swollen-glands=no THEN diagnosis=allergy

The rule is now perfect as its accuracy over the training data is 100%.
Hence, we are done with the construction of this rule.

We now remove the dataset instances covered by this rule. Since no
instances with diagnosis=allergy remain in the dataset, we are done
with the construction of rules predicting diagnosis=allergy.

The resulting set of rules consists of the rule:

IF fever=no and swollen-glands=no THEN diagnosis=allergy

(10 points) In general, when classification rules are mined from a dataset, shorter rules (that is, rules with few conditions in their left-hand-sides) are more desirable than longer rules. Propose a new measure (instead of the p/t ratio) to select conditions for a rule that takes into account the length of the rule. That is, your proposed measure should provide a tradeoff between the accuracy of the rule and the length of the rule. EXPLAIN YOUR ANSWER.

Here are two sample alternate solutions taken from the students' exam solutions:

Taken from Amanda Bazner's exam solution:

I have a method for creating a tradeoff between rule length and accuracy, but
is independed of the measure. p/t would work, as would any other measure.
Provide to the algorithm a minimum level of confidence Cmin and a maximum
value of rule length Lmax.

The decrement deltaC is (1 - Cmin)/(Lmax - 1).

Then, for each length L, the rules must have accuracy Cmin + (Lmax - 1)*deltaC
to pass.

If minimum confidence = 0.7 and max rule length = 4

Cmi = 0.7,  Lmax = 4, deltaC = 0.3/3 = 0.1

so rules of length 1 must be at least (0.7 + 0*deltaC) = 70% accurate to be kept,
   rules of length 2 must be at least (0.7 + 1*deltaC) = 80% accurate to be kept,
   rules of length 3 must be at least (0.7 + 2*deltaC) = 90% accurate to be kept,
   rules of length 4 must be          (0.7 + 3*deltaC) = 100% accurate to be kept.


Taken from James Martineau's exam solution:

An info. gain scheme algorithm would work, since it compares the goodness
of the two rules before choosing. Choosing a minimum info gain necessary
to justify lengthening the rule could keep rules shorter - only conditions that
result in significant improvement would be added.

Problem III. Association Rules (35 points)

Consider the following dataset.

@relation 'disease-diagnosis-weka.filters.unsupervised.attribute.Remove-R4,5'

@attribute sore-throat {yes,no}
@attribute fever {yes,no}
@attribute swollen-glands {yes,no}
@attribute diagnosis {strep-throat,allergy,cold}

@data

yes, yes, yes, strep-throat

yes,  no, yes, strep-throat

 no,  no, yes, strep-throat

 no,  no,  no, allergy

yes,  no,  no, allergy

yes, yes,  no, cold

 no, yes,  no, cold

Assume that we want to mine association rules with minimum support: 0.25 (that is, the itemset has to be present in at least 2 data instances.

(20 Points) Use the Apriori algorithm to construct all the frequent itemsets in this dataset. The first two levels of frequent itemsets are provided below.

LEVEL 1


SUPPORT  ITEMSETS              

2/7 {diagnosis=allergy}             
2/7 {diagnosis=cold}                
3/7 {diagnosis=strep-throat}
4/7 {fever=no}               
3/7 {fever=yes}                     
3/7 {sore-throat=no}               
4/7 {sore-throat=yes}               
4/7 {swollen-glands=no}             
3/7 {swollen-glands=yes}            

LEVEL 2

ITEMSETS                                         

2/7 {diagnosis=allergy, fever=no}                   
0/7 {diagnosis=allergy, fever=yes}                 
1/7 {diagnosis=allergy, sore-throat=no}                  
1/7 {diagnosis=allergy, sore-throat=yes}                 
2/7 {diagnosis=allergy, swollen-glands=no}               
0/7 {diagnosis=allergy, swollen-glands=yes}

0/7 {diagnosis=cold, fever=no}
2/7 {diagnosis=cold, fever=yes}
1/7 {diagnosis=cold, sore-throat=no}
1/7 {diagnosis=cold, sore-throat=yes}
2/7 {diagnosis=cold, swollen-glands=no}
0/7 {diagnosis=cold, swollen-glands=yes}

2/7 {diagnosis=strep-throat, fever=no}
1/7 {diagnosis=strep-throat, fever=yes}
1/7 {diagnosis=strep-throat, sore-throat=no}
2/7 {diagnosis=strep-throat, sore-throat=yes}
0/7 {diagnosis=strep-throat, swollen-glands=no}
3/7 {diagnosis=strep-throat, swollen-glands=yes}

2/7 {fever=no, sore-throat=no}
2/7 {fever=no, sore-throat=yes}
2/7 {fever=no, swollen-glands=no}
2/7 {fever=no, swollen-glands=yes}

1/7 {fever=yes, sore-throat=no}
2/7 {fever=yes, sore-throat=yes}
2/7 {fever=yes, swollen-glands=no}
1/7 {fever=yes, swollen-glands=yes}

2/7 {sore-throat=no, swollen-glands=no}
1/7 {sore-throat=no, swollen-glands=yes}

2/7 {sore-throat=yes, swollen-glands=no}
2/7 {sore-throat=yes, swollen-glands=yes}



LEVEL 3 Compute all the candidate and frequent itemsets
for level 3. Use both the join and the subset pruning
criteria to make the process more efficient.

SOLUTION:

SUPPORT ITEMSETS                                         
 
2/7 {diagnosis=allergy, fever=no,  swollen-glands=no}                   
2/7 {diagnosis=cold,    fever=yes, swollen-glands=no}                   
1/7 {diagnosis=strep-throat, fever=no, sore-throat=yes}                   
2/7 {diagnosis=strep-throat, fever=no, swollen-glands=yes}                   
2/7 {diagnosis=strep-throat, sore-throat=yes, swollen-glands=yes}                   
1/7 {fever=no,  sore-throat=no,  swollen-glands=no}                   
XXX {fever=no,  sore-throat=no,  swollen-glands=yes} no need to compute the support of
                                                     this itemset as its subset
                                                     {sore-throat=no,  swollen-glands=yes}
                                                     is not frequent. 
1/7 {fever=no,  sore-throat=no,  swollen-glands=no}                   
1/7 {fever=no,  sore-throat=yes, swollen-glands=yes}                   
1/7 {fever=yes, sore-throat=yes, swollen-glands=no}                   


Hence there are 10 candidate itemsets and only 4 frequent itemsets.


LEVEL 4 Compute all the candidate and frequent itemsets
for level 4. Use both the join and the subset pruning
criteria to make the process more efficient.

ITEMSETS                                          SUPPORT  


SOLUTION:

There are NO candidate itemset for level 4 as no pair of frequent itemsets
from level 3 satisfy the join condition: 

     2/7 {diagnosis=allergy, fever=no,  swollen-glands=no}                   
     2/7 {diagnosis=cold,    fever=yes, swollen-glands=no}                   
     2/7 {diagnosis=strep-throat, fever=no, swollen-glands=yes}                   
     2/7 {diagnosis=strep-throat, sore-throat=yes, swollen-glands=yes}                   

That is, having exaclty the same items, in the same (alphabetical) order, 
except for the last item in the two itemsets.

(5 points) Select one of your frequent itemsets from level 3 and construct an association rule from it.

SOLUTION:
     
     I chose the 3-level itemset:

     2/7 {diagnosis=strep-throat, fever=no,  swollen-glands=yes}                   

     Association Rule:

         diagnosis=strep-throat => fever=no & swollen-glands=yes                   


     Compute the confidence of your rule. Show the steps of your
     calculations.

       CONFIDENCE(diagnosis=strep-throat => fever=no & swollen-glands=yes)
        
            SUPPORT(diagnosis=strep-throat, fever=no, swollen-glands=yes)
          = -------------------------------------------------------
                      SUPPORT(diagnosis=strep-throat)
   
              2/7        
          = --------- = 2/3 = 66% 
              3/7

(10 points) Assume that we want to generate only association rules that have a particular item (or attribute-value) in their right-hand-sides. Suppose for instance that we want to generate only association rules that have diagnosis=allergy in their consequent. DESCRIBE the changes that you would make to the Apriori algorithm to generate just those rules in an EFFICIENT manner, given minimum support and minimum confidence thresholds. Use the dataset above to illustrate your ideas.

SOLUTION:

  During frequent itemset generation:
    
      Eliminate from Level 1 all the 1-itemsets with diagnosis=value with 
      value different from allergy as those itemsets will never be
      extended to contain diagnosis=allergy.

      (Note than eliminating itemsets that do not contain
       diagnosis=allergy would not work in general as this would prevent 
       rules that should appear in the output from appearing. 
       More on this will be discussed in class and on Project 2).
 
  During rule generation:

      Use only frequent itemsets that contain diagnosis=allergy to 
      construct rules.
      Also, consider only those splits of these frequent itemsets that
      would put the required attribute-value diagnosis=allergy in
      the consequent, or right-hand-side, of the rule.

x	1/2	1/3	2/3	1/4	3/4	1/5	2/5	3/5	1/6	5/6	1/7	2/7	3/7	4/7	1
log2(x)	-1	-1.5	-0.6	-2	-0.4	-2.3	-1.3	-0.7	-2.5	-0.2	-2.8	-1.8	-1.2	-0.8	0

CS4445 Data Mining and Knowledge Discovery in Databases. A-2004 SOLUTIONS Exam 1 by Prof. Carolina Ruiz - September 17, 2004

Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute

Problem I. Decision Trees (30 points)

SOLUTION

SOLUTION

SOLUTION

Problem II. Classification Rules (35 points)

SOLUTIONS

Problem III. Association Rules (35 points)

CS4445 Data Mining and Knowledge Discovery in Databases. A-2004
SOLUTIONS Exam 1 by Prof. Carolina Ruiz - September 17, 2004

Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute