@relation disease-diagnosis % This toy dataset is taken from the book "Data Mining: A Tutorial-Based Primer" % by R.J. Roiger and M.W. Geatz. Addison-Wesley. 2003. @attribute sore-throat {yes, no} @attribute fever {yes, no} @attribute swollen-glands {yes, no} @attribute congestion {yes, no} @attribute headache {yes, no} @attribute diagnosis {strep-throat, allergy, cold} @data yes, yes, yes, yes, yes, strep-throat yes, no, yes, no, no, strep-throat no, no, yes, no, no, strep-throat no, no, no, yes, yes, allergy no, no, no, yes, no, allergy yes, no, no, yes, yes, allergy yes, yes, no, yes, no, cold no, yes, no, yes, no, cold no, yes, no, yes, yes, cold yes, yes, no, yes, yes, cold
The entropies of the predicting attributes with respect to diagnosis are the following:
Entropy of sore-throat with respect to diagnosis = 1.5 Entropy of fever with respect to diagnosis = 0.82 Entropy of swollen-glands with respect to diagnosis = 0.68 Entropy of congestion with respect to diagnosis = 1.12 Entropy of headache with respect to diagnosis = 1.5
For your convenience, the logarithm in base 2 of selected values are provided.
x 1/2 1/3 2/3 1/4 3/4 1/5 2/5 3/5 1/6 5/6 1/7 2/7 3/7 4/7 1 log2(x) -1 -1.5 -0.6 -2 -0.4 -2.3 -1.3 -0.7 -2.5 -0.2 -2.8 -1.8 -1.2 -0.8 0 SOLUTION
Here is the computation of the SORE-THROAT entropy with respect to DIAGNOSIS: DIAGNOSIS strep-throat allergy cold SORE-THROAT yes (5/10)*[ - (2/5)*log2(2/5) - (1/5)*log2(1/5) - (2/5)*log2(2/5)] = 0.75 no (5/10)*[ - (1/5)*log2(1/5) - (2/5)*log2(2/5) - (2/5)*log2(2/5)] = 0.75 ------ 1.5 Although you didn't need to compute the entropy of the remaining predicting attributes, I include those calculations here for illustration purposes. FEVER yes (5/10)*[ - (1/5)*log2(1/5) - (0/5)*log2(0/5) - (4/5)*log2(4/5)] = 0.35 no (5/10)*[ - (2/5)*log2(2/5) - (3/5)*log2(3/5) - (0/5)*log2(0/5)] = 0.47 ------ 0.82 SWOLLEN-GLANDS yes (3/10)*[ - (3/3)*log2(3/3) - (0/3)*log2(0/3) - (0/3)*log2(0/3)] = 0 no (7/10)*[ - (0/7)*log2(0/7) - (3/7)*log2(3/7) - (4/7)*log2(4/7)] = 0.68 ------ 0.68 CONGESTION yes (8/10)*[ - (1/8)*log2(1/8) - (3/8)*log2(3/8) - (4/8)*log2(4/8)] = 1.12 no (2/10)*[ - (2/2)*log2(2/2) - (0/2)*log2(0/2) - (0/2)*log2(0/2)] = 0 ------ 1.12 HEADACHE yes (5/10)*[ - (1/5)*log2(1/5) - (2/5)*log2(2/5) - (2/5)*log2(2/5)] = 0.75 no (5/10)*[ - (2/5)*log2(2/5) - (1/5)*log2(1/5) - (2/5)*log2(2/5)] = 0.75 ------ 1.5
SOLUTION
Swollen-glands is chosen as the root node as it is the predicting attribute with the lowest entropy.
SOLUTION
When Swollen-glands is used as the root node of the tree, the tree looks like: SWOLLEN-GLANDS / \ yes / \ no / \ diagnosis=strep-throat 3 instances with diagnosis=allergy and 4 instances with diagnosis=cold All the dataset instances in the lowest left-most node have the same classification: diagnosis=strep-throat, and hence this is the prediction made for that leaf of the tree. The lowest right-most node contains the following set of heterogeneous instances: no, no, no, yes, yes, allergy no, no, no, yes, no, allergy yes, no, no, yes, yes, allergy yes, yes, no, yes, no, cold no, yes, no, yes, no, cold no, yes, no, yes, yes, cold yes, yes, no, yes, yes, cold and hence we need to split that node. Possible attributes that we can use to split it are: SORE-THROAT, FEVER, CONGESTION, and HEADACHE. We need to find the attribute with the lowest entropy with respect to diagnosis over this smaller set of instances. Note that in this subset of the data, the only two values of the attribute DIAGNOSIS are "allergy" and "cold". Entropy of SORE-THROAT with respect to DIAGNOSIS over smaller dataset: DIAGNOSIS allergy cold SORE-THROAT yes (3/7)*[ - (1/3)*log2(1/3) - (2/3)*log2(2/3)] = no (4/7)*[ - (2/4)*log2(2/4) - (2/4)*log2(2/4)] = ------ Entropy of FEVER with respect to DIAGNOSIS over smaller dataset: DIAGNOSIS allergy cold FEVER yes (4/7)*[ - (0/4)*log2(0/4) - (4/4)*log2(4/4)] = 0 nos (3/7)*[ - (3/3)*log2(3/3) - (0/3)*log2(0/3)] = 0 ------ 0 Since the entropy of an attribute cannot be lower than 0, there is no need to keep computing the entropy of the remaining attributes, as none of them can be better "splitter" than FEVER. Hence, the right-most branch of the above tree is split by FEVER: SWOLLEN-GLANDS? / \ yes / \ no / \ diagnosis=strep-throat FEVER? / \ yes / \ no / \ diagnosis=cold diagnosis=allergy Now, all the tree branches end on homogeneous nodes and hence the tree construction ends with the tree above as the result.
@relation 'disease-diagnosis-weka.filters.unsupervised.attribute.Remove-R4,5' @attribute sore-throat {yes,no} @attribute fever {yes,no} @attribute swollen-glands {yes,no} @attribute diagnosis {allergy,strep-throat,cold} @data yes, yes, yes, strep-throat yes, no, yes, strep-throat no, no, yes, strep-throat no, no, no, allergy yes, no, no, allergy yes, yes, no, cold no, yes, no, cold
SOLUTIONS
IF ? THEN diagnosis=allergy looking for the best condition to add to the left-hand-side of the rule: CONDITION p/t sore-throat=yes 1/4 sore-throat=no 1/3 fever=yes 0/3 fever=no 2/4 swollen-glands=yes 0/3 swollen-glands=no 2/4 Both fever=no and swollen-glands=yes have the best p/t value. Arbitrarily, we choose fever=no as the 1st condition. The rule is still not perfect as its accuracy is 50%. IF fever=no and ? THEN diagnosis=allergy looking for the best condition to add to the left-hand-side of the rule: CONDITION p/t fever=no and sore-throat=yes 1/2 fever=no and sore-throat=no 1/2 fever=no and swollen-glands=yes 0/2 fever=no and swollen-glands=no 2/2 The best condition to add to fever=no is swollen-glands=no, resulting in the rule: IF fever=no and swollen-glands=no THEN diagnosis=allergy The rule is now perfect as its accuracy over the training data is 100%. Hence, we are done with the construction of this rule. We now remove the dataset instances covered by this rule. Since no instances with diagnosis=allergy remain in the dataset, we are done with the construction of rules predicting diagnosis=allergy. The resulting set of rules consists of the rule: IF fever=no and swollen-glands=no THEN diagnosis=allergy
Here are two sample alternate solutions taken from the students' exam solutions: Taken from Amanda Bazner's exam solution: I have a method for creating a tradeoff between rule length and accuracy, but is independed of the measure. p/t would work, as would any other measure. Provide to the algorithm a minimum level of confidence Cmin and a maximum value of rule length Lmax. The decrement deltaC is (1 - Cmin)/(Lmax - 1). Then, for each length L, the rules must have accuracy Cmin + (Lmax - 1)*deltaC to pass. If minimum confidence = 0.7 and max rule length = 4 Cmi = 0.7, Lmax = 4, deltaC = 0.3/3 = 0.1 so rules of length 1 must be at least (0.7 + 0*deltaC) = 70% accurate to be kept, rules of length 2 must be at least (0.7 + 1*deltaC) = 80% accurate to be kept, rules of length 3 must be at least (0.7 + 2*deltaC) = 90% accurate to be kept, rules of length 4 must be (0.7 + 3*deltaC) = 100% accurate to be kept. Taken from James Martineau's exam solution: An info. gain scheme algorithm would work, since it compares the goodness of the two rules before choosing. Choosing a minimum info gain necessary to justify lengthening the rule could keep rules shorter - only conditions that result in significant improvement would be added.
@relation 'disease-diagnosis-weka.filters.unsupervised.attribute.Remove-R4,5' @attribute sore-throat {yes,no} @attribute fever {yes,no} @attribute swollen-glands {yes,no} @attribute diagnosis {strep-throat,allergy,cold} @data yes, yes, yes, strep-throat yes, no, yes, strep-throat no, no, yes, strep-throat no, no, no, allergy yes, no, no, allergy yes, yes, no, cold no, yes, no, coldAssume that we want to mine association rules with minimum support: 0.25 (that is, the itemset has to be present in at least 2 data instances.
LEVEL 1 SUPPORT ITEMSETS 2/7 {diagnosis=allergy} 2/7 {diagnosis=cold} 3/7 {diagnosis=strep-throat} 4/7 {fever=no} 3/7 {fever=yes} 3/7 {sore-throat=no} 4/7 {sore-throat=yes} 4/7 {swollen-glands=no} 3/7 {swollen-glands=yes} LEVEL 2 ITEMSETS 2/7 {diagnosis=allergy, fever=no} 0/7 {diagnosis=allergy, fever=yes} 1/7 {diagnosis=allergy, sore-throat=no} 1/7 {diagnosis=allergy, sore-throat=yes} 2/7 {diagnosis=allergy, swollen-glands=no} 0/7 {diagnosis=allergy, swollen-glands=yes} 0/7 {diagnosis=cold, fever=no} 2/7 {diagnosis=cold, fever=yes} 1/7 {diagnosis=cold, sore-throat=no} 1/7 {diagnosis=cold, sore-throat=yes} 2/7 {diagnosis=cold, swollen-glands=no} 0/7 {diagnosis=cold, swollen-glands=yes} 2/7 {diagnosis=strep-throat, fever=no} 1/7 {diagnosis=strep-throat, fever=yes} 1/7 {diagnosis=strep-throat, sore-throat=no} 2/7 {diagnosis=strep-throat, sore-throat=yes} 0/7 {diagnosis=strep-throat, swollen-glands=no} 3/7 {diagnosis=strep-throat, swollen-glands=yes} 2/7 {fever=no, sore-throat=no} 2/7 {fever=no, sore-throat=yes} 2/7 {fever=no, swollen-glands=no} 2/7 {fever=no, swollen-glands=yes} 1/7 {fever=yes, sore-throat=no} 2/7 {fever=yes, sore-throat=yes} 2/7 {fever=yes, swollen-glands=no} 1/7 {fever=yes, swollen-glands=yes} 2/7 {sore-throat=no, swollen-glands=no} 1/7 {sore-throat=no, swollen-glands=yes} 2/7 {sore-throat=yes, swollen-glands=no} 2/7 {sore-throat=yes, swollen-glands=yes} LEVEL 3 Compute all the candidate and frequent itemsets for level 3. Use both the join and the subset pruning criteria to make the process more efficient.SOLUTION: SUPPORT ITEMSETS 2/7 {diagnosis=allergy, fever=no, swollen-glands=no} 2/7 {diagnosis=cold, fever=yes, swollen-glands=no} 1/7 {diagnosis=strep-throat, fever=no, sore-throat=yes} 2/7 {diagnosis=strep-throat, fever=no, swollen-glands=yes} 2/7 {diagnosis=strep-throat, sore-throat=yes, swollen-glands=yes} 1/7 {fever=no, sore-throat=no, swollen-glands=no} XXX {fever=no, sore-throat=no, swollen-glands=yes} no need to compute the support of this itemset as its subset {sore-throat=no, swollen-glands=yes} is not frequent. 1/7 {fever=no, sore-throat=no, swollen-glands=no} 1/7 {fever=no, sore-throat=yes, swollen-glands=yes} 1/7 {fever=yes, sore-throat=yes, swollen-glands=no} Hence there are 10 candidate itemsets and only 4 frequent itemsets. LEVEL 4 Compute all the candidate and frequent itemsets for level 4. Use both the join and the subset pruning criteria to make the process more efficient. ITEMSETS SUPPORT
SOLUTION: There are NO candidate itemset for level 4 as no pair of frequent itemsets from level 3 satisfy the join condition: 2/7 {diagnosis=allergy, fever=no, swollen-glands=no} 2/7 {diagnosis=cold, fever=yes, swollen-glands=no} 2/7 {diagnosis=strep-throat, fever=no, swollen-glands=yes} 2/7 {diagnosis=strep-throat, sore-throat=yes, swollen-glands=yes} That is, having exaclty the same items, in the same (alphabetical) order, except for the last item in the two itemsets.
SOLUTION: I chose the 3-level itemset: 2/7 {diagnosis=strep-throat, fever=no, swollen-glands=yes} Association Rule: diagnosis=strep-throat => fever=no & swollen-glands=yes Compute the confidence of your rule. Show the steps of your calculations. CONFIDENCE(diagnosis=strep-throat => fever=no & swollen-glands=yes) SUPPORT(diagnosis=strep-throat, fever=no, swollen-glands=yes) = ------------------------------------------------------- SUPPORT(diagnosis=strep-throat) 2/7 = --------- = 2/3 = 66% 3/7
SOLUTION: During frequent itemset generation: Eliminate from Level 1 all the 1-itemsets with diagnosis=value with value different from allergy as those itemsets will never be extended to contain diagnosis=allergy. (Note than eliminating itemsets that do not contain diagnosis=allergy would not work in general as this would prevent rules that should appear in the output from appearing. More on this will be discussed in class and on Project 2). During rule generation: Use only frequent itemsets that contain diagnosis=allergy to construct rules. Also, consider only those splits of these frequent itemsets that would put the required attribute-value diagnosis=allergy in the consequent, or right-hand-side, of the rule.