CS4445 Data Mining and Knowledge Discovery in Databases. A-2004
Solutions Exam 2 - October 14, 2004

By Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute

Instructions


Problem I. Instance Based Learning (25 points)

Consider a dataset with a nominal target attribute (i.e., a nominal CLASS) and several predicting attributes. Suppose that the dataset contains 1000 instances and that the data instances in the dataset have been clustered into 10 clusters each one containing roughly 100 instances. Let c1, c2, c3, ..., c9, c10 be the cluster centroids. The clustering has been performed using Euclidean distance over the predicting attributes (without using the target attribute). Consider the following classification method:
Given a test instance t and an integer k (k is much smaller than 100):

  1. Find the closest centroid to the test instance using Euclidean distance over the predicting attributes.

  2. Use Euclidean distance to select the k-nearest neighbors of t among those instances that belong to the cluster represented by the closest centroid.

  3. Use those k selected data instances to classify the test instance.
Will this classification method always make the same prediction (but only faster) for a test instance t than the prediction made by the k-nearest neighbor classifier based on the same Euclidean distance but in which the k-nearest neighbors are computed over the entire dataset? Explain and if at all possible ILLUSTRATE your answer.
Solutions

No. This classification method and the k-nearest neighbor classifier may produce 
different classifications of a given test instance.
Consider for example the following dataset, that contains 6 instances and two
attributes. The first attribute, A, is numeric, and the 2nd attribute is the
target CLASS having two possible values: yes and no.
        
       A   Class  
       
     x1:  5    yes
     c1: 10    no
     x2: 15    no
     x3: 50    yes
     c2: 70    yes
     x4: 90    no


    x1  c1   x2                 t              x3                c2                 x4

_________________________________________________________________________________________________
|        |        |         |        |         |        |         |        |         |         |
1       10       20        30       40        50       60        70       80        90       100

Assume that the dataset instances have been clustered into two clusters:
{x1= 5, c1= 10, x2= 15} and {x3= 50, c2= 70, x4= 90} with centroids c1 and c2.

Let t=35 be a test instance and let k=1.
The classification method described in this problem will determine that:


On the other hand, the 1-nearest neighbor classifier over the entire dataset will 
determine that x3 is the nearest neighbor of t and therefore, it will output "yes",
the CLASS value of x3.


Problem II. Numeric Predictions (50 points)

Consider the following dataset. Note that instances have been labeled with a number in parentheses so that you can refer to them in your solutions.


@relation modified-balance-scale

% This is a small, modified sample of the "Balance Scale Weight & Distance Database"
% from the UCI Data Repository
% 
%  Relevant Information: 
%	This data set was generated to model psychological
%	experimental results.  Each example is classified as having the
%	balance scale tip to the right, tip to the left, or be
%	balanced.  The attributes are the left weight, the left
%	distance, the right weight, and the right distance.  


@attribute scale  {L, B, R}
@attribute left-weight    numeric
@attribute left-distance  numeric
@attribute right-weight   numeric
@attribute right-distance numeric

@data
%                                                            TARGET
%      scale  left-weight  left-distance  right-weight   right-distance

(1)      B,          1,           1,              1,             1

(2)      R,          1,           1,              1,             5

(3)      L,          1,           3,              1,             1

(4)      L,          1,           3,              5,             4

(5)      R,          1,           3,              5,             5


The purpose of this problem is to construct a tree to predict the attribute right-distance using the other four attributes (scale, left-weight, left-distance, and right-weight).

  1. (6 points) Transforming nominal attributes to numeric. Transform the nominal scale attribute into as many numeric/binary attributes as needed. Show your work.
    
         Solution:
    
           In order to convert the nominal attribute scale into binary ones, 
           we calculate the average right-distance for each  scale value:
    
    	B: average  right-distance = 1
    	R: average  right-distance = 5 = (5 + 5)/2
    	L: average  right-distance = 2.5 = (1 + 4)/2
    
          Now, we sort those nominal values in decreasing order of average value:
            R, L, B. 
    
          We introduce two binary attributes: 
          scale=R and scale=R-or-L
    
    

  2. (4 points) Show the transformed dataset in the space provided below:
    Solutions
    @data
    %                                                                   TARGET
    %  scale-R scale-R-or-L  left-weight  left-distance  right-weight   right-distance
    
    (1)    0,        0,        1,           1,              1,             1
    
    (2)    1,        1,        1,           1,              1,             5
    
    (3)    0,        1,        1,           3,              1,             1
    
    (4)    0,        1,        1,           3,              5,             4
    
    (5)    1,        1,        1,           3,              5,             5
    
    

  3. (8 Points) Identifying candidate split points. List all the candidate split points for each of the predicting attributes that need to be considered to find the best split point for the root node of the regression/model tree.
    Solutions
    p1 :  scale-R  = 0.5
    p2 :  scale-R-or-L  = 0.5
    p3 :  left-distance  = 2 
    p4 :  right-weight = 3 
    

  4. (12 Points) Evaluating candidate split points Compute the SDR (Standard Deviation Reduction) of each of the candidate split points that you listed above. For your convenience, the following standard deviations (std) are provided:
      std({1}) = 0
      std({4}) = 0
      std({5}) = 0
      std({1, 1}) = 0
      std({5, 5}) = 0
      std({1, 4}) = 2.12
      std({1, 5}) = 2.83
      std({4, 5}) = 0.7
      std({1, 1, 4}) = 1.73
      std({1, 1, 5}) = 2.31
      std({1, 4, 5}) = 2.08
      std({1, 5, 5}) = 2.31
      std({4, 5, 5}) = 0.58
      std({1, 1, 4, 5}) = 2.06
      std({1, 4, 5, 5}) = 1.89
      std({1, 1, 4, 5, 5}) = 2.05
    
    SHOW YOUR WORK.
    
    Solutions
    
         We select as the condition for the root node of our tree
         the split point that maximizes the value of the following formula:
    
           SDR = sd(right-distance over all instances)
    	      - [(k1/n)*sd(right-distance of instances with attribute value below split point)
    		  + (k2/n)*sd(right-distance of instances with attribute value above split point)]
    
                 where sd stands for standard deviation.
                 k1 is the number of instances with attribute value below split point.
                 k2 is the number of instances with attribute value above split point.
                 n is the number of instances.
    
    
    
    p1 :  scale-R=0.5
    
      SDR(p1) = std({1, 1, 4, 5, 5}) - [ (3/5) * st({1,1,4}) + (2/5) * st({5,5}) ]
    	  = 2.05 - [ (3/5)*1.73 + (2/5)*0 ]
              = 1.012
    
    p2 :  scale-R-or-L=0.5
    
      SDR(p2) = std({1, 1, 4, 5, 5}) - [ (1/5) * st({1}) + (4/5) * st({1,4,5,5}) ]
    	  = 2.05 - [ (1/5)*0 + (4/5)*1.89 ]
              = 0.538
    
    p3 :  left-distance=2 
    
      SDR(p3) = std({1, 1, 4, 5, 5}) - [ (2/5) * st({1,5}) + (3/5) * st({1,4,5}) ]
    	  = 2.05 - [ (2/5)*2.83 + (3/5)*2.08 ]
              = -0.33 
    
    p4 :  right-weight=3 
    
      SDR(p4) = std({1, 1, 4, 5, 5}) - [ (3/5) * st({1,1,5}) + (2/5) * st({4,5}) ]
    	  = 2.05 - [ (3/5)*2.31 + (2/5)*0.7 ]
              = 0.384
    
    

  5. (3 Points) Choosing the best candidate split point. According to the SDR's that you computed above select the best split point.
    
    Solutions
    
    The best split point is p1: scale-R=0.5, since it is the split point
    with the highest standard deviation reduction SDR.
    
    
    
    

  6. (2 Points) Constructing the root of the tree. Construct the root node of the tree with the split point that you determined to be the right one. ALSO, WRITE DOWN WHICH INSTANCES BELONG TO each of the two leaves.
    
    Solutions
    
    		 scale-R <= 0.5
                      /           \
                 yes /             \  no
                    /               \
                 leaf 1            leaf 2
    	  (1),(3),(4)          (2),(5)
    
    
    

  7. Constructing the leaves of the tree. Assume that we decide not to split any of the above two leaves further (note that each leaf contains 4 or less data instances).

    1. (7 Points) Regression Tree Assume that we will use the tree above as a regression tree.

      DESCRIBE how the value that each of the leaf nodes will output as its prediction is computed.

      
      
      Solutions
      
       The average of the target attribute for all the instances at a given leaf
                node is used as the predicted value for an instance classified by that
                leaf node. 
      
      
      
      CALCULATE the precise value that each of the leaves in the tree above will output. Show your work.
      
          Leaf 1: right-distance = (1 + 1 + 4)/3 = 2
      
      
          Leaf 2: right-distance = (5 + 5)/2 = 5
      
      

    2. (8 Points) Model Tree Assume that we will use the tree above as a model tree.

      DESCRIBE how the value that each of the leaf nodes will output as its prediction is computed.

      
      Solutions
      
      
              Each leaf node will output its predications based on a linear equation.
              The linear equation at each leaf node is formed by a linear regression 
              on the training instances found at that particular leaf node.
      
      
      
      ILLUSTRATE what the function/formula that each of the leaves in the tree above will use to produce its output is like. (You don't have to produce the precise function just illustrate what the function will be like.)
      Solutions
      
      Note that the value of the attribute left-weight is the same in all the
      instances in the dataset. Therefore, this attribute doesn't provide any information and as such it can be removed from consideration.
      
        Leaf 1:  right-weight = w0 + w1*scale-R + w2*scale-R-or-L + w3*left-distance  
      	   
              where the weights w0, w1, w2, and w3 are found using Linear Regression 
      	over the data instances (1), (3), and (4).
      
        Leaf 2:  right-weight = v0 + v1*scale-R + v2*scale-R-or-L + v3*left-distance 
      	   
              where the weights w0, w1, w2, and w3 are found using Linear Regression 
      	over the data instances (2) and (5).
      

Problem III. Clustering (25 Points)

(Adapted from "Data Mining: Introductory and Advanced Topics" by M.H. Dunham, Prentice Hall. 2003.)

Consider the following dataset with only one attribute:

@relation one-dimension

@attribute A numeric 

@data
   2
   5
  10
  12
   3
  20
  30
  11
  25
Suppose that we want to cluster these data instances into 2 clusters. Follow the Simple k-means clustering algorithm with: to cluster this dataset until it terminates. Show your work on the table provided below.
Solutions:

             centroid 1     centroid 2         cluster 1              cluster 2 
                                         (list instances that      (list instances that      
                                          belong to this cluster)   belong to this cluster)
______________________________________________________________________________________________

1st iteration:      3              18       {2,5,10,3}                 {12,20,30,11,25}
2nd iteration:      5              19.6     {2,5,10,3,12,11}           {20,30,25}
3rd iteration:      7.16           25       {2,5,10,3,12,11}           {20,30,25}


The algorithm stops after the 3rd iteration as the two clusters obtained on this iteration
are identical to the two clusters obtained on the previous iteration.