Solutions Exam 2 CS 4445 A04

CS4445 Data Mining and Knowledge Discovery in Databases. A-2004
Solutions Exam 2 - October 14, 2004

By Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute

Instructions

Show your work
Justify your answers
Use the space provided to write your answers
Ask in case of doubt

Problem I. Instance Based Learning (25 points)

Consider a dataset with a nominal target attribute (i.e., a nominal CLASS) and several predicting attributes. Suppose that the dataset contains 1000 instances and that the data instances in the dataset have been clustered into 10 clusters each one containing roughly 100 instances. Let c1, c2, c3, ..., c9, c10 be the cluster centroids. The clustering has been performed using Euclidean distance over the predicting attributes (without using the target attribute). Consider the following classification method:

Given a test instance t and an integer k (k is much smaller than 100):

Find the closest centroid to the test instance using Euclidean distance over the predicting attributes.
Use Euclidean distance to select the k-nearest neighbors of t among those instances that belong to the cluster represented by the closest centroid.
Use those k selected data instances to classify the test instance.

Will this classification method always make the same prediction (but only faster) for a test instance t than the prediction made by the k-nearest neighbor classifier based on the same Euclidean distance but in which the k-nearest neighbors are computed over the entire dataset? Explain and if at all possible ILLUSTRATE your answer.

Solutions

No. This classification method and the k-nearest neighbor classifier may produce 
different classifications of a given test instance.
Consider for example the following dataset, that contains 6 instances and two
attributes. The first attribute, A, is numeric, and the 2nd attribute is the
target CLASS having two possible values: yes and no.
        
       A   Class  
       
     x1:  5    yes
     c1: 10    no
     x2: 15    no
     x3: 50    yes
     c2: 70    yes
     x4: 90    no


    x1  c1   x2                 t              x3                c2                 x4

_________________________________________________________________________________________________
|        |        |         |        |         |        |         |        |         |         |
1       10       20        30       40        50       60        70       80        90       100

Assume that the dataset instances have been clustered into two clusters:
{x1= 5, c1= 10, x2= 15} and {x3= 50, c2= 70, x4= 90} with centroids c1 and c2.

Let t=35 be a test instance and let k=1.
The classification method described in this problem will determine that:

 will determine that c1 is the closest centroid to t since 
     distance(t,c1)=25 and distance(t,c2)=35, 
 will find that x2 is the 1-nearest neighbor  of t in the cluster centered around c1,
 will output  "no" as the classification of t, as this is the CLASS of instance x2.


On the other hand, the 1-nearest neighbor classifier over the entire dataset will 
determine that x3 is the nearest neighbor of t and therefore, it will output "yes",
the CLASS value of x3.

Problem II. Numeric Predictions (50 points)

Consider the following dataset. Note that instances have been labeled with a number in parentheses so that you can refer to them in your solutions.



@relation modified-balance-scale

% This is a small, modified sample of the "Balance Scale Weight & Distance Database"
% from the UCI Data Repository
% 
%  Relevant Information: 
%	This data set was generated to model psychological
%	experimental results.  Each example is classified as having the
%	balance scale tip to the right, tip to the left, or be
%	balanced.  The attributes are the left weight, the left
%	distance, the right weight, and the right distance.  


@attribute scale  {L, B, R}
@attribute left-weight    numeric
@attribute left-distance  numeric
@attribute right-weight   numeric
@attribute right-distance numeric

@data
%                                                            TARGET
%      scale  left-weight  left-distance  right-weight   right-distance

(1)      B,          1,           1,              1,             1

(2)      R,          1,           1,              1,             5

(3)      L,          1,           3,              1,             1

(4)      L,          1,           3,              5,             4

(5)      R,          1,           3,              5,             5

The purpose of this problem is to construct a tree to predict the attribute right-distance using the other four attributes (scale, left-weight, left-distance, and right-weight).

(6 points) Transforming nominal attributes to numeric. Transform the nominal scale attribute into as many numeric/binary attributes as needed. Show your work.


     Solution:

       In order to convert the nominal attribute scale into binary ones, 
       we calculate the average right-distance for each  scale value:

	B: average  right-distance = 1
	R: average  right-distance = 5 = (5 + 5)/2
	L: average  right-distance = 2.5 = (1 + 4)/2

      Now, we sort those nominal values in decreasing order of average value:
        R, L, B. 

      We introduce two binary attributes: 
      scale=R and scale=R-or-L

(4 points) Show the transformed dataset in the space provided below:

Solutions
@data
%                                                                   TARGET
%  scale-R scale-R-or-L  left-weight  left-distance  right-weight   right-distance

(1)    0,        0,        1,           1,              1,             1

(2)    1,        1,        1,           1,              1,             5

(3)    0,        1,        1,           3,              1,             1

(4)    0,        1,        1,           3,              5,             4

(5)    1,        1,        1,           3,              5,             5

(8 Points) Identifying candidate split points. List all the candidate split points for each of the predicting attributes that need to be considered to find the best split point for the root node of the regression/model tree.
```
Solutions
p1 :  scale-R  = 0.5
p2 :  scale-R-or-L  = 0.5
p3 :  left-distance  = 2 
p4 :  right-weight = 3 
```

(12 Points) Evaluating candidate split points Compute the SDR (Standard Deviation Reduction) of each of the candidate split points that you listed above. For your convenience, the following standard deviations (std) are provided:

  std({1}) = 0
  std({4}) = 0
  std({5}) = 0
  std({1, 1}) = 0
  std({5, 5}) = 0
  std({1, 4}) = 2.12
  std({1, 5}) = 2.83
  std({4, 5}) = 0.7
  std({1, 1, 4}) = 1.73
  std({1, 1, 5}) = 2.31
  std({1, 4, 5}) = 2.08
  std({1, 5, 5}) = 2.31
  std({4, 5, 5}) = 0.58
  std({1, 1, 4, 5}) = 2.06
  std({1, 4, 5, 5}) = 1.89
  std({1, 1, 4, 5, 5}) = 2.05

SHOW YOUR WORK.

Solutions

     We select as the condition for the root node of our tree
     the split point that maximizes the value of the following formula:

       SDR = sd(right-distance over all instances)
	      - [(k1/n)*sd(right-distance of instances with attribute value below split point)
		  + (k2/n)*sd(right-distance of instances with attribute value above split point)]

             where sd stands for standard deviation.
             k1 is the number of instances with attribute value below split point.
             k2 is the number of instances with attribute value above split point.
             n is the number of instances.



p1 :  scale-R=0.5

  SDR(p1) = std({1, 1, 4, 5, 5}) - [ (3/5) * st({1,1,4}) + (2/5) * st({5,5}) ]
	  = 2.05 - [ (3/5)*1.73 + (2/5)*0 ]
          = 1.012

p2 :  scale-R-or-L=0.5

  SDR(p2) = std({1, 1, 4, 5, 5}) - [ (1/5) * st({1}) + (4/5) * st({1,4,5,5}) ]
	  = 2.05 - [ (1/5)*0 + (4/5)*1.89 ]
          = 0.538

p3 :  left-distance=2 

  SDR(p3) = std({1, 1, 4, 5, 5}) - [ (2/5) * st({1,5}) + (3/5) * st({1,4,5}) ]
	  = 2.05 - [ (2/5)*2.83 + (3/5)*2.08 ]
          = -0.33 

p4 :  right-weight=3 

  SDR(p4) = std({1, 1, 4, 5, 5}) - [ (3/5) * st({1,1,5}) + (2/5) * st({4,5}) ]
	  = 2.05 - [ (3/5)*2.31 + (2/5)*0.7 ]
          = 0.384

(3 Points) Choosing the best candidate split point. According to the SDR's that you computed above select the best split point.


Solutions

The best split point is p1: scale-R=0.5, since it is the split point
with the highest standard deviation reduction SDR.

(2 Points) Constructing the root of the tree. Construct the root node of the tree with the split point that you determined to be the right one. ALSO, WRITE DOWN WHICH INSTANCES BELONG TO each of the two leaves.


Solutions

		 scale-R <= 0.5
                  /           \
             yes /             \  no
                /               \
             leaf 1            leaf 2
	  (1),(3),(4)          (2),(5)

Constructing the leaves of the tree. Assume that we decide not to split any of the above two leaves further (note that each leaf contains 4 or less data instances).

(7 Points) Regression Tree Assume that we will use the tree above as a regression tree.

DESCRIBE how the value that each of the leaf nodes will output as its prediction is computed.



Solutions

 The average of the target attribute for all the instances at a given leaf
          node is used as the predicted value for an instance classified by that
          leaf node.

CALCULATE the precise value that each of the leaves in the tree above will output. Show your work.


    Leaf 1: right-distance = (1 + 1 + 4)/3 = 2


    Leaf 2: right-distance = (5 + 5)/2 = 5

(8 Points) Model Tree Assume that we will use the tree above as a model tree.

DESCRIBE how the value that each of the leaf nodes will output as its prediction is computed.


Solutions


        Each leaf node will output its predications based on a linear equation.
        The linear equation at each leaf node is formed by a linear regression 
        on the training instances found at that particular leaf node.

ILLUSTRATE what the function/formula that each of the leaves in the tree above will use to produce its output is like. (You don't have to produce the precise function just illustrate what the function will be like.)

Solutions

Note that the value of the attribute left-weight is the same in all the
instances in the dataset. Therefore, this attribute doesn't provide any information and as such it can be removed from consideration.

  Leaf 1:  right-weight = w0 + w1*scale-R + w2*scale-R-or-L + w3*left-distance  
	   
        where the weights w0, w1, w2, and w3 are found using Linear Regression 
	over the data instances (1), (3), and (4).

  Leaf 2:  right-weight = v0 + v1*scale-R + v2*scale-R-or-L + v3*left-distance 
	   
        where the weights w0, w1, w2, and w3 are found using Linear Regression 
	over the data instances (2) and (5).

Problem III. Clustering (25 Points)

(Adapted from "Data Mining: Introductory and Advanced Topics" by M.H. Dunham, Prentice Hall. 2003.)

Consider the following dataset with only one attribute:

@relation one-dimension

@attribute A numeric 

@data
   2
   5
  10
  12
   3
  20
  30
  11
  25

Suppose that we want to cluster these data instances into 2 clusters. Follow the Simple k-means clustering algorithm with:

k=2,
initial centroids 3 and 18, and
Euclidean distance (equivalently, you can use the Manhattan distance |x - y|),

to cluster this dataset until it terminates. Show your work on the table provided below.

Solutions:

             centroid 1     centroid 2         cluster 1              cluster 2 
                                         (list instances that      (list instances that      
                                          belong to this cluster)   belong to this cluster)
______________________________________________________________________________________________

1st iteration:      3              18       {2,5,10,3}                 {12,20,30,11,25}
2nd iteration:      5              19.6     {2,5,10,3,12,11}           {20,30,25}
3rd iteration:      7.16           25       {2,5,10,3,12,11}           {20,30,25}


The algorithm stops after the 3rd iteration as the two clusters obtained on this iteration
are identical to the two clusters obtained on the previous iteration.