Given a test instance t and an integer k (k is much smaller than 100):Will this classification method always make the same prediction (but only faster) for a test instance t than the prediction made by the k-nearest neighbor classifier based on the same Euclidean distance but in which the k-nearest neighbors are computed over the entire dataset? Explain and if at all possible ILLUSTRATE your answer.
- Find the closest centroid to the test instance using Euclidean distance over the predicting attributes.
- Use Euclidean distance to select the k-nearest neighbors of t among those instances that belong to the cluster represented by the closest centroid.
- Use those k selected data instances to classify the test instance.
Solutions No. This classification method and the k-nearest neighbor classifier may produce different classifications of a given test instance. Consider for example the following dataset, that contains 6 instances and two attributes. The first attribute, A, is numeric, and the 2nd attribute is the target CLASS having two possible values: yes and no. A Class x1: 5 yes c1: 10 no x2: 15 no x3: 50 yes c2: 70 yes x4: 90 no x1 c1 x2 t x3 c2 x4 _________________________________________________________________________________________________ | | | | | | | | | | | 1 10 20 30 40 50 60 70 80 90 100 Assume that the dataset instances have been clustered into two clusters: {x1= 5, c1= 10, x2= 15} and {x3= 50, c2= 70, x4= 90} with centroids c1 and c2. Let t=35 be a test instance and let k=1. The classification method described in this problem will determine that:
@relation modified-balance-scale % This is a small, modified sample of the "Balance Scale Weight & Distance Database" % from the UCI Data Repository % % Relevant Information: % This data set was generated to model psychological % experimental results. Each example is classified as having the % balance scale tip to the right, tip to the left, or be % balanced. The attributes are the left weight, the left % distance, the right weight, and the right distance. @attribute scale {L, B, R} @attribute left-weight numeric @attribute left-distance numeric @attribute right-weight numeric @attribute right-distance numeric @data % TARGET % scale left-weight left-distance right-weight right-distance (1) B, 1, 1, 1, 1 (2) R, 1, 1, 1, 5 (3) L, 1, 3, 1, 1 (4) L, 1, 3, 5, 4 (5) R, 1, 3, 5, 5The purpose of this problem is to construct a tree to predict the attribute right-distance using the other four attributes (scale, left-weight, left-distance, and right-weight).
Solution: In order to convert the nominal attribute scale into binary ones, we calculate the average right-distance for each scale value: B: average right-distance = 1 R: average right-distance = 5 = (5 + 5)/2 L: average right-distance = 2.5 = (1 + 4)/2 Now, we sort those nominal values in decreasing order of average value: R, L, B. We introduce two binary attributes: scale=R and scale=R-or-L
Solutions @data % TARGET % scale-R scale-R-or-L left-weight left-distance right-weight right-distance (1) 0, 0, 1, 1, 1, 1 (2) 1, 1, 1, 1, 1, 5 (3) 0, 1, 1, 3, 1, 1 (4) 0, 1, 1, 3, 5, 4 (5) 1, 1, 1, 3, 5, 5
Solutions p1 : scale-R = 0.5 p2 : scale-R-or-L = 0.5 p3 : left-distance = 2 p4 : right-weight = 3
std({1}) = 0 std({4}) = 0 std({5}) = 0 std({1, 1}) = 0 std({5, 5}) = 0 std({1, 4}) = 2.12 std({1, 5}) = 2.83 std({4, 5}) = 0.7 std({1, 1, 4}) = 1.73 std({1, 1, 5}) = 2.31 std({1, 4, 5}) = 2.08 std({1, 5, 5}) = 2.31 std({4, 5, 5}) = 0.58 std({1, 1, 4, 5}) = 2.06 std({1, 4, 5, 5}) = 1.89 std({1, 1, 4, 5, 5}) = 2.05 SHOW YOUR WORK. Solutions We select as the condition for the root node of our tree the split point that maximizes the value of the following formula: SDR = sd(right-distance over all instances) - [(k1/n)*sd(right-distance of instances with attribute value below split point) + (k2/n)*sd(right-distance of instances with attribute value above split point)] where sd stands for standard deviation. k1 is the number of instances with attribute value below split point. k2 is the number of instances with attribute value above split point. n is the number of instances. p1 : scale-R=0.5 SDR(p1) = std({1, 1, 4, 5, 5}) - [ (3/5) * st({1,1,4}) + (2/5) * st({5,5}) ] = 2.05 - [ (3/5)*1.73 + (2/5)*0 ] = 1.012 p2 : scale-R-or-L=0.5 SDR(p2) = std({1, 1, 4, 5, 5}) - [ (1/5) * st({1}) + (4/5) * st({1,4,5,5}) ] = 2.05 - [ (1/5)*0 + (4/5)*1.89 ] = 0.538 p3 : left-distance=2 SDR(p3) = std({1, 1, 4, 5, 5}) - [ (2/5) * st({1,5}) + (3/5) * st({1,4,5}) ] = 2.05 - [ (2/5)*2.83 + (3/5)*2.08 ] = -0.33 p4 : right-weight=3 SDR(p4) = std({1, 1, 4, 5, 5}) - [ (3/5) * st({1,1,5}) + (2/5) * st({4,5}) ] = 2.05 - [ (3/5)*2.31 + (2/5)*0.7 ] = 0.384
Solutions The best split point is p1: scale-R=0.5, since it is the split point with the highest standard deviation reduction SDR.
Solutions scale-R <= 0.5 / \ yes / \ no / \ leaf 1 leaf 2 (1),(3),(4) (2),(5)
DESCRIBE how the value that each of the leaf nodes will output as its prediction is computed.
Solutions The average of the target attribute for all the instances at a given leaf node is used as the predicted value for an instance classified by that leaf node.CALCULATE the precise value that each of the leaves in the tree above will output. Show your work.
Leaf 1: right-distance = (1 + 1 + 4)/3 = 2 Leaf 2: right-distance = (5 + 5)/2 = 5
DESCRIBE how the value that each of the leaf nodes will output as its prediction is computed.
Solutions Each leaf node will output its predications based on a linear equation. The linear equation at each leaf node is formed by a linear regression on the training instances found at that particular leaf node.ILLUSTRATE what the function/formula that each of the leaves in the tree above will use to produce its output is like. (You don't have to produce the precise function just illustrate what the function will be like.)
Solutions Note that the value of the attribute left-weight is the same in all the instances in the dataset. Therefore, this attribute doesn't provide any information and as such it can be removed from consideration. Leaf 1: right-weight = w0 + w1*scale-R + w2*scale-R-or-L + w3*left-distance where the weights w0, w1, w2, and w3 are found using Linear Regression over the data instances (1), (3), and (4). Leaf 2: right-weight = v0 + v1*scale-R + v2*scale-R-or-L + v3*left-distance where the weights w0, w1, w2, and w3 are found using Linear Regression over the data instances (2) and (5).
Consider the following dataset with only one attribute:
@relation one-dimension @attribute A numeric @data 2 5 10 12 3 20 30 11 25Suppose that we want to cluster these data instances into 2 clusters. Follow the Simple k-means clustering algorithm with:
Solutions: centroid 1 centroid 2 cluster 1 cluster 2 (list instances that (list instances that belong to this cluster) belong to this cluster) ______________________________________________________________________________________________ 1st iteration: 3 18 {2,5,10,3} {12,20,30,11,25} 2nd iteration: 5 19.6 {2,5,10,3,12,11} {20,30,25} 3rd iteration: 7.16 25 {2,5,10,3,12,11} {20,30,25} The algorithm stops after the 3rd iteration as the two clusters obtained on this iteration are identical to the two clusters obtained on the previous iteration.