CS539 MACHINE LEARNING. SPRING 99
SOLUTIONS - Practice Exam 1

PROF. CAROLINA RUIZ
Department of Computer Science
Worcester Polytechnic Institute


Problem 1 (20 Points): Concept Learning

Consider the following set of training examples (taken from Dean, Allen, and Aloimonos' book) to train a robot janitor to predict whether or not an office contains a recycling bin.

STATUS FLOOR DEPT. OFFICE SIZE RECYCLING BIN?
1. faculty four cs medium yes
2. faculty four ee medium yes
3. student four cs small no
4. faculty five cs medium yes

Assume that each of the attributes assumes just the values that appear on the table.

  1. What is the size of the set of instances for this example?
    
    Attributes:  STATUS  FLOOR  DEPT.  OFFICE SIZE
    # values:      2    x  2   x  2   x    2       = 16
    
    
  2. What is the size of the hypothesis space?
    
    Attributes:  STATUS  FLOOR  DEPT.  OFFICE SIZE
    # values:      3    x  3   x  3   x    3        + 1 = 82
                   |                                  |
              (= 2 + "?")                   [null,null,null,null]
    
    
  3. Give a sequence of S and G boundary sets computed by the CANDIDATE-ELIMINATION algorithm if it is given the sequence of examples above in the order in which they appear on the table.
    
    S0 = {[null,null,null,null]}
    S1 = {[faculty,four,cs,medium]}
    S2 = {[faculty,four,?,medium]}
    S3 = S2
    S4 = {[faculty,?,?,medium]}
    
    G4 = G3
    G3 = {[faculty,?,?,?], [?,?,?,medium]}
    G2 = G1
    G1 = G0
    G0 = {[?,?,?,?]}
    
    
  4. Suggest an ordering of the examples that would minimize the sizes of the intermediate S and G sets. Explain your answer.
    Note that the sum of the sizes of the intermediate S and G sets
    is at least 2*(# of examples + 1) = 10, counting S0 and G0, since 
    each S and G has at least one element.
    
    For the trace shown above the sum of the sizes is 12. Since
    the final G contains 2 elements, then the minimal sum of the sizes
    is obtained when the "last G" (i.e. G4) is the only one containing
    2 elements. That can be achieved if we process all the positive
    examples before processing the negative examples. In that case
    the sum of the sizes is 11. 
     
    
  5. Suggest a query guaranteed to reduce the size of the Version Space regardless of how the trainer classifies it.
    
    Query = [faculty,four,cs,small]
    
    If this is a positive instance, 
     then the boundaries can be updated to:
      S = [faculty,?,?,?]
      G = [faculty,?,?,?]
    
    If this is a negative instance, 
     then the boundaries can be updated to:
      S = [faculty,?,?,medium]
      G = [?,?,?,medium]
    
    
  6. (Extra Credit) Complete the proof of Theorem 2.1. from the textbook. That is, prove that every hypothesis that is consistent with the training data lies between the most specific and the most general boundaries S and G in the partially ordered hypothesis space.
    By way of contradiction, assume that there are hypotheses in H
    that are consistent w.r.t. D, but do not lie between S and G.
    Let K be the set of all those hyotheses. That is,
    K = { h in H | Consistent(h,D) and h does not lie between
                   S and G in the partially ordered hypothesis space}
    Let h be a maximally specific hypothesis in K (i.e. h has the
    property that there is no h' in K such that h' is more specific
    than h).
    
    Case 1: If there is no hypothesis h' in H consistent with D that
    is more specific than h then, by definition h would belong to s
    contradiction our assumption that h belongs to K.
    
    Case 2: If there is a hypothesis h' in H consistent with D that is
    more specific than h then h' belongs to VS, since h is maximally
    specific in K. Now we just need to show that we can find some
    h'' in VS such that h'' is more general than or equal to h,
    which would imply that h belongs to VS with would be a contradiction 
    with our assumption.
    
       Case 2.1: If there is no consistent hypothesis h' in H more
       general than h, then by definition h belongs to G.  
    
       Case 2.2: If there is a consistent hypothesis h' in H more
       general than h, then:
    
          Case 2.2.1: If h' belongs to VS then we're done.
          
          Case 2.2.2: If h' belongs to K, then construct a chain
          h <=g h1 <=g h2 <=g ... <=g h_n where h_n is a maximally
          general hypothesis in K. Following the same arguments of
          Case 1 and beginning of Case 2, one can show that either 
          h'' = h_n belongs to G or there is some h'' in VS such that
          h_n <=g h''. In both cases,  h'' <=g h, and hence h would  
          belong to VS, yielding a contradiction.
    
    Since all the cases above yield a contradiction, then K must be
    empty and so every hypothesis that is consistent with the
    training data lies between the most specific and the most general
    boundaries S and G in the partially ordered hypothesis space.
          
    

Problem 2 (20 Points): Decision Trees

The following table containing Student Exam Performance Data is taken from section 7.4 of A. Cawsey, "The Essence of Artificial Intelligence", Prentice Hall Europe, 1998.

No. Student First last year? Male? Works hard? Drinks? First this year?
1 Richard yes yes no yes yes
2 Alan yes yes yes no yes
3 Alison no no yes no yes
4 Jeff no yes no yes no
5 Gail yes no yes yes yes
6 Simon no yes yes yes no

  1. What is the entropy of this collection of training examples with respect to the target function classification?
         entropy(S) = - (4/6)log_2(4/6) - (2/6)log_2(2/6)
         
         here log_2 denotes logarithm in base 2.
         
  2. Construct a minimal decision tree for this dataset. Show your work.
                                    
         entropyS(1st-last-year) 
         = (3/6)[-(3/3)log_2(3/3) - 0] + (3/6)[-(1/3)log_2(1/3) - (2/3)log_2(2/3)]
         = 0.459
    
         entropyS(male)
         = (4/6)[-(1/2)log_2(1/2) - (1/2)log_2(1/2)] + (2/6)[-(2/2)log_2(2/2) - 0]
         = 0.666
    
         entropyS(works-hard)
         = (4/6)[-(3/4)log_2(3/4) - (1/4)log_2(1/4)] + (2/6)[-(1/2)log_2(1/2) - (1/2)log_2(1/2)]
         = 0.8741
         
         entropyS(drinks)
         = (4/6)[-(1/2)log_2(1/2) - (1/2)log_2(1/2)] + (2/6)[-(2/2)log_2(2/2) - 0]
         = 0.666
    
         1st-last-year is the attribute with the smallest entropy and so we use it for the
         root of the tree:
    
    
                               1st-last-year?
                               /           \
                          yes /             \ no
                             /               \
                     1+, 2+, 5+          3+, 4-, 6-
                        YES
    
          We need to split the rightmost node which contains examples 3, 4, 6 only.
          Let S' be:
    
    
    No. Student First last year? Male? Works hard? Drinks? First this year?
    3 Alison no no yes no yes
    4 Jeff no yes no yes no
    6 Simon no yes yes yes no
    Now we need to select from male?, works-hard?, and drinks? the attribute with the smallest entropy w.r.t. S' entropyS'(drinks?) = (2/3)[-(2/2)log_2(2/2) - 0] + (1/3)[-(1/1)log_2(1/1) - 0] = 0 Since this is the smallest possible entropy, then drinks? is choosen. 1st-last-year? / \ yes / \ no / \ 1+, 2+, 5+ drinks? YES / \ no / \ yes / \ 3+ 4-, 6- YES NO
  3. Use your decision tree from the previous part to classify the following instances:
    No. Student First last year? Male? Works hard? Drinks? First this year?
    7 Matthew no yes no yes ??
    8 Mary no no yes yes ??
    Both instances are classified as negative by the previous decision tree.
    
  4. Suppose that the trainer tells you that your classifications of the instances are incorrect (i.e. the right classification of each of the instances is "yes" if you said "no", and "no" if you said "yes"). Discuss a way to post-process your tree (without constructing it from scratch again) so that the resulting decision tree better represents the training data together with the two test instances.
    If the correct answer for each instance is "yes", then the drinks? node
    can be pruned and replaced by a majority vote of the remaining instances.
    That is,
    
                               1st-last-year?
                               /           \
                          yes /             \ no
                             /               \
                     1+, 2+, 5+          3+, 4-, 6-, 7+, 8+
                        YES              Majority vote = YES
    
    The accuracy of the tree remains the same (6/8), but the tree is now smaller.
    
    Another possibility is to replace  drinks? by male? (which also has
    entropy 0 with respect to S'). In that case only one instance is misclassified,
    and hence the accuracy of the decision tree increases to 7/8.
    
  5. (Extra Credit) Suppose that we are trying to learn a concept C and that A is one of the predicting attributes. Let n be the number of values that C can assume and k the number of values that A can assume. (For instance, if C=PlayTennis and A is Outlook, then n = 2 (yes,no) and k=3 (sunny, overcast, rainy).) What is the interval of real numbers in which the value of the entropy of A with respect to S lies? In other words, find real numbers a and b such that a <= entropy(A) <= b.
    0 <= entropy(A) <= log_2(k)
    

Problem 3 (15 Points): Neural Networks

Consider the following Boolean function

A B ~A v B
1 1 1
1 0 0
0 1 1
0 0 1

  1. Can this function be represented by a perceptron? Explain your answer.
    Yes, since the function is linearly separable:
    
                 B  |
                    |
                   1+    +
                    |
                    +____-_____ A
                    0    1
    
  2. If yes, construct a perceptron that represents the function. Otherwise construct a multilayer neural network that represents it.
        Perceptron:
                      -----
        1 --- w0 --->|     |  
        A --- w1 --->|     |-----> output = 1,  if w0 + w1*A + w2*B > 0
        B --- w2 --->|     |                0,  otherwise
                      -----
    
        Take w0= 1, w1= -1, and w2= 1.
    
  3. Explain informally why the delta training rule in Equation (4.10) is only an approximation to the true gradient descent rule of Equation (4.7).
    Equation (4.10) corresponds to incremental gradient descent in which weights 
    are updated after seeing each of the examples, while Equation (4.7) corresponds 
    to true gradient descent in which the weights are updated after seeing all the 
    examples once.
    

Problem 4 (15 Points): Evaluating Hypotheses

Suppose that hypothesis h commits r=10 errors over a sample of n=65 independently drawn examples.
  1. What is the standard deviation in ERRORs(h)?
         ERRORs(h) = r/n = 10/65 = 2/13
    
         SDs(h) = sqrt((2/13)*(1 - (2/13))/65) = sqrt(0.002) = 0.0447
         
  2. What is the 90% confidence interval (two-sided) for the true error rate?
         ERRORs(h) +- Z90*SDs(h) = 2/13 +- 1.64*(0.0447) = [0.0804531,0.227239]
         
  3. What is the 95% one-sided interval?
         (-infinite, ERRORs(h) + Z90*SDs(h)] = (-infinity, 2/13 + 1.64*0.0447] 
                                             = (-infinity,0.227239]
         
  4. What is the 90% one-sided interval?
         (-infinity, ERRORs(h) + Z80*SDs(h)] = (-infinity, 2/13 + 1.28*0.0447]
                                             = (-infinity,0.211129]
         

Problem 5 (15 Points): Bayesian Learning

Consider the following training instances for the janitor example of Problem 1,

STATUS FLOOR DEPT. OFFICE SIZE RECYCLING BIN?
1. faculty four cs medium yes
2. student four ee large yes
3. staff five cs medium no
4. student three ee small yes
5. staff four cs medium no

  1. Construct a Bayesian network that represents all attributes in the janitor example, assuming that the predicting attributes are pairwise independent. Provide the probability table for each of the predicting attributes.
    Let S=status, F=floor, D=dept., OS=office-size, and RB=recycling-bin?.
    
            S | Prob      F | Prob    D | Prob      OS | Prob
      --------------  ------------   ---------  -------------
      faculty | 1/5   four  | 3/5    cs | 3/5   small  | 1/5
      staff   | 2/5   five  | 1/5    ee | 2/5   medium | 3/5
      student | 2/5   three | 1/5               large  | 1/5
    
            -----         -----       -----      ------
           |  S  |       |  F  |     |  D  |    |  OS  |
            -----         -----       -----      ------
                  \         |           |       /
                   \        |           |      /
                    \       |           |     /
                     v      v           v    v
                      -----------------------
                     |          RB           |
                      -----------------------
    
  2. Show how a naive Bayesian classifier would classify the following instance:

    STATUS FLOOR DEPT. OFFICE SIZE RECYCLING BIN?
    student four cs small ??

    A naive Bayes classifier would select the value v ("yes" or "no") for the 
    target concept that maximizes the following product:
    
       P(S=student|RB=v)*P(F=four|RB=v)*P(D=cs|RB=v)*P(OS=small|RB=v)*P(v)
    
    For v = "yes":
      P(S=student|RB=yes)*P(F=four|RB=yes)*P(D=cs|RB=yes)*P(OS=small|RB=yes)*P(yes)
      = 2/3                *  2/3           * 1/3          * 1/3            * 3/5
      = 4/135
    
    For v = "no":
      P(S=student|RB=no)*P(F=four|RB=no)*P(D=cs|RB=no)*P(OS=small|RB=no)*P(no)
      = 0               *  1/2          * 2/2         * 0               * 2/5
      = 0
    
    Hence, the naive Bayes classifier classifies the instance as positive, 
    i.e. RB=yes.
    
    
  3. Assume now that OFFICE SIZE depends on STATUS.
    1. Construct a Bayesian network that represents all attributes. Assume that the conditional probability table for OFFICE SIZE is the following:
                Office Size    faculty  student staff
                --------------------------------------
                small          1/8       1/2    1/4
                medium         1/4       1/4    2/4
                large          5/8       1/4    1/4
      
      
                   S | Prob      F | Prob    D | Prob   
             --------------  ------------   ---------  
             faculty | 1/5   four  | 3/5    cs | 3/5   
             staff   | 2/5   five  | 1/5    ee | 2/5   
             student | 2/5   three | 1/5               
      
             -----
            |  S  |
             -----
               |
               |
               v
             ------         -----       -----     
            |  OS  |       |  F  |     |  D  |    
             ------         -----       -----      
                    \         |           |       
                     \        |           |      
                      \       |           |     
                       v      v           v    
                        -----------------------
                       |          RB           |
                        -----------------------
      
      
                
    2. Compute P(Recycling-bin?=yes | Office-size = medium, Dept = cs, Floor = four)
      Counting frequencies from the table:
      
           P(RB=yes|OS=medium,D=cs,F=four) = 1/2
      
      
    3. Compute P(OS = medium)
      P(OS=medium) =   P(OS=medium|S=faculty)*P(S=faculty)
                     + P(OS=medium|S=student)*P(S=student)
                     + P(OS=medium|S=staff)*P(S=staff)
                   = 1/4*1/5 + 1/4*2/5 + 2/4*2/5
                   = 7/20
      

Problem 6 (10 Points): Papers

  1. According to Dietterich, T. "Machine-Learning Research: Four Current Directions" AAAI Magazine, Winter 1997, what are the assumptions under which a majority vote in an ensemble of classifiers improve upon individual classifications?
  2. What is the main idea of the procedure for training neural nets described on the "Learning representations by back-propagating errors' by D. Rumelhart et al.?
  3. From the paper "Combining Neural Networks and Context-Driven search for Online, printed Handwriting Recognition in the NEWTON" by L. Yaeger et al., select one of the techniques for helping a neural network better encode class probabilities for underrepresented classes and writing styles and explain the technique in your own words.