CS 539 A99 - Solutions Practice Exam 1

CS539 MACHINE LEARNING. SPRING 99
SOLUTIONS - Practice Exam 1

PROF. CAROLINA RUIZ
Department of Computer Science
Worcester Polytechnic Institute

Problem 1 (20 Points): Concept Learning

Consider the following set of training examples (taken from Dean, Allen, and Aloimonos' book) to train a robot janitor to predict whether or not an office contains a recycling bin.

STATUS FLOOR DEPT. OFFICE SIZE RECYCLING BIN?
1. faculty four cs medium yes
2. faculty four ee medium yes
3. student four cs small no
4. faculty five cs medium yes

Assume that each of the attributes assumes just the values that appear on the table.

What is the size of the set of instances for this example?


Attributes:  STATUS  FLOOR  DEPT.  OFFICE SIZE
# values:      2    x  2   x  2   x    2       = 16

What is the size of the hypothesis space?


Attributes:  STATUS  FLOOR  DEPT.  OFFICE SIZE
# values:      3    x  3   x  3   x    3        + 1 = 82
               |                                  |
          (= 2 + "?")                   [null,null,null,null]

Give a sequence of S and G boundary sets computed by the CANDIDATE-ELIMINATION algorithm if it is given the sequence of examples above in the order in which they appear on the table.


S0 = {[null,null,null,null]}
S1 = {[faculty,four,cs,medium]}
S2 = {[faculty,four,?,medium]}
S3 = S2
S4 = {[faculty,?,?,medium]}

G4 = G3
G3 = {[faculty,?,?,?], [?,?,?,medium]}
G2 = G1
G1 = G0
G0 = {[?,?,?,?]}

Suggest an ordering of the examples that would minimize the sizes of the intermediate S and G sets. Explain your answer.

Note that the sum of the sizes of the intermediate S and G sets
is at least 2*(# of examples + 1) = 10, counting S0 and G0, since 
each S and G has at least one element.

For the trace shown above the sum of the sizes is 12. Since
the final G contains 2 elements, then the minimal sum of the sizes
is obtained when the "last G" (i.e. G4) is the only one containing
2 elements. That can be achieved if we process all the positive
examples before processing the negative examples. In that case
the sum of the sizes is 11.

Suggest a query guaranteed to reduce the size of the Version Space regardless of how the trainer classifies it.


Query = [faculty,four,cs,small]

If this is a positive instance, 
 then the boundaries can be updated to:
  S = [faculty,?,?,?]
  G = [faculty,?,?,?]

If this is a negative instance, 
 then the boundaries can be updated to:
  S = [faculty,?,?,medium]
  G = [?,?,?,medium]

(Extra Credit) Complete the proof of Theorem 2.1. from the textbook. That is, prove that every hypothesis that is consistent with the training data lies between the most specific and the most general boundaries S and G in the partially ordered hypothesis space.

By way of contradiction, assume that there are hypotheses in H
that are consistent w.r.t. D, but do not lie between S and G.
Let K be the set of all those hyotheses. That is,
K = { h in H | Consistent(h,D) and h does not lie between
               S and G in the partially ordered hypothesis space}
Let h be a maximally specific hypothesis in K (i.e. h has the
property that there is no h' in K such that h' is more specific
than h).

Case 1: If there is no hypothesis h' in H consistent with D that
is more specific than h then, by definition h would belong to s
contradiction our assumption that h belongs to K.

Case 2: If there is a hypothesis h' in H consistent with D that is
more specific than h then h' belongs to VS, since h is maximally
specific in K. Now we just need to show that we can find some
h'' in VS such that h'' is more general than or equal to h,
which would imply that h belongs to VS with would be a contradiction 
with our assumption.

   Case 2.1: If there is no consistent hypothesis h' in H more
   general than h, then by definition h belongs to G.  

   Case 2.2: If there is a consistent hypothesis h' in H more
   general than h, then:

      Case 2.2.1: If h' belongs to VS then we're done.
      
      Case 2.2.2: If h' belongs to K, then construct a chain
      h <=g h1 <=g h2 <=g ... <=g h_n where h_n is a maximally
      general hypothesis in K. Following the same arguments of
      Case 1 and beginning of Case 2, one can show that either 
      h'' = h_n belongs to G or there is some h'' in VS such that
      h_n <=g h''. In both cases,  h'' <=g h, and hence h would  
      belong to VS, yielding a contradiction.

Since all the cases above yield a contradiction, then K must be
empty and so every hypothesis that is consistent with the
training data lies between the most specific and the most general
boundaries S and G in the partially ordered hypothesis space.

Problem 2 (20 Points): Decision Trees

The following table containing Student Exam Performance Data is taken from section 7.4 of A. Cawsey, "The Essence of Artificial Intelligence", Prentice Hall Europe, 1998.

No. Student First last year? Male? Works hard? Drinks? First this year?

1 Richard yes yes no yes yes

2 Alan yes yes yes no yes

3 Alison no no yes no yes

4 Jeff no yes no yes no

5 Gail yes no yes yes yes

6 Simon no yes yes yes no

What is the entropy of this collection of training examples with respect to the target function classification?

     entropy(S) = - (4/6)log_2(4/6) - (2/6)log_2(2/6)
     
     here log_2 denotes logarithm in base 2.

Construct a minimal decision tree for this dataset. Show your work.

                                
     entropyS(1st-last-year) 
     = (3/6)[-(3/3)log_2(3/3) - 0] + (3/6)[-(1/3)log_2(1/3) - (2/3)log_2(2/3)]
     = 0.459

     entropyS(male)
     = (4/6)[-(1/2)log_2(1/2) - (1/2)log_2(1/2)] + (2/6)[-(2/2)log_2(2/2) - 0]
     = 0.666

     entropyS(works-hard)
     = (4/6)[-(3/4)log_2(3/4) - (1/4)log_2(1/4)] + (2/6)[-(1/2)log_2(1/2) - (1/2)log_2(1/2)]
     = 0.8741
     
     entropyS(drinks)
     = (4/6)[-(1/2)log_2(1/2) - (1/2)log_2(1/2)] + (2/6)[-(2/2)log_2(2/2) - 0]
     = 0.666

     1st-last-year is the attribute with the smallest entropy and so we use it for the
     root of the tree:


                           1st-last-year?
                           /           \
                      yes /             \ no
                         /               \
                 1+, 2+, 5+          3+, 4-, 6-
                    YES

      We need to split the rightmost node which contains examples 3, 4, 6 only.
      Let S' be:



No. Student First last year? Male? Works hard? Drinks?
   First this year?



3 Alison no no yes no yes


4 Jeff no yes no yes no


6 Simon no yes yes yes no



      Now we need to select from male?, works-hard?, and drinks? the attribute
      with the smallest entropy w.r.t. S'

      entropyS'(drinks?)            
      = (2/3)[-(2/2)log_2(2/2) - 0] + (1/3)[-(1/1)log_2(1/1) - 0]
      = 0

      Since this is the smallest possible entropy, then drinks? is choosen.
  

                           1st-last-year?
                           /           \
                      yes /             \ no
                         /               \
                 1+, 2+, 5+            drinks?
                    YES               /       \
                                  no /         \ yes
                                    /           \
                                   3+         4-, 6-
                                  YES           NO

Use your decision tree from the previous part to classify the following instances:

No. Student First last year? Male? Works hard? Drinks? First this year?

7 Matthew no yes no yes ??

8 Mary no no yes yes ??
```
Both instances are classified as negative by the previous decision tree.
```

Suppose that the trainer tells you that your classifications of the instances are incorrect (i.e. the right classification of each of the instances is "yes" if you said "no", and "no" if you said "yes"). Discuss a way to post-process your tree (without constructing it from scratch again) so that the resulting decision tree better represents the training data together with the two test instances.

If the correct answer for each instance is "yes", then the drinks? node
can be pruned and replaced by a majority vote of the remaining instances.
That is,

                           1st-last-year?
                           /           \
                      yes /             \ no
                         /               \
                 1+, 2+, 5+          3+, 4-, 6-, 7+, 8+
                    YES              Majority vote = YES

The accuracy of the tree remains the same (6/8), but the tree is now smaller.

Another possibility is to replace  drinks? by male? (which also has
entropy 0 with respect to S'). In that case only one instance is misclassified,
and hence the accuracy of the decision tree increases to 7/8.

(Extra Credit) Suppose that we are trying to learn a concept C and that A is one of the predicting attributes. Let n be the number of values that C can assume and k the number of values that A can assume. (For instance, if C=PlayTennis and A is Outlook, then n = 2 (yes,no) and k=3 (sunny, overcast, rainy).) What is the interval of real numbers in which the value of the entropy of A with respect to S lies? In other words, find real numbers a and b such that a <= entropy(A) <= b.
```
0 <= entropy(A) <= log_2(k)
```

Problem 3 (15 Points): Neural Networks

Consider the following Boolean function

A B ~A v B

1 1 1

1 0 0

0 1 1

0 0 1

Can this function be represented by a perceptron? Explain your answer.

Yes, since the function is linearly separable:

             B  |
                |
               1+    +
                |
                +____-_____ A
                0    1

If yes, construct a perceptron that represents the function. Otherwise construct a multilayer neural network that represents it.

    Perceptron:
                  -----
    1 --- w0 --->|     |  
    A --- w1 --->|     |-----> output = 1,  if w0 + w1*A + w2*B > 0
    B --- w2 --->|     |                0,  otherwise
                  -----

    Take w0= 1, w1= -1, and w2= 1.

Explain informally why the delta training rule in Equation (4.10) is only an approximation to the true gradient descent rule of Equation (4.7).

Equation (4.10) corresponds to incremental gradient descent in which weights 
are updated after seeing each of the examples, while Equation (4.7) corresponds 
to true gradient descent in which the weights are updated after seeing all the 
examples once.

Problem 4 (15 Points): Evaluating Hypotheses

Suppose that hypothesis h commits r=10 errors over a sample of n=65 independently drawn examples.

What is the standard deviation in ERRORs(h)?

     ERRORs(h) = r/n = 10/65 = 2/13

     SDs(h) = sqrt((2/13)*(1 - (2/13))/65) = sqrt(0.002) = 0.0447

What is the 90% confidence interval (two-sided) for the true error rate?

     ERRORs(h) +- Z90*SDs(h) = 2/13 +- 1.64*(0.0447) = [0.0804531,0.227239]

What is the 95% one-sided interval?

     (-infinite, ERRORs(h) + Z90*SDs(h)] = (-infinity, 2/13 + 1.64*0.0447] 
                                         = (-infinity,0.227239]

What is the 90% one-sided interval?

     (-infinity, ERRORs(h) + Z80*SDs(h)] = (-infinity, 2/13 + 1.28*0.0447]
                                         = (-infinity,0.211129]

Problem 5 (15 Points): Bayesian Learning

Consider the following training instances for the janitor example of Problem 1,

STATUS FLOOR DEPT. OFFICE SIZE RECYCLING BIN?
1. faculty four cs medium yes
2. student four ee large yes
3. staff five cs medium no
4. student three ee small yes
5. staff four cs medium no

Construct a Bayesian network that represents all attributes in the janitor example, assuming that the predicting attributes are pairwise independent. Provide the probability table for each of the predicting attributes.

Let S=status, F=floor, D=dept., OS=office-size, and RB=recycling-bin?.

        S | Prob      F | Prob    D | Prob      OS | Prob
  --------------  ------------   ---------  -------------
  faculty | 1/5   four  | 3/5    cs | 3/5   small  | 1/5
  staff   | 2/5   five  | 1/5    ee | 2/5   medium | 3/5
  student | 2/5   three | 1/5               large  | 1/5

        -----         -----       -----      ------
       |  S  |       |  F  |     |  D  |    |  OS  |
        -----         -----       -----      ------
              \         |           |       /
               \        |           |      /
                \       |           |     /
                 v      v           v    v
                  -----------------------
                 |          RB           |
                  -----------------------

Show how a naive Bayesian classifier would classify the following instance:

STATUS FLOOR DEPT. OFFICE SIZE RECYCLING BIN?
student four cs small ??

A naive Bayes classifier would select the value v ("yes" or "no") for the 
target concept that maximizes the following product:

   P(S=student|RB=v)*P(F=four|RB=v)*P(D=cs|RB=v)*P(OS=small|RB=v)*P(v)

For v = "yes":
  P(S=student|RB=yes)*P(F=four|RB=yes)*P(D=cs|RB=yes)*P(OS=small|RB=yes)*P(yes)
  = 2/3                *  2/3           * 1/3          * 1/3            * 3/5
  = 4/135

For v = "no":
  P(S=student|RB=no)*P(F=four|RB=no)*P(D=cs|RB=no)*P(OS=small|RB=no)*P(no)
  = 0               *  1/2          * 2/2         * 0               * 2/5
  = 0

Hence, the naive Bayes classifier classifies the instance as positive, 
i.e. RB=yes.

Assume now that OFFICE SIZE depends on STATUS.

Construct a Bayesian network that represents all attributes. Assume that the conditional probability table for OFFICE SIZE is the following:

          Office Size    faculty  student staff
          --------------------------------------
          small          1/8       1/2    1/4
          medium         1/4       1/4    2/4
          large          5/8       1/4    1/4


             S | Prob      F | Prob    D | Prob   
       --------------  ------------   ---------  
       faculty | 1/5   four  | 3/5    cs | 3/5   
       staff   | 2/5   five  | 1/5    ee | 2/5   
       student | 2/5   three | 1/5               

       -----
      |  S  |
       -----
         |
         |
         v
       ------         -----       -----     
      |  OS  |       |  F  |     |  D  |    
       ------         -----       -----      
              \         |           |       
               \        |           |      
                \       |           |     
                 v      v           v    
                  -----------------------
                 |          RB           |
                  -----------------------

Compute P(Recycling-bin?=yes | Office-size = medium, Dept = cs, Floor = four)
```
Counting frequencies from the table:

     P(RB=yes|OS=medium,D=cs,F=four) = 1/2
```

Compute P(OS = medium)

P(OS=medium) =   P(OS=medium|S=faculty)*P(S=faculty)
               + P(OS=medium|S=student)*P(S=student)
               + P(OS=medium|S=staff)*P(S=staff)
             = 1/4*1/5 + 1/4*2/5 + 2/4*2/5
             = 7/20

Problem 6 (10 Points): Papers

According to Dietterich, T. "Machine-Learning Research: Four Current Directions" AAAI Magazine, Winter 1997, what are the assumptions under which a majority vote in an ensemble of classifiers improve upon individual classifications?
What is the main idea of the procedure for training neural nets described on the "Learning representations by back-propagating errors' by D. Rumelhart et al.?
From the paper "Combining Neural Networks and Context-Driven search for Online, printed Handwriting Recognition in the NEWTON" by L. Yaeger et al., select one of the techniques for helping a neural network better encode class probabilities for underrepresented classes and writing styles and explain the technique in your own words.

	STATUS	FLOOR	DEPT.	OFFICE SIZE	RECYCLING BIN?
1.	faculty	four	cs	medium	yes
2.	faculty	four	ee	medium	yes
3.	student	four	cs	small	no
4.	faculty	five	cs	medium	yes

No.	Student	First last year?	Male?	Works hard?	Drinks?	First this year?
1	Richard	yes	yes	no	yes	yes
2	Alan	yes	yes	yes	no	yes
3	Alison	no	no	yes	no	yes
4	Jeff	no	yes	no	yes	no
5	Gail	yes	no	yes	yes	yes
6	Simon	no	yes	yes	yes	no

CS539 MACHINE LEARNING. SPRING 99 SOLUTIONS - Practice Exam 1

PROF. CAROLINA RUIZ Department of Computer Science Worcester Polytechnic Institute