Dustin Stevens-Baier
Comp 578
9-24-06


Section 4.8

2) Consider the training examples shown in Table 4.7 for a binary classification problem.

a)Compute the Gini index for the overall collection of training examples.
The Gini index is 1 minus the sum of the fraction of records belonging to i at a given node t.

Gini = 1 - p(C0|Class)2 - p(C1|Class)2 = 1 - (10/20)2 - (10/20)2 = 0.5


b) Compute the Gini index for the Customer ID attribute.

The customer ID makes it so that you have 20 Gini values that look like this.

 Gini(1) = 1 - (1/1)2 - (0/1)2 = 0

Add all of these up and you get 0.


c)Compute the Gini index for the Gender attribute.

Gini(Male) = 1 - p(C0|M)2 - p(C1|M)2 = 1 - (6/10)2 - (4/10)2 = 0.48

Gini(Female) = 1 - p(C0|F)2 - p(C1|F)2 = 1 - (6/10)2 - (4/10)2 = 0.48

Gini(Gender) = [(T(Male)/T(Maile + female)]*Gini(Male) + [(T(Female)/T(Male + Female)]*Gini(Female) = (10/20)*0.48 + (10/20)*0.48 = 0.48

d)Compute the Gini index for the Car Type attribute using multiway split.

Gini(Family) = 1 - p(C0|Family)2 - p(C1|Family)2 = 1 - (1/4)2 - (3/4)2 = 0.375

Gini(Sports) = 1 - p(C0|Sports)2 - p(C1|Sports)2 = 1 - (8/8)2 - (0/8)2 = 0

Gini(Luxury) = 1 - p(C0|Luxury)2 - p(C1|Luxury)2 = 1 - (1/8)2 - (7/8)2 = 0.2188

TGini(Car Type) = [(T(Family)/T(Car Type)]*Gini(Family) + [(T(Sports)/T(Car Type)]*Gini(Sports) + [(T(Luxury)/T(Car Type)]*Gini(Luxury) = (4/20)*0.375 + (8/20)*0 + (8/20)*0.2188 = 0.1625

e)Compute the Gini index for the Shirt Size attribute using multiway split

Gini(Small) = 1 - p(C0|Small)2 - p(C1|Small)2 = 1 - (3/5)2 - (2/5)2 = 0.48

Gini(Medium) = 1 - p(C0|Medium)2 - p(C1|Medium)2 = 1 - (3/7)2 - (4/7)2 = 0.4898

Gini(Large) = 1 - p(C0|Large)2 - p(C1|Large)2 = 1 - (2/4)2 - (2/4)2 = 0.5

Gini(Extra Large) = 1 - p(C0|Extra Large)2 - p(C1|Extra Large)2 = 1 - (2/4)2 - (2/4)2 = 0.5

TGini(Shirt Size) = [(T(Small)/T(Shirt Size)]*Gini(Small) + [(T(Medium)/T(Shirt Size)]*Gini(Medium) + [(T(Large)/T(Shirt Size)]*Gini(Large) + [(T(Extra Large)/T(Shirt Size)]*Gini(Extra Large) = (5/20)*0.48 + (7/20)*0.4898 + (4/20)*0.5+ (4/20)*0.5 = 0.4914

f)       Which attribute is better, Gender, Car Type, or Shirt Size?

We like the attribute with the lowest value for the Gini Index.  Thus it is the Car Type.
Gender = .48   Car Type = .1625, and Shirt Size = .4914.


g)    Explain why Customer ID should not be used as the attribute test condition even though it has the lowest Gini.
Becuase customer id is unique and thus can't be used as a predictive attribute.  It would be useless becuase it divides it into all the possible nodes with out needing any sort of predictive behavior.


3 Consider the training examples shown in Table 4.8 for a binary classification problem.

(a)  What is the entropy of this collection of training examples with respect to the positive class?

 Class+ =4
 Class-  = 5
  Total=9

Entropy = -(4/9)*log2(4/9) - (5/9)*log2(5/9) = .9911

(b)  What are the information gains of a1 and a2 relative to these training examples?

Delta = I(Parent) - SUM[N(vj)/N]Iv(j)

Using a1 for the split gives l(a|t) with 3 + and 1 - and l(a|f) with 1 + and 4 -

l(a|t) = -(3/4)*log2(3/4) - (1/4)*log2(1/4) = .8113

l(a|f) = -(1/5)*log2(1/5) - (4/5)*log2(4/5) = .7219
>
Delta(a1) = .9911 - (4/9)*.8113 - (5/9)*.7129 = 0.2345 

Using a2 for the split gives l(a|t) with 2 + and 3 -, and l(a|f) with 2 + and 2 -

l(a|t) = -(2/5)*log2(2/5) - (3/5)*log<2(3/5) = .971

l(a|f) = -(2/4)*log2(2/4) - (2/4)*log2(2/4) = 1

Delta(a2) = .9911 - (5/9)*.971 - (4/9)*1 = 0.007211

(c)  For a3, which is a continuous attribute, compute the information gain for every possible split.

If we split at every possible point we get table that looks like this
    1<=   1>   3<=   3>   4<= 4>  5<=    5> 6<=  6>  7<=    7>   8<=   8>
  Class +   1   3   3   2   2   2   3   4   4
  Class -   0   5   1   4   1   4   3   2   3   2   4   1   5   0
  Total   1   2   3   6   4   3   1   0

Total = 9
l(T) = .9911

l(<=1.0) = -(1/1)log2(1/1) - (0/1)log2(0/1) = 0
l(>1.0) = -(3/8)log2(3/8) - (5/8)log2(5/8) = .95444
Delta(1.0) = .991 - (1/9)*0 - (8/9)*.9544 = .1427


l(<=3.0) = -(1/2)log2(1/2) - (1/2)log2(1/2) = 1
l(>3.0) = -(3/7)log2(3/7) - (4/7)log2(4/7) = .9852
Delta(3.0) = .991 - (2/9)*1 - (7/9)*.9852 = .0026

ll(<=4.0) = -(2/3)log2(2/3) - (1/3)log2(1/3) = .9183
l(>4.0) = -(2/6)log2(2/6) - (2/6)log2(2/6) = .9183
Delta(4.0) = .991 - (3/9)*.9183 - (6/9)*.9183 = .0728


l(<=5.0) = -(2/5)log2(2/5) - (3/5)log2(3/5) = .971
l(>5.0) = -(2/4)log2(2/4) - (2/4)log2(2/4) = 1
Delta(5.0) = .991 - (5/9)*.971 - (4/9)*1 = .0072


l(<=6.0) = -(3/6)log2(3/6) - (3/6)log2(3/6) = 1
l(>6.0) = -(1/3)log2(1/3) - (2/3)log2(2/3) =  .9183
Delta(6.0) = .991 - (6/9)*1 - (3/9)*.39 = .0183

l(<=7.0) = -(4/8)log2(4/8) - (4/8)log2(4/8) = 1
>l(>7.0) = -(0/1)log2(0/1) - (1/1)log2(1/1) = 0
Delta(7.0) = .991 - (8/9)*1 - (1/9)*0 = .1021

l(<=8.0)= -(4/9)log2(4/9) - (5/9)log2(5/9) = .9911
l(>8.0)= -(0/0)log2(0/0) - (0/0)log2(0/0) = 0
Delta(8.0) = .991 - (9/9)*1 - (0/9)*0= 0



(d) What is the best split (among a1, a2, and a3) according to the information gain?

The best split is using attribute a1 becuase it has the largest delta difference in entropy with .11427.  

(e) What is the best split (between a1 and a2) according to the classification error rate?

A1:

CE(T) = 1 - max[p(T+|Total), p(T-|Total)] = 1 - max[3/4, 1/4] = 1 - 3/4 = 0.25 < BR >
CE(F) = 1 - max[p(F+|Total), p(F-|Total)] = 1 - max[2/4, 2/4] = 1 - 4/5 = 0.2 < BR >
Total CE(a1) = [Total CE(T)/Total(a1)]*CE(T) + [Total CE(F)/Total(a1)]*CE(F)= (4/9)*0.25 + (5/9)*0.48 = 0.2222
A2:

CE(T) = 1 - max[p(T+|Total), p(T-|Total)] = 1 - max[2/5, 3/5] = 1 - 3/5 = 0.4
CE(F) = 1 - max[p(F+|Total), p(F-|Total)] = 1 - max[1/5, 4/5] = 1 - 1/2= 0.5
Total CE(a2) = [Total CE(T)/Total(a2)]*CE(T) + [Total CE(F)/Total(a2)]*CE(F) = (5/9)*0.4 + (4/9)*0.5 = 0.4444
 

The best split is A1 since it has the lower classification error. < /P > < /P >



(f)What is the best split (between a1 and a2) according to the Gini index?

A1:

Gini(T) = 1 - p(+|T)2 - p(-|T)2 = 1 - (3/4)2 - (1/4)2 = 0.375
Gini(F) = 1 - p(+|F)2 - p(-|F)2 = 1 - (1/5)2 - (4/5)2 = 0.32
TGini(a1) = [(Total(T)/Total(a1)]*Gini(T) + [(Total(F)/Total(a1)]*Gini(F) =(4/9)*0.375 + (5/9)*0.32 = 0.3444

A2:

Gini(T) = 1 - p(+|T)2 - p(-|T)2 = 1 - (2/5)2 - (3/5)2 = 0.48
Gini(F) = 1 - p(+|F)2 - p(-|F)2 = 1 - (2/4)2 - (2/4)2 = 0.5
TGini(a2) = [(Total(T)/Total(a2)]*Gini(T) + [(Total(F)/Total(a2)]*Gini(F) =( 5/9)*0.48 + (4/9)*0.5 = 0.4889
 

The best split is A1 since the subsets for attribute a1 have a smaller Gini index.



6 Consider the following set of training tables

(a)Compute a two-level decision tree using the greedy approach described in this chapter. Use the classification error rate as the criterion for splitting. What is the overall error rate of the induced tree?

CE(X0) = 1 -max[60/120,60/120]=.5
CE(X1) = 1 -maax[40/80,40/80]=.5
WE = 80/200 * .5 + 120/200 * .5 = .5

CE(Y0) = 1 -max[40/100,60/100]=.4
CE(Y0) = 1 -max[60/100,40/100]=.4
WE = 100/200 * .4 + 100/200 * .4 = .4


CE(Z0) = 1 -max[30/100,70/100]=.3
CE(Z0) = 1 -max[70/100,30/100]=.3
WE = 100/200 ** .3 + 00/200 * .3 = .3

Attribute Z is choosen for the first split because it is the lowest.

Z=T

CE(XT) = 1 - max[25/40 , 15/40] = .375
CE(XF) = 1 - max[45/60 , 15/60] = .25
WE = 40/100 * .375  + 60/100 * .25 = .3

CE(YT) = 1 - max[45/60 , 15/60] = .25
CE(YF) = 1 - max[25/40 , 25/40] = .375
WE = 60/100 * .25  + 40/100 * .375 = .3

Z=F

CE(XT) = 1 - max[15/40 , 25/40] = .375
CE(XF) = 1 - max[15/60 , 45/60] = .25
WE = 40/100 * .375  + 60/100 * .25 = .3

CE(YT) = 1 - max[15/40 , 25/40] = .375
CE(YF) = 1 - max[15/60 , 45/60] = .25
WE = 40/100 * .375  + 60/100 * .25 = .3

Since they are the same weighted values the decision tree will start with Z and then have both y and x come off of Z. It doesn't matter which branch you put y and z off of.

For example lets put Z then off the F branch you have X and then off the T branch of that C1:25 C2:15 and off the F branch C1:45 C2:15  off the Z's T branch you have Y and off that T branch you have C1:15 C2:25 off the F branch C1:15 C2:45.




(b)  Repeat part (a) using X as the first splitting attribute and then choose the best remaining attribute for splitting at each successor nodes.What is the error rate of the induced tree?

If the X attribute is already choosen we starrt with the next choice.

X=T

CE(ZT) = 1 - max[25/40 , 15/40] = .375
CE(ZF) = 1 - max[15/40 , 25/40] = .375
WE = 40/80 * .375  + 40/80 * .375 = .375

CE(YT) = 1 - max[5/40 , 35/40] = .125
CE(YF) = 1 - max[35/40 , 5/40] = .125
WE = 60/100 * .25  + 40/100 * .375 = .125

X=F

CE(ZT) = 1 - max[45/60 , 15/60] = .25
CE(ZF) = 1 - max[15/60 , 45/60] = .25
WE = 40/100 * .375  + 60/100 * .25 = .25

CE(YT) = 1 - max[55/60 , 5/60] = .083
CE(YF) = 1 - max[15/60 , 45/60] = .083
WE = 60/120 * .083  + 60/120 * .083 = .083

Since YT and YF on the X=F branch becuase they are lowst value.  Then the ZT AND ZF are off the X=T branch.


(c) Compare the results of parts (a) and (b).  Comment on the suitability of the greedy heuristic used for splitting attribute selection.

The Error rate increased since we choose the wrong one to start with.  The greedy heuristic doesn't seem to always get the best solution as demonstrated in part b.  Part b is not the greedy algoithm and yet it appears to have the best solution.

Hosted by www.Geocities.ws

1