| |
1<= |
1> |
3<= |
3> |
4<= |
4> |
5<= |
5> |
6<= |
6> |
7<= |
7> |
8<= |
8> |
| Class + |
1 |
3 |
1 |
3 |
2 |
2 |
2 |
2 |
3 |
1 |
4 |
0 |
4 |
0 |
| Class - |
0 |
5 |
1 |
4 |
1 |
4 |
3 |
2 |
3 |
2 |
4 |
1 |
5 |
0 |
| Total |
1 |
8 |
2 |
7 |
3 |
6 |
5 |
4 |
6 |
3 |
8 |
1 |
9 |
0 |
Total = 9
l(T) =
.9911
l(<=1.0) = -(1/1)log2(1/1) - (0/1)log2(0/1) = 0
l(>1.0) =
-(3/8)log2(3/8) - (5/8)log2(5/8) = .95444
Delta(1.0) = .991 - (1/9)*0 -
(8/9)*.9544 = .1427
l(<=3.0) = -(1/2)log2(1/2) - (1/2)log2(1/2) =
1
l(>3.0) = -(3/7)log2(3/7) - (4/7)log2(4/7) = .9852
Delta(3.0) = .991
- (2/9)*1 - (7/9)*.9852 = .0026
ll(<=4.0) = -(2/3)log2(2/3) -
(1/3)log2(1/3) = .9183
l(>4.0) = -(2/6)log2(2/6) - (2/6)log2(2/6) =
.9183
Delta(4.0) = .991 - (3/9)*.9183 - (6/9)*.9183 =
.0728
l(<=5.0) = -(2/5)log2(2/5) - (3/5)log2(3/5) =
.971
l(>5.0) = -(2/4)log2(2/4) - (2/4)log2(2/4) = 1
Delta(5.0) = .991 -
(5/9)*.971 - (4/9)*1 = .0072
l(<=6.0) = -(3/6)log2(3/6) -
(3/6)log2(3/6) = 1
l(>6.0) = -(1/3)log2(1/3) - (2/3)log2(2/3) =
.9183
Delta(6.0) = .991 - (6/9)*1 - (3/9)*.39 = .0183
l(<=7.0) =
-(4/8)log2(4/8) - (4/8)log2(4/8) = 1
>l(>7.0) = -(0/1)log2(0/1) -
(1/1)log2(1/1) = 0
Delta(7.0) = .991 - (8/9)*1 - (1/9)*0 =
.1021
l(<=8.0)= -(4/9)log2(4/9) - (5/9)log2(5/9) =
.9911
l(>8.0)= -(0/0)log2(0/0) - (0/0)log2(0/0) = 0
Delta(8.0) = .991
- (9/9)*1 - (0/9)*0= 0
(d) What is the best split (among a1, a2, and
a3) according to the information gain?
The best split is using
attribute a1 becuase
it has the largest delta difference in entropy with .11427.
(e)
What is the best split (between a1 and a2) according to the classification error
rate?
A1:
| CE(T) = 1 - max[p(T+|Total),
p(T-|Total)] = 1 - max[3/4,
1/4] = 1 - 3/4 = 0.25 <
BR > |
| CE(F) = 1 - max[p(F+|Total), p(F-|Total)] = 1 - max[2/4, 2/4] = 1 - 4/5 = 0.2 <
BR > |
| Total CE(a1) = [Total CE(T)/Total(a1)]*CE(T) +
[Total CE(F)/Total(a1)]*CE(F)= (4/9)*0.25 + (5/9)*0.48 =
0.2222
|
A2:
CE(T) = 1 - max[p(T+|Total), p(T-|Total)] = 1 -
max[2/5, 3/5] = 1 - 3/5 = 0.4
|
CE(F) = 1 - max[p(F+|Total), p(F-|Total)] = 1 - max[1/5, 4/5] =
1 - 1/2= 0.5
|
| Total CE(a2) = [Total CE(T)/Total(a2)]*CE(T)
+ [Total CE(F)/Total(a2)]*CE(F) = (5/9)*0.4 +
(4/9)*0.5 = 0.4444 |
| |
The best split is A1 since it has the lower classification error. < /P > < /P >
|
(f)What is the best split (between a1 and a2)
according to the Gini index?
A1:
Gini(T) = 1 - p(+|T)2 - p(-|T)2 = 1 -
(3/4)2 - (1/4)2 = 0.375
|
Gini(F) = 1 - p(+|F)2 - p(-|F)2 = 1 -
(1/5)2 - (4/5)2 = 0.32
|
| TGini(a1) = [(Total(T)/Total(a1)]*Gini(T) +
[(Total(F)/Total(a1)]*Gini(F) =(4/9)*0.375 + (5/9)*0.32 = 0.3444 |
|
A2:
Gini(T) = 1 - p(+|T)2 - p(-|T)2 = 1 -
(2/5)2 - (3/5)2 = 0.48
|
Gini(F) = 1 - p(+|F)2 - p(-|F)2 = 1 -
(2/4)2 - (2/4)2 = 0.5
|
| TGini(a2) = [(Total(T)/Total(a2)]*Gini(T) +
[(Total(F)/Total(a2)]*Gini(F) =( 5/9)*0.48 + (4/9)*0.5 = 0.4889 |
| |
The best split is A1 since the subsets for attribute a1 have a smaller Gini
index.