Dustin Stevens-Baier
Comp 578
10/22/06




Assignment#8


1. Consider the traffic accident data set shown in Table 7.10

  Weather Conditions   Driver's Condition   Traffic Violation   Seat Belt   Crash Severity
  Good   Alcohol-Impaired   Exceed Speed Limit   No   Major
  Bad   Sober   None   Yes   Minor
  Good   Sober   Disobey stop sign   Yes   Minor
  Good   Sober   Exceed Speed Limit   Yes   Major
  Bad   Sober   Disobey traffic signal   No   Major
  Good   Alcohol-Impaired   Disobey stop sign   Yes   Minor
 Bad   Alcohol-Impaired   None   Yes   Major
  Good   Sober   Disobey traffic signal   Yes   Major
  Good   Alcohol-Impaired   None   No   Major
  Bad   Sober   Disobey traffic signal   No   Major
  Good   Alcohol-Impaired   Exceed Speed Limit   Yes   Major
  Bad   Sober   Disobey stop sign   Yes   Minor

(a) Show a binarized version of that data set.

  WC=Good   WC=Bad   DC=Alcohol Impaired   DC= Sober   TV=Excess Speed   TV=None   TV=Disobey stop sign   TV=Disobey traffic signal  SB=
Yes
  Sb=
No
  CS= Major   Cs= Minor
  1  0   1   0   1   0   0   0   1   1   0
  0   1   0   1   0   1   0   0   1   0   0   1
  1   0   0   1   0   0   1   0   1   0   0   1
  1   0   0   1   1   0   0   0   1   0   1   0
  0   1   0   1   0   0   0   1   0   1   1   0
  1   0   1   0   0   0   1   0   1   0   0   1
  0   1   1   0   0   1   0   0   1   0   1   0
  1   0   0   1   0   0   0   1   1   0   1   0
  1   0   1   0   0   1   0   0   0   1   1   0
  0   1   0   1   0   0   0   1   0   1   1   0
  1   0   1   0   1   0   0   0   1   0   1   0
  0   1   0   0   0   1   0   1   0   0   1

(b) What is the maximum width of each transaction in the binarized data?
the width is 5 attributes becuase the binarized version means you can only select one of each of the original data set.

(c) assuming that support threshold is 30% how many candidates and frequnt itemsets will be generated?

.3 * 12 = 3.6  round up to 4.  Thus the minsup = 33.3%

Frequent 1-itemsets:

Good Weather = 7
Bad Weather = 5
DC Alcohol Impaired = 5
DC Sober = 7
TV Excess Speed = 3
TV None= 3
TV Disobey stop sign = 3
TV Disobey traffic signal = 3
SB Yes = 8
SB  No = 4
CS   Major=8
CS Minor = 4


There are 8 that meat the criteria

Frequent 2-itemsets

Good weather, Bad weather =0
Good weather, DC Alcohol Impaired = 4
Good weather, Sober Driver = 3
Good weather, SB Yes = 5
Good Weather, SB NO = 2
Good Weather, CS Major =5
Good Weather CS Minor = 2
Bad weather, DC Alcohol Impaired = 1
Bad weather, Sober Driver = 4
Bad weather, SB Yes =3
Bad Weather, SB NO = 2
Bad Weather, CS Major =3
Bad Weather CS Minor = 2
DC Alchohol Impaired, DC Sober =0
DC Alchohol Impaired, SB Yes=3
DC Alchohol Impaired, SB No=2
DC Alchohol Impaired, CS Major =4
DC Alchohol Impaired, CS Minor = 2
DC Sober, SB Yes=5
DC Sober, SB No=2
DC Sober, CS Major =4
DC Sober, CS Minor = 3
SB Yes, SB No = 0
SB Yes, CS Major = 4
SB Yes, CS Minor = 4
SB No, CS Major = 4
SB No, CS Minor = 0
CS Major, CS Minor = 0

Frequent itemsets -10

3 itemets

Good Weather, DC Alcohol Impaired, SB Yes =2
Good Weather, DC Alcohol Impaired, CS Major = 3
Good Weather, SB Yes, CS Major = 3
DC Sober, SB Yes, CS Major = 2
SB Yes, Major Crash, Minor Crash =0

Therefore there are 18 frequent itemsets when teh threshold is 33.3%

(d) Create a data set that contains only the following asymmetric binary attributes: (Weather = Bad, Driver's condition = Alcohol-impaired, Traffic violation = Yes, Seat Belt =No, Crash Severity =Major).  For Traffic violation only None has a value of 0.  The rest of the attribute values are assigned to 1.  Assuming that support threshold is 30%, how many candidate and frequent itemsets will be generated?

Using the previous questions we know

Bad Weather =5
DC Alcohol-Impaired =5
Traffic Violation = 9
SB No= 4
CS Major =  8

1 itemset = 5

Bad Weather, DC Alcohol-Impaired = 1
Bad Weather, Traffic Violation = 3
Bad Weather, SB No = 2
Bad Weather, CS Major = 3
DC Alcohol-Imapired, Traffic Violation =3
DC Alcohol-Imapired, SB No = 2
DC Alcohol-Imapired, CS Major = 4
Traffic Violation, SB No =3
Traffic Violation, CS Major  =6
SB No, CS Major = 4

2 itemset = 3

No 3 itemsets

Therefore a total of 8 frequent itemsets




(e) Compare the number of candidate and frequent itemsets generated in parts (c) and (d)


In comparing the two we see that the number of candidates for part c is higher than the number of candidates for part d.  We also see that the number of frequent items is higher for part c than d.


2.  Consider the data set shown in Table 7.11 Suppose we apply the following discretization strategies to the continuous attributes of the data set.

D1: Partition the range of each continuous attribute into the 3 equal-sized bins.
D2: Partition the range of each continuous into 3 bins; where each bin contains an equal number of transactions.

For each strategy answer the following questions:
  i. Construct a binarized version of the data set.
 ii. derive all the frequent itemsets having support > = 30%. 

For D1

  TID Temp (80-88)    Temp (88-96)   Temp (96-104)   Pressure (1025-1052)   Pressure (1052-1079)   Pressure (1079-1106)   Alarm 1   Alarm 2   Alarm 3
  1   0   1   0   0   0   1   0   0   1
  2   1   0   0   1   0   0   1   1   0
  3   0   0   1   0   0   1   1   1   1
  4   0   0   1   0   0   1   1   0   0
  5   1   0   0   1   0   0   0   1   1
  6   0   0   1   0   0   1   1   1   0
  7   1   0   0   1   0   0   1   0   1
  8   1   0   0   1   0   0   1   0   0
  9   0   0   1   0   0   1   1   1   1

For D2

 TID Temp (80-86)   Temp (86-98)   Temp (98-104)  Pressure (1025-1040)   Pressure (1040-1086)    Pressure (1086-1106)  Alarm 1   Alarm 2 Alarm 3
  1   0   1   0   0   0   1   0   0   1
  2   1   0   0   0   1   0   1   1   0
  3   0   0   1   0   0   1   1   1   1
  4   0   1   0   0   1   0   1   0   0
  5   1   0   0   1   0   0   0   1   1
  6   0   0   1   0   1   0   1   1   0
  7   1   0   0   1   0   0   1   0   1
  8   0   1   0   1   0   0   1   0   0
  9   0   0   1   0   0   1   1   1   1



For D1 :

.3 * 9 = 2.7 round to 3

Temp (80-88) = 4
Temp (88-96) =1
Temp (96-104) =4
Pressure (1025-1052) =4

Pressure (1052-1079) =0
Pressure (1079-1106) =0
Alarm 1 = 7
Alarm 2 = 5
Alarm 3 = 5


1 itemsets =7


Temp (80-88), Temp (88-96) =0
Temp (80-88), Temp (96-104)  =0
Temp (80-88),  Pressure (1025-1052) = 4
Temp (80-88),  Pressure (1052-1079) = 0
Temp (80-88), Pressure (1079-1106) = 0
Temp (80-88),  Alarm 1 = 3
Temp (80-88), Alarm 2 = 2
Temp (80-88), Alarm 3 = 2
Temp (88-96), Temp (96-104)  =0
Temp (88-96),  Pressure (1025-1052) = 0
Temp (88-96),  Pressure (1052-1079) = 0
Temp (88-96), Pressure (1079-1106) = 0
Temp (88-96),  Alarm 1 = 0
Temp (88-96), Alarm 2 = 0
Temp (88-96), Alarm 3 = 0
Temp (96-104),Pressure (1025-1052) =0
Temp (96-104), Pressure (1052-1079) =0
Temp (96-104), Pressure (1079-1106) =4
Temp (96-104), Alarm 1 =4
Temp (96-104), Alarm 2 =3

Temp (96-104), Alarm 3 =2
Pressure (1025-1052), Pressure (1052-1079) = 0
Pressure (1025-1052), Pressure (1079-1106) = 0
Pressure (1025-1052), Alarm 1 = 3
Pressure (1025-1052), Alarm 2 = 2
Pressure (1025-1052), Alarm 3 = 2
Pressure (1052-1079), Pressure (1079-1106) = 0
Pressure (1052-1079), Alarm 1 = 0
Pressure (1052-1079), Alarm 2 =0
Pressure (1052-1079),  Alarm 3 =0
Pressure (1079-1106), Alarm 1 = 4
Pressure (1079-1106), Alarm 2 = 3
Pressure (1079-1106),   Alarm 3 = 3
Alarm 1, Alarm 2 = 4
Alarm 1, Alarm 3 = 3
Alarm 2, Alarm 3 = 3


2 itemsets= 12

Temp (80-88),  Pressure (1025-1052), Alarm 1 = 3
Temp (80-88),  Pressure (1025-1052), Alarm 2 = 2
Temp (80-88),  Pressure (1025-1052), Alarm 3 = 2
Temp (80-88),  Alarm 1, Alarm 2 = 1
Temp (80-88),  Alarm 1, Alarm 3 =  1
Temp (96-104), Pressure (1079-1106), Alarm 1 = 4
Temp (96-104), Pressure (1079-1106), Alarm 2 = 3

Temp (96-104), Pressure (1079-1106), Alarm 3 = 2
Pressure (1025-1052), Alarm 1, Alarm 2 = 1
Pressure (1025-1052), Alarm 1, Alarm 3 = 1
Pressure (1079-1106), Alarm 1, Alarm 2 = 2
Pressure (1079-1106), Alarm 1, Alarm 3 = 2
Pressure (1079-1106),   Alarm 2, Alarm 3 = 2
Alarm 1, Alarm 2, Alarm 3 = 2

3 itemsets= 3

Temp (80-88),  Pressure (1025-1052), Alarm 1, Alarm 2 = 1
Temp (80-88),  Pressure (1025-1052), Alarm 1, Alarm 3 = 1
Temp (96-104), Pressure (1079-1106), Alarm 1, Alarm 2 = 3
Temp (96-104), Pressure (1079-1106), Alarm 1, Alarm 3 = 2
Temp (96-104), Pressure (1079-1106), Alarm 2, Alarm 3 = 2

4 itemsets =1

D1 has 23 frequent itemsets


For D2:

Temp (80-86) = 3
Temp (86-98) =3
Temp (98-104) =3
Pressure (1025-1040) =3
Pressure (1040-1086) =3
Pressure (1086-1106) =3
Alarm 1 = 7
Alarm 2 = 5
Alarm 3 = 5


1 itemsets =9

Temp (80-86), Temp (86-98) = 0
Temp (80-86), Temp (98-104) = 0
Temp (80-86), Pressure (1025-1040) = 2
Temp (80-86), Pressure (1040-1086)  = 1
Temp (80-86), Pressure (1086-1106) = 0
Temp (80-86), Alarm 1 = 2
Temp (80-86), Alarm 2 = 2
Temp (80-86), Alarm 3 =  2
Temp (86-98), Temp (98-104)  = 0
Temp (86-98), Pressure (1025-1040) = 1
Temp (86-98), Pressure (1040-1086) = 1
Temp (86-98), Pressure (1086-1106) = 1
Temp (86-98), Alarm 1 = 2
Temp (86-98), Alarm 2 = 0
Temp (86-98), Alarm 3 = 1
Temp (98-104), Pressure (1025-1040) = 0
Temp (98-104), Pressure (1040-1086) = 1
Temp (98-104), Pressure (1086-1106) = 2
Temp (98-104), Alarm 1 = 3
Temp (98-104), Alarm 2 = 3

Temp (98-104), Alarm 3 = 2
Pressure (1025-1040), Pressure (1040-1086) = 0
Pressure (1025-1040), Pressure (1086-1106) = 0
Pressure (1025-1040),  Alarm 1 = 2
Pressure (1025-1040), Alarm 2 = 1
Pressure (1025-1040), Alarm 3 = 2
Pressure (1040-1086), Pressure (1086-1106) =0
Pressure (1040-1086), Alarm 1 = 2
Pressure (1040-1086), Alarm 2 = 2
Pressure (1040-1086),  Alarm 3 = 3
Alarm 1, Alarm 2 = 4
Alarm 1, Alarm 3 = 3
Alarm 2, Alarm 3 =  3


2 itemsets = 6

Temp (98-104), Alarm 1, Pressure (1025-1040) = 0
Temp (98-104), Alarm 1, Pressure (1040-1086) = 1
Temp (98-104), Alarm 1, Pressure (1086-1106) = 2
Temp (98-104), Alarm 1, Alarm 2 = 3
Temp (98-104), Alarm 1, Alarm 3 = 2
Temp (98-104), Alarm 2, Pressure (1025-1040) = 0
Temp (98-104), Alarm 2, Pressure (1040-1086) = 1
Temp (98-104), Alarm 2, Pressure (1086-1106) = 2
Temp (98-104), Alarm 2, Alarm 3=  3
Pressure (1040-1086),  Alarm 3, Alarm 1 =  0
Pressure (1040-1086),  Alarm 3, Alarm 2 = 0
Alarm 1, Alarm 2, Alarm 3 = 2

3 itemsets = 2

D2 has 17 frequent itemsets

(b) The continuous attribute can also be discretized using a clustering approach.
   i. Plot a graph of temperature versus pressure for the data points shown in Table 7.11
   ii. How many natural clusters do you observe from the graph? assign a label C1, C2 etc.  to each cluster in the graph.
   iii. What type of clustering algorithm do you think can be used to identify the clusters? State your reasons clearly.
   iv. Replace the temperature and pressure attributes in Table 7.11 with asymmetric binary attributes C1, c2, etc.  Construct a transaction matrix using    the new attributes (along with attributes Alarm 1, Alarm 2, Alarm 3).
  v. Derive all the frequent itemsets having support > = 30% from the binarized data.


 


The top cluster is C1 and the bottom Cluster is C2.

Mahalonobis would be a good clustering algorithm since it standardizes all the data.

  TID   c1   c2   Alarm 1   Alarm 2   Alarm 3
  1   1   0   0
  2   0   1   1 1
  3   1   0   1 1
  4   1   0   1
  5   0   1   0
  6   1   0   1
  7   0   1   1
  8   0   1   1
  9   1   0   1

c1 = 5
c2 =4
Alarm 1 = 7
Alarm 2 =5
Alarm 3 =5


1 itemsets = 5

C1, C2 =0
C1, Alarm 1 = 4
C1, Alarm 2 = 3
C1, Alarm 3 = 3
C2, Alarm 1 = 3
C2, Alarm 2 = 1
C2, Alarm 3 = 2

2 itemsets = 4

C1, Alarm 1, Alarm 2 = 3
C1, Alarm 1, Alarm 3 = 2
C1, Alarm 2, Alarm 3 = 2 
C2, Alarm 1, Alarm 2 = 1
C2, Alarm 1, Alarm 3 =1

3 itemsets =1

Total frequent itemsets = 10







Hosted by www.Geocities.ws

1