Dustin Stevens-Baier

Comp 578

10-13-06

Assignment #7

1. For each of the following questions, provide an example of an association rule from the market basket domain that satisfies the following conditions. Also describe wether such rules are subjectively interesting.

I used table 6.22 on page 404 for the market transactions. I tried using table 6.1 first but there were a couple of rules that had very bad examples.

(a) A rule that has high support and high confidence.

Consider the rule {a} => {e}
s({a}=>{e}) = 6/10 = 0.60
c({a}=>{e}) = 6/7 = 0.86

The relationship could be subjectively interesting. It may be useful to try and increase cross-selling and advertising for retailers.

(b) A rule that has reasonably high support but low confidence.

Consider the rule {e} => {a}
s({e} => {a}) = 6/10 = 0.60
c({e} => {a}) = 6/8 = 0.75

Not subjectively interesting since these have a low confidence they do not have a reliable interface.

(c) A rule that has low support and low confidence.

Consider the rule {b} => {c}
s({b} => {c}) = 3/10 = 0.30
c({b} => {c}) = 3/6 = 0.50

Not subjectively interesting these are items are seldom bought together.

(d) A rule that has low support and high confidence.

Consider the rule {d} => {e}
s({d} => {e}) = 5/10 = 0.50
c({d} => {e}) = 5/6 = 0.83

Subjectively interesting  since these items although not bought as often are almost always bought together.  A company would want this info to make sure that they produce these items together.

2. Consider the data set shown in Table 6.22

(a) Compute the support for itemsets {e}, {b,d}, and {b,d,e} by treating each transaction ID as a market basket.

s{e} = 8/10 = 0.8
s{b, d} = 2/10 = 0.2
s{b, d, e} = 2/10 = 0.2

(b) use the results in part (a) to compute the confidencefor the association rules {b,d}->{e} and {e} ->{b,d}. Is confidence a symmetric measure?

c({b,d}==>{e}) = 2 / 2 = 1.0
c ({e}==>{b, d}) = 2 / 8 = 0.25
Confidence is not symmetric as you can see from the above not being equal.

(c) Repeat part (a) by treating each customer ID as a market basket. Each item should be treated as a binary variable.

s{e} = 4/5 = 0.8
s{b, d} = 4/5 = 0.8
s{b, d, e} = 4/5 = 0.8

(d) Use the reults in part (c) to compute the confidence for the association rules {b,d}-> {e} and {e}->{b,d}.

c({b,d}=>{e}) = 4 / 4 = 1.0
c({e}=>{b, d}) = 4 / 4 = 1.0

(e) Suppose s2 and c1 are the support and confidence values of an association rule r when treating each transaction ID as a market basket. Also, let s2 and c2 be the support and confidence values of r when treating each customer ID as a market basket. Discuss wether there are any relationships between s1 and s2 or c1 and c2.

s1 < = s2. If s2 is a subset of s1. What this means is that if you have a frequent itemset all the subsets will be frequent also. A subset will always be atleast as frequent as its set.

14) Answer the following questions using the data sets shown in Figure 6.34. Note that each data set contains 1000 items and 10,000 transactions. Dark cells indicate the presence of items and white cells indicate the absence of items. We will apply the Apriori algorithm to extract frequent itemsets with minsup = 10%.

(a) Which data set will produce the most number of frequent itemsets?

Data set a will produce the most number of frequent itemsets.  Each item transactions are > 10% threshold. Their are 200 items per block which can generate itemsets where x is between 1 and 200.

(b) Which data set will produce the fewest number of frequnt itemsets?

Data set d will produce the least number of frequent itemsets since no items > 10% threshold.

(c) Which data set will produce the longest frequent itemset?

Data set e will produce the longest frequent itemset since most number of items are concentrated somewhere between 2000 to 4000 transaction range. Alot of the items >= 10% threshold.

(d) Which data set will produce frequent itemsets with the highest maximum support?

Data set b will produce frequent itemsets with the highest maximum support since items around 100, appears in transactions range somewhere in between 500 to 7000.

(e) Which data set will produce frequent itemsets containing items with wide-varying support levels.

Data set e will produce frequent itemsets containing items with wide-varying support levels since items support level > 40% is visible around item 500. There are varying support count that these items >= 10% threshold.

Hosted by www.Geocities.ws