Dustin Stevens-Baier
Comp
578
10-13-06
Assignment #7
1. For each of the
following questions, provide an example of an association rule from the market
basket domain that satisfies the following conditions. Also describe
wether such rules are subjectively interesting.
I used table 6.22 on page 404 for
the market transactions. I tried using table 6.1 first but
there were a couple
of rules that had very bad examples.
(a) A rule that has high support
and high confidence.
Consider the rule {a}
=> {e}
s({a}=>{e}) = 6/10 = 0.60
c({a}=>{e}) = 6/7 =
0.86
The relationship could be subjectively interesting. It may be useful
to try and increase cross-selling and advertising for retailers.
(b) A rule that has reasonably high support
but low confidence.
Consider the rule {e} =>
{a}
s({e} =>
{a}) = 6/10 =
0.60
c({e} => {a}) = 6/8 = 0.75
Not
subjectively interesting since these have a low confidence they do not have a
reliable interface.
(c) A rule that has low
support and low confidence.
Consider the rule {b}
=> {c}
s({b} => {c}) = 3/10 =
0.30
c({b} => {c}) = 3/6 = 0.50
Not
subjectively interesting these are items are seldom bought together.
(d) A rule that has low
support and high confidence.
Consider the rule {d}
=> {e}
s({d} => {e}) = 5/10 =
0.50
c({d} => {e}) = 5/6 =
0.83
Subjectively interesting since these items although not
bought as often are almost always bought together. A company would
want this info to make sure that they produce these items together.
2. Consider the data set shown in Table 6.22
(a)
Compute the support for itemsets {e}, {b,d}, and {b,d,e} by treating each
transaction ID as a market basket.
s{e} = 8/10 =
0.8
s{b, d} = 2/10 = 0.2
s{b, d, e}
= 2/10 = 0.2
(b) use the results in part (a) to compute the
confidencefor the association rules {b,d}->{e} and {e} ->{b,d}. Is
confidence a symmetric measure?
c({b,d}==>{e}) = 2 / 2
= 1.0
c ({e}==>{b, d}) = 2 / 8 = 0.25
Confidence is not symmetric as you can see from the
above not being equal.
(c) Repeat part (a) by
treating each customer ID as a market basket. Each item should be treated
as a binary variable.
s{e} = 4/5 = 0.8
s{b, d} = 4/5 =
0.8
s{b, d, e} = 4/5 = 0.8
(d) Use the reults in part
(c) to compute the confidence for the association rules {b,d}-> {e} and
{e}->{b,d}.
c({b,d}=>{e}) = 4 / 4 =
1.0
c({e}=>{b, d}) = 4 / 4 =
1.0
(e) Suppose s2 and c1 are the support and
confidence values of an association rule r when treating each transaction ID as
a market basket. Also, let s2 and c2 be the support and
confidence values of r when treating each customer ID as a market
basket. Discuss wether there are any relationships between s1 and s2 or c1
and c2.
s1 < =
s2. If s2 is a subset of s1.
What this means is that if you have a frequent itemset all the
subsets will be frequent also. A subset will always be atleast
as frequent as its set.
14) Answer the following questions using the data sets
shown in Figure 6.34. Note that each data set contains 1000 items and
10,000 transactions. Dark cells indicate the presence of items and white
cells indicate the absence of items. We will apply the Apriori algorithm
to extract frequent itemsets with minsup = 10%.
(a) Which data set will produce
the most number of frequent itemsets?
Data set a will
produce the most number of frequent itemsets. Each item transactions
are > 10% threshold. Their are 200 items per block which can
generate itemsets
where x is between 1 and 200.
(b) Which data set will produce the fewest number
of frequnt itemsets?
Data set d will produce the least number of
frequent itemsets since no items > 10% threshold.
(c) Which data set will
produce the longest frequent itemset?
Data set e will produce
the longest frequent itemset since most number of items are concentrated
somewhere between 2000 to 4000 transaction range. Alot of the items
>= 10% threshold.
(d) Which data set will produce frequent itemsets
with the highest maximum support?
Data set b will
produce frequent itemsets with the highest maximum support since items around 100, appears in transactions
range somewhere in between 500 to 7000.
(e) Which data
set will produce frequent itemsets containing items with wide-varying support
levels.
Data set e
will produce frequent itemsets containing items with wide-varying support levels
since items support level > 40% is visible around item 500. There are
varying support count that these items >= 10% threshold.