Dustin Stevens-Baier
Comp
548
Assignment #10
9.8-1. For sparse
data, discuss why considering only the presence of non-zero values might give a
more accurate view of the objects than considering the actual magnitudes of
values.When would such an approach not be
desirable?
The best examples are probably
document simlarity and possibly number comparisons. Two documents are
similar if they contain some of the same words, the values can change depending
on how many of the same words they have in common. The same thing can be
said for sets of data having simlar values. One can find similarity by the
volume of the simlar words. This is not always reliable. We run into
problems if a small set of words dominate the comparison, or a small set of
data. The document or data set will be similar to ones with the same word
or number in higher frequency.
Such an approach might not be desirable when a set of data has
100 people 50 of which are American the other 50 are French. in
compariosn to two other data sets one set has
50 Americans and 50 Germans while the third data set has 49 Americans and 30 French
and 21 Germans One can see that the second data
set has 50 americans as compared to the third which only
had 49. If the similarity comparison was done for Americans it would choose the
wrong set. This problem only increases as
the one set starts to dominate even more.
9.8-2. Describe the change in the time
complexity of K-means as the number of clusters to be found
increases.
The number of attributes and the number of points
stays the same as K increases. K means algorithm has a time complexity of
O(m) as stated on page 571.
9.8-3.
Consider
a set of documents. Assume that all documents have
been normalized to have unit length of 1. What is the �shape� of a cluster
that consists of all documents whose cosine similarity to a centroid is greater
than some specified constant? In other words, cos (d, c) > =d,
where 0< d < 1.
Since the documents are normalized this
means that they ahve the unit length of 1 which makes the outer diameter
one. Becuase of the restraints put on it by the problem above the internal
diameter is d. The picture is basically two circles one internal diamter d
and the other diameter 1. C is in the middle and teh rest are all greater
than d so they are in the external circle but not in the internal circle.
9.8-4. Discuss
the advantages and disadvantages of treating clustering as an optimization
problem. Among other factors, consider efficiency, non-determinism, and
whether an optimization-based approach captures all types of clustering that are
of interest.
The advantages to treating clustering as an
optimization problem is that it can be customized if we know the data sets
information ahead of time thus allowing us to choose the best type of clustering
algorithm to suit are needs.
The disadavntages to treating clustering as an optimization
problem are the optimization techniques have a high time complexity.
We also need to know some of the data parameters ahead of the time which is not very
convenient. The objective functions they use tend to favor large clusters which can
cause problems.
9.8-12. Figure 9.28
shows a clustering of a two-dimensional point data set with two clusters:
The leftmost cluster, whose points are marked by asterisks, is somewhat diffuse,
while the rightmost cluster, whose points are makred by circles, is
compact. To the right of the compact cluster, there is a single point that
belongs to the diffuse cluster, whose center is farther away than that of the
compact cluster. Explain why this is possible with EM clustering, but not
K-mean clustering.
The
K-mean algorithm is not
capable of identifying clusters with varying density since it uses the center
based approach to identify its clusters. The EM clustering
algorithm computes the probability that a point belongs to the
cluster. Since the densities are varied, the EM algorithm can use the
differing probabilities taht a point belongs to a cluster..