Dustin Stevens-Baier
Comp 548

Assignment #10




9.8-1. For sparse data, discuss why considering only the presence of non-zero values might give a more accurate view of the objects than considering the actual magnitudes of values.When would such an approach not be desirable?

The best examples are probably document simlarity and possibly number comparisons.  Two documents are similar if they contain some of the same words, the values can change depending on how many of the same words they have in common.  The same thing can be said for sets of data having simlar values.  One can find similarity by the volume of the simlar words.  This is not always reliable. We run into problems if a small set of words dominate the comparison, or a small set of data. The document or data set will be similar to ones with the same word or number in higher frequency.


Such an approach might not be desirable when a set of data has 100 people 50 of which are American the other 50 are French.  in compariosn to two other data sets one set has 50 Americans  and 50 Germans   while the third data set has   49 Americans and 30 French  and 21 Germans  One can see that the second data set has 50 americans as compared to the third which only had 49.  If the similarity comparison was done for Americans it would choose the wrong set.  This problem only increases as the one set starts to dominate even more.

9.8-2. Describe the change in the time complexity of K-means as the number of clusters to be found increases.

The number of attributes and the number of points stays the same as K increases.  K means algorithm has a time complexity of O(m) as stated on page 571.

9.8-3. Consider a set of documents. Assume that all documents have been normalized to have unit length of 1. What is the �shape� of a cluster that consists of all documents whose cosine similarity to a centroid is greater than some specified constant? In other words, cos (d, c) > =d, where 0< d < 1.

Since the documents are normalized this means that they ahve the unit length of 1 which makes the outer diameter one.  Becuase of the restraints put on it by the problem above the internal diameter is d.  The picture is basically two circles one internal diamter d and the other diameter 1.  C is in the middle and teh rest are all greater than d so they are in the external circle but not in the internal circle.

9.8-4. Discuss the advantages and disadvantages of treating clustering as an optimization problem.  Among other factors, consider efficiency, non-determinism, and whether an optimization-based approach captures all types of clustering that are of interest.

The advantages to treating clustering as an optimization problem is that it can be customized if we know the data sets information ahead of time thus allowing us to choose the best type of clustering algorithm to suit are needs. 


The disadavntages to treating clustering as an optimization problem are the optimization techniques have a high time complexity.  We also need to know some of the data parameters ahead of the time which is not very convenient.  The objective functions they use tend to favor large clusters which can cause problems.

9.8-12. Figure 9.28 shows a clustering of a two-dimensional point data set with two clusters:  The leftmost cluster, whose points are marked by asterisks, is somewhat diffuse, while the rightmost cluster, whose points are makred by circles, is compact.  To the right of the compact cluster, there is a single point that belongs to the diffuse cluster, whose center is farther away than that of the compact cluster.  Explain why this is possible with EM clustering, but not K-mean clustering.

The K-mean algorithm is not capable of identifying clusters with varying density since it uses the center based approach to identify its clusters.   The EM clustering algorithm computes the probability that a point belongs to the cluster.  Since the densities are varied, the EM algorithm can use the differing probabilities taht a point belongs to a cluster..

Hosted by www.Geocities.ws

1