Dustin Stevens-Baier
Comp 578
10-29-06

1. Consider a data set consisting of 2^20 data vectors, where each vector has 32 components and each componenet is a 4-byte value.� Suppose that vector quantization is used for compression and that 2^16 prototype vectors are used.� How many bytes of storage does that data set take before and� after compression and what is the compression ratio?

before compression = 4x 32 x2^20 = 134,217,728

after compression = 4x 32 x2^16� + 2x2^20 = 8,388,608 + 2,097,152 = 10,485,760

2. Find all well seperated clusters in the set of points shown in Figure 8.35

3. Many partitional clustering algorithms that automatically deteremine the number of clusters claim that this is an advantage.� List two situations in which the is not the case.

1) If you are clustering and need to reduce to a specific size then you need to know how many clusters are created.�

2) It also presents a problem with hierarchal clustering as sub clusters are ignored.

4. Given K equally sized clusters, the probability that a randomly chosen initial centroid will come from any given cluster is 1?K, but the probability that each cluster will have exactly one initial centroid is much lower.� In general, if there are K clusters and each cluster has n points, then the probability, p, of selecting in a sample of size K one initial centroid from each cluster is given by Equation 8.20.� From this formula we can calculate, for example, that the chance of having one initial centroid from each of four clusters is 4!/$^4 = .0938.

(a) Plot the probability of obtaining one point from each cluster in a sample of size K for values of K between 2 and 100.

(b) For K clusters, K=10, 100, and 1000, find the probability that a sample of size 2K contains at least one point from each cluster.� You can use either mathematical methods or statistical simulation to determine the answer.�

K=10 p = (1-1/10)^20 =.122��K=100p=(1-1/100)^200=.134��K=1000p=(1-1/1000)^2000=.135

K=10p =1-.122 = .878��K =100p = 1-.134 = .866�� K=1000p = 1-.135 =.865

K=10p= .878^10=.27�K=100p=.866^100=5.65x10^-7�K=1000p=.865^1000=1.04x10^-63

5. Identify the cluster in Figure 8.36 using the center-, contiguity, and density based definitions.� Also indicate the number of clusters for each case and give a brief indication of your reasoning.� Note that darkness or the number of dots indicates density.� If it helps assume cenetr-based means K-means, contiguity based means single link, and density based DBSCAN.

Figure a has two clusters.� The points inside the circles are core points while the points outside are noise.� These clusters are density based.� Figure b has two clusters.� The points within the circle are closer to the points inside the circle while the points outside the circle are further away, these are contiguity based.� Figure c has three clusters� one for each triangle.� Since each point inside the triangle is closer to the center of it's triangle than any other center, thus making it center based.� Figure d is two clusters the points on the right are closer to all the points on the right when compared to the points on the left, thi is why it is contiguity based.

6)� For the following sets of two dimensional points, (1) provide a sketch of how they would be split into clusters by K-means for the given number of clusters and (2) indicate approximately where the resulting centroids would be.� Assume that we are using the squared error objective function.� If you think that there is more than one possible solution, then please indicate whether each solution is a global or local minimum.� Note that the label of each diagram in Figure 8.37� matches the corresponding part of this question, e.g. Figure 8.37(a) goes with part (a).

(a) K=2 Assuming that the points are uniformly distributed in the circle, how many possible ways are there to partition the points into two clusters? What can you say about the positions� of the two centroids?

There are a ton of solutions for the centroids but one simple one is to have two on opposite sides like the following example.

(b) K=3 The distance between the edges of the circles is slightly greater than the radii of the circles.

(c) K=3 The distance between the edges of the circles is much less than the radii of the circles.

(d) K=2.

The centroids are lcoated in the middle

(e) K=3 Hint: Use the symmetry of the situation and remember that we are looking for a rough sketch of what the result would be.�

There are three centroids each in the middle of the circle.� �

Hosted by www.Geocities.ws