Dustin Stevens-Baier
Comp
578
10-29-06
1. Consider a data set consisting of 2^20
data vectors, where each vector has 32 components and each componenet is a
4-byte value.� Suppose that vector quantization is used for compression and
that 2^16 prototype vectors are used.� How many bytes of storage does that
data set take before and� after compression and what is the compression
ratio?
before compression = 4x 32 x2^20 =
134,217,728
after compression = 4x 32 x2^16� + 2x2^20 = 8,388,608 +
2,097,152 = 10,485,760
2. Find all well seperated clusters in the set of points
shown in Figure 8.35
3. Many partitional clustering
algorithms that automatically deteremine the number of clusters claim that this
is an advantage.� List two situations in which the is not the
case.
1) If you are clustering and need to reduce to a specific
size then you need to know how many clusters are created.�
2) It
also presents a problem with hierarchal clustering as sub clusters are ignored.
4. Given K equally sized clusters, the probability that a
randomly chosen initial centroid will come from any given cluster is 1?K, but
the probability that each cluster will have exactly one initial centroid is much
lower.� In general, if there are K clusters and each cluster has n points,
then the probability, p, of selecting in a sample of size K one initial centroid
from each cluster is given by Equation 8.20.� From this formula we can
calculate, for example, that the chance of having one initial centroid from each
of four clusters is 4!/$^4 = .0938.
(a) Plot the
probability of obtaining one point from each cluster in a sample of size K for
values of K between 2 and 100.

(b) For K clusters, K=10, 100, and
1000, find the probability that a sample of size 2K contains at least one point
from each cluster.� You can use either mathematical methods or statistical
simulation to determine the answer.�
K=10 p = (1-1/10)^20
=.122���K=100p=(1-1/100)^200=.134�����K=1000p=(1-1/1000)^2000=.135
K=10p =1-.122 = .878��K =100p = 1-.134 =
.866�� K=1000p = 1-.135 =.865
K=10p=
.878^10=.27�K=100p=.866^100=5.65x10^-7�K=1000p=.865^1000=1.04x10^-63
5.
Identify the cluster in Figure 8.36 using the center-, contiguity, and density
based definitions.� Also indicate the number of clusters for each case and
give a brief indication of your reasoning.� Note that darkness or the
number of dots indicates density.� If it helps assume cenetr-based means
K-means, contiguity based means single link, and density based
DBSCAN.
Figure a
has two clusters.� The points inside the
circles are core points while the points outside are noise.� These clusters
are density based.� Figure b has two clusters.� The points within the
circle are closer to the points inside the circle while the points outside the
circle are further away, these are contiguity based.� Figure c has three
clusters� one for each triangle.� Since each point inside the triangle is closer to
the center of it's triangle than any other center, thus
making it center based.� Figure d is two clusters the points on the
right are closer to all the points on the right when compared to
the points on the left, thi is why it is
contiguity based.
6)� For the following sets of two dimensional
points, (1) provide a sketch of how they would be split into clusters by K-means
for the given number of clusters and (2) indicate approximately where the
resulting centroids would be.� Assume that we are using the squared error
objective function.� If you think that there is more than one possible
solution, then please indicate whether each solution is a global or local
minimum.� Note that the label of each diagram in Figure 8.37� matches
the corresponding part of this question, e.g. Figure 8.37(a) goes with part
(a).
(a)
K=2 Assuming that the points are uniformly distributed in the circle, how many
possible ways are there to partition the points into two clusters? What can you
say about the positions� of the two centroids?
There are a
ton of solutions for the centroids but one simple one is to have two on opposite
sides like the following example.

(b) K=3 The distance between the edges of the circles is
slightly greater than the radii of the circles.
(c) K=3 The distance between the edges of the circles is
much less than the radii of the circles.
(d)
K=2.
The
centroids are lcoated in the middle

(e) K=3 Hint: Use the symmetry of the situation and
remember that we are looking for a rough sketch of what the result would
be.�
There are three centroids each in the middle of the
circle.� �