Dustin Stevens-Baier
Comp 578
Assignment
#12
The K-means method is a prototype-based partitional clustering
technique that attempts to find a user-specified number of clusters K which are
represented by their centroids. The K-means method defines a prototype in terms
of a centroid which is usually the mean of a group of points and is typically
applied to objects in a continuous n-dimensional space. The basic
algorithm is as follows (gotten from the Text Book page 497).
1.
Select K points as initial centroids
2. repeat
3.
form K clusters by assigning each point to its closest
centroid.
4. Recompute the centroid of each
cluster.
5. until Centroids do not change.
An example
of the K-means algorithm using the iris data set and JMP 6 is as
follows:

The above picture is a snapshot of the
iris data set. This data set is used to show the k-means in what is a
fairly easy to cluster example.

This image shows the four
variables sepal length, sepal width, petal length, and petal width that are used
in the clustering process.

This is the first look at the k-means
before any steps are taken you can see two of the centroids that are clearly on
the blue and red areas. The third is very small and is labeled by the 2 in the
left hand corner on the green cluster.

After the next step you can see the
shapes begin to take place and start to move towards the middle of their
respective clusters. You can see that the green cluster is actually split
apart and that the centroid actually has none of the data points from he green
cluster in it.

The next step
shows a slight shift of the centroids settling into their places.

This is step seven which is the final step. Nothing changes after this step. You can see that it has
correctly labeled all of the data points in the clusters and that the centroids
are in the middle of their data points. One can see that not all of the
data does get encircled , outliers can also extremely
effect the clusters that are found. The
best solution is to eliminate the outliers before hand if possible. In
some situations this is not ideal because we may not know what the outliers are
or they might provide interesting info that we want to include in the
analysis.
In comparison the same data
set can be looked at in JMP from using self organizing maps. The data
looks as follows.

One can see that using Self Organizing
Maps has the three centroids placed in the middle of the data set.

The next step in using SOMs shows that
there are now three centroids in the correct cluster areas.

The next step shows
that the circles are becoming more uniform in size and that they are closers
towards the centers of their clusters.

On step 7 the SOMs have finished moving
one can see that they have similar locations to the basic K-means but that they
have a few different data points inside the circles.

Looking at the data that comes from the basic K-means one
can see that cluster one has 55 points in it which is slightly off since we know
that there are supposed to be 50 points in each cluster. We can see that
this is a fairly accurate but not perfect. Cluster two is 46 data points
and cluster three is 49 data points. Another thing one can look at is the standard deviation of
the different variables for each cluster. You can see that cluster 3 is the tightest
of the three and there are a couple of values that are high,
cluster 1 sepal length and petal length, and cluster 2 petal length.

This is is cluster summary of the SOM
approach. It shows that the cluster were again a little off but pretty
close to the correct 50 data points per cluster. When comparing SOMs
standard deviations to the basic k-means algorithm. we can see that the
standard deviations are a little better, none are over the .7. It appears
that the SOM method gives slightly tighter clusters.

Another set of data that has much more error than the iris
set is the cholestrol data set. This set has four different treatments
however one of them gets left out and the data set becomes three clusters with
the fourth just being one point. It appears as if the data sets have
two many similar points so they overlap. One can also see that the standard deviations are
very high.
When computing
the
error of the K-means algorithm we could also use SSE, which is the sum of the
squared error. Given two different sets of clusters that are produced by
two different runs of K-means the most preferred one is the one with the
smallest squared error, because this means that the prototypes of this
clustering are a better representation of the points in their cluster.
Error can change when random centroids are used to start with.
Possible problems can
happen when initial centroids are chosen poorly. If they aren't picked correctly you
can get clustering that is not optimal. A very good example of this is in
the text on page 502. The basic problem is if the three clusters
have two of their centroids in the
same data area then the one of the sets will be divided when
it shouldn't be and another will be combined when it shouldn't. One way of dealing with
this problem is to run multiple runs with random initialization and choose the
one with the lowest SSE. This isn't very efficient but can give
us the best answer for a small number of K, with a relatively even
number of points in each cluster. If there is a larger number of K
then we probably will never get all of the clusters, also if one of the
clusters is significantly larger than the others it means that we will have
a harder time getting a selection not in the large
cluster.
One solution to this is to use a
hierarchal clustering technique. This technique has K clusters
extracted from the hierarchical clustering and the centroids of those clusters
are used as the initial centroids. This can work if the sample is small and K is
small as compared to the sample size.
Another problem with the
K-means algorithm is handling empty clusters. If no points are allocated
to a cluster during the assignment step then an empty cluster can be
obtained. If this happens then a replacement strategy needs to be
implemented. Two options for this are to replace the centroid from the
cluster with the highest SSE. This will usually reduce the SSE.