In: Computer Science
please show the differences in the kmeans clustering algorithms using graphs:
hartigan wong method, macqueen method, and lloyd floyd method
Hartigan's method for
k-means clusteringholds several
potential advantages compared to the clas- sical and prevalent
optimization heuristic known as Lloyd's algorithm.
This algorithm is proven to converge
in the sense that after a finite number of such
trials the assignments of the data items no longer change.(Figure
2)
K-means (MacQueen, 1967) is one of
the simplest unsupervised learning algorithms that solve the well
known clustering problem. The procedure follows a simple and easy
way to classify a given data set through a certain number of
clusters (assume k clusters) fixed a priori. The main idea is to
define k centroids, one for each cluster. These centroids shoud be
placed in a cunning way because of different location causes
different result. So, the better choice is to place them as much as
possible far away from each other. The next step is to take each
point belonging to a given data set and associate it to the nearest
centroid. When no point is pending, the first step is completed and
an early groupage is done. At this point we need to re-calculate k
new centroids as barycenters of the clusters resulting from the
previous step. After we have these k new centroids, a new binding
has to be done between the same data set points and the nearest new
centroid. A loop has been generated. As a result of this loop we
may notice that the k centroids change their location step by step
until no more changes are done. In other words centroids do not
move any more.
Finally, this algorithm aims at minimizing an objective
function, in this case a squared error function. The objective
function
,
where
is a chosen distance measure between a data point
and the cluster centre
, is an indicator of the distance of the
n data points from their respective cluster
centres.
The algorithm is composed of the following steps:
|
Although it can be proved that the procedure will always terminate, the k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centres. The k-means algorithm can be run multiple times to reduce this effect.
K-means is a simple algorithm that has been adapted to many problem domains. As we are going to see, it is a good candidate for extension to work with fuzzy feature vectors.
R provides Lloyd's algorithm as an option to kmeans(); the default algorithm, by Hartigan and Wong (1979) is much smarter. Like MacQueen's algorithm (MacQueen, 1967), it updates the centroids any time a point is moved; it also makes clever (time-saving) choices in checking for the closest cluster. On the other hand Lloyd's k-means algorithm is the first and simplest of all these clustering algorithms.
Lloyd's algorithm (Lloyd, 1957) takes a set of observations or
cases (think: rows of an nxp matrix, or points in Reals) and
clusters them into k
groups. It tries to minimize the
within-cluster sum of squares
where u_i is the mean of all the points in cluster S_i. The
algorithm proceeds as follows (I'll spare you the formality of the
exhaustive notation):
There is a problem with R's implementation, however, and the problem arises when considering multiple starting points. I should note that it's generally prudent to consider several dierent starting points, because the algorithm is guaranteed to converge, but is not guaranteed to coverge to a global optima. This is particularly true for large, high-dimensional problems.
The link for the diagrams
https://www.ijcai.org/Proceedings/13/Papers/249.pdf