In: Statistics and Probability
In your own words, summarize the steps of K-means clustering. Make sure to give example(s). What are the advantages and disadvantages of the K-means clustering? Any limitations?
Step 1: Initialization
Step 2: Cluster Assignment
Step 3: Move the centroid
We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, K-means algorithm is converged
K-Means Advantages :
1) If variables are huge, then K-Means most of the times computationally faster than hierarchical clustering, if we keep k smalls.
2) K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
K-Means Disadvantages :
1) Difficult to predict K-Value.
2) With global cluster, it didn't work well.
3) Different initial partitions can result in different final clusters.
4) It does not work well with clusters (in the original data) of Different size and Different density
Example
kmeans algorithm is very popular and used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc. The goal usually when we undergo a cluster analysis is either:
Get a meaningful intuition of the structure of the data we’re dealing with.
Cluster-then-predict where different models will be built for different subgroups if we believe there is a wide variation in the behaviors of different subgroups. An example of that is clustering patients into different subgroups and build a model for each subgroup to predict the probability of the risk of having heart attack.
Limitations
If any one of these 3 assumptions are violated, then k-means will fail.