In: Computer Science
One way to cluster objects is called k-means clustering. The goal is to find k different clusters, each represented by a "prototype", defined as the centroid of cluster. The centroid is computed as follows: the jth value in the centroid is the mean (average) of the jth values of all the members of the cluster. Our goal is for every member a cluster to be closer to that cluster's prototype than to any of the other prototypes. Thus a prototype is a good representative for the items in a cluster. We start the algorithm with k initial clusters and centroids. We then alternate between two steps: For each sample, find the nearest prototype. This assigns members to each cluster. Set the prototype of each cluster as the centroid of its members. We stop when the cluster assignment doesn't change, or when we've reached a maximum number of iterations.
Can someone help me with a program in python that computes the Euclidean distance and computes cluster prototype (centroid)?
Thank you!
Hi, K-means clustering is a very standard clustering technique and there are many open source api's that support this. However I will recommend you to use the sklearn's implementation. You can also view the sklearn code on github.
The function is as follows:
class sklearn.cluster.KMeans(n_clusters=8, init=’kmeans++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’,verbose=0, random_state=None, copy_x=True, n_jobs=None, algorithm=’auto’)
now let me first explain you what this function is:
This function uses Kmeans++ that is an advanced version of Kmeans with the differences only in initialization.
Keep the rest of the parameters as default.
The code will be as follows:
from sklearn.cluster import KMeans import numpy as np X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) kmeans = KMeans(n_clusters=2, random_state=0).fit(X) print(kmeans.cluster_centers_)
Now let me please explain the code
Here X is our Data which is an array of data points in 2D space. Now in the second line of code we will fit that data and the centroids will be computed by the function automatically.
In the last line of code we will print the centroids.