Question

In: Computer Science

Question: In MATLAB, Implement a hybrid clustering algorithm which combines hierarchical clustering and k-means clustering. The...

Question: In MATLAB, Implement a hybrid clustering algorithm which combines hierarchical clustering and k-means clustering. The hybrid algorithm will use hierarchical clustering to produce stable clusters and k-means clustering will initialize seeds based on the centroids of the produced stable clusters (instead of randomly initialized seeds)

Background Information: Both hierarchal clustering and k-means clustering group similar data objects into clusters. However, the two algorithms have their pros and cons. For example, hierarchical clustering produces stable clusters while k-means clustering generates instable clusters due to random initial centroids(seeds). In hierarchal clustering once data objects are wrongfully merged, the data objects cannot be moved to another group while k-means clustering can re-assign data objects to a different group.

Solutions

Expert Solution

hierarchical clustering identifies groups in a tree-like structure but suffers from computational complexity in large datasets while K-means clustering is efficient but designed to identify homogeneous spherically-shaped clusters a hybrid non-parametric clustering approach that amalgamates the two methods to identify general-shaped clusters and that can be applied to larger datasets. Specifically, we first partition the dataset into spherical groups using K-means.next merge these groups using hierarchical methods with a data-driven distance measure as a stopping criterion.

The algorithm has the following steps:

  1. Removing scatter from the dataset: The algorithm first removes scatter from the dataset from consideration.


    Outliers or scatter can greatly influence clustering performance . Although many methods exist, we adopt the following straightforward approach to eliminating scatter. We use K-means with the largest of our candidate group sizes (G) and multiple initializations ( Knp) to obtain a G-means partition. Observations in any of the G groups that have less than 0.1% of the size of the dataset are labeled as scatter and eliminated from further consideration. This leaves us with n* observations X1,X2,,Xn* (say) which we proceed with clustering using KmH.
  2. Finding a partition: Our algorithm has two phases. The first focuses on finding a (potentially) large number (K0) of homogeneous spherical groups while the next merges these groups according to some criterion. We call these phases the K-means and hierarchical phases. The exact details of these phases are as follows:

    1. The K-means phase: For a given K0 and initialization, the K-means phase uses its namesake algorithm with multiple (m) initializations to identify K0 homogeneous spherically-distributed groups. This phase yields K0 groups {?1, ?2,, ?K0} with means μ1,μ2,,μK0. Each obtained cluster ?k is now considered to be one entity. Therefore, we now have K0 entities labeled as ?1, ?2,, ?K0 for consideration.

    2. Hierarchical phase: For given K* and distance d, ·), we successively merge the K-means groups as follows:

      1. Set i* = 1 and d1∗=1. Define ?̃j(1) = ?j for all j.

      2. For j ∈ 1…(K0i*) C∼j(i∗+1)=C∼j(i∗). Find k, l such that k < l and d(C∼ki∗,C∼li∗)=min1≤m<q≤(K0-i∗+1)d(C∼mi∗,C∼qi∗). Set C∼k(i∗+1)=C∼k(i∗)∪C∼l(i∗) and if l <K0i* + 1 then C∼l(i∗+1)=C∼K0-i∗+1(i∗), define di∗∗=d(C∼ki∗,C∼li∗). Set i* = i* + 1.

      3. If i* = K0 or i* = K0K* + 1 terminate, else return to Step 2(b)

For the hierarchical phase of Step 2, we calculate the distance between two clusters obtained from the K-means step by assuming (non-homogeneous) spherically-dispersed Gaussian-distributed groups in the dataset. Specifically, we let X1,X2,,Xn* be independent p-variate observations with Xi~Np(μζi,σζi2I), where ζi ∈ {1, 2,, K} for i = 1, 2,, n*. Here we assume that μk’s are all distinct and that nk is the number of observations in cluster k. Then the density for the Xi’s is given by f(X)=∑k=1KI(X∈Ck)ϕ(X;μk,σk2I), where ?k is a cluster indexed by the Np(μk,σk2I) density and I(X ∈ ?k ) is an indicator function specifying whether observation X belongs to the kth group having a p-dimensional multivariate normal density ϕ(X;μk,σk2I)∝σk-pexp[-12σk2(X-μk)′(X-μk)], k = 1,, K. Define the distance measure

Dk(Xi)=(Xi-μk)′(Xi-μk)σk2

(1)

and the variable

Yj,l(X) = ?j(X) - ?l(X), where X ∈ ?l,

(2)

and Yl,j (X) similarly. Using the spherically-dispersed Gaussian models formulated above, Yj,l (X) is a random variable which represents the difference in squared distances of X ∈ ?l to the center of ?j and to the center of ?l. Then plj=Pr[Yj,l(X)<0] is the probability that an observation from ?l is classified into ?j and is calculated as follows.

Theorem 1

Let X ~ Np(μl, Σl), with Σl a positive-definite matrix. Further, let Yj,l (X) = ?j (X)−?l (X), where Dk(X)=(X-μk)′∑k-1(X-μk) for k ∈ {j, l}. Let λ1, λ2, …, λp be the eigenvalues of ∑j∣l≡∑l12∑j-1∑l12 with corresponding eigenvectors γ1, γ2, … γp. Then Yj,l (X) is distributed as ∑i=1pI(λi≠1)[(λi-1)Ui-λiδi2/(λi-1)]+∑i=1pI(λi=1)δi(2Zi+δi), where Ui′s are independent non-central χ2random variables with one degree of freedom and non-centrality parameter λi2δi2/(λi-1)2 with δi=γi′∑l-12(μl-μj) for i ∈ {1, 2, …, p} ∩ {i : λl ≠ 1}, independent of Zi’s, which are independent standard normal random  variables, for i ∈ {1, 2, …, p} ∩ {i : λi = 1}.

  1. Forming multiple partitions and choosing the optimal P*: Repeat Step 2 N = ML times with M different K0s and L different K*s to form multiple partitions. Determine the optimal hierarchical partition P*.


Related Solutions

In MATLAB, Implement a hybrid clustering algorithm which combines hierarchical clustering and k-means clustering.
In MATLAB, Implement a hybrid clustering algorithm which combines hierarchical clustering and k-means clustering.
What is clustering? Explain how K-Means Clustering Algorithm works? What are the Advantages and disadvantages of...
What is clustering? Explain how K-Means Clustering Algorithm works? What are the Advantages and disadvantages of Clustering ALgorithms discussed in our class (K-Means,Hierchal)? Which Clustering Algorithm is better K-Means or hierarchical Clustering? Explain with a proper example which is better algorithm?
We've now had an introduction to several different models: Linear regression, logistic regression, k-means, hierarchical clustering,...
We've now had an introduction to several different models: Linear regression, logistic regression, k-means, hierarchical clustering, GMM, Naive Bayes, and decision trees. For this assignment, I would like you to choose three models from the above list and describe two problems that each of the models could potentially be used to solve. You can do one big post with all three models and six solvable problems or do three separate posts if you prefer.   Short Explanation of Decision Trees Decision...
Try to use K means clustering to segment an image. You can use Matlab function: kmeans(...
Try to use K means clustering to segment an image. You can use Matlab function: kmeans( )
K-means clustering: a. In the k-means lab, you examined different values for k using the "knee"...
K-means clustering: a. In the k-means lab, you examined different values for k using the "knee" heuristic to pick the best value of k. Explain what is so special about the k values on the “knee”? Hint: There are two properties that together make these values of k special. b. Give an example of a type of data (data type) that k-means should not be used for and explain why.
State similarities and differences between Fuzzy c-means and hierarchical clustering based on Gaussian distributions.
State similarities and differences between Fuzzy c-means and hierarchical clustering based on Gaussian distributions.
Question 1. What is k-means clustering? How does it work? Give a few examples that you...
Question 1. What is k-means clustering? How does it work? Give a few examples that you would use this algorithm. ---------------- Question 2. What is k-nearest neighbor? How does it work? Give a few examples that you would use this algorithm.
One way to cluster objects is called k-means clustering. The goal is to find k different...
One way to cluster objects is called k-means clustering. The goal is to find k different clusters, each represented by a "prototype", defined as the centroid of cluster. The centroid is computed as follows: the jth value in the centroid is the mean (average) of the jth values of all the members of the cluster. Our goal is for every member a cluster to be closer to that cluster's prototype than to any of the other prototypes. Thus a prototype...
In your own words, summarize the steps of K-means clustering. Make sure to give example(s). What...
In your own words, summarize the steps of K-means clustering. Make sure to give example(s). What are the advantages and disadvantages of the K-means clustering? Any limitations?
Analyze the algorithm experimentally. a)Implement the algorithm b)Let p be the string length at which your...
Analyze the algorithm experimentally. a)Implement the algorithm b)Let p be the string length at which your program takes 2 seconds to run, collect running times for your algorithm using the following string lengths: p/4, 2p/4, 3p/4, p, 5p/4, 6p/4, 7p/4, 2p. c)Generate your strings by reading the attached file, only reading as many characters as you need. d)Plot your results (x-axis is string length, y-axis should be time) e)Draw conclusions based on your graph. You may also need to plot...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT