In: Math
Suppose you have been building a model using the k-means clustering algorithm and you keep finding that a certain variable is essentially ignored by the model (in other words, the variable is very similarly distributed across all clusters). Describe a method that can be used to exaggerate or minimize the impact of a variable when using k-means clustering. Why does this method work?
no additional info available, predictive analysis
Method?
A method that can be used to exaggerate or minimize the impact of a variable is Projected clustering.
What is Projected clustering, and why does it work?
Projected clustering is often used to cluster high dimensional data, when the variability among different variables is different, and on a different scale.
Projected clustering assigns each point in the dataset to a new unique cluster, but each cluster may also exist in different subspaces altogether. The general approach of using projected clustering is to use a special distance function (which may be user designed) together with a regular clustering algorithm.
For example, the PreDeCon algorithm which is used, often checks which attributes among the available ones seem to support a clustering for each of the available points, and then adjusts the distance function accordingly, such that the dimensions which have low variance are often amplified in the distance function.
If the distance function which is being used weights the attributes differently, but never with a 0 (and hence never ever drops the irrelevant attributes), the said algorithm is called a "soft"-projected clustering algorithm, signifiying that the number of variables never decreases, only their relevance is affected/ changed.