In: Computer Science
how to solve for overfitting or underfitting when selecting the number of components for Gaussian mixture model?
Suppose you are in the situation depicted in Figure 1, you want to discern how many clusters we have (or, if you prefer, how many gaussians components generated the data), and you don’t have information about the “ground truth”. A real case, where data don't have the nicety of behaving good because the simulated ones.
Figure 1: The vecotrs we are getting to cluster. The transparency
on the points reflects the density.
At a primary look, one could scream “Three main cluster plus two
minor!”. and perhaps it’s just correct, but here we would like to
see for an automatic method for locating the “right” number of
clusters.
It’s to be noted that this “right” number is something fluffy
anyway, since the specificity of each problem can lead one to
decide to overcome the decision of the automated algorithm.
But let me introduce the GMM results for the configuration we just
announced (five clusters). In the plot below (Figure 2) we fitted a
GMM with five components to our data. And we have our first issue:
same data, same model, but different results.
Figure 2: Clusters shapes: since the algorithm we rely on is not
deterministic, we can have very different results!
The guilty for this behavior is that the fitting procedure: the
Expectation-Maximization (EM) algorithm. This algorithm only
guarantee that we land to a local optimal point, but it do not
guarantee that this local optima is also the global one. And so, if
the algorithm starts from different initialization points,
generally it lands into different configurations.
While there are other procedures to fit the GMM, we want to stick
on this one: the other methods are more complex and require more
hyperparameters to be set; and this exceed the goals for this
post.
We, then, have to take in account of the non-deterministic nature
of our guy.
The easiest way of handling it's to run the fitting procedure
several times, and consider the mean and therefore the variance for
every configuration. Put simple: we account for a mistake on each
fit.
Cluster performance evaluation(s)
Since we don’t know the bottom truth of our cluster generators,
i.e. we are not aware of the original distribution which generated
the data, our choices about the performance evaluation of the
clustering process are limited and quite noisy.
Nonetheless, we will explore three different techniques.
Silhouette score
This score, as clearly stated by the SKLearn developers, consider
two measures:
The distance between a sample and every one other points within the
same cluster.
The distance between a sample and every one other points within the
next nearest cluster.
i.e. it checks how much the clusters are compact and well
separated. The more the score is almost one, the higher the
clustering is.
Since we already know that the fitting procedure isn't
deterministic, we run twenty fits for every number of clusters,
then we consider the mean and therefore the variance of the best
five runs. The results are in Figure 3.
Figure 3: Silhouette scores for our dataset
Well, it seems that we get the simplest score with five cluters. We
have also to consider that also the four cluster configuration is
almost equally good, if we consider the standard deviation (the
‘error’) of both the configurations. So, this score don't gives us
a transparent result.
Distance between GMMs
Here we form two datasets, each with an half randomly choose amount
of knowledge . We will then check how much the GMMs trained on the
two sets are similar, for each configuration.
Since we are talking about distributions, the concept of similarity
is embedded in the Jensen-Shannon (JS) metric. The lesser is the
JS-distance between the two GMMs, the more the GMMs agree on how to
fit the data.
All the credits on the (beautiful!) way the Jensen-Shannon metric
is calculated go to Dougal, and to its post on StackOverflow.
Figure 4: Distance between half data GMMs
Following this technique, the configuration with three clusters is
the most conservative; this do not necessarily means that it’s the
good configuration, since we are just talking about reproducibility
of the results between the two sets.
The interesting part of this plot is the sudden change in the
distance we see after passing the six cluster configuration: both
the mean distance and its associated error show a huge
increase.
This means that starting from seven clusters, the GMMs of the two
sets are much more different (since the distance is much bigger),
and also less stable (since the error band is much bigger).
We can say that the good configuration, which takes in account both
of the amount of information included (=biggest possible number of
clusters) and on the stability of the fitting procedure (=lowest
possible GMMs distance), is that the one which considers six
cluster.
Bayesian information criterion (BIC)
This criterion gives us an estimation on what proportion is sweet
the GMM in terms of predicting the info we even have . The lower is
the BIC, the better is the model to actually predict the data we
have, and by extension, the true, unknown, distribution. In order
to avoid overfitting, this system penalizes models with big number
of clusters.
Figure 5: Bayesian information criterion scores as function of the
amount of normal components (clusters) forming the GMMs.
Following this criterion, the larger the amount of clusters, the
higher should be the model. which suggests that the penalty BIC
criteria gives to complex models don't save us from overfit. Or, in
additional prosaic terms, this score sucks. a minimum of during
this basic form.
But before screaming and trashing out this system , we will notice
two things. the primary is that the curve is fairly smooth and
monotone. The second is that the curve follows different slopes in
several a part of it. ranging from these two observations, the
temptation to see where the BIC curve change slope is big. So let’s
check it!
Technically, we've to calculate the gradient of the BIC scores
curve. Intuitively, the concept of gradient is simple: if two
consecutive points have an equivalent value, their gradient is
zero. If they need different values, their gradient are often
eighter negative, if the second point features a lower value, or
positive otherwise. The magnitude of the gradient tells us what
proportion the 2 values are different.
Image for post
Figure 6: plot of the gradients of the curve in Figure 5
As expected, all the gradients have negative values. But we see
more clearly that ranging from a cluster size of seven the gradient
become almost constant, i.e. the first function features a more
gentle decrease, i.e. there's no much gain in increasing the amount
of clusters. In short, this system suggest us to use six
clusters.
Final decision: change model
We have explored three different techniques to settle on the proper
number of clusters which may be discerned during this dataset. The
results are within the table below:
Image for post
Even if the choices are similar, there's no a transparent value on
which all the strategies agree. during this specific case, this
suggests that the GMM isn't an honest model to cluster our
data.
Another sympton you'll catch is that both the Silhouette and
therefore the gradient of BIC show a second value which is nearly
nearly as good |pretty much as good"> nearly as good because the
choose one: 4 is nearly as good as 5 for the Silhouette, and 5 is
nearly nearly as good as 6 for the gradient of BIC scores.
One could tell just pause (in medio stat virtus) because the right
number of clusters, but this might not be the optimal choice. And
it seems that really it had been not for my specific issue.
This because the clusters don't show a transparent symmetric
(oval-like) shape, then they will not be approximated by a model
composed of 2-d gaussians just like the one you see in Figure 7,
which instead are symmetric.
Image for post
Figure 7: a bi-variate Gaussian distribution .