Question

In: Computer Science

how to solve for overfitting or underfitting when selecting the number of components for Gaussian mixture...

how to solve for overfitting or underfitting when selecting the number of components for Gaussian mixture model?

Solutions

Expert Solution

Suppose you are in the situation depicted in Figure 1, you want to discern how many clusters we have (or, if you prefer, how many gaussians components generated the data), and you don’t have information about the “ground truth”. A real case, where data don't have the nicety of behaving good because the simulated ones.


Figure 1: The vecotrs we are getting to cluster. The transparency on the points reflects the density.
At a primary look, one could scream “Three main cluster plus two minor!”. and perhaps it’s just correct, but here we would like to see for an automatic method for locating the “right” number of clusters.
It’s to be noted that this “right” number is something fluffy anyway, since the specificity of each problem can lead one to decide to overcome the decision of the automated algorithm.
But let me introduce the GMM results for the configuration we just announced (five clusters). In the plot below (Figure 2) we fitted a GMM with five components to our data. And we have our first issue: same data, same model, but different results.


Figure 2: Clusters shapes: since the algorithm we rely on is not deterministic, we can have very different results!
The guilty for this behavior is that the fitting procedure: the Expectation-Maximization (EM) algorithm. This algorithm only guarantee that we land to a local optimal point, but it do not guarantee that this local optima is also the global one. And so, if the algorithm starts from different initialization points, generally it lands into different configurations.
While there are other procedures to fit the GMM, we want to stick on this one: the other methods are more complex and require more hyperparameters to be set; and this exceed the goals for this post.
We, then, have to take in account of the non-deterministic nature of our guy.
The easiest way of handling it's to run the fitting procedure several times, and consider the mean and therefore the variance for every configuration. Put simple: we account for a mistake on each fit.
Cluster performance evaluation(s)
Since we don’t know the bottom truth of our cluster generators, i.e. we are not aware of the original distribution which generated the data, our choices about the performance evaluation of the clustering process are limited and quite noisy.
Nonetheless, we will explore three different techniques.
Silhouette score
This score, as clearly stated by the SKLearn developers, consider two measures:
The distance between a sample and every one other points within the same cluster.
The distance between a sample and every one other points within the next nearest cluster.
i.e. it checks how much the clusters are compact and well separated. The more the score is almost one, the higher the clustering is.
Since we already know that the fitting procedure isn't deterministic, we run twenty fits for every number of clusters, then we consider the mean and therefore the variance of the best five runs. The results are in Figure 3.


Figure 3: Silhouette scores for our dataset
Well, it seems that we get the simplest score with five cluters. We have also to consider that also the four cluster configuration is almost equally good, if we consider the standard deviation (the ‘error’) of both the configurations. So, this score don't gives us a transparent result.
Distance between GMMs
Here we form two datasets, each with an half randomly choose amount of knowledge . We will then check how much the GMMs trained on the two sets are similar, for each configuration.
Since we are talking about distributions, the concept of similarity is embedded in the Jensen-Shannon (JS) metric. The lesser is the JS-distance between the two GMMs, the more the GMMs agree on how to fit the data.
All the credits on the (beautiful!) way the Jensen-Shannon metric is calculated go to Dougal, and to its post on StackOverflow.


Figure 4: Distance between half data GMMs
Following this technique, the configuration with three clusters is the most conservative; this do not necessarily means that it’s the good configuration, since we are just talking about reproducibility of the results between the two sets.
The interesting part of this plot is the sudden change in the distance we see after passing the six cluster configuration: both the mean distance and its associated error show a huge increase.
This means that starting from seven clusters, the GMMs of the two sets are much more different (since the distance is much bigger), and also less stable (since the error band is much bigger).
We can say that the good configuration, which takes in account both of the amount of information included (=biggest possible number of clusters) and on the stability of the fitting procedure (=lowest possible GMMs distance), is that the one which considers six cluster.
Bayesian information criterion (BIC)
This criterion gives us an estimation on what proportion is sweet the GMM in terms of predicting the info we even have . The lower is the BIC, the better is the model to actually predict the data we have, and by extension, the true, unknown, distribution. In order to avoid overfitting, this system penalizes models with big number of clusters.


Figure 5: Bayesian information criterion scores as function of the amount of normal components (clusters) forming the GMMs.
Following this criterion, the larger the amount of clusters, the higher should be the model. which suggests that the penalty BIC criteria gives to complex models don't save us from overfit. Or, in additional prosaic terms, this score sucks. a minimum of during this basic form.
But before screaming and trashing out this system , we will notice two things. the primary is that the curve is fairly smooth and monotone. The second is that the curve follows different slopes in several a part of it. ranging from these two observations, the temptation to see where the BIC curve change slope is big. So let’s check it!
Technically, we've to calculate the gradient of the BIC scores curve. Intuitively, the concept of gradient is simple: if two consecutive points have an equivalent value, their gradient is zero. If they need different values, their gradient are often eighter negative, if the second point features a lower value, or positive otherwise. The magnitude of the gradient tells us what proportion the 2 values are different.
Image for post


Figure 6: plot of the gradients of the curve in Figure 5
As expected, all the gradients have negative values. But we see more clearly that ranging from a cluster size of seven the gradient become almost constant, i.e. the first function features a more gentle decrease, i.e. there's no much gain in increasing the amount of clusters. In short, this system suggest us to use six clusters.
Final decision: change model
We have explored three different techniques to settle on the proper number of clusters which may be discerned during this dataset. The results are within the table below:


Image for post
Even if the choices are similar, there's no a transparent value on which all the strategies agree. during this specific case, this suggests that the GMM isn't an honest model to cluster our data.
Another sympton you'll catch is that both the Silhouette and therefore the gradient of BIC show a second value which is nearly nearly as good |pretty much as good"> nearly as good because the choose one: 4 is nearly as good as 5 for the Silhouette, and 5 is nearly nearly as good as 6 for the gradient of BIC scores.
One could tell just pause (in medio stat virtus) because the right number of clusters, but this might not be the optimal choice. And it seems that really it had been not for my specific issue.
This because the clusters don't show a transparent symmetric (oval-like) shape, then they will not be approximated by a model composed of 2-d gaussians just like the one you see in Figure 7, which instead are symmetric.
Image for post


Figure 7: a bi-variate Gaussian distribution .


Related Solutions

When using Gaussian elimination to solve a system of linear equations, how can you recognize that...
When using Gaussian elimination to solve a system of linear equations, how can you recognize that the system has no solution?
For the following exercises, solve each system by Gaussian elimination.
For the following exercises, solve each system by Gaussian elimination.
Gaussian Mixture Model: the initial means and variances of two clusters in a GMM are as...
Gaussian Mixture Model: the initial means and variances of two clusters in a GMM are as follows: ?(1)=−3, ?(2)=2, ?21=?22=4. Let ?1=?2=0.5. Let ?(1)=0.2, ?(2)=−0.9, ?(3)=−1, ?(4)=1.2, ?(5)=1.8 be five points that need to cluster. Need to find 1) p(1|1) 2) p(1|2) 3) p(1|3) 4) p(1|4) 5) p(1|5)
[Solve this problem in Gaussian units and not in SI units.] Write and discuss the equation...
[Solve this problem in Gaussian units and not in SI units.] Write and discuss the equation of motion for one electron of mass me bound to the origin by an elastic force of constant Kel and under the action of a frictional force of coefficient γ and proportional to the velocity.
how would you resolve the number of compounds in a mixture if the rfs of the...
how would you resolve the number of compounds in a mixture if the rfs of the spots under uv are 1.0, 0.52,0.0
in parts a and b use gaussian elimination to solve the system of linear equations. show...
in parts a and b use gaussian elimination to solve the system of linear equations. show all algebraic steps. a. x1 + x2 + x3 = 2 x1 - x3 = -2 2x2 + x3 = -1 b. x1 + x2 + x3 = 3 3x1 + 4x2 + 2x3 = 4 4x1 + 5x2 + 3x3 = 7 2x1 + 3x2 + x3 = 1
Solve the system using either Gaussian elimination with back-substitution or Gauss-Jordan elimination. (If there is no...
Solve the system using either Gaussian elimination with back-substitution or Gauss-Jordan elimination. (If there is no solution, enter NO SOLUTION. If the system has an infinite number of solutions, express x, y, z, and w in terms of the parameters t and s.) 4x + 12y − 7z − 20w = 20 3x + 9y − 5z − 28w = 36 (x, y, z, w) = ( ) *Last person who solved this got it wrong
Use either Gaussian elimination or Gauss-Jordan elimination to solve the given system or show that no...
Use either Gaussian elimination or Gauss-Jordan elimination to solve the given system or show that no solution exists. (Please show clear steps and explain them) x1 + x2 + x3 = 7 x1 − x2 − x3 = −3 3x1 + x2 + x3 = 11
1. Solve linear system using Gaussian elimination a) x1 + 2x2 + x3 = 2 -x1...
1. Solve linear system using Gaussian elimination a) x1 + 2x2 + x3 = 2 -x1 − 3x2 + 2x3 = -3   x1 − 6x2 + 3x3 = -6 b) -2b + 2c = 10 3a + 12b -3c = -6 6a + 18b + 0c = 19 c) 4x - 1y + 4z + 3t = 5 1x - 4z + 6t = 7 5x - 5y + 1z + 2t = -5 4x + 1y + 3z +...
Solve the following system of equations using Gaussian or​ Gauss-Jordan elimination. w + x + y...
Solve the following system of equations using Gaussian or​ Gauss-Jordan elimination. w + x + y + z = -2 2w +2x - 2y - 2z = -12 3w - 2x + 2y + z = 4 w - x + 7y + 3z = 4
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT