In: Computer Science
Argue analytically that a completely impure node yields the highest Gini Impurity.
A Simple Explanation of Gini Impurity
What Gini Impurity is (with examples) and how it's used to train Decision Trees.
MARCH 29, 2019
If you look at the documentation for theDecisionTreeClassifier
class in scikit-learn, you’ll see something like this for the
criterion
parameter:
The RandomForestClassifierdocumentation says the same thing. Both mention that the default criterion is “gini” for the Gini Impurity. What is that?!
TLDR: Read the Recap.
Decision Trees ?
Training a decision tree consists of iteratively splitting the current data into two branches. Say we had the following datapoints:
The Dataset
Right now, we have 1 branch with 5 blues and 5 greens.
Let’s make a split at x = 2x=2:
A Perfect Split
This is a perfect split! It breaks our dataset perfectly into two branches:
What if we’d made a split at x = 1.5x=1.5instead?
An Imperfect Split
This imperfect split breaks our dataset into these branches:
It’s obvious that this split is worse, buthow can we quantify that?
Being able to measure the quality of a split becomes even more important if we add a third class, reds . Imagine the following split:
Compare that against this split:
Which split is better? It’s no longer immediately obvious. We need a way toquantitatively evaluate how good a split is.
Gini Impurity
This is where the Gini Impurity metric comes in.
Suppose we
What’s the probability we classify the datapoint incorrectly? The answer to that question is the Gini Impurity.
Example 1: The Whole Dataset
Let’s calculate the Gini Impurity of our entire dataset. If we randomly pick a datapoint, it’s either blue (50%) or green (50%).
Now, we randomly classify our datapoint according to the class distribution. Since we have 5 of each color, we classify it as blue 50% of the time and as green 50% of the time.
What’s the probability we classify our datapoint incorrectly?
Event | Probability |
---|---|
Pick Blue, Classify Blue ✓ | 25% |
Pick Blue, Classify Green ❌ | 25% |
Pick Green, Classify Blue ❌ | 25% |
Pick Green, Classify Green ✓ | 25% |
We only classify it incorrectly in 2 of the events above. Thus, our total probability is 25% + 25% = 50%, so the Gini Impurity is \b