Question

In: Computer Science

Argue analytically that a completely impure node yields the highest Gini Impurity.

Argue analytically that a completely impure node yields the highest Gini Impurity.

Solutions

Expert Solution

A Simple Explanation of Gini Impurity

What Gini Impurity is (with examples) and how it's used to train Decision Trees.

MARCH 29, 2019

If you look at the documentation for theDecisionTreeClassifier class in scikit-learn, you’ll see something like this for the criterion parameter:

The RandomForestClassifierdocumentation says the same thing. Both mention that the default criterion is “gini” for the Gini Impurity. What is that?!

TLDR: Read the Recap.

Decision Trees ?

Training a decision tree consists of iteratively splitting the current data into two branches. Say we had the following datapoints:

The Dataset

Right now, we have 1 branch with 5 blues and 5 greens.          

Let’s make a split at x = 2x=2:

A Perfect Split

This is a perfect split! It breaks our dataset perfectly into two branches:

  • Left branch, with 5 blues.     
  • Right branch, with 5 greens.     

What if we’d made a split at x = 1.5x=1.5instead?

An Imperfect Split

This imperfect split breaks our dataset into these branches:

  • Left branch, with 4 blues.    
  • Right branch, with 1 blue and 5 greens.      

It’s obvious that this split is worse, buthow can we quantify that?

Being able to measure the quality of a split becomes even more important if we add a third class, reds . Imagine the following split:

  • Branch 1, with 3 blues, 1 green, and 1 red.     
  • Branch 2, with 3 greens and 1 red.   

Compare that against this split:

  • Branch 1, with 3 blues, 1 green, and 2 reds.      
  • Branch 2, with 3 greens.   

Which split is better? It’s no longer immediately obvious. We need a way toquantitatively evaluate how good a split is.

Gini Impurity

This is where the Gini Impurity metric comes in.

Suppose we

  1. Randomly pick a datapoint in our dataset, then
  2. Randomly classify it according to the class distribution in the dataset. For our dataset, we’d classify it as blue \frac{5}{10}105​ of the time and as green \frac{5}{10}105​ of the time, since we have 5 datapoints of each color.

What’s the probability we classify the datapoint incorrectly? The answer to that question is the Gini Impurity.

Example 1: The Whole Dataset

Let’s calculate the Gini Impurity of our entire dataset. If we randomly pick a datapoint, it’s either blue (50%) or green (50%).

Now, we randomly classify our datapoint according to the class distribution. Since we have 5 of each color, we classify it as blue 50% of the time and as green 50% of the time.

What’s the probability we classify our datapoint incorrectly?

Event Probability
Pick Blue, Classify Blue ✓ 25%
Pick Blue, Classify Green ❌ 25%
Pick Green, Classify Blue ❌ 25%
Pick Green, Classify Green ✓ 25%

We only classify it incorrectly in 2 of the events above. Thus, our total probability is 25% + 25% = 50%, so the Gini Impurity is \b


Related Solutions

Given a small amount of an impure solid (with an insoluble impurity) and a known solvent...
Given a small amount of an impure solid (with an insoluble impurity) and a known solvent to recrystallize the solid, briefly in 3 steps describe how you would go about in recrystallizing and purifying the impure solid.
A sample of impure limestone (calcium carbonate) when heated yields calcium oxide and oxygen gas. A...
A sample of impure limestone (calcium carbonate) when heated yields calcium oxide and oxygen gas. A 1.506 g sample of limestone gives 0.558 g of carbon dioxide. This is less than what was expected. Calculate the percent of limestone in the impure sample.
A sample of impure limestone (calcium carbonate) when heated yields calcium oxide and oxygen gas. A...
A sample of impure limestone (calcium carbonate) when heated yields calcium oxide and oxygen gas. A 1.506 g sample of limestone gives 0.558 g of carbon dioxide. This is less than what was expected. Calculate the percent of limestone in the impure sample.
Implement a priority queue using a DoublyLinkedList where the node with the highest priority (key) is...
Implement a priority queue using a DoublyLinkedList where the node with the highest priority (key) is the right-most node. The remove (de-queue) operation returns the node with the highest priority (key). If displayForward() displays List (first-->last) : 10 30 40 55 remove() would return the node with key 55. Demonstrate by inserting keys at random, displayForward(), call remove then displayForward() again. You will then attach a modified DoublyLinkedList.java (to contain the new priorityInsert(long key) and priorityRemove() methods). Use the provided...
Implement a priority queue using a DoublyLinkedList where the node with the highest priority (key) is...
Implement a priority queue using a DoublyLinkedList where the node with the highest priority (key) is the right-most node. The remove (de-queue) operation returns the node with the highest priority (key). If displayForward() displays List (first-->last) : 10 30 40 55 remove() would return the node with key 55. You will then attach priorityInsert(long key) and priorityRemove() methods). AND Use the provided PQDoublyLinkedTest.java to test your code. BOTH CODES SHOULD WORK TOGETHER, YOU JUST HAVE TO ADD priorityInsert(int). PLEASE PROVIDE...
Implement a priority queue using a DoublyLinkedList where the node with the highest priority (key) is...
Implement a priority queue using a DoublyLinkedList where the node with the highest priority (key) is the right-most node. The remove (de-queue) operation returns the node with the highest priority (key). If displayForward() displays List (first-->last) : 10 30 40 55 remove() would return the node with key 55. Demonstrate by inserting keys at random, displayForward(), call remove then displayForward() again. You will then attach a modified DoublyLinkedList.java (to contain the new priorityInsert(long key) and priorityRemove() methods), and a driver...
1. What Region of visible light spectrum yields the highest photosynthetic rates in spinach? What regions...
1. What Region of visible light spectrum yields the highest photosynthetic rates in spinach? What regions yield the lowest photosynthetic rates? How do you know? (See graph below) 2. 2. Green photosynthetic pigments (e.g., chlorophyll a and b) capture energy from a wide—but not the widest possible—region of the visible spectrum of light. What pigment color would enable plants to capture energy from almost the entire spectrum of visible light? Why, then, are most plants green? (see graph below) 3....
6.) Which compound has the highest potential energy and produces the most ATP when completely oxidized?...
6.) Which compound has the highest potential energy and produces the most ATP when completely oxidized? a.  acetyl-CoA b.  glucose c.  pyruvate d.  carbon e.  both pyruvate and carbon 7.) Which is TRUE about amphipathic molecules when placed in an aqueous solution (water)? a.  The interior portion is composed of hydrophilic hydrocarbon chains. b.  The outside portion is composed of hydrophilic head groups. c.  The interior portion is composed of hydrophobic hydrocarbon chains. d.  Both A and B e.  Both B and C 8.) On which of the following processes...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT