In: Economics
write a essay what is information gain?
min :750 words
You didn't specify the subject. Hope it's information gain in Machine Learning.
Information gain calculates the reduction in entropy or surprise from transforming a dataset in some way.
It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification.
Information gain can also be used for feature selection, by evaluating the gain of each variable in the context of the target variable. In this slightly different usage, the calculation is referred to as mutual information between the two random variables.
What Is Information Gain?
Information Gain, or IG for short, measures the reduction in entropy or surprise by splitting a dataset according to a given value of a random variable.
A larger information gain suggests a lower entropy group or groups of samples, and hence less surprise.
You might recall that information quantifies how surprising an event is in bits. Lower probability events have more information, higher probability events have less information. Entropy quantifies how much information there is in a random variable, or more specifically its probability distribution. A skewed distribution has a low entropy, whereas a distribution where events have equal probability has a larger entropy.
In information theory, we like to describe the “surprise” of an event. Low probability events are more surprising therefore have a larger amount of information. Whereas probability distributions where the events are equally likely are more surprising and have larger entropy.
For more on the basics of information and entropy, see the tutorial:
Now, let’s consider the entropy of a dataset.
We can think about the entropy of a dataset in terms of the probability distribution of observations in the dataset belonging to one class or another, e.g. two classes in the case of a binary classification dataset.
One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S (i.e., a member of S drawn at random with uniform probability).
In machine learning, especially when using decision trees as we progress in each step to solve a problem or predict the outcome, the level of uncertainty decreases as the algorithm becomes more accurate at each progressive leaf’s level. This you can treat as the knowledge gained by the algorithm to solve a particular classification problem.
A text book example of information gain is predicting the gender of an unborn baby.
At the first step, when the mother gets pregnant, gender of the foetus can be male or female and the uncertainty is high or at best you are 50% certain or uncertain about the prediction.
However, at the time of first ultrasound test in the first trimester, the certainty of prediction becomes better, say it becomes 75%. The drop in uncertainty is the loss of entropy and that is also your Information or Knowledge Gain, because the loss in entropy has resulted in an equal gain of certainty.
Now after the ultrasound in third trimester of pregnancy, there is no uncertainty and the prediction can be close to 100% certain.
To calculate mathematically, lets consider an example of a decision tree to predict whether there will be a golf game today or not.
The result here depends on the outlook or weather forecast.
So the gain here while predicting PlayGolf in presence of Outlook will be the difference between
“entropy when trying to predict PlayGolf alone”
and
“entropy when trying to predict PlayGolf with knowledge of Outlook” for the day.
From here you can digg deeper into the algorithm and mathematics on gain and entropy.
When we are trying to build the Decision Tree we start with a root node and recursively add child nodes until we fit the training data perfectly. The problem that we face is to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned. To do this we use entropy & information gain. Entropy give us the measure of impurity in our class. The lesser the entropy the better the output. The entropy typically changes when we use a node in a decision tree to partition the training instances into smaller subsets. Information gain is a measure of this change in entropy. Information gain tells us how important a given attribute of the feature vectors is. We will use it to decide the ordering of attributes in the nodes of a decision tree.