In: Computer Science
Consider the following dataset where the target
feature is “Run”.
Weather Mood Breezy Run
Hot Mixed feeling No No
Hot Happy Yes Yes
Warm Happy Yes Yes
Warm Sad No Yes
Warm Mixed feeling No No
Hot Happy No No
Cold Happy No Yes
(i) On what feature should you split on first, using Information
Gain? 8
(ii) Draw the decision tree at this stage with the above selected
root node
(i)
Splitting will based on "Mood" Feature.
Reason using information gain:
Let node N represent or hold the tuples of partition D.
The attribute with the highest information gain is chosen as the splitting attribute for node N.
This attribute minimizes the information needed to classify the tuples in the resulting partitions
and reflects the least randomness or “impurity” in these partitions.
The expected information needed to classify a tuple in D is given by
where p(i) is probability of favourable tuple in class i
The information contained by a particular attribute is Given by
where D(j) is data set containing tuple where particular feature is selected
Gain(A) tells us how much would be gained if we branch on A.
Using the above formulas:
Since 4 tuple are classifying in "yes" case and remaining 3 classifying in "no" case
Now let Attribute as Mood => Happy (4 tuple), Mixed_Feeling (2 tuple), Sad (1 tuple)
Now let Attribute as Weather => Warm(3 tuple), Hot (3 tuple), Cold ( 1 tuple)
Now let Attribute as Breezy => No(5 tuple), Yes(2 tuple)
Hence Gain(Mood) = 0.985 - 0.463 = 0.522 bits
Gain(Weather) = 0.985 - 0.787 = 0.198 bits
Gain(Breezy) = 0.985 - 0.693 = 0.292 bits
=> First splitting should be based on Mood since it is giving highest Gain
After Mood feature selection dataset is divided in 3 sets
Mood = "Happy"
Mood = Happy | ||
Weather | Breezy | Run |
Hot | Yes | Yes |
Warm | Yes | Yes |
Hot | No | No |
Cold | No | Yes |
Mood = Mixed_Feeling | ||
Weather | Breezy | Run |
Hot | No | No |
Warm | No | No |
Mood = Sad | ||
Weather | Breezy | Run |
Warm | No | Yes |
When Mood is Mixed_Feeling outcome is No (No more splitting) ,
Sad is Yes (No more splitting)
We again need to split remaining database based on other two
feature
So Let new dataset = DMH (Data on Mood Happy)
Again we have to do same process to make complete decision tree
(ii)