In: Computer Science
For this problem, use the e1-p1.csv dataset.
Using the decision tree algorithm that we discussed in the class, determine which attribute is the best attribute at the root level. You should not use Weka, JMP Pro, or any other data mining/machine learning software. You must show all intermediate results and calculations.
For this problem, use the e1-p1.csv dataset.
Using the decision tree algorithm that we discussed in the class, determine which attribute is the best attribute at the root level. You should not use Weka, JMP Pro, or any other data mining/machine learning software. You must show all intermediate results and calculations.
A1 | A2 | Class |
hot | medium | N |
mild | large | N |
hot | small | Y |
cold | medium | N |
cold | small | N |
mild | medium | Y |
cold | large | N |
mild | medium | Y |
mild | large | Y |
mild | medium | N |
hot | medium | Y |
cold | large | N |
mild | small | N |
hot | medium | Y |
cold | medium | N |
The attribute selection measure can be done in one of the following ways: information gain, gini index and gain ratio. Here we will use information gain to determine which attribute is the best attribute at the root level.
Information Gain
Let node N represent or hold the tuples of partition D. Attribute with the highest information gain is chosen as the splitting attribute for node N. Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|.
Expected information (entropy) needed to classify a tuple in D: - Equation 1 is given below.
Information needed (after using A to split D into v partitions) to classify D: Equation 2 is given below.
Information gained by branching on attribute A: Equation 3 is given below.
In the above data set, there are two distinct classes m=2. Let C1 corresponds to “Y and C2 corresponds to “N”. Info(D) can be calculated as follows using equation 1.
Total number of tuples =15
Number of Y= 6
Number of N=9
Info(D) = -6/15 log2(6/15) - 9/15 log2(9/15)=0.97 bits.
For attribute A1,
Number of hot= 4
Number of mild=6
Number of cold=5
Number of hot with Y = 3
Number of hot with N = 1
Number of mild with Y = 3
Number of mild with N = 3
Number of cold with Y =0
Number of cold with N =5
Using equation 2
InfoA1(D)=4/15*(-3/4 * log23/4 - 1/4 * log21/4)+ 6/15 * (-3/6 * log2 * 3/6 - -3/6 * log2 * 3/6) + 5/15*(-0/5 *log20/5 - 5/5 * log25/5) = 0.216
Using equation 3
Gain(A1) = Info(D) - InfoA1(D) = 0.97-0.216=0.754
For attribute A2,
Number of small= 3
Number of medium=8
Number of large=4
Number of small with Y = 1
Number of small with N = 2
Number of medium with Y = 4
Number of medium with N = 4
Number of large with Y =1
Number of large with N =3
Using equation 2
InfoA2(D)= 3/15*(-1/3 * log21/3 - 2/3 * log22/3)+ 8/15 * (-4/8 * log2 4/8 - -4/8 * log2 4/8) + 4/15*(-1/4 *log21/4 - 3/4 * log23/4) = 0.4
Using equation 3
Gain(A2) = Info(D) - InfoA2(D) = 0.97-0.4=0.57
The attribute with highest information gain need to be chosen as root. Here A1 is having the highest information gain. Hence A1 should be chosen as the root.