Question

In: Computer Science

For this problem, use the e1-p1.csv dataset. Using the decision tree algorithm that we discussed in...

For this problem, use the e1-p1.csv dataset.

Using the decision tree algorithm that we discussed in the class, determine which attribute is the best attribute at the root level. You should not use Weka, JMP Pro, or any other data mining/machine learning software. You must show all intermediate results and calculations.

For this problem, use the e1-p1.csv dataset.

Using the decision tree algorithm that we discussed in the class, determine which attribute is the best attribute at the root level. You should not use Weka, JMP Pro, or any other data mining/machine learning software. You must show all intermediate results and calculations.

A1 A2 Class
hot medium N
mild large N
hot small Y
cold medium N
cold small N
mild medium Y
cold large N
mild medium Y
mild large Y
mild medium N
hot medium Y
cold large N
mild small N
hot medium Y
cold medium N

Solutions

Expert Solution

The attribute selection measure can be done in one of the following ways: information gain, gini index and gain ratio. Here we will use information gain to determine which attribute is the best attribute at the root level.

Information Gain

Let node N represent or hold the tuples of partition D. Attribute with the highest information gain is chosen as the splitting attribute for node N. Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|.

Expected information (entropy) needed to classify a tuple in D: - Equation 1 is given below.

Information needed (after using A to split D into v partitions) to classify D: Equation 2 is given below.

Information gained by branching on attribute A: Equation 3 is given below.

In the above data set, there are two distinct classes m=2. Let C1 corresponds to “Y and C2 corresponds to “N”. Info(D) can be calculated as follows using equation 1.

Total number of tuples =15

Number of Y= 6

Number of N=9

Info(D) = -6/15 log2(6/15) - 9/15 log2(9/15)=0.97 bits.

For attribute A1,

Number of hot= 4

Number of mild=6

Number of cold=5

Number of hot with Y = 3

Number of hot with N = 1

Number of mild with Y = 3

Number of mild with N = 3

Number of cold with Y =0

Number of cold with N =5

Using equation 2

InfoA1(D)=4/15*(-3/4 * log23/4 - 1/4 * log21/4)+ 6/15 * (-3/6 * log2 * 3/6 - -3/6 * log2 * 3/6) + 5/15*(-0/5 *log20/5 - 5/5 * log25/5) = 0.216

Using equation 3

Gain(A1) = Info(D) - InfoA1(D) = 0.97-0.216=0.754

For attribute A2,

Number of small= 3

Number of medium=8

Number of large=4

Number of small with Y = 1

Number of small with N = 2

Number of medium with Y = 4

Number of medium with N = 4

Number of large with Y =1

Number of large with N =3

Using equation 2

InfoA2(D)= 3/15*(-1/3 * log21/3 - 2/3 * log22/3)+ 8/15 * (-4/8 * log2 4/8 - -4/8 * log2 4/8) + 4/15*(-1/4 *log21/4 - 3/4 * log23/4) = 0.4

Using equation 3

Gain(A2) = Info(D) - InfoA2(D) = 0.97-0.4=0.57

The attribute with highest information gain need to be chosen as root. Here A1 is having the highest information gain. Hence A1 should be chosen as the root.


Related Solutions

For this problem, use the e1-p3.csv dataset. A1 A2 A3 Class low hot medium Y low...
For this problem, use the e1-p3.csv dataset. A1 A2 A3 Class low hot medium Y low mild large N low hot small N low cold medium N high cold small N high mild medium Y high cold large N low cold medium Y high mild large Y low mild large N high hot medium Y high cold large N low mild small Y low hot small N low hot medium N (1). Using the Naïve Bayes algorithm that we discussed...
We also discussed the use of the Extended Euclidian algorithm to calculate modular inverses. Use this...
We also discussed the use of the Extended Euclidian algorithm to calculate modular inverses. Use this algorithm to compute the following values. Show all of the steps involved. 9570-1(mod 12935) 550-1 (mod 1769)
Describe Hunter's Algorithm for building decision trees. Build a decision out of the following ("training") dataset....
Describe Hunter's Algorithm for building decision trees. Build a decision out of the following ("training") dataset. The goal is to determine if a person is a defaulted borrower given values for the first four attributes. How do you deal with the attribute Annual Income with real values? For a person with values for the first four attributes 11, No, Single, 180K, is this person a defaulted borrower or not according to your newly built decision tree? ID Home Owner              ...
Using Python read dataset in the HTML in beautiful way. You need to read CSV file...
Using Python read dataset in the HTML in beautiful way. You need to read CSV file ( Use any for example, You can use small dataset) You need to use pandas library You need to use Flask Make search table like YouTube has.
Apply the classification algorithm to the following set of data records. Draw a decision tree. The...
Apply the classification algorithm to the following set of data records. Draw a decision tree. The class attribute is Repeat Customer. RID Age City Gender Education Repeat Customer 101 20..30 NY F College YES 102 20..30 SF M Graduate YES 103 31..40 NY F College YES 104 51..60 NY F College NO 105 31..40 LA M High school NO 106 41..50 NY F College YES 107 41..50 NY F Graduate YES 108 20..30 LA M College YES 109 20..30 NY...
In a decision tree, how does the algorithm pick the attributes for splitting? Would you explain...
In a decision tree, how does the algorithm pick the attributes for splitting? Would you explain it logically and specifically?
A: Inappropriate Use of Business Cards In this lesson we discussed the value of using business...
A: Inappropriate Use of Business Cards In this lesson we discussed the value of using business cards and how they can help you achieve your career goals. Now consider the flip side: Provide 2 or 3 examples or circumstance of where handing out business cards may be considered inappropriate. For each, explain why. B: Intersection of Personal and Professional Networks In this lesson we discussed how often, people you meet in a professional environment also become personal acquaintances. Describe one...
Use the multi-layer perceptron algorithm to learn a model that classifies IRIS flower dataset. Split the...
Use the multi-layer perceptron algorithm to learn a model that classifies IRIS flower dataset. Split the dataset into a train set to train the algorithm and test set to test the algorithm. Calculate the accuracy. Use Scikit-Learn
We have discussed that the main goal in management decision making for the firm should be to:
We have discussed that the main goal in management decision making for the firm should be to:  minimize costs  maximize the value of the firm  maximize earnings maximize the value of the preferred stock  maximize net income
Question - 1 Using the structural node and methods discussed in Binary Search Tree lectures, create...
Question - 1 Using the structural node and methods discussed in Binary Search Tree lectures, create a method for the Binary Search Tree that takes an unsorted input list and constructs a Binary Search Tree based on its values. Any duplicate value will only appear once on the tree. This method outputs a Binary Search Tree structure (not an enumeration of the tree). Discuss method's Big-O notation. Add proper and consistent documentation to identify code sections or lines to clearly...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT