In: Computer Science
What kind of problems attributes make in entropy-based decision trees. How can we solve them?
Decision Tree Classification:
•A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3 (Iterative Dichotomiser).
•Later, he presented C4.5, which was the successor of ID3.
•ID3 and C4.5 adopt a greedy approach.
•In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive divide-and-conquer manner.
Decision Tree Classification – Approach:
•There are couple of algorithms there to build a decision tree:
•CART (Classification and Regression Trees) → uses Gini Index(Classification) as metric.
•ID3 (Iterative Dichotomiser 3) → uses Entropy function and Information gain as metrics.
•We can use either of them to build the classification tree.
Attribute selection :
•An attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given data sample, S, of class-labeled training tuples into individual classes.
•If we were to split S into smaller partitions according to the outcomes of the splitting criterion, ideally each partition would be pure (i.e., all the tuples that fall into a given partition would belong to the same class).
•Conceptually, the “best” splitting criterion is the one that most closely results in such a scenario.
•Attribute selection measures are also known as splitting rules because they determine how the tuples at a given node are to be split.
•We shall be using Information Gain, Gain Ratio as the attribute selection measures.
Entropy:
•In machine learning, entropy is a measure of the randomness in the information being processed.
•Also, it is a measure of the uncertainty associated with a random variable.
•The higher the entropy, the harder it is to draw any conclusions from that information.
•Where,
•S, the data sample, be a training set of class-labeled tuples.
•The class label attribute has m distinct values defining m distinct classes, Ci (for i = 1, …… , m).
•P(Ci) is the probability that a tuple in S belongs to class Ci
•Ci,S be the set of tuples of class Ci in S.
Information Gain:
•Information gain can be defined as the amount of information gained about a random variable or signal from observing another random variable.
•It can be considered as the difference between the entropy of parent node and weighted average entropy of child nodes.
Information Gain = Entropy(Parent) – [Weighted Average] * Entropy(Children)
where |Sj| is Weight of the jth partition
Attribute A can be used to split S into v partitions or subsets, {S1, S2, …. , Sv} where Sj contains those tuples in S that have outcome aj of A.
•The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at node N().
Gain Ratio:
•Information gain is biased for the attribute with many outcomes.
•It means it prefers the attribute with a large number of distinct values.
•For instance, consider an attribute with a unique identifier such as customer_ID has zero info(S) because of pure partition.
•This maximizes the information gain and creates useless partitioning.
•The attribute with the highest gain ratio is chosen as the splitting attribute.