In: Advanced Math
Describe a real-world prediction problem using urban data for which accuracy is paramount and interpretability may be less important, and for which it might be preferable to use random forests rather than decision trees. Argue why this is the case.
RANDOM FOREST -
There are tons of material in circulation explaining the concept of random forest. My attempt here is to explain the concept but with a touch of simplification by juxtaposing it with a few social science scenarios. Those who are new to data science, can get intimidated by the technical deluge and lose interest.
DECISION TREE
There are many advantages associated with how the decision trees operate, but individually, decision trees introduce many problems. With these problems, we rarely find a robust predictive model. Some of the reasons why I say so are given below as disadvantages:
Disadvantages:
I. Very unstable since a variation in the data-set can lead to a very different tree taking shape. The tree is very dependent on the training data being used.
II. Over-fitting is a common problem with decision trees. They follow the pattern of the training data too closely, which cannot be replicated by all data-sets, resulting in poor performance on unseen data.
COMPARISON BETWEEN RANDOM FOREST AND DECISION TREE-
The context having been set with a very generic understanding of random forest, we will now move to technical intricacies of the algorithm. Building a forest means, building many individual trees, as a conglomeration. Hence, it is imperative to understand the working of a decision tree to understand the modus vivendi of random forest.
In real life, the decision making process is generally a subconscious step-wise activity, in which, each option (which is equivalent to a feature/predictor in data science parlance) is weighed upon with it’s pros and cons and the option that gives the best output is chosen. In other words, it’s a step-wise process, where at each step, the factors that give a better separation amongst all levels of decisions, are chosen for further analysis, until the final decision is made. Here, we see a real life example(on left) of a decision tree, where the Levels of Decision are Yes (I should Quit) and No(I should Not) and based on different situations, the answer is different for the individual.
In data science, the decision tree algorithms work towards splitting the data into homogeneous groups. This is achieved by selecting the predictor that helps in getting the best split. It works on both categorical and regression problems. The criterion that are generally used to identify the significant predictors, which ultimately help in splitting the data are “Entropy”, “Information Gain” and “Gini Index”. The purpose of all these three techniques is to identify the best split which helps us move towards a clear separation between various levels of the target variable. The moment we have a split which cannot be further homogenized, we have our decision for that branch. In a decision tree, we can have multiple branches which ultimately help in defining the rules that gives the information required for the prediction. For a more detailed technical explanation of how these concepts work, please refer to a very interesting blog I found in Medium itself, Decision Trees.
CONCLUSION -
We need to make prediction based on a massive amount of features. What algorithm comes to mind first? For me, I usually always go with random forest. When faced with the problem of over-fitting, the machine learning technique that comes to rescue (more often than not) is random forest again. When we want an easy solution to a problem that a greedy algorithm causes, random forest seems to be the answer.
Until not too long ago, random forest was one of the most widely used predictive modelling techniques in various data science competitions. Of late, the boosting algorithm, XGBoost has taken over but random forest remains a very useful technique to know and use.