
In: Advanced Math

Describe a real-world prediction problem using urban data for which accuracy is paramount and interpretability may...

Describe a real-world prediction problem using urban data for which accuracy is paramount and interpretability may be less important, and for which it might be preferable to use random forests rather than decision trees. Argue why this is the case.


Expert Solution


There are tons of material in circulation explaining the concept of random forest. My attempt here is to explain the concept but with a touch of simplification by juxtaposing it with a few social science scenarios. Those who are new to data science, can get intimidated by the technical deluge and lose interest.


There are many advantages associated with how the decision trees operate, but individually, decision trees introduce many problems. With these problems, we rarely find a robust predictive model. Some of the reasons why I say so are given below as disadvantages:


I. Very unstable since a variation in the data-set can lead to a very different tree taking shape. The tree is very dependent on the training data being used.

II. Over-fitting is a common problem with decision trees. They follow the pattern of the training data too closely, which cannot be replicated by all data-sets, resulting in poor performance on unseen data.


The context having been set with a very generic understanding of random forest, we will now move to technical intricacies of the algorithm. Building a forest means, building many individual trees, as a conglomeration. Hence, it is imperative to understand the working of a decision tree to understand the modus vivendi of random forest.

In real life, the decision making process is generally a subconscious step-wise activity, in which, each option (which is equivalent to a feature/predictor in data science parlance) is weighed upon with it’s pros and cons and the option that gives the best output is chosen. In other words, it’s a step-wise process, where at each step, the factors that give a better separation amongst all levels of decisions, are chosen for further analysis, until the final decision is made. Here, we see a real life example(on left) of a decision tree, where the Levels of Decision are Yes (I should Quit) and No(I should Not) and based on different situations, the answer is different for the individual.

In data science, the decision tree algorithms work towards splitting the data into homogeneous groups. This is achieved by selecting the predictor that helps in getting the best split. It works on both categorical and regression problems. The criterion that are generally used to identify the significant predictors, which ultimately help in splitting the data are “Entropy”, “Information Gain” and “Gini Index”. The purpose of all these three techniques is to identify the best split which helps us move towards a clear separation between various levels of the target variable. The moment we have a split which cannot be further homogenized, we have our decision for that branch. In a decision tree, we can have multiple branches which ultimately help in defining the rules that gives the information required for the prediction. For a more detailed technical explanation of how these concepts work, please refer to a very interesting blog I found in Medium itself, Decision Trees.


We need to make prediction based on a massive amount of features. What algorithm comes to mind first? For me, I usually always go with random forest. When faced with the problem of over-fitting, the machine learning technique that comes to rescue (more often than not) is random forest again. When we want an easy solution to a problem that a greedy algorithm causes, random forest seems to be the answer.

Until not too long ago, random forest was one of the most widely used predictive modelling techniques in various data science competitions. Of late, the boosting algorithm, XGBoost has taken over but random forest remains a very useful technique to know and use.

Related Solutions

Describe a real-world prediction problem using urban data for which interpretability of your models and results...
Describe a real-world prediction problem using urban data for which interpretability of your models and results is essential, and for which it might be preferable to use decision trees rather than random forests. Argue why this is the case.
Describe a real world business problem and propose a solution or solutions to the problem by...
Describe a real world business problem and propose a solution or solutions to the problem by applying managerial economics concepts. Focus on a problem and its solutions. What will be the pros and cons of the solution or solutions.
Explain one real world application in which decision trees canbe particularly helpful for prediction applications...
Explain one real world application in which decision trees can be particularly helpful for prediction applications like rainfall prediction, loan default prediction,etc. Your discussion should give an overview of the predication application you are discussing, describe why the application is important and explain how decision trees can be used in this particular application. please include at least one academic reference .
Conclude the accuracy of real-time measurements by using DAQ.
Conclude the accuracy of real-time measurements by using DAQ.
Identify a real-world problem in which the concept of a “thin film” and an "infinite slab"...
Identify a real-world problem in which the concept of a “thin film” and an "infinite slab" can be applied and turn in a copy of an appropriate problem statement.
1) Describe a real-world example that uses one of the Data Mining Tasks and why is...
1) Describe a real-world example that uses one of the Data Mining Tasks and why is this task best suited to this example? PLEASE EXPLAIN IN DETAIL.
Topic: Real-World Monopolies Describe an example of a real-world industry or market that would be considered...
Topic: Real-World Monopolies Describe an example of a real-world industry or market that would be considered by economists to be a natural monopoly. What characteristics of the industry make it a monopoly? What is the impact of the monopoly power on its customers? Why might government want to regulate natural monopolies? How might such regulation be structured?
• Describe one situation with real-world examples in which a programmer might want to create a...
• Describe one situation with real-world examples in which a programmer might want to create a loop that tests its condition in the beginning of the loop and one situation in which the condition is tested at the end of the loop. • There are many situations where infinite loops may occur. Discuss those situations and provide best practices for each of the loop types that help avoid writing infinite loops.
Using a real-world data set that interests you, conduct one of the tests you learned this...
Using a real-world data set that interests you, conduct one of the tests you learned this week (or fit a linear regression model). Make sure to document all steps in the hypothesis testing process, including stating your hypotheses, your code, your output and your findings along with interpretation. You may use data from the MASS library if you wish, or load external data
Using a real-world data set that interests you, conduct one of the tests you learned this...
Using a real-world data set that interests you, conduct one of the tests you learned this week (or fit a linear regression model). Make sure to document all steps in the hypothesis testing process, including stating your hypotheses, your code, your output and your findings along with interpretation. You may use data from the MASS library if you wish, or load external data Please type all answer.