In: Statistics and Probability
Suppose we are interested in the effect of drinking on student achievement. We survey students at the University of Rhode Island about their drinking habits - the number of times they “binge” drank (had 5+ drinks in one sitting) in the previous semester - and their previous semester’s GPA. We also have data on their gender, race, parent’s education, and family income. If we estimate the following regression:
GP Ai = β0 + β1BingeEventsi + γXi + i
Where GP Ai is student i’s GPA last semester, BingeEventsi is the number of times they reported binge drinking last semester, Xi is a set of controls for gender, race, and family variables, and i is the error term. Suppose we find a statistically significant negative correlation between binge drinking and GPA - increased binge drinking is associated with decreases in GPA (β1 < 0). Why does this finding not imply a causal effect of binge drinking on GPA?
In the question, it's written that Binge drinking is negatively correlated with GPA and it is statistically significant. Though, a big thing to note here is A correlation between two variables does not imply causation. This can be understood in a very simple way using this example:
There have been findings that there exists a negative correlation between a student's anxiety before an exam and the student's score on the test. But we cannot say that the anxiety causes a lower score on the test; there could be other reasons—the student may not have studied well, for example. So the correlation here does not imply causation.
Also, there exists a positive correlation between the number of hours one spends studying for a test and the grade he/she gets on the test. Here, there is causation as well; if you spend more time studying, it results in a higher grade.
So, this is why we can't infer there exists a causal effect of binge drinking on GPA in spite of being the fact that these two are negatively correlated.