In: Statistics and Probability
Does linear regression estimate a cause and effect relationship? Why or why not?
quick answer - No
Any statistics text worth its salt will caution the reader not to confuse correlation with causation. Yet the mistake is very common. As a refresher, here's an example:
Consider elementary school students' shoe sizes and scores on a standard reading exam. They are correlated, but saying that larger shoe size causes higher reading scores is as absurd as saying that high reading scores cause larger shoe size.
In this example, there is a clear lurking variable, namely, age. As the child gets older, both their shoe size and reading ability increase.
Elaborating on this situation:
If you agree that increasing age (for elementary school children) causes increasing foot size, and therefore increasing shoe size, then you expect a correlation between age and shoe size. Correlation is symmetric, so shoe size and age are correlated. But it would be absurd to say that shoe size causes age.
In other words, even when there is a causal relationship, the causality typically only goes one way. (Of course, it could go both ways, as in a feedback loop.)
One situation where people slip into confusing correlation and causality is in regression. For example, one might regress college GPA on SAT scores, obtaining a positive coefficient beta of SAT score in the regression equation. Consider the following two statements:
Statement 2 is correct (assuming, of course, that the regression has been carried out correctly). Statement 1 is incorrect: the regression equation gives no information about causality. Indeed, there is likely a lurking variable (or probably a bunch of lurking variables) that affects both GPA and SAT score; SAT score is considered to be a (perhaps crude) measure of this lurking variable.