In: Statistics and Probability
Suppose that a relevant variable is omitted from a simple regression (i.e., there should be a second explanatory variable in the model, but there is not). Under what conditions would the estimated slope coefficient (i.e., the one usually called β1) be biased downward; under what conditions would the estimated slope coefficient be biased upwards?
Suppose the true cause-and-effect relationship is given by
y = a + b x + c z + u {\displaystyle y=a+bx+cz+u}
with parameters a, b, c, dependent variable y, independent variables x and z, and error term u. We wish to know the effect of x itself upon y (that is, we wish to obtain an estimate of b).
Two conditions must hold true for omitted-variable bias to exist in linear regression:
Suppose we omit z from the regression, and suppose the relation between x and z is given by
z = d + f x + e {\displaystyle z=d+fx+e}
with parameters d, f and error term e. Substituting the second equation into the first gives
y = ( a + c d ) + ( b + c f ) x + ( u + c e ) . {\displaystyle y=(a+cd)+(b+cf)x+(u+ce).}
If a regression of y is conducted upon x only, this last equation is what is estimated, and the regression coefficient on x is actually an estimate of (b + cf ), giving not simply an estimate of the desired direct effect of x upon y (which is b), but rather of its sum with the indirect effect (the effect f of x on z times the effect c of z on y). Thus by omitting the variable z from the regression, we have estimated the total derivative of y with respect to x rather than its partial derivative with respect to x. These differ if both c and f are non-zero.
The direction and extent of the bias are both contained in cf, since the effect sought is b but the regression estimates b+cf. The extent of the bias is the absolute value of cf, and the direction of bias is upward (toward a more positive or less negative value) if cf > 0 (if the direction of correlation between x and y is the same as that between x and z), and it is downward otherwise.