In: Statistics and Probability
Linear regression
Hello
What does it mean that the residuals in linear regression is normal distributed? Why is it only the residuals that is, and not the "raw" data? And why do we want our residuals to be normal?
The assumption of normality of the residuals in linear regression plays a crucial role in the estimation of error variance, and associated inferential procedures. This distributional assumption makes it possible for us to determine the form of distribution of the estimates of the model parameters. Moreover, it provides us with a point estimate of the error variance itself. Consequently, this implies that the test statistics, used in testing the significance of parameters or of the model as a whole, follow either a Normal distribution or a t-distribution with desired degrees of freedom. Had we not assumed that the error terms are normal, all the above procedures would have been a lot complicated to deal with.
Conventionally the raw data or specifically the independent variable is assumed to be non stochastic and uncorrelated with the errors. This is solely for maintaining homoscedasticity or the condition of equal variances for error terms. If the independent variable and errors are observed to be related, it results in heteroscedasticity of errors, which has to be dealt with separately using specifically designed methods. Thus we do not assume normality of the raw data.