In: Statistics and Probability
What are some of the common errors or problems we need to check for before buying the results of a regression analysis and how would you avoid those types of errors? Be sure to mention how you would check your data to see if it meets the assumptions of using regression and then write the rest of your answer. Write this essentially as a note of warning to yourself for what to watch out for when performing regression analysis and what to watch out for when reading published regression results.
Regressions are mostly performed assuming normality and uncorrelatedness of the errors. So we should be sure of that the covariates are uncorrelated/independent. We don't really need to statistically check it at this stage, but just an intuitive thought should suffice. Next, after the fit has been performed, we should look at various diagnostic plots, to check whether the fit is good or not. Such diagnostics mainly involve, checking for multicollinearity of the errosr (can use lag plots etc. such plots maybe provided with the results), checking for homoscedasticity and checking for gaussianity. Once this is done, we should be sure of that the model is not overfitting. Now, this can usually be checked by BIC, AIC indices or cross-validation scores.
And lastly, after we are convinced that the model is a good fit and is not over-fitting, we should ensure that if the paper is claiming something quoting a regression coefficient, then that coefficient should be significant. This significance is generally tested using a t-test and the results are provided as a summary in most of the statistical softwares. It is expected that the paper would also cite these values.
These are the general test one considers before relying on the results. Of course, we can use our judgement in a specific case to see whether the author has tried to manipulated data to her/his advantage or not. This includes, the type of sampling the author has used, whether different groups have been represented propely or not, whether the categorical variables have been properly coded or not, etc.