In: Statistics and Probability
Throughout the course, you have studied and used tools and techniques that have underlying statistical theory and assumptions. Regression is no different. Haphazard application of regression analysis, as with any type of statistical technique, can lead to results that are inaccurate and that, even worse, can get you or your employer into trouble (whether that trouble involves product faults, legal issues, or simply wasted time and money). Thus, you must always be cognizant of the conditions of the problem as they relate to the assumptions and theory associated with your application of regression techniques.
Regression analysis is a statistical procedure, and it requires that certain assumptions be satisfied if you are to correctly interpret the results.
1. Which assumptions, if violated, can cause the greatest bias in the results of the regression analysis? Why?
1. Linear and Additive: If we try to fit a linear model into a non-linear and non-additive data set, then in that case regression algorithm will fail in capturing this trend mathematically So this will result in inefficient model and create erroneous predictions over unseen data set.
2. Autocorrelation: If there is correlation among error terms then this will reduce model’s accuracy as this will underestimate true standard error. Generally it is seen in time series models.
3. Multicollinearity: When there is a presence of correlated variables, So finding true relationship of predictors with response variable is a tedious task means difficult to find out which variable is actually contributing in prediction of response variable.
Also, with correlated predictors, the standard errors tend to increase. So confidence interval will be wider leading to less precise estimates of slope parameters.
4. Heteroskedasticity: (presence of non-constant variance in error terms)
Usually non-constant variance arises in case of outliers So disproportionately influences the model’s performance by which confidence interval for out of sample prediction tends to be unrealistically wide or narrow.
5. Normal Distribution of error terms: If error terms will become non- normally distributed, So confidence intervals may become too wide or narrow. This unstability creates problem in estimating coefficients based on minimization of least squares.