In: Statistics and Probability
What statistical information should one look for in order to determine that a given linear regression model is not a good fit? If you shouldn't use such a linear model then what would be a good estimate for a predicted output?
Please don't hesitate to give a "thumbs up" in case
you're satisfied with the answer
There are many criterion to look for to know if linear regression model is a good fit:
1. The residual plot should be homoscedastic , i.e. the distribution of errors w.r.t to dependent variables should be random. If there' a trend, then either you don't have a linear relation between the predictor and dependent variables or you need more predictor variables in your linear model
2. R-square is low, meaning your predictors aren't strong enough in terms of predictive strength to predict dependent variables.
3. p-value of linear regression should be below a .05 ( or any small value, upto modeller' discretion). if it is not then the linear regression is not statistically significant i.e. predictor variables don't have significant linear relation with dependent variable.
4. Multi collinearity between 2 variables or high VIF means that certain variables should be kicked out of the linear model.
5. You can also look at the out of box data validation i.e. check for errors on a dataset on which you didn't create your model on to get error / prediction accuracy
If a linear model is not the correct fit, then you can do either of the following to estimate the predicted output:
1. You can try fitting in higher degree polynomials which may be model your variables is much better way
2. If you are using binomial variable as output you should conduct a logistic regression instead of a linear regression
3. If you want to keep using a linear model, then you can do variable transformation either on dependent or independent variable to get a linear relation between your variables.
4. Outliers may be the case of biasing the linear regression, and hence decrease in any of the above ways. Hence, treat dataset to outlier for each of the participating variable to remove biases due to outliers.