In: Statistics and Probability
1. When you think about it, putting a regression model together is really not that difficult. After all, I assume if you are doing that you would have a feel for the process you are modeling and the variables to choose for the model. But the real question is "is the model any good"? Here is where measures of validity and reliability come into play. How do we measure validity in a regression model?
Measuring the validity of the regression model
Check the sign and value of regression coefficient: We must check the sign the regression coefficient. It must make practical sense. For example, sales and price have an inverse relationship, hence we can expect a negative sign for the variable. Similarly, the value of the coefficient if it makes intuitive sense.
Pvalue and significance of the variable: We check the pvalue and determine if the pvalue is less than 0.05, then the variable is significant.
Global Hypothesis Test: This is based on the ANOVA conducted on the regression if the regression model hold good and it has at least one independent variable that is significant in predicting the dependent variables. Only if the pvalue of the ANOVA is less than 0.05, we can conclude that the model is significant.
Variance inflation Factor: We can calculate the VIF for each variable and if the value exceeds 5 or 10, then we can conclude that the variable is poorly estimated and it is unstable.
Coefficient of determination(rsqaure)
It is the measure of the amount of variability in y explained by x.
Its value lies between 0 and 1. Greater the value, better is the
model.
Adjusted R2 is an improved version of R2, which increases only
if a significant variable is added to the model. It penalizes the
model for every junk or non-signficant variable that is added to
the model.
R2 will be greater than adjusted R2, as adjusted R2 only considers
the significant variable.If the Adjusted R2 is lower than the R2,
then it gives us an indication that there are some non-signficant
variables added ot model.
Root MSE or Root Mean square error, tells us the standard deviation of the residuals (actual minus predicted values). In other words, the residuals tell us how far the actual points is from the regression line. RMSE helps us understand the spread of the residuals. If the RMSE value is high it indicates that the residuals are far away from the regression line or if it low, it indicates the actual points are very close to the regression line.