In: Statistics and Probability
4. Explain the difference between R-square and R-square (adj). Which one should you use? Why? Provide an example.
Difference between R2 and AdjR2 with detailed explanation is below,
R2 is the square of the correlation coefficient. It is the
proportion of the variation in Y that is
Explained by the variation in X. R2 varies between zero (no linear
relationship) and one (perfect linear
relationship).
R2 , officially known as the coefficient of determination, is
defined as the sum of squares due to the regression
divided by the adjusted total sum of squares of Y.
R2 is probably the most popular measure of how well a regression
model fits the data. R2 may be defined either
as a ratio or a percentage. Since we use the ratio form, its values
range from zero to one. A value of R2
near zero
indicates no linear relationship, while a value near one indicates
a perfect linear fit. Although popular, R2 should
not be used indiscriminately or interpreted without scatter plot
support.
Adj R2 :
R2 varies directly with N, the sample size. In fact, when N = 2,
R2 = 1. Because R2 is so closely tied to the
sample size, an adjusted R2 value, called R2 , has been developed.
R2 was developed to minimize the impact of
sample size. where p is 2 if the intercept is included in the model
and 1 if not.
there is one main difference between R2 and the adjusted R2: R2 assumes that every single variable explains the variation in the dependent variable. The adjusted R2 tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable
Why we use adj R2 :
We already know how R Squared can help us in Model Evaluation. However, there is one major disadvantage of using R Squared. The value of R Squared never decreases. If you are wondering why does it need to decrease since it will only result in a bad model, there is a catch, adding new independent variables will result in an increased value of R Squared. This is a major flow as R Squared will suggest that adding new variables irrespective of whether they are really significant or not, will increase the value.
Now the adjnR2 is defined as below,
AdjR2 is better because,
The value of Adjusted R Squared decreases as k increases also while considering R Squared acting a penalization factor for a bad variable and rewarding factor for a good or significant variable. Adjusted R Squared is thus a better model evaluator and can correlate the variables more efficiently than R Squared.
The example how Adj R2 is related with R2 is
R2 increases with every predictor added to a model. As R2 always increases and never decreases, it can appear to be a better fit with the more terms you add to the model. This can be completely misleading.
Similarly if your model has too many terms and too many high-order polynomials you can run into the problem of over-fitting the data. When you over-fit data, a misleadingly high R2 value can lead to misleading projections.
Hope you understood about R2 and Adj R2 and which one is used.
If you understood then RATE POSITIVE ?. In case of any queries please feel free to ask in comment box.