In: Statistics and Probability
1.What is multicollinearity?
2.What sample correlation coefficient values between two x's "warn" of a potential problem due to multicollinearity and what is that problem?
3. Can an independent variable in multiple linear regression be a categorical variable?
4.If not, why not, but if yes, how should the categorical variable be worked into the regression?
1) Multicollinearity is a problem in multiple linear regression which occurs when two or more independent variables are correlated. It is a problem because all the independent variables should be independent of each other. Multicollinearity gives us a higher value of coefficient of determination which could be misleading.
2) As a general thumb of rule, we say that if the value of correlation coefficient between two independent variables is greater than 0.8 then it would lead to problem of multicollinearity. However, in some cases the value of 0.7 or 0.9 is also used to decide the potential problem.
Following are problems caused if two variables with correlation of more than 0.8 are both used in multiple linear regression -
i) The coefficient estimates can swing wildly based on which other independent variable is present in the model. The coefficients become very sensitive to small changes in the model.
ii) Multicollinearity reduces the precision of estimate coefficients which weakens the statistical power of the regression model.
iii) It increases the value of coefficient of determination which gives us a misleading interpretation of model being stronger than it actually should be.
iv) The variance inflation factor becomes very high.
------------------------------------
3) Yes, an independent variable can definitely be used in a multiple linear regression.
4) We can use categorical variables by assigning coded values to them. For example, a categorical variable with two levels of measurement can be coded as '1' and '0' for its two levels and used as a normal variable in multiple linear regression.
If it has three levels of measurement, you can code them as '-1', '0' and '1' to use the variable as a normal independent variable in multiple linear regression.
_______________________________