In: Math
Data has been gathered to explain executive salaries from a variety of factors.
Perform a Multiple Regression to predict SALARY from EXPerience, EDUCation, GENDER, NUMberSupported and ASSETS.
d. Test for violation of the Model Assumptions. How did you do this?
e. Test for multicollinearity. How did you do this?
SALARY | EXP | EDUC | GENDER | NUMSUP | ASSETS |
93300 | 12 | 15 | 1 | 240 | 170 |
130000 | 25 | 14 | 1 | 510 | 160 |
88200 | 20 | 14 | 0 | 370 | 170 |
74400 | 3 | 19 | 1 | 170 | 170 |
115300 | 19 | 12 | 1 | 520 | 150 |
70400 | 14 | 13 | 0 | 420 | 160 |
114200 | 18 | 18 | 1 | 290 | 170 |
72600 | 2 | 17 | 1 | 200 | 180 |
108600 | 14 | 13 | 1 | 560 | 180 |
68600 | 4 | 16 | 1 | 230 | 160 |
102000 | 8 | 18 | 1 | 540 | 150 |
101400 | 19 | 15 | 1 | 90 | 180 |
149400 | 23 | 16 | 1 | 560 | 180 |
57100 | 5 | 15 | 0 | 470 | 150 |
87400 | 3 | 16 | 1 | 340 | 190 |
131000 | 22 | 17 | 1 | 70 | 200 |
90300 | 24 | 14 | 0 | 160 | 180 |
115600 | 22 | 16 | 1 | 160 | 190 |
102800 | 13 | 18 | 1 | 110 | 180 |
141900 | 21 | 16 | 1 | 410 | 180 |
90900 | 10 | 13 | 1 | 370 | 190 |
73400 | 11 | 12 | 1 | 180 | 170 |
101000 | 12 | 19 | 1 | 60 | 200 |
85400 | 10 | 19 | 1 | 60 | 180 |
138300 | 26 | 17 | 1 | 110 | 200 |
82300 | 7 | 15 | 1 | 280 | 190 |
85500 | 7 | 19 | 1 | 110 | 180 |
75300 | 10 | 19 | 0 | 300 | 170 |
87500 | 23 | 14 | 0 | 220 | 170 |
127100 | 12 | 15 | 1 | 570 | 200 |
80100 | 6 | 16 | 1 | 240 | 180 |
90900 | 15 | 16 | 0 | 300 | 150 |
109600 | 15 | 18 | 1 | 260 | 170 |
70700 | 8 | 13 | 1 | 150 | 160 |
104400 | 18 | 19 | 0 | 350 | 160 |
71200 | 2 | 13 | 1 | 370 | 190 |
85400 | 13 | 14 | 1 | 150 | 160 |
89300 | 12 | 17 | 0 | 480 | 190 |
124800 | 21 | 15 | 1 | 310 | 180 |
42800 | 3 | 12 | 0 | 340 | 150 |
125000 | 20 | 16 | 1 | 520 | 160 |
122200 | 20 | 19 | 1 | 200 | 170 |
107100 | 20 | 17 | 0 | 490 | 160 |
61000 | 1 | 15 | 0 | 570 | 180 |
59800 | 2 | 17 | 1 | 70 | 160 |
95700 | 9 | 17 | 1 | 300 | 160 |
85600 | 11 | 17 | 0 | 190 | 160 |
88900 | 21 | 13 | 0 | 500 | 160 |
143000 | 20 | 20 | 1 | 390 | 170 |
109200 | 17 | 16 | 0 | 520 | 180 |
156700 | 24 | 12 | 1 | 530 | 200 |
65100 | 2 | 17 | 0 | 590 | 190 |
105900 | 9 | 13 | 1 | 560 | 170 |
74300 | 2 | 18 | 0 | 600 | 190 |
79300 | 13 | 12 | 0 | 390 | 170 |
106600 | 14 | 18 | 1 | 110 | 170 |
106400 | 18 | 13 | 1 | 190 | 190 |
77400 | 10 | 14 | 1 | 110 | 160 |
129400 | 21 | 13 | 1 | 430 | 190 |
82600 | 11 | 14 | 0 | 440 | 150 |
126100 | 26 | 15 | 1 | 210 | 190 |
121900 | 22 | 18 | 1 | 320 | 160 |
96200 | 3 | 16 | 1 | 560 | 180 |
128900 | 17 | 18 | 1 | 450 | 190 |
72200 | 2 | 16 | 1 | 410 | 180 |
58800 | 4 | 18 | 0 | 70 | 150 |
79300 | 8 | 17 | 1 | 90 | 190 |
96100 | 13 | 15 | 1 | 290 | 160 |
94900 | 3 | 18 | 1 | 530 | 180 |
89000 | 13 | 16 | 0 | 420 | 170 |
108800 | 25 | 19 | 0 | 150 | 200 |
95300 | 11 | 15 | 1 | 500 | 190 |
71200 | 2 | 17 | 0 | 430 | 190 |
173400 | 26 | 17 | 1 | 570 | 190 |
107000 | 20 | 20 | 1 | 90 | 150 |
100000 | 19 | 12 | 1 | 340 | 160 |
100700 | 12 | 13 | 1 | 440 | 170 |
152800 | 22 | 18 | 1 | 500 | 160 |
95300 | 13 | 13 | 0 | 570 | 180 |
77300 | 2 | 15 | 1 | 560 | 190 |
84600 | 15 | 14 | 1 | 160 | 170 |
92600 | 12 | 13 | 1 | 390 | 190 |
85900 | 13 | 19 | 0 | 370 | 200 |
79400 | 5 | 17 | 1 | 330 | 160 |
80100 | 8 | 17 | 0 | 560 | 170 |
114100 | 21 | 20 | 0 | 590 | 180 |
78500 | 5 | 16 | 1 | 290 | 200 |
87300 | 9 | 18 | 0 | 440 | 180 |
102900 | 19 | 15 | 0 | 480 | 190 |
116300 | 23 | 19 | 1 | 130 | 150 |
51500 | 3 | 12 | 0 | 440 | 190 |
106500 | 13 | 19 | 1 | 310 | 150 |
109000 | 22 | 17 | 0 | 370 | 200 |
66600 | 9 | 12 | 0 | 180 | 160 |
111100 | 7 | 19 | 1 | 520 | 200 |
83100 | 10 | 18 | 0 | 90 | 180 |
159500 | 25 | 18 | 1 | 590 | 160 |
122500 | 10 | 19 | 1 | 480 | 200 |
67300 | 3 | 19 | 1 | 80 | 160 |
97900 | 16 | 17 | 0 | 380 | 160 |
d. We need to check the following assumptions:
1. Homoscedasticity of residuals or equal variance
R code :
par(mfrow=c(2,2))
mod_1 <- lm(SALARY ~ ., data=data)
plot(mod_1)
Output :
From the Top left graph and bottom right graph , there is no certain pattern. Hence we can say that Homoscedasticity of residuals can be accepted.
2. No autocorrelation of residuals
This can be tested by using dwtest( Durbin Watson test) in R
R code:
library(lmtest)
dwtest(data[,1] ~ data[,2])
dwtest(data[,1] ~ data[,3])
dwtest(data[,1] ~ data[,4])
dwtest(data[,1] ~ data[,5])
Output :
dwtest(data[,1] ~ data[,2])
Durbin-Watson test
data: data[, 1] ~ data[, 2]
DW = 2.2839, p-value = 0.9244
alternative hypothesis: true autocorrelation is greater than 0
> dwtest(data[,1] ~ data[,3])
Durbin-Watson test
data: data[, 1] ~ data[, 3]
DW = 2.3343, p-value = 0.9555
alternative hypothesis: true autocorrelation is greater than 0
> dwtest(data[,1] ~ data[,4])
Durbin-Watson test
data: data[, 1] ~ data[, 4]
DW = 2.1778, p-value = 0.819
alternative hypothesis: true autocorrelation is greater than 0
> dwtest(data[,1] ~ data[,5])
Durbin-Watson test
data: data[, 1] ~ data[, 5]
DW = 2.3297, p-value = 0.9515
alternative hypothesis: true autocorrelation is greater than 0
Since, P-value ( in each case ) > 0.05 , we can say that Autocorrelation is not there in the data.
3. Normality of residuals
This can be visually checked using the qqnorm() plot.
R code:
mod <- lm(SALARY ~ ., data=data)
plot(mod)
Output:
The qqnorm() plot evaluates this assumption. If points lie exactly on the line, it is perfectly normal distribution. In out case the points lie on the line approximately , we can say that Residuals are Normally distributed.
e. Test for multicollinearity
Using VIF function in R we can check for multicollinearity.
VIF is a metric computed for every X variable that goes into a linear model. If the VIF of a variable is high, it means the information in that variable is already explained by other X variables present in the given model, which means, more redundant is that variable.So, lower the VIF (<2) the better.
R Code:
library(car)
mod2 <- lm(SALARY ~ ., data=data)
vif(mod2)
Output:
EXP EDUC GENDER NUMSUP ASSETS
1.002071 1.037777 1.063135 1.101590 1.029408
We can see that VIF value for each variable is less than 2 , we can say that No multicollinearity is present in the data.