In: Statistics and Probability
The following data set is obtained by a randomly selected sample of 93 employees working at a bank.
| SALARY | EDUC | EXPER | TIME | 
| 39000 | 12 | 0 | 1 | 
| 40200 | 10 | 44 | 7 | 
| 42900 | 12 | 5 | 30 | 
| 43800 | 8 | 6 | 7 | 
| 43800 | 8 | 8 | 6 | 
| 43800 | 12 | 0 | 7 | 
| 43800 | 12 | 0 | 10 | 
| 43800 | 12 | 5 | 6 | 
| 44400 | 15 | 75 | 2 | 
| 45000 | 8 | 52 | 3 | 
| 45000 | 12 | 8 | 19 | 
| 46200 | 12 | 52 | 3 | 
| 48000 | 8 | 70 | 20 | 
| 48000 | 12 | 6 | 23 | 
| 48000 | 12 | 11 | 12 | 
| 48000 | 12 | 11 | 17 | 
| 48000 | 12 | 63 | 22 | 
| 48000 | 12 | 144 | 24 | 
| 48000 | 12 | 163 | 12 | 
| 48000 | 12 | 228 | 26 | 
| 48000 | 12 | 381 | 1 | 
| 48000 | 16 | 214 | 15 | 
| 49800 | 8 | 318 | 25 | 
| 51000 | 8 | 96 | 33 | 
| 51000 | 12 | 36 | 15 | 
| 51000 | 12 | 59 | 14 | 
| 51000 | 15 | 115 | 1 | 
| 51000 | 15 | 165 | 4 | 
| 51000 | 16 | 123 | 12 | 
| 51600 | 12 | 18 | 12 | 
| 52200 | 8 | 102 | 29 | 
| 52200 | 12 | 127 | 29 | 
| 52800 | 8 | 90 | 11 | 
| 52800 | 8 | 190 | 1 | 
| 52800 | 12 | 107 | 11 | 
| 54000 | 8 | 173 | 34 | 
| 54000 | 8 | 228 | 33 | 
| 54000 | 12 | 26 | 11 | 
| 54000 | 12 | 36 | 33 | 
| 54000 | 12 | 38 | 22 | 
| 54000 | 12 | 82 | 29 | 
| 54000 | 12 | 169 | 27 | 
| 54000 | 12 | 244 | 1 | 
| 54000 | 15 | 24 | 13 | 
| 54000 | 15 | 49 | 27 | 
| 54000 | 15 | 51 | 21 | 
| 54000 | 15 | 122 | 33 | 
| 55200 | 12 | 97 | 17 | 
| 55200 | 12 | 196 | 32 | 
| 55800 | 12 | 133 | 30 | 
| 56400 | 12 | 55 | 9 | 
| 57000 | 12 | 90 | 23 | 
| 57000 | 12 | 117 | 25 | 
| 57000 | 15 | 51 | 17 | 
| 57000 | 15 | 61 | 11 | 
| 57000 | 15 | 241 | 34 | 
| 60000 | 12 | 121 | 30 | 
| 60000 | 15 | 79 | 13 | 
| 61200 | 12 | 209 | 21 | 
| 63000 | 12 | 87 | 33 | 
| 63000 | 15 | 231 | 15 | 
| 46200 | 12 | 12 | 22 | 
| 50400 | 15 | 14 | 3 | 
| 51000 | 12 | 180 | 15 | 
| 51000 | 12 | 315 | 2 | 
| 52200 | 12 | 29 | 14 | 
| 54000 | 12 | 7 | 21 | 
| 54000 | 12 | 38 | 11 | 
| 54000 | 12 | 113 | 3 | 
| 54000 | 15 | 18 | 8 | 
| 54000 | 15 | 359 | 11 | 
| 57000 | 15 | 36 | 5 | 
| 60000 | 8 | 320 | 21 | 
| 60000 | 12 | 24 | 2 | 
| 60000 | 12 | 32 | 17 | 
| 60000 | 12 | 49 | 8 | 
| 60000 | 12 | 56 | 33 | 
| 60000 | 12 | 252 | 11 | 
| 60000 | 12 | 272 | 19 | 
| 60000 | 15 | 25 | 13 | 
| 60000 | 15 | 36 | 32 | 
| 60000 | 15 | 56 | 12 | 
| 60000 | 15 | 64 | 33 | 
| 60000 | 15 | 108 | 16 | 
| 60000 | 16 | 46 | 3 | 
| 63000 | 15 | 72 | 17 | 
| 66000 | 15 | 64 | 16 | 
| 66000 | 15 | 84 | 33 | 
| 66000 | 15 | 216 | 16 | 
| 68400 | 15 | 42 | 7 | 
| 69000 | 12 | 175 | 10 | 
| 69000 | 15 | 132 | 24 | 
| 81000 | 16 | 55 | 33 | 
This data set was obtained by collecting information on a randomly selected sample of 93 employees working at a bank.
SALARY- starting annual salary at the time of hire
EDUC - number of years of schooling at the time of the hire
EXPER - number of months of previous work experience at the time of hire
TIME - number of months that the employee has been working at the bank until now
2. Use the least squares method to fit a simple linear model that relates the salary (dependent variable) toeducation (independent variable).
a) What is your model? State the hypothesis that is to be tested, the decision rule, the test statistic, and your decision, usinga level of significance of 5%.
b) What percentage of the variation in salary has been explained by the regression?
c) Provide a 95% confidence interval estimate for the true slope value.
d) Based on your model, what is the expected salary of a new hire with 12 years of education
e ) What is the 95% prediction interval for the salary of a new hire with 12 years of education? Use the fact that the distance value = 0.011286
Please explain clearly.
Sol:
Perform in R studio
use lm function in R to fit a linear model of salary on educ
Use sumamry function to get the coeffcient and p value
Predict function to get the confidence and predicttion interval for newdata=12 educ
Rcode:
df1 =read.table(header = TRUE, text ="
SALARY   EDUC   EXPER   TIME
39000   12   0   1
40200   10   44   7
42900   12   5   30
43800   8   6   7
43800   8   8   6
43800   12   0   7
43800   12   0   10
43800   12   5   6
44400   15   75   2
45000   8   52   3
45000   12   8   19
46200   12   52   3
48000   8   70   20
48000   12   6   23
48000   12   11   12
48000   12   11   17
48000   12   63   22
48000   12   144   24
48000   12   163   12
48000   12   228   26
48000   12   381   1
48000   16   214   15
49800   8   318   25
51000   8   96   33
51000   12   36   15
51000   12   59   14
51000   15   115   1
51000   15   165   4
51000   16   123   12
51600   12   18   12
52200   8   102   29
52200   12   127   29
52800   8   90   11
52800   8   190   1
52800   12   107   11
54000   8   173   34
54000   8   228   33
54000   12   26   11
54000   12   36   33
54000   12   38   22
54000   12   82   29
54000   12   169   27
54000   12   244   1
54000   15   24   13
54000   15   49   27
54000   15   51   21
54000   15   122   33
55200   12   97   17
55200   12   196   32
55800   12   133   30
56400   12   55   9
57000   12   90   23
57000   12   117   25
57000   15   51   17
57000   15   61   11
57000   15   241   34
60000   12   121   30
60000   15   79   13
61200   12   209   21
63000   12   87   33
63000   15   231   15
46200   12   12   22
50400   15   14   3
51000   12   180   15
51000   12   315   2
52200   12   29   14
54000   12   7   21
54000   12   38   11
54000   12   113   3
54000   15   18   8
54000   15   359   11
57000   15   36   5
60000   8   320   21
60000   12   24   2
60000   12   32   17
60000   12   49   8
60000   12   56   33
60000   12   252   11
60000   12   272   19
60000   15   25   13
60000   15   36   32
60000   15   56   12
60000   15   64   33
60000   15   108   16
60000   16   46   3
63000   15   72   17
66000   15   64   16
66000   15   84   33
66000   15   216   16
68400   15   42   7
69000   12   175   10
69000   15   132   24
81000   16   55   33
"
)
df1
linmod=lm(SALARY ~ EDUC ,data=df1)
coefficients(linmod)
summary(linmod)
newdata=data.frame(EDUC=12)
attach(df1)
predict(linmod,newdata,level=0.95,interval="confidence")
predict(linmod,newdata,level=0.95,interval="predict")
Output:
> coefficients(linmod)
(Intercept) EDUC
38185.598 1280.859
> summary(linmod)
Call:
lm(formula = SALARY ~ EDUC, data = df1)
Residuals:
Min 1Q Median 3Q Max
-14555.9 -4632.5 444.1 3767.5 22320.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38186 3774 10.117 < 2e-16 ***
EDUC 1281 297 4.313 4.08e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6501 on 91 degrees of freedom
Multiple R-squared: 0.1697,   Adjusted R-squared:
0.1606
F-statistic: 18.6 on 1 and 91 DF, p-value: 4.077e-05
> newdata=data.frame(EDUC=12)
> attach(df1)
> predict(linmod,newdata,level=0.95,interval="confidence")
fit lwr upr
1 53555.91 52184.04 54927.78
> predict(linmod,newdata,level=0.95,interval="predict")
fit lwr upr
1 53555.91 40569.57 66542.25
ANSWER:(2A)
linear regression model is
salary= 38185.598+1280.859 *Educ
slope=1280.859
y intercept=38185.598
Ho:
no linear relationship between salary and educ
Ha:
linear relationship between salary and educ
alpha=0.05
F statistic= 18.6 p-value: 4.077e-05
p<0.05
Reject Ho
Accept Ha
Conclusion:
There is suffcient statistcial evidence at 5% level of significance to conclude that there is a linear relationship between salary and educ
Model is significant
we can use this model to predict SALARY from EDUC
Solution-b:
R sq=0.1697
=0.1697*100
=16.97% variation in salary is explained by educ
Explained variance=16.97%
unexplained variance=100-16.97=83.03%
c) Provide a 95% confidence interval estimate for the true slope value.
confint(linmod)
2.5 % 97.5 %
(Intercept) 30688.2625 45682.933
EDUC 690.9706 1870.748
95% confidence interval estimate for the true slope value lies in between 690.9706 and 1870.748
d) Based on your model, what is the expected salary of a new hire with 12 years of education
salary= 38185.598+1280.859 *Educ
for Educ=12 substitute in regression eq
salary= 38185.598+1280.859 *12
=53555.91
e ) What is the 95% prediction interval for the salary of a new hire with 12 years of education? Use the fact that the distance value = 0.011286
95% prediction interval from output is
40569.57 and 66542.25