In: Statistics and Probability
The following data set is obtained by a randomly selected sample of 93 employees working at a bank.
SALARY | EDUC | EXPER | TIME |
39000 | 12 | 0 | 1 |
40200 | 10 | 44 | 7 |
42900 | 12 | 5 | 30 |
43800 | 8 | 6 | 7 |
43800 | 8 | 8 | 6 |
43800 | 12 | 0 | 7 |
43800 | 12 | 0 | 10 |
43800 | 12 | 5 | 6 |
44400 | 15 | 75 | 2 |
45000 | 8 | 52 | 3 |
45000 | 12 | 8 | 19 |
46200 | 12 | 52 | 3 |
48000 | 8 | 70 | 20 |
48000 | 12 | 6 | 23 |
48000 | 12 | 11 | 12 |
48000 | 12 | 11 | 17 |
48000 | 12 | 63 | 22 |
48000 | 12 | 144 | 24 |
48000 | 12 | 163 | 12 |
48000 | 12 | 228 | 26 |
48000 | 12 | 381 | 1 |
48000 | 16 | 214 | 15 |
49800 | 8 | 318 | 25 |
51000 | 8 | 96 | 33 |
51000 | 12 | 36 | 15 |
51000 | 12 | 59 | 14 |
51000 | 15 | 115 | 1 |
51000 | 15 | 165 | 4 |
51000 | 16 | 123 | 12 |
51600 | 12 | 18 | 12 |
52200 | 8 | 102 | 29 |
52200 | 12 | 127 | 29 |
52800 | 8 | 90 | 11 |
52800 | 8 | 190 | 1 |
52800 | 12 | 107 | 11 |
54000 | 8 | 173 | 34 |
54000 | 8 | 228 | 33 |
54000 | 12 | 26 | 11 |
54000 | 12 | 36 | 33 |
54000 | 12 | 38 | 22 |
54000 | 12 | 82 | 29 |
54000 | 12 | 169 | 27 |
54000 | 12 | 244 | 1 |
54000 | 15 | 24 | 13 |
54000 | 15 | 49 | 27 |
54000 | 15 | 51 | 21 |
54000 | 15 | 122 | 33 |
55200 | 12 | 97 | 17 |
55200 | 12 | 196 | 32 |
55800 | 12 | 133 | 30 |
56400 | 12 | 55 | 9 |
57000 | 12 | 90 | 23 |
57000 | 12 | 117 | 25 |
57000 | 15 | 51 | 17 |
57000 | 15 | 61 | 11 |
57000 | 15 | 241 | 34 |
60000 | 12 | 121 | 30 |
60000 | 15 | 79 | 13 |
61200 | 12 | 209 | 21 |
63000 | 12 | 87 | 33 |
63000 | 15 | 231 | 15 |
46200 | 12 | 12 | 22 |
50400 | 15 | 14 | 3 |
51000 | 12 | 180 | 15 |
51000 | 12 | 315 | 2 |
52200 | 12 | 29 | 14 |
54000 | 12 | 7 | 21 |
54000 | 12 | 38 | 11 |
54000 | 12 | 113 | 3 |
54000 | 15 | 18 | 8 |
54000 | 15 | 359 | 11 |
57000 | 15 | 36 | 5 |
60000 | 8 | 320 | 21 |
60000 | 12 | 24 | 2 |
60000 | 12 | 32 | 17 |
60000 | 12 | 49 | 8 |
60000 | 12 | 56 | 33 |
60000 | 12 | 252 | 11 |
60000 | 12 | 272 | 19 |
60000 | 15 | 25 | 13 |
60000 | 15 | 36 | 32 |
60000 | 15 | 56 | 12 |
60000 | 15 | 64 | 33 |
60000 | 15 | 108 | 16 |
60000 | 16 | 46 | 3 |
63000 | 15 | 72 | 17 |
66000 | 15 | 64 | 16 |
66000 | 15 | 84 | 33 |
66000 | 15 | 216 | 16 |
68400 | 15 | 42 | 7 |
69000 | 12 | 175 | 10 |
69000 | 15 | 132 | 24 |
81000 | 16 | 55 | 33 |
This data set was obtained by collecting information on a randomly selected sample of 93 employees working at a bank.
SALARY- starting annual salary at the time of hire
EDUC - number of years of schooling at the time of the hire
EXPER - number of months of previous work experience at the time of hire
TIME - number of months that the employee has been working at the bank until now
2. Use the least squares method to fit a simple linear model that relates the salary (dependent variable) toeducation (independent variable).
a) What is your model? State the hypothesis that is to be tested, the decision rule, the test statistic, and your decision, usinga level of significance of 5%.
b) What percentage of the variation in salary has been explained by the regression?
c) Provide a 95% confidence interval estimate for the true slope value.
d) Based on your model, what is the expected salary of a new hire with 12 years of education
e ) What is the 95% prediction interval for the salary of a new hire with 12 years of education? Use the fact that the distance value = 0.011286
Please explain clearly.
Sol:
Perform in R studio
use lm function in R to fit a linear model of salary on educ
Use sumamry function to get the coeffcient and p value
Predict function to get the confidence and predicttion interval for newdata=12 educ
Rcode:
df1 =read.table(header = TRUE, text ="
SALARY EDUC EXPER TIME
39000 12 0 1
40200 10 44 7
42900 12 5 30
43800 8 6 7
43800 8 8 6
43800 12 0 7
43800 12 0 10
43800 12 5 6
44400 15 75 2
45000 8 52 3
45000 12 8 19
46200 12 52 3
48000 8 70 20
48000 12 6 23
48000 12 11 12
48000 12 11 17
48000 12 63 22
48000 12 144 24
48000 12 163 12
48000 12 228 26
48000 12 381 1
48000 16 214 15
49800 8 318 25
51000 8 96 33
51000 12 36 15
51000 12 59 14
51000 15 115 1
51000 15 165 4
51000 16 123 12
51600 12 18 12
52200 8 102 29
52200 12 127 29
52800 8 90 11
52800 8 190 1
52800 12 107 11
54000 8 173 34
54000 8 228 33
54000 12 26 11
54000 12 36 33
54000 12 38 22
54000 12 82 29
54000 12 169 27
54000 12 244 1
54000 15 24 13
54000 15 49 27
54000 15 51 21
54000 15 122 33
55200 12 97 17
55200 12 196 32
55800 12 133 30
56400 12 55 9
57000 12 90 23
57000 12 117 25
57000 15 51 17
57000 15 61 11
57000 15 241 34
60000 12 121 30
60000 15 79 13
61200 12 209 21
63000 12 87 33
63000 15 231 15
46200 12 12 22
50400 15 14 3
51000 12 180 15
51000 12 315 2
52200 12 29 14
54000 12 7 21
54000 12 38 11
54000 12 113 3
54000 15 18 8
54000 15 359 11
57000 15 36 5
60000 8 320 21
60000 12 24 2
60000 12 32 17
60000 12 49 8
60000 12 56 33
60000 12 252 11
60000 12 272 19
60000 15 25 13
60000 15 36 32
60000 15 56 12
60000 15 64 33
60000 15 108 16
60000 16 46 3
63000 15 72 17
66000 15 64 16
66000 15 84 33
66000 15 216 16
68400 15 42 7
69000 12 175 10
69000 15 132 24
81000 16 55 33
"
)
df1
linmod=lm(SALARY ~ EDUC ,data=df1)
coefficients(linmod)
summary(linmod)
newdata=data.frame(EDUC=12)
attach(df1)
predict(linmod,newdata,level=0.95,interval="confidence")
predict(linmod,newdata,level=0.95,interval="predict")
Output:
> coefficients(linmod)
(Intercept) EDUC
38185.598 1280.859
> summary(linmod)
Call:
lm(formula = SALARY ~ EDUC, data = df1)
Residuals:
Min 1Q Median 3Q Max
-14555.9 -4632.5 444.1 3767.5 22320.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38186 3774 10.117 < 2e-16 ***
EDUC 1281 297 4.313 4.08e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6501 on 91 degrees of freedom
Multiple R-squared: 0.1697, Adjusted R-squared:
0.1606
F-statistic: 18.6 on 1 and 91 DF, p-value: 4.077e-05
> newdata=data.frame(EDUC=12)
> attach(df1)
> predict(linmod,newdata,level=0.95,interval="confidence")
fit lwr upr
1 53555.91 52184.04 54927.78
> predict(linmod,newdata,level=0.95,interval="predict")
fit lwr upr
1 53555.91 40569.57 66542.25
ANSWER:(2A)
linear regression model is
salary= 38185.598+1280.859 *Educ
slope=1280.859
y intercept=38185.598
Ho:
no linear relationship between salary and educ
Ha:
linear relationship between salary and educ
alpha=0.05
F statistic= 18.6 p-value: 4.077e-05
p<0.05
Reject Ho
Accept Ha
Conclusion:
There is suffcient statistcial evidence at 5% level of significance to conclude that there is a linear relationship between salary and educ
Model is significant
we can use this model to predict SALARY from EDUC
Solution-b:
R sq=0.1697
=0.1697*100
=16.97% variation in salary is explained by educ
Explained variance=16.97%
unexplained variance=100-16.97=83.03%
c) Provide a 95% confidence interval estimate for the true slope value.
confint(linmod)
2.5 % 97.5 %
(Intercept) 30688.2625 45682.933
EDUC 690.9706 1870.748
95% confidence interval estimate for the true slope value lies in between 690.9706 and 1870.748
d) Based on your model, what is the expected salary of a new hire with 12 years of education
salary= 38185.598+1280.859 *Educ
for Educ=12 substitute in regression eq
salary= 38185.598+1280.859 *12
=53555.91
e ) What is the 95% prediction interval for the salary of a new hire with 12 years of education? Use the fact that the distance value = 0.011286
95% prediction interval from output is
40569.57 and 66542.25