In: Statistics and Probability
This question requires using Rstudio. This is following commands to install and import data into R:
> install.packages("ISLR")
> library(ISLR)
> data(Wage)
The required data installed and imported, now this is description of the data:
This dataset contains economic and demographic data for 3000
individuals living in the mid-Atlantic region. For each of
the
3000 individuals, the following 11 variables are recorded:
year: Year that wage information was recorded
age: Age of worker
maritl: A factor with levels 1. Never Married 2. Married 3. Widowed
4. Divorced and 5.
Separated indicating marital status
race: A factor with levels 1. White 2. Black 3. Asian and 4. Other
indicating race
education: A factor with levels 1. < HS Grad 2. HS Grad 3. Some
College 4. College Grad
and 5. Advanced Degree indicating education level
region: Region of the country (mid-atlantic only)
jobclass: A factor with levels 1. Industrial and 2. Information
indicating type of job
health: A factor with levels 1. <=Good and 2. >=Very Good
indicating health level of worker
health ins: A factor with levels 1. Yes and 2. No indicating
whether worker has health insurance
logwage: Log of workers wage
wage: Workers raw wage
This question continues with the Wage dataset.
(a) Fit a multiple regression model to predict wage using year,
age, and jobclass
(b) What is the predicted wage for a 45 year old working in the
industrial sector in the year
2009? What are the associated 95% condence and prediction
intervals?
(c) Create a binary variable, wage150, that contains a 1 if wage
contains a value above
150, and a 0 if wage contains a value below 150.
Please provide all necessary codes using Rstudio. No need to provide any screenshots.
(a)
Ran the below command in R studio to run the linear regression
model = lm(wage ~ year + age + jobclass, data = Wage)
Summary of the model is,
> summary(model)
Call:
lm(formula = wage ~ year + age + jobclass, data = Wage)
Residuals:
Min 1Q Median 3Q Max
-103.646 -24.525 -6.118 16.406 200.662
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.400e+03 7.252e+02 -3.309 0.000946 ***
year 1.235e+00 3.616e-01 3.415 0.000646 ***
age 6.362e-01 6.373e-02 9.982 < 2e-16 ***
jobclass2. Information 1.597e+01 1.471e+00 10.859 < 2e-16
***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.09 on 2996 degrees of freedom
Multiple R-squared: 0.07794, Adjusted R-squared:
0.07702
F-statistic: 84.41 on 3 and 2996 DF, p-value: < 2.2e-16
(b)
The predicted wage for a 45 year old working in the industrial sector in the year 2009 are found by the below commands.
> newdata = data.frame(year = 2009, age = 45,
jobclass = "1. Industrial")
> predict.lm(model, newdata, interval =
c("confidence"))
fit lwr upr
1 109.5603 106.517 112.6036
> predict.lm(model, newdata, interval =
c("prediction"))
fit lwr upr
1 109.5603 30.89567 188.225
The predicted wage for a 45 year old working in the industrial sector in the year 2009 is 109.5603
95% confidence interval is (106.517, 112.6036)
95% prediction intervals is (30.89567, 188.225)
(c)
The binary variable, wage150, that contains a 1 if wage contains a value above 150, and a 0 if wage contains a value below 150 can be created as below.
wage150 = Wage$wage > 150