In: Statistics and Probability
This question requires using Rstudio. This is following commands to install and import data into R:
> install.packages("ISLR")
> library(ISLR)
> data(Wage)
The required data installed and imported, now this is description of the data:
This dataset contains economic and demographic data for 3000
individuals living in the mid-Atlantic region. For each of
the
3000 individuals, the following 11 variables are recorded:
year: Year that wage information was recorded
age: Age of worker
maritl: A factor with levels 1. Never Married 2. Married 3. Widowed
4. Divorced and 5.
Separated indicating marital status
race: A factor with levels 1. White 2. Black 3. Asian and 4. Other
indicating race
education: A factor with levels 1. < HS Grad 2. HS Grad 3. Some
College 4. College Grad
and 5. Advanced Degree indicating education level
region: Region of the country (mid-atlantic only)
jobclass: A factor with levels 1. Industrial and 2. Information
indicating type of job
health: A factor with levels 1. <=Good and 2. >=Very Good
indicating health level of worker
health ins: A factor with levels 1. Yes and 2. No indicating
whether worker has health insurance
logwage: Log of workers wage
wage: Workers raw wage
This question continues with the Wage dataset.
You wish to fit a multiple regression model to predict wage using year, age, and jobclass.
However, you are interested in whether the change in wage as a worker ages differs between
industrial workers and information workers. Fit the appropriate model and test the
hypothesis of interest. Include your results and your conclusion.
Please provide all necessary codes using Rstudio.
library(ISLR)
data(Wage)
head(Wage)
year age sex maritl race education region jobclass health health_ins
231655 2006 18 1. Male 1. Never Married 1. White 1. < HS Grad 2. Middle Atlantic 1. Industrial 1. <=Good 2. No
86582 2004 24 1. Male 1. Never Married 1. White 4. College Grad 2. Middle Atlantic 2. Information 2. >=Very Good 2. No
161300 2003 45 1. Male 2. Married 1. White 3. Some College 2. Middle Atlantic 1. Industrial 1. <=Good 1. Yes
155159 2003 43 1. Male 2. Married 3. Asian 4. College Grad 2. Middle Atlantic 2. Information 2. >=Very Good 1. Yes
11443 2005 50 1. Male 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic 2. Information 1. <=Good 1. Yes
376662 2008 54 1. Male 2. Married 1. White 4. College Grad 2. Middle Atlantic 2. Information 2. >=Very Good 1. Yes
logwage wage
231655 4.318063 75.04315
86582 4.255273 70.47602
161300 4.875061 130.98218
155159 5.041393 154.68529
11443 4.318063 75.04315
376662 4.845098 127.11574
regmodel <- lm(wage ~ year+age+jobclass, data = Wage)
summary(regmodel)
Call:
lm(formula = wage ~ year + age + jobclass, data = Wage)
Residuals:
Min 1Q Median 3Q Max
-103.646 -24.525 -6.118 16.406 200.662
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.400e+03 7.252e+02 -3.309 0.000946 ***
year 1.235e+00 3.616e-01 3.415 0.000646 ***
age 6.362e-01 6.373e-02 9.982 < 2e-16 ***
jobclass2. Information 1.597e+01 1.471e+00 10.859 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.09 on 2996 degrees of freedom
Multiple R-squared: 0.07794, Adjusted R-squared: 0.07702
F-statistic: 84.41 on 3 and 2996 DF, p-value: < 2.2e-16
Regression Equation
Wage = -2399.90 + 1.23 * year + 0.64 * age + 15.97 * jobclass2. Information
Null and Alternate Hypothesis
H0: All the coefficients of the linear model are zero
Ha: Not all the coefficients of the linear model are zero
From the ANOVA table, since the p-value is less than 0.05, hence we reject the null hypothesis ie the model is significant ie not all coefficients are zero.
Also, the p-value for the independent variable is less than 0.05, hence the variable is significant ie there exists a relationship between the dependent and the independent variable.