In: Statistics and Probability
The Book of R (Question 20.2) Please answer using R code.
Continue using the survey data frame from the package MASS for the next few exercises.
Now, turn back to the ready-to-use mtcars data set. One of the variables in this data frame is qsec , described as the time in seconds it takes to race a quarter mile; another is gear , the number of forward gears (cars in this data set have either 3, 4, or 5 gears).
> library(MASS)
> #a)
> library(plyr)
> count(survey,vars='Exer')
Exer freq
1 Freq 115
2 None 24
3 Some 98
> par(mfrow=c(1,3))
> attach(survey)
> s1=survey[which(Exer=='Freq'),]
> s2=survey[which(Exer=='None'),]
> s3=survey[which(Exer=='Some'),]
> boxplot(s1$Height,xlab="Freq")
> boxplot(s2$Height,xlab="None")
> boxplot(s3$Height,xlab="Some")
> #b)
> linmodel=lm(Height~Exer)
> summary(linmodel)
Call:
lm(formula = Height ~ Exer)
Residuals:
Min 1Q Median 3Q Max
-24.607 -6.397 -1.607 6.103 25.393
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 174.6067 0.9396 185.836 < 2e-16 ***
ExerNone -5.5787 2.3489 -2.375 0.01847 *
ExerSome -4.2098 1.4094 -2.987 0.00316 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 9.628 on 206 degrees of
freedom
(28 observations deleted due to missingness)
Multiple R-squared: 0.05333, Adjusted R-squared: 0.04414
F-statistic: 5.802 on 2 and 206 DF, p-value: 0.003536
>
> #c)
> #Based on the p-value for the model (=0.003536), the exercise
frequency does have an impact on the height since the null
corresponding to equality of means in rejected at 5% level.
>
> #d) lwr and upr give the 95% prediction intervals and fit
gives the mean
> predict(linmodel, s1, interval="predict")
fit lwr upr
7 174.6067 155.5349 193.6784
> predict(linmodel, s2, interval="predict")
fit lwr upr
2 169.028 149.5777 188.4783
> predict(linmodel, s3, interval="predict")
fit lwr upr
1 170.3969 151.3027 189.4911
> #e)
> aov(Height~Exer)
Call:
aov(formula = Height ~ Exer)
Terms:
Exer Residuals
Sum of Squares 1075.657 19094.894
Deg. of Freedom 2 206
Residual standard error: 9.627755
Estimated effects may be unbalanced
28 observations deleted due to missingness
> summary(aov(Height~Exer))
Df Sum Sq Mean Sq F value Pr(>F)
Exer 2 1076 537.8 5.802 0.00354 **
Residuals 206 19095 92.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
28 observations deleted due to missingness
> #No change from the result obtained in the linear model.
p-value is also the same.
Question 2
> attach(mtcars)
#a)
> linmod2=lm(qsec~gear)
> linmod2
Call:
lm(formula = qsec ~ gear)
Coefficients:
(Intercept) gear
19.452 -0.449
> summary(linmod2)
Call:
lm(formula = qsec ~ gear)
Residuals:
Min 1Q Median 3Q Max
-2.7074 -1.0629 -0.2164 1.2436 2.3436
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.4523 1.5131 12.856 8.96e-13 ***
gear -0.4490 0.4039 -1.112 0.276
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 1.517 on 26 degrees of
freedom
Multiple R-squared: 0.04537, Adjusted R-squared: 0.008657
F-statistic: 1.236 on 1 and 26 DF, p-value: 0.2765
#From the p-value, we see that the null is accepted and hence gear is not a significant variable when qsec is the response.
> #b)
> linmod3=lm(qsec~factor(gear))
> summary(linmod3)
Call:
lm(formula = qsec ~ factor(gear))
Residuals:
Min 1Q Median 3Q Max
-2.29308 -0.66058 -0.04727 0.81568 2.51692
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.7031 0.3549 49.880 <2e-16 ***
factor(gear)4 0.9042 0.5242 1.725 0.0969 .
factor(gear)5 -1.8031 0.7317 -2.464 0.0210 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 1.28 on 25 degrees of
freedom
Multiple R-squared: 0.3468, Adjusted R-squared: 0.2945
F-statistic: 6.635 on 2 and 25 DF, p-value: 0.004881
#The difference arises because in the first case the gear column is treated as a continuous variable and in the second case it is a categorical variable with 3 levels.