In: Statistics and Probability
The data file glakes.txt on the textbook website provides the cargo volume (in tons) and the time (in hours) it takes to load and unload the cargo for each of the 31 ships that docked at a Canadian port on the Great Lakes. Note that the variable measuring the cargo volume is called “Tonnage” in the data file. We are interested in learning how Tonnage affects the Time.
(a). Fit a simple linear regression model ( Time ? Tonnage ) and provide the summary ˆ = ˆ 0 + ? ˆ 1 output of the model.
(b). Draw the four default diagnostic plots for the model in (a). Does this model seem to fit the data well? If not, list any problems you see.
(c). To improve the model, we consider transforming the variables. First, we consider only transforming the response variable by using the R function boxcox(y~x, data = dataName) from the MASS library.
(1) What transformation does this boxcox plot suggest? Is it a transformation of the response variable or predictor variable?
(2) What is the goal of applying the transformation in (1)? Please select one of the following answers.
A: Make the distribution of transformed Time as close to a normal distribution as possible.
B: Make the distribution of transformed Tonnage as close to a normal distribution as possible.
C: Make the conditional distribution of transformed Time given Tonnage as close to a normal distribution as possible.
(3) Fit a model with transformed variable(s) and provide the summary output of the model.
(4) Draw the four default diagnostic plots for the model in (3). Does this model seem to fit the data well? If not, list any problems you see.
(d) Now we consider transforming both the response variable and the predictor variable.
(1) Use R function boxcox(x~1, data = dataName) to identify an appropriate transformation for the predictor variable.
(2) What is the goal of applying the transformation in (1)? Please select one of the following answers. A: Make the distribution of transformed Time as close to a normal distribution as possible. B: Make the distribution of transformed Tonnage as close to a normal distribution as possible. C: Make the conditional distribution of transformed Time given Tonnage as close to a normal distribution as possible.
(3) Use the inverse response plot to identify an appropriate transformation for the response variable. Hint: First fit a new model with the transformed predictor variable. Then use the R function invResPlot(newModelName) or inverseResponsePlot(newModelName) from the car library to draw the inverse response plot. You may also load the alr4 library instead, since loading the alr4 library automatically loads the car library.
(e) Fit a model with transformed predictor and response variable and provide the summary output of the model. Use the power transformations identified in (d).
(f) Draw the four default diagnostic plots for the model in (e). Does this model seem to fit the data well? If not, list any problems you see.
(g) Interpret the slope of the model in (e) in terms of the percentage change of the variables.
dataset:
glakes.txt:
Case Tonnage Time 1 2213 17 2 3256 30 3 12203 68 4 7021 64 5 529 11 6 3192 55 7 547 20 8 4682 49 9 6112 69 10 5375 68 11 6666 49 12 3930 43 13 4263 31 14 1849 17 15 663 13 16 329 13 17 2790 43 18 353 15 19 2829 30 20 363 20 21 7084 41 22 1328 15 23 294 13 24 268 11 25 1732 24 26 507 11 27 1486 28 28 536 22 29 851 9 30 6760 43 31 15900 131
Part a,b,c and d are not needed. Not sure if this is too much for a single question so just as much as possible.
Following is the R-code to get the output for e,f,g. The codes are well documented and you will understand what lambda is taken to do the power transformation of both predictor and response variable given as hint in (d)
#######################
## Glakes Data ###
#######################
library(MASS)
library(car)
bc<-boxcox(Tonnage~1, data = glakes)
(lambda <- bc$x[which.max(bc$y)])
#Lambda Comes out to be 0.06
# Applying this lambda to make a transformed predictor
variable
glakes$Tonnage.tran<-(glakes$Tonnage)^lambda
# Fit lm with the transformed predictor and untransformed
response
Model<-lm(Time~Tonnage.tran,data=glakes)
# Inverse response plot
inv<-car::invResPlot(Model)
# Lambda=0.01 gives smallest RSS
# 0.01 is appropriate for response
# Applying this lambda to make a transformed response
variable
glakes$Time.tran<-(glakes$Time)^0.01
# fit the transformed model
tran.model<-lm(Time.tran~Tonnage.tran,data=glakes)
tran.model
#Diagnostic plot
par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(tran.model)
dev.off()
par(mfrow=c(1,1)) # Change back to 1 x 1
The model fit is given by
> tran.model
Call:
lm(formula = Time.tran ~ Tonnage.tran, data = glakes)
Coefficients:
(Intercept) Tonnage.tran
0.94627 0.05523
f) The diagnostic plot comes as
Which means all the diagnostic plots comes out to be promising. There is no particular pattern in residual vs fitted and scale-location plot, which means homescedasticity assumption appears valid here. The Q-Q plot appears almost linear which is much improvement over normal linear regression model. And lastly, the leverage plot suggests presence of no influential observations in the data.
g) The slope of the model is 0.055