In: Statistics and Probability
In this assignment, we will use the dataset collected in
Baystate Medical Center, Springfield, Mass (1986), featured in
Hosmer, D.W. and Lemeshow, S. (1989) Applied Logistic Regression.
New York: Wiley. To download this dataset, go into R and run the
following code to save the data as the object “data1”.
install.packages("MASS")
library(MASS)
data1 <- birthwt
Description of birthwt Data
low indicator of birth weight less than 2.5 kg.
age mother's weight in pounds at the last menstrual period.
lwt mother's weight in pounds at the last menstrual period.
race mother's race (1 = white, 2 = black, 3 = others).
smoke smoking status during pregnancy.
ptl the number of previous premature labours.
ht history of hypertension.
ui presence of uterine irritability.
ftv the number of physician visits during the first trimester.
bwt birth weight in grams.
Use R to help construct a final (best fit) model for the birth
weight data, with bwt as the Y variable.
(i) Write down the model equation.
(ii) Use R to plot studentized residuals against predicted values
and X variables. Discuss what you see.
(b) Verify that regression model assumptions are met. Do you think
there is a need to do any adjustments/transformations? If yes, what
do you suggest? (
a)
i)
Final Regression Equation:
Bwt = 3586.50 -1139.20 * low – 97.34 * race – 157.42* smoke -303.19 ui
ii)
Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model.
In our example, there is some pattern in the residual plot. This suggests that we cannot assume linear relationship between the predictors and the outcome variables
b)
The diagnostic plots show residuals in four different ways:
Normal Q-Q
Above plot shows that the residuals are not normally distributed
Homogeneity of variance
This assumption can be checked by examining the scale-location plot, also known as the spread-location plot.
plot(model2, 3)
It can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors (or heteroscedasticity).
A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation of the outcome variable (y).
R Code
library(MASS)
data("birthwt")
data1 <- birthwt
head(data1)
summary(data1)
model = lm(bwt~.,data=data1)
summary(model)
## To find best model for all the combination of regression independent variables
step(model, direction = "backward",trace = FALSE)
model2 = lm(bwt~low + race + smoke + ui,data=data1)
summary(model2)
plot(model2)
Output
> library(MASS)
> data("birthwt")
> data1 <- birthwt
> head(data1)
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
87 0 20 105 1 1 0 0 0 1 2557
88 0 21 108 1 1 0 0 1 2 2594
89 0 18 107 1 1 0 0 1 0 2600
91 0 21 124 3 0 0 0 0 0 2622
> summary(data1)
low age lwt race smoke
Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000 1st Qu.:0.0000
Median :0.0000 Median :23.00 Median :121.0 Median :1.000 Median :0.0000
Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847 Mean :0.3915
3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000 3rd Qu.:1.0000
Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000 Max. :1.0000
ptl ht ui ftv bwt
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. : 709
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2414
Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000 Median :2977
Mean :0.1958 Mean :0.06349 Mean :0.1481 Mean :0.7937 Mean :2945
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:3487
Max. :3.0000 Max. :1.00000 Max. :1.0000 Max. :6.0000 Max. :4990
>
> model = lm(bwt~.,data=data1)
> summary(model)
Call:
lm(formula = bwt ~ ., data = data1)
Residuals:
Min 1Q Median 3Q Max
-991.22 -300.96 -5.39 277.74 1637.80
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3612.508 229.457 15.744 < 2e-16 ***
low -1131.217 73.957 -15.296 < 2e-16 ***
age -6.245 6.347 -0.984 0.326416
lwt 1.051 1.133 0.927 0.355085
race -100.905 38.544 -2.618 0.009605 **
smoke -174.116 72.000 -2.418 0.016597 *
ptl 81.340 68.552 1.187 0.236980
ht -181.955 137.661 -1.322 0.187934
ui -336.776 93.314 -3.609 0.000399 ***
ftv -7.578 30.992 -0.245 0.807118
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 433.7 on 179 degrees of freedom
Multiple R-squared: 0.6632, Adjusted R-squared: 0.6462
F-statistic: 39.16 on 9 and 179 DF, p-value: < 2.2e-16
>
> ## To find best model for all the combination of regression independent variables
> step(model, direction = "backward",trace = FALSE)
Call:
lm(formula = bwt ~ low + race + smoke + ui, data = data1)
Coefficients:
(Intercept) low race smoke ui
3586.50 -1139.20 -97.34 -157.42 -303.19
>
> model2 = lm(bwt~low + race + smoke + ui,data=data1)
> summary(model2)
Call:
lm(formula = bwt ~ low + race + smoke + ui, data = data1)
Residuals:
Min 1Q Median 3Q Max
-1025.8 -351.0 30.8 285.8 1500.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3586.50 86.68 41.379 < 2e-16 ***
low -1139.20 71.12 -16.019 < 2e-16 ***
race -97.34 37.37 -2.605 0.009942 **
smoke -157.42 70.38 -2.237 0.026510 *
ui -303.19 90.02 -3.368 0.000922 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 432.5 on 184 degrees of freedom
Multiple R-squared: 0.6557, Adjusted R-squared: 0.6482
F-statistic: 87.59 on 4 and 184 DF, p-value: < 2.2e-16