In: Statistics and Probability
In this problem, we will perform multiple regression on the Boston housing data. The data contains 506 records with 14 variables. The variable medv is the response variable.
Solve the following problems in R and print out the commands and outputs :
To assess the data use
library(MASS)
data(Boston)
(a) First perform a multiple regression with all the variables, what can you say about the significance of the variables based on only the p-values. Next use the ”step” function to perform backward selection using (1) the AIC criteria and (2) the BIC criteria then compare the results. (By default the step function in R performs variable selection based on AIC criteria. Read the documentation to find out how to do the selection using BIC criteria. )
(b) Now make a histogram of the response variable (use hist()) to see if it is skewed. Using log(medv) as the response variable, perform the stepwise selection as previously using both AIC and BIC criteria. Compare with the previous results in terms of selected variables and adjusted R2.
Answer:
The data contain 506 records with 14 variables .
Please see the R snippet is as follows:
library(MASS)
data(Boston)
Boston
fit <- lm(medv~.,data=Boston)
summary(fit)
step.model <- stepAIC(fit, direction = "backward",
trace = 1)
summary(step.model)
n <- nrow(Boston)
step(fit,
direction="backward",k=log(n), trace = 1) # k is the BIC setup
hist(log(Boston$medv),col="brown")
The results are
Start: AIC=1589.64
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad
+
tax + ptratio + black + lstat
Df Sum of Sq RSS AIC
- age 1 0.06 11079 1587.7
- indus 1 2.52 11081 1587.8
<none> 11079 1589.6
- chas 1 218.97 11298 1597.5
- tax 1 242.26 11321 1598.6
- crim 1 243.22 11322 1598.6
- zn 1 257.49 11336 1599.3
- black 1 270.63 11349 1599.8
- rad 1 479.15 11558 1609.1
- nox 1 487.16 11566 1609.4
- ptratio 1 1194.23 12273 1639.4
- dis 1 1232.41 12311 1641.0
- rm 1 1871.32 12950 1666.6
- lstat 1 2410.84 13490 1687.3
Step: AIC=1587.65
medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax
+
ptratio + black + lstat
Df Sum of Sq RSS AIC
- indus 1 2.52 11081 1585.8
<none> 11079 1587.7
- chas 1 219.91 11299 1595.6
- tax 1 242.24 11321 1596.6
- crim 1 243.20 11322 1596.6
- zn 1 260.32 11339 1597.4
- black 1 272.26 11351 1597.9
- rad 1 481.09 11560 1607.2
- nox 1 520.87 11600 1608.9
- ptratio 1 1200.23 12279 1637.7
- dis 1 1352.26 12431 1643.9
- rm 1 1959.55 13038 1668.0
- lstat 1 2718.88 13798 1696.7
Step: AIC=1585.76
medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio
+
black + lstat
Df Sum of Sq RSS AIC
<none> 11081 1585.8
- chas 1 227.21 11309 1594.0
- crim 1 245.37 11327 1594.8
- zn 1 257.82 11339 1595.4
- black 1 270.82 11352 1596.0
- tax 1 273.62 11355 1596.1
- rad 1 500.92 11582 1606.1
- nox 1 541.91 11623 1607.9
- ptratio 1 1206.45 12288 1636.0
- dis 1 1448.94 12530 1645.9
- rm 1 1963.66 13045 1666.3
- lstat 1 2723.48 13805 1695.0
Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat, data = Boston)
Coefficients:
(Intercept) crim zn chas nox rm dis
36.341145 -0.108413 0.045845 2.718716 -17.376023 3.801579
-1.492711
rad tax ptratio black lstat
0.299608 -0.011778 -0.946525 0.009291 -0.522553
The BIC criterion results are
Start: AIC=1648.81
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad
+
tax + ptratio + black + lstat
Df Sum of Sq RSS AIC
- age 1 0.06 11079 1642.6
- indus 1 2.52 11081 1642.7
<none> 11079 1648.8
- chas 1 218.97 11298 1652.5
- tax 1 242.26 11321 1653.5
- crim 1 243.22 11322 1653.6
- zn 1 257.49 11336 1654.2
- black 1 270.63 11349 1654.8
- rad 1 479.15 11558 1664.0
- nox 1 487.16 11566 1664.4
- ptratio 1 1194.23 12273 1694.4
- dis 1 1232.41 12311 1696.0
- rm 1 1871.32 12950 1721.6
- lstat 1 2410.84 13490 1742.2
Step: AIC=1642.59
medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax
+
ptratio + black + lstat
Df Sum of Sq RSS AIC
- indus 1 2.52 11081 1636.5
<none> 11079 1642.6
- chas 1 219.91 11299 1646.3
- tax 1 242.24 11321 1647.3
- crim 1 243.20 11322 1647.3
- zn 1 260.32 11339 1648.1
- black 1 272.26 11351 1648.7
- rad 1 481.09 11560 1657.9
- nox 1 520.87 11600 1659.6
- ptratio 1 1200.23 12279 1688.4
- dis 1 1352.26 12431 1694.6
- rm 1 1959.55 13038 1718.8
- lstat 1 2718.88 13798 1747.4
Step: AIC=1636.48
medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio
+
black + lstat
Df Sum of Sq RSS AIC
<none> 11081 1636.5
- chas 1 227.21 11309 1640.5
- crim 1 245.37 11327 1641.3
- zn 1 257.82 11339 1641.9
- black 1 270.82 11352 1642.5
- tax 1 273.62 11355 1642.6
- rad 1 500.92 11582 1652.6
- nox 1 541.91 11623 1654.4
- ptratio 1 1206.45 12288 1682.5
- dis 1 1448.94 12530 1692.4
- rm 1 1963.66 13045 1712.8
- lstat 1 2723.48 13805 1741.5
Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat, data = Boston)
Coefficients:
(Intercept) crim zn chas nox rm dis
36.341145 -0.108413 0.045845 2.718716 -17.376023 3.801579
-1.492711
rad tax ptratio black lstat
0.299608 -0.011778 -0.946525 0.009291 -0.522553