Question

In: Statistics and Probability

In this problem, we will perform multiple regression on the Boston housing data. The data contains...

In this problem, we will perform multiple regression on the Boston housing data. The data contains 506 records with 14 variables. The variable medv is the response variable.

Solve the following problems in R and print out the commands and outputs :

To assess the data use

library(MASS)

data(Boston)

(a) First perform a multiple regression with all the variables, what can you say about the significance of the variables based on only the p-values. Next use the ”step” function to perform backward selection using (1) the AIC criteria and (2) the BIC criteria then compare the results. (By default the step function in R performs variable selection based on AIC criteria. Read the documentation to find out how to do the selection using BIC criteria. )

(b) Now make a histogram of the response variable (use hist()) to see if it is skewed. Using log(medv) as the response variable, perform the stepwise selection as previously using both AIC and BIC criteria. Compare with the previous results in terms of selected variables and adjusted R2.

Solutions

Expert Solution

Answer:

The data contain 506 records with 14 variables .

Please see the R snippet is as follows:

library(MASS)
data(Boston)
Boston

fit <- lm(medv~.,data=Boston)
summary(fit)


step.model <- stepAIC(fit, direction = "backward",
trace = 1)
summary(step.model)

n <- nrow(Boston)

step(fit,
direction="backward",k=log(n), trace = 1) # k is the BIC setup


hist(log(Boston$medv),col="brown")

The results are

Start: AIC=1589.64
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
tax + ptratio + black + lstat

Df Sum of Sq RSS AIC
- age 1 0.06 11079 1587.7
- indus 1 2.52 11081 1587.8
<none> 11079 1589.6
- chas 1 218.97 11298 1597.5
- tax 1 242.26 11321 1598.6
- crim 1 243.22 11322 1598.6
- zn 1 257.49 11336 1599.3
- black 1 270.63 11349 1599.8
- rad 1 479.15 11558 1609.1
- nox 1 487.16 11566 1609.4
- ptratio 1 1194.23 12273 1639.4
- dis 1 1232.41 12311 1641.0
- rm 1 1871.32 12950 1666.6
- lstat 1 2410.84 13490 1687.3

Step: AIC=1587.65
medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
ptratio + black + lstat

Df Sum of Sq RSS AIC
- indus 1 2.52 11081 1585.8
<none> 11079 1587.7
- chas 1 219.91 11299 1595.6
- tax 1 242.24 11321 1596.6
- crim 1 243.20 11322 1596.6
- zn 1 260.32 11339 1597.4
- black 1 272.26 11351 1597.9
- rad 1 481.09 11560 1607.2
- nox 1 520.87 11600 1608.9
- ptratio 1 1200.23 12279 1637.7
- dis 1 1352.26 12431 1643.9
- rm 1 1959.55 13038 1668.0
- lstat 1 2718.88 13798 1696.7

Step: AIC=1585.76
medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
black + lstat

Df Sum of Sq RSS AIC
<none> 11081 1585.8
- chas 1 227.21 11309 1594.0
- crim 1 245.37 11327 1594.8
- zn 1 257.82 11339 1595.4
- black 1 270.82 11352 1596.0
- tax 1 273.62 11355 1596.1
- rad 1 500.92 11582 1606.1
- nox 1 541.91 11623 1607.9
- ptratio 1 1206.45 12288 1636.0
- dis 1 1448.94 12530 1645.9
- rm 1 1963.66 13045 1666.3
- lstat 1 2723.48 13805 1695.0

Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat, data = Boston)

Coefficients:
(Intercept) crim zn chas nox rm dis  
36.341145 -0.108413 0.045845 2.718716 -17.376023 3.801579 -1.492711  
rad tax ptratio black lstat  
0.299608 -0.011778 -0.946525 0.009291 -0.522553  

The BIC criterion results are

Start: AIC=1648.81
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
tax + ptratio + black + lstat

Df Sum of Sq RSS AIC
- age 1 0.06 11079 1642.6
- indus 1 2.52 11081 1642.7
<none> 11079 1648.8
- chas 1 218.97 11298 1652.5
- tax 1 242.26 11321 1653.5
- crim 1 243.22 11322 1653.6
- zn 1 257.49 11336 1654.2
- black 1 270.63 11349 1654.8
- rad 1 479.15 11558 1664.0
- nox 1 487.16 11566 1664.4
- ptratio 1 1194.23 12273 1694.4
- dis 1 1232.41 12311 1696.0
- rm 1 1871.32 12950 1721.6
- lstat 1 2410.84 13490 1742.2

Step: AIC=1642.59
medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
ptratio + black + lstat

Df Sum of Sq RSS AIC
- indus 1 2.52 11081 1636.5
<none> 11079 1642.6
- chas 1 219.91 11299 1646.3
- tax 1 242.24 11321 1647.3
- crim 1 243.20 11322 1647.3
- zn 1 260.32 11339 1648.1
- black 1 272.26 11351 1648.7
- rad 1 481.09 11560 1657.9
- nox 1 520.87 11600 1659.6
- ptratio 1 1200.23 12279 1688.4
- dis 1 1352.26 12431 1694.6
- rm 1 1959.55 13038 1718.8
- lstat 1 2718.88 13798 1747.4

Step: AIC=1636.48
medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
black + lstat

Df Sum of Sq RSS AIC
<none> 11081 1636.5
- chas 1 227.21 11309 1640.5
- crim 1 245.37 11327 1641.3
- zn 1 257.82 11339 1641.9
- black 1 270.82 11352 1642.5
- tax 1 273.62 11355 1642.6
- rad 1 500.92 11582 1652.6
- nox 1 541.91 11623 1654.4
- ptratio 1 1206.45 12288 1682.5
- dis 1 1448.94 12530 1692.4
- rm 1 1963.66 13045 1712.8
- lstat 1 2723.48 13805 1741.5

Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat, data = Boston)

Coefficients:
(Intercept) crim zn chas nox rm dis  
36.341145 -0.108413 0.045845 2.718716 -17.376023 3.801579 -1.492711  
rad tax ptratio black lstat  
0.299608 -0.011778 -0.946525 0.009291 -0.522553  


Related Solutions

Refer to the TV Revenue data set. Perform a complete multiple regression analysis that might be...
Refer to the TV Revenue data set. Perform a complete multiple regression analysis that might be used to predict net revenue using all provided explanatory variables (there are 4 explanatory variables). Complete all steps for the multiple regression as outlined in class and modify the original model if necessary. Use an alpha = .10 for all hypotheses tests. Make sure you show each required step for any hypothesis test. Provide all required Minitab output with your written responses. Obs NetRevenue...
i. Use MS Excel Data Analysis ToolPak to perform a multiple regression analysis using Quality as...
i. Use MS Excel Data Analysis ToolPak to perform a multiple regression analysis using Quality as the response variable and Helpfulness and Clarity as the explanatory variables. Write down the corresponding coefficient estimates and provide the regression output. j. Perform an F-test for the overall usefulness of the model in part i) using a 5% significance level. Make sure you follow all the steps for hypothesis testing indicated in the Instructions section and clearly state your conclusion. k. Test manually...
The data presented in Problem 7 are analyzed using multiple linear regression analysis and the models...
The data presented in Problem 7 are analyzed using multiple linear regression analysis and the models are shown here. In the models, the data are coded as 1 = new medication and 0 = standard medication, and age 65 and older is coded as 1 = yes and 0 = no. ŷ = 53.85 − 23.54 (Medication) ŷ = 45.31 − 19.88 (Medication) + 14.64 (Age 65 +) ŷ = 45.51 − 20.21 ( Medication ) + 14.29 ( Age...
PROBLEM #1: (Multiple Choice) Choose the correct answer. We conduct a regression analysis to test the...
PROBLEM #1: (Multiple Choice) Choose the correct answer. We conduct a regression analysis to test the hypotheses β = 1 vs Ha: β = 1. The value of the test statistic is found to be 2.24 for sample sizes n=22. The p-value for this test is: ⑴ 0.01 < p-value < 0.025 ⑵ 0.02 < p-value < 0.05 ⑶ 0.0005 < p-value < 0.001 ⑷ 0.001 < p-value < 0.005 ⑸ None of the above
How do you perform hypothesis testing on multiple regression data from ANOVA table step-by-step? Please provide...
How do you perform hypothesis testing on multiple regression data from ANOVA table step-by-step? Please provide example.
Suppose you perform the following multiple regression: Y = B0 + B1X1 + B2X2 + B3X3....
Suppose you perform the following multiple regression: Y = B0 + B1X1 + B2X2 + B3X3. You find that X1 and X3 have a near perfect correlation. How would you conclude on the utility of your regression result? This is a problem of multicollinearity which renders the entire regression invalid. This is a problem of multicollinearity which nevertheless does not invalidate the utility of the model as a whole This is NOT a regression problem and inferences made using the...
According to Pollock and Edwards, a multiple regression analysis contains at least __ independent variables a....
According to Pollock and Edwards, a multiple regression analysis contains at least __ independent variables a. 0 b. 1 c. 2 d. none of the above According to Pollock and Edwards, the .05 threshold suggests that researchers wish to commit _____ less than five times out of 100 tests. a. individual fallacy b. Type I error c. Type II error d. ecological fallacy According to Pollock and Edwards, a p-value determines the exact probability of obtaining the observed sample difference...
Use multiple regression with dummies, since the data is seasonal for the regression model. Year Sales...
Use multiple regression with dummies, since the data is seasonal for the regression model. Year Sales (Millions) Trend 2014 1 480.0 1 2014 Q2 864.0 2 2014 Q3 942.0 3 2014 Q4 1,100.0 4 2015 Q1 1,200.0 5 2015 Q2 1,900.0 6 2015 Q3 1,900.0 7 2015 Q4 1,300.0 8 2016 Q1 1,200.0 9 2016 Q2 1,500.0 10 2016 Q3 1,200.0 11 2016 Q4 500.0 12 2017 Q1 356.0 13 2017 Q2 1,300.0 14 2017 Q3 1,000.0 15 2017 Q4...
In a multiple linear regression how do we calculate the standard error of B2 ( we...
In a multiple linear regression how do we calculate the standard error of B2 ( we have two independent variables and a constant so we have B0 B1 and B2) how do we calculate the standard error of the three.
The standard project is to use multiple regression analysis to analyze a data set. The data...
The standard project is to use multiple regression analysis to analyze a data set. The data set is a study of student persistent enrolling in the next semester based on Gender, Age, GPA, a 22 questionnaire on self-efficacy, and student enrollment status. The educational researcher wants to study the relationship between student enrollment status as it relates to gender, age, GPA, and the total response to a 22 questionnaire survey. a. The estimated multiple regression analysis equation. b. Does the...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT