In: Statistics and Probability
rent | rooms | baths | sqrfoot | house | campusclose | pets | new |
875 | 1 | 1 | 655 | 0 | 0 | 0 | 0 |
1130 | 1 | 1 | 800 | 0 | 0 | 1 | 0 |
785 | 1 | 1 | 650 | 0 | 1 | 0 | 0 |
895 | 1 | 1 | 566 | 0 | 1 | 0 | 0 |
690 | 1 | 1 | 600 | 0 | 1 | 0 | 0 |
800 | 1 | 1 | 435 | 0 | 1 | 0 | 0 |
595 | 1 | 1 | 500 | 0 | 0 | 0 | 0 |
850 | 1 | 1 | 655 | 0 | 1 | 0 | 0 |
775 | 1 | 1 | 612 | 0 | 0 | 1 | 1 |
795 | 1 | 1 | 688 | 0 | 1 | 0 | 0 |
1050 | 1 | 1 | 700 | 0 | 1 | 1 | 0 |
870 | 1 | 1 | 655 | 0 | 1 | 0 | 0 |
1070 | 1 | 1 | 710 | 0 | 0 | 1 | 0 |
850 | 1 | 1 | 670 | 0 | 1 | 0 | 0 |
825 | 1 | 1 | 488 | 0 | 1 | 0 | 0 |
1300 | 2 | 1 | 781 | 1 | 1 | 1 | 0 |
1225 | 2 | 1 | 764 | 0 | 1 | 0 | 0 |
1300 | 2 | 1 | 800 | 1 | 0 | 0 | 0 |
1200 | 2 | 1 | 922 | 0 | 1 | 0 | 0 |
1345 | 2 | 1 | 856 | 0 | 0 | 1 | 1 |
1100 | 2 | 2 | 866 | 0 | 0 | 0 | 1 |
1350 | 2 | 2 | 1300 | 0 | 0 | 0 | 0 |
1450 | 2 | 1 | 700 | 0 | 1 | 1 | 1 |
1200 | 2 | 1 | 800 | 0 | 1 | 0 | 0 |
1195 | 2 | 1 | 795 | 0 | 1 | 0 | 0 |
1185 | 2 | 1 | 864 | 0 | 1 | 0 | 0 |
1100 | 2 | 1 | 1050 | 0 | 1 | 0 | 0 |
1125 | 2 | 2 | 986 | 0 | 0 | 1 | 1 |
1075 | 2 | 1 | 800 | 0 | 0 | 1 | 1 |
1210 | 2 | 2 | 890 | 0 | 1 | 0 | 0 |
1150 | 2 | 1 | 1200 | 0 | 0 | 1 | 0 |
1215 | 2 | 1 | 988 | 0 | 1 | 0 | 0 |
1270 | 2 | 1.5 | 995 | 0 | 1 | 0 | 0 |
995 | 2 | 1 | 864 | 0 | 1 | 0 | 0 |
1095 | 2 | 1 | 1050 | 0 | 0 | 0 | 0 |
995 | 2 | 1 | 800 | 0 | 1 | 0 | 0 |
1205 | 2 | 1 | 900 | 1 | 1 | 1 | 0 |
1560 | 3 | 2 | 1200 | 1 | 1 | 0 | 0 |
1800 | 3 | 2.5 | 1309 | 1 | 0 | 0 | 1 |
1740 | 3 | 1 | 1200 | 1 | 1 | 0 | 0 |
1795 | 3 | 2 | 1300 | 0 | 0 | 0 | 0 |
2067 | 3 | 4 | 1700 | 0 | 1 | 0 | 1 |
2695 | 3 | 2.5 | 1551 | 0 | 0 | 1 | 1 |
1815 | 3 | 2 | 1467 | 0 | 0 | 1 | 0 |
1900 | 3 | 2.5 | 1600 | 1 | 0 | 0 | 0 |
1395 | 3 | 2 | 1611 | 1 | 0 | 1 | 0 |
1194 | 3 | 1 | 1705 | 1 | 1 | 0 | 0 |
1699 | 3 | 3 | 1646 | 1 | 1 | 1 | 0 |
1700 | 3 | 2 | 1550 | 1 | 0 | 1 | 0 |
2700 | 4 | 3 | 2100 | 1 | 0 | 1 | 1 |
2956 | 4 | 4 | 1659 | 0 | 1 | 1 | 1 |
2400 | 4 | 2 | 2300 | 1 | 1 | 0 | 0 |
2250 | 4 | 2 | 1900 | 0 | 1 | 0 | 0 |
2099 | 4 | 4 | 2200 | 1 | 1 | 1 | 0 |
2720 | 4 | 3 | 2400 | 0 | 1 | 0 | 1 |
1700 | 4 | 1.5 | 1980 | 1 | 1 | 0 | 0 |
2200 | 4 | 1.5 | 2100 | 1 | 1 | 0 | 0 |
2600 | 5 | 1.5 | 3500 | 1 | 1 | 0 | 0 |
2600 | 5 | 2 | 1607 | 1 | 0 | 0 | 0 |
2300 | 5 | 2 | 2600 | 1 | 0 | 0 | 0 |
1.Remove the variable with highest p-value and re-fit the model. Only remove one variable at a time.
2. Continue removing variables one-by-one until all variables in the model have a p-value less than 0.05.
3. Consider whether any of the variables in your model are related to each other. Check this with the scatterplot matrix and\or by finding the correlation between the two explanatory variables. If r <= 0.80 then keep both variables in the model. This is your final model. However If r > 0.80, then one of the variables should be removed from the model. Re-fit two models, each model without one of the correlated variables. Select the model with the higher adjusted R-squared value.
a. (2 points) Provide a narrative for how you settled upon the final model. Example: “I first fit the full model and noticed the p-value for ____was very high. I dropped it from the model and refit the data, then I check the correlation between ___ and ___ to see if the relationship was too strong between the explanatory variables.”
b.(2 points) Provide the R output of your final model.
c. (2 points) State the least squares regression equation of your model.
d. (2 points) Compare the adjusted R- squared values from the full model to your final model. Is there much of a difference? What does this comparison tell us about the fit of two models?
1.
Loaded the data into a dataframe (rooms) and ran the regression in R with below command and output.
> model1 = lm(rent ~ ., data = rooms)
> summary(model1)
Call:
lm(formula = rent ~ ., data = rooms)
Residuals:
Min 1Q Median 3Q Max
-409.50556 -124.29555 12.24494 129.78501 639.47952
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 148.17524969 84.81491593 1.74704 0.0865341 .
rooms 408.69587824 66.75578899 6.12225 1.2337e-07 ***
baths 131.69500141 48.81992791 2.69757 0.0093944 **
sqrfoot 0.08225052 0.11247639 0.73127 0.4678973
house -146.45871377 82.30078617 -1.77955 0.0809932 .
campusclose 31.01998932 61.00682154 0.50847
0.6132760
pets 101.69598374 68.04734480 1.49449 0.1410920
new 122.75355017 87.42624629 1.40408 0.1662393
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 210.4259 on 52 degrees of freedom
Multiple R-squared: 0.8926609, Adjusted R-squared:
0.8782114
F-statistic: 61.77799 on 7 and 52 DF, p-value: < 2.2204e-16
The highest p-value is for campusclose.
2.
Removing the variable campusclose and running the regression again we get,
> model2 = lm(rent ~ rooms + baths + sqrfoot + house
+ pets + new, data = rooms)
> summary(model2)
Call:
lm(formula = rent ~ rooms + baths + sqrfoot + house + pets +
new, data = rooms)
Residuals:
Min 1Q Median 3Q Max
-417.16233 -116.79030 13.24609 115.20607 631.12915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 170.69022050 71.83063524 2.37629 0.0211332 *
rooms 407.48249621 66.24482626 6.15116 1.0396e-07 ***
baths 133.39153878 48.36388388 2.75808 0.0079607 **
sqrfoot 0.08412433 0.11162689 0.75362
0.4544121
house -149.16521870 81.55197063 -1.82908 0.0730158 .
pets 92.87526516 65.33705783 1.42148 0.1610353
new 113.90218629 85.07422097 1.33886 0.1863319
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The highest p-value is for sqrfoot.
Removing the variable sqrfoot and running the regression again we get,
> model3 = lm(rent ~ rooms + baths + house + pets + new, data
= rooms)
> summary(model3)
Call:
lm(formula = rent ~ rooms + baths + house + pets + new, data =
rooms)
Residuals:
Min 1Q Median 3Q Max
-416.14059 -114.07137 7.00428 123.27113 637.05859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 169.58277 71.52773 2.37087 0.0213447 *
rooms 447.52688 39.40079 11.35832 6.2257e-16 ***
baths 138.54788 47.68554 2.90545 0.0053062 **
house -150.09609 81.21575 -1.84812 0.0700650 .
pets 95.33839 64.99368 1.46689 0.1482068
new 104.06992 83.73088 1.24291 0.2192710
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 208.1112 on 54 degrees of freedom
Multiple R-squared: 0.8909712, Adjusted R-squared:
0.880876
F-statistic: 88.25643 on 5 and 54 DF, p-value: < 2.2204e-16
The highest p-value is for new.
Removing the variable new and running the regression again we get,
> model4 = lm(rent ~ rooms + baths + house + pets , data =
rooms)
> summary(model4)
Call:
lm(formula = rent ~ rooms + baths + house + pets, data = rooms)
Residuals:
Min 1Q Median 3Q Max
-432.02712 -121.34304 -5.14514 119.74665 668.79221
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 150.31946 70.17344 2.14211 0.0366277 *
rooms 452.39178 39.39960 11.48214 3.1402e-16 ***
baths 157.54211 45.39362 3.47058 0.0010179 **
house -183.88561 76.90869 -2.39096 0.0202537 *
pets 124.85771 60.79773 2.05366 0.0447754 *
Now all varibles have p-value less than 0.05.
3.
The variables in the above model are rooms, baths, house and pets
Running the correlation on rooms dataframe, we get
> cor(rooms)
rent rooms baths sqrfoot house campusclose
rent 1.00000000000 0.908503713359 0.73964557320 0.84917696737
0.46943266938 -0.07598738507
rooms 0.90850371336 1.000000000000 0.63383078116 0.91693149229
0.62979395469 -0.04712222866
baths 0.73964557320 0.633830781157 1.00000000000 0.61000906432
0.28883391554 -0.09561758715
sqrfoot 0.84917696737 0.916931492295 0.61000906432 1.00000000000
0.58366829910 -0.02261255078
house 0.46943266938 0.629793954692 0.28883391554 0.58366829910
1.00000000000 -0.05281228359
campusclose -0.07598738507 -0.047122228656 -0.09561758715
-0.02261255078 -0.05281228359 1.00000000000
pets 0.13039491214 0.001047910074 0.20097569028 0.01015951996
0.07573812580 -0.34757851760
new 0.30682539073 0.131614960629 0.41380424027 0.08326810774
-0.16122923188 -0.29137624733
pets new
rent 0.130394912142 0.30682539073
rooms 0.001047910074 0.13161496063
baths 0.200975690280 0.41380424027
sqrfoot 0.010159519955 0.08326810774
house 0.075738125802 -0.16122923188
campusclose -0.347578517598 -0.29137624733
pets 1.000000000000 0.37620154105
new 0.376201541048 1.00000000000
We see that none of the variables (rooms, baths, house and pets) have correlation above 0.80.
a.
I first fit the full model and noticed the p-value for campusclose was very high. I dropped it from the model and then ran the regression, and noticed the p-value for sqrfoot was very high. I dropped it from the model and then ran the regression, and noticed the p-value for new was very high. I dropped it from the model and then ran the regression, and found that all remaining variables have p-value less than 0.05. Now I check the correlation between all the remaining variables to see if there is any relationship that was too strong between the explanatory variables. I found none of the variables with strong correlation.
b.
R output of the final model is,
> summary(model4)
Call:
lm(formula = rent ~ rooms + baths + house + pets, data = rooms)
Residuals:
Min 1Q Median 3Q Max
-432.02712 -121.34304 -5.14514 119.74665 668.79221
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 150.31946 70.17344 2.14211 0.0366277 *
rooms 452.39178 39.39960 11.48214 3.1402e-16 ***
baths 157.54211 45.39362 3.47058 0.0010179 **
house -183.88561 76.90869 -2.39096 0.0202537 *
pets 124.85771 60.79773 2.05366 0.0447754 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 209.1394 on 55 degrees of freedom
Multiple R-squared: 0.8878522, Adjusted R-squared:
0.879696
F-statistic: 108.856 on 4 and 55 DF, p-value: < 2.2204e-16
c.
The least squares regression equation of model is,
Rent = 150.31946 + 452.39178 rooms + 157.54211 baths - 183.88561 house + 124.85771 pets
d.
From the summary(model1) output, the adjusted R- squared of full model is 0.8782114.
From the summary(model4) output, the adjusted R- squared of full model is 0.879696.
There is not much difference between the full model and final model.
Both models are adequately fitting the given data.