In: Statistics and Probability
The dataset HomesForSaleCA contains a random sample of 30 houses for sale in California. Suppose that we are interested in predicting the Size (in thousands of square feet) for such homes.
State Price Size Beds Baths CA 500 3.2 5 3.5 CA 995 3.7 4 3.5 CA 609 2.2 4 3 CA 1199 2.8 3 2.5 CA 949 1.4 3 2 CA 415 1.7 3 2.5 CA 895 2.1 3 2 CA 775 1.6 3 3 CA 109 0.6 1 1 CA 5900 4.8 4 4.5 CA 219 1.1 3 2 CA 255 1.2 3 2 CA 86 0.6 1 1 CA 62 1.2 3 2 CA 165 1.9 5 3.5 CA 1695 6.9 5 5.5 CA 499 1.4 3 2 CA 47 1.5 3 2 CA 195 2 3 2.5 CA 775 1 2 2 CA 199 1.4 3 2 CA 480 3 5 3 CA 173 0.9 3 1 CA 189 2.5 2 2 CA 230 1.7 3 2 CA 380 2.1 5 3 CA 110 0.8 2 1 CA 499 1.3 3 2 CA 399 1.4 3 2 CA 2450 5 4 5
1. What is the total variability in the sizes of the 30 homes in
this sample? (Hint: Try a regression ANOVA with any of the
other variables as a predictor.)
2. What other variable in the HomesForSaleCA dataset explains the greatest amount of the total variability in home sizes? Explain how you decide on the variable.
3. How much of the total variability in home sizes is explained by the "best" variable identified in question 2? Give the answer both as a raw number and as a percentage.
4. Which of the variables in the dataset is the weakest predictor of home sizes? How much of the variability does it explain?
5. Is the weakest predictor identified in question 4 still an effective predictor of home sizes? Include some justification for your answer.
thank you for your help!
Solution :
The regression with Price is:
r² | 0.430 | |||||
r | 0.656 | |||||
Std. Error | 1.096 | |||||
n | 30 | |||||
k | 1 | |||||
Dep. Var. | Size | |||||
ANOVA table | ||||||
Source | SS | df | MS | F | p-value | |
Regression | 25.36173499 | 1 | 25.36173499 | 21.10 | .0001 | |
Residual | 33.65826501 | 28 | 1.20208089 | |||
Total | 59.02000000 | 29 | ||||
Regression output | confidence interval | |||||
variables | coefficients | std. error | t (df=28) | p-value | 95% lower | 95% upper |
Intercept | 1.4987 | |||||
Price | 0.0008 | 0.00018305 | 4.593 | .0001 | 0.0005 | 0.0012 |
The regression with Beds is:
r² | 0.412 | |||||
r | 0.642 | |||||
Std. Error | 1.113 | |||||
n | 30 | |||||
k | 1 | |||||
Dep. Var. | Size | |||||
ANOVA table | ||||||
Source | SS | df | MS | F | p-value | |
Regression | 24.3432 | 1 | 24.3432 | 19.66 | .0001 | |
Residual | 34.6768 | 28 | 1.2385 | |||
Total | 59.0200 | 29 | ||||
Regression output | confidence interval | |||||
variables | coefficients | std. error | t (df=28) | p-value | 95% lower | 95% upper |
Intercept | -0.6617 | |||||
Beds | 0.8541 | 0.1927 | 4.434 | .0001 | 0.4595 | 1.2488 |
The regression with Baths is:
r² | 0.846 | |||||
r | 0.920 | |||||
Std. Error | 0.570 | |||||
n | 30 | |||||
k | 1 | |||||
Dep. Var. | Size | |||||
ANOVA table | ||||||
Source | SS | df | MS | F | p-value | |
Regression | 49.9270 | 1 | 49.9270 | 153.74 | 6.86E-13 | |
Residual | 9.0930 | 28 | 0.3247 | |||
Total | 59.0200 | 29 | ||||
Regression output | confidence interval | |||||
variables | coefficients | std. error | t (df=28) | p-value | 95% lower | 95% upper |
Intercept | -0.8648 | |||||
Baths | 1.1859 | 0.0956 | 12.399 | 6.86E-13 | 0.9900 | 1.3818 |
(1) Total variability = 59.0200
(2) Baths explain the highest variability in Size (84.6%)
(3) 84.6%
(4) Beds is the weakest predictor. It explains 41.2% of the variability.
(5) Yes, it is effective since it's p-value is significant and there is a linear relationship of Beds with Size.
Please give me a thumbs-up if this helps you out. Thank you!