In: Statistics and Probability
A Realtor is interested in modeling the selling price of houses based on the square footage and the age of the house. The data was collected in the two largest cities in Arkansas and is presented here.
Square footage X1 Age in years X2 style Selling price Y
775 37 Traditional 28,000
700 49 Traditional 34,000
720 54 Traditional 34,500
864 37 Rambler 39,900
650 35 Traditional 40,000
780 79 Victorian 41,500
900 48 Traditional 42,500
816 35 Rambler 53,500
1800 17 Victorian 57,000
1340 66 Victorian 59,000
1800 18 Rambler 59,500
1124 34 Traditional 62,000
2880 24 Victorian 68,500
1480 75 Rambler 72,500
1652 94 Victorian 70,000
2088 71 Victorian 73,112
1700 34 Traditional 76,780
1262 78 Rambler 77,350
1500 54 Victorian 85,590
1200 35 Victorian 79,900
650 45 Traditional 48,100
We need two indicator variables for the style of the house. I will choose Traditional as the base category.
I rambler = { 1 if house is a rambler; 0 if not Ivictor = { 1 if house is victorian; 0 if not
When entering the data do not use the commas
1. Plot y vs. x1 and y vs. x2. Do you see any curvature in these 2 plots? If so what can be suggested about the variables? Now what model do you think needs to be used.
2. Suppose someone wishes to use the regression model
Y= B0+B1X1+B2X2+B3X1^2+B5X1X2+B6Irambler+B7Ivictor+E
a) Write Down the prediction equation.
b) Interpret R2.
c) Test if the regression model is useful. (F-test)
1)
There seems to some linear curve.
Graph shows that the data is distributed independent without any curvature.
2) Using Excel:
SUMMARY OUTPUT | ||||||
Regression Statistics | ||||||
Multiple R | 0.820985 | |||||
R Square | 0.674016 | |||||
Adjusted R Square | 0.534308 | |||||
Standard Error | 11801.35 | |||||
Observations | 21 | |||||
ANOVA | ||||||
df | SS | MS | F | Significance F | ||
Regression | 6 | 4031480929 | 671913488.1 | 4.824476239 | 0.007212744 | |
Residual | 14 | 1949805195 | 139271799.6 | |||
Total | 20 | 5981286124 | ||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | |
Intercept | 3439.088 | 30244.23111 | 0.11371054 | 0.911081829 | -61428.33623 | 68306.51192 |
X1 | 62.42002 | 32.17639197 | 1.939932185 | 0.072804956 | -6.591478554 | 131.4315154 |
X2 | -94.0395 | 479.2688606 | -0.196214598 | 0.84726185 | -1121.969016 | 933.8899223 |
X1^2 | -0.01555 | 0.007485261 | -2.077782585 | 0.056609328 | -0.031607032 | 0.000501543 |
X1*X2 | 0.118454 | 0.305604947 | 0.387604802 | 0.704137596 | -0.537003475 | 0.773911364 |
Rambler | 3081.234 | 7524.897006 | 0.409471859 | 0.688389479 | -13058.06531 | 19220.53244 |
Victorian | 3291.8 | 8165.320837 | 0.403144039 | 0.692931624 | -14221.07096 | 20804.6718 |
b) R-squared value: 0.674016
The proporiton of variance explained by regression equation is 0.674016
c)
ANOVA | |||||
df | SS | MS | F | Significance F | |
Regression | 6 | 4031480929 | 671913488.1 | 4.824476239 | 0.007212744 |
Residual | 14 | 1949805195 | 139271799.6 | ||
Total | 20 | 5981286124 |
Test statistic is significant and can conlude there is atleast one variable good for prediction.