In: Math
sales | sqft | adv_cost | inventory | distance | district_size | storecount |
231 | 1.47 | 7.62 | 897 | 10.9 | 79.48 | 40 |
232 | 1.53 | 9.57 | 892 | 9.4 | 51.154 | 12 |
156 | 1.68 | 8.37 | 542 | 7.9 | 60.358 | 41 |
157 | 1.355 | 6.73 | 552 | 6.8 | 55.561 | 68 |
10 | 1.33 | 1.66 | 242 | 3.5 | 89.624 | 14 |
10 | 1.33 | 1.17 | 235 | 3.6 | 86.898 | 62 |
519 | 1.89 | 12.96 | 3670 | 18.5 | 108.857 | 56 |
520 | 1.885 | 12.02 | 3657 | 19.1 | 100.685 | 75 |
437 | 1.7 | 12.29 | 3345 | 17.4 | 90.138 | 59 |
487 | 1.86 | 12.5 | 3322 | 16.5 | 111.284 | 22 |
299 | 1.4 | 9.86 | 1784 | 11.5 | 75.606 | 26 |
195 | 1.63 | 7.22 | 1230 | 9.8 | 64.245 | 27 |
20 | 1.24 | 5.23 | 483 | 2.4 | 55.929 | 11 |
68 | 1.51 | 3.93 | 114 | 4.5 | 73.187 | 33 |
428 | 1.78 | 11.04 | 2829 | 16.4 | 101.192 | 51 |
429 | 1.725 | 9.43 | 3410 | 15.7 | 80.694 | 16 |
464 | 1.72 | 12.19 | 2873 | 15.8 | 105.254 | 84 |
15 | 1.2 | 1.17 | 289 | 3.2 | 80.937 | 31 |
65 | 1.47 | 6.56 | 292 | 3.9 | 80.187 | 97 |
66 | 1.51 | 5.55 | 312 | 3.8 | 85.897 | 66 |
98 | 1.24 | 5.79 | 235 | 6.4 | 90.219 | 75 |
338 | 1.65 | 3.34 | 1160 | 12.1 | 121.988 | 84 |
249 | 1.513 | 2.23 | 1184 | 9.7 | 115.277 | 12 |
161 | 1.4 | 6.95 | 399 | 7.9 | 50.188 | 14 |
467 | 1.46 | 13.17 | 2062 | 16.1 | 101.211 | 89 |
398 | 1.84 | 11.68 | 2103 | 15.9 | 95.406 | 49 |
497 | 1.68 | 12.11 | 2743 | 18 | 80.195 | 14 |
528 | 1.94 | 10.98 | 3779 | 18 | 110.025 | 58 |
529 | 1.765 | 11.11 | 3916 | 18.9 | 103.26 | 52 |
99 | 1.31 | 4.35 | 782 | 4.8 | 111.732 | 52 |
100 | 1.525 | 3.79 | 804 | 4.7 | 99.7 | 41 |
1 | 1.45 | 4.68 | 1116 | 3.4 | 85.882 | 50 |
347 | 1.65 | 10.08 | 2223 | 13.4 | 94.181 | 49 |
348 | 1.811 | 7.87 | 2180 | 12.1 | 95.242 | 50 |
341 | 1.64 | 10.34 | 1494 | 14.3 | 70.693 | 28 |
557 | 1.66 | 13.55 | 3522 | 18.5 | 94.329 | 43 |
508 | 1.698 | 11.53 | 3521 | 16.7 | 99.917 | 50 |
In the “HomeSales” dataset, the response variable, sales, depends on six potential predictor variables, sq_ft, adv_cost, inventory, distance, district_size, and storecount. Fit four simple linear regression (SLR) models corresponding to the four predictors, sq_ft, adv_cost, inventory, and distance. Then, for each model, create a normal probability plot and a histogram for the residuals, together with the two residual scatterplots: residuals vs. fitted values and residuals vs. observation order.
What do the residual plots for the model with sq_ft as the predictor indicate about the validity of this regression model and assumptions made about the errors?
What do the residual plots for the model with adv_cost as the predictor indicate about the validity of this regression model and assumptions made about the errors?
What do the residual plots for the model with inventory as the predictor indicate about the validity of this regression model and assumptions made about the errors?
What do the residual plots for the model with distance as the predictor indicate about the validity of this regression model and assumptions made about the errors?
One objective of this analysis is to obtain an appropriate simple linear regression model that can be used to estimate the average sales based on a single predictor. State your “best” choice based on your conclusions in parts (a)–(d).
Complete the table below, using the regression analysis results of the four simple linear regression models considered in parts (a)–(d). Based on the table entries, would you change your “best” choice from part (e).
Model predictor |
S |
R2 |
t-stat |
sqft |
110.75 |
66.44% |
8.32 |
adv_cost |
|||
inventory |
|||
distance |
A model including the predictor variable adv_cost is of specific interest. Obtain appropriate residual plots and determine if adding either district_size or storecount as an additional predictor to the SLR model with predictor adv_cost is likely to improve its fit.
a)Model 1: The predictor variable sqft to predict sales
Call:
lm(formula = sales ~ sqft, data = data)
Residuals:
Min 1Q Median 3Q Max
-200.740 -80.410 7.266 47.567 277.668
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -921.65 145.55 -6.332 2.82e-07 ***
sqft 760.94 91.41 8.324 8.16e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 110.8 on 35 degrees of freedom
Multiple R-squared: 0.6644, Adjusted R-squared: 0.6548
F-statistic: 69.29 on 1 and 35 DF, p-value: 8.161e-10
RESIDUAL PLOTS
In general,The residual plot should be symmetric around zero
b)Model 2: The predictor variable adv_cost to predict sales
Call:
lm(formula = sales ~ adv_cost, data = data)
Residuals:
Min 1Q Median 3Q Max
-147.37 -56.78 -17.33 40.86 265.56
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -72.712 37.047 -1.963 0.0577 .
adv_cost 43.458 4.145 10.486 2.42e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 93.94 on 35 degrees of freedom
Multiple R-squared: 0.7585, Adjusted R-squared: 0.7516
F-statistic: 109.9 on 1 and 35 DF, p-value: 2.417e-12
c)Model 3: The predictor variable inventory to predict sales
Call:
lm(formula = sales ~ inventory, data = data)
Residuals:
Min 1Q Median 3Q Max
-195.594 -46.620 -0.477 37.107 142.349
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.524456 18.640588 2.442 0.0198 *
inventory 0.135367 0.008634 15.679 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 67.49 on 35 degrees of freedom
Multiple R-squared: 0.8754, Adjusted R-squared: 0.8718
F-statistic: 245.8 on 1 and 35 DF, p-value: < 2.2e-16
d)Model 4: The predictor variable distance to predict sales
Call:
lm(formula = sales ~ distance, data = data)
Residuals:
Min 1Q Median 3Q Max
-48.421 -21.467 -0.902 24.457 45.440
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -82.838 9.862 -8.40 6.59e-10 ***
distance 32.659 0.791 41.29 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27.12 on 35 degrees of freedom
Multiple R-squared: 0.9799, Adjusted R-squared: 0.9793
F-statistic: 1705 on 1 and 35 DF, p-value: < 2.2e-16
e)From the residual plots of the above model, model 4.i.e., the model with the predictor variable distance seems to the best model.
Model Predictor |
Sum of Squares |
R-Square |
t-stat |
Sqft |
110.75 |
66.44% |
8.32 |
adv_cost |
93.94 |
75.85% |
10.486 |
inventory |
67.49 |
87.54% |
15.679 |
distance |
27.12 |
97.99% |
41.29 |
Based on the above table, among the four models the model with the predictor variable distance is the best model(R-squared value is 97..99%)
For the model including the predictor variable adv_cost and district_size to predict sales gives the following results
Call: lm(formula = sales ~ adv_cost + district_size, data = data) Residuals: Min 1Q Median 3Q Max -130.830 -42.406 0.737 45.658 136.643 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -355.9997 57.8682 -6.152 5.47e-07 *** adv_cost 40.9861 3.0788 13.312 4.83e-15 *** district_size 3.4468 0.6213 5.548 3.33e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 69.05 on 34 degrees of freedom Multiple R-squared: 0.8733, Adjusted R-squared: 0.8658 F-statistic: 117.1 on 2 and 34 DF, p-value: 5.613e-16
Residual plots
The model including the predictor variable adv_cost and store count to predict sales gives the following results
Call: lm(formula = sales ~ adv_cost + storecount, data = data) Residuals: Min 1Q Median 3Q Max -151.68 -55.02 -18.10 41.50 262.08 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -75.89629 45.75969 -1.659 0.106 adv_cost 43.38473 4.24679 10.216 6.73e-12 *** storecount 0.08221 0.67404 0.122 0.904 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 95.29 on 34 degrees of freedom Multiple R-squared: 0.7586, Adjusted R-squared: 0.7444 F-statistic: 53.43 on 2 and 34 DF, p-value: 3.202e-11
Residual plots
By Comparing the above two model, we can say that including store count to the model does not improve the model.
But adding the variable district_size to the model improves the model to a greater extent. This can be seen by the increase in R square value and also from the residual plots.