In: Statistics and Probability
Use the data set named Store_Visits located in the folder Data Files for HW Assignment (outside of Minitab folder) in the K-drive. The response variable y is the number of visits of a customer to a particular food store in a large suburban area within the period of a month, and the independent variable x is the distance (in miles) of the customer’s home to the store.
Fit a simple linear regression model to the data, and answer the following questions.
a) Give the proportion of the variation in the number of visits per month of a customer explained by the distance of the customer’s home to the store.
b) Submit the residual plot. ^ It appears from the plot that there is a problem with one of the model assumptions. Which one is it, and what would you suggest to remedy the problem?
c) Carry out your suggestion to fix the problem of part (b) and submit a new residual plot. Does your suggested remedy work? ^
d) Based on your new model, what is the proportion of the variation in the number of visits per month of a customer explained by the distance of the customer’s home to the store? How does it compare to that of the original model?
e) Based on your new model, construct a 95% prediction interval for y, the number of visits to the store for a customer who lives 2.5 miles from the store. Interpret the P.I.
K-Drive data. -Minitab
y x
12 0.8
5 1.2
6 2.3
8 1.5
3 3.2
2 6.3
1 7.9
2 5.3
6 1.5
3 1.9
10 1.7
5 2.6
3 2.9
6 4.2
2 3.9
4 3.1
3 5.8
6 1.7
7 2.2
2 4.5
1 6.1
1 5.8
1 7.4
3 6.4
2 4.7
2 3.9
3 4
4 4.6
The response variable y is the number of visits of a customer to a particular food store in a large suburban area within the period of a month,
And the independent variable x is the distance (in miles) of the customer’s home to the store.
by using Minitab
Fit a simple linear regression model to the data
Steps
1) Enter given data in Minitab coloums
2) Select following options
Stats Regression Regression Fit Regression Modal
3) Then in obtained box select
Responses : Column of variable y
And Continue Predictors : Column of variable x
4) You can select Graph You need ( this can residual graph ) , and storage if wanted
5) then click ok
Now these is our data ( copied from minitab with residuals output )
y | x | RESI1 |
12 | 0.8 | 4.721378 |
5 | 1.2 | -1.85132 |
6 | 2.3 | 0.323756 |
8 | 1.5 | 1.469155 |
3 | 3.2 | -1.71482 |
2 | 6.3 | 0.596763 |
1 | 7.9 | 1.305966 |
2 | 5.3 | -0.47149 |
6 | 1.5 | -0.53085 |
3 | 1.9 | -3.10354 |
10 | 1.7 | 3.682805 |
5 | 2.6 | -0.35577 |
3 | 2.9 | -2.03529 |
6 | 4.2 | 2.353435 |
2 | 3.9 | -1.96704 |
4 | 3.1 | -0.82164 |
3 | 5.8 | 1.062638 |
6 | 1.7 | -0.31719 |
7 | 2.2 | 1.216931 |
2 | 4.5 | -1.32609 |
1 | 6.1 | -0.61689 |
1 | 5.8 | -0.93736 |
1 | 7.4 | 0.77184 |
3 | 6.4 | 1.703589 |
2 | 4.7 | -1.11244 |
2 | 3.9 | -1.96704 |
3 | 4 | -0.86022 |
4 | 4.6 | 0.780735 |
And this is minitab output
Regression Analysis: y versus x
Analysis of Variance
Source DF
Seq SS Contribution Adj SS Adj
MS F-Value P-Value
Regression 1
122.25
58.50% 122.25
122.246 36.65 0.000
x
1 122.25
58.50% 122.25 122.246
36.65 0.000
Error 26
86.72 41.50%
86.72 3.335
Lack-of-Fit 22
74.72
35.76%
74.72
3.396 1.13 0.510
Pure Error
4
12.00
5.74% 12.00
3.000
Total
27 208.96 100.00%
Model Summary
S R-sq
R-sq(adj) PRESS R-sq(pred)
1.82628 58.50% 56.90%
103.156 50.63%
Coefficients
Term Coef
SE Coef
95%
CI T-Value
P-Value VIF
Constant 8.133
0.760 ( 6.572,
9.695) 10.71
0.000
x
-1.068 0.176
(-1.431, -0.706)
-6.05 0.000 1.00
Regression Equation
y = 8.133 - 1.068 x
Fits and Diagnostics for Unusual Observations
Std Del
Obs y
Fit SE Fit 95%
CI Resid Resid
Resid HI Cook’s
D DFITS
1 12.000 7.279 0.637 (5.969, 8.588) 4.721
2.76 3.22 0.121741 0.53
1.19750
11 10.000 6.317 0.511 (5.267, 7.368) 3.683
2.10 2.26 0.078294 0.19
0.65879
Obs
1 R
11 R
R Large residual
Residual Plots for y
Regression Equation
y = 8.133 - 1.068 x
a)
Proportion of the variation in the number of visits per month of a customer explained by the distance of the customer’s home to the store is 58.50%
From output table
Model Summary
S
R-sq
R-sq(adj) PRESS R-sq(pred)
1.82628 58.50% 56.90%
103.156 50.63%
b) Submit the residual plot
It appears from the plot that there is a problem with one of the model assumptions
The plot shows a some U shape pattern i.e plot patterns are non-random , thus variance is not constant .
Suggesting a better fit for a non-linear model .
c)
Suggestion to fix the problem of part (b) it to use transformation ( nonlinear transformation)
A nonlinear transformation changes (increases or decreases) linear relationships between variables and, thus, changes the correlation between variables .
Steps in minitab
Steps
1) For data in Minitab coloums
2) Select following options
Stats Regression Regression Fit Regression Modal
3) Then in obtained box select
Responses : Column of variable y
And Continue Predictors : Column of variable x
4) You can select Graph You need ( this can residual graph ) , and storage if wanted
5) Select Option , you will see box with " No transformation "
Change it to " = 0.5 (square root)" For square root transformation .
6) then click ok
Here RESI2 is residual ( e = - ) column
y | x | RESI2 |
12 | 0.8 | 0.745803 |
5 | 1.2 | -0.37464 |
6 | 2.3 | 0.134675 |
8 | 1.5 | 0.298421 |
3 | 3.2 | -0.34067 |
2 | 6.3 | 0.175358 |
1 | 7.9 | 0.191528 |
2 | 5.3 | -0.09363 |
6 | 1.5 | -0.08052 |
3 | 1.9 | -0.69036 |
10 | 1.7 | 0.68607 |
5 | 2.6 | 0.001951 |
3 | 2.9 | -0.42137 |
6 | 4.2 | 0.645756 |
2 | 3.9 | -0.47022 |
4 | 3.1 | -0.09962 |
3 | 5.8 | 0.358701 |
6 | 1.7 | -0.02672 |
7 | 2.2 | 0.304038 |
2 | 4.5 | -0.30882 |
1 | 6.1 | -0.29265 |
1 | 5.8 | -0.37335 |
1 | 7.4 | 0.057034 |
3 | 6.4 | 0.520095 |
2 | 4.7 | -0.25503 |
2 | 3.9 | -0.47022 |
3 | 4 | -0.12548 |
4 | 4.6 | 0.303862 |
Minitab Output
Regression Analysis: y versus x
Method
Box-Cox transformation λ = 0.5
Analysis of Variance for Transformed Response
Source
DF Seq SS Contribution Adj SS Adj MS F-Value
P-Value
Regression 1
7.7510 66.04% 7.7510
7.7510 50.56 0.000
x
1 7.7510
66.04% 7.7510 7.7510 50.56
0.000
Error
26 3.9856
33.96% 3.9856 0.1533
Lack-of-Fit 22
3.3918 28.90% 3.3918
0.1542 1.04 0.553
Pure Error 4
0.5938 5.06% 0.5938
0.1484
Total
27 11.7366 100.00%
Model Summary for Transformed Response
S
R-sq R-sq(adj)
PRESS R-sq(pred)
0.391525 66.04%
64.74%
4.64139 60.45%
Coefficients for Transformed Response
Term
Coef SE
Coef 95%
CI T-Value P-Value
VIF
Constant 2.933
0.163 ( 2.599,
3.268)
18.01 0.000
x
-0.2690 0.0378 (-0.3467,
-0.1912) -7.11
0.000 1.00
Regression Equation
y^0.5 = 2.933 - 0.2690 x
Fits and Diagnostics for Unusual Observations
Original Response
Obs
y Fit
95% CI
1 12.0000 7.3891 (5.9414, 8.9946)
Transformed Response
Std Del
Obs y'
Fit SE
Fit 95% CI
Resid Resid
Resid HI Cook’s
D DFITS
1 3.464 2.718 0.137 (2.437, 2.999)
0.746 2.03 2.17
0.121741 0.29 0.809136
Obs
1 R
y' = transformed response
R Large residual
Residual Plots for y
New residual plot are
We can see a random pattern in residual plot . And hence variance is constant
Yes suggested remedy work . ( i.e square root transformation )
d) For new model , Proportion of the variation in the number of visits per month of a customer explained by the distance of the customer’s home to the store is 66.04%
Model Summary for Transformed Response
S
R-sq R-sq(adj)
PRESS R-sq(pred)
0.391525 66.04%
64.74%
4.64139 60.45%
Compare to that of the original model variance explained is increased by 8 %
For original model R2 = 58.50%
For new model R2 = 66.04%
e) Based on your new model, construct a 95% prediction interval for y, the number of visits to the store for a customer who lives 2.5 miles from the store. Interpret the P.I.
For new model 95% confidence interval is for variable x is (-0.3467, -0.1912) and that of constant or intercept is ( 2.599, 3.268)
{ Coefficients for Transformed Response
Term
Coef SE
Coef 95%
CI T-Value P-Value
VIF
Constant
2.933 0.163 (
2.599, 3.268)
18.01 0.000
x
-0.2690 0.0378 (-0.3467,
-0.1912) -7.11
0.000 1.00
}
Our New Model is
Regression Equation
y^0.5 = 2.933 - 0.2690 x
a prediction interval ( P.I ) is an estimate of an interval in which a future observation will fall, with a certain probability, (here 0.95 )
Construct a 95% prediction interval for y, the number of visits to the store for a customer who lives x =2.5 miles from the store
y^0.5 = 2.933 - 0.2690 x
= ( 2.933 - 0.2690 x ) 2
for x = 2.5
= ( 2.933 - 0.2690 * 2.5 ) 2 = 5.10986
thus = 5.10986 at x = 2.5
95% prediction interval for y is given by
t *
Where t =
Here n = 28 number of observation .
and k = 1 number of independent variable
At 5 % level of significance
t = = =
You can find it from software like minitab , R or by statistical tables
t = = 2.055
And
= S *
n = 28
x' = 2.5
= mean ( x ) = 3.835714 ( can be calculated manually )
= 107.1243
Here S = 0.391525 ( form output table )
S
R-sq
R-sq(adj) PRESS
R-sq(pred)
0.391525
66.04% 64.74%
4.64139 60.45%
Hence S2 = ( 0.391525 )2 = 0.1532918
And
= S * = 0.391525 * = 0.05969364
And = 0.003563331
hence 95% prediction interval for y is given by
t *
= 5.10986 2.055 * = 5.10986 2.055 * 0.3960494
Thus 95% prediction interval for y is given by ( at x =2.5 )
P. I . = 5.10986 0.8138815
= ( 4.295979 ,5.923742 )