In: Statistics and Probability
Use the data set named Store_Visits located in the folder Data Files for HW Assignment (outside of Minitab folder) in the K-drive. The response variable y is the number of visits of a customer to a particular food store in a large suburban area within the period of a month, and the independent variable x is the distance (in miles) of the customer’s home to the store.
Fit a simple linear regression model to the data, and answer the following questions.
a) Give the proportion of the variation in the number of visits per month of a customer explained by the distance of the customer’s home to the store.
b) Submit the residual plot. ^ It appears from the plot that there is a problem with one of the model assumptions. Which one is it, and what would you suggest to remedy the problem?
c) Carry out your suggestion to fix the problem of part (b) and submit a new residual plot. Does your suggested remedy work? ^
d) Based on your new model, what is the proportion of the variation in the number of visits per month of a customer explained by the distance of the customer’s home to the store? How does it compare to that of the original model?
e) Based on your new model, construct a 95% prediction interval for y, the number of visits to the store for a customer who lives 2.5 miles from the store. Interpret the P.I.
K-Drive data. -Minitab
y   x
12   0.8
5   1.2
6   2.3
8   1.5
3   3.2
2   6.3
1   7.9
2   5.3
6   1.5
3   1.9
10   1.7
5   2.6
3   2.9
6   4.2
2   3.9
4   3.1
3   5.8
6   1.7
7   2.2
2   4.5
1   6.1
1   5.8
1   7.4
3   6.4
2   4.7
2   3.9
3   4
4   4.6
The response variable y is the number of visits of a customer to a particular food store in a large suburban area within the period of a month,
And the independent variable x is the distance (in miles) of the customer’s home to the store.
by using Minitab
Fit a simple linear regression model to the data
Steps
1) Enter given data in Minitab coloums
2) Select following options
     Stats 
 Regression 
 Regression 
 Fit Regression Modal
3) Then in obtained box select
Responses : Column of variable y
And Continue Predictors : Column of variable x
4) You can select Graph You need ( this can residual graph ) , and storage if wanted
5) then click ok
Now these is our data ( copied from minitab with residuals output )
| y | x | RESI1 | 
| 12 | 0.8 | 4.721378 | 
| 5 | 1.2 | -1.85132 | 
| 6 | 2.3 | 0.323756 | 
| 8 | 1.5 | 1.469155 | 
| 3 | 3.2 | -1.71482 | 
| 2 | 6.3 | 0.596763 | 
| 1 | 7.9 | 1.305966 | 
| 2 | 5.3 | -0.47149 | 
| 6 | 1.5 | -0.53085 | 
| 3 | 1.9 | -3.10354 | 
| 10 | 1.7 | 3.682805 | 
| 5 | 2.6 | -0.35577 | 
| 3 | 2.9 | -2.03529 | 
| 6 | 4.2 | 2.353435 | 
| 2 | 3.9 | -1.96704 | 
| 4 | 3.1 | -0.82164 | 
| 3 | 5.8 | 1.062638 | 
| 6 | 1.7 | -0.31719 | 
| 7 | 2.2 | 1.216931 | 
| 2 | 4.5 | -1.32609 | 
| 1 | 6.1 | -0.61689 | 
| 1 | 5.8 | -0.93736 | 
| 1 | 7.4 | 0.77184 | 
| 3 | 6.4 | 1.703589 | 
| 2 | 4.7 | -1.11244 | 
| 2 | 3.9 | -1.96704 | 
| 3 | 4 | -0.86022 | 
| 4 | 4.6 | 0.780735 | 
And this is minitab output
Regression Analysis: y versus x
Analysis of Variance
Source        DF  
Seq SS   Contribution Adj SS   Adj
MS    F-Value P-Value
Regression      1  
122.25       
58.50%     122.25  
122.246    36.65     0.000
         
x            
1   122.25       
58.50%   122.25 122.246   
36.65     0.000
    
Error          26
86.72        41.50%
86.72      3.335
Lack-of-Fit      22   
74.72       
35.76%     
74.72     
3.396        1.13 0.510
Pure Error       
4   
12.00        
5.74%     12.00     
3.000
Total           
27 208.96       100.00%
Model Summary
      S    R-sq
R-sq(adj)    PRESS R-sq(pred)
1.82628 58.50%     56.90%
103.156      50.63%
Coefficients
Term       Coef
    SE Coef      
95%
CI         T-Value
P-Value    VIF
Constant 8.133   
0.760     ( 6.572,
9.695)    10.71   
0.000
    
x       
   -1.068    0.176
(-1.431, -0.706)   
-6.05   0.000      1.00
Regression Equation
y = 8.133 - 1.068 x
Fits and Diagnostics for Unusual Observations
                                                    
Std    Del
Obs       y  
Fit    SE Fit      95%
CI      Resid Resid
Resid        HI Cook’s
D    DFITS
1 12.000 7.279   0.637 (5.969, 8.588) 4.721  
2.76   3.22 0.121741      0.53
1.19750
11 10.000 6.317   0.511 (5.267, 7.368) 3.683  
2.10   2.26 0.078294      0.19
0.65879
Obs
1 R
11 R
R Large residual
Residual Plots for y

Regression Equation
y = 8.133 - 1.068 x
a)
Proportion of the variation in the number of visits per month of a customer explained by the distance of the customer’s home to the store is 58.50%
From output table
Model Summary
      S      
R-sq       
R-sq(adj)    PRESS   R-sq(pred)
1.82628 58.50%     56.90%   
103.156      50.63%
b) Submit the residual plot


It appears from the plot that there is a problem with one of the model assumptions
The plot shows a some U shape pattern i.e plot patterns are non-random , thus variance is not constant .
Suggesting a better fit for a non-linear model .
c)
Suggestion to fix the problem of part (b) it to use transformation ( nonlinear transformation)
A nonlinear transformation changes (increases or decreases) linear relationships between variables and, thus, changes the correlation between variables .
Steps in minitab
Steps
1) For data in Minitab coloums
2) Select following options
     Stats 
 Regression 
 Regression 
 Fit Regression Modal
3) Then in obtained box select
Responses : Column of variable y
And Continue Predictors : Column of variable x
4) You can select Graph You need ( this can residual graph ) , and storage if wanted
5) Select Option , you will see box with " No transformation "
     Change it to " 
 = 0.5 (square root)" For square root transformation .
6) then click ok
Here RESI2 is residual ( e = 
 -
 ) column
| y | x | RESI2 | 
| 12 | 0.8 | 0.745803 | 
| 5 | 1.2 | -0.37464 | 
| 6 | 2.3 | 0.134675 | 
| 8 | 1.5 | 0.298421 | 
| 3 | 3.2 | -0.34067 | 
| 2 | 6.3 | 0.175358 | 
| 1 | 7.9 | 0.191528 | 
| 2 | 5.3 | -0.09363 | 
| 6 | 1.5 | -0.08052 | 
| 3 | 1.9 | -0.69036 | 
| 10 | 1.7 | 0.68607 | 
| 5 | 2.6 | 0.001951 | 
| 3 | 2.9 | -0.42137 | 
| 6 | 4.2 | 0.645756 | 
| 2 | 3.9 | -0.47022 | 
| 4 | 3.1 | -0.09962 | 
| 3 | 5.8 | 0.358701 | 
| 6 | 1.7 | -0.02672 | 
| 7 | 2.2 | 0.304038 | 
| 2 | 4.5 | -0.30882 | 
| 1 | 6.1 | -0.29265 | 
| 1 | 5.8 | -0.37335 | 
| 1 | 7.4 | 0.057034 | 
| 3 | 6.4 | 0.520095 | 
| 2 | 4.7 | -0.25503 | 
| 2 | 3.9 | -0.47022 | 
| 3 | 4 | -0.12548 | 
| 4 | 4.6 | 0.303862 | 
Minitab Output
Regression Analysis: y versus x
Method
Box-Cox transformation λ = 0.5
Analysis of Variance for Transformed Response
Source        
DF   Seq SS Contribution Adj SS Adj MS F-Value
P-Value
Regression      1  
7.7510        66.04% 7.7510
7.7510    50.56    0.000
  
x            
1   7.7510       
66.04% 7.7510 7.7510    50.56   
0.000
   
Error         
26   3.9856       
33.96% 3.9856 0.1533
Lack-of-Fit     22  
3.3918        28.90% 3.3918
0.1542     1.04    0.553
Pure Error 4  
0.5938         5.06% 0.5938
0.1484
Total           
27 11.7366       100.00%
Model Summary for Transformed Response
      
S        
R-sq      R-sq(adj)   
PRESS      R-sq(pred)
0.391525 66.04%    
64.74%    
4.64139      60.45%
Coefficients for Transformed Response
Term       
Coef    SE
Coef      95%
CI       T-Value P-Value  
VIF
Constant     2.933
    0.163    ( 2.599,  
3.268)     
18.01    0.000
    
x         
-0.2690    0.0378 (-0.3467,
-0.1912)    -7.11   
0.000     1.00
Regression Equation
y^0.5 = 2.933 - 0.2690 x
Fits and Diagnostics for Unusual Observations
Original Response
Obs        
y      Fit     
95% CI
1 12.0000 7.3891 (5.9414, 8.9946)
Transformed Response
                                                   
Std    Del
Obs     y'     
Fit       SE
Fit      95% CI    
Resid   Resid
Resid        HI Cook’s
D     DFITS
1    3.464 2.718   0.137 (2.437, 2.999)
0.746   2.03    2.17   
0.121741      0.29 0.809136
Obs
1 R
y' = transformed response
R Large residual
Residual Plots for y

New residual plot are


We can see a random pattern in residual plot . And hence variance is constant
Yes suggested remedy work . ( i.e square root transformation )
d) For new model , Proportion of the variation in the number of visits per month of a customer explained by the distance of the customer’s home to the store is 66.04%
Model Summary for Transformed Response
      
S        
R-sq      R-sq(adj)   
PRESS      R-sq(pred)
0.391525 66.04%    
64.74%    
4.64139      60.45%
Compare to that of the original model variance explained is increased by 8 %
For original model R2 = 58.50%
For new model R2 = 66.04%
e) Based on your new model, construct a 95% prediction interval for y, the number of visits to the store for a customer who lives 2.5 miles from the store. Interpret the P.I.
For new model 95% confidence interval is for variable x is (-0.3467, -0.1912) and that of constant or intercept is ( 2.599, 3.268)
{ Coefficients for Transformed Response
   Term       
Coef    SE
Coef      95%
CI       T-Value P-Value  
VIF
   Constant    
2.933     0.163    (
2.599,   3.268)     
18.01    0.000
       
x         
-0.2690    0.0378 (-0.3467,
-0.1912)    -7.11   
0.000     1.00
}
Our New Model is
Regression Equation
y^0.5 = 2.933 - 0.2690 x
a prediction interval ( P.I ) is an estimate of an interval in which a future observation will fall, with a certain probability, (here 0.95 )
Construct a 95% prediction interval for y, the number of visits to the store for a customer who lives x =2.5 miles from the store
y^0.5 = 2.933 - 0.2690 x
   
 = ( 2.933 - 0.2690 x ) 2
for x = 2.5
= ( 2.933 - 0.2690 * 2.5 )
2   = 5.10986
thus   
= 5.10986       
at x = 2.5
95% prediction interval for y is given by

t
* 
   
Where t   = 
Here n = 28 number of observation .
and k = 1 number of independent variable
At 5 % level of significance
t   = 
    = 
 = 
You can find it from software like minitab , R or by statistical tables
t = 
 = 2.055
And
=   S * 
n = 28
x' = 2.5
= mean ( x ) =
3.835714               
( can be calculated manually )
  
= 107.1243  
Here S = 0.391525 ( form output table )
   
S       
   R-sq     
R-sq(adj)    PRESS     
R-sq(pred)
0.391525  
66.04%     64.74%    
4.64139      60.45%
Hence S2 = ( 0.391525 )2 = 0.1532918
And
=   S * 
 = 0.391525 * 
    = 0.05969364
And 
= 0.003563331
hence 95% prediction interval for y is given by

t
* 
     = 5.10986
 2.055 * 
 = 5.10986 
 2.055 * 0.3960494
Thus 95% prediction interval for y is given by ( at x =2.5 )
P. I . = 5.10986
 0.8138815
= ( 4.295979 ,5.923742 )