In: Statistics and Probability
Structurally deficient highway bridges. Data on structurally deficient highway bridges is compiled by the Federal Highway Administration (FHWA) and reported in the National Bridge Inventory (NBI). For each state, the NBI lists the number of structurally deficient bridges and the total area (thousands of square feet) of the deficient bridges. The data for the 50 states (plus the District of Columbia and Puerto Rico). For future planning and budgeting, the FHWA wants to estimate the total area of structurally deficient bridges in a state based on the number of deficient bridges
| 
 NumberSD  | 
 SDArea  | 
| 
 1899  | 
 432.71  | 
| 
 155  | 
 60.92  | 
| 
 181  | 
 110.57  | 
| 
 997  | 
 347.35  | 
| 
 3140  | 
 5177.97  | 
| 
 580  | 
 316.92  | 
| 
 358  | 
 387.78  | 
| 
 20  | 
 9.05  | 
| 
 24  | 
 59.34  | 
| 
 302  | 
 412.92  | 
| 
 1028  | 
 344.86  | 
| 
 142  | 
 39.8  | 
| 
 349  | 
 135.43  | 
| 
 2501  | 
 1192.43  | 
| 
 2030  | 
 688.19  | 
| 
 5153  | 
 1069.71  | 
| 
 2991  | 
 527.47  | 
| 
 1362  | 
 458.37  | 
| 
 1780  | 
 1453.26  | 
| 
 349  | 
 131.13  | 
| 
 388  | 
 236.18  | 
| 
 585  | 
 521.83  | 
| 
 1584  | 
 804.15  | 
| 
 1156  | 
 325.9  | 
| 
 3002  | 
 692.75  | 
| 
 4433  | 
 1187.42  | 
| 
 473  | 
 90.94  | 
| 
 2382  | 
 335.75  | 
| 
 47  | 
 20.08  | 
| 
 383  | 
 127.66  | 
| 
 750  | 
 752.43  | 
| 
 404  | 
 196.67  | 
| 
 2128  | 
 1427.73  | 
| 
 2272  | 
 1034.61  | 
| 
 743  | 
 101.42  | 
| 
 2862  | 
 965.16  | 
| 
 5793  | 
 1423.25  | 
| 
 514  | 
 393.96  | 
| 
 5802  | 
 2404.61  | 
| 
 164  | 
 237.96  | 
| 
 1260  | 
 626.38  | 
| 
 1216  | 
 209.33  | 
| 
 1325  | 
 481.31  | 
| 
 2186  | 
 1031.45  | 
| 
 233  | 
 102.56  | 
| 
 500  | 
 153.8  | 
| 
 1208  | 
 483.68  | 
| 
 400  | 
 502.03  | 
| 
 1058  | 
 331.59  | 
| 
 1302  | 
 399.8  | 
| 
 389  | 
 143.46  | 
| 
 241  | 
 195.43  | 
a) Deplaned on scatterplot, can you use linear regression to predict SDArea based onNumberSD? Explain.
b) Develop a simple linear regression equation to predict SDArea based on NumberSD.
c) Is the model you found in (a) a good fit? Why or why not?
d) Predict the SDArea when the NumberSD is 1260 bridges. Find the corresponding residuals.
e) Build a 90% CI, confidence interval, for coefficient of NumberSD ( b1).
f) Repeat (e) with a 95% CI. What is the difference between your answer in (e) and (f)?
We will use R-software to make scatterplot ,and to fit a regression model.
Given data is
| 
 NumberSD  | 
 SDArea  | 
| 
 1899  | 
 432.71  | 
| 
 155  | 
 60.92  | 
| 
 181  | 
 110.57  | 
| 
 997  | 
 347.35  | 
| 
 3140  | 
 5177.97  | 
| 
 580  | 
 316.92  | 
| 
 358  | 
 387.78  | 
| 
 20  | 
 9.05  | 
| 
 24  | 
 59.34  | 
| 
 302  | 
 412.92  | 
| 
 1028  | 
 344.86  | 
| 
 142  | 
 39.8  | 
| 
 349  | 
 135.43  | 
| 
 2501  | 
 1192.43  | 
| 
 2030  | 
 688.19  | 
| 
 5153  | 
 1069.71  | 
| 
 2991  | 
 527.47  | 
| 
 1362  | 
 458.37  | 
| 
 1780  | 
 1453.26  | 
| 
 349  | 
 131.13  | 
| 
 388  | 
 236.18  | 
| 
 585  | 
 521.83  | 
| 
 1584  | 
 804.15  | 
| 
 1156  | 
 325.9  | 
| 
 3002  | 
 692.75  | 
| 
 4433  | 
 1187.42  | 
| 
 473  | 
 90.94  | 
| 
 2382  | 
 335.75  | 
| 
 47  | 
 20.08  | 
| 
 383  | 
 127.66  | 
| 
 750  | 
 752.43  | 
| 
 404  | 
 196.67  | 
| 
 2128  | 
 1427.73  | 
| 
 2272  | 
 1034.61  | 
| 
 743  | 
 101.42  | 
| 
 2862  | 
 965.16  | 
| 
 5793  | 
 1423.25  | 
| 
 514  | 
 393.96  | 
| 
 5802  | 
 2404.61  | 
| 
 164  | 
 237.96  | 
| 
 1260  | 
 626.38  | 
| 
 1216  | 
 209.33  | 
| 
 1325  | 
 481.31  | 
| 
 2186  | 
 1031.45  | 
| 
 233  | 
 102.56  | 
| 
 500  | 
 153.8  | 
| 
 1208  | 
 483.68  | 
| 
 400  | 
 502.03  | 
| 
 1058  | 
 331.59  | 
| 
 1302  | 
 399.8  | 
| 
 389  | 
 143.46  | 
| 
 241  | 
 195.43  | 
First we will import data into R
> NumberSD=scan("clipboard")
Read 52 items
> SDArea=scan("clipboard")
Read 52 items
>
head(data.frame(NumberSD,SDArea),10)         
# to print first 10 observations
   NumberSD SDArea
1     
1899       432.71
2       155 60.92
3      
181       110.57
4       997 347.35
5      3140     
5177.97
6      
580        316.92
7      
358        387.78
8       
20          9.05
9       
24         59.34
10      302
        412.92
a) Deplaned on scatterplot, can you use linear regression to predict SDArea based onNumberSD? Explain.
> plot(NumberSD,SDArea,main="Scatter Plot",col=12,pch=19)

We can see possitive or increasing trend in data which implies possitive correlation between two variables, although there is one outlier we can use linear regression to predict SDArea based onNumberSD.
b) Develop a simple linear regression equation to predict SDArea based on NumberSD.
R-code and outpt


> fit=lm(SDArea~NumberSD) # to find regression equation
>
summary(fit)                          
# to print summary
Call:
lm(formula = SDArea ~ NumberSD)
Residuals:
   Min     1Q
Median     3Q    Max
-831.0 -146.3 -107.2 104.7 3972.9
Coefficients:
            
Estimate     Std. Error t
value    Pr(>|t|)  
(Intercept)      
119.86392    123.02005     0.974
   0.335  
NumberSD    0.34560
0.06158       5.613   
8.69e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 635.2 on 50 degrees of freedom
Multiple R-squared: 0.3865,    Adjusted R-squared:
0.3743
F-statistic: 31.5 on 1 and 50 DF, p-value: 8.695e-07
>
anova(fit)                                 
# to print ANOVA
Analysis of Variance Table
Response: SDArea
               
    Df    Sum
Sq      Mean Sq      F
value     Pr(>F)  
NumberSD     1   
12710096    12710096    
31.503    8.695e-07 ***
Residuals 50    20173064   
403461                    
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Simple linear regression equation to predict SDArea based on NumberSD is
SDArea (y) = 119.8639 + 0.3456 * NumberSD
c) Is the model you found in (a) a good fit? Why or why not?
Coefficient of determination is
R-squared: 0.3865
Which implies that only 38.65% of variability in SDAreas is explained by independent variable NumberSD , which is to less , to consider fitted model is good.
We will check Residual plot
> zero=rep(0,5000)
> k=1:5000
> plot(NumberSD,Residual,main="Residual
Plot",col=2,pch=19)
> lines(k,zero)

We can see a outliers is residual plot too which suggest some transformation should be used. Hence model you found in (a) is not a good fit , as we known that both variables are positively correlated , hence we can use simple regression model to predict SDArea , but for doing this some transformation like square root or log transformation should be used.
d) Predict the SDArea when the NumberSD is 1260 bridges. Find the corresponding residuals.
Given NumberSD = 1260
regression equation to predict SDArea based on NumberSD is
SDArea (y) = 119.8639 + 0.3456 * NumberSD
thus SDArea (y) = 119.8639 + 0.3456 * 1260
SDArea (y) = 555.3199
Hence Predict the SDArea is 555.3199
Now actual value of SDArea correspond to NumberSD (=1260) is 626.38
> data.frame(NumberSD[40:45],SDArea[40:45])
   NumberSD.40.45. SDArea.40.45.
1            
164        237.96
2           
1260      626.38
3           
1216          209.33
4           
1325               
481.31
5           
2186                
1031.45
6            
233          102.56
Hence Predicted value of SDArea is 555.3199 {at NumberSD =1260 }
And Actual value of SDArea is 626.38 {at NumberSD =1260 }
Thus Residual = Actual value - Predicted value = 626.38 - 555.3199 = 71.0601
The corresponding residuals is 71.0601
e) Build a 90% CI, confidence interval, for coefficient of NumberSD ( b1).
90% CI, confidence interval, for ( b1). is given by
CI = { 
 - 
 * SE(
)
, 
 + 
 * SE(
)
}
Here 
 = 0.34560    and   
SE(
)
= 0.06158
is t-distributed with n-2 = 52-2 = 50 degree of freedom and
=0.10, { for 90% CI, }
It can be computed from statistical book or more accurately from any software like R,Excel
From R
> qt(1-.1/2,50)
[1] 1.675905
Thus 
 = 1.675905
Hence 90% CI, confidence interval, for ( b1) is given by
CI = { 
 - 
 * SE(
)
, 
 + 
 * SE(
)
}
= { 0.34560 - 1.675905 * 0.06158 , 0.34560 + 1.675905 * 0.06158 }
= { 0.2423978 , 0.4488022 }
90% CI, confidence interval, for coefficient of NumberSD ( b1) is { 0.24240 , 0.44880 }
f) Repeat (e) with a 95% CI. What is the difference between your answer in (e) and (f)?
90% CI, confidence interval, for ( b1). is given by
CI = { 
 - 
 * SE(
)
, 
 + 
 * SE(
)
}
is t-distributed with 50 degree of freedom but 
=0.05, { for 95% CI, }
From R
> qt(1-.05/2,50)
[1] 2.008559
Thus 
 = 2.008559
Hence 95% CI, confidence interval, for ( b1) is given by
CI = { 
 - 
 * SE(
)
, 
 + 
 * SE(
)
}
= { 0.34560 - 2.008559 * 0.06158 , 0.34560 + 2.008559 * 0.06158 }
= { 0.2219129 , 0.4692871 }
95% CI, confidence interval, for coefficient of NumberSD ( b1) is { 0.22191 , 0.46929 }
Difference between part in (e) and (f) is that 95% confidence interval which is { 0.22191 , 0.46929 } is greater than that of 90% confidence interval { 0.24240 , 0.44880 } . ie A 90 % confidence interval for ( b1) is narrower .