In: Statistics and Probability
Structurally deficient highway bridges. Data on structurally deficient highway bridges is compiled by the Federal Highway Administration (FHWA) and reported in the National Bridge Inventory (NBI). For each state, the NBI lists the number of structurally deficient bridges and the total area (thousands of square feet) of the deficient bridges. The data for the 50 states (plus the District of Columbia and Puerto Rico). For future planning and budgeting, the FHWA wants to estimate the total area of structurally deficient bridges in a state based on the number of deficient bridges
NumberSD |
SDArea |
1899 |
432.71 |
155 |
60.92 |
181 |
110.57 |
997 |
347.35 |
3140 |
5177.97 |
580 |
316.92 |
358 |
387.78 |
20 |
9.05 |
24 |
59.34 |
302 |
412.92 |
1028 |
344.86 |
142 |
39.8 |
349 |
135.43 |
2501 |
1192.43 |
2030 |
688.19 |
5153 |
1069.71 |
2991 |
527.47 |
1362 |
458.37 |
1780 |
1453.26 |
349 |
131.13 |
388 |
236.18 |
585 |
521.83 |
1584 |
804.15 |
1156 |
325.9 |
3002 |
692.75 |
4433 |
1187.42 |
473 |
90.94 |
2382 |
335.75 |
47 |
20.08 |
383 |
127.66 |
750 |
752.43 |
404 |
196.67 |
2128 |
1427.73 |
2272 |
1034.61 |
743 |
101.42 |
2862 |
965.16 |
5793 |
1423.25 |
514 |
393.96 |
5802 |
2404.61 |
164 |
237.96 |
1260 |
626.38 |
1216 |
209.33 |
1325 |
481.31 |
2186 |
1031.45 |
233 |
102.56 |
500 |
153.8 |
1208 |
483.68 |
400 |
502.03 |
1058 |
331.59 |
1302 |
399.8 |
389 |
143.46 |
241 |
195.43 |
a) Deplaned on scatterplot, can you use linear regression to predict SDArea based onNumberSD? Explain.
b) Develop a simple linear regression equation to predict SDArea based on NumberSD.
c) Is the model you found in (a) a good fit? Why or why not?
d) Predict the SDArea when the NumberSD is 1260 bridges. Find the corresponding residuals.
e) Build a 90% CI, confidence interval, for coefficient of NumberSD ( b1).
f) Repeat (e) with a 95% CI. What is the difference between your answer in (e) and (f)?
We will use R-software to make scatterplot ,and to fit a regression model.
Given data is
NumberSD |
SDArea |
1899 |
432.71 |
155 |
60.92 |
181 |
110.57 |
997 |
347.35 |
3140 |
5177.97 |
580 |
316.92 |
358 |
387.78 |
20 |
9.05 |
24 |
59.34 |
302 |
412.92 |
1028 |
344.86 |
142 |
39.8 |
349 |
135.43 |
2501 |
1192.43 |
2030 |
688.19 |
5153 |
1069.71 |
2991 |
527.47 |
1362 |
458.37 |
1780 |
1453.26 |
349 |
131.13 |
388 |
236.18 |
585 |
521.83 |
1584 |
804.15 |
1156 |
325.9 |
3002 |
692.75 |
4433 |
1187.42 |
473 |
90.94 |
2382 |
335.75 |
47 |
20.08 |
383 |
127.66 |
750 |
752.43 |
404 |
196.67 |
2128 |
1427.73 |
2272 |
1034.61 |
743 |
101.42 |
2862 |
965.16 |
5793 |
1423.25 |
514 |
393.96 |
5802 |
2404.61 |
164 |
237.96 |
1260 |
626.38 |
1216 |
209.33 |
1325 |
481.31 |
2186 |
1031.45 |
233 |
102.56 |
500 |
153.8 |
1208 |
483.68 |
400 |
502.03 |
1058 |
331.59 |
1302 |
399.8 |
389 |
143.46 |
241 |
195.43 |
First we will import data into R
> NumberSD=scan("clipboard")
Read 52 items
> SDArea=scan("clipboard")
Read 52 items
>
head(data.frame(NumberSD,SDArea),10)
# to print first 10 observations
NumberSD SDArea
1
1899 432.71
2 155 60.92
3
181 110.57
4 997 347.35
5 3140
5177.97
6
580 316.92
7
358 387.78
8
20 9.05
9
24 59.34
10 302
412.92
a) Deplaned on scatterplot, can you use linear regression to predict SDArea based onNumberSD? Explain.
> plot(NumberSD,SDArea,main="Scatter Plot",col=12,pch=19)
We can see possitive or increasing trend in data which implies possitive correlation between two variables, although there is one outlier we can use linear regression to predict SDArea based onNumberSD.
b) Develop a simple linear regression equation to predict SDArea based on NumberSD.
R-code and outpt
> fit=lm(SDArea~NumberSD) # to find regression equation
>
summary(fit)
# to print summary
Call:
lm(formula = SDArea ~ NumberSD)
Residuals:
Min 1Q
Median 3Q Max
-831.0 -146.3 -107.2 104.7 3972.9
Coefficients:
Estimate Std. Error t
value Pr(>|t|)
(Intercept)
119.86392 123.02005 0.974
0.335
NumberSD 0.34560
0.06158 5.613
8.69e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 635.2 on 50 degrees of freedom
Multiple R-squared: 0.3865, Adjusted R-squared:
0.3743
F-statistic: 31.5 on 1 and 50 DF, p-value: 8.695e-07
>
anova(fit)
# to print ANOVA
Analysis of Variance Table
Response: SDArea
Df Sum
Sq Mean Sq F
value Pr(>F)
NumberSD 1
12710096 12710096
31.503 8.695e-07 ***
Residuals 50 20173064
403461
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Simple linear regression equation to predict SDArea based on NumberSD is
SDArea (y) = 119.8639 + 0.3456 * NumberSD
c) Is the model you found in (a) a good fit? Why or why not?
Coefficient of determination is
R-squared: 0.3865
Which implies that only 38.65% of variability in SDAreas is explained by independent variable NumberSD , which is to less , to consider fitted model is good.
We will check Residual plot
> zero=rep(0,5000)
> k=1:5000
> plot(NumberSD,Residual,main="Residual
Plot",col=2,pch=19)
> lines(k,zero)
We can see a outliers is residual plot too which suggest some transformation should be used. Hence model you found in (a) is not a good fit , as we known that both variables are positively correlated , hence we can use simple regression model to predict SDArea , but for doing this some transformation like square root or log transformation should be used.
d) Predict the SDArea when the NumberSD is 1260 bridges. Find the corresponding residuals.
Given NumberSD = 1260
regression equation to predict SDArea based on NumberSD is
SDArea (y) = 119.8639 + 0.3456 * NumberSD
thus SDArea (y) = 119.8639 + 0.3456 * 1260
SDArea (y) = 555.3199
Hence Predict the SDArea is 555.3199
Now actual value of SDArea correspond to NumberSD (=1260) is 626.38
> data.frame(NumberSD[40:45],SDArea[40:45])
NumberSD.40.45. SDArea.40.45.
1
164 237.96
2
1260 626.38
3
1216 209.33
4
1325
481.31
5
2186
1031.45
6
233 102.56
Hence Predicted value of SDArea is 555.3199 {at NumberSD =1260 }
And Actual value of SDArea is 626.38 {at NumberSD =1260 }
Thus Residual = Actual value - Predicted value = 626.38 - 555.3199 = 71.0601
The corresponding residuals is 71.0601
e) Build a 90% CI, confidence interval, for coefficient of NumberSD ( b1).
90% CI, confidence interval, for ( b1). is given by
CI = { - * SE() , + * SE() }
Here = 0.34560 and SE() = 0.06158
is t-distributed with n-2 = 52-2 = 50 degree of freedom and =0.10, { for 90% CI, }
It can be computed from statistical book or more accurately from any software like R,Excel
From R
> qt(1-.1/2,50)
[1] 1.675905
Thus = 1.675905
Hence 90% CI, confidence interval, for ( b1) is given by
CI = { - * SE() , + * SE() }
= { 0.34560 - 1.675905 * 0.06158 , 0.34560 + 1.675905 * 0.06158 }
= { 0.2423978 , 0.4488022 }
90% CI, confidence interval, for coefficient of NumberSD ( b1) is { 0.24240 , 0.44880 }
f) Repeat (e) with a 95% CI. What is the difference between your answer in (e) and (f)?
90% CI, confidence interval, for ( b1). is given by
CI = { - * SE() , + * SE() }
is t-distributed with 50 degree of freedom but =0.05, { for 95% CI, }
From R
> qt(1-.05/2,50)
[1] 2.008559
Thus = 2.008559
Hence 95% CI, confidence interval, for ( b1) is given by
CI = { - * SE() , + * SE() }
= { 0.34560 - 2.008559 * 0.06158 , 0.34560 + 2.008559 * 0.06158 }
= { 0.2219129 , 0.4692871 }
95% CI, confidence interval, for coefficient of NumberSD ( b1) is { 0.22191 , 0.46929 }
Difference between part in (e) and (f) is that 95% confidence interval which is { 0.22191 , 0.46929 } is greater than that of 90% confidence interval { 0.24240 , 0.44880 } . ie A 90 % confidence interval for ( b1) is narrower .