In: Statistics and Probability
The data below contains the total costs and depths of 16 offshore oil wells. It is expected that cost is a linear function of the depth.
(a) Write out the equation of the regression line. Interpret the slope and intercept in the context of this problem. Do they make sense?
b) Test the hypothesis that the slope parameter is zero 4 different ways (ANOVA, t-test for β1, t-test for ρ, and a confidence interval for β1).
(c) What is the R2 for the SLR you have obtained? What does it mean?
(d) Plot the standardized residuals against the independent variable. What can you say about the regression using this graph? (HINT: Are there outliers? Does it seem reasonable to claim the data has a linear fit?)
Depth(feet) vs Cost($1000)
Depth (feet) | Cost ($1000) |
5000 | 2596.800049 |
5200 | 3328 |
6000 | 3181.100098 |
6538 | 3198.399902 |
7109 | 4779.899902 |
7556 | 5905.600098 |
8005 | 5769.200195 |
8207 | 8089.5 |
8210 | 4813.100098 |
8600 | 5618.700195 |
9026 | 7736 |
9197 | 6788.299805 |
9926 | 7840.799805 |
10813 | 8882.5 |
13800 | 10489.5 |
14311 | 12506.59961 |
Depth (feet), X | Cost ($1000), Y | XY | X² | Y² |
5000 | 2596.800049 | 12984000.25 | 25000000 | 6743370.494 |
5200 | 3328 | 17305600 | 27040000 | 11075584 |
6000 | 3181.100098 | 19086600.59 | 36000000 | 10119397.83 |
6538 | 3198.399902 | 20911138.56 | 42745444 | 10229761.93 |
7109 | 4779.899902 | 33980308.4 | 50537881 | 22847443.07 |
7556 | 5905.600098 | 44622714.34 | 57093136 | 34876112.52 |
8005 | 5769.200195 | 46182447.56 | 64080025 | 33283670.89 |
8207 | 8089.5 | 66390526.5 | 67354849 | 65440010.25 |
8210 | 4813.100098 | 39515551.8 | 67404100 | 23165932.55 |
8600 | 5618.700195 | 48320821.68 | 73960000 | 31569791.88 |
9026 | 7736 | 69825136 | 81468676 | 59845696 |
9197 | 6788.299805 | 62431993.31 | 84584809 | 46081014.24 |
9926 | 7840.799805 | 77827778.86 | 98525476 | 61478141.58 |
10813 | 8882.5 | 96046472.5 | 116920969 | 78898806.25 |
13800 | 10489.5 | 144755100 | 190440000 | 110029610.3 |
14311 | 12506.59961 | 178981947 | 204804721 | 156415033.8 |
Ʃx = | Ʃy = | Ʃxy = | Ʃx² = | Ʃy² = |
137498 | 101523.9998 | 979168137.4 | 1287960086 | 762099377.6 |
Sample size, n = | 16 |
x̅ = Ʃx/n = 137498/16 = | 8593.625 |
y̅ = Ʃy/n = 101523.9998/16 = | 6345.249985 |
SSxx = Ʃx² - (Ʃx)²/n = 1287960086 - (137498)²/16 = | 106353835.8 |
SSyy = Ʃy² - (Ʃy)²/n = 762099377.55589 - (101523.99976)²/16 = | 117904219.6 |
SSxy = Ʃxy - (Ʃx)(Ʃy)/n = 979168137.36836 - (137498)(101523.99976)/16 = | 106708955 |
a)
Slope, b = SSxy/SSxx = 106708954.95661/106353835.75 = 1.003339035
y-intercept, a = y̅ -b* x̅ = 6345.24998 - (1.00334)*8593.625 = -2277.069432
Regression equation :
ŷ = -2277.0694 + (1.0033) x
Slope interpretation: A unit increase in depth will increase the cost by 1.0033 units
Y -Intercept: this the value at x = 0. this value is not reasonable because at x = 0, the cost is negative which cannot be true.
b)
Anova test:
Null and alternative hypothesis:
Ho: β₁ = 0
Ha: β₁ ≠ 0
SSE = SSyy - SSxy²/SSxx =10838959.72
SSR = SSxy²/SSxx = 107065259.9188
Test statistic:
F = SSR/(SSE/(n-2)) = 107065259.9188/(10838959.7209/14) = 138.2894
P-value = 0.0000
Conclusion:
p-value < α Reject the null hypothesis.
Slope Hypothesis test:
Null and alternative hypothesis:
Ho: β₁ = 0
Ha: β₁ ≠ 0
Slope, b = 1.003339035
Sum of Square error, SSE = SSyy -SSxy²/SSxx = 117904219.63968 - (106708954.95661)²/106353835.75 = 10838959.72
Standard error, se = √(SSE/(n-2)) = √(10838959.7209/(16-2)) = 879.89284
Test statistic:
t = b/(se/√SSxx) = 11.7597
df = n-2 = 14
p-value = T.DIST.2T(ABS(11.7597), 14) = 0.0000
Conclusion:
p-value < α Reject the null hypothesis.
Correlation Hypothesis test:
Null and alternative hypothesis:
Ho: ρ = 0
Ha: ρ ≠ 0
Correlation coefficient, r = SSxy/√(SSxx*SSyy) = 106708954.95661/√(106353835.75*117904219.63968) = 0.9529
Test statistic :
t = r*√(n-2)/√(1-r²) = 0.9529 *√(16 - 2)/√(1 - 0.9529²) = 11.7597
df = n-2 = 14
p-value = T.DIST.2T(ABS(11.7597), 14) = 0.0000
Conclusion:
p-value < α Reject the null hypothesis. There is a correlation between x and y.
95% Confidence interval for slope:
Lower limit = β₁ - tc*se/√SSxx = 0.8203
Upper limit = β₁ + tc*se/√SSxx = 1.1863
As the confidence interval do not contain 0, we reject the null hypothesis.
c)
Coefficient of determination, r² = (SSxy)²/(SSxx*SSyy)
= (106708954.95661)²/(106353835.75*117904219.63968) = 0.9081
90.81% variation in y is due to the linear relationship between y and x variables.
d) Residuals:
X | Y | Predicted value, ŷ | Residual, y-ŷ |
5000 | 2596.8 | -2277.0694 + (1.0033) * 5000 = 2739.6257 | -142.8257 |
5200 | 3328 | -2277.0694 + (1.0033) * 5200 = 2940.2936 | 387.7064 |
6000 | 3181.1 | -2277.0694 + (1.0033) * 6000 = 3742.9648 | -561.8647 |
6538 | 3198.4 | -2277.0694 + (1.0033) * 6538 = 4282.7612 | -1084.3613 |
7109 | 4779.9 | -2277.0694 + (1.0033) * 7109 = 4855.6678 | -75.7679 |
7556 | 5905.6 | -2277.0694 + (1.0033) * 7556 = 5304.1603 | 601.4398 |
8005 | 5769.2 | -2277.0694 + (1.0033) * 8005 = 5754.6595 | 14.5406 |
8207 | 8089.5 | -2277.0694 + (1.0033) * 8207 = 5957.334 | 2132.1660 |
8210 | 4813.1 | -2277.0694 + (1.0033) * 8210 = 5960.344 | -1147.2439 |
8600 | 5618.7 | -2277.0694 + (1.0033) * 8600 = 6351.6463 | -732.9461 |
9026 | 7736 | -2277.0694 + (1.0033) * 9026 = 6779.0687 | 956.9313 |
9197 | 6788.3 | -2277.0694 + (1.0033) * 9197 = 6950.6397 | -162.3399 |
9926 | 7840.8 | -2277.0694 + (1.0033) * 9926 = 7682.0738 | 158.7260 |
10813 | 8882.5 | -2277.0694 + (1.0033) * 10813 = 8572.0356 | 310.4644 |
13800 | 10489.5 | -2277.0694 + (1.0033) * 13800 = 11569.0093 | -1079.5093 |
14311 | 12506.6 | -2277.0694 + (1.0033) * 14311 = 12081.7155 | 424.8841 |