In: Statistics and Probability
The cigarette data set (partially given below) presents data on tar, nicotine, weight (in grams) and carbon monoxide contents (in milligrams) for a sample of 25 (filter) brands of cigarettes tested in a recent year.
Tar (x1) |
Nicotine (x2) |
Weight (x3) |
Carbon Monoxide (y) |
14.1 |
0.86 |
0.9853 |
13.6 |
. |
. |
. |
. |
. |
. |
. |
. |
12.0 |
0.82 |
1.1184 |
14.9 |
Question 1
Answer the following for the variables Carbon Monoxide (response variable) and Tar(predictor variable).
a. Fit the regression line. Report the parameter estimates (the estimates of the intercept and slope).
b. Examine the residual plots and comment on the fit of the model. Are there any fit issues? Are there any outliers (use Cook’s D > 1 as a threshold)? If so, identify the observation numbers and delete the observation and repeat part a.
c. Is Taruseful (use α = 0.05) in predicating Carbon Monoxide? Why?
d. What percentage of the variation in Carbon Monoxide is explained by Tar? Is that high or low?
e. What is the predicted value for Carbon Monoxide when Tar is 10? Give a 95% prediction interval for this estimate.
SOLUTION
QUESTION1
> B12 <- read.csv("C:/Users/pcc/Desktop/B12.csv")
> View(B12)
> B12
Carbon.Monoxide tar
1 13.6 14.1
2 16.6 16.0
3 23.5 29.8
4 10.2 8.0
5 5.4 4.1
6 15.0 15.0
7 9.0 8.8
8 12.3 12.4
9 16.3 16.6
10 15.4 14.9
11 13.0 13.7
12 14.4 15.1
13 10.0 7.8
14 10.2 11.4
15 9.5 9.0
16 1.5 1.0
17 18.5 17.0
18 12.6 12.8
19 17.5 15.8
20 4.9 4.5
21 15.9 14.5
22 8.5 7.3
23 10.6 8.6
24 13.9 15.2
25 14.9 12.0
fit=lm(Carbon.Monoxide~tar,data = B12)
> fit
Call:
lm(formula = Carbon.Monoxide ~ tar, data = B12)
Coefficients:
(Intercept) tar
2.743 0.801
> summary(fit)
Call:
lm(formula = Carbon.Monoxide ~ tar, data = B12)
Residuals:
Min 1Q Median 3Q Max
-3.1124 -0.7167 -0.3754 1.0091 2.5450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.74328 0.67521 4.063 0.000481 ***
tar 0.80098 0.05032 15.918 6.55e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.397 on 23 degrees of freedom
Multiple R-squared: 0.9168, Adjusted R-squared: 0.9132
F-statistic: 253.4 on 1 and 23 DF, p-value: 6.552e-14
> ## Estimated intercept is ##
> intercept= 2.74328 ##
> ## Slope = 0.80098 ##
>
> ## Residual plot ##
> plot(fit)
Here in the above graph we observe that there is non – linear pattern.
From the above graph we observe that normality assumption is followed.
Here we can say that residual point are randomaly spread . Means that the assumption of equal variance is satisfied.
## Here is the p value of tar is 0.00000 is highly significant for predicting the Carbon monoxide
because the t test use to see weather the independent variables are significantly effect on dependent variable or not ##
## R squared value is 0.9168 means the percentage of variation explained by the carbon monoxide is 91.68% to the our model. Which is very high. ##
>
> ## prediction for Carbon monoxide when tar is 10 ##
> n=data.frame(tar=10)
> predict(fit,n,interval = "confidence")
fit lwr upr
1 10.75304 10.13083 11.37524
>
> ## So here the value of carbon monoxide for tar is = 10 is 10.75304 ##
> And the 95 % CI are (10.13083 , 11.37524) ##
a) Scatter plot:
Tar | Carbon | T*T | T*C | c*C | |
14.1 | 13.6 | 198.81 | 191.76 | 184.96 | |
16 | 16.6 | 256 | 265.6 | 275.56 | |
29.8 | 23.5 | 888.04 | 700.3 | 552.25 | |
8 | 10.2 | 64 | 81.6 | 104.04 | |
4.1 | 5.4 | 16.81 | 22.14 | 29.16 | |
15 | 15 | 225 | 225 | 225 | |
8.8 | 9 | 77.44 | 79.2 | 81 | |
12.4 | 12.3 | 153.76 | 152.52 | 151.29 | |
16.6 | 16.3 | 275.56 | 270.58 | 265.69 | |
14.9 | 15.4 | 222.01 | 229.46 | 237.16 | |
13.7 | 13 | 187.69 | 178.1 | 169 | |
15.1 | 14.4 | 228.01 | 217.44 | 207.36 | |
7.8 | 10 | 60.84 | 78 | 100 | |
11.4 | 10.2 | 129.96 | 116.28 | 104.04 | |
9 | 9.5 | 81 | 85.5 | 90.25 | |
1 | 1.5 | 1 | 1.5 | 2.25 | |
17 | 18.5 | 289 | 314.5 | 342.25 | |
12.8 | 12.6 | 163.84 | 161.28 | 158.76 | |
15.8 | 17.5 | 249.64 | 276.5 | 306.25 | |
4.5 | 4.9 | 20.25 | 22.05 | 24.01 | |
14.5 | 15.9 | 210.25 | 230.55 | 252.81 | |
7.3 | 8.5 | 53.29 | 62.05 | 72.25 | |
8.6 | 10.6 | 73.96 | 91.16 | 112.36 | |
15.2 | 13.9 | 231.04 | 211.28 | 193.21 | |
12 | 14.9 | 144 | 178.8 | 222.01 | |
Mean | 12.216 | 12.528 | |||
Sum | 305.4 | 313.2 | 4501.2 | 4443.15 | 4462.92 |
n | 25 |
Y=a+bX
c) Correlation coefficient to test prediciton accuracy:
=95.75% is strong for prediction.
d) Percentage of variaiton:
R-squared to be cosidered as percentage of variaiton:
= 91.6778%
b)
Remove outlier with: Tar =29.8
Tar | Carbon | T*T | T*C | c*C | |
14.1 | 13.6 | 198.81 | 191.76 | 184.96 | |
16 | 16.6 | 256 | 265.6 | 275.56 | |
8 | 10.2 | 64 | 81.6 | 104.04 | |
4.1 | 5.4 | 16.81 | 22.14 | 29.16 | |
15 | 15 | 225 | 225 | 225 | |
8.8 | 9 | 77.44 | 79.2 | 81 | |
12.4 | 12.3 | 153.76 | 152.52 | 151.29 | |
16.6 | 16.3 | 275.56 | 270.58 | 265.69 | |
14.9 | 15.4 | 222.01 | 229.46 | 237.16 | |
13.7 | 13 | 187.69 | 178.1 | 169 | |
15.1 | 14.4 | 228.01 | 217.44 | 207.36 | |
7.8 | 10 | 60.84 | 78 | 100 | |
11.4 | 10.2 | 129.96 | 116.28 | 104.04 | |
9 | 9.5 | 81 | 85.5 | 90.25 | |
1 | 1.5 | 1 | 1.5 | 2.25 | |
17 | 18.5 | 289 | 314.5 | 342.25 | |
12.8 | 12.6 | 163.84 | 161.28 | 158.76 | |
15.8 | 17.5 | 249.64 | 276.5 | 306.25 | |
4.5 | 4.9 | 20.25 | 22.05 | 24.01 | |
14.5 | 15.9 | 210.25 | 230.55 | 252.81 | |
7.3 | 8.5 | 53.29 | 62.05 | 72.25 | |
8.6 | 10.6 | 73.96 | 91.16 | 112.36 | |
15.2 | 13.9 | 231.04 | 211.28 | 193.21 | |
12 | 14.9 | 144 | 178.8 | 222.01 | |
Mean | 11.4833333 | 12.07083 | |||
Sum | 275.6 | 289.7 | 3613.16 | 3742.85 | 3910.67 |
n | 24 |
Y=a+bx
Using above formulas:
Regression equation:
Correlation coefficient: r = 0.966158
R-squared value= 0.933462= 93.3462%