In: Statistics and Probability
"FATALS","CUTTING"
270,15692
183,16198
319,17235
103,18463
149,18959
124,19103
62,19618
298,20436
330,21229
486,18660
302,17551
373,17466
187,17388
347,15261
168,14731
234,14237
68,13216
162,12017
27,11845
40,11905
26,11881
41,11974
116,11892
84,11810
43,12076
292,12342
89,12608
148,13049
166,11656
32,13305
72,13390
27,13625
154,13865
44,14445
3,14424
3,14315
153,13761
11,12471
9,10960
17,9218
2,9054
5,9218
63,8817
41,7744
10,6907
3,6440
26,6021
52,5561
31,5309
3,5320
19,4784
10,4311
12,3663
88,3060
0,2779
41,2623
2,2058
5,1890
2,1535
0,1515
0,1595
23,1803
4,1495
0,1432
The above contains data on the following two variables
• FATALS: the annual number of fatalities from gas and dust explosions in coal mines for years 1915 to 1978.
• CUTTING: the number of cutting machines in use
(a) Fit the regression model using FATALS as the dependent variable and CUTTING as the independent variable.
(b) Using appropriate residual plots and formal tests, investigate the violation of any assumptions. Do any assumptions of the linear regression model appear to be violated? If so, which one (or ones)?
(Hint: Plot of residuals versus fitted values can be used for linearity, zero mean, and constant variance. Normal probability plot of the residuals can be used for normality. We also have formal tests for the constant variance and normality assumptions that you can do in R).
Hint: data=read.table(‘hmw6_prob3.txt’, header=T, sep=‘,’) y=data$FATALS x=data$CUTTING
As i just have the data with me i have manually entered the values of data in R and carried out the analysis the code and output is as follows
t=c(270,15692,183,16198,319,17235,103,18463,149,18959,124,19103,62,19618,298,20436,330,21229,486,18660,302,17551,373,17466,187,17388,347,15261,168,14731,234,14237,68,13216,162,12017,27,11845,40,11905,26,11881,41,11974,116,11892,84,11810,43,12076,292,12342,89,12608,148,13049,166,11656,32,13305,72,13390,27,13625,154,13865,44,14445,3,14424,3,14315,153,13761,11,12471,9,10960,17,9218,2,9054,5,9218,63,8817,41,7744,10,6907,3,6440,26,6021,52,5561,31,5309,3,5320,19,4784,10,4311,12,3663,88,3060,0,2779,41,2623,2,2058,5,1890,2,1535,0,1515,0,1595,23,1803,4,1495,0,1432) cutting=t[seq(2,128,2)] fatals=t[seq(1,127,2)] model=lm(fatals~cutting) model fatals_fitted=-47.70923+0.01343*cutting residual=fatals_fitted-fatals plot(residual,fatals_fitted) qqnorm(residual) shapiro.test(residual)
The output is as follows:
model
Call:
lm(formula = fatals ~ cutting)
Coefficients:
(Intercept) cutting
-47.70923 0.01343
Hence model is
fatal=-47.70923+0.01343*cutting
As seen from this plot the variance is not constant hence assumption of constant variance is voilated.
to check whether the residuals have normal distribution i have used shapiro test further and qq plot as well
This is the required qq plot
The line in qq plot is not perfectly straight but can be considered straight to some extent and hence can be said that it has normal distribution.
Shapiro-Wilk normality test
data: residual
W = 0.95666, p-value = 0.02465
This is the Shapiro test from this it can be seen that p-value is 0.02465 so at 1%LOS we can say that residuals are normally distributed.