In: Statistics and Probability
Regression Analysis/Need R code/Step By Step Explanation/No Hand Written Regression:
A) Provide a 95% Confidence interval for each of the estimated parameters B ) Use Hypothesis Testing to test the significance of Regression and Residual Analysis C) Perform Lack of fit test Apply corresponding transformation to correct model inadequacies if any. D) Perform Multicollinearity and Validate your model Accordingly. We seek a model of the form # # B = A0 * X0 + A1 * X1 + A2 * X2 + A3 * X3 + A4 * X4
Dataset is in format that can be used and run in R Studio directly:
a0, 1; a1, the petrol tax; a2, the per capita income;a3, the number of miles of paved highway;a4, the proportion of drivers; b, the consumption of petrol.
b <- c(541, 524,561, 414,410,457,344, 467, 464, 498, 580,
471, 525, 508, 566,635,603,714,865,640,649
,540, 464, 547, 460, 566, 577, 631, 574, 534, 571, 554, 577, 628,
487, 644, 640, 704, 648, 968
, 587, 699, 632, 591, 782, 510,610, 524)
a4 <- c(0.525, 0.572, 0.580, 0.529, 0.544, 0.571, 0.451,
0.553, 0.529, 0.552, 0.530, 0.525,
0.574,0.545,0.608,0.586,0.572,0.540,0.724,0.677,0.663,0.602,0.511,0.517,0.551,0.544,
0.548,
0.579, 0.563, 0.493, 0.518, 0.513, 0.578, 0.547, 0.487, 0.629,
0.566, 0.586, 0.663, 0.672
,0.626,0.563,0.603,0.508,0.672,0.571,0.623, 0.593)
a3<- c(1976,1250, 1586, 2351, 431, 1333, 11868, 2138, 8577,
8507, 5939, 14186, 6930, 6580,8159,
10340, 8508, 4725, 5915, 6010,7834,602, 2449, 4686,2619, 4746,
5399,9061,5975, 4650, 6905,
6594,6524,4121,3495, 7834, 17782, 6385, 3274, 3905,4639, 3985,
3635, 2611, 2302, 3942, 4083
, 9794)
a2 <- c(3571, 4092, 3865, 4870, 4399, 5342, 5319, 5126, 4447,
4512, 4391, 5126, 4817, 4207, 4332, 4318,
4206, 3718, 4716, 4341, 4593, 4983, 4897, 4258, 4574, 3721, 3448,
3846, 4188, 3601, 3640, 3333,
3063,3357,3528,3802,4045, 3897, 3635, 4345, 4449, 3656, 4300, 3745,
5215, 4476, 4296 ,5002)
a1 <- c(9.00, 9.00, 9.00, 7.50, 8.00, 10.00, 8.00, 8.00,
8.00, 7.00, 8.00, 7.50, 7.00, 7.00, 7.00, 7.00,
7.00, 7.00, 7.00, 8.50, 7.00, 8.00, 9.00, 9.00, 8.50, 9.00, 8.00,
7.50, 8.00, 9.00, 7.00,
7.00, 8.00, 7.50, 8.00, 6.58, 5.00, 7.00, 8.50, 7.00, 7.00, 7.00,
7.00, 7.00, 6.00, 9.00, 7.00
, 7.00)
Note : I assume X0 =1 .
b <- c(541, 524,561, 414,410,457,344, 467, 464, 498, 580,
471, 525, 508, 566,635,603,714,865,640,649,540, 464, 547, 460, 566,
577, 631, 574, 534, 571, 554, 577, 628, 487, 644, 640, 704, 648,
968, 587, 699, 632, 591, 782, 510,610, 524)
a4 <- c(0.525, 0.572, 0.580, 0.529, 0.544, 0.571, 0.451, 0.553,
0.529, 0.552, 0.530,
0.525,0.574,0.545,0.608,0.586,0.572,0.540,0.724,0.677,0.663,0.602,0.511,0.517,0.551,0.544,
0.548,0.579, 0.563, 0.493, 0.518, 0.513, 0.578, 0.547, 0.487,
0.629, 0.566, 0.586, 0.663,
0.672,0.626,0.563,0.603,0.508,0.672,0.571,0.623, 0.593)
a3<- c(1976,1250, 1586, 2351, 431, 1333, 11868, 2138, 8577,
8507, 5939, 14186, 6930, 6580,8159,10340, 8508, 4725, 5915,
6010,7834,602, 2449, 4686,2619, 4746, 5399,9061,5975, 4650,
6905,6594,6524,4121,3495, 7834, 17782, 6385, 3274, 3905,4639, 3985,
3635, 2611, 2302, 3942, 4083, 9794)
a2 <- c(3571, 4092, 3865, 4870, 4399, 5342, 5319, 5126, 4447,
4512, 4391, 5126, 4817, 4207, 4332, 4318,4206, 3718, 4716, 4341,
4593, 4983, 4897, 4258, 4574, 3721, 3448, 3846, 4188, 3601, 3640,
3333,3063,3357,3528,3802,4045, 3897, 3635, 4345, 4449, 3656, 4300,
3745, 5215, 4476, 4296 ,5002)
a1 <- c(9.00, 9.00, 9.00, 7.50, 8.00, 10.00, 8.00, 8.00, 8.00,
7.00, 8.00, 7.50, 7.00, 7.00, 7.00, 7.00,7.00, 7.00, 7.00, 8.50,
7.00, 8.00, 9.00, 9.00, 8.50, 9.00, 8.00, 7.50, 8.00, 9.00,
7.00,7.00, 8.00, 7.50, 8.00, 6.58, 5.00, 7.00, 8.50, 7.00, 7.00,
7.00, 7.00, 7.00, 6.00, 9.00, 7.00, 7.00)
m0 is the regression model.
m0 <- lm(b~a1+a2+a3+a4)
a. 95%Confidence Intervals
confint(m0)
# for accessing individual confidence intervals ,eg. a2/personal income tax
confint(m0)[2,]
b. Significance testing
summary(m0) #you will get these results. Estimate Std. Error t value Pr(>|t|) (Intercept) 3.773e+02 1.855e+02 2.033 0.048207 * a1 -3.479e+01 1.297e+01 -2.682 0.010332 * a2 -6.659e-02 1.722e-02 -3.867 0.000368 *** a3 -2.426e-03 3.389e-03 -0.716 0.477999 a4 1.336e+03 1.923e+02 6.950 1.52e-08 *** Residual standard error: 66.31 on 43 degrees of freedom Multiple R-squared: 0.6787, Adjusted R-squared: 0.6488 F-statistic: 22.71 on 4 and 43 DF, p-value: 3.907e-10
Since p-value is 3.907*10-10 ,which is less than 0.05 , we can say that this regression is useful. in explaining variation in"b" / consumption of petrol.
Residual Analysis.
par(mfrow = c(2,2)) par(mar=c(1,1,1,1)) plot(m0)
On Analyzing the Residuals v/s Fitted plot you will see that most of the points (except points : 18 & 40) lie around the centre line.
On Analyzing the Residual Q-Q plot you will see that , the residual are following normal distribution (except for point 40).
On Analyzing the Residual v/s leverage plot you will see that only pint 40 is the outlier.
Lack of fit test.
#(n-p)*σ-hat2/σ2 #σ-hat = residual standard error =66.31 on 43 degrees of freedom test.stat<-(48 - 4)*66.31 1-pchisq(test.stat,44)
You will get the answer as 0, which means lack of fit.
Multicollinearity test
car::vif(m0)
a1 a2 a3 a4
1.625676 ,1.043274 ,1.496937, 1.216355
Since all the VIF's are ess than 5 , we can safely sa that Multicollinearity doesnot exist