In: Statistics and Probability
The electric power consumed each month by a chemical plant is thought to be related to the average ambient temperature (x1), the number of days in the month (x2), the average product purity (x3), and the tons of product produced (x4). The past year’s historical data are available and are presented in the following table:
y x1 x2 x3 x4
240 25 24 91 100
236 31 21 90 95
270 45 24 88 110
274 60 25 87 88
301 65 25 91 94
316 72 26 94 99
300 80 25 87 97
296 84 25 86 96
267 75 24 88 110
276 60 25 91 105
288 50 25 90 100
261 38 23 89 98
Fit a multiple linear regression model to this data. Determine whether or not the model you have assumed can be simplified by dropping any of the terms. Calculate the ANOVA table for the model, which you have determined. What is the estimate of the variance in the electric power measurement? Plot residual plots to determine if the assumptions underlying your analysis are met. Is this normal assumption a good one here? Calculate parameter covariance and parameter correlation matrix for your model. Do you have any concerns on the basis of this matrix?
I have used R studio software to answer this question. I will explain step by step with code.
First copy the data into excel and import that in R studio by clicking import data set tab on the right side
1. Fit a multiple linear regression model to this data
Code
fit=lm(y~., data=regression)
summary(fit)
R Studio Output
Call:
lm(formula = y ~ ., data = regression)
Residuals:
Min 1Q Median 3Q Max
-14.098 -9.778 1.767 6.798 13.016
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -123.1312 157.2561 -0.783 0.459
x1 0.7573 0.2791 2.713 0.030 *
x2 7.5188 4.0101 1.875 0.103
x3 2.4831 1.8094 1.372 0.212
x4 -0.4811 0.5552 -0.867 0.415
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.79 on 7 degrees of freedom
Multiple R-squared: 0.852, Adjusted R-squared:
0.7675
F-statistic: 10.08 on 4 and 7 DF, p-value: 0.00496
2. Determine whether or not the model you have assumed can be simplified by dropping any of the terms
We can remove variables if VIF is greater than 5
library(car)
vif(fit)
x1 x2 x3 x4
2.322836 2.160759 1.335410 1.008733
All variables have VIF less than 5
We can remove variables which have p value greater than 0.05
X4 has p value 0.415 so we remove that variable and run the model
fit=lm(y~.-x4, data=regression)
summary(fit)
Call:
lm(formula = y ~ . - x4, data = regression)
Residuals:
Min 1Q Median 3Q Max
-17.810 -6.770 3.257 7.978 9.531
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -162.1350 148.3154 -1.093 0.3061
x1 0.7487 0.2745 2.727 0.0260 *
x2 7.6906 3.9424 1.951 0.0869 .
x3 2.3434 1.7739 1.321 0.2230
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.6 on 8 degrees of freedom
Multiple R-squared: 0.8362, Adjusted R-squared:
0.7747
F-statistic: 13.61 on 3 and 8 DF, p-value: 0.001652
Now X3 has p value 0.2230 whch is greater than 0.05, so we remove that variable and run the model
fit=lm(y~.-x4-x3, data=regression)
summary(fit)
Call:
lm(formula = y ~ . - x4, data = regression)
Residuals:
Min 1Q Median 3Q Max
-17.810 -6.770 3.257 7.978 9.531
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -162.1350 148.3154 -1.093 0.3061
x1 0.7487 0.2745 2.727 0.0260 *
x2 7.6906 3.9424 1.951 0.0869 .
x3 2.3434 1.7739 1.321 0.2230
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.6 on 8 degrees of freedom
Multiple R-squared: 0.8362, Adjusted R-squared:
0.7747
F-statistic: 13.61 on 3 and 8 DF, p-value: 0.001652
> fit=lm(y~.-x4-x3, data=regression)
> summary(fit)
Call:
lm(formula = y ~ . - x4 - x3, data = regression)
Residuals:
Min 1Q Median 3Q Max
-17.051 -9.842 3.167 8.114 13.902
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.9164 81.9126 0.048 0.9629
x1 0.5727 0.2498 2.293 0.0475 *
x2 9.8824 3.7213 2.656 0.0262 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.07 on 9 degrees of freedom
Multiple R-squared: 0.8004, Adjusted R-squared:
0.7561
F-statistic: 18.05 on 2 and 9 DF, p-value: 0.0007085
The model is built only on x1 and x2 by removing x4 and x3
the model is y= 0.5727*x1+ 9.8824*x2+3.9164
3. Calculate the ANOVA table for the model, which you have determined.
anova(fit)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 4233.4 4233.4 29.0470 0.000439 ***
x2 1 1027.8 1027.8 7.0523 0.026234 *
Residuals 9 1311.7 145.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
4. What is the estimate of the variance in the electric power measurement
12.07 on 9 degrees of freedom
5. Plot residual plots to determine if the assumptions underlying your analysis are met.Is this normal assumption a good one here?
plot(fit,which=1)
Ideally the red line the plot should be horizontal or close to horizontal . If it shows a significant pattern , it means we might have missed non-linear components in the model. If that is the case you might want to look at pair wise relation ship between response and each predictor and figure out if any of the predictors need to be transformed to make the relationship linear between response and that predictor
plot(fit,which=2)
This tells you whether you residuals are normal or not. All points in the straight line indicate residuals are normal
plot(fit,which=3)
This plot tells us whether our residuals are homoscedastic or not. If this plot has a pattern then that mean residuals are heteroscedastic. If that is the case which is not in our case , you can again log transform response and rebuild your model. Why log transform of response in these deviations of residuals? Becuase it brings down scale of response and in turn for residuals. Effect of scale coming down is differences being less prominent.
5.Calculate parameter covariance and parameter correlation matrix for your model. Do you have any concerns on the basis of this matrix?
data1=data[,c(1,2,3)]
cor(data1)
y x1 x2
y 1.0000000 0.8025385 0.826963
x1 0.8025385 1.0000000 0.660456
x2 0.8269630 0.6604560 1.000000
cov(data1)
y x1 x2
y 597.53788 380.71970 26.33333
x1 380.71970 376.62879 16.69697
x2 26.33333 16.69697 1.69697
no concerns on the basis of this mtrix