In: Statistics and Probability
data file:
"STATE" "MALE" "BIRTH" "DIVO" "BEDS" "EDUC" "INCO" "LIFE"
AK 119.1 24.8 5.6 603.3 14.1 4638 69.31
AL 93.3 19.4 4.4 840.9 7.8 2892 69.05
AR 94.1 18.5 4.8 569.6 6.7 2791 70.66
AZ 96.8 21.2 7.2 536.0 12.6 3614 70.55
CA 96.8 18.2 5.7 649.5 13.4 4423 71.71
CO 97.5 18.8 4.7 717.7 14.9 3838 72.06
CT 94.2 16.7 1.9 791.6 13.7 4871 72.48
DC 86.8 20.1 3.0 1859.4 17.8 4644 65.71
DE 95.2 19.2 3.2 926.8 13.1 4468 70.06
FL 93.2 16.9 5.5 668.2 10.3 3698 70.66
GA 94.6 21.1 4.1 705.4 9.2 3300 68.54
HW 108.1 21.3 3.4 794.3 14.0 4599 73.60
IA 94.6 17.1 2.5 773.9 9.1 3643 72.56
ID 99.7 20.3 5.1 541.5 10.0 3243 71.87
IL 94.2 18.5 3.3 871.0 10.3 4446 70.14
IN 95.1 19.1 2.9 736.1 8.3 3709 70.88
KS 96.2 17.0 3.9 854.6 11.4 3725 72.58
KY 96.3 18.7 3.3 661.9 7.2 3076 70.10
LA 94.7 20.4 1.4 724.0 9.0 3023 68.76
MA 91.6 16.6 1.9 1103.8 12.6 4276 71.83
MD 95.5 17.5 2.4 841.3 13.9 4267 70.22
ME 94.8 17.9 3.9 919.5 8.4 3250 70.93
MI 96.1 19.4 3.4 754.7 9.4 4041 70.63
MN 96.0 18.0 2.2 905.4 11.1 3819 72.96
MO 93.2 17.3 3.8 801.6 9.0 3654 70.69
MS 94.0 22.1 3.7 763.1 8.1 2547 68.09
MT 99.9 18.2 4.4 668.7 11.0 3395 70.56
NC 95.9 19.3 2.7 658.8 8.5 3200 69.21
ND 101.8 17.6 1.6 959.9 8.4 3077 72.79
NE 95.4 17.3 2.5 866.1 9.6 3657 72.60
NH 95.7 17.9 3.3 878.2 10.9 3720 71.23
NJ 93.7 16.8 1.5 713.1 11.8 4684 70.93
NM 97.2 21.7 4.3 560.9 12.7 3045 70.32
NV 102.8 19.6 18.7 560.7 10.8 4583 69.03
NY 91.5 17.4 1.4 1056.2 11.9 4605 70.55
OH 94.1 18.7 3.7 751.0 9.3 3949 70.82
OK 94.9 17.5 6.6 664.6 10.0 3341 71.42
OR 95.9 16.8 4.6 607.1 11.8 3677 72.13
PA 92.4 16.3 1.9 948.9 8.7 3879 70.43
RI 96.2 16.5 1.8 960.5 9.4 3878 71.90
SC 96.5 20.1 2.2 739.9 9.0 2951 67.96
SD 98.4 17.6 2.0 984.7 8.6 3108 72.08
TN 93.7 18.4 4.2 831.6 7.9 3079 70.11
TX 95.9 20.6 4.6 674.0 10.9 3507 70.90
UT 97.6 25.5 3.7 470.5 14.0 3169 72.90
VA 97.7 18.6 2.6 835.8 12.3 3677 70.08
VT 95.6 18.8 2.3 1026.1 11.5 3447 71.64
WA 98.7 17.8 5.2 556.4 12.7 3997 71.72
WI 96.3 17.6 2.0 814.7 9.8 3712 72.48
WV 93.9 17.8 3.2 950.4 6.8 3038 69.48
WY 100.7 19.6 5.4 925.9 11.8 3672 70.29
We consider a multiple linear regression model with LIFE (y) as the response variable, and MALE (x1), BIRTH (x2), DIVO (x3), BEDS (x4), EDUC (x5), and INCO (x6), as predictors. Answer the following questions using least square estimates in term of matrix formulas.
compute the following using matrix formulas,
(a) Compute and report the least-squares estimates. Write down the least-squares regression equation.
(b) Explain in context what the coefficients corresponding to MALE and BIRTH mean.
(c) Compute the biased and the unbiased estimates of the error variance σ2.
(d) Using the unbiased estimate of error variance, Compute the standard errors of the estimators of the regression coefficients.
(e) Compute the coefficient of determination. Give a practical interpretation of your result.
R commands and outputs:
> d=read.table("data.txt",header=TRUE)
> dim(d
) [1] 51 8
> 51*8
[1] 408
> head(d)
head(d) STATE MALE BIRTH DIVO BEDS EDUC INCO LIFE
1 AK 119.1 24.8 5.6 603.3 14.1 4638 69.31
2 AL 93.3 19.4 4.4 840.9 7.8 2892 69.05
3 AR 94.1 18.5 4.8 569.6 6.7 2791 70.66
4 AZ 96.8 21.2 7.2 536.0 12.6 3614 70.55
5 CA 96.8 18.2 5.7 649.5 13.4 4423 71.71
6 CO 97.5 18.8 4.7 717.7 14.9 3838 72.06
> y=d$LIFE
> x1=d$MALE
> x2=d$BIRTH
> x3=d$DIVO
> x4=d$BEDS
> x5=d$EDUC
> x6=d$INCO
> ###Matrix notation
> ###betahats
> n=dim(d)[1]
> n=51
> I=rep(1,n)
> X=cbind(I,x1,x2,x3,x4,x5,x6)
> betahat=solve(t(X)%*%X)%*%t(X)%*%y
> ## Coefficients corresponding to MALE and BIRTH
> ## MALE: x1= 0.1261018758. As value of variable MALE increases, value of y (LIFE) increases. If MALE decreases decreases, y decreases
. > ## BIRTH: x2= -0.5160557876. As value of variable MALE increases, value of y (LIFE) decreases and vice-versa
. > k=6
> p=k+1
> SSRes=t(y)%*%y-t(betahat)%*%t(X)%*%y
> SSRes=as.numeric(SSRes)
> SSRes
[1] 60.80295
> MSRes=SSRes/(n-p)
> MSRes
[1] 1.381885
> sigsqhat=as.numeric(MSRes) #Unbiased estimate of error variance
> sigsqhat
[1] 1.381885
> Cov_bhat=sigsqhat*solve(t(X)%*%X)
> Cov_bhat
I x1 x2 x3 x4 x5 x6
I 18.4019304518 -1.633385e-01
-6.622173e-02 2.057874e-02
-2.046046e-03 9.566568e-02
-2.334589e-04 x1 -0.1633385203 2.230839e-03
-2.356535e-03 -2.153210e-04
1.563106e-05 3.974575e-04 -6.254433e-06
x2 -0.0662217348 -2.356535e-03
1.375400e-02 -1.089876e-03 6.602875e-06
-6.480967e-03 2.774321e-05
x3 0.0205787372 -2.153210e-04
-1.089876e-03 5.469090e-03 2.367058e-05
4.033179e-04 -6.281835e-06
x4 -0.0020460458 1.563106e-05
6.602875e-06 2.367058e-05 9.594796e-07
-1.537143e-05 -7.391819e-08
x5 0.0956656811 3.974575e-04
-6.480967e-03 4.033179e-04
-1.537143e-05 1.232599e-02
-3.600185e-05
x6 -0.0002334589 -6.254433e-06
2.774321e-05 -6.281835e-06
-7.391819e-08 -3.600185e-05
2.114108e-07
> v_bhat=c()
> for(j in 1:p)
+ {
+ v_bhat[j]=Cov_bhat[j,j]
+ }
> round(v_bhat,6)
[1] 18.401930 0.002231 0.013754 0.005469 0.000001 0.012326
[7] 0.000000
> SE=sqrt(v_bhat)
> SE
[1] 4.2897471315 0.0472317551
0.1172774622 0.0739532971
0.0009795303 0.1110224836
0.0004597943
> std_error=round(SE,6)
> std_error
[1] 4.289747 0.047232 0.117277 0.073953
0.000980 0.111022 0.000460
> b=c("beta0","beta1","beta2","beta3","beta4","beta5","beta6")
> data.frame(b,std_error)
b std_error
1 beta0 4.289747
2 beta1 0.047232
3 beta2 0.117277
4 beta3 0.073953
5 beta4 0.000980
6 beta5 0.111022
7 beta6 0.000460
> ##Coefficient of Determination
> SSTotal=t(y)%*%y-(sum(y))^2/n
> SSTotal=as.numeric(SSTotal)
> SSTotal
[1] 114.3972
> Rsq=1-(SSRes/SSTotal)
> Rsq
[1] 0.4684927
> ##In statistics,in regression analysis, "R squared", denoted as R² ,also known as the coefficient of determination, is the proportion of the variance in the dependent (response) variable that is predictable from the independent (regressor) variable.
>##This means that 46.84927 % of variation in the model is explained by regressors.
> AdjRsq=1-((SSRes/(n-p))/(SSTotal/(n-1)))
> AdjRsq
[1] 0.3960144
> ##One can cross-check using following:
> fit=lm(y~x1+x2+x3+x4+x5+x6)
> s=summary(fit)
> s Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6)
Residuals:
Min 1Q Median 3Q Max
-2.5563 -0.6629 0.0755 0.6983 3.3215
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.5577813 4.2897471 16.448 < 2e-16 ***
x1 0.1261019 0.0472318 2.670 0.01059 *
x2 -0.5160558 0.1172775 -4.400 6.78e-05 ***
x3 -0.1965375 0.0739533 -2.658 0.01093 *
x4 -0.0033392 0.0009795 -3.409 0.00141 **
x5 0.2368223 0.1110225 2.133 0.03853 *
x6 -0.0003612 0.0004598 -0.786 0.43633
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.176 on 44
degrees of freedom
Multiple R-squared: 0.4685, Adjusted R-squared: 0.396
F-statistic: 6.464 on 6 and 44 DF, p-value: 6.112e-05
> aov(fit)
Call:
aov(formula = fit)
Terms:
x1 x2 x3 x4 x5 x6 Residuals
Sum of Squares 4.56322 24.40986 4.67299 12.23977 6.85561 0.85279 60.80295
Deg. of Freedom 1 1 1 1 1 1 44
Residual standard error: 1.175536
Estimated effects may be unbalanced