In: Statistics and Probability
Analyze and interpret the effect of explanatory variables on the milk intake (dl.milk) in the kfm data set (ISwR) using a multiple regression model.(R programming)
1) Run regression for dl.milk on all other variables. Do you find any significance that milk intake can be explained by other variables?
2) find regression models in which fewer explanation variables should be used. i.e., select a subset of variables so that a better fit can be achieved.
Using R Code:
load the library ISwR
load the dataset kfm
There are 50 observations and 7 variables
variables are
"no" "dl.milk" "sex" "weight" "ml.suppl" "mat.weight"
"mat.height"
code the variable sex boys to 0 and girls to 1
dl.milk is dependent variable
R Code:
library(ISwR)
print(kfm)
dim(kfm)
names(kfm)
require(dplyr)
kfm1 <- kfm %>%
mutate(sex = ifelse(sex == "girl",0,1))
head(kfm1)
kfm1 <- kfm1[,-1]
rgmod1 <- lm(dl.milk~.,data=kfm1)
summary(rgmod1)
coefficients(rgmod1)
output :
> summary(rgmod1)
Call:
lm(formula = dl.milk ~ ., data = kfm1)
Residuals:
Min 1Q Median 3Q Max
-1.74201 -0.81173 -0.00926 0.78326 2.52646
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.181372 4.322605 -2.818 0.007212 **
sex 0.499532 0.312672 1.598 0.117284
weight 1.349124 0.322450 4.184 0.000135 ***
ml.suppl -0.002233 0.001241 -1.799 0.078829 .
mat.weight 0.006212 0.023708 0.262 0.794535
mat.height 0.072278 0.030169 2.396 0.020906 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.075 on 44 degrees of freedom
Multiple R-squared: 0.5459, Adjusted R-squared: 0.4943
F-statistic: 10.58 on 5 and 44 DF, p-value: 1.03e-06
intrepretation
From summary we can see for weight and mat.height are significant
variables as p<0.05
other variables sex ,ml.suppl ,mat.weight are not significant
variables as p>0.05
F(5,44)=10.58
p=0.0000
p<0.05
Model is significant.
We can use model for predicting dl.milk.
Regression model
dl.milk=-12.181371613
+0.4995321988*sex+1.349124010*weight-0.002232952 * ml.suppl +
0.006211857* mat.weight+0.072278226 * mat.height
Solution 2:
Now exclude the insignificant variables and run model with
significant variables:
Rcode:
rgmod2 <- lm(dl.milk~weight+mat.height,data=kfm1)
summary(rgmod2)
output:
Min 1Q Median 3Q Max
-2.19598 -0.82149 0.01822 0.75582 2.83375
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11.92014 4.07325 -2.926 0.00527 **
weight 1.42862 0.31338 4.559 3.67e-05 ***
mat.height 0.07063 0.02636 2.680 0.01013 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.109 on 47 degrees of freedom
Multiple R-squared: 0.4835, Adjusted R-squared: 0.4615
F-statistic: 22 on 2 and 47 DF, p-value: 1.811e-0
Intrepretation :
r sq=0.4835
48.35% variation in dl.milk is explained by model.
Rest 51.65% is unexplained variation.
F(2,47)=22
p=0.0000
p<0.05 model is significant.
Final regression model is
(Intercept) weight mat.height
-11.92014253 1.42862096 0.07062876
dl.milk=-11.92014253 +1.42862096*weight+ 0.07062876 *mat.height