In: Statistics and Probability
R Programming: Load the {ISLR} and {GGally} libraries. Load and attach the College{ISLR} data set. 1.2 Inspect the data with the ggpairs(){GGally} function, but do not run the ggpairs plots on all variables because it will take a very long time. Only include these variables in your ggpairs plot: “Outstate”,“S.F.Ratio”,“Private”,“PhD”,“Grad.Rate”. 1.3 Briefly answer: if we are interested in predicting out of state tuition (Outstate), can you tell from the plots if any of the other variables have a curvilinear relationship with Outstate? Briefly explain. 1.4 Regardless of your answer, plot Outstate (Y axis) against S.F.Ratio (X axis). Then, please answer, do you now see a more curvilinear pattern in the relationship? 1.5 Fit a linear model to predict Outstate as a function of the other 4 predictors in your ggpairs plot. Store your model results in an object named fit.linear. Display a summary of your results. ## ## Call: ## "" ## ## Residuals: ## Min 1Q Median 3Q Max ## -7920.9 -1627.9 -79.9 1572.4 13040.4 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1240.515 736.104 1.685 0.0923 . ## S.F.Ratio -244.339 25.963 -9.411 <2e-16 *** ## PrivateYes 3653.040 243.210 15.020 <2e-16 *** ## PhD 82.791 5.988 13.826 <2e-16 *** ## Grad.Rate 60.658 5.896 10.288 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2441 on 772 degrees of freedom ## Multiple R-squared: 0.6337, Adjusted R-squared: 0.6318 ## F-statistic: 333.9 on 4 and 772 DF, p-value: < 2.2e-16 1.6 Briefly answer: Does this seem like a good model fit? Why or why not? 1.7 Now add an interaction term for S.F.Ratio with Private. Store your model results in an object named fit.inter. Display a summary of your results. Then do an anova() test to evaluate if fit.inter has more predictive power than fit.linear. ## ## Call: ## "" ## ## Residuals: ## Min 1Q Median 3Q Max ## -8026.2 -1588.7 -72.2 1504.9 12879.2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -958.756 1007.011 -0.952 0.34135 ## S.F.Ratio -111.583 49.091 -2.273 0.02330 * ## PrivateYes 6565.770 947.547 6.929 8.94e-12 *** ## PhD 82.521 5.954 13.861 < 2e-16 *** ## Grad.Rate 59.671 5.870 10.165 < 2e-16 *** ## S.F.Ratio:PrivateYes -181.125 56.972 -3.179 0.00154 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2427 on 771 degrees of freedom ## Multiple R-squared: 0.6384, Adjusted R-squared: 0.6361 ## F-statistic: 272.3 on 5 and 771 DF, p-value: < 2.2e-16 ## Analysis of Variance Table ## ## Model 1: Outstate ~ S.F.Ratio + Private + PhD + Grad.Rate ## Model 2: Outstate ~ S.F.Ratio * Private + PhD + Grad.Rate ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 772 4600356791 ## 2 771 4540828867 1 59527924 10.107 0.001536 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 1.8 Briefly interpret the coefficients of the interaction term and the ANOVA results 1.9 Now use the poly() function to fit a polynomial of degree 4 for S.F.Ratio. Store your model results in an object named fit.poly. Display the summary results. Then conduct an anova() test to evaluate if fit.poly has more predictive power than fit.linear. ## ## Call: ## "" ## ## Residuals: ## Min 1Q Median 3Q Max ## -8271.7 -1508.4 -18.3 1555.2 12598.0 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1942.267 504.823 -3.847 0.000129 *** ## poly(S.F.Ratio, 4)1 -28332.540 2820.328 -10.046 < 2e-16 *** ## poly(S.F.Ratio, 4)2 12856.006 2394.114 5.370 1.04e-07 *** ## poly(S.F.Ratio, 4)3 2700.292 2472.274 1.092 0.275074 ## poly(S.F.Ratio, 4)4 -7339.122 2403.789 -3.053 0.002343 ** ## PrivateYes 3438.220 246.937 13.923 < 2e-16 *** ## PhD 81.807 5.858 13.966 < 2e-16 *** ## Grad.Rate 60.167 5.782 10.406 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2386 on 769 degrees of freedom ## Multiple R-squared: 0.6513, Adjusted R-squared: 0.6481 ## F-statistic: 205.2 on 7 and 769 DF, p-value: < 2.2e-16 ## Analysis of Variance Table ## ## Model 1: Outstate ~ S.F.Ratio + Private + PhD + Grad.Rate ## Model 2: Outstate ~ poly(S.F.Ratio, 4) + Private + PhD + Grad.Rate ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 772 4600356791 ## 2 769 4379475120 3 220881671 12.928 3.022e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 1.10 Briefly interpret your results.
R code for 1.1 and 1.2
#Load the {ISLR} and {GGally} libraries
library(ISLR)
library(GGally)
#1.1 Load and attach the College{ISLR} data set
data(College,package="ISLR")
attach(College)
names(College)
#1.2 Inspect the data
ggpairs(data.frame(Outstate,S.F.Ratio,Private,PhD,Grad.Rate))
#get this
1.3 Looking at the scatter plots in the first column, we can say the following
R Code
#1.4 plot Outstate (Y axis) against S.F.Ratio (X axis)
plot(S.F.Ratio,Outstate)
#Get this
We can see that the relationship being linear till S.F ratio 20 and curves beyond 20.
We do see a more curvilinear pattern in the relationship.
1.5 Fit a linear model to predict Outstate as a function of the other 4 predictors in your ggpairs plot. Store your model results in an object named fit.linear. Display a summary of your
R code
#1.5 Fit a linear model to predict Outstate
fit.linear<-lm(Outstate~S.F.Ratio+Private+PhD+Grad.Rate)
#Display a summary
summary(fit.linear)
#get this
1.6 Briefly answer: Does this seem like a good model fit? Why or why not?
We can do a goodness of fit test for the overall model.
Let the model that we want to estimate be
The hypotheses are
We get the test statistics and the p-values from the following
The test statistics is F=333.9 The p-value=0.0000 (rounded to 4 decimals)
We will reject the null hypothesis if the p-value is less than the significance level 0.05.
Here, the p-value is less than the significance level, 0.05. Hence we reject the null hypothesis.
We conclude that the model is a good fit.
1.7 Now add an interaction term for S.F.Ratio with Private
R code
#1.7 Now add an interaction term for S.F.Ratio with
Private
fit.inter<-lm(Outstate~S.F.Ratio+Private+PhD+Grad.Rate+S.F.Ratio*Private)
#Display a summary of your results
summary(fit.inter)
#Then do an anova() test
anova(fit.linear,fit.inter)
# get this
1.8. Briefly interpret the coefficients of the interaction term and the ANOVA results
The coefficient of the interaction term is -181.125, the coefficient of S.F. Ratio is -111.583
For a private university (Private=Yes) the coefficient of S.F. ratio is -181.12+( -111.583 ) =-292.703
For a public university college (Private=Yes) the coefficient of S.F. ratio is -111.583
The above indicate when the university is private, 1 point increase in the Student/faculty ratio would decrease the outof station tuition fees by $292.703 compared to when the university is public, a 1 point increase in the Student/faculty ratio would decrease the outof station tuition fees by $111.583.
This indicates that the private universities are more sensitive to increase in the student/faculty ratio compared to the public universities
Next we can compare the 2 models using the ANOVA results.
We will call the Model 2 (model with interaction term) as the full model (it has more number of variables)
The model that we are estimating is
We will call the Model 1 (the linear fit) as the restricted model (it has less number of variables and it is a sub model in model 2)
The hypotheses are
We get the test statistics and the p-values from the ANOVA
The test statistics is F=10.107 and the p-value =0.001536
We will reject the null hypothesis if the p-value is less than the significance level.
Here, the p-value is 0.001536 and it is less than the significance level 0.05. Hence we reject the null hypothesis.
We conclude that the model with interaction term is significant.
1.9 Now use the poly() function to fit a polynomial of degree 4 for S.F.Ratio
R Code
#1.9 Now use the poly() function
fit.poly<-lm(Outstate~poly(S.F.Ratio,4)+Private+PhD+Grad.Rate)
summary(fit.poly)
#Then do an anova() test
anova(fit.linear,fit.poly)
#get this
Briefly interpret your results.
we can compare the poly model to the linear fit using the ANOVA results.
We will call the Model 2 (model with poly term) as the full model (it has more number of variables)
The model that we are estimating is
We will call the Model 1 (the linear fit) as the restricted model (it has less number of variables and it is a sub model in model 2)
The hypotheses are
We get the test statistics and the p-values from the ANOVA
The test statistics is F=12.928 and the p-value =0.0000 (rounded to 4 decimals)
We will reject the null hypothesis if the p-value is less than the significance level.
Here, the p-value is 0.0000 and it is less than the significance level 0.05. Hence we reject the null hypothesis.
We conclude that the model with polynomial of degree 4 for S.F.Ratio is significant.