Question

In: Statistics and Probability

R Programming: Load the {ISLR} and {GGally} libraries. Load and attach the College{ISLR} data set. 1.2...

R Programming: Load the {ISLR} and {GGally} libraries. Load and attach the College{ISLR} data set. 1.2 Inspect the data with the ggpairs(){GGally} function, but do not run the ggpairs plots on all variables because it will take a very long time. Only include these variables in your ggpairs plot: “Outstate”,“S.F.Ratio”,“Private”,“PhD”,“Grad.Rate”. 1.3 Briefly answer: if we are interested in predicting out of state tuition (Outstate), can you tell from the plots if any of the other variables have a curvilinear relationship with Outstate? Briefly explain. 1.4 Regardless of your answer, plot Outstate (Y axis) against S.F.Ratio (X axis). Then, please answer, do you now see a more curvilinear pattern in the relationship? 1.5 Fit a linear model to predict Outstate as a function of the other 4 predictors in your ggpairs plot. Store your model results in an object named fit.linear. Display a summary of your results. ## ## Call: ## "" ## ## Residuals: ## Min 1Q Median 3Q Max ## -7920.9 -1627.9 -79.9 1572.4 13040.4 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1240.515 736.104 1.685 0.0923 . ## S.F.Ratio -244.339 25.963 -9.411 <2e-16 *** ## PrivateYes 3653.040 243.210 15.020 <2e-16 *** ## PhD 82.791 5.988 13.826 <2e-16 *** ## Grad.Rate 60.658 5.896 10.288 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2441 on 772 degrees of freedom ## Multiple R-squared: 0.6337, Adjusted R-squared: 0.6318 ## F-statistic: 333.9 on 4 and 772 DF, p-value: < 2.2e-16 1.6 Briefly answer: Does this seem like a good model fit? Why or why not? 1.7 Now add an interaction term for S.F.Ratio with Private. Store your model results in an object named fit.inter. Display a summary of your results. Then do an anova() test to evaluate if fit.inter has more predictive power than fit.linear. ## ## Call: ## "" ## ## Residuals: ## Min 1Q Median 3Q Max ## -8026.2 -1588.7 -72.2 1504.9 12879.2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -958.756 1007.011 -0.952 0.34135 ## S.F.Ratio -111.583 49.091 -2.273 0.02330 * ## PrivateYes 6565.770 947.547 6.929 8.94e-12 *** ## PhD 82.521 5.954 13.861 < 2e-16 *** ## Grad.Rate 59.671 5.870 10.165 < 2e-16 *** ## S.F.Ratio:PrivateYes -181.125 56.972 -3.179 0.00154 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2427 on 771 degrees of freedom ## Multiple R-squared: 0.6384, Adjusted R-squared: 0.6361 ## F-statistic: 272.3 on 5 and 771 DF, p-value: < 2.2e-16 ## Analysis of Variance Table ## ## Model 1: Outstate ~ S.F.Ratio + Private + PhD + Grad.Rate ## Model 2: Outstate ~ S.F.Ratio * Private + PhD + Grad.Rate ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 772 4600356791 ## 2 771 4540828867 1 59527924 10.107 0.001536 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 1.8 Briefly interpret the coefficients of the interaction term and the ANOVA results 1.9 Now use the poly() function to fit a polynomial of degree 4 for S.F.Ratio. Store your model results in an object named fit.poly. Display the summary results. Then conduct an anova() test to evaluate if fit.poly has more predictive power than fit.linear. ## ## Call: ## "" ## ## Residuals: ## Min 1Q Median 3Q Max ## -8271.7 -1508.4 -18.3 1555.2 12598.0 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1942.267 504.823 -3.847 0.000129 *** ## poly(S.F.Ratio, 4)1 -28332.540 2820.328 -10.046 < 2e-16 *** ## poly(S.F.Ratio, 4)2 12856.006 2394.114 5.370 1.04e-07 *** ## poly(S.F.Ratio, 4)3 2700.292 2472.274 1.092 0.275074 ## poly(S.F.Ratio, 4)4 -7339.122 2403.789 -3.053 0.002343 ** ## PrivateYes 3438.220 246.937 13.923 < 2e-16 *** ## PhD 81.807 5.858 13.966 < 2e-16 *** ## Grad.Rate 60.167 5.782 10.406 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2386 on 769 degrees of freedom ## Multiple R-squared: 0.6513, Adjusted R-squared: 0.6481 ## F-statistic: 205.2 on 7 and 769 DF, p-value: < 2.2e-16 ## Analysis of Variance Table ## ## Model 1: Outstate ~ S.F.Ratio + Private + PhD + Grad.Rate ## Model 2: Outstate ~ poly(S.F.Ratio, 4) + Private + PhD + Grad.Rate ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 772 4600356791 ## 2 769 4379475120 3 220881671 12.928 3.022e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 1.10 Briefly interpret your results.

Solutions

Expert Solution

R code for 1.1 and 1.2

#Load the {ISLR} and {GGally} libraries
library(ISLR)
library(GGally)

#1.1 Load and attach the College{ISLR} data set
data(College,package="ISLR")
attach(College)
names(College)

#1.2 Inspect the data
ggpairs(data.frame(Outstate,S.F.Ratio,Private,PhD,Grad.Rate))

#get this

1.3 Looking at the scatter plots in the first column, we can say the following

  • the variables S.F. Ratio, PhD and Grade.Rate have a linear relationship with outof state
  • Variable private is a categorical variable and the values of outof state are different for the two levels of private (yes,no)

R Code

#1.4 plot Outstate (Y axis) against S.F.Ratio (X axis)
plot(S.F.Ratio,Outstate)

#Get this

We can see that the relationship being linear till S.F ratio 20 and curves beyond 20.

We do see a more curvilinear pattern in the relationship.

1.5 Fit a linear model to predict Outstate as a function of the other 4 predictors in your ggpairs plot. Store your model results in an object named fit.linear. Display a summary of your

R code

#1.5 Fit a linear model to predict Outstate
fit.linear<-lm(Outstate~S.F.Ratio+Private+PhD+Grad.Rate)
#Display a summary
summary(fit.linear)

#get this

1.6 Briefly answer: Does this seem like a good model fit? Why or why not?

We can do a goodness of fit test for the overall model.

Let the model that we want to estimate be

The hypotheses are

We get the test statistics and the p-values from the following

The test statistics is F=333.9 The p-value=0.0000 (rounded to 4 decimals)

We will reject the null hypothesis if the p-value is less than the significance level 0.05.

Here, the p-value is less than the significance level, 0.05. Hence we reject the null hypothesis.

We conclude that the model is a good fit.

1.7 Now add an interaction term for S.F.Ratio with Private

R code

#1.7 Now add an interaction term for S.F.Ratio with Private
fit.inter<-lm(Outstate~S.F.Ratio+Private+PhD+Grad.Rate+S.F.Ratio*Private)
#Display a summary of your results
summary(fit.inter)
#Then do an anova() test
anova(fit.linear,fit.inter)

# get this

1.8. Briefly interpret the coefficients of the interaction term and the ANOVA results

The coefficient of the interaction term is -181.125, the coefficient of S.F. Ratio is  -111.583

For a private university (Private=Yes) the coefficient of S.F. ratio is -181.12+( -111.583 ) =-292.703

For a public university college (Private=Yes) the coefficient of S.F. ratio is -111.583

The above indicate when the university is private, 1 point increase in the Student/faculty ratio would decrease the outof station tuition fees by $292.703 compared to when the university is public, a 1 point increase in the Student/faculty ratio would decrease the outof station tuition fees by $111.583.

This indicates that the private universities are more sensitive to increase in the student/faculty ratio compared to the public universities

Next we can compare the 2 models using the ANOVA results.

We will call the Model 2 (model with interaction term) as the full model (it has more number of variables)

The model that we are estimating is

We will call the Model 1 (the linear fit) as the restricted model (it has less number of variables and it is a sub model in model 2)

The hypotheses are

We get the test statistics and the p-values from the ANOVA

The test statistics is F=10.107 and the p-value =0.001536

We will reject the null hypothesis if the p-value is less than the significance level.

Here, the p-value is 0.001536 and it is less than the significance level 0.05. Hence we reject the null hypothesis.

We conclude that the model with interaction term is significant.

1.9 Now use the poly() function to fit a polynomial of degree 4 for S.F.Ratio

R Code

#1.9 Now use the poly() function
fit.poly<-lm(Outstate~poly(S.F.Ratio,4)+Private+PhD+Grad.Rate)
summary(fit.poly)
#Then do an anova() test
anova(fit.linear,fit.poly)

#get this

Briefly interpret your results.

we can compare the poly model to the linear fit  using the ANOVA results.

We will call the Model 2 (model with poly term) as the full model (it has more number of variables)

The model that we are estimating is

We will call the Model 1 (the linear fit) as the restricted model (it has less number of variables and it is a sub model in model 2)

The hypotheses are

We get the test statistics and the p-values from the ANOVA

The test statistics is F=12.928 and the p-value =0.0000 (rounded to 4 decimals)

We will reject the null hypothesis if the p-value is less than the significance level.

Here, the p-value is 0.0000 and it is less than the significance level 0.05. Hence we reject the null hypothesis.

We conclude that the model with polynomial of degree 4 for S.F.Ratio is significant.


Related Solutions

Use R statictical software. Load the ISLR package to get the Auto data set. Fit below...
Use R statictical software. Load the ISLR package to get the Auto data set. Fit below non-linear models to the Auto data set. We will treat horsepower as the predictor and mpg as the response. • Fit the cubic spline with 3 knots (25th percentile, 50th percentile, and 75th percentile of horsepower) • Fit the natural spline with 3 knots (25th percentile, 50th percentile, and 75th percentile of horsepower) • Fit the smoothing spline by choosing optimal lambda with cross-validation....
Load “Lock5Data” into your R console. Load “OlympicMarathon” data set in “Lock5Data”. This data set contains...
Load “Lock5Data” into your R console. Load “OlympicMarathon” data set in “Lock5Data”. This data set contains population of all times to finish the 2008 Olympic Men’s Marathon. a) What is the population size? b) Now using “Minutes” column generate a random sample of size 5. c) Calculate the sample mean and record it (create a excel sheet or write a direct R program to record this) d) Continue steps (b) and (c) 10,000 time (that mean you have recorded 10,000...
Install and load the dataset named Carseats (in the ISLR package) into R. Run a multiple...
Install and load the dataset named Carseats (in the ISLR package) into R. Run a multiple linear regression with all the variables. Using the coefficients, write down the model. ( be careful with the qualitative variable ShelveLoc. ) obtain the interaction plot of ShelveLoc and price.
Install and load the dataset named Carseats (in the ISLR package) into R. Create a new...
Install and load the dataset named Carseats (in the ISLR package) into R. Create a new dataframe that is a copy of Carseats. Create two indicator (dummy) variables: Bad_Shelf = 1 if ShelveLoc = “Bad”, 0 otherwise Good_Shelf = 1 if ShelveLoc = “Good”, 0 otherwise Also, create two interaction variables: Price_Bad_Shelf = Price* Bad_Shelf Price_Good_Shelf = Price* Good_Shelf For Questions 1-2, please estimate a linear regression model (using the lm function) with Sales as the dependent variable and Price,...
Write code in R for this questions,, will vote!! Load the Taxi.txt data set into R....
Write code in R for this questions,, will vote!! Load the Taxi.txt data set into R. (a) Calculate the mean, median, standard deviation, 30th percentile, and 65th percentile for Mileage and TripTime. (b) Make a frequency table for PaymentProvider that includes a Sum column. Report the resulting table. (c) Make a contingency table comparing PaymentType and Airport. Report the resulting table. (d) Use the cor() function to find the correlation between each pair of the Meter, Tip, Mileage, and TripTime...
** Number 2 implemented in R (R Studio) ** Set up the Auto data: Load the...
** Number 2 implemented in R (R Studio) ** Set up the Auto data: Load the ISLR package and the Auto data Determine the median value for mpg Use the median to create a new column in the data set named mpglevel, which is 1 if mpg>median and otherwise is 0. Make sure this variable is a factor. We will use mpglevel as the target (response) variable for the algorithms. Use the names() function to verify that your new column...
In the R programming language, we would like to use the data set called iris to...
In the R programming language, we would like to use the data set called iris to build a simple linear regression model to predict Sepal.Length based on Petal.Length. Calculate the least squares regression line to predict Sepal.Length based on Petal.Length. Interpret the slope of the line in the context of the problem. Remember that both variables are measured in centimeters. Plot the regression line in a scatterplot of Sepal.Length vs. Petal.Length. Test H1: ??1 ≠ 0 at ?? = 0.05...
R Programming Exercise Book Problem 31 (a) "airquality.csv" is a data set which consists of ozone,...
R Programming Exercise Book Problem 31 (a) "airquality.csv" is a data set which consists of ozone, solar radiation, wind and temperature measurements taken in New York city from May to September of 1973. Use the command read.csv to read the data set. Now write a code which will take 7 random temperature values from each month and then calculate the mean and the standard deviation for the 7 samples. Display the mean as a variables which includes the name of...
R Programming Exercise Book Problem 31 (a) "airquality.csv" is a data set which consists of ozone,...
R Programming Exercise Book Problem 31 (a) "airquality.csv" is a data set which consists of ozone, solar radiation, wind and temperature measurements taken in New York city from May to September of 1973. Use the command read.csv to read the data set. Now write a code which will take 7 random temperature values from each month and then calculate the mean and the standard deviation for the 7 samples. Display the mean as a variables which includes the name of...
There are four numeric columns in R programming language's iris data set. Create a scatter plot...
There are four numeric columns in R programming language's iris data set. Create a scatter plot between the four numeric columns using R programming language and give answers to the following parts. Calculate the correlation between each pair of the four numeric columns in iris. Which pair of variables has the strongest linear relationship? Interpret their ??. Which pair of variables has the weakest linear relationship? Interpret their ??. Which pair(s) of variables can you conclude have a population correlation...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT