In: Statistics and Probability
RPI would like to develop a multiple regression model for predicting graduate student Grade Point Averages. The initial data from 30 grad students are in the file GPA.sav. The file contains the following variables: GPA (graduate grade point averages), GREQ (score on the quantitative section of the Graduate Record Exam, a commonly used entrance exam for graduate programs), GREV (score on the verbal section of the GRE), MAT (score on the Miller Analogies Test, another graduate entrance exam), and AR, the Average Rating that the student received from 3 professors who interviewed the student prior to making admission decisions. GPA can exceed 4.0 since the university attaches pluses and minuses to letter grades including As. Conduct a multiple regression analysis in R using GPA as the dependent variable and the other variables as predictors. If slopes for any of the variables are not significant remove them from the MR and run the MR again. Briefly explain the R output. Be sure to check assumptions and summarize the results of your tests. Write down the final MR equation for GPA Predict GPA of an incoming student with GREQ=550, GREV=620, MAT=68, & AR=4.
This is the data:
GPA GRE_Q GRE_V MAT AR
1 3.2 625 540 65 2.7
2 4.1 575 680 75 4.5
3 3.0 520 480 65 2.5
4 2.6 545 520 55 3.1
5 3.7 520 490 75 3.6
6 4.0 655 535 65 4.3
7 4.3 630 720 75 4.6
8 2.7 500 500 75 3.0
9 3.6 605 575 65 4.7
10 4.1 555 690 75 3.4
11 2.7 505 545 55 3.7
12 2.9 540 515 55 2.6
13 2.5 520 520 55 3.1
14 3.0 585 710 65 2.7
15 3.3 600 610 85 5.0
16 3.2 625 540 65 2.7
17 4.1 575 680 75 4.5
18 3.0 520 480 65 2.5
19 2.6 545 520 55 3.1
20 3.7 520 490 75 3.6
21 4.0 655 535 65 4.3
22 4.3 630 720 75 4.6
23 2.7 500 500 75 3.0
24 3.6 605 575 65 4.7
25 4.1 555 690 75 3.4
26 2.7 505 545 55 3.7
27 2.9 540 515 55 2.6
28 2.5 520 520 55 3.1
29 3.0 585 710 65 2.7
30 3.3 600 610 85 5.0
Here is everything I have done in R, please help
## Conduct Multiple regression with variables
> ## How to select many variables in an lm formula???
> ## use y~. for selecting all variables in a data frame
>
> linear_mod6 <- lm(GPA ~ . , # regression formula
+ data=GPA_data) # data set
>
> ## use stepwise variable selection
> ## use direction parameter = forward,backward,both for
choosing the type of direction to use
>
> step(linear_mod6, direction="forward")
Start: AIC=-52.37
GPA ~ GRE_Q + GRE_V + MAT + AR
Call:
lm(formula = GPA ~ GRE_Q + GRE_V + MAT + AR, data = GPA_data)
Coefficients:
(Intercept) GRE_Q GRE_V MAT AR
-1.738107 0.003998 0.001524 0.020896 0.144234
>
> # Summarize and print the results
> summary(linear_mod6) # show regression coefficients table
Call:
lm(formula = GPA ~ ., data = GPA_data)
Residuals:
Min 1Q Median 3Q Max
-0.7876 -0.2297 0.0069 0.2673 0.5260
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.738107 0.950740 -1.828 0.0795 .
GRE_Q 0.003998 0.001831 2.184 0.0385 *
GRE_V 0.001524 0.001050 1.451 0.1593
MAT 0.020896 0.009549 2.188 0.0382 *
AR 0.144234 0.113001 1.276 0.2135
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3874 on 25 degrees of freedom
Multiple R-squared: 0.6405, Adjusted R-squared:
0.5829
F-statistic: 11.13 on 4 and 25 DF, p-value: 2.519e-05
>
> # perform one way anova
>
> anova(linear_mod6) # anova table
Analysis of Variance Table
Response: GPA
Df Sum Sq Mean Sq F value Pr(>F)
GRE_Q 1 3.8974 3.8974 25.9718 2.906e-05 ***
GRE_V 1 1.1660 1.1660 7.7699 0.009999 **
MAT 1 1.3753 1.3753 9.1651 0.005653 **
AR 1 0.2445 0.2445 1.6292 0.213548
Residuals 25 3.7515 0.1501
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## To import the dataset select the file after running the below command
data_26Feb = read.csv(file.choose(),header = T)
head(data_26Feb)
GPA GRE_Q GRE_V MAT AR
1 3.2 625 540 65 2.7
2 4.1 575 680 75 4.5
3 3.0 520 480 65 2.5
4 2.6 545 520 55 3.1
5 3.7 520 490 75 3.6
6 4.0 655 535 65 4.3
# Linear Regression Model
regmodel <- lm(GPA ~ ., data = data_26Feb)
summary(regmodel)
Call:
lm(formula = GPA ~ ., data = data_26Feb)
Residuals:
Min 1Q Median 3Q Max
-0.7876 -0.2297 0.0069 0.2673 0.5260
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.738107 0.950740 -1.828 0.0795 .
GRE_Q 0.003998 0.001831 2.184 0.0385 *
GRE_V 0.001524 0.001050 1.451 0.1593
MAT 0.020896 0.009549 2.188 0.0382 *
AR 0.144234 0.113001 1.276 0.2135
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3874 on 25 degrees of freedom
Multiple R-squared: 0.6405, Adjusted R-squared: 0.5829
F-statistic: 11.13 on 4 and 25 DF, p-value: 2.519e-05
Since the above model contains some variables which are insignificant (their p-value is greater than 0.05, ie GRE_V,AR) we would create another model.
The Choice about which independent variables should be included in the model is determined by using the Backward Selection Technique as below.
step(regmodel, direction = "backward",trace = FALSE)
Call:
lm(formula = GPA ~ GRE_Q + GRE_V + MAT, data = data_26Feb)
Coefficients:
(Intercept) GRE_Q GRE_V MAT
-2.148770 0.004926 0.001612 0.026119
Hence our Model will contain only GRE_Q, GRE_V, MAT as the independent variables
## Regression Model 2
regmodel2 <- lm(GPA ~ GRE_Q + GRE_V + MAT, data = data_26Feb)
summary(regmodel2)
Call:
lm(formula = GPA ~ GRE_Q + GRE_V + MAT, data = data_26Feb)
Residuals:
Min 1Q Median 3Q Max
-0.7101 -0.2762 0.1159 0.3275 0.5386
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.148770 0.905406 -2.373 0.02531 *
GRE_Q 0.004926 0.001701 2.896 0.00756 **
GRE_V 0.001612 0.001060 1.520 0.14051
MAT 0.026119 0.008731 2.991 0.00601 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.392 on 26 degrees of freedom
Multiple R-squared: 0.617, Adjusted R-squared: 0.5729
F-statistic: 13.96 on 3 and 26 DF, p-value: 1.28e-05
Since this model contains one insignificant variable as well we would create another model after excluding the insignificant variable
##Final Regression Model
regmodel3 <- lm(GPA ~ GRE_Q + MAT, data = data_26Feb)
summary(regmodel3)
Call:
lm(formula = GPA ~ GRE_Q + MAT, data = data_26Feb)
Residuals:
Min 1Q Median 3Q Max
-0.7751 -0.3325 0.1078 0.3184 0.6020
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.129377 0.927038 -2.297 0.02960 *
GRE_Q 0.005976 0.001591 3.756 0.00084 ***
MAT 0.030807 0.008365 3.683 0.00102 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4014 on 27 degrees of freedom
Multiple R-squared: 0.583, Adjusted R-squared: 0.5521
F-statistic: 18.87 on 2 and 27 DF, p-value: 7.444e-06
Regression Equation:
GPA = -2.13 + 0.006* GRE_Q + 0.031 * MAT
P-value of the overall Model is less than 0.05, Hence the model is significant.
R Square of the model is 0.583 ie 58.3 % of the variation in the dependent variable can be explained by the independent variable.
Assumptions of Linear Model
plot(regmodel3)
Since there is no definite pattern in the above plot, hence the data is linear.
The normal probability plot will give a straight line if the errors are distributed normally, but here many points deviate from the straight line.
Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.
Finally,
When GREQ=550, GREV=620, MAT=68, & AR=4
GPA = -2.13 + 0.006* 550 + 0.031 * 68 = 3.28