Question

In: Statistics and Probability

RPI would like to develop a multiple regression model for predicting graduate student Grade Point Averages....

RPI would like to develop a multiple regression model for predicting graduate student Grade Point Averages. The initial data from 30 grad students are in the file GPA.sav. The file contains the following variables: GPA (graduate grade point averages), GREQ (score on the quantitative section of the Graduate Record Exam, a commonly used entrance exam for graduate programs), GREV (score on the verbal section of the GRE), MAT (score on the Miller Analogies Test, another graduate entrance exam), and AR, the Average Rating that the student received from 3 professors who interviewed the student prior to making admission decisions. GPA can exceed 4.0 since the university attaches pluses and minuses to letter grades including As. Conduct a multiple regression analysis in R using GPA as the dependent variable and the other variables as predictors. If slopes for any of the variables are not significant remove them from the MR and run the MR again. Briefly explain the R output. Be sure to check assumptions and summarize the results of your tests. Write down the final MR equation for GPA Predict GPA of an incoming student with GREQ=550, GREV=620, MAT=68, & AR=4.

This is the data:

GPA GRE_Q GRE_V MAT AR
1 3.2 625 540 65 2.7
2 4.1 575 680 75 4.5
3 3.0 520 480 65 2.5
4 2.6 545 520 55 3.1
5 3.7 520 490 75 3.6
6 4.0 655 535 65 4.3
7 4.3 630 720 75 4.6
8 2.7 500 500 75 3.0
9 3.6 605 575 65 4.7
10 4.1 555 690 75 3.4
11 2.7 505 545 55 3.7
12 2.9 540 515 55 2.6
13 2.5 520 520 55 3.1
14 3.0 585 710 65 2.7
15 3.3 600 610 85 5.0
16 3.2 625 540 65 2.7
17 4.1 575 680 75 4.5
18 3.0 520 480 65 2.5
19 2.6 545 520 55 3.1
20 3.7 520 490 75 3.6
21 4.0 655 535 65 4.3
22 4.3 630 720 75 4.6
23 2.7 500 500 75 3.0
24 3.6 605 575 65 4.7
25 4.1 555 690 75 3.4
26 2.7 505 545 55 3.7
27 2.9 540 515 55 2.6
28 2.5 520 520 55 3.1
29 3.0 585 710 65 2.7
30 3.3 600 610 85 5.0

Here is everything I have done in R, please help

## Conduct Multiple regression with variables
> ## How to select many variables in an lm formula???
> ## use y~. for selecting all variables in a data frame
>
> linear_mod6 <- lm(GPA ~ . , # regression formula
+ data=GPA_data) # data set
>
> ## use stepwise variable selection
> ## use direction parameter = forward,backward,both for choosing the type of direction to use
>
> step(linear_mod6, direction="forward")
Start: AIC=-52.37
GPA ~ GRE_Q + GRE_V + MAT + AR


Call:
lm(formula = GPA ~ GRE_Q + GRE_V + MAT + AR, data = GPA_data)

Coefficients:
(Intercept) GRE_Q GRE_V MAT AR
-1.738107 0.003998 0.001524 0.020896 0.144234

>
> # Summarize and print the results
> summary(linear_mod6) # show regression coefficients table

Call:
lm(formula = GPA ~ ., data = GPA_data)

Residuals:
Min 1Q Median 3Q Max
-0.7876 -0.2297 0.0069 0.2673 0.5260

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.738107 0.950740 -1.828 0.0795 .
GRE_Q 0.003998 0.001831 2.184 0.0385 *
GRE_V 0.001524 0.001050 1.451 0.1593
MAT 0.020896 0.009549 2.188 0.0382 *
AR 0.144234 0.113001 1.276 0.2135
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3874 on 25 degrees of freedom
Multiple R-squared: 0.6405,   Adjusted R-squared: 0.5829
F-statistic: 11.13 on 4 and 25 DF, p-value: 2.519e-05

>
> # perform one way anova
>
> anova(linear_mod6) # anova table
Analysis of Variance Table

Response: GPA
Df Sum Sq Mean Sq F value Pr(>F)
GRE_Q 1 3.8974 3.8974 25.9718 2.906e-05 ***
GRE_V 1 1.1660 1.1660 7.7699 0.009999 **
MAT 1 1.3753 1.3753 9.1651 0.005653 **
AR 1 0.2445 0.2445 1.6292 0.213548
Residuals 25 3.7515 0.1501
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Solutions

Expert Solution

## To import the dataset select the file after running the below command

data_26Feb = read.csv(file.choose(),header = T)

head(data_26Feb)

GPA GRE_Q GRE_V MAT AR

1 3.2   625   540 65 2.7

2 4.1   575   680 75 4.5

3 3.0   520   480 65 2.5

4 2.6   545   520 55 3.1

5 3.7   520   490 75 3.6

6 4.0   655   535 65 4.3

# Linear Regression Model

regmodel <- lm(GPA ~ ., data = data_26Feb)

summary(regmodel)

 
Call:
lm(formula = GPA ~ ., data = data_26Feb)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-0.7876 -0.2297  0.0069  0.2673  0.5260 
 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -1.738107   0.950740  -1.828   0.0795 .
GRE_Q        0.003998   0.001831   2.184   0.0385 *
GRE_V        0.001524   0.001050   1.451   0.1593  
MAT          0.020896   0.009549   2.188   0.0382 *
AR           0.144234   0.113001   1.276   0.2135  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.3874 on 25 degrees of freedom
Multiple R-squared:  0.6405,  Adjusted R-squared:  0.5829 
F-statistic: 11.13 on 4 and 25 DF,  p-value: 2.519e-05

Since the above model contains some variables which are insignificant (their p-value is greater than 0.05, ie GRE_V,AR) we would create another model.

The Choice about which independent variables should be included in the model is determined by using the Backward Selection Technique as below.

step(regmodel, direction = "backward",trace = FALSE)

 
Call:
lm(formula = GPA ~ GRE_Q + GRE_V + MAT, data = data_26Feb)
 
Coefficients:
(Intercept)        GRE_Q        GRE_V          MAT  
  -2.148770     0.004926     0.001612     0.026119  

Hence our Model will contain only GRE_Q, GRE_V, MAT as the independent variables

## Regression Model 2

regmodel2 <- lm(GPA ~ GRE_Q + GRE_V + MAT, data = data_26Feb)

summary(regmodel2)

 
Call:
lm(formula = GPA ~ GRE_Q + GRE_V + MAT, data = data_26Feb)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-0.7101 -0.2762  0.1159  0.3275  0.5386 
 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) -2.148770   0.905406  -2.373  0.02531 * 
GRE_Q        0.004926   0.001701   2.896  0.00756 **
GRE_V        0.001612   0.001060   1.520  0.14051   
MAT          0.026119   0.008731   2.991  0.00601 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.392 on 26 degrees of freedom
Multiple R-squared:  0.617,   Adjusted R-squared:  0.5729 
F-statistic: 13.96 on 3 and 26 DF,  p-value: 1.28e-05

Since this model contains one insignificant variable as well we would create another model after excluding the insignificant variable

##Final Regression Model

regmodel3 <- lm(GPA ~ GRE_Q + MAT, data = data_26Feb)

summary(regmodel3)

 
Call:
lm(formula = GPA ~ GRE_Q + MAT, data = data_26Feb)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-0.7751 -0.3325  0.1078  0.3184  0.6020 
 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.129377   0.927038  -2.297  0.02960 *  
GRE_Q        0.005976   0.001591   3.756  0.00084 ***
MAT          0.030807   0.008365   3.683  0.00102 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.4014 on 27 degrees of freedom
Multiple R-squared:  0.583,   Adjusted R-squared:  0.5521 
F-statistic: 18.87 on 2 and 27 DF,  p-value: 7.444e-06

Regression Equation:

GPA = -2.13 + 0.006* GRE_Q + 0.031 * MAT

P-value of the overall Model is less than 0.05, Hence the model is significant.

R Square of the model is 0.583 ie 58.3 % of the variation in the dependent variable can be explained by the independent variable.

Assumptions of Linear Model

plot(regmodel3)

Since there is no definite pattern in the above plot, hence the data is linear.

The normal probability plot will give a straight line if the errors are distributed normally, but here many points deviate from the straight line.

Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.

Finally,

When GREQ=550, GREV=620, MAT=68, & AR=4

GPA = -2.13 + 0.006* 550 + 0.031 * 68 = 3.28


Related Solutions

A university would like to develop a regression model to predict the point differential for games...
A university would like to develop a regression model to predict the point differential for games played by its men's basketball team. A point differential is the difference between the final points scored by two competing teams. A positive differential is a win for the university's team and a negative differential is a loss. For a random sample of games, the point differential (y) was calculated, along with the number of assists (x1), rebounds (x2), turnovers (x3) and personal fouls...
Suppose an athletic director would like to develop a regression model to predict the point differential...
Suppose an athletic director would like to develop a regression model to predict the point differential for games played by the college's men's basketball team. A point differential is the difference between the final points scored by two competing teams. A positive differential is a win, and a negative differential is a loss. For a random sample of home and away games, the point differential was calculated, along with the number of assists, rebounds, and turnovers. The data are given...
A business statistics professor at a college would like to develop a regression model to predict...
A business statistics professor at a college would like to develop a regression model to predict the final exam scores for students based on their current GPAs, the number of hours they studied for the exam, the number of times they were absent during the semester, and their genders. Use the accompanying data to complete parts a through c below. Score   GPA   Hours   Absences   Gender 68   2.55   3.00   0   0 69   2.22   4.00   3   0 70   2.60   2.50   1   0...
Suppose a bank would like to develop a regression model to predict a? person's credit score...
Suppose a bank would like to develop a regression model to predict a? person's credit score based on his or her? age, weekly?income, highest education level? (high school, bachelor? degree, graduate? degree), and whether or not he or she owns or rents his or her primary residence. The accompanying table provides these data for a random sample of customers. Complete parts a through d below Credit_Score   Income_($)      Age      Education        Residence 592                              1,383   55        Bachelor         Own 702                              1,707   65       ...
Suppose a statistician built a multiple regression model for predicting the total number of runs scored...
Suppose a statistician built a multiple regression model for predicting the total number of runs scored by a baseball team during a season. Using data for n=200 samples, the results below were obtained. Complete parts a through d. Ind. Var. β estimate Standard Error Ind. Var.. β estimate Standard Error Intercept 3.88 17.03 Doubles (X3) 0.74 0.04 Walks (X1) 0.37 0.05 Triples (X4) 1.17 0.23 Singles (X2) 0.51 0.05 Home Runs (X5) 1.44 0.04 a. Write the least squares prediction...
Grade:ABCDF Probability:0.10.30.40.10.1 To calculate student grade point averages, grades are expressed in a numerical scale with...
Grade:ABCDF Probability:0.10.30.40.10.1 To calculate student grade point averages, grades are expressed in a numerical scale with A = 4, B = 3, and so on down to F = 0. Find the expected value. This is the average grade in this course. Explain how to simulate choosing students at random and recording their grades. Simulate 50 students and find the mean of their 50 grades. Compare this estimate of the expected value with the exact expected value from part (a)....
Use the following data to develop a multiple regression model to predict from and . Discuss...
Use the following data to develop a multiple regression model to predict from and . Discuss the output, including comments about the overall strength of the model, the significance of the regression coefficients, and other indicators of model fit. y x1 x2 198 29 1.64 214 71 2.81 211 54 2.22 219 73 2.70 184 67 1.57 167 32 1.63 201 47 1.99 204 43 2.14 190 60 2.04 222 32 2.93 197 34 2.15 Appendix A Statistical Tables *(Round...
1.Develop a multiple linear regression model to predict the price of a house using the square...
1.Develop a multiple linear regression model to predict the price of a house using the square feet of living area, number of bedrooms, and number of bathrooms as the predictor variables     Write the reqression equation.      Discuss the statistical significance of the model as a whole using the appropriate regression statistic at a 95% level of confidence. Discuss the statistical significance of the coefficient for each independent variable using the appropriate regression statistics at a 95% level of confidence....
Using the data in the Excel file Home Market Value, develop a multiple regression model for...
Using the data in the Excel file Home Market Value, develop a multiple regression model for estimating the market value as a function of house age and house size. Predict the value of a house that is 30 years old and has 1800 square feet, and also predict the value of a house that is 5 years old and has 2800 square feet. Conduct your analysis using the following Multiple Regression Model Building and Interpretation Rubric: Identify the dependent variable...
A multiple regression analysis between yearly income (y in thousands of dollars), college grade point average...
A multiple regression analysis between yearly income (y in thousands of dollars), college grade point average (X1), age of the individuals (X2 in years), and the gender of the individual (X3: 0 representing female and 1 representing male) was performed on a sample of 10 people, and the following results were obtained. Coefficients Standard of Error Intercept 4.0928 1.4400 x1 10.0230 1.6512 x2 0.1020 0.1225 x3 -4.4811 1.4400 ANOVA Source of Variation DF Sum of Squares Mean Square F Regression...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT