Question

In: Statistics and Probability

RPI would like to develop a multiple regression model for predicting graduate student Grade Point Averages....

RPI would like to develop a multiple regression model for predicting graduate student Grade Point Averages. The initial data from 30 grad students are in the file GPA.sav. The file contains the following variables: GPA (graduate grade point averages), GREQ (score on the quantitative section of the Graduate Record Exam, a commonly used entrance exam for graduate programs), GREV (score on the verbal section of the GRE), MAT (score on the Miller Analogies Test, another graduate entrance exam), and AR, the Average Rating that the student received from 3 professors who interviewed the student prior to making admission decisions. GPA can exceed 4.0 since the university attaches pluses and minuses to letter grades including As. Conduct a multiple regression analysis in R using GPA as the dependent variable and the other variables as predictors. If slopes for any of the variables are not significant remove them from the MR and run the MR again. Briefly explain the R output. Be sure to check assumptions and summarize the results of your tests. Write down the final MR equation for GPA Predict GPA of an incoming student with GREQ=550, GREV=620, MAT=68, & AR=4.

This is the data:

GPA GRE_Q GRE_V MAT AR
1 3.2 625 540 65 2.7
2 4.1 575 680 75 4.5
3 3.0 520 480 65 2.5
4 2.6 545 520 55 3.1
5 3.7 520 490 75 3.6
6 4.0 655 535 65 4.3
7 4.3 630 720 75 4.6
8 2.7 500 500 75 3.0
9 3.6 605 575 65 4.7
10 4.1 555 690 75 3.4
11 2.7 505 545 55 3.7
12 2.9 540 515 55 2.6
13 2.5 520 520 55 3.1
14 3.0 585 710 65 2.7
15 3.3 600 610 85 5.0
16 3.2 625 540 65 2.7
17 4.1 575 680 75 4.5
18 3.0 520 480 65 2.5
19 2.6 545 520 55 3.1
20 3.7 520 490 75 3.6
21 4.0 655 535 65 4.3
22 4.3 630 720 75 4.6
23 2.7 500 500 75 3.0
24 3.6 605 575 65 4.7
25 4.1 555 690 75 3.4
26 2.7 505 545 55 3.7
27 2.9 540 515 55 2.6
28 2.5 520 520 55 3.1
29 3.0 585 710 65 2.7
30 3.3 600 610 85 5.0

Here is everything I have done in R, please help

## Conduct Multiple regression with variables
> ## How to select many variables in an lm formula???
> ## use y~. for selecting all variables in a data frame
>
> linear_mod6 <- lm(GPA ~ . , # regression formula
+ data=GPA_data) # data set
>
> ## use stepwise variable selection
> ## use direction parameter = forward,backward,both for choosing the type of direction to use
>
> step(linear_mod6, direction="forward")
Start: AIC=-52.37
GPA ~ GRE_Q + GRE_V + MAT + AR

Call:
lm(formula = GPA ~ GRE_Q + GRE_V + MAT + AR, data = GPA_data)

Coefficients:
(Intercept) GRE_Q GRE_V MAT AR
-1.738107 0.003998 0.001524 0.020896 0.144234

>
> # Summarize and print the results
> summary(linear_mod6) # show regression coefficients table

Call:
lm(formula = GPA ~ ., data = GPA_data)

Residuals:
Min 1Q Median 3Q Max
-0.7876 -0.2297 0.0069 0.2673 0.5260

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.738107 0.950740 -1.828 0.0795 .
GRE_Q 0.003998 0.001831 2.184 0.0385 *
GRE_V 0.001524 0.001050 1.451 0.1593
MAT 0.020896 0.009549 2.188 0.0382 *
AR 0.144234 0.113001 1.276 0.2135
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3874 on 25 degrees of freedom
Multiple R-squared: 0.6405, Adjusted R-squared: 0.5829
F-statistic: 11.13 on 4 and 25 DF, p-value: 2.519e-05

>
> # perform one way anova
>
> anova(linear_mod6) # anova table
Analysis of Variance Table

Response: GPA
Df Sum Sq Mean Sq F value Pr(>F)
GRE_Q 1 3.8974 3.8974 25.9718 2.906e-05 ***
GRE_V 1 1.1660 1.1660 7.7699 0.009999 **
MAT 1 1.3753 1.3753 9.1651 0.005653 **
AR 1 0.2445 0.2445 1.6292 0.213548
Residuals 25 3.7515 0.1501
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Expert Solution

## To import the dataset select the file after running the below command

data_26Feb = read.csv(file.choose(),header = T)

head(data_26Feb)

GPA GRE_Q GRE_V MAT AR

1 3.2 625 540 65 2.7

2 4.1 575 680 75 4.5

3 3.0 520 480 65 2.5

4 2.6 545 520 55 3.1

5 3.7 520 490 75 3.6

6 4.0 655 535 65 4.3

# Linear Regression Model

regmodel <- lm(GPA ~ ., data = data_26Feb)

summary(regmodel)

Call:

lm(formula = GPA ~ ., data = data_26Feb)

Residuals:

    Min      1Q  Median      3Q     Max

-0.7876 -0.2297  0.0069  0.2673  0.5260

Coefficients:

             Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.738107   0.950740  -1.828   0.0795 .

GRE_Q        0.003998   0.001831   2.184   0.0385 *

GRE_V        0.001524   0.001050   1.451   0.1593

MAT          0.020896   0.009549   2.188   0.0382 *

AR           0.144234   0.113001   1.276   0.2135

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3874 on 25 degrees of freedom

Multiple R-squared:  0.6405,  Adjusted R-squared:  0.5829

F-statistic: 11.13 on 4 and 25 DF,  p-value: 2.519e-05

Since the above model contains some variables which are insignificant (their p-value is greater than 0.05, ie GRE_V,AR) we would create another model.

The Choice about which independent variables should be included in the model is determined by using the Backward Selection Technique as below.

step(regmodel, direction = "backward",trace = FALSE)

Call:

lm(formula = GPA ~ GRE_Q + GRE_V + MAT, data = data_26Feb)

Coefficients:

(Intercept)        GRE_Q        GRE_V          MAT

  -2.148770     0.004926     0.001612     0.026119

Hence our Model will contain only GRE_Q, GRE_V, MAT as the independent variables

## Regression Model 2

regmodel2 <- lm(GPA ~ GRE_Q + GRE_V + MAT, data = data_26Feb)

summary(regmodel2)

Call:

lm(formula = GPA ~ GRE_Q + GRE_V + MAT, data = data_26Feb)

Residuals:

    Min      1Q  Median      3Q     Max

-0.7101 -0.2762  0.1159  0.3275  0.5386

Coefficients:

             Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.148770   0.905406  -2.373  0.02531 *

GRE_Q        0.004926   0.001701   2.896  0.00756 **

GRE_V        0.001612   0.001060   1.520  0.14051

MAT          0.026119   0.008731   2.991  0.00601 **

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.392 on 26 degrees of freedom

Multiple R-squared:  0.617,   Adjusted R-squared:  0.5729

F-statistic: 13.96 on 3 and 26 DF,  p-value: 1.28e-05

Since this model contains one insignificant variable as well we would create another model after excluding the insignificant variable

##Final Regression Model

regmodel3 <- lm(GPA ~ GRE_Q + MAT, data = data_26Feb)

summary(regmodel3)

Call:

lm(formula = GPA ~ GRE_Q + MAT, data = data_26Feb)

Residuals:

    Min      1Q  Median      3Q     Max

-0.7751 -0.3325  0.1078  0.3184  0.6020

Coefficients:

             Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.129377   0.927038  -2.297  0.02960 *

GRE_Q        0.005976   0.001591   3.756  0.00084 ***

MAT          0.030807   0.008365   3.683  0.00102 **

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4014 on 27 degrees of freedom

Multiple R-squared:  0.583,   Adjusted R-squared:  0.5521

F-statistic: 18.87 on 2 and 27 DF,  p-value: 7.444e-06

Regression Equation:

GPA = -2.13 + 0.006* GRE_Q + 0.031 * MAT

P-value of the overall Model is less than 0.05, Hence the model is significant.

R Square of the model is 0.583 ie 58.3 % of the variation in the dependent variable can be explained by the independent variable.

Assumptions of Linear Model

plot(regmodel3)

Since there is no definite pattern in the above plot, hence the data is linear.

The normal probability plot will give a straight line if the errors are distributed normally, but here many points deviate from the straight line.

Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.

Finally,

When GREQ=550, GREV=620, MAT=68, & AR=4

GPA = -2.13 + 0.006* 550 + 0.031 * 68 = 3.28

orchestra answered 2 years ago

A university would like to develop a regression model to predict the point differential for games...

A university would like to develop a regression model to predict the point differential for games played by its men's basketball team. A point differential is the difference between the final points scored by two competing teams. A positive differential is a win for the university's team and a negative differential is a loss. For a random sample of games, the point differential (y) was calculated, along with the number of assists (x1), rebounds (x2), turnovers (x3) and personal fouls...

Suppose an athletic director would like to develop a regression model to predict the point differential...

Suppose an athletic director would like to develop a regression model to predict the point differential for games played by the college's men's basketball team. A point differential is the difference between the final points scored by two competing teams. A positive differential is a win, and a negative differential is a loss. For a random sample of home and away games, the point differential was calculated, along with the number of assists, rebounds, and turnovers. The data are given...

A business statistics professor at a college would like to develop a regression model to predict...

A business statistics professor at a college would like to develop a regression model to predict the final exam scores for students based on their current GPAs, the number of hours they studied for the exam, the number of times they were absent during the semester, and their genders. Use the accompanying data to complete parts a through c below. Score GPA Hours Absences Gender 68 2.55 3.00 0 0 69 2.22 4.00 3 0 70 2.60 2.50 1 0...

Suppose a bank would like to develop a regression model to predict a? person's credit score...

Suppose a bank would like to develop a regression model to predict a? person's credit score based on his or her? age, weekly?income, highest education level? (high school, bachelor? degree, graduate? degree), and whether or not he or she owns or rents his or her primary residence. The accompanying table provides these data for a random sample of customers. Complete parts a through d below Credit_Score Income_($) Age Education Residence 592 1,383 55 Bachelor Own 702 1,707 65 ...

Suppose a statistician built a multiple regression model for predicting the total number of runs scored...

Suppose a statistician built a multiple regression model for predicting the total number of runs scored by a baseball team during a season. Using data for n=200 samples, the results below were obtained. Complete parts a through d. Ind. Var. β estimate Standard Error Ind. Var.. β estimate Standard Error Intercept 3.88 17.03 Doubles (X3) 0.74 0.04 Walks (X1) 0.37 0.05 Triples (X4) 1.17 0.23 Singles (X2) 0.51 0.05 Home Runs (X5) 1.44 0.04 a. Write the least squares prediction...

Grade:ABCDF Probability:0.10.30.40.10.1 To calculate student grade point averages, grades are expressed in a numerical scale with...

Grade:ABCDF Probability:0.10.30.40.10.1 To calculate student grade point averages, grades are expressed in a numerical scale with A = 4, B = 3, and so on down to F = 0. Find the expected value. This is the average grade in this course. Explain how to simulate choosing students at random and recording their grades. Simulate 50 students and find the mean of their 50 grades. Compare this estimate of the expected value with the exact expected value from part (a)....

Use the following data to develop a multiple regression model to predict from and . Discuss...

Use the following data to develop a multiple regression model to predict from and . Discuss the output, including comments about the overall strength of the model, the significance of the regression coefficients, and other indicators of model fit. y x1 x2 198 29 1.64 214 71 2.81 211 54 2.22 219 73 2.70 184 67 1.57 167 32 1.63 201 47 1.99 204 43 2.14 190 60 2.04 222 32 2.93 197 34 2.15 Appendix A Statistical Tables *(Round...

Use Excel to develop a multiple regression model to predict Cost of Materials by Number of...

Use Excel to develop a multiple regression model to predict Cost of Materials by Number of Employees, New Capital Expenditures, Value Added by Manufacture, and End-of-Year Inventories. Locate the observed value that is in Industrial Group 12 and has 7 employees. Based on the model and the multiple regression output, what is the corresponding residual of this observation? Write your answer as a number, round to 2 decimal places. SIC Code No. Emp. No. Prod. Wkrs. Value Added by Mfg....

b) Use a multiple regression model with dummy variables as follows to develop an equation to...

b) Use a multiple regression model with dummy variables as follows to develop an equation to account for seasonal effects in the data: Qtr1 = 1 if Quarter 1, 0 otherwise; Qtr2 = 1 if Quarter 2, 0 otherwise; Qtr3 = 1 if Quarter 3, 0 otherwise. If required, round your answers to three decimal places. For subtractive or negative numbers use a minus sign even if there is a + sign before the blank (Example: -300). If the constant...

A multiple regression analysis between yearly income (y in thousands of dollars), college grade point average...

A multiple regression analysis between yearly income (y in thousands of dollars), college grade point average (X1), age of the individuals (X2 in years), and the gender of the individual (X3: 0 representing female and 1 representing male) was performed on a sample of 10 people, and the following results were obtained. Coefficients Standard of Error Intercept 4.0928 1.4400 x1 10.0230 1.6512 x2 0.1020 0.1225 x3 -4.4811 1.4400 ANOVA Source of Variation DF Sum of Squares Mean Square F Regression...