In: Statistics and Probability
This is an exploratory problem intended to introduce the idea of
curvilinear regression. Personally, I was a bit shocked to discover
that multiple LINEAR regression is the main vehicle to calculate
regressions for data with nonlinear relationships...sounds a bit
counter-intuitive. However, if we think of the higher-power terms
(quadratic, cubic, etc.) as distinct variables, the ideas work well
together.
Here is a data set for students in a gifted program. The first
score (X1=GPAX1=GPA) is the students’ math grade from last year,
and the second score (Y=SATY=SAT) is their SAT-M score. As this is
a non-representative group (when considering the population of all
students taking math classes in high school), it is not unexpected
to see range-restriction effects (generally all high performing,
few lower performing representatives) or ceiling effects (maximum
score on the SAT-M is 800). In data such as this, it is not
uncommon to see non-linear trends.
GPA | SAT |
---|---|
3.2 | 760 |
3.8 | 775 |
3 | 760 |
2.8 | 745 |
4 | 770 |
3.5 | 760 |
3.1 | 760 |
3.2 | 770 |
3.3 | 765 |
3.5 | 765 |
3.5 | 755 |
3.3 | 760 |
3.6 | 765 |
2.9 | 750 |
2.1 | 725 |
3.2 | 765 |
3.4 | 770 |
3.8 | 765 |
2.2 | 720 |
2.8 | 760 |
2.8 | 755 |
3.6 | 755 |
3.6 | 770 |
3.5 | 765 |
3.4 | 770 |
Step 1: Copy the data into your prefered
statistical software program. Change the variable names to GPA and
SAT if need be. Before doing any analysis, look at a scatterplot of
the data with GPA on the horizontal axis and SAT on the vertical
axis. Be sure to note any trends.
The following includes information for Excel users. If you are
not using Excel, please disregard.
Step 2: Run a regression (Data Analysis >
Regression) with SAT as the X variable. Again, be sure to note what
evidence supports the assumptions for a regression analysis. Report
the regression equation and the requested statistics:
SAT=SAT= + ×GPA×GPA
(Report regression coefficients accurate to 3 decimal
places.)
R2adj=Radj2=
(Report accurate to 3 decimal places.)
Step 3: Create a third variable called GPAsq (for
squared GPA). In Excel, use a formula, something like =B1^2 and
fill down the rest of the column.
Step 4: Run the quadratic regression by adding the
independent variable GPAsq to the model. Report the regression
equation and the requested statistics:
SAT=SAT= + ×GPA×GPA
+ ×GPA2×GPA2
(Report regression coefficients accurate to 3 decimal
places.)
R2adj=Radj2=
(Report accurate to 3 decimal places.)
Step 5: Notice how the adjusted coefficient of
multiple determination changed from the bivariate regression to the
quadratic (multiple) regression. The next step is to determine if
this more complicated model is statistically significantly better
than the more parsimonious linear model.
For the multiple regression model, what was the F-ratio
and the resulting P-value?
Fmodel=Fmodel=
(Report accurate to 2 decimal places.)
P=P=
(Report accurate to 3 decimal places.)
Here, y = SAT, x = GPA. We input the given data in MS Excel and,
as directed, use the "Regression" option under Data > Data
Analysis to compute the regression equation and answer the given
questions. The screenshot of the data and the output is given
below.
From the scatterplot, we can say that there is an increasing
trend.
The regression equation is coming out to be as:
= 682.352 + 23.689
.
The adjusted R square value = 0.685.
Now, we introduce a third variable named "GPAsq" and then redo the
calculation to find the new regression equation. The regression
equation is:
= 534.348 + 123.314
+ (-16.331)
.
The adjusted R square value = 0.801.
The F(model) value = 49.22 and the P value = 0.000.