In: Statistics and Probability
8. Linear equations and the regression line
Suppose a graduate student does a survey of undergraduate study habits on his university campus. He collects data on students who are in different years in college by asking them how many hours of course work they do for each class in a typical week. A sample of four students provides the following data on year in college and hours of course work per class:
Student |
Year in College |
Course Work Hours per Class |
---|---|---|
1 | Freshman (1) | 7 |
2 | Sophomore (2) | 7 |
3 | Junior (3) | 2 |
4 | Senior (4) | 2 |
A scatter plot of the sample data is shown here (blue circle symbols). The line Y = –X + 6 is shown in orange.
0 Sum of Distances(Mx, My)0123451086420HOURSYEAR
Think about how close the line Y = –X + 6 is to the sample points. Look at the graph and find each point’s vertical distance from the line. If the point sits above the line, the distance is positive; if the point sits below the line, the distance is negative.
The sum of the vertical distances between the sample points and the orange line is , and the sum of the squared vertical distances between the sample points and the orange line is .
On the graph, place the black point (X symbol) on the graph to plot the point (MXX, MYY), where MXX is the mean year for the four students (1, 2, 3, and 4) in the sample and MYY is the mean hours of course work per class for the four students (7, 7, 2, and 2) in the sample.
Then use the green line (triangle symbols) to plot the line that has the same slope as (is parallel to) the line Y = –X + 6, but with the additional property that the vertical distances between the points and the line sum to 0. To plot the line, drag the green line onto the graph. Move the green triangles to adjust the slope.
The line you just plotted through the point (MXX, MYY).
The sum of the squared vertical distances between the sample points and the line that you just plotted is .
Which of the following describes the plotted line with the smallest total squared error?
Y = –X + 6
The line you plotted that has a sum of the distances equal to 0
Neither—the two lines fit the data equally well
Suppose you fit the regression line to the four sample points on the graph. On the basis of your work so far, being as specific as you can be, you know that the total squared error is .
sum of distance = -1
sum of squared distance = 9
y^ = 5.75 - x
this line passes through (2.5,3.25)
sum of squared distance = 8.750
New line has smallest total squared error
The line you plotted that has a sum of the distances equal to 0