In: Math
Let's focus on the relationship between the average debt in dollars at graduation (AveDebt) and the in-state cost per year after need-based aid (InCostAid).
a) Does a linear relationship between InCost Aid and AveDebt seem reasonable? Explain.
b) Are there any unusual cases in this sample? If yes, state which ones they are and how they may be affecting the least-squares model fit.
InCostAid | AveDebt |
10359 | 20708 |
6541 | 17468 |
10433 | 21263 |
9821 | 19530 |
13323 | 25300 |
12103 | 26472 |
11806 | 23562 |
16265 | 32362 |
14699 | 20790 |
14465 | 20504 |
16306 | 9949 |
10854 | 28508 |
15466 | 24624 |
14389 | 25821 |
12271 | 24111 |
12778 | 17893 |
11421 | 17617 |
4735 | 23964 |
16461 | 28999 |
10669 | 22541 |
15089 | 23729 |
13251 | 23726 |
14758 | 25729 |
14466 | 26946 |
17093 | 33944 |
a)Our first job is to draw a scatterplot using the data.
As the scaterplot shows, there might be a weak linear relationship between the two observed variables.
To get a better idea, we calculate correlation coefficient for the two variables, which is done by using the formula:
where , ; n = total number of observation.
X = say InCostAid and Y = AveDebt.
So, we found , r = 0.306
which doesnot indicate that there is a linear relationship between the two variables.
b) Now, from the scatterplot, we observe data of row 2, row 11 and row 18 are a bit outside from the cluster ( this cann be understood better using a boxplot). Again looking at the values of these rows, one can clearly notice that these values show some suspicious quantities.
These values are increasing the SSE ( Sum of squares of errors) , thus decreasing the acuuracy of the linear model.
IF we calculate the 'r' excluding values, we get r= 0.571 , which tells there is a moderate linear relationship between the two variables.