In: Statistics and Probability
On this worksheet, make an XY scatter plot linked to the following data:1.01,2.8482, 1.48, 4.2772, 1.8, 4.788, 1.81, 5.3757, 1.07, 2.5252, 1.53, 3.0906, 1.46, 4.3362, 1.38, 3.2016, 1.77, 4.3542, 1.88, 4.8692, 1.32, 3.8676, 1.75, 3.9375, 1.94, 5.7424, 1.19, 2.4752, 1.31, 26.2, 1.56, 4.5708, 1.16, 2.842, 1.22, 2.44, 1.72, 5.1256, 1.45, 4.3355, 1.43, 4.2471, 1.19, 3.5343, 2, 5.46, 1.6, 3.84, 1.58, 3.8552 Add trendline, regression equation and r squared to the plot.Add this title. ("Scatterplot of X and Y Data"). The scatterplot reveals a point outside the point pattern. Copy the data to a new location in the worksheet. You now have 2 sets of data. Data that are more tha 1.5 IQR below Q1 or more than 1.5 IQR above Q3 are considered outliers and must be investigated. It was determined that the outlying point resulted from data entry error. Remove the outlier in the copy of the data. Make a new scatterplot linked to the cleaned data without the outlier, and add title ("Scatterplot without Outlier,") trendline, and regression equation label. Compare the regression equations of the two plots. How did removal of the outlier affect the slope and R2?
First of all select the whole data and then go to 'Insert' tab. Then look for 'Charts' section and from the dropdown select scatter plot as shown -
This would give you the scatter plot.
Then click on the plot and then find the '+' sign on the right side to add extra features to your graph. Mark the 'Axis Titles' and then look for 'Trendline'. You will find 'More Options' in trendline as shown -
When you click on this, a panel will open up on the right hand side titled 'Format Trendline'. Look at the bottom to find 'Display Equation on Chart' and ' Display R-squared value on chart' option. Mark both the options as shown -
Then edit the axis titles as 'X' and 'Y', add a title to the scatter plot by simply clicking on the text part of those parts. Finally you should get the following plot -
_________________________________
We can clearly see that there is a point lying far away from the cluster of points. That point is (1.31, 26.2).
So, we remove the data to get new data as -
X | Y |
1.01 | 2.8482 |
1.48 | 4.2772 |
1.8 | 4.788 |
1.81 | 5.3757 |
1.07 | 2.5252 |
1.53 | 3.0906 |
1.46 | 4.3362 |
1.38 | 3.2016 |
1.77 | 4.3542 |
1.88 | 4.8692 |
1.32 | 3.8676 |
1.75 | 3.9375 |
1.94 | 5.7424 |
1.19 | 2.4752 |
1.56 | 4.5708 |
1.16 | 2.842 |
1.22 | 2.44 |
1.72 | 5.1256 |
1.45 | 4.3355 |
1.43 | 4.2471 |
1.19 | 3.5343 |
2 | 5.46 |
1.6 | 3.84 |
1.58 | 3.8552 |
Then follow the same steps as earlier to get the following scatter plot -
We note that the equation of regression line was: Y = 0.6346(X) + 3.9309
And the new equation of regression line without outlier is: Y = 2.9907(X) - 0.526.
We see that the slope coefficient has changed drastically after removing the outlier. It has become approximately 5 times larger than the slope of data with outlier.
Also, the R-square value before was 0.0015 while the new R-square value is 0.7536 which is a very very huge change. As the R-square value tells us about the amount of variation included in the model as explained by the set of independent variables, so the model with outlier suggests that we were only able to include 0.15% of variation in Y in that model while if we remove the outlier, we can say that the new model includes about 75.36% of variation in Y as explained by X.
Thus, a single outlier can affect the whole model drastically.
_________________________________________________________