In: Statistics and Probability
A 10-year study was conducted by the American Heart Association to analyze how age, blood pressure, and smoking are related to the risk of strokes. Risk is interpreted as the probability (times 100) that the patient will have a stroke over the next 10-year period. For the smoking variable, define a dummy variable with 1 indicating a smoker and 0 indicating a nonsmoker. A sample data set of 18 people was used to develop an estimated regression equation to predict risk (y) as a dependent variable using age (x1), blood pressure (x2), and smoking (x3) as independent variables.
The following results are part of the Regression Analysis for Risk, age, blood pressure, and smoking from Minitab. Below is a table with statistics necessary for analyzing the residuals.
Table 1: Statistics Necessary for Performing a Residual Analysis
Obs Number |
Standardized residual |
Studentized deleted residual |
Leverage value |
Cook’s Distance |
1 |
0.6429087 |
0.6306934 |
0.1619209 |
0.0199644 |
2 |
0.531077 |
0.5188061 |
0.1093039 |
0.0086529 |
3 |
0.5370127 |
0.5247105 |
0.1450981 |
0.0122364 |
4 |
0.3258096 |
0.3165155 |
0.2983087 |
0.011282 |
5 |
1.5835402 |
1.6696679 |
0.1700605 |
0.1284562 |
6 |
0.8404364 |
0.8323283 |
0.77876441 |
0.9953109 |
7 |
-0.09176 |
-0.08887 |
0.353075 |
0.0011488 |
8 |
1.1437206 |
1.1556506 |
0.2076484 |
0.0857019 |
9 |
-0.488835 |
-0.476887 |
0.132997 |
0.009164 |
10 |
-0.882775 |
-0.876351 |
0.5218152 |
0.0619345 |
11 |
-0.132161 |
-0.128035 |
0.1246959 |
0.0006221 |
12 |
-0.094143 |
-0.091179 |
0.1124736 |
0.0002808 |
13 |
-1.871716 |
-2.050635 |
0.1712912 |
0.181031 |
14 |
-0.467756 |
-0.456031 |
0.2606083 |
0.0192793 |
15 |
-0.828827 |
-0.820311 |
0.1754352 |
0.0365392 |
16 |
2.0125329 |
2.2548097 |
0.7732424 |
0.0997011 |
17 |
-1.937311 |
-2.194042 |
0.1093148 |
0.1151579 |
18 |
-0.675826 |
-0.663911 |
0.1431895 |
0.0190825 |
a) Can any of the observations be classified as an outlier? Please use the Studentized Deleted Residual method to answer this question. Use ? = 0.05
b) Can any of the observations be classified as an influential observation? Please use both the Leverage value and Cook’s Distance methods for this question.
a) As a rule of thumb, Generally if any value is greater than 2 units for studentized residuals they are usually classified as an outlier. Here, Observation Number 13,16 and 17 have absolute values more than 2.Hence they are classified as an outlier. This rule is subjective. According to domain, some can put a cuttoff at 3.But usually cutoff of 2 is fair good. Also alpha is 0.05.So cutoff value is around 2.
b) In leverage terms any value more than 0.5 are Influential. By this, observation No. 6,10,16 are the Influential points. But by Cooks distance any value more than 1 is considered to be an Influential. Here only value no 6 is very close to 1.But then there are No Influential points according to Cooks distance.