In: Statistics and Probability
Different people have different criteria of determining outliers. Late statistician John Tukey suggested the rule 1.5*IQR as one criterion (singular form of criteria) and it was widely accepted ever since. It says that an observed value is considered an outlier if it is either smaller than Q1-1.5*IQR or larger than Q3+1.5*IQR. People have asked him in the past why “1.5” was used but Tukey simply answered, “Because 1 is too small and 2 is too large.” Suppose that the numerical variable of interest has a Normal distribution. Make use of parts (a) to (c) to answer the question in part (d).
1*IQR rule (i.e. either smaller than Q1-1*IQR or larger than Q3+1*IQR)? [4]
1.5*IQR rule? [2]
2*IQR rule? [2]
(a) For Q1, z = -0.6745 and for Q3, z = 0.6745
Q1 – 1 * IQR = Q1 – (Q3 – Q1) = 2Q1 – Q3 = 2(-0.6745) – 0.6745 = -2.0235
Q3 + 1 * IQR = Q3 + (Q3 – Q1) = 2Q3 – Q1 = 2(0.6745) – (-0.6745) = 2.0235
Area to the left of z = -2.0235 or to the right of z = 2.0235 is 0.0215
So, total proportion that would be considered outlier = 2 * 0.0215 = 0.043 (4.3%)
(b) Q1 – 1.5 * IQR = Q1 – 1.5(Q3 – Q1) = 2.5Q1 – 1.5Q3 = 2.5(-0.6745) – 1.5(0.6745) = -2.698
Q3 + 1.5 * IQR = Q3 + 1.5(Q3 – Q1) = 2.5Q3 – 1.5Q1 = 2.5(0.6745) – 1.5(-0.6745) = 2.698
Area to the left of z = -2.698 or to the right of z = 2.698 is 0.0035
So, total proportion that would be considered outlier = 2 * 0.0035 = 0.007 (0.7%)
(c) Q1 – 2 * IQR = Q1 – 2(Q3 – Q1) = 3Q1 – 2Q3 = 3(-0.6745) – 2(0.6745) = -3.3725
Q3 + 2 * IQR = Q3 + 2(Q3 – Q1) = 3Q3 – 2Q1 = 3(0.6745) – 2(-0.6745) = 3.3725
Area to the left of z = -3.3725 or to the right of z = 3.3725 is 0.0004
So, total proportion that would be considered outlier = 2 * 0.0004 = 0.0008 (0.08%)
(d) As we can see from above calculations, if we take 1 IQR rule, then a lot of data will pass as outliers (4.3%) whereas if we take the 2 IQR rule, hardly any data will pass as outliers (0.08%). That’s what Tukey probably meant when he said “1 is too small and 2 is too large”. So, the 1.5 IQR rule is optimal.