In: Statistics and Probability
What is an outlier? How would you scan for outliers in your dataset? What would you do with data points that are considered outliers? [6 Points]
An Outlier is a data value in a data set that differs significantly from other data points in the set. It may be due to variability in measurement or it can be an experimental error.
For Example: In a data set of 6 points, say – 2, 5, 8, 3, 0, 378 one can easily say (without calculating) that 378 is an outlier of the data set as it differs significantly from the rest of the 5 points.
To Scan for outliers in the data set one must find the range / fences for the data set. To do that –
Firstly, one needs to find the Lower / First Quartile of the data set. It is the median of the first half of the data set. Then one needs to find the Upper / Third Quartile of the data set. It is the median of the next half of the data set. Before, finding the measures it would be better to find the median of the complete data set as it would help to divide the data into two equal halves
Then, one needs to find the Interquartile Range, IQR of the data set which can be obtained by subtracting the value of Lower Quartile from the Upper Quartile, that is, IQR = Upper Quartile – Lower Quartile
Then, to find the range of the data set, one must multiply 1.5 to IQR of the data set. Let p = 1.5 x IQR. To this value we add the Upper Quartile to find the upper fence and also we subtract this value p from the lower quartile to obtain the lower fence
That is,
Upper Fence = Upper Quartile + p
Lower Fence = Lower Quartile – p
If all the values of the data set lie inside [Lower Fence, Upper Fence] then there are no outliers in the data set. If any of the data point(s) does not lie in the interval then those data point(s) is/are the outlier(s) of the data set.
If there are outliers then one can –
· Trim the data set and replace the outliers with the “near” values available
· Replace the outliers with the mean / median of the data set, whichever better suits the data
· Completely remove the outliers and work with whatever values left for better/accurate results