In: Computer Science
2. Many statistical tests for outliers were developed in an environment in which a few hundred observations was a large data set. With this question, we discuss the limitations of such approaches.
(a) For a set of 1,000,000 values, how many outliers we would have according to the test that says a value is an outlier if it is more than three standard deviations from the average? (Assume a normal distribution.)
(b) Does the approach that states an outlier is an object of unusually low probability need to be adjusted when dealing with large data sets? If so, how?
2)a)Choosing a N(0, 1) distribution for simplicity, α = prob(|x| ≥ c) gives (c, α) = (3, 0.0027). In this case, there is a 0.0027 probability of having an outlier outside the symmetric interval (−c, c). For 1,000,000 values, this should correspond to 2700 outliers.
b)Increasing the threshold to 4 standard deviations would make α = 6.3 · 10−5 , corresponding to only 63 outliers. In experimental particle physics, one typically sets a threshold of 5 standard deviations, with α = 5.7 · 10−7 , which should be halved when testing for the significance of an event/outlier (one-sided).When dealing with large data set the analysis can have a different approach in classifying outliers. Outlier is defined by its distance from the central object data set in comparison to other objects in the data set. Depending on the precision of the data set and outlier can be at future distance from the central of the data set greater than a defined radial distance.
*******************************************************************************************************************
In case of any doubt do ask in the comment section.Hope you like it