In: Statistics and Probability
You are given a data set containing the height, weight, age, and blood pressure of a representative sample of people from a major metropolitan area. Comment on the suitability of using a statistically-based versus a cluster-based outlier detection scheme to identify people with anomalous characteristics for this data set.
To find the outlier we have several methods like graphical statistically based also cluster based etc.
In statistically based analysis the approach is like assuming a parametrical model describing the distribution of the data .example we can say that normal distribution.To apply a statistical test it depend several things like. Data distribution , parameter of distribution and also number of the expected outlier . The statistical based approach is likehood approach. There are also limitations to the statistical method.
In the statistical method most of the approach are for a single attribute secondly in many cases data distribution may not be known then this approach faces problem and also for high dimensional data it may be challenging to find out the true distribution
Then it comes the cluster based analysis its basis is like clustering the data into groups of different density then choose points in small cluster as the candidate outlier then comparing with this candidate point compute the distance between the candidate point and noncandidating clusters . If the candidate points are so far from the all other noncandidating points they are outliers.
Since here we got height ,weight,age, and blood pressure of sample from metropolitan city means we have more than one or several type of variables so we can say that it's multivariate and also it will be a huge dimensional data because it's from metropolitan city . According to these founding I think the stable method for outlier detection is cluster based outlier detection method.