In: Computer Science
Patient dataset from a hospital has been taken to Identify
whether the patient has heart disease or not. Dataset
contains
noisy data and some outliers present in it, for that dataset
choose
any of the suitable data preprocessing tasks and also tell
how
outliers or noisy data removed from that dataset.
Outliers can be detected using thee following methods:
Extreme Value Analysis:
This is the most trivial form of outlier detection. The key of this method is to understand tthe underlying distribution of the variable and find the values at the extreme ends.
In case of a Gaussian Distribution, the outliers will lie outside the mean plus or minus 3 times the standard deviation of the variable.
If the variable is not normally distributed (not a Gaussian distribution), a good approach is to calculate the quantiles and then the inter-quartile range.
Standard Score (Z Score):
A Z-score (or standard score) represents how many standard deviations a given measurement deviates from the mean. It merely re-scales, or standardizes the data. A Z-score serves to specify the precise location of each observation within a distribution. The sign of the Z-score shows whether the score is above (+) or below ( — ) the mean.
The intuition behind the Z-score method of outlier detection is that, once we’ve centered and rescaled the data, anything that is too far from zero (the threshold is usually a Z-score of 3 or -3) should be considered an outlier.
Clustering Method:
Clustering is a popular technique used to group similar data points or objects in groups or clusters. It can also be used as an important tool for outlier analysis. In this approach, group the similar kind of objects and the oulier is automatically seperated.
Graphical Approach:
Various plots such as Box plots, histograms, and Scatter plots are majorly used to identify outliers in the dataset.
Methods to Pre-Process Outliers:
Mean / Median / Random Sampling:
If we have reasons to believe that outliers are due to mechanical error or problems during measurement. That means, the outliers are in nature similar to missing data, then any method used for missing data imputation can we used to replace outliers. The number are outliers are small (otherwise they won't be called outliers) and it's reasonable to use mean/median/random imputation to replace them.
Trimming:
In this method, we discard the outliers completely i.e. eliminate the data points that are considered as outliers. In situations where you won’t be removing a large number of values from the dataset, trimming is a good and fast approach.
Top / bottom / zero Coding:
Top Coding means capping the maximum of the distribution at an arbitrary set value. A top coded variable is one for which data points above an upper bound are censored. By implementing top coding, the outlier is capped at a certain maximum value and looks like many other observations.
Bottom coding is analogous but on the left side of the distribution. That is, all values below a certain threshold, are capped to that threshold. If the threshold is zero, then it is known as zero-coding. For example, for variables like “age” or “earnings”, it is not possible to have negative values. Thus it’s reasonable to cap the lowest value to zero.
PLEASE LIKE THE SOLUTION :))
IF YOU HAVE ANY DOUBTS PLEASE MENTION IN THE COMMENT