In: Statistics and Probability
I am going to give you a detailed answer to the question. In case you have doubts please let me know. I have detailed how to remove both - missing and noise from data in the following reasons :
1. Deleting the observations
If there are huge no. of records in your dataset, where all the catgories are to be predicted are sufficiently represented in the training data, then delete (or not to include missing values while model building, for example by setting na.action=na.omit) those observations (rows) that contain missing values. Make sure after deleting the observations, you have:
1. Have sufficent data points, so
the model doesn’t lose power.
2. Not to introduce bias (meaning, disproportionate or
non-representation of classes).
lm(medv ~ ptratio + rad, data=BostonHousing, na.action=na.omit)
2. Deleting the variable
If a variable has most missing values that rest of the variables in the dataset, and, if by removing that one variable you can save many observations. then, suggest to remove that particular variable, unless it is a really important predictor that makes a lot of business sense
3. Imputation with mean / median / mode
Replacing the missing values with the mean / median / mode is a crude way of treating missing values. Depending on the context, like if the variation is low or if the variable has low leverage over the response, such a rough approximation is acceptable and could possibly give satisfactory results.
4. Prediction
Prediction is most advanced method to impute your missing values and includes different approaches such as: kNN Imputation, rpart, and mice.
4.1. kNN Imputation
DMwR::knnImputation uses k-Nearest Neighbours approach to impute missing values. What kNN imputation does in simpler terms is as follows: For every observation to be imputed, it identifies ‘k’ closest observations based on the euclidean distance and computes the weighted average (weighted based on distance) of these ‘k’ obs.
Code for this:
library(DMwR) knnOutput <- knnImputation(BostonHousing[, !names(BostonHousing) %in% "medv"]) # perform knn imputation.
4.2 rpart
The limitation with DMwR::knnImputation is that it sometimes may not be appropriate to use when the missing value comes from a factor variable. Both rpart and mice has flexibility to handle that scenario. The advantage with rpart is that you just need only one of the variables to be non NA in the predictor fields.
4.3 mice
mice short for Multivariate Imputation by Chained Equations is an R package that provides advanced features for missing value treatment. It uses a slightly uncommon way of implementing the imputation in 2-steps, using mice() to build the model and complete() to generate the completed data
To remove Noise:
1.Use BilateralFilter:
ListLinePlot[{data, BilateralFilter[data, 2, .5, MaxIterations -> 25]}, PlotStyle -> {Thin, Red}]
2.Use MeanShiftFilter can produce similar results:
ListLinePlot[{data, MeanShiftFilter[data, 5, .5, MaxIterations -> 10]}, PlotStyle -> {Thin, Red}]
3. Another way is to apply TrimmedMean over a sliding window. In R this the way to do it:
ListLinePlot[{data, ArrayFilter[TrimmedMean, data, 20]}, PlotStyle -> {Thin, Red}]