In: Computer Science
How do outliers affect PC scores? Perform a PCA on the board stiffness dataset with and without detected outliers.
PCA(Principal component analysis) is used for reducing dimensions of the dataset. It helps us to identify datapoints that are different from rest(outliers). When the data have outliers it will lead PCA to misleading conclusions.
#Installing all the required libraries
install.packages("SMLoutliers")
library(SMLoutliers)
data("stiff")
View(stiff)
par(mfrow = c(2, 2))
hist(stiff$x1, breaks = 20) #you can see the outlier.
max(stiff$x1) #this x1 value is the outlier
#[1] 2983
hist(stiff$x2, breaks = 20)
max(stiff$x2)
#[1] 2794
hist(stiff$x3, breaks = 20)
max(stiff$x3)
##[1] 2412
hist(stiff$x4, breaks = 20)
max(stiff$x4)
##[1] 2581
#PCA WITH OUTLIERS
stiff.pca <- prcomp(stiff, center = TRUE, scale = TRUE)
print(stiff.pca)
plot(stiff.pca, type='l')
#ploting part
install.packages("ggfortify")
library(ggplot2)
library(ggfortify)
autoplot(stiff.pca) #you can see only one PC giving us 90% of whole data variability.
#PCA without Outlier
#just cut the 9th row, take rest of the data.
data=stiff[-9,]
data.pca <- prcomp(data, center = TRUE, scale = TRUE)
print(data.pca)
plot(data.pca, type='l')
autoplot(data.pca)
#now we can see PC1 giving us only 85.4% variability, which was 90% before.
#So,from the above plot we can see that outlier makes a difference in PCA
Output Screenshot