Question

In: Computer Science

Data mining--> Please Perform Principal Component Analysis and K-Means Clustering on the Give dataset Below. [50...

Data mining-->

  1. Please Perform Principal Component Analysis and K-Means Clustering on the Give dataset Below. [50 Points]

    Dataset Link : https://dataminingcsc6740.s3-us-west-2.amazonaws.com/datasets/homework_2.csv

    10 Points for Data Preprocessing.

    15 Points for PCA Algorithm along with plots and Results Explaination.

    15 Points for K-Means Algorithm with plots and Results Explination.

    10 Points for Comparing the results between PCA and K-Means and whats your infer- ence from your ouputs of the algorithms.

    Hints:
    As per the data preprocessing step convert all the variables in the dataset into Numerical

    values as the algorithms only work with Numerical values
    Then Apply both algorithms one after the other then plot the output clusters Compare the output clusters in both the steps.

Solutions

Expert Solution

In data mining, I have used R language platform as an IDE

The code is here with k means clustering and principal component analysis(PCA)

#load dataset

library(ggplot2)

library(ggthemes)

library(GGally)

library(dplyr)

library(Metrics)

homework_2 <- read.csv("C:/Users/CST2019/Downloads/homework_2.csv", sep=";")

# structure of dataset

str(homework_2)

#required libraries

require(dplyr)

# pre processing and convet categorical values to numeric values

homework_2$default<-as.numeric(homework_2$default)

str(homework_2)

homework_2$housing<-as.numeric(homework_2$housing)

str(homework_2)

homework_2$loan<-as.numeric(homework_2$loan)

str(homework_2)

homework_2$poutcome<-as.numeric(homework_2$poutcome)

str(homework_2)

homework_2$y<-as.numeric(homework_2$y)

str(homework_2)

homework_2$month<-as.numeric(homework_2$month)

str(homework_2)

homework_2$contact<-as.numeric(homework_2$contact)

str(homework_2)

homework_2$job<-as.numeric(homework_2$job)

str(homework_2)

homework_2$marital<-as.numeric(homework_2$marital)

str(homework_2)

homework_2$education<-as.numeric(homework_2$education)

str(homework_2)

#over all summary

summary(homework_2)

str(homework_2)

#plots

hist(homework_2$age)

hist(homework_2$marital)

hist(homework_2$education)

hist(homework_2$default)

hist(homework_2$balance)

hist(homework_2$housing)

hist(homework_2$loan)

#k means clustering

kmeans(homework_2, 3)

cor<-cor(homework_2)

library(corrplot)

#CORRELARTION MATRIX

corrplot(cor,method="number")

library(caTools)

sample = sample.split(homework_2,SplitRatio = 0.70) # splits the data in the ratio mentioned in SplitRatio. After splitting marks these rows as logical TRUE and the the remaining are marked as logical FALSE

train1 =subset(homework_2,sample ==TRUE) # creates a training dataset named train1 with rows which are marked as TRUE

test1=subset(homework_2, sample==FALSE)

summary(train1)

summary(test1)

#PCA algorithm

prin_comp <- prcomp(train1, scale. = T)

names(prin_comp)

prin_comp$center

prin_comp$scale

prin_comp$rotation

biplot(prin_comp, scale = 0)

std_dev <- prin_comp$sdev

std_dev

#compute variance

pr_var <- std_dev^2

pr_var

#check variance of first 10 components

pr_var[1:10]

#proportion of variance explained

prop_varex <- pr_var/sum(pr_var)

prop_varex[1:20]

#add a training set with principal components

train.data <- data.frame(pdays = train1$pdays, prin_comp$x)

#we are interested in first 10 PCAs

train.data <- train.data[,1:10]

#run a decision tree

install.packages("rpart")

library(rpart)

rpart.model <- rpart(pdays ~ . , data = train.data, method = "anova")

rpart.model

#transform test into PCA

test.data <- predict(prin_comp, newdata = test1)

test.data <- as.data.frame(test.data)

#select the first 10 components

test.data <- test.data[,1:10]

#make prediction on test data

rpart.prediction <- predict(rpart.model, test.data)


Related Solutions

INTRODUCTION TO DATA MINING Question 3: K-means clustering Use the k-means algorithm and Euclidean distance to...
INTRODUCTION TO DATA MINING Question 3: K-means clustering Use the k-means algorithm and Euclidean distance to cluster the following seven examples into two clusters: A1=(1, 1), A2=(1.5, 2), A3=(3,4), A4=(5,7), A5=(3.5,5), A6=(4.5,5), A7=(3.5,4.5) Suppose that the initial seeds (centers of each cluster) are A1 and A4. Run the k-means algorithm for 2 epochs. At the end of this epoch show: a) Distance matrix by calculating Euclidean distance. b) The new clusters (i.e. the examples belonging to each cluster) c) The...
Using the ruspini dataset provided with the cluster package in R, perform a k-means analysis. Document...
Using the ruspini dataset provided with the cluster package in R, perform a k-means analysis. Document the findings and justify the choice of K. Hint: use data(ruspini) to load the dataset into the R workspace.
Apply PCA ( Principal Component Analysis ) in python to this data set below  that is a...
Apply PCA ( Principal Component Analysis ) in python to this data set below  that is a csv file Then plot it with different colors. Thank you I will UPVOTE! target A B C D E F G surprise 2 3 1 1 19 12 0 sad 2 0 0 2 12 1 15 angry 95 2 1 0 1 0 1 sad 4 56 2 0 0 3 1 neutral 1 2 2 0 39 0 11 happy 0 0...
Apply PCA ( Principal Component Analysis ) in python to this data set below  that is a...
Apply PCA ( Principal Component Analysis ) in python to this data set below  that is a csv file Then plot it. Thank you I will UPVOTE! A B C D E F G 2 3 1 1 19 12 0 2 0 0 2 12 1 15 95 2 1 0 1 0 1 4 56 2 0 0 3 1 1 2 2 0 39 0 11 0 0 0 34 1 0 0 5 55 0 0 0...
Question 1. What is k-means clustering? How does it work? Give a few examples that you...
Question 1. What is k-means clustering? How does it work? Give a few examples that you would use this algorithm. ---------------- Question 2. What is k-nearest neighbor? How does it work? Give a few examples that you would use this algorithm.
In your own words, summarize the steps of K-means clustering. Make sure to give example(s). What...
In your own words, summarize the steps of K-means clustering. Make sure to give example(s). What are the advantages and disadvantages of the K-means clustering? Any limitations?
One-Way ANOVA 1) Below is a hypothetical experiment with data included: Please perform a data analysis...
One-Way ANOVA 1) Below is a hypothetical experiment with data included: Please perform a data analysis using SPSS. Include data inspection, description, and a one-way ANOVA with supporting graph showing confidence intervals (error bar graph). Attach both to this Board. 2) Add a brief paragraph explaining the conclusions about the data in this study. Did it show a significant difference? Why or why not. A researcher wants to see if seating position in a classroom has a differential effect on...
Part 1. Consider the dataset below. You will perform a series of regressions and data transformations....
Part 1. Consider the dataset below. You will perform a series of regressions and data transformations. Be sure to keep a record of all your computer results. First, please perform a simple linear regression. Predict Y if X = 40. To avoid rounding errors in ALL your calculations, please perform your calculations on your spreadsheet referencing data from your regression output. X Y 54 6 42 16 28 33 38 18 25 41 70 3 48 10 41 14 20...
COMPLETE A LOGISTIC REGRESSION, AS WELL AS A K-MEANS CLUSTER ANALYSIS IN EXCEL? Using the data...
COMPLETE A LOGISTIC REGRESSION, AS WELL AS A K-MEANS CLUSTER ANALYSIS IN EXCEL? Using the data to find four clusters of cities. Write a short report about the clusters you find. Does the clustering make sense? Can you provide descriptive, meaningful names for the clusters? SHOW GRAPHS PLEASE (Scatter plot/cluster) Metropolitan_Area Cost_Living Transportation Jobs Education Abilene, TX 96.32 36.54 17.28 49.29 Akron, OH 47.31 69.68 86.11 71.95 Albany, GA 86.12 28.02 32.01 26.62 Albany-Schenectady-Troy, NY 25.22 82.71 52.97 99.43 Albuquerque,...
I already ran PCA on data given in last Principal Component Analysis: Energy_kcal, Protein_g, Fat_g, Carb_g...
I already ran PCA on data given in last Principal Component Analysis: Energy_kcal, Protein_g, Fat_g, Carb_g Eigenanalysis of the Correlation Matrix Eigenvalue 2.2504 1.1894 0.5583 0.0018 Proportion 0.563 0.297 0.140 0.000 Cumulative 0.563 0.860 1.000 1.000 Variable PC1 PC2 PC3 PC4 Energy_kcal 0.663 0.090 -0.028 -0.743 Protein_g 0.399 -0.578 0.663 0.261 Fat_g 0.604 0.027 -0.563 0.564 Carb_g 0.191 0.811 0.494 0.250 I ned answers to these two parts State principal components as linear combination of given set of variables. Explain...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT