Question

In: Computer Science

Using R Programing Apply clustering to "Wholesale customers Data Set" and see if you can distinguish...

Using R Programing

Apply clustering to "Wholesale customers Data Set" and see if you can distinguish between regions. NOTE: the clustering should exclude the region column.

Solutions

Expert Solution

conti  %>% kmeans(2, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k2

conti %>% kmeans(3, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k3

conti %>% kmeans(4, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k4

conti %>% kmeans(5, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k5

conti %>% kmeans(6, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k6

conti %>% kmeans(7, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k7

grid.arrange(k2, k3, k4, k5, k6, k7, nrow=3)

K-means for "Suitable" Ks

None of the methods for finding a suitable k gave the same results. To me, 10+ clusters or customer segments seems too high to make much marketing sense, especially if we assume that the 440 observations represents the total number of customers for the business. So k-means was computed for k = 5 and k = 2 against the full data, and k = 5 and k = 3 against the subset.

All customers, k = 5

Clusters 1 and 2 both have relatively much higher average spend on Fresh products compared to the other categories -- although overall cluster 2 has higher average spend across all categories except Detergents_Paper. Cluster 5 are big spenders in Fresh, Milk, Grocery, and Detergents_Paper. While cluster 4 has relatively higher average spend on Milk, Grocery, and Detergents_Paper. Cluster 3, the largest sized one, appears to be lower spenders across most categories.

set.seed(888)
model_k5 <- kmeans(whole_cust, centers = 5)
clusplot(whole_cust, model_k5$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k5)
## K-means clustering with 5 clusters of sizes 113, 24, 227, 71, 5
## 
## Cluster means:
##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1 20600.283  3787.832  5089.841 3989.071         1130.142   1639.071
## 2 48777.375  6607.375  6197.792 9462.792          932.125   4435.333
## 3  5655.819  3567.793  4513.040 2386.529         1437.559   1005.031
## 4  5207.831 13191.028 20321.718 1674.028         9036.380   1937.944
## 5 25603.000 43460.600 61472.200 2636.000        29974.200   2708.800
## 
## Clustering vector:
##   [1] 3 3 3 1 1 3 3 3 3 4 3 3 1 1 1 3 3 3 1 3 1 3 1 4 1 1 3 1 4 2 1 3 1 1 3
##  [36] 3 1 1 4 2 1 1 4 4 3 4 4 5 3 4 3 3 2 4 1 3 4 4 1 3 3 5 3 4 3 4 3 1 3 3
##  [71] 1 1 3 1 3 1 3 4 3 3 3 4 3 1 3 5 5 2 3 1 3 1 4 1 4 3 3 3 3 3 4 4 3 2 1
## [106] 1 3 4 3 4 3 4 1 1 1 3 3 3 1 3 1 3 3 3 2 2 1 1 3 2 3 3 1 3 3 3 3 3 1 3
## [141] 1 1 2 3 1 4 3 3 3 1 1 3 1 3 3 4 4 1 3 4 3 3 1 4 3 4 3 3 3 3 4 4 3 4 3
## [176] 3 2 3 3 3 3 2 3 2 3 3 3 3 3 4 1 1 3 4 3 1 1 3 3 3 4 4 1 3 3 4 3 3 3 4
## [211] 1 4 3 3 3 4 4 1 4 3 1 3 3 3 3 3 1 3 3 3 3 3 1 3 1 3 3 1 3 2 1 1 1 3 3
## [246] 4 3 1 1 3 3 4 3 1 3 1 3 3 2 2 3 3 1 3 4 4 4 1 4 1 3 3 3 2 3 3 1 3 3 1
## [281] 3 3 2 1 2 2 3 1 1 2 3 3 3 4 1 3 1 3 3 3 1 4 3 4 4 3 4 1 3 4 3 1 4 3 3
## [316] 4 3 3 3 4 3 3 1 1 1 2 3 3 1 3 3 4 1 5 1 1 1 3 3 3 3 3 3 4 3 3 4 1 3 4
## [351] 3 4 3 4 1 3 1 4 3 3 1 3 3 3 3 3 3 3 1 3 2 1 3 1 3 3 4 2 3 3 1 1 1 3 4
## [386] 3 3 1 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 1 1 3 1 4 3 3 3 3 3 3 3 3 4 3 4 3
## [421] 3 1 1 1 1 3 4 1 3 3 3 3 1 3 1 1 2 4 3 3
## 
## Within cluster sum of squares by cluster:
## [1]  9394958498 16226867469 10804478229 11008166107  5682449098
##  (between_SS / total_SS =  66.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

All customers, k = 2

Cluster 2, the smaller sized cluster, appears to be bigger spenders who buy lots of Fresh products. Cluster 1 is everyone else.

set.seed(888)
model_k2 <- kmeans(whole_cust, centers = 2)
clusplot(whole_cust, model_k2$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k2)
## K-means clustering with 2 clusters of sizes 375, 65
## 
## Cluster means:
##       Fresh     Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1  7944.112 5151.819  7536.128 2484.131         2872.557   1214.261
## 2 35401.369 9514.231 10346.369 6463.092         2933.046   3316.846
## 
## Clustering vector:
##   [1] 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 1 1 1 2 1
##  [36] 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1
## [141] 1 2 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 1 1 1
## [246] 1 1 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1
## [281] 1 1 2 2 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
## [316] 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [351] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 2 1 1
## [386] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [421] 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 60341401922 52876126599
##  (between_SS / total_SS =  28.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Subset excluding top 10 customers, k = 5

Again, we see that cluster 1 spends relatively on average more on Fresh products. Meanwhile cluster 2 are big Grocery and Detergents_Paper spenders. Cluster 3 again, the largest sized one, appear to be lower spenders across most categories. Cluster 5 appear to be more moderate spenders across most categories. And cluster 4 spends more on average in Milk, Groceries and Delicassen.

set.seed(888)
model_k5_2 <- kmeans(whole_cust_rm_top, centers = 5)
clusplot(whole_cust_rm_top, model_k5_2$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k5_2)
## K-means clustering with 5 clusters of sizes 102, 23, 179, 27, 73
## 
## Cluster means:
##       Fresh      Milk   Grocery    Frozen Detergents_Paper Delicassen
## 1 23761.225  3828.647  4985.765 3625.6863         1109.578  1508.7157
## 2  1278.174  7734.957 18777.870  879.7826         8855.043  1218.0870
## 3  6908.112  2271.648  2818.939 2521.4302          604.838   894.6201
## 4  7282.074 15539.481 21090.852 2149.8889         8582.667  1637.1111
## 5  4703.890  7474.808  9551.699 1310.5616         3996.521  1232.0137
## 
## Clustering vector:
##   [1] 5 5 3 1 5 3 5 3 4 5 3 1 1 1 3 5 3 1 5 1 3 1 1 1 3 3 4 1 1 3 1 1 3 5 1
##  [36] 5 4 1 1 5 2 5 4 4 5 4 3 5 1 5 1 3 5 3 5 5 5 4 3 5 1 3 3 1 3 1 5 1 3 4
##  [71] 3 3 3 2 5 1 3 3 1 3 3 5 3 5 3 3 3 5 2 5 3 3 5 4 5 4 3 4 1 3 1 3 3 3 1
## [106] 3 1 3 3 5 1 1 1 5 1 3 3 3 3 3 3 5 5 3 3 1 1 1 3 1 3 3 3 1 1 3 1 3 3 2
## [141] 2 1 5 2 5 3 3 4 5 4 5 3 3 3 2 5 2 3 5 3 3 3 5 5 3 3 3 5 5 5 1 3 3 2 3
## [176] 1 5 3 3 4 4 3 3 2 3 5 5 4 1 3 5 5 2 1 3 3 5 3 3 3 3 1 3 3 3 3 5 1 3 1
## [211] 3 3 1 3 1 1 1 3 5 2 3 3 1 3 3 3 1 5 1 3 3 3 3 1 3 2 4 2 1 4 3 3 3 5 1
## [246] 3 3 1 3 1 3 3 1 1 3 1 1 1 3 3 3 5 1 3 1 3 5 3 1 4 5 2 2 5 4 1 3 4 3 1
## [281] 2 3 3 5 3 3 3 4 3 3 1 3 1 3 3 1 3 3 4 1 1 1 3 3 3 5 5 5 2 3 5 2 1 3 4
## [316] 3 2 3 2 3 3 1 2 5 3 1 3 3 3 3 5 3 3 1 3 1 1 3 1 3 3 5 1 3 5 1 1 1 3 4
## [351] 3 3 1 3 3 3 3 3 1 3 3 5 3 3 3 3 1 1 1 1 3 1 4 3 3 3 3 5 3 5 5 5 2 3 5
## [386] 1 1 1 1 3 5 1 3 3 5 3 1 3 1 1 1 4 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 9359217997  814580736 5870158691 2116292111 2852650509
##  (between_SS / total_SS =  68.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Subset excluding top 10 customers, k = 3

Again we have cluster 1 with large average spending in the Fresh category. Cluster 2 spends more on average on Milk, Grocery, and Detergents_Paper. And the last cluster is the largest with lower spending across most categories.

set.seed(888)
model_k3 <- kmeans(whole_cust_rm_top, centers = 3)
clusplot(whole_cust_rm_top, model_k3$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k3)
## K-means clustering with 3 clusters of sizes 103, 90, 211
## 
## Cluster means:
##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1 23681.864  3801.320  4975.903 3599.320        1100.1553   1507.447
## 2  4551.489 10192.667 16206.444 1409.222        7059.7556   1380.389
## 3  6543.739  2992.427  3510.227 2358.052         947.3602    932.128
## 
## Clustering vector:
##   [1] 3 2 3 1 3 3 3 3 2 2 3 1 1 1 3 2 3 1 3 1 3 1 1 1 3 3 2 1 1 3 1 1 3 2 1
##  [36] 2 2 1 1 2 2 2 2 2 2 2 3 3 1 2 1 3 2 3 3 3 3 2 3 3 1 3 3 1 3 1 3 1 3 2
##  [71] 3 3 3 2 2 1 3 3 1 3 3 2 3 3 3 3 3 2 2 3 3 1 2 2 3 2 3 2 1 3 1 3 3 3 1
## [106] 3 1 3 3 3 1 1 1 3 1 3 3 3 3 3 3 3 3 3 3 1 1 1 3 1 3 3 3 1 1 3 1 3 3 2
## [141] 2 1 2 2 2 3 3 2 3 2 2 3 3 3 2 3 2 3 2 3 3 3 3 2 3 3 3 2 2 2 1 3 3 2 3
## [176] 1 2 3 3 2 2 3 3 2 3 3 3 2 1 3 3 2 2 1 3 3 2 3 3 3 3 1 3 3 3 3 3 1 3 1
## [211] 3 3 1 3 1 1 1 3 2 2 3 3 1 3 3 3 1 3 1 3 3 3 3 1 3 2 2 2 1 2 3 3 3 3 1
## [246] 3 3 1 3 1 3 3 1 1 3 1 1 1 3 3 3 2 1 3 1 3 3 3 1 2 2 2 2 2 2 1 3 2 3 1
## [281] 2 3 3 2 3 3 3 2 3 3 1 3 1 3 3 1 3 3 2 1 1 1 3 3 3 2 2 3 2 3 3 2 1 3 2
## [316] 3 2 3 2 3 3 1 2 3 3 1 3 3 3 3 3 3 3 1 3 1 1 3 1 3 3 2 1 3 3 1 1 1 3 2
## [351] 3 3 1 3 3 3 3 3 1 3 3 2 3 3 3 3 1 1 1 1 3 1 2 3 3 3 3 2 3 3 2 2 2 3 2
## [386] 1 1 1 1 3 2 1 3 3 2 3 1 3 1 1 1 2 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 9442507473 7725837886 8424471476
##  (between_SS / total_SS =  61.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

K-means Summary

While each k value provided different clusters for the full data and the subset, there was always one cluster that had much higher average spend in the Fresh category. The largest-sized cluster always contained on average lower spenders for most categories.


Related Solutions

Using R to show how classification and clustering, can be applied to classify or cluster data....
Using R to show how classification and clustering, can be applied to classify or cluster data. Make sure to submit your data, results, plots, and your interpretation. a) Please use any dataset to conduct a classification analysis (logistic regression, random forest, or decision tree – better if you apply more than one method and compare) b) Please use any dataset to conduct a clustering analysis (kmeans or hierarchical or random forest – better if you apply more than one method...
3. Using the R data set called warpbreaks (See ?warpbreaks for more info), we want to...
3. Using the R data set called warpbreaks (See ?warpbreaks for more info), we want to compare the mean breaks across both the different types of wool and the different levels of tension. In this problem, use ?? = 0.10. a. Make a boxplot to compare breaks across both wool and tension. Color-code the three different tension levels for easier visibility. Within wool A, describe the relationship between tension and breaks. Within wool B, describe the relationship between tension and...
How can this problem be done WITHOUT using R? For the bird egg length data set,...
How can this problem be done WITHOUT using R? For the bird egg length data set, conduct an appropriate test to determine if bird egg length differs among species. Assume that before you conducted this test you hypothesized about 3 contrasts a priori. These were that (1) Meadow Pipits, Wagtails, and Robins would be different than the other 3 bird species, (2) Hedge Sparrows would differ from Wrens, and (3) Tree Pipits would be different than all other birds. Use...
Part (a) using the data set below on the account balances of customers at a bank’s...
Part (a) using the data set below on the account balances of customers at a bank’s four locations. Using that data set and an α of 0.05, test the null hypothesis that the mean account balances are equal in the four towns using a one-way ANOVA in Excel. using Excel file include the output as part of the answer. Part (b)    Do you reject the null hypothesis or not? Indicate on which part of the Excel output you base...
2. Consider a data set {3, 20, 35, 62, 80}, perform hierarchical clustering using complete linkage...
2. Consider a data set {3, 20, 35, 62, 80}, perform hierarchical clustering using complete linkage and plot the dendogram to visualize it. R code needed with full steps including packages
Using R program and with a For loop. Assuming a data set of 1000 observations and...
Using R program and with a For loop. Assuming a data set of 1000 observations and 10 predictors. How would one use a for loop to cycle through different proportions of training and test sizes. For example, 20% of data goes to training and 80% for test in first iteration. Each iteration adding another 10% to the training set. So first set= (20% train, 80% test), second set = (30% train, 70% test), third set= (40% train,60%test) and so on....
Apply K-Mean Clustering for the following data sets for two clusters. Tabulate all the assignments. In...
Apply K-Mean Clustering for the following data sets for two clusters. Tabulate all the assignments. In order to get full credit, show your all work done step by step including the cell calculations using excel functions. Sample No X Y 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77
CAN YOU PLEASE POST THE R-SCRIPT ONLY The built-in data set LakeHuron is a time series...
CAN YOU PLEASE POST THE R-SCRIPT ONLY The built-in data set LakeHuron is a time series which provides records of annual measurements of the level, in feet, of Lake Huron 1875 to 1972. Using R we can convert this data into the vector x by the assignment x<-as.vector(LakeHuron). Assume that the n measurements x=( x1, x2,...,xn) are a random sample from a population with true unknown mean μ and true unknown variance σ2. Remember, let x be defined by x<-as.vector(LakeHuron)...
Using R studio 1. Read the iris data set into a data frame. 2. Print the...
Using R studio 1. Read the iris data set into a data frame. 2. Print the first few lines of the iris dataset. 3. Output all the entries with Sepal Length > 5. 4. Plot a box plot of Petal Length with a color of your choice. 5. Plot a histogram of Sepal Width. 6. Plot a scatter plot showing the relationship between Petal Length and Petal Width. 7. Find the mean of Sepal Length by species. Hint: You could...
Using R: The data set “Drink.csv” represents the amount of bio medication filled in a sample...
Using R: The data set “Drink.csv” represents the amount of bio medication filled in a sample of 50 consecutive 2-liter bottles. 1) At the 0.01 level of significance, can you test whether the mean amount of medication is different from 2.0 liter using the critical value approach? What is the absolute value of the critical points? 2) Can you confirm your conclusion in part a using p value approach? Can you also replicate p value from t.test using the pt...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT