In: Computer Science
Using R Programing
Apply clustering to "Wholesale customers Data Set" and see if you can distinguish between regions. NOTE: the clustering should exclude the region column.
conti %>% kmeans(2, nstart=10) %>%
fviz_cluster(geom = "point",
data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k2
conti %>% kmeans(3, nstart=10) %>%
fviz_cluster(geom = "point",
data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k3
conti %>% kmeans(4, nstart=10) %>%
fviz_cluster(geom = "point",
data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k4
conti %>% kmeans(5, nstart=10) %>%
fviz_cluster(geom = "point",
data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k5
conti %>% kmeans(6, nstart=10) %>%
fviz_cluster(geom = "point",
data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k6
conti %>% kmeans(7, nstart=10) %>%
fviz_cluster(geom = "point",
data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k7
grid.arrange(k2, k3, k4, k5, k6, k7, nrow=3)
K-means for "Suitable" Ks
None of the methods for finding a suitable k gave the same results. To me, 10+ clusters or customer segments seems too high to make much marketing sense, especially if we assume that the 440 observations represents the total number of customers for the business. So k-means was computed for k = 5 and k = 2 against the full data, and k = 5 and k = 3 against the subset.
All customers, k = 5
Clusters 1 and 2 both have relatively much higher average spend on Fresh products compared to the other categories -- although overall cluster 2 has higher average spend across all categories except Detergents_Paper. Cluster 5 are big spenders in Fresh, Milk, Grocery, and Detergents_Paper. While cluster 4 has relatively higher average spend on Milk, Grocery, and Detergents_Paper. Cluster 3, the largest sized one, appears to be lower spenders across most categories.
set.seed(888) model_k5 <- kmeans(whole_cust, centers = 5) clusplot(whole_cust, model_k5$cluster, color = TRUE, shade = TRUE, lines = 0)
print(model_k5)
## K-means clustering with 5 clusters of sizes 113, 24, 227, 71, 5
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 20600.283 3787.832 5089.841 3989.071 1130.142 1639.071
## 2 48777.375 6607.375 6197.792 9462.792 932.125 4435.333
## 3 5655.819 3567.793 4513.040 2386.529 1437.559 1005.031
## 4 5207.831 13191.028 20321.718 1674.028 9036.380 1937.944
## 5 25603.000 43460.600 61472.200 2636.000 29974.200 2708.800
##
## Clustering vector:
## [1] 3 3 3 1 1 3 3 3 3 4 3 3 1 1 1 3 3 3 1 3 1 3 1 4 1 1 3 1 4 2 1 3 1 1 3
## [36] 3 1 1 4 2 1 1 4 4 3 4 4 5 3 4 3 3 2 4 1 3 4 4 1 3 3 5 3 4 3 4 3 1 3 3
## [71] 1 1 3 1 3 1 3 4 3 3 3 4 3 1 3 5 5 2 3 1 3 1 4 1 4 3 3 3 3 3 4 4 3 2 1
## [106] 1 3 4 3 4 3 4 1 1 1 3 3 3 1 3 1 3 3 3 2 2 1 1 3 2 3 3 1 3 3 3 3 3 1 3
## [141] 1 1 2 3 1 4 3 3 3 1 1 3 1 3 3 4 4 1 3 4 3 3 1 4 3 4 3 3 3 3 4 4 3 4 3
## [176] 3 2 3 3 3 3 2 3 2 3 3 3 3 3 4 1 1 3 4 3 1 1 3 3 3 4 4 1 3 3 4 3 3 3 4
## [211] 1 4 3 3 3 4 4 1 4 3 1 3 3 3 3 3 1 3 3 3 3 3 1 3 1 3 3 1 3 2 1 1 1 3 3
## [246] 4 3 1 1 3 3 4 3 1 3 1 3 3 2 2 3 3 1 3 4 4 4 1 4 1 3 3 3 2 3 3 1 3 3 1
## [281] 3 3 2 1 2 2 3 1 1 2 3 3 3 4 1 3 1 3 3 3 1 4 3 4 4 3 4 1 3 4 3 1 4 3 3
## [316] 4 3 3 3 4 3 3 1 1 1 2 3 3 1 3 3 4 1 5 1 1 1 3 3 3 3 3 3 4 3 3 4 1 3 4
## [351] 3 4 3 4 1 3 1 4 3 3 1 3 3 3 3 3 3 3 1 3 2 1 3 1 3 3 4 2 3 3 1 1 1 3 4
## [386] 3 3 1 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 1 1 3 1 4 3 3 3 3 3 3 3 3 4 3 4 3
## [421] 3 1 1 1 1 3 4 1 3 3 3 3 1 3 1 1 2 4 3 3
##
## Within cluster sum of squares by cluster:
## [1] 9394958498 16226867469 10804478229 11008166107 5682449098
## (between_SS / total_SS = 66.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
All customers, k = 2
Cluster 2, the smaller sized cluster, appears to be bigger spenders who buy lots of Fresh products. Cluster 1 is everyone else.
set.seed(888) model_k2 <- kmeans(whole_cust, centers = 2) clusplot(whole_cust, model_k2$cluster, color = TRUE, shade = TRUE, lines = 0)
print(model_k2)
## K-means clustering with 2 clusters of sizes 375, 65
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 7944.112 5151.819 7536.128 2484.131 2872.557 1214.261
## 2 35401.369 9514.231 10346.369 6463.092 2933.046 3316.846
##
## Clustering vector:
## [1] 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 1 1 1 2 1
## [36] 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1
## [141] 1 2 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 1 1 1
## [246] 1 1 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1
## [281] 1 1 2 2 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
## [316] 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [351] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 2 1 1
## [386] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [421] 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 60341401922 52876126599
## (between_SS / total_SS = 28.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Subset excluding top 10 customers, k = 5
Again, we see that cluster 1 spends relatively on average more on Fresh products. Meanwhile cluster 2 are big Grocery and Detergents_Paper spenders. Cluster 3 again, the largest sized one, appear to be lower spenders across most categories. Cluster 5 appear to be more moderate spenders across most categories. And cluster 4 spends more on average in Milk, Groceries and Delicassen.
set.seed(888) model_k5_2 <- kmeans(whole_cust_rm_top, centers = 5) clusplot(whole_cust_rm_top, model_k5_2$cluster, color = TRUE, shade = TRUE, lines = 0)
print(model_k5_2)
## K-means clustering with 5 clusters of sizes 102, 23, 179, 27, 73
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 23761.225 3828.647 4985.765 3625.6863 1109.578 1508.7157
## 2 1278.174 7734.957 18777.870 879.7826 8855.043 1218.0870
## 3 6908.112 2271.648 2818.939 2521.4302 604.838 894.6201
## 4 7282.074 15539.481 21090.852 2149.8889 8582.667 1637.1111
## 5 4703.890 7474.808 9551.699 1310.5616 3996.521 1232.0137
##
## Clustering vector:
## [1] 5 5 3 1 5 3 5 3 4 5 3 1 1 1 3 5 3 1 5 1 3 1 1 1 3 3 4 1 1 3 1 1 3 5 1
## [36] 5 4 1 1 5 2 5 4 4 5 4 3 5 1 5 1 3 5 3 5 5 5 4 3 5 1 3 3 1 3 1 5 1 3 4
## [71] 3 3 3 2 5 1 3 3 1 3 3 5 3 5 3 3 3 5 2 5 3 3 5 4 5 4 3 4 1 3 1 3 3 3 1
## [106] 3 1 3 3 5 1 1 1 5 1 3 3 3 3 3 3 5 5 3 3 1 1 1 3 1 3 3 3 1 1 3 1 3 3 2
## [141] 2 1 5 2 5 3 3 4 5 4 5 3 3 3 2 5 2 3 5 3 3 3 5 5 3 3 3 5 5 5 1 3 3 2 3
## [176] 1 5 3 3 4 4 3 3 2 3 5 5 4 1 3 5 5 2 1 3 3 5 3 3 3 3 1 3 3 3 3 5 1 3 1
## [211] 3 3 1 3 1 1 1 3 5 2 3 3 1 3 3 3 1 5 1 3 3 3 3 1 3 2 4 2 1 4 3 3 3 5 1
## [246] 3 3 1 3 1 3 3 1 1 3 1 1 1 3 3 3 5 1 3 1 3 5 3 1 4 5 2 2 5 4 1 3 4 3 1
## [281] 2 3 3 5 3 3 3 4 3 3 1 3 1 3 3 1 3 3 4 1 1 1 3 3 3 5 5 5 2 3 5 2 1 3 4
## [316] 3 2 3 2 3 3 1 2 5 3 1 3 3 3 3 5 3 3 1 3 1 1 3 1 3 3 5 1 3 5 1 1 1 3 4
## [351] 3 3 1 3 3 3 3 3 1 3 3 5 3 3 3 3 1 1 1 1 3 1 4 3 3 3 3 5 3 5 5 5 2 3 5
## [386] 1 1 1 1 3 5 1 3 3 5 3 1 3 1 1 1 4 3 3
##
## Within cluster sum of squares by cluster:
## [1] 9359217997 814580736 5870158691 2116292111 2852650509
## (between_SS / total_SS = 68.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Subset excluding top 10 customers, k = 3
Again we have cluster 1 with large average spending in the Fresh category. Cluster 2 spends more on average on Milk, Grocery, and Detergents_Paper. And the last cluster is the largest with lower spending across most categories.
set.seed(888) model_k3 <- kmeans(whole_cust_rm_top, centers = 3) clusplot(whole_cust_rm_top, model_k3$cluster, color = TRUE, shade = TRUE, lines = 0)
print(model_k3)
## K-means clustering with 3 clusters of sizes 103, 90, 211
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 23681.864 3801.320 4975.903 3599.320 1100.1553 1507.447
## 2 4551.489 10192.667 16206.444 1409.222 7059.7556 1380.389
## 3 6543.739 2992.427 3510.227 2358.052 947.3602 932.128
##
## Clustering vector:
## [1] 3 2 3 1 3 3 3 3 2 2 3 1 1 1 3 2 3 1 3 1 3 1 1 1 3 3 2 1 1 3 1 1 3 2 1
## [36] 2 2 1 1 2 2 2 2 2 2 2 3 3 1 2 1 3 2 3 3 3 3 2 3 3 1 3 3 1 3 1 3 1 3 2
## [71] 3 3 3 2 2 1 3 3 1 3 3 2 3 3 3 3 3 2 2 3 3 1 2 2 3 2 3 2 1 3 1 3 3 3 1
## [106] 3 1 3 3 3 1 1 1 3 1 3 3 3 3 3 3 3 3 3 3 1 1 1 3 1 3 3 3 1 1 3 1 3 3 2
## [141] 2 1 2 2 2 3 3 2 3 2 2 3 3 3 2 3 2 3 2 3 3 3 3 2 3 3 3 2 2 2 1 3 3 2 3
## [176] 1 2 3 3 2 2 3 3 2 3 3 3 2 1 3 3 2 2 1 3 3 2 3 3 3 3 1 3 3 3 3 3 1 3 1
## [211] 3 3 1 3 1 1 1 3 2 2 3 3 1 3 3 3 1 3 1 3 3 3 3 1 3 2 2 2 1 2 3 3 3 3 1
## [246] 3 3 1 3 1 3 3 1 1 3 1 1 1 3 3 3 2 1 3 1 3 3 3 1 2 2 2 2 2 2 1 3 2 3 1
## [281] 2 3 3 2 3 3 3 2 3 3 1 3 1 3 3 1 3 3 2 1 1 1 3 3 3 2 2 3 2 3 3 2 1 3 2
## [316] 3 2 3 2 3 3 1 2 3 3 1 3 3 3 3 3 3 3 1 3 1 1 3 1 3 3 2 1 3 3 1 1 1 3 2
## [351] 3 3 1 3 3 3 3 3 1 3 3 2 3 3 3 3 1 1 1 1 3 1 2 3 3 3 3 2 3 3 2 2 2 3 2
## [386] 1 1 1 1 3 2 1 3 3 2 3 1 3 1 1 1 2 3 3
##
## Within cluster sum of squares by cluster:
## [1] 9442507473 7725837886 8424471476
## (between_SS / total_SS = 61.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
K-means Summary
While each k value provided different clusters for the full data and the subset, there was always one cluster that had much higher average spend in the Fresh category. The largest-sized cluster always contained on average lower spenders for most categories.