In: Computer Science
Using R Programing
Apply clustering to "Wholesale customers Data Set" and see if you can distinguish between regions. NOTE: the clustering should exclude the region column.
conti  %>% kmeans(2, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k2
conti %>% kmeans(3, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k3
conti %>% kmeans(4, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k4
conti %>% kmeans(5, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k5
conti %>% kmeans(6, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k6
conti %>% kmeans(7, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k7
grid.arrange(k2, k3, k4, k5, k6, k7, nrow=3)

K-means for "Suitable" Ks
None of the methods for finding a suitable k gave the same results. To me, 10+ clusters or customer segments seems too high to make much marketing sense, especially if we assume that the 440 observations represents the total number of customers for the business. So k-means was computed for k = 5 and k = 2 against the full data, and k = 5 and k = 3 against the subset.
All customers, k = 5
Clusters 1 and 2 both have relatively much higher average spend on Fresh products compared to the other categories -- although overall cluster 2 has higher average spend across all categories except Detergents_Paper. Cluster 5 are big spenders in Fresh, Milk, Grocery, and Detergents_Paper. While cluster 4 has relatively higher average spend on Milk, Grocery, and Detergents_Paper. Cluster 3, the largest sized one, appears to be lower spenders across most categories.
set.seed(888) model_k5 <- kmeans(whole_cust, centers = 5) clusplot(whole_cust, model_k5$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k5)
## K-means clustering with 5 clusters of sizes 113, 24, 227, 71, 5
## 
## Cluster means:
##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1 20600.283  3787.832  5089.841 3989.071         1130.142   1639.071
## 2 48777.375  6607.375  6197.792 9462.792          932.125   4435.333
## 3  5655.819  3567.793  4513.040 2386.529         1437.559   1005.031
## 4  5207.831 13191.028 20321.718 1674.028         9036.380   1937.944
## 5 25603.000 43460.600 61472.200 2636.000        29974.200   2708.800
## 
## Clustering vector:
##   [1] 3 3 3 1 1 3 3 3 3 4 3 3 1 1 1 3 3 3 1 3 1 3 1 4 1 1 3 1 4 2 1 3 1 1 3
##  [36] 3 1 1 4 2 1 1 4 4 3 4 4 5 3 4 3 3 2 4 1 3 4 4 1 3 3 5 3 4 3 4 3 1 3 3
##  [71] 1 1 3 1 3 1 3 4 3 3 3 4 3 1 3 5 5 2 3 1 3 1 4 1 4 3 3 3 3 3 4 4 3 2 1
## [106] 1 3 4 3 4 3 4 1 1 1 3 3 3 1 3 1 3 3 3 2 2 1 1 3 2 3 3 1 3 3 3 3 3 1 3
## [141] 1 1 2 3 1 4 3 3 3 1 1 3 1 3 3 4 4 1 3 4 3 3 1 4 3 4 3 3 3 3 4 4 3 4 3
## [176] 3 2 3 3 3 3 2 3 2 3 3 3 3 3 4 1 1 3 4 3 1 1 3 3 3 4 4 1 3 3 4 3 3 3 4
## [211] 1 4 3 3 3 4 4 1 4 3 1 3 3 3 3 3 1 3 3 3 3 3 1 3 1 3 3 1 3 2 1 1 1 3 3
## [246] 4 3 1 1 3 3 4 3 1 3 1 3 3 2 2 3 3 1 3 4 4 4 1 4 1 3 3 3 2 3 3 1 3 3 1
## [281] 3 3 2 1 2 2 3 1 1 2 3 3 3 4 1 3 1 3 3 3 1 4 3 4 4 3 4 1 3 4 3 1 4 3 3
## [316] 4 3 3 3 4 3 3 1 1 1 2 3 3 1 3 3 4 1 5 1 1 1 3 3 3 3 3 3 4 3 3 4 1 3 4
## [351] 3 4 3 4 1 3 1 4 3 3 1 3 3 3 3 3 3 3 1 3 2 1 3 1 3 3 4 2 3 3 1 1 1 3 4
## [386] 3 3 1 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 1 1 3 1 4 3 3 3 3 3 3 3 3 4 3 4 3
## [421] 3 1 1 1 1 3 4 1 3 3 3 3 1 3 1 1 2 4 3 3
## 
## Within cluster sum of squares by cluster:
## [1]  9394958498 16226867469 10804478229 11008166107  5682449098
##  (between_SS / total_SS =  66.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
All customers, k = 2
Cluster 2, the smaller sized cluster, appears to be bigger spenders who buy lots of Fresh products. Cluster 1 is everyone else.
set.seed(888) model_k2 <- kmeans(whole_cust, centers = 2) clusplot(whole_cust, model_k2$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k2)
## K-means clustering with 2 clusters of sizes 375, 65
## 
## Cluster means:
##       Fresh     Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1  7944.112 5151.819  7536.128 2484.131         2872.557   1214.261
## 2 35401.369 9514.231 10346.369 6463.092         2933.046   3316.846
## 
## Clustering vector:
##   [1] 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 1 1 1 2 1
##  [36] 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1
## [141] 1 2 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 1 1 1
## [246] 1 1 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1
## [281] 1 1 2 2 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
## [316] 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [351] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 2 1 1
## [386] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [421] 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 60341401922 52876126599
##  (between_SS / total_SS =  28.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
Subset excluding top 10 customers, k = 5
Again, we see that cluster 1 spends relatively on average more on Fresh products. Meanwhile cluster 2 are big Grocery and Detergents_Paper spenders. Cluster 3 again, the largest sized one, appear to be lower spenders across most categories. Cluster 5 appear to be more moderate spenders across most categories. And cluster 4 spends more on average in Milk, Groceries and Delicassen.
set.seed(888) model_k5_2 <- kmeans(whole_cust_rm_top, centers = 5) clusplot(whole_cust_rm_top, model_k5_2$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k5_2)
## K-means clustering with 5 clusters of sizes 102, 23, 179, 27, 73
## 
## Cluster means:
##       Fresh      Milk   Grocery    Frozen Detergents_Paper Delicassen
## 1 23761.225  3828.647  4985.765 3625.6863         1109.578  1508.7157
## 2  1278.174  7734.957 18777.870  879.7826         8855.043  1218.0870
## 3  6908.112  2271.648  2818.939 2521.4302          604.838   894.6201
## 4  7282.074 15539.481 21090.852 2149.8889         8582.667  1637.1111
## 5  4703.890  7474.808  9551.699 1310.5616         3996.521  1232.0137
## 
## Clustering vector:
##   [1] 5 5 3 1 5 3 5 3 4 5 3 1 1 1 3 5 3 1 5 1 3 1 1 1 3 3 4 1 1 3 1 1 3 5 1
##  [36] 5 4 1 1 5 2 5 4 4 5 4 3 5 1 5 1 3 5 3 5 5 5 4 3 5 1 3 3 1 3 1 5 1 3 4
##  [71] 3 3 3 2 5 1 3 3 1 3 3 5 3 5 3 3 3 5 2 5 3 3 5 4 5 4 3 4 1 3 1 3 3 3 1
## [106] 3 1 3 3 5 1 1 1 5 1 3 3 3 3 3 3 5 5 3 3 1 1 1 3 1 3 3 3 1 1 3 1 3 3 2
## [141] 2 1 5 2 5 3 3 4 5 4 5 3 3 3 2 5 2 3 5 3 3 3 5 5 3 3 3 5 5 5 1 3 3 2 3
## [176] 1 5 3 3 4 4 3 3 2 3 5 5 4 1 3 5 5 2 1 3 3 5 3 3 3 3 1 3 3 3 3 5 1 3 1
## [211] 3 3 1 3 1 1 1 3 5 2 3 3 1 3 3 3 1 5 1 3 3 3 3 1 3 2 4 2 1 4 3 3 3 5 1
## [246] 3 3 1 3 1 3 3 1 1 3 1 1 1 3 3 3 5 1 3 1 3 5 3 1 4 5 2 2 5 4 1 3 4 3 1
## [281] 2 3 3 5 3 3 3 4 3 3 1 3 1 3 3 1 3 3 4 1 1 1 3 3 3 5 5 5 2 3 5 2 1 3 4
## [316] 3 2 3 2 3 3 1 2 5 3 1 3 3 3 3 5 3 3 1 3 1 1 3 1 3 3 5 1 3 5 1 1 1 3 4
## [351] 3 3 1 3 3 3 3 3 1 3 3 5 3 3 3 3 1 1 1 1 3 1 4 3 3 3 3 5 3 5 5 5 2 3 5
## [386] 1 1 1 1 3 5 1 3 3 5 3 1 3 1 1 1 4 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 9359217997  814580736 5870158691 2116292111 2852650509
##  (between_SS / total_SS =  68.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
Subset excluding top 10 customers, k = 3
Again we have cluster 1 with large average spending in the Fresh category. Cluster 2 spends more on average on Milk, Grocery, and Detergents_Paper. And the last cluster is the largest with lower spending across most categories.
set.seed(888) model_k3 <- kmeans(whole_cust_rm_top, centers = 3) clusplot(whole_cust_rm_top, model_k3$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k3)
## K-means clustering with 3 clusters of sizes 103, 90, 211
## 
## Cluster means:
##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1 23681.864  3801.320  4975.903 3599.320        1100.1553   1507.447
## 2  4551.489 10192.667 16206.444 1409.222        7059.7556   1380.389
## 3  6543.739  2992.427  3510.227 2358.052         947.3602    932.128
## 
## Clustering vector:
##   [1] 3 2 3 1 3 3 3 3 2 2 3 1 1 1 3 2 3 1 3 1 3 1 1 1 3 3 2 1 1 3 1 1 3 2 1
##  [36] 2 2 1 1 2 2 2 2 2 2 2 3 3 1 2 1 3 2 3 3 3 3 2 3 3 1 3 3 1 3 1 3 1 3 2
##  [71] 3 3 3 2 2 1 3 3 1 3 3 2 3 3 3 3 3 2 2 3 3 1 2 2 3 2 3 2 1 3 1 3 3 3 1
## [106] 3 1 3 3 3 1 1 1 3 1 3 3 3 3 3 3 3 3 3 3 1 1 1 3 1 3 3 3 1 1 3 1 3 3 2
## [141] 2 1 2 2 2 3 3 2 3 2 2 3 3 3 2 3 2 3 2 3 3 3 3 2 3 3 3 2 2 2 1 3 3 2 3
## [176] 1 2 3 3 2 2 3 3 2 3 3 3 2 1 3 3 2 2 1 3 3 2 3 3 3 3 1 3 3 3 3 3 1 3 1
## [211] 3 3 1 3 1 1 1 3 2 2 3 3 1 3 3 3 1 3 1 3 3 3 3 1 3 2 2 2 1 2 3 3 3 3 1
## [246] 3 3 1 3 1 3 3 1 1 3 1 1 1 3 3 3 2 1 3 1 3 3 3 1 2 2 2 2 2 2 1 3 2 3 1
## [281] 2 3 3 2 3 3 3 2 3 3 1 3 1 3 3 1 3 3 2 1 1 1 3 3 3 2 2 3 2 3 3 2 1 3 2
## [316] 3 2 3 2 3 3 1 2 3 3 1 3 3 3 3 3 3 3 1 3 1 1 3 1 3 3 2 1 3 3 1 1 1 3 2
## [351] 3 3 1 3 3 3 3 3 1 3 3 2 3 3 3 3 1 1 1 1 3 1 2 3 3 3 3 2 3 3 2 2 2 3 2
## [386] 1 1 1 1 3 2 1 3 3 2 3 1 3 1 1 1 2 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 9442507473 7725837886 8424471476
##  (between_SS / total_SS =  61.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
K-means Summary
While each k value provided different clusters for the full data and the subset, there was always one cluster that had much higher average spend in the Fresh category. The largest-sized cluster always contained on average lower spenders for most categories.