Using R Programing Apply clustering to "Wholesale customers Data Set" and see if you can distinguish...

Using R Programing

Apply clustering to "Wholesale customers Data Set" and see if you can distinguish between regions. NOTE: the clustering should exclude the region column.

Solutions

Expert Solution

conti  %>% kmeans(2, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k2

conti %>% kmeans(3, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k3

conti %>% kmeans(4, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k4

conti %>% kmeans(5, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k5

conti %>% kmeans(6, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k6

conti %>% kmeans(7, nstart=10)  %>% 
    fviz_cluster(geom = "point",
        data=wholesale_log[,c('Fresh', 'Milk', 'Grocery', 'Frozen', 'Deter_Paper', 'Delicassen')]) -> k7

grid.arrange(k2, k3, k4, k5, k6, k7, nrow=3)

K-means for "Suitable" Ks

None of the methods for finding a suitable k gave the same results. To me, 10+ clusters or customer segments seems too high to make much marketing sense, especially if we assume that the 440 observations represents the total number of customers for the business. So k-means was computed for k = 5 and k = 2 against the full data, and k = 5 and k = 3 against the subset.

All customers, k = 5

Clusters 1 and 2 both have relatively much higher average spend on Fresh products compared to the other categories -- although overall cluster 2 has higher average spend across all categories except Detergents_Paper. Cluster 5 are big spenders in Fresh, Milk, Grocery, and Detergents_Paper. While cluster 4 has relatively higher average spend on Milk, Grocery, and Detergents_Paper. Cluster 3, the largest sized one, appears to be lower spenders across most categories.

set.seed(888)
model_k5 <- kmeans(whole_cust, centers = 5)
clusplot(whole_cust, model_k5$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k5)

## K-means clustering with 5 clusters of sizes 113, 24, 227, 71, 5
## 
## Cluster means:
##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1 20600.283  3787.832  5089.841 3989.071         1130.142   1639.071
## 2 48777.375  6607.375  6197.792 9462.792          932.125   4435.333
## 3  5655.819  3567.793  4513.040 2386.529         1437.559   1005.031
## 4  5207.831 13191.028 20321.718 1674.028         9036.380   1937.944
## 5 25603.000 43460.600 61472.200 2636.000        29974.200   2708.800
## 
## Clustering vector:
##   [1] 3 3 3 1 1 3 3 3 3 4 3 3 1 1 1 3 3 3 1 3 1 3 1 4 1 1 3 1 4 2 1 3 1 1 3
##  [36] 3 1 1 4 2 1 1 4 4 3 4 4 5 3 4 3 3 2 4 1 3 4 4 1 3 3 5 3 4 3 4 3 1 3 3
##  [71] 1 1 3 1 3 1 3 4 3 3 3 4 3 1 3 5 5 2 3 1 3 1 4 1 4 3 3 3 3 3 4 4 3 2 1
## [106] 1 3 4 3 4 3 4 1 1 1 3 3 3 1 3 1 3 3 3 2 2 1 1 3 2 3 3 1 3 3 3 3 3 1 3
## [141] 1 1 2 3 1 4 3 3 3 1 1 3 1 3 3 4 4 1 3 4 3 3 1 4 3 4 3 3 3 3 4 4 3 4 3
## [176] 3 2 3 3 3 3 2 3 2 3 3 3 3 3 4 1 1 3 4 3 1 1 3 3 3 4 4 1 3 3 4 3 3 3 4
## [211] 1 4 3 3 3 4 4 1 4 3 1 3 3 3 3 3 1 3 3 3 3 3 1 3 1 3 3 1 3 2 1 1 1 3 3
## [246] 4 3 1 1 3 3 4 3 1 3 1 3 3 2 2 3 3 1 3 4 4 4 1 4 1 3 3 3 2 3 3 1 3 3 1
## [281] 3 3 2 1 2 2 3 1 1 2 3 3 3 4 1 3 1 3 3 3 1 4 3 4 4 3 4 1 3 4 3 1 4 3 3
## [316] 4 3 3 3 4 3 3 1 1 1 2 3 3 1 3 3 4 1 5 1 1 1 3 3 3 3 3 3 4 3 3 4 1 3 4
## [351] 3 4 3 4 1 3 1 4 3 3 1 3 3 3 3 3 3 3 1 3 2 1 3 1 3 3 4 2 3 3 1 1 1 3 4
## [386] 3 3 1 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 1 1 3 1 4 3 3 3 3 3 3 3 3 4 3 4 3
## [421] 3 1 1 1 1 3 4 1 3 3 3 3 1 3 1 1 2 4 3 3
## 
## Within cluster sum of squares by cluster:
## [1]  9394958498 16226867469 10804478229 11008166107  5682449098
##  (between_SS / total_SS =  66.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

All customers, k = 2

Cluster 2, the smaller sized cluster, appears to be bigger spenders who buy lots of Fresh products. Cluster 1 is everyone else.

set.seed(888)
model_k2 <- kmeans(whole_cust, centers = 2)
clusplot(whole_cust, model_k2$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k2)

## K-means clustering with 2 clusters of sizes 375, 65
## 
## Cluster means:
##       Fresh     Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1  7944.112 5151.819  7536.128 2484.131         2872.557   1214.261
## 2 35401.369 9514.231 10346.369 6463.092         2933.046   3316.846
## 
## Clustering vector:
##   [1] 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 1 1 1 2 1
##  [36] 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1
## [141] 1 2 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 1 1 1
## [246] 1 1 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1
## [281] 1 1 2 2 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
## [316] 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [351] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 2 1 1
## [386] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [421] 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 60341401922 52876126599
##  (between_SS / total_SS =  28.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Subset excluding top 10 customers, k = 5

Again, we see that cluster 1 spends relatively on average more on Fresh products. Meanwhile cluster 2 are big Grocery and Detergents_Paper spenders. Cluster 3 again, the largest sized one, appear to be lower spenders across most categories. Cluster 5 appear to be more moderate spenders across most categories. And cluster 4 spends more on average in Milk, Groceries and Delicassen.

set.seed(888)
model_k5_2 <- kmeans(whole_cust_rm_top, centers = 5)
clusplot(whole_cust_rm_top, model_k5_2$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k5_2)

## K-means clustering with 5 clusters of sizes 102, 23, 179, 27, 73
## 
## Cluster means:
##       Fresh      Milk   Grocery    Frozen Detergents_Paper Delicassen
## 1 23761.225  3828.647  4985.765 3625.6863         1109.578  1508.7157
## 2  1278.174  7734.957 18777.870  879.7826         8855.043  1218.0870
## 3  6908.112  2271.648  2818.939 2521.4302          604.838   894.6201
## 4  7282.074 15539.481 21090.852 2149.8889         8582.667  1637.1111
## 5  4703.890  7474.808  9551.699 1310.5616         3996.521  1232.0137
## 
## Clustering vector:
##   [1] 5 5 3 1 5 3 5 3 4 5 3 1 1 1 3 5 3 1 5 1 3 1 1 1 3 3 4 1 1 3 1 1 3 5 1
##  [36] 5 4 1 1 5 2 5 4 4 5 4 3 5 1 5 1 3 5 3 5 5 5 4 3 5 1 3 3 1 3 1 5 1 3 4
##  [71] 3 3 3 2 5 1 3 3 1 3 3 5 3 5 3 3 3 5 2 5 3 3 5 4 5 4 3 4 1 3 1 3 3 3 1
## [106] 3 1 3 3 5 1 1 1 5 1 3 3 3 3 3 3 5 5 3 3 1 1 1 3 1 3 3 3 1 1 3 1 3 3 2
## [141] 2 1 5 2 5 3 3 4 5 4 5 3 3 3 2 5 2 3 5 3 3 3 5 5 3 3 3 5 5 5 1 3 3 2 3
## [176] 1 5 3 3 4 4 3 3 2 3 5 5 4 1 3 5 5 2 1 3 3 5 3 3 3 3 1 3 3 3 3 5 1 3 1
## [211] 3 3 1 3 1 1 1 3 5 2 3 3 1 3 3 3 1 5 1 3 3 3 3 1 3 2 4 2 1 4 3 3 3 5 1
## [246] 3 3 1 3 1 3 3 1 1 3 1 1 1 3 3 3 5 1 3 1 3 5 3 1 4 5 2 2 5 4 1 3 4 3 1
## [281] 2 3 3 5 3 3 3 4 3 3 1 3 1 3 3 1 3 3 4 1 1 1 3 3 3 5 5 5 2 3 5 2 1 3 4
## [316] 3 2 3 2 3 3 1 2 5 3 1 3 3 3 3 5 3 3 1 3 1 1 3 1 3 3 5 1 3 5 1 1 1 3 4
## [351] 3 3 1 3 3 3 3 3 1 3 3 5 3 3 3 3 1 1 1 1 3 1 4 3 3 3 3 5 3 5 5 5 2 3 5
## [386] 1 1 1 1 3 5 1 3 3 5 3 1 3 1 1 1 4 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 9359217997  814580736 5870158691 2116292111 2852650509
##  (between_SS / total_SS =  68.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Subset excluding top 10 customers, k = 3

Again we have cluster 1 with large average spending in the Fresh category. Cluster 2 spends more on average on Milk, Grocery, and Detergents_Paper. And the last cluster is the largest with lower spending across most categories.

set.seed(888)
model_k3 <- kmeans(whole_cust_rm_top, centers = 3)
clusplot(whole_cust_rm_top, model_k3$cluster, color = TRUE, shade = TRUE, lines = 0)

print(model_k3)

## K-means clustering with 3 clusters of sizes 103, 90, 211
## 
## Cluster means:
##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1 23681.864  3801.320  4975.903 3599.320        1100.1553   1507.447
## 2  4551.489 10192.667 16206.444 1409.222        7059.7556   1380.389
## 3  6543.739  2992.427  3510.227 2358.052         947.3602    932.128
## 
## Clustering vector:
##   [1] 3 2 3 1 3 3 3 3 2 2 3 1 1 1 3 2 3 1 3 1 3 1 1 1 3 3 2 1 1 3 1 1 3 2 1
##  [36] 2 2 1 1 2 2 2 2 2 2 2 3 3 1 2 1 3 2 3 3 3 3 2 3 3 1 3 3 1 3 1 3 1 3 2
##  [71] 3 3 3 2 2 1 3 3 1 3 3 2 3 3 3 3 3 2 2 3 3 1 2 2 3 2 3 2 1 3 1 3 3 3 1
## [106] 3 1 3 3 3 1 1 1 3 1 3 3 3 3 3 3 3 3 3 3 1 1 1 3 1 3 3 3 1 1 3 1 3 3 2
## [141] 2 1 2 2 2 3 3 2 3 2 2 3 3 3 2 3 2 3 2 3 3 3 3 2 3 3 3 2 2 2 1 3 3 2 3
## [176] 1 2 3 3 2 2 3 3 2 3 3 3 2 1 3 3 2 2 1 3 3 2 3 3 3 3 1 3 3 3 3 3 1 3 1
## [211] 3 3 1 3 1 1 1 3 2 2 3 3 1 3 3 3 1 3 1 3 3 3 3 1 3 2 2 2 1 2 3 3 3 3 1
## [246] 3 3 1 3 1 3 3 1 1 3 1 1 1 3 3 3 2 1 3 1 3 3 3 1 2 2 2 2 2 2 1 3 2 3 1
## [281] 2 3 3 2 3 3 3 2 3 3 1 3 1 3 3 1 3 3 2 1 1 1 3 3 3 2 2 3 2 3 3 2 1 3 2
## [316] 3 2 3 2 3 3 1 2 3 3 1 3 3 3 3 3 3 3 1 3 1 1 3 1 3 3 2 1 3 3 1 1 1 3 2
## [351] 3 3 1 3 3 3 3 3 1 3 3 2 3 3 3 3 1 1 1 1 3 1 2 3 3 3 3 2 3 3 2 2 2 3 2
## [386] 1 1 1 1 3 2 1 3 3 2 3 1 3 1 1 1 2 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 9442507473 7725837886 8424471476
##  (between_SS / total_SS =  61.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

K-means Summary

While each k value provided different clusters for the full data and the subset, there was always one cluster that had much higher average spend in the Fresh category. The largest-sized cluster always contained on average lower spenders for most categories.

venereology answered 11 months ago

Next > < Previous

Question

Using R Programing Apply clustering to "Wholesale customers Data Set" and see if you can distinguish...

Solutions

Expert Solution

Related Solutions

Using R to show how classification and clustering, can be applied to classify or cluster data....

3. Using the R data set called warpbreaks (See ?warpbreaks for more info), we want to...

How can this problem be done WITHOUT using R? For the bird egg length data set,...

Part (a) using the data set below on the account balances of customers at a bank’s...

2. Consider a data set {3, 20, 35, 62, 80}, perform hierarchical clustering using complete linkage...

Using R program and with a For loop. Assuming a data set of 1000 observations and...

Apply K-Mean Clustering for the following data sets for two clusters. Tabulate all the assignments. In...

CAN YOU PLEASE POST THE R-SCRIPT ONLY The built-in data set LakeHuron is a time series...

Using R studio 1. Read the iris data set into a data frame. 2. Print the...

Using R: The data set “Drink.csv” represents the amount of bio medication filled in a sample...