In: Statistics and Probability
Customer | Age | Female | Income | Married | Children | Loan | Mortgage |
A | 29 | 0 | 12623.4 | 1 | 1 | 1 | 0 |
B | 25 | 0 | 23818.6 | 1 | 0 | 0 | 0 |
C | 40 | 1 | 31473.9 | 0 | 2 | 0 | 1 |
D | 48 | 0 | 20268 | 1 | 0 | 0 | 0 |
E | 65 | 0 | 51417 | 1 | 2 | 0 | 0 |
F | 59 | 1 | 30971.8 | 1 | 3 | 1 | 1 |
G | 61 | 1 | 47025 | 0 | 2 | 1 | 1 |
H | 30 | 1 | 9672.25 | 1 | 0 | 1 | 0 |
I | 31 | 1 | 15976.3 | 1 | 0 | 1 | 0 |
J | 29 | 0 | 14711.8 | 1 | 0 | 0 | 1 |
Know Thy Customer (KTC) is a financial consulting company that provides personalized financial advice to its clients. As a basis for developing this tailored advising, KTC would like to segment its customers into several representative groups based on key characteristics. Peyton Blake, the director of KTC’s fledging analytics division, plans to establish the set of representative customer profiles based on 600 customer records in the file KnowThyCustomer. Each customer record contains data on age, gender, annual income, marital status, number of children, whether the customer has a car loan, and whether the customer has a home mortgage. KTC’s market research staff has determined that these seven characteristics should form the basis of the customer clustering.
The data contains both categorical variables (Female, Married, Car, and Mortgage) and numerical variables (Age, Income, and Children).
Question:
Using hierarchical clustering method in analyzing the selected data set (hints: consider only the numerical variables; normalize the values of variables – calculating z-scores before you conduct the cluster analysis)
Answer:
Cluster Analysis of Observations: Age, Income, Children
Standardized Variables, Euclidean Distance, Complete Linkage
Amalgamation Steps
At each step in the amalgamation process, view the clusters that are formed and examine their similarity and distance levels. The higher the similarity level, the more similar the observations are in each cluster. The lower the distance level, the closer the observations are in each cluster.
Ideally, the clusters should have a relatively high similarity level and a relatively low distance level. However, you must balance that goal with having a reasonable and practical number of clusters.
Key Results: Similarity level, Distance level
In these results, the data contain a total of 10 observations. In step 1, two clusters (observations 9 and 10 in the worksheet) are joined to form a new cluster. This step creates 9 clusters in the data, with a similarity level of 96.1429 and a distance level of 0.15756. Although the similarity level is high and the distance level is low, the number of clusters is too high to be useful. At each subsequent step, as new clusters are formed, the similarity level decreases and the distance level increases. At the final step, all the observations are joined into a single cluster.
Step | Number of clusters |
Similarity level |
Distance level |
Clusters joined |
New cluster |
Number of obs. in new cluster |
|
1 | 9 | 96.1429 | 0.15756 | 9 | 10 | 9 | 2 |
2 | 8 | 90.1496 | 0.40238 | 5 | 7 | 5 | 2 |
3 | 7 | 89.1278 | 0.44413 | 8 | 9 | 8 | 3 |
4 | 6 | 77.8102 | 0.90645 | 1 | 8 | 1 | 4 |
5 | 5 | 70.7584 | 1.19451 | 1 | 2 | 1 | 5 |
6 | 4 | 62.9473 | 1.51359 | 3 | 6 | 3 | 2 |
7 | 3 | 60.7288 | 1.60422 | 1 | 4 | 1 | 6 |
8 | 2 | 47.5129 | 2.14409 | 3 | 5 | 3 | 4 |
9 | 1 | 0.0000 | 4.08498 | 1 | 3 | 1 | 10 |
Final Partition
Number of observations |
Within cluster sum of squares |
Average distance from centroid |
Maximum distance from centroid |
|
Cluster1 | 6 | 2.66623 | 0.585548 | 1.09268 |
Cluster2 | 4 | 3.76429 | 0.943228 | 1.24289 |
Cluster Centroids
Variable | Cluster1 | Cluster2 | Grand centroid |
Age | -0.633492 | 0.95024 | -0.0000000 |
Income | -0.670189 | 1.00528 | 0.0000000 |
Children | -0.721688 | 1.08253 | 0.0000000 |
Distances Between Cluster Centroids
Cluster1 | Cluster2 | |
Cluster1 | 0.00000 | 2.92756 |
Cluster2 | 2.92756 | 0.00000 |
Dendrogram
In these results, the data contain a total of 10 observations. In step 1, two clusters (observations 9 and 10 in the worksheet) are joined to form a new cluster. This step creates 9 clusters in the data, with a similarity level of 96.1429 and a distance level of 0.15756. Although the similarity level is high and the distance level is low, the number of clusters is too high to be useful. At each subsequent step, as new clusters are formed, the similarity level decreases and the distance level increases. At the final step, all the observations are joined into a single cluster.
Question:
For the selected data set, apply k-means clustering with using Age, Income, and Children as variables. Normalize the values of the input variables. This will generate a total of two clusters. Describe these two clusters of clients according to their “average” characteristics
Answer :
K-means Cluster Analysis: Age, Income, Children
Method
Number of clusters | 2 |
Standardized variables | Yes |
Final Partition
Number of observations |
Within cluster sum of squares |
Average distance from centroid |
Maximum distance from centroid |
|
Cluster1 | 4 | 3.764 | 0.943 | 1.243 |
Cluster2 | 6 | 2.666 | 0.586 | 1.093 |
Cluster Centroids
Variable | Cluster1 | Cluster2 | Grand centroid |
Age | 0.9502 | -0.6335 | 0.0000 |
Income | 1.0053 | -0.6702 | 0.0000 |
Children | 1.0825 | -0.7217 | 0.0000 |
Distances Between Cluster Centroids
Cluster1 | Cluster2 | |
Cluster1 | 0.0000 | 2.9276 |
Cluster2 | 2.9276 | 0.0000 |
For Question 3, set k = 3, i.e., three clusters and rerun the cluster analysis. Compare the clustering results (k = 2 vs. k = 3) and discuss, in your opinion, which method is better while describing customers’ average charteristics and designing marketing segmentation
K-means Cluster Analysis: Age, Income, Children
Method
Number of clusters | 3 |
Standardized variables | Yes |
Final Partition
Number of observations |
Within cluster sum of squares |
Average distance from centroid |
Maximum distance from centroid |
|
Cluster1 | 2 | 0.398 | 0.446 | 0.446 |
Cluster2 | 4 | 1.569 | 0.562 | 0.970 |
Cluster3 | 4 | 3.764 | 0.943 | 1.243 |
Cluster Centroids
Variable | Cluster1 | Cluster2 | Cluster3 | Grand centroid |
Age | -0.7968 | -0.5519 | 0.9502 | 0.0000 |
Income | -1.0207 | -0.4949 | 1.0053 | 0.0000 |
Children | -0.4330 | -0.8660 | 1.0825 | 0.0000 |
Distances Between Cluster Centroids
Cluster1 | Cluster2 | Cluster3 | |
Cluster1 | 0.0000 | 0.7239 | 3.0747 |
Cluster2 | 0.7239 | 0.0000 | 2.8816 |
Cluster3 | 3.0747 | 2.8816 | 0.0000 |
Amalgamation Steps
Step | Number of clusters |
Similarity level |
Distance level |
Clusters joined |
New cluster |
Number of obs. in new cluster |
|
1 | 9 | 96.1429 | 0.15756 | 9 | 10 | 9 | 2 |
2 | 8 | 90.1496 | 0.40238 | 5 | 7 | 5 | 2 |
3 | 7 | 89.1278 | 0.44413 | 8 | 9 | 8 | 3 |
4 | 6 | 77.8102 | 0.90645 | 1 | 8 | 1 | 4 |
5 | 5 | 70.7584 | 1.19451 | 1 | 2 | 1 | 5 |
6 | 4 | 62.9473 | 1.51359 | 3 | 6 | 3 | 2 |
7 | 3 | 60.7288 | 1.60422 | 1 | 4 | 1 | 6 |
8 | 2 | 47.5129 | 2.14409 | 3 | 5 | 3 | 4 |
9 | 1 | 0.0000 | 4.08498 | 1 | 3 | 1 | 10 |
The similarity decreases by more than 13 (from 60.7288 to 47.5129) at steps 7 and 8, when the number of clusters changes from 3 to 2. These results indicate that 3 clusters may be sufficient for the final partition. If this grouping makes intuitive sense, then it is probably a good choice.
This dendrogram was created using a final partition of 3 clusters, which occurs at a similarity level of approximately 60. The first cluster (far left) is composed of two observations (the observations in rows 5 and 7 of the worksheet). The second cluster, directly to the right, is composed of 6 observations (the observations in rows 1,2,4,8,9 and 10 of the worksheet). The third cluster is composed of 2 observations (the observations in rows 3 and 6).If you cut the dendrogram higher, then there would be fewer final clusters, but their similarity level would be lower. If you cut the dendrogram lower, then the similarity level would be higher, but there would be more final clusters.
The method for k=3 is better while describing customers’ average charteristics and designing marketing segmentation.