In: Statistics and Probability
Know Thy Customer (KTC) is a financial consulting company that provides personalized financial advice to its clients. As a basis for developing this tailored advising, KTC would like to segment its customers into several representative groups based on key characteristics. Peyton Avery, the director of KTC’s fledging analytics division, plans to establish the set of representative customer profiles based on 600 customer records in the file KnowThyCustomer. Each customer record contains data on age, gender, annual income, marital status, number of children, whether the customer has a car loan, and whether the customer has a home mortgage. KTC’s market research staff has determined that these seven characteristics should form the basis of the customer clustering. Peyton has invited a summer intern, Danny Riles, into her office so they can discuss how to proceed. As they review the data on the computer screen, Peyton’s brow furrows as she realizes that sis task may not be trivial. The data contains both categorical variables (Female, Married, Car, Mortgage), and interval variables (Age, Income, and Children). Managerial Report Playing the role of Peyton, you must write a report documenting the construction of the representative customer profiles. Because Peyton would like to use this report as a training reference for interns such as Danny, your report should experiment with several approaches and explain the strengths and weaknesses of each. In particular, your report should include the following analyses: 1. Using k-means clustering on all seven variables, experiment with different values of k. Recommend a value of k and describe these k clusters according to their “average” characteristics. Why might k-means clustering not be a good method to use for these seven variables? 2. Using hierarchical clustering all seven variables, experiment with using complete linkage and group average linkage as the clustering method. Recommend a set of customer profiles (clusters). Describe these clusters according to their “average” characteristics. Why might hierarchical clustering not be a good method to use for these seven variables?
1.1 In K means clustering the convention is to use the Euclidean distance between feature vectors. But here as the feature vector are consists of both the qualitative and quantitative variable so using Euclidean distance will not work here. Rather this is meaningless. Instead of that one can use Bray Curtis dissimilarity measure to calculate the distance between the feature vectors as they take into account both the qualitative such as gender and the quantitative variables such as Annual income in the consideration.
1.2 The next thing is to choose the number of clusters. At first, any no of clusters can be chosen, the default can be chosen as if there are n data points available, then we can start with n/30 no of class centroids. There is no thumb rule or method to choose exact no of class centroids, but what we can do we can split the data 70-30 ratio. train the K means clustering on the 70% of the data. Then use the rest of 30% to assign them to the different cluster, after assigning this 30% of the point if the original class centroids change drastically then we need to change the no of clusters, there may be a possibility that more number of clusters are required.
1.3. The main disadvantage is that we do not know in prior how many clusters could be formed optimally. and also that I have indicated earlier the distance measure should have been changed after seeing the nature of the data.
2. For hierarchical clustering group average linkage is a very crude method to use. It is not very helpful. On the other hand, if we use complete linkage then it reveals the maximum distance between two clusters .
2.1 It may be useful to use the hierarchical clustering because at all the stages we can see the full picture and according to our necessity we can stop at any of the stages with the number of cluster.
There may be noisy data and some of the features may unnecessarily add noise to the data. So it might not be good to use hierarchical clustering .