In: Statistics and Probability
Know Thy Customer (KTC) is a financial consulting company that provides personalized financial advice to its clients. As a basis for developing this tailored advising, KTC would like to segment its customers into several representative groups based on key characteristics. Peyton Avery, the director of KTC’s fledging analytics division, plans to establish the set of representative customer profiles based on 600 customer records in the file KnowThyCustomer. Each customer record contains data on age, gender, annual income, marital status, number of children, whether the customer has a car loan, and whether the customer has a home mortgage. KTC’s market research staff has determined that these seven characteristics should form the basis of the customer clustering. Peyton has invited a summer intern, Danny Riles, into her office so they can discuss how to proceed. As they review the data on the computer screen, Peyton’s brow furrows as she realizes that this task may not be trivial. The data contains both categorical variables (Female, Married, Car, Mortgage), and interval variables (Age, Income, and Children). Managerial Report Playing the role of Peyton, you must write a report documenting the construction of the representative customer profiles. Because Peyton would like to use this report as a training reference for interns such as Danny, your report should experiment with several approaches and explain the strengths and weaknesses of each. In particular, your report should include the following analyses: 1. Using k-means clustering on all seven variables, experiment with different values of k. Recommend a value of k and describe these k clusters according to their “average” characteristics. Why might k-means clustering not be a good method to use for these seven variables? 2. Using hierarchical clustering all seven variables, experiment with using complete linkage and group average linkage as the clustering method. Recommend a set of customer profiles (clusters). Describe these clusters according to their “average” characteristics. Why might hierarchical clustering not be a good method to use for these seven variables? 3.Apply a two-step clustering method:
a. Apply hierarchical clustering on the binary variables Female, Married, Car, and Mortgage to recommend a set of clusters. Use matching coefficients as the similarity measure and group average linkage as the clustering method. b. Based on the clusters from part(a), split 600 observations into m separate data sets, where m is the number of clusters recommended from part(a). For each of these m data sets, apply 2-means clustering using Age, Income, and Children as variables. This will generate a total of 2m clusters. Describe these 2m clusters according to their “average” characteristics. What benefit does this two-step clustering approach have over the approaches in the parts (1) and (2)? What weakness does it have?
Solution:
1.1 In K implies bunching the tradition is to utilize the Euclidean separation between highlight vectors. In any case, here as the element vector are comprises of both the subjective and quantitative variable so utilizing Euclidean separation won't work here. Or maybe this is inane. Rather than that one can utilize Bray Curtis divergence measure to compute the separation between the element vectors as they consider both the subjective, for example, sexual orientation and the quantitative factors, for example, Annual pay in the thought.
1.2 The following thing is to pick the quantity of groups. At first, any no of groups can be picked, the default can be picked as though there are n information focuses accessible, at that point we can begin with n/30 no of class centroids. There is no thumb administer or strategy to pick correct no of class centroids, yet what we can do we can part the information 70-30 proportion. prepare the K implies grouping on the 70% of the information. At that point utilize whatever is left of 30% to allocate them to the diverse group, in the wake of appointing this 30% of the point if the first class centroids change definitely then we have to change the no of bunches, there might be a plausibility that more number of bunches are required.
1.3. The principle disservice is that we don't know in earlier what number of groups could be shaped ideally. and furthermore that I have shown before the separation measure ought to have been changed in the wake of seeing the idea of the information.
2. For progressive bunching bunch normal linkage is an extremely unrefined strategy to utilize. It isn't exceptionally useful. Then again, on the off chance that we utilize finish linkage then it uncovers the most extreme separation between two groups .
2.1 It might be helpful to utilize the various leveled bunching on the grounds that at all the stages we can see the full picture and as indicated by our need we can stop at any of the phases with the quantity of group.
There might be loud information and a portion of the highlights may superfluously add clamor to the information. So it probably won't regard utilize progressive bunching .