In: Statistics and Probability
Scenario
Universal Bank is a relatively young bank growing rapidly in terms of overall customer acquisition. The majority of these customers are liability customers (depositors) with varying sizes of relationship with the bank. The customer base of asset customers (borrowers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business. In particular, it wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise smarter campaigns with better target marketing. The goal is to use k-NN to predict whether a new customer will accept a loan offer. This will serve as the basis for the design of a new campaign.
The dataset UniversalBank.csv below contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
With all this information in mind and the use of R, your job is to:
Partition the dataset into 60% training and 40%
validation sets considering the information on the following
customer:
Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2,
Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0,
Securities Account = 0, CD Account = 0, Online = 1, and Credit Card
= 1.
Perform a k-NN classification with all predictors except ID and ZIP code using k = 1. Remember to transform categorical predictors with more than two categories into dummy variables first.
Specify the success class as 1 (loan acceptance), and use the default cutoff value of 0.5. How would this customer be classified?
Tell me, what is a choice of k that balances between overfitting and ignoring the predictor information?
Show the confusion matrix for the validation data that results from using the best k. Then,
Consider the following customer:
Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2,
Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0,
Securities Account = 0, CD Account = 0, Online = 1 and Credit Card
= 1.
Classify the above customer using the best k.
Repartition the data, this time into training, validation, and test sets (50% : 30% : 20%).
Apply the k-NN method with the k chosen above.
Compare the confusion matrix of the test set with that of the training and validation sets.
Comment on the differences and their reason.
Market basket analysis covers the two methods association and sequence analysis. Both are useful to find frequent patterns among the variables. The association method is useful to identify, which variables occur together and accordingly creates a rule.
The rule is developed by counting how often a variable emerge alone and in combination in the data. In addition to the connection of variable s and their probability, sequencing also considers the order in which the relationships occur. Thus, it includes a timing element in the analysis.
Overall, the market basket analysis is useful to find out the
probability that variables appear together. Unfortunately, this
analysis does not give us any results out of two reasons. First, no
significant association at a confidence level of
5% could be created for unknown reasons. Second, there is no time
element, which is necessary for performing a sequence
discovery.
Memory based reasoning uses the K-nearest neighbour method to make prediction for new data. For binary target variables, this method searches a local area of predefined K numbers of neighbours and allocates the new object to the closest neighbour.
In terms of our target, the disadvantage is again that this is a predictive method and does not help to find important variables to explain the characteristics of loan taker. Obviously, text mining is used to detect patterns i n articles or other written documents and therefore is of no use for the universal bank data set.
class | prob | neighbour | age | experience | income | family | CCAvg |
0 | 0 | 1 | 40 | 10 | 84 | 2 | 2 |
edu_1 | edu_2 | Morgage | securities acct | cd account | online | credit card |
1 | 0 | 0 | 0 | 0 | 1 | 1 |
From the output we conclude that the above customer is classified as belonging to the loan not accepted group
Choice of k that balances between over fitting and ignoring the predictor would be k = 6. The value is chosen because it minimizes the % validation error. After testing various k levels. According to the validation error log for different k the best k points to 6, where %error training is 7.4% and validation % error is 8.75%.
Validation error log for different k
value of K | Error training % | error validation % |
1 | 0 | 10 |
2 | 5.83 | 13.75 |
3 | 6.67 | 11.25 |
4 | 7.5 | 18.75 |
5 | 6.67 | 12.5 |
6 | 7.5 | 16.5 |
7 | 10 | 12.5 |
8 | 9.17 | 12.5 |
9 | 8.33 | 11.25 |
The value of k that balances between over fitting and ignoring the predictor information is 9.
Validation Data scoring - Summary Report (for k=1)
cut-off prob value -> 0.5
class | prob | neigthbour | age | experience | income | family | CCAvg |
0 | 0 | 1 | 40 | 10 | 84 | 2 | 2 |
edu_1 | edu_2 | Morgage | securities acct | cd account | online | credit card |
0 | 01 | 0 | 0 | 0 | 1 | 1 |