The two most critical questions in the lending industry are: 1)
How risky is the borrower? 2) Given the borrower’s risk, should we
lend him/her? The answer to the first question determines the
interest rate the borrower would have. Interest rate measures among
other things (such as time value of money) the riskness of the
borrower, i.e. the riskier the borrower, the higher the interest
rate. With interest rate in mind, we can then determine if the
borrower is eligible for the loan.
Investors (lenders) provide loans to borrowers in exchange for
the promise of repayment with interest. That means the lender only
makes profit (interest) if the borrower pays off the loan. However,
if he/she doesn’t repay the loan, then the lender loses money.
We’ll be using publicly available data from LendingClub.com. The
data covers the 9,578 loans funded by the platform between May 2007
and February 2010. The interest rate is provided to us for each
borrower. Therefore, so we’ll address the second question
indirectly by trying to predict if the borrower will repay the loan
by its mature date or not. Through this excerise we’ll illustrate
three modeling concepts:
- What to do with missing values.
- Techniques used with imbalanced classification problems.
- Illustrate how to build an ensemble model using two methods:
blending and stacking, which most likely gives us a boost in
performance.
Below is a short description of each feature in the data
set:
- credit_policy: 1 if the customer meets the
credit underwriting criteria of LendingClub.com, and 0
otherwise.
- purpose: The purpose of the loan such as:
credit_card, debt_consolidation, etc.
- int_rate: The interest rate of the loan
(proportion).
- installment: The monthly installments ($) owed
by the borrower if the loan is funded.
- log_annual_inc: The natural log of the annual
income of the borrower.
- dti: The debt-to-income ratio of the
borrower.
- fico: The FICO credit score of the
borrower.
- days_with_cr_line: The number of days the
borrower has had a credit line.
- revol_bal: The borrower’s revolving
balance.
- revol_util: The borrower’s revolving line
utilization rate.
- inq_last_6mths: The borrower’s number of
inquiries by creditors in the last 6 months.
- delinq_2yrs: The number of times the borrower
had been 30+ days past due on a payment in the past 2 years.
- pub_rec: The borrower’s number of derogatory
public records.
- not_fully_paid: indicates whether the loan was
not paid back in full (the borrower either defaulted or the
borrower was deemed unlikely to pay it back).