In: Computer Science
Students are now armed with R tools to examine and manipulate datasets. The object of this Tutorial and Lab Test 3 is to demonstrate these skills in the real world.
To Do:
The dataset which I have considered for this case study is Lending Club loan data. This dataset contains the full LendingClub data available from their site. There are separate files for accepted and rejected loans. The accepted loans also include the FICO scores too.
The data set can be downloaded via this link: https://www.kaggle.com/wordsforthewise/lending-club/download
Dataset Exploration in R
# These files are in the GB, so it takes a bit to load...up to a minute or two
acc_dt <- fread(accepted_fn)
rej_dt <- fread(rejected_fn)
Discoving the dimentions :
# that's a lot of observations
dim(acc_dt)
=>> 2260701 samples x 151 features
dim(rej_dt)
=>> 27648741 samples x 9 features
Column Names
names(acc_dt)
ANALYIS
Taking a look at the rejected data a lot of information that is in the accepted data is missing.
There are a few variables that are good targets for predictive modelling:
Observation
The data for grade looks good, so it can be used as a taget in a predictive model. On the other hand the count of "Default" in loan status looks to be very small.
So the number of loans with a status of "Default" is low, but according to the Lending Club website definition it is in fact a transitionary status, which then turns into "Charged Off" which is the final state for a loan where funds are not likely to be recovered.
There are plenty of loans that fall under "Charged Off" so loan_status can be used as a target for binary predictions.