Question

In: Computer Science

Students are now armed with R tools to examine and manipulate datasets. The object of this...

Students are now armed with R tools to examine and manipulate datasets. The object of this Tutorial and Lab Test 3 is to demonstrate these skills in the real world.

To Do:

  • There are thousands of Big Datasets available on the web
  • Find a Big Dataset that contains information in which you are particularly interested - for example Tropical Fish, or F1 racing. (NB: please provide a link to the dataset)
  • A simple search such as 'big data soccer csv' will generally produce good results or will lead you to metasites that have lists of datasets
  • Look for csv files
  • Access the dataset
  • Using the R language
    • Discover the dimensions of the dataset (number rows, cols, col names, etc.)
    • Find out some interesting information from the dataset
    • Prepare 3-5 slides
      • Give the source of the dataset
      • Describe the data set dataset
      • Tell us 3 interesting things you found in the dataset

Solutions

Expert Solution

The dataset which I have considered for this case study is Lending Club loan data. This dataset contains the full LendingClub data available from their site. There are separate files for accepted and rejected loans. The accepted loans also include the FICO scores too.

The data set can be downloaded via this link: https://www.kaggle.com/wordsforthewise/lending-club/download

Dataset Exploration in R

# These files are in the GB, so it takes a bit to load...up to a minute or two
acc_dt <- fread(accepted_fn)
rej_dt <- fread(rejected_fn)

Discoving the dimentions :

# that's a lot of observations
dim(acc_dt)

=>>     2260701 samples x 151 features

dim(rej_dt)

=>>     27648741 samples x 9 features

Column Names

names(acc_dt)
  1. 'id'
  2. 'member_id'
  3. 'loan_amnt'
  4. 'funded_amnt'
  5. 'funded_amnt_inv'
  6. 'term'
  7. 'int_rate'
  8. 'installment'
  9. 'grade'
  10. 'sub_grade'
  11. 'emp_title'
  12. 'emp_length'
  13. 'home_ownership'
  14. 'annual_inc'
  15. 'verification_status'
  16. 'issue_d'
  17. 'loan_status'
  18. 'pymnt_plan'
  19. 'url'
  20. 'desc'
  21. 'purpose'
  22. 'title'
  23. 'zip_code'
  24. 'addr_state'
  25. 'dti'
  26. 'delinq_2yrs'
  27. 'earliest_cr_line'
  28. 'fico_range_low'
  29. 'fico_range_high'
  30. 'inq_last_6mths'
  31. 'mths_since_last_delinq'
  32. 'mths_since_last_record'
  33. 'open_acc'
  34. 'pub_rec'
  35. 'revol_bal'
  36. 'revol_util'
  37. 'total_acc'
  38. 'initial_list_status'
  39. 'out_prncp'
  40. 'out_prncp_inv'
  41. 'total_pymnt'
  42. 'total_pymnt_inv'
  43. 'total_rec_prncp'
  44. 'total_rec_int'
  45. 'total_rec_late_fee'
  46. 'recoveries'
  47. 'collection_recovery_fee'
  48. 'last_pymnt_d'
  49. 'last_pymnt_amnt'
  50. 'next_pymnt_d'
  51. 'last_credit_pull_d'
  52. 'last_fico_range_high'
  53. 'last_fico_range_low'
  54. 'collections_12_mths_ex_med'
  55. 'mths_since_last_major_derog'
  56. 'policy_code'
  57. 'application_type'
  58. 'annual_inc_joint'
  59. 'dti_joint'
  60. 'verification_status_joint'
  61. 'acc_now_delinq'
  62. 'tot_coll_amt'
  63. 'tot_cur_bal'
  64. 'open_acc_6m'
  65. 'open_act_il'
  66. 'open_il_12m'
  67. 'open_il_24m'
  68. 'mths_since_rcnt_il'
  69. 'total_bal_il'
  70. 'il_util'
  71. 'open_rv_12m'
  72. 'open_rv_24m'
  73. 'max_bal_bc'
  74. 'all_util'
  75. 'total_rev_hi_lim'
  76. 'inq_fi'
  77. 'total_cu_tl'
  78. 'inq_last_12m'
  79. 'acc_open_past_24mths'
  80. 'avg_cur_bal'
  81. 'bc_open_to_buy'
  82. 'bc_util'
  83. 'chargeoff_within_12_mths'
  84. 'delinq_amnt'
  85. 'mo_sin_old_il_acct'
  86. 'mo_sin_old_rev_tl_op'
  87. 'mo_sin_rcnt_rev_tl_op'
  88. 'mo_sin_rcnt_tl'
  89. 'mort_acc'
  90. 'mths_since_recent_bc'
  91. 'mths_since_recent_bc_dlq'
  92. 'mths_since_recent_inq'
  93. 'mths_since_recent_revol_delinq'
  94. 'num_accts_ever_120_pd'
  95. 'num_actv_bc_tl'
  96. 'num_actv_rev_tl'
  97. 'num_bc_sats'
  98. 'num_bc_tl'
  99. 'num_il_tl'
  100. 'num_op_rev_tl'
  101. 'num_rev_accts'
  102. 'num_rev_tl_bal_gt_0'
  103. 'num_sats'
  104. 'num_tl_120dpd_2m'
  105. 'num_tl_30dpd'
  106. 'num_tl_90g_dpd_24m'
  107. 'num_tl_op_past_12m'
  108. 'pct_tl_nvr_dlq'
  109. 'percent_bc_gt_75'
  110. 'pub_rec_bankruptcies'
  111. 'tax_liens'
  112. 'tot_hi_cred_lim'
  113. 'total_bal_ex_mort'
  114. 'total_bc_limit'
  115. 'total_il_high_credit_limit'
  116. 'revol_bal_joint'
  117. 'sec_app_fico_range_low'
  118. 'sec_app_fico_range_high'
  119. 'sec_app_earliest_cr_line'
  120. 'sec_app_inq_last_6mths'
  121. 'sec_app_mort_acc'
  122. 'sec_app_open_acc'
  123. 'sec_app_revol_util'
  124. 'sec_app_open_act_il'
  125. 'sec_app_num_rev_accts'
  126. 'sec_app_chargeoff_within_12_mths'
  127. 'sec_app_collections_12_mths_ex_med'
  128. 'sec_app_mths_since_last_major_derog'
  129. 'hardship_flag'
  130. 'hardship_type'
  131. 'hardship_reason'
  132. 'hardship_status'
  133. 'deferral_term'
  134. 'hardship_amount'
  135. 'hardship_start_date'
  136. 'hardship_end_date'
  137. 'payment_plan_start_date'
  138. 'hardship_length'
  139. 'hardship_dpd'
  140. 'hardship_loan_status'
  141. 'orig_projected_additional_accrued_interest'
  142. 'hardship_payoff_balance_amount'
  143. 'hardship_last_payment_amount'
  144. 'disbursement_method'
  145. 'debt_settlement_flag'
  146. 'debt_settlement_flag_date'
  147. 'settlement_status'
  148. 'settlement_date'
  149. 'settlement_amount'
  150. 'settlement_percentage'
  151. 'settlement_term'

ANALYIS

Taking a look at the rejected data a lot of information that is in the accepted data is missing.

There are a few variables that are good targets for predictive modelling:

  1. application outcome (accepted or rejected)
  2. interest rate (int_rate)
  3. grade of the loan (grade)
  4. the loan status, if paid or defaulted (loan_status)

Observation

The data for grade looks good, so it can be used as a taget in a predictive model. On the other hand the count of "Default" in loan status looks to be very small.

So the number of loans with a status of "Default" is low, but according to the Lending Club website definition it is in fact a transitionary status, which then turns into "Charged Off" which is the final state for a loan where funds are not likely to be recovered.

There are plenty of loans that fall under "Charged Off" so loan_status can be used as a target for binary predictions.


Related Solutions

load the MASS library in R. A. Package ‘MASS’ which provides a description of the datasets...
load the MASS library in R. A. Package ‘MASS’ which provides a description of the datasets available in the MASS package. Then, answer each of the following questions using the appropriate test statistic and following formal steps of hypothesis testing. A:Test of equal or given proportions: Use the “bacteria” data set to answer the question, “did the drug treatment have a significant effect of the presence of the bacteria compared with the placebo?” B: F-test: Use the “cats” data set...
This questions was asked to be done using R language programming. The datasets are available along...
This questions was asked to be done using R language programming. The datasets are available along with MASS package in Rstudio. A. Package ‘MASS’ which provides a description of the datasets available in the MASS package Then, complete the following analysis of the identified data from the library. B. One-sample t-test: Use the “chem” dataset to answer the question, “is the flour production company producing whole meal flour with greater than 1 part per million copper in it?” C. Two-sample...
R has a number of datasets built in. One such dataset is called mtcars. This data...
R has a number of datasets built in. One such dataset is called mtcars. This data set contains fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models) as reported in a 1974 issue of Motor Trend Magazine. We do not have to read in these built-in datasets. We can just attach the variables by using the code attach(mtcars) We can just type in mtcars and see the entire dataset. We can see the variable...
Armed with the knowledge you are gaining from this module week's summit session, it's now time...
Armed with the knowledge you are gaining from this module week's summit session, it's now time for us to apply what we are learning! For this activity, complete four of the five scenario-based problems. Prepare a document to explain your answers and graph your solutions. Make sure to use references in APA format to support your positions. 1. Joshua’s income increases, while federal taxes increase; therefore, please illustrate by constructing a supply and demand graph, the direction in which the...
In r studio, how do you find significant variables that differ between two datasets of the...
In r studio, how do you find significant variables that differ between two datasets of the same variables?
The data set ”airquality” in the R datasets library has data on ozone concentration, wind speed,...
The data set ”airquality” in the R datasets library has data on ozone concentration, wind speed, temperature, and solar radiation by month and day for May through September in New York. Attach airquality to your workspace and then construct side-by-side boxplots of Wind by Month. Month is a numeric variable in the airquality data frame. You can treat it as a factor by using the ”as.factor” function, e.g., > plot(Wind ∼ as.factor(Month)) Next, do an analysis of variance to determine...
Examine the computation formula for r, the sample correlation coefficient. (a) In the formula for r,...
Examine the computation formula for r, the sample correlation coefficient. (a) In the formula for r, if we exchange the symbols x and y, do we get a different result or do we get the same (equivalent) result? Explain your answer. The result is the same because the formula is dependent on the symbols. The result is different because the formula is not dependent on the symbols. The result is the same because the formula is not dependent on the...
Examine the computation formula for r, the sample correlation coefficient. (a) In the formula for r,...
Examine the computation formula for r, the sample correlation coefficient. (a) In the formula for r, if we exchange the symbols x and y, do we get a different result or do we get the same (equivalent) result? Explain your answer. The result is the same because the formula is not dependent on the symbols. The result is the same because the formula is dependent on the symbols. The result is different because the formula is dependent on the symbols....
Examine the computation formula for r, the sample correlation coefficient. (a) In the formula for r,...
Examine the computation formula for r, the sample correlation coefficient. (a) In the formula for r, if we exchange the symbols x and y, do we get a different result or do we get the same (equivalent) result? Explain your answer. The result is the same because the formula is dependent on the symbols. The result is the same because the formula is not dependent on the symbols. The result is different because the formula is not dependent on the...
Examine the computation formula for r, the sample correlation coefficient. (a) In the formula for r,...
Examine the computation formula for r, the sample correlation coefficient. (a) In the formula for r, if we exchange the symbols x and y, do we get a different result or do we get the same (equivalent) result? Explain your answer. The result is the same because the formula is not dependent on the symbols. The result is different because the formula is not dependent on the symbols. The result is different because the formula is dependent on the symbols....
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT