In: Computer Science
Suppose you are a Data scientist. You are building a Classifier that can predict whether a person is likely to default or Not based on certain parameters/attribute values. Assume, the class variable is “Default” and has two outcomes, {“yes”, “no”} • Own_House = Yes, No • Marital Status = Single, Married, Divorced • Annual Income = Low, Medium, High • Currently Employed = Yes, No Suppose a rule-based classifier produces the following rules: 1. Own_House = Yes → Default = Yes 2. Marital Status = Single → Default = Yes 3. Annual Income = Low → Default = Yes 4. Annual Income = High, Currently Employed = No → Default = Yes 5. Annual Income = Medium, Currently Employed = Yes → Default = No 6. Own_House = No, Marital Status = Married → Default = No 7. Own_House = No, Marital Status = Single → Default = Yes Answer the following questions. Make sure to provide a brief explanation or examples to illustrate the answer. a) Are the rules mutually exclusive? b) Is the rule set exhaustive? c) Is ordering needed for this set of rules? d) Do you need a default class for the rule set?
5 Credit scoring application
In this section, we present the real data used for the empirical
analysis, the predictive performance of the model and the insights
binary quantile regression provides about the relationship between
the explanatory variables. This section also introduces a
segmentation framework for the credit applicants. The estimation of
the model parameters was done using the
bayesQR R-package (Benoit et al., 2011). Since no external or
historical information about the
parameters was present, vague prior distributions were placed on
the model parameters, i.e.
π(β) ∼ Normal(0, 100).
5.1 Data
The data used in this study is the German credit dataset, publicly
available at
Asuncion and Newman (2007) and was used in other studies such as
Huang et al. (2007) and
West et al. (2005). This dataset consists of 700 examples of
creditworthy applicants and 300
examples of applicants who defaulted. For each instance, the
dataset includes 24 input variables that describe 19 attributes
that characterize the applicants (with 4 nominal attributes
transformed to dummy variables). These explanatory variables, shown
in Table A1 of the Appendix, include demographic characteristics of
customers (e.g., Personal status and gender, Age
in years), credit details (e.g., Duration of credit (months) and
Credit amount), customers’ financial standing (e.g., Average
balance in savings account, Credit history) and employment
(e.g.,
Nature of job, Present employment since). The dependent variable
reveals default (value=1) or
non-default (value=0).
5.2 Prediction and evaluation
The performance of credit scoring models is obviously of utmost
importance in financial and
banking industries. To evaluate the performance of the binary
quantile regression model proposed, we first estimate the
regression for nineteen different quantile levels (τ = 0.05, ...,
0.95
by 0.05). Consequently we compute the probability of default,
according to the procedure explained in Section 3. Using the values
obtained, we compute the AUC. The average accuracy
obtained through a 10 cross validation is 0.77, with a standard
deviation of 0.06. This means
that the proposed model has good discrimination power since it
significantly exceeds the nullmodel benchmark of 0.5. The average
percentage of correctly classified instances is 76.2%, with
a standard deviation of 5.1%, when using a threshold of 0.5. This
is higher than the accuracy rate obtained when using a naive model,
i.e. 70%, resulting from the classification of all customer as
non-defaulters.
As stated before, the German credit data is used by several authors
to build models in a credit
scoring context. However, the results obtained in those studies
should be compared with care
with the results reported in this paper because of model
assumptions and validation method.
For example, Baesens et al. (2003a) used various methods, e.g.
neural networks and decision
trees, for building the credit scoring model. The most accurate
model turned out to be the
pruned neural network and had a PCC of 77.8%. However, this study
does not use a crossvalidation method to evaluate the performance
of the models. Xiao et al. (2006) also applies
several methods to classify credit applicants using the German
credit data. This study does use
cross-validation. The most accurate model, i.e. support vector
machines with sigmoid kernel,
presents a PCC of 77.2%. Although the evaluation method is the same
as the one used in
this paper, we don’t have information regarding the threshold that
was used. This choice can
significantly influence the resulting PCC. However, given the
results of the current study and
the results reported in the previous studies, we conclude that the
model proposed can compete
with the state-of-the-art models suggested in the literature.
5.3 Explanatory variables effects
Often used parametric models, such as logit or probit, give insight
into the effect size of the
mean of the response distribution. However, with binary quantile
regression, it is possible to
get a more thorough view on the effect of the explanatory
variables. To do this, we analyzed
the regression parameters for the quantile levels τ = 0.05, ...,
0.95 by 0.05.
It is important to note that before estimating the regression
model, we considered the possible multicollinearity issues. There
are a number of approaches to deal with multi-collinearity.
One popular check is the variance inflation factor (VIF) (see
Bajpai (2009)). A value of VIF
greater than 10 is an indication that multicollinearity may be
causing problems in estimations
(Neter et al., 1996; Myers, 2000). VIF statistics for the
independent variables considered in
this study indicate VIF ranging from 1.1 to 3.3. This is within the
acceptable range and thus
multicollinearity is not an important issue for the current
dataset.
To analyze the regression parameters we computed a 90% pointwise
Bayesian credible interval
from the marginal posterior distributions of each parameter. The
results showed that, for 16
variables, the confidence intervals of the regression parameters
overlap the value of zero on
practically all quantile levels. Therefore, we conclude that these
variables are not important
for the analysis. Most of these variables are demographic variables
and variables concerned
with the employment. Moreover, variables related to credit details,
such as Credit amount and
Application has other debtors or guarantors:Co-applicant, and
customers’ financial standing,
such as Number of existing credits at this bank and Other
installment plans, also seem not to
influence credit score estimation.
Figure 1 depicts a summary of the quantile regression parameters
obtained for the 8 variables
that are relevant on most quantile levels. The solid line with
filled dots represents the point
estimates of the regression coefficients for the different quantile
levels (τ = 0.05, ..., 0.95 by 0.05).
The shaded area represents the 90% pointwise credible intervals
obtained from the marginal
posterior distribution of the different regression
parameters.
By analyzing Figure 1 we can conclude there are some variables
whose impact is negative at each
quantile level, while there are others whose impact is positive.
None of the relevant variables
have opposite effects for lower versus higher quantiles. However,
the impact of each variable on
credit risk seems to be different over quantile levels. This
reinforces the supremacy of quantile
regression over other techniques used in this context, which assume
a constant effect of theexplanatory variables at different points
of the independent variable distribution. The variable
Duration of credit (months) has a positive impact on the risk of
failure. This impact is higher
in the extreme high and low quantiles, than in the middle
quantiles. This means that analyzing
only the mean effect would give the researcher an overestimation of
the typical effect of this
variable. The variable Property also has a positive relationship
with credit risk, suggesting that
customers with less valuable properties (see Table A1) are those
with higher promptness to
default. This tendency is, once again, more evident for both low
and high extreme quantiles.
For most quantiles, the variable Application has other debtors or
guarantors:Co-applicant:None
positively influences the credit risk. This suggests that people
who apply alone for a credit
have higher probability of not succeeding in repaying it than
people who have support from
other people. This idea is not valid for low quantile levels and
one high quantile level since
the relationship seems to be relevant. The variable Credit purpose:
New car also presents a
positive impact on the dependent variable analyzed. However, this
impact is not relevant for
creditworthy applicants (i.e., low quantiles). The effect of this
variable is practically stable over
quantiles. This means that for this variable, the additional
insights are limited compared to
the insights from logit or probit models. In contrast, the variable
Credit purpose: Used car has
a negative impact. This is more pronounced for high quantiles. As
expected, both Customer
account status and Average balance in savings account present a
negative relationship with
credit risk, revealing that people with more money in the accounts
are creditworthy applicants.
This tendency is more outspoken for low quantile levels, i.e.
customers with better credit quality
given the set of covariates. Concerning the variable Credit
history, it is interesting to observe
that it also has a negative impact, suggesting that people with a
compromising history are
less prone to default. It may reveal that the possible
inconvenients arising from the past credit
processes made them averse to failure.