Question

In: Computer Science

Suppose you are a Data scientist. You are building a Classifier that can predict whether a...

Suppose you are a Data scientist. You are building a Classifier that can predict whether a person is likely to default or Not based on certain parameters/attribute values. Assume, the class variable is “Default” and has two outcomes, {“yes”, “no”} • Own_House = Yes, No • Marital Status = Single, Married, Divorced • Annual Income = Low, Medium, High • Currently Employed = Yes, No Suppose a rule-based classifier produces the following rules: 1. Own_House = Yes → Default = Yes 2. Marital Status = Single → Default = Yes 3. Annual Income = Low → Default = Yes 4. Annual Income = High, Currently Employed = No → Default = Yes 5. Annual Income = Medium, Currently Employed = Yes → Default = No 6. Own_House = No, Marital Status = Married → Default = No 7. Own_House = No, Marital Status = Single → Default = Yes Answer the following questions. Make sure to provide a brief explanation or examples to illustrate the answer. a) Are the rules mutually exclusive? b) Is the rule set exhaustive? c) Is ordering needed for this set of rules? d) Do you need a default class for the rule set?

Solutions

Expert Solution

5 Credit scoring application
In this section, we present the real data used for the empirical analysis, the predictive performance of the model and the insights binary quantile regression provides about the relationship between the explanatory variables. This section also introduces a segmentation framework for the credit applicants. The estimation of the model parameters was done using the
bayesQR R-package (Benoit et al., 2011). Since no external or historical information about the
parameters was present, vague prior distributions were placed on the model parameters, i.e.
π(β) ∼ Normal(0, 100).
5.1 Data
The data used in this study is the German credit dataset, publicly available at
Asuncion and Newman (2007) and was used in other studies such as Huang et al. (2007) and
West et al. (2005). This dataset consists of 700 examples of creditworthy applicants and 300
examples of applicants who defaulted. For each instance, the dataset includes 24 input variables that describe 19 attributes that characterize the applicants (with 4 nominal attributes
transformed to dummy variables). These explanatory variables, shown in Table A1 of the Appendix, include demographic characteristics of customers (e.g., Personal status and gender, Age
in years), credit details (e.g., Duration of credit (months) and Credit amount), customers’ financial standing (e.g., Average balance in savings account, Credit history) and employment (e.g.,
Nature of job, Present employment since). The dependent variable reveals default (value=1) or
non-default (value=0).
5.2 Prediction and evaluation
The performance of credit scoring models is obviously of utmost importance in financial and
banking industries. To evaluate the performance of the binary quantile regression model proposed, we first estimate the regression for nineteen different quantile levels (τ = 0.05, ..., 0.95
by 0.05). Consequently we compute the probability of default, according to the procedure explained in Section 3. Using the values obtained, we compute the AUC. The average accuracy
obtained through a 10 cross validation is 0.77, with a standard deviation of 0.06. This means
that the proposed model has good discrimination power since it significantly exceeds the nullmodel benchmark of 0.5. The average percentage of correctly classified instances is 76.2%, with
a standard deviation of 5.1%, when using a threshold of 0.5. This is higher than the accuracy rate obtained when using a naive model, i.e. 70%, resulting from the classification of all customer as non-defaulters.
As stated before, the German credit data is used by several authors to build models in a credit
scoring context. However, the results obtained in those studies should be compared with care
with the results reported in this paper because of model assumptions and validation method.
For example, Baesens et al. (2003a) used various methods, e.g. neural networks and decision
trees, for building the credit scoring model. The most accurate model turned out to be the
pruned neural network and had a PCC of 77.8%. However, this study does not use a crossvalidation method to evaluate the performance of the models. Xiao et al. (2006) also applies
several methods to classify credit applicants using the German credit data. This study does use
cross-validation. The most accurate model, i.e. support vector machines with sigmoid kernel,
presents a PCC of 77.2%. Although the evaluation method is the same as the one used in
this paper, we don’t have information regarding the threshold that was used. This choice can
significantly influence the resulting PCC. However, given the results of the current study and
the results reported in the previous studies, we conclude that the model proposed can compete
with the state-of-the-art models suggested in the literature.
5.3 Explanatory variables effects
Often used parametric models, such as logit or probit, give insight into the effect size of the
mean of the response distribution. However, with binary quantile regression, it is possible to
get a more thorough view on the effect of the explanatory variables. To do this, we analyzed
the regression parameters for the quantile levels τ = 0.05, ..., 0.95 by 0.05.
It is important to note that before estimating the regression model, we considered the possible multicollinearity issues. There are a number of approaches to deal with multi-collinearity.
One popular check is the variance inflation factor (VIF) (see Bajpai (2009)). A value of VIF
greater than 10 is an indication that multicollinearity may be causing problems in estimations
(Neter et al., 1996; Myers, 2000). VIF statistics for the independent variables considered in
this study indicate VIF ranging from 1.1 to 3.3. This is within the acceptable range and thus
multicollinearity is not an important issue for the current dataset.
To analyze the regression parameters we computed a 90% pointwise Bayesian credible interval
from the marginal posterior distributions of each parameter. The results showed that, for 16
variables, the confidence intervals of the regression parameters overlap the value of zero on
practically all quantile levels. Therefore, we conclude that these variables are not important
for the analysis. Most of these variables are demographic variables and variables concerned
with the employment. Moreover, variables related to credit details, such as Credit amount and
Application has other debtors or guarantors:Co-applicant, and customers’ financial standing,
such as Number of existing credits at this bank and Other installment plans, also seem not to
influence credit score estimation.
Figure 1 depicts a summary of the quantile regression parameters obtained for the 8 variables
that are relevant on most quantile levels. The solid line with filled dots represents the point
estimates of the regression coefficients for the different quantile levels (τ = 0.05, ..., 0.95 by 0.05).
The shaded area represents the 90% pointwise credible intervals obtained from the marginal
posterior distribution of the different regression parameters.
By analyzing Figure 1 we can conclude there are some variables whose impact is negative at each
quantile level, while there are others whose impact is positive. None of the relevant variables
have opposite effects for lower versus higher quantiles. However, the impact of each variable on
credit risk seems to be different over quantile levels. This reinforces the supremacy of quantile
regression over other techniques used in this context, which assume a constant effect of theexplanatory variables at different points of the independent variable distribution. The variable
Duration of credit (months) has a positive impact on the risk of failure. This impact is higher
in the extreme high and low quantiles, than in the middle quantiles. This means that analyzing
only the mean effect would give the researcher an overestimation of the typical effect of this
variable. The variable Property also has a positive relationship with credit risk, suggesting that
customers with less valuable properties (see Table A1) are those with higher promptness to
default. This tendency is, once again, more evident for both low and high extreme quantiles.
For most quantiles, the variable Application has other debtors or guarantors:Co-applicant:None
positively influences the credit risk. This suggests that people who apply alone for a credit
have higher probability of not succeeding in repaying it than people who have support from
other people. This idea is not valid for low quantile levels and one high quantile level since
the relationship seems to be relevant. The variable Credit purpose: New car also presents a
positive impact on the dependent variable analyzed. However, this impact is not relevant for
creditworthy applicants (i.e., low quantiles). The effect of this variable is practically stable over
quantiles. This means that for this variable, the additional insights are limited compared to
the insights from logit or probit models. In contrast, the variable Credit purpose: Used car has
a negative impact. This is more pronounced for high quantiles. As expected, both Customer
account status and Average balance in savings account present a negative relationship with
credit risk, revealing that people with more money in the accounts are creditworthy applicants.
This tendency is more outspoken for low quantile levels, i.e. customers with better credit quality
given the set of covariates. Concerning the variable Credit history, it is interesting to observe
that it also has a negative impact, suggesting that people with a compromising history are
less prone to default. It may reveal that the possible inconvenients arising from the past credit
processes made them averse to failure.


Related Solutions

Suppose you want to test whether you can solely rely on assessment to predict house price,...
Suppose you want to test whether you can solely rely on assessment to predict house price, that is, knowing housing characteristics will not help you predict housing price, once assessment is included in the model. Using a sample of 125 houses, you have estimated Price= α+ β1 assess+ β2 lotsize+ β3 sqrft+ β4 bdrms+ u and you decide to do a test at the 5% significance level. Then your best approach to answering the question is to a) check each...
URGENT QUICK PROBABILITY QUESTIONS Problem #2 Part a: A data scientist is is building a model...
URGENT QUICK PROBABILITY QUESTIONS Problem #2 Part a: A data scientist is is building a model for a business to evaluate various metrics about the products and outlook. In looking at the data, she decides that a product should be continued if it sold 8,000 over the previous year. In addition, the product is considered “popular” if it receives 200 mentions by the local press over the past year. In selecting a product at random from the company’s web page,...
A classification technique (or classifier) is a systematic approach to building classification models from an input...
A classification technique (or classifier) is a systematic approach to building classification models from an input data set. Examples include decision tree classifiers, rule-based classifiers, neural networks, support vector machines, and naïve Bayes classifiers. In your own words, describe each of these techniques and provide a scenario in which each technique would be most appropriate. Use your textbook and outside resources in the formulation of your response. Cite the sources you use to make your response.
The data in BUSI1013 Credit Card Balance.xlsx is collected for building a regression model to predict...
The data in BUSI1013 Credit Card Balance.xlsx is collected for building a regression model to predict credit card balance of retail banking customers in a Canadian bank. Use this data to perform a simple regression analysis between Account balance and Income (in thousands). (12 points) Develop a scatter diagram using Account Balance as the dependent variable y and Income as the independent variable x. Develop the estimated regression equation. Use the estimated regression equation to predict the Account Balance of...
The data in BUSI1013 Credit Card Balance.xlsx is collected for building a regression model to predict...
The data in BUSI1013 Credit Card Balance.xlsx is collected for building a regression model to predict credit card balance of retail banking customers in a Canadian bank. Use this data to perform a simple regression analysis between Account balance and Income (in thousands). (12 points) Develop a scatter diagram using Account Balance as the dependent variable y and Income as the independent variable x. Develop the estimated regression equation. Use the estimated regression equation to predict the Account Balance of...
As a data scientist of a company, you want to analyze the following data collected by...
As a data scientist of a company, you want to analyze the following data collected by your company which relates the advertising expenditure A in thousands of dollars to total sales S in thousands of dollars. The following table shows this relationship Advertising Expenditure (A) Total Sales (S) 18.6 312 18.8 322 18.8 333 18.8 317 19 301 19 320 19.2 305 Using Advertising expenditure (A) as the domain and Total Sales (S) as the range, the data is not...
Discuss the main differences between Naïve Bayes Classifier and Softmax Classifier. Assess when will you use...
Discuss the main differences between Naïve Bayes Classifier and Softmax Classifier. Assess when will you use Naïve Bayes over Softmax Classifier Please provide at least 3 differences thx
A political scientist is interested in the question of whether or not there is a difference...
A political scientist is interested in the question of whether or not there is a difference between Republicans and Democrats when it comes to their involvement in voluntary associations. Using a 25-point scale to measure involvement in voluntary associations, and collecting information from a random sample of 17 Republicans and 12 Democrats, the political scientist discovers the following: Republicans: Mean of 12.56, standard deviation of 3.77 Democrats: Mean of 16.43, standard deviation of 4.21 Test the null hypothesis at the...
If a classifier performs well on training data but poorly in production, what's the most likely...
If a classifier performs well on training data but poorly in production, what's the most likely problem? 1. High variance 2. High bias 3. High entropy 4. High measurement noise
5. Suppose that a data scientist have 200 observations, 300 input variables, and a categorical output...
5. Suppose that a data scientist have 200 observations, 300 input variables, and a categorical output variable. Here is his/her analysis procedure: Step 1. Find a small subset of good predictors that show fairly strong (univariate) association with the output variable. Step 2. Using the input variables selected in Step 1, build LDA (linear discriminant analysis) and QDA (quadratic discriminant analysis) models. Step 3. Perform 5-fold cross-validation for both LDA and QDA models with input variables selected in Step 1....
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT