Classifying credit defaulters

10 minute read

Published:

Multi-Linear Logistic Regression

Logistic regressions are useful tools for classifying objects. The target variable is binary, either a yes or no, 1 or 0, that we try to estimate. By running the regression on only a subset of data and using the results to estimate it on the rest, we can build a machine learning model. In this example, I’ll build a model estimating which consumers are most likely to default on their credit loans from a set of variables.

Example: Credit Card default risk

Another example, which we will consider later in this module, is the prediction of a credit card default event. The dataset contains records that have the default indicator along with card owner wage, the mortgage payment amount, the vacation spending amount, and living cost. The model can be trained on these observations and can be used to predict if default is expected for the new unlabeled instance. Rather, the prediction model is expected to tell how likely the default will occur and thus the card issuer will know the risk.

In these problems, we require a probability of the event with the positive outcome, it is called ‘positive’ just because this is the event we are most interested to detect. Ironically, quite often these events are of ill-nature (default or spam message). In the credit card default problem, the goal is to find how likely it is that the default will happen. The target values in the training dataset have two values: either fully paid or charged off. Likewise, in a loan approval, the target values are loan approved or not approved. Analysis of historical data is aimed to find probabilities of positive outcomes (balance fully paid or loan approved), well trained models should be capable to predict what class the new instance belongs to, what makes it a binary classifier.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.figure(figsize=(11, 8))
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
CC_PaymentsWageCost_LivingMtgVacationsDefault
01171289925440043423713236No
1411482327331113592810788No
2159416897212901377918326No
3139626758221491293857647No
4615088983430842961410131No
#Plot the pairwise plot using the seaborn library. 
#Notice the parameter 'hue' set to the values in 'Default' column

import seaborn as sns

sns.pairplot(Credit_risk, hue='Default')

credit risk

In the pair plot, the observations labeled as No are shown in blue, and Yes in orange. Inspection of the graph suggests that observations can be classified by several predictors: Mtg, Vacations, and Cost_Living.

Before building the model, the count of each classes in the dataset should be inspected.

# Count the number of unique values in column "Default".

Credit_risk['Default'].value_counts()

Unbalanced data

For the most accurate results when using logistic regression modeling, the data should be balanced (ie. the number of positive instances should be comparable to the number of negative instances). In the Credit_risk dataframe, there are approximately twice as many observations with negative target value as those with positive. This difference is not too big, so we can proceed with unmodified dataset.

NOTE: Classification algorithms are very sensitive to unbalanced data and for strongly unbalanced datasets, the issue has to be properly addressed before training. The easiest solutions would be:

  • to reduce size of oversampled class or
  • to inflate the undersampled class.

To reduce an oversampled class, it is recommended to select that class into a separate dataframe, shuffle the rows and select each n-th row from dataframe; the number n is chosen so that two datasets - positive and negative classes - will have comparible sizes.

To inflate the undersampled class, duplicate the undersampled class so that the final dataset has an approximetely equal number of positive and negative instances. Again, the recipe would be to separate into two datasets by the class, duplicate the rows of the undersampled class as many times as required, and then combine the datasets.

Another approach is to stratify the dataset, that is to fetch randomly selected observations from each subclass and compose a new smaller dataset where both classes are equally represented.

#There is one categorical type in the datasets (column 'Default').
#Convert it to 1 and 0 before training model.

Credit_risk['default_enum'] = Credit_risk['Default'].map({'Yes': 1, 'No': 0})

Credit_risk.describe()
CC_PaymentsWageCost_LivingMtgVacationsdefault_enum
count300.000000300.000000300.000000300.000000300.000000300.000000
mean15085.17333385327.02000035959.89333334264.2600009789.5633330.303333
std24142.42424412002.18014412550.75468115048.5937363133.0398060.460466
min-40164.00000059734.0000003654.00000010583.00000087.0000000.000000
25%-3234.75000076953.75000027630.75000019554.5000007727.0000000.000000
50%11422.50000084288.00000035481.50000034198.00000010004.0000000.000000
75%38863.50000093335.75000043542.75000045641.25000011960.2500001.000000
max63210.000000119703.00000075355.00000082760.00000017421.0000001.000000

This example has 300 observations, it is not so obvious that observations should be interpreted statistically.

To illustrate this point, we will split the rows into 14 bins by wage and then calculate the number of observations and the number of defaults in each group. The description of the Credit_risk dataset shows wage minimum and maximum values. Based on these values, we’ll define the wage bins as:

wage_bins = [55000, 60000, 65000, 70000, 75000, 80000, 85000, 90000, 95000, 100000, 105000, 110000, 115000, 120000]
wage_bin_labels= ['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']

Credit_risk['wage_group'] = pd.cut(Credit_risk.Wage, wage_bins, right=False, labels=wage_bin_labels)
#Group the rows by the wage group, count number of rows and defaults in each group.

credit_risk_by_wage= Credit_risk[['default_enum', 'wage_group']].groupby(['wage_group']).agg({'default_enum': 'sum', 'wage_group': 'count'})


#Rename columns
credit_risk_by_wage.columns = ['default', 'Count']

#and calculate the frequency of the default in each group.
credit_risk_by_wage['Proportion'] = credit_risk_by_wage.default / credit_risk_by_wage.Count

credit_risk_by_wage.reset_index()
wage_groupdefaultCountProportion
060111.000000
165250.400000
27010270.370370
37516270.592593
48018450.400000
58513540.240741
69013430.302326
79510340.294118
81004270.148148
91053190.157895
10110180.125000
11115080.000000
12120020.000000
plt.figure(figsize=(11, 8))
plt.xlabel('wage group', fontsize=18)
plt.xticks(fontsize=18)
plt.ylabel('Proportion', fontsize=18)
plt.yticks(fontsize=18)

plt.scatter(credit_risk_by_wage.index , credit_risk_by_wage.Proportion)

defaults by wage group

After grouping by the wage group, the number of instances that fall in each group was counted and the sum of the enumerated default values gives the number of defaults in that group. The calculated proportion gives a rough estimate of the probability of default in the group. We can see that the probability is zero or close to zero for the high wage groups (wage > 105,000) and shows larger value for lower wage group.

This was just an illustration of the probabilistic approach to larger datasets.

The Generalized Linear Model applies a statistical model for each observation; the response variable is modeled using a probability distribution and then parameter of the distribution fitted to predictors’ values.

As was stated above, the goal of logistic regression is to find the probability that an instance belongs to a positive class. Doing this for logistic regressions is a two-stage process:

  • first, the probabilities have to be estimated;
  • then, probabilities are fitted to a linear model via a transformation function.

Note: statsmodels library provides several built-in probability distributions, called families: Binomial, Gamma, Gaussian, Negative Binomial, Poisson. The appropriate distribution can be picked if the statistics of the event is known. These distributions will be considered in more detail in the next course of this program. For now, we will use the default distribution: Gaussian, often referred to as Normal Distribution.

Below, let’s look at the prediction model for Credit Card default using two predictors - vacation spending (Vacations) and mortgage payments (Mtg).

plt.figure(figsize=(11, 8))
plt.xlabel('Vacations', fontsize=18)
plt.xticks(fontsize=18)
plt.ylabel('Mtg', fontsize=18)
plt.yticks(fontsize=16)

plt.scatter(Credit_risk.Vacations, Credit_risk.Mtg, c=Credit_risk.default_enum)

mortgages and vacations

#Set the Vacations and Mtg as predictors and add constant to predictors for Intercept. 
The target, the value which we are trying to predict is default_enum.

exod_list = Credit_risk[['Vacations', 'Mtg']]
exod_list=sm.add_constant(exod_list)
exod_list

#Train the model
glm_vacations_mtg = sm.GLM(Credit_risk.default_enum, exod_list)

res_multi = glm_vacations_mtg.fit()


#Get the summary of the regression
res_multi.summary()
Generalized Linear Model Regression Results
Dep. Variable:default_enum No. Observations: 300
Model:GLM Df Residuals: 297
Model Family:Gaussian Df Model: 2
Link Function:identity Scale: 0.10310
Method:IRLS Log-Likelihood: -83.365
Date:Sat, 05 Jan 2019 Deviance: 30.621
Time:20:15:29 Pearson chi2: 30.6
No. Iterations:3 Covariance Type:nonrobust
coefstd errzP>|z|[0.0250.975]
const -0.5388 0.075 -7.157 0.000 -0.686 -0.391
Vacations 9.082e-06 5.93e-06 1.531 0.126-2.54e-06 2.07e-05
Mtg 2.198e-05 1.23e-06 17.806 0.000 1.96e-05 2.44e-05
#To get parameters of the fit

res_multi.params
const       -0.538818
Vacations    0.000009
Mtg          0.000022
dtype: float64
plt.scatter(Credit_risk.Vacations[Credit_risk.default_enum == 1], Credit_risk.Mtg[Credit_risk.default_enum == 1], label = 'No Default')
plt.scatter(Credit_risk.Vacations[Credit_risk.default_enum == 0], Credit_risk.Mtg[Credit_risk.default_enum == 0], label = 'Default')
plt.plot(Credit_risk.Vacations, (0.5388/0.000009) - (0.00002198/0.000009082)* Credit_risk.Vacations, 'r',label='Decision Boundary')
plt.legend()

mortgages and vacations

#Set the Vacations and Mtg as predictors and add constant to predictors for Intercept. 
The target is default_enum.

X_cr = Credit_risk[['Vacations', 'Mtg']]
Y_cr = Credit_risk['default_enum'].values

from sklearn.linear_model import LogisticRegression

#Instantiate  the model
log_reg = LogisticRegression(C=1e10)

#Train the model
log_reg.fit(X_cr, Y_cr)
LogisticRegression(C=10000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
#Predict the targets

Y_log_reg_pred = log_reg.predict(X_cr)
#To estimate goodness of the fit, we evalute the accuracy score
#which counts the number of correctly predicted observations.

from sklearn.metrics import accuracy_score

accuracy_score(Y_cr, Y_log_reg_pred)
0.75

The linear models relies on a number of factors:

  • the linear dependence between the predictors and the target;
  • independence of the observations in the dataset

The linear models are robust, easy to implement and to interpret. However, if listed assumptions do not hold, these models would not produce accurate predictions and more complex methods are required.


References

Connor Johnson blog (http://connor-johnson.com/2014/02/18/linear-regression-with-python/).

MNIST database (https://en.wikipedia.org/wiki/MNIST_database).

Ng, A. (2018) Machine Learning Yearning (electronic book).

Unsupervised Learning. (https://en.wikipedia.org/wiki/Unsupervised_learning#Approaches).

Witten, I.H, Frank, E. (2005) Data Mining. Practical Machine Learnng Tools and Techniques (2nd edition). Elsevier.