Classifying credit defaulters

10 minute read

Published: April 29, 2019

Multi-Linear Logistic Regression

Logistic regressions are useful tools for classifying objects. The target variable is binary, either a yes or no, 1 or 0, that we try to estimate. By running the regression on only a subset of data and using the results to estimate it on the rest, we can build a machine learning model. In this example, I’ll build a model estimating which consumers are most likely to default on their credit loans from a set of variables.

Example: Credit Card default risk

Another example, which we will consider later in this module, is the prediction of a credit card default event. The dataset contains records that have the default indicator along with card owner wage, the mortgage payment amount, the vacation spending amount, and living cost. The model can be trained on these observations and can be used to predict if default is expected for the new unlabeled instance. Rather, the prediction model is expected to tell how likely the default will occur and thus the card issuer will know the risk.

In these problems, we require a probability of the event with the positive outcome, it is called ‘positive’ just because this is the event we are most interested to detect. Ironically, quite often these events are of ill-nature (default or spam message). In the credit card default problem, the goal is to find how likely it is that the default will happen. The target values in the training dataset have two values: either fully paid or charged off. Likewise, in a loan approval, the target values are loan approved or not approved. Analysis of historical data is aimed to find probabilities of positive outcomes (balance fully paid or loan approved), well trained models should be capable to predict what class the new instance belongs to, what makes it a binary classifier.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.figure(figsize=(11, 8))
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

	CC_Payments	Wage	Cost_Living	Mtg	Vacations	Default
0	11712	89925	44004	34237	13236	No
1	4114	82327	33111	35928	10788	No
2	15941	68972	12901	37791	8326	No
3	13962	67582	21491	29385	7647	No
4	6150	88983	43084	29614	10131	No

#Plot the pairwise plot using the seaborn library. 
#Notice the parameter 'hue' set to the values in 'Default' column

import seaborn as sns

sns.pairplot(Credit_risk, hue='Default')

credit risk

In the pair plot, the observations labeled as No are shown in blue, and Yes in orange. Inspection of the graph suggests that observations can be classified by several predictors: Mtg, Vacations, and Cost_Living.

Before building the model, the count of each classes in the dataset should be inspected.

# Count the number of unique values in column "Default".

Credit_risk['Default'].value_counts()

Unbalanced data

For the most accurate results when using logistic regression modeling, the data should be balanced (ie. the number of positive instances should be comparable to the number of negative instances). In the Credit_risk dataframe, there are approximately twice as many observations with negative target value as those with positive. This difference is not too big, so we can proceed with unmodified dataset.

NOTE: Classification algorithms are very sensitive to unbalanced data and for strongly unbalanced datasets, the issue has to be properly addressed before training. The easiest solutions would be:

to reduce size of oversampled class or
to inflate the undersampled class.

To reduce an oversampled class, it is recommended to select that class into a separate dataframe, shuffle the rows and select each n-th row from dataframe; the number n is chosen so that two datasets - positive and negative classes - will have comparible sizes.

To inflate the undersampled class, duplicate the undersampled class so that the final dataset has an approximetely equal number of positive and negative instances. Again, the recipe would be to separate into two datasets by the class, duplicate the rows of the undersampled class as many times as required, and then combine the datasets.

Another approach is to stratify the dataset, that is to fetch randomly selected observations from each subclass and compose a new smaller dataset where both classes are equally represented.

#There is one categorical type in the datasets (column 'Default').
#Convert it to 1 and 0 before training model.

Credit_risk['default_enum'] = Credit_risk['Default'].map({'Yes': 1, 'No': 0})

Credit_risk.describe()

	CC_Payments	Wage	Cost_Living	Mtg	Vacations	default_enum
count	300.000000	300.000000	300.000000	300.000000	300.000000	300.000000
mean	15085.173333	85327.020000	35959.893333	34264.260000	9789.563333	0.303333
std	24142.424244	12002.180144	12550.754681	15048.593736	3133.039806	0.460466
min	-40164.000000	59734.000000	3654.000000	10583.000000	87.000000	0.000000
25%	-3234.750000	76953.750000	27630.750000	19554.500000	7727.000000	0.000000
50%	11422.500000	84288.000000	35481.500000	34198.000000	10004.000000	0.000000
75%	38863.500000	93335.750000	43542.750000	45641.250000	11960.250000	1.000000
max	63210.000000	119703.000000	75355.000000	82760.000000	17421.000000	1.000000

This example has 300 observations, it is not so obvious that observations should be interpreted statistically.

To illustrate this point, we will split the rows into 14 bins by wage and then calculate the number of observations and the number of defaults in each group. The description of the Credit_risk dataset shows wage minimum and maximum values. Based on these values, we’ll define the wage bins as:

wage_bins = [55000, 60000, 65000, 70000, 75000, 80000, 85000, 90000, 95000, 100000, 105000, 110000, 115000, 120000]
wage_bin_labels= ['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']

Credit_risk['wage_group'] = pd.cut(Credit_risk.Wage, wage_bins, right=False, labels=wage_bin_labels)

#Group the rows by the wage group, count number of rows and defaults in each group.

credit_risk_by_wage= Credit_risk[['default_enum', 'wage_group']].groupby(['wage_group']).agg({'default_enum': 'sum', 'wage_group': 'count'})

#Rename columns
credit_risk_by_wage.columns = ['default', 'Count']

#and calculate the frequency of the default in each group.
credit_risk_by_wage['Proportion'] = credit_risk_by_wage.default / credit_risk_by_wage.Count

credit_risk_by_wage.reset_index()

	wage_group	default	Count	Proportion
0	60	1	1	1.000000
1	65	2	5	0.400000
2	70	10	27	0.370370
3	75	16	27	0.592593
4	80	18	45	0.400000
5	85	13	54	0.240741
6	90	13	43	0.302326
7	95	10	34	0.294118
8	100	4	27	0.148148
9	105	3	19	0.157895
10	110	1	8	0.125000
11	115	0	8	0.000000
12	120	0	2	0.000000

plt.figure(figsize=(11, 8))
plt.xlabel('wage group', fontsize=18)
plt.xticks(fontsize=18)
plt.ylabel('Proportion', fontsize=18)
plt.yticks(fontsize=18)

plt.scatter(credit_risk_by_wage.index , credit_risk_by_wage.Proportion)

defaults by wage group

After grouping by the wage group, the number of instances that fall in each group was counted and the sum of the enumerated default values gives the number of defaults in that group. The calculated proportion gives a rough estimate of the probability of default in the group. We can see that the probability is zero or close to zero for the high wage groups (wage > 105,000) and shows larger value for lower wage group.

This was just an illustration of the probabilistic approach to larger datasets.

The Generalized Linear Model applies a statistical model for each observation; the response variable is modeled using a probability distribution and then parameter of the distribution fitted to predictors’ values.

As was stated above, the goal of logistic regression is to find the probability that an instance belongs to a positive class. Doing this for logistic regressions is a two-stage process:

first, the probabilities have to be estimated;
then, probabilities are fitted to a linear model via a transformation function.

Note: statsmodels library provides several built-in probability distributions, called families: Binomial, Gamma, Gaussian, Negative Binomial, Poisson. The appropriate distribution can be picked if the statistics of the event is known. These distributions will be considered in more detail in the next course of this program. For now, we will use the default distribution: Gaussian, often referred to as Normal Distribution.

Below, let’s look at the prediction model for Credit Card default using two predictors - vacation spending (Vacations) and mortgage payments (Mtg).

plt.figure(figsize=(11, 8))
plt.xlabel('Vacations', fontsize=18)
plt.xticks(fontsize=18)
plt.ylabel('Mtg', fontsize=18)
plt.yticks(fontsize=16)

plt.scatter(Credit_risk.Vacations, Credit_risk.Mtg, c=Credit_risk.default_enum)

mortgages and vacations

#Set the Vacations and Mtg as predictors and add constant to predictors for Intercept. 
The target, the value which we are trying to predict is default_enum.

exod_list = Credit_risk[['Vacations', 'Mtg']]
exod_list=sm.add_constant(exod_list)
exod_list

#Train the model
glm_vacations_mtg = sm.GLM(Credit_risk.default_enum, exod_list)

res_multi = glm_vacations_mtg.fit()


#Get the summary of the regression
res_multi.summary()

Generalized Linear Model Regression Results
Dep. Variable:	default_enum	No. Observations:	300
Model:	GLM	Df Residuals:	297
Model Family:	Gaussian	Df Model:	2
Link Function:	identity	Scale:	0.10310
Method:	IRLS	Log-Likelihood:	-83.365
Date:	Sat, 05 Jan 2019	Deviance:	30.621
Time:	20:15:29	Pearson chi2:	30.6
No. Iterations:	3	Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-0.5388	0.075	-7.157	0.000	-0.686	-0.391
Vacations	9.082e-06	5.93e-06	1.531	0.126	-2.54e-06	2.07e-05
Mtg	2.198e-05	1.23e-06	17.806	0.000	1.96e-05	2.44e-05

#To get parameters of the fit

res_multi.params

const       -0.538818
Vacations    0.000009
Mtg          0.000022
dtype: float64

plt.scatter(Credit_risk.Vacations[Credit_risk.default_enum == 1], Credit_risk.Mtg[Credit_risk.default_enum == 1], label = 'No Default')
plt.scatter(Credit_risk.Vacations[Credit_risk.default_enum == 0], Credit_risk.Mtg[Credit_risk.default_enum == 0], label = 'Default')
plt.plot(Credit_risk.Vacations, (0.5388/0.000009) - (0.00002198/0.000009082)* Credit_risk.Vacations, 'r',label='Decision Boundary')
plt.legend()

mortgages and vacations

#Set the Vacations and Mtg as predictors and add constant to predictors for Intercept. 
The target is default_enum.

X_cr = Credit_risk[['Vacations', 'Mtg']]
Y_cr = Credit_risk['default_enum'].values

from sklearn.linear_model import LogisticRegression

#Instantiate  the model
log_reg = LogisticRegression(C=1e10)

#Train the model
log_reg.fit(X_cr, Y_cr)

LogisticRegression(C=10000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

#Predict the targets

Y_log_reg_pred = log_reg.predict(X_cr)

#To estimate goodness of the fit, we evalute the accuracy score
#which counts the number of correctly predicted observations.

from sklearn.metrics import accuracy_score

accuracy_score(Y_cr, Y_log_reg_pred)

0.75

The linear models relies on a number of factors:

the linear dependence between the predictors and the target;
independence of the observations in the dataset

The linear models are robust, easy to implement and to interpret. However, if listed assumptions do not hold, these models would not produce accurate predictions and more complex methods are required.

References

Connor Johnson blog (http://connor-johnson.com/2014/02/18/linear-regression-with-python/).

MNIST database (https://en.wikipedia.org/wiki/MNIST_database).

Ng, A. (2018) Machine Learning Yearning (electronic book).

Unsupervised Learning. (https://en.wikipedia.org/wiki/Unsupervised_learning#Approaches).

Witten, I.H, Frank, E. (2005) Data Mining. Practical Machine Learnng Tools and Techniques (2nd edition). Elsevier.

Share on

Twitter Facebook LinkedIn

Weseem Ahmed

Classifying credit defaulters

Multi-Linear Logistic Regression

Unbalanced data

References

Share on

You May Also Enjoy

Intelligent Snake Game

Predicting diabetes outcomes with machine learning

Transportation change since COVID-19 in Canada

Cluster Map