Classifying credit defaulters
Published:
Multi-Linear Logistic Regression
Logistic regressions are useful tools for classifying objects. The target variable is binary, either a yes or no, 1 or 0, that we try to estimate. By running the regression on only a subset of data and using the results to estimate it on the rest, we can build a machine learning model. In this example, I’ll build a model estimating which consumers are most likely to default on their credit loans from a set of variables.
Example: Credit Card default risk
Another example, which we will consider later in this module, is the prediction of a credit card default event. The dataset contains records that have the default indicator along with card owner wage, the mortgage payment amount, the vacation spending amount, and living cost. The model can be trained on these observations and can be used to predict if default is expected for the new unlabeled instance. Rather, the prediction model is expected to tell how likely the default will occur and thus the card issuer will know the risk.
In these problems, we require a probability of the event with the positive outcome, it is called ‘positive’ just because this is the event we are most interested to detect. Ironically, quite often these events are of ill-nature (default or spam message). In the credit card default problem, the goal is to find how likely it is that the default will happen. The target values in the training dataset have two values: either fully paid or charged off. Likewise, in a loan approval, the target values are loan approved or not approved. Analysis of historical data is aimed to find probabilities of positive outcomes (balance fully paid or loan approved), well trained models should be capable to predict what class the new instance belongs to, what makes it a binary classifier.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.figure(figsize=(11, 8))
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
CC_Payments | Wage | Cost_Living | Mtg | Vacations | Default | |
---|---|---|---|---|---|---|
0 | 11712 | 89925 | 44004 | 34237 | 13236 | No |
1 | 4114 | 82327 | 33111 | 35928 | 10788 | No |
2 | 15941 | 68972 | 12901 | 37791 | 8326 | No |
3 | 13962 | 67582 | 21491 | 29385 | 7647 | No |
4 | 6150 | 88983 | 43084 | 29614 | 10131 | No |
#Plot the pairwise plot using the seaborn library.
#Notice the parameter 'hue' set to the values in 'Default' column
import seaborn as sns
sns.pairplot(Credit_risk, hue='Default')
In the pair plot, the observations labeled as No
are shown in blue, and Yes
in orange. Inspection of the graph suggests that observations can be classified by several predictors: Mtg
, Vacations
, and Cost_Living
.
Before building the model, the count of each classes in the dataset should be inspected.
# Count the number of unique values in column "Default".
Credit_risk['Default'].value_counts()
Unbalanced data
For the most accurate results when using logistic regression modeling, the data should be balanced (ie. the number of positive instances should be comparable to the number of negative instances). In the Credit_risk dataframe, there are approximately twice as many observations with negative target value as those with positive. This difference is not too big, so we can proceed with unmodified dataset.
NOTE: Classification algorithms are very sensitive to unbalanced data and for strongly unbalanced datasets, the issue has to be properly addressed before training. The easiest solutions would be:
- to reduce size of oversampled class or
- to inflate the undersampled class.
To reduce an oversampled class, it is recommended to select that class into a separate dataframe, shuffle the rows and select each n-th row from dataframe; the number n is chosen so that two datasets - positive and negative classes - will have comparible sizes.
To inflate the undersampled class, duplicate the undersampled class so that the final dataset has an approximetely equal number of positive and negative instances. Again, the recipe would be to separate into two datasets by the class, duplicate the rows of the undersampled class as many times as required, and then combine the datasets.
Another approach is to stratify the dataset, that is to fetch randomly selected observations from each subclass and compose a new smaller dataset where both classes are equally represented.
#There is one categorical type in the datasets (column 'Default').
#Convert it to 1 and 0 before training model.
Credit_risk['default_enum'] = Credit_risk['Default'].map({'Yes': 1, 'No': 0})
Credit_risk.describe()
CC_Payments | Wage | Cost_Living | Mtg | Vacations | default_enum | |
---|---|---|---|---|---|---|
count | 300.000000 | 300.000000 | 300.000000 | 300.000000 | 300.000000 | 300.000000 |
mean | 15085.173333 | 85327.020000 | 35959.893333 | 34264.260000 | 9789.563333 | 0.303333 |
std | 24142.424244 | 12002.180144 | 12550.754681 | 15048.593736 | 3133.039806 | 0.460466 |
min | -40164.000000 | 59734.000000 | 3654.000000 | 10583.000000 | 87.000000 | 0.000000 |
25% | -3234.750000 | 76953.750000 | 27630.750000 | 19554.500000 | 7727.000000 | 0.000000 |
50% | 11422.500000 | 84288.000000 | 35481.500000 | 34198.000000 | 10004.000000 | 0.000000 |
75% | 38863.500000 | 93335.750000 | 43542.750000 | 45641.250000 | 11960.250000 | 1.000000 |
max | 63210.000000 | 119703.000000 | 75355.000000 | 82760.000000 | 17421.000000 | 1.000000 |
This example has 300 observations, it is not so obvious that observations should be interpreted statistically.
To illustrate this point, we will split the rows into 14 bins by wage and then calculate the number of observations and the number of defaults in each group. The description of the Credit_risk
dataset shows wage minimum and maximum values. Based on these values, we’ll define the wage bins as:
wage_bins = [55000, 60000, 65000, 70000, 75000, 80000, 85000, 90000, 95000, 100000, 105000, 110000, 115000, 120000]
wage_bin_labels= ['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
Credit_risk['wage_group'] = pd.cut(Credit_risk.Wage, wage_bins, right=False, labels=wage_bin_labels)
#Group the rows by the wage group, count number of rows and defaults in each group.
credit_risk_by_wage= Credit_risk[['default_enum', 'wage_group']].groupby(['wage_group']).agg({'default_enum': 'sum', 'wage_group': 'count'})
#Rename columns
credit_risk_by_wage.columns = ['default', 'Count']
#and calculate the frequency of the default in each group.
credit_risk_by_wage['Proportion'] = credit_risk_by_wage.default / credit_risk_by_wage.Count
credit_risk_by_wage.reset_index()
wage_group | default | Count | Proportion | |
---|---|---|---|---|
0 | 60 | 1 | 1 | 1.000000 |
1 | 65 | 2 | 5 | 0.400000 |
2 | 70 | 10 | 27 | 0.370370 |
3 | 75 | 16 | 27 | 0.592593 |
4 | 80 | 18 | 45 | 0.400000 |
5 | 85 | 13 | 54 | 0.240741 |
6 | 90 | 13 | 43 | 0.302326 |
7 | 95 | 10 | 34 | 0.294118 |
8 | 100 | 4 | 27 | 0.148148 |
9 | 105 | 3 | 19 | 0.157895 |
10 | 110 | 1 | 8 | 0.125000 |
11 | 115 | 0 | 8 | 0.000000 |
12 | 120 | 0 | 2 | 0.000000 |
plt.figure(figsize=(11, 8))
plt.xlabel('wage group', fontsize=18)
plt.xticks(fontsize=18)
plt.ylabel('Proportion', fontsize=18)
plt.yticks(fontsize=18)
plt.scatter(credit_risk_by_wage.index , credit_risk_by_wage.Proportion)
After grouping by the wage group, the number of instances that fall in each group was counted and the sum of the enumerated default values gives the number of defaults in that group. The calculated proportion gives a rough estimate of the probability of default in the group. We can see that the probability is zero or close to zero for the high wage groups (wage > 105,000) and shows larger value for lower wage group.
This was just an illustration of the probabilistic approach to larger datasets.
The Generalized Linear Model applies a statistical model for each observation; the response variable is modeled using a probability distribution and then parameter of the distribution fitted to predictors’ values.
As was stated above, the goal of logistic regression is to find the probability that an instance belongs to a positive class. Doing this for logistic regressions is a two-stage process:
- first, the probabilities have to be estimated;
- then, probabilities are fitted to a linear model via a transformation function.
Note: statsmodels
library provides several built-in probability distributions, called families: Binomial, Gamma, Gaussian, Negative Binomial, Poisson. The appropriate distribution can be picked if the statistics of the event is known. These distributions will be considered in more detail in the next course of this program. For now, we will use the default distribution: Gaussian, often referred to as Normal Distribution.
Below, let’s look at the prediction model for Credit Card default using two predictors - vacation spending (Vacations
) and mortgage payments (Mtg
).
plt.figure(figsize=(11, 8))
plt.xlabel('Vacations', fontsize=18)
plt.xticks(fontsize=18)
plt.ylabel('Mtg', fontsize=18)
plt.yticks(fontsize=16)
plt.scatter(Credit_risk.Vacations, Credit_risk.Mtg, c=Credit_risk.default_enum)
#Set the Vacations and Mtg as predictors and add constant to predictors for Intercept.
The target, the value which we are trying to predict is default_enum.
exod_list = Credit_risk[['Vacations', 'Mtg']]
exod_list=sm.add_constant(exod_list)
exod_list
#Train the model
glm_vacations_mtg = sm.GLM(Credit_risk.default_enum, exod_list)
res_multi = glm_vacations_mtg.fit()
#Get the summary of the regression
res_multi.summary()
Dep. Variable: | default_enum | No. Observations: | 300 |
---|---|---|---|
Model: | GLM | Df Residuals: | 297 |
Model Family: | Gaussian | Df Model: | 2 |
Link Function: | identity | Scale: | 0.10310 |
Method: | IRLS | Log-Likelihood: | -83.365 |
Date: | Sat, 05 Jan 2019 | Deviance: | 30.621 |
Time: | 20:15:29 | Pearson chi2: | 30.6 |
No. Iterations: | 3 | Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -0.5388 | 0.075 | -7.157 | 0.000 | -0.686 | -0.391 |
Vacations | 9.082e-06 | 5.93e-06 | 1.531 | 0.126 | -2.54e-06 | 2.07e-05 |
Mtg | 2.198e-05 | 1.23e-06 | 17.806 | 0.000 | 1.96e-05 | 2.44e-05 |
#To get parameters of the fit
res_multi.params
const -0.538818
Vacations 0.000009
Mtg 0.000022
dtype: float64
plt.scatter(Credit_risk.Vacations[Credit_risk.default_enum == 1], Credit_risk.Mtg[Credit_risk.default_enum == 1], label = 'No Default')
plt.scatter(Credit_risk.Vacations[Credit_risk.default_enum == 0], Credit_risk.Mtg[Credit_risk.default_enum == 0], label = 'Default')
plt.plot(Credit_risk.Vacations, (0.5388/0.000009) - (0.00002198/0.000009082)* Credit_risk.Vacations, 'r',label='Decision Boundary')
plt.legend()
#Set the Vacations and Mtg as predictors and add constant to predictors for Intercept.
The target is default_enum.
X_cr = Credit_risk[['Vacations', 'Mtg']]
Y_cr = Credit_risk['default_enum'].values
from sklearn.linear_model import LogisticRegression
#Instantiate the model
log_reg = LogisticRegression(C=1e10)
#Train the model
log_reg.fit(X_cr, Y_cr)
LogisticRegression(C=10000000000.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
#Predict the targets
Y_log_reg_pred = log_reg.predict(X_cr)
#To estimate goodness of the fit, we evalute the accuracy score
#which counts the number of correctly predicted observations.
from sklearn.metrics import accuracy_score
accuracy_score(Y_cr, Y_log_reg_pred)
0.75
The linear models relies on a number of factors:
- the linear dependence between the predictors and the target;
- independence of the observations in the dataset
The linear models are robust, easy to implement and to interpret. However, if listed assumptions do not hold, these models would not produce accurate predictions and more complex methods are required.
References
Connor Johnson blog (http://connor-johnson.com/2014/02/18/linear-regression-with-python/).
MNIST database (https://en.wikipedia.org/wiki/MNIST_database).
Ng, A. (2018) Machine Learning Yearning (electronic book).
Unsupervised Learning. (https://en.wikipedia.org/wiki/Unsupervised_learning#Approaches).
Witten, I.H, Frank, E. (2005) Data Mining. Practical Machine Learnng Tools and Techniques (2nd edition). Elsevier.