import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score2 Credit analysis
2.1 Credit risk The Basel Accords
Credit risk refers to the possibility that a borrower will fail to meet their debt obligations according to the agreed terms, leading to a financial loss for the lender or creditor. It is a fundamental component of risk management in banking and finance, affecting a wide range of financial products, from personal loans and credit cards to corporate bonds and syndicated loans.
Basel accords (Basel I, Basel II, and Basel III) is foundational to how banks and financial institutions manage and mitigate credit risk. The Basel accords are a series of international regulatory frameworks developed by the Basel Committee on Banking Supervision (BCBS) to ensure that banks have sufficient capital to cover the risks they face, including creddit risk.
The Basel Accords are a set of international banking regulations developed by the Basel Committee on Banking Supervision (BCBS). - Basel I (1988) introduced minimum capital requirements for banks, focusing on credit risk. - Basel II (2004) expanded on Basel I by including capital requirements for operational risk and enhancing risk management and disclosure practices. - Basel III (2010) was implemented in response to the 2008 financial crisis, introducing stricter capital requirements, leverage ratios, and liquidity standards. - The Accords aim to strengthen the regulation, supervision, and risk management within the banking sector globally. - Compliance with Basel standards is voluntary, but adoption is strongly encouraged by international financial systems
To apply Basel Accords on credit risk modeling
- Gather historical data on borrowers, including default rates, credit scores, and financial statements. Quality and comprehensive data are crucial for accurate modeling. 
- Follow a standardized approaches on credit risk modeling: the Standardized Approach (SA) and the Internal Ratings-Based (IRB) approach. 
For IRB, we need to develop internal models to estimate Probability of Default (PD), Loss Given Default (LGD), and Exposure at Default (EAD).
Loss Given Default (LGD) is a key parameter in credit risk management and represents the amount of loss a lender incurs when a borrower defaults, after accounting for recoveries.
The Probability of Default (PD) is a measure used in credit risk management to quantify the likelihood that a borrower will default on their debt obligations within a specified time period, typically one year.
Exposure at Default (EAD) is a key parameter in credit risk management that represents the total value a bank or lender is exposed to at the time a borrower defaults. EAD is an essential component in the calculation of expected loss and regulatory capital requirements.
- Model Validation
Backtesting: Compare model predictions with actual historical outcomes to validate accuracy
In this session, we will focus only on the IBR , estimating only the Probability of Default (PD).
2.2 Logistic regression and discriminant analysis, credit allocation analysis and Probability of Default.
To cover this topics we will estimate the Probability of Default (PD) using Logistoc regression and discriminant analysis models. We do that in a credit allocation analysis.
2.2.1 Gather historical data on borrowers
In this case, In this chapter, we will cover credit allocation analysis (loan origination). The database credit.xlsx has historical information on Lendingclub, https://www.lendingclub.com/ fintech marketplace bank at scale. On the spreadsheets, you will find the variable description. The original data set has at least 2 million observations and 150 variables. Inside the file “credit.xlsx,” you will find only 873 observations (rows) and 71 columns. Each row represents a Lendingclub client. We previously made the data cleaning (missing values, correlated variables, Zero- and Near Zero-Variance Predictors).
In the next output, we see the variables of Lendingclub’s customers when they granted the loan. For example, the variable term is the term, in years, of the loan, “annual_inc,” which is the customer’s annual income when she got the loan, etc.
credit=pd.read_csv("https://raw.githubusercontent.com/abernal30/ml_book/main/credit.csv")
credit.head()| Default | term | installment | grade | emp_title | emp_length | home_ownership | annual_inc | verification_status | purpose | ... | num_il_tl | num_op_rev_tl | num_rev_accts | num_rev_tl_bal_gt_0 | num_sats | num_tl_op_past_12m | pct_tl_nvr_dlq | percent_bc_gt_75 | pub_rec_bankruptcies | total_bc_limit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Fully Paid | 1 | 123.03 | 3 | 299.0 | 3 | 1 | 55000.0 | 1 | 3 | ... | 3 | 4 | 9 | 4 | 7 | 3 | 76.9 | 0.0 | 0 | 2400 | 
| 1 | Fully Paid | 1 | 820.28 | 3 | 209.0 | 3 | 1 | 65000.0 | 1 | 10 | ... | 6 | 20 | 27 | 5 | 22 | 2 | 97.4 | 7.7 | 0 | 79300 | 
| 2 | Fully Paid | 2 | 432.66 | 2 | 623.0 | 3 | 1 | 63000.0 | 1 | 4 | ... | 6 | 4 | 7 | 3 | 6 | 0 | 100.0 | 50.0 | 0 | 6200 | 
| 3 | Fully Paid | 2 | 289.91 | 6 | 126.0 | 5 | 1 | 104433.0 | 2 | 6 | ... | 10 | 7 | 19 | 6 | 12 | 4 | 96.6 | 60.0 | 0 | 20300 | 
| 4 | Fully Paid | 1 | 405.18 | 3 | 633.0 | 6 | 3 | 34000.0 | 2 | 3 | ... | 2 | 4 | 4 | 3 | 5 | 0 | 100.0 | 100.0 | 0 | 9400 | 
5 rows × 71 columns
The variable “Default”, winch originally has the name “loan_status”, it has two labels:
“Charge off” means that the credit grantor wrote your account off of their receivables as a loss and is closed to future charges. When an account displays a status of “charge off,” it is closed to future use, although the customer still owns the debt. For this example, we will consider Charged Off equivalent to Default and Fully Paid as no default.
In a previous output, we show that the “Default” variable class is “character,” and a function we will apply below does only accept numeric or factor variables. We transform that variable into “factor.”
 credit["Default"].value_counts()Default
Fully Paid     728
Charged Off    145
Name: count, dtype: int642.2.2 Defining the model (Follow a standardized approaches)
We use the Internal Ratings-Based (IRB) approach.
For IRB, we will estimate Probability of Default (PD)
First we define our independent and dependent variables
\[Default=\alpha_{0}\ +\beta_{1}\ term_{1}+\beta_{2}\ grade_{2}+...+\beta_{n}\ variable_{n}+e\]
y=credit["Default"] # select the Defalut variable
X=credit.drop(columns=["Default"]) # we drop the dependent variable, Default2.2.3 Training the model (Model validation)
It is a common practice to divide the data set into training (80% of the observations) and testing (the other 20%). However, other authors suggest splitting it into Training Set, Validation Set and a Test Set (Lantz 2019). In the latter case, the proposal is to train the model on the training set (for example, 60% of the data), do the cross-validation (a procedure we will cover below) on the validation set (for example, 20% of the data), and once the model has a good prediction performance, test it in the test set (20% of the data).
We split randomly into 80% the train dataset and 20% the test dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)The “train_test_split” function performs the partition. The argument “test size” is the size of the test database. The “random state” argument is to control the experiment. In such a way that we all have the same results, since the partition is random
We define the “machine learning” model that we want to run. In this case is a Logistic Regression model, which has the particularity that it accepts binary variables as dependent variables.
import warnings 
warnings.filterwarnings('ignore')model = LogisticRegression()Then we estimate the parameters. First comes the X and then the y.
model.fit(X_train,y_train) LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Unlike econometric models, where we do causality analysis, here we are not concerned with printing the resulting regression parameters but instead making the prediction. In this case, the prediction is whether it will be “Fully paid” (no default) or “Charged off” (default). The previous output we got only tells us that we have run a logit model.
The next step is to make the prediction on the test data set.
predict_logit=model.predict(X_test)
predict_logit[:5]array(['Fully Paid', 'Fully Paid', 'Charged Off', 'Fully Paid',
       'Charged Off'], dtype=object)Measuring model performance
To measure the performance of our prediction, we will use the accuracy_score.The first argument is the test dataset and then the prediction.
accuracy_score(y_test, predict_logit) 0.9314285714285714To explain the accuracy briefly, it is important to see that the algorithm first compares our prediction to the real observation (stored inside the y_test dataset). In brief, the resulting 0.9314 or 93.14% is the percentage of times that our prediction is equal to the real data. The bigger this percentage, the better our prediction.
2.2.4 LinearDiscriminantAnalysis
Now we train other machine learning model, Linear Discriminat Analysis (LDA). The idea is to verify which of the two models, between Logistic and LDA, has a better accuracy. The only thing we do different regarding the Logistic model us change this code:
model_LDA=LinearDiscriminantAnalysis()The other prodecure is the same.
model_LDA.fit(X_train,y_train)
predict_LDA=model_LDA.predict(X_test)
accuracy_score(y_test, predict_LDA)0.9428571428571428In this case we see that the model LDA is a better model, then we use this model to do the prediction.
We can make other adjustments to improve the predictive power and “Accuracy,” such as looking for combinations of the independent variables, X, or running other machine learning models. But for this course, this is all that will be covered.
2.2.5 Probability of Default
We estimate the Probabilities of Default (PD) from the best model between Logit and Linear Discriminant.
pd.DataFrame(model_LDA.predict_proba(X_test)).head()| 0 | 1 | |
|---|---|---|
| 0 | 0.000036 | 0.999964 | 
| 1 | 0.999554 | 0.000446 | 
| 2 | 0.489148 | 0.510852 | 
| 3 | 0.000002 | 0.999998 | 
| 4 | 0.999953 | 0.000047 | 
Now we estimate the probability of default based on the X_test database. The column with the number 1 is the one with the probability of “default”, and the column labelled zero is (1 - the probability of default). In this case, 0.999964 or 99.99% is the probability that a person with the characteristics of the first observation in the X_tests database will not pay the loan.