Predicting Credit Risk

3 min readJan 21, 2021

U.S banks will set aside up to $320 billion to cover potential credit losses in 2020 due to the financial strain of the pandemic, according to a new report from Accenture (NYSE: ACN). ; Accenture estimates that banks will need to hold an additional US$210 billion to US$265 billion to cover potential write-offs for bad loans in 2020 depending on the severity of the public health aspect of COVID-19

Dataset & Investigative Questions

The current report illustrates the data analysis of the behavior of the German borrowers. The data classify individual by a set of attributes as good or bad credit risk (target variable). The attributes have both integer and categorical variables defining characteristics of an individual as good or bad credit. Multiple question such as What is the duration of account with the bank? How many dependents does the individual have? What is the credit amount owned by the borrower? The project answers many such question to understand the pattern for credit risk being good or bad. We will explore various features with respect to our target variable i.e. Credit Risk (V21) in our exploratory data analysis.

a) Logistic regression (analysis performed using R): I selected the logistic model for our base model compared to other decision tree models (C5.0, CHAID, Quest, CNR) because we are dealing with high dimensional data. Also learning of simple hyperplane in logistic or logit model is better due to the curse of dimensionality

b) XGBoost Tree (analysis performed using SPSS): I selected XG Boost against the Random Forest model, as we had imbalanced target variable 70% positive and 30% negative. The XGBoost model comes from gradient boosting family of algorithm which is great for most of the classification models (specifically for imbalanced group).

Data Engineering & Limitation

The target value distribution is not equally distributed i.e. 700 (1- Good risk) and 300 (2-Bad risk).This is a common issue in scenarios where anomaly detection is important such as fraudulent of transaction in banks, identification of disease in patient, etc. If we ignore the current issue, the predictive model created will be biased and inaccurate.