stat eriksson lÅngstrÖm1324129/fulltext01.pdfabstract probability of impairment, or probability of...

62
Comparison of Machine Learning Techniques when Estimating Probability of Impairment Estimating Probability of Impairment through Identification of Defaulting Customers one year Ahead of Time Authors: Alexander Eriksson Jacob Långström Supervisors: Prof. Oleg Seleznjev Xun Su June 13, 2019 Student Master thesis, 30 hp Degree Project in Industrial Engineering and Management Spring 2019

Upload: others

Post on 16-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Comparison of Machine Learning Techniques whenEstimating Probability of Impairment

Estimating Probability of Impairment through Identificationof Defaulting Customers one year Ahead of Time

Authors:Alexander Eriksson

Jacob Långström

Supervisors:Prof. Oleg Seleznjev

Xun Su

June 13, 2019

StudentMaster thesis, 30 hpDegree Project in Industrial Engineering and ManagementSpring 2019

Page 2: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

AbstractProbability of Impairment, or Probability of Default, is the ratio of how many customers within asegment are expected to not fulfil their debt obligations and instead go into Default. This is a keymetric within banking to estimate the level of credit risk, where the current standard is to estimateProbability of Impairment using Linear Regression. In this paper we show how this metric insteadcan be estimated through a classification approach with machine learning. By using models trainedto find which specific customers will go into Default within the upcoming year, based on NeuralNetworks and Gradient Boosting, the Probability of Impairment is shown to be more accuratelyestimated than when using Linear Regression. Additionally, these models provide numerous real-lifeimplementations internally within the banking sector. The new features of importance we found canbe used to strengthen the models currently in use, and the ability to identify customers about to gointo Default let banks take necessary actions ahead of time to cover otherwise unexpected risks.

Key WordsClassification, Imbalanced Data, Machine Learning, Probability of Impairment, Risk Management

SammanfattningTiteln på denna rapport är En jämförelse av maskininlärningstekniker för uppskattning av Probabilityof Impairment. Uppskattningen av Probability of Impairment sker genom identifikation av låntagaresom inte kommer fullfölja sina återbetalningsskyldigheter inom ett år. Probability of Impairment,eller Probability of Default, är andelen kunder som uppskattas att inte fullfölja sina skyldighetersom låntagare och återbetalning därmed uteblir. Detta är ett nyckelmått inom banksektorn för attberäkna nivån av kreditrisk, vilken enligt nuvarande regleringsstandard uppskattas genom LinjärRegression. I denna uppsats visar vi hur detta mått istället kan uppskattas genom klassifikation medmaskininlärning. Genom användandet av modeller anpassade för att hitta vilka specifika kunder sominte kommer fullfölja sina återbetalningsskyldigheter inom det kommande året, baserade på NeuralaNätverk och Gradient Boosting, visas att Probability of Impairment bättre uppskattas än genomLinjär Regression. Dessutom medför dessa modeller även ett stort antal interna användningsområdeninom banksektorn. De nya variabler av intresse vi hittat kan användas för att stärka de modeller somidag används, samt förmågan att identifiera kunder som riskerar inte kunna fullfölja sina skyldigheterlåter banker utföra nödvändiga åtgärder i god tid för att hantera annars oväntade risker.

NyckelordKlassificering, Obalanserat Data, Maskininlärning, Probability of Impairment, Riskhantering

i

Page 3: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

AcknowledgementsWe want to thank Professor Oleg Seleznjev at Umeå University for his mentoring and input leadingus to achieve the knowledge necessary to write this thesis, Xun Su at Nordea for her administrativework and for challenging us to explore issues we else would not have considered, Nordea for providingdata and letting us write our thesis at their Swedish headquarter located in Stockholm, AlexanderRamström at Nordea for his administrative work related to giving us access to both rooms and data,and finally the remaining people from the IFRS9 team and their team leader Andreas Wirenhammarfor welcoming us and answering any questions we had.

ii

Page 4: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Purpose and Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Approach and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 62.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Multiple Imputation by Chained Equations . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Imbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.8 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.9 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Method 213.1 Pre-Processing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Creating the Target Variable: YearDefault . . . . . . . . . . . . . . . . . . . . . 213.1.2 Macro Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.1.3 Initial Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1.4 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.5 Imputing Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1.6 Grouping of Minority Categories . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.7 Historical Customer Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.8 Splitting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.9 Oversampling, One-Hot Encoding and Standardization . . . . . . . . . . . . . . 283.1.10 Data for Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.1 ANN - Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.2 RF - Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.3 XGBoost - Extreme Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . 333.2.4 Ensemble of ANN and XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Results 414.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3 Comparing all Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Discussion 525.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2 Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4 Important Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.5 Non-Linear Risk Grade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 Development Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Reference List 55

iii

Page 5: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Abbreviations

ACC Accuracy

ACCE Expected Accuracy

ANN Artificial Neural Network

AP Average Precision

AUC Area Under Curve

AUPRC Area Under Precision-Recall Curve

Default Borrowers failing to fully meet theirobligations to clear their debt

ECL Expected Credit Loss

ENN Wilson’s Edited NearestNeighbours rule

Ensemble Ensemble model ofArtificial Neural Network andExtreme Gradient Boosting

FN False Negative

FNR False Negative Rate

FP False Positive

FPR False Positive Rate

G-mean Geometric Mean

IFRS 9 International Financial ReportingStandard

Kappa Cohen’s Kappa

LR Linear Regression

Macro Refers to data containingmacroeconomic features

MCC Matthews Correlation Coefficient

MICE Multiple Imputation by ChainedEquations

MLR Multiple Linear Regression

NPV Negative Predictive Value

PC Principal Components

PCA Principal Component Analysis

PD Probability of Default

PI Probability of Impairment

Precision Positive Predictive Value

PRC Precision-Recall Curve

Recall True Positive Rate

RF Random Forest

ROC Receiver Operating Characteristic

Shipping Refers to data containing features forcustomers whose business activities arerelated to shipping

SMOTE Synthetic Minority OversamplingTechnique

SMOTEENN Synthetic Minority OversamplingTechnique, followed by cleaningusing Edited Nearest Neighbours

Specificity True Negative Rate

TN True Negative

TP True Positive

XGBoost Extreme Gradient Boosting

*model*_cla Classifier version of *model*

*model*_reg Regression version of *model*

iv

Page 6: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

1 IntroductionThe purpose of this chapter is to provide necessary knowledge regarding the problem we seek toexamine throughout the report. This includes a description of the problem, how we approach it andwhat data we have access to. The data are provided by Nordea and this report is written at theirSwedish headquarter located in Stockholm. We have not been awarded any monetary compensationor been made any promises of future gain and are thus unbiased when writing this report. The PIbeing presented in this report is based on coded variables we have created to not leak any classifiedinformation such as customer identifications, where every observation where a customer appears anadditional time within the data, we treat that additional observation as an entirely new customer.This also encodes Nordea’s true realised PI without losing any for us valuable properties.

1.1 BackgroundProbability of Impairment [PI] has the same meaning as Probability of Default [PD] or Risk of Default,and is defined as a theoretical percentage value explaining how many out of a group of borrowers, forany reason, are expected to not fulfil their obligated debt payments (Nordea, 2017, p. 45). This is akey parameter within banking when estimating Expected Credit Loss [ECL], which is the expectedloss due to borrowers failing to fully meet their obligations to clear their debts. Estimating ECL isnot only important from a business standpoint to determine capital buffers and interest rates but isalso a legal requirement. Capital of banks within the European Union [EU] are subjects to manylegal frameworks such as CRD IV and CRR. These are based on Basel III with purpose to lowerthe risks of banking activities and are doing so by adding a requirement of maintaining a higherquality and level of capital based on the amount of Risk Weighted Assets, or assets subject to creditrisk (European Commission, 2019). To estimate the level of credit risk, the International FinancialReporting Standard [IFRS 9] has been adopted by EU, where within this standard it is required toestimate ECL by PI. We will use the term PI and not PD because the term PI is associated withIFRS 9 meanwhile PD is associated with the Internal Ratings-Based approach.

1.2 Problem DefinitionDue to PI being a central value within ECL, it is important to estimate it as well as possible. If theestimated PI for a certain group is lower than in reality, it leads to ECL also being estimated lowerthan in reality, and thus the expected losses for the upcoming year risk not being covered by theincome. If the opposite occurs and the estimated PI is larger than in reality, thus showing too largeECL, perfectly fine opportunities to give loans to gain incomes will be passed. A larger ECL also leadsto a larger capital requirement under the European legal frameworks, which leads to more capitalthan needed is locked away as a buffer from being invested. Nordea is currently mainly estimatingPI using Linear Regression [LR].

1

Page 7: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

1.3 Purpose and AimThe purpose of this report is to estimate PI for a segment by using various machine learning tech-niques through two approaches illustrated in Figure 1. The first approach is through classifying eachobservation within the segment as either Default or a Non-default, followed by estimating PI as theratio of Defaults amongst all observations. The second approach is through regression-based tech-niques where we for each observation estimate the individual probability to go into Default withinone year and follow this by estimating PI as the average of all these regression values.

Figure 1: Approaches

The models we build will be compared to the LR model currently being used by Nordea, where weevaluate the models based on their predictive power and complexity. The aim of this project is forour presented models to be better suited for estimating PI than their current LR model, given thedata collected by Nordea. This would conclusively lead to Nordea better predicting their credit riskand therefore make more informed decisions.

1.4 DelimitationsThere are many various types and variants of machine learning techniques which could be used toestimate PI, but we have limited ourselves to compare the following methods

• Random Forest [RF].

• Extreme Gradient Boosting [XGBoost].

• Artificial Neural Network [ANN].

• Ensemble of ANN and XGBoost [Ensemble].

All methods except the Ensemble are found in these studies (Chen et al., 2016; Carmona et al., 2017).RF is a strong model compared to its relative low complexity, and although XGBoost and ANN aremuch more complex, a large majority of all Kaggle competitions are won by models based on thesetwo. We are limiting ourselves to only try certain models, and thus risk missing models which wouldend up having a stronger predicatively power than the ones presented. The existence of such a modelis more or less guaranteed through the use of ensemble methods, but we deem going through moremodels would be too time consuming. We are therefore making this limitation to increase the timewe can spend processing the data and building, tuning and comparing our pre-decided models.

2

Page 8: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

1.5 DataThe data used in the report are provided by Nordea, which are confidential and thus can not bedisplayed. Hence, we will only show the structure of the data and explain the features included. Thedata provided consist of two data sets, let us call them Shipping data and Macro data. The Shippingdata include information about customers whose business activities are related to shipping while theMacro data include macroeconomic factors. Both the Shipping- and Macro data are presented on amonthly basis, but the Macro data are gathered on a quarterly basis.

1.5.1 Shipping Data

The Shipping data consist of variables explaining what different segments customers belong to, howhigh risk they are rated to carry, together with their utilization amount and credit limit. Table 1shows a list over the available variables within the Shipping data. Every observation is collected on amonthly basis and is having the variable data_period and B1 as keys. B1 shows the customer ID forthe observation and data_period shows the year and month for when the data were collected. A cus-tomer with a specific B1 can thus appear in many rows due to existing as a customer for many months,or even years, and many different customers can also exist during the same data_period, but due to acustomer only can appear at most one time for a given data_period, these two variables together workas a key identifier for the whole Shipping data. The Shipping data we have access to are collectedbetween January 2008 and 2017, where earlier data are not considered by Nordea to reflect today’seconomic environment due to the economic crisis and the fallout from it that did not exist beforehand.

Three pairs of variables within the Shipping data either are, or should be, mirrors of one another.The variables B1 and B1cure both show the customer ID and are identical, and the variables b403and B403cure both show with binary code however the customer is in Default or not during a givendata_period and are identical, and RAT and riskGrade both show the same Risk Grade except codeddifferently. Additionally, a pair of variables that happen to be identical of one another without beingdirectly related are ActiveExp and ship. These are both variables only containing the value 1 forevery observation, which is due to all customers within the Shipping data are both active customersand customers existing within the segment Shipping at each given data_period.

The remaining variables either show further segmentation, the Utilization Amount or the CreditLimit. The segments are in which Country B40 a customer exists, the Industry B45 the customeroperates within, the Reason BP92 for the loan, or Reason for Default in the cases when a Customer isflagged as Default, the Exposure Classification B409 which the customer is defined as (i.e. Sovereign,Insitution, Corporate etc.), and finally their credit rating according to the two different systems RiskGrade riskGrade and FICO Score Card sco. The credit risk ratings of the different customers areaccompanied by which specific models used for deciding the ratings within the variables brs8_rat andbrs8_sco, and the Aligned Scoring between different score cards within the variable BP78. UtilizationAmount B416 shows the amount each customer has on-balance, meanwhile their Credit Limit B419is the total amount on- and off-balance.

3

Page 9: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Table 1: Shipping Variables

Shipping Variable Description

ActiveExp Active Exposure

B1 Customer ID

B1cure Customer ID

B40 Country

b403 Default Flag

B403cure Default Flag

B409 Exposure Class

B416 Utilization Amount

B419 Credit Limit

B45 Industry

BP78 Aligned Score

BP92 Reason

brs8_rat Rating Model

brs8_sco Score Card Model

DA Delivery Agreement

data_period Year and Month for Observation

RAT Rating

riskGrade Risk Grade

sco Scoring

ship Shipping Customer

1.5.2 Macro Data

The Macro data consist of macroeconomic variables, which are variables explaining trends in howthe economy as a whole behaves for a larger geographical area. A few of these variables are, forexample, the Gross Domestic Product [GDP] for each of the Scandinavian countries and for theEU. The Macro data are divided into three Scenarios: Baseline, Better and Worse, where Baselineconsists of the observed macroeconomic values. The values are presented on a monthly basis butgathered on a quarterly basis, meaning that the months January to Mars have identical values toone another, April to June have identical values, and so on for every year. Both scenarios Better andWorse consist of simulated values, where the values in Scenario Better simulate a stronger and morewell-off economy and the values in Scenario Worse simulates a weaker economy. Most of the Macrodata are gathered between January 1980 and January 2019, making it a period of 39 years. Thereare a few macro variables in the Macro data which started to be gathered later, where the latest onestarted to be gathered January 1999. Table 2 shows the variables within the Macro data gatheredfor most European countries and for the EU.

4

Page 10: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Table 2: Macro Variables

Macro Variable Description

C Household Consumption

COLIABNF Liabilities for Non-financial Corporations

CPI Consumer Price Index

DATA_PERIOD Observation Year/Month

ER Quarterly Earnings

GDP Gross Domestic Product

IMFWGDP World GDP

LNPE Loans and Liabilities for the Household Sector

PEWFP Wages and Salaries

PH House Price Index

UP Unemployment Rate

WPO World Oil Price

X Export of Goods and Services

1.6 Approach and OutlineThe report will start by going through necessary theory regarding the methods we use either whenprocessing the data, building the models, or when evaluating the performance of each model. Werecommend that even the most experienced of readers study the Section 2.9 to be aware of whatmetrics we use for evaluation, and the definitions we use throughout this report.

The second part of this report covers the method step-by-step and explains how we approach eachproblem and the process of coding the data to not leak classified information, such as Nordea’s actualPI for each segment. This part is divided into subsections of how we first treat the data followed byhow each model is built upon it. The method is best read in order from start to finish, with emphasison achieving an understanding of how we build the ANN and XGBoost model before reaching thesection related to the Ensemble model. This because the Ensemble is built from the same ANN andXGBoost models described in their self-described sections. Worth noting is that the data processingfor LR differs from how we process the data for the tree-based methods and for ANN. This is accord-ing to Nordea’s directive and usage of variables such as Norway GDP with lag within the LR model.

The third part covers the results from each model in accordance to the evaluation metrics describedin Section 2.9. The result section first covers the LR model and the classification-based models inde-pendently from one another, but with both sections including explanations and interpretations of theevaluation metrics. The result section finishes with a comparison between all models when estimatingPI for each Risk Grade, PI for each year, PI for each month, and PI for the whole segment of customers.

We conclude the report with a section covering discussion of what conclusion we can draw, themethods we used, what the results say and what can be further improved upon. The section startswith our conclusions, followed by discussing them in the remaining sections. Thus, readers whomare only interested in specific conclusions have the option to only study the sections related to thoseconclusions.

At the very last of pages, the reader finds the reference list covering the literature and sourceswe have gathered either inspiration or knowledge from when writing this report.

5

Page 11: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

2 TheoryIn the report we will explore both regression and classifications problems, although with emphasis onclassification. It is possible to characterize variables as quantitative or qualitative, where qualitativevariables are categorical while quantitative variables are numerical values. We will refer to problemswith quantitative response as regression and problems with qualitative response (labels) as classifi-cation. Keep in mind that this way of segregating the two types of problems is not always that clear.There are cases like for Logistic regression when the model is used for two class classification, buteven when Logistic Regression is used for classification the problem can still be viewed as regressionsince the model estimates the class probabilities.

2.1 Binary ClassificationWhen we want to create a model to assign new observations into groups based on similarity, weare describing a classification problem. In our case, we want to classify new observations as eitherDefault or Non-default, which are binary outcomes as there are two possible outcomes. To make aclassification, we have a set of observed values for which we can analyse by using models fit for binaryclassification (James et al., 2013, p.129).

2.2 Multiple Linear RegressionRegression analysis is the statistical method for investigating the relationship between one or moreresponse variables and a collection of predictors, often referred to as explanatory variables. Theregression type we will look at is the Multiple Linear Regression [MLR]. The MLR has one responsevariable and several explanatory variables where the MLR also requires that the response is a linearfunction of the unknown parameters (Yan, 2009, p.3). Let y = [y1, y2, ..., yn]

T , ε = [ε1, ε2, ..., εn]T ,

β = [β0, β1, ..., βp]T and

X =

1 x11 x12 . . . x1p

1 x22 x23 . . . x2p

......

... . . . ...1 xn2 xn3 . . . xnp

We can then define the MLR model as

y = Xβ + ε

where ε are independent with E[ε] = 0 (expected value), V ar[ε] = σ2 (constant variance), and εfollow the normal distribution (Alm & Britton, 2008, p.442). In Linear Regression we want to findthe estimated β, unknown regression parameters, that minimize the difference between the observedvalues y and the predicted values y. To find the estimates of β, we can use Least Squares estimation.We find the estimates by minimizing the residuals sum of squares εT ε, and using the knowledgeε = y −Xβ we get the estimation by solving

β = argminβ [(y −Xβ)T (y −Xβ)] (1)

By differentiation of εT ε with respect to β and setting equations to zero, assuming XTX is positivedefinite, we obtain the unique solution β = (XTX)−1XT y for equation 1. If X are linearly depen-dent, we can not find a unique solution (Hastie et al., 2009, pp.45-46). If we have variables that arealmost perfectly correlated, we will be able to produce unique solutions, but the estimates will havelarge standard errors (Abramowicz, 2017a, p.95).

To be able to tell how good our model is, we use R2. R2 presents the amount of variation inthe data explained by the model, and is defined as

R2 = 1−∑n

i=1(yi − yi)2∑n

i=1(yi − y)2

where R2 = 0 does not explain anything and R2 = 1 is a perfect fit (Abramowicz, 2017b, p.9).

6

Page 12: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Outliers, leverage points and influential observations are three types of unusual observations thatcould be found in a data set. Outliers are observations that do not fit the model well, Leveragepoint are observations that are extreme in explanatory variable space, and influential observationschange the model fit greatly (Abramowicz, 2017c, p.41). Cook’s distance will be used to find possibleinfluential observations. Cook’s distance is defined as

Di =(y − yi)

T (y − yi)

pσ2

(Abramowicz, 2017c, p.63).

To check the structure of the model, we use Partial Regression plot and Partial Residual plot. InPartial Regression plot, we isolate the effect of one explanatory variable Xi on the response andremove the effect of the remaining explanatory variables. This is done by building a regression modelwith all explanatory variables except Xi. We calculate the residuals from this regression. Then wecreate a second regression model with Xi as the response and use all other variables as explanatoryvariables. The residuals are calculated, and we plot the residuals from the two models against eachother. By doing this we can investigate the relationship between one explanatory variable Xi andthe response. In Partial Residual plot we are looking at the effect of an explanatory variable Xi andthe response by removing the predicting effect from the other explanatory variables. These plots willhelp us find non-linearity and unusual observations (Abramowicz, 2017c, pp.72-78).

2.3 Multiple Imputation by Chained EquationsMultiple Imputation by Chained Equations [MICE] is a way to impute missing values in a dataset, where each variable containing missing values is handled as the response variable in a regressionmodel with possibility for every other variable to be used as explanatory variables, and with an addedstochastic variable to add more variance. This creates a series, or a chain, of regression models whereeach variable with missing values can be modeled according to their distribution and relation to theother variables. Continuous variables are modeled using Linear Regression, binary variables usingLogistic Regression, categorical variables using Multinomial Logit model and count variables usingPoisson. This process will repeat a chosen amount of iterations where the new imputed values willkeep updating based on the imputed values from previous steps, marking the key difference betweensingle- and multiple imputation. The algorithm can be described as

1. Set number of cycles C.

2. Out of all M variables {Xm}M1 , find the N variables {Xn}N1 containing missing values.

3. For variable with missing values X1 to XN :

a) Find all K missing values {xnk}K1 for variable Xn. Use a place holder to remember theirplaces.

b) Conduct a simple imputation for all missing values {xnk}K1 . Options for this could beusing the mean values for mean imputation, or backward- or forward imputation.

4. For variable X1 to XN , update the imputation:

a) Set the original missing values {xnk}K1 , now place holder observations, back to missing.b) Choose regression model based on the type of variable of Xn.c) Choose a subset {Vn} from the variables {Xm}M1 as explanatory variables, with possibility

of {Vn} = {Xm}M1 .c) Use regression with {Vn} and a stochastic variable to impute the missing values {xnk}K1 ,

thus, updating the values.

5. Repeat cycle in 4. for C number of cycles.

The amount of cycles C should ideally be chosen as the point of convergence, that is when thecoefficients in the regression models have become stable and the variables are no longer dependingon the order of imputation (Azur et al., 2011).

7

Page 13: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

2.4 Principal Component AnalysisWith Principal Component Analysis [PCA], we want to find a smaller set of new variables, Princi-pal Components [PC], created by a particular linear combination of the original p random variablesX1, X2, ..., Xp. The main goals of PCA are data reduction and interpretation. Through a few linearcombinations of the variables, which are the PC, we are interested in explaining the variance andcovariance of the original data. In a data set of p variables, p PC are needed to explain the totalpopulation variance. But in many cases, most of the variance can be explain by a small set of k PC.If that is the case, we could reduce the data from n ∗ p to n ∗ k without a big loss in information.Moreover, PCA often reveals relationships that was not transparent beforehand (Johnson & Wichern,2014, p.430).

The calculation of PC can be summarized as follows

1. Calculate the covariance matrix of X, SX .

2. Do spectral decomposition on SX .

3. Sort the eigenvalue-eigenvector pairs (λ1, e1), (λ2, e2), ..., (λp, ep) by λ1 ≥ λ2 ≥ ... ≥ λp ≥ 0.

4. We get the PC by Yi = eTi X for i = 1, 2, ..., p with these properties: V ar[Yi] = eTi SXei = λi fori = 1, 2, ..., p and Cov[Yi, Yk] = eTi SXek = 0, where i = k

(Johnson & Wichern, 2014, p.432).

In cases where we have variables measured on scales with widely different ranges, standardization isappropriate. If we do not standardize in these situations, the majority of the total variation will bedue to the variables with the greatest ranges. We would expect the only important PC to be thosewith heavy weighting of these variables (Johnson & Wichern, 2014, p.439).

When we are going to decide how many PC to use there are no definite rules (Johnson & Wich-ern, 2014, p.444). We will consider the amount of total variance explained and the eigenvaluesrelative size. The proportion of total variance explained due to kth PC without standardization ofthe random variables is

λk

λ1 + λ2 + ...+ λp, k = 1, 2, ..., p

and with standardized random variables

λk

p, k = 1, 2, ..., p

A rule of thumb is that about 90% of the total variance should be explained by the PC. To visualizethe size of the eigenvalues, a Scree plot can be used. Figure 2 is an illustration of a Scree plot. Thisplot will further help us to determine an appropriate number of PC. We look for the ”elbow” in theplot. The point where the elbow accrues is the number of PC chosen. In Figure 2 we find the elbowat i = 3. The eigenvalues after i = 2 are relatively small and about the same size, hence two ormaybe three PC seems reasonable.

Figure 2: Illustration of a Scree Plot

8

Page 14: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

2.5 Imbalanced Data2.5.1 Definition and Associated Problems

Data where the observed classes are not equally represented to a point when it cause problems forcreating models are known as imbalanced data. Classification algorithms often assume balanced databut handle small imbalances very well, although if used on heavier imbalanced data it leads to themodel becoming biased towards predicting the majority class. This becomes a problem because thecase is generally that the important or interesting class which we want to predict happens to be theminority class, and not the majority class. Additionally, the consequences for misclassifying the mi-nority class is often more severe than misclassifying the majority class, for example, when examiningFraud data, Cancer data, or in our case, data regarding loan Defaults. In these cases, it is oftenbetter to be safe than sorry, and when in doubt rather mistakenly classify a Non-cancer as Cancerthan the other way around. Even though there are many cases of Defaults within the data set, thenumber of Defaults are still low relative to the number of Non-defaults, and this relative rarity leadsto increased difficulty of detecting patterns.

Because of the bias towards the less interesting majority class, commonly used metrics such as clas-sification accuracy to evaluate models must be reconsidered. This is due to the bias of predicting themajority class and because the cost of errors varies greatly. When building a model using classifica-tion accuracy on a data set where only 1% of the observations are Defaults, the model risk picking upthe default setting of classifying every single observation as the majority class Non-default, resultingin as high accuracy as 99% even though it is missing every Default.

There are generally two ways to address the issue of imbalanced data. The first way is to han-dle the difference in error cost by assigning larger costs to incorrectly classified observations fromthe minority class than the costs of incorrectly classified observations from the majority class. Thisleads to misclassifying an observation from the minority class being equally bad to misclassifyingnumerous observations from the majority class. The second way is to handle the issue of unequalrepresentation through resampling the Training data set using over- and undersampling methods.Oversampling means to add fictional observations of the minority class, simulated from the originalminority class, to the Training set and thus balancing it. Undersampling instead tries to achieve thebalance between classes by removing observations from the majority class (Brownlee, 2015; Bowyeret al., 2002; Maalouf & Trafalis, 2011).

9

Page 15: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

2.5.2 Synthetic Minority Over-sampling Technique

Synthetic Minority Over-sampling Technique [SMOTE] is a way of performing oversampling of theminority class. The oversampling is achieved by first locating all minority class samples and thenadding new synthetic variables along the line segments joining the minority class nearest neighbours,where the position on the line segment between two nearest neighbours is determined by a randomnumber between 0 and 1. The authors have summarized SMOTE with following algorithm

1. Input number of minority class samples T, amount of SMOTE N% and number of nearestneighbours k. N should either be 0 < N < 100 or an integer multiple of 100.

2. If N is less than 100, randomize the minority class sample as only a random percent of themwill be SMOTE:d. Thus, randomize and set T = N

100 ∗ T and N = 100.

3. Set N = N100 .

4. Set numattrs = Number of attributes, Sample [ ][ ]: Array for original minority class samples,newindex: Keeps a count of numbers of synthetic samples generated (initialised to 0),Synthetic [ ][ ]: Array for synthetic samples.

5. Populate for sample t = 1 to T :

a) Compute k nearest neighbours for minority class sample t and save the indices in thennarray.

b) For amount of SMOTE n = 1 to N :1. Choose a random number between 1 and k, call it nn. Choose one of the k nearest

neighbours of t.2. For attr = 1 to numattrs:

a) Compute dif = Sample[nnarray[nn]][attr]− Sample[t][attr].b) Compute gap = Random number between 0 and 1.c) Set Synthetic[newindex][attr] = Sample[t][attr] + gap ∗ dif.

6. Return new sample.

This means that if the sample (6, 4) is considered with (4, 3) being one of its nearest neighbours,then new samples will be generated as (f1′, f2′) = (6, 4) + rand(0, 1) ∗ (−2,−1), where (−2,−1) =(4, 3)− (6, 4) (Bowyer et al., 2002).

2.5.3 Cleaning using Edited Nearest Neighbours

Oversampling minority class samples using SMOTE helps with balancing class distributions, but itdoes not take class clusters into consideration. This means the algorithm is not preventing new syn-thetic sample points to be created within the majority class space or is cleaning out majority exampleslaying within the minority class space. A certain level of intrusion helps prevent overfitting by causingthe decision boundaries for the minority and majority class to spread further into one another, buttoo many rogue intruders deeply within the wrong class space may instead cause overfitting when aclassifier tries fitting those observations as well. In these cases, Wilson’s Edited Nearest Neighbourrule [ENN] can be applied to clean out noise by removing rogue intruders and creating more clearborders (Wilson, 1972). In a binary case, ENN works by examining the three nearest neighbours ofa certain example point, and if at least two out of those three points belong to the other class, theexample point is removed:

1. Input T samples {Et}T1 .

2. For sample E1 to sample ET , investigate intrusion by:

a) Find the K = 3 nearest neighbours {NNk(Et)}K1 for the sample Et.b) If Et = majority vote{NNk(Et)}K1 , remove sample Et

(Batista et al., 2004).

10

Page 16: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

2.6 Tree-Based MethodsIn this section we will go through tree-based methods for classification. Tree-based methods are simpleand easy to interpret but at the expense of prediction accuracy. The performance can be improvedby producing multiple trees and then combining them to yield a better prediction. Random Forest[RF] and Boosting are two methods that use multiple trees. Unfortunately, it becomes more difficultto interpret the model when we use multiple trees (James et al., 2013, p.303).

2.6.1 Classification Trees

The basis of tree-based approaches is to step wise split data into binary subsets based on similarity,where the final nodes, called leaf nodes, are each fitted a simple function. This function could, forexample, be as simple as a constant equal to the mean value of the binary observations left in thatspecific leaf node, or a constant in the form of a score meant to later be used when adding many treestogether. The tree function f(x) is therefore defined as

f(x) =

M∑m=1

CmI (x ∈ Rm)

Where R1, R2, ..., RM are the leaf nodes or regions depicted in Figure 3 and C1, C2, ..., CM are simplefunctions or a constant, for example, the average of the observations within the region. Classificationtrees are using these simple functions to choose appropriate classification for observations reachingdifferent leaf nodes within the tree. Thus, if eight out of ten binary observations within a leaf nodeare positives, the estimated function Cm will be

Cm = avg (yi|xi ∈ Rm) = 0.8

In a classification tree, thresholds are set for which values Cm must reach for certain classifications.Traditionally in an unbiased, binary case, when Cm > 0.5 for a leaf node, it means new observationsreaching that leaf node will be classified as positive.

Figure 3: Binary Classification Tree

For each split about to be made, the feature which at a certain split point makes the most accuratedivide is used to split the data in that specific node, as illustrated in Figure 3. Even though thefeature could divide the data into more than two splits, binary splits are almost exclusively used dueto less risk of dividing data too quickly and due to a three-part split still can be achieved over twoiterations. The model is considered powerful relative to its conceptually simplicity but is prune toboth over- and underfitting. A large tree can fit the Training data with perfect accuracy meanwhile asmall tree will miss valuable patterns, and therefore strategies to tackle this problem must be applied.Some of these strategies are setting a maximum tree size, only splitting data if it leads to enoughimprovement of the model, or only grow the tree until the size of the leaf nodes reached a set minimum(Hastie et al., 2009, p.305).

11

Page 17: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

2.6.2 Random Forest

Due to trees being sensitive to the Training data, they benefit greatly from bagging techniques.Bagging means to bootstrap a subset sample from the Training data to fit a tree. When this is doneover the course of numerous iterations, where for every iteration a new tree is built for each pieceof bagged data, it is possible to create an ensemble model made up by layers of all these createdtrees. This method is called a Random Forest [RF], where if a new observation is to be classified, itgoes through every tree and is collectively classified by a majority vote. If Cb(x) is the prediction forobservation x made by tree b within the RF, then the final classification of the RF will be

CBRF(x) = majority vote {Cb(x)}B1

This is illustrated in Figure 4 with B = 12 trees, where the observation is classified as negative byfour trees and positive by the remaining eight, thus making majority vote {Cb(x)}121 = positive.

Figure 4: Random Forest Classification

Even though trees are sensitive to bagging, it does not create enough variance. Feature subsamplingis therefore applied within RF to increase the variation between the trees. This means not all featuresare available to choose best split point from, but only k randomly selected features will for each splitbe available for choosing the optimal split between. This leads to the trees within the RF to growdifferently, as can be observed in Figure 4. Because the trees grow differently, this type of modelhandle variation in the data much better than single trees and is not as prune to overfitting (Hastieet al., 2009, p.587).

2.6.3 Boosting

The idea behind boosting is to create a strong classifier by adding together many weak classifiers,where each new weak classifier tries to fill the errors made by earlier learners. A weak classifier, orweak predictor, is a simple model making predictions with an accuracy of not much above chance.Boosting is achieved by adding layers of weak classifiers, like decision stumps or decision trees, ontoeach other in such a way that each new classifier will pick up the slack from previous layers. Bydoing this, each added weak classifier will try to improve the next layered model by focusing on thelearning and mistakes made by the previous model. In the case of adaptive boosting, this is doneby giving each incorrectly classified data point a larger weight and each correctly classified point asmaller weight when fitting the next classifier. The size of the weight is determined by the so calledlearning rate, where a large learning rate gives a large weight for misclassified points and a smallweight for correctly classified, which can increase the speed of convergence but also lead to a lessaccurate model (Hastie et al., 2009, p.353).

12

Page 18: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

2.6.4 Extreme Gradient Boosting

Gradient Boosting is a boosting method made up by a mixture of a decision tree approach andincremental learning inspired by gradient descent. It is made up by adding decision trees in sequence,where instead of giving new weights to correctly- and incorrectly classified data points, each addedtree is fitted on the residual errors made by the previous model. That means each new tree is fittedon the negative gradient of the previous model, thus, minimizing the pseudo residuals. The GradientTree Boosting algorithm is as follows

1. Set f0(x) = arg minγ

∑Ni=1 L(yi, γ), where γ is mean of the residuals.

2. For tree m = 1 to M :

a) For observation i = 1, 2, .., N compute the pseudo residuals rim given by negative gradient:

rim = −[δL(yi, fm−1(xi))

δfm−1(xi)

]b) Use the pseudo residuals rim to fit a classification tree, giving the regions Rjm for j =

1, 2, ..., Jm.c) For each region Rjm compute the mean of residuals γjm by:

γjm = arg minγ

∑xi∈Rjm

L(yi, fm−1(xi) + γ)

d) Set fm(x) = fm−1(x) +∑Jm

j=1 γjmI(x ∈ Rjm).

3. Return model f(x) = fM (x).

Given that we are minimizing the squared error loss function 12 [yi − fM (xi)]

2 for the model, we willthus every iteration fit the new tree for the current negative gradient −gim = yi − fm−1(xi) wherefm−1 is the current model of sequential trees and f(x) = fM (x) as m → M . The Gradient BoostedTree algorithm with M = 2 trees is illustrated on a simple set in Figure 5 (Hastie et al., 2009, p.359).

Figure 5: Gradient Boosting

13

Page 19: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Extreme Gradient Boosting [XGBoost] was proposed as a sparsity aware option to perform gradientboosting to solve large scale problem while using minimal amount of resources. First of all, it uti-lizes shrinkage, feature subsampling and regularization to prevent overfitting. Shrinkage is when thecorrectly classified data points are scaled back using a shrinkage parameter, similar to the learningrate. Feature subsampling is the method used in random forest to create larger variation betweentrees, by for each split in the tree only making k randomly selected features available for choosingthe optimal split between. Regularization is a way to reward simplicity and punish complexity, byadding a penalizing term for the amount of leaf nodes and leaf scores in each tree.

Finding the optimal split can however be very computational demanding when the data includecontinuous features. What tree models usually do is first sorting the continuous feature consideredfor the split and then use a greedy algorithm to find the optimal split point. When using the greedyalgorithm to create the model, every possible split point has to be tried, which is not sustainable fora data set large enough to not fit into memory. XGBoost instead uses an approximate algorithm tosuggest splitting points based on percentile distribution and start by trying these points. Based onwhich variant of the algorithm is chosen, the most effective split point out of the selected percentilepoints will be looked further into by bucketing its’ surrounding points into a smaller data set. A newiteration then begins and goes on until the optimal split is found.

To not have to sort the data each time a feature is considered, sorted data are instead stored inmultiple blocks where each block contains a subset of sorted data. This block structure means eachfeature only needs to be sorted once to be used for later iterations and allows data to be spreadaround on multiple cores for parallel learning (Chen & Guestrin, 2016).

14

Page 20: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

2.7 Artificial Neural NetworksArtificial Neural Networks [ANN] are models inspired by the human brain. Simon Haykin gave in hisbook ”Neural Networks and Learning Machines” the following definition of ANN:

”A neural network is a massively parallel distributed processor made up of simple processing unitsthat has a natural propensity for storing experiential knowledge and making it available for use. Itresembles the brain in two respects:

1. Knowledge is acquired by the network from its environment through a learning process.

2. Interneuron connection strengths, known as synaptic weights, are used to store the acquiredknowledge.”

The term neural network covers many different models and learning methods. We will cover feed-forward neural networks, which means that the information is always fed forward, i.e. there are noloops in the network.

2.7.1 Neurons

The ANN consists of a set of neurons and their connection to other neurons. The neurons take one ormore inputs and produces an output. For each neuron, every input is associated with a weight thatdefines the importance of the inputs. The neuron sums all the weighted inputs, and the weightedsum of inputs is then modified by a so-called activation function before sending the output forwardto another neuron. This process is a feed forward system where data are being sent in one directionfrom input to output and is known as perceptron (Flores, 2011, pp.1-3).

In Figure 6 we can see the model of a neuron k with a bias bk, which increases or lowers the netinput to the activation function φ(.). The purpose of the activation function is to limit the rangeof possible outputs from the neuron. xj are the j = 1, ...,m input signals for the neuron which aremultiplied with their respectively weight wkj ∈ R. The model in Figure 6 can be described as

uk =

m∑j=1

wkjxj

vk = uk + bk and (2)ak = φ(vk) (3)

where uk is the linear combiner output due to the input signals and ak, the activation value, is theoutput signal from the neuron k (Haykin, 2008, pp.10-11).

Figure 6: Model of a Neuron

There are various activation functions. Since we have a binary classification problem, we use sigmoidactivation in the output layer which gives us a probability, i.e. a scoring between zero and one. Inall the other layers tanh, the hyperbolic tangent, activation is used. There are similarities betweentanh and sigmoid, but tanh maps the input between -1 and 1. The sigmoid function is defined as

φ(v)sigmoid =1

1 + e−v

and tanh asφ(v)tanh =

e2v − 1

e2v + 1

15

Page 21: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

2.7.2 Multilayer Perceptron

Network architecture is how the neurons are organized into a structure. Multilayer Perceptron [MLP]is neural networks with neurons organized in layers, where the layers are stacked in a sequence. Theoutput from the neurons in one layer becomes the input for the next layer and so forth. The layersin a MLP consist of an input layer, one or more hidden layers and a final output layer (Flores, 2011,p.2). Figure 7 is an illustration of a MLP with one hidden layer. We can now generalize the neuronmodel to encompass all the neurons in the hidden layers. Let wl be the weight matrix, bl the biasvector and al the activation vector. Equations 2 and 3 can then be rewritten as

vl = wlal−1 + bl

al = φ(vl)

where vl is the weighted input to the neurons in layer l.

Figure 7: A Feed Forward Neural Network with one Hidden Layer

Let x denote the training input and y(x) be the desired output. In our illustration in Figure 7 wehave only one neuron in the output layer and assume this is a binary classification problem. In thisexample y(x) will take the values zero or one. In ANN we want to find the values of the weightsand biases so the output from the network approximates y(x). To be able to quantify how well thenetwork approximates y(x), we need to define a loss function L (Nielsen, 2018, ch.1). Since we aredealing with a binary classification problem, the loss function binary cross entropy is appropriate.The loss function binary cross entropy is defined as

L = − 1

n

∑x

y ∗ ln(a) + (1− y) ∗ ln(1− a)

where n is the number of observations in the Training set, y is the desired output, and the sum isover all training inputs x. The loss function returns high values for bad predictions and low valuesfor good predictions (Nielsen, 2018, ch.3). Hence, we want to minimize the loss function to find agood classifier. This step is often referred to as back-propagation and is the core algorithm whereANN learn. Back-propagation feeds the loss backwards through the network to learn how much everyneuron contributed to the loss so the weight can thereafter be adjusted (Al-Masri, 2019). To tune theweights, we use a stochastic optimization algorithm called AdaMax. AdaMax is based on stochasticgradient descent that computes individual adaptive learning rates for different parameters. This willhopefully lead to a smaller loss in the next iteration.

16

Page 22: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

The algorithm for AdaMax is presented below

1. Require: α (Step size).

2. Require: β1, β2[0, 1) (Exponential decay rates).

3. Require: L(Θ) (Stochastic loss function).

4. Require: θ0 (Initial parameter vector).

a) m0 ← 0 (Initial first moment vector).b) u0 ← 0 (Initialize the exponentially weighted infinity norm).c) t← 0 (Initialize timestep).

5. While Θt not converged do:

a) t← t+ 1.b) gt ←▽ΘLt(Θt−1) (Get gradients at timestep t).c) mt ← β1 ∗mt−1 + (1− β1) ∗ gt (Update biased first moment estimate).d) ut ← max(β2 ∗ ut−1, |gt|) (Update the exponentially weighted infinity norm).e) Θt ← Θt−1 − ( α

1−βt1) ∗ mt

ut(Update parameters).

6. end while.

7. return Θt (Resulting parameters).

where ( α1−βt

1) is the learning rate with the bias-correction term from the first moment (Ba & Kingma,

2015, pp.1,9).

When the Training data are large, the learning can become slow. To speed up the process wecan divide the Training data into batches by randomly selecting a predefined number of samples s.The randomly chosen s samples will be referred to as a mini batch. This means that the weights willbe updated after each mini batch of s samples. If we have a Training data set with 300 samples anda mini batch size of 10, we will have 300

10 = 30 mini batches. The weights will be updated 30 timesuntil we exhausted the training samples. By this point we have completed what is known as an epochof training. We can repeat the process as many times as we like, i.e. run as many epochs we want(Brownlee, 2018a).

Typically, we do not want to find the global minimum for the loss function since it is likely thatthe solution will be overfitted (Hastie et al., 2009, p.395). Therefore, some regularization is needed.Batch Normalization can act as regularization, and in many cases replace Dropout. Batch Nor-malization make standardization part of the model architecture by performing standardization foreach training mini batch (Ioffe & Szegedy, 2015). Dropout means randomly ignoring neurons in thenetwork. What this does is simulating many different networks. The idea of Dropout is to breaksituations where the network layers co-adapt to correct mistakes from previous layers, and thus pro-duce a more robust model (Brownlee, 2018b).

The scaling of the inputs determines the scaling of the weights in the input layer. Therefore, the scaleof the input variables can have a large effect on the quality of the final results. If the input variablesare of highly different scales, standardization might be an appropriate solution. This ensures that allinput variables are treated equally (Hastie et al., 2009, pp.398-400).

17

Page 23: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

2.8 Model Selection2.8.1 K-Fold Cross-Validation

Cross-Validation is a simple but strong technique for performing validation given sparse data. Theidea is to split the Training data set into k folds of about equal size, fit the data on the Trainingset with one of the folds excluded, and use that excluded fold to calculate the prediction error of thefitted model (Hastie et al., 2009, p.243).

2.8.2 Grid Search and Random Grid Search

Suppose we have two parameters we need to fixate to build a model. We have two sets of N can-didate values, {Sn}N1 and {Tn}N1 , one set for each of the two parameters. If we want to figure outwhich configuration of these values produces the most effective model; we would have to build modelsusing every combination. For models using two parameters with N possible values each, this wouldmean building N2 models. As the number of parameters, or the dimensions, increases, so does thenumber of models which have to be built. Adding a third parameter with the set of candidate values{Un}N1 would mean building N3 models. This approach is known as a Grid Search and is consideredpowerful when examining few and small grids, although as the dimensions rise to d, the training timeexponentially rises with it to Nd.

The case is often that similar parameter configurations give similar output, meaning that if configu-ration [1, 1, 1] produces a much worse model than configuration [7, 8, 9], it is likely that configuration[1, 1, 2] also produce a worse model, making large amount of different configurations a waste of timeto further examine. The idea of a Random Grid Search is to speed up the process by approximatingthe most effective model. X number of models are first built using randomly sampled parameter con-figurations, and based on which configurations give the most effective models, then create additionalmodels using configurations sampled in the general area of the previous best ones (Bhat et al., 2018).

2.9 EvaluationWhen we have built a model we reckon should make robust predictions, the model needs to beevaluated to decide its usefulness for the problem at hand. Evaluation metrics are also importantwhen different models should be compared. This section presents various tools to evaluate a binaryclassifier.

2.9.1 Performance Tools

Confusion matrix is a widely used metric in the field of finance (Boujelbene et al., 2018). An illus-tration of the confusion matrix can be seen in Figure 8. True Positives [TP] are correctly classifiedpositive values and True Negatives [TP] are correctly classified negative values. False Positive [FP],also known as Type I error, are negative values incorrectly classified as positive, and False Negative[FN], i.e. Type II error, are positive values incorrectly classified as negative. When dealing withimbalanced data, we consider the minority class as positive and the majority as negative.

Figure 8: Confusion Matrix

18

Page 24: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

A measure that is defined for an individual score threshold of a classifier is called a single-thresholdmeasure. To clarify, the binary classifiers prediction scores will range between 0 and 1. A thresholdof 0.5 means that all predictions scores greater than 0.5 will be classified as a 1, while all predictionsscores below 0.5 will be classified as a 0. The single-threshold measure is defined only for a predefinedthreshold value, and thus is not able to give an overview of a range of performances with varyingthresholds. It is not clear how to choose the threshold, hence threshold free measures like the Re-ceiver Operating Characteristics [ROC] curve and Precision-Recall Curve [PRC] plot can be useful(Rehmsmeier & Saito, 2015).

2.9.2 Single-Threshold Metrics

Table 3 is a summary of single-threshold measures derived from the confusion matrix.

Table 3: Single-Threshold Metrics

Metric Formula

ACC TP+TNTOT

ACCETP+FN

TOT∗ TP+FP

TOT+ FP+TN

TOT∗ FN+TN

TOT

Kappa ACC−ACCE1−ACCE

Recall TPTP+FN

Specificity TNTN+FP

G-mean√

Recall ∗ Specificity

FNR FNTP+FN

FPR FPTN+FP

Precision TPTP+FP

NPV TNTN+FN

MCC TP∗TN−FP∗FN√(TP+FP )(TP+FN)(TN+FP )(TN+FN)

TOT = TP + FP + TN + FN

ACC: Accuracy; ACCE: Expected Accuracy; Kappa: Cohen’s Kappa; Recall: TruePositive Rate; Specificity: True Negative Rate; G-mean: Geometric Mean; FNR:False Negative Rate; FPR: False Positive Rate; Precision: Positive Predictive Value;NPV: Negative Predictive Value; MCC: Matthews Correlation Coefficient

Accuracy maximization is a commonly used metric to evaluate a classifier’s performance. Accuracyis the proportion of the number of correctly classified observations compared to all observationsin the experiment. The assumption of equal FP and FN costs are made, which in most practicalproblems is not the case, i.e. accuracy is unable to differentiate between Type I and Type II errorcosts. Additionally, accuracy is not able to capture the true performance of the classifier when wehave an imbalanced data set and therefore might not be sufficient (Boujelbene et al., 2018; Pokorný,2010). If we consider a binary classification problem where 90 % of our data belong to one class,we would get a high accuracy if the classifier would predict all observation to the majority class,but the model would not be useful. This is called the Accuracy Paradox (Akosa, 2017). In a sit-uation like this, metrics like Recall and Precision are better measurements (Maalouf & Trafalis, 2011).

Recall, or True Positive Rate, is the ratio between correctly classified positive values and all pos-itive values TP

TP+FN . Specificity, or True Negative Rate, is the ratio of correctly classified negativevalues and all negative values TN

TN+FP . (Maalouf & Trafalis, 2011) Geometric Mean [G-mean] givesthe classification performance balance between the minority and majority classes

√Recall ∗ Specificity.

A low value for the G-mean indicates that the classification of the positive class is poor even thoughthe negative class is correctly classified or the other way around. The G-mean helps us to avoidoverfitting one class and underfitting the other. G-mean is thus an important metric when working

19

Page 25: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

with imbalanced data, where we want to avoid overfitting the negative class and underfitting thepositive class (Akosa, 2017).

False Negative Rate [FNR] is the ratio of positive values incorrectly classified as negative and allpositive values FN

TP+FN . False Positive Rate [FPR] is the ratio of negative values incorrectly classifiedas positive and all negative values FP

TN+FP (Maalouf & Trafalis, 2011).

Precision, or Positive Predictive Value, is the measure of how many correct positive predictionsamong all positive predictions TP

TP+FP . While Negative Predictive Value [NPV] is how many correctnegative predictions among all negative predictions TN

TN+FN .

Matthews Correlation Coefficient [MCC] results in values between -1 and 1. MCC equal to 1 statesthat we have a perfect classifier. With a perfectly imperfect model we have a MCC of -1 while aMCC of 0 correspond to a model that is no better than chance. Since MCC includes TP, TN, FPand FN the classifier has to predict both the negative class and positive class well to convince MCCit is a good model. This is what makes MCC useful when we have imbalanced data (Chicco, 2017).

Cohen’s Kappa [Kappa] is a metric to compare the observed accuracy [ACC] and the expectedaccuracy [ACCE]. In a nutshell we are looking at how well the model correspond to reality. Theusefulness of Kappa is apparent when we have imbalanced data and ACC is not sufficient. A classifierwith Kappa of less than 0.4 is considered a poor classifier while a Kappa greater than 0.75 indicatesan excellent classifier (Liu, 2018, p.25). Just like MCC, Kappa results in values between -1 and 1.Where Kappa of 1 means that there is a perfect agreement between the predictions and the actualvalues and Kappa of 0 states that there is no agreement. A completely wrong model has a Kappavalue of -1 (Akosa, 2017).

2.9.3 Area Under Curves

The Receiver Operating Characteristics [ROC] curve displays both error types simultaneously for allpossible thresholds. We get the ROC by plotting Recall against the FPR, where the Area Under theROC Curve [AUC] summarizes the overall performance for the classifier at all possible thresholds.If the AUC is close to 1 you have a good classifier, while an AUC of 0.5 is no better than chance(James et al., 2013, p.147). Since both Recall and FPR are unaffected by the prior class distribution,so is also the AUC (Maucort-Boulch et al., 2015). When dealing with imbalanced data sets, we musttake caution when interpreting the ROC curve. The ROC curve tends to maximize both the TP andTN, where in imbalanced data sets, we are mainly interested in the TP. This might lead to the ROCbeing misleading.

A more robust alternative for imbalanced data sets is examining the Precision-Recall Curve [PRC],because the TN is absent in the PRC (Rehmsmeier & Saito, 2015). In an imbalanced data set, wherewe have few positive observations which also happens to be the class of interest, it is then preferredto not include TN in the prediction score (Chicco, 2017).

As the name suggests, PRC shows the tradeoff between the Precision and the Recall for differentthresholds. A high score for Recall tells us that the classifier returns the majority of the positive classwhile a high score for Precision states that we have accurate results. For illustration, imagine thatwe have a binary classification problem where we want to segregate apples from other fruits. Theapples are a minority. If we classify all the fruits as apples, we will get the highest possible scorefor Recall, i.e. one. But this would lead to a low score for Precision, and we would faulty accuseinnocent fruits for being apples. On the other hand, if we only classified one fruit as an apple and didit correctly this would mean a Precision score of one, but the Recall would be low. And the majorityof the apples would go undetected. This is the tradeoff to be considered. To summarize the overallperformance of the models’ ability to classify the minority class we calculate the Area Under PRC[AUPRC], also known as Average Precision [AP]. The AP is defined as

AP =∑n

(Recalln −Recalln−1) ∗ Precisionn

where n is the threshold (SciKit Learn, 2019). We want the AP to be as close to one as possible.

20

Page 26: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

3 MethodThis chapter goes through how we process the data, followed by how each model is created. Thepre-processing includes, but is not limited to, steps such as how we create our response variable,how we join Macro variables, handle missing values, grouping minority categories and oversamplingour minority class. Worth noting is that the data processing differs slightly between the tree-basedmethods and the ANN due to the tree-based models requiring oversampling and ANN requires stan-dardization. The data are also processed somewhat differently for the LR according to Nordea’sdirective and due to its usage of the variable Norway GDP with lag.

3.1 Pre-Processing Data3.1.1 Creating the Target Variable: YearDefault

The first steps include the creation of the target variable Y earDefault. Y earDefault is created withhelp of the Shipping data variables b403 and B1. b403 states if a specific customer B1 is in Default.Our goal is to find customers that will Default within 12 months. Before we can create Y earDefault,we must make sure that there is no data error in b403 and B1. We mainly want to make sure thereis no missing values for the B1, and it is not, and we can not see anything else out of the ordinaryin B1. When we look closer at b403, we observe that there are 245 rows in the Shipping data withDefault rating, RAT ϵ [0+, 0, 0−] , but is not correctly classified as a Default by b403. We update thevariable b403 with the correct values. At this point we are confident that B1 and b403 have correctvalues and we can create our target Y earDefault. For each costumer we look for a Default, andwhen a customer in Default is found we trace back 12 month and give Y earDefault the value 1.Table 4 shows an illustration of how we build the variable Y earDefault.

Table 4: Illustration of YearDefault

Month B1 b403 YearDefault1 Alexander 0 02 Alexander 0 03 Alexander 0 14 Alexander 0 15 Alexander 0 16 Alexander 0 17 Alexander 0 18 Alexander 0 19 Alexander 0 110 Alexander 0 111 Alexander 0 112 Alexander 0 113 Alexander 0 114 Alexander 1 11 Jacob 0 02 Jacob 0 0

The Pseudocode for the creation of Y earDefault is

1. Input Shipping data set with B1: customer ID, b403: Default status, and set R =number ofrows.

2. Set YearDefault [ ]: Array for new variable, showing however a customer within M = 12 monthsDefaulted or not.

3. For row-loop row = 1 to R:

a) For month-loop month = 1 to M :1. Set index = row +month− 1.2. If index ≤ N , Proceed.3. If B1[row] = B1[index], Proceed.4. If b403[index] = 1, set YearDefault[row] = 1, Break month-loop.5. If month = 12, set YearDefault[row] = 0.

3. Return array YearDefault.

21

Page 27: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

3.1.2 Macro Data

We want to add features from the Macro data, which are a data set showing general macroeconomictrends. First, we remove all rows with Scenario set to either Better or Worse because the values ofthese observations are simulated values made for scenario analysis, and we only want to keep thetrue values from the Baseline Scenario. All remaining rows in the data set are at this stage partof the same Baseline Scenario and extracted the same date, so we remove the features scenario andextractionmonth because these do not carry any information.

Because most of the values within the Macro data overall steadily increase according to a Wienerprocess with drift, as illustrated for the macro variable NOR_GDP_org in Figure 9, the models riskinterpreting the values as a substitute for which year and month the observation is from and trainthe models based on that. This is not something we want to risk implement, so instead we turneach Macro variable into its quarterly percentage change to instead capture macroeconomic changes.After this we remove all observations with dates in DATA_PERIOD which are not found withinthe variable data_period in the Shipping data and thus are not of use.

(a) Norway GDP for Months in Macro Data (b) Norway GDP Change for Months in Shipping Data

Figure 9: Norway GDP

This newMacro data set contain 75 variables, excluded the new index and key variable DATA_PERIODshowing the month and year for each observed set of Macro features, and to reduce this number offeatures we are conducting a Principal Component Analysis [PCA]. First, we standardize the Macrodata using the StandardScaler function from the sklearn module, followed by using the PCA func-tion from the same module to transform the Macro data into Principal Components [PC]. Figure 10illustrates Scree plots of first all 75 PC, and then where we choose cut-off, according to the elbowrule, at 15 PC. This is equal to 90% explained variance. Furthermore, because a large majority of theobservations come from Norway, we are adding the quarterly- and yearly change in Norway GDP tothe set of Macro PC, making it a set of 17 features plus index. These features are then joined withthe Shipping data at the correct dates.

(a) All 75 Principal Components (b) A Selection of 15 Principal Components

Figure 10: Scree Plot of Macro Principal Components

22

Page 28: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

3.1.3 Initial Data Cleaning

When the variable Y earDefault has been created and the Macro data has been added, we removeall the rows where b403 indicate a customer in Default. The reason is that we are not interested inwhen a customer is in Default, but rather if the customer will Default within a year. At this stageb403 does not carry any information and hence can be dropped. Other features that are dropped atthis point are B409, ActivExp, ship, B1cure and b403cure. B409 is dropped since all observationshave the same exposure class, ActivExp is dropped due to all observations having the value 1, i.e.all are active customers, and Ship is dropped due to all customers are shipping customers. B1cureis dropped due to it being a mirror of customer ID B1, which we are not interested in using forpredicting PI. b403cure indicates if a customer has been cured from a Default, but this is not relevantto us because we only wanting to predict if a customer will Default, and not however they will stayin Default. After a closer look at the Shipping data we observe that missing values are presented invarious ways. Table 5 shows the features and the missing value representation.

Table 5: Missing Value Representation

Feature Missing Value Representation

brs8_rat nan RAT_UNKNOW

BP92 nan U U1 UG 3N

riskGrade nan 100

RAT nan U N .

To make it easier for us to handle missing values we change the representation to nan for all. Thescoring values sco ϵ [0+, 0, 0−] should indicate a customer in Default, but these scoring values can stillbe found even though the Shipping data do not have any customers in Default anymore. Thus, thevalues are not correct, and we remove these completely from sco and instead treat them as missingvalues.

We observe that riskGrade has missing values, as well as it can be directly mapped from RAT .Therefore, we use the information from RAT to fill in some gaps in riskGrade. By now, all informa-tion found in RAT is also found in riskGrade, and thus we drop RAT from the data set.

23

Page 29: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

3.1.4 Missing Data

Since we do not have any missing values from the Macro data, we will only consider the remainingof the Shipping data in this section. There is no obvious way how to impute missing values forriskGrade. From the features available we can not explain riskGrade and therefore we remove therows with missing values for riskGrade. Handling the missing values will be a necessity for laterwhen we use the resampling method SMOTE together with ENN.

After dropping the missing values for riskGrade, the Table 6 present the remaining missing val-ues in the Shipping data set.

Table 6: Missing Values

Feature Total Number Percentage (%)

B40 0 0.0

DA 0 0.0

B1 0 0.0

B45 0 0.0

data_period 0 0.0

sco 46258 42.0

brs8_rat 23580 21.4

brs8_sco 46058 41.8

riskGrade 0 0.0

B419 64 0.1

B416 35535 32.3

BP78 51059 46.4

BP92 56701 51.5

YearDefault 0 0.0

There are still many missing values in the data, so a closer look is of interest. Figure 11 gives us abetter view of the missing values in the data set.

Figure 11: Null-Matrix

The white spots are the missing values in the data set. The position where the missing values arelocated will change if the data set are sorted in a different way, but the correlation between thevariables would still stand. The missing values for sco, brs8_sco, BP78 and BP92 seems to be highlycorrelated with the majority of missing data in the beginning and the end. BP92 also have a goodamount of missing values in the middle. brs8_rat follows the mentioned features as well, but withless missing values. While the missing values for B416 are more evenly spread out over the entiredata set. The correlation heatmap of missing data in Figure 12 validates the above mentioned.

24

Page 30: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Figure 12: Correlation Heatmap over Missing Values

If we remove all the rows with any missing values, it is approximately 21% left of the original data(after Defaulted customer are removed). Hence, we will impute the missing values using MICE.

3.1.5 Imputing Missing Values

For sensitivity analysis purpose, we generate five different imputed data sets using MICE without theresponse variable Y earDefault through the function MICEData from the statsmodels module. Thefive runs with MICE produce similar results for the imputed values which give us some confidencethat the results are stable. We choose one of the generated data sets at random.

Table 7 shows a summary over the quantitative variables with missing values before and after imputa-tion is performed, while Figure 13 presents the appearances for each class of the qualitative variableswith missing values before and after imputation.

Table 7: Summary of Features before and after Imputation

Before After

BP78 B416 B419 BP78 B416 B419

count 58998.000 7.452e+04 1.100e+05 110057.000 1.101e+05 1.101+05

mean 344.577 1.384e+07 1.795e+07 418.090 1.051e+07 1.795e+07

std 662.297 2.808e+07 4.630e+07 709.602 2.429e+07 4.631e+07

min -40.000 1.100e-03 0.000e+00 -40.000 1.100e-03 0.000e+00

25% 17.000 3.073e+05 1.604e+02 17.000 2.402e+05 1.603e+02

50% 17.000 3.801e+06 9.476e+05 17.000 1.894e+06 9.476e+05

75% 17.000 1.604e+07 1.647e+07 498.000 1.012e+07 1.647e+07

max 1783.000 1.583e+09 1.165e+09 1783.000 1.583e+09 1.165e+09

25

Page 31: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

(a) Appearances by brs8_sco (b) Appearances by brs8_rat

(c) Appearances by sco (d) Appearances by BP92

Figure 13: The Appearances of Features before and after Imputation

If we inspect Table 7 we can see that B419 looks almost the same before and after the imputation,which we expected because it had only 0.1% missing values. BP79 has changes in the 75th-percentile,mean value and standard deviation. All these measures are higher after imputation. B416 haslower values for mean, standard deviation, 25th-pecentile, 50th-percentile and 75th-percentile afterimputation. But we can see that all the variables are in the same range before and after imputation,and from Figure 13 we observe that the distribution between the classes are kept after imputation.When taking the large amount of missing values into consideration, the result from the imputationseems reasonable.

3.1.6 Grouping of Minority Categories

Because we will later One-Hot encode all categorical features, as many new features as we have dif-ferent categories in all categorical features will be created. To reduce both the number of dimensionswhich are later to be created and to reduce noise we therefore first want to group together minoritycategories within the categorical features, and if needed remove observations belonging to minoritycategories.

When examining how many times each country appears in the data set and sorting them cumu-latively, we observe in Figure 14 that a large amount of all observations belong to only a relative fewdifferent countries, with a large portion belonging to NO.

Figure 14: Cumulated Appearances by Countries

26

Page 32: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Based on visual inspection, we choose to use the Country coded as IL to determine our cut-off point.Because the Country code IL appears 725 times in the data set, we use 700 as our cut-off value andthus changing the Country code into Other for all observations with a Country code appearing lessthan 700 times in the data set. This results in reducing the number of categories from 73 down to22 categories. Same process is repeated for Industry and Reason. For Industry we choose our cut-offpoint at 300 and thus changing the Industry code for the observations appearing less than that to 1.This reduces the number of Industry categories appearing in the data set from 18 categories downto 13 categories. For Reason we choose our cut-off point at 500 and thus changing the Reason codefor the observations appearing less than that to 1. This reduces the number of Reason categoriesappearing in the data set from 44 categories down to 10 categories. The remaining categories fromCountry, Industry and Reason are shown in Figure 15. Even though there are still clear minoritycategories, we consider it being enough observation within each remaining category to justify keepingthem as is.

Figure 15: Category Appearances

It only exists six categories within the variable DA with one clear minority category, as shown in Table8. Because there are only 5 observations within the category 1792, we are removing this categoryand its observations entirely.

Table 8: Appearances by DA

DA Appearances

1785 47805

1791 30740

1780 16459

1770 9768

1775 5280

1792 5

3.1.7 Historical Customer Data

The variables B416 and B419 are explaining the Utilization Amount and Credit Limit for customers.This means a higher value indicates a larger and potentially more important customer and a lowervalue indicates a smaller one, where larger customers also must provide larger amounts of securitiesthan a smaller. These features will prove to be important for the models when predicting Defaults, andwe were informed by Nordea about cases when customers either get into tougher financial situationsor even know themselves they are about to Default, and therefore accumulate a larger amount ofdebt. We therefore were asked to create new variables with ability to capture these changes, butthis can only be done for customers which are within the system for multiple months. There are 183customers which only appear as Shipping customers for one month, 938 customers which only appearas Shipping customers for two months, and 91 customers which only appear as Shipping customersfor three months, making it a total of 1212 customers or 2332 observations. Because of the request toimplement historical data from the variables B416 and B419, we are removing these 2332 rows fromthe Shipping data since they do not contain enough historical data.

27

Page 33: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

For each remaining observation we start by adding three more variables each for the variables B416and B419: Percentage Change past month, Percentage Change past 2 months, and Percentage Changepast 3 months. As a way to capture the monthly changes during the past six months for everyobservation, we also add Percentage Change past month with monthly lags up to five months. Thesenew variables are illustrated in Figure 16. Because there are customers within the Shipping dataappearing less than that, they will still have some missing values. We treat these in two steps ofimputation: First step is grouping the historical data by customer ID B1 and imputing the missingvalues for each historical variable with the mean value. Second step is aimed at imputing the missingvalues missed during the first step, by imputing each remaining missing with 0. The number 0 ischosen because it is equal to no percentage change. This adds up to creating a total of 16 newvariables.

Figure 16: How Historical Changes in B416 and B419 are Captured with 8 Variables each

3.1.8 Splitting the Data

We split the data into the following three groups

• Training data: 60%

• Validation data: 20%

• Testing data: 20%

We are however not splitting the data randomly based on observations, but instead randomly basedon Customer ID using the GroupShuffleSplit function from the sklearn module. This because wewant to fully hedge against information leaking between the Training-, Validation- and Testing data.This does not symbolize the reality perfectly, since even though there are new customers coming innext year, there will still be many returning customers and thus a certain level of information leakageis permitted. We decide to not take this into consideration and rather completely hedge againstany information leakage. Because we are splitting on Customer ID B1, the different sets will notcontain completely equal amount of observations. The distribution of observations within each set isillustrated in Figure 17.

Figure 17: Pie Chart showing Data Split

3.1.9 Oversampling, One-Hot Encoding and Standardization

Due to the data set being heavily imbalanced, we are using the oversampling technique SMOTEon the Training data followed by cleaning using ENN [SMOTEENN] to create a new Training setthat will be used when training the tree-based models RF and XGBoost. Using the SMOTEENNfunction from the imblearn module, we first oversample the amount of Default observations up toa point where there are as many Default observations as Non-default and follow this by removingobservations based on the majority class of the three nearest neighbours. We remove an observationif the majority class of its neighbours differ from the observation. Doing this increases the amount ofobservations in our Training set from 65396 observations to 102609, with about as many customers

28

Page 34: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

tagged as Default as the number of customers tagged as Non-default. The distribution of the data isillustrated in Figure 18. We do not conduct any oversampling or cleaning on the Training data weuse for the ANN.

Figure 18: Pie Chart showing Data Split after SMOTEENN

After this step we use the get_dummies function from the pandas module to one-hot encode thecategorical features in every data set so they can be used by the RF, XGBoost and the ANN, thusturning each data set containing 47 features to data sets containing 155 features. If a categoryhappens to be missing in a feature in either Training-, Validation- or Testing data, for example, if noobservation with riskGrade 4 exist in the Validation data, the corresponding one-hot encoded variableis added but only containing zeroes. Additionally, we are also creating a standardized version of theTraining-, Validation- and Testing data which we will use for the ANN during its training-, validation-and testing phase.

3.1.10 Data for Linear Regression

We want to build a LR model that resembles Nordeas existing model. Therefore, we will furtherhandle the data for the LR according to Nordea’s guidance. The first thing we do is groupingriskGrade ϵ [3, 4, ..., 8] and riskGrade ϵ [18, 19, 20] together. The next step is creating the re-sponse variable Realised_PI. The response variable is created with help of data_period, riskGradeand YearDefault. For each year and Risk Grade we calculate the mean value of YearDefault. At thispoint we want to add the yearly percentage change of the Macro variable NOR_GDP_org. As beforewe are only interested in the true values from the Baseline scenario. Moreover, we add the yearlypercentage change with monthly lags of [3, 6, 9, 12, 15, 18, 21, 24], i.e. to capture the yearly percentagechange three month ago, six months ago and so on. The next step is to remove all the rows in theTraining data set where the Realised_PI is zero. We take the natural logarithm of Realised_PI andcreate the final Training- and Testing data sets with ln(Realised_PI), riskGrade, the yearly percent-age change of NOR_GDP_org and the lags. Hence, we have data sets presented on a yearly basis,Table 9 shows the structure of the LR data sets.

Table 9: Illustration of LR Data

Year ln(Realised_PI) riskGrade Nor_GDP_org Nor_GDP_lag03 … Nor_GDP_lag24

2008 -5.0 8 -0.006 -0.010 ... 0.031

2008 -4.0 9 -0.006 -0.010 ... 0.031

2008 -3.8 10 -0.006 -0.010 ... 0.031

. . . . ... .

. . . . ... .

. . . . ... .

2017 -2.6 16 0.026 0.014 ... 0.018

2017 -2.0 17 0.026 0.014 ... 0.018

2017 -1.7 18 0.026 0.014 ... 0.018

29

Page 35: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

3.2 Models3.2.1 ANN - Artificial Neural Network

For training the ANN model we are using the standardized Training data, for tuning the hyperparam-eters associated with the ANN we are using the standardized Validation data, and later for testingthe model we are using the standardized Testing data.

We are building the ANN using the module Keras. There are many hyperparameters associatedwith ANN and due to time- and computational constraints, we are not able to build a large numberof different models. So initially to help us select a good enough ANN, we use Randomized GridSearch, from the module sklearn, with 3-fold cross-validation. We restrict the number of hyperpa-rameters included in the grid as well as the possible values each hyperparameter can take. We buildthe grid using the parameters and corresponding values that are presented in Table 10.

Table 10: ANN Parameter Grid

Hyperparameter Values

Learning Rate [0.0001,0.001,0.01,0.1]

Hidden Layers [2,4,6,8,10]

Hidden Units [8,16,32,64,128,256]

Hidden Activation Function [’relu’, ’tanh’]

Optimization Algorithm [’Adam’, ’RMSprop’,’Nadam’,’Adamax’]

Mini Batch Size [16,32,64,128,256]

We randomly select 600 out of 4800 possible models from the grid. The selection criteria we use dur-ing the Randomized Grid Search is average precision. We tune the hyperparameters on the Trainingdata set. Hidden Activation Function refers to the activation functions used in all layers except forthe output layer. As mentioned in the Section 2.7, the activation function in the output layer we useis sigmoid since we have a binary classification problem.

The Randomized Grid Search result in a model with two hidden layers, 128 hidden units (neu-rons) in each layer (except the output layer), tanh as hidden activation function, mini batch size of128, the optimization algorithm AdaMax and a learning rate of 0.01. Since we use the same numberof hidden units for all the layers in the Randomized Grid Search, we run another tuning process.This time we run a Grid Search with 3-fold cross validation. The grid includes the number of hiddenunits in each layer (excluded the output layer). Each layer can take [16, 64, 128, 256] number of unitsin the grid, which means that we build 64 different models. From the Grid Search, the best model,that is the model with highest average precision, is the one with 256 hidden units in the input layerand the first hidden layer, and 128 hidden units in the second hidden layer. We continue to manuallytune hyperparameters with help of the Validation data set. Keras includes a hyperparameter calledclass_weight, which weighs the loss function during training. This is useful for us since we have veryimbalanced data. The class_weight help us to pay more attention to the minority class. The finalANN architecture is presented in Table 11.

Table 11: The Architecture of ANN

Hidden Units Activation

Input layer 256 tanh

Hidden Layer 1 256 tanh

Batch Normalization

Hidden Layer 2 128 tanh

Output Layer 1 sigmoid

30

Page 36: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

We use batch normalization for the purpose of regularization and a learning rate decay of 0.02. Themanually tuning result in a change of the learning rate to 0.02 as well. Table 12 presents the finalsetting for the hyperparameters.

Table 12: Final Hyperparameter Settings for ANN

Hyperparameter Settings

Hidden layers 2

Hidden Units [256,256,128,1]

Learning Decay 0.02

Learning Rate 0.02

Activation Function [tanh, tanh, tanh, sigmoid]

Optimization Algorithm AdaMax

mini batch size 128

class_weight 1:7 mapping

Epochs 9

The figures in Figure 19 shows the history from training the final ANN. As one can see from Figure19, there is evidence of some overfitting. The reason why we choose this specific ANN model is due tothe model generalize way worse when using heavier regularization, i.e. such model predicts terriblyon unseen data.

(a) Loss (b) Matthews Correlation Coefficient [MCC]

(c) Precision (d) Recall

Figure 19: History from Training the ANN

31

Page 37: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

3.2.2 RF - Random Forest

For training the RF model we are using the Training data we oversampled using SMOTEENN, fortuning the weight associated with the Default class we are using the Validation data, and later fortesting the model we are using the Testing data. We are building the model using the RandomForest-Classifier function within the module sklearn with 200 trees, where when creating each tree, we startby applying both bagging and feature subsampling to reduce overfitting during the training process.For bagging we are sampling 50% of the Training data prior to building each tree, followed by featuresubsampling where we randomly make 10% of the features available at each node for which the modelcan use to decide best split. To minimize the training time for the RF we are setting a depth limitof 15 levels. For this model we examine its confusion matrix for the Validation data to check thebalance of FP and FN. If the model predicts an uneven amount of FN and FP or a smaller amountof FN than FP, we are changing the weight for the minority class to better control the balance andcause the model to be slightly more biased towards predicting Defaults. The weight we settle onusing for our final RF model is 1.5. The hyperparameter settings for RF that are not set to theirdefault options are presented in Table 13.

Table 13: Hyperparameter Settings for RF

Hyperparameter Settings

n_estimators 200

max_depth 15

class_weight 1: 1.5

max_features 0.1

Figure 20 shows the feature importance in relative appearances for the 20 most important featuresin the RF. These mainly consist of the Macro PC.

Figure 20: The 20 Highest Relative Appearances for RF

32

Page 38: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

3.2.3 XGBoost - Extreme Gradient Boosting

For training the XGBoost model we are using the Training data we oversampled using SMOTEENN,for tuning of hyperparameters we are using the Validation data, and later for testing the model weare using the Testing data. We are building the model using the XGBClassifier function from theXGBoost module using binary : logistic as our objective, because we want the output to be theprobabilities for binary classification. When training each tree, we start by applying both baggingand feature subsampling to reduce overfitting during the training process. For bagging we are sam-pling 80% of the Training data prior to building each tree, followed by feature subsampling wherewe randomly make 10% of the features available at each node for which the model can use to decidebest split. To further decrease the likelihood of overfitting, we are only letting each tree be grown 5levels.

For every tree we add when training the XGBoost model, we evaluate the result on the Valida-tion data and measure its AUPRC score, until the AUPRC score has not improved in 100 rounds.For this model we examine the confusion matrix for its predictions on the Validation data to check thebalance of FP and FN. If the model predicts an uneven amount of FN and FP or a smaller amountof FN than FP, we are changing the weight for the minority class to better control the balance andcause the model to be slightly more biased towards predicting Defaults. The weight we settle onusing for our final XGBoost model is 2.31, for which the best AUPRC score is achieved using 763trees for learning rate 0.1. The hyperparameter settings for XGBoost that are not set to their defaultoptions are presented in Table 14.

Table 14: Hyperparameter Settings for XGBoost

Hyperparameter Settings

objective binary:logistic

n_estimators 762

learning_rate 0.1

max_depth 5

subsample 0.8

colsample_bynode 0.1

scale_pos_weight 2.31

Figure 21 shows the feature importance for the XGBoost model based on the metrics Gain andAppearances. For each metric the 20most important features are shown, where 21a shows importancebased on Total Gain, 21b shows importance based on Average Gain and 21c shows importance basedon Appearances. Because the data contain categorical variables which has been one-hot encodedand thus converted into many binary variables, they can only appear at most one time in everytree. This means Gain is more reliable metric to use since it shows the contribution to the model.Features considered most important according to Total Gain are mostly Macro PC, with features suchas Aligned Score BP78, Delivery Agreement DA, Credit Limit B416 and Utilization Amount B419showing highest importance from the Shipping data. The Risk Grades showing highest importanceare Risk Grade 9, 10 and 15.

33

Page 39: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

(a) Total Gain for Features in XGBoost (b) Average Gain for Features in XGBoost

(c) Feature Appearances for XGBoost

Figure 21: Feature Importance for XGBoost

3.2.4 Ensemble of ANN and XGBoost

The Ensemble is built from training the ANN, RF and XGBoost in parallel, and then weighing theirpredictions by using another but simpler XGBoost model. The Training data used in the Ensemblemethod thus consists of the predicted classifications and regression values from the ANN, RF andXGBoost on the Validation set as features and true classifications as labels. As it later turned out,only the regression features from ANN and XGBoost are used by the final XGBoost wrapper, whichlets us accurately illustrate the data as in Figure 22. Figure 22a is a scatter plot showing the Trainingdata used in the Ensemble method, plotted against the regression output from ANN and XGBoost.Observations on the right side of the vertical line are observations predicted as Defaults by theXGBoost model and observations above the horizontal line are observations predicted as Defaults bythe ANN. That means observations in the top right corner are observations predicted as Defaults byboth models meanwhile observations in the top left corner are observations predicted as Defaults bythe ANN but not by XGBoost.

34

Page 40: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

(a) Scatter Plot over the used Features

(b) Pie Chart showing the Distributions of Labels

Figure 22: Training Data before Resampling

This data set are heavily imbalanced, as illustrated in Figure 22b, and we are therefore using theoversampling technique SMOTE to create synthetic variables followed by cleaning using ENN throughthe SMOTEENN function from the imblearn module. We oversample from the minority class Defaultsup to a point where the ratio between the minority class and the majority class is 0.2 to still keepsome imbalance, followed by removing observations based on the majority class of the three nearestneighbours. This means we remove an observation if the majority class of its neighbours differ fromthe observation. Our new Training set is illustrated in Figure 23.

(a) Scatter Plot over the used Features

(b) Pie Chart showing the Distributions of Labels

Figure 23: Training Data after Resampling with SMOTEENN

35

Page 41: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Because we this time do not have a validation set for tuning of any hyperparameters, we aim to builda simple XGBoost model using 125 trees and learning rate 0.05. Although we have two plausibleoptions for weight which we can try against the Training data and choose the most fitting one. Thetwo candidate weights are 0.2 and 12

88 ≈ 0.14, because these are the ratios DefaultsNon-defaults before- and

after cleaning using ENN. We use these candidates to train two models and let each make predictionson the same data they were trained on. This creates the confusion matrices illustrated in Figure 24,where we see that even though Figure 24a is slightly better balanced, it underestimates the amountof Defaults. We therefore choose the weight 0.2 instead of the weight 0.14.

(a) scale_pos_weight = 0.14 (b) scale_pos_weight = 0.2

Figure 24: Confusion Matrices for Training Set using Different Weights

In addition to using the hyperparameter settings presented in Table 12 and Table 14 for the firstlayer ANN and XGBoost, the hyperparameters setting presented in Table 15 are the ones not set totheir default values for the second layer Ensemble XGBoost.

Table 15: Hyperparameter Settings for Ensemble (XGBoost)

Hyperparameter Settings

objective binary:logistic

n_estimators 125

learning_rate 0.05

scale_pos_weight 0.2

Figure 25 shows the feature importance from the Ensemble model trained using weight 0.2. Eventhough the model can use the regression values from RF and the classifications given by ANN, RFand XGBoost, none of these features prove to be useful in the Ensemble.

(a) Average Gain for Features in Ensemble (b) Feature Appearances for Ensemble

Figure 25: Feature Importance for Ensemble

36

Page 42: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

3.2.5 Linear Regression

We build nine different Linear Regression models [LR] accordingly to Nordea’s preference. The nineLR include all possible combination of the explanatory variables riskGrade and one of the laggedNOR_GDP_org presented under Section 3.1.10. Figure 26 shows the result of model selection criteriafor all the models.

(a) R2 (b) Residual Sum of Squares

Figure 26: Linear Regression Model Selection

From Figure 26 model nine which includes the explanatory variables riskGrade and NOR_GDP_lag24performs the best. This conclusion is based on that model nine has the highest R2 and the lowestRSS. We will continue the analysis with this model. The summary statistics from the regressionmodel can be seen in Table 16.

Table 16: Summary Statistics for Linear Regression

Dep. Variable: Realised_PI R-squared: 0.491Model: OLS Adj. R-squared: 0.476Method: Least Squares F-statistic: 32.79Date: Thu, 09 May 2019 Prob (F-statistic): 1.07e-10Time: 22:22:22 Log-Likelihood: -110.39No. Observations: 71 AIC: 226.8Df Residuals: 68 BIC: 233.6Df Model: 2

coef std err t P> |t| [0.025 0.975]const -8.1288 0.605 -13.426 0.000 -9.337 -6.921riskGrade 0.3562 0.045 7.993 0.000 0.267 0.445NOR_GDP_lag24 -25.2275 10.510 -2.400 0.019 -46.199 -4.256

Durbin-Watson: 1.585

All explanatory variables are significant at a significance level of 5%. The R2 = 0.491 is quitelow, which indicate that the model does not explain the data very well. The estimated regressioncoefficients say that the higher Risk Grade the customer has, the higher Realised_PI. And the oppositeis true for the yearly percentage change for NOR_GDP_lag24, i.e. the larger the positive change isthe lower the Realised_PI. There are various assumptions being made when we build a LR. Therefore,model diagnostic must be conducted. We will divide the model diagnostic into three parts, namely

• Check error assumption about constant variance, normality and correlation of error.

• Find unusual observations.

• Check the structure of the model.

37

Page 43: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

First we take a look at the Training data. Figure 27 is a pairwise scatter plot matrix over the Trainingdata.

Figure 27: Pairwise Scatter Plots

From the pairwise scatter plots in Figure 27, we can see a linear relationship between our responsevariable Realised_PI and riskGrade, while the same can not be said about NOR_GDP_lag24 andRealised_PI. But from a first inspection there seems not to be any extreme points in any of the plotsthat could potentially be problematic.

One assumption that is made when building LR is that the errors are independent of themselves.Durbin-Watson test is one test to check if there is any autocorrelation. The Durbin-Watson statis-tic, d, lies between zero and four. A value of two states that there is no autocorrelation. We haved = 1.59, see Table 16, which indicates some positive serial correlation, but a rule of thumb is that ifd < 1 we should be alarmed. Even though there is some evidence of autocorrelation, the assumptionof independent errors seems reasonable.

We check the assumption of constant variance by plotting the residuals versus the fitted values.In Figure 28 we are looking for evidence of nonlinearity or heteroscedasticity (non-constant vari-ance). It can be easier to look for heteroscedasticity by taking the square root of absolute value oferror against fitted values, see Figure 28b.

(a) Residuals Versus Fitted Values (b) Scale-Location Plot

Figure 28: Residuals versus Fitted Values

38

Page 44: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

The observations in Figure 28 look quite randomly distributed so the assumption of constant varianceseems reasonable. A Quantile-Quantile plot [QQ-plot] can be used to check the normality assump-tion. QQ-plot compares the empirical quantiles in the data with the theoretical quantiles of normaldistribution. Figure 29 displays our QQ-plot.

Figure 29: Quantile-Quantile Plot

The sample quantiles follow the theoretical quantiles pretty well, but there is some S-shape to it.Hence, we conduct a formal normality test as well, called Anderson-Darling test. The Anderson-Darling test resulted in a p-value=0.64 so the assumption that the residuals are normal distributedshould not be rejected.

We will now continue to look for unusual observations. Outliers are observations that do not fitthe model well. Observations with studentized residuals outside the interval [−2, 2] should be paidattention to. We have eight observations outside the interval, which all are potential outliers. Weuse n t-tests and Bonferroni correction to have a total probability of error α = 0.1. After performingthe test, we can reject the hypothesis that we have any outliers. Leverage points are observationsthat are extreme in explanatory variable space. A rule of thumb is that observations with leveragegreater than 2p

n , where p is the regression dimensions and n is the number of observations, shouldbe looked at more closely. We find two observations that are potential leverage points. Influentialpoints are observations that change the model fit greatly. We will examine the Cook’s distance anduse the cut-off 4

n . We observe six observations that are potentially influential points. In Figure 30we plot residuals versus leverage and mark the six observations with the highest Cook’s distance.

Figure 30: Residuals versus Leverage

The marked observations in Figure 30 do not stand out greatly. We take a closer look at all theabove-mentioned unusual observations but can not find a reason to exclude any observations fromthe data set. Therefore, we will keep all of them.

39

Page 45: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

We want to see if the model is correct as well, i.e. check the structure of the model. Figure 31 showsPartial Regression plots and Partial Residual plots for the two explanatory variables.

(a) Partial Regression Plot for riskGrade (b) Partial Residual Plot for riskGrade

(c) Partial Regression Plot for NOR_GDP_lag24 (d) Partial Residual Plot for NOR_GDP_lag24

Figure 31: Partial Regression- and Partial Residual Plots

Based on the plots in Figure 31, the linearity assumption for riskGrade seems reasonable, whileNOR_GDP_lag24 might move more like a second-degree polynomial function.

The last thing we will check is collinearity. It is problematic if we have explanatory variables thatare dependent. Figure 32 shows the correlation matrix of the LR Training data. We can concludefrom Figure 32 that there is no evidence of collinearity between our explanatory variables.

Figure 32: Correlation Matrix

40

Page 46: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

4 Results

4.1 Linear RegressionThe best model out of the nine LR models investigated, see Section 3.2.5, include the explanatoryvariables riskGrade and the 24 months lag of Norway GDP yearly change. The regression equationfor the model is

ln(Realised_PI) = −8.1288 + 0.3562 ∗ riskGrade− 25.2275 ∗NOR_GDP_lag24

And thusRealised_PI = eln(Realised_PI)

Figure 33 shows the predicted PI from the LR against the realised PI.

Figure 33: Fitted versus True Values for LR

41

Page 47: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

4.2 ClassifiersThe predictions made by each classifier are illustrated in Figure 34 as confusion matrices. Out of allobservations, the ANN classified 310 Defaults correctly, 288 Defaults as Non-defaults and 349 Non-defaults as Defaults. The remaining were correctly classified Non-defaults. The RF classified 218Defaults correctly, 380 Defaults as Non-defaults and 256 Non-defaults as Defaults. The remainingwere correctly classified Non-defaults. The XGBoost classified 302 Defaults correctly, 296 Defaults asNon-defaults and 327 Non-defaults as Defaults. The remaining observations were correctly classifiedNon-Defaults. The Ensemble classified 325 Defaults correctly, 273 Defaults as Non-defaults and 325Non-defaults as Defaults. The remaining were correctly classified Non-defaults.

Figure 34: Confusion Matrix for each Classifier

The results from the single-threshold measurements for all classifiers are presented in Table 17, wherethe green markings show the best performing model, yellow marking shows second-best performingmodel and red marking shows worst performing model given each metric. The comparison shows theEnsemble model steadily outperforms the other models, with ANN coming in second and XGBoostas third, with RF being the worst performing model. We are more interested in a conservative model,i.e. we rather have less FN even though this would lead to more FP. This mean we rather overestimatethe PI than underestimate the PI for the segment. Therefore, we accept that precision and specificityget lower scores if this lead to greater values for recall and NPV. For the same reason we value FNRhigher than FPR, but we still want a classifier that are doing well on both negative and positivepredictions and therefore we want to see as high scores as possible for MCC, Cohen’s Kappa andG-Mean. According to these metrics, the Ensemble is not only the best performing model, but alsothe only model reaching a value > 0.5 for both Cohen’s Kappa and MCC. All measurements aredefined in Section 2.9.1.

Table 17: Single-Threshold Measurements for all Classifiers

Metric (%) ANN RF XGBoost EnsembleAccuracy 97.018 97.023 97.084 97.201Cohen´s Kappa 47.791 39.166 47.726 50.644Recall 51.840 36.455 50.502 54.348Specificity 98.319 98.767 98.425 98.435G-Mean 71.392 60.005 70.503 73.142FNR 48.161 63.545 49.498 45.652FPR 1.681 1.233 1.575 1.565Precision 47.041 45.992 48.013 50.000NPV 98.609 98.181 98.572 98.682MCC 47.851 39.444 47.742 50.691

42

Page 48: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

The ROC-curves and Precision-Recall curves for all the classifiers are illustrated in Figure 35. Thesealso show the Ensemble being the better performing model due to having largest AUC- and AP scorecompared to the other models.

(a) ROC Curves for all Classifiers (b) PRC for all Classifiers

Figure 35: Plots Over all Possible Thresholds, ROC-Curve and PRC for all Classifiers

43

Page 49: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

4.3 Comparing all ModelsWhen comparing all models, we are examining how accurately they are predicting PI for followingsegments: Each Risk Grade, each year, each month, and for the whole Testing data.

4.3.1 Risk Grade

Figure 36 illustrates the distribution of observations for each Risk Grade, including the amount ofDefaults and Non-Defaults. Worth noting is that there are a considerable amount fewer observationswithin Risk Grade 5 and within Risk Grades larger than 15.

Figure 36: Distribution of Observations per Risk Grade

Table 18 shows the predicted PI for each Risk Grade from each model compared with the realised PI.The most accurate predictions are marked with green, second most accurate marked with yellow, andworst predictions marked with red. There exist multiple columns with the same color on the same rowbecause they share the same value or are close to each other. These results hint that using classifiersgenerally give better predicted PI than using regression. They also hint that the Ensemble classifierand XGBoost classifier better predict PI than other models, except for higher Risk Grades where RFregressor at first look shows potential. Although the RF regressor shows strong performance for thesecustomers, it is likely a fluke due to the RF regressor generally predicting same PI for customers withRisk Grade 20 as risk grade 14, even though their realised PI differ significantly. It is also the modelmaking the largest errors overall.

Table 18: Comparison of Predicted PI for each Risk Grade (%)

RG PI (%) XGBoost_reg XGBoost_cla RF_reg RF_cla ANN_reg ANN_cla Ensemble_reg Ensemble_cla LR_reg5 0.000 13.661 0.000 13.648 0.000 0.319 0.000 0.803 0.000 0.1326 0.000 3.498 0.000 10.564 0.145 2.079 0.290 0.831 0.000 0.2027 0.000 1.257 0.000 7.677 0.000 0.946 0.087 0.294 0.000 0.2578 2.601 5.324 2.244 13.170 3.213 4.116 2.499 3.079 2.448 0.4049 0.703 2.966 0.642 8.711 0.978 1.855 0.794 1.070 0.642 0.55010 0.229 1.954 0.510 6.788 0.408 0.881 0.382 0.601 0.331 0.80411 0.984 5.310 1.192 11.994 0.686 3.843 1.610 2.291 1.490 1.17012 2.968 8.710 3.163 14.370 3.710 7.857 3.670 4.852 2.929 1.66013 3.478 7.886 2.231 12.872 3.018 4.311 1.837 3.242 1.969 2.21314 4.947 15.438 9.002 16.137 5.596 13.639 8.435 10.867 9.976 3.30015 16.824 17.373 11.844 18.301 10.094 17.855 14.805 14.565 12.786 4.18516 15.918 18.653 13.061 13.916 4.082 14.165 10.612 13.014 12.653 5.80517 10.204 11.676 5.782 12.954 2.381 6.683 1.701 5.196 3.061 9.55318 17.178 32.585 26.994 16.463 1.841 41.162 37.423 34.299 34.969 12.55719 34.756 51.457 51.829 28.197 17.683 43.140 42.073 45.118 50.610 22.73520 17.568 20.873 16.216 16.236 6.757 24.079 20.270 20.988 20.270 22.487

44

Page 50: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

The comparison of predicted PI-error for each Risk Grade is visualized in Figure 37. From Figure37a we can see that the higher Risk Grades stands out. The models seem to overall find it moretroublesome to predict the PI correctly for the observations with greater Risk Grades. The modelshave been divided into two subplots based on their movements, see Figure 37b and Figure 37c. Wesee that ANN, XGBoost and the Ensemble generally make the same errors as one another, meanwhileRF and LR generally make the same errors.

(a) All models (b) Models That Moves Alike (c) Models That Moves Alike

Figure 37: Comparison of Predicted PI-Error for each Risk Grade, for each Model

45

Page 51: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Amore detailed view of the comparison of predicted PI-error for each Risk Grade can be seen in Figure38. Here we see RF standing out for each Risk Grade, where the RF regressor making the largesterrors for smaller Risk Grades and the RF classifier making the largest errors for larger Risk Grades.XGBoost regressor and ANN regressor also stand out as they are generally making significantly largererrors than the classifier versions.

Figure 38: Comparison of Predicted PI-Error for each Risk Grade

46

Page 52: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

4.3.2 Yearly Predictions

Figure 39 illustrates the distribution of observations for each year, including the amount of Defaultsand Non-Defaults. The distribution of observations is somewhat equal across all years, with somefewer customers during 2008 and 2009, although the relative amount of Defaults is higher than otheryears for 2011 and 2012.

Figure 39: Distribution of Observations per Year

Table 19 shows the predicted PI for each year from each model compared with the realised PI. Themost accurate predictions are marked with green, second most accurate marked with yellow, andworst predictions marked with red. There exist multiple columns with the same color on the samerow because they share the same value or are close to each other. These results hint that usingclassifiers generally gives better predicted PI than using regression, with exception for the Ensemblewhich is performing about equally well as both regressor and classifier. The result also hints thatthe XGBoost classifier better predicts PI than other models, although neither the ANN classifier norEnsemble are lagging far behind. The RF regressor is making the largest errors out of all the modelsat every instance.

Table 19: Comparison of Predicted PI for each Year

Year PI (%) XGBoost_reg XGBoost_cla RF_reg RF_cla ANN_reg ANN_cla Ensemble_reg Ensemble_cla LR_reg2008 3.807 4.884 1.484 9.585 0.000 4.011 0.710 1.968 0.968 1.7422009 2.279 7.826 2.344 12.133 0.912 8.252 3.516 4.938 3.385 1.8592010 3.941 7.724 3.846 10.840 1.187 7.176 3.656 4.696 3.561 2.9492011 8.216 19.924 13.339 33.704 15.415 15.596 12.853 14.498 13.560 1.5632012 4.721 8.670 5.039 14.255 3.858 6.900 5.538 5.925 5.356 2.4062013 0.699 1.757 0.419 4.650 0.000 0.617 0.140 0.427 0.280 1.1012014 0.723 2.627 0.316 6.132 0.000 1.258 0.226 0.517 0.045 2.2762015 1.571 4.160 0.868 8.194 0.000 2.753 0.785 1.461 0.910 1.0412016 1.632 5.213 1.154 9.802 0.040 3.521 1.552 2.172 1.273 1.1462017 0.870 2.642 0.414 4.586 0.000 2.761 1.575 1.616 0.912 1.317

47

Page 53: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Figure 40 visualizes the PI-error for each model grouped on year. From Figure 40a we can see thatyear 2011 stands out. The models seem overall to find it more troublesome to predict the PI correctlyfor observations from this year. All the models seem to move in a similar fashioned way with exceptionof the LR, which goes against the trend and instead underestimates the PI for year 2011. Figure 40bshows the PI-error for the observations in year 2011, where we can see the observations with higherRisk Grade contributing more to the greater error.

(a) PI-Error for Each Year (b) PI-Error for Year 2011

Figure 40: The PI-Error grouped by Year

A more detailed view of the comparison of PI-error for each year is visualized in Figure 41. TheRF regressor clearly stands out by making the largest errors, with the XGBoost regressor seeminglymaking second most errors, followed by either other regressors or the RF classifier.

Figure 41: Comparison of Predicted PI-Error for each Year

48

Page 54: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

4.3.3 Monthly Predictions

Figure 42 illustrates the distribution of observations for each month, including the amount of Defaultsand Non-Defaults. The distribution of observations is equal across all month, although the relativeamount of Defaults is lower for December. This indicates more customers are Defaulting this month,due to the way we have created YearDefault. The variable shows if an observation is scheduled toDefault within one year, meaning that a sudden decrease of Defaults equals a decrease of observationswhom are scheduled to Default within one year. Many of the observations which in November arescheduled to Default within one year are gone in December, meaning that they went into Defaultduring December.

Figure 42: Distribution of Observations per Month

Table 20 shows the predicted PI for each month from each model compared with the realised PI.The most accurate predictions are marked with green, second most accurate marked with yellow, andworst predictions marked with red. There exist multiple columns with the same color on the samerow because they share the same value or are close to each other. In addition to hinting that theXGBoost-, ANN- and the Ensemble classifier once again are predicting PI most accurately meanwhileRF regressor shows worst performance, we also observe the realised PI going significantly lower thanmean during December. Due to the realised PI showing the percentage of observations heading intoDefault within one year, a drop in PI from November to December means many observations whichin November were classified to go into Default within one year, actually did go into Default duringDecember and thus lowers the PI. Such a drop is also observed for September, although not as largedrop as for December. This is also reflected within the remaining months, where a rising PI is observedup until about August, with slightly higher realised PI for June, thus indicating an accumulation ofpercentage of customers that will go into Default.

Table 20: Comparison of Predicted PI for each Month

Month PI (%) XGBoost_reg XGBoost_cla RF_reg RF_cla ANN_reg ANN_cla Ensemble_reg Ensemble_cla LR_regJan 2.821 6.888 3.282 13.194 2.821 5.257 2.936 3.967 3.339 1.738Feb 2.971 7.241 3.371 13.652 3.486 5.459 3.257 4.263 3.429 1.762Mar 2.911 6.691 3.139 13.275 2.740 5.719 3.368 4.160 3.425 1.747Apr 2.935 6.903 2.822 12.593 2.427 5.376 3.386 4.008 2.935 1.725May 2.715 6.419 2.216 12.405 2.715 5.283 3.158 3.683 2.826 1.704Jun 2.944 6.781 2.944 12.282 2.500 5.447 3.444 4.060 3.167 1.734Jul 3.074 6.916 3.458 11.536 2.141 5.464 3.568 4.178 3.732 1.708Aug 3.030 6.844 3.648 11.388 2.020 5.403 3.423 4.036 3.423 1.703Sep 2.592 6.333 2.758 11.077 1.710 4.801 2.482 3.419 2.648 1.702Oct 2.838 5.729 2.616 8.454 1.336 4.786 2.838 3.363 2.504 1.683Nov 2.756 6.012 2.925 8.667 1.462 4.824 2.643 3.559 2.868 1.701Dec 1.994 5.436 2.165 8.103 1.311 4.422 2.507 3.010 2.222 1.643

49

Page 55: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

As the year 2011 is standing out as the year when the models made the largest errors, Table 21shows the predicted PI for each month for 2011 made by each model compared with the realised PI.The most accurate predictions are marked with green, second most accurate marked with yellow, andworst predictions marked with red. There exist multiple columns with the same color on the same rowbecause they share the same value or are close to each other. By comparison with the mean realisedPI for each month in Table 20, we observe in Table 21 that the realised PI is significantly higher thanthe mean for every month. This indicate many customers going into Default during both year 2011and 2012, where during August 2011 about 10% of all customers are scheduled to Default within oneyear. This deviation from the mean is also reflected by the predictions made by the different models,where the XGBoost-, ANN- and Ensemble classifiers all generally predicts the risk of being too high,meanwhile LR instead predicts the risk of being too low, which is not preferred.

Table 21: Comparison of Predicted PI for each Month during Year 2011

Month PI (%) XGBoost_reg XGBoost_cla RF_reg RF_cla ANN_reg ANN_cla Ensemble_reg Ensemble_cla LR_regJan 8.205 16.325 9.744 33.831 12.308 13.121 7.180 10.999 9.231 1.757Feb 8.122 18.665 10.152 36.037 17.767 13.374 7.614 12.313 10.152 1.757Mar 8.630 16.153 9.137 34.185 12.183 15.005 10.660 12.674 11.675 1.853Apr 7.772 19.775 11.399 38.938 18.653 16.857 15.026 14.512 13.472 1.667May 7.853 20.397 10.995 39.531 23.560 16.636 14.136 15.040 14.136 1.608Jun 9.948 20.701 14.660 38.472 20.419 15.890 15.183 15.123 14.136 1.552Jul 9.948 25.568 19.372 35.526 18.848 19.988 18.848 19.687 19.895 1.426Aug 10.053 26.235 22.751 35.655 16.931 19.606 17.460 19.409 19.577 1.451Sep 7.143 21.260 13.736 32.944 14.286 16.016 12.637 15.262 13.736 1.340Oct 7.650 17.750 12.568 25.724 9.290 14.018 12.022 13.472 12.568 1.435Nov 7.650 18.952 13.661 27.000 10.383 14.305 12.568 13.918 13.115 1.560Dec 5.233 17.280 12.209 24.961 9.302 12.057 11.047 11.482 11.047 1.288

Figure 43 visualizes the PI-Error for each model grouped on month. RF regressor clearly stands outin both subfigures, it heavily overestimates the PI. In Figure 43a we can see that both RF classifierand LR underestimate the PI. The most accurate models are the ANN-, XGBoost- and Ensembleclassifiers. The same is seen in Figure 43b, but the LR performs quite well in absolute value terms.Another difference is that RF classifier does not underestimate the PI during the months of 2011.Even though LR performs relatively well during 2011, it does underestimate the PI which is not good.

(a) PI-Error, All Years Included (b) PI-Error for Year 2011

Figure 43: The PI-Error grouped by Month

50

Page 56: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

4.3.4 Testing Data and Summary

Table 22 shows the predicted PI for the whole Testing data from each model. Even though theXGBoost classifier is the most accurate, both the ANN- and the Ensemble classifier make an aboutequal prediction as the XGBoost. The RF regressor is the least accurate.

Table 22: Comparison of Predicted PI for the Whole Segment

True PI (%): 2.799Predicted PI (%)

XGBoost_reg 6.515XGBoost_cla 2.944RF_reg 11.379RF_cla 2.219ANN_reg 5.186ANN_cla 3.085Ensemble_reg 3.808Ensemble_cla 3.043LR_reg 1.712

The comparison of predicted PI-error for the whole segment is visualized in Figure 44. The XGBoost-,ANN- and Ensemble classifiers are making the least errors, meanwhile the RF regressor is makingthe largest errors. The Ensemble regressor makes slightly smaller errors than the LR, except theLR is underestimating the PI. XGBoost, ANN and Ensemble are all overestimating the PI, with theclassifier versions only slightly overestimating it. This is positive due to the models are tuned torather overestimate than underestimate.

(a) PI-Error (b) Absolute Values of the PI-Errors

Figure 44: Comparison of Predicted PI-Error for the Whole Segment

To summarize, the models which are best suited for accurately predicting PI out of the models, we haveexamined, are the XGBoost classifier, the ANN classifier and the Ensemble classifier. Out of thesethree models, the Ensemble classifier is best suited due to its higher single-threshold measurements,AUC and AP scores.

51

Page 57: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

5 Discussion

5.1 ConclusionWe can with this report conclude the following

▶ Machine learning techniques such as XGBoost and ANN can accurately estimatePI for a segment when treating it as a classification problem.

▷ Actions leading to Default can be captured through data mining, even with limited accessto features containing personalized data.

▷ To significantly improve the models further, more personalized data are required.

▶ Machine learning models based on XGBoost and ANN should replace the LR modelcurrently used by Nordea.

▷ Machine learning techniques such as XGBoost and ANN significantly outperform the LRmodel currently used by Nordea.

▷ More complex models than LR achieve better performance for reasons such as PI notfollowing a strict linear trend against features such as Risk Grade and there exist complexrelationships between features.

▷ Even better predictions are achieved using an Ensemble of XGBoost and ANN.

▶ There are numerous features showing higher importance than Risk Grade whenestimating PI.

▷ Features such as BP78, DA, B419 and B416 all show higher importance than the mostimportant Risk Grade when predicting customer Defaults.

▷ Macro variables are important when predicting customer Defaults, given that their dimen-sionality is reduced using PCA.

▷ Through Nordea’s current procedures of judging a customer’s Risk Grade and using this toestimate PI, the consequence is that a large number of customers are assigned the wrongRisk Grade, thus making the PI estimation flawed.

5.2 Classification ProblemBy treating the estimation of PI for a segment as a classification problem of finding which customerswill go into Default and which customers will not, machine learning techniques such as XGBoost andANN can be used within credit risk to accurately estimate PI, where the best performing model wehave examined is an Ensamble of both ANN and XGBoost. The classification models, or classifiers,work by predicting however each and every customer will go into Default or not within the upcomingyear, and based on these predictions then estimate the ratio of customers Defaulting within differentsegments. The alternative to this approach is instead treating the problem as a regression problem,based on the assumption that every customer has an inherited probability to go into Default, withsome customers having a higher probability than others. Because the classifiers show stronger perfor-mance than their regression counterparts, including the LR model used by Nordea, it indicates thatviewing customers as having an inherited probability to Default does not reflect reality. This couldin fact be due to customers do not roll a dice or go through a similar random process to determinehowever they will go into Default or not, but instead Default is rather decided through actions cus-tomers make and through environmental effects, which theoretically can be captured through datamining.

5.3 ComplexityOne drawback of these types of models is their complexity, where it is not as clear-cut to see the innerworkings of the model as a LR. Even though the inner workings of the LR are very visible, Figure33 and Table 22 indicate that easily illustrated does not equal higher performance. The increasedillustratory complexity comes in the case of XGBoost both in the form of classification through amajority vote using a large amount of classification trees and in the form of the high dimensionalityof the data used by the model. It is possible to illustrate each and every tree and show how differentobservations are classified, but this would not be very easy to interpret either. We had hoped toshow at least a few trees from our XGBoost model within this report, although that proved not tobe possible due to technical issues where we were blocked from downloading necessary modules.

52

Page 58: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

5.4 Important FeaturesEven though the Risk Grade is considered an important feature when predicting Default using XG-Boost, as seen in Figure 21, with Risk Grade 10, 15 and 9 being amongst the 20 most importantfeatures according to total gain, there are features considered more important by the model. Featuressuch as BP78, DA, B419 and B416 all show higher importance than the most important Risk Grade.Additionally, many of the PC from the Macro data are showing very high importance when makingpredictions, especially compared with the feature Norway_GDP_yearly which according to Nordeais the most important Macro variable when looking at Shipping customers due to the large majorityof customers in Norway. We therefore conclude that Macro variables are important when predictingcustomers Defaulting, given that their dimensionality is reduced using PCA.

5.5 Non-Linear Risk GradeIn addition to the advantage these types of machine learning models have over LR of being able tohandle high dimensional data containing large number of features, is the ability of making non-linearpredictions. The LR model is built as linearly increasing the PI for observations as their Risk Gradebecomes higher, thus built on the assumption of observations with higher Risk Grade also have higherPI, but this is not true as shown in Table 18. PI does not follow a strict linear trend against the RiskGrade. Observations with Risk Grade 8 has, for example, over ten times higher realised PI as RiskGrade 10, where customers assigned Risk Grade 10 are considered carrying less risk. If the case isthat PI should steadily increase as the Risk Grade increases, we can conclude that through Nordea’scurrent procedures of judging a customer’s Risk Grade, a large number of customers are assignedthe wrong Risk Grades. Non-linearity is not a limitation for the machine learning models we haveexamined due to them not being forced to predict according to these types of assumptions, and thuswe instead handle the Risk Grades as categorical variables.

5.6 Development Opportunities5.6.1 Hardware

Even by the lack of access to better hardware to conduct more advanced model tuning, we showwith this report that it is still possible to build models which not only outperform the LR modelcurrently used by Nordea, but also performing well and accurately predicts PI. We had planned onusing a parameter tuning process of Randomized Grid Search to find candidate configurations forthe hyperparameters for the XGBoost, followed by a grid search between the possible candidates,although due lacking better hardware we could not create small enough grids for the computers tohandle and still significantly improve the model.

5.6.2 Personalized Data

Even though we believe improvements could be made by being able to tune the hyperparameters, weare convinced that more personalized data are needed for a substantial improvement to be apparent.The most important features are currently the Macro PC, which are showing the effects the economicenvironment has on the customer and thus are not very personalized and unique for each observation.We tried numerous feature engineering techniques which are not presented in the report due to noimprovements or slightly worse results, which leads to us making the conclusion that we need betterfeatures. For instance, we tried using the information we received from the XGBoost in form offeature importance to conduct feature selection. We hoped that this would reduce the noise, but noimprovements were made. We tried a feature extraction method called FAMD, which can be seenas a combination of PCA and MCA. MCA is used for qualitative variables, which PCA is not ableto handle. When doing this, no improvements could be seen. Some further feature engineering thatcould be considered but was not executed in our work is removing or grouping together some classesfrom sco, brs8_sco and brs_8rat. This could potentially reduce the noise.

53

Page 59: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

5.6.3 Introduce Leakage and Historical Data

When we are splitting the data into Training, Validation and Testing, we are splitting entirely oncustomers, making it such that customers appearing in the Training set do not appear in either theValidation- or the Testing sets. Even though this creates the least possible bias, this does not sym-bolize the reality perfectly. This because even though there are entirely new customers coming innext year, there will still be many returning customers and thus a certain level of information leakagecould be justified. Alternatively, Nordea could use another split point where they actively create theTesting set from the last occurring observations within the data set, since the results from such a testis not likely to be published publicly.

There is also an opportunity to use more historical data for many customers. For example, theunique customer appearing most times within the data is appearing on 120 rows, which is ten years.This customer should in theory have a large amount of historical data which can be used when makingpredictions for next year regarding this specific customer. Overall using more historical data, like forexample, introducing Macro features with lag within the Macro PC, could also prove to be usefulbecause of the high importance the Macro data proved to have in its current state.

5.6.4 Real-Life Implementations

Even though it is possible to implement models based on ANN and XGBoost to predict PI, this wouldnot happen over a fortnight. We have predicted PI to illustrate an example of the usefulness of thesekinds of models, but the ability to identify customers which are about to Default before it happens alsohave numerous internal applications within the banking sector, outside of predicting PI. An exampleof such an application is the ability to take necessary actions towards these customers ahead of timeto cover otherwise unexpected risks, or simply judging their risk more empirically. Furthermore, thefeatures these models show to be of importance can relatively easily be implemented in the modelsalready in use to further improve these, or to find potential flaws within the current estimations.

54

Page 60: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

6 Reference ListAbramowicz K. (2017a). Advanced Statistical Modelling Spring 2017 - Linear Models: Estimation.[pdf]. Umeå: Umeå University. https://www.cambro.umu.se/access/content/group/58302VT17-1/Lecture%20Slides/3.%20Linear%20Models/1.%20Estimation/Part1.pdf [Retrieved 2019-05-16]

Abramowicz K. (2017b). Advanced Statistical Modelling Spring 2017 - Linear Models: Inference.[pdf]. Umeå: Umeå University. https://www.cambro.umu.se/access/content/group/58302VT17-1/Lecture%20Slides/3.%20Linear%20Models/2.%20Inference/Part2.pdf [Retrieved 2019-05-16]

Abramowicz K. (2017c). Advanced Statistical Modelling Spring 2017 - Linear Models: Prediction, Di-agnostic and Model Selection. [pdf]. Umeå: Umeå University. https://www.cambro.umu.se/access/content/group/58302VT17-1/Lecture%20Slides/3.%20Linear%20Models/3.%20Prediction%2C%20Validation%20and%20Selection/Part3.pdf [Retrieved 2019-05-16]

Al-Masri A. (2019-01-30). How Does Back-Propagation in Artificial Neural Networks Work? To-wards Data Science. [Website]. https://towardsdatascience.com/how-does-back-propagation-in-artificial-neural-networks-work-c7cad873ea7. [Retrieved 2019-05-10]

Akosa J.S. (2017). Predictive Accuracy: A Misleading Performance Measure for Highly Imbal-anced Data. SAS. [pdf]. https://support.sas.com/resources/papers/proceedings17/0942-2017.pdf.[Retrieved 2019-02-25]

Alm S.E., & Britton T. (2008). Stokastik. Sannolikhetsteori och statistikteori med tillämpningar.1th edition. Stockholm: Liber.

Azur M., Frangakis C., Leaf P., & Stuart E. (2011). Multiple imputation by chained equations:what is it and how does it work? International Journal of Methods in Psychiatric Research, 20, (1),40-49.

Ba J.L., & Kingma D.P. (2015). Adam: A Method for Stochastic Optimization. In: InternationalConference for Learning Representations. San Diago, USA, May 7-9. https://arxiv.org/pdf/1412.6980v9.pdf [Retrieved 2019-05-10]

Batista G., Monard M., & Prati R. (2004). A Study of the Behavior of Several Methods for BalancingMachine Learning Training Data. Sigkdd Explorations, 6, (1), 20-29.

Bhat P.C., Proster H. B., Sekmen S., & Stewart C. (2018). Optimizing Event Selection with theRandom Grid Search. Computer Physics Communications, 228, (C). doi:10.1016/j.cpc.2018.02.018

Bowyer K., Chawla N., Hall L., & Kegelmeyer W. (2002). SMOTE: Synthetic Minority Over-samplingTechnique. Journal of Artificial Intelligence Research, 16.

Boujelbene Y., Khemakhem S.,& Said F.B. (2018). Credit risk assessment for unbalanced datasetsbased on data mining, artificial neural network and support vector machines.Journal of Modelling in Management, 13, (4), 932-951. doi: 10.1108/JM2-01-2017-0002

Brownlee J. (2015-08-19). 8 Tactics to Combat Imbalanced Classes in Your Machine Learning DatasetMachine Learning Mastery. [Website]. https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ [Retrieved 2019-05-17]

Brownlee J. (2018a-07-20). What is the Difference Between a Batch and an Epoch in a NeuralNetwork? Machine Learning Mastery. [Website]. https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/ [Retrieved 2019-05-16]

Brownlee J. (2018b-12-03). A Gentle Introduction to Dropout for Regularizing Deep Neural Net-works. Machine Learning Mastery. [Website]. https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/ [Retrieved 2019-05-16]

Carmona P., Cliement F., & Momparler A. (2017). Predicting failure in the U.S. banking sec-

55

Page 61: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

tor: An extreme gradient boosting approach. International Review of Economics and Finance,http://dx.doi.org/10.1016/j.iref.2018.03.008

Chicco D. (2017). Ten quick tips for machine learning in computational biology. BioData Min-ing, 10(35). doi: 10.1186/s13040-017-0155-3

Chen T., & Guestrin C. (2016). XGBoost: A Scalable Tree Boosting System. KDD ’16, August13-17. doi: 10.1145/2939672.2939785

European Commission (2019-01-22). Prudential requirements. https://ec.europa.eu/info/business-economy-euro/banking-and-finance/financial-supervision-and-risk-management/managing-risks-banks-and-financial-institutions/prudential-requirements_en. [Retrieved 2019-01-22]

Flores J.A. (2011). Focus on Artificial Neural Networks. New York: Nova Science Publishers, Incor-porated. E-book. https://ebookcentral.proquest.com/lib/umeaub-ebooks/detail.action?docID=3017889

Haykin S. (2008). Neural Networks and Learning Machines. 3th edition. New Jersey: PrenticeHall.

Hastie T., Tibshirani R., & Friedman J. (2009). The Elements of Statistical Learning: Data Mining,Inference, and Prediction. 2nd edition. New York: Springer. E-book.

Ioffe S., & Szegedy C. (2015). Batch Normalization: Accelerating Deep Network Training by ReducingInternal Covariate Shift. Source: Cornell University, arXiv. https://arxiv.org/pdf/1502.03167v3.pdf.arXiv:1502.03167v3

James G., Hastie T., Tibshirani R., & Witten D. (2013). An Introduction to Statistical Learn-ing. 7th edition. New York: Springer. E-book. doi: 10.1007/978-1-4614-7138-7

Johnson R.A., & Wichern D.W. (2013). Applied Multivariate Statistical Analysis. 6th edition. Edin-burgh Gate. Pearson Education Limited.

Liu X. (2018). Lecture 6: The Elements of Supervised Learning. [pdf]. Umeå: Umeå Univer-sity. https://www.cambro.umu.se/access/content/group/58216HT18-1/Slides%20and%20notes/Lecture_6_Elements_of_Supervised_Learning.pdf [Retrieved 2019-02-25]

Maalouf M., & Trafalis T.B. (2011). Rare events and imbalanced datasets: an overview. Int. J.Data Mining, Modelling and Management, 3 (4), 375-388. doi: 10.1504/IJDMMM.2011.042935

Maucort-Boulch D., Ozenne B., & Subtil F. (2015). The Precision-Recall Curve Overcame theOptimism of the Receiver Operating Characteristic Curve in Rare Diseases. Journal of ClinicalEpidemiology. 68(8), pp.855-859. doi: 10.1016/j.jclinepi.2015.02.010

Nordea (2017). Nordea Annual Report. [electronic]. Available via: https://www.nordea.com/Images/37-247331/Annual%20Report%20Nordea%20Bank%20AB%202017.pdf. [Retrieved 2019-01-22]

Nielsen M. (2018-10-02). Neural Networks and Deep Learning. [Online Book].http://neuralnetworksanddeeplearning.com/index.html. [Retrieved 2019-05-10]

Pokorný M. (2010). The evaluation of binary classification tasks in economical prediction. ActaUniversitatis Agriculturae et Silviculturae Mendelianae Brunensis, 58(6), 369-378. doi: 10.11118/ac-taun201058060369

Rehmsmeier M., & Saito T. (2015). The Precision-Recall Plot Is More Informative than the ROC PlotWhen Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE, 10(3). doi: 10.1371/jour-nal.pone.0118432

SciKit Learn. (2019). 3.3.2.8. Precision, recall and F-measures. [Website]. https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics [Retreived 2019-05-16]

56

Page 62: STAT ERIKSSON LÅNGSTRÖM1324129/FULLTEXT01.pdfAbstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a

Yan, X. (2009). Linear Regression Analysis : Theory and Computing. Singapore: World Scien-tific Publishing Co Pte Ltd. E-book. https://ebookcentral.proquest.com/lib/umeaub-ebooks/detail.action?docID=477274.

Wilson D. (1972). Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEETransactions on Systems, Man, and Cybernetics, 2, (3).

57