advanced backtesting probability of default predictions

1

MSc STOCHASTICS AND FINANCIAL MATHEMATICS MASTER THESIS

Advanced Backtesting Probability of

Default Predictions

Author: Supervisor:

Congyi Dong dr. Arnoud den Boer Examination Date: Daily Supervisor: 27 November dr. Sjoerd C. de Vries Second Reader: dr. ir. Eric Winands

2

Abstract Measuring the performance of Probability of Default (PD) models is always a major task for banks. The predictions of PD models are regularly tested against actual observations. This activity is called backtesting. In the real world, banks won’t directly use the PDs to describe the credit quality of clients but map these clients to a certain bucket of an internal rating system according to their PDs. However, these PD ratings are not produced on a very evenly spaced schedule, and the backtesting reliability suffers from an incorrect assumption that the most recent credit rating predictions, which may be generated 12 months ago or longer ago, are still valid at the backtest starting date. This problem could be solved by establishing a migration matrix of credit grades to estimate the rating at the start of a backtest period.

This thesis investigates whether or not we can use Hidden Markov Model (HMM) to obtain a good estimate of this migration matrix. In our research, the ‘true’ credit grades are considered as the Hidden Markov states, while the credit grades predicted by banks are set to be observation states. This leads to large size of both observation and hidden state space. To reduce to data size required to fit this high-dimensional HMM, we propose a technique to estimate the migration matrix block by block. Then, we estimate the migration matrix in two scenarios: monthly or irregularly rating the clients. In the former ideal case, the bank won’t lose a lot of information about credit quality transitions of clients, so the estimated migration matrix is in line with the given ‘true’ matrix; in the latter more realistic case, by introducing a new observation state ‘non-rated', the credit quality migration sequences of clients are also put on a monthly grid. Due to a lack of information, in the latter case, we can only estimate the transition probabilities of clients in low credit rating grade blocks. Thus, we conclude that when banks rerate clients irregularly, HMM can be applied to some specific portfolios whose clients are with low PDs, such as a mortgage portfolio.

Title: Advanced Backtesting Probability of Default Predictions Keywords: Credit Risk Management, Credit Model Validation, Backtesting,

Probability of Default, Hidden Markov Model Author: Congyi Dong, 12101788 Email: [email protected] Supervisor: dr. Arnoud den Boer Daily supervisor: dr. Sjoerd C. de Vries Second reader: dr.ir. Erik Winands Examination date: 27 November Korteweg-de Vries Institute for Mathematics University of Amsterdam Science Park 105-107, 1098 XG Amsterdam http://kdvi.uva.nl

3

Preface After eight months, I am finishing this thesis and then graduating from University of Amsterdam. During these two years, I grew rapidly and found my career goal. My master's project starts on 8th March. A few weeks later, the Netherlands was put on lockdown because of the COVID-19 outbreak. It was a tough time for us interns since we could only work from home. I am very grateful to Rabobank not only for offering me the chance to do a thesis internship but also for arranging some online sessions to make me have access to the bank industry even if I was not allowed to go to the office.

I would like to thank my supervisors, Sjoerd de Vries and Arnoud den Boer, for guiding me and supporting me in my master's project. I learned from them about how to solve problems and how to write a well-organized thesis. I would also like to thank my other colleagues in the Credit Model Validation team. Last, many thanks to my roommate Ziyu Zhou for her company during this quarantine and her kind suggestions.

Hope you enjoy your reading.

Congyi Dong,

Amsterdam, 27 November 2020

4

Table of Content

1 Introduction .................................................................................................................6

2 Regulatory Background ................................................................................................9

2.1 A Brief History of Basel Accords ..................................................................................... 9

2.2 Some Definitions and Regulations from CRR ............................................................... 10

3 Literature Review ....................................................................................................... 12

3.1 Literature Related to Model Validation ....................................................................... 12

3.2 Literature Related to HMM .......................................................................................... 13

4 Model Validation Methodology .................................................................................. 15

4.1 Calibration Quality ....................................................................................................... 15

4.1.1 Binomial Test ......................................................................................................... 15

4.1.2 Poisson Binomial Test ........................................................................................... 17

4.1.3 Traffic Light Approach ........................................................................................... 18

4.2 Discriminatory Power ................................................................................................... 18

4.2.1 Cumulative Accuracy Profile (CAP) ....................................................................... 19

4.2.2 Accuracy Ratio (AR) ............................................................................................... 22

4.2.3 Receiver Operating Characteristic (ROC) .............................................................. 23

4.3 Chapter Summary ........................................................................................................ 23

5 Dataset Simulation and PD model Validation .............................................................. 24

5.1 Dataset Simulation ....................................................................................................... 24

5.1.1 Factor Values and Credit Rating System Setup ..................................................... 25

5.1.2 ‘True’ Credit Rating Model .................................................................................... 26

5.1.3 Drift Functions ....................................................................................................... 28

5.1.4 Simulation of Credit Rating Migration .................................................................. 30

5.2 Validation Methods Implementation ........................................................................... 34

5.2.1 Logistic Regression Model with Full Information ................................................. 34

5.2.2 Logistic Regression Model with Partial Information ............................................. 38

5.3 Chapter Summary ........................................................................................................ 40

6 Hidden Markov Model Methodology (HMM) .............................................................. 41

6.1 Setup of Hidden Markov Model ................................................................................... 41

6.1.1 One unit delay Hidden Markov Model.................................................................. 42

6.1.2 Zero delay Hidden Markov Model ........................................................................ 44

6.2 Handling HMM ............................................................................................................. 45

5

6.2.1 Probability of obtaining a certain observation sequence ..................................... 46

6.2.2 Estimation of HMM Parameters ........................................................................... 49

6.2.3 Decoding the observation sequence .................................................................... 52

6.3 A Simple Example of HMM Application ....................................................................... 53

6.3.1 When the Bank Rerates Clients on Evenly Spaced Schedule ................................ 53

6.3.2 When the Bank Rerates Clients on Unevenly Spaced Schedule ........................... 54

6.4 Estimation of 15-dimensional HMM Transition Matrix ............................................... 56

6.5 Chapter Summary ........................................................................................................ 57

7 The Implementation of the Hidden Markov Model ..................................................... 59

7.1 Data Pre-processing ..................................................................................................... 59

7.1.1 Data Pre-processing for Full Information.............................................................. 60

7.1.2 Data Pre-processing for Partial Information HMM ............................................... 60

7.2 When Clients Are Rated Monthly ................................................................................ 61

7.3 When Clients Are Not Rated Monthly ......................................................................... 66

7.4 Attempts to Improve Obtained Results ....................................................................... 69

7.4.1 Modifying the Original Settings of Simulated Artificial Bank................................ 69

7.4.2 Modifying the Data Pre-processing technique ..................................................... 72

7.5 Chapter Summary ........................................................................................................ 73

8 Conclusions and Discussions ....................................................................................... 75

8.1 Conclusions .................................................................................................................. 75

8.2 Discussions ................................................................................................................... 76

9 Further Research ........................................................................................................ 79

Popular Summary ........................................................................................................... 82

Reference ....................................................................................................................... 84

Appendix I. The bucket plotting based on a declining number of factors .......................... 86

Appendix II. The estimation of blocks containing two credit grades ................................. 88

6

1 Introduction

In the real world, banks are required to calculate Probability of Default (PD) as a part of risk management. PD is defined as ‘the probability of default of a counterparty over a one-year period’ [1]. These are several reasons why banks estimate PD: 1) Regulatory Capital (RC) calculation. Banks are required to hold capital for unexpected losses. PDs are used as part of this calculation. They are used to calculate Risk Weights [1]; 2) RAROC. Calculating the Risk-Adjusted Return ON Capital. This is used to check whether a loan would make sufficient returns to compensate for the risk that a bank runs on the loans and the cost of capital. This may be part of the next point: 3) Client acceptance. The bank can choose to reject clients that create a risk that is deemed to be too high and that doesn’t have a RAROC that is higher than the minimum hurdle rate; 4) Provisioning. PDs are used to calculated expected losses. Currently, the IFRS9 standard is used in most banks to do provisioning. 5) Pricing. With a good PD model, it is easier to accept bad clients if banks can calculate the correct pricing for such clients. Bad clients may have to pay more for their loan, so that even if some of the default, on average banks still get a profit. This is related to the RAROC above. 6) Client monitoring. A rapid changing PD can be a signal to put a client on the close watch by account management or the Special Assets Management (SAM) department.

Instead of directly applying the exact PD of each client, banks usually assign the clients to one of the buckets in the internal rating system which is defined on a set of PD intervals. In this internal rating system, the clients in the same credit grade are assumed to have the same bucket PD. The PD prediction is regularly tested against the actual Observed Default Rate (ODR) to check the predictive ability of a target PD model. This activity is called backtesting. The ideal backtesting procedure, where all the clients are rated at the starting date, will yield the most reliable assessment results, as shown in Figure 1. This is because the time horizon of PD (can be found in Section 2.2) is one year and the backtest period is also a year.

Figure 1. The ideal backtest when all clients are rated at the starting date

7

In Figure 1, ‘R1’ to ‘R10’ represents buckets of the internal credit rating system; ‘P’ and ‘D’ represents performing state and defaulting state, respectively. By comparing observed default frequency between the vertical lines in Figure 1 with the PD predicted 1 year ago, we could properly assess the performance of tested PD prediction models.

However, the real-world backtest is not optimal, due to the incorrect assumption that the most recent ratings are still valid at the start of a backtest period. Banks won’t rerate all clients at the starting date but directly compare the ODR with the most recent credit rating grades which could be very old, up to 12 months if regulatory requirements are fully complied with, but in practice sometimes even older, as shown in Figure 2. During the interval of time between the starting date of a backtest period and the most recent rating date, the credit quality of the clients will invisibly change to be better or worse, reducing the backtest reliability.

Figure 2. The real-world backtesting procedure when banks rerate all clients irregularly and assume that the latest credit grades are still valid, which might not be the case

Building a good migration matrix of client ratings would help to solve the problem mentioned above. Based on the knowledge of transition probabilities, banks would be able to predict the possible credit quality migration during this time interval so that the backtest reliability (but also the predictions themselves) will be improved.

However, the migration matrix estimation that banks are using now is slightly wrong. Banks build the migration matrices using the assumption that they are one-year transitions. This can be illustrated in Figure 3. A, B, C, and D represent the time point of the vertical lines, with a one-year time interval in between. The one-year transition matrices are computed based on the latest credit rating grades around the time point of the vertical lines. If during a year a client is not rated, for example, the fourth client from A to B in Figure 3, then his or her credit grade is seen to remain at 𝑅2. However, this is not really reasonable, since we don’t know whether the migration from 𝑅2 to 𝑅5 of the fourth client happens during the AB period or BC period.

8

Figure 3. How banks compute the credit rating grades migration matrix in the real world

The assumption of the backtest implies an identity migration matrix, as the most recent ratings are considered as the current ratings. Although a slightly inaccurate migration matrix, as shown in Figure 3, will at least provide some useful information about possible transitions, a better estimate of the migration matrix would be more helpful.

Thus, this research aims to check whether a Hidden Markov Model can help to obtain a good estimate of the credit grades migration matrix, enabling the prediction of credit quality migration at the starting date of a backtest period.

In Chapter 2, the related regulations and definitions regarding model validation will be discussed, and Chapter 3 covers the literature relating to both model validation and previous researches of predicting the credit rating grades migration. In Chapter 4, the model validation methodologies in use will be explained in detail. Then, after simulating an artificial bank in Chapter 5, these validation methods will be put into practice to see whether they can distinguish the good model from the bad ones. In Chapter 6 the zero delay Hidden Markov Model will be given theoretically, and then in Chapter 7, we will check whether a Hidden Markov Model can be used to estimate the migration matrix of the credit quality of clients. Chapter 8 will make the conclusions resulting from this research and discusses their implications, for this research, and Chapter 9 provides four possible directions for further research.

R5 R4

R5 R5

R9 R9

R2 R5

R3

R8

R3

R4

1 YEAR

Original ratings.

Irregularly rerating schedule

A B C D

R4

1 YEAR 1 YEAR

9

2 Regulatory Background

In this chapter the history of Basel Accords and its EU translation Capital Requirements Regulation (CRR). Some definitions and regulations related to our research will be introduced. This chapter is based on Basel Committee documents BCBS (2004) [3], BCBS (2005a) [4], BCBS (2005b) [5], BCBS (2010) [6], the BIS (Bank for international settlements) official website (www.bis.org), and BIC (2014) [7].

2.1 A Brief History of Basel Accords

As recorded in the history document published by BIS, after the breakdown of the Bretton Woods system of managed exchange rates [7] in 1973, many banks suffered from large foreign currency losses. On 26 June 1974, because the foreign exchange exposures of Bankhaus Herstatt were three times its capital, the Federal Banking Supervisory Office of West Germany withdrew its banking license. Banks outside Germany took heavy losses on their unsettled trades with Herstatt, adding an international dimension to the turmoil. In October the same year, the Franklin National Bank of New York also closed its doors after incurring large foreign exchange losses [7]. Following bank failures in both Germany and the United States in 1974, the central bank governors of the G10 countries set up a committee on Banking Regulations and Supervision. This committee was renamed the Basel Committee on Banking Supervision. It provides a forum for regular cooperation on banking supervisory matters, and its objective is to enhance understanding of key supervisory issues and improve the quality of banking supervision worldwide [7].

As mentioned in the official history document [7], in the 1980s, the rate of bank failures in the United States was increasing at an appalling rate. As a result, the external debt of a lot of countries had been growing at an unsustainable rate and the probability of major international banks going bankrupt was alarmingly high. Backed by the G10 Governors, the Basel Committee on Banking Supervision met in 1987 in Basel, Switzerland to discuss the possible solutions of preventing things spinning out of control [7]. This meeting reached the agreement to use a weighted approach to measure the risk banks run on their exposure. According to an advisory paper published in December 1987, a capital measurement system, commonly known as the Basel Capital Accord (Basel I), was approved by the G10 Governors and issued to the Banks in July 1988 [7]. Basel I called for a minimum capital ratio of capital to risk-weighted assets of 8% to be implemented by the end of 1992. This was the beginning of the Basel Accords.

In June 1999, the committee issued a proposal for a new capital adequacy framework to replace the 1988 Accord. This led to the release of the Revised Capital Framework in June 2004, which is generally known as ‘Basel II’ [7]. In Basel II, the BCBS recommends to ‘take rating and scoring as the basis for determining risk-sensitive regulatory capital requirements for credit risks’ [3]. Compared to Basel I, where capital requirements are uniformly at 8% in particular for corporate borrowers irrespective of their creditworthiness, Tasche states that this is a major progress [12]. Basel II also gives two approaches for capital calculation: Standardized Approach (SA) and Internal Rating-based Approach (IRB). Credit institutions that apply Basel II Standardized Approach (SA) can base the calculation of capital requirements on agency ratings [3]. In the Standardized Approach (SA), Basel II also gives fixed PD percentages for certain business types (retail, residential real estate, commercial real estate, overdue

10

loans). Credit institutions that are allowed to apply the internal rating-based (IRB) approach will have to derive PDs from ratings they have determined themselves [12]. Note that in the IRB approach, capital requirements depend not only on PD estimates but also on estimates of loss given default (LGD) and exposure at default (EAD) parameters [12]. Rabobank is now using the IRB approach to calculate the capital for most, but not all, of its portfolios.

As stated in the official history document [7], the need for a fundamental strengthening of the Basel II framework became apparent even before Lehman Brothers collapsed in September 2008. The banking sector had entered the financial crisis with too much leverage and inadequate liquidity buffers [7]. In July 2009, the Committee issued a further package of documents to strengthen the Basel II capital framework. These documents strengthen the regulation and supervision of internally active banks. In September 2010, the Group of Governors and Heads of Supervision announced higher global minimum capital standards for a commercial bank. This followed an agreement reached in July regarding the overall design of the capital and liquidity reform package, now referred to as ‘Basel III’.

However, this Basel III regulation could not be directly applied to the EU. From the official documents [8], it follows this is because Basel III itself is not a law for the worldwide banks, but a set of internationally accepted standards set by regulators and Central Banks. Thus, Basel III regulations had to be translated into an EU adapted version, which could be put under democratic control. The High-Level Group on Financial Supervision in the EU chaired by Jacques de Larosière invited the Union to develop a more harmonized set of financial regulations. In the context of the future European supervisory architecture, the European Council of 18 and 19 June 2009 also stressed the need to establish a ‘European Single Rule Book’ applicable to all credit institutions and investment firms in the internal market [1]. The Capital Requirements Regulation (CRR) was designed for this purpose, and it is now the EU law that aims to decrease the likelihood that banks go insolvent [8].

2.2 Some Definitions and Regulations from CRR

This section is based on the Capital Requirements Regulation (CRR) document [1]. As stated in Chapter 1, this research aims to check whether Hidden Markov Model (HMM) can help to estimate a better credit rating grade migration matrix, so that banks would be able to predict the credit grades at the start of a backtest period. The definition of obligor grade, the definition of Probability of Default (PD) and Observed Default Rate (ODR) are introduced as follows.

In CRR art.3 (54), the Probability of Default (PD) is defined as ‘the probability of default of a counterparty over a one-year period’;

In CRR art.3 (78), 'one-year default rate' means the ratio between the number of defaults occurred during a period that starts from one year prior to a date T and the number of obligors assigned to this grade or pool one year prior to that date;

In CRR art.143 (6), 'obligor grade' means a risk category within the obligor rating scale of a rating system, to which obligors are assigned on the basis of a specified and distinct set of rating criteria, from which estimates of probability of default (PD) are derived;

11

The following five regulations are related to the requirement of an internal rating system and PD model validation. The third regulation states that Observed Default Rate (ODR) is seen as an estimate of PD, while the fourth regulation points out the way of estimating the PDs of obligors in a given grade. These give us a clue to determine the bucket PDs of the simulated bucketing system in Chapter 5. The last regulation state that the model validation must be conducted on both model level and grade level.

According to CRR art.170 (3c), the process of assigning exposures to grades or pools shall provide for a meaningful differentiation of risk, for a grouping of sufficiently homogenous exposures and shall allow for accurate and consistent estimation of loss characteristics at grade or pool level;

According to CRR art.170 (2), an institution shall take all relevant information into account in assigning obligors and facilities to grades or pools. Information shall be current and shall enable the institution to forecast the future performance of the exposure;

According to CRR art.180 (1a), institutions shall estimate PDs by obligor grade from long run averages of one-year default rates;

According to CRR art.180 (1g), to the extent that an institution uses statistical default prediction models it is allowed to estimate PDs as the simple average of default-probability estimates for individual obligors in a given grade;

According to CRR art.185 (b), institutions shall regularly compare realized default rates with estimated PDs for each grade;

As mentioned above, from a model validation perspective, when estimating the credit grade migration matrix, we are not allowed to reduce the dimension of the internal rating system by directly folding the credit grades, as Malgorzata Wiktoris did in her research [27]. This is because the model validation is required to be conducted on both model level and bucket level. If we fold the grades, we are not able to backtest PDs on bucket level.

12

3 Literature Review

The literature about how to apply a Hidden Markov Model on credit quality is sparse. This section will discuss three articles related to model validation [9][12][13], one article related to HMM applied on credit quality [27], and two books about the theory of HMM [28][29].

3.1 Literature Related to Model Validation

Gerd Castermans and David Martens [9] give a structured introduction of commonly used quantitative validation methods and mainly focus on backtesting and benchmarking that are key quantitative tools. They state that generally there are three parts of model validation: Calibration, discrimination, and stability. Calibration refers to the mapping of a rating to a quantitative risk measure. A rating system is considered well-calibrated if the estimated risk measures deviate only marginally from what has been observed ex-post. Discrimination measures how well the rating system provides an ordinal ranking of the risk measure considered. Stability measures to what extent the population that was used to construct the rating system is similar to the population on which it is currently used.

The authors analyze both advantages and disadvantages of methods that are discussed in their article. In terms of the calibration, the well-known binomial test is mentioned. Its estimations are TTC (through-the-circle) but the outcomes are PIT (point-in-time). The TTC estimation is supposed to be a long-term average ODFs, that is, unconditional of the business cycle we are in at any time. By contrast, the PIT estimation is representative of the current business cycle. This means that the binomial test won’t take the economic situation into account. In terms of discriminatory power, the ROC test and the DeLong test are introduced. Confidence intervals and tests of ROC are available for the AUC measures. However, it is hard for a researcher to use ROC to define a minimum value that determines acceptable discriminatory power. The DeLong test has the sample variability but it is complex to calculate.

Tasche [13] elaborates on the validation requirements for rating systems and probabilities of default that were introduced in Basel II [12]. He puts the main emphasis on the issues with quantitative validation. The techniques discussed in his article could be used to meet the quantitative regulatory requirements. However, their appropriateness will depend on the specific conditions under which they are applied. He introduces a theoretical framework by defining two random variables, 𝑆 and 𝑍. The former one denotes a score on a continuous scale that the institution has assigned to the borrower, while the latter one shows the state the borrower will have at the end of a fixed period, default or non-default. Then, the institution’s intention with the score variable 𝑆 is to forecast the borrower’s future state 𝑍, by relying on the information on the borrower’s credit worthiness that is summarized in 𝑆. He mentioned that in this sense, scoring and rating are related to binary classification and that scoring can be called binary classification with a one-dimensional co-variate.

Intuitively, a good rating system should be able to distinguish the creditworthy obligors and the potential defaulters, by assigning good obligors to low credit rating grades and clients with a higher PD to high credit rating grades. Therefore, Dirk Tasche also discusses when and how this monotonicity can be guaranteed under this theoretical framework. This problem is considered in the context of a hypothetical decision problem. He introduces some techniques to find a reasonable threshold of scores, under which the borrower would be predicted to get

13

defaulted. After that, he studies the question of how discriminatory power can be measured and tested. He elaborates on the Cumulative Accuracy Profile (CAP) and its summary statistics Accuracy Ratio (AR), Receiver Operating Characteristic (ROC), and its summary measure Area Under the Curve (AUC) and the error rates as measures of discriminatory power. Various calibration techniques are also included with similar content as in the article of Castermans and Martens [9].

Dirk Tasche concludes that AR and AUC seem promising tools to check the discriminatory powers as their statistical properties are well investigated and they are available together with many auxiliary features in most of the more popular statistical software packages. With regards to testing calibration, for conditional PD estimates powerful tests, such as the binomial and the Hosmer-Lemeshow test, are available. However, their appropriateness strongly depends on an independence assumption that default events are all independent. This independent assumption needs to be justified on a case-by-case basis.

Similarly, Engelmann [13] also introduces the CAP and ROC as commonly used techniques to test the discriminatory power of a PD prediction model. He gives the relationship between AR and AUC, that is,

𝐴𝑅 = 2𝐴𝑈𝐶 − 1.

3.2 Literature Related to HMM

Elliott’s book [28] is mainly about the theory of HMM. The book includes theorems about both discrete and continuous states and observations. Chapter 2, which describes the discrete HMM, is related to our research and will be our main focus. HMM assumes that there is a Markov process which is unobservable, and that there is another process whose behavior depends on the hidden Markov process. Elliott’s book is based on a one-unit delay discrete HMM, which means the observed value at time 𝑡 only depends on the value of the hidden state at time 𝑡 − 1. The details of the one unit delay HMM can be found in Chapter 6.

Elliott’s book assumes that the noises of these two processes are independent. If the transition matrix and emission matrix are denoted by 𝐴 and 𝐶, for example, the HMM can be written as

𝑋𝑘+1 = 𝐴𝑋𝑘 + 𝑉𝑘+1 (hidden states),

𝑌𝑘+1 = 𝐶𝑋𝑘 + 𝑊𝑘+1 (observation states),

where 𝑉𝑘+1 and 𝑊𝑘+1 are the noises at time 𝑡 + 1. In his book, these two noise processes are independent of each other. He also proposes another form of HMM where these two noise processes are dependent. In this case, the observation state 𝑌𝑘+1 will depend on both 𝑋𝑘+1 and 𝑋𝑘. In Chapter 6, we will explain the theoretical HMM model in detail, and 𝑌𝑡 denotes the ‘true’ credit quality for clients at time 𝑡, while 𝑋𝑡 denotes the credit ratings calculated by banks.

Malgorzata Wiktoria’s research [27] is based on Elliott’s book [28]. She gives a brief introduction about theorems of both general HMM and dependent HMM, and states that in the dependent HMM the hidden true credit quality state 𝑋𝑘+1 and observation 𝑌𝑘+1 are jointly depend on 𝑋𝑘 , which means that in addition to previous period’s credit quality, knowledge of current credit rating carries information about current credit quality. She conducts a numerical experiment to test whether the HMM can be used to estimate the

14

transition matrix of hidden states which are the true credit quality in her research. Instead of estimating the transition matrix for all credit ratings, she roughly divided all the credit grades into two groups: investment grade (IG) and speculative grade (SG), which reduces the dimensions of both hidden state space and signal state space. However, due to the requirements from CRR, from a validation perspective, folding credit rating grades will make us unable to backtest models on bucket level. Thus, in our research, one of the challenges would be how to reduce the dimension of state space without folding credit ratings.

Different from Malgorzata Wiktoria’s research and Elliott’s book, Rogemar presents a zero delay HMM, which is slightly different from one unit delay version. The zero delay HMM assumes that the observation signal at time 𝑡 depends on the hidden state at time 𝑡 instead of the hidden state at the previous step.

𝑋𝑘+1 = 𝐴𝑋𝑘 + 𝑉𝑘+1;

𝑌𝑘 = 𝐶∗𝑋𝑘 + 𝑊𝑘∗,

where 𝐴 = (𝑎𝑗𝑖) represents the transition matrix and 𝐶∗ = (𝑐𝑗𝑖∗ ) represents the emission

probability matrix; 𝑉𝑘 and 𝑊𝑘∗ are both the noise terms.

Our research is based on zero delay HMM. We apply the zero delay HMM because intuitively the observation signals will depend on the current hidden true credit quality instead of the previous hidden state. The parameter estimation methods are also different. Malgorzata Wiktoria and Elliott apply the filter-based cohort approach [28] to estimate the migration matrix, while we will use the Baum-Welch algorithm to obtain the estimation of transition probability. The specific steps of the Baum-Welch algorithm can be found in Zheng Rong [30] and Jeff Bilmes [31], and the steps are covered in Chapter 6.

15

4 Model Validation Methodology

In the previous chapter, the regulatory background and previous researches are introduced. In this chapter, the model validation methodologies will be demonstrated, and in Chapter 5 these validation methodologies will be tested to see whether they can help in distinguishing the good PD prediction models from the bad ones.

The focus of the following validation activities is to check whether or not models are fit for purpose and are conceptually sound by effectively challenging the owner, modeling teams, and users of the developed model, model documentation, and test results. Moreover, it is emphasized by the Basel Committee that both quantitative and qualitative components should be considered during the validation process. For more specific qualitative validation, please consult the Basel document [4]. This research deals with quantitative validation only.

As BCBS (2005a) states [4], “validation is fundamentally about assessing the predictive ability of a bank’s risk estimated and the use of ratings in credit processes”. Here the term “predictive ability” is not a statistical term that has a specific mathematical meaning, but that, in the financial industry, could be understood as the correctness of calibration of PD models and the discriminatory power of the entire internal rating system [12]. The testing methods of these two parts will be introduced in the following sections, respectively.

4.1 Calibration Quality

To check the correctness of the calibration quality of PD models is to test whether or not the observed PD is in line with the predicted PD. For calibration quality of PD models, BCBS (2004) [3] states that “banks must regularly compare realized default rates with estimated PDs for each grade”. Therefore, the credit institutions need to test the accuracy of prediction models on both grade level and model level. The binomial test can be used for the bucket level testing, while the Poisson-binomial test is applied to describe the prediction ability on the model level. Based on the traffic light approach, which is proposed in [4], the calibration quality of the target model can be monitored.

4.1.1 Binomial Test

In some processes observed values can only be divided into two categories, such as qualified/unqualified, yes/no, life/death, etc. The binomial distribution is a probability distribution that describes 𝑛 independent processes that each yields events with only two mutually exclusive results with a fixed probability. The binomial test is a method used to test whether the samples followed the binomial distribution with parameters (𝑛, 𝑝), where 𝑛 is the number of samples and 𝑝 is the probability of obtaining a ‘success’ event instead of ‘no success’ event. Note that the binomial test can only be used for models that predict a dichotomous variable (in this case, it’s default or performing).

In the binomial test, the observation events are all assumed to be independent, which means that the observed results for the clients in the same observation window are parallel and won’t interact. Suppose that in 𝑛 samples, there are 𝑘 samples present success. The probability mass function of binomial distributed random variable 𝑋 can be written as[9]

ℙ(𝑋 = 𝑘) = (𝑛𝑘

) 𝑝𝑘 ∙ (1 − 𝑝)𝑛−𝑘 (4.1)

16

If 𝑛 is large enough, for instance, 𝑛 > 1000, and 𝑛 ∙ 𝑝 ∙ (1 − 𝑝) ≥ 9, we can apply a normal approximation, that is, the binomially distributed random variable 𝑋 is approximately normally distributed, which can be expressed as

𝑋 ∼ 𝑁(𝑛𝑝, 𝑛𝑝(1 − 𝑝)). (4.2)

As a hypothesis test, the binomial test can be either one-sided or two-sided, which are shown in Table 1.

Table 1. The null and alternative hypotheses of the binomial test

One-sided binomial test Two-sided binomial test

𝑯𝟎 The Observed Default Rates (ODR) are in line with the predicted PDs, which means the PD model can be considered accurate.

𝑯𝟏

1) The ODRs are lower than the predicted PDs, or 2) the ODR is larger than the predicted PDs, which means the PD prediction model is not accurate.

The ODRs are not equal to the predicted PD, which means the PD prediction model is not accurate.

Comparing the output p-value of the binomial test above with the chosen significance level 𝛼, one can reject or not reject the null hypothesis. How to choose a significant level depends on how conservative one would like to be. Theoretically, in terms of the right-sided test, one would reject the null hypothesis if the following inequality holds[10].

ℙ(𝑋 ≥ 𝑘) = 1 − 𝐹(𝑘 − 1) = 1 − ∑ (𝑛𝑖

) 𝑝𝑖 ∙ (1 − 𝑝)𝑛−𝑖

𝑘−1

𝑖=0

≤ 𝛼 (4.3)

, where 𝑛 shows the amount of all observations; 𝑘 represents the number of success events, which in this case would be understood as the number of default events. Similarly, in terms of the left-sided test, the null hypothesis would be rejected when

ℙ(𝑋 ≤ 𝑘) = 𝐹(𝑘) = ∑ (𝑛𝑖

) 𝑝𝑖 ∙ (1 − 𝑝)𝑛−𝑖 ≤ 𝛼

𝑘

𝑖=0

(4.4)

In both equations (4.3) and (4.4), the 𝐹(∙) represents the cumulative distribution function of binomially distributed random variable 𝑋, where 𝑋 ∼ 𝐵𝑖𝑛(𝑛, 𝑝). For the two-sided binomial test, both probabilities of equations (4.3) and (4.4) should be conducted but comparing with 𝛼

2 instead. As long as one of the one-sided tests leads to a rejection, the null hypothesis of the

two-sided test would be rejected at the significance level 𝛼 , otherwise, it would not be rejected.

Also, based on the assumptions mentioned above, the normal approximation can be considered. In this case, the null hypothesis 𝐻0 is rejected when

ℙ(𝑍 ≥ 𝑧) = 1 − Φ(𝑧) ≤ 𝛼, (4.5)

for the right-sided test and

17

ℙ(𝑍 ≤ 𝑧) = Φ(𝑧) ≤ 𝛼, (4.6)

for the left-sided test, where

𝑧 =𝑘 − 𝑛𝑝

√𝑛𝑝(1 − 𝑝).

and Φ(∙) represents the cumulative distribution function of standard normal distribution. Similarly, the null hypothesis 𝐻0 of a two-sided test would be rejected at significance level 𝛼 as long as one of the one-sided tests leads to the rejection.

In the real world, a one-sided test and two-sided test would be conducted under different circumstances. In the real world, a bank tends to use the right-side test to see whether or not the model is too optimistic to avoid the risk of having more defaulters than expected. In contrast, the two-sided binomial test would be applied when one wants the model not to be too conservative or too optimistic. From a current risk management perspective, monitoring conservative and optimistic predicted PDs are both important. Therefore, in this research, a two-sided test is used, which means both conservatism and optimism are unacceptable.

4.1.2 Poisson Binomial Test

Different from the binomial test, the Poisson binomial test is conducted to test the calibration of the whole rating system rather than on grade level. It is an exact test since no approximations and shortcuts are necessary. The Poisson binomial test describes the distribution of the sum of independent and non-identically distributed random indicators, where each indicator is a Bernoulli random variable and where each probability of default may vary. The Poisson binomial test would reduce to the binomial test if the probabilities of default are equal on the bucket level. The null and alternative hypotheses are shown in Table 2.

Table 2. The null land alternative hypotheses of the Poisson binomial test

One-sided Poisson binomial test Two-sided Poisson binomial test

𝑯𝟎 The ODR is in line with the predicted PDs, which means the PD model can be considered accurate.

𝑯𝟏

1) The ODR is lower than the predicted PDs or 2) the ODR is larger than the predicted PDs, which means the PD model is not accurate

The ODR is not in line with the predicted PDs, which means the PD model is not accurate.

Similarly, using the Poisson binomial test, the results of p-values is compared with the set significance level 𝛼. The 𝛼 is depending on how conservative one wants to be. For the Poisson binomial test, the null hypothesis would be rejected when

ℙ(𝑋 ≥ 𝑘) = 1 − 𝐹(𝑘 − 1) ≤ 𝛼, (4.7)

for the right-sided test, and

ℙ(𝑋 ≤ 𝑘) = 𝐹(𝑘) ≤ 𝛼 (4.8)

18

for the left-sided test, where 𝐹(∙) is the cumulative distribution of Poisson binomial test and are written as

𝐹(𝑘) = ∑ ∑ ∏ 𝑝𝑗 ∏(1 − 𝑝𝑗)

𝑗∈𝐴𝑐𝑗∈𝐴𝐴∈ℱ𝑚

𝑘

𝑚=0

, (4.9)

where ℱ𝑚 is the 𝜎 -algebra of 𝑚 integers which is the subset of {1,2,3, … , 𝑛} ; 𝐴𝑐 is the complement of set 𝐴. Same as above, the null hypothesis two-sided test of Poison binomial will be rejected at significance level 𝛼 as long as one of the one-sided tests reaches the

rejection at the significance level 𝛼

2.

However, the computation of the CDF of the Poisson Binomial distribution is not as straightforward as directly applying equation (4.9) above [18]. Instead of using the approximation approaches, Rabobank applies a simple derivation for an exact formula with a closed-form expression of Poisson binomial CDF, which involves the Fourier transform of the characteristic function of the distribution. Since how to improve computational efficiency is not the focus of this thesis, we won’t explain the details further. Further details of the algorithms can be found in [18][19].

4.1.3 Traffic Light Approach

The traffic light approach is applied while reporting the quality of the PD model[2]. There are three colors, red yellow, and green, with thresholds set in advance. The model validator could see the potential issues of a model and then would need to perform a further investigation according to the output warning signals. Note that the results of this traffic light approach should not be seen as the specific conclusion of the calibration of the internal rating system, but as the direction of further research. The traffic light indicators are shown in Table 3.

Table 3. The traffic light indicator of both the binomial test and Poisson binomial test

Traffic

light

approach

PD

PD predictions and ODF for

the bucket are not

significantly different at an

alpha of 10%.

PD predictions and ODF for the

bucket are significantly different

at an alpha of 10%, but not at an

alpha of 0.2%.

PD predictions and ODF

for the bucket are

significantly different at

an alpha of 0.2%.

An alpha value of 0.05 is commonly used for a hypothesis test. In the previous version of the binomial test in the backtest, two one-sided tests were conducted both with a significance level 0.05, one for optimism, and the other for conservatism. Thus, the combined alpha value is 0.1, which is considered as the threshold of the orange traffic light. Sometimes getting a warning signal when the predicted PDs locates out of the 90% confidence interval would be early. Thus, the red traffic light is introduced with an alpha value of 0.1% for one-sided tests, yielding the combined alpha value of 0.2%, which is very conservative. The red traffic light would indicate serious errors while validating.

4.2 Discriminatory Power

The discriminatory power of a PD model determines to verify whether or not this model can distinguish defaulters from non-defaulters. In BCBS (2005a) [4], the discriminatory power is defined as the ‘ability to discriminate ex-ante between defaulting and non-defaulting

19

borrowers’, where the term ‘ex-ante’ means ‘in advance’. The discriminatory power can indicate the performance of the whole internal rating system, while the correctness of calibration can be used to see if a PD prediction model has correctly assigned clients to credit rating grades.

There are multiple methods for discriminatory power testing. The most commonly used validation method is the Cumulative Accuracy Profile (CAP) and its summary statistic Accuracy Ratio (AR) which is also known as Gini Index or Powerstat. Another method with a similar idea is the Receiver Operating Characteristic (ROC). This section is mainly based on Dirk Tasche [12] and Bernd and Evelyn [13].

4.2.1 Cumulative Accuracy Profile (CAP)

In the form of figures, Cumulative Accuracy Profile (CAP) intuitively and concisely describes the quality of an internal rating system. A perfect internal rating system can properly assign defaulters to lower credit quality rating classes. Suppose that there are 10 performing credit

buckets in an internal rating system {𝑅0, 𝑅1, … , 𝑅9}1 with decreasing credit quality. The performance of all clients in the next period (one year) is monitored. At the end of the following period, the defaulted clients are named as defaulters(D), and the clients that stay

performing are seen as non-defaulters (ND). Let 𝑝𝐷𝑖 , 𝑖 ∈ ℕ, 0 ≤ 𝑖 ≤ 9, represents the

proportion of defaulters in 𝑅𝑖 credit rating class. These proportions sum up to 1, that is ∑ 𝑝𝐷

𝑖9𝑖=0 = 1. Similarly, we can define the other two proportions 𝑝𝑁𝐷

𝑖 and 𝑝𝑇𝑖 , 𝑖 ∈ ℕ, 0 ≤ 𝑖 ≤

9 by the same way, where 𝑝𝑇𝑖 means the proportion of the defaulters in all clients. Note that

here the proportion 𝑝𝐷𝑖 , 𝑝𝑇

𝑖 and 𝑝𝑁𝐷𝑖 do not mean the ‘true’ proportions. They are all

computed based on the predictions of the chosen PD prediction model.

Provided the average observed default rate 𝜋, the fraction between the total number of defaulters and the total number of clients, we can easily deduce [13]

𝑝𝑇𝑖 = 𝜋𝑝𝐷

𝑖 + (1 − 𝜋)𝑝𝑁𝐷𝑖 . (4.10)

The discrete empirical cumulative distribution function 𝐹𝑇(∙), 𝐹𝐷(∙) and 𝐹𝑁𝐷(∙) can be defined by summing up the portion values above, that is,

𝐹𝑇(𝑘) = ∑ 𝑝𝑇𝑗

𝑘

𝑗=0

, 𝑘 = 0,1, … ,9 (4.11)

𝐹𝐷(𝑘) = ∑ 𝑝𝐷𝑗

𝑘

𝑗=0

, 𝑘 = 0,1, … ,9 (4.12)

𝐹𝑁𝐷(𝑘) = ∑ 𝑝𝑁𝐷𝑗

𝑘

𝑗=0

, 𝑘 = 0,1, … ,9 (4.13)

, where 𝐹𝑇(𝑘) = ℙ(𝑆𝑇 ≤ 𝑘), meaning the probability of a client with credit grade no greater than 𝑅𝑘.

Then, the CAP function[12] is

1 Note that this is not the internal rating system of Rabobank.

20

𝐶𝐴𝑃(𝑢) = 𝐹𝐷(𝐹𝑇−1(𝑢)), 𝑢 ∈ (0,1) (4.11)

The Cumulative Accuracy Profile (CAP) is defined as the curve connecting all points of

(𝐹𝑇(𝑘), 𝐹𝐷(𝑘)), 𝑘 ∈ ℕ, 0 ≤ 𝑘 ≤ 9 or (𝑢, 𝐶𝐴𝑃(𝑢)), 𝑢 ∈ (0,1) by linear interpolation[13]. The

former expression of CAP points can still be used when the cumulative distribution function 𝐹𝑇(∙) is invertible. An example of CAP is shown in Figure 4.

Figure 4. An example of Cumulative Accuracy Profile

As shown in Figure 4, the curve describes the performance of a perfect model that can correctly predict whether those performing clients default or not; the dashed diagonal line indicates the performance of a developed model which got no forecasting power; the middle orange curve means the performance of the model that we wants to evaluate. On common sense, the predicted model applied by banks cannot do the perfect prediction but will more or less provide helpful advice, so its performance line is normally located between the worse and the best.

The process of plotting the CAP as shown in Figure 4 can be elaborated by a simple case. Suppose that there are 5 clients with 3 of them defaulted (red) and the rest 2 clients performing at the end of the next period.

NO. 1 2 3 4 5

Default or not

We apply three PD prediction models with different quality and then obtain the forecasted default probabilities in Table 4.

Table 4. The predicted PDs in the example of CAP plotting

NO. 1 2 3 4 5

Default or not

Predicted PD from perfect models (best quality) 0.9 0.4 0.6 0.8 0.1

𝑎𝑝

𝑎𝑟

21

Predicted PD from random models (worst) 0.9 0.7 0.6 0.1 0.2

Predicted PD from the developed models 0.9 0.6 0.8 0.3 0.2

After sorted these default possibilities, we can obtain the ordered sequences in Table 5.

Table 5. The sorted predicted PDs in the example of CAP plotting

Predicted PD from perfect models (best quality) 1 4 3 2 5

Predicted PD from random models (worst) 1 2 3 5 4

Predicted PD from the developed models 1 3 2 4 5

As shown in Table 5, perfect models will assign the actual defaulters to the lower rating class, putting the red cells on the left side and green cells on the right side, while the model with zeros information cannot distinguish the actual defaulter from non-defaulters, with the green cells and red cells randomly spreading. The prediction PD models apply in the financial industry only give limited predictions, with errors of placing cells on the wrong side as the third row in Table 5.

In order to draw the CAP lines, we should set a threshold of probabilities in Table 4, above which clients would be considered as positive. Then the value of the X-axis can be defined as the fraction of the number of positive samples and the number of total clients. The value of the Y-axis can be correspondingly defined as the fraction of the number of defaulters who are also seen as positive and the number of all positive samples. In this case, the points can be calculated with decreasing thresholds.

Table 6. Computing the coordinate points of CAP.

Perfect model

Threshold 0.85 0.70 0.50 0.30 0

x-axis 0.20 0.40 0.6 0.8 1

y-axis 0.33 0.67 1 1 1

Developed model

Threshold 0.85 0.70 0.50 0.25 0

x-axis 0.20 0.40 0.60 0.80 1

y-axis 0.33 0.67 0.67 1 1

Random model

Threshold 0.80 0.65 0.50 0.15 0

x-axis 0.20 0.40 0.6 0.8 1

y-axis 0.33 0.33 0.67 0.67 1

Connecting all the (𝑥, 𝑦) points in Table 6, we can see that the line for a perfect model is rising to 1 fast, like the blue line in Figure 4. Since the worst model randomly assigns the clients to

22

credit buckets, its CAP line is almost diagonal as the dashed line in Figure 4. By contrast, the CAP line of the real-life model shows a tendency of approaching 1, but at a relatively lower speed than that of a perfect model, making the developed CAP curve located between two extremes.

4.2.2 Accuracy Ratio (AR)

The information given by the CAP figure can be simply summarized by the Accuracy Ratio (AR), which is also known as the Gini coefficient or Powerstat. In Figure 4, the notation 𝑎𝑝

represents the area between the CAP of the perfect model and CAP of the random model; 𝑎𝑟 shows the area between the CAP of the model under evaluation and the CAP of the random model. The ratio of 𝑎𝑟 and 𝑎𝑝 are defined as the Accuracy Ratio[13], that is,

𝐴𝑅 =𝑎𝑟

𝑎𝑝. (4.15)

Also, it can be analytically written in terms of CAP function[12]

𝐴𝑅 =2 ∫ 𝐶𝐴𝑃(𝑢)𝑑𝑢

1

0− 1

1 − 𝑝, (4.16)

where 𝑝 represents the fraction of defaulters.

The AR of a random model is 0, while the AR of the perfect model is 1.Since the developed model has the discriminatory power in between these two extremes, so its AR is a fraction, in the range of 0 and 1. We can conclude according to Figure 4 that the larger the AR score is, the closer the CAP of the developed model is to the CAP of the perfect model, which means the discriminatory power of the developed internal rating system is higher.

In Table 4, we can see that some of the predicted PDs are very large, such as 0.9 and 0.8. However, in the real world, by doing risk management, the bank can somewhat avoid the loss produced by default. For instance, the bank would assess the credit risk of obligors before signing the contract and refuse those assigned to higher credit grades. As a result, the predicted PDs of clients who are already in the portfolios would be less than a given threshold, such as 0.4. Moreover, the default of an obligor is assumed to be a random variable with Bernoulli distribution, since credit quality that the banks are trying to monitor is in form of a probability instead of a specific outcome. Therefore, the AR concerning a PD prediction model is stochastic. We can do a Monte Carlo experiment, based on the modeled PDs and repeatedly draw defaults and calculate ARs from that. In this way, we can obtain a predicted AR and its confidence intervals.

In addition, AR is portfolio dependent[14][15][16]. Due to the stochastic factors, we can hardly find an actual model with AR extremely close to 1. Thus, the ‘perfect’ AR (Gini)2, which is the Gini when the fraction of defaulters is exactly the expectation PD in terms of every bucket in the internal rating system, could be seen as the benchmark for the perfect model considering stochastic factors. Based on the bucket PDs of the internal rating system, we can repeat the Monte Carlo experiments and thus obtain the bootstrapped ARs whose average will be close, but not always, to the ‘perfect’ AR. This is because the AR function is non-linear.

2 Note that this is not the model quality indicator for Rabobank. In Rabobank, a traffic light approach with a orange light threshold of 40% is used to determine the discriminatory power of a model. However, for some of the portfolios, even their perfect AR cannot reach 40%, so setting a ‘perfect’ AR will be more reasonable.

23

In our simulation, we apply the bootstrapped average AR as the benchmark, because it considers the non-linearity of Gini functions.

4.2.3 Receiver Operating Characteristic (ROC)

Receiver Operating Characteristic (ROC) is another technique to investigate the discriminatory power of an internal rating system. It is with a concept similar to the CAP. Given the assumptions in section 4.2.1, the ROC function[13] is defined as

𝑅𝑂𝐶(𝑢) = 𝐹𝐷(𝐹𝑁𝐷−1(𝑢)), 𝑢 ∈ (0,1). (4.17)

By connecting all points of (𝑢, 𝑅𝑂𝐶(𝑢)), 𝑢 ∈ (0,1), we can obtain the ROC plotting. If the

𝐹𝑁𝐷−1(∙) is invertible, one can plotting all points of (𝐹𝑁𝐷(𝑘), 𝐹𝐷(𝑘)), 𝑘 ∈ ℕ, 0 ≤ 𝑘 ≤ 9 to

obtain ROC. By contrast, plotting ROC does not require the estimation of PD for all clients.

Area Under the Curve (AUC) is the associated summary measure for Receiver Operating Characteristic, to describe the discriminatory power of the rating system by a simple number. The AUC has a strong connection with the AR, which can be technically demonstrated by

𝐴𝑈𝐶 = ∫ 𝑅𝑂𝐶(𝑢)𝑑𝑢1

0

=𝐴𝑅 + 1

2(4.18)

Equation (4.18) is proved by Engelmann[17].

Since ROC is more well known in the artificial intelligence industry and plays no role in this thesis, it won’t be explained any further in detail.

4.3 Chapter Summary

In this chapter, some quantitative validation methods were introduced. For the calibration quality test, we applied the binomial test on the bucket level and the Poisson binomial test on the model level. For the discriminatory power test, the Cumulative accuracy Profile (CAP) and its summary measure AR were elaborated with an example. Then, the traffic light approach was introduced as a standard benchmark to determine good models for these three techniques. In addition, the concepts of ‘perfect’ AR and bootstrapped AR average were introduced to set a benchmark to determine the discriminatory power of a PD prediction model. In the next chapter, a simulated artificial bank will be set up, on which these validation methods will be implemented to see whether they are effective and efficient in distinguishing the good model from the bad ones.

24

5 Dataset Simulation and PD model Validation

In the previous chapters, we discussed the credit modeling regulations and introduced some validation methodologies. Now, in this chapter, an artificial bank will be generated to check whether the validation approaches stated in Chapter 4 works out or not. The procedure of dataset simulation and validation methods implementation is as follows.

Step 1: The artificial internal rating system setup. First, we will build up an internal rating system that consists of 10 performing credit rating buckets. Each client can be mapped to one of the buckets in this internal rating system according to his PD. This bucketing system is defined by stating the PD boundaries. Then, a ‘true’ PD model is set up with 5 factors whose initial values are generated from normal distributions with different parameters. We also introduce 5 states which describe the situation of the defaulted clients: first year in default (DY1), second year in default (DY2), third year in default (DY3), Cured (C), and Liquidated (Liq). Then the whole state space is of 15 states. Section 5.1.1 to section 5.1.2 will explain this step in detail.

Step 2: Credit grade migration rules setup. by adding the drifts to the factors, the credit grades will change depending on where the ‘true’ PDs are in the bucketing system. The ‘true’ credit rating grades will be updated monthly, so the migration matrices also describe the credit rating transitions between two months. After 250 consecutive months, we can obtain 249 migration matrices, and then the convergence of ‘true’ migration matrices will be tested by the standard deviations of each cell. Then, the ‘true’ information dataset of the artificial bank is established. Section 5.1.3 to section 5.1.4 will explain this in detail.

Step 3: PD prediction models setup. Since in the real world it is hard for a financial institute to figure out all the factors that somehow affect the credit quality, the predicted credit grades will inevitably have errors. Now, in order to mimic the real-world situation, we assume that all the ‘true’ information is hidden, such as the ‘true’ PD model, the ‘true’ migration matrix, and the ‘true’ credit quality migration rules. Then, we build up 5 PD models in the form of sigmoid functions based on the different quantity of factors. These 5 PD models are with different levels of predictive ability. Section 5.2 will explain this in detail.

Step 4: Validation methods implementation. Intuitively, the less the factors, the worse the quality of the PD prediction model. By rerating clients using the PD models built up in step 3, we can obtain 5 groups of forecasted PDs. Applying the validation methods introduced in Chapter 4 to test the prediction performance of these models, we can know whether these validation methods can recognize the decreasing prediction power of these PD prediction models. Section 5.2 will explain this in detail.

5.1 Dataset Simulation

In this section, we aim to build up the artificial bank with ‘true’ information, such as ‘true’ PDs and ‘true’ credit grade migration matrix.

In the real world, clients are assigned to credit buckets according to the forecasted PDs which are computed based on various factors, for instance, liquidity ratio, solvency, debt service converge ratio, quality of management, business segment, history of defaults, etc. All the factor values are drifting over time. Before considering the drifts, we need to simulate the initial factor values at the beginning of the first month. The initial factor values are generated

25

from normal distributions with different parameters. By building the ‘true’ PD model and defining the PD boundaries of bucketing systems, we can label the credit grades for all clients. From a risk modeling and validation perspective, it is effective and efficient to assume all client who are in the same credit grade to have the same PD which is called the bucket PD. Then the drift functions are introduced to make these factor values to change over time, making the credit rating grades drift as well. In this case, the ‘true’ PD model is in form of a sigmoid function of a score based on these factors. This process is illustrated in Figure 5.

Figure 5. Simulate migrating PD and credit grades

In Figure 5, 𝑡 represents the time, where 𝑡 = 𝑖 and 𝑡 = 𝑖 + 1 are with actual time interval a month.

In section 5.1.1, the defined bucketing system and the approaches used to generate initial factor values are introduced. In section 5.1.2, the ‘true’ PD model for the artificial bank and the data pre-processing techniques for the ‘true’ PD model are determined. In section 5.1.3, the idea of selecting drift functions is elaborated and in section 5.1.4, the simulation of credit quality migration is introduced, and a ‘true’ migration matrix is shown at the end of this section.

5.1.1 Factor Values and Credit Rating System Setup

As a part of the basic setting of this artificial bank, the performing buckets in this internal rating system are based on the PD. There are 10 performing buckets3 in the system. Suppose that there is a total of 5 factors that would affect PD, namely ‘Factor 1’ to ‘Factor 5’. The initial values of these factors are assumed to follow normal distributions with different parameters, which are shown in Table 7. Note that these distributions do not apply to the drifted factor values that are calculated by given drift functions. Every factor value will follow the normal distribution at the beginning. Once it starts drifting, its distribution will change. At the beginning of the first month, all clients are performing. the first default will be witnessed at the end of the first month.

3 Note that Rabobank has its own proprietary bucketing system which differs from the bucketing system in this thesis for reasons of confidentiality

26

Table 7. The distributions of initial values (𝑡 = 0) of all factors

Factors Distribution

1 N(1,1)

2 N(2,2.5)

3 N(3,4)

4 N(4,5.5)

5 N(5,7)

Then, a ‘true’ PD model is applied to compute the PDs for every client. Here the outcome PDs would range from 0 to 1. To control the risk of defaulting, this artificial bank would refuse the clients whose PD are higher than 0.4 to join a portfolio. The bucketing system is defined based on a set of PD intervals. From a risk management perspective, it would be more concise to assume all clients in the same bucket to have the same bucket PD, which can be simply set to be the median of each PD interval. These bucket PDs would be used to simulate the defaulters and non-defaulters at the end of every month. Note that bucketing system is only used to denote the credit quality of performing clients instead of the defaulters. The credit state of defaulters will be added into the state space in section 5.1.4. The bucketing system is shown in Table 8.

Table 8. The bucketing system of simulated artificial bank

Credit bucket 𝑅0 𝑅1 𝑅2 𝑅3 𝑅4

PD interval [0,0.04) [0.04,0.08) [0.08,0.12) [0.12,0.16) [0.16,0.20)

Bucket PD 0.02 0.06 0.10 0.14 0.18

Credit bucket 𝑅5 𝑅6 𝑅7 𝑅8 𝑅9

PD interval [0.20,0.24) [0.24,0.28) [0.28,0.32) [0.32,0.36) [0.36,0.40)

Bucket PD 0.22 0.26 0.30 0.34 0.38

In this bucketing system, the higher a credit rating grade is, the lower the credit quality is. The numbering of buckets in the bank system is consistent with the value of PDs instead of the intuitive credit quality, for instance, the highest credit grade 𝑅9 contains the clients with PD around 0.38, which is the biggest acceptable PD grade, in terms of the simulated artificial bank. In the Rabobank credit bucketing system, there exists a riskless bucket, ‘𝑅0’ containing clients who will not get defaulted in the next period after joining the portfolio. However, this is not the case in this artificial bank. In Table 8, it is still possible for clients in 𝑅0 to default. The increase increment of bucket PD in our system is 0.04, which is not very small and will enable the HMM to distinguish these states.

5.1.2 ‘True’ Credit Rating Model

This ‘true’ PD model describes the real probability of default in this artificial world. It is invisible in the real world since there are many factors that banks don’t know or haven’t taken

27

into account. As stated in Figure 5, for this artificial bank, the ‘true’ PD model is set to be a sigmoid model:

𝑃𝐷 =1

1 + 𝑒−(𝛽∙𝑋+𝛽0), (5.1)

where 𝑋 is the matrix of all factor values; 𝛽 is the sigmoid parameter vector and 𝛽0 is the intercept. If the artificial bank will refuse clients with PD over 0.4,

𝛽 ∙ 𝑋 + 𝛽0 ∈ (−∞, −𝑙𝑜𝑔1.5]. (5.2) Set 𝛽0 = −𝑙𝑜𝑔1.5 and 𝛽 = [0.1,0.5,1,0.5,0.25] . To scale all factor values, we apply the equation

Transform(factor i) =(𝑢𝑏 − 𝑙𝑏)

1 + 𝑒−𝑠(factor i+𝑠ℎ)+ 𝑙𝑏 ∈ (𝑙𝑏, 𝑢𝑏), (5.3)

where ‘𝑢𝑏’ and ‘𝑙𝑏 ’ are the upper and lower bounds of the target interval in which the transformed factor values are located; ‘𝑠’ and ‘𝑠ℎ’ are the steepness and shift, which are both parameters set by ourselves. Those parameters exploited in this case are in Table 9.

Table 9. The parameters chosen for scaling factor values

Factor 1 Factor 2 Factor 3 Factor 4 Factor 5

ub 0.13

lb -3

s 4 0.5 0.4 0.2 1.5

sh -1 2 3 6 -1.5

These parameter values in Table 9 are selected to make the distribution of output ‘true’ PD fit the reality: most clients would be more likely to be in the middle credit rating class, such as ‘𝑅3’ to ‘𝑅5’. Then, the number of clients would decrease as the credit rating goes both up and down. Based on the initial factor values, we now have the histogram of initial credit rating grades in Figure 6, which looks similar to a normal distribution.

Figure 6. The distribution of credit quality grades at the first month

28

Whether a client is default or not is described by a random variable that is Bernoulli distributed with corresponding bucket PD (Table 8) as the parameter. Now, we can obtain the dataset of the artificial bank for the first month and part of the dataset is shown in Table 10.

Table 10. Part of the dataset of the artificial bank for the first month

In the dataset(Table 10), we can see the difference between the ‘true’ PDs and the bucket PDs; all the defaulters are denoted by 1, while all the non-defaulters are denoted by 0, as shown in column named ‘𝑑𝑒𝑓𝑎𝑢𝑙𝑡_𝑜𝑟_𝑛𝑜𝑡’; ‘𝑜𝑙𝑑_𝑐𝑟𝑒𝑑𝑖𝑡_𝑟𝑎𝑡𝑖𝑛𝑔’ describes the clients’ credit grades in last month. For the dataset of the first month, the number 1000 means that this client is new to the artificial bank; all clients are numbered by ‘𝑐𝑙𝑖𝑒𝑛𝑡𝑠_𝑖𝑑’ so as to trace the migration histories of the credit quality.

Note that based on factor values, we can only give a possibility of default rather than giving an exact outcome prediction of default or not. During the first month, defaulters will be existing. The credit transitions of defaulters will be further explained in section 5.1.4.

5.1.3 Drift Functions

After generating all the factor values for the very first month, some drift functions will be introduced here to create movements of those factor values, based on which clients will start to migrate on their credit grades. We can interpret the factor values as 5 different stochastic

process {𝑋𝑡(𝑖)

}, 𝑖 = 1,2,3,4,5, where 𝑖 represents the factors’ number and 𝑡 represents time.

The increments of factor values with respect to time are determined by the stochastic

differential equations shown in Table 11. 𝜃(i), 𝑖 = 2,3,4,5 are self-defined constant that

differs for every factor. They are the means of the respective stochastic processes; 𝐾(𝑖), 𝑖 =

1,2,3,4,5 is a scaling parameter controlling the centralization speed; 𝜎(𝑖), 𝑖 = 1,2,3,4,5 is the volatility of each factor; 𝑊 is a Brownian motion, and 𝑑𝑊𝑡 ∼ 𝑁(0, 𝑑𝑡), where 𝑑𝑡 represents the increment of time. in this case, 𝑑𝑡 = 1, meaning that clients are rerated every month.

Table 11. The stochastic differential equation in terms of every factor

Factor NO.

SDE Equations

Factor 1 Dothan 𝑑𝑋𝑡(1)

= 𝜎(1)𝑋𝑡(1)

𝑑𝑊𝑡

29

Factor 2 CIR 𝑑𝑋𝑡(2)

= 𝐾(2)(𝜃(2) − 𝑋𝑡(2)

)𝑑𝑡 + 𝜎(2)√|𝑋𝑡(2)

|𝑑𝑊𝑡

Factor 3 Vasicek 𝑑𝑋𝑡(3)

= 𝐾(3)(𝜃(3) − 𝑋𝑡(3)

)𝑑𝑡 + 𝜎(3)𝑑𝑊𝑡

Factor 4 Longstaff 𝑑𝑋𝑡(4)

= 𝐾(4) (𝜃(4) − √|𝑋𝑡(4)

| ) 𝑑𝑡 + 𝜎(4)√|𝑋𝑡(4)

|𝑑𝑊𝑡

Factor 5 Geometric

Brownian Motion 𝑑𝑋𝑡(5)

= 𝜃(5) 𝑋𝑡(5)

𝑑𝑡 + 𝜎(5)𝑋𝑡(5)

𝑑𝑊𝑡

According to the credit quality migration matrix in real life, the rating grade of most of the clients in portfolios is stable, showing the maximum probability for remaining in a grade rather than migration. In terms of those migrations, clients with relatively high grades or low grades, such as R0 and 𝑅9, are more likely to move towards the center, such as 𝑅5. Given the clue above, we can determine the parameters listed in Table 11 by treating factors of low risky clients and factors of high risky clients separately. All the numbers are included in Table 12.

Table 12. The parameters chosen for stochastic differential equations of every factor

𝑹𝟎 − 𝑹𝟒 (Factor 1 to 5) 𝑹𝟓 − 𝑹𝟗(Factor 1 to 5)

𝑲(𝒊) [𝑁𝐴𝑁 0.01 0.05 0.5 0.05] [𝑁𝐴𝑁 0.1 0.1 0.15 0.05]

𝜽(𝒊) [𝑁𝐴𝑁 3.35 1.6 3 2] [𝑁𝐴𝑁 3.35 1.2 2.5 2]

𝝈(𝒊) [0.05 0.3 0.3 0.2 0.1] [0.05 0.3 0.3 0.2 0.1]

From Table 12, the speed of centralization is restricted by keeping the 𝐾(𝑖), 𝑖 = 1,2,3,4,5 small, and the mean of a factor stochastic process is tailored to have clients moved towards the center. Figure 7 Figure 8 Figure 9 Figure 10 and Figure 11 illustrate some examples of stochastic processes in months, whose increments follow the stochastic differential equations (Table 11), respectively.

Figure 7. The stochastic process based on Dothan SDE

Figure 8. The stochastic process based on Vasicek SDE

30

Figure 9. The stochastic process based on CIR SDE

Figure 10. The stochastic process based on Longstaff SDE

Figure 11. The stochastic process based on Geometric Brownian Motion

Note that the parameter vector β of true PD model (5.1) is set to be [0.1,0.5,1,0.5,0.25], which means the value of factor 3 will show a relatively bigger impact on the migration of credit rating while the movements of factor 5 will not affect much. From Figure 8, we can

have an intuitive understanding of the convergence rate of factor 3 approaching θ(3). After 50 months, the value of factor 3 will be stable and will stay mean-reverting.

5.1.4 Simulation of Credit Rating Migration

For the first month dataset of this artificial bank, the clients are all in the 10 performing buckets at the beginning, and some of them will be migrated to the defaulted state at the end of the first month. Defaulting clients may return to a performing state. This is called a ‘cure’. In our model, we introduce an intermediate cure state before clients are returned to a normal performing state. Clients can remain in a defaulted state for some time. During this time clients can either cure or be liquidated, i.e., collateral or guarantees are used to recover (part of) the exposure and the relationship with the client is ended. In our model we assume three successive default states after which a client is either cured or liquidated.

Three defaulting states are added into the internal rating system of the simulated artificial bank: default in 1 year (𝐷𝑌1) default in 2 year (𝐷𝑌2) and default in 3 year (𝐷𝑌3). Once clients default, they will be relabeled by 𝐷𝑌1. Some 𝐷𝑌1 customers will be cured, and some will go directly to liquidation. The rest who remain in default for the whole year will automatically migrate to 𝐷𝑌2. The longer a client is in default, the less likely he or she is to be cured. The

31

state space is now completed with 10 performing states, 3 defaulting states, a cured state (𝐶), and a liquidation state (𝐿𝑖𝑞). To make a brief illustration of all possible transitions among the states in the internal rating system, the 10 performing buckets are reduced to 3, as shown in Figure 11.

Figure 12. All possible transitions among all states in the state space (with 10 performing buckets reduced to 3)

From Figure 12, we can see that all the performing states are commuting. However, it is not always the case. For some portfolios, the performing clients can migrate to any state in the internal rating system, while for some other portfolios, they can only migrate to an adjacent performing state.

Clients in all performing states have a certain probability to default, and all clients in defaulting states are possible to get cured or to liquidate. For those who are already cured, it is possible for them to default again. In this case, these clients will be put into 𝐷𝑌1 instead of the defaulting states where they came from.

In order to determine the transition probabilities and thus obtain the migration matrix, the states of the internal rating system are put into three categories: performing states (𝑅0 to 𝑅9), defaulting states (𝐷𝑌1 to 𝐷𝑌3), the cured state (𝐶), and the liquidation state (𝐿𝑖𝑞), and these groups will be treated differently. Here is an example of how the credit quality migrate from the 𝑖-th month ( 𝑡 = 𝑖) to the 𝑖 + 1-th month. In this 𝑖-th dataset, the clients’ credit grades and whether they are defaulters at the end of that month are recorded.

For the clients remaining in the performing states at the end of the 𝑖-th month, all their the factor values drift according to the drift functions demonstrated in section 5.1.3 at the beginning of 𝑖 + 1-th month, based on which the updated credit grades can be calculated by

32

‘true’ PD model as described in section 5.1.2. After that, part of them will be migrated to 𝐷𝑌1 according to the Bernoulli simulation results with associated bucket PDs as parameters. Once a client defaults, the factor values will immediately stop drifting, and hold the values where he or she gets defaulted. Only when this client is cured and moving back to the performing states will the factor values drift again.

For the clients in defaulting states at the end of the 𝑖-th month, they are with probabilities [0.7 , 0.7] respectively to be automatically moved to the next defaulting state (𝐷𝑌3 are not commuting to 𝐷𝑌1 and 𝐷𝑌2), with decreasing probabilities [0.25 , 0.15 , 0.05] to get cured, and with increasing probabilities [0.05 , 0.15 , 0.95] to move to liquidate.

For the cured clients, there would be a probability of 0.45 for them to default again to return to 𝐷𝑌1, and a probability of 0.55 to return to performing states. The clients who are already in liquidation at the end of 𝑖 -th month, will be removed from the portfolio. In order to maintain a stable total amount of clients who are in the portfolio, new clients are accepted at the beginning of 𝑖 + 1-th month, to keep the total number of clients 60,000 in this case. The initial factor values of new clients are generated from normal distributions in Table 7.

The migration probabilities are estimated based on the transition frequencies among every pair in the internal rating system, and can be written as

𝑝𝑚𝑛 =𝐴𝑚𝑛

𝑖+1

𝐴𝑚𝑖

, (5.4)

where 𝑝𝑚𝑛 means the transition probability from state ‘𝑚’ to state ‘𝑛’; 𝐴𝑚𝑖 represents the

amount of clients in state ‘𝑚’ at end of 𝑖-th month; 𝐴𝑚𝑛𝑖+1 represents the number of clients

who migrate from state ‘𝑚’, at the end of the 𝑖-th month, to state ‘𝑛’, at the end of the 𝑖 + 1-th month. The liquidation state is an absorbing state, which means all the clients in 𝐿𝑖𝑞 state have a probability 1 to stay at the liquidation state. The datasets for 250 months are generated in order to check the convergence of those migration matrices. The average migration matrix is shown in Figure 13. In order to check the convergence of all migration matrices obtained in the recursive algorithm above, the variance of probabilities in every cell is computed in Figure 14. since the variances of all cells are less than 0.02, one can conclude that the migration matrix of the artificial bank will converge after 250 months. The stationary distribution of numbers of clients in every state in the internal rating system is shown in Figure 15.

From Figure 13, we can observe the centralization tendency for all clients in performing credit buckets. Meanwhile, the credit quality of performing clients is relatively stable, which means clients will tend to stay at their state rather than migrate very often. As mentioned above, the frozen factor values of default clients will drift again as long as they are cured and move back to the performing states, so these clients are returning to the very performing state in which they defaulted rather than being randomly assigned to a credit grade. Logically, clients in 𝑅9 are more likely to default, so the transition probability from 𝐶 to 𝑅9 larger than that from 𝐶 to other performing states. However, the biggest probability in terms of the row for cured state 𝐶 are located at the middle and the values are declining to both sides. This can be explained by Figure 15 as most of the performing clients are located in the middle of the bucketing system, and that the number of clients drops as credit grades spread from the center to both sides.

33

Figure 13. The 'true' migration matrix for the artificial bank

Figure 14. The variances of all cells in migrations matrixes for 250 months

34

Figure 15. The stationary distribution of all states in the state space

5.2 Validation Methods Implementation

In section 5.1, we simulate a ‘true’ information dataset with ‘true’ credit grades and a ‘true’ migration matrix, which can be a benchmark for the validation implementing results. Now, assume that all the ‘true’ information is hidden except the factor values. in this section, 5 PD prediction models are built up based on different subsets of factors and we aim to investigate whether the validation methods elaborated in Chapter 4 can distinguish the good PD prediction model from the bad ones.

All the PD prediction models in this section are in the form of a sigmoid function, with the perfect model based on full information of all factors and the low-quality model based on partial information of fewer factors. In this section, the calibration quality of all models is tested on both bucket level and model level. Then, after computing the ‘perfect’ AR as the benchmark for this artificial bank, the AR (Gini) of each PD model is calculated to check the discriminatory power of all models. Section 5.2.1 and 5.2.2 gives the results of a good model and bad models separately.

5.2.1 Logistic Regression Model with Full Information

5.2.1.1 The calibration quality of the full information model

The dataset for a month, as shown in Table 10, consists of the factor values and the credit grades at the beginning of that month, and the default results during that month. In terms of the PD prediction model fitting, the transitions from a performing state to a default state during a month are more interesting than the transitions starting from a default state, cured state, or liquidated state. Thus, we only consider the transitions start from a performing state within a month. Suppose that we are predicting the credit grades for clients who didn’t default during the 189-th month. After data-processing, the estimated parameters of the sigmoid function are

[𝛽, 𝛽0] = [0.098 , 0.512 , 1.037 , 0.506 , 0.248 , −0.390]. (5.5)

The values of (5.5) are in line with the values [0.1,0.5,1,0.5, 0.25, − log(1.5)] , which is determined in section 5.1.2. On the bucket level, the binomial test is applied with the

35

combination of the traffic light approach, introduced in Table 3, and the results are given in Figure 16.

Figure 16. The bucket plotting of perfect PD mode based on the full information dataset of the 189-th month. It compares the predicted results and the reality for the 190-th month. ‘N’ represents the number of clients in a grade; ‘D’ represents the actual amount of defaulters for the 190-th month; ‘E(D)’ represents the theoretical expectation of the number of defaulters in terms of bucket PDs; ‘Result’ represents the traffic light indicator whose principles are introduced in Chapter 4.

In Figure 16, rather than giving the p-values of the binomial test, we can use the traffic light approach to make the test results more intuitive to understand. There is only one ‘optimistic’ light, which can be understood as the acceptable bias given the significance level 𝛼 = 0.1 for orange light, because the possibility of observing at least one orange light is

𝑃(at least one orange light) = 1 − 0.910 = 0.65,

which is not small. For the rest of the buckets, the actual total number of defaulters are in line with their theoretical expectation. Also, the whole internal rating system can be regarded as accurate.

In order to reduce the stochastic effects which are brought by the SDE chosen for drifting factor values, we can repeat the binomial test above given significance level 0.1 for the last 50 months, to see whether the actual number of defaulters is significantly different from the predicted expectations or not. By averaging these 50 groups of binomial test p-values for each bucket, as shown in Table 13, according to Table 1, we can conclude that the perfect PD prediction model based on full information can be considered accurate on all buckets.

36

Table 13. The average p-values of the binomial test on bucket level for the last 50 months dataset (200-th month to 250-th month)

Credit buckets 𝑹𝟎 𝑹𝟏 𝑹𝟐 𝑹𝟑 𝑹𝟒

Average p-values 0.53 0.45 0.48 0.49 0.43

Credit buckets 𝑅5 𝑅6 𝑅7 𝑅8 𝑅9

Average p-values 0.45 0.45 0.46 0.47 0.50

Figure 17. The bucket plotting of a combination of the last 50 months prediction datasets. ‘N’ represents the number of clients in a grade; ‘D’ represents the actual amount of defaulters for the 190-th month; ‘E(D)’ represents the theoretical expectation of the number of defaulters in terms of bucket PDs; ‘Result’ represents the traffic light indicator whose principles are introduced in Chapter 4.

If we combine all the predictions of credit grades for the last 50 month datasets and apply the binomial test on this bigger dataset, the bucket plotting with the traffic light approach is shown in Figure 17. The internal rating system is considered accurate in Figure 17. It is also not surprising to witness the biases on several buckets, as the probability of observing at least one orange warning light is 0.65, which is already computed above.

After conducting the calibration test on the bucket level, the Poisson binomial test given a 0.1 significance level is applied as the measure of the performance of the whole internal rating system. Based on the known credit grades at the end of the previous month, the credit rating migrations are predicted and then the p-value of the Poisson binomial test can be computed. The average of the Poisson binomial p-values for the last 50 months is

Average Poisson binomial p − value = 0.4399, (5.6)

37

meaning that the internal rating system can be considered accurate with significance level 0.1. The traffic light indicator for the p-value of the Poisson binomial test is also added in the bucket plotting as the bar at the bottom (Figure 19, Figure 20).

5.2.1.2 The discriminatory power of the full information PD model

Since AR(Gini) is portfolio dependent [14][15][16], a benchmark of perfect AR(Gini) is needed in order to compare the discriminatory power of a good-quality model with full information and that of the low-quality model with partial information. The ‘perfect’ AR(Gini) can be calculated when the actual number of defaults in each bucket is around the theoretical expectation of defaulter amount, that is,

Expected Number of Defaulters = 𝑁(𝑖) × 𝑃𝐷(𝑖), (5.7)

where 𝑁(𝑖) represents the total number of clients in bucket 𝑖; 𝑃𝐷(𝑖) represents the bucket PD for bucket 𝑖.

In this case, all the datasets are defined by the ‘true’ PD model, so in terms of ‘true’ PD buckets assignment of all clients, the total number of defaulters is exactly in line with the theoretical expectation based on bucket PDs. Moreover, no estimated PD model can perform better than the ‘true’ PD model (5.1), so the AR (Gini) of ‘true’ PD is selected as the ‘perfect’ Gini benchmark. The PD prediction model only yields a probability instead of a deterministic classification label, the stochastic effects can be reduced by taking the average of several runs of ARs (Ginis). Again, provided the datasets of the last 50 months, we can obtain the average AR(Gini):

𝐴𝑅𝑏𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘 = 0.3939. (5.8)

The AR(Gini) of the full information PD prediction model, which is trained based on the dataset of non-defaulters at 189-th month, is shown in Figure 18.

Figure 18. The CAP and AR (Powerstat) based on the dataset of 189-th month

By repeating the AR computation for the datasets of the last 50 months, the average AR is

𝐴𝑅𝑎𝑣𝑒𝑟𝑎𝑔𝑒 50 = 0.3851. (5.9)

It can be concluded that the discriminatory power of the full information model is slightly smaller than that of the benchmark ‘true’ PD prediction model, which is in line with the actual situation.

38

5.2.2 Logistic Regression Model with Partial Information

Here, the number of factors, that are used to estimate the parameters of the PD prediction model, declines step by step, from 5 to 1, finally producing 5 PD models with different prediction quality. The Poisson binomial test results and discriminatory power test results for all the models are recorded in Table 14.

Table 14. The average Poisson binomial p-value, the average AR(Gini), and the std of Gini based on last 50 month datasets

Factor values used to train the model

Factor 1 Factor 1 to 2 Factor 1 to 3 Factor 1 to 4 Full info

Average Poisson binomial test p-value

0.0095 0.2253 0.3029 0.4310 0.4399

Average GINI scores for last 50 iterations

0.0640 0.1655 0.2905 0.3413 0.3851

Standard deviation of GINI for last 50 iterations

0.0066 0.0055 0.0076 0.0083 0.0078

From Table 14, we can see that the Poisson binomial test p-values, Average AR (Gini) are decreasing with a declining number of factors. The internal rating system is suggested to be accurate when based on more than two types of factor values, while the null hypothesis (Table 2) is rejected when only factor 1 is available for the parameter estimation of a PD prediction model. The standard deviation of Gini for the last 50 iterations is stable.

Figure 19. The bucket plotting for the PD prediction model trained based on factor 1 to factor 3 in the 189-th month dataset. ‘N’ represents the number of clients in a grade; ‘D’ represents

39

the actual amount of defaulters for the 190-th month; ‘E(D)’ represents the theoretical expectation of the amount of defaulters in terms of bucket PDs; ‘Result’ represents the traffic light indicator whose principles are introduced in Chapter 4.

Here, the bucket plotting in Figure 19 for predicted credit grades based on the 189-th month dataset is given as an example of the backtest on the bucket level. The PD prediction model, in this case, is trained only based on factor 1 to factor 3. The complete set of bucket plots with a declining number of factors are shown in the appendix.

Compared with the bucket plot of the calibration quality of the full information PD prediction model (Figure 16), this plot shows more orange and red warning traffic lights, while the Poisson binomial test for the whole group is suggesting this internal rating system to be unbiased. This is because, for some bucket, the actual total number of clients are above the predicted number, while some of buckets are showing the opposite. Also, the clients are forecasted to only spread in 9 credit grades instead of all 10 credit grades, with no clients witnessed in 𝑅9.

However, as shown in Figure 20, when we only consider factor 1 to build up the PD prediction model, the bias will be also observed even in terms of the whole internal rating system. By repeating the binomial test on PD prediction models that are built based on different datasets, we can see that these PD models tend to put clients into one of the two buckets, with the first bucket presenting an optimistic warning and the second one being accurate sometimes. Moreover, the overall binomial test also gives an optimistic red flag warning. This is a systematic problem, which can be explained by the simulation rules set in section 5.1. The values of factor 1 are generated from the 𝑁(1,1) normal distribution, whose variance is relatively smaller than that of the rest of the factors. After averaging every factor value in terms of buckets, we have Table 15, where the means of factor 1 increase slowly. This non-monotonous increase produces the error while fitting the PD prediction model, which can be considered as the cause of this problem. Besides, the coefficient of factor 1 in the ‘true’ PD model is 0.1, making factor 1 less important while bucketing clients.

According to the stationary ‘true’ credit grades distribution in Figure 15, more clients are seen to be in middle buckets rather than the low risky buckets. Thus, there would be more clients, whose PDs are underestimated rather than overestimated, making the calibration of the first bucket, 𝑅3 in Figure 20, and the calibration of all buckets optimistic.

Table 15. The average of factor values in terms of buckets with index meaning 𝑅0 to 𝑅9

40

Figure 20. The bucket plotting for the PD prediction model trained only with factor 1 in the 159-th month dataset. ‘N’ represents the number of clients in a grade; ‘D’ represents the actual amount of defaulters for the 190-th month; ‘E(D)’ represents the theoretical expectation of the number of defaulters in terms of bucket PDs; ‘Result’ represents the traffic light indicator whose principles are introduced in Chapter 4.

5.3 Chapter Summary

In Chapter 5, after setting a ‘true’ PD prediction model, we built up a bucketing system. Using the drift functions, the credit grades of performing clients were migrated for 250 continuous months, yielding the 249 migration matrices. The stability of those migration matrices was confirmed since the standard deviation of transition probabilities of each cell was less than 0.02.

By building up the PD prediction models based on different subsets of full information, the validation methodologies elaborated in Chapter 4 were applied to see whether these methods can distinguish the good prediction models from the bad ones. The less the information, the smaller average Gini and average Poisson binomial p-values of the last 50 months dataset were obtained, and the model was considered with no calibration correctness and no discriminatory power when the factor 1 was the only available factor. Thus, we can conclude that the validation methodologies in Chapter 4 can properly monitor the calibration quality and discriminatory power of a target PD prediction model.

41

6 Hidden Markov Model Methodology (HMM)

In the previous chapter, we simulated the dataset of an artificial bank on which the validation methodologies elaborated in Chapter 4 were implemented. From the figures we obtain at the end of Chapter 5, we can conclude that the validation methods work out and can provide helpful suggestions for the quality of PD prediction models. in this chapter, the theoretical dynamics of Hidden Markov Models will be explained. The setup of HMM is based on Malgorzata [27] and Elliot [28]. After that, the applicability of the hidden Markov model in the field of backtesting will be discussed in Chapter 7.

The Hidden Markov Model (HMM) is originally a parameter estimation and pattern recognition technology proposed by Baum and Egon [20]. The modeling process consists of two stochastic processes: a hidden stochastic process in the form of a general Markov chain, which is hidden, and an observable stochastic process of output signals. We describe the transition probability among hidden states by transition matrix and the correspondence between states and observed variables by emission matrix. The goal of the Hidden Markov Model is to estimate the number of hidden states and the transition matrix given the observed data sequences and then to predict the signals in the future.

Hidden Markov Models are widely used in various fields: aerospace, weather forecasting, underground survey, medical CT imaging, and so on. The best-known application of HMM models in engineering is speech recognition (Jelinek [21]; Rabiner [22]) and Biological sequence analysis (Durbin [23]). Besides, in term of the financial industry, it is also found useful in volatility analysis of financial markets (Schaller and Van Norden [24]); the short - and long-term effects of stock market returns (Bhar andHamori [25]); analysis of interest rate term structure (Gray [26]), etc.

6.1 Setup of Hidden Markov Model

This section is based on the research of Malgorzata Wiktoria [27].

First, we define a probability space (Ω, ℱ, ℙ), where Ω is ….; ℱ represents the filtration that contains a sequence of 𝜎-algebra {ℱ𝑘}, 𝑘 ∈ ℕ; ℙ represents a certain probability measure in the real world. The hidden states are described by a finite-state, homogeneous, discrete-time Markov chain {𝑋𝑘}, 𝑘 ∈ ℕ, with given initial value 𝑋0 or the distribution of 𝑋0. The state space of {𝑋𝑘} can be written as

𝑆𝑋 = {1,2,3, … , 𝑁},

where 𝑁 is the number of all the hidden states. Without the loss of generality, the state space 𝑆𝑋 can be identified with a series of unit vectors

𝑆𝑋 = {𝑒1, 𝑒2, … , 𝑒𝑁},

where 𝑒𝑖 = {0, … ,0,1,0, … ,0}′ ∈ ℝ𝑁.

Write the 𝜎-algebra ℱk = 𝜎(𝑋0, 𝑋1, … , 𝑋𝑘), a record of all possible history information from 𝑋0 to 𝑋𝑘. The following Markov property

42

ℙ(𝑋𝑘+1 = 𝑒𝑗|ℱ𝑘 ) = ℙ(𝑋𝑘+1 = 𝑒𝑗|𝑋𝑘) (6.1)

holds. Also, the transition matrix can be represented by

𝐴 = (𝑎𝑗𝑖)1≤𝑗,𝑗≤𝑁∈ ℝN×N,

where

𝑎𝑗𝑖 = ℙ(𝑋𝑘+1 = 𝑒𝑗|𝑋𝑘 = 𝑒𝑖).

Then by Markov property shown in equation (6.1), we have[28]

𝔼[𝑋𝑘+1|ℱ𝑘] = 𝔼[𝑋𝑘+1|𝑋𝑘] = 𝐴𝑋𝑘 . (6.2)

Define

𝑉𝑘+1 ≔ 𝑋𝑘+1 − 𝐴𝑋𝑘 . (6.3)

We can obtain the state equation

𝑋𝑘+1 = 𝐴𝑋𝑘 + 𝑉𝑘+1. (6.4)

Thus, taking the conditional expectation and considering the equation (6.2), it holds that

𝔼[𝑉𝑘+1|ℱ𝑘] = 𝔼[𝑋𝑘+1 − 𝐴𝑋𝑘|𝑋𝑘] = 𝐴𝑋𝑘 − 𝐴𝑋𝑘 = 0,

yielding the sequence {𝑉𝑘}, 𝑘 ∈ ℕ is of martingale increments [28].

After building the true credit quality equation (6.4), we are going to set up the equation for observation stochastic process {𝑌𝑘}, 𝑘 ∈ ℕ.

6.1.1 One unit delay Hidden Markov Model

This section is based on [28].

Suppose that Markov chain {𝑋𝑘}, 𝑘 ∈ ℕ is unobservable. We can treat the signal processes {𝑌𝑘}, 𝑘 ∈ ℕ as the function value of {𝑋𝑘}, that is

𝑌𝑘+1 = 𝑔(𝑋𝑘 , 𝑤𝑘+1) 𝑘 ∈ ℕ, (6.5)

where 𝑔(∙) is a function with finite range; {𝑤𝑘} in equation (6.5) is a sequence of independent, identically distributed random variables, which is seen as the noise in this case. In addition, sequences {𝑉𝑘} and {𝑤𝑘} are mutually independent. In this function, there will be one unit delay of 𝑌𝑘+1 in terms of the hidden state 𝑋𝑘.

Suppose that the function values of function 𝑐(∙) are integers of the range 0 to 𝑀. We can write the state space for {𝑌𝑘}, k ∈ ℕ as a series of unit vectors, that is

𝑆𝑌 = {𝑓1, 𝑓2, … , 𝑓𝑀},

where 𝑓𝑖 = (0, … ,0,1,0, … ,0)′ ∈ ℝ𝑀.

Define the new filtration

𝒴𝑘 = 𝜎{𝑌0, 𝑌1, … , 𝑌𝑘},

𝒢𝑘 = 𝜎{𝑋0, 𝑋1, … , 𝑋𝑘 , 𝑌0, 𝑌1, … , 𝑌𝑘},

where {𝒴𝑘}, 𝑘 ∈ ℕ contains all possible histories of {𝑌𝑘}, 𝑘 ∈ ℕ ; {𝒢𝑘}, 𝑘 ∈ ℕ contains all possible histories of both {𝑋𝑘} and {𝑌𝑘}, 𝑘 ∈ ℕ. 𝒴0 ⊂ 𝒴1 ⊂ ⋯ ⊂ 𝒴𝑘, and 𝒢0 ⊂ 𝒢1 ⊂ ⋯ ⊂ 𝒢𝑘.

43

Obviously, 𝑌𝑘+1 is ℱ𝑘-measurable. The state space of the hidden Markov chain {𝑋𝑘}, 𝑘 ∈ ℕ and that of signals observed states {𝑌𝑘}, 𝑘 ∈ ℕ are both finite and discrete.

According to equation (6.5), we have

ℙ(𝑌𝑘+1 = 𝑓𝑗|𝒢𝑘) = ℙ(𝑌𝑘+1 = 𝑓𝑗|𝑋𝑘),

so the emission matrix of HMM can be set as

𝐶 = (𝑐𝑗𝑖) ∈ ℝ𝑀×𝑁 ,

where

𝑐𝑗𝑖 = ℙ(𝑌𝑘+1 = 𝑓𝑗|𝑋𝑘 = 𝑒𝑖)

and

∑ 𝑐𝑗𝑖

𝑀

𝑗=1

= 1, 𝑐𝑗𝑖 ≥ 0, 0 ≤ 𝑗 ≤ 𝑀, 1 ≤ 𝑖 ≤ 𝑁.

Then

𝔼[𝑌𝑘|𝑋𝑘] = 𝐶𝑋𝑘 . (6.6)

Write

𝑊𝑘+1 ≔ 𝑌𝑘+1 − 𝐶𝑋𝑘 . (6.7)

Similarly, taking the expectation of 𝑊𝑘+1 and noting (6.7), we have

𝔼[𝑊𝑘+1|𝒢𝑘] = 𝔼[𝑌𝑘+1 − 𝐶𝑋𝑘|𝑋𝑘] = 𝐶𝑋𝑘 − 𝐶𝑋𝑘 = 0,

yielding that {𝑊𝑘}, 𝑘 ∈ ℕ is also with (ℙ, 𝒢𝑘) martingale increments.

𝑌𝑘 can be rewritten as

𝑌𝑘+1 = 𝐶𝑋𝑘 + 𝑊𝑘+1, (6.8)

where the term 𝑊𝑘+1 can be considered as the noise of the signal with known 𝒢𝑘. Note that the correlation of noise terms {𝑉𝑘} and {𝑊𝑘}, 𝑘 ∈ ℕ can be discussed in future research. Here we assume that these two noise random variables are independent of each other.

Before building the Hidden Markov Model, we need to introduce a notation and a lemma first.

Notation 6.1[28] Write 𝑌𝑘𝑖 = ⟨𝑌𝑘 , 𝑓𝑖⟩ so 𝑌𝑘 = (𝑌𝑘

1, … , 𝑌𝑘𝑀), 𝑘 ∈ ℕ. For each 𝑘 ∈ ℕ, exactly one

component is equal to 1, the remainder being 0.

It holds that

∑ 𝑌𝑘𝑖

𝑀

𝑖=1

= 1.

Write

𝑐𝑘+1𝑖 = 𝔼[𝑌𝑘

𝑖|𝒢𝑘] = ∑ 𝑐𝑗𝑖⟨𝑒𝑗 , 𝑋𝑘⟩

𝑁

𝑗=1

, 𝑐𝑘 = (𝑐𝑘1, … , 𝑐𝑘

𝑀)′,

then the following equation holds

44

𝑐𝑘+1 = 𝔼[𝑌𝑘+1|𝒢𝑘] = 𝐶𝑋𝑘 .

Here we should assume that [28]

𝑐𝑘𝑖 > 0, 1 ≤ 𝑖 ≤ 𝑀, 𝑘 ∈ ℕ.

Lemma 6.2[28] With 𝑑𝑖𝑎𝑔(𝑧) denoting the diagonal matrix with vector 𝑧 on its diagonal, we have

𝑉𝑘+1𝑉𝑘+1′ = 𝑑𝑖𝑎𝑔(𝐴𝑋𝑘) + 𝑑𝑖𝑎𝑔(𝑉𝑘+1) − 𝐴𝑑𝑖𝑎𝑔(𝑋𝑘𝐴′) − 𝐴𝑋𝑘𝑉𝑘+1

′ − 𝑉𝑘+1(𝐴𝑋𝑘)′

and

⟨𝑉𝑘+1⟩ ≔ 𝔼[𝑉𝑘+1𝑉𝑘+1′ |ℱ𝑘] = 𝑑𝑖𝑎𝑔(𝐴𝑋𝑘) − 𝐴𝑑𝑖𝑎𝑔𝑋𝑘𝐴′.

Proof of lemma 6.2 can be found in [28].

Now, the model is set up as follows.

One Unit Delay Discrete HMM Under the probability measure ℙ, the true credit quality equation (7.9) and the signal observation equation (6.10) are written as

𝑋𝑘+1 = 𝐴𝑋𝑘 + 𝑉𝑘+1, (6.9)

𝑌𝑘+1 = 𝐶𝑋𝑘 + 𝑊𝑘+1, (6.10)

where 𝐴 = (𝑎𝑗𝑖) represents the transition matrix and 𝐶 = (𝑐𝑗𝑖) represents the emission

probability matrix that satisfy

∑ 𝑎𝑗𝑖

𝑁

𝑗=1

= 1, 𝑎𝑗𝑖 ≥ 0;

∑ 𝑐𝑗𝑖

𝑀

𝑗=1

= 1, 𝑐𝑗𝑖 ≥ 0,

and 𝑉𝑘 and 𝑊𝑘 are both the martingale increments such that

𝔼[𝑉𝑘+1|ℱ𝑘] = 0, 𝔼[𝑊𝑘+1|𝒢𝑘] = 0,

⟨𝑉𝑘+1⟩ ≔ 𝔼[𝑉𝑘+1𝑉𝑘+1′ |𝑋𝑘] = 𝑑𝑖𝑎𝑔(𝐴𝑋𝑘) − 𝐴𝑑𝑖𝑎𝑔(𝑋𝑘𝐴′),

⟨𝑊𝑘+1⟩ ≔ 𝔼[𝑊𝑘+1𝑊𝑘+1′ |𝑋𝑘] = 𝑑𝑖𝑎𝑔(𝐶𝑋𝑘) − 𝐶𝑑𝑖𝑎𝑔(𝑋𝑘𝐶′).

6.1.2 Zero delay Hidden Markov Model

This section is mainly based on Chapter 5 of [29].

Again, we assume that {𝑋𝑘}, 𝑘 ∈ ℕ is not observed directly. We take the signal process {𝑌𝑘}, 𝑘 ∈ ℕ as a function with finite values with the respect to {𝑋𝑘}, 𝑘 ∈ ℕ, that is

𝑌𝑘 = 𝑔∗(𝑋𝑘 , 𝑤𝑘∗), 𝑘 ∈ ℕ, (6.11)

where 𝑔∗(∙) is a finite-valued function; {𝑤𝑘′ }, 𝑘 ∈ ℕ is an independent identically distributed

random variables that are independent of hidden process {𝑋𝑘}, 𝑘 ∈ ℕ. The state space of the observation process in this model is the same as that of one unit delay HMM set above.

Set the emission matrix as

𝐶∗ = (𝑐𝑗𝑖∗ ) ∈ ℝ𝑀×𝑁 ,

45

where

𝑐𝑗𝑖∗ = ℙ(𝑌𝑘 = 𝑓𝑗|𝑋𝑘 = 𝑒𝑖), 𝑐𝑗𝑖

∗ ≥ 0, 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 𝑀

and

∑ 𝑐𝑗𝑖∗

𝑀

𝑗=1

= 1.

Then,

𝔼[𝑌𝑘|𝑋𝑘] = 𝐶∗𝑋𝑘 .

Here we still define a series of filtrations {𝒢𝑘}, 𝑘 ∈ ℕ as all possible histories of {𝑋𝑘 , 𝑌𝑘}, 𝑘 ∈ℕ.

Write

𝑊𝑘∗ ≔ 𝑌𝑘 − 𝐶∗𝑋𝑘 . (6.12)

Similarly, taking the expectation of 𝑊𝑘+1 and noting (6.12), we have

𝔼[𝑊𝑘∗|𝒢𝑘−1 ∨ {𝑋𝑘}] = 𝔼[𝑌𝑘+1 − 𝐶∗𝑋𝑘|𝑋𝑘] = 𝐶∗𝑋𝑘 − 𝐶∗𝑋𝑘 = 0 ∈ ℝ𝑀 ,

yielding that {𝑊𝑘∗}, 𝑘 ∈ ℕ is also with (ℙ, 𝒢𝑘−1 ∨ {𝑋𝑘}) martingale increments.

𝑌𝑘 can be rewritten as

𝑌𝑘 = 𝐶∗𝑋𝑘 + 𝑊𝑘∗. (6.13)

Zero Delay Discrete HMM Under the probability measure ℙ, the true credit quality equation (7.9) and the signal observation equation (6.13) can be written as[29]

𝑋𝑘+1 = 𝐴𝑋𝑘 + 𝑉𝑘+1; (6.14)

𝑌𝑘 = 𝐶∗𝑋𝑘 + 𝑊𝑘∗, (6.15)

where 𝐴 = (𝑎𝑗𝑖) represents the transition matrix and 𝐶∗ = (𝑐𝑗𝑖∗ ) represents the emission

probability matrix that satisfy

∑ 𝑎𝑗𝑖

𝑁

𝑗=1

= 1, 𝑎𝑗𝑖 ≥ 0;

∑ 𝑐𝑗𝑖∗

𝑀

𝑗=1

= 1, 𝑐𝑗𝑖∗ ≥ 0,

and 𝑉𝑘 and 𝑊𝑘∗ are both the martingale increments such that

𝔼[𝑉𝑘+1|ℱ𝑘] = 0, 𝔼[𝑊𝑘∗|𝒢𝑘−1 ∨ {𝑋𝑘}] = 0.

6.2 Handling HMM

With these probability definitions in place, the three classical tasks for HMM are discussed, which cover two theoretical and practical issues: parameter estimation and model interpretation. The three tasks are the computation of the probability of obtaining a certain observation sequence, estimation of model parameters, decoding of observation sequence.

46

This section is mainly based on Zheng Rong[30], Jeff Bilmes[31], and Feldman[32]. Note that this section concerns the zero delay Hidden Markov Model.

We denote the observed sequence as 𝑌, and the corresponding hidden sequence as 𝑋. The parameters are represented as a group 𝜆 = (𝐴, 𝐶∗, Π), where 𝐴 is the transition matrix of hidden sequence, 𝐶∗ is the emission matrix of zero delay HMM, Π is the initial distribution of hidden states.

The first task is to compute the probability of obtaining a certain observation sequence ℙ(𝑌|𝜆) , given the parameters 𝜆 = (𝐴, 𝐶∗, Π) . It can be solved by Forward-Backward Algorithms. Then, the model parameter learning is the core of this research, which can be settled by the Baum-Welch Algorithm. Decoding an observation sequence is to predict the most likely hidden state sequence with parameters 𝜆 = (𝐴, 𝐶∗, Π) and observation signal sequence, which can be done by means of the Viterbi Algorithm.

6.2.1 Probability of obtaining a certain observation sequence

This section is based on Zheng Rong’s research [30] and Jeff Bilmes’ research [31]. The aim of this section is to compute the value of ℙ(𝑌|𝜆).

Generally, the probability of obtaining a given observation sequence are computed by a trivial way, that is, by listing all possible hidden state sequences of length 𝑇 + 1 and then deducing all the joint probability ℙ(𝑋, 𝑌|𝜆) . The, probability of obtaining a certain hidden state sequence 𝑋 = {𝑋0, 𝑋1, … , 𝑋𝑇} is

ℙ(𝑋|𝜆) = 𝜋𝑋0𝑎𝑋1𝑋0

𝑎𝑋2𝑋1… 𝑎𝑋𝑇𝑋𝑇−1

. (6.16)

Then, the probability of getting the corresponding observation sequence is

ℙ(𝑌|𝑋, 𝜆) = 𝑐𝑌0𝑋0

∗ 𝑐𝑌1𝑋1

∗ … . 𝑐𝑌𝑇𝑋𝑇

∗ . (6.17)

We can simply compute the joint probability of 𝑌 and 𝑋 as follows

ℙ(𝑋, 𝑌|𝜆) = ℙ(𝑋|𝜆)ℙ(𝑌|𝑋, 𝜆) = 𝜋𝑋0𝑎𝑋1𝑋0


𝑐𝑌0𝑋0

∗ 𝑐𝑌1𝑋1


∗ . (6.18)

The marginal probability of witnessing 𝑌 with given parameters 𝜆 will be

ℙ(𝑌|𝜆) = ∑ ℙ(𝑋, 𝑌|𝜆)𝑋 = ∑ 𝜋𝑋0𝑎𝑋1𝑋0


𝑐𝑌0𝑋0

∗ 𝑐𝑌1𝑋1


∗𝑋0,𝑋1 ,…,𝑋𝑇

. (6.19)

The method above is effective but inefficient when the number of hidden states 𝑁 is very large. Under this circumstance, we need to consider 𝑁𝑇+1 different sequences of hidden states, which means the time complexity of this trivial algorithm would be with the order of 𝑂((𝑇 + 1)𝑁𝑇+1). By contrast, a forward-backward algorithm is recommended when facing a big number of hidden states.

The forward-backward algorithm is the collective name of the forward algorithm and the backward algorithm, both of which can be used to obtain the probability of the HMM observation sequence. Let us first focus on how the forward algorithm solves this problem.

6.2.1.1 Forward Algorithm

The forward algorithm is essentially a dynamic programming algorithm. We need to find the formula of local state recursion so that we can expand from the optimal solution of the sub-problem to the optimal solution of the entire problem step by step.

47

In order to define the local state of this dynamic programming, we first introduce the ‘forward probability’.

Definition 6.3 The value of hidden state at time 𝑡 is 𝑋𝑡 = 𝑒𝑖 ∈ 𝑆𝑋 , then the probability of having the sequence {𝑌0, 𝑌1, … , 𝑌𝑡} are defined as the forward probability, that is

𝛼𝑡(𝑖) = ℙ(𝑌0, 𝑌1, … , 𝑌𝑡 , 𝑋𝑡 = 𝑒𝑖|𝜆), (6.20)

where 𝑒𝑖 is an element in state space 𝑆𝑋 .

Now we assume that we have found the forward probability of each hidden state at time 𝑡, and then we need to derive the forward probability of each hidden state at time 𝑡 + 1. The relationship between the forward probability 𝛼𝑡(𝑖) and the value of the next hidden state is illustrated in Figure 21.

Figure 21 The recursion plotting of the forward algorithm

From Figure 21, we can see that the value of 𝛼𝑡(𝑗)𝑎𝑖𝑗 represents the probability of having an

observation sequence {𝑌0, 𝑌1, … , 𝑌𝑡} until time 𝑡, and 𝑋𝑡 = 𝑒𝑖, that is

𝛼𝑡(𝑗)𝑎𝑖𝑗 = ℙ(𝑌0, 𝑌1, … , 𝑌𝑡 , 𝑋𝑡 = 𝑒𝑗 , 𝑋𝑡+1 = 𝑒𝑖|𝜆),

which is represented as a line in Figure 21.

Sum up the probabilities represented by all lines Figure 21, we can have

∑ 𝛼𝑡(𝑗)𝑎𝑖𝑗

𝑁

𝑗=1

= ℙ(𝑌0, 𝑌1, … , 𝑌𝑡 , 𝑋𝑡+1 = 𝑒𝑖|𝜆). (6.21)

Then, the probability of obtaining the observation sequence of {𝑌0, 𝑌1, … , 𝑌𝑡+1}

ℙ(𝑌0, 𝑌1, … , 𝑌𝑡 , 𝑋𝑡+1 = 𝑒𝑖|𝜆) = [∑ 𝛼𝑡(𝑗)𝑎𝑖𝑗

𝑁

𝑗=1

] 𝑐𝑌(𝑡+1)𝑒𝑖

∗ , (6.22)

where 𝑌𝑡+1 is the random variable of the observation state at time 𝑡 + 1; 𝑒𝑖 is the known value of state 𝑋𝑡.

48

This dynamic programming starts at time 0 and ends at time 𝑇. Since 𝛼𝑇(𝑖) represents the probability of witnessing the sequence {𝑌0, 𝑌1, … , 𝑌𝑇} with 𝑋𝑇 = 𝑒𝑖 at time 𝑇. The probability of having the sequence {𝑌0, 𝑌1, … , 𝑌𝑇} is

ℙ(𝑌0, 𝑌1, … , 𝑌𝑇|𝜆) = ℙ(𝑌|𝜆) = ∑ 𝛼𝑇(𝑖)

𝑁

𝑖=1

, (6.23)

where 𝑁 is the number of the hidden states.

6.2.1.2 Backward Algorithm

The backward algorithm is very similar to the forward algorithm. Both of these algorithms apply the idea of dynamic programming. The differences between them are that the backward algorithm is based on another selection of local states and a ‘backward probability’.

Definition 6.4 The value of hidden state at time 𝑡 is 𝑋𝑡 = 𝑒𝑖 ∈ 𝑆𝑋 , then the probability of having the sequence {𝑌𝑡+1, 𝑌𝑡+2, … , 𝑌𝑇} are defined as the backward probability, that is

𝛽𝑡(𝑖) = ℙ(𝑌𝑡+1, 𝑌𝑡+2, … , 𝑌𝑇|𝑋𝑡 = 𝑒𝑖 , 𝜆), (6.24)

where 𝑒𝑖 is an element in state space 𝑆𝑋 .

The dynamic programming recursive formula of backward probability is different from that of forward probability, which is shown in Figure 22.

Figure 22. The recursion plotting of backward algorithm

Assume that we have found the backward probabilities 𝛽𝑡+1(𝑗) of all hidden states at time 𝑡 + 1, we first compute

ℙ(𝑌𝑡+2, 𝑌𝑡+3, … , 𝑌𝑇 , 𝑋𝑡 = 𝑒𝑖 , 𝑋𝑡+1 = 𝑒𝑗|𝜆) = 𝑎𝑗𝑖𝛽𝑡+1(𝑗). (6.25)

Then,

ℙ(𝑌𝑡+1, 𝑌𝑡+2, … , 𝑌𝑇 , 𝑋𝑡 = 𝑒𝑖|𝜆) = 𝑎𝑗𝑖𝑐𝑌𝑡+1𝑒𝑗

∗ 𝛽𝑡+1(𝑗), (6.26)

which corresponds to a line in Figure 22. Sum up all the probabilities, we obtain

ℙ(𝑌𝑡+1, 𝑌𝑡+2, … , 𝑌𝑇|𝜆) = ∑ 𝑎𝑗𝑖𝑐𝑒𝑗𝑌𝑡+1

∗ 𝛽𝑡+1(𝑗)

𝑁

𝑗=1

, 𝑖 = 1, … , 𝑁. (6.27)

49

Thus,

ℙ(𝑌|𝜆) = ℙ(𝑌0, 𝑌1, … , 𝑌𝑇|𝜆) = ∑ 𝜋𝑖𝑐𝑌0𝑒𝑖

∗ 𝛽0(𝑖)

𝑁

𝑖=1

, (6.28)

where 𝑁 is the number of the hidden states.

6.2.1.3 Some commonly used probabilities of HMM

According to forward probability and backward probability, we can derive the probability formula of a single state and joint states in HMM.

First, we can have

𝛾t(𝑖) = ℙ(𝑋𝑡 = 𝑒𝑖|𝑌, 𝜆) =ℙ(𝑋𝑡 = 𝑒𝑖 , 𝑌|𝜆)

ℙ(𝑌|𝜆)=

𝛼𝑡(𝑖)𝛽𝑡(𝑖)

∑ 𝛼𝑡(𝑗)𝛽𝑡(𝑗)𝑁𝑗=1

. (6.29)

Second,

𝜉𝑡(𝑖, 𝑗) = ℙ(𝑋𝑡 = 𝑒𝑖 , 𝑋𝑡+1 = 𝑒𝑗|𝑌, 𝜆) =ℙ(𝑋𝑡 = 𝑒𝑖 , 𝑋𝑡+1 = 𝑒𝑗 , 𝑌|𝜆)

ℙ(𝑌|𝜆), (6.30)

where

ℙ(𝑋𝑡 = 𝑒𝑖 , 𝑋𝑡+1 = 𝑒𝑗 , 𝑌|𝜆) = 𝛼𝑡(𝑖)𝑎𝑗𝑖𝑐𝑌𝑡+1𝑒𝑗

∗ 𝛽𝑡+1(𝑗).

Thus, equation (6.30) can be rewritten as

𝜉𝑡(𝑖, 𝑗) =𝛼𝑡(𝑖)𝑎𝑗𝑖𝑐𝑌𝑡+1𝑒𝑗

∗ 𝛽𝑡+1(𝑗)

∑ ∑ 𝛼𝑡(𝑟)𝑎𝑠𝑟𝑐𝑌𝑡+1𝑒𝑠

∗ 𝛽𝑡+1(𝑠)𝑁𝑠=1

𝑁𝑟=1

. (6.31)

6.2.2 Estimation of HMM Parameters

In this section, we are going to discuss the estimation of HMM parameters 𝜆 = (𝐴, 𝐶∗, Π), which is considered the most complicated among the three HMM tasks. It can be divided into two cases with given conditions. This section is based on Zheng Rong’s research [30] and Jeff Bilmes’ research[31].

The first case is relatively simple, that is, we already know 𝐷 observation sequences of length

𝑇 + 1 and corresponding hidden state sequences {(𝑋(1), 𝑌(1)), (𝑋(2), 𝑌(2)), … , (𝑋(𝐷), 𝑌(𝐷))}.

In this case, we can obtain the estimated matrixes by means of the maximum likelihood method. The matrixes are estimated according to the transformation frequency of each sequence.

In the real world, however, hidden state sequences are hard to come by. The most common way to solve this problem is the Baum-Welch algorithm, which is based on the EM algorithm. In the era when the Baum-Welch algorithm was proposed, the EM algorithm has not been abstracted, so we still talk about the Baum-Welch algorithm in this section.

6.2.2.1 Pseudocode of Baum-Welch Algorithm

Since the Baum-Welch algorithm has the idea of EM algorithm, there would be two steps of it: E-step and M-step. Here is the pseudocode of the Baum-Welch algorithm.

50

Step 1. Set initial model parameters �̅�;

Step 2. (E-step) Given current parameters �̅�, compute

𝐿(𝜆, �̅�) = ∑ ℙ(𝑋|𝑌, �̅�)log ℙ(𝑋, 𝑌|𝜆)

𝑋

. (6.32)

Step 3. (M-step) Maximize the equation above with the respect to 𝜆 and update the

value of �̅� by

�̅� = argmax𝜆

∑ ℙ(𝑋|𝑌, �̅�)𝑙𝑜𝑔ℙ(𝑋, 𝑌|𝜆)

X

. (6.33)

Step 4. Repeat step 2 to step 3, until a stopping criterion is satisfied.

In the following section, we will deduce the theories of how to specifically update each parameter.

6.2.2.2 Derivation of Baum-Welch Algorithm

We aim to estimate the transition matrix of hidden states 𝐴 and the emission matrix 𝐶, given

the history dataset {(𝑋(1), 𝑌(1)), (𝑋(2), 𝑌(2)), … , (𝑋(𝐷), 𝑌(𝐷))}, where 𝑋(𝑑) = {𝑋0(𝑑)

, … , 𝑋𝑇(𝑑)

},

and 𝑌(𝑑) = {𝑌0(𝑑)

, 𝑌1(𝑑)

, … , 𝑌𝑇(𝑑)

}. In E-step, according to the expression of the joint probability ℙ(𝑋, 𝑌|𝜆) in equation (7.18), we have

ℙ(𝑋, 𝑌|𝜆) = ∏ ℙ(𝑋(𝑑), 𝑌(𝑑)|𝜆)

𝐷

𝑑=1

= ∏ 𝜋𝑋0

(𝑑)𝑎𝑋1

(𝑑)𝑋0

(𝑑) … . 𝑎𝑋𝑇

(𝑑)𝑋𝑇−1

(𝑑) 𝑐𝑌0𝑋0

∗(𝑑)𝑐𝑌𝑇𝑋𝑇

∗(𝑑)

𝐷

𝑑=1

,

where 𝑋, 𝑌 represents {𝑋(1), 𝑋(2), … , 𝑋(𝐷)} and {𝑌(1), 𝑌(2), … , 𝑌(𝐷)} respectively.

In M-step, equation (7.32) would be maximized with the respect to λ. Since

ℙ(𝑌|𝑋, �̅�) =ℙ(𝑋, 𝑌|�̅�)

ℙ(𝑌|𝜆),

where ℙ(𝑌|�̅�) is constant; �̅� is the known current parameters. Then, equation (6.33) can be

rewritten as

�̅� = argmax𝜆

∑ ℙ(𝑋, 𝑌|�̅�)𝑙𝑜𝑔ℙ(𝑋, 𝑌|𝜆)

𝑋

= argmax𝜆

∑ ∑ ℙ(𝑋, 𝑌|�̅�)(𝑙𝑜𝑔𝜋𝑋0

(𝑑) + ∑ 𝑙𝑜𝑔𝑎𝑋𝑡+1

(𝑑)𝑋𝑡

(𝑑)

𝑇−1

𝑡=0

+ ∑ 𝑙𝑜𝑔𝑐𝑋𝑡

∗(𝑑)

𝑇

𝑡=0

)

𝑋

𝐷

𝑑=1

. (6.34)

Therefore, by taking the derivatives of equation (6.34) with respect to 𝐴, 𝐵, Π respectively,

we can get the updated parameters �̅�.

a). Updating 𝚷.

Since Π only appears in the first term of equation (6.34), we can simplify it to equation (6.35) while only considering the derivative with respect to Π. The simplified version is

�̅� = argmax𝜋

∑ ∑ ℙ(𝑋, 𝑌|�̅�)𝑙𝑜𝑔𝜋𝑋0

(𝑑)

𝑋

𝐷

𝑑=1

= argmax𝜋

∑ ∑ ℙ(𝑌, 𝑋0(𝑑)

= 𝑒𝑖|�̅�)𝑙𝑜𝑔𝜋𝑖

𝑁

𝑖=1

𝐷

𝑑=1

, (6.35)

51

where π𝑖 represents the probability ℙ(𝑋0(𝑑)

= 𝑒𝑖), 𝑑 = 1,2, … , 𝐷.

Consider

∑ 𝜋𝑖

𝑁

𝑖=1

= 1.

According to the method of Lagrange multipliers, the Lagrange function is

𝑓�̅� = argmax𝜋

∑ ∑ ℙ(𝑌, 𝑋0(𝑑)

= 𝑒𝑖|�̅�)𝑙𝑜𝑔𝜋𝑖

𝑁

𝑖=1

𝑁

𝑑=1

+ 𝛿 (∑ 𝜋𝑖

𝑁

𝑖=1

− 1) , (6.36)

where 𝛿 is the Lagrangian coefficient. By taking the partial derivative of equation (6.36) with respect to 𝜋𝑖 , yielding

𝜕𝑓�̅�

𝜕𝜋𝑖= ∑ ℙ(𝑌, 𝑋0

(𝑑)= 𝑒𝑖|�̅�)

𝐷

𝑑=1

+ 𝛿𝜋𝑖 = 0, 𝑖 = 1, … , 𝑁, (6.37)

and then the solution of (6.35) can be the recursion formula of Π, that is,

𝜋𝑖 =∑ ℙ(𝑌, 𝑋0

(𝑑)= 𝑒𝑖|�̅�)𝐷

𝑑=1

∑ ℙ(𝑌, �̅�)𝐷𝑑=1

=∑ ℙ(𝑌, 𝑋0

(𝑑)= 𝑒𝑖|�̅�)𝐷

𝑑=1

𝐷ℙ(𝑌, �̅�)

=∑ ℙ(𝑋0

(𝑑)= 𝑒𝑖|𝑌, �̅�)𝐷

𝑑=1

𝐷

=∑ ℙ(𝑋0

(𝑑)= 𝑒𝑖|𝑌

(𝑑), �̅�)𝐷𝑑=1

𝐷

=∑ 𝛾0

(𝑑)(𝑖)𝐷𝑑=1

𝐷. (6.38)

b). Updating 𝑨.

Similar to section a), we only consider the second term of equation (6.34), which we write as

�̅� = ∑ ∑ ∑ ℙ(𝑋, 𝑌|�̅�)𝑙𝑜𝑔𝑎𝑋𝑡,𝑋𝑡+1

𝑇−1

𝑡=1

𝑋

𝐷

𝑑=1

= ∑ ∑ ∑ ∑ ℙ(𝑌, 𝑋𝑡(𝑑)

= 𝑒𝑖 , 𝑋𝑡+1(𝑑)

= 𝑒𝑗|�̅�)𝑙𝑜𝑔𝑎𝑗𝑖

𝑇−1

𝑡=1

𝑁

𝑗=1

𝑁

𝑖=1

𝐷

𝑑=1

, (6.39)

where

𝑎𝑗𝑖 = ∑ 𝑎𝑗𝑖

𝑁

𝑗=1

= 1.

By the method of Lagrange multipliers, we can obtain the recursion formula of 𝑎𝑗𝑖

52

𝑎𝑖𝑗 =∑ ∑ ℙ(𝑌(𝑑), 𝑋𝑡

(𝑑)= 𝑒𝑖 , 𝑋𝑡+1

(𝑑)= 𝑒𝑗|�̅� )𝑇−1

𝑡=1𝐷𝑑=1

∑ ∑ ℙ(𝑌(𝑑), 𝑋𝑡(𝑑)

= 𝑒𝑖|�̅�)𝑇−1𝑡=1

𝐷𝑑=1

=∑ ∑ 𝜉𝑡

(𝑑)(𝑖, 𝑗)𝑇−1𝑡=1

𝐷𝑑=1

∑ ∑ 𝛾𝑡(𝑑)(𝑖)𝑇−1

𝑡=1𝐷𝑑=1

. (6.40)

c). Updating 𝑪∗

Similar to section a), we only consider the third term of equation (6.34), which we write as

�̅�∗ = ∑ ∑ ∑ ℙ(𝑋, 𝑌|�̅�)𝑙𝑜𝑔𝑐𝑌𝑡𝑋𝑡

∗

𝑇

𝑡=1𝑋

𝐷

𝑑=1

= ∑ ∑ ∑ ℙ(𝑌, 𝑋𝑡(𝑑)

= 𝑒𝑗|�̅�)𝑙𝑜𝑔𝑐𝑌𝑡𝑒𝑗

∗

𝑇

𝑡=1

𝑁

𝑗=1

𝐷

𝑑=1

, (6.41)

where

∑ 𝑐𝑓𝑘𝑒𝑗= 1

𝑀

𝑘=1

, 𝑓𝑘 ∈ 𝑆𝑌.

By the method of Lagrange multipliers, we can obtain the recursion formula of 𝑐𝑓𝑘𝑒𝑗

∗ as follows

𝑐𝑓𝑘𝑒𝑗

∗ =

∑ ∑ ℙ(𝑌, 𝑋𝑡(𝑑)

= 𝑒𝑗|�̅�)1{𝑌𝑡

(𝑑)=𝑓𝑘}

𝑇𝑡=1 𝐷

𝑑=1

∑ ∑ ℙ(𝑌, 𝑋𝑡(𝑑)

= 𝑒𝑗|�̅� )𝑇𝑡=1

𝐷𝑑=1

=∑ ∑ 𝛾𝑡

(𝑑)(𝑗)𝑇

𝑡=1;𝑌𝑡(𝑑)

=𝑓𝑘

𝐷𝑑=1

∑ ∑ 𝛾𝑡(𝑑)(𝑗)𝑇

𝑡=1𝐷𝑑=1

. (6.42)

6.2.3 Decoding the observation sequence

Here we will exploit the Viterbi algorithm. Viterbi algorithm[32] is a dynamic programming algorithm used to find the most likely sequence of hidden states (called Viterbi paths), resulting in a series of observed events. Proposed by Andrew Viterbi in 1967, the Viterbi algorithm is used to deconvolution in digital communication links to eliminate noise. This algorithm is widely used in CDMA and GSM digital cellular networks, dial-up modems, satellites, deep space communications, and 802.11 wireless networks for de-concentration codes. It is also used today in speech recognition, keyword recognition, computational linguistics, and bioinformatics. For example, in speech recognition, the sound signal is seen as a sequence of observed events, while the text string is seen as the underlying cause of the sound signal, so the Viterbi algorithm can be applied to the sound signal to find the most likely text string.

Note that here we denote a single sequence {𝑋0, 𝑋1, … , 𝑋𝑇} by 𝑋;{𝑌0, 𝑌1, … , 𝑌𝑇} by 𝑌, which is different from the denotations in section 6.2.2. First, we define

𝛿𝑡(𝑖) = max𝑋

ℙ(𝑋𝑡 = 𝑒𝑖 , 𝑋0, 𝑋1, . . , 𝑋𝑡−1, 𝑌𝑡 , 𝑌𝑡−1, … , 𝑌0|𝜆) , 𝑖 = 1,2, … , 𝑁 (6.43)

Then the recursion formula of 𝛿𝑡(𝑖) is

53

𝛿𝑡+1(𝑖) = max𝑋

ℙ(𝑋𝑡+1 = 𝑒𝑖 , 𝑋0, 𝑋1, . . , 𝑋𝑡 , 𝑌𝑡+1𝑌𝑡 , , … , 𝑌0|𝜆)

= max1≤𝑗≤𝑁

[𝛿𝑡(𝑗)𝑎𝑗𝑖]𝑐𝑋𝑡+1𝑒𝑖

∗ . (6.44)

Second, define

𝜓𝑡(𝑖) = argmax1≤𝑗≤𝑁

[𝛿𝑡−1(𝑗)𝑎𝑖𝑗] , (6.45)

where 𝜓𝑡(𝑖) represents the hidden state value at time 𝑡 − 1 of the sequence {𝑋0, 𝑋1, … , 𝑋𝑡−1, 𝑋𝑡 = 𝑒𝑖} with maximum probability 𝛿𝑡−1(𝑗)𝑎𝑖𝑗 . Then the decoded hidden

state sequence {𝜓𝑡(∙)}, t = 0,1,2, … , T, can be obtained at the end of iteration.

6.3 A Simple Example of HMM Application

This section briefly illustrates how the hidden Markov model is applied to credit quality forecasting with two simple examples. For a start, it is assumed that customers' credit quality changes over time and that there exist hidden states representing the ‘true’ credit quality. The bank has to build PD prediction models to approach those ‘true’ credit grades. The PDs forecasted by these models are considered as the ‘observed’ credit grades. The hidden state and the observed signal share the same state space, i.e. they use the same set of state labels although their interpretation is slightly different. The transitions in the observed sequences enable the estimation of parameters of the HMM. The quality of the PD prediction models will influence the values in the emission matrix of HMM, meaning that if the PD model would perfectly reflect the ‘true’ credit quality of clients the emission matrix will be an identity matrix.

The first example below is based on the ideal assumption that the bank will rerate the credit quality of all clients at a regular schedule, which is set to be done every month. The second example describes a more realistic situation where the bank rerates the clients at an irregular schedule, leading to less rerating activities compared with the ideal situation.

6.3.1 When the Bank Rerates Clients on Evenly Spaced Schedule

Under the ideal circumstances, the bank will rerate customers at a regular schedule, for instance doing rerating activities each month. In that case, the movement of the credit quality of each client can be recorded periodically. These sequences are then processed as observed signals for the HMM fitting. Provided that the rating system consists of three 2 different performing state(𝑅0, 𝑅1) and 1 default state(𝐷), a part of the credit grade movements is shown in Figure 23.

54

Figure 23. The example of the movement of regularly rerated credit quality grades

From Figure 23, the observation credit grade sequence is

[𝑅0, 𝑅0, 𝑅1].

The transition matrix of ‘true’ credit quality can be estimated using the Baum-Welch algorithm discussed in Section 6.2.2 above. If the hidden state and the observed signal share the same state space and that there is a one-to-one mapping between the observed signal and the hidden state, which means the prediction model can correctly reflect the ‘true’ credit rating grades of clients. A simple example of an emission matrix between hidden states and observed states looks like

𝐶 = [1 0 00 1 00 0 1

],

with the columns’ and rows’ index of 𝑅0, 𝑅1, 𝐷 . If the PD prediction model is not always correct, the emission matrix will look like

𝐶 = [0.8 0.1 0.10.1 0.8 0.10.1 0.1 0.8

].

with the columns’ and rows’ index 𝑅0, 𝑅1, 𝐷. Also, a simple example transition matrix among hidden states looks like

𝐴 = [0.8 0.1 0.10.1 0.7 0.20 0 1

],

with the columns’ and rows’ index 𝑅0, 𝑅1, 𝐷. Here the ‘true’ credit grades tend to stay at the original state rather than migrate to another state; clients won’t move back to a performing state from the default state.

6.3.2 When the Bank Rerates Clients on Unevenly Spaced Schedule

In this case, the rating system is also set to contain 2 performing states (𝑅0, 𝑅1) and a default state (𝐷). The time interval between the two predicted credit grades of a certain client might be different, which is shown in Figure 24.

Observed signals

𝑅0

𝑅0

𝑅0

𝑅0

𝑅1

𝑅1

1 month 1 month

PD prediction

models

‘True’ credit quality

55

Figure 24. The example of the movement of unevenly rerated credit quality grades.

From Figure 24, we can observe that there are two different sizes of the time step, which will make the reconstructed transition matrix have a bad interpretation that it no longer means transition probability per time step anymore. In order to solve this problem, a new observed signal named 𝑁𝑅 (non-rating) can be added in, which means at the end of some months there are no updates for the credit quality of certain clients. This example is briefly shown in Figure 25.

Figure 25. The example of the movement of unevenly rerated credit quality grades (𝑁𝑅 added).

In Figure 25, the observation sequence is

[𝑅0, 𝑁𝑅, 𝑅0, 𝑅1],

based on which the transition matrix and emission matrix of HMM can be estimated. If the bank rerates clients less frequently, for the perfect PD prediction model, the emission matrix looks like

𝐶 = [0.8 0.2 0 00.8 0 0.2 00.8 0 0 0.2

],

With the columns’ index 𝑁𝑅, 𝑅0, 𝑅1, 𝐷 and rows’ index 𝑅0, 𝑅1, 𝐷. The transition matrix might look like

𝐴 = [0.8 0.1 0.10.1 0.7 0.20 0 1

],

with the columns’ and rows’ index 𝑅0, 𝑅1, 𝐷.

𝑅0 𝑅0 𝑅1

2 month 1 month

Credit rating grades

‘True’ credit quality

Observed signals

𝑅0

𝑅0

𝑅0 𝑅1

𝑅1

1 month 1 month PD prediction models

1 month

𝑅1

NR R0

56

The mathematical aim of this research is to compute the transition matrix and emission matrix based on the dataset after taking 𝑁𝑅 state into consideration. This estimated transition matrix of hidden ‘true’ credit quality enables us to predict the possible change in credit quality so as to solve two problems brought by the incorrect assumptions when conducting the backtest. The first assumption is that the most recent rating is still valid, but it might be several months old and already changed. The second assumption is that in the real world the transitions, based on which the transition matrices are built, are seen to be one-year, which often won’t be the case.

6.4 Estimation of 15-dimensional HMM Transition Matrix

Hidden Markov Models are widely used in various fields. Despite the benefits of simplicity, HMM still has problems in terms of learning efficiency and convergence speed. In general, HMMs face two major problems. The first is how to choose the appropriate number of hidden states; the second is how to estimate model parameters effectively [34]. In our study, the hidden state represents the ‘true’ credit quality of customers, which is what the PD models try to predict. Thus, the hidden state space has the same dimension as the observed state space. For the simulated dataset above, HMM in this case has 15 dimensions. Since the Baum-Welch algorithm, which is applied for parameter estimation, is an iterative optimization approach, using it to estimate a high dimensional transition matrix will lead to a significant drawback of time inefficiency and a high computational cost [34][35]. In our simulations, the computation time and the size of the data required were too large to be practical.

In previous studies, empirical researchers proposed alternative methods. One can use a non-iterative approach with spectral-based algorithms to avoid the cost of iteration. However, they are not suitable for parameter estimation when the available observation sequences are of large length. Therefore, despite the limitations of the Baum-Welch algorithm, it’s still commonly used in the HMM parameter estimation [34][35].

In the research of Korolkiewicz [27], the performing states were folded into only two categories: investment grade (IG) and speculative grade (SG). This reduced the dimensions of the HMM to four: investment grade, speculative grade, default, and not rated. However, due to the regulations of calibration, “Banks must regularly compare realized default rates with estimated PDs for each grade”[3]. Therefore, from a validation perspective, it is not allowed to assign all the performing states into two groups and give a rough migration probability between two groups.

The solution we choose for this computational problem is based on the observation that transitions far from the diagonal are usually close to zero (clients migrate to better or worse grades gradually).Now, the trick is to regard sub-matrices of the full matrix one at a time. These matrices are chosen to be around the diagonal (but will also include the transition to default). All observations that do not consider the sub-matrix will be folded in a single ‘other state’-𝑂𝑆. For instance, if now one focuses on the transition probability of performing state 𝑅4 , 𝑅5 and 𝑅6 . Then, except for 𝑅4, 𝑅5, 𝑅6, and 𝐷𝑌1, all other states are folding into one same group ‘𝑂𝑆’ (other states), which reduces the dimension of HMM from 15 to 6, which can be written as

[𝑅0, … , 𝑅9, 𝑁𝑅, 𝐷𝑌1, 𝐷𝑌2, 𝐷𝑌3, 𝐿𝑖𝑞, 𝐶] ⟹ [𝑅𝑥, 𝑅𝑥+1, 𝑅𝑥+2, 𝑁𝑅, 𝐷𝑌1, 𝑂𝑆],

𝑥 = 0,1, … ,7.

57

With 𝑥 moving from 0 to 7, the location of the blocks will move along the diagonal of the migration matrix. When 𝑥 = 0,2, an example of the location of the performing states included in the blocks is shown in Figure 26.

𝑅0 𝑅1 𝑅2 𝑅3 𝑅4 𝑅5

𝑅0

𝑅1

𝑅2

𝑅3

𝑅4

𝑅5

Figure 26. An example of cutting the big migration matrix into mini blocks.

As shown in Figure 26, there would be an overlapping part of these blocks, which would help in improving the stability and accuracy of the estimated transition probabilities.

Note that in principle, the transition probabilities among states after default can also be computed by HMM, but it is not necessary and probably detrimental to do that. Since the clients who are not in the performing states (𝑅0 to 𝑅9), are always monitored by the bank, the transition probabilities among [𝐷𝑌1, 𝐷𝑌2, 𝐷𝑌3, 𝐿𝑖𝑞, 𝐶] can be simply computed though the real transition migration frequency rather than through using HMM methodology, that is

𝑃(𝑋𝑡+1 = 𝑗|𝑋𝑡 = 𝑖) =𝐴𝑖𝑗

𝐴𝑖, 𝑖 = 𝐷𝑌1, 𝐷𝑌2, 𝐷𝑌3, 𝐿𝑖𝑞, 𝐶

where 𝐴𝑖 represents all transitions from state 𝑖; 𝐴𝑖𝑗 represents the transitions from state 𝑖 to state 𝑗. Thus, the blocks with reduced dimensions will only be used to obtain the transition probabilities of performing states. By moving the location of blocks, we can obtain the all probabilities that we are interested in.

Until now, we calculate the probabilities of transitions among the 5 post-default states by the transition frequency. All the default states can always emit that they are default states. This can be made even stronger. Since banks know for sure when clients default, cured, and liquidated states, there’s no need to define hidden states and signals for these 5 states in our case. We can directly observe that. Thus, the alternative is to apply the Partially observable Hidden Markov Model(POHMM)[36][37], an extension of HMM, instead of the classic HMM. We leave this methodology for further research.

6.5 Chapter Summary

In this chapter, we introduced some theories and the general equations of the HMM, including the one unit delay HMM and zero delay HMM. The parameter estimation approaches of these two types of HMM were different. In our research, we applied the zero delay HMM and the Baum-Welch algorithm to estimate the transition matrix of hidden states. This was because zero delay HMM could describe the situation better. We believed the

58

observation states depend on the current hidden states rather than the hidden states at previous step. After we elaborated on the mathematical derivation of Baum-Welch algorithm, two examples about how the HMM fits in this research were given. The main problem during the HMM application was that the dimension of state space was too large to estimate the migration matrix. In order to solve it, we proposed to first estimate the transition matrices of the blocks whose index was a subset of the original state space, and then to merge all the blocks so as to obtain the estimation of the whole transition matrix of the internal rating system.

59

7 The Implementation of the Hidden Markov Model

In the previous chapters, we discussed some validation methods that are currently used in banks and simulated an artificial bank dataset where all these validation methods were implemented. Also, we explained the theoretical Hidden Markov Model, and why and how it can be applied for predicting the possible credit quality transitions for clients. In this chapter, we will perform a numerical experiment based on the simulated dataset to see if the estimated migration matrix is in line with the ‘true’ migration matrix. The procedure is as follows.

Step 1: Extract the credit quality migration sequence for each of the clients and do the data pre-processing. Since we numbered all clients when they join the portfolio, we can easily obtain these sequences and put them in a Python dictionary with the client ID as the key. Section 7.1 will explain this step in detail.

Step 2: Train the HMM on the full information sequences. Assume all the clients will be rerated monthly. In this case, the HMM model will reduce to a normal Markov Model, as the emission matrix will be identity. Given full information, we can test if HMM can compute the transition matrix of hidden states and the identity emission matrix. After flattening the error matrix of the estimated matrix and the ‘true’ migration matrix into a vector, we can check the quality of HMM estimation by the 2-norm of the vector. Section 7.2 will explain this step in detail.

Step 3: Train the HMM on the partial information sequences. Assume the clients are not rated on an evenly schedule, as shown in Figure 24 and Figure 25. Here, we set that all the clients in performing states have the same probability to be rerated every month. Given partial information, the HMM will be trained. Section 7.3 will explain this step in detail.

Step 4: Try out new methods to improve the results obtained in step 3. Analyze the possible reasons that lead to the results in step 3 and modify techniques which are used in the previous steps or design new techniques so as to improve the accuracy of the estimated transition matrix in terms of each block. Section 7.4 will explain this step in detail.

Note that in this chapter we haven’t taken the quality of the PD prediction model into account, as stated in section 6.3. The dataset and the ‘true’ credit grades are migrating once a month, which means if the PD model is assumed to be perfect and the bank is using these models to rerate all the clients monthly, the theoretical emission matrix will be identity. By contrast, if the PD prediction model is not fully accurate, the observation state will probably be different from the hidden state. In this case, the emission probabilities at diagonal won’t be one.

7.1 Data Pre-processing

In this section, the data pre-processing techniques will be introduced. After shortening the credit quality migration sequences by removing less important information, we should balance the dataset so that the numbers of transitions starting from every performing state will be relatively similar. Balancing is necessary since the big difference among the numbers of transitions from each credit grade will reduce the accuracy of estimation. The data preprocessing procedures under two different situations, when the clients are rated regularly and irregularly, will be slightly different. There will be an extra step to process the data that is collected when clients are rated at an irregular schedule.

60

7.1.1 Data Pre-processing for Full Information

In this section, two steps of the data pre-processing are elaborated. The first step is to remove less interesting information from the original credit quality sequences; the second step is to balance the numbers of transitions for afterward model fitting.

Step 1: Removing Unimportant Information. As mentioned in section 6.4, the size of the state space, which is 15, is too large to train the HMM. In order to estimate a 15 × 15 migration matrix, a bank needs to collect the credit quality migration sequences containing more than ten million transitions. This can be easily checked by fitting the HMM with 15 hidden states on samples that are simulated from a known HMM model. Thus, the block technique in section 6.4 is introduced to reduce the size of the required data for HMM fitting.

Suppose that we consider three adjacent performing states and 𝐷𝑌1 in one block and fold all other states as 𝑂𝑆. Then, the block is with rows and columns index

[𝑅𝑥, 𝑅𝑥+1, 𝑅𝑥+2, 𝐷𝑌1, OS], 𝑥 = 0,1,2,3,4,5,6,7, (7.1)

where 𝑂𝑆 represents other states. If we have a credit migration sequence as

[𝑅𝑥, 𝑅𝑥+1, 𝑅𝑥+3, 𝑅𝑥+3, 𝑅𝑥+3, 𝐷𝑌1, 𝐷𝑌2, 𝐷𝑌3, 𝐿𝑖𝑞],

we can rewrite it by folding the less important states into 𝑂𝑆,

[𝑅𝑥, 𝑅𝑥+1, 𝑂𝑆, 𝑂𝑆, 𝑂𝑆, 𝐷𝑌1, 𝑂𝑆, 𝑂𝑆, 𝑂𝑆].

Then, we can merge the successive 𝑂𝑆 into one 𝑂𝑆, that is

[𝑅𝑥 , 𝑅𝑥+1, 𝑂𝑆, 𝑂𝑆, 𝑂𝑆, 𝐷𝑌1, 𝑂𝑆, 𝑂𝑆, 𝑂𝑆] ⟹ [𝑅𝑥, 𝑅𝑥+1, 𝑂𝑆, 𝐷𝑌1, 𝑂𝑆],

to shorten the sequence by removing less interesting information.

Step 2: Data Balancing. Since in Figure 15, there are more clients in the middle credit rating classes than those in 𝑅0 or 𝑅9, the dataset needs to be balanced before fitting. As a common sense, the clients tend to stay in a credit grade rather than to migrate frequently, so a credit quality transition sequence is supposed to contain more information about its first state. Thus, we can do a population scaling of all migration sequences depending on their first states. There will be 4 balancing probabilities corresponding to the credit sequences starting from 𝑅𝑥, 𝑅𝑥+1, 𝑅𝑥+2, 𝑂𝑆 to randomly select sequences for the afterward model fitting. For instance, if the balancing probabilities of block [𝑅𝑥, 𝑅𝑥+1, 𝑅𝑥+2, 𝑂𝑆] is [1, 0.5, 0.2, 0], all the sequences starting from 𝑅𝑥 and none of sequences starting from 𝑂𝑆 will be picked, while half of the sequences, that begin with 𝑅𝑥+1, and 1 out of 5 sequences, that begin with 𝑅𝑥+2, will be selected.

7.1.2 Data Pre-processing for Partial Information HMM

In this section, we first introduce how we add a ‘non-rated’ (𝑁𝑅) state(as shown in Figure 25) into the original dictionary of credit quality migration sequences. The methods to remove the less important information and to balance the dataset are the same as explained in the previous section.

Step 1: Consider 𝑵𝑹. In order to clearly explain how to add 𝑁𝑅 into the observation state space, we show an example of a block containing 𝑅0, 𝑅1, 𝑅2. In this case, the observation state space turns to be [𝑅0, 𝑅1, 𝑅2, 𝑁𝑅] . Suppose that we have a signal credit rating grade sequence where the bank rerates clients every month, that is,

61

[𝑅1, 𝑅1, 𝑅1, 𝑅1, 𝑅0, 𝑅0, 𝑅0, 𝑅2, 𝑅0, 𝑅2].

Now the bank re-schedule its credit rating activity. Every month, there is a 50% probability for the bank to rerate a client. We can rewrite the sequence above as

[𝑅1, 𝑁𝑅, 𝑅1, 𝑁𝑅, 𝑁𝑅, 𝑁𝑅, 𝑅0, 𝑅2, 𝑁𝑅, 𝑁𝑅].

Note that the first state of a credit quality migration sequence shouldn’t change to a 𝑁𝑅 state. Also, right after the clients are cured and return to the performing state, the artificial bank will always check their credit quality rather than leave it non-rated. For example, the credit quality grade in red should not be converted to ‘non-rated’:

[𝑅1 𝑅2 𝐷𝑌1 𝐶 𝑅1 𝑅2].

Step 2: Remove the unimportance information. This can be found in section 7.1.1.

Step 3: Balance Data. This can be found in section 7.1.1.

7.2 When Clients Are Rated Monthly

Assume that the ‘true’ credit quality of the clients will migrate monthly, which is the transition among the hidden state. This section discusses an ideal situation that the bank will apply the perfect PD prediction model to rate all clients monthly. We set that each block is of 5 dimensions with three performing states, 1 default state, and 1 other state. Since we are researching the potential credit grade migrations starts from performing states, the states, 𝐷𝑌2, 𝐷𝑌3, 𝐶, 𝐿𝑖𝑞, are less interesting to us. Thus, the hidden credit quality migration will only cover all the performing system and 𝐷𝑌1.

In the implementation of HMM, there is a problem with the graph isomorphism. What we intend to estimate is the transition probabilities among the internal rating system which is with an order. However, the HMM doesn’t know their labels nor their ordering, since it gets signals that do not necessarily have anything to do with the state labels themselves. This means the algorithm might correctly derive the transition probabilities, but they may be placed in another cell that we could also obtain if we randomly reorder the credit grades. This can be shown in Figure 27. We can see from Figure 27 that states with the same color are equivalent.

Figure 27. An example of graph isomorphism: Actual transition graph (left), estimated transition graph (right)

a b

c d

4

1

2 3

62

Assume a simple Markov chain diagram of hidden states as shown in Figure 28.

Figure 28. An example of a Markov Chain diagram of hidden states

If the transition matrix and emission matrix are

𝐴 = [0 1 00 0 11 0 0

] 𝐶 = [1 0 00 1 00 0 1

] ,

with a starting hidden position of 1, the emitted observation sequence will be

[1,2,3,1,2,3 … . . ].

However, if the transition matrix and the emission matrix are

𝐴 = [0 1 00 0 11 0 0

] 𝐶 = [0 1 00 0 11 0 0

] ,

with a starting position of 3, the emitted observation sequence will also be

[1,2,3,1,2,3, … . . ].

Thus, different transition and emission combinations, together with different starting positions all give the same observation sequences, which means with given signals, the algorithm cannot uniquely determine the underlying matrices.

In order to interpret the figure nicely, we need to relabel the hidden states according to the emission matrix, since in terms of a perfect PD model the emission matrix will be identity, which gives us a clue to relabel everything. Besides, we can also use the transitions from a performing state to default in 1 year to uniquely identify the performing states, because these probabilities are supposed to be monotonous.

Here shows an example in Figure 29, which are the results when we choose the block with two of the performing states as the interesting states, we can see the blue cells of the emission matrix are not located at the diagonal, based on which, we can easily deduce that the hidden state 0 represents the other buckets; hidden state 1 represents 𝑅0. Then after relabeling we can obtain the ordered transition matrix, as shown in Figure 29.

1 3

2

63

Figure 29. The transition matrix of hidden states(right) and the emission matrix(left) of the block containing the first two credit grades

In order to determine a suitable size of the blocks, we repeat the experiments on both blocks containing two performing states and blocks containing three performing states. The former ones tend to give the predictions of PD which are higher than the ‘true’ bucket PD (Figure 13). Besides, in terms of a block with 𝑅4 and 𝑅5 , which are with almost the same ‘true’ probabilities to migrate to the other and are with similar ‘true’ PDs, HMM cannot distinguish the migrations of clients in these two states. Therefore, using the blocks concentrating on three performing states will be better. Here, Table 16 shows some characteristics of the credit quality sequences that are used to fit the HMM, and the time cost to fit HMM for each block.

Table 16. The table of total transitions, the number of clients, and the time cost for model fitting in terms of different blocks when the bank is rating clients monthly (parameter for

hmmlearn: n_iter=1000, tol=0.000001)

Performing states [𝑹𝒙, 𝑹𝒙+𝟏, 𝑹𝒙+𝟐]

the value of 𝒙 0 1 2 3

Client ID range 0-154,704 0-154,704 0-154,704 0-154,704

Balancing probabilities

[1,0.3,0.3,0.5] [0.5,0.5,0.5,0.5] [0.5,0.5,0.5,0.5] [0.5,0.5,1,0.5]

Total transitions 788,186 1,102,877 1,159,585 1,113,504

The number of clients

40,514 61,927 77,870 84,914

Time cost 37min 51min 134min 134min

Convergence True True True True

Performing states [𝑹𝒙, 𝑹𝒙+𝟏, 𝑹𝒙+𝟐]

the value of 𝒙 4 5 6 7

OS

OS

OS

64

Client ID range 0-194,704 0-204,704 0-214,704 0-400,000

Balancing probabilities

[0.6,0.9,1,0.5] [0.7,0.8,1,0.5] [0.7,0.8,1,0.5] [0.5,1,1,0.5]

Total transitions 1,194,569 988,769 803,227 806,679

The number of clients

101,961 92,270 81,026 114,050



From Table 16, we can see that the number of clients increases when the value 𝑥 gets bigger. This is because compared with clients in low credit grades, clients in higher credit grades are more likely to default, making the length of the sequences generally shorter than that of clients with low PD. Thus, in order to obtain better results, we need the data of a larger number of clients for model fitting. The estimated transition matrices are shown in Figure 30. Note that in these matrices, we are only interested in the first three rows, which are the transition probabilities starting from a performing state. For the last two rows, the values won’t affect a lot in this research.

OS OS

OS OS

OS

OS

OS

OS

65

Figure 30. The transition matrices of hidden states in terms of each block. The index of each block shows the credit grades we were considering at that moment

Figure 31. The number of overlapping for each cell

OS OS

OS OS

OS

OS

OS

OS

66

Based on the estimated results in Figure 30, we can see there is some overlapping of these blocks. The numbers of overlapping with respect to the location of cells are shown in Figure 31. The choices of the number and the index of blocks are always flexible, and it is possible to estimate the transition probabilities of the block only containing 𝑅0 and 𝑅9. For the cells with only one estimated probability, we directly apply it, while for the cells with more than one estimation, we take their average. After putting all these probabilities into the cells of a bigger migration matrix whose index contains all internal rating grades and 𝐷𝑌1, we can have the final estimated hidden state migration matrix, as shown in Figure 32.

Figure 32. The estimated hidden state migration matrix (when the bank rerates the clients monthly)

The similarity of the estimated migration matrix (Figure 32) and the ‘true’ migration matrix (Figure 13) is assessed by the second norm of the difference matrix between these two. The value of l2-norm is 0.14, which is low enough to conclude that if the bank is rerating clients on an evenly spaced schedule, the HMM estimation of the migration matrix is accurate.

7.3 When Clients Are Not Rated Monthly

In this section, after converting a normal performing state into the 𝑁𝑅 state by a probability of 50%, which was explained in section 7.1, we can obtain the dataset when the bank rerates clients irregularly. Now we use the credit quality migration sequences of clients who are with the same range of client ID as in section 7.2 to estimate transition matrices. The total transitions, the range of client IDs, and the balancing probabilities are included in Table 17.

67

Here HMM failed to estimate the transition matrix of the blocks whose indexes starting from credit grades 𝑅3, so we can only obtain the matrices for the first 4 blocks. The estimated transition matrices of the block containing [𝑅0, 𝑅1, 𝑅2, 𝑂𝑆], [𝑅2, 𝑅3, 𝑅4, 𝑂𝑆], [𝑅3, 𝑅4, 𝑅5, 𝑂𝑆] and [𝑅4, 𝑅5, 𝑅6, 𝑂𝑆]are shown in Figure 33 , respectively.

Table 17. The table of total transitions, the number of clients, and the time cost for model fitting in terms of different blocks when the bank is not rating the clients monthly

(parameter for hmmlearn: n_iter=1000, tol=0.000001)

Performing states [𝑹𝒙, 𝑹𝒙+𝟏, 𝑹𝒙+𝟐] the value of 𝒙

0 1 2 3

Client ID range 0-154,704 0-154,704 0-154,704 0-154,704

Balancing probabilities [1,0.5,0.5,0.5] [0.5,0.5,0.5,0.5] [0.5,0.5,0.5,0.5] [0.5,0.5,1,0.5]

Total transitions 597,908 610,989 666,501 694,039

The number of clients 24,798 30,335 38,204 47,454



Figure 33. The estimated hidden states transition matrix for the first 4 blocks which are all of size 5 containing three performing states

68

From Figure 33, we can see that there are two main problems with the estimation results. HMM is unable to correctly estimate the transition matrix of hidden states, meanwhile, the estimated PDs tend to be larger than the ‘true’ bucket PDs which are shown in Table 8, and the highest credit grade in a block is showing even a bigger error.

There are three possible reasons why we failed to estimate the transition matrix of the ‘true’ credit quality with higher PDs. Firstly, clients with relatively high credit grades are more likely to default. Compared with the clients with low credit grades, they will leave the portfolio early, making the length of credit quality migration sequences of clients starting from high credit grades bigger than that of clients who start from low credit grades. The analytical length of the credit quality sequences in terms of each credit grade until the first default is followed by the geometric distribution whose parameter is the corresponding PD. The expectation of the length of credit quality sequences will be

𝔼[length] =1

𝑃𝐷,

as shown in Figure 34. After considering balancing, the length of the sequences in terms of blocks can be illustrated in Figure 35.

Figure 34. The analytical length of the credit quality sequences in terms of credit grades

Figure 35. The average length of the credit quality sequences in terms of blocks after balancing

From Figure 35, we can observe that the average length of the sequences for a block is monotonously decreasing as the credit grades of that block become higher. Compared with a long credit quality sequence, the shorter one only contains limited information about credit rating class transitions. If the bank rerates the clients monthly, this limited information is

69

enough for us to figure out the transition matrix of hidden states. However, after considering the state 𝑁𝑅, more data will be required in model fitting.

The second possible reason is that for clients in higher credit grades, the relative differences between successive the bucket PDs become small. From Table 8, we can see from the bucketing system that the bucket PD of 𝑅1 is triple that of 𝑅0; the bucket PD of 𝑅2 is almost double that of 𝑅1. However, the bucket PD of 𝑅9 is only 11.7% bigger than that of 𝑅8.

The third possible reason is that the data pre-processing technique available to the monthly rerated dataset does not fit the unevenly rerated dataset. We need to modify the technique used in the previous section or to propose a new method to process data for HMM fitting.

7.4 Attempts to Improve Obtained Results

In this section, we will report some attempts to modify the settings of the simulated artificial bank and the procedure of data pre-processing to see whether the problems in the obtained results can be solved.

7.4.1 Modifying the Original Settings of Simulated Artificial Bank

Here we reset the boundaries of the internal rating system to make the bucket PDs grow exponentially so that the difference of the PDs of two successive credit rating grades won’t be too small for the HMM to recognize. The exponentially increasing bucket PDs of this new bucketing system is shown in Table 18. Based on the factor values simulated in Chapter 5, the ‘true’ PDs are computed using equation (5.1), and the clients will be assigned to the credit bucket whose bucket PD is closest to their ‘true’ PDs. The parameters 𝛽0 and 𝛽 remain to be [0.1 , 0.5 , 1 , 0.5 , 0.25] and −log1.5 , whereas the parameters for the transformation function (5.3) should be modified to the values in Table 19.

Table 18. The new bucketing system with exponentially increasing bucket PDs

Figure 36. The bucket PDs with respect to the credit rating grades in the exponential system

Credit grades 𝑅0 𝑅1 𝑅2 𝑅3 𝑅4 𝑅5 𝑅6 𝑅7 𝑅8 𝑅9

Bucket PDs 0.0099 0.0148 0.0222 0.0334 0.0500 0.0751 0.1126 0.1689 0.2533 0.38

70

Table 19. The parameters chosen for scaling factor values for the exponential bucketing system

Factor 1 Factor 2 Factor 3 Factor 4 Factor 5

ub 0.3

lb -2.2

s 4 0.5 0.7 0.6 1.5

sh -1 -1.4 -4 -3 -1.5

The histogram of the credit rating grades in the first month and the stationary distribution of all states are shown in Figure 37 and Figure 38 respectively. The parameters in Table 19 makes the credit rating distribution of the first month looks similar to a normal distribution with the mean around 𝑅4, which is a true situation for banks in real life. By properly choosing the scaling parameter, the mean value and the volatility (shown in Table 20) of the stochastic differential equations in Table 11, we can make the stationary distribution of the whole state space close to a normal distribution, where most of the clients are in the middle credit rating classes.

Figure 37. The credit rating frequency of the first month of the new dataset

Figure 38. The stationary credit rating frequency of all states in the new dataset

Table 20. The parameters chosen for stochastic differential equations of every factor in the exponential bucketing system

𝑹𝟎 − 𝑹𝟒 (Factor 1 to 5) 𝑹𝟓 − 𝑹𝟗(Factor 1 to 5)

𝑲(𝒊) [𝑁𝐴𝑁 0.01 0.01 0.15 0.05] [𝑁𝐴𝑁 0.1 0.1 0.15 0.05]

𝜽(𝒊) [𝑁𝐴𝑁 3.35 2.8 6 5.5] [𝑁𝐴𝑁 3.35 2 4 4]

𝝈(𝒊) [0.05 0.3 0.3 0.2 0.1] [0.05 0.3 0.3 0.2 0.1]

71

Figure 39. The 'true' Migration matrix of the exponential bucketing system

The average ‘true’ migration matrix based on the bucketing system with exponentially increasing bucket PDs is shown in Figure 39, after updating the dataset for the successive 250 months. The standard errors in term of each cell are smaller than 0.015, except for the transition probabilities that starts from 𝑅9. For rows of 𝑅9, the std is 0.30. This is because the amount of 𝑅9 clients is relatively small as shown in Figure 38. Thus, we can conclude that the migration matrices converge.

However, HMM still doesn’t work on the 𝑅7 𝑅8 𝑅9 block. The estimated migration matrix is shown in Figure 40. We can see from Figure 40 that there is one hidden state which will possibly emit three credit grades, which is incorrect.

Figure 40. The estimated emission matrix of a block with 3 highest credit grades

72

However, it works on the 𝑅0 𝑅1 𝑅2 block but still with some errors in the estimation of emission matrix as shown in Figure 41. It works well on distinguishing the 𝑅0 clients and 𝑅1 clients, however, for 𝑅2 clients, whose credit quality migration sequences are of a relatively shorter length among these three, the PD estimation of clients in 𝑅2 is not fully accurate.

Figure 41. The emission matrix (right) whose rows index as the unlabeled hidden states and the relabeled transition matrix of hidden ’true’ credit quality (left) of the new system

From the estimated emission matrix in Figure 41, we observe that the hidden state 0 is the ‘true’ credit rating grade 𝑅2. The clients in this hidden state have a probability of 0.5 to be rated and have a probability of 0.5 not to be rated, which is in line with the assumption that every month the bank has a probability of 0.5 to rerate a certain client. However, in Figure 37, the estimated PD of 𝑅2 is not accurate, which might be caused by the data pre-processing technique. Thus, we need to modify the data pre-processing method in order to solve this problem.

7.4.2 Modifying the Data Pre-processing technique

Under this circumstance, the merging of successive 𝑂𝑆 will reduce the accuracy of the estimated emission matrix. This is because in our simulation the clients in 𝑂𝑆 are also rated with a probability of 0.5, but shortening the sequences makes the HMM yield the estimation meaning that all the clients in 𝑂𝑆 will be rated, as shown in Figure 41, which is wrong. Thus, in this case we should not remove successive 𝑂𝑆 but cut the sequences by setting 𝐷𝑌1 as boundary and then remove the subsequences with only 𝑂𝑆 and 𝑁𝑅 before defaulting. Here gives an example, where the red part should be removed.

[𝑅1 𝑅2 𝑁𝑅 𝐷𝑌1 𝑂𝑆 𝑂𝑆 𝑂𝑆 𝑁𝑅 𝐷𝑌1 𝑂𝑆 𝑅1 𝑅2 𝑁𝑅 𝑂𝑆 𝑂𝑆 𝐷𝑌1 𝑂𝑆]

⟹

[𝑅1 𝑅2 𝑁𝑅 𝐷𝑌1 𝑂𝑆 𝑅1 𝑅2 𝑁𝑅 𝑂𝑆 𝑂𝑆 𝐷𝑌1 𝑂𝑆]

By applying this data pre-processing method, the quality of HMM estimation is indeed improved, as shown in Figure 42.

73

Figure 42. The emission matrix (right two) whose rows index as the unlabeled hidden states and the relabeled transition matrix of hidden ’true’ credit quality (left two) of the first two

blocks of the new system

We are only interested in the transition probabilities of the first three rows in Figure 42. From Figure 42 we can see that the estimations of PDs are improved and are in line with the ‘true’ PDs in the ‘true’ migration matrix of the exponential bucketing system in Figure 39. HMM works well on the blocks with low bucket PDs, even though these bucket PDs are small, with the absolute value of difference around 0.003. However, HMM still cannot work out on the blocks containing 𝑅4 to 𝑅9. This problem will be left into the further research.

7.5 Chapter Summary

In this chapter, we implemented the HMM based on the dataset simulated in Chapter 5. There was an important assumption of this chapter that we didn’t discuss here: the effect brought by the quality of PD prediction models that the bank uses4. The PD prediction models were perfect, which means that if the bank rerated all the clients monthly, the ‘true’ emission matrices of HMM would be identity.

We first applied HMM to estimate the transition matrices of all blocks, given that the bank uses a perfect model to rate clients. Then we added a new state 𝑁𝑅 into the observation state space, while the hidden state space remained the same as before. The estimations of transition matrices were not good. Only part of the blocks could be successfully estimated if

4 In principle, incorrect PD models can be modelled in the emission matrix which apart from the ‘true’ PD and NR could also emit more or less incorrect estimates. This would be an interesting way to determine the calibration accuracy of the model

74

the bank rated clients irregularly. Some approaches were tried out in order to improve the quality of HMM estimation. They, to some extent, could help, but we still couldn’t reach our goal of obtaining the migration matrix of the whole internal rating system.

75

8 Conclusions and Discussions

In this chapter, the effectiveness of HMM and the techniques we applied will be discussed after presenting the numerical results and the conclusions. In the next chapter, some possible directions for further research will be provided.

8.1 Conclusions

The aim of this research was to check whether or not HMM methodology can be used to estimate the migration matrix describing credit quality migrations so as to estimate a correct migration matrix and thus fix the incorrect assumption that banks apply the most recent credit rating grades as if it is the current rating regardless of its age. A properly estimated migration matrix might be used for a better estimation of the credit rating grades at the start of a backtesting period and perhaps for more accurate ratings.

Using a simulation of an artificial bank, we investigated the efficacy of HMM in two scenarios: clients that are either regularly or irregularly rated. Due to the high dimensions of the state space of both hidden states and observation signals, we estimated the transition matrices block by block so as to reduce the size of the required data for Hidden Markov modeling for each estimation. Note that a discussion of the quality of PD prediction models themselves is not included in this HMM research, which means in our research the true emission matrix of our simulation bank was assumed to be identity.

In the artificial bank simulation, risk factors that drive default risk were described with several different random processes with a drift. With the chosen parameters of the drift functions, the factor values changed monthly, yielding a sequence of converged ‘true’ monthly migration matrices. After averaging the last 50 iterations of migration matrices, we obtained the average ‘true’ migration matrix. This matrix showed that the credit rating migration of clients has a centralization tendency as intended. This means that clients with relatively low and high credit grades tend to move to the middle credit classes of the bucketing system. In a typical time step, most of the simulated clients would remain at their credit rating classes, with a relatively small probability of migrating to different rating classes. Thus, the migration tendency of the simulated portfolio is reasonably in line with the real-world situation5, and the results that have been obtained based on this artificial bank in the next step of research are sufficiently representative that we can draw conclusions about the usability of HMM. We also applied the validation methods on this simulated dataset to check whether these validation methods can give a correct suggestion about the quantitative quality of PD prediction models.

In the first study, we assumed all the clients were regularly rated monthly. Based on the credit quality sequences of clients who have joined the simulated portfolio, the monthly migration

5 Note that there is one important difference between the simulated portfolio and portfolios in the real world. The migration rates of the simulated portfolio are relatively high for a monthly rate (the same holds for the default rates). The rates are actually closer to yearly rates. In the real world the threshold of PDs is 0.2, whereas in the simulated portfolio the threshold is 0.4. At the beginning of the research, it is assumed that with higher PD threshold, the increments of PDs between credit ratings will be larger, which might enable HMM to quickly distinguish the clients in different states. However, the results of the research shows that HMM works well on low credit ratings with small differences, and that with the default rates lower on a monthly basis the emission sequences would be longer and the results might be improved.

76

matrices of the block were estimated using HMM. Each block concerned a subset of the internal rating grades as the index. After combining the results of all blocks into a big migration matrix with all internal rating grades, the estimation error was checked by the l2-norm of the flattened difference matrix between the estimated monthly migration matrix and the average ‘true’ migration matrix for the artificial bank. The l2-norm here was 0.14, which is small, suggesting that in this case HMM can be used to estimate the migration matrix of the hidden credit rating grades.

In the second study, we assumed all the clients are irregularly rated. There was a probability of 0.5 for the artificial bank to rate a client each month, where the probability could be adjusted to be larger or smaller than 0.5 in further research. After adding the ‘non-rated’ state into the observation state space of HMM, the accuracy of the HMM estimated was reduced. Only blocks containing low credit grades, 𝑅0 to 𝑅4, were obtained with relatively accurate PD prediction values. We made some attempts to improve the estimation accuracy, but those did not succeed in improving the results. We believe that the core reason that led to the failure of the HMM application on blocks was the short length of credit rating migration sequences for high credit rating grades. The threshold of PD of clients to be allowed to join the portfolio was 0.4, making the credit grades migrate to default too often. As a result, rating sequences were too short to provide enough information when we fitted the HMM in this ‘non-rated’ condition. Although HMM was not able to distinguish the transition probabilities of the credit grades with PD over 0.2, in our simulation, it worked quite well on the blocks with very small PDs. The HMM also worked on the first two blocks from the new simulated dataset whose bucket PDs were exponentially increasing. The bucket PDs of these first two blocks were with very small differences, even less than 0.01. This finding enables the application of HMM on some specific portfolios whose clients have longer performing time and low PDs, such as a mortgage.

8.2 Discussions

Discussion 1: The number of clients in an artificial bank.

Given full transition information, we computed the transition probabilities among 15 states by the transition frequencies. For the whole research, the number of clients who were in the simulated portfolio for each month was set to be 60,000. This large number of simulated clients reduced the variance of estimated probabilities and thus enables the monthly migration matrices to converge fast to the average ‘true’ migration matrix. By contrast, if we simulated less amount of clients, we might still obtain a similar average ‘true’ migration matrix, but the estimations will have a higher variance.

Discussion 2: The chosen parameters of SDE.

In Chapter 5, all the parameters in the SDEs were chosen in order to make the credit ratings have a centralization tendency. The scaling parameter controlled the centralization speed of factor values, while the volatilities controlled the spread width of credit ratings. We can see from the average ‘true’ migration matrix that the credit ratings were set to migrate only to other ratings with distances of less than two steps. If the volatilities become larger, then it is possible for clients to migrate to some farther credit grades in the following month. If volatility becomes more severe, customers may move to a further credit rating in the next month. what’s more, if we increase the value of scaling parameters, it will promote centralization and

77

make customers more likely to migrate to other states rather than stay in the original state, which contradicts the actual situation of the bank.

Discussion 3: The Gini test.

The Accuracy Ratio (AR), also known as Gini, is a summary statistics of Cumulative Accuracy Profile (CAP). This profile is generated by first sorting obligors on their PD rating (highest first) and then observing their default status. A strongly discriminating model is assumed to have all defaults already in front, having given them the worst ratings. Since the default of an obligor is assumed to be a stochastic variable, the Gini, a function dependent on many of these stochastic variables, must be stochastic as well. The observed Gini is, therefore, just a sample from a distribution of realizable Ginis parametrized by the set of PDs belonging to the obligor group that is observed. As stated, the Gini is portfolio dependent, which means PD prediction models that have the same Gini might have different discriminatory power in distinguishing potential defaulters from the non-defaulters. In our research, we simulated a ‘perfect’ Gini of the simulated portfolio as a benchmark, but in the real world, banks are still using the traffic light approach as a benchmark for PD models for all portfolios, with fixed Gini thresholds. The red light threshold in many cases is higher than the ‘perfect’ Gini, 0.38, in our research, it is clear that this benchmark has somewhat limited power in properly describing the discriminatory power of PD prediction models. Its main purpose would be to compare the performance of different models on the same portfolio.

Discussion 4: Estimation of the migration matrix block by block.

In order to reduce the amount of required data, we estimated the whole migration matrix block by block. This technique enables us to obtain a well-estimated migration matrix when the artificial bank is assumed to rate all the clients monthly. Although we didn’t obtain the migration matrix when the bank rates clients irregularly, this method allows us to know which credit grades HMM cannot estimate. This gives us a clue that the HMM can be used on which kind of portfolios. The size of the blocks could be flexible, but in our simulations, the blocks with 3 credit grades showed more accurate PD estimations than the blocks with 2 credit grades. The effect brought by the size of the block can be further investigated.

Discussion 5: The length of credit rating migration sequences.

In order to properly predict the credit rating grades at the starting date of a backtesting period, we aimed to estimate the migration matrix of all performing states and default in 1 year states. Therefore, we were more interested in the credit rating migration sub-sequences containing migrations among internal rating grades and from performing states to DY1 states. As analyzed in Chapter 7, the analytical average length of migration sequence until clients default from performing states decreased exponentially from more than 50 to less than 10. The more frequent default for clients decreased the proportion of transitions among the performing states, the less useful information for the HMM fitting. From 𝑅2 the analytical length of migration sequences became less than 10, which didn’t affect a lot when the artificial bank rated clients monthly. However, when the artificial bank rated clients irregularly, half of the transition information was hidden. This led to an incorrect estimation of the block migration matrix. Here, we can reach a conclusion that HMM can be used when longer credit migration sequences are provided.

78

Discussion 6: The probability that the artificial bank rates a client each month.

In our research, the probability that the artificial bank rated clients each month was assumed to be 0.5 which was better than the actual situation. If the HMM cannot be applied under this condition, it cannot be used in the real world as well. Thus, a future researcher not only needs to find a better way to improve the estimation when the probability is 0.5 but also need to lower this probability to mimic the real-world situation. A probability of around 12.5% would be approximately correct as that would give an average of once every 12 months. The regulatory requirement for corporate clients is that they have to be rerated at least every 12 months.

Discussion 7: Necessary data preprocessing before fitting the HMM.

A big amount of required data for the HMM fitting led to a slow fitting speed. In order to reduce the HMM fitting time, we first needed to remove the less interesting information to increase the proportion of relatively important information and to control the size of data used in model fitting, for instance, the number of transitions between default states. In addition, balancing the training dataset was also necessary. In our simulation, the stationary distribution of credit rating grades showed that most of the clients would be in the middle credit classes instead of the least or most risky credit rating grades: 𝑅0 or 𝑅9. This technique also helped in reducing the size of the required data. For instance, if we estimated the block containing 𝑅0, 𝑅1 and 𝑅2, even more transitions of 𝑅2 would be added when we added some sequences containing 𝑅0 to ensure the estimation accuracy of the migration matrix.

Discussion 8: The exponentially increasing bucket PDs.

The goal of changing the linearly increasing bucket PDs to exponentially increasing bucket PDs was to check whether the value of the difference between bucket PDs affects the HMM estimation. In our original bucketing system, the increment of bucket PDs was 0.04, which accounted for a decreasing proportion with higher credit rating grades. By contrast, the absolute increments between the exponentially increasing bucket PDs would be bigger, which we assumed might enable HMM to distinguish different PDs better. However, as shown above, the HMM still can only apply on blocks containing low credit grades, which means the length of the migration sequences impact more on HMM estimation. A possible solution is to adjust the threshold of the bucketing system to reduce the highest acceptable PD from 0.4 to 0.2. This would enable us to obtain longer credit rating sequences.

79

9 Further Research

Four directions for further research are suggested below.

Direction 1: Modifying the artificial bank simulation.

In our conclusions and discussions, we mentioned that the HMM methodology is not working on the blocks containing high PD credit rating grades. Then, the next step could be modifying the threshold of the artificial bank simulation to mimic a specific portfolio with longer credit migration sequences for clients and lower PDs. A mortgage portfolio could be a good choice for further research because generally clients will be in a mortgage for more than ten years. Dutch mortgages also typically have low mortgage rates. This would enable us to obtain the credit migration sequences with sufficient information about the transitions among the internal rating grades.

Direction 2: The quality of the PD prediction model.

When fitting the HMM, we assumed that all the PD prediction models were perfect, which means that the ‘true’ emission matrix between the hidden states and the observation signals would be identity. However, this cannot always be the case in the real world. Since there would be some hidden factors that affect the PDs, we were unable to perfectly predict all the PDs based on the models we built. In our research, the emission matrix is identity, meaning that each hidden state emit a signal with a 1-to-1 correspondence to that hidden states,

which can be illustrated in Figure 43 (columns indexes show the observation signals, row index shows a hidden state).

𝑅0 𝑅1 𝑅2 𝑅3 𝑅4

𝑅2 0 0 1 0 0

Figure 43. A row of the example of an identity emission matrix

With the consideration of the quality of the PD prediction model, the emission probabilities might decrease from diagonal to both sides, as shown in Figure 44. This is an example of a possible emission matrix, the distribution of the observation signals given a hidden state depends on the PD prediction model that we use.

𝑅0 𝑅1 𝑅2 𝑅3 𝑅4

𝑅2 0.05 0.15 0.6 0.15 0.05

Figure 44. A row of the example of an emission matrix considering the quality of PD models

Also, we could build up a miscalibrated model by having the signals asymmetric and with an average that is not equal to the underlying state. It is possible to calibrate this model by determining the emission matrix by using HMM.

80

Direction 3: Economic fluctuations.

In our research, we implemented the HMM using a simulated artificial bank having clients with PDs based on stably changing factor values. However, in the real world, the drifts of factor values somehow depend on the economic situation. For example, COVID-19 has caused drastic fluctuations in the economy, leading to significant losses to the catering industry, the aviation industry, the tourism industry, and so on. In this case, it is possible to see a sharp increase in the PD value of previously creditworthy clients. Therefore, we need to test the robustness of the HMM to see whether the HMM will perform well or not under different economic conditions.

Direction 4: The Partially Observable Hidden Markov Model (POHMM)

In our simulation, we set the transition probabilities among the default states, cured state, and the liquidation states and assume that these hidden state will 100% emit to the same signal states so that we can obtain the credit rating migration sequences which are in line with the real-world sequences. However, once clients get defaulted, banks will know their states for sure unless these clients move back to performing states again. This means that there is no need for us to have signals for default states or to estimate the emission probability for default state and liquidation state. We can directly observe the ‘true’ state of defaulters. This could be explained by an example of a binary tree. A full binary tree with height 3 has 23 paths, but if we know the value of some nodes, these branches can be cut to reduce the number of paths. Compared with the classic HMM, POHMM will be a better choice for migration matrix estimation assuming that we can directly know part of the hidden states.

In the research of John V. Monaco [36], the POHMM is introduced as ‘an extension of the HMM in which the hidden state is conditioned on an independent Markov Chain. This structure is motivated by the presence of discrete metadata, such as an event type, that may partially reveal the hidden state but itself emanates from a separate process.’ The POHMM can be illustrated in Figure 45.

Figure 45. POHMM structure. Observed values (emission and event type) are shown in gray, hidden values are shown in white[36]

In Figure 45, the event type is shown by 𝜔; hidden state represents by 𝑍; observed signal represents by 𝑋. An independent Markov chain of the event type is given; hidden states

81

depends on not only hidden state one step before but also two known event type; Observed signals are still only depending on the corresponding hidden states. To illustrate this concept, John V. Monaco considers a POHMM with two hidden states and three event types. At each time step, the observed event type limits the system to hidden states that have been conditioned on that event type as demonstrated in Figure 46 [36].

Figure 46. POHMM example with two hidden states and three event types. Given observed event type 𝑏, the hidden state would be in one of {1𝑏, 2𝑏}; the 𝑎 observed at the next step

limits the possible transitions from {1𝑏, 2𝑏} to {1𝑎, 2𝑎}.[36].

In our research, based on the states that banks can know for sure, the event types can be defined as a set of 6 states:

{𝑃, 𝐷𝑌1, 𝐷𝑌2, 𝐷𝑌3, 𝐶, 𝐿𝑖𝑞},

where 𝑃 means ‘performing’. If the event type is not 𝑃, we can know for sure that the hidden states are the same as the event type. Thus, the corresponding hidden state space would be

{{𝑅𝑥}, {𝐷𝑌1}, {𝐷𝑌2}, {𝐷𝑌3}, {𝐶}, {𝐿𝑖𝑞}},

where 𝑥 = 0,1,2,3,4,5,6,7,8,9.

Since once clients default banks will know it and deal with it, for improving the backtest procedure, we are more interested in the migration matrix among the internal rating grades rather than the migration matrix of the whole state space. The estimation result might be improved by using POHMM, and it is an interesting topic for further research, as it describes the information known by banks better.

82

Popular Summary

Credit risk management is an important part of bank risk management. Once obligors cannot pay for their debt, they will go into default. The credit rating models are used to reflect the creditworthiness of an obligor. In terms of credit risk modeling, there are three main focuses: Probability of Default (PD), Loss Given Default (LGD), and Exposure at Default (EAD). We only consider PD in this research. According to the European Capital Requirements Regulation (CRR), PD is defined as the probability of default in the following one year. In the real world, banks won’t directly use the PDs to describe the credit quality of clients but map clients to a certain bucket of an internal rating system which is defined on a set of PD intervals. In our research, there are 10 buckets in the internal rating system.

The performance of the PD model needs to be checked by model validation techniques. Model validators have the responsibility to execute the initial validation of new or redeveloped models and to execute periodic reviews for models in use. Such validations and reviews must be independent of the interest of the business, the model owner, the model developer, and the model users, so as to assure adequate governance and compliance with regulations regarding model validation independence.

Basically, there are two main parts of model validation: model calibration backtesting and discriminatory power testing. PD model calibration backtest is to check the accuracy of PD predictions on both bucket level and model level. On the grade level, banks apply the Binomial test to check whether the Observed Default Rate (ODR) of a given grade is in line with its predicted PDs. On the model level, banks will use the Poisson binomial model to test accuracy of the whole internal rating system rather than only focusing on the prediction of one grade. Discriminatory power testing is to check whether a PD model can distinguish the creditworthy obligors from the potential defaulters, and thus assign the good clients to low credit grades. CAP and ROC are two main techniques in testing the discriminatory power of PD models.

There is a problem existing in the standard backtesting procedure. In the real world, banks irregularly update the credit rating grades of performing clients, which means the latest credit grades were probably generated 12 months ago or even longer ago. However, while conducting backtesting, banks use an incorrect assumption that the most recent ratings are still valid at the starting date of a backtest period. This incorrect assumption will reduce the backtest reliability since it is possible for clients to migrate to another credit quality grade within this invisible time interval before a backtest starts. Estimating a migration matrix of credit rating grades would enable banks to predict the possible credit quality migration of clients. The existing migration matrix estimation method is slightly incorrect because it assumes that the most recent credit ratings represent the current rating. This means that until the next credit update date, the clients would be considered to have no credit quality transitions. This is not reasonable, because banks are unable to know exactly when the credit quality changes between two rating dates. If it appears right after the most recent credit grades updating date, the migration matrix will be wrong. In addition, the periods of time between two rating activities for each client are different, which leads to the unstable quality of the migration matrix estimated by the existing method.

This thesis investigates whether or not a Hidden Markov Model (HMM) can help to obtain a good estimation of the migration matrix of credit ratings. First, we simulate an artificial bank whose ‘true’ migration matrix of ‘true’ credit ratings is known, which enables us to compare

83

the estimated migration matrix with the ‘true’ matrix to check whether HMM helps. Since in the real world, the number of credit ratings is not small (there are 10 buckets in our research), the size of data required to estimate this large migration matrix (15 × 15 in our research) is even larger than the data that banks have. Due to the CRR (Capital Requirements Regulation), we are not allowed to simply fold the credit ratings into a smaller number of buckets to solve this problem. We propose a technique to estimate the big migration matrix block by block, and this method effectively reduces the required data size.

In the real world, banks will rerate clients on an unevenly spaced schedule, while means the period between two rerating activities would be different. In our research, we will apply the HMM in two scenarios: when clients are rated monthly or irregularly. The former case is an ideal case. Since the credit ratings will be updated monthly, there would be no information loss. The estimated migration matrix is in line with the ‘true’ matrix. By contrast, the latter case is more realistic. By adding a new state, ‘non-rated’, into the observation state space, we are able to put the credit rating migration sequence of each client on a monthly grid. However, due to a lack of information, we cannot obtain the whole migration matrix but some of the transition probabilities of low credit grades. Thus, we can conclude that the HMM can be used to predict the credit rating migration at the starting date of backtesting for some specific portfolios. These portfolios should be with most of the clients in low credit grades and long performing time, such as a mortgage portfolio.

84

Reference

[1]. Regulation (EU) No 575/2013 of the European Parliament and of the Council of 26 June 2013 on prudential requirements for credit institutions and investment firms and amending Regulation (EU) No 648/2012 Text with EEA relevance

[2]. ECB, ECB Guide to internal models, General topics chapter, November 2018. [3]. Basel Committee on Banking Supervision (BCBS) (2004) Basel II: International

Convergence of Capital Measurement and Capital Standards: A Revised Framework. [4]. Basel Committee on Banking Supervision (2005a). Studies on the Validation of Internal

Rating Systems. BCBS Publications, Working Paper No. 14, Bank for International Settlements, May.

[5]. Basel Committee on Banking Supervision (2005b). An Explanatory Note on the Basel II IRB Risk Weight Functions. BCBS Publications, Bank for International Settlements, July.

[6]. Basel Committee on Banking Supervision (2010). Basel III: A global regulatory framework for more resilient banks and banking systems. BCBS Publications No. 189, Bank for International Settlements, December.

[7]. Bank for International Settlements (2014). A brief history of the Basel Committee. BIS website, www.bis.org.

[8]. Capital Requirements - CRD IV/CRR – Frequently Asked Questions". Brussels: European Commission. 12 July 2013. Retrieved 6 December 2015.

[9]. Baesens, B. “An Overview and Framework for PD Backtesting and Benchmarking.” Journal of the Operational Research Society 61.3 (2010): 359–373.

[10]. Ross, Sheldon M. Introduction to Probability Models . Twelfth edition. London: Academic Press is an imprint of Elsevier, 2019.

[11]. Wang, Y. H. “ON THE NUMBER OF SUCCESSES IN INDEPENDENT TRIALS.” Statistica Sinica 3.2 (1993): 295–312.

[12]. Tasche, Dirk. “11 - Validation of Internal Rating Systems and PD Estimates.” The Analytics of Risk Model Validation. Academic Press. 169–196.

[13]. Tasche, Dirk. Measuring the Discriminative Power of Rating Systems. Vol. 2003,01. Deutsche Bundesbank, Research Centre, 2003.

[14]. Hamerle, Alfred, Robert Rauhmeier, and Daniel Roesch. “Uses and Misuses of Measures for Credit Rating Accuracy.” SSRN Electronic Journal n. pag.

[15]. Lingo, Manuel, and Gerhard Winkler. “Discriminatory Power: An Obsolete Validation Criterion?” Journal of risk model validation 2.1 (2008): 45–71.

[16]. Blochwitz, Stefan, Marcus R. W Martin, and Carsten S Wehn. “Statistical Approaches to PD Validation.” The Basel II Risk Parameters. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011. 293–309.

[17]. Engelmann, Bernd. "Testing Rating Accuracy." Credit Risk Models and Management 2004. [18]. Hong, Yili. “On Computing the Distribution Function for the Poisson Binomial Distribution.”

Computational statistics & data analysis 59 (2013): 41–51. [19]. Biscarri, William, Sihai Dave Zhao, and Robert J Brunner. “A Simple and Fast Method for

Computing the Poisson Binomial Distribution Function.” Computational statistics & data analysis 122.C (2018): 92–100.

[20]. Leonard E. Baum, and J. A. Eagon. “An Inequality with Applications to Statistical Estimation for Probabilistic Functions of Markov Processes and to a Model for Ecology.” Bulletin of the American Mathematical Society 73.3 (1967): 360–363.

85

[21]. Bahl, L et al. “Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate (Corresp.).” IEEE Transactions on Information Theory 20.2 (1974): 284–287.

[22]. Rabiner, L.R. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” Proceedings of the IEEE 77.2 (1989): 257–286.

[23]. Durbin, Richard. Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids. Cambridge, U.K: Cambridge University Press.

[24]. Schaller, Huntley, and Simon Van Norden. “Regime Switching in Stock Market Returns.” Applied Financial Economics 7.2 (1997): 177–191.

[25]. Bhar, Ramaprasad, and Shigeyuki Hamori. “Empirical Characteristics of the Permanent and Transitory Components of Stock Return: Analysis in a Markov Switching Heteroscedasticity Framework.” Economics Letters 82.2 (2004): 157–165.

[26]. Gray, Stephen F. “Modeling the Conditional Distribution of Interest Rates as a Regime-Switching Process.” Journal of Financial Economics 42.1 (1996): 27–62.

[27]. Korolkiewicz, Małgorzata Wiktoria. “A Dependent Hidden Markov Model of Credit Quality.” International Journal of Stochastic Analysis 2012.2012 (2012): 13.

[28]. Elliott, Robert J., Lakhdar. Aggoun, and John B. Moore. Hidden Markov Models Estimation and Control. 1st ed. 1995. New York, NY: Springer New York.

[29]. Hidden Markov Models in Finance. Vol. 104. Boston, MA: Springer US, 2007.. [30]. Yang, Zheng Rong. “Hidden Markov Model.” Machine Learning Approaches To

Bioinformatics. World Scientific Publishing Co. Pte. Ltd., 2010. 177–194. [31]. Bilmes, Jeff. (2000). A Gentle Tutorial of the EM Algorithm and its Application to

Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report ICSI-TR-97-021, University of Berkeley. 4.

[32]. Feldman, J, I Abou-Faycal, and M Frigo. “A Fast Maximum-Likelihood Decoder for Convolutional Codes.” Vol. 1. IEEE, 2002. 371–375 vol.1.

[33]. Frontczak, Robert & Jaeger, Michael & Schumacher, Bernd. (2017). From Power Curves to Discriminative Power: Measuring Model Performance of LGD Models. Journal of Mathematical Finance. 07. 657-670. 10.4236/jmf.2017.73034.

[34]. Liu, Tingting, and Jan Lemeire. “Efficient and Effective Learning of HMMs Based on Identification of Hidden States.” Mathematical problems in engineering 2017 (2017): 1–26.

[35]. Matsuyama, Y, R Hayashi, and R Yokote. “Fast Estimation of Hidden Markov Models via Alpha-EM Algorithm.” IEEE, 2011. 89–92.

[36]. John V. Monaco and Charles C. Tappert. 2018. The partially observable hidden Markov model and its application to keystroke dynamics. Pattern Recogn. 76, C (April 2018), 449–462.

[37]. Forchhammer, S, and J Rissanen. “Partially Hidden Markov Models.” IEEE Transactions on Information Theory 42.4 (1996): 1253–1256.

86

Appendix I. The bucket plotting based on a declining number of factors

Figure 47. The bucket plotting for the PD prediction model trained based on factor 1 to factor 2 in the 189-th month dataset. ‘N’ represents the number of clients in a grade; ‘D’ represents the actual amount of defaulters for the 190-th month; ‘E(D)’ represents the theoretical expectation of the amount of defaulters in terms of bucket PDs; ‘Result’ represents the traffic light indicator whose principles are introduced in chapter 4.

87

Figure 48. The bucket plotting for the PD prediction model trained based on factor 1 to factor 4 in the 189-th month dataset. ‘N’ represents the number of clients in a grade; ‘D’ represents the actual amount of defaulters for the 190-th month; ‘E(D)’ represents the theoretical expectation of the amount of defaulters in terms of bucket PDs; ‘Result’ represents the traffic light indicator whose principles are introduced in chapter 4

88

Appendix II. The estimation of blocks containing two credit grades

We pick two adjacent performing states and DY1 and fold all other states, making the block

with rows and columns index [𝑅𝑥, 𝑅𝑥+1, 𝐷𝑌1, 𝑜𝑡ℎ𝑒𝑟 𝑏𝑢𝑐𝑘𝑒𝑡𝑠], 𝑥 = 0,1,2,3,4,5,6,7,8. After

using the balancing technique to balance the training dataset, we fit the HMMs with the

respect to different subsets of internal ratings. The results are shown as follows.

Table 21. The table of total transitions and the time cost for model fitting in terms of different blocks when the bank is rating clients monthly (parameter for hmmlearn:

n_iter=1000, tol=0.000001)

𝒙 0 1 2 3 4

Total transition

992,841 890,374 828,653 805,719 791,804

Time cost 123min 81min 256min 172min 236min

Converge True True True True True

𝒙 5 6 7 8 Total

transition 757,398 725,766 711,516 569,905


Converge True True True True

The estimations of blocks containing 2 credit grades are as follows. The left orange matrices

represent the migration matrices of blocks, while the right blue matrices are the emission

matrices. The emission matrices are the original version before relabeling the hidden states,

so the index of these matrices is [1,2,3,4].

90

Figure 49. The emission matrix (right) whose rows index as the unlabeled hidden states and the relabeled transition matrix of hidden ’true’ credit quality (left).

The PD estimations for 𝐷𝑌1 columns are not accurate. Also, the estimated migration matrices

with index [𝑅5, 𝑅6, 𝐷𝑌1 𝑜𝑡ℎ𝑒𝑟 𝑏𝑢𝑐𝑘𝑒𝑡𝑠] is not good. This is because the ‘true’ transition

probabilities for clients migrating from 𝑅5 to 𝑅6 or from 𝑅6 to 𝑅5 look too similar, so that the

HMM cannot distinguish these two hidden states. If we estimate blocks containing three

credit grades, the problem can be solved.

91

Figure 50. The estimated migration matrix of block containing 𝑅4, 𝑅5, 𝑅6

advanced backtesting probability of default predictions

Documents