fa competition (final)

Upload: junhao-ho

Post on 04-Jun-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 FA Competition (Final)

    1/15

    NUS FINANCIAL ANALYTICS COMPETITON:

    BANKRUPTCY PREDICTION OF FIRMS IN

    CHINA, HONG KONG AND SINGAPORE

    USING MACHINE-LEARNING APPROACHES

    Submitted By:

    Choy Pui Yee (A0119666) | Ho Jun Hao (A0028383) | Lee Seng Yin, Daniel (A0040255) |

    Tay Yuzhong, Zeldon (A0119093)

  • 8/13/2019 FA Competition (Final)

    2/15

    ABSTRACT

    For many corporations, assessing the creditworthiness of investment targets is vital to investment

    decisions. Data mining and machine learning techniques have been known to be applicable in solving

    bankruptcy prediction and credit scoring problems. However, many default prediction models may

    have drawn on studies of empirical data extracted from mature, Western markets. In this report, our

    team has adopted an experimental approach to seeking data mining techniques that might be more

    suitable for profiling Asian firms. The six classification approaches used are: ID3 Decision Tree (J48),

    Random Forest, Random Tree, Logistic Regression, Support Vector Machines and Neural Networks.

    From the experiments, the team found that Decision Tree models generate the highest expected

    returns for all countries, while Logistic Regression, SVM and Neural Network models produced lower

    Type 2 errors.In addition, the most-recent-year core ratios are sufficient to predict bankruptcy. With

    these findings, the team recommends that financial institutions assess the risk of extending loans to

    firms separately, based on their country. Our team recognises our experiments to be preliminary to

    developing a full system for the Asian markets default prediction. Further studies can be extendedto our study in future in order to better yield a complete model.

  • 8/13/2019 FA Competition (Final)

    3/15

    CONTENTS

    Abstract ................................................................................................................................................... 0

    1 Introduction .................................................................................................................................... 1

    2 Literature Review ............................................................................................................................ 1

    2.1 Bankruptcy Prediction ............................................................................................................. 1

    2.2 Classification Models for Bankruptcy Prediction .................................................................... 2

    2.2.1 ID3 Decision Tree (J48) .................................................................................................... 2

    2.2.2 Random Forest ................................................................................................................ 3

    2.2.3 Random Tree ................................................................................................................... 3

    2.2.4 Logistics Regression ........................................................................................................ 3

    2.2.5 Support Vector Machines ............................................................................................... 3

    2.2.6 Neural Networks (NN) ..................................................................................................... 3

    3 Data Preparation ............................................................................................................................. 4

    4 Experimental Design ....................................................................................................................... 6

    5 Evaluation of Classification Models ................................................................................................ 7

    5.1 Accuracy .................................................................................................................................. 7

    5.2 Expected Returns .................................................................................................................... 7

    6 Results and Discussion .................................................................................................................... 7

    6.1 China ....................................................................................................................................... 7

    6.2 Hong Kong ............................................................................................................................... 8

    6.3 Singapore ................................................................................................................................ 8

    6.4 Combination of China, Hong Kong and Singapore .................................................................. 9

    6.5 Key Observations from China, Hong Kong and Singapore ...................................................... 9

    7 Recommendations ........................................................................................................................ 10

    8 Conclusion ..................................................................................................................................... 10

    9 References ..................................................................................................................................... 11

  • 8/13/2019 FA Competition (Final)

    4/15

    1

    1 INTRODUCTION

    Understanding default likelihood is critical to credit risk management, macro policy making and

    financial regulation. Data mining techniques have been developed very early on in the field of

    bankruptcy and default prediction. These machine learning techniques have the ability to read huge

    amounts of data, while filtering out redundant or irrelevant information and identifying correlated

    attributes, which describe the characteristics of likely defaulting firms [1]. These are then used to

    build models to evaluate whether corporations face financial distress. For financial institutions, the

    models act as early warning systems as well as decision making tools for evaluation of candidate

    firms for collaboration or investment. Such decisions have to take into account the opportunity cost

    and the risk of failures [2]. However, many studies on default prediction models either looked at, or

    incorporated empirical data, extracted from mature, Western markets as part of their sample data.

    In this century, with many emerging markets and developing firms existing and centering their

    activities on the commodity-rich and emerging Asia region, there may be hidden combinations of

    characteristics, that defaulting Asian firms share. These factors may be different from their Westerncounterparts, and can make default prediction more effective. The data mining algorithms

    commonly used for profiling firms from a global samples perspective might be made more effective

    if a solely tailored regional sample is used for modelling.

    The Credit Research Initiative (CRI) is a non-profit undertaking by the Risk Management Institute at

    the National University of Singapore (NUS). It seeks to promote research and developments in the

    critical area of credit risk. As such, it welcomes suggestions and improvements to this area of

    research and grants the public access to its database of firm specific data which covers over 60,400

    listed firms around the world for the purpose of such work. Its foundation is the probability ofdefault (PD) model developed from its extensive database. It continually calibrates its model and has

    on-going work in identifying common company-specific attributes that are more indicative of

    defaults in emerging markets. [3]

    Our team seeks to discover data mining techniques that may be more effective in classifying Asian

    defaulting firms, by considering 6 different data mining techniques, on sample data of firms listed in

    3 Asian financial bourses taken from the RMI CRIs database. Prediction accuracy will be used to

    evaluate each techniques performance.

    2 LITERATURE REVIEW

    2.1 BANKRUPTCY PREDICTION

    Sometimes a distressed firm can continue to operate in that condition for a prolonged period of

    years. Other times, firms enter bankruptcy immediately after a highly distressing event, such as a

    major fraud. This seemingly disordered chance of default can be correlated to a combination of firm-

    specific and external economic factors. Lensberg et al.[4] has investigated much related work and

    categorized various factors affecting bankruptcy potentially.

  • 8/13/2019 FA Competition (Final)

    5/15

    2

    In the past, Beaver[5,6]used financial ratios as the input variables of linear regression models for

    firm bankruptcy classification. Altman[7] was one of the first to identify the classical multivariate

    discriminate analysis technique. On the other hand, many recent studies focus on using data mining

    techniques[8] for bankruptcy prediction. Other groups of researchers showed that data mining

    models which require lesser knowledge of financial knowledge, such as neural networks, outperform

    statistical approaches such as logistic regression, linear discriminate analysis, and multiple

    discriminate analysis, that rely on financial ratios and statistical rules[912, 30].

    Table 1 below shows a comparison of related studies published from 2001 to 2007 to examine which

    models they built. Many of the studies emphasized on designing more sophisticated classifiers.

    Table 1. Summary of related studies

    Author(s) Feature Selection Prediction Models

    Atiya (2001) Yes Neural Networks

    Lee et al. (2002) No Discriminant Analysis & Neural

    Network

    Malhortra and Malhorta No Fuzzy Logic & Neural Networks

    McKee and Lensberg (2002) No Genetic Algorithms

    Shin and Lee (2002) Yes Genetic Algorithms

    Kim and Han (2003) No Genetic Algorithms

    Huang et al. (2004) Yes SVM

    Canbas et al. (2005) Yes Discriminant Analysis & Logistic

    Regression

    Lee et al. (2005) No Self-organizing Maps

    Min and Lee (2005) Yes Support Vector Machines

    Ong et al. (2005) No Neural Networks &

    Discriminant Analysis

    Shin et al. (2005) Yes Support Vector Machines

    Gestel et al. (2006) No Support Vector Machines

    Huysmans et al. (2006) No Self-organizing Maps & Neutral

    Networks

    Lensberg et al. (2006) No Genetic Algorithms

    Min et al. (2006) No Genetic Algorithms & Support

    Vector Machines

    Tsakonas et al. (2006) No Neural Logics Networks &

    Genetic AlgorithmsWu et al. (2007) No Genetic Algorithms & Support

    Vector Machines

    Tsai and Wu (2008) No Neural Networks

    2.2 CLASSIFICATION MODELS FOR BANKRUPTCY PREDICTION

    2.2.1 ID3 Decision Tree (J48)

    Instead of generating a decision rule in the form of a discriminant function, the ID3 algorithm

    produces a decision tree that classifies the training sample by using the entropy measure

    [13,14]. This is an inductive machine learning method and has been applied to many business

    classification problems today, including credit scoring [15], corporate failures prediction [16],

  • 8/13/2019 FA Competition (Final)

    6/15

    3

    stock portfolio construction [17], stock market behaviour prediction [18], and bankruptcy risk

    prediction *19+. For our project, our team used the open source machine learning program Weka

    3.6, which adopts the use of the J48 decision tree algorithm, which is an extension of the ID3

    algorithm developed by Ross Quinlan.

    2.2.2 Random ForestRandom Forests (RF) are classification algorithms developed by Breiman [20]. They use an ensemble

    of classification trees [8, 21] in the construction of the model. The concept of RF is to combine many

    binary decision trees learnt using several bootstrap samples coming from the main sample, then

    choosing randomly at each node, a further subset of explanatory variables. RFs rank variables by a

    variable importance index [19], and so can suggest the significance of a variable based on the

    classification accuracy, while considering the interaction among variables. The algorithm estimates

    the importance of a variable by looking at how much prediction error increases when data not

    present in the bootstrap sample for that variable is permuted while all others are held unchanged.

    The necessary calculations are carried out tree by tree as the RF is constructed. Typically, the rank

    order of the importance score is reported [23].

    2.2.3 Random Tree

    The Random Tree algorithm constructs a decision tree that considers K randomly chosen attributes

    at each node. It performs no pruning and has the ability to allow estimation of class probabilities

    based on a hold-out set. The Random Tree is usually used as a 'building block' with RFs, with many

    Random Trees coming together to make an RF as mentioned above. Generally, the Random Tree on

    its own tends to be too weak and requires an ensemble of algorithms to make it strong enough.

    However, not much research has used it as a standalone technique to evaluate bankruptcy data but

    our team decided to test it because of its simplicity.2.2.4

    Logistics Regression

    Logistic Regression has been the classic answer to many credit default analysis problems for many

    years. Ohlson [24] was the first to apply the Multiple Logistic Regression Analysis (Logit) to

    the failure prediction study while claiming that the model was superior to MDA due to lesser

    limitations in statistical normality. He successfully developed the model with nine predictors (7

    financial ratios and 2 categorical variables) and many research built on his study by using Logit

    analysis instead of MDA [2427]

    2.2.5 Support Vector Machines

    Support Vector Machines (SVM) are derived from statistical learning theories and follow structural

    risk minimization principles [28, 29]. The basic idea of SVM is to define a hyperplane that

    geometrically separates binary classes in high dimension spaces. The optimal hyperplane is obtained

    by maximizing the margin between the data points of the two classes whereby a structural risk

    minimum is achieved [31, 32].

    2.2.6 Neural Networks (NN)

    Many studies on bankruptcy prediction using the non-linear NNs have been around since 1990, and

    are still active now. NNs have generally outperformed the other existing methods. Currently, several

    of the major commercial loan default prediction products are based on NNs. For example, Moodys

    Public Firm Risk Model [33] is based on NNs as the main technology. Many banks have also

    developed and are using proprietary NN default prediction models.

  • 8/13/2019 FA Competition (Final)

    7/15

    4

    3 DATA PREPARATION

    The team has opted to use Core Ratios (CR) as the financial data to build the classifier model. With

    CR data, meaningful comparisons can be made on companies of different sizes. Of the available CR

    data, only the most updated annual reports data are selected. In addition, only CR attributes that are

    more than 90% complete in all three countries studied are selected. The 45 CR attributes used are

    listed in Table 2. As the classifier accuracy may be adversely affected by missing values, the missing

    values are replaced with the country mean.

    Table 2. List of selected CR attributes

    CR Attributes 1 to 15 CR Attributes 16 to 30 CR Attributes 31 to 45

    SALES_GROWTH PRETAX_MARGIN TOT_DEBT_TO_COM_EQY

    ASSET_GROWTH TRAIL_12M_SALES_PER_SH TOT_DEBT_TO_TOT_EQY

    ASSET_TURNOVER TRAIL_12M_EPS_BEF_XO_ITE

    M

    TOT_DEBT_TO_TOT_ASSET

    OPER_MARGIN CASH_AND_ST_INVESTMENTS NET_DEBT_TO_SHRHLDR_EQTYPRETAX_INC_TO_NET_SALES TRAIL_12M_NET_SALES SHORT_AND_LONG_TERM_DEB

    T

    PROF_MARGIN TRAIL_12M_OPER_INC NET_DEBT

    RETURN_ON_ASSET TRAIL_12M_EPS NET_CHNG_ST_DEBT

    TAX_BURDEN TRAIL_12M_CASH_FROM_OPE

    R

    NET_CHANGE_LIABILITIES

    FNCL_LVRG TRAIL_12M_NET_INC INCR_IN_LIAB_PCT_OF_TOT

    REVENUE_PER_SH TOT_COMMON_EQY NET_CHANGE_TOTAL_EQUITY

    OPER_INC_PER_SH TOT_DEBT_TO_TOT_CAP INCR_IN_EQY_PCT_OF_TOT

    PRETAX_INC_PER_SH ASSET_TO_EQY COM_EQY_TO_TOT_ASSETCONT_INC_PER_SH LT_DEBT_TO_COM_EQY CASH_TO_TOT_ASSET

    CASH_ST_INVESTMENTS_PER_S

    H

    LT_DEBT_TO_TOT_CAP ACCT_RCV_TO_TOT_ASSET

    BOOK_VAL_PER_SH LT_DEBT_TO_TOT_ASSET TRAIL_12M_INC_BEF_XO_ITEM

    For each instance of the CR data, five default indicators are appended for companies. The indicators

    are numbered 1 to 5. The number represents the years of consideration from the latest credit event.

    A CR instance will be assigned a True indicator if, it recorded a credit event listed in Table 3 and

    falls within the period of consideration. For example, if a company recorded a credit event listed in

    Table 3 and has annual report filings from FY 2000 to 2004, the indicators appended to the CR datawill be as follows:

    Figure 1. Illustration of class indicators

    FY YEAR B1 B2 B3 B4 B5

    2000 FALSE FALSE FALSE FALSE TRUE

    2001 FALSE FALSE FALSE TRUE TRUE

    2002 FALSE FALSE TRUE TRUE TRUE

    2003 FALSE TRUE TRUE TRUE TRUE

    2004 TRUE TRUE TRUE TRUE TRUE

  • 8/13/2019 FA Competition (Final)

    8/15

    5

    Table 3. Credit events that are classified as default events

    Action Type Subcategory

    Delisting Filing Type: Administration

    Delisting Filing Type: Canadian CCAA

    Delisting Filing Type: Chapter 11

    Delisting Filing Type: Judicial Management

    Delisting Filing Type: Liquidation

    Delisting Filing Type: Protection

    Delisting Filing Type: Receivership

    Delisting Filing Type: Reorganization

    Delisting Filing Type: Restructuring

    Delisting Filing Type: Unknown

    Delisting Filing Type: Winding Up

    Bankruptcy Filing Reason: Bankruptcy

    Bankruptcy Filing Reason: Coupon & principal paymentBankruptcy Filing Reason: Coupon payment only

    Bankruptcy Filing Reason: Debt Restructuring

    Bankruptcy Filing Reason: Interest payment

    Bankruptcy Filing Reason: Loan payment

    Bankruptcy Filing Reason: Principal payment

    Bankruptcy Filing Reason: Unknown

    Bankruptcy Filing Reason for delisting: Bankruptcy

    Noting that some industries may inherently be more risky than others, an attribute to account for

    the industry is appended to the data. BICS sector attribute from the company information, which

    indicates the industry that the company is functioning in, is selected. The structure of the final data

    is shown below in Table 4.

    Table 4. Structure of data for model building

    Data Source: Company

    Information

    Fundamentals CR Credit Events

    Selected

    Attributes

    BICS_SECTOR 45 CR Attributes (see Table 4) Action type, subcategory

    and date

    Data

    preparation

    conducted

    1. Selected most updated

    annual report data.

    2. Selected only attributes that

    are 90% complete in all three

    countries

    3. Replaces missing values with

    country mean

    Created 5 class indicators.

    Output BICS_SECTOR

    attribute

    45 CR attributes that are 100%

    complete

    5 class indicators

  • 8/13/2019 FA Competition (Final)

    9/15

    6

    Different combinations of countries data were also explored. The summary of the class distribution

    for each combination is shown below.

    Table 5. Class distribution of each combination

    Combination

    of countries

    Number

    of

    instances

    Number

    of Non

    Default

    (ND) and

    Default

    (D)

    Default Indicators

    B1 B2 B3 B4 B5

    Singapore

    (SG)

    6828 ND 6795 6764 6738 6716 6695

    D33 64 90 112 133

    (0.48%) (0.94%) (1.32%) (1.64%) (1.95%)

    Hong Kong

    (HK)

    12239 ND 12196 12157 12121 12090 12063

    D43 82 118 149 176

    (0.35%) (0.67%) (0.96%) (1.22%) (1.44%)

    China

    (CN)

    21091 ND 20850 20624 20410 20205 20006

    D241 467 681 886 1085

    (1.14%) (2.21%) (3.23%) (4.20%) (5.14%)

    SG and HK 19067 ND 18991 18921 18859 18806 18758

    D76 146 208 261 309

    (0.40%) (0.77%) (1.09%) (1.37%) (1.62%)

    SG and CN 21919 ND 27645 27388 27148 26921 26701

    D274 531 771 998 1218

    (0.98%) (1.90%) (2.76%) (3.57%) (4.36%)

    HK and CN 33330 ND 33046 32781 32531 32295 32069

    D284 549 799 1035 1261

    (0.85%) (1.65%) (2.40%) (3.11%) (3.78%)

    SG, HK and

    CN

    40158 ND 39841 39545 39269 39011 38764

    D317 613 889 1147 1394

    (0.79%) (1.53%) (2.21%) (2.86%) (3.47%)

    4 EXPERIMENTAL DESIGN

    The classification models used in our project experiments are: ID3 Decision Tree (J48), RandomForest, Random Tree, Logistic Regression, Support Vector Machines and Neural Networks. The three

    key questions that the team aims to address are:

    a) How many years of data from a firm are required for accurate prediction of default?b) Which is the most appropriate Data Mining Algorithm for a particular country specific dataset?c) How accurate and reliable is each Data Mining Technique in predicting default for each dataset?We ran the six algorithms to each of the three datasets and tabulated their confusion matrix results

    to obtain prediction accuracy. Each experiment was repeated with different bankruptcy indicators

    from B1 to B5, to ascertain the number of years of data required for accurate default prediction. We

    further combined the datasets and re-ran the tests to see if permuted combinations of datasets willcause the algorithms to yield different performances.

  • 8/13/2019 FA Competition (Final)

    10/15

    7

    Lastly, in order to gain a better perspective of how the results can help in loan decisions, we applied

    a cost function to the confusion matrix of each result to simulate the expected gains of prediction.

    We assumed good loans to return an optimistic gain of 6% interest while, bad loans return the worst

    outcome of losing 100% of investment along with the opportunity cost of 6%. Loans that were not

    made due to the algorithms classifying them as a bankrupt will incur an opportunity cost of 6%

    interest.

    5 EVALUATION OF CLASSIFICATION MODELS

    The team built more than 30 models for each financial market. To identify the most appropriate

    model, the team compared the models across the six classification approaches, using accuracy and

    expected returns as key performance measures.

    5.1 ACCURACYThe team determined accuracy based on the proportion of incorrectly classified cases. Two types of

    errors may occur during classification, i.e. Type I and Type II errors. Type I errors occur when the

    model classifies the non-bankruptcy group into the bankruptcy group. These errors represent

    potential loss in interest revenue of the financial institutions. On the other hand, Type II errors

    classify the bankruptcy group into the non-bankruptcy group. In comparison, Type II errors appear to

    be more costly to financial institutions as they might be unable to recover the principal amount.

    Hence, we placed a higher emphasis on Type II errors. A model is considered to have high accuracy if

    it has a low proportion of Type II cases.

    5.2 EXPECTED RETURNS

    The team calculated the expected returns for each models using the assigned costs for each of the

    four possible outcomes, i.e. correctly classified bankruptcy, incorrectly classified bankruptcy,

    correctly classified non-bankruptcy, and incorrectly classified non-bankruptcy.

    Table 6: Cost Assignment

    Bankruptcy Non-Bankruptcy

    Correctly Classified 0% 6%

    Incorrectly Classified -6% -106%

    6 RESULTS AND DISCUSSION

    6.1 CHINA

    Amongst all the classification models, decision tree models (i.e. J48, Random Forest, Random Tree)

    yielded the highest accuracy, with no Type 2 errors. This implies that the models developed are

    highly accurate in bankruptcy predication. The expected returns are also higher for decision tree

    models, with the Random Forest model generating an expected return of close to 6%. The financial

    institution can expect an average return of 5.8% using the decision tree models.

    Table 7: China

    Model B1 B2 B3 B4 B5

  • 8/13/2019 FA Competition (Final)

    11/15

    8

    Accuracy

    J48 0.000 0.000 0.000

    Not Available

    Random Forest 0.000 0.000 0.000

    Random Tree 0.000 0.000 0.000

    Logistic Regression 0.005 0.010 0.016

    SVM 0.007 0.014 0.021NN(MLP) 0.004 0.002 0.009

    Expected Returns

    J48 5.80 5.58 5.52

    Not Available

    Random Forest 5.91 5.79 5.71

    Random Tree 5.79 5.65 5.44

    Logistic Regression 2.16 1.56 1.03

    SVM 1.96 0.79 -0.15

    NN(MLP) 4.26 3.99 3.17

    6.2 HONG KONGThe Support Vector Machine (SVM) model produced the lowest proportion of Type II errors.

    However, the expected returns from this model is also the lowest as compared to other models. On

    the other hand, the decision tree models yielded higher expected returns, ranging from 5.44% to

    5.57%.

    Table 8: Hong Kong

    Model B1 B2 B3 B4 B5

    Accuracy

    J48 0.0036 0.0073 0.0103 0.0130 0.0147

    Random Forest 0.0038 0.0074 0.0106 0.0131 0.0147

    Random Tree 0.0039 0.0074 0.0110 0.0130 0.0140Logistic Regression 0.0038 0.0069 0.0102 0.0132 0.0159

    SVM 0.0035 0.0071 0.0105 0.0144 0.0163

    NN(MLP) 0.0039 0.0075 0.0102 0.0122 0.0138

    Expected Returns

    J48 5.48 5.06 4.73 4.38 4.18

    Random Forest 5.57 5.16 4.80 4.51 4.32

    Random Tree 5.44 5.03 4.43 4.42 4.18

    Logistic Regression 2.22 1.98 2.15 2.27 1.56

    SVM 0.24 0.06 1.33 1.07 1.01

    NN(MLP) 4.64 1.66 2.68 2.37 1.84

    6.3 SINGAPORE

    The SVM model has the lowest proportion of Type II errors, followed by Logistic Regression model.

    These models, however, produced very low expected returns. Despite the lower accuracy rate, the

    decision tree models work best in generating a good return on loans.

    Table 9: Singapore

    Model B1 B2 B3 B4 B5

    Accuracy

    J48 0.0055 0.0110 0.0121 0.0159 0.0160

    Random Forest 0.0049 0.0098 0.0132 0.0147 0.0171Random Tree 0.0050 0.0097 0.0091 0.0158 0.0162

  • 8/13/2019 FA Competition (Final)

    12/15

    9

    Logistic Regression 0.0034 0.0104 0.0142 0.0147 0.0153

    SVM 0.0033 0.0090 0.0146 0.0166 0.0162

    NN(MLP) 0.0041 0.0107 0.0136 0.0134 0.0143

    Expected Returns

    J48 5.27 4.46 4.11 4.00 3.98

    Random Forest 5.44 4.90 4.51 4.34 4.08Random Tree 5.08 4.40 4.62 3.80 3.82

    Logistic Regression 2.38 1.62 1.21 1.58 1.12

    SVM 0.92 1.19 0.63 0.31 0.47

    NN(MLP) 5.06 4.34 3.05 2.06 1.14

    6.4 COMBINATION OF CHINA,HONG KONG AND SINGAPORE

    The team combined the datasets for China, Hong Kong and Singapore to predict bankruptcy for Asia

    as a whole. The logistic regression model works best for Asia in minimising the number of Type II

    errors. However, the expected returns are better for decision tree models.

    Table 10: China, Hong Kong and Singapore

    Model B1 B2 B3 B4 B5

    Accuracy

    J48 0.0041 0.0089 0.0128 0.0150 0.0173

    Random Forest 0.0046 0.0088 0.0125 0.0152 0.0180

    Random Tree 0.0046 0.0086 0.0125 0.0156 0.0178

    Logistic Regression 0.0040 0.0081 0.0102 0.0130 0.0165

    SVM 0.0045 0.0102 0.0121 0.0153 0.0168

    NN(MLP) 0.0042 0.0080 0.0101 0.0123 0.0153

    Expected Returns

    J48 5.35 4.83 4.36 4.08 3.76Random Forest 5.49 5.01 4.59 4.26 3.97

    Random Tree 5.31 4.85 4.33 4.01 3.69

    Logistic Regression 1.49 1.18 1.51 1.03 0.68

    SVM 1.28 -0.26 0.95 0.75 -0.64

    NN(MLP) 1.30 -0.25 0.88 1.06 -0.46

    6.5 KEY OBSERVATIONS FROM CHINA,HONG KONG AND SINGAPORE

    The team made the following observations when comparing the models of China, Hong Kong and

    Singapore:

    a) Decision Tree models generate high expected returns for all countries. Currently, many financialinstitutions use logistic regression and other statistic approaches in predicating bankruptcy. Our

    models seem to suggest that classifications using the decision tree approaches yield better

    outcomes in terms of expected returns. In particular, the Random Tree models work well across

    financial markets.

    b) Logistic Regression, SVM and Neural Network models have lower Type II errors. These modelsare better at preventing bad loans.

    c) China predicts bankruptcy of firms accurately.Of the three countries studied, China yielded thehighest accuracy and expected returns. This could be due to a more mature loan market, where

    the financial institutions are relatively accurate in predicting bankruptcy of firms.

  • 8/13/2019 FA Competition (Final)

    13/15

    10

    d) Better to build the models for different countries separately. The models built using thecombined datasets for China, Hong Kong and Singapore did not improve accuracy and expected

    returns.

    e) The most-recent-year core ratios are sufficient to predict bankruptcy.Comparing the modelsbuilt using core ratios of different years, the models built using the most-recent-year core ratios

    (i.e. B1) yielded the best outcomes.

    7 RECOMMENDATIONS

    Financial institutions should assess the credit risk of firms from different countries separately. The

    decision tree models work best for the China market, while the SVM model predicts bankruptcies

    better for Hong Kong and Singapore. Risk-seeking financial institutions can maximise their returns

    using the Random Forest model. This model yielded the highest expected returns for all countries.

    8 CONCLUSION

    Due to bankruptcy risk, numerous studies have attempted to develop credit risk or default

    prediction models by using several statistical methods or machine learning algorithms. Much

    research literature based their studies on datasets that included company data from developed,

    western markets. Through our teams experiments with 6 different data mining techniques and

    datasets of company core ratios obtained from 3 different Asian markets (Singapore, Hong Kong and

    China), we discovered that, with just 1 year of company data, decision trees are more effective in

    classifying default. Country-specific data for Asia should also be studied as standalones. Our team

    recognises our experiments to be preliminary to developing a full system for the Asian markets

    default prediction. Further studies can be extended to our study in future in order to better yield a

    complete model. For example, features such as externally-driven events (interest rates, commodity

    prices, government restrictions) that cause changes in company policies or balance sheet values

    should also be studied so that profiling and company assessment would not be limited as a system.

    However, we hope that it would be sufficient to spur future studies.

  • 8/13/2019 FA Competition (Final)

    14/15

    11

    9 REFERENCES

    [1] J. Yang, S. Olafsson, Optimization-based feature selection with adaptive instance sampling,

    Computers & Operations Research 33 (11) (2006) 30883106.

    [2] A.I. Dimitras, S.H. Zanakis, C. Zopounidis, A survey of business failures with an emphasis on

    prediction methods and industrial applications, European Journal of Operational Research 90(1996) 487513.

    [3] RMI (2013), NUS-RMI Credit Research Initiative Technical Report Version: 2013 Update 2b,

    Global Credit Review 3, pp105

    [4] T. Lensberg, A. Eilifsen, T.E. McKee, Bankruptcy theory development and classification via

    genetic programming, European Journal of Operational Research 169 (2006) 677697.

    [5] W.H. Beaver, Financial ratios as predictors of failure, Journal of Accounting Research 4 (1966)

    71102.

    [6] W.H. Beaver, Alternative accounting measures as predictors of failure, Account Review 43 (1)

    (1968) 113122.

    [7] E.I. Altman, Financial ratios, discriminant analysis and the prediction of corporate bankruptcy,

    Journal of Finance 23 (1968) 589609.[8] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning: Data Mining,

    Inference, and Prediction, Springer, New York, 2001.

    [9] P.R. Kumar, V. Ravi, Bankruptcy prediction in banks and firms via statistical and intelligent

    techniquesa review, European Journal of Operational Research 180 (1) (2007) 128.

    [10] J.H. Min, Y.-C. Lee, Bankruptcy prediction using support vector machine with optimal choice of

    kernel function parameters, Expert Systems with Applications 28 (2005) 603614.

    [11] K.S. Shin, T.S. Lee, H.J. Kim, An application of support vector machines in bankruptcy

    prediction model, Expert Systems with Applications 28 (2005) 127135.

    [12] G. Zhang, M.Y. Hu, B.E. Patuwo, D.C. Indro, Artificial neural networks in bankruptcy prediction:

    general framework and cross-validation analysis, European Journal of Operational Research

    116 (1999) 1632.[13] J.R. Quinlan, Discovering Rules by Induction from Large Collection of Examples, in: D.

    Michie, Ed., Expert Systems in the Micro Electronic Age (Edinburg University Press, 1979).

    [14] J.R. Quinlan, Induction of Decision Trees, Machine Learning I (1986) 81-106.

    [15] C. Carter and J. Catlett, Assessing Credit Card Applications Using Machine Learning, IEEE

    Expert (Fall 1987) 71-79.

    [16] W.F. Messier and J.V. Hansen, Inducing Rules for Expert System Development: An

    Example Using Default and Bankruptcy data, Management Science 34, No. 12 (Dec 1988)

    1403-1415.

    [17] K.Y. Tam and R. Chi, Inducing Stock Screening Rules for Portfolio Construction, Journal of

    Operations Research Society ( 1991, to appear).

    [18] H. Braun and J.S. Chandler, Predicting Stock Market Behaviour through Rule Induction:An Application of the Learning-From-Example Approach, Decision Science 18 (1987) 415-

    429.

    [19] S.B. Lee and S.H. Oh, A Comparative Study of Recursive Partitioning Algorithm and

    Analog Concept Learning System, Expert Systems with Applications 1 (1990) 403-416.

    [20] L. Breiman, Random forest, Machine Learning 45 (2001) 532

    [21] L. Breiman, Bagging predictors, Machine Learning 26 (2) (1996) 123140.

    [22] A. Liaw, M. Wiener, Classification and regression by random forest, R News 2 (3) (2002) 1822

    [23] R. Diaz-Uriarte, S. Alvarez de Andres, Gene selection and classification of microarray data

    using random forest, BMC Bioinformatics 7 (3) (2006).

    [24] Ohlson, J. A. (1980), Financial Ratios and the Probabilistic Prediction of Bankruptcy,

    Journal of Accounting Research, 18 (1), 109-31.

  • 8/13/2019 FA Competition (Final)

    15/15

    12

    [25] Zavgren, C. V. (1985), Assessing the Vulnerability to Failure of American Industrial Firms:

    A Logistic Analysis, Journal of Business Finance and Accounting, 12 (1), 19-45.

    [26] Altman, E. I., and G. Sabato (2007), Modeling Credit Risk for SMEs: Evidence from the

    U.S. Market, Abacus, 43 (3), 332-57.

    [27] Altman, E. I., G. Sabato, and N. Wilson (2008), The Value of Non -Financial Information in

    SME Risk Management, Working Paper, New York University.[28] Boser BE, Guyon IM, Vapnik VN (1992) A traininig algorithm for optimal margin classifers. In:

    Haussler D (ed) Proceedings of the 5th annual ACM workshop on computational learning

    theory. ACM Press, New York, pp 144152

    [29] Cortes C, Vapnik VN (1995) Support-vector networks. Mach Learn 20(3):273297

    [30] Huang Z, Chen H, Hsu CJ, Chen WH, Wu S (2004) Credit rating analysis with support vector

    machines and neural networks: a market comparative study. Decis Support Syst 37:543558

    [31] Vapnik VN (2000) The nature of statistical learning theory, 2nd edn. Springer, Berlin

    [32] Scholkopf B, Smola AJ (2002) Learning with kernels. MIT Press, Cambridge

    [33] Moodys Quantitative Risks Public Firm Risk Model. *Online+. Available: www.moodysqra.com