a data mining approach to the prediction of corporate failure

Post on 29-Jan-2016

56 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

A data mining approach to the prediction of corporate failure. Advisor : Dr. Hsu Presenter : ching-wen Hong Author : Feng Yu Lin, Sally McClean. Outline. Motivation Objection Classifier technique The five steps of Data Mining: the SAS SEMMA methodology Sampling - PowerPoint PPT Presentation

TRANSCRIPT

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

A data mining approach to the prediction

of corporate failure

Advisor : Dr. Hsu

Presenter : ching-wen Hong

Author : Feng Yu Lin, Sally McClean

2 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline Motivation

Objection

Classifier technique

The five steps of Data Mining: the SAS SEMMA methodology

Sampling

Data exploration

Data manipulation

Modelling and results

Assessment

Conclusion

My opinion

3 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation Due to recent changes in the world economy and as more firms,

large or small, seem to fail now more than ever corporate failure prediction is of increasing importance.

4 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objection This paper uses a data mining approach to the prediction of

corporate failure.

5 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Classifier technique The classifiers include numerous statistical methods and machine l

earning methods.

The statistical methods include discriminant analysis (DA) and logistic regression (LG).

The machine learning methods include artificial neural networks (NN) and decision tree method (C5.0).

Another one, the hybrid method is proposed by the author.

6 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

The five steps of Data Mining: the SAS SEMMA methodology

Sampling: The data sampling consists of company financial data from the UK. Explore: Preprocessing of dataManipulate: Feature selectionModel: The classification models used to the prediction of corporate failure Assess: Comparison of prediction accuracy

7 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Sampling The data sampling consists of company financial data fr

om the UK. The financial data were accessed from Datastream/ICV.

The companies are divided into two groups:one is the failed companies group and the other is the nonfailed companies group.

This training sample consists of 690 nonfailed companies and 106 failed companies.(1980-1990)

The test dataset consists of 289 nonfailed companies and 48 failed companies.(1991-1999)

8 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

9 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Data exploration Exploration or preprocessing of data is very important and is sometimes the most time-consuming part.

Exploration of data is included: (1)The data is presented in a ready to use state. (2)The preprocessing of missing data. (3)We must filter out the redundant records such as duplicated data.

We decide to delete x719,x734,x735,x761,x766and x792, since these variables include too many missing values.

10 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Data manipulation One main purpose of data manipulation is to fea

ture selection.

We will use two feature selection methods. The first one is based on financial theory and human judgement( Feature selection І ), the second is based on ANOVA ( Feature selection Ⅱ ),.

11 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Feature selection І- human judgement based on financial theory

Laurent[21],Ezzamuel et al.[13] and Clarke[10] attempted to reduce the wide variety of financial ratios into several major groups (e.g. profitability, liquidity,etc), the suggestion being that a researcher need only select one ratio from each of the groups to obtain an indication of the companies overall performance.

The features selected are from these categories: return on capital employed(ROCE);Turnover to total assets employed(turnover/TA);capital gearing(CG);working capital(WC)ratio.

12 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Feature selection Ⅱ- ANOVA statistical method

To classify failed companies and the nonfailed companies effectively, the predictors should be chosen to enable us to distinguish between these two groups.

We use ANOVA to select the variables that are significantly different between these two groups.

Table 3, the ANOVA output of the training set, we can get the variables indicated by *, which indicate a significantly different between the failed and nonfailed group.

13 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

14 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Modelling and results

The classification models used in this study are:Statistics:DA,LG;Machine learning methods:decision trees,NN.

15 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Hybrid models combining different classifiers

A hybrid method usually integrates two or more technologies. The purpose of integrating technologies is to strengthen the best features of each.

A hybrid method that combines the best features of several classification models is developed to increase the prediction performance.

16 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Our purpose is to minimize the number of wrong predictions:

Min δ=∑i=1n (Ti*Oi)

s.t. Oi=f(w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim-θ),i=1,…,nn=the number of classifiers.m=the number of companies in the sample.Ti the actual outcome of ith company in the test sample.Oi the predicted outcome of ith company in the test sam

ple based on the hybrid method Oi=1 if w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim>θ, Oi=0 elseVij the predicted output of ith company in the test sampl

e by classifier j.wj is the weight of classifier j,Θ is the threshold. *return

s 1 if both sides of *are different, *returns 0 if both sides of *are same.

17 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

The problem of finding such a combination of classifiers is to try to find (1)the combination of weight,(2)their respective classifiers and (3) the threshold θ such that the miss hit δ between actual output Ti and Oi is as small as possible.

18 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The hybrid algorithm

1.Compte the total hit ratio (total accuracy) for the training sample itself for all m independent classifiers.(w1, w2 ,… , wm )

2.Take the outcome prediction Vij of classifier j for all the companies i in the test sample for j=1,…,m, i=1,…,n

3.Take all the population of classifiers, or subsets of it. Compute w1Vi1+ w2Vi2+ w3Vi3+…+ wm

Vim for each companies i in the test sample.

19 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

4. if w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim>θ,then Oi=1 , else Oi=0. Adjust the parameter θ, where 0<θ< w1+ w2+ w3+…+ wm , such that the misclassification δ is as small as possible.

20 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

We present three hybrid classifiers.

Hybrid1-DA+LG+NN+C5.0

Hybrid2-DA+NN+C5.0

Hybrid3-LG+C5.0

21 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

22 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Assessment

23 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusions

For all the models,DA,LG,NN,C5.0 and hybrid classifiers, we found the ANOVA feature selection is better than human judgement feature selection except for DA.

The machine learning methods (NN and decision trees) show better performance than the statistical approach.

We present three hybrid classifiers:Hybrid1-DA+LG +NN+C5.0;Hybrid2-DA+NN+C5.0; Hybrid3- LG + C5.0.The empirical tests show that a hybrid classifiers produces higher prediction accuracy than individual classifiers.

24 Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.My opinion Advantage: A hybrid method that combines the best features of

several classification models is developed to increase the prediction performance.

Disadvantage: (1)The hybrid algorithm is time-consuming in computing w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim ,i=1,…,n for subsets of

classifiers. (2) The hybrid method is not good in the application.

top related