on cascading small decision trees

PowerPoint Presentation

On Cascading Small Decision Trees

Juli MinguillnCombinatorics and Digital Communications Group (CCD)Autonomous University of Barcelona (UAB)Barcelona, Spain

http://www.tesisenxarxa.net/TESIS_UAB/AVAILABLE/TDX-1209102-150635/jma1de1.pdf

Table of contents

Introduction

Decision trees

Combining classifiers

Experimental results

Theoretical issues

Conclusions

Further research

References

Introduction

Main goal: to build simple and fast classifiers for data mining

Partial goals:

To reduce both training and exploitation costs

To increase classification accuracy

To permit partial classification

Several classification systems could be used: decision trees, neural networks, support vector machines, nearest neighbour classifier, etc.

Decision trees

Introduced by Quinlan in 1983 and developed by Breiman et al. in 1984

Decision trees reproduce the way humans take decisions: a path of questions is followed from the input sample to the output label

Decision trees are based on recursive partitioning of the input space, trying to separate elements from different classes

Supervised training labeled data is used for training

Why decision trees?

Natural handling of data of mixed types

Handling of missing values

Robustness to outliers in input space

Insensitive to monotone transformations

Computational scalability

Ability to deal with irrelevant inputs

Interpretability

Growing decision trees (binary)

T=(data set) /* initially the tree is a single leaf */

while stoppingCriterion(T) is false

select t from T maximising selectionCriterion(t)

split t=(tL,tR) maximising splittingCriterion(t,tL,tR)

replace t in T with (tL,tR)

end

prune back T using the BFOS algorithm

chooseT minimising classification error on (data set)

Growing algorithm parameters

The computed decision tree is determined by:

Stopping criterion

Node selection criterion

Splitting criterion

Labelling rule

If a perfect decision tree is built and then it is pruned back, both the stopping and the node selection criteria become irrelevant

Splitting criterion

Measures the gain of a split for a given criterion

Usually related to the concept of impurity

Classification performance may be very sensitive to such criterion

Entropy and R-norm criteria yield the best results in average, Bayes error criterion the worst

Different kinds of splits:

Orthogonal hyperplanes: fast, interpretable, poor performance

General hyperplanes: expensive, partially interpretable

Distance based (spherical trees): expensive, allow clustering

Labelling rule

Each leaf t is labelled in order to minimise misclassification error:

l(t) = argjmin { r(t) = {k=0..K-1} C(j,k) p(k|t) }

Different classification costs C(j,k) are allowed

A priori class probabilities may be included

Margin is defined as 1-2 r(t), or also as

max { p(k|t) } 2ndmax { p(k|t) }

Problems

Repetition, replication and fragmentation

Poor performance for large data dimensionality or large number of classes

Orthogonal splits may lead to poor classification performance due to poor internal decision functions

Overfitting may occur for large decision trees

Training is very expensive for large data sets

Decision trees are unstable classifiers

Progressive decision trees

Goal: to overcome some problems related to the use of classical decision trees

Basic idea: to break the classification problem in a sequence of partial classification problems, from easier to more difficult

Only small decision trees are used:

Avoid overfitting

Reduce both training and exploitation costs

Permit partial classification

Detect possible outliers

Decision trees become decision graphs

Growing progressive decision trees

Build a complete decision tree of depth d

Prune it using the BFOS algorithm

Relabel it using the new labelling rule: a leaf is labelled as mixed if its margin is not large enough (at least )

Join all regions labelled as mixed

Start again using only the mixed regions

Example (I)

M

1M0

M01M

Example (II)

M

01M

10

01MMM

Example (III)

10

M

01

Combining classifiers

Basic idea: instead of building a complex classifier, build several simple classifiers and combine them into a more complex one

Several paradigms:

Voting: bagging, boosting, randomising

Stacking

Cascading

Why do they work? Because of the fact that different classifiers make different kinds of mistakes

Different classifiers are built by using different training sets

Cascading generalization

Developed by Gama et al. in 2000

Basic idea: simple classifiers are sequentially ensembled carrying over information from one classifier to the next in the sequence

Three types of cascading ensembles:

Type A: no additional info, mixed class

Type B: additional info, no mixed class

Type C: additional info, mixed class

Type A progressive decision trees

No additional info is carried from one stage to the next, but only samples labelled as mixed are passed down:

T

DYD

Type B progressive decision trees

Additional info (estimated class probabilities and margin) is computed for each sample, and all samples are passed down:

T

DYD

Type C progressive decision trees

Additional info is computed for each sample, and only samples labelled as mixed are passed down:

T

DYD

Experimental results

Four different projects:

Document layout recognition

Hyperspectral imaging

Brain tumour classification

UCI collection evaluation

Basic tools for evaluation:

N-fold cross-validation

bootstrapping

bias-variance decomposition

}real projects

Document layout recognition (I)

Goal: adaptive compression for an automated document storage system using lossy/lossless JPEG standard

Four classes: background (removed), text (OCR), line drawings (lossless) and images (lossy)

Documents are 8.5 x 11.7 at 150 dpi

Target block size: 8 x 8 pixels (JPEG standard)

Minguilln, J. et al., Progressive classification scheme for document layout recognition, Proc. of the SPIE, Denver, CO, USA, v. 3816:241-250, 1999

Document layout recognition (II)

Classical approach: a single decision tree with a block size of 8 x 8 pixels

0.078388.56721211200 / 2112008 x 8ErrordmaxR|T|Num. BlocksSize

Document layout recognition (III)

Progressive approach: four block sizes (64 x 64, 32 x 32, 16 x 16 and 8 x 8)

0.04263.721121052 / 5376016 x 160.04764.17147856 / 1344032 x 320.08942.7763360 / 336064 x 640.06584.731827892 / 2150408 x 8ErrordmaxR|T|Num. BlocksSize

Hyperspectral imaging (I)

Image size is 710 x 4558 pixels x 14 bands (available ground truth data is only 400 x 2400)

Ground truth data presents some artifacts due to low resolution: around 10% mislabelled

19 classes including asphalt, water, rocks, soil and several vegetation types

Goal: to build a classification system and to identify the most important bands for each class, but also to detect possible outliers in the training set

Minguilln, J. et al., Adaptive lossy compression and classification of hyperspectral images, Proc. of remote sensing VI, Barcelona, Spain, v. 4170:214-225, 2000

Hyperspectral imaging (II)

Classical approach:

0.1631.09.83836T1ErrorPTR|T|Tree

Using the new labeling rule:

0.0920.7229.60650T2ErrorPTR|T|Tree

Hyperspectral imaging (III)

Progressive approach:

0.1990.3832.148T3B0.0560.5233.029T3A0.0940.7064.8444T3ErrorPTR|T|Tree

Brain tumour classification (I)

Goal: to build a classification system for helping clinicians to identify brain tumour types

Too many classes and too few samples: a hierarchical structure partially reproducing the WHO tree has been created

Different classifiers (LDA, k-NN, decision trees) are combined using a mixture of cascading and voting schemes

Minguilln, J. et al., Classifier combination for in vivo magnetic resonance spectra of brain tumours, Proc. of Multiple Classifier Systems, Cagliari, Italy, LNCS 2364

Brain tumour classification (II)

Each classification stage is:

k-NNLDADT

XVYDecision trees use LDA class distances as additional information

Unknown means classifiers disagree

Brain tumour classification (III)

Normal100%Tumour99.5%

Benign92.1%Malignant94.9%

Grade II82.6%Grade IV94.7%

98.9%Grade III0%

Astro94.1%Oligo100%

84.0%89.9%83.8%Secondary91.4%Primary81.8%

75.0%MN+SCH+HBASTII+ODGLB+LYM+PNET+MET

UCI collection

Goal: exhaustive testing of progressive decision trees

20 data sets were chosen:

No categorical variables

No missing values

Large range of number of samples, data dimension and number of classes

Available at http://kdd.ics.uci.edu

Experiments setup

N-fold cross-validation with N=3

For each training set, 25 bootstrap replicates are generated (subsampling with replacement)

Each experiment is repeated 5 times and performance results are averaged

Bias-variance decomposition is computed for each repetition and then averaged

Bias-variance decomposition

Several approaches, Domingos 2000

First classifiers in a cascading ensemble should have moderate bias and low variance: small (but not too much) decision trees

Last classifiers should have small bias and moderate variance: large (but not too much) decision trees

Only different classifiers (from a bias-variance behaviour) should be ensembled: number of decision trees should be small

Empirical evaluation summary (I)

Bias usually predominates over variance on most data sets decision trees outperform the k-NN classifier

Bias decreases fast when the decision tree has enough leaves

Variance shows an unpredictable behaviour, depending on data set intrinsic characteristics

Empirical evaluation summary (II)

Type B progressive decision trees usually outperform classical decision trees, mainly to bias reduction. Two or three small decision trees are enough

Type A progressive decision trees do not outperform classical decision trees in general, but variance is reduced (classifiers are smaller and thus stabler)

Type C experiments are still running...

Theoretical issues

Decision trees are convex combinations of internal node decision functions:

Tj(x)={i=1..|Tj|} pij ij hij(x)

Cascading is a convex combination of t decision trees:

T(x)={j=1..t} qj Tj(x)

Type A: the first decision tree is the most important

Type B: the last decision tree is the most important

Type C: not aplicable

Error generalization bounds

Convex combinations may be studied under the margin paradigm defined by Schapire et al.

Generalization error depends on tree structure and internal node functions VC dimension

Unbalanced trees are preferable

Unbalanced classifiers are preferable

Modest goal: to see that the current theory related to classifier combination does not deny progressive decision trees

Conclusions

Progressive decision trees generalise classical decision trees and the cascading paradigm

Cascading is very useful for large data sets with a large number of classes hierarchical structure

Preliminary experiments with type C progressive decision trees look promising

Experiments with real data sets show that it is possible to improve classification accuracy and reduce both classification and exploitation costs at the same time

Fine tuning is absolutely necessary!...

Further research

The R-norm splitting criterion may be used to build adaptive decision trees

Better error generalisation bounds are needed

A complete and specific theoretical framework for the cascading paradigm must be developed

Parameters (, d and t) are currently empirical, more explanations are needed

New applications (huge data sets):

Web mining

DNA interpretation

Selected references

Breiman, L. et al., Classification and Regression Trees, Wadsworth International Group, 1984

Gama, J. et al., Cascade Generalization, Machine Learning 41(3):315-343, 2000

Domingos, P., A unified bias-variance decomposition and its applications, Proc. of the 17th Int. Conf. On Machine Learning, Stanford, CA, USA, 231-238, 2000

Schapire, R.E. et al., Boosting the margin: a new explanation for the effectiveness of voting methods, Annals of Statistics 26(5):1651-1686, 1998

on cascading small decision trees

Education