on cascading small decision trees
TRANSCRIPT
PowerPoint Presentation
On Cascading Small Decision Trees
Juli MinguillnCombinatorics and Digital Communications Group (CCD)Autonomous University of Barcelona (UAB)Barcelona, Spain
http://www.tesisenxarxa.net/TESIS_UAB/AVAILABLE/TDX-1209102-150635/jma1de1.pdf
Table of contents
Introduction
Decision trees
Combining classifiers
Experimental results
Theoretical issues
Conclusions
Further research
References
Introduction
Main goal: to build simple and fast classifiers for data mining
Partial goals:
To reduce both training and exploitation costs
To increase classification accuracy
To permit partial classification
Several classification systems could be used: decision trees, neural networks, support vector machines, nearest neighbour classifier, etc.
Decision trees
Introduced by Quinlan in 1983 and developed by Breiman et al. in 1984
Decision trees reproduce the way humans take decisions: a path of questions is followed from the input sample to the output label
Decision trees are based on recursive partitioning of the input space, trying to separate elements from different classes
Supervised training labeled data is used for training
Why decision trees?
Natural handling of data of mixed types
Handling of missing values
Robustness to outliers in input space
Insensitive to monotone transformations
Computational scalability
Ability to deal with irrelevant inputs
Interpretability
Growing decision trees (binary)
T=(data set) /* initially the tree is a single leaf */
while stoppingCriterion(T) is false
select t from T maximising selectionCriterion(t)
split t=(tL,tR) maximising splittingCriterion(t,tL,tR)
replace t in T with (tL,tR)
end
prune back T using the BFOS algorithm
chooseT minimising classification error on (data set)
Growing algorithm parameters
The computed decision tree is determined by:
Stopping criterion
Node selection criterion
Splitting criterion
Labelling rule
If a perfect decision tree is built and then it is pruned back, both the stopping and the node selection criteria become irrelevant
Splitting criterion
Measures the gain of a split for a given criterion
Usually related to the concept of impurity
Classification performance may be very sensitive to such criterion
Entropy and R-norm criteria yield the best results in average, Bayes error criterion the worst
Different kinds of splits:
Orthogonal hyperplanes: fast, interpretable, poor performance
General hyperplanes: expensive, partially interpretable
Distance based (spherical trees): expensive, allow clustering
Labelling rule
Each leaf t is labelled in order to minimise misclassification error:
l(t) = argjmin { r(t) = {k=0..K-1} C(j,k) p(k|t) }
Different classification costs C(j,k) are allowed
A priori class probabilities may be included
Margin is defined as 1-2 r(t), or also as
max { p(k|t) } 2ndmax { p(k|t) }
Problems
Repetition, replication and fragmentation
Poor performance for large data dimensionality or large number of classes
Orthogonal splits may lead to poor classification performance due to poor internal decision functions
Overfitting may occur for large decision trees
Training is very expensive for large data sets
Decision trees are unstable classifiers
Progressive decision trees
Goal: to overcome some problems related to the use of classical decision trees
Basic idea: to break the classification problem in a sequence of partial classification problems, from easier to more difficult
Only small decision trees are used:
Avoid overfitting
Reduce both training and exploitation costs
Permit partial classification
Detect possible outliers
Decision trees become decision graphs
Growing progressive decision trees
Build a complete decision tree of depth d
Prune it using the BFOS algorithm
Relabel it using the new labelling rule: a leaf is labelled as mixed if its margin is not large enough (at least )
Join all regions labelled as mixed
Start again using only the mixed regions
Example (I)
M
1M0
M01M
Example (II)
M
01M
10
01MMM
Example (III)
10
M
01
Combining classifiers
Basic idea: instead of building a complex classifier, build several simple classifiers and combine them into a more complex one
Several paradigms:
Voting: bagging, boosting, randomising
Stacking
Cascading
Why do they work? Because of the fact that different classifiers make different kinds of mistakes
Different classifiers are built by using different training sets
Cascading generalization
Developed by Gama et al. in 2000
Basic idea: simple classifiers are sequentially ensembled carrying over information from one classifier to the next in the sequence
Three types of cascading ensembles:
Type A: no additional info, mixed class
Type B: additional info, no mixed class
Type C: additional info, mixed class
Type A progressive decision trees
No additional info is carried from one stage to the next, but only samples labelled as mixed are passed down:
T
DYD
Type B progressive decision trees
Additional info (estimated class probabilities and margin) is computed for each sample, and all samples are passed down:
T
DYD
Type C progressive decision trees
Additional info is computed for each sample, and only samples labelled as mixed are passed down:
T
DYD
Experimental results
Four different projects:
Document layout recognition
Hyperspectral imaging
Brain tumour classification
UCI collection evaluation
Basic tools for evaluation:
N-fold cross-validation
bootstrapping
bias-variance decomposition
}real projects
Document layout recognition (I)
Goal: adaptive compression for an automated document storage system using lossy/lossless JPEG standard
Four classes: background (removed), text (OCR), line drawings (lossless) and images (lossy)
Documents are 8.5 x 11.7 at 150 dpi
Target block size: 8 x 8 pixels (JPEG standard)
Minguilln, J. et al., Progressive classification scheme for document layout recognition, Proc. of the SPIE, Denver, CO, USA, v. 3816:241-250, 1999
Document layout recognition (II)
Classical approach: a single decision tree with a block size of 8 x 8 pixels
0.078388.56721211200 / 2112008 x 8ErrordmaxR|T|Num. BlocksSize
Document layout recognition (III)
Progressive approach: four block sizes (64 x 64, 32 x 32, 16 x 16 and 8 x 8)
0.04263.721121052 / 5376016 x 160.04764.17147856 / 1344032 x 320.08942.7763360 / 336064 x 640.06584.731827892 / 2150408 x 8ErrordmaxR|T|Num. BlocksSize
Hyperspectral imaging (I)
Image size is 710 x 4558 pixels x 14 bands (available ground truth data is only 400 x 2400)
Ground truth data presents some artifacts due to low resolution: around 10% mislabelled
19 classes including asphalt, water, rocks, soil and several vegetation types
Goal: to build a classification system and to identify the most important bands for each class, but also to detect possible outliers in the training set
Minguilln, J. et al., Adaptive lossy compression and classification of hyperspectral images, Proc. of remote sensing VI, Barcelona, Spain, v. 4170:214-225, 2000
Hyperspectral imaging (II)
Classical approach:
0.1631.09.83836T1ErrorPTR|T|Tree
Using the new labeling rule:
0.0920.7229.60650T2ErrorPTR|T|Tree
Hyperspectral imaging (III)
Progressive approach:
0.1990.3832.148T3B0.0560.5233.029T3A0.0940.7064.8444T3ErrorPTR|T|Tree
Brain tumour classification (I)
Goal: to build a classification system for helping clinicians to identify brain tumour types
Too many classes and too few samples: a hierarchical structure partially reproducing the WHO tree has been created
Different classifiers (LDA, k-NN, decision trees) are combined using a mixture of cascading and voting schemes
Minguilln, J. et al., Classifier combination for in vivo magnetic resonance spectra of brain tumours, Proc. of Multiple Classifier Systems, Cagliari, Italy, LNCS 2364
Brain tumour classification (II)
Each classification stage is:
k-NNLDADT
XVYDecision trees use LDA class distances as additional information
Unknown means classifiers disagree
Brain tumour classification (III)
Normal100%Tumour99.5%
Benign92.1%Malignant94.9%
Grade II82.6%Grade IV94.7%
98.9%Grade III0%
Astro94.1%Oligo100%
84.0%89.9%83.8%Secondary91.4%Primary81.8%
75.0%MN+SCH+HBASTII+ODGLB+LYM+PNET+MET
UCI collection
Goal: exhaustive testing of progressive decision trees
20 data sets were chosen:
No categorical variables
No missing values
Large range of number of samples, data dimension and number of classes
Available at http://kdd.ics.uci.edu
Experiments setup
N-fold cross-validation with N=3
For each training set, 25 bootstrap replicates are generated (subsampling with replacement)
Each experiment is repeated 5 times and performance results are averaged
Bias-variance decomposition is computed for each repetition and then averaged
Bias-variance decomposition
Several approaches, Domingos 2000
First classifiers in a cascading ensemble should have moderate bias and low variance: small (but not too much) decision trees
Last classifiers should have small bias and moderate variance: large (but not too much) decision trees
Only different classifiers (from a bias-variance behaviour) should be ensembled: number of decision trees should be small
Empirical evaluation summary (I)
Bias usually predominates over variance on most data sets decision trees outperform the k-NN classifier
Bias decreases fast when the decision tree has enough leaves
Variance shows an unpredictable behaviour, depending on data set intrinsic characteristics
Empirical evaluation summary (II)
Type B progressive decision trees usually outperform classical decision trees, mainly to bias reduction. Two or three small decision trees are enough
Type A progressive decision trees do not outperform classical decision trees in general, but variance is reduced (classifiers are smaller and thus stabler)
Type C experiments are still running...
Theoretical issues
Decision trees are convex combinations of internal node decision functions:
Tj(x)={i=1..|Tj|} pij ij hij(x)
Cascading is a convex combination of t decision trees:
T(x)={j=1..t} qj Tj(x)
Type A: the first decision tree is the most important
Type B: the last decision tree is the most important
Type C: not aplicable
Error generalization bounds
Convex combinations may be studied under the margin paradigm defined by Schapire et al.
Generalization error depends on tree structure and internal node functions VC dimension
Unbalanced trees are preferable
Unbalanced classifiers are preferable
Modest goal: to see that the current theory related to classifier combination does not deny progressive decision trees
Conclusions
Progressive decision trees generalise classical decision trees and the cascading paradigm
Cascading is very useful for large data sets with a large number of classes hierarchical structure
Preliminary experiments with type C progressive decision trees look promising
Experiments with real data sets show that it is possible to improve classification accuracy and reduce both classification and exploitation costs at the same time
Fine tuning is absolutely necessary!...
Further research
The R-norm splitting criterion may be used to build adaptive decision trees
Better error generalisation bounds are needed
A complete and specific theoretical framework for the cascading paradigm must be developed
Parameters (, d and t) are currently empirical, more explanations are needed
New applications (huge data sets):
Web mining
DNA interpretation
Selected references
Breiman, L. et al., Classification and Regression Trees, Wadsworth International Group, 1984
Gama, J. et al., Cascade Generalization, Machine Learning 41(3):315-343, 2000
Domingos, P., A unified bias-variance decomposition and its applications, Proc. of the 17th Int. Conf. On Machine Learning, Stanford, CA, USA, 231-238, 2000
Schapire, R.E. et al., Boosting the margin: a new explanation for the effectiveness of voting methods, Annals of Statistics 26(5):1651-1686, 1998