l4. ensembles of decision trees
TRANSCRIPT
Ensembles
Gonzalo Martínez Muñoz Universidad Autónoma de Madrid
2!
• What is an ensemble? How to build them?
• Bagging, Boosting, Random forests, class-switching
• Combiners
• Stacking
• Other techniques
• Why they work? Success stories
Outline
• The combination of opinions is rooted in the culture of humans
• Formalized with the Condorcet Jury Theorem:
Given a jury of voters and assuming independent errors. If the probability of each single person in the jury of being correct is above 50% then the probability of the jury of being correct tends to 100% as the number persons increase
Condorcet Jury theorem
Nicolas de Condorcet (1743-1794),!French mathematician!
4!
• An ensemble is a combination of classifiers that output a final classification.
What is an ensemble?
New Instance: x
1! 1! 2! 1! 2! 1!
T=7 classifiers
1!
General idea
• Generate many classifiers and combine them to get a final classification
• They perform very good. In general better than any of the single learners they are composed of
• The classifiers should be different from one another
• It is important to generate diverse classifiers from the available data
5/63!
How to build them?
• There are several techniques to build diverse base learners in an ensemble:
• Use modified versions of the training set to train the base learners
• Introduce changes in the learning algorithms
• These strategies can also be used in combination.
• Generally the greater the randomization the better are the results
How to build them?
• Modifications of the training set can be generated by
• Resampling the dataset. By bootstrap sampling (e.g. bagging), weighted sampling (e.g. boosting).
• Altering the attributes: The base learners are trained using different feature subsets (e.g Random subspaces)
• Altering the class labels: Grouping classes into two new class values at random (e.g. ECOC) or modifying at random the class labels (e.g. Class-switching)
How to build them?
• Randomizing the learning algorithms
• Introducing certain randomness into the learning algorithms, so that two consecutive executions of the algorithm would output different classifiers
• Running the base learner with different architectures, paremeters, etc.
Bagging
Input: Dataset L Ensemble size T
1.for t=1 to T: 2. sample = BootstrapSample(L) 3. ht = TrainClassifier(sample)
( )( )⎟⎠
⎞⎜⎝
⎛== ∑
=
T
tt
j
jhIH1
argmax)( xx
Bootstrap
Aggregation
+Output:
Bagging Original dataset! Bootstrap !
sample 1!
!
Repeated example!!Removed example!
…!
…!
Bootstrap !sample T!
Considerations about bagging
• Uses 63,2% of the training data on average to build each classifier.
• It is very robust against label noise.
• In general, it improves the error of the single learner.
• Easily parallelizable
Boosting Input:
Dataset L Ensemble size T
1.Assign example weights to 1/N
2.for t=1 to T:
3. ht = BuildClassifier(L, pesos)
4. et = WeightedError(L, pesos)
5. if et==0 or et ≥ 0.5 break
6. Multiply incorrectly classified instances weights ht by et/
(1-et)
7. Normalize weights
Boosting Original dataset! Iteration 1!
…!
…!
Iteration 2!
Considerations about boosting
• Obtains very good generalization error on average
• It is not robust against class label noise
• It can increment the error of the base classifier
• Cannot be easily implemented in parallel
Random forest
• Breiman defined a Random forest as an ensemble that:
• Has decision trees as its base learner
• Introduces some randomness in the learning process.
• Under this definition bagging of decision trees is a random forest and in fact it is. However…
Random forest • In practice, it is often considered an ensemble that:
• Each tree is generated, as in bagging, using bootstrap samples
• The tree is a special tree that each split is computed using:
• A random subset of the features
• The best split within this subset is then selected
• Unpruned trees are used
Considerations about random forests
• Its performance is better than boosting in most cases
• It is robust to noise (does not overfit)
• Random forest introduces an additional randomization mechanism with respect to bagging
• Easily parallelizable
• Random trees are very fast to train
Class switching
• Class switching is an ensemble method in which diversity is obtained by using different versions of the training data polluted with class label noise.
• Specifically, to train each base learner, the class label of each training point is changed to a different class label with probability p.
Class switching Original dataset! Random!
noise 1!
…!
…!
Random!noise T!
p=30%!
Example
• 2D example
• Boundary is x1=x2
• x1~U[0, 1] x2~U[0, 1]
• Not an easy task for a normal decision tree
• Let’s try bagging, boosting and class-switching with p=0.2 y p=0.4
x1
x2
Clase 1
Clase 2
1
1
bagging! boosting! switching p=0.2! switching p=0.4!
1 clasf..!
11 clasf..!
101 clasf..!
1001 clasf..!
Results
22!
Parametrization
Base classifiers Ensemble size T Other
parameters /options
Bagging Unpruned decision trees
As much as possible Smaller samples
Boosting Pruned decision trees Weak learners
Hundreds
Random forest Unpruned random decision trees
As much as
possible
# random features for the split =
log(#features) or sqrt(#features)
Class-switching Unpruned decision trees >Thousands % of instances to
modifiy, p~30%
Generally used parameters !
Combiners
• The combination techniques can be divided into two groups:
• Voting strategies: The ensemble prediction is the class label that is predicted most often by the base learners. Could be weighted
• Non voting strategies: Some operations such as maximum, minimum, product, median and mean can be employed on the confidence levels that are the output of the individual base learners.
• There is no winner strategy among the different combination techniques. Depends on many factors
Stacking
• In stacking the combination phase included in the learning process.
• First the base learners are trained on some version of the original training set
• After that, the predictions of the base learners are used as new feature vectors to train a second level learner (meta-learner).
• The key point in this strategy is to improve the guesses that are made by the base learners, by generalizing these guesses using a meta learner.
Evidence!histograms! Stacked classifier!
Stacking dataset!
Random forest!
… …
h1
h2
hn
h1 h2 hn
output!
Stacking example
Extract descriptors 1. A Random forest is trained on the descriptors: • Each leaf node stores the class histogram
• In a second phase stacking is applied: • The histograms of the leaf nodes are accumulated for all
tree • The accumulated histograms are concatenated
• Boosting is applied to the concatenated histograms.
1.- Random ordering produced by bagging
h1 , h2 , h3 ,..,hT
0.08 ! 0.09 !
0.1 ! 0.11 ! 0.12 ! 0.13 ! 0.14 !
20 ! 40 ! 60 ! 80 ! 100 ! 120 ! 140 ! 160 ! 180 ! 200 !
Err
or!
# of classifiers!
Bagging!Reduce-error!
CART!
2.- New ordering hs1 , hs2 , hs3 ,..,hsT
% pruning!!!
!3.- Pruning hs1 ,..,hsM
Size reduction!
Classification error reduction!
Ensemble pruning
Accumulated votes: 2 1 5 4 3 2 1
Dynamic ensemble pruning
New Instance: x
1
t à
1 1 2 1 2 1
T=7 classifiers
0 0 Final class: 1
Do we really need to query all classifiers in the ensemble? NO
t2 t1
Why they work?
• Reasons for their good results:
• Statistical reasons: There are not enough data for the classification algorithm to obtain an optimum hypothesis.
• Computational reasons: The single algorithm is not capable of reaching the optimum solution.
• Expressive reasons: The solution is outside the hypothesis space.
28/63!
Why they work?
Thomas Dietterich!
Why they work?
30/63!
A set of suboptimal solutions can be created that compensate their limitations when combined in the ensemble.!
Success story 1: Netflix prize challenge
• Dataset: rating of 17770 movies and 480189 users
Combines hundreds of
models from three teams
Variant of stacking
Success story 2: KDD cup
• KDD cup 2013: Predict papers written by given author.
• The winning team used Random Forest and Boosting among other models combined with regularized linear regression.
• KDD cup 2014: Predict funding requests that deserve an A+ in donorschoose.org
• Multistage ensemble
• KDD cup 2015: Predict dropouts in MOOC
• Multistage ensemble
Success story 3: Kinect
• Computer Vision
• Classify pixels into body parts (leg, head, etc)
• Use Random Forests
34!
• A family of machine learning algorithms with one of the best over all performances. Comparable or better than SVMs
• Almost parameter less learning algorithms.
• If decision trees are the base learners, they are cheap (fast) to train and in test.
Good things about ensembles
35!
• None! Well maybe something…
• Slower than single classifier. Since we create hundreds or thousands of classifiers.
• Can be mitigated using ensemble pruning
Bad things about ensembles