l4. ensembles of decision trees

35
Ensembles Gonzalo Martínez Muñoz Universidad Autónoma de Madrid

Upload: machine-learning-valencia

Post on 14-Apr-2017

798 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: L4. Ensembles of Decision Trees

Ensembles

Gonzalo Martínez Muñoz Universidad Autónoma de Madrid

Page 2: L4. Ensembles of Decision Trees

2!

•  What is an ensemble? How to build them?

•  Bagging, Boosting, Random forests, class-switching

•  Combiners

•  Stacking

•  Other techniques

•  Why they work? Success stories

Outline

Page 3: L4. Ensembles of Decision Trees

•  The combination of opinions is rooted in the culture of humans

•  Formalized with the Condorcet Jury Theorem:

Given a jury of voters and assuming independent errors. If the probability of each single person in the jury of being correct is above 50% then the probability of the jury of being correct tends to 100% as the number persons increase

Condorcet Jury theorem

Nicolas de Condorcet (1743-1794),!French mathematician!

Page 4: L4. Ensembles of Decision Trees

4!

•  An ensemble is a combination of classifiers that output a final classification.

What is an ensemble?

New Instance: x

1! 1! 2! 1! 2! 1!

T=7 classifiers

1!

Page 5: L4. Ensembles of Decision Trees

General idea

•  Generate many classifiers and combine them to get a final classification

•  They perform very good. In general better than any of the single learners they are composed of

•  The classifiers should be different from one another

•  It is important to generate diverse classifiers from the available data

5/63!

Page 6: L4. Ensembles of Decision Trees

How to build them?

•  There are several techniques to build diverse base learners in an ensemble:

•  Use modified versions of the training set to train the base learners

•  Introduce changes in the learning algorithms

•  These strategies can also be used in combination.

•  Generally the greater the randomization the better are the results

Page 7: L4. Ensembles of Decision Trees

How to build them?

•  Modifications of the training set can be generated by

•  Resampling the dataset. By bootstrap sampling (e.g. bagging), weighted sampling (e.g. boosting).

•  Altering the attributes: The base learners are trained using different feature subsets (e.g Random subspaces)

•  Altering the class labels: Grouping classes into two new class values at random (e.g. ECOC) or modifying at random the class labels (e.g. Class-switching)

Page 8: L4. Ensembles of Decision Trees

How to build them?

•  Randomizing the learning algorithms

•  Introducing certain randomness into the learning algorithms, so that two consecutive executions of the algorithm would output different classifiers

•  Running the base learner with different architectures, paremeters, etc.

Page 9: L4. Ensembles of Decision Trees

Bagging

Input: Dataset L Ensemble size T

1.for t=1 to T: 2. sample = BootstrapSample(L) 3. ht = TrainClassifier(sample)

( )( )⎟⎠

⎞⎜⎝

⎛== ∑

=

T

tt

j

jhIH1

argmax)( xx

Bootstrap

Aggregation

+Output:

Page 10: L4. Ensembles of Decision Trees

Bagging Original dataset! Bootstrap !

sample 1!

!

Repeated example!!Removed example!

…!

…!

Bootstrap !sample T!

Page 11: L4. Ensembles of Decision Trees

Considerations about bagging

•  Uses 63,2% of the training data on average to build each classifier.

•  It is very robust against label noise.

•  In general, it improves the error of the single learner.

•  Easily parallelizable

Page 12: L4. Ensembles of Decision Trees

Boosting Input:

Dataset L Ensemble size T

1.Assign example weights to 1/N

2.for t=1 to T:

3. ht = BuildClassifier(L, pesos)

4. et = WeightedError(L, pesos)

5. if et==0 or et ≥ 0.5 break

6. Multiply incorrectly classified instances weights ht by et/

(1-et)

7. Normalize weights

Page 13: L4. Ensembles of Decision Trees

Boosting Original dataset! Iteration 1!

…!

…!

Iteration 2!

Page 14: L4. Ensembles of Decision Trees

Considerations about boosting

•  Obtains very good generalization error on average

•  It is not robust against class label noise

•  It can increment the error of the base classifier

•  Cannot be easily implemented in parallel

Page 15: L4. Ensembles of Decision Trees

Random forest

•  Breiman defined a Random forest as an ensemble that:

•  Has decision trees as its base learner

•  Introduces some randomness in the learning process.

•  Under this definition bagging of decision trees is a random forest and in fact it is. However…

Page 16: L4. Ensembles of Decision Trees

Random forest •  In practice, it is often considered an ensemble that:

•  Each tree is generated, as in bagging, using bootstrap samples

•  The tree is a special tree that each split is computed using:

•  A random subset of the features

•  The best split within this subset is then selected

•  Unpruned trees are used

Page 17: L4. Ensembles of Decision Trees

Considerations about random forests

•  Its performance is better than boosting in most cases

•  It is robust to noise (does not overfit)

•  Random forest introduces an additional randomization mechanism with respect to bagging

•  Easily parallelizable

•  Random trees are very fast to train

Page 18: L4. Ensembles of Decision Trees

Class switching

•  Class switching is an ensemble method in which diversity is obtained by using different versions of the training data polluted with class label noise.

•  Specifically, to train each base learner, the class label of each training point is changed to a different class label with probability p.

Page 19: L4. Ensembles of Decision Trees

Class switching Original dataset! Random!

noise 1!

…!

…!

Random!noise T!

p=30%!

Page 20: L4. Ensembles of Decision Trees

Example

•  2D example

•  Boundary is x1=x2

•  x1~U[0, 1] x2~U[0, 1]

•  Not an easy task for a normal decision tree

•  Let’s try bagging, boosting and class-switching with p=0.2 y p=0.4

x1

x2

Clase 1

Clase 2

1

1

Page 21: L4. Ensembles of Decision Trees

bagging! boosting! switching p=0.2! switching p=0.4!

1 clasf..!

11 clasf..!

101 clasf..!

1001 clasf..!

Results

Page 22: L4. Ensembles of Decision Trees

22!

Parametrization

Base classifiers Ensemble size T Other

parameters /options

Bagging Unpruned decision trees

As much as possible Smaller samples

Boosting Pruned decision trees Weak learners

Hundreds

Random forest Unpruned random decision trees

As much as

possible

# random features for the split =

log(#features) or sqrt(#features)

Class-switching Unpruned decision trees >Thousands % of instances to

modifiy, p~30%

Generally used parameters !

Page 23: L4. Ensembles of Decision Trees

Combiners

•  The combination techniques can be divided into two groups:

•  Voting strategies: The ensemble prediction is the class label that is predicted most often by the base learners. Could be weighted

•  Non voting strategies: Some operations such as maximum, minimum, product, median and mean can be employed on the confidence levels that are the output of the individual base learners.

•  There is no winner strategy among the different combination techniques. Depends on many factors

Page 24: L4. Ensembles of Decision Trees

Stacking

•  In stacking the combination phase included in the learning process.

•  First the base learners are trained on some version of the original training set

•  After that, the predictions of the base learners are used as new feature vectors to train a second level learner (meta-learner).

•  The key point in this strategy is to improve the guesses that are made by the base learners, by generalizing these guesses using a meta learner.

Page 25: L4. Ensembles of Decision Trees

Evidence!histograms! Stacked classifier!

Stacking dataset!

Random forest!

…  …  

h1  

h2  

hn  

h1   h2   hn  

output!

Stacking example

Extract descriptors 1.  A Random forest is trained on the descriptors: • Each leaf node stores the class histogram

•  In a second phase stacking is applied: •  The histograms of the leaf nodes are accumulated for all

tree •  The accumulated histograms are concatenated

•  Boosting is applied to the concatenated histograms.

Page 26: L4. Ensembles of Decision Trees

1.- Random ordering produced by bagging

h1 , h2 , h3 ,..,hT

0.08 ! 0.09 !

0.1 ! 0.11 ! 0.12 ! 0.13 ! 0.14 !

20 ! 40 ! 60 ! 80 ! 100 ! 120 ! 140 ! 160 ! 180 ! 200 !

Err

or!

# of classifiers!

Bagging!Reduce-error!

CART!

2.- New ordering hs1 , hs2 , hs3 ,..,hsT

% pruning!!!

!3.- Pruning hs1 ,..,hsM

Size reduction!

Classification error reduction!

Ensemble pruning

Page 27: L4. Ensembles of Decision Trees

Accumulated votes: 2 1 5 4 3 2 1

Dynamic ensemble pruning

New Instance: x

1

t à

1 1 2 1 2 1

T=7 classifiers

0 0 Final class: 1

 Do we really need to query all classifiers in the ensemble?  NO

t2 t1

Page 28: L4. Ensembles of Decision Trees

Why they work?

•  Reasons for their good results:

•  Statistical reasons: There are not enough data for the classification algorithm to obtain an optimum hypothesis.

•  Computational reasons: The single algorithm is not capable of reaching the optimum solution.

•  Expressive reasons: The solution is outside the hypothesis space.

28/63!

Page 29: L4. Ensembles of Decision Trees

Why they work?

Thomas Dietterich!

Page 30: L4. Ensembles of Decision Trees

Why they work?

30/63!

A set of suboptimal solutions can be created that compensate their limitations when combined in the ensemble.!

Page 31: L4. Ensembles of Decision Trees

Success story 1: Netflix prize challenge

•  Dataset: rating of 17770 movies and 480189 users

Combines hundreds of

models from three teams

Variant of stacking

Page 32: L4. Ensembles of Decision Trees

Success story 2: KDD cup

•  KDD cup 2013: Predict papers written by given author.

•  The winning team used Random Forest and Boosting among other models combined with regularized linear regression.

•  KDD cup 2014: Predict funding requests that deserve an A+ in donorschoose.org

•  Multistage ensemble

•  KDD cup 2015: Predict dropouts in MOOC

•  Multistage ensemble

Page 33: L4. Ensembles of Decision Trees

Success story 3: Kinect

•  Computer Vision

•  Classify pixels into body parts (leg, head, etc)

•  Use Random Forests

Page 34: L4. Ensembles of Decision Trees

34!

•  A family of machine learning algorithms with one of the best over all performances. Comparable or better than SVMs

•  Almost parameter less learning algorithms.

•  If decision trees are the base learners, they are cheap (fast) to train and in test.

Good things about ensembles

Page 35: L4. Ensembles of Decision Trees

35!

•  None! Well maybe something…

•  Slower than single classifier. Since we create hundreds or thousands of classifiers.

•  Can be mitigated using ensemble pruning

Bad things about ensembles