ensemble methods - department of computer science · 2019. 5. 3. · extremely randomized trees aka...

45
Ensemble Methods: Bagging and Boosting 2 Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Many slides attributable to: Liping Liu and Roni Khardon (Tufts) T. Q. Chen (UW), James, Witten, Hastie, Tibshirani (ISL/ESL books) Prof. Mike Hughes

Upload: others

Post on 01-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Ensemble Methods:Bagging and Boosting

2

Tufts COMP 135: Introduction to Machine Learninghttps://www.cs.tufts.edu/comp/135/2019s/

Many slides attributable to:Liping Liu and Roni Khardon (Tufts)T. Q. Chen (UW), James, Witten, Hastie, Tibshirani (ISL/ESL books)

Prof. Mike Hughes

Page 2: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Unit ObjectivesBig idea: We can improve performance by aggregating decisions from MANY predictors

• V1: Predictors are Independently Trained• Using bootstrap subsample of examples: “Bagging”• Using random subsets of features• Exemplary method: Random Forest / ExtraTrees

• V2: Predictors are Sequentially Trained• Each successive predictor “boosts” performance• Exemplary method: XGBoost

3Mike Hughes - Tufts COMP 135 - Spring 2019

Page 3: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Motivating Example

3 binary classifiersModel predictions as independent random variablesEach one is correct 70% of the time

What is chance that majority vote is correct?

4Mike Hughes - Tufts COMP 135 - Spring 2019

Page 4: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Motivating Example

3 binary classifiersModel predictions as independent random variablesEach one is correct 70% of the time

What is chance that majority vote is correct?

5Mike Hughes - Tufts COMP 135 - Spring 2019

0.784

Page 5: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Motivating Example

5 binary classifiersModel predictions as independent random variablesEach one is correct 70% of the time

What is chance that majority vote is correct?

6Mike Hughes - Tufts COMP 135 - Spring 2019

Page 6: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Motivating Example

5 binary classifiersModel predictions as independent random variablesEach one is correct 70% of the time

What is chance that majority vote is correct?

7Mike Hughes - Tufts COMP 135 - Spring 2019

0.8369…

Page 7: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Motivating Example

101 binary classifiersModel predictions as independent random variablesEach one is correct 70% of the time

What is chance that majority vote is correct?

8Mike Hughes - Tufts COMP 135 - Spring 2019

Page 8: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Motivating Example

101 binary classifiersModel predictions as independent random variablesEach one is correct 70% of the time

What is chance that majority vote is correct?

9Mike Hughes - Tufts COMP 135 - Spring 2019

>0.99…

Page 9: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

10Mike Hughes - Tufts COMP 135 - Spring 2019

Key Idea: Diversity

• Vary the training data

Page 10: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Bootstrap Sampling

11Mike Hughes - Tufts COMP 135 - Spring 2019

Page 11: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Bootstrap Sampling in Python

12Mike Hughes - Tufts COMP 135 - Spring 2019

Page 12: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Bootstrap Aggregation: BAgg-ing

• Draw B “replicas” of training set• Use bootstrap sampling with replacement

• Make prediction by averaging

13Mike Hughes - Tufts COMP 135 - Spring 2019

Page 13: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Regression Example: 1 tree

Images are taken from Adele Cutler’s slides

Page 14: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Regression Example: 10 trees

The solid black line is the ground-truth, Red lines are predictions of single regression trees

Images are taken from Adele Cutler’s slides

Page 15: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Regression Average of 10 trees

The solid black line is the ground-truth, The blue line is the prediction of the average of 10 regression trees

Images are taken from Adele Cutler’s slides

Page 16: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Binary Classification

Images are taken from Adele Cutler’s slides

Page 17: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Decision Boundary: 1 tree

Images are taken from Adele Cutler’s slides

Page 18: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Decision boundary: 25 trees

Images are taken from Adele Cutler’s slides

Page 19: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Average over 25 trees

Images are taken from Adele Cutler’s slides

Page 20: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Variance of averages

21Mike Hughes - Tufts COMP 135 - Spring 2019

• Given B independent observations

• Each one has variance v

• Compute the mean of the B observations

• What is variance of this estimator?

Page 21: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Why Bagging Works:Reduce Variance!

22Mike Hughes - Tufts COMP 135 - Spring 2019

• Flexible learners applied to small datasets have high variance w.r.t. the data distribution• Small change in training set -> big change in

predictions on heldout set

• Bagging decreases heldout error by decreasing the variance of predictions

• Bagging can be applied to any base classifiers/regressors

Page 22: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Another Idea for Diversity

• Vary the features

23Mike Hughes - Tufts COMP 135 - Spring 2019

Page 23: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Random Forest

24Mike Hughes - Tufts COMP 135 - Spring 2019

Combine example diversity AND feature diversity

For t = 1 to T (# trees): Create an bootstrap sample from the training set.Greedy train tree on random subsample of features

For each node within a maximum depth: Randomly select m features from F featuresFind the best split among m variables

Average the trees to get predictions for new data.

Page 24: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Extremely Randomized Treesaka “ExtraTrees” in sklearn

25Mike Hughes - Tufts COMP 135 - Spring 2019

Speed and feature diversity

For t = 1 to T (# trees): Create an bootstrap sample from the training set.Greedy train tree on random subsample of features

For each node within a maximum depth: Randomly select m features from F featuresFind the best split among m variablesTry 1 random split at each of m variables,

then select the best split of these

Page 25: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

26Mike Hughes - Tufts COMP 135 - Spring 2019

Page 26: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

27Mike Hughes - Tufts COMP 135 - Spring 2019

Single tree

Credit: ISL textbook

Page 27: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

28Mike Hughes - Tufts COMP 135 - Spring 2019

Page 28: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Random Forest in Industry

29Mike Hughes - Tufts COMP 135 - Spring 2019

Microsoft Kinect RGB-D camera

Page 29: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

30Mike Hughes - Tufts COMP 135 - Spring 2019

Page 30: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Summary of Ensemble v1:Independent predictors

• Average over independent base predictors

• Why it works: Reduce variance

• PRO• Often better heldout performance than base model

• CON• Training B separate models is expensive

31Mike Hughes - Tufts COMP 135 - Spring 2019

Page 31: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Vocabulary: Residual

32Mike Hughes - Tufts COMP 135 - Spring 2019

Page 32: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Ensemble Method v2:Sequentially Predict Residual

• Model f1: Trained with (x,y) pairs• Capture residual: r1 = y – f1

• Model f2: Trained with (x, r1) pairs• Capture residual: r2 = r1 – f2

• Repeat!

Combine weak learners into a powerful committee

33Mike Hughes - Tufts COMP 135 - Spring 2019

Page 33: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Adaboost (Adaptive Boosting)Reweight misclassified examples

34Mike Hughes - Tufts COMP 135 - Spring 2019

ESL textbook

Page 34: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Boosting with depth-1 tree “stump”

35Mike Hughes - Tufts COMP 135 - Spring 2019

ESL textbook

Page 35: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Boosting for Regression Trees

36Mike Hughes - Tufts COMP 135 - Spring 2019

ISL textbook

Page 36: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Boosted Tree: Optimization

37Mike Hughes - Tufts COMP 135 - Spring 2019

Page 37: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Gradient Boosting Algorithm

38Mike Hughes - Tufts COMP 135 - Spring 2019

Decide tree structure by fitting to gradients

Decide leaf values by progressively minimizing loss

Add up trees to get the final model

Compute gradientAt each training example

Page 38: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

What about regularization?

39Mike Hughes - Tufts COMP 135 - Spring 2019

Minimization objective when adding tree t:

Loss function Regularization(limit complexity of tree t)

https://xgboost.readthedocs.io/

Page 39: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Regularization

40Mike Hughes - Tufts COMP 135 - Spring 2019

We can penalize• Number of nodes in tree• Depth of tree• Scalar value predicted in each region (L2 penalty)

Credit: T. Chenhttps://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

Page 40: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Example Regularization Term

41Mike Hughes - Tufts COMP 135 - Spring 2019

Credit: T. Chenhttps://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

Page 41: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

To Improve Gradient Boosting

Can extend gradient boosting with• 2nd order gradient information (Newton step)• Penalties on tree complexity• Very smart practical implementation

Result: Extreme Gradient Boostingaka XGBoost (T. Chen & C. Guestrin)

42Mike Hughes - Tufts COMP 135 - Spring 2019

Page 42: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

XGBoost:Extreme Gradient Boosting

• Kaggle competitions in 2015• 29 total winning solutions to challenges published• 17 / 29 (58%) used XGBoost• 11 / 29 (37%) used deep neural networks

43Mike Hughes - Tufts COMP 135 - Spring 2019

Page 43: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

More details (beyond this class)

ESL textbook, Section 10.10

Good slide deck by T. Q. Chen (first author of XGBoost):• https://homes.cs.washington.edu/~tqchen/pdf

/BoostedTree.pdf

44Mike Hughes - Tufts COMP 135 - Spring 2019

Page 44: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Summary of BoostingPRO• Like all tree methods, invariant to scaling of inputs (no

need for careful feature normalization)• Can be scalable

CON• Greedy Sequential fit may not be globally optimal

IN PRACTICE• XGBoost

• Very popular in many benchmark competitions and industrial applications

45Mike Hughes - Tufts COMP 135 - Spring 2019

Page 45: Ensemble Methods - Department of Computer Science · 2019. 5. 3. · Extremely Randomized Trees aka “ExtraTrees” in sklearn Mike Hughes - Tufts COMP 135 - Spring 2019 25 Speed

Unit ObjectivesBig idea: We can improve performance by aggregating decisions from MANY predictors

• V1: Predictors are Independently Trained• Using bootstrap subsample of examples: “Bagging”• Using random subsets of features• Exemplary method: Random Forest / ExtraTrees

• V2: Predictors are Sequentially Trained• Each successive predictor “boosts” performance• Exemplary method: XGBoost

46Mike Hughes - Tufts COMP 135 - Spring 2019