lecture 21: bagging and boosting - university of arizonahzhang/math574m/2020lect21-boost.pdfcompute...

46
Bagging Trees Random Forest Boosting Methods AdaBoost Additive Models AdaBoost and Additive Logistic Models Lecture 21: Bagging and Boosting Hao Helen Zhang Hao Helen Zhang Lecture 21: Bagging and Boosting

Upload: others

Post on 16-Jul-2020

5 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Lecture 21: Bagging and Boosting

Hao Helen Zhang

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 2: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Outlines

Bagging Methods

Bagging TreesRandom Forest

Adaboost

Strong learners vs Weak LearnersMotivationsAlgorithms

Additive Models

Adaboost and Additive Logistic Regression

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 3: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Ensemble Methods (Model Averaging)

A machine learning ensemble meta-algorithm designed to improvethe stability and accuracy of machine learning algorithms

widely used in statistical classification and regression.

can reduce variance and helps to avoid overfitting.

usually applied to decision tree methods, but it can be usedwith any type of method.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 4: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Bootstrap Bagging (Bagging)

Bagging is a special case of the model averaging approach.

Bootstrap aggregation = Bagging

Bagging leads to ”improvements for unstable procedures”(Breiman, 1996), e.g. neural nets, classification and regressiontrees, and subset selection in linear regression (Breiman,1994).

On the other hand, it can mildly degrade the performance ofstable methods such as K-nearest neighbors (Breiman, 1996).

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 5: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Basic Idea

Given a standard training set D of size n, bagging

generates B new training sets Di , each of size n′, by samplingfrom D uniformly and with replacement.

The B models are fitted using the above B bootstrap samplesand combined by averaging the output (for regression) orvoting (for classification).

This kind of sample is known as a bootstrap sample.

By sampling with replacement, some observations may berepeated in each Di .

If n′ = n, then for large n, the set Di is expected to have thefraction (1− 1/e) ≈ 63.2% of the unique examples of D, therest being duplicates.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 6: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Bagging Procedures

Bagging uses the bootstrap to improve the estimate or predictionof a fit.

Given data Z = {(x1, y1), ..., (xn, yn)}, we generate Bbootstrap samples Z∗b

Empirical distribution P: putting equal probability 1/n oneach (xi , yi ) (discrete)Generate Z∗b = {(x1∗, yn∗), ..., (xn∗, yn∗)} ∼ P, b = 1, ...,B

Obtain f ∗b(x), b = 1, ...,B.

The Monte Carlo estimate of the bagging estimate

fbag(x) =1

B

B∑b=1

f ∗b(x).

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 7: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Properties of Bagging Estimates

Advantages:

Note fbag(x) −→ EP f∗(x) as B →∞,

fbag(x) typically has smaller variance than f (x);

fbag(x) differs from f (x) only when the latter is nonlinear oradaptive function of data.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 8: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Bagging Classification TreesIn multiclass (K ≥ 3)problems, there are two scenarios

(1) f b(x) is indicator-vector, with one 1 and K-1 0’s (hardclassification)

(2) f b(x) = (p1, ..., pK ), the estimates of class probabilities (softclassification)

The bagged estimate is the average prediction at x from B trees

fbag(x) =1

B

B∑b=1

f ∗b(x),

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 9: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Bagging Classification Trees (cont.)There are two types of averaging:

(1) uses the majority vote;

(2) use the averaged probabilities.

The bagged classifier

Gbag(x) = arg maxk=1,··· ,K

fbag(x).

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 10: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Example

Sample size n = 30, two classes

p = 5 features, each having a standard Gaussian distributionwith pairwise correlation Corr(Xj ,Xk) = 0.95.

The response Y was generated according to

Pr(Y = 1|x1 ≤ 0.5) = 0.2, Pr(Y = 1|x1 > 0.5) = 0.8.

The Bayes error is 0.2.

A test sample of size 2, 000 was generated from the samepopulation.

We fit

classification trees to the training sample

classification trees to each of 200 bootstrap samples

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 11: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 8x.2<0.39

x.2>0.39

10/30

0

x.3<-1.575x.3>-1.575

3/21

0

2/5

1

0/16

0

2/9

1

PSfrag repla ements Original TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5x.2<0.36

x.2>0.36

7/30

0

x.1<-0.965x.1>-0.965

1/23

0

1/5

0

0/18

0

1/7

1

PSfrag repla ementsOriginal Tree Bootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5

x.2<0.39x.2>0.39

11/30

0

x.3<-1.575x.3>-1.575

3/22

0

2/5

1

0/17

0

0/8

1

PSfrag repla ementsOriginal TreeBootstrap Tree 1 Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5

x.4<0.395x.4>0.395

4/30

0

x.3<-1.575x.3>-1.575

2/25

0

2/5

0

0/20

0

2/5

0

PSfrag repla ementsOriginal TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3

Bootstrap Tree 4Bootstrap Tree 5x.2<0.255

x.2>0.255

13/30

0

x.3<-1.385x.3>-1.385

2/16

0

2/5

0

0/11

0

3/14

1

PSfrag repla ementsOriginal TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4

Bootstrap Tree 5x.2<0.38

x.2>0.38

12/30

0

x.3<-1.61x.3>-1.61

4/20

0

2/6

1

0/14

0

2/10

1

PSfrag repla ementsOriginal TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5

Figure 8.9: Bagging trees on simulated dataset. Topleft panel shows original tree. Five trees grown on boot-strap samples are shown.Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 12: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

About Bagging Trees

The original tree and five bootstrap trees are all different:

with different splitting features

with different splitting cutpoints

The trees have high variance due to the correlation in thepredictors

Averaging reduces variance and leaves bias unchanged.

Under squared-error loss, averaging reduces variance andleaves bias unchanged.

Therefore, bagging will often decrease MSE.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 13: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 8

Number of Bootstrap Samples

Test

Error

0 50 100 150 200

0.20

0.25

0.30

0.35

••••••••

••

••••••••

•••••••

•••••

•••••

••••••••••

•••

••••••••••••

••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••

••••

••

••

••

••

•••

••

•••••••••••

••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Bagged Trees

Original Tree

Bayes

Figure 8.10: Error urves for the bagging example ofFigure 8.9. Shown is the test error of the original treeand bagged trees as a fun tion of the number of boot-strap samples. The green points orrespond to majorityvote, while the purple points average the probabilities.Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 14: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

About Bagging

Bagging can dramatically reduce the variance of unstableprocedures like trees, leading to improved prediction

Bagging smooths out this variance and hence reducing the testerrorBagging can stabilize unstable procedures.

The simple structure in the model can be lost due to bagging

A bagged tree is no longer a tree.The bagged estimate is not easy to interpret.

Under 0-1 loss for classification, bagging may not help due tothe nonadditivity of bias and variance.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 15: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Random Forest

Random forest is an ensemble classifier that consists of manydecision trees and outputs the class that is the mode of theclass’s output by individual trees.

The algorithm was developed by Breiman (2001) and Cutler.

The method combines Breiman’s “bagging” idea and therandom selection of features, in order to construct a collectionof decision trees with controlled variation.

The selection of a random subset of features is an example ofthe random subspace method, a way to implement stochasticdiscrimination.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 16: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Learning Algorithm for Building A Tree

Denote the training size by n and the number of variables by p.Assume m < p is the number of input variables to be used todetermine the decision at a node of the tree.

Randomly choose n samples with replacement (i.e. take abootstrap sample).

Use the rest of the samples to estimate the error of the tree,by predicting their classes.

For each node of the tree, randomly choose m variables onwhich to base the decision at that node. Calculate the bestsplit based on these m variables in the training set.

Each tree is fully grown and not pruned.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 17: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Advantages of Random Forest

highly accurate in many real examples; fast; handles a verylarge number of input variables.ability to estimate the importance of variables forclassification through permutation.generates an internal unbiased estimate of the generalizationerror as the forest building progresses.impute missing data and maintains accuracy when a largeproportion of the data are missing.provides an experimental way to detect variable interactions.can balance error in unbalanced data sets.compute proximities between cases, useful for clustering,detecting outliers, and (by scaling) visualizing the datacan be extended to unlabeled data, leading to unsupervisedclustering, outlier detection and data views

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 18: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Disadvantages of Random Forest

Random forests are prone to overfitting for some data sets.This is even more pronounced in noisy classification/regressiontasks.

Random forests do not handle large numbers of irrelevantfeatures

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 19: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Motivation: Model Averaging

They are methods for improving the performance of weak learners.

strong learners: Given a large enough dataset, the classifiercan arbitrarily accurately learn the target function withprobability 1− τ (where τ > 0 can be arbitrarily small)

weak learners: Given a large enough dataset, the classifier canbarely learn the target function with probability 1

2 + τ

The error rate is only slightly better than a random guessing

Can we construct a strong learner from weak learners and how?

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 20: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Boosting

Motivation: combines the outputs of many weak classifiers toproduce a powerful “committee”.

Similar to bagging and other committee-based approaches

Originally designed for classification problems, but can also beextended to regression problem.

Consider the two-class problem

Y ∈ {−1, 1}, the classifier G (x) has training error

err =1

n

n∑i=1

I (yi 6= G (xi ))

The expected error rate on future predictions isEX,Y I (Y 6= G (X)).

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 21: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Classification Trees

Classifications trees can be simple, but often produce noise or weakclassifiers.

Bagging (Breiman 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby a majority vote.

Boosting (Freund & Shapire 1996): Fit many large or smalltrees to re-weighted versions of the training data. Classify bya weighted majority vote.

In general, Boosting > Bagging > Single Tree

Breiman’s comment “AdaBoost .... best off-the-shelf classifierin the world”. (1996, NIPS workshop)

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 22: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

AlgorithmExample

AdaBoost (Discrete Boost)

Adaptively resampling the data (Freund & Shapire 1997; winnersof the 2003 Godel Prize)

1 sequentially apply the weak classification algorithm torepeatedly modified versions of the data (re-weighted data)

2 produces a sequence of weak classifiers

Gm(x), m = 1, 2, ...,M.

3 The predictions from Gm’s are then combined through aweighted majority vote to produce the final prediction

G (x) = sign

(M∑

m=1

αmGm(x)

).

Here α1, ..., αM ≥ 0 are computed by the boosting algorithm.Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 23: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

AlgorithmExample

AdaBoost Algorithm

1 Initially the observation weights wi = 1/n, i = 1, ..., n.2 For m=1 to M

(a) Fit a classifier Gm(x) to the training data using weights wi .(b) Compute the weighted error

errm =

∑ni=1 wi I (yi 6= Gm(xi ))∑n

i=1 wi.

(c) Compute the importance of Gm as

αm = log

(1− errm

errm

)(d) Update wi ← wi · exp[αm · I (yi 6= Gm(xi ))], i = 1, ..., n.

3 Output G (x) = sign(∑M

m=1 αmGm(x)).

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 24: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

AlgorithmExample

Weights of Individual Weak Learners

In the final rule, the weight of Gm

αm = log

(1− errm

errm

),

where errm is the weight error of Gm.

The weights αm’s weigh the contribution of each Gm.

The higher (lower) errm, the smaller (larger) αm.

The principle is to give higher influence (larger weights) tomore accurate classifiers in the sequence.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 25: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

AlgorithmExample

Data Re-weighting (Modification) Scheme

At each boosting, we impose the updated weights w1, ...,wn tosamples (xi , yi ), i = 1, ..., n.

Initially, all weights are set to wi = 1/n. The usual classifier.

At step m = 2, . . . ,M, we modify weights for observationsindividually: increasing weights for those observationsmisclassified by Gm−1(x) and decreasing weights forobservations classified correctly by Gm−1(x).

Samples difficult to correctly classify receive ever increasinginfluenceEach successive classifier is forced to concentrate on thosetraining observations that are missed by previous ones insequence

The classification algorithm is re-applied to the weightedobservations

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 26: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

AlgorithmExample

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 10

Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted Sample

Weighted SampleWeighted Sample

Training Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted SampleWeighted Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

PSfrag repla ementsG(x) = sign hPMm=1 �mGm(x)iGM (x)

G3(x)G2(x)G1(x)

Final Classifier

Figure 10.1: S hemati of AdaBoost. Classi�ers aretrained on weighted versions of the dataset, and then ombined to produ e a �nal predi tion.Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 27: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

AlgorithmExample

Power of Boosting

Ten features X1, ...,X10 ∼ N(0, 1)

two-classes: Y = 2 · I (∑10

j=1 X2j > χ2

10(0.5))− 1

sample size n = 2000, test size 10, 000

weak classifier: stump (a two-terminal node classification tree)

performance of boosting and comparison with other methods:

stump has 46% misclassification rate400-mode tree has 26% misclassification rateboosting has 12.2% error rate

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 28: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

AlgorithmExample

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 10

Boosting Iterations

Test

Error

0 100 200 300 400

0.00.1

0.20.3

0.40.5

Single Stump

400 Node Tree

Figure 10.2: Simulated data (10.2): test error ratefor boosting with stumps, as a fun tion of the numberof iterations. Also shown are the test error rate for asingle stump, and a 400 node lassi� ation tree.Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 29: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

AlgorithmExample

Boosting And Additive Models

The success of boosting is not very mysterious. The key lies in

G (x) = sign

(M∑

m=1

αmGm(x)

).

Adaboost is equivalent to fitting an additive model using theexponential loss function (a very recent discovery by Friedmanet al. (2000)).

AdaBoost was originally motivated from a very differentperspective

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 30: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Introduction on Additive Models

An additive model typically assumes a function form

f (x) =M∑

m=1

βmb(x; γm),

βm’s are coefficients. b(x, γm) are basis functions of xcharacterized by γm.

The model f is obtained by minimizing a loss averaged over thetraining data

min{βm,γm}M1

n∑i=1

L

(yi ,

M∑m=1

βmb(xi ; γm)

). (1)

It is feasible to rapidly solve the sub-problem of fitting just a singlebasis.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 31: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Forward Stagewise Fitting for Additive Models

Forward stagewise modeling approximate the solution to (1) by

sequentially adding new basis functions to the expansionwithout adjusting the parameters and coef. of those that havebeen added.

At iteration m, one solves for the optimal basis functionb(x, γm) and corresponding coefficient βm, which is added tothe current expansion fm−1(x).

Previously added terms are not modified.

This process is repeated.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 32: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Squared-Error Loss Example

Consider the squared-error loss

L(y , f (x)) = [y − f (x)]2

At the mth step, given the current fit fm−1(x), we solve

minβ,γ

∑i

L(yi , fm−1(xi ) + βb(x, γ))⇐⇒

minβ,γ

∑i

[yi − fm−1(xi )− βb(xi ; γ)]2 =∑i

[rim − βb(xi , γ)]2.

The term βmb(x; γm) is the best fit to the current residual.This produces the updated fit

fm(x) = fm−1(x) + βmb(x; γm).

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 33: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Forward Stagewise Additive Modeling

1 Initialize f0(x) = 0.2 For m = 1 to M:

(a) Compute

(βm, γm) = arg minβ,γ

n∑i=1

L (yi , fm−1(xi ) + βb(xi ; γ)) .

(b) Set fm(x) = fm−1(x) + βmb(x; γm)

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 34: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Additive Logistic Models and AdaBoost

Friedman et al. (2001) showed that AdaBoost is equivalent toforward stagewise additive modeling

using the exponential loss function

L(y , f (x)) = exp{−yf (x)},

using individual classifiers Gm(x) ∈ {−1, 1} as basis functions

The (population) minimizer of the exponential loss function is

f ∗(x) = arg minf

EY |x[e−Yf (x)]

=1

2log

Pr(Y = 1|x)

Pr(Y = −1|x),

which is equal to one half of the log-odds. So, AdaBoost can beregarded as an additive logistic regression model.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 35: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Forward Stagewise Additive Modeling with Exponential Loss

For exponential loss, the minimization at mth step in forwardstagewise modeling becomes

(βm, γm) = arg minβ,γ

n∑i=1

exp{−yi [fm−1(xi ) + βb(xi , γ)]}

In the context of a weak learner G , this is

(βm,Gm) = arg minβ,G

n∑i=1

exp{−yi [fm−1(xi ) + βG (xi )]},

or equivalently, (βm,Gm) = arg minβ,G

n∑i=1

w(m)i exp{−yiβG (xi )},

where w(m)i = exp{−yi fm−1(xi )}.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 36: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Iterative Optimization

In order to solve

minβ,G

n∑i=1

w(m)i exp{−yiβG (xi )},

we take the two-step (profile) approach

first, fix β > 0 and solve for G .

second, solve for β with G = G .

Recall that both Y and G (x) take only two values +1 and −1. So

yiG (xi ) = +1⇐⇒ yi = G (xi ),

yiG (xi ) = −1⇐⇒ yi 6= G (xi ).

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 37: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Solving for Gm

(βm,Gm) = arg minβ,G

n∑i=1

w(m)i exp{−yiβG (xi )}

= arg minβ,G

eβ∑

yi 6=G(xi )

w(m)i + e−β

∑yi=G(xi )

w(m)i

= arg minβ,G

(eβ − e−β)∑

yi 6=G(xi )

w(m)i + e−β

n∑i=1

w(m)i

For any β > 0, the minimizer Gm is a {−1, 1}-valued function

Gm = arg minG

n∑i=1

w(m)i I [yi 6= G (xi )],

the classifier that minimizes training error for the weighted data.Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 38: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Solving for βm

Define

errm =

∑ni=1 w

(m)i I (yi 6= Gm(xi ))∑n

i=1 w(m)i

Plugging Gm in the objective gives

βm = argminβ{e−β + (eβ − e−β)errm}n∑

i=1

w(m)i

The solution is

βm =1

2log

(1− errm

errm

).

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 39: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Connection to Adaboost

Since−yGm(x) = 2(I (y 6= Gm(x))− 1,

we have

w(m+1)i = w

(m)i exp{−βmyiGm(xi )}

= w(m)i exp{αmI (yi 6= G (xi ))} exp{−βm}

where αm = 2βm and exp{−βm} is constant across the datapoints. Therefore

the weight update is equivalent to line 2(d) of the AdaBoost

line 2(a) of the Adaboost is equivalent to solving theminimization problem

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 40: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Weights and Their Update

At the mth step, the weight for the ith observation is

w(m)i = exp{−yi fm−1(xi )},

which depends only on fm−1(xi ) but not on β or G (x).

At the (m + 1)th step, we update the weight using the fact

fm(x) = fm−1(xi ) + βmGm(xi ).

It leads to the following update formula

w(m+1)i = exp{−yi fm(xi )}

= exp{−yi (fm−1(xi ) + βmGm(xi ))}= w

(m)i exp{−βmyiGm(xi )}.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 41: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Why Exponential Loss?

Principal virtue is computational

Exponential loss concentrates much more influence onobservations with large negative margins yf (x). It is especiallysensitive to misspecification of class labels.

More robust losses: computation is not as easy (Use GradientBoosting).

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 42: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Exponential Loss and Cross Entropy

Define Y ′ = (Y + 1)/2 ∈ {0, 1}.The binomial negative log-likelihood loss function is

−l(Y , p(x)) = −[Y ′ log p(x) + (1− Y ′) log(1− p(x))

]= log (1 + exp{−2Yf (x)}) ,

where p(x) = [1 + e−2f (x)]−1.

The population minimizers of the deviance EY |x[−l(Y , f (x)]

and EY |x[e−Yf (x)] are the same.

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 43: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Loss Functions for Classification

Choice of loss functions matters for finite data sets.The Margin of f (x) is defined as yf (x).The classification rule is G (x) = sign[f (x)]The decision boundary is f (x) = 0

Observations with positive margin yi f (xi ) > 0 are correctlyclassified

Observations with negative margin yi f (xi ) < 0 are incorrectlyclassified

The goal of a classification algorithm is to produce positivemargins as frequently as possible

Any loss function should penalize negative margins moreheavily than positive margins, since positive margin arealready correctly classified

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 44: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Various Loss Functions

Monotone decreasing loss functions of the margin

Misclassification loss: I (sign(f (x)) 6= y)

Exponential loss: exp(−yf ) (not robust against mislabeledsamples)

Binomial deviance: log{1 + exp(−2yf )} (more robust againstinfluential points)

SVM loss: [1− yf ]+

Other loss functions

Squared error: (y − f )2 = (1− yf )2 (penalize positive marginsheavily)

Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 45: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 10

-2 -1 0 1 2

0.00.5

1.01.5

2.02.5

3.0

MisclassificationExponentialBinomial DevianceSquared ErrorSupport Vector

PSfrag repla ementsLoss

y � fFigure 10.4: Loss fun tions for two- lass lassi� a-tion. The response is y = �1; the predi tion is f ,with lass predi tion sign(f). The losses are mis las-si� ation: I(sign(f) 6= y); exponential: exp(�yf); bi-nomial devian e: log(1 + exp(�2yf)); squared error:(y � f)2; and support ve tor: (1� yf) � I(yf > 1) (seeSe tion 12.3). Ea h fun tion has been s aled so that itpasses through the point (0; 1).Hao Helen Zhang Lecture 21: Bagging and Boosting

Page 46: Lecture 21: Bagging and Boosting - University of Arizonahzhang/math574m/2020Lect21-Boost.pdfcompute proximities between cases, useful for clustering, detecting outliers, and (by scaling)

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Brief Summary on AdaBoost

AdaBoost fits an additive model, where the basis functionsGm(x) stage-wise optimize exponential loss

The population minimizer of exponential loss is the log odds

There are loss functions more robust than squared error loss(for regression problems) or exponential loss (for classificationproblems)

Hao Helen Zhang Lecture 21: Bagging and Boosting