lecture 21: bagging and boosting - university of arizonahzhang/math574m/2020lect21-boost.pdfcompute...

Bagging TreesRandom Forest

Boosting MethodsAdaBoost

Additive ModelsAdaBoost and Additive Logistic Models

Lecture 21: Bagging and Boosting

Hao Helen Zhang

Hao Helen Zhang Lecture 21: Bagging and Boosting




Outlines

Bagging Methods


Adaboost

Strong learners vs Weak LearnersMotivationsAlgorithms

Additive Models

Adaboost and Additive Logistic Regression





Ensemble Methods (Model Averaging)

A machine learning ensemble meta-algorithm designed to improvethe stability and accuracy of machine learning algorithms

widely used in statistical classification and regression.

can reduce variance and helps to avoid overfitting.

usually applied to decision tree methods, but it can be usedwith any type of method.





Bootstrap Bagging (Bagging)

Bagging is a special case of the model averaging approach.

Bootstrap aggregation = Bagging

Bagging leads to ”improvements for unstable procedures”(Breiman, 1996), e.g. neural nets, classification and regressiontrees, and subset selection in linear regression (Breiman,1994).

On the other hand, it can mildly degrade the performance ofstable methods such as K-nearest neighbors (Breiman, 1996).





Basic Idea

Given a standard training set D of size n, bagging

generates B new training sets Di , each of size n′, by samplingfrom D uniformly and with replacement.

The B models are fitted using the above B bootstrap samplesand combined by averaging the output (for regression) orvoting (for classification).

This kind of sample is known as a bootstrap sample.

By sampling with replacement, some observations may berepeated in each Di .

If n′ = n, then for large n, the set Di is expected to have thefraction (1− 1/e) ≈ 63.2% of the unique examples of D, therest being duplicates.





Bagging Procedures

Bagging uses the bootstrap to improve the estimate or predictionof a fit.

Given data Z = {(x1, y1), ..., (xn, yn)}, we generate Bbootstrap samples Z∗b

Empirical distribution P: putting equal probability 1/n oneach (xi , yi ) (discrete)Generate Z∗b = {(x1∗, yn∗), ..., (xn∗, yn∗)} ∼ P, b = 1, ...,B

Obtain f ∗b(x), b = 1, ...,B.

The Monte Carlo estimate of the bagging estimate

fbag(x) =1

B

B∑b=1

f ∗b(x).





Properties of Bagging Estimates

Advantages:

Note fbag(x) −→ EP f∗(x) as B →∞,

fbag(x) typically has smaller variance than f (x);

fbag(x) differs from f (x) only when the latter is nonlinear oradaptive function of data.





Bagging Classification TreesIn multiclass (K ≥ 3)problems, there are two scenarios

(1) f b(x) is indicator-vector, with one 1 and K-1 0’s (hardclassification)

(2) f b(x) = (p1, ..., pK ), the estimates of class probabilities (softclassification)

The bagged estimate is the average prediction at x from B trees

fbag(x) =1

B

B∑b=1

f ∗b(x),





Bagging Classification Trees (cont.)There are two types of averaging:

(1) uses the majority vote;

(2) use the averaged probabilities.

The bagged classifier

Gbag(x) = arg maxk=1,··· ,K

fbag(x).





Example

Sample size n = 30, two classes

p = 5 features, each having a standard Gaussian distributionwith pairwise correlation Corr(Xj ,Xk) = 0.95.

The response Y was generated according to

Pr(Y = 1|x1 ≤ 0.5) = 0.2, Pr(Y = 1|x1 > 0.5) = 0.8.

The Bayes error is 0.2.

A test sample of size 2, 000 was generated from the samepopulation.

We fit

classification trees to the training sample

classification trees to each of 200 bootstrap samples





Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 8x.2<0.39

x.2>0.39

10/30

0

x.3<-1.575x.3>-1.575

3/21

0

2/5

1

0/16

0

2/9

1

PSfrag repla ements Original TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5x.2<0.36

x.2>0.36

7/30

0

x.1<-0.965x.1>-0.965

1/23

0

1/5

0

0/18

0

1/7

1

PSfrag repla ementsOriginal Tree Bootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5

x.2<0.39x.2>0.39

11/30

0

x.3<-1.575x.3>-1.575

3/22

0

2/5

1

0/17

0

0/8

1

PSfrag repla ementsOriginal TreeBootstrap Tree 1 Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5

x.4<0.395x.4>0.395

4/30

0

x.3<-1.575x.3>-1.575

2/25

0

2/5

0

0/20

0

2/5

0

PSfrag repla ementsOriginal TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3

Bootstrap Tree 4Bootstrap Tree 5x.2<0.255

x.2>0.255

13/30

0

x.3<-1.385x.3>-1.385

2/16

0

2/5

0

0/11

0

3/14

1

PSfrag repla ementsOriginal TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4

Bootstrap Tree 5x.2<0.38

x.2>0.38

12/30

0

x.3<-1.61x.3>-1.61

4/20

0

2/6

1

0/14

0

2/10

1

PSfrag repla ementsOriginal TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5

Figure 8.9: Bagging trees on simulated dataset. Topleft panel shows original tree. Five trees grown on boot-strap samples are shown.Hao Helen Zhang Lecture 21: Bagging and Boosting




About Bagging Trees

The original tree and five bootstrap trees are all different:

with different splitting features

with different splitting cutpoints

The trees have high variance due to the correlation in thepredictors

Averaging reduces variance and leaves bias unchanged.

Under squared-error loss, averaging reduces variance andleaves bias unchanged.

Therefore, bagging will often decrease MSE.





Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 8

Number of Bootstrap Samples

Test

Error

0 50 100 150 200

0.20

0.25

0.30

0.35

••••••••

•

•

•

•

•

•

••

•

•

•

••••••••

•

•••••••

•

•

•••••

•••••

••••••••••

•

•

•••

••••••••••••

••

•

•

•

•

•

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••

••••

••

•

••

•

••

••

•

•

•

•

•••

•

••

•

•

•••••••••••

•

••••

•

•

•

•

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Bagged Trees

Original Tree

Bayes

Figure 8.10: Error urves for the bagging example ofFigure 8.9. Shown is the test error of the original treeand bagged trees as a fun tion of the number of boot-strap samples. The green points orrespond to majorityvote, while the purple points average the probabilities.Hao Helen Zhang Lecture 21: Bagging and Boosting




About Bagging

Bagging can dramatically reduce the variance of unstableprocedures like trees, leading to improved prediction

Bagging smooths out this variance and hence reducing the testerrorBagging can stabilize unstable procedures.

The simple structure in the model can be lost due to bagging

A bagged tree is no longer a tree.The bagged estimate is not easy to interpret.

Under 0-1 loss for classification, bagging may not help due tothe nonadditivity of bias and variance.





Random Forest

Random forest is an ensemble classifier that consists of manydecision trees and outputs the class that is the mode of theclass’s output by individual trees.

The algorithm was developed by Breiman (2001) and Cutler.

The method combines Breiman’s “bagging” idea and therandom selection of features, in order to construct a collectionof decision trees with controlled variation.

The selection of a random subset of features is an example ofthe random subspace method, a way to implement stochasticdiscrimination.





Learning Algorithm for Building A Tree

Denote the training size by n and the number of variables by p.Assume m < p is the number of input variables to be used todetermine the decision at a node of the tree.

Randomly choose n samples with replacement (i.e. take abootstrap sample).

Use the rest of the samples to estimate the error of the tree,by predicting their classes.

For each node of the tree, randomly choose m variables onwhich to base the decision at that node. Calculate the bestsplit based on these m variables in the training set.

Each tree is fully grown and not pruned.





Advantages of Random Forest

highly accurate in many real examples; fast; handles a verylarge number of input variables.ability to estimate the importance of variables forclassification through permutation.generates an internal unbiased estimate of the generalizationerror as the forest building progresses.impute missing data and maintains accuracy when a largeproportion of the data are missing.provides an experimental way to detect variable interactions.can balance error in unbalanced data sets.compute proximities between cases, useful for clustering,detecting outliers, and (by scaling) visualizing the datacan be extended to unlabeled data, leading to unsupervisedclustering, outlier detection and data views





Disadvantages of Random Forest

Random forests are prone to overfitting for some data sets.This is even more pronounced in noisy classification/regressiontasks.

Random forests do not handle large numbers of irrelevantfeatures





Motivation: Model Averaging

They are methods for improving the performance of weak learners.

strong learners: Given a large enough dataset, the classifiercan arbitrarily accurately learn the target function withprobability 1− τ (where τ > 0 can be arbitrarily small)

weak learners: Given a large enough dataset, the classifier canbarely learn the target function with probability 1

2 + τ

The error rate is only slightly better than a random guessing

Can we construct a strong learner from weak learners and how?





Boosting

Motivation: combines the outputs of many weak classifiers toproduce a powerful “committee”.

Similar to bagging and other committee-based approaches

Originally designed for classification problems, but can also beextended to regression problem.

Consider the two-class problem

Y ∈ {−1, 1}, the classifier G (x) has training error

err =1

n

n∑i=1

I (yi 6= G (xi ))

The expected error rate on future predictions isEX,Y I (Y 6= G (X)).





Classification Trees

Classifications trees can be simple, but often produce noise or weakclassifiers.

Bagging (Breiman 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby a majority vote.

Boosting (Freund & Shapire 1996): Fit many large or smalltrees to re-weighted versions of the training data. Classify bya weighted majority vote.

In general, Boosting > Bagging > Single Tree

Breiman’s comment “AdaBoost .... best off-the-shelf classifierin the world”. (1996, NIPS workshop)





AlgorithmExample

AdaBoost (Discrete Boost)

Adaptively resampling the data (Freund & Shapire 1997; winnersof the 2003 Godel Prize)

1 sequentially apply the weak classification algorithm torepeatedly modified versions of the data (re-weighted data)

2 produces a sequence of weak classifiers

Gm(x), m = 1, 2, ...,M.

3 The predictions from Gm’s are then combined through aweighted majority vote to produce the final prediction

G (x) = sign

(M∑

m=1

αmGm(x)

).

Here α1, ..., αM ≥ 0 are computed by the boosting algorithm.Hao Helen Zhang Lecture 21: Bagging and Boosting




AlgorithmExample

AdaBoost Algorithm

1 Initially the observation weights wi = 1/n, i = 1, ..., n.2 For m=1 to M

(a) Fit a classifier Gm(x) to the training data using weights wi .(b) Compute the weighted error

errm =

∑ni=1 wi I (yi 6= Gm(xi ))∑n

i=1 wi.

(c) Compute the importance of Gm as

αm = log

(1− errm

errm

)(d) Update wi ← wi · exp[αm · I (yi 6= Gm(xi ))], i = 1, ..., n.

3 Output G (x) = sign(∑M

m=1 αmGm(x)).





AlgorithmExample

Weights of Individual Weak Learners

In the final rule, the weight of Gm

αm = log

(1− errm

errm

),

where errm is the weight error of Gm.

The weights αm’s weigh the contribution of each Gm.

The higher (lower) errm, the smaller (larger) αm.

The principle is to give higher influence (larger weights) tomore accurate classifiers in the sequence.





AlgorithmExample

Data Re-weighting (Modification) Scheme

At each boosting, we impose the updated weights w1, ...,wn tosamples (xi , yi ), i = 1, ..., n.

Initially, all weights are set to wi = 1/n. The usual classifier.

At step m = 2, . . . ,M, we modify weights for observationsindividually: increasing weights for those observationsmisclassified by Gm−1(x) and decreasing weights forobservations classified correctly by Gm−1(x).

Samples difficult to correctly classify receive ever increasinginfluenceEach successive classifier is forced to concentrate on thosetraining observations that are missed by previous ones insequence

The classification algorithm is re-applied to the weightedobservations





AlgorithmExample


Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted Sample

Weighted SampleWeighted Sample

Training Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted SampleWeighted Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

PSfrag repla ementsG(x) = sign hPMm=1 �mGm(x)iGM (x)

G3(x)G2(x)G1(x)

Final Classifier

Figure 10.1: S hemati of AdaBoost. Classi�ers aretrained on weighted versions of the dataset, and then ombined to produ e a �nal predi tion.Hao Helen Zhang Lecture 21: Bagging and Boosting




AlgorithmExample

Power of Boosting

Ten features X1, ...,X10 ∼ N(0, 1)

two-classes: Y = 2 · I (∑10

j=1 X2j > χ2

10(0.5))− 1

sample size n = 2000, test size 10, 000

weak classifier: stump (a two-terminal node classification tree)

performance of boosting and comparison with other methods:

stump has 46% misclassification rate400-mode tree has 26% misclassification rateboosting has 12.2% error rate





AlgorithmExample


Boosting Iterations

Test

Error

0 100 200 300 400

0.00.1

0.20.3

0.40.5

Single Stump

400 Node Tree

Figure 10.2: Simulated data (10.2): test error ratefor boosting with stumps, as a fun tion of the numberof iterations. Also shown are the test error rate for asingle stump, and a 400 node lassi� ation tree.Hao Helen Zhang Lecture 21: Bagging and Boosting




AlgorithmExample

Boosting And Additive Models

The success of boosting is not very mysterious. The key lies in

G (x) = sign

(M∑

m=1

αmGm(x)

).

Adaboost is equivalent to fitting an additive model using theexponential loss function (a very recent discovery by Friedmanet al. (2000)).

AdaBoost was originally motivated from a very differentperspective





Introduction on Additive Models

An additive model typically assumes a function form

f (x) =M∑

m=1

βmb(x; γm),

βm’s are coefficients. b(x, γm) are basis functions of xcharacterized by γm.

The model f is obtained by minimizing a loss averaged over thetraining data

min{βm,γm}M1

n∑i=1

L

(yi ,

M∑m=1

βmb(xi ; γm)

). (1)

It is feasible to rapidly solve the sub-problem of fitting just a singlebasis.





Forward Stagewise Fitting for Additive Models

Forward stagewise modeling approximate the solution to (1) by

sequentially adding new basis functions to the expansionwithout adjusting the parameters and coef. of those that havebeen added.

At iteration m, one solves for the optimal basis functionb(x, γm) and corresponding coefficient βm, which is added tothe current expansion fm−1(x).

Previously added terms are not modified.

This process is repeated.





Squared-Error Loss Example

Consider the squared-error loss

L(y , f (x)) = [y − f (x)]2

At the mth step, given the current fit fm−1(x), we solve

minβ,γ

∑i

L(yi , fm−1(xi ) + βb(x, γ))⇐⇒

minβ,γ

∑i

[yi − fm−1(xi )− βb(xi ; γ)]2 =∑i

[rim − βb(xi , γ)]2.

The term βmb(x; γm) is the best fit to the current residual.This produces the updated fit

fm(x) = fm−1(x) + βmb(x; γm).





Forward Stagewise Additive Modeling

1 Initialize f0(x) = 0.2 For m = 1 to M:

(a) Compute

(βm, γm) = arg minβ,γ

n∑i=1

L (yi , fm−1(xi ) + βb(xi ; γ)) .

(b) Set fm(x) = fm−1(x) + βmb(x; γm)





Additive Logistic Models and AdaBoost

Friedman et al. (2001) showed that AdaBoost is equivalent toforward stagewise additive modeling

using the exponential loss function

L(y , f (x)) = exp{−yf (x)},

using individual classifiers Gm(x) ∈ {−1, 1} as basis functions

The (population) minimizer of the exponential loss function is

f ∗(x) = arg minf

EY |x[e−Yf (x)]

=1

2log

Pr(Y = 1|x)

Pr(Y = −1|x),

which is equal to one half of the log-odds. So, AdaBoost can beregarded as an additive logistic regression model.





Forward Stagewise Additive Modeling with Exponential Loss

For exponential loss, the minimization at mth step in forwardstagewise modeling becomes

(βm, γm) = arg minβ,γ

n∑i=1

exp{−yi [fm−1(xi ) + βb(xi , γ)]}

In the context of a weak learner G , this is

(βm,Gm) = arg minβ,G

n∑i=1

exp{−yi [fm−1(xi ) + βG (xi )]},

or equivalently, (βm,Gm) = arg minβ,G

n∑i=1

w(m)i exp{−yiβG (xi )},

where w(m)i = exp{−yi fm−1(xi )}.





Iterative Optimization

In order to solve

minβ,G

n∑i=1

w(m)i exp{−yiβG (xi )},

we take the two-step (profile) approach

first, fix β > 0 and solve for G .

second, solve for β with G = G .

Recall that both Y and G (x) take only two values +1 and −1. So

yiG (xi ) = +1⇐⇒ yi = G (xi ),

yiG (xi ) = −1⇐⇒ yi 6= G (xi ).





Solving for Gm

(βm,Gm) = arg minβ,G

n∑i=1

w(m)i exp{−yiβG (xi )}

= arg minβ,G

eβ∑

yi 6=G(xi )

w(m)i + e−β

∑yi=G(xi )

w(m)i

= arg minβ,G

(eβ − e−β)∑

yi 6=G(xi )

w(m)i + e−β

n∑i=1

w(m)i

For any β > 0, the minimizer Gm is a {−1, 1}-valued function

Gm = arg minG

n∑i=1

w(m)i I [yi 6= G (xi )],

the classifier that minimizes training error for the weighted data.Hao Helen Zhang Lecture 21: Bagging and Boosting




Solving for βm

Define

errm =

∑ni=1 w

(m)i I (yi 6= Gm(xi ))∑n

i=1 w(m)i

Plugging Gm in the objective gives

βm = argminβ{e−β + (eβ − e−β)errm}n∑

i=1

w(m)i

The solution is

βm =1

2log

(1− errm

errm

).





Connection to Adaboost

Since−yGm(x) = 2(I (y 6= Gm(x))− 1,

we have

w(m+1)i = w

(m)i exp{−βmyiGm(xi )}

= w(m)i exp{αmI (yi 6= G (xi ))} exp{−βm}

where αm = 2βm and exp{−βm} is constant across the datapoints. Therefore

the weight update is equivalent to line 2(d) of the AdaBoost

line 2(a) of the Adaboost is equivalent to solving theminimization problem





Weights and Their Update

At the mth step, the weight for the ith observation is

w(m)i = exp{−yi fm−1(xi )},

which depends only on fm−1(xi ) but not on β or G (x).

At the (m + 1)th step, we update the weight using the fact

fm(x) = fm−1(xi ) + βmGm(xi ).

It leads to the following update formula

w(m+1)i = exp{−yi fm(xi )}

= exp{−yi (fm−1(xi ) + βmGm(xi ))}= w

(m)i exp{−βmyiGm(xi )}.





Why Exponential Loss?

Principal virtue is computational

Exponential loss concentrates much more influence onobservations with large negative margins yf (x). It is especiallysensitive to misspecification of class labels.

More robust losses: computation is not as easy (Use GradientBoosting).





Exponential Loss and Cross Entropy

Define Y ′ = (Y + 1)/2 ∈ {0, 1}.The binomial negative log-likelihood loss function is

−l(Y , p(x)) = −[Y ′ log p(x) + (1− Y ′) log(1− p(x))

]= log (1 + exp{−2Yf (x)}) ,

where p(x) = [1 + e−2f (x)]−1.

The population minimizers of the deviance EY |x[−l(Y , f (x)]

and EY |x[e−Yf (x)] are the same.





Loss Functions for Classification

Choice of loss functions matters for finite data sets.The Margin of f (x) is defined as yf (x).The classification rule is G (x) = sign[f (x)]The decision boundary is f (x) = 0

Observations with positive margin yi f (xi ) > 0 are correctlyclassified

Observations with negative margin yi f (xi ) < 0 are incorrectlyclassified

The goal of a classification algorithm is to produce positivemargins as frequently as possible

Any loss function should penalize negative margins moreheavily than positive margins, since positive margin arealready correctly classified





Various Loss Functions

Monotone decreasing loss functions of the margin

Misclassification loss: I (sign(f (x)) 6= y)

Exponential loss: exp(−yf ) (not robust against mislabeledsamples)

Binomial deviance: log{1 + exp(−2yf )} (more robust againstinfluential points)

SVM loss: [1− yf ]+

Other loss functions

Squared error: (y − f )2 = (1− yf )2 (penalize positive marginsheavily)






-2 -1 0 1 2

0.00.5

1.01.5

2.02.5

3.0

MisclassificationExponentialBinomial DevianceSquared ErrorSupport Vector

PSfrag repla ementsLoss

y � fFigure 10.4: Loss fun tions for two- lass lassi� a-tion. The response is y = �1; the predi tion is f ,with lass predi tion sign(f). The losses are mis las-si� ation: I(sign(f) 6= y); exponential: exp(�yf); bi-nomial devian e: log(1 + exp(�2yf)); squared error:(y � f)2; and support ve tor: (1� yf) � I(yf > 1) (seeSe tion 12.3). Ea h fun tion has been s aled so that itpasses through the point (0; 1).Hao Helen Zhang Lecture 21: Bagging and Boosting




Brief Summary on AdaBoost

AdaBoost fits an additive model, where the basis functionsGm(x) stage-wise optimize exponential loss

The population minimizer of exponential loss is the log odds

There are loss functions more robust than squared error loss(for regression problems) or exponential loss (for classificationproblems)


lecture 21: bagging and boosting - university of arizonahzhang/math574m/2020lect21-boost.pdfcompute...

Documents