boosting - stanford universityhastie/talks/boost.pdfjanuary 2003 trevor hastie, stanford university...

43
January 2003 Trevor Hastie, Stanford University 1 Boosting Trevor Hastie Statistics Department Stanford University Collaborators: Brad Efron, Jerome Friedman, Saharon Rosset, Rob Tibshirani, Ji Zhu http://www-stat.stanford.edu/hastie/TALKS/boost.pdf

Upload: others

Post on 08-Apr-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 1

Boosting

Trevor HastieStatistics DepartmentStanford University

Collaborators: Brad Efron, Jerome Friedman,Saharon Rosset, Rob Tibshirani, Ji Zhu

http://www-stat.stanford.edu/∼hastie/TALKS/boost.pdf

Page 2: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 2

Outline

• Model Averaging

• Bagging

• Boosting

• History of Boosting

• Stagewise Additive Modeling

• Boosting and Logistic Regression

• MART

• Boosting and Overfitting

• Summary of Boosting, and its place in thetoolbox.

Methods for improving the performance of weaklearners. Classification trees are adaptive androbust, but do not generalize well. The techniquesdiscussed here enhance their performanceconsiderably.

Reference: Chapter 10.

Page 3: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 3

Classification Problem

X1

X2

-6 -4 -2 0 2 4 6

-6-4

-20

24

6

Bayes Error Rate: 0.25

0

00

0

00

0

0

00

0

0

0

0

0

0

0

0

0 0 0

00

0

00

00

00

0

0

00

00

00

0 0

0

0

00

0

0

0 0

00

0

0

0

000

0

0 0

00

00

00 00 0000

0

0

0

0

0

0

0

00

00

0

0

00

00

0

0

0

000

00

00

0

000 00

0

0

0

0

0

00

00

0

00

0

00

00

0

0

0

0

0

0

0

0

00

00

0

0

0

0

00

0

0

0 00

0

0 00

00

00

0

0

0

0

0 0

0

0

000

0

0

0

0

00

0

0

000

0

00

0

0

0

0 00

0

00 00 0

0 0

000

00

0

0000 0

0

0

0

0

00 0

000

0

0

0

0

00

0

00

0

0

0

0

0

0

0

0 00

0

00

00

0

0

0

0 0

00

00

0 0

0

0

0

00

0

00

0

0

000

0

0

0

00

0

00 00

0

00

0

0

0

0

0

0

0

00

0

0

0

0

00

000

0

0

00

00

0

00

0

0

0 0

0

0

0

0

0

0

00

0

0

0

0

0

00

00

0

0

00

00 0

0

0

0

0 0

0

0

00

000

0

0 0

00

00

0 0

0 0

0

0

0

0

00

0

0

0

0

000

0

00

0 0

0

0

00

00

0 0

0

0

0

0

0

0

0

0

0

0 00

00

0

00

0 0

00

00

1

1

1

11

1

1

1

11

1

111

1

1

11

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1

1

11

1

1

11

1

1 11

1

1

1

1

1

1

1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

11

1

1

11

1

1

11

11

11

1

11

11

1

1

1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

111

1

1

1

11

11

11

1

1

1

11

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

11

1

1

1

1

1

1

11

111

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

11

11

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1 1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

11

11

1

1

1

1

1

11

1

1

1

11

1

1

1 1

1

1

1

1

1

1

1

1

11

1

1

1 1

1

1

11

1

1

1

1

1 1

1

11

1

1

1

1

1

1

1

1

11

1

11

1

Data (X, Y ) ∈ Rp × {0, 1}.X is predictor, feature; Y is class label, response.

(X, Y ) have joint probability distribution D.

Goal: Based on N training pairs (Xi, Yi) drawnfrom D produce a classifier C(X) ∈ {0, 1}Goal: choose C to have low generalization error

R(C) = PD(C(X) �= Y )

= ED[1(C(X) �=Y )]

Page 4: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 4

Deterministic Concepts

X1

X2

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

0

0

00

0 0

0

0 0

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

00

0

0

0

000

0

0

0

0

0

0

0

0

0

00

00 0

0

00

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

00

00

0

0

0

0

0

0

00

0

0

0

0

00

0

0

000

0

0

00

0

0

0

0 0

00

0

0 00

0

0

0

00

00 0

0

0

00

0

0

0

0

0

00

0

00

00 0

0

000

0

0

0

00

0 00

0

0

0

0

0

0

0

0

0

00

0

0 0

0

0

0

00

0

00

00

0

0

00

0

0

0

0

00

0 0

0

00

0

0

0

0 0

0

0

0

0

00

0

000

000

0

00 0

0

0

0

0 00

0

0

0

00

00

0

0

0

0

0

00

0

00

0

00

0

0

00

00

0

0 0

0

0

0

0

0

0

00

0

00

00

00

0

0

00

0

00

0

0

00

0

00

0 0

0

0

0

0

0

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0 00

00

0

0

0

0

0

00

00

0

0

00

0

00

00

0

0

00

00

0

000 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 00

0

0

00

00 00

00

00

00

0

0

0

0

0

0

0

0 0

0

0

0

0

00

0

00

0

000

00

00

0

0

0

1

1

1

1

1

1

11

11

11

1

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

11

11

1

1 1

1

1

1

1

1

1

1

11

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

11

1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1 11

1

1

11

11

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1 11 1

1

1

1

1

1

1

1

11

1

1

1

1

11

11

1

1

11

1

1 1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

Bayes Error Rate: 0

X ∈ Rp has distribution D.

C(X) is deterministic function ∈ concept class.

Goal: Based on N training pairs(Xi, Yi = C(Xi)) drawn from D produce aclassifier C(X) ∈ {0, 1}Goal: choose C to have low generalization error

R(C) = PD(C(X) �= C(X))

= ED[1(C(X) �=C(X))]

Page 5: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 5

Classification Tree

Sample of size 200

x.2<-1.06711x.2>-1.06711

94/200

1

0/34

1

x.2<1.14988x.2>1.14988

72/166

0

x.1<1.13632x.1>1.13632

40/134

0

x.1<-0.900735x.1>-0.900735

23/117

0

x.1<-1.1668x.1>-1.1668

5/26

1

0/12

1

x.1<-1.07831x.1>-1.07831

5/14

1

1/5

1

4/9

1

x.2<-0.823968x.2>-0.823968

2/91

0

2/8

0

0/83

0

0/17

1

0/32

1

Page 6: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 6

Decision Boundary: Tree

X1

X2

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

1

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

11

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

000

0

00

00

0

0

0

00

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

0

0

0

0 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

0

00

0

Error Rate: 0.073

When the nested spheres are in IR10, CARTproduces a rather noisy and inaccurate ruleG(X), with error rates around 30%.

Page 7: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 7

Model Averaging

Classification trees can be simple, but oftenproduce noisy (bushy) or weak (stunted)classifiers.

• Bagging (Breiman, 1996): Fit many largetrees to bootstrap-resampled versions of thetraining data, and classify by majority vote.

• Boosting (Freund & Shapire, 1996): Fit manylarge or small trees to reweighted versions ofthe training data. Classify by weightedmajority vote.

In general Boosting � Bagging � Single Tree.

“AdaBoost · · · best off-the-shelf classifier in theworld” — Leo Breiman, NIPS workshop, 1996.

Page 8: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 8

Bagging

Bagging or bootstrap aggregation averages agiven procedure over many samples, to reduce its

variance — a poor man’s Bayes. See pp 246.

Suppose G(X, x) is a classifier, such as a tree,producing a predicted class label at input point x.

To bag G, we draw bootstrap samplesX∗1, . . .X∗B each of size N with replacementfrom the training data. Then

Gbag(x) = Majority Vote {G(X∗b, x)}Bb=1.

Bagging can dramatically reduce the variance ofunstable procedures (like trees), leading toimproved prediction. However any simplestructure in G (e.g a tree) is lost.

Page 9: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 9

x.2<0.39x.2>0.39

10/30

0

x.3<-1.575x.3>-1.575

3/21

0

2/5

1

0/16

0

2/9

1

Original Tree

x.2<0.36x.2>0.36

7/30

0

x.1<-0.965x.1>-0.965

1/23

0

1/5

0

0/18

0

1/7

1

Bootstrap Tree 1

x.2<0.39x.2>0.39

11/30

0

x.3<-1.575x.3>-1.575

3/22

0

2/5

1

0/17

0

0/8

1

Bootstrap Tree 2

x.4<0.395x.4>0.395

4/30

0

x.3<-1.575x.3>-1.575

2/25

0

2/5

0

0/20

0

2/5

0

Bootstrap Tree 3

x.2<0.255x.2>0.255

13/30

0

x.3<-1.385x.3>-1.385

2/16

0

2/5

0

0/11

0

3/14

1

Bootstrap Tree 4

x.2<0.38x.2>0.38

12/30

0

x.3<-1.61x.3>-1.61

4/20

0

2/6

1

0/14

0

2/10

1

Bootstrap Tree 5

Page 10: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 10

Decision Boundary: Bagging

X1

X2

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

1

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

11

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

000

0

00

00

0

0

0

00

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

0

0

0

0 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

0

00

0

Error Rate: 0.032

Bagging averages many trees, and producessmoother decision boundaries.

Page 11: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 11

Boosting

Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted Sample

Weighted SampleWeighted Sample

Training Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted SampleWeighted Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

G(x) = sign[∑M

m=1 αmGm(x)]

GM (x)

G3(x)

G2(x)

G1(x)

Final Classifier

Page 12: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 12

Bagging and Boosting

Number of Terms

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4 Bagging

AdaBoost

100 Node Trees

2000 points from Nested Spheres in R10; Bayeserror rate is 0%.

Trees are grown Best First without pruning.

Leftmost iteration is a single tree.

Page 13: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 13

AdaBoost (Freund & Schapire, 1996)

1. Initialize the observation weightswi = 1/N, i = 1, 2, . . . , N .

2. For m = 1 to M repeat steps (a)–(d):

(a) Fit a classifier Gm(x) to the training datausing weights wi.

(b) Compute

errm =∑N

i=1 wiI(yi �= Gm(xi))∑Ni=1 wi

.

(c) Compute αm = log((1 − errm)/errm).

(d) Update weights for i = 1, . . . , N :wi ← wi · exp[αm · I(yi �= Gm(xi))]and renormalize to wi to sum to 1.

3. Output G(x) = sign[∑M

m=1 αmGm(x)].

Page 14: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 14

History of Boosting

Boosting

BaggingL. Breiman(1996)

Attempts toSchapire, Y. Singer (1998)Schapire, Freund, P. Bartlett, Lee (1997)

Concept of

with

Experiments

Improvements

explain whyAdaboost works

Adaboost

PAC Learning Model

Schapire & Freund (1995,1997)

appears

Adaboostis born

Genesis of

Freund & Schapire (1996) Breiman(1996, 1997)R. Quinlan (1996)

L.G. Valiant (1984)

Y. Freund (1995)

R.E Schapire (1990)

Friedman, Hastie, Tibshirani (1998)

Schapire, Singer, Freund, Iyer (1998)

Page 15: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 15

PAC Learning Model

X ∼ D: Instance SpaceC : X �→ {0, 1} Concept ∈ Ch : X �→ {0, 1} Hypothesis ∈ Herror(h) = PD[C(X) �= h(X)]

hC

X

Definition: Consider a concept class C definedover a set X of length N . L is a learner(algorithm) using hypothesis space H. C is PAClearn-able by L using H if for all C ∈ C, alldistributions D over X and all ε, δ ∈ (0, 1

2 ),learner L will, with Pr ≥ (1 − δ), output anh ∈ H s.t. errorD(h) ≤ ε in time polynomial in1/ε, 1/δ, N and size(C).

Such an L is called a strong Learner.

Page 16: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 16

Boosting a Weak Learner

Weak learner L produces an h with error rateβ = (1

2 − ε) < 12 , with Pr ≥ (1 − δ) for any D.

L has access to continuous stream of trainingdata and a class oracle.

1. L learns h1 on first N training points.

2. L randomly filters the next batch of trainingpoints, extracting N/2 points correctlyclassified by h1, N/2 incorrectly classified,and produces h2.

3. L builds a third training set of N points forwhich h1 and h2 disagree, and produces h3.

4. L outputs h = Majority Vote(h1, h2, h3)

THEOREM (Schapire, 1990): “The Strength ofWeak Learnability”

errorD(h) ≤ 3β2 − 2β3 < β

Page 17: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 17

Boosting Stumps

Boosting Iterations

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Single Stump

400 Node Tree

A stump is a two-node tree, after a single split.Boosting stumps works remarkably well on thenested-spheres problem.

Page 18: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 18

Boosting & Training Error

Nested spheres in R10 — Bayes error is 0%.Stumps

Number of Terms

Tra

in a

nd T

est E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5

Boosting drives the training error to zero. Furtheriterations continue to improve test error in manyexamples.

Page 19: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 19

Boosting Noisy Problems

Nested Gaussians in R10 — Bayes error is 25%.Stumps

Number of Terms

Tra

in a

nd T

est E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5

Bayes Error

Here the test error does increase, but quite slowly.

Page 20: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 20

Stagewise Additive Modeling

Boosting builds an additive model

f(x) =M∑

m=1

βmb(x; γm).

Here b(x, γm) is a tree, and γm parametrizes thesplits.

We do things like that in statistics all the time!

• GAMs: f(x) =∑

j fj(xj)

• Basis expansions: f(x) =∑M

m=1 θmhm(x)

Traditionally the parameters fm, θm are fit jointly(i.e. least squares, maximum likelihood).

With boosting, the parameters (βm, γm) are fit ina stagewise fashion. This slows the process down,and tends to overfit less quickly.

Page 21: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 21

Stagewise Least Squares

Suppose we have available a basis family b(x; γ)parametrized by γ. For example, a simple familyis b(x; γj) = xj .

• After m− 1 steps, suppose we have the modelfm−1(x) =

∑m−1j=1 βjb(x; γj).

• At the mth step we solve

minβ,γ

N∑i=1

(yi − fm−1(xi) − βb(xi; γ))2

• Denoting the residuals at the mth stage byrim = yi − fm−1(xi), the previous stepamounts to

minβ,γ

(rim − βb(xi; γ))2,

• Thus the term βmb(x; γm) that best fits thecurrent residuals is added to the expansion ateach step.

Page 22: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 22

Adaboost: Stagewise Modeling

• AdaBoost builds an additive logisticregression model

f(x) = logPr(Y = 1|x)

Pr(Y = −1|x)=

M∑m=1

αmGm(x)

by stagewise fitting using the loss function

L(y, f(x)) = exp(−y f(x)).

• Given the current fM−1(x), our solution for(βm, Gm) is

arg minβ,G

N∑i=1

exp[−yi(fm−1(xi) + β G(x))]

where Gm(x) ∈ {−1, 1} is a tree classifier andβm is a coefficient.

Page 23: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 23

• With w(m)i = exp(−yi fm−1(xi)), this can be

re-expressed as

arg minβ,G

N∑i=1

w(m)i exp(−β yi G(xi))

• We can show that this leads to the Adaboost

algorithm; See pp 305.

Page 24: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 24

Why Exponential Loss?

-2 -1 0 1 2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

MisclassificationExponentialBinomial DevianceSquared ErrorSupport Vector

Loss

y · f

• e−yF (x) is a monotone, smooth upper boundon misclassification loss at x.

• Leads to simple reweighting scheme.

• Has logit transform as population minimizer

f∗(x) =12

logPr(Y = 1|x)

Pr(Y = −1|x)

• Other more robust loss functions, likebinomial deviance.

Page 25: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 25

General Stagewise Algorithm

We can do the same for more general lossfunctions, not only least squares.

1. Initialize f0(x) = 0.

2. For m = 1 to M :

(a) Compute(βm, γm) =arg minβ,γ

∑Ni=1 L(yi, fm−1(xi)+βb(xi; γ)).

(b) Set fm(x) = fm−1(x) + βmb(x; γm).

Sometimes we replace step (b) in item 2 by

(b∗) Set fm(x) = fm−1(x) + νβmb(x; γm)

Here ν is a shrinkage factor, and often ν < 0.1.

See pp 326. Shrinkage slows the stagewisemodel-building even more, and typically leads tobetter performance.

Page 26: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 26

MART

• General boosting algorithm that works with avariety of different loss functions. Modelsinclude regression, outlier-resistant regression,K-class classification and risk modeling.

• MART uses gradient boosting to buildadditive tree models, for example, forrepresenting the logits in logistic regression.

• Tree size is a parameter that determines theorder of interaction (next slide).

• MART inherits all the good features of trees(variable selection, missing data, mixedpredictors), and improves on the weakfeatures, such as prediction performance.

• MART is described in detail in , section10.10.

Page 27: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 27

MART in detail

Model

fM (x) =M∑

m=1

T (x; Θm)

where T (x; Θ) is a tree:

T (x; Θ) =J∑

j=1

γjI(x ∈ Rj)

with parameters Θ = {Rj , γj}J1

Fitting Criterion

Given Θj , j = 1, . . . , m − 1, we obtain Θm viastagewise optimization:

arg minΘm

N∑i=1

L (yi, fm−1(xi) + T (xi; Θm))

For general loss functions this is a very difficultoptimization problem. Gradient boosting is anapproximate gradient descent method.

Page 28: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 28

Gradient Boosting

1. Initialize f0(x) = arg minγ

∑Ni=1 L(yi, γ).

2. For m = 1 to M :

(a) For i = 1, 2, . . . , N compute

rim = −[∂L(yi, f(xi))

∂f(xi)

]f=fm−1

.

(b) Fit a regression tree to the targets rim

giving terminal regionsRjm, j = 1, 2, . . . , Jm.

(c) For j = 1, 2, . . . , Jm compute

γjm = arg minγ

∑xi∈Rjm

L (yi, fm−1(xi) + γ) .

(d) Updatefm(x) = fm−1(x) +

∑Jm

j=1 γjmI(x ∈ Rjm).

3. Output f(x) = fM (x).

Page 29: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 29

Tree Size

The tree size J determines the interaction orderof the model:

η(X) = =∑

j

ηj(Xj) +∑jk

ηjk(Xj , Xk)

+∑jkl

ηjkl(Xj , Xk, Xl) + · · ·

Number of Terms

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4 Stumps

10 Node100 NodeAdaboost

Page 30: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 30

Stumps win!

Since the true decision boundary is the surface ofa sphere, the function that describes it has theform

f(X) = X21 + X2

2 + . . . + X2p − c = 0.

Boosted stumps via MART returns reasonableapproximations to these quadratic functions.

Coordinate Functions for Additive Logistic Trees

f1(x1) f2(x2) f3(x3) f4(x4) f5(x5)

f6(x6) f7(x7) f8(x8) f9(x9) f10(x10)

Page 31: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 31

Example: Predicting e-mail spam

• data from 4601 email messages

• goal: predict whether an email message isspam (junk email) or good.

• input features: relative frequencies in amessage of 57 of the most commonlyoccurring words and punctuation marks in allthe training the email messages.

• for this problem not all errors are equal; wewant to avoid filtering out good email, whileletting spam get through is not desirable butless serious in its consequences.

• we coded spam as 1 and email as 0.

• A system like this would be trained for eachuser separately (e.g. their word lists would bedifferent)

Page 32: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 32

Predictors

• 48 quantitative predictors—the percentage ofwords in the email that match a given word.Examples include business, address,internet, free, and george. The idea wasthat these could be customized for individualusers.

• 6 quantitative predictors—the percentage ofcharacters in the email that match a givencharacter. The characters are ch;, ch(, ch[,ch!, ch$, and ch#.

• The average length of uninterruptedsequences of capital letters: CAPAVE.

• The length of the longest uninterruptedsequence of capital letters: CAPMAX.

• The sum of the length of uninterruptedsequences of capital letters: CAPTOT.

Page 33: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 33

Some important features

39% of the training data were spam.

Average percentage of words or characters in anemail message equal to the indicated word orcharacter. We have chosen the words andcharacters showing the largest difference betweenspam and email.

george you your hp free hpl

spam 0.00 2.26 1.38 0.02 0.52 0.01

email 1.27 1.27 0.44 0.90 0.07 0.43

! our re edu remove

spam 0.51 0.51 0.13 0.01 0.28

email 0.11 0.18 0.42 0.29 0.01

Page 34: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 34

Spam Example Results

With 3000 training and 1500 test observations,MART fits an additive logistic model

f(x) = logPr(spam|x)Pr(email|x)

using trees with J = 6 terminal-node trees.

MART achieves a test error of 4%, compared to5.3% for an additive GAM, 5.5% for MARS, and8.7% for CART.

Page 35: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 35

Spam: Variable Importance

!$

hpremove

freeCAPAVE

yourCAPMAX

georgeCAPTOT

eduyouour

moneywill

1999business

re(

receiveinternet

000email

meeting;

650overmailpm

peopletechnology

hplall

orderaddress

makefont

projectdata

originalreport

conferencelab

[creditparts

#85

tablecs

direct415857

telnetlabs

addresses3d

0 20 40 60 80 100

Relative importance

Page 36: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 36

Spam: Partial Dependence

!

Par

tial D

epen

denc

e

0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

remove

Par

tial D

epen

denc

e0.0 0.2 0.4 0.6

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

edu

Par

tial D

epen

denc

e

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

-0.6

-0.2

0.0

0.2

hp

Par

tial D

epen

denc

e

0.0 0.5 1.0 1.5 2.0 2.5 3.0

-1.0

-0.6

-0.2

0.0

0.2

fS(XS) =1N

N∑i=1

f(XS , xiC)

Page 37: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 37

Spam: Partial Dependence

0.51.01.52.02.53.0

0.2

0.4

0.6

0.8

1.0

-1.0

-0.5

0.0

0.5

1.0

hp

!

fS(XS) =1N

N∑i=1

f(XS , xiC)

Page 38: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 38

Margins

Margin

Cum

ulat

ive

Den

sity

-1.0 -0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

• ••• •••••••••• • •••••••••••• •••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••

•••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•• •• ••• •••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

• • • • •• •• • ••••• • ••••• •••• ••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

• 20 Iterations• 50 Iterations• 200 Iterations

margin(X) = M(X) = 2PC(X) − 1

Freund & Schapire (1997): Boosting generalizesbecause it pushes the training margins well abovezero, while keeping the VC dimension undercontrol (also Vapnik, 1996). With Pr ≥ (1 − δ)

PTest(M(X) ≤ 0) ≤ PTrain(M(X) ≤ θ)

+O

(1√N

(log N log |H|θ2 + log 1/δ

) 12)

Page 39: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 39

Number of Terms

Tra

in a

nd T

est E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5

Real AdaBoostDiscrete AdaBoost

Bayes Error

How does Boosting avoid overfitting?

• As iterations proceed, impact of change is

localized.

• Parameters are not jointly optimized — stagewise

estimation slows down the learning process.

• Classifiers are hurt less by overfitting (Cover and

Hart, 1967).

• Margin theory of Schapire and Freund, Vapnik?

Disputed by Breiman (1997).

• Jury is still out!

Page 40: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 40

Boosting and L1 Penalized Fitting

0.0 0.5 1.0 1.5 2.0 2.5

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

0 50 100 150 200 250

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

� ��

�����

���������

���������

����� ������� ���� ��

�������

• ε forward stagewise:(idealized boosting withshrinkage). Given a family of basis functionsh1(x), . . . hM (x), and loss function L.

• Model at kth step is Fk(x) =∑

m βkmhm(x).

• At step k + 1, identify coordinate m withlargest |∂L/∂βm|, and update βk+1

m ← βkm + ε.

• Equivalent to the lasso: minL(β) + λk||β||1

Page 41: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 41

Summary and Closing Comments

• The introduction of Boosting by Schapire,Freund, and colleagues has brought us anexciting and important set of new ideas.

• Boosting fits additive logistic models, whereeach component (base learner) is simple. Thecomplexity needed for the base learnerdepends on the target function.

• Little connection between weighted boostingand bagging; boosting is primarily a biasreduction procedure, while the goal ofbagging is variance reduction.

• Boosting can of course overfit; stopping earlyis a good way to regularize boosting.

• Our view of boosting is stagewise and slowoptimization of a loss function. For example,gradient boosting approximates the gradientupdates by small trees fit to the empiricalgradients.

Page 42: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 42

• Modern versions of boosting use different lossfunctions, and incorporate shrinkage. Canhandle a wide variety of regression modelingscenarios.

• Later on we will compare boosting withsupport vector machines.

Page 43: Boosting - Stanford Universityhastie/TALKS/boost.pdfJanuary 2003 Trevor Hastie, Stanford University 2 Outline • Model Averaging • Bagging • Boosting • History of Boosting •

January 2003 Trevor Hastie, Stanford University 43

Comparison of Learning Methods

Some characteristics of different learning methods.

Key: ● = good, ● =fair, and ● =poor.

Characteristic Neural SVM CART GAM KNN, MART

Nets kernels

Natural handlingof data of “mixed”type

● ● ● ● ● ●

Handling of miss-ing values

● ● ● ● ● ●

Robustness tooutliers in inputspace

● ● ● ● ● ●

Insensitive tomonotone trans-formations ofinputs

● ● ● ● ● ●

Computationalscalability (largeN)

● ● ● ● ● ●

Ability to dealwith irrelevantinputs

● ● ● ● ● ●

Ability to extractlinear combina-tions of features

● ● ● ● ● ●

Interpretability● ● ● ● ● ●

Predictive power● ● ● ● ● ●