ensemble learning reading: r. schapire, a brief introduction to boosting

Ensemble Learning

Reading:

R. Schapire, A brief introduction to boosting

Ensemble learning

Training sets Hypotheses Ensemble hypothesis

S1 h1

S2 h2

.

.

.

SN hN

H

Advantages of ensemble learning

• Can be very effective at reducing generalization error!

(E.g., by voting.)

• Ideal case: the hi have independent errors

Example

Given three hypotheses, h1, h2, h3, with hi (x) {−1,1}

Suppose each hi has 60% generalization accuracy, and assume errors are independent.

Now suppose H(x) is the majority vote of h1, h2, and h3 . What is probability that H is correct?

h1 h2 h3 H probability

C C C C .216

C C I C .144

C I I I .096

C I C C .144

I C C C .144

I I C I .096

I C I I .096

I I I I .064

Total probability correct: .648

Again, given three hypotheses, h1, h2, h3.

Suppose each hi has 40% generalization accuracy, and assume errors are independent.

Now suppose we classify x as the majority vote of h1, h2, and h3 . What is probability that the classification is correct?

Another Example

h1 h2 h3 H probability

C C C C .064

C C I C .096

C I I I .144

C I C C .096

I C C C .096

I I C I .144

I C I I .144

I I I I .261

Total probability correct: .352

In general, if hypotheses h1, ..., hM all have generalization accuracy A, what is probability that a majority vote will be correct?

General case

Possible problems with ensemble learning

• Errors are typically not independent

• Training time and classification time are increased by a factor of M.

• Hard to explain how ensemble hypothesis does classification.

• How to get enough data to create M separate data sets,

S1, ..., SM?

• Three popular methods:

– Voting:• Train classifier on M different training sets Si to

obtain M different classifiers hi.

• For a new instance x, define H(x) as:

where αi is a confidence measure for classifier hi

– Bagging (Breiman, 1990s):• To create Si, create “bootstrap replicates” of original

training set S

– Boosting (Schapire & Freund, 1990s)• To create Si, reweight examples in original training

set S as a function of whether or not they were misclassified on the previous round.

Adaptive Boosting (Adaboost)

A method for combining different weak hypotheses (training error close to but less than 50%) to produce a strong hypothesis (training error close to 0%)

Sketch of algorithm

Given examples S and learning algorithm L, with | S | = N

• Initialize probability distribution over examples w1(i) = 1/N .

• Repeatedly run L on training sets St S to produce h1, h2, ... , hK.

– At each step, derive St from S by choosing examples probabilistically according to probability distribution wt . Use St to learn ht.

• At each step, derive wt+1 by giving more probability to examples that were misclassified at step t.

• The final ensemble classifier H is a weighted sum of the ht’s, with each weight being a function of the corresponding ht’s error on its training set.

Adaboost algorithm

• Given S = {(x1, y1), ..., (xN, yN)} where x X, yi {+1, −1}

• Initialize w1(i) = 1/N. (Uniform distribution over data)

• For t = 1, ..., K:

– Select new training set St from S with replacement, according to wt

– Train L on St to obtain hypothesis ht

– Compute the training error t of ht on S :

– Compute coefficient

t

tt

1ln

2

1

– Compute new weights on data:

For i = 1 to N

where Zt is a normalization factor chosen so that wt+1 will be a probability distribution:

• At the end of K iterations of this algorithm, we have

h1, h2, . . . , hK

We also have

1, 2, . . . ,K, where

• Ensemble classifier:

• Note that hypotheses with higher accuracy on their training sets are

weighted more strongly.

t

tt

1ln

2

1

A Hypothetical Example

where { x1, x2, x3, x4 } are class +1

{x5, x6, x7, x8 } are class −1

t = 1 :

w1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8}

S1 = {x1, x2, x2, x5, x5, x6, x7, x8} (notice some repeats)

Train classifier on S1 to get h1

Run h1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1}

• Calculate error:

Calculate ’s:

Calculate new w’s:

t = 2

w2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102}

S2 = {x1, x2, x2, x3, x4, x4, x7, x8}

Run classifier on S2 to get h2

Run h2 on S. Suppose classifications are: {1, 1, 1, 1, 1, 1, 1, 1}

Calculate error:

Calculate ’s:

Calculate w’s:

t =3

w3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125}

S3 = {x2, x3, x3, x3, x5, x6, x7, x8}

Run classifier on S3 to get h3

Run h3 on S. Suppose classifications are: {1, 1, −1, 1, −1, −1, 1, −1}

Calculate error:

• Calculate ’s:

• Ensemble classifier:

What is the accuracy of H on the training data?

Example Actual class

h1 h2 h3

x1 1 1 1 1

x2 1 −1 1 1

x3 1 −1 1 −1

x4 1 1 1 1

x5 −1 −1 1 −1

x6 −1 −1 1 −1

x7 −1 1 1 1

x8 −1 −1 1 −1

where { x1, x2, x3, x4 } are class +1 {x5, x6, x7, x8 } are class −1

Recall the training set:

HW 2: SVMs and Boosting

Caveat: I don’t know if boosting will help. But worth a try!

Boosting implementation in HW 2

Use arrays and vectors.

Let S be the training set. You can set S to DogsVsCats.train or some subset of examples.

Let M = |S| be the number of training examples. Let K be the number of boosting iterations.

; Initialize:

Read in S as an array of M rows and 65 columns.

Create the weight vector w(1) = (1/M, 1/M, 1/M, ..., 1/M)

; Run Boosting Iterations:

For T = 1 to K {

; Create training set T:

For i = 1 to M {

Select (with replacement) an example from S using weights in w.

(Use “roulette wheel” selection algorithm)

Print that example to S<T> (i.e., “S1”)

; Learn classifier T

svm_learn <kernel parameters> S<T> h<T>

; Calculate Error<T> :

svm_classify S h<T> <T>.predictions

(e.g., “svm_classify S h1 1.predictions”)

Use <T>.predictions to determine which examples in S were misclassified.

Error<T> = Sum of weights of examples that were misclassified.

; Calculate Alpha<T>

; Calculate new weight vector, using Alpha<T> and <T>.predictions file

} ; T T + 1

At this point, you have

h1, h2, ..., hK [the K different classifiers]

Alpha1, Alpha2, ..., AlphaK

; Run Ensemble Classifier on test data (DogsVsCats.test):

For T = 1 to K:

svm_classify DogsVsCats.test h<T>.model Test(T).predictions}

At this point you have: Test1.predictions, Test2.predictions,...,TestK.predictions

To classify test example x1:The sign of row 1 of Test1.predictions gives h1(x1)The sign of row 1 of Test2.predictions gives h2(x1)etc.

H(x1) = sgn [Alpha1 * h1(x1) + Alpha2 * h2(x1) + ... + AlphaK * hK(x1)]

Compute H(x) for every x in the test set.

Sources of Classifier Error: Bias, Variance, and Noise

• Bias:– Classifier cannot learn the correct hypothesis (no matter what

training data is given), and so incorrect hypothesis h is learned. The bias is the average error of h over all possible training sets.

• Variance:– Training data is not representative enough of all data, so the

learned classifier h varies from one training set to another.

• Noise:– Training data contains errors, so incorrect hypothesis h is learned.

Give examples of these.

Adaboost seems to reduce both bias and variance.

Adaboost does not seem to overfit for increasing K.

Optional: Read about “Margin-theory” explanation of success of Boosting

ensemble learning reading: r. schapire, a brief introduction to boosting

Documents

s n h n h slide

hypotheses h

hypothesis h t

classifier h i

training sets s t s

corresponding h t s

examples s

training error t of