lecture8 - ensemble methods - test page for apache

Ensemble MethodsNLP ML Web!

Fall 2013!Andrew Rosenberg!

TA/Grader: David Guy Brizan

How do you make a decision?• What do you want for lunch today?!

• What did you have last night?!

• What are your favorite foods?!

• Have you been to this restaurant before?!

• How expensive is it?!

• Are you on a diet?!

• Moral/ethical/religious concerns.

Decision Making

• Collaboration.!!

• Weighing disparate sources of evidence.

Ensemble Methods

• Ensemble Methods are based around the hypothesis that an aggregated decision from multiple experts can be superior to a decision from a single system.

Ensemble averaging• Assume your prediction has noise in it

y = f(x) + ✏

✏ ⇠ N (0,�2)

y =1

k

kX

i

f(x) + ✏

(i)

• On multiple iid observations, the epsilons will cancel out, leaving a better estimate of the signal

http://terpconnect.umd.edu/~toh/spectrum/SignalsAndNoise.html

http://terpconnect.umd.edu/~toh/spectrum/SignalsAndNoise.html#EnsembleAveraging

Combining disparate evidence• Early Fusion - combine features

Acoustic Features

Video Features

Lexical Features

Concatenate Classifier

Combining disparate evidence• Late Fusion - combine predictions

Acoustic Features

Video Features

Lexical Features

Classifier Merge

Classifier

Classifier

Classifier Fusion• Construct an answer from k predictions

C1 C2 C3 C4

Test instance

Classifier Fusion• Construct an answer from k predictions

Features Classifier Merge

Classifier

Classifier

Majority Voting• Each Classifier generates a prediction and

confidence score. !

• Chose the prediction that receives the most “votes” predictions from the ensemble

Features Classifier Sum

Classifier

Classifier

Weighted Majority Voting!

• Most classifiers can be interpreted as delivering a distribution over predictions.!

• Rather than sum the number of votes, generate an average distribution from the sum.!

• This is the same as taking a vote where each prediction contributes its confidence.

Features Classifier Weighted Sum

Classifier

Classifier

Sum, Max, Min• Majority Voting can be viewed as summing the scores from

each ensemble member.!

• Other aggregation functions can be used including:!

• maximum!

• minimum!

• What is the implication of these?

Features Classifier Aggregator

Classifier

Classifier

Second-tier classifier• Classifier predictions are used as input

features for a second classifier.!

• How should the second tier classifier be trained?

Features Classifier Classifer

Classifier

Classifier

Classifier Fusion• Advantages!

• Experts to be trained separately on specialized data!

• Can be trained quicker, due to smaller data sets and feature space dimensionality.!

• Disadvantages!

• Interactions across feature sets may be missed!

• Explanation of how and why it works can be limited.

Bagging• Bootstrap Aggregating!

• Train k models on different samples of the training data!

• Predict by averaging results of k models !!

• Simple instance of majority voting.

Model averaging• Seen in Language Modeling!

• Take, for example a linear classifier

• Average model parameters.

y = f(WTx+ b)

W ⇤ =1

k

X

i

Wi

b⇤ =1

k

X

i

bi b⇤ =1

k

X

i

bi + ✏

W ⇤ =1

k

X

i

Wi + ✏

Mixture of Experts• Can we do better than averaging over

all training points?!

• Look at the input data to see which points are better classified by which classifiers!

• Allow each expert to focus on those cases where it’s already doing better than average

Mixture of Experts• The array of P’s are called a

“gating network”.!

• Optimize pi as part of the loss function

Features Classifier Classifer

Classifier

Classifier

p

p

p

Probability Correct under a mixture of experts

221

1 ||||

2)|(

cio

cd

i

ci

c epMoGdp−−

∑=π

From Hinton Lecture

prob. desired output on c

mixing coefficient for i on c

gaussian loss between desired and

observed output

Gating network Gradient

)(2

log)|(log

221

221

221

1

||||

||||

||||

2

ci

c

j

cjo

cdcj

cio

cdci

ci

c

cio

cd

i

ci

c

odep

epoE

epMoEdp

−−=∂

∂

−=−

∑

∑

−−

−−

−−

π

posterior probability of expert iFrom Hinton Lecture

AdaBoost

• Adaptive Boosting!

• Construct an ensemble of “weak” classifiers.!

• Typically single split decision trees!

• Identify weights for each classifier.

Weak Classifiers

• Weak classifiers:!

• low performance (slightly better than chance)!

• high variance!

• (for adaboost) should have uncorrelated errors.

Boosting Hypothesis

• The existence of a weak learner implies the existence of a strong learner.

AdaBoost Decision Function

• AdaBoost generates a prediction from a weighted sum of predictions of each classifier.!

• The AdaBoost algorithm determines the weights.!

• Similar to systems that use a second tier classifier to learn a combination function.

C(x) = ↵1C1(x) + ↵2C2(x) + . . .+ ↵kCk(x)

The weight training is different from any loss function we’ve used.

AdaBoost training algorithm• Repeat!

• Identify the best unused classifier Ci.!

• Assign it a weight based on its performance!

• Update the weights of each data point based on whether or not it is classified correctly!

• Until performance converges or all classifiers are included.

Identify the best classifier• Generate hypotheses using each unused

classifier.!

• Calculate weighted error using the current data point weights.!

• Data point weights are initialized to one.

We

=X

yi 6=km(xi)

w(m)i

How many errors were made

Generate a weight for the classifier

• The larger the reduction in error, the larger the classifier weight

em =Wm

Wratio of error to previous iteration

↵m =1

2ln

✓1� emem

◆new weight

Data Point weighting• If data point i was not correctly

classified!!

!

• If data point i was correctly classified

w(m+1)i = w(m)

i e�↵m = w(m)i

rem

1� em

w(m+1)i = w(m)

i e↵m = w(m)i

r1� emem

>1

<1

AdaBoost training algorithm• Repeat!

• Identify the best unused classifier Ci.!

• Assign it a weight based on its performance!

• Update the weights of each data point based on whether or not it is classified correctly!

• Until performance converges or all classifiers are included.

Random Forests• Random Forests are similar to

AdaBoost decision trees. (sans adaptive training)!

• An ensemble of classifiers is trained each on a different subset of features and a different set of data points!

• Random subspace projection

Decision Treeworld&state&

is&it&raining?&

is&the&sprinkler&on?&P(wet)&=&0.95&

P(wet)&=&0.9&

yes%no%

yes%no%

P(wet)&=&0.1&

Construct a Forest of Tree

……"tree"t1 tree"tT

category"c

Training Algorithm• Divide training data into K subsets of

data points and M variables.!

• Improved Generalization!

• Reduced Memory requirements!

• Train a unique decision tree on each K set!

• Simple multi threading

Handling Class Imbalances• Class imbalance, or skewed class distributions happen

when there are not equal numbers of each label. !

• Class Imbalance provides a number of challenges !

• Density Estimation !

• low priors can lead to poor estimation of minority classes!

• Loss Functions!

• Since the loss of each point is equal, getting a lot of majority class points correct is important.!

• Evaluation!

• Accuracy is less informative.

Impact on Accuracy• Example from Information Retrieval!

• Find 10 relevant documents from a set of 100.

True Values

Positive Negative

Hyp Values

Positive 0 0

Negative 10 90

Accuracy = 90%

Contingency Table

Accuracy =TP + TN

TP + FP + TN + FN

True Values Positive Negative

Hyp Values

Positive True Positive

False Positive

Negative False Negative

True Negative

F-Measure

• Precision: how many hypothesized events were true events

• Recall: how many of the true events were identified

• F-Measure: Harmonic meanof precision and recall

P =TP

TP + FP

R =TP

TP + FN

F =2PR

P + RTrue Values

Positive Negative

Hyp Values

Positive 0 0

Negative 10 90

F-Measure

• F-measure can be weighted to favor Precision or Recall

•beta > 1 favors recall •beta < 1 favors precision

F� =(1 + �2)PR

(�2P ) + R

F-Measure

True Values

Positive Negative

Hyp Values

Positive 0 0

Negative 10 90

P = 0R = 0F1 = 0

F-Measure

True Values

Positive Negative

Hyp Values

Positive 10 50

Negative 0 40

P =1060

R = 1F1 = .29

F-Measure

True Values

Positive Negative

Hyp Values

Positive 9 1

Negative 1 89

P = .9R = .9F1 = .9

ROC and AUC• It is common to plot classifier

performance at a variety of settings or thresholds

• Receiver Operating Characteristic (ROC) curves plot true positives against false positives.

• The overall performanceis calculated by the Area Under the Curve(AUC)

Skew in Classifier Training• Most classifiers train better with balanced

training data.!

• Bayesian methods:!

• Reliance on a prior to weight classes.!

• Estimation of class conditioned density is impacted by skew in number of samples!

• Loss functions:!

• There is more pressure to set the decision boundary for the majority classes

Skew in Classifier Training

Skew in Classifier TrainingTwice as

many errors Same distance from

optimal decision boundary

Sampling• Artificial manipulation of the number of

training samples can help reduce the impact of class imbalance!

• Under sampling!

• Randomly select Nm data points from the majority class for training

Sampling• Oversampling!

• Reproduce the minority class points until the class sizes are balanced

Ensemble Sampling• Ensemble Sampling!

• Repeat undersampling NM/Nm times with different samples of the majority class data points.!

• Train NM/Nm classifiers, combine with majority voting.

C1

C2

C3

Merge

Ensemble Methods• Very simple and effective technique to

improve classification performance!

• Netflix Prize, Watson, etc.!

• Mathematical justification!

• Intuitive appeal into how decisions are made by people and organizations!

• Can allow for modular training

lecture8 - ensemble methods - test page for apache

Documents