lecture8 - ensemble methods - test page for apache
TRANSCRIPT
![Page 1: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/1.jpg)
Ensemble MethodsNLP ML Web!
Fall 2013!Andrew Rosenberg!
TA/Grader: David Guy Brizan
![Page 2: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/2.jpg)
How do you make a decision?• What do you want for lunch today?!
• What did you have last night?!
• What are your favorite foods?!
• Have you been to this restaurant before?!
• How expensive is it?!
• Are you on a diet?!
• Moral/ethical/religious concerns.
![Page 3: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/3.jpg)
Decision Making
• Collaboration.!!
• Weighing disparate sources of evidence.
![Page 4: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/4.jpg)
Ensemble Methods
• Ensemble Methods are based around the hypothesis that an aggregated decision from multiple experts can be superior to a decision from a single system.
![Page 5: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/5.jpg)
Ensemble averaging• Assume your prediction has noise in it
y = f(x) + ✏
✏ ⇠ N (0,�2)
y =1
k
kX
i
f(x) + ✏
(i)
• On multiple iid observations, the epsilons will cancel out, leaving a better estimate of the signal
http://terpconnect.umd.edu/~toh/spectrum/SignalsAndNoise.html
![Page 6: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/6.jpg)
Combining disparate evidence• Early Fusion - combine features
Acoustic Features
Video Features
Lexical Features
Concatenate Classifier
![Page 7: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/7.jpg)
Combining disparate evidence• Late Fusion - combine predictions
Acoustic Features
Video Features
Lexical Features
Classifier Merge
Classifier
Classifier
![Page 8: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/8.jpg)
Classifier Fusion• Construct an answer from k predictions
C1 C2 C3 C4
Test instance
![Page 9: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/9.jpg)
Classifier Fusion• Construct an answer from k predictions
Features Classifier Merge
Classifier
Classifier
![Page 10: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/10.jpg)
Majority Voting• Each Classifier generates a prediction and
confidence score. !
• Chose the prediction that receives the most “votes” predictions from the ensemble
Features Classifier Sum
Classifier
Classifier
![Page 11: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/11.jpg)
Weighted Majority Voting!
• Most classifiers can be interpreted as delivering a distribution over predictions.!
• Rather than sum the number of votes, generate an average distribution from the sum.!
• This is the same as taking a vote where each prediction contributes its confidence.
Features Classifier Weighted Sum
Classifier
Classifier
![Page 12: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/12.jpg)
Sum, Max, Min• Majority Voting can be viewed as summing the scores from
each ensemble member.!
• Other aggregation functions can be used including:!
• maximum!
• minimum!
• What is the implication of these?
Features Classifier Aggregator
Classifier
Classifier
![Page 13: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/13.jpg)
Second-tier classifier• Classifier predictions are used as input
features for a second classifier.!
• How should the second tier classifier be trained?
Features Classifier Classifer
Classifier
Classifier
![Page 14: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/14.jpg)
Classifier Fusion• Advantages!
• Experts to be trained separately on specialized data!
• Can be trained quicker, due to smaller data sets and feature space dimensionality.!
• Disadvantages!
• Interactions across feature sets may be missed!
• Explanation of how and why it works can be limited.
![Page 15: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/15.jpg)
Bagging• Bootstrap Aggregating!
• Train k models on different samples of the training data!
• Predict by averaging results of k models !!
• Simple instance of majority voting.
![Page 16: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/16.jpg)
Model averaging• Seen in Language Modeling!
• Take, for example a linear classifier
• Average model parameters.
y = f(WTx+ b)
W ⇤ =1
k
X
i
Wi
b⇤ =1
k
X
i
bi b⇤ =1
k
X
i
bi + ✏
W ⇤ =1
k
X
i
Wi + ✏
![Page 17: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/17.jpg)
Mixture of Experts• Can we do better than averaging over
all training points?!
• Look at the input data to see which points are better classified by which classifiers!
• Allow each expert to focus on those cases where it’s already doing better than average
![Page 18: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/18.jpg)
Mixture of Experts• The array of P’s are called a
“gating network”.!
• Optimize pi as part of the loss function
Features Classifier Classifer
Classifier
Classifier
p
p
p
![Page 19: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/19.jpg)
Probability Correct under a mixture of experts
221
1 ||||
2)|(
cio
cd
i
ci
c epMoGdp−−
∑=π
From Hinton Lecture
prob. desired output on c
mixing coefficient for i on c
gaussian loss between desired and
observed output
![Page 20: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/20.jpg)
Gating network Gradient
)(2
log)|(log
221
221
221
1
||||
||||
||||
2
ci
c
j
cjo
cdcj
cio
cdci
ci
c
cio
cd
i
ci
c
odep
epoE
epMoEdp
−−=∂
∂
−=−
∑
∑
−−
−−
−−
π
posterior probability of expert iFrom Hinton Lecture
![Page 21: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/21.jpg)
AdaBoost
• Adaptive Boosting!
• Construct an ensemble of “weak” classifiers.!
• Typically single split decision trees!
• Identify weights for each classifier.
![Page 22: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/22.jpg)
Weak Classifiers
• Weak classifiers:!
• low performance (slightly better than chance)!
• high variance!
• (for adaboost) should have uncorrelated errors.
![Page 23: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/23.jpg)
Boosting Hypothesis
• The existence of a weak learner implies the existence of a strong learner.
![Page 24: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/24.jpg)
AdaBoost Decision Function
• AdaBoost generates a prediction from a weighted sum of predictions of each classifier.!
• The AdaBoost algorithm determines the weights.!
• Similar to systems that use a second tier classifier to learn a combination function.
C(x) = ↵1C1(x) + ↵2C2(x) + . . .+ ↵kCk(x)
The weight training is different from any loss function we’ve used.
![Page 25: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/25.jpg)
AdaBoost training algorithm• Repeat!
• Identify the best unused classifier Ci.!
• Assign it a weight based on its performance!
• Update the weights of each data point based on whether or not it is classified correctly!
• Until performance converges or all classifiers are included.
![Page 26: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/26.jpg)
Identify the best classifier• Generate hypotheses using each unused
classifier.!
• Calculate weighted error using the current data point weights.!
• Data point weights are initialized to one.
We
=X
yi 6=km(xi)
w(m)i
How many errors were made
![Page 27: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/27.jpg)
Generate a weight for the classifier
• The larger the reduction in error, the larger the classifier weight
em =Wm
Wratio of error to previous iteration
↵m =1
2ln
✓1� emem
◆new weight
![Page 28: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/28.jpg)
Data Point weighting• If data point i was not correctly
classified!!
!
• If data point i was correctly classified
w(m+1)i = w(m)
i e�↵m = w(m)i
rem
1� em
w(m+1)i = w(m)
i e↵m = w(m)i
r1� emem
>1
<1
![Page 29: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/29.jpg)
AdaBoost training algorithm• Repeat!
• Identify the best unused classifier Ci.!
• Assign it a weight based on its performance!
• Update the weights of each data point based on whether or not it is classified correctly!
• Until performance converges or all classifiers are included.
![Page 30: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/30.jpg)
Random Forests• Random Forests are similar to
AdaBoost decision trees. (sans adaptive training)!
• An ensemble of classifiers is trained each on a different subset of features and a different set of data points!
• Random subspace projection
![Page 31: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/31.jpg)
Decision Treeworld&state&
is&it&raining?&
is&the&sprinkler&on?&P(wet)&=&0.95&
P(wet)&=&0.9&
yes%no%
yes%no%
P(wet)&=&0.1&
![Page 32: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/32.jpg)
Construct a Forest of Tree
……"tree"t1 tree"tT
category"c
![Page 33: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/33.jpg)
Training Algorithm• Divide training data into K subsets of
data points and M variables.!
• Improved Generalization!
• Reduced Memory requirements!
• Train a unique decision tree on each K set!
• Simple multi threading
![Page 34: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/34.jpg)
Handling Class Imbalances• Class imbalance, or skewed class distributions happen
when there are not equal numbers of each label. !
• Class Imbalance provides a number of challenges !
• Density Estimation !
• low priors can lead to poor estimation of minority classes!
• Loss Functions!
• Since the loss of each point is equal, getting a lot of majority class points correct is important.!
• Evaluation!
• Accuracy is less informative.
![Page 35: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/35.jpg)
Impact on Accuracy• Example from Information Retrieval!
• Find 10 relevant documents from a set of 100.
True Values
Positive Negative
Hyp Values
Positive 0 0
Negative 10 90
Accuracy = 90%
![Page 36: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/36.jpg)
Contingency Table
Accuracy =TP + TN
TP + FP + TN + FN
True Values Positive Negative
Hyp Values
Positive True Positive
False Positive
Negative False Negative
True Negative
![Page 37: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/37.jpg)
F-Measure
• Precision: how many hypothesized events were true events
• Recall: how many of the true events were identified
• F-Measure: Harmonic meanof precision and recall
P =TP
TP + FP
R =TP
TP + FN
F =2PR
P + RTrue Values
Positive Negative
Hyp Values
Positive 0 0
Negative 10 90
![Page 38: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/38.jpg)
F-Measure
• F-measure can be weighted to favor Precision or Recall
•beta > 1 favors recall •beta < 1 favors precision
F� =(1 + �2)PR
(�2P ) + R
![Page 39: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/39.jpg)
F-Measure
True Values
Positive Negative
Hyp Values
Positive 0 0
Negative 10 90
P = 0R = 0F1 = 0
![Page 40: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/40.jpg)
F-Measure
True Values
Positive Negative
Hyp Values
Positive 10 50
Negative 0 40
P =1060
R = 1F1 = .29
![Page 41: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/41.jpg)
F-Measure
True Values
Positive Negative
Hyp Values
Positive 9 1
Negative 1 89
P = .9R = .9F1 = .9
![Page 42: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/42.jpg)
ROC and AUC• It is common to plot classifier
performance at a variety of settings or thresholds
• Receiver Operating Characteristic (ROC) curves plot true positives against false positives.
• The overall performanceis calculated by the Area Under the Curve(AUC)
![Page 43: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/43.jpg)
Skew in Classifier Training• Most classifiers train better with balanced
training data.!
• Bayesian methods:!
• Reliance on a prior to weight classes.!
• Estimation of class conditioned density is impacted by skew in number of samples!
• Loss functions:!
• There is more pressure to set the decision boundary for the majority classes
![Page 44: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/44.jpg)
Skew in Classifier Training
![Page 45: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/45.jpg)
Skew in Classifier TrainingTwice as
many errors Same distance from
optimal decision boundary
![Page 46: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/46.jpg)
Sampling• Artificial manipulation of the number of
training samples can help reduce the impact of class imbalance!
• Under sampling!
• Randomly select Nm data points from the majority class for training
![Page 47: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/47.jpg)
Sampling• Oversampling!
• Reproduce the minority class points until the class sizes are balanced
![Page 48: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/48.jpg)
Ensemble Sampling• Ensemble Sampling!
• Repeat undersampling NM/Nm times with different samples of the majority class data points.!
• Train NM/Nm classifiers, combine with majority voting.
C1
C2
C3
Merge
![Page 49: Lecture8 - Ensemble Methods - Test Page for Apache](https://reader031.vdocuments.site/reader031/viewer/2022021211/620650ac8c2f7b1730068f4b/html5/thumbnails/49.jpg)
Ensemble Methods• Very simple and effective technique to
improve classification performance!
• Netflix Prize, Watson, etc.!
• Mathematical justification!
• Intuitive appeal into how decisions are made by people and organizations!
• Can allow for modular training