lecture 13 - web.engr.oregonstate.edu
TRANSCRIPT
![Page 1: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/1.jpg)
Lecture 13Lecture 13
Oct-27-2007
![Page 2: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/2.jpg)
BaggingBagging• Generate T random sample from training set by
b t t ibootstrapping• Learn a sequence of classifiers h1,h2,…,hT from
each of them using base learner Leach of them, using base learner L• To classify an unknown sample X, let each
classifier predictclassifier predict. • Take simple majority vote to make the final
prediction.p
Simple scheme, works well in many situations!
![Page 3: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/3.jpg)
Bias/Variance for classifiersBias/Variance for classifiers• Bias arises when the classifier cannot represent the true
function – that is the classifier underfits the datafunction that is, the classifier underfits the data
• Variance arises when the classifier overfits the data –minor variations in training set cause the classifier to overfit differently
• Clearly you would like to have a low bias and low variance classifier!
T i ll l bi l ifi ( fitti ) h hi h i– Typically, low bias classifiers (overfitting) have high variance– high bias classifiers (underfitting) have low variance– We have a trade-off
![Page 4: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/4.jpg)
Effect of Algorithm Parameters on Bias and Variance
• k-nearest neighbor: increasing k typically increases bias and reduces variance
• decision trees of depth D: increasing D typically increases variance and reduces bias
![Page 5: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/5.jpg)
Why does bagging work?Why does bagging work?
• Bagging takes the average of multipleBagging takes the average of multiple models --- reduces the variance
• This suggests that bagging works the best• This suggests that bagging works the best with low bias and high variance classifiers
![Page 6: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/6.jpg)
BoostingBoosting• Also an ensemble method: the final prediction is a p
combination of the prediction of multiple classifiers.• What is different?
– Its iterative. Boosting: Successive classifiers depends upon its predecessors - look at errors from previous classifiers to
f fdecide what to focus on for the next iteration over dataBagging : Individual classifiers were independent.
– All training examples are used in each iteration, but with different weights – more weights on difficult sexamples. (the ones on which we committed mistakes in the previous iterations)
![Page 7: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/7.jpg)
Adaboost: IllustrationAdaboost: Illustration
h ( )H(X)
Update weights hM(x)
hm(x)H(X)
h (x)
Update weights
Update weights
h2(x)
h3(x)
Original data: uniformly weighted
Update weights
h1(x)uniformly weighted
![Page 8: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/8.jpg)
The AdaBoost Algorithm
CS434 Fall 2007
![Page 9: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/9.jpg)
The AdaBoost Algorithm
![Page 10: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/10.jpg)
AdaBoost(Example)AdaBoost(Example)Original Training set : Equal Weights to all training samplesg g p
Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire
![Page 11: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/11.jpg)
AdaBoost(Example)AdaBoost(Example)
ROUND 1
![Page 12: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/12.jpg)
AdaBoost(Example)AdaBoost(Example)
ROUND 2ROUND 2
![Page 13: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/13.jpg)
AdaBoost(Example)AdaBoost(Example)
ROUND 3
![Page 14: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/14.jpg)
AdaBoost(Example)AdaBoost(Example)
![Page 15: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/15.jpg)
Weighted ErrorWeighted Error• Adaboost calls L with a set of prespecified weights • It is often straightforward to convert a base learner L to take• It is often straightforward to convert a base learner L to take into account an input distribution D.
i iDecision trees?
K Nearest Neighbor?g
Naïve Bayes?
• When it is not straightforward we can resample the training dataS according to D and then feed the new data set into the learner.S cco d g o d e eed e ew d se o e e e .
![Page 16: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/16.jpg)
Boosting Decision StumpsBoosting Decision StumpsDecision stumps: very simple rules of thumb that test
diti i l tt ib tcondition on a single attribute.
Among the most commonly used base classifiers – truly weak!
Boosting with decision stumps has been shown to achieve better performance compared to unbounded decision trees.
![Page 17: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/17.jpg)
Boosting Performance• Comparing C4.5, boosting decision stumps, boosting
C4.5 using 27 UCI data set– C4.5 is a popular decision tree learner
![Page 18: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/18.jpg)
Boosting vs Baggingf D i i Tof Decision Trees
![Page 19: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/19.jpg)
Overfitting?Overfitting?• Boosting drives training error to zero, will it overfit?
C i h• Curious phenomenon
• Boosting is often robust to overfitting (not always)g g ( y )• Test error continues to decrease even after training error
goes to zero
![Page 20: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/20.jpg)
Explanation with MarginsL
∑=
⋅=L
lll xhwxf
1)()(
Margin = y ⋅ f(x)
Histogram of functional margin for ensemble just after achieving zero training error
![Page 21: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/21.jpg)
Effect of Boosting: M i i i M iMaximizing Margin
MarginNo examples with small margins!!
Even after zero training error the margin of examples increases.This is one reason that the generalization error may continue decreasing.
![Page 22: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/22.jpg)
Bias/variance analysis of BoostingBias/variance analysis of Boosting
• In the early iterations boosting is primaryIn the early iterations, boosting is primary a bias-reducing method
• In later iterations it appears to be primarily• In later iterations, it appears to be primarily a variance-reducing method
![Page 23: Lecture 13 - web.engr.oregonstate.edu](https://reader034.vdocuments.site/reader034/viewer/2022052601/628cd5688f9baa40ab267678/html5/thumbnails/23.jpg)
What you need to know about bl h d ?ensemble methods?
• Bagging: a randomized algorithm based on bootstrapping– What is bootstrapping– Variance reduction
Wh t l i l ith ill b d f b i ?– What learning algorithms will be good for bagging?
• Boosting:– Combine weak classifiers (i.e., slightly better than random)( , g y )– Training using the same data set but different weights– How to update weights?– How to incorporate weights in learning (DT, KNN, Naïve Bayes)– One explanation for not overfitting: maximizing the margin