spectral embedding of k-cliques, graph partitioning, and k ...pa336/mls16/ml-sp16-lec15.pdfchoosing...
TRANSCRIPT
Boosting
Instructor: Pranjal Awasthi
Motivating Example
Want to predict winner of 2016 general election
Motivating Example
⢠Step 1: Collect training data
â 2012,
â 2008,
â 2004,
â 2000,
â 1996,
â 1992,
â âŠ......
⢠Step 2: Find a function that gets high accuracy(~99%) on training set.
What if instead, we have good ârules of thumbâ that often work well?
Motivating Example
⢠Step 1: Collect training data
â 2012,
â 2008,
â 2004,
â 2000,
â 1996,
â 1992,
â âŠ......
⢠â1will do well on a subset of train set, and will do better than half overall, say 51% accuracy.
â1= If sitting president is running, go with his/her party,Otherwise, predict randomly
Motivating Example
⢠Step 1: Collect training data
â ,
â 2008
â ,
â 2000,
â ,
â ,
â âŠ......
â1= If sitting president is running, go with his/her party,Otherwise, predict randomly
What if only focus on a subset of examples: â1 will do poorly.
Motivating Example
⢠Step 1: Collect training data
â ,
â 2008
â ,
â 2000,
â ,
â ,
â âŠ......
⢠â2will do well, say 51% accuracy.
â2= If CNN poll gives a 20pt lead to someone, go with his/her party,Otherwise, predict randomly
What if only focus on a subset of examples: â1 will do poorly.But â2 will do well.
Motivating Example
⢠Step 1: Collect training data
â 2012,
â 2008,
â 2004,
â 2000,
â 1996,
â 1992,
â âŠ......
⢠Suppose can consistently produce rules in ð» that do slightly better than random guessing.
â1, â2, â3, ⊠.
H
Motivating Example
⢠Step 1: Collect training data
â 2012,
â 2008,
â 2004,
â 2000,
â 1996,
â 1992,
â âŠ......
⢠Output a single rule that gets high accuracy on training set.
â1, â2, â3, ⊠.
H
Boosting
Weak Learning Algorithm
Strong Learning Algorithm
Boosting: A general method for converting ârules of thumbâ into highly accurate predictions.
History of Boosting
⢠Originally studied from a purely theoretical perspective
â Is weak learning equivalent to strong learning?
â Answered positively by Rob Schapire in 1989 via the boosting algorithm.
⢠AdaBoost -- A highly practical version developed in 1995
â One of the most popular ML (meta)algorithms
History of Boosting
⢠Has connections to many different areas
â Empirical risk minimization
â Game theory
â Convex optimization
â Computational complexity theory
Strong Learning
⢠To get strong learning, need to
â Design an algorithm that can get any arbitrary error ð > 0,on the training set ðð
â Do VC dimension analysis to see how large ð needs to be for generalization
Theory of Boosting
⢠Weak Learning Assumption:
â There exists algorithm ðŽ that can consistently produce weak classifiers on the training set
ðð = ð1, ðŠ1 , ð2, ðŠ2 âŠ
â Weak Classifier:
⢠For every weighting W of ðð, ðŽ outputs âð such that
ððððð,ðâð â€
1
2â ðŸ, ðŸ > 0
⢠ððððð,ð âð = Ïð=1ð ð€ððŒ âð ðð â ðŠð
Ïð=1ð ð€ð
Theory of Boosting
⢠Weak Learning Assumption:
â There exists an algorithm ðŽ that can consistently produce weak classifiers on the training set ðð
⢠Boosting Guarantee: Use ðŽ to get algorithm ðµ such that
â For every ð > 0, ðµ outputs ð with ððððð ð †ð
AdaBoost
⢠Run ðŽ on ðð get â1⢠Use â1to get ð2
⢠Run ðŽ on ðð with weights ð2 to get â2⢠Repeat ð times
⢠Combine â1, â2, ⊠âð to get the final classifier.
Q1: How to choose weights?Q2: How to combine classifiers?
Choosing weights
⢠Run ðŽ on ðð get â1⢠Use â1to get ð2
⢠Run ðŽ on ðð with weights ð2 to get â2⢠Repeat ð times
⢠Combine â1, â2, ⊠âð to get the final classifier.
Idea 1: Let ðð¡ be uniform over examples that âð¡â1 got incorrectly
Choosing weights
⢠Run ðŽ on ðð get â1⢠Use â1to get ð2
⢠Run ðŽ on ðð with weights ð2 to get â2⢠Repeat ð times
⢠Combine â1, â2, ⊠âð to get the final classifier.
Idea 2: Let ðð¡ be such that error of âð¡â1 becomes 1
2.
Choosing weights
⢠Run ðŽ on ðð get â1⢠Use â1to get ð2
⢠Run ðŽ on ðð with weights ð2 to get â2⢠Repeat ð times
⢠Combine â1, â2, ⊠âð to get the final classifier
⢠If ððððð,ðð¡â1âð¡â1 =
1
2â ðŸ, then
â ðð¡,ð = ðð¡â1,ð, if âð¡â1 is incorrect on ðð
â ðð¡,ð =ðð¡â1,ð
1
2âðŸ
(1
2+ðŸ)
, if âð¡â1 is correct on ðð
Combining Classifiers
⢠Run ðŽ on ðð get â1⢠Use â1to get ð2
⢠Run ðŽ on ðð with weights ð2 to get â2⢠Repeat ð times
⢠Combine â1, â2, ⊠âð to get the final classifier.
Take majority vote ð(ð) = ððŽðœ(â1( ÔŠð), â2( ÔŠð), ⊠âð( ÔŠð))
Combining Classifiers
⢠Run ðŽ on ðð get â1⢠Use â1to get ð2
⢠Run ðŽ on ðð with weights ð2 to get â2⢠Repeat ð times
⢠Combine â1, â2, ⊠âð to get the final classifier.
Take majority vote ð(ð) = ððŽðœ(â1( ÔŠð), â2( ÔŠð), ⊠âð( ÔŠð))
Theorem:
If error of each âð is =1
2â ðŸ, then ððððð ð †ðâ2ððŸ
2
Analysis
Theorem:
If error of each âð is =1
2â ðŸ, then ððððð ð †ðâ2ððŸ
2
Analysis
Theorem:
If error of each âð is =1
2â ðŸ, then ððððð ð †ðâ2ððŸ
2
Analysis
Theorem:
If error of each âð is =1
2â ðŸ, then ððððð ð †ðâ2ððŸ
2
Analysis
Theorem:
If error of each âð is â€1
2â ðŸ, then ððððð ð †ðâ2ððŸ
2
Let ðð¡ be such that error of âð¡â1 becomes 1
2.
Take (weighted) majority vote
Strong Learning
⢠To get strong learning, need to
â Design an algorithm that can get any arbitrary error ð > 0,on the training set ðð
â Do VC dimension analysis to see how large ð needs to be for generalization
Generalization Bound
â1, â2, â3, ⊠.
H
Applications: Decision Stumps
Advantages of AdaBoost
⢠Helps learn complex models from simple ones without overfitting
⢠A general paradigm and easy to implement
⢠Parameter free, no tuning
⢠Often not very sensitive to the number of rounds ð
⢠Cons:
â Does not help much if weak learners are already complex.
â Sensitive to noise in the data