spectral embedding of k-cliques, graph partitioning, and k ...pa336/mls16/ml-sp16-lec15.pdfchoosing...

28
Boosting Instructor: Pranjal Awasthi

Upload: others

Post on 10-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Boosting

Instructor: Pranjal Awasthi

Page 2: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Motivating Example

Want to predict winner of 2016 general election

Page 3: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Motivating Example

• Step 1: Collect training data

– 2012,

– 2008,

– 2004,

– 2000,

– 1996,

– 1992,

– 
......

• Step 2: Find a function that gets high accuracy(~99%) on training set.

What if instead, we have good “rules of thumb” that often work well?

Page 4: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Motivating Example

• Step 1: Collect training data

– 2012,

– 2008,

– 2004,

– 2000,

– 1996,

– 1992,

– 
......

• ℎ1will do well on a subset of train set, and will do better than half overall, say 51% accuracy.

ℎ1= If sitting president is running, go with his/her party,Otherwise, predict randomly

Page 5: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Motivating Example

• Step 1: Collect training data

– ,

– 2008

– ,

– 2000,

– ,

– ,

– 
......

ℎ1= If sitting president is running, go with his/her party,Otherwise, predict randomly

What if only focus on a subset of examples: ℎ1 will do poorly.

Page 6: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Motivating Example

• Step 1: Collect training data

– ,

– 2008

– ,

– 2000,

– ,

– ,

– 
......

• ℎ2will do well, say 51% accuracy.

ℎ2= If CNN poll gives a 20pt lead to someone, go with his/her party,Otherwise, predict randomly

What if only focus on a subset of examples: ℎ1 will do poorly.But ℎ2 will do well.

Page 7: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Motivating Example

• Step 1: Collect training data

– 2012,

– 2008,

– 2004,

– 2000,

– 1996,

– 1992,

– 
......

• Suppose can consistently produce rules in 𝐻 that do slightly better than random guessing.

ℎ1, ℎ2, ℎ3, 
 .

H

Page 8: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Motivating Example

• Step 1: Collect training data

– 2012,

– 2008,

– 2004,

– 2000,

– 1996,

– 1992,

– 
......

• Output a single rule that gets high accuracy on training set.

ℎ1, ℎ2, ℎ3, 
 .

H

Page 9: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Boosting

Weak Learning Algorithm

Strong Learning Algorithm

Boosting: A general method for converting “rules of thumb” into highly accurate predictions.

Page 10: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

History of Boosting

• Originally studied from a purely theoretical perspective

– Is weak learning equivalent to strong learning?

– Answered positively by Rob Schapire in 1989 via the boosting algorithm.

• AdaBoost -- A highly practical version developed in 1995

– One of the most popular ML (meta)algorithms

Page 11: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

History of Boosting

• Has connections to many different areas

– Empirical risk minimization

– Game theory

– Convex optimization

– Computational complexity theory

Page 12: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Strong Learning

• To get strong learning, need to

– Design an algorithm that can get any arbitrary error 𝜖 > 0,on the training set 𝑆𝑚

– Do VC dimension analysis to see how large 𝑚 needs to be for generalization

Page 13: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Theory of Boosting

• Weak Learning Assumption:

– There exists algorithm 𝐎 that can consistently produce weak classifiers on the training set

𝑆𝑚 = 𝑋1, 𝑊1 , 𝑋2, 𝑊2 


– Weak Classifier:

• For every weighting W of 𝑆𝑚, 𝐎 outputs ℎ𝑊 such that

𝑒𝑟𝑟𝑆𝑚,𝑊ℎ𝑊 ≀

1

2− 𝛟, 𝛟 > 0

• 𝑒𝑟𝑟𝑆𝑚,𝑊 ℎ𝑊 = σ𝑖=1𝑚 𝑀𝑖𝐌 ℎ𝑊 𝑋𝑖 ≠𝑊𝑖

σ𝑖=1𝑚 𝑀𝑖

Page 14: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Theory of Boosting

• Weak Learning Assumption:

– There exists an algorithm 𝐎 that can consistently produce weak classifiers on the training set 𝑆𝑚

• Boosting Guarantee: Use 𝐎 to get algorithm 𝐵 such that

– For every 𝜖 > 0, 𝐵 outputs 𝑓 with 𝑒𝑟𝑟𝑆𝑚 𝑓 ≀ 𝜖

Page 15: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

AdaBoost

• Run 𝐎 on 𝑆𝑚 get ℎ1• Use ℎ1to get 𝑊2

• Run 𝐎 on 𝑆𝑚 with weights 𝑊2 to get ℎ2• Repeat 𝑇 times

• Combine ℎ1, ℎ2, 
 ℎ𝑇 to get the final classifier.

Q1: How to choose weights?Q2: How to combine classifiers?

Page 16: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Choosing weights

• Run 𝐎 on 𝑆𝑚 get ℎ1• Use ℎ1to get 𝑊2

• Run 𝐎 on 𝑆𝑚 with weights 𝑊2 to get ℎ2• Repeat 𝑇 times

• Combine ℎ1, ℎ2, 
 ℎ𝑇 to get the final classifier.

Idea 1: Let 𝑊𝑡 be uniform over examples that ℎ𝑡−1 got incorrectly

Page 17: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Choosing weights

• Run 𝐎 on 𝑆𝑚 get ℎ1• Use ℎ1to get 𝑊2

• Run 𝐎 on 𝑆𝑚 with weights 𝑊2 to get ℎ2• Repeat 𝑇 times

• Combine ℎ1, ℎ2, 
 ℎ𝑇 to get the final classifier.

Idea 2: Let 𝑊𝑡 be such that error of ℎ𝑡−1 becomes 1

2.

Page 18: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Choosing weights

• Run 𝐎 on 𝑆𝑚 get ℎ1• Use ℎ1to get 𝑊2

• Run 𝐎 on 𝑆𝑚 with weights 𝑊2 to get ℎ2• Repeat 𝑇 times

• Combine ℎ1, ℎ2, 
 ℎ𝑇 to get the final classifier

• If 𝑒𝑟𝑟𝑆𝑚,𝑊𝑡−1ℎ𝑡−1 =

1

2− 𝛟, then

– 𝑊𝑡,𝑖 = 𝑊𝑡−1,𝑖, if ℎ𝑡−1 is incorrect on 𝑋𝑖

– 𝑊𝑡,𝑖 =𝑊𝑡−1,𝑖

1

2−𝛟

(1

2+𝛟)

, if ℎ𝑡−1 is correct on 𝑋𝑖

Page 19: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Combining Classifiers

• Run 𝐎 on 𝑆𝑚 get ℎ1• Use ℎ1to get 𝑊2

• Run 𝐎 on 𝑆𝑚 with weights 𝑊2 to get ℎ2• Repeat 𝑇 times

• Combine ℎ1, ℎ2, 
 ℎ𝑇 to get the final classifier.

Take majority vote 𝑓(𝑋) = 𝑀𝐎𝐜(ℎ1( Ԋ𝑋), ℎ2( Ԋ𝑋), 
 ℎ𝑇( Ԋ𝑋))

Page 20: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Combining Classifiers

• Run 𝐎 on 𝑆𝑚 get ℎ1• Use ℎ1to get 𝑊2

• Run 𝐎 on 𝑆𝑚 with weights 𝑊2 to get ℎ2• Repeat 𝑇 times

• Combine ℎ1, ℎ2, 
 ℎ𝑇 to get the final classifier.

Take majority vote 𝑓(𝑋) = 𝑀𝐎𝐜(ℎ1( Ԋ𝑋), ℎ2( Ԋ𝑋), 
 ℎ𝑇( Ԋ𝑋))

Theorem:

If error of each ℎ𝑖 is =1

2− 𝛟, then 𝑒𝑟𝑟𝑆𝑚 𝑓 ≀ 𝑒−2𝑇𝛟

2

Page 21: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Analysis

Theorem:

If error of each ℎ𝑖 is =1

2− 𝛟, then 𝑒𝑟𝑟𝑆𝑚 𝑓 ≀ 𝑒−2𝑇𝛟

2

Page 22: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Analysis

Theorem:

If error of each ℎ𝑖 is =1

2− 𝛟, then 𝑒𝑟𝑟𝑆𝑚 𝑓 ≀ 𝑒−2𝑇𝛟

2

Page 23: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Analysis

Theorem:

If error of each ℎ𝑖 is =1

2− 𝛟, then 𝑒𝑟𝑟𝑆𝑚 𝑓 ≀ 𝑒−2𝑇𝛟

2

Page 24: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Analysis

Theorem:

If error of each ℎ𝑖 is ≀1

2− 𝛟, then 𝑒𝑟𝑟𝑆𝑚 𝑓 ≀ 𝑒−2𝑇𝛟

2

Let 𝑊𝑡 be such that error of ℎ𝑡−1 becomes 1

2.

Take (weighted) majority vote

Page 25: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Strong Learning

• To get strong learning, need to

– Design an algorithm that can get any arbitrary error 𝜖 > 0,on the training set 𝑆𝑚

– Do VC dimension analysis to see how large 𝑚 needs to be for generalization

Page 26: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Generalization Bound

ℎ1, ℎ2, ℎ3, 
 .

H

Page 27: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Applications: Decision Stumps

Page 28: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights •Run on 𝑚 get ℎ1 •Use ℎ1 to get 2 •Run on 𝑚 with weights

Advantages of AdaBoost

• Helps learn complex models from simple ones without overfitting

• A general paradigm and easy to implement

• Parameter free, no tuning

• Often not very sensitive to the number of rounds 𝑇

• Cons:

– Does not help much if weak learners are already complex.

– Sensitive to noise in the data