simple regret bandit algorithms for unstructured noisy optimization
TRANSCRIPT
Parameter tuning, bandit algorithms
Parameter tuning by bandits algorithms
Madeira,June 2010.
Outline
Parameter tuning
Bandits
Our testbed
Results
Parameter tuning is optimization
Parameter tuningOptimization:I have a function f.I want its global minimum f*, i.e. f(f*) f(x) for all x
Expensive optimization
Parameter tuning
Expensive optimization
Parameter tuningOptimization:I have a function f.I want its global minimum f*, i.e. f(f*) f(x) for all x
Expensive optimizationf is an expensive functiontakes hours of computation, or hours of cluster-based computation, orhours of human work.
or maybe is not a function.
Parameter tuning
Parameter tuning is great
Parameter tuningExpensive optimizationf is an expensive functiontakes hours of computation, or hours of cluster-based computation, orhours of human work.
or maybe is not a function.
Parameter tuningf is the testing of a program under given coefficients.
crucial in many applications==> expensive optimization
Outline
Parameter tuning
Bandits
Our testbed
Results
A ``bandit'' problem
p1,...,pN unknown probabilities [0,1]
At each time step i [1,n]choose ui {1,...,N} (as a function of uj and rj, j a wide literature==> centered on finite time analysis
Simple regret
What is simple regret ?
Regret = cumulated regretSimple regret: = expected value optimal expected value.
==> optimization on average==> no structure on the search space !==> what are good algorithms for this criterion ?
Simple regret
What is simple regret ?
Regret = cumulated regretSimple regret: = expected value optimal expected value.
==> optimization on average==> no structure on the search space !==> what are good algorithms for this criterion ?
Simple regret
What is simple regret ?
Regret = cumulated regretSimple regret: = expected value optimal expected value.
==> optimization on average==> no structure on the search space !==> what are good algorithms for this criterion ?
Bubeck et al: uniform is optimal!
n linear as a function of K.
n linear as a function of K.
Bubeck et al: uniform is optimal!
Regret exponential as a function of n.
For a fixed regret, n linear as a function of K.
Bubeck et al: uniform is optimal!
Regret exponential as a function of n.
For a fixed regret, n linear as a function of K.
Bubeck et al: uniform is optimal!
Regret exponential as a function of n.
For a fixed regret, n linear as a function of K.
Yet, non-asymptotically,UCB is much better.(see also ``successive reject''algorithm)
Outline
Parameter tuning
Bandits
Our testbed
Results
Our testbed
Monte-Carlo Tree Search
Why should you care about Monte-Carlo Tree Search ?Recent algorithm (2006)
Very impressive in computer-Go
Now invading discrete time control, games, difficult planning (high dimensional cases!).
Our testbed
Monte-Carlo Tree Search
Why should you care about Monte-Carlo Tree Search ?Recent algorithm (2006)
Very impressive in computer-Go
Now invading discrete time control, games, difficult planning (high dimensional cases!).
A take-home message here:MCTS is not yet very well known.but it's a really great algorithm,in my humble opinion.
Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29 2008: win against a pro (8p) 19x19, H9 MoGo2008: win against a pro (4p) 19x19, H8 CrazyStone2008: win against a pro (4p) 19x19, H7 CrazyStone2009: win against a pro (9p) 19x19, H7 MoGo2009: win against a pro (1p) 19x19, H6 MoGo
2007: win against a pro (5p) 9x9 (blitz) MoGo2008: win against a pro (5p) 9x9 white MoGo2009: win against a pro (5p) 9x9 black MoGo2009: win against a pro (9p) 9x9 white Fuego2009: win against a pro (9p) 9x9 black MoGoTW
==> still 6 stones at least!
Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29 2008: win against a pro (8p) 19x19, H9 MoGo2008: win against a pro (4p) 19x19, H8 CrazyStone2008: win against a pro (4p) 19x19, H7 CrazyStone2009: win against a pro (9p) 19x19, H7 MoGo2009: win against a pro (1p) 19x19, H6 MoGo
2007: win against a pro (5p) 9x9 (blitz) MoGo2008: win against a pro (5p) 9x9 white MoGo2009: win against a pro (5p) 9x9 black MoGo2009: win against a pro (9p) 9x9 white Fuego2009: win against a pro (9p) 9x9 black MoGoTW
==> still 6 stones at least!
All good results With MCTS
Coulom (06)Chaslot, Saito & Bouzy (06)Kocsis Szepesvari (06)UCT (Upper Confidence Trees)
UCT
UCT
UCT
UCT
UCT
Kocsis & Szepesvari (06)
Exploitation ...
Exploitation ...
SCORE = 5/7 + k.sqrt( log(10)/7 )
Exploitation ...
SCORE = 5/7 + k.sqrt( log(10)/7 )
Exploitation ...
SCORE = 5/7 + k.sqrt( log(10)/7 )
... or exploration ?
SCORE = 0/2 + k.sqrt( log(10)/2 )
Monte-Carlo Tree Search
parallelizationmulti-core
message passing
tentative: GPGPU
frugality + consistency
automatic parameter tuning ( expensive optimization - DPS)
combinationexpert rules + supervised learning
learning-based Monte-Carlo
patterns (non-regression GP)
applications far from games (unstructured pbs)active learning + non-linear optimization
Spiral
starting: energy + robotics
Outline
Parameter tuning
Bandits
Our testbed
Results
A bit of theory: Bonferroni statistical confidence bounds
A statistical test is as follows: - I run n tests - I average the results - I choose a risk level - Statistical test: A little calculus (statistical test) says that I have precision
A bit of theory: Bonferroni statistical confidence bounds
A statistical test is as follows: - I run n tests - I average the results - I choose a risk level - Statistical test: A little calculus (statistical test) says that I have precision
If there are K tests,my probabilityof being wrongis multiplied by K!
A bit of theory: Bonferroni statistical confidence bounds
A statistical test is as follows: - I run n tests - I average the results - I choose a risk level - Statistical test: A little calculus (statistical test) says that I have precision
A statistical test with Bonferroni correction is as follows: - I run n tests for each of K cases - I average the results - I choose a risk level /K - Statistical test: A little calculus (statistical test) says that I have precision
A bit of theory: Bonferroni statistical confidence bounds
If yes, uniform sampling should be ok, otherwise try UCB (and be lucky).
Results
If yes, uniform sampling should be ok, otherwise try UCB (and be lucky).
We've been lucky.One of our test cases gives yes, the other gives no.We did not even have to cheat for having this.
First case: blitz games, nearly fast Xps.Second case: real games, very expensive optimization.
Results
If yes, uniform sampling should be ok, otherwise try UCB (and be lucky).
We've been lucky.One of our test cases gives yes, the other gives no.We did not even have to cheat for having this.
First case: blitz games, nearly fast Xps.Second case: real games, very expensive optimization.
Here the uniform sampling isreally convincing.
Results
If yes, uniform sampling should be ok, otherwise try UCB (and be lucky).
We've been lucky.One of our test cases gives yes, the other gives no.We did not even have to cheat for having this.
First case: blitz games, nearly fast Xps.Second case: real games, very expensive optimization.
Here the uniform sampling ismoderately convincing.
Yet, the empirical best was ok (a better with UCB than with uniform)
Conclusion
What do we propose ?
- A simple mathematically derived rule for predicting if we can trust uniform sampling.* If the answer is yes, then easy parallelization, and statistical validation is naturally included (Bonferroni correction). * If the answer is no uniform can't work. Maybe UCB or successive reject. Be lucky.
- Xps on MCTS (subliminal message: MCTS is great).
- Remark: take care of Bonferroni corrections. Much better for the regression testing of non-deterministic codes/data.
Click to edit the title text format
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level