online(learning(and(regret - university of edinburgh
TRANSCRIPT
Decision Making in Robots and Autonomous Agents
Online Learning and Regret
Subramanian Ramamoorthy School of Informa<cs
3 March, 2015
Recap: Interpreta,on of MAB Type Problems
3/3/2015 2
Related to ‘rewards’
Recap: MAB as Special Case of Online Learning
3/3/2015 3
Recap: How to Evaluate Online Alg -‐ Regret
• ALer you have played for T rounds, you experience a regret: = [Reward sum of op,mal strategy] – [Sum of actual collected rewards]
• If the average regret per round goes to zero with probability
1, asympto,cally, we say the strategy has no-‐regret property ~ guaranteed to converge to an op,mal strategy
• ε-‐greedy is sub-‐op,mal (so has some regret).
3/3/2015 4
[ ]
kk
T
ti
T
tt trETrT
t
µµ
µµρ
max
)(ˆ
*1
*
1
*
=
−=−= ∑∑== Randomness in
draw of rewards & player’s strategy
Solving MAB: Interval Es,ma,on
• AZribute to each arm an “op,mis,c ini,al es,mate” within a certain confidence interval
• Greedily choose arm with highest op,mis,c mean (upper bound on confidence interval)
• Infrequently observed arm will have over-‐valued reward mean, leading to explora,on
• Frequent usage pushes op,mis,c es,mate to true values
3/3/2015 5
Interval Es,ma,on Procedure
• Associate to each arm 100(1 -‐ α)% reward mean upper band
• Assume, e.g., rewards are normally distributed • Arm is observed n ,mes to yield empirical mean & std dev • α-‐upper bound:
• If α is ac,vely controlled, possible zero-‐regret strategy
– For general distribu,ons, we don’t know
3/3/2015 6
dxxtc
cn
u
t
∫ ∞−
−
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
−+=
2exp
21)(
)1(ˆˆ
2
1
π
ασ
µα
Cum. Distribu,on Func,on
Solving MAB: UCB Strategy
• Again, based on no,on of an upper confidence bound but more generally applicable
• Algorithm: – Play each arm once – At ,me t > K, play arm it maximizing
3/3/2015 7
far so playedbeen has j timesofnumber :
ln2)(
,
,
tj
tjj
T
Tttr +
UCB Strategy
3/3/2015 8
Reminder: Chernoff-‐Hoeffding Bound
3/3/2015 9
UCB Strategy – Behaviour
3/3/2015 10
We will not try to prove the following result but I quote (only FYI) the final result to tell you why UCB may be a desirable strategy – regret is bounded.
K = number of arms
Varia,on on So#Max:
• It is possible to drive regret down by annealing τ • Exp3 : Exponen,al weight alg. for explora,on and exploita,on • Probability of choosing arm k (of K) at ,me t is
3/3/2015 11
∑ =
n
bbQ
aQ
t
t
ee
1)(
)(
τ
τ
( ))log(Regret
at pulled is arm if
)()()(
exp)()1(
)(
)()1()(
1
KKTO
otherwisetj
twKtPtr
twtw
Ktw
twtP
j
j
jj
j
k
jj
kk
≈
⎪⎩
⎪⎨
⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛
=+
+−=
∑=
γ
γγ
γ is a user defined open parameter
Solving MAB: Gimns Index
• Each arm delivers reward with a probability • This probability may change through ,me but only when arm
is pulled • Goal is to maximize discounted rewards – future is discounted
by an exponen,al discount factor δ < 1
• The structure of the problem is such that, all you need to do is compute an “index” for each arm and play the one with the highest index (rich theory to explain why)
• Index is of the form:
3/3/2015 12
ν i = supT>0
δ tr(t)t=0
T
∑
δ t
t=0
T
∑
Gimns Index – Intui,on
• Proving op,mality isn’t within our scope; it is based on, • Stopping ,me: the point where you should terminate bandit
• Nice Property: Gimns index for any given bandit is independent of expected outcome of all other bandits – Once you have a good arm, keep playing un,l there is a beZer one – If you add/remove machines, computa,on doesn’t really change
• BUT: – hard to compute, even when you know distribu,ons – Explora,on issues; arms aren’t updated unless used (restless bandits?)
3/3/2015 13
Numerous Applica,ons!
3/3/2015 14
Equilibria
• “… as central to the study of social systems as it is to the analysis of physical phenomena. In the physical world, equilibrium results from the balancing of forces. In socie,es it results from the balancing of inten,ons.” (H. Peyton Young)
• Classical mechanics has both an equilibrium and non-‐equilibrium descrip,on of mo,on
• What about non-‐equilibrium study of strategic interac,ons?
3/3/2015 15
Non-‐Equilibrium Study of Strategic Interac,ons
• Perhaps nearest thing is Bayesian decision theory. If individuals can imagine: – Future states of the world – All possible changes in behaviour, by all individuals, over all possible sequences of states
• As condi,ons unfold, they update beliefs and op,mize expected future payoffs
• If their beliefs put posi,ve probability on the strategies their opponents are actually using, then beliefs and behaviours will gradually come into alignment, and equilibrium or something close to it will obtain
3/3/2015 16
Points to Ponder
• The issue with this high ra,onality viewpoint: – Individuals need sophis,ca,on, i.e., reasoning power – Can all possible futures be actually an,cipated?
• A peculiarity of social systems (versus physical systems) – individuals are learning about a process where others are also learning (self-‐referen,al)
• When the observer is part of the system, the act of learning changes the thing to be learned
3/3/2015 17
Example Applica,on: Choosing Interfaces
Choose parameters that users can work with …
… a continual process!
3/3/2015 18
Example Applica,on: Adap,ng Interfaces
… which can be used to adapt online to individual users.
[Source: http://spyrestudios.com]
Some tasks permit incredible variety…
3/3/2015 19
Example Applica,on: Op,mizing with a Moving Target
User Performance is highly con,ngent on their experiences – on the paths they take in an interface landscape.
3/3/2015 20
Simple(st) Example – Uncertain Game
• Soda Game
• Players know their own payoff but no knowledge of other player (not even, as in Bayesian games, distribu,ons).
• Imagine you are the row player and you have observed: (Payoff) 0 0 0 1 1 0 0 0 1 0 0 Row L R L L R R L R R R R ? Column R L R L R L R L R L L ?
What should you do in the next ,me period?
3/3/2015
L R
L Coke, Coke Sprite, Seven-‐up
R Seven-‐up, Sprite Pepsi, Pepsi
21
What is the Nature of Uncertainty Here?
• We do not know what kind of game we are facing
• If both of us prefer “dark” drinks to “light” drinks or vice versa, it is a coordina,on game (three equilibria, two pure and one mixed)
• If one of us prefers dark and other prefers light, it is like matching pennies, unique mixed equilibrium
3/3/2015 22
A Thorny Problem (Foster & Young, 2001) Imagine a game was constructed as follows: Entries in payoff matrix are determined by independent draws from a normal distribu,on – once at the beginning
• With ra,onal Bayesian players who have a prior over opponent’s strategy space guided by a commonly known payoff distribu,on
• It can be shown that, under any pair of priors, the players will fail to learn the Nash equilibrium with posi=ve probability
• There may be no priors that sa,sfy the necessary condi,on of “absolute con,nuity” (i.e., that players’ prior beliefs capture the set of actual play paths with posi,ve probability) – Need great care in analyzing learning procedures…
… and we have not even men,oned computa,onal cost yet 3/3/2015 23
Model: “Reinforcement” Learning
• Firstly, note that here the term is used slightly differently from what you may be used to!
• At each ,me period t, a subject chooses ac,ons from a finite set X, Nature/external subject chooses ac,on y from set Y
• Realized payoff is u(xt, yt), this is assumed ,me-‐independent • We define another variable, θ, to model subject’s propensity
to play ac,on x at ,me t. So, the probability of an ac,on is,
Let qt and θt represent k-‐dim vectors • Learning: how do the propensi,es evolve over =me?
3/3/2015 24
“Matching” Payoffs
• Define a random unit vector that acts as indicator variable,
• A linear upda,ng model for propensi,es is (u is payoff),
• A simpler update rule is,
3/3/2015
Discount factor Random perturbations
Payoff
25
Cumula,ve Payoff Matching
• Cumula,ve payoffs up to ,me t:
• Sum of ini,al propensi,es is: • Define a new quan,ty:
• So that change in probability of ac,on, per period, is:
• The denominator is unbounded, so eventually this curve flaZens out – power law of prac,ce
3/3/2015 26
Roth-‐Erev RL Model
• Past payoffs are discounted at a constant geometric rate with λ < 1, and in each period there are random perturba,ons or “trembles”
• Marginal impact of period-‐by-‐period payoffs levels off
eventually, as denominator is bounded above • Another interpreta,on – in terms of aspira,on levels
(reinforce ac,on if its current payoff exceeds aspira,on)
3/3/2015 27
Empirical Plausibility
• Many predic,ons of such models are observed in prac,ce – Recency phenomena: Recent payoffs tend to maZer more than long
past ones – Habit forming: Cumula,ve payoffs maZer in addi,on to the average
payoff of ac,on
• However, real human behaviour may not be restricted to simple rules like this.
• On a hierarchy of learning rules, these “RL” rules fall on the lower end of the spectrum – Behaviour depends solely on summary sta,s,cs of players’ payoffs
3/3/2015 28
What is Captured in this type of RL
Despite their simplicity, they already capture some important, qualita,ve features that are shared with other learning methods as well 1. Probabilis,c choice: Subjects’ choice depends on history
and a random component, could be due to • Unmodeled behaviour • Deliberate experimenta,on • Inten,on strategy to keep opponent guessing
2. Sluggish adapta,on: Strong serial correla,on between probability distribu,ons in successive periods
3/3/2015 29
What Other Ways Are There to Learn?
• Examples: No regret learning, Smoothed fic,,ous play, Hypothesis tes,ng with smoothed beZer responses
• Bayesian ra,onal learning does not share all of the similari,es to previous slide: – Unless perfectly indifferent between ac,ons, a Bayesian should prefer pure over mixed strategies
– Op,mum behaviour is sensi,ve to small changes in beliefs, so one can see frequent and radical changes in behaviour
3/3/2015 30
Test: Learning in Sta,onary Environments
• RL presumes no mental model of the world and other agents • Does it s,ll lead to op,mal behaviour against the subjects’
environment? – Convergence to Nash equilibrium may be a tall order – What happens in a sta,onary (stochas,c) environment?
• History: • Behaviour strategy or ‘response rule’: • This gives condi,onal probability of ac,on: • Assume nature plays according to fixed rule,
3/3/2015 31
Learning in Sta,onary Environment
• Combina,on of g and q* leads to a stochas,c process,
realiza,ons from Ω• Let B(q*) denote subset of ac,ons in X that maximize player’s
expected payoff against
• We say that g is op,mal against q* if • Rule g is op,mal against a sta,onary distribu,on if the above
holds for every q* – similarly to equilibrium defini,on (but against fixed distrib)
3/3/2015 32
Result: Sta,onary Environment
Theorem: Given any finite ac,on sets X and Y, cumula=ve payoff matching on X is op,mal against every sta,onary distribu,on on Y
• In general games, this kind of statement is hard to make – Proof of this seemingly simple statement relies on stochas,c
approxima,on theory – Analysis under varying distribu,ons is hard!
• In zero-‐sum games, CPM converges, with probability 1, to a Nash equilibrium
3/3/2015 33
What Next?
• Simple reinforcement rules such as CPM omit any men,on of the cogni,ve process
• What other kinds of criteria might subjects bring in? 1. PaZern of past play: predict opponent’s next ac,on
based on what has happened so far and choose ac,ons to maximize expected payoffs
2. Past payoffs: Could we have done beZer by playing differently in the past? • No predic,ve behavioural model, subjects simply want to
minimize ex post regret
3/3/2015 34
Regret
• Consider simple game of choosing soL drinks: (Payoff) 0 0 0 1 1 0 0 0 1 0 0 Row L R L L R R L R R R R ? Column R L R L R L R L R L L ?
• Imagine you are allowed to replay the game but you must do so by choosing the same ac,on in every period (hypothesis class from which you evaluate).
• We do not really know what opponent would have done if we changed our play but we do have realized performance, so we ask with respect to this – If you just play R, payoff is 5 (for L, payoff is 6); foregone payoff was 3 – Average regret from not playing all L: 3/11; against all R: 2/11
3/3/2015 35
Regret
• Average payoff through to ,me t: • For each ac,on x, define average regret from not having
played x as,
• We have a vector of regret, • A given realiza,on of play has no regret if,
• A behavioural rule g has no regret if, given a pre-‐specified infinite sequence of play by Nature, (y1, y2, …), almost all realiza,ons ω generated by g sa,sfy the above condi,on
3/3/2015 36
Regret Matching
• Many varia,ons on learning using regret exist. A simple and appealing rule due to Hart and Mas-‐Colell is the following
• In each period, t+1, decision maker plays each ac,on with probability propor,onal to the non-‐nega,ve part of his regret up to that ,me,
• If regret for R is 2/11 and for L is 3/11 then under regret
matching, Row player chooses R or L with probability 2/5, 3/5 respec,vely (at t = 12 in our previous example) – As per CPM, R would have been chosen more than L
3/3/2015 37
Regret Matching with ε-‐Experimenta,on
Can one do this without even knowing opponent ac<ons? – Subject experiments, randomly, with small probability ε – When not experimen,ng, he employs regret matching with the following modifica,on:
• “Es,mated” regret for ac,on x is its average payoff in previous periods when he experimented and chose x MINUS Average realized payoff over all ac,ons in all previous periods
Theorem: In a finite game against Nature, given δ > 0, for all sufficiently small ε > 0, regret matching with ε-‐experimenta,on has at most δ-‐regret against every sequence of play by Nature.
3/3/2015 38
Why Does Regret Matching Work?
• Player X has two ac,ons, {1, 2} • Average per period payoff, • If he had just played ac,on 1,
• Regret:
• We want to have, , almost surely, where the non-‐nega,ve part of the regret is being denoted as
3/3/2015 39
How Does Regret Matching Work?
• In period t+1, opponent takes unforeseen ac,on – Irrespec,ve of what ac,on will be, next period regret from playing
ac,on 1 is the nega,ve of that corresponding to ac,on 2 – Incremental regret is of the form (αt+1, -‐αt+1) for an unknown αt+1
• Let us say one is following a mixed strategy,
• Expected incremental regret with respect to this strategy,
• Weighted over ,me,
3/3/2015 40
Regret Matching Procedure
• The goal is to choose a probability to make
• This is the same as making sure that is orthogonal to the current
• This implies,
• which implies,
3/3/2015 41
Condi,onal or Internal Regret
• There exist a pair of ac,ons x,y so that playing x would have yielded higher total payoff over all periods when the subject actually played y – e.g., one may not have done beZer with ac,on such as ‘wear blue’ – Condi,onal statement is that she could have done beZer by always
wearing blue whenever she had instead worn black
• Given a play path, ω, player’s condi,onal regret matrix at ,me t is a matrix Rt(ω) such that
3/3/2015 42
Shapley’s Game
Consider the following game:
Suppose we have a history of play over 10 periods: (Payoff) 1 0 0 0 1 0 0 0 0 0 Row R R B B B Y Y R Y R Column R B Y Y B R R Y R Y
3/3/2015
R Y B
R 1,0 0,0 0,1
Y 0,1 1,0 0,0
B 0,0 0,1 1,0
43
Shapley’s Game – Condi,onal Regret
• Adopt the perspec,ve of the Row player; at the end of ten periods his condi,onal regret matrix is
• If Row had played R in the three periods when he actually played Y, his total for that period would have been 3 instead of 0. So, average condi,onal regret is 3/10 in cell (Y,R)
3/3/2015
R Y B
R 0 0.1 0
Y 0.3 0 0
B -‐0.1 0.1 0
44
Learning with Condi,onal Regret
Fact: There exist learning rules that eliminate condi,onal regrets no maZer what Nature does (Foster & Vohra, 1997)
These rules are of the general form: Reinforcement increments δt are computed (e.g., linear algebraically) from condi,onal regret matrix.
Proof is based on a celebrated result called Blackwell’s Approachability Theorem.
3/3/2015 45
Calibra,on
(P. Dawid) A sequence of binary forecasters is calibrated if in all those periods when forecaster predicts that event “1” will occur with probability p, the empirical frequency distribu,on of 1’s in all of those periods is in fact p
Similar defini,on applies to arbitrary symbols being forecast, e.g., real valued predic,ons, but the defini,on is more intricate in its formula,on…
3/3/2015 46
Example: Bridge Contracts (Keren 1987)
3/3/2015 47
Example: Physicians (Christensen-‐Szalanski et al.)
3/3/2015 48
Why might this bias make sense?
Random Forecas,ng Rules
• Forecast equivalent of a randomized ac,on choice • A rule of the form (z is random variable):
• F is calibrated if, for every ω, the following calibra,on score goes to zero almost surely on player’s sequence of forecasts
3/3/2015
Num times p was forecast upto t
Empirical distribution of outcomes when prediction p was forecast
49
Calibrated Forecasters
Given any finite set Z and ε > 0, there exist random forecas,ng rules that are ε-‐calibrated for all sequences on Z
Theorem: Let G be a finite game. Suppose every player uses a calibrated forecas,ng rule and chooses myopic best response to his forecast. Then empirical frequency distribu=on of play converges with probability one to the set of correlated equilibria of G
3/3/2015 50
Takeaway Messages
• Equilibrium is a nice concept but a lot of real ac,on is off equilibrium in life
• How do people get to equilibria? • What happens if everyone is learning, groping their way
towards some no,on of ‘equilibrium’ • This area has many counter-‐intui,ve results
– ‘Perfect’ Bayesian learning is not always so – Simple learning rules gives surprisingly useful behaviour – No,ons such as regret enable learning despite limits to modelling of
the underlying process
• Many algorithms, such as regret matching and calibrated forecasts, represent ways to get to equilibrium
3/3/2015 51