online(learning(and(regret - university of edinburgh

51
Decision Making in Robots and Autonomous Agents Online Learning and Regret Subramanian Ramamoorthy School of Informa<cs 3 March, 2015

Upload: others

Post on 20-Nov-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online(Learning(and(Regret - University of Edinburgh

Decision  Making  in Robots and Autonomous Agents

 Online  Learning  and  Regret  

Subramanian  Ramamoorthy  School  of  Informa<cs  

 3  March,  2015  

Page 2: Online(Learning(and(Regret - University of Edinburgh

Recap:  Interpreta,on  of  MAB  Type  Problems  

3/3/2015   2  

Related  to  ‘rewards’  

Page 3: Online(Learning(and(Regret - University of Edinburgh

Recap:  MAB  as  Special  Case  of  Online  Learning  

3/3/2015   3  

Page 4: Online(Learning(and(Regret - University of Edinburgh

Recap:  How  to  Evaluate  Online  Alg  -­‐  Regret  

•  ALer  you  have  played  for  T  rounds,  you  experience  a  regret:  =  [Reward  sum  of  op,mal  strategy]  –  [Sum  of  actual  collected  rewards]  

 •  If  the  average  regret  per  round  goes  to  zero  with  probability  

1,  asympto,cally,  we  say  the  strategy  has  no-­‐regret  property      ~  guaranteed  to  converge  to  an  op,mal  strategy    

•  ε-­‐greedy  is  sub-­‐op,mal  (so  has  some  regret).  

3/3/2015   4  

[ ]

kk

T

ti

T

tt trETrT

t

µµ

µµρ

max

)(ˆ

*1

*

1

*

=

−=−= ∑∑== Randomness  in  

draw  of  rewards  &  player’s  strategy  

Page 5: Online(Learning(and(Regret - University of Edinburgh

Solving  MAB:  Interval  Es,ma,on  

•  AZribute  to  each  arm  an  “op,mis,c  ini,al  es,mate”  within  a  certain  confidence  interval  

•  Greedily  choose  arm  with  highest  op,mis,c  mean  (upper  bound  on  confidence  interval)  

•  Infrequently  observed  arm  will  have  over-­‐valued  reward  mean,  leading  to  explora,on  

•  Frequent  usage  pushes  op,mis,c  es,mate  to  true  values  

3/3/2015   5  

Page 6: Online(Learning(and(Regret - University of Edinburgh

Interval  Es,ma,on  Procedure  

•  Associate  to  each  arm  100(1  -­‐  α)%  reward  mean  upper  band  

•  Assume,  e.g.,  rewards  are  normally  distributed  •  Arm  is  observed  n  ,mes  to  yield  empirical  mean  &  std  dev  •  α-­‐upper  bound:  

 •  If  α  is  ac,vely  controlled,  possible  zero-­‐regret  strategy  

–  For  general  distribu,ons,  we  don’t  know  

3/3/2015   6  

dxxtc

cn

u

t

∫ ∞−

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

−+=

2exp

21)(

)1(ˆˆ

2

1

π

ασ

µα

Cum.  Distribu,on  Func,on  

Page 7: Online(Learning(and(Regret - University of Edinburgh

Solving  MAB:  UCB  Strategy  

•  Again,  based  on  no,on  of  an  upper  confidence  bound  but  more  generally  applicable  

•  Algorithm:  –  Play  each  arm  once  –  At  ,me  t > K,  play  arm  it  maximizing  

3/3/2015   7  

far so playedbeen has j timesofnumber :

ln2)(

,

,

tj

tjj

T

Tttr +

Page 8: Online(Learning(and(Regret - University of Edinburgh

UCB  Strategy  

3/3/2015   8  

Page 9: Online(Learning(and(Regret - University of Edinburgh

Reminder:  Chernoff-­‐Hoeffding  Bound    

3/3/2015   9  

Page 10: Online(Learning(and(Regret - University of Edinburgh

UCB  Strategy  –  Behaviour  

3/3/2015   10  

 We  will  not  try  to  prove  the  following  result  but  I  quote  (only  FYI)  the  final  result  to  tell  you  why  UCB  may  be  a  desirable  strategy  –  regret  is  bounded.  

K  =  number  of  arms  

Page 11: Online(Learning(and(Regret - University of Edinburgh

Varia,on  on  So#Max:  

•  It  is  possible  to  drive  regret  down  by  annealing  τ  •  Exp3  :  Exponen,al  weight  alg.  for  explora,on  and  exploita,on  •  Probability  of  choosing  arm  k (of K)  at  ,me  t  is    

3/3/2015   11  

∑ =

n

bbQ

aQ

t

t

ee

1)(

)(

τ

τ

( ))log(Regret

at pulled is arm if

)()()(

exp)()1(

)(

)()1()(

1

KKTO

otherwisetj

twKtPtr

twtw

Ktw

twtP

j

j

jj

j

k

jj

kk

⎪⎩

⎪⎨

⎟⎟⎠

⎞⎜⎜⎝

=+

+−=

∑=

γ

γγ

γ  is  a  user  defined  open  parameter  

Page 12: Online(Learning(and(Regret - University of Edinburgh

Solving  MAB:  Gimns  Index  

•  Each  arm  delivers  reward  with  a  probability  •  This  probability  may  change  through  ,me  but  only  when  arm  

is  pulled  •  Goal  is  to  maximize  discounted  rewards  –  future  is  discounted  

by  an  exponen,al  discount  factor  δ < 1

•  The  structure  of  the  problem  is  such  that,  all  you  need  to  do  is  compute  an  “index”  for  each  arm  and  play  the  one  with  the  highest  index  (rich  theory  to  explain  why)  

•  Index  is  of  the  form:  

3/3/2015   12  

ν i = supT>0

δ tr(t)t=0

T

δ t

t=0

T

Page 13: Online(Learning(and(Regret - University of Edinburgh

Gimns  Index  –  Intui,on  

•  Proving  op,mality  isn’t  within  our  scope;  it  is  based  on,  •  Stopping  ,me:  the  point  where  you  should  terminate  bandit  

•  Nice  Property:  Gimns  index  for  any  given  bandit  is  independent  of  expected  outcome  of  all  other  bandits  –  Once  you  have  a  good  arm,  keep  playing  un,l  there  is  a  beZer  one  –  If  you  add/remove  machines,  computa,on  doesn’t  really  change  

•  BUT:    –  hard  to  compute,  even  when  you  know  distribu,ons  –  Explora,on  issues;  arms  aren’t  updated  unless  used  (restless  bandits?)  

3/3/2015   13  

Page 14: Online(Learning(and(Regret - University of Edinburgh

Numerous  Applica,ons!  

3/3/2015   14  

Page 15: Online(Learning(and(Regret - University of Edinburgh

Equilibria  

•  “…  as  central  to  the  study  of  social  systems  as  it  is  to  the  analysis  of  physical  phenomena.  In  the  physical  world,  equilibrium  results  from  the  balancing  of  forces.  In  socie,es  it  results  from  the  balancing  of  inten,ons.”  (H.  Peyton  Young)  

•  Classical  mechanics  has  both  an  equilibrium  and  non-­‐equilibrium  descrip,on  of  mo,on  

•  What  about  non-­‐equilibrium  study  of  strategic  interac,ons?  

3/3/2015   15  

Page 16: Online(Learning(and(Regret - University of Edinburgh

Non-­‐Equilibrium  Study  of    Strategic  Interac,ons  

•  Perhaps  nearest  thing  is  Bayesian  decision  theory.  If  individuals  can  imagine:  –  Future  states  of  the  world  –  All  possible  changes  in  behaviour,  by  all  individuals,  over  all  possible  sequences  of  states  

•  As  condi,ons  unfold,  they  update  beliefs  and  op,mize  expected  future  payoffs  

•  If  their  beliefs  put  posi,ve  probability  on  the  strategies  their  opponents  are  actually  using,  then  beliefs  and  behaviours  will  gradually  come  into  alignment,  and  equilibrium  or  something  close  to  it  will  obtain  

3/3/2015   16  

Page 17: Online(Learning(and(Regret - University of Edinburgh

Points  to  Ponder  

•  The  issue  with  this  high  ra,onality  viewpoint:  –  Individuals  need  sophis,ca,on,  i.e.,  reasoning  power  –  Can  all  possible  futures  be  actually  an,cipated?  

•  A  peculiarity  of  social  systems  (versus  physical  systems)  –  individuals  are  learning  about  a  process  where  others  are  also  learning  (self-­‐referen,al)  

•  When  the  observer  is  part  of  the  system,  the  act  of  learning  changes  the  thing  to  be  learned  

3/3/2015   17  

Page 18: Online(Learning(and(Regret - University of Edinburgh

Example  Applica,on:  Choosing  Interfaces  

Choose parameters that users can work with …

… a continual process!

3/3/2015   18  

Page 19: Online(Learning(and(Regret - University of Edinburgh

Example  Applica,on:  Adap,ng  Interfaces  

…  which  can  be  used  to  adapt  online  to  individual  users.  

[Source: http://spyrestudios.com]

Some tasks permit incredible variety…

3/3/2015   19  

Page 20: Online(Learning(and(Regret - University of Edinburgh

Example  Applica,on:  Op,mizing  with  a  Moving  Target  

User  Performance  is  highly  con,ngent  on  their  experiences  –  on  the  paths  they  take  in  an  interface  landscape.  

3/3/2015   20  

Page 21: Online(Learning(and(Regret - University of Edinburgh

Simple(st)  Example  –  Uncertain  Game  

•  Soda  Game  

•  Players  know  their  own  payoff  but  no  knowledge  of  other  player  (not  even,  as  in  Bayesian  games,  distribu,ons).  

•  Imagine  you  are  the  row  player  and  you  have  observed:    (Payoff)  0        0        0        1        1        0        0        0        1        0        0    Row  L          R        L        L        R        R        L        R        R        R        R        ?    Column  R          L        R        L        R        L        R        L        R        L        L          ?    

 What  should  you  do  in  the  next  ,me  period?  

3/3/2015  

L   R  

L   Coke,  Coke   Sprite,  Seven-­‐up  

R   Seven-­‐up,  Sprite   Pepsi,  Pepsi  

21  

Page 22: Online(Learning(and(Regret - University of Edinburgh

What  is  the  Nature  of  Uncertainty  Here?  

•  We  do  not  know  what  kind  of  game  we  are  facing  

•  If  both  of  us  prefer  “dark”  drinks  to  “light”  drinks  or  vice  versa,  it  is  a  coordina,on  game  (three  equilibria,  two  pure  and  one  mixed)  

•  If  one  of  us  prefers  dark  and  other  prefers  light,  it  is  like  matching  pennies,  unique  mixed  equilibrium  

3/3/2015   22  

Page 23: Online(Learning(and(Regret - University of Edinburgh

A  Thorny  Problem  (Foster  &  Young,  2001)    Imagine  a  game  was  constructed  as  follows:  Entries  in  payoff  matrix  are  determined  by  independent  draws  from  a  normal  distribu,on  –  once  at  the  beginning  

•  With  ra,onal  Bayesian  players  who  have  a  prior  over  opponent’s  strategy  space  guided  by  a  commonly  known  payoff  distribu,on  

•  It  can  be  shown  that,  under  any  pair  of  priors,  the  players  will  fail  to  learn  the  Nash  equilibrium  with  posi=ve  probability  

•  There  may  be  no  priors  that  sa,sfy  the  necessary  condi,on  of  “absolute  con,nuity”  (i.e.,  that  players’  prior  beliefs  capture  the  set  of  actual  play  paths  with  posi,ve  probability)  –  Need  great  care  in  analyzing  learning  procedures…  

 …  and  we  have  not  even  men,oned  computa,onal  cost  yet  3/3/2015   23  

Page 24: Online(Learning(and(Regret - University of Edinburgh

Model:  “Reinforcement”  Learning  

•  Firstly,  note  that  here  the  term  is  used  slightly  differently  from  what  you  may  be  used  to!  

•  At  each  ,me  period  t,  a  subject  chooses  ac,ons  from  a  finite  set  X,  Nature/external  subject  chooses  ac,on  y  from  set  Y  

•  Realized  payoff  is  u(xt,  yt),  this  is  assumed  ,me-­‐independent  •  We  define  another  variable,  θ,  to  model  subject’s  propensity  

to  play  ac,on  x  at  ,me  t.  So,  the  probability  of  an  ac,on  is,  

Let  qt  and  θt  represent  k-­‐dim  vectors  •  Learning:  how  do  the  propensi,es  evolve  over  =me?  

3/3/2015   24  

Page 25: Online(Learning(and(Regret - University of Edinburgh

“Matching”  Payoffs  

•  Define  a  random  unit  vector  that  acts  as  indicator  variable,  

•  A  linear  upda,ng  model  for  propensi,es  is  (u  is  payoff),  

•  A  simpler  update  rule  is,  

3/3/2015  

Discount factor Random perturbations

Payoff

25  

Page 26: Online(Learning(and(Regret - University of Edinburgh

Cumula,ve  Payoff  Matching  

•  Cumula,ve  payoffs  up  to  ,me  t:  

•  Sum  of  ini,al  propensi,es  is:  •  Define  a  new  quan,ty:  

•  So  that  change  in  probability  of  ac,on,  per  period,  is:  

•  The  denominator  is  unbounded,  so  eventually  this  curve  flaZens  out  –  power  law  of  prac,ce  

3/3/2015   26  

Page 27: Online(Learning(and(Regret - University of Edinburgh

Roth-­‐Erev  RL  Model  

•  Past  payoffs  are  discounted  at  a  constant  geometric  rate  with  λ  <  1,  and  in  each  period  there  are  random  perturba,ons  or  “trembles”  

 •  Marginal  impact  of  period-­‐by-­‐period  payoffs  levels  off  

eventually,  as  denominator  is  bounded  above  •  Another  interpreta,on  –  in  terms  of  aspira,on  levels  

(reinforce  ac,on  if  its  current  payoff  exceeds  aspira,on)  

3/3/2015   27  

Page 28: Online(Learning(and(Regret - University of Edinburgh

Empirical  Plausibility  

•  Many  predic,ons  of  such  models  are  observed  in  prac,ce  –  Recency  phenomena:  Recent  payoffs  tend  to  maZer  more  than  long  

past  ones  –  Habit  forming:  Cumula,ve  payoffs  maZer  in  addi,on  to  the  average  

payoff  of  ac,on  

•  However,  real  human  behaviour  may  not  be  restricted  to  simple  rules  like  this.  

•  On  a  hierarchy  of  learning  rules,  these  “RL”  rules  fall  on  the  lower  end  of  the  spectrum  –  Behaviour  depends  solely  on  summary  sta,s,cs  of  players’  payoffs  

3/3/2015   28  

Page 29: Online(Learning(and(Regret - University of Edinburgh

What  is  Captured  in  this  type  of  RL  

 Despite  their  simplicity,  they  already  capture  some  important,  qualita,ve  features  that  are  shared  with  other  learning  methods  as  well  1.  Probabilis,c  choice:  Subjects’  choice  depends  on  history  

and  a  random  component,  could  be  due  to  •  Unmodeled  behaviour  •  Deliberate  experimenta,on  •  Inten,on  strategy  to  keep  opponent  guessing  

2.  Sluggish  adapta,on:  Strong  serial  correla,on  between  probability  distribu,ons  in  successive  periods  

3/3/2015   29  

Page 30: Online(Learning(and(Regret - University of Edinburgh

What  Other  Ways  Are  There  to  Learn?  

•  Examples:  No  regret  learning,  Smoothed  fic,,ous  play,  Hypothesis  tes,ng  with  smoothed  beZer  responses  

•  Bayesian  ra,onal  learning  does  not  share  all  of  the  similari,es  to  previous  slide:  –  Unless  perfectly  indifferent  between  ac,ons,  a  Bayesian  should  prefer  pure  over  mixed  strategies  

–  Op,mum  behaviour  is  sensi,ve  to  small  changes  in  beliefs,  so  one  can  see  frequent  and  radical  changes  in  behaviour  

3/3/2015   30  

Page 31: Online(Learning(and(Regret - University of Edinburgh

Test:  Learning  in  Sta,onary  Environments  

•  RL  presumes  no  mental  model  of  the  world  and  other  agents  •  Does  it  s,ll  lead  to  op,mal  behaviour  against  the  subjects’  

environment?  –  Convergence  to  Nash  equilibrium  may  be  a  tall  order  – What  happens  in  a  sta,onary  (stochas,c)  environment?  

•  History:  •  Behaviour  strategy  or  ‘response  rule’:  •  This  gives  condi,onal  probability  of  ac,on:  •  Assume  nature  plays  according  to  fixed  rule,    

3/3/2015   31  

Page 32: Online(Learning(and(Regret - University of Edinburgh

Learning  in  Sta,onary  Environment  

•  Combina,on  of  g  and  q*  leads  to  a  stochas,c  process,  

realiza,ons  from  Ω•  Let  B(q*)  denote  subset  of  ac,ons  in  X  that  maximize  player’s  

expected  payoff  against    

•  We  say  that  g  is  op,mal  against  q*  if    •  Rule  g  is  op,mal  against  a  sta,onary  distribu,on  if  the  above  

holds  for  every  q*  –  similarly  to  equilibrium  defini,on  (but  against  fixed  distrib)  

3/3/2015   32  

Page 33: Online(Learning(and(Regret - University of Edinburgh

Result:  Sta,onary  Environment  

 Theorem:  Given  any  finite  ac,on  sets  X  and  Y,  cumula=ve  payoff  matching  on  X  is  op,mal  against  every  sta,onary  distribu,on  on  Y  

•  In  general  games,  this  kind  of  statement  is  hard  to  make  –  Proof  of  this  seemingly  simple  statement  relies  on  stochas,c  

approxima,on  theory  –  Analysis  under  varying  distribu,ons  is  hard!  

•  In  zero-­‐sum  games,  CPM  converges,  with  probability  1,  to  a  Nash  equilibrium  

3/3/2015   33  

Page 34: Online(Learning(and(Regret - University of Edinburgh

What  Next?  

•  Simple  reinforcement  rules  such  as  CPM  omit  any  men,on  of  the  cogni,ve  process  

•  What  other  kinds  of  criteria  might  subjects  bring  in?  1.  PaZern  of  past  play:  predict  opponent’s  next  ac,on  

based  on  what  has  happened  so  far  and  choose  ac,ons  to  maximize  expected  payoffs  

2.  Past  payoffs:  Could  we  have  done  beZer  by  playing  differently  in  the  past?  •  No  predic,ve  behavioural  model,  subjects  simply  want  to  

minimize  ex  post  regret  

3/3/2015   34  

Page 35: Online(Learning(and(Regret - University of Edinburgh

Regret  

•  Consider  simple  game  of  choosing  soL  drinks:    (Payoff)  0        0        0        1        1        0        0        0        1        0        0    Row  L          R        L        L        R        R        L        R        R        R        R        ?    Column  R          L        R        L        R        L        R        L        R        L        L          ?  

•  Imagine  you  are  allowed  to  replay  the  game  but  you  must  do  so  by  choosing  the  same  ac,on  in  every  period  (hypothesis  class  from  which  you  evaluate).  

•  We  do  not  really  know  what  opponent  would  have  done  if  we  changed  our  play  but  we  do  have  realized  performance,  so  we  ask  with  respect  to  this  –  If  you  just  play  R,  payoff  is  5  (for  L,  payoff  is  6);  foregone  payoff  was  3  –  Average  regret  from  not  playing  all  L:  3/11;  against  all  R:  2/11  

3/3/2015   35  

Page 36: Online(Learning(and(Regret - University of Edinburgh

Regret  

•  Average  payoff  through  to  ,me  t:  •  For  each  ac,on  x,  define  average  regret  from  not  having  

played  x  as,  

•  We  have  a  vector  of  regret,  •  A  given  realiza,on  of  play  has  no  regret  if,  

•  A  behavioural  rule  g  has  no  regret  if,  given  a  pre-­‐specified  infinite  sequence  of  play  by  Nature,  (y1,  y2,  …),  almost  all  realiza,ons  ω  generated  by  g  sa,sfy  the  above  condi,on  

3/3/2015   36  

Page 37: Online(Learning(and(Regret - University of Edinburgh

Regret  Matching  

•  Many  varia,ons  on  learning  using  regret  exist.  A  simple  and  appealing  rule  due  to  Hart  and  Mas-­‐Colell  is  the  following  

•  In  each  period,  t+1,  decision  maker  plays  each  ac,on  with  probability  propor,onal  to  the  non-­‐nega,ve  part  of  his  regret  up  to  that  ,me,  

 •  If  regret  for  R  is  2/11  and  for  L  is  3/11  then  under  regret  

matching,  Row  player  chooses  R  or  L  with  probability  2/5,  3/5  respec,vely  (at  t  =  12  in  our  previous  example)  –  As  per  CPM,  R  would  have  been  chosen  more  than  L  

3/3/2015   37  

Page 38: Online(Learning(and(Regret - University of Edinburgh

Regret  Matching  with  ε-­‐Experimenta,on  

 Can  one  do  this  without  even  knowing  opponent  ac<ons?  –  Subject  experiments,  randomly,  with  small  probability  ε  – When  not  experimen,ng,  he  employs  regret  matching  with  the  following  modifica,on:  

•  “Es,mated”  regret  for  ac,on  x  is  its  average  payoff  in  previous  periods  when  he  experimented  and  chose  x  MINUS  Average  realized  payoff  over  all  ac,ons  in  all  previous  periods  

 Theorem:  In  a  finite  game  against  Nature,  given  δ  >  0,  for  all  sufficiently  small  ε  >  0,  regret  matching  with  ε-­‐experimenta,on  has  at  most  δ-­‐regret  against  every  sequence  of  play  by  Nature.  

3/3/2015   38  

Page 39: Online(Learning(and(Regret - University of Edinburgh

Why  Does  Regret  Matching  Work?  

•  Player  X  has  two  ac,ons,  {1,  2}  •  Average  per  period  payoff,  •  If  he  had  just  played  ac,on  1,  

•  Regret:  

•  We  want  to  have,                              ,  almost  surely,  where  the  non-­‐nega,ve  part  of  the  regret  is  being  denoted  as      

3/3/2015   39  

Page 40: Online(Learning(and(Regret - University of Edinburgh

How  Does  Regret  Matching  Work?  

•  In  period  t+1,  opponent  takes  unforeseen  ac,on  –  Irrespec,ve  of  what  ac,on  will  be,  next  period  regret  from  playing  

ac,on  1  is  the  nega,ve  of  that  corresponding  to  ac,on  2  –  Incremental  regret  is  of  the  form  (αt+1,  -­‐αt+1)  for  an  unknown  αt+1  

•  Let  us  say  one  is  following  a  mixed  strategy,  

•  Expected  incremental  regret  with  respect  to  this  strategy,    

•  Weighted  over  ,me,  

3/3/2015   40  

Page 41: Online(Learning(and(Regret - University of Edinburgh

Regret  Matching  Procedure  

•  The  goal  is  to  choose  a  probability  to  make  

•  This  is  the  same  as  making  sure  that                                    is  orthogonal  to  the  current      

•  This  implies,  

•  which  implies,  

3/3/2015   41  

Page 42: Online(Learning(and(Regret - University of Edinburgh

Condi,onal  or  Internal  Regret  

•  There  exist  a  pair  of  ac,ons  x,y  so  that  playing  x  would  have  yielded  higher  total  payoff  over  all  periods  when  the  subject  actually  played  y  –  e.g.,  one  may  not  have  done  beZer  with  ac,on  such  as  ‘wear  blue’  –  Condi,onal  statement  is  that  she  could  have  done  beZer  by  always  

wearing  blue  whenever  she  had  instead  worn  black  

•  Given  a  play  path,  ω,  player’s  condi,onal  regret  matrix  at  ,me  t  is  a  matrix  Rt(ω)  such  that  

 3/3/2015   42  

Page 43: Online(Learning(and(Regret - University of Edinburgh

Shapley’s  Game  

 Consider  the  following  game:  

 Suppose  we  have  a  history  of  play  over  10  periods:    (Payoff)  1        0        0        0        1        0        0        0        0        0    Row  R        R        B        B        B        Y        Y        R        Y        R    Column  R          B        Y        Y        B        R        R        Y        R        Y  

3/3/2015  

R   Y   B  

R   1,0   0,0   0,1  

Y   0,1   1,0   0,0  

B   0,0   0,1   1,0  

43  

Page 44: Online(Learning(and(Regret - University of Edinburgh

Shapley’s  Game  –  Condi,onal  Regret  

•  Adopt  the  perspec,ve  of  the  Row  player;  at  the  end  of  ten  periods  his  condi,onal  regret  matrix  is  

•  If  Row  had  played  R  in  the  three  periods  when  he  actually  played  Y,  his  total  for  that  period  would  have  been  3  instead  of  0.  So,  average  condi,onal  regret  is  3/10  in  cell  (Y,R)  

   

3/3/2015  

R   Y   B  

R   0   0.1   0  

Y   0.3   0   0  

B   -­‐0.1   0.1   0  

44  

Page 45: Online(Learning(and(Regret - University of Edinburgh

Learning  with  Condi,onal  Regret  

     Fact:  There  exist  learning  rules  that  eliminate  condi,onal  regrets  no  maZer  what  Nature  does  (Foster  &  Vohra,  1997)  

   These  rules  are  of  the  general  form:    Reinforcement  increments  δt  are  computed  (e.g.,  linear  algebraically)  from  condi,onal  regret  matrix.    

   Proof  is  based  on  a  celebrated  result  called  Blackwell’s  Approachability  Theorem.  

3/3/2015   45  

Page 46: Online(Learning(and(Regret - University of Edinburgh

Calibra,on  

     (P.  Dawid)  A  sequence  of  binary  forecasters  is  calibrated  if  in  all  those  periods  when  forecaster  predicts  that  event  “1”  will  occur  with  probability  p,  the  empirical  frequency  distribu,on  of  1’s  in  all  of  those  periods  is  in  fact  p  

   Similar  defini,on  applies  to  arbitrary  symbols  being  forecast,  e.g.,  real  valued  predic,ons,  but  the  defini,on  is  more  intricate  in  its  formula,on…  

 

3/3/2015   46  

Page 47: Online(Learning(and(Regret - University of Edinburgh

Example:  Bridge  Contracts  (Keren  1987)  

3/3/2015   47  

Page 48: Online(Learning(and(Regret - University of Edinburgh

Example:  Physicians  (Christensen-­‐Szalanski  et  al.)    

3/3/2015   48  

Why might this bias make sense?

Page 49: Online(Learning(and(Regret - University of Edinburgh

Random  Forecas,ng  Rules  

•  Forecast  equivalent  of  a  randomized  ac,on  choice  •  A  rule  of  the  form  (z  is  random  variable):  

•  F  is  calibrated  if,  for  every  ω,  the  following  calibra,on  score  goes  to  zero  almost  surely  on  player’s  sequence  of  forecasts  

3/3/2015  

Num times p was forecast upto t

Empirical distribution of outcomes when prediction p was forecast

49  

Page 50: Online(Learning(and(Regret - University of Edinburgh

Calibrated  Forecasters  

 Given  any  finite  set  Z  and  ε  >  0,  there  exist  random  forecas,ng  rules  that  are  ε-­‐calibrated  for  all  sequences  on  Z  

       Theorem:  Let  G  be  a  finite  game.  Suppose  every  player  uses  a  calibrated  forecas,ng  rule  and  chooses  myopic  best  response  to  his  forecast.  Then  empirical  frequency  distribu=on  of  play  converges  with  probability  one  to  the  set  of  correlated  equilibria  of  G  

3/3/2015   50  

Page 51: Online(Learning(and(Regret - University of Edinburgh

Takeaway  Messages  

•  Equilibrium  is  a  nice  concept  but  a  lot  of  real  ac,on  is  off  equilibrium  in  life  

•  How  do  people  get  to  equilibria?  •  What  happens  if  everyone  is  learning,  groping  their  way  

towards  some  no,on  of  ‘equilibrium’  •  This  area  has  many  counter-­‐intui,ve  results  

–  ‘Perfect’  Bayesian  learning  is  not  always  so  –  Simple  learning  rules  gives  surprisingly  useful  behaviour  –  No,ons  such  as  regret  enable  learning  despite  limits  to  modelling  of  

the  underlying  process  

•  Many  algorithms,  such  as  regret  matching  and  calibrated  forecasts,  represent  ways  to  get  to  equilibrium  

3/3/2015   51