response regret

Response Regret

Martin ZinkevichAAAI Fall SymposiumNovember 5th, 2005

This work was supported by NSF Career Grant #IIS-0133689.

Outline

Introduction Repeated Prisoners’ Dilemma

Tit-for-Tat Grim Trigger

Traditional Regret Response Regret Conclusion

The Prisoner’s Dilemma

Two prisoners (Alice and Bob) are caught for a small crime. They make a deal not to squeal on each other for a large crime.

Then, the authorities meet with each prisoner separately, and offer a pardon for the small crime if the prisoner turns (his/her) partner in for the large crime.

Each has two options: Cooperate with (his/her) fellow prisoner, or Defect from the deal.

Bimatrix Game

Alice: 5 yearsBob: 5 years


AliceDefects


Alice: 1 yearBob: 1 year

AliceCooperates

BobDefects

BobCooperates

Bimatrix Game

BobCooperates

BobDefects

AliceCooperates

-1,-1 -6,0

AliceDefects

0,-6 -5,-5

Nash Equilibrium

BobCooperates

BobDefects

AliceCooperates

-1,-1 -6,0

AliceDefects

0,-6 -5,-5

The Problem

Each acting to slightly improve his/her circumstances hurts the other player, such that if they both acted “irrationally”, they would both do better.

A Better Model for Real Life

Consequences for misbehavior These improve life A better model: Infinitely repeated

games

The Goal

Can we come up with algorithms with performance guarantees in the presence of other intelligent agents which take into account the delayed consequences?

Side effect: a goal for reinforcement learning in infinite POMDPs.

Regret Versus Standard RL

Guarantees of performance during learning.

No guarantee for the “final” policy……for now.

A New Measure of Regret

Traditional Regret measures immediate consequences

Response Regret measures delayed effects

Outline




Repeated Bimatrix Game

-5,-50,-6Alice

Defects

-6,0-1,-1Alice

Cooperates

BobDefects

BobCooperates

Finite State Machine (for Bob)

Bobcooperates

Bob defects

Alicedefects

Alicedefects

Alice*

Alicecooperates

Bobcooperates

Alicecooperates

Grim Trigger

Bobcooperates

Bobdefects

Alicedefects

Alice*

Alicecooperates

Always Cooperate

Bobcooperates

Alice*

Always Defect

Bobdefects

Alice*

Tit-for-Tat

Bobcooperates

Bobdefects

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

Discounted Utility

Bobcooperates

Bobdefects

Alicecooperates

Alicedefects

Alicedefects

Alicecooperates

GOSTOPGO

STOPGO

STOPGO

GOSTOPGO

STOPGO

STOPGO

GOSTOPGO

STOPGO

STOPGO

GOSTOPGO

STOPGO

STOPC -1C -1

D 0C -6

C -6D 0

D 0C -6

GOPr[ ]=2/3

STOPPr[ ]=1/3

Discounted Utility

The expected value of that process t=1

1 ut t-1

Optimal Value Functions for FSMs

V*(s) discounted utility of OPTIMAL

policy from state s V

*(s) immediate maximum utility at state s

V*(B) discounted utility of OPTIMAL

policy given belief over states B V

*(B) immediate maximum utility given belief over states B

GOPr[ ]= STOPPr[ ]=(1-)

Best Responses, Discounted Utility

If >1/5, a policy is a best response to grim trigger iff it always cooperates when playing grim trigger.

Bobcooperates

Bobdefects

Alicedefects

Alice*

Alicecooperates

Best Responses, Discounted Utility

Similarly, if >1/5, a policy is a best response to tit-for-tat iff it always cooperates when playing tit-for-tat.

Bobcooperates

Bobdefects

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

Knowing Versus Learning

Given a known FSM for the opponent, we can determine the optimal policy (for some ) from an initial state.

However, if it is an unknown FSM, by the time we learn what it is, it will be too late to act optimally.

Grim Trigger or Always Cooperate?

Bobcooperates

Bobdefects

Alicedefects

Alice*

Alicecooperates

Bobcooperates

Alice*

Grim Trigger Always Cooperate

For learning, optimality from the initial state is a bad goal.

Deterministic Infinite SMs

represent any deterministic policy de-randomization

C

D

C

C

D

D

D

New Goal

Can a measure of regret allow us to play like tit-for-tat in the Infinitely Repeated Prisoner’s Dilemma?

In addition, it should be possible for one algorithm to minimize regret against all possible opponents (finite and infinite SMs).

Outline




Traditional Regret:Rock-Paper-Scissors

Bob playsRock

Bob playsPaper

Bob playsScissors

Alice playsRock

TieBob wins

$1Alice wins

$1

Alice playsPaper

Alice wins $1

TieBob wins

$1

Alice playsScissors

Bob wins $1

Alice wins $1

Tie

Traditional Regret:Rock-Paper-Scissors

Bob playsRock

Bob playsPaper

Bob playsScissors

Alice playsRock

0,0 -1,1 1,-1

Alice playsPaper

1,-1 0,0 -1,1

Alice playsScissors

-1,1 1,-1 0,0

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Utility of the Algorithm

Define ut to be the utility of ALG at time t.

Define u0ALG to be:

u0ALG=(1/T)t=1

T ut

Here:u0

ALG=(1/5)(0+1+(-1)+1+0)=1/5

u0ALG=1/5

Rock-Paper-Scissors Visit Counts for Bob’s Internal States

3 Visits 1 Visit

1 Visit

u0ALG=1/5

Rock-Paper-Scissors Frequencies

3/5 Visits 1/5 Visits

1/5 Visits

u0ALG=1/5

Rock-Paper-Scissors Dropped according to Frequencies

3/5 Visits 1/5 Visits

1/5 Visits

0

2/5

-2/5

u0ALG=1/5

Traditional Regret

Consider B to be the empirical frequency states were visited.

Define u0ALG to be the average utility

of the algorithm. Traditional regret of ALG is:

R= V*(B)-u0

ALG

R=(2/5)-(1/5)

u0ALG=1/5

0

2/5

-2/5

Traditional Regret

Goal: regret approach zero a.s. Exists an algorithm that will do this

for all opponents.

What Algorithm?

Gradient Ascent With Euclidean Projection (Zinkevich, 2003):

(when pi strictly positive)

What Algorithm?

Exponential Weighted Experts (Littlestone + Warmuth, 1994):

And a close relative:

What Algorithm?

Regret Matching:

What Algorithm?

Lots of them!

Extensions to Traditional Regret

(Foster and Vohra, 1997)

Into the past… Have a short history Optimal against BR to Alice’s Last.

Extensions to Traditional Regret

(Auer et al) Only see ut, not ui,t:

Use an unbiased estimator of ui,t:

Outline




This Talk

Do you want to? Even then, is it possible?

Traditional Regret:Prisoner’s Dilemma

Bobcooperates

Bobdefects

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

CCDCDD

DD

DD

DD

DD

DD

DD

DD


Bobcooperates

(0.2)

Bobdefects

(0.8)

Alicecooperates

Alicedefects

Alicedefects

Alicecooperates

Alice defects: -4Alice cooperates: -5

Traditional Regret

BobCooperates

BobDefects

AliceCooperates

-1,-1 -6,0

AliceDefects

0,-6 -5,-5

The New Dilemma

Traditional regret forces greedy, short-sighted behavior.

A new concept is needed.

A New Measurement of Regret

Bobcooperates

(0.2)

Bobdefects

(0.8)

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

V*(B)

instead of V0*(B)

Response Regret

Consider B to be the empirical distribution over states visited.

Define u0ALG to be the average utility

of the algorithm. Traditional regret is:

R0= V*(B)-u0

ALG

Response regret is:R= V

*(B)-?

Averaged Discounted Utility

Utility of algorithm at time t’=ut’

Discounted utility from time t=t’=t

1 ut’t’-t

Averaged discounted utility from 1 to Tu

ALG=(1/T)t=1T t’=t

1 ut’t’-t

Dropped in at random but play optimally:V

*(B)Response Regret

R= V*(B)-u

ALG

Response Regret

Consider B to be the empirical distribution over states visited.

Traditional regret is:R0= V

*(B)-u0ALG

Response regret is:R= V

*(B)-uALG

Comparing Regret Measures:when Bob Plays Tit-for-Tat

Bobcooperates

(0.2)

Bobdefects

(0.8)

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

CCDCDD

DD

DD

DD

DD

DD

DD

DDR0=1/10 (defect)R1/5=0 (any policy)R2/3=(203/30)¼6.76 (always cooperate)

Comparing Regret Measures: when Bob Plays Tit-for-Tat

Bobcooperates

(1.0)

Bobdefects

(0.0)

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

CCCCCC

CC

CC

CC

CC

CC

CC

CCR0=1 (defect)R1/5=0 (any policy)R2/3=0 (always cooperate/tit-for-tat/grim trigger)

Comparing Regret Measures: when Bob Plays Grim Trigger

Bobcooperates

(0.2)

Bobdefects

(0.8)

Alicedefects

Alice*

Alicecooperates

CCDCDD

DD

DD

DD

DD

DD

DD

DDR0=1/10 (defect)R1/5=0 (grim trigger/tit-for-tat/always defect)R2/3=11/30 (grim trigger/tit-for-tat)

Comparing Regret Measures: when Bob Plays Grim Trigger

Bobcooperates

(1.0)

Bobdefects

(0.0)

Alicedefects

Alice*

Alicecooperates

CCCCCC

CC

CC

CC

CC

CC

CC

CCR0=1 (defect)R1/5=0 (always cooperate/always defect/tit-for-tat/grim trigger)R2/3=0 (always cooperate/tit-for-tat/grim trigger)

Regrets

vs Tit-for-Tat

vs Grim Trigger

CDDDDDDDDDCCDDDDDDDD

R0=0.1

R1/5=0

R2/3¼6.76

R0=0.1

R1/5=0

R2/3¼0.36

CCCCCCCCCCCCCCCCCCCCCC

R0=1

R1/5=0

R2/3=0

R0=1

R1/5=0

R2/3=0

What it Measures:

constant opportunities high response regret

a few drastic mistakes low response regret

convergence implies Nash Equilibrium of the repeated game

Philosophy

Response regret cannot be known without knowing the opponent.

Response regret can be estimated while playing the opponent, so that the estimate in the limit will be exact a.s.

Determining Utility of a Policyin a State

If I want to know the discounted utility of using a policy P from the third state visited…

Use the policy P from the third time step ad infinitum, and take the discounted reward.

S1 S2 S3 S4 S5

Determining Utility of a Policyin a State in Finite Time

Start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used.

In EXPECTATION, the same as before.

S1 S2 S3 S4 S5

Determining Utility of a Policyin a State in Finite Time Without ALWAYS Using It

With a probability , start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used and multiply it by 1/.

In EXPECTATION, the same as before. Can estimate any finite number of policies

at the same time this way.

S1 S2 S3 S4 S5

Traditional Regret

Goal: regret approach zero a.s. Exists an algorithm for all

opponents.

Response Regret

Goal: regret approach zero a.s. Exists an algorithm for all

opponents.

A Hard Environment:The Combination Lock Problem

Bd

Bd

Ad

Ac Bd

Bd

Bc

Ac Ac Ad

A*

Ad

Ac

SPEED!

Response regret takes time to minimize (combination lock problem).

Current work: restricting the adversary’s choice of policies. In particular, if the number of policies is N, then the regret is linear in N and polynomial in 1/(1-).

Related Work

Other work De Farias and Meggido 2004 Browning, Bowling, and Veloso 2004 Bowling and McCracken 2005

Episodic solutions: similar problems to Finitely Repeated Prisoner’s Dilemma.

What is in a Name?

Why not Consequence Regret?

Questions?

Thanks to:Avrim Blum (CMU)

Michael Bowling (U Alberta)Amy Greenwald (Brown)

Michael Littman (Rutgers)Rich Sutton (U Alberta)

Always Cooperate

Bobcooperates

Alice*

CCDCDD

DD

DD

DD

DD

DD

DD

DDR0=1/10R1/5=1/10R2/3=1/10

Practice

Using these estimation techniques, it is possible to minimize response regret (make it approach zero almost surely in the limit in an ARBITRARY environment).

Similar to the Folk Theorems, it is also possible to converge to the socially optimal behavior if is close enough to 1.(???)


Bobcooperates

Bobdefects

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

Possible Outcomes

Alice cooperates, Bob cooperates: Alice: 1 year Bob: 1 year

Alice defects, Bob cooperates: Alice: 0 years Bob: 6 years

Alice cooperates, Bob defects: Alice: 6 years Bob: 0 years

Alice defects, Bob defects: Alice: 5 years Bob: 5 years

Bimatrix Game

BobCooperates

BobDefects

AliceCooperates

Alice: 1 yearBob: 1 year


AliceDefects



Repeated Bimatrix Game

The same one-shot game is played repeatedly.

Either average reward or discounted reward is considered.

Rock-Paper-Scissors Bob plays BR to Alice’s Last

One Slide Summary

Problem: Prisoner’s Dilemma Solution: Infinitely Repeated

Prisoner’s Dilemma Same Problem: Traditional Regret Solution: Response Regret

Formalism for FSMs (S,A,,O,u,T)

States S Finite actions A Finite observations Observation function O:S! Utility function u:S£A!R

(or u:S£O!R) Transition function T:S£A!S V*(s)=maxa2 A [u(s,a)+V*(T(s,a))]

Beliefs

Suppose S is a set of states.

T(s,a) state O(s)

observation u(s,a) value V*(s)=maxa2A

[u(s,a)+V*(T(s,a))]

Suppose B is a distribution over states.

T(B,a,o) belief O(B,o) probability u(B,a) expected value V*(B)=maxa2A

[u(B,a)+o2O(B,o)V*(T(B,a,o))]

response regret

Documents