response regret

87
Response Regret Martin Zinkevich AAAI Fall Symposium November 5 th , 2005 This work was supported by NSF Career Grant #IIS-0133689.

Upload: larissa-peterson

Post on 31-Dec-2015

48 views

Category:

Documents


3 download

DESCRIPTION

Response Regret. Martin Zinkevich AAAI Fall Symposium November 5 th , 2005 This work was supported by NSF Career Grant #IIS-0133689. Outline. Introduction Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger Traditional Regret Response Regret Conclusion. The Prisoner’s Dilemma. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Response Regret

Response Regret

Martin ZinkevichAAAI Fall SymposiumNovember 5th, 2005

This work was supported by NSF Career Grant #IIS-0133689.

Page 2: Response Regret

Outline

Introduction Repeated Prisoners’ Dilemma

Tit-for-Tat Grim Trigger

Traditional Regret Response Regret Conclusion

Page 3: Response Regret

The Prisoner’s Dilemma

Two prisoners (Alice and Bob) are caught for a small crime. They make a deal not to squeal on each other for a large crime.

Then, the authorities meet with each prisoner separately, and offer a pardon for the small crime if the prisoner turns (his/her) partner in for the large crime.

Each has two options: Cooperate with (his/her) fellow prisoner, or Defect from the deal.

Page 4: Response Regret

Bimatrix Game

Alice: 5 yearsBob: 5 years

Alice: 0 yearsBob: 6 years

AliceDefects

Alice: 6 yearsBob: 0 years

Alice: 1 yearBob: 1 year

AliceCooperates

BobDefects

BobCooperates

Page 5: Response Regret

Bimatrix Game

BobCooperates

BobDefects

AliceCooperates

-1,-1 -6,0

AliceDefects

0,-6 -5,-5

Page 6: Response Regret

Nash Equilibrium

BobCooperates

BobDefects

AliceCooperates

-1,-1 -6,0

AliceDefects

0,-6 -5,-5

Page 7: Response Regret

The Problem

Each acting to slightly improve his/her circumstances hurts the other player, such that if they both acted “irrationally”, they would both do better.

Page 8: Response Regret

A Better Model for Real Life

Consequences for misbehavior These improve life A better model: Infinitely repeated

games

Page 9: Response Regret

The Goal

Can we come up with algorithms with performance guarantees in the presence of other intelligent agents which take into account the delayed consequences?

Side effect: a goal for reinforcement learning in infinite POMDPs.

Page 10: Response Regret

Regret Versus Standard RL

Guarantees of performance during learning.

No guarantee for the “final” policy……for now.

Page 11: Response Regret

A New Measure of Regret

Traditional Regret measures immediate consequences

Response Regret measures delayed effects

Page 12: Response Regret

Outline

Introduction Repeated Prisoners’ Dilemma

Tit-for-Tat Grim Trigger

Traditional Regret Response Regret Conclusion

Page 13: Response Regret

Outline

Introduction Repeated Prisoners’ Dilemma

Tit-for-Tat Grim Trigger

Traditional Regret Response Regret Conclusion

Page 14: Response Regret

Repeated Bimatrix Game

-5,-50,-6Alice

Defects

-6,0-1,-1Alice

Cooperates

BobDefects

BobCooperates

Page 15: Response Regret

Finite State Machine (for Bob)

Bobcooperates

Bob defects

Alicedefects

Alicedefects

Alice*

Alicecooperates

Bobcooperates

Alicecooperates

Page 16: Response Regret

Grim Trigger

Bobcooperates

Bobdefects

Alicedefects

Alice*

Alicecooperates

Page 17: Response Regret

Always Cooperate

Bobcooperates

Alice*

Page 18: Response Regret

Always Defect

Bobdefects

Alice*

Page 19: Response Regret

Tit-for-Tat

Bobcooperates

Bobdefects

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

Page 20: Response Regret

Discounted Utility

Bobcooperates

Bobdefects

Alicecooperates

Alicedefects

Alicedefects

Alicecooperates

GOSTOPGO

STOPGO

STOPGO

GOSTOPGO

STOPGO

STOPGO

GOSTOPGO

STOPGO

STOPGO

GOSTOPGO

STOPGO

STOPC -1C -1

D 0C -6

C -6D 0

D 0C -6

GOPr[ ]=2/3

STOPPr[ ]=1/3

Page 21: Response Regret

Discounted Utility

The expected value of that process t=1

1 ut t-1

Page 22: Response Regret

Optimal Value Functions for FSMs

V*(s) discounted utility of OPTIMAL

policy from state s V

*(s) immediate maximum utility at state s

V*(B) discounted utility of OPTIMAL

policy given belief over states B V

*(B) immediate maximum utility given belief over states B

GOPr[ ]= STOPPr[ ]=(1-)

Page 23: Response Regret

Best Responses, Discounted Utility

If >1/5, a policy is a best response to grim trigger iff it always cooperates when playing grim trigger.

Bobcooperates

Bobdefects

Alicedefects

Alice*

Alicecooperates

Page 24: Response Regret

Best Responses, Discounted Utility

Similarly, if >1/5, a policy is a best response to tit-for-tat iff it always cooperates when playing tit-for-tat.

Bobcooperates

Bobdefects

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

Page 25: Response Regret

Knowing Versus Learning

Given a known FSM for the opponent, we can determine the optimal policy (for some ) from an initial state.

However, if it is an unknown FSM, by the time we learn what it is, it will be too late to act optimally.

Page 26: Response Regret

Grim Trigger or Always Cooperate?

Bobcooperates

Bobdefects

Alicedefects

Alice*

Alicecooperates

Bobcooperates

Alice*

Grim Trigger Always Cooperate

For learning, optimality from the initial state is a bad goal.

Page 27: Response Regret

Deterministic Infinite SMs

represent any deterministic policy de-randomization

C

D

C

C

D

D

D

Page 28: Response Regret

New Goal

Can a measure of regret allow us to play like tit-for-tat in the Infinitely Repeated Prisoner’s Dilemma?

In addition, it should be possible for one algorithm to minimize regret against all possible opponents (finite and infinite SMs).

Page 29: Response Regret

Outline

Introduction Repeated Prisoners’ Dilemma

Tit-for-Tat Grim Trigger

Traditional Regret Response Regret Conclusion

Page 30: Response Regret

Traditional Regret:Rock-Paper-Scissors

Bob playsRock

Bob playsPaper

Bob playsScissors

Alice playsRock

TieBob wins

$1Alice wins

$1

Alice playsPaper

Alice wins $1

TieBob wins

$1

Alice playsScissors

Bob wins $1

Alice wins $1

Tie

Page 31: Response Regret

Traditional Regret:Rock-Paper-Scissors

Bob playsRock

Bob playsPaper

Bob playsScissors

Alice playsRock

0,0 -1,1 1,-1

Alice playsPaper

1,-1 0,0 -1,1

Alice playsScissors

-1,1 1,-1 0,0

Page 32: Response Regret

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Page 33: Response Regret

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Page 34: Response Regret

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Page 35: Response Regret

Utility of the Algorithm

Define ut to be the utility of ALG at time t.

Define u0ALG to be:

u0ALG=(1/T)t=1

T ut

Here:u0

ALG=(1/5)(0+1+(-1)+1+0)=1/5

u0ALG=1/5

Page 36: Response Regret

Rock-Paper-Scissors Visit Counts for Bob’s Internal States

3 Visits 1 Visit

1 Visit

u0ALG=1/5

Page 37: Response Regret

Rock-Paper-Scissors Frequencies

3/5 Visits 1/5 Visits

1/5 Visits

u0ALG=1/5

Page 38: Response Regret

Rock-Paper-Scissors Dropped according to Frequencies

3/5 Visits 1/5 Visits

1/5 Visits

0

2/5

-2/5

u0ALG=1/5

Page 39: Response Regret

Traditional Regret

Consider B to be the empirical frequency states were visited.

Define u0ALG to be the average utility

of the algorithm. Traditional regret of ALG is:

R= V*(B)-u0

ALG

R=(2/5)-(1/5)

u0ALG=1/5

0

2/5

-2/5

Page 40: Response Regret

Traditional Regret

Goal: regret approach zero a.s. Exists an algorithm that will do this

for all opponents.

Page 41: Response Regret

What Algorithm?

Gradient Ascent With Euclidean Projection (Zinkevich, 2003):

(when pi strictly positive)

Page 42: Response Regret

What Algorithm?

Exponential Weighted Experts (Littlestone + Warmuth, 1994):

And a close relative:

Page 43: Response Regret

What Algorithm?

Regret Matching:

Page 44: Response Regret

What Algorithm?

Lots of them!

Page 45: Response Regret

Extensions to Traditional Regret

(Foster and Vohra, 1997)

Into the past… Have a short history Optimal against BR to Alice’s Last.

Page 46: Response Regret

Extensions to Traditional Regret

(Auer et al) Only see ut, not ui,t:

Use an unbiased estimator of ui,t:

Page 47: Response Regret

Outline

Introduction Repeated Prisoners’ Dilemma

Tit-for-Tat Grim Trigger

Traditional Regret Response Regret Conclusion

Page 48: Response Regret

This Talk

Do you want to? Even then, is it possible?

Page 49: Response Regret

Traditional Regret:Prisoner’s Dilemma

Bobcooperates

Bobdefects

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

CCDCDD

DD

DD

DD

DD

DD

DD

DD

Page 50: Response Regret

Traditional Regret:Prisoner’s Dilemma

Bobcooperates

(0.2)

Bobdefects

(0.8)

Alicecooperates

Alicedefects

Alicedefects

Alicecooperates

Alice defects: -4Alice cooperates: -5

Page 51: Response Regret

Traditional Regret

BobCooperates

BobDefects

AliceCooperates

-1,-1 -6,0

AliceDefects

0,-6 -5,-5

Page 52: Response Regret

The New Dilemma

Traditional regret forces greedy, short-sighted behavior.

A new concept is needed.

Page 53: Response Regret

A New Measurement of Regret

Bobcooperates

(0.2)

Bobdefects

(0.8)

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

V*(B)

instead of V0*(B)

Page 54: Response Regret

Response Regret

Consider B to be the empirical distribution over states visited.

Define u0ALG to be the average utility

of the algorithm. Traditional regret is:

R0= V*(B)-u0

ALG

Response regret is:R= V

*(B)-?

Page 55: Response Regret

Averaged Discounted Utility

Utility of algorithm at time t’=ut’

Discounted utility from time t=t’=t

1 ut’t’-t

Averaged discounted utility from 1 to Tu

ALG=(1/T)t=1T t’=t

1 ut’t’-t

Dropped in at random but play optimally:V

*(B)Response Regret

R= V*(B)-u

ALG

Page 56: Response Regret

Response Regret

Consider B to be the empirical distribution over states visited.

Traditional regret is:R0= V

*(B)-u0ALG

Response regret is:R= V

*(B)-uALG

Page 57: Response Regret

Comparing Regret Measures:when Bob Plays Tit-for-Tat

Bobcooperates

(0.2)

Bobdefects

(0.8)

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

CCDCDD

DD

DD

DD

DD

DD

DD

DDR0=1/10 (defect)R1/5=0 (any policy)R2/3=(203/30)¼6.76 (always cooperate)

Page 58: Response Regret

Comparing Regret Measures: when Bob Plays Tit-for-Tat

Bobcooperates

(1.0)

Bobdefects

(0.0)

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

CCCCCC

CC

CC

CC

CC

CC

CC

CCR0=1 (defect)R1/5=0 (any policy)R2/3=0 (always cooperate/tit-for-tat/grim trigger)

Page 59: Response Regret

Comparing Regret Measures: when Bob Plays Grim Trigger

Bobcooperates

(0.2)

Bobdefects

(0.8)

Alicedefects

Alice*

Alicecooperates

CCDCDD

DD

DD

DD

DD

DD

DD

DDR0=1/10 (defect)R1/5=0 (grim trigger/tit-for-tat/always defect)R2/3=11/30 (grim trigger/tit-for-tat)

Page 60: Response Regret

Comparing Regret Measures: when Bob Plays Grim Trigger

Bobcooperates

(1.0)

Bobdefects

(0.0)

Alicedefects

Alice*

Alicecooperates

CCCCCC

CC

CC

CC

CC

CC

CC

CCR0=1 (defect)R1/5=0 (always cooperate/always defect/tit-for-tat/grim trigger)R2/3=0 (always cooperate/tit-for-tat/grim trigger)

Page 61: Response Regret

Regrets

vs Tit-for-Tat

vs Grim Trigger

CDDDDDDDDDCCDDDDDDDD

R0=0.1

R1/5=0

R2/3¼6.76

R0=0.1

R1/5=0

R2/3¼0.36

CCCCCCCCCCCCCCCCCCCCCC

R0=1

R1/5=0

R2/3=0

R0=1

R1/5=0

R2/3=0

Page 62: Response Regret

What it Measures:

constant opportunities high response regret

a few drastic mistakes low response regret

convergence implies Nash Equilibrium of the repeated game

Page 63: Response Regret

Philosophy

Response regret cannot be known without knowing the opponent.

Response regret can be estimated while playing the opponent, so that the estimate in the limit will be exact a.s.

Page 64: Response Regret

Determining Utility of a Policyin a State

If I want to know the discounted utility of using a policy P from the third state visited…

Use the policy P from the third time step ad infinitum, and take the discounted reward.

S1 S2 S3 S4 S5

Page 65: Response Regret

Determining Utility of a Policyin a State in Finite Time

Start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used.

In EXPECTATION, the same as before.

S1 S2 S3 S4 S5

Page 66: Response Regret

Determining Utility of a Policyin a State in Finite Time Without ALWAYS Using It

With a probability , start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used and multiply it by 1/.

In EXPECTATION, the same as before. Can estimate any finite number of policies

at the same time this way.

S1 S2 S3 S4 S5

Page 67: Response Regret

Traditional Regret

Goal: regret approach zero a.s. Exists an algorithm for all

opponents.

Page 68: Response Regret

Response Regret

Goal: regret approach zero a.s. Exists an algorithm for all

opponents.

Page 69: Response Regret

A Hard Environment:The Combination Lock Problem

Bd

Bd

Ad

Ac Bd

Bd

Bc

Ac Ac Ad

A*

Ad

Ac

Page 70: Response Regret

SPEED!

Response regret takes time to minimize (combination lock problem).

Current work: restricting the adversary’s choice of policies. In particular, if the number of policies is N, then the regret is linear in N and polynomial in 1/(1-).

Page 71: Response Regret

Related Work

Other work De Farias and Meggido 2004 Browning, Bowling, and Veloso 2004 Bowling and McCracken 2005

Episodic solutions: similar problems to Finitely Repeated Prisoner’s Dilemma.

Page 72: Response Regret

What is in a Name?

Why not Consequence Regret?

Page 73: Response Regret

Questions?

Thanks to:Avrim Blum (CMU)

Michael Bowling (U Alberta)Amy Greenwald (Brown)

Michael Littman (Rutgers)Rich Sutton (U Alberta)

Page 74: Response Regret

Always Cooperate

Bobcooperates

Alice*

CCDCDD

DD

DD

DD

DD

DD

DD

DDR0=1/10R1/5=1/10R2/3=1/10

Page 75: Response Regret

Practice

Using these estimation techniques, it is possible to minimize response regret (make it approach zero almost surely in the limit in an ARBITRARY environment).

Similar to the Folk Theorems, it is also possible to converge to the socially optimal behavior if is close enough to 1.(???)

Page 76: Response Regret

Traditional Regret:Prisoner’s Dilemma

Bobcooperates

Bobdefects

Alicedefects

Alicedefects

Alicecooperates

Alicecooperates

Page 77: Response Regret

Possible Outcomes

Alice cooperates, Bob cooperates: Alice: 1 year Bob: 1 year

Alice defects, Bob cooperates: Alice: 0 years Bob: 6 years

Alice cooperates, Bob defects: Alice: 6 years Bob: 0 years

Alice defects, Bob defects: Alice: 5 years Bob: 5 years

Page 78: Response Regret

Bimatrix Game

BobCooperates

BobDefects

AliceCooperates

Alice: 1 yearBob: 1 year

Alice: 6 yearsBob: 0 years

AliceDefects

Alice: 0 yearsBob: 6 years

Alice: 5 yearsBob: 5 years

Page 79: Response Regret

Repeated Bimatrix Game

The same one-shot game is played repeatedly.

Either average reward or discounted reward is considered.

Page 80: Response Regret

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Page 81: Response Regret

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Page 82: Response Regret

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Page 83: Response Regret

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Page 84: Response Regret

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Page 85: Response Regret

One Slide Summary

Problem: Prisoner’s Dilemma Solution: Infinitely Repeated

Prisoner’s Dilemma Same Problem: Traditional Regret Solution: Response Regret

Page 86: Response Regret

Formalism for FSMs (S,A,,O,u,T)

States S Finite actions A Finite observations Observation function O:S! Utility function u:S£A!R

(or u:S£O!R) Transition function T:S£A!S V*(s)=maxa2 A [u(s,a)+V*(T(s,a))]

Page 87: Response Regret

Beliefs

Suppose S is a set of states.

T(s,a) state O(s)

observation u(s,a) value V*(s)=maxa2A

[u(s,a)+V*(T(s,a))]

Suppose B is a distribution over states.

T(B,a,o) belief O(B,o) probability u(B,a) expected value V*(B)=maxa2A

[u(B,a)+o2O(B,o)V*(T(B,a,o))]