dipartimento di elettronica e informazione multiagent rational decision making: searching and...

34
Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What is “good” multiagent learning?

Post on 20-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

Dipartimento di Elettronica e Informazione

Multiagent rational decision making: searching and learning for “good” strategies

Enrique Munoz de Cote

What is “good” multiagent learning?

Page 2: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning2

The prescriptive non-cooperative agenda [Shoham et al. 07]

We are interested in problems where an agent needs to interact in open environments integrated by other agents.

What's a “good” strategy in this situation?

Can the monkey find a “good” strategy?

or does it need to learn?

View: single agent perspective of the multiagent problem.

Environment dependent.

Page 3: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning3

Multiagent Reinforcement Learning Framework

unknown world:learning

known world:solving

Single-agent Multiple agents

MDPs

matrix gamesDecision Theory,Planning

stochastic games

Page 4: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

Dipartimento di Elettronica ed Informazione

What is “good” multiagent learning?

Game theory and multiagent learning: brief backgrounds

Game theory

Stochastic games

Solution concepts

Multiagent learning

Solution concepts

Relation to game theory

Page 5: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Stochastic games (SG)

SGs are good examples of how agents' behaviours depend on each other.

Strategies represent the way agents' behave

Strategies might change as a function of other strategies.

Game theory mathematically captures behaviour in strategic situations.

A B

$$

backgrounds→game theory

Page 6: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

A Computational Example: SG version of chicken

actions: U, D, R, L, X

coin flip on collision

Semiwalls (50%)

collision = -5;

step cost = -1;

goal = +100;

discount factor = 0.95;

both can get goal.SG of chicken [Hu & Wellman, 03]

A B

$$

backgrounds→game theory

Page 7: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Strategies on the SG of chicken

A B

$$Average expected reward:

(88.3,43.7);

(43.7,88.3);

(66,66);

(43.7,43.7);

(38.7,38.7);

(83.6,83.6)

backgrounds→game theory

Page 8: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

A B

$$

Equilibrium values

Average total reward on equilibrium:

Nash

• (88.3,43.7) very imbalanced, inefficient

• (43.7,88.3) very imbalanced, inefficient

• (53.6,53.6) ½ mix, still inefficient

Correlated

• ([43.7,88.3],[43.7,88.3]);

Minimax

• (43.7,43.7);

Friend

• (38.7,38.7)

backgrounds→game theory

Computationally difficult to find in general

Page 9: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Repeated Games

What if agents are allowed to play multiple times?

Strategies:

• Can be a function of history

• Can be randomized

Nash equilibrium still exists.

Page 10: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Computing strategies for repeated SGs

Complete information: solve

• Exact or approximate solutions

Incomplete information: learn

• The environment (as perceived by the agent) is not Markovian

• Convergence is not guaranteed

− Exceptions: zero-sum and team games

• Unwanted cycles and unpredicted behaviour appear

There are algorithms for solving and learning that use the same successive approximations to the Bellman equations to derive solution policies.

Page 11: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Learning equilibrium strategies in SGs

Multiagent RL updates are based on the Bellman equations (just as RL):

A value iteration (VI) algorithm solve for the optimal Q function

Finding a solution via VI depends on the operator Eq{·}:

How can multiagent RL learn any of those strategies?

Foe-Q: Nash-Q: Nash{·}

CE-Q: CE{·}

Friend-Q: max{·}max min{·}

Page 12: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

B

$$

Defining optimality

the safest

the one that minimizes the opponent's reward

the one that maximizes the opponent's reward

the socially stable one

What’s A’s optimal strategy?

In an open environment, an optimal strategy is arguable and may be defined by several criteria.

A

Page 13: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Defining optimality: our criteria

Optimality: should obtain close to maximum utility against other best response algorithms.

Security: should guarantee a minimum lower bound utility.

Simplicity: should be intuitive to understand and implement.

Adaptivity: should learn how to behave optimal, and remain optimal (even if environment changes).

Page 14: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Observation: Reinforcement Learning updates

Q-learning converges to a BR strategy in MDPs

Definition [best response]. A best response function BR(·) returns the set of all strategies that are optimal against the environment's joint strategy.

example environment: only agents with fixed strategies

backgrounds→multiagent RL

observation 2: a learner's BR can be modified by a change in the environment's fixed strategy.

observation 1: a learner's BR is optimal against fixed strategies.

Page 15: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

Dipartimento di Elettronica ed Informazione

What is “good” multiagent learning?

Social Rewards

Shaping rewards and intrinsic motivations

Leader and follower strategies Open questions

Joint work with:Monica BabesMichael L. Littman

Page 16: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Social rewards: hints from the brain

We’re smart, but evolution doesn’t trust us to plan all that far ahead.

Evolution programs us to want things likely to bring about what we need:

taste -> nutrition

pleasure -> procreation

eye contact -> care

generosity -> cooperation

Social Rewards→motivations

Page 17: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Is cooperation “pleasurable”?

fMRI study during repeated prisoner’s dilemma showed that humans perceive:

mutual cooperation

“internal rewards”(activity in the brain’s reward center) defection

+

-

Social Rewards→motivations

Page 18: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Social Rewards: its telescoping effect

Objective: change the behavior of the learner by influencing its early experience.

Shaping rewards[Ng et al., 99]

Intrinsic motivation[Singh et al., 04]

Social rewards

Social Rewards→snapshot

Page 19: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Social Rewards: guiding learners to better equilibria

Objective: change the behavior of the learner by influencing its early experience.

Shaping rewards[Ng et al., 99]

Intrinsic motivation[Singh et al., 04]

Social rewards

Social Rewards→snapshot

Page 20: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Leader and follower reasoning [Littman and Stone, 01]

A leader strategy is able to guide a best response learner.

Assumption: the opponent will adapt to its decisions.

A B

$$ In the example A is a leader

and B is a follower

A best response learner is a follower.

Assumption: its behaviour doesn't hurt nobody.

leaders

followers

Social Rewards→introduction

Page 21: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Leader strategies

Assumption: opponent is playing a best response.

-10,-101,-1center

-1,10,0wall

centerwall

Leader fixed strategies

agent Bagent A

Matrix game of chicken.

leader

follower

BRB(wall) = centerRA(wall,center) = -1

BRB(center) = wallRA(center,wall) = 1

Social Rewards→introduction

Page 22: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

$$

$$BB$$AA A B

Leader mutual advantage strategies

Easy to say way: compute convex hull.

Easy to compute way:

• Compute attack and defence strategies.

• Compute mutual advantage strategy.

• Use attack strategy as threat to deviations.

the SG version of the prisoner's dilemma [Munoz de Cote and Littman, 2008]

One-shot Nash

Mutual advantage Nash in the repeated game

Social Rewards→introduction

Page 23: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

How can a learner be also a leader?

We influence the best response learner's early experience with special shaping rewards called “social rewards”

The learner starts as a leader

If opponent is not a BR follower, the social shaping is washed away.

Social Rewards→methodology

Page 24: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning

Shaping Based on Potentials

Idea: each state is assigned a potential Φ(s) [Ng et al, 1999],

On each transition, utility is augmented with the difference in potential,

Social Rewards→methodology

Page 25: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning26

The Q+shaping algorithm

Compute attack and defence strategies.

Compute mutual advantage strategy

• For repeated matrix games use [Littman and Stone,2003] algorithm

• For repeated stochastic games use [Munoz de Cote and Littman, 2008] algorithm

Compute the state values (potentials) for the mutual advantage strategy.

Initialize the Q-table with the potential based function F(s,s’).

• The attack strategy as threat to deviations will teach BR learners better mutual advantage strategies.

Theorem [Wiewiora 03]: shaping based on potentials has the same effect as initializing the Q function with the potential values.

Q+shaping's main objective is to lead or follow, as appropriate

Social Rewards→algorithm

Page 26: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

Dipartimento di Elettronica ed Informazione

What is “good” multiagent learning?

A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games

Joint work with:Michael L. Littman

Page 27: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning28

Main Result

Given a repeated stochastic game, return a strategy profile that is a Nash equilibrium (specifically one whose payo s match the ffegalitarian point) of the average payo repeated ffstochastic game in polynomial time.

Concretely, we address the following computational problem:

v1

v2

Convex hull of the average payoffs

egalitarian line

Repeated SG Nash algorithm→result

Page 28: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning29

How? (the short story version)

Compute minimax (security) strategies.

• Solve two linear programming problems.

The algorithm searches for a point:

egalitarian line

v2

2v1

where

P

Convex hull of a hypothetical SG

P is the point with the highest egalitarian value.

Repeated SG Nash algorithm→result

Page 29: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning30

How? (the search for point P)

egalitarian line

v2

2v1

P=R

Convex hull of a hypothetical SG

Repeated SG Nash algorithm→result

Compute R=friend1, L=friend2 and attack1, attack2strategies

Find egalitarian point and its policy

• If R is left of egalitarian line: P=R

• elseIf L is right of egalitarian line: P = L

• Else egalSearch(R,L,T)

L

R

P=L

R

L

folkfolkEgal(U1,U2, ε)

Page 30: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning31

Complexity

The algorithm involves solving MDPs (polynomial time) and other steps that also take polynomial time.

• The algorithm is polynomial iff T is bounded by a polynomial.

Result:

Running time. Polynomial in:The discount factor (1 / (1 – γ) );The approximation factor (1 /ε)

Running time. Polynomial in:The discount factor (1 / (1 – γ) );The approximation factor (1 /ε)

Repeated SG Nash algorithm→result

Page 31: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning32

SG version of the PD game

$$

$$BB$$AA A B

Algorithm

Agent A

Agent B

security-VI

46.5 46.5 mutual defection

friend-VI

46 46 mutual defection

CE-VI 46.5 46.5 mutual defection

folkEgal

88.8 88.8 mutual cooperatio with threat of defection

Repeated SG Nash algorithm→experiments

Page 32: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning33

Compromise game

A B

$$BB $$AA

Algorithm

Agent A

Agent B

security-VI

0 0 attacker blocking goal

friend-VI

-20 -20 mutual defection

CE-VI 68.2 70.1 suboptimal waiting strategy

folkEgal

78.7 78.7 mutual cooperaton (w=0.5) with treat of defection

Repeated SG Nash algorithm→experiments

Page 33: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

What is “good” multiagent learning34

Asymmetric game

A B$$BB$$AA $$AA

Algorithm

Agent A

Agent B

security-VI

0 0 attacker blocking goal

friend-VI

-200 -200 mutual defection

CE-VI 32.1 32.1 suboptimal mutual cooperation

folkEgal

37.2 37.2 mutual cooperaton with threat of defection

Repeated SG Nash algorithm→experiments

Page 34: Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What

Dipartimento di Elettronica e Informazione

Thanks for your attention!