reinforcement learning to play an optimal nash equilibrium in coordination markov games xiaofeng...

Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games

XiaoFeng Wang and Tuomas Sandholm

Carnegie Mellon University

Outline

Introduction Settings Coordination Difficulties Optimal Adaptive Learning Convergence Proof Extension: Beyond Self-play Extension: Beyond Team Games Conclusion and Future Works

Coordination Games Coordination Games:

– A coordination game typically possesses multiple Nash equilibria, some of which might be Pareto dominated by some of the others.

– Assumption: Players (self-interested agents) prefer Nash equilibria than any other steady states (for example, a best-response loop).

Objective: to play a Nash equilibrium which is not Pareto dominated by other Nash equilibria.

Why coordination games are important?– Whenever an individual agent cannot achieve its goal without interacting with others, coordination problems could

happen. – Study on coordination games helps us to understand how to achieve win-win outcomes in interactions and avoid being

stuck in undesirable equilibria.– Examples: Team games, Battle-of-the-sexes, and minimum-effort games.

Team Games

Team Games:– In a team game, agents receive the same expected

rewards. – Team Games are the simplest form of coordination

games Why team games are important?

– A team game can have multiple Nash equilibria. Only some of them are optimal. This captures the important properties of a general category of coordination games. Study on team games gives us an easy start without loss of important generalities.

Coordination Markov Games Markov decision process:

– Model environment as a set of states S. A decision-maker (agent) drives the changes of states to maximize the sum of its discounted long-term payoffs.

A coordination Markov game:– Combination of MDP and coordination games: A set of self-interested agents

choose joint action aA to determine the state transition so as to maximize their own profits. For example, Team Markov games.

Relation between Markov game and Repeated stage games:– A joint Q-function maps a state-joint action pair (s, a) to the tuple of the sum

of discounted long-term rewards individual agents receive by taking joint action a at state s and then following a joint strategy .

– Q(s, . ) can be viewed as a stage game in which agent i receives a payoff Qi(s, a) (a component of the tuple of Q(s, a)) with a joint action a being taken by all agents at state s. We call such a game as state game.

– A Subgame Perfect Nash equilibrium (SPNE) of a coordination Markov game is composed of the Nash equilibria of a sequence of coordination state games.

Reinforcement Learning (RL)

Objective of reinforcement learning– Find a strategy : S A to maximize an agent’s discounted long-term

payoffs without knowledge about environment model (rewarding structure and transition probability)

Model-based reinforcement learning– Learning rewarding structure and transition probability to compute Q-

function.

Model-free reinforcement learning– Learning Q-function directly.

Learning policy: – Interleave learning with execution of learnt policy. – GLIE guarantees the convergence to an optimal policy for a

single-agent MDP.

RL in a Coordination Markov Game

Objective– Without knowing game structure, an agent i is trying to find an optimal

individual strategy i: S Ai to maximize the sum of its discounted long-term payoffs.

Difficulties:– Two layers of learning (Learning of game structure and learning of

strategy) are interdependent during the learning of a general Markov game: On one hand, strategy is determined over Q-function. On the other hand, Q-function is learnt with respect to the joint strategy agents take.

RL in team Markov games– Team Markov games simplify the learning problem: Off-policy learning

of game structure, learning coordination over the individual state games.

– In a team Markov game, the accumulation of individual agents’ optimal policies is an optimal Nash equilibrium for the game.

– Although simple, more tricky than it appears to be.

Research Issues

How to play an optimal Nash equilibrium in an unknown team Markov game?

How to extend the results to a more general category of coordination stage game and Markov games?

Outline

IntroductionSettings Coordination Difficulties Optimal Adaptive Learning Convergence Proof Extension: Beyond Self-play Extension: Beyond Team Games Conclusion and Future Works

Setting: Agents make decision independently and

concurrently. No communications between agents. Agents independently receive reward signals with

the same expected values Environment model is unknown Agents’ actions are fully observable Objective: find an optimal joint policy *: S Ai to

maximize the sum of discounted long-term rewards.

Outline

Introduction SettingsCoordination Difficulties Optimal Adaptive Learning Convergence Proof Extension: Beyond Self-play Extension: Beyond Team Games Conclusion and Future Works

Coordination over a known game

A team may have multiple optimal NE. Without coordination, agents do not know how to play.

10 0 -100

0 5 0

-100 0 10

B0

B1

B2

A0 A1 A2

Claus and Boutilier’s stage game Solutions:

– Lexicographic conventions (Boutilier)• Problem: Sometimes, mechanism designer unable or unwilling to impose

orders.

– Learning:• Each agent treats others as nonstrategic players and best responds to the

empirical distribution of others’ previous plays. E.g, Fictitious play, adaptive play

• Problem: The learning process may converge to a sub-optimal NE, usually a risk dominant NE

Coordination over an unknown game

Unknown game structure and noisy payoffs make coordination even more difficult. – Independently receiving noisy rewards, agents may hold different

views of a game at a particular moment. In this case, even lexicographic convention does not work.

9.90-100

05 0

-100010.1B0

B1

B2

A0 A1 A2

A

10.10-100

05 0

-10009.9B0

B1

B2

A0 A1 A2

B

Problems

Against a known game– By solving the game, agents can identify all the

NE but do not know how to play.– By myopic play (learning), agents can learn to

play a consistent NE which however may not be optimal.

Against an unknown game– Agents might not identify optimal NE before

the game structure fully converges.

Outline

Introduction Settings Coordination DifficultiesOptimal Adaptive Learning Convergence Proof Extension: Beyond Self-play Extension: Beyond Team Games Conclusion and Future Works

Optimal Adaptive Learning Basic ideas:

– Over a known game: eliminate the sub-optimal NE and then use myopic play (learning) to learn to play.

– Over a unknown game: estimate the NE of the game before the game structure converges. Interleave learning of coordination with learning of game structure.

Learning layers: – Learning of coordination: Biased Adaptive Play against

virtual games.– Learning of game structure: Construction virtual games

with -bound over a model-based RL algorithm.

Virtual games A virtual game (VG) is derived from a team state

game Q(s,.) as follows: – If a is an optimal NE in Q(s,.), VG(s,a)=1.

Otherwise, VG(s,a)=0.

Virtual games eliminate all the strict sub-optimal NE of the original games. This is nontrivial when the number of players are more than 2.

100-100

05 0

-100010B0

B1

B2

A0 A1 A2

100

00 0

001B0

B1

B2

A0 A1 A2

Adaptive Play

Adaptive play (AP): – Each agent has a limited memory size to hold m recent plays being

observed.

– To choose actions, an agent i randomly draws k samples (without replacement) to build up an empirical model of others’ joint strategy.

• For example, suppose that there exists an reduced joint action profile a-i (all but i’s individual actions) which appears in the samples for K(a-i) times, agent i treats the probability of the action as K(a-i)/k.

– Agent i chooses the action which best responds to this distribution.

Previous work (Peyton Young) shows that AP converges to a strict NE in any weakly acyclic game.

Weakly Acyclic Games and Biased Set

Weakly acyclic games (WAG):– In a weakly acyclic game, there exists a

best-response path from any strategy profile to a strict NE.

– Many virtual games are WAGs However, not all VGs are WAGs.

– Some VGs only have weak NE which does not constitute an absorbing state.

Weakly acyclic game w.r.t. a biased set (WAGB):– A game in which exist best-response

paths from any profile to an NE in a set D (called biased set).

10 0

00 0

001B0

B1

B2

A0 A1 A2

11 0

10 1

011B0

B1

B2

A0 A1 A2

Biased Adaptive Play

Biased adaptive play (BAP):– Similar to AP except that an agent biases its action selection when

it detects that it is playing an NE in the biased set.

Biased rules:– For an agent i if its k samples contain the same a-i which has also

been included in at least one of NE in D, the agent chooses its most recent best response to the strategy profile. For example, if B’s samples show that A keeps playing A0 and its most recent best response is B0, B will stick to this action.

Biased adaptive play guarantees the convergence to an optimal NE for any VG constructed over a team game with the biased set containing all the optimal NE.

Construct VG over an unknown game

Basic ideas:– Using a slowly decreasing bound (called -bound) to find all

optimal NE. Specifically,• At a state s and time t, an joint action a is -optimal for the state game

in if Qt(s,a)+tmaxa’Qt(s,a’).

• A virtual game VGt is constructed over these -optimal joint actions.

• If limtt=0 and t decreases slower than Q-function, VGt converges to VG.

– Construction of -bound depends on the RL algorithm used to learn the game structure. Over a model-based reinforcement learning algorithm, we prove that the following bound meets the condition: Nb-0.5 for all 0<b<0.5, where N is the minimal number of samples made up to time t.

The Algorithm Learning of coordination

– For each state, construct VGt according to -optimal actions.

– Follow GLIE learning policy, use BAP to choose best-response actions over VGt with exploitation probability.

Learning of game structure– Use a model-based RL

to update Q-function.– Update -bound with the

minimal number of sampling. Find -optimal actions with the bound

Outline

Introduction Settings Coordination Difficulties Optimal Adaptive LearningConvergence Proof Extension: Beyond Self-play Extension: Beyond Team Games Conclusion and Future Works

Flowchart of the Proof

Theorem 1: BAP converges over WAGB Theorem 3:

BAP with GLIEconverges over WAGBLemma 2:

NonstationaryMarkov Chain

Lemma 4: Any VG is WAGB

Theorem 5:Convergence rateof the model-based RL

Theorem 6:VG can be learntwith -bound w.p.1

Main Theorem:OAL converges to an optimal NEw.p.1

Model BAP as a Markov chain

Stationary Markov chain model:– State:

• An initial state is composed of m initial joint actions agents observed: h0=(a1, a2,…, am).

• The definition of other states is inductive: The successor state h’ of a state h is obtained by deleting the leftmost element and add in a new observed joint action at the leftmost side of the tuple.

• Absorbing state: (a,a,…,a) is an individual absorbing state if aD or it is a strict NE. All individual absorbing states are clustered into a unique absorbing state.

– Transition:

• The probability ph,h’ that a state h transits to h’ is positive if and only if the left most joint action a={a1, a2,…, an} in h’ is composed of individual action ai which best responds to at least k samples in h.

• Since the distribution an agent takes to sample its memory is independent of time, the transition probability between any two states does not change with time. Therefore, the Markov chain is stationary.

Convergence over a known game

Theorem 1 Let L(a) be the shortest length of a best-response path from joint action a to an NE in D. LG=maxaL(a). If mk(LG+2), BAP over WAGB converges to either a NE in D or a strict NE w.p.1.

Nonstationary Markov Chain Model:– With GLIE learning policy, at any moment, an agent has a probability to

do experimenting (exploring the actions other than the estimated best-response). The exploration probability is diminishing with time. Therefore, we can model BAP with GLIE over WAGB as a nonstationary Markov chain, with a transition matrix Pt. Let P be the transition matrix of the stationary Markov chain for BAP over the same WAGB. Clearly, GLIE guarantees that PtP with t.

In stationary Markov chain model, we have only one absorbing state (composed of several individual absorbing states). Theorem 1 says that such a Markov chain is ergodic, with only one stationary distribution, given mk(LG+2). With nonstationary Markov chain theory, we can get the following Theorem:

Theorem 2 With mk(LG+2), BAP with GLIE converges to either a NE in D or a strict NE w.p.1.

Determine the length of best-response path

In a team game, LG is no more than n (the number of agents). The following figure illustrates this. In the figure, each box represents an individual action of an agent. represents an individual action contained in a NE. In the figure, we see that n-n’ agents can move the joint actions to an NE by switching their individual actions one after the other. This switching is best-response given others stick to their individual actions.

Lemma 4 The VG of any team game is a WAGB w.r.t. the set of optimal NE with LVG n.

… …… …

n (number of agents)

n’ (length of NE prefix)

Non-NE strategy

NE

Learning the virtual games First, we assess the convergence rate of the model-based

RL algorithm.

Then, we construct the sufficient condition for -bound over the convergence rate lemma.

Main Theorem Theorem 7 In any team Markov game among n

agents if 1) mk(n+1) 2) -bound satisfies Lemma 6, then the OAL algorithm converges to an optimal NE w.p.1

General ideas of the proof:– With Lemma 6, we have that the probability of the event

E that VGt=VG for the rest of play after time t converges to 1 with t.

– Starting from a time t’, conditioning on the probability of E, agents play BAP with GLIE over a known game, which converges to an optimal NE w.p.1 according to Theorem 3.

– Combine these two convergence process together, we get the convergence result.

Example: 2-agent game

10 0 -100

0 5 0

-100 0 10

B0

B1

B2

A0 A1 A2

Example: 3-agent game

10-20-20-20-205-205-20

-20-205-2010-205-20-20

-205-205-20-20-20-2010A1

A2

A3

B1C1 B1C2 B1C3 B2C1 B2C2 B2C3 B3C1 B3C2 B3C3

Example: Multiple stage games

Outline

Introduction Settings Coordination Difficulties Optimal Adaptive Learning Convergence ProofExtension: Beyond Self-playExtension: Beyond Team Games Conclusion and Future Works

Extension: general ideas Classic game theory tells us how to solve a games, i.e., identifying the

fixed points of introspections. However, it is less clear about how to play a game.

Standard ways to play a game:– Solve the game first and play a NE strategy (strategic play).

• Problem: 1) With existence of multiple NE, sometimes, agents may not know how to play. 2) It might be computationally expensive.

– Assume that others take stationary strategy and best response to the belief (myopic play).

• Problem: Myopic strategies may lead agents to play a sub-optimal (Pareto dominated) NE.

The idea generalized from OAL: Partially Myopic and Partially Strategic (PMPS) play.– Biased Action Selection: Strategically lead the other to play a stable

strategy.– Virtual Games: Compute NE first and then eliminate the sub-optimal NE.– Adaptive Play: Myopically adjust best-response strategy w.r.t. the agent’s

observations.

Extension: Beyond self-play Problem:

– OAL only guarantees convergence to an optimal NE in self-play. That is, all players are OAL agents. Can agents find optimal coordination when only some of them play OAL? Let’s consider the simplest case: two agent, one is JAL or IL player (Claus and Boutilier 98) and the other is OAL player.

A straightforward way to enforce the optimal coordination: – Two players, one of them is an “opinionated” player who leads the play. Leader

Learner

– If the other is either JAL and IL player, the convergence to optimal NE is guaranteed.

– How about that the other is also a leader agent? More seriously, how to play if the leader does not know the type of the other player?

10 0 -100

0 5 0

-100 0 10

B0

B1

B2

A0 A1 A2

New Biased Rules

Original biased rules:– For an agent i if its k samples contain the same a-i which has also been

included in at least one of NE in D, the agent chooses its most recent best response to the strategy profile. For example, if B’s samples show that A keeps playing A0 and its most recent best response is B0, B will stick to this action.

New biased rules:– If an agent i has multiple best-response actions w.r.t. its k samples, it

chooses the one included in an optimal NE in VG. If there exists several such choices, it chooses the one which has been played most recently.

Difference between the old and the new rules:– Old rules biases the action-selection when others’ joint strategy has been

included in an optimal NE. Otherwise, it just randomizes its choices of best-response actions.

– The new rules always biases the agent’s action-selection.

Example

The new rules preserves the properties of convergence in n-agent team Markov games.

Extension: Beyond Team Games How to extend the ideas of PMPS play to general

coordination games?– To simplify the setting, now we consider a category of

coordination stage games with the following properties:• These games have at least one pure strategy NE.

• Agents have compatible preferences of some of these NE over any other steady states (such as mixed strategy NE or best-response loops).

– Let’s consider two situations: Perfect monitoring and imperfect monitoring.

• Perfect monitoring: Agents can observe others’ actions and payoffs.

• Imperfect monitoring: Agents only observe others’ actions.

– All agents may not have information about the game structure.

Perfect Monitoring Following the same idea of OAL. Algorithm:

– Learning of coordination• Compute all the NE of the game estimated. • Find out all the NE being dominated. For example, a strategy profile

(a,b) is dominated by (a’,b’) if (Q(a)<Q(a’)-) and (Q(b)Q(b’)+).• Construct a VG which contains all the NE not being dominated,

setting other values in VG to zero (without loss of generality, suppose that agents normalize their payoff to a value between zero and one).

• With GLIE exploration, BAP over the VG.

– Learning of game structure• Observe the others’ payoffs and update the sample means of agents’

expected payoffs in the game matrix.• Compute an -bound in the same way as OAL.

The learning over the coordination stage games we discussed is conjectured to converge to an NE not being Pareto dominated w.p.1

Imperfect Monitoring In general, it is difficult to eliminate sub-optimal NE without knowing

others’ payoffs. Let’s consider the simplest case: Two learning agents have at least one common interest (a strategy profile maximizes both agents’ payoffs).

For this game, agents can learn to play an optimal NE with a modified version of OAL (with new biased rules).– Biased rules: 1) Each agent randomizes its action-selection whenever the

payoff of its best-response actions is zero over the virtual game. 2) Each agent biases its action to recent best response if all its k samples contain the same individual actions of the other agent, more than m-k recorded joint actions have this property and the agent have multiple best responses to give it payoff 1 w.r.t. to its k samples. Otherwise, randomly choose best-response action.

In this type of coordination stage game, the learning process is conjectured to converge to an optimal NE. The result can be extended to Markov game.

Example

1, 0 0, 0 0, 1

0,0 1,1 0,0

0,1 0,0 1,0

B0

B1

B2

A0 A1 A2

Conclusions and Future Works In this research, we study RL techniques for agents to play an optimal

NE (not being Pareto dominated by other NE) in coordination games when the environmental model is unknown beforehand.

We start our research with team game and propose the OAL algorithm, the first algorithm which guarantees the convergence to an optimal NE in any team Markov games.

We further generalize the basic ideas in OAL and propose a new approach for learning in games, called partially myopic and partially strategic play.

We extend the PMPS play beyond self-play and team games. Some of the results can be extended to Markov games.

In future research, we will further explore the application of PMPS play in coordination games. Especially, we will study how to eliminate sub-optimal NE in imperfect monitoring environments.

reinforcement learning to play an optimal nash equilibrium in coordination markov games xiaofeng...

Documents