nash q-learning for general-sum stochastic games hu & wellman march 6 th, 2006 cs286r presented...

29
Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th , 2006 CS286r Presented by Ilan Lobel

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Nash Q-Learning for General-Sum Stochastic Games

Hu & Wellman

March 6th, 2006

CS286r

Presented by

Ilan Lobel

Outline

Stochastic Games and Markov Perfect Equilibria Bellman’s Operator as a Contraction Mapping Stochastic Approximation of a Contraction Mapping Application to Zero-Sum Markov Games Minimax-Q Learning Theory of Nash-Q Learning Empirical Testing of Nash-Q Learning

How do we model games that evolve over time ?

Stochastic Games ! Current Game = State

Ingredients:– Agents (N)– States (S)– Payoffs (R)– Transition Probabilities (P)– Discount Factor (δ)

Example of a Stochastic Game

1,2 3,4

5,6 7,8

-1,2 -3,4

-5,6 -7,8

A

B

C D

0,0

-10,10

A

B

C D E

Move with 30% probabilitywhen (B,D)

Move with 50% probabilitywhen (A,C) or (A,D)

δ = 0.9

Markov Game is a Generalization of…

Repeated Games

Markov GamesAdd States

Markov Game is a Generalization of…

Repeated Games

Markov GamesAdd States

MDP

Add Agents

Markov Perfect Equilibrium (MPE)

Strategy maps states into randomized actions– πi: S Δ(A)

No agent has an incentive to unilaterally change her policy.

Cons & Pros of MPEs

Cons:– Can’t implement everything described by the Folk

Theorems (i.e., no trigger strategies)

Pros:– MPEs always exist in finite Markov Games (Fink, 64)– Easier to “search for”

Learning in Stochastic Games

Learning is specially important in Markov Games because MPE are hard to compute.

Do we know:– Our own payoffs ?– Others’ rewards ?– Transition probabilities ?– Others’ strategies ?

Learning in Stochastic Games

Adapted from Reinforcement Learning:– Minimax-Q Learning (zero-sum games)– Nash-Q Learning– CE-Q Learning

Zero-Sum Stochastic Games

Nice properties:– All equilibria have the same value.– Any equilibrium strategy of player 1 against any

equilibrium strategy of player 2 produces an MPE.– It has a Bellman’s-type equation.

Bellman’s Equation in DP

)}'()',,(),({max)('

sJ*ssaPasrsJ*s

a

Bellman Operator: T

Bellman’s Equation Rewritten:

TJ*J*

)}'()',,(),({max))(('

sJssaPasrsTJs

a

Contraction Mapping

The Bellman Operator has the contraction property:

Bellman’s Equation is a direct consequence of the contraction.

|)(')(|max |)(')(|max sJsJsTJsTJ ss

The Shapley Operator for Zero-Sum Stochastic Games

)}'(),,',( ),,({maxmin))((s'

212121 sJaassPaasrsTJ aa

The Shapley Operator is a contraction mapping. (Shapley, 53)

Hence, it also has a fixed point, which is an MPE:

TJ*J*

Value Iteration for Zero-Sum Stochastic Games

Direct consequence of contraction.

Converges to fixed point of operator.

kk TJJ 1

0any Start with J

Q-Learning

Another consequence of a contraction mapping:– Q-Learning converges !

Q-Learning can be described as an approximation of value iteration:– Value iteration with noise.

Q-Learning Convergence

Q-Learning is called a Stochastic Iterative Approximation of Bellman’s operator:– Learning Rate of 1/t.– Noise is zero-mean and has bounded variance.

It converges if all state-action pairs are visited infinitely often.

(Neuro-Dynamic Programming – Bertsekas, Tsitsiklis)

Minimax-Q Learning Algorithm For Zero-Sum Stochastic Games

Initialize your Q0(s,a1,a2) for all states, actions. Update rule:

Player 1 then chooses action u1 in the next stage sk+1.

)}],,({maxmin),,([

),,(1(),,(

21121

2121

21

1

uusQaasr

aasQaasQ

kkuukt

ktk kk

Minimax-Q Learning

It’s a Stochastic Iterative Approximation of Shapley Operator.

It converges to a Nash Equilibrium if all state-action-action triplets are visited infinitely often. (Littman, 96)

Can we extend it to General-Sum Stochastic Games ?

Yes & No. Nash-Q Learning is such an extension. However, it has much worse computational

and theoretical properties.

Nash-Q Learning Algorithm

Initialize Q0j(s,a1,a2) for all states, actions and for

every agent.– You must simulate everyone’s Q-factors.

Update rule:

Choose the randomized action generated by the Nash operator.

)}],,({ ),,([

),,(1(),,(

21121

21211

uusQNashaasr

aasQaasQ

kkkt

kjktk

jk

The Nash Operator andThe Principle of Optimality

Nash Operator finds the Nash of a stage game. Find Nash of stage game with Q-factors as

your payoffs.

)},,({ ),,( 21121 uusQNashaasr kkk

Payoffs for Rest of theMarkov Game

Current Reward

The Nash Operator

Unkown complexity even for 2 players. In comparison, the minimax operator can be

solved in polynomial time. (there’s a linear programming formulation)

For convergence, all players must break ties in favor of the same Nash Equilibrium.

Why not go model-based if computation is so expensive ?

Convergence Results

If every stage game encountered during learning has a global optimum, Nash-Q converges.

If every stage game encountered during learning has a saddle point, Nash-Q converges.

Both of these are VERY strong assumptions.

Convergence Result Analysis

The global optimum assumption implies full cooperation between agents.

The saddle point assumption implies no cooperation between agents.

Are these equivalent to DP Q-Learning and minimax-Q Learning, respectively ?

Empirical Testing: The Grid-world

WORLD 1Some Nash Equilibria

Empirical Testing: Nash Equilibria

WORLD 2

All Nash Equilibria

(97%)

(3%) (3%)

Empirical Performance

In very small and simple games, Nash-Q learning often converged even though theory did not predict so.

In particular, if all Nash Equilibria have the same value Nash-Q did better than expected.

Conclusions

Nash-Q is a nice step forward:– It can be used for any Markov Game.– It uses the Principle of Optimality in a smart way.

But there is still a long way to go:– Convergence results are weak.– There are no computational complexity results.