moshe tennenholtz, aviv zohar learning equilibria in repeated congestion games

Moshe Tennenholtz, Aviv Zohar

Learning Equilibria in Repeated Congestion Games

The Nash equilibrium is an important conceptExists for many reasonable games. Provides a good recommendation for a group of

players.

It assumes that all players are fully aware of the game. Given each other’s strategies, they can

compute a best response.

Motivation

What about games with missing information?

We show the existence of an equilibrium even if players learn the game while playing it, for a broad family of games – repeated symmetric congestion games.Our equilibrium will be in pure strategies (rare

even for the regular Nash eq.)It will be very efficient – the total cost of all

players will be minimal.The repeated game must be long enough.

Motivation

A congestion game [Rosenthal] is defined by:A set of players N={1…n}A set of resources RA set of bundles each player i can pickA cost function for each resource r, that depends only

on the number of players that have picked this resource This cost is applied to all players that use this resource.

The strategy of a player is to pick a bundle. The cost to a player: the sum of all costs for his

bundle.

Congestion Games

Number of players: n.Resources: the edges in a graph.Allowed bundles: All simple paths connecting S to T

Costs per edge:

Example of a Congestion Game

ST

0/1/20/3/2

0/4/4

2/2/2

0/0/1

2/2/0

Theorem: Every congestion game is a potential game and thus has a pure Nash equilibrium.

I.e., there exists a pure strategy profile (each player deterministically picks a bundle) such that no player wishes to change his bundle given the choices of the others.

Def: A symmetric congestion game is a game where all players can pick the same set of bundles.

Congestion Games

Dfn: A resource selection game is a symmetric congestion game where all possible bundles are of size 1.

Example: identical processes running on machines.Every process chooses the machine to run onRunning time depends on the number of

processes on that machine.

Resource Selection Games

Assume that players repeatedly play some congestion game.

The number of rounds T is finite.The cost of every resource is unknown initially.At every round, everyone picks a bundle and

observes:Their own payment for each resource,The actions of all others.

The total payment is the average over the rounds of the game.

Is there an equilibrium of some sort?How efficient can we be?

Our Setting

Lots of work in learning that tries to converge to a Nash equilibrium of the game while learning it.The problem: The converging strategies

themselves are not in equilibrium

Work on equilibria in fully known repeated games. Folk theorems guarantee a wide variety of

equilibriaHowever, it is less realistic to expect players to

fully know the game.

Previous (slightly related) work

Due to Brafman and TennenholtzA form of ex-post equilibrium

There is an unknown state of the world S (that affects the payments in the game)

Def: Given an unknown repeated game, a strategy profile for the players is a learning equilibrium if no player wishes to change strategy, even if it knows the game.

The Learning Equilibrium

Repeated 2 player games have a (mixed) learning equilibrium (if you can see the payment of the other player) [Brafman, Tennenholtz]

All repeated symmetric 2 player games have a (mixed) learning equilibrium

All repeated monotonic resource selection games have a (mixed) learning equilibrium [Ashlagi, Monderer, Tennenholtz]

Our result: A pure equilibrium in all repeated symmetric congestion games (no limit on number of players).

Previous (more related) work:

For the remainder of the talk, I will assume agents can communicate (Cheap talk) to coordinate through some channel.

This assumption is not a must. Agents can communicate through the repeated game via the actions they take.

Such signalingmakes the proofs a bit more complexbut has little effect on the game (provided that the

game is long enough, and we communicate only a finite amount of data)

Communication between agents

The cooperative solution is the best we can hope for.

Denote its total cost by OPT.

The cooperative solution:Players play all combinations of bundles, and learn

the cost of each resource for any load.Then compute the optimal joint actionPlay the joint action while taking turns playing the

different roles in it.Each player gets OPT/n cost if the game is long

enough.

What can we hope to achieve?

For any symmetric congestion game G with n players, and for every ε>0 there exists a number (of rounds) T such that the repeated game that has T rounds in which we play G in every round, has an ε-equilibrium in which each player suffers a cost of at most (OPT/n)+ε.

An equivalent statement could be made about infinite games with discounted payments. (Discuss)

Our Main Result – Exact Statement

The equilibrium strategy will consist of 3 behaviors.

1.Cooperative learning: Players play all combinations of bundles to learn the

game. If some player deviates, start punishing, otherwise start playing optimaly.

2.Playing optimaly Players play optimally taking turns in different roles.

If someone deviates, start punishing.

3.Punishment of a deviator This is the tricky part.

Proof

Let us start by assuming that the deviation occurs after the game has been learned.

Assume w.l.o.g. that player n has deviated.

A punishment strategy in this case:All n-1 honest players compute a Nash equilibrium

for the congestion game G, while ignoring the n’th player.

I.e., they consider the game as having only n-1 players.

Note that the equilibrium always exists.

How to Punish

Lemma: if all other players play a Nash eq. for n-1 players, the deviating player has a cost no lower than any other player.

Proof: Assume that some honest player i gets less than the deviator. Fix all other players.

Effective Punishment

Bundle played by the deviator.

Bundle played by player i.

Total cost of player i was greater.The difference must come from resources

that are not shared.Due to symmetry, player I could have picked

the deviator’s bundle, and would gain by it (in the game of n-1 players in which player n does not exist)

This contradicts the fact that player i is playing a Nash eq. for n-1 players.



Our proof so far relied on full knowledge of the game.

If the game is unknown, we cannot compute the Nash eq. of n-1 players.

In case of missing knowledge, we will optimistically under-evaluate the costs.

Punishing when information is missing

Now, players will use these under-evaluated costs when they try to punish a deviator.

They will compute the Nash equilibrium of the congestion game with n-1 players, that has resource costs defined by

And they will repeatedly play this equilibrium.

Now, again, assume that during some round the deviator pays less than some player i.

The difference must come from resources they did not have in common.

But then, why didn’t player i switch to the bundle the deviator has? There can be only one reason:He under-evaluated his own bundle.

Therefore, he must have observed something new.



At every round of punishment, at least one of two things must happen:

1.One of the players learns a previously unknown value2.or, the deviator has a higher cost than any other

player.

Once a player learns a new value, he will broadcast it to the other honest players.

This way, they have common knowledge of the values found and can continue to compute the Nash eq. strategy.

The modified Lemma

If no one deviates, players spend a finite amount of time learning the game, and then play optimally.If the game is long enough, they will gain a payment of

OPT/n+ε/2.If one player deviates, he can only do better than the

other players for a finite number of rounds.

For the rest of the game he gets

So if the game is long enough, his gains in the finite number of rounds are dwarfed by this high cost. He gains at best some small ε.

Proving the Theorem

What can players observe during the game is critical.The theorem also holds for weaker levels of

monitoring.E.g., Let us now assume that players see the actions of

other players only where they select the same resources that they have.

Can we still detect deviations, punish and coordinate?

One of the main problems is communication. Players can still signal, but no longer broadcast to all others at the same time (unless they are on the same resource).

Imperfect monitoring

Assume some honest player observes some other player deviating from the proposed strategy.

It has to call this into the attention of the other players.

He does so by deviating himself, and notifying some of the others.

They in turn deviate and notify others, etc.After every player has seen some other player

deviate, we have to find out who to punish, and how.

Imperfect Monitoring

Each player will signal which other player he has seen deviating, and when this deviation occurred.

Everyone suspects the player who has been reported as the deviator in the earliest round.

Blaming others.

Actual deviator

T

T+1T+1

T+2

But the deviator may also lie.

To throw off the blame, he can try to say that he saw someone else deviate in an earlier round.

So the players must suspect:The earliest reported

deviatorThe player that reported him

Blaming others.

Actual deviator

T

T+1T+1

T+2

T-1

So how can we punish in this case?Note that the identity of the deviator is important.All other players need to compute a Nash eq. for

n-1 players, and play it.Each bundle in the Nash eq. has to be picked by

one of the players.

Solution: tell both suspect players to pick the same bundle (that is part of the Nash eq.). At least one of them is honest, and will play that strategy.

The other player must have a high cost.

How to Punish

Another way to restrict the level of monitoring, is to allow players to see only their total cost, without details regarding each resource.

If all players see enough combinations of profiles, they will be able to deduce all the needed information about the cost of resources.

The problem: There is no way to under-evaluate the costs of resources in a way that can be used to punish the deviator.

Non-Detailed monitoring

Let us look at a game with 2 players that has 3 bundles A,B,C.

Assume that player 1 has observed the costs in the table below

Example

Player 1’s action Player 2’s action cost

A C C(A)=1

B A C(B)=1

C B C(C)=1

The scenario is completely symmeric

A possible assignment of costs:

All 3 symmetric assignments are also possible

So in fact, any resource can be valued at cost 0 (when one player visits it)

Player 1’s action Player 2’s action cost

A C C(A)=1

B A C(B)=1

C B C(C)=1

There is in fact no sure way to punish the deviating player with a constant pure strategy.

If player 1 picks bundle α, the deviator can pick γ and gainwithout revealing new information

A pure strategy that does punish (or learn): Select α,β,γ in sequence

Conjecture: There is a pure strategy learning equilibrium even in the non-detailed monitoring case.

Theorem: there is a mixed strategy equilibrium in the case of non detailed monitoring.

The equilibrium strategy punishes by playing the Nash equilibrium of the known part of the game, and with some small probability does a random exploratory action.

Theorem: There exists an asymmetric congestion game with no learning equilibrium (not even mixed).

In fact, it is even an asymmetric resource selection game.

Asymetric Congestion Games

Bundles allowed for player 1

Bundles allowed for player 2

0.5 or 10000 / 10.5 or 1000

If some player has a cost of 1000 on his private resource, his best option is to select the shared resource all the time.

If the other player has 0.5 on his resource, his best choice is to play that resource all the time.

If both players have 0.5, at least one of them pays more than 0.25

He can pretend to have 1000 on his resource, and play the shared resource, to get a cost of 0.

Bundles allowed for player 1 Bundles allowed for player 2

0.5 or 10000 / 10.5 or 1000

This is a very interesting game.

It is quite unclear how to play it rationally.

Bundles allowed for player 1 Bundles allowed for player 2

0.5 or 10000 / 10.5 or 1000

There exists a (symmetric) resource selection game that has no strong equilibrium.Assumption: the deviators can correlate their

action.

Observe the following game with 3 players:

Strong Equilibria in repeated congestion games

1 / 2 / 2 1 / 2 / 2

The total cost to all players is at least 5 in any profile.

Any pair of players have a cost of at least 3.

In any strategy profile there exist 2 players that each have a cost of 1.5 or more, and at least one pays strictly more.

These 2 players can deviate, play on different resources, and get a payment of 1.5 each.

1 / 2 / 2 1 / 2 / 2

moshe tennenholtz, aviv zohar learning equilibria in repeated congestion games

Documents

potential game

unknown repeated game

player games

symmetric congestion

group of players

set of players n

repeated games

congestion gamesdfn