learning with opponent-learning awareness · multi-agent settings are quickly gathering importance...

12
Learning with Opponent-Learning Awareness Jakob N. Foerster 2,[email protected] Richard Y. Chen 1,[email protected] Maruan Al-Shedivat 4 [email protected] Shimon Whiteson 2 [email protected] Pieter Abbeel 1,3 [email protected] Igor Mordatch 1 [email protected] 1 OpenAI 2 University of Oxford 3 UC Berkeley 4 CMU Abstract Multi-agent settings are quickly gathering importance in ma- chine learning. Beyond a plethora of recent work on deep multi-agent reinforcement learning, hierarchical reinforce- ment learning, generative adversarial networks and decen- tralized optimization can all be seen as instances of this set- ting. However, the presence of multiple learning agents in these settings renders the training problem non-stationary and often leads to unstable training or undesired final re- sults. We present Learning with Opponent-Learning Aware- ness (LOLA), a method that reasons about the anticipated learning of the other agents. The LOLA learning rule in- cludes an additional term that accounts for the impact of the agent’s policy on the anticipated parameter update of the other agents. We show that the LOLA update rule can be ef- ficiently calculated using an extension of the likelihood ra- tio policy gradient update, making the method suitable for model-free reinforcement learning. This method thus scales to large parameter and input spaces and nonlinear function approximators. Preliminary results show that the encounter of two LOLA agents leads to the emergence of tit-for-tat and therefore cooperation in the infinitely iterated prison- ers’ dilemma, while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher or- der gradient-based methods. Applied to infinitely repeated matching pennies, only LOLA agents converge to the Nash equilibrium. We also apply LOLA to a grid world task with an embedded social dilemma using deep recurrent policies. Again, by considering the learning of the other agent, LOLA agents learn to cooperate out of selfish interests. 1 Introduction Due to the advent of deep RL methods that allow the study of many agents in rich environments, multi-agent reinforce- ment learning has flourished in recent years. However, most of this work considers fully cooperative settings (Omid- shafiei et al., 2017; Foerster et al., 2017a,b) and emergent communication in particular (Das et al., 2017; Mordatch and Abbeel, 2017; Lazaridou, Peysakhovich, and Baroni, 2016; Foerster et al., 2016; Sukhbaatar, Fergus, and others, 2016). Considering future applications of multi-agent RL, such as self-driving cars, it is obvious that many of these will be Equal Contribution only partially cooperative and contain elements of competi- tion and selfish incentives. The human ability to maintain cooperation in a variety of complex social settings has been vital for the success of human societies. Emergent reciprocity has been observed even in strongly adversarial settings such as wars (Axelrod, 2006), making it a quintessential and robust feature of hu- man life. In the future, artificial learning agents are likely to take an active part in human society, interacting both with other learning agents and humans in complex partially competitive settings. Failing to develop learning algorithms that lead to emergent reciprocity in these artificial agents would lead to disastrous outcomes. How reciprocity can emerge among a group of learning, self-interested, reward maximizing RL agents is thus a ques- tion both of theoretical interest and of practical importance. Game theory has a long history of studying the learning outcomes in games that contain cooperative and competi- tive elements. In particular, the tension between cooperation and defection is commonly studied in the iterated prison- ers’ dilemma. In this game, selfish interests can lead to an outcome that is overall worse for all participants, while co- operation maximizes social welfare, one measure of which is the sum of rewards for all agents. Interestingly, in the simple setting of an infinitely repeated prisoners’ dilemma with discounting, randomly initialized RL agents pursuing gradient descent on the exact value func- tion learn to defect with high probability. This shows that current state-of-the-art learning methods in deep multi-agent RL can lead to agents that fail to cooperate reliably even in simple social settings with explicit actions to cooperate and defect. One well-known shortcoming is that they fail to consider the learning process of the other agents and simply treat the other agent as a static part of the environment. As a step towards reasoning over the learning behaviour of other agents in social settings, we propose Learning with Opponent-Learning Awareness, (LOLA). The LOLA learn- ing rule includes an additional term that accounts for the im- pact of one agent’s parameter update on the learning step of the other agents. For convenience we use the word ‘oppo- nent’ to describe the other agent, even though the method is not limited to zero-sum games and can be applied in the general-sum setting. We show that this additional term, arXiv:1709.04326v1 [cs.AI] 13 Sep 2017

Upload: others

Post on 14-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

Learning with Opponent-Learning AwarenessJakob N. Foerster2,†

[email protected] Y. Chen1,†

[email protected] Al-Shedivat4

[email protected]

Shimon Whiteson2

[email protected] Abbeel1,3

[email protected] Mordatch1

[email protected]

1OpenAI 2University of Oxford 3UC Berkeley 4CMU

Abstract

Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deepmulti-agent reinforcement learning, hierarchical reinforce-ment learning, generative adversarial networks and decen-tralized optimization can all be seen as instances of this set-ting. However, the presence of multiple learning agents inthese settings renders the training problem non-stationaryand often leads to unstable training or undesired final re-sults. We present Learning with Opponent-Learning Aware-ness (LOLA), a method that reasons about the anticipatedlearning of the other agents. The LOLA learning rule in-cludes an additional term that accounts for the impact ofthe agent’s policy on the anticipated parameter update of theother agents. We show that the LOLA update rule can be ef-ficiently calculated using an extension of the likelihood ra-tio policy gradient update, making the method suitable formodel-free reinforcement learning. This method thus scalesto large parameter and input spaces and nonlinear functionapproximators. Preliminary results show that the encounterof two LOLA agents leads to the emergence of tit-for-tatand therefore cooperation in the infinitely iterated prison-ers’ dilemma, while independent learning does not. In thisdomain, LOLA also receives higher payouts compared to anaive learner, and is robust against exploitation by higher or-der gradient-based methods. Applied to infinitely repeatedmatching pennies, only LOLA agents converge to the Nashequilibrium. We also apply LOLA to a grid world task withan embedded social dilemma using deep recurrent policies.Again, by considering the learning of the other agent, LOLAagents learn to cooperate out of selfish interests.

1 IntroductionDue to the advent of deep RL methods that allow the studyof many agents in rich environments, multi-agent reinforce-ment learning has flourished in recent years. However, mostof this work considers fully cooperative settings (Omid-shafiei et al., 2017; Foerster et al., 2017a,b) and emergentcommunication in particular (Das et al., 2017; Mordatch andAbbeel, 2017; Lazaridou, Peysakhovich, and Baroni, 2016;Foerster et al., 2016; Sukhbaatar, Fergus, and others, 2016).Considering future applications of multi-agent RL, such asself-driving cars, it is obvious that many of these will be

†Equal Contribution

only partially cooperative and contain elements of competi-tion and selfish incentives.

The human ability to maintain cooperation in a varietyof complex social settings has been vital for the success ofhuman societies. Emergent reciprocity has been observedeven in strongly adversarial settings such as wars (Axelrod,2006), making it a quintessential and robust feature of hu-man life.

In the future, artificial learning agents are likely to takean active part in human society, interacting both with otherlearning agents and humans in complex partially competitivesettings. Failing to develop learning algorithms that lead toemergent reciprocity in these artificial agents would lead todisastrous outcomes.

How reciprocity can emerge among a group of learning,self-interested, reward maximizing RL agents is thus a ques-tion both of theoretical interest and of practical importance.Game theory has a long history of studying the learningoutcomes in games that contain cooperative and competi-tive elements. In particular, the tension between cooperationand defection is commonly studied in the iterated prison-ers’ dilemma. In this game, selfish interests can lead to anoutcome that is overall worse for all participants, while co-operation maximizes social welfare, one measure of whichis the sum of rewards for all agents.

Interestingly, in the simple setting of an infinitely repeatedprisoners’ dilemma with discounting, randomly initializedRL agents pursuing gradient descent on the exact value func-tion learn to defect with high probability. This shows thatcurrent state-of-the-art learning methods in deep multi-agentRL can lead to agents that fail to cooperate reliably evenin simple social settings with explicit actions to cooperateand defect. One well-known shortcoming is that they fail toconsider the learning process of the other agents and simplytreat the other agent as a static part of the environment.

As a step towards reasoning over the learning behaviourof other agents in social settings, we propose Learning withOpponent-Learning Awareness, (LOLA). The LOLA learn-ing rule includes an additional term that accounts for the im-pact of one agent’s parameter update on the learning step ofthe other agents. For convenience we use the word ‘oppo-nent’ to describe the other agent, even though the methodis not limited to zero-sum games and can be applied inthe general-sum setting. We show that this additional term,

arX

iv:1

709.

0432

6v1

[cs

.AI]

13

Sep

2017

Page 2: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

when applied by both agents, leads to emergent reciprocityand cooperation in the iterated prisoners’ dilemma (IPD).Experimentally we also show that in IPD, each agent is in-centivized to switch from naive learning to LOLA, whilethere are no additional gains in attempting to exploit LOLAwith higher order gradient terms. This suggests that withinthe space of local, gradient-based learning rules both agentsusing LOLA is a stable equilibrium.

We also present a version of LOLA adopted to the deepRL setting using likelihood ratio policy gradients, makingLOLA scalable to settings with high dimensional input andparameter spaces.

We evaluate the policy gradient version of LOLA on theiterated prisoners dilemma (IPD) and iterated matching pen-nies (IMP), a simplified version of rock-paper-scissors. Weshow that LOLA leads to cooperation with high social wel-fare, while policy gradients, a standard reinforcement learn-ing approach, does not. The policy gradient finding is con-sistent with prior work, e.g., Sandholm and Crites (1996).We also extend LOLA to settings where the opponent pol-icy is unknown and needs to be inferred from state-actiontrajectories of the opponent’s behaviour.

Finally, we apply LOLA with and without opponent mod-elling to a grid-world task with an embedded underlying so-cial dilemma. This task has temporally extended actions andtherefore requires high dimensional recurrent policies foragents to learn to reciprocate. Again, cooperation emergesin this task when using LOLA, even when the opponent’spolicy is unknown and needs to be estimated.

2 Related WorkThe study of general-sum games has a long history in gametheory and evolution. Thousands of papers have been writ-ten on the iterated prisoners’ dilemma (IPD), including theseminal work on the topic by Axelrod (2006). This workpopularized tit-for-tat (TFT), a strategy in which an agentcooperates on the first move and then copies the opponent’smost recent move, as a robust and simple strategy in the IPD.

Most work in deep multi-agent RL focuses on fully co-operative settings (Omidshafiei et al., 2017; Foerster etal., 2017a,b) and emergent communication in particular(Das et al., 2017; Mordatch and Abbeel, 2017; Lazari-dou, Peysakhovich, and Baroni, 2016; Foerster et al., 2016;Sukhbaatar, Fergus, and others, 2016). As an exception,Leibo et al. (2017) consider mixed multi-agent environmentsand study the emergence of cooperation and competition asa function of the problem setup and the model parameters.Similarly Lowe et al. (2017) propose a centralized actor-critic architecture for efficient training in these mixed en-vironments. However, neither of these papers explicitly rea-sons about the learning behaviour of other agents and thusfails to discover interesting solutions in mixed-competitivesettings.

The closest to our problem setting is the work of Lererand Peysakhovich (2017), which directly generalizes tit-for-tat to complex environments using deep RL. The authorsexplicitly train a fully cooperative and a defecting policyfor both agents and then construct a tit-for-tat policy that

switches between these two in order to encourage the oppo-nent to cooperate. Similar in spirit to this work, Munoz deCote and Littman (2008) propose a Nash equilibrium algo-rithm for repeated stochastic games that explicitly attemptsto find the egalitarian point by switching between competi-tive and zero-sum strategies.

Reciprocity and cooperation are not emergent propertiesof the learning rule in these settings but directly coded intothe algorithm. By contrast, LOLA makes no assumptionsabout cooperation and simply assumes that each agent ismaximizing its own return.

Brafman and Tennenholtz (2003) introduce the conceptof an ‘efficient learning equilibrium’ (ELE), in which nei-ther side is encouraged to deviate from the learning rule.Their algorithm applies to settings where all Nash equilib-ria can be computed and enumerated. So far no proof existsthat LOLA is an ELE but our initial empirical results are en-couraging. Furthermore, we do not assume that Nash equi-libria are computable, which is in general difficult in high-dimensional complex settings. For example, listing all theNash equilibria of the board game Go is clearly beyond thescope of current techniques.

Our work also relates to opponent modeling, such as ficti-cious play (Brown, 1951) and action prediction. Mealing andShapiro (2013) propose a method that finds a policy based onpredicting the opponent’s future action. While these meth-ods model the opponent strategy, they do not address thelearning dynamics of the opponent.

By contrast, Zhang and Lesser (2010) carry out policyprediction under one-step learning dynamics. However, theopponents’ policy updates are assumed to be fixed and onlyused to learn a best response to the anticipated updated pa-rameters. By contrast, LOLA directly models the policy up-dates of all opponents such that each agent actively drives itsopponents’ policy updates to maximize its own reward.

With LOLA, each agent differentiates its estimated re-ward through the opponents’ policy update. Similar ideaswere proposed by Metz et al. (2016), whose training methodfor generative adversarial networks differentiates throughmultiple update steps of the opponent. Their method relieson a end-to-end differentiable loss function, and thus doesnot work in the general RL setting. However, the overall re-sults are similar: anticipating the opponents update stabilisesthe training outcome.

3 BackgroundOur work assumes a multi-agent task that is commonly de-scribed as a stochastic game G, specified by a tuple G =〈S,U, P, r, Z,O, n, γ〉. Here n agents, a ∈ A ≡ {1, ..., n},choose actions, ua ∈ U , and s ∈ S is the state of the en-vironment. The joint action u ∈ U ≡ Un leads to a statetransition based on the transition function P (s′|s,u) : S ×U×S → [0, 1]. The reward functions ra(s,u) : S×U→ Rspecify the reward for each agent, lastly γ ∈ [0, 1) is the dis-count factor.

We further define the discounted future return from timet onward as Rat =

∑∞l=0 γ

lrat+l for each agent, a. As anaive learner, each agent maximizes its total discounted re-

Page 3: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

0.00 0.25 0.50 0.75 1.00P(cooperation | state)_agent 1

0.0

0.2

0.4

0.6

0.8

1.0

P(co

oper

atio

n | s

tate

)_ag

ent 2

a)

0.00 0.25 0.50 0.75 1.00P(cooperation | state)_agent 1

0.0

0.2

0.4

0.6

0.8

1.0

b)

CCCDDCDDs0

0 1000 2000 3000 4000Iterations

2.2

2.0

1.8

1.6

1.4

1.2

1.0

0.8

Aver

age

rewa

rd p

er st

ep

c)

a1 Lolaa2 Lolaa1 NLa2 NL

0 500 1000 1500 2000Iterations

2.2

2.0

1.8

1.6

1.4

1.2

1.0

0.8 d)

Figure 1: Shown is the probability of cooperation in the iterated prisoners dilemma at the end of 50 training runs for bothagents as a function of state under naive learning, a), and LOLA b) when using the exact gradients of the value function. Alsoshown is the average return per step for naive and LOLA under both the exact gradient, c), and policy gradient approximation,d). We can see that NL leads to DD, resulting in an average reward of ca. −2. In contrast, the LOLA learning rule leads tothe emergence of tit-for-tat, in b): When in the last move agent 1 defected and agent 2 cooperated (DC, green points), mostlikely in the next move agent 1 will cooperate and agent 2 will defect, indicated by a concentration of the green points in thebottom right corner. Similarly, the yellow points (CD), are concentrated in the top left corner. While the results for the PolicyGradient approximation are more noisy, they are qualitatively similar. Detailed plots are shown in the Supplementary Material.Best viewed in color.

turn in expectation separately. This can be done with pol-icy gradient methods (Sutton et al., 1999) such as REIN-FORCE (Williams, 1992). Policy gradient methods updatean agent’s policy, parameterized by θa, by performing gra-dient ascent on an estimate of the expected discounted totalreward E [Ra0 ].

Initially we assume that all agents can observe all rewards,policy parameters, and learning rules. We remove the as-sumption regarding access to the parameters of the otheragent by adding opponent modelling in Section 4.4. By con-vention, bold lowercase letters denote column vectors.

4 MethodsIn this section, we review the naive learner’s strategy and in-troduce the LOLA learning rule. We derive the update ruleswhen agents have access to exact gradients and Hessiansof their expected discounted future return in Sections 4.1and 4.2. Section 4.3 derives the learning rules based purelyon rollouts, using policy gradients. This renders LOLA suit-able for deep RL. For simplicity, we assume the number ofagents is n = 2 and display the update rules for agent 1 only.The same derivation holds for arbitrary numbers of agents.

4.1 Naive LearnerSuppose each agent’s policy πa is parameterized by θa andV a(θ1,θ2) is the expected total discounted return for agenta as a function of both agents’ policy parameters (θ1,θ2). Anaive learner iteratively optimizes for its own expected totaldiscounted return separately, such that at the ith iteration, θaiis updated to θai+1 according to

θ1i+1 = argmaxθ1 V 1(θ1,θ2

i )

θ2i+1 = argmaxθ2 V 2(θ1

i ,θ2).

In the reinforcement learning setting, agents do not have ac-cess to {V 1, V 2} over all parameter values. Instead, we as-sume that agents only have access to the function values andgradients at (θ1

i ,θ2i ). Using this information the naive learn-

ers apply the gradient ascent update rule fanl:

θ1i+1 = θ1

i + f1nl(θ

1i ,θ

2i ),

f1nl =

∂V 1(θ1i ,θ

2i )

∂θ1i

· δ, (4.1)

where δ is the step size.

4.2 Learning with Opponent Learning AwarenessA LOLA learner optimizes its policy by driving the oppo-nent’s best-response policy to maximize its own expecteddiscounted future return, such that in every iteration,

θai+1 = θai + ∆θa, a ∈ {0, 1},where

∆θ1 = argmax∆θ1:‖∆θ1‖≤δ

V 1(θ1i + ∆θ1,

θ2i + argmax

∆θ2:‖∆θ2‖≤ηV 2(θ1

i + ∆θ1,θ2i + ∆θ2)

),

∆θ2 = argmax∆θ2:‖∆θ2‖≤δ

V 2(θ1i+

argmax∆θ1:‖∆θ1‖≤η

V 1(θ1i + ∆θ1,θ2

i + ∆θ2),θ2i + ∆θ2

).

In contrast to prior works, e.g., Zhang and Lesser (2010),that only predict the opponent’s policy parameter update,LOLA learners actively influence the opponent’s future pol-icy update.

If the agents have access only to the gradients and Hes-sians of {V 1, V 2} at each agent’s current policy parameter

Page 4: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

0.00 0.25 0.50 0.75 1.00P(head | state)_agent 1

0.0

0.2

0.4

0.6

0.8

1.0

P(he

ad |

stat

e)_a

gent

2a)

0.00 0.25 0.50 0.75 1.00P(head | state)_agent 1

0.0

0.2

0.4

0.6

0.8

1.0

b)

HHHTTHTTs0

0 50 100 150 200Iterations

0.4

0.2

0.0

0.2

0.4

Aver

age

rewa

rd p

er st

ep

c)

0 500 1000 1500 2000Iterations

0.4

0.2

0.0

0.2

0.4

d)

a1 Lolaa2 Lolaa1 NLa2 NL

Figure 2: Shown is the probability of playing heads in the iterated matching pennies game at the end of 50 training runs for bothagents as a function of state under naive learning, a), and LOLA b) when using the exact gradients of the value function. Alsoshown is the average return per step for NL and LOLA under both the exact gradient, c), and policy gradient approximation,d). We can see that naive learning a) results in near deterministic strategies, indicated by the accumulation of points in thecorners. These strategies are easily exploitable by other deterministic strategies leading to unstable training and high variancein the reward per step in c). In contrast, LOLA agents learn to play the only Nash strategy, 50%/%50, leading to low variancein the reward per step. One interpretation is that LOLA agents anticipate that exploiting a deviation from Nash increases theirimmediate return, but also renders them more exploitable by the opponent’s next learning step. Best viewed in color.

(θ1i ,θ

2i ), then the LOLA update rule falola augments the gra-

dient ascent of f1nl with a second-order term such that at the

ith iteration, agent 1 updates θ1i to θ1

i+1 as follows

θ1i+1 = θ1

i + f1lola(θ

1i ,θ

2i )

and

f1lola(θ

1i ,θ

2i ) =

∂V 1(θ1i ,θ

2i )

∂θ1i

· δ

+

(∂V 1(θ1

i ,θ2i )

∂θ2i

)T∂2V 2(θ1

i ,θ2i )

∂θ1i ∂θ

2i

· δη, (4.2)

where the step sizes δ, η are for the first and second orderupdates.

4.3 Learning via Policy GradientWhen agents do not have access to exact gradientsor Hessians, we derive the update rules fnl, pg andflola, pg based on approximations of the derivatives in(4.1) and (4.2). Denote an episode of horizon T asτ = (s0, u

10, u

20, r

10, r

20, ..., sT , u

1T , u

2T , r

1T , r

2T ) and its cor-

responding discounted return for agent a at timestep t asRat (τ) =

∑Tl=t γ

l−tral . Then the expected episodic returngiven the agents’ policies (π1, π2), ER1

0(τ) and ER20(τ),

approximate V 1 and V 2 respectively, so do the gradients andHessians.

The gradient of ER10(τ) follows from the policy gradient

derivation:

∇θ1 ER10(τ) = E

[R1

0(τ)∇θ1 log π1(τ)]

= E[∑T

t=0∇θ1 log π1(u1

t |st) ·∑T

l=tγlr1

l

]= E

[∑T

t=0∇θ1 log π1(u1

t |st)γt(R1t (τ)− b(st)

)],

where b(st) is a baseline for variance reduction. Then thepolicy gradient-based update rule fnl, pg for the naive learneris

f1nl, pg = ∇θ1 ER1

0(τ) · δ. (4.3)For the LOLA update, we derive the following estimator forthe second-order term in (4.2) based on policy gradients (seeSupplementary Material for detailed derivation):∇θ1∇θ2 ER2

0(τ)

= E[R2

0(τ)∇θ1 log π1(τ)(∇θ2 log π2(τ)

)T ]= E

[∑T

t=0γtr2

t ·(∑t

l=0∇θ1 log π1(u1

l |sl))

(∑t

l=0∇θ2 log π2(u2

l |sl))T]

. (4.4)

The complete LOLA update for agent 1 using policy gradi-ents is

f1lola, pg = ∇θ1 ER1

0(τ) · δ+(∇θ2 ER1

0(τ))T∇θ1∇θ2 ER2

0(τ) · δη. (4.5)

4.4 LOLA with Opponent ModelingSo far we have assumed that each agent has access to theexact parameters of the opponent. However, in adversarialsettings these parameters are typically obscured and have tobe inferred from the state-action trajectories. Formally, wereplace θ2 with θ̂2, where θ̂2 is estimated from trajectoriesusing maximum likelihood:

θ̂2 = argmaxθ2

∑t

log πθ2(u2t |st) (4.6)

θ̂2 then replaces θ2 in the LOLA update rule, both for theexact version using the value function and the gradient basedapproximation.

Page 5: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

4.5 Higher Order LOLAThe LOLA learning rule so far assumes that the opponent isa naive learner that carries out first order learning using pol-icy gradients. In this setting, which we call first-order LOLA,accounting for the learning of the other agent leads to a sec-ond order correction term. However, we can also considera higher order LOLA agent that differentiates through thelearning step of this first order LOLA agent, including thesecond order correction. This leads to a third order deriva-tive in the correction term. While this third order term istypically difficult to compute using rollouts, when the exactvalue function is available it is tractable.

5 Experimental SetupIn this section, we summarize the settings where we com-pare the learning behavior of NL and LOLA agents. Thefirst setting (Sec. 5.1) consists of two classical infinitely it-erated games, the iterated prisoners dilemma (IPD) and iter-ated matching pennies (IMP). These two classical environ-ments allow us to obtain the discounted future return of eachplayer given both players’ policies, which leads to exact pol-icy updates for NL and LOLA agents. The second setting(Sec. 5.2) is called ‘coin game‘, a more difficult two-playerenvironment, where exact discounted future reward can notbe calculated and each player is parameterized with a deeppolicy network.

5.1 Iterated GamesThe per-step payoff matrix of the prisoners’ dilemma isshown in Table 1.

C DC (-1, -1) (-3, 0)D (0, -3) (-2, -2)

Table 1: Payoff matrix of prisoners’ dilemma.

In a single-shot prisoners’ dilemma, there is only oneNash equilibrium Fudenberg and Tirole (1991), where bothagents defect. In the infinitely repeated prisoners’ dilemma,the folk theorem (Roger, 1991) shows that there are in-finitely many Nash equilibria. Two notable ones are the al-ways defect strategy (DD), and tit-for-tat (TFT). In TFT eachagent starts out with cooperation and then repeats the previ-ous action of the opponent. The average returns per step inself-play are −1 and −2 for TFT and DD respectively.

IMP Gibbons (1992) is a zero-sum game, with per-steppayouts shown in Table 2. This game only has a singlemixed strategy Nash equilibrium which is both players play-ing 50%/50% heads / tails.

Head TailHead (+1, -1) (-1, +1)Tail (-1, +1) (+1, -1)

Table 2: Payoff matrix of matching pennies.

IPD IMP%TFT R(std) %Nash R(std)

NL-Ex. 20.8 -1.98(0.14) 0.0 0(0.37)LOLA-Ex. 81.0 -1.06(0.19) 98.8 0(0.02)

NL-PG 20.0 -1.98(0.00) 13.2 0(0.19)LOLA-PG 66.4 -1.17(0.34) 93.2 0(0.06)

Table 3: Shown is the probability of agents playing TFT andNash for the IPD and IMP respectively as well as the averagereward per step, R, and (STD) at the end of training for 50training runs.

We model the IPD and IMP as a two-agent MDP, wherethe state at time 0 is empty and at time t ≥ 1 is both agents’actions from t− 1:

st = (u1t−1, u

2t=1) for t > 1.

Each agent’s policy is fully parametrized by 5 probabili-ties. For agent a in the case of the IPD, they are πa(C|s0),πa(C|CC), πa(C|CD), πa(C|DC) and πa(C|DD). Wecan derive each agent’s future discounted reward as an an-alytical function of the agents’ policy (see SupplementaryMaterial for details) and calculate the exact policy updatefor both NL and LOLA agents.

We further assume that agents can only update their poli-cies between the rollouts, not during the iterated game play.Conceptually each agent submits their policy to the envi-ronment, which then get used to play a large number (batchsize) of infinitely iterated games. Next both agents receivethe traces resulting from these games and can submit up-dated policies to the environment.

5.2 Coin Game

Next we study LOLA in a setting that requires recurrent poli-cies and features sequential actions. The ‘Coin Game’ wasfirst proposed in Lerer and Peysakhovich (2017) as a higherdimensional expansion of the iterated prionser’s dilemmawith multi-step actions. As shown in Figure 5.2, in this set-ting two agents, ‘red’ and ‘blue’, are tasked with collectingcoins.

The coins are either blue or red, and appear randomly onthe grid-world, each coin appearing when the last one hasbeen picked up. Agents pick up coins by moving onto thefield where the coin is located. While every agent receives apoint for picking up a coin of any colour, whenever the ‘redagent’ picks up a blue coin the ‘blue agent’ loses 2 pointsand vice versa.

As a result, if both agents greedily pick up any coin avail-able, they receive 0 points on average. In the ‘Coin Game’,agents’ policies are parametrized with a recurrent neural net-work and one cannot obtain the future discounted reward asa function of both agents’ policies in closed form. Policygradient-based learning is applied for both NL and LOLAagents in our experiments. We further apply LOLA withopponent-modelling to this task.

Page 6: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

Figure 3: In the coin game, two agents ‘red’ and ‘blue’,get rewarded for picking up coins. However, the ‘red agent’loses 2 points when the ‘blue agent’ picks up a red coin andvice versa. Effectively this is a world with an embedded so-cial dilemma where the action to cooperate and defect aretemporally extended.

5.3 Training DetailsIn all our PG experiments we use gradient descent with stepsize 0.005 for the actor, 1 for the critic, and batch size 4000.γ is set to 0.96 for the prisoners’ dilemma and the coin gameand 0.9 for matching pennies. The high value of γ for the‘Coin Game’ and IPD was chosen in order to allow for longtime horizons, which are known to be required for cooper-ation in the IPD. We found that a lower γ produced morestable learning on the IMP.

For the coin game the agent’s policy architecture is a re-current neural network with 32 hidden units and 2 convolu-tional layers with 3× 3 filters, stride 1, and ‘relu’ activationfor input processing. The input is presented as a 4 channelgrid, with 2 channels encoding the positions of the 2 agentsand 2 channels for the red and blue coins respectively.

6 ResultsIn this section, we summarize the experimental results. Weaim to answer the following questions:

1. With the exact policy update, how do LOLA agents be-have in iterated games compared with NL agents?

2. Does replacing the exact policy update with policy gra-dient updates change the learned behaviors of LOLA andNL agents?

3. Does the learning of LOLA agents scale to high-dimensional settings where the agents’ policies areparametrized by deep networks?

4. When replacing access to the exact parameters of theopponent agent with opponent modeling, does LOLAagents’ behavior preserve?

5. Exploiting LOLA: Can LOLA agents be exploited by us-ing higher order gradients, i.e., does LOLA lead to anarms race of ever higher order corrections or is LOLA /LOLA stable?

We answer the first two questions in Sec. 6.1, the next twoquestions in Sec. 6.2 and the last one in Sec. 6.3.

6.1 Iterated Games

Figures 3a) and 3b) show the policy for both agents at theend of training under naive learning (NL-Ex) and LOLA(LOLA-Ex) when the agents have access to exact gradientsand Hessians of {V 1, V 2}. Here LOLA and NL describepairs of agents. We consider mixed learning of one LOLAagent vs an NL agent in Section 6.3. Under NL, the agentslearn to defect in all states, indicated by the accumulationof points in the bottom left corner of the plot. However, un-der LOLA, in most cases the agents learn TFT. In particularagent 1 cooperates in s0, CC and DC, while agent 2 coop-erates in s0, CC and CD. As a result, Figure 3c) shows thatthe average return per step is close to −1 for LOLA, cor-responding to TFT, while NL results in an average rewardof −2, corresponding to the fully defective (DD) equilib-rium. Figure 3d) shows the average return per step for NL-PG and LOLA-PG where agents learn via policy gradient.LOLA-PG also demonstrates cooperation while agents de-fect in NL-PG. Further plots are provided in the Supplemen-tary Material.

We conduct the same analysis for IMP. In this game, un-der naive learning the agents’ strategies fail to converge. Incontrast, under LOLA the agents’ policies converge to theonly Nash equilibrium, playing 50%/50% heads / tails. Ta-ble 3 summarizes the numerical results comparing LOLAwith NL agents in both the exact and policy gradient set-tings. In IPD, LOLA agents learn policies consistent withTFT with a much higher probability and achieve higher re-ward than NL (−1.06 vs−1.98). In IMP, LOLA agents con-verge to the Nash equilibrium more stably while NL agentsdo not. The difference in stability is illustrated by the highvariance of the average returns per step for NL agents com-pared to the low variance under LOLA (0.37 vs 0.02).

6.2 Coin Game

As shown in Figure 4, NL agents collect coins indiscrimi-nately, corresponding to defection. In contrast, LOLA agentslearn to pick up coins predominantly (around 80%) of theirown color, corresponding to cooperation. The same resultholds true when agents have to learn the policy of the oppo-nent, using LOLA with opponent modelling. We emphasizethat in this setting neither agent can recover the exact policyparameters of the opponent, since there is a large amountof redundancy in the neural network parameters. For exam-ple, each agent could permute the weights of their fully con-nected layers.

Page 7: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

0 1000 2000 3000 4000Iterations

0.4

0.5

0.6

0.7

0.8

0.9

1.0

P(ow

n co

in)

NLLOLALOLA-OM

(a)

0 1000 2000 3000 4000Iterations

0

5

10

15

20

Poin

ts

NLLOLALOLA-OM

(b)

Figure 4: Shown is the percentage of all picked up coins that match in colour, in a), and the total points obtained, in b), for apair of naive learners (NL), LOLA-agents (LOLA), and a pair of LOLA-agents with opponent modelling (LOLA-OM). Alsoshown is the standard deviation of the percentage and the points obtained in order to indicate variability of the result, based on5 training runs. We see that LOLA and LOLA-OM learn to cooperate, while NL does not. Best viewed in color.

6.3 Higher Order LOLA

In the exact value function setting the higher order LOLAterms can be evaluated. We use this to address the questionof whether there is an arms race leading to ever higher ordersof correction terms between the two agents. Table 4 showsthat in IPD, a LOLA learner can achieve higher payoutsagainst a naive learner. Thus, there is an incentive for eitheragent to switch from naive learning to first order LOLA. Fur-thermore, two LOLA agents playing against each other bothreceive higher rewards than a LOLA agent playing againsta naive learner. This makes LOLA a dominant learning rulein IPD compared to naive learning. However, we further findthat higher order LOLA provides no incremental gains whenplaying against a first order LOLA agent, leading to a re-duction in payouts for both agents. These experiments werecarried out with a LR of 0.5. While it is beyond the scope ofthis work to prove that LOLA / LOLA is a dominant learningrule in the space of all possible gradient-based rules, theseinitial results are encouraging.

NL 1st order 2nd OrderNL (-1.99, -1.99) (-1.54, -1.28) -1st (-1.28, -1.54) (-1.04, -1.04) (-1.14, -1.17)

Table 4: Higher order LOLA results on the IPD. A LOLAagent obtains higher rewards compared to a NL. However inthis setting there is no incremental gain from using higherorder LOLA in order to exploit another LOLA agent in theIPD. In fact both agents do worse under 2nd order correc-tions.

7 Conclusions & Future WorkWe presented Learning with Opponent-Learning Awareness(LOLA), a learning method for multi-agent settings that con-siders the learning processes of other agents. We have shownthat when both agents apply the LOLA learning rule, thisleads to the emergence of cooperation based on tit-for-tatin the infinitely repeated iterated prisoners’ dilemma whileindependent learning does not. Empirical results show thatin the IPD both agents are incentivized to use LOLA, whilehigher order exploits show no further gain. We also find thatLOLA leads to stable learning of the Nash equilibrium initerated matching pennies.

Furthermore we apply a policy gradient based version ofLOLA to the ‘Coin Game’, a multi-step game which re-quires recurrent policies. In this setting, LOLA agents learnto collaborate, even when they do not have access to the pol-icy of the other agent.

In the future we would like to address the exploitability ofLOLA, when adversarial agents explicitly aim to take advan-tage of a LOLA learner using global search methods ratherthan gradient based methods. Just as LOLA is a way to ex-ploit a naive learner, in principle there should be means ofexploiting LOLA learners in turn, unless LOLA is itself anequilibrium learning strategy. We would also like to proveproperties regarding the kind of equilibria to which LOLAagents converge. An initial step is to understand the dynam-ics in infinitesimal gradient ascent, similar to what was doneby Wunder, Littman, and Babes (2010). Another challengeis to apply LOLA in settings with many agents. Since eachagent has to account for the learning of all other agents, totalcomputational requirements are quadratic in the number ofagents. One solution is to apply LOLA to a learned subsetof opponents that have the greatest influence on the reward.

Page 8: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

AcknowledgementsWe would like to thank Ilya Sutskever, Bob McGrew, PaulCristiano and the rest of OpenAI for fruitful discussion. Wewould like to thank Michael Littman for providing feedbackon an early version of the manuscript.

ReferencesAxelrod, R. M. 2006. The evolution of cooperation: revised

edition. Basic books.

Brafman, R. I., and Tennenholtz, M. 2003. Efficient learningequilibrium. In Advances in Neural Information Process-ing Systems, volume 9, 1635–1643.

Brown, G. W. 1951. Iterative solution of games by fictitiousplay.

Das, A.; Kottur, S.; Moura, J. M.; Lee, S.; and Ba-tra, D. 2017. Learning cooperative visual dialogagents with deep reinforcement learning. arXiv preprintarXiv:1703.06585.

Foerster, J.; Assael, Y. M.; de Freitas, N.; and Whiteson, S.2016. Learning to communicate with deep multi-agent re-inforcement learning. In Advances in Neural InformationProcessing Systems, 2137–2145.

Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; andWhiteson, S. 2017a. Counterfactual multi-agent policygradients. arXiv preprint arXiv:1705.08926.

Foerster, J.; Nardelli, N.; Farquhar, G.; Torr, P.; Kohli, P.;Whiteson, S.; et al. 2017b. Stabilising experience replayfor deep multi-agent reinforcement learning. In 34th In-ternational Conference of Machine Learning.

Fudenberg, D., and Tirole, J. 1991. Game theory, 1991.Cambridge, Massachusetts 393:12.

Gibbons, R. 1992. Game theory for applied economists.Princeton University Press.

Lazaridou, A.; Peysakhovich, A.; and Baroni, M. 2016.Multi-agent cooperation and the emergence of (natural)language. arXiv preprint arXiv:1612.07182.

Leibo, J. Z.; Zambaldi, V.; Lanctot, M.; Marecki, J.; andGraepel, T. 2017. Multi-agent reinforcement learningin sequential social dilemmas. In Proceedings of the 16thConference on Autonomous Agents and MultiAgent Sys-tems, 464–473. International Foundation for AutonomousAgents and Multiagent Systems.

Lerer, A., and Peysakhovich, A. 2017. Maintaining coop-eration in complex social dilemmas using deep reinforce-ment learning. arXiv preprint arXiv:1707.01068.

Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; andMordatch, I. 2017. Multi-agent actor-critic for mixedcooperative-competitive environments. arXiv preprintarXiv:1706.02275.

Mealing, R., and Shapiro, J. L. 2013. Opponent mod-elling by sequence prediction and lookahead in two-player games. In ICAISC (2), 385–396.

Metz, L.; Poole, B.; Pfau, D.; and Sohl-Dickstein, J. 2016.Unrolled generative adversarial networks. arXiv preprintarXiv:1611.02163.

Mordatch, I., and Abbeel, P. 2017. Emergence of groundedcompositional language in multi-agent populations. arXivpreprint arXiv:1703.04908.

Munoz de Cote, E., and Littman, M. L. 2008. A polynomial-time Nash equilibrium algorithm for repeated stochasticgames. In 24th Conference on Uncertainty in ArtificialIntelligence (UAI’08).

Omidshafiei, S.; Pazis, J.; Amato, C.; How, J. P.; and Vian, J.2017. Deep decentralized multi-task multi-agent rl underpartial observability. arXiv preprint arXiv:1703.06182.

Roger, B. M. 1991. Game theory: analysis of conflict.Sandholm, T. W., and Crites, R. H. 1996. Multiagent re-

inforcement learning in the iterated prisoner’s dilemma.Biosystems 37(1-2):147–166.

Sukhbaatar, S.; Fergus, R.; et al. 2016. Learning multia-gent communication with backpropagation. In Advancesin Neural Information Processing Systems, 2244–2252.

Sutton, R. S.; McAllester, D. A.; Singh, S. P.; Mansour,Y.; et al. 1999. Policy gradient methods for reinforce-ment learning with function approximation. In NIPS, vol-ume 99, 1057–1063.

Williams, R. J. 1992. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning. Ma-chine learning 8(3-4):229–256.

Wunder, M.; Littman, M.; and Babes, M. 2010. Classesof multiagent Q-learning dynamics with epsilon-greedyexploration. In Proceedings of the Twenty-Seventh In-ternational Conference on Machine Learning (ICML-10),1167–1174.

Zhang, C., and Lesser, V. R. 2010. Multi-agent learningwith policy prediction. In AAAI.

Page 9: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

A Appendix

A.1 Derivation of Second-Order derivative

In this section, we derive the second order derivatives of LOLA in the policy gradient setting. Recall that an episode of horizonT is

τ = (s0, u10, u

20, r

10, r

20, ..., sT , u

1T , u

2T , r

1T , r

2T )

and the corresponding discounted return for agent a at timestep t is Rat (τ) =∑Tl=t γ

l−tral . We denote Eπ1,π2,τ as the expec-tation taken over both agents’ policy and the episode τ . Then,

∇θ1∇θ2 Eπ1,π2,τ R10(τ) = ∇θ1∇θ2 Eτ

[R1

0(τ) ·T∏l=0

π1(u1l |sl,θ1) ·

T∏l=0

π2(u2l |sl,θ2)

]

= Eτ

R10(τ) ·

(∇θ1

( T∏l=0

π1(u1l |sl,θ1)

))(∇θ2

( T∏l=0

π2(u2l |sl,θ2)

))T= Eτ

R10(τ) ·

(∇θ1

(∏Tl=0 π

1(u1l |sl,θ1)

)∏Tl=0 π

1(u1l |sl,θ1)

)(∇θ2

(∏Tl=0 π

2(u2l |sl,θ2)

)∏Tl=0 π

2(u2l |sl,θ2)

)T

·T∏l=0

π1(u1l |sl,θ1) ·

T∏l=0

π2(u2l |sl,θ2)

]

= Eπ1,π2,τ

R10(τ) ·

(∇θ1

(∏Tl=0 π

1(u1l |sl,θ1)

)∏Tl=0 π

1(u1l |sl,θ1)

)(∇θ2

(∏Tl=0 π

2(u2l |sl,θ2)

)∏Tl=0 π

2(u2l |sl,θ2)

)T= Eπ1,π2,τ

R10(τ) ·

(∇θ1 log

( T∏l=0

π1(u1l |sl,θ1)

))(∇θ2 log

( T∏l=0

π2(u2l |sl,θ2)

))T= Eπ1,π2,τ

[R1

0(τ) ·(∑T

l=0∇θ1 log π1(u1

l |sl,θ1)

)(∑T

l=0∇θ2 log π2(u2

l |sl,θ2)

)T].

The second equality is due to πl is only a function of θl. The third equality is multiply and divide the probability of the episodeτ . The fourth equality factors the probability of the episode τ into the expectation Eπ1,π2,τ . The fifth and sixth equalities arestandard policy gradient operations.

Similar derivations lead to the the following second order cross-term gradient for a single reward of agent 1 at time t

∇θ1∇θ2 Eπ1,π2,τ r1t = Eπ1,π2,τ

[r1t ·(∑t

l=0∇θ1 log π1(u1

l |sl,θ1))(∑t

l=0∇θ2 log π2(u2

l |sl,θ2))T]

.

Sum the rewards over t,

∇θ1∇θ2 Eπ1,π2,τ R10(τ) = Eπ1,π2,τ

[∑T

t=0γtr1

t ·(∑t

l=0∇θ1 log π1(u1

l |sl,θ1))(∑t

l=0∇θ2 log π2(u2

l |sl,θ2))T]

,

which is the 2nd order term in the Methods Section.

A.2 Derivation of the exact value function in the Iterated Prisoners’ dilemma and Iterated MatchingPennies

In both IPD and IMP the action space consists of 2 discrete actions. The state consists of the union of the last action of bothagents. As such there are a total of 5 possible states, 1 state being the initial state, s0, and the other 4 the 2 x 2 states dependingon the last action taken.

As a consequence the policy of each agent can be represented by 5 parameters,θa,the probabilities of taking action 0 in eachof these 5 states. In the case of the IPD these parameters correspond to the probability of cooperation in s0, CC, CD, DC and

Page 10: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

DD:

πa(C|s0) = θa,0, πa(D|s0) = 1− θa,0,πa(C|CC) = θa,1, πa(D|CC) = 1− θa,1,πa(C|CD) = θa,2, πa(D|CD) = 1− θa,2,πa(C|DC) = θa,3, πa(D|DC) = 1− θa,3,πa(C|DD) = θa,4, πa(D|DD) = 1− θa,4, a ∈ {1, 2}.

We denote θa = (θa,0, θa,1, θa,2, θa,3, θa,4).In these games the union of π1 and π2 induces a state transition function P (s′|s) = P (u|s). Denote the distribution of s0 as

p0:p0 =

(θ1,0θ2,0, θ1,0(1− θ2,0), (1− θ1,0)θ2,0, (1− θ1,0)(1− θ2,0)

)T,

the payout vector asr1 = (−1,−3, 0,−2)T and r2 = (−1, 0,−3,−2)T ,

and the transition matrix is

P =[θ1θ2, θ1(1− θ2), (θ1 − 1)θ2, (1− θ1)(1− θ2)

]Then V1, V2 can be represented as

V 1(θ1,θ2) = pT0(r1 +

∑∞

t=1γtP tr1

)V 2(θ1,θ2) = pT0

(r2 +

∑∞

t=1γtP tr2

).

Since γ < 1 and P is a stochastic matrix, the infinite sum converges and

V 1(θ1,θ2) = pT0I

I− γPr1,

V 2(θ1,θ2) = pT0I

I− γPr2,

where I is the identity matrix.An equivalent derivation holds for the Iterated Matching Pennies game with r1 = (−1, 1, 1,−1)T and r2 = −r1.

Page 11: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

A.3 Figures

0.0 0.2 0.4 0.6 0.8 1.0P(cooperation | state)_agent 0

0.0

0.2

0.4

0.6

0.8

1.0P(

coop

erat

ion

| sta

te)_

agen

t 1

0.0 0.2 0.4 0.6 0.8 1.0P(cooperation | state)_agent 0

0.0

0.2

0.4

0.6

0.8

1.0

CCCDDCDDP0

0 50 100 150 200Iterations

2.0

1.8

1.6

1.4

1.2

1.0

0.8

Aver

age

rewa

rd p

er st

ep

a0 Lolaa1 Lolaa0 NLa1 NL

(a)

0.0 0.2 0.4 0.6 0.8 1.0P(Head | state)_agent 0

0.0

0.2

0.4

0.6

0.8

1.0

P(He

ad |

stat

e)_a

gent

1

0.0 0.2 0.4 0.6 0.8 1.0P(Head | state)_agent 0

0.0

0.2

0.4

0.6

0.8

1.0

HHHTTHTTP0

0 50 100 150 200Iterations

0.2

0.1

0.0

0.1

0.2

Aver

age

rewa

rd p

er st

ep

a0 Lolaa1 Lolaa0 NLa1 NL

(b)

Figure 5: Shown is the probability of cooperation in the prisoners dilemma (a) and the probability of heads in the matchingpennies game (b) at the end of 50 training runs for both agents as a function of state under naive learning (left) and LOLA(middle) when using the exact gradients of the value function. Also shown is the average return per step for naive and LOLA(right)

Page 12: Learning with Opponent-Learning Awareness · Multi-agent settings are quickly gathering importance in ma-chine learning. Beyond a plethora of recent work on deep ... outcome that

0.0 0.2 0.4 0.6 0.8 1.0

P(cooperation | state)_agent 0

0.0

0.2

0.4

0.6

0.8

1.0

P(c

oopera

tion |

sta

te)_

agent

1

0.0 0.2 0.4 0.6 0.8 1.0

P(cooperation | state)_agent 0

0.0

0.2

0.4

0.6

0.8

1.0

CC

CD

DC

DD

P0

0 20 40 60 80 100

Iterations

2.0

1.5

1.0

Avera

ge r

ew

ard

per

step

a0 Lola

a1 Lola

a0 NL

a1 NL

(a)

0.0 0.2 0.4 0.6 0.8 1.0

P(Head | state)_agent 0

0.0

0.2

0.4

0.6

0.8

1.0

P(H

ead |

sta

te)_

agent

1

0.0 0.2 0.4 0.6 0.8 1.0

P(Head | state)_agent 0

0.0

0.2

0.4

0.6

0.8

1.0

HH

HT

TH

TT

P0

0 20 40 60 80 100

Iterations

0.3

0.2

0.1

0.0

0.1

0.2

0.3A

vera

ge r

ew

ard

per

step

a0 Lola

a1 Lola

a0 NL

a1 NL

(b)

Figure 6: Same as Figure A.3, but using the policy gradient approximation for all terms. Clearly results are more noisy byqualitatively follow the results of the exact method.