multi-agent adversarial inverse reinforcement learning · 2019. 8. 1. · however, maxent irl is...

Multi-Agent Adversarial Inverse Reinforcement Learning

Lantao Yu 1 Jiaming Song 1 Stefano Ermon 1

AbstractReinforcement learning agents are prone to unde-sired behaviors due to reward mis-specification.Finding a set of reward functions to properlyguide agent behaviors is particularly challeng-ing in multi-agent scenarios. Inverse reinforce-ment learning provides a framework to automati-cally acquire suitable reward functions from ex-pert demonstrations. Its extension to multi-agentsettings, however, is difficult due to the morecomplex notions of rational behaviors. In thispaper, we propose MA-AIRL, a new frameworkfor multi-agent inverse reinforcement learning,which is effective and scalable for Markov gameswith high-dimensional state-action space and un-known dynamics. We derive our algorithm basedon a new solution concept and maximum pseu-dolikelihood estimation within an adversarial re-ward learning framework. In the experiments, wedemonstrate that MA-AIRL can recover rewardfunctions that are highly correlated with groundtruth ones, and significantly outperforms priormethods in terms of policy imitation.

1. IntroductionReinforcement learning (RL) is a general and powerfulframework for decision making under uncertainty. Recentadvances in deep learning have enabled a variety of RL ap-plications such as games (Silver et al., 2016; Mnih et al.,2015), robotics (Gu et al., 2017; Levine et al., 2016), auto-mated machine learning (Zoph & Le, 2016) and generativemodeling (Yu et al., 2017). RL algorithms are also show-ing promise in multi-agent systems, where multiple agentsinteract with each other, such as multi-player games (Penget al., 2017), social interactions (Leibo et al., 2017) andmulti-robot control systems (Matignon et al., 2012). How-

1Department of Computer Science, Stanford Univer-sity, Stanford, CA 94305 USA. Correspondence to: Lan-tao Yu <[email protected]>, Stefano Ermon <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

ever, the success of RL crucially depends on careful rewarddesign (Amodei et al., 2016). As reinforcement learningagents are prone to undesired behaviors due to reward mis-specification (Amodei & Clark, 2016), designing suitablereward functions can be challenging in many real-worldapplications (Hadfield-Menell et al., 2017). In multi-agentssystems, since different agents may have completely dif-ferent goals and state-action representations, hand-tuningreward functions becomes increasingly more challenging aswe take more agents into consideration.

Imitation learning presents a direct approach to program-ming agents with expert demonstrations, where agentslearn to produce behaviors similar to the demonstrations.However, imitation learning algorithms, such as behaviorcloning (Pomerleau, 1991) and generative adversarial imita-tion learning (Ho & Ermon, 2016; Ho et al., 2016), typicallysidestep the problem of interring an explicit representationfor the underlying reward functions. Because the rewardfunction is often considered as the most succinct, robust andtransferable representation of a task (Abbeel & Ng, 2004;Fu et al., 2017), it is important to consider the problem of in-ferring reward functions from expert demonstrations, whichwe refer to as inverse reinforcement learning (IRL). IRLcan offer many advantages compared to direct policy imita-tion, such as analyzing and debugging an imitation learningalgorithm, inferring agents’ intentions and re-optimizingrewards in new environments (Ng et al., 2000).

However, IRL is ill-defined, as many policies can be op-timal for a given reward and many reward functions canexplain a set of demonstrations. Maximum entropy inversereinforcement learning (MaxEnt IRL) (Ziebart et al., 2008)provides a general probabilistic framework to solve the am-biguity by finding the trajectory distribution with maximumentropy that matches the reward expectation of the experts.As MaxEnt IRL requires solving an integral over all possibletrajectories for computing the partition function, it is onlysuitable for small scale problems with known dynamics.Adversarial IRL (Finn et al., 2016a; Fu et al., 2017) scalesMaxEnt IRL to large and continuous problems by drawingan analogy between a sampling based approximation ofMaxEnt IRL and Generative Adversarial Networks (Good-fellow et al., 2014) with a particular discriminator structure.However, the approach is restricted to single-agent settings.

arX

iv:1

907.

1322

0v1

[cs

.LG

] 3

0 Ju

l 201

9


In this paper, we consider the IRL problem in multi-agentenvironments with high-dimensional continuous state-actionspace and unknown dynamics. Generalizing MaxEnt IRLand Adversarial IRL to multi-agent systems is challenging.Since each agent’s optimal policy depends on other agents’policies, the notion of optimality, central to Markov decisionprocesses, has to be replaced by an appropriate equilibriumsolution concept. Nash equilibrium (Hu et al., 1998) is themost popular solution concept for multi-agent RL, whereeach agent’s policy is the best response to others. However,Nash equilibrium is incompatible with MaxEnt RL in thesense that it assumes the agents never take sub-optimalactions. Thus imitation learning and inverse reinforcementlearning methods based on Nash equilibrium or correlatedequilibrium (Aumann, 1974) might lack the ability to handleirrational (or computationally bounded) experts.

In this paper, inspired by logistic quantal response equilib-rium (McKelvey & Palfrey, 1995; 1998) and Gibbs sam-pling (Hastings, 1970), we propose a new solution concepttermed logistic stochastic best response equilibrium (LS-BRE), which allows us to characterize the trajectory distribu-tion induced by parameterized reward functions and handlethe bounded rationality of expert demonstrations in a prin-cipled manner. Specifically, by uncovering the close rela-tionship between LSBRE and MaxEnt RL, and bridging theoptimization of joint likelihood and conditional likelihoodwith maximum pseudolikelihood estimation, we proposeMulti-Agent Adversarial Inverse Reinforcemnt Learning(MA-AIRL), a novel MaxEnt IRL framework for Markovgames. MA-AIRL is effective and scalable to large high-dimensional Markov games with unknown dynamics, whichare not amenable to previous methods relying on tabularrepresentation and linear or quadratic programming (Natara-jan et al., 2010; Waugh et al., 2013; Lin et al., 2014; 2018).We experimentally demonstrate that MA-AIRL is able torecover reward functions that are highly correlated to theground truth rewards, while simultaneously learning policiesthat significantly outperform state-of-the-art multi-agent im-itation learning algorithms (Song et al., 2018) in mixedcooperative and competitive tasks (Lowe et al., 2017).

2. Preliminaries2.1. Markov Games

Markov games (Littman, 1994) are generalizations ofMarkov decision processes (MDPs) to the case ofN interact-ing agents. A Markov game (S,A, P, η, r) is defined via aset of states S , and N sets of actions {Ai}Ni=1. The functionP : S × A1 × · · · × AN → P(S) describes the (stochas-tic) transition process between states, where P(S) denotesthe set of probability distributions over the set S. Giventhat we are in state st at time t and the agents take actions(a1, . . . , aN ), the state transitions to st+1 with probability

P (st+1|st, a1, . . . , aN ). Each agent i obtains a (bounded)reward given by a function ri : S × A1 × · · · × AN → R.The function η ∈ P(S) specifies the distribution of theinitial state. We use bold variables without subscript i todenote the concatenation of all variables for all agents (e.g.,π denotes the joint policy, r denotes all rewards and a de-notes actions of all agents in a multi-agent setting). We usesubscript −i to denote all agents except for i. For exam-ple, (ai,a−i) represents (a1, . . . , aN ), the actions of all Nagents. The objective of each agent i is to maximize itsown expected return (i.e., the expected sum of discountedrewards) Eπ

[∑Tt=1 γ

tri,t

], where γ is the discount fac-

tor and ri,t is the reward received t steps into the future.Each agent achieves its own objective by selecting actionsthrough a stochastic policy πi : S → P(Ai). Depend-ing on the context, the policies can be Markovian (i.e., de-pend only on the state) or require additional coordinationsignals. For each agent i, we further define the expectedreturn for a state-action pair as: ExpRet

πi,π−ii (st,at) =

Est+1:T ,at+1:T

[∑l≥t γ

l−tri(sl,al)|st,at,π

]2.2. Solution Concepts for Markov Games

A correlated equilibrium (CE) for a Markov game (Ziebartet al., 2011) is a joint strategy profile, where no agent canachieve higher expected reward through unilaterally chang-ing its own policy. CE first introduced by (Aumann, 1974;1987) is a more general solution concept than the well-known Nash equilibrium (NE) (Hu et al., 1998), whichfurther requires agents’ actions in each state to be indepen-dent, i.e. π(a|s) = ΠN

i=1πi(ai|s). It has been shown thatmany decentralized, adaptive strategies will converge to CEinstead of a more restrictive equilibrium such as NE (Gor-don et al., 2008; Hart & Mas-Colell, 2000). To take boundedrationality into consideration, (McKelvey & Palfrey, 1995;1998) further propose logistic quantal response equilibrium(LQRE) as a stochastic generalization to NE and CE.

Definition 1. A logistic quantal response equilibrium forMarkov game corresponds to any strategy profile satisfyinga set of constraints, where for each state and action, theconstraint is given by:

πi(ai|s) =exp (λExpRetπi (s, ai,a−i))∑a′i

exp (λExpRetπi (s, a′i,a−i))

Intuitively, in LQRE, agents choose actions with higherexpected return with higher probability.

2.3. Learning from Expert Demonstrations

Suppose we do not have access to the ground truth rewardsignal r, but have demonstrations D provided by an expert(N expert agents in Markov games). D is a set of trajectories{τj}Mj=1, where τj = {(stj ,atj)}Tt=1 is an expert trajectory


collected by sampling s1 ∼ η(s),at ∼ πE(at|st), st+1 ∼P (st+1|st,at). D contains the entire supervision to thelearning algorithm, i.e., we assume we cannot ask for addi-tional interactions with the experts during training. GivenD, imitation learning (IL) aims to directly learn policiesthat behave similarly to these demonstrations, whereas in-verse reinforcement learning (IRL) (Russell, 1998; Ng et al.,2000) seeks to infer the underlying reward functions whichinduce the expert policies.

The MaxEnt IRL framework (Ziebart et al., 2008) aimsto recover a reward function that rationalizes the expertbehaviors with the least commitment, denoted as IRL(πE):

IRL(πE) = arg maxr∈RS×A

EπE [r(s, a)]− RL(r)

RL(r) = maxπ∈ΠH(π) + Eπ[r(s, a)]

where H(π) = Eπ[− log π(a|s)] is the policy entropy.However, MaxEnt IRL is generally considered less effi-cient and scalable than direct imitation, as we need to solvea forward RL problem in the inner loop. In the context ofimitation learning, (Ho & Ermon, 2016) proposed to usegenerative adversarial training (Goodfellow et al., 2014), tolearn the policies characterized by RL ◦ IRL(πE) directly,leading to the Generative Adversarial Imitation Learning(GAIL) algorithm:

minθ

maxω

EπE [logDω(s, a)] + Eπθ [log(1−Dω(s, a))]

where Dω is a discriminator that classifies expert and policytrajectories, and πθ is the parameterized policy that tries tomaximize its score under Dω. According to Goodfellowet al. (2014), with infinite data and infinite computation, atoptimality, the distribution of generated state-action pairsshould exactly match the distribution of demonstrated state-action pairs under the GAIL objective. The downside tothis approach, however, is that we bypass the intermediatestep of recovering rewards. Specifically, note that we cannotextract reward functions from the discriminator, as Dω(s, a)will converge to 0.5 for all (s, a) pairs.

2.4. Adversarial Inverse Reinforcement Learning

Besides resolving the ambiguity that many optimal rewardscan explain a set of demonstrations, another advantage ofMaxEnt IRL is that it can be interpreted as solving thefollowing maximum likelihood estimation (MLE) problem:

pω(τ) ∝

[η(s1)

T∏t=1

P (st+1|st, at)

]exp

(T∑t=1

rω(st, at)

)(1)

maxω

EπE [log pω(τ)] = Eτ∼πE

[T∑t=1

rω(st, at)

]− logZω

Here, ω are the parameters of the reward function and Zωis the partition function, i.e. an integral over all possibletrajectories consistent with the environment dynamics. Zω isintractable to compute when the state-action spaces are largeor continuous, and the environment dynamics are unknown.

Combining Guided Cost Learning (GCL) (Finn et al.,2016b) and generative adversarial training, Finn et al.; Fuet al. proposed adversarial IRL framework as an efficientsampling based approximation to the MaxEnt IRL, wherethe discriminator takes on a particular form:

Dω(s, a) =exp(fω(s, a))

exp(fω(s, a)) + q(a|s)

where fω(s, a) is the learned function, q(a|s) is the prob-ability of the adaptive sampler pre-computed as an inputto the discriminator, and the policy is trained to maximizelogD − log(1−D).

To alleviate the reward shaping ambiguity (Ng et al., 1999),where many reward functions can explain an optimal policy,(Fu et al., 2017) further restricted f to a reward estimatorgω and a potential shaping function hφ:

fω,φ(s, a, s′) = gω(s, a) + γhφ(s′)− hφ(s)

It has been shown that under suitable assumptions, gω andhφ will recover the true reward and value function up to aconstant.

3. Method3.1. Logistic Stochastic Best Response Equilibirum

To extend MaxEnt IRL to Markov games, we need be ableto characterize the trajectory distribution induced by a set of(parameterized) reward functions {ri(s,a)}Ni=1 (analogousto Equation (1)). However existing optimality notions intro-duced in Section 2.2 do not explicitly define a tractable jointstrategy profile that we can use to maximize the likelihoodof expert demonstrations (as a function of the rewards); theydo so implicitly as the solution to a set of constraints.

Motivated by Gibbs sampling (Hastings, 1970), dependencynetworks (Heckerman et al., 2000), best response dynamic(Nisan et al., 2011; Gandhi, 2012) and LQRE, we proposea new solution concept that allows us to characterize ratio-nal (joint) policies induced from a set of reward functions.Intuitively, our solution concept corresponds to the resultof repeatedly applying a stochastic (entropy-regularized)best response mechanism, where each agent (in turns) at-tempts to optimize its actions while keeping the other agents’actions fixed.

To begin with, let us first consider a stateless single-shotnormal-form game with N players and a reward function


ri : A1 × . . . × AN → R for each player i. We con-sider the following Markov chain over (A1 × · · · × AN ),where the state of the Markov chain at step k is denotedz(k) = (z1, · · · , zN )(k), with each random variable z(k)

i

taking values in Ai. The transition kernel of the MarkovChain is defined by the following equations:

z(k+1)i ∼ Pi(ai|a−i = z

(k)−i ) =

exp(λri(ai, z(k)−i ))∑

a′iexp(λri(a′i, z

(k)−i ))

(2)

and each agent i is updated in scan order. Given all otherplayers’ actions z(k)

−i , the i-th player picks an action propor-

tionally to exp(λri(ai, z(k)−i )), where λ > 0 is a parameter

that controls the level of rationality of the agents. For λ→ 0,the agent will select actions uniformly at random, while forλ → ∞, the agent will select actions greedily (best re-sponse). Because the Markov Chain is ergodic, it admits aunique stationary distribution which we denote π(a). Inter-preting this stationary distribution over (A1×· · ·×AN ) as apolicy, we call this stationary joint policy a logistic stochas-tic best response equilibrium for normal-form games.

Now let us generalize the solution concept to Markov games.For each agent i, let {πti}Tt=1 denote a set of time-dependentpolicies. First we define the state action value function foreach agent i. Starting from the base case:

Qπ∅

i (sT , aTi ,aT−i) = ri(s

T , aTi ,aT−i)

then we recursively define:

Qπt+1:T

i (st, ati,at−i) = ri(s

t, ati,at−i)+

Est+1∼P (·|st,at)

[H(πt+1

i (·|st+1))+

Eat+1∼πt+1(·|st+1)[Qπt+2:T

i (st+1,at+1)]]

which generalizes the standard state-action value functionin single-agent RL (a−i = ∅ when N = 1).Definition 2. Given a Markov game with horizon T , thelogistic stochastic best response equilibrium (LSBRE) isa sequence of T stochastic policies {πt}Tt=1 constructedby the following process. Consider T Markov chains over(A1×· · ·AN )|S|, where the state of the t-th Markov chain atstep k is {zt,(k)

i : S → Ai}Ni=1, with each random variablezt,(k)i (s) taking values in Ai. For t ∈ [T, . . . , 1], we recur-

sively define the the stationary joint distribution πt(a|s) ofthe t-th Markov chain in terms of {π`}T`=t+1 as:

For st ∈ S, i ∼ [1, · · · , N ], we update the state of theMarkov chain as:

zt,(k+1)i (st) ∼ P ti (ati|at−i = z

t,(k)−i (st), st) =

exp(λQπ

t+1:T

i (st, ati, zt,(k)−i (st))

)∑a′i

exp(λQπ

t+1:T

i (st, a′i, zt,(k)−i (st)

) (3)

where parameter λ ∈ R+ controls the level of rationalityof the agents, and {P ti }Ni=1 specifies a set of conditionaldistributions. LSBRE for Markov game is the sequenceof T joint stochastic policies {πt}Tt=1. Each joint policyπt : S → P(A1 × · · · × AN ) is given by:

πt(a1, · · · , aN |st) = P

(⋂i

{zt,(∞)i (st) = ai}

)(4)

where the probability is taken with respect to the uniquestationary distribution of the t-th Markov chain.

When the set of conditionals in Equation (2) are compatible(in the sense that each conditional can be inferred from thesame joint distribution (Arnold & Press, 1989)), the aboveprocess corresponds to a Gibbs sampler, which will con-verge to a stationary joint distribution π(a), whose condi-tional distributions are consistent with the ones used duringsampling, namely Equation (2). This is the case, for exam-ple, if the agents are cooperative, i.e., they share the samereward function ri. In general, π(a) is the distribution spec-ified by the dependency network (Heckerman et al., 2000)defined via conditionals in Equation (2). The same argu-ment can be made for the Markov Chains in Definition 2with respect to the conditionals in Equation (3).

When the set of conditionals in Equation (2) and (3) are in-compatible, the procedure is called a pseudo Gibbs sampler.As discussed in literatures on dependency networks (Hecker-man et al., 2000; Chen et al., 2011; Chen & Ip, 2015), whenthe conditionals are learned from a sufficiently large dataset,the pseudo Gibbs sampler asymptotically works well in thesense that the conditionals of the stationary joint distribu-tion are nearly consistent with the conditionals used duringsampling. Under some conditions, theoretical bounds on theapproximation can be obtained (Heckerman et al., 2000).

3.2. Trajectory Distributions Induced by LSBRE

Following (Fu et al., 2017; Levine, 2018), without loss ofgenerality, in the remainder of this paper we consider thecase where λ = 1. First, we note that there is a connec-tion between the notion of LSBRE and maximum causalentropy reinforcement learning (Ziebart, 2010). Specifically,we can characterize the trajectory distribution induced byLSBRE policies with an energy-based formulation, wherethe probability of a trajectory increases exponentially asthe sum of rewards increases. Formally, with LSBRE poli-cies, the probability of generating a certain trajectory can becharacterized with the following theorem:Theorem 1. Given a joint policy {πt(at|st)}Tt=1 specifiedby LSBRE, for each agent i, let {πt−i(at−i|st)}Tt=1

denote other agents’ marginal distribution and{πti(ati|at−i, st)}Tt=1 denote agent i’s conditional dis-tribution, both obtained from the LSBRE joint policies.


Then the LSBRE conditional distributions are the optimalsolution to the following optimization problem:

minπ1:T

DKL(p(τ)||p(τ)) (5)

p(τ) =

[η(s1) ·

T∏t=1

P (st+1|st,at) · πt−i(at−i|st))

]·

T∏t=1

πti(ati|at−i, st)

p(τ) ∝

[η(s1) ·

T∏t=1


]·

exp

(T∑t=1

ri(st, ati,a

t−i)

)(6)

Proof. See Appendix A.1.

Intuitively, for single-shot normal form games, the abovestatement holds obviously from the definition in Equation(2). For Markov games, similar to the process introducedin Definition 2, we can employ a dynamic programmingalgorithm to find the conditional policies which minimizesEquation (5). Specifically, we first construct the base case oft = T as a normal form game, then recursively construct theconditional policy for each time step t, based on the policiesfrom t+1 to T that have already been constructed. It can beshown that the constructed optimal policy which minimizesthe KL divergence between its trajectory distribution and thetrajectory distribution defined in Equation (6) correspondsto the set of conditional policies in LSBRE.

3.3. Multi-Agent Adversarial IRL

In the remainder of this paper, we assume that the expertpolicies form a unique LSBRE under some unknown (pa-rameterized) reward functions, according to Definition 2.By adopting LSBRE as the optimality notion, we are ableto rationalize the demonstrations by maximizing the likeli-hood of the expert trajectories with respect to the LSBREstationary distribution, which is in turn induced by the ω-parameterized reward functions {ri(s,a;ωi)}Ni=1.

The probability of a trajectory τ = {st,at}Tt=1 generatedby LSBRE policies in a Markov game is defined by thefollowing generative process:

p(τ) = η(s1) ·T∏t=1

πt(at|st;ω) ·T∏t=1

P (st+1|st,at) (7)

where πt(at|st;ω) are the unique stationary joint dis-tributions for the LSBRE induced by {ri(s,a;ωi)}Ni=1.The initial state distribution η(s1) and transition dynam-ics P (st+1|st,at) are specified by the Markov game.

As mentioned in Section 2.4, the MaxEnt IRL frameworkinterprets finding suitable reward functions as maximumlikelihood over the expert trajectories in the distributiondefined in Equation (7), which can be reduced to:

maxω

Eτ∼πE

[T∑t=1

logπt(at|st;ω)

](8)

since the initial state distribution and transition dynamicsdo not depend on the parameterized rewards.

Note that πt(at|st) in Equation (8) is the joint policy de-fined in Equation (4), whose conditional distributions aregiven by Equation (3). From Section 3.1, we know thatgiven a set ofω-parameterized reward functions, we are ableto characterize the conditional policies {πti(ati|at−i, st)}Tt=1

for each agent i. However direct optimization over the jointMLE objective in Equation (8) is intractable, as we cannotobtain a closed form for the stationary joint policy. Fortu-nately, we are able to construct an asymptotically consistentestimator by approximating the joint likelihood πt(at|st)with a product of the conditionals

∏i=1 π

ti(a

ti|at−i, st),

which is termed a pseudolikelihood (Besag, 1975).

With the asymptotic consistency property of the maximumpseudolikelihood estimation (Besag, 1975; Lehmann &Casella, 2006), we have the following theorem:Theorem 2. Let demonstrations τ1, . . . , τM be independentand identically distributed (sampled from LSBRE induced bysome unknown reward functions), and suppose that for allt ∈ [1, . . . , T ], ati ∈ Ai, πti(ati|at−i, st;ωi) is differentiablewith respect to ωi. Then, with probability tending to 1 asM →∞, the equation

∂

∂ω

M∑m=1

T∑t=1

N∑i=1

log πti(am,ti |a

m,t−i , s

m,t;ωi) = 0 (9)

has a root ωM such that ωM tends to the maximizer of thejoint likelihood in Equation (8).

Proof. See Appendix A.2.

Theorem 2 bridges the gap between optimizing the jointlikelihood and each conditional likelihood. Now we are ableto maximize the objective in Equation (8) as:

EπE

[N∑i=1

T∑t=1

∂

∂ωlog πti(a

ti|at−i, st;ωi)

](10)

To optimize the maximum pseudolikelihood objective inEquation (10), we can instead optimize the following surro-gate loss which is a variational approximation to the psuedo-likelihood objective (from Theorem 1):

EπE

[N∑i=1

T∑t=1

∂

∂ωri(s

t,at;ωi)

]−

N∑i=1

∂

∂ωlogZωi


where Zωi is the partition function of the distribution inEquation (6). It is generally intractable to exactly computeand optimize the partition function Zω, which involves anintegral over all trajectories. Similar to GCL (Finn et al.,2016b) and single-agent AIRL (Fu et al., 2017), we employimportance sampling to estimate the partition function withadaptive samplers qθ. Now we are ready to introduce ourpractical Multi-Agent Adversarial IRL (MA-AIRL) frame-work, where we train the ω-parameterized discriminatorsas:

maxω

EπE

[N∑i=1

logexp(fωi(s,a))

exp(fωi(s,a)) + qθi(ai|s)

]+

Eqθ

[N∑i=1

logqθi(ai|s)

exp(fωi(s,a)) + qθi(ai|s)

](11)

and we train the θ-parameterized generators as:

maxθ

Eqθ

[N∑i=1

log(Dωi(s,a))− log(1−Dωi(s,a))

]

= Eqθ

[N∑i=1

fωi(s,a)− log(qθi(ai|s))

](12)

Specifically, for each agent i, we have a discriminator witha particular structure exp(fω)

exp(fω)+qθfor a binary classification,

and a generator as an adaptive importance sampler for es-timating the partition function. Intuitively, qθ is trained tominimize the KL divergence between its trajectory distribu-tion and that induced by the reward functions, for reducingthe variance of importance sampling, while fω in the dis-criminator is trained to estimate the reward function. Atoptimality, fω will approximate the advantage function forthe expert policy and qθ will approximate the expert policy.

3.4. Solving Reward Ambiguity in Multi-Agent IRL

For single-agent reinforcement learning, Ng et al. shows thatfor any state-only potential function φ : S → R, potential-based reward shaping defined as:

r′(st, at, st+1) = r(st, at, st+1) + γΦ(st+1)− Φ(st)

is a necessary and sufficient condition to guarantee invari-ance of the optimal policy in both finite and infinite horizonMDPs. In other words, given a set of expert demonstrations,there is a class of reward functions, all of which can explainthe demonstrated expert behaviors. Thus without furtherassumptions, it would be impossible to identify the ground-truth reward that induces the expert policy within this class.Similar issues also exist when we consider multi-agent sce-narios. Devlin & Kudenko show that in multi-agent systems,using the same reward shaping for one or more agents willnot alter the set of Nash equilibria. It is possible to extend

Algorithm 1 Multi-Agent Adversarial IRLInput: Expert trajectories DE = {τEj }; Markov game asa black box with parameters (N,S,A, η,P , γ)Initialize the parameters of policies q, reward estimatorsg and potential functions h with θ,ω,φ.repeat

Sample trajectories Dπ = {τj} from π:s1 ∼ η(s),at ∼ π(at|st), st+1 ∼ P (st+1|st,at)

Sample state-action pairs Xπ , XE from Dπ , DE .for i = 1, . . . , N do

Update ωi, φi to increase the objective in Eq. 11:EXE [logD(s, ai, s

′)] + EXπ [log(1−D(s, ai, s′))]

end forfor i = 1, . . . , N do

Update reward estimates ri(s, ai, s′) with gωi(s, ai).or (logD(s, ai, s

′)− log(1−D(s, ai, s′)))

Update θi with respect to ri(s, ai, s′).end for

until ConvergenceOutput: Learned policies πθ and reward functions gω .

this result to other solution concepts such as CE and LS-BRE. For example, in the case of LSBRE, after specifyingthe level of rationality λ, for any πi 6= πLSBRE

i , we have:

EπLSBREi ,πLSBRE

−i[ri(s,a)] ≥ Eπi,πLSBRE

−i[ri(s,a)] (13)

since each individual LSBRE conditional policy is the opti-mal solution to the corresponding entropy regularized RLproblem (See Appendix (A.1)). It can be also shown thatany policy that satisfies the inequality in Equation (13) willstill satisfy the inequality after reward shaping (Devlin &Kudenko, 2011).

To mitigate the reward shaping effect and recover rewardfunctions with higher linear correlation to the ground truthreward, as in (Fu et al., 2017), we further assume the func-tions fωi in Equation (11) have a specific structure:

fωi,φi(st, at, st+1) = gωi(s

t, at) + γhφi(st+1)− hφi(st)

where gω is a reward estimator and hφ is a potential func-tion. We summarize the MA-AIRL training procedure inAlgorithm 1.

4. Related WorkA vast number of methods and paradigms have been pro-posed for single-agent imitation learning and inverse rein-forcement learning. However, multi-agent scenarios are lesscommonly investigated, and most existing works assumespecific reward structures. These include fully cooperativegames (Barrett et al., 2017; Le et al., 2017; Sosic et al., 2017;Bogert & Doshi, 2014), two player zero-sum games (Lin


et al., 2014), and rewards as linear combinations of pre-specified features (Reddy et al., 2012; Waugh et al., 2013).Recently, Song et al. proposed MA-GAIL, a multi-agentextension of GAIL which works on general Markov games.

While both MA-AIRL and MA-GAIL are based on adversar-ial training, the methods are inherently different. MA-GAILis based on the notion of Nash equilibrium, and is motivatedvia a specific choice of Lagrange multipliers for a constraintoptimization problem. MA-AIRL, on the other hand, isderived from MaxEnt RL and LSBRE, and aims to obtainan MLE solution for the joint trajectories; we connect thiswith a set of conditionals via pseudolikelihood, which arethen solved with the adversarial reward learning framework.From a reward learning perspective, the discriminators’ out-puts in MA-GAIL will converge to uninformative uniformdistribution, while MA-AIRL allows us to recover rewardfunctions from the optimal discriminators.

5. ExperimentsWe seek to answer the following questions via empiricalevaluation: (1) Can MA-AIRL efficiently recover the expertpolicies for each individual agent from the expert demon-strations (policy imitation)? (2) Can MA-AIRL effectivelyrecover the underlying reward functions, for which the ex-pert policies form a LSBRE (reward recovery)?

Task Description To answer these questions, we evaluateour MA-AIRL algorithm on a series of simulated particle en-vironments (Lowe et al., 2017). Specifically, we consider thefollowing scenarios: cooperative navigation, where threeagents cooperate through physical actions to reach threelandmarks; cooperative communication, where two agents,a speaker and a listener, cooperate to navigate to a particu-lar landmark; and competitive keep-away, where one agenttries to reach a target landmark, while an adversary, withoutknowing the target a priori, tries to infer the target fromthe agent’s behaviors and prevent it from reaching the goalthrough physical interactions.

In our experiments, for generality, the learning algorithmswill not leverage any prior knowledge on the types of in-teractions (cooperative or competitive). Thus for all thetasks described above, the learning algorithms will take adecentralized form and we will not utilize additional re-ward regularization, besides penalizing the `2 norm of thereward parameters to mitigate overfitting (Ziebart, 2010;Kalakrishnan et al., 2013).

Training Procedure In the simulated environments, wehave access to the ground-truth reward functions, which en-ables us to accurately evaluate the quality of both recoveredpolicies and reward functions. We use a multi-agent versionof ACKTR (Wu et al., 2017; Song et al., 2018), an efficientmodel-free policy gradient algorithm for training the experts

as well as the adaptive samplers in MA-AIRL. The super-vision signals for the experts come from the ground-truthrewards, while the reward signals for the adaptive samplerscome from the discriminators. Specifically, we first obtainexpert policies induced by the ground-truth rewards, thenwe use them to generate demonstrations, from which thelearning algorithms will try to recover the policies as wellas the underlying reward functions. We compare MA-AIRLagainst the state-of-the-art multi-agent imitation learningalgorithm, MA-GAIL (Song et al., 2018), which is a gener-alization of GAIL to Markov games. Following (Li et al.,2017; Song et al., 2018), we use behavior cloning to pretrainMA-AIRL and MA-GAIL to reduce sample complexity forexploration, and we use 200 episodes of expert demonstra-tions, each with 50 time steps, which is close to the amountof time steps used in (Ho & Ermon, 2016)1.

5.1. Policy Imitation

Although MA-GAIL achieved superior performance com-pared with behavior cloning (Song et al., 2018), it only aimsto recover policies via distribution matching. Moreover, thetraining signal for the policy will become less informative astraining progresses; according to (Goodfellow et al., 2014)with infinite data and computational resources the discrimi-nator outputs will converge to 0.5 for all state-action pairs,which could potentially hinder the robustness of the pol-icy towards the end of training. To empirically verify ourclaims, we compare the quality of the learned policies interms of the expected return received by each agent.

In the cooperative environment, we directly use the ground-truth rewards from the environment as the oracle metric,since all agents share the same reward. In the competitiveenvironment, we follow the evaluation procedure in (Songet al., 2018), where we place the experts and learned policiesin the same environment. A learned policy is considered“better” if it receives a higher expected return while its op-ponent receives a lower expected return. The results forcooperative and competitive environments are shown in Ta-bles 1 and 2 respectively. MA-AIRL consistently performsbetter than MA-GAIL in terms of the received reward in allthe considered environments, suggesting superior imitationlearning capabilities to the experts.

5.2. Reward Recovery

The second question we seek to answer is concerned withthe reward recovering problem as in inverse reinforcementlearning: is the algorithm able to recover the ground truthreward functions with expert demonstrations being the onlysource of supervision? To answer this question, we evaluatethe statistical correlations between the ground truth rewards

1The codebase for this work can be found at https://github.com/ermongroup/MA-AIRL.

https://github.com/ermongroup/MA-AIRL

https://github.com/ermongroup/MA-AIRL


Table 1. Expected returns in cooperative tasks. Mean and varianceare taken across different random seeds used to train the policies.

Algorithm Nav. ExpRet Comm. ExpRet

Expert -43.195 ± 2.659 -12.712 ± 1.613Random -391.314 ± 10.092 -125.825 ± 3.4906

MA-GAIL -52.810 ± 2.981 -12.811 ± 1.604MA-AIRL -47.515 ± 2.549 -12.727 ± 1.557

Table 2. Expected returns of the agents in competitive task.Agent #1 represents the agent trying to reach the target andAgent #2 represents the adversary. Mean and variance are takenacross different random seeds.

Agent #1 Agent #2 Agent #1 ExpRet

Expert Expert -6.804 ± 0.316

MA-GAIL Expert -6.978 ± 0.305MA-AIRL Expert -6.785 ± 0.312

Expert MA-GAIL -6.919 ± 0.298Expert MA-AIRL -7.367± 0.311

(which the learning algorithms have no access to) and theinferred rewards for the same state-action pairs.

Specifically, we consider two types of statistical correlations:Pearson’s correlation coefficient (PCC), which measuresthe linear correlation between two random variables; andSpearman’s rank correlation coefficient (SCC), which mea-sures the statistical dependence between the rankings of tworandom variables. Higher SCC suggests that two rewardfunctions have higher monotonic relationships and higherPCC suggests higher linear correlations. For each trajectory,we compare the ground-truth return from the environmentwith the supervision signals from the discriminators, whichcorrespond to gω in MA-AIRL and log(Dω) in MA-GAIL.

Tables 3 and 4 provide the SCC and PCC statistics for co-operative and competitive environments respectively. Inthe cooperative case, compared to MA-GAIL, MA-AIRLachieves a much higher PCC and SCC, which could facil-itate policy learning. The statistical correlations betweenreward signals gathered from discriminators for each agentare also quite high, suggesting that while we do not revealthe agents are cooperative, MA-AIRL is able to discoverhigh correlations between the agents’ reward functions. Inthe competitive case, the reward functions learned by MA-AIRL also significantly outperform MA-GAIL in termsof SCC and PCC statistics. In Figure 1, we further showthe changes of PCC statistics with respect to training timesteps for MA-GAIL and MA-AIRL. The reward functionsrecovered by MA-GAIL initially have a high correlationwith the ground truth, yet that dramatically decreases astraining continues, whereas the functions learned by MA-AIRL maintains a high correlation throughout the course of

Table 3. Statistical correlations between the learned reward func-tions and the ground-truth rewards in cooperative tasks. Meanand variance are taken across N independently learned rewardfunctions for N agents.

Task Metric MA-GAIL MA-AIRL

Nav. SCC 0.792 ± 0.085 0.934 ± 0.015PCC 0.556 ± 0.081 0.882 ± 0.028

Comm. SCC 0.879 ± 0.059 0.936 ± 0.080PCC 0.612 ± 0.093 0.848 ± 0.099

Table 4. Statistical correlations between the learned reward func-tions and the ground-truth rewards in competitive task.

Algorithm MA-GAIL MA-AIRL

SCC #1 0.424 0.534SCC #2 0.653 0.907

Average SCC 0.538 0.721

PCC #1 0.497 0.720PCC #2 0.392 0.667

Average PCC 0.445 0.694

training, which is in line with the theoretical analysis thatin MA-GAIL, reward signals from the discriminators willbecome less informative towards convergence.

0 2000 4000 6000 8000 10000 12000timestep

0.5

0.6

0.7

0.8

0.9

PC

C

Figure 1. PCC w.r.t. the training epochs in cooperative navigation,with MA-AIRL (blue) and MA-GAIL (orange).

6. Discussion and Future WorkWe propose MA-AIRL, the first multi-agent MaxEnt IRLframework that is effective and scalable to Markov gameswith high-dimensional state-action space and unknown dy-namics. We derive our algorithm based on a solution con-cept termed LSBRE and we employ maximum pseudolikeli-hood estimation to achieve tractability. Experimental resultsdemonstrate that MA-AIRL is able to imitate expert behav-iors in high-dimensional complex environments, as well aslearn reward functions that are highly correlated with theground truth rewards. An exciting avenue for future work isto include reward regularization to mitigate overfitting andleverage prior knowledge of the task structure.


AcknowledgmentsThis research was supported by Toyota Research Institute,NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), Amazon AWS, andQualcomm.

ReferencesAbbeel, P. and Ng, A. Y. Apprenticeship learning via inverse

reinforcement learning. In Proceedings of the twenty-firstinternational conference on Machine learning, pp. 1,2004.

Amodei, D. and Clark, J. Faulty reward functions in thewild, 2016.

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul-man, J., and Mane, D. Concrete problems in AI safety.arXiv preprint arXiv:1606. 06565, 2016.

Arnold, B. C. and Press, S. J. Compatible conditional distri-butions. Journal of the American Statistical Association,84(405):152–156, 1989.

Aumann, R. J. Subjectivity and correlation in randomizedstrategies. Journal of mathematical Economics, 1(1):67–96, 1974.

Aumann, R. J. Correlated equilibrium as an expressionof bayesian rationality. Econometrica: Journal of theEconometric Society, pp. 1–18, 1987.

Barrett, S., Rosenfeld, A., Kraus, S., and Stone, P. Mak-ing friends on the fly: Cooperating with new teammates.Artificial Intelligence, 242:132–171, 2017.

Besag, J. Statistical analysis of non-lattice data. The statis-tician, pp. 179–195, 1975.

Bogert, K. and Doshi, P. Multi-robot inverse reinforcementlearning under occlusion with interactions. In Proceed-ings of the 2014 international conference on Autonomousagents and multi-agent systems, pp. 173–180, 2014.

Chen, S.-H. and Ip, E. H. Behaviour of the gibbs samplerwhen conditional distributions are potentially incompat-ible. Journal of statistical computation and simulation,85(16):3266–3275, 2015.

Chen, S.-H., Ip, E. H., and Wang, Y. J. Gibbs ensembles fornearly compatible and incompatible conditional models.Computational statistics & data analysis, 55(4):1760–1769, 2011.

Dawid, A. P. and Musio, M. Theory and applications ofproper scoring rules. Metron, 72(2):169–183, 2014.

Devlin, S. and Kudenko, D. Theoretical considerations ofpotential-based reward shaping for multi-agent systems.In The 10th International Conference on AutonomousAgents and Multiagent Systems - Volume 1, AAMAS ’11,pp. 225–232, Richland, SC, 2011. International Founda-tion for Autonomous Agents and Multiagent Systems.ISBN 9780982657157, 9780982657157.

Finn, C., Christiano, P., Abbeel, P., and Levine, S. A con-nection between generative adversarial networks, inversereinforcement learning, and energy-based models. arXivpreprint arXiv:1611.03852, 2016a.

Finn, C., Levine, S., and Abbeel, P. Guided cost learning:Deep inverse optimal control via policy optimization. InInternational Conference on Machine Learning, pp. 49–58, June 2016b.

Fu, J., Luo, K., and Levine, S. Learning robust rewardswith adversarial inverse reinforcement learning. arXivpreprint arXiv:1710.11248, 2017.

Gandhi, A. The stochastic response dynamic: A new ap-proach to learning and computing equilibrium in continu-ous games. Technical Report, 2012.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In Advances in neuralinformation processing systems, pp. 2672–2680, 2014.

Gordon, G. J., Greenwald, A., and Marks, C. No-regretlearning in convex games. In Proceedings of the 25thinternational conference on Machine learning, pp. 360–367. ACM, 2008.

Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep rein-forcement learning for robotic manipulation with asyn-chronous off-policy updates. In Robotics and Automa-tion (ICRA), 2017 IEEE International Conference on, pp.3389–3396. IEEE, 2017.

Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., andDragan, A. Inverse reward design. In Advances in NeuralInformation Processing Systems, pp. 6765–6774, 2017.

Hart, S. and Mas-Colell, A. A simple adaptive procedureleading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.

Hastings, W. K. Monte carlo sampling methods usingmarkov chains and their applications. 1970.

Heckerman, D., Chickering, D. M., Meek, C., Rounthwaite,R., and Kadie, C. Dependency networks for inference,collaborative filtering, and data visualization. Journal ofMachine Learning Research, 1(Oct):49–75, 2000.


Ho, J. and Ermon, S. Generative adversarial imitation learn-ing. In Advances in Neural Information Processing Sys-tems, pp. 4565–4573, 2016.

Ho, J., Gupta, J., and Ermon, S. Model-free imitation learn-ing with policy optimization. In International Conferenceon Machine Learning, pp. 2760–2769, 2016.

Hu, J., Wellman, M. P., and Others. Multiagent reinforce-ment learning: theoretical framework and an algorithm.In ICML, volume 98, pp. 242–250, 1998.

Kalakrishnan, M., Pastor, P., Righetti, L., and Schaal,S. Learning objective functions for manipulation. InRobotics and Automation (ICRA), 2013 IEEE Interna-tional Conference on, pp. 1331–1336, 2013.

Le, H. M., Yue, Y., and Carr, P. Coordinated Multi-Agent im-itation learning. arXiv preprint arXiv:1703.03121, March2017.

Lehmann, E. L. and Casella, G. Theory of point estimation.Springer Science & Business Media, 2006.

Leibo, J. Z., Zambaldi, V., Lanctot, M., Marecki, J., andGraepel, T. Multi-agent reinforcement learning in se-quential social dilemmas. In Proceedings of the 16thConference on Autonomous Agents and MultiAgent Sys-tems, pp. 464–473, 2017.

Levine, S. Reinforcement learning and control as proba-bilistic inference: Tutorial and review. arXiv preprintarXiv:1805. 00909, 2018.

Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. The Journal ofMachine Learning Research, 17(1):1334–1373, 2016.

Li, Y., Song, J., and Ermon, S. InfoGAIL: Interpretableimitation learning from visual demonstrations. arXivpreprint arXiv:1703. 08840, 2017.

Lin, X., Beling, P. A., and Cogill, R. Multi-agent inverse re-inforcement learning for zero-sum games. arXiv preprintarXiv:1403. 6508, 2014.

Lin, X., Adams, S. C., and Beling, P. A. Multi-agent inversereinforcement learning for general-sum stochastic games.arXiv preprint arXiv:1806.09795, 2018.

Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of theeleventh international conference on machine learning,volume 157, pp. 157–163, 1994.

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P.,and Mordatch, I. Multi-Agent Actor-Critic for mixedCooperative-Competitive environments. arXiv preprintarXiv:1706.02275, June 2017.

Matignon, L., Jeanpierre, L., Mouaddib, A.-I., and Others.Coordinated Multi-Robot exploration under communi-cation constraints using decentralized markov decisionprocesses. In AAAI, 2012.

McKelvey, R. D. and Palfrey, T. R. Quantal response equi-libria for normal form games. Games and economicbehavior, 10(1):6–38, 1995.

McKelvey, R. D. and Palfrey, T. R. Quantal response equi-libria for extensive form games. Experimental economics,1(1):9–41, 1998.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529, 2015.

Natarajan, S., Kunapuli, G., Judah, K., Tadepalli, P., Ker-sting, K., and Shavlik, J. Multi-agent inverse reinforce-ment learning. In Machine Learning and Applications(ICMLA), 2010 Ninth International Conference on, pp.395–400, 2010.

Ng, A. Y., Harada, D., and Russell, S. Policy invarianceunder reward transformations: Theory and application toreward shaping. In ICML, volume 99, pp. 278–287, 1999.

Ng, A. Y., Russell, S. J., et al. Algorithms for inversereinforcement learning. In Icml, pp. 663–670, 2000.

Nisan, N., Schapira, M., Valiant, G., and Zohar, A. Best-response mechanisms. In ICS, pp. 155–165, 2011.

Peng, P., Yuan, Q., Wen, Y., Yang, Y., Tang, Z., Long, H.,and Wang, J. Multiagent Bidirectionally-Coordinatednets for learning to play StarCraft combat games. arXivpreprint arXiv:1703. 10069, 2017.

Pomerleau, D. A. Efficient training of artificial neural net-works for autonomous navigation. Neural computation,3(1):88–97, 1991. ISSN 0899-7667.

Reddy, T. S., Gopikrishna, V., Zaruba, G., and Huber, M.Inverse reinforcement learning for decentralized non-cooperative multiagent systems. In Systems, Man, andCybernetics (SMC), 2012 IEEE International Conferenceon, pp. 1930–1935, 2012.

Russell, S. Learning agents for uncertain environments. InProceedings of the eleventh annual conference on Com-putational learning theory, pp. 101–103. ACM, 1998.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., et al. Mastering thegame of go with deep neural networks and tree search.nature, 529(7587):484, 2016.


Song, J., Ren, H., Sadigh, D., and Ermon, S. Multi-agentgenerative adversarial imitation learning. 2018.

Sosic, A., KhudaBukhsh, W. R., Zoubir, A. M., and Koeppl,H. Inverse reinforcement learning in swarm systems.In Proceedings of the 16th Conference on AutonomousAgents and MultiAgent Systems, pp. 1413–1421. Interna-tional Foundation for Autonomous Agents and Multia-gent Systems, 2017.

Waugh, K., Ziebart, B. D., and Andrew Bagnell, J. Compu-tational rationalization: The inverse equilibrium problem.arXiv preprint arXiv:1308.3506, August 2013.

Wu, Y., Mansimov, E., Liao, S., Grosse, R., and Ba, J. Scal-able trust-region method for deep reinforcement learningusing kronecker-factored approximation. arXiv preprintarXiv:1708.05144, August 2017.

Yu, L., Zhang, W., Wang, J., and Yu, Y. Seqgan: Sequencegenerative adversarial nets with policy gradient. In AAAI,pp. 2852–2858, 2017.

Ziebart, B. D. Modeling purposeful adaptive behavior withthe principle of maximum causal entropy. 2010.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K.Maximum entropy inverse reinforcement learning. InAAAI, volume 8, pp. 1433–1438, 2008.

Ziebart, B. D., Bagnell, J. A., and Dey, A. K. Maximumcausal entropy correlated equilibria for markov games. InThe 10th International Conference on Autonomous Agentsand Multiagent Systems - Volume 1, AAMAS ’11, pp.207–214, Richland, SC, 2011. International Foundationfor Autonomous Agents and Multiagent Systems. ISBN9780982657157, 9780982657157.

Zoph, B. and Le, Q. V. Neural architecture search withreinforcement learning. arXiv preprint arXiv:1611.01578,2016.


A. AppendixA.1. Trajectory Distribution Induced by Logistic Stochastic Best Response Equilibrium

Let {πt−i(at−i|st)}Tt=1 denote other agents’ marginal LSBRE policies, and {πti(ati|at−i, st)}Tt=1 denote agent i’s conditionalpolicy. With chain rule, the induced trajectory distribution is given by:

p(τ) =

[η(s1) ·

T∏t=1


]·T∏t=1

πti(ati|at−i, st) (14)

Suppose the desired distribution is given by:

p(τ) ∝

[η(s1) ·

T∏t=1


]· exp

(T∑t=1

ri(st, ati,a

t−i)

)(15)

Now we will shown that the optimal solution to the following optimization problem correspond to the LSBRE conditionalpolicies:

minπ1:Ti

DKL(p(τ)||p(τ)) (16)

The optimization problem in Equation (16) is equivalent to (the partition function of the desired distribution is a constantwith respect to optimized policies):

maxπ1:Ti

Eτ∼p(τ)

[log η(s1) +

T∑t=1

(logP (st+1|st,at) + logπt−i(at−i|st) + ri(s

t,at))−

log η(s1)−T∑t=1

(logP (st+1|st,at) + logπt−i(at−i|st) + log πti(a

ti|at−i, st))

]

= Eτ∼p(τ)

[T∑t=1

ri(st,at)− log πti(a

ti|at−i, st)

]=

T∑t=1

E(st,at)∼p(st,at)[ri(st,at)− log πti(a

ti|at−i, st)] (17)

To maximize this objective, we can use a dynamic programming procedure. Let us first consider the base case of optimizingπTi (aTi |aT−i, sT ):

E(sT ,aT )∼p(sT ,aT )[ri(sT ,aT )− log πTi (aTi |aT−i)] =

EsT∼p(sT ),aT−i∼πT−i(·|sT )

[−DKL

(πTi (aTi |aT−i, sT )||

exp(ri(sT , aTi ,a

T−i))

exp(Vi(sT ,aT−i))

)+ Vi(s

T ,aT−i)

](18)

where exp(Vi(sT ,aT−i)) is the partition function and Vi(sT ,aT−i) = log

∑a′i

exp(ri(sT , a′i,a

T−i)). The optimal policy is

given by:

πTi (aTi |aT−i, sT ) = exp(ri(sT , aTi ,a

T−i)− Vi(sT ,aT−i)) (19)

With the optimal policy in Equation (19), Equation (18) is equivalent to (with the KL divergence being zero):

E(sT ,aT )∼p(sT ,aT )[ri(sT ,aT )− log πTi (aTi |aT−i)] = EsT∼p(sT ),aT−i∼πT−i(·|sT )[Vi(s

T ,aT−i)] (20)

Then recursively, for a given time step t, πti(ati|at−i, st) must maximize:

E(st,at)∼p(st,at)

[ri(s

t,at)− log πti(ati|at−i) + Est+1∼P (·|st,at),at+1

−i ∼πt+1−i (·|st+1)[V

πt+2:T

i (st+1,at+1−i )]

]= (21)

Est∼p(st),at−i∼πt−i(·|st)

[−DKL

(πti(a

ti|at−i, st)||

exp(Qπt+1:T

i (st, ati,at−i))

exp(V πt+1:T

i (st,at−i))

)+ V π

t+1:T

i (st,at−i)

](22)


where we define:

Qπt+1:T

i (st,at) = ri(st,at) + Est+1∼p(·|st,at)

[H(πt+1

i (·|st+1)) + Eat+1−i ∼π

t+1−i (·|st+1)[Vi(s

t+1,at+1−i )]

](23)

V πt+1:T

i (st,at−i) = log∑a′i

exp(Qπt+1:T

i (st, a′i,at−i)) (24)

The optimal policy to Equation (22) is given by:

πti(ati|at−i, st) = exp(Qπ

t+1:T

i (st,at)− V πt+1:T

i (st,at−i)) (25)

which is exactly the set of conditional distributions used to produce LSBRE (Definition 2).

A.2. Maximum Pseudolikelihood Estimation for LSBRE

Theorem 2 strictly follows the asymptotic consistency property of maximum pseudolikelihood estimation (Lehmann &Casella, 2006; Dawid & Musio, 2014). For simplicity, we will show the proof for normal form games and similar toAppendix A.1, the extension to Markov games can be proved by induction.

Consider a normal form game with N players and reward functions {ri(a;ωi)}Ni=1. Suppose the expert demonstrations D ={(a1, . . . , aN )m}Mm=1 are generated by π(a;ω∗), where ω∗ denotes the true value of the parameters. The pseudolikelihoodobjective we want to maximize is given by:

`PL(ω) =1

M

M∑m=1

N∑i=1

log πi(ami |am−i;ωi) =

1

M

M∑m=1

N∑i=1

logexp(ri(a

mi ,a

m−i;ωi))∑

a′iexp(ri(a′i,a

m−i;ωi))

(26)

=1

M

M∑m=1

N∑i=1

ri(ami ,a

m−i;ωi)−

1

M

M∑m=1

N∑i=1

logZ(am−i;ωi) (27)

=

N∑i=1

∑a

pD(a)ri(ai,a−i;ωi)−N∑i=1

∑a−i

pD(a−i) logZ(a−i;ωi) (28)

where pD is the empirical data distribution and Z(a−i;ωi) is the partition function.

Take derivatives of `PL(ω):

∂

∂ω`PL(ω) =

N∑i=1

∑a

pD(a)∂

∂ωri(ai,a−i;ωi)−

N∑i=1

∑a−i

pD(a−i)1

Z(a−i;ωi)

∂

∂ωZ(a−i;ωi) (29)

=

N∑i=1

∑a

pD(a)∂


N∑i=1

∑a−i

pD(a−i)∑ai

exp(ri(ai,a−i;ωi))

Z(a−i;ωi)

∂

∂ωri(ai,a−i;ωi) (30)

=

N∑i=1

∑a

pD(a)∂


N∑i=1

∑a−i

pD(a−i)∑ai

πi(ai|a−i;ωi)∂


When the sample size m→∞, Equation (31) is equivalent to:

∂

∂ω`PL(ω) =

N∑i=1

∑a

p(a;ω∗)∂


N∑i=1

∑a−i

p(a−i;ω∗)∑ai

πi(ai|a−i;ωi)∂


=

N∑i=1

∑a−i

p(a−i;ω∗)∑ai

(p(ai|a−i;ω∗)− πi(ai|a−i;ωi))∂


When ω = ω∗, the gradients in Equation (33) will be zero.

multi-agent adversarial inverse reinforcement learning · 2019. 8. 1. · however, maxent irl is...

Documents