backplay: ‘man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is...

15
Backplay: ‘Man muss immer umkehren’ Cinjon Resnick * NYU [email protected] Roberta Raileanu * NYU [email protected] Sanyam Kapoor NYU [email protected] Alex Peysakhovich FAIR [email protected] Kyunghyun Cho NYU, FAIR [email protected] Joan Bruna NYU, FAIR [email protected] Abstract A long-standing problem in model free reinforcement learning (RL) is that it requires a large number of trials to learn a good policy, especially in environments with sparse rewards. We explore a method to increase the sample efficiency of RL when we have access to demonstrations. Our approach, which we call Backplay, uses a single demonstration to construct a curriculum for a given task. Rather than starting each training episode in the environment’s fixed initial state, we start the agent near the end of the demonstration and move the starting point backwards during the course of training until we reach the initial state. We perform experiments in a competitive four player game (Pommerman) and a path-finding maze game. We find that this weak form of guidance provides significant gains in sample complexity with a stark advantage in sparse reward environments. In some cases, standard RL did not yield any improvement while Backplay reached success rates greater than 50% and generalized to unseen initial conditions in the same amount of training time. Additionally, we see that agents trained via Backplay can learn policies superior to those of the original demonstration. 1 Introduction An important goal of AI research is to construct agents that can enter new environments and learn how to achieve desired goals (Levine et al., 2016). A key paradigm for this task is deep reinforcement learning (deep RL) which succeeded in many environments, including complex zero-sum games such as Go, Poker, and Chess (Silver et al., 2016; Moravcík et al., 2017; Brown & Sandholm, 2017a; Silver et al., 2017). However, training an RL agent can be extremely time consuming, particularly in environments with sparse rewards. In these settings, the agent typically requires a large number of episodes to stumble upon positive reward and learn even a moderately effective policy that can then be refined. Reward sparsity is often resolved via humans ‘densifying’ the underlying reward function. Such reward shaping can be effective, but can also change the set of optimal policies towards having unintended side effects (Ng et al., 1999; Clark & Amodei, 2016). We consider an alternative technique for accelerating RL in sparse reward settings without reward shaping. The essence of our proposed technique is to create a curriculum for the agent via reversing a single trajectory (i.e. state sequence) of reasonably good, but not necessarily optimal, behavior. That is, we start our agent at the end of a demonstration and let it learn a policy in this easier setup. We then move the starting point backward until the agent is training solely on the initial state of the corresponding task. We call this technique Backplay and include a visual reference in Figure 1. * These two authors contributed equally Preprint. Work in progress. arXiv:1807.06919v1 [cs.LG] 18 Jul 2018

Upload: others

Post on 24-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

Backplay: ‘Man muss immer umkehren’

Cinjon Resnick∗NYU

[email protected]

Roberta Raileanu∗NYU

[email protected]

Sanyam KapoorNYU

[email protected]

Alex PeysakhovichFAIR

[email protected]

Kyunghyun ChoNYU, FAIR

[email protected]

Joan BrunaNYU, FAIR

[email protected]

Abstract

A long-standing problem in model free reinforcement learning (RL) is that itrequires a large number of trials to learn a good policy, especially in environmentswith sparse rewards. We explore a method to increase the sample efficiency of RLwhen we have access to demonstrations. Our approach, which we call Backplay,uses a single demonstration to construct a curriculum for a given task. Ratherthan starting each training episode in the environment’s fixed initial state, westart the agent near the end of the demonstration and move the starting pointbackwards during the course of training until we reach the initial state. We performexperiments in a competitive four player game (Pommerman) and a path-findingmaze game. We find that this weak form of guidance provides significant gains insample complexity with a stark advantage in sparse reward environments. In somecases, standard RL did not yield any improvement while Backplay reached successrates greater than 50% and generalized to unseen initial conditions in the sameamount of training time. Additionally, we see that agents trained via Backplay canlearn policies superior to those of the original demonstration.

1 Introduction

An important goal of AI research is to construct agents that can enter new environments and learnhow to achieve desired goals (Levine et al., 2016). A key paradigm for this task is deep reinforcementlearning (deep RL) which succeeded in many environments, including complex zero-sum gamessuch as Go, Poker, and Chess (Silver et al., 2016; Moravcík et al., 2017; Brown & Sandholm, 2017a;Silver et al., 2017). However, training an RL agent can be extremely time consuming, particularly inenvironments with sparse rewards. In these settings, the agent typically requires a large number ofepisodes to stumble upon positive reward and learn even a moderately effective policy that can thenbe refined. Reward sparsity is often resolved via humans ‘densifying’ the underlying reward function.Such reward shaping can be effective, but can also change the set of optimal policies towards havingunintended side effects (Ng et al., 1999; Clark & Amodei, 2016).

We consider an alternative technique for accelerating RL in sparse reward settings without rewardshaping. The essence of our proposed technique is to create a curriculum for the agent via reversinga single trajectory (i.e. state sequence) of reasonably good, but not necessarily optimal, behavior.That is, we start our agent at the end of a demonstration and let it learn a policy in this easier setup.We then move the starting point backward until the agent is training solely on the initial state of thecorresponding task. We call this technique Backplay and include a visual reference in Figure 1.

∗These two authors contributed equally

Preprint. Work in progress.

arX

iv:1

807.

0691

9v1

[cs

.LG

] 1

8 Ju

l 201

8

Page 2: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

Figure 1. Backplay graphic description. We first collect a demonstration, from which we build a curriculum overthe states. We then sample a state according to that curriculum and initialize our agent accordingly.

We experiment in two scenarios: a maze environment where the agent must find a path to a goal andPommerman (MultiAgentLearning, 2018), a complex four-player competitive game. We empiricallyshow that Backplay vastly decreases the number of samples required to learn a good policy. Inaddition, we see that an agent trained with Backplay can outperform its corresponding demonstratorand even learn an optimal policy when using a sub-optimal demonstration. Notably, the Backplayagent is able to learn just as well in environments with a single sparse reward as in those with adense shaped reward. This extends to a challenging setup where it successfully generalizes to unseenPommerman scenarios while the baselines fail to learn at all.

2 Related Work

The most related work to ours is an independent and concurrent blog post on a method similar toBackplay to train policies on the challenging Atari game Montezuma’s Revenge (Salimans & Chen,2018). While both propose training an RL agent by starting each episode from a demonstration statebased on a reverse curriculum, there are a number of key differences between the two reports.

First, our work considers the problem of learning effective policies in environments with variousinitial configurations, each corresponding to a different demonstration. Second, we analyze whetherthe technique can find optimal policies when trained using sub-optimal demonstrations and the extentto which those policies can generalize to unseen initial conditions. Third, we do not use an adaptiveschedule based on the agent’s success rate to advance its starting state because we did not find this toimprove the overall performance or sample complexity. And finally, we include both empirical andtheoretical analysis on when the efficacy of this approach is higher than that of standard RL.

Behavioral cloning aims at explicitly encouraging a policy to mimic an expert policy (Ross et al.,2011; Daumé et al., 2009; Zhang & Cho, 2016; Laskey et al., 2016). Many algorithms in this literaturerequire access to the expert’s policy or use supervision from the expert’s actions and/or states (Hosu& Rebedea, 2016; Nair et al., 2017; Hester et al., 2017; Ho & Ermon, 2016; Aytar et al., 2018; Lerer& Peysakhovich, 2018; Peng et al., 2018). Backplay only requires access to a single trajectory ofstates visited by this policy and uses these states to initialize a type of curriculum (Bengio et al., 2009)rather than as a supervised objective.

Recent work by McAleer et al. (2018) frames the problem of solving the Rubik’s cube as a sparsereward model-based RL task with inputs generated by starting at the goal state and taking randomactions. Similarly, Hosu & Rebedea (2016) uses intermediate states of an expert demonstration asstarting states from which a policy is executed. However, neither of these works explore imposing acurriculum over the demonstration.

The idea of using a reverse curriculum for sparse reward reinforcement learning problems has recentlybeen proposed. Florensa et al. (2017) uses a sequence of short random walks to gradually initializean RL agent from a set of start states increasingly far from a known goal location. Their methodhowever only applies to tasks with a unique and fixed goal state, while our technique could be appliedin settings with multiple non-stationary targets. In a similar fashion, Ivanovic et al. (2018) uses aknown (approximate) dynamics model to create a backwards curriculum for continuous control tasks.

2

Page 3: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

Their approach requires a physical prior which it is not always available and often not applicablein multi-agent scenarios. In contrast, Backplay automatically creates a curriculum and can be usedin any resettable environment as long as demonstrations are available. An interesting direction forfuture research is using some of these previously proposed ideas to make Backplay more efficient.

3 Backplay

Consider the standard formalization of a single agent Markov Decision Process (MDP) defined bya set of states S, a set of actions A, and a transition function T : S × A → S which gives theprobability distribution of the next state given a current state and action. If P(A) denotes the spaceof probability distributions over actions, the agent chooses actions by sampling from a stochasticpolicy π : S → P(A), and receives reward r : S × A → R at every time step. The agent’s goalis to construct a policy which minimizes its discounted expected return Rt = E

[∑∞k=0 γ

krt+k+1

]where rt is the reward at time t and γ ∈ [0, 1] is the discount factor, and the expectation is taken withrespect to both the policy and the environment.

The final component of an MDP is the distribution of initial starting states s0. The key idea inBackplay is that instead of starting the MDP from this s0 we have access to a demonstration whichreaches a sequence of states {sd0, sd1, . . . , sdT }. For each training episode, we uniformly sample startingstates from the sub-sequence {sdT−k, sdT−k+1, . . . , s

dT−j} for some chosen window [j, k], j ≤ k. As

training continues, we ‘advance’ the window by increasing the values of j and k until we are trainingon the initial state in every episode. In this manner, our hyperparameters for Backplay are only the(j, k) pairs (windows) and the training epochs at which we advance them.

In the next section, we explore more formally under what conditions Backplay can improve thesample-efficiency of RL training. First, we give intuition about under what conditions Backplay canincrease training speed and when it can lead to a better policy than that of the demonstrator.

In Figure 2, we show three toy grid worlds. In each, the agent begins at s0, movement is costly, andthe agent receives a reward of 1 if and only if they navigate to the goal s∗. Thus, the optimal strategyis to reach the goal in as few steps as possible.

The left grid world shows a sparse reward environment where trial-and-error model-free RL willconverge slowly. Backplay, using the sub-optimal expert policy (the line), will decrease the requisitetraining time because it will position the agents close to high value states. In addition, the agent willlikely surpass the expert policy because, unlike in behavioral cloning approaches, Backplay does notencourage the agent to imitate expert actions. Rather, the curriculum forces the agent to first explorestates with large associated value and consequently, estimating the value function suffers less fromthe curse of dimensionality.

The middle grid demonstrates a maze with bottlenecks. Backplay will vastly decrease the explorationtime because the agent will be less prone to exploring the errant right side chamber that it can get lostin if it traverse to the right instead of going up to the goal. These sorts of bottlenecks exist in manyreal life scenarios, including Pommerman where the bomb laying action itself is an example.

On the right side grid, Backplay will similarly not have difficulty surpassing the sub-optimal expertbecause it will learn to skip the large ingress into the middle of the maze. However, it will havetrouble doing better than a model-free policy not using Backplay because that policy will spend allof its training initialized from s0 and consequently will be more likely to stumble upon the optimalsolution of going up and then right, whereas Backplay will spend the dominant amount of its earlytraining starting from the sub-optimal states included in the demonstration.

3.1 Preliminary Analysis

We now derive some basic results for Backplay training with a Q-learning agent in a simple navigationenvironment. Because we are specifically interested in the case of sparse rewards, in our environmentagents receive a positive reward of 1 if they reach a goal absorbing state and a reward of 0 otherwise.

Consider a generic map as a connected, undirected graph G = (V,E). An agent is placed at afixed node v0 ∈ V and can move to neighboring nodes at a negative reward of −1 for each move.Its goal is to reach a target node v∗ ∈ V , terminating the episode. This corresponds to an MDPM = (S,A, P,R) with state space given by S ∼ V , action A ∼ E, deterministic transition kernel

3

Page 4: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

(a) Backplay: Good (b) Backplay: Good (c) Backplay: Bad

Figure 2. Three environments that capture the intuition behind Backplay and its limitations. Backplay will likelybe advantageous on the first and second mazes and surpass or match the expert’s demonstration. However, it willperform worse on the third maze.

corresponding to P (st+1 = j | st = i, at = (l, k)) = δ(j = k)δ(l = i) + δ(j = i)δ(l 6= i),R(st, at) = 1 for st+1 6= v∗, and is absorbing at s = v∗.

Assume now that π is a fixed policy onM, such that the underlying Markov chain

Kπ(s′, s) =∑a∈A

P (s′ = j | s0 = s, a0 = a)π(a|s0 = s)

is irreducible and aperiodic. K has a single absorbing state at s = s∗. Our goal is to estimate thevalue function associated to this policy, equivalent to the expected first-passage time of the Markovchain K from s0 = s to s∗:

Vπ(s) = Eτ(s, s∗) = Emin{j ≥ 0 ; s.t.sj = s∗, s0 = s, si+1 ∼ K(·, si)} .

Instead of analyzing this Markov chain, we make a simplifying assumption that the performanceof the navigation task is captured by projecting this chain into the 1D process zt = distG(st, s∗) ∈{0, 1, . . . ,M}, with M = diam(G). That is, we consider a latent variable at each state indicating itsdistance to the target. The process zt is in general not Markovian, since its transition probabilities arenot only a function of zt. Its Markov approximation z̄t is defined as the Markov chain given by theexpected transition probabilities

Pr(z̄t+1 = l + 1|z̄t = l) = αl := Prµ(zt+1 = l + 1 | zt = l) , (1)Pr(z̄t+1 = l|z̄t = l) = βl = Prµ(zt+1 = l | zt = l) , l = 0 . . .M (2)

and thus Pr(z̄t+1 = l − 1|z̄t = l) = 1− αl − βl = Prµ(zt+1 = l − 1 | zt = l) , under the stationarydistribution µ of K. In other words, we create an equivalence class given by the level sets of theoptimal value function Vopt(s), the shortest-path distance in G.

The analysis of Backplay in the 1D chain z̄t is now tractable. For a fixed Backplay step size m > 0,with m � M , we initialize the chain at z̄0 = M −m. Since Pr(z̄m = M) =

∏Mj=M−m αj =

γM−m,M , after O(γ−1M−m,M ) trials of length ≤M we will reach the absorbing state and finally havea signal-carrying update for the Q-function at z̄M−1. We can consequently merge z̄M−1 into theabsorbing state and reduce the length of the chain by one. By repeating the argument m times, wethus obtain that after O(

∑mj=0 γ

−1M−m,M−j) = O(mγ−1M−m,M ) trials, the Q-function is updated at

z̄0 = M −m. By repeating this argument at Backplay steps km, k = 1, . . .M/m, we reach a samplecomplexity of

Tm =

M/m−1∑k=0

O(

m∑j=0

γ−1km,(k+1)m−j) .

In the case where αl = α for all l, we obtain γ−1km,(k+1)m−j = γ0,m−j = α−m+j and therefore

Tm = O(M(1−αm+1

m(1−α) α−m)

, where the important term is the rate Mmα−m.

On the other hand, from Hong & Zhou (2017), we verify that the first-passage time τ(0,M) in askip-free finite Markov chain of M states with a single absorbing state is a random variable whosemoment-generating function ϕ(s) = Esτ(s,s∗) is given by

ϕ(s) =

M∏j=1

(1− λj)s1− λjs

, (3)

4

Page 5: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

where λ1, . . . , λM are the non-unit eigenvalues of the transition probability matrix. It results thatEτ(0,M) = ϕ′(1) =

∑Mj=1

11−λj

≈ (1 − λ1)−1, which corresponds to the reciprocal spectralgap, and therefore the model without Backplay will on average take TM = Θ(α−M ) trials to reachthe absorbing state and thus receive information. We thus obtain sample complexity gains that areexponential in the diameter of the graph (O(Mα−m) versus O(α−M )) in this simple birth-death 1dscenario.

Of course, this analysis is preliminary in many levels. For example, the projection into the 1-dimensional skip-free process makes two dangerous assumptions: (i) The level sets of our estimatedvalue functions correspond to the level sets of the true value function, and (ii) The projected processis well described by its Markovian approximation. The first assumption is related to the abilityof our function approximation to generalize from visited states s to unvisited states s′ such thatdistG(s′, s∗) = distG(s, s∗). The second assumption has an impact on the bounds derived above, andcould be worked around by replacing conditional probabilities with infima over the level sets.

As in the simple examples above, this analysis suggests that Backplay is not a ‘universal’ strategyto improve sample complexity. Indeed, even in the navigation setting, if the task now randomizesthe initial state s0, a single demonstration trajectory does not generally improve the coverage ofthe state-space outside an exponentially small region around said trajectory. As a simple example,imagine a binary tree graph and a navigation that starts at a random leaf and needs to reach the root.A single expert trajectory will be disjoint from half of the state space (because the root is absorbing),thus providing no sample complexity gains on average.

The preceding analysis suggests that a full characterization of Backplay is a fruitful direction forreinforcement learning theory, albeit beyond the scope of this paper.

4 Experiments

We now move to evaluating Backplay empirically in two environments: a grid world maze and a fourplayer free-for-all game. The questions we study across both environments are the following:

• Is Backplay more efficient than training an RL agent from scratch?

• How does the quality of the given demonstration affect the effectiveness of Backplay?

• Can Backplay agents surpass the demonstrator when it is non-optimal?

• Can Backplay agents generalize?

4.1 Maze

For the Maze setup, we generated grid worlds of size 24 × 24 with 120 randomly placed walls,a random start position, and a random goal position. We then used the A* search algorithm togenerate trajectories. These included both Optimal demonstrations (true shortest path) and N-Optimaldemonstrations, paths that were N steps longer than the shortest path. For N-Optimal demonstrations,we used a noisy A* where at each step we follow A* with probability p or otherwise chose arandom action. We considered N ∈ {5, 10}. In all scenarios, we filtered any path that was less than35 in length and stopped when we found a hundred valid training games. Note that we held thedemonstration length invariant rather than the optimal length.

The observation space in maze games consists of four 24× 24 binary maps. They contain ones atthe positions of, respectively, the agent, the goal, passages, and walls. The action space has discreteoptions: Pass, Up, Down, Left, or Right.

The game ends when the agent has reached its goal or after a maximum of 200 steps. The agentreceives reward +1 for reaching the goal and a -0.03 penalty for each step taken in the environment.Thus optimal policies are those which reach the goal in the fastest possible time.

We use a standard deep RL setup for our agents. The agent’s policy and value functions areparameterized by a convolutional neural network with 2 layers each of 32 output channels, followedby two linear layers with 128 dimensions. Each of the layers are followed by ReLU activations. Thisbody then feeds two heads, a scalar value function and a softmax policy function over the five actions.All of the CNN kernels are 3x3 with stride and padding of one.

5

Page 6: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

Algorithm Expert % Optimal % 0-5 Optimal Avg Suboptimality Std SuboptimalityStandard None 0 0 N/A N/AUniform Optimal .27 .91 8.26 32.92Uniform 5-Optimal .51 .98 2.04 17.39Uniform 10-Optimal .49 .98 2.04 16.79Backplay Optimal .31 1 0.64 4.96Backplay 5-Optimal .37 .94 7.94 33.35Backplay 10-Optimal .54 .99 0.37 3.49

Table 1: Results after 2000 epochs on 100 mazes. Standard has failed to learn, while both Backplayand Uniform succeed on almost all mazes and, importantly, outperform the experts’ demonstrations.These results were consistent across each of three different seeds.

We train our agent using Proximal Policy Optimization (PPO, Schulman et al. (2017)) with γ = 0.99,learning rate 1× 10−3, batch size 102400, 60 parallel workers, clipping parameter 0.2, generalizedadvantage estimation with τ = 0.95, entropy coefficient 0.01, value loss coefficient 0.5, mini-batchsize 5120, horizon 1707, and 4 PPO updates at each iteration.

We compare several training regimes. The first is Backplay, which uses the Backplay algorithm cor-responding to the sequence of windows [[0, 4), [4, 8), [8, 16), [16, 32), [32, 64), [64, 64]] and epochs[0, 350, 700, 1050, 1400, 1750]. When we get to a training epoch represented in the sequence ofepochs, we advance to the corresponding value in the sequence of windows. For example, say webegan training at epoch 20. Then we select a game at random, an N ∈ [16, 32), and start the agent inthat game N steps from the end. Whenever a pair is chosen such that the game’s length is smallerthan N , we use the initial state. The second, Standard, is a baseline of standard RL training from s0.The last, Uniform, is an ablation that considers how important is the curriculum aspect of Backplayby sampling states from the entire demonstration randomly rather than according to the Backplaydistribution.

Table 1 shows that Standard has immense trouble learning in this sparse reward environment whileBackplay and Uniform both find an optimal path approximately 30-50% of the time and a pathwithin five of the optimal path almost always. Thus, in this environment, demonstrations of evensub-optimal experts are extremely useful while the curriculum created by Backplay is not necessary.It does, however, aid convergence speed (learning curves in Appendix A.4). We will see in the nextexperiment that the curriculum becomes important as the environment increases in complexity.

Finally, we looked at whether Backplay generalizes in two ways. First, we had our agents exploreten new test mazes. Second, we tried generating new mazes for each episode of training. Noneof our training regimes were successfully able to complete either scenario, suggesting that in thisenvironment Backplay does not aid in generalization.

4.2 Pommerman

The Pommerman environment is based off of the classic console game Bomberman and will be acompetition at the upcoming NIPS. It is played on an 11x11 grid where each of four agents have sixactions on every turn: move in a cardinal directions, pass, or lay a bomb. The agents begin fencedin their own area by two different types of walls - rigid and wooden (see Figure 3). The former areindestructible while bombs destroy the latter. Upon blowing up wooden walls, there is a uniformchance at yielding one of three power-ups: an extra bomb, an extra unit of range in the agent’s bombs,and the ability to kick bombs. The maps are designed randomly, albeit there is always a guaranteedpath between any two agents.

In our experiments, we use the purely adversarial Free-For-All (FFA) environment. This environmentis introduced by MultiAgentLearning (2018) and we point the reader there for more details. Thewinner of the game is the last agent standing and receives a reward of 1. The game is played from theperspective of one agent picked uniformly among the four, where each agent is defined by whichof the four starting positions they begin. The three opponents are copies of the winner of the June3rd 2018 FFA competition, a strong competitor using a Finite State Machine Tree-Search approach(FSMTS, Zhou et al. (2018)).

6

Page 7: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

Figure 3. Pommerman start state. Each agent begins in one of four positions. Yellow squares are wood, brownare rigid, and the gray are passages.

The input to our model consists of 18 11x11 channels, which are together a function of the observationstate the game provides. A detailed description of this mapping is given in Appendix A.1. We feedthe policy network the last two observation states concatenated together for a total of a 36x11x11input.

The game ends either when the learning agent wins, when it dies, or when 800 steps have passed.Winning gives a +1 reward and both drawing and losing give−1. Consequently, this is an environmentwith a very sparse reward, so we refer to the runs on this setup as Sparse. In some of our experiments,labeled Dense, we add another shaping reward of +0.1 every time the agent picks up an item.

We use a similar setup to that used in the Maze game. The architecture differences are that we havean additional two convolutional layers at the beginning, use 256 output channels, and have outputdimensions of 1024 and 512, respectively, for the linear layers. This architecture was not tuned at allduring the course of our experiments. Further hyperparameter differences are that we used a learningrate of 3× 10−4 and a gamma of 1.0. These models trained for ∼50M frames.2

We demonstrate results on the same three training regimes of Standard, Backplay, and Uniform. Wealso explored using DAgger (Ross et al. (2011)) but found that our agent achieved approximately thesame win rate (∼ 20%) as what we would expect when four FSMTS agents played each other (giventhat there are also ties).

Pommerman can be difficult for reinforcement learning agents. The agent must learn to effectivelywield the bomb action in order to win against competent opponents. However, bombs destroy agentsindiscriminately, so placing one without knowing how to retreat is often a negative experience. Sinceagents begin in an isolated area, they are prone to converging to policies which do not use the bombaction (as seen in the histograms in Figure 4).

(a) Standard Sparse (b) Backplay Sparse

Figure 4. Typical histograms for how the Pommerman action selections change over time. From left to right arethe concatenated counts of the actions (Pass, Up, Down, Left, Right, Bomb), delineated on the y-axis by theepoch. Note how the Standard agent learns to not use bombs.

Our Backplay hyperparameters differed based on whether we were training onfour games or a hundred games. Both corresponded to the sequence of windows[[0, 32), [24, 64), [56, 128), [120, 256), [248, 512), [504, 800], [800, 800]], while the for-

2In our early experiments, we also used a batch-size of 5120, which meant that the sampled transitionswere much more correlated. While Backplay would train without any issues in this setup, Standard would onlymarginally learn. In an effort to improve our baselines, we did not explore this further.

7

Page 8: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

(a) Backplay (b) Backplay

(c) Standard & Uniform (d) Standard & Uniform

Figure 5. Results when training on four maps (four seeds). Charts a and c are trained from the perspective of thewinning agent. Charts b and d are from that of the second place agent. Dense is with reward shaping and Sparseis without. The Backplay models first encounter the initial state at epoch 250 (vertical red line). From epoch 300onward (vertical green line), it only ever sees the starting state and success is measured strictly over that regime.While Standard achieves strong results only in the dense setting, Backplay achieves comparable performance inboth sparse and dense setups, and Uniform fails to learn a worthwhile policy in either.

mer changed windows at epochs [0, 50, 100, 150, 200, 250, 300] and the latter at epochs[0, 85, 170, 255, 340, 425, 510].

We begin by considering an easy situation where maps are drawn from one of four fixed Pommermanboard configurations (maps) with demonstrations provided by the four FSMTS agents. We indepen-dently consider two Backplay trajectories, one which follows the winner and one where it follows thesecond place finisher. We change the FFA game to an MDP by replacing one of the FSMTS with ourRL trained agents.

We see in Figure 5 that Backplay is capable of producing agents that can defeat FSMTS agents inboth sparse and dense setups while Standard requires hand-tuned reward shaping to constructcompetitive agents.

We then consider a harder setup where the board is drawn from one of hundred maps. Here, Back-play achieves high success rates while Standard and Uniform models fail to learn anything ofsubstance (Figure 6). Moreover, it learns to play half of the games at win rates higher than 80%(Table 4 in A.3).

Unlike in the maze experiment, we see that the Backplay agent can, to some extent, generalize tounseen maps. Backplay wins 416 / 1000 games on a held out test set of ten never before seenmaps, with the following success rates on each of the ten boards: 85.3%, 84.1%, 81.6%, 79.5%,52.4%, 47.4%, 38.1%, 22.6%, 20%, and 18.3%.

8

Page 9: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

Figure 6. Results when training on a hundred maps from the winner’s perspective (four seeds). Dense is withreward shaping and Sparse is without. We show the success rate only over the initial state. Backplay firstencounters this state at epoch 340 and it becomes the default for half of the environments at 425. They doexceptionally well while the Uniform and Standard approaches do not show any reasonable signs of progresseven though they have trained on this state the entire time.

4.3 Experience

We trained a large number of models through our research into Backplay. Though these findings aretangential to our main points we list some observations here that may be helpful for other researchersworking with Backplay.

First, we found that Backplay does not perform well when the curriculum is advanced too quickly,however it does not fail when the curriculum is advanced ‘too slowly.’ Thus, researchers interested inusing Backplay should err on the side of advancing the window too slowly rather than too quickly.

Second, we found that Backplay does not need to hit a high success rate before advancing. Initially,we tried using adaptive approaches that advanced the window when the agent reached a certainsuccess threshold. This worked but was too slow. Our hypothesis is what is more important is thatthe agent gets sufficiently exposed to enough states to attain a reasonable barometer of their valuerather than that the agent gets a perfectly optimal policy.

Third, Backplay can still recover if success goes to zero. This surprising and infrequent resultoccurred at the juncture where the window moved back to the initial state and even if the policy’sentropy over actions became maximal. We are unsure what differentiates these models from the onesthat did not recover.

5 Conclusion

We have introduced and analyzed Backplay, a technique which improves the sample efficiency ofmodel-free RL by constructing a curriculum around a demonstration. We showed that Backplayagents can learn in complex environments where standard model-free RL fails, and that they canoutperform the ‘expert’ whose trajectories they use for selecting the starting states while training.

We also presented preliminary results on the limits of Backplay, but future research is needed to betterunderstand the kinds of environments where Backplay succeeds or fails as well as how it can be mademore efficient.

9

Page 10: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

An important future direction is combining Backplay with more complex and complementary methodssuch as Monte Carlo Tree Search (MCTS, (Browne et al., 2012; Vodopivec et al., 2017)). There aremany potential ways to do so, for example by using Backplay to warm-start MCTS.

Another future direction is to consider Backplay in multi-agent problems. Backplay can likely beused to accelerate self-play learning in zero-sum games, however special care needs to be taken toavoid policy correlation during training (Lanctot et al., 2017) and thus to make sure that learnedstrategies are safe and not exploitable (Brown & Sandholm, 2017b).

A third direction is towards non-zero sum games. It is well known that standard independent multi-agent learning does not produce agents that are able to cooperate in social dilemmas (Leibo et al.,2017; Lerer & Peysakhovich, 2017; Peysakhovich & Lerer, 2017; Foerster et al., 2017) or riskycoordination games (Yoshida et al., 2008; Peysakhovich & Lerer, 2018). By contrast, humans aremuch better at finding these coordinating and cooperating equilibria (Bó, 2005; Kleiman-Weineret al., 2016). Thus we conjecture that human demonstrations can be combined with Backplay toconstruct agents that can indeed perform well in these situations. We note that the use of humans asexperts, in general, was also suggested by Salimans & Chen (2018), work concurrent to our own.

Other future priorities are to gain a better understanding of when Backplay works well, where it fails,and how we can make the procedure more efficient. Do the gains come from value estimation likeour analysis suggests, from policy iteration, or from both? What about the curricula itself - is there anideal rate at which we advance the window and is there a better approach than a hand-tuned schedule?Should we be using one policy that learns throughout training or would it be better to have a policyper window? This has some nice consequences and, if we could ascertain a confidence estimate of astate’s value, this will likely speed up the rate at which we can advance Backplay’s schedule.

Acknowledgments

We especially thank Yichen Gong, Henry Zhou, and Luvneesh Mugrai for letting us use their agent.We also thank Martin Arjovsky, Adam Bouhenguel, Tim Chklovski, Andrew Dai, Doug Eck, JesseEngel, Jakob Foerster, Ilya Kostrikov, Mohammad Norouzi, Avital Oliver, Yann Ollivier, and WillWhitney for helpful contributions.

ReferencesAytar, Yusuf, Pfaff, Tobias, Budden, David, Paine, Tom Le, Wang, Ziyu, and de Freitas, Nando.

Playing hard exploration games by watching youtube. arXiv preprint arXiv:1805.11592, 2018.

Bengio, Yoshua, Louradour, Jérôme, Collobert, Ronan, and Weston, Jason. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM,2009.

Bó, Pedro Dal. Cooperation under the shadow of the future: experimental evidence from infinitelyrepeated games. American economic review, 95(5):1591–1604, 2005.

Brown, Noam and Sandholm, Tuomas. Safe and nested subgame solving for imperfect-information games. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vish-wanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems30, pp. 689–699. Curran Associates, Inc., 2017a. URL http://papers.nips.cc/paper/6671-safe-and-nested-subgame-solving-for-imperfect-information-games.pdf.

Brown, Noam and Sandholm, Tuomas. Safe and nested subgame solving for imperfect-informationgames. In Advances in Neural Information Processing Systems, pp. 689–699, 2017b.

Browne, Cameron B, Powley, Edward, Whitehouse, Daniel, Lucas, Simon M, Cowling, Peter I,Rohlfshagen, Philipp, Tavener, Stephen, Perez, Diego, Samothrakis, Spyridon, and Colton, Simon.A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligenceand AI in games, 4(1):1–43, 2012.

Clark, J. and Amodei, D. Faulty reward functions in the wild. https://blog.openai.com/faulty-reward-functions/, 2016.

10

Page 11: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

Daumé, Hal, Langford, John, and Marcu, Daniel. Search-based structured prediction. Machinelearning, 75(3):297–325, 2009.

Florensa, Carlos, Held, David, Wulfmeier, Markus, and Abbeel, Pieter. Reverse curriculum generationfor reinforcement learning. arXiv preprint arXiv:1707.05300, 2017.

Foerster, Jakob N, Chen, Richard Y, Al-Shedivat, Maruan, Whiteson, Shimon, Abbeel, Pieter, andMordatch, Igor. Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326,2017.

Hester, Todd, Vecerik, Matej, Pietquin, Olivier, Lanctot, Marc, Schaul, Tom, Piot, Bilal, Horgan,Dan, Quan, John, Sendonaris, Andrew, Dulac-Arnold, Gabriel, et al. Deep q-learning fromdemonstrations. arXiv preprint arXiv:1704.03732, 2017.

Ho, Jonathan and Ermon, Stefano. Generative adversarial imitation learning. In Advances in NeuralInformation Processing Systems, pp. 4565–4573, 2016.

Hong, Wenming and Zhou, Ke. A note on the passage time of finite-state markov chains. Communi-cations in Statistics-Theory and Methods, 46(1):438–445, 2017.

Hosu, Ionel-Alexandru and Rebedea, Traian. Playing atari games with deep reinforcement learningand human checkpoint replay. arXiv preprint arXiv:1607.05077, 2016.

Ivanovic, Boris, Harrison, James, Sharma, Apoorva, Chen, Mo, and Pavone, Marco. Barc: Backwardreachability curriculum for robotic reinforcement learning. arXiv preprint arXiv:1806.06161,2018.

Kleiman-Weiner, Max, Ho, Mark K, Austerweil, Joseph L, Littman, Michael L, and Tenenbaum,Joshua B. Coordinate to cooperate or compete: abstract goals and joint intentions in socialinteraction. In COGSCI, 2016.

Lanctot, Marc, Zambaldi, Vinicius, Gruslys, Audrunas, Lazaridou, Angeliki, Perolat, Julien, Silver,David, Graepel, Thore, et al. A unified game-theoretic approach to multiagent reinforcementlearning. In Advances in Neural Information Processing Systems, pp. 4190–4203, 2017.

Laskey, Michael, Staszak, Sam, Hsieh, Wesley Yu-Shu, Mahler, Jeffrey, Pokorny, Florian T, Dragan,Anca D, and Goldberg, Ken. Shiv: Reducing supervisor burden in dagger using support vectorsfor efficient learning from demonstrations in high dimensional state spaces. In Robotics andAutomation (ICRA), 2016 IEEE International Conference on, pp. 462–469. IEEE, 2016.

Leibo, Joel Z, Zambaldi, Vinicius, Lanctot, Marc, Marecki, Janusz, and Graepel, Thore. Multi-agentreinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Au-tonomous Agents and MultiAgent Systems, pp. 464–473. International Foundation for AutonomousAgents and Multiagent Systems, 2017.

Lerer, Adam and Peysakhovich, Alexander. Maintaining cooperation in complex social dilemmasusing deep reinforcement learning. arXiv preprint arXiv:1707.01068, 2017.

Lerer, Adam and Peysakhovich, Alexander. Learning social conventions in markov games. arXivpreprint arXiv:1806.10071, 2018.

Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deepvisuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016. URL http://jmlr.org/papers/v17/15-522.html.

McAleer, Stephen, Agostinelli, Forest, Shmakov, Alexander, and Baldi, Pierre. Solving the rubik’scube without human knowledge. arXiv preprint arXiv:1805.07470, 2018.

Moravcík, Matej, Schmid, Martin, Burch, Neil, Lisý, Viliam, Morrill, Dustin, Bard, Nolan, Davis,Trevor, Waugh, Kevin, Johanson, Michael, and Bowling, Michael H. Deepstack: Expert-levelartificial intelligence in no-limit poker. CoRR, abs/1701.01724, 2017. URL http://arxiv.org/abs/1701.01724.

MultiAgentLearning. Pommerman. https://github.com/MultiAgentLearning/playground,2018.

11

Page 12: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

Nair, Ashvin, McGrew, Bob, Andrychowicz, Marcin, Zaremba, Wojciech, and Abbeel, Pieter.Overcoming exploration in reinforcement learning with demonstrations. arXiv preprintarXiv:1709.10089, 2017.

Ng, Andrew Y, Harada, Daishi, and Russell, Stuart. Policy invariance under reward transformations:Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.

Peng, Xue Bin, Abbeel, Pieter, Levine, Sergey, and van de Panne, Michiel. Deepmimic:Example-guided deep reinforcement learning of physics-based character skills. arXiv preprintarXiv:1804.02717, 2018.

Peysakhovich, Alexander and Lerer, Adam. Consequentialist conditional cooperation in socialdilemmas with imperfect information. arXiv preprint arXiv:1710.06975, 2017.

Peysakhovich, Alexander and Lerer, Adam. Prosocial learning agents solve generalized stag huntsbetter than selfish ones. Proceedings of the 17th Conference on Autonomous Agents and MultiAgentSystems, 2018.

Ross, Stéphane, Gordon, Geoffrey, and Bagnell, Drew. A reduction of imitation learning andstructured prediction to no-regret online learning. In Proceedings of the fourteenth internationalconference on artificial intelligence and statistics, pp. 627–635, 2011.

Salimans, Tim and Chen, Richard. Learning montezuma’s re-venge from a single demonstration. https://blog.openai.com/learning-montezumas-revenge-from-a-single-demonstration/, 2018. Accessed:2018-07-12.

Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, and Klimov, Oleg. Proximal policyoptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Silver, David, Huang, Aja, Maddison, Chris J., Guez, Arthur, Sifre, Laurent, van den Driessche,George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, Diele-man, Sander, Grewe, Dominik, Nham, John, Kalchbrenner, Nal, Sutskever, Ilya, Lillicrap, Timothy,Leach, Madeleine, Kavukcuoglu, Koray, Graepel, Thore, and Hassabis, Demis. Mastering thegame of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.

Silver, David, Hubert, Thomas, Schrittwieser, Julian, Antonoglou, Ioannis, Lai, Matthew, Guez,Arthur, Lanctot, Marc, Sifre, Laurent, Kumaran, Dharshan, Graepel, Thore, Lillicrap, Timothy P.,Simonyan, Karen, and Hassabis, Demis. Mastering chess and shogi by self-play with a generalreinforcement learning algorithm. CoRR, abs/1712.01815, 2017. URL http://arxiv.org/abs/1712.01815.

Vodopivec, Tom, Samothrakis, Spyridon, and Šter, Branko. On monte carlo tree search and rein-forcement learning. J. Artif. Int. Res., 60(1):881–936, September 2017. ISSN 1076-9757. URLhttp://dl.acm.org/citation.cfm?id=3207692.3207712.

Yoshida, Wako, Dolan, Ray J, and Friston, Karl J. Game theory of mind. PLoS computationalbiology, 4(12):e1000254, 2008.

Zhang, Jiakai and Cho, Kyunghyun. Query-efficient imitation learning for end-to-end autonomousdriving. arXiv preprint arXiv:1605.06450, 2016.

Zhou, Hongwei, Gong, Yichen, Mugrai, Luvneesh, Khalifa, Ahmed, Andy, Nealen, and Togelius,Julian. A hybrid search agent in pommerman. In The International Conference on the Foundationsof Digital Games (FDG), 2018.

12

Page 13: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

A Appendix

A.1 Pommerman Observation State

There are 18 feature maps that encompass each observation. They consist of the following: the agents’identities and locations, the locations of the walls, power-ups, and bombs, the bombs’ blast strengthsand remaining life counts, and the current time step.

The first map contains the integer values of each bomb’s blast strength at the location of that bomb.The second map is similar but the integer value is the bomb’s remaining life. At all other positions,the first two maps are zero. The next map is binary and contains a single one at the agent’s location.If the agent is dead, this map is zero everywhere.

The following two maps are similar. One is full with the agent’s integer current bomb count, theother with its blast radius. We then have a full binary map that is one if the agent can kick and zerootherwise.

The next maps deal with the other agents. The first contains only ones if the agent has a teammateand zeros otherwise. This is useful for building agents that can play both team and solo matches. Ifthe agent has a teammate, the next map is binary with a one at the teammate’s location (and zero ifshe is not alive). Otherwise, the agent has three enemies, so the next map contains the position of theenemy that started in the diagonally opposed corner from the agent. The following two maps containthe positions of the other two enemies, which are present in both solo and team games.

We then include eight feature maps representing the respective locations of passages, rigid walls,wooden walls, flames, extra-bomb power-ups, increase-blast-strength power-ups, and kicking-abilitypower-ups. All are binary with ones at the corresponding locations.

Finally, we include a full map with the float ratio of the current step to the total number of steps. Thisinformation is useful for distinguishing among observation states that are seemingly very similar,but in reality are very different because the game has a fixed ending step where the agent receivesnegative reward for not winning.

A.2 Backplay Hyperparameters

As mentioned in Section 3, the Backplay hyperparameters are the window bounds and the frequencywith which they are shifted. There isn’t any downside to having the model continue training in awindow for too long, albeit the ideal is that this method increases the speed of training. There ishowever a downside to advancing too quickly. A scenario common to effective training is improvingsuccess curves punctured by step drops whenever the window advances.

Starting at training epoch Uniform window0 [0, 32)

85 [24, 64)170 [56, 128)255 [120, 256)340 [248, 512)425 [504, 800)510 [800, 800]

Table 2: Backplay hyperparameters for Pommerman 100 maps.

Starting at training epoch Uniform window0 [0, 4)

350 [4, 8)700 [8, 16)1050 [16, 32)1400 [32, 64)1750 [64, 64)

Table 3: Backplay hyperparameters for Maze.

13

Page 14: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

A.3 Win Rates

Here we show the per-map win rates obtained by the agent trained with Backlpay on the 100Pommerman maps.

Win Maps90% 2480% 2670% 660% 550% 5

Table 4: Aggregate per-map win rates of the model trained with Backplay on 100 Pommerman maps.The model was run over 5000 times in total, with at least 32 times on each of the 100 maps. TheMaps column shows the number of maps on which the Backplay agent had a success rate of at leastthe percent in the Win column. Note that this model has a win rate of > 80% on more than half ofthe maps.

A.4 Maze Learning Curves

Figure 7. Maze results when training with demonstrations of length equal to the optimal path length (over threeseeds). The Backplay models first encounter the initial state at epoch 1400 (vertical red line). It becomes thedefault starting state at epoch 1750 (vertical green line).

14

Page 15: Backplay: ‘Man muss immer umkehren’1 k=0 kr t+k+1 where r tis the reward at time tand 2[0;1] is the discount factor, and the expectation is taken with respect to both the policy

Figure 8. Maze results when training with demonstrations that are five greater than the optimal path length (overthree seeds). The Backplay models first encounter the initial state at epoch 1400 (vertical red line). It becomesthe default starting state at epoch 1750 (vertical green line).

Figure 9. Maze results when training with demonstrations that are ten greater than the optimal path length (overthree seeds). The Backplay models first encounter the initial state at epoch 1400 (vertical red line). It becomesthe default starting state at epoch 1750 (vertical green line).

15