active domain randomization - arxiv · active domain randomization (2017) proposed svpg, which...

12
Active Domain Randomization Bhairav Mehta 12 Manfred Diaz 12 Florian Golemo 13 Christopher J. Pal 1345 Liam Paull 125 Abstract Domain randomization is a popular technique for improving domain transfer, often used in a zero- shot setting when the target domain is unknown or cannot easily be used for training. In this work, we empirically examine the effects of domain ran- domization on agent generalization. Our experi- ments show that domain randomization may lead to suboptimal, high-variance policies, which we attribute to the uniform sampling of environment parameters. We propose Active Domain Random- ization, a novel algorithm that learns a parame- ter sampling strategy. Our method looks for the most informative environment variations within the given randomization ranges by leveraging the discrepancies of policy rollouts in randomized and reference environment instances. We find that training more frequently on these instances leads to better overall agent generalization. In addition, when domain randomization and policy transfer fail, Active Domain Randomization of- fers more insight into the deficiencies of both the chosen parameter ranges and the learned policy, allowing for more focused debugging. Our ex- periments across various physics-based simulated and a real-robot task show that this enhancement leads to more robust, consistent policies. 1. Introduction Recent trends in Deep Reinforcement Learning (DRL) ex- hibit a growing interest in zero-shot domain transfer, i.e. when a policy is learned in a source domain and is then tested without finetuning in an unseen target domain. Zero- shot transfer is particularly useful when the task in the target domain is inaccessible, complex, or expensive, such as gath- ering rollouts from a real-world robot. An ideal agent would learn to generalize across domains; it would accomplish the task without exploiting irrelevant features or deficiencies in the source domain (i.e., approximate physics in simulators), which may vary dramatically after transfer. 1 Mila 2 Universite de Montreal 3 Element AI 4 Polytechnique Montreal 5 Canada CIFAR AI Chair. Correspondence to: Bhairav Mehta <[email protected]>. Preprint. Work in Progress. Figure 1. Agent generalization, expressed as performance across different engine strength settings in LunarLander. We compare the following approaches: Baseline, i.e. default environment dy- namics; Uniform Domain Randomization (UDR); Active Domain Randomization (ADR, our approach which actively searches for difficult MDP instances to train on); and Oracle, i.e. a handpicked randomization range. For evaluation, we take each sampling strat- egy’s final policies and evaluate them across the full range of environment parameters (i.e. vary main engine strength, which affects the responsiveness and landing speed of the simulated lan- der). ADR learns a sampling strategy that allows for near-expert levels of generalization, while both Baseline and UDR fail to solve lower MES environments. One promising approach for zero-shot transfer has been domain randomization (Tobin et al., 2017). In Domain Ran- domization (DR), we uniformly randomize environment parameters of the simulation (i.e. friction, motor torque) across predefined ranges after every training episode. By randomizing everything that might vary in the target environ- ment, the hope is that the agent will view the target domain as just another variation. However, recent works suggest that the sample complexity grows exponentially with the number of randomization parameters, even when dealing only with transfer between simulations (i.e. in Andrychowicz et al. (2018) Figure 8). In addition, when using domain random- ization unsuccessfully, policy transfer fails as a black box. After a failed transfer, randomization ranges are tweaked heuristically via trial-and-error. Repeating this process itera- tively, researchers are often left with arbitrary ranges that do (or do not) lead to policy convergence without any insight into how those settings may affect to the learned behavior. arXiv:1904.04762v1 [cs.LG] 9 Apr 2019

Upload: others

Post on 15-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

Bhairav Mehta 1 2 Manfred Diaz 1 2 Florian Golemo 1 3 Christopher J. Pal 1 3 4 5 Liam Paull 1 2 5

AbstractDomain randomization is a popular technique forimproving domain transfer, often used in a zero-shot setting when the target domain is unknownor cannot easily be used for training. In this work,we empirically examine the effects of domain ran-domization on agent generalization. Our experi-ments show that domain randomization may leadto suboptimal, high-variance policies, which weattribute to the uniform sampling of environmentparameters. We propose Active Domain Random-ization, a novel algorithm that learns a parame-ter sampling strategy. Our method looks for themost informative environment variations withinthe given randomization ranges by leveraging thediscrepancies of policy rollouts in randomizedand reference environment instances. We findthat training more frequently on these instancesleads to better overall agent generalization. Inaddition, when domain randomization and policytransfer fail, Active Domain Randomization of-fers more insight into the deficiencies of both thechosen parameter ranges and the learned policy,allowing for more focused debugging. Our ex-periments across various physics-based simulatedand a real-robot task show that this enhancementleads to more robust, consistent policies.

1. IntroductionRecent trends in Deep Reinforcement Learning (DRL) ex-hibit a growing interest in zero-shot domain transfer, i.e.when a policy is learned in a source domain and is thentested without finetuning in an unseen target domain. Zero-shot transfer is particularly useful when the task in the targetdomain is inaccessible, complex, or expensive, such as gath-ering rollouts from a real-world robot. An ideal agent wouldlearn to generalize across domains; it would accomplish thetask without exploiting irrelevant features or deficiencies inthe source domain (i.e., approximate physics in simulators),which may vary dramatically after transfer.

1Mila 2Universite de Montreal 3Element AI 4PolytechniqueMontreal 5Canada CIFAR AI Chair. Correspondence to: BhairavMehta <[email protected]>.

Preprint. Work in Progress.

Figure 1. Agent generalization, expressed as performance acrossdifferent engine strength settings in LunarLander. We comparethe following approaches: Baseline, i.e. default environment dy-namics; Uniform Domain Randomization (UDR); Active DomainRandomization (ADR, our approach which actively searches fordifficult MDP instances to train on); and Oracle, i.e. a handpickedrandomization range. For evaluation, we take each sampling strat-egy’s final policies and evaluate them across the full range ofenvironment parameters (i.e. vary main engine strength, whichaffects the responsiveness and landing speed of the simulated lan-der). ADR learns a sampling strategy that allows for near-expertlevels of generalization, while both Baseline and UDR fail to solvelower MES environments.

One promising approach for zero-shot transfer has beendomain randomization (Tobin et al., 2017). In Domain Ran-domization (DR), we uniformly randomize environmentparameters of the simulation (i.e. friction, motor torque)across predefined ranges after every training episode. Byrandomizing everything that might vary in the target environ-ment, the hope is that the agent will view the target domainas just another variation. However, recent works suggest thatthe sample complexity grows exponentially with the numberof randomization parameters, even when dealing only withtransfer between simulations (i.e. in Andrychowicz et al.(2018) Figure 8). In addition, when using domain random-ization unsuccessfully, policy transfer fails as a black box.After a failed transfer, randomization ranges are tweakedheuristically via trial-and-error. Repeating this process itera-tively, researchers are often left with arbitrary ranges that do(or do not) lead to policy convergence without any insightinto how those settings may affect to the learned behavior.

arX

iv:1

904.

0476

2v1

[cs

.LG

] 9

Apr

201

9

Page 2: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

(a) (b)

Figure 2. ADR shows benefits over UDR on a wide range of tasks,including in a sim2real reaching task. The 4 DoF simulated robotmust learn an efficient policy to reach a virtual point (shown inpink), and the final policies are evaluated on the real robot. Weshow that ADR policies transfer more robustly during zero-shottransfer to the more difficult real-world robot environment.

In this work, we demonstrate that the strategy of uniformlysampling environment parameters is suboptimal and pro-pose an alternative method, Active Domain Randomiza-tion. Active Domain Randomization (ADR) formulatesdomain randomization as a search for randomized environ-ments that maximize utility for the agent policy. Concretely,we aim to find environments that currently pose difficultiesfor the agent policy and dedicate more training time to thesetroublesome parameter settings. We cast this active searchas a Reinforcement Learning (RL) problem where the ADRsampling policy is parameterized with Stein VariationalPolicy Gradient (SVPG). ADR hones in on problematicregions of the randomization space by learning a discrimina-tive reward computed from discrepancies in policy rolloutsgenerated in randomized and reference environments.

We showcase ADR on a simple environment where thebenefits of training on more challenging variations are ap-parent and interpretable (Figure 1), and demonstrate thatADR learns to preferentially select parameters from theseregions while still adapting to the policy’s current deficien-cies. We then apply ADR to more complex environmentsand real robot settings (Figure 2) and show that even withhigh-dimensional search spaces and unmodeled dynamics,policies trained with ADR exhibit superior generalizationand lower overall variance than their Uniform Domain Ran-domization (UDR) counterparts.

The key contributions of our work can be summarized as:

1. Our proposed ADR method learns an adaptive random-ization strategy that finds problematic environmentswithin the given randomization ranges. Across a widevariety of tasks, we find that training preferentially onthese environments leads to better generalization.

2. ADR can provide insight into which dimensions andparameter ranges are most influential before transfer,which can aid the tuning of randomization ranges be-fore expensive experiments are undertaken.

2. BackgroundIn this section, we briefly cover the basics of RL (used totrain both the agent policy and the ADR policy), domainrandomization, and Stein Variational Policy Gradient (pa-rameterizes the ADR policy).

2.1. Reinforcement Learning

We consider a RL framework (Sutton & Barto, 2018) wheresome task T is defined by a Markov Decision Process(MDP) consisting of a state space S, action space A, statetransition function P : S × A 7→ S, reward functionR : S × A 7→ R, and discount factor γ ∈ (0, 1). Thegoal for an agent trying to solve T is to learn a policy π withparameters θ that maximizes the expected total discountedreward. We define a rollout τ = (s0, a0..., sT ) to be thesequence of states st and actions at ∼ π(at|st) executed bya policy π in the environment.

2.2. Domain Randomization

Domain randomization (DR) is a technique to train policiescompletely in simulation and transfer them in a zero-shotmanner to the real world. DR requires a prescribed set ofNrand simulation parameters to randomize, as well as corre-sponding ranges to sample them from. A set of parametersis sampled from randomization space Ξ ⊂ RNrand , whereeach randomization parameter ξ(i) is bounded on a closedinterval {

[ξ(i)low, ξ

(i)high

]}Nrandi=1 .

When a configuration ξ ∈ Ξ is passed to a non-differentiablesimulator S, it generates an environment E. At the start ofeach episode, the parameters are uniformly sampled fromthe ranges, and the environment generated from those valuesis used to train the agent policy π.

DR may perturb any to all elements of the task T ’s underly-ing MDP1, with the exception of keeping R and γ constant.DR therefore generates a set of MDPs that are superficiallysimilar, but can vary greatly in difficulty depending on thecharacter of the randomization. Upon transfer to the targetdomain, the expectation is that the agent policy has learnedto generalize across MDPs, and sees the final domain as justanother variation of parameters.

The most common instantiation of DR, UDR is summarizedin Algorithm 2 in Appendix B. UDR generates randomizedenvironment instances Ei by uniformly sampling Ξ. Theagent policy π is then trained on rollouts τi produced inrandomized environments Ei.

2.3. Stein Variational Policy Gradient

Sufficient exploration in high-dimensional state spaces hasalways been a difficult problem in RL. Recently, Liu et al.

1The effects of DR on action space A are usually implicit orare carried out on the simulation side.

Page 3: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

(2017) proposed SVPG, which learns an ensemble of poli-cies µφ in a maximum-entropy RL framework (Ziebart,2010).

maxµ

Eµ[J(µ)] + αH(µ) (1)

with entropyH being controlled by temperature parameterα. SVPG uses Stein Variational Gradient Descent (Liu &Wang, 2016) to iteratively update an ensemble of N policiesor particles µφ = {µφi}Ni=1 with an update rule:

µφi ← µφi + ε∆µφi

∆µφi =1

N

N∑j=1

[∇µφj J(µφj )k(µφj , µφi)

+ α∇µφj k(µφj , µφi)]

(2)

with step size ε and positive definite kernel k. This updaterule balances exploitation (the first term moves particlestowards high-reward regions) and exploration (the secondterm repulses similar policies).

3. MethodDrawing analogies with Bayesian Optimization (BO) litera-ture, one can think of the randomization space as a searchspace. We aim to look for points (environment instances)that maximize utility, or provide the most improvement toour agent policy when used for training. Traditionally, inBO, the search for where to evaluate an objective is handledby acquisition functions, which trade off exploitation of theobjective with exploration in the uncertain regions of thespace (Brochu et al., 2010). However, unlike the stationaryobjectives seen in BO, training the agent policy makes ouroptimization nonstationary: the environment with highestutility at time t is likely not the same as the maximum-utilityenvironment at time t+ 1. With this dynamic objective, weneed to actively search the space for the most fruitful train-ing environments given the current state of the agent policy.

3.1. Motivating Experiment

However, as this nonstationary search adds its own com-plexity, it is important to investigate if uniform samplingacross the entire space is actually detrimental to agent per-formance. Concretely, we begin by investigating the validityof the following claim: uniformly sampling of environmentparameters does not generate equally useful MDPs. Totest the hypothesis, we use LunarLander-v2, where theagent’s task is to ground a lander in a designated zone andreward is based on the quality of landing (fuel used, impactvelocity, etc). Parameterized by an 8D state vector and actu-ated by a 2D continuous action space, LunarLander-v2has one main axis of randomization that we vary: the mainengine strength (MES).

We aim to determine if certain environment instances (differ-

ent values of the MES) are more informative - more efficientthan others in terms of aiding generalization. We set thetotal range of variation for the MES to be [8, 20] (the defaultis 13, and lower than 7.5 makes the environment unsolv-able when all other physics parameters are held constant)and find through empirical tests that lower engine strengthsgenerate harder MDPs to solve. Under this assumption,we show the effects of focused domain randomization byediting the ranges that the MES is uniformly sampled from.

We train multiple agents, with the only difference betweenthem being the randomization ranges for MES. The ran-domization ranges define what types of environments theagent is exposed to during training. Figure 1 shows thefinal generalization performance of each agent by sweepingacross the entire randomization range of [8, 20] and rollingout the policy in the generated environments. We see thatfocusing on harder MDPs improves generalization over uni-formly sampling the whole space, even when the evaluationenvironment is outside of the training distribution.

3.2. Active Domain Randomization

The experiment in the previous section shows that prefer-ential training on more informative environments providestangible benefits, but in general, finding these environmentsis diffcult because:

1. It is rare that such intuitively hard MDP instances orparameter ranges are known beforehand.

2. DR is used mostly when the space of randomized pa-rameters is high-dimensional or noninterpretable.

An ideal randomization scheme would find the most infor-mative environment instances in the randomization space,rather than uniformly sampling from the entire space. Whileseemingly just an instantiation of the traditional BO prob-lem, the nonstationarity of the objective (the environmentutility) requires us to redefine the notion of an acquisitionfunction while simultaneously dealing with BO’s deficien-cies with higher-dimensional inputs (Wang et al., 2013).

To this end, we propose ADR, summarized in Algorithm 1and Figure 3. ADR provides a framework for manipulatinga more general analog of an acquisition function, selectingthe most informative MDPs for the agent within the random-ization space. By formulating the search as an RL problem,ADR learns a policy µφ where the states are proposed ran-domization configurations ξ ∈ Ξ and actions are continuouschanges to those parameters.

We learn a discriminator-based reward for µφ, similar to theone originally proposed in Eysenbach et al. (2018):

rD = logDψ(y|τi ∼ π(·;Ei)) (3)

where y is a boolean variable denoting the discriminator’sprediction of which type of environment (a randomized en-vironment Ei or reference environment Eref ) the trajectoryτi was generated from. We assume that the Eref = S(ξref )

Page 4: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

Algorithm 1 Active Domain Randomization1: Input: Ξ: Randomization space, S: Simulator, ξref :

reference parameters2: Initialize πθ: agent policy, µφ: SVPG particles, Dψ:

discriminator, Eref ← S(ξref ): reference environment3: while not max timesteps do4: for each particle do5: rollout ξi ∼ µφ(·)6: end for7: for each ξi do8: // Generate, rollout in randomized env.9: Ei ← S(ξi)

10: rollout τi ∼ πθ(·;Ei), τref ∼ πθ(·;Eref )11: Trand ← Trand ∪ τi12: Tref ← Tref ∪ τref13: for each gradient step do14: // Agent policy update15: with Trand update:16: θ ← θ + ν∇θJ(πθ)17: end for18: end for19: // Calculate reward for each proposed environment20: for each τi ∈ Trand do21: Calculate reward with associated ξi and Ei using

Eq. (3)22: end for23: // Update randomization sampling strategy24: for each particle µφi do25: Update particles using Eq. (2)26: end for27: // Update discriminator28: for each gradient step do29: Update Dψ with τi and τref using SGD.30: end for31: end while

is provided with the original task definition.

Intuitively, we reward the policy µφ for finding regions ofthe randomization space that produce environment instanceswhere the same agent policy π acts differently than in thereference environment. The agent policy π sees and trainsonly on the randomized environments (as it would in tradi-tional DR), using the environment’s task-specific reward forupdates. As the agent improves on the proposed, problem-atic environments, it becomes more difficult to differentiatewhether any given state transition was generated from thereference or randomized environment. Thus, ADR can findwhat parts of the randomization space the agent is currentlyperforming poorly on, and can actively update its samplingstrategy throughout the training process.

3.3. Architecture Walkthrough

In this section, we walk through the diagram shown in Fig-ure 3. All line references refer to Algorithm 1.

Figure 3. Overview of our proposed framework: ADR proposesrandomized environments (c) or simulation instances from asimulator (b) and rolls out an agent policy (d) in those instances.The discriminator (e) learns a reward (f) as a proxy for environ-ment difficulty by distinguishing between rollouts in the refer-ence environment (a) and randomized instances, which is usedto train Stein Variational Policy Gradient (SVPG) particles (g).Enforced through the SVPG formulation, the particles propose adiverse set of environment dynamics, and try to find the environ-ment parameters (h) that are currently causing the agent the mostdifficulty.

3.3.1. SVPG SAMPLER

To encourage sufficient exploration in high dimensional ran-domization spaces, we parameterize µφ with SVPG. Sinceeach particle proposes its own environment settings ξi (lines4-6, Fig. 3h), all of which are passed to the agent for train-ing, the agent policy benefits from the same environmentvariety seen in UDR. However, unlike UDR, µφ can usethe learned reward to focus on problematic MDP instanceswhile still being efficiently parallelizable.

3.3.2. SIMULATOR

After receiving each particle’s proposed parameter settingsξi, we generate randomized environments Ei = S(ξi) (line9, Fig. 3b).

3.3.3. GENERATING TRAJECTORIES

We proceed to train the agent policy π on the randomizedinstances Ei, just as in UDR. We roll out π on each ran-domized instance Ei and store each trajectory τi. For everyrandomized trajectory generated, we use the same policy tocollect and store a reference trajectory τref by rolling outπ in the default environment Eref (lines 10-12, Fig. 3a, c).We store all trajectories (lines 11-12) as we will use them toscore each parameter setting ξi and update the discriminator.

The agent policy is a black box: although in our experimentswe train π with Deep Deterministic Policy Gradients (Lilli-crap et al., 2015), the policy can be trained with any other onor off-policy algorithm by introducing only minor changes

Page 5: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

to Algorithm 1 (lines 13-17, Fig. 3d).

3.3.4. SCORING ENVIRONMENTS

We now generate a score for each environment (lines 20-22)using each stored randomized trajectory τi by passing themthrough the discriminator Dψ, which predicts the type ofenvironment (reference or randomized) each trajectory wasgenerated from. We use this score as a reward to updateeach SVPG particle using Equation 2 (lines 24-26, Fig. 3f).

After scoring each ξi according to Equation 3, we use therandomized and reference trajectories to train the discrimi-nator (lines 28-30, Fig. 3e).

4. Results4.1. Experiment Details

To test ADR, we experiment on OpenAI Gym environments(Brockman et al., 2016) across various tasks, both simulatedand real.

• LunarLander-v22, a 2 degrees of freedom (DoF)environment where the agent has to softly land a space-craft, implemented in Box2D (detailed in Section 3.2),

• Pusher-3DOF-v03, a 3 DoF arm that has to pusha puck to a target, implemented in Mujoco (Todorovet al., 2012), and

• ErgoReacher-v04, a 4 DoF arm which has to toucha goal with its end effector, implemented in the BulletPhysics Engine (Coumans, 2015). For sim2real exper-iments, we recreate this environment setup on a realPoppy Ergo Jr. robot (Lapeyre, 2014) shown in Fig.2.

In addition to the randomization of the main engine strengthof LunarLander-v2, both other environments random-ize various physics parameters that change the environmentsdrastically. Pusher-3DOF-v0 has two axes of random-ization that make the puck slide more or less when pushed.ErgoReacher-v0 randomizes the max torque and gainfor each degree of freedom (joint) for eight randomizationparameters. We provide a detailed account of the random-izations used in Table 1 in Appendix C.

All simulated experiments are run with five seeds each withfive random resets, totaling 25 independent trials per eval-uation point. All experimental results are plotted mean-averaged with one standard deviation shown. Detailed ex-periment information can be found in Appendix E.

4.2. Toy Experiments

To investigate whether ADR’s learned sampling strategyprovides a tangible benefit for agent generalization, we

2https://gym.openai.com/envs/LunarLander-v2/

3Originally developed for Haarnoja et al. (2018)4Originally developed for Golemo et al. (2018)

(a) (b)

Figure 4. Learning curves over time in LunarLander. Higheris better. (a) Performance on the default environment settings;(b) Performance on particularly difficult settings - our approachoutperforms both the policy trained on a single simulator instance(”baseline”) and the UDR approach.

start by comparing it against traditional DR (labeled asUDR) on LunarLander-v2 and vary only the main en-gine strength (MES). In Figure 1, we see that ADR ap-proaches expert-levels of generalization whereas UDR failsto generalize on lower MES ranges.

From Figure 4(a), we see that ADR solves the referenceenvironment (ξMES = 13) more consistently than UDR,never dipping below the Solved line once that level of per-formance is reached. Figure 4(b) shows that ADR is theonly agent out of both the baseline (trained only on MES of13) and the UDR agent (trained seeing environments withξMES ∼ U [8, 20]) that makes significant progress on thehard environment instances (ξMES ∼ U [8, 11]).

Figure 5 explains the adaptability of ADR by showing gen-eralization and sampling distribution at various stages oftraining. ADR starts by sampling approximately uniformlyfor the first 650K steps, but then finds a deficiency in thepolicy on higher ranges of the MES. As those areas becomemore frequently sampled between 650K-800K steps, theagent learns to solve all of the higher-MES environments,as shown by the generalization curve for 800K steps. Asa result, the discriminator is no longer able to differentiatereference and randomized trajectories from the higher MESregions, and starts to reward environment instances gener-ated in the lower end of the MES range, which improvesgeneralization towards the completion of training.

4.3. Randomization in High Dimensions

If the intuitions that drive ADR are correct, we should seeincreased benefit of a learned sampling strategy with largerNrand due to the increasing sparsity of informative environ-ments when sampling uniformly. We first explore ADR’sperformance on Pusher3DOF-v0, an environment whereNrand = 2. Both randomization dimensions (puck damp-ing, puck friction loss) affect whether or not the puck re-tains momentum and continues to slide after making contactwith the agent’s end effector. Lowering the values of theseparameters simultaneously creates an intuitively-harder en-

Page 6: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

(a) (b)

Figure 5. Agent generalization, expressed as performance acrossdifferent engine strength settings in LunarLander. (a) Change inperformance during training; (b) Change in dynamics samplingduring training. As training proceeds, ADR begins preferentiallysampling the more challenging environmental instances.

vironment, where the puck continues to slide after beinghit. In the reference environment, the puck retains no mo-mentum and must be continuously pushed in order to move.We qualitatively visualize the effect of these parameters onpuck sliding in Figure 6(a).

From Figure 6(b), we see ADR’s improved robustness toextrapolation - or when the target domain lies outside thetraining region. We train two agents, one using ADR andone using UDR, and show them only the training regionsencapsulated by the dark, outlined box in the top-right ofFigure 6(a). Qualitatively, only 25% of the environmentshave dynamics which cause the puck to slide, which are thehardest environments to solve in the training region. We seethat from the sampling histogram overlaid on Figure 6(a)that ADR prioritizes the single, harder purple region morethan the light blue regions, allowing for better generalizationto the unseen test domains, as shown in Figure 6(b). ADRoutperforms UDR in all but one test region and producespolicies with less variance than their UDR counterparts.

From both Figure 7(a) and Figure 7(b), which are learningcurves for UDR and ADR on the reference environmentand hard environment (the pink square in Figure 6) respec-tively, we observe an interesting phenomenon: not onlydoes ADR solve both environments more consistently (i.e.doesn’t pop up above the Solved line), but UDR also un-learns the good behaviors it acquired in the beginning oftraining. When training neural networks in both supervisedand reinforcement learning settings, this phenomenon hasbeen dubbed as catastrophic forgetting (Kirkpatrick et al.,2016). ADR seems to exhibit this slightly (leading to ”hills”in the curve), but due to the adaptive nature the algorithm,it is able to adjust quickly and retain better performanceacross all environments.

4.4. Randomization in Uninterpretable Dimensions

We further show the significance of ADR over UDR onErgoReacher-v0, where Nrand = 8. It is now impossi-

(a)

(b)

Figure 6. Sampling behavior of ADR in Pusher3Dof. The envi-ronment dynamics are characterized by friction and damping of thesliding puck. We have identified dynamics settings which exhibiteasier or harder to learn puck behavior (as highlighted by cyan,purple, and pink - from easy to hard). (a) During training, thealgorithm only had access to a limited, easier range of dynamics(black outline). We observed that our approach will converge tothe hardest settings within this limited range. (b) Performancemeasured by distance to target, lower is better. The results showthe higher performance and lower variance of our approach, safefor one exception.

ble to infer intuitively which environments are hard due tothe complex interactions between the eight randomizationparameters (gains and maximum torques for each joint). Fordemonstration purposes, we test extrapolation by creatinga held-out target environment with extremely low valuesfor torque and gain, which causes certain states in the en-vironment to lead to catastrophic failure - gravity pulls therobot end effector down, and the robot is not strong enoughto pull itself back up. We show an example of an agentgetting trapped in a catastrophic failure state in Figure 12,Appendix C.1.

To generalize effectively, the sampling policy should priori-tize environments with lower torque and gain values in orderfor the agent to operate in such states precisely. However,since the hard evaluation environment is not seen duringtraining, ADR must learn to prioritize the hardest environ-ments that it can see, while still learning behaviors that canoperate well across the entire training region.

Page 7: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

(a) (b)

Figure 7. Learning curves over time in Pusher3Dof. Lower isbetter. (a) Performance on the default environment settings; (b)Performance on particularly difficult settings - our approach out-performs both the policy trained with the UDR approach both interms of performance and variance.

(a) (b)

Figure 8. Learning curves over time in ErgoReacher. Lower isbetter. (a) Performance on the default environment settings; (b)Performance on particularly difficult settings - our approach out-performs both the policy trained with the UDR approach both interms of performance and variance.

In Figure 8(a), we see that when evaluated on the referenceenvironment, the policy learned using ADR has much lowervariance than one learned using UDR and can solve the en-vironment much more effectively. In addition, it generalizesbetter to the unseen target domain as shown in Figure 8(b),again which much less variance in the learned agent policy.

UDR’s high variance on ErgoReacher-v0 is indicativeof some of its issues: by continuously training on a randommix of hard and easy MDP instances, both beneficial anddetrimental agent behaviors can be learned and unlearnedthroughout training. As shown in ErgoReacher-v0, thismixing can lead to high-variance, inconsistent, and unpre-dictable behavior upon transfer. By focusing on those harderenvironments and allowing the definition of hard to adaptover time, ADR shows more consistent performance andbetter overall generalization than UDR in all environmentstested.

4.5. Sim2Real Transfer Experiments

In this section, we present results of simulation-trained poli-cies transferred zero-shot onto the real Poppy Ergo Jr. robot.

Figure 9. Results of the ErgoReacher policies evaluated on thephysical robot over various torque settings, measured by finaldistance to the target. Lower is better. Our approach has a perfor-mance that is equal or better than UDR while the spread of ADRis consistently smaller than UDR. Smaller spread points to moreconsistent performance, which is important when considering thepotentially dangerous transfer onto real world robots.

In sim2real (simulation to reality) transfer, many policiesfail due to unmodeled dynamics within the simulators, aspolicies may have overfit to or exploited simulation-specificdetails of their training environments. While the deficien-cies and high variance of UDR are clear even in simulatedenvironments, one of the most impressive results of domainrandomization was zero-shot transfer out of simulation ontorobots. However, we find that the same issues of unpre-dictable performance apply to UDR-trained policies in thereal world as well.

We take each method’s (ADR and UDR) five independentsimulation-trained policies from Section 4.4 and transferthem without fine tuning onto the real robot. We rolloutonly the final policy on the robot, and show performancein Figure 9. To evaluate generalization, we alter the robotby changing the values of the torques (higher torque meansthe arm moves at higher speed and accelerates faster), andevaluate each of the policies with 25 random goals (125independent evaluations per torque setting). As shown inFigure 9, ADR policies obtain overall better or similar per-formance than UDR policies trained in the same conditions.More importantly, ADR policies are more consistent anddisplay lower spread across all environments, which is cru-cial when safely evaluating reinforcement learning policieson real-world robots.

4.6. Interpretability

One of the secondary benefits of ADR is its insight into in-compatibilities between the task and randomization ranges.We demonstrate the simple effects of this phenomenon in a

Page 8: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

Figure 10. Sampling frequency across engine strengths when vary-ing the randomization ranges. The updated, red distribution showsa much milder unevenness in the distribution, while still learningto focus on the harder instances. This can be used for debuggingthe randomization ranges before transferring a learned policy ontoa physical system.

one-dimensional LunarLander-v2, where we only ran-domize the main engine strength. Our initial experimentsvaried this parameter between 6 and 20, which lead to ADRlearning degenerate agent policies by learning to proposethe lopsided blue distribution in Figure 10. Upon inspectionof the simulation, we see that when the parameter has avalue of less than approximately 8, the task becomes almostimpossible to solve due to the other environment factors (inthis case the lander always hits the ground too fast, which itis penalized for).

After adjusting the parameter ranges to more sensible values,we see a better sampled distribution in pink, which stillgives more preference to the hard environments in the lowerengine strength range. Most importantly, ADR allows foranalysis that is both focused - we know exactly what partof the simulation is causing trouble - and pre-transfer, i.e.done before a more expensive experiment such as real robottransfer has taken place. With UDR, the agents would beequally trained on these degenerate environments, leadingto policies with potentially undefined behavior (or, as seenin Section 4.4, unlearn good behaviors) in these truly out-of-distribution simulations.

5. Related Work5.1. Dynamic and Adversarial Simulators

Simulators have played a crucial role in transferring learnedpolicies onto real robots, and many different strategieshave been proposed. Randomizing simulation parametersfor better generalization or transfer performance is a well-established idea in evolutionary robotics (Zagal et al., 2004;Bongard & Lipson, 2004), but recently has emerged as aneffective way to perform zero-shot transfer of deep rein-forcement learning policies in difficult tasks (Andrychowiczet al., 2018; Tobin et al., 2017; Peng et al., 2018; Sadeghi &

Levine, 2016).

Learnable simulations are also an effective way to adapt asimulation to a particular target environment. Chebotar et al.(2018) and Ruiz et al. (2018) use RL for effective trans-fer by learning parameters of a simulation that accuratelydescribes the target domain, but require the target domainfor reward calculation, which can lead to overfitting. Incontrast, our approach requires no target domain, but ratheronly a reference domain (the default simulation parameters)and a general range for each parameter. ADR encouragesdiversity, and as a result gives the agent a wider variety ofexperience. In addition, unlike Chebotar et al. (2018), ourmethod does not requires carefully-tuned (co-)variances ortask-specific cost functions. Concurrently, Khirodkar et al.(2018) also showed the advantages of learning adversarialsimulations and disadvantages of purely uniform random-ization distributions in object detection tasks.

To improve policy robustness, Robust Adversarial Rein-forcement Learning (RARL) Pinto et al. (2017) jointly trainsboth an agent and an adversary who applies environmentforces that disrupt the agent’s task progress. ADR removesthe zero-sum game dynamics, which have been known to de-crease training stability (Mescheder et al., 2018). More im-portantly, our method’s final outputs - the SVPG-based sam-pling strategy and discriminator - are reusable and can beused to train new agents as shown in Appendix A, whereasa trained RARL adversary would overpower any new agentand impede learning progress.

5.2. Active Learning and Informative Samples

Active learning methods in supervised learning try to con-struct a representative, sometimes time-variant, dataset froma large pool of unlabelled data by proposing elements to belabeled. The chosen samples are labelled by an oracle andsent back to the model for use. Similarly, ADR searchesfor what environments may be most useful to the agentat any given time. Active learners, like BO methods dis-cussed in Section 3, often require an acquisition function(derived from a notion of model uncertainty) to chose thenext sample. Since ADR handles this decision through theexplore-exploit framework of RL and the α in SVPG, ADRsidesteps the well-known scalability issues of both activelearning and BO (Tong, 2001).

Recently, Toneva et al. (2018) showed that certain examplesin popular computer vision datasets are harder for networksto learn, and that some examples generalize (or are for-gotten) much quicker than others. We explore the samephenomenon in the space of MDPs defined by our random-ization ranges, and try to find the ”examples” that cause ouragent the most trouble. Unlike the active learning setting orToneva et al. (2018), we have no oracle or supervisory losssignal in RL, and instead attempt to learn a proxy signal forADR via a discriminator.

Page 9: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

5.3. Generalization in Reinforcement Learning

Generalization in RL has long been one of the holy grails ofthe field, and recent work like Packer et al. (2018), Cobbeet al. (2018), and Farebrother et al. (2018) highlight the ten-dency of deep RL policies to overfit to details of the trainingenvironment. Our experiments exhibit the same phenom-ena, but our method improves upon the state of the art byexplicitly searching for and varying the environment aspectsthat our agent policy may have overfit to. We find that ouragents, when trained more frequently on these problematicsamples, show better generalization while also improvinginterpretability in both the randomization ranges’ and agentpolicy’s weaknesses.

6. ConclusionIn this work, we highlight failure cases of traditional domainrandomization, and propose active domain randomization(ADR), a general method capable of finding the most in-formative parts of the randomization parameter space fora reinforcement learning agent to train on. ADR does thisby posing the search as a reinforcement learning problem,and optimizes for the most informative environments us-ing a learned reward and multiple policies. We show ona wide variety of simulated environments that this methodefficiently trains agents with better generalization than tradi-tional domain randomization, extends well to high dimen-sional parameter spaces, and produces more robust policieswhen transferring to the real world.

AcknowledgementsThe authors gratefully acknowledge the Natural Sciencesand Engineering Research Council of Canada (NSERC),the Fonds de Recherche Nature et Technologies Quebec(FQRNT) and the Open Philanthropy Project for supportingthis work. In addition, the authors would like to thank KyleKastner and members of the REAL Lab for their helpfulcomments.

ReferencesAndrychowicz, M., Baker, B., Chociej, M., Jozefowicz,

R., McGrew, B., Pachocki, J., Petron, A., Plappert, M.,Powell, G., Ray, A., et al. Learning dexterous in-handmanipulation. arXiv preprint arXiv:1808.00177, 2018.

Bongard, J. and Lipson, H. Once more unto the breach:Co-evolving a robot and its simulator. In Proceedingsof the Ninth International Conference on the Simulationand Synthesis of Living Systems (ALIFE9), 2004.

Brochu, E., Cora, V. M., and de Freitas, N. A tutorial onbayesian optimization of expensive cost functions, withapplication to active user modeling and hierarchical rein-

forcement learning. CoRR, abs/1012.2599, 2010. URLhttp://arxiv.org/abs/1012.2599.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. Openai gym,2016.

Chebotar, Y., Handa, A., Makoviychuk, V., Macklin, M.,Issac, J., Ratliff, N., and Fox, D. Closing the sim-to-realloop: Adapting simulation randomization with real worldexperience. arXiv preprint arXiv:1810.05687, 2018.

Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman,J. Quantifying generalization in reinforcement learning.CoRR, abs/1812.02341, 2018.

Coumans, E. Bullet physics simulation. In ACM SIG-GRAPH 2015 Courses, SIGGRAPH ’15, New York, NY,USA, 2015. ACM.

Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversityis all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070, 2018.

Farebrother, J., Machado, M. C., and Bowling, M. General-ization and regularization in DQN, 2018.

Fujimoto, S., van Hoof, H., and Meger, D. Addressingfunction approximation error in actor-critic methods. InInternational Conference on Machine Learning, 2018.

Gangwani, T., Liu, Q., and Peng, J. Learning self-imitatingdiverse policies. In International Conference on LearningRepresentations, 2019.

Golemo, F., Taiga, A. A., Courville, A., and Oudeyer, P.-Y.Sim-to-real transfer with neural-augmented robot simula-tion. In Conference on Robot Learning, 2018.

Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., andLevine, S. Composable deep reinforcement learning forrobotic manipulation. arXiv preprint arXiv:1803.06773,2018.

Khirodkar, R., Yoo, D., and Kitani, K. M. Vadra: Visual ad-versarial domain randomization and augmentation. arXivpreprint arXiv:1812.00491, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization, 2014.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N. C., Veness,J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J.,Ramalho, T., Grabska-Barwinska, A., Hassabis, D.,Clopath, C., Kumaran, D., and Hadsell, R. Overcom-ing catastrophic forgetting in neural networks. CoRR,abs/1612.00796, 2016. URL http://arxiv.org/abs/1612.00796.

Page 10: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

Lapeyre, M. Poppy: open-source, 3D printed and fully-modular robotic platform for science, art and education.Theses, Universite de Bordeaux, November 2014. URLhttps://hal.inria.fr/tel-01104641.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,T., Tassa, Y., Silver, D., and Wierstra, D. Continuouscontrol with deep reinforcement learning. arXiv preprintarXiv:1509.02971, 2015.

Liu, Q. and Wang, D. Stein variational gradient descent:A general purpose bayesian inference algorithm. In Lee,D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., andGarnett, R. (eds.), Advances in Neural Information Pro-cessing Systems 29. 2016.

Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. Steinvariational policy gradient, 2017.

Mescheder, L., Geiger, A., and Nowozin, S. Which trainingmethods for GANs do actually converge? In InternationalConference on Machine Learning, 2018.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. InInternational Conference on Machine Learning, pp. 1928–1937, 2016.

Packer, C., Gao, K., Kos, J., Krhenbhl, P., Koltun, V., andSong, D. Assessing generalization in deep reinforcementlearning, 2018.

Peng, X. B., Andrychowicz, M., Zaremba, W., and Abbeel,P. Sim-to-real transfer of robotic control with dynamicsrandomization. 2018 IEEE International Conference onRobotics and Automation (ICRA), May 2018.

Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. Ro-bust adversarial reinforcement learning, 2017.

Ruiz, N., Schulter, S., and Chandraker, M. Learning tosimulate, 2018.

Sadeghi, F. and Levine, S. (cad)$ˆ2$rl: Real single-image flight without a single real image. CoRR,abs/1611.04201, 2016. URL http://arxiv.org/abs/1611.04201.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: Anintroduction. MIT Press, 2018.

Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., andAbbeel, P. Domain randomization for transferring deepneural networks from simulation to the real world. InIntelligent Robots and Systems (IROS), 2017 IEEE/RSJInternational Conference on. IEEE, 2017.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physicsengine for model-based control. In IROS. IEEE, 2012.

Toneva, M., Sordoni, A., Combes, R. T. d., Trischler, A.,Bengio, Y., and Gordon, G. J. An empirical study ofexample forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159, 2018.

Tong, S. Active Learning: Theory and Applications. PhDthesis, 2001. AAI3028187.

Wang, Z., Zoghi, M., Hutter, F., Matheson, D., and De Fre-itas, N. Bayesian optimization in high dimensions viarandom embeddings. In Proceedings of the Twenty-ThirdInternational Joint Conference on Artificial Intelligence,IJCAI ’13, 2013.

Zagal, J. C., Ruiz-del Solar, J., and Vallejos, P. Back toreality: Crossing the reality gap in evolutionary robotics.IFAC Proceedings Volumes, 37(8), 2004.

Ziebart, B. D. Modeling Purposeful Adaptive Behavior withthe Principle of Maximum Causal Entropy. PhD thesis,CMU, 2010.

Page 11: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

A. Bootstrapping Training of New AgentsUnlike DR, ADR’s learned sampling strategy and discrimi-nator can be reused to train new agents from scratch. To testthe transferability of the sampling strategy, we first train aninstance of ADR on LunarLander-v2, and then extractthe SVPG particles and discriminator. We then replace theagent policy with an random network initialization, and onceagain train according the the details in Section 4.1. FromFigure 11(a), it can be seen that the bootstrapped agent gen-eralization is even better than the one learned with ADRfrom scratch. However, its training speed on the defaultenvironment (ξMES = 13) is relatively lower.

(a) (b)

Figure 11. Generalization and default environment learning pro-gression on LunarLander-v2 when using ADR to bootstrap anew policy. Higher is better.

B. Uniform Domain RandomizationHere we review the algorithm for Uniform Domain Ran-domization (UDR), first proposed in (Tobin et al., 2017),shown in Algorithm 2.

Algorithm 2 Uniform Sampling Domain Randomization1: Input: Ξ: Randomization space, S: Simulator2: Initialize πθ: agent policy3: for each episode do4: // Uniformly sample parameters5: for i = 1 to Nrand do6: ξ(i) ∼ U

[ξ(i)low, ξ

(i)high

]7: end for8: // Generate, rollout in randomized env.9: Ei ← S(ξi)

10: rollout τi ∼ πθ(·;Ei)11: Trand ← Trand ∪ τi12: for each gradient step do13: // Agent policy update14: with Trand update:15: θ ← θ + ν∇θJ(πθ)16: end for17: end for

C. Environment DetailsPlease see Table 1.

C.1. Catastrophic Failure States in ErgoReacher

In Figure 12, we show an example progression to a catas-trophic failure state in the held-out, simulated target environ-ment of ErgoReacher-v0, with extremely low torqueand gain values.

Figure 12. An example progression (left to right) of an agentmoving to a catastrophic failure state (Panel 4) in the hardErgoReacher-v0 environment.

D. Untruncated Plots for Lunar Lander

Figure 13. Generalization on LunarLander-v2 for an expertinterval selection, ADR, and UDR. Higher is better.

All policies on Lunar Lander described in our paper receive aSolved score when the engine strengths are above 12, whichis why truncated plots are shown in the main document. Forclarity, we show the full, untruncated plot in Figure 13.

E. Network Architectures and ExperimentalHyperparameters

All experiments can be reproduced using our Github reposi-tory5.

All of our experiments use the same network architecturesand experiment hyperparameters, except for the number ofparticles N . For any experiment with LunarLander-v2,

5https://github.com/montrealrobotics/active-domainrand

Page 12: Active Domain Randomization - arXiv · Active Domain Randomization (2017) proposed SVPG, which learns an ensemble of poli-cies ˚ in a maximum-entropy RL framework (Ziebart, 2010)

Active Domain Randomization

Environment Nrand Types of Randomizations Train Ranges Test Ranges

LunarLander-v2 1 Main Engine Strength [8, 20] [8, 11]

Pusher-3DOF-v0 2 Puck Friction Loss & Puck Joint Damping [0.67, 1.0]× default [0.5, 0.67]× default

ErgoReacher-v0 8Joint Damping [0.3, 2.0]× default 0.2× defaultJoint Max Torque [1.0, 4.0]× default default

Table 1. We summarize the environments used, as well as characteristics about the randomizations performed in each environment.

we use N = 10. For both other environments, we use N =15. All other hyperparameters and network architecturesremain constant, which we detail below. All networks usethe Adam optimizer (Kingma & Ba, 2014).

We run Algorithm 1 until 1 million agent timesteps arereached - i.e. the agent policy takes 1M steps in the ran-domized environments. We also cap each episode off aparticular number of timesteps according to the documen-tation associated with (Brockman et al., 2016). In particu-lar, LunarLander-v2 has an episode time limit of 1000environment timesteps, whereas both Pusher-3DOF-v0and ErgoReacher-v0 use an episode time limit of 100timesteps.

For our agent policy, we use an implementation of DDPG(particularly, OurDDPG.py) from the Github repositoryassociated with (Fujimoto et al., 2018). The actor and criticboth have two hidden layers of 400 and 300 neurons respec-tively, and use ReLU activations. Our discriminator-basedrewarder is a two-layer neural network, both layers having128 neurons. The hidden layers use tanh activation, andthe network outputs a sigmoid for prediction.

The agent particles in SVPG are parameterized by a two-layer actor-critic architecture, both layers in both networkshaving 100 neurons. We use Advantage Actor-Critic (A2C)to calculate unbiased and low variance gradient estimates.All of the hidden layers use tanh activation and are or-thogonally initialized, with a learning rate of 0.0003 anddiscount factor γ = 0.99. They operate on a RNrand con-tinuous space, with each axis bounded between [0, 1]. Weallow for set the max step length to be 0.05, and every 50timesteps, we reset each particle and randomly initialize itsstate using a Nrand-dimensional uniform distribution. Weuse a temperature α = 10 with an RBF-Kernel as was donein (Liu et al., 2017). In our work we use an Radial BasisFunction (RBF) kernel with median baseline as describedin Liu et al. (2017) and an A2C policy gradient estimator(Mnih et al., 2016), although both the kernel and estimatorcould be substituted with alternative methods (Gangwaniet al., 2019). To ensure diversity of environments through-out training, we always roll out the SVPG particles using anon-deterministic sample.

For DDPG, we use a learning rate ν = 0.001, target updatecoefficient of 0.005, discount factor γ = 0.99, and batchsize of 1000. We let the policy run for 1000 steps before

any updates, and clip the max action of the actor between[−1, 1] as prescribed by each environment.

Our discriminator-based reward generator is a network withtwo, 128-neuron layers with a learning rate of .0002 anda binary cross entropy loss (i.e. is this a randomized orreference trajectory). To calculate the reward for a trajec-tory for any environment, we split each trajectory into its(st, at, st+1) constituents, pass each tuple through the dis-criminator, and average the outputs, which is then set as thereward for the trajectory. Our batch size is set to be 128, andmost importantly, as done in (Eysenbach et al., 2018), wecalculate the reward for examples before using those sameexamples to train the discriminator.