self-tuning deep reinforcement learning...reinforcement learning (rl) algorithms often have...

25
Self-Tuning Deep Reinforcement Learning Tom Zahavy 1 Zhongwen Xu 1 Vivek Veeriah 1 Matteo Hessel 1 Junhyuk Oh 1 Hado van Hasselt 1 David Silver 1 Satinder Singh 12 Abstract Reinforcement learning (RL) algorithms often require expensive manual or automated hyper- parameter searches in order to perform well on a new domain. This need is particularly acute in modern deep RL architectures which often incor- porate many modules and multiple loss functions. In this paper, we take a step towards addressing this issue by using metagradients (Xu et al., 2018) to tune these hyperparameters via differentiable cross validation, whilst the agent interacts with and learns from the environment. We present the Self-Tuning Actor Critic (STAC) which uses this process to tune the hyperparameters of the usual loss function of the IMPALA actor critic agent (Espeholt et. al., 2018 ), to learn the hyperpa- rameters that define auxiliary loss functions, and to balance trade offs in off policy learning by in- troducing and adapting the hyperparameters of a novel leaky V-trace operator. The method is sim- ple to use, sample efficient and does not require significant increase in compute. Ablative studies show that the overall performance of STAC im- proves as we adapt more hyperparameters. When applied to 57 games on the Atari 2600 environ- ment over 200 million frames our algorithm im- proves the median human normalized score of the baseline from 243% to 364%. 1. Introduction Reinforcement Learning (RL) algorithms often have mul- tiple hyperparameters that require careful tuning; this is especially true for modern deep RL architectures, which often incorporate many modules and many loss functions. Training a deep RL agent is thus typically performed in two nested optimization loops. In the inner training loop, we fix a set of hyperparameters, and optimize the agent parameters with respect to these fixed hyperparameters. In the outer (manual or automated) tuning loop, we search for good 1 Work done at DeepMind, London. 2 University of Michigan. Correspondence to: Tom Zahavy <[email protected]>. hyperparameters, evaluating them either in terms of their inner loop performance, or in terms of their performance on some special validation data. Inner loop training is typi- cally differentiable and can be efficiently performed using back propagation, while the optimization of the outer loop is typically performed via gradient-free optimization. Since only the hyperparameters (but not the agent policy itself) are transferred between outer loop iterations, we refer to this as a “multiple lifetime” approach. For example, random search for hyperparameters (Bergstra & Bengio, 2012) falls under this category, and population based training (Jaderberg et al., 2017) also shares many of its properties. The cost of rely- ing on multiple lifetimes to tune hyperparameters is often not accounted for when the performance of algorithms is reported. The impact of this is mild when users can rely on hyperparameters established in the literature, but the cost of hyper-parameter tuning across multiple lifetimes manifests itself when algorithm are applied to new domains. This motivates a significant body of work on tuning hyper- parameters online, within a single agent lifetime. Previous work in this area has often focused on solutions tailored to specific hyperparameters. For instance, Schaul et al. (2019) proposed a non stationary bandit algorithm to adapt the exploration-exploitation trade off. Mann et al. (2016) and White & White (2016) proposed algorithms to adapt λ (the eligibility trace coefficient). Rowland et al. (2019) introduced an α coefficient into V-trace to account for the variance-contraction trade off in off policy learning. In an- other line of work, metagradients were used to adapt the optimiser’s parameters (Sutton, 1992; Snoek et al., 2012; Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al., 2017; Young et al., 2018). In this paper we build on the metagradient approach to tune hyperparameters. In previous work, online metagradients have been used to learn the discount factor or the λ coeffi- cient (Xu et al., 2018), to discover intrinsic rewards (Zheng et al., 2018) and auxiliary tasks (Veeriah et al., 2019). This is achieved by representing an inner (training) loss func- tion as a function of both the agent parameters and a set of hyperparameters. In the inner loop, the parameters of the agent are trained to minimize this inner loss function w.r.t the current values of the hyperparameters; in the outer loop, the hyperparameters are adapted via back propagation to minimize the outer (validation) loss. arXiv:2002.12928v2 [stat.ML] 2 Mar 2020

Upload: others

Post on 04-Sep-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

Tom Zahavy 1 Zhongwen Xu 1 Vivek Veeriah 1 Matteo Hessel 1 Junhyuk Oh 1 Hado van Hasselt 1

David Silver 1 Satinder Singh 1 2

AbstractReinforcement learning (RL) algorithms oftenrequire expensive manual or automated hyper-parameter searches in order to perform well ona new domain. This need is particularly acute inmodern deep RL architectures which often incor-porate many modules and multiple loss functions.In this paper, we take a step towards addressingthis issue by using metagradients (Xu et al., 2018)to tune these hyperparameters via differentiablecross validation, whilst the agent interacts withand learns from the environment. We present theSelf-Tuning Actor Critic (STAC) which uses thisprocess to tune the hyperparameters of the usualloss function of the IMPALA actor critic agent(Espeholt et. al., 2018 ), to learn the hyperpa-rameters that define auxiliary loss functions, andto balance trade offs in off policy learning by in-troducing and adapting the hyperparameters of anovel leaky V-trace operator. The method is sim-ple to use, sample efficient and does not requiresignificant increase in compute. Ablative studiesshow that the overall performance of STAC im-proves as we adapt more hyperparameters. Whenapplied to 57 games on the Atari 2600 environ-ment over 200 million frames our algorithm im-proves the median human normalized score of thebaseline from 243% to 364%.

1. IntroductionReinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this isespecially true for modern deep RL architectures, whichoften incorporate many modules and many loss functions.Training a deep RL agent is thus typically performed in twonested optimization loops. In the inner training loop, we fixa set of hyperparameters, and optimize the agent parameterswith respect to these fixed hyperparameters. In the outer(manual or automated) tuning loop, we search for good

1Work done at DeepMind, London. 2University of Michigan.Correspondence to: Tom Zahavy <[email protected]>.

hyperparameters, evaluating them either in terms of theirinner loop performance, or in terms of their performanceon some special validation data. Inner loop training is typi-cally differentiable and can be efficiently performed usingback propagation, while the optimization of the outer loopis typically performed via gradient-free optimization. Sinceonly the hyperparameters (but not the agent policy itself) aretransferred between outer loop iterations, we refer to this asa “multiple lifetime” approach. For example, random searchfor hyperparameters (Bergstra & Bengio, 2012) falls underthis category, and population based training (Jaderberg et al.,2017) also shares many of its properties. The cost of rely-ing on multiple lifetimes to tune hyperparameters is oftennot accounted for when the performance of algorithms isreported. The impact of this is mild when users can rely onhyperparameters established in the literature, but the cost ofhyper-parameter tuning across multiple lifetimes manifestsitself when algorithm are applied to new domains.This motivates a significant body of work on tuning hyper-parameters online, within a single agent lifetime. Previouswork in this area has often focused on solutions tailoredto specific hyperparameters. For instance, Schaul et al.(2019) proposed a non stationary bandit algorithm to adaptthe exploration-exploitation trade off. Mann et al. (2016)and White & White (2016) proposed algorithms to adaptλ (the eligibility trace coefficient). Rowland et al. (2019)introduced an α coefficient into V-trace to account for thevariance-contraction trade off in off policy learning. In an-other line of work, metagradients were used to adapt theoptimiser’s parameters (Sutton, 1992; Snoek et al., 2012;Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al.,2017; Young et al., 2018).

In this paper we build on the metagradient approach to tunehyperparameters. In previous work, online metagradientshave been used to learn the discount factor or the λ coeffi-cient (Xu et al., 2018), to discover intrinsic rewards (Zhenget al., 2018) and auxiliary tasks (Veeriah et al., 2019). Thisis achieved by representing an inner (training) loss func-tion as a function of both the agent parameters and a set ofhyperparameters. In the inner loop, the parameters of theagent are trained to minimize this inner loss function w.r.tthe current values of the hyperparameters; in the outer loop,the hyperparameters are adapted via back propagation tominimize the outer (validation) loss.

arX

iv:2

002.

1292

8v2

[st

at.M

L]

2 M

ar 2

020

Page 2: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

It is perhaps surprising that we may choose to optimizea different loss function in the inner loop, instead of theouter loss we ultimately care about. However, this is nota new idea. Regularization, for example, is a techniquethat changes the objective in the inner loop to balance thebias-variance trade-off and avoid the risk of over fitting.In model-based RL, it was shown that the policy foundusing a smaller discount factor can actually be better thana policy learned with the true discount factor (Jiang et al.,2015). Auxiliary tasks (Jaderberg et al., 2016) are anotherexample, where gradients are taken w.r.t unsupervised lossfunctions in order to improve the agent’s representation.Finally, it is well known that, in order to maximize thelong term cumulative reward efficiently (the objective ofthe outer loop), RL agents must explore, i.e., act accordingto a different objective in the inner loop (accounting, forinstance, for uncertainty).

This paper makes the following contributions. First, weshow that it is feasible to use metagradients to simulta-neously tune many critical hyperparameters (controllingimportant trade offs in a reinforcement learning agent), aslong as they are differentiable w.r.t a validation/outer loss.Importantly, we show that this can be done online, within asingle lifetime, requiring only 30% additional compute.

We demonstrate this by introducing two novel deep RL ar-chitectures, that extend IMPALA (Espeholt et al., 2018), adistributed actor-critic, by adding additional componentswith many more new hyperparameters to be tuned. The firstagent, referred to as a Self-Tuning Actor Critic (STAC), in-troduces a leaky V-trace operator that mixes importance sam-pling (IS) weights with truncated IS weights. The mixingcoefficient in leaky V-trace is differentiable (unlike the orig-inal V-trace) but similarly balances the variance-contractiontrade-off in off-policy learning. The second architecture isSTACX (STAC with auXiliary tasks). Inspired by Fedus et al.(2019), STACX augments STAC with parametric auxiliaryloss functions (each with it’s own hyper-parameters).

These agents allow us to show empirically, through ex-tensive ablation studies, that performance consistently im-proves as we expose more hyperparameters to metagradients.In particular, when applied to 57 Atari games (Bellemareet al., 2013), STACX achieved a normalized median scoreof 364%, a new state-of-the-art for online model free agents.

2. BackgroundIn the following, we consider three types of parameters:

1. θ – the agent parameters.

2. ζ – the hyperparameters.

3. η ⊂ ζ – the metaparameters.

θ denotes the parameters of the agent and parameterises,for example, the value function and the policy; these param-eters are randomly initialised at the beginning of an agent’slifetime, and updated using back-propagation on a suitableinner loss function. ζ denotes the hyperparameters, in-cluding, for example, the parameters of the optimizer (e.gthe learning rate) or the parameters of the loss function (e.g.the discount factor); these may be tuned over the course ofmany lifetimes (for instance via random search) to optimizean outer (validation) loss function. In a typical deep RLsetup, only these first two types of parameters need to beconsidered. In metagradient algorithms a third set of param-eters must be specified: the metaparameters, denoted η;these are a subset of the differentiable parameters in ζ thatstart with some some initial value (itself a hyper-parameter),but that are then adapted during the course of training.

2.1. The metagradient approach

Metagradient RL (Xu et al., 2018) is a general frameworkfor adapting, online, within a single lifetime, the differen-tiable hyperparameters η. Consider an inner loss that is afunction of both the parameters θ and the metaparametersη: Linner(θ; η). On each step of an inner loop, θ can be opti-mized with a fixed η to minimize the inner loss Linner(θ; η) :

θ(ηt).= θt+1 = θt −∇θLinner(θt; ηt). (1)

In an outer loop, η can then be optimized to minimize theouter loss by taking a metagradient step. As θ(η) is a func-tion of η this corresponds to updating the η parameters bydifferentiating the outer loss w.r.t η,

ηt+1 = ηt −∇ηLouter(θ(ηt)). (2)

The algorithm is general, as it implements a specific case ofonline cross-validation, and can be applied, in principle, toany differentiable meta-parameter η used by the inner loss.

2.2. IMPALA

Specific instantiations of the metagradient RL frameworkrequire specification of the inner and outer loss functions.Since our agent builds on the IMPALA actor critic agent(Espeholt et al., 2018), we now provide a brief introduction.

IMPALA maintains a policy πθ(at|xt) and a value functionvθ(xt) that are parameterized with parameters θ. Thesepolicy and the value function are trained via an actor-criticupdate with entropy regularization; such an update is oftenrepresented (with slight abuse of notation) as the gradientof the following pseudo-loss function

L(θ) =gv · (vs − Vθ(xs))2

−gpρs · log πθ(as|xs)(rs + γvs+1 − Vθ(xs))

−ge ·∑a

πθ(as|xs) log πθ(as|xs), (3)

Page 3: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

where gv, gp, ge are suitable loss coefficients. We refer tothe policy that generates the data for these updates as thebehaviour policy µ(at|xt). In the on policy case, whereµ(at|xt) = π(at|xt), ρs = 1 , then vs is the n-steps boot-strapped return vs =

∑s+n−1t=s γt−srt + γnV (xs+1).

IMPALA uses a distributed actor critic architecture, thatassign copies of the policy parameters to multiple actors indifferent machines to achieve higher sample throughput. Asa result, the target policy π on the learner machine can beseveral updates ahead of the actor’s policy µ that generatedthe data used in an update. Such off policy discrepancy canlead to biased updates, requiring to weight the updates withIS weights for stable learning. Specifically, IMPALA (Espe-holt et al., 2018) uses truncated IS weights to balance thevariance-contraction trade off on these off-policy updates.This corresponds to instantiating Eq. (3) with

vs = V (xs) +

s+n−1∑t=s

γt−s(Πt−1i=sci

)δtV, (4)

where we define δtV = ρt(rt + γV (xt+1) − V (xt)) andwe set ρt = min(ρ, π(at|xt)

µ(at|xt) ), ci = λmin(c, π(at|xt)µ(at|xt) ) for

suitable truncation levels ρ and c.

2.3. Metagradient IMPALA

The metagradient agent in (Xu et al., 2018) uses the meta-gradient update rules from the previous section with theactor-critic loss function of Espeholt et al. (2018). Morespecifically, the inner loss is a parameterised version of theIMPALA loss with metaparameters η based on Eq. (3),

L(θ; η) =0.5 (vs − Vθ(xs))2

−ρs · log πθ(as|xs)(rs + γvs+1 − Vθ(xs))

−0.01∑a

πθ(as|xs) log πθ(as|xs),

where η = {γ, λ}. Notice that γ and λ also affect the innerloss through the definition of vs (Eq. (4)).

The outer loss is defined to be the policy gradient loss

Louter(η) =Louter(θ(η))

=− ρs log πθ(η)(as|xs)(rs + γvs+1 − Vθ(η)(xs)).

3. Self-Tuning actor-critic agentsWe first consider a slightly extended version of the meta-gradient IMPALA agent (Xu et al., 2018). Specifically, weallow the metagradient to adapt learning rates for individualcomponents in the loss,

L(θ; η) =gv · (vs − Vθ(xs))2

−gpρs · log πθ(as|xs)(rs + γvs+1 − Vθ(xs))

−ge ·∑a

πθ(as|xs) log πθ(as|xs). (5)

The outer loss is defined using Eq. (3),

Louter(η) = Louter(θ(η))

=gouterv ·

(vs − Vθ(η)(xs)

)2

−gouterp · ρs log πθ(η)(as|xs)(rs + γvs+1 − Vθ(η)(xs))

−goutere ·

∑a

πθ(η)(as|xs) log πθ(η)(as|xs)

+gouterkl · KL

(πθ(η), πθ

). (6)

Notice the new Kullback–Leibler (KL) term in Eq. (6), mo-tivating the η-update not to change the policy too much.

Compared to the work by Xu et al. (2018) that only self-tuned the inner-loss γ and λ (hence η = {γ,λ}), alsoself-tuning the inner-loss coefficients corresponds to set-ting η = {γ, λ, gv, gp, ge}. These metaparameters allow forloss specific learning rates and support dynamically balanc-ing exploration with exploitation by adapting the entropyloss weight 1. The hyperparameters of STAC include the ini-tialisations of the metaparameters, the hyperparameters ofthe outer loss {γouter, λouter, gouter

v , gouterp , gouter

e }, the KL co-efficient gouter

kl (set to 1), and the learning rate of the ADAMmeta optimizer (set to 1e− 3).

IMPALA Self-Tuning IMPALAθ vθ, πθ vθ, πθ

ζ {γ, λ, gv, gp, ge}{γouter, λouter, gouter

v , gouterp , gouter

e }InitialisationsADAM parameter, gouter

klη – {γ, λ, gv, gp, ge}

Table 1. Parameters in & IMPALA and self-tuning IMPALA.

To set the initial values of the metaparameters of self-tuningIMPALA, we use a simple “rule of thumb” and set themto the values of the corresponding parameters in the outerloss (e.g., the initial value of the inner-loss λ is set equal tothe outer-loss λ). For outer-loss hyperparameters that arecommon to IMPALA we default to the IMPALA settings.

In the next two sections, we show how embracing self tuningvia metagradients enables us to augment this agent with aparameterised Leaky V-trace operator and with self-tunedauxiliary loss functions. These ideas are examples of howthe ability to self-tune metaparameters via metagradients canbe used to introduce novel ideas into RL algorithms withoutrequiring extensive tuning of the new hyperparameters.

1There are a few additional subtle differences between thisself-tuning IMPALA agent and the metagradient agent from Xuet al. (2018). For example, we do not use the γ embedding usedin (Xu et al., 2018). These differences are further discussed in thesupplementary where we also reproduce the results of Xu et al.(2018) in our code base.

Page 4: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

3.1. STAC

All the hyperparameters that we considered for self-tuningso far have the property that they are explicitly defined inthe definition of the loss function and can be directly differ-entiated. The truncation levels in the V-trace operator withinIMPALA, on the other hand, are equivalent to applying aReLU activation and are non differentiable.

Motivated by the study of non linear activations in DeepLearning (Xu et al., 2015), we now introduce an agent basedon a variant of the V-trace operator that we call leaky V-trace. We will refer to this agent as the Self-Tuning ActorCritic (STAC). Leaky V-trace uses a leaky rectifier (Maaset al., 2013) to truncate the importance sampling weights,which allows for a small non-zero gradient when the unit issaturated. We show that the degree of leakiness can controlcertain trade offs in off policy learning, similarly to V-trace,but in a manner that is differentiable.

Before we introduce Leaky V-trace, let us first recall howthe off policy trade offs are represented in V-trace using thecoefficients ρ, c. The weight ρt = min(ρ, π(at|xt)

µ(at|xt) ) appearsin the definition of the temporal difference δtV and definesthe fixed point of this update rule. The fixed point of thisupdate is the value function V πρ of the policy πρ that issomewhere between the behaviour policy µ and the targetpolicy π controlled by the hyper parameter ρ,

πρ =min (ρµ(a|x), π(a|x))∑b min (ρµ(b|x), π(b|x))

. (7)

The product of the weights cs, ..., ct−1 in Eq. (4) measureshow much a temporal difference δtV observed at time timpacts the update of the value function. The truncationlevel c is used to control the speed of convergence by tradingoff the update variance for a larger contraction rate, similarto Retrace(λ) (Munos et al., 2016). By clipping the impor-tance weights, the variance associated with the update ruleis reduced relative to importance-weighted returns . On theother hand, the clipping of the importance weights effec-tively cuts the traces in the update, resulting in the updateplacing less weight on later TD errors, and thus worseningthe contraction rate of the corresponding operator.

Following this interpretation of the off policy coefficients,we now propose a variation of V-trace which we call leakyV-trace with new parameters αρ ≥ αc,

ISt =π(at|xt)µ(at|xt)

, ρt = αρ min(ρ, ISt

)+ (1− αρ)ISt,

ci = λ(αc min

(c, ISt

)+ (1− αc)ISt

),

vs = V (xs) +

s+n−1∑t=s

γt−s(Πt−1i=sci

)δtV,

δtV = ρt(rt + γV (xt+1)− V (xt)). (8)

We highlight that for αρ = 1, αc = 1, Leaky V-trace isexactly equivalent to V-trace, while for αρ = 0, αc = 0, itis equivalent to canonical importance sampling. For othervalues we get a mixture of the truncated and non truncatedimportance sampling weights.

Theorem 1 below suggests that Leaky V-trace is a contrac-tion mapping, and that the value function that it will con-verge to is given by V πρ,αρ , where

πρ,αρ =αρ min (ρµ(a|x), π(a|x)) + (1− αρ)π(a|x)

αρ∑b min (ρµ(b|x), π(b|x)) + 1− αρ

,

(9)

is a policy that mixes (and then re-normalizes) the targetpolicy with the V-trace policy Eq. (7) 2. A more formalstatement of Theorem 1 and a detailed proof (which closelyfollows that of Espeholt et al. (2018) for the original v-traceoperator) can be found in the supplementary material.

Theorem 1. The leaky V-trace operator defined by Eq. (8) isa contraction operator and it converges to the value functionof the policy defined by Eq. (9).

Similar to ρ, the new parameter αρ controls the fixed pointof the update rule, and defines a value function that inter-polates between the value function of the target policy πand the behaviour policy µ. Specifically, the parameter αcallows the importance weights to ”leak back” creating theopposite effect to clipping.

Since Theorem 1 requires us to have αρ ≥ αc, our mainSTAC implementation parametrises the loss with a singleparameter α = αρ = αc. In addition, we also experimentedwith a version of STAC that learns both αρ and αc. Quiteinterestingly, this variation of STAC learns the rule αρ ≥ αcby its own (see the experiments section for more details).

Note that low values of αc lead to importance samplingwhich is high contraction but high variance. On the otherhand, high values of αc lead to V-trace, which is lowercontraction and lower variance than importance sampling.Thus exposing αc to meta-learning enables STAC to directlycontrol the contraction/variance trade-off.

In summary, the metaparameters for STAC are{γ, λ, gv, gp, ge, α}. To keep things simple, when usingLeaky V-trace we make two simplifications w.r.t the hy-perparameters. First, we use V-trace to initialise LeakyV-trace, i.e., we initialise α = 1. Second, we fix the outerloss to be V-trace, i.e. we set αouter = 1.

2Note that α−trace (Rowland et al., 2019), another adaptivealgorithm for off policy learning, mixes the V-trace policy with thebehaviour policy; Leaky V-trace mixes it with the target policy.

Page 5: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

3.2. STAC with auxiliary tasks (STACX)

Next, we introduce a new agent, that extends STAC withauxiliary policies, value functions, and respective auxiliaryloss functions; this is new because the parameters that de-fine the auxiliary tasks (the discount factors in this case)are self-tuned. As this agent has a new architecture in ad-dition to an extended set of metaparameters we give it adifferent acronym and denote it by STACX (STAC with aux-iliary tasks). The auxiliary losses have the same parametricform as the main objective and can be used to regularize itsobjective and improve its representations.

Deep Residual

Block

MLP

Observation

MLPMLP

STAC STACX

Figure 1. Block diagrams of STAC and STACX.

STACX’s architecture has a shared representation layerθshared, from which it splits into n different heads (Fig. 1).For the shared representation layer we use the deep residualnet from (Espeholt et al., 2018). Each head has a policy anda corresponding value function that are represented using a2 layered MLP with parameters {θi}ni=1. Each one of theseheads is trained in the inner loop to minimize a loss functionL(θi; ηi). parametrised by its own set of metaparameters ηi.

The policy of the STACX agent is defined to be the policyof a specific head (i = 1). The hyperparameters {ηi}ni=1 aretrained in the outer loop to improve the performance of thissingle head. Thus, the role of the auxiliary heads is to actas auxiliary tasks (Jaderberg et al., 2016) and improve theshared representation θshared. Finally, notice that each headhas its own policy πi, but the behaviour policy is fixed to beπ1. Thus, to optimize the auxiliary heads we use (Leaky)V-trace for off policy corrections 3.

3 We also considered two extensions of this approach. (1) Ran-dom ensemble: The policy head is chosen at random from [1, .., n],and the hyperparameters are differentiated w.r.t the performanceof each one of the heads in the outer loop. (2) Average ensemble:The actor policy is defined to be the average logits of the heads,and we learn one additional head for the value function of thispolicy. The metagradient in the outer loop is taken with respect tothe actor policy, and /or, each one of the heads individually. Whilethese extensions seem interesting, in all of our experiments they

The metaparameters for STACX are{γi, λi, giv, gip, gie, αi}3i=1. Since the outer loss is de-fined only w.r.t head 1, introducing the auxiliary tasks intoSTACX does not require new hyperparameters for the outerloss. In addition, we use the same initialisation values forall the auxiliary tasks. Thus, STACX has exactly the samehyperparameters as STAC.

0 0.5e8 1e8 1.5e8 2e80%

50%100%150%200%250%300%350%

Med

ian

hum

an n

orm

alize

d sc

ores

Meta gradient, Xu et. al.

Fixed auxiliary tasks

IMPALASTACSTACX

Figure 2. Median human-normalized scores during training.

4. ExperimentsIn all of our experiments (with the exception of the ro-bustness experiments in Section 4.3) we use the IMPALAhyperparameters both for the IMPALA baseline and forthe outer loss of the STAC agent, i.e., λouter = 1, αouter =1, gouter

v,p,e = 1. We use γ = 0.995 as it was found to improvethe performance of IMPALA considerably (Xu et al., 2018).

4.1. Atari learning curves

We start by evaluating STAC and STACX in the ArcadeLearning Environment (Bellemare et al., 2013, ALE). Fig. 2presents the normalized median scores 4 during training. Wefound STACX to learn faster and achieve higher final per-formance than STAC. We also compare these agents withversions of them without self tuning (fixing the metapa-rameters). The version of STACX with fixed unsupervisedauxiliary tasks achieved a normalized median score of 247,similar to that of UNREAL (Jaderberg et al., 2016) but notmuch better than IMPALA. In Fig. 3 we report the relativeimprovement of STACX over IMPALA in the individuallevels (an equivalent figure for STAC may be found in thesupplementary material).

always led to a small decrease in performance when compared toour auxiliary task agent without these extensions. Similar findingswere reported in (Fedus et al., 2019).

4Normalized median scores are computed as follows. For eachAtari game, we compute the human normalized score after 200Mframes of training and average this over 3 different seeds; we thenreport the overall median score over the 57 Atari domains.

Page 6: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

35% 50% 75% 100% 150% 200% 275% 350% 500%

Relative human normalized scoresPhoenix

Up n down

Yars revenge

Boxing

Bowling

Double dunk

Amidar

Riverraid

Name this game

Time pilot

Kung fu master

Solaris

Atlantis

Ms pacman

Demon attack

Skiing

Pong

Montezuma revenge

Private eye

Pitfall

Freeway

Enduro

Venture

Crazy climber

Bank heist

Frostbite

Fishing derby

Berzerk

Surround

Krull

Seaquest

Hero

Breakout

Zaxxon

Assault

Qbert

Asterix

Tutankham

Video pinball

Star gunner

Centipede

Asteroids

Ice hockey

Space invaders

Alien

Battle zone

Gopher

Gravitar

Beam rider

Tennis

Defender

Kangaroo

Robotank

Jamesbond

Chopper command

Wizard of wor

Road runner

Figure 3. Mean human-normalized scores after 200M frames, rela-tive improvement in percents of STACX over IMPALA.

STACX achieved a median score of 364%, a new state of theart result in the ALE benchmark for training online model-free agents for 200M frames. In fact, there are only twoagents that reported better performance after 200M frames:LASER (Schmitt et al., 2019) achieved a normalized me-dian score of 431% and MuZero (Schrittwieser et al., 2019)achieved 731%. These papers propose algorithmic mod-ifications that are orthogonal to our approach and can becombined in future work; LASER combines IMPALA witha uniform large-scale experience replay; MuZero uses replayand a tree-based search with a learned model.

4.2. Ablative analysis

Next, we perform an ablative study of our approach by train-ing different variations of STAC and STACX. The resultsare summarized in Fig. 4. In red, we can see different base-

lines 5, and in green and blue, we can see different ablationsof STAC and STACX respectively. In these ablations ηcorresponds to the subset of the hyperparameters that arebeing self-tuned, e.g., η = {γ} corresponds to self-tuninga single loss function where only γ is self-tuned (and theother hyperparameters are fixed).

Inspecting Fig. 4 we observe that the performance of STACand STACX consistently improve as they self-tune moremetaparameters. These metaparameters control differenttrade-offs in reinforcement learning: discount factor con-trols the effective horizon, loss coefficients affect learningrates, the Leaky V-trace coefficient controls the variance-contraction-bias trade-off in off policy RL.

We have also experimented with adding more auxiliarylosses, i.e, having 5 or 8 auxiliary loss functions. Thesevariations performed better then having a single loss func-tion but slightly worse then having 3. This can be furtherexplained by Fig. 9 (Section 4.4), which shows that theauxiliary heads are self-tuned to similar metaparameters.

Nature DQNIMPALA

{}{ }

{ , }{ , , gv, ge, gp}

{ , , gv, ge, gp, }Xu et al.

{}3i = 1

{ }3i = 1

{ i, i}3i = 1

{ i, i, giv, gi

e, gip}3

i = 1

{ i, i, giv, gi

e, gip, i}3

i = 1

95%191%

243%240%248%257%

292%287%

247%262%

249%341%

364%

Figure 4. Ablative studies of STAC (green) and STACX (blue)compared to some baselines (red). Median normalized scoresacross 57 Atari games after training for 200M frames.

Finally, we experimented with a few more versions ofSTACX. One variation allows STACX to self-tune bothαρ and αc without enforcing the relation αρ ≥ αc. This ver-sion performed slightly worse than STACX (achieved a me-dian score of 353%) and we further discuss it in Section 4.4(Fig. 10). In another variation we self-tune α together with asingle truncation parameter ρ = c. This variation performedmuch worse achieving a median score of 301%, which maybe explained by ρ not being differentiable.

5We further discuss reproducing the results of Xu et al. (2018)in the supplementary material.

Page 7: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

0% 20% 40% 60% 80% 100%Percentage of relative improvement

02468

10121416182022

Num

ber o

f gam

es{ i, i, gi

v, gie, gi

p, i}3i = 1

{ i, i, giv, gi

e, gip}3

i = 1

{ , , gv, ge, gp}{ , }{ }

Figure 5. The number of games (y-axis) in which ablative varia-tions of STAC and STACX improve the IMPALA baseline by atleast x percents (x-axis).

In Fig. 5, we further summarize the relative improvementof ablative variations of STAC (green,yellow, and blue; thebottom three lines) and STACX (light blue and red; the toptwo lines) over the IMPALA baseline. For each value ofx ∈ [0, 100] (the x axis), we measure the number of gamesin which an ablative version of the STAC(X) agent is betterthan the IMPALA agent by at least x percent and subtractfrom it the number of games in which the IMPALA agentis better than STAC(X). Clearly, we can see that STACX(light blue) improves the performance of IMPALA by a largemargin. Moreover, we observe that allowing the STAC(X)agents to self-tune more metaparameters consistently im-proves its performance in more games.

4.3. Robustness

It is important to note that our algorithm is not hyperpa-rameter free. For instance, we still need to choose hyper-parameter settings for the outer loss. Additionally, eachhyperparameter in the inner loss that we expose to meta-gradients still requires an initialization (itself a hyperparame-ter). Therefore, in this section, we investigate the robustnessof STACX to its hyperparameters.

We begin with the hyperparameters of the outer loss. Inthese experiments we compare the robustness of STACXwith that of IMPALA in the following manner. For each hy-per parameter (γ, gv) we select 5 perturbations. For STACXwe perturb the hyper parameter in outer loss (γouter, gouter

v )and for IMPALA we perturb the corresponding hyper pa-rameter (γ, gv). We randomly selected 5 Atari levels andpresent the mean and standard deviation across 3 randomseeds after 200M frames of training.

Fig. 6 presents the results for the discount factor. We cansee that overall (in 72% of the configurations measuredby mean), STACX indeed performs better than IMPALA.Similarly, Fig. 7 shows the robustness of STACX to thecritic weight (gv), where STACX improves over IMPALAin 80% of the configurations.

0.99 0.992 0.995 0.997 0.9990

50

Norm

alize

d m

ean Chopper_command

STACIMPALA

0.99 0.992 0.995 0.997 0.9990

50

100

Norm

alize

d m

ean Jamesbond

0.99 0.992 0.995 0.997 0.9990

20

Norm

alize

d m

ean Defender

0.99 0.992 0.995 0.997 0.9990

5

Norm

alize

d m

ean Beam_rider

0.99 0.992 0.995 0.997 0.9990

5

Norm

alize

d m

ean Crazy_climber

Figure 6. Robustness to the discount factor. Mean and confidenceintervals (over 3 seeds), after 200M frames of training. Left(blue) bars correspond to STACX and right (red) bars to IMPALA.STACX is better than IMPALA in 72% of the runs measured bymean.

0.25 0.35 0.5 0.75 10

100

Norm

alize

d m

ean Chopper_command

STACIMPALA

0.25 0.35 0.5 0.75 10

50

Norm

alize

d m

ean Jamesbond

0.25 0.35 0.5 0.75 10

20

Norm

alize

d m

ean Defender

0.25 0.35 0.5 0.75 10

10

Norm

alize

d m

ean Beam_rider

0.25 0.35 0.5 0.75 10

5

Norm

alize

d m

ean Crazy_climber

Figure 7. Robustness to the critic weight gv . Mean and confidenceintervals (over 3 seeds), after 200M frames of training. Left (blue)bars correspond to STACX and right (red) bars correspond toIMPALA. STACX is better than IMPALA in 80% of the runsmeasured by mean.

Next, we investigate the robustness of STACX to the initial-isation of the metaparameters in Fig. 8. We selected valuesthat are close to 1 as our design principle is to initialisethe metaparameters to be similar to the hyperparameters inthe outer loss. We observe that overall, the method is quiterobust to different initialisations.

Page 8: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

0.99

0.99

20.

995

0.99

70.

9990

10

20

30

40

50

60

All i

nit v

alue

ssChopper_command

0.99

0.99

20.

995

0.99

70.

9990

20

40

60

80

Jamesbond

0.99

0.99

20.

995

0.99

70.

9990

5

10

15

20

Defender

0.99

0.99

20.

995

0.99

70.

9990

2

4

6

Beam_rider

0.99

0.99

20.

995

0.99

70.

9990

2

4

6

Crazy_climber

0.99

0.99

20.

995

0.99

70.

9990

10

20

30

40

50

Disc

ount

init

valu

e

0.99

0.99

20.

995

0.99

70.

9990

20

40

60

80

100

0.99

0.99

20.

995

0.99

70.

9990

5

10

15

20

25

30

0.99

0.99

20.

995

0.99

70.

9990

2

4

6

0.99

0.99

20.

995

0.99

70.

9990

2

4

6

Figure 8. Robustness to the initialisation of the metaparameters.Mean and confidence intervals (over 6 seeds), after 200M framesof training. Columns correspond to different games. Bottom:perturbing γ init ∈ {0.99, 0.992, 0.995, 0.997, 0.999}. Top: per-turbing all the meta parameter initialisations. I.e., setting all the hy-perparamters {γ init, λinit, ginit

v , ginite , ginit

p , αinit}3i=1 to a single fixedvalue in {0.99, 0.992, 0.995, 0.997, 0.999}.

4.4. Adaptivity

In Fig. 9 we visualize the metaparameters of STACX duringtraining. As there are many metaparameters, seeds and lev-els, we restrict ourselves to a single seed (chosen arbitraryto 1) and a single game (Jamesbond). More examples canbe found in the supplementary material. For each hyperparameter we plot the value of the hyperparameters associ-ated with three different heads, where the policy head (headnumber 1) is presented in blue and the auxiliary heads (2and 3) are presented in orange and magenta.

Inspecting Fig. 9 we can see that the two auxiliary heads self-tuned their metaparameters to have relatively similar values,but different than those of the main head. The discountfactor of the main head, for example, converges to the valueof the discount factor in the outer loss (0.995), while thediscount factors of the auxiliary heads change quite a lotduring training and learn about horizons that differ fromthat of the main head.

We also observe non trivial behaviour in self-tuning the coef-ficients of the loss functions (ge, gp, gv), in the λ coefficientand in the off policy coefficient α. For instance, we foundthat at the beginning of training α is self tuned to a highvalue (close to 1) so it is quite similar to V-trace; instead,towards the end of training STACX self-tunes it to lowervalues which makes it closer to importance sampling.

0.0 0.5 1.0 1.5 2.01e8

0.9850.9900.995

0.0 0.5 1.0 1.5 2.01e8

0.900.951.00

0.0 0.5 1.0 1.5 2.01e8

0.5

1.0

g p

0.0 0.5 1.0 1.5 2.01e8

0.6

0.8

1.0

g v

0.0 0.5 1.0 1.5 2.01e8

0.5

1.0

g e

0.0 0.5 1.0 1.5 2.01e8

0.5

1.0

head 1 head 2 head 3

Figure 9. Adaptivity in Jamesbond.

Finally, we also noticed an interesting behaviour in the ver-sion of STACX where we expose to self-tuning both theαρ and αc coefficients, without imposing αρ ≥ αc (Theo-rem 1). This variation of STACX achieved a median scoreof 353%. Quite interestingly, the metagradient discoveredαρ ≥ αc on its own, i.e., it self-tunes αρ so that is greater orequal to αc in 91.2% of the time (averaged over time, seeds,and levels) , and so that αρ ≥ 0.99αc in 99.2% of the time.Fig. 10 shows an example of this in Jamesbond.

0.0 0.5 1.0 1.5 2.01e8

0.990

0.9953

3c

0.0 0.5 1.0 1.5 2.01e8

0.990

0.9952

2c

0.0 0.5 1.0 1.5 2.01e8

0.0

0.5

1.01

1c

Figure 10. Discovery of αρ ≥ αc.

5. SummaryIn this work we demonstrate that it is feasible to use metagra-dients to simultaneously tune many critical hyperparameters(controlling important trade offs in a reinforcement learningagent), as long as they are differentiable; we show that thiscan be done online, within a single lifetime. We do so bypresenting STAC and STACX, actor critic algorithms thatself-tune a large number of hyperparameters of very differ-ent nature. We showed that the performance of these agentsimproves as they self-tune more hyperparameters, and wedemonstrated that STAC and STACX are computationallyefficient and robust to their own hyperparameters.

Page 9: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

ReferencesBellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.

The arcade learning environment: An evaluation plat-form for general agents. Journal of Artificial IntelligenceResearch, 47:253–279, 2013.

Bergstra, J. and Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13(1):281–305, February 2012. ISSN 1532-4435.

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih,V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning,I., et al. Impala: Scalable distributed deep-rl with impor-tance weighted actor-learner architectures. arXiv preprintarXiv:1802.01561, 2018.

Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G., andLarochelle, H. Hyperbolic discounting and learning overmultiple horizons. arXiv preprint arXiv:1902.06865,2019.

Franceschi, L., Donini, M., Frasconi, P., and Pontil, M.Forward and reverse gradient-based hyperparameter opti-mization. In Proceedings of the 34th International Con-ference on Machine Learning-Volume 70, pp. 1165–1173.JMLR. org, 2017.

Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T.,Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforce-ment learning with unsupervised auxiliary tasks. arXivpreprint arXiv:1611.05397, 2016.

Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M.,Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning,I., Simonyan, K., et al. Population based training ofneural networks. arXiv preprint arXiv:1711.09846, 2017.

Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The depen-dence of effective planning horizon on model accuracy.In Proceedings of the 2015 International Conference onAutonomous Agents and Multiagent Systems, pp. 1181–1189, 2015.

Maas, A. L., Hannun, A. Y., and Ng, A. Y. Rectifier non-linearities improve neural network acoustic models. Inin ICML Workshop on Deep Learning for Audio, Speechand Language Processing. Citeseer, 2013.

Maclaurin, D., Duvenaud, D., and Adams, R. Gradient-based hyperparameter optimization through reversiblelearning. In International Conference on Machine Learn-ing, pp. 2113–2122, 2015.

Mann, T. A., Penedones, H., Mannor, S., and Hester, T.Adaptive lambda least-squares temporal difference learn-ing. arXiv preprint arXiv:1612.09465, 2016.

Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare,M. Safe and efficient off-policy reinforcement learning.In Advances in Neural Information Processing Systems,pp. 1054–1062, 2016.

Pedregosa, F. Hyperparameter optimization with approxi-mate gradient. In International Conference on MachineLearning, pp. 737–746, 2016.

Rowland, M., Dabney, W., and Munos, R. Adap-tive trade-offs in off-policy learning. arXiv preprintarXiv:1910.07478, 2019.

Schaul, T., Borsa, D., Ding, D., Szepesvari, D., Ostrovski,G., Dabney, W., and Osindero, S. Adapting behaviourfor learning progress. arXiv preprint arXiv:1912.06910,2019.

Schmitt, S., Hessel, M., and Simonyan, K. Off-policyactor-critic with shared experience replay. arXiv preprintarXiv:1909.11583, 2019.

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K.,Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis,D., Graepel, T., et al. Mastering atari, go, chess andshogi by planning with a learned model. arXiv preprintarXiv:1911.08265, 2019.

Snoek, J., Larochelle, H., and Adams, R. P. Practicalbayesian optimization of machine learning algorithms.In Advances in neural information processing systems,pp. 2951–2959, 2012.

Sutton, R. S. Adapting bias by gradient descent: An incre-mental version of delta-bar-delta. In AAAI, pp. 171–176,1992.

Veeriah, V., Hessel, M., Xu, Z., Rajendran, J., Lewis, R. L.,Oh, J., van Hasselt, H. P., Silver, D., and Singh, S. Discov-ery of useful questions as auxiliary tasks. In Advances inNeural Information Processing Systems, pp. 9306–9317,2019.

White, M. and White, A. A greedy approach to adaptingthe trace parameter for temporal difference learning. InProceedings of the 2016 International Conference onAutonomous Agents & Multiagent Systems, pp. 557–565,2016.

Xu, B., Wang, N., Chen, T., and Li, M. Empirical evaluationof rectified activations in convolutional network. arXivpreprint arXiv:1505.00853, 2015.

Xu, Z., van Hasselt, H. P., and Silver, D. Meta-gradient re-inforcement learning. In Advances in neural informationprocessing systems, pp. 2396–2407, 2018.

Page 10: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

Young, K., Wang, B., and Taylor, M. E. Metatrace: Onlinestep-size tuning by meta-gradient descent for reinforce-ment learning control. arXiv preprint arXiv:1805.04514,2018.

Zheng, Z., Oh, J., and Singh, S. On learning intrinsicrewards for policy gradient methods. In Advances inNeural Information Processing Systems, pp. 4644–4654,2018.

Page 11: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

6. Additional results

35% 50% 75% 100% 150% 200% 275% 350% 500%

Relative human normalized scoresSkiing

Phoenix

Space invaders

Yars revenge

Riverraid

Asteroids

Amidar

Defender

Bowling

Asterix

Time pilot

Qbert

Name this game

Atlantis

Double dunk

Berzerk

Ms pacman

Kung fu master

Bank heist

Surround

Hero

Solaris

Enduro

Pong

Freeway

Boxing

Private eye

Pitfall

Venture

Montezuma revenge

Star gunner

Demon attack

Frostbite

Krull

Fishing derby

Seaquest

Zaxxon

Breakout

Up n down

Tutankham

Ice hockey

Alien

Crazy climber

Video pinball

Beam rider

Assault

Gopher

Kangaroo

Gravitar

Road runner

Centipede

Tennis

Battle zone

Robotank

Chopper command

Wizard of wor

Jamesbond

Figure 11. Mean human-normalized scores after 200M frames, relative improvement in percents of STAC over the IMPALA baseline.

Page 12: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

7. Analysis of Leaky V-traceDefine the Leaky V-trace operator R:

RV (x) = V (x) + Eµ

∑t≥0

γt(Πt−1i=0 ci

)ρt (rt + γV (xt+1)− V (xt)) |x0 = x, µ)

, (10)

where the expectation Eµ is with respect to the behaviour policy µ which has generated the trajectory (xt)t≥0, i.e.,x0 = x, xt+1 ∼ P (·|xt, at), at ∼ µ(·|xt). Similar to (Espeholt et al., 2018), we consider the infinite-horizon operator butvery similar results hold for the n-step truncated operator.

Let

IS(xt) =π(at|xt)µ(at|xt)

,

be importance sampling weights, let

ρt = min(ρ, IS(xt)), ct = min(c, IS(xt)),

be truncated importance sampling weights with ρ ≥ c, and let

ρt = αρρt + (1− αρ)IS(xt), ct = αcct + (1− αc)IS(xt)

be the Leaky importance sampling weights with leaky coefficients αρ ≥ αc.

Theorem 2 (Restatement of Theorem 1). Assume that there exists β ∈ (0, 1] such that Eµρ0 ≥ β. Then the operator Rdefined by Eq. (10) has a unique fixed point V π , which is the value function of the policy πρ,αρ defined by

πρ,αρ =αρ min (ρµ(a|x), π(a|x)) + (1− αρ)π(a|x)∑b αρ min (ρµ(b|x), π(b|x)) + (1− αρ)π(b|x)

,

Furthermore, R is a η-contraction mapping in sup-norm, with

η = γ−1 − (γ−1 − 1)Eµ

∑t≥0

γt(Πt−2i=0 ci

)ρt−1

≤ 1− (1− γ)(αρβ + 1− αρ) < 1,

where c−1 = 1, ρ−1 = 1 and Πt−2s=0cs = 1 for t = 0, 1.

Proof. The proof follows the proof of V-trace from (Espeholt et al., 2018) with adaptations for the leaky V-trace coefficients.We have that

RV1(x)− RV2(x) = Eµ∑t≥0

γt(Πt−2s=0cs

)[ρt−1 − ct−1ρt] (V1(xt)− V2(xt)) .

Denote by κt = ρt−1 − ct−1ρt, and notice that

Eµρt = αρEµρt + (1− αρ)EµIS(xt) ≤ 1,

since EµIS(xt) = 1, and therefore, Eµρt ≤ 1. Furthermore, since ρ ≥ c and αρ ≥ αc, we have that ∀t, ρt ≥ ct. Thus, thecoefficients κt are non negative in expectation, since

Eµκt = Eµρt−1 − ct−1ρt ≥ Eµct−1(1− ρt) ≥ 0.

Thus, V1(x)− V2(x) is a linear combination of the values V1 − V2 at the other states, weighted by non-negative coefficientswhose sum is

Page 13: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

∑t≥0

γtEµ(Πt−2s=0cs

)[ρt−1 − ct−1ρt]

=∑t≥0

γtEµ(Πt−2s=0cs

)ρt−1 −

∑t≥0

γtEµ(Πt−1s=0cs

)ρt

=γ−1 − (γ−1 − 1)∑t≥0

γtEµ(Πt−2s=0cs

)ρt−1

≤γ−1 − (γ−1 − 1)(1 + γEµρ0) (11)=1− (1− γ)Eµρ0

=1− (1− γ)Eµ (αρρ0 + (1− αρ)IS(x0))

≤1− (1− γ) (αρβ + 1− αρ) < 1 (12)

where Eq. (11) holds since we expanded only the first two elements in the sum, and all the elements in this sum are positive,and Eq. (12) holds by the assumption.

We deduce that ‖RV1(x) − RV2(x)‖∞ ≤ η‖V1(x) − V2(x)‖∞, with η = 1 − (1 − γ) (αρβ + 1− αρ) < 1, so R is acontraction mapping. Furthermore, we can see that the parameter αρ controls the contraction rate, for αρ = 1 we get thecontraction rate of V-trace η = 1− (1− γ)β and as αρ gets smaller with get better contraction as with αρ = 0 we get thatη = γ.

Thus R possesses a unique fixed point. Let us now prove that this fixed point is V πρ,αρ , where

πρ,αρ =αρ min (ρµ(a|x), π(a|x)) + (1− αρ)π(a|x)∑b αρ min (ρµ(b|x), π(b|x)) + (1− αρ)π(b|x)

, (13)

is a policy that mixes the target policy with the V-trace policy.

We have:

Eµ [ρt(rt + γV πρ,αρ (xt+1)− V πρ,αρ (xt))|xt]=Eµ (αρρt + (1− αρ)IS(xt)) (rt + γV πρ,αρ (xt+1)− V πρ,αρ (xt))

=∑a

µ(a|xt)(αρ min

(ρ,π(a|xt)µ(a|xt)

)+ (1− αρ)

π(a|xt)µ(a|xt)

)(rt + γV πρ,αρ (xt+1)− V πρ,αρ (xt))

=∑a

(αρ min (ρµ(a|xt), π(a|xt)) + (1− αρ)π(a|xt)) (rt + γV πρ,αρ (xt+1)− V πρ,αρ (xt))

=∑a

πρ,αρ(a|xt)(rt + γV πρ,αρ (xt+1)− V πρ,αρ (xt)) ·∑b

(αρ min (ρµ(b|xt), π(b|xt)) + (1− αρ)π(b|xt)) = 0,

where we get that the left side (up to the summation on b) of the last equality equals zero since this is the Bellman equationfor V πρ,αρ . We deduce that RV πρ,αρ = V πρ,αρ , thus, V πρ,αρ is the unique fixed point of R.

Page 14: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

8. ReproducibilityInspecting the results in Fig. 4 one may notice that there are small differences between the results of IMAPALA and usingmeta gradients to tune only λ, γ compared to the results that were reported in (Xu et al., 2018).

We investigated the possible reasons for these differences. First, our method was implemented in a different code base. Ourcode is written in JAX, compared to the implementation in (Xu et al., 2018) that was written in TensorFlow. This mayexplain the small difference in final performance between our IMAPALA baseline (Fig. 4) and the the result of Xu et. al.which is slightly higher (257.1).

Second, Xu et. al. observed that embedding the γ hyper parameter into the θ network improved their results significantly,reaching a final performance (when learning γ, λ) of 287.7 (see section 1.4 in (Xu et al., 2018) for more details). Ourmethod, on the other hand, only achieved a score of 240 in this ablative study. We further investigated this differenceby introducing the γ embedding intro our architecture. With γ embedding, our method achieved a score of 280.6 whichalmost reproduces the results in (Xu et al., 2018). We then introduced the same embedding mechanism to our model withauxiliary loss functions. In this case for auxiliary loss i we embed γi. We experimented with two variants, one that shares theembedding weights across the auxiliary tasks and one that learns a specific embedding for each auxiliary task. Both of thesevariants performed similarly (306.8, 3.077 respectively) which is better then our previous result with embedding and withoutauxiliary losses that achieved 280.6. Unfortunately, the performance of the auxiliary loss architecture actually performedbetter without the embedding (353.4) and we therefor ended up not using the γ embedding in our architecture. We leave it tofuture work to further investigate methods of combining the embedding mechanisms with the auxiliary loss functions.

Page 15: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

Page 16: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

9. Individual level learning curves

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.92

0.94

0.96

0.98

1.00

alie

n, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0gp

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0gv

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

1.00ge

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

1.00

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000reward

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

alie

n, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

30000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

alie

n, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.92

0.94

0.96

0.98

1.00

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

amid

ar, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

200

400

600

800

1000

1200

1400

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

amid

ar, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

200

400

600

800

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

amid

ar, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

500

1000

1500

2000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

assa

ult,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.800

0.825

0.850

0.875

0.900

0.925

0.950

0.975

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

30000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

assa

ult,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

30000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

assa

ult,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.965

0.970

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

30000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

aste

rix, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

0

200000

400000

600000

800000

1000000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.92

0.94

0.96

0.98

1.00

aste

rix, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

0

200000

400000

600000

800000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

aste

rix, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

50000

100000

150000

200000

250000

300000

350000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9775

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

aste

roid

s, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.825

0.850

0.875

0.900

0.925

0.950

0.975

1.000

0.0 0.5 1.0 1.5 2.01e8

0

25000

50000

75000

100000

125000

150000

175000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9775

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

aste

roid

s, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

0

20000

40000

60000

80000

100000

120000

140000

160000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

aste

roid

s, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

0

20000

40000

60000

80000

100000

120000

140000

160000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

atla

ntis,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.86

0.88

0.90

0.92

0.94

0.96

0.98

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

200000

400000

600000

800000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

atla

ntis,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

200000

400000

600000

800000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

atla

ntis,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

200000

400000

600000

800000

Figure 12. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.

Page 17: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

1.000

bank

_hei

st, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0gp

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0gv

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0ge

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

0

200

400

600

800

1000

1200

reward

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.965

0.970

0.975

0.980

0.985

0.990

0.995

1.000

bank

_hei

st, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

0

200

400

600

800

1000

1200

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

bank

_hei

st, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

200

400

600

800

1000

1200

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9750

0.9775

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

battl

e_zo

ne, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

10000

20000

30000

40000

50000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9775

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

battl

e_zo

ne, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

10000

20000

30000

40000

50000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

battl

e_zo

ne, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

2500

5000

7500

10000

12500

15000

17500

20000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.982

0.984

0.986

0.988

0.990

0.992

0.994

0.996

beam

_rid

er, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

20000

40000

60000

80000

100000

120000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.982

0.984

0.986

0.988

0.990

0.992

0.994

0.996

beam

_rid

er, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.825

0.850

0.875

0.900

0.925

0.950

0.975

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.800

0.825

0.850

0.875

0.900

0.925

0.950

0.975

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

20000

40000

60000

80000

100000

120000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.982

0.984

0.986

0.988

0.990

0.992

0.994

0.996

beam

_rid

er, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

20000

40000

60000

80000

100000

120000

140000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

berz

erk,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

400

500

600

700

800

900

1000

1100

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

berz

erk,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

500

600

700

800

900

1000

1100

1200

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

berz

erk,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

400

500

600

700

800

900

1000

1100

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

bowl

ing,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

1.000

0.0 0.5 1.0 1.5 2.01e8

25.0

27.5

30.0

32.5

35.0

37.5

40.0

42.5

45.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

bowl

ing,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.0 0.5 1.0 1.5 2.01e8

20

25

30

35

40

45

50

55

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

bowl

ing,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.0 0.5 1.0 1.5 2.01e8

20

25

30

35

40

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.960

0.965

0.970

0.975

0.980

0.985

0.990

0.995

boxi

ng, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

20

40

60

80

100

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

boxi

ng, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

20

40

60

80

100

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

boxi

ng, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

2

0

2

4

6

Figure 13. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.

Page 18: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

brea

kout

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0gp

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

1.00gv

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

ge

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0 1.5 2.01e8

0

200

400

600

800

reward

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

brea

kout

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0 1.5 2.01e8

0

200

400

600

800

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

brea

kout

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0 1.5 2.01e8

0

200

400

600

800

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9850

0.9875

0.9900

0.9925

0.9950

cent

iped

e, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.900

0.925

0.950

0.975

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

1.00

0.0 0.5 1.0 1.5 2.01e8

0

100000

200000

300000

400000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9850

0.9875

0.9900

0.9925

0.9950

cent

iped

e, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

1.00

0.0 0.5 1.0 1.5 2.01e8

0

100000

200000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

cent

iped

e, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

1.00

0.0 0.5 1.0 1.5 2.01e8

0

100000

200000

300000

400000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.85

0.90

0.95

1.00

chop

per_

com

man

d, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

50000

100000

150000

200000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.8

0.9

1.0

chop

per_

com

man

d, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

200000

400000

600000

800000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.85

0.90

0.95

1.00

chop

per_

com

man

d, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

50000

100000

150000

200000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

craz

y_cli

mbe

r, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.95

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.0 0.5 1.0 1.5 2.01e8

50000

100000

150000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9850

0.9875

0.9900

0.9925

0.9950

craz

y_cli

mbe

r, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

0.0 0.5 1.0 1.5 2.01e8

50000

100000

150000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9850

0.9875

0.9900

0.9925

0.9950

craz

y_cli

mbe

r, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

0.0 0.5 1.0 1.5 2.01e8

50000

100000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

defe

nder

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

100000

200000

300000

400000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

defe

nder

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

100000

200000

300000

400000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.95

1.00

defe

nder

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

100000

200000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

dem

on_a

ttack

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

50000

100000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.97

0.98

0.99

dem

on_a

ttack

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

50000

100000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

dem

on_a

ttack

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

50000

100000

Figure 14. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.

Page 19: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

doub

le_d

unk,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

1.000gp

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000gv

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

ge

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

15

10

5

0

reward

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

doub

le_d

unk,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.92

0.94

0.96

0.98

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

15

10

5

0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

doub

le_d

unk,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

15

10

5

0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

endu

ro, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.989

0.990

0.991

0.992

0.993

0.994

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9898

0.9900

0.9902

0.9904

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9890

0.9895

0.9900

0.9905

0.0 0.5 1.0 1.5 2.01e8

0.04

0.02

0.00

0.02

0.04

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

endu

ro, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9897

0.9898

0.9899

0.9900

0.9901

0.9902

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9901

0.9902

0.9903

0.9904

0.9905

0.9906

0.0 0.5 1.0 1.5 2.01e8

0.04

0.02

0.00

0.02

0.04

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.998

endu

ro, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.989

0.990

0.991

0.992

0.993

0.994

0.995

0.996

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9898

0.9899

0.9900

0.9901

0.9902

0.9903

0.9904

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9898

0.9899

0.9900

0.9901

0.9902

0.9903

0.0 0.5 1.0 1.5 2.01e8

0.04

0.02

0.00

0.02

0.04

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

fishi

ng_d

erby

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

80

60

40

20

0

20

40

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

fishi

ng_d

erby

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

100

80

60

40

20

0

20

40

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

fishi

ng_d

erby

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

80

60

40

20

0

20

40

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

freew

ay, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.986

0.987

0.988

0.989

0.990

0.991

0.992

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98875

0.98900

0.98925

0.98950

0.98975

0.99000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9899

0.9900

0.9901

0.9902

0.9903

0.9904

0.9905

0.0 0.5 1.0 1.5 2.01e8

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9775

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

freew

ay, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.986

0.987

0.988

0.989

0.990

0.991

0.992

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9888

0.9890

0.9892

0.9894

0.9896

0.9898

0.9900

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9896

0.9898

0.9900

0.9902

0.9904

0.9906

0.0 0.5 1.0 1.5 2.01e8

0.00

0.02

0.04

0.06

0.08

0.10

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.982

0.984

0.986

0.988

0.990

0.992

freew

ay, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.987

0.988

0.989

0.990

0.991

0.992

0.993

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9890

0.9895

0.9900

0.9905

0.9910

0.9915

0.9920

0.9925

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.987

0.988

0.989

0.990

0.991

0.992

0.993

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9896

0.9898

0.9900

0.9902

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9898

0.9899

0.9900

0.9901

0.9902

0.9903

0.9904

0.0 0.5 1.0 1.5 2.01e8

0.04

0.02

0.00

0.02

0.04

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

frost

bite

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

175

200

225

250

275

300

325

350

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

frost

bite

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.850

0.875

0.900

0.925

0.950

0.975

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.850

0.875

0.900

0.925

0.950

0.975

1.000

0.0 0.5 1.0 1.5 2.01e8

175

200

225

250

275

300

325

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

frost

bite

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

200

250

300

350

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.998

goph

er, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

20000

40000

60000

80000

100000

120000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

goph

er, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.86

0.88

0.90

0.92

0.94

0.96

0.98

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

20000

40000

60000

80000

100000

120000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

goph

er, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

20000

40000

60000

80000

100000

120000

Figure 15. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.

Page 20: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

grav

itar,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0gp

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000gv

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00ge

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

500

1000

1500

2000

2500

3000

3500

reward

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

1.000

grav

itar,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

500

1000

1500

2000

2500

3000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

grav

itar,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

500

1000

1500

2000

2500

3000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

hero

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

5000

10000

15000

20000

25000

30000

35000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

hero

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.800

0.825

0.850

0.875

0.900

0.925

0.950

0.975

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

30000

35000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

hero

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

30000

35000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

ice_h

ocke

y, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.93

0.94

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

10

5

0

5

10

15

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

ice_h

ocke

y, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

10

5

0

5

10

15

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

ice_h

ocke

y, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

10

5

0

5

10

15

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

jam

esbo

nd, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

jam

esbo

nd, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

jam

esbo

nd, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

kang

aroo

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.825

0.850

0.875

0.900

0.925

0.950

0.975

1.000

0.0 0.5 1.0 1.5 2.01e8

0

2000

4000

6000

8000

10000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

kang

aroo

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

0

2000

4000

6000

8000

10000

12000

14000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

kang

aroo

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

1.00

0.0 0.5 1.0 1.5 2.01e8

0

2000

4000

6000

8000

10000

12000

14000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.960

0.965

0.970

0.975

0.980

0.985

0.990

0.995

krul

l, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.93

0.94

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

0.0 0.5 1.0 1.5 2.01e8

2000

4000

6000

8000

10000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

krul

l, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.955

0.960

0.965

0.970

0.975

0.980

0.985

0.990

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

0.0 0.5 1.0 1.5 2.01e8

2000

3000

4000

5000

6000

7000

8000

9000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.960

0.965

0.970

0.975

0.980

0.985

0.990

0.995

krul

l, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.0 0.5 1.0 1.5 2.01e8

3000

4000

5000

6000

7000

8000

Figure 16. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.

Page 21: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

kung

_fu_

mas

ter,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.8

1.0gp

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

1.000gv

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

1.000ge

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.5

1.0

0.0 0.5 1.0 1.5 2.01e8

20000

40000

reward

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

kung

_fu_

mas

ter,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.5

1.0

0.0 0.5 1.0 1.5 2.01e8

20000

40000

60000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.8

0.9

1.0

kung

_fu_

mas

ter,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.5

1.0

0.0 0.5 1.0 1.5 2.01e8

20000

40000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

mon

tezu

ma_

reve

nge,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.991

0.992

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9890

0.9895

0.9900

0.9905

0.0 0.5 1.0 1.5 2.01e8

0.05

0.00

0.05

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

mon

tezu

ma_

reve

nge,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.991

0.992

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.989

0.990

0.0 0.5 1.0 1.5 2.01e8

0

1

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

mon

tezu

ma_

reve

nge,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.991

0.992

0.993

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.0 0.5 1.0 1.5 2.01e8

0

2

4

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

1.000

ms_

pacm

an, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.5

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.5

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

0.0 0.5 1.0 1.5 2.01e8

2000

4000

6000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

1.000

ms_

pacm

an, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.5

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.5

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.96

0.98

1.00

0.0 0.5 1.0 1.5 2.01e8

2000

4000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

1.000

ms_

pacm

an, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.5

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.5

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

1.00

0.0 0.5 1.0 1.5 2.01e8

2000

4000

6000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

nam

e_th

is_ga

me,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

10000

20000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

nam

e_th

is_ga

me,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

10000

20000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.97

0.98

0.99

nam

e_th

is_ga

me,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

10000

20000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

phoe

nix,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.95

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.0 0.5 1.0 1.5 2.01e8

0

100000

200000

300000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

phoe

nix,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.85

0.90

0.95

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.0 0.5 1.0 1.5 2.01e8

0

200000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

phoe

nix,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.25

0.50

0.75

1.00

0.0 0.5 1.0 1.5 2.01e8

0

100000

200000

300000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.97

0.98

0.99

pitfa

ll, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.925

0.950

0.975

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98

0.99

0.0 0.5 1.0 1.5 2.01e8

40

20

0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

pitfa

ll, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.925

0.950

0.975

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

0.0 0.5 1.0 1.5 2.01e8

60

40

20

0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.97

0.98

0.99

pitfa

ll, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.995

0.0 0.5 1.0 1.5 2.01e8

100

50

0

Figure 17. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.

Page 22: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.998

pong

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0gp

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0gv

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0ge

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.0 0.5 1.0 1.5 2.01e8

20

15

10

5

0

5

10

15

20

reward

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

pong

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

0.0 0.5 1.0 1.5 2.01e8

20

10

0

10

20

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

1.000

pong

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

0.0 0.5 1.0 1.5 2.01e8

20

15

10

5

0

5

10

15

20

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

priv

ate_

eye,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.982

0.984

0.986

0.988

0.990

0.992

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.982

0.984

0.986

0.988

0.990

0.992

0.994

0.0 0.5 1.0 1.5 2.01e8

60

70

80

90

100

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

priv

ate_

eye,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.965

0.970

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.982

0.984

0.986

0.988

0.990

0.992

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.960

0.965

0.970

0.975

0.980

0.985

0.990

0.995

0.0 0.5 1.0 1.5 2.01e8

200

150

100

50

0

50

100

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

priv

ate_

eye,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.982

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

0.0 0.5 1.0 1.5 2.01e8

100

50

0

50

100

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.982

0.984

0.986

0.988

0.990

0.992

0.994

0.996

qber

t, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.92

0.94

0.96

0.98

1.00

qber

t, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

2500

5000

7500

10000

12500

15000

17500

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

qber

t, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

2500

5000

7500

10000

12500

15000

17500

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

river

raid

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

5000

10000

15000

20000

25000

30000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

river

raid

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

5000

10000

15000

20000

25000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

river

raid

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

5000

10000

15000

20000

25000

30000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

road

_run

ner,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

1.00

0.0 0.5 1.0 1.5 2.01e8

0

10000

20000

30000

40000

50000

60000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

road

_run

ner,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.0 0.5 1.0 1.5 2.01e8

0

10000

20000

30000

40000

50000

60000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

road

_run

ner,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

100000

200000

300000

400000

500000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.965

0.970

0.975

0.980

0.985

0.990

0.995

robo

tank

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.0 0.5 1.0 1.5 2.01e8

0

10

20

30

40

50

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

robo

tank

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

10

20

30

40

50

60

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

robo

tank

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

0

10

20

30

40

50

Figure 18. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.

Page 23: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9800

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

seaq

uest

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0gp

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

1.000gv

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.92

0.94

0.96

0.98

ge

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

1000

2000

3000

reward

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

seaq

uest

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

1000

2000

3000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

seaq

uest

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

500

1000

1500

2000

2500

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

skiin

g, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.991

0.992

0.993

0.994

0.0 0.5 1.0 1.5 2.01e8

30000

25000

20000

15000

10000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

skiin

g, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.0 0.5 1.0 1.5 2.01e8

30000

25000

20000

15000

10000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

skiin

g, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.987

0.988

0.989

0.990

0.991

0.0 0.5 1.0 1.5 2.01e8

30000

25000

20000

15000

10000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

sola

ris, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

1000

1500

2000

2500

3000

3500

4000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.985

0.990

0.995

sola

ris, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

1000

1500

2000

2500

3000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

sola

ris, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

1000

1500

2000

2500

3000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

spac

e_in

vade

rs, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

10000

20000

30000

40000

50000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

spac

e_in

vade

rs, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

10000

20000

30000

40000

50000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

spac

e_in

vade

rs, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

10000

20000

30000

40000

50000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

star

_gun

ner,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.850

0.875

0.900

0.925

0.950

0.975

1.000

0.0 0.5 1.0 1.5 2.01e8

0

100000

200000

300000

400000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

star

_gun

ner,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.850

0.875

0.900

0.925

0.950

0.975

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.96

0.98

1.00

0.0 0.5 1.0 1.5 2.01e8

0

100000

200000

300000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

1.00

star

_gun

ner,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.875

0.900

0.925

0.950

0.975

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

100000

200000

300000

400000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

surro

und,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.0 0.5 1.0 1.5 2.01e8

10

5

0

5

10

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.80

0.85

0.90

0.95

1.00

surro

und,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.0 0.5 1.0 1.5 2.01e8

10

5

0

5

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.85

0.90

0.95

1.00

surro

und,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.0 0.5 1.0 1.5 2.01e8

10

5

0

5

Figure 19. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.

Page 24: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

tenn

is, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0gp

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0gv

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0ge

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.0 0.5 1.0 1.5 2.01e8

20

10

0

10

20

reward

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

tenn

is, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.0 0.5 1.0 1.5 2.01e8

20

10

0

10

20

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

tenn

is, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.0 0.5 1.0 1.5 2.01e8

20

10

0

10

20

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

time_

pilo

t, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.20.30.40.50.60.70.80.91.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.88

0.90

0.92

0.94

0.96

0.98

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

10000

20000

30000

40000

50000

60000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

time_

pilo

t, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

10000

20000

30000

40000

50000

60000

70000

80000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

time_

pilo

t, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

10000

20000

30000

40000

50000

60000

70000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

tuta

nkha

m, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

50

100

150

200

250

300

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

tuta

nkha

m, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9825

0.9850

0.9875

0.9900

0.9925

0.9950

0.9975

1.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

100

150

200

250

300

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

tuta

nkha

m, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.20.30.40.50.60.70.80.91.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.01e8

100

150

200

250

300

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

up_n

_dow

n, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.20.30.40.50.60.70.80.91.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

50000

100000

150000

200000

250000

300000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

up_n

_dow

n, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

50000

100000

150000

200000

250000

300000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

up_n

_dow

n, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

50000

100000

150000

200000

250000

300000

350000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

vent

ure,

seed

=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.991

0.992

0.993

0.994

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.991

0.992

0.993

0.994

0.995

0.996

0.997

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9896

0.9897

0.9898

0.9899

0.9900

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9896

0.9898

0.9900

0.9902

0.9904

0.0 0.5 1.0 1.5 2.01e8

0.04

0.02

0.00

0.02

0.04

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

vent

ure,

seed

=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98950.98960.98970.98980.98990.99000.99010.99020.9903

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9901

0.9902

0.9903

0.9904

0.0 0.5 1.0 1.5 2.01e8

0.04

0.02

0.00

0.02

0.04

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

vent

ure,

seed

=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9896

0.9897

0.9898

0.9899

0.9900

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.9901

0.9902

0.9903

0.9904

0.9905

0.0 0.5 1.0 1.5 2.01e8

0.04

0.02

0.00

0.02

0.04

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.825

0.850

0.875

0.900

0.925

0.950

0.975

1.000

vide

o_pi

nbal

l, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.970

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

1.000

0.0 0.5 1.0 1.5 2.01e8

0

100000

200000

300000

400000

500000

600000

700000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

vide

o_pi

nbal

l, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.960

0.965

0.970

0.975

0.980

0.985

0.990

0.995

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.98000.98250.98500.98750.99000.99250.99500.99751.0000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.92

0.94

0.96

0.98

1.00

0.0 0.5 1.0 1.5 2.01e8

0

100000

200000

300000

400000

500000

600000

700000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.960

0.965

0.970

0.975

0.980

0.985

0.990

0.995

vide

o_pi

nbal

l, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.90

0.92

0.94

0.96

0.98

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0

100000

200000

300000

400000

500000

600000

700000

Figure 20. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.

Page 25: Self-Tuning Deep Reinforcement Learning...Reinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this is especially true for modern

Self-Tuning Deep Reinforcement Learning

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.982

0.984

0.986

0.988

0.990

0.992

0.994

0.996

wiza

rd_o

f_wo

r, se

ed=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

gp

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

gv

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.975

0.980

0.985

0.990

0.995

ge

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

10000

20000

30000

40000

50000

reward

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

wiza

rd_o

f_wo

r, se

ed=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.965

0.970

0.975

0.980

0.985

0.990

0.995

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

30000

35000

40000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.982

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.998

wiza

rd_o

f_wo

r, se

ed=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

10000

20000

30000

40000

50000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

yars

_rev

enge

, see

d=1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.980

0.985

0.990

0.995

1.000

0.0 0.5 1.0 1.5 2.01e8

0

20000

40000

60000

80000

100000

120000

140000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

yars

_rev

enge

, see

d=2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

5000

10000

15000

20000

25000

30000

35000

40000

45000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

yars

_rev

enge

, see

d=3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.0 0.5 1.0 1.5 2.01e8

20000

40000

60000

80000

100000

120000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

zaxx

on, s

eed=

1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.988

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.960

0.965

0.970

0.975

0.980

0.985

0.990

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

30000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.998

1.000

zaxx

on, s

eed=

2

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.94

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

30000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.998

zaxx

on, s

eed=

3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.990

0.992

0.994

0.996

0.998

1.000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.95

0.96

0.97

0.98

0.99

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.01e8

0

5000

10000

15000

20000

25000

30000

Figure 21. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.