tibgm: a transferable and information-based graphical...

TibGM: A Transferable and Information-Based Graphical ModelApproach for Reinforcement Learning

Tameem Adel 1 Adrian Weller 1 2

AbstractOne of the challenges to reinforcement learning(RL) is scalable transferability among complextasks. Incorporating a graphical model (GM),along with the rich family of related methods, asa basis for RL frameworks provides potential toaddress issues such as transferability, generalisa-tion and exploration. Here we propose a flexibleGM-based RL framework which leverages effi-cient inference procedures to enhance generali-sation and transfer power. In our proposed trans-ferable and information-based graphical modelframework ‘TibGM’, we show the equivalencebetween our mutual information-based objectivein the GM, and an RL consolidated objectiveconsisting of a standard reward maximisation tar-get and a generalisation/transfer objective. In set-tings where there is a sparse or deceptive rewardsignal, our TibGM framework is flexible enoughto incorporate exploration bonuses depicting in-trinsic rewards. We empirically verify improvedperformance and exploration power.

1. IntroductionReinforcement learning (RL) is a powerful approach yetstandard RL algorithms do not scale well for complex orlarge tasks (Bakker & Schmidhuber, 2004b), and often suf-fer when facing problems with sparse rewards. Exceptionsinclude a few hierarchical reinforcement learning (HRL)frameworks (Watkins, 1989; Kaelbling et al., 1996; Parr& Russell, 1998; Sutton et al., 1999; Dietterich, 2000;Bakker & Schmidhuber, 2004b), which have the potentialto improve over standard RL in terms of scalability, andto handle non-Markovian environments. Other advantagesof HRL include the potential to solve subtasks indepen-

1Department of Engineering, University of Cambridge, UK2The Alan Turing Institute, UK. Correspondence to: TameemAdel <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

dently, the reuse of learnt representations among subtasks,improved efficiency due to the task decoupling and the re-duced search space (Bakker & Schmidhuber, 2004a; Cao &Ray, 2012; Schmidhuber, 2015; Florensa et al., 2017), andefficient exploration at higher levels of the hierarchy (Levyet al., 2017).

Problems related to the notion of a hierarchy, such ashow to learn or implement efficient inference queries, pavethe way for paradigms like probabilistic graphical models(PGMs, Jordan, 1998; Jordan et al., 1999; Wainwright &Jordan, 2008; Koller & Friedman, 2009). The ability tocast a problem as a PGM facilitates a path between ob-jectives written specifically to match our targets, and auto-mated learning and inference techniques enabling efficientsolutions to achieve the objectives.

Graphical models grant the possibility of beginning fromprecise objectives, expressing them in the interpretableform of a graph, and achieving the objectives via efficientinference. Compared to previous RL frameworks that havecast the problem in a PGM, we aim at taking full advan-tage of this cycle here. We propose a novel informationtheoretic objective that aims at both maximising ‘local’ re-ward, and facilitating transfer learning (transferability) andexploration, ultimately leading to improved ‘global’ rewardmaximisation. We take advantage of a latent space disen-tangled into components fit for our purpose, and develop arecognition model-based variational inference procedure toachieve our objectives.

Graphical models enable earlier work in approximate in-ference to be adopted. A seminal example is variationalautoencoders (VAEs, Kingma & Welling, 2014; Kingmaet al., 2014), which effectively merge two types of models,graphical models and deep learning, into a single compre-hensive framework. In VAEs, a generative and a recogni-tion model are integrated for the sake of developing a pow-erful probabilistic framework utilising the recent advancesin both deep generative models and scalable variational in-ference (Kingma & Welling, 2014; Rezende et al., 2014).Here we use similar techniques and propose an informationtheoretic objective which expresses our joint goals of localreward maximisation, and high transferability and general-isation power. We construct a graphical model (GM) that

TibGM: A Transferable and Information-Based Graphical Model Approach for Reinforcement Learning

captures our modeling assumptions then show that infer-ence on the GM to achieve our information theoretic objec-tive corresponds to the quest for a two-fold RL objectivetargeting both reward maximisation and the enhancementof the generalisation capabilities of the GM-based frame-work. We show that this framework can also be used inenvironments with sparse or deceptive extrinsic reward sig-nals by introducing a corresponding exploration strategy.

In our approach, the latent space representing the environ-ment is disentangled into components which (i) aim at max-imising the reward at each time step, and (ii) ‘global’ com-ponents corresponding to more generic, time-independentinformation about the environment. This disentanglementis analogous to functional theories of hierarchy in psy-chology (Parsons, 1940; Boker, 2002; Moody & White,2003). To perform inference, we develop an efficient varia-tional inference procedure, involving both a generative anda recognition model. We derive a novel analogy betweenour proposed information theoretic objective and an RL ob-jective taking into account both reward maximisation andthe optimisation for generalisation and transferability. Wename our approach based on the Transferable Information-Based Graphical Model TibGM. Our proposed two-foldobjective contains a term aiming at the maximisation of theextrinsic reward, together with another term which encour-ages the ‘global’ subset of the latent space not to dependheavily on the current reward (to aid generalisation).

If the reward signal is sparse, deceptive (Colas et al., 2018;Hong et al., 2018; Khadka & Tumer, 2018) or delayed(van der Pol & Oliehoek, 2016; Andreas et al., 2017), ex-trinsic rewards are hardly sufficient to convey the model-ing objective, stimulating the need for intrinsic rewards andmotivations such as exploration bonuses (Bellemare et al.,2016; Fu et al., 2017; Tang et al., 2017; Ostrovski et al.,2018; Burda et al., 2018; 2019). We show that it is pos-sible to augment our TibGM framework with an unsuper-vised pretraining objective that does not count on extrinsicrewards.

We highlight the following contributions: 1) We propose agraphical model based on which an introduced informationtheoretic objective leads to the solution of RL problems(Section 3); 2) We derive a correspondence between theproposed mutual information-based objective and a two-fold RL objective, where both reward maximisation onthe one hand, and generalisation and transferability on theother hand, are optimised (Section 3); 3) The introduced la-tent space is disentangled into ‘local’ components focusedon maximising the reward at each time step, and ‘global’latent components; 4) In cases with sparse, deceptive orvery delayed extrinsic rewards, we propose an informationtheoretic unsupervised pretraining procedure that can fur-ther focus on exploration while still exploiting the learn-

ing and inference efficiency advantages of the graphicalmodel (Section 4); 5) We verify our approach empiricallyon 16 benchmark tasks, outperforming recent state-of-the-art methods (Section 5).

2. Related workSome works have explored links between RL and PGMs,e.g. (Dayan & Hinton, 1997; Kappen, 2005; Manfredi &Mahadevan, 2005; Neumann, 2011; Kappen et al., 2012;Levine, 2014; Abdolmaleki et al., 2018; Haarnoja et al.,2018a;b; Xu et al., 2018; Fellows et al., 2019). Out ofthese works, the most similar frameworks to ours are SAC(Haarnoja et al., 2018b) and LSP (Haarnoja et al., 2018a).The LSP framework (Haarnoja et al., 2018a) provides anELBO that corresponds to reward maximisation and a max-imum entropy objective. They then develop a hierarchicalstochastic policy where learning done at any level can beundone at the higher level, in case the latter deems it morebeneficial to do so. In addition to the exploration bonusesand the modeling and inference flexibility provided byTibGM, other differences to most of these models, and inparticular to (Haarnoja et al., 2018a), include the following:The driving objective that leads to the equivalence with atwo-fold RL objective in TibGM, i.e. our starting point, isthe one based on mutual information, not the ELBO. In ad-dition, the framework in (Haarnoja et al., 2018a) does notcontain a recognition (inference) model whose amortisa-tion leads to improvements in efficiency and further mod-eling flexibility. Another key difference is that, unlike sev-eral related frameworks, e.g. (Ziebart et al., 2008; Nachumet al., 2017; Schulman et al., 2017a; Haarnoja et al., 2017;2018a;b; Levine, 2018; Grau-Moya et al., 2019), the ob-jective proposed by TibGM does not depend on maximumentropy. The framework in (Haarnoja et al., 2018a) is fo-cussed on the fact that higher hierarchy levels can undolower level transformations. This is not what we pursue,as we believe that providing efficient inference in the firstplace would be better in terms of the computational run-time. Other RL algorithms that have links with probabilis-tic inference include (Todorov, 2007; Toussaint, 2009; Pe-ters et al., 2010; Rawlik et al.; Hausman et al., 2018; Tang& Agrawal, 2018) .

Regarding pretraining strategies, Florensa et al. (2017) es-tablished an information theoretic exploration procedurethat maximises the mutual information between states andactions at the top level of their hierarchy. The work in(Goyal et al., 2019) also contains an information theoreticregularisation term. In addition to the architectural ones,differences to the latter include the fact that there is no dis-entanglement in the learnt latent space to keep a part of itfocussed on maximising the reward at each time step; thereis single focus on exploration. Also, our derived analogy


provides further modeling flexibility, e.g. the unsupervisedpretraining procedure.

In the experiments, we compare to several other state-of-the-art algorithms, like the sample-efficient deterministicoff-policy algorithm, DDPG (Lillicrap et al., 2015). Wealso compare to proximal policy optimization (PPO, Schul-man et al., 2017b) which mostly converges to nearly deter-ministic policies as well. Other works involved in the com-parisons include algorithms with evolutionary nature likeERL (Khadka & Tumer, 2018) and GEP-PG (Colas et al.,2018), and ProMP (Rothfuss et al., 2019), whose target isto adapt well to new similar environments.

3. MethodologyOur TibGM framework is based on designing a latent spacethat aims both at modeling the environment with its ac-tions and states with high fidelity, and also at capturing theglobal, persistent aspects of such environment that can betransferable. The proposed latent space must be capable ofaccomplishing two goals: i) achieving high rewards on thetasks the RL learner is facing; and ii) constructing a repre-sentation with a high generalisation and transfer potential.

Notation and Model

Consider a Markov decision process (MDP) defined as(S,A,P, r) with state space S, action space A, transitiondynamics P, and transition reward r. Refer to the currentstate, next state and current action as st, st+1 and at, re-spectively. The transition dynamics P(st+1|st,at) refersto the probability of the transition from state st to statest+1 by taking action at. Without loss of generality, as-sume the reward r(st,at) is non-positive. Denote by s0the initial state, by τ = {s0,a0, s1,a1, . . . , sT,aT} a tra-jectory, and refer to the distribution of the trajectory underpolicy π(a|s) as ρπ(τ).

We propose an objective motivated by information theory,then show how it corresponds to an RL objective whichaims at both (i) maximising the reward of the resulting tra-jectory, and (ii) constructing a transferable representationwith good generalisation power.

The proposed probabilistic model is displayed in Figure 1.Assume there is a binary variable b (for ‘best’) denotingwhether it is ideal (when bt = 1) at time step t to chooseaction at. The probability that bt = 1 is proportionate tothe reward value, as we will see later. We show that thequest for the aforementioned goals in our graphical modelvia the introduced information theoretic objective is equiv-alent to a solution to the learning problem that maximisesthe reward along with the optimisation for an intrinsic goalof generalisation. The latent variables ht=1...T and z areresponsible for the generation of the sequence of actions

Figure 1. The graphical model depicting the proposed TibGMmodel. Actions at each time step depend on the states st, as wellas the disentangled latent space consisting of time-variant ‘local’components ht and ‘global’ components z, encouraging general-isation. The variable bt conveys information about the reward,see the text. Dotted lines (shown only at the first time step for im-proved readability) indicate the dependencies of the recognitionmodel, when different from their generative counterparts. Achiev-ing our information theoretic objective by performing variationalinference on this graphical model is equivalent to optimising fora two-fold RL objective aiming at maximising the reward as wellas achieving an additional generalisation/transfer goal.

at. Note that there is one ‘local’ ht per time step t, unlikez which is ‘global’.

In the following, we portray our mutual information-basedobjective. Afterwards, we derive a correspondence be-tween our objective and an RL objective which considersboth a typical reward maximisation objective and an intro-duced goal of estimating a latent space with high transfer-ability and generalisation power. We then give details aboutthe inference procedure. We aim at achieving a maximumreward, which we initially convey via maximising the mu-tual information between the actions at and the optimal-ity variable bt, I(at,bt). The time-dependent latent vari-ables ht are responsible for maximising this mutual infor-mation term, which denotes the aim to choose optimal ac-tions at each time step. Both ht and the global latent com-ponents, referred to as z, affect the actions a at each timestep, but z is constrained via the other modeling objective(generalisation) expressed by the second term. This sec-ond term, I(z,bt|st), is minimized to encourage a latentglobal representation with high generalisation and transferpower. The intuition here is that by minimising the mutualinformation between z and the reward directed variable bt,the former is induced to be free to model the global aspectsthat can potentially lead to a highly generalisable represen-tation. Our overall objective can be described as follows:

maxθ,φ

I(at,bt)− β I(z,bt|st), (1)

Where θ,φ are the parameters of the generative and recog-nition model, respectively. The parameter β controls thedegree to which the second part of the objective is imposed,


i.e. how much we want to enforce our latent space to modelgeneral concepts about the environment that are not neces-sarily related to the extrinsic reward of the current task. Ahigh positive value of β places great emphasis on trans-ferability and generalisation, whereas β = 0 leads to thestandard reward-maximising RL objective. For our experi-ments, we determine β by cross-validation. The first termin (1) is equivalent to the following:

I(at,bt) =

∫at,bt

pθ(at,bt) logpθ(at,bt)

pθ(at)p(bt)dat dbt

=

∫at,bt

pθ(at,bt) logpθ(at)pθ(bt|at)pθ(at)p(bt)

dat dbt

= Epθ(at,bt)[logpθ(bt|at)− logp(bt)]

= Epθ(at,bt)[logpθ(bt|at)] +H(bt) (2)

Since the entropy term H(bt) in (2) does not have an im-pact on our optimization objective, we will not take it intoaccount in the remaining part.

Meanwhile, regarding the second term in (1):

I(z,bt|st) =∫st

p(st)

∫z,bt

pθ(z,bt|st) logpθ(z,bt|st)

pθ(z|st)pθ(bt|st)dstdzdbt

=

∫st,z,bt

pθ(z,bt, st) logpθ(z|bt, st)pθ(bt|st)pθ(z|st)pθ(bt|st)

dstdzdbt

=

∫st,z,bt

pθ(z,bt, st)[logpθ(z|bt, st)−logpθ(z|st)] dstdzdbt

(3)

Let qφ(z|st) be the estimated approximation to the groundtruth p(z|st). The KL-divergence between both is:

KL[pθ(z|st)‖qφ(z|st)] ≥ 0 (4)∫pθ(z|st) logpθ(z|st)dz ≥

∫pθ(z|st) logqφ(z|st)dz

Using (4) back in (3), we obtain:

I(z,bt|st) ≤∫st,z,bt

pθ(z,bt, st)[logpθ(z|bt, st)−logqφ(z|st)]dstdzdbt

=

∫st,z,bt

pθ(bt, st)pθ(z|bt, st) logpθ(z|bt, st)

qφ(z|st)dstdzdbt

(5)

= Epθ(bt,st)KL[pθ(z|bt, st)‖qφ(z|st)] (6)

By combining the bounds in (2) and (6), our informationtheoretic objective is equivalent to:

max I(at,bt)− β I(z,bt|st)≡ max

θ,φ

Epθ(at,bt)logpθ(bt|at)−βEpθ(bt,st)KL[pθ(z|bt, st)‖qφ(z|st)](7)

For specific values of st and at, the probability that bt = 1is equal to p(bt|at, st) = er(at,st), in order to keep it asa probability value between 0 and 1, similar to (Haarnojaet al., 2018a). The rewards r are assumed to have non-positive values as in (Haarnoja et al., 2018a). The first termin (7) is equivalent to the following:

Epθ(at,bt)logpθ(bt|at) =∫at,bt,st

pθ(at,bt, st) logpθ(bt|at, st)datdbtdst, (8)

where pθ(bt|at, st) can be used since bt and st are in-dependent given at. From (7) and (8), we can then seethe equivalence between the objective introduced here byTibGM and an RL objective that typically maximises thereward -first term in (7)- along with an additional goal. Thesecond term in (7) aims at guaranteeing that some com-ponents z of the latent space aim at capturing generalisedconcepts of the environment, i.e. concepts and aspects notparticularly related to the specific optimality conditions andtargets of the current task.

Inference

Recall that the observed variables at each time step are thecurrent state st and the optimality variable bt. To per-form inference on the model proposed and described above,we construct a recognition model (Stuhlmuller et al., 2013;Kingma & Welling, 2014; Rezende et al., 2014) whose pa-rameters φ approximate the true posterior, along with thegenerative model with parameters θ. Our inference proce-dure aims at approximately inferring an optimal policy withhigh fidelity. As such, similar to other works portrayingthe reinforcement learning problem in a graphical modelsetting (Strehl et al., 2006; 2009; Fu et al., 2018; Haarnojaet al., 2018a), we constrain the dynamics, that will be esti-mated by sampling, to be equal to the true dynamics. Thegenerative and recognition models developed here are thecomponents of the parameterised distribution estimation ofthe policy.

In the recognition model, the dependency between thepotentially generic (and rather constant throughout timesteps) z and at, qφ(z|at), is modeled via orthogonalSylvester normalizing flows (van den Berg et al., 2018).On the other hand, the dependency between each ht andat, qφ(ht|at) is modeled via a Gaussian transformationwhose parameters are computed via a neural network.Sylvester normalizing flows (SNFs, van den Berg et al.,2018) are a state-of-the-art generalisation of planar flowsand they lead to much more flexible density transforma-tions in addition to improved overall performance. Theintuition (which is empirically supported) behind using aGaussian, whose parameters are estimated via a neural net-work, to estimate qφ(ht|at) is to try and reserve the mod-


eling power of ht to estimate solely time-dependent as-pects, while keeping the rest of the features into the morepowerful, SNF based qφ(z|at). A rather similar trick hasbeen successfully imposed in (Li & Mandt, 2018), wherethey assigned a low-dimensional representation to modelthe time-dependent latent component.

SNFs (van den Berg et al., 2018), like other normalizingflows (NFs) (Rezende & Mohamed, 2015; Kingma et al.,2016) apply a chain of invertible parameterized transfor-mations, ft, t = 1, . . . ,T, to its input a such that the out-come of the last iteration, z = zT has a more flexible dis-tribution that in our algorithm is optimized for high fidelityin representing the environment aspects preserved throughtime. Each transformation step is indexed by t, where z0 isan initial random variable with (Gaussian) density q0(z0)that is successively transformed through a chain of trans-formations f1, . . . , fT (Rezende & Mohamed, 2015):

zt = ft(zt−1,a) ∀t = 1 . . .T (9)

The probability density function of the ultimate latent rep-resentation, z = zT, can be computed provided that thedeterminant of the Jacobian of each of the transforma-tions, det(ft), can be computed. The probability densityqφ(z|a) = qT(zT|a) can be expressed as follows:

logqT(zT|a) = logq0(z0|a)−T∑

t=1

log∣∣∣det dzt

dzt−1

∣∣∣ ,z = zT. (10)

SNFs have the following form:

ft(zt−1) = zt−1 +Au(Bzt−1 + c), (11)

where A ∈ RD×M ,B ∈ RM×D, c ∈ RM , and M < D,and u is a nonlinearity, for a single layer MLP with M hid-den units (van den Berg et al., 2018). In addition to theiraforementioned advantages, SNFs provide an efficient so-lution to the single-unit bottleneck problem of planar flows,which as argued by Kingma et al. (2016), is due to the factthat (without an SNF) the impact of the second term in (11)would saturate to be similar to that of a single-neuron MLP.SNFs lead to a flexible bottleneck since the Jacobian deter-minant of the transformation can now be obtained usingM < D dimensions1. Each map from a up to z has theform given in (11). Thus:

z = fT ◦ fT−1 ◦ . . . ◦ f1(a). (12)

Other versions of normalizing flows have been used beforein similar contexts, such as (Haarnoja et al., 2018a). How-ever, unlike there, which was fully based on a generative

1For the proof and full details, see van den Berg et al. (2018).

model with no recognition model, we use a different typeof normalizing flows (SNFs), and we use them in our recog-nition model.

We now describe how to implement the objective. At timestep t, the first term in (7) is:

Epθ(at,bt)logpθ(bt|at) =∫at,bt

pθ(at,bt) logpθ(bt|at)datdbt

=

∫z,ht,st,at,bt

pθ(z,ht, st,at,bt) logpθ(bt|at)dzdhtdstdatdbt

(13)

The joint probability in (13) can be obtained through sam-ples from the joint distribution of the generative model atthe current time step. Regarding pθ(bt|at), it is expressedusing the sampled st and the reward, er(at,st).

Now consider the second term in (7). From (5) it is equiv-alent to:

I(z,bt|st) =∫z,st,bt

pθ(z, st,bt) logpθ(z|bt, st)

qφ(z|st)dzdstdbt

=

∫z,st,bt

pθ(z, st,bt)[logpθ(z|bt, st)− qφ(z|st)]dzdstdbt

=

∫z,ht,st,at,bt

pθ(z,ht, st,at,bt)[logpθ(z,bt, st)− pθ(bt, st)

− qφ(z|st)]dzdhtdstdatdbt (14)

The generative terms -all except qφ(z|st)- are obtainedby sampling from the joint distribution of the generativemodel. Regarding the recognition model, the reparame-terisation trick (Kingma & Welling, 2014) is used in esti-mating at. The action variable at is expressed as: at =fφ(st,bt, ε). The function fφ is a deterministic functionof st, bt and the Gaussian random variable ε. One ofthe advantages of the reparameterisation trick is that thenoise term ε becomes independent of the model parame-ters, which facilitates taking gradients (Kingma & Welling,2014; Alemi et al., 2017). Afterwards, we can proceed withcomputing qφ(z|st), where, as mentioned above, z|at ismodeled via an SNF.

qφ(z|st) ∝

qφ

(z|µφ(fφ(st,bt, ε)), σφ(fφ(st,bt, ε))

)qφ(at|st,bt) =

qφ

(z|µφ(fφ(st,bt, ε)), σφ(fφ(st,bt, ε))

)N (ε|0, I) (15)

4. Sparse or Deceptive RewardsThere are cases where there is hardly any reward signal,and therefore the variable b is not prevalent anymore, e.g.in environments with rare or sparse rewards. In addition,there are also environments with deceptive reward signals.


In such cases, the dependence on b is either not feasible,misleading or both. The flexibility provided by the graph-ical modeling nature of TibGM enables us to model thesecases via the following diversification procedure. This pro-cedure aims to enhance exploration power, encouraging thelearner to visit new states, which becomes more importantwhen it is not possible to count on the reward signal, andalso still to provide a level of generalisation as a founda-tion of TibGM. As we will show in the experiments, thisprocedure can be used for unsupervised pretraining prior tothe supervised stage involving rewards, in cases when suchrewards are available.

Standing along the mutual information based objectives,we propose to handle the sparse-reward case with an in-formation bottleneck based objective which encourages ex-ploration when there is no b (hence the name ExTibGM).The intuition is that, with no extrinsic reward available, thelatent z shall now be further focused on exploration, andit should hence be made maximally informative about thestates s at the expense of being compressive about a. Beingmaximally informative about the states s is key to the em-phasis on exploring new states. On the other hand, beingcompressive about the actions a provides a level of gener-alisation as promised by the latent space z. This is similar,but not identical, to the DIAYN algorithm in (Eysenbachet al., 2019), where they also base their reasoning on aninformation theoretic objective. However, in addition toother differences in the objectives, they base their reason-ing in DIAYN on diversifying policies via maximum en-tropy, while we base ours on generalisation via a differentobjective.

With no reward available for our approach, the latent space,now consisting only of z, aims at enhancing the explo-ration power by maximising the opportunity of visitingnew states.2 Other information bottleneck objectives havebeen introduced before within VAE-based frameworks, e.g.(Alemi et al., 2017; Adel et al., 2018; Alemi et al., 2018).Our objective at time step t is thus defined as:

IB(z, st,at) = I(z, st)−αI(z,at), (16)

where α is a parameter that controls the level of generali-sation provided by the objective. Similar to β, we set α bycross-validation in our experiments.

We begin by analyzing the first term in (16), I(z, st):

I(z, st) =

∫z,st

pθ(z, st) logpθ(st)pθ(z|st)p(st)pθ(z)

dz dst

≥∫z,st

pθ(z, st) logqφ(z|st)pθ(z)

dz dst (17)

2We suggest that there is not enough supervision in such caseto disentangle our latent space into time-dependent and time-invariant subsets.

due to the non-negativity of the KL-divergence betweenpθ(z|st) and qφ(z|st), similar to step (4). Therefore,I(z, st) is equivalent to:

I(z, st) = Epθ(z,st)[logqφ(z|st)− logpθ(z)] (18)

Regarding the second term I(z,at), and using the non-negativity of KL[pθ(z)‖qφ(z)]:

I(z,at) ≤∫z,at

p(at)pθ(z|at) logpθ(z|at)qφ(z)

dz dat

=

∫at

p(at)KL[pθ(z|at)‖qφ(z)] dat (19)

Using the bounds in (18) and (19) in the objective in (16):

IB(z, st,at) ≥Epθ(z,st)[logqφ(z|st)− logpθ(z)]

−αEp(at)KL[pθ(z|at)‖qφ(z)] (20)

The first term in (18) -which can also be found in DIAYN(Eysenbach et al., 2019), though with a different implemen-tation using no normalizing flows- is enforcing explorationsince maximising this term leads to gaining further rewardfrom visiting states that are easy to discriminate. A sam-ple is taken from the Gaussian pθ(z), whereas the corre-sponding computation of qφ(z|st) proceeds using SNFs inthe recognition model as described in Section 3, but withno b in the model anymore. The second term, which is anovel way of enforcing generalisation in environments withsparse rewards to the best of our knowledge, induces z tobe general by being less dependent on a. Here we use an-other SNF in the generative model to allow for an analyticalcomputation of the KL-divergence.

5. ExperimentsWe evaluate empirically the effectiveness of our TibGMand ExTibGM frameworks, comparing to several state-of-the-art algorithms: DDPG (Lillicrap et al., 2015), LSP(Haarnoja et al., 2018a), SAC3 (Haarnoja et al., 2018b),PPO (Schulman et al., 2017b), ERL (Khadka & Tumer,2018), DIAYN (Eysenbach et al., 2019), VIREL (Fellowset al., 2019), GEP-PG (Colas et al., 2018) and ProMP(Rothfuss et al., 2019).

We conduct our experiments on six standard benchmarktasks from across OpenAI Gym (Brockman et al., 2016)and rllab (Duan et al., 2016): Swimmer (rllab), Hopper(v1), Walker2d (v1), HalfCheetah (v1), Ant (rllab) and Hu-manoid (rllab). We also consider the Continuous MountainCar (CMC) environment, a control benchmark with decep-tive reward properties (Lehman et al., 2017; Colas et al.,2018).

3We build our code on the top of SAC, https://github.com/haarnoja/sac.

https://github.com/haarnoja/sac

https://github.com/haarnoja/sac


We evaluate the following aspects: 1) Performance re-sulting from our disentangled latent space-based ap-proach TibGM and its inference model; 2) Impact of theexploration-focused version of our framework ExTibGMon a deceptive-reward problem; 3) Impact of using the pre-training of ExTibGM and the resulting diversity on com-putational run-time; and 4) How TibGM fares in a policytransfer-learning setup.

Confidence intervals are shown in all the plots. Unlessnoted otherwise, each experiment was repeated 50 timesand significance has been tested via a paired t-test with sig-nificance level at 5%. Cross-validation was used to esti-mate optimum Values of β = 0.3 and α = 0.2. In all plots,we run a sufficient number of time steps until we observethat all prior methods appear to reach saturation (Haarnojaet al. (2018a) used a similar number of steps). Other exper-imental details are given in the Appendix.

5.1. Average Reward on Benchmark Tasks

We compare our two algorithms TibGM and ExTibGM,where the latter refers to adopting the introduced explo-ration strategy as an unsupervised pretraining procedure.The results of the total expected return are displayed inFigure 2. Most of the algorithms we compare to here, e.g.DDPG and PPO are single-minded on the expected return,so this comparison is favourable to them against algorithmslike TibGM, ExTibGM and maximum entropy based algo-rithms that inherently optimise for a two-fold objective.

As can be seen in Figure 2, the proposed ExTibGM, whichcorresponds to TibGM preceded with the introduced pre-training procedure, achieves significantly better results onall the 6 tasks. On the other hand, TibGM, i.e. with nopretraining, achieves better results than the rest of the com-petitors in 5 out of the 6 tasks. The latter achieves a signif-icantly higher return than ExTibGM on the Walker2d task.In such case, it can be that too much exploration has con-fused the algorithm, and the vanilla TibGM is more suitableto the task. As such, the proposed ExTibGM and TibGM,especially the former, achieve state-of-the-art performanceon the 6 tasks.

5.2. Exploration Efficiency on CMC

We evaluate the exploration potential of ExTibGM on atask with deceptive rewards. The standard CMC bench-mark, which has been used in (Lehman et al., 2017; Colaset al., 2018), attains special deceptive reward characteris-tics. CMC is a control benchmark where an object (vehi-cle) whose the set of actions are {−1, 0, 1} must reach itsgoal at the top of a hill by gaining momentum and acceler-ating from another hill. One of the interesting explorationissues raised by CMC is that large accelerations are neces-sary to reach the goal, but larger accelerations cause larger

0 100 200 300 400 500

×103 steps0

50

100

150

200

250

300

350

retu

rn

(a) Swimmer (rllab)

0 250 500 750 1000 1250 1500 1750 2000

×103 steps0

500

1000

1500

2000

2500

3000

retu

rn

(b) Hopper (v1)

0 1000 2000 3000 4000 5000

×103 steps0

1000

2000

3000

4000

5000

retu

rn

(c) Walker2d (v1)

0 2000 4000 6000 8000 10000

×103 steps0

2000

4000

6000

8000

10000

12000

14000

16000

retu

rn

(d) HalfCheetah (v1)

0 2000 4000 6000 8000 10000

×103 steps0

1000

2000

3000

4000

5000

6000

7000

8000

retu

rn

(e) Ant (rllab)

0 2500 5000 7500 10000 12500 15000 17500 20000

×103 steps0

1000

2000

3000

4000

5000

6000

7000

8000

retu

rn

DDPGLSPSACPPOERLDIAYNVIRELTibGMExTibGM

(f) Humanoid (rllab)

Figure 2. Total expected return on 6 benchmark tasks. Thick linesin the middle of each curve indicate the average performance,while the standard deviations over 50 random seeds are shownby the shaded regions. The proposed algorithm ExTibGM, whichcorresponds to TibGM preceded with the introduced pretrainingprocedure, achieves significantly better results on all the 6 tasks.On the other hand, TibGM, i.e. with no pretraining, achieves bet-ter results than the rest of the competitors in 5 out of the 6 tasks.The latter achieves a significantly higher return than ExTibGM onthe Walker2d task. The proposed framework therefore achievesstate-of-the-art performance on the 6 tasks.

penalties too (Colas et al., 2018). As such, the agent shouldfigure out how to perform the smallest sufficient accelera-tions (Lehman et al., 2017). Also, the CMC environmentmay mislead the agent since it provides a deceptive rewardsignal (Colas et al., 2018). Thus, the ability to wisely ex-plore the environment in CMC is of paramount importance.

As described in (Colas et al., 2018), one of the fundamen-tal exploration issues raised by CMC is the time at whichthe goal is first reached, which is very critical. In Figure 3,we display the histograms resulting from performing 1000trials and showing the number of steps required before thegoal is reached for the first time. We compare to two vari-ants of GEP (Colas et al., 2018). The proposed ExTibGMreaches the goal much earlier. In 632 out of the 1000 trials,the goal was reached in the first 5,000 steps by ExTibGM,


time steps (×103)

0 10 20 30 40 50

0100200300400500600

ExTibGMTibGMGEP, linear policyGEP, complex policy

Figure 3. Histograms showing the number of steps required toreach the goal for the first time in CMC. On average, the proposedExTibGM reaches the goal much earlier. In 632 out of the 1000trials, the goal was reached in the first 5,000 steps by ExTibGM,compared to 394 by the best GEP version.

0 20 40 60 80 100 120 140 160 180minutes

0

500

1000

1500

2000

2500

3000

return

randomDIAYNExTibGM

(a) Swimmer (rllab)

0 20 40 60 80 100 120 140 160 180minutes

0

2000

4000

6000

8000

10000

12000

14000

return

randomDIAYNExTibGM

(b) Hopper (v1)

0 20 40 60 80 100 120 140 160 180minutes

0

1000

2000

3000

4000

5000

6000

return

randomDIAYNExTibGM

(c) Ant (rllab)

Figure 4. Impact of the pretraining procedure of ExTibGM andDIAYN on the run-time of 3 benchmarks. ExTibGM significantlyaccelerates the learning procedure in 2 of the 3.

compared to 394 by the best GEP version.

5.3. Accelerating Learning with Pretraining

We evaluate the impact of the pretraining performed inExTibGM and its exploration on the computational run-time. Figure 4 displays the results showing the impact ofpretraining performed via DIAYN (Eysenbach et al., 2019)and by ExTibGM, along with a random initialisation base-line, on the run-time of the Hopper, HalfCheetah and Antbenchmarks. Similar to the setting of an identical exper-iment in (Eysenbach et al., 2019), the times spent duringour amortised unsupervised pretraining are omitted fromthe plot, since the assumption is that the bottleneck is inthe supervised training. We are doing the same for bothDIAYN and ExTibGM anyway, so this does not favour theproposed framework. As can be seen in Figure 4, the pre-training in ExTibGM (significantly) accelerates the learn-ing procedure in (two of) the three benchmarks.

5.4. Transfer Learning

We evaluate TibGM on 6 transfer learning tasks that requireadaptation (Rothfuss et al., 2019). In two of the tasks, the

HalfCheetah and the Walker agents need to keep switch-ing between walking forward and backward. The Ant andHumanoid should adapt to run in various directions. In thefinal two tasks, the Hopper and Walker have to adapt to dif-ferent dynamic configurations (Rothfuss et al., 2019). InTable 1, we compare to two state-of-the-art transfer learn-ing algorithms, ProMP (Rothfuss et al., 2019) and InfoBot(Goyal et al., 2019), in terms of the total expected returnafter 2 × 107 steps. The proposed TibGM achieves sig-nificant improvements and clearly leads to state-of-the-artresults on the 6 adaptation tasks. This demonstrates that thethe latent components z seem to have managed to capturethe common aspects, which stand across the changes in thedomain. Moreover the variance resulting from TibGM islower than the competitors in 5 out of the 6 tasks.

Table 1. Total expected return on 6 adaptation tasks. TibGMachieves significant improvements and clearly leads to state-of-the-art results on the 6 adaptation tasks. Moreover, the varianceresulting from TibGM is the lowest among competitors in 5 outof the 6 tasks. Bold refers to an expected return value that is sig-nificantly better than its competitors. Significance is tested usingthe same paired t-test described above.

TibGM ProMP InfoBotHopperRandParams 912 ± 36 438 ± 60 679 ± 54WalkerRandParams 848 ± 49 525 ± 72 458 ± 44

HalfCheetahFwdBack 1187 ± 77 735 ± 93 714 ± 84WalkerFwdBack 953 ± 26 542 ± 51 410 ± 58

AntRandDir 684 ± 44 218 ± 48 311 ± 89HumanoidRandDir 1186 ± 82 527 ± 113 334 ± 89

6. ConclusionWe introduced an RL framework which leverages the ex-pressiveness and power of graphical models. We defineda novel information theoretic objective, and showed itscorrespondence to an RL objective aiming at both max-imising the reward, and facilitating transfer learning andexploration. We developed an inference procedure basedon state-of-the-art advances in variational inference. Thelatent space representing the policy is disentangled intolocal components focused on reward maximisation, andglobal components capturing information which is usefulacross different environment settings. We also introducedan unsupervised information theoretic pretraining strategy,demonstrating the flexibility of our framework, which fur-ther focuses on exploration, and performs well in environ-ments with sparse or deceptive reward signals.

AcknowledgementsBoth authors thank the anonymous reviewers for helpfulcomments, and acknowledge support from the LeverhulmeTrust via the CFI. AW acknowledges support from the


David MacKay Newton research fellowship at Darwin Col-lege and The Alan Turing Institute under EPSRC grantEP/N510129/1 & TU/B/000074.

ReferencesAbdolmaleki, A., Springenberg, J., Tassa, Y., Munos, R.,

Heess, N., and Riedmiller, M. Maximum a posterioripolicy optimisation. International Conference on Learn-ing Representations (ICLR), 2018.

Adel, T., Ghahramani, Z., and Weller, A. Discovering in-terpretable representations for both deep generative anddiscriminative models. International Conference on Ma-chine Learning (ICML), 2018.

Alemi, A., Fischer, I., Dillon, J., and Murphy, K. Deepvariational information bottleneck. International Con-ference on Learning Representations (ICLR), 2017.

Alemi, A., Poole, B., , Fischer, I., Dillon, J., Saurous, R.,and Murphy, K. Fixing a broken ELBO. InternationalConference on Machine Learning (ICML), 2018.

Andreas, J., Klein, D., and Levine, S. Modular multitaskreinforcement learning with policy sketches. Interna-tional Conference on Machine Learning (ICML), 2017.

Bakker, B. and Schmidhuber, J. Hierarchical reinforcementlearning with subpolicies specializing for learned sub-goals. Neural Networks and Computational Intelligence,2004a.

Bakker, B. and Schmidhuber, J. Hierarchical reinforcementlearning based on subgoal discovery and subpolicy spe-cialization. Conf. on Intelligent Autonomous Systems, 8:438–445, 2004b.

Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T.,Saxton, D., and Munos, R. Unifying count-based ex-ploration and intrinsic motivation. Advances in neuralinformation processing systems (NIPS), 2016.

Boker, S. Consequences of continuity: The hunt for intrin-sic properties within parameters of dynamics in psycho-logical processes. Multivariate Behavioral Research, 37:405–422, 2002.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. OpenAI Gym.arXiv preprint arXiv:1606.01540, 2016.

Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell,T., and Efros, A. Large-scale study of curiosity-drivenlearning. arXiv preprint arXiv:1808.04355, 2018.

Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Ex-ploration by random network distillation. InternationalConference on Learning Representations (ICLR), 2019.

Cao, F. and Ray, S. Bayesian hierarchical reinforcementlearning. Advances in neural information processing sys-tems (NIPS), 2012.

Colas, C., Sigaud, O., and Oudeyer, P. GEP-PG: Decou-pling exploration and exploitation in deep reinforcementlearning algorithms. International Conference on Ma-chine Learning (ICML), 2018.

Dayan, P. and Hinton, G. Using expectation-maximizationfor reinforcement learning. Neural Computation, 9:271–278, 1997.

Dietterich, T. Hierarchical reinforcement learning with theMAXQ value function decomposition. Journal of Artifi-cial Intelligence Research, 13:227–303, 2000.

Duan, Y., Chen, X., Houthooft, R., Schulman, J., andAbbeel, P. Benchmarking deep reinforcement learningfor continuous control. International Conference on Ma-chine Learning (ICML), pp. 1329–1338, 2016.

Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Di-versity is all you need: Learning skills without a rewardfunction. International Conference on Learning Repre-sentations (ICLR), 2019.

Fellows, M., Mahajan, A., Rudner, T., and Whiteson,S. VIREL: A variational inference framework for rein-forcement learning. Artificial Intelligence and Statistics(AISTATS), 2019.

Florensa, C., Duan, Y., and Abbeel, P. Stochastic neu-ral networks for hierarchical reinforcement learning.International Conference on Learning Representations(ICLR), 2017.

Fu, J., Co-Reyes, J., and Levine, S. Exploration with exem-plar models for deep reinforcement learning. Advancesin neural information processing systems (NIPS), 2017.

Fu, J., Singh, A., Ghosh, D., Yang, L., and Levine, S. Vari-ational inverse control with events: A general frameworkfor data-driven reward definition. Advances in neural in-formation processing systems (NIPS), 2018.

Goyal, A., Islam, R., Strouse, D., Ahmed, Z., Larochelle,H., Botvinick, M., Levine, S., and Bengio, Y. Transferand exploration via the information bottleneck. Interna-tional Conference on Learning Representations (ICLR),2019.

Grau-Moya, J., Leibfried, F., and Vrancx, P. Soft Q-learning with mutual-information regularization. In-ternational Conference on Learning Representations(ICLR), 2019.


Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Rein-forcement learning with deep energy-based policies. In-ternational Conference on Machine Learning (ICML),2017.

Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine,S. Latent space policies for hierarchical reinforcementlearning. International Conference on Machine Learn-ing (ICML), 2018a.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Softactor-critic: Off-policy maximum entropy deep rein-forcement learning with a stochastic actor. InternationalConference on Machine Learning (ICML), 2018b.

Hausman, K., Springenberg, J., Wang, Z., Heess, N., andRiedmiller, M. Learning an embedding space for trans-ferable robot skills. International Conference on Learn-ing Representations (ICLR), 2018.

Hong, Z., Shann, T., Su, S., Chang, Y., Fu, T., and Lee, C.Diversity-driven exploration strategy for deep reinforce-ment learning. Advances in neural information process-ing systems (NIPS), 2018.

Jordan, M. Learning in graphical models. Springer Scienceand Business Media, 1998.

Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. Anintroduction to variational methods for graphical models.Machine Learning, pp. 183–233, 1999.

Kaelbling, L., Littman, M., and Moore, A. Reinforcementlearning: A survey. Journal of Artificial Intelligence Re-search, 4:237–285, 1996.

Kappen, H. Path integrals and symmetry breaking for op-timal control theory. Journal of Statistical Mechanics:Theory And Experiment, 11, 2005.

Kappen, H., Gomez, V., and Opper, M. Optimal control asa graphical model inference problem. Machine Learn-ing, 87:159–182, 2012.

Khadka, S. and Tumer, K. Evolution-guided policy gradi-ent in reinforcement learning. Advances in neural infor-mation processing systems (NIPS), 2018.

Kingma, D. and Welling, M. Auto-encoding variationalBayes. International Conference on Learning Represen-tations (ICLR), 2014.

Kingma, D., Rezende, D., Mohamed, S., and Welling,M. Semi-supervised learning with deep generative mod-els. Advances in neural information processing systems(NIPS), 28:3581–3589, 2014.

Kingma, D., Salimans, T., Jozefowicz, R., Chen, X.,Sutskever, I., and Welling, M. Improved variational in-ference with inverse autoregressive flow. Advances inneural information processing systems (NIPS), 30, 2016.

Koller, D. and Friedman, N. Probabilistic graphical mod-els: Principles and techniques. MIT Press, 2009.

Lehman, J., Chen, J., Clune, J., and Stanley, K. ES ismore than just a traditional finite-difference approxima-tor. arXiv preprint arXiv:1712.06568, 2017.

Levine, S. Motor skill learning with local trajectory meth-ods. PhD thesis, Stanford University, 2014.

Levine, S. Reinforcement learning and control as proba-bilistic inference: Tutorial and review. arXiv preprintarXiv:1805.00909, 2018.

Levy, A., Platt, R., and Saenko, K. Hierarchical actor-critic.arXiv preprint arXiv:1712.00948, 2017.

Li, Y. and Mandt, S. Disentangled sequential autoencoder.International Conference on Machine Learning (ICML),35, 2018.

Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T.,Tassa, Y., Silver, D., and Wierstra, D. Continuous con-trol with deep reinforcement learning. arXiv preprintarXiv:1509.02971, 2015.

Manfredi, V. and Mahadevan, S. Hierarchical reinforce-ment learning using graphical models. ICML workshopon Rich Representations for Reinforcement Learning,2005.

Moody, J. and White, D. Structural cohesion and embed-dedness: A hierarchical concept of social groups. Amer-ican Sociological Review, pp. 103–127, 2003.

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D.Bridging the gap between value and policy based rein-forcement learning. Advances in neural information pro-cessing systems (NIPS), 2017.

Neumann, G. Variational inference for policy search inchanging situations. International Conference on Ma-chine Learning (ICML), 2011.

Ostrovski, G., Bellemare, M., van den Oord, A., andMunos, A. Count-based exploration with neural densitymodels. International Conference on Machine Learning(ICML), 2018.

Parr, R. and Russell, S. Reinforcement learning with hi-erarchies of machines. Advances in neural informationprocessing systems (NIPS), pp. 1043–1049, 1998.


Parsons, T. An analytical approach to the theory of socialstratification. The American Journal of Sociology, 45:841–862, 1940.

Peters, J., Mulling, K., and Altun, Y. Relative entropy pol-icy search. AAAI Conference on Artificial Intelligence,pp. 1607–1612, 2010.

Rawlik, K., Toussaint, M., and S. Vijayakumar, title = Onstochastic optimal control and reinforcement learning byapproximate inference, j. . R. v. . . p. . . y. . .

Rezende, D. and Mohamed, S. Variational inference withnormalizing flows. International Conference on Ma-chine Learning (ICML), 32:1530–1538, 2015.

Rezende, D., Mohamed, S., and Wierstra, D. Stochasticbackpropagation and approximate inference in deep gen-erative models. International Conference on MachineLearning (ICML), 31, 2014.

Rothfuss, J., Lee, D., Clavera, I., Asfour, T., and Abbeel,P. ProMP: Proximal meta-policy search. InternationalConference on Learning Representations (ICLR), 2019.

Schmidhuber, J. On learning to think: Algorithmic infor-mation theory for novel combinations of reinforcementlearning controllers and recurrent neural world models.arXiv preprint arXiv:1511.09249, 2015.

Schulman, J., Abbeel, P., and Chen, X. Equivalencebetween policy gradients and soft Q-learning. arXivpreprint arXiv:1704.06440, 2017a.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017b.

Strehl, A., Li, L., and Littman, M. Incremental model-based learners with formal learning-time guarantees.Uncertainty in Aritifical Intelligence (UAI), 2006.

Strehl, A., Li, L., and Littman, M. Reinforcement learn-ing in finite MDPs: PAC analysis. Journal of MachineLearning Research (JMLR), pp. 2413–2444, 2009.

Stuhlmuller, A., Taylor, J., and Goodman, N. Learningstochastic inverses. Advances in neural information pro-cessing systems (NIPS), 27:3048–3056, 2013.

Sutton, R., Precup, D., and Singh, S. Between MDPs andsemi-MDPs: A framework for temporal abstraction inreinforcement learning. Artificial intelligence, 112:181–211, 1999.

Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X.,Duan, Y., Schulman, J., Turck, F. D., and Abbeel, P. #Ex-ploration: A study of count-based exploration for deepreinforcement learning. Advances in neural informationprocessing systems (NIPS), 2017.

Tang, Y. and Agrawal, S. Implicit Policy for ReinforcementLearning. arXiv preprint arXiv:1806.06798, 2018.

Todorov, E. Linearly-solvable Markov decision prob-lems. Advances in neural information processing sys-tems (NIPS), pp. 1369–1376, 2007.

Toussaint, M. Robot trajectory optimization using approx-imate inference. International Conference on MachineLearning (ICML), 2009.

van den Berg, R., Hasenclever, L., Tomczak, J., andWelling, M. Sylvester normalizing flows for variationalinference. Uncertainty in Aritifical Intelligence (UAI),2018.

van der Pol, E. and Oliehoek, F. Coordinated Deep Rein-forcement Learners for Traffic Light Control. NIPS 2016workshop on Learning, Inference and Control of Multi-agent Systems, 2016.

Wainwright, M. and Jordan, M. Graphical models, expo-nential families, and variational inference. Foundationsand trends in Machine Learning, 1:1–305, 2008.

Watkins, C. Learning from delayed rewards. PhD thesis,Cambridge University, 1989.

Xu, K., Ratner, E., Dragan, A., Levine, S., and Finn, C.Learning a prior over intent via meta-inverse reinforce-ment learning. arXiv preprint arXiv:1805.12573, 2018.

Ziebart, B., Maas, A., Bagnell, J., and Dey, A. Maximumentropy inverse reinforcement learning. AAAI Confer-ence on Artificial Intelligence, 2008.

tibgm: a transferable and information-based graphical...

Documents