hill climbing on value estimates for search-control in dyna · hill climbing on value estimates for...

7
Hill Climbing on Value Estimates for Search-control in Dyna Yangchen Pan 1 , Hengshuai Yao 2 , Amir-massoud Farahmand 3,4 and Martha White 1 1 Department of Computing Science, University of Alberta, Canada 2 Huawei HiSilicon, Canada 3 Vector Institute, Canada 4 Department of Computer Science, University of Toronto, Canada [email protected], [email protected], [email protected], [email protected] Abstract Dyna is an architecture for model-based reinforce- ment learning (RL), where simulated experience from a model is used to update policies or value func- tions. A key component of Dyna is search-control, the mechanism to generate the state and action from which the agent queries the model, which remains largely unexplored. In this work, we propose to generate such states by using the trajectory obtained from Hill Climbing (HC) the current estimate of the value function. This has the effect of propagating value from high-value regions and of preemptively updating value estimates of the regions that the agent is likely to visit next. We derive a noisy projected natural gradient algorithm for hill climbing, and highlight a connection to Langevin dynamics. We provide an empirical demonstration on four classical domains that our algorithm, HC-Dyna, can obtain significant sample efficiency improvements. We study the properties of different sampling distribu- tions for search-control, and find that there appears to be a benefit specifically from using the samples generated by climbing on current value estimates from low-value to high-value region. 1 Introduction Experience replay (ER) [Lin, 1992] is currently the most com- mon way to train value functions approximated as neural net- works (NNs), in an online RL setting [Adam et al., 2012; Wawrzy ´ nski and Tanwani, 2013]. The buffer in ER is typ- ically a recency buffer, storing the most recent transitions, composed of state, action, next state and reward. At each environment time step, the NN gets updated by using a mini- batch of samples from the ER buffer, that is, the agent re- plays those transitions. ER enables the agent to be more sample efficient, and in fact can be seen as a simple form of model-based RL [van Seijen and Sutton, 2015]. This con- nection is specific to the Dyna architecture [Sutton, 1990; Sutton, 1991], where the agent maintains a search-control (SC) queue of pairs of states and actions and uses a model to generate next states and rewards. These simulated transitions are used to update values. ER, then, can be seen as a variant of Dyna with a nonparameteric model, where search-control is determined by the observed states and actions. By moving beyond ER to Dyna with a learned model, we can potentially benefit from increased flexibility in obtaining simulated transitions. Having access to a model allows us to generate unobserved transitions, from a given state-action pair. For example, a model allows the agent to obtain on-policy or exploratory samples from a given state, which has been reported to have advantages [Gu et al., 2016; Pan et al., 2018; Santos et al., 2012; Peng et al., 2018]. More generally, mod- els allow for a variety of choices for search-control, which is critical as it emphasizes different states during the planning phase. Prioritized sweeping [Moore and Atkeson, 1993] uses the model to obtain predecessor states, with states sampled according to the absolute value of temporal difference error. This early work, and more recent work [Sutton et al., 2008; Pan et al., 2018; Corneil et al., 2018], showed this addition significantly outperformed Dyna with states uniformly sam- pled from observed states. Most of the work on search-control, however, has been limited to sampling visited or predecessor states. Predecessor states require a reverse model, which can be limiting. The range of possibilities has yet to be explored for search-control and there is room for many more ideas. In this work, we investigate using sampled trajectories by hill climbing on our learned value function to generate states for search-control. Updating along such trajectories has the effect of propagating value from regions the agent currently believes to be high-value. This strategy enables the agent to preemptively update regions where it is likely to visit next. Further, it focuses updates in areas where approximate values are high, and so important to the agent. To obtain such states for search-control, we propose a noisy natural projected gra- dient algorithm. We show this has a connection to Langevin dynamics, whose distribution converges to the Gibbs distribu- tion, where the density is proportional to the exponential of the state values. We empirically study different sampling dis- tributions for populating the search-control queue, and verify the effectiveness of hill climbing based on estimated values. We conduct experiments showing improved performance in four benchmark domains, as compared to DQN 1 , and illustrate the usage of our architecture for continuous control. 1 We use DQN to refer to the algorithm by [Mnih et al., 2015] that uses ER and target network, but not the exact original architecture. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) 3209

Upload: others

Post on 22-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hill Climbing on Value Estimates for Search-control in Dyna · Hill Climbing on Value Estimates for Search-control in Dyna Yangchen Pan1, Hengshuai Yao2, Amir-massoud Farahmand3;4

Hill Climbing on Value Estimates for Search-control in Dyna

Yangchen Pan1 , Hengshuai Yao2 , Amir-massoud Farahmand3,4 and Martha White11Department of Computing Science, University of Alberta, Canada

2Huawei HiSilicon, Canada3Vector Institute, Canada

4Department of Computer Science, University of Toronto, [email protected], [email protected], [email protected], [email protected]

Abstract

Dyna is an architecture for model-based reinforce-ment learning (RL), where simulated experiencefrom a model is used to update policies or value func-tions. A key component of Dyna is search-control,the mechanism to generate the state and action fromwhich the agent queries the model, which remainslargely unexplored. In this work, we propose togenerate such states by using the trajectory obtainedfrom Hill Climbing (HC) the current estimate of thevalue function. This has the effect of propagatingvalue from high-value regions and of preemptivelyupdating value estimates of the regions that the agentis likely to visit next. We derive a noisy projectednatural gradient algorithm for hill climbing, andhighlight a connection to Langevin dynamics. Weprovide an empirical demonstration on four classicaldomains that our algorithm, HC-Dyna, can obtainsignificant sample efficiency improvements. Westudy the properties of different sampling distribu-tions for search-control, and find that there appearsto be a benefit specifically from using the samplesgenerated by climbing on current value estimatesfrom low-value to high-value region.

1 IntroductionExperience replay (ER) [Lin, 1992] is currently the most com-mon way to train value functions approximated as neural net-works (NNs), in an online RL setting [Adam et al., 2012;Wawrzynski and Tanwani, 2013]. The buffer in ER is typ-ically a recency buffer, storing the most recent transitions,composed of state, action, next state and reward. At eachenvironment time step, the NN gets updated by using a mini-batch of samples from the ER buffer, that is, the agent re-plays those transitions. ER enables the agent to be moresample efficient, and in fact can be seen as a simple form ofmodel-based RL [van Seijen and Sutton, 2015]. This con-nection is specific to the Dyna architecture [Sutton, 1990;Sutton, 1991], where the agent maintains a search-control(SC) queue of pairs of states and actions and uses a model togenerate next states and rewards. These simulated transitionsare used to update values. ER, then, can be seen as a variant

of Dyna with a nonparameteric model, where search-controlis determined by the observed states and actions.

By moving beyond ER to Dyna with a learned model, wecan potentially benefit from increased flexibility in obtainingsimulated transitions. Having access to a model allows us togenerate unobserved transitions, from a given state-action pair.For example, a model allows the agent to obtain on-policyor exploratory samples from a given state, which has beenreported to have advantages [Gu et al., 2016; Pan et al., 2018;Santos et al., 2012; Peng et al., 2018]. More generally, mod-els allow for a variety of choices for search-control, which iscritical as it emphasizes different states during the planningphase. Prioritized sweeping [Moore and Atkeson, 1993] usesthe model to obtain predecessor states, with states sampledaccording to the absolute value of temporal difference error.This early work, and more recent work [Sutton et al., 2008;Pan et al., 2018; Corneil et al., 2018], showed this additionsignificantly outperformed Dyna with states uniformly sam-pled from observed states. Most of the work on search-control,however, has been limited to sampling visited or predecessorstates. Predecessor states require a reverse model, which canbe limiting. The range of possibilities has yet to be exploredfor search-control and there is room for many more ideas.

In this work, we investigate using sampled trajectories byhill climbing on our learned value function to generate statesfor search-control. Updating along such trajectories has theeffect of propagating value from regions the agent currentlybelieves to be high-value. This strategy enables the agent topreemptively update regions where it is likely to visit next.Further, it focuses updates in areas where approximate valuesare high, and so important to the agent. To obtain such statesfor search-control, we propose a noisy natural projected gra-dient algorithm. We show this has a connection to Langevindynamics, whose distribution converges to the Gibbs distribu-tion, where the density is proportional to the exponential ofthe state values. We empirically study different sampling dis-tributions for populating the search-control queue, and verifythe effectiveness of hill climbing based on estimated values.We conduct experiments showing improved performance infour benchmark domains, as compared to DQN1, and illustratethe usage of our architecture for continuous control.

1We use DQN to refer to the algorithm by [Mnih et al., 2015] thatuses ER and target network, but not the exact original architecture.

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

3209

Page 2: Hill Climbing on Value Estimates for Search-control in Dyna · Hill Climbing on Value Estimates for Search-control in Dyna Yangchen Pan1, Hengshuai Yao2, Amir-massoud Farahmand3;4

2 BackgroundWe formalize the environment as a Markov Decision Process(MDP) (S,A,P, R, γ), where S is the state space, A is theaction space, P : S×A×S → [0, 1] is the transition probabili-ties,R : S×A×S → R is the reward function, and γ ∈ [0, 1]is the discount factor. At each time step t = 1, 2, . . . , the agentobserves a state st ∈ S and takes an action at ∈ A, transitionsto st+1 ∼ P(·|st, at) and receives a scalar reward rt+1 ∈ Raccording to the reward function R.

Typically, the goal is to learn a policy to maximize theexpected return starting from some fixed initial state. One pop-ular algorithm is Q-learning, by which we can obtain approxi-mate action-values Qθ : S × A → R for parameters θ. Thepolicy corresponds to acting greedily according to these action-values: for each state, select action arg maxa∈AQ(s, a). TheQ-learning update for a sampled transition st, at, rt+1, st+1 is

θ ← θ + αδt∇Qθ(st, at)

where δtdef= rt+1 + max

a′∈AQθ(st+1, a

′)−Qθ(st, at)

Though frequently used, such an update may not be soundwith function approximation. Target networks [Mnih et al.,2015] are typically used to improve stability when trainingNNs, where the bootstrap target on the next step is a fixed,older estimate of the action-values.

ER and Dyna can both be used to improve sample efficiencyof DQN. Dyna is a model-based method that simulates (orreplays) transitions, to reduce the number of required interac-tions with the environment. A model is sometimes availablea priori (e.g., from physical equations of the dynamics) oris learned using data collected through interacting with theenvironment. The generic Dyna architecture, with explicitpseudo-code given by [Sutton and Barto, 2018, Chapter 8],can be summarized as follows. When the agent interacts withthe real world, it updates both the action-value function andthe learned model using the real transition. The agent thenperforms n planning steps. In each planning step, the agentsamples (s, a) from the search-control queue, generates nextstate s′ and reward r from (s, a) using the model, and updatesthe action-values using Q-learning with the tuple (s, a, r, s′).

3 A Motivating ExampleIn this section we provide an example of how the value func-tion surface changes during learning on a simple continuous-state GridWorld domain. This provides intuition on why itis useful to populate the search-control queue with states ob-tained by hill climbing on the estimated value function, asproposed in the next section.

Consider the GridWorld in Figure 1(a), which is a vari-ant of the one introduced by [Peng and Williams, 1993].In each episode, the agent starts from a uniformly sam-pled point from the area [0, 0.05]2 and terminates when itreaches the goal area [0.95, 1.0]2. There are four actions{UP, DOWN, LEFT, RIGHT}; each leads to a 0.05 unit movetowards the corresponding direction. As a cost-to-goal prob-lem, the reward is −1 per step.

In Figure 1, we plot the value function surface after 0,14k and 20k mini-batch updates to DQN. We visualize the

gradient ascent trajectories with 100 gradient steps startingfrom five states (0.1, 0.1), (0.9, 0.9), (0.1, 0.9), (0.9, 0.1),and (0.3, 0.4). The gradient of the value function used inthe gradient ascent is

∇sV (s) = ∇s maxa

Qθ(s, a), (1)

At the beginning, with a randomly initialized NN, the gradientwith respect to state is almost zero, as seen in Figure 1(b). Asthe DQN agent updates its parameters, the gradient ascent gen-erates trajectories directed towards the goal, though after only14k steps, these are not yet contiguous, as seen Figure 1(c).After 20k steps, as in Figure 1(d), even though the value func-tion is still inaccurate, the gradient ascent trajectories take allinitial states to the goal area. This suggests that as long as theestimated value function roughly reflects the shape of the opti-mal value function, the trajectories provide a demonstration ofhow to reach the goal—or high-value regions—and speed uplearning by focusing updates on these relevant regions.

More generally, by focusing planning on regions the agentthinks are high-value, it can quickly correct value functionestimates before visiting those regions, and so avoid unneces-sary interaction. We demonstrate this in Figure 1(e), wherethe agent obtains gains in performance by updating from high-value states, even when its value estimates have the wrongshape. After 20k learning steps, the values are flipped by negat-ing the sign of the parameters in the output layer of the NN.HC-Dyna, introduced in Section 5, quickly recovers comparedto DQN and OnPolicy updates from the ER buffer. Planningsteps help pushing down these erroneously high-values, andthe agent can recover much more quickly.

4 Effective Hill ClimbingTo generate states for search control, we need an algorithmthat can climb on the estimated value function surface. Forgeneral value function approximators, such as NNs, this canbe difficult. The value function surface can be very flat or veryrugged, causing the gradient ascent to get stuck in local optimaand hence interrupt the gradient traveling process. Further, thestate variables may have very different numerical scales. Whenusing a regular gradient ascent method, it is likely for the statevariables with a smaller numerical scale to immediately goout of the state space. Lastly, gradient ascent is unconstrained,potentially generating unrealizable states.

In this section, we propose solutions for all these issues. Weprovide a noisy invariant projected gradient ascent strategy togenerate meaningful trajectories of states for search-control.We then discuss connections to Langevin dynamics, a modelfor heat diffusion, which provides insight into the samplingdistribution of our search-control queue.

4.1 Noisy Natural Projected Gradient AscentTo address the first issue, of flat or rugged function surfaces,we propose to add Gaussian noise on each gradient ascentstep. Intuitively, this provides robustness to flat regions andavoids getting stuck in local maxima on the function surface,by diffusing across the surface to high-value regions.

To address the second issue of vastly different numericalscales among state variables, we use a standard strategy to be

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

3210

Page 3: Hill Climbing on Value Estimates for Search-control in Dyna · Hill Climbing on Value Estimates for Search-control in Dyna Yangchen Pan1, Hengshuai Yao2, Amir-massoud Farahmand3;4

S

G

0.00.4

1.0 0.0

0.4

1.0-0.1

0.0

0.1

(a)GridWorlddomain

0.00.4

1.00.0

0.4

1.0-14.2

-13.8

-13.5

(b)Beforeupdate

0.00.4

1.00.0

0.4

1.0

-19.5

-18.6

-17.5

(c)Update14ktimes

2000 0.5 time steps 3.0

1e5

1750

250

0

Sum of rewards per Episode (30runs)

Negate the weight of the output layer in NN

DQN

OnPolicy-Dyna

HC-Dyna

(d)Update20ktimes (e)NegationofNN

Figure1:(b-d)ThevaluefunctionontheGridWorlddomainwithgradientascenttrajectories.(e)showslearningcurves(sumofrewardsper

1.00 0.75 0.50 0.25 0.00 0.25 0.50position

0.06

0.04

0.02

0.00

0.02

0.04

0.06

velocity

episodev.s.timesteps)whereeachalgorithmneedstorecoverfromabadNNinitialization(i.e.thevaluefunctionlookslikethereverseof(d)).

Figure2:Thesearch-controlqueuefilledbyusing(+)ornotusing(o)naturalgradientonMountainCar-v0.

invarianttoscale:naturalgradientascent.ApopularchoiceofnaturalgradientisderivedbydefiningthemetrictensorastheFisherinformationmatrix[AmariandDouglas,1998;Amari,1998;Thomasetal.,2016]. Weintroduceasimpleandcomputationallyefficientmetrictensor:theinverseofcovariancematrixofthestatesΣ−1s .Thischoiceissimple,becausethecovariancematrixcaneasilybeestimatedonline.Wecandefinethefollowinginnerproduct:

s,s =sΣ−1s s,∀s,s∈S,

whichinducesavectorspace—theRiemannianmanifold—wherewecancomputethedistanceoftwopointssands+∆

thatareclosetoeachotherbyd(s,s+∆)def=∆ Σ−1s ∆.The

steepestascentupdatingrulebasedonthisdistancemetricbecomess←s+αΣsg,wheregisthegradientvector.Wedemonstratetheutilityofusingthenaturalgradientscal-ing.Figure2showsthestatesfromthesearch-controlqueuefilledbyhillclimbinginearlystagesoflearning(after8000steps)onMountainCar.Thedomainhastwostatevariableswithverydifferentnumericalscale:position∈[−1.2,0.6]andvelocity∈[−0.07,0.07].Usingaregulargradientupdate,thequeueshowsastatedistributionwithmanystatesconcentratednearthetopsinceitisveryeasyforthevelocityvariabletogooutofboundary.Incontrast,theonewithnaturalgradient,showscleartrajectorieswithanobvioustendencytotherighttoparea(position≥0.5),whichisthegoalarea.Weuseprojectedgradientupdatestoaddressthethirdissueregardingunrealizablestates.Weexplaintheissueandsolu-tionusingtheAcrobotdomain.Thefirsttwostatevariablesarecosθ,sinθ,whereθistheanglebetweenthefirstrobotarm’slinkandthevectorpointingdownwards.Thisinducestherestrictionthatcos2θ+sin2θ=1.Thehillclimbingprocesscouldgeneratemanystatesthatdonotsatisfythisrestriction.

Thiscouldpotentiallydegradeperformance,sincetheNNneedstogeneralizetothesestatesunnecessarily.WecanuseaprojectionoperatorΠtoenforcesuchrestrictions,wheneverknown,aftereachgradientascentstep.InAcrobot,Πisasim-plenormalization.Inmanysettings,theconstraintsaresimpleboxconstraints,withprojectionjustinsidetheboundary.Nowwearereadytointroduceourfinalhillclimbingrule:

s←Π(s+αΣsg+N), (2)

whereNisGaussiannoiseandαastepsize.Forsimplicity,wesetthestepsizetoα=0.1/||Σsg||acrossallresultsinthiswork,thoughofcoursetherecouldbebetterchoices.

4.2 ConnectiontoLangevinDynamics

TheproposedhillclimbingprocedureissimilartoLangevindynamics,whichisfrequentlyusedasatooltoanalyzeopti-mizationalgorithmsortoacquireanestimateoftheexpectedparametervaluesw.r.t.someposteriordistributioninBayesianlearning[WellingandTeh,2011].TheoverdampedLangevindynamicscanbedescribedbyastochasticdifferentialequation(SDE)dW(t)=∇U(Wt)dt+

√2dBt,whereBt∈R

disad-dimensionalBrownianmotionandUisacontinuousdiffer-entiablefunction.Undersomeconditions,itturnsoutthattheLangevindiffusion(Wt)t≥0convergestoauniqueinvariantdistributionp(x)∝exp(U(x))[Chiangetal.,1987].ByapplytheEuler-MaruyamadiscretizationschemetotheSDE,weacquirethediscretizedversionYk+1 =Yk+αk+1∇U(Yk)+

√2αk+1Zk+1where(Zk)k≥1isani.i.d.se-

quenceofstandardd-dimensionalGaussianrandomvectorsand(αk)k≥1isasequenceofstepsizes.Thisdiscretizationschemewasusedtoacquiresamplesfromtheoriginalin-variantdistributionp(x)∝exp(U(x))throughtheMarkovchain(Yk)k≥1whenitconvergestothechain’sstationarydistribution[Roberts,1996].Thedistancebetweenthelim-itingdistributionof(Yk)k≥1andtheinvariantdistributionoftheunderlyingSDEhasbeencharacterizedthroughvariousbounds[DurmusandMoulines,2017].Whenweperformhillclimbing,theparameterθisconstantateachtimestept.BychoosingthefunctionUintheSDEabovetobeequaltoVθ,weseethatthestatedistributionp(s)inoursearch-controlqueueisapproximately2

p(s)∝exp(Vθ(s)).

2Differentassumptionson(αk)k≥1andpropertiesofUcangiveconvergenceclaimswithdifferentstrengths.Alsoreferto[WellingandTeh,2011]forthediscussionontheuseofapreconditioner.

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

3211

Page 4: Hill Climbing on Value Estimates for Search-control in Dyna · Hill Climbing on Value Estimates for Search-control in Dyna Yangchen Pan1, Hengshuai Yao2, Amir-massoud Farahmand3;4

Algorithm1HC-Dyna

Input:budgetkforthenumberofgradientascentsteps(e.g.,k=100),stochasticityηforgradientascent(e.g.,η=0.1),ρpercentageofupdatesfromSCqueue(e.g.,ρ=0.5),dthenumberofstatevariables,i.e.S⊂Rd

InitializeemptySCqueueBscandERbufferBerΣs←I (empiricalcovariancematrix)µss← 0∈R

d×d,µs← 0∈Rd (auxiliaryvariables

forcomputingempiricalcovariancematrix,sampleaveragewillbemaintainedforµss,µs)a←0 (thresholdforacceptingastate)fort=1,2,...doObserve(s,a,s,r)andaddittoBer

µss←µss(t−1)+ss

t ,µs←µs(t−1)+s

t

Σs←µss−µsµsa←(1−β)a+βdistance(s,s)forβ=0.001Samples0fromBer,s←∞fori=0,...,kdogsi←∇sV(si)=∇smaxaQθ(si,a)

si+1←Π(si+0.1

||Σsgsi||Σsgsi+Xi),Xi∼N(0,ηΣs)

ifdistance(s,si+1)≥ athenAddsi+1intoBsc,s←si+1

forntimesdoSampleamixedmini-batchb,withproportionρfromBscand1−ρfromBerUpdateparametersθ(i.e.DQNupdate)withb

Animportantdifferencebetweenthetheoreticallimitingdistri-butionandtheactualdistributionacquiredbyourhillclimbingmethodisthatourtrajectorieswouldalsoincludethestatesduringtheburn-inortransientperiod,whichreferstothepe-riodbeforethestationarybehaviorisreached.Wewouldwanttopointoutthatthosestatesplayanessentialroleinimprovinglearningefficiencyaswewilldemonstrateinsection6.2.

5 HillClimbingDyna

Inthissection,weprovidethefullalgorithm,calledHillClimb-ingDyna,summarizedinAlgorithm1.ThekeycomponentistousetheHillClimbingproceduredevelopedinthepre-vioussection,togeneratestatesforsearch-control(SC).Toensuresomeseparationbetweenstatesinthesearch-controlqueue,weuseathreshold atodecidewhetherornottoaddastateintothequeue. Weuseasimpleheuristictosetthisthresholdoneachstep,asthefollowingsampleaverage:

a≈(T)a =

Tt=1

||st+1−st||2/√d

T .ThestartstateforthegradientascentisrandomlysampledfromtheERbuffer.Inadditiontousingthisnewmethodforsearchcontrol,wealsofounditbeneficialtoincludeupdatesontheexperiencegeneratedintherealworld.Themini-batchsampledfortrain-inghasρproportionoftransitionsgeneratedbystatesfromtheSCqueue,and1−ρfromtheERbuffer.Forexample,forρ=0.75withamini-batchsizeof32,theupdatesconsistsof24(=32×0.75)transitionsgeneratedfromstatesintheSCqueueand6transitionsfromtheERbuffer.Previouswork

0.0 0.2 0.8 1.0time steps

1e52.0

1.7

0.3

0.0

Sumof

rewardsperEpisode(30runs)

1e3

DQN

0.125

0.25

0.5

0.75

1.0

usingDynaforlearningNNvaluefunctionsalsousedsuch

0 10000 30000 5000040

60

180

200

Number

ofstepsperepisode(50runs)

ER

0.2

0.4

0.6

0.8

1.0

(a)(Continuousstate)GridWorld (b)TabularGridWorld

Figure3:Theeffectofmixingrateonlearningperformance.ThenumericallabelmeansHC-Dynawithacertainmixingrate.

mixedmini-batches[Hollandetal.,2018].Onepotentialreasonthisadditionisbeneficialisthatitalle-

viatesissueswithheavilyskewingthesamplingdistributiontobeoff-policy.TabularQ-learningisanoff-policylearningal-gorithm,whichhasstrongconvergenceguaranteesundermildassumptions[Tsitsiklis,1994].Whenmovingtofunctionap-proximation,however,convergenceofQ-learningismuchlesswellunderstood.Thechangeinsamplingdistributionforthestatescouldsignificantlyimpactconvergencerates,andpoten-tiallyevencausedivergence.Empirically,previousprioritizedERworkpointedoutthatskewingthesamplingdistributionfromtheERbuffercanleadtoabiasedsolution[Schauletal.,2016].ThoughtheERbufferisnoton-policy,becausethepolicyiscontinuallychanging,thedistributionofstatesisclosertothestatesthatwouldbesampledbythecurrentpolicythanthoseinSC.UsingmixedstatesfromtheERbuffer,andthosegeneratedbyHillClimbing,couldalleviatesomeoftheissueswiththisskewness.Anotherpossiblereasonthatsuchmixedsamplingcouldbenecessaryisduetomodelerror.Theuseofrealexperiencecouldmitigateissueswithsucherror. Wefound,however,thatthismixinghasaneffectevenwhenusingthetruemodel.Thissuggeststhatthisphenomenonindeedisrelatedtothedistributionoverstates.WeprovideasmallexperimentintheGridWorld,depicted

inFigure1,usingbothacontinuous-stateandadiscrete-stateversion.Weincludeadiscretestateversion,sowecandemon-stratethattheeffectpersistseveninatabularsettingwhenQ-learningisknowntobestable.Thecontinuous-statesettingusesNNs—asdescribedmorefullyinSection6—withamini-batchsizeof32.Forthetabularsetting,themini-batchsizeis1;updatesarerandomlyselectedtobefromtheSCqueueorERbufferproportionaltoρ.Figure3showstheperformanceofHC-Dynaasthemixingproportionincreasesfrom0(ERonly)to1.0(SConly).Inbothcases,amixingratearoundρ=0.5providesthebestresults.Generally,usingtoofewsearch-controlsamplesdonotimproveperformance;focusingtoomanyupdatesonsearch-controlsamplesseemstoslightlyspeedupearlylearning,butthenlaterlearningsuffers.Inallfurtherexperimentsinthispaper,wesetρ=0.5.

6 ExperimentsInthissection,wedemonstratetheutilityof(DQN-)HC-Dynainseveralbenchmarkdomains,andthenanalyzethelearningeffectofdifferentsamplingdistributionstogeneratestatesforthesearch-controlqueue.

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

3212

Page 5: Hill Climbing on Value Estimates for Search-control in Dyna · Hill Climbing on Value Estimates for Search-control in Dyna Yangchen Pan1, Hengshuai Yao2, Amir-massoud Farahmand3;4

0 1 2 3 4 51e52

0

Sumofrewardsper

Episode (30runs)

1e3 DQN

OnPolicy-Dyna

HC-Dyna

0.2 0.6 1.0

1e52.0

1.5

1.0

0.5

0.01e3

(a)plansteps1

0.2 0.6 1.0

1e52.0

1.5

1.0

0.5

0.01e3

(b)plansteps10

0 1 2 3 4 51e52.0

1.5

1.0

0.5

0.01e3

(c)plansteps30

0.2 0.6 1.01e52.0

1.5

1.0

0.5

0.01e3

(d)plansteps1

0.2 0.6 1.01e52.0

1.5

1.0

0.5

0.01e3

(e)plansteps10

00.5 1.5 2.5

1e5100

300

500

(f)plansteps30

0 1 2 3 4 51e4

100

300

500

(g)plansteps1

0 1 2 3 4 51e4

100

300

500

(h)plansteps10

5000.5 1.5 2.5

1e5400

300

200

100

(i)plansteps30

1 3 51e4

500

400

300

200

100

(j)plansteps1

1 3 51e4

500

400

300

200

100

(k)plansteps10 (l)plansteps30

Figure4:Evaluationcurves(sumofepisodicrewardv.s.environ-menttimesteps)of(DQN-)HC-Dyna,(DQN-)OnPolicy-Dyna,DQNonGridWorld(a-c),MountainCar-v0(d-f),CartPole-v1(g-i)andAcrobot-v1(j-l).Thecurvesplottedbydottedlinesareusingonlinelearnedmodels.Resultsareaveragedover30randomseeds.

6.1 ResultsinBenchmarkDomains

Inthissection,wepresentempiricalresultsonfourclassicdomains:theGridWorld(Figure1(a)),MountainCar,CartPoleandAcrobot.WepresentbothdiscreteandcontinuousactionresultsintheGridWorld,andcomparetoDQNforthediscretecontrolandtoDeepDeterministicPolicyGradient(DDPG)forthecontinuouscontrol[Lillicrapetal.,2016].Theagentsalluseatwo-layerNN,withReLUactivationsand32nodesineachlayer. Weincluderesultsusingboththetruemodelandthelearnedmodel,onthesameplots.Wefurtherincludemultipleplanningstepsn,whereforeachrealenvironmentstep,theagentdoesnupdateswithamini-batchofsize32.InadditiontoER,weaddanon-policybaselinecalled

OnPolicy-Dyna. Thisalgorithmsamplesamini-batchofstates(notthefulltransition)fromtheERbuffer,butthengeneratesthenextstateandrewardusinganon-policyaction.ThisbaselinedistinguisheswhenthegainofHC-Dynaalgo-rithmisduetoon-policysampledactions,ratherthanbecauseofthestatesinoursearch-controlqueue.

DiscreteActionTheresultsinFigure4showthat(a)HC-Dynaneverharms

GSC queue

ER buffer

performanceoverERandOnPolicy-Dyna,andinsomecases

10

1

(a)DQN (b)HC-Dyna

Figure5:Figure(a)(b)showbuffer(red·)/queue(black+)distributiononGridWorld(s∈[0,1]2)byuniformlysampling2kstates.(a)isshowingERbufferwhenrunningDQN,hencethereisno“+”init.(b)shows0.2%oftheERsamplesfallinthegreenshadow(i.e.highvalueregion),while27.8%samplesfromtheSCqueuearethere.

significantlyimprovesperformance,(b)thesegainspersistevenunderlearnedmodelsand(c)therearecleargainsfromHC-Dynaevenwithasmallnumberofplanningsteps.Inter-estingly,usingmultiplemini-batchupdatespertimestepcansignificantlyimprovetheperformanceofallthealgorithms.DQN,however,hasverylimitedgainwhenmovingfrom10to30planningstepsonalldomainsexceptGridWorld,whereasHC-Dynaseemstomorenoticeablyimprovefrommoreplan-ningsteps.ThisimpliesapossiblelimitoftheusefulnessofonlyusingsamplesintheERbuffer.Weobservethattheon-policyactionsdoesnotalwayshelp.

TheGridWorlddomainisinfacttheonlyonewhereon-policyactions(OnPolicy-Dyna)showsanadvantageasthenumberofplanningstepsincrease.Thisresultprovidesevidencethatthegainofouralgorithmisduetothestatesinoursearch-controlqueue,ratherthanon-policysampledactions. Wealsoseethateventhoughbothmodel-basedmethodsperformworsewhenthemodelhastobelearnedcomparedtowhenthetruemodelisavailable,HC-DynaisconsistentlybetterthanOnPolicy-Dynaacrossalldomains/settings.Togainintuitionforwhyouralgorithmachievessuperiorperformance,wevisualizethestatesinthesearch-controlqueueforHC-DynaintheGridWorlddomain(Figure5).WealsoshowthestatesintheERbufferatthesametimestep,forbothHC-DynaandDQNtocontrast.Therearetwointerestingoutcomesfromthisvisualization.First,themodificationtosearch-controlsignificantlychangeswheretheagentexplores,asevidencedbytheERbufferdistribution.Second,HC-DynahasmanystatesintheSCqueuethatarenearthegoalregionevenwhenitsERbuffersamplesconcentrateontheleftpartonthesquare.Theagentcanstillupdatearoundthegoalregionevenwhenitisphysicallyintheleftpartofthedomain.

ContinuousControlOurarchitecturecaneasilybeusedwithcontinuousactions,aslongasthealgorithmestimatesvalues. WeuseDDPG[Lillicrapetal.,2016]asanexampleforuseinsideHC-Dyna. DDPGisanactor-criticalgorithmthatusesthede-terministicpolicygradienttheorem[Silveretal.,2014].Letπψ(·):S →Abetheactornetworkparameterizedbyψ,andQθ(·,·):S×A →Rbethecritic. Givenanini-tialstatevalues,thegradientascentdirectioncanbecom-putedby∇sQψ(s,πθ(s)).Infact,becausethegradientstepcausessmallchanges,wecanfurtherapproximatethisgra-

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

3213

Page 6: Hill Climbing on Value Estimates for Search-control in Dyna · Hill Climbing on Value Estimates for Search-control in Dyna Yangchen Pan1, Hengshuai Yao2, Amir-massoud Farahmand3;4

0.0 0.2 0.8 1.0time steps

1e52.0

1.8

1.0

0.8

Sumof

rewardsper

Episode(100runs)

1e3

DDPG

DDPG-HC-Dyna

DDPG-OnPolicyDyna

Figure6:HC-DynaforcontinuouscontrolwithDDPG.Weused5planningstepsinthisexperiment.

dientmoreefficientlyusing∇sQψ(s,a∗),a∗

def=πθ(s),with-

outbackpropagatingthegradientthroughtheactornetwork.WemodifiedtheGridWorldinFigure1(a)tohaveactionspaceA=[−1,1]2andanactionat∈Aisexecutedasst+1 ← st+0.05at.Figure6showsthelearningcurveofDDPG,andDDPGwithOnPolicy-DynaandwithHC-Dyna.Asbefore,HC-Dynashowssignificantearlylearningbene-fitsandalsoreachesabettersolution.Thishighlightsthatimprovedsearch-controlcouldbeparticularlyeffectiveforalgorithmsthatareknowntobepronetolocalminima,likeActor-Criticalgorithms.

6.2 InvestigatingSamplingDistributionsforSearch-control

WenextinvestigatetheimportanceoftwochoicesinHC-Dyna:(a)usingtrajectoriestohigh-valueregionsand(b)usingtheagent’svalueestimatestoidentifytheseimportantregions.Totestthis,weincludefollowingsamplingmethods

forcomparison:(a)HC-Dyna:hillclimbingbyusingVθ(ouralgorithm);(b)Gibbs:sampling∝exp(Vθ);(c)HC-Dyna-Vstar:hillclimbingbyusingV∗and(d)Gibbs-Vstar:sampling∝exp(V∗),whereV∗isapre-learnedoptimalvaluefunction.WealsoincludethebaselinesOnPolicyDyna,ERandUniform-Dyna,whichuniformlysamplesstatesfromthewholestatespace. AllstrategiesmixwithER,usingρ=0.5,tobettergiveinsightintoperformancedifferences.TofacilitatesamplingfromtheGibbsdistributionandcom-putingtheoptimalvaluefunction,wetestonasimplifiedTabularGridWorlddomainofsize20×20,withoutanyobsta-cles.Eachstateisrepresentedbyanintegeri∈{1,...,400},assignedfrombottomtotop,lefttorightonthesquarewith20×20grids.HC-DynaandHC-Dyna-Vstarassumethatthestatespaceiscontinuousonthesquare[0,1]2andeachgridcanberepresentedbyitscenter’s(x,y)coordinates.Weusethefinitedifferencemethodforhillclimbing.

ComparingtotheGibbsDistribution

AswepointedouttheconnectiontotheLangevindynamicsinSection4.2,thelimitingbehaviorofourhillclimbingstrategyisapproximatelyaGibbsdistribution.Figure7(a)showsthatHC-Dynaperformsthebestamongallsamplingdistributions,includingGibbsandotherbaselines.Thisresultsuggeststhatthestatesduringtheburn-inperiodmatter.Figure7(b)showsthestatecountbyrandomlysamplingthesamenumberofstatesfromtheHC-Dyna’ssearch-controlqueueandfromthat

0 10000 50000Time steps40

60

180

200

Number of steps per episode(50runs)

ER

OnPolicy-Dyna

Gibbs

HC-Dyna

Uniform-Dyna

filledbyGibbsdistribution. WecanseethattheGibbsone

400State: 1 to 4000

50

250

300

State

count

HC-Dyna

Gibbs

(a)TabularGridWorld

0 10000 50000Time steps40

60

180

200

Number of steps perepisode (50runs)

Gibbs

Gibbs-Vstar

HC-Dyna

HC-Dyna-Vstar

(b)HC-Dynavs.Gibbs

Figure7:(a)Evaluationcurve;(b)Search-controldistribution

400State: 1 to 4000

50

250

300

State

count

HC-Dyna-Vstar

Gibbs-Vstar

(a)TabularGridWorld (b)HC-Dynavs.GibbswithVstar

Figure8:(a)Evaluationcurve;(b)Search-controldistribution

concentratesitsdistributiononlyonveryhighvaluestates.

ComparingtoTrueValuesOnehypothesisisthatthevalueestimatesguidetheagenttothegoal.Anaturalcomparison,then,istousetheoptimalvalues,whichshouldpointtheagentdirectlytothegoal.Fig-ure8(a)indicatesthatusingtheestimates,ratherthantruevalues,ismorebeneficialforplanning.Thisresulthighlightsthattheredoesseemtobesomeadditionalvaluetofocusingupdatesbasedontheagent’scurrentvalueestimates.Com-paringstatedistributionofGibbs-VstarandHC-Dyna-VstarinFigure8(b)toGibbsandHC-DynainFigure7(b),onecanseethatbothdistributionsareevenmoreconcentrated,whichseemstonegativelyimpactperformance.

7 ConclusionWepresentedanewDynaalgorithm,calledHC-Dyna,whichgeneratesstatesforsearch-controlbyusinghillclimbingonvalueestimates.Weproposedanoisynaturalprojectedgradi-entascentstrategyforthehillclimbingprocess.Wedemon-stratethatusingstatesfromhillclimbingcansignificantlyimprovesampleefficiencyinseveralbenchmarkdomains.Weempiricallyinvestigated,andvalidated,severalchoicesinouralgorithm,includingtheuseofnaturalgradients,theutilityofmixingwithERsamples,thebenefitsofusingestimatedvaluesforsearchcontrol.Anaturalnextstepistofurtherin-vestigateothercriteriaforassigningimportancetostates.OurHCstrategyisgenericforanysmoothfunction;notonlyforvalueestimates.Apossiblealternativeistoinvestigateimpor-tancebasedonerrorinaregion,orbasedmoreexplicitlyonoptimismoruncertainty,toencouragesystematicexploration.

AcknowledgmentsWewouldliketoacknowledgefundingfromtheCanadaCI-FARAIChairsProgram,AmiiandNSERC.

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

3214

Page 7: Hill Climbing on Value Estimates for Search-control in Dyna · Hill Climbing on Value Estimates for Search-control in Dyna Yangchen Pan1, Hengshuai Yao2, Amir-massoud Farahmand3;4

References[Adam et al., 2012] Sander Adam, Lucian Busoniu, and

Robert Babuska. Experience replay for real-time reinforce-ment learning control. IEEE Transactions on Systems, Man,and Cybernetics, pages 201–212, 2012.

[Amari and Douglas, 1998] Shun-Ichi Amari and Scott C.Douglas. Why natural gradient? IEEE International Con-ference on Acoustics, Speech and Signal Processing, pages1213–1216, 1998.

[Amari, 1998] Shun-Ichi Amari. Natural gradient works effi-ciently in learning. Neural Computation, 10(2):251–276,1998.

[Chiang et al., 1987] Tzuu-Shuh Chiang, Chii-Ruey Hwang,and Shuenn Jyi Sheu. Diffusion for global optimizationin Rn. SIAM Journal on Control and Optimization, pages737–753, 1987.

[Corneil et al., 2018] Dane S. Corneil, Wulfram Gerstner,and Johanni Brea. Efficient model-based deep reinforce-ment learning with variational state tabulation. ICML, pages1049–1058, 2018.

[Durmus and Moulines, 2017] Alain Durmus and EricMoulines. Nonasymptotic convergence analysis for theunadjusted Langevin algorithm. The Annals of AppliedProbability, pages 1551–1587, 2017.

[Gu et al., 2016] Shixiang Gu, Timothy P. Lillicrap, IlyaSutskever, and Sergey Levine. Continuous Deep Q-Learning with Model-based Acceleration. In ICML, pages2829–2838, 2016.

[Holland et al., 2018] G. Zacharias Holland, Erik Talvitie,and Michael Bowling. The effect of planning shapeon dyna-style planning in high-dimensional state spaces.CoRR, abs/1806.01825, 2018.

[Lillicrap et al., 2016] Timothy P. Lillicrap, Jonathan J. Hunt,Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. Continuous control withdeep reinforcement learning. In ICLR, 2016.

[Lin, 1992] Long-Ji Lin. Self-Improving Reactive AgentsBased On Reinforcement Learning, Planning and Teaching.Machine Learning, 1992.

[Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu,David Silver, Andrei A. Rusu, and et al. Human-levelcontrol through deep reinforcement learning. Nature, 2015.

[Moore and Atkeson, 1993] Andrew W. Moore and Christo-pher G. Atkeson. Prioritized sweeping: Reinforcementlearning with less data and less time. Machine learning,pages 103–130, 1993.

[Pan et al., 2018] Yangchen Pan, Muhammad Zaheer, AdamWhite, Andrew Patterson, and Martha White. Organizingexperience: a deeper look at replay mechanisms for sample-based planning in continuous state domains. In IJCAI,pages 4794–4800, 2018.

[Peng and Williams, 1993] Jing Peng and Ronald J Williams.Efficient learning and planning within the Dyna framework.Adaptive behavior, 1993.

[Peng et al., 2018] Baolin Peng, Xiujun Li, Jianfeng Gao,Jingjing Liu, Kam-Fai Wong, and Shang-Yu Su. DeepDyna-Q: Integrating planning for task-completion dialoguepolicy learning. In Annual Meeting of the Association forComputational Linguistics, pages 2182–2192, 2018.

[Roberts, 1996] Richard L. Roberts, Gareth O.and Tweedie.Exponential convergence of langevin distributions and theirdiscrete approximations. Bernoulli, pages 341–363, 1996.

[Santos et al., 2012] Matilde Santos, Jose Antonio Martın H.,Victoria Lopez, and Guillermo Botella. Dyna-H: A heuristicplanning reinforcement learning algorithm applied to role-playing game strategy decision systems. Knowledge-BasedSystems, 32:28–36, 2012.

[Schaul et al., 2016] Tom Schaul, John Quan, IoannisAntonoglou, and David Silver. Prioritized Experience Re-play. In ICLR, 2016.

[Silver et al., 2014] David Silver, Guy Lever, Nicolas Heess,Thomas Degris, Daan Wierstra, and Martin Riedmiller. De-terministic policy gradient algorithms. In ICML, pagesI–387–I–395, 2014.

[Sutton and Barto, 2018] Richard S. Sutton and Andrew G.Barto. Reinforcement Learning: An Introduction. The MITPress, second edition, 2018.

[Sutton et al., 2008] Richard S. Sutton, Csaba Szepesvari, Al-borz Geramifard, and Michael Bowling. Dyna-style plan-ning with linear function approximation and prioritizedsweeping. In UAI, pages 528–536, 2008.

[Sutton, 1990] Richard S. Sutton. Integrated architectures forlearning, planning, and reacting based on approximatingdynamic programming. In ML, 1990.

[Sutton, 1991] Richard S. Sutton. Dyna, an integrated ar-chitecture for learning, planning, and reacting. SIGARTBulletin, 2(4):160–163, 1991.

[Thomas et al., 2016] Philip Thomas, Bruno Castro Silva,Christoph Dann, and Emma Brunskill. Energetic naturalgradient descent. In ICML, pages 2887–2895, 2016.

[Tsitsiklis, 1994] John N. Tsitsiklis. Asynchronous stochasticapproximation and Q-learning. Machine Learning, pages185–202, 1994.

[van Seijen and Sutton, 2015] Harm van Seijen andRichard S. Sutton. A deeper look at planning aslearning from replay. In ICML, pages 2314–2322, 2015.

[Wawrzynski and Tanwani, 2013] Paweł Wawrzynski andAjay Kumar Tanwani. Autonomous reinforcement learningwith experience replay. Neural Networks, pages 156–167,2013.

[Welling and Teh, 2011] Max Welling and Yee Whye Teh.Bayesian learning via stochastic gradient Langevin dynam-ics. In ICML, pages 681–688, 2011.

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

3215