ensemble network architecture for deep reinforcement...

Research ArticleEnsemble Network Architecture for DeepReinforcement Learning

Xi-liang Chen Lei Cao Chen-xi Li Zhi-xiong Xu and Jun Lai

Institute of Command Information System PLA University of Science and Technology No 1 Hai Fu Road Guang Hua RoadQin Huai District Nanjing City Jiangsu Province 210007 China

Correspondence should be addressed to Xi-liang Chen 383618393qqcom

Received 8 September 2017 Revised 10 February 2018 Accepted 20 February 2018 Published 5 April 2018

Academic Editor Jian G Zhou

Copyright copy 2018 Xi-liang Chen et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

The popular deep 119876 learning algorithm is known to be instability because of the 119876-valuersquos shake and overestimation action valuesunder certain conditions These issues tend to adversely affect their performance In this paper we develop the ensemble networkarchitecture for deep reinforcement learning which is based on value function approximationThe temporal ensemble stabilizes thetraining process by reducing the variance of target approximation error and the ensemble of target values reduces the overestimateand makes better performance by estimating more accurate 119876-value Our results show that this architecture leads to statisticallysignificant better value evaluation and more stable and better performance on several classical control tasks at OpenAI Gymenvironment

1 Introduction

Reinforcement learning (RL) algorithms [1 2] are verysuitable for learning to control an agent by letting it interactwith an environment In recent years deep neural networks(DNN) have been introduced into reinforcement learningand they have achieved a great success on the value functionapproximation The first deep 119876-network (DQN) algorithmwhich successfully combines a powerful nonlinear functionapproximation technique known as DNN together with the119876-learning algorithmwas proposed byMnih et al [3] In thispaper experience replay mechanism was proposed Follow-ing the DQNwork a variety of solutions have been proposedto stabilize the algorithms [3ndash9]Thedeep119876-networks classeshave achieved unprecedented success in challenging domainssuch as Atari 2600 and some other games

Although DQN algorithms have been successful insolving many problems because of their powerful functionapproximation ability and strong generalization betweensimilar state inputs they are still poor in solving some issuesTwo reasons for this are as follows (a) the randomness of thesampling is likely to lead to serious shock and (b) these sys-tematic errors might cause instability poor performance andsometimes divergence of learning In order to address these

issues the averaged target DQN (ADQN) [10] algorithm isimplemented to construct target values by combining target119876-networks continuously with a single learning network andthe BootstrappedDQN[11] algorithm is proposed to getmoreefficient exploration and better performance with the useof several 119876-networks learning in parallel Although thesealgorithms do reduce the overestimate they do not evaluatethe importance of the past learned networks Besides highvariance in target values combinedwith themax operator stillexists

There are some ensemble algorithms [4 12] solving thisissue in reinforcement learning but these existing algorithmsare not compatible with nonlinearly parameterized valuefunctions

In this paper we propose the ensemble algorithm as asolution to this problem In order to enhance learning speedand final performance we combine multiple reinforcementlearning algorithms in a single agent with several ensemblealgorithms to determine the actions or action probabilitiesIn supervised learning ensemble algorithms such as baggingboosting and mixtures of experts [13] are often used forlearning and combining multiple classifiers But in RLensemble algorithms are used for representing and learningthe value function

HindawiMathematical Problems in EngineeringVolume 2018 Article ID 2129393 6 pageshttpsdoiorg10115520182129393

2 Mathematical Problems in Engineering

Based on an agent integrated withmultiple reinforcementlearning algorithms multiple value functions are learned atthe same time The ensembles combine the policies derivedfrom the value functions in a final policy for the agent Themajority voting (MV) the rank voting (RV) the Boltzmannmultiplication (BM) and the Boltzmann addition (BA) areused to combine RL algorithms While these methods arecostly in deep reinforcement learning (DRL) algorithms wecombine different DRL algorithms that learn separate valuefunctions and policiesTherefore in our ensemble approacheswe combine the different policies derived from the updatetargets learned by deep 119876-networks deep Sarsa networksdouble deep 119876-networks and other DRL algorithms As aconsequence this leads to reduced overestimations morestable learning process and improved performance

2 Related Work

21 Reinforcement Learning Reinforcement learning is amachine learning method that allows the system to interactwith and learn from the environment to maximize cumu-lative return rewards Assume that the standard reinforce-ment learning setting where an agent interacts with theenvironment 120576 We can describe this process with MarkovDecision Processes (MDP) [2 9] It can be specified as a tuple(119878 119860 120587 119877 120574) At each step 119905 the agent receives a state 119904119905 andselect an action 119886119905 from the set of legal actions 119860 accordingto the policy 120587 where 120587 is a policy mapping sequencesto actions The action is passed to the environment 119864 Inaddition the agent receives the next state 119904119905+1 and a rewardsignal 119903119905 This process continues until the agent reaches aterminal state

The agent seeks to maximize the expected discountedreturn where we define the future discounted return at time119905 as 119877119905 = suminfin119896=0 120574119896119903119905+119896 with discount factor 120574 isin (0 1] Thegoal of the RL agent is to learn a policy which makes thefuture discounted return maximize For an agent behavingaccording to a stochastic policy 120587 the value of the state-action pair can be defined as follows 119876120587(119904 119886) = 119864119877119905 | 119904119905 =119904 119886119905 = 119886 120587 The optimal action-value function 119876 satisfiesthe Bellman equation 119876lowast(119904 119886) = 1198641199041015840sim120576[119903 + 120574max1198861015840119876lowast(1199041015840 1198861015840) |119904 119886]

The reinforcement learning algorithms estimate theaction value function by iteratively updating the Bellmanequation 119876lowast(119904 119886) = 1198641199041015840sim120576[119903 + 120574max1198861015840119876lowast(1199041015840 1198861015840) | 119904 119886] When119905 rarr infin the algorithm makes 119876-value function convergeto the optimal action value function [1] If the optimal 119876-function 119876lowast is known the agent can select optimal actionsby selecting the action with the maximal value in a state120587lowast = argmax119886119876lowast(119904 119886)22 Target Deep 119876 Learning RL agents update their modelparameters while they observe a stream of transitions like(119904119905 119886119905 119903119905+1 119904119905+1)They discard the incoming data after a singleupdateThere are two issues with this methodThe first one isthat there are strong correlations among the incoming datawhich may break the assumption of many popular stochasticgradient-based algorithms Secondly the minor changes in

the 119876 function may result in a huge change in the policywhich makes the algorithm difficult to converge [7 9 14 15]

As for the deep 119876-networks algorithms proposed in(Mnih et al 2013) two aspects are improved On the onehand the action value function is approximated by the DNNDQN uses the DNN with a parameter 120579 to approximate thevalue function 119876(119904 119886 120579) asymp 119876lowast(119904 119886 120579) on the other handthe experience replay mechanism is adopted The algorithmlearns from sampled transitions from an experience bufferrather than learning fully online This mechanism makesit possible to break the temporal correlations by mixingmore and less recent experience for updating and trainingThis model free reinforcement learning algorithm solvesthe problem of ldquomodel disasterrdquo and uses the generalizedapproximation method of the value function to solve theproblem of ldquodimension disasterrdquo

The convergence issue was mentioned in 2015 by Schaulet al [14] The above 119876-learning update rules can be directlyimplemented in a neural network DQN uses the DNNwith parameters 120579 to approximate the value function∘ Theparameter 120579 updates from transition (119904119905 119886119905 119903119905+1 119904119905+1) aregiven by the following [11]

120579119905+1 larr997888 120579119905 + 120572 (119910119876119905 minus 119876 (119904119905 119886119905 120579119905)) nabla120579119876 (119904119905 119886119905 120579119905) (1)

with 119910DQN119894 = 119903119905 + 120574max119886119876(119904119905+1 119886 120579minus119905 )

The update targets for Sarsa can be described as follows

119910Sarsa119894 = 119903119905 + 120574119876 (119904119905+1 119886119905+1 120579minus119905 ) (2)

where 120572 is the scalar learning rate 120579minus are target net-work parameters which are fixed to 120579minus = 120579119905 In casethe squared error is taken as a loss function 119871 119894(120579119894) =1198641199041015840sim120576(119910119894 minus 119876(119904 119886 120579119894))2

In general experience replay can reduce the amount ofexperience required to learn and replace it withmore compu-tation and more memory which are often cheaper resourcesthan the RL agentrsquos interactions with its environment [14]

23 Double Deep 119876 Learning In 119876-learning and DQN themax operator uses the same values to both select and evaluatean action This can therefore lead to overoptimistic valueestimates (van Hasselt 2010) To mitigate this problem theupdate targets value of double 119876-learning error can then bewritten as follows

119910DDQN119894 = 119903119905 + 120574119876(119904119905+1 argmax

1198861015840119876(119904119905+1 1198861015840 120579119905) 120579minus119905 ) (3)

DDQN is the same as for DQN [8] but with the target 119910DQN119894

replaced with 119910DDQN119894

3 Ensemble Methods for DeepReinforcement Learning

AsDQNclasses useDNNs to approximate the value functionit has strong generalization ability between similar stateinputs The generalization can cause divergence in the case

Mathematical Problems in Engineering 3

of repeated bootstrapped temporal difference updates So wecan solve this issue by integrating different versions of thetarget network

In contrast to a single classifier ensemble algorithmsin a system have been shown to be more effective Theycan lead to a higher accuracy Bagging boosting and AdaBoosting are methods to train multiple classifiers But in RLensemble algorithms are used for representing and learningthe value functionThey are combined bymajor voting RankVoting BoltzmannMultiplication mixture model and otherensemble methods If the errors of the single classifiers arenot strongly correlated this can significantly improve theclassification accuracy

31 Temporal Ensemble As described in Section 22 theDQN classes of deep reinforcement learning algorithms usea target network with parameters 120579minus copied from 120579119905 every119862 steps Temporal Ensemble method is suitable for thealgorithms which use a target network for updating andtraining Temporal ensemble uses the previous 119870 learnednetworks to produce the value estimate and builds up 119870 isin119873 complete networks with 119870 distinct memory buffers Therecent 119876-value function is trained according to its owntarget network 119876(119904 119886 120579119905) So each one of 119876-value functions1198761 1198762 119876119896 represents temporally extended estimate of119876-value function

Note that the more recent target network is likely tobe more accurate at the beginning of the training and theaccuracy of the target networks is increasing as the traininggoes on So we denote a learning rate parameter 120582 isin (0 1]here for target network The weight of 119894th target network is119908119894 = 120582119894minus1sum119873119894=1 120582119894minus1

So the learned 119876-value function by temporal ensemblecan be described as follows

119876119879 (119904 119886 120579) = 119873sum119894=1

(120582119894minus1119876119894 (119904 119886 120579119894)sum119873119894=1 120582119894minus1 ) (4)

As lim120582997888rarr1120582119894minus1sum119873119894=1 120582119894minus1 = 1119873 we can see that the targetnetworks have the same weights when 120582 equals 1 Thisformula indicates that the closer the target networks are thegreater the target networksrsquo weight is As target networksbecome more accurate their weights become equal The lossfunction remains the same as in DQN and so does theparameter update equation

119910119879119894 = 119903119905 + 120574max1198861015840

119873sum119894=1

119908119894119876119879 (1199041015840 1198861015840 120579119879) 120579119905+1 larr997888 120579119905 + 120572 (119910119879119905 minus 119876 (119904119905 119886119905 120579119905)) nabla120579119876 (119904119905 119886119905 120579119905)

(5)

In every iteration the parameters of the oldest ones areremoved from the target network buffer and the newest onesare added to the buffer Note that the 119876-value functions areinaccurate at the beginning of training So the parameter 120582may be a function of time and even the state space

32 Ensemble of Target Values The traditional ensemblereinforcement learning algorithms maintain multiple tabular

algorithms in memory space [4 16] and majority votingrank voting Boltzmann addition and so forth are used tocombine these tabular algorithms But deep reinforcementlearning uses neutral networks as function approximatorsThe use of multiple neural networks is very computationallyexpensive and inefficient In contrast to previous researcheswe combine different DRL algorithms that learn separatevalue functions and policies Therefore in our ensembleapproaches we combine the different policies derived fromthe update targets learned by deep 119876-networks deep Sarsanetworks double deep 119876-networks and other DRL algo-rithms as follows

119910DQN119894 = 119903119905 + 120574max

119886119876 (119904119905+1 119886 120579minus119905 )

119910Sarsa119894 = 119903119905 + 120574119876 (119904119905+1 119886119905+1 120579minus119905 ) 119910DDQN119894 = 119903119905 + 120574119876(119904119905+1 argmax

1198861015840119876 (119904119905+1 119886119905+1 120579119905) 120579minus119905 )

(6)

Besides these update targets formula other algorithms basedon value function approximators can be also used to combineThe update targets according to the algorithm 119896 at time 119905 willbe denoted by 119910119905 = sum119896119894=1 120573119894119910119894119905

The loss function remains the same as in DQN and sodoes the parameter update equation

120579119905+1 larr997888 120579119905 + 120572 (119910119864119905 minus 119876 (119904119905 119886119905 120579119905)) nabla120579119876 (119904119905 119886119905 120579119905) (7)

33 The Ensemble Network Architecture The temporal andtarget values ensemble algorithm (TEDQN) is an integratedarchitecture of the value-based DRL algorithms As shown inSections 31 and 32 the ensemble network architecture hastwo parts to avoid divergence and improve performance

The architecture of our ensemble algorithm is shown inFigure 1 these two parts are combined together by evaluatednetwork

The temporal ensemble stabilizes the training processby reducing the variance of target approximation error [10]Besides the ensemble of target values reduces the overes-timate and makes better performance by estimating moreaccurate 119876-value The temporal and target values ensemblealgorithm are given by Algorithm 1

As the ensemble network architecture shares the sameinput-output interface with standard 119876-networks and tar-get networks we can recycle all learning algorithms with119876-networks to train the ensemble architecture

4 Experiments

41 Experimental Setup So far we have carried out ourexperiments on several classical control and Box2D environ-ments on OpenAI Gym CartPole-v0 MountainCar-v0 andLunarLander-v2 [15] We use the same network architecturelearning algorithms and hyperparameters for all these envi-ronments

We trained the algorithms using 10000 episodes andused the Adaptive Moment Estimation (Adam) algorithm tominimize the loss with learning rate 120583 = 000001 and set


(1) Initialize action-value network 119876 with random weights 120579(2) Initialize the target neural network buffer (119876119894)119871119894=1(3) For episode 1119872 do(4) For 119905 = 1 119879 do(5) With probability 120576 select a random action 119886119905 otherwise119886119905 = argmax119886119876(119904119905 119886 120579)(6) Execute action 119886119905 in environment and observe reward 119903119905and next state 119904119905+1 and store transition (119904119905 119886119905 119903119905 119904119905+1) in119863(7) Sample randomminibatch of transition (119904119905 119886119905 119903119905 119904119905+1) from119863(8) set 119908119894 = 120582119894minus1sum119873119894=1 120582119894minus1(9) Ensemble 119876-learner 119876(119904 119886 120579) = sum119873119894=1 119908119894119876119894(119904 119886 120579119894)(10) set 119910DQN

119894 = 119903119905 + 120574max119886119876(119904119905+1 119886 120579minus119905 )(11) set 119910Sarsa119894 = 119903119905 + 120574119876(119904119905+1 119886119905+1 120579minus119905 )(12) set 119910DDQN119894 = 119903119905 + 120574119876(119904119905+1 argmax1198861015840119876(119904119905+1 119886119905+1 120579119905) 120579minus119905 )(13) Set 119910119894 = 119903119895 if episode terminates at step 119895 + 1 sum119896119894=1 120573119894119910119894119905 otherwise(14) 120579119894 = argmin

120579

119864 [(119910119894(119904119886) minus 119876 (119904 119886 120579))2](15) Every 119862 steps reset 119876 = 119876(16) End for(17) End for

Algorithm 1 The temporal and target values ensemble algorithm

Shared network

Evaluated network

Target value

Target Q1

Target Q2

Target Qn

Approximator1

Approximator2

Approximator3

Figure 1 The architecture of the ensemble algorithm

the batch size to 32 The summary of the configuration isprovided below The target network updated each 300 stepsThe behavior policy during training was 120576-greedy with 120576annealed linearly from 1 to 001 over the first five thousandssteps and fixed at 001 thereafterWe used a replay memory often thousands most recent transitions

We independently executed each method 10 timesrespectively on every task For each running time thelearned policy will be tested 100 times without explorationnoise or prior knowledge by every 100 training episodes tocalculate the average scoresWe report themean and standarddeviation of the convergence episodes and the scores of thebest policy

42 Results and Analysis We consider three baseline algo-rithms that use target network and value function approxi-mation namely the version of the DQN algorithm from theNature paper [8] DSN that reduce over estimation [17] andDDQN that substantially improved the state-of-the-art byreducing the overestimation bias with double119876-learning [9]

Using this 10 no-ops performance measure it is clear thatthe ensemble network does substantially better than a singlenetwork For comparison we also show results for DQNDSN and DDQN Figure 2 shows the improvement of the

ensemble network over the baseline single network of DQNDSN and DDQN Again we see that the improvements areoften very dramatic

The results in Table 1 show that algorithms we pre-sented can successfully train neural network controllers onthe classical control domain on OpenAI Gym A detailedcomparison shows that there are several games in whichTE DQN greatly improves upon DQN DSN and DDQNNoteworthy examples include CartPole-v0 (performance hasbeen improved by 136 795 and 78 and variance hasbeen reduced by 100 100 and 100) MountainCar-v0(performance has been improved by 267 212 and 248and variance has been reduced by 316 779 and 84)and LunarLander-v2 (performance has been improved by283 328 and 505 and variance has been reduced by192 464 and 505)

5 Conclusion

We introduced a new learning architecture making temporalextension and the ensemble of target values for deep119876 learn-ing algorithms while sharing a common learning moduleThe new ensemble architecture in combination with somealgorithmic improvements leads to dramatic improvements


010

020

030

040

050

060

070

080

090

010

00minus50

050

100150200250300350

CartPole-v0

Episode

Aver

age a

ctio

n va

lue (

Q)

DQNDSNDDQN

TE DQN k = 3TE DQN k = 6

010

0020

0030

0040

0050

0060

0070

0080

0090

0010

000minus700

minus600

minus500

minus400

minus300

minus200

minus100

0Mountain-v0

Episode

Aver

age S

core

DQNDSNDDQN


0 02 04 06 08 1 12 14 16 18 2minus250minus200minus150minus100minus50

050

100150200

LunarLander-v2

Episode

(a) (b) (c)

Aver

age S

core

DQNDSNDDQN


times104

010

020

030

040

050

060

070

080

090

010

00

minus100

1020304050607080

CartPole-v0 (TE DQN k = 6)

EpisodeAv

erag

e act

ion

valu

e (Q

)TE Q-valueTE Q-value fitDQN Q-valueDQN Q-value fit

010

0020

0030

0040

0050

0060

0070

0080

0090

0010

000minus40

minus20

0

20

40

60

80

100Mountain-v0 (TE DQN k = 6)

Episode

Aver

age a

ctio

n va

lue (

Q)

TE Q-valueTE Q-value fitDQN Q-valueDQN Q-value fit

0 02 04 06 08 1 12 14 16 18 2minus40

minus20

0

20

40

60

80

100LunarLander-v2 (TE DQN k = 6)

Episode

Aver

age a

ctio

n va

lue (

Q)


times104

0 5000 10000minus600

minus400

minus200

0

rew

ard

Episode

DQN TE DQN k = 6

0 5000 10000 15000 20000Episode

DQN TE DQN k = 6

minus600

minus400

minus200

0

200

400

600

rew

ard

DQN TE DQN k = 6

200 400 600 800 10000Episode

0

100

200

300

400

rew

ard

Figure 2 Training curves tracking the agentrsquos average score and average predicted action-value (a) Performance comparison of all algorithmsin terms of the average reward on each task (b) Average predicted action-value on a held-out set of states on each task Each point on thecurve is the average of the action-value 119876 computed over the held-out set of states (c) The performance of DQN and TEDQN on each taskThe darker line shows the average scores of each algorithm and the orange shaded area shows the two extreme values of DQN and the greenshaded area shows TE DQN


Table 1 The columns present the average performance of DQN DSN DDQN EDQN and TE-DQN after 10000 episodes using 120576-greedypolicy with 120576 = 00001 after 10000 stepsThe standard variation represents the variability over seven independent trials Average performanceimproved with the number of averaged networks

Task(AVG score Std) CartPole-v0 MountainCar-v0 LunarLander-v2

DQN (2649 217) (minus1482 174) (1593 167)DSN (1671 616) (minus1377 539) (1539 252)Double DQN (2782 318) (minus1442 168) (1358 118)TE DQN 119870 = 3 (2991 13) (minus1156 214) (1869 191)TE DQN 119870 = 6 (300 0) (minus1084 119) (2044 135)

over existing approaches for deep RL in the challengingclassical control issues In practice this ensemble architecturecan be very convenient to integrate the RL methods based onthe approximate value function

Although the ensemble algorithms are superior to asingle reinforcement learning algorithm it is noted that thecomputational complexity is higher The experiments alsoshow that the temporal ensemble makes the training processmore stable and the ensemble of a variety of algorithmsmakes the estimation of the 119876-value more accurate Thecombination of the two ways enables the training to achievea stable convergence This is due to the fact that ensemblesimprove independent algorithms most if the algorithmspredictions are less correlated So that the output of the119876-network based on the choice of action can achieve balancebetween exploration and exploitation

In fact the independence of the ensemble algorithmsand their elements is very important on the performance forensemble algorithms In further works we want to analyzethe role of each algorithm and each 119876-network in differentstages so as to further enhance the performance of theensemble algorithm

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] S Mozer and M Hasselmo ldquoReinforcement learning an intro-ductionrdquo IEEE Transactions on Neural Networks and LearningSystems vol 16 no 1 pp 285-286 2005

[2] L P KaelblingM L Littman andAWMoore ldquoReinforcementlearning a surveyrdquo Journal of Artificial Intelligence Research vol4 pp 237ndash285 1996

[3] V Mnih K Kavukcuoglu D Silver et al ldquoPlaying Atari withdeep reinforcement learning [EBOL]rdquo httpsarxivorgabs13125602

[4] M A Wiering and H van Hasselt ldquoEnsemble algorithms inreinforcement learningrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 930ndash9362008

[5] S Whiteson and P Stone ldquoEvolutionary function approxima-tion for reinforcement learningrdquo Journal of Machine LearningResearch (JMLR) vol 7 pp 877ndash917 2006

[6] P Preux S Girgin and M Loth ldquoFeature discovery inapproximate dynamic programmingrdquo in Proceedings of the2009 IEEE Symposium on Adaptive Dynamic Programming andReinforcement Learning ADPRL 2009 pp 109ndash116 April 2009

[7] T Degris P M Pilarski and R S Sutton ldquoModel-Free rein-forcement learning with continuous action in practicerdquo inProceedings of the 2012 American Control Conference ACC 2012pp 2177ndash2182 June 2012

[8] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

[9] H Van Hasselt A Guez and D Silver ldquoDeep reinforcementlearning with double Q-Learningrdquo in Proceedings of the 30thAAAIConference onArtificial Intelligence AAAI 2016 pp 2094ndash2100 February 2016

[10] O Anschel N Baram N Shimkin et al ldquoAveraged-DQNVariance Reduction and Stabilization for Deep ReinforcementLearning [EBOL]rdquo httpsarxivorgabs161101929

[11] I Osband C Blundell A Pritzel et al ldquoDeep Exploration viaBootstrappedDQN [EBOL]rdquo httpsarxivorgabs160204621

[12] S Fauszliger and F Schwenker ldquoEnsemble Methods for Rein-forcement Learning with Function Approximationrdquo inMultipleClassifier Systems pp 56ndash65 Springer Berlin Germany 2011

[13] A K Jain R P W Duin and J Mao ldquoStatistical patternrecognition a reviewrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 22 no 1 pp 4ndash37 2000

[14] T Schaul J Quan I Antonoglou et al ldquoPrioritized ExperienceReplay [EBOL]rdquo httpsarxivorgabs151105952

[15] I ZamoraNG Lopez VMVilches et al ldquoExtending theOpe-nAIGym for robotics a toolkit for reinforcement learning usingROS and Gazebo [EBOL]rdquo httpsarxivorgabs160805742

[16] D Ernst P Geurts and L Wehenkel ldquoTree-based batch modereinforcement learningrdquo Journal of Machine Learning Research(JMLR) vol 6 no 2 pp 503ndash556 2005

[17] M Ganger E Duryea and W Hu ldquoDouble Sarsa and DoubleExpected Sarsa with Shallow and Deep Learningrdquo Journal ofData Analysis and Information Processing vol 04 no 04 pp159ndash176 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of


Mathematical Problems in Engineering

Applied MathematicsJournal of


Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of



Engineering Mathematics

International Journal of


Operations ResearchAdvances in

Journal of


Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences


Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018


Decision SciencesAdvances in


AnalysisInternational Journal of


Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom


Based on an agent integrated withmultiple reinforcementlearning algorithms multiple value functions are learned atthe same time The ensembles combine the policies derivedfrom the value functions in a final policy for the agent Themajority voting (MV) the rank voting (RV) the Boltzmannmultiplication (BM) and the Boltzmann addition (BA) areused to combine RL algorithms While these methods arecostly in deep reinforcement learning (DRL) algorithms wecombine different DRL algorithms that learn separate valuefunctions and policiesTherefore in our ensemble approacheswe combine the different policies derived from the updatetargets learned by deep 119876-networks deep Sarsa networksdouble deep 119876-networks and other DRL algorithms As aconsequence this leads to reduced overestimations morestable learning process and improved performance

2 Related Work

21 Reinforcement Learning Reinforcement learning is amachine learning method that allows the system to interactwith and learn from the environment to maximize cumu-lative return rewards Assume that the standard reinforce-ment learning setting where an agent interacts with theenvironment 120576 We can describe this process with MarkovDecision Processes (MDP) [2 9] It can be specified as a tuple(119878 119860 120587 119877 120574) At each step 119905 the agent receives a state 119904119905 andselect an action 119886119905 from the set of legal actions 119860 accordingto the policy 120587 where 120587 is a policy mapping sequencesto actions The action is passed to the environment 119864 Inaddition the agent receives the next state 119904119905+1 and a rewardsignal 119903119905 This process continues until the agent reaches aterminal state

The agent seeks to maximize the expected discountedreturn where we define the future discounted return at time119905 as 119877119905 = suminfin119896=0 120574119896119903119905+119896 with discount factor 120574 isin (0 1] Thegoal of the RL agent is to learn a policy which makes thefuture discounted return maximize For an agent behavingaccording to a stochastic policy 120587 the value of the state-action pair can be defined as follows 119876120587(119904 119886) = 119864119877119905 | 119904119905 =119904 119886119905 = 119886 120587 The optimal action-value function 119876 satisfiesthe Bellman equation 119876lowast(119904 119886) = 1198641199041015840sim120576[119903 + 120574max1198861015840119876lowast(1199041015840 1198861015840) |119904 119886]

The reinforcement learning algorithms estimate theaction value function by iteratively updating the Bellmanequation 119876lowast(119904 119886) = 1198641199041015840sim120576[119903 + 120574max1198861015840119876lowast(1199041015840 1198861015840) | 119904 119886] When119905 rarr infin the algorithm makes 119876-value function convergeto the optimal action value function [1] If the optimal 119876-function 119876lowast is known the agent can select optimal actionsby selecting the action with the maximal value in a state120587lowast = argmax119886119876lowast(119904 119886)22 Target Deep 119876 Learning RL agents update their modelparameters while they observe a stream of transitions like(119904119905 119886119905 119903119905+1 119904119905+1)They discard the incoming data after a singleupdateThere are two issues with this methodThe first one isthat there are strong correlations among the incoming datawhich may break the assumption of many popular stochasticgradient-based algorithms Secondly the minor changes in

the 119876 function may result in a huge change in the policywhich makes the algorithm difficult to converge [7 9 14 15]

As for the deep 119876-networks algorithms proposed in(Mnih et al 2013) two aspects are improved On the onehand the action value function is approximated by the DNNDQN uses the DNN with a parameter 120579 to approximate thevalue function 119876(119904 119886 120579) asymp 119876lowast(119904 119886 120579) on the other handthe experience replay mechanism is adopted The algorithmlearns from sampled transitions from an experience bufferrather than learning fully online This mechanism makesit possible to break the temporal correlations by mixingmore and less recent experience for updating and trainingThis model free reinforcement learning algorithm solvesthe problem of ldquomodel disasterrdquo and uses the generalizedapproximation method of the value function to solve theproblem of ldquodimension disasterrdquo

The convergence issue was mentioned in 2015 by Schaulet al [14] The above 119876-learning update rules can be directlyimplemented in a neural network DQN uses the DNNwith parameters 120579 to approximate the value function∘ Theparameter 120579 updates from transition (119904119905 119886119905 119903119905+1 119904119905+1) aregiven by the following [11]

120579119905+1 larr997888 120579119905 + 120572 (119910119876119905 minus 119876 (119904119905 119886119905 120579119905)) nabla120579119876 (119904119905 119886119905 120579119905) (1)

with 119910DQN119894 = 119903119905 + 120574max119886119876(119904119905+1 119886 120579minus119905 )

The update targets for Sarsa can be described as follows

119910Sarsa119894 = 119903119905 + 120574119876 (119904119905+1 119886119905+1 120579minus119905 ) (2)

where 120572 is the scalar learning rate 120579minus are target net-work parameters which are fixed to 120579minus = 120579119905 In casethe squared error is taken as a loss function 119871 119894(120579119894) =1198641199041015840sim120576(119910119894 minus 119876(119904 119886 120579119894))2

In general experience replay can reduce the amount ofexperience required to learn and replace it withmore compu-tation and more memory which are often cheaper resourcesthan the RL agentrsquos interactions with its environment [14]

23 Double Deep 119876 Learning In 119876-learning and DQN themax operator uses the same values to both select and evaluatean action This can therefore lead to overoptimistic valueestimates (van Hasselt 2010) To mitigate this problem theupdate targets value of double 119876-learning error can then bewritten as follows

119910DDQN119894 = 119903119905 + 120574119876(119904119905+1 argmax

1198861015840119876(119904119905+1 1198861015840 120579119905) 120579minus119905 ) (3)

DDQN is the same as for DQN [8] but with the target 119910DQN119894

replaced with 119910DDQN119894

3 Ensemble Methods for DeepReinforcement Learning

AsDQNclasses useDNNs to approximate the value functionit has strong generalization ability between similar stateinputs The generalization can cause divergence in the case







119876119879 (119904 119886 120579) = 119873sum119894=1

(120582119894minus1119876119894 (119904 119886 120579119894)sum119873119894=1 120582119894minus1 ) (4)


119910119879119894 = 119903119905 + 120574max1198861015840

119873sum119894=1

119908119894119876119879 (1199041015840 1198861015840 120579119879) 120579119905+1 larr997888 120579119905 + 120572 (119910119879119905 minus 119876 (119904119905 119886119905 120579119905)) nabla120579119876 (119904119905 119886119905 120579119905)

(5)




119910DQN119894 = 119903119905 + 120574max

119886119876 (119904119905+1 119886 120579minus119905 )


1198861015840119876 (119904119905+1 119886119905+1 120579119905) 120579minus119905 )

(6)



120579119905+1 larr997888 120579119905 + 120572 (119910119864119905 minus 119876 (119904119905 119886119905 120579119905)) nabla120579119876 (119904119905 119886119905 120579119905) (7)





4 Experiments






120579



Shared network

Evaluated network

Target value

Target Q1

Target Q2

Target Qn

Approximator1

Approximator2

Approximator3








5 Conclusion



010

020

030

040

050

060

070

080

090

010

00minus50

050

100150200250300350

CartPole-v0

Episode

Aver

age a

ctio

n va

lue (

Q)

DQNDSNDDQN


010

0020

0030

0040

0050

0060

0070

0080

0090

0010

000minus700

minus600

minus500

minus400

minus300

minus200

minus100

0Mountain-v0

Episode

Aver

age S

core

DQNDSNDDQN



050

100150200

LunarLander-v2

Episode

(a) (b) (c)

Aver

age S

core

DQNDSNDDQN


times104

010

020

030

040

050

060

070

080

090

010

00

minus100

1020304050607080


EpisodeAv

erag

e act

ion

valu

e (Q


010

0020

0030

0040

0050

0060

0070

0080

0090

0010

000minus40

minus20

0

20

40

60

80


Episode

Aver

age a

ctio

n va

lue (

Q)


0 02 04 06 08 1 12 14 16 18 2minus40

minus20

0

20

40

60

80


Episode

Aver

age a

ctio

n va

lue (

Q)


times104

0 5000 10000minus600

minus400

minus200

0

rew

ard

Episode

DQN TE DQN k = 6

0 5000 10000 15000 20000Episode

DQN TE DQN k = 6

minus600

minus400

minus200

0

200

400

600

rew

ard

DQN TE DQN k = 6

200 400 600 800 10000Episode

0

100

200

300

400

rew

ard











References

























Journal of












Journal of







Volume 2018






Volume 2018














119876119879 (119904 119886 120579) = 119873sum119894=1

(120582119894minus1119876119894 (119904 119886 120579119894)sum119873119894=1 120582119894minus1 ) (4)


119910119879119894 = 119903119905 + 120574max1198861015840

119873sum119894=1

119908119894119876119879 (1199041015840 1198861015840 120579119879) 120579119905+1 larr997888 120579119905 + 120572 (119910119879119905 minus 119876 (119904119905 119886119905 120579119905)) nabla120579119876 (119904119905 119886119905 120579119905)

(5)




119910DQN119894 = 119903119905 + 120574max

119886119876 (119904119905+1 119886 120579minus119905 )


1198861015840119876 (119904119905+1 119886119905+1 120579119905) 120579minus119905 )

(6)



120579119905+1 larr997888 120579119905 + 120572 (119910119864119905 minus 119876 (119904119905 119886119905 120579119905)) nabla120579119876 (119904119905 119886119905 120579119905) (7)





4 Experiments






120579



Shared network

Evaluated network

Target value

Target Q1

Target Q2

Target Qn

Approximator1

Approximator2

Approximator3








5 Conclusion



010

020

030

040

050

060

070

080

090

010

00minus50

050

100150200250300350

CartPole-v0

Episode

Aver

age a

ctio

n va

lue (

Q)

DQNDSNDDQN


010

0020

0030

0040

0050

0060

0070

0080

0090

0010

000minus700

minus600

minus500

minus400

minus300

minus200

minus100

0Mountain-v0

Episode

Aver

age S

core

DQNDSNDDQN



050

100150200

LunarLander-v2

Episode

(a) (b) (c)

Aver

age S

core

DQNDSNDDQN


times104

010

020

030

040

050

060

070

080

090

010

00

minus100

1020304050607080


EpisodeAv

erag

e act

ion

valu

e (Q


010

0020

0030

0040

0050

0060

0070

0080

0090

0010

000minus40

minus20

0

20

40

60

80


Episode

Aver

age a

ctio

n va

lue (

Q)


0 02 04 06 08 1 12 14 16 18 2minus40

minus20

0

20

40

60

80


Episode

Aver

age a

ctio

n va

lue (

Q)


times104

0 5000 10000minus600

minus400

minus200

0

rew

ard

Episode

DQN TE DQN k = 6

0 5000 10000 15000 20000Episode

DQN TE DQN k = 6

minus600

minus400

minus200

0

200

400

600

rew

ard

DQN TE DQN k = 6

200 400 600 800 10000Episode

0

100

200

300

400

rew

ard











References

























Journal of












Journal of







Volume 2018






Volume 2018











120579



Shared network

Evaluated network

Target value

Target Q1

Target Q2

Target Qn

Approximator1

Approximator2

Approximator3








5 Conclusion



010

020

030

040

050

060

070

080

090

010

00minus50

050

100150200250300350

CartPole-v0

Episode

Aver

age a

ctio

n va

lue (

Q)

DQNDSNDDQN


010

0020

0030

0040

0050

0060

0070

0080

0090

0010

000minus700

minus600

minus500

minus400

minus300

minus200

minus100

0Mountain-v0

Episode

Aver

age S

core

DQNDSNDDQN



050

100150200

LunarLander-v2

Episode

(a) (b) (c)

Aver

age S

core

DQNDSNDDQN


times104

010

020

030

040

050

060

070

080

090

010

00

minus100

1020304050607080


EpisodeAv

erag

e act

ion

valu

e (Q


010

0020

0030

0040

0050

0060

0070

0080

0090

0010

000minus40

minus20

0

20

40

60

80


Episode

Aver

age a

ctio

n va

lue (

Q)


0 02 04 06 08 1 12 14 16 18 2minus40

minus20

0

20

40

60

80


Episode

Aver

age a

ctio

n va

lue (

Q)


times104

0 5000 10000minus600

minus400

minus200

0

rew

ard

Episode

DQN TE DQN k = 6

0 5000 10000 15000 20000Episode

DQN TE DQN k = 6

minus600

minus400

minus200

0

200

400

600

rew

ard

DQN TE DQN k = 6

200 400 600 800 10000Episode

0

100

200

300

400

rew

ard











References

























Journal of












Journal of







Volume 2018






Volume 2018









010

020

030

040

050

060

070

080

090

010

00minus50

050

100150200250300350

CartPole-v0

Episode

Aver

age a

ctio

n va

lue (

Q)

DQNDSNDDQN


010

0020

0030

0040

0050

0060

0070

0080

0090

0010

000minus700

minus600

minus500

minus400

minus300

minus200

minus100

0Mountain-v0

Episode

Aver

age S

core

DQNDSNDDQN



050

100150200

LunarLander-v2

Episode

(a) (b) (c)

Aver

age S

core

DQNDSNDDQN


times104

010

020

030

040

050

060

070

080

090

010

00

minus100

1020304050607080


EpisodeAv

erag

e act

ion

valu

e (Q


010

0020

0030

0040

0050

0060

0070

0080

0090

0010

000minus40

minus20

0

20

40

60

80


Episode

Aver

age a

ctio

n va

lue (

Q)


0 02 04 06 08 1 12 14 16 18 2minus40

minus20

0

20

40

60

80


Episode

Aver

age a

ctio

n va

lue (

Q)


times104

0 5000 10000minus600

minus400

minus200

0

rew

ard

Episode

DQN TE DQN k = 6

0 5000 10000 15000 20000Episode

DQN TE DQN k = 6

minus600

minus400

minus200

0

200

400

600

rew

ard

DQN TE DQN k = 6

200 400 600 800 10000Episode

0

100

200

300

400

rew

ard











References

























Journal of












Journal of







Volume 2018






Volume 2018

















References

























Journal of












Journal of







Volume 2018






Volume 2018















Journal of












Journal of







Volume 2018






Volume 2018








ensemble network architecture for deep reinforcement...

Documents