research article efficient actor-critic algorithm with ... · research article efficient...

16
Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning Shan Zhong, 1,2 Quan Liu, 1,3,4 and QiMing Fu 5 1 School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215000, China 2 School of Computer Science and Engineering, Changshu Institute of Technology, Changshu, Jiangsu 215500, China 3 Collaborative Innovation Center of Novel Soſtware Technology and Industrialization, Jiangsu 210000, China 4 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China 5 College of Electronic & Information Engineering, Suzhou University of Science and Technology, Jiangsu, Suzhou 215000, China Correspondence should be addressed to Quan Liu; [email protected] Received 29 May 2016; Revised 28 July 2016; Accepted 16 August 2016 Academic Editor: Leonardo Franco Copyright © 2016 Shan Zhong et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. To improve the convergence rate and the sample efficiency, two efficient learning methods AC-HMLP and RAC-HMLP (AC- HMLP with 2 -regularization) are proposed by combining actor-critic algorithm with hierarchical model learning and planning. e hierarchical models consisting of the local and the global models, which are learned at the same time during learning of the value function and the policy, are approximated by local linear regression (LLR) and linear function approximation (LFA), respectively. Both the local model and the global model are applied to generate samples for planning; the former is used only if the state-prediction error does not surpass the threshold at each time step, while the latter is utilized at the end of each episode. e purpose of taking both models is to improve the sample efficiency and accelerate the convergence rate of the whole algorithm through fully utilizing the local and global information. Experimentally, AC-HMLP and RAC-HMLP are compared with three representative algorithms on two Reinforcement Learning (RL) benchmark problems. e results demonstrate that they perform best in terms of convergence rate and sample efficiency. 1. Introduction and Related Work Reinforcement Learning (RL) [1–4], a framework for solv- ing the Markov Decision Process (MDP) problem, targets generating the optimal policy by maximizing the expected accumulated rewards. e agent interacts with its environ- ment and receives information about the current state at each time step. Aſter the agent chooses an action according to the policy, the environment will transition to a new state while emitting a reward. RL can be divided into two classes, online and offline. Online method learns by interacting with the environment, which easily incurs the inefficient use of data and the stability issue. Offline or batch RL [5] as a subfield of dynamic programming (DP) [6, 7] can avoid the stability issue and achieve high sample efficiency. DP aims at solving optimal control problems, but it is implemented backward in time, making it offline and computationally expensive for complex or real-time prob- lems. To avoid the curse of dimensionality in DP, approxi- mate dynamic programming (ADP) received much attention to obtain approximate solutions of the Hamilton-Jacobi- Bellman (HJB) equation by combining DP, RL, and function approximation [8]. Werbos [9] introduced an approach for ADP which was also called adaptive critic designs (ACDs). ACDs consist of two neural networks (NNs), one for approxi- mating the critic and the other for approximating the actor, so that DP can be solved approximately forward in time. Several synonyms about ADP and ACDs mainly include approximate dynamic programming, asymptotic dynamic programming, heuristic dynamic programming, and neurodynamic pro- gramming [10, 11]. e iterative nature of the ADP formulation makes it nat- ural to design the optimal discrete-time controllers. Al-Tamimi et al. [12] established a heuristic dynamic programming Hindawi Publishing Corporation Computational Intelligence and Neuroscience Volume 2016, Article ID 4824072, 15 pages http://dx.doi.org/10.1155/2016/4824072

Upload: others

Post on 19-Jan-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

Research ArticleEfficient Actor-Critic Algorithm with Hierarchical ModelLearning and Planning

Shan Zhong12 Quan Liu134 and QiMing Fu5

1School of Computer Science and Technology Soochow University Suzhou Jiangsu 215000 China2School of Computer Science and Engineering Changshu Institute of Technology Changshu Jiangsu 215500 China3Collaborative Innovation Center of Novel Software Technology and Industrialization Jiangsu 210000 China4Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education Jilin UniversityChangchun 130012 China5College of Electronic amp Information Engineering Suzhou University of Science and Technology Jiangsu Suzhou 215000 China

Correspondence should be addressed to Quan Liu quanliusudaeducn

Received 29 May 2016 Revised 28 July 2016 Accepted 16 August 2016

Academic Editor Leonardo Franco

Copyright copy 2016 Shan Zhong et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

To improve the convergence rate and the sample efficiency two efficient learning methods AC-HMLP and RAC-HMLP (AC-HMLP with ℓ

2-regularization) are proposed by combining actor-critic algorithm with hierarchical model learning and planning

The hierarchical models consisting of the local and the global models which are learned at the same time during learning ofthe value function and the policy are approximated by local linear regression (LLR) and linear function approximation (LFA)respectively Both the local model and the global model are applied to generate samples for planning the former is used only ifthe state-prediction error does not surpass the threshold at each time step while the latter is utilized at the end of each episodeThe purpose of taking both models is to improve the sample efficiency and accelerate the convergence rate of the whole algorithmthrough fully utilizing the local and global information Experimentally AC-HMLP and RAC-HMLP are compared with threerepresentative algorithms on two Reinforcement Learning (RL) benchmark problems The results demonstrate that they performbest in terms of convergence rate and sample efficiency

1 Introduction and Related Work

Reinforcement Learning (RL) [1ndash4] a framework for solv-ing the Markov Decision Process (MDP) problem targetsgenerating the optimal policy by maximizing the expectedaccumulated rewards The agent interacts with its environ-ment and receives information about the current state at eachtime step After the agent chooses an action according to thepolicy the environment will transition to a new state whileemitting a reward RL can be divided into two classes onlineand offline Online method learns by interacting with theenvironment which easily incurs the inefficient use of dataand the stability issue Offline or batch RL [5] as a subfieldof dynamic programming (DP) [6 7] can avoid the stabilityissue and achieve high sample efficiency

DP aims at solving optimal control problems but itis implemented backward in time making it offline and

computationally expensive for complex or real-time prob-lems To avoid the curse of dimensionality in DP approxi-mate dynamic programming (ADP) received much attentionto obtain approximate solutions of the Hamilton-Jacobi-Bellman (HJB) equation by combining DP RL and functionapproximation [8] Werbos [9] introduced an approach forADP which was also called adaptive critic designs (ACDs)ACDs consist of two neural networks (NNs) one for approxi-mating the critic and the other for approximating the actor sothat DP can be solved approximately forward in time Severalsynonyms about ADP andACDsmainly include approximatedynamic programming asymptotic dynamic programmingheuristic dynamic programming and neurodynamic pro-gramming [10 11]

The iterative nature of the ADP formulationmakes it nat-ural to design the optimal discrete-time controllers Al-Tamimiet al [12] established a heuristic dynamic programming

Hindawi Publishing CorporationComputational Intelligence and NeuroscienceVolume 2016 Article ID 4824072 15 pageshttpdxdoiorg10115520164824072

2 Computational Intelligence and Neuroscience

algorithm based on value iteration where the convergence isproved in the context of general nonlinear discrete systemsDierks et al [13] solved the optimal control of nonlineardiscrete-time systems by using two processes online systemidentification and offline optimal control training withoutthe requirement of partial knowledge about the systemdynamics Wang et al [14] focused on applying iterativeADP algorithm with error boundary to obtain the optimalcontrol law in which the NNs are adopted to approximatethe performance index function compute the optimal controlpolicy and model the nonlinear system

Extensions of ADP for continuous-time systems facethe challenges involved in proving stability and convergencemeanwhile ensuring the algorithm being online and model-free To approximate the value function and improve the pol-icy for continuous-time system Doya [15] derived a temporaldifference (TD) error-based algorithm in the frameworkof HJB Under a measure of input quadratic performanceMurray et al [16] developed a stepwise ADP algorithm inthe context of HJB Hanselmann et al [17] put forward acontinuous-time ADP formulation where Newtonrsquos methodis used in the second-order actor adaption to achieve theconvergence of the critic Recently Bhasin et al [18] built anactor-critic-identifier (ACI) an architecture that representsthe actor critic and model by taking NNs as nonlinearlyparameterized approximators while the parameters of NNsare updated by least-square method

All the aforementioned ADP variants utilized the NNas the function approximator however linear parameterizedapproximators are usually more preferred in RL becausethey make it easier to understand and analyze the theoreticalproperties of the resulting RL algorithms [19] Moreovermost of the above works did not learn a model online toaccelerate the convergence rate and improve the sample effi-ciency Actor-critic (AC) algorithmwas introduced in [20] forthe first time many variants which approximated the valuefunction and the policy by linear function approximationhave been widely used in continuous-time systems since then[21ndash23] By combining model learning and AC Grondmanet al [24] proposed an improved learning method calledModel Learning Actor-Critic (MLAC) which approximatesthe value function the policy and the process model by LLRIn MLAC the gradient of the next state with respect to thecurrent action is computed for updating the policy gradientwith the goal of improving the convergence rate of the wholealgorithm In their latter work [25] LFA takes the place ofLLR as the approximation method for value function thepolicy and the process model Enormous samples are stillrequired when only using such a process model to updatethe policy gradient Afterward Costa et al [26] derived anAC algorithm by introducing Dyna structure called Dyna-MLAC which approximated the value function the policyand the model by LLR as MLAC did The difference is thatDyna-MLAC applies the model not only in updating thepolicy gradient but also in planning [27] Though planningcan improve the sample efficiency to a large extent the modellearned by LLR is just a local model so that the globalinformation of samples is yet neglected

Though the above works learn a model during learningof the value function and the policy only the local infor-mation of the samples is utilized If the global informationof the samples can be utilized reasonably the convergenceperformance will be improved further Inspired by this ideawe establish two novel AC algorithms called AC-HMLP andRAC-HMLP (AC-HMLPwith ℓ

2-regularization) AC-HMLP

and RAC-HMLP consist of twomodels the global model andthe local model Both models incorporate the state transitionfunction and the reward function for planning The globalmodel is approximated by LFA while the local model isrepresented by LLR The local and the global models arelearned simultaneously at each time step The local modelis used for planning only if the error does not surpass thethreshold while the global planning process is started at theend of an episode so that the local and the global informationcan be kept and utilized uniformly

The main contributions of our work on AC-HMLP andRAC-HMLP are as follows

(1) Develop twonovel AC algorithms based on hierarchalmodels Distinguishing from the previous works AC-HMLP and RAC-HMLP learn a global model wherethe reward function and the state transition func-tion are approximated by LFA Meanwhile unlikethe existing model learning methods [28ndash30] whichrepresent a feature-based model we directly establisha state-based model to avoid the error brought byinaccurate features

(2) As MLAC and Dyna-MLAC did AC-HMLP andRAC-HMLP also learn a local model by LLR Thedifference is that we design a useful error threshold todecide whether to start the local planning process Ateach time step the real-next state is computed accord-ing to the system dynamics whereas the predicted-next state is obtained from LLR The error betweenthem is defined as the state-prediction error If thiserror does not surpass the error threshold the localplanning process is started

(3) The local model and the global model are used forplanning uniformly The local and the global modelsproduce local and global samples to update the samevalue function and the policy as a result the numberof the real samples will decrease dramatically

(4) Experimentally the convergence performance andthe sample efficiency are thoroughly analyzed Thesample efficiency which is defined as the numberof samples for convergence is analyzed RAC-HMLPandAC-HMLP are also compared with S-ACMLACand Dyna-MLAC in convergence performance andsample efficiency The results demonstrate that RAC-HMLP performs best and AC-HMLP performs sec-ondbest and both of themoutperform the other threemethods

This paper is organized as follows Section 2 reviewssome background knowledge concerning MDP and theAC algorithm Section 3 describes the hierarchical model

Computational Intelligence and Neuroscience 3

learning and planning Section 4 specifies our algorithmsmdashAC-HMLP and RAC-HMLP The empirical results of thecomparisons with the other three representative algorithmsare analyzed in Section 5 Section 6 concludes our work andthen presents the possible future work

2 Preliminaries

21 MDP RL can solve the problemmodeled by MDP MDPcan be represented as four-tuple (119883119880 120588 119891)

(1) 119883 is the state space 119909119905isin 119883 denotes the state of the

agent at time step 119905(2) 119880 represents the action space 119906

119905isin 119880 is the action

which the agent takes at the time step 119905(3) 120588 119883 times 119880 rarr R denotes the reward function At the

time step 119905 the agent locates at a state 119909119905and takes an

action 119906119905resulting in next state 119909

119905+1while receiving a

reward 119903119905= 120588(119909

119905 119906119905)

(4) 119891 119883 times 119880 rarr 119883 is defined as the transition function119891(119909119905 119906119905 119909119905+1) is the probability of reaching the next

state 119909119905+1

after executing 119906119905at the state 119909

119905

Policy ℎ 119883 rarr 119880 is the mapping from the state space119883 to the action space 119880 where the mathematical set of ℎdepends on specific domains The goal of the agent is tofind the optimal policy ℎlowast that can maximize the cumulativerewards The cumulative rewards are the sum or discountedsum of the received rewards and here we use the latter case

Under the policy ℎ the value function 119881ℎ 119883 rarr R

denotes the expected cumulative rewards which is shown as

119881ℎ(119909) = 119864

infin

sum

119896=0

120574119896119903119905+119896+1

| 119909 = 119909119905 (1)

where 120574 isin [0 1] represents the discount factor 119909119905is the

current stateThe optimal state-value function 119881lowast(119909) is computed as

119881lowast(119909) = max

119881 (119909) forall119909 isin 119883 (2)

Therefore the optimal policy ℎlowast at state 119909 can be obtainedby

ℎlowast(119909) = argmax

119881ℎ(119909) forall119909 isin 119883 (3)

22 AC Algorithm AC algorithm mainly contains two partsactor and critic which are stored separately Actor and criticare also called the policy and value function respectivelyTheactor-only methods approximate the policy and then updateits parameter along the direction of performance improvingwith the possible drawback being large variance resultingfrom policy estimationThe critic-only methods estimate thevalue function by approximating a solution to the Bellmanequation the optimal policy is found by maximizing thevalue functionOther than the actor-onlymethods the critic-only methods do not try to search the optimal policy inpolicy space They just estimate the critic for evaluating the

performance of the actor as a result the near-optimality ofthe resulting policy cannot be guaranteed By combiningthe merits of the actor and the critic AC algorithms wereproposed where the value function is approximated to updatethe policy

The value function and the policy are parameterized by119881(119909 120579) and ℎ(119909 120573) where 120579 and 120573 are the parameters of thevalue function and the policy respectively At each time step119905 the parameter 120579 is updated as

120579119905+1

= 120579119905+ 120572119888120575119905

120597119881 (119909 120579)

120597120579119909 = 119909

119905 120579 = 120579

119905 (4)

where

120575119905= 119903119905+ 120574119881 (119909

119905 120579119905) minus 119881 (119909

119905 120579119905) (5)

denoting the TD-error of the value function 120597119881(119909 120579)120597120579represents the feature of the value function The parameter120572119888isin [0 1] is the learning rate of the value functionEligibility is a trick to improve the convergence via

assigning the credits to the previously visited states At eachtime step 119905 the eligibility can be represented as

119890119905(119909) =

120597119881 (119909 120579)

120597120579119909119905= 119909

120582120574119890119905minus1

(119909) 119909119905

= 119909

(6)

where 120582 isin [0 1] denotes the trace-decay rateBy introducing the eligibility the update for 120579 in (4) can

be transformed as

120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909) (7)

The policy parameter 120573119905can be updated by

120573119905+1

= 120573119905+ 120572119886120575119905Δ119906119905

120597ℎ (119909 120573)

120597120573 (8)

where 120597ℎ(119909 120573)120597120573 is the feature of the policy Δ119906119905is a

random exploration term conforming to zero-mean normaldistribution 120572

119886isin [0 1] is the learning rate of the policy

S-AC (Standard AC algorithm) serves as a baseline tocompare with our method which is shown in Algorithm 1The value function and the policy are approximated linearlyin Algorithm 1 where TD is used as the learning algorithm

3 Hierarchical Model Learning and Planning

31 Why to Use Hierarchical Model Learning and PlanningThe model in RL refers to the state transition function andthe reward function When the model is established we canuse any model-based RL method to find the optimal policyfor example DP Model-based methods can significantlydecrease the number of the required samples and improve theconvergence performance Inspired by this idea we introducethe hierarchical model learning into AC algorithm so as tomake it becomemore sample-efficient Establishing a relativeaccurate model for the continuous state and action spaces isstill an open issue

4 Computational Intelligence and Neuroscience

Input 120574 120582 120572119886 120572119888 1205902

(1) Initialize 120579 120573(2) Loop(3) 119890 larr

9978881 120579 larr 997888

0 120573 larr9978880 119909 larr 119909

0 119905 larr 1

(4) Repeat all episodes(5) Choose Δ119906

119905according to119873(0 120590

2)

(6) 119906119905larr ℎ(119909 120573) + Δ119906

119905

(7) Execute 119906119905and observe 119903

119905+1and 119909

119905+1

(8) Update the eligibility of the value function 119890119905(119909) = 120597119881(119909 120579)120597120579 119909

119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(9) Compute the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(10) Update the parameter of the value function 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

(11) Update the parameter of the policy 120573119905+1

= 120573119905+ 120572119886120575119905Δ119906119905(120597ℎ(119909 120573)120597120573)

(12) 119905 larr 119905 + 1

(13) End Repeat(14) End LoopOutput 120579 120573

Algorithm 1 S-AC

The preexisting works are mainly aimed at the problemswith continuous states but discrete actions They approx-imated the transition function in the form of probabilitymatrix which specifies the transition probability from thecurrent feature to the next feature The indirectly observedfeatures result in the inaccurate feature-based model Theconvergence rate will be slowed significantly by using such aninaccurate model for planning especially at each time step inthe initial phase

To solve these problems we will approximate a state-based model instead of the inaccurate feature-based modelMoreover we will introduce an additional global model forplanning The global model is applied only at the end of eachepisode so that the global information can be utilized asmuchas possible Using such a global model without others willlead to the loss of valuable local informationThus likeDyna-MLAC we also approximate a local model by LLR and useit for planning at each time step The difference is that auseful error threshold is designed for the local planning inour method If the state-prediction error between the real-next state and the predicted one does not surpass the errorthreshold the local planning process will be started at thecurrent time step Therefore the convergence rate and thesample efficiency can be improved dramatically by combiningthe local and global model learning and planning

32 Learning and Planning of the Global Model The globalmodel establishes separate equations for the reward functionand the state transition function of every state component bylinear function approximation Assume the agent is at state119909119905= 1199091199051 119909

119905119870 where119870 is the dimensionality of the state

119909119905 the action 119906

119905is selected according to the policy ℎ then

all the components 1199091015840119905+11

1199091015840

119905+12 119909

1015840

119905+1119870 of the next state

1199091015840

119905+1can be predicted as

1199091015840

119905+11= 1205781199051

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+12= 1205781199052

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+1119870= 120578119905119870

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

(9)

where 120578119905119894= (1205781199051198941 1205781199051198942 120578

119905119894119863)T 1 le 119894 le 119870 is the param-

eter of the state transition function corresponding to the 119894thcomponent of the current state at time step 119905 with 119863 beingthe dimensionality of the feature 120601(119909

1199051 1199091199052 119909

119905119870 119906119905)

Likewise the reward 1199031015840119905+1

can be predicted as

1199031015840

119905+1= 120589119905

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905) (10)

where 120589119905= (1205891199051 1205891199052 120589

119905119863) is the parameter of the reward

function at time step 119905After 120578

1199051 1205781199052 120578

119905119870and 120589 are updated the model can

be applied to generate samples Let the current state be119909119905= (1199091199051 1199091199052 119909

119905119896) after the action 119906

119905is executed the

parameters 120578119905+1119894

(1 le 119894 le 119870) can be estimated by the gradientdescent method shown as

120578119905+11

= 1205781199051

+ 120572119898(119909119905+11

minus 1199091015840

119905+11) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+12

= 1205781199052

+ 120572119898(119909119905+12

minus 1199091015840

119905+12) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+1119870

= 120578119905119870

+ 120572119898(119909119905+1119870

minus 1199091015840

119905+1119870) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

(11)

where 120572119898

isin [0 1] is the learning rate of the model119909119905+1

is the real-next state obtained according to the systemdynamics where 119909

119905+1119894(1 le 119894 le 119870) is its 119894th component

(1199091015840

119905+11 1199091015840

119905+12 119909

1015840

119905+1119870) is the predicted-next state according

to (9)The parameter 120589

119905+1is estimated as

120589119905+1

= 120589119905+ 120572119898(119903119905+1

minus 1199031015840

119905+1) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905) (12)

where 119903119905+1

is the real reward reflected by the system dynamicswhile 1199031015840

119905+1is the predicted reward obtained according to (10)

Computational Intelligence and Neuroscience 5

33 Learning and Planning of the Local Model Though thelocal model also approximates the state transition functionand the reward function as the global model does LLR isserved as the function approximator instead of LFA In thelocal model a memory storing the samples in the form of(119909119905 119906119905 119909119905+1 119903119905+1) is maintained At each time step a new

sample is generated from the interaction and it will take theplace of the oldest one in the memory Not all but only L-nearest samples in thememorywill be selected for computingthe parameter matrix Γ isin R(119870+1)times(119870+2) of the local modelBefore achieving this the input matrix119883 isin R(119870+2)times119871 and theoutput matrix 119884 isin R(119870+1)times119871 should be prepared as follows

119883 =

[[[[[[[[[[[[

[

11990911

11990921

sdot sdot sdot 1199091198711

11990912

11990922

sdot sdot sdot 1199091198712

1199091119870

1199092119870

sdot sdot sdot 119909119871119870

1199061

1199062

sdot sdot sdot 119906119871

1 1 sdot sdot sdot 1

]]]]]]]]]]]]

]

119884 =

[[[[[[[[[

[

11990921

11990931

sdot sdot sdot 119909119871+11

11990922

11990932

sdot sdot sdot 119909119871+12

1199092119870

1199093119870

sdot sdot sdot 119909119871+1119870

1199032

1199033

sdot sdot sdot 119903119871+1

]]]]]]]]]

]

(13)

The last row of 119883 consisting of ones is to add a biason the output Every column in the former 119870 + 1 lines of119883 corresponds to a state-action pair for example the 119894thcolumn is the state-action (119909

119894 119906119894)T 1 le 119894 le 119871 119884 is composed

of 119871 next states and rewards corresponding to119883Γ can be obtained via solving 119884 = Γ119883 as

Γ = 119884119883T(119883119883

T)minus1

(14)

Let the current input vector be [1199091199051 119909

119905119870 119906119905 1]

T theoutput vector [1199091015840

119905+11 119909

1015840

119905+1119870 1199031015840

119905+1]T can be predicted by

[1199091015840

119905+11 119909

1015840

119905+1119896 1199031015840

119905+1]T= Γ [119909

1199051 119909

119905119896 119906119905 1]

T (15)

where [1199091199051 119909

119905119870] and [119909

1015840

119905+11 119909

1015840

119905+1119870] are the current

state and the predicted-next state respectively 1199031015840119905+1

is thepredicted reward

Γ is estimated according to (14) at each time step there-after the predicted-next state and the predicted reward canbe obtained by (15) Moreover we design an error thresholdto decide whether local planning is required We computethe state-prediction error between the real-next state and thepredicted-next state at each time step If this error does not

surpass the error threshold the local planning process willbe launched The state-prediction error is formulated as

119864119903119905= max

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+11minus 119909119905+11

119909119905+11

1003816100381610038161003816100381610038161003816100381610038161003816

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+12minus 119909119905+12

119909119905+12

1003816100381610038161003816100381610038161003816100381610038161003816

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+1119870minus 119909119905+1119870

119909119905+1119870

1003816100381610038161003816100381610038161003816100381610038161003816

100381610038161003816100381610038161003816100381610038161003816

1199031015840

119905+1minus 119903119905+1

119903119905+1

100381610038161003816100381610038161003816100381610038161003816

(16)

Let the error threshold be 120585 then the local model willbe used for planning only if 119864119903

119905le 120585 at time step 119905 In

the local planning process a sequence of locally simulatedsamples in the form of (119909

1199051 119909

119905119870 119906119905 1199091015840

119905+11 119909

1015840

119905+1119870 1199031015840)

are generated to improve the convergence of the same valuefunction and the policy as the global planning process does

4 Algorithm Specification

41 AC-HMLP AC-HMLP algorithm consists of a mainalgorithm and two subalgorithms The main algorithm isthe learning algorithm (see Algorithm 2) whereas the twosubalgorithms are local model planning procedure (seeAlgorithm 3) and global model planning procedure (seeAlgorithm 4) respectively At each time step the main algo-rithm learns the value function (see line (26) in Algorithm 2)the policy (see line (27) in Algorithm 2) the local model (seeline (19) in Algorithm 2) and the global model (see lines(10)sim(11) in Algorithm 2)

There are several parameters which are required to bedetermined in the three algorithms 119903 and 120582 are discountfactor and trace-decay rate respectively 120572

119886120572119888 and120572

119898are the

corresponding learning rates of the value function the policyand the global model119872 size is the capacity of the memory120585 denotes the error threshold 119871 determines the number ofselected samples which is used to fit the LLR 1205902 is thevariance that determines the region of the exploration 119875

119897and

119875119892are the planning times for local model and global model

Some of these parameters have empirical values for example119903 and 120582 The others have to be determined by observing theempirical results

Notice that Algorithm 3 starts planning at the state 119909119905

which is passed from the current state in Algorithm 2 whileAlgorithm 4 uses 119909

0as the initial state The reason for using

different initializations is that the local model is learnedaccording to the L-nearest samples of the current state 119909

119905

whereas the global model is learned from all the samplesThus it is reasonable and natural to start the local and globalplanning process at the states 119909

119905and 119909

0 respectively

42 AC-HMLPwith ℓ2-Regularization Regression approach-

es inmachine learning are generally represented asminimiza-tion of a square loss term and a regularization term The ℓ

2-

regularization also called ridge regress is a widely used regu-larization method in statistics and machine learning whichcan effectively prohibit overfitting of learning Therefore weintroduce ℓ

2-regularization to AC-HMLP in the learning of

the value function the policy and the model We term thisnew algorithm as RAC-HMLP

6 Computational Intelligence and Neuroscience

Input 120574 120582 120572119886 120572119888 120572119898 120585119872 size 119871 1205902 119870 119875

119897 119875119892

(1) Initialize 120579 larr 9978880 120573 larr

9978880 Γ larr 997888

0 1205781 1205782 120578

119870larr

9978880 120589 larr 997888

0

(2) Loop(3) 119890

1larr

9978881 11990911 119909

1119896larr 11990901 119909

0119896 119905 larr 1 number larr 0

(4) Repeat all episodes(5) Choose Δ119906

119905according to119873(0 120590

2)

(6) Execute the action 119906119905= ℎ(119909 120573) + Δ119906

119905

(7) Observe the reward 119903119905+1

and the next state 119909119905+1

= 119909119905+11

119909119905+1119896

Update the global model(8) Predict the next state

1199091015840

119905+11= 1205781199051

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+12= 1205781199052

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+1119870= 120578119905119870

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

(9) Predict the reward 1199031015840119905+1

= 120589119905

T120601(1199091199051 1199091199052 119909

119905119870 119906119905)

(10) Update the parameters 1205781 1205782 120578

119870

120578119905+11

= 1205781199051+ 120572119898(119909119905+11

minus 1199091015840

119905+11) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+12

= 1205781199052+ 120572119898(119909119905+12

minus 1199091015840

119905+12) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+1119870

= 120578119905119870

+ 120572119898(119909119905+1119870

minus 1199091015840

119905+1119870) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

(11) Update the parameter 120589 120589119905+1

= 120589119905+ 120572119898(119903119905+1

minus 1199031015840

119905+1)120601(1199091199051 1199091199052 119909119905119870 119906119905)

Update the local model(12) If 119905 le 119872 size(13) Insert the real sample (119909

119905 119906119905 119909119905+1 119903119905+1) into the memory119872

(14) Else If(15) Replace the oldest one in119872 with the real sample (119909

119905 119906119905 119909119905+1 119903119905+1)

(16) End if(17) Select L-nearest neighbors of the current state from119872 to construct119883 and 119884(18) Predict the next state and the reward [1199091015840

119905+1 1199031015840

119905+1]T= Γ[119909

119905 119906119905 1]

T

(19) Update the parameter Γ Γ = 119884119883T(119883119883

T)minus1

(20) Compute the local error119864119903119905= max|(1199091015840

119905+11minus 119909119905+11

)119909119905+11

| |(1199091015840

119905+12minus 119909119905+12

)119909119905+12

| |(1199091015840

119905+1119870minus 119909119905+1119870

)119909119905+1119870

| |(1199031015840

119905+1minus 119903119905+1)119903119905+1|

(21) If 119864119903119905le 120585

(22) Call Local-model planning (120579 120573 Γ 119890 number 119875119897) (Algorithm 3)

(23) End IfUpdate the value function

(24) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(25) Estimate the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(26) Update the value-function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

Update the policy(27) Update the policy parameter 120573

119905+1= 120573119905+ 120572119886120575119905Δ119906119905120601(119909119905)

(28) 119905 = 119905 + 1

(29) Update the number of samples number = number + 1(30) Until the ending condition is satisfied(31) Call Global-model planning (120579 120573 120578

1 120578

119896 120589 119890 number 119875

119892) (Algorithm 4)

(32) End LoopOutput 120573 120579

Algorithm 2 AC-HMLP algorithm

Computational Intelligence and Neuroscience 7

(1) Loop for 119875119897time steps

(2) 1199091larr 119909119905 119905 larr 1

(3) Choose Δ119906119905according to119873(0 120590

2)

(4) 119906119905= ℎ(119909

119905 120573119905) + Δ119906

119905

(5) Predict the next state and the reward [119909119905+11

119909119905+1119870

119903119905+1] = Γ119905[1199091199051 119909

119905119870 119906119905 1]

T

(6) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(7) Compute the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(8) Update the value function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

(9) Update the policy parameter 120573119905+1

= 120573119905+ 120572119886120575TDΔ119906119905120601(119909119905)

(10) If 119905 le 119878

(11) 119905 = 119905 + 1

(12) End If(13) Update the number of samples number = number + 1(14) End LoopOutput 120573 120579

Algorithm 3 Local model planning (120579 120573 Γ 119890 number 119875119897)

(1) Loop for 119875119892times

(2) 119909larr 1199090 larr 1

(3) Repeat all episodes(4) Choose Δ119906

according to119873(0 120590)

(5) Compute exploration term 119906= ℎ(119909

120573) + Δ119906

(6) Predict the next state1199091015840

+11= 1205781

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+12= 1205782

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+1119870= 120578119870

119879120601 (1199091 1199092 119909

119870 119906)

(7) Predict the reward 1199031015840+1

= 120589119905

T(1199091 1199092 119909

119870 119906119905)

(8) Update the eligibility 119890(119909) = 120601(119909

) 119909= 119909 120582120574119890

minus1(119909) 119909

= 119909

(9) Compute the TD error 120575= 119903119905+1

+ 120574119881(1199091015840

+1 120579) minus 119881(119909

120579)

(10) Update the value function parameter 120579+1

= 120579+ 120572119888120575119890(119909)

(11) Update the policy parameter 120573+1

= 120573+ 120572119886120575Δ119906120601(119909)

(12) If le 119879

(13) = + 1

(14) End If(15) Update the number of samples number = number + 1(16) End Repeat(17) End LoopOutput 120573 120579

Algorithm 4 Global model planning (120579 120573 1205781 120578

119896 120589 119890 number 119875

119892)

The goal of learning the value function is to minimize thesquare of the TD-error which is shown as

min120579

numbersum

119905=0

10038171003817100381710038171003817119903119905+1

+ 120574119881 (1199091015840

119905+1 120579119905) minus 119881 (119909

119905 120579119905)10038171003817100381710038171003817

+ ℓ119888

10038171003817100381710038171205791199051003817100381710038171003817

2

(17)

where number represents the number of the samples ℓ119888ge 0

is the regularization parameter of the critic 1205791199052 is the ℓ

2-

regularization which penalizes the growth of the parametervector 120579

119905 so that the overfitting to noise samples can be

avoided

The update for the parameter 120579 of the value function inRAC-HMLP is represented as

120579119905+1

= 120579119905(1 minus

120572119888

numberℓ119888) +

120572119888

number

numbersum

119904=0

119890119904(119909) 120575119904 (18)

The update for the parameter 120573 of the policy is shown as

120573119905+1

= 120573119905(1 minus

120572119886

numberℓ119886)

+120572119886

number

numbersum

119904=0

120601 (119909119904) 120575119904Δ119906119904

(19)

where ℓ119886ge 0 is the regularization parameter of the actor

8 Computational Intelligence and Neuroscience

F

w

Figure 1 Pole balancing problem

The update for the global model can be denoted as

120578119905+11

= 1205781199051(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+11

minus 1199091015840

119904+11)

120578119905+12

= 1205781199052(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+12

minus 1199091015840

119904+12)

120578119905+1119870

= 120578119905119870

(1 minus120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+1119870

minus 1199091015840

119904+11)

(20)

120589119905+1

= 120589119905(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119903119904+1

minus 1199031015840

119904+1)

(21)

where ℓ119898ge 0 is the regularization parameter for the model

namely the state transition function and the reward functionAfter we replace the update equations of the parameters

in Algorithms 2 3 and 4 with (18) (19) (20) and (21) we willget the resultant algorithm RAC-HMLP Except for the aboveupdate equations the other parts of RAC-HMLP are the samewith AC-HMLP so we will not specify here

5 Empirical Results and Analysis

AC-HMLP and RAC-HMLP are compared with S-ACMLAC andDyna-MLAC on two continuous state and actionspaces problems pole balancing problem [31] and continuousmaze problem [32]

51 Pole Balancing Problem Pole balancing problem is a low-dimension but challenging benchmark problem widely usedin RL literature shown in Figure 1

There is a car moving along the track with a hinged poleon its topThegoal is to find a policywhich can guide the force

to keep the pole balanceThe system dynamics is modeled bythe following equation

=119892 sin (119908) minus V119898119897 sin (2119908) 2 minus V cos (119908) 119865

41198973 minus V119898119897cos2 (119908) (22)

where 119908 is the angle of the pole with the vertical line and are the angular velocity and the angular acceleration of thepole 119865 is the force exerted on the cart The negative valuemeans the force to the right and otherwise means to the left119892 is the gravity constant with the value 119892 = 981ms2119898 and119897 are the length and the mass of the pole which are set to119898 =

20 kg and 119897 = 05m respectively V is a constantwith the value1(119898 + 119898

119888) where119898

119888= 80 kg is the mass of the car

The state 119909 = (119908 ) is composed of 119908 and which arebounded by [minus120587 120587] and [minus2 2] respectivelyThe action is theforce 119865 bounded by [minus50N 50N] The maximal episode is300 and the maximal time step for each episode is 3000 Ifthe angle of the pole with the vertical line is not exceeding1205874at each time step the agent will receive a reward 1 otherwisethe pole will fall down and receive a reward minus1 There is alsoa random noise applying on the force bounded by [minus10N10N] An episode ends when the pole has kept balance for3000 time steps or the pole falls downThe pole will fall downif |119908| ge 1205874 To approximate the value function and thereward function the state-based feature is coded by radialbasis functions (RBFs) shown as

120601119894(119909) = 119890

minus(12)(119909minus119888119894)T119861minus1(119909minus119888119894) (23)

where 119888119894denotes the 119894th center point locating over the grid

points minus06 minus04 minus02 0 02 04 06 times minus2 0 2 119861 is adiagonal matrix containing the widths of the RBFs with 1205902

119908=

02 and 1205902= 2The dimensionality 119911 of 120601(119909) is 21 Notice that

the approximations of the value function and the policy onlyrequire the state-based feature However the approximationof the model requires the state-action-based feature Let 119909 be(119908 119906) we can also compute its feature by (23) In this case119888119894locates over the points minus06 minus04 minus02 0 02 04 06 times

minus2 0 2 times minus50 minus25 0 25 50 The diagonal elements of 119861are 1205902119908= 02 1205902

= 2 and 120590

2

119865= 10 The dimensionality 119911

of 120601(119908 119906) is 105 Either the state-based or the state-action-based feature has to be normalized so as to make its value besmaller than 1 shown as

120601119894(119909) =

120601119894(119909)

sum119911

119894=1120601119894(119909)

(24)

where 119911 gt 0 is the dimensionality of the featureRAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC on this experiment The parame-ters of S-AC MLAC and Dyna-MLAC are set according tothe values mentioned in their papers The parameters of AC-HMLP and RAC-HMLP are set as shown in Table 1

The planning times may have significant effect on theconvergence performance Therefore we have to determinethe local planning times 119875

119897and the global planning times 119875

119892

at first The selective settings for the two parameters are (30300) (50 300) (30 600) and (50 600) From Figure 2(a) it

Computational Intelligence and Neuroscience 9

Table 1 Parameters settings of RAC-HMLP and AC-HMLP

Parameter Symbol ValueTime step 119879

11990401

Discount factor 120574 09Trace-decay rate 120582 09Exploration variance 120590

2 1Learning rate of the actor 120572

11988605

Learning rate of the critic 120572119888

04Learning rate of the model 120572

11989805

Error threshold 120585 015Capacity of the memory 119872 size 100Number of the nearest samples 119871 9Local planning times 119875

11989730

Global planning times 119875119892 300

Number of components of the state K 2Regularization parameter of the model ℓ

11989802

Regularization parameter of the critic ℓ119888

001Regularization parameter of the actor ℓ

1198860001

is easy to find out that (119875119897= 30119875

119892= 300) behaves best in these

four settings with 41 episodes for convergence (119875119897= 30 119875

119892=

600) learns fastest in the early 29 episodes but it convergesuntil the 52nd episode (119875

119897= 30 119875

119892= 600) and (119875

119897= 30 119875

119892=

300) have identical local planning times but different globalplanning times Generally the more the planning times thefaster the convergence rate but (119875

119897= 30 119875

119892= 300) performs

better instead The global model is not accurate enough atthe initial time Planning through such an inaccurate globalmodel will lead to an unstable performance Notice that (119875

119897=

50 119875119892= 300) and (119875

119897= 50 119875

119892= 600) behave poorer than (119875

119897

= 30 119875119892= 600) and (119875

119897= 30 119875

119892= 300) which demonstrates

that planning too much via the local model will not performbetter (119875

119897= 50119875

119892= 300) seems to converge at the 51st episode

but its learning curve fluctuates heavily after 365 episodes notconverging any more until the end Like (119875

119897= 50 119875

119892= 300)

(119875119897= 50 119875

119892= 600) also has heavy fluctuation but converges

at the 373rd episode Evidently (119875119897= 50 119875

119892= 300) performs

slightly better than (119875119897= 50 119875

119892= 300) Planning through the

global model might solve the nonstability problem caused byplanning of the local model

The convergence performances of the fivemethods RAC-HMLP AC-HMLP S-AC MLAC and Dyna-MLAC areshown in Figure 2(b) It is evident that our methods RAC-HMLP and AC-HMLP have the best convergence perfor-mances RAC-HMLP and AC-HMLP converge at the 39thand 41st episodes Both of them learn quickly in the primaryphase but the learning curve of RAC-HMLP seems to besteeper than that of AC-HMLP Dyna-MLAC learns fasterthan MLAC and S-AC in the former 74 episodes but itconverges until the 99th episode Though MLAC behavespoorer than Dyna-MLAC it requires just 82 episodes toconverge The method with the slowest learning rate is S-AC where the pole can keep balance for 3000 time stepsfor the first time at the 251st episode Unfortunately it con-verges until the 333rd episode RAC-HMLP converges fastest

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

Pl = 30 Pg = 300

Pl = 50 Pg = 300

Pl = 30 Pg = 600

Pl = 50 Pg = 600

(a) Determination of different planning times

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of different algorithms in convergence performance

Figure 2 Comparisons of different planning times and differentalgorithms

which might be caused by introducing the ℓ2-regularization

Because the ℓ2-regularization does not admit the parameter

to grow rapidly the overfitting of the learning process canbe avoided effectively Dyna-MLAC converges faster thanMLAC in the early 74 episodes but its performance is notstable enough embodied in the heavy fluctuation from the75th to 99th episode If the approximate-local model is dis-tinguished largely from the real-local model then planningthrough such an inaccurate local model might lead to anunstable performance S-AC as the only method withoutmodel learning behaves poorest among the fiveThese resultsshow that the convergence performance can be improvedlargely by introducing model learning

The comparisons of the five different algorithms insample efficiency are shown in Figure 3 It is clear that thenumbers of the required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 56003

10 Computational Intelligence and Neuroscience

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

50 100 150 200 250 300 350 4000Training episodes

0

1

2

3

4

5

6

Num

ber o

f sam

ples

to co

nver

ge

times104

Figure 3 Comparisons of sample efficiency

35535 32043 14426 and 12830 respectively Evidently RAC-HMLP converges fastest and behaves stably all the timethus requiring the least samplesThough AC-HMLP requiresmore samples to converge than RAC-HMLP it still requiresfar fewer samples than the other three methods Because S-AC does not utilize the model it requires the most samplesamong the five Unlike MLAC Dyna-MLAC applies themodel not only in updating the policy but also in planningresulting in a fewer requirement for the samples

The optimal policy and optimal value function learned byAC-HMLP after the training ends are shown in Figures 4(a)and 4(b) respectively while the ones learned by RAC-HMLPare shown in Figures 4(c) and 4(d) Evidently the optimalpolicy and the optimal value function learned by AC-HMLPand RAC-HMLP are quite similar but RAC-HMLP seems tohave more fine-grained optimal policy and value function

As for the optimal policy (see Figures 4(a) and 4(c))the force becomes smaller and smaller from the two sides(left side and right side) to the middle The top-right is theregion requiring a force close to 50N where the directionof the angle is the same with that of angular velocity Thevalues of the angle and the angular velocity are nearing themaximum values 1205874 and 2Therefore the largest force to theleft is required so as to guarantee that the angle between theupright line and the pole is no bigger than 1205874 It is oppositein the bottom-left region where a force attributing to [minus50Nminus40N] is required to keep the angle from surpassing minus1205874The pole can keep balance with a gentle force close to 0 in themiddle regionThedirection of the angle is different from thatof angular velocity in the top-left and bottom-right regionsthus a force with the absolute value which is relatively largebut smaller than 50N is required

In terms of the optimal value function (see Figures 4(b)and 4(d)) the value function reaches a maximum in theregion satisfying minus2 le 119908 le 2 The pole is more prone tokeep balance even without applying any force in this region

resulting in the relatively larger value function The pole ismore and more difficult to keep balance from the middle tothe two sides with the value function also decaying graduallyThe fine-grained value of the left side compared to the rightonemight be caused by themore frequent visitation to the leftside More visitation will lead to a more accurate estimationabout the value function

After the training ends the prediction of the next stateand the reward for every state-action pair can be obtainedthrough the learnedmodelThe predictions for the next angle119908 the next angular velocity and the reward for any possiblestate are shown in Figures 5(a) 5(b) and 5(c) It is noticeablethat the predicted-next state is always near the current statein Figures 5(a) and 5(b) The received reward is always largerthan 0 in Figure 5(c) which illustrates that the pole canalways keep balance under the optimal policy

52 Continuous Maze Problem Continuous maze problemis shown in Figure 6 where the blue lines with coordinatesrepresent the barriersThe state (119909 119910) consists of the horizon-tal coordinate 119909 isin [0 1] and vertical coordinate 119910 isin [0 1]Starting from the position ldquoStartrdquo the goal of the agent isto reach the position ldquoGoalrdquo that satisfies 119909 + 119910 gt 18 Theaction is to choose an angle bounded by [minus120587 120587] as the newdirection and then walk along this direction with a step 01The agent will receive a reward minus1 at each time step Theagent will be punished with a reward minus400 multiplying thedistance if the distance between its position and the barrierexceeds 01 The parameters are set the same as the formerproblem except for the planning times The local and globalplanning times are determined as 10 and 50 in the sameway of the former experiment The state-based feature iscomputed according to (23) and (24) with the dimensionality119911 = 16 The center point 119888

119894locates over the grid points

02 04 06 08 times 02 04 06 08 The diagonal elementsof 119861 are 120590

2

119909= 02 and 120590

2

119910= 02 The state-action-based

feature is also computed according to (23) and (24) Thecenter point 119888

119894locates over the grid points 02 04 06 08 times

02 04 06 08timesminus3 minus15 0 15 3with the dimensionality119911 = 80The diagonal elements of119861 are 1205902

119909= 02 1205902

119910= 02 and

1205902

119906= 15RAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC in cumulative rewards The resultsare shown in Figure 7(a) RAC-HMLP learns fastest in theearly 15 episodesMLACbehaves second best andAC-HMLPperforms poorest RAC-HMLP tends to converge at the 24thepisode but it really converges until the 39th episode AC-HMLP behaves steadily starting from the 23rd episode to theend and it converges at the 43rd episode Like the formerexperiment RAC-HMLP performs slightly better than AC-HMLP embodied in the cumulative rewards minus44 comparedto minus46 At the 53rd episode the cumulative rewards ofS-AC fall to about minus545 quickly without any ascendingthereafter Though MLAC and Dyna-MLAC perform wellin the primary phase their curves start to descend at the88th episode and the 93rd episode respectively MLAC andDyna-MLAC learn the local model to update the policygradient resulting in a fast learning rate in the primary phase

Computational Intelligence and Neuroscience 11

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus50

0

50

(a) Optimal policy of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

92

94

96

98

10

(b) Optimal value function of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus40

minus20

0

20

40

(c) Optimal policy of RAC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

9

92

94

96

98

10

(d) Optimal value function of RAC-HMLP learned after training

Figure 4 Optimal policy and value function learned by AC-HMLP and RAC-HMLP

However they do not behave stably enough near the end ofthe training Too much visitation to the barrier region mightcause the fast descending of the cumulative rewards

The comparisons of the time steps for reaching goal aresimulated which are shown in Figure 7(b) Obviously RAC-HMLP and AC-HMLP perform better than the other threemethods RAC-HMLP converges at 39th episode while AC-HMLP converges at 43rd episode with the time steps forreaching goal being 45 and 47 It is clear that RAC-HMLPstill performs slightly better than AC-HMLP The time stepsfor reaching goal are 548 69 and 201 for S-AC MLACandDyna-MLAC Among the fivemethods RAC-HMLP notonly converges fastest but also has the best solution 45 Thepoorest performance of S-AC illustrates that model learningcan definitely improve the convergence performance As inthe former experiment Dyna-MLAC behaves poorer thanMLAC during training If the model is inaccurate planningvia such model might influence the estimations of the valuefunction and the policy thus leading to a poor performancein Dyna-MLAC

The comparisons of sample efficiency are shown inFigure 8 The required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 105956588 7694 4388 and 4062 respectively As in pole balancingproblem RAC-HMLP also requires the least samples whileS-AC needs the most to converge The difference is that

Dyna-MLAC requires samples slightly more than MLACThe ascending curve of Dyna-MLAC at the end of thetraining demonstrates that it has not convergedThe frequentvisitation to the barrier in Dyna-MLAC leads to the quickdescending of the value functions As a result enormoussamples are required to make these value functions moreprone to the true ones

After the training ends the approximate optimal policyand value function are obtained shown in Figures 9(a) and9(b) It is noticeable that the low part of the figure is exploredthoroughly so that the policy and the value function are verydistinctive in this region For example inmost of the region inFigure 9(a) a larger angle is needed for the agent so as to leavethe current state Clearly the nearer the current state and thelow part of Figure 9(b) the smaller the corresponding valuefunction The top part of the figures may be not frequentlyor even not visited by the agent resulting in the similar valuefunctionsThe agent is able to reach the goal onlywith a gentleangle in these areas

6 Conclusion and Discussion

This paper proposes two novel actor-critic algorithms AC-HMLP and RAC-HMLP Both of them take LLR and LFA torepresent the local model and global model respectively Ithas been shown that our new methods are able to learn the

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

2 Computational Intelligence and Neuroscience

algorithm based on value iteration where the convergence isproved in the context of general nonlinear discrete systemsDierks et al [13] solved the optimal control of nonlineardiscrete-time systems by using two processes online systemidentification and offline optimal control training withoutthe requirement of partial knowledge about the systemdynamics Wang et al [14] focused on applying iterativeADP algorithm with error boundary to obtain the optimalcontrol law in which the NNs are adopted to approximatethe performance index function compute the optimal controlpolicy and model the nonlinear system

Extensions of ADP for continuous-time systems facethe challenges involved in proving stability and convergencemeanwhile ensuring the algorithm being online and model-free To approximate the value function and improve the pol-icy for continuous-time system Doya [15] derived a temporaldifference (TD) error-based algorithm in the frameworkof HJB Under a measure of input quadratic performanceMurray et al [16] developed a stepwise ADP algorithm inthe context of HJB Hanselmann et al [17] put forward acontinuous-time ADP formulation where Newtonrsquos methodis used in the second-order actor adaption to achieve theconvergence of the critic Recently Bhasin et al [18] built anactor-critic-identifier (ACI) an architecture that representsthe actor critic and model by taking NNs as nonlinearlyparameterized approximators while the parameters of NNsare updated by least-square method

All the aforementioned ADP variants utilized the NNas the function approximator however linear parameterizedapproximators are usually more preferred in RL becausethey make it easier to understand and analyze the theoreticalproperties of the resulting RL algorithms [19] Moreovermost of the above works did not learn a model online toaccelerate the convergence rate and improve the sample effi-ciency Actor-critic (AC) algorithmwas introduced in [20] forthe first time many variants which approximated the valuefunction and the policy by linear function approximationhave been widely used in continuous-time systems since then[21ndash23] By combining model learning and AC Grondmanet al [24] proposed an improved learning method calledModel Learning Actor-Critic (MLAC) which approximatesthe value function the policy and the process model by LLRIn MLAC the gradient of the next state with respect to thecurrent action is computed for updating the policy gradientwith the goal of improving the convergence rate of the wholealgorithm In their latter work [25] LFA takes the place ofLLR as the approximation method for value function thepolicy and the process model Enormous samples are stillrequired when only using such a process model to updatethe policy gradient Afterward Costa et al [26] derived anAC algorithm by introducing Dyna structure called Dyna-MLAC which approximated the value function the policyand the model by LLR as MLAC did The difference is thatDyna-MLAC applies the model not only in updating thepolicy gradient but also in planning [27] Though planningcan improve the sample efficiency to a large extent the modellearned by LLR is just a local model so that the globalinformation of samples is yet neglected

Though the above works learn a model during learningof the value function and the policy only the local infor-mation of the samples is utilized If the global informationof the samples can be utilized reasonably the convergenceperformance will be improved further Inspired by this ideawe establish two novel AC algorithms called AC-HMLP andRAC-HMLP (AC-HMLPwith ℓ

2-regularization) AC-HMLP

and RAC-HMLP consist of twomodels the global model andthe local model Both models incorporate the state transitionfunction and the reward function for planning The globalmodel is approximated by LFA while the local model isrepresented by LLR The local and the global models arelearned simultaneously at each time step The local modelis used for planning only if the error does not surpass thethreshold while the global planning process is started at theend of an episode so that the local and the global informationcan be kept and utilized uniformly

The main contributions of our work on AC-HMLP andRAC-HMLP are as follows

(1) Develop twonovel AC algorithms based on hierarchalmodels Distinguishing from the previous works AC-HMLP and RAC-HMLP learn a global model wherethe reward function and the state transition func-tion are approximated by LFA Meanwhile unlikethe existing model learning methods [28ndash30] whichrepresent a feature-based model we directly establisha state-based model to avoid the error brought byinaccurate features

(2) As MLAC and Dyna-MLAC did AC-HMLP andRAC-HMLP also learn a local model by LLR Thedifference is that we design a useful error threshold todecide whether to start the local planning process Ateach time step the real-next state is computed accord-ing to the system dynamics whereas the predicted-next state is obtained from LLR The error betweenthem is defined as the state-prediction error If thiserror does not surpass the error threshold the localplanning process is started

(3) The local model and the global model are used forplanning uniformly The local and the global modelsproduce local and global samples to update the samevalue function and the policy as a result the numberof the real samples will decrease dramatically

(4) Experimentally the convergence performance andthe sample efficiency are thoroughly analyzed Thesample efficiency which is defined as the numberof samples for convergence is analyzed RAC-HMLPandAC-HMLP are also compared with S-ACMLACand Dyna-MLAC in convergence performance andsample efficiency The results demonstrate that RAC-HMLP performs best and AC-HMLP performs sec-ondbest and both of themoutperform the other threemethods

This paper is organized as follows Section 2 reviewssome background knowledge concerning MDP and theAC algorithm Section 3 describes the hierarchical model

Computational Intelligence and Neuroscience 3

learning and planning Section 4 specifies our algorithmsmdashAC-HMLP and RAC-HMLP The empirical results of thecomparisons with the other three representative algorithmsare analyzed in Section 5 Section 6 concludes our work andthen presents the possible future work

2 Preliminaries

21 MDP RL can solve the problemmodeled by MDP MDPcan be represented as four-tuple (119883119880 120588 119891)

(1) 119883 is the state space 119909119905isin 119883 denotes the state of the

agent at time step 119905(2) 119880 represents the action space 119906

119905isin 119880 is the action

which the agent takes at the time step 119905(3) 120588 119883 times 119880 rarr R denotes the reward function At the

time step 119905 the agent locates at a state 119909119905and takes an

action 119906119905resulting in next state 119909

119905+1while receiving a

reward 119903119905= 120588(119909

119905 119906119905)

(4) 119891 119883 times 119880 rarr 119883 is defined as the transition function119891(119909119905 119906119905 119909119905+1) is the probability of reaching the next

state 119909119905+1

after executing 119906119905at the state 119909

119905

Policy ℎ 119883 rarr 119880 is the mapping from the state space119883 to the action space 119880 where the mathematical set of ℎdepends on specific domains The goal of the agent is tofind the optimal policy ℎlowast that can maximize the cumulativerewards The cumulative rewards are the sum or discountedsum of the received rewards and here we use the latter case

Under the policy ℎ the value function 119881ℎ 119883 rarr R

denotes the expected cumulative rewards which is shown as

119881ℎ(119909) = 119864

infin

sum

119896=0

120574119896119903119905+119896+1

| 119909 = 119909119905 (1)

where 120574 isin [0 1] represents the discount factor 119909119905is the

current stateThe optimal state-value function 119881lowast(119909) is computed as

119881lowast(119909) = max

119881 (119909) forall119909 isin 119883 (2)

Therefore the optimal policy ℎlowast at state 119909 can be obtainedby

ℎlowast(119909) = argmax

119881ℎ(119909) forall119909 isin 119883 (3)

22 AC Algorithm AC algorithm mainly contains two partsactor and critic which are stored separately Actor and criticare also called the policy and value function respectivelyTheactor-only methods approximate the policy and then updateits parameter along the direction of performance improvingwith the possible drawback being large variance resultingfrom policy estimationThe critic-only methods estimate thevalue function by approximating a solution to the Bellmanequation the optimal policy is found by maximizing thevalue functionOther than the actor-onlymethods the critic-only methods do not try to search the optimal policy inpolicy space They just estimate the critic for evaluating the

performance of the actor as a result the near-optimality ofthe resulting policy cannot be guaranteed By combiningthe merits of the actor and the critic AC algorithms wereproposed where the value function is approximated to updatethe policy

The value function and the policy are parameterized by119881(119909 120579) and ℎ(119909 120573) where 120579 and 120573 are the parameters of thevalue function and the policy respectively At each time step119905 the parameter 120579 is updated as

120579119905+1

= 120579119905+ 120572119888120575119905

120597119881 (119909 120579)

120597120579119909 = 119909

119905 120579 = 120579

119905 (4)

where

120575119905= 119903119905+ 120574119881 (119909

119905 120579119905) minus 119881 (119909

119905 120579119905) (5)

denoting the TD-error of the value function 120597119881(119909 120579)120597120579represents the feature of the value function The parameter120572119888isin [0 1] is the learning rate of the value functionEligibility is a trick to improve the convergence via

assigning the credits to the previously visited states At eachtime step 119905 the eligibility can be represented as

119890119905(119909) =

120597119881 (119909 120579)

120597120579119909119905= 119909

120582120574119890119905minus1

(119909) 119909119905

= 119909

(6)

where 120582 isin [0 1] denotes the trace-decay rateBy introducing the eligibility the update for 120579 in (4) can

be transformed as

120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909) (7)

The policy parameter 120573119905can be updated by

120573119905+1

= 120573119905+ 120572119886120575119905Δ119906119905

120597ℎ (119909 120573)

120597120573 (8)

where 120597ℎ(119909 120573)120597120573 is the feature of the policy Δ119906119905is a

random exploration term conforming to zero-mean normaldistribution 120572

119886isin [0 1] is the learning rate of the policy

S-AC (Standard AC algorithm) serves as a baseline tocompare with our method which is shown in Algorithm 1The value function and the policy are approximated linearlyin Algorithm 1 where TD is used as the learning algorithm

3 Hierarchical Model Learning and Planning

31 Why to Use Hierarchical Model Learning and PlanningThe model in RL refers to the state transition function andthe reward function When the model is established we canuse any model-based RL method to find the optimal policyfor example DP Model-based methods can significantlydecrease the number of the required samples and improve theconvergence performance Inspired by this idea we introducethe hierarchical model learning into AC algorithm so as tomake it becomemore sample-efficient Establishing a relativeaccurate model for the continuous state and action spaces isstill an open issue

4 Computational Intelligence and Neuroscience

Input 120574 120582 120572119886 120572119888 1205902

(1) Initialize 120579 120573(2) Loop(3) 119890 larr

9978881 120579 larr 997888

0 120573 larr9978880 119909 larr 119909

0 119905 larr 1

(4) Repeat all episodes(5) Choose Δ119906

119905according to119873(0 120590

2)

(6) 119906119905larr ℎ(119909 120573) + Δ119906

119905

(7) Execute 119906119905and observe 119903

119905+1and 119909

119905+1

(8) Update the eligibility of the value function 119890119905(119909) = 120597119881(119909 120579)120597120579 119909

119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(9) Compute the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(10) Update the parameter of the value function 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

(11) Update the parameter of the policy 120573119905+1

= 120573119905+ 120572119886120575119905Δ119906119905(120597ℎ(119909 120573)120597120573)

(12) 119905 larr 119905 + 1

(13) End Repeat(14) End LoopOutput 120579 120573

Algorithm 1 S-AC

The preexisting works are mainly aimed at the problemswith continuous states but discrete actions They approx-imated the transition function in the form of probabilitymatrix which specifies the transition probability from thecurrent feature to the next feature The indirectly observedfeatures result in the inaccurate feature-based model Theconvergence rate will be slowed significantly by using such aninaccurate model for planning especially at each time step inthe initial phase

To solve these problems we will approximate a state-based model instead of the inaccurate feature-based modelMoreover we will introduce an additional global model forplanning The global model is applied only at the end of eachepisode so that the global information can be utilized asmuchas possible Using such a global model without others willlead to the loss of valuable local informationThus likeDyna-MLAC we also approximate a local model by LLR and useit for planning at each time step The difference is that auseful error threshold is designed for the local planning inour method If the state-prediction error between the real-next state and the predicted one does not surpass the errorthreshold the local planning process will be started at thecurrent time step Therefore the convergence rate and thesample efficiency can be improved dramatically by combiningthe local and global model learning and planning

32 Learning and Planning of the Global Model The globalmodel establishes separate equations for the reward functionand the state transition function of every state component bylinear function approximation Assume the agent is at state119909119905= 1199091199051 119909

119905119870 where119870 is the dimensionality of the state

119909119905 the action 119906

119905is selected according to the policy ℎ then

all the components 1199091015840119905+11

1199091015840

119905+12 119909

1015840

119905+1119870 of the next state

1199091015840

119905+1can be predicted as

1199091015840

119905+11= 1205781199051

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+12= 1205781199052

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+1119870= 120578119905119870

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

(9)

where 120578119905119894= (1205781199051198941 1205781199051198942 120578

119905119894119863)T 1 le 119894 le 119870 is the param-

eter of the state transition function corresponding to the 119894thcomponent of the current state at time step 119905 with 119863 beingthe dimensionality of the feature 120601(119909

1199051 1199091199052 119909

119905119870 119906119905)

Likewise the reward 1199031015840119905+1

can be predicted as

1199031015840

119905+1= 120589119905

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905) (10)

where 120589119905= (1205891199051 1205891199052 120589

119905119863) is the parameter of the reward

function at time step 119905After 120578

1199051 1205781199052 120578

119905119870and 120589 are updated the model can

be applied to generate samples Let the current state be119909119905= (1199091199051 1199091199052 119909

119905119896) after the action 119906

119905is executed the

parameters 120578119905+1119894

(1 le 119894 le 119870) can be estimated by the gradientdescent method shown as

120578119905+11

= 1205781199051

+ 120572119898(119909119905+11

minus 1199091015840

119905+11) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+12

= 1205781199052

+ 120572119898(119909119905+12

minus 1199091015840

119905+12) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+1119870

= 120578119905119870

+ 120572119898(119909119905+1119870

minus 1199091015840

119905+1119870) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

(11)

where 120572119898

isin [0 1] is the learning rate of the model119909119905+1

is the real-next state obtained according to the systemdynamics where 119909

119905+1119894(1 le 119894 le 119870) is its 119894th component

(1199091015840

119905+11 1199091015840

119905+12 119909

1015840

119905+1119870) is the predicted-next state according

to (9)The parameter 120589

119905+1is estimated as

120589119905+1

= 120589119905+ 120572119898(119903119905+1

minus 1199031015840

119905+1) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905) (12)

where 119903119905+1

is the real reward reflected by the system dynamicswhile 1199031015840

119905+1is the predicted reward obtained according to (10)

Computational Intelligence and Neuroscience 5

33 Learning and Planning of the Local Model Though thelocal model also approximates the state transition functionand the reward function as the global model does LLR isserved as the function approximator instead of LFA In thelocal model a memory storing the samples in the form of(119909119905 119906119905 119909119905+1 119903119905+1) is maintained At each time step a new

sample is generated from the interaction and it will take theplace of the oldest one in the memory Not all but only L-nearest samples in thememorywill be selected for computingthe parameter matrix Γ isin R(119870+1)times(119870+2) of the local modelBefore achieving this the input matrix119883 isin R(119870+2)times119871 and theoutput matrix 119884 isin R(119870+1)times119871 should be prepared as follows

119883 =

[[[[[[[[[[[[

[

11990911

11990921

sdot sdot sdot 1199091198711

11990912

11990922

sdot sdot sdot 1199091198712

1199091119870

1199092119870

sdot sdot sdot 119909119871119870

1199061

1199062

sdot sdot sdot 119906119871

1 1 sdot sdot sdot 1

]]]]]]]]]]]]

]

119884 =

[[[[[[[[[

[

11990921

11990931

sdot sdot sdot 119909119871+11

11990922

11990932

sdot sdot sdot 119909119871+12

1199092119870

1199093119870

sdot sdot sdot 119909119871+1119870

1199032

1199033

sdot sdot sdot 119903119871+1

]]]]]]]]]

]

(13)

The last row of 119883 consisting of ones is to add a biason the output Every column in the former 119870 + 1 lines of119883 corresponds to a state-action pair for example the 119894thcolumn is the state-action (119909

119894 119906119894)T 1 le 119894 le 119871 119884 is composed

of 119871 next states and rewards corresponding to119883Γ can be obtained via solving 119884 = Γ119883 as

Γ = 119884119883T(119883119883

T)minus1

(14)

Let the current input vector be [1199091199051 119909

119905119870 119906119905 1]

T theoutput vector [1199091015840

119905+11 119909

1015840

119905+1119870 1199031015840

119905+1]T can be predicted by

[1199091015840

119905+11 119909

1015840

119905+1119896 1199031015840

119905+1]T= Γ [119909

1199051 119909

119905119896 119906119905 1]

T (15)

where [1199091199051 119909

119905119870] and [119909

1015840

119905+11 119909

1015840

119905+1119870] are the current

state and the predicted-next state respectively 1199031015840119905+1

is thepredicted reward

Γ is estimated according to (14) at each time step there-after the predicted-next state and the predicted reward canbe obtained by (15) Moreover we design an error thresholdto decide whether local planning is required We computethe state-prediction error between the real-next state and thepredicted-next state at each time step If this error does not

surpass the error threshold the local planning process willbe launched The state-prediction error is formulated as

119864119903119905= max

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+11minus 119909119905+11

119909119905+11

1003816100381610038161003816100381610038161003816100381610038161003816

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+12minus 119909119905+12

119909119905+12

1003816100381610038161003816100381610038161003816100381610038161003816

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+1119870minus 119909119905+1119870

119909119905+1119870

1003816100381610038161003816100381610038161003816100381610038161003816

100381610038161003816100381610038161003816100381610038161003816

1199031015840

119905+1minus 119903119905+1

119903119905+1

100381610038161003816100381610038161003816100381610038161003816

(16)

Let the error threshold be 120585 then the local model willbe used for planning only if 119864119903

119905le 120585 at time step 119905 In

the local planning process a sequence of locally simulatedsamples in the form of (119909

1199051 119909

119905119870 119906119905 1199091015840

119905+11 119909

1015840

119905+1119870 1199031015840)

are generated to improve the convergence of the same valuefunction and the policy as the global planning process does

4 Algorithm Specification

41 AC-HMLP AC-HMLP algorithm consists of a mainalgorithm and two subalgorithms The main algorithm isthe learning algorithm (see Algorithm 2) whereas the twosubalgorithms are local model planning procedure (seeAlgorithm 3) and global model planning procedure (seeAlgorithm 4) respectively At each time step the main algo-rithm learns the value function (see line (26) in Algorithm 2)the policy (see line (27) in Algorithm 2) the local model (seeline (19) in Algorithm 2) and the global model (see lines(10)sim(11) in Algorithm 2)

There are several parameters which are required to bedetermined in the three algorithms 119903 and 120582 are discountfactor and trace-decay rate respectively 120572

119886120572119888 and120572

119898are the

corresponding learning rates of the value function the policyand the global model119872 size is the capacity of the memory120585 denotes the error threshold 119871 determines the number ofselected samples which is used to fit the LLR 1205902 is thevariance that determines the region of the exploration 119875

119897and

119875119892are the planning times for local model and global model

Some of these parameters have empirical values for example119903 and 120582 The others have to be determined by observing theempirical results

Notice that Algorithm 3 starts planning at the state 119909119905

which is passed from the current state in Algorithm 2 whileAlgorithm 4 uses 119909

0as the initial state The reason for using

different initializations is that the local model is learnedaccording to the L-nearest samples of the current state 119909

119905

whereas the global model is learned from all the samplesThus it is reasonable and natural to start the local and globalplanning process at the states 119909

119905and 119909

0 respectively

42 AC-HMLPwith ℓ2-Regularization Regression approach-

es inmachine learning are generally represented asminimiza-tion of a square loss term and a regularization term The ℓ

2-

regularization also called ridge regress is a widely used regu-larization method in statistics and machine learning whichcan effectively prohibit overfitting of learning Therefore weintroduce ℓ

2-regularization to AC-HMLP in the learning of

the value function the policy and the model We term thisnew algorithm as RAC-HMLP

6 Computational Intelligence and Neuroscience

Input 120574 120582 120572119886 120572119888 120572119898 120585119872 size 119871 1205902 119870 119875

119897 119875119892

(1) Initialize 120579 larr 9978880 120573 larr

9978880 Γ larr 997888

0 1205781 1205782 120578

119870larr

9978880 120589 larr 997888

0

(2) Loop(3) 119890

1larr

9978881 11990911 119909

1119896larr 11990901 119909

0119896 119905 larr 1 number larr 0

(4) Repeat all episodes(5) Choose Δ119906

119905according to119873(0 120590

2)

(6) Execute the action 119906119905= ℎ(119909 120573) + Δ119906

119905

(7) Observe the reward 119903119905+1

and the next state 119909119905+1

= 119909119905+11

119909119905+1119896

Update the global model(8) Predict the next state

1199091015840

119905+11= 1205781199051

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+12= 1205781199052

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+1119870= 120578119905119870

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

(9) Predict the reward 1199031015840119905+1

= 120589119905

T120601(1199091199051 1199091199052 119909

119905119870 119906119905)

(10) Update the parameters 1205781 1205782 120578

119870

120578119905+11

= 1205781199051+ 120572119898(119909119905+11

minus 1199091015840

119905+11) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+12

= 1205781199052+ 120572119898(119909119905+12

minus 1199091015840

119905+12) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+1119870

= 120578119905119870

+ 120572119898(119909119905+1119870

minus 1199091015840

119905+1119870) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

(11) Update the parameter 120589 120589119905+1

= 120589119905+ 120572119898(119903119905+1

minus 1199031015840

119905+1)120601(1199091199051 1199091199052 119909119905119870 119906119905)

Update the local model(12) If 119905 le 119872 size(13) Insert the real sample (119909

119905 119906119905 119909119905+1 119903119905+1) into the memory119872

(14) Else If(15) Replace the oldest one in119872 with the real sample (119909

119905 119906119905 119909119905+1 119903119905+1)

(16) End if(17) Select L-nearest neighbors of the current state from119872 to construct119883 and 119884(18) Predict the next state and the reward [1199091015840

119905+1 1199031015840

119905+1]T= Γ[119909

119905 119906119905 1]

T

(19) Update the parameter Γ Γ = 119884119883T(119883119883

T)minus1

(20) Compute the local error119864119903119905= max|(1199091015840

119905+11minus 119909119905+11

)119909119905+11

| |(1199091015840

119905+12minus 119909119905+12

)119909119905+12

| |(1199091015840

119905+1119870minus 119909119905+1119870

)119909119905+1119870

| |(1199031015840

119905+1minus 119903119905+1)119903119905+1|

(21) If 119864119903119905le 120585

(22) Call Local-model planning (120579 120573 Γ 119890 number 119875119897) (Algorithm 3)

(23) End IfUpdate the value function

(24) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(25) Estimate the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(26) Update the value-function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

Update the policy(27) Update the policy parameter 120573

119905+1= 120573119905+ 120572119886120575119905Δ119906119905120601(119909119905)

(28) 119905 = 119905 + 1

(29) Update the number of samples number = number + 1(30) Until the ending condition is satisfied(31) Call Global-model planning (120579 120573 120578

1 120578

119896 120589 119890 number 119875

119892) (Algorithm 4)

(32) End LoopOutput 120573 120579

Algorithm 2 AC-HMLP algorithm

Computational Intelligence and Neuroscience 7

(1) Loop for 119875119897time steps

(2) 1199091larr 119909119905 119905 larr 1

(3) Choose Δ119906119905according to119873(0 120590

2)

(4) 119906119905= ℎ(119909

119905 120573119905) + Δ119906

119905

(5) Predict the next state and the reward [119909119905+11

119909119905+1119870

119903119905+1] = Γ119905[1199091199051 119909

119905119870 119906119905 1]

T

(6) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(7) Compute the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(8) Update the value function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

(9) Update the policy parameter 120573119905+1

= 120573119905+ 120572119886120575TDΔ119906119905120601(119909119905)

(10) If 119905 le 119878

(11) 119905 = 119905 + 1

(12) End If(13) Update the number of samples number = number + 1(14) End LoopOutput 120573 120579

Algorithm 3 Local model planning (120579 120573 Γ 119890 number 119875119897)

(1) Loop for 119875119892times

(2) 119909larr 1199090 larr 1

(3) Repeat all episodes(4) Choose Δ119906

according to119873(0 120590)

(5) Compute exploration term 119906= ℎ(119909

120573) + Δ119906

(6) Predict the next state1199091015840

+11= 1205781

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+12= 1205782

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+1119870= 120578119870

119879120601 (1199091 1199092 119909

119870 119906)

(7) Predict the reward 1199031015840+1

= 120589119905

T(1199091 1199092 119909

119870 119906119905)

(8) Update the eligibility 119890(119909) = 120601(119909

) 119909= 119909 120582120574119890

minus1(119909) 119909

= 119909

(9) Compute the TD error 120575= 119903119905+1

+ 120574119881(1199091015840

+1 120579) minus 119881(119909

120579)

(10) Update the value function parameter 120579+1

= 120579+ 120572119888120575119890(119909)

(11) Update the policy parameter 120573+1

= 120573+ 120572119886120575Δ119906120601(119909)

(12) If le 119879

(13) = + 1

(14) End If(15) Update the number of samples number = number + 1(16) End Repeat(17) End LoopOutput 120573 120579

Algorithm 4 Global model planning (120579 120573 1205781 120578

119896 120589 119890 number 119875

119892)

The goal of learning the value function is to minimize thesquare of the TD-error which is shown as

min120579

numbersum

119905=0

10038171003817100381710038171003817119903119905+1

+ 120574119881 (1199091015840

119905+1 120579119905) minus 119881 (119909

119905 120579119905)10038171003817100381710038171003817

+ ℓ119888

10038171003817100381710038171205791199051003817100381710038171003817

2

(17)

where number represents the number of the samples ℓ119888ge 0

is the regularization parameter of the critic 1205791199052 is the ℓ

2-

regularization which penalizes the growth of the parametervector 120579

119905 so that the overfitting to noise samples can be

avoided

The update for the parameter 120579 of the value function inRAC-HMLP is represented as

120579119905+1

= 120579119905(1 minus

120572119888

numberℓ119888) +

120572119888

number

numbersum

119904=0

119890119904(119909) 120575119904 (18)

The update for the parameter 120573 of the policy is shown as

120573119905+1

= 120573119905(1 minus

120572119886

numberℓ119886)

+120572119886

number

numbersum

119904=0

120601 (119909119904) 120575119904Δ119906119904

(19)

where ℓ119886ge 0 is the regularization parameter of the actor

8 Computational Intelligence and Neuroscience

F

w

Figure 1 Pole balancing problem

The update for the global model can be denoted as

120578119905+11

= 1205781199051(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+11

minus 1199091015840

119904+11)

120578119905+12

= 1205781199052(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+12

minus 1199091015840

119904+12)

120578119905+1119870

= 120578119905119870

(1 minus120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+1119870

minus 1199091015840

119904+11)

(20)

120589119905+1

= 120589119905(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119903119904+1

minus 1199031015840

119904+1)

(21)

where ℓ119898ge 0 is the regularization parameter for the model

namely the state transition function and the reward functionAfter we replace the update equations of the parameters

in Algorithms 2 3 and 4 with (18) (19) (20) and (21) we willget the resultant algorithm RAC-HMLP Except for the aboveupdate equations the other parts of RAC-HMLP are the samewith AC-HMLP so we will not specify here

5 Empirical Results and Analysis

AC-HMLP and RAC-HMLP are compared with S-ACMLAC andDyna-MLAC on two continuous state and actionspaces problems pole balancing problem [31] and continuousmaze problem [32]

51 Pole Balancing Problem Pole balancing problem is a low-dimension but challenging benchmark problem widely usedin RL literature shown in Figure 1

There is a car moving along the track with a hinged poleon its topThegoal is to find a policywhich can guide the force

to keep the pole balanceThe system dynamics is modeled bythe following equation

=119892 sin (119908) minus V119898119897 sin (2119908) 2 minus V cos (119908) 119865

41198973 minus V119898119897cos2 (119908) (22)

where 119908 is the angle of the pole with the vertical line and are the angular velocity and the angular acceleration of thepole 119865 is the force exerted on the cart The negative valuemeans the force to the right and otherwise means to the left119892 is the gravity constant with the value 119892 = 981ms2119898 and119897 are the length and the mass of the pole which are set to119898 =

20 kg and 119897 = 05m respectively V is a constantwith the value1(119898 + 119898

119888) where119898

119888= 80 kg is the mass of the car

The state 119909 = (119908 ) is composed of 119908 and which arebounded by [minus120587 120587] and [minus2 2] respectivelyThe action is theforce 119865 bounded by [minus50N 50N] The maximal episode is300 and the maximal time step for each episode is 3000 Ifthe angle of the pole with the vertical line is not exceeding1205874at each time step the agent will receive a reward 1 otherwisethe pole will fall down and receive a reward minus1 There is alsoa random noise applying on the force bounded by [minus10N10N] An episode ends when the pole has kept balance for3000 time steps or the pole falls downThe pole will fall downif |119908| ge 1205874 To approximate the value function and thereward function the state-based feature is coded by radialbasis functions (RBFs) shown as

120601119894(119909) = 119890

minus(12)(119909minus119888119894)T119861minus1(119909minus119888119894) (23)

where 119888119894denotes the 119894th center point locating over the grid

points minus06 minus04 minus02 0 02 04 06 times minus2 0 2 119861 is adiagonal matrix containing the widths of the RBFs with 1205902

119908=

02 and 1205902= 2The dimensionality 119911 of 120601(119909) is 21 Notice that

the approximations of the value function and the policy onlyrequire the state-based feature However the approximationof the model requires the state-action-based feature Let 119909 be(119908 119906) we can also compute its feature by (23) In this case119888119894locates over the points minus06 minus04 minus02 0 02 04 06 times

minus2 0 2 times minus50 minus25 0 25 50 The diagonal elements of 119861are 1205902119908= 02 1205902

= 2 and 120590

2

119865= 10 The dimensionality 119911

of 120601(119908 119906) is 105 Either the state-based or the state-action-based feature has to be normalized so as to make its value besmaller than 1 shown as

120601119894(119909) =

120601119894(119909)

sum119911

119894=1120601119894(119909)

(24)

where 119911 gt 0 is the dimensionality of the featureRAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC on this experiment The parame-ters of S-AC MLAC and Dyna-MLAC are set according tothe values mentioned in their papers The parameters of AC-HMLP and RAC-HMLP are set as shown in Table 1

The planning times may have significant effect on theconvergence performance Therefore we have to determinethe local planning times 119875

119897and the global planning times 119875

119892

at first The selective settings for the two parameters are (30300) (50 300) (30 600) and (50 600) From Figure 2(a) it

Computational Intelligence and Neuroscience 9

Table 1 Parameters settings of RAC-HMLP and AC-HMLP

Parameter Symbol ValueTime step 119879

11990401

Discount factor 120574 09Trace-decay rate 120582 09Exploration variance 120590

2 1Learning rate of the actor 120572

11988605

Learning rate of the critic 120572119888

04Learning rate of the model 120572

11989805

Error threshold 120585 015Capacity of the memory 119872 size 100Number of the nearest samples 119871 9Local planning times 119875

11989730

Global planning times 119875119892 300

Number of components of the state K 2Regularization parameter of the model ℓ

11989802

Regularization parameter of the critic ℓ119888

001Regularization parameter of the actor ℓ

1198860001

is easy to find out that (119875119897= 30119875

119892= 300) behaves best in these

four settings with 41 episodes for convergence (119875119897= 30 119875

119892=

600) learns fastest in the early 29 episodes but it convergesuntil the 52nd episode (119875

119897= 30 119875

119892= 600) and (119875

119897= 30 119875

119892=

300) have identical local planning times but different globalplanning times Generally the more the planning times thefaster the convergence rate but (119875

119897= 30 119875

119892= 300) performs

better instead The global model is not accurate enough atthe initial time Planning through such an inaccurate globalmodel will lead to an unstable performance Notice that (119875

119897=

50 119875119892= 300) and (119875

119897= 50 119875

119892= 600) behave poorer than (119875

119897

= 30 119875119892= 600) and (119875

119897= 30 119875

119892= 300) which demonstrates

that planning too much via the local model will not performbetter (119875

119897= 50119875

119892= 300) seems to converge at the 51st episode

but its learning curve fluctuates heavily after 365 episodes notconverging any more until the end Like (119875

119897= 50 119875

119892= 300)

(119875119897= 50 119875

119892= 600) also has heavy fluctuation but converges

at the 373rd episode Evidently (119875119897= 50 119875

119892= 300) performs

slightly better than (119875119897= 50 119875

119892= 300) Planning through the

global model might solve the nonstability problem caused byplanning of the local model

The convergence performances of the fivemethods RAC-HMLP AC-HMLP S-AC MLAC and Dyna-MLAC areshown in Figure 2(b) It is evident that our methods RAC-HMLP and AC-HMLP have the best convergence perfor-mances RAC-HMLP and AC-HMLP converge at the 39thand 41st episodes Both of them learn quickly in the primaryphase but the learning curve of RAC-HMLP seems to besteeper than that of AC-HMLP Dyna-MLAC learns fasterthan MLAC and S-AC in the former 74 episodes but itconverges until the 99th episode Though MLAC behavespoorer than Dyna-MLAC it requires just 82 episodes toconverge The method with the slowest learning rate is S-AC where the pole can keep balance for 3000 time stepsfor the first time at the 251st episode Unfortunately it con-verges until the 333rd episode RAC-HMLP converges fastest

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

Pl = 30 Pg = 300

Pl = 50 Pg = 300

Pl = 30 Pg = 600

Pl = 50 Pg = 600

(a) Determination of different planning times

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of different algorithms in convergence performance

Figure 2 Comparisons of different planning times and differentalgorithms

which might be caused by introducing the ℓ2-regularization

Because the ℓ2-regularization does not admit the parameter

to grow rapidly the overfitting of the learning process canbe avoided effectively Dyna-MLAC converges faster thanMLAC in the early 74 episodes but its performance is notstable enough embodied in the heavy fluctuation from the75th to 99th episode If the approximate-local model is dis-tinguished largely from the real-local model then planningthrough such an inaccurate local model might lead to anunstable performance S-AC as the only method withoutmodel learning behaves poorest among the fiveThese resultsshow that the convergence performance can be improvedlargely by introducing model learning

The comparisons of the five different algorithms insample efficiency are shown in Figure 3 It is clear that thenumbers of the required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 56003

10 Computational Intelligence and Neuroscience

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

50 100 150 200 250 300 350 4000Training episodes

0

1

2

3

4

5

6

Num

ber o

f sam

ples

to co

nver

ge

times104

Figure 3 Comparisons of sample efficiency

35535 32043 14426 and 12830 respectively Evidently RAC-HMLP converges fastest and behaves stably all the timethus requiring the least samplesThough AC-HMLP requiresmore samples to converge than RAC-HMLP it still requiresfar fewer samples than the other three methods Because S-AC does not utilize the model it requires the most samplesamong the five Unlike MLAC Dyna-MLAC applies themodel not only in updating the policy but also in planningresulting in a fewer requirement for the samples

The optimal policy and optimal value function learned byAC-HMLP after the training ends are shown in Figures 4(a)and 4(b) respectively while the ones learned by RAC-HMLPare shown in Figures 4(c) and 4(d) Evidently the optimalpolicy and the optimal value function learned by AC-HMLPand RAC-HMLP are quite similar but RAC-HMLP seems tohave more fine-grained optimal policy and value function

As for the optimal policy (see Figures 4(a) and 4(c))the force becomes smaller and smaller from the two sides(left side and right side) to the middle The top-right is theregion requiring a force close to 50N where the directionof the angle is the same with that of angular velocity Thevalues of the angle and the angular velocity are nearing themaximum values 1205874 and 2Therefore the largest force to theleft is required so as to guarantee that the angle between theupright line and the pole is no bigger than 1205874 It is oppositein the bottom-left region where a force attributing to [minus50Nminus40N] is required to keep the angle from surpassing minus1205874The pole can keep balance with a gentle force close to 0 in themiddle regionThedirection of the angle is different from thatof angular velocity in the top-left and bottom-right regionsthus a force with the absolute value which is relatively largebut smaller than 50N is required

In terms of the optimal value function (see Figures 4(b)and 4(d)) the value function reaches a maximum in theregion satisfying minus2 le 119908 le 2 The pole is more prone tokeep balance even without applying any force in this region

resulting in the relatively larger value function The pole ismore and more difficult to keep balance from the middle tothe two sides with the value function also decaying graduallyThe fine-grained value of the left side compared to the rightonemight be caused by themore frequent visitation to the leftside More visitation will lead to a more accurate estimationabout the value function

After the training ends the prediction of the next stateand the reward for every state-action pair can be obtainedthrough the learnedmodelThe predictions for the next angle119908 the next angular velocity and the reward for any possiblestate are shown in Figures 5(a) 5(b) and 5(c) It is noticeablethat the predicted-next state is always near the current statein Figures 5(a) and 5(b) The received reward is always largerthan 0 in Figure 5(c) which illustrates that the pole canalways keep balance under the optimal policy

52 Continuous Maze Problem Continuous maze problemis shown in Figure 6 where the blue lines with coordinatesrepresent the barriersThe state (119909 119910) consists of the horizon-tal coordinate 119909 isin [0 1] and vertical coordinate 119910 isin [0 1]Starting from the position ldquoStartrdquo the goal of the agent isto reach the position ldquoGoalrdquo that satisfies 119909 + 119910 gt 18 Theaction is to choose an angle bounded by [minus120587 120587] as the newdirection and then walk along this direction with a step 01The agent will receive a reward minus1 at each time step Theagent will be punished with a reward minus400 multiplying thedistance if the distance between its position and the barrierexceeds 01 The parameters are set the same as the formerproblem except for the planning times The local and globalplanning times are determined as 10 and 50 in the sameway of the former experiment The state-based feature iscomputed according to (23) and (24) with the dimensionality119911 = 16 The center point 119888

119894locates over the grid points

02 04 06 08 times 02 04 06 08 The diagonal elementsof 119861 are 120590

2

119909= 02 and 120590

2

119910= 02 The state-action-based

feature is also computed according to (23) and (24) Thecenter point 119888

119894locates over the grid points 02 04 06 08 times

02 04 06 08timesminus3 minus15 0 15 3with the dimensionality119911 = 80The diagonal elements of119861 are 1205902

119909= 02 1205902

119910= 02 and

1205902

119906= 15RAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC in cumulative rewards The resultsare shown in Figure 7(a) RAC-HMLP learns fastest in theearly 15 episodesMLACbehaves second best andAC-HMLPperforms poorest RAC-HMLP tends to converge at the 24thepisode but it really converges until the 39th episode AC-HMLP behaves steadily starting from the 23rd episode to theend and it converges at the 43rd episode Like the formerexperiment RAC-HMLP performs slightly better than AC-HMLP embodied in the cumulative rewards minus44 comparedto minus46 At the 53rd episode the cumulative rewards ofS-AC fall to about minus545 quickly without any ascendingthereafter Though MLAC and Dyna-MLAC perform wellin the primary phase their curves start to descend at the88th episode and the 93rd episode respectively MLAC andDyna-MLAC learn the local model to update the policygradient resulting in a fast learning rate in the primary phase

Computational Intelligence and Neuroscience 11

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus50

0

50

(a) Optimal policy of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

92

94

96

98

10

(b) Optimal value function of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus40

minus20

0

20

40

(c) Optimal policy of RAC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

9

92

94

96

98

10

(d) Optimal value function of RAC-HMLP learned after training

Figure 4 Optimal policy and value function learned by AC-HMLP and RAC-HMLP

However they do not behave stably enough near the end ofthe training Too much visitation to the barrier region mightcause the fast descending of the cumulative rewards

The comparisons of the time steps for reaching goal aresimulated which are shown in Figure 7(b) Obviously RAC-HMLP and AC-HMLP perform better than the other threemethods RAC-HMLP converges at 39th episode while AC-HMLP converges at 43rd episode with the time steps forreaching goal being 45 and 47 It is clear that RAC-HMLPstill performs slightly better than AC-HMLP The time stepsfor reaching goal are 548 69 and 201 for S-AC MLACandDyna-MLAC Among the fivemethods RAC-HMLP notonly converges fastest but also has the best solution 45 Thepoorest performance of S-AC illustrates that model learningcan definitely improve the convergence performance As inthe former experiment Dyna-MLAC behaves poorer thanMLAC during training If the model is inaccurate planningvia such model might influence the estimations of the valuefunction and the policy thus leading to a poor performancein Dyna-MLAC

The comparisons of sample efficiency are shown inFigure 8 The required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 105956588 7694 4388 and 4062 respectively As in pole balancingproblem RAC-HMLP also requires the least samples whileS-AC needs the most to converge The difference is that

Dyna-MLAC requires samples slightly more than MLACThe ascending curve of Dyna-MLAC at the end of thetraining demonstrates that it has not convergedThe frequentvisitation to the barrier in Dyna-MLAC leads to the quickdescending of the value functions As a result enormoussamples are required to make these value functions moreprone to the true ones

After the training ends the approximate optimal policyand value function are obtained shown in Figures 9(a) and9(b) It is noticeable that the low part of the figure is exploredthoroughly so that the policy and the value function are verydistinctive in this region For example inmost of the region inFigure 9(a) a larger angle is needed for the agent so as to leavethe current state Clearly the nearer the current state and thelow part of Figure 9(b) the smaller the corresponding valuefunction The top part of the figures may be not frequentlyor even not visited by the agent resulting in the similar valuefunctionsThe agent is able to reach the goal onlywith a gentleangle in these areas

6 Conclusion and Discussion

This paper proposes two novel actor-critic algorithms AC-HMLP and RAC-HMLP Both of them take LLR and LFA torepresent the local model and global model respectively Ithas been shown that our new methods are able to learn the

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

Computational Intelligence and Neuroscience 3

learning and planning Section 4 specifies our algorithmsmdashAC-HMLP and RAC-HMLP The empirical results of thecomparisons with the other three representative algorithmsare analyzed in Section 5 Section 6 concludes our work andthen presents the possible future work

2 Preliminaries

21 MDP RL can solve the problemmodeled by MDP MDPcan be represented as four-tuple (119883119880 120588 119891)

(1) 119883 is the state space 119909119905isin 119883 denotes the state of the

agent at time step 119905(2) 119880 represents the action space 119906

119905isin 119880 is the action

which the agent takes at the time step 119905(3) 120588 119883 times 119880 rarr R denotes the reward function At the

time step 119905 the agent locates at a state 119909119905and takes an

action 119906119905resulting in next state 119909

119905+1while receiving a

reward 119903119905= 120588(119909

119905 119906119905)

(4) 119891 119883 times 119880 rarr 119883 is defined as the transition function119891(119909119905 119906119905 119909119905+1) is the probability of reaching the next

state 119909119905+1

after executing 119906119905at the state 119909

119905

Policy ℎ 119883 rarr 119880 is the mapping from the state space119883 to the action space 119880 where the mathematical set of ℎdepends on specific domains The goal of the agent is tofind the optimal policy ℎlowast that can maximize the cumulativerewards The cumulative rewards are the sum or discountedsum of the received rewards and here we use the latter case

Under the policy ℎ the value function 119881ℎ 119883 rarr R

denotes the expected cumulative rewards which is shown as

119881ℎ(119909) = 119864

infin

sum

119896=0

120574119896119903119905+119896+1

| 119909 = 119909119905 (1)

where 120574 isin [0 1] represents the discount factor 119909119905is the

current stateThe optimal state-value function 119881lowast(119909) is computed as

119881lowast(119909) = max

119881 (119909) forall119909 isin 119883 (2)

Therefore the optimal policy ℎlowast at state 119909 can be obtainedby

ℎlowast(119909) = argmax

119881ℎ(119909) forall119909 isin 119883 (3)

22 AC Algorithm AC algorithm mainly contains two partsactor and critic which are stored separately Actor and criticare also called the policy and value function respectivelyTheactor-only methods approximate the policy and then updateits parameter along the direction of performance improvingwith the possible drawback being large variance resultingfrom policy estimationThe critic-only methods estimate thevalue function by approximating a solution to the Bellmanequation the optimal policy is found by maximizing thevalue functionOther than the actor-onlymethods the critic-only methods do not try to search the optimal policy inpolicy space They just estimate the critic for evaluating the

performance of the actor as a result the near-optimality ofthe resulting policy cannot be guaranteed By combiningthe merits of the actor and the critic AC algorithms wereproposed where the value function is approximated to updatethe policy

The value function and the policy are parameterized by119881(119909 120579) and ℎ(119909 120573) where 120579 and 120573 are the parameters of thevalue function and the policy respectively At each time step119905 the parameter 120579 is updated as

120579119905+1

= 120579119905+ 120572119888120575119905

120597119881 (119909 120579)

120597120579119909 = 119909

119905 120579 = 120579

119905 (4)

where

120575119905= 119903119905+ 120574119881 (119909

119905 120579119905) minus 119881 (119909

119905 120579119905) (5)

denoting the TD-error of the value function 120597119881(119909 120579)120597120579represents the feature of the value function The parameter120572119888isin [0 1] is the learning rate of the value functionEligibility is a trick to improve the convergence via

assigning the credits to the previously visited states At eachtime step 119905 the eligibility can be represented as

119890119905(119909) =

120597119881 (119909 120579)

120597120579119909119905= 119909

120582120574119890119905minus1

(119909) 119909119905

= 119909

(6)

where 120582 isin [0 1] denotes the trace-decay rateBy introducing the eligibility the update for 120579 in (4) can

be transformed as

120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909) (7)

The policy parameter 120573119905can be updated by

120573119905+1

= 120573119905+ 120572119886120575119905Δ119906119905

120597ℎ (119909 120573)

120597120573 (8)

where 120597ℎ(119909 120573)120597120573 is the feature of the policy Δ119906119905is a

random exploration term conforming to zero-mean normaldistribution 120572

119886isin [0 1] is the learning rate of the policy

S-AC (Standard AC algorithm) serves as a baseline tocompare with our method which is shown in Algorithm 1The value function and the policy are approximated linearlyin Algorithm 1 where TD is used as the learning algorithm

3 Hierarchical Model Learning and Planning

31 Why to Use Hierarchical Model Learning and PlanningThe model in RL refers to the state transition function andthe reward function When the model is established we canuse any model-based RL method to find the optimal policyfor example DP Model-based methods can significantlydecrease the number of the required samples and improve theconvergence performance Inspired by this idea we introducethe hierarchical model learning into AC algorithm so as tomake it becomemore sample-efficient Establishing a relativeaccurate model for the continuous state and action spaces isstill an open issue

4 Computational Intelligence and Neuroscience

Input 120574 120582 120572119886 120572119888 1205902

(1) Initialize 120579 120573(2) Loop(3) 119890 larr

9978881 120579 larr 997888

0 120573 larr9978880 119909 larr 119909

0 119905 larr 1

(4) Repeat all episodes(5) Choose Δ119906

119905according to119873(0 120590

2)

(6) 119906119905larr ℎ(119909 120573) + Δ119906

119905

(7) Execute 119906119905and observe 119903

119905+1and 119909

119905+1

(8) Update the eligibility of the value function 119890119905(119909) = 120597119881(119909 120579)120597120579 119909

119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(9) Compute the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(10) Update the parameter of the value function 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

(11) Update the parameter of the policy 120573119905+1

= 120573119905+ 120572119886120575119905Δ119906119905(120597ℎ(119909 120573)120597120573)

(12) 119905 larr 119905 + 1

(13) End Repeat(14) End LoopOutput 120579 120573

Algorithm 1 S-AC

The preexisting works are mainly aimed at the problemswith continuous states but discrete actions They approx-imated the transition function in the form of probabilitymatrix which specifies the transition probability from thecurrent feature to the next feature The indirectly observedfeatures result in the inaccurate feature-based model Theconvergence rate will be slowed significantly by using such aninaccurate model for planning especially at each time step inthe initial phase

To solve these problems we will approximate a state-based model instead of the inaccurate feature-based modelMoreover we will introduce an additional global model forplanning The global model is applied only at the end of eachepisode so that the global information can be utilized asmuchas possible Using such a global model without others willlead to the loss of valuable local informationThus likeDyna-MLAC we also approximate a local model by LLR and useit for planning at each time step The difference is that auseful error threshold is designed for the local planning inour method If the state-prediction error between the real-next state and the predicted one does not surpass the errorthreshold the local planning process will be started at thecurrent time step Therefore the convergence rate and thesample efficiency can be improved dramatically by combiningthe local and global model learning and planning

32 Learning and Planning of the Global Model The globalmodel establishes separate equations for the reward functionand the state transition function of every state component bylinear function approximation Assume the agent is at state119909119905= 1199091199051 119909

119905119870 where119870 is the dimensionality of the state

119909119905 the action 119906

119905is selected according to the policy ℎ then

all the components 1199091015840119905+11

1199091015840

119905+12 119909

1015840

119905+1119870 of the next state

1199091015840

119905+1can be predicted as

1199091015840

119905+11= 1205781199051

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+12= 1205781199052

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+1119870= 120578119905119870

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

(9)

where 120578119905119894= (1205781199051198941 1205781199051198942 120578

119905119894119863)T 1 le 119894 le 119870 is the param-

eter of the state transition function corresponding to the 119894thcomponent of the current state at time step 119905 with 119863 beingthe dimensionality of the feature 120601(119909

1199051 1199091199052 119909

119905119870 119906119905)

Likewise the reward 1199031015840119905+1

can be predicted as

1199031015840

119905+1= 120589119905

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905) (10)

where 120589119905= (1205891199051 1205891199052 120589

119905119863) is the parameter of the reward

function at time step 119905After 120578

1199051 1205781199052 120578

119905119870and 120589 are updated the model can

be applied to generate samples Let the current state be119909119905= (1199091199051 1199091199052 119909

119905119896) after the action 119906

119905is executed the

parameters 120578119905+1119894

(1 le 119894 le 119870) can be estimated by the gradientdescent method shown as

120578119905+11

= 1205781199051

+ 120572119898(119909119905+11

minus 1199091015840

119905+11) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+12

= 1205781199052

+ 120572119898(119909119905+12

minus 1199091015840

119905+12) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+1119870

= 120578119905119870

+ 120572119898(119909119905+1119870

minus 1199091015840

119905+1119870) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

(11)

where 120572119898

isin [0 1] is the learning rate of the model119909119905+1

is the real-next state obtained according to the systemdynamics where 119909

119905+1119894(1 le 119894 le 119870) is its 119894th component

(1199091015840

119905+11 1199091015840

119905+12 119909

1015840

119905+1119870) is the predicted-next state according

to (9)The parameter 120589

119905+1is estimated as

120589119905+1

= 120589119905+ 120572119898(119903119905+1

minus 1199031015840

119905+1) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905) (12)

where 119903119905+1

is the real reward reflected by the system dynamicswhile 1199031015840

119905+1is the predicted reward obtained according to (10)

Computational Intelligence and Neuroscience 5

33 Learning and Planning of the Local Model Though thelocal model also approximates the state transition functionand the reward function as the global model does LLR isserved as the function approximator instead of LFA In thelocal model a memory storing the samples in the form of(119909119905 119906119905 119909119905+1 119903119905+1) is maintained At each time step a new

sample is generated from the interaction and it will take theplace of the oldest one in the memory Not all but only L-nearest samples in thememorywill be selected for computingthe parameter matrix Γ isin R(119870+1)times(119870+2) of the local modelBefore achieving this the input matrix119883 isin R(119870+2)times119871 and theoutput matrix 119884 isin R(119870+1)times119871 should be prepared as follows

119883 =

[[[[[[[[[[[[

[

11990911

11990921

sdot sdot sdot 1199091198711

11990912

11990922

sdot sdot sdot 1199091198712

1199091119870

1199092119870

sdot sdot sdot 119909119871119870

1199061

1199062

sdot sdot sdot 119906119871

1 1 sdot sdot sdot 1

]]]]]]]]]]]]

]

119884 =

[[[[[[[[[

[

11990921

11990931

sdot sdot sdot 119909119871+11

11990922

11990932

sdot sdot sdot 119909119871+12

1199092119870

1199093119870

sdot sdot sdot 119909119871+1119870

1199032

1199033

sdot sdot sdot 119903119871+1

]]]]]]]]]

]

(13)

The last row of 119883 consisting of ones is to add a biason the output Every column in the former 119870 + 1 lines of119883 corresponds to a state-action pair for example the 119894thcolumn is the state-action (119909

119894 119906119894)T 1 le 119894 le 119871 119884 is composed

of 119871 next states and rewards corresponding to119883Γ can be obtained via solving 119884 = Γ119883 as

Γ = 119884119883T(119883119883

T)minus1

(14)

Let the current input vector be [1199091199051 119909

119905119870 119906119905 1]

T theoutput vector [1199091015840

119905+11 119909

1015840

119905+1119870 1199031015840

119905+1]T can be predicted by

[1199091015840

119905+11 119909

1015840

119905+1119896 1199031015840

119905+1]T= Γ [119909

1199051 119909

119905119896 119906119905 1]

T (15)

where [1199091199051 119909

119905119870] and [119909

1015840

119905+11 119909

1015840

119905+1119870] are the current

state and the predicted-next state respectively 1199031015840119905+1

is thepredicted reward

Γ is estimated according to (14) at each time step there-after the predicted-next state and the predicted reward canbe obtained by (15) Moreover we design an error thresholdto decide whether local planning is required We computethe state-prediction error between the real-next state and thepredicted-next state at each time step If this error does not

surpass the error threshold the local planning process willbe launched The state-prediction error is formulated as

119864119903119905= max

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+11minus 119909119905+11

119909119905+11

1003816100381610038161003816100381610038161003816100381610038161003816

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+12minus 119909119905+12

119909119905+12

1003816100381610038161003816100381610038161003816100381610038161003816

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+1119870minus 119909119905+1119870

119909119905+1119870

1003816100381610038161003816100381610038161003816100381610038161003816

100381610038161003816100381610038161003816100381610038161003816

1199031015840

119905+1minus 119903119905+1

119903119905+1

100381610038161003816100381610038161003816100381610038161003816

(16)

Let the error threshold be 120585 then the local model willbe used for planning only if 119864119903

119905le 120585 at time step 119905 In

the local planning process a sequence of locally simulatedsamples in the form of (119909

1199051 119909

119905119870 119906119905 1199091015840

119905+11 119909

1015840

119905+1119870 1199031015840)

are generated to improve the convergence of the same valuefunction and the policy as the global planning process does

4 Algorithm Specification

41 AC-HMLP AC-HMLP algorithm consists of a mainalgorithm and two subalgorithms The main algorithm isthe learning algorithm (see Algorithm 2) whereas the twosubalgorithms are local model planning procedure (seeAlgorithm 3) and global model planning procedure (seeAlgorithm 4) respectively At each time step the main algo-rithm learns the value function (see line (26) in Algorithm 2)the policy (see line (27) in Algorithm 2) the local model (seeline (19) in Algorithm 2) and the global model (see lines(10)sim(11) in Algorithm 2)

There are several parameters which are required to bedetermined in the three algorithms 119903 and 120582 are discountfactor and trace-decay rate respectively 120572

119886120572119888 and120572

119898are the

corresponding learning rates of the value function the policyand the global model119872 size is the capacity of the memory120585 denotes the error threshold 119871 determines the number ofselected samples which is used to fit the LLR 1205902 is thevariance that determines the region of the exploration 119875

119897and

119875119892are the planning times for local model and global model

Some of these parameters have empirical values for example119903 and 120582 The others have to be determined by observing theempirical results

Notice that Algorithm 3 starts planning at the state 119909119905

which is passed from the current state in Algorithm 2 whileAlgorithm 4 uses 119909

0as the initial state The reason for using

different initializations is that the local model is learnedaccording to the L-nearest samples of the current state 119909

119905

whereas the global model is learned from all the samplesThus it is reasonable and natural to start the local and globalplanning process at the states 119909

119905and 119909

0 respectively

42 AC-HMLPwith ℓ2-Regularization Regression approach-

es inmachine learning are generally represented asminimiza-tion of a square loss term and a regularization term The ℓ

2-

regularization also called ridge regress is a widely used regu-larization method in statistics and machine learning whichcan effectively prohibit overfitting of learning Therefore weintroduce ℓ

2-regularization to AC-HMLP in the learning of

the value function the policy and the model We term thisnew algorithm as RAC-HMLP

6 Computational Intelligence and Neuroscience

Input 120574 120582 120572119886 120572119888 120572119898 120585119872 size 119871 1205902 119870 119875

119897 119875119892

(1) Initialize 120579 larr 9978880 120573 larr

9978880 Γ larr 997888

0 1205781 1205782 120578

119870larr

9978880 120589 larr 997888

0

(2) Loop(3) 119890

1larr

9978881 11990911 119909

1119896larr 11990901 119909

0119896 119905 larr 1 number larr 0

(4) Repeat all episodes(5) Choose Δ119906

119905according to119873(0 120590

2)

(6) Execute the action 119906119905= ℎ(119909 120573) + Δ119906

119905

(7) Observe the reward 119903119905+1

and the next state 119909119905+1

= 119909119905+11

119909119905+1119896

Update the global model(8) Predict the next state

1199091015840

119905+11= 1205781199051

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+12= 1205781199052

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+1119870= 120578119905119870

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

(9) Predict the reward 1199031015840119905+1

= 120589119905

T120601(1199091199051 1199091199052 119909

119905119870 119906119905)

(10) Update the parameters 1205781 1205782 120578

119870

120578119905+11

= 1205781199051+ 120572119898(119909119905+11

minus 1199091015840

119905+11) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+12

= 1205781199052+ 120572119898(119909119905+12

minus 1199091015840

119905+12) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+1119870

= 120578119905119870

+ 120572119898(119909119905+1119870

minus 1199091015840

119905+1119870) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

(11) Update the parameter 120589 120589119905+1

= 120589119905+ 120572119898(119903119905+1

minus 1199031015840

119905+1)120601(1199091199051 1199091199052 119909119905119870 119906119905)

Update the local model(12) If 119905 le 119872 size(13) Insert the real sample (119909

119905 119906119905 119909119905+1 119903119905+1) into the memory119872

(14) Else If(15) Replace the oldest one in119872 with the real sample (119909

119905 119906119905 119909119905+1 119903119905+1)

(16) End if(17) Select L-nearest neighbors of the current state from119872 to construct119883 and 119884(18) Predict the next state and the reward [1199091015840

119905+1 1199031015840

119905+1]T= Γ[119909

119905 119906119905 1]

T

(19) Update the parameter Γ Γ = 119884119883T(119883119883

T)minus1

(20) Compute the local error119864119903119905= max|(1199091015840

119905+11minus 119909119905+11

)119909119905+11

| |(1199091015840

119905+12minus 119909119905+12

)119909119905+12

| |(1199091015840

119905+1119870minus 119909119905+1119870

)119909119905+1119870

| |(1199031015840

119905+1minus 119903119905+1)119903119905+1|

(21) If 119864119903119905le 120585

(22) Call Local-model planning (120579 120573 Γ 119890 number 119875119897) (Algorithm 3)

(23) End IfUpdate the value function

(24) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(25) Estimate the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(26) Update the value-function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

Update the policy(27) Update the policy parameter 120573

119905+1= 120573119905+ 120572119886120575119905Δ119906119905120601(119909119905)

(28) 119905 = 119905 + 1

(29) Update the number of samples number = number + 1(30) Until the ending condition is satisfied(31) Call Global-model planning (120579 120573 120578

1 120578

119896 120589 119890 number 119875

119892) (Algorithm 4)

(32) End LoopOutput 120573 120579

Algorithm 2 AC-HMLP algorithm

Computational Intelligence and Neuroscience 7

(1) Loop for 119875119897time steps

(2) 1199091larr 119909119905 119905 larr 1

(3) Choose Δ119906119905according to119873(0 120590

2)

(4) 119906119905= ℎ(119909

119905 120573119905) + Δ119906

119905

(5) Predict the next state and the reward [119909119905+11

119909119905+1119870

119903119905+1] = Γ119905[1199091199051 119909

119905119870 119906119905 1]

T

(6) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(7) Compute the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(8) Update the value function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

(9) Update the policy parameter 120573119905+1

= 120573119905+ 120572119886120575TDΔ119906119905120601(119909119905)

(10) If 119905 le 119878

(11) 119905 = 119905 + 1

(12) End If(13) Update the number of samples number = number + 1(14) End LoopOutput 120573 120579

Algorithm 3 Local model planning (120579 120573 Γ 119890 number 119875119897)

(1) Loop for 119875119892times

(2) 119909larr 1199090 larr 1

(3) Repeat all episodes(4) Choose Δ119906

according to119873(0 120590)

(5) Compute exploration term 119906= ℎ(119909

120573) + Δ119906

(6) Predict the next state1199091015840

+11= 1205781

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+12= 1205782

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+1119870= 120578119870

119879120601 (1199091 1199092 119909

119870 119906)

(7) Predict the reward 1199031015840+1

= 120589119905

T(1199091 1199092 119909

119870 119906119905)

(8) Update the eligibility 119890(119909) = 120601(119909

) 119909= 119909 120582120574119890

minus1(119909) 119909

= 119909

(9) Compute the TD error 120575= 119903119905+1

+ 120574119881(1199091015840

+1 120579) minus 119881(119909

120579)

(10) Update the value function parameter 120579+1

= 120579+ 120572119888120575119890(119909)

(11) Update the policy parameter 120573+1

= 120573+ 120572119886120575Δ119906120601(119909)

(12) If le 119879

(13) = + 1

(14) End If(15) Update the number of samples number = number + 1(16) End Repeat(17) End LoopOutput 120573 120579

Algorithm 4 Global model planning (120579 120573 1205781 120578

119896 120589 119890 number 119875

119892)

The goal of learning the value function is to minimize thesquare of the TD-error which is shown as

min120579

numbersum

119905=0

10038171003817100381710038171003817119903119905+1

+ 120574119881 (1199091015840

119905+1 120579119905) minus 119881 (119909

119905 120579119905)10038171003817100381710038171003817

+ ℓ119888

10038171003817100381710038171205791199051003817100381710038171003817

2

(17)

where number represents the number of the samples ℓ119888ge 0

is the regularization parameter of the critic 1205791199052 is the ℓ

2-

regularization which penalizes the growth of the parametervector 120579

119905 so that the overfitting to noise samples can be

avoided

The update for the parameter 120579 of the value function inRAC-HMLP is represented as

120579119905+1

= 120579119905(1 minus

120572119888

numberℓ119888) +

120572119888

number

numbersum

119904=0

119890119904(119909) 120575119904 (18)

The update for the parameter 120573 of the policy is shown as

120573119905+1

= 120573119905(1 minus

120572119886

numberℓ119886)

+120572119886

number

numbersum

119904=0

120601 (119909119904) 120575119904Δ119906119904

(19)

where ℓ119886ge 0 is the regularization parameter of the actor

8 Computational Intelligence and Neuroscience

F

w

Figure 1 Pole balancing problem

The update for the global model can be denoted as

120578119905+11

= 1205781199051(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+11

minus 1199091015840

119904+11)

120578119905+12

= 1205781199052(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+12

minus 1199091015840

119904+12)

120578119905+1119870

= 120578119905119870

(1 minus120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+1119870

minus 1199091015840

119904+11)

(20)

120589119905+1

= 120589119905(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119903119904+1

minus 1199031015840

119904+1)

(21)

where ℓ119898ge 0 is the regularization parameter for the model

namely the state transition function and the reward functionAfter we replace the update equations of the parameters

in Algorithms 2 3 and 4 with (18) (19) (20) and (21) we willget the resultant algorithm RAC-HMLP Except for the aboveupdate equations the other parts of RAC-HMLP are the samewith AC-HMLP so we will not specify here

5 Empirical Results and Analysis

AC-HMLP and RAC-HMLP are compared with S-ACMLAC andDyna-MLAC on two continuous state and actionspaces problems pole balancing problem [31] and continuousmaze problem [32]

51 Pole Balancing Problem Pole balancing problem is a low-dimension but challenging benchmark problem widely usedin RL literature shown in Figure 1

There is a car moving along the track with a hinged poleon its topThegoal is to find a policywhich can guide the force

to keep the pole balanceThe system dynamics is modeled bythe following equation

=119892 sin (119908) minus V119898119897 sin (2119908) 2 minus V cos (119908) 119865

41198973 minus V119898119897cos2 (119908) (22)

where 119908 is the angle of the pole with the vertical line and are the angular velocity and the angular acceleration of thepole 119865 is the force exerted on the cart The negative valuemeans the force to the right and otherwise means to the left119892 is the gravity constant with the value 119892 = 981ms2119898 and119897 are the length and the mass of the pole which are set to119898 =

20 kg and 119897 = 05m respectively V is a constantwith the value1(119898 + 119898

119888) where119898

119888= 80 kg is the mass of the car

The state 119909 = (119908 ) is composed of 119908 and which arebounded by [minus120587 120587] and [minus2 2] respectivelyThe action is theforce 119865 bounded by [minus50N 50N] The maximal episode is300 and the maximal time step for each episode is 3000 Ifthe angle of the pole with the vertical line is not exceeding1205874at each time step the agent will receive a reward 1 otherwisethe pole will fall down and receive a reward minus1 There is alsoa random noise applying on the force bounded by [minus10N10N] An episode ends when the pole has kept balance for3000 time steps or the pole falls downThe pole will fall downif |119908| ge 1205874 To approximate the value function and thereward function the state-based feature is coded by radialbasis functions (RBFs) shown as

120601119894(119909) = 119890

minus(12)(119909minus119888119894)T119861minus1(119909minus119888119894) (23)

where 119888119894denotes the 119894th center point locating over the grid

points minus06 minus04 minus02 0 02 04 06 times minus2 0 2 119861 is adiagonal matrix containing the widths of the RBFs with 1205902

119908=

02 and 1205902= 2The dimensionality 119911 of 120601(119909) is 21 Notice that

the approximations of the value function and the policy onlyrequire the state-based feature However the approximationof the model requires the state-action-based feature Let 119909 be(119908 119906) we can also compute its feature by (23) In this case119888119894locates over the points minus06 minus04 minus02 0 02 04 06 times

minus2 0 2 times minus50 minus25 0 25 50 The diagonal elements of 119861are 1205902119908= 02 1205902

= 2 and 120590

2

119865= 10 The dimensionality 119911

of 120601(119908 119906) is 105 Either the state-based or the state-action-based feature has to be normalized so as to make its value besmaller than 1 shown as

120601119894(119909) =

120601119894(119909)

sum119911

119894=1120601119894(119909)

(24)

where 119911 gt 0 is the dimensionality of the featureRAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC on this experiment The parame-ters of S-AC MLAC and Dyna-MLAC are set according tothe values mentioned in their papers The parameters of AC-HMLP and RAC-HMLP are set as shown in Table 1

The planning times may have significant effect on theconvergence performance Therefore we have to determinethe local planning times 119875

119897and the global planning times 119875

119892

at first The selective settings for the two parameters are (30300) (50 300) (30 600) and (50 600) From Figure 2(a) it

Computational Intelligence and Neuroscience 9

Table 1 Parameters settings of RAC-HMLP and AC-HMLP

Parameter Symbol ValueTime step 119879

11990401

Discount factor 120574 09Trace-decay rate 120582 09Exploration variance 120590

2 1Learning rate of the actor 120572

11988605

Learning rate of the critic 120572119888

04Learning rate of the model 120572

11989805

Error threshold 120585 015Capacity of the memory 119872 size 100Number of the nearest samples 119871 9Local planning times 119875

11989730

Global planning times 119875119892 300

Number of components of the state K 2Regularization parameter of the model ℓ

11989802

Regularization parameter of the critic ℓ119888

001Regularization parameter of the actor ℓ

1198860001

is easy to find out that (119875119897= 30119875

119892= 300) behaves best in these

four settings with 41 episodes for convergence (119875119897= 30 119875

119892=

600) learns fastest in the early 29 episodes but it convergesuntil the 52nd episode (119875

119897= 30 119875

119892= 600) and (119875

119897= 30 119875

119892=

300) have identical local planning times but different globalplanning times Generally the more the planning times thefaster the convergence rate but (119875

119897= 30 119875

119892= 300) performs

better instead The global model is not accurate enough atthe initial time Planning through such an inaccurate globalmodel will lead to an unstable performance Notice that (119875

119897=

50 119875119892= 300) and (119875

119897= 50 119875

119892= 600) behave poorer than (119875

119897

= 30 119875119892= 600) and (119875

119897= 30 119875

119892= 300) which demonstrates

that planning too much via the local model will not performbetter (119875

119897= 50119875

119892= 300) seems to converge at the 51st episode

but its learning curve fluctuates heavily after 365 episodes notconverging any more until the end Like (119875

119897= 50 119875

119892= 300)

(119875119897= 50 119875

119892= 600) also has heavy fluctuation but converges

at the 373rd episode Evidently (119875119897= 50 119875

119892= 300) performs

slightly better than (119875119897= 50 119875

119892= 300) Planning through the

global model might solve the nonstability problem caused byplanning of the local model

The convergence performances of the fivemethods RAC-HMLP AC-HMLP S-AC MLAC and Dyna-MLAC areshown in Figure 2(b) It is evident that our methods RAC-HMLP and AC-HMLP have the best convergence perfor-mances RAC-HMLP and AC-HMLP converge at the 39thand 41st episodes Both of them learn quickly in the primaryphase but the learning curve of RAC-HMLP seems to besteeper than that of AC-HMLP Dyna-MLAC learns fasterthan MLAC and S-AC in the former 74 episodes but itconverges until the 99th episode Though MLAC behavespoorer than Dyna-MLAC it requires just 82 episodes toconverge The method with the slowest learning rate is S-AC where the pole can keep balance for 3000 time stepsfor the first time at the 251st episode Unfortunately it con-verges until the 333rd episode RAC-HMLP converges fastest

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

Pl = 30 Pg = 300

Pl = 50 Pg = 300

Pl = 30 Pg = 600

Pl = 50 Pg = 600

(a) Determination of different planning times

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of different algorithms in convergence performance

Figure 2 Comparisons of different planning times and differentalgorithms

which might be caused by introducing the ℓ2-regularization

Because the ℓ2-regularization does not admit the parameter

to grow rapidly the overfitting of the learning process canbe avoided effectively Dyna-MLAC converges faster thanMLAC in the early 74 episodes but its performance is notstable enough embodied in the heavy fluctuation from the75th to 99th episode If the approximate-local model is dis-tinguished largely from the real-local model then planningthrough such an inaccurate local model might lead to anunstable performance S-AC as the only method withoutmodel learning behaves poorest among the fiveThese resultsshow that the convergence performance can be improvedlargely by introducing model learning

The comparisons of the five different algorithms insample efficiency are shown in Figure 3 It is clear that thenumbers of the required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 56003

10 Computational Intelligence and Neuroscience

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

50 100 150 200 250 300 350 4000Training episodes

0

1

2

3

4

5

6

Num

ber o

f sam

ples

to co

nver

ge

times104

Figure 3 Comparisons of sample efficiency

35535 32043 14426 and 12830 respectively Evidently RAC-HMLP converges fastest and behaves stably all the timethus requiring the least samplesThough AC-HMLP requiresmore samples to converge than RAC-HMLP it still requiresfar fewer samples than the other three methods Because S-AC does not utilize the model it requires the most samplesamong the five Unlike MLAC Dyna-MLAC applies themodel not only in updating the policy but also in planningresulting in a fewer requirement for the samples

The optimal policy and optimal value function learned byAC-HMLP after the training ends are shown in Figures 4(a)and 4(b) respectively while the ones learned by RAC-HMLPare shown in Figures 4(c) and 4(d) Evidently the optimalpolicy and the optimal value function learned by AC-HMLPand RAC-HMLP are quite similar but RAC-HMLP seems tohave more fine-grained optimal policy and value function

As for the optimal policy (see Figures 4(a) and 4(c))the force becomes smaller and smaller from the two sides(left side and right side) to the middle The top-right is theregion requiring a force close to 50N where the directionof the angle is the same with that of angular velocity Thevalues of the angle and the angular velocity are nearing themaximum values 1205874 and 2Therefore the largest force to theleft is required so as to guarantee that the angle between theupright line and the pole is no bigger than 1205874 It is oppositein the bottom-left region where a force attributing to [minus50Nminus40N] is required to keep the angle from surpassing minus1205874The pole can keep balance with a gentle force close to 0 in themiddle regionThedirection of the angle is different from thatof angular velocity in the top-left and bottom-right regionsthus a force with the absolute value which is relatively largebut smaller than 50N is required

In terms of the optimal value function (see Figures 4(b)and 4(d)) the value function reaches a maximum in theregion satisfying minus2 le 119908 le 2 The pole is more prone tokeep balance even without applying any force in this region

resulting in the relatively larger value function The pole ismore and more difficult to keep balance from the middle tothe two sides with the value function also decaying graduallyThe fine-grained value of the left side compared to the rightonemight be caused by themore frequent visitation to the leftside More visitation will lead to a more accurate estimationabout the value function

After the training ends the prediction of the next stateand the reward for every state-action pair can be obtainedthrough the learnedmodelThe predictions for the next angle119908 the next angular velocity and the reward for any possiblestate are shown in Figures 5(a) 5(b) and 5(c) It is noticeablethat the predicted-next state is always near the current statein Figures 5(a) and 5(b) The received reward is always largerthan 0 in Figure 5(c) which illustrates that the pole canalways keep balance under the optimal policy

52 Continuous Maze Problem Continuous maze problemis shown in Figure 6 where the blue lines with coordinatesrepresent the barriersThe state (119909 119910) consists of the horizon-tal coordinate 119909 isin [0 1] and vertical coordinate 119910 isin [0 1]Starting from the position ldquoStartrdquo the goal of the agent isto reach the position ldquoGoalrdquo that satisfies 119909 + 119910 gt 18 Theaction is to choose an angle bounded by [minus120587 120587] as the newdirection and then walk along this direction with a step 01The agent will receive a reward minus1 at each time step Theagent will be punished with a reward minus400 multiplying thedistance if the distance between its position and the barrierexceeds 01 The parameters are set the same as the formerproblem except for the planning times The local and globalplanning times are determined as 10 and 50 in the sameway of the former experiment The state-based feature iscomputed according to (23) and (24) with the dimensionality119911 = 16 The center point 119888

119894locates over the grid points

02 04 06 08 times 02 04 06 08 The diagonal elementsof 119861 are 120590

2

119909= 02 and 120590

2

119910= 02 The state-action-based

feature is also computed according to (23) and (24) Thecenter point 119888

119894locates over the grid points 02 04 06 08 times

02 04 06 08timesminus3 minus15 0 15 3with the dimensionality119911 = 80The diagonal elements of119861 are 1205902

119909= 02 1205902

119910= 02 and

1205902

119906= 15RAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC in cumulative rewards The resultsare shown in Figure 7(a) RAC-HMLP learns fastest in theearly 15 episodesMLACbehaves second best andAC-HMLPperforms poorest RAC-HMLP tends to converge at the 24thepisode but it really converges until the 39th episode AC-HMLP behaves steadily starting from the 23rd episode to theend and it converges at the 43rd episode Like the formerexperiment RAC-HMLP performs slightly better than AC-HMLP embodied in the cumulative rewards minus44 comparedto minus46 At the 53rd episode the cumulative rewards ofS-AC fall to about minus545 quickly without any ascendingthereafter Though MLAC and Dyna-MLAC perform wellin the primary phase their curves start to descend at the88th episode and the 93rd episode respectively MLAC andDyna-MLAC learn the local model to update the policygradient resulting in a fast learning rate in the primary phase

Computational Intelligence and Neuroscience 11

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus50

0

50

(a) Optimal policy of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

92

94

96

98

10

(b) Optimal value function of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus40

minus20

0

20

40

(c) Optimal policy of RAC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

9

92

94

96

98

10

(d) Optimal value function of RAC-HMLP learned after training

Figure 4 Optimal policy and value function learned by AC-HMLP and RAC-HMLP

However they do not behave stably enough near the end ofthe training Too much visitation to the barrier region mightcause the fast descending of the cumulative rewards

The comparisons of the time steps for reaching goal aresimulated which are shown in Figure 7(b) Obviously RAC-HMLP and AC-HMLP perform better than the other threemethods RAC-HMLP converges at 39th episode while AC-HMLP converges at 43rd episode with the time steps forreaching goal being 45 and 47 It is clear that RAC-HMLPstill performs slightly better than AC-HMLP The time stepsfor reaching goal are 548 69 and 201 for S-AC MLACandDyna-MLAC Among the fivemethods RAC-HMLP notonly converges fastest but also has the best solution 45 Thepoorest performance of S-AC illustrates that model learningcan definitely improve the convergence performance As inthe former experiment Dyna-MLAC behaves poorer thanMLAC during training If the model is inaccurate planningvia such model might influence the estimations of the valuefunction and the policy thus leading to a poor performancein Dyna-MLAC

The comparisons of sample efficiency are shown inFigure 8 The required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 105956588 7694 4388 and 4062 respectively As in pole balancingproblem RAC-HMLP also requires the least samples whileS-AC needs the most to converge The difference is that

Dyna-MLAC requires samples slightly more than MLACThe ascending curve of Dyna-MLAC at the end of thetraining demonstrates that it has not convergedThe frequentvisitation to the barrier in Dyna-MLAC leads to the quickdescending of the value functions As a result enormoussamples are required to make these value functions moreprone to the true ones

After the training ends the approximate optimal policyand value function are obtained shown in Figures 9(a) and9(b) It is noticeable that the low part of the figure is exploredthoroughly so that the policy and the value function are verydistinctive in this region For example inmost of the region inFigure 9(a) a larger angle is needed for the agent so as to leavethe current state Clearly the nearer the current state and thelow part of Figure 9(b) the smaller the corresponding valuefunction The top part of the figures may be not frequentlyor even not visited by the agent resulting in the similar valuefunctionsThe agent is able to reach the goal onlywith a gentleangle in these areas

6 Conclusion and Discussion

This paper proposes two novel actor-critic algorithms AC-HMLP and RAC-HMLP Both of them take LLR and LFA torepresent the local model and global model respectively Ithas been shown that our new methods are able to learn the

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

4 Computational Intelligence and Neuroscience

Input 120574 120582 120572119886 120572119888 1205902

(1) Initialize 120579 120573(2) Loop(3) 119890 larr

9978881 120579 larr 997888

0 120573 larr9978880 119909 larr 119909

0 119905 larr 1

(4) Repeat all episodes(5) Choose Δ119906

119905according to119873(0 120590

2)

(6) 119906119905larr ℎ(119909 120573) + Δ119906

119905

(7) Execute 119906119905and observe 119903

119905+1and 119909

119905+1

(8) Update the eligibility of the value function 119890119905(119909) = 120597119881(119909 120579)120597120579 119909

119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(9) Compute the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(10) Update the parameter of the value function 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

(11) Update the parameter of the policy 120573119905+1

= 120573119905+ 120572119886120575119905Δ119906119905(120597ℎ(119909 120573)120597120573)

(12) 119905 larr 119905 + 1

(13) End Repeat(14) End LoopOutput 120579 120573

Algorithm 1 S-AC

The preexisting works are mainly aimed at the problemswith continuous states but discrete actions They approx-imated the transition function in the form of probabilitymatrix which specifies the transition probability from thecurrent feature to the next feature The indirectly observedfeatures result in the inaccurate feature-based model Theconvergence rate will be slowed significantly by using such aninaccurate model for planning especially at each time step inthe initial phase

To solve these problems we will approximate a state-based model instead of the inaccurate feature-based modelMoreover we will introduce an additional global model forplanning The global model is applied only at the end of eachepisode so that the global information can be utilized asmuchas possible Using such a global model without others willlead to the loss of valuable local informationThus likeDyna-MLAC we also approximate a local model by LLR and useit for planning at each time step The difference is that auseful error threshold is designed for the local planning inour method If the state-prediction error between the real-next state and the predicted one does not surpass the errorthreshold the local planning process will be started at thecurrent time step Therefore the convergence rate and thesample efficiency can be improved dramatically by combiningthe local and global model learning and planning

32 Learning and Planning of the Global Model The globalmodel establishes separate equations for the reward functionand the state transition function of every state component bylinear function approximation Assume the agent is at state119909119905= 1199091199051 119909

119905119870 where119870 is the dimensionality of the state

119909119905 the action 119906

119905is selected according to the policy ℎ then

all the components 1199091015840119905+11

1199091015840

119905+12 119909

1015840

119905+1119870 of the next state

1199091015840

119905+1can be predicted as

1199091015840

119905+11= 1205781199051

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+12= 1205781199052

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+1119870= 120578119905119870

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

(9)

where 120578119905119894= (1205781199051198941 1205781199051198942 120578

119905119894119863)T 1 le 119894 le 119870 is the param-

eter of the state transition function corresponding to the 119894thcomponent of the current state at time step 119905 with 119863 beingthe dimensionality of the feature 120601(119909

1199051 1199091199052 119909

119905119870 119906119905)

Likewise the reward 1199031015840119905+1

can be predicted as

1199031015840

119905+1= 120589119905

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905) (10)

where 120589119905= (1205891199051 1205891199052 120589

119905119863) is the parameter of the reward

function at time step 119905After 120578

1199051 1205781199052 120578

119905119870and 120589 are updated the model can

be applied to generate samples Let the current state be119909119905= (1199091199051 1199091199052 119909

119905119896) after the action 119906

119905is executed the

parameters 120578119905+1119894

(1 le 119894 le 119870) can be estimated by the gradientdescent method shown as

120578119905+11

= 1205781199051

+ 120572119898(119909119905+11

minus 1199091015840

119905+11) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+12

= 1205781199052

+ 120572119898(119909119905+12

minus 1199091015840

119905+12) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+1119870

= 120578119905119870

+ 120572119898(119909119905+1119870

minus 1199091015840

119905+1119870) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

(11)

where 120572119898

isin [0 1] is the learning rate of the model119909119905+1

is the real-next state obtained according to the systemdynamics where 119909

119905+1119894(1 le 119894 le 119870) is its 119894th component

(1199091015840

119905+11 1199091015840

119905+12 119909

1015840

119905+1119870) is the predicted-next state according

to (9)The parameter 120589

119905+1is estimated as

120589119905+1

= 120589119905+ 120572119898(119903119905+1

minus 1199031015840

119905+1) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905) (12)

where 119903119905+1

is the real reward reflected by the system dynamicswhile 1199031015840

119905+1is the predicted reward obtained according to (10)

Computational Intelligence and Neuroscience 5

33 Learning and Planning of the Local Model Though thelocal model also approximates the state transition functionand the reward function as the global model does LLR isserved as the function approximator instead of LFA In thelocal model a memory storing the samples in the form of(119909119905 119906119905 119909119905+1 119903119905+1) is maintained At each time step a new

sample is generated from the interaction and it will take theplace of the oldest one in the memory Not all but only L-nearest samples in thememorywill be selected for computingthe parameter matrix Γ isin R(119870+1)times(119870+2) of the local modelBefore achieving this the input matrix119883 isin R(119870+2)times119871 and theoutput matrix 119884 isin R(119870+1)times119871 should be prepared as follows

119883 =

[[[[[[[[[[[[

[

11990911

11990921

sdot sdot sdot 1199091198711

11990912

11990922

sdot sdot sdot 1199091198712

1199091119870

1199092119870

sdot sdot sdot 119909119871119870

1199061

1199062

sdot sdot sdot 119906119871

1 1 sdot sdot sdot 1

]]]]]]]]]]]]

]

119884 =

[[[[[[[[[

[

11990921

11990931

sdot sdot sdot 119909119871+11

11990922

11990932

sdot sdot sdot 119909119871+12

1199092119870

1199093119870

sdot sdot sdot 119909119871+1119870

1199032

1199033

sdot sdot sdot 119903119871+1

]]]]]]]]]

]

(13)

The last row of 119883 consisting of ones is to add a biason the output Every column in the former 119870 + 1 lines of119883 corresponds to a state-action pair for example the 119894thcolumn is the state-action (119909

119894 119906119894)T 1 le 119894 le 119871 119884 is composed

of 119871 next states and rewards corresponding to119883Γ can be obtained via solving 119884 = Γ119883 as

Γ = 119884119883T(119883119883

T)minus1

(14)

Let the current input vector be [1199091199051 119909

119905119870 119906119905 1]

T theoutput vector [1199091015840

119905+11 119909

1015840

119905+1119870 1199031015840

119905+1]T can be predicted by

[1199091015840

119905+11 119909

1015840

119905+1119896 1199031015840

119905+1]T= Γ [119909

1199051 119909

119905119896 119906119905 1]

T (15)

where [1199091199051 119909

119905119870] and [119909

1015840

119905+11 119909

1015840

119905+1119870] are the current

state and the predicted-next state respectively 1199031015840119905+1

is thepredicted reward

Γ is estimated according to (14) at each time step there-after the predicted-next state and the predicted reward canbe obtained by (15) Moreover we design an error thresholdto decide whether local planning is required We computethe state-prediction error between the real-next state and thepredicted-next state at each time step If this error does not

surpass the error threshold the local planning process willbe launched The state-prediction error is formulated as

119864119903119905= max

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+11minus 119909119905+11

119909119905+11

1003816100381610038161003816100381610038161003816100381610038161003816

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+12minus 119909119905+12

119909119905+12

1003816100381610038161003816100381610038161003816100381610038161003816

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+1119870minus 119909119905+1119870

119909119905+1119870

1003816100381610038161003816100381610038161003816100381610038161003816

100381610038161003816100381610038161003816100381610038161003816

1199031015840

119905+1minus 119903119905+1

119903119905+1

100381610038161003816100381610038161003816100381610038161003816

(16)

Let the error threshold be 120585 then the local model willbe used for planning only if 119864119903

119905le 120585 at time step 119905 In

the local planning process a sequence of locally simulatedsamples in the form of (119909

1199051 119909

119905119870 119906119905 1199091015840

119905+11 119909

1015840

119905+1119870 1199031015840)

are generated to improve the convergence of the same valuefunction and the policy as the global planning process does

4 Algorithm Specification

41 AC-HMLP AC-HMLP algorithm consists of a mainalgorithm and two subalgorithms The main algorithm isthe learning algorithm (see Algorithm 2) whereas the twosubalgorithms are local model planning procedure (seeAlgorithm 3) and global model planning procedure (seeAlgorithm 4) respectively At each time step the main algo-rithm learns the value function (see line (26) in Algorithm 2)the policy (see line (27) in Algorithm 2) the local model (seeline (19) in Algorithm 2) and the global model (see lines(10)sim(11) in Algorithm 2)

There are several parameters which are required to bedetermined in the three algorithms 119903 and 120582 are discountfactor and trace-decay rate respectively 120572

119886120572119888 and120572

119898are the

corresponding learning rates of the value function the policyand the global model119872 size is the capacity of the memory120585 denotes the error threshold 119871 determines the number ofselected samples which is used to fit the LLR 1205902 is thevariance that determines the region of the exploration 119875

119897and

119875119892are the planning times for local model and global model

Some of these parameters have empirical values for example119903 and 120582 The others have to be determined by observing theempirical results

Notice that Algorithm 3 starts planning at the state 119909119905

which is passed from the current state in Algorithm 2 whileAlgorithm 4 uses 119909

0as the initial state The reason for using

different initializations is that the local model is learnedaccording to the L-nearest samples of the current state 119909

119905

whereas the global model is learned from all the samplesThus it is reasonable and natural to start the local and globalplanning process at the states 119909

119905and 119909

0 respectively

42 AC-HMLPwith ℓ2-Regularization Regression approach-

es inmachine learning are generally represented asminimiza-tion of a square loss term and a regularization term The ℓ

2-

regularization also called ridge regress is a widely used regu-larization method in statistics and machine learning whichcan effectively prohibit overfitting of learning Therefore weintroduce ℓ

2-regularization to AC-HMLP in the learning of

the value function the policy and the model We term thisnew algorithm as RAC-HMLP

6 Computational Intelligence and Neuroscience

Input 120574 120582 120572119886 120572119888 120572119898 120585119872 size 119871 1205902 119870 119875

119897 119875119892

(1) Initialize 120579 larr 9978880 120573 larr

9978880 Γ larr 997888

0 1205781 1205782 120578

119870larr

9978880 120589 larr 997888

0

(2) Loop(3) 119890

1larr

9978881 11990911 119909

1119896larr 11990901 119909

0119896 119905 larr 1 number larr 0

(4) Repeat all episodes(5) Choose Δ119906

119905according to119873(0 120590

2)

(6) Execute the action 119906119905= ℎ(119909 120573) + Δ119906

119905

(7) Observe the reward 119903119905+1

and the next state 119909119905+1

= 119909119905+11

119909119905+1119896

Update the global model(8) Predict the next state

1199091015840

119905+11= 1205781199051

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+12= 1205781199052

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+1119870= 120578119905119870

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

(9) Predict the reward 1199031015840119905+1

= 120589119905

T120601(1199091199051 1199091199052 119909

119905119870 119906119905)

(10) Update the parameters 1205781 1205782 120578

119870

120578119905+11

= 1205781199051+ 120572119898(119909119905+11

minus 1199091015840

119905+11) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+12

= 1205781199052+ 120572119898(119909119905+12

minus 1199091015840

119905+12) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+1119870

= 120578119905119870

+ 120572119898(119909119905+1119870

minus 1199091015840

119905+1119870) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

(11) Update the parameter 120589 120589119905+1

= 120589119905+ 120572119898(119903119905+1

minus 1199031015840

119905+1)120601(1199091199051 1199091199052 119909119905119870 119906119905)

Update the local model(12) If 119905 le 119872 size(13) Insert the real sample (119909

119905 119906119905 119909119905+1 119903119905+1) into the memory119872

(14) Else If(15) Replace the oldest one in119872 with the real sample (119909

119905 119906119905 119909119905+1 119903119905+1)

(16) End if(17) Select L-nearest neighbors of the current state from119872 to construct119883 and 119884(18) Predict the next state and the reward [1199091015840

119905+1 1199031015840

119905+1]T= Γ[119909

119905 119906119905 1]

T

(19) Update the parameter Γ Γ = 119884119883T(119883119883

T)minus1

(20) Compute the local error119864119903119905= max|(1199091015840

119905+11minus 119909119905+11

)119909119905+11

| |(1199091015840

119905+12minus 119909119905+12

)119909119905+12

| |(1199091015840

119905+1119870minus 119909119905+1119870

)119909119905+1119870

| |(1199031015840

119905+1minus 119903119905+1)119903119905+1|

(21) If 119864119903119905le 120585

(22) Call Local-model planning (120579 120573 Γ 119890 number 119875119897) (Algorithm 3)

(23) End IfUpdate the value function

(24) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(25) Estimate the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(26) Update the value-function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

Update the policy(27) Update the policy parameter 120573

119905+1= 120573119905+ 120572119886120575119905Δ119906119905120601(119909119905)

(28) 119905 = 119905 + 1

(29) Update the number of samples number = number + 1(30) Until the ending condition is satisfied(31) Call Global-model planning (120579 120573 120578

1 120578

119896 120589 119890 number 119875

119892) (Algorithm 4)

(32) End LoopOutput 120573 120579

Algorithm 2 AC-HMLP algorithm

Computational Intelligence and Neuroscience 7

(1) Loop for 119875119897time steps

(2) 1199091larr 119909119905 119905 larr 1

(3) Choose Δ119906119905according to119873(0 120590

2)

(4) 119906119905= ℎ(119909

119905 120573119905) + Δ119906

119905

(5) Predict the next state and the reward [119909119905+11

119909119905+1119870

119903119905+1] = Γ119905[1199091199051 119909

119905119870 119906119905 1]

T

(6) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(7) Compute the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(8) Update the value function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

(9) Update the policy parameter 120573119905+1

= 120573119905+ 120572119886120575TDΔ119906119905120601(119909119905)

(10) If 119905 le 119878

(11) 119905 = 119905 + 1

(12) End If(13) Update the number of samples number = number + 1(14) End LoopOutput 120573 120579

Algorithm 3 Local model planning (120579 120573 Γ 119890 number 119875119897)

(1) Loop for 119875119892times

(2) 119909larr 1199090 larr 1

(3) Repeat all episodes(4) Choose Δ119906

according to119873(0 120590)

(5) Compute exploration term 119906= ℎ(119909

120573) + Δ119906

(6) Predict the next state1199091015840

+11= 1205781

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+12= 1205782

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+1119870= 120578119870

119879120601 (1199091 1199092 119909

119870 119906)

(7) Predict the reward 1199031015840+1

= 120589119905

T(1199091 1199092 119909

119870 119906119905)

(8) Update the eligibility 119890(119909) = 120601(119909

) 119909= 119909 120582120574119890

minus1(119909) 119909

= 119909

(9) Compute the TD error 120575= 119903119905+1

+ 120574119881(1199091015840

+1 120579) minus 119881(119909

120579)

(10) Update the value function parameter 120579+1

= 120579+ 120572119888120575119890(119909)

(11) Update the policy parameter 120573+1

= 120573+ 120572119886120575Δ119906120601(119909)

(12) If le 119879

(13) = + 1

(14) End If(15) Update the number of samples number = number + 1(16) End Repeat(17) End LoopOutput 120573 120579

Algorithm 4 Global model planning (120579 120573 1205781 120578

119896 120589 119890 number 119875

119892)

The goal of learning the value function is to minimize thesquare of the TD-error which is shown as

min120579

numbersum

119905=0

10038171003817100381710038171003817119903119905+1

+ 120574119881 (1199091015840

119905+1 120579119905) minus 119881 (119909

119905 120579119905)10038171003817100381710038171003817

+ ℓ119888

10038171003817100381710038171205791199051003817100381710038171003817

2

(17)

where number represents the number of the samples ℓ119888ge 0

is the regularization parameter of the critic 1205791199052 is the ℓ

2-

regularization which penalizes the growth of the parametervector 120579

119905 so that the overfitting to noise samples can be

avoided

The update for the parameter 120579 of the value function inRAC-HMLP is represented as

120579119905+1

= 120579119905(1 minus

120572119888

numberℓ119888) +

120572119888

number

numbersum

119904=0

119890119904(119909) 120575119904 (18)

The update for the parameter 120573 of the policy is shown as

120573119905+1

= 120573119905(1 minus

120572119886

numberℓ119886)

+120572119886

number

numbersum

119904=0

120601 (119909119904) 120575119904Δ119906119904

(19)

where ℓ119886ge 0 is the regularization parameter of the actor

8 Computational Intelligence and Neuroscience

F

w

Figure 1 Pole balancing problem

The update for the global model can be denoted as

120578119905+11

= 1205781199051(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+11

minus 1199091015840

119904+11)

120578119905+12

= 1205781199052(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+12

minus 1199091015840

119904+12)

120578119905+1119870

= 120578119905119870

(1 minus120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+1119870

minus 1199091015840

119904+11)

(20)

120589119905+1

= 120589119905(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119903119904+1

minus 1199031015840

119904+1)

(21)

where ℓ119898ge 0 is the regularization parameter for the model

namely the state transition function and the reward functionAfter we replace the update equations of the parameters

in Algorithms 2 3 and 4 with (18) (19) (20) and (21) we willget the resultant algorithm RAC-HMLP Except for the aboveupdate equations the other parts of RAC-HMLP are the samewith AC-HMLP so we will not specify here

5 Empirical Results and Analysis

AC-HMLP and RAC-HMLP are compared with S-ACMLAC andDyna-MLAC on two continuous state and actionspaces problems pole balancing problem [31] and continuousmaze problem [32]

51 Pole Balancing Problem Pole balancing problem is a low-dimension but challenging benchmark problem widely usedin RL literature shown in Figure 1

There is a car moving along the track with a hinged poleon its topThegoal is to find a policywhich can guide the force

to keep the pole balanceThe system dynamics is modeled bythe following equation

=119892 sin (119908) minus V119898119897 sin (2119908) 2 minus V cos (119908) 119865

41198973 minus V119898119897cos2 (119908) (22)

where 119908 is the angle of the pole with the vertical line and are the angular velocity and the angular acceleration of thepole 119865 is the force exerted on the cart The negative valuemeans the force to the right and otherwise means to the left119892 is the gravity constant with the value 119892 = 981ms2119898 and119897 are the length and the mass of the pole which are set to119898 =

20 kg and 119897 = 05m respectively V is a constantwith the value1(119898 + 119898

119888) where119898

119888= 80 kg is the mass of the car

The state 119909 = (119908 ) is composed of 119908 and which arebounded by [minus120587 120587] and [minus2 2] respectivelyThe action is theforce 119865 bounded by [minus50N 50N] The maximal episode is300 and the maximal time step for each episode is 3000 Ifthe angle of the pole with the vertical line is not exceeding1205874at each time step the agent will receive a reward 1 otherwisethe pole will fall down and receive a reward minus1 There is alsoa random noise applying on the force bounded by [minus10N10N] An episode ends when the pole has kept balance for3000 time steps or the pole falls downThe pole will fall downif |119908| ge 1205874 To approximate the value function and thereward function the state-based feature is coded by radialbasis functions (RBFs) shown as

120601119894(119909) = 119890

minus(12)(119909minus119888119894)T119861minus1(119909minus119888119894) (23)

where 119888119894denotes the 119894th center point locating over the grid

points minus06 minus04 minus02 0 02 04 06 times minus2 0 2 119861 is adiagonal matrix containing the widths of the RBFs with 1205902

119908=

02 and 1205902= 2The dimensionality 119911 of 120601(119909) is 21 Notice that

the approximations of the value function and the policy onlyrequire the state-based feature However the approximationof the model requires the state-action-based feature Let 119909 be(119908 119906) we can also compute its feature by (23) In this case119888119894locates over the points minus06 minus04 minus02 0 02 04 06 times

minus2 0 2 times minus50 minus25 0 25 50 The diagonal elements of 119861are 1205902119908= 02 1205902

= 2 and 120590

2

119865= 10 The dimensionality 119911

of 120601(119908 119906) is 105 Either the state-based or the state-action-based feature has to be normalized so as to make its value besmaller than 1 shown as

120601119894(119909) =

120601119894(119909)

sum119911

119894=1120601119894(119909)

(24)

where 119911 gt 0 is the dimensionality of the featureRAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC on this experiment The parame-ters of S-AC MLAC and Dyna-MLAC are set according tothe values mentioned in their papers The parameters of AC-HMLP and RAC-HMLP are set as shown in Table 1

The planning times may have significant effect on theconvergence performance Therefore we have to determinethe local planning times 119875

119897and the global planning times 119875

119892

at first The selective settings for the two parameters are (30300) (50 300) (30 600) and (50 600) From Figure 2(a) it

Computational Intelligence and Neuroscience 9

Table 1 Parameters settings of RAC-HMLP and AC-HMLP

Parameter Symbol ValueTime step 119879

11990401

Discount factor 120574 09Trace-decay rate 120582 09Exploration variance 120590

2 1Learning rate of the actor 120572

11988605

Learning rate of the critic 120572119888

04Learning rate of the model 120572

11989805

Error threshold 120585 015Capacity of the memory 119872 size 100Number of the nearest samples 119871 9Local planning times 119875

11989730

Global planning times 119875119892 300

Number of components of the state K 2Regularization parameter of the model ℓ

11989802

Regularization parameter of the critic ℓ119888

001Regularization parameter of the actor ℓ

1198860001

is easy to find out that (119875119897= 30119875

119892= 300) behaves best in these

four settings with 41 episodes for convergence (119875119897= 30 119875

119892=

600) learns fastest in the early 29 episodes but it convergesuntil the 52nd episode (119875

119897= 30 119875

119892= 600) and (119875

119897= 30 119875

119892=

300) have identical local planning times but different globalplanning times Generally the more the planning times thefaster the convergence rate but (119875

119897= 30 119875

119892= 300) performs

better instead The global model is not accurate enough atthe initial time Planning through such an inaccurate globalmodel will lead to an unstable performance Notice that (119875

119897=

50 119875119892= 300) and (119875

119897= 50 119875

119892= 600) behave poorer than (119875

119897

= 30 119875119892= 600) and (119875

119897= 30 119875

119892= 300) which demonstrates

that planning too much via the local model will not performbetter (119875

119897= 50119875

119892= 300) seems to converge at the 51st episode

but its learning curve fluctuates heavily after 365 episodes notconverging any more until the end Like (119875

119897= 50 119875

119892= 300)

(119875119897= 50 119875

119892= 600) also has heavy fluctuation but converges

at the 373rd episode Evidently (119875119897= 50 119875

119892= 300) performs

slightly better than (119875119897= 50 119875

119892= 300) Planning through the

global model might solve the nonstability problem caused byplanning of the local model

The convergence performances of the fivemethods RAC-HMLP AC-HMLP S-AC MLAC and Dyna-MLAC areshown in Figure 2(b) It is evident that our methods RAC-HMLP and AC-HMLP have the best convergence perfor-mances RAC-HMLP and AC-HMLP converge at the 39thand 41st episodes Both of them learn quickly in the primaryphase but the learning curve of RAC-HMLP seems to besteeper than that of AC-HMLP Dyna-MLAC learns fasterthan MLAC and S-AC in the former 74 episodes but itconverges until the 99th episode Though MLAC behavespoorer than Dyna-MLAC it requires just 82 episodes toconverge The method with the slowest learning rate is S-AC where the pole can keep balance for 3000 time stepsfor the first time at the 251st episode Unfortunately it con-verges until the 333rd episode RAC-HMLP converges fastest

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

Pl = 30 Pg = 300

Pl = 50 Pg = 300

Pl = 30 Pg = 600

Pl = 50 Pg = 600

(a) Determination of different planning times

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of different algorithms in convergence performance

Figure 2 Comparisons of different planning times and differentalgorithms

which might be caused by introducing the ℓ2-regularization

Because the ℓ2-regularization does not admit the parameter

to grow rapidly the overfitting of the learning process canbe avoided effectively Dyna-MLAC converges faster thanMLAC in the early 74 episodes but its performance is notstable enough embodied in the heavy fluctuation from the75th to 99th episode If the approximate-local model is dis-tinguished largely from the real-local model then planningthrough such an inaccurate local model might lead to anunstable performance S-AC as the only method withoutmodel learning behaves poorest among the fiveThese resultsshow that the convergence performance can be improvedlargely by introducing model learning

The comparisons of the five different algorithms insample efficiency are shown in Figure 3 It is clear that thenumbers of the required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 56003

10 Computational Intelligence and Neuroscience

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

50 100 150 200 250 300 350 4000Training episodes

0

1

2

3

4

5

6

Num

ber o

f sam

ples

to co

nver

ge

times104

Figure 3 Comparisons of sample efficiency

35535 32043 14426 and 12830 respectively Evidently RAC-HMLP converges fastest and behaves stably all the timethus requiring the least samplesThough AC-HMLP requiresmore samples to converge than RAC-HMLP it still requiresfar fewer samples than the other three methods Because S-AC does not utilize the model it requires the most samplesamong the five Unlike MLAC Dyna-MLAC applies themodel not only in updating the policy but also in planningresulting in a fewer requirement for the samples

The optimal policy and optimal value function learned byAC-HMLP after the training ends are shown in Figures 4(a)and 4(b) respectively while the ones learned by RAC-HMLPare shown in Figures 4(c) and 4(d) Evidently the optimalpolicy and the optimal value function learned by AC-HMLPand RAC-HMLP are quite similar but RAC-HMLP seems tohave more fine-grained optimal policy and value function

As for the optimal policy (see Figures 4(a) and 4(c))the force becomes smaller and smaller from the two sides(left side and right side) to the middle The top-right is theregion requiring a force close to 50N where the directionof the angle is the same with that of angular velocity Thevalues of the angle and the angular velocity are nearing themaximum values 1205874 and 2Therefore the largest force to theleft is required so as to guarantee that the angle between theupright line and the pole is no bigger than 1205874 It is oppositein the bottom-left region where a force attributing to [minus50Nminus40N] is required to keep the angle from surpassing minus1205874The pole can keep balance with a gentle force close to 0 in themiddle regionThedirection of the angle is different from thatof angular velocity in the top-left and bottom-right regionsthus a force with the absolute value which is relatively largebut smaller than 50N is required

In terms of the optimal value function (see Figures 4(b)and 4(d)) the value function reaches a maximum in theregion satisfying minus2 le 119908 le 2 The pole is more prone tokeep balance even without applying any force in this region

resulting in the relatively larger value function The pole ismore and more difficult to keep balance from the middle tothe two sides with the value function also decaying graduallyThe fine-grained value of the left side compared to the rightonemight be caused by themore frequent visitation to the leftside More visitation will lead to a more accurate estimationabout the value function

After the training ends the prediction of the next stateand the reward for every state-action pair can be obtainedthrough the learnedmodelThe predictions for the next angle119908 the next angular velocity and the reward for any possiblestate are shown in Figures 5(a) 5(b) and 5(c) It is noticeablethat the predicted-next state is always near the current statein Figures 5(a) and 5(b) The received reward is always largerthan 0 in Figure 5(c) which illustrates that the pole canalways keep balance under the optimal policy

52 Continuous Maze Problem Continuous maze problemis shown in Figure 6 where the blue lines with coordinatesrepresent the barriersThe state (119909 119910) consists of the horizon-tal coordinate 119909 isin [0 1] and vertical coordinate 119910 isin [0 1]Starting from the position ldquoStartrdquo the goal of the agent isto reach the position ldquoGoalrdquo that satisfies 119909 + 119910 gt 18 Theaction is to choose an angle bounded by [minus120587 120587] as the newdirection and then walk along this direction with a step 01The agent will receive a reward minus1 at each time step Theagent will be punished with a reward minus400 multiplying thedistance if the distance between its position and the barrierexceeds 01 The parameters are set the same as the formerproblem except for the planning times The local and globalplanning times are determined as 10 and 50 in the sameway of the former experiment The state-based feature iscomputed according to (23) and (24) with the dimensionality119911 = 16 The center point 119888

119894locates over the grid points

02 04 06 08 times 02 04 06 08 The diagonal elementsof 119861 are 120590

2

119909= 02 and 120590

2

119910= 02 The state-action-based

feature is also computed according to (23) and (24) Thecenter point 119888

119894locates over the grid points 02 04 06 08 times

02 04 06 08timesminus3 minus15 0 15 3with the dimensionality119911 = 80The diagonal elements of119861 are 1205902

119909= 02 1205902

119910= 02 and

1205902

119906= 15RAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC in cumulative rewards The resultsare shown in Figure 7(a) RAC-HMLP learns fastest in theearly 15 episodesMLACbehaves second best andAC-HMLPperforms poorest RAC-HMLP tends to converge at the 24thepisode but it really converges until the 39th episode AC-HMLP behaves steadily starting from the 23rd episode to theend and it converges at the 43rd episode Like the formerexperiment RAC-HMLP performs slightly better than AC-HMLP embodied in the cumulative rewards minus44 comparedto minus46 At the 53rd episode the cumulative rewards ofS-AC fall to about minus545 quickly without any ascendingthereafter Though MLAC and Dyna-MLAC perform wellin the primary phase their curves start to descend at the88th episode and the 93rd episode respectively MLAC andDyna-MLAC learn the local model to update the policygradient resulting in a fast learning rate in the primary phase

Computational Intelligence and Neuroscience 11

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus50

0

50

(a) Optimal policy of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

92

94

96

98

10

(b) Optimal value function of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus40

minus20

0

20

40

(c) Optimal policy of RAC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

9

92

94

96

98

10

(d) Optimal value function of RAC-HMLP learned after training

Figure 4 Optimal policy and value function learned by AC-HMLP and RAC-HMLP

However they do not behave stably enough near the end ofthe training Too much visitation to the barrier region mightcause the fast descending of the cumulative rewards

The comparisons of the time steps for reaching goal aresimulated which are shown in Figure 7(b) Obviously RAC-HMLP and AC-HMLP perform better than the other threemethods RAC-HMLP converges at 39th episode while AC-HMLP converges at 43rd episode with the time steps forreaching goal being 45 and 47 It is clear that RAC-HMLPstill performs slightly better than AC-HMLP The time stepsfor reaching goal are 548 69 and 201 for S-AC MLACandDyna-MLAC Among the fivemethods RAC-HMLP notonly converges fastest but also has the best solution 45 Thepoorest performance of S-AC illustrates that model learningcan definitely improve the convergence performance As inthe former experiment Dyna-MLAC behaves poorer thanMLAC during training If the model is inaccurate planningvia such model might influence the estimations of the valuefunction and the policy thus leading to a poor performancein Dyna-MLAC

The comparisons of sample efficiency are shown inFigure 8 The required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 105956588 7694 4388 and 4062 respectively As in pole balancingproblem RAC-HMLP also requires the least samples whileS-AC needs the most to converge The difference is that

Dyna-MLAC requires samples slightly more than MLACThe ascending curve of Dyna-MLAC at the end of thetraining demonstrates that it has not convergedThe frequentvisitation to the barrier in Dyna-MLAC leads to the quickdescending of the value functions As a result enormoussamples are required to make these value functions moreprone to the true ones

After the training ends the approximate optimal policyand value function are obtained shown in Figures 9(a) and9(b) It is noticeable that the low part of the figure is exploredthoroughly so that the policy and the value function are verydistinctive in this region For example inmost of the region inFigure 9(a) a larger angle is needed for the agent so as to leavethe current state Clearly the nearer the current state and thelow part of Figure 9(b) the smaller the corresponding valuefunction The top part of the figures may be not frequentlyor even not visited by the agent resulting in the similar valuefunctionsThe agent is able to reach the goal onlywith a gentleangle in these areas

6 Conclusion and Discussion

This paper proposes two novel actor-critic algorithms AC-HMLP and RAC-HMLP Both of them take LLR and LFA torepresent the local model and global model respectively Ithas been shown that our new methods are able to learn the

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

Computational Intelligence and Neuroscience 5

33 Learning and Planning of the Local Model Though thelocal model also approximates the state transition functionand the reward function as the global model does LLR isserved as the function approximator instead of LFA In thelocal model a memory storing the samples in the form of(119909119905 119906119905 119909119905+1 119903119905+1) is maintained At each time step a new

sample is generated from the interaction and it will take theplace of the oldest one in the memory Not all but only L-nearest samples in thememorywill be selected for computingthe parameter matrix Γ isin R(119870+1)times(119870+2) of the local modelBefore achieving this the input matrix119883 isin R(119870+2)times119871 and theoutput matrix 119884 isin R(119870+1)times119871 should be prepared as follows

119883 =

[[[[[[[[[[[[

[

11990911

11990921

sdot sdot sdot 1199091198711

11990912

11990922

sdot sdot sdot 1199091198712

1199091119870

1199092119870

sdot sdot sdot 119909119871119870

1199061

1199062

sdot sdot sdot 119906119871

1 1 sdot sdot sdot 1

]]]]]]]]]]]]

]

119884 =

[[[[[[[[[

[

11990921

11990931

sdot sdot sdot 119909119871+11

11990922

11990932

sdot sdot sdot 119909119871+12

1199092119870

1199093119870

sdot sdot sdot 119909119871+1119870

1199032

1199033

sdot sdot sdot 119903119871+1

]]]]]]]]]

]

(13)

The last row of 119883 consisting of ones is to add a biason the output Every column in the former 119870 + 1 lines of119883 corresponds to a state-action pair for example the 119894thcolumn is the state-action (119909

119894 119906119894)T 1 le 119894 le 119871 119884 is composed

of 119871 next states and rewards corresponding to119883Γ can be obtained via solving 119884 = Γ119883 as

Γ = 119884119883T(119883119883

T)minus1

(14)

Let the current input vector be [1199091199051 119909

119905119870 119906119905 1]

T theoutput vector [1199091015840

119905+11 119909

1015840

119905+1119870 1199031015840

119905+1]T can be predicted by

[1199091015840

119905+11 119909

1015840

119905+1119896 1199031015840

119905+1]T= Γ [119909

1199051 119909

119905119896 119906119905 1]

T (15)

where [1199091199051 119909

119905119870] and [119909

1015840

119905+11 119909

1015840

119905+1119870] are the current

state and the predicted-next state respectively 1199031015840119905+1

is thepredicted reward

Γ is estimated according to (14) at each time step there-after the predicted-next state and the predicted reward canbe obtained by (15) Moreover we design an error thresholdto decide whether local planning is required We computethe state-prediction error between the real-next state and thepredicted-next state at each time step If this error does not

surpass the error threshold the local planning process willbe launched The state-prediction error is formulated as

119864119903119905= max

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+11minus 119909119905+11

119909119905+11

1003816100381610038161003816100381610038161003816100381610038161003816

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+12minus 119909119905+12

119909119905+12

1003816100381610038161003816100381610038161003816100381610038161003816

1003816100381610038161003816100381610038161003816100381610038161003816

1199091015840

119905+1119870minus 119909119905+1119870

119909119905+1119870

1003816100381610038161003816100381610038161003816100381610038161003816

100381610038161003816100381610038161003816100381610038161003816

1199031015840

119905+1minus 119903119905+1

119903119905+1

100381610038161003816100381610038161003816100381610038161003816

(16)

Let the error threshold be 120585 then the local model willbe used for planning only if 119864119903

119905le 120585 at time step 119905 In

the local planning process a sequence of locally simulatedsamples in the form of (119909

1199051 119909

119905119870 119906119905 1199091015840

119905+11 119909

1015840

119905+1119870 1199031015840)

are generated to improve the convergence of the same valuefunction and the policy as the global planning process does

4 Algorithm Specification

41 AC-HMLP AC-HMLP algorithm consists of a mainalgorithm and two subalgorithms The main algorithm isthe learning algorithm (see Algorithm 2) whereas the twosubalgorithms are local model planning procedure (seeAlgorithm 3) and global model planning procedure (seeAlgorithm 4) respectively At each time step the main algo-rithm learns the value function (see line (26) in Algorithm 2)the policy (see line (27) in Algorithm 2) the local model (seeline (19) in Algorithm 2) and the global model (see lines(10)sim(11) in Algorithm 2)

There are several parameters which are required to bedetermined in the three algorithms 119903 and 120582 are discountfactor and trace-decay rate respectively 120572

119886120572119888 and120572

119898are the

corresponding learning rates of the value function the policyand the global model119872 size is the capacity of the memory120585 denotes the error threshold 119871 determines the number ofselected samples which is used to fit the LLR 1205902 is thevariance that determines the region of the exploration 119875

119897and

119875119892are the planning times for local model and global model

Some of these parameters have empirical values for example119903 and 120582 The others have to be determined by observing theempirical results

Notice that Algorithm 3 starts planning at the state 119909119905

which is passed from the current state in Algorithm 2 whileAlgorithm 4 uses 119909

0as the initial state The reason for using

different initializations is that the local model is learnedaccording to the L-nearest samples of the current state 119909

119905

whereas the global model is learned from all the samplesThus it is reasonable and natural to start the local and globalplanning process at the states 119909

119905and 119909

0 respectively

42 AC-HMLPwith ℓ2-Regularization Regression approach-

es inmachine learning are generally represented asminimiza-tion of a square loss term and a regularization term The ℓ

2-

regularization also called ridge regress is a widely used regu-larization method in statistics and machine learning whichcan effectively prohibit overfitting of learning Therefore weintroduce ℓ

2-regularization to AC-HMLP in the learning of

the value function the policy and the model We term thisnew algorithm as RAC-HMLP

6 Computational Intelligence and Neuroscience

Input 120574 120582 120572119886 120572119888 120572119898 120585119872 size 119871 1205902 119870 119875

119897 119875119892

(1) Initialize 120579 larr 9978880 120573 larr

9978880 Γ larr 997888

0 1205781 1205782 120578

119870larr

9978880 120589 larr 997888

0

(2) Loop(3) 119890

1larr

9978881 11990911 119909

1119896larr 11990901 119909

0119896 119905 larr 1 number larr 0

(4) Repeat all episodes(5) Choose Δ119906

119905according to119873(0 120590

2)

(6) Execute the action 119906119905= ℎ(119909 120573) + Δ119906

119905

(7) Observe the reward 119903119905+1

and the next state 119909119905+1

= 119909119905+11

119909119905+1119896

Update the global model(8) Predict the next state

1199091015840

119905+11= 1205781199051

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+12= 1205781199052

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+1119870= 120578119905119870

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

(9) Predict the reward 1199031015840119905+1

= 120589119905

T120601(1199091199051 1199091199052 119909

119905119870 119906119905)

(10) Update the parameters 1205781 1205782 120578

119870

120578119905+11

= 1205781199051+ 120572119898(119909119905+11

minus 1199091015840

119905+11) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+12

= 1205781199052+ 120572119898(119909119905+12

minus 1199091015840

119905+12) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+1119870

= 120578119905119870

+ 120572119898(119909119905+1119870

minus 1199091015840

119905+1119870) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

(11) Update the parameter 120589 120589119905+1

= 120589119905+ 120572119898(119903119905+1

minus 1199031015840

119905+1)120601(1199091199051 1199091199052 119909119905119870 119906119905)

Update the local model(12) If 119905 le 119872 size(13) Insert the real sample (119909

119905 119906119905 119909119905+1 119903119905+1) into the memory119872

(14) Else If(15) Replace the oldest one in119872 with the real sample (119909

119905 119906119905 119909119905+1 119903119905+1)

(16) End if(17) Select L-nearest neighbors of the current state from119872 to construct119883 and 119884(18) Predict the next state and the reward [1199091015840

119905+1 1199031015840

119905+1]T= Γ[119909

119905 119906119905 1]

T

(19) Update the parameter Γ Γ = 119884119883T(119883119883

T)minus1

(20) Compute the local error119864119903119905= max|(1199091015840

119905+11minus 119909119905+11

)119909119905+11

| |(1199091015840

119905+12minus 119909119905+12

)119909119905+12

| |(1199091015840

119905+1119870minus 119909119905+1119870

)119909119905+1119870

| |(1199031015840

119905+1minus 119903119905+1)119903119905+1|

(21) If 119864119903119905le 120585

(22) Call Local-model planning (120579 120573 Γ 119890 number 119875119897) (Algorithm 3)

(23) End IfUpdate the value function

(24) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(25) Estimate the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(26) Update the value-function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

Update the policy(27) Update the policy parameter 120573

119905+1= 120573119905+ 120572119886120575119905Δ119906119905120601(119909119905)

(28) 119905 = 119905 + 1

(29) Update the number of samples number = number + 1(30) Until the ending condition is satisfied(31) Call Global-model planning (120579 120573 120578

1 120578

119896 120589 119890 number 119875

119892) (Algorithm 4)

(32) End LoopOutput 120573 120579

Algorithm 2 AC-HMLP algorithm

Computational Intelligence and Neuroscience 7

(1) Loop for 119875119897time steps

(2) 1199091larr 119909119905 119905 larr 1

(3) Choose Δ119906119905according to119873(0 120590

2)

(4) 119906119905= ℎ(119909

119905 120573119905) + Δ119906

119905

(5) Predict the next state and the reward [119909119905+11

119909119905+1119870

119903119905+1] = Γ119905[1199091199051 119909

119905119870 119906119905 1]

T

(6) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(7) Compute the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(8) Update the value function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

(9) Update the policy parameter 120573119905+1

= 120573119905+ 120572119886120575TDΔ119906119905120601(119909119905)

(10) If 119905 le 119878

(11) 119905 = 119905 + 1

(12) End If(13) Update the number of samples number = number + 1(14) End LoopOutput 120573 120579

Algorithm 3 Local model planning (120579 120573 Γ 119890 number 119875119897)

(1) Loop for 119875119892times

(2) 119909larr 1199090 larr 1

(3) Repeat all episodes(4) Choose Δ119906

according to119873(0 120590)

(5) Compute exploration term 119906= ℎ(119909

120573) + Δ119906

(6) Predict the next state1199091015840

+11= 1205781

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+12= 1205782

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+1119870= 120578119870

119879120601 (1199091 1199092 119909

119870 119906)

(7) Predict the reward 1199031015840+1

= 120589119905

T(1199091 1199092 119909

119870 119906119905)

(8) Update the eligibility 119890(119909) = 120601(119909

) 119909= 119909 120582120574119890

minus1(119909) 119909

= 119909

(9) Compute the TD error 120575= 119903119905+1

+ 120574119881(1199091015840

+1 120579) minus 119881(119909

120579)

(10) Update the value function parameter 120579+1

= 120579+ 120572119888120575119890(119909)

(11) Update the policy parameter 120573+1

= 120573+ 120572119886120575Δ119906120601(119909)

(12) If le 119879

(13) = + 1

(14) End If(15) Update the number of samples number = number + 1(16) End Repeat(17) End LoopOutput 120573 120579

Algorithm 4 Global model planning (120579 120573 1205781 120578

119896 120589 119890 number 119875

119892)

The goal of learning the value function is to minimize thesquare of the TD-error which is shown as

min120579

numbersum

119905=0

10038171003817100381710038171003817119903119905+1

+ 120574119881 (1199091015840

119905+1 120579119905) minus 119881 (119909

119905 120579119905)10038171003817100381710038171003817

+ ℓ119888

10038171003817100381710038171205791199051003817100381710038171003817

2

(17)

where number represents the number of the samples ℓ119888ge 0

is the regularization parameter of the critic 1205791199052 is the ℓ

2-

regularization which penalizes the growth of the parametervector 120579

119905 so that the overfitting to noise samples can be

avoided

The update for the parameter 120579 of the value function inRAC-HMLP is represented as

120579119905+1

= 120579119905(1 minus

120572119888

numberℓ119888) +

120572119888

number

numbersum

119904=0

119890119904(119909) 120575119904 (18)

The update for the parameter 120573 of the policy is shown as

120573119905+1

= 120573119905(1 minus

120572119886

numberℓ119886)

+120572119886

number

numbersum

119904=0

120601 (119909119904) 120575119904Δ119906119904

(19)

where ℓ119886ge 0 is the regularization parameter of the actor

8 Computational Intelligence and Neuroscience

F

w

Figure 1 Pole balancing problem

The update for the global model can be denoted as

120578119905+11

= 1205781199051(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+11

minus 1199091015840

119904+11)

120578119905+12

= 1205781199052(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+12

minus 1199091015840

119904+12)

120578119905+1119870

= 120578119905119870

(1 minus120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+1119870

minus 1199091015840

119904+11)

(20)

120589119905+1

= 120589119905(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119903119904+1

minus 1199031015840

119904+1)

(21)

where ℓ119898ge 0 is the regularization parameter for the model

namely the state transition function and the reward functionAfter we replace the update equations of the parameters

in Algorithms 2 3 and 4 with (18) (19) (20) and (21) we willget the resultant algorithm RAC-HMLP Except for the aboveupdate equations the other parts of RAC-HMLP are the samewith AC-HMLP so we will not specify here

5 Empirical Results and Analysis

AC-HMLP and RAC-HMLP are compared with S-ACMLAC andDyna-MLAC on two continuous state and actionspaces problems pole balancing problem [31] and continuousmaze problem [32]

51 Pole Balancing Problem Pole balancing problem is a low-dimension but challenging benchmark problem widely usedin RL literature shown in Figure 1

There is a car moving along the track with a hinged poleon its topThegoal is to find a policywhich can guide the force

to keep the pole balanceThe system dynamics is modeled bythe following equation

=119892 sin (119908) minus V119898119897 sin (2119908) 2 minus V cos (119908) 119865

41198973 minus V119898119897cos2 (119908) (22)

where 119908 is the angle of the pole with the vertical line and are the angular velocity and the angular acceleration of thepole 119865 is the force exerted on the cart The negative valuemeans the force to the right and otherwise means to the left119892 is the gravity constant with the value 119892 = 981ms2119898 and119897 are the length and the mass of the pole which are set to119898 =

20 kg and 119897 = 05m respectively V is a constantwith the value1(119898 + 119898

119888) where119898

119888= 80 kg is the mass of the car

The state 119909 = (119908 ) is composed of 119908 and which arebounded by [minus120587 120587] and [minus2 2] respectivelyThe action is theforce 119865 bounded by [minus50N 50N] The maximal episode is300 and the maximal time step for each episode is 3000 Ifthe angle of the pole with the vertical line is not exceeding1205874at each time step the agent will receive a reward 1 otherwisethe pole will fall down and receive a reward minus1 There is alsoa random noise applying on the force bounded by [minus10N10N] An episode ends when the pole has kept balance for3000 time steps or the pole falls downThe pole will fall downif |119908| ge 1205874 To approximate the value function and thereward function the state-based feature is coded by radialbasis functions (RBFs) shown as

120601119894(119909) = 119890

minus(12)(119909minus119888119894)T119861minus1(119909minus119888119894) (23)

where 119888119894denotes the 119894th center point locating over the grid

points minus06 minus04 minus02 0 02 04 06 times minus2 0 2 119861 is adiagonal matrix containing the widths of the RBFs with 1205902

119908=

02 and 1205902= 2The dimensionality 119911 of 120601(119909) is 21 Notice that

the approximations of the value function and the policy onlyrequire the state-based feature However the approximationof the model requires the state-action-based feature Let 119909 be(119908 119906) we can also compute its feature by (23) In this case119888119894locates over the points minus06 minus04 minus02 0 02 04 06 times

minus2 0 2 times minus50 minus25 0 25 50 The diagonal elements of 119861are 1205902119908= 02 1205902

= 2 and 120590

2

119865= 10 The dimensionality 119911

of 120601(119908 119906) is 105 Either the state-based or the state-action-based feature has to be normalized so as to make its value besmaller than 1 shown as

120601119894(119909) =

120601119894(119909)

sum119911

119894=1120601119894(119909)

(24)

where 119911 gt 0 is the dimensionality of the featureRAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC on this experiment The parame-ters of S-AC MLAC and Dyna-MLAC are set according tothe values mentioned in their papers The parameters of AC-HMLP and RAC-HMLP are set as shown in Table 1

The planning times may have significant effect on theconvergence performance Therefore we have to determinethe local planning times 119875

119897and the global planning times 119875

119892

at first The selective settings for the two parameters are (30300) (50 300) (30 600) and (50 600) From Figure 2(a) it

Computational Intelligence and Neuroscience 9

Table 1 Parameters settings of RAC-HMLP and AC-HMLP

Parameter Symbol ValueTime step 119879

11990401

Discount factor 120574 09Trace-decay rate 120582 09Exploration variance 120590

2 1Learning rate of the actor 120572

11988605

Learning rate of the critic 120572119888

04Learning rate of the model 120572

11989805

Error threshold 120585 015Capacity of the memory 119872 size 100Number of the nearest samples 119871 9Local planning times 119875

11989730

Global planning times 119875119892 300

Number of components of the state K 2Regularization parameter of the model ℓ

11989802

Regularization parameter of the critic ℓ119888

001Regularization parameter of the actor ℓ

1198860001

is easy to find out that (119875119897= 30119875

119892= 300) behaves best in these

four settings with 41 episodes for convergence (119875119897= 30 119875

119892=

600) learns fastest in the early 29 episodes but it convergesuntil the 52nd episode (119875

119897= 30 119875

119892= 600) and (119875

119897= 30 119875

119892=

300) have identical local planning times but different globalplanning times Generally the more the planning times thefaster the convergence rate but (119875

119897= 30 119875

119892= 300) performs

better instead The global model is not accurate enough atthe initial time Planning through such an inaccurate globalmodel will lead to an unstable performance Notice that (119875

119897=

50 119875119892= 300) and (119875

119897= 50 119875

119892= 600) behave poorer than (119875

119897

= 30 119875119892= 600) and (119875

119897= 30 119875

119892= 300) which demonstrates

that planning too much via the local model will not performbetter (119875

119897= 50119875

119892= 300) seems to converge at the 51st episode

but its learning curve fluctuates heavily after 365 episodes notconverging any more until the end Like (119875

119897= 50 119875

119892= 300)

(119875119897= 50 119875

119892= 600) also has heavy fluctuation but converges

at the 373rd episode Evidently (119875119897= 50 119875

119892= 300) performs

slightly better than (119875119897= 50 119875

119892= 300) Planning through the

global model might solve the nonstability problem caused byplanning of the local model

The convergence performances of the fivemethods RAC-HMLP AC-HMLP S-AC MLAC and Dyna-MLAC areshown in Figure 2(b) It is evident that our methods RAC-HMLP and AC-HMLP have the best convergence perfor-mances RAC-HMLP and AC-HMLP converge at the 39thand 41st episodes Both of them learn quickly in the primaryphase but the learning curve of RAC-HMLP seems to besteeper than that of AC-HMLP Dyna-MLAC learns fasterthan MLAC and S-AC in the former 74 episodes but itconverges until the 99th episode Though MLAC behavespoorer than Dyna-MLAC it requires just 82 episodes toconverge The method with the slowest learning rate is S-AC where the pole can keep balance for 3000 time stepsfor the first time at the 251st episode Unfortunately it con-verges until the 333rd episode RAC-HMLP converges fastest

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

Pl = 30 Pg = 300

Pl = 50 Pg = 300

Pl = 30 Pg = 600

Pl = 50 Pg = 600

(a) Determination of different planning times

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of different algorithms in convergence performance

Figure 2 Comparisons of different planning times and differentalgorithms

which might be caused by introducing the ℓ2-regularization

Because the ℓ2-regularization does not admit the parameter

to grow rapidly the overfitting of the learning process canbe avoided effectively Dyna-MLAC converges faster thanMLAC in the early 74 episodes but its performance is notstable enough embodied in the heavy fluctuation from the75th to 99th episode If the approximate-local model is dis-tinguished largely from the real-local model then planningthrough such an inaccurate local model might lead to anunstable performance S-AC as the only method withoutmodel learning behaves poorest among the fiveThese resultsshow that the convergence performance can be improvedlargely by introducing model learning

The comparisons of the five different algorithms insample efficiency are shown in Figure 3 It is clear that thenumbers of the required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 56003

10 Computational Intelligence and Neuroscience

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

50 100 150 200 250 300 350 4000Training episodes

0

1

2

3

4

5

6

Num

ber o

f sam

ples

to co

nver

ge

times104

Figure 3 Comparisons of sample efficiency

35535 32043 14426 and 12830 respectively Evidently RAC-HMLP converges fastest and behaves stably all the timethus requiring the least samplesThough AC-HMLP requiresmore samples to converge than RAC-HMLP it still requiresfar fewer samples than the other three methods Because S-AC does not utilize the model it requires the most samplesamong the five Unlike MLAC Dyna-MLAC applies themodel not only in updating the policy but also in planningresulting in a fewer requirement for the samples

The optimal policy and optimal value function learned byAC-HMLP after the training ends are shown in Figures 4(a)and 4(b) respectively while the ones learned by RAC-HMLPare shown in Figures 4(c) and 4(d) Evidently the optimalpolicy and the optimal value function learned by AC-HMLPand RAC-HMLP are quite similar but RAC-HMLP seems tohave more fine-grained optimal policy and value function

As for the optimal policy (see Figures 4(a) and 4(c))the force becomes smaller and smaller from the two sides(left side and right side) to the middle The top-right is theregion requiring a force close to 50N where the directionof the angle is the same with that of angular velocity Thevalues of the angle and the angular velocity are nearing themaximum values 1205874 and 2Therefore the largest force to theleft is required so as to guarantee that the angle between theupright line and the pole is no bigger than 1205874 It is oppositein the bottom-left region where a force attributing to [minus50Nminus40N] is required to keep the angle from surpassing minus1205874The pole can keep balance with a gentle force close to 0 in themiddle regionThedirection of the angle is different from thatof angular velocity in the top-left and bottom-right regionsthus a force with the absolute value which is relatively largebut smaller than 50N is required

In terms of the optimal value function (see Figures 4(b)and 4(d)) the value function reaches a maximum in theregion satisfying minus2 le 119908 le 2 The pole is more prone tokeep balance even without applying any force in this region

resulting in the relatively larger value function The pole ismore and more difficult to keep balance from the middle tothe two sides with the value function also decaying graduallyThe fine-grained value of the left side compared to the rightonemight be caused by themore frequent visitation to the leftside More visitation will lead to a more accurate estimationabout the value function

After the training ends the prediction of the next stateand the reward for every state-action pair can be obtainedthrough the learnedmodelThe predictions for the next angle119908 the next angular velocity and the reward for any possiblestate are shown in Figures 5(a) 5(b) and 5(c) It is noticeablethat the predicted-next state is always near the current statein Figures 5(a) and 5(b) The received reward is always largerthan 0 in Figure 5(c) which illustrates that the pole canalways keep balance under the optimal policy

52 Continuous Maze Problem Continuous maze problemis shown in Figure 6 where the blue lines with coordinatesrepresent the barriersThe state (119909 119910) consists of the horizon-tal coordinate 119909 isin [0 1] and vertical coordinate 119910 isin [0 1]Starting from the position ldquoStartrdquo the goal of the agent isto reach the position ldquoGoalrdquo that satisfies 119909 + 119910 gt 18 Theaction is to choose an angle bounded by [minus120587 120587] as the newdirection and then walk along this direction with a step 01The agent will receive a reward minus1 at each time step Theagent will be punished with a reward minus400 multiplying thedistance if the distance between its position and the barrierexceeds 01 The parameters are set the same as the formerproblem except for the planning times The local and globalplanning times are determined as 10 and 50 in the sameway of the former experiment The state-based feature iscomputed according to (23) and (24) with the dimensionality119911 = 16 The center point 119888

119894locates over the grid points

02 04 06 08 times 02 04 06 08 The diagonal elementsof 119861 are 120590

2

119909= 02 and 120590

2

119910= 02 The state-action-based

feature is also computed according to (23) and (24) Thecenter point 119888

119894locates over the grid points 02 04 06 08 times

02 04 06 08timesminus3 minus15 0 15 3with the dimensionality119911 = 80The diagonal elements of119861 are 1205902

119909= 02 1205902

119910= 02 and

1205902

119906= 15RAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC in cumulative rewards The resultsare shown in Figure 7(a) RAC-HMLP learns fastest in theearly 15 episodesMLACbehaves second best andAC-HMLPperforms poorest RAC-HMLP tends to converge at the 24thepisode but it really converges until the 39th episode AC-HMLP behaves steadily starting from the 23rd episode to theend and it converges at the 43rd episode Like the formerexperiment RAC-HMLP performs slightly better than AC-HMLP embodied in the cumulative rewards minus44 comparedto minus46 At the 53rd episode the cumulative rewards ofS-AC fall to about minus545 quickly without any ascendingthereafter Though MLAC and Dyna-MLAC perform wellin the primary phase their curves start to descend at the88th episode and the 93rd episode respectively MLAC andDyna-MLAC learn the local model to update the policygradient resulting in a fast learning rate in the primary phase

Computational Intelligence and Neuroscience 11

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus50

0

50

(a) Optimal policy of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

92

94

96

98

10

(b) Optimal value function of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus40

minus20

0

20

40

(c) Optimal policy of RAC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

9

92

94

96

98

10

(d) Optimal value function of RAC-HMLP learned after training

Figure 4 Optimal policy and value function learned by AC-HMLP and RAC-HMLP

However they do not behave stably enough near the end ofthe training Too much visitation to the barrier region mightcause the fast descending of the cumulative rewards

The comparisons of the time steps for reaching goal aresimulated which are shown in Figure 7(b) Obviously RAC-HMLP and AC-HMLP perform better than the other threemethods RAC-HMLP converges at 39th episode while AC-HMLP converges at 43rd episode with the time steps forreaching goal being 45 and 47 It is clear that RAC-HMLPstill performs slightly better than AC-HMLP The time stepsfor reaching goal are 548 69 and 201 for S-AC MLACandDyna-MLAC Among the fivemethods RAC-HMLP notonly converges fastest but also has the best solution 45 Thepoorest performance of S-AC illustrates that model learningcan definitely improve the convergence performance As inthe former experiment Dyna-MLAC behaves poorer thanMLAC during training If the model is inaccurate planningvia such model might influence the estimations of the valuefunction and the policy thus leading to a poor performancein Dyna-MLAC

The comparisons of sample efficiency are shown inFigure 8 The required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 105956588 7694 4388 and 4062 respectively As in pole balancingproblem RAC-HMLP also requires the least samples whileS-AC needs the most to converge The difference is that

Dyna-MLAC requires samples slightly more than MLACThe ascending curve of Dyna-MLAC at the end of thetraining demonstrates that it has not convergedThe frequentvisitation to the barrier in Dyna-MLAC leads to the quickdescending of the value functions As a result enormoussamples are required to make these value functions moreprone to the true ones

After the training ends the approximate optimal policyand value function are obtained shown in Figures 9(a) and9(b) It is noticeable that the low part of the figure is exploredthoroughly so that the policy and the value function are verydistinctive in this region For example inmost of the region inFigure 9(a) a larger angle is needed for the agent so as to leavethe current state Clearly the nearer the current state and thelow part of Figure 9(b) the smaller the corresponding valuefunction The top part of the figures may be not frequentlyor even not visited by the agent resulting in the similar valuefunctionsThe agent is able to reach the goal onlywith a gentleangle in these areas

6 Conclusion and Discussion

This paper proposes two novel actor-critic algorithms AC-HMLP and RAC-HMLP Both of them take LLR and LFA torepresent the local model and global model respectively Ithas been shown that our new methods are able to learn the

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

6 Computational Intelligence and Neuroscience

Input 120574 120582 120572119886 120572119888 120572119898 120585119872 size 119871 1205902 119870 119875

119897 119875119892

(1) Initialize 120579 larr 9978880 120573 larr

9978880 Γ larr 997888

0 1205781 1205782 120578

119870larr

9978880 120589 larr 997888

0

(2) Loop(3) 119890

1larr

9978881 11990911 119909

1119896larr 11990901 119909

0119896 119905 larr 1 number larr 0

(4) Repeat all episodes(5) Choose Δ119906

119905according to119873(0 120590

2)

(6) Execute the action 119906119905= ℎ(119909 120573) + Δ119906

119905

(7) Observe the reward 119903119905+1

and the next state 119909119905+1

= 119909119905+11

119909119905+1119896

Update the global model(8) Predict the next state

1199091015840

119905+11= 1205781199051

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+12= 1205781199052

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

1199091015840

119905+1119870= 120578119905119870

T120601 (1199091199051 1199091199052 119909

119905119870 119906119905)

(9) Predict the reward 1199031015840119905+1

= 120589119905

T120601(1199091199051 1199091199052 119909

119905119870 119906119905)

(10) Update the parameters 1205781 1205782 120578

119870

120578119905+11

= 1205781199051+ 120572119898(119909119905+11

minus 1199091015840

119905+11) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+12

= 1205781199052+ 120572119898(119909119905+12

minus 1199091015840

119905+12) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

120578119905+1119870

= 120578119905119870

+ 120572119898(119909119905+1119870

minus 1199091015840

119905+1119870) 120601 (119909

1199051 1199091199052 119909

119905119870 119906119905)

(11) Update the parameter 120589 120589119905+1

= 120589119905+ 120572119898(119903119905+1

minus 1199031015840

119905+1)120601(1199091199051 1199091199052 119909119905119870 119906119905)

Update the local model(12) If 119905 le 119872 size(13) Insert the real sample (119909

119905 119906119905 119909119905+1 119903119905+1) into the memory119872

(14) Else If(15) Replace the oldest one in119872 with the real sample (119909

119905 119906119905 119909119905+1 119903119905+1)

(16) End if(17) Select L-nearest neighbors of the current state from119872 to construct119883 and 119884(18) Predict the next state and the reward [1199091015840

119905+1 1199031015840

119905+1]T= Γ[119909

119905 119906119905 1]

T

(19) Update the parameter Γ Γ = 119884119883T(119883119883

T)minus1

(20) Compute the local error119864119903119905= max|(1199091015840

119905+11minus 119909119905+11

)119909119905+11

| |(1199091015840

119905+12minus 119909119905+12

)119909119905+12

| |(1199091015840

119905+1119870minus 119909119905+1119870

)119909119905+1119870

| |(1199031015840

119905+1minus 119903119905+1)119903119905+1|

(21) If 119864119903119905le 120585

(22) Call Local-model planning (120579 120573 Γ 119890 number 119875119897) (Algorithm 3)

(23) End IfUpdate the value function

(24) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(25) Estimate the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(26) Update the value-function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

Update the policy(27) Update the policy parameter 120573

119905+1= 120573119905+ 120572119886120575119905Δ119906119905120601(119909119905)

(28) 119905 = 119905 + 1

(29) Update the number of samples number = number + 1(30) Until the ending condition is satisfied(31) Call Global-model planning (120579 120573 120578

1 120578

119896 120589 119890 number 119875

119892) (Algorithm 4)

(32) End LoopOutput 120573 120579

Algorithm 2 AC-HMLP algorithm

Computational Intelligence and Neuroscience 7

(1) Loop for 119875119897time steps

(2) 1199091larr 119909119905 119905 larr 1

(3) Choose Δ119906119905according to119873(0 120590

2)

(4) 119906119905= ℎ(119909

119905 120573119905) + Δ119906

119905

(5) Predict the next state and the reward [119909119905+11

119909119905+1119870

119903119905+1] = Γ119905[1199091199051 119909

119905119870 119906119905 1]

T

(6) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(7) Compute the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(8) Update the value function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

(9) Update the policy parameter 120573119905+1

= 120573119905+ 120572119886120575TDΔ119906119905120601(119909119905)

(10) If 119905 le 119878

(11) 119905 = 119905 + 1

(12) End If(13) Update the number of samples number = number + 1(14) End LoopOutput 120573 120579

Algorithm 3 Local model planning (120579 120573 Γ 119890 number 119875119897)

(1) Loop for 119875119892times

(2) 119909larr 1199090 larr 1

(3) Repeat all episodes(4) Choose Δ119906

according to119873(0 120590)

(5) Compute exploration term 119906= ℎ(119909

120573) + Δ119906

(6) Predict the next state1199091015840

+11= 1205781

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+12= 1205782

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+1119870= 120578119870

119879120601 (1199091 1199092 119909

119870 119906)

(7) Predict the reward 1199031015840+1

= 120589119905

T(1199091 1199092 119909

119870 119906119905)

(8) Update the eligibility 119890(119909) = 120601(119909

) 119909= 119909 120582120574119890

minus1(119909) 119909

= 119909

(9) Compute the TD error 120575= 119903119905+1

+ 120574119881(1199091015840

+1 120579) minus 119881(119909

120579)

(10) Update the value function parameter 120579+1

= 120579+ 120572119888120575119890(119909)

(11) Update the policy parameter 120573+1

= 120573+ 120572119886120575Δ119906120601(119909)

(12) If le 119879

(13) = + 1

(14) End If(15) Update the number of samples number = number + 1(16) End Repeat(17) End LoopOutput 120573 120579

Algorithm 4 Global model planning (120579 120573 1205781 120578

119896 120589 119890 number 119875

119892)

The goal of learning the value function is to minimize thesquare of the TD-error which is shown as

min120579

numbersum

119905=0

10038171003817100381710038171003817119903119905+1

+ 120574119881 (1199091015840

119905+1 120579119905) minus 119881 (119909

119905 120579119905)10038171003817100381710038171003817

+ ℓ119888

10038171003817100381710038171205791199051003817100381710038171003817

2

(17)

where number represents the number of the samples ℓ119888ge 0

is the regularization parameter of the critic 1205791199052 is the ℓ

2-

regularization which penalizes the growth of the parametervector 120579

119905 so that the overfitting to noise samples can be

avoided

The update for the parameter 120579 of the value function inRAC-HMLP is represented as

120579119905+1

= 120579119905(1 minus

120572119888

numberℓ119888) +

120572119888

number

numbersum

119904=0

119890119904(119909) 120575119904 (18)

The update for the parameter 120573 of the policy is shown as

120573119905+1

= 120573119905(1 minus

120572119886

numberℓ119886)

+120572119886

number

numbersum

119904=0

120601 (119909119904) 120575119904Δ119906119904

(19)

where ℓ119886ge 0 is the regularization parameter of the actor

8 Computational Intelligence and Neuroscience

F

w

Figure 1 Pole balancing problem

The update for the global model can be denoted as

120578119905+11

= 1205781199051(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+11

minus 1199091015840

119904+11)

120578119905+12

= 1205781199052(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+12

minus 1199091015840

119904+12)

120578119905+1119870

= 120578119905119870

(1 minus120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+1119870

minus 1199091015840

119904+11)

(20)

120589119905+1

= 120589119905(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119903119904+1

minus 1199031015840

119904+1)

(21)

where ℓ119898ge 0 is the regularization parameter for the model

namely the state transition function and the reward functionAfter we replace the update equations of the parameters

in Algorithms 2 3 and 4 with (18) (19) (20) and (21) we willget the resultant algorithm RAC-HMLP Except for the aboveupdate equations the other parts of RAC-HMLP are the samewith AC-HMLP so we will not specify here

5 Empirical Results and Analysis

AC-HMLP and RAC-HMLP are compared with S-ACMLAC andDyna-MLAC on two continuous state and actionspaces problems pole balancing problem [31] and continuousmaze problem [32]

51 Pole Balancing Problem Pole balancing problem is a low-dimension but challenging benchmark problem widely usedin RL literature shown in Figure 1

There is a car moving along the track with a hinged poleon its topThegoal is to find a policywhich can guide the force

to keep the pole balanceThe system dynamics is modeled bythe following equation

=119892 sin (119908) minus V119898119897 sin (2119908) 2 minus V cos (119908) 119865

41198973 minus V119898119897cos2 (119908) (22)

where 119908 is the angle of the pole with the vertical line and are the angular velocity and the angular acceleration of thepole 119865 is the force exerted on the cart The negative valuemeans the force to the right and otherwise means to the left119892 is the gravity constant with the value 119892 = 981ms2119898 and119897 are the length and the mass of the pole which are set to119898 =

20 kg and 119897 = 05m respectively V is a constantwith the value1(119898 + 119898

119888) where119898

119888= 80 kg is the mass of the car

The state 119909 = (119908 ) is composed of 119908 and which arebounded by [minus120587 120587] and [minus2 2] respectivelyThe action is theforce 119865 bounded by [minus50N 50N] The maximal episode is300 and the maximal time step for each episode is 3000 Ifthe angle of the pole with the vertical line is not exceeding1205874at each time step the agent will receive a reward 1 otherwisethe pole will fall down and receive a reward minus1 There is alsoa random noise applying on the force bounded by [minus10N10N] An episode ends when the pole has kept balance for3000 time steps or the pole falls downThe pole will fall downif |119908| ge 1205874 To approximate the value function and thereward function the state-based feature is coded by radialbasis functions (RBFs) shown as

120601119894(119909) = 119890

minus(12)(119909minus119888119894)T119861minus1(119909minus119888119894) (23)

where 119888119894denotes the 119894th center point locating over the grid

points minus06 minus04 minus02 0 02 04 06 times minus2 0 2 119861 is adiagonal matrix containing the widths of the RBFs with 1205902

119908=

02 and 1205902= 2The dimensionality 119911 of 120601(119909) is 21 Notice that

the approximations of the value function and the policy onlyrequire the state-based feature However the approximationof the model requires the state-action-based feature Let 119909 be(119908 119906) we can also compute its feature by (23) In this case119888119894locates over the points minus06 minus04 minus02 0 02 04 06 times

minus2 0 2 times minus50 minus25 0 25 50 The diagonal elements of 119861are 1205902119908= 02 1205902

= 2 and 120590

2

119865= 10 The dimensionality 119911

of 120601(119908 119906) is 105 Either the state-based or the state-action-based feature has to be normalized so as to make its value besmaller than 1 shown as

120601119894(119909) =

120601119894(119909)

sum119911

119894=1120601119894(119909)

(24)

where 119911 gt 0 is the dimensionality of the featureRAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC on this experiment The parame-ters of S-AC MLAC and Dyna-MLAC are set according tothe values mentioned in their papers The parameters of AC-HMLP and RAC-HMLP are set as shown in Table 1

The planning times may have significant effect on theconvergence performance Therefore we have to determinethe local planning times 119875

119897and the global planning times 119875

119892

at first The selective settings for the two parameters are (30300) (50 300) (30 600) and (50 600) From Figure 2(a) it

Computational Intelligence and Neuroscience 9

Table 1 Parameters settings of RAC-HMLP and AC-HMLP

Parameter Symbol ValueTime step 119879

11990401

Discount factor 120574 09Trace-decay rate 120582 09Exploration variance 120590

2 1Learning rate of the actor 120572

11988605

Learning rate of the critic 120572119888

04Learning rate of the model 120572

11989805

Error threshold 120585 015Capacity of the memory 119872 size 100Number of the nearest samples 119871 9Local planning times 119875

11989730

Global planning times 119875119892 300

Number of components of the state K 2Regularization parameter of the model ℓ

11989802

Regularization parameter of the critic ℓ119888

001Regularization parameter of the actor ℓ

1198860001

is easy to find out that (119875119897= 30119875

119892= 300) behaves best in these

four settings with 41 episodes for convergence (119875119897= 30 119875

119892=

600) learns fastest in the early 29 episodes but it convergesuntil the 52nd episode (119875

119897= 30 119875

119892= 600) and (119875

119897= 30 119875

119892=

300) have identical local planning times but different globalplanning times Generally the more the planning times thefaster the convergence rate but (119875

119897= 30 119875

119892= 300) performs

better instead The global model is not accurate enough atthe initial time Planning through such an inaccurate globalmodel will lead to an unstable performance Notice that (119875

119897=

50 119875119892= 300) and (119875

119897= 50 119875

119892= 600) behave poorer than (119875

119897

= 30 119875119892= 600) and (119875

119897= 30 119875

119892= 300) which demonstrates

that planning too much via the local model will not performbetter (119875

119897= 50119875

119892= 300) seems to converge at the 51st episode

but its learning curve fluctuates heavily after 365 episodes notconverging any more until the end Like (119875

119897= 50 119875

119892= 300)

(119875119897= 50 119875

119892= 600) also has heavy fluctuation but converges

at the 373rd episode Evidently (119875119897= 50 119875

119892= 300) performs

slightly better than (119875119897= 50 119875

119892= 300) Planning through the

global model might solve the nonstability problem caused byplanning of the local model

The convergence performances of the fivemethods RAC-HMLP AC-HMLP S-AC MLAC and Dyna-MLAC areshown in Figure 2(b) It is evident that our methods RAC-HMLP and AC-HMLP have the best convergence perfor-mances RAC-HMLP and AC-HMLP converge at the 39thand 41st episodes Both of them learn quickly in the primaryphase but the learning curve of RAC-HMLP seems to besteeper than that of AC-HMLP Dyna-MLAC learns fasterthan MLAC and S-AC in the former 74 episodes but itconverges until the 99th episode Though MLAC behavespoorer than Dyna-MLAC it requires just 82 episodes toconverge The method with the slowest learning rate is S-AC where the pole can keep balance for 3000 time stepsfor the first time at the 251st episode Unfortunately it con-verges until the 333rd episode RAC-HMLP converges fastest

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

Pl = 30 Pg = 300

Pl = 50 Pg = 300

Pl = 30 Pg = 600

Pl = 50 Pg = 600

(a) Determination of different planning times

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of different algorithms in convergence performance

Figure 2 Comparisons of different planning times and differentalgorithms

which might be caused by introducing the ℓ2-regularization

Because the ℓ2-regularization does not admit the parameter

to grow rapidly the overfitting of the learning process canbe avoided effectively Dyna-MLAC converges faster thanMLAC in the early 74 episodes but its performance is notstable enough embodied in the heavy fluctuation from the75th to 99th episode If the approximate-local model is dis-tinguished largely from the real-local model then planningthrough such an inaccurate local model might lead to anunstable performance S-AC as the only method withoutmodel learning behaves poorest among the fiveThese resultsshow that the convergence performance can be improvedlargely by introducing model learning

The comparisons of the five different algorithms insample efficiency are shown in Figure 3 It is clear that thenumbers of the required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 56003

10 Computational Intelligence and Neuroscience

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

50 100 150 200 250 300 350 4000Training episodes

0

1

2

3

4

5

6

Num

ber o

f sam

ples

to co

nver

ge

times104

Figure 3 Comparisons of sample efficiency

35535 32043 14426 and 12830 respectively Evidently RAC-HMLP converges fastest and behaves stably all the timethus requiring the least samplesThough AC-HMLP requiresmore samples to converge than RAC-HMLP it still requiresfar fewer samples than the other three methods Because S-AC does not utilize the model it requires the most samplesamong the five Unlike MLAC Dyna-MLAC applies themodel not only in updating the policy but also in planningresulting in a fewer requirement for the samples

The optimal policy and optimal value function learned byAC-HMLP after the training ends are shown in Figures 4(a)and 4(b) respectively while the ones learned by RAC-HMLPare shown in Figures 4(c) and 4(d) Evidently the optimalpolicy and the optimal value function learned by AC-HMLPand RAC-HMLP are quite similar but RAC-HMLP seems tohave more fine-grained optimal policy and value function

As for the optimal policy (see Figures 4(a) and 4(c))the force becomes smaller and smaller from the two sides(left side and right side) to the middle The top-right is theregion requiring a force close to 50N where the directionof the angle is the same with that of angular velocity Thevalues of the angle and the angular velocity are nearing themaximum values 1205874 and 2Therefore the largest force to theleft is required so as to guarantee that the angle between theupright line and the pole is no bigger than 1205874 It is oppositein the bottom-left region where a force attributing to [minus50Nminus40N] is required to keep the angle from surpassing minus1205874The pole can keep balance with a gentle force close to 0 in themiddle regionThedirection of the angle is different from thatof angular velocity in the top-left and bottom-right regionsthus a force with the absolute value which is relatively largebut smaller than 50N is required

In terms of the optimal value function (see Figures 4(b)and 4(d)) the value function reaches a maximum in theregion satisfying minus2 le 119908 le 2 The pole is more prone tokeep balance even without applying any force in this region

resulting in the relatively larger value function The pole ismore and more difficult to keep balance from the middle tothe two sides with the value function also decaying graduallyThe fine-grained value of the left side compared to the rightonemight be caused by themore frequent visitation to the leftside More visitation will lead to a more accurate estimationabout the value function

After the training ends the prediction of the next stateand the reward for every state-action pair can be obtainedthrough the learnedmodelThe predictions for the next angle119908 the next angular velocity and the reward for any possiblestate are shown in Figures 5(a) 5(b) and 5(c) It is noticeablethat the predicted-next state is always near the current statein Figures 5(a) and 5(b) The received reward is always largerthan 0 in Figure 5(c) which illustrates that the pole canalways keep balance under the optimal policy

52 Continuous Maze Problem Continuous maze problemis shown in Figure 6 where the blue lines with coordinatesrepresent the barriersThe state (119909 119910) consists of the horizon-tal coordinate 119909 isin [0 1] and vertical coordinate 119910 isin [0 1]Starting from the position ldquoStartrdquo the goal of the agent isto reach the position ldquoGoalrdquo that satisfies 119909 + 119910 gt 18 Theaction is to choose an angle bounded by [minus120587 120587] as the newdirection and then walk along this direction with a step 01The agent will receive a reward minus1 at each time step Theagent will be punished with a reward minus400 multiplying thedistance if the distance between its position and the barrierexceeds 01 The parameters are set the same as the formerproblem except for the planning times The local and globalplanning times are determined as 10 and 50 in the sameway of the former experiment The state-based feature iscomputed according to (23) and (24) with the dimensionality119911 = 16 The center point 119888

119894locates over the grid points

02 04 06 08 times 02 04 06 08 The diagonal elementsof 119861 are 120590

2

119909= 02 and 120590

2

119910= 02 The state-action-based

feature is also computed according to (23) and (24) Thecenter point 119888

119894locates over the grid points 02 04 06 08 times

02 04 06 08timesminus3 minus15 0 15 3with the dimensionality119911 = 80The diagonal elements of119861 are 1205902

119909= 02 1205902

119910= 02 and

1205902

119906= 15RAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC in cumulative rewards The resultsare shown in Figure 7(a) RAC-HMLP learns fastest in theearly 15 episodesMLACbehaves second best andAC-HMLPperforms poorest RAC-HMLP tends to converge at the 24thepisode but it really converges until the 39th episode AC-HMLP behaves steadily starting from the 23rd episode to theend and it converges at the 43rd episode Like the formerexperiment RAC-HMLP performs slightly better than AC-HMLP embodied in the cumulative rewards minus44 comparedto minus46 At the 53rd episode the cumulative rewards ofS-AC fall to about minus545 quickly without any ascendingthereafter Though MLAC and Dyna-MLAC perform wellin the primary phase their curves start to descend at the88th episode and the 93rd episode respectively MLAC andDyna-MLAC learn the local model to update the policygradient resulting in a fast learning rate in the primary phase

Computational Intelligence and Neuroscience 11

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus50

0

50

(a) Optimal policy of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

92

94

96

98

10

(b) Optimal value function of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus40

minus20

0

20

40

(c) Optimal policy of RAC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

9

92

94

96

98

10

(d) Optimal value function of RAC-HMLP learned after training

Figure 4 Optimal policy and value function learned by AC-HMLP and RAC-HMLP

However they do not behave stably enough near the end ofthe training Too much visitation to the barrier region mightcause the fast descending of the cumulative rewards

The comparisons of the time steps for reaching goal aresimulated which are shown in Figure 7(b) Obviously RAC-HMLP and AC-HMLP perform better than the other threemethods RAC-HMLP converges at 39th episode while AC-HMLP converges at 43rd episode with the time steps forreaching goal being 45 and 47 It is clear that RAC-HMLPstill performs slightly better than AC-HMLP The time stepsfor reaching goal are 548 69 and 201 for S-AC MLACandDyna-MLAC Among the fivemethods RAC-HMLP notonly converges fastest but also has the best solution 45 Thepoorest performance of S-AC illustrates that model learningcan definitely improve the convergence performance As inthe former experiment Dyna-MLAC behaves poorer thanMLAC during training If the model is inaccurate planningvia such model might influence the estimations of the valuefunction and the policy thus leading to a poor performancein Dyna-MLAC

The comparisons of sample efficiency are shown inFigure 8 The required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 105956588 7694 4388 and 4062 respectively As in pole balancingproblem RAC-HMLP also requires the least samples whileS-AC needs the most to converge The difference is that

Dyna-MLAC requires samples slightly more than MLACThe ascending curve of Dyna-MLAC at the end of thetraining demonstrates that it has not convergedThe frequentvisitation to the barrier in Dyna-MLAC leads to the quickdescending of the value functions As a result enormoussamples are required to make these value functions moreprone to the true ones

After the training ends the approximate optimal policyand value function are obtained shown in Figures 9(a) and9(b) It is noticeable that the low part of the figure is exploredthoroughly so that the policy and the value function are verydistinctive in this region For example inmost of the region inFigure 9(a) a larger angle is needed for the agent so as to leavethe current state Clearly the nearer the current state and thelow part of Figure 9(b) the smaller the corresponding valuefunction The top part of the figures may be not frequentlyor even not visited by the agent resulting in the similar valuefunctionsThe agent is able to reach the goal onlywith a gentleangle in these areas

6 Conclusion and Discussion

This paper proposes two novel actor-critic algorithms AC-HMLP and RAC-HMLP Both of them take LLR and LFA torepresent the local model and global model respectively Ithas been shown that our new methods are able to learn the

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

Computational Intelligence and Neuroscience 7

(1) Loop for 119875119897time steps

(2) 1199091larr 119909119905 119905 larr 1

(3) Choose Δ119906119905according to119873(0 120590

2)

(4) 119906119905= ℎ(119909

119905 120573119905) + Δ119906

119905

(5) Predict the next state and the reward [119909119905+11

119909119905+1119870

119903119905+1] = Γ119905[1199091199051 119909

119905119870 119906119905 1]

T

(6) Update the eligibility 119890119905(119909) = 120601(119909

119905) 119909119905= 119909 120582120574119890

119905minus1(119909) 119909

119905= 119909

(7) Compute the TD error 120575119905= 119903119905+1

+ 120574119881(119909119905+1 120579119905) minus 119881(119909

119905 120579119905)

(8) Update the value function parameter 120579119905+1

= 120579119905+ 120572119888120575119905119890119905(119909)

(9) Update the policy parameter 120573119905+1

= 120573119905+ 120572119886120575TDΔ119906119905120601(119909119905)

(10) If 119905 le 119878

(11) 119905 = 119905 + 1

(12) End If(13) Update the number of samples number = number + 1(14) End LoopOutput 120573 120579

Algorithm 3 Local model planning (120579 120573 Γ 119890 number 119875119897)

(1) Loop for 119875119892times

(2) 119909larr 1199090 larr 1

(3) Repeat all episodes(4) Choose Δ119906

according to119873(0 120590)

(5) Compute exploration term 119906= ℎ(119909

120573) + Δ119906

(6) Predict the next state1199091015840

+11= 1205781

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+12= 1205782

119879120601 (1199091 1199092 119909

119870 119906)

1199091015840

+1119870= 120578119870

119879120601 (1199091 1199092 119909

119870 119906)

(7) Predict the reward 1199031015840+1

= 120589119905

T(1199091 1199092 119909

119870 119906119905)

(8) Update the eligibility 119890(119909) = 120601(119909

) 119909= 119909 120582120574119890

minus1(119909) 119909

= 119909

(9) Compute the TD error 120575= 119903119905+1

+ 120574119881(1199091015840

+1 120579) minus 119881(119909

120579)

(10) Update the value function parameter 120579+1

= 120579+ 120572119888120575119890(119909)

(11) Update the policy parameter 120573+1

= 120573+ 120572119886120575Δ119906120601(119909)

(12) If le 119879

(13) = + 1

(14) End If(15) Update the number of samples number = number + 1(16) End Repeat(17) End LoopOutput 120573 120579

Algorithm 4 Global model planning (120579 120573 1205781 120578

119896 120589 119890 number 119875

119892)

The goal of learning the value function is to minimize thesquare of the TD-error which is shown as

min120579

numbersum

119905=0

10038171003817100381710038171003817119903119905+1

+ 120574119881 (1199091015840

119905+1 120579119905) minus 119881 (119909

119905 120579119905)10038171003817100381710038171003817

+ ℓ119888

10038171003817100381710038171205791199051003817100381710038171003817

2

(17)

where number represents the number of the samples ℓ119888ge 0

is the regularization parameter of the critic 1205791199052 is the ℓ

2-

regularization which penalizes the growth of the parametervector 120579

119905 so that the overfitting to noise samples can be

avoided

The update for the parameter 120579 of the value function inRAC-HMLP is represented as

120579119905+1

= 120579119905(1 minus

120572119888

numberℓ119888) +

120572119888

number

numbersum

119904=0

119890119904(119909) 120575119904 (18)

The update for the parameter 120573 of the policy is shown as

120573119905+1

= 120573119905(1 minus

120572119886

numberℓ119886)

+120572119886

number

numbersum

119904=0

120601 (119909119904) 120575119904Δ119906119904

(19)

where ℓ119886ge 0 is the regularization parameter of the actor

8 Computational Intelligence and Neuroscience

F

w

Figure 1 Pole balancing problem

The update for the global model can be denoted as

120578119905+11

= 1205781199051(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+11

minus 1199091015840

119904+11)

120578119905+12

= 1205781199052(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+12

minus 1199091015840

119904+12)

120578119905+1119870

= 120578119905119870

(1 minus120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+1119870

minus 1199091015840

119904+11)

(20)

120589119905+1

= 120589119905(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119903119904+1

minus 1199031015840

119904+1)

(21)

where ℓ119898ge 0 is the regularization parameter for the model

namely the state transition function and the reward functionAfter we replace the update equations of the parameters

in Algorithms 2 3 and 4 with (18) (19) (20) and (21) we willget the resultant algorithm RAC-HMLP Except for the aboveupdate equations the other parts of RAC-HMLP are the samewith AC-HMLP so we will not specify here

5 Empirical Results and Analysis

AC-HMLP and RAC-HMLP are compared with S-ACMLAC andDyna-MLAC on two continuous state and actionspaces problems pole balancing problem [31] and continuousmaze problem [32]

51 Pole Balancing Problem Pole balancing problem is a low-dimension but challenging benchmark problem widely usedin RL literature shown in Figure 1

There is a car moving along the track with a hinged poleon its topThegoal is to find a policywhich can guide the force

to keep the pole balanceThe system dynamics is modeled bythe following equation

=119892 sin (119908) minus V119898119897 sin (2119908) 2 minus V cos (119908) 119865

41198973 minus V119898119897cos2 (119908) (22)

where 119908 is the angle of the pole with the vertical line and are the angular velocity and the angular acceleration of thepole 119865 is the force exerted on the cart The negative valuemeans the force to the right and otherwise means to the left119892 is the gravity constant with the value 119892 = 981ms2119898 and119897 are the length and the mass of the pole which are set to119898 =

20 kg and 119897 = 05m respectively V is a constantwith the value1(119898 + 119898

119888) where119898

119888= 80 kg is the mass of the car

The state 119909 = (119908 ) is composed of 119908 and which arebounded by [minus120587 120587] and [minus2 2] respectivelyThe action is theforce 119865 bounded by [minus50N 50N] The maximal episode is300 and the maximal time step for each episode is 3000 Ifthe angle of the pole with the vertical line is not exceeding1205874at each time step the agent will receive a reward 1 otherwisethe pole will fall down and receive a reward minus1 There is alsoa random noise applying on the force bounded by [minus10N10N] An episode ends when the pole has kept balance for3000 time steps or the pole falls downThe pole will fall downif |119908| ge 1205874 To approximate the value function and thereward function the state-based feature is coded by radialbasis functions (RBFs) shown as

120601119894(119909) = 119890

minus(12)(119909minus119888119894)T119861minus1(119909minus119888119894) (23)

where 119888119894denotes the 119894th center point locating over the grid

points minus06 minus04 minus02 0 02 04 06 times minus2 0 2 119861 is adiagonal matrix containing the widths of the RBFs with 1205902

119908=

02 and 1205902= 2The dimensionality 119911 of 120601(119909) is 21 Notice that

the approximations of the value function and the policy onlyrequire the state-based feature However the approximationof the model requires the state-action-based feature Let 119909 be(119908 119906) we can also compute its feature by (23) In this case119888119894locates over the points minus06 minus04 minus02 0 02 04 06 times

minus2 0 2 times minus50 minus25 0 25 50 The diagonal elements of 119861are 1205902119908= 02 1205902

= 2 and 120590

2

119865= 10 The dimensionality 119911

of 120601(119908 119906) is 105 Either the state-based or the state-action-based feature has to be normalized so as to make its value besmaller than 1 shown as

120601119894(119909) =

120601119894(119909)

sum119911

119894=1120601119894(119909)

(24)

where 119911 gt 0 is the dimensionality of the featureRAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC on this experiment The parame-ters of S-AC MLAC and Dyna-MLAC are set according tothe values mentioned in their papers The parameters of AC-HMLP and RAC-HMLP are set as shown in Table 1

The planning times may have significant effect on theconvergence performance Therefore we have to determinethe local planning times 119875

119897and the global planning times 119875

119892

at first The selective settings for the two parameters are (30300) (50 300) (30 600) and (50 600) From Figure 2(a) it

Computational Intelligence and Neuroscience 9

Table 1 Parameters settings of RAC-HMLP and AC-HMLP

Parameter Symbol ValueTime step 119879

11990401

Discount factor 120574 09Trace-decay rate 120582 09Exploration variance 120590

2 1Learning rate of the actor 120572

11988605

Learning rate of the critic 120572119888

04Learning rate of the model 120572

11989805

Error threshold 120585 015Capacity of the memory 119872 size 100Number of the nearest samples 119871 9Local planning times 119875

11989730

Global planning times 119875119892 300

Number of components of the state K 2Regularization parameter of the model ℓ

11989802

Regularization parameter of the critic ℓ119888

001Regularization parameter of the actor ℓ

1198860001

is easy to find out that (119875119897= 30119875

119892= 300) behaves best in these

four settings with 41 episodes for convergence (119875119897= 30 119875

119892=

600) learns fastest in the early 29 episodes but it convergesuntil the 52nd episode (119875

119897= 30 119875

119892= 600) and (119875

119897= 30 119875

119892=

300) have identical local planning times but different globalplanning times Generally the more the planning times thefaster the convergence rate but (119875

119897= 30 119875

119892= 300) performs

better instead The global model is not accurate enough atthe initial time Planning through such an inaccurate globalmodel will lead to an unstable performance Notice that (119875

119897=

50 119875119892= 300) and (119875

119897= 50 119875

119892= 600) behave poorer than (119875

119897

= 30 119875119892= 600) and (119875

119897= 30 119875

119892= 300) which demonstrates

that planning too much via the local model will not performbetter (119875

119897= 50119875

119892= 300) seems to converge at the 51st episode

but its learning curve fluctuates heavily after 365 episodes notconverging any more until the end Like (119875

119897= 50 119875

119892= 300)

(119875119897= 50 119875

119892= 600) also has heavy fluctuation but converges

at the 373rd episode Evidently (119875119897= 50 119875

119892= 300) performs

slightly better than (119875119897= 50 119875

119892= 300) Planning through the

global model might solve the nonstability problem caused byplanning of the local model

The convergence performances of the fivemethods RAC-HMLP AC-HMLP S-AC MLAC and Dyna-MLAC areshown in Figure 2(b) It is evident that our methods RAC-HMLP and AC-HMLP have the best convergence perfor-mances RAC-HMLP and AC-HMLP converge at the 39thand 41st episodes Both of them learn quickly in the primaryphase but the learning curve of RAC-HMLP seems to besteeper than that of AC-HMLP Dyna-MLAC learns fasterthan MLAC and S-AC in the former 74 episodes but itconverges until the 99th episode Though MLAC behavespoorer than Dyna-MLAC it requires just 82 episodes toconverge The method with the slowest learning rate is S-AC where the pole can keep balance for 3000 time stepsfor the first time at the 251st episode Unfortunately it con-verges until the 333rd episode RAC-HMLP converges fastest

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

Pl = 30 Pg = 300

Pl = 50 Pg = 300

Pl = 30 Pg = 600

Pl = 50 Pg = 600

(a) Determination of different planning times

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of different algorithms in convergence performance

Figure 2 Comparisons of different planning times and differentalgorithms

which might be caused by introducing the ℓ2-regularization

Because the ℓ2-regularization does not admit the parameter

to grow rapidly the overfitting of the learning process canbe avoided effectively Dyna-MLAC converges faster thanMLAC in the early 74 episodes but its performance is notstable enough embodied in the heavy fluctuation from the75th to 99th episode If the approximate-local model is dis-tinguished largely from the real-local model then planningthrough such an inaccurate local model might lead to anunstable performance S-AC as the only method withoutmodel learning behaves poorest among the fiveThese resultsshow that the convergence performance can be improvedlargely by introducing model learning

The comparisons of the five different algorithms insample efficiency are shown in Figure 3 It is clear that thenumbers of the required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 56003

10 Computational Intelligence and Neuroscience

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

50 100 150 200 250 300 350 4000Training episodes

0

1

2

3

4

5

6

Num

ber o

f sam

ples

to co

nver

ge

times104

Figure 3 Comparisons of sample efficiency

35535 32043 14426 and 12830 respectively Evidently RAC-HMLP converges fastest and behaves stably all the timethus requiring the least samplesThough AC-HMLP requiresmore samples to converge than RAC-HMLP it still requiresfar fewer samples than the other three methods Because S-AC does not utilize the model it requires the most samplesamong the five Unlike MLAC Dyna-MLAC applies themodel not only in updating the policy but also in planningresulting in a fewer requirement for the samples

The optimal policy and optimal value function learned byAC-HMLP after the training ends are shown in Figures 4(a)and 4(b) respectively while the ones learned by RAC-HMLPare shown in Figures 4(c) and 4(d) Evidently the optimalpolicy and the optimal value function learned by AC-HMLPand RAC-HMLP are quite similar but RAC-HMLP seems tohave more fine-grained optimal policy and value function

As for the optimal policy (see Figures 4(a) and 4(c))the force becomes smaller and smaller from the two sides(left side and right side) to the middle The top-right is theregion requiring a force close to 50N where the directionof the angle is the same with that of angular velocity Thevalues of the angle and the angular velocity are nearing themaximum values 1205874 and 2Therefore the largest force to theleft is required so as to guarantee that the angle between theupright line and the pole is no bigger than 1205874 It is oppositein the bottom-left region where a force attributing to [minus50Nminus40N] is required to keep the angle from surpassing minus1205874The pole can keep balance with a gentle force close to 0 in themiddle regionThedirection of the angle is different from thatof angular velocity in the top-left and bottom-right regionsthus a force with the absolute value which is relatively largebut smaller than 50N is required

In terms of the optimal value function (see Figures 4(b)and 4(d)) the value function reaches a maximum in theregion satisfying minus2 le 119908 le 2 The pole is more prone tokeep balance even without applying any force in this region

resulting in the relatively larger value function The pole ismore and more difficult to keep balance from the middle tothe two sides with the value function also decaying graduallyThe fine-grained value of the left side compared to the rightonemight be caused by themore frequent visitation to the leftside More visitation will lead to a more accurate estimationabout the value function

After the training ends the prediction of the next stateand the reward for every state-action pair can be obtainedthrough the learnedmodelThe predictions for the next angle119908 the next angular velocity and the reward for any possiblestate are shown in Figures 5(a) 5(b) and 5(c) It is noticeablethat the predicted-next state is always near the current statein Figures 5(a) and 5(b) The received reward is always largerthan 0 in Figure 5(c) which illustrates that the pole canalways keep balance under the optimal policy

52 Continuous Maze Problem Continuous maze problemis shown in Figure 6 where the blue lines with coordinatesrepresent the barriersThe state (119909 119910) consists of the horizon-tal coordinate 119909 isin [0 1] and vertical coordinate 119910 isin [0 1]Starting from the position ldquoStartrdquo the goal of the agent isto reach the position ldquoGoalrdquo that satisfies 119909 + 119910 gt 18 Theaction is to choose an angle bounded by [minus120587 120587] as the newdirection and then walk along this direction with a step 01The agent will receive a reward minus1 at each time step Theagent will be punished with a reward minus400 multiplying thedistance if the distance between its position and the barrierexceeds 01 The parameters are set the same as the formerproblem except for the planning times The local and globalplanning times are determined as 10 and 50 in the sameway of the former experiment The state-based feature iscomputed according to (23) and (24) with the dimensionality119911 = 16 The center point 119888

119894locates over the grid points

02 04 06 08 times 02 04 06 08 The diagonal elementsof 119861 are 120590

2

119909= 02 and 120590

2

119910= 02 The state-action-based

feature is also computed according to (23) and (24) Thecenter point 119888

119894locates over the grid points 02 04 06 08 times

02 04 06 08timesminus3 minus15 0 15 3with the dimensionality119911 = 80The diagonal elements of119861 are 1205902

119909= 02 1205902

119910= 02 and

1205902

119906= 15RAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC in cumulative rewards The resultsare shown in Figure 7(a) RAC-HMLP learns fastest in theearly 15 episodesMLACbehaves second best andAC-HMLPperforms poorest RAC-HMLP tends to converge at the 24thepisode but it really converges until the 39th episode AC-HMLP behaves steadily starting from the 23rd episode to theend and it converges at the 43rd episode Like the formerexperiment RAC-HMLP performs slightly better than AC-HMLP embodied in the cumulative rewards minus44 comparedto minus46 At the 53rd episode the cumulative rewards ofS-AC fall to about minus545 quickly without any ascendingthereafter Though MLAC and Dyna-MLAC perform wellin the primary phase their curves start to descend at the88th episode and the 93rd episode respectively MLAC andDyna-MLAC learn the local model to update the policygradient resulting in a fast learning rate in the primary phase

Computational Intelligence and Neuroscience 11

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus50

0

50

(a) Optimal policy of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

92

94

96

98

10

(b) Optimal value function of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus40

minus20

0

20

40

(c) Optimal policy of RAC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

9

92

94

96

98

10

(d) Optimal value function of RAC-HMLP learned after training

Figure 4 Optimal policy and value function learned by AC-HMLP and RAC-HMLP

However they do not behave stably enough near the end ofthe training Too much visitation to the barrier region mightcause the fast descending of the cumulative rewards

The comparisons of the time steps for reaching goal aresimulated which are shown in Figure 7(b) Obviously RAC-HMLP and AC-HMLP perform better than the other threemethods RAC-HMLP converges at 39th episode while AC-HMLP converges at 43rd episode with the time steps forreaching goal being 45 and 47 It is clear that RAC-HMLPstill performs slightly better than AC-HMLP The time stepsfor reaching goal are 548 69 and 201 for S-AC MLACandDyna-MLAC Among the fivemethods RAC-HMLP notonly converges fastest but also has the best solution 45 Thepoorest performance of S-AC illustrates that model learningcan definitely improve the convergence performance As inthe former experiment Dyna-MLAC behaves poorer thanMLAC during training If the model is inaccurate planningvia such model might influence the estimations of the valuefunction and the policy thus leading to a poor performancein Dyna-MLAC

The comparisons of sample efficiency are shown inFigure 8 The required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 105956588 7694 4388 and 4062 respectively As in pole balancingproblem RAC-HMLP also requires the least samples whileS-AC needs the most to converge The difference is that

Dyna-MLAC requires samples slightly more than MLACThe ascending curve of Dyna-MLAC at the end of thetraining demonstrates that it has not convergedThe frequentvisitation to the barrier in Dyna-MLAC leads to the quickdescending of the value functions As a result enormoussamples are required to make these value functions moreprone to the true ones

After the training ends the approximate optimal policyand value function are obtained shown in Figures 9(a) and9(b) It is noticeable that the low part of the figure is exploredthoroughly so that the policy and the value function are verydistinctive in this region For example inmost of the region inFigure 9(a) a larger angle is needed for the agent so as to leavethe current state Clearly the nearer the current state and thelow part of Figure 9(b) the smaller the corresponding valuefunction The top part of the figures may be not frequentlyor even not visited by the agent resulting in the similar valuefunctionsThe agent is able to reach the goal onlywith a gentleangle in these areas

6 Conclusion and Discussion

This paper proposes two novel actor-critic algorithms AC-HMLP and RAC-HMLP Both of them take LLR and LFA torepresent the local model and global model respectively Ithas been shown that our new methods are able to learn the

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

8 Computational Intelligence and Neuroscience

F

w

Figure 1 Pole balancing problem

The update for the global model can be denoted as

120578119905+11

= 1205781199051(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+11

minus 1199091015840

119904+11)

120578119905+12

= 1205781199052(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+12

minus 1199091015840

119904+12)

120578119905+1119870

= 120578119905119870

(1 minus120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119909119904+1119870

minus 1199091015840

119904+11)

(20)

120589119905+1

= 120589119905(1 minus

120572119898

numberℓ119898) +

120572119898

number

sdot

numbersum

119904=1

120601 (1199091199041 119909

119904119870 119906119904) (119903119904+1

minus 1199031015840

119904+1)

(21)

where ℓ119898ge 0 is the regularization parameter for the model

namely the state transition function and the reward functionAfter we replace the update equations of the parameters

in Algorithms 2 3 and 4 with (18) (19) (20) and (21) we willget the resultant algorithm RAC-HMLP Except for the aboveupdate equations the other parts of RAC-HMLP are the samewith AC-HMLP so we will not specify here

5 Empirical Results and Analysis

AC-HMLP and RAC-HMLP are compared with S-ACMLAC andDyna-MLAC on two continuous state and actionspaces problems pole balancing problem [31] and continuousmaze problem [32]

51 Pole Balancing Problem Pole balancing problem is a low-dimension but challenging benchmark problem widely usedin RL literature shown in Figure 1

There is a car moving along the track with a hinged poleon its topThegoal is to find a policywhich can guide the force

to keep the pole balanceThe system dynamics is modeled bythe following equation

=119892 sin (119908) minus V119898119897 sin (2119908) 2 minus V cos (119908) 119865

41198973 minus V119898119897cos2 (119908) (22)

where 119908 is the angle of the pole with the vertical line and are the angular velocity and the angular acceleration of thepole 119865 is the force exerted on the cart The negative valuemeans the force to the right and otherwise means to the left119892 is the gravity constant with the value 119892 = 981ms2119898 and119897 are the length and the mass of the pole which are set to119898 =

20 kg and 119897 = 05m respectively V is a constantwith the value1(119898 + 119898

119888) where119898

119888= 80 kg is the mass of the car

The state 119909 = (119908 ) is composed of 119908 and which arebounded by [minus120587 120587] and [minus2 2] respectivelyThe action is theforce 119865 bounded by [minus50N 50N] The maximal episode is300 and the maximal time step for each episode is 3000 Ifthe angle of the pole with the vertical line is not exceeding1205874at each time step the agent will receive a reward 1 otherwisethe pole will fall down and receive a reward minus1 There is alsoa random noise applying on the force bounded by [minus10N10N] An episode ends when the pole has kept balance for3000 time steps or the pole falls downThe pole will fall downif |119908| ge 1205874 To approximate the value function and thereward function the state-based feature is coded by radialbasis functions (RBFs) shown as

120601119894(119909) = 119890

minus(12)(119909minus119888119894)T119861minus1(119909minus119888119894) (23)

where 119888119894denotes the 119894th center point locating over the grid

points minus06 minus04 minus02 0 02 04 06 times minus2 0 2 119861 is adiagonal matrix containing the widths of the RBFs with 1205902

119908=

02 and 1205902= 2The dimensionality 119911 of 120601(119909) is 21 Notice that

the approximations of the value function and the policy onlyrequire the state-based feature However the approximationof the model requires the state-action-based feature Let 119909 be(119908 119906) we can also compute its feature by (23) In this case119888119894locates over the points minus06 minus04 minus02 0 02 04 06 times

minus2 0 2 times minus50 minus25 0 25 50 The diagonal elements of 119861are 1205902119908= 02 1205902

= 2 and 120590

2

119865= 10 The dimensionality 119911

of 120601(119908 119906) is 105 Either the state-based or the state-action-based feature has to be normalized so as to make its value besmaller than 1 shown as

120601119894(119909) =

120601119894(119909)

sum119911

119894=1120601119894(119909)

(24)

where 119911 gt 0 is the dimensionality of the featureRAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC on this experiment The parame-ters of S-AC MLAC and Dyna-MLAC are set according tothe values mentioned in their papers The parameters of AC-HMLP and RAC-HMLP are set as shown in Table 1

The planning times may have significant effect on theconvergence performance Therefore we have to determinethe local planning times 119875

119897and the global planning times 119875

119892

at first The selective settings for the two parameters are (30300) (50 300) (30 600) and (50 600) From Figure 2(a) it

Computational Intelligence and Neuroscience 9

Table 1 Parameters settings of RAC-HMLP and AC-HMLP

Parameter Symbol ValueTime step 119879

11990401

Discount factor 120574 09Trace-decay rate 120582 09Exploration variance 120590

2 1Learning rate of the actor 120572

11988605

Learning rate of the critic 120572119888

04Learning rate of the model 120572

11989805

Error threshold 120585 015Capacity of the memory 119872 size 100Number of the nearest samples 119871 9Local planning times 119875

11989730

Global planning times 119875119892 300

Number of components of the state K 2Regularization parameter of the model ℓ

11989802

Regularization parameter of the critic ℓ119888

001Regularization parameter of the actor ℓ

1198860001

is easy to find out that (119875119897= 30119875

119892= 300) behaves best in these

four settings with 41 episodes for convergence (119875119897= 30 119875

119892=

600) learns fastest in the early 29 episodes but it convergesuntil the 52nd episode (119875

119897= 30 119875

119892= 600) and (119875

119897= 30 119875

119892=

300) have identical local planning times but different globalplanning times Generally the more the planning times thefaster the convergence rate but (119875

119897= 30 119875

119892= 300) performs

better instead The global model is not accurate enough atthe initial time Planning through such an inaccurate globalmodel will lead to an unstable performance Notice that (119875

119897=

50 119875119892= 300) and (119875

119897= 50 119875

119892= 600) behave poorer than (119875

119897

= 30 119875119892= 600) and (119875

119897= 30 119875

119892= 300) which demonstrates

that planning too much via the local model will not performbetter (119875

119897= 50119875

119892= 300) seems to converge at the 51st episode

but its learning curve fluctuates heavily after 365 episodes notconverging any more until the end Like (119875

119897= 50 119875

119892= 300)

(119875119897= 50 119875

119892= 600) also has heavy fluctuation but converges

at the 373rd episode Evidently (119875119897= 50 119875

119892= 300) performs

slightly better than (119875119897= 50 119875

119892= 300) Planning through the

global model might solve the nonstability problem caused byplanning of the local model

The convergence performances of the fivemethods RAC-HMLP AC-HMLP S-AC MLAC and Dyna-MLAC areshown in Figure 2(b) It is evident that our methods RAC-HMLP and AC-HMLP have the best convergence perfor-mances RAC-HMLP and AC-HMLP converge at the 39thand 41st episodes Both of them learn quickly in the primaryphase but the learning curve of RAC-HMLP seems to besteeper than that of AC-HMLP Dyna-MLAC learns fasterthan MLAC and S-AC in the former 74 episodes but itconverges until the 99th episode Though MLAC behavespoorer than Dyna-MLAC it requires just 82 episodes toconverge The method with the slowest learning rate is S-AC where the pole can keep balance for 3000 time stepsfor the first time at the 251st episode Unfortunately it con-verges until the 333rd episode RAC-HMLP converges fastest

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

Pl = 30 Pg = 300

Pl = 50 Pg = 300

Pl = 30 Pg = 600

Pl = 50 Pg = 600

(a) Determination of different planning times

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of different algorithms in convergence performance

Figure 2 Comparisons of different planning times and differentalgorithms

which might be caused by introducing the ℓ2-regularization

Because the ℓ2-regularization does not admit the parameter

to grow rapidly the overfitting of the learning process canbe avoided effectively Dyna-MLAC converges faster thanMLAC in the early 74 episodes but its performance is notstable enough embodied in the heavy fluctuation from the75th to 99th episode If the approximate-local model is dis-tinguished largely from the real-local model then planningthrough such an inaccurate local model might lead to anunstable performance S-AC as the only method withoutmodel learning behaves poorest among the fiveThese resultsshow that the convergence performance can be improvedlargely by introducing model learning

The comparisons of the five different algorithms insample efficiency are shown in Figure 3 It is clear that thenumbers of the required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 56003

10 Computational Intelligence and Neuroscience

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

50 100 150 200 250 300 350 4000Training episodes

0

1

2

3

4

5

6

Num

ber o

f sam

ples

to co

nver

ge

times104

Figure 3 Comparisons of sample efficiency

35535 32043 14426 and 12830 respectively Evidently RAC-HMLP converges fastest and behaves stably all the timethus requiring the least samplesThough AC-HMLP requiresmore samples to converge than RAC-HMLP it still requiresfar fewer samples than the other three methods Because S-AC does not utilize the model it requires the most samplesamong the five Unlike MLAC Dyna-MLAC applies themodel not only in updating the policy but also in planningresulting in a fewer requirement for the samples

The optimal policy and optimal value function learned byAC-HMLP after the training ends are shown in Figures 4(a)and 4(b) respectively while the ones learned by RAC-HMLPare shown in Figures 4(c) and 4(d) Evidently the optimalpolicy and the optimal value function learned by AC-HMLPand RAC-HMLP are quite similar but RAC-HMLP seems tohave more fine-grained optimal policy and value function

As for the optimal policy (see Figures 4(a) and 4(c))the force becomes smaller and smaller from the two sides(left side and right side) to the middle The top-right is theregion requiring a force close to 50N where the directionof the angle is the same with that of angular velocity Thevalues of the angle and the angular velocity are nearing themaximum values 1205874 and 2Therefore the largest force to theleft is required so as to guarantee that the angle between theupright line and the pole is no bigger than 1205874 It is oppositein the bottom-left region where a force attributing to [minus50Nminus40N] is required to keep the angle from surpassing minus1205874The pole can keep balance with a gentle force close to 0 in themiddle regionThedirection of the angle is different from thatof angular velocity in the top-left and bottom-right regionsthus a force with the absolute value which is relatively largebut smaller than 50N is required

In terms of the optimal value function (see Figures 4(b)and 4(d)) the value function reaches a maximum in theregion satisfying minus2 le 119908 le 2 The pole is more prone tokeep balance even without applying any force in this region

resulting in the relatively larger value function The pole ismore and more difficult to keep balance from the middle tothe two sides with the value function also decaying graduallyThe fine-grained value of the left side compared to the rightonemight be caused by themore frequent visitation to the leftside More visitation will lead to a more accurate estimationabout the value function

After the training ends the prediction of the next stateand the reward for every state-action pair can be obtainedthrough the learnedmodelThe predictions for the next angle119908 the next angular velocity and the reward for any possiblestate are shown in Figures 5(a) 5(b) and 5(c) It is noticeablethat the predicted-next state is always near the current statein Figures 5(a) and 5(b) The received reward is always largerthan 0 in Figure 5(c) which illustrates that the pole canalways keep balance under the optimal policy

52 Continuous Maze Problem Continuous maze problemis shown in Figure 6 where the blue lines with coordinatesrepresent the barriersThe state (119909 119910) consists of the horizon-tal coordinate 119909 isin [0 1] and vertical coordinate 119910 isin [0 1]Starting from the position ldquoStartrdquo the goal of the agent isto reach the position ldquoGoalrdquo that satisfies 119909 + 119910 gt 18 Theaction is to choose an angle bounded by [minus120587 120587] as the newdirection and then walk along this direction with a step 01The agent will receive a reward minus1 at each time step Theagent will be punished with a reward minus400 multiplying thedistance if the distance between its position and the barrierexceeds 01 The parameters are set the same as the formerproblem except for the planning times The local and globalplanning times are determined as 10 and 50 in the sameway of the former experiment The state-based feature iscomputed according to (23) and (24) with the dimensionality119911 = 16 The center point 119888

119894locates over the grid points

02 04 06 08 times 02 04 06 08 The diagonal elementsof 119861 are 120590

2

119909= 02 and 120590

2

119910= 02 The state-action-based

feature is also computed according to (23) and (24) Thecenter point 119888

119894locates over the grid points 02 04 06 08 times

02 04 06 08timesminus3 minus15 0 15 3with the dimensionality119911 = 80The diagonal elements of119861 are 1205902

119909= 02 1205902

119910= 02 and

1205902

119906= 15RAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC in cumulative rewards The resultsare shown in Figure 7(a) RAC-HMLP learns fastest in theearly 15 episodesMLACbehaves second best andAC-HMLPperforms poorest RAC-HMLP tends to converge at the 24thepisode but it really converges until the 39th episode AC-HMLP behaves steadily starting from the 23rd episode to theend and it converges at the 43rd episode Like the formerexperiment RAC-HMLP performs slightly better than AC-HMLP embodied in the cumulative rewards minus44 comparedto minus46 At the 53rd episode the cumulative rewards ofS-AC fall to about minus545 quickly without any ascendingthereafter Though MLAC and Dyna-MLAC perform wellin the primary phase their curves start to descend at the88th episode and the 93rd episode respectively MLAC andDyna-MLAC learn the local model to update the policygradient resulting in a fast learning rate in the primary phase

Computational Intelligence and Neuroscience 11

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus50

0

50

(a) Optimal policy of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

92

94

96

98

10

(b) Optimal value function of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus40

minus20

0

20

40

(c) Optimal policy of RAC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

9

92

94

96

98

10

(d) Optimal value function of RAC-HMLP learned after training

Figure 4 Optimal policy and value function learned by AC-HMLP and RAC-HMLP

However they do not behave stably enough near the end ofthe training Too much visitation to the barrier region mightcause the fast descending of the cumulative rewards

The comparisons of the time steps for reaching goal aresimulated which are shown in Figure 7(b) Obviously RAC-HMLP and AC-HMLP perform better than the other threemethods RAC-HMLP converges at 39th episode while AC-HMLP converges at 43rd episode with the time steps forreaching goal being 45 and 47 It is clear that RAC-HMLPstill performs slightly better than AC-HMLP The time stepsfor reaching goal are 548 69 and 201 for S-AC MLACandDyna-MLAC Among the fivemethods RAC-HMLP notonly converges fastest but also has the best solution 45 Thepoorest performance of S-AC illustrates that model learningcan definitely improve the convergence performance As inthe former experiment Dyna-MLAC behaves poorer thanMLAC during training If the model is inaccurate planningvia such model might influence the estimations of the valuefunction and the policy thus leading to a poor performancein Dyna-MLAC

The comparisons of sample efficiency are shown inFigure 8 The required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 105956588 7694 4388 and 4062 respectively As in pole balancingproblem RAC-HMLP also requires the least samples whileS-AC needs the most to converge The difference is that

Dyna-MLAC requires samples slightly more than MLACThe ascending curve of Dyna-MLAC at the end of thetraining demonstrates that it has not convergedThe frequentvisitation to the barrier in Dyna-MLAC leads to the quickdescending of the value functions As a result enormoussamples are required to make these value functions moreprone to the true ones

After the training ends the approximate optimal policyand value function are obtained shown in Figures 9(a) and9(b) It is noticeable that the low part of the figure is exploredthoroughly so that the policy and the value function are verydistinctive in this region For example inmost of the region inFigure 9(a) a larger angle is needed for the agent so as to leavethe current state Clearly the nearer the current state and thelow part of Figure 9(b) the smaller the corresponding valuefunction The top part of the figures may be not frequentlyor even not visited by the agent resulting in the similar valuefunctionsThe agent is able to reach the goal onlywith a gentleangle in these areas

6 Conclusion and Discussion

This paper proposes two novel actor-critic algorithms AC-HMLP and RAC-HMLP Both of them take LLR and LFA torepresent the local model and global model respectively Ithas been shown that our new methods are able to learn the

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

Computational Intelligence and Neuroscience 9

Table 1 Parameters settings of RAC-HMLP and AC-HMLP

Parameter Symbol ValueTime step 119879

11990401

Discount factor 120574 09Trace-decay rate 120582 09Exploration variance 120590

2 1Learning rate of the actor 120572

11988605

Learning rate of the critic 120572119888

04Learning rate of the model 120572

11989805

Error threshold 120585 015Capacity of the memory 119872 size 100Number of the nearest samples 119871 9Local planning times 119875

11989730

Global planning times 119875119892 300

Number of components of the state K 2Regularization parameter of the model ℓ

11989802

Regularization parameter of the critic ℓ119888

001Regularization parameter of the actor ℓ

1198860001

is easy to find out that (119875119897= 30119875

119892= 300) behaves best in these

four settings with 41 episodes for convergence (119875119897= 30 119875

119892=

600) learns fastest in the early 29 episodes but it convergesuntil the 52nd episode (119875

119897= 30 119875

119892= 600) and (119875

119897= 30 119875

119892=

300) have identical local planning times but different globalplanning times Generally the more the planning times thefaster the convergence rate but (119875

119897= 30 119875

119892= 300) performs

better instead The global model is not accurate enough atthe initial time Planning through such an inaccurate globalmodel will lead to an unstable performance Notice that (119875

119897=

50 119875119892= 300) and (119875

119897= 50 119875

119892= 600) behave poorer than (119875

119897

= 30 119875119892= 600) and (119875

119897= 30 119875

119892= 300) which demonstrates

that planning too much via the local model will not performbetter (119875

119897= 50119875

119892= 300) seems to converge at the 51st episode

but its learning curve fluctuates heavily after 365 episodes notconverging any more until the end Like (119875

119897= 50 119875

119892= 300)

(119875119897= 50 119875

119892= 600) also has heavy fluctuation but converges

at the 373rd episode Evidently (119875119897= 50 119875

119892= 300) performs

slightly better than (119875119897= 50 119875

119892= 300) Planning through the

global model might solve the nonstability problem caused byplanning of the local model

The convergence performances of the fivemethods RAC-HMLP AC-HMLP S-AC MLAC and Dyna-MLAC areshown in Figure 2(b) It is evident that our methods RAC-HMLP and AC-HMLP have the best convergence perfor-mances RAC-HMLP and AC-HMLP converge at the 39thand 41st episodes Both of them learn quickly in the primaryphase but the learning curve of RAC-HMLP seems to besteeper than that of AC-HMLP Dyna-MLAC learns fasterthan MLAC and S-AC in the former 74 episodes but itconverges until the 99th episode Though MLAC behavespoorer than Dyna-MLAC it requires just 82 episodes toconverge The method with the slowest learning rate is S-AC where the pole can keep balance for 3000 time stepsfor the first time at the 251st episode Unfortunately it con-verges until the 333rd episode RAC-HMLP converges fastest

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

Pl = 30 Pg = 300

Pl = 50 Pg = 300

Pl = 30 Pg = 600

Pl = 50 Pg = 600

(a) Determination of different planning times

0

500

1000

1500

2000

2500

3000

Num

ber o

f bal

ance

d tim

e ste

ps

50 100 150 200 250 300 350 4000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of different algorithms in convergence performance

Figure 2 Comparisons of different planning times and differentalgorithms

which might be caused by introducing the ℓ2-regularization

Because the ℓ2-regularization does not admit the parameter

to grow rapidly the overfitting of the learning process canbe avoided effectively Dyna-MLAC converges faster thanMLAC in the early 74 episodes but its performance is notstable enough embodied in the heavy fluctuation from the75th to 99th episode If the approximate-local model is dis-tinguished largely from the real-local model then planningthrough such an inaccurate local model might lead to anunstable performance S-AC as the only method withoutmodel learning behaves poorest among the fiveThese resultsshow that the convergence performance can be improvedlargely by introducing model learning

The comparisons of the five different algorithms insample efficiency are shown in Figure 3 It is clear that thenumbers of the required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 56003

10 Computational Intelligence and Neuroscience

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

50 100 150 200 250 300 350 4000Training episodes

0

1

2

3

4

5

6

Num

ber o

f sam

ples

to co

nver

ge

times104

Figure 3 Comparisons of sample efficiency

35535 32043 14426 and 12830 respectively Evidently RAC-HMLP converges fastest and behaves stably all the timethus requiring the least samplesThough AC-HMLP requiresmore samples to converge than RAC-HMLP it still requiresfar fewer samples than the other three methods Because S-AC does not utilize the model it requires the most samplesamong the five Unlike MLAC Dyna-MLAC applies themodel not only in updating the policy but also in planningresulting in a fewer requirement for the samples

The optimal policy and optimal value function learned byAC-HMLP after the training ends are shown in Figures 4(a)and 4(b) respectively while the ones learned by RAC-HMLPare shown in Figures 4(c) and 4(d) Evidently the optimalpolicy and the optimal value function learned by AC-HMLPand RAC-HMLP are quite similar but RAC-HMLP seems tohave more fine-grained optimal policy and value function

As for the optimal policy (see Figures 4(a) and 4(c))the force becomes smaller and smaller from the two sides(left side and right side) to the middle The top-right is theregion requiring a force close to 50N where the directionof the angle is the same with that of angular velocity Thevalues of the angle and the angular velocity are nearing themaximum values 1205874 and 2Therefore the largest force to theleft is required so as to guarantee that the angle between theupright line and the pole is no bigger than 1205874 It is oppositein the bottom-left region where a force attributing to [minus50Nminus40N] is required to keep the angle from surpassing minus1205874The pole can keep balance with a gentle force close to 0 in themiddle regionThedirection of the angle is different from thatof angular velocity in the top-left and bottom-right regionsthus a force with the absolute value which is relatively largebut smaller than 50N is required

In terms of the optimal value function (see Figures 4(b)and 4(d)) the value function reaches a maximum in theregion satisfying minus2 le 119908 le 2 The pole is more prone tokeep balance even without applying any force in this region

resulting in the relatively larger value function The pole ismore and more difficult to keep balance from the middle tothe two sides with the value function also decaying graduallyThe fine-grained value of the left side compared to the rightonemight be caused by themore frequent visitation to the leftside More visitation will lead to a more accurate estimationabout the value function

After the training ends the prediction of the next stateand the reward for every state-action pair can be obtainedthrough the learnedmodelThe predictions for the next angle119908 the next angular velocity and the reward for any possiblestate are shown in Figures 5(a) 5(b) and 5(c) It is noticeablethat the predicted-next state is always near the current statein Figures 5(a) and 5(b) The received reward is always largerthan 0 in Figure 5(c) which illustrates that the pole canalways keep balance under the optimal policy

52 Continuous Maze Problem Continuous maze problemis shown in Figure 6 where the blue lines with coordinatesrepresent the barriersThe state (119909 119910) consists of the horizon-tal coordinate 119909 isin [0 1] and vertical coordinate 119910 isin [0 1]Starting from the position ldquoStartrdquo the goal of the agent isto reach the position ldquoGoalrdquo that satisfies 119909 + 119910 gt 18 Theaction is to choose an angle bounded by [minus120587 120587] as the newdirection and then walk along this direction with a step 01The agent will receive a reward minus1 at each time step Theagent will be punished with a reward minus400 multiplying thedistance if the distance between its position and the barrierexceeds 01 The parameters are set the same as the formerproblem except for the planning times The local and globalplanning times are determined as 10 and 50 in the sameway of the former experiment The state-based feature iscomputed according to (23) and (24) with the dimensionality119911 = 16 The center point 119888

119894locates over the grid points

02 04 06 08 times 02 04 06 08 The diagonal elementsof 119861 are 120590

2

119909= 02 and 120590

2

119910= 02 The state-action-based

feature is also computed according to (23) and (24) Thecenter point 119888

119894locates over the grid points 02 04 06 08 times

02 04 06 08timesminus3 minus15 0 15 3with the dimensionality119911 = 80The diagonal elements of119861 are 1205902

119909= 02 1205902

119910= 02 and

1205902

119906= 15RAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC in cumulative rewards The resultsare shown in Figure 7(a) RAC-HMLP learns fastest in theearly 15 episodesMLACbehaves second best andAC-HMLPperforms poorest RAC-HMLP tends to converge at the 24thepisode but it really converges until the 39th episode AC-HMLP behaves steadily starting from the 23rd episode to theend and it converges at the 43rd episode Like the formerexperiment RAC-HMLP performs slightly better than AC-HMLP embodied in the cumulative rewards minus44 comparedto minus46 At the 53rd episode the cumulative rewards ofS-AC fall to about minus545 quickly without any ascendingthereafter Though MLAC and Dyna-MLAC perform wellin the primary phase their curves start to descend at the88th episode and the 93rd episode respectively MLAC andDyna-MLAC learn the local model to update the policygradient resulting in a fast learning rate in the primary phase

Computational Intelligence and Neuroscience 11

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus50

0

50

(a) Optimal policy of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

92

94

96

98

10

(b) Optimal value function of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus40

minus20

0

20

40

(c) Optimal policy of RAC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

9

92

94

96

98

10

(d) Optimal value function of RAC-HMLP learned after training

Figure 4 Optimal policy and value function learned by AC-HMLP and RAC-HMLP

However they do not behave stably enough near the end ofthe training Too much visitation to the barrier region mightcause the fast descending of the cumulative rewards

The comparisons of the time steps for reaching goal aresimulated which are shown in Figure 7(b) Obviously RAC-HMLP and AC-HMLP perform better than the other threemethods RAC-HMLP converges at 39th episode while AC-HMLP converges at 43rd episode with the time steps forreaching goal being 45 and 47 It is clear that RAC-HMLPstill performs slightly better than AC-HMLP The time stepsfor reaching goal are 548 69 and 201 for S-AC MLACandDyna-MLAC Among the fivemethods RAC-HMLP notonly converges fastest but also has the best solution 45 Thepoorest performance of S-AC illustrates that model learningcan definitely improve the convergence performance As inthe former experiment Dyna-MLAC behaves poorer thanMLAC during training If the model is inaccurate planningvia such model might influence the estimations of the valuefunction and the policy thus leading to a poor performancein Dyna-MLAC

The comparisons of sample efficiency are shown inFigure 8 The required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 105956588 7694 4388 and 4062 respectively As in pole balancingproblem RAC-HMLP also requires the least samples whileS-AC needs the most to converge The difference is that

Dyna-MLAC requires samples slightly more than MLACThe ascending curve of Dyna-MLAC at the end of thetraining demonstrates that it has not convergedThe frequentvisitation to the barrier in Dyna-MLAC leads to the quickdescending of the value functions As a result enormoussamples are required to make these value functions moreprone to the true ones

After the training ends the approximate optimal policyand value function are obtained shown in Figures 9(a) and9(b) It is noticeable that the low part of the figure is exploredthoroughly so that the policy and the value function are verydistinctive in this region For example inmost of the region inFigure 9(a) a larger angle is needed for the agent so as to leavethe current state Clearly the nearer the current state and thelow part of Figure 9(b) the smaller the corresponding valuefunction The top part of the figures may be not frequentlyor even not visited by the agent resulting in the similar valuefunctionsThe agent is able to reach the goal onlywith a gentleangle in these areas

6 Conclusion and Discussion

This paper proposes two novel actor-critic algorithms AC-HMLP and RAC-HMLP Both of them take LLR and LFA torepresent the local model and global model respectively Ithas been shown that our new methods are able to learn the

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

10 Computational Intelligence and Neuroscience

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

50 100 150 200 250 300 350 4000Training episodes

0

1

2

3

4

5

6

Num

ber o

f sam

ples

to co

nver

ge

times104

Figure 3 Comparisons of sample efficiency

35535 32043 14426 and 12830 respectively Evidently RAC-HMLP converges fastest and behaves stably all the timethus requiring the least samplesThough AC-HMLP requiresmore samples to converge than RAC-HMLP it still requiresfar fewer samples than the other three methods Because S-AC does not utilize the model it requires the most samplesamong the five Unlike MLAC Dyna-MLAC applies themodel not only in updating the policy but also in planningresulting in a fewer requirement for the samples

The optimal policy and optimal value function learned byAC-HMLP after the training ends are shown in Figures 4(a)and 4(b) respectively while the ones learned by RAC-HMLPare shown in Figures 4(c) and 4(d) Evidently the optimalpolicy and the optimal value function learned by AC-HMLPand RAC-HMLP are quite similar but RAC-HMLP seems tohave more fine-grained optimal policy and value function

As for the optimal policy (see Figures 4(a) and 4(c))the force becomes smaller and smaller from the two sides(left side and right side) to the middle The top-right is theregion requiring a force close to 50N where the directionof the angle is the same with that of angular velocity Thevalues of the angle and the angular velocity are nearing themaximum values 1205874 and 2Therefore the largest force to theleft is required so as to guarantee that the angle between theupright line and the pole is no bigger than 1205874 It is oppositein the bottom-left region where a force attributing to [minus50Nminus40N] is required to keep the angle from surpassing minus1205874The pole can keep balance with a gentle force close to 0 in themiddle regionThedirection of the angle is different from thatof angular velocity in the top-left and bottom-right regionsthus a force with the absolute value which is relatively largebut smaller than 50N is required

In terms of the optimal value function (see Figures 4(b)and 4(d)) the value function reaches a maximum in theregion satisfying minus2 le 119908 le 2 The pole is more prone tokeep balance even without applying any force in this region

resulting in the relatively larger value function The pole ismore and more difficult to keep balance from the middle tothe two sides with the value function also decaying graduallyThe fine-grained value of the left side compared to the rightonemight be caused by themore frequent visitation to the leftside More visitation will lead to a more accurate estimationabout the value function

After the training ends the prediction of the next stateand the reward for every state-action pair can be obtainedthrough the learnedmodelThe predictions for the next angle119908 the next angular velocity and the reward for any possiblestate are shown in Figures 5(a) 5(b) and 5(c) It is noticeablethat the predicted-next state is always near the current statein Figures 5(a) and 5(b) The received reward is always largerthan 0 in Figure 5(c) which illustrates that the pole canalways keep balance under the optimal policy

52 Continuous Maze Problem Continuous maze problemis shown in Figure 6 where the blue lines with coordinatesrepresent the barriersThe state (119909 119910) consists of the horizon-tal coordinate 119909 isin [0 1] and vertical coordinate 119910 isin [0 1]Starting from the position ldquoStartrdquo the goal of the agent isto reach the position ldquoGoalrdquo that satisfies 119909 + 119910 gt 18 Theaction is to choose an angle bounded by [minus120587 120587] as the newdirection and then walk along this direction with a step 01The agent will receive a reward minus1 at each time step Theagent will be punished with a reward minus400 multiplying thedistance if the distance between its position and the barrierexceeds 01 The parameters are set the same as the formerproblem except for the planning times The local and globalplanning times are determined as 10 and 50 in the sameway of the former experiment The state-based feature iscomputed according to (23) and (24) with the dimensionality119911 = 16 The center point 119888

119894locates over the grid points

02 04 06 08 times 02 04 06 08 The diagonal elementsof 119861 are 120590

2

119909= 02 and 120590

2

119910= 02 The state-action-based

feature is also computed according to (23) and (24) Thecenter point 119888

119894locates over the grid points 02 04 06 08 times

02 04 06 08timesminus3 minus15 0 15 3with the dimensionality119911 = 80The diagonal elements of119861 are 1205902

119909= 02 1205902

119910= 02 and

1205902

119906= 15RAC-HMLP and AC-HMLP are compared with S-AC

MLAC and Dyna-MLAC in cumulative rewards The resultsare shown in Figure 7(a) RAC-HMLP learns fastest in theearly 15 episodesMLACbehaves second best andAC-HMLPperforms poorest RAC-HMLP tends to converge at the 24thepisode but it really converges until the 39th episode AC-HMLP behaves steadily starting from the 23rd episode to theend and it converges at the 43rd episode Like the formerexperiment RAC-HMLP performs slightly better than AC-HMLP embodied in the cumulative rewards minus44 comparedto minus46 At the 53rd episode the cumulative rewards ofS-AC fall to about minus545 quickly without any ascendingthereafter Though MLAC and Dyna-MLAC perform wellin the primary phase their curves start to descend at the88th episode and the 93rd episode respectively MLAC andDyna-MLAC learn the local model to update the policygradient resulting in a fast learning rate in the primary phase

Computational Intelligence and Neuroscience 11

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus50

0

50

(a) Optimal policy of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

92

94

96

98

10

(b) Optimal value function of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus40

minus20

0

20

40

(c) Optimal policy of RAC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

9

92

94

96

98

10

(d) Optimal value function of RAC-HMLP learned after training

Figure 4 Optimal policy and value function learned by AC-HMLP and RAC-HMLP

However they do not behave stably enough near the end ofthe training Too much visitation to the barrier region mightcause the fast descending of the cumulative rewards

The comparisons of the time steps for reaching goal aresimulated which are shown in Figure 7(b) Obviously RAC-HMLP and AC-HMLP perform better than the other threemethods RAC-HMLP converges at 39th episode while AC-HMLP converges at 43rd episode with the time steps forreaching goal being 45 and 47 It is clear that RAC-HMLPstill performs slightly better than AC-HMLP The time stepsfor reaching goal are 548 69 and 201 for S-AC MLACandDyna-MLAC Among the fivemethods RAC-HMLP notonly converges fastest but also has the best solution 45 Thepoorest performance of S-AC illustrates that model learningcan definitely improve the convergence performance As inthe former experiment Dyna-MLAC behaves poorer thanMLAC during training If the model is inaccurate planningvia such model might influence the estimations of the valuefunction and the policy thus leading to a poor performancein Dyna-MLAC

The comparisons of sample efficiency are shown inFigure 8 The required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 105956588 7694 4388 and 4062 respectively As in pole balancingproblem RAC-HMLP also requires the least samples whileS-AC needs the most to converge The difference is that

Dyna-MLAC requires samples slightly more than MLACThe ascending curve of Dyna-MLAC at the end of thetraining demonstrates that it has not convergedThe frequentvisitation to the barrier in Dyna-MLAC leads to the quickdescending of the value functions As a result enormoussamples are required to make these value functions moreprone to the true ones

After the training ends the approximate optimal policyand value function are obtained shown in Figures 9(a) and9(b) It is noticeable that the low part of the figure is exploredthoroughly so that the policy and the value function are verydistinctive in this region For example inmost of the region inFigure 9(a) a larger angle is needed for the agent so as to leavethe current state Clearly the nearer the current state and thelow part of Figure 9(b) the smaller the corresponding valuefunction The top part of the figures may be not frequentlyor even not visited by the agent resulting in the similar valuefunctionsThe agent is able to reach the goal onlywith a gentleangle in these areas

6 Conclusion and Discussion

This paper proposes two novel actor-critic algorithms AC-HMLP and RAC-HMLP Both of them take LLR and LFA torepresent the local model and global model respectively Ithas been shown that our new methods are able to learn the

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

Computational Intelligence and Neuroscience 11

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus50

0

50

(a) Optimal policy of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

92

94

96

98

10

(b) Optimal value function of AC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

minus40

minus20

0

20

40

(c) Optimal policy of RAC-HMLP learned after training

minus2

minus1

0

1

2

Ang

ular

velo

city

(rad

s)

0 05minus05Angle (rad)

9

92

94

96

98

10

(d) Optimal value function of RAC-HMLP learned after training

Figure 4 Optimal policy and value function learned by AC-HMLP and RAC-HMLP

However they do not behave stably enough near the end ofthe training Too much visitation to the barrier region mightcause the fast descending of the cumulative rewards

The comparisons of the time steps for reaching goal aresimulated which are shown in Figure 7(b) Obviously RAC-HMLP and AC-HMLP perform better than the other threemethods RAC-HMLP converges at 39th episode while AC-HMLP converges at 43rd episode with the time steps forreaching goal being 45 and 47 It is clear that RAC-HMLPstill performs slightly better than AC-HMLP The time stepsfor reaching goal are 548 69 and 201 for S-AC MLACandDyna-MLAC Among the fivemethods RAC-HMLP notonly converges fastest but also has the best solution 45 Thepoorest performance of S-AC illustrates that model learningcan definitely improve the convergence performance As inthe former experiment Dyna-MLAC behaves poorer thanMLAC during training If the model is inaccurate planningvia such model might influence the estimations of the valuefunction and the policy thus leading to a poor performancein Dyna-MLAC

The comparisons of sample efficiency are shown inFigure 8 The required samples for S-AC MLAC Dyna-MLAC AC-HMLP and RAC-HMLP to converge are 105956588 7694 4388 and 4062 respectively As in pole balancingproblem RAC-HMLP also requires the least samples whileS-AC needs the most to converge The difference is that

Dyna-MLAC requires samples slightly more than MLACThe ascending curve of Dyna-MLAC at the end of thetraining demonstrates that it has not convergedThe frequentvisitation to the barrier in Dyna-MLAC leads to the quickdescending of the value functions As a result enormoussamples are required to make these value functions moreprone to the true ones

After the training ends the approximate optimal policyand value function are obtained shown in Figures 9(a) and9(b) It is noticeable that the low part of the figure is exploredthoroughly so that the policy and the value function are verydistinctive in this region For example inmost of the region inFigure 9(a) a larger angle is needed for the agent so as to leavethe current state Clearly the nearer the current state and thelow part of Figure 9(b) the smaller the corresponding valuefunction The top part of the figures may be not frequentlyor even not visited by the agent resulting in the similar valuefunctionsThe agent is able to reach the goal onlywith a gentleangle in these areas

6 Conclusion and Discussion

This paper proposes two novel actor-critic algorithms AC-HMLP and RAC-HMLP Both of them take LLR and LFA torepresent the local model and global model respectively Ithas been shown that our new methods are able to learn the

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

12 Computational Intelligence and Neuroscience

minus2

minus15

minus1

minus05

0

05

1

15

2A

ngle

velo

city

(rad

s)

minus05

minus04

minus03

minus02

minus01

0

01

02

03

0 05minus05Angle (rad)

(a) Prediction of the angle at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

minus3

minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

(b) Prediction of the angular velocity at next state

minus2

minus15

minus1

minus05

0

05

1

15

2

Ang

le v

eloci

ty (r

ads

)

0 05minus05Angle (rad)

055

06

065

07

075

08

085

09

095

1

(c) Prediction of the reward

Figure 5 Prediction of the next state and reward according to the global model

1

(03 07)(055 07)

(055 032)

(055 085)Goal

Start

1

0

Figure 6 Continuous maze problem

optimal value function and the optimal policy In the polebalancing and continuous maze problems RAC-HMLP andAC-HMLP are compared with three representative methodsThe results show that RAC-HMLP and AC-HMLP not onlyconverge fastest but also have the best sample efficiency

RAC-HMLP performs slightly better than AC-HMLP inconvergence rate and sample efficiency By introducing ℓ

2-

regularization the parameters learned by RAC-HMLP willbe smaller and more uniform than those of AC-HMLP sothat the overfitting can be avoided effectively Though RAC-HMLPbehaves better its improvement overAC-HMLP is notsignificant Because AC-HMLP normalizes all the features to

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

Computational Intelligence and Neuroscience 13

minus1000

minus900

minus800

minus700

minus600

minus500

minus400

minus300

minus200

minus100

0

Cum

ulat

ive r

ewar

ds

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(a) Comparisons of cumulative rewards

0

100

200

300

400

500

600

700

800

900

1000

Tim

e ste

ps fo

r rea

chin

g go

al

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

(b) Comparisons of time steps for reaching goal

Figure 7 Comparisons of cumulative rewards and time steps for reaching goal

0

2000

4000

6000

8000

10000

12000

Num

ber o

f sam

ples

to co

nver

ge

20 40 60 80 1000Training episodes

S-ACMLACDyna-MLAC

AC-HMLPRAC-HMLP

Figure 8 Comparisons of different algorithms in sample efficiency

[0 1] the parameters will not change heavily As a result theoverfitting can also be prohibited to a certain extent

S-AC is the only algorithm without model learning Thepoorest performance of S-AC demonstrates that combiningmodel learning and AC can really improve the performanceDyna-MLAC learns a model via LLR for local planning andpolicy updating However Dyna-MLAC directly utilizes themodel before making sure whether it is accurate addition-ally it does not utilize the global information about thesamples Therefore it behaves poorer than AC-HMLP and

RAC-HMLP MLAC also approximates a local model viaLLR as Dyna-MLAC does but it only takes the model toupdate the policy gradient surprisingly with a slightly betterperformance than Dyna-MLAC

Dyna-MLAC andMLAC approximate the value functionthe policy and the model through LLR In the LLR approachthe samples collected in the interaction with the environmenthave to be stored in the memory Through KNN or119870-d treeonly 119871-nearest samples in the memory are selected to learnthe LLR Such a learning process takes a lot of computationand memory costs In AC-HMLP and RAC-HMLP there isonly a parameter vector to be stored and learned for any of thevalue function the policy and the global model ThereforeAC-HMLP and RAC-HMLP outperform Dyna-MLAC andMLAC also in computation and memory costs

The planning times for the local model and the globalmodel have to be determined according the experimentalperformanceThus we have to set the planning times accord-ing to the different domains To address this problem ourfuture work will consider how to determine the planningtimes adaptively according to the different domains More-over with the development of the deep learning [33 34]deepRL has succeeded inmany applications such asAlphaGoand Atari 2600 by taking the visual images or videos asinput However how to accelerate the learning process indeep RL through model learning is still an open questionThe future work will endeavor to combine deep RL withhierarchical model learning and planning to solve the real-world problems

Competing InterestsThe authors declare that there are no competing interestsregarding the publication of this article

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 14: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

14 Computational Intelligence and Neuroscience

0

01

02

03

04

05

06

07

08

09

1C

oord

inat

e y

02 04 06 08 10Coordinate x

minus3

minus2

minus1

0

1

2

3

(a) Optimal policy learned by AC-HMLP

0

01

02

03

04

05

06

07

08

09

1

Coo

rdin

ate y

02 04 06 08 10Coordinate x

minus22

minus20

minus18

minus16

minus14

minus12

minus10

minus8

minus6

minus4

minus2

(b) Optimal value function learned by AC-HMLP

Figure 9 Final optimal policy and value function after training

Acknowledgments

This research was partially supported by Innovation CenterofNovel Software Technology and Industrialization NationalNatural Science Foundation of China (61502323 6150232961272005 61303108 61373094 and 61472262) Natural Sci-ence Foundation of Jiangsu (BK2012616) High School Nat-ural Foundation of Jiangsu (13KJB520020) Key Laboratoryof Symbolic Computation and Knowledge Engineering ofMinistry of Education Jilin University (93K172014K04) andSuzhou Industrial Application of Basic Research ProgramPart (SYG201422)

References

[1] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[2] M L Littman ldquoReinforcement learning improves behaviourfrom evaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash451 2015

[3] D Silver R S Sutton and M Muller ldquoTemporal-differencesearch in computer Gordquo Machine Learning vol 87 no 2 pp183ndash219 2012

[4] J Kober and J Peters ldquoReinforcement learning in roboticsa surveyrdquo in Reinforcement Learning pp 579ndash610 SpringerBerlin Germany 2012

[5] D-R Liu H-L Li and DWang ldquoFeature selection and featurelearning for high-dimensional batch reinforcement learning asurveyrdquo International Journal of Automation and Computingvol 12 no 3 pp 229ndash242 2015

[6] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[7] D P Bertsekas Dynamic Programming and Optimal ControlAthena Scientific Belmont Mass USA 1995

[8] F L Lewis and D R Liu Reinforcement Learning and Approx-imate Dynamic Programming for Feedback Control John Wileyamp Sons Hoboken NJ USA 2012

[9] P J Werbos ldquoAdvanced forecasting methods for global crisiswarning and models of intelligencerdquo General Systems vol 22pp 25ndash38 1977

[10] D R Liu H L Li andDWang ldquoData-based self-learning opti-mal control research progress and prospectsrdquo Acta AutomaticaSinica vol 39 no 11 pp 1858ndash1870 2013

[11] H G Zhang D R Liu Y H Luo and D Wang AdaptiveDynamic Programming for Control Algorithms and StabilityCommunications and Control Engineering Series SpringerLondon UK 2013

[12] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[13] T Dierks B T Thumati and S Jagannathan ldquoOptimal con-trol of unknown affine nonlinear discrete-time systems usingoffline-trained neural networks with proof of convergencerdquoNeural Networks vol 22 no 5-6 pp 851ndash860 2009

[14] F-Y Wang N Jin D Liu and Q Wei ldquoAdaptive dynamicprogramming for finite-horizon optimal control of discrete-time nonlinear systems with 120576-error boundrdquo IEEE Transactionson Neural Networks vol 22 no 1 pp 24ndash36 2011

[15] K Doya ldquoReinforcement learning in continuous time andspacerdquo Neural Computation vol 12 no 1 pp 219ndash245 2000

[16] J J Murray C J Cox G G Lendaris and R Saeks ldquoAdaptivedynamic programmingrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 32 no2 pp 140ndash153 2002

[17] T Hanselmann L Noakes and A Zaknich ldquoContinuous-timeadaptive criticsrdquo IEEE Transactions on Neural Networks vol 18no 3 pp 631ndash647 2007

[18] S Bhasin R Kamalapurkar M Johnson K G VamvoudakisF L Lewis and W E Dixon ldquoA novel actor-critic-identifierarchitecture for approximate optimal control of uncertainnonlinear systemsrdquo Automatica vol 49 no 1 pp 82ndash92 2013

[19] L Busoniu R Babuka B De Schutter et al ReinforcementLearning and Dynamic Programming Using Function Approxi-mators CRC Press New York NY USA 2010

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 15: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

Computational Intelligence and Neuroscience 15

[20] A G Barto R S Sutton and C W Anderson ldquoNeuronlikeadaptive elements that can solve difficult learning controlproblemsrdquo IEEE Transactions on SystemsMan and Cyberneticsvol 13 no 5 pp 834ndash846 1983

[21] H Van Hasselt and M A Wiering ldquoReinforcement learning incontinuous action spacesrdquo in Proceedings of the IEEE Sympo-sium onApproximateDynamic Programming andReinforcementLearning (ADPRL rsquo07) pp 272ndash279 Honolulu Hawaii USAApril 2007

[22] S Bhatnagar R S Sutton M Ghavamzadeh and M LeeldquoNatural actor-critic algorithmsrdquoAutomatica vol 45 no 11 pp2471ndash2482 2009

[23] I Grondman L Busoniu G A D Lopes and R BabuskaldquoA survey of actor-critic reinforcement learning standard andnatural policy gradientsrdquo IEEE Transactions on Systems Manand Cybernetics Part C Applications and Reviews vol 42 no6 pp 1291ndash1307 2012

[24] I Grondman L Busoniu and R Babuska ldquoModel learningactor-critic algorithms performance evaluation in a motioncontrol taskrdquo in Proceedings of the 51st IEEE Conference onDecision and Control (CDC rsquo12) pp 5272ndash5277 Maui HawaiiUSA December 2012

[25] I Grondman M Vaandrager L Busoniu R Babuska and ESchuitema ldquoEfficient model learning methods for actor-criticcontrolrdquo IEEE Transactions on Systems Man and CyberneticsPart B Cybernetics vol 42 no 3 pp 591ndash602 2012

[26] B CostaW Caarls andD SMenasche ldquoDyna-MLAC tradingcomputational and sample complexities in actor-critic rein-forcement learningrdquo in Proceedings of the Brazilian Conferenceon Intelligent Systems (BRACIS rsquo15) pp 37ndash42 Natal BrazilNovember 2015

[27] O Andersson F Heintz and P Doherty ldquoModel-based rein-forcement learning in continuous environments using real-timeconstrained optimizationrdquo in Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI rsquo15) pp 644ndash650Texas Tex USA January 2015

[28] R S Sutton C Szepesvari A Geramifard and M BowlingldquoDyna-style planning with linear function approximation andprioritized sweepingrdquo in Proceedings of the 24th Conference onUncertainty in Artificial Intelligence (UAI rsquo08) pp 528ndash536 July2008

[29] R Faulkner and D Precup ldquoDyna planning using a featurebased generative modelrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo10) pp 1ndash9Vancouver Canada December 2010

[30] H Yao and C Szepesvari ldquoApproximate policy iteration withlinear action modelsrdquo in Proceedings of the 26th National Con-ference on Artificial Intelligence (AAAI rsquo12) Toronto CanadaJuly 2012

[31] H R Berenji and P Khedkar ldquoLearning and tuning fuzzylogic controllers through reinforcementsrdquo IEEE Transactions onNeural Networks vol 3 no 5 pp 724ndash740 1992

[32] R S Sutton ldquoGeneralization in reinforcement learning suc-cessful examples using sparse coarse codingrdquo in Advances inNeural Information Processing Systems pp 1038ndash1044 MITPress New York NY USA 1996

[33] D Silver A Huang C J Maddison et al ldquoMastering the gameof Go with deep neural networks and tree searchrdquo Nature vol529 no 7587 pp 484ndash489 2016

[34] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-level controlthrough deep reinforcement learningrdquo Nature vol 518 no7540 pp 529ndash533 2015

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 16: Research Article Efficient Actor-Critic Algorithm with ... · Research Article Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning ShanZhong, 1,2 QuanLiu,

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014