new research article two-phase iteration for value function...
TRANSCRIPT
Research ArticleTwo-Phase Iteration for Value FunctionApproximation and Hyperparameter Optimization inGaussian-Kernel-Based Adaptive Critic Design
Xin Chen1 Penghuan Xie2 Yonghua Xiong1 Yong He1 and Min Wu1
1School of Automation China University of Geosciences Wuhan Hubei 430074 China2School of Information Science and Engineering Central South University Changsha Hunan 410083 China
Correspondence should be addressed to Yonghua Xiong xiongyhcugeducn
Received 7 January 2015 Accepted 26 May 2015
Academic Editor Simon X Yang
Copyright copy 2015 Xin Chen et alThis is an open access article distributed under theCreative CommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Adaptive Dynamic Programming (ADP) with critic-actor architecture is an effective way to perform online learning control Toavoid the subjectivity in the design of a neural network that serves as a critic network kernel-based adaptive critic design (ACD)was developed recentlyThere are two essential issues for a static kernel-basedmodel how to determine proper hyperparameters inadvance and how to select right samples to describe the value function They all rely on the assessment of sample values Basedon the theoretical analysis this paper presents a two-phase simultaneous learning method for a Gaussian-kernel-based criticnetwork It is able to estimate the values of samples without infinitively revisiting them And the hyperparameters of the kernelmodel are optimized simultaneously Based on the estimated sample values the sample set can be refined by adding alternativesor deleting redundances Combining this critic design with actor network we present a Gaussian-kernel-based Adaptive DynamicProgramming (GK-ADP) approach Simulations are used to verify its feasibility particularly the necessity of two-phase learningthe convergence characteristics and the improvement of the system performance by using a varying sample set
1 Introduction
Reinforcement learning (RL) is an interactive machine learn-ing method for solving sequential decision problems It iswell known as an important learning method in unknownor dynamic environment Different from supervised learningand unsupervised learning RL interacts with the envi-ronment through trial mechanism and modifies its actionpolicies to maximize the payoffs [1] It is strongly connectedfrom a theoretical point of view with direct and indirectadaptive optimal control methods [2]
Traditional RL research focused on discrete stateactionsystems stateaction only takes on a finite number of pre-scribed discrete values The learning space grows exponen-tially as the number of states and the number of allowedactions increase This leads to the so-called curse of dimen-sionality (CoD) [2] In order to mitigate this CoD problemfunction approximations and generalization methods [3] are
introduced to store the optimal value and the optimal controlas a function of the state vector Generalization methodsbased on parametric model such as neural networks [4ndash6]have become one of popular means to solve RL problem incontinuous environments
Currently research work on RL in continuous environ-ment to construct learning systems for nonlinear optimalcontrol has attracted attention of researchers and scholars incontrol domains for the reason that it can modify its policyonly based on the value function without knowing the modelstructure or the parameters in advance A family of newRL techniques known as Approximate or Adaptive DynamicProgramming (ADP) (also known as Neurodynamic Pro-gramming or Adaptive Critic Designs (ACDs)) has receivedmore and more research interest [7 8] ADPs are based onthe actor-critic structure in which there is a critic assessingthe value of the action or control policy applied by an actorand an actor modifying its action based on the assessment
Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 760459 14 pageshttpdxdoiorg1011552015760459
2 Mathematical Problems in Engineering
of values In literatures ADP approaches are categorized asthe followingmain schemes heuristic dynamic programming(HDP) dual heuristic programming (DHP) globalized dualheuristic programming (GDHP) and their action-dependentversions [9 10]
ADP researches always adopt multilayer perceptron neu-ral networks (MLPNNs) as the critic design Vrabie and Lewisproposed an online approach to continuous-time directadaptive optimal control which made use of neural networksto parametrically represent the control policy and the per-formance of the control system [11] Liu et al solved theconstrained optimal control problem of unknown discrete-time nonlinear systems based on the iterative ADP algorithmvia GDHP technique with three neural networks [12] In factdifferent kinds of neural networks (NNs) play the importantroles in ADP algorithms such as radial basis function NNs[3] wavelet basis function NNs [13] and echo state network[14]
Besides the benefits brought by NNs ADP methodsalways suffer from some problems concerned in the design ofNNs On one hand the learning control performance greatlydepends on empirical design of critic networks especially themanual setting of the hidden layer or the basis functions Onthe other hand due to the local minima in neural networktraining how to improve the quality of the final policies isstill an open problem [15]
As we can see it is difficult to evaluate the effective-ness of the parametric model when the knowledge on themodelrsquos order or nonlinear characteristics of the system isnot enough Compared with parametric modeling methodsnonparametricmodelingmethods especially kernelmethods[16 17] do not need to set the model structure in advanceHence kernel machines have been popularly studied torealize nonlinear and nonparametric modeling Engel etal proposed the kernel recursive least-squares algorithmto construct minimum mean-squared-error solutions tononlinear least-squares problems [18] As popular kernelmachines support vector machines (SVMs) also have beenapplied to nonparametricmodeling problems Dietterich andWang combined linear programming with SVMs to findvalue function approximations [19] The similar research waspublished in [20] inwhich the least-squares SVMswere usedNevertheless they both focused on discrete stateaction spaceand lacked theoretical results on the policies obtained moreor less
In addition to SVMs Gaussian processes (GPs) havebecome an alternative generalization method GP modelsare powerful nonparametric tools for approximate Bayesianinference and learning In comparison with other popularnonlinear architectures such as multilayer perceptrons theirbehavior is conceptually simpler to understand and modelfitting can be achieved without resorting to nonconvexoptimization routines [21 22] In [23] Engel et al firstapplied GPs in temporal-difference (TD) learning for MDPswith stochastic rewards and deterministic transitions Theyderived a new GPTD algorithm in [24] that overcame thelimitation of deterministic transitions and extended GPTDto the estimation of state-action values GPTD algorithm justaddressed the value approximation so it should be combined
with actor-critic methods or other policy iteration methodsto solve learning control problems
An alternative approach employing GPs in RL is model-based value iteration or policy iteration method in whichGP model is used to model system dynamics and rep-resent the value function [25] In [26] an approximatedvalue-function based RL algorithm named Gaussian processdynamic programming (GPDP) was presented which builtdynamic transition model value function model and actionpolicy model respectively using GPs In this way the sampleset will be adjusted to such a reasonable shape with highsample densities near the equilibriums or the places wherevalue functions change dramatically Thus it is good at con-trolling nonlinear systems with complex dynamics A majorshortcoming even if the relatively high computation cost isendurable is that the states in sample set need to be revisitedagain and again in order to update their value functions Sincethis condition is unpractical in real implements it diminishesthe appeal of employing this method
Kernel-based method is also introduced to ADP In [15]a novel framework of ACDs with sparse kernel machines waspresented by integrating kernel methods into critic networkA sparsification method based on the approximately lineardependence (ALD) analysis was used to sparsify the kernelmachinesObviously thismethod can overcome the difficultyof presetting model structure in parametric models and real-ize actor-critic learning online [27] However the selection ofsamples based on the ALD analysis is an offline way withoutconsidering the distribution of the value functionThereforethe data samples cannot be adjusted online which makesthe method more suitable for control systems with smoothdynamics where value function changes gently
We think GPDP and ACDs with sparse kernel machinesare complementary As indicated in [28] it is known theprediction of GPs is viewed as a linear combination of thecovariance between the new points and the samples Henceit seems reasonable to introduce kernel machine with GPsto build critic network in ACDs if the values of the samplesare known or at least can be assessed numerically And thenthe sample set will be adjusted online during critic-actorlearning
The major problem here is how to realize the valuefunction learning and GP models updating simultaneouslyespecially under the condition that the samples of state-action space can hardly be revisited infinitely in order toapproximate their values To tackle this problem a two-phaseiteration is developed in order to get optimal control policyfor the system whose dynamics are unknown a priori
2 Description of the Problem
In general ADP is an actor-critic method which approx-imates the value functions and policies to encourage therealization of generalization in MDPs with large or continu-ous spaces The critic design plays the most important rolebecause it determines how the actor optimizes its actionHence we give a brief introduction on both kernel-basedACD and GPs in order to derive the clear description of thetheoretical problem
Mathematical Problems in Engineering 3
21 Kernel-Based ACDs Kernel-based ACDs mainly consistof a critic network a kernel-based feature learning module areward function an actor networkcontroller and a model ofthe plant The critic constructed by kernel machine is used toapproximate the value functions or their derivativesThen theoutput of the critic is used in the training process of the actorso that policy gradients can be computed As actor finallyconverges the optimal action policymapping states to actionsis described by this actor
Traditional neural network based on kernel machine andsamples serves as the model of value functions just as thefollowing equation shows and the recursive algorithm suchas KLSTD [29] serves as value function approximation
119876 (119909) =
119871
sum
119894=1120572119894119896 (119909 119909
119894) (1)
where 119909 and 119909119894represent state-action pairs (119904 119886) and (119904
119894 119886119894)
120572119894 119894 = 1 2 119871 are the weights and (119904
119894 119886119894) represents
selected state-action pairs in sample set and 119896(119909 119909119894) is a kernel
functionThe key of critic learning is the update of the weights vec-
tor120572The value functionmodeling in continuous stateactionspace is a regression problem From the view of the modelNN structure is a reproducing kernel space spanned bythe samples in which the value function is expressed as alinear regression function Obviously as the basis of Hilbertspace how to select samples determines the VC dimensionof identification as well as the performance of value functionapproximation
If only using ALD-based kernel sparsification it isonly independent of samples that are considered in sampleselection So it is hard to evaluate how good the sampleset is because the sample selection does not consider thedistribution of value function and the performance of ALDanalysis is affected seriously by the hyperparameters of kernelfunction which are predetermined empirically and fixedduring learning
If the hyperparameters can be optimized online and thevalue function wrt samples can be evaluated by iterationalgorithms the critic network will be optimized not only byvalue approximation but also by hyperparameter optimiza-tion Moreover with approximated sample values there is adirect way to evaluate the validity of sample set in orderto regulate the set online Thus in this paper we turn toGaussian processes to construct the criterion for samples andhyperparameters
22 GP-Based Value Function Model For an MDP the datasamples 119909
119894and the corresponding 119876 value can be collected
by observing theMDPHere 119909119894is the state-action pairs (119904
119894 119886119894)
119904 isin 119877119873 119886 isin 119877
119872 and 119876(119909) is the value function defined as
119876lowast(119909) = 119877 (119909) + 120573int119901 (119904
1015840| 119909)119881 (119904
1015840) 1198891199041015840 (2)
where 119881(119904) = max119886119876(119909)
Given a sample set collected from a continuous dynamicsystem 119883
119871 119910 where 119910 = 119876(119883
119871) + 120576 120576 sim 119873(0 V0) Gaussian
regression with covariance function shown in the followingequation is a well known model technology to infer the 119876
function
119896 (119909119901 119909119902) = V1 exp [minus
12(119909119901minus119909119902)119879
Λminus1
(119909119901minus119909119902)] (3)
where Λ = diag([1199081 1199082 119908119873+119872])Assuming additive independent identically distributed
Gaussian noise 120576 with variance V0 the prior on the noisyobservations becomes
119870119871= 119870 (119883
119871 119883119871) + V0119868119871 (4)
where119870(119883119871 119883119871)
=
[[[[[[[[
[
119896 (1199091 1199091) 119896 (1199091 1199092) sdot sdot sdot 119896 (1199091 119909119871)
119896 (1199092 1199091) d
d
119896 (119909119871 1199091) sdot sdot sdot sdot sdot sdot 119896 (119909
119871 119909119871)
]]]]]]]]
]
(5)
The parameters 119908119889 V0 and V1 are the hyperparam-
eters of the 119866119877119902and collected within the vector 120579 =
[1199081 sdot sdot sdot 119908119873+119872
V1 V0]119879
For an arbitrary input 119909lowast the predictive distribution of
the function value is Gaussian distributed with mean andvariance given by
119864 [119876 (119909lowast)] = 119870 (119909
lowast 119883119871)119870minus1119871
119910 (6)
var [119876 (119909)] = 119896 (119909lowast 119909lowast) minus119870 (119909
lowast 119883119871)119870minus1119871
119870(119883119871 119909lowast) (7)
Comparing (6) to (1) we find that if we let 120572 = 119870minus1119871
119910 theneural network is also regarded as the Gaussian regressionOr the critic network can be constructed based on Gaussiankernel machine if the following conditions are satisfied
Condition 1 The hyperparameters 120579 of Gaussian kernel areknown
Condition 2 The values 119910 wrt all samples states 119883119871are
known
With Gaussian-kernel-based critic network the samplestate-action pairs and corresponding values are known Andthen the criterion such as the comprehensive utility proposedin [26] can be set up in order to refine sample set online Atthe same time it is convenient to optimize hyperparametersin order to approximate 119876 value function more accuratelyThus besides the advantages brought by kernel machine thecritic based on Gaussian-kernel will be better in approximat-ing value functions
Consider Condition 1 If the values 119910 wrt119883119871are known
(note that it is indeed Condition 2) the common way to gethyperparameters is by evidencemaximization where the log-evidence is given by
119871 (120579) = minus12119910119879119870minus1119871
119910minus12log 1003816100381610038161003816119870119871
1003816100381610038161003816 minus119899
2log 2120587 (8)
4 Mathematical Problems in Engineering
It requires the calculation of the derivative of 119871(120579) wrt each120579119894 given by
120597119871 (120579)
120597120579119894
= minus12tr [119870minus1119871
120597119870119871
120597120579119894
]+12120572119879 120597119870119871
120597120579119894
120572 (9)
where tr[sdot] denotes the trace 120572 = 119870minus1119871
119910Consider Condition 2 For unknown system if 119870
119871is
known that is the hyperparameters are known (note thatit is indeed Condition 1) the update of critic network willbe transferred to the update of values wrt samples by usingvalue iteration
According to the analysis both conditions are interde-pendent That means the update of critic depends on knownhyperparameters and the optimization of hyperparametersdepends on accurate sample values
Hence we need a comprehensive iteration method torealize value approximation and optimization of hyperpa-rameters simultaneously A direct way is to update themalternately Unfortunately this way is not reasonable becausethese two processes are tightly coupled For example tem-poral differential errors drive value approximation but thechange of weights 120572will cause the change of hyperparameters120579 simultaneously then it is difficult to tell whether thistemporal differential error is induced by observation or byGaussian regression model changing
To solve this problem a kind of two-phase value iterationfor critic network is presented in the next section and theconditions of convergence are analyzed
3 Two-Phase Value Iteration for CriticNetwork Approximation
First a proposition is given to describe the relationshipbetween hyperparameters and the sample value function
Proposition 1 The hyperparameters are optimized by evi-dence maximization according to the samples and their Qvalues and the log-evidence is given by
120597119871 (120579)
120597120579119894
= minus12tr [119870minus1119871
120597119870119871
120597120579119894
]+12120572119879 120597119870119871
120597120579119894
120572119879= 0
119894 = 1 119863
(10)
where 120572 = 119870minus1119871
119910 It can be proved that for arbitrary hyperpa-rameters 120579 if 120572 = 0 (10) defines an implicit function or acontinuously differentiable function as follows
120579 = 119891 (120572) = [1198911 (120572) 1198912 (120572) sdot sdot sdot 119891119863 (120572)]
119879
(11)
Then the two-phase value iteration for critic network isdescribed as the following theorem
Theorem 2 Given the following conditions
(1) the system is boundary input and boundary output(BIBO) stable
(2) the immediate reward 119903 is bounded(3) for 119896 = 1 2 120578119896
119905rarr 0 suminfin
119905=1 120578119896
119905= infin
the following iteration process is convergent
120572119905+1 = 120572
119905
120579119905+1 = 120579
119905+ 1205781
119905(119891 (120572119905) minus 120579119905)
120572119905+2 = 120572
119905+1 + 1205782
119905119896alefsym
120579119905
(119909119883119871)
sdot [119903 + 120573119881 (120579119905 1199091015840) minus 119896120579119905
(119909119883119871) 120572119905+1]
120579119905+2 = 120579
119905+1
(12)
where 119881(120579119905 1199091015840) = max
119886119896120579119905
(1199091015840 119883119871)120572(119905) and 119896
alefsym
120579119905
(119909 119883119871)
is a kind of pseudoinversion of 119896120579119905
(119909 119883119871) that is
119896120579119905
(119909 119883119871)119896alefsym
120579119905
(119909 119883119871) = 1
Proof From (12) it is clear that the two phases include theupdate of hyperparameters in phase 1 which is viewed as theupdate of generalization model and the update of samplesrsquovalue in phase 2 which is viewed as the update of criticnetwork
The convergence of iterative algorithm is proved based onstochastic approximation Lyapunov method
Define that
119875119905119896120579119905
(119909119883119871) 120572119905= 119903 +120573119881 (120579
119905 1199091015840)
= 119903 +max119886
119896120579119905
(1199091015840 119883119871) 120572 (119905)
(13)
Equation (12) is rewritten as
120572119905+1 = 120572
119905
120579119905+1 = 120579
119905+ 120578
1119905(119891 (120572119905) minus 120579119905)
120572119905+2 = 120572
119905+1 + 1205782119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus 119896
120579119905
(119909119883119871) 120572119905+1]
120579119905+2 = 120579
119905+1
(14)
Define approximation errors as
1205931119905+2 = 120593
1119905+1 = 120579
119905+1 minus119891 (120572119905+1) = 120579
119905minus119891 (120572
119905+1)
+ 1205781119905[119891 (120572119905) minus 120579119905] = (1minus 120578
1119905) 120593
1119905
(15)
1205932119905+2 = (120572
119905+2 minus120572lowast) = (120572
119905+1 minus120572lowast) + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909 119883119871) 120572119905+1
minus119875119905+1119896120579
119905
(119909119883119871) 120572lowast+119875119905+1119896120579
119905
(119909119883119871) 120572lowast
minus 119896120579119905
(119909119883119871) 120572119905+1] = 120593
1119905+1 + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909 119883119871) 120572119905+1
Mathematical Problems in Engineering 5
minus119875119905+1119896120579
119905
(119909119883119871) 120572lowast] + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572lowast
minus 119896120579119905
(119909119883119871) 120572119905+1]
(16)
Further define that
1205821 (119909 120572119905+1 120572lowast) = 119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus119875
119905+1119896120579119905
(119909119883119871) 120572lowast]
1205822 (119909 120572119905+1 120572lowast) = 119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572lowastminus 119896120579119905
(119909119883119871) 120572119905+1]
(17)
Thus (16) is reexpressed as
1205932119905+2 = 120593
1119905+1 + 120578
21199051205821 (119909 120572119905+1 120572
lowast) + 120578
21199051205822 (119909 120572119905+1 120572
lowast) (18)
Let 120593119905= [(120593
1119905)119879
(1205932119905)119879]119879
and the two-phase iteration isin the shape of stochastic approximation that is
[1205931119905+1
1205932119905+1
] = [1205931119905
1205932119905
]+[minus120578
11199051205931119905
0]
[1205931119905+2
1205932119905+2
]
= [1205931119905+1
1205932119905+1
]
+[0
12057821199051205821 (119909 120572119905+1 120572
lowast) + 120578
21199051205822 (119909 120572119905+1 120572
lowast)]
(19)
Define119881 (120593119905)
= 120593119879
119905Λ120593119905
= [(1205931119905)119879
(1205932119905)119879
] [119868119863
0
0 119896119879
120579119905
(119909lowast 119883119871) 119896120579119905
(119909lowast 119883119871)] [
1205931119905
1205932119905
]
(20)
where 119863 is the scale of the hyperparameters and 119909lowast will
be defined later Let 119896lowast119905represent 119896119879
120579119905
(119909lowast 119883119871)119896120579119905
(119909lowast 119883119871) for
shortObviously (20) is a positive definitematrix and |119909
lowast| lt infin
|119896lowast
119905| lt infin Hence119881(120593
119905) is a Lyapunov functional At moment
119905 the conditional expectation of the positive definite matrixis
119864119905119881119905+2 (120593119905+2)
= 119864119905[120593119879
119905+2Λ120593119905+2]
= 119864119905[(120593
1119905+2)119879
(1205932119905+2)119879
] [119868119863
0
0 119896lowast
119905
][1205931119905+2
1205932119905+2
]
(21)
It is easy to compute the first-order Taylor expansion of(21) as
119864119905119881119905+2 (120593119905+2)
= 119864119905119881119905+1 (120593119905+1) + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) (120593119905+2 minus120593119905+1)]
+ (1205782119905)21198622119864119905 [(120593119905+2 minus120593
119905+1)119879(120593119905+2 minus120593
119905+1)]
(22)
in which the first item on the right of (22) is
119864119905119881119905+1 (120593119905+1)
= 119881119905(120593119905) + 120578
1119905119864119905[1198811015840
119905(120593119905) (120593119905+1 minus120593
119905)]
+ (1205781119905)21198621119864119905 [(120593119905+1 minus120593
119905)119879(120593119905+1 minus120593
119905)]
(23)
Substituting (23) into (22) yields
119864119905119881119905+2 (120593119905+2) minus119881
119905(120593119905)
= 1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1]
+ (1205781119905)21198621119864119905 [(119884
1119905)2] + (120578
1119905)21198622119864119905 [(119884
2119905+1)
2]
(24)
where 1198841119905= [ minus120593
1119905
0]
1198842119905+1 = [
0119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus 119896
120579119905
(119909119883119871) 120572119905+1
]
= [0
1205821 (119909 120572119905+1 120572lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
(25)
Consider the last two items of (24) firstly If the immediatereward is bounded and infinite discounted reward is appliedthe value function is bounded Hence |120572
119905| lt infin
If the system is BIBO stable existΣ0 Σ1 sub 119877119899+119898 119899119898 are the
dimensions of the state space and action space respectivelyforall1199090 isin Σ0 119909infin isin Σ1 the policy space is bounded
According to Proposition 1 when the policy space isbounded and |120572
119905| lt infin |120579
119905| lt infin there exists a constant 1198881
so that
119864100381610038161003816100381610038161198811015840
119905+1 (120593119905+1) 1198842119905+1
10038161003816100381610038161003816le 1198881 (26)
In addition
119864100381610038161003816100381610038161198842119905+1
10038161003816100381610038161003816
2= 119864
10038161003816100381610038161003816[1205821 (119909 120572119905+1 120572
lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
210038161003816100381610038161003816
le 2119864 1003816100381610038161003816100381612058221 (119909 120572119905+1 120572
lowast)10038161003816100381610038161003816+ 2119864 10038161003816100381610038161003816
12058222 (119909 120572119905+1 120572
lowast)10038161003816100381610038161003816
le 11988821205932119905+1 + 1198883120593
2119905+1 le 1198884119881 (120593
119905+1)
(27)
From (26) and (27) we know that
119864100381610038161003816100381610038161198842119905+1
10038161003816100381610038161003816
2+119864
100381610038161003816100381610038161198811015840
119905+1 (120593119905+1) 1198842119905+1
10038161003816100381610038161003816le 1198623119881 (120593
119905+1) +1198623 (28)
where 1198623 = max1198881 1198884 Similarly
119864100381610038161003816100381610038161198841119905
10038161003816100381610038161003816
2+119864
100381610038161003816100381610038161198811015840
119905(120593119905) 119884
1119905
10038161003816100381610038161003816le 1198624119881 (120593
119905) +1198624 (29)
6 Mathematical Problems in Engineering
According to Lemma 541 in [30] 119864119881(120593119905) lt infin and
119864|1198841119905|2lt infin and 119864|119884
2119905+1|
2lt infin
Now we focus on the first two items on the right of (24)The first item 119864
119905[1198811015840
119905(120593119905)119884
1119905] is computed as
119864119905[1198811015840
119905(120593119905) 119884
1119905]
= 119864119905[(120593
1119905)119879
(1205932119905)119879
] [119868119863
0
0 119896lowast
119905
][minus120593
1119905
0]
= minus100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2
(30)
For the second item 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] when the
state transition function is time invariant it is true that119875119905+1119870120579
119905
(119909 119883119871)120572119905+1 = 119875
119905119870120579119905
(119909 119883119871)120572119905+1 120572119905+1 = 120572
119905 Then we
have
1198641199051205821 (119909 120572119905+1 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572119905
minus 119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast] = 119896alefsym
120579119905
(119909119883119871)
sdot 119864 [119875119905119896120579119905
(119909119883119871) 120572119905
minus119875119905119896120579119905
(119909119883119871) 120572lowast] le 119896alefsym
120579119905
(119909119883119871)
sdot10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
(31)
This inequality holds because of positive 119896alefsym120579119905
(119909 119883) 119896alefsym120579119905119894gt
0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|
where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax
119909|Δ(120579 119909)| Then
10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
= max119909
119864119905
10038161003816100381610038161003816[119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038161003816100381610038161003816
= max119909
10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max
1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905)
minus(119903 + 120573max1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast)]
10038161003816100381610038161003816100381610038161003816= 120573
sdotmax119909
10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905
minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
10038161003816100381610038161003816100381610038161003816= 120573max119909
10038161003816100381610038161003816100381610038161003816int1199041015840
119901 (1199041015840| 119909)
sdot [max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast] 1198891199041015840
10038161003816100381610038161003816100381610038161003816
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdot
10038161003816100381610038161003816100381610038161003816[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
100381610038161003816100381610038161003816100381610038161198891199041015840
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) 120572119905minus 119896120579119905
(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816
1198891199041015840= 120573
sdotmax119909
int1199041015840
119901 (1199041015840| 119909)max
1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840
= 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119864119905[119875119905119896120579119905
(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905
(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840
le 120573max119909
10038161003816100381610038161003816119896120579119905
(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871)
sdot 1205932119905
10038161003816100381610038161003816
(32)
Hence
1198641199051205821 (119909 120572119905+1 120572
lowast) le 120573119896
alefsym
120579119905
(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816 (33)
On the other hand
1198641199051205822 (119909 120572119905 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905]
= 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572lowast] + 119896alefsym
120579119905
(119909119883119871)
sdot 119896120579119905
(119909119883119871) 120572lowastminus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905
= minus 119896alefsym
120579119905
(119909119883119871) [119896120579119905
(119909119883119871) 120593
2119905minus1198641199051205823 (119909 120579119905 120579
lowast)]
(34)
where 1205823(119909 120579119905 120579lowast) = 119875
119905119896120579119905
(119909 119883119871)120572lowastminus 119896120579119905
(119909 119883119871)120572lowast is the
value function error caused by the estimated hyperparame-ters
Substituting (33) and (34) into 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] yields
119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] = 119864
119905[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 119896120579119905
(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572
lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
= [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot [1198641199051205821 (119909 120572119905+1 120572
lowast) + 1198641199051205822 (119909 120572119905+1 120572
lowast)]
le 120573 [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871) 119896alefsym
120579119905
(119909lowast 119883119871)
sdot10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816minus [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot 119896alefsym
120579119905
(119909lowast 119883119871) [119896120579119905
(119909lowast 119883119871) 120593
2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]
le minus (1minus120573)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2+100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
(35)
Mathematical Problems in Engineering 7
Obviously if the following inequality is satisfied
1205782119905
100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
lt 1205781119905
100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2+ 120578
2119905(1minus120573)
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2(36)
there exists a positive const 120575 such that the first two items onthe right of (24) satisfy
1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)
According to Theorem 542 in [30] the iterative process isconvergent namely 120593
119905
1199081199011997888997888997888997888rarr 0
Remark 3 Let us check the final convergence positionObviously 120593
119905
1199081199011997888997888997888997888rarr 0 [ 120572119905
120579119905
]1199081199011997888997888997888997888rarr [
120572lowast
119891(120572lowast)] This means the
equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883
119871)120572lowast] =
119870120579lowast(119909 119883
119871)120572lowast where 120579
lowast= 119891(120572
lowast) And the equilibrium of
hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-
imization that max120593(119871(120593)) = max
120593((12)119910
119879119870minus1120579lowast (119883119871 119883119871)119910 +
(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572
lowast
Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function
Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer
Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs
Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning
4 Simulation and Discussion
In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set
Before further discussion we firstly give common setupin all simulations
(i) The span 119870 = 50(ii) To make output bounded once a state is out of the
boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of
account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning
(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908
119894= 18
V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to
002 s
41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning
Consider a simple single homogeneous inverted pendu-lum system
1199041 = 1199042
1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889
119872119901119889
(38)
where 1199041 and 1199042 represent the angle and its speed 119872119901
=
01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole
We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations
In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578
2119905=
001(1 + 119905)0018 1205781
119905= 119886(1 + 119905
0018) where 119886 is varying from 0
to 012The learning is repeated 50 times in order to get the
average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space
8 Mathematical Problems in Engineering
Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871
119894=1 sample set
120587(1199040) initial policy1205781t 120578
2t learning step size
Let 119905 = 0Loop
119891119900119903 k = 1 119905119900 119870 119889119900
119905 = t + 1119886119905= 120587(119904
119905)
Get the reward 119903119905
Observe next state 119904119905+1
Update 120572 according to (12)Update the policy 120587 according to optimum seeking
119890119899119889 119891119900119903
Update 120579 according to (12)Until the termination criterion is satisfied
Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming
0 001 002 005 008 01 0120
01
02
03
04
05
06
07
08
09
1
Succ
ess r
ate o
ver 5
0 ru
ns
6
17
27
42
5047 46
Parameter a in learning rate 1205781
Figure 1 Success times over 50 runs under different learning rate1205781119905
is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890
minus2511989005
11989009
119890minus1
1198900001
]Figure 1 shows the result of success rates Clearly with
1205782119905fixed different 1205781
119905rsquos affect the performance significantly In
particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the
increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse
Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions
0 1000 2000 3000 4000 5000 6000 7000 8000500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
No learninga = 001a = 002a = 005
a = 008a = 01a = 012
Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781
119905
If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781
119905 the evolution processes of
the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium
Mathematical Problems in Engineering 9
S1
S3
Figure 3 One-dimensional inverted pendulum
42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out
The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows
The mathematic model is given as
1199041 = 1199042 + 120576
1199042 =119892 sin (1199041) minus 119872
119901119889119904
22 cos (1199041) sin (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
+cos (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
119906
1199043 = 1199044
1199044 =119906 + 119872
119901119889 (119904
22 sin (1199041) minus 1199042 cos (1199041))
119872
(39)
where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)
Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows
(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification
(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906
(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905
001)
and 1205782119905= 002(1 + 119905)
002
0 2000 4000 6000 8000minus02
002040608
112141618
Iteration
The a
vera
ge o
f cum
ulat
ed w
eigh
ts ov
er 1
00 ru
ns
minus1000100200300400500600700800900
10000
GK-ADP120590 = 09
120590 = 12
120590 = 15
120590 = 18
120590 = 21
120590 = 24
Figure 4 The comparison of the average evolution processes over100 runs
(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03
(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network
Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP
However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum
119894120572119894 of the critic
network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum
119894119876(119909119894) 119909119894isin 119883119871 in GK-
ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed
Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the
10 Mathematical Problems in Engineering
09 12 15 18 21 240
01
02
03
04
05
06
07
08
09
1
Parameter 120590 in ALD-based sparsification
Succ
ess r
ate o
ver 1
00 ru
ns
67 6876 75
58 57
96
GK-ADP
Figure 5 The success rates wrt all algorithms over 100 runs
contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used
To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]
Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network
The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP
Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth
It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning
to balance efficiency and engineering applicability are twoimportant issues in future work
Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is
If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point
To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection
43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]
119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])
+120573
2log (var [119876 (119909) | 119883119871])
(40)
where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients
Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =
119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during
GK-ADP learning where 119879119890denotes the end iteration of
learning The algorithm to refine sample set is presented asAlgorithm 2
Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set
We repeat all 100 runs of GK-ADP again with the samelearning rates 120578
1 and 1205782 Based on 96 successful runs in
simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues
However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute
Mathematical Problems in Engineering 11
0 1 2 3 4 5minus15
minus1
minus05
0
05
Time (s)
0 1 2 3 4 5Time (s)
minus2
0
2
4
Forc
e (N
)
Angle displacementAngle speed
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(a) The resulted control performance using GK-ADP
0 1 2 3 4 5Time (s)
0
0
1 2 3 4 5Time (s)
Angle displacementAngle speed
minus1
0
1
2
3
Forc
e (N
)
minus1
minus05
05
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(b) The resulted control performance using KHDP
Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP
minus04 minus02 0 0204
minus04minus02
002
04minus4
minus2
0
2
4
Pole angle (rad)
Linear displacement (m)
Best
actio
n
Figure 7 The final best policies wrt all samples
minus04 minus020 02
04
minus04minus02
002
04minus10
0
10
20
30
40
Pole angle (rad)Linear displacement (m)
Q v
alue
Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples
0 20 40 60 80 100
0
05
1
15
2
Run
Cart
disp
lace
men
t (m
) and
lini
ear s
peed
(ms
)
Cart displacementLinear speed
Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy
values of cart displacement denoted by 1198781198873(119895) and 119878119886
3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as
119864 (119895) =119878119887
3 (119895) minus 119878119886
3 (119895)
1198781198873 (119895) (41)
Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
2 Mathematical Problems in Engineering
of values In literatures ADP approaches are categorized asthe followingmain schemes heuristic dynamic programming(HDP) dual heuristic programming (DHP) globalized dualheuristic programming (GDHP) and their action-dependentversions [9 10]
ADP researches always adopt multilayer perceptron neu-ral networks (MLPNNs) as the critic design Vrabie and Lewisproposed an online approach to continuous-time directadaptive optimal control which made use of neural networksto parametrically represent the control policy and the per-formance of the control system [11] Liu et al solved theconstrained optimal control problem of unknown discrete-time nonlinear systems based on the iterative ADP algorithmvia GDHP technique with three neural networks [12] In factdifferent kinds of neural networks (NNs) play the importantroles in ADP algorithms such as radial basis function NNs[3] wavelet basis function NNs [13] and echo state network[14]
Besides the benefits brought by NNs ADP methodsalways suffer from some problems concerned in the design ofNNs On one hand the learning control performance greatlydepends on empirical design of critic networks especially themanual setting of the hidden layer or the basis functions Onthe other hand due to the local minima in neural networktraining how to improve the quality of the final policies isstill an open problem [15]
As we can see it is difficult to evaluate the effective-ness of the parametric model when the knowledge on themodelrsquos order or nonlinear characteristics of the system isnot enough Compared with parametric modeling methodsnonparametricmodelingmethods especially kernelmethods[16 17] do not need to set the model structure in advanceHence kernel machines have been popularly studied torealize nonlinear and nonparametric modeling Engel etal proposed the kernel recursive least-squares algorithmto construct minimum mean-squared-error solutions tononlinear least-squares problems [18] As popular kernelmachines support vector machines (SVMs) also have beenapplied to nonparametricmodeling problems Dietterich andWang combined linear programming with SVMs to findvalue function approximations [19] The similar research waspublished in [20] inwhich the least-squares SVMswere usedNevertheless they both focused on discrete stateaction spaceand lacked theoretical results on the policies obtained moreor less
In addition to SVMs Gaussian processes (GPs) havebecome an alternative generalization method GP modelsare powerful nonparametric tools for approximate Bayesianinference and learning In comparison with other popularnonlinear architectures such as multilayer perceptrons theirbehavior is conceptually simpler to understand and modelfitting can be achieved without resorting to nonconvexoptimization routines [21 22] In [23] Engel et al firstapplied GPs in temporal-difference (TD) learning for MDPswith stochastic rewards and deterministic transitions Theyderived a new GPTD algorithm in [24] that overcame thelimitation of deterministic transitions and extended GPTDto the estimation of state-action values GPTD algorithm justaddressed the value approximation so it should be combined
with actor-critic methods or other policy iteration methodsto solve learning control problems
An alternative approach employing GPs in RL is model-based value iteration or policy iteration method in whichGP model is used to model system dynamics and rep-resent the value function [25] In [26] an approximatedvalue-function based RL algorithm named Gaussian processdynamic programming (GPDP) was presented which builtdynamic transition model value function model and actionpolicy model respectively using GPs In this way the sampleset will be adjusted to such a reasonable shape with highsample densities near the equilibriums or the places wherevalue functions change dramatically Thus it is good at con-trolling nonlinear systems with complex dynamics A majorshortcoming even if the relatively high computation cost isendurable is that the states in sample set need to be revisitedagain and again in order to update their value functions Sincethis condition is unpractical in real implements it diminishesthe appeal of employing this method
Kernel-based method is also introduced to ADP In [15]a novel framework of ACDs with sparse kernel machines waspresented by integrating kernel methods into critic networkA sparsification method based on the approximately lineardependence (ALD) analysis was used to sparsify the kernelmachinesObviously thismethod can overcome the difficultyof presetting model structure in parametric models and real-ize actor-critic learning online [27] However the selection ofsamples based on the ALD analysis is an offline way withoutconsidering the distribution of the value functionThereforethe data samples cannot be adjusted online which makesthe method more suitable for control systems with smoothdynamics where value function changes gently
We think GPDP and ACDs with sparse kernel machinesare complementary As indicated in [28] it is known theprediction of GPs is viewed as a linear combination of thecovariance between the new points and the samples Henceit seems reasonable to introduce kernel machine with GPsto build critic network in ACDs if the values of the samplesare known or at least can be assessed numerically And thenthe sample set will be adjusted online during critic-actorlearning
The major problem here is how to realize the valuefunction learning and GP models updating simultaneouslyespecially under the condition that the samples of state-action space can hardly be revisited infinitely in order toapproximate their values To tackle this problem a two-phaseiteration is developed in order to get optimal control policyfor the system whose dynamics are unknown a priori
2 Description of the Problem
In general ADP is an actor-critic method which approx-imates the value functions and policies to encourage therealization of generalization in MDPs with large or continu-ous spaces The critic design plays the most important rolebecause it determines how the actor optimizes its actionHence we give a brief introduction on both kernel-basedACD and GPs in order to derive the clear description of thetheoretical problem
Mathematical Problems in Engineering 3
21 Kernel-Based ACDs Kernel-based ACDs mainly consistof a critic network a kernel-based feature learning module areward function an actor networkcontroller and a model ofthe plant The critic constructed by kernel machine is used toapproximate the value functions or their derivativesThen theoutput of the critic is used in the training process of the actorso that policy gradients can be computed As actor finallyconverges the optimal action policymapping states to actionsis described by this actor
Traditional neural network based on kernel machine andsamples serves as the model of value functions just as thefollowing equation shows and the recursive algorithm suchas KLSTD [29] serves as value function approximation
119876 (119909) =
119871
sum
119894=1120572119894119896 (119909 119909
119894) (1)
where 119909 and 119909119894represent state-action pairs (119904 119886) and (119904
119894 119886119894)
120572119894 119894 = 1 2 119871 are the weights and (119904
119894 119886119894) represents
selected state-action pairs in sample set and 119896(119909 119909119894) is a kernel
functionThe key of critic learning is the update of the weights vec-
tor120572The value functionmodeling in continuous stateactionspace is a regression problem From the view of the modelNN structure is a reproducing kernel space spanned bythe samples in which the value function is expressed as alinear regression function Obviously as the basis of Hilbertspace how to select samples determines the VC dimensionof identification as well as the performance of value functionapproximation
If only using ALD-based kernel sparsification it isonly independent of samples that are considered in sampleselection So it is hard to evaluate how good the sampleset is because the sample selection does not consider thedistribution of value function and the performance of ALDanalysis is affected seriously by the hyperparameters of kernelfunction which are predetermined empirically and fixedduring learning
If the hyperparameters can be optimized online and thevalue function wrt samples can be evaluated by iterationalgorithms the critic network will be optimized not only byvalue approximation but also by hyperparameter optimiza-tion Moreover with approximated sample values there is adirect way to evaluate the validity of sample set in orderto regulate the set online Thus in this paper we turn toGaussian processes to construct the criterion for samples andhyperparameters
22 GP-Based Value Function Model For an MDP the datasamples 119909
119894and the corresponding 119876 value can be collected
by observing theMDPHere 119909119894is the state-action pairs (119904
119894 119886119894)
119904 isin 119877119873 119886 isin 119877
119872 and 119876(119909) is the value function defined as
119876lowast(119909) = 119877 (119909) + 120573int119901 (119904
1015840| 119909)119881 (119904
1015840) 1198891199041015840 (2)
where 119881(119904) = max119886119876(119909)
Given a sample set collected from a continuous dynamicsystem 119883
119871 119910 where 119910 = 119876(119883
119871) + 120576 120576 sim 119873(0 V0) Gaussian
regression with covariance function shown in the followingequation is a well known model technology to infer the 119876
function
119896 (119909119901 119909119902) = V1 exp [minus
12(119909119901minus119909119902)119879
Λminus1
(119909119901minus119909119902)] (3)
where Λ = diag([1199081 1199082 119908119873+119872])Assuming additive independent identically distributed
Gaussian noise 120576 with variance V0 the prior on the noisyobservations becomes
119870119871= 119870 (119883
119871 119883119871) + V0119868119871 (4)
where119870(119883119871 119883119871)
=
[[[[[[[[
[
119896 (1199091 1199091) 119896 (1199091 1199092) sdot sdot sdot 119896 (1199091 119909119871)
119896 (1199092 1199091) d
d
119896 (119909119871 1199091) sdot sdot sdot sdot sdot sdot 119896 (119909
119871 119909119871)
]]]]]]]]
]
(5)
The parameters 119908119889 V0 and V1 are the hyperparam-
eters of the 119866119877119902and collected within the vector 120579 =
[1199081 sdot sdot sdot 119908119873+119872
V1 V0]119879
For an arbitrary input 119909lowast the predictive distribution of
the function value is Gaussian distributed with mean andvariance given by
119864 [119876 (119909lowast)] = 119870 (119909
lowast 119883119871)119870minus1119871
119910 (6)
var [119876 (119909)] = 119896 (119909lowast 119909lowast) minus119870 (119909
lowast 119883119871)119870minus1119871
119870(119883119871 119909lowast) (7)
Comparing (6) to (1) we find that if we let 120572 = 119870minus1119871
119910 theneural network is also regarded as the Gaussian regressionOr the critic network can be constructed based on Gaussiankernel machine if the following conditions are satisfied
Condition 1 The hyperparameters 120579 of Gaussian kernel areknown
Condition 2 The values 119910 wrt all samples states 119883119871are
known
With Gaussian-kernel-based critic network the samplestate-action pairs and corresponding values are known Andthen the criterion such as the comprehensive utility proposedin [26] can be set up in order to refine sample set online Atthe same time it is convenient to optimize hyperparametersin order to approximate 119876 value function more accuratelyThus besides the advantages brought by kernel machine thecritic based on Gaussian-kernel will be better in approximat-ing value functions
Consider Condition 1 If the values 119910 wrt119883119871are known
(note that it is indeed Condition 2) the common way to gethyperparameters is by evidencemaximization where the log-evidence is given by
119871 (120579) = minus12119910119879119870minus1119871
119910minus12log 1003816100381610038161003816119870119871
1003816100381610038161003816 minus119899
2log 2120587 (8)
4 Mathematical Problems in Engineering
It requires the calculation of the derivative of 119871(120579) wrt each120579119894 given by
120597119871 (120579)
120597120579119894
= minus12tr [119870minus1119871
120597119870119871
120597120579119894
]+12120572119879 120597119870119871
120597120579119894
120572 (9)
where tr[sdot] denotes the trace 120572 = 119870minus1119871
119910Consider Condition 2 For unknown system if 119870
119871is
known that is the hyperparameters are known (note thatit is indeed Condition 1) the update of critic network willbe transferred to the update of values wrt samples by usingvalue iteration
According to the analysis both conditions are interde-pendent That means the update of critic depends on knownhyperparameters and the optimization of hyperparametersdepends on accurate sample values
Hence we need a comprehensive iteration method torealize value approximation and optimization of hyperpa-rameters simultaneously A direct way is to update themalternately Unfortunately this way is not reasonable becausethese two processes are tightly coupled For example tem-poral differential errors drive value approximation but thechange of weights 120572will cause the change of hyperparameters120579 simultaneously then it is difficult to tell whether thistemporal differential error is induced by observation or byGaussian regression model changing
To solve this problem a kind of two-phase value iterationfor critic network is presented in the next section and theconditions of convergence are analyzed
3 Two-Phase Value Iteration for CriticNetwork Approximation
First a proposition is given to describe the relationshipbetween hyperparameters and the sample value function
Proposition 1 The hyperparameters are optimized by evi-dence maximization according to the samples and their Qvalues and the log-evidence is given by
120597119871 (120579)
120597120579119894
= minus12tr [119870minus1119871
120597119870119871
120597120579119894
]+12120572119879 120597119870119871
120597120579119894
120572119879= 0
119894 = 1 119863
(10)
where 120572 = 119870minus1119871
119910 It can be proved that for arbitrary hyperpa-rameters 120579 if 120572 = 0 (10) defines an implicit function or acontinuously differentiable function as follows
120579 = 119891 (120572) = [1198911 (120572) 1198912 (120572) sdot sdot sdot 119891119863 (120572)]
119879
(11)
Then the two-phase value iteration for critic network isdescribed as the following theorem
Theorem 2 Given the following conditions
(1) the system is boundary input and boundary output(BIBO) stable
(2) the immediate reward 119903 is bounded(3) for 119896 = 1 2 120578119896
119905rarr 0 suminfin
119905=1 120578119896
119905= infin
the following iteration process is convergent
120572119905+1 = 120572
119905
120579119905+1 = 120579
119905+ 1205781
119905(119891 (120572119905) minus 120579119905)
120572119905+2 = 120572
119905+1 + 1205782
119905119896alefsym
120579119905
(119909119883119871)
sdot [119903 + 120573119881 (120579119905 1199091015840) minus 119896120579119905
(119909119883119871) 120572119905+1]
120579119905+2 = 120579
119905+1
(12)
where 119881(120579119905 1199091015840) = max
119886119896120579119905
(1199091015840 119883119871)120572(119905) and 119896
alefsym
120579119905
(119909 119883119871)
is a kind of pseudoinversion of 119896120579119905
(119909 119883119871) that is
119896120579119905
(119909 119883119871)119896alefsym
120579119905
(119909 119883119871) = 1
Proof From (12) it is clear that the two phases include theupdate of hyperparameters in phase 1 which is viewed as theupdate of generalization model and the update of samplesrsquovalue in phase 2 which is viewed as the update of criticnetwork
The convergence of iterative algorithm is proved based onstochastic approximation Lyapunov method
Define that
119875119905119896120579119905
(119909119883119871) 120572119905= 119903 +120573119881 (120579
119905 1199091015840)
= 119903 +max119886
119896120579119905
(1199091015840 119883119871) 120572 (119905)
(13)
Equation (12) is rewritten as
120572119905+1 = 120572
119905
120579119905+1 = 120579
119905+ 120578
1119905(119891 (120572119905) minus 120579119905)
120572119905+2 = 120572
119905+1 + 1205782119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus 119896
120579119905
(119909119883119871) 120572119905+1]
120579119905+2 = 120579
119905+1
(14)
Define approximation errors as
1205931119905+2 = 120593
1119905+1 = 120579
119905+1 minus119891 (120572119905+1) = 120579
119905minus119891 (120572
119905+1)
+ 1205781119905[119891 (120572119905) minus 120579119905] = (1minus 120578
1119905) 120593
1119905
(15)
1205932119905+2 = (120572
119905+2 minus120572lowast) = (120572
119905+1 minus120572lowast) + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909 119883119871) 120572119905+1
minus119875119905+1119896120579
119905
(119909119883119871) 120572lowast+119875119905+1119896120579
119905
(119909119883119871) 120572lowast
minus 119896120579119905
(119909119883119871) 120572119905+1] = 120593
1119905+1 + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909 119883119871) 120572119905+1
Mathematical Problems in Engineering 5
minus119875119905+1119896120579
119905
(119909119883119871) 120572lowast] + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572lowast
minus 119896120579119905
(119909119883119871) 120572119905+1]
(16)
Further define that
1205821 (119909 120572119905+1 120572lowast) = 119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus119875
119905+1119896120579119905
(119909119883119871) 120572lowast]
1205822 (119909 120572119905+1 120572lowast) = 119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572lowastminus 119896120579119905
(119909119883119871) 120572119905+1]
(17)
Thus (16) is reexpressed as
1205932119905+2 = 120593
1119905+1 + 120578
21199051205821 (119909 120572119905+1 120572
lowast) + 120578
21199051205822 (119909 120572119905+1 120572
lowast) (18)
Let 120593119905= [(120593
1119905)119879
(1205932119905)119879]119879
and the two-phase iteration isin the shape of stochastic approximation that is
[1205931119905+1
1205932119905+1
] = [1205931119905
1205932119905
]+[minus120578
11199051205931119905
0]
[1205931119905+2
1205932119905+2
]
= [1205931119905+1
1205932119905+1
]
+[0
12057821199051205821 (119909 120572119905+1 120572
lowast) + 120578
21199051205822 (119909 120572119905+1 120572
lowast)]
(19)
Define119881 (120593119905)
= 120593119879
119905Λ120593119905
= [(1205931119905)119879
(1205932119905)119879
] [119868119863
0
0 119896119879
120579119905
(119909lowast 119883119871) 119896120579119905
(119909lowast 119883119871)] [
1205931119905
1205932119905
]
(20)
where 119863 is the scale of the hyperparameters and 119909lowast will
be defined later Let 119896lowast119905represent 119896119879
120579119905
(119909lowast 119883119871)119896120579119905
(119909lowast 119883119871) for
shortObviously (20) is a positive definitematrix and |119909
lowast| lt infin
|119896lowast
119905| lt infin Hence119881(120593
119905) is a Lyapunov functional At moment
119905 the conditional expectation of the positive definite matrixis
119864119905119881119905+2 (120593119905+2)
= 119864119905[120593119879
119905+2Λ120593119905+2]
= 119864119905[(120593
1119905+2)119879
(1205932119905+2)119879
] [119868119863
0
0 119896lowast
119905
][1205931119905+2
1205932119905+2
]
(21)
It is easy to compute the first-order Taylor expansion of(21) as
119864119905119881119905+2 (120593119905+2)
= 119864119905119881119905+1 (120593119905+1) + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) (120593119905+2 minus120593119905+1)]
+ (1205782119905)21198622119864119905 [(120593119905+2 minus120593
119905+1)119879(120593119905+2 minus120593
119905+1)]
(22)
in which the first item on the right of (22) is
119864119905119881119905+1 (120593119905+1)
= 119881119905(120593119905) + 120578
1119905119864119905[1198811015840
119905(120593119905) (120593119905+1 minus120593
119905)]
+ (1205781119905)21198621119864119905 [(120593119905+1 minus120593
119905)119879(120593119905+1 minus120593
119905)]
(23)
Substituting (23) into (22) yields
119864119905119881119905+2 (120593119905+2) minus119881
119905(120593119905)
= 1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1]
+ (1205781119905)21198621119864119905 [(119884
1119905)2] + (120578
1119905)21198622119864119905 [(119884
2119905+1)
2]
(24)
where 1198841119905= [ minus120593
1119905
0]
1198842119905+1 = [
0119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus 119896
120579119905
(119909119883119871) 120572119905+1
]
= [0
1205821 (119909 120572119905+1 120572lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
(25)
Consider the last two items of (24) firstly If the immediatereward is bounded and infinite discounted reward is appliedthe value function is bounded Hence |120572
119905| lt infin
If the system is BIBO stable existΣ0 Σ1 sub 119877119899+119898 119899119898 are the
dimensions of the state space and action space respectivelyforall1199090 isin Σ0 119909infin isin Σ1 the policy space is bounded
According to Proposition 1 when the policy space isbounded and |120572
119905| lt infin |120579
119905| lt infin there exists a constant 1198881
so that
119864100381610038161003816100381610038161198811015840
119905+1 (120593119905+1) 1198842119905+1
10038161003816100381610038161003816le 1198881 (26)
In addition
119864100381610038161003816100381610038161198842119905+1
10038161003816100381610038161003816
2= 119864
10038161003816100381610038161003816[1205821 (119909 120572119905+1 120572
lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
210038161003816100381610038161003816
le 2119864 1003816100381610038161003816100381612058221 (119909 120572119905+1 120572
lowast)10038161003816100381610038161003816+ 2119864 10038161003816100381610038161003816
12058222 (119909 120572119905+1 120572
lowast)10038161003816100381610038161003816
le 11988821205932119905+1 + 1198883120593
2119905+1 le 1198884119881 (120593
119905+1)
(27)
From (26) and (27) we know that
119864100381610038161003816100381610038161198842119905+1
10038161003816100381610038161003816
2+119864
100381610038161003816100381610038161198811015840
119905+1 (120593119905+1) 1198842119905+1
10038161003816100381610038161003816le 1198623119881 (120593
119905+1) +1198623 (28)
where 1198623 = max1198881 1198884 Similarly
119864100381610038161003816100381610038161198841119905
10038161003816100381610038161003816
2+119864
100381610038161003816100381610038161198811015840
119905(120593119905) 119884
1119905
10038161003816100381610038161003816le 1198624119881 (120593
119905) +1198624 (29)
6 Mathematical Problems in Engineering
According to Lemma 541 in [30] 119864119881(120593119905) lt infin and
119864|1198841119905|2lt infin and 119864|119884
2119905+1|
2lt infin
Now we focus on the first two items on the right of (24)The first item 119864
119905[1198811015840
119905(120593119905)119884
1119905] is computed as
119864119905[1198811015840
119905(120593119905) 119884
1119905]
= 119864119905[(120593
1119905)119879
(1205932119905)119879
] [119868119863
0
0 119896lowast
119905
][minus120593
1119905
0]
= minus100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2
(30)
For the second item 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] when the
state transition function is time invariant it is true that119875119905+1119870120579
119905
(119909 119883119871)120572119905+1 = 119875
119905119870120579119905
(119909 119883119871)120572119905+1 120572119905+1 = 120572
119905 Then we
have
1198641199051205821 (119909 120572119905+1 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572119905
minus 119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast] = 119896alefsym
120579119905
(119909119883119871)
sdot 119864 [119875119905119896120579119905
(119909119883119871) 120572119905
minus119875119905119896120579119905
(119909119883119871) 120572lowast] le 119896alefsym
120579119905
(119909119883119871)
sdot10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
(31)
This inequality holds because of positive 119896alefsym120579119905
(119909 119883) 119896alefsym120579119905119894gt
0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|
where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax
119909|Δ(120579 119909)| Then
10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
= max119909
119864119905
10038161003816100381610038161003816[119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038161003816100381610038161003816
= max119909
10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max
1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905)
minus(119903 + 120573max1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast)]
10038161003816100381610038161003816100381610038161003816= 120573
sdotmax119909
10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905
minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
10038161003816100381610038161003816100381610038161003816= 120573max119909
10038161003816100381610038161003816100381610038161003816int1199041015840
119901 (1199041015840| 119909)
sdot [max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast] 1198891199041015840
10038161003816100381610038161003816100381610038161003816
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdot
10038161003816100381610038161003816100381610038161003816[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
100381610038161003816100381610038161003816100381610038161198891199041015840
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) 120572119905minus 119896120579119905
(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816
1198891199041015840= 120573
sdotmax119909
int1199041015840
119901 (1199041015840| 119909)max
1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840
= 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119864119905[119875119905119896120579119905
(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905
(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840
le 120573max119909
10038161003816100381610038161003816119896120579119905
(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871)
sdot 1205932119905
10038161003816100381610038161003816
(32)
Hence
1198641199051205821 (119909 120572119905+1 120572
lowast) le 120573119896
alefsym
120579119905
(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816 (33)
On the other hand
1198641199051205822 (119909 120572119905 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905]
= 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572lowast] + 119896alefsym
120579119905
(119909119883119871)
sdot 119896120579119905
(119909119883119871) 120572lowastminus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905
= minus 119896alefsym
120579119905
(119909119883119871) [119896120579119905
(119909119883119871) 120593
2119905minus1198641199051205823 (119909 120579119905 120579
lowast)]
(34)
where 1205823(119909 120579119905 120579lowast) = 119875
119905119896120579119905
(119909 119883119871)120572lowastminus 119896120579119905
(119909 119883119871)120572lowast is the
value function error caused by the estimated hyperparame-ters
Substituting (33) and (34) into 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] yields
119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] = 119864
119905[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 119896120579119905
(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572
lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
= [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot [1198641199051205821 (119909 120572119905+1 120572
lowast) + 1198641199051205822 (119909 120572119905+1 120572
lowast)]
le 120573 [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871) 119896alefsym
120579119905
(119909lowast 119883119871)
sdot10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816minus [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot 119896alefsym
120579119905
(119909lowast 119883119871) [119896120579119905
(119909lowast 119883119871) 120593
2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]
le minus (1minus120573)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2+100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
(35)
Mathematical Problems in Engineering 7
Obviously if the following inequality is satisfied
1205782119905
100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
lt 1205781119905
100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2+ 120578
2119905(1minus120573)
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2(36)
there exists a positive const 120575 such that the first two items onthe right of (24) satisfy
1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)
According to Theorem 542 in [30] the iterative process isconvergent namely 120593
119905
1199081199011997888997888997888997888rarr 0
Remark 3 Let us check the final convergence positionObviously 120593
119905
1199081199011997888997888997888997888rarr 0 [ 120572119905
120579119905
]1199081199011997888997888997888997888rarr [
120572lowast
119891(120572lowast)] This means the
equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883
119871)120572lowast] =
119870120579lowast(119909 119883
119871)120572lowast where 120579
lowast= 119891(120572
lowast) And the equilibrium of
hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-
imization that max120593(119871(120593)) = max
120593((12)119910
119879119870minus1120579lowast (119883119871 119883119871)119910 +
(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572
lowast
Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function
Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer
Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs
Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning
4 Simulation and Discussion
In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set
Before further discussion we firstly give common setupin all simulations
(i) The span 119870 = 50(ii) To make output bounded once a state is out of the
boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of
account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning
(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908
119894= 18
V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to
002 s
41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning
Consider a simple single homogeneous inverted pendu-lum system
1199041 = 1199042
1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889
119872119901119889
(38)
where 1199041 and 1199042 represent the angle and its speed 119872119901
=
01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole
We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations
In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578
2119905=
001(1 + 119905)0018 1205781
119905= 119886(1 + 119905
0018) where 119886 is varying from 0
to 012The learning is repeated 50 times in order to get the
average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space
8 Mathematical Problems in Engineering
Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871
119894=1 sample set
120587(1199040) initial policy1205781t 120578
2t learning step size
Let 119905 = 0Loop
119891119900119903 k = 1 119905119900 119870 119889119900
119905 = t + 1119886119905= 120587(119904
119905)
Get the reward 119903119905
Observe next state 119904119905+1
Update 120572 according to (12)Update the policy 120587 according to optimum seeking
119890119899119889 119891119900119903
Update 120579 according to (12)Until the termination criterion is satisfied
Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming
0 001 002 005 008 01 0120
01
02
03
04
05
06
07
08
09
1
Succ
ess r
ate o
ver 5
0 ru
ns
6
17
27
42
5047 46
Parameter a in learning rate 1205781
Figure 1 Success times over 50 runs under different learning rate1205781119905
is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890
minus2511989005
11989009
119890minus1
1198900001
]Figure 1 shows the result of success rates Clearly with
1205782119905fixed different 1205781
119905rsquos affect the performance significantly In
particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the
increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse
Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions
0 1000 2000 3000 4000 5000 6000 7000 8000500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
No learninga = 001a = 002a = 005
a = 008a = 01a = 012
Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781
119905
If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781
119905 the evolution processes of
the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium
Mathematical Problems in Engineering 9
S1
S3
Figure 3 One-dimensional inverted pendulum
42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out
The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows
The mathematic model is given as
1199041 = 1199042 + 120576
1199042 =119892 sin (1199041) minus 119872
119901119889119904
22 cos (1199041) sin (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
+cos (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
119906
1199043 = 1199044
1199044 =119906 + 119872
119901119889 (119904
22 sin (1199041) minus 1199042 cos (1199041))
119872
(39)
where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)
Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows
(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification
(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906
(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905
001)
and 1205782119905= 002(1 + 119905)
002
0 2000 4000 6000 8000minus02
002040608
112141618
Iteration
The a
vera
ge o
f cum
ulat
ed w
eigh
ts ov
er 1
00 ru
ns
minus1000100200300400500600700800900
10000
GK-ADP120590 = 09
120590 = 12
120590 = 15
120590 = 18
120590 = 21
120590 = 24
Figure 4 The comparison of the average evolution processes over100 runs
(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03
(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network
Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP
However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum
119894120572119894 of the critic
network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum
119894119876(119909119894) 119909119894isin 119883119871 in GK-
ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed
Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the
10 Mathematical Problems in Engineering
09 12 15 18 21 240
01
02
03
04
05
06
07
08
09
1
Parameter 120590 in ALD-based sparsification
Succ
ess r
ate o
ver 1
00 ru
ns
67 6876 75
58 57
96
GK-ADP
Figure 5 The success rates wrt all algorithms over 100 runs
contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used
To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]
Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network
The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP
Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth
It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning
to balance efficiency and engineering applicability are twoimportant issues in future work
Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is
If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point
To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection
43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]
119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])
+120573
2log (var [119876 (119909) | 119883119871])
(40)
where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients
Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =
119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during
GK-ADP learning where 119879119890denotes the end iteration of
learning The algorithm to refine sample set is presented asAlgorithm 2
Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set
We repeat all 100 runs of GK-ADP again with the samelearning rates 120578
1 and 1205782 Based on 96 successful runs in
simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues
However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute
Mathematical Problems in Engineering 11
0 1 2 3 4 5minus15
minus1
minus05
0
05
Time (s)
0 1 2 3 4 5Time (s)
minus2
0
2
4
Forc
e (N
)
Angle displacementAngle speed
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(a) The resulted control performance using GK-ADP
0 1 2 3 4 5Time (s)
0
0
1 2 3 4 5Time (s)
Angle displacementAngle speed
minus1
0
1
2
3
Forc
e (N
)
minus1
minus05
05
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(b) The resulted control performance using KHDP
Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP
minus04 minus02 0 0204
minus04minus02
002
04minus4
minus2
0
2
4
Pole angle (rad)
Linear displacement (m)
Best
actio
n
Figure 7 The final best policies wrt all samples
minus04 minus020 02
04
minus04minus02
002
04minus10
0
10
20
30
40
Pole angle (rad)Linear displacement (m)
Q v
alue
Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples
0 20 40 60 80 100
0
05
1
15
2
Run
Cart
disp
lace
men
t (m
) and
lini
ear s
peed
(ms
)
Cart displacementLinear speed
Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy
values of cart displacement denoted by 1198781198873(119895) and 119878119886
3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as
119864 (119895) =119878119887
3 (119895) minus 119878119886
3 (119895)
1198781198873 (119895) (41)
Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 3
21 Kernel-Based ACDs Kernel-based ACDs mainly consistof a critic network a kernel-based feature learning module areward function an actor networkcontroller and a model ofthe plant The critic constructed by kernel machine is used toapproximate the value functions or their derivativesThen theoutput of the critic is used in the training process of the actorso that policy gradients can be computed As actor finallyconverges the optimal action policymapping states to actionsis described by this actor
Traditional neural network based on kernel machine andsamples serves as the model of value functions just as thefollowing equation shows and the recursive algorithm suchas KLSTD [29] serves as value function approximation
119876 (119909) =
119871
sum
119894=1120572119894119896 (119909 119909
119894) (1)
where 119909 and 119909119894represent state-action pairs (119904 119886) and (119904
119894 119886119894)
120572119894 119894 = 1 2 119871 are the weights and (119904
119894 119886119894) represents
selected state-action pairs in sample set and 119896(119909 119909119894) is a kernel
functionThe key of critic learning is the update of the weights vec-
tor120572The value functionmodeling in continuous stateactionspace is a regression problem From the view of the modelNN structure is a reproducing kernel space spanned bythe samples in which the value function is expressed as alinear regression function Obviously as the basis of Hilbertspace how to select samples determines the VC dimensionof identification as well as the performance of value functionapproximation
If only using ALD-based kernel sparsification it isonly independent of samples that are considered in sampleselection So it is hard to evaluate how good the sampleset is because the sample selection does not consider thedistribution of value function and the performance of ALDanalysis is affected seriously by the hyperparameters of kernelfunction which are predetermined empirically and fixedduring learning
If the hyperparameters can be optimized online and thevalue function wrt samples can be evaluated by iterationalgorithms the critic network will be optimized not only byvalue approximation but also by hyperparameter optimiza-tion Moreover with approximated sample values there is adirect way to evaluate the validity of sample set in orderto regulate the set online Thus in this paper we turn toGaussian processes to construct the criterion for samples andhyperparameters
22 GP-Based Value Function Model For an MDP the datasamples 119909
119894and the corresponding 119876 value can be collected
by observing theMDPHere 119909119894is the state-action pairs (119904
119894 119886119894)
119904 isin 119877119873 119886 isin 119877
119872 and 119876(119909) is the value function defined as
119876lowast(119909) = 119877 (119909) + 120573int119901 (119904
1015840| 119909)119881 (119904
1015840) 1198891199041015840 (2)
where 119881(119904) = max119886119876(119909)
Given a sample set collected from a continuous dynamicsystem 119883
119871 119910 where 119910 = 119876(119883
119871) + 120576 120576 sim 119873(0 V0) Gaussian
regression with covariance function shown in the followingequation is a well known model technology to infer the 119876
function
119896 (119909119901 119909119902) = V1 exp [minus
12(119909119901minus119909119902)119879
Λminus1
(119909119901minus119909119902)] (3)
where Λ = diag([1199081 1199082 119908119873+119872])Assuming additive independent identically distributed
Gaussian noise 120576 with variance V0 the prior on the noisyobservations becomes
119870119871= 119870 (119883
119871 119883119871) + V0119868119871 (4)
where119870(119883119871 119883119871)
=
[[[[[[[[
[
119896 (1199091 1199091) 119896 (1199091 1199092) sdot sdot sdot 119896 (1199091 119909119871)
119896 (1199092 1199091) d
d
119896 (119909119871 1199091) sdot sdot sdot sdot sdot sdot 119896 (119909
119871 119909119871)
]]]]]]]]
]
(5)
The parameters 119908119889 V0 and V1 are the hyperparam-
eters of the 119866119877119902and collected within the vector 120579 =
[1199081 sdot sdot sdot 119908119873+119872
V1 V0]119879
For an arbitrary input 119909lowast the predictive distribution of
the function value is Gaussian distributed with mean andvariance given by
119864 [119876 (119909lowast)] = 119870 (119909
lowast 119883119871)119870minus1119871
119910 (6)
var [119876 (119909)] = 119896 (119909lowast 119909lowast) minus119870 (119909
lowast 119883119871)119870minus1119871
119870(119883119871 119909lowast) (7)
Comparing (6) to (1) we find that if we let 120572 = 119870minus1119871
119910 theneural network is also regarded as the Gaussian regressionOr the critic network can be constructed based on Gaussiankernel machine if the following conditions are satisfied
Condition 1 The hyperparameters 120579 of Gaussian kernel areknown
Condition 2 The values 119910 wrt all samples states 119883119871are
known
With Gaussian-kernel-based critic network the samplestate-action pairs and corresponding values are known Andthen the criterion such as the comprehensive utility proposedin [26] can be set up in order to refine sample set online Atthe same time it is convenient to optimize hyperparametersin order to approximate 119876 value function more accuratelyThus besides the advantages brought by kernel machine thecritic based on Gaussian-kernel will be better in approximat-ing value functions
Consider Condition 1 If the values 119910 wrt119883119871are known
(note that it is indeed Condition 2) the common way to gethyperparameters is by evidencemaximization where the log-evidence is given by
119871 (120579) = minus12119910119879119870minus1119871
119910minus12log 1003816100381610038161003816119870119871
1003816100381610038161003816 minus119899
2log 2120587 (8)
4 Mathematical Problems in Engineering
It requires the calculation of the derivative of 119871(120579) wrt each120579119894 given by
120597119871 (120579)
120597120579119894
= minus12tr [119870minus1119871
120597119870119871
120597120579119894
]+12120572119879 120597119870119871
120597120579119894
120572 (9)
where tr[sdot] denotes the trace 120572 = 119870minus1119871
119910Consider Condition 2 For unknown system if 119870
119871is
known that is the hyperparameters are known (note thatit is indeed Condition 1) the update of critic network willbe transferred to the update of values wrt samples by usingvalue iteration
According to the analysis both conditions are interde-pendent That means the update of critic depends on knownhyperparameters and the optimization of hyperparametersdepends on accurate sample values
Hence we need a comprehensive iteration method torealize value approximation and optimization of hyperpa-rameters simultaneously A direct way is to update themalternately Unfortunately this way is not reasonable becausethese two processes are tightly coupled For example tem-poral differential errors drive value approximation but thechange of weights 120572will cause the change of hyperparameters120579 simultaneously then it is difficult to tell whether thistemporal differential error is induced by observation or byGaussian regression model changing
To solve this problem a kind of two-phase value iterationfor critic network is presented in the next section and theconditions of convergence are analyzed
3 Two-Phase Value Iteration for CriticNetwork Approximation
First a proposition is given to describe the relationshipbetween hyperparameters and the sample value function
Proposition 1 The hyperparameters are optimized by evi-dence maximization according to the samples and their Qvalues and the log-evidence is given by
120597119871 (120579)
120597120579119894
= minus12tr [119870minus1119871
120597119870119871
120597120579119894
]+12120572119879 120597119870119871
120597120579119894
120572119879= 0
119894 = 1 119863
(10)
where 120572 = 119870minus1119871
119910 It can be proved that for arbitrary hyperpa-rameters 120579 if 120572 = 0 (10) defines an implicit function or acontinuously differentiable function as follows
120579 = 119891 (120572) = [1198911 (120572) 1198912 (120572) sdot sdot sdot 119891119863 (120572)]
119879
(11)
Then the two-phase value iteration for critic network isdescribed as the following theorem
Theorem 2 Given the following conditions
(1) the system is boundary input and boundary output(BIBO) stable
(2) the immediate reward 119903 is bounded(3) for 119896 = 1 2 120578119896
119905rarr 0 suminfin
119905=1 120578119896
119905= infin
the following iteration process is convergent
120572119905+1 = 120572
119905
120579119905+1 = 120579
119905+ 1205781
119905(119891 (120572119905) minus 120579119905)
120572119905+2 = 120572
119905+1 + 1205782
119905119896alefsym
120579119905
(119909119883119871)
sdot [119903 + 120573119881 (120579119905 1199091015840) minus 119896120579119905
(119909119883119871) 120572119905+1]
120579119905+2 = 120579
119905+1
(12)
where 119881(120579119905 1199091015840) = max
119886119896120579119905
(1199091015840 119883119871)120572(119905) and 119896
alefsym
120579119905
(119909 119883119871)
is a kind of pseudoinversion of 119896120579119905
(119909 119883119871) that is
119896120579119905
(119909 119883119871)119896alefsym
120579119905
(119909 119883119871) = 1
Proof From (12) it is clear that the two phases include theupdate of hyperparameters in phase 1 which is viewed as theupdate of generalization model and the update of samplesrsquovalue in phase 2 which is viewed as the update of criticnetwork
The convergence of iterative algorithm is proved based onstochastic approximation Lyapunov method
Define that
119875119905119896120579119905
(119909119883119871) 120572119905= 119903 +120573119881 (120579
119905 1199091015840)
= 119903 +max119886
119896120579119905
(1199091015840 119883119871) 120572 (119905)
(13)
Equation (12) is rewritten as
120572119905+1 = 120572
119905
120579119905+1 = 120579
119905+ 120578
1119905(119891 (120572119905) minus 120579119905)
120572119905+2 = 120572
119905+1 + 1205782119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus 119896
120579119905
(119909119883119871) 120572119905+1]
120579119905+2 = 120579
119905+1
(14)
Define approximation errors as
1205931119905+2 = 120593
1119905+1 = 120579
119905+1 minus119891 (120572119905+1) = 120579
119905minus119891 (120572
119905+1)
+ 1205781119905[119891 (120572119905) minus 120579119905] = (1minus 120578
1119905) 120593
1119905
(15)
1205932119905+2 = (120572
119905+2 minus120572lowast) = (120572
119905+1 minus120572lowast) + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909 119883119871) 120572119905+1
minus119875119905+1119896120579
119905
(119909119883119871) 120572lowast+119875119905+1119896120579
119905
(119909119883119871) 120572lowast
minus 119896120579119905
(119909119883119871) 120572119905+1] = 120593
1119905+1 + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909 119883119871) 120572119905+1
Mathematical Problems in Engineering 5
minus119875119905+1119896120579
119905
(119909119883119871) 120572lowast] + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572lowast
minus 119896120579119905
(119909119883119871) 120572119905+1]
(16)
Further define that
1205821 (119909 120572119905+1 120572lowast) = 119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus119875
119905+1119896120579119905
(119909119883119871) 120572lowast]
1205822 (119909 120572119905+1 120572lowast) = 119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572lowastminus 119896120579119905
(119909119883119871) 120572119905+1]
(17)
Thus (16) is reexpressed as
1205932119905+2 = 120593
1119905+1 + 120578
21199051205821 (119909 120572119905+1 120572
lowast) + 120578
21199051205822 (119909 120572119905+1 120572
lowast) (18)
Let 120593119905= [(120593
1119905)119879
(1205932119905)119879]119879
and the two-phase iteration isin the shape of stochastic approximation that is
[1205931119905+1
1205932119905+1
] = [1205931119905
1205932119905
]+[minus120578
11199051205931119905
0]
[1205931119905+2
1205932119905+2
]
= [1205931119905+1
1205932119905+1
]
+[0
12057821199051205821 (119909 120572119905+1 120572
lowast) + 120578
21199051205822 (119909 120572119905+1 120572
lowast)]
(19)
Define119881 (120593119905)
= 120593119879
119905Λ120593119905
= [(1205931119905)119879
(1205932119905)119879
] [119868119863
0
0 119896119879
120579119905
(119909lowast 119883119871) 119896120579119905
(119909lowast 119883119871)] [
1205931119905
1205932119905
]
(20)
where 119863 is the scale of the hyperparameters and 119909lowast will
be defined later Let 119896lowast119905represent 119896119879
120579119905
(119909lowast 119883119871)119896120579119905
(119909lowast 119883119871) for
shortObviously (20) is a positive definitematrix and |119909
lowast| lt infin
|119896lowast
119905| lt infin Hence119881(120593
119905) is a Lyapunov functional At moment
119905 the conditional expectation of the positive definite matrixis
119864119905119881119905+2 (120593119905+2)
= 119864119905[120593119879
119905+2Λ120593119905+2]
= 119864119905[(120593
1119905+2)119879
(1205932119905+2)119879
] [119868119863
0
0 119896lowast
119905
][1205931119905+2
1205932119905+2
]
(21)
It is easy to compute the first-order Taylor expansion of(21) as
119864119905119881119905+2 (120593119905+2)
= 119864119905119881119905+1 (120593119905+1) + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) (120593119905+2 minus120593119905+1)]
+ (1205782119905)21198622119864119905 [(120593119905+2 minus120593
119905+1)119879(120593119905+2 minus120593
119905+1)]
(22)
in which the first item on the right of (22) is
119864119905119881119905+1 (120593119905+1)
= 119881119905(120593119905) + 120578
1119905119864119905[1198811015840
119905(120593119905) (120593119905+1 minus120593
119905)]
+ (1205781119905)21198621119864119905 [(120593119905+1 minus120593
119905)119879(120593119905+1 minus120593
119905)]
(23)
Substituting (23) into (22) yields
119864119905119881119905+2 (120593119905+2) minus119881
119905(120593119905)
= 1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1]
+ (1205781119905)21198621119864119905 [(119884
1119905)2] + (120578
1119905)21198622119864119905 [(119884
2119905+1)
2]
(24)
where 1198841119905= [ minus120593
1119905
0]
1198842119905+1 = [
0119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus 119896
120579119905
(119909119883119871) 120572119905+1
]
= [0
1205821 (119909 120572119905+1 120572lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
(25)
Consider the last two items of (24) firstly If the immediatereward is bounded and infinite discounted reward is appliedthe value function is bounded Hence |120572
119905| lt infin
If the system is BIBO stable existΣ0 Σ1 sub 119877119899+119898 119899119898 are the
dimensions of the state space and action space respectivelyforall1199090 isin Σ0 119909infin isin Σ1 the policy space is bounded
According to Proposition 1 when the policy space isbounded and |120572
119905| lt infin |120579
119905| lt infin there exists a constant 1198881
so that
119864100381610038161003816100381610038161198811015840
119905+1 (120593119905+1) 1198842119905+1
10038161003816100381610038161003816le 1198881 (26)
In addition
119864100381610038161003816100381610038161198842119905+1
10038161003816100381610038161003816
2= 119864
10038161003816100381610038161003816[1205821 (119909 120572119905+1 120572
lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
210038161003816100381610038161003816
le 2119864 1003816100381610038161003816100381612058221 (119909 120572119905+1 120572
lowast)10038161003816100381610038161003816+ 2119864 10038161003816100381610038161003816
12058222 (119909 120572119905+1 120572
lowast)10038161003816100381610038161003816
le 11988821205932119905+1 + 1198883120593
2119905+1 le 1198884119881 (120593
119905+1)
(27)
From (26) and (27) we know that
119864100381610038161003816100381610038161198842119905+1
10038161003816100381610038161003816
2+119864
100381610038161003816100381610038161198811015840
119905+1 (120593119905+1) 1198842119905+1
10038161003816100381610038161003816le 1198623119881 (120593
119905+1) +1198623 (28)
where 1198623 = max1198881 1198884 Similarly
119864100381610038161003816100381610038161198841119905
10038161003816100381610038161003816
2+119864
100381610038161003816100381610038161198811015840
119905(120593119905) 119884
1119905
10038161003816100381610038161003816le 1198624119881 (120593
119905) +1198624 (29)
6 Mathematical Problems in Engineering
According to Lemma 541 in [30] 119864119881(120593119905) lt infin and
119864|1198841119905|2lt infin and 119864|119884
2119905+1|
2lt infin
Now we focus on the first two items on the right of (24)The first item 119864
119905[1198811015840
119905(120593119905)119884
1119905] is computed as
119864119905[1198811015840
119905(120593119905) 119884
1119905]
= 119864119905[(120593
1119905)119879
(1205932119905)119879
] [119868119863
0
0 119896lowast
119905
][minus120593
1119905
0]
= minus100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2
(30)
For the second item 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] when the
state transition function is time invariant it is true that119875119905+1119870120579
119905
(119909 119883119871)120572119905+1 = 119875
119905119870120579119905
(119909 119883119871)120572119905+1 120572119905+1 = 120572
119905 Then we
have
1198641199051205821 (119909 120572119905+1 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572119905
minus 119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast] = 119896alefsym
120579119905
(119909119883119871)
sdot 119864 [119875119905119896120579119905
(119909119883119871) 120572119905
minus119875119905119896120579119905
(119909119883119871) 120572lowast] le 119896alefsym
120579119905
(119909119883119871)
sdot10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
(31)
This inequality holds because of positive 119896alefsym120579119905
(119909 119883) 119896alefsym120579119905119894gt
0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|
where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax
119909|Δ(120579 119909)| Then
10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
= max119909
119864119905
10038161003816100381610038161003816[119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038161003816100381610038161003816
= max119909
10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max
1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905)
minus(119903 + 120573max1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast)]
10038161003816100381610038161003816100381610038161003816= 120573
sdotmax119909
10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905
minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
10038161003816100381610038161003816100381610038161003816= 120573max119909
10038161003816100381610038161003816100381610038161003816int1199041015840
119901 (1199041015840| 119909)
sdot [max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast] 1198891199041015840
10038161003816100381610038161003816100381610038161003816
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdot
10038161003816100381610038161003816100381610038161003816[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
100381610038161003816100381610038161003816100381610038161198891199041015840
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) 120572119905minus 119896120579119905
(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816
1198891199041015840= 120573
sdotmax119909
int1199041015840
119901 (1199041015840| 119909)max
1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840
= 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119864119905[119875119905119896120579119905
(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905
(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840
le 120573max119909
10038161003816100381610038161003816119896120579119905
(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871)
sdot 1205932119905
10038161003816100381610038161003816
(32)
Hence
1198641199051205821 (119909 120572119905+1 120572
lowast) le 120573119896
alefsym
120579119905
(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816 (33)
On the other hand
1198641199051205822 (119909 120572119905 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905]
= 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572lowast] + 119896alefsym
120579119905
(119909119883119871)
sdot 119896120579119905
(119909119883119871) 120572lowastminus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905
= minus 119896alefsym
120579119905
(119909119883119871) [119896120579119905
(119909119883119871) 120593
2119905minus1198641199051205823 (119909 120579119905 120579
lowast)]
(34)
where 1205823(119909 120579119905 120579lowast) = 119875
119905119896120579119905
(119909 119883119871)120572lowastminus 119896120579119905
(119909 119883119871)120572lowast is the
value function error caused by the estimated hyperparame-ters
Substituting (33) and (34) into 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] yields
119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] = 119864
119905[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 119896120579119905
(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572
lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
= [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot [1198641199051205821 (119909 120572119905+1 120572
lowast) + 1198641199051205822 (119909 120572119905+1 120572
lowast)]
le 120573 [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871) 119896alefsym
120579119905
(119909lowast 119883119871)
sdot10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816minus [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot 119896alefsym
120579119905
(119909lowast 119883119871) [119896120579119905
(119909lowast 119883119871) 120593
2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]
le minus (1minus120573)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2+100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
(35)
Mathematical Problems in Engineering 7
Obviously if the following inequality is satisfied
1205782119905
100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
lt 1205781119905
100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2+ 120578
2119905(1minus120573)
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2(36)
there exists a positive const 120575 such that the first two items onthe right of (24) satisfy
1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)
According to Theorem 542 in [30] the iterative process isconvergent namely 120593
119905
1199081199011997888997888997888997888rarr 0
Remark 3 Let us check the final convergence positionObviously 120593
119905
1199081199011997888997888997888997888rarr 0 [ 120572119905
120579119905
]1199081199011997888997888997888997888rarr [
120572lowast
119891(120572lowast)] This means the
equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883
119871)120572lowast] =
119870120579lowast(119909 119883
119871)120572lowast where 120579
lowast= 119891(120572
lowast) And the equilibrium of
hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-
imization that max120593(119871(120593)) = max
120593((12)119910
119879119870minus1120579lowast (119883119871 119883119871)119910 +
(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572
lowast
Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function
Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer
Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs
Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning
4 Simulation and Discussion
In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set
Before further discussion we firstly give common setupin all simulations
(i) The span 119870 = 50(ii) To make output bounded once a state is out of the
boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of
account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning
(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908
119894= 18
V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to
002 s
41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning
Consider a simple single homogeneous inverted pendu-lum system
1199041 = 1199042
1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889
119872119901119889
(38)
where 1199041 and 1199042 represent the angle and its speed 119872119901
=
01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole
We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations
In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578
2119905=
001(1 + 119905)0018 1205781
119905= 119886(1 + 119905
0018) where 119886 is varying from 0
to 012The learning is repeated 50 times in order to get the
average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space
8 Mathematical Problems in Engineering
Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871
119894=1 sample set
120587(1199040) initial policy1205781t 120578
2t learning step size
Let 119905 = 0Loop
119891119900119903 k = 1 119905119900 119870 119889119900
119905 = t + 1119886119905= 120587(119904
119905)
Get the reward 119903119905
Observe next state 119904119905+1
Update 120572 according to (12)Update the policy 120587 according to optimum seeking
119890119899119889 119891119900119903
Update 120579 according to (12)Until the termination criterion is satisfied
Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming
0 001 002 005 008 01 0120
01
02
03
04
05
06
07
08
09
1
Succ
ess r
ate o
ver 5
0 ru
ns
6
17
27
42
5047 46
Parameter a in learning rate 1205781
Figure 1 Success times over 50 runs under different learning rate1205781119905
is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890
minus2511989005
11989009
119890minus1
1198900001
]Figure 1 shows the result of success rates Clearly with
1205782119905fixed different 1205781
119905rsquos affect the performance significantly In
particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the
increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse
Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions
0 1000 2000 3000 4000 5000 6000 7000 8000500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
No learninga = 001a = 002a = 005
a = 008a = 01a = 012
Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781
119905
If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781
119905 the evolution processes of
the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium
Mathematical Problems in Engineering 9
S1
S3
Figure 3 One-dimensional inverted pendulum
42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out
The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows
The mathematic model is given as
1199041 = 1199042 + 120576
1199042 =119892 sin (1199041) minus 119872
119901119889119904
22 cos (1199041) sin (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
+cos (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
119906
1199043 = 1199044
1199044 =119906 + 119872
119901119889 (119904
22 sin (1199041) minus 1199042 cos (1199041))
119872
(39)
where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)
Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows
(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification
(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906
(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905
001)
and 1205782119905= 002(1 + 119905)
002
0 2000 4000 6000 8000minus02
002040608
112141618
Iteration
The a
vera
ge o
f cum
ulat
ed w
eigh
ts ov
er 1
00 ru
ns
minus1000100200300400500600700800900
10000
GK-ADP120590 = 09
120590 = 12
120590 = 15
120590 = 18
120590 = 21
120590 = 24
Figure 4 The comparison of the average evolution processes over100 runs
(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03
(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network
Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP
However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum
119894120572119894 of the critic
network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum
119894119876(119909119894) 119909119894isin 119883119871 in GK-
ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed
Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the
10 Mathematical Problems in Engineering
09 12 15 18 21 240
01
02
03
04
05
06
07
08
09
1
Parameter 120590 in ALD-based sparsification
Succ
ess r
ate o
ver 1
00 ru
ns
67 6876 75
58 57
96
GK-ADP
Figure 5 The success rates wrt all algorithms over 100 runs
contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used
To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]
Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network
The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP
Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth
It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning
to balance efficiency and engineering applicability are twoimportant issues in future work
Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is
If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point
To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection
43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]
119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])
+120573
2log (var [119876 (119909) | 119883119871])
(40)
where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients
Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =
119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during
GK-ADP learning where 119879119890denotes the end iteration of
learning The algorithm to refine sample set is presented asAlgorithm 2
Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set
We repeat all 100 runs of GK-ADP again with the samelearning rates 120578
1 and 1205782 Based on 96 successful runs in
simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues
However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute
Mathematical Problems in Engineering 11
0 1 2 3 4 5minus15
minus1
minus05
0
05
Time (s)
0 1 2 3 4 5Time (s)
minus2
0
2
4
Forc
e (N
)
Angle displacementAngle speed
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(a) The resulted control performance using GK-ADP
0 1 2 3 4 5Time (s)
0
0
1 2 3 4 5Time (s)
Angle displacementAngle speed
minus1
0
1
2
3
Forc
e (N
)
minus1
minus05
05
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(b) The resulted control performance using KHDP
Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP
minus04 minus02 0 0204
minus04minus02
002
04minus4
minus2
0
2
4
Pole angle (rad)
Linear displacement (m)
Best
actio
n
Figure 7 The final best policies wrt all samples
minus04 minus020 02
04
minus04minus02
002
04minus10
0
10
20
30
40
Pole angle (rad)Linear displacement (m)
Q v
alue
Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples
0 20 40 60 80 100
0
05
1
15
2
Run
Cart
disp
lace
men
t (m
) and
lini
ear s
peed
(ms
)
Cart displacementLinear speed
Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy
values of cart displacement denoted by 1198781198873(119895) and 119878119886
3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as
119864 (119895) =119878119887
3 (119895) minus 119878119886
3 (119895)
1198781198873 (119895) (41)
Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
4 Mathematical Problems in Engineering
It requires the calculation of the derivative of 119871(120579) wrt each120579119894 given by
120597119871 (120579)
120597120579119894
= minus12tr [119870minus1119871
120597119870119871
120597120579119894
]+12120572119879 120597119870119871
120597120579119894
120572 (9)
where tr[sdot] denotes the trace 120572 = 119870minus1119871
119910Consider Condition 2 For unknown system if 119870
119871is
known that is the hyperparameters are known (note thatit is indeed Condition 1) the update of critic network willbe transferred to the update of values wrt samples by usingvalue iteration
According to the analysis both conditions are interde-pendent That means the update of critic depends on knownhyperparameters and the optimization of hyperparametersdepends on accurate sample values
Hence we need a comprehensive iteration method torealize value approximation and optimization of hyperpa-rameters simultaneously A direct way is to update themalternately Unfortunately this way is not reasonable becausethese two processes are tightly coupled For example tem-poral differential errors drive value approximation but thechange of weights 120572will cause the change of hyperparameters120579 simultaneously then it is difficult to tell whether thistemporal differential error is induced by observation or byGaussian regression model changing
To solve this problem a kind of two-phase value iterationfor critic network is presented in the next section and theconditions of convergence are analyzed
3 Two-Phase Value Iteration for CriticNetwork Approximation
First a proposition is given to describe the relationshipbetween hyperparameters and the sample value function
Proposition 1 The hyperparameters are optimized by evi-dence maximization according to the samples and their Qvalues and the log-evidence is given by
120597119871 (120579)
120597120579119894
= minus12tr [119870minus1119871
120597119870119871
120597120579119894
]+12120572119879 120597119870119871
120597120579119894
120572119879= 0
119894 = 1 119863
(10)
where 120572 = 119870minus1119871
119910 It can be proved that for arbitrary hyperpa-rameters 120579 if 120572 = 0 (10) defines an implicit function or acontinuously differentiable function as follows
120579 = 119891 (120572) = [1198911 (120572) 1198912 (120572) sdot sdot sdot 119891119863 (120572)]
119879
(11)
Then the two-phase value iteration for critic network isdescribed as the following theorem
Theorem 2 Given the following conditions
(1) the system is boundary input and boundary output(BIBO) stable
(2) the immediate reward 119903 is bounded(3) for 119896 = 1 2 120578119896
119905rarr 0 suminfin
119905=1 120578119896
119905= infin
the following iteration process is convergent
120572119905+1 = 120572
119905
120579119905+1 = 120579
119905+ 1205781
119905(119891 (120572119905) minus 120579119905)
120572119905+2 = 120572
119905+1 + 1205782
119905119896alefsym
120579119905
(119909119883119871)
sdot [119903 + 120573119881 (120579119905 1199091015840) minus 119896120579119905
(119909119883119871) 120572119905+1]
120579119905+2 = 120579
119905+1
(12)
where 119881(120579119905 1199091015840) = max
119886119896120579119905
(1199091015840 119883119871)120572(119905) and 119896
alefsym
120579119905
(119909 119883119871)
is a kind of pseudoinversion of 119896120579119905
(119909 119883119871) that is
119896120579119905
(119909 119883119871)119896alefsym
120579119905
(119909 119883119871) = 1
Proof From (12) it is clear that the two phases include theupdate of hyperparameters in phase 1 which is viewed as theupdate of generalization model and the update of samplesrsquovalue in phase 2 which is viewed as the update of criticnetwork
The convergence of iterative algorithm is proved based onstochastic approximation Lyapunov method
Define that
119875119905119896120579119905
(119909119883119871) 120572119905= 119903 +120573119881 (120579
119905 1199091015840)
= 119903 +max119886
119896120579119905
(1199091015840 119883119871) 120572 (119905)
(13)
Equation (12) is rewritten as
120572119905+1 = 120572
119905
120579119905+1 = 120579
119905+ 120578
1119905(119891 (120572119905) minus 120579119905)
120572119905+2 = 120572
119905+1 + 1205782119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus 119896
120579119905
(119909119883119871) 120572119905+1]
120579119905+2 = 120579
119905+1
(14)
Define approximation errors as
1205931119905+2 = 120593
1119905+1 = 120579
119905+1 minus119891 (120572119905+1) = 120579
119905minus119891 (120572
119905+1)
+ 1205781119905[119891 (120572119905) minus 120579119905] = (1minus 120578
1119905) 120593
1119905
(15)
1205932119905+2 = (120572
119905+2 minus120572lowast) = (120572
119905+1 minus120572lowast) + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909 119883119871) 120572119905+1
minus119875119905+1119896120579
119905
(119909119883119871) 120572lowast+119875119905+1119896120579
119905
(119909119883119871) 120572lowast
minus 119896120579119905
(119909119883119871) 120572119905+1] = 120593
1119905+1 + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909 119883119871) 120572119905+1
Mathematical Problems in Engineering 5
minus119875119905+1119896120579
119905
(119909119883119871) 120572lowast] + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572lowast
minus 119896120579119905
(119909119883119871) 120572119905+1]
(16)
Further define that
1205821 (119909 120572119905+1 120572lowast) = 119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus119875
119905+1119896120579119905
(119909119883119871) 120572lowast]
1205822 (119909 120572119905+1 120572lowast) = 119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572lowastminus 119896120579119905
(119909119883119871) 120572119905+1]
(17)
Thus (16) is reexpressed as
1205932119905+2 = 120593
1119905+1 + 120578
21199051205821 (119909 120572119905+1 120572
lowast) + 120578
21199051205822 (119909 120572119905+1 120572
lowast) (18)
Let 120593119905= [(120593
1119905)119879
(1205932119905)119879]119879
and the two-phase iteration isin the shape of stochastic approximation that is
[1205931119905+1
1205932119905+1
] = [1205931119905
1205932119905
]+[minus120578
11199051205931119905
0]
[1205931119905+2
1205932119905+2
]
= [1205931119905+1
1205932119905+1
]
+[0
12057821199051205821 (119909 120572119905+1 120572
lowast) + 120578
21199051205822 (119909 120572119905+1 120572
lowast)]
(19)
Define119881 (120593119905)
= 120593119879
119905Λ120593119905
= [(1205931119905)119879
(1205932119905)119879
] [119868119863
0
0 119896119879
120579119905
(119909lowast 119883119871) 119896120579119905
(119909lowast 119883119871)] [
1205931119905
1205932119905
]
(20)
where 119863 is the scale of the hyperparameters and 119909lowast will
be defined later Let 119896lowast119905represent 119896119879
120579119905
(119909lowast 119883119871)119896120579119905
(119909lowast 119883119871) for
shortObviously (20) is a positive definitematrix and |119909
lowast| lt infin
|119896lowast
119905| lt infin Hence119881(120593
119905) is a Lyapunov functional At moment
119905 the conditional expectation of the positive definite matrixis
119864119905119881119905+2 (120593119905+2)
= 119864119905[120593119879
119905+2Λ120593119905+2]
= 119864119905[(120593
1119905+2)119879
(1205932119905+2)119879
] [119868119863
0
0 119896lowast
119905
][1205931119905+2
1205932119905+2
]
(21)
It is easy to compute the first-order Taylor expansion of(21) as
119864119905119881119905+2 (120593119905+2)
= 119864119905119881119905+1 (120593119905+1) + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) (120593119905+2 minus120593119905+1)]
+ (1205782119905)21198622119864119905 [(120593119905+2 minus120593
119905+1)119879(120593119905+2 minus120593
119905+1)]
(22)
in which the first item on the right of (22) is
119864119905119881119905+1 (120593119905+1)
= 119881119905(120593119905) + 120578
1119905119864119905[1198811015840
119905(120593119905) (120593119905+1 minus120593
119905)]
+ (1205781119905)21198621119864119905 [(120593119905+1 minus120593
119905)119879(120593119905+1 minus120593
119905)]
(23)
Substituting (23) into (22) yields
119864119905119881119905+2 (120593119905+2) minus119881
119905(120593119905)
= 1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1]
+ (1205781119905)21198621119864119905 [(119884
1119905)2] + (120578
1119905)21198622119864119905 [(119884
2119905+1)
2]
(24)
where 1198841119905= [ minus120593
1119905
0]
1198842119905+1 = [
0119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus 119896
120579119905
(119909119883119871) 120572119905+1
]
= [0
1205821 (119909 120572119905+1 120572lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
(25)
Consider the last two items of (24) firstly If the immediatereward is bounded and infinite discounted reward is appliedthe value function is bounded Hence |120572
119905| lt infin
If the system is BIBO stable existΣ0 Σ1 sub 119877119899+119898 119899119898 are the
dimensions of the state space and action space respectivelyforall1199090 isin Σ0 119909infin isin Σ1 the policy space is bounded
According to Proposition 1 when the policy space isbounded and |120572
119905| lt infin |120579
119905| lt infin there exists a constant 1198881
so that
119864100381610038161003816100381610038161198811015840
119905+1 (120593119905+1) 1198842119905+1
10038161003816100381610038161003816le 1198881 (26)
In addition
119864100381610038161003816100381610038161198842119905+1
10038161003816100381610038161003816
2= 119864
10038161003816100381610038161003816[1205821 (119909 120572119905+1 120572
lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
210038161003816100381610038161003816
le 2119864 1003816100381610038161003816100381612058221 (119909 120572119905+1 120572
lowast)10038161003816100381610038161003816+ 2119864 10038161003816100381610038161003816
12058222 (119909 120572119905+1 120572
lowast)10038161003816100381610038161003816
le 11988821205932119905+1 + 1198883120593
2119905+1 le 1198884119881 (120593
119905+1)
(27)
From (26) and (27) we know that
119864100381610038161003816100381610038161198842119905+1
10038161003816100381610038161003816
2+119864
100381610038161003816100381610038161198811015840
119905+1 (120593119905+1) 1198842119905+1
10038161003816100381610038161003816le 1198623119881 (120593
119905+1) +1198623 (28)
where 1198623 = max1198881 1198884 Similarly
119864100381610038161003816100381610038161198841119905
10038161003816100381610038161003816
2+119864
100381610038161003816100381610038161198811015840
119905(120593119905) 119884
1119905
10038161003816100381610038161003816le 1198624119881 (120593
119905) +1198624 (29)
6 Mathematical Problems in Engineering
According to Lemma 541 in [30] 119864119881(120593119905) lt infin and
119864|1198841119905|2lt infin and 119864|119884
2119905+1|
2lt infin
Now we focus on the first two items on the right of (24)The first item 119864
119905[1198811015840
119905(120593119905)119884
1119905] is computed as
119864119905[1198811015840
119905(120593119905) 119884
1119905]
= 119864119905[(120593
1119905)119879
(1205932119905)119879
] [119868119863
0
0 119896lowast
119905
][minus120593
1119905
0]
= minus100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2
(30)
For the second item 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] when the
state transition function is time invariant it is true that119875119905+1119870120579
119905
(119909 119883119871)120572119905+1 = 119875
119905119870120579119905
(119909 119883119871)120572119905+1 120572119905+1 = 120572
119905 Then we
have
1198641199051205821 (119909 120572119905+1 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572119905
minus 119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast] = 119896alefsym
120579119905
(119909119883119871)
sdot 119864 [119875119905119896120579119905
(119909119883119871) 120572119905
minus119875119905119896120579119905
(119909119883119871) 120572lowast] le 119896alefsym
120579119905
(119909119883119871)
sdot10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
(31)
This inequality holds because of positive 119896alefsym120579119905
(119909 119883) 119896alefsym120579119905119894gt
0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|
where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax
119909|Δ(120579 119909)| Then
10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
= max119909
119864119905
10038161003816100381610038161003816[119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038161003816100381610038161003816
= max119909
10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max
1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905)
minus(119903 + 120573max1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast)]
10038161003816100381610038161003816100381610038161003816= 120573
sdotmax119909
10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905
minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
10038161003816100381610038161003816100381610038161003816= 120573max119909
10038161003816100381610038161003816100381610038161003816int1199041015840
119901 (1199041015840| 119909)
sdot [max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast] 1198891199041015840
10038161003816100381610038161003816100381610038161003816
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdot
10038161003816100381610038161003816100381610038161003816[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
100381610038161003816100381610038161003816100381610038161198891199041015840
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) 120572119905minus 119896120579119905
(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816
1198891199041015840= 120573
sdotmax119909
int1199041015840
119901 (1199041015840| 119909)max
1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840
= 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119864119905[119875119905119896120579119905
(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905
(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840
le 120573max119909
10038161003816100381610038161003816119896120579119905
(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871)
sdot 1205932119905
10038161003816100381610038161003816
(32)
Hence
1198641199051205821 (119909 120572119905+1 120572
lowast) le 120573119896
alefsym
120579119905
(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816 (33)
On the other hand
1198641199051205822 (119909 120572119905 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905]
= 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572lowast] + 119896alefsym
120579119905
(119909119883119871)
sdot 119896120579119905
(119909119883119871) 120572lowastminus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905
= minus 119896alefsym
120579119905
(119909119883119871) [119896120579119905
(119909119883119871) 120593
2119905minus1198641199051205823 (119909 120579119905 120579
lowast)]
(34)
where 1205823(119909 120579119905 120579lowast) = 119875
119905119896120579119905
(119909 119883119871)120572lowastminus 119896120579119905
(119909 119883119871)120572lowast is the
value function error caused by the estimated hyperparame-ters
Substituting (33) and (34) into 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] yields
119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] = 119864
119905[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 119896120579119905
(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572
lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
= [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot [1198641199051205821 (119909 120572119905+1 120572
lowast) + 1198641199051205822 (119909 120572119905+1 120572
lowast)]
le 120573 [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871) 119896alefsym
120579119905
(119909lowast 119883119871)
sdot10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816minus [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot 119896alefsym
120579119905
(119909lowast 119883119871) [119896120579119905
(119909lowast 119883119871) 120593
2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]
le minus (1minus120573)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2+100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
(35)
Mathematical Problems in Engineering 7
Obviously if the following inequality is satisfied
1205782119905
100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
lt 1205781119905
100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2+ 120578
2119905(1minus120573)
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2(36)
there exists a positive const 120575 such that the first two items onthe right of (24) satisfy
1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)
According to Theorem 542 in [30] the iterative process isconvergent namely 120593
119905
1199081199011997888997888997888997888rarr 0
Remark 3 Let us check the final convergence positionObviously 120593
119905
1199081199011997888997888997888997888rarr 0 [ 120572119905
120579119905
]1199081199011997888997888997888997888rarr [
120572lowast
119891(120572lowast)] This means the
equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883
119871)120572lowast] =
119870120579lowast(119909 119883
119871)120572lowast where 120579
lowast= 119891(120572
lowast) And the equilibrium of
hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-
imization that max120593(119871(120593)) = max
120593((12)119910
119879119870minus1120579lowast (119883119871 119883119871)119910 +
(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572
lowast
Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function
Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer
Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs
Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning
4 Simulation and Discussion
In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set
Before further discussion we firstly give common setupin all simulations
(i) The span 119870 = 50(ii) To make output bounded once a state is out of the
boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of
account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning
(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908
119894= 18
V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to
002 s
41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning
Consider a simple single homogeneous inverted pendu-lum system
1199041 = 1199042
1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889
119872119901119889
(38)
where 1199041 and 1199042 represent the angle and its speed 119872119901
=
01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole
We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations
In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578
2119905=
001(1 + 119905)0018 1205781
119905= 119886(1 + 119905
0018) where 119886 is varying from 0
to 012The learning is repeated 50 times in order to get the
average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space
8 Mathematical Problems in Engineering
Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871
119894=1 sample set
120587(1199040) initial policy1205781t 120578
2t learning step size
Let 119905 = 0Loop
119891119900119903 k = 1 119905119900 119870 119889119900
119905 = t + 1119886119905= 120587(119904
119905)
Get the reward 119903119905
Observe next state 119904119905+1
Update 120572 according to (12)Update the policy 120587 according to optimum seeking
119890119899119889 119891119900119903
Update 120579 according to (12)Until the termination criterion is satisfied
Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming
0 001 002 005 008 01 0120
01
02
03
04
05
06
07
08
09
1
Succ
ess r
ate o
ver 5
0 ru
ns
6
17
27
42
5047 46
Parameter a in learning rate 1205781
Figure 1 Success times over 50 runs under different learning rate1205781119905
is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890
minus2511989005
11989009
119890minus1
1198900001
]Figure 1 shows the result of success rates Clearly with
1205782119905fixed different 1205781
119905rsquos affect the performance significantly In
particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the
increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse
Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions
0 1000 2000 3000 4000 5000 6000 7000 8000500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
No learninga = 001a = 002a = 005
a = 008a = 01a = 012
Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781
119905
If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781
119905 the evolution processes of
the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium
Mathematical Problems in Engineering 9
S1
S3
Figure 3 One-dimensional inverted pendulum
42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out
The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows
The mathematic model is given as
1199041 = 1199042 + 120576
1199042 =119892 sin (1199041) minus 119872
119901119889119904
22 cos (1199041) sin (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
+cos (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
119906
1199043 = 1199044
1199044 =119906 + 119872
119901119889 (119904
22 sin (1199041) minus 1199042 cos (1199041))
119872
(39)
where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)
Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows
(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification
(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906
(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905
001)
and 1205782119905= 002(1 + 119905)
002
0 2000 4000 6000 8000minus02
002040608
112141618
Iteration
The a
vera
ge o
f cum
ulat
ed w
eigh
ts ov
er 1
00 ru
ns
minus1000100200300400500600700800900
10000
GK-ADP120590 = 09
120590 = 12
120590 = 15
120590 = 18
120590 = 21
120590 = 24
Figure 4 The comparison of the average evolution processes over100 runs
(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03
(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network
Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP
However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum
119894120572119894 of the critic
network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum
119894119876(119909119894) 119909119894isin 119883119871 in GK-
ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed
Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the
10 Mathematical Problems in Engineering
09 12 15 18 21 240
01
02
03
04
05
06
07
08
09
1
Parameter 120590 in ALD-based sparsification
Succ
ess r
ate o
ver 1
00 ru
ns
67 6876 75
58 57
96
GK-ADP
Figure 5 The success rates wrt all algorithms over 100 runs
contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used
To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]
Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network
The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP
Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth
It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning
to balance efficiency and engineering applicability are twoimportant issues in future work
Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is
If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point
To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection
43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]
119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])
+120573
2log (var [119876 (119909) | 119883119871])
(40)
where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients
Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =
119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during
GK-ADP learning where 119879119890denotes the end iteration of
learning The algorithm to refine sample set is presented asAlgorithm 2
Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set
We repeat all 100 runs of GK-ADP again with the samelearning rates 120578
1 and 1205782 Based on 96 successful runs in
simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues
However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute
Mathematical Problems in Engineering 11
0 1 2 3 4 5minus15
minus1
minus05
0
05
Time (s)
0 1 2 3 4 5Time (s)
minus2
0
2
4
Forc
e (N
)
Angle displacementAngle speed
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(a) The resulted control performance using GK-ADP
0 1 2 3 4 5Time (s)
0
0
1 2 3 4 5Time (s)
Angle displacementAngle speed
minus1
0
1
2
3
Forc
e (N
)
minus1
minus05
05
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(b) The resulted control performance using KHDP
Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP
minus04 minus02 0 0204
minus04minus02
002
04minus4
minus2
0
2
4
Pole angle (rad)
Linear displacement (m)
Best
actio
n
Figure 7 The final best policies wrt all samples
minus04 minus020 02
04
minus04minus02
002
04minus10
0
10
20
30
40
Pole angle (rad)Linear displacement (m)
Q v
alue
Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples
0 20 40 60 80 100
0
05
1
15
2
Run
Cart
disp
lace
men
t (m
) and
lini
ear s
peed
(ms
)
Cart displacementLinear speed
Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy
values of cart displacement denoted by 1198781198873(119895) and 119878119886
3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as
119864 (119895) =119878119887
3 (119895) minus 119878119886
3 (119895)
1198781198873 (119895) (41)
Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 5
minus119875119905+1119896120579
119905
(119909119883119871) 120572lowast] + 120578
2119905119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572lowast
minus 119896120579119905
(119909119883119871) 120572119905+1]
(16)
Further define that
1205821 (119909 120572119905+1 120572lowast) = 119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus119875
119905+1119896120579119905
(119909119883119871) 120572lowast]
1205822 (119909 120572119905+1 120572lowast) = 119896alefsym
120579119905
(119909119883119871)
sdot [119875119905+1119896120579
119905
(119909119883119871) 120572lowastminus 119896120579119905
(119909119883119871) 120572119905+1]
(17)
Thus (16) is reexpressed as
1205932119905+2 = 120593
1119905+1 + 120578
21199051205821 (119909 120572119905+1 120572
lowast) + 120578
21199051205822 (119909 120572119905+1 120572
lowast) (18)
Let 120593119905= [(120593
1119905)119879
(1205932119905)119879]119879
and the two-phase iteration isin the shape of stochastic approximation that is
[1205931119905+1
1205932119905+1
] = [1205931119905
1205932119905
]+[minus120578
11199051205931119905
0]
[1205931119905+2
1205932119905+2
]
= [1205931119905+1
1205932119905+1
]
+[0
12057821199051205821 (119909 120572119905+1 120572
lowast) + 120578
21199051205822 (119909 120572119905+1 120572
lowast)]
(19)
Define119881 (120593119905)
= 120593119879
119905Λ120593119905
= [(1205931119905)119879
(1205932119905)119879
] [119868119863
0
0 119896119879
120579119905
(119909lowast 119883119871) 119896120579119905
(119909lowast 119883119871)] [
1205931119905
1205932119905
]
(20)
where 119863 is the scale of the hyperparameters and 119909lowast will
be defined later Let 119896lowast119905represent 119896119879
120579119905
(119909lowast 119883119871)119896120579119905
(119909lowast 119883119871) for
shortObviously (20) is a positive definitematrix and |119909
lowast| lt infin
|119896lowast
119905| lt infin Hence119881(120593
119905) is a Lyapunov functional At moment
119905 the conditional expectation of the positive definite matrixis
119864119905119881119905+2 (120593119905+2)
= 119864119905[120593119879
119905+2Λ120593119905+2]
= 119864119905[(120593
1119905+2)119879
(1205932119905+2)119879
] [119868119863
0
0 119896lowast
119905
][1205931119905+2
1205932119905+2
]
(21)
It is easy to compute the first-order Taylor expansion of(21) as
119864119905119881119905+2 (120593119905+2)
= 119864119905119881119905+1 (120593119905+1) + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) (120593119905+2 minus120593119905+1)]
+ (1205782119905)21198622119864119905 [(120593119905+2 minus120593
119905+1)119879(120593119905+2 minus120593
119905+1)]
(22)
in which the first item on the right of (22) is
119864119905119881119905+1 (120593119905+1)
= 119881119905(120593119905) + 120578
1119905119864119905[1198811015840
119905(120593119905) (120593119905+1 minus120593
119905)]
+ (1205781119905)21198621119864119905 [(120593119905+1 minus120593
119905)119879(120593119905+1 minus120593
119905)]
(23)
Substituting (23) into (22) yields
119864119905119881119905+2 (120593119905+2) minus119881
119905(120593119905)
= 1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1]
+ (1205781119905)21198621119864119905 [(119884
1119905)2] + (120578
1119905)21198622119864119905 [(119884
2119905+1)
2]
(24)
where 1198841119905= [ minus120593
1119905
0]
1198842119905+1 = [
0119875119905+1119896120579
119905
(119909119883119871) 120572119905+1 minus 119896
120579119905
(119909119883119871) 120572119905+1
]
= [0
1205821 (119909 120572119905+1 120572lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
(25)
Consider the last two items of (24) firstly If the immediatereward is bounded and infinite discounted reward is appliedthe value function is bounded Hence |120572
119905| lt infin
If the system is BIBO stable existΣ0 Σ1 sub 119877119899+119898 119899119898 are the
dimensions of the state space and action space respectivelyforall1199090 isin Σ0 119909infin isin Σ1 the policy space is bounded
According to Proposition 1 when the policy space isbounded and |120572
119905| lt infin |120579
119905| lt infin there exists a constant 1198881
so that
119864100381610038161003816100381610038161198811015840
119905+1 (120593119905+1) 1198842119905+1
10038161003816100381610038161003816le 1198881 (26)
In addition
119864100381610038161003816100381610038161198842119905+1
10038161003816100381610038161003816
2= 119864
10038161003816100381610038161003816[1205821 (119909 120572119905+1 120572
lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
210038161003816100381610038161003816
le 2119864 1003816100381610038161003816100381612058221 (119909 120572119905+1 120572
lowast)10038161003816100381610038161003816+ 2119864 10038161003816100381610038161003816
12058222 (119909 120572119905+1 120572
lowast)10038161003816100381610038161003816
le 11988821205932119905+1 + 1198883120593
2119905+1 le 1198884119881 (120593
119905+1)
(27)
From (26) and (27) we know that
119864100381610038161003816100381610038161198842119905+1
10038161003816100381610038161003816
2+119864
100381610038161003816100381610038161198811015840
119905+1 (120593119905+1) 1198842119905+1
10038161003816100381610038161003816le 1198623119881 (120593
119905+1) +1198623 (28)
where 1198623 = max1198881 1198884 Similarly
119864100381610038161003816100381610038161198841119905
10038161003816100381610038161003816
2+119864
100381610038161003816100381610038161198811015840
119905(120593119905) 119884
1119905
10038161003816100381610038161003816le 1198624119881 (120593
119905) +1198624 (29)
6 Mathematical Problems in Engineering
According to Lemma 541 in [30] 119864119881(120593119905) lt infin and
119864|1198841119905|2lt infin and 119864|119884
2119905+1|
2lt infin
Now we focus on the first two items on the right of (24)The first item 119864
119905[1198811015840
119905(120593119905)119884
1119905] is computed as
119864119905[1198811015840
119905(120593119905) 119884
1119905]
= 119864119905[(120593
1119905)119879
(1205932119905)119879
] [119868119863
0
0 119896lowast
119905
][minus120593
1119905
0]
= minus100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2
(30)
For the second item 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] when the
state transition function is time invariant it is true that119875119905+1119870120579
119905
(119909 119883119871)120572119905+1 = 119875
119905119870120579119905
(119909 119883119871)120572119905+1 120572119905+1 = 120572
119905 Then we
have
1198641199051205821 (119909 120572119905+1 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572119905
minus 119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast] = 119896alefsym
120579119905
(119909119883119871)
sdot 119864 [119875119905119896120579119905
(119909119883119871) 120572119905
minus119875119905119896120579119905
(119909119883119871) 120572lowast] le 119896alefsym
120579119905
(119909119883119871)
sdot10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
(31)
This inequality holds because of positive 119896alefsym120579119905
(119909 119883) 119896alefsym120579119905119894gt
0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|
where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax
119909|Δ(120579 119909)| Then
10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
= max119909
119864119905
10038161003816100381610038161003816[119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038161003816100381610038161003816
= max119909
10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max
1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905)
minus(119903 + 120573max1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast)]
10038161003816100381610038161003816100381610038161003816= 120573
sdotmax119909
10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905
minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
10038161003816100381610038161003816100381610038161003816= 120573max119909
10038161003816100381610038161003816100381610038161003816int1199041015840
119901 (1199041015840| 119909)
sdot [max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast] 1198891199041015840
10038161003816100381610038161003816100381610038161003816
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdot
10038161003816100381610038161003816100381610038161003816[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
100381610038161003816100381610038161003816100381610038161198891199041015840
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) 120572119905minus 119896120579119905
(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816
1198891199041015840= 120573
sdotmax119909
int1199041015840
119901 (1199041015840| 119909)max
1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840
= 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119864119905[119875119905119896120579119905
(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905
(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840
le 120573max119909
10038161003816100381610038161003816119896120579119905
(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871)
sdot 1205932119905
10038161003816100381610038161003816
(32)
Hence
1198641199051205821 (119909 120572119905+1 120572
lowast) le 120573119896
alefsym
120579119905
(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816 (33)
On the other hand
1198641199051205822 (119909 120572119905 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905]
= 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572lowast] + 119896alefsym
120579119905
(119909119883119871)
sdot 119896120579119905
(119909119883119871) 120572lowastminus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905
= minus 119896alefsym
120579119905
(119909119883119871) [119896120579119905
(119909119883119871) 120593
2119905minus1198641199051205823 (119909 120579119905 120579
lowast)]
(34)
where 1205823(119909 120579119905 120579lowast) = 119875
119905119896120579119905
(119909 119883119871)120572lowastminus 119896120579119905
(119909 119883119871)120572lowast is the
value function error caused by the estimated hyperparame-ters
Substituting (33) and (34) into 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] yields
119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] = 119864
119905[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 119896120579119905
(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572
lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
= [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot [1198641199051205821 (119909 120572119905+1 120572
lowast) + 1198641199051205822 (119909 120572119905+1 120572
lowast)]
le 120573 [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871) 119896alefsym
120579119905
(119909lowast 119883119871)
sdot10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816minus [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot 119896alefsym
120579119905
(119909lowast 119883119871) [119896120579119905
(119909lowast 119883119871) 120593
2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]
le minus (1minus120573)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2+100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
(35)
Mathematical Problems in Engineering 7
Obviously if the following inequality is satisfied
1205782119905
100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
lt 1205781119905
100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2+ 120578
2119905(1minus120573)
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2(36)
there exists a positive const 120575 such that the first two items onthe right of (24) satisfy
1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)
According to Theorem 542 in [30] the iterative process isconvergent namely 120593
119905
1199081199011997888997888997888997888rarr 0
Remark 3 Let us check the final convergence positionObviously 120593
119905
1199081199011997888997888997888997888rarr 0 [ 120572119905
120579119905
]1199081199011997888997888997888997888rarr [
120572lowast
119891(120572lowast)] This means the
equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883
119871)120572lowast] =
119870120579lowast(119909 119883
119871)120572lowast where 120579
lowast= 119891(120572
lowast) And the equilibrium of
hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-
imization that max120593(119871(120593)) = max
120593((12)119910
119879119870minus1120579lowast (119883119871 119883119871)119910 +
(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572
lowast
Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function
Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer
Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs
Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning
4 Simulation and Discussion
In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set
Before further discussion we firstly give common setupin all simulations
(i) The span 119870 = 50(ii) To make output bounded once a state is out of the
boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of
account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning
(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908
119894= 18
V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to
002 s
41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning
Consider a simple single homogeneous inverted pendu-lum system
1199041 = 1199042
1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889
119872119901119889
(38)
where 1199041 and 1199042 represent the angle and its speed 119872119901
=
01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole
We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations
In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578
2119905=
001(1 + 119905)0018 1205781
119905= 119886(1 + 119905
0018) where 119886 is varying from 0
to 012The learning is repeated 50 times in order to get the
average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space
8 Mathematical Problems in Engineering
Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871
119894=1 sample set
120587(1199040) initial policy1205781t 120578
2t learning step size
Let 119905 = 0Loop
119891119900119903 k = 1 119905119900 119870 119889119900
119905 = t + 1119886119905= 120587(119904
119905)
Get the reward 119903119905
Observe next state 119904119905+1
Update 120572 according to (12)Update the policy 120587 according to optimum seeking
119890119899119889 119891119900119903
Update 120579 according to (12)Until the termination criterion is satisfied
Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming
0 001 002 005 008 01 0120
01
02
03
04
05
06
07
08
09
1
Succ
ess r
ate o
ver 5
0 ru
ns
6
17
27
42
5047 46
Parameter a in learning rate 1205781
Figure 1 Success times over 50 runs under different learning rate1205781119905
is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890
minus2511989005
11989009
119890minus1
1198900001
]Figure 1 shows the result of success rates Clearly with
1205782119905fixed different 1205781
119905rsquos affect the performance significantly In
particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the
increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse
Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions
0 1000 2000 3000 4000 5000 6000 7000 8000500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
No learninga = 001a = 002a = 005
a = 008a = 01a = 012
Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781
119905
If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781
119905 the evolution processes of
the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium
Mathematical Problems in Engineering 9
S1
S3
Figure 3 One-dimensional inverted pendulum
42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out
The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows
The mathematic model is given as
1199041 = 1199042 + 120576
1199042 =119892 sin (1199041) minus 119872
119901119889119904
22 cos (1199041) sin (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
+cos (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
119906
1199043 = 1199044
1199044 =119906 + 119872
119901119889 (119904
22 sin (1199041) minus 1199042 cos (1199041))
119872
(39)
where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)
Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows
(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification
(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906
(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905
001)
and 1205782119905= 002(1 + 119905)
002
0 2000 4000 6000 8000minus02
002040608
112141618
Iteration
The a
vera
ge o
f cum
ulat
ed w
eigh
ts ov
er 1
00 ru
ns
minus1000100200300400500600700800900
10000
GK-ADP120590 = 09
120590 = 12
120590 = 15
120590 = 18
120590 = 21
120590 = 24
Figure 4 The comparison of the average evolution processes over100 runs
(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03
(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network
Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP
However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum
119894120572119894 of the critic
network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum
119894119876(119909119894) 119909119894isin 119883119871 in GK-
ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed
Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the
10 Mathematical Problems in Engineering
09 12 15 18 21 240
01
02
03
04
05
06
07
08
09
1
Parameter 120590 in ALD-based sparsification
Succ
ess r
ate o
ver 1
00 ru
ns
67 6876 75
58 57
96
GK-ADP
Figure 5 The success rates wrt all algorithms over 100 runs
contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used
To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]
Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network
The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP
Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth
It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning
to balance efficiency and engineering applicability are twoimportant issues in future work
Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is
If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point
To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection
43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]
119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])
+120573
2log (var [119876 (119909) | 119883119871])
(40)
where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients
Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =
119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during
GK-ADP learning where 119879119890denotes the end iteration of
learning The algorithm to refine sample set is presented asAlgorithm 2
Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set
We repeat all 100 runs of GK-ADP again with the samelearning rates 120578
1 and 1205782 Based on 96 successful runs in
simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues
However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute
Mathematical Problems in Engineering 11
0 1 2 3 4 5minus15
minus1
minus05
0
05
Time (s)
0 1 2 3 4 5Time (s)
minus2
0
2
4
Forc
e (N
)
Angle displacementAngle speed
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(a) The resulted control performance using GK-ADP
0 1 2 3 4 5Time (s)
0
0
1 2 3 4 5Time (s)
Angle displacementAngle speed
minus1
0
1
2
3
Forc
e (N
)
minus1
minus05
05
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(b) The resulted control performance using KHDP
Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP
minus04 minus02 0 0204
minus04minus02
002
04minus4
minus2
0
2
4
Pole angle (rad)
Linear displacement (m)
Best
actio
n
Figure 7 The final best policies wrt all samples
minus04 minus020 02
04
minus04minus02
002
04minus10
0
10
20
30
40
Pole angle (rad)Linear displacement (m)
Q v
alue
Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples
0 20 40 60 80 100
0
05
1
15
2
Run
Cart
disp
lace
men
t (m
) and
lini
ear s
peed
(ms
)
Cart displacementLinear speed
Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy
values of cart displacement denoted by 1198781198873(119895) and 119878119886
3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as
119864 (119895) =119878119887
3 (119895) minus 119878119886
3 (119895)
1198781198873 (119895) (41)
Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
6 Mathematical Problems in Engineering
According to Lemma 541 in [30] 119864119881(120593119905) lt infin and
119864|1198841119905|2lt infin and 119864|119884
2119905+1|
2lt infin
Now we focus on the first two items on the right of (24)The first item 119864
119905[1198811015840
119905(120593119905)119884
1119905] is computed as
119864119905[1198811015840
119905(120593119905) 119884
1119905]
= 119864119905[(120593
1119905)119879
(1205932119905)119879
] [119868119863
0
0 119896lowast
119905
][minus120593
1119905
0]
= minus100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2
(30)
For the second item 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] when the
state transition function is time invariant it is true that119875119905+1119870120579
119905
(119909 119883119871)120572119905+1 = 119875
119905119870120579119905
(119909 119883119871)120572119905+1 120572119905+1 = 120572
119905 Then we
have
1198641199051205821 (119909 120572119905+1 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572119905
minus 119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast] = 119896alefsym
120579119905
(119909119883119871)
sdot 119864 [119875119905119896120579119905
(119909119883119871) 120572119905
minus119875119905119896120579119905
(119909119883119871) 120572lowast] le 119896alefsym
120579119905
(119909119883119871)
sdot10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
(31)
This inequality holds because of positive 119896alefsym120579119905
(119909 119883) 119896alefsym120579119905119894gt
0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|
where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax
119909|Δ(120579 119909)| Then
10038171003817100381710038171003817119864 [119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038171003817100381710038171003817
= max119909
119864119905
10038161003816100381610038161003816[119875119905119896120579119905
(119909119883119871) 120572119905minus119875119905119896120579119905
(119909119883119871) 120572lowast]10038161003816100381610038161003816
= max119909
10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max
1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905)
minus(119903 + 120573max1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast)]
10038161003816100381610038161003816100381610038161003816= 120573
sdotmax119909
10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905
minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
10038161003816100381610038161003816100381610038161003816= 120573max119909
10038161003816100381610038161003816100381610038161003816int1199041015840
119901 (1199041015840| 119909)
sdot [max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast] 1198891199041015840
10038161003816100381610038161003816100381610038161003816
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdot
10038161003816100381610038161003816100381610038161003816[max1198861015840
119896120579119905
(1199091015840 119883119871) 120572119905minusmax1198861015840
119896120579119905
(1199091015840 119883119871) 120572lowast]
100381610038161003816100381610038161003816100381610038161198891199041015840
le 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) 120572119905minus 119896120579119905
(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816
1198891199041015840= 120573
sdotmax119909
int1199041015840
119901 (1199041015840| 119909)max
1198861015840
10038161003816100381610038161003816119896120579119905
(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840
= 120573max119909
int1199041015840
119901 (1199041015840| 119909)
sdotmax1198861015840
10038161003816100381610038161003816119864119905[119875119905119896120579119905
(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905
(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840
le 120573max119909
10038161003816100381610038161003816119896120579119905
(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871)
sdot 1205932119905
10038161003816100381610038161003816
(32)
Hence
1198641199051205821 (119909 120572119905+1 120572
lowast) le 120573119896
alefsym
120579119905
(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816 (33)
On the other hand
1198641199051205822 (119909 120572119905 120572
lowast) = 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905]
= 119864119905[119896alefsym
120579119905
(119909119883119871) 119875119905119896120579119905
(119909119883119871) 120572lowast
minus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572lowast] + 119896alefsym
120579119905
(119909119883119871)
sdot 119896120579119905
(119909119883119871) 120572lowastminus 119896alefsym
120579119905
(119909119883119871) 119896120579119905
(119909119883119871) 120572119905
= minus 119896alefsym
120579119905
(119909119883119871) [119896120579119905
(119909119883119871) 120593
2119905minus1198641199051205823 (119909 120579119905 120579
lowast)]
(34)
where 1205823(119909 120579119905 120579lowast) = 119875
119905119896120579119905
(119909 119883119871)120572lowastminus 119896120579119905
(119909 119883119871)120572lowast is the
value function error caused by the estimated hyperparame-ters
Substituting (33) and (34) into 119864119905[1198811015840
119905+1(120593119905+1)1198842119905+1] yields
119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] = 119864
119905[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 119896120579119905
(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572
lowast) + 1205822 (119909 120572119905+1 120572
lowast)]
= [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot [1198641199051205821 (119909 120572119905+1 120572
lowast) + 1198641199051205822 (119909 120572119905+1 120572
lowast)]
le 120573 [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871) 119896alefsym
120579119905
(119909lowast 119883119871)
sdot10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816minus [119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
119896120579119905
(119909lowast 119883119871)
sdot 119896alefsym
120579119905
(119909lowast 119883119871) [119896120579119905
(119909lowast 119883119871) 120593
2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]
le minus (1minus120573)10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2+100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
(35)
Mathematical Problems in Engineering 7
Obviously if the following inequality is satisfied
1205782119905
100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
lt 1205781119905
100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2+ 120578
2119905(1minus120573)
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2(36)
there exists a positive const 120575 such that the first two items onthe right of (24) satisfy
1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)
According to Theorem 542 in [30] the iterative process isconvergent namely 120593
119905
1199081199011997888997888997888997888rarr 0
Remark 3 Let us check the final convergence positionObviously 120593
119905
1199081199011997888997888997888997888rarr 0 [ 120572119905
120579119905
]1199081199011997888997888997888997888rarr [
120572lowast
119891(120572lowast)] This means the
equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883
119871)120572lowast] =
119870120579lowast(119909 119883
119871)120572lowast where 120579
lowast= 119891(120572
lowast) And the equilibrium of
hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-
imization that max120593(119871(120593)) = max
120593((12)119910
119879119870minus1120579lowast (119883119871 119883119871)119910 +
(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572
lowast
Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function
Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer
Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs
Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning
4 Simulation and Discussion
In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set
Before further discussion we firstly give common setupin all simulations
(i) The span 119870 = 50(ii) To make output bounded once a state is out of the
boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of
account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning
(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908
119894= 18
V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to
002 s
41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning
Consider a simple single homogeneous inverted pendu-lum system
1199041 = 1199042
1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889
119872119901119889
(38)
where 1199041 and 1199042 represent the angle and its speed 119872119901
=
01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole
We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations
In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578
2119905=
001(1 + 119905)0018 1205781
119905= 119886(1 + 119905
0018) where 119886 is varying from 0
to 012The learning is repeated 50 times in order to get the
average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space
8 Mathematical Problems in Engineering
Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871
119894=1 sample set
120587(1199040) initial policy1205781t 120578
2t learning step size
Let 119905 = 0Loop
119891119900119903 k = 1 119905119900 119870 119889119900
119905 = t + 1119886119905= 120587(119904
119905)
Get the reward 119903119905
Observe next state 119904119905+1
Update 120572 according to (12)Update the policy 120587 according to optimum seeking
119890119899119889 119891119900119903
Update 120579 according to (12)Until the termination criterion is satisfied
Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming
0 001 002 005 008 01 0120
01
02
03
04
05
06
07
08
09
1
Succ
ess r
ate o
ver 5
0 ru
ns
6
17
27
42
5047 46
Parameter a in learning rate 1205781
Figure 1 Success times over 50 runs under different learning rate1205781119905
is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890
minus2511989005
11989009
119890minus1
1198900001
]Figure 1 shows the result of success rates Clearly with
1205782119905fixed different 1205781
119905rsquos affect the performance significantly In
particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the
increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse
Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions
0 1000 2000 3000 4000 5000 6000 7000 8000500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
No learninga = 001a = 002a = 005
a = 008a = 01a = 012
Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781
119905
If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781
119905 the evolution processes of
the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium
Mathematical Problems in Engineering 9
S1
S3
Figure 3 One-dimensional inverted pendulum
42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out
The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows
The mathematic model is given as
1199041 = 1199042 + 120576
1199042 =119892 sin (1199041) minus 119872
119901119889119904
22 cos (1199041) sin (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
+cos (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
119906
1199043 = 1199044
1199044 =119906 + 119872
119901119889 (119904
22 sin (1199041) minus 1199042 cos (1199041))
119872
(39)
where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)
Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows
(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification
(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906
(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905
001)
and 1205782119905= 002(1 + 119905)
002
0 2000 4000 6000 8000minus02
002040608
112141618
Iteration
The a
vera
ge o
f cum
ulat
ed w
eigh
ts ov
er 1
00 ru
ns
minus1000100200300400500600700800900
10000
GK-ADP120590 = 09
120590 = 12
120590 = 15
120590 = 18
120590 = 21
120590 = 24
Figure 4 The comparison of the average evolution processes over100 runs
(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03
(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network
Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP
However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum
119894120572119894 of the critic
network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum
119894119876(119909119894) 119909119894isin 119883119871 in GK-
ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed
Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the
10 Mathematical Problems in Engineering
09 12 15 18 21 240
01
02
03
04
05
06
07
08
09
1
Parameter 120590 in ALD-based sparsification
Succ
ess r
ate o
ver 1
00 ru
ns
67 6876 75
58 57
96
GK-ADP
Figure 5 The success rates wrt all algorithms over 100 runs
contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used
To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]
Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network
The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP
Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth
It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning
to balance efficiency and engineering applicability are twoimportant issues in future work
Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is
If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point
To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection
43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]
119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])
+120573
2log (var [119876 (119909) | 119883119871])
(40)
where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients
Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =
119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during
GK-ADP learning where 119879119890denotes the end iteration of
learning The algorithm to refine sample set is presented asAlgorithm 2
Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set
We repeat all 100 runs of GK-ADP again with the samelearning rates 120578
1 and 1205782 Based on 96 successful runs in
simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues
However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute
Mathematical Problems in Engineering 11
0 1 2 3 4 5minus15
minus1
minus05
0
05
Time (s)
0 1 2 3 4 5Time (s)
minus2
0
2
4
Forc
e (N
)
Angle displacementAngle speed
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(a) The resulted control performance using GK-ADP
0 1 2 3 4 5Time (s)
0
0
1 2 3 4 5Time (s)
Angle displacementAngle speed
minus1
0
1
2
3
Forc
e (N
)
minus1
minus05
05
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(b) The resulted control performance using KHDP
Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP
minus04 minus02 0 0204
minus04minus02
002
04minus4
minus2
0
2
4
Pole angle (rad)
Linear displacement (m)
Best
actio
n
Figure 7 The final best policies wrt all samples
minus04 minus020 02
04
minus04minus02
002
04minus10
0
10
20
30
40
Pole angle (rad)Linear displacement (m)
Q v
alue
Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples
0 20 40 60 80 100
0
05
1
15
2
Run
Cart
disp
lace
men
t (m
) and
lini
ear s
peed
(ms
)
Cart displacementLinear speed
Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy
values of cart displacement denoted by 1198781198873(119895) and 119878119886
3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as
119864 (119895) =119878119887
3 (119895) minus 119878119886
3 (119895)
1198781198873 (119895) (41)
Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 7
Obviously if the following inequality is satisfied
1205782119905
100381610038161003816100381610038161003816[119896120579119905
(119909lowast 119883119871) 120593
2119905]119879
1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816
lt 1205781119905
100381610038161003816100381610038161205931119905
10038161003816100381610038161003816
2+ 120578
2119905(1minus120573)
10038161003816100381610038161003816119896120579119905
(119909lowast 119883119871) 120593
2119905
10038161003816100381610038161003816
2(36)
there exists a positive const 120575 such that the first two items onthe right of (24) satisfy
1205781119905119864119905[1198811015840
119905(120593119905) 119884
1119905] + 120578
2119905119864119905[1198811015840
119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)
According to Theorem 542 in [30] the iterative process isconvergent namely 120593
119905
1199081199011997888997888997888997888rarr 0
Remark 3 Let us check the final convergence positionObviously 120593
119905
1199081199011997888997888997888997888rarr 0 [ 120572119905
120579119905
]1199081199011997888997888997888997888rarr [
120572lowast
119891(120572lowast)] This means the
equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883
119871)120572lowast] =
119870120579lowast(119909 119883
119871)120572lowast where 120579
lowast= 119891(120572
lowast) And the equilibrium of
hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-
imization that max120593(119871(120593)) = max
120593((12)119910
119879119870minus1120579lowast (119883119871 119883119871)119910 +
(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572
lowast
Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function
Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer
Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs
Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning
4 Simulation and Discussion
In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set
Before further discussion we firstly give common setupin all simulations
(i) The span 119870 = 50(ii) To make output bounded once a state is out of the
boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of
account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning
(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908
119894= 18
V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to
002 s
41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning
Consider a simple single homogeneous inverted pendu-lum system
1199041 = 1199042
1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889
119872119901119889
(38)
where 1199041 and 1199042 represent the angle and its speed 119872119901
=
01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole
We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations
In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578
2119905=
001(1 + 119905)0018 1205781
119905= 119886(1 + 119905
0018) where 119886 is varying from 0
to 012The learning is repeated 50 times in order to get the
average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space
8 Mathematical Problems in Engineering
Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871
119894=1 sample set
120587(1199040) initial policy1205781t 120578
2t learning step size
Let 119905 = 0Loop
119891119900119903 k = 1 119905119900 119870 119889119900
119905 = t + 1119886119905= 120587(119904
119905)
Get the reward 119903119905
Observe next state 119904119905+1
Update 120572 according to (12)Update the policy 120587 according to optimum seeking
119890119899119889 119891119900119903
Update 120579 according to (12)Until the termination criterion is satisfied
Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming
0 001 002 005 008 01 0120
01
02
03
04
05
06
07
08
09
1
Succ
ess r
ate o
ver 5
0 ru
ns
6
17
27
42
5047 46
Parameter a in learning rate 1205781
Figure 1 Success times over 50 runs under different learning rate1205781119905
is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890
minus2511989005
11989009
119890minus1
1198900001
]Figure 1 shows the result of success rates Clearly with
1205782119905fixed different 1205781
119905rsquos affect the performance significantly In
particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the
increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse
Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions
0 1000 2000 3000 4000 5000 6000 7000 8000500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
No learninga = 001a = 002a = 005
a = 008a = 01a = 012
Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781
119905
If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781
119905 the evolution processes of
the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium
Mathematical Problems in Engineering 9
S1
S3
Figure 3 One-dimensional inverted pendulum
42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out
The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows
The mathematic model is given as
1199041 = 1199042 + 120576
1199042 =119892 sin (1199041) minus 119872
119901119889119904
22 cos (1199041) sin (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
+cos (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
119906
1199043 = 1199044
1199044 =119906 + 119872
119901119889 (119904
22 sin (1199041) minus 1199042 cos (1199041))
119872
(39)
where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)
Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows
(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification
(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906
(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905
001)
and 1205782119905= 002(1 + 119905)
002
0 2000 4000 6000 8000minus02
002040608
112141618
Iteration
The a
vera
ge o
f cum
ulat
ed w
eigh
ts ov
er 1
00 ru
ns
minus1000100200300400500600700800900
10000
GK-ADP120590 = 09
120590 = 12
120590 = 15
120590 = 18
120590 = 21
120590 = 24
Figure 4 The comparison of the average evolution processes over100 runs
(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03
(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network
Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP
However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum
119894120572119894 of the critic
network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum
119894119876(119909119894) 119909119894isin 119883119871 in GK-
ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed
Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the
10 Mathematical Problems in Engineering
09 12 15 18 21 240
01
02
03
04
05
06
07
08
09
1
Parameter 120590 in ALD-based sparsification
Succ
ess r
ate o
ver 1
00 ru
ns
67 6876 75
58 57
96
GK-ADP
Figure 5 The success rates wrt all algorithms over 100 runs
contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used
To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]
Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network
The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP
Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth
It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning
to balance efficiency and engineering applicability are twoimportant issues in future work
Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is
If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point
To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection
43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]
119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])
+120573
2log (var [119876 (119909) | 119883119871])
(40)
where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients
Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =
119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during
GK-ADP learning where 119879119890denotes the end iteration of
learning The algorithm to refine sample set is presented asAlgorithm 2
Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set
We repeat all 100 runs of GK-ADP again with the samelearning rates 120578
1 and 1205782 Based on 96 successful runs in
simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues
However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute
Mathematical Problems in Engineering 11
0 1 2 3 4 5minus15
minus1
minus05
0
05
Time (s)
0 1 2 3 4 5Time (s)
minus2
0
2
4
Forc
e (N
)
Angle displacementAngle speed
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(a) The resulted control performance using GK-ADP
0 1 2 3 4 5Time (s)
0
0
1 2 3 4 5Time (s)
Angle displacementAngle speed
minus1
0
1
2
3
Forc
e (N
)
minus1
minus05
05
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(b) The resulted control performance using KHDP
Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP
minus04 minus02 0 0204
minus04minus02
002
04minus4
minus2
0
2
4
Pole angle (rad)
Linear displacement (m)
Best
actio
n
Figure 7 The final best policies wrt all samples
minus04 minus020 02
04
minus04minus02
002
04minus10
0
10
20
30
40
Pole angle (rad)Linear displacement (m)
Q v
alue
Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples
0 20 40 60 80 100
0
05
1
15
2
Run
Cart
disp
lace
men
t (m
) and
lini
ear s
peed
(ms
)
Cart displacementLinear speed
Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy
values of cart displacement denoted by 1198781198873(119895) and 119878119886
3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as
119864 (119895) =119878119887
3 (119895) minus 119878119886
3 (119895)
1198781198873 (119895) (41)
Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
8 Mathematical Problems in Engineering
Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871
119894=1 sample set
120587(1199040) initial policy1205781t 120578
2t learning step size
Let 119905 = 0Loop
119891119900119903 k = 1 119905119900 119870 119889119900
119905 = t + 1119886119905= 120587(119904
119905)
Get the reward 119903119905
Observe next state 119904119905+1
Update 120572 according to (12)Update the policy 120587 according to optimum seeking
119890119899119889 119891119900119903
Update 120579 according to (12)Until the termination criterion is satisfied
Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming
0 001 002 005 008 01 0120
01
02
03
04
05
06
07
08
09
1
Succ
ess r
ate o
ver 5
0 ru
ns
6
17
27
42
5047 46
Parameter a in learning rate 1205781
Figure 1 Success times over 50 runs under different learning rate1205781119905
is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890
minus2511989005
11989009
119890minus1
1198900001
]Figure 1 shows the result of success rates Clearly with
1205782119905fixed different 1205781
119905rsquos affect the performance significantly In
particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the
increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse
Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions
0 1000 2000 3000 4000 5000 6000 7000 8000500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
No learninga = 001a = 002a = 005
a = 008a = 01a = 012
Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781
119905
If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781
119905 the evolution processes of
the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium
Mathematical Problems in Engineering 9
S1
S3
Figure 3 One-dimensional inverted pendulum
42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out
The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows
The mathematic model is given as
1199041 = 1199042 + 120576
1199042 =119892 sin (1199041) minus 119872
119901119889119904
22 cos (1199041) sin (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
+cos (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
119906
1199043 = 1199044
1199044 =119906 + 119872
119901119889 (119904
22 sin (1199041) minus 1199042 cos (1199041))
119872
(39)
where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)
Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows
(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification
(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906
(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905
001)
and 1205782119905= 002(1 + 119905)
002
0 2000 4000 6000 8000minus02
002040608
112141618
Iteration
The a
vera
ge o
f cum
ulat
ed w
eigh
ts ov
er 1
00 ru
ns
minus1000100200300400500600700800900
10000
GK-ADP120590 = 09
120590 = 12
120590 = 15
120590 = 18
120590 = 21
120590 = 24
Figure 4 The comparison of the average evolution processes over100 runs
(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03
(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network
Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP
However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum
119894120572119894 of the critic
network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum
119894119876(119909119894) 119909119894isin 119883119871 in GK-
ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed
Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the
10 Mathematical Problems in Engineering
09 12 15 18 21 240
01
02
03
04
05
06
07
08
09
1
Parameter 120590 in ALD-based sparsification
Succ
ess r
ate o
ver 1
00 ru
ns
67 6876 75
58 57
96
GK-ADP
Figure 5 The success rates wrt all algorithms over 100 runs
contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used
To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]
Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network
The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP
Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth
It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning
to balance efficiency and engineering applicability are twoimportant issues in future work
Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is
If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point
To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection
43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]
119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])
+120573
2log (var [119876 (119909) | 119883119871])
(40)
where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients
Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =
119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during
GK-ADP learning where 119879119890denotes the end iteration of
learning The algorithm to refine sample set is presented asAlgorithm 2
Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set
We repeat all 100 runs of GK-ADP again with the samelearning rates 120578
1 and 1205782 Based on 96 successful runs in
simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues
However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute
Mathematical Problems in Engineering 11
0 1 2 3 4 5minus15
minus1
minus05
0
05
Time (s)
0 1 2 3 4 5Time (s)
minus2
0
2
4
Forc
e (N
)
Angle displacementAngle speed
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(a) The resulted control performance using GK-ADP
0 1 2 3 4 5Time (s)
0
0
1 2 3 4 5Time (s)
Angle displacementAngle speed
minus1
0
1
2
3
Forc
e (N
)
minus1
minus05
05
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(b) The resulted control performance using KHDP
Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP
minus04 minus02 0 0204
minus04minus02
002
04minus4
minus2
0
2
4
Pole angle (rad)
Linear displacement (m)
Best
actio
n
Figure 7 The final best policies wrt all samples
minus04 minus020 02
04
minus04minus02
002
04minus10
0
10
20
30
40
Pole angle (rad)Linear displacement (m)
Q v
alue
Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples
0 20 40 60 80 100
0
05
1
15
2
Run
Cart
disp
lace
men
t (m
) and
lini
ear s
peed
(ms
)
Cart displacementLinear speed
Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy
values of cart displacement denoted by 1198781198873(119895) and 119878119886
3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as
119864 (119895) =119878119887
3 (119895) minus 119878119886
3 (119895)
1198781198873 (119895) (41)
Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 9
S1
S3
Figure 3 One-dimensional inverted pendulum
42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out
The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows
The mathematic model is given as
1199041 = 1199042 + 120576
1199042 =119892 sin (1199041) minus 119872
119901119889119904
22 cos (1199041) sin (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
+cos (1199041) 119872
119889 (43 minus 119872119901cos2 (1199041) 119872)
119906
1199043 = 1199044
1199044 =119906 + 119872
119901119889 (119904
22 sin (1199041) minus 1199042 cos (1199041))
119872
(39)
where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)
Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows
(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification
(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906
(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905
001)
and 1205782119905= 002(1 + 119905)
002
0 2000 4000 6000 8000minus02
002040608
112141618
Iteration
The a
vera
ge o
f cum
ulat
ed w
eigh
ts ov
er 1
00 ru
ns
minus1000100200300400500600700800900
10000
GK-ADP120590 = 09
120590 = 12
120590 = 15
120590 = 18
120590 = 21
120590 = 24
Figure 4 The comparison of the average evolution processes over100 runs
(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03
(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network
Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP
However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum
119894120572119894 of the critic
network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum
119894119876(119909119894) 119909119894isin 119883119871 in GK-
ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed
Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the
10 Mathematical Problems in Engineering
09 12 15 18 21 240
01
02
03
04
05
06
07
08
09
1
Parameter 120590 in ALD-based sparsification
Succ
ess r
ate o
ver 1
00 ru
ns
67 6876 75
58 57
96
GK-ADP
Figure 5 The success rates wrt all algorithms over 100 runs
contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used
To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]
Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network
The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP
Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth
It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning
to balance efficiency and engineering applicability are twoimportant issues in future work
Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is
If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point
To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection
43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]
119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])
+120573
2log (var [119876 (119909) | 119883119871])
(40)
where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients
Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =
119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during
GK-ADP learning where 119879119890denotes the end iteration of
learning The algorithm to refine sample set is presented asAlgorithm 2
Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set
We repeat all 100 runs of GK-ADP again with the samelearning rates 120578
1 and 1205782 Based on 96 successful runs in
simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues
However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute
Mathematical Problems in Engineering 11
0 1 2 3 4 5minus15
minus1
minus05
0
05
Time (s)
0 1 2 3 4 5Time (s)
minus2
0
2
4
Forc
e (N
)
Angle displacementAngle speed
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(a) The resulted control performance using GK-ADP
0 1 2 3 4 5Time (s)
0
0
1 2 3 4 5Time (s)
Angle displacementAngle speed
minus1
0
1
2
3
Forc
e (N
)
minus1
minus05
05
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(b) The resulted control performance using KHDP
Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP
minus04 minus02 0 0204
minus04minus02
002
04minus4
minus2
0
2
4
Pole angle (rad)
Linear displacement (m)
Best
actio
n
Figure 7 The final best policies wrt all samples
minus04 minus020 02
04
minus04minus02
002
04minus10
0
10
20
30
40
Pole angle (rad)Linear displacement (m)
Q v
alue
Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples
0 20 40 60 80 100
0
05
1
15
2
Run
Cart
disp
lace
men
t (m
) and
lini
ear s
peed
(ms
)
Cart displacementLinear speed
Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy
values of cart displacement denoted by 1198781198873(119895) and 119878119886
3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as
119864 (119895) =119878119887
3 (119895) minus 119878119886
3 (119895)
1198781198873 (119895) (41)
Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
10 Mathematical Problems in Engineering
09 12 15 18 21 240
01
02
03
04
05
06
07
08
09
1
Parameter 120590 in ALD-based sparsification
Succ
ess r
ate o
ver 1
00 ru
ns
67 6876 75
58 57
96
GK-ADP
Figure 5 The success rates wrt all algorithms over 100 runs
contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used
To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]
Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network
The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP
Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth
It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning
to balance efficiency and engineering applicability are twoimportant issues in future work
Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is
If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point
To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection
43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]
119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])
+120573
2log (var [119876 (119909) | 119883119871])
(40)
where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients
Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =
119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during
GK-ADP learning where 119879119890denotes the end iteration of
learning The algorithm to refine sample set is presented asAlgorithm 2
Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set
We repeat all 100 runs of GK-ADP again with the samelearning rates 120578
1 and 1205782 Based on 96 successful runs in
simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues
However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute
Mathematical Problems in Engineering 11
0 1 2 3 4 5minus15
minus1
minus05
0
05
Time (s)
0 1 2 3 4 5Time (s)
minus2
0
2
4
Forc
e (N
)
Angle displacementAngle speed
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(a) The resulted control performance using GK-ADP
0 1 2 3 4 5Time (s)
0
0
1 2 3 4 5Time (s)
Angle displacementAngle speed
minus1
0
1
2
3
Forc
e (N
)
minus1
minus05
05
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(b) The resulted control performance using KHDP
Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP
minus04 minus02 0 0204
minus04minus02
002
04minus4
minus2
0
2
4
Pole angle (rad)
Linear displacement (m)
Best
actio
n
Figure 7 The final best policies wrt all samples
minus04 minus020 02
04
minus04minus02
002
04minus10
0
10
20
30
40
Pole angle (rad)Linear displacement (m)
Q v
alue
Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples
0 20 40 60 80 100
0
05
1
15
2
Run
Cart
disp
lace
men
t (m
) and
lini
ear s
peed
(ms
)
Cart displacementLinear speed
Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy
values of cart displacement denoted by 1198781198873(119895) and 119878119886
3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as
119864 (119895) =119878119887
3 (119895) minus 119878119886
3 (119895)
1198781198873 (119895) (41)
Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 11
0 1 2 3 4 5minus15
minus1
minus05
0
05
Time (s)
0 1 2 3 4 5Time (s)
minus2
0
2
4
Forc
e (N
)
Angle displacementAngle speed
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(a) The resulted control performance using GK-ADP
0 1 2 3 4 5Time (s)
0
0
1 2 3 4 5Time (s)
Angle displacementAngle speed
minus1
0
1
2
3
Forc
e (N
)
minus1
minus05
05
Pole
angl
e (ra
d) an
d sp
eed
(rad
s)
(b) The resulted control performance using KHDP
Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP
minus04 minus02 0 0204
minus04minus02
002
04minus4
minus2
0
2
4
Pole angle (rad)
Linear displacement (m)
Best
actio
n
Figure 7 The final best policies wrt all samples
minus04 minus020 02
04
minus04minus02
002
04minus10
0
10
20
30
40
Pole angle (rad)Linear displacement (m)
Q v
alue
Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples
0 20 40 60 80 100
0
05
1
15
2
Run
Cart
disp
lace
men
t (m
) and
lini
ear s
peed
(ms
)
Cart displacementLinear speed
Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy
values of cart displacement denoted by 1198781198873(119895) and 119878119886
3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as
119864 (119895) =119878119887
3 (119895) minus 119878119886
3 (119895)
1198781198873 (119895) (41)
Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
12 Mathematical Problems in Engineering
119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)
Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883
119871] using (6) and (7)
Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order
Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add
119894119891 L +119873add gt 119873limit
Calculate hyperparameters based on 119883119871+119873add
119891119900119903 l = 1 119905119900 119871 +119873add
Calculate 119864[119876(119909(119897)) | 119883119871+119873add
] and var[119876(119909119897) | 119883119871+119873add
] using (6) and (7)Calculate 119880(119909(119897)) using (40)
119890119899119889 119891119900119903 119897
Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883
119873limit
119890119899119889 119894119891
Algorithm 2 Refinement of sample set
0 2000 4000 6000 8000 10000 12000 14000 16000 18000100
200
300
400
500
600
700
800
Iteration
The s
um o
f Q v
alue
s of a
ll sa
mpl
e sta
te-a
ctio
ns
(a) The average evolution process of 119876 values before and after addingsamples
0 20 40 60 80 100minus15
minus1
minus05
0
05
1
Run
Enha
ncem
ent (
)
(b) The enhancement of the control performance about cart displacementafter adding samples
Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples
as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance
Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573
5 Conclusions and Future Work
ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration
The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values
A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 13
minus04minus02
002
04
minus04minus0200204minus10
0
10
20
30
40
Pole angle (rad)
Q v
alue
s
Cart
displac
emen
t (m)
(a) Projecting on the dimensions of pole angle and cart displacement
minus04 minus03 minus02 minus01 0 01 02 03 04minus5
0
5
10
15
20
25
30
35
40
Pole angle (rad)
Q v
alue
s
(b) Projecting on the dimension of pole angle
Figure 11 The 119876 values wrt all added samples over 100 runs
refine sample set online in order to enhance the performanceof critic-actor architecture during operation
However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011
References
[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998
[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009
[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012
[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009
[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear
systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014
[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014
[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008
[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008
[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009
[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012
[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009
[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013
[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007
[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010
[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
14 Mathematical Problems in Engineering
Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013
[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998
[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006
[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004
[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002
[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009
[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008
[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014
[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003
[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005
[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004
[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009
[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014
[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002
[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005
[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of