arxiv:2011.12956v1 [cs.lg] 26 nov 2020

Reinforcement Learning for Robust Missile AutopilotDesign

Bernardo Cortez · Florian Peter · ThomasLausenhammer · Paulo Oliveira

Accepted: August 10th 2021

Abstract Designing missiles’ autopilot controllers has been a complex task, giventhe extensive flight envelope and the nonlinear flight dynamics. A solution thatcan excel both in nominal performance and in robustness to uncertainties is stillto be found. While Control Theory often debouches into parameters’ schedulingprocedures, Reinforcement Learning has presented interesting results in ever morecomplex tasks, going from videogames to robotic tasks with continuous actiondomains. However, it still lacks clearer insights on how to find adequate rewardfunctions and exploration strategies. To the best of our knowledge, this work ispioneer in proposing Reinforcement Learning as a framework for flight control.In fact, it aims at training a model-free agent that can control the longitudinalnon-linear flight dynamics of a missile, achieving the target performance and ro-bustness to uncertainties. To that end, under TRPO’s methodology, the collectedexperience is augmented according to HER, stored in a replay buffer and sampledaccording to its significance. Not only does this work enhance the concept of pri-oritized experience replay into BPER, but it also reformulates HER, activatingthem both only when the training progress converges to suboptimal policies, inwhat is proposed as the SER methodology. The results show that it is possibleboth to achieve the target performance and to improve the agent’s robustnessto uncertainties (with low damage on nominal performance) by further trainingit in non-nominal environments, therefore validating the proposed approach andencouraging future research in this field.

Keywords Reinforcement Learning · TRPO · HER · flight control · missileautopilot

F. AuthorTecnico Lisboa · E-mail: [email protected] ·ORCID: 0000-0001-5553-1106S. AuthorsMBDA Deutschland GmbH · E-mail: [email protected] ·Technical University of Munich · E-mail: [email protected] ·Tecnico Lisboa · E-mail: [email protected]

arX

iv:2

011.

1295

6v2

[cs

.LG

] 1

8 Se

p 20

21

2 Bernardo Cortez et al.

1 Introduction

Over the last decades, designing the autopilot flight controller for a system such asa missile has been a complex task, given (i) the non-linear dynamics and (ii) thedemanding performance and robustness requirements. This process is classicallysolved by tedious scheduling procedures, which often lack the ability to generalizeacross the whole flight envelope and different missile configurations.

Reinforcement Learning (RL) constitutes a promising approach to address thelatter issues, given its ability of controlling systems from which it has no priorinformation. This motivation increases as the system to be controlled grows incomplexity, for the plausible hypothesis that the RL agent could benefit from hav-ing a considerably wider range of possible (combinations of) actions. By learningthe cross-coupling effects, the agent is expected to converge to the optimal policyfaster.

The longitudinal nonlinear dynamics (cf. section 2), however, constitutes amere first step, necessary to the posterior possible expansion of the approach tothe whole flight dynamics. This work is, hence, motivated by the will of findinga RL algorithm that can control the longitudinal nonliear flight dynamics of aGeneric Surface-to-Air Missile (GSAM) with no prior information about it, being,thus, model-free.

2 Model

2.1 GSAM Dynamics

The GSAM’s flight dynamics to be controlled is modelled by a nonlinear system,in which the total sum of the forces FT,G and of the momenta MT,G are dependenton the flight dynamic states (cf. equations (1) and (2)), like the Mach number M ,the height h or the angle of attack α.

FT,G = F (M,h, α, ...) (1)

MT,G = M(M,h, α, ...) (2)

This system is decomposed into Translation (cf. equation (3)), Rotation (cf.equation (4)), Position (cf. equation (5)) and Attitude (cf. equation (6)) terms, asPeter (2018) [15] described. Equations (3) to (6) follow Peter’s (2018) [15] notation.

V EG =1

mFT,G − ωE × V EG (3)

ωE = (JG)−1(MT,G − ωE × JGωE

)(4)

rEG = MTVEG (5)[

ΦE , ΘE , ΨE]T

= R.ωE (6)

To investigate the principle ability of an RL agent to serve as a missile au-topilot, the nonlinear longitudinal motion of the missile dynamics is considered.Therefore, the issue of cross-coupling effects within the autopilot design is notaddressed in this paper.

Reinforcement Learning for Robust Missile Autopilot Design 3

2.2 Actuator Dynamics

The GSAM is actuated by four fin deflections, δi, which are mapped to three aero-dynamic equivalent controls, ξ, η and ζ (cf. equation (7)). The latter provide theadvantage of directly matching the GSAM’s roll, pitch and yaw axis, respectively.

ξηζ

=1

4

1 1 1 11 −1 −1 11 1 −1 −1

δ1δ2δ3δ4

(7)

δ1δ2δ3δ4

=

1 1 11 −1 11 −1 −11 1 −1

ξηζ

(8)

In the control of the longitudinal flight dynamics, ξ and ζ are set to 0 and,hence, equation (8) becomes equation (9).

δ1δ2δ3δ4

=

1−1−11

η (9)

The actuator system is, thus, the system that receives the desired (commanded)ηcom and outputs η, modelling the dynamic response of the physical fins with itsdeflection limit of 30 degrees. The latter is assumed to be a second order systemwith the following closed loop characteristics:

1. Natural frequency ωn of 150 rad.s−1

2. Damping factor λ of 0.7

2.3 Performance Requirements

The algorithm must achieve the target performance established in terms of thefollowing requirements:

1. Static error margin = 0.5%2. Overshoot < 20%3. Rise time < 0.6s4. Settling time (5%) < 0.6s5. Bounded actuation6. Smooth actuation

Besides, the present work also aims at achieving and improving the robustnessof the algorithm to conditions different from the training ones.

When trying to optimize the achieved performance, one must, however, beaware of the physical limitations imposed by the system being controlled. A com-mercial airplane, for example, will never have the agility of a fighter jet, regardlessof how optimized its controller is. Therefore, it is hopeless to expect the system to


follow too abrupt reference signals like step functions. Instead, the system will beasked to follow shaped reference signals, i.e., the output of the Reference Model.The latter is hereby defined as the system whose closed loop dynamics is designedto mimic the one desired for the dynamic system being controlled. In this case:natural frequency ωn and damping factor λ of 10 rad.s−1 and 0.7, respectively.

Fig. 1 Shaped and command reference signals

The proper workflow of the interaction between the agent and the dynamicsystem being controlled - including a Reference Model - is illustrated in figure 2.The command reference signal is given as an input to the Reference Model, whoseoutput, the shaped reference signal is the input of the agent.

Fig. 2 Block diagram including the Reference Model


3 Background

3.1 Topic Overview

RL has been the object of research after the foundations laid out by Sutton et al.(2018) [20], with applications going from the Atari 2600 games, to the MuJoCogames, or to robotic tasks (like grasping...) or other classic control problems (likethe inverted pendulum).

TRPO (Schulman et al. 2015 [17]) has become one of the commonly acceptedbenchmarks for its success achieved by the cautious trust region optimization andmonotonic reward increase. Perpendicularly, Lillicrap et al. (2016) [9] proposedDDPG, a model-free off-policy algorithm, revolutionary not only for its abilityof learning directly from pixels while maintaining the network architecture sim-ple, but mainly because it was designed to cope with continuous domain action.Contrarily to TRPO, DDPG’s off-policy nature implied a much higher sample ef-ficiency, resulting in a faster training proccess. Both TRPO and DDPG have beenthe roots for much of the research work that followed.

On the one side, some authors valued more the benefits of an off-policy algo-rithm and took the inspiration in DDPG to develop TD3 (Fujimoto et al. 2018[3]), addressing DDPG’s problem of over-estimation of the states’ value. On theother side, others preferred the benefits of on-policy algorithm and proposed inter-esting improvements to the original TRPO, either by reducing its implementationcomplexity (Wu et al. 2017 [21]), by trying to decrease the variance of its estimates(Schulman 2016 [18]) or even by showing the benefits of its interconnection withreplay buffers (Kangin et al. [8]).

Apart from these, a new result began to arise: agents were ensuring stabilityat the expense of converging to suboptimal solutions. Once again, new algorithmswere conceived in each family, on- and off-policy. Haarnoja and Tang proposed toexpress the optimal policy via a Boltzmann distribution in order to learn stochasticbehaviors and to improve the exploration phase within the scope of an off-policyactor-critic architecture: Soft Q-learning (Haarnoja et al. 2017 [6]). Almost simul-taneously, Schulman et al. published PPO (Schulman et al. 2017 [19]), claiming tohave simplified TRPO’s implementation and increased its sample-efficiency by in-serting an entropy bonus to increase exploration performance and avoid prematuresuboptimal convergence. Furthermore, Haarnoja et al. developed SAC (Haarnojaet al. 2018 [7]), in an attempt to recover the training stability without losing theentropy-encouraged exploration and Nachum et al. (2018) [11] proposed Smoothie,allying the trust region implementation of PPO with DDPG.

Finally, there has also been research done on merging both on-policy and off-policy algorithms, trying to profit from the upsides of both, like IPG (Gu et al.2017 [5]), TPCL (Nachum et al. 2018 [12]), Q-Prop (Guet al. 2019 [4]) and PGQL(O’Donoghue et al. 2019 [14]).

3.2 Trust Region Policy Optimization

TRPO is an on-policy model-free RL algorithm that aims at maximizing the dis-counted sum of future rewards (cf. equation (10)) following an actor-critic archi-tecture and a trust region search.


R(st) =∞∑l=0

γlr(t+ l) (10)

Initially, Schulman et al. (2015) [17] proposed to update the policy estimator’sparameters with the conjugate gradient algorithm followed by a line search. Thetrust region search would be ensured by a hard constraint on the Kullback-Leiblerdivergence DKL.

Briefly after, Kangin et al. (2019) [8] proposed an enhancement, augmentingthe training data by using replay buffers and GAE (Schulman et al. 2016 [18]).Also contrarily to the original proposal, Kangin et al. trained the value estimator’sparameters with the ADAM optimizer and the policy’s with K-FAC (Martens etal. 2015 [10]). The former was implemented within a regression between the outputof the Value NN, V , and its target, V ′, whilst the latter used equation (11) as theloss function, which has got a first term concerning the objective function beingmaximized (cf. equation (10)) and a second one penalizing differences between twoconsecutive policies outside the trust region, whose radius is the hyperparameterδTR.

LP = −Es0,a0,...

[ ∞∑t=0

γtr(st, at)

]+

+αmax(0,Ea∼πθold[DKL(πθold(a), πθnew(a))

]− δTR)

(11)

For matters of simplicity of implementation, Schulman et al. (2015) [17] rewroteequation (11) in terms of collected experience (cf. equation (12)), as a function oftwo different stochastic policies, πθnew and πθold , and the GAE.

LP = −EsEa∼πθold

[GAEθold(s)

πθnew(a)

πθold(a)

]+

+αmax(0,Ea∼πθold [DKL(πθold(a), πθnew(a))]− δTR)

(12)

Both versions of TRPO use the same exploration strategy: the output of thePolicy NN is used as the mean of a multivariate normal distribution whose covari-ance matrix is part of the policy trainable parameters.

3.3 Hindsight Experience Replay

The key idea of HER (Andrychowicz et al. 2018 [1]) is to store in the replay buffersnot only the experience collected from interacting with the environment, but alsoabout experiences that would be obtained, had the agent followed a different goal.HER was proposed as a complement of sparse rewards.


3.4 Prioritized Experience Replay

When working with replay buffers, randomly sampling experience can be outper-formed by a more sophisticated heuristic. Schaul et al. (2016) [16] proposed Pri-oritized Experience Replay, sampling experience according to their significance,measured by the TD-error. Besides, the strict required performance (cf. table 1)causes the agent to seldom achieve success and the training dataset to be im-balanced. Thus, the agent simply overfits the ”failure” class. Narasimhan et al.(2015) [13] have addressed this problem by forcing 25% of the training dataset tobe sampled from the less represented class.

4 Application of RL to the Missile’s Flight

One episode is defined as the attempt of following a 5s-long az reference signalconsisting of two consecutive steps whose amplitude and rise times are randomlygenerated except when deploying the agent for testing purposes (Cortez et al. 2020[2]). With a sampling time of 1ms, an episode includes 5000 steps, from which 2400are defined as transition periods (the 600 steps composing 0.6s after each of thefour rise times), whilst the remaining 2600 are resting periods.

From the requirements defined in section 2.3, the RL problem was formulatedas the training of an agent that succeeds when the performance of an episode meetsthe levels defined in table 1 in terms of the maximum stationary tracking error|ez|max,r, of the overshoot, of the actuation magnitude |η|max and of the actuationnoise levels both in resting (ηnoise,r) and transition (ηnoise,t) periods.

Table 1 Performance Objectives

Requirement Achieved Value|ez |max,r 0.5 [g]Overshoot 20 [%]|η|max 15 [º]ηnoise,r 1 [rad]ηnoise,t 0.2 [rad]

4.1 Algorithm

As explained in section 1, the current problem required an on-policy model-free RLalgorithm. Among them, not only is TRPO a current state-of-the-art algorithm(cf. section 3.1), but it also presents the attractiveness of the trust region search,avoiding sudden drops during the training progress, which is a very interestingfeature to be explored by the industry, whose mindset is often aiming at robustresults. TRPO was, therefore, the most suitable choice.


4.1.1 Modifications to original TRPO

The present implementation was inspired in the implementations proposed bySchulman et al. (2015) [17] and by Kangin et al. (2019) [8] (cf. section 3.2). Thereare several differences, though:

1. The reward function is given by equation (19), whose relative weights wi canbe found in Cortez et al. (2020) [2].

f1 = −w1.|ez| (13)

f2 =

{0 if |η| < ηmax−w2 otherwise

(14)

ηslope =∆η

ts(15)

f3 = −w3.|ηslope| (16)

condition = |ez| < 3g ∧ |η| < 0.2∧∧|eu| < eu,max

(17)

f4 =

{w4.

eu,max−|eu|eu,max

if condition

0 otherwise(18)

r =4∑i=1

fi (19)

2. Both neural networks have got three hidden layers whose sizes - h1, h2 andh3 - are related with the observations vector (their input size) and with theactions vector (output size of the Policy NN).

3. Observations are normalized (cf. equation (20)) so that the learning processcan cope with the different domains of each feature.

obsnorm =obs− µonline

σonline(20)

In equation (20), µonline and σonline are the running mean value and therunning variance of the set of observations collected along the whole train-ing process, which are updated with information about the newly collectedobservations after each training episode.

4. The proposed exploration strategy is deeply rooted on Kangin et al.’s (2019)[8], meaning that, the new action η is sampled from a normal distribution (cf.equation (21)) whose mean is the output of the policy neural network andwhose variance is obtained according to equation (22).

η ∼ N(µη, σ2η) (21)

σ2η = eσ

2log (22)

Although similar, this strategy differs from the original in σ2log (cf. equation

(23)), which is directly influenced by the tracking error ez through σ2log,tune.

σ2log = σ2

log,train + σ2log,tune(ez) (23)


5. Equation (26) was used as the loss function of the policy parameters, modifyingequation (12) in order to emphasize the need of reducing DKL with a termthat linearly penalizes DKL and a quadratic term that aims at fine-tuning itso that it is closer to the trust region radius δTR, encouraging as big updatesteps as possible.

L1 = EsEη∼πθold

[GAEπθold (s)

πθnew(η)

πθold(η)

](24)

L2 = Eη∼πθold [DKL(πθold(η), πθnew(η))] (25)

LP = −L1 + α (max(0, L2 − δTR))2 + βL2 (26)

Having the previously mentioned exploration strategy, πθ is a Gaussian distri-bution over the continuous action space (cf. equation (27), where µη and θηare defined in equation (21)).

πθ(η) =1

ση√

2πe− (η−µη)2

2σ2η (27)

Hence, L1 (cf. equation (24)) is given by equation (28), where n is the numberof samples in the training batch, assuming that all samples in the trainingbatch are independent and identically distributed1.

L1 =

[1

n

n∑i=1

GAEπθold si

].Πni=1

[πθnew(ηi)

πθold(ηi)

](28)

Moreover, L2 is given by equation (29).

L2 =1

n

n∑i=1

DKL(πθold(ηi), πθnew(ηi)) (29)

6. ADAM was used as the optimizer of both NN for its wide cross-range successand acceptability as the default optimizer of most ML applications.

4.1.2 Hindsight Experience Replay

The present goal is not defined by achieving a certain final state as Andrychowicz etal. (2018) [1] proposed, but, instead, a certain performance in the whole sequenceof states that constitutes an episode. For this reason, choosing a different goalmust mean, in this case, to follow a different reference signal. After collecting afull episode, those trajectories are replayed with new goals, which are sampledaccording to two different strategies. These strategies dictate the choice of theamplitudes of the two consecutive steps of each new reference signal.

1 We can assume they are (i) independent because they are sampled from the replay buffer(stage 5 of algorithm 2, section 4.1.5), breaking the causality correlation that the temporalsequence could entail, and (ii) identically distributed because the exploration strategy is alwaysthe same and, therefore, the stochastic policy πθ is always a Gaussian distribution over theaction space (cf. equation (27)).


The first strategy - mean strategy - consists of choosing the amplitudes of thesteps of the command signal as the mean values of the measured acceleration dur-ing the first and second resting periods, respectively. Similarly, the second strategy- final strategy - consists of choosing them as the last values of the measured signalduring each resting period. Apart from the step amplitudes, all the other originalparameters (Cortez et al. 2020 [2]) of the reference signal are kept.

4.1.3 Balanced Prioritized Experience Replay

Being li the priority level of the experience collected in step i (cf. equation (30),with eu,max = 0.01), Nj the number of steps with priority level j and ρj theproportion of steps with priority level j desired in the training datasets, BPERwas implemented according to algorithm 1.

li =

{1 if |ez|i < 0.5g ∧ |η|i < |η|max2 ∧ |eu|i < eu,max

0 otherwise(30)

Notice that P (i) and pi follow the notation of Schaul et al. (2016) [16] (as-suming α = 1), in which pi = 1

rank(i) matches the rank-based prioritization with

rank(i) meaning the ordinal position of step i when all steps in the replay buffersare ordered by the magnitude of their temporal differences.

Algorithm 1: Balanced Prioritized Experience Replay

if N1 < 0.25(N0 +N1) thenρ1 = 0.25sj =

∑i pi,∀i(li = j) with j ∈ {0, 1}

elseρ1 = 0.5s0 = s1 =

∑i pi, ∀i

endρ0 = 1− ρ1

P (i) =

{ρ1 × pi

s1if li = 1

ρ0 × pis0

otherwise

As condensed in algorithm 1, when there is less than 25% of successful steps inthe replay buffers, the successful and unsuccessful subsets of the replay buffers aresampled separately, with 25% coming from the successful subset. In a posteriorphase of training, when there is already more than 25% of successful steps, bothsubsets are molten. In either cases, sampling is always done according to thetemporal differences, i.e., a step with higher temporal difference has got a higherchance of being sampled.

4.1.4 Scheduled Experience Replay

Having HER (cf. section 4.1.2) and BPER (cf. section 4.1.3) dependent on a con-dition - the SER condition - is hereby defined as Scheduled Experience Replay(SER). The SER condition is exemplified in equation (31), where ez,past standsfor the mean tracking error of the previously collected episode .


ez,past ≤ 2g (31)

Contrarily to its original context (cf. section 3.3), the reward function is notsparse and was already able of achieving near-target performance without HER.The hypothesis, in this case, is that HER can be a complementary feature, byactivating it only when the agent converges to suboptimal policies.

Moreover, without HER, BPER adds less benefit, since there is no specialreason for the agent to believe that some part of the collected experience is moresignificant than other.

4.1.5 Algorithm Description

Algorithm 2: Implemented TRPO with a Replay Buffer and SER

Initializationwhile training do

1. Collect experience, sampling actions from policy π2. Augment the collected experience with synthetic successful episodes (SER)3. (at, st, rt)← V ′(st) and GAE(st) for all (at, st, rt) in T , for all T in B4. Store the newly collected experience in the replay buffer5. Sample the training sets from the replay buffer (SER)6. Update all value and policy parameters7. Update trust region

end

1. One batch B of trajectories T , is collected.2. If the SER condition (cf. section 4.1.4) holds, B is augmented according to

HER (cf. section 4.1.2).3. The targets for the Value NN V ′(st) and the GAE(st) are computed and added

to the trajectories.4. B is stored in the replay buffer, discarding the oldest batch: the replay bufferR contains data collected from the last policies and works as a FIFO queue.

5. If the SER condition (cf. section 4.1.4) holds, the training dataset is sampledfrom R according to BPER (cf. section 4.1.3). Otherwise, the entire informationavailable in R is used as training dataset.

6. The value parameters and the policy parameters are updated.7. The trust region parameters are updated.

4.2 Methodology

As further detailed in [2], the established methodology (i) progressively increasesthe amplitude of the randomly generated command signal and (ii) intermediatetesting of the agent’s performance against a -10g/10g double step without explo-ration, in order , respectively, (i) to avoid overfitting and (ii) to decide whether ornot to finish the training process.


4.3 Robustness Assessments

The formal mathematical guarantee of robustness of a RL agent composed of neu-ral networks cannot be done in the same terms as the one of linear controllers.It was, hence, evaluated by deploying the nominal agent in non-nominal environ-ments. Apart from testing its performance, the hypothesis is also that training thisagent in the latter can improve its robustness. To do so, the training of the bestfound nominal agent was resumed in the presence of non-nominalities (cf. section4.3.1). This training and its resulting best found agent are henceforward calledrobustifying training and robustified agent, respectively.

4.3.1 Robustifying Trainings

Three different modifications were separately made in the provided model (cf. sec-tion 2), in order to obtain three different non-nominal environments, each of themmodelling latency, estimation uncertainty in the nominal estimated Mach numberMnom and height hnom (cf. equations (32) and (33)) and parametric uncertaintyin the nominal aerodynamic coefficients Cz,nom and Cm,nom (cf. equations (34)and (35)).

M = (1 +∆M)×Mnom (32)

h = (1 +∆h)× hnom (33)

Cz = (1 +∆Cz)× Cz,nom (34)

Cm = (1 +∆Cm)× Cm,nom (35)

In the case of non-nominal environments including latency, the range of possiblevalues is [0, lmax]∩N0, whereas in the other cases, the uncertainty is assumed to benormally distributed, meaning that, following Peter’s (2018) [15] line of thought,its domain is [µ− 3σ, µ+ 3σ]2.

Before each new episode of a robustifying training, the non-nominality newvalue (either latency or one of the uncertainties) was sampled from a uniformdistribution over its domain and kept constant during the entire episode. In eachcase, four different values were tried for the bounds of the range of possibilities(either lmax or 3σ):

1. lmax ∈ {1, 3, 5, 10} [ms]2. (3σestimation) ∈ {1, 2, 3, 5} [%]3. (3σparametric) ∈ {5, 7, 10, 15} [%]

Values whose robustying training had diverged after 2500 episodes were dis-carded. The remaining were run for a total of 5000 episodes (cf. section 5.4).

2 To be accurate, this interval covers only 99.73% of the possible values, but it is assumedto be the whole spectrum of possible values.


5 Results

5.1 Expected Results

Empirically, it has been seen that agents that struggle to control η’s magnitudewithin its bounds (cf. table 1) are unable of achieving low error levels withoutincreasing the level of noise, if they can do it at all. In other words, the only wayof having good tracking results with unbounded actions is to fall into a bang-bangcontrol-like situation. Therefore, the best found agent will have to start by learningto use only bounded and smooth action values, which will allow it to, then, startdecreasing the tracking error.

Notice that, while the tracking error and the noise measures are expected totend to 0, η’s magnitude is not, since it would mean that the agent had given upactuating in the environment. Hence, it is expected that, at some point in training,the latter stabilizes.

Moreover, the exploration strategy proposed in section 4.1.1 insert some vari-ance in the output of the policy neural network, meaning that the noise measuresare also not expected to reach exactly 0.

5.2 Best Found Agent

(a) Tracking performance (b) Action, η

Fig. 3 Test performance of the Best Found Agent

As figure 3 evidences, the agent is clearly able of controlling the measuredacceleration and to track the reference signal it is fed with, satisfying all theperformance requirements defined in table 1 (cf. table table 2).

5.3 Reproducibility Assessments

Figure 4 shows the tests of the best agents obtained during each of the nine train-ings run to assess the reproducibility of the best found nominal agent (cf. section5.2) and it is possible to see that none of them achieved the target performance,


Table 2 Performance achieved by the Best Found Nominal Agent

Requirement Achieved Value|ez |max,r 0.4214 [g]Overshoot 8.480 [%]|η|max 0.1052 [rad]ηnoise,r 0.04222 [rad]ηnoise,t 0.005513 [rad]


Fig. 4 Nominal test performance of the Reproducibility Trials

i.e., none meets all the performance criteria established in table 1. Although mostof the trials can be considered far from the initial random policy, it is possible toverify that trials 7 and 8 present a very poor tracking performance and that trial2’s action signal is not smooth.

5.4 Robustness Assessments

5.4.1 Latency

As figure 5 shows, from the four different values of lmax, only 5ms converged after2500 episodes, meaning that it was the only robustifying training being run fora total of 5000 episodes, during which the best agent found was defined as theLatency Robustified Agent. The nominal performance (cf. figure 6) is damaged,having a less stable action signal and a poorer tracking performance.

Furthermore, its performance in environments with latency (0ms to 40ms) didnot provide any enhancement, as its success rate is low in all quantities of interest(cf. table 3).

Since the Latency Robustified Agent was worse than the Nominal Agent inboth nominal and non-nominal environments, it lost in both Performance andRobustness categories. Thus, it is possible to say, in general terms, that, concerninglatency, the robustifying trainings failed.


Fig. 5 Mean tracking error of latency robustifying trainings


Fig. 6 Nominal Performance of the Latency Robustified Agent

Table 3 Robustified Agent success rate in improving the robustness to Latency of the NominalAgent

Requirement Success %

|ez |max,r 0.00

Overshoot 25.00

ηmax 0.00

ηnoise,r 7.50

ηnoise,t 5.00

5.5 Estimation Uncertainty

As figure 7 shows, from the four different values of 3σ tried, only the 3% oneconverged after 2500 episodes, meaning that it was the only robustifying trainingbeing run for a total of 5000 episodes, during which the best agent found was de-fined as the Estimation Uncertainty Robustified Agent. The nominal performanceis improved, having a lower overshoot and a smoother action signal.


Fig. 7 Mean tracking error of estimation uncertainty robustifying trainings


Fig. 8 Nominal Performance of the Estimation Uncertainty Robustified Agent

Furthermore, it enhanced the performance of the Nominal Agent in environ-ments with estimation uncertainty (-10% to 10%), having achieved high successrates (cf. table 4).

Table 4 Robustified Agent success rate in improving the robustness to Estimation Uncertaintyof the Nominal Agent


|ez |max,r 60.38

Overshoot 66.39

ηmax 91.97

ηnoise,r 77.51

ηnoise,t 82.33

Since the Estimation Uncertainty Robustified Agent was better than the Nom-inal Agent in both nominal and non-nominal environments, it won in both Per-


formance and Robustness categories. Thus, it is possible to say, in general terms,that, concerning estimation uncertainty, the robustifying trainings succeeded.

5.6 Parametric Uncertainty

Fig. 9 Mean tracking error of parametric uncertainty robustifying trainings

As figure 9 shows, from the four different values of 3σ tried, only the 5% oneconverged after 2500 episodes, meaning that it was the only robustifying trainingbeing run for a total of 5000 episodes, during which the best agent found wasdefined as the Parametric Uncertainty Robustified Agent. The nominal trackingperformance is damaged, but the action signal is smoother (cf. figure 10).


Fig. 10 Nominal Performance of the Parametric Uncertainty Robustified Agent

Furthermore, it enhanced the performance of the Nominal Agent in environ-ments with parametric uncertainty (-40% to 40%), having achieved high successrates (cf. table 5).


Table 5 Robustified Agent success rate in improving the robustness to Parametric Uncertaintyof the Nominal Agent


|ez |max,r 61.55

Overshoot 83.16

ηmax 98.28

ηnoise,r 100.00

ηnoise,t 99.31

Since the Parametric Uncertainty Robustified Agent was better than the Nom-inal Agent in non-nominal environments, it won in the Robustness categories, re-maining acceptably the same in terms of performance in nominal environments.Thus, it is possible to say, in general terms, that, concerning parametric uncer-tainty, the robustifying trainings succeeded.

6 Achievements

The proposed algorithm has been considered successful, since all the objectivesestablished in section 1 were accomplished, confirming the motivations put forthin section 1. Three main achievements must be highlighted:

1. the nominal target performance (cf. section 5.2) achieved by the proposedalgorithm with the non-linear model of the dynamic system (cf. section 2);

2. the ability of SER (cf. section 4.1.4) in boosting a previously suboptimallyconverged performance;

3. the very sound rates of success in overtaking the performance achieved by thebest found nominal agent (cf. sections 5.5 and 5.6). RL has confirmed to bea promising learning framework for real life applications, where the conceptof Robustying Trainings can bridge the gap between training the agent in thenominal environment and deploying it in reality.

7 Future Work

The first direction of future work is to expand the current task to the control of thewhole nonlinear flight dynamics of the GSAM, instead of solely the longitudinalone. Such an expansion would require both (i) straightforward modifications inthe code and in the training methodology and (ii) some conceptual challenges,concerning the expansion of the reward function and of the exploration strategy.

Secondly, it would be interesting to investigate how to tackle the main chal-lenges faced during the design of the proposed algorithm, namely (i) to avoid thetime-consuming reward engineering process, (ii) the definition of the explorationstrategy and (iii) the reproducibility issue.


References

1. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welin-der, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experi-ence Replay. In Advances in Neural Information Processing Systems, 2017.

2. Bernardo Cortez. Reinforcement Learning for Robust Missile Autopilot Design. Technicalreport, Instituto Superior Tecnico, Lisboa, 12 2020.

3. Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing Function ApproximationError in Actor-Critic Methods. 35th International Conference on Machine Learning,ICML 2018, 4:2587–2601, 2018.

4. Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and SergeyLevine. Q-PrOP: Sample-efficient policy gradient with an off-policy critic. 5th Inter-national Conference on Learning Representations, ICLR 2017 - Conference Track Pro-ceedings, pages 1–13, 2019.

5. Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, BernhardScholkopf, and Sergey Levine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in Neural Informa-tion Processing Systems, 2017-Decem(Nips):3847–3856, 2017.

6. Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learn-ing with deep energy-based policies. 34th International Conference on Machine Learning,ICML 2017, 3:2171–2186, 2017.

7. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic:Off-policy maximum entropy deep reinforcement learning with a stochastic actor. 35thInternational Conference on Machine Learning, ICML 2018, 5:2976–2989, 2018.

8. Dmitry Kangin and Nicolas Pugeault. On-Policy Trust Region Policy Optimisation withReplay Buffers. pages 1–13, 2019.

9. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yu-val Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcementlearning. 4th International Conference on Learning Representations, ICLR 2016 - Con-ference Track Proceedings, 2016.

10. James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factoredapproximate curvature. 32nd International Conference on Machine Learning, ICML 2015,3:2398–2407, 2015.

11. Ofir Nachum, Mohammad Norouzi, George Tucker, and Dale Schuurmans. Smoothedaction value functions for learning Gaussian policies. 35th International Conference onMachine Learning, ICML 2018, 8:5941–5953, 2018.

12. Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. TruST-PCL: Anoff-policy trust region method for continuous control. 6th International Conference onLearning Representations, ICLR 2018 - Conference Track Proceedings, pages 1–14, 2018.

13. Karthik Narasimhan, Tejas D Kulkarni, and Regina Barzilay. Language Understanding forText-based Games using Deep Reinforcement Learning. Technical report, MassachusettsInstitute of Technology, 2015.

14. Brendan O’Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combin-ing policy gradient and q-learning. 5th International Conference on Learning Represen-tations, ICLR 2017 - Conference Track Proceedings, pages 1–15, 2019.

15. Florian Peter. Nonlinear and Adaptive Missile Autopilot Design. PhD thesis, TechnischeUniversitat Munchen, Munich, 5 2018.

16. Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience re-play. 4th International Conference on Learning Representations, ICLR 2016 - ConferenceTrack Proceedings, pages 1–21, 2016.

17. John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trustregion policy optimization. 32nd International Conference on Machine Learning, ICML2015, 3:1889–1897, 2015.

18. John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel.High-dimensional continuous control using generalized advantage estimation. 4th Inter-national Conference on Learning Representations, ICLR 2016 - Conference Track Pro-ceedings, pages 1–14, 2016.

19. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. ProximalPolicy Optimization Algorithms. pages 1–12, 2017.

20. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. TheMIT Press, 2 edition, 2018.


21. Yuhuai Wu, Elman Mansimov, Shun Liao, Roger Grosse, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation.Advances in Neural Information Processing Systems, 2017-Decem(Nips):5280–5289, 2017.

arxiv:2011.12956v1 [cs.lg] 26 nov 2020

Documents