improvingmaneuverstrategyinaircombatbyalternatefreeze...

17
Research Article Improving Maneuver Strategy in Air Combat by Alternate Freeze Games with a Deep Reinforcement Learning Algorithm Zhuang Wang , 1 Hui Li , 1,2 Haolin Wu , 1 and Zhaoxin Wu 2 1 College of Computer Science, Sichuan University, Chengdu, Sichuan 610065, China 2 National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, Sichuan 610065, China Correspondence should be addressed to Hui Li; [email protected] Received 11 April 2020; Accepted 9 June 2020; Published 30 June 2020 Academic Editor: Ramon Sancibrian Copyright © 2020 Zhuang Wang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In a one-on-one air combat game, the opponent’s maneuver strategy is usually not deterministic, which leads us to consider a variety of opponent’s strategies when designing our maneuver strategy. In this paper, an alternate freeze game framework based on deep reinforcement learning is proposed to generate the maneuver strategy in an air combat pursuit. e maneuver strategy agents for aircraft guidance of both sides are designed in a flight level with fixed velocity and the one-on-one air combat scenario. Middleware which connects the agents and air combat simulation software is developed to provide a reinforcement learning environment for agent training. A reward shaping approach is used, by which the training speed is increased, and the performance of the generated trajectory is improved. Agents are trained by alternate freeze games with a deep reinforcement algorithm to deal with nonstationarity. A league system is adopted to avoid the red queen effect in the game where both sides implement adaptive strategies. Simulation results show that the proposed approach can be applied to maneuver guidance in air combat, and typical angle fight tactics can be learnt by the deep reinforcement learning agents. For the training of an opponent with the adaptive strategy, the winning rate can reach more than 50%, and the losing rate can be reduced to less than 15%. In a competition with all opponents, the winning rate of the strategic agent selected by the league system is more than 44%, and the probability of not losing is about 75%. 1.Introduction Despite long-range radar and missile technology improve- ments, there is still a scenario that two fighter aircrafts may not detect each other until they are within the visual range. erefore, modern fighters are designed for close combat, and military pilots are trained in air combat basic fighter maneuvering (BFM). Pursuit is a kind of BFM, which aims to control an aircraft to reach a position of advantage when it is fighting against another aircraft [1]. In order to reduce the workload of pilots and remove the need to provide them with complex spatial orientation in- formation, many research studies focus on the autonomous air combat maneuver decision. Toubman et al. [2–5] used rule-based dynamic scripting in one-on-one, two-on-one, and two-on-two air combat, which requires hard coding the air-combat tactics into a maneuver selection algorithm. A virtual pursuit point-based combat maneuver guidance law for an unmanned combat aerial vehicle (UCAV) is presented and is used in X-Plane-based nonlinear six-degrees-of- freedom combat simulation [6, 7]. Eklund et al. [8, 9] presented a nonlinear, online model predictive controller for pursuit and evasion of two fixed-wing autonomous aircrafts, which rely on previous knowledge of the maneuvers. Game-theoretic-based approaches are widely used in the automation of air combat pursuit maneuver. Austin et al. [10, 11] proposed a matrix game approach to generate in- telligence maneuver decisions for one-on-one air combat. A limited search method is adopted over discrete maneuver choices to maximize a scoring function, and the feasibility of real-time autonomous combat is demonstrated in simula- tion. Ma et al. [12] formulated the cooperative occupancy decision-making problem in air combat as a zero-sum matrix game and designed the double-oracle combined Hindawi Mathematical Problems in Engineering Volume 2020, Article ID 7180639, 17 pages https://doi.org/10.1155/2020/7180639

Upload: others

Post on 04-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

Research ArticleImproving Maneuver Strategy in Air Combat by Alternate FreezeGames with a Deep Reinforcement Learning Algorithm

Zhuang Wang 1 Hui Li 12 Haolin Wu 1 and Zhaoxin Wu 2

1College of Computer Science Sichuan University Chengdu Sichuan 610065 China2National Key Laboratory of Fundamental Science on Synthetic Vision Sichuan University Chengdu Sichuan 610065 China

Correspondence should be addressed to Hui Li lihuibscueducn

Received 11 April 2020 Accepted 9 June 2020 Published 30 June 2020

Academic Editor Ramon Sancibrian

Copyright copy 2020 ZhuangWang et alis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In a one-on-one air combat game the opponentrsquos maneuver strategy is usually not deterministic which leads us to consider avariety of opponentrsquos strategies when designing ourmaneuver strategy In this paper an alternate freeze game framework based ondeep reinforcement learning is proposed to generate the maneuver strategy in an air combat pursuitemaneuver strategy agentsfor aircraft guidance of both sides are designed in a flight level with fixed velocity and the one-on-one air combat scenarioMiddleware which connects the agents and air combat simulation software is developed to provide a reinforcement learningenvironment for agent training A reward shaping approach is used by which the training speed is increased and the performanceof the generated trajectory is improved Agents are trained by alternate freeze games with a deep reinforcement algorithm to dealwith nonstationarity A league system is adopted to avoid the red queen effect in the game where both sides implement adaptivestrategies Simulation results show that the proposed approach can be applied to maneuver guidance in air combat and typicalangle fight tactics can be learnt by the deep reinforcement learning agents For the training of an opponent with the adaptivestrategy the winning rate can reach more than 50 and the losing rate can be reduced to less than 15 In a competition with allopponents the winning rate of the strategic agent selected by the league system is more than 44 and the probability of not losingis about 75

1 Introduction

Despite long-range radar and missile technology improve-ments there is still a scenario that two fighter aircrafts maynot detect each other until they are within the visual rangeerefore modern fighters are designed for close combatand military pilots are trained in air combat basic fightermaneuvering (BFM) Pursuit is a kind of BFM which aimsto control an aircraft to reach a position of advantage when itis fighting against another aircraft [1]

In order to reduce the workload of pilots and remove theneed to provide them with complex spatial orientation in-formation many research studies focus on the autonomousair combat maneuver decision Toubman et al [2ndash5] usedrule-based dynamic scripting in one-on-one two-on-oneand two-on-two air combat which requires hard coding theair-combat tactics into a maneuver selection algorithm A

virtual pursuit point-based combat maneuver guidance lawfor an unmanned combat aerial vehicle (UCAV) is presentedand is used in X-Plane-based nonlinear six-degrees-of-freedom combat simulation [6 7] Eklund et al [8 9]presented a nonlinear online model predictive controller forpursuit and evasion of two fixed-wing autonomous aircraftswhich rely on previous knowledge of the maneuvers

Game-theoretic-based approaches are widely used in theautomation of air combat pursuit maneuver Austin et al[10 11] proposed a matrix game approach to generate in-telligence maneuver decisions for one-on-one air combat Alimited search method is adopted over discrete maneuverchoices to maximize a scoring function and the feasibility ofreal-time autonomous combat is demonstrated in simula-tion Ma et al [12] formulated the cooperative occupancydecision-making problem in air combat as a zero-summatrix game and designed the double-oracle combined

HindawiMathematical Problems in EngineeringVolume 2020 Article ID 7180639 17 pageshttpsdoiorg10115520207180639

algorithm with neighborhood search to solve the model In[13 14] the air combat game is regarded as the Markovgame and then the pursuit maneuver strategy is solved bycomputing its Nash equilibrium is approach solves theproblem that the matrix game cannot deal with continuousmultiple states However it is only suitable for rationalopponents An optimal pursuit-evasion fighter maneuver isformulated as a differential game and then solved by non-linear programming which is complex and requires enor-mous amount of calculation [15]

Other approaches for pursuit maneuver strategy gen-eration include influence diagram genetic algorithm andapproximate dynamic programming Influence diagram isused to model the sequential maneuver decision in aircombat and high-performance simulation results are ob-tained [16ndash18] However this approach is difficult to beapplied in practice since the influence diagram is convertedinto a nonlinear programming problem which cannot meetthe demand of fast computation during air combat In [19] agenetics-based machine learning algorithm is implementedto generate high angle-of-attack air combat maneuver tacticsfor the X-31 fighter aircraft in a one-on-one air combatscenario Approximate dynamic programming (ADP) canbe employed to solve the air combat pursuit maneuverdecision problem quickly and effectively [20 21] By con-trolling the roll rate it can provide fast maneuver response ina changing situation

Most of the previous studies have used various algo-rithms to solve the pursuit maneuver problem and havesome satisfactory results whereas two problems still existOne is that previous studies have assumed the maneuverstrategy of the opponent is deterministic or generated by afixed algorithm However in realistic situations these ap-proaches are difficult to deal with the flexible strategiesadopted by different opponents e other is that traditionalalgorithms rely heavily on prior knowledge and have highcomputational complexity which cannot adapt to therapidly changing situation in air combat Since UCAVs havereceived growing interest worldwide and the flight controlsystem of the aircraft is developing rapidly towards intel-lectualization [22] it is necessary to do research on theintelligent maneuver strategy in UCAVs and manned air-craft combat

In this paper a deep reinforcement learning- (DRL-)[23] based alternate freeze game approach is proposed totrain guidance agents which can provide maneuver in-structions in an air combat Using alternate freeze gamesin each training period one agent is learning while itsopponent is frozen DRL is a type of artificial intelligencewhich combines reinforcement learning (RL) [24] anddeep learning (DL) [25] RL allows an agent to learndirectly from the environment through trial and errorwithout perfect knowledge of the environment in advanceA well-trained DRL agent can automatically determine anadequate behavior within a specific context trying tomaximize its performance using few computational timese theory of DRL is very suitable for solving sequentialdecision-making problems such as maneuver guidance inair combat

DRL has been utilized in many decision-making fieldssuch as video games [26 27] board games [28 29] androbot control [30] and obtained great achievements ofhuman level or superhuman performance For aircraftguidance research Waldock et al [31] proposed a DQNmethod to generate a trajectory to perform perched landingon the ground Alejandro et al [32] proposed a DRL strategyfor autonomous landing of the UAV on a moving platformLou and Guo [33] presented a uniform framework ofadaptive control by using policy-searching algorithms for aquadrotor An RL agent is developed for guiding a poweredUAV from one thermal location to another by controllingbank angle [34] For air combat research You et al [35]developed an innovative framework for cognitive electronicwarfare tasks by using a DRL algorithm without prior in-formation Luo et al [36] proposed a Q-learning-based aircombat target assignment algorithm which avoids relying onprior knowledge and performs well

Reward shaping is a method of incorporating domainknowledge into RL so that the algorithms are guided fastertowards more promising solutions [37] Reward shaping iswidely adopted in the RL community and it is also used inaircraft planning and control In [4] a reward function isproposed to remedy the false rewards and punishments forfiring air combat missiles which allows computer-generatedforces (CGFs) to generate more intelligent behavior Tumerand Agogino [38] proposed difference reward functions in amultiagent air traffic system and showed that agents canmanage effective route selection and significantly reducecongestion In [39] two types of reward functions are de-veloped to solve ground holding and air holding problemswhich assist air traffic controllers in maintaining highstandard of safety and fairness between airlines

Previous research studies have shown benefits of usingDRL to solve the air maneuver guidance problem Howeverthere are still two problems in applying DRL to air combatOne is that previous studies did not have a specific envi-ronment for air combat either developing an environmentbased on the universal ones [32] or using a discrete gridworld which do not have the function of air combatsimulation

e other problem is that classic DRL algorithms arealmost all one-sided optimization algorithms which canonly guide an aircraft to a fixed location or a regularlymoving destination However the opponent in air combathas diversity and variability maneuver strategies which arenonstationary and the classic DRL algorithms cannot dealwith them Hernandez-Leal et al reviewed how the non-stationarity is modelled and addressed by state-of-the-artmultiagent learning algorithms [40] Some researcherscombine MDP with game theory to study reinforcementlearning in stochastic games to solve the nonstationarityproblem [41ndash43] while there are still two limitations One isto assume the opponentrsquos strategy is rational In eachtraining step the opponent chooses the optimal action Atrained agent can only deal with a rational opponent butcannot fight against a nonrational one e other is to uselinear programming or quadratic programming to calculatethe Nash equilibrium which leads to a huge amount of

2 Mathematical Problems in Engineering

calculations e Minimax-DQN [44] algorithm whichcombines DQN and Minimax-Q learning [41] for the two-player zero-sumMarkov game is proposed in a recent paperAlthough it can be applied to complex games it still can onlydeal with rational opponents and needs to use linear pro-gramming to calculate the Q value

e main contributions of this paper can be summarizedas follows

(i) Middleware is designed and developed whichmakes specific software for air combat simulationused as an RL environment An agent can be trainedin this environment and then used to guide anaircraft to reach the position of advantage in one-on-one air combat

(ii) Guidance agents for air combat pursuit maneuver ofboth sides are designede reward shaping methodis adopted to improve the convergence speed oftraining and the performance of the maneuverguidance strategies An agent is trained in the en-vironment where its opponent is also an adaptiveagent so that the well-trained agent has the ability tofight against an opponent with the intelligentstrategy

(iii) An alternate freeze game framework is proposed todeal with nonstationarity which can be used tosolve the problem of variable opponent strategies inRL e league system is adopted to select an agentwith the highest performance to avoid the red queeneffect [45]

is paper is organized as follows Section 2 intro-duces the maneuver guidance strategy problem of one-on-one air combat maneuver under consideration andthe DRL-based model as well as training environment tosolve it e training optimization includes rewardshaping and alternate freeze games which are presentedin Section 3 e simulation and results of the proposedapproach are given in Section 4 e final section con-cludes the paper

2 Problem Formulation

In this section the problem of maneuver guidance in aircombat is introducede training environment is designedand the DRL-based model is set up to solve it

21ProblemStatement eproblem solved in this paper is aone-on-one air combat pursuit-evasion maneuver probleme red aircraft is regarded as our side and the blue one as anenemy e objective of the guidance agent is to learn amaneuver strategy (policy) to guide the aircraft from itscurrent position to a position of advantage and maintain theadvantage At the same time it should guide the aircraft notto enter the advantage position of its opponent

e dynamics of the aircraft is described by a point-massmodel [17] and is given by the following differentialequations

_x v cos θ cosψ

_y v cosθ sinψ

_z vsinθ

θ

1

mv((L + Tsinβ) times cosϕ minus mgcosθ)

_ψ 1

mvcosθ(L + Tsinβ) times sinϕ

_v 1m

(Tcosβ minus D) minus gsinθ

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(1)

where (x y z) are 3-dimensional coordinates of the aircrafte terms v θ and ψ are speed flight path angle and theheading angle of the aircraft respectively e mass of theaircraft and the acceleration due to the gravity are denotedby m and g respectively e three forces are lift force Ldrag force D and thrust force T e remaining two vari-ables are the angle of attack β and the bank angle ϕ as shownin Figure 1

In this paper the aircraft is assumed to fly at a fixedvelocity in the horizontal plane and the assumption can bewritten as follows

β 0

θ 0

T D

L mg

cosϕ

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(2)

e equations of motion for the aircraft are simplified asfollows

xt+δt xt + vδtcos ψt+δt( 1113857

yt+δt yt + vδtsin ψt+δt( 1113857

ψt+δt ψt + _ψt+δtδt

_ψt+δt g

vtan ϕt+δt( 1113857

ϕt+δt ϕt + At_ϕtδt

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(3)

where v _ϕt ϕt _ψt and ψt are the speed roll rate bank angleturn rate and heading angle of the aircraft at time t re-spectively e term At is the action generated by theguidance agent and the control actions available to theaircraft are a isin roll-left maintain-bank-angle roll-righte simulation time step is denoted by δt

Mathematical Problems in Engineering 3

In aircraft maneuver strategy design maneuvers in a fixedplane are usually used to measure its performance Figure 2shows a one-on-one air combat scenario [20] in which eachaircraft flies at a fixed velocity in the X-Y plane under a ma-neuver strategy at time t e position of advantage is that oneaircraft gains opportunities to fire at its opponent It is defined as

dmin ltdt ltdmax

μrt

11138681113868111386811138681113868111386811138681113868lt μmax

ηbt

11138681113868111386811138681113868111386811138681113868lt ηmax

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(4)

where superscripts r and b refer to red and blue respectivelye term dt is the distance between the aircraft and the target attime t μr

t is the deviation angle which is defined as the angledifference between the line-of-sight vector and the heading ofour aircraft and ηb

t is the aspect angle which is between thelongitudinal symmetry axis (to the tail direction) of the targetplane and the connecting line from the target planersquos tail toattacking the planersquos nosee terms dmin dmax μmax and μminare thresholds where the subscriptsmax andmin represent theupper and lower bounds of the corresponding variables re-spectively which are determined by the combat mission andthe performance of the aircraft

e expression dmin ltdt ltdmax makes sure that theopponent is within the attacking range of the air-to-airweapon of the aircraft e expression |μr

t |lt μmax refers to anarea in which the blue aircraft is difficult to escape withsensor locking e expression |ηb

t |lt ηmax defines an areawhere the killing probability is high when attacking from therear of the blue aircraft

22 DRL-Based Model An RL agent interacts with an envi-ronment over time and aims tomaximize the long-term reward[24] At each training time t the agent receives a state St in astate space S and generates an actionAt from an action spaceA

following a policy π S times A⟶ R en the agent receives ascalar reward Rt and transitions to the next state St+1 accordingto the environment dynamics as shown in Figure 3

A value function V(s) is defined to evaluate the aircombat advantages of each state which is the expectation ofdiscounted cumulated rewards on all states following time t

V(s) Eπ Rt + cRt+1 + c2Rt+2 + St s

11138681113868111386811138681113966 1113967 (5)

where c isin [0 1] is the discount factor and policy π(s a) is amapping from the state space to the action space

e agent guides an aircraft to maneuver by changing itsbank angleϕ using roll rate _ϕe action space of DRL is minus1 01 and At can take a value from three options at time t whichmeans roll-left maintain-bank-angle and roll-right respec-tively e position of the aircraft is updated according to (3)e action-value functionQ(s a) can be used which is definedas the value of taking action a in state s under a policy π

Q(s a) Eπ 1113944

infin

k0c

kRt+k St s

1113868111386811138681113868 At a⎧⎨

⎫⎬

⎭ (6)

e Bellman optimality equation is

Qlowast(s a) E Rt + cmax

aprimeQlowast

St+1 aprime( 1113857St s At a1113896 1113897

(7)

Any policy that is greedy with respect to the optimalevaluation function Qlowast(s a) is an optimal policy ActuallyQlowast(s a) can be obtained through iterations using temporal-difference learning and its updated formula is defined as

dmindmax

vb

vr

(xbt yb

t)

(xrt yr

t)

ηbt

micrort

Goal zoneRed aircra (our side)Blue aircra (enemy)

Figure 2 Aircraft maneuver guidance problem in air combat

L

D

T

v

β

θ ψ

ϕ

Figure 1 Aircraft dynamics parameters

Air combatenvironment

Air combatguidance agent

At = Changing bank angle

St = aircra status opponent status

Rtndash1 = R(Stndash1 Atndash1 St)

St+1

Rt

Figure 3 Reinforcement learning in air combat

4 Mathematical Problems in Engineering

Q St At( 1113857⟵Q St At( 1113857 + α Rt + cmaxa

Q St+1 a( 1113857 minus Q St At( 11138571113874 1113875

(8)

where Q(St At) is the estimated action-value function and αis the learning rate

In air combat maneuvering applications the reward issparse and only at the end of a game (the terminal state)according to the results of winning losing or drawing areward value is given is makes learning algorithms to usedelayed feedback to determine the long-term consequencesof their actions in all nonterminal states which is timeconsuming Reward shaping is proposed by Ng et al [46] tointroduce additional rewards into the learning processwhich will guide the algorithm towards learning a goodpolicy faster e shaping reward function F(St At St+1) hasthe form

F St At St+1( 1113857 cΦ St+1( 1113857 minusΦ St( 1113857 (9)

where Φ(St) is a real-valued function over states It can beproven that the final policy after using reward shaping isequivalent to the final policy without it [46] Rt in previousequations can be replaced by Rt + F(St At St+1)

Taking the red agent as an example the state space of oneguidance agent can be described with a vector

St xrt y

rt ψ

rt ϕ

rt x

bt y

bt ψb

t1113966 1113967 (10)

e state that an agent received in one training stepincludes the position heading angle and bank angle of itselfand the position heading angle and bank angle of its op-ponent which is a 7-dimensional infinite vector ereforethe function approximate method should be used to com-bine features of state and learned weights DRL is a solutionto address this problem Great achievements have beenmadeusing the advanced DRL algorithms such as DQN [26]DDPG [47] A3C [48] and PPO [49] Since the action ofaircraft guidance is discrete DQN algorithm can be usede action-value function can be estimated with functionapproximation such as 1113954Q(s aw) asymp Qπ(s a) where w arethe weights of a neural network

DQN algorithm is executed according to the trainingtime step However there is not only training time step butalso simulation time step Because of the inconsistencyDQN algorithm needs to be improved to train a guidanceagent e pseudo-code of DQN for aircraft maneuverguidance training in air combat is shown in Algorithm 1 Ateach training step the agent receives information about theaircraft and its opponent and then generates an action andsends it to the environment After receiving the action theaircraft in the environment is maneuvered according to thisaction in each simulation step

23 Middleware Design An RL agent improves its abilitythrough continuous interaction with the environmentHowever there is no maneuver guidance environment forair combat and some combat simulation software can onlyexecute agents not train them In this paper middleware isdesigned and developed based on commercial air combat

simulation software [50] coded in C++ e functions ofmiddleware include interface between software and RLagents coordination of the simulation time step and the RLtraining time step and reward calculation By using it an RLenvironment for aircraft maneuver guidance training in aircombat is performed e guidance agents of both aircraftshave the same structure Figure 4 shows the structure and thetraining process of the red side

In actual or simulated air combat an aircraft perceivesthe situation through multisensors By analyzing the situ-ation after information fusion maneuver decisions are madeby the pilot or the auxiliary decision-making system In thissystem situation perception and decision-making are thesame as in actual air combat An assumption is made that asensor with full situation perception ability is used throughwhich an aircraft can obtain the position and heading angleof its opponent

Maneuver guidance in air combat using RL is an episodictask and there is the notion of episodes of some length wherethe goal is to take the agent from a starting state to a goal state[51] First a flight level with fixed velocity and the one-on-oneair combat scenario is setup in simulation software en thesituation information of both aircrafts is randomly initializedincluding their positions heading angles bank angles and rollrates which constitute the starting state of each agente goalstates are achieved when one aircraft reaches the position ofadvantage and maintains it one aircraft flies out of the sectoror the simulation time is out

Middleware can be used in the training and applicationphases of agents e training phase is the process of makingthe agent from zero to have the maneuver guidance abilitywhich includes training episodes and testing episodes In thetraining episode the agent is trained by the DRL algorithmand its guidance strategy is continuously improved After anumber of training episodes some testing episodes areperformed to verify the ability of the agent in the currentstage in which the strategy does not change e applicationphase is the process of using the well-trained agent tomaneuver guidance in air combat simulation

A training episode is composed of training steps In eachtraining step first the information recognized by the air-borne sensor is sent to the guidance agent through mid-dleware as a tuple of state coded in Python see Steps 1 2 and3 in Figure 4 en the agent generates an action using itsneural networks according to the exploration strategy asshown in Steps 4 and 5 Simulation software receives aguidance instruction transformed by middleware and sendsthe next situation information to middleware see Steps 6and 7 Next middleware transforms the situation infor-mation into state information and calculates reward andsends them to the agent as shown in Steps 2 3 and 8Because agent training is an iterative process all the stepsexcept Step 1 are executed repeatedly in an episode Last thetuple of the current state action reward and the next statein each step is stored in the replay memory for the training ofthe guidance agent see Steps 9 and 10 In each simulationstep the aircraft maneuvers according to the instruction andrecognizes the situation according to its sensorconfiguration

Mathematical Problems in Engineering 5

3 Training Optimization

31 Reward Shaping Two problems need to be solved whenusing the DRL method to train a maneuver guidance agentin air combat One is that the only criterion to evaluate theguidance is the successful arrival to the position of advan-tage which is a sparse reward problem leading to slowconvergence of training Secondly in realistic situations thetime to arrive at the position of advantage and the quality ofthe trajectory need to be considered In this paper a rewardshaping method is proposed to improve the training speedand the performance of the maneuver guidance strategy

Reward shaping [46] is usually used to modify the re-ward function to facilitate learning while maintaining op-timal policy which is a manual endeavour In this studythere are four rules to follow in reward shaping

(i) Limited by the combat area and the maximumnumber of control times the aircraft is guided to theposition of advantage

(ii) e aircraft should be guided closer and closer to itsgoal zone

(iii) e aircraft should be guided away from the goalzone of its opponent

(1) Set parameters of the controlled aircraft and opponent(2) Set training time step size Δt and simulation time step size δt

(3) Initialize maximum number of training episode M and maximum simulation time T in each episode(4) Initialize replay memory D to capacity N

(5) Initialize action-value function Q and target action-value function Qminus with random weights(6) for episode 0 Mdo(7) Initialize state S0(8) for t 0 TΔtdo(9) With probability ϵ select a random action At

(10) Otherwise select At argmaxa

Q(St aw)

(11) for i 0 Δtδtdo(12) Execute action At in air combat simulation software(13) Obtain the positions of aircraft and target(14) if episode terminates then(15) break(16) end if(17) end for(18) Observe reward Rt and state St+1(19) Store transition [St At Rt St+1] in D

(20) if episode terminates then(21) break(22) end if(23) Sample random minibatch of transitions [Sj Aj Rj Sj+1] from D

(24) if episode terminates at step j + 1then(25) set Yj Rj

(26) else(27) set Yj Rj + max

aQminus (Sj+1 awminus )

(28) end if(29) Perform a gradient descent step on (Yj minus Q(Sj Ajwj))

2 with respect to the network parameters w(30) Every C steps reset Qminus Q

(31) end for(32) end for

ALGORITHM 1 DQN [26] for maneuver guidance agent training

1

7

2 6

93 8

4 5

10

Air combat simulation soware

Initialization Visualization

Situation recognitionAir combat

Aircra motion

Loop ∆tδt times

InstructionSituation

Middleware

ActionRewardState

State perception

Action generation

Blueaircra

guidanceagent

LearningTrain

[S0 A0 R0 S1][S1 A1 R1 S2]

[St At Rt St+1]

[STndash1 ATndash1 RTndash1 ST]

Replay memoryRed aircra guidance agent

Store

Figure 4 Structure and training process of the red side

6 Mathematical Problems in Engineering

(iv) e aircraft should be guided to its goal zone inshort time as much as possible

According to the above rules the reward function isdefined as

R St At St+1( 1113857 Rt + F St At St+1( 1113857 (11)

where Rt is the original scalar reward and F(St At St+1) isthe shaping reward function Rt is given by

Rt w1T St+1( 1113857 + w2 (12)

where T(St+1) is the termination reward function e termw1 is a coefficient and w2 is a penalty for time consumptionwhich is a constant

ere are four kinds of termination states the aircraftarrives at the position of advantage the opponent arrives atits position of advantage the aircraft moves out of thecombat area and the maximum number of control times hasbeen reached and each aircraft is still in the combat area andhas not reached its advantage situation Termination rewardis the reward obtained when the next state is terminationwhich is often used in the standard RL training It is definedas

T St+1( 1113857

c1 if win

c2 else if loss

c3 else if out of the sector

c4 else if no control times left

0 else

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(13)

Usually c1 is a positive value c2 is a negative value andc3 and c4 are nonpositive

e shaping reward function F(St At St+1) has the formas described in (9) e real-value function Φ(St) is anadvantage function of the state e larger the value of thisfunction the more advantageous the current situation isis function can provide additional information to help theagent to select action which is better than only using thetermination reward It is defined as

Φ St( 1113857 D St( 1113857O St( 1113857

100 (14)

where D(St) is the distance reward function and O(St) is theorientation reward function which are defined as

D St( 1113857 expminusdt minus dmax + dmin( 11138571113868111386811138681113868 2

1113868111386811138681113868

180degk1113888 1113889 (15)

O St( 1113857 1 minusμr

t

11138681113868111386811138681113868111386811138681113868 + ηb

t

11138681113868111386811138681113868111386811138681113868

180deg (16)

where k has units of metersdegrees and is used to adjust therelative effect of range and angle and a value of 10 iseffective

32 Alternate Freeze Game DQN for Maneuver GuidanceAgent Training Unlike board and RTS games the data ofhuman players in air combat games are very rare so

pretraining methods using supervised learning algorithmscannot be used We can only assume a strategy used by theopponent and then propose a strategy to defeat it We thenthink about what strategies the opponent will use to confrontour strategy and then optimize our strategy

In this paper alternate freeze learning is used for gamesto solve the nonstationarity problem Both aircrafts in a one-on-one air combat scenario use DRL-based agents to adapttheir maneuver strategies In each training period one agentis learning from scratch while the strategy of its opponentwhich was obtained from the previous training period isfrozen rough games the maneuver guidance perfor-mance of each side rises alternately and different agents withhigh level of maneuver decision-making ability are gener-ated e pseudo-code of alternate freeze game DQN isshown in Algorithm 2

An important effect that must be considered in thisapproach is the red queen effect When one aircraft uses aDRL agent and its opponent uses a static strategy theperformance of the aircraft is absolute with respect to itsopponent However when both aircrafts use DRL agents theperformance of each agent is only related to its currentopponent As a result the trained agent is only suitable for itslatest opponent but cannot deal with the previous ones

rough K training periods of alternate freeze learningK red agents and K + 1 blue agents are saved in the trainingprocess e league system is adopted in which those agentsare regarded as players By using a combination of strategiesfrom multiple opponents the opponentrsquos mixed strategybecomes smoother In different initial scenarios each agentguides the aircraft to confront the aircraft guided by eachopponent e goal is to select one of our strategies throughthe league so that the probability of winning is high and theprobability of losing is low According to the result of thecompetition the agent obtains different points e topagent in the league table is selected as the optimum one

4 Simulation and Results

In this section first random air combat scenarios are ini-tialized in simulation software for maneuver guidance agenttraining and testing Second agents of both sides with re-ward shaping are created and trained using the alternatefreeze game DQN algorithm ird taking the well-trainedagents as players the league system is adopted and the topagent in the league table is selected Last the selected agent isevaluated and the maneuver behavior of the aircraft isanalyzed

41 SimulationSetup e air combat simulation parametersare shown in Table 1 e Adam optimizer [52] is used forlearning the neural network parameters with a learning rateof 5 times 10minus 4 and a discount c 099 e replay memory sizeis 1 times 105 e initial value of ϵ in ϵ-greedy exploration is 1and the final value is 01 For each simulation the rewardshaping parameters c1 c2 c3 and c4 are set to 2 minus2 minus1 andminus1 respectively e terms w1 and w2 are set to 098 and001 respectively e training period K is 10

Mathematical Problems in Engineering 7

We evaluated three neural networks and three mini-batches Each neural network has 3 hidden layers and thenumber of units is 64 64 128 for the small neural network(NN) 128 128 256 for the medium one and 256 256 512for the large one e sizes of minibatch (MB) are 256 512and 1024 respectively ese neural networks and mini-batches are used in pairs to train red agents and play againstthe initial blue agent e simulation result is shown in

Figure 5 First a large minibatch cannot make the trainingsuccessful for small and medium neural networks and itstraining speed is slow for a large network Second using asmall neural network the average reward obtained will belower than that of the other two networks ird if the sameneural network is adopted the training speed using themedium minibatch is faster than using a small one Lastconsidering the computational efficiency the selected net-work has 3 hidden layers with 128 128 and 256 unitsrespectively e selected minibatch size is 512

In order to improve the generality of a single agentinitial positions are randomized in air combat simulation Inthe first iteration of the training process the initial strategy isformed by a randomly initialized neural network for the redagent and the conservative strategy of the maximum-overload turn is adopted by the blue one In the subsequenttraining the training agent is trained from scratch and itsopponent adopts the strategy obtained from the latesttraining

Air combat geometry can be divided into four categories(from red aircraftrsquos perspective) offensive defensive neu-tral and head-on [7] as shown in Figure 6(a) Defining thedeviation angle and aspect angle from minus180deg to 180degFigure 6(b) shows an advantage diagram where the distanceis 300m Smaller |μr| means that the heading or gun of thered aircraft has better aim at its opponent and smaller |ηb|

implies a higher possibility of the blue aircraft being shot orfacing a fatal situation For example in the offensive

(1) Set parameters of both aircrafts(2) Set simulation parameters(3) Set the number of training periods K and the condition for ending each training period Wthreshold(4) Set DRL parameters(5) Set the opponent initialization policy πblue0(6) for period 1 Kdo(7) for aircraft [red blue] do(8) if aircraft red then(9) Set the opponent policy π πblueperiodminus1(10) Initialize neural networks of red agent(11) while Winning ratelt Wthresholddo(12) Train agent using Algorithm 1(13) end while(14) Save the well-trained agent whose maneuver guidance policy is πredperiod(15) else(16) if period K + 1then(17) break(18) else(19) Set the opponent policy π πredperiod(20) Initialize neural networks of blue agent(21) while Winning ratelt Wthresholddo(22) Train agent using Algorithm 1(23) end while(24) Save the well-trained agent whose maneuver guidance policy is πblueperiod(25) end if(26) end if(27) end for(28) end for

ALGORITHM 2 Alternate freeze game DQN for maneuver guidance agent training in air combats

Table 1 Air combat simulation parameters

Red aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Random position and heading angle

Blue aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Map center northwardInitial policy (πb

0) Maximum overload turnSimulation parameters

Position of advantage dmax 500m dmin 100mμmax 60deg ηmax 30deg

Combat area 5 km times 5 kmSimulation time T 500 sSimulation time step δt 01 sTraining time step Δt 05 s

8 Mathematical Problems in Engineering

scenario the initial state value of the red agent is largebecause of small |μr| and |ηb| according to equations (14)and (16) erefore in the offensive initial scenario the winprobability of our side will be larger than that of the op-ponent while the defensive initial scenario is reversed Inneutral and head-one initial scenarios the initial state valuesof both sides are almost equal e performance of the well-trained agents is verified according to the classification offour initial situations in the league system

42 Simulation Results e policy naming convention is πbi

or πri which means a blue or red policy produced after i

periods respectively After every 100 training episodes thetest episodes run 100 times and the learning curves areshown in Figure 7 For the first period (πr

1 vs πb0) due to the

simple maneuver policy of the blue aircraft the red aircraftcan achieve almost 100 success rate through about 4000episodes of training as shown in Figure 7(a) At the be-ginning most of the games are draw because of the randominitialization of the red agent and the evasion strategy of itsopponent Another reason is that the aircraft often flies outof the airspace in the early stages of training Although theagent will get a penalty for this the air combat episode willstill get a draw During the training process the red aircraft islooking for winning strategies in the airspace and itswinning rate is constantly increasing However because of

its pursuit of offensive and neglect of defensive the winningrate of its opponent is also rising In the later stage oftraining the red aircraft gradually understands the oppo-nentrsquos strategy and it can achieve a very high winning rate

With the iterations in games both agents have learnt theintelligent maneuver strategy which can guide the aircraft inthe pursuit-evasion game Figure 7(b) shows a typical ex-ample that πb

5 is the trainer and πr5 is its opponent After

about 5000 episodes of training the blue aircraft can reducethe losing rate to a lower level Since the red one adopts anintelligent maneuver guidance strategy the blue agentcannot learn a winning strategy in a short time After about20000 episodes the agent gradually understands the op-ponentrsquos maneuver strategy and its winning rate keepsincreasing e final winning rate is stable between 50 and60 and the losing rate is below 10

e training process iteration of πr10 vs πb

9 is shown inFigure 7(c) Because the maneuver strategy of the blue aircraftis an intelligent strategy which has been trained iterativelytraining for the red agent ismuchmore difficult than that in thefirst iteration e well-trained agent can win more than 60and lose less than 10 in the game against πb

9It is hard to get a higher winning rate in every training

except for the scenario that the opponent is πb0 is is because

both aircrafts are homogeneous the initial situations arerandomized and the opponent in the training environment isintelligent In some situations as long as the opponentrsquosstrategy is intelligent enough the trained agent is almostimpossible to win Interestingly in some scenarios where theopponent is intelligent enough and the aircraft cannot defeat itno matter how the agent operates the agent will guide theaircraft out of the airspace to obtain a draw is gives us aninspiration that is in a scenario where you cannot win it maybe the best way to get out of the combat area

For four initial air combat situations two aircrafts areguided by the agents trained in each iteration stage etypical flight trajectories are shown in Figure 8 Since πb

0 is astatic strategy πr

1 can easily win in offensive head-on andneutral situations as shown in Figures 8(a) 8(c) and 8(d) Inthe defensive situation the red aircraft first turns away fromits opponent who has the advantage and then looks foropportunities to establish an advantage situation ere areno fierce rivalries in the process of combat as shown inFigure 8(b)

Although πr5 is an intelligent strategy well-trained π

b5 can

still gain an advantage in combat as shown in Figures 8(e)ndash8(h) In the offensive situation (from the red aircraftrsquosperspective) the blue aircraft cannot easily get rid of redonersquos following and after fierce confrontation it can es-tablish an advantage situation as shown in Figure 8(e)Figure 8(f) shows the defensive situation that the blueaircraft can keep its advantage until it wins In the other twoscenarios the blue aircraft performs well especially in theinitial situation of neutral which adopts the delayed-turningtactics and successfully wins in air combat

πb9is an intelligent strategy and well-trained πr

10 canachieve more than half of the victory and less than 10 of thefailure However unlike πr

1 vs πb0 π

r10 cannot easily win

especially in head-on situations as shown in Figures 8(i)ndash8(l)

5000 100000Training episodes

ndash2

ndash1

0

1

2Av

erag

e rew

ard

Small NNsmall MBSmall NNmedium MBSmall NNlarge MBMedium NNsmall MBMedium NNmedium MB

Medium NNlarge MBLarge NNsmall MBLarge NNmedium MBLarge NNlarge MB

Figure 5 Evaluation of the neural network and minibatch

Mathematical Problems in Engineering 9

In the offensive situation the blue aircraft tries to get rid ofthe red one by constantly adjusting its bank angle but it isfinally captured by the red one In the defensive situation thered aircraft adopted the circling back tactics cleverly andquickly captured its opponent In the head-on situation thered and blue aircrafts alternately occupy an advantage sit-uation and after fierce maneuver confrontation the redaircraft finally wins In the neutral situation the red oneconstantly adjusts its bank angle according to the opponentrsquosposition and wins

For the above three pairs of strategies the head-on initialsituations are taken as examples to analyze the advantages ofboth sides in the game as shown in Figure 9 For the πr

1 vs πb0

scenario because πb0 has no intelligent maneuver guidance

ability although both sides are equally competitive at first halftime the red aircraft continues to expand its advantages untilit wins as shown in Figure 9(a) When the blue agent learnsthe intelligence strategy it can defeat the red one as shown in

Figure 9(b) For the πr10 vs πb

9 scenario the two sides havealternately established an advantage position and the redaircraft has not won easily as shown in Figure 9(c) Com-bining with the trajectories shown in Figures 8(c) 8(g) and8(k) the maneuver strategies of both sides have been im-proved using the alternate freeze game DQN algorithm eapproach has the benefit of discovering newmaneuver tactics

43 League Results e objective of the league system is toselect a red agent who generates maneuver guidance in-structions according to its strategy so as to achieve the bestresults without knowing the strategy of its opponent ereare ten agents on each side which are saved in the iterationof games Each red agent fights each blue agent 400 timesincluding 100 times for each of four initial scenarios edetailed league results are shown in Table 2 Results with awinloss ratio greater than 1 are highlighted in bold

μr

ηbOffensive

Defensive

Neutral

Head-on

(a)

Offensive

Defensivendash180

0

180

Asp

ect a

ngle

ηb (

degr

ee)

0 180ndash180Deviation angle μr (degree)

(b)

Figure 6 Situation recognition based on deviation angle and aspect angle (a) Four initial air combat situation categories (b) Situationrecognition diagram

1000 2000 3000 40000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(a)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(b)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(c)

Figure 7 Learning curves in the training process (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

10 Mathematical Problems in Engineering

0 03

3

6

6

9

9

12

12

12

16

(a)

133133

120 120

105105

90

9075

7560

60

45

45

30

30

15

15

00

(b)

0 016

16

32

32

48

48

64

64

80

80

(c)

00

33 6

6 99

1313

(d)

150

150

300300

30

30

240

2409060

60

120

180

180

270120

0 0270

210

210

90

(e)

0 0 2 2 44 6

6 88

(f )

40

120

120 0 0

100

100

14060

60

80

20

190

190160

16040

140

20 80

(g)

50 5040

40

30

30

20

2010

10

0

0

(h)

0

10

10

20

30

30

102

102

20

40

40

50

50

6060

7070

80

800

90

90

(i)

0 0

6

6 12

12

18

18

24

24

30

30

(j)

50

50

466466300

300150200

150350

00

250

250 400400

100

100

200

350

(k)

00 16

16

32

32

4848

64

64

79

79

(l)

Figure 8 Typical trajectory results in the training iteration of air combat games (a) πr1 vs πb

0 offensive (b) πr1 vs πb

0 defensive (c) πr1 vs πb

0head-on (d) πr

1 vs πb0 neutral (e) πb

5 vs πr5 offensive (f ) πb

5 vs πr5 defensive (g) πb

5 vs πr5 head-on (h) πb

5 vs πr5 neutral (i) πr

10 vs πbg

offensive (j) πr10 vs π

bg defensive (k) π

r10 vs π

bg head-on (l) π

r10 vs π

bg neutral

25 50 750Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(a)

100 2000Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(b)

200 4000Training steps

ndash05

00

05

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(c)

Figure 9 Advantage curves in head-on air combat games (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

Mathematical Problems in Engineering 11

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 2: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

algorithm with neighborhood search to solve the model In[13 14] the air combat game is regarded as the Markovgame and then the pursuit maneuver strategy is solved bycomputing its Nash equilibrium is approach solves theproblem that the matrix game cannot deal with continuousmultiple states However it is only suitable for rationalopponents An optimal pursuit-evasion fighter maneuver isformulated as a differential game and then solved by non-linear programming which is complex and requires enor-mous amount of calculation [15]

Other approaches for pursuit maneuver strategy gen-eration include influence diagram genetic algorithm andapproximate dynamic programming Influence diagram isused to model the sequential maneuver decision in aircombat and high-performance simulation results are ob-tained [16ndash18] However this approach is difficult to beapplied in practice since the influence diagram is convertedinto a nonlinear programming problem which cannot meetthe demand of fast computation during air combat In [19] agenetics-based machine learning algorithm is implementedto generate high angle-of-attack air combat maneuver tacticsfor the X-31 fighter aircraft in a one-on-one air combatscenario Approximate dynamic programming (ADP) canbe employed to solve the air combat pursuit maneuverdecision problem quickly and effectively [20 21] By con-trolling the roll rate it can provide fast maneuver response ina changing situation

Most of the previous studies have used various algo-rithms to solve the pursuit maneuver problem and havesome satisfactory results whereas two problems still existOne is that previous studies have assumed the maneuverstrategy of the opponent is deterministic or generated by afixed algorithm However in realistic situations these ap-proaches are difficult to deal with the flexible strategiesadopted by different opponents e other is that traditionalalgorithms rely heavily on prior knowledge and have highcomputational complexity which cannot adapt to therapidly changing situation in air combat Since UCAVs havereceived growing interest worldwide and the flight controlsystem of the aircraft is developing rapidly towards intel-lectualization [22] it is necessary to do research on theintelligent maneuver strategy in UCAVs and manned air-craft combat

In this paper a deep reinforcement learning- (DRL-)[23] based alternate freeze game approach is proposed totrain guidance agents which can provide maneuver in-structions in an air combat Using alternate freeze gamesin each training period one agent is learning while itsopponent is frozen DRL is a type of artificial intelligencewhich combines reinforcement learning (RL) [24] anddeep learning (DL) [25] RL allows an agent to learndirectly from the environment through trial and errorwithout perfect knowledge of the environment in advanceA well-trained DRL agent can automatically determine anadequate behavior within a specific context trying tomaximize its performance using few computational timese theory of DRL is very suitable for solving sequentialdecision-making problems such as maneuver guidance inair combat

DRL has been utilized in many decision-making fieldssuch as video games [26 27] board games [28 29] androbot control [30] and obtained great achievements ofhuman level or superhuman performance For aircraftguidance research Waldock et al [31] proposed a DQNmethod to generate a trajectory to perform perched landingon the ground Alejandro et al [32] proposed a DRL strategyfor autonomous landing of the UAV on a moving platformLou and Guo [33] presented a uniform framework ofadaptive control by using policy-searching algorithms for aquadrotor An RL agent is developed for guiding a poweredUAV from one thermal location to another by controllingbank angle [34] For air combat research You et al [35]developed an innovative framework for cognitive electronicwarfare tasks by using a DRL algorithm without prior in-formation Luo et al [36] proposed a Q-learning-based aircombat target assignment algorithm which avoids relying onprior knowledge and performs well

Reward shaping is a method of incorporating domainknowledge into RL so that the algorithms are guided fastertowards more promising solutions [37] Reward shaping iswidely adopted in the RL community and it is also used inaircraft planning and control In [4] a reward function isproposed to remedy the false rewards and punishments forfiring air combat missiles which allows computer-generatedforces (CGFs) to generate more intelligent behavior Tumerand Agogino [38] proposed difference reward functions in amultiagent air traffic system and showed that agents canmanage effective route selection and significantly reducecongestion In [39] two types of reward functions are de-veloped to solve ground holding and air holding problemswhich assist air traffic controllers in maintaining highstandard of safety and fairness between airlines

Previous research studies have shown benefits of usingDRL to solve the air maneuver guidance problem Howeverthere are still two problems in applying DRL to air combatOne is that previous studies did not have a specific envi-ronment for air combat either developing an environmentbased on the universal ones [32] or using a discrete gridworld which do not have the function of air combatsimulation

e other problem is that classic DRL algorithms arealmost all one-sided optimization algorithms which canonly guide an aircraft to a fixed location or a regularlymoving destination However the opponent in air combathas diversity and variability maneuver strategies which arenonstationary and the classic DRL algorithms cannot dealwith them Hernandez-Leal et al reviewed how the non-stationarity is modelled and addressed by state-of-the-artmultiagent learning algorithms [40] Some researcherscombine MDP with game theory to study reinforcementlearning in stochastic games to solve the nonstationarityproblem [41ndash43] while there are still two limitations One isto assume the opponentrsquos strategy is rational In eachtraining step the opponent chooses the optimal action Atrained agent can only deal with a rational opponent butcannot fight against a nonrational one e other is to uselinear programming or quadratic programming to calculatethe Nash equilibrium which leads to a huge amount of

2 Mathematical Problems in Engineering

calculations e Minimax-DQN [44] algorithm whichcombines DQN and Minimax-Q learning [41] for the two-player zero-sumMarkov game is proposed in a recent paperAlthough it can be applied to complex games it still can onlydeal with rational opponents and needs to use linear pro-gramming to calculate the Q value

e main contributions of this paper can be summarizedas follows

(i) Middleware is designed and developed whichmakes specific software for air combat simulationused as an RL environment An agent can be trainedin this environment and then used to guide anaircraft to reach the position of advantage in one-on-one air combat

(ii) Guidance agents for air combat pursuit maneuver ofboth sides are designede reward shaping methodis adopted to improve the convergence speed oftraining and the performance of the maneuverguidance strategies An agent is trained in the en-vironment where its opponent is also an adaptiveagent so that the well-trained agent has the ability tofight against an opponent with the intelligentstrategy

(iii) An alternate freeze game framework is proposed todeal with nonstationarity which can be used tosolve the problem of variable opponent strategies inRL e league system is adopted to select an agentwith the highest performance to avoid the red queeneffect [45]

is paper is organized as follows Section 2 intro-duces the maneuver guidance strategy problem of one-on-one air combat maneuver under consideration andthe DRL-based model as well as training environment tosolve it e training optimization includes rewardshaping and alternate freeze games which are presentedin Section 3 e simulation and results of the proposedapproach are given in Section 4 e final section con-cludes the paper

2 Problem Formulation

In this section the problem of maneuver guidance in aircombat is introducede training environment is designedand the DRL-based model is set up to solve it

21ProblemStatement eproblem solved in this paper is aone-on-one air combat pursuit-evasion maneuver probleme red aircraft is regarded as our side and the blue one as anenemy e objective of the guidance agent is to learn amaneuver strategy (policy) to guide the aircraft from itscurrent position to a position of advantage and maintain theadvantage At the same time it should guide the aircraft notto enter the advantage position of its opponent

e dynamics of the aircraft is described by a point-massmodel [17] and is given by the following differentialequations

_x v cos θ cosψ

_y v cosθ sinψ

_z vsinθ

θ

1

mv((L + Tsinβ) times cosϕ minus mgcosθ)

_ψ 1

mvcosθ(L + Tsinβ) times sinϕ

_v 1m

(Tcosβ minus D) minus gsinθ

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(1)

where (x y z) are 3-dimensional coordinates of the aircrafte terms v θ and ψ are speed flight path angle and theheading angle of the aircraft respectively e mass of theaircraft and the acceleration due to the gravity are denotedby m and g respectively e three forces are lift force Ldrag force D and thrust force T e remaining two vari-ables are the angle of attack β and the bank angle ϕ as shownin Figure 1

In this paper the aircraft is assumed to fly at a fixedvelocity in the horizontal plane and the assumption can bewritten as follows

β 0

θ 0

T D

L mg

cosϕ

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(2)

e equations of motion for the aircraft are simplified asfollows

xt+δt xt + vδtcos ψt+δt( 1113857

yt+δt yt + vδtsin ψt+δt( 1113857

ψt+δt ψt + _ψt+δtδt

_ψt+δt g

vtan ϕt+δt( 1113857

ϕt+δt ϕt + At_ϕtδt

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(3)

where v _ϕt ϕt _ψt and ψt are the speed roll rate bank angleturn rate and heading angle of the aircraft at time t re-spectively e term At is the action generated by theguidance agent and the control actions available to theaircraft are a isin roll-left maintain-bank-angle roll-righte simulation time step is denoted by δt

Mathematical Problems in Engineering 3

In aircraft maneuver strategy design maneuvers in a fixedplane are usually used to measure its performance Figure 2shows a one-on-one air combat scenario [20] in which eachaircraft flies at a fixed velocity in the X-Y plane under a ma-neuver strategy at time t e position of advantage is that oneaircraft gains opportunities to fire at its opponent It is defined as

dmin ltdt ltdmax

μrt

11138681113868111386811138681113868111386811138681113868lt μmax

ηbt

11138681113868111386811138681113868111386811138681113868lt ηmax

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(4)

where superscripts r and b refer to red and blue respectivelye term dt is the distance between the aircraft and the target attime t μr

t is the deviation angle which is defined as the angledifference between the line-of-sight vector and the heading ofour aircraft and ηb

t is the aspect angle which is between thelongitudinal symmetry axis (to the tail direction) of the targetplane and the connecting line from the target planersquos tail toattacking the planersquos nosee terms dmin dmax μmax and μminare thresholds where the subscriptsmax andmin represent theupper and lower bounds of the corresponding variables re-spectively which are determined by the combat mission andthe performance of the aircraft

e expression dmin ltdt ltdmax makes sure that theopponent is within the attacking range of the air-to-airweapon of the aircraft e expression |μr

t |lt μmax refers to anarea in which the blue aircraft is difficult to escape withsensor locking e expression |ηb

t |lt ηmax defines an areawhere the killing probability is high when attacking from therear of the blue aircraft

22 DRL-Based Model An RL agent interacts with an envi-ronment over time and aims tomaximize the long-term reward[24] At each training time t the agent receives a state St in astate space S and generates an actionAt from an action spaceA

following a policy π S times A⟶ R en the agent receives ascalar reward Rt and transitions to the next state St+1 accordingto the environment dynamics as shown in Figure 3

A value function V(s) is defined to evaluate the aircombat advantages of each state which is the expectation ofdiscounted cumulated rewards on all states following time t

V(s) Eπ Rt + cRt+1 + c2Rt+2 + St s

11138681113868111386811138681113966 1113967 (5)

where c isin [0 1] is the discount factor and policy π(s a) is amapping from the state space to the action space

e agent guides an aircraft to maneuver by changing itsbank angleϕ using roll rate _ϕe action space of DRL is minus1 01 and At can take a value from three options at time t whichmeans roll-left maintain-bank-angle and roll-right respec-tively e position of the aircraft is updated according to (3)e action-value functionQ(s a) can be used which is definedas the value of taking action a in state s under a policy π

Q(s a) Eπ 1113944

infin

k0c

kRt+k St s

1113868111386811138681113868 At a⎧⎨

⎫⎬

⎭ (6)

e Bellman optimality equation is

Qlowast(s a) E Rt + cmax

aprimeQlowast

St+1 aprime( 1113857St s At a1113896 1113897

(7)

Any policy that is greedy with respect to the optimalevaluation function Qlowast(s a) is an optimal policy ActuallyQlowast(s a) can be obtained through iterations using temporal-difference learning and its updated formula is defined as

dmindmax

vb

vr

(xbt yb

t)

(xrt yr

t)

ηbt

micrort

Goal zoneRed aircra (our side)Blue aircra (enemy)

Figure 2 Aircraft maneuver guidance problem in air combat

L

D

T

v

β

θ ψ

ϕ

Figure 1 Aircraft dynamics parameters

Air combatenvironment

Air combatguidance agent

At = Changing bank angle

St = aircra status opponent status

Rtndash1 = R(Stndash1 Atndash1 St)

St+1

Rt

Figure 3 Reinforcement learning in air combat

4 Mathematical Problems in Engineering

Q St At( 1113857⟵Q St At( 1113857 + α Rt + cmaxa

Q St+1 a( 1113857 minus Q St At( 11138571113874 1113875

(8)

where Q(St At) is the estimated action-value function and αis the learning rate

In air combat maneuvering applications the reward issparse and only at the end of a game (the terminal state)according to the results of winning losing or drawing areward value is given is makes learning algorithms to usedelayed feedback to determine the long-term consequencesof their actions in all nonterminal states which is timeconsuming Reward shaping is proposed by Ng et al [46] tointroduce additional rewards into the learning processwhich will guide the algorithm towards learning a goodpolicy faster e shaping reward function F(St At St+1) hasthe form

F St At St+1( 1113857 cΦ St+1( 1113857 minusΦ St( 1113857 (9)

where Φ(St) is a real-valued function over states It can beproven that the final policy after using reward shaping isequivalent to the final policy without it [46] Rt in previousequations can be replaced by Rt + F(St At St+1)

Taking the red agent as an example the state space of oneguidance agent can be described with a vector

St xrt y

rt ψ

rt ϕ

rt x

bt y

bt ψb

t1113966 1113967 (10)

e state that an agent received in one training stepincludes the position heading angle and bank angle of itselfand the position heading angle and bank angle of its op-ponent which is a 7-dimensional infinite vector ereforethe function approximate method should be used to com-bine features of state and learned weights DRL is a solutionto address this problem Great achievements have beenmadeusing the advanced DRL algorithms such as DQN [26]DDPG [47] A3C [48] and PPO [49] Since the action ofaircraft guidance is discrete DQN algorithm can be usede action-value function can be estimated with functionapproximation such as 1113954Q(s aw) asymp Qπ(s a) where w arethe weights of a neural network

DQN algorithm is executed according to the trainingtime step However there is not only training time step butalso simulation time step Because of the inconsistencyDQN algorithm needs to be improved to train a guidanceagent e pseudo-code of DQN for aircraft maneuverguidance training in air combat is shown in Algorithm 1 Ateach training step the agent receives information about theaircraft and its opponent and then generates an action andsends it to the environment After receiving the action theaircraft in the environment is maneuvered according to thisaction in each simulation step

23 Middleware Design An RL agent improves its abilitythrough continuous interaction with the environmentHowever there is no maneuver guidance environment forair combat and some combat simulation software can onlyexecute agents not train them In this paper middleware isdesigned and developed based on commercial air combat

simulation software [50] coded in C++ e functions ofmiddleware include interface between software and RLagents coordination of the simulation time step and the RLtraining time step and reward calculation By using it an RLenvironment for aircraft maneuver guidance training in aircombat is performed e guidance agents of both aircraftshave the same structure Figure 4 shows the structure and thetraining process of the red side

In actual or simulated air combat an aircraft perceivesthe situation through multisensors By analyzing the situ-ation after information fusion maneuver decisions are madeby the pilot or the auxiliary decision-making system In thissystem situation perception and decision-making are thesame as in actual air combat An assumption is made that asensor with full situation perception ability is used throughwhich an aircraft can obtain the position and heading angleof its opponent

Maneuver guidance in air combat using RL is an episodictask and there is the notion of episodes of some length wherethe goal is to take the agent from a starting state to a goal state[51] First a flight level with fixed velocity and the one-on-oneair combat scenario is setup in simulation software en thesituation information of both aircrafts is randomly initializedincluding their positions heading angles bank angles and rollrates which constitute the starting state of each agente goalstates are achieved when one aircraft reaches the position ofadvantage and maintains it one aircraft flies out of the sectoror the simulation time is out

Middleware can be used in the training and applicationphases of agents e training phase is the process of makingthe agent from zero to have the maneuver guidance abilitywhich includes training episodes and testing episodes In thetraining episode the agent is trained by the DRL algorithmand its guidance strategy is continuously improved After anumber of training episodes some testing episodes areperformed to verify the ability of the agent in the currentstage in which the strategy does not change e applicationphase is the process of using the well-trained agent tomaneuver guidance in air combat simulation

A training episode is composed of training steps In eachtraining step first the information recognized by the air-borne sensor is sent to the guidance agent through mid-dleware as a tuple of state coded in Python see Steps 1 2 and3 in Figure 4 en the agent generates an action using itsneural networks according to the exploration strategy asshown in Steps 4 and 5 Simulation software receives aguidance instruction transformed by middleware and sendsthe next situation information to middleware see Steps 6and 7 Next middleware transforms the situation infor-mation into state information and calculates reward andsends them to the agent as shown in Steps 2 3 and 8Because agent training is an iterative process all the stepsexcept Step 1 are executed repeatedly in an episode Last thetuple of the current state action reward and the next statein each step is stored in the replay memory for the training ofthe guidance agent see Steps 9 and 10 In each simulationstep the aircraft maneuvers according to the instruction andrecognizes the situation according to its sensorconfiguration

Mathematical Problems in Engineering 5

3 Training Optimization

31 Reward Shaping Two problems need to be solved whenusing the DRL method to train a maneuver guidance agentin air combat One is that the only criterion to evaluate theguidance is the successful arrival to the position of advan-tage which is a sparse reward problem leading to slowconvergence of training Secondly in realistic situations thetime to arrive at the position of advantage and the quality ofthe trajectory need to be considered In this paper a rewardshaping method is proposed to improve the training speedand the performance of the maneuver guidance strategy

Reward shaping [46] is usually used to modify the re-ward function to facilitate learning while maintaining op-timal policy which is a manual endeavour In this studythere are four rules to follow in reward shaping

(i) Limited by the combat area and the maximumnumber of control times the aircraft is guided to theposition of advantage

(ii) e aircraft should be guided closer and closer to itsgoal zone

(iii) e aircraft should be guided away from the goalzone of its opponent

(1) Set parameters of the controlled aircraft and opponent(2) Set training time step size Δt and simulation time step size δt

(3) Initialize maximum number of training episode M and maximum simulation time T in each episode(4) Initialize replay memory D to capacity N

(5) Initialize action-value function Q and target action-value function Qminus with random weights(6) for episode 0 Mdo(7) Initialize state S0(8) for t 0 TΔtdo(9) With probability ϵ select a random action At

(10) Otherwise select At argmaxa

Q(St aw)

(11) for i 0 Δtδtdo(12) Execute action At in air combat simulation software(13) Obtain the positions of aircraft and target(14) if episode terminates then(15) break(16) end if(17) end for(18) Observe reward Rt and state St+1(19) Store transition [St At Rt St+1] in D

(20) if episode terminates then(21) break(22) end if(23) Sample random minibatch of transitions [Sj Aj Rj Sj+1] from D

(24) if episode terminates at step j + 1then(25) set Yj Rj

(26) else(27) set Yj Rj + max

aQminus (Sj+1 awminus )

(28) end if(29) Perform a gradient descent step on (Yj minus Q(Sj Ajwj))

2 with respect to the network parameters w(30) Every C steps reset Qminus Q

(31) end for(32) end for

ALGORITHM 1 DQN [26] for maneuver guidance agent training

1

7

2 6

93 8

4 5

10

Air combat simulation soware

Initialization Visualization

Situation recognitionAir combat

Aircra motion

Loop ∆tδt times

InstructionSituation

Middleware

ActionRewardState

State perception

Action generation

Blueaircra

guidanceagent

LearningTrain

[S0 A0 R0 S1][S1 A1 R1 S2]

[St At Rt St+1]

[STndash1 ATndash1 RTndash1 ST]

Replay memoryRed aircra guidance agent

Store

Figure 4 Structure and training process of the red side

6 Mathematical Problems in Engineering

(iv) e aircraft should be guided to its goal zone inshort time as much as possible

According to the above rules the reward function isdefined as

R St At St+1( 1113857 Rt + F St At St+1( 1113857 (11)

where Rt is the original scalar reward and F(St At St+1) isthe shaping reward function Rt is given by

Rt w1T St+1( 1113857 + w2 (12)

where T(St+1) is the termination reward function e termw1 is a coefficient and w2 is a penalty for time consumptionwhich is a constant

ere are four kinds of termination states the aircraftarrives at the position of advantage the opponent arrives atits position of advantage the aircraft moves out of thecombat area and the maximum number of control times hasbeen reached and each aircraft is still in the combat area andhas not reached its advantage situation Termination rewardis the reward obtained when the next state is terminationwhich is often used in the standard RL training It is definedas

T St+1( 1113857

c1 if win

c2 else if loss

c3 else if out of the sector

c4 else if no control times left

0 else

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(13)

Usually c1 is a positive value c2 is a negative value andc3 and c4 are nonpositive

e shaping reward function F(St At St+1) has the formas described in (9) e real-value function Φ(St) is anadvantage function of the state e larger the value of thisfunction the more advantageous the current situation isis function can provide additional information to help theagent to select action which is better than only using thetermination reward It is defined as

Φ St( 1113857 D St( 1113857O St( 1113857

100 (14)

where D(St) is the distance reward function and O(St) is theorientation reward function which are defined as

D St( 1113857 expminusdt minus dmax + dmin( 11138571113868111386811138681113868 2

1113868111386811138681113868

180degk1113888 1113889 (15)

O St( 1113857 1 minusμr

t

11138681113868111386811138681113868111386811138681113868 + ηb

t

11138681113868111386811138681113868111386811138681113868

180deg (16)

where k has units of metersdegrees and is used to adjust therelative effect of range and angle and a value of 10 iseffective

32 Alternate Freeze Game DQN for Maneuver GuidanceAgent Training Unlike board and RTS games the data ofhuman players in air combat games are very rare so

pretraining methods using supervised learning algorithmscannot be used We can only assume a strategy used by theopponent and then propose a strategy to defeat it We thenthink about what strategies the opponent will use to confrontour strategy and then optimize our strategy

In this paper alternate freeze learning is used for gamesto solve the nonstationarity problem Both aircrafts in a one-on-one air combat scenario use DRL-based agents to adapttheir maneuver strategies In each training period one agentis learning from scratch while the strategy of its opponentwhich was obtained from the previous training period isfrozen rough games the maneuver guidance perfor-mance of each side rises alternately and different agents withhigh level of maneuver decision-making ability are gener-ated e pseudo-code of alternate freeze game DQN isshown in Algorithm 2

An important effect that must be considered in thisapproach is the red queen effect When one aircraft uses aDRL agent and its opponent uses a static strategy theperformance of the aircraft is absolute with respect to itsopponent However when both aircrafts use DRL agents theperformance of each agent is only related to its currentopponent As a result the trained agent is only suitable for itslatest opponent but cannot deal with the previous ones

rough K training periods of alternate freeze learningK red agents and K + 1 blue agents are saved in the trainingprocess e league system is adopted in which those agentsare regarded as players By using a combination of strategiesfrom multiple opponents the opponentrsquos mixed strategybecomes smoother In different initial scenarios each agentguides the aircraft to confront the aircraft guided by eachopponent e goal is to select one of our strategies throughthe league so that the probability of winning is high and theprobability of losing is low According to the result of thecompetition the agent obtains different points e topagent in the league table is selected as the optimum one

4 Simulation and Results

In this section first random air combat scenarios are ini-tialized in simulation software for maneuver guidance agenttraining and testing Second agents of both sides with re-ward shaping are created and trained using the alternatefreeze game DQN algorithm ird taking the well-trainedagents as players the league system is adopted and the topagent in the league table is selected Last the selected agent isevaluated and the maneuver behavior of the aircraft isanalyzed

41 SimulationSetup e air combat simulation parametersare shown in Table 1 e Adam optimizer [52] is used forlearning the neural network parameters with a learning rateof 5 times 10minus 4 and a discount c 099 e replay memory sizeis 1 times 105 e initial value of ϵ in ϵ-greedy exploration is 1and the final value is 01 For each simulation the rewardshaping parameters c1 c2 c3 and c4 are set to 2 minus2 minus1 andminus1 respectively e terms w1 and w2 are set to 098 and001 respectively e training period K is 10

Mathematical Problems in Engineering 7

We evaluated three neural networks and three mini-batches Each neural network has 3 hidden layers and thenumber of units is 64 64 128 for the small neural network(NN) 128 128 256 for the medium one and 256 256 512for the large one e sizes of minibatch (MB) are 256 512and 1024 respectively ese neural networks and mini-batches are used in pairs to train red agents and play againstthe initial blue agent e simulation result is shown in

Figure 5 First a large minibatch cannot make the trainingsuccessful for small and medium neural networks and itstraining speed is slow for a large network Second using asmall neural network the average reward obtained will belower than that of the other two networks ird if the sameneural network is adopted the training speed using themedium minibatch is faster than using a small one Lastconsidering the computational efficiency the selected net-work has 3 hidden layers with 128 128 and 256 unitsrespectively e selected minibatch size is 512

In order to improve the generality of a single agentinitial positions are randomized in air combat simulation Inthe first iteration of the training process the initial strategy isformed by a randomly initialized neural network for the redagent and the conservative strategy of the maximum-overload turn is adopted by the blue one In the subsequenttraining the training agent is trained from scratch and itsopponent adopts the strategy obtained from the latesttraining

Air combat geometry can be divided into four categories(from red aircraftrsquos perspective) offensive defensive neu-tral and head-on [7] as shown in Figure 6(a) Defining thedeviation angle and aspect angle from minus180deg to 180degFigure 6(b) shows an advantage diagram where the distanceis 300m Smaller |μr| means that the heading or gun of thered aircraft has better aim at its opponent and smaller |ηb|

implies a higher possibility of the blue aircraft being shot orfacing a fatal situation For example in the offensive

(1) Set parameters of both aircrafts(2) Set simulation parameters(3) Set the number of training periods K and the condition for ending each training period Wthreshold(4) Set DRL parameters(5) Set the opponent initialization policy πblue0(6) for period 1 Kdo(7) for aircraft [red blue] do(8) if aircraft red then(9) Set the opponent policy π πblueperiodminus1(10) Initialize neural networks of red agent(11) while Winning ratelt Wthresholddo(12) Train agent using Algorithm 1(13) end while(14) Save the well-trained agent whose maneuver guidance policy is πredperiod(15) else(16) if period K + 1then(17) break(18) else(19) Set the opponent policy π πredperiod(20) Initialize neural networks of blue agent(21) while Winning ratelt Wthresholddo(22) Train agent using Algorithm 1(23) end while(24) Save the well-trained agent whose maneuver guidance policy is πblueperiod(25) end if(26) end if(27) end for(28) end for

ALGORITHM 2 Alternate freeze game DQN for maneuver guidance agent training in air combats

Table 1 Air combat simulation parameters

Red aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Random position and heading angle

Blue aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Map center northwardInitial policy (πb

0) Maximum overload turnSimulation parameters

Position of advantage dmax 500m dmin 100mμmax 60deg ηmax 30deg

Combat area 5 km times 5 kmSimulation time T 500 sSimulation time step δt 01 sTraining time step Δt 05 s

8 Mathematical Problems in Engineering

scenario the initial state value of the red agent is largebecause of small |μr| and |ηb| according to equations (14)and (16) erefore in the offensive initial scenario the winprobability of our side will be larger than that of the op-ponent while the defensive initial scenario is reversed Inneutral and head-one initial scenarios the initial state valuesof both sides are almost equal e performance of the well-trained agents is verified according to the classification offour initial situations in the league system

42 Simulation Results e policy naming convention is πbi

or πri which means a blue or red policy produced after i

periods respectively After every 100 training episodes thetest episodes run 100 times and the learning curves areshown in Figure 7 For the first period (πr

1 vs πb0) due to the

simple maneuver policy of the blue aircraft the red aircraftcan achieve almost 100 success rate through about 4000episodes of training as shown in Figure 7(a) At the be-ginning most of the games are draw because of the randominitialization of the red agent and the evasion strategy of itsopponent Another reason is that the aircraft often flies outof the airspace in the early stages of training Although theagent will get a penalty for this the air combat episode willstill get a draw During the training process the red aircraft islooking for winning strategies in the airspace and itswinning rate is constantly increasing However because of

its pursuit of offensive and neglect of defensive the winningrate of its opponent is also rising In the later stage oftraining the red aircraft gradually understands the oppo-nentrsquos strategy and it can achieve a very high winning rate

With the iterations in games both agents have learnt theintelligent maneuver strategy which can guide the aircraft inthe pursuit-evasion game Figure 7(b) shows a typical ex-ample that πb

5 is the trainer and πr5 is its opponent After

about 5000 episodes of training the blue aircraft can reducethe losing rate to a lower level Since the red one adopts anintelligent maneuver guidance strategy the blue agentcannot learn a winning strategy in a short time After about20000 episodes the agent gradually understands the op-ponentrsquos maneuver strategy and its winning rate keepsincreasing e final winning rate is stable between 50 and60 and the losing rate is below 10

e training process iteration of πr10 vs πb

9 is shown inFigure 7(c) Because the maneuver strategy of the blue aircraftis an intelligent strategy which has been trained iterativelytraining for the red agent ismuchmore difficult than that in thefirst iteration e well-trained agent can win more than 60and lose less than 10 in the game against πb

9It is hard to get a higher winning rate in every training

except for the scenario that the opponent is πb0 is is because

both aircrafts are homogeneous the initial situations arerandomized and the opponent in the training environment isintelligent In some situations as long as the opponentrsquosstrategy is intelligent enough the trained agent is almostimpossible to win Interestingly in some scenarios where theopponent is intelligent enough and the aircraft cannot defeat itno matter how the agent operates the agent will guide theaircraft out of the airspace to obtain a draw is gives us aninspiration that is in a scenario where you cannot win it maybe the best way to get out of the combat area

For four initial air combat situations two aircrafts areguided by the agents trained in each iteration stage etypical flight trajectories are shown in Figure 8 Since πb

0 is astatic strategy πr

1 can easily win in offensive head-on andneutral situations as shown in Figures 8(a) 8(c) and 8(d) Inthe defensive situation the red aircraft first turns away fromits opponent who has the advantage and then looks foropportunities to establish an advantage situation ere areno fierce rivalries in the process of combat as shown inFigure 8(b)

Although πr5 is an intelligent strategy well-trained π

b5 can

still gain an advantage in combat as shown in Figures 8(e)ndash8(h) In the offensive situation (from the red aircraftrsquosperspective) the blue aircraft cannot easily get rid of redonersquos following and after fierce confrontation it can es-tablish an advantage situation as shown in Figure 8(e)Figure 8(f) shows the defensive situation that the blueaircraft can keep its advantage until it wins In the other twoscenarios the blue aircraft performs well especially in theinitial situation of neutral which adopts the delayed-turningtactics and successfully wins in air combat

πb9is an intelligent strategy and well-trained πr

10 canachieve more than half of the victory and less than 10 of thefailure However unlike πr

1 vs πb0 π

r10 cannot easily win

especially in head-on situations as shown in Figures 8(i)ndash8(l)

5000 100000Training episodes

ndash2

ndash1

0

1

2Av

erag

e rew

ard

Small NNsmall MBSmall NNmedium MBSmall NNlarge MBMedium NNsmall MBMedium NNmedium MB

Medium NNlarge MBLarge NNsmall MBLarge NNmedium MBLarge NNlarge MB

Figure 5 Evaluation of the neural network and minibatch

Mathematical Problems in Engineering 9

In the offensive situation the blue aircraft tries to get rid ofthe red one by constantly adjusting its bank angle but it isfinally captured by the red one In the defensive situation thered aircraft adopted the circling back tactics cleverly andquickly captured its opponent In the head-on situation thered and blue aircrafts alternately occupy an advantage sit-uation and after fierce maneuver confrontation the redaircraft finally wins In the neutral situation the red oneconstantly adjusts its bank angle according to the opponentrsquosposition and wins

For the above three pairs of strategies the head-on initialsituations are taken as examples to analyze the advantages ofboth sides in the game as shown in Figure 9 For the πr

1 vs πb0

scenario because πb0 has no intelligent maneuver guidance

ability although both sides are equally competitive at first halftime the red aircraft continues to expand its advantages untilit wins as shown in Figure 9(a) When the blue agent learnsthe intelligence strategy it can defeat the red one as shown in

Figure 9(b) For the πr10 vs πb

9 scenario the two sides havealternately established an advantage position and the redaircraft has not won easily as shown in Figure 9(c) Com-bining with the trajectories shown in Figures 8(c) 8(g) and8(k) the maneuver strategies of both sides have been im-proved using the alternate freeze game DQN algorithm eapproach has the benefit of discovering newmaneuver tactics

43 League Results e objective of the league system is toselect a red agent who generates maneuver guidance in-structions according to its strategy so as to achieve the bestresults without knowing the strategy of its opponent ereare ten agents on each side which are saved in the iterationof games Each red agent fights each blue agent 400 timesincluding 100 times for each of four initial scenarios edetailed league results are shown in Table 2 Results with awinloss ratio greater than 1 are highlighted in bold

μr

ηbOffensive

Defensive

Neutral

Head-on

(a)

Offensive

Defensivendash180

0

180

Asp

ect a

ngle

ηb (

degr

ee)

0 180ndash180Deviation angle μr (degree)

(b)

Figure 6 Situation recognition based on deviation angle and aspect angle (a) Four initial air combat situation categories (b) Situationrecognition diagram

1000 2000 3000 40000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(a)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(b)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(c)

Figure 7 Learning curves in the training process (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

10 Mathematical Problems in Engineering

0 03

3

6

6

9

9

12

12

12

16

(a)

133133

120 120

105105

90

9075

7560

60

45

45

30

30

15

15

00

(b)

0 016

16

32

32

48

48

64

64

80

80

(c)

00

33 6

6 99

1313

(d)

150

150

300300

30

30

240

2409060

60

120

180

180

270120

0 0270

210

210

90

(e)

0 0 2 2 44 6

6 88

(f )

40

120

120 0 0

100

100

14060

60

80

20

190

190160

16040

140

20 80

(g)

50 5040

40

30

30

20

2010

10

0

0

(h)

0

10

10

20

30

30

102

102

20

40

40

50

50

6060

7070

80

800

90

90

(i)

0 0

6

6 12

12

18

18

24

24

30

30

(j)

50

50

466466300

300150200

150350

00

250

250 400400

100

100

200

350

(k)

00 16

16

32

32

4848

64

64

79

79

(l)

Figure 8 Typical trajectory results in the training iteration of air combat games (a) πr1 vs πb

0 offensive (b) πr1 vs πb

0 defensive (c) πr1 vs πb

0head-on (d) πr

1 vs πb0 neutral (e) πb

5 vs πr5 offensive (f ) πb

5 vs πr5 defensive (g) πb

5 vs πr5 head-on (h) πb

5 vs πr5 neutral (i) πr

10 vs πbg

offensive (j) πr10 vs π

bg defensive (k) π

r10 vs π

bg head-on (l) π

r10 vs π

bg neutral

25 50 750Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(a)

100 2000Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(b)

200 4000Training steps

ndash05

00

05

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(c)

Figure 9 Advantage curves in head-on air combat games (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

Mathematical Problems in Engineering 11

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 3: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

calculations e Minimax-DQN [44] algorithm whichcombines DQN and Minimax-Q learning [41] for the two-player zero-sumMarkov game is proposed in a recent paperAlthough it can be applied to complex games it still can onlydeal with rational opponents and needs to use linear pro-gramming to calculate the Q value

e main contributions of this paper can be summarizedas follows

(i) Middleware is designed and developed whichmakes specific software for air combat simulationused as an RL environment An agent can be trainedin this environment and then used to guide anaircraft to reach the position of advantage in one-on-one air combat

(ii) Guidance agents for air combat pursuit maneuver ofboth sides are designede reward shaping methodis adopted to improve the convergence speed oftraining and the performance of the maneuverguidance strategies An agent is trained in the en-vironment where its opponent is also an adaptiveagent so that the well-trained agent has the ability tofight against an opponent with the intelligentstrategy

(iii) An alternate freeze game framework is proposed todeal with nonstationarity which can be used tosolve the problem of variable opponent strategies inRL e league system is adopted to select an agentwith the highest performance to avoid the red queeneffect [45]

is paper is organized as follows Section 2 intro-duces the maneuver guidance strategy problem of one-on-one air combat maneuver under consideration andthe DRL-based model as well as training environment tosolve it e training optimization includes rewardshaping and alternate freeze games which are presentedin Section 3 e simulation and results of the proposedapproach are given in Section 4 e final section con-cludes the paper

2 Problem Formulation

In this section the problem of maneuver guidance in aircombat is introducede training environment is designedand the DRL-based model is set up to solve it

21ProblemStatement eproblem solved in this paper is aone-on-one air combat pursuit-evasion maneuver probleme red aircraft is regarded as our side and the blue one as anenemy e objective of the guidance agent is to learn amaneuver strategy (policy) to guide the aircraft from itscurrent position to a position of advantage and maintain theadvantage At the same time it should guide the aircraft notto enter the advantage position of its opponent

e dynamics of the aircraft is described by a point-massmodel [17] and is given by the following differentialequations

_x v cos θ cosψ

_y v cosθ sinψ

_z vsinθ

θ

1

mv((L + Tsinβ) times cosϕ minus mgcosθ)

_ψ 1

mvcosθ(L + Tsinβ) times sinϕ

_v 1m

(Tcosβ minus D) minus gsinθ

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(1)

where (x y z) are 3-dimensional coordinates of the aircrafte terms v θ and ψ are speed flight path angle and theheading angle of the aircraft respectively e mass of theaircraft and the acceleration due to the gravity are denotedby m and g respectively e three forces are lift force Ldrag force D and thrust force T e remaining two vari-ables are the angle of attack β and the bank angle ϕ as shownin Figure 1

In this paper the aircraft is assumed to fly at a fixedvelocity in the horizontal plane and the assumption can bewritten as follows

β 0

θ 0

T D

L mg

cosϕ

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(2)

e equations of motion for the aircraft are simplified asfollows

xt+δt xt + vδtcos ψt+δt( 1113857

yt+δt yt + vδtsin ψt+δt( 1113857

ψt+δt ψt + _ψt+δtδt

_ψt+δt g

vtan ϕt+δt( 1113857

ϕt+δt ϕt + At_ϕtδt

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(3)

where v _ϕt ϕt _ψt and ψt are the speed roll rate bank angleturn rate and heading angle of the aircraft at time t re-spectively e term At is the action generated by theguidance agent and the control actions available to theaircraft are a isin roll-left maintain-bank-angle roll-righte simulation time step is denoted by δt

Mathematical Problems in Engineering 3

In aircraft maneuver strategy design maneuvers in a fixedplane are usually used to measure its performance Figure 2shows a one-on-one air combat scenario [20] in which eachaircraft flies at a fixed velocity in the X-Y plane under a ma-neuver strategy at time t e position of advantage is that oneaircraft gains opportunities to fire at its opponent It is defined as

dmin ltdt ltdmax

μrt

11138681113868111386811138681113868111386811138681113868lt μmax

ηbt

11138681113868111386811138681113868111386811138681113868lt ηmax

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(4)

where superscripts r and b refer to red and blue respectivelye term dt is the distance between the aircraft and the target attime t μr

t is the deviation angle which is defined as the angledifference between the line-of-sight vector and the heading ofour aircraft and ηb

t is the aspect angle which is between thelongitudinal symmetry axis (to the tail direction) of the targetplane and the connecting line from the target planersquos tail toattacking the planersquos nosee terms dmin dmax μmax and μminare thresholds where the subscriptsmax andmin represent theupper and lower bounds of the corresponding variables re-spectively which are determined by the combat mission andthe performance of the aircraft

e expression dmin ltdt ltdmax makes sure that theopponent is within the attacking range of the air-to-airweapon of the aircraft e expression |μr

t |lt μmax refers to anarea in which the blue aircraft is difficult to escape withsensor locking e expression |ηb

t |lt ηmax defines an areawhere the killing probability is high when attacking from therear of the blue aircraft

22 DRL-Based Model An RL agent interacts with an envi-ronment over time and aims tomaximize the long-term reward[24] At each training time t the agent receives a state St in astate space S and generates an actionAt from an action spaceA

following a policy π S times A⟶ R en the agent receives ascalar reward Rt and transitions to the next state St+1 accordingto the environment dynamics as shown in Figure 3

A value function V(s) is defined to evaluate the aircombat advantages of each state which is the expectation ofdiscounted cumulated rewards on all states following time t

V(s) Eπ Rt + cRt+1 + c2Rt+2 + St s

11138681113868111386811138681113966 1113967 (5)

where c isin [0 1] is the discount factor and policy π(s a) is amapping from the state space to the action space

e agent guides an aircraft to maneuver by changing itsbank angleϕ using roll rate _ϕe action space of DRL is minus1 01 and At can take a value from three options at time t whichmeans roll-left maintain-bank-angle and roll-right respec-tively e position of the aircraft is updated according to (3)e action-value functionQ(s a) can be used which is definedas the value of taking action a in state s under a policy π

Q(s a) Eπ 1113944

infin

k0c

kRt+k St s

1113868111386811138681113868 At a⎧⎨

⎫⎬

⎭ (6)

e Bellman optimality equation is

Qlowast(s a) E Rt + cmax

aprimeQlowast

St+1 aprime( 1113857St s At a1113896 1113897

(7)

Any policy that is greedy with respect to the optimalevaluation function Qlowast(s a) is an optimal policy ActuallyQlowast(s a) can be obtained through iterations using temporal-difference learning and its updated formula is defined as

dmindmax

vb

vr

(xbt yb

t)

(xrt yr

t)

ηbt

micrort

Goal zoneRed aircra (our side)Blue aircra (enemy)

Figure 2 Aircraft maneuver guidance problem in air combat

L

D

T

v

β

θ ψ

ϕ

Figure 1 Aircraft dynamics parameters

Air combatenvironment

Air combatguidance agent

At = Changing bank angle

St = aircra status opponent status

Rtndash1 = R(Stndash1 Atndash1 St)

St+1

Rt

Figure 3 Reinforcement learning in air combat

4 Mathematical Problems in Engineering

Q St At( 1113857⟵Q St At( 1113857 + α Rt + cmaxa

Q St+1 a( 1113857 minus Q St At( 11138571113874 1113875

(8)

where Q(St At) is the estimated action-value function and αis the learning rate

In air combat maneuvering applications the reward issparse and only at the end of a game (the terminal state)according to the results of winning losing or drawing areward value is given is makes learning algorithms to usedelayed feedback to determine the long-term consequencesof their actions in all nonterminal states which is timeconsuming Reward shaping is proposed by Ng et al [46] tointroduce additional rewards into the learning processwhich will guide the algorithm towards learning a goodpolicy faster e shaping reward function F(St At St+1) hasthe form

F St At St+1( 1113857 cΦ St+1( 1113857 minusΦ St( 1113857 (9)

where Φ(St) is a real-valued function over states It can beproven that the final policy after using reward shaping isequivalent to the final policy without it [46] Rt in previousequations can be replaced by Rt + F(St At St+1)

Taking the red agent as an example the state space of oneguidance agent can be described with a vector

St xrt y

rt ψ

rt ϕ

rt x

bt y

bt ψb

t1113966 1113967 (10)

e state that an agent received in one training stepincludes the position heading angle and bank angle of itselfand the position heading angle and bank angle of its op-ponent which is a 7-dimensional infinite vector ereforethe function approximate method should be used to com-bine features of state and learned weights DRL is a solutionto address this problem Great achievements have beenmadeusing the advanced DRL algorithms such as DQN [26]DDPG [47] A3C [48] and PPO [49] Since the action ofaircraft guidance is discrete DQN algorithm can be usede action-value function can be estimated with functionapproximation such as 1113954Q(s aw) asymp Qπ(s a) where w arethe weights of a neural network

DQN algorithm is executed according to the trainingtime step However there is not only training time step butalso simulation time step Because of the inconsistencyDQN algorithm needs to be improved to train a guidanceagent e pseudo-code of DQN for aircraft maneuverguidance training in air combat is shown in Algorithm 1 Ateach training step the agent receives information about theaircraft and its opponent and then generates an action andsends it to the environment After receiving the action theaircraft in the environment is maneuvered according to thisaction in each simulation step

23 Middleware Design An RL agent improves its abilitythrough continuous interaction with the environmentHowever there is no maneuver guidance environment forair combat and some combat simulation software can onlyexecute agents not train them In this paper middleware isdesigned and developed based on commercial air combat

simulation software [50] coded in C++ e functions ofmiddleware include interface between software and RLagents coordination of the simulation time step and the RLtraining time step and reward calculation By using it an RLenvironment for aircraft maneuver guidance training in aircombat is performed e guidance agents of both aircraftshave the same structure Figure 4 shows the structure and thetraining process of the red side

In actual or simulated air combat an aircraft perceivesthe situation through multisensors By analyzing the situ-ation after information fusion maneuver decisions are madeby the pilot or the auxiliary decision-making system In thissystem situation perception and decision-making are thesame as in actual air combat An assumption is made that asensor with full situation perception ability is used throughwhich an aircraft can obtain the position and heading angleof its opponent

Maneuver guidance in air combat using RL is an episodictask and there is the notion of episodes of some length wherethe goal is to take the agent from a starting state to a goal state[51] First a flight level with fixed velocity and the one-on-oneair combat scenario is setup in simulation software en thesituation information of both aircrafts is randomly initializedincluding their positions heading angles bank angles and rollrates which constitute the starting state of each agente goalstates are achieved when one aircraft reaches the position ofadvantage and maintains it one aircraft flies out of the sectoror the simulation time is out

Middleware can be used in the training and applicationphases of agents e training phase is the process of makingthe agent from zero to have the maneuver guidance abilitywhich includes training episodes and testing episodes In thetraining episode the agent is trained by the DRL algorithmand its guidance strategy is continuously improved After anumber of training episodes some testing episodes areperformed to verify the ability of the agent in the currentstage in which the strategy does not change e applicationphase is the process of using the well-trained agent tomaneuver guidance in air combat simulation

A training episode is composed of training steps In eachtraining step first the information recognized by the air-borne sensor is sent to the guidance agent through mid-dleware as a tuple of state coded in Python see Steps 1 2 and3 in Figure 4 en the agent generates an action using itsneural networks according to the exploration strategy asshown in Steps 4 and 5 Simulation software receives aguidance instruction transformed by middleware and sendsthe next situation information to middleware see Steps 6and 7 Next middleware transforms the situation infor-mation into state information and calculates reward andsends them to the agent as shown in Steps 2 3 and 8Because agent training is an iterative process all the stepsexcept Step 1 are executed repeatedly in an episode Last thetuple of the current state action reward and the next statein each step is stored in the replay memory for the training ofthe guidance agent see Steps 9 and 10 In each simulationstep the aircraft maneuvers according to the instruction andrecognizes the situation according to its sensorconfiguration

Mathematical Problems in Engineering 5

3 Training Optimization

31 Reward Shaping Two problems need to be solved whenusing the DRL method to train a maneuver guidance agentin air combat One is that the only criterion to evaluate theguidance is the successful arrival to the position of advan-tage which is a sparse reward problem leading to slowconvergence of training Secondly in realistic situations thetime to arrive at the position of advantage and the quality ofthe trajectory need to be considered In this paper a rewardshaping method is proposed to improve the training speedand the performance of the maneuver guidance strategy

Reward shaping [46] is usually used to modify the re-ward function to facilitate learning while maintaining op-timal policy which is a manual endeavour In this studythere are four rules to follow in reward shaping

(i) Limited by the combat area and the maximumnumber of control times the aircraft is guided to theposition of advantage

(ii) e aircraft should be guided closer and closer to itsgoal zone

(iii) e aircraft should be guided away from the goalzone of its opponent

(1) Set parameters of the controlled aircraft and opponent(2) Set training time step size Δt and simulation time step size δt

(3) Initialize maximum number of training episode M and maximum simulation time T in each episode(4) Initialize replay memory D to capacity N

(5) Initialize action-value function Q and target action-value function Qminus with random weights(6) for episode 0 Mdo(7) Initialize state S0(8) for t 0 TΔtdo(9) With probability ϵ select a random action At

(10) Otherwise select At argmaxa

Q(St aw)

(11) for i 0 Δtδtdo(12) Execute action At in air combat simulation software(13) Obtain the positions of aircraft and target(14) if episode terminates then(15) break(16) end if(17) end for(18) Observe reward Rt and state St+1(19) Store transition [St At Rt St+1] in D

(20) if episode terminates then(21) break(22) end if(23) Sample random minibatch of transitions [Sj Aj Rj Sj+1] from D

(24) if episode terminates at step j + 1then(25) set Yj Rj

(26) else(27) set Yj Rj + max

aQminus (Sj+1 awminus )

(28) end if(29) Perform a gradient descent step on (Yj minus Q(Sj Ajwj))

2 with respect to the network parameters w(30) Every C steps reset Qminus Q

(31) end for(32) end for

ALGORITHM 1 DQN [26] for maneuver guidance agent training

1

7

2 6

93 8

4 5

10

Air combat simulation soware

Initialization Visualization

Situation recognitionAir combat

Aircra motion

Loop ∆tδt times

InstructionSituation

Middleware

ActionRewardState

State perception

Action generation

Blueaircra

guidanceagent

LearningTrain

[S0 A0 R0 S1][S1 A1 R1 S2]

[St At Rt St+1]

[STndash1 ATndash1 RTndash1 ST]

Replay memoryRed aircra guidance agent

Store

Figure 4 Structure and training process of the red side

6 Mathematical Problems in Engineering

(iv) e aircraft should be guided to its goal zone inshort time as much as possible

According to the above rules the reward function isdefined as

R St At St+1( 1113857 Rt + F St At St+1( 1113857 (11)

where Rt is the original scalar reward and F(St At St+1) isthe shaping reward function Rt is given by

Rt w1T St+1( 1113857 + w2 (12)

where T(St+1) is the termination reward function e termw1 is a coefficient and w2 is a penalty for time consumptionwhich is a constant

ere are four kinds of termination states the aircraftarrives at the position of advantage the opponent arrives atits position of advantage the aircraft moves out of thecombat area and the maximum number of control times hasbeen reached and each aircraft is still in the combat area andhas not reached its advantage situation Termination rewardis the reward obtained when the next state is terminationwhich is often used in the standard RL training It is definedas

T St+1( 1113857

c1 if win

c2 else if loss

c3 else if out of the sector

c4 else if no control times left

0 else

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(13)

Usually c1 is a positive value c2 is a negative value andc3 and c4 are nonpositive

e shaping reward function F(St At St+1) has the formas described in (9) e real-value function Φ(St) is anadvantage function of the state e larger the value of thisfunction the more advantageous the current situation isis function can provide additional information to help theagent to select action which is better than only using thetermination reward It is defined as

Φ St( 1113857 D St( 1113857O St( 1113857

100 (14)

where D(St) is the distance reward function and O(St) is theorientation reward function which are defined as

D St( 1113857 expminusdt minus dmax + dmin( 11138571113868111386811138681113868 2

1113868111386811138681113868

180degk1113888 1113889 (15)

O St( 1113857 1 minusμr

t

11138681113868111386811138681113868111386811138681113868 + ηb

t

11138681113868111386811138681113868111386811138681113868

180deg (16)

where k has units of metersdegrees and is used to adjust therelative effect of range and angle and a value of 10 iseffective

32 Alternate Freeze Game DQN for Maneuver GuidanceAgent Training Unlike board and RTS games the data ofhuman players in air combat games are very rare so

pretraining methods using supervised learning algorithmscannot be used We can only assume a strategy used by theopponent and then propose a strategy to defeat it We thenthink about what strategies the opponent will use to confrontour strategy and then optimize our strategy

In this paper alternate freeze learning is used for gamesto solve the nonstationarity problem Both aircrafts in a one-on-one air combat scenario use DRL-based agents to adapttheir maneuver strategies In each training period one agentis learning from scratch while the strategy of its opponentwhich was obtained from the previous training period isfrozen rough games the maneuver guidance perfor-mance of each side rises alternately and different agents withhigh level of maneuver decision-making ability are gener-ated e pseudo-code of alternate freeze game DQN isshown in Algorithm 2

An important effect that must be considered in thisapproach is the red queen effect When one aircraft uses aDRL agent and its opponent uses a static strategy theperformance of the aircraft is absolute with respect to itsopponent However when both aircrafts use DRL agents theperformance of each agent is only related to its currentopponent As a result the trained agent is only suitable for itslatest opponent but cannot deal with the previous ones

rough K training periods of alternate freeze learningK red agents and K + 1 blue agents are saved in the trainingprocess e league system is adopted in which those agentsare regarded as players By using a combination of strategiesfrom multiple opponents the opponentrsquos mixed strategybecomes smoother In different initial scenarios each agentguides the aircraft to confront the aircraft guided by eachopponent e goal is to select one of our strategies throughthe league so that the probability of winning is high and theprobability of losing is low According to the result of thecompetition the agent obtains different points e topagent in the league table is selected as the optimum one

4 Simulation and Results

In this section first random air combat scenarios are ini-tialized in simulation software for maneuver guidance agenttraining and testing Second agents of both sides with re-ward shaping are created and trained using the alternatefreeze game DQN algorithm ird taking the well-trainedagents as players the league system is adopted and the topagent in the league table is selected Last the selected agent isevaluated and the maneuver behavior of the aircraft isanalyzed

41 SimulationSetup e air combat simulation parametersare shown in Table 1 e Adam optimizer [52] is used forlearning the neural network parameters with a learning rateof 5 times 10minus 4 and a discount c 099 e replay memory sizeis 1 times 105 e initial value of ϵ in ϵ-greedy exploration is 1and the final value is 01 For each simulation the rewardshaping parameters c1 c2 c3 and c4 are set to 2 minus2 minus1 andminus1 respectively e terms w1 and w2 are set to 098 and001 respectively e training period K is 10

Mathematical Problems in Engineering 7

We evaluated three neural networks and three mini-batches Each neural network has 3 hidden layers and thenumber of units is 64 64 128 for the small neural network(NN) 128 128 256 for the medium one and 256 256 512for the large one e sizes of minibatch (MB) are 256 512and 1024 respectively ese neural networks and mini-batches are used in pairs to train red agents and play againstthe initial blue agent e simulation result is shown in

Figure 5 First a large minibatch cannot make the trainingsuccessful for small and medium neural networks and itstraining speed is slow for a large network Second using asmall neural network the average reward obtained will belower than that of the other two networks ird if the sameneural network is adopted the training speed using themedium minibatch is faster than using a small one Lastconsidering the computational efficiency the selected net-work has 3 hidden layers with 128 128 and 256 unitsrespectively e selected minibatch size is 512

In order to improve the generality of a single agentinitial positions are randomized in air combat simulation Inthe first iteration of the training process the initial strategy isformed by a randomly initialized neural network for the redagent and the conservative strategy of the maximum-overload turn is adopted by the blue one In the subsequenttraining the training agent is trained from scratch and itsopponent adopts the strategy obtained from the latesttraining

Air combat geometry can be divided into four categories(from red aircraftrsquos perspective) offensive defensive neu-tral and head-on [7] as shown in Figure 6(a) Defining thedeviation angle and aspect angle from minus180deg to 180degFigure 6(b) shows an advantage diagram where the distanceis 300m Smaller |μr| means that the heading or gun of thered aircraft has better aim at its opponent and smaller |ηb|

implies a higher possibility of the blue aircraft being shot orfacing a fatal situation For example in the offensive

(1) Set parameters of both aircrafts(2) Set simulation parameters(3) Set the number of training periods K and the condition for ending each training period Wthreshold(4) Set DRL parameters(5) Set the opponent initialization policy πblue0(6) for period 1 Kdo(7) for aircraft [red blue] do(8) if aircraft red then(9) Set the opponent policy π πblueperiodminus1(10) Initialize neural networks of red agent(11) while Winning ratelt Wthresholddo(12) Train agent using Algorithm 1(13) end while(14) Save the well-trained agent whose maneuver guidance policy is πredperiod(15) else(16) if period K + 1then(17) break(18) else(19) Set the opponent policy π πredperiod(20) Initialize neural networks of blue agent(21) while Winning ratelt Wthresholddo(22) Train agent using Algorithm 1(23) end while(24) Save the well-trained agent whose maneuver guidance policy is πblueperiod(25) end if(26) end if(27) end for(28) end for

ALGORITHM 2 Alternate freeze game DQN for maneuver guidance agent training in air combats

Table 1 Air combat simulation parameters

Red aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Random position and heading angle

Blue aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Map center northwardInitial policy (πb

0) Maximum overload turnSimulation parameters

Position of advantage dmax 500m dmin 100mμmax 60deg ηmax 30deg

Combat area 5 km times 5 kmSimulation time T 500 sSimulation time step δt 01 sTraining time step Δt 05 s

8 Mathematical Problems in Engineering

scenario the initial state value of the red agent is largebecause of small |μr| and |ηb| according to equations (14)and (16) erefore in the offensive initial scenario the winprobability of our side will be larger than that of the op-ponent while the defensive initial scenario is reversed Inneutral and head-one initial scenarios the initial state valuesof both sides are almost equal e performance of the well-trained agents is verified according to the classification offour initial situations in the league system

42 Simulation Results e policy naming convention is πbi

or πri which means a blue or red policy produced after i

periods respectively After every 100 training episodes thetest episodes run 100 times and the learning curves areshown in Figure 7 For the first period (πr

1 vs πb0) due to the

simple maneuver policy of the blue aircraft the red aircraftcan achieve almost 100 success rate through about 4000episodes of training as shown in Figure 7(a) At the be-ginning most of the games are draw because of the randominitialization of the red agent and the evasion strategy of itsopponent Another reason is that the aircraft often flies outof the airspace in the early stages of training Although theagent will get a penalty for this the air combat episode willstill get a draw During the training process the red aircraft islooking for winning strategies in the airspace and itswinning rate is constantly increasing However because of

its pursuit of offensive and neglect of defensive the winningrate of its opponent is also rising In the later stage oftraining the red aircraft gradually understands the oppo-nentrsquos strategy and it can achieve a very high winning rate

With the iterations in games both agents have learnt theintelligent maneuver strategy which can guide the aircraft inthe pursuit-evasion game Figure 7(b) shows a typical ex-ample that πb

5 is the trainer and πr5 is its opponent After

about 5000 episodes of training the blue aircraft can reducethe losing rate to a lower level Since the red one adopts anintelligent maneuver guidance strategy the blue agentcannot learn a winning strategy in a short time After about20000 episodes the agent gradually understands the op-ponentrsquos maneuver strategy and its winning rate keepsincreasing e final winning rate is stable between 50 and60 and the losing rate is below 10

e training process iteration of πr10 vs πb

9 is shown inFigure 7(c) Because the maneuver strategy of the blue aircraftis an intelligent strategy which has been trained iterativelytraining for the red agent ismuchmore difficult than that in thefirst iteration e well-trained agent can win more than 60and lose less than 10 in the game against πb

9It is hard to get a higher winning rate in every training

except for the scenario that the opponent is πb0 is is because

both aircrafts are homogeneous the initial situations arerandomized and the opponent in the training environment isintelligent In some situations as long as the opponentrsquosstrategy is intelligent enough the trained agent is almostimpossible to win Interestingly in some scenarios where theopponent is intelligent enough and the aircraft cannot defeat itno matter how the agent operates the agent will guide theaircraft out of the airspace to obtain a draw is gives us aninspiration that is in a scenario where you cannot win it maybe the best way to get out of the combat area

For four initial air combat situations two aircrafts areguided by the agents trained in each iteration stage etypical flight trajectories are shown in Figure 8 Since πb

0 is astatic strategy πr

1 can easily win in offensive head-on andneutral situations as shown in Figures 8(a) 8(c) and 8(d) Inthe defensive situation the red aircraft first turns away fromits opponent who has the advantage and then looks foropportunities to establish an advantage situation ere areno fierce rivalries in the process of combat as shown inFigure 8(b)

Although πr5 is an intelligent strategy well-trained π

b5 can

still gain an advantage in combat as shown in Figures 8(e)ndash8(h) In the offensive situation (from the red aircraftrsquosperspective) the blue aircraft cannot easily get rid of redonersquos following and after fierce confrontation it can es-tablish an advantage situation as shown in Figure 8(e)Figure 8(f) shows the defensive situation that the blueaircraft can keep its advantage until it wins In the other twoscenarios the blue aircraft performs well especially in theinitial situation of neutral which adopts the delayed-turningtactics and successfully wins in air combat

πb9is an intelligent strategy and well-trained πr

10 canachieve more than half of the victory and less than 10 of thefailure However unlike πr

1 vs πb0 π

r10 cannot easily win

especially in head-on situations as shown in Figures 8(i)ndash8(l)

5000 100000Training episodes

ndash2

ndash1

0

1

2Av

erag

e rew

ard

Small NNsmall MBSmall NNmedium MBSmall NNlarge MBMedium NNsmall MBMedium NNmedium MB

Medium NNlarge MBLarge NNsmall MBLarge NNmedium MBLarge NNlarge MB

Figure 5 Evaluation of the neural network and minibatch

Mathematical Problems in Engineering 9

In the offensive situation the blue aircraft tries to get rid ofthe red one by constantly adjusting its bank angle but it isfinally captured by the red one In the defensive situation thered aircraft adopted the circling back tactics cleverly andquickly captured its opponent In the head-on situation thered and blue aircrafts alternately occupy an advantage sit-uation and after fierce maneuver confrontation the redaircraft finally wins In the neutral situation the red oneconstantly adjusts its bank angle according to the opponentrsquosposition and wins

For the above three pairs of strategies the head-on initialsituations are taken as examples to analyze the advantages ofboth sides in the game as shown in Figure 9 For the πr

1 vs πb0

scenario because πb0 has no intelligent maneuver guidance

ability although both sides are equally competitive at first halftime the red aircraft continues to expand its advantages untilit wins as shown in Figure 9(a) When the blue agent learnsthe intelligence strategy it can defeat the red one as shown in

Figure 9(b) For the πr10 vs πb

9 scenario the two sides havealternately established an advantage position and the redaircraft has not won easily as shown in Figure 9(c) Com-bining with the trajectories shown in Figures 8(c) 8(g) and8(k) the maneuver strategies of both sides have been im-proved using the alternate freeze game DQN algorithm eapproach has the benefit of discovering newmaneuver tactics

43 League Results e objective of the league system is toselect a red agent who generates maneuver guidance in-structions according to its strategy so as to achieve the bestresults without knowing the strategy of its opponent ereare ten agents on each side which are saved in the iterationof games Each red agent fights each blue agent 400 timesincluding 100 times for each of four initial scenarios edetailed league results are shown in Table 2 Results with awinloss ratio greater than 1 are highlighted in bold

μr

ηbOffensive

Defensive

Neutral

Head-on

(a)

Offensive

Defensivendash180

0

180

Asp

ect a

ngle

ηb (

degr

ee)

0 180ndash180Deviation angle μr (degree)

(b)

Figure 6 Situation recognition based on deviation angle and aspect angle (a) Four initial air combat situation categories (b) Situationrecognition diagram

1000 2000 3000 40000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(a)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(b)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(c)

Figure 7 Learning curves in the training process (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

10 Mathematical Problems in Engineering

0 03

3

6

6

9

9

12

12

12

16

(a)

133133

120 120

105105

90

9075

7560

60

45

45

30

30

15

15

00

(b)

0 016

16

32

32

48

48

64

64

80

80

(c)

00

33 6

6 99

1313

(d)

150

150

300300

30

30

240

2409060

60

120

180

180

270120

0 0270

210

210

90

(e)

0 0 2 2 44 6

6 88

(f )

40

120

120 0 0

100

100

14060

60

80

20

190

190160

16040

140

20 80

(g)

50 5040

40

30

30

20

2010

10

0

0

(h)

0

10

10

20

30

30

102

102

20

40

40

50

50

6060

7070

80

800

90

90

(i)

0 0

6

6 12

12

18

18

24

24

30

30

(j)

50

50

466466300

300150200

150350

00

250

250 400400

100

100

200

350

(k)

00 16

16

32

32

4848

64

64

79

79

(l)

Figure 8 Typical trajectory results in the training iteration of air combat games (a) πr1 vs πb

0 offensive (b) πr1 vs πb

0 defensive (c) πr1 vs πb

0head-on (d) πr

1 vs πb0 neutral (e) πb

5 vs πr5 offensive (f ) πb

5 vs πr5 defensive (g) πb

5 vs πr5 head-on (h) πb

5 vs πr5 neutral (i) πr

10 vs πbg

offensive (j) πr10 vs π

bg defensive (k) π

r10 vs π

bg head-on (l) π

r10 vs π

bg neutral

25 50 750Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(a)

100 2000Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(b)

200 4000Training steps

ndash05

00

05

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(c)

Figure 9 Advantage curves in head-on air combat games (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

Mathematical Problems in Engineering 11

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 4: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

In aircraft maneuver strategy design maneuvers in a fixedplane are usually used to measure its performance Figure 2shows a one-on-one air combat scenario [20] in which eachaircraft flies at a fixed velocity in the X-Y plane under a ma-neuver strategy at time t e position of advantage is that oneaircraft gains opportunities to fire at its opponent It is defined as

dmin ltdt ltdmax

μrt

11138681113868111386811138681113868111386811138681113868lt μmax

ηbt

11138681113868111386811138681113868111386811138681113868lt ηmax

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(4)

where superscripts r and b refer to red and blue respectivelye term dt is the distance between the aircraft and the target attime t μr

t is the deviation angle which is defined as the angledifference between the line-of-sight vector and the heading ofour aircraft and ηb

t is the aspect angle which is between thelongitudinal symmetry axis (to the tail direction) of the targetplane and the connecting line from the target planersquos tail toattacking the planersquos nosee terms dmin dmax μmax and μminare thresholds where the subscriptsmax andmin represent theupper and lower bounds of the corresponding variables re-spectively which are determined by the combat mission andthe performance of the aircraft

e expression dmin ltdt ltdmax makes sure that theopponent is within the attacking range of the air-to-airweapon of the aircraft e expression |μr

t |lt μmax refers to anarea in which the blue aircraft is difficult to escape withsensor locking e expression |ηb

t |lt ηmax defines an areawhere the killing probability is high when attacking from therear of the blue aircraft

22 DRL-Based Model An RL agent interacts with an envi-ronment over time and aims tomaximize the long-term reward[24] At each training time t the agent receives a state St in astate space S and generates an actionAt from an action spaceA

following a policy π S times A⟶ R en the agent receives ascalar reward Rt and transitions to the next state St+1 accordingto the environment dynamics as shown in Figure 3

A value function V(s) is defined to evaluate the aircombat advantages of each state which is the expectation ofdiscounted cumulated rewards on all states following time t

V(s) Eπ Rt + cRt+1 + c2Rt+2 + St s

11138681113868111386811138681113966 1113967 (5)

where c isin [0 1] is the discount factor and policy π(s a) is amapping from the state space to the action space

e agent guides an aircraft to maneuver by changing itsbank angleϕ using roll rate _ϕe action space of DRL is minus1 01 and At can take a value from three options at time t whichmeans roll-left maintain-bank-angle and roll-right respec-tively e position of the aircraft is updated according to (3)e action-value functionQ(s a) can be used which is definedas the value of taking action a in state s under a policy π

Q(s a) Eπ 1113944

infin

k0c

kRt+k St s

1113868111386811138681113868 At a⎧⎨

⎫⎬

⎭ (6)

e Bellman optimality equation is

Qlowast(s a) E Rt + cmax

aprimeQlowast

St+1 aprime( 1113857St s At a1113896 1113897

(7)

Any policy that is greedy with respect to the optimalevaluation function Qlowast(s a) is an optimal policy ActuallyQlowast(s a) can be obtained through iterations using temporal-difference learning and its updated formula is defined as

dmindmax

vb

vr

(xbt yb

t)

(xrt yr

t)

ηbt

micrort

Goal zoneRed aircra (our side)Blue aircra (enemy)

Figure 2 Aircraft maneuver guidance problem in air combat

L

D

T

v

β

θ ψ

ϕ

Figure 1 Aircraft dynamics parameters

Air combatenvironment

Air combatguidance agent

At = Changing bank angle

St = aircra status opponent status

Rtndash1 = R(Stndash1 Atndash1 St)

St+1

Rt

Figure 3 Reinforcement learning in air combat

4 Mathematical Problems in Engineering

Q St At( 1113857⟵Q St At( 1113857 + α Rt + cmaxa

Q St+1 a( 1113857 minus Q St At( 11138571113874 1113875

(8)

where Q(St At) is the estimated action-value function and αis the learning rate

In air combat maneuvering applications the reward issparse and only at the end of a game (the terminal state)according to the results of winning losing or drawing areward value is given is makes learning algorithms to usedelayed feedback to determine the long-term consequencesof their actions in all nonterminal states which is timeconsuming Reward shaping is proposed by Ng et al [46] tointroduce additional rewards into the learning processwhich will guide the algorithm towards learning a goodpolicy faster e shaping reward function F(St At St+1) hasthe form

F St At St+1( 1113857 cΦ St+1( 1113857 minusΦ St( 1113857 (9)

where Φ(St) is a real-valued function over states It can beproven that the final policy after using reward shaping isequivalent to the final policy without it [46] Rt in previousequations can be replaced by Rt + F(St At St+1)

Taking the red agent as an example the state space of oneguidance agent can be described with a vector

St xrt y

rt ψ

rt ϕ

rt x

bt y

bt ψb

t1113966 1113967 (10)

e state that an agent received in one training stepincludes the position heading angle and bank angle of itselfand the position heading angle and bank angle of its op-ponent which is a 7-dimensional infinite vector ereforethe function approximate method should be used to com-bine features of state and learned weights DRL is a solutionto address this problem Great achievements have beenmadeusing the advanced DRL algorithms such as DQN [26]DDPG [47] A3C [48] and PPO [49] Since the action ofaircraft guidance is discrete DQN algorithm can be usede action-value function can be estimated with functionapproximation such as 1113954Q(s aw) asymp Qπ(s a) where w arethe weights of a neural network

DQN algorithm is executed according to the trainingtime step However there is not only training time step butalso simulation time step Because of the inconsistencyDQN algorithm needs to be improved to train a guidanceagent e pseudo-code of DQN for aircraft maneuverguidance training in air combat is shown in Algorithm 1 Ateach training step the agent receives information about theaircraft and its opponent and then generates an action andsends it to the environment After receiving the action theaircraft in the environment is maneuvered according to thisaction in each simulation step

23 Middleware Design An RL agent improves its abilitythrough continuous interaction with the environmentHowever there is no maneuver guidance environment forair combat and some combat simulation software can onlyexecute agents not train them In this paper middleware isdesigned and developed based on commercial air combat

simulation software [50] coded in C++ e functions ofmiddleware include interface between software and RLagents coordination of the simulation time step and the RLtraining time step and reward calculation By using it an RLenvironment for aircraft maneuver guidance training in aircombat is performed e guidance agents of both aircraftshave the same structure Figure 4 shows the structure and thetraining process of the red side

In actual or simulated air combat an aircraft perceivesthe situation through multisensors By analyzing the situ-ation after information fusion maneuver decisions are madeby the pilot or the auxiliary decision-making system In thissystem situation perception and decision-making are thesame as in actual air combat An assumption is made that asensor with full situation perception ability is used throughwhich an aircraft can obtain the position and heading angleof its opponent

Maneuver guidance in air combat using RL is an episodictask and there is the notion of episodes of some length wherethe goal is to take the agent from a starting state to a goal state[51] First a flight level with fixed velocity and the one-on-oneair combat scenario is setup in simulation software en thesituation information of both aircrafts is randomly initializedincluding their positions heading angles bank angles and rollrates which constitute the starting state of each agente goalstates are achieved when one aircraft reaches the position ofadvantage and maintains it one aircraft flies out of the sectoror the simulation time is out

Middleware can be used in the training and applicationphases of agents e training phase is the process of makingthe agent from zero to have the maneuver guidance abilitywhich includes training episodes and testing episodes In thetraining episode the agent is trained by the DRL algorithmand its guidance strategy is continuously improved After anumber of training episodes some testing episodes areperformed to verify the ability of the agent in the currentstage in which the strategy does not change e applicationphase is the process of using the well-trained agent tomaneuver guidance in air combat simulation

A training episode is composed of training steps In eachtraining step first the information recognized by the air-borne sensor is sent to the guidance agent through mid-dleware as a tuple of state coded in Python see Steps 1 2 and3 in Figure 4 en the agent generates an action using itsneural networks according to the exploration strategy asshown in Steps 4 and 5 Simulation software receives aguidance instruction transformed by middleware and sendsthe next situation information to middleware see Steps 6and 7 Next middleware transforms the situation infor-mation into state information and calculates reward andsends them to the agent as shown in Steps 2 3 and 8Because agent training is an iterative process all the stepsexcept Step 1 are executed repeatedly in an episode Last thetuple of the current state action reward and the next statein each step is stored in the replay memory for the training ofthe guidance agent see Steps 9 and 10 In each simulationstep the aircraft maneuvers according to the instruction andrecognizes the situation according to its sensorconfiguration

Mathematical Problems in Engineering 5

3 Training Optimization

31 Reward Shaping Two problems need to be solved whenusing the DRL method to train a maneuver guidance agentin air combat One is that the only criterion to evaluate theguidance is the successful arrival to the position of advan-tage which is a sparse reward problem leading to slowconvergence of training Secondly in realistic situations thetime to arrive at the position of advantage and the quality ofthe trajectory need to be considered In this paper a rewardshaping method is proposed to improve the training speedand the performance of the maneuver guidance strategy

Reward shaping [46] is usually used to modify the re-ward function to facilitate learning while maintaining op-timal policy which is a manual endeavour In this studythere are four rules to follow in reward shaping

(i) Limited by the combat area and the maximumnumber of control times the aircraft is guided to theposition of advantage

(ii) e aircraft should be guided closer and closer to itsgoal zone

(iii) e aircraft should be guided away from the goalzone of its opponent

(1) Set parameters of the controlled aircraft and opponent(2) Set training time step size Δt and simulation time step size δt

(3) Initialize maximum number of training episode M and maximum simulation time T in each episode(4) Initialize replay memory D to capacity N

(5) Initialize action-value function Q and target action-value function Qminus with random weights(6) for episode 0 Mdo(7) Initialize state S0(8) for t 0 TΔtdo(9) With probability ϵ select a random action At

(10) Otherwise select At argmaxa

Q(St aw)

(11) for i 0 Δtδtdo(12) Execute action At in air combat simulation software(13) Obtain the positions of aircraft and target(14) if episode terminates then(15) break(16) end if(17) end for(18) Observe reward Rt and state St+1(19) Store transition [St At Rt St+1] in D

(20) if episode terminates then(21) break(22) end if(23) Sample random minibatch of transitions [Sj Aj Rj Sj+1] from D

(24) if episode terminates at step j + 1then(25) set Yj Rj

(26) else(27) set Yj Rj + max

aQminus (Sj+1 awminus )

(28) end if(29) Perform a gradient descent step on (Yj minus Q(Sj Ajwj))

2 with respect to the network parameters w(30) Every C steps reset Qminus Q

(31) end for(32) end for

ALGORITHM 1 DQN [26] for maneuver guidance agent training

1

7

2 6

93 8

4 5

10

Air combat simulation soware

Initialization Visualization

Situation recognitionAir combat

Aircra motion

Loop ∆tδt times

InstructionSituation

Middleware

ActionRewardState

State perception

Action generation

Blueaircra

guidanceagent

LearningTrain

[S0 A0 R0 S1][S1 A1 R1 S2]

[St At Rt St+1]

[STndash1 ATndash1 RTndash1 ST]

Replay memoryRed aircra guidance agent

Store

Figure 4 Structure and training process of the red side

6 Mathematical Problems in Engineering

(iv) e aircraft should be guided to its goal zone inshort time as much as possible

According to the above rules the reward function isdefined as

R St At St+1( 1113857 Rt + F St At St+1( 1113857 (11)

where Rt is the original scalar reward and F(St At St+1) isthe shaping reward function Rt is given by

Rt w1T St+1( 1113857 + w2 (12)

where T(St+1) is the termination reward function e termw1 is a coefficient and w2 is a penalty for time consumptionwhich is a constant

ere are four kinds of termination states the aircraftarrives at the position of advantage the opponent arrives atits position of advantage the aircraft moves out of thecombat area and the maximum number of control times hasbeen reached and each aircraft is still in the combat area andhas not reached its advantage situation Termination rewardis the reward obtained when the next state is terminationwhich is often used in the standard RL training It is definedas

T St+1( 1113857

c1 if win

c2 else if loss

c3 else if out of the sector

c4 else if no control times left

0 else

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(13)

Usually c1 is a positive value c2 is a negative value andc3 and c4 are nonpositive

e shaping reward function F(St At St+1) has the formas described in (9) e real-value function Φ(St) is anadvantage function of the state e larger the value of thisfunction the more advantageous the current situation isis function can provide additional information to help theagent to select action which is better than only using thetermination reward It is defined as

Φ St( 1113857 D St( 1113857O St( 1113857

100 (14)

where D(St) is the distance reward function and O(St) is theorientation reward function which are defined as

D St( 1113857 expminusdt minus dmax + dmin( 11138571113868111386811138681113868 2

1113868111386811138681113868

180degk1113888 1113889 (15)

O St( 1113857 1 minusμr

t

11138681113868111386811138681113868111386811138681113868 + ηb

t

11138681113868111386811138681113868111386811138681113868

180deg (16)

where k has units of metersdegrees and is used to adjust therelative effect of range and angle and a value of 10 iseffective

32 Alternate Freeze Game DQN for Maneuver GuidanceAgent Training Unlike board and RTS games the data ofhuman players in air combat games are very rare so

pretraining methods using supervised learning algorithmscannot be used We can only assume a strategy used by theopponent and then propose a strategy to defeat it We thenthink about what strategies the opponent will use to confrontour strategy and then optimize our strategy

In this paper alternate freeze learning is used for gamesto solve the nonstationarity problem Both aircrafts in a one-on-one air combat scenario use DRL-based agents to adapttheir maneuver strategies In each training period one agentis learning from scratch while the strategy of its opponentwhich was obtained from the previous training period isfrozen rough games the maneuver guidance perfor-mance of each side rises alternately and different agents withhigh level of maneuver decision-making ability are gener-ated e pseudo-code of alternate freeze game DQN isshown in Algorithm 2

An important effect that must be considered in thisapproach is the red queen effect When one aircraft uses aDRL agent and its opponent uses a static strategy theperformance of the aircraft is absolute with respect to itsopponent However when both aircrafts use DRL agents theperformance of each agent is only related to its currentopponent As a result the trained agent is only suitable for itslatest opponent but cannot deal with the previous ones

rough K training periods of alternate freeze learningK red agents and K + 1 blue agents are saved in the trainingprocess e league system is adopted in which those agentsare regarded as players By using a combination of strategiesfrom multiple opponents the opponentrsquos mixed strategybecomes smoother In different initial scenarios each agentguides the aircraft to confront the aircraft guided by eachopponent e goal is to select one of our strategies throughthe league so that the probability of winning is high and theprobability of losing is low According to the result of thecompetition the agent obtains different points e topagent in the league table is selected as the optimum one

4 Simulation and Results

In this section first random air combat scenarios are ini-tialized in simulation software for maneuver guidance agenttraining and testing Second agents of both sides with re-ward shaping are created and trained using the alternatefreeze game DQN algorithm ird taking the well-trainedagents as players the league system is adopted and the topagent in the league table is selected Last the selected agent isevaluated and the maneuver behavior of the aircraft isanalyzed

41 SimulationSetup e air combat simulation parametersare shown in Table 1 e Adam optimizer [52] is used forlearning the neural network parameters with a learning rateof 5 times 10minus 4 and a discount c 099 e replay memory sizeis 1 times 105 e initial value of ϵ in ϵ-greedy exploration is 1and the final value is 01 For each simulation the rewardshaping parameters c1 c2 c3 and c4 are set to 2 minus2 minus1 andminus1 respectively e terms w1 and w2 are set to 098 and001 respectively e training period K is 10

Mathematical Problems in Engineering 7

We evaluated three neural networks and three mini-batches Each neural network has 3 hidden layers and thenumber of units is 64 64 128 for the small neural network(NN) 128 128 256 for the medium one and 256 256 512for the large one e sizes of minibatch (MB) are 256 512and 1024 respectively ese neural networks and mini-batches are used in pairs to train red agents and play againstthe initial blue agent e simulation result is shown in

Figure 5 First a large minibatch cannot make the trainingsuccessful for small and medium neural networks and itstraining speed is slow for a large network Second using asmall neural network the average reward obtained will belower than that of the other two networks ird if the sameneural network is adopted the training speed using themedium minibatch is faster than using a small one Lastconsidering the computational efficiency the selected net-work has 3 hidden layers with 128 128 and 256 unitsrespectively e selected minibatch size is 512

In order to improve the generality of a single agentinitial positions are randomized in air combat simulation Inthe first iteration of the training process the initial strategy isformed by a randomly initialized neural network for the redagent and the conservative strategy of the maximum-overload turn is adopted by the blue one In the subsequenttraining the training agent is trained from scratch and itsopponent adopts the strategy obtained from the latesttraining

Air combat geometry can be divided into four categories(from red aircraftrsquos perspective) offensive defensive neu-tral and head-on [7] as shown in Figure 6(a) Defining thedeviation angle and aspect angle from minus180deg to 180degFigure 6(b) shows an advantage diagram where the distanceis 300m Smaller |μr| means that the heading or gun of thered aircraft has better aim at its opponent and smaller |ηb|

implies a higher possibility of the blue aircraft being shot orfacing a fatal situation For example in the offensive

(1) Set parameters of both aircrafts(2) Set simulation parameters(3) Set the number of training periods K and the condition for ending each training period Wthreshold(4) Set DRL parameters(5) Set the opponent initialization policy πblue0(6) for period 1 Kdo(7) for aircraft [red blue] do(8) if aircraft red then(9) Set the opponent policy π πblueperiodminus1(10) Initialize neural networks of red agent(11) while Winning ratelt Wthresholddo(12) Train agent using Algorithm 1(13) end while(14) Save the well-trained agent whose maneuver guidance policy is πredperiod(15) else(16) if period K + 1then(17) break(18) else(19) Set the opponent policy π πredperiod(20) Initialize neural networks of blue agent(21) while Winning ratelt Wthresholddo(22) Train agent using Algorithm 1(23) end while(24) Save the well-trained agent whose maneuver guidance policy is πblueperiod(25) end if(26) end if(27) end for(28) end for

ALGORITHM 2 Alternate freeze game DQN for maneuver guidance agent training in air combats

Table 1 Air combat simulation parameters

Red aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Random position and heading angle

Blue aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Map center northwardInitial policy (πb

0) Maximum overload turnSimulation parameters

Position of advantage dmax 500m dmin 100mμmax 60deg ηmax 30deg

Combat area 5 km times 5 kmSimulation time T 500 sSimulation time step δt 01 sTraining time step Δt 05 s

8 Mathematical Problems in Engineering

scenario the initial state value of the red agent is largebecause of small |μr| and |ηb| according to equations (14)and (16) erefore in the offensive initial scenario the winprobability of our side will be larger than that of the op-ponent while the defensive initial scenario is reversed Inneutral and head-one initial scenarios the initial state valuesof both sides are almost equal e performance of the well-trained agents is verified according to the classification offour initial situations in the league system

42 Simulation Results e policy naming convention is πbi

or πri which means a blue or red policy produced after i

periods respectively After every 100 training episodes thetest episodes run 100 times and the learning curves areshown in Figure 7 For the first period (πr

1 vs πb0) due to the

simple maneuver policy of the blue aircraft the red aircraftcan achieve almost 100 success rate through about 4000episodes of training as shown in Figure 7(a) At the be-ginning most of the games are draw because of the randominitialization of the red agent and the evasion strategy of itsopponent Another reason is that the aircraft often flies outof the airspace in the early stages of training Although theagent will get a penalty for this the air combat episode willstill get a draw During the training process the red aircraft islooking for winning strategies in the airspace and itswinning rate is constantly increasing However because of

its pursuit of offensive and neglect of defensive the winningrate of its opponent is also rising In the later stage oftraining the red aircraft gradually understands the oppo-nentrsquos strategy and it can achieve a very high winning rate

With the iterations in games both agents have learnt theintelligent maneuver strategy which can guide the aircraft inthe pursuit-evasion game Figure 7(b) shows a typical ex-ample that πb

5 is the trainer and πr5 is its opponent After

about 5000 episodes of training the blue aircraft can reducethe losing rate to a lower level Since the red one adopts anintelligent maneuver guidance strategy the blue agentcannot learn a winning strategy in a short time After about20000 episodes the agent gradually understands the op-ponentrsquos maneuver strategy and its winning rate keepsincreasing e final winning rate is stable between 50 and60 and the losing rate is below 10

e training process iteration of πr10 vs πb

9 is shown inFigure 7(c) Because the maneuver strategy of the blue aircraftis an intelligent strategy which has been trained iterativelytraining for the red agent ismuchmore difficult than that in thefirst iteration e well-trained agent can win more than 60and lose less than 10 in the game against πb

9It is hard to get a higher winning rate in every training

except for the scenario that the opponent is πb0 is is because

both aircrafts are homogeneous the initial situations arerandomized and the opponent in the training environment isintelligent In some situations as long as the opponentrsquosstrategy is intelligent enough the trained agent is almostimpossible to win Interestingly in some scenarios where theopponent is intelligent enough and the aircraft cannot defeat itno matter how the agent operates the agent will guide theaircraft out of the airspace to obtain a draw is gives us aninspiration that is in a scenario where you cannot win it maybe the best way to get out of the combat area

For four initial air combat situations two aircrafts areguided by the agents trained in each iteration stage etypical flight trajectories are shown in Figure 8 Since πb

0 is astatic strategy πr

1 can easily win in offensive head-on andneutral situations as shown in Figures 8(a) 8(c) and 8(d) Inthe defensive situation the red aircraft first turns away fromits opponent who has the advantage and then looks foropportunities to establish an advantage situation ere areno fierce rivalries in the process of combat as shown inFigure 8(b)

Although πr5 is an intelligent strategy well-trained π

b5 can

still gain an advantage in combat as shown in Figures 8(e)ndash8(h) In the offensive situation (from the red aircraftrsquosperspective) the blue aircraft cannot easily get rid of redonersquos following and after fierce confrontation it can es-tablish an advantage situation as shown in Figure 8(e)Figure 8(f) shows the defensive situation that the blueaircraft can keep its advantage until it wins In the other twoscenarios the blue aircraft performs well especially in theinitial situation of neutral which adopts the delayed-turningtactics and successfully wins in air combat

πb9is an intelligent strategy and well-trained πr

10 canachieve more than half of the victory and less than 10 of thefailure However unlike πr

1 vs πb0 π

r10 cannot easily win

especially in head-on situations as shown in Figures 8(i)ndash8(l)

5000 100000Training episodes

ndash2

ndash1

0

1

2Av

erag

e rew

ard

Small NNsmall MBSmall NNmedium MBSmall NNlarge MBMedium NNsmall MBMedium NNmedium MB

Medium NNlarge MBLarge NNsmall MBLarge NNmedium MBLarge NNlarge MB

Figure 5 Evaluation of the neural network and minibatch

Mathematical Problems in Engineering 9

In the offensive situation the blue aircraft tries to get rid ofthe red one by constantly adjusting its bank angle but it isfinally captured by the red one In the defensive situation thered aircraft adopted the circling back tactics cleverly andquickly captured its opponent In the head-on situation thered and blue aircrafts alternately occupy an advantage sit-uation and after fierce maneuver confrontation the redaircraft finally wins In the neutral situation the red oneconstantly adjusts its bank angle according to the opponentrsquosposition and wins

For the above three pairs of strategies the head-on initialsituations are taken as examples to analyze the advantages ofboth sides in the game as shown in Figure 9 For the πr

1 vs πb0

scenario because πb0 has no intelligent maneuver guidance

ability although both sides are equally competitive at first halftime the red aircraft continues to expand its advantages untilit wins as shown in Figure 9(a) When the blue agent learnsthe intelligence strategy it can defeat the red one as shown in

Figure 9(b) For the πr10 vs πb

9 scenario the two sides havealternately established an advantage position and the redaircraft has not won easily as shown in Figure 9(c) Com-bining with the trajectories shown in Figures 8(c) 8(g) and8(k) the maneuver strategies of both sides have been im-proved using the alternate freeze game DQN algorithm eapproach has the benefit of discovering newmaneuver tactics

43 League Results e objective of the league system is toselect a red agent who generates maneuver guidance in-structions according to its strategy so as to achieve the bestresults without knowing the strategy of its opponent ereare ten agents on each side which are saved in the iterationof games Each red agent fights each blue agent 400 timesincluding 100 times for each of four initial scenarios edetailed league results are shown in Table 2 Results with awinloss ratio greater than 1 are highlighted in bold

μr

ηbOffensive

Defensive

Neutral

Head-on

(a)

Offensive

Defensivendash180

0

180

Asp

ect a

ngle

ηb (

degr

ee)

0 180ndash180Deviation angle μr (degree)

(b)

Figure 6 Situation recognition based on deviation angle and aspect angle (a) Four initial air combat situation categories (b) Situationrecognition diagram

1000 2000 3000 40000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(a)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(b)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(c)

Figure 7 Learning curves in the training process (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

10 Mathematical Problems in Engineering

0 03

3

6

6

9

9

12

12

12

16

(a)

133133

120 120

105105

90

9075

7560

60

45

45

30

30

15

15

00

(b)

0 016

16

32

32

48

48

64

64

80

80

(c)

00

33 6

6 99

1313

(d)

150

150

300300

30

30

240

2409060

60

120

180

180

270120

0 0270

210

210

90

(e)

0 0 2 2 44 6

6 88

(f )

40

120

120 0 0

100

100

14060

60

80

20

190

190160

16040

140

20 80

(g)

50 5040

40

30

30

20

2010

10

0

0

(h)

0

10

10

20

30

30

102

102

20

40

40

50

50

6060

7070

80

800

90

90

(i)

0 0

6

6 12

12

18

18

24

24

30

30

(j)

50

50

466466300

300150200

150350

00

250

250 400400

100

100

200

350

(k)

00 16

16

32

32

4848

64

64

79

79

(l)

Figure 8 Typical trajectory results in the training iteration of air combat games (a) πr1 vs πb

0 offensive (b) πr1 vs πb

0 defensive (c) πr1 vs πb

0head-on (d) πr

1 vs πb0 neutral (e) πb

5 vs πr5 offensive (f ) πb

5 vs πr5 defensive (g) πb

5 vs πr5 head-on (h) πb

5 vs πr5 neutral (i) πr

10 vs πbg

offensive (j) πr10 vs π

bg defensive (k) π

r10 vs π

bg head-on (l) π

r10 vs π

bg neutral

25 50 750Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(a)

100 2000Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(b)

200 4000Training steps

ndash05

00

05

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(c)

Figure 9 Advantage curves in head-on air combat games (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

Mathematical Problems in Engineering 11

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 5: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

Q St At( 1113857⟵Q St At( 1113857 + α Rt + cmaxa

Q St+1 a( 1113857 minus Q St At( 11138571113874 1113875

(8)

where Q(St At) is the estimated action-value function and αis the learning rate

In air combat maneuvering applications the reward issparse and only at the end of a game (the terminal state)according to the results of winning losing or drawing areward value is given is makes learning algorithms to usedelayed feedback to determine the long-term consequencesof their actions in all nonterminal states which is timeconsuming Reward shaping is proposed by Ng et al [46] tointroduce additional rewards into the learning processwhich will guide the algorithm towards learning a goodpolicy faster e shaping reward function F(St At St+1) hasthe form

F St At St+1( 1113857 cΦ St+1( 1113857 minusΦ St( 1113857 (9)

where Φ(St) is a real-valued function over states It can beproven that the final policy after using reward shaping isequivalent to the final policy without it [46] Rt in previousequations can be replaced by Rt + F(St At St+1)

Taking the red agent as an example the state space of oneguidance agent can be described with a vector

St xrt y

rt ψ

rt ϕ

rt x

bt y

bt ψb

t1113966 1113967 (10)

e state that an agent received in one training stepincludes the position heading angle and bank angle of itselfand the position heading angle and bank angle of its op-ponent which is a 7-dimensional infinite vector ereforethe function approximate method should be used to com-bine features of state and learned weights DRL is a solutionto address this problem Great achievements have beenmadeusing the advanced DRL algorithms such as DQN [26]DDPG [47] A3C [48] and PPO [49] Since the action ofaircraft guidance is discrete DQN algorithm can be usede action-value function can be estimated with functionapproximation such as 1113954Q(s aw) asymp Qπ(s a) where w arethe weights of a neural network

DQN algorithm is executed according to the trainingtime step However there is not only training time step butalso simulation time step Because of the inconsistencyDQN algorithm needs to be improved to train a guidanceagent e pseudo-code of DQN for aircraft maneuverguidance training in air combat is shown in Algorithm 1 Ateach training step the agent receives information about theaircraft and its opponent and then generates an action andsends it to the environment After receiving the action theaircraft in the environment is maneuvered according to thisaction in each simulation step

23 Middleware Design An RL agent improves its abilitythrough continuous interaction with the environmentHowever there is no maneuver guidance environment forair combat and some combat simulation software can onlyexecute agents not train them In this paper middleware isdesigned and developed based on commercial air combat

simulation software [50] coded in C++ e functions ofmiddleware include interface between software and RLagents coordination of the simulation time step and the RLtraining time step and reward calculation By using it an RLenvironment for aircraft maneuver guidance training in aircombat is performed e guidance agents of both aircraftshave the same structure Figure 4 shows the structure and thetraining process of the red side

In actual or simulated air combat an aircraft perceivesthe situation through multisensors By analyzing the situ-ation after information fusion maneuver decisions are madeby the pilot or the auxiliary decision-making system In thissystem situation perception and decision-making are thesame as in actual air combat An assumption is made that asensor with full situation perception ability is used throughwhich an aircraft can obtain the position and heading angleof its opponent

Maneuver guidance in air combat using RL is an episodictask and there is the notion of episodes of some length wherethe goal is to take the agent from a starting state to a goal state[51] First a flight level with fixed velocity and the one-on-oneair combat scenario is setup in simulation software en thesituation information of both aircrafts is randomly initializedincluding their positions heading angles bank angles and rollrates which constitute the starting state of each agente goalstates are achieved when one aircraft reaches the position ofadvantage and maintains it one aircraft flies out of the sectoror the simulation time is out

Middleware can be used in the training and applicationphases of agents e training phase is the process of makingthe agent from zero to have the maneuver guidance abilitywhich includes training episodes and testing episodes In thetraining episode the agent is trained by the DRL algorithmand its guidance strategy is continuously improved After anumber of training episodes some testing episodes areperformed to verify the ability of the agent in the currentstage in which the strategy does not change e applicationphase is the process of using the well-trained agent tomaneuver guidance in air combat simulation

A training episode is composed of training steps In eachtraining step first the information recognized by the air-borne sensor is sent to the guidance agent through mid-dleware as a tuple of state coded in Python see Steps 1 2 and3 in Figure 4 en the agent generates an action using itsneural networks according to the exploration strategy asshown in Steps 4 and 5 Simulation software receives aguidance instruction transformed by middleware and sendsthe next situation information to middleware see Steps 6and 7 Next middleware transforms the situation infor-mation into state information and calculates reward andsends them to the agent as shown in Steps 2 3 and 8Because agent training is an iterative process all the stepsexcept Step 1 are executed repeatedly in an episode Last thetuple of the current state action reward and the next statein each step is stored in the replay memory for the training ofthe guidance agent see Steps 9 and 10 In each simulationstep the aircraft maneuvers according to the instruction andrecognizes the situation according to its sensorconfiguration

Mathematical Problems in Engineering 5

3 Training Optimization

31 Reward Shaping Two problems need to be solved whenusing the DRL method to train a maneuver guidance agentin air combat One is that the only criterion to evaluate theguidance is the successful arrival to the position of advan-tage which is a sparse reward problem leading to slowconvergence of training Secondly in realistic situations thetime to arrive at the position of advantage and the quality ofthe trajectory need to be considered In this paper a rewardshaping method is proposed to improve the training speedand the performance of the maneuver guidance strategy

Reward shaping [46] is usually used to modify the re-ward function to facilitate learning while maintaining op-timal policy which is a manual endeavour In this studythere are four rules to follow in reward shaping

(i) Limited by the combat area and the maximumnumber of control times the aircraft is guided to theposition of advantage

(ii) e aircraft should be guided closer and closer to itsgoal zone

(iii) e aircraft should be guided away from the goalzone of its opponent

(1) Set parameters of the controlled aircraft and opponent(2) Set training time step size Δt and simulation time step size δt

(3) Initialize maximum number of training episode M and maximum simulation time T in each episode(4) Initialize replay memory D to capacity N

(5) Initialize action-value function Q and target action-value function Qminus with random weights(6) for episode 0 Mdo(7) Initialize state S0(8) for t 0 TΔtdo(9) With probability ϵ select a random action At

(10) Otherwise select At argmaxa

Q(St aw)

(11) for i 0 Δtδtdo(12) Execute action At in air combat simulation software(13) Obtain the positions of aircraft and target(14) if episode terminates then(15) break(16) end if(17) end for(18) Observe reward Rt and state St+1(19) Store transition [St At Rt St+1] in D

(20) if episode terminates then(21) break(22) end if(23) Sample random minibatch of transitions [Sj Aj Rj Sj+1] from D

(24) if episode terminates at step j + 1then(25) set Yj Rj

(26) else(27) set Yj Rj + max

aQminus (Sj+1 awminus )

(28) end if(29) Perform a gradient descent step on (Yj minus Q(Sj Ajwj))

2 with respect to the network parameters w(30) Every C steps reset Qminus Q

(31) end for(32) end for

ALGORITHM 1 DQN [26] for maneuver guidance agent training

1

7

2 6

93 8

4 5

10

Air combat simulation soware

Initialization Visualization

Situation recognitionAir combat

Aircra motion

Loop ∆tδt times

InstructionSituation

Middleware

ActionRewardState

State perception

Action generation

Blueaircra

guidanceagent

LearningTrain

[S0 A0 R0 S1][S1 A1 R1 S2]

[St At Rt St+1]

[STndash1 ATndash1 RTndash1 ST]

Replay memoryRed aircra guidance agent

Store

Figure 4 Structure and training process of the red side

6 Mathematical Problems in Engineering

(iv) e aircraft should be guided to its goal zone inshort time as much as possible

According to the above rules the reward function isdefined as

R St At St+1( 1113857 Rt + F St At St+1( 1113857 (11)

where Rt is the original scalar reward and F(St At St+1) isthe shaping reward function Rt is given by

Rt w1T St+1( 1113857 + w2 (12)

where T(St+1) is the termination reward function e termw1 is a coefficient and w2 is a penalty for time consumptionwhich is a constant

ere are four kinds of termination states the aircraftarrives at the position of advantage the opponent arrives atits position of advantage the aircraft moves out of thecombat area and the maximum number of control times hasbeen reached and each aircraft is still in the combat area andhas not reached its advantage situation Termination rewardis the reward obtained when the next state is terminationwhich is often used in the standard RL training It is definedas

T St+1( 1113857

c1 if win

c2 else if loss

c3 else if out of the sector

c4 else if no control times left

0 else

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(13)

Usually c1 is a positive value c2 is a negative value andc3 and c4 are nonpositive

e shaping reward function F(St At St+1) has the formas described in (9) e real-value function Φ(St) is anadvantage function of the state e larger the value of thisfunction the more advantageous the current situation isis function can provide additional information to help theagent to select action which is better than only using thetermination reward It is defined as

Φ St( 1113857 D St( 1113857O St( 1113857

100 (14)

where D(St) is the distance reward function and O(St) is theorientation reward function which are defined as

D St( 1113857 expminusdt minus dmax + dmin( 11138571113868111386811138681113868 2

1113868111386811138681113868

180degk1113888 1113889 (15)

O St( 1113857 1 minusμr

t

11138681113868111386811138681113868111386811138681113868 + ηb

t

11138681113868111386811138681113868111386811138681113868

180deg (16)

where k has units of metersdegrees and is used to adjust therelative effect of range and angle and a value of 10 iseffective

32 Alternate Freeze Game DQN for Maneuver GuidanceAgent Training Unlike board and RTS games the data ofhuman players in air combat games are very rare so

pretraining methods using supervised learning algorithmscannot be used We can only assume a strategy used by theopponent and then propose a strategy to defeat it We thenthink about what strategies the opponent will use to confrontour strategy and then optimize our strategy

In this paper alternate freeze learning is used for gamesto solve the nonstationarity problem Both aircrafts in a one-on-one air combat scenario use DRL-based agents to adapttheir maneuver strategies In each training period one agentis learning from scratch while the strategy of its opponentwhich was obtained from the previous training period isfrozen rough games the maneuver guidance perfor-mance of each side rises alternately and different agents withhigh level of maneuver decision-making ability are gener-ated e pseudo-code of alternate freeze game DQN isshown in Algorithm 2

An important effect that must be considered in thisapproach is the red queen effect When one aircraft uses aDRL agent and its opponent uses a static strategy theperformance of the aircraft is absolute with respect to itsopponent However when both aircrafts use DRL agents theperformance of each agent is only related to its currentopponent As a result the trained agent is only suitable for itslatest opponent but cannot deal with the previous ones

rough K training periods of alternate freeze learningK red agents and K + 1 blue agents are saved in the trainingprocess e league system is adopted in which those agentsare regarded as players By using a combination of strategiesfrom multiple opponents the opponentrsquos mixed strategybecomes smoother In different initial scenarios each agentguides the aircraft to confront the aircraft guided by eachopponent e goal is to select one of our strategies throughthe league so that the probability of winning is high and theprobability of losing is low According to the result of thecompetition the agent obtains different points e topagent in the league table is selected as the optimum one

4 Simulation and Results

In this section first random air combat scenarios are ini-tialized in simulation software for maneuver guidance agenttraining and testing Second agents of both sides with re-ward shaping are created and trained using the alternatefreeze game DQN algorithm ird taking the well-trainedagents as players the league system is adopted and the topagent in the league table is selected Last the selected agent isevaluated and the maneuver behavior of the aircraft isanalyzed

41 SimulationSetup e air combat simulation parametersare shown in Table 1 e Adam optimizer [52] is used forlearning the neural network parameters with a learning rateof 5 times 10minus 4 and a discount c 099 e replay memory sizeis 1 times 105 e initial value of ϵ in ϵ-greedy exploration is 1and the final value is 01 For each simulation the rewardshaping parameters c1 c2 c3 and c4 are set to 2 minus2 minus1 andminus1 respectively e terms w1 and w2 are set to 098 and001 respectively e training period K is 10

Mathematical Problems in Engineering 7

We evaluated three neural networks and three mini-batches Each neural network has 3 hidden layers and thenumber of units is 64 64 128 for the small neural network(NN) 128 128 256 for the medium one and 256 256 512for the large one e sizes of minibatch (MB) are 256 512and 1024 respectively ese neural networks and mini-batches are used in pairs to train red agents and play againstthe initial blue agent e simulation result is shown in

Figure 5 First a large minibatch cannot make the trainingsuccessful for small and medium neural networks and itstraining speed is slow for a large network Second using asmall neural network the average reward obtained will belower than that of the other two networks ird if the sameneural network is adopted the training speed using themedium minibatch is faster than using a small one Lastconsidering the computational efficiency the selected net-work has 3 hidden layers with 128 128 and 256 unitsrespectively e selected minibatch size is 512

In order to improve the generality of a single agentinitial positions are randomized in air combat simulation Inthe first iteration of the training process the initial strategy isformed by a randomly initialized neural network for the redagent and the conservative strategy of the maximum-overload turn is adopted by the blue one In the subsequenttraining the training agent is trained from scratch and itsopponent adopts the strategy obtained from the latesttraining

Air combat geometry can be divided into four categories(from red aircraftrsquos perspective) offensive defensive neu-tral and head-on [7] as shown in Figure 6(a) Defining thedeviation angle and aspect angle from minus180deg to 180degFigure 6(b) shows an advantage diagram where the distanceis 300m Smaller |μr| means that the heading or gun of thered aircraft has better aim at its opponent and smaller |ηb|

implies a higher possibility of the blue aircraft being shot orfacing a fatal situation For example in the offensive

(1) Set parameters of both aircrafts(2) Set simulation parameters(3) Set the number of training periods K and the condition for ending each training period Wthreshold(4) Set DRL parameters(5) Set the opponent initialization policy πblue0(6) for period 1 Kdo(7) for aircraft [red blue] do(8) if aircraft red then(9) Set the opponent policy π πblueperiodminus1(10) Initialize neural networks of red agent(11) while Winning ratelt Wthresholddo(12) Train agent using Algorithm 1(13) end while(14) Save the well-trained agent whose maneuver guidance policy is πredperiod(15) else(16) if period K + 1then(17) break(18) else(19) Set the opponent policy π πredperiod(20) Initialize neural networks of blue agent(21) while Winning ratelt Wthresholddo(22) Train agent using Algorithm 1(23) end while(24) Save the well-trained agent whose maneuver guidance policy is πblueperiod(25) end if(26) end if(27) end for(28) end for

ALGORITHM 2 Alternate freeze game DQN for maneuver guidance agent training in air combats

Table 1 Air combat simulation parameters

Red aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Random position and heading angle

Blue aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Map center northwardInitial policy (πb

0) Maximum overload turnSimulation parameters

Position of advantage dmax 500m dmin 100mμmax 60deg ηmax 30deg

Combat area 5 km times 5 kmSimulation time T 500 sSimulation time step δt 01 sTraining time step Δt 05 s

8 Mathematical Problems in Engineering

scenario the initial state value of the red agent is largebecause of small |μr| and |ηb| according to equations (14)and (16) erefore in the offensive initial scenario the winprobability of our side will be larger than that of the op-ponent while the defensive initial scenario is reversed Inneutral and head-one initial scenarios the initial state valuesof both sides are almost equal e performance of the well-trained agents is verified according to the classification offour initial situations in the league system

42 Simulation Results e policy naming convention is πbi

or πri which means a blue or red policy produced after i

periods respectively After every 100 training episodes thetest episodes run 100 times and the learning curves areshown in Figure 7 For the first period (πr

1 vs πb0) due to the

simple maneuver policy of the blue aircraft the red aircraftcan achieve almost 100 success rate through about 4000episodes of training as shown in Figure 7(a) At the be-ginning most of the games are draw because of the randominitialization of the red agent and the evasion strategy of itsopponent Another reason is that the aircraft often flies outof the airspace in the early stages of training Although theagent will get a penalty for this the air combat episode willstill get a draw During the training process the red aircraft islooking for winning strategies in the airspace and itswinning rate is constantly increasing However because of

its pursuit of offensive and neglect of defensive the winningrate of its opponent is also rising In the later stage oftraining the red aircraft gradually understands the oppo-nentrsquos strategy and it can achieve a very high winning rate

With the iterations in games both agents have learnt theintelligent maneuver strategy which can guide the aircraft inthe pursuit-evasion game Figure 7(b) shows a typical ex-ample that πb

5 is the trainer and πr5 is its opponent After

about 5000 episodes of training the blue aircraft can reducethe losing rate to a lower level Since the red one adopts anintelligent maneuver guidance strategy the blue agentcannot learn a winning strategy in a short time After about20000 episodes the agent gradually understands the op-ponentrsquos maneuver strategy and its winning rate keepsincreasing e final winning rate is stable between 50 and60 and the losing rate is below 10

e training process iteration of πr10 vs πb

9 is shown inFigure 7(c) Because the maneuver strategy of the blue aircraftis an intelligent strategy which has been trained iterativelytraining for the red agent ismuchmore difficult than that in thefirst iteration e well-trained agent can win more than 60and lose less than 10 in the game against πb

9It is hard to get a higher winning rate in every training

except for the scenario that the opponent is πb0 is is because

both aircrafts are homogeneous the initial situations arerandomized and the opponent in the training environment isintelligent In some situations as long as the opponentrsquosstrategy is intelligent enough the trained agent is almostimpossible to win Interestingly in some scenarios where theopponent is intelligent enough and the aircraft cannot defeat itno matter how the agent operates the agent will guide theaircraft out of the airspace to obtain a draw is gives us aninspiration that is in a scenario where you cannot win it maybe the best way to get out of the combat area

For four initial air combat situations two aircrafts areguided by the agents trained in each iteration stage etypical flight trajectories are shown in Figure 8 Since πb

0 is astatic strategy πr

1 can easily win in offensive head-on andneutral situations as shown in Figures 8(a) 8(c) and 8(d) Inthe defensive situation the red aircraft first turns away fromits opponent who has the advantage and then looks foropportunities to establish an advantage situation ere areno fierce rivalries in the process of combat as shown inFigure 8(b)

Although πr5 is an intelligent strategy well-trained π

b5 can

still gain an advantage in combat as shown in Figures 8(e)ndash8(h) In the offensive situation (from the red aircraftrsquosperspective) the blue aircraft cannot easily get rid of redonersquos following and after fierce confrontation it can es-tablish an advantage situation as shown in Figure 8(e)Figure 8(f) shows the defensive situation that the blueaircraft can keep its advantage until it wins In the other twoscenarios the blue aircraft performs well especially in theinitial situation of neutral which adopts the delayed-turningtactics and successfully wins in air combat

πb9is an intelligent strategy and well-trained πr

10 canachieve more than half of the victory and less than 10 of thefailure However unlike πr

1 vs πb0 π

r10 cannot easily win

especially in head-on situations as shown in Figures 8(i)ndash8(l)

5000 100000Training episodes

ndash2

ndash1

0

1

2Av

erag

e rew

ard

Small NNsmall MBSmall NNmedium MBSmall NNlarge MBMedium NNsmall MBMedium NNmedium MB

Medium NNlarge MBLarge NNsmall MBLarge NNmedium MBLarge NNlarge MB

Figure 5 Evaluation of the neural network and minibatch

Mathematical Problems in Engineering 9

In the offensive situation the blue aircraft tries to get rid ofthe red one by constantly adjusting its bank angle but it isfinally captured by the red one In the defensive situation thered aircraft adopted the circling back tactics cleverly andquickly captured its opponent In the head-on situation thered and blue aircrafts alternately occupy an advantage sit-uation and after fierce maneuver confrontation the redaircraft finally wins In the neutral situation the red oneconstantly adjusts its bank angle according to the opponentrsquosposition and wins

For the above three pairs of strategies the head-on initialsituations are taken as examples to analyze the advantages ofboth sides in the game as shown in Figure 9 For the πr

1 vs πb0

scenario because πb0 has no intelligent maneuver guidance

ability although both sides are equally competitive at first halftime the red aircraft continues to expand its advantages untilit wins as shown in Figure 9(a) When the blue agent learnsthe intelligence strategy it can defeat the red one as shown in

Figure 9(b) For the πr10 vs πb

9 scenario the two sides havealternately established an advantage position and the redaircraft has not won easily as shown in Figure 9(c) Com-bining with the trajectories shown in Figures 8(c) 8(g) and8(k) the maneuver strategies of both sides have been im-proved using the alternate freeze game DQN algorithm eapproach has the benefit of discovering newmaneuver tactics

43 League Results e objective of the league system is toselect a red agent who generates maneuver guidance in-structions according to its strategy so as to achieve the bestresults without knowing the strategy of its opponent ereare ten agents on each side which are saved in the iterationof games Each red agent fights each blue agent 400 timesincluding 100 times for each of four initial scenarios edetailed league results are shown in Table 2 Results with awinloss ratio greater than 1 are highlighted in bold

μr

ηbOffensive

Defensive

Neutral

Head-on

(a)

Offensive

Defensivendash180

0

180

Asp

ect a

ngle

ηb (

degr

ee)

0 180ndash180Deviation angle μr (degree)

(b)

Figure 6 Situation recognition based on deviation angle and aspect angle (a) Four initial air combat situation categories (b) Situationrecognition diagram

1000 2000 3000 40000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(a)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(b)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(c)

Figure 7 Learning curves in the training process (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

10 Mathematical Problems in Engineering

0 03

3

6

6

9

9

12

12

12

16

(a)

133133

120 120

105105

90

9075

7560

60

45

45

30

30

15

15

00

(b)

0 016

16

32

32

48

48

64

64

80

80

(c)

00

33 6

6 99

1313

(d)

150

150

300300

30

30

240

2409060

60

120

180

180

270120

0 0270

210

210

90

(e)

0 0 2 2 44 6

6 88

(f )

40

120

120 0 0

100

100

14060

60

80

20

190

190160

16040

140

20 80

(g)

50 5040

40

30

30

20

2010

10

0

0

(h)

0

10

10

20

30

30

102

102

20

40

40

50

50

6060

7070

80

800

90

90

(i)

0 0

6

6 12

12

18

18

24

24

30

30

(j)

50

50

466466300

300150200

150350

00

250

250 400400

100

100

200

350

(k)

00 16

16

32

32

4848

64

64

79

79

(l)

Figure 8 Typical trajectory results in the training iteration of air combat games (a) πr1 vs πb

0 offensive (b) πr1 vs πb

0 defensive (c) πr1 vs πb

0head-on (d) πr

1 vs πb0 neutral (e) πb

5 vs πr5 offensive (f ) πb

5 vs πr5 defensive (g) πb

5 vs πr5 head-on (h) πb

5 vs πr5 neutral (i) πr

10 vs πbg

offensive (j) πr10 vs π

bg defensive (k) π

r10 vs π

bg head-on (l) π

r10 vs π

bg neutral

25 50 750Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(a)

100 2000Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(b)

200 4000Training steps

ndash05

00

05

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(c)

Figure 9 Advantage curves in head-on air combat games (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

Mathematical Problems in Engineering 11

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 6: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

3 Training Optimization

31 Reward Shaping Two problems need to be solved whenusing the DRL method to train a maneuver guidance agentin air combat One is that the only criterion to evaluate theguidance is the successful arrival to the position of advan-tage which is a sparse reward problem leading to slowconvergence of training Secondly in realistic situations thetime to arrive at the position of advantage and the quality ofthe trajectory need to be considered In this paper a rewardshaping method is proposed to improve the training speedand the performance of the maneuver guidance strategy

Reward shaping [46] is usually used to modify the re-ward function to facilitate learning while maintaining op-timal policy which is a manual endeavour In this studythere are four rules to follow in reward shaping

(i) Limited by the combat area and the maximumnumber of control times the aircraft is guided to theposition of advantage

(ii) e aircraft should be guided closer and closer to itsgoal zone

(iii) e aircraft should be guided away from the goalzone of its opponent

(1) Set parameters of the controlled aircraft and opponent(2) Set training time step size Δt and simulation time step size δt

(3) Initialize maximum number of training episode M and maximum simulation time T in each episode(4) Initialize replay memory D to capacity N

(5) Initialize action-value function Q and target action-value function Qminus with random weights(6) for episode 0 Mdo(7) Initialize state S0(8) for t 0 TΔtdo(9) With probability ϵ select a random action At

(10) Otherwise select At argmaxa

Q(St aw)

(11) for i 0 Δtδtdo(12) Execute action At in air combat simulation software(13) Obtain the positions of aircraft and target(14) if episode terminates then(15) break(16) end if(17) end for(18) Observe reward Rt and state St+1(19) Store transition [St At Rt St+1] in D

(20) if episode terminates then(21) break(22) end if(23) Sample random minibatch of transitions [Sj Aj Rj Sj+1] from D

(24) if episode terminates at step j + 1then(25) set Yj Rj

(26) else(27) set Yj Rj + max

aQminus (Sj+1 awminus )

(28) end if(29) Perform a gradient descent step on (Yj minus Q(Sj Ajwj))

2 with respect to the network parameters w(30) Every C steps reset Qminus Q

(31) end for(32) end for

ALGORITHM 1 DQN [26] for maneuver guidance agent training

1

7

2 6

93 8

4 5

10

Air combat simulation soware

Initialization Visualization

Situation recognitionAir combat

Aircra motion

Loop ∆tδt times

InstructionSituation

Middleware

ActionRewardState

State perception

Action generation

Blueaircra

guidanceagent

LearningTrain

[S0 A0 R0 S1][S1 A1 R1 S2]

[St At Rt St+1]

[STndash1 ATndash1 RTndash1 ST]

Replay memoryRed aircra guidance agent

Store

Figure 4 Structure and training process of the red side

6 Mathematical Problems in Engineering

(iv) e aircraft should be guided to its goal zone inshort time as much as possible

According to the above rules the reward function isdefined as

R St At St+1( 1113857 Rt + F St At St+1( 1113857 (11)

where Rt is the original scalar reward and F(St At St+1) isthe shaping reward function Rt is given by

Rt w1T St+1( 1113857 + w2 (12)

where T(St+1) is the termination reward function e termw1 is a coefficient and w2 is a penalty for time consumptionwhich is a constant

ere are four kinds of termination states the aircraftarrives at the position of advantage the opponent arrives atits position of advantage the aircraft moves out of thecombat area and the maximum number of control times hasbeen reached and each aircraft is still in the combat area andhas not reached its advantage situation Termination rewardis the reward obtained when the next state is terminationwhich is often used in the standard RL training It is definedas

T St+1( 1113857

c1 if win

c2 else if loss

c3 else if out of the sector

c4 else if no control times left

0 else

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(13)

Usually c1 is a positive value c2 is a negative value andc3 and c4 are nonpositive

e shaping reward function F(St At St+1) has the formas described in (9) e real-value function Φ(St) is anadvantage function of the state e larger the value of thisfunction the more advantageous the current situation isis function can provide additional information to help theagent to select action which is better than only using thetermination reward It is defined as

Φ St( 1113857 D St( 1113857O St( 1113857

100 (14)

where D(St) is the distance reward function and O(St) is theorientation reward function which are defined as

D St( 1113857 expminusdt minus dmax + dmin( 11138571113868111386811138681113868 2

1113868111386811138681113868

180degk1113888 1113889 (15)

O St( 1113857 1 minusμr

t

11138681113868111386811138681113868111386811138681113868 + ηb

t

11138681113868111386811138681113868111386811138681113868

180deg (16)

where k has units of metersdegrees and is used to adjust therelative effect of range and angle and a value of 10 iseffective

32 Alternate Freeze Game DQN for Maneuver GuidanceAgent Training Unlike board and RTS games the data ofhuman players in air combat games are very rare so

pretraining methods using supervised learning algorithmscannot be used We can only assume a strategy used by theopponent and then propose a strategy to defeat it We thenthink about what strategies the opponent will use to confrontour strategy and then optimize our strategy

In this paper alternate freeze learning is used for gamesto solve the nonstationarity problem Both aircrafts in a one-on-one air combat scenario use DRL-based agents to adapttheir maneuver strategies In each training period one agentis learning from scratch while the strategy of its opponentwhich was obtained from the previous training period isfrozen rough games the maneuver guidance perfor-mance of each side rises alternately and different agents withhigh level of maneuver decision-making ability are gener-ated e pseudo-code of alternate freeze game DQN isshown in Algorithm 2

An important effect that must be considered in thisapproach is the red queen effect When one aircraft uses aDRL agent and its opponent uses a static strategy theperformance of the aircraft is absolute with respect to itsopponent However when both aircrafts use DRL agents theperformance of each agent is only related to its currentopponent As a result the trained agent is only suitable for itslatest opponent but cannot deal with the previous ones

rough K training periods of alternate freeze learningK red agents and K + 1 blue agents are saved in the trainingprocess e league system is adopted in which those agentsare regarded as players By using a combination of strategiesfrom multiple opponents the opponentrsquos mixed strategybecomes smoother In different initial scenarios each agentguides the aircraft to confront the aircraft guided by eachopponent e goal is to select one of our strategies throughthe league so that the probability of winning is high and theprobability of losing is low According to the result of thecompetition the agent obtains different points e topagent in the league table is selected as the optimum one

4 Simulation and Results

In this section first random air combat scenarios are ini-tialized in simulation software for maneuver guidance agenttraining and testing Second agents of both sides with re-ward shaping are created and trained using the alternatefreeze game DQN algorithm ird taking the well-trainedagents as players the league system is adopted and the topagent in the league table is selected Last the selected agent isevaluated and the maneuver behavior of the aircraft isanalyzed

41 SimulationSetup e air combat simulation parametersare shown in Table 1 e Adam optimizer [52] is used forlearning the neural network parameters with a learning rateof 5 times 10minus 4 and a discount c 099 e replay memory sizeis 1 times 105 e initial value of ϵ in ϵ-greedy exploration is 1and the final value is 01 For each simulation the rewardshaping parameters c1 c2 c3 and c4 are set to 2 minus2 minus1 andminus1 respectively e terms w1 and w2 are set to 098 and001 respectively e training period K is 10

Mathematical Problems in Engineering 7

We evaluated three neural networks and three mini-batches Each neural network has 3 hidden layers and thenumber of units is 64 64 128 for the small neural network(NN) 128 128 256 for the medium one and 256 256 512for the large one e sizes of minibatch (MB) are 256 512and 1024 respectively ese neural networks and mini-batches are used in pairs to train red agents and play againstthe initial blue agent e simulation result is shown in

Figure 5 First a large minibatch cannot make the trainingsuccessful for small and medium neural networks and itstraining speed is slow for a large network Second using asmall neural network the average reward obtained will belower than that of the other two networks ird if the sameneural network is adopted the training speed using themedium minibatch is faster than using a small one Lastconsidering the computational efficiency the selected net-work has 3 hidden layers with 128 128 and 256 unitsrespectively e selected minibatch size is 512

In order to improve the generality of a single agentinitial positions are randomized in air combat simulation Inthe first iteration of the training process the initial strategy isformed by a randomly initialized neural network for the redagent and the conservative strategy of the maximum-overload turn is adopted by the blue one In the subsequenttraining the training agent is trained from scratch and itsopponent adopts the strategy obtained from the latesttraining

Air combat geometry can be divided into four categories(from red aircraftrsquos perspective) offensive defensive neu-tral and head-on [7] as shown in Figure 6(a) Defining thedeviation angle and aspect angle from minus180deg to 180degFigure 6(b) shows an advantage diagram where the distanceis 300m Smaller |μr| means that the heading or gun of thered aircraft has better aim at its opponent and smaller |ηb|

implies a higher possibility of the blue aircraft being shot orfacing a fatal situation For example in the offensive

(1) Set parameters of both aircrafts(2) Set simulation parameters(3) Set the number of training periods K and the condition for ending each training period Wthreshold(4) Set DRL parameters(5) Set the opponent initialization policy πblue0(6) for period 1 Kdo(7) for aircraft [red blue] do(8) if aircraft red then(9) Set the opponent policy π πblueperiodminus1(10) Initialize neural networks of red agent(11) while Winning ratelt Wthresholddo(12) Train agent using Algorithm 1(13) end while(14) Save the well-trained agent whose maneuver guidance policy is πredperiod(15) else(16) if period K + 1then(17) break(18) else(19) Set the opponent policy π πredperiod(20) Initialize neural networks of blue agent(21) while Winning ratelt Wthresholddo(22) Train agent using Algorithm 1(23) end while(24) Save the well-trained agent whose maneuver guidance policy is πblueperiod(25) end if(26) end if(27) end for(28) end for

ALGORITHM 2 Alternate freeze game DQN for maneuver guidance agent training in air combats

Table 1 Air combat simulation parameters

Red aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Random position and heading angle

Blue aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Map center northwardInitial policy (πb

0) Maximum overload turnSimulation parameters

Position of advantage dmax 500m dmin 100mμmax 60deg ηmax 30deg

Combat area 5 km times 5 kmSimulation time T 500 sSimulation time step δt 01 sTraining time step Δt 05 s

8 Mathematical Problems in Engineering

scenario the initial state value of the red agent is largebecause of small |μr| and |ηb| according to equations (14)and (16) erefore in the offensive initial scenario the winprobability of our side will be larger than that of the op-ponent while the defensive initial scenario is reversed Inneutral and head-one initial scenarios the initial state valuesof both sides are almost equal e performance of the well-trained agents is verified according to the classification offour initial situations in the league system

42 Simulation Results e policy naming convention is πbi

or πri which means a blue or red policy produced after i

periods respectively After every 100 training episodes thetest episodes run 100 times and the learning curves areshown in Figure 7 For the first period (πr

1 vs πb0) due to the

simple maneuver policy of the blue aircraft the red aircraftcan achieve almost 100 success rate through about 4000episodes of training as shown in Figure 7(a) At the be-ginning most of the games are draw because of the randominitialization of the red agent and the evasion strategy of itsopponent Another reason is that the aircraft often flies outof the airspace in the early stages of training Although theagent will get a penalty for this the air combat episode willstill get a draw During the training process the red aircraft islooking for winning strategies in the airspace and itswinning rate is constantly increasing However because of

its pursuit of offensive and neglect of defensive the winningrate of its opponent is also rising In the later stage oftraining the red aircraft gradually understands the oppo-nentrsquos strategy and it can achieve a very high winning rate

With the iterations in games both agents have learnt theintelligent maneuver strategy which can guide the aircraft inthe pursuit-evasion game Figure 7(b) shows a typical ex-ample that πb

5 is the trainer and πr5 is its opponent After

about 5000 episodes of training the blue aircraft can reducethe losing rate to a lower level Since the red one adopts anintelligent maneuver guidance strategy the blue agentcannot learn a winning strategy in a short time After about20000 episodes the agent gradually understands the op-ponentrsquos maneuver strategy and its winning rate keepsincreasing e final winning rate is stable between 50 and60 and the losing rate is below 10

e training process iteration of πr10 vs πb

9 is shown inFigure 7(c) Because the maneuver strategy of the blue aircraftis an intelligent strategy which has been trained iterativelytraining for the red agent ismuchmore difficult than that in thefirst iteration e well-trained agent can win more than 60and lose less than 10 in the game against πb

9It is hard to get a higher winning rate in every training

except for the scenario that the opponent is πb0 is is because

both aircrafts are homogeneous the initial situations arerandomized and the opponent in the training environment isintelligent In some situations as long as the opponentrsquosstrategy is intelligent enough the trained agent is almostimpossible to win Interestingly in some scenarios where theopponent is intelligent enough and the aircraft cannot defeat itno matter how the agent operates the agent will guide theaircraft out of the airspace to obtain a draw is gives us aninspiration that is in a scenario where you cannot win it maybe the best way to get out of the combat area

For four initial air combat situations two aircrafts areguided by the agents trained in each iteration stage etypical flight trajectories are shown in Figure 8 Since πb

0 is astatic strategy πr

1 can easily win in offensive head-on andneutral situations as shown in Figures 8(a) 8(c) and 8(d) Inthe defensive situation the red aircraft first turns away fromits opponent who has the advantage and then looks foropportunities to establish an advantage situation ere areno fierce rivalries in the process of combat as shown inFigure 8(b)

Although πr5 is an intelligent strategy well-trained π

b5 can

still gain an advantage in combat as shown in Figures 8(e)ndash8(h) In the offensive situation (from the red aircraftrsquosperspective) the blue aircraft cannot easily get rid of redonersquos following and after fierce confrontation it can es-tablish an advantage situation as shown in Figure 8(e)Figure 8(f) shows the defensive situation that the blueaircraft can keep its advantage until it wins In the other twoscenarios the blue aircraft performs well especially in theinitial situation of neutral which adopts the delayed-turningtactics and successfully wins in air combat

πb9is an intelligent strategy and well-trained πr

10 canachieve more than half of the victory and less than 10 of thefailure However unlike πr

1 vs πb0 π

r10 cannot easily win

especially in head-on situations as shown in Figures 8(i)ndash8(l)

5000 100000Training episodes

ndash2

ndash1

0

1

2Av

erag

e rew

ard

Small NNsmall MBSmall NNmedium MBSmall NNlarge MBMedium NNsmall MBMedium NNmedium MB

Medium NNlarge MBLarge NNsmall MBLarge NNmedium MBLarge NNlarge MB

Figure 5 Evaluation of the neural network and minibatch

Mathematical Problems in Engineering 9

In the offensive situation the blue aircraft tries to get rid ofthe red one by constantly adjusting its bank angle but it isfinally captured by the red one In the defensive situation thered aircraft adopted the circling back tactics cleverly andquickly captured its opponent In the head-on situation thered and blue aircrafts alternately occupy an advantage sit-uation and after fierce maneuver confrontation the redaircraft finally wins In the neutral situation the red oneconstantly adjusts its bank angle according to the opponentrsquosposition and wins

For the above three pairs of strategies the head-on initialsituations are taken as examples to analyze the advantages ofboth sides in the game as shown in Figure 9 For the πr

1 vs πb0

scenario because πb0 has no intelligent maneuver guidance

ability although both sides are equally competitive at first halftime the red aircraft continues to expand its advantages untilit wins as shown in Figure 9(a) When the blue agent learnsthe intelligence strategy it can defeat the red one as shown in

Figure 9(b) For the πr10 vs πb

9 scenario the two sides havealternately established an advantage position and the redaircraft has not won easily as shown in Figure 9(c) Com-bining with the trajectories shown in Figures 8(c) 8(g) and8(k) the maneuver strategies of both sides have been im-proved using the alternate freeze game DQN algorithm eapproach has the benefit of discovering newmaneuver tactics

43 League Results e objective of the league system is toselect a red agent who generates maneuver guidance in-structions according to its strategy so as to achieve the bestresults without knowing the strategy of its opponent ereare ten agents on each side which are saved in the iterationof games Each red agent fights each blue agent 400 timesincluding 100 times for each of four initial scenarios edetailed league results are shown in Table 2 Results with awinloss ratio greater than 1 are highlighted in bold

μr

ηbOffensive

Defensive

Neutral

Head-on

(a)

Offensive

Defensivendash180

0

180

Asp

ect a

ngle

ηb (

degr

ee)

0 180ndash180Deviation angle μr (degree)

(b)

Figure 6 Situation recognition based on deviation angle and aspect angle (a) Four initial air combat situation categories (b) Situationrecognition diagram

1000 2000 3000 40000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(a)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(b)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(c)

Figure 7 Learning curves in the training process (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

10 Mathematical Problems in Engineering

0 03

3

6

6

9

9

12

12

12

16

(a)

133133

120 120

105105

90

9075

7560

60

45

45

30

30

15

15

00

(b)

0 016

16

32

32

48

48

64

64

80

80

(c)

00

33 6

6 99

1313

(d)

150

150

300300

30

30

240

2409060

60

120

180

180

270120

0 0270

210

210

90

(e)

0 0 2 2 44 6

6 88

(f )

40

120

120 0 0

100

100

14060

60

80

20

190

190160

16040

140

20 80

(g)

50 5040

40

30

30

20

2010

10

0

0

(h)

0

10

10

20

30

30

102

102

20

40

40

50

50

6060

7070

80

800

90

90

(i)

0 0

6

6 12

12

18

18

24

24

30

30

(j)

50

50

466466300

300150200

150350

00

250

250 400400

100

100

200

350

(k)

00 16

16

32

32

4848

64

64

79

79

(l)

Figure 8 Typical trajectory results in the training iteration of air combat games (a) πr1 vs πb

0 offensive (b) πr1 vs πb

0 defensive (c) πr1 vs πb

0head-on (d) πr

1 vs πb0 neutral (e) πb

5 vs πr5 offensive (f ) πb

5 vs πr5 defensive (g) πb

5 vs πr5 head-on (h) πb

5 vs πr5 neutral (i) πr

10 vs πbg

offensive (j) πr10 vs π

bg defensive (k) π

r10 vs π

bg head-on (l) π

r10 vs π

bg neutral

25 50 750Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(a)

100 2000Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(b)

200 4000Training steps

ndash05

00

05

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(c)

Figure 9 Advantage curves in head-on air combat games (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

Mathematical Problems in Engineering 11

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 7: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

(iv) e aircraft should be guided to its goal zone inshort time as much as possible

According to the above rules the reward function isdefined as

R St At St+1( 1113857 Rt + F St At St+1( 1113857 (11)

where Rt is the original scalar reward and F(St At St+1) isthe shaping reward function Rt is given by

Rt w1T St+1( 1113857 + w2 (12)

where T(St+1) is the termination reward function e termw1 is a coefficient and w2 is a penalty for time consumptionwhich is a constant

ere are four kinds of termination states the aircraftarrives at the position of advantage the opponent arrives atits position of advantage the aircraft moves out of thecombat area and the maximum number of control times hasbeen reached and each aircraft is still in the combat area andhas not reached its advantage situation Termination rewardis the reward obtained when the next state is terminationwhich is often used in the standard RL training It is definedas

T St+1( 1113857

c1 if win

c2 else if loss

c3 else if out of the sector

c4 else if no control times left

0 else

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(13)

Usually c1 is a positive value c2 is a negative value andc3 and c4 are nonpositive

e shaping reward function F(St At St+1) has the formas described in (9) e real-value function Φ(St) is anadvantage function of the state e larger the value of thisfunction the more advantageous the current situation isis function can provide additional information to help theagent to select action which is better than only using thetermination reward It is defined as

Φ St( 1113857 D St( 1113857O St( 1113857

100 (14)

where D(St) is the distance reward function and O(St) is theorientation reward function which are defined as

D St( 1113857 expminusdt minus dmax + dmin( 11138571113868111386811138681113868 2

1113868111386811138681113868

180degk1113888 1113889 (15)

O St( 1113857 1 minusμr

t

11138681113868111386811138681113868111386811138681113868 + ηb

t

11138681113868111386811138681113868111386811138681113868

180deg (16)

where k has units of metersdegrees and is used to adjust therelative effect of range and angle and a value of 10 iseffective

32 Alternate Freeze Game DQN for Maneuver GuidanceAgent Training Unlike board and RTS games the data ofhuman players in air combat games are very rare so

pretraining methods using supervised learning algorithmscannot be used We can only assume a strategy used by theopponent and then propose a strategy to defeat it We thenthink about what strategies the opponent will use to confrontour strategy and then optimize our strategy

In this paper alternate freeze learning is used for gamesto solve the nonstationarity problem Both aircrafts in a one-on-one air combat scenario use DRL-based agents to adapttheir maneuver strategies In each training period one agentis learning from scratch while the strategy of its opponentwhich was obtained from the previous training period isfrozen rough games the maneuver guidance perfor-mance of each side rises alternately and different agents withhigh level of maneuver decision-making ability are gener-ated e pseudo-code of alternate freeze game DQN isshown in Algorithm 2

An important effect that must be considered in thisapproach is the red queen effect When one aircraft uses aDRL agent and its opponent uses a static strategy theperformance of the aircraft is absolute with respect to itsopponent However when both aircrafts use DRL agents theperformance of each agent is only related to its currentopponent As a result the trained agent is only suitable for itslatest opponent but cannot deal with the previous ones

rough K training periods of alternate freeze learningK red agents and K + 1 blue agents are saved in the trainingprocess e league system is adopted in which those agentsare regarded as players By using a combination of strategiesfrom multiple opponents the opponentrsquos mixed strategybecomes smoother In different initial scenarios each agentguides the aircraft to confront the aircraft guided by eachopponent e goal is to select one of our strategies throughthe league so that the probability of winning is high and theprobability of losing is low According to the result of thecompetition the agent obtains different points e topagent in the league table is selected as the optimum one

4 Simulation and Results

In this section first random air combat scenarios are ini-tialized in simulation software for maneuver guidance agenttraining and testing Second agents of both sides with re-ward shaping are created and trained using the alternatefreeze game DQN algorithm ird taking the well-trainedagents as players the league system is adopted and the topagent in the league table is selected Last the selected agent isevaluated and the maneuver behavior of the aircraft isanalyzed

41 SimulationSetup e air combat simulation parametersare shown in Table 1 e Adam optimizer [52] is used forlearning the neural network parameters with a learning rateof 5 times 10minus 4 and a discount c 099 e replay memory sizeis 1 times 105 e initial value of ϵ in ϵ-greedy exploration is 1and the final value is 01 For each simulation the rewardshaping parameters c1 c2 c3 and c4 are set to 2 minus2 minus1 andminus1 respectively e terms w1 and w2 are set to 098 and001 respectively e training period K is 10

Mathematical Problems in Engineering 7

We evaluated three neural networks and three mini-batches Each neural network has 3 hidden layers and thenumber of units is 64 64 128 for the small neural network(NN) 128 128 256 for the medium one and 256 256 512for the large one e sizes of minibatch (MB) are 256 512and 1024 respectively ese neural networks and mini-batches are used in pairs to train red agents and play againstthe initial blue agent e simulation result is shown in

Figure 5 First a large minibatch cannot make the trainingsuccessful for small and medium neural networks and itstraining speed is slow for a large network Second using asmall neural network the average reward obtained will belower than that of the other two networks ird if the sameneural network is adopted the training speed using themedium minibatch is faster than using a small one Lastconsidering the computational efficiency the selected net-work has 3 hidden layers with 128 128 and 256 unitsrespectively e selected minibatch size is 512

In order to improve the generality of a single agentinitial positions are randomized in air combat simulation Inthe first iteration of the training process the initial strategy isformed by a randomly initialized neural network for the redagent and the conservative strategy of the maximum-overload turn is adopted by the blue one In the subsequenttraining the training agent is trained from scratch and itsopponent adopts the strategy obtained from the latesttraining

Air combat geometry can be divided into four categories(from red aircraftrsquos perspective) offensive defensive neu-tral and head-on [7] as shown in Figure 6(a) Defining thedeviation angle and aspect angle from minus180deg to 180degFigure 6(b) shows an advantage diagram where the distanceis 300m Smaller |μr| means that the heading or gun of thered aircraft has better aim at its opponent and smaller |ηb|

implies a higher possibility of the blue aircraft being shot orfacing a fatal situation For example in the offensive

(1) Set parameters of both aircrafts(2) Set simulation parameters(3) Set the number of training periods K and the condition for ending each training period Wthreshold(4) Set DRL parameters(5) Set the opponent initialization policy πblue0(6) for period 1 Kdo(7) for aircraft [red blue] do(8) if aircraft red then(9) Set the opponent policy π πblueperiodminus1(10) Initialize neural networks of red agent(11) while Winning ratelt Wthresholddo(12) Train agent using Algorithm 1(13) end while(14) Save the well-trained agent whose maneuver guidance policy is πredperiod(15) else(16) if period K + 1then(17) break(18) else(19) Set the opponent policy π πredperiod(20) Initialize neural networks of blue agent(21) while Winning ratelt Wthresholddo(22) Train agent using Algorithm 1(23) end while(24) Save the well-trained agent whose maneuver guidance policy is πblueperiod(25) end if(26) end if(27) end for(28) end for

ALGORITHM 2 Alternate freeze game DQN for maneuver guidance agent training in air combats

Table 1 Air combat simulation parameters

Red aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Random position and heading angle

Blue aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Map center northwardInitial policy (πb

0) Maximum overload turnSimulation parameters

Position of advantage dmax 500m dmin 100mμmax 60deg ηmax 30deg

Combat area 5 km times 5 kmSimulation time T 500 sSimulation time step δt 01 sTraining time step Δt 05 s

8 Mathematical Problems in Engineering

scenario the initial state value of the red agent is largebecause of small |μr| and |ηb| according to equations (14)and (16) erefore in the offensive initial scenario the winprobability of our side will be larger than that of the op-ponent while the defensive initial scenario is reversed Inneutral and head-one initial scenarios the initial state valuesof both sides are almost equal e performance of the well-trained agents is verified according to the classification offour initial situations in the league system

42 Simulation Results e policy naming convention is πbi

or πri which means a blue or red policy produced after i

periods respectively After every 100 training episodes thetest episodes run 100 times and the learning curves areshown in Figure 7 For the first period (πr

1 vs πb0) due to the

simple maneuver policy of the blue aircraft the red aircraftcan achieve almost 100 success rate through about 4000episodes of training as shown in Figure 7(a) At the be-ginning most of the games are draw because of the randominitialization of the red agent and the evasion strategy of itsopponent Another reason is that the aircraft often flies outof the airspace in the early stages of training Although theagent will get a penalty for this the air combat episode willstill get a draw During the training process the red aircraft islooking for winning strategies in the airspace and itswinning rate is constantly increasing However because of

its pursuit of offensive and neglect of defensive the winningrate of its opponent is also rising In the later stage oftraining the red aircraft gradually understands the oppo-nentrsquos strategy and it can achieve a very high winning rate

With the iterations in games both agents have learnt theintelligent maneuver strategy which can guide the aircraft inthe pursuit-evasion game Figure 7(b) shows a typical ex-ample that πb

5 is the trainer and πr5 is its opponent After

about 5000 episodes of training the blue aircraft can reducethe losing rate to a lower level Since the red one adopts anintelligent maneuver guidance strategy the blue agentcannot learn a winning strategy in a short time After about20000 episodes the agent gradually understands the op-ponentrsquos maneuver strategy and its winning rate keepsincreasing e final winning rate is stable between 50 and60 and the losing rate is below 10

e training process iteration of πr10 vs πb

9 is shown inFigure 7(c) Because the maneuver strategy of the blue aircraftis an intelligent strategy which has been trained iterativelytraining for the red agent ismuchmore difficult than that in thefirst iteration e well-trained agent can win more than 60and lose less than 10 in the game against πb

9It is hard to get a higher winning rate in every training

except for the scenario that the opponent is πb0 is is because

both aircrafts are homogeneous the initial situations arerandomized and the opponent in the training environment isintelligent In some situations as long as the opponentrsquosstrategy is intelligent enough the trained agent is almostimpossible to win Interestingly in some scenarios where theopponent is intelligent enough and the aircraft cannot defeat itno matter how the agent operates the agent will guide theaircraft out of the airspace to obtain a draw is gives us aninspiration that is in a scenario where you cannot win it maybe the best way to get out of the combat area

For four initial air combat situations two aircrafts areguided by the agents trained in each iteration stage etypical flight trajectories are shown in Figure 8 Since πb

0 is astatic strategy πr

1 can easily win in offensive head-on andneutral situations as shown in Figures 8(a) 8(c) and 8(d) Inthe defensive situation the red aircraft first turns away fromits opponent who has the advantage and then looks foropportunities to establish an advantage situation ere areno fierce rivalries in the process of combat as shown inFigure 8(b)

Although πr5 is an intelligent strategy well-trained π

b5 can

still gain an advantage in combat as shown in Figures 8(e)ndash8(h) In the offensive situation (from the red aircraftrsquosperspective) the blue aircraft cannot easily get rid of redonersquos following and after fierce confrontation it can es-tablish an advantage situation as shown in Figure 8(e)Figure 8(f) shows the defensive situation that the blueaircraft can keep its advantage until it wins In the other twoscenarios the blue aircraft performs well especially in theinitial situation of neutral which adopts the delayed-turningtactics and successfully wins in air combat

πb9is an intelligent strategy and well-trained πr

10 canachieve more than half of the victory and less than 10 of thefailure However unlike πr

1 vs πb0 π

r10 cannot easily win

especially in head-on situations as shown in Figures 8(i)ndash8(l)

5000 100000Training episodes

ndash2

ndash1

0

1

2Av

erag

e rew

ard

Small NNsmall MBSmall NNmedium MBSmall NNlarge MBMedium NNsmall MBMedium NNmedium MB

Medium NNlarge MBLarge NNsmall MBLarge NNmedium MBLarge NNlarge MB

Figure 5 Evaluation of the neural network and minibatch

Mathematical Problems in Engineering 9

In the offensive situation the blue aircraft tries to get rid ofthe red one by constantly adjusting its bank angle but it isfinally captured by the red one In the defensive situation thered aircraft adopted the circling back tactics cleverly andquickly captured its opponent In the head-on situation thered and blue aircrafts alternately occupy an advantage sit-uation and after fierce maneuver confrontation the redaircraft finally wins In the neutral situation the red oneconstantly adjusts its bank angle according to the opponentrsquosposition and wins

For the above three pairs of strategies the head-on initialsituations are taken as examples to analyze the advantages ofboth sides in the game as shown in Figure 9 For the πr

1 vs πb0

scenario because πb0 has no intelligent maneuver guidance

ability although both sides are equally competitive at first halftime the red aircraft continues to expand its advantages untilit wins as shown in Figure 9(a) When the blue agent learnsthe intelligence strategy it can defeat the red one as shown in

Figure 9(b) For the πr10 vs πb

9 scenario the two sides havealternately established an advantage position and the redaircraft has not won easily as shown in Figure 9(c) Com-bining with the trajectories shown in Figures 8(c) 8(g) and8(k) the maneuver strategies of both sides have been im-proved using the alternate freeze game DQN algorithm eapproach has the benefit of discovering newmaneuver tactics

43 League Results e objective of the league system is toselect a red agent who generates maneuver guidance in-structions according to its strategy so as to achieve the bestresults without knowing the strategy of its opponent ereare ten agents on each side which are saved in the iterationof games Each red agent fights each blue agent 400 timesincluding 100 times for each of four initial scenarios edetailed league results are shown in Table 2 Results with awinloss ratio greater than 1 are highlighted in bold

μr

ηbOffensive

Defensive

Neutral

Head-on

(a)

Offensive

Defensivendash180

0

180

Asp

ect a

ngle

ηb (

degr

ee)

0 180ndash180Deviation angle μr (degree)

(b)

Figure 6 Situation recognition based on deviation angle and aspect angle (a) Four initial air combat situation categories (b) Situationrecognition diagram

1000 2000 3000 40000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(a)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(b)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(c)

Figure 7 Learning curves in the training process (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

10 Mathematical Problems in Engineering

0 03

3

6

6

9

9

12

12

12

16

(a)

133133

120 120

105105

90

9075

7560

60

45

45

30

30

15

15

00

(b)

0 016

16

32

32

48

48

64

64

80

80

(c)

00

33 6

6 99

1313

(d)

150

150

300300

30

30

240

2409060

60

120

180

180

270120

0 0270

210

210

90

(e)

0 0 2 2 44 6

6 88

(f )

40

120

120 0 0

100

100

14060

60

80

20

190

190160

16040

140

20 80

(g)

50 5040

40

30

30

20

2010

10

0

0

(h)

0

10

10

20

30

30

102

102

20

40

40

50

50

6060

7070

80

800

90

90

(i)

0 0

6

6 12

12

18

18

24

24

30

30

(j)

50

50

466466300

300150200

150350

00

250

250 400400

100

100

200

350

(k)

00 16

16

32

32

4848

64

64

79

79

(l)

Figure 8 Typical trajectory results in the training iteration of air combat games (a) πr1 vs πb

0 offensive (b) πr1 vs πb

0 defensive (c) πr1 vs πb

0head-on (d) πr

1 vs πb0 neutral (e) πb

5 vs πr5 offensive (f ) πb

5 vs πr5 defensive (g) πb

5 vs πr5 head-on (h) πb

5 vs πr5 neutral (i) πr

10 vs πbg

offensive (j) πr10 vs π

bg defensive (k) π

r10 vs π

bg head-on (l) π

r10 vs π

bg neutral

25 50 750Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(a)

100 2000Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(b)

200 4000Training steps

ndash05

00

05

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(c)

Figure 9 Advantage curves in head-on air combat games (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

Mathematical Problems in Engineering 11

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 8: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

We evaluated three neural networks and three mini-batches Each neural network has 3 hidden layers and thenumber of units is 64 64 128 for the small neural network(NN) 128 128 256 for the medium one and 256 256 512for the large one e sizes of minibatch (MB) are 256 512and 1024 respectively ese neural networks and mini-batches are used in pairs to train red agents and play againstthe initial blue agent e simulation result is shown in

Figure 5 First a large minibatch cannot make the trainingsuccessful for small and medium neural networks and itstraining speed is slow for a large network Second using asmall neural network the average reward obtained will belower than that of the other two networks ird if the sameneural network is adopted the training speed using themedium minibatch is faster than using a small one Lastconsidering the computational efficiency the selected net-work has 3 hidden layers with 128 128 and 256 unitsrespectively e selected minibatch size is 512

In order to improve the generality of a single agentinitial positions are randomized in air combat simulation Inthe first iteration of the training process the initial strategy isformed by a randomly initialized neural network for the redagent and the conservative strategy of the maximum-overload turn is adopted by the blue one In the subsequenttraining the training agent is trained from scratch and itsopponent adopts the strategy obtained from the latesttraining

Air combat geometry can be divided into four categories(from red aircraftrsquos perspective) offensive defensive neu-tral and head-on [7] as shown in Figure 6(a) Defining thedeviation angle and aspect angle from minus180deg to 180degFigure 6(b) shows an advantage diagram where the distanceis 300m Smaller |μr| means that the heading or gun of thered aircraft has better aim at its opponent and smaller |ηb|

implies a higher possibility of the blue aircraft being shot orfacing a fatal situation For example in the offensive

(1) Set parameters of both aircrafts(2) Set simulation parameters(3) Set the number of training periods K and the condition for ending each training period Wthreshold(4) Set DRL parameters(5) Set the opponent initialization policy πblue0(6) for period 1 Kdo(7) for aircraft [red blue] do(8) if aircraft red then(9) Set the opponent policy π πblueperiodminus1(10) Initialize neural networks of red agent(11) while Winning ratelt Wthresholddo(12) Train agent using Algorithm 1(13) end while(14) Save the well-trained agent whose maneuver guidance policy is πredperiod(15) else(16) if period K + 1then(17) break(18) else(19) Set the opponent policy π πredperiod(20) Initialize neural networks of blue agent(21) while Winning ratelt Wthresholddo(22) Train agent using Algorithm 1(23) end while(24) Save the well-trained agent whose maneuver guidance policy is πblueperiod(25) end if(26) end if(27) end for(28) end for

ALGORITHM 2 Alternate freeze game DQN for maneuver guidance agent training in air combats

Table 1 Air combat simulation parameters

Red aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Random position and heading angle

Blue aircraft parametersVelocity 200msMaximum roll angle 75degRoll rate 40deg sInitial position Map center northwardInitial policy (πb

0) Maximum overload turnSimulation parameters

Position of advantage dmax 500m dmin 100mμmax 60deg ηmax 30deg

Combat area 5 km times 5 kmSimulation time T 500 sSimulation time step δt 01 sTraining time step Δt 05 s

8 Mathematical Problems in Engineering

scenario the initial state value of the red agent is largebecause of small |μr| and |ηb| according to equations (14)and (16) erefore in the offensive initial scenario the winprobability of our side will be larger than that of the op-ponent while the defensive initial scenario is reversed Inneutral and head-one initial scenarios the initial state valuesof both sides are almost equal e performance of the well-trained agents is verified according to the classification offour initial situations in the league system

42 Simulation Results e policy naming convention is πbi

or πri which means a blue or red policy produced after i

periods respectively After every 100 training episodes thetest episodes run 100 times and the learning curves areshown in Figure 7 For the first period (πr

1 vs πb0) due to the

simple maneuver policy of the blue aircraft the red aircraftcan achieve almost 100 success rate through about 4000episodes of training as shown in Figure 7(a) At the be-ginning most of the games are draw because of the randominitialization of the red agent and the evasion strategy of itsopponent Another reason is that the aircraft often flies outof the airspace in the early stages of training Although theagent will get a penalty for this the air combat episode willstill get a draw During the training process the red aircraft islooking for winning strategies in the airspace and itswinning rate is constantly increasing However because of

its pursuit of offensive and neglect of defensive the winningrate of its opponent is also rising In the later stage oftraining the red aircraft gradually understands the oppo-nentrsquos strategy and it can achieve a very high winning rate

With the iterations in games both agents have learnt theintelligent maneuver strategy which can guide the aircraft inthe pursuit-evasion game Figure 7(b) shows a typical ex-ample that πb

5 is the trainer and πr5 is its opponent After

about 5000 episodes of training the blue aircraft can reducethe losing rate to a lower level Since the red one adopts anintelligent maneuver guidance strategy the blue agentcannot learn a winning strategy in a short time After about20000 episodes the agent gradually understands the op-ponentrsquos maneuver strategy and its winning rate keepsincreasing e final winning rate is stable between 50 and60 and the losing rate is below 10

e training process iteration of πr10 vs πb

9 is shown inFigure 7(c) Because the maneuver strategy of the blue aircraftis an intelligent strategy which has been trained iterativelytraining for the red agent ismuchmore difficult than that in thefirst iteration e well-trained agent can win more than 60and lose less than 10 in the game against πb

9It is hard to get a higher winning rate in every training

except for the scenario that the opponent is πb0 is is because

both aircrafts are homogeneous the initial situations arerandomized and the opponent in the training environment isintelligent In some situations as long as the opponentrsquosstrategy is intelligent enough the trained agent is almostimpossible to win Interestingly in some scenarios where theopponent is intelligent enough and the aircraft cannot defeat itno matter how the agent operates the agent will guide theaircraft out of the airspace to obtain a draw is gives us aninspiration that is in a scenario where you cannot win it maybe the best way to get out of the combat area

For four initial air combat situations two aircrafts areguided by the agents trained in each iteration stage etypical flight trajectories are shown in Figure 8 Since πb

0 is astatic strategy πr

1 can easily win in offensive head-on andneutral situations as shown in Figures 8(a) 8(c) and 8(d) Inthe defensive situation the red aircraft first turns away fromits opponent who has the advantage and then looks foropportunities to establish an advantage situation ere areno fierce rivalries in the process of combat as shown inFigure 8(b)

Although πr5 is an intelligent strategy well-trained π

b5 can

still gain an advantage in combat as shown in Figures 8(e)ndash8(h) In the offensive situation (from the red aircraftrsquosperspective) the blue aircraft cannot easily get rid of redonersquos following and after fierce confrontation it can es-tablish an advantage situation as shown in Figure 8(e)Figure 8(f) shows the defensive situation that the blueaircraft can keep its advantage until it wins In the other twoscenarios the blue aircraft performs well especially in theinitial situation of neutral which adopts the delayed-turningtactics and successfully wins in air combat

πb9is an intelligent strategy and well-trained πr

10 canachieve more than half of the victory and less than 10 of thefailure However unlike πr

1 vs πb0 π

r10 cannot easily win

especially in head-on situations as shown in Figures 8(i)ndash8(l)

5000 100000Training episodes

ndash2

ndash1

0

1

2Av

erag

e rew

ard

Small NNsmall MBSmall NNmedium MBSmall NNlarge MBMedium NNsmall MBMedium NNmedium MB

Medium NNlarge MBLarge NNsmall MBLarge NNmedium MBLarge NNlarge MB

Figure 5 Evaluation of the neural network and minibatch

Mathematical Problems in Engineering 9

In the offensive situation the blue aircraft tries to get rid ofthe red one by constantly adjusting its bank angle but it isfinally captured by the red one In the defensive situation thered aircraft adopted the circling back tactics cleverly andquickly captured its opponent In the head-on situation thered and blue aircrafts alternately occupy an advantage sit-uation and after fierce maneuver confrontation the redaircraft finally wins In the neutral situation the red oneconstantly adjusts its bank angle according to the opponentrsquosposition and wins

For the above three pairs of strategies the head-on initialsituations are taken as examples to analyze the advantages ofboth sides in the game as shown in Figure 9 For the πr

1 vs πb0

scenario because πb0 has no intelligent maneuver guidance

ability although both sides are equally competitive at first halftime the red aircraft continues to expand its advantages untilit wins as shown in Figure 9(a) When the blue agent learnsthe intelligence strategy it can defeat the red one as shown in

Figure 9(b) For the πr10 vs πb

9 scenario the two sides havealternately established an advantage position and the redaircraft has not won easily as shown in Figure 9(c) Com-bining with the trajectories shown in Figures 8(c) 8(g) and8(k) the maneuver strategies of both sides have been im-proved using the alternate freeze game DQN algorithm eapproach has the benefit of discovering newmaneuver tactics

43 League Results e objective of the league system is toselect a red agent who generates maneuver guidance in-structions according to its strategy so as to achieve the bestresults without knowing the strategy of its opponent ereare ten agents on each side which are saved in the iterationof games Each red agent fights each blue agent 400 timesincluding 100 times for each of four initial scenarios edetailed league results are shown in Table 2 Results with awinloss ratio greater than 1 are highlighted in bold

μr

ηbOffensive

Defensive

Neutral

Head-on

(a)

Offensive

Defensivendash180

0

180

Asp

ect a

ngle

ηb (

degr

ee)

0 180ndash180Deviation angle μr (degree)

(b)

Figure 6 Situation recognition based on deviation angle and aspect angle (a) Four initial air combat situation categories (b) Situationrecognition diagram

1000 2000 3000 40000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(a)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(b)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(c)

Figure 7 Learning curves in the training process (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

10 Mathematical Problems in Engineering

0 03

3

6

6

9

9

12

12

12

16

(a)

133133

120 120

105105

90

9075

7560

60

45

45

30

30

15

15

00

(b)

0 016

16

32

32

48

48

64

64

80

80

(c)

00

33 6

6 99

1313

(d)

150

150

300300

30

30

240

2409060

60

120

180

180

270120

0 0270

210

210

90

(e)

0 0 2 2 44 6

6 88

(f )

40

120

120 0 0

100

100

14060

60

80

20

190

190160

16040

140

20 80

(g)

50 5040

40

30

30

20

2010

10

0

0

(h)

0

10

10

20

30

30

102

102

20

40

40

50

50

6060

7070

80

800

90

90

(i)

0 0

6

6 12

12

18

18

24

24

30

30

(j)

50

50

466466300

300150200

150350

00

250

250 400400

100

100

200

350

(k)

00 16

16

32

32

4848

64

64

79

79

(l)

Figure 8 Typical trajectory results in the training iteration of air combat games (a) πr1 vs πb

0 offensive (b) πr1 vs πb

0 defensive (c) πr1 vs πb

0head-on (d) πr

1 vs πb0 neutral (e) πb

5 vs πr5 offensive (f ) πb

5 vs πr5 defensive (g) πb

5 vs πr5 head-on (h) πb

5 vs πr5 neutral (i) πr

10 vs πbg

offensive (j) πr10 vs π

bg defensive (k) π

r10 vs π

bg head-on (l) π

r10 vs π

bg neutral

25 50 750Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(a)

100 2000Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(b)

200 4000Training steps

ndash05

00

05

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(c)

Figure 9 Advantage curves in head-on air combat games (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

Mathematical Problems in Engineering 11

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 9: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

scenario the initial state value of the red agent is largebecause of small |μr| and |ηb| according to equations (14)and (16) erefore in the offensive initial scenario the winprobability of our side will be larger than that of the op-ponent while the defensive initial scenario is reversed Inneutral and head-one initial scenarios the initial state valuesof both sides are almost equal e performance of the well-trained agents is verified according to the classification offour initial situations in the league system

42 Simulation Results e policy naming convention is πbi

or πri which means a blue or red policy produced after i

periods respectively After every 100 training episodes thetest episodes run 100 times and the learning curves areshown in Figure 7 For the first period (πr

1 vs πb0) due to the

simple maneuver policy of the blue aircraft the red aircraftcan achieve almost 100 success rate through about 4000episodes of training as shown in Figure 7(a) At the be-ginning most of the games are draw because of the randominitialization of the red agent and the evasion strategy of itsopponent Another reason is that the aircraft often flies outof the airspace in the early stages of training Although theagent will get a penalty for this the air combat episode willstill get a draw During the training process the red aircraft islooking for winning strategies in the airspace and itswinning rate is constantly increasing However because of

its pursuit of offensive and neglect of defensive the winningrate of its opponent is also rising In the later stage oftraining the red aircraft gradually understands the oppo-nentrsquos strategy and it can achieve a very high winning rate

With the iterations in games both agents have learnt theintelligent maneuver strategy which can guide the aircraft inthe pursuit-evasion game Figure 7(b) shows a typical ex-ample that πb

5 is the trainer and πr5 is its opponent After

about 5000 episodes of training the blue aircraft can reducethe losing rate to a lower level Since the red one adopts anintelligent maneuver guidance strategy the blue agentcannot learn a winning strategy in a short time After about20000 episodes the agent gradually understands the op-ponentrsquos maneuver strategy and its winning rate keepsincreasing e final winning rate is stable between 50 and60 and the losing rate is below 10

e training process iteration of πr10 vs πb

9 is shown inFigure 7(c) Because the maneuver strategy of the blue aircraftis an intelligent strategy which has been trained iterativelytraining for the red agent ismuchmore difficult than that in thefirst iteration e well-trained agent can win more than 60and lose less than 10 in the game against πb

9It is hard to get a higher winning rate in every training

except for the scenario that the opponent is πb0 is is because

both aircrafts are homogeneous the initial situations arerandomized and the opponent in the training environment isintelligent In some situations as long as the opponentrsquosstrategy is intelligent enough the trained agent is almostimpossible to win Interestingly in some scenarios where theopponent is intelligent enough and the aircraft cannot defeat itno matter how the agent operates the agent will guide theaircraft out of the airspace to obtain a draw is gives us aninspiration that is in a scenario where you cannot win it maybe the best way to get out of the combat area

For four initial air combat situations two aircrafts areguided by the agents trained in each iteration stage etypical flight trajectories are shown in Figure 8 Since πb

0 is astatic strategy πr

1 can easily win in offensive head-on andneutral situations as shown in Figures 8(a) 8(c) and 8(d) Inthe defensive situation the red aircraft first turns away fromits opponent who has the advantage and then looks foropportunities to establish an advantage situation ere areno fierce rivalries in the process of combat as shown inFigure 8(b)

Although πr5 is an intelligent strategy well-trained π

b5 can

still gain an advantage in combat as shown in Figures 8(e)ndash8(h) In the offensive situation (from the red aircraftrsquosperspective) the blue aircraft cannot easily get rid of redonersquos following and after fierce confrontation it can es-tablish an advantage situation as shown in Figure 8(e)Figure 8(f) shows the defensive situation that the blueaircraft can keep its advantage until it wins In the other twoscenarios the blue aircraft performs well especially in theinitial situation of neutral which adopts the delayed-turningtactics and successfully wins in air combat

πb9is an intelligent strategy and well-trained πr

10 canachieve more than half of the victory and less than 10 of thefailure However unlike πr

1 vs πb0 π

r10 cannot easily win

especially in head-on situations as shown in Figures 8(i)ndash8(l)

5000 100000Training episodes

ndash2

ndash1

0

1

2Av

erag

e rew

ard

Small NNsmall MBSmall NNmedium MBSmall NNlarge MBMedium NNsmall MBMedium NNmedium MB

Medium NNlarge MBLarge NNsmall MBLarge NNmedium MBLarge NNlarge MB

Figure 5 Evaluation of the neural network and minibatch

Mathematical Problems in Engineering 9

In the offensive situation the blue aircraft tries to get rid ofthe red one by constantly adjusting its bank angle but it isfinally captured by the red one In the defensive situation thered aircraft adopted the circling back tactics cleverly andquickly captured its opponent In the head-on situation thered and blue aircrafts alternately occupy an advantage sit-uation and after fierce maneuver confrontation the redaircraft finally wins In the neutral situation the red oneconstantly adjusts its bank angle according to the opponentrsquosposition and wins

For the above three pairs of strategies the head-on initialsituations are taken as examples to analyze the advantages ofboth sides in the game as shown in Figure 9 For the πr

1 vs πb0

scenario because πb0 has no intelligent maneuver guidance

ability although both sides are equally competitive at first halftime the red aircraft continues to expand its advantages untilit wins as shown in Figure 9(a) When the blue agent learnsthe intelligence strategy it can defeat the red one as shown in

Figure 9(b) For the πr10 vs πb

9 scenario the two sides havealternately established an advantage position and the redaircraft has not won easily as shown in Figure 9(c) Com-bining with the trajectories shown in Figures 8(c) 8(g) and8(k) the maneuver strategies of both sides have been im-proved using the alternate freeze game DQN algorithm eapproach has the benefit of discovering newmaneuver tactics

43 League Results e objective of the league system is toselect a red agent who generates maneuver guidance in-structions according to its strategy so as to achieve the bestresults without knowing the strategy of its opponent ereare ten agents on each side which are saved in the iterationof games Each red agent fights each blue agent 400 timesincluding 100 times for each of four initial scenarios edetailed league results are shown in Table 2 Results with awinloss ratio greater than 1 are highlighted in bold

μr

ηbOffensive

Defensive

Neutral

Head-on

(a)

Offensive

Defensivendash180

0

180

Asp

ect a

ngle

ηb (

degr

ee)

0 180ndash180Deviation angle μr (degree)

(b)

Figure 6 Situation recognition based on deviation angle and aspect angle (a) Four initial air combat situation categories (b) Situationrecognition diagram

1000 2000 3000 40000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(a)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(b)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(c)

Figure 7 Learning curves in the training process (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

10 Mathematical Problems in Engineering

0 03

3

6

6

9

9

12

12

12

16

(a)

133133

120 120

105105

90

9075

7560

60

45

45

30

30

15

15

00

(b)

0 016

16

32

32

48

48

64

64

80

80

(c)

00

33 6

6 99

1313

(d)

150

150

300300

30

30

240

2409060

60

120

180

180

270120

0 0270

210

210

90

(e)

0 0 2 2 44 6

6 88

(f )

40

120

120 0 0

100

100

14060

60

80

20

190

190160

16040

140

20 80

(g)

50 5040

40

30

30

20

2010

10

0

0

(h)

0

10

10

20

30

30

102

102

20

40

40

50

50

6060

7070

80

800

90

90

(i)

0 0

6

6 12

12

18

18

24

24

30

30

(j)

50

50

466466300

300150200

150350

00

250

250 400400

100

100

200

350

(k)

00 16

16

32

32

4848

64

64

79

79

(l)

Figure 8 Typical trajectory results in the training iteration of air combat games (a) πr1 vs πb

0 offensive (b) πr1 vs πb

0 defensive (c) πr1 vs πb

0head-on (d) πr

1 vs πb0 neutral (e) πb

5 vs πr5 offensive (f ) πb

5 vs πr5 defensive (g) πb

5 vs πr5 head-on (h) πb

5 vs πr5 neutral (i) πr

10 vs πbg

offensive (j) πr10 vs π

bg defensive (k) π

r10 vs π

bg head-on (l) π

r10 vs π

bg neutral

25 50 750Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(a)

100 2000Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(b)

200 4000Training steps

ndash05

00

05

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(c)

Figure 9 Advantage curves in head-on air combat games (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

Mathematical Problems in Engineering 11

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 10: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

In the offensive situation the blue aircraft tries to get rid ofthe red one by constantly adjusting its bank angle but it isfinally captured by the red one In the defensive situation thered aircraft adopted the circling back tactics cleverly andquickly captured its opponent In the head-on situation thered and blue aircrafts alternately occupy an advantage sit-uation and after fierce maneuver confrontation the redaircraft finally wins In the neutral situation the red oneconstantly adjusts its bank angle according to the opponentrsquosposition and wins

For the above three pairs of strategies the head-on initialsituations are taken as examples to analyze the advantages ofboth sides in the game as shown in Figure 9 For the πr

1 vs πb0

scenario because πb0 has no intelligent maneuver guidance

ability although both sides are equally competitive at first halftime the red aircraft continues to expand its advantages untilit wins as shown in Figure 9(a) When the blue agent learnsthe intelligence strategy it can defeat the red one as shown in

Figure 9(b) For the πr10 vs πb

9 scenario the two sides havealternately established an advantage position and the redaircraft has not won easily as shown in Figure 9(c) Com-bining with the trajectories shown in Figures 8(c) 8(g) and8(k) the maneuver strategies of both sides have been im-proved using the alternate freeze game DQN algorithm eapproach has the benefit of discovering newmaneuver tactics

43 League Results e objective of the league system is toselect a red agent who generates maneuver guidance in-structions according to its strategy so as to achieve the bestresults without knowing the strategy of its opponent ereare ten agents on each side which are saved in the iterationof games Each red agent fights each blue agent 400 timesincluding 100 times for each of four initial scenarios edetailed league results are shown in Table 2 Results with awinloss ratio greater than 1 are highlighted in bold

μr

ηbOffensive

Defensive

Neutral

Head-on

(a)

Offensive

Defensivendash180

0

180

Asp

ect a

ngle

ηb (

degr

ee)

0 180ndash180Deviation angle μr (degree)

(b)

Figure 6 Situation recognition based on deviation angle and aspect angle (a) Four initial air combat situation categories (b) Situationrecognition diagram

1000 2000 3000 40000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(a)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(b)

10000 200000Training episodes

00

02

04

06

08

10

Perc

enta

ge

Blue win

Red winDraw

(c)

Figure 7 Learning curves in the training process (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

10 Mathematical Problems in Engineering

0 03

3

6

6

9

9

12

12

12

16

(a)

133133

120 120

105105

90

9075

7560

60

45

45

30

30

15

15

00

(b)

0 016

16

32

32

48

48

64

64

80

80

(c)

00

33 6

6 99

1313

(d)

150

150

300300

30

30

240

2409060

60

120

180

180

270120

0 0270

210

210

90

(e)

0 0 2 2 44 6

6 88

(f )

40

120

120 0 0

100

100

14060

60

80

20

190

190160

16040

140

20 80

(g)

50 5040

40

30

30

20

2010

10

0

0

(h)

0

10

10

20

30

30

102

102

20

40

40

50

50

6060

7070

80

800

90

90

(i)

0 0

6

6 12

12

18

18

24

24

30

30

(j)

50

50

466466300

300150200

150350

00

250

250 400400

100

100

200

350

(k)

00 16

16

32

32

4848

64

64

79

79

(l)

Figure 8 Typical trajectory results in the training iteration of air combat games (a) πr1 vs πb

0 offensive (b) πr1 vs πb

0 defensive (c) πr1 vs πb

0head-on (d) πr

1 vs πb0 neutral (e) πb

5 vs πr5 offensive (f ) πb

5 vs πr5 defensive (g) πb

5 vs πr5 head-on (h) πb

5 vs πr5 neutral (i) πr

10 vs πbg

offensive (j) πr10 vs π

bg defensive (k) π

r10 vs π

bg head-on (l) π

r10 vs π

bg neutral

25 50 750Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(a)

100 2000Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(b)

200 4000Training steps

ndash05

00

05

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(c)

Figure 9 Advantage curves in head-on air combat games (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

Mathematical Problems in Engineering 11

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 11: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

0 03

3

6

6

9

9

12

12

12

16

(a)

133133

120 120

105105

90

9075

7560

60

45

45

30

30

15

15

00

(b)

0 016

16

32

32

48

48

64

64

80

80

(c)

00

33 6

6 99

1313

(d)

150

150

300300

30

30

240

2409060

60

120

180

180

270120

0 0270

210

210

90

(e)

0 0 2 2 44 6

6 88

(f )

40

120

120 0 0

100

100

14060

60

80

20

190

190160

16040

140

20 80

(g)

50 5040

40

30

30

20

2010

10

0

0

(h)

0

10

10

20

30

30

102

102

20

40

40

50

50

6060

7070

80

800

90

90

(i)

0 0

6

6 12

12

18

18

24

24

30

30

(j)

50

50

466466300

300150200

150350

00

250

250 400400

100

100

200

350

(k)

00 16

16

32

32

4848

64

64

79

79

(l)

Figure 8 Typical trajectory results in the training iteration of air combat games (a) πr1 vs πb

0 offensive (b) πr1 vs πb

0 defensive (c) πr1 vs πb

0head-on (d) πr

1 vs πb0 neutral (e) πb

5 vs πr5 offensive (f ) πb

5 vs πr5 defensive (g) πb

5 vs πr5 head-on (h) πb

5 vs πr5 neutral (i) πr

10 vs πbg

offensive (j) πr10 vs π

bg defensive (k) π

r10 vs π

bg head-on (l) π

r10 vs π

bg neutral

25 50 750Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(a)

100 2000Training steps

ndash050

ndash025

000

025

050

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(b)

200 4000Training steps

ndash05

00

05

Adva

ntag

e fun

ctio

n va

lue

Red valueBlue value

(c)

Figure 9 Advantage curves in head-on air combat games (a) πr1 vs πb

0 (b) πb5 vs πr

5 (c) πr10 vs πb

g

Mathematical Problems in Engineering 11

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 12: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

πri where i isin [1 10] is trained in the environment with

πbiminus1 as its opponent and πb

j is trained against πrj

j isin [1 10] e result of πr1 fighting with πb

0 is idealHowever since πb

0 is a static strategy rather than an in-telligent strategy πr

1 does not perform well in the gamewith other intelligent blue agents For πr

2 to πr10 although

there is no special training for against πb0 the performance

is still very good because πb0 is a static strategy In the

iterative game process the trained agents can get goodresults in the confrontation with the frozen agents in theenvironment For example πr

2 has an overwhelming ad-vantage over πb

1 so does πb2 and πr

2Because of the red queen effect although πr

10 has an ad-vantage against πb

9 it is in a weak position against the early blueagents such as πb

2 πb3 and πb

4 e league table is shown inTable 3 in which 3 points are for win 1 point for draw and 0points for loss In the early stage of training the performance ofthe trained agents will be gradually improvede later trainedagents are better than the former ones such as from πr

1 to πr6

As the alternate freeze training goes on the performance doesnot get better such as πr

7 to πr10e top of the score table is πr

6which not only has the highest score but also has advantageswhen playing with all blue agents except πb

6 as shown inTable 2 In the competition with all the opponents it can win44 and remain unbeaten at 75

In order to verify the performance of this agent it isconfronted with the opponent agents trained by ADP [20]and Minimax-DQN [44] Each of the four typical initialsituations is performed 100 times for both opponents andthe results are shown in Table 4 e time cost to generate anaction for each algorithm is shown in Table 5 It can be foundthat the performance of the agent presented in this paper iscomparable to that of ADP and the computational time isslightly reduced However ADP is a model-based methodwhich is assumed to know all the information such as the rollrate of the opponent Compared with Minimax-DQN theagent has the overall advantage is is because Minimax-DQN assumes that the opponent is rational while the op-ponent in this paper is not which is closer to the real worldIn addition compared with Minimax-DQN the computa-tional efficiency of the algorithm proposed has been sig-nificantly improved because linear programming is used toselect the optimal action at each step in Minimax-DQNwhich is time consuming

44 Agent Evaluation and Behavior Analysis Evaluatingagent performance in air combat is not simply a matter ofeither winning or losing In addition to winning and losingrate two criteria are added to represent success level average

Table 2 League results

πb0 πb

1 πb2 πb

3 πb4 πb

5 πb6 πb

7 πb8 πb

9 πb10

πr1

Win 393 17 83 60 137 75 96 126 119 133 123Draw 7 103 213 198 95 128 131 117 113 113 114Loss 0 280 104 142 168 197 173 157 168 154 163

πr2

Win 384 267 4 82 78 94 88 135 142 161 152Draw 16 120 128 178 143 137 157 76 100 87 94Loss 0 13 268 140 179 169 155 189 158 152 154

πr3

Win 387 185 253 5 82 97 103 123 134 154 131Draw 11 104 131 115 194 148 145 132 84 102 97Loss 2 111 16 280 124 155 152 145 182 144 172

πr4

Win 381 172 182 237 31 92 137 154 149 167 158Draw 19 123 102 117 110 151 101 97 144 92 104Loss 0 105 116 46 259 157 162 149 107 141 138

πr5

Win 382 164 172 187 221 29 95 137 168 154 162Draw 18 122 113 142 139 144 118 103 121 143 113Loss 0 114 115 71 40 227 187 160 111 103 125

πr6

Win 380 162 177 193 197 217 52 106 134 152 143Draw 19 105 98 132 101 134 139 194 176 131 144Loss 1 133 125 75 102 49 209 100 90 117 113

πr7

Win 383 169 169 180 165 187 210 56 82 122 133Draw 14 98 111 105 123 134 133 137 180 132 156Loss 3 133 120 115 112 79 57 207 138 144 111

πr8

Win 383 154 157 172 168 177 182 213 36 107 114Draw 16 123 109 114 132 102 97 134 141 124 138Loss 1 123 134 114 100 121 121 53 223 169 148

πr9

Win 387 140 162 157 154 162 179 182 215 37 102Draw 11 133 97 105 127 131 97 104 149 125 147Loss 2 127 141 138 119 107 124 114 36 238 151

πr10

Win 379 155 146 147 143 159 167 169 170 219 42Draw 10 102 84 92 105 97 131 104 85 140 131Loss 11 143 170 161 152 144 102 127 145 41 227

12 Mathematical Problems in Engineering

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 13: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

time to win (ATW) and average disadvantage time (ADT)ATW is measured as the average elapsed time required tomaneuver to the advantage position in the scenario whereour aircraft wins Smaller ATW is better than a larger oneADT is the average accumulated time that the advantagefunction value of our aircraft is less than that of the op-ponent which is used as a criterion to evaluate the riskexposure from the adversary weapons

A thousand air combat scenarios are generated insimulation software and the opponent in each scenario israndomly chosen from ten blue agents Each well-trainedred agent is used to perform these confrontations and theresults are shown in Figure 10 Agent number i means thepolicy trained after i periods From πr

1 to πr5 the per-

formance of agents is gradually improving After πr5 their

performance is stabilized and each agent has differentcharacteristics Comparing πr

5 to πr10 the winning and

losing rates are almost the same as those in Table 3 ewinning rate of the six agents is similar with the highestπr5 winning 6 higher than the lowest πr

10 as shown inFigure 10(a) However πr

6 is the agent with the lowestlosing rate and its losing rate is more than 10 lower than

that of other agents as shown in Figure 10(b) For ATWπr7 is less than 100 s πr

6 is 1087 s and that of other agents ismore than 110 s as shown in Figure 10(c) For ADT πr

5πr6 and πr

8 are less than 120 s πr7 is above 130 s and πr

9 andπr10 are between 120 and 130 s as shown in Figure 10(d)In summary with the increase in the number of itera-

tions the red queen effect appears obviously e perfor-mance of πr

8 πr9 and πr

10 is no better than that of πr6 and πr

7Although the winning rate of πr

7 is not high and has moredisadvantage time it can always win quickly It is an ag-gressive agent and can be used in scenarios where ouraircraft needs to beat the opponent quicklye winning rateof πr

6 is similar to other agents while its losing rate is thelowest Its ATW is only higher than πr

7 and ADT is thelowest In most cases it can be selected to achieve morevictories while keeping itself safe πr

6 is an agent with goodcomprehensive performance which means that the selectionmethod of the league system is effective

For the four initial scenarios air combat simulationusing πr

6 is selected for behavior analysis as shown inFigure 11 In the offensive scenario the opponent tried toescape by turning in the opposite direction Our aircraft useda lag pursuit tactic which is simple but effective At the 1stsecond our aircraft did not keep up with the opponent butchose to fly straight At the 9th second it gradually turned tothe opponent established and maintained the advantageand finally won as shown in Figure 11(a)

In the defensive scenario our aircraft was at a disad-vantage position and the opponent was trying to lock us in

Table 3 League table

Rank Agent (strategy) Win Draw Loss Score1 πr

6 1913 1373 1114 71122 πr

7 1858 1323 1219 68973 πr

5 1871 1276 1253 68894 πr

9 1877 1226 1297 68575 πr

8 1863 1230 1307 68196 πr

10 1896 1081 1423 67697 πr

4 1860 1160 1380 67408 πr

3 1654 1263 1483 62259 πr

2 1587 1236 1577 599710 πr

1 1362 1332 1706 5418

Table 4 Algorithm comparison

vs ADP [20] vs Minimax-DQN [44]

OffensiveWin 57 66Draw 28 21Loss 15 13

DefensiveWin 19 20Draw 31 42Loss 50 38

Head-onWin 43 46Draw 16 26Loss 41 28

NeutralWin 32 43Draw 32 22Loss 36 35

Table 5 Time cost to generate an action

Algorithm Time cost (ms)e proposed algorithm 138ADP 142Minimax-DQN 786

Mathematical Problems in Engineering 13

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 14: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

with a lag pursuit tactic as shown in Figure 11(b) At the30th second our aircraft adjusted its roll angle to the right toavoid being locked by the opponent as shown inFigure 11(c) After that it continued to adjust the roll angle

and when the opponent noticed that our strategy hadchanged our aircraft had already flown out of the sector Insituations where it is difficult to win the agent will use a safemaneuver strategy to get a draw

2 3 4 5 6 7 8 9 101Agent number

Win

s

300

351

387

425443440436438429

418

(a)

2 3 4 5 6 7 8 9 101Agent number

Loss

es

402

371342

309

279259

278281290312

(b)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1513 15041437

10871159 1163

1176

984

1215

2234

(c)

2 3 4 5 6 7 8 9 101Agent number

Tim

e (s)

1354

1528

1296

11701157

1199

1113

1289 12961251

(d)

Figure 10 Agent evaluation (a) Number of winning (b) Number of losing (c) Average time to win (d) Average disadvantage time

115

5

9

9

13

13

17

17

21

21

24

24

(a)

11

4

4

7

7

10

10

13

13

16

16

19

19

21

21

(b)

21

21

24

24

27

27

30

30

33

33

36 36

39 39

43 43

(c)

1 1 66 111116

16

21

21

26

26

31

31

36

36

41

41

46

46

50

50

(d)

50

50

55

55

60

60

65

65

70

70

75

75

80

8085

85 90

90 95

95 100

100

(e)

100

100

105

105

110110

115115

120

120

125

125

130

130

135

135

140

140

145

145

150

150

154

154

(f )

40

40

36

36

31

31

26

26

21

21

16

16

11

11

6

611

(g)

74 7470

7065

6560

6055

5550

5045

45

40

40

(h)

Figure 11 Trajectories under four typical initial situations (a) Offensive (b) Defensive (0ndash21 s) (c) Defensive (21ndash43 s) (d) Head-on(0ndash50 s) (e) Head-on (50ndash100 s) (f ) Head-on (100ndash154 s) (g) Neutral (0ndash40 s) (h) Neutral (40ndash74 s)

14 Mathematical Problems in Engineering

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 15: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

In the head-on scenario both sides adopted the samemaneuver strategy in the first 50 s that is the maximumoverload turn and waiting for opportunities as shown inFigure 11(d) e crucial decision was made in the 50thsecond the opponent was still hovering and waiting for anopportunity while our aircraft stopped hovering and re-duced its turning rate to fly towards the opponent At the90th second the aircraft reached an advantage situation overits opponent as shown in Figure 11(e) In the final stage ouraircraft adopted the lead pursuit tactic to establish andmaintain the advantage and won as shown in Figure 11(f )

In the neutral scenario our initial strategy was to turn awayfrom the opponent to find opportunities e opponentrsquosstrategy was to lag pursuit and successfully reached the rear ofour aircraft at the 31st second as shown in Figure 11(g)However after the 40th second our aircraft made a wisedecision to reduce the roll angle thereby increasing the turningradius At the 60th second the disadvantaged situation wasterminated as shown in Figure 11(h) After that our aircraftincreased its roll angle and the opponent was in a state ofmaximum overload right-turn which made it unable to get ridof our aircraft and finally it lost It can be seen that trained byalternate freeze games using the DQN algorithm and selectedthrough the league system the agent can learn tactics such aslead pursuit lag pursuit and hovering and can use them in aircombat

5 Conclusion

In this paper middleware connecting air combat simulationsoftware and the reinforcement learning agent is developedIt provides an idea for researchers to design differentmiddleware to transform existing software into reinforce-ment learning environment which can expand the appli-cation field of reinforcement learning Maneuver guidanceagents with reward reshaping are designed roughtraining in the environment an agent can guide an aircraft tofight against its opponent in air combat and reach an ad-vantage situation in most scenarios An alternate freezegame algorithm is proposed and combined with RL It can beused in nonstationarity situations where other playersrsquostrategies are variable rough the league system the agentwith improved performance after iterative training is se-lectede league results show that the strategy quality of theselected agent is better than that of the other agents and thered queen effect is avoided Agents can learn some typicalangle tactics and behaviors in the horizontal plane andperform these tactics by guiding an aircraft to maneuver inone-on-one air combat

In future works the problem would be extended to 3Dmaneuvering with less restrictive vehicle dynamics e 3Dformulation will lead to a larger state space and morecomplex actions and the learning mechanism would beimproved to deal with them Beyond an extension to 3Dmaneuvering a 2-vs-2 air combat scenario would beestablished in which there are collaborators as well as ad-versaries One option is to use a centralized controller whichturns to a single-agent DRL problem to obtain the jointaction of both aircrafts to execute in each time step e

other option is to adopt a decentralized system in whichboth agents take a decision for themselves and might co-operate to achieve a common goal

In theory the convergence of the alternate freeze gamealgorithm needs to be analyzed Furthermore in order toimprove the diversity of potential opponents in amore complexair combat scenario the league system would be refined

Nomenclature

(x y z) 3-dimensional coordinates of an aircraftv Speedθ Flight path angleψ Heading angleβ Angle of attackϕ Bank anglem Mass of an aircraftg Acceleration due to gravityT rust forceL Lift forceD Drag force_ψ Turn rate_ϕ Roll ratedt Distance between the aircraft and the target

at time t

dmin Minimum distance of the advantageposition

dmax Maximum distance of the advantageposition

μt Deviation angle at time t

μmax Maximum deviation angle of the advantageposition

ηt Aspect angle at time t

ηmax Maximum aspect angle of the advantageposition

S State spaceSt State vector at time t

s All states in the state spaceA Action spaceAt Action at time t

a All actions in the action spaceπ S⟶ A PolicyR(St At St+1) Reward functionRt Scalar reward received at time t

V(s) Value functionQ(s a) Action-value functionQlowast(s a) Optimal action-value functionQ(St At) Estimated action-value function at time t

c Discount factorα Learning rateT Maximum simulation time in each episodeΔt Training time step sizeδt Simulation time step sizeF(St At St+1) Shaping reward functionΦ(St) A real-valued function on statesT(St) Termination reward functionD(St) Distance reward functionO(St) Orientation reward functionK Number of training periods

Mathematical Problems in Engineering 15

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 16: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the National NaturalScience Foundation of China under Grants 91338107 andU1836103 and the Development Program of Sichuan Chinaunder Grants 2017GZDZX0002 18ZDYF3867 and19ZDZX0024

Supplementary Materials

Aircraft dynamics the derivation of the kinematic anddynamic equations of the aircraft Code a simplified versionof the environment used in this paper It does not havecommercial software and middleware but its aircraft modeland other functions are the same as those proposed in thispaper is environment and the RL agent are packaged as asupplementary material rough this material the alternatefreeze game DQN algorithm proposed in this paper can bereproduced (Supplementary Materials)

References

[1] L R Shaw ldquoBasic fighter maneuversrdquo in Fighter CombatTactics and Maneuvering M D Annapolis Ed Naval In-stitute Press New York NY USA 1st edition 1985

[2] A Toubman J J Roessingh P Spronck A Plaat and J HerikDynamic Scripting with Team Coordination in Air CombatSimulation Lecture Notes in Computer Science KaohsiungTaiwan 2014

[3] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoCentralized versus decentralized team coordination usingdynamic scriptingrdquo in Proceedings of the European Modelingand Simulation Symposium pp 1ndash7 Porto Portugal 2014

[4] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoRewarding air combat behavior in training simulationsrdquo inProceedings of IEEE International Conference on Systems Manand Cybernetics pp 1397ndash1402 Kowloon China 2015

[5] A Toubman J J Roessingh P Spronck A Plaat and J HerikldquoTransfer learning of air combat behaviorrdquo in Proceedings ofthe IEEE International Conference on Machine Learning andApplications (ICMLA) pp 226ndash231 Miami FL USA 2015

[6] D I You and D H Shim ldquoDesign of an aerial combatguidance law using virtual pursuit point conceptrdquo Proceedingsof the Institution of Mechanical Engineers vol 229 no 5pp 1ndash22 2014

[7] H Shin J Lee and D H Shim ldquoDesign of a virtual fighterpilot and simulation environment for unmanned combataerial vehiclesrdquo in Prodeedings of the Advances in FlightControl Systems pp 1ndash21 Grapevine TX USA 2017

[8] J M Eklund J Sprinkle and S S Sastry ldquoImplementing andtesting a nonlinear model predictive tracking controller foraerial pursuitevasion games on a fixed wing aircraftrdquo in

Poceedings of the Control System Applications pp 1509ndash1514Portland OR USA 2005

[9] J M Eklund J Sprinkle and S S Sastry ldquoSwitched andsymmetric pursuitevasion games using online model pre-dictive control with application to autonomous aircraftrdquo IEEETransactions on Control Systems Technology vol 20 no 3pp 604ndash620 2012

[10] F Austin G Carbone M Falco H Hinz and M LewisAutomated Maneuvering Decisions for Air-to-Air CombatGrumman Corporate Research Center Bethpage NY USA1987

[11] F Austin G Carbone M Falco H Hinz and M LewisldquoGame theory for automated maneuvering during air-to-aircombatrdquo Journal of Guidance Control and Dynamics vol 13no 6 pp 1143ndash1149 1990

[12] Y Ma G Wang X Hu H Luo and X Lei ldquoCooperativeoccupancy decision making of multi-uav in beyond-visual-range air combat a game theory approachrdquo IEEE Accessvol 323 2019

[13] J P Hespanha M Prandini and S Sastry ldquoProbabilisticpursuit-evasion games a one-step Nash approachrdquo in Pro-ceedings of the 36th IEEE Conference on Decision and ControlSydney NSW Australia 2000

[14] A Xu Y Kou L Yu B Xu and Y Lv ldquoEngagement ma-neuvering strategy of air combat based on fuzzy markov gametheoryrdquo in Proceedings of IEEE International Conference onComputer pp 126ndash129 Wuhan China 2011

[15] K Horic and B A Conway ldquoOptimal fighter pursuit-evasionmaneuvers found via two-sided optimizationrdquo Journal ofGuidance Control and Dynamics vol 29 no 1 pp 105ndash1122006

[16] K Virtanen T Raivio and R P Hamalainen ldquoModelingpilotrsquos sequential maneuvering decisions by a multistage in-fluence diagramrdquo Journal of Guidance Control and Dy-namics vol 27 no 4 pp 665ndash677 2004

[17] K Virtanen J Karelahti and T Raivio ldquoModeling air combatby a moving horizon influence diagram gamerdquo Journal ofGuidance Control and Dynamics vol 29 no 5 pp 1080ndash1091 2006

[18] Q Pan D Zhou J Huang et al ldquoManeuver decision forcooperative close-range air combat based on state predictedinfluence diagramrdquo in Proceedings of the Information andAutomation for Sustainability pp 726ndash730 Macau China2017

[19] R E Smith B A Dike R K Mehra B Ravichandran andA El-Fallah ldquoClassifier systems in combat two-sided learningof maneuvers for advanced fighter aircraftrdquo ComputerMethods in Applied Mechanics and Engineering vol 186no 2ndash4 pp 421ndash437 2000

[20] J S McGrew J P How BWilliams and N Roy ldquoAir-combatstrategy using approximate dynamic programmingrdquo Journalof Guidance Control and Dynamics vol 33 no 5pp 1641ndash1654 2010

[21] Y Ma X Ma and X Song ldquoA case study on air combatdecision using approximated dynamic programmingrdquoMathematical Problems in Engineering vol 2014 2014

[22] I Bisio C Garibotto F Lavagetto A Sciarrone andS Zappatore ldquoBlind detection advanced techniques forWiFi-based drone surveillancerdquo IEEE Transactions on VehicularTechnology vol 68 no 1 pp 938ndash946 2019

[23] Y Li ldquoDeep reinforcement learningrdquo 2018[24] R S Sutton and A G Barto Reinforcement Learning An

Introduction MIT Press London UK 2nd edition 2018

16 Mathematical Problems in Engineering

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17

Page 17: ImprovingManeuverStrategyinAirCombatbyAlternateFreeze ...downloads.hindawi.com/journals/mpe/2020/7180639.pdf · 1CollegeofComputerScience,SichuanUniversity,Chengdu,Sichuan610065,China

[25] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[26] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[27] Z Kavukcuoglu R Liu Z Meng Y Zhang Y Yu and T LuldquoOn reinforcement learning for full-length game of starcraftrdquo2018

[28] D Silver and C J Maddison ldquoMastering the game of Go withdeep neural networks and tree searchrdquo Nature vol 529no 7587 pp 484ndash489 2016

[29] D Hubert and I Antonoglou ldquoA general reinforcementlearning algorithm that masters chess shogi and Go throughself-playrdquo Science vol 362 no 6419 pp 1140ndash1144 2018

[30] J Schrittwieser J A Bagnell and J Peters ldquoReinforcementlearning in robotics a surveyrdquo Be International Journal ofRobotics Research vol 32 no 11 pp 1238ndash1274 2013

[31] A Waldock C Greatwood F Salama and T RichardsonldquoLearning to perform a perched landing on the ground usingdeep reinforcement learningrdquo Journal of Intelligent and Ro-botic Systems vol 92 no 3-4 pp 685ndash704 2018

[32] R R Alejandro S Carlos B Hriday P Paloma andP Campoy ldquoA deep reinforcement learning strategy for UAVautonomous landing on a moving platformrdquo Journal of In-telligent and Robotic Systems vol 93 no 1-2 pp 351ndash3662019

[33] W Lou and X Guo ldquoAdaptive trajectory tracking controlusing reinforcement learning for quadrotorrdquo Journal of In-telligent and Robotic Systems vol 13 no 1 pp 1ndash10 2016

[34] C Dunn J Valasek and K Kirkpatrick ldquoUnmanned airsystem search and localization guidance using reinforcementlearningrdquo in Proceedings of the of the AIAA Infotech Aerospace2014 Conference pp 1ndash8 Garden Grove CA USA 2014

[35] S You M Diao and L Gao ldquoDeep reinforcement learning fortarget searching in cognitive electronic warfarerdquo IEEE Accessvol 7 pp 37432ndash37447 2019

[36] P Luo J Xie and W Che C Man ldquoQ-learning based aircombat target assignment algorithmrdquo in Proceedings of IEEEInternational Conference on Systems pp 779ndash783 BudapestHungary 2016

[37] M Grzes S Paulo ldquoReward shaping in episodic reinforce-ment learningrdquo in Proceedings of the International Foundationfor Autonomous Agents and Multiagent Systems pp 565ndash573Brazil 2017

[38] K Tummer and A Agogino H Hakodate ldquoAgent rewardshaping for alleviating traffic congestionrdquo in Proceedings of theInternational Foundation for Autonomous Agents and Mul-tiagent Systems pp 1ndash7 Japan 2006

[39] L L B V Cruciol A C de Arruda L Weigang L Li andA M F Crespo ldquoReward functions for learning to control inair traffic flow managementrdquo Transportation Research Part CEmerging Technologies vol 35 pp 141ndash155 2013

[40] P Hernandez-Leal M Kaisers T Baarslag and E M de CoteldquoA survey of learning in multiagent environments dealingwith non-stationarityrdquo 2019

[41] M L Littman ldquoMarkov games as a framework for multi-agentreinforcement learningrdquo in Proceedings of the InternationalConference on International Conference on Machine Learningpp 157ndash163 New Brunswick NJ USA 1994

[42] M L Littman ldquoFriend-or-foe Q-learning in general-sumgamesrdquo in Proceedigs of the International Conference on In-ternational Conference on Machine Learning pp 322ndash328Williamstown MA USA 2001

[43] J Hu and M P Wellman ldquoNash Q-learning for general-sumstochastic gamesrdquo Journal of Machine Learning Researchvol 4 pp 1039ndash1069 2003

[44] J Fan Z Wang Y Xie and Z Yang ldquoA theoretical analysis ofdeep Q-learningrdquo 2020

[45] R E Smith B A Dike B Ravichandran A Fallah andR Mehra ldquoTwo-sided genetics-based learning to discovernovel fighter combat maneuversrdquo in Lecture Notes in Com-puter Science pp 233ndash242 Springer Berlin Heidelberg 2001

[46] A Ng D Harada and S Russell S Bled ldquoPolicy invarianceunder reward transformations theory and application toreward shapingrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 278ndash287 Berlin Germany 1999

[47] T P Lillicrap J J Hunt A Pritzel et al ldquoContinuous controlwith deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Learning Representations PuertoRico MA USA 2016

[48] V Mnih A P Badia MMirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the 32ndInternational Conference on International Conference onMachine Learning pp 1ndash19 New York NY USA 2016

[49] J Schulman F Wolski P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017

[50] Commercial Air Combat Simulation Software FG_SimStudiohttpwwweyextentcomproducttail117

[51] M Wiering and M van Otterlo Reinforcement LearningState-of-Be-Art Springer Press Heidelberg Germany 2014

[52] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo in Proceedings of the 32nd InternationalConference on International Conference on Machine Learningpp 1ndash15 San Diego CA USA 2015

Mathematical Problems in Engineering 17