abstract 1. introduction 1.1. motivation arxiv:1905 ... · 1.1. motivation with the rapid increase...

Autonomous Air Traffic Controller A Deep Multi-Agent ReinforcementLearning Approach

Marc Brittain 1 Peng Wei 1

Abstract

Air traffic control is a real-time safety-criticaldecision making process in highly dynamic andstochastic environments In todayrsquos aviation prac-tice a human air traffic controller monitors anddirects many aircraft flying through its designatedairspace sector With the fast growing air traf-fic complexity in traditional (commercial airlin-ers) and low-altitude (drones and eVTOL aircraft)airspace an autonomous air traffic control systemis needed to accommodate high density air traf-fic and ensure safe separation between aircraftWe propose a deep multi-agent reinforcementlearning framework that is able to identify and re-solve conflicts between aircraft in a high-densitystochastic and dynamic en-route sector with mul-tiple intersections and merging points The pro-posed framework utilizes an actor-critic modelA2C that incorporates the loss function from Prox-imal Policy Optimization (PPO) to help stabilizethe learning process In addition we use a cen-tralized learning decentralized execution schemewhere one neural network is learned and shared byall agents in the environment We show that ourframework is both scalable and efficient for largenumber of incoming aircraft to achieve extremelyhigh traffic throughput with safety guarantee Weevaluate our model via extensive simulations inthe BlueSky environment Results show that ourframework is able to resolve 9997 and 100of all conflicts both at intersections and mergingpoints respectively in extreme high-density airtraffic scenarios

1Department of Aerospace Engineering Iowa State Univer-sity Ames IA 50021 USA Correspondence to Marc Brittainltmwbiastateedugt Peng Wei ltpweiiastateedugt

1 Introduction11 Motivation

With the rapid increase in global air traffic and expected highdensity air traffic in specific airspace regions to guaranteeair transportation safety and efficiency becomes a criticalchallenge Tactical air traffic control (ATC) decisions toensure safe separation between aircraft are still being madeby human air traffic controllers in en-route airspace sectorswhich is the same as compared to 50 years ago (Councilet al 2014) Heinz Erzberger and his NASA colleaguesfirst proposed autonomous air traffic control by introducingthe Advanced Airspace Concept (AAC) to increase airspacecapacity and operation safety by designing automation toolssuch as the Autoresolver and TSAFE to augment humancontrollers (Erzberger 2005 Erzberger amp Heere 2010 Far-ley amp Erzberger 2007) in conflict resolution Inspired byErzberger we believe that a fully automated ATC system isthe ultimate solution to handle the high-density complexand dynamic air traffic in the future en-route and terminalairspace for commercial air traffic

In recent proposals for low-altitude airspace operations suchas UAS Traffic Management (UTM) (Kopardekar et al2016) U-space (Undertaking 2017) and urban air mobility(Mueller Kopardekar and Goodrich 2017) there is also astrong demand for an autonomous air traffic control systemto provide advisories to these intelligent aircraft facilitateon-board autonomy or human operator decisions and copewith high-density air traffic while maintaining safety andefficiency (Air 2015 Airbus 2018 Google 2015 Holdenamp Goel 2016 Kopardekar 2015 Mueller et al 2017 Uber2018) According to the most recent study by Hunter andWei (Hunter amp Wei 2019) the key to these low-altitudeairspace operations is to design the autonomous ATC onstructured airspace to achieve envisioned high throughputTherefore the critical challenge here is to design an au-tonomous air traffic control system to provide real-timeadvisories to aircraft to ensure safe separation both along airroutes and at intersections of these air routes Furthermorewe need this autonomous ATC system to be able to managemultiple intersections and handle uncertainty in real time

To implement such a system we need a model to perceive

arX

iv1

905

0130

3v1

[cs

LG

] 2

May

201

9

Autonomous Air Traffic Controller A Deep Multi-Agent Reinforcement Learning Approach

the current air traffic situation and provide advisories toaircraft in an efficient and scalable manner Reinforcementlearning a branch of machine learning is a promising wayto solve this problem The goal in reinforcement learning isto allow an agent to learn an optimal policy by interactingwith an environment The agent is trained by first perceivingthe state in the environment selecting an action based onthe perceived state and receiving a reward based on this per-ceived state and action By formulating the tasks of humanair traffic controllers as a reinforcement learning problemthe trained agent can provide dynamic real-time air trafficadvisories to aircraft with extremely high safety guaranteeunder uncertainty and little computation overhead

Artificial intelligence (AI) algorithms are achieving per-formance beyond humans in many real-world applicationstoday An artificial intelligence agent called AlphaGo builtby DeepMind defeated the world champion Ke Jie in threematches of Go in May 2017 (Silver amp Hassabis 2016) Thisnotable advance in the AI field demonstrated the theoreticalfoundation and computational capability to potentially aug-ment and facilitate human tasks with intelligent agents andAI technologies To utilize such techniques fast-time sim-ulators are needed to allow the agent to efficiently learn inthe environment Until recently there were no open-sourcehigh-quality air traffic control simulators that allowed forfast-time simulations to enable an AI agent to interact withThe air traffic control simulator BlueSky developed by TUDelft allows for realistic real-time air traffic scenarios andwe decide to use this software as the environment and simu-lator for performance evaluation of our proposed framework(Hoekstra amp Ellerbroek 2016)

In this paper a deep multi-agent reinforcement learningframework is proposed to enable autonomous air trafficseparation in en-route airspace where each aircraft is repre-sented by an agent Each agent will comprehend the currentair traffic situation and perform online sequential decisionmaking to select speed advisories in real-time to avoid con-flicts at intersections merging points and along route Ourproposed framework provides another promising potentialsolution to enable an autonomous air traffic control system

12 Related Work

Deep reinforcement learning has been widely explored inground transportation in the form of traffic light control(Genders and Razavi 2016 Liang Du Wang and Han2018) In these approaches the authors deal with a singleintersection and use one agent per intersection to controlthe traffic lights Our problem is similar to ground trans-portation in the sense we want to provide speed advisoriesto aircraft to avoid conflict in the same way a traffic lightadvises cars to stop and go The main difference with ourproblem is that we need to control the speed of each aircraft

to ensure there is no along route conflict In our work werepresent each aircraft as an agent instead of the intersectionto handle along route and intersection conflicts

There have been many important contributions to the topicof autonomous air traffic control One of the most promis-ing and well-known lines of work is the Autoresolver de-signed and developed by Heinz Erzberger and his NASAcolleagues (Erzberger 2005 Erzberger amp Heere 2010 Far-ley amp Erzberger 2007) It employs an iterative approach se-quentially computing and evaluating candidate trajectoriesuntil a trajectory is found that satisfies all of the resolutionconditions The candidate trajectory is then output by thealgorithm as the conflict resolution trajectory The Autore-solver is a physics-based approach that involves separatecomponents of conflict detection and conflict resolution Ithas been tested in various large-scale simulation scenarioswith promising performance

Strategies for increasing throughput of aircraft while min-imizing delay in high-density sectors are currently beingdesigned and implemented by NASA These works includethe Traffic Management Advisor (TMA) (Erzberger andItoh 2014) or Traffic Based Flow Management (TBFM) acentral component of ATD-1 (Baxley Johnson Scardinaand Shay 2016) In this approach a centralized plannerdetermines conflict free time-slots for aircraft to ensure sep-aration requirements are maintained at the metering fix Ouralgorithm also is able to achieve a conflict free metering fixby allowing the agents to learn a cooperative strategy thatis queried quickly online during execution unlike TBFMAnother main difference with our work is that our proposedframework is a decentralized framework that can handleuncertainty In TMA or TBFM once the arrival sequenceis determined and aircraft are within the ldquofreeze horizonrdquono deviation from the sequence is allowed which could beproblematic if one aircraft becomes uncooperative

Multi-agent approaches have also been applied to conflictresolution (Wollkind Valasek and Ioerger 2004) In thisline of work negotiation techniques are used to resolveidentified conflicts in the sector In our research we donot impose any negotiation techniques but leave it to theagents to derive negotiation techniques through learning andtraining

Reinforcement learning and deep Q-networks have beendemonstrated to play games such as Go Atari and Warcraftand most recently Starcraft II (Amato and Shani 2010Mnih Kavukcuoglu Silver Graves Antonoglou Wierstraand Riedmiller 2013 Silver Huang Maddison Guez SifreVan Den Driessche Schrittwieser Antonoglou Panneer-shelvam Lanctot et al 2016 Vinyals Ewalds BartunovGeorgiev Vezhnevets Yeo Makhzani Kuttler AgapiouSchrittwieser et al 2017) The results from these papersshow that a well-designed sophisticated AI agent is capable


of learning complex strategies It was also shown in pre-vious work that a hierarchical deep reinforcement learningagent was able to avoid conflict and choose optimal routecombinations for a pair of aircraft (Brittain and Wei 2018)

Recently the field of multi-agent collision avoidancehas seen much success in using a decentralized non-communicating framework in ground robots (Chen LiuEverett and How 2017 Everett Chen and How 2018) Inthis work the authors develop an extension to the policy-based learning algorithm (GA3C) that proves to be efficientin learning complex interactions between many agents Wefind that the field of collision avoidance can be adapted toconflict resolution by considering larger separation require-ments so our framework is inspired by the ideas set forthby (Everett et al 2018)

In this paper the deep multi-agent reinforcement learningframework is developed to solve the separation problem forautonomous air traffic control in en-route dynamic airspacewhere we avoid the computationally expensive forward in-tegration method by learning a policy that can be quicklyqueried The results show that our framework has verypromising performance

The structure of this paper is as follows in Section II thebackground of reinforcement learning policy based learn-ing and multi-agent reinforcement learning will be intro-duced In Section III the description of the problem andits mathematical formulation of deep multi-agent reinforce-ment learning are presented Section IV presents our de-signed deep multi-agent reinforcement learning frameworkto solve this problem The numerical experiments and re-sults are shown in Section V and Section VI concludes thispaper

2 Background21 Reinforcement Learning

Reinforcement learning is one type of sequential decisionmaking where the goal is to learn how to act optimally ina given environment with unknown dynamics A reinforce-ment learning problem involves an environment an agentand different actions the agent can select in this environmentThe agent is unique to the environment and we assume theagent is only interacting with one environment If we let trepresent the current time then the components that makeup a reinforcement learning problem are as follows

bull S - The state space S is a set of all possible states in theenvironment

bull A - The action space A is a set of all actions the agentcan select in the environment

bull r(st at) - The reward function determines how much

reward the agent is able to acquire for a given (st at)transition

bull γ isin [01] - A discount factor determines how far in thefuture to look for rewards As γ rarr 0 immediaterewards are emphasized whereas when γ rarr 1 futurerewards are prioritized

S contains all information about the environment and eachelement st can be considered a snapshot of the environmentat time t The agent accepts st and with this the agent thenselects an action at By selecting action at the state isnow updated to st+1 and there is an associated reward frommaking the transition from (st at)rarr st+1 How the stateevolves from strarr st+1 given action at is dependent uponthe dynamics of the system which is often unknown Thereward function is user defined but needs to be carefullydesigned to reflect the goal of the environment

From this framework the agent is able to extract the optimalactions for each state in the environment by maximizing acumulative reward function We call the actions the agentselects for each state in the environment a policy Let πrepresent some policy and T represent the total time for agiven environment then the optimal policy can be definedas follows

πlowast = arg maxπ

E[

Tsumt=0

(r(st at)|π)] (1)

If we design the reward to reflect the goal in the environmentthen by maximizing the total reward we have obtained theoptimal solution to the problem

22 Policy-Based Learning

In this work we consider a policy-based reinforcementlearning algorithm to generate policies for each agent toexecute The advantage of policy-based learning is that thesealgorithms are able to learn stochastic policies whereasvalue-based learning can not This is especially beneficial innon-communicating multi-agent environments where thereis uncertainty in other agentrsquos action A3C (Mnih BadiaMirza Graves Lillicrap Harley Silver and Kavukcuoglu2016) a recent policy-based algorithm uses a single neuralnetwork to approximate both the policy (actor) and value(critic) functions with many threads of an agent running inparallel to allow for increased exploration of the state-spaceThe actor and critic are trained according to the two lossfunctions

Lπ = log π(at |st)(Rt minus V (st)) + β middotH(π(st)) (2)

Lv = (Rt minus V (st))2 (3)

where in (2) the first term log π(at |st)(Rt minus V (st)) re-duces the probability of sampling an action that led to a


lower return than was expected by the critic and the secondterm β middotH(π(st)) is used to encourage exploration by dis-couraging premature convergence to suboptimal determinis-tic polices Here H is the entropy and the hyperparameter βcontrols the strength of the entropy regularization term In(3) the critic is trained to approximate the future discountedrewards Rt =

sumkminus1i=0 γ

irt+i + γkV (st+k)

One drawback of Lπ is that it can lead to large destruc-tive policy updates and hinder the final performance of themodel A recent algorithm called Proximal Policy Opti-mization (PPO) solved this problem by introducing a newtype of loss function that limits the change from the previ-ous policy to the new policy (Schulman Wolski DhariwalRadford and Klimov 2017) If we let rt(θ) denote theprobability ratio and θ represent the neural network weightsat time t rt(θ) = πθ(at|st)

πθold (at|st) the PPO loss function can be

formulated as follows

LCLIP(θ) =

Et[min(rt(θ)(A) clip(rt(θ) 1minus ε 1 + ε)(A))] (4)

where A = Rt minus V (st) and ε is a hyperparameter thatdetermines the bound for rt(θ) This loss function allowsthe previous policy to move in the direction of the newpolicy but by limiting this change it is shown to lead tobetter performance (Schulman et al 2017)

23 Multi-Agent Reinforcement Learning

In multi-agent reinforcement learning instead of consider-ing one agentrsquos interaction with the environment we areconcerned with a set of agents that share the same environ-ment (Bu Babu De Schutter et al 2008) Fig 1 shows theprogression of a multi-agent reinforcement learning prob-lem Each agent has its own goals that it is trying to achievein the environment that is typically unknown to the otheragents In these types of problems the difficulty of learninguseful policies greatly increases since the agents are bothinteracting with the environment and each other One strat-egy for solving multi-agent environments is IndependentQ-learning (Tan 1993) where other agents are consideredto be part of the environment and there is no communica-tion between agents This approach often fails since eachagent is operating in the environment and in return createslearning instability This learning instability is caused bythe fact that each agent is changing its own policy and howthe agent changes this policy will influence the policy of theother agents (Matignon Laurent and Le Fort-Piat 2012)Without some type of communication it is very difficult forthe agents to converge to a good policy

Figure 1 Progression of a multi-agent reinforcement learning prob-lem

3 Problem FormulationIn real world practice air traffic controllers in en-route andterminal sectors are responsible for separating aircraft Inour research we used the BlueSky air traffic control simu-lator as our deep reinforcement learning environment Wedeveloped two challenging Case Studies one with multipleintersections (Case Study 1) and one with a merging point(Case Study 2) both with high-density air traffic to evalu-ate the performance of our deep multi-agent reinforcementlearning framework

31 Objective

The objective in these Case Studies is to maintain safe sep-aration between aircraft and resolve conflict for aircraft inthe sector by providing speed advisories In Case Study 1three routes are constructed with two intersections so thatthe agents must navigate through the intersection with noconflicts In Case Study 2 there are two routes that reacha merging point and continue on one route so ensuringproper separation requirements at the merging point is adifficult problem to solve In order to obtain the optimal so-lution in this environment the agents have to maintain safeseparation and resolve conflict and every time step in theenvironment To increase the difficulty of the Case Studiesand to provide a more realistic environment aircraft enterthe sector stochastically so that the agents need to develop astrategy instead of simply memorizing actions

32 Simulation Settings

There are many settings we imposed to make these CaseStudies feasible For each simulation run there is a fixedmax number of aircraft This is to allow comparable per-formance between simulation runs and to evaluate the finalperformance of the model In BlueSky the performancemetrics of each aircraft type impose different constraintson the range of cruise speeds We set all aircraft to bethe same type Boeing 747-400 in both Case Studies We


also imposed a setting that all aircraft can not deviate fromtheir route The final setting in the Case Studies is the de-sired speed of each aircraft Each aircraft has the abilityto select three different desired speeds minimum allowedcruise speed current cruise speed and maximum allowedcruise speed which is defined in the BlueSky simulationenvironment

33 Multi-Agent Reinforcement Learning Formulation

Here we formulate our BlueSky Case Study as a deep multi-agent reinforcement learning problem by representing eachaircraft as an agent and define the state space action spacetermination criteria and reward function for the agents

331 STATE SPACE

A state contains all the information the AI agent needs tomake decisions Since this is a multi-agent environment weneeded to incorporate communication between the agentsTo allow the agents to communicate the state for a givenagent also contains state information from the N-closestagents We follow a similar state space definition as in(Everett et al 2018) but instead we use half of the lossof separation distance as the radius of the aircraft In thisway the sum of the radii between two aircraft is equal to theloss of separation distance The state information includesdistance to the goal aircraft speed aircraft acceleration dis-tance to the intersection a route identifier and half the lossof separation distance for the N-closest agents where theposition for a given aircraft can be represented as (distanceto the goal route identifier) We also included the distanceto the N-closest agents in the state space of the agents andthe full loss of separation distance From this we can seethat the state space for the agents is constant in size sinceit only depends on the N-closest agents and does not scaleas the number of agents in the environment increase Fig 2shows an example of a state in the BlueSky environment

We found that defining which N-closest agents to consideris very important to obtain a good result since we do notwant to add irrelevant information in the state space Forexample consider Fig 2 If the ownship is on R1 andone of the closest aircraft on R3 has already passed theintersection there is no reason to include its information inthe state space of the ownship We defined the followingrules for the aircraft that are allowed to be in the state of theownship

bull aircraft on conflicting route must have not reached theintersection

bull aircraft must either be on the same route or on a con-flicting route

By utilizing these rules we eliminated useless information

Figure 2 BlueSky sector designed for our Case Study Shown isCase Study 1 for illustration there are three routes R1 R2 andR3 along with two intersections I1 and I2

which we found to be critical in obtaining convergence tothis problem

If we consider Fig 2 as an example we can acquire all ofthe state information we need from the aircraft If we let I(i)

represent the distance to the goal aircraft speed aircraftacceleration distance to the intersection route identifierand half the loss of separation distance of aircraft i the statewill be represented as follows

sot = (I(o) d(1)LOS(o 1) d(2)LOS(o 2) d(n)

LOS(o n) I(1) I(2) I(n))

where sot represents the ownship state d(i) represents thedistance from ownship to aircraft i LOS(o i) represents theloss of separation distance between aircraft o and aircraft iand n represents the number of closest aircraft to include inthe state of each agent By defining the loss of separationdistance between two aircraft in the state space the agentsshould be able to develop a strategy for non-uniform loss ofseparation requirements for different aircraft types In thiswork we consider the standard uniform loss of separationrequirements and look to explore this idea in future work

332 ACTION SPACE

All agents decide to change or maintain their desired speedevery 12 seconds in simulation time The action space forthe agents can be defined as follows

At = [vmin vtminus1 vmax]

where vmin is the minimum allowed cruise speed (decel-erate) vtminus1 is the current speed of the aircraft (hold) andvmax is the maximum allowed cruise speed (accelerate)

333 TERMINAL STATE

Termination in the episode was achieved when all aircrafthad exited the sector

Naircraft = 0


334 REWARD FUNCTION

The reward function for the agents were all identical butlocally applied to encourage cooperation between the agentsIf two agents were in conflict they would both receive apenalty but the remaining agents that were not in conflictwould not receive a penalty Here a conflict is defined asthe distance between any two aircraft is less than 3 nmiThe reward needed to be designed to reflect the goal of thispaper safe separation and conflict resolution We were ableto capture our goals in the following reward function for theagents

rt =

minus1 if dco lt 3

minusα+ β middot dco if dco lt 10 and dco ge 3

0 otherwise

where dco is the distance from the ownship to the closestaircraft in nautical miles and α and β are small positiveconstants to penalize agents as they approach the loss ofseparation distance By defining the reward to reflect thedistance to the closest aircraft this allows the agent to learnto select actions to maintain safe separation requirements

4 Solution ApproachTo solve the BlueSky Case Studies we designed and de-veloped a novel deep multi-agent reinforcement learningframework called the Deep Distributed Multi-Agent Rein-forcement Learning framework (DD-MARL) In this sec-tion we introduce and describe the framework then weexplain why this framework is needed to solve this CaseStudy

To formulate this environment as a deep multi-agent rein-forcement learning problem we utilized a centralized learn-ing with decentralized execution framework with one neuralnetwork where the actor and critic share layers of same theneural network further reducing the number of trainableparameters By using one neural network we can train amodel that improves the joint expected return of all agentsin the sector which encourages cooperation between theagents We utilized the synchronous version of A3C A2C(advantage actor critic) which is shown to achieve the sameor better performance as compared to the asynchronous ver-sion (Schulman et al 2017) We also adapted the A2Calgorithm to incorporate the PPO loss function defined in(6) which we found led to a more stable policy and resultedin better final performance We follow a similar approachto (Everett et al 2018) to split the state into the two partsownship state information and all other information whichwe call the local state information slocal We then encodethe local state information using a fully connected layerbefore combining the encoded state with the ownship stateinformation From there the combined state is sent through

Figure 3 Illustration of the neural network architecture for A2Cwith shared layers between the actor and critic Each hidden layeris a fully connected (FC) layer with 32 nodes for the encoded stateand 256 nodes for the last two layers

two fully connected layers and produces two outputs thepolicy and value for a given state Fig 3 shows an illus-tration of the the neural network architecture With thisframework we can implement the neural network to allaircraft instead of having a specified neural network forall individual aircraft In this way the neural network isacting as a centralized learner and distributing knowledgeto each aircraft The neural networkrsquos policy is distributedat the beginning of each episode and updated at the endof each episode which reduces the amount of informationthat is sent to each aircraft since sending an updated modelduring the route could be computationally expensive In thisformulation each agent has identical neural networks butsince they are evolving different states their actions can bedifferent

It is also important to note that this framework is invariant tothe number of aircraft When observing an en-route sectoraircraft are entering and exiting which creates a dynamicenvironment with varying number of aircraft Since ourapproach does not depend on the number of aircraft ourframework can handle any number of aircraft arriving basedon stochastic inter-arrival times

5 Numerical Experiments51 Interface

To test the performance of our proposed framework weutilized the BlueSky air traffic control simulator This sim-ulator is built around python so we were able to quicklyobtain the state space information of all aircraft1 By designwhen restarting the simulation all objectives were the samemaintain safe separation and sequencing resolve conflictsand minimize delay Aircraft initial positions and availablespeed changes did not change between simulation runs

1Code will be made available at httpsgithubcommarcbrittain


52 Environment Setting

For each simulation run in BlueSky we discretized the envi-ronment into episodes where each run through the simula-tion counted as one episode We also introduced a time-step∆t so that after the agents selected an action the envi-ronment would evolve for ∆t seconds until a new actionwas selected We set ∆t to 12 seconds to allow for a no-ticeable change in state from st rarr st+1 and to check thesafe-separation requirements at regular intervals

There were many different parameters that needed to betuned and selected for the Case Studies We implementedthe adapted A2C concept mentioned earlier with two hiddenlayers consisting of 256 nodes The encoding layer for theN -closest aircraft state information consisted of 32 nodesand we used the ReLU activation function for all hiddenlayers The output of the actor used a Softmax activationfunction and the output of the critic used a Linear activationfunction Other key parameter values included learning ratelr = 00001 γ = 099 ε = 02 α = 01 β = 0005 and weused the Adam optimizer for both the actor and critic loss(Kingma and Ba 2014)

53 Case Study 1 Three routes with two intersections

In this Case Study we considered three routes with twointersections as shown in Fig 2 In our DD-MARL frame-work the single neural network is implemented on eachaircraft as they enter the sector Each agent is then ableto select its own desired speed which greatly increases thecomplexity of this problem since the agents need to learnhow to cooperate in order to maintain safe-separation re-quirements What also makes this problem interesting isthat each agent does not have a complete representation ofthe state space since only the ownship (any given agent)state information and the N-closest agent state informationare included

54 Case Study 2 Two routes with one merging point

This Case Study consisted of two routes merging to one sin-gle point and then following one route thereafter (see Fig 4This poses another critical challenge for an autonomous airtraffic control system that is not present in Case Study 1merging When merging to a single point safe separationrequirements need to be maintain both before the mergingpoint and after This is particularly challenging since thereis high density traffic on each route that is now combiningto one route Agents need to carefully coordinate in order tomaintain safe separation requirements after the merge point

In both Case Studies there were 30 total aircraft that enteredthe airspace following a uniform distribution over 4 5 and6 minutes This is an extremely difficult problem to solvebecause the agents cannot simply memorize actions the

Figure 4 Case Study 2 two routes R1 and R2 merge to a singleroute at M1

Table 1 Performance of the policy tested for 200 episodes

CASE STUDY MEAN MEDIAN

1 2999 plusmn 0141 302 30 30

agents need to develop a strategy in order to solve the prob-lem We also included the 3 closest agents state informationin the state of the ownship All other agents are not includedin the state of the ownship The episode terminated whenall 30 aircraft had exited the sector so the optimal solutionin this problem is 30 goals achieved Here we define goalachieved as an aircraft exiting it the sector without conflict

55 Algorithm Performance

In this section we analyze the performance of DD-MARLon the Case Studies We allowed the AI agents to trainfor 20000 episodes and 5000 episodes for Case Study 1and Case Study 2 resepctively We then evaluated the finalpolicy for 200 episodes to calculate the mean and standarddeviation along with the median to evaluate the performanceof the final policy as shown in Table 12 We can see fromFig 5 that for Case Study 1 the policy began convergingto a good policy by around episode 7500 then began tofurther refine to a near optimal policy for the remainingepisodes For Case Study 2 we can see from Fig 6 that anear optimal policy was obtained in only 2000 episodes andcontinued to improve through the remainder of the 3000episodes Training for only 20000 episodes (as required inCase Study 1) is computationally inexpensive as it equatesto less than 4 days of training We suspect that this is dueto the approach of distributing one neural network to allaircraft and by allowing shared layers between the actor andcritic

2A video of the final converged policy can be foundat httpswwwyoutubecomwatchv=sjRGjiRZWxg andhttpsyoutubeNvLxTJNd-q0


0 2500 5000 7500 10000 12500 15000 17500 20000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal

Figure 5 Learning curve of the DD-MARL framework for CaseStudy 1 Results are smoothed with a 30 episode rolling averagefor clarity

We can see from Table 1 that on average we obtained ascore of 2999 throughout the 200 episode testing phasefor Case Study 1 and 30 for Case Study 2 This equates toresolving conflict 9997 at the intersections and 100 atthe merging point Given that this is a stochastic environ-ment we speculate that there could be cases where there isan orientation of aircraft where the 3 nmi loss of separationdistance can not be achieved and in such cases we wouldalert human ATC to resolve this type of conflict The me-dian score removes any outliers from our testing phase andwe can see the median score is optimal for both Case Study1 and Case Study 2

6 ConclusionWe built an autonomous air traffic controller to ensure safeseparation between aircraft in high-density en-route airspacesector The problem is formulated as a deep multi-agentreinforcement learning problem with the actions of selectingdesired aircraft speed The problem is then solved by usingthe DD-MARL framework which is shown to be capableof solving complex sequential decision making problemsunder uncertainty According to our knowledge the majorcontribution of this research is that we are the first researchgroup to investigate the feasibility and performance of au-tonomous air traffic control with a deep multi-agent rein-forcement learning framework to enable an automated safeand efficient en-route airspace The promising results fromour numerical experiments encourage us to conduct futurework on more complex sectors We will also investigate thefeasibility of the autonomous air traffic controller to replacehuman air traffic controllers in ultra dense dynamic andcomplex airspace in the future

0 1000 2000 3000 4000 5000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal


AcknowledgementsWe would like to thank Xuxi Yang Guodong Zhu JoshBertram Priyank Pradeep and Xufang Zheng whose inputand discussion helped in the success of this work

ReferencesAir A P Revising the airspace model for the safe integra-

tion of small unmanned aircraft systems Amazon PrimeAir 2015

Airbus Blueprint for the sky 2018 URLhttpsstoragegoogleapiscomblueprintAirbus_UTM_Blueprintpdf

Amato C and Shani G High-level reinforcement learningin strategy games In Proceedings of the 9th Interna-tional Conference on Autonomous Agents and MultiagentSystems volume 1-Volume 1 pp 75ndash82 InternationalFoundation for Autonomous Agents and Multiagent Sys-tems 2010

Baxley B T Johnson W C Scardina J and Shay R FAir traffic management technology demonstration-1 con-cept of operations (atd-1 conops) version 30 2016

Brittain M and Wei P Autonomous aircraft sequencingand separation with hierarchical deep reinforcement learn-ing In Proceedings of the International Conference forResearch in Air Transportation 2018

Bu L Babu R De Schutter B et al A comprehen-sive survey of multiagent reinforcement learning IEEETransactions on Systems Man and Cybernetics Part C(Applications and Reviews) 38(2)156ndash172 2008


Chen Y F Liu M Everett M and How J P Decentral-ized non-communicating multiagent collision avoidancewith deep reinforcement learning In 2017 IEEE interna-tional conference on robotics and automation (ICRA) pp285ndash292 IEEE 2017

Council N R et al Autonomy research for civil aviationtoward a new era of flight National Academies Press2014

Erzberger H Automated conflict resolution for air trafficcontrol 2005

Erzberger H and Heere K Algorithm and operational con-cept for resolving short-range conflicts Proceedings ofthe Institution of Mechanical Engineers Part G Journalof Aerospace Engineering 224(2)225ndash243 2010

Erzberger H and Itoh E Design principles and algorithmsfor air traffic arrival scheduling 2014

Everett M Chen Y F and How J P Motion planningamong dynamic decision-making agents with deep re-inforcement learning In 2018 IEEERSJ InternationalConference on Intelligent Robots and Systems (IROS) pp3052ndash3059 IEEE 2018

Farley T and Erzberger H Fast-time simulation evaluationof a conflict resolution algorithm under high air trafficdemand In 7th USAEurope ATM 2007 RampD Seminar2007

Genders W and Razavi S Using a deep reinforcementlearning agent for traffic signal control arXiv preprintarXiv161101142 2016

Google Google uas airspace system overview 2015URL httpsutmarcnasagovdocsGoogleUASAirspaceSystemOverview5pager[1]pdf

Hoekstra J M and Ellerbroek J Bluesky atc simulatorproject an open data and open source approach InProceedings of the 7th International Conference on Re-search in Air Transportation pp 1ndash8 FAAEurocontrolUSAEurope 2016

Holden J and Goel N Fast-forwarding to a future ofon-demand urban air transportation San Francisco CA2016

Hunter G and Wei P Service-oriented separation assurancefor small uas traffic management In Integrated Communi-cations Navigation and Surveillance (ICNS) ConferenceHerndon VA 2019

Kingma D P and Ba J Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014

Kopardekar P Rios J Prevot T Johnson M Jung Jand Robinson J E Unmanned aircraft system trafficmanagement (utm) concept of operations 2016

Kopardekar P H Safely enabling civilian unmanned aerialsystem (uas) operations in low-altitude airspace by un-manned aerial system traffic management (utm) 2015

Liang X Du X Wang G and Han Z Deep reinforce-ment learning for traffic light control in vehicular net-works arXiv preprint arXiv180311115 2018

Matignon L Laurent G J and Le Fort-Piat N Inde-pendent reinforcement learners in cooperative markovgames a survey regarding coordination problems TheKnowledge Engineering Review 27(1)1ndash31 2012

Mnih V Kavukcuoglu K Silver D Graves AAntonoglou I Wierstra D and Riedmiller M Playingatari with deep reinforcement learning arXiv preprintarXiv13125602 2013

Mnih V Badia A P Mirza M Graves A LillicrapT Harley T Silver D and Kavukcuoglu K Asyn-chronous methods for deep reinforcement learning InInternational conference on machine learning pp 1928ndash1937 2016

Mueller E R Kopardekar P H and Goodrich K HEnabling airspace integration for high-density on-demandmobility operations In 17th AIAA Aviation TechnologyIntegration and Operations Conference pp 3086 2017

Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017

Silver D and Hassabis D Alphago Mastering the ancientgame of go with machine learning Research Blog 92016

Silver D Huang A Maddison C J Guez A Sifre LVan Den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M et al Mastering thegame of go with deep neural networks and tree searchnature 529(7587)484 2016

Tan M Multi-agent reinforcement learning Independentvs cooperative agents In Proceedings of the tenth inter-national conference on machine learning pp 330ndash3371993

Uber Uber elevate - the future of urban air transport 2018URL httpswwwubercominfoelevate

Undertaking S J U-space blueprint SESAR Joint Under-taking Accessed September 18 2017


Vinyals O Ewalds T Bartunov S Georgiev P Vezhn-evets A S Yeo M Makhzani A Kuttler H AgapiouJ Schrittwieser J et al Starcraft ii A new challenge forreinforcement learning arXiv preprint arXiv1708047822017

Wollkind S Valasek J and Ioerger T Automated conflictresolution for air traffic management using cooperativemultiagent negotiation In AIAA Guidance Navigationand Control Conference and Exhibit pp 4992 2004


the current air traffic situation and provide advisories toaircraft in an efficient and scalable manner Reinforcementlearning a branch of machine learning is a promising wayto solve this problem The goal in reinforcement learning isto allow an agent to learn an optimal policy by interactingwith an environment The agent is trained by first perceivingthe state in the environment selecting an action based onthe perceived state and receiving a reward based on this per-ceived state and action By formulating the tasks of humanair traffic controllers as a reinforcement learning problemthe trained agent can provide dynamic real-time air trafficadvisories to aircraft with extremely high safety guaranteeunder uncertainty and little computation overhead

Artificial intelligence (AI) algorithms are achieving per-formance beyond humans in many real-world applicationstoday An artificial intelligence agent called AlphaGo builtby DeepMind defeated the world champion Ke Jie in threematches of Go in May 2017 (Silver amp Hassabis 2016) Thisnotable advance in the AI field demonstrated the theoreticalfoundation and computational capability to potentially aug-ment and facilitate human tasks with intelligent agents andAI technologies To utilize such techniques fast-time sim-ulators are needed to allow the agent to efficiently learn inthe environment Until recently there were no open-sourcehigh-quality air traffic control simulators that allowed forfast-time simulations to enable an AI agent to interact withThe air traffic control simulator BlueSky developed by TUDelft allows for realistic real-time air traffic scenarios andwe decide to use this software as the environment and simu-lator for performance evaluation of our proposed framework(Hoekstra amp Ellerbroek 2016)

In this paper a deep multi-agent reinforcement learningframework is proposed to enable autonomous air trafficseparation in en-route airspace where each aircraft is repre-sented by an agent Each agent will comprehend the currentair traffic situation and perform online sequential decisionmaking to select speed advisories in real-time to avoid con-flicts at intersections merging points and along route Ourproposed framework provides another promising potentialsolution to enable an autonomous air traffic control system

12 Related Work

Deep reinforcement learning has been widely explored inground transportation in the form of traffic light control(Genders and Razavi 2016 Liang Du Wang and Han2018) In these approaches the authors deal with a singleintersection and use one agent per intersection to controlthe traffic lights Our problem is similar to ground trans-portation in the sense we want to provide speed advisoriesto aircraft to avoid conflict in the same way a traffic lightadvises cars to stop and go The main difference with ourproblem is that we need to control the speed of each aircraft

to ensure there is no along route conflict In our work werepresent each aircraft as an agent instead of the intersectionto handle along route and intersection conflicts

There have been many important contributions to the topicof autonomous air traffic control One of the most promis-ing and well-known lines of work is the Autoresolver de-signed and developed by Heinz Erzberger and his NASAcolleagues (Erzberger 2005 Erzberger amp Heere 2010 Far-ley amp Erzberger 2007) It employs an iterative approach se-quentially computing and evaluating candidate trajectoriesuntil a trajectory is found that satisfies all of the resolutionconditions The candidate trajectory is then output by thealgorithm as the conflict resolution trajectory The Autore-solver is a physics-based approach that involves separatecomponents of conflict detection and conflict resolution Ithas been tested in various large-scale simulation scenarioswith promising performance

Strategies for increasing throughput of aircraft while min-imizing delay in high-density sectors are currently beingdesigned and implemented by NASA These works includethe Traffic Management Advisor (TMA) (Erzberger andItoh 2014) or Traffic Based Flow Management (TBFM) acentral component of ATD-1 (Baxley Johnson Scardinaand Shay 2016) In this approach a centralized plannerdetermines conflict free time-slots for aircraft to ensure sep-aration requirements are maintained at the metering fix Ouralgorithm also is able to achieve a conflict free metering fixby allowing the agents to learn a cooperative strategy thatis queried quickly online during execution unlike TBFMAnother main difference with our work is that our proposedframework is a decentralized framework that can handleuncertainty In TMA or TBFM once the arrival sequenceis determined and aircraft are within the ldquofreeze horizonrdquono deviation from the sequence is allowed which could beproblematic if one aircraft becomes uncooperative

Multi-agent approaches have also been applied to conflictresolution (Wollkind Valasek and Ioerger 2004) In thisline of work negotiation techniques are used to resolveidentified conflicts in the sector In our research we donot impose any negotiation techniques but leave it to theagents to derive negotiation techniques through learning andtraining

Reinforcement learning and deep Q-networks have beendemonstrated to play games such as Go Atari and Warcraftand most recently Starcraft II (Amato and Shani 2010Mnih Kavukcuoglu Silver Graves Antonoglou Wierstraand Riedmiller 2013 Silver Huang Maddison Guez SifreVan Den Driessche Schrittwieser Antonoglou Panneer-shelvam Lanctot et al 2016 Vinyals Ewalds BartunovGeorgiev Vezhnevets Yeo Makhzani Kuttler AgapiouSchrittwieser et al 2017) The results from these papersshow that a well-designed sophisticated AI agent is capable
















E[

Tsumt=0

(r(st at)|π)] (1)









sumkminus1i=0 γ

irt+i + γkV (st+k)




LCLIP(θ) =







31 Objective








331 STATE SPACE











LOS(o n) I(1) I(2) I(n))


332 ACTION SPACE




333 TERMINAL STATE


Naircraft = 0


334 REWARD FUNCTION


rt =

minus1 if dco lt 3


0 otherwise






















1 2999 plusmn 0141 302 30 30






0 2500 5000 7500 10000 12500 15000 17500 20000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal




0 1000 2000 3000 4000 5000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal























































E[

Tsumt=0

(r(st at)|π)] (1)









sumkminus1i=0 γ

irt+i + γkV (st+k)




LCLIP(θ) =







31 Objective








331 STATE SPACE











LOS(o n) I(1) I(2) I(n))


332 ACTION SPACE




333 TERMINAL STATE


Naircraft = 0


334 REWARD FUNCTION


rt =

minus1 if dco lt 3


0 otherwise






















1 2999 plusmn 0141 302 30 30






0 2500 5000 7500 10000 12500 15000 17500 20000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal




0 1000 2000 3000 4000 5000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal










































sumkminus1i=0 γ

irt+i + γkV (st+k)




LCLIP(θ) =







31 Objective








331 STATE SPACE











LOS(o n) I(1) I(2) I(n))


332 ACTION SPACE




333 TERMINAL STATE


Naircraft = 0


334 REWARD FUNCTION


rt =

minus1 if dco lt 3


0 otherwise






















1 2999 plusmn 0141 302 30 30






0 2500 5000 7500 10000 12500 15000 17500 20000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal




0 1000 2000 3000 4000 5000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal












































331 STATE SPACE











LOS(o n) I(1) I(2) I(n))


332 ACTION SPACE




333 TERMINAL STATE


Naircraft = 0


334 REWARD FUNCTION


rt =

minus1 if dco lt 3


0 otherwise






















1 2999 plusmn 0141 302 30 30






0 2500 5000 7500 10000 12500 15000 17500 20000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal




0 1000 2000 3000 4000 5000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal









































334 REWARD FUNCTION


rt =

minus1 if dco lt 3


0 otherwise






















1 2999 plusmn 0141 302 30 30






0 2500 5000 7500 10000 12500 15000 17500 20000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal




0 1000 2000 3000 4000 5000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal




















































1 2999 plusmn 0141 302 30 30






0 2500 5000 7500 10000 12500 15000 17500 20000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal




0 1000 2000 3000 4000 5000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal









































0 2500 5000 7500 10000 12500 15000 17500 20000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal




0 1000 2000 3000 4000 5000

Episode

0

5

10

15

20

25

30

Count

Learning Curve

Goals

Conflicts

Optimal








































abstract 1. introduction 1.1. motivation arxiv:1905 ... · 1.1. motivation with the rapid increase...

Documents