journal of la toward packet routing with fully-distributed ... · journal of latex class files,...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Toward Packet Routing with Fully-distributedMulti-agent Deep Reinforcement Learning

Xinyu You†, Xuanjie Li†, Yuedong Xu, Hui Feng, Jin Zhao, Huaicheng Yan

Abstract—Packet routing is one of the fundamental problemsin computer networks in which a router determines the next-hop of each packet in the queue to get it as quickly aspossible to its destination. Reinforcement learning (RL) hasbeen introduced to design autonomous packet routing policieswith local information of stochastic packet arrival and ser-vice. However, the curse of dimensionality of RL prohibits themore comprehensive representation of dynamic network states,thus limiting its potential benefit. In this paper, we proposea novel packet routing framework based on multi-agent deepreinforcement learning (DRL) in which each router possess anindependent LSTM recurrent neural network for training anddecision making in a fully distributed environment. The LSTMrecurrent neural network extracts routing features from richinformation regarding backlogged packets and past actions, andeffectively approximates the value function of Q-learning. Wefurther allow each route to communicate periodically with directneighbors so that a broader view of network state can beincorporated. Experimental results manifest that our multi-agentDRL policy can strike the delicate balance between congestion-aware and shortest routes, and significantly reduce the packetdelivery time in general network topologies compared with itscounterparts.

Index Terms—Packet routing, multi-agent learning, deep rein-forcement learning, local communications

I. INTRODUCTION

PACKET routing is a very challenging problem in dis-tributed and autonomous computer networks, especially

in wireless networks in the absence of centralized or co-ordinated service providers. Each router decides to whichneighbour it should send his packet in order to minimize thedelivery time. The primary feature of packet routing resides inits fine-grained per-packet forwarding policy. No informationregarding the network traffic is shared between neighbouringnodes. In contrast, exiting protocols use flooding approacheseither to maintain a globally consistent routing table (e.g.DSDV [10]), or to construct an on-demand flow level routingtable (e.g. AODV [9]). The packet routing is essential tomeet the dynamically changing traffic pattern in today’s com-

X. You, X. Li, Y. Xu and H. Feng are with School of InformationScience and Technology, Fudan University, Shanghai 200237, China. Emails:{xyyou18, xuanjieli16, ydxu, hfeng}@fudan.edu.cn

J. Zhao is with School of Computer Science, Fudan University, Shanghai200237, China. Email: [email protected]

H. Yan is with the Key Laboratory of Advanced Control and Optimizationfor Chemical Process of Ministry of Education, East China University ofScience and Technology, Shanghai 200237, China, and also with the Collegeof Mechatronics and Control Engineering, Hubei Normal University, Huangshi435002, China. Email: [email protected].

† The first two authors contributed equally to this work.

munication networks. Meanwhile, it symbolizes the difficultyof designing fully distributed forwarding policy that strikesa balance of choosing short paths and less congested pathsthrough learning with local observations.

Reinforcement learning (RL) is a bio-inspired machinelearning approach that acquires knowledge by exploring theinteraction with local environment without the need of exter-nal supervision [1]. Therefore, it is suitable to address therouting challenge in distributed networks where each node(interchangeable with router) measures the per-hop deliverydelays as the reward of its actions and learns the best actionaccordingly. Authors in [5] proposed the first multi-agent Q-learning approach for packet routing in a generalized networktopology. This straightforward routing policy achieves muchsmaller mean delivery delay compared with the benchmarkshortest-path approach. Xia et al. [33] applied dual RL-basedQ-routing approach to improve convergence rate of routingin cognitive radio networks. Lin and Schaar [3] adopted thejoint Q-routing and power control policy for delay sensitiveapplications in wireless networks. More applications of RL-based routing algorithms can be found in [11]. Owing to thewell-known “curse of dimensionality” [14], the state-actionspace of RL is usually small such that the existing RL-basedrouting algorithms cannot take full advantage of the history ofnetwork traffic dynamics and cannot explore sufficiently moretrajectories before deciding the packet forwarding. The com-plexity of training RL with large state-action space becomesan obstacle of deploying RL-based packet routing.

The breakthrough of deep reinforcement learning (DRL)provides a new opportunity to a good many RL-based net-working applications that are previously perplexed by theprohibitive training burden. With deep neural network (DNN)as a powerful approximator of Q-table, the network designercan leverage its advantages from two aspects: (1) the neuralnetwork can take much more information as its inputs, en-larging the state-action space for better policy making; (2) theneutral network can automatically abstract invisible featuresfrom high-dimensional input data [17], thus achieving an end-to-end decision making yet alleviating the handcrafted featureselection technique. Recent successful applications includecloud resource allocation [21], adaptive bitrate video streaming[22], cellular scheduling [23]. DRL is even used to generaterouting policy in [24] against the dynamic traffic pattern thatis hardly predictable. However, authors in [24] considers acentralized routing policy that requires the global topology andthe global traffic demand matrix, and operates at the flow-level.Inspired by the power of DRL and in view of the limitationsof Q-routing [5], we aim to make an early attempt to develop

arX

iv:1

905.

0349

4v2

[cs

.NI]

14

Nov

201

9


fully-distributed packet routing policies using multi-agent deepreinforcement learning.

In this paper, we propose a novel multi-agent DRL algo-rithm named Deep Q-routing with Communication (DQRC)for fully distributed packet routing. Each router uses a care-fully designed LSTM recurrent neural network (RNN) tolearn and infer the dynamic routing policy independently.DQRC takes the high-dimensional information as its input:the destinations of head-of-line (HOL) as well as backloggedpackets, the action history, and the queue length of neighboringrouters that are reported periodically. The action of DQRC isthe next-hop of the HOL packet, and the packet delivery timeis chosen as the reward of Q-learning so as to train the LSTMneural network at each router. The intuitions of our designstand for twofold relation to the queueing process. On onehand, the action history is closely related to the congestion ofnext hops, the number of backlogged packets indicates the loadof the current router, and knowing the destinations of outgoingpackets avoids pumping them into the same adjacent routers.On the other hand, with a lightweight communication scheme,an agent can acquire the queue information of its directneighbors and learn to send packets to less congested nexthops. With such a large input space, existing RL approachessuch Q-routing [5] cannot handle the training online so that thetraining of deep neural networks using RL rewards becomesa necessity. DQRC is fully distributed in the sense that eachrouter is configured with an independent neural network forparameter update and decision making. This differs from therecent multi-agent DRL learning framework in other domains[6] where the training of neural networks are simultaneousand globally consistent. The training of multi-agent DRL isusually difficult (e.g. convergence and training speed), whileDQRC proves the feasibility of deploying DRL-based packetrouting in the dynamic environment.

With Q-routing [5] and Backpressure [34] as benchmarks,our experimental results reveal a few interesting observa-tions. Firstly, DQRC significantly outperforms the other tworepresentative algorithms in terms of the average deliverydelay with different network loads and topologies. With care-ful examination of DQRC routing policy, we observe thateach router makes adaptive routing decision by consideringmore information than the destination of the HOL packet,thus avoiding congestion on “popular” paths. Secondly, thelightweight communication mechanism is very beneficial toDQRC. An agent only shares its raw queue length other thana host of DNN parameters with its neighbors, and this sharingcan be infrequent or delayed for several time slots with merelygentle increase of delivery time. Thirdly, DQRC is robust tothe hyper-parameters of LSTM neural network, indicating thata moderate complexity of neural networks (e.g. 3 hidden layersand 128 neurons in a hidden layer) is sufficient.

The remainder of this paper is organized as follows: Sec-tion II reviews the background knowledge of RL, DRL andPOMDP. Section III presents our design of DQRC. Thedelivery delay of the proposed algorithm are evaluated inSection IV with Q-routing and Backpressure as the benchmark.Section V is devoted to making discussions about future studyand challenges. Section VI concludes this work.

II. BACKGROUND AND LITERATURE

In this section, we briefly review the traditional routingalgorithms, RL and DRL techniques and their applications torouting problem. We then put forward the necessity of fully-distributed learning for real-world routing problem. Finally,the background of POMDP is included.

A. Traditional Routing Algorithm

As a profound and globally applicable routing algorithm,shortest-path algorithm reveals how to transfer all the data asquickly as possible. Within all the algorithms in shortest-pathclass, Bellman-Ford algorithm [39] and Dijkstra’s algorithm[40] are crucial to the development of network protocols. Thedefinition of a shortest path may vary with different contexts,such as transmission delay or number of hops in the network.It is easy to understand that along the shortest path betweentwo nodes, data packets can be delivered costing the leastamount of time provided that there is no congestion alongthe route. However, these assumptions are unreachable for therealistic network. When the network load is heavy, shortest-path algorithm will lead to severe backlogs in busy routersand thus necessitates manual monitoring and adjustment.

Backpressure [34] has shown its efficiency in dynamic traf-fic routing in multi-hop network by using congestion gradients.Each node maintains different queues for all of the potentialdestination nodes and each packet is assigned to one queueaccording to its destination. The basic idea of Backpressure isto utilize the differential of the queue backlogs between thecurrent node and its neighbor nodes as the pressure to drivepacket transmission. Backpressure has been widely studiedin the literature because of its simplicity and asymptoticoptimality at heavy traffic regime, but there are still somedrawbacks hindering its wider application: Backpressure maysuffer from poor delay performance particularly in light load,in which case there is not enough pressure to push data towardsthe destination and therefore packets may choose unnecessarylonger paths and even loops.

B. RL Routing Algorithm

Based on the mapping relationship between observed stateand execution action, RL aims to construct an agent to max-imize the expected discounted reward through the interactionwith environment. Without prior knowledge of which state theenvironment would transition to or which actions yield thehighest reward, the learner must discover the optimal policyby trial-and-error.

The first attempt to apply RL in the packet routing problemis Q-routing algorithm, which is a variant of Q-learning [1].Since Q-routing is essentially based on multi-agent approach,each node is viewed as an independent agent and endowedwith a Q-table to restore Q-values as the estimate of the trans-mission time between that node and the others. With the aimof shortening average packet delivery time, agents will updatetheir Q-table and learn the optimal routing policy through thefeedback from their neighboring nodes when receiving thepacket sent to them. Despite the superior performance over


shortest-path algorithm in dynamic network environment, Q-routing suffers from the inability to fine-tune routing policyunder heavy network load and the inadequate adaptabilityof network load change. To address these problems, otherimproved algorithms have been proposed such as PQ-routing[7] which uses previous routing memory to predict the traffictrend and DRQ-routing [8] which utilizes the informationfrom both forward and backward exploration to make betterdecisions.

C. DRL Routing Algorithm

DRL embraces the advantage of deep neural networks [41][42] to the training process, thereby improving the learningspeed and the performance of RL [4]. One popular algorithmof DRL is Deep Q-Learning (DQL) [25], which implementsa Deep Q-Network (DQN) instead of Q-table to derive an ap-proximation of Q-value with special mechanisms of experiencereplay and target Q-network.

Recently, network routing problems are solved using DRLunder different environment and optimization targets. Basedon the control model of the agent, these algorithms can becategorized as follows:Class 1: Single-agent learning.

Single-agent algorithm treats the network controller as acentral agent which can observe the global information of thenetwork and control the packet transmission at each router.Both the learning and execution process of this kind ofalgorithm are centralized [28], in other words, the commu-nication between routers are not restricted during training andexecution.

SDN-Routing [24] presents the first attempt to apply single-agent DRL in the routing optimization of traffic engineering.The traffic demand which represents the bandwidth requestbetween each source-destination pair is viewed as the environ-ment state. The network controller determines the transmissionpath of packets to achieve the objective of minimizing thenetwork delay. Another algorithm [20] considers a similarnetwork model while taking minimum link utilization as theoptimization target.Class 2: Multi-agent learning.

Cooperative control in multi-agent systems provides analternative way to solve a complicated problem that are hard tobe performed for one agent [26] [27]. In multi-agent learning,each router in the network is treated as a single agent whichcan observe partial environment information and take actionsaccording to its own routing policy.

The first multi-agent DRL routing algorithm is DQN-routing[6] that is the combination Q-routing and DQN. Each routeris regarded as an agent whose parameters are shared by eachother and updated at the same time during training process(centralized training), but it provides independent instruc-tions for packet transmission (decentralized execution). Thecomparison with contemporary routing algorithms confirms asubstantial performance gain.

Nevertheless, algorithms with centralized learning processstated above are hard to be applied in the realistic network.The centralized learning controller is usually unable to gather

environment transitions from widely distributed routers oncean action is executed somewhere and to update the parametersof each neural network simultaneously caused by the limitedbandwidth [31] [32].

Accordingly, for better application in real-world scenario,the routing algorithm we propose is executed in a fully-distributed way [19], which means both the training processand the execution process are decentralized. Under thesecircumstances, each agent owns its unique neural networkwith independent parameters for policy update and decisionmaking, thereby avoiding the necessity for the communica-tions among routers in the process of environment transitioncollection and parameter update.

D. POMDP

In the real-world environment, it is rare that agents canobserve the full state of the system. This type of problem canbe modelled as partially observable Markov decision processes(POMDPs) where the observation agents receive is only thepartial glimpse of the underlying system state [35]. The partialobservation of system state will aggravate the non-stationarityproblem in multi-agent reinforcement learning (MARL).

To tackle with this problem, an efficient solution is to utilizeDeep Recurrent Q-Network (DRQN) [35] which is capable ofcapturing invisible features based on time-continuous obser-vations. Derived from DQN, this ingenious neural networkarchitecture substitutes some fully-connected layers by LongShort Term Memory (LSTM) [36] layers which can notonly store and reuse history information to explore temporalfeatures, but also split inputs into several parts to searchconnections between them. The LSTM layer contains specialunits called memory blocks in which memory cells can storethe temporal state of the network and gates are used to controlthe flow of information [37]. The existence of LSTM helpsalleviate the dilemma that the present task cannot receive therelevant information long before it.

Another promising direction is to design a cooperation orcommunication mechanism to help agents learn from not onlyits own knowledge but also public information broadcast fromother agents. Two MARL algorithms are put forward in [28] toaddress partially observable multi-agent decision making prob-lems: reinforced inter-agent learning (RIAL) and differentiableinter-agent learning (DIAL). The former uses independentDQN, while the latter backpropagates error derivatives throughcommunication channels during learning. With centralisedlearning but decentralised execution, the communication be-tween agents promise the success in MARL domain.

Inspired by these ideas, we propose our algorithm DQRCwith fully distributed multi-agent deep reinforcement learning.DQRC amalgamates traditional DQN with LSTM architectureas well as communication mechanism to address POMDPs inpacket routing problem.

III. DESIGN

We establish the mathematical model of the packet routingproblem and describe the representation of each element inthe reinforcement learning formulation. We then propose a


1 2

4

0

3

76

5

8

Fig. 1. 3x3 network topology.

novel deep recurrent neural network architecture and thecorresponding training algorithm.

A. Mathematical Model

We consider a network consisting of routers and links con-nected between them, and packet routing is the transmission ofpackets from a source to its destination through intermediatenodes. We now present the the mathematical model of thepacket routing problem.

Network. The network is modeled as a directed graphG = (N , E), where N and E are defined as finite sets of nodesand transmission links between them respectively. A simplenetwork topology can be found in Fig. 1, containing 9 nodesand 12 pairs of bidirectional links. Each packet is originatedfrom node s and destined for node d: s, d ∈ N and s 6= dwith randomly generated intervals.

Routing. The mission of packet routing is to transfer eachpacket to its destination through the relaying of multiplerouters. The queue of routers follows the first-in first-out(FIFO) criterion. Each router n constantly delivers the HOLpacket to its neighbor node v until that packet reaches itstermination.

Target. The packet routing problem aims at finding theoptimal transmission path between source and destinationnodes based on some routing metric, which, in our experiment,is defined as the average delivery time of packets. Formally,we denote the packet set as P and the total transmission timeas tp for each packet p : p ∈ P . Our target is to minimize theaverage delivery time T =

∑p∈P tp/K, where K denotes the

number of packets in P .

B. Reinforcement Learning Formulation

Packet routing can be modeled as a multi-agent reinforce-ment learning problem with partially observable Markov deci-sion processes (POMDPs) [30]. Each node is an independentagent and learns its routing policy by observing the localnetwork state and communicating with its neighbor nodes.Therefore, we will describe the definitions of each elementin reinforcement learning for a single agent.

State space. The packet p to be sent by agent n is definedas current packet. We denote the state space of agent n asSn : {dp, En, Cn}, where dp is the destination of the current

actionhistory

currentdestination

futuredestinations

32

32

32

128

N

m·N

k·N

128

input layer fc layer 1

LSTM layer

fc layer 2

max-queue node

N 32

|An|

output layer

128

fc layer 3

fc: fully-connected

activation function: ReLU

Fig. 2. LSTM neural network with ReLU activation.

packet, En is some extra information related to agent n, Cnis the information shared form the neighbor nodes of agent n.At different time steps, the state observed by the agent is timevarying due to the dynamic change of network traffic.

Action space. The action space of agent n is defined asAn : Vn, where Vn is the set of neighbor nodes of node n.Accordingly, for each agent, the size of action space equalsto the number of its adjacent nodes. For example, node 4 inFig.1 has four candidate actions. Once a packet arrives at thehead of queue at time step t, agent n observes the current statest ∈ Sn and picks an action at ∈ An, and then the currentpacket is delivered to the corresponding neighbor of node n.

Reward. We craft the reward to guide the agent towardseffective policy for our target: minimizing the average deliverytime. The reward at time step t is set to be the sum of queueingtime and transmission time: rt = q + l, where the former qis the time spent in the queue of agent n, and the latter l isreferred to as the transmission latency to the next hop.

C. Deep Neural Network

We will introduce the architecture of the neural networkdesigned for our algorithm in this part. Note that, in theformulation of fully-distributed reinforcement learning, eachnode is an individual agent and therefore possesses its ownneural network for decision making. Accordingly, the follow-ing description of the neural network architecture is tailoredfor a single agent.

As shown in Fig.2, we build a deep recurrent neural networkwith three fully-connected layers and one LSTM layer. Theinput of the neural network can be classified into four parts:(1) current destination: the destination node of the current

packet.(2) action history: the executed actions for the past k packets

sent out just before the current packet.


(3) future destinations: the destination nodes of the next mpackets waiting behind the current packet.

(4) max-queue node: the node which has the longest queueamong all the neighbor nodes of the current node.

In the formulation of the state space in Section III-B, thecurrent node corresponds to dp, the action history and thefuture destinations corresponds to En, the max-queue nodecorresponds to Cn.

Before being input into the neural network, all of the aboveinformation will be processed with one-hot encoding. Takethe first part of input information as an example, if the currentpacket is destined to node 4, the one-hot encoding result ofcurrent destination is [000010000]. Therefore, the the numberof input neurons equals to (1 + k+m+ 1)×N , where N isdefined as the total number of nodes in the network topology.

The first hidden layer is a concatenation of four subsets ofhidden neurons. Each subset possesses 32 neurons and is fullyconnected with the corresponding part of the input neuronsindependently. Following the first hidden layer, another twohidden layers with 128 neurons are added.

In partially observable environment, each agent can onlyreceive an observation s which is correlated with the fullenvironment state. Inspired by Deep Recurrent Q-networks(DRQN) [35], we add a LSTM layer to maintain an internalstate and aggregate observations over time. Despite the partialobservation s, the hidden state of the agent h will be includedto represent Q-value which is defined as Q(s, h, a).

Furthermore, the size of the output layer and the agentsaction space |An| are identical, and the value of each outputneuron is the estimated Q-value of the corresponding action.With this change of the representation for Q-value, we try toupdate the parameter of neural networks instead of the valueof the Q-table.

We use Rectified Linear Unit (ReLU) as the activationfunction and Root Mean Square Prop (RMSProp) as theoptimization algorithm.

D. Learning Algorithm

By integrating Q-routing and DRQN, we propose the packetrouting algorithm with multi-agent deep reinforcement learn-ing, where both training and execution process are set decen-tralized. The pseudo-code of the learning algorithm, which wecall Deep Q-routing with Communication (DQRC), is shownin Algorithm 1, in which the initialization and the trainingprocess are identical for each node.

Each node i is treated as an individual agent and possessesits own neural network Qi with particular parameter θi toestimate the state-action value function Qi(s, h, a; θi), whichrepresents the expected delivery time for a packet to reachthe destination when the agent executes action a in state sand hidden state h. Replay memory Di with capacity of 100is also initialized independently for each agent to restore itsenvironment transitions, and from it a random min-batch withsize of 16 will be sampled for the update of its networkparameters.

For each decision epoch t when a packet p arrives atthe head of line of a certain node n, agent n will observe

Algorithm 1 Deep Q-routing with Communication (DQRC)// initializationfor agent i = 1, N do

Initialize replay buffer Di ← ∅Initialize Q-network Qi with random weights θi

end for

// training processfor episode = 1, M do

for each decision epoch t doAssign current agent n and packet pObserve local information dp and EnCollect shared information CnIntegrate current state st : {dp, En, Cn} and hiddenstate htSelect and execute action

at =

{a random action with probability εargmaxaQn(θn) with probability 1− ε

Forward p to next agent vtCalculate and collect reward rtObserve next state st+1 and next hidden state ht+1

Set transmission flag ft =

{1 vt = dp

0 otherwiseStore transition (st, ht, rt, vt, st+1, ht+1, ft) in Dn

Sample a random batch (sj , hj , rj , vj , sj+1, hj+1, fj)from Dn

Set yj = rj +maxa′Qvj (sj+1, hj+1, a′; θvj )(1− fj)

θn ← GradientDescent ( (yj −Qn(sj , hj , aj ; θn))2 )end for

end for

local information dp and En and collect shared informationCn through the communication with neighbor nodes. Byintegrating current state st : {dp, En, Cn} and hidden state ht,agent n will execute an action at based on ε-greedy policy,which means agent n will choose a random action from itsaction space An with probability ε or choose the action withthe highest Q-value with probability 1− ε. The assignment ofat is given by:

at =

{a random action with probability εargmaxaQn(st, ht, at; θn) with probability 1− ε

(3.1)Then the current packet p is forwarded to the correspondingneighbor node vt, and the reward rt is calculated and sentback to agent n. Current state and hidden state will transitionto st+1 and ht+1 respectively. Besides, the transmission flagft will be set to 1 if the next node vt matches the packetsdestination dp or set to 0 otherwise. The assignment of ft isgiven by:

ft =

{1 vt = dp

0 otherwise(3.2)

After the feedback of these information, agent n will recordthis transition (st, ht, rt, vt, st+1, ht+1, ft) into its replay


memory Dn. Different from the sequential update processof DRQN, a training batch (sj , hj , rj , vj , sj+1, hj+1, fj) issampled randomly from Dn to avoid violating the DQNrandom sampling policy. As a result of the unstable environ-ment caused by the multi-agent characteristic, the remainingdelivery time τ that packet p is expected to spend from vt todp need to be recalculated before the training process. τ isgiven by:

τ = maxa′Qvj (sj+1, hj+1, a′; θvj ) (3.3)

At the end of the decision epoch t, the method of gradientdescent is used to fit the neural network Qn(θn). The targetvalue yj is the sum of the immediate reward r and theremaining delivery time τ :

yj = rj + τ(1− fj) (3.4)

The parameter of Qn(θn) can be trained by minimising theloss function Lt:

Lt = (yj −Qn(sj , hj , aj ; θn))2 (3.5)

After differentiating the loss function Lt with respect to theweights θn, we can update θn by:

θn ← θn + α∇θn(yj −Qn(sj , hj , aj ; θn))2 (3.6)

where α is the learning rate. In this way, the network param-eters of each agent are updated with episodic training untilconvergence.

IV. EVALUATION

We conducted several experiments in the simulation envi-ronment of computer network to evaluate the performance ofDQRC in both online and offline training mode. Our experi-ments cover a broad set of network conditions and topologies.Then we will give an explanation for the comparison resultsfrom the perspective of input information and learned policy.Furthermore, a deep dive into DQRC will be made to test itsrobustness to various parameters.

A. Simulation Environment

We now describe the settings of our simulation environmentand another two algorithms to be compared with DQRC.

Topology. The topology of the network we used is thesame as Fig. 1, which remains static in the whole experiment.Despite the simple structure, we can explore new insights intopacket routing, and actually a more complex network will leadto similar results. All the nodes and links in the network sharethe same attributes: the buffer size of each node is unlimitedand the bandwidth of each link equals to the packet size, inwhich case only a single packet can be transmitted at a time.

Packet. A certain proportion, named distribution ratio, ofpackets are generated from node 0 (busy ingress-router) tonode 8 (busy egress-router), while the other packets source anddestination are chosen uniformly. Packets are introduced intothe network with the same size and their generated intervalsare fixed at a certain value which is inversely proportional tothe load of network traffic.

0 200 400 600 800 1000Training Episode

5

10

15

20

25

30

35

Aver

age

Deliv

ery

Tim

e

DQRCQRouting

Fig. 3. Offline training speed comparison.

Time setting. The time during the simulation is measuredby milliseconds. The transmission time between adjacentnodes a packet has to spend is set to 1.0ms. The performancecriterion of the experiment is the average delivery time ofpackets within a certain period.

Compared algorithms. Both Q-routing and Backpressurealgorithms stated in Section II are included for performancecomparison. Since [5] has proved the inferior performance ofshortest-path algorithm, it will not be taken into comparison.In our experiment, in order to follow the FIFO criterion, thetraditional Backpressure algorithm is tailored to route the HOLpacket to the neighbor node with the minimal number ofpackets with the same destination.

B. Offline-training Online-test Result

In offline experiments, we generated a fixed packet seriescontaining 1000 packets as the training set on which we trainedthe neural network and Q-table separately. Instead of endingwith all the packets routed to their destinations, each trainingepisode stops in 100ms, in order to meet the characteristicof endless packet in online environment. After 1000 trainingepisodes, we restored the well-trained models to comparetheir performance in a new test environment where packetswere generated at the corresponding network load level butdifferent source-destination pairs from the training set. Notethat Backpressure does not need training and can be appliedto test directly.

Training speed. With the fixed packet sequence whosegenerated interval is 0.5ms and distribution ratio is 70%, wetrained DQRC and Q-routing and compared their trainingspeeds. Fig. 3 shows the variation trend of average packetdelivery time along with the training episode. The solid line isthe smoothed result of the true value depicted in the shadowedpart. Though performing worse in the first few episodes,DQRC convergences quickly and keeps stable at a lower delaylevel. On the contrary, Q-routing fluctuates violently from timeto time and never converges. From the perspective of eitherconvergence speed or final training result, DQRC outperformsQ-routing in our simulation.


0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Generated Interval

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

22.5Av

erag

e De

liver

y Ti

me

QRoutingDQRCBackpressure

Fig. 4. Offline test with different network loads.

Network load. In terms of the distribution ratio at 70%, wecarried out online tests with various packet generated intervalsranging from 0.3ms to 1.0ms and recorded 50 sets of result forthe first 100ms under each parameter. The average value andstandard deviation of these resulting data are depicted withcurve and shadow separately in Fig. 4. As expected, we canclearly see that:

(1) All three algorithms present a rising tendency on aver-age delivery time as the load becomes heavier. DQRCoutperforms the other two algorithms at all network loadlevels.

(2) Q-routing behave equivalently with DQRC when thegenerated interval is between 0.9ms and 1.0ms (lownetwork load). Conversely, Backpressure has fairly goodperformance when when the generated interval is between0.4ms and 0.8ms (high network load). This interestingphenomenon may reveal the fact that DQRC has themerits of Q-routing and Backpressure while discardingtheir shortcomings.

(3) The width of shadow band of DQRC is the narrowest,meaning that our proposed algorithm performs morestable during the test and is robust to randomly generatedpacket set.

Distribution ratio. We collected another 50 sets of resultsfrom online tests by adjusting the distribution ratio from 10%to 90% and fixing the generated interval at 0.5ms. Similarly,each test ends in 100ms. From the result diagram shown inFig. 5, we can obtain more appealing discoveries:

(1) As the distribution ratio continues increasing, the averagedelivery time of all three algorithms is prolonged becauselarger distribution ratio indicates more pressure on thelinks connnetting node 0 and node 8.

(2) The distribution ratio for Q-routing and Backpressurewith good performance is 10%–40% and 50%–90% re-spectively, but DQRC works well in either case. Thisresult draws the same conclusion that DQRC possessesthe features of Q-routing and Backpressure as that statedin the above part.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Distribution Ratio

5

10

15

20

25

Aver

age

Deliv

ery

Tim

e


Fig. 5. Offline test with different distribution ratios.

(3) Considering DQRC, there is almost no shadow areaaround the solid line before the distribution ratio reaches60%, indicating the good robustness to the randomnessof packet generation.

(4) The slope of Q-routing is the largest, showing that Q-routing has higher sensitivity to the change of distributionratio than the other two algorithms. By contrary, DQRCand Backpressure are more robust to this kind of trafficchange.

In summary, no matter in which kind of test environment,DQRC integrates the merits of both Q-routing and Backpres-sure and performs better in the following three aspects:(1) The average delivery time of DQRC is the shortest

regardless of generated interval or distribution ratio.(2) DQRC is more adaptable and resistant to the change of

network load and spatial distribution of packet generation.(3) DQRC frees itself from random errors caused by the

network disturbances.

C. Online-training Online-test Result

In online simulation environment, packets are generated allthe time following the regulations described in Section IV-A.The simulation timeline is split into intervals of 100ms and foreach interval the average delivery time of transmitted packetsis recorded. In order to test each algorithm’s adaptability to thechange of traffic pattern, we initially set the generated intervalof packets to 1.0ms and suddenly change it to 0.7ms at time4000ms and reset it to 1.0ms at time 8000ms. Before put intoonline test, DQRC and Q-routing are well-trained in offlinetraining environment with generated interval set to 1.0ms, andduring the simulation the parameters of neural networks andthe value of the Q-table are updated from time to time. Notethat Backpressure does not need training.

Fig. 6 plots the average result of 50 online tests withdifferent packets sets. We can clearly see that (1) Q-routinghave comparable performance with DQRC in the first and thethird stage. However, when the network load begin to increaseat 4000ms, Q-routing find it hard to converge to a new routing


0 2000 4000 6000 8000 10000 12000Simulation Time

5

10

15

20

25

30Av

erag

e De

liver

y Ti

me


Fig. 6. Online test result with changing network loads.

012 3

45

67

8

9

101112

13

14

15 161718

1920

21

2223 24

Fig. 7. AT&T North America Topology.

policy to address the growing number of packets, leadingto larger transmission delay and unstable performance; (2)Though behaving fairly well in the second stage, Backpressurehas an inferior performance compared to DQRC and Q-routingin lightly loaded network due to insufficient pressure to pushpackets to their destinations; (3) DQRC achieves the shortestdelivery time and the lowest fluctuations in all three stages.Besides, DQRC converges more quickly than Q-routing, andtherefore is more adaptable to dynamic changes of networkload.

D. Complex Topology

In the above experiments, the 3×3 topology in Fig. 1 issymmetrically designed and each node has no more than4 neighbor nodes. To test the scalability of our proposedalgorithm, we expand the network scale to a 25-nodes and 56-edges topology depicted in Fig. 7. The topology is taken fromtopology zoo [38] and represents real AT&T North Americanetwork. Each vertex represents a city, for example node 0is New York City, and each edge is a bidirectional link. The

0 100 200 300 400 500episode

0

2

4

6

8

10

lost

RMSPropAdamSGD

Fig. 8. Pre-Training result with different optimization algorithms.

0 200 400 600 800 1000Training Episode

5

10

15

20

25

30

Aver

age

Deliv

ery

Tim

e

DQRC_PreTrainDQRC

Fig. 9. Comparison between DQRC with and without pre-training.

connection of routers becomes complex and each router hasmore choices to make, thus increasing the difficulty of decisionmaking. The attributes of the new topology are the same asthose in Section IV-A except that node 17 is viewed as thebusy ingress-router and node 8 is viewed as the busy egress-router.

In multi-agent DRL, the addition of more agents willincrease the convergence difficulty of neural networks. Toaddress this problem, we introduce the technique of pre-training. In the neural network designed in Section III-C, thevalue of each output neuron represents the estimated remainingdelivery time after choosing the corresponding neighbor node.We take shortest-path algorithm as an initial guide to grasp thetopology information. In detail, firstly we calculate the lengthsof the shortest path between each source-destination pair; Thenwe treat these lengths as labels and craft states by editing onlythe current destination and setting the other three part of inputinformation to zeros; Finally, we perform supervised learningto train the neural network before putting it into the processof RL.

We conducted pre-training with three optimization algo-rithms: Stochastic Gradient Descent (SGD), Adam and RM-


0.4 0.5 0.6 0.7 0.8 0.9 1.0Generated Interval

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

22.5Av

erag

e De

liver

y Ti

me


Fig. 10. Offline test with different network loads in AT&T network.

SProp. Fig. 8 depicts the average lost change of all the agentswith respect to the episode of pre-training. We can clearly seethat, with Adam or RMSProp as the optimization algorithm,the average lost decrease sharply in the first few episodes,and after about 200 episodes, the neural network of eachnodes converges with average lost close to zero. However, pre-training with SGD optimization converges to a higher levelwith slower speed. Therefore, we choose Adam as the finaloptimization algorithm in the pre-training process.

Fig. 9 shows the performance gap between DQRC withand without pre-training during the RL training process. Wefind that, in such a complex topology with 25 nodes, DQRCfinds itself takes a very long period converge with random ini-tialization. However, after the implementation of pre-training,DQRC reaches a fairly good performance in the initial stateand converges quickly to a low delay level.

In line with the experiment setting in Section IV-B, weexecute online-tests in two cases: (1) fixing the distributionratio and changing the network load level with different packetgenerated intervals; (2) fixing the generated intervals andchanging the the distribution ratio. The simulation results areillustrated in Fig. 10 and Fig. 11 respectively. Unsurprisingly,in this complicated topology, the variation trends of the aver-age delivery time with respect to the network load level anddistribution ratio have fairly consistency with those in the 3×3topology. The similar results in both topologies demonstrategood robustness of DQRC.

E. Performance Analysis

With the results in both Offline-training Online-test andOnline-training Online-test tests with different topologies,DQRC can be regarded as a combination of Q-routing andBackpressure, achieving not only shorter average delivery timebut also greater stability in dynamically changing networks.We will clarify the main reasons for this result from theperspective of the routing policy learned by each algorithm.To simplify our interpretation, the following analysis is basedon the 3x3 topology (Fig. 1).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Distribution Ratio

5

10

15

20

25

30

Aver

age

Deliv

ery

Tim

e


Fig. 11. Offline test with different distribution ratios in AT&T network.

Why does DQRC possess the features of Q-routing andBackpressure? From Fig. 4 and Fig. 5, we can see that inboth cases DQRC resembles Q-routing in light network loadand Backpressure in heavy network load. This is an inevitableresult owing to the neural network structure. In the anteriordescription of neural network design in Section III-C, the deeprecurrent neural network of DQRC is constructed on the basisof Q-table in Q-routing, which is used to store Q-values foreach state-action pair. Therefore, DQRC reserves the ability toestimate the remaining delivery time for all possible decisions.When the network is relatively light-loaded, both Q-table andneural network can give the agent an accurate estimate ofQ-value to help with decision making. But this consonancewill be disturbed by the rise of network load, in whichcase the traits of Backpressure become useful. Different fromQ-routing, agents in DQRC are no longer isolated withoutcommunication with each other, and in contrast, they willexchange their queue information with adjacent agents. Withthis extra information, agents in DQRC can be aware ofthe pressure in the queue of neighbor nodes, thus avoidingconveying too many packets to busy nodes. DQRC has graspedthe core idea of Backpressure which has been proved itsexcellent performance especially in heavy traffic pattern.

What are the advantages of DQRC? In addition to timedelay estimation with neural network and information sharingbetween neighbor nodes, we also introduce new techniquesinto DQRC: (1) as described in Section III-C, we take fulladvantage of the representation ability of the neural networkby expanding the input dimension. With the introduction ofhistory actions and future destinations, DQRC can make moreaccurate estimate of the Q-value. (2) With the LSTM layerin neural network, DQRC can capture invisible connectionsamong input elements and temporal features of sequentialdecisions.

Whenever the agents of Q-routing choose an action, the onlyinformation they can utilize is the destination of the currentpacket, leading to the same routing decision for packets withthe same destination. For this reason, the sole input will causeviolent fluctuations during the process of offline training (Fig.


3). However, with additional information as the network input,the agents of DQRC can execute different but effective routingpolicy for each packet despite the same destination. Moreprecisely, we evaluate the case where the generated intervaland distribution ratio are set to 0.1ms and 70% respectibly inonline test. The network load is so heavy that a large number ofpackets are waiting in the queue of node 0 to be transmitted tonode 8. For these packets, the agent of node 0 has two choices:sending them to node 1 or node 3. The well-trained agent ofnode 0 of Q-routing follows the shortest path and thereforeall those packets will be sent to node 1. Under this strategy,serious congestion will occur unavoidably in the links 0→1,1→2 and 1→4, which eventually lead to longer delivery time.However, DQRC can overcome this difficulty cleverly. Beforemaking decisions for every packet destined for node 8 at node0, the agent will first collect the information about the actionsthe last five packets have taken and the nodes the next fivepackets are destined for and then check the queue length ofnode 1 and node 3 with communication. For example, whenthe last five packets were sent to node 1, the agent will decideto send the current packet to node 3 to avoid long queuingdelay. Similarly, after some packets were sent to node 3, theagent will change its policy and decide to transfer the packetthrough node 1 again. Therefore, the great packet burden canbe shared by node 1 and node 3 in DQRC instead of beingborn only by node 0 in Q-routing.

Of course, it is possible that different kinds of informa-tion will lead to various decisions, so agents need to trainthemselves to judge which kind of information is more crucialand needs more attention. After the self-adaptation in trainingprocess, the astute agents have the ability to grasp the dynamicchanges in the network and adjust their routing policiesaccordingly, which, shown in our test result, can gain shorteraverage delivery time and a better performance.

F. Deep Dive into DQRC

In this part, we are committed to exploring the intrinsicdisciplines within DQRC by evaluating its performance underthe following situations: (1) we change the architecture ofthe neural network with different quantities of neurons anddense layers, (2) we modify the communication interval ofinformation sharing among agents, (3) we rule out a certainelement of DQRC including extra information, shared infor-mation and LSTM layer. Note that the following experimentsare conducted with setting generated interval at 0.5s anddistribution ratio at 70%.

Neural network architecture. Starting with the originalarchitecture (Fig. 2), we altered a range of neural networkparameters to understand their impact on the performanceof DQRC. First, after fixing the number of neurons in eachhidden layer at 128, we varied the number of hidden layersbetween the first hidden layer (fc1) and LSTM layer in thearchitecture of DQRC. The comparison results are shown inTABLE I. We find that the default setting yields the bestperformance, but the delay gap between it and the others isvery small, showing DRQC’s great robustness to the depth ofneural network. Then with the default layer number, we varied

TABLE ITEST WITH DIFFERENT NUMBER OF HIDDEN LAYERS

Number of hidden layers(between fc1 and LSTM layer) Average delivery time

0 4.281 4.172 4.113 4.33

TABLE IITEST WITH DIFFERENT NUMBER OF NEURONS

Number of neurons Average delivery time

32 5.2664 4.96128 4.11256 4.71

TABLE IIITEST WITH DIFFERENT INTERVALS OF INFORMATION SHARING

Intervals ofinformation sharing (ms) Average delivery time

0 4.111 4.322 4.753 4.894 5.195 5.35

Q-routing DQR DQRC withoutCommunication

DQRC withoutLSTM layer

DQRC0

2

4

6

8

10

12

14

16

18

20

Aver

age

deliv

ery

time

Fig. 12. Test with different algorithms.

the number of neurons in each hidden layer and LSTM layer.The changes of these parameters are synchronous, i.e., when64 neurons are used, the neuron number of each sub-layerin the first hidden layer is 16. Results from this alterationare presented in TABLE II. We can see that the number ofneurons have higher influence on DQRC’s performance. Toofew or too many neurons will lead to performance degradation.In conclusion, carefully-choosing hyper-parameters of neuralnetwork is needed for better estimation.

Communication interval. In our original design, each agentcan collect queue lengths of neighbor nodes immediatelywhenever making a decision. However, when applying DQRC


to a network with delayed feedback, the time delay during thetransmission of shared information may impede the adoptionof DQRC in practice. Therefore, we modified the communi-cation interval of information sharing among agents to testits influence on DQRC’s performance. With communicationinterval ranging from 1ms to 5ms, the test result is shownin TABLE III. We find that the average delivery time steadilyrises as we increase the communication interval. Despite slightperformance degradation, the average delivery time of DQRCis still less than half of that of Backpressure, showing DQRC’sfeasibility in non real-time network.

Element in DQRC. In order to identifies the rationalityof each element in DQRC, we put forward another threealgorithms on the basis of DQRC. (1) DQRC without commu-nication: we forbid the communication process in DQRC andeliminates shared information in the state space, (2) DQRCwithout LSTM: we delete the LSTM layer in the originalneural network (Fig. 2), (3) DQR: we include only currentdestination into the input information.

The comparison result with different algorithms is shownin Fig. 12. We can see that the lack of extra informationor shared information will lead to remarkable performancedegradation. If not equipped with LSTM layer, the averagedelivery time of DQRC will increase by 10 percent. As forDQR and Q-routing, the only difference between these twoalgorithms is the representation of Q-value but the learningalgorithms are identical. The neural network which is merelythe approximation of Q-table would not help with the estimateof the accurate transmission time between the source andtermination of packets. As a result, DQR and Q-routing whoshare the same input have comparable performance.

V. DISCUSSION

In this section, we put forward our research plan andensuing challenges in several directions, deriving from somelimitations of the current work.

Other DRL algorithms. The routing algorithm we proposeis based on DQN [25], which is a classical but simple formof DRL. Thanks to the tremendous contribution researchers inthe community of DRL have made, more effective DRL algo-rithms can be leveraged in packet routing. For example, as theoptimization of the policy gradient based RL algorithm, TRPO[12] is combined with the neural network in continuous controldomain to ensure monotonic performance [13]. Besides, basedon DPG [15], an off-policy actor-critic algorithm DDPG [16]uses the neural network as a differentiable function approxi-mator to estimate action-value function, and then updates thepolicy parameters in the direction of the deterministic policygradient.

Realistic simulation environment. In this paper, the exper-iments we have conducted in the simulation environment arebased on some restrictive conditions, as described in SectionIV-A, which would impede the adoption of our proposedrouting algorithms in the realistic network with complex trafficpatterns. In the future work, we will consider a more generalnetwork setting such as routers with finite queue and packetswith varied sizes. A couple of uncertain elements like link

breakage or node failure can also be taken into account.Furthermore, NS-3 network simulator [2] can be utilizedas the standard platform to test the performance of routingalgorithms.

Multi-agent Learning. The packet routing system in ourcurrent implementation is built on the multi-agent model,where each router is treated as an independent agent andlearns asynchronously. However, in multi-agent environment,the inherent non-stationarity problem [18] during the learningprocess will be magnified after the application of DNN. Be-sides the information-sharing mechanism stated in our paper,importance sampling and fingerprint [29] provide alternativesolutions to address this problem. Moreover, witnessing thebenefits of cooperative learning [28], we will analyse theperformance boost that the local coordination of DNNs (e.g.,parameter sharing and memory sharing [28]) can yield in ourfuture study.

VI. CONCLUSION

We presented a fully-distributed packet routing algorithmDQRC based on multi-agent deep reinforcement learning. Wedesigned a deep recurrent neural network with proper inputto make more accurate estimation for the transmission delay.With information sharing mechanism, each agent in DQRCcan learn adaptive routing policy through the interaction withenvironment. From the preliminary experiment results, we findthat, compared with Q-routing and Backpressure, DQRC canreduce the average packet delivery time to a considerableextent in different traffic patterns and topologies.

ACKNOWLEDGMENT

This work is partially supported by Natural Science Foun-dation of China (No. 61772139),the National Key Researchand Development Program of China (No.213), Shanghai-Hong Kong CollaborativeProject under Grant 18510760900and CERNET Innovation Project NGII20170209.

REFERENCES

[1] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,”MIT press, 2018.

[2] T. R. Henderson, M. Lacage, G. F. Riley, C. Dowell, and J. Kopena,“Network simulations with the ns-3 simulator,” SIGCOMM demonstra-tion, vol. 14, no. 14, p. 527, 2008.

[3] Z. Lin and M. van der Schaar, “Autonomic and distributed joint routingand power control for delay-sensitive applications in multi-hop wirelessnetworks,” IEEE Transactions on Wireless Communications, vol. 10, no.1, pp. 102-113, 2011.

[4] N. C. Luong et al., “Applications of Deep Reinforcement Learn-ing in Communications and Networking: A Survey,” arXiv preprintarXiv:1810.07862, 2018.

[5] J. A. Boyan and M. L. Littman, “Packet routing in dynamically chang-ing networks: A reinforcement learning approach,” Advances in neuralinformation processing systems, 1994, pp. 671-678.

[6] D. Mukhutdinov, A. Filchenkov, A. Shalyto, and V. Vyatkin, “Multi-agent deep learning for simultaneous optimization for time and energy indistributed routing system,” Future Generation Computer Systems, vol.94, pp. 587-600, 2019.

[7] S. P. Choi and D.-Y. Yeung, “Predictive Q-routing: A memory-basedreinforcement learning approach to adaptive traffic control,” Advancesin Neural Information Processing Systems, 1996, pp. 945-951.

[8] S. Kumar and R. Miikkulainen, “Dual reinforcement Q-routing: An on-line adaptive routing algorithm,” Proceedings of the artificial neuralnetworks in engineering Conference, 1997, pp. 231-238.

http://arxiv.org/abs/1810.07862


[9] C. Perkins, E. Belding-Royer, and S. Das, “Ad hoc on-demand distancevector (AODV) routing,” 2003.

[10] C. E. Perkins and P. Bhagwat, “Highly dynamic destination-sequenceddistance-vector routing (DSDV) for mobile computers,” ACM SIGCOMMcomputer communication review, 1994, vol. 24, pp. 234-244.

[11] H. A. Al-Rawi, M. A. Ng, and K.-L. A. Yau, “Application of rein-forcement learning to routing in distributed wireless networks: a review,”Artificial Intelligence Review, vol. 43, no. 3, pp. 381-416, 2015.

[12] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustregion policy optimization,” International Conference on Machine Learn-ing, 2015, pp. 1889-1897.

[13] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench-marking deep reinforcement learning for continuous control,” Interna-tional Conference on Machine Learning, 2016, pp. 1329-1338.

[14] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,” IEEE transactions on pattern analysis andmachine intelligence, vol. 35, no. 8, pp. 1798-1828, 2013.

[15] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,“Deterministic policy gradient algorithms,” ICML, 2014.

[16] T. P. Lillicrap et al.,“Continuous control with deep reinforcement learn-ing,” arXiv preprint arXiv:1509.02971, 2015.

[17] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,“A brief survey of deep reinforcement learning,” arXiv preprintarXiv:1708.05866, 2017.

[18] P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote,“A survey of learning in multiagent environments: Dealing with non-stationarity,” arXiv preprint arXiv:1707.09183, 2017.

[19] Y. Xu, Z.-G. Wu, Y.-J. Pan, C. K. Ahn, and H. Yan, “Consensusof Linear Multiagent Systems With Input-Based Triggering Condition,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 49,no. 11, pp. 2308-2317, Nov. 2019.

[20] A. Valadarsky, M. Schapira, D. Shahaf, and A. Tamar, “Learning to routewith deep rl,” NIPS Deep Reinforcement Learning Symposium, 2017.

[21] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource manage-ment with deep reinforcement learning,” Proceedings of the 15th ACMWorkshop on Hot Topics in Networks, 2016, pp. 50-56.

[22] H. Mao, R. Netravali, and M. Alizadeh, “Neural adaptive video stream-ing with pensieve,” Proceedings of the Conference of the ACM SpecialInterest Group on Data Communication, 2017, pp. 197-210.

[23] Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy, “A deep reinforce-ment learning based framework for power-efficient resource allocation incloud RANs,” 2017 IEEE International Conference on Communications(ICC), 2017, pp. 1-6.

[24] G. Stampa, M. Arias, D. Sanchez-Charles, V. Munts-Mulero, and A. Ca-bellos, “A deep-reinforcement learning approach for software-defined net-working routing optimization,” arXiv preprint arXiv:1709.07080, 2017.

[25] V. Mnih et al., “Human-level control through deep reinforcement learn-ing,” Nature, vol. 518, no. 7540, p. 529, 2015.

[26] R. Yang, H. Zhang, G. Feng, H. Yan, and Z. Wang, “Robust cooperativeoutput regulation of multi-agent systems via adaptive event-triggeredcontrol,” Automatica, vol. 102, pp. 129-136, Apr. 2019.

[27] Y. Zhang, H. Li, J. Sun, and W. He, “Cooperative Adaptive Event-Triggered Control for Multiagent Systems With Actuator Failures,” IEEETransactions on Systems, Man, and Cybernetics: Systems, vol. 49, no. 9,pp. 1759-1768, Sep. 2019.

[28] J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson, “Learning tocommunicate with deep multi-agent reinforcement learning,” Advances inNeural Information Processing Systems, 2016, pp. 2137-2145.

[29] J. Foerster et al., “Stabilising experience replay for deep multi-agentreinforcement learning,” Proceedings of the 34th International Conferenceon Machine Learning-Volume 70, 2017, pp. 1146-1155.

[30] G. E. Monahan, “State of the arta survey of partially observableMarkov decision processes: theory, models, and algorithms,” ManagementScience, vol. 28, no. 1, pp. 1-16, 1982.

[31] Z.-G. Wu, Y. Xu, Y.-J. Pan, P. Shi, and Q. Wang, “Event-TriggeredPinning Control for Consensus of Multiagent Systems With QuantizedInformation,” IEEE Transactions on Systems, Man, and Cybernetics:Systems, vol. 48, no. 11, pp. 1929-1938, Nov. 2018.

[32] R. Yang, H. Zhang, G. Feng, and H. Yan, “Distributed Event-TriggeredAdaptive Control for Cooperative Output Regulation of HeterogeneousMultiagent Systems Under Switching Topology,” IEEE Transactions onNeural Networks and Learning Systems, vol. 29, no. 9, pp. 4347-4358,Sep. 2018.

[33] B. Xia, M. H. Wahab, Y. Yang, Z. Fan, and M. Sooriyabandara, “Rein-forcement learning based spectrum-aware routing in multi-hop cognitiveradio networks,” 2009 4th International Conference on Cognitive RadioOriented Wireless Networks and Communications, 2009, pp. 1-5.

[34] L. Tassiulas and A. Ephremides, “Stability properties of constrainedqueueing systems and scheduling policies for maximum throughputin multihop radio networks,” 29th IEEE Conference on Decision andControl, 1990, pp. 2130-2132.

[35] M. Hausknecht and P. Stone, “Deep recurrent q-learning for partiallyobservable mdps,” 2015 AAAI Fall Symposium Series, 2015.

[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735-1780, 1997.

[37] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrentneural network architectures for large scale acoustic modeling,” Fifteenthannual conference of the international speech communication association,2014.

[38] S. Knight, H. Nguyen, N. Falkner, R. Bowden, and M. Roughan, “Theinternet topology zoo,” IEEE Journal on Selected Areas in Communica-tions, 2011.

[39] R. Bellman, “On a routing problem,” Quarterly of applied mathematics,vol. 16, no. 1, pp. 87-90, 1958.

[40] E. W. Dijkstra, “A note on two problems in connexion with graphs,”Numer. Math., vol. 1, no. 1, pp. 269-271, Dec. 1959.

[41] X. Zhao, H. Yang, and G. Zong, “Adaptive Neural Hierarchical SlidingMode Control of Nonstrict-Feedback Nonlinear Systems and an Appli-cation to Electronic Circuits,” IEEE Transactions on Systems, Man, andCybernetics: Systems, vol. 47, no. 7, pp. 1394-1404, Jul. 2017.

[42] M. Chen, “Constrained Control Allocation for Overactuated AircraftUsing a Neurodynamic Model,” IEEE Transactions on Systems, Man,and Cybernetics: Systems, vol. 46, no. 12, pp. 1630-1641, Dec. 2016.





journal of la toward packet routing with fully-distributed ... · journal of latex class files,...

Documents