master thesis report -...

Master Thesis ReportFederated Averaging Deep Q-Network

A Distributed Deep Reinforcement Learning Algorithm

Sebastian BackstadJune 4, 2018

StudentSpring 2018Master Thesis, 30 ECTSMaster of Science in Computing Science, 300 ECTSSupervised by Mehmood Khan Ph.D

AbstractIn the telecom sector, there is a huge amount of rich data generated every day. This trend

will increase with the launch of 5G networks. Telco companies are interested in analyzingtheir data to shape and improve their core businesses. However, there can be a number oflimiting factors that prevents them from logging data to central data centers for analysis.Some examples include data privacy, data transfer, network latency etc.

In this work, we present a distributed Deep Reinforcement Learning (DRL) method calledFederated Averaging Deep Q-Network (FADQN), that employs a distributed hierarchicalreinforcement learning architecture. It utilizes gradient averaging to decrease communicationcost. Privacy concerns are also satisfied by training the agent locally and only sendingaggregated information to the centralized server. We introduce two versions of FADQN:synchronous and asynchronous.

Results on the cart-pole environment show 80 times reduction in communication withoutany significant loss in performance. Additionally, in case of asynchronous approach, we seea great improvement in convergence.

i

AcknowledgementsI would like to thank Business Area Digital Services at Ericsson AB for proposing the

project and allowing me to conduct this research. My greatest gratitude goes to my excellentsupervisor, Mehmood Khan Ph.D, who has supported me throughout this whole endeavor.Additional thanks are in order for my manager, Tobias Ley, as well as team members Wen-feng Hu, Vincent A. Huang, and many other. Thank you all for many interesting discussionsand helpful seminars.

ii

Contents

1 Introduction 11.1 About the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Problem scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Machine Learning 32.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.1 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3.2 Reward function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3.3 Value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.4 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.2 Deep Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Distributed DRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5.1 Asynchronously Distributed Reinforcement Learning . . . . . . . . . . 82.5.2 Synchronously Distributed Reinforcement Learning . . . . . . . . . . . 8

3 Methodology 103.1 RL Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 Cartpole V0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Reward function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.3 Defining Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.4 Agent traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 FADQN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.3 Asynchronous and Synchronous Federated Averaging DQN . . . . . . 133.2.4 Decaying Epoch Number . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Weighting Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.1 Staleness Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.2 Normalized Score Weighting . . . . . . . . . . . . . . . . . . . . . . . . 163.3.3 Augmented Score Weighting (statistical) . . . . . . . . . . . . . . . . . 16

3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.2 Experience Replay Memory . . . . . . . . . . . . . . . . . . . . . . . . 173.4.3 Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Results 184.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

iii

CONTENTS CONTENTS

5 Discussion and Conclusions 235.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.1 Upper Limits on Number of Workers . . . . . . . . . . . . . . . . . . . 24

6 Related and Future Work 256.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.1.1 Optimization to Fed-Avg . . . . . . . . . . . . . . . . . . . . . . . . . 256.1.2 Distributed Reinforcement Learning Frameworks . . . . . . . . . . . . 256.1.3 OpenAI Request for Research 2 . . . . . . . . . . . . . . . . . . . . . . 25

6.2 Cart Pole as a reference problem . . . . . . . . . . . . . . . . . . . . . . . . . 256.2.1 Improvements to the way we define convergence . . . . . . . . . . . . . 26

6.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.3.1 Decoupled Actors and Learners . . . . . . . . . . . . . . . . . . . . . . 266.3.2 Backup Workers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.3.3 Security and encryption . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Acronyms 31

A Appendix 32A.1 Colored Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

A.1.1 Staleness Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32A.1.2 Normalized Score Weighting . . . . . . . . . . . . . . . . . . . . . . . . 32A.1.3 Augmented Score Weighting (statistical) . . . . . . . . . . . . . . . . . 33

A.2 Default Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iv

List of Figures

2.1 State-Action-Reward-State Cycle . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 The activation of a single neuron . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Cart Pole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 FADQN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Convergence values for various values of target update . . . . . . . . . . . . . 194.2 Communication rounds required for convergence . . . . . . . . . . . . . . . . 204.3 Comparison of median communication rounds in orders of magnitude (log10) 204.4 Convergence for various batch sizes . . . . . . . . . . . . . . . . . . . . . . . . 214.5 Comparison of the sync/async FADQN algorithms convergence . . . . . . . . 22

6.1 Coupled and decoupled actor-learner architectures . . . . . . . . . . . . . . . 27

List of Tables

2.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 File format used to store the result of a single run of the algorithm . . . . . . 123.2 FADQN Parameter description . . . . . . . . . . . . . . . . . . . . . . . . . . 13

A.1 FADQN default hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 34

List of Algorithms

1 Asynchronous Federated Averaging Deep Q-Network . . . . . . . . . . . . . . 142 Synchronous Federated Averaging Deep Q-Network . . . . . . . . . . . . . . . 15

v

Chapter 1

Introduction

1.1 About the problemThe world is quickly getting more and more connected. With the launch of 5G networks,the available communication capacity will increase drastically. This is required to meet ademand that is increasing equally fast. Optimal control of the telecom networks will be acrucial component in meeting this need. Accurate prediction of load and intelligent resourceallocation to handle the load are, and will continue to be very important components fornetwork operations.

Network functions are being virtualized and put into data centers where they can scalefreely through addition or removal of virtual machines. At the same time, network nodesare gaining more and more computing power; paving the way for edge computing and otherinteresting technologies.

Additionally, these networks will generate a lot of data, both in the forms of networkperformance logs and user behaviour data etc. User data is an incredibly valuable resourceand a major success factor for data driven companies like Google. To be able to operatethe networks optimally, this data needs to be processed and analyzed. This kind of dataoften contains sensitive information which makes operators hesitant to share, or even storeoutside of country borders.

Another limiting factor can be the size of the data. In some cases it might not even bepossible to send the huge amounts of data that a telecom network generates to a centralizedserver/data center. This can be because the cost of sending large amounts of data over thenetwork and storing it can be expensive. It can also be because by the time the data isaggregated at the centralized server, it is already obsolete.

Machine learning is an area that is generating a lot of interest. With machine learn-ing models getting increasingly complex and requiring more and more computation; usingdistributed, scalable, and flexible models are the obvious next step. Distributed machinelearning is usually performed in big data centers with near instant communication betweenthe distributed agents. This, however, will not be the case for many future distributedmodels.

As machine learning models get distributed to new platforms such as mobile base sta-tions, smartphones[1] and Internet of Things (IoT) devices; they each bring new resourceconstraints and challenges. Different constraints can require vastly different methods. Ex-amples of this are: the limited processing power of some IoT devices and the high cost ofbandwidth in smartphones. The algorithms described in this thesis tries to solve one ofthese constraints by lowering communication costs at the expense of more local computing.

1.2 HypothesisThe problem we are primarily trying to solve is to automatically optimize the operationof mobile networks using machine learning. Google has proven, through their data center

1

1.3. PROBLEM SCOPE CHAPTER 1. INTRODUCTION

optimization algorithm, that this is possible, and that it can have a huge impact on boththe core business as well as the environment [2].

While Google trained their agent on old performance logs, the problem of autonomousoptimization has many of the characteristics of a Markov decision process. This feature oftenmakes Reinforcement Learning (RL) a valid strategy. Chapter 2 briefly describes some ofthe most popular machine learning methods and the theory behind RL (chapter 2.3). RLhas had some big breakthroughs lately, such as AlphaGo Zero [3]. It is still, despite this therecent successes, a very underrepresented field of machine learning. These are some of thereasons why we decided to explore solutions using RL.

By doing both data processing and training locally, we hope to solve not only the commu-nication cost problem, but also the data size problem. This could be possible if some of thedata generated by the environment only is used for training the machine learning model anddoes not need to be stored for further analysis. Additionally, by using aggregated featuresinstead of direct updates we also satisfy the privacy concerns that arises with sharing cus-tomer data or otherwise sensitive information. It is possible to encrypt the training updatesto make the system even more secure and privacy protecting [4], more on this in chapter 6.

We hope the proposed algorithm will be able to solve the problem using fewer communi-cation rounds than a standard distributed algorithm (not using averaging) while convergingfaster (or at least not slower) than a non distributed version of this algorithm (only oneagent). Another interesting aspect is how well it scales to a high number of agents, wediscuss this briefly in chapter 5

1.3 Problem scopeThe problem is limited to a very simple environment. This decision was taken to make thedevelopment easier and the testing less time consuming. The environment is described indetail in section 3.1. Further analysis on the choice of environment and its drawbacks canbe found in section 6.2.

The original scope only included an asynchronous version of the algorithm. This wasextended to include a synchronous version as well. This allowed us to compare the twoparadigms and give additional insights into how the algorithm performs.

2

Chapter 2

Machine Learning

Machine learning is a field of computer science that gives computer systems theability to ”learn” (i.e. progressively improve performance on a specific task) withdata, without being explicitly programmed.1

In the following sections, we briefly discusses the three main areas of machine learningi.e. supervised learning, unsupervised learning and reinforcement learning. Section 2.1 and2.2 give a brief overview of supervised and unsupervised learning, while section 2.3 providesa more detailed overview of the reinforcement learning which is the main focus of this thesis.

2.1 Supervised learningSupervised learning is the machine learning task of inferring a function from supervisedtraining data. Supervised, in this context, means that each training sample consists of bothan input object and the desired output value. This means that a labeled data set is requiredto train a supervised model. The dependency on labeled data is one of the biggest drawbacksof supervised learning.

We will not go any further into all the different methods and algorithms of supervisedlearning since it is not very relevant in this thesis. The exception is the following distributedsupervised learning algorithm.

FedAvg

Federated Averaging (FedAvg) [1] is a distributed supervised learning algorithm that focuseson reducing the amount of network communication needed to train by doing more compu-tation locally. First the workers train on their local data set multiple times. Before sendingthe results to the parameter server (which holds the global model), each worker compiles itsnew knowledge into an average. This will reduce the amount of communication between theworkers and the parameter server. Drastically reduced communication cost is a desirableproperty in many contexts, including much of what was described in chapter 1. Distributedmodels will be covered more in-depth in section 2.4.2.

2.2 Unsupervised learningUnsupervised learning is focused on having an agent infer a function from unlabeled data.This means that the samples in the data set don’t have a ”correct answer” attached to them.Rather, the algorithm has to use other means to estimate how good a guess actually is.

A recent and very promising example of unsupervised learning is Generative AdversarialNetworks (GAN) [5].

1Wikipedia

3

2.3. REINFORCEMENT LEARNING CHAPTER 2. MACHINE LEARNING

2.3 Reinforcement learningReinforcement Learning (RL) is a machine learning technique that is mainly used to solveproblems that can be described as a Markov decision process. This type of problems areconsidered solved when the agent finds an optimal policy, i.e. it will take the best action inevery possible state. The basic idea behind reinforcement learning is that an agent evaluatesthe current state of the environment, selects the action that will give the highest expectedreward and performs said action in the environment. When acting in the environment, theagent receives a reward and is put in another state, at which point the process starts over.This means that RL agents learn from their own experience through trial and error ratherthan from preexisting data.

s0 s1 s2a0 a1r0 r1

Time

Figure 2.1: State-Action-Reward-State Cycle

In a typical RL problem, the current state of the agent is dependent on what actionswere performed in the previous states. Taking the action which will give ”highest expectedreward” usually refers to the action that puts the agent in the state with the highest potentialto gain rewards in the future. Figure 2.1 shows the agent’s moves between stages usingactions an collecting rewards for every step. This process is often called SARS.

For a more complete overview of different kinds of reinforcement learning and its appli-cations, we refer to Deep Reinforcement Learning: An Overview [6]

2.3.1 PolicyA policy (π) is what the agent uses to choose an action. It is a function that takes thecurrent state of the environment (s) and returns the action that the agent thinks is optimal.It is mapping from every possible state to an action.

π(s) : S⇒ A

2.3.2 Reward functionThe reward function is used to determine the reward for a certain state. The goal of a RLagent is to maximize its total accumulated reward. This is usually achieved by finding andstaying in states which grant a high reward.

R(s) : S⇒ R

Choosing a good reward function is very important for the performance of RL algorithms.It can sometimes be challenging to even come up with a reward function if it isn’t clearlyquantifiable. One such example is the game of go. It is very difficult to look at an unfinishedgame of go and accurately predict how many points each player has.

2.3.3 Value functionA value function is used to determine the value of being in a certain state. Value is alwayscounted in the amount of reward the agent will be able to collect. The immediate rewardavailable from a state is often easy to calculate, but as we progress through the SARS cyclesmore and more uncertainty is added. This is where the discount factor comes into play.For every step k we take into the future we discount the reward we would supposedly gain

4

2.4. DEEP LEARNING CHAPTER 2. MACHINE LEARNING

from that state by a factor γk. A small value of γ means the agent will focus on short termrewards, while a big value means a focus on future potential rewards. A normal value of γis somewhere around 0.99.

Given the policy π, the value function V π(s) calculates the expected accumu-lative discounted future reward for an agent that starts at state s and actsaccording to policy π.

2.4 Deep LearningDeep learning can be just about any other machine learning technique as long as a deepneural network is used as a general function approximator.

2.4.1 Neural NetworkA neural network is a universal nonlinear function approximator. This means that it takesany value from its input space and will always return a value from its output space. Figure2.2 shows a typical deep neural network and some of its main components. In essence theyare made up of: Inputs, weights, biases, hidden/intermediate neurons, and output.

Input is just the entrance point to the function. The only job of an input node is tosend its input value to all the neurons that are connected to it.

Weights are trainable variables that hold a single floating point value. This is the”memory” of a neural network and their only job is to multiply any value that passesthrough them with their own stored value.

Biases are constant values that have their own weights and acts like an input node, theyhelp making the network more flexible by adding a trainable constant to all the variableinputs.

s1

s2

s3

s4

a1

a2

Input Layer Hidden Layer Output Layer

Figure 2.2: Neural Network

Neurons are simple intermediate units that collect all their inputs, apply an activationfunction to them, and sends the result of that activation out to all of its outputs. Figure 2.3shows a neuron with four inputs and four outputs. It will multiply every input with thatinput’s corresponding weight, it will then sum all the inputs to create one single number.That number is sent to the activation function which outputs another (usually more simple)number. The output from the activation function is sent out as the neuron’s output.

Some of the most popular activation functions are graphed in figure 2.4 and their math-ematical functions can be seen in table 2.1.

Output is the final step of the process where the output values are converted to anappropriate format (e.g. probability distribution) and then returned to the user.

5

2.4. DEEP LEARNING CHAPTER 2. MACHINE LEARNING

f

(n∑

i=0

wi ∗ xi

)

x0

x1

x2

xn

w00

w01

w02

w0n

w10

w11

w12

w1n

Figure 2.3: The activation of a single neuron

Sigmoid TanhStep

ReLU LReLU ELU

Figure 2.4: Activation Functions

Binary Step f(x) =

{0 for x < 0

1 for x ≥ 0

Sigmoid f(x) = 11+e−x

TanH f(x) = tanh(x) = (ex−e−x)(ex+e−x)

ReLU f(x) =

{0 for x < 0

x for x ≥ 0

LReLU f(α, x) =

{αx for x < 0

1 for x ≥ 0

ELU f(α, x) =

{α(ex − 1) for x < 0

1 for x ≥ 0

Table 2.1: Activation Functions

6

2.5. DISTRIBUTED DRL CHAPTER 2. MACHINE LEARNING

2.4.2 Deep Q-LearningWhat sets “Reinforcement Learning (RL)” and ”Deep Reinforcement Learning (DRL)” apartis that DRL uses a (deep) neural network as the function approximator for the action-valuefunction (Q-function).

DQN

All information in this section is taken directly from the original Deep Q-Network paper byMnih et al., 2013 [7]. For more in-depth explanations, please refer to that paper.

First we make the assumption that future rewards are discounted by a factor of γ pertimestep. The future discounted reward at time t is then Rt =

∑Tt′=t γ

t′−trt′ , where T isthe timestep at which the game ends.

The optimal action-value function Q∗(s, a) is the maximum expected reward achiev-able by following any strategy starting at state s and taking some action a, Q∗(s, a) =maxπ E[Rt|st = s, at = a, π], where π is a policy mapping states to actions (or distributionsover actions).

The optimal action-value function obeys an important identity known as the Bellmanequation. This is based on the following intuition: if the optimal value Q∗(s′, a′) of the states′ at the next timestep was known for all possible actions a′, then the optimal strategy is toselect the action a′ maximizing the expected value of r + γQ∗(s′, a′),

Q∗(s, a) = Es′∼D

[r + γ max

a′Q∗(s′, a′)

∣∣s, a] (2.1)

We will approximate the action-value function using a neural network function ap-proximator with weights θ as a Q-network. Our new action-value function then becomesQ(s, a; θ) ≈ Q∗(s, a). This Q-network can be trained by minimizing a sequence of lossfunctions Li(θi) that changes each iteration i.

Li(θi) = Es,a∼p(·)

[(yi −Q(s, a; θi))

2]

(2.2)

where yi = Es′∼D[r + γ maxa′ Q(s′, a′; θi−1)|s, a] is the target for iteration i and p(s,a)is a probability distribution over states s and actions a that we refer to as the behaviourdistribution. The parameters from the previous iteration θi−1 are held fixed when optimizingthe loss function Li(θi). Differentiating the loss function with respect to the weights we arriveat the following gradient,

∇θiLi(θi) = Es,a∼p(·);s′∼D

[(r + γ max

a′Q(s′, a′; θi−1)−Q(s, a; θi)

)∇θiQ(s, a; θi)

](2.3)

This gradient will be calculated by optimizing the loss function using Stochastic GradientDescent (SGD).

2.5 Distributed DRLWhen talking about distributed computing in general there are two major paradigms, syn-chronous (sync) and asynchronous (async). These have vastly different properties, strengthsand weaknesses. These properties will be explained in the following sections. We will thendiscuss some of the most successful algorithms from both categories. And see how theycapitalize on their strengths and mitigate their weaknesses.

For a system to be distributed, the agents must have some way of communicating witheach other. This is usually done through sharing of parameters or through actions, in caseof a shared environment.

• Shared parameters, separate environments

– The standard way to do distributed RL

7


– Every agent tries to optimize the global parameters by interacting with separatecopies of the same environment

– Greatly increases both exploration and training speed

• Separate parameters, same environment

– Cooperative or competitive agents that are learning to help or defeat each otherin a shared environment

• Shared parameters, same environment

– Very common in self play (DotA2 / AlphaGo Zero [3]) where an agent playsagainst itself to learn multiple things at once

2.5.1 Asynchronously Distributed Reinforcement LearningMost asynchronously distributed reinforcement learning algorithms are centered around hav-ing multiple actors separately exploring their own environments, while asynchronously up-dating a global set of parameters. This allows them to be easily scalable and and relativelyfast. Asynchronous algorithms often suffer from stale (old) gradients which often makessynchronous algorithms superior except in special cases [8].

A3C

Asynchronous Advantage Actor-Critic (A3C) is an algorithm published by Google DeepMindin 2016 [9]. It is able to learn many of the Atari benchmark games very fast on inexpensiveCPU hardware. This can be achieved because it relies on being highly parallel and using upto 16 (or 32) workers on a single CPU.

Gorila

General Reinforcement Learning Architecture (Gorila) [10] is an asynchronous frameworkfor large scale distributed reinforcement learning. It uses a shared replay memory betweenall workers. It uses separate workers that either only generates experience by acting in theenvironment, or only trains by sampling from the replay memory.

2.5.2 Synchronously Distributed Reinforcement LearningSync-Opt

Synchronous Stochastic Optimization (Sync-Opt) [8] tries to solve the problem with slow,straggling, workers by only waiting for a set number of workers and hence dropping theslowest few workers.

A2C

[Synchronous] Advantage Actor-Critic (A2C) (A.k.a PAAC - Parallel Advantage Actor-Critic) [11] is a modified version of the famous A3C. It works the same way, with theexception that it synchronizes all agents between rounds. OpenAI claims in their blog post2

that synchronous A2C outperforms A3C.

DPPO

Proximal Policy Optimation (PPO) [12] is a policy gradient algorithm that focuses on makingconstrained and very stable changes to its policy. It’s performance is so good that OpenAIhas started using it as their go-to algorithm.

Distributed PPO This is just a distributed version of this popular algorithm.2https://blog.openai.com/baselines-acktr-a2c/ (Oct 2017)

8

https://blog.openai.com/baselines-acktr-a2c/


IMPALA

Importance Weighted Actor-Learner Architecture (IMPALA) [13] is a new and very promis-ing algorithm that combines features of many of the aforementioned algorithms.

It is tested on the new benchmark: DMLab-30, A suit of 30 challenging new environmentsadded to the DeepMind Lab [14]

9

Chapter 3

Methodology

3.1 RL Environment

3.1.1 Cartpole V0Cart Pole [15] (or inverted pendulum) is a classic problem in control theory. This simpleproblem consists of a pole attached to a cart that can move in two dimensions. The cartsits on a flat surface and can move only left or right. The pole is to be balanced on top ofthe cart. Figure 3.1 shows the setup and defines the parameters. The game (episode) endsif one of the following three conditions are met:

1. Failure - The cart moves too far away from the center (|x| > xmax)

2. Failure - The pole falls down (angle from upright position gets too large) (|ρ| > ρmax)

3. Success - A set amount of time (timesteps or frames) passes (t > tmax)

OpenAI has implemented this problem as an environment in their reinforcement learningplayground OpenAI Gym1 with the limits: xmax = 2.4 and ρmax = 15◦ and tmax = 200.

xmax

x

∆ρ

∆x

ρ

ρmax

Figure 3.1: Cart Pole

1https://gym.openai.com/envs/CartPole-v0/ (May 2018)

10

https://gym.openai.com/envs/CartPole-v0/

3.1. RL ENVIRONMENT CHAPTER 3. METHODOLOGY

State Space

The size of this problem’s state space is 4. All these values together make up a certain state.Performing an action will progress the environment to a new state.

x = Horizontal Distance∆x = Horizontal Velocityρ = Pole Angle

∆ρ = Pole Angular Velocity

Action Space

The size of this problem’s action space is 2. The agent will return a softmax probabilitydistribution over the two values. Each number represent the probability that that action isthe ”optimal choice” and they naturally sum to 1 (i.e. 100%).

∆x+ = Positive force (right)∆x− = Negative force (left)

This action space is fairly small compared to many other environments, but it is notfar off from how a real application would look. In the example of autonomous scaling theactions could represent ”scale up” and ”scale down”.

3.1.2 Reward functionThe reward function for Cart Pole is very simple. A reward of 1 is given for every time-stepwhere neither of the two conditions are violated (cart position and pole angle). A reward of0 is given whenever at least one of the conditions are violated. Since the episode ends uponfailure, the reward will always be 1 except for the last time-step in case of failure. A rewardfunction this simple holds almost no information about how good the state actually is, onlyif it’s illegal or not. A hand crafted reward function would probably give a higher rewardfor being close to the center and for having a small angle on the pole. That kind of moreexpressive reward function would likely result in much faster and better learning.

Bias in the reward function

The choice of reward function will always introduce some form of bias into the system. Ifwe look at the example above, the simple binary reward function makes no assumptions onhow the game ”should” be played and lets the agent learn from scratch. Even though thismight seem unbiased it will actually introduce a form of technical bias. This kind of biascan for instance come from how the environment is designed.

A good example of technical bias in this implementation of the Cart Pole environmentcomes from the episode length (200 timesteps). Since the agent only receives a negativereward (punishment) when it violates a constraint (position or pole angle), it has no incentiveto improve above the point of just barely avoiding failure. We can sometimes see that theagent ”stops trying” once it has passed approximately 180 timesteps because the pole doesnot have enough time to fall before the episode ends successfully.

On the other side of the spectrum we have the human bias introduced when makinga hand made reward function. Not only does it require a high level of domain specificknowledge to create; it also makes the assumption that the human designing the functionactually knows how the value of these specific features should be quantified. This kind ofhuman bias will often result in a faster learning curve at the cost of potentially limiting theagent’s ”creativity”. This trade off can certainly be valuable in many circumstances, but oneshould always take into consideration all the different biases we introduce to a system.

11

3.2. ALGORITHMS CHAPTER 3. METHODOLOGY

3.1.3 Defining ConvergenceOpenAI Gym defines convergence of its ”CartPole-v0” environment as the episode wherethe average reward of the last 100 episodes is equal to or greater than 195. This 97.5% limitmight seem strict, but it is actually quite reasonable considering how easy the environmentis.

CartPole-v0 defines ”solving” as getting average reward of 195.0 over 100 con-secutive trials.

The OpenAI Baselines [16] implementation of DQN was used as a template and baselinein our experiments. It assumes an even more strict limit, requiring an average reward of 200per episode (100%)

When developing a distributed version of this implementation we kept the strict 100%requirement but considers the environment solved (converged) as soon as at least one workermeets the criteria.

Section 6.2.1 briefly discusses some ways to improve the way convergence is defined tomake learning the environment more stable and efficient.

3.1.4 Agent tracesAfter every completed episode, the agent logs its results to a csv-file. Once all agentsare done, the files are combined to a single csv-file. Using this file, averages and othermeasurements can be calculated. The format of the file is described in table 3.1.

Episode Rewardw Avg Rewardw Global Tw Com Roundw ... Avg Rewardk rk

∑k−100i=k

ri100 Tk crk ...

∑wi=1

rwi

w

Table 3.1: File format used to store the result of a single run of the algorithm

Data from the workers is stored side by side in w sets of columns. Every row (k)represents an episode. The first average reward is the worker’s rolling average over the last100 episodes (used to determine convergence). The final average reward is the momentaryreward for that episode, averaged over all workers.

3.2 AlgorithmsIn this section we propose a new algorithm called Federated Averaging Deep Q-Network(FADQN). Both asynchronous and synchronous versions of the algorithm were developed.The algorithm takes inspiration mainly from Federated Averaging [1] and A3C [9] but using aDeep Q-Network (DQN) [7] with experience replay as a the base algorithm. The synchronousalgorithm is also able to utilize several different weighting schemes when sampling from theexperience replay memory.

3.2.1 ParametersThe parameters used in the description of the algorithm can be found in table 3.2. Thereare other technical parameters described in section A.1.

Epochs, Episodes, and Communication Rounds

An epoch is one round of training. It results in a single gradient which is applied (added) tothe parameters (weights). FedAvg [1] uses a full pass through its local data set to calculateone gradient, and repeats this epochs number of times to create an averaged gradient. In theFADQN case, one epoch means drawing b samples from the local experience replay memory(D) and using those samples to calculate one gradient.

12


Param DescriptionE Epochs per Communication Roundb Mini-Bach Sizet Local Time-stepT Global Time-stepθ Global Parametersθ′ Local copy of ParametersN Target Network Update Frequencyc Communication RoundsD Local Experience Replay Memory

Table 3.2: FADQN Parameter description

A communication round is the average of at least one epoch. The number of epochmaking up a communication round can be either constant, scaling, or dependent on the sizeof the data set (like FedAvg). In FADQN we use a scaling size for the communication rounds,where it stars out as a high number which slowly decreases as the training progresses.

An episode always runs from start (t0) to finish (one of the end conditions are met). Itis thus completely separated from both epochs and communication rounds.

3.2.2 FADQN ArchitectureThe FADQN architecture is, like most distributed machine learning architectures, comprisedof at least one parameter server and several workers. Figure 3.2 is a visual representation ofthe parts that make up the FADQN architecture and how they communicate.

The parameter server is responsible for keeping the global parameter set and sendinga copy of the parameters to the workers whenever needed. It also collects the averagedgradients from the workers and updates the global parameters using those values.

Most of the inner workings of the workers remain unchanged from the OpenAI Baselinesimplementation. That includes the agent’s interaction with the environment, storage andsampling from the reply memory, calculating loss, computing gradients, etc.

In addition to making the architecture distributed, we have added several features ontop of the workers which includes: averaging of gradients using multiple different meth-ods, all functions required to synchronize the training (in the synchronous case), and amore advanced global time step counter to ensure correct asynchronous execution (in theasynchronous case)

3.2.3 Asynchronous and Synchronous Federated Averaging DQNThe two algorithms are defined in algorithm 1 and 2. The pseudocode should be fairly selfexplanatory to anyone who is familiar with DQN. The averaging performed on the last lineof each algorithm is explained in detail in section 3.3

3.2.4 Decaying Epoch NumberBy taking the average, we lose some precision. This is an intuitive thought that is partiallytrue for the averaging we are doing in the FADQN algorithm. For a single worker, there isno ”precision” lost when averaging gradients, but with multiple workers averaging hundredsof gradients every communication round; some precision is lost. If one o the workers take100 steps in one direction while another takes 100 steps in the opposite direction, they willessentially cancel each other out.

This is fine in the beginning because the global policy is so bad that we just want toexplore as much of the environment as possible, as fast as possible.But when we get towardsthe end of the episode where the global parameters almost represent the optimal policy, wewant to be careful not to make too big changes.

13


Figure 3.2: FADQN Architecture

Algorithm 1 Asynchronous Federated Averaging Deep Q-NetworkRequire: Global shared parameter vector θRequire: Agent-specific parameter vectors θ′ and θ′targetRequire: Global shared counter T = 0 and local counter tRequire: Local experience replay memory D

1: repeat ▷ Start of new communication round2: θ′ ← θ3: Told ← T4: t← 15: for t ≤ E do6: t← t+ 17: Act in environment using θ′

8: Save (st, at, rt, s′t) to D

9: Sample random mini-batch of (sj , aj , rj , sj+1) from D

10: yj ←

{rj if terminalrj + γmaxa′Q(sj+1, a

′; θ′target) otherwise11: Perform gradient descent step on (yj −Q(sj , aj ; θ

′)2) w.r.t. θ′

12: end for13: ttot ← ttot + t14: T ← T + t15: θ ← θ + t

T−Told

∑Ei=1gi(θ

′)16: until T ≥ Tmax

14

3.3. WEIGHTING GRADIENTS CHAPTER 3. METHODOLOGY

Algorithm 2 Synchronous Federated Averaging Deep Q-NetworkRequire: Global shared parameter vector θRequire: Agent-specific parameter vectors θ′ and θ′targetRequire: Local experience replay memory D

1: repeat ▷ Start of new communication round2: θ′ ← θ3: t← 14: for t ≤ E do5: t← t+ 16: Act in environment using θ′

7: Save (st, at, rt, s′t) to D

8: Sample random mini-batch of (sj , aj , rj , sj+1) from D

9: yj ←

{rj if terminalrj + γmaxa′Q(sj+1, a

′; θ′target) otherwise10: Perform gradient descent step on (yj −Q(sj , aj ; θ

′)2) w.r.t. θ′

11: end for12: Wait for other workers13: θ ← θ + 1

N

∑Ei=1gi(θ

′)14: until one of the N workers converge

For this reason we implemented decaying epoch number. The number of epochs to vergeover will be high to begin with, but slowly reduced as the episode progresses. This will helpstabilize the training in the later stages.

3.3 Weighting GradientsWhile gradient averaging is not a new concept, it recently got a lot of attention with therelease of Google’s FedAvg paper [1]. As described in the paper, for every communicationround each worker’s gradients are weighted depending on their individual contribution tothe final result.

The following sections will describe how the gradients are averaged. We will first ex-plain the weighting used in the asynchronous case, followed by the synchronous case. Twoadditional approaches are also explained but were never properly tested and are thus notincluded in the results. For easier understanding, colored versions of these equations anddescriptions are provided in appendix A.1

3.3.1 Staleness WeightingAsynchronous Staleness Weighting

θc+1 ← θc +E

Tnew − Told

E∑i=1

gi(θ′) (3.1)

The global parameters for the next communication round are calculated by first takingthe sum of the gradients from every epoch with respect to the local parameters. The gradientsum is then scaled by a factor representing the staleness of the gradient: number of epochsdivided by the difference in global time steps from when the local computation started andwhen it ended. The averaged gradient is finally added to the global parameters for thecurrent communication round.

15

3.4. IMPLEMENTATION DETAILS CHAPTER 3. METHODOLOGY

Synchronous Staleness Weighting

θc+1 ← θc +

N∑w=1

1

N

E∑i=1

gwi(θ′) (3.2)

This is a special case of the asynchronous version where most parts are identical, withtwo exceptions. The staleness factor has been replaced by a constant value based on thetotal number of agents participating in the training. The second difference is the additionalsum which adds the averaged gradients for all agents.

This sum is only present in the synchronous version, because the parameter server waitsfor all workers to complete before adding the gradient. This implicitly creates the additionalsum. In contrast, the asynchronous workers will just add their own averaged gradient,without any regard for what the other workers are doing.

3.3.2 Normalized Score Weighting

θc+1 ← θc +

E∑i=1

Xi −min

max−mingi(θ

′) (3.3)

The idea behind this method is to scale the size of each worker’s gradient to matchtheir relative reward. To get the normalize reward we take the reward of worker i andsubtract the lowest reward registered out of all workers this communication round we thendivide this with the difference between the smallest n the largest reward registered in thiscommunication round

We tried implementing this, but it does not work with environments that have a maxscore/reward that is commonly reached, like cart pole (max 200). This would have to betweaked to account or the different special cases.

3.3.3 Augmented Score Weighting (statistical)

βi0 =1

n

ProbScorer = βir ∗Xi −minr

maxr −minr

βir+1 ← βir +(ProbScorer − ProbScorer−1

) (3.4)

The weighting factor for the first round starts out as the average. We then create aprobability score for round r by multiplying the weighting factor of worker i with worker i’snormalized reward for round r (as in the normalized score weighting). Finally to get workeri’s weighting factor for the next round we add the difference in probability score from thelast round to the weighting factor of this round.

3.4 Implementation Details

3.4.1 Neural NetworkThis implementation of FADQN [17] utilizes a 2-layered fully connected deep neural networkto perform its function approximation. between the two hidden layers are 64 neurons con-necting the layers. The neurons in the hidden layer uses Tanh as their activation function.Figure 2.2 in chapter 2.4 about deep learning is a fairly accurate but simplified depiction ofthis specific neural network.

16

3.4. IMPLEMENTATION DETAILS CHAPTER 3. METHODOLOGY

Input : 4

Weights1 : 4× 64 = 256

Hidden : 64

Weights2 : 64× 2 = 128

Output : 2

As we can see from the above calculation, a 4-64-2 fully connected neural network willhave a total of 386 trainable weights.

3.4.2 Experience Replay MemoryThe experience replay memory is initialize to a specific size and never changes after thattime. It starts out empty and the agent runs randomly for 1000 timesteps to add some datato the replay memory. Only after these initial samples have been added can the agent startactually training by sampling from the past experience.

3.4.3 OptimizerTo train the neural network we use the ADAM Optimizer [18] with default settings toperform Stochastic Gradient Descent (SGD).

17

Chapter 4

Results

4.1 Experiment SetupThe default parameters used for the experiments are defined in A.2. Any differences andthe varying parameters will be stated explicitly. The environment used for all experimentsis the default OpenAI Gym implementation of Cart Pole [15].

The experiments were primarily focused on the synchronous algorithm, since they havebeen proven to perform better [8]. Some experiments and baselines were performed on theasynchronous to test and possibly confirm this claim.

Most test were conducted by locking all but one or two parameters and then running thealgorithm to convergence ten times for every distinct value of the variable parameters. Wealways show the average or median of the best seven result (depending on the experiment).In some cases outliers like max and min values are also plotted.

The results are compared to an appropriate baseline made up of either a non distributed(out of the box) version of OpenAI’s DQN, or a distributed version of FADQN which does notuse any averaging (just a distributed DQN). The FADQN baseline includes both synchronousand asynchronous versions of the algorithm.

4.2 ResultsIn this section we will discuss the results of the experiments described in section 4.1.

Target Network Update Frequency

The first experiment is aimed at finding the optimal value for the target network updatefrequency. The (non-distributed) baseline has this value set to 1000 which is way too high.In the first experiment we decrement this value by 100 until it is so low that the algorithmdiverges. Figure 4.1 shows the convergence time (in episodes) for the different target updatesettings. This experiment was only conducted on the synchronous version of the algorithm.

The baseline seems to be less affected by the change in target update, which might bethe reason why OenAI choose such a high value for it. Both versions did, however, divergewhen target update was set to 100 or less.

As described in the section about deep q-learning (2.4.2), the loss of the predictionis calculated by taking the mean squared error (MSE) between the q-values of the mainparameters and the target parameters. If the target q-network is updated too frequently,the two sets of parameters (θ and θ−) will tend to the same value. This essentially nullifiesthe training stability gained by using separate parameters for the target and original q-networks.

Our network is sensitive to this high variance and will diverge as seen when ”TargetUpdate” is set to 200 or less, in figure 4.1.

18

4.2. RESULTS CHAPTER 4. RESULTS

Figure 4.1: Convergence values for various values of target update

Communication Rounds

While conducting the above experiment, we also collected data on how many communica-tion rounds every run required to converge (data storage is described in table 3.1). Theseresults are are shown in figure 4.2. While the two graphs have the same label on the y-axis(communication rounds), they have very different values. This becomes apparent when weplot them on a logarithmic scale on the y-axis as in fig 4.3.

Only aggregating one epoch per communication round instead of 100 will consistentlyrequire about two orders of magnitude more communication. This might seem obvioussince we aggregate 100 gradients instead of sending every single one, but it does prove thataveraging this many gradients to reduce communication costs has little to no detrimentaleffect on the performance of the algorithm.

19


Figure 4.2: Communication rounds required for convergence

Figure 4.3: Comparison of median communication rounds in orders of magnitude (log10)

20


Varying Mini-batch Sizes

The following experiment tries using smaller mini-batch sizes. There is a trend towardsusing using bigger mini-batch sizes [19], but this experiments shows that – in this case –smaller mini-batch sizes seem to lead to faster convergence. We are not really sure why thishappens. It could be a case of overfitting since new data is added to the experience replaymemory fairly slowly and we sample over 100 times for every point of experience added. Ifthis is true, it makes sense, but we have no evidence other than speculation pointing to thatfact. Clearly more research is needed!

Figure 4.4: Convergence for various batch sizes

Adding very accurate gradients (many synchronous workers’ averaged gradients) will stillonly give a marginal convergence increase over the single worker case. This could be theresult of the Cart Pole problems simplicity.

21


Reward Per Episode

Figure 4.5 shows how the average single run performance of both versions of the FADQNalgorithm. The hyperparameters were manually chosen to be: Target Update = 300 andMini-batch Size = 32. The graphs were generated by taking the mean of the seven bestruns out of ten, and then smoothed using the average of the last ten episodes for every point.

Figure 4.5: Comparison of the sync/async FADQN algorithms convergence

Both algorithms clearly shows sub-par performance for the single worker run. The asyn-chronous algorithm looks chaotic with no clear performance increase as the number of work-ers grow. The synchronous algorithm on the other hand, shows both a more stable learningcurve as well as performance increases as the number of workers grow.

22

Chapter 5

Discussion and Conclusions

5.1 DiscussionAs we’ve seen from the results in figures 4.2 and 4.3, the number of communication roundsrequired for convergence decreases proportionally to the value of E (number of epochs).This fact makes sense in the case of the current version of the FADQN algorithm, sincethe length of every communication round is fixed. This decision is motivated by a desire tokeep the lower bound of the length of every communication round high enough to cause aconsiderable decrease in communication cost.

We can compare this to how the A3C algorithm operates. A3C will send a gradientupdate every xmax steps OR as soon as the current episode ends. This theoretically putsthe lower bound of a single communication round to 1. Following are some different waysof limiting the communication rounds:

1. Limit only by episode

- Run a full episode before sending gradient update

2. Limit by episode and fixed upper bound

- Send gradient update after xmax time-steps or at the end of the episode, whichevercomes first. (Like A3C)

3. Limit by episode and fixed lower bound

- Send gradient update when episode ends, as long as xmin time-steps has passedsince the last update. If the episode ends too early, start next round and rununtil that episode finishes. (can cause very long rounds)

4. Limit by episode with both lower and upper bound

- Same as 3 but with added protection against very long rounds. If the upperbound (xmax) is reached, the gradient is sent even though the episode is not yetfinished.

5. Fixed communication round length

- Always send gradient after x steps, without any regard to the start or end ofepisodes. (xmin = xmax)

The version of FADQN used in the experiments employ technique 4. Because the DQNalgorithm [7] uses random samples from the experience replay memory to train, there will beno significant difference in performance between the different techniques. The only impactwill be the communication efficiency.

Algorithms that depend on the end of an episode to calculate the correct reward, such asA3C [9], will be more sensitive to how you choose to end a communication round. In thesecases method number 2 or 4 might be the optimal choices.

23

5.1. DISCUSSION CHAPTER 5. DISCUSSION AND CONCLUSIONS

5.1.1 Upper Limits on Number of WorkersAlgorithms that do a lot of aggregation are by nature more vulnerable to reaching theupper limit on number of workers. We don’t have an exact number for this limit, butrunning over 50 workers would probably be detrimental to the performance, especially forthe asynchronous version.

Having just a single neural network that all workers update will not scale to massivesizes (tens of thousands or more). We need other paradigms to be able to operate on themassive scale. Maybe a worker only updates one or a few of the weights in the network?Maybe we should have multiple layers of abstraction where a smaller group can help traina part of a larger system? (This is similar to the UNREAL architecture [20])

24

Chapter 6

Related and Future Work

6.1 Related Work

6.1.1 Optimization to Fed-AvgFederated Averaging [1] has been gaining attention lately. The importance of communicationefficient distributed models is apparent, and there are research efforts to reduce communica-tion costs even more. Compressing the averaged gradients or using sparse gradient updatesseem to be active research topics.

6.1.2 Distributed Reinforcement Learning FrameworksCurrently there exists a large amount of different machine learning frameworks. FADQN iswritten in plain TensorFlow [21] which is comparitively low level. TensorFlow’s distributedfunctions are still very primitive and it took a considerable amount of development time toget everyting to work correctly. If this project was to be continued, it could be a good ideato rewrite it in a more high level framework specifically built to handle distributed machinelearning. Some examples of this is Ray [22] and Chainer [23].

6.1.3 OpenAI Request for Research 2Further research in the field of parameter averaging in distributed machine learning is beingrequested by OpenAI1, proving that the field is quickly gaining attention from the majorplayers in AI research. We might see averaging techniques becoming state of the art fordistributed machine learning.

6.2 Cart Pole as a reference problemWhile cart pole might be a fairly good representation of an auto scaling problem, it’s de-scriptiveness limits its usefulness as a general reference problem. This is because it has verysmall state and action spaces, and very simple logic.

To solve this problem we should in the future use some of the current state of the artbenchmarks for reinforcement learning to rigorously verify the agent’s performance. Someexamples are: Atari [24], MuJoCo [25], or even DeepMind Lab [14].

These more sophisticated environments often require additional work to integrate intothe system, for instance graphics processing. That is why we decided to limit the scope ofthis research to only include cart pole.

1https://blog.openai.com/requests-for-research-2/ (Feb 2018)

25

https://blog.openai.com/requests-for-research-2/

6.3. OPTIMIZATION CHAPTER 6. RELATED AND FUTURE WORK

6.2.1 Improvements to the way we define convergenceIn section 3.1.3 we mentioned how the convergence requirements of the OpenAI Baselinesimplementation [16] at 100% over 100 episodes are a bit too strict. There is one cleardisadvantage of this strict policy: it introduces a lot of randomness to the convergencenumber (episodes require to converge) which is used to determine the performance of theagent.

With the current system an agent can complete 99 perfect (200 reward) episodes onlyto fail the last with only 199 reward. This will force the agent to perform 100 new perfectepisodes before it can ”officially” converge. This can often lead to the same agent some-times converging after 400 episodes and sometimes after 600 episodes, completely based onrandomness.

Just setting the convergence threshold to 199, 198 or 197 can drastically reduce theinfluence of randomness, but it is still possible to get three non-perfect episodes in a row,causing the same problem.

We propose a layered convergence condition that combines a less strict percent basedthreshold while adding a streak requirement to prevent bad algorithms from convergingearly. An example of such condition could be: Perfect episodes ≥ 195 and the last 50episodes have to be perfect. This would allow for some bad randomness to happen, and itwould cause at most a 50 episode delay. These numbers would of course have to be tweaked.

6.3 OptimizationThe current implementation (June 2018) of the FADQN algorithm [17] is not very welloptimized. This section will describe some possible improvements that are not includedbecause of time or other limitations.

6.3.1 Decoupled Actors and LearnersMany recent algorithms using distributed reinforcement learning have been centered aroundthe idea of decoupling the actors from the learners [10, 13]. In these kind of architecturesany learner can learn from the experience (data) generated by any of the actors. This createsthree different categories of architectures that can be described as follows:

1. Decoupled - Uses a centralized replay memory. Learners can access samples generatedby any actor and potentially train on the same sample multiple times.

2. Coupled - Every actor-learner pair has its own replay memory. Learners can train onthe same sample multiple times, but all samples come from the same actor

3. Hard Coupled - Learners observe its actor’s actions and learn from them once. Thesample is then discarded, possibly leading to data inefficiency.

Figure 6.1 gives a visual representation of how the actors and learners of the differentarchitectures relate to each other. FADQN currently uses a coupled architecture and itwould be interesting to see how its performance would change if switched to a decoupledarchitecture.

One thing to note about the FADQN architecture is that the actors and learners arecompletely sequential. This means that the actor first acts once in the environment andstores the result of that action in the replay memory. When that is finished the learnersamples several experience samples from the replay memory to train on, and There is noguarantee that the learner will draw the sample that the actor just put in the replay memory.This could potentially be improved by having the actor running on a separate thread andcontinuously generating data for the learner to use. Doing this would most likely speed updata generation causing the replay memory to contain more fresh samples. Whether thismethod could increase performance would have to be tested.

26


a0 a1 an

m0 m1 mn

l0 l1 ln

θ

a0 a1 an

M

l0 l1 ln

θ

a0 a1 an

l0 l1 ln

θ

Coupled Decoupled Hard Coupled(Data Inefficient)

Figure 6.1: Coupled and decoupled actor-learner architectures

In the FedAvg paper [1] the authors purposefully deals with non-IID (not independentand identically distributed) data to prove that their algorithm works even when the datadoesn’t come from a single neatly organized data set. By using a decoupled architecture, wealso make our data slightly non-IID. This is because every worker’s replay memory is uniqueand can not be accessed by the other workers. That means that even if we were to get anincrease in performance by using a shared replay memory, the current architecture tests thealgorithm’s robustness in an additional way.

6.3.2 Backup WorkersWhen using synchronized distributed machine learning a small fraction of the workers willalways be unusually slow. This causes all workers to wait for the stragglers, wasting a lotof time. Chen et al. [8] proves that the use of a few backup workers can significantly reducethe time needed to train synchronous models. Even as few as 4-6% extra workers can beenough to greatly speed up convergence time.

We were not concerned with the training time when implementing FADQN, only conver-gence speed and communication rounds. Thus, we did not spend any time on implementingthis feature. It should, however, be fairly simple to add backup workers, and the FADQNarchitecture is fully compatible with it.

6.3.3 Security and encryptionWe have talked a bit about privacy and security when handling the data used in trainingof machine learning models. The act of averaging several gradients together automaticallyhelps to disguise the data that was trained on to generate the gradients. This is howevernot 100% safe, because there are ways to reverse the training process and potentially revealsome of the underlying information.

When Google released the Federated Averaging paper [1] they also release some accom-panying papers that, among other things, dealt with this kind of security [4]. The mostsimple measure is to just encrypt the averaged gradient before sending it to the centralserver. That will prevent an outside attacker from seeing your data, but it does not addressthe privacy concern of not having sensitive data stored on whatever server that happens toprocess the gradient. They suggest different methods of ensuring privacy including the useof a digital signature to ensure that the gradient is aggregated with a sufficient number ofother agents’ gradients.

27


Most of these methods could easily be incorporated into FADQN since the architectureis so similar to the FedAvg architecture.

28

Bibliography

[1] H Brendan McMahan et al. “Communication-efficient learning of deep networks fromdecentralized data”. In: arXiv preprint arXiv:1602.05629 (2016).

[2] Jim Gao. Machine Learning Applications for Data Center Optimization. 2014.[3] David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature

550 (Oct. 2017). Article, p. 354. url: http://dx.doi.org/10.1038/nature24270.[4] Keith Bonawitz et al. Practical Secure Aggregation for Privacy Preserving Machine

Learning. Cryptology ePrint Archive, Report 2017/281. https://eprint.iacr.org/2017/281. 2017.

[5] Ian Goodfellow et al. “Generative adversarial nets”. In: Advances in neural informationprocessing systems. 2014, pp. 2672–2680.

[6] Yuxi Li. “Deep Reinforcement Learning: An Overview”. In: CoRR abs/1701.07274(2017). arXiv: 1701.07274. url: http://arxiv.org/abs/1701.07274.

[7] Volodymyr Mnih et al. “Playing atari with deep reinforcement learning”. In: arXivpreprint arXiv:1312.5602 (2013).

[8] Jianmin Chen et al. “Revisiting distributed synchronous SGD”. In: arXiv preprintarXiv:1604.00981 (2016).

[9] Volodymyr Mnih et al. “Asynchronous methods for deep reinforcement learning”. In:International Conference on Machine Learning. 2016, pp. 1928–1937.

[10] Arun Nair et al. “Massively parallel methods for deep reinforcement learning”. In:arXiv preprint arXiv:1507.04296 (2015).

[11] Alfredo V Clemente, Humberto N Castejón, and Arjun Chandra. “Efficient Paral-lel Methods for Deep Reinforcement Learning”. In: arXiv preprint arXiv:1705.04862(2017).

[12] John Schulman et al. “Proximal policy optimization algorithms”. In: arXiv preprintarXiv:1707.06347 (2017).

[13] Lasse Espeholt et al. “IMPALA: Scalable Distributed Deep-RL with Importance WeightedActor-Learner Architectures”. In: arXiv preprint arXiv:1802.01561 (2018).

[14] Charles Beattie et al. “DeepMind Lab”. In: CoRR abs/1612.03801 (2016). arXiv: 1612.03801. url: http://arxiv.org/abs/1612.03801.

[15] Andrew G Barto, Richard S Sutton, and Charles W Anderson. “Neuronlike adaptiveelements that can solve difficult learning control problems”. In: IEEE transactions onsystems, man, and cybernetics 5 (1983), pp. 834–846.

[16] Prafulla Dhariwal et al. OpenAI Baselines. https://github.com/openai/baselines.2017.

[17] Sebastian Backstad and Mehmood Khan. FADQN. https://github.com/aztah/distrl. 2018.

[18] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”.In: CoRR abs/1412.6980 (2014). arXiv: 1412.6980. url: http://arxiv.org/abs/1412.6980.

29

http://dx.doi.org/10.1038/nature24270

https://eprint.iacr.org/2017/281

https://eprint.iacr.org/2017/281

http://arxiv.org/abs/1701.07274





https://github.com/openai/baselines

https://github.com/aztah/distrl

https://github.com/aztah/distrl




BIBLIOGRAPHY BIBLIOGRAPHY

[19] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. “Extremely Large Minibatch SGD:Training ResNet-50 on ImageNet in 15 Minutes”. In: CoRR abs/1711.04325 (2017).arXiv: 1711.04325. url: http://arxiv.org/abs/1711.04325.

[20] Max Jaderberg et al. “Reinforcement Learning with Unsupervised Auxiliary Tasks”.In: CoRR abs/1611.05397 (2016). arXiv: 1611.05397. url: http://arxiv.org/abs/1611.05397.

[21] Martı́n Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Sys-tems. Software available from tensorflow.org. 2015. url: https://www.tensorflow.org/.

[22] Philipp Moritz et al. “Ray: A Distributed Framework for Emerging AI Applications”.In: arXiv preprint arXiv:1712.05889 (2017).

[23] Seiya Tokui et al. “Chainer: a Next-Generation Open Source Framework for DeepLearning”. In: Proceedings of Workshop on Machine Learning Systems (LearningSys)in The Twenty-ninth Annual Conference on Neural Information Processing Systems(NIPS). 2015. url: http://learningsys.org/papers/LearningSys_2015_paper_33.pdf.

[24] Marc G Bellemare et al. “The Arcade Learning Environment: An evaluation platformfor general agents.” In: J. Artif. Intell. Res.(JAIR) 47 (2013), pp. 253–279.

[25] Emanuel Todorov, Tom Erez, and Yuval Tassa. “MuJoCo: A physics engine for model-based control”. In: Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ Interna-tional Conference on. IEEE. 2012, pp. 5026–5033.

30






https://www.tensorflow.org/

https://www.tensorflow.org/

http://learningsys.org/papers/LearningSys_2015_paper_33.pdf

http://learningsys.org/papers/LearningSys_2015_paper_33.pdf

Acronyms

A2C Advantage Actor-Critic. 8

A3C Asynchronous Advantage Actor-Critic. 7

DQN Deep Q-Network. ii, 7

DRL Deep Reinforcement Learning. i, 5

FADQN Federated Averaging Deep Q-Network. i, 11

FedAvg Federated Averaging. 3, 11, 12

Gorila General Reinforcement Learning Architecture. 7

IMPALA Importance Weighted Actor-Learner Architecture. 7

PPO Proximal Policy Optimation. 5, 8

RL Reinforcement Learning. 4, 5, 9

Sync-Opt Synchronous Stochastic Optimization. 8

31

Appendix A

Appendix

A.1 Colored EquationsUsing colors in equations is an attempt to make the equations more understandable. Thecolor of the symbols in the equation matches the color of the text describing said symbol.The following equations are from chapter 3.3

A.1.1 Staleness WeightingAsynchronous Staleness Weighting

θc+1 ← θc +E

Tnew − Told

E∑i=1

gi(θ′) (A.1)

The global parameters for the next communication round are calculated by first takingthe sum of the gradients from every epoch with respect to the local parameters. The gradientsum is then scaled by a factor representing the staleness of the gradient: number of epochsdivided by the difference in global time steps from when the local computation started andwhen it ended. The averaged gradient is finally added to the global parameters for thecurrent communication round.

Synchronous Staleness Weighting

θc+1 ← θc +

N∑w=1

1

N

E∑i=1

gwi(θ′) (A.2)

This is a special case of the asynchronous version where most parts are identical, withtwo exceptions. The staleness factor has been replaced by a constant value based on thetotal number of agents participating in the training. The second difference is the additionalsum which adds the averaged gradients for all agents.

This sum is only present in the synchronous version, because the parameter server waitsfor all workers to complete before adding the gradient. This implicitly creates the additionalsum. In contrast, the asynchronous workers will just add their own averaged gradient,without any regard for what the other workers are doing.

A.1.2 Normalized Score Weighting

θc+1 ← θc +

E∑i=1

Xi −min

max−mingi(θ

′) (A.3)

The idea behind this method is to scale the size of each worker’s gradient to matchtheir relative reward. To get the normalize reward we take the reward of worker i and

32

A.1. COLORED EQUATIONS APPENDIX A. APPENDIX

subtract the lowest reward registered out of all workers this communication round we thendivide this with the difference between the smallest n the largest reward registered in thiscommunication round

We tried implementing this, but it does not work with environments that have a maxscore/reward that is commonly reached, like cart pole (max 200). This would have to betweaked to account or the different special cases.

A.1.3 Augmented Score Weighting (statistical)

βi0 =1

n

ProbScorer = βir ∗Xi −minr

maxr −minr

βir+1 ← βir +(ProbScorer − ProbScorer−1

) (A.4)

The weighting factor for the first round starts out as the average. We then create aprobability score for round r by multiplying the weighting factor of worker i with worker i’snormalized reward for round r (as in the normalized score weighting). Finally to get workeri’s weighting factor for the next round we add the difference in probability score from thelast round to the weighting factor of this round.

33

A.2. DEFAULT PARAMETERS APPENDIX A. APPENDIX

A.2 Default ParametersThis appendix lists all the default parameters used for the algorithms. These parameterswere either manually tuned or just uses the default parameters from the OpenAI Base-lines [16] implementation

Param Value Name DescriptionE 100 Epochs Decays linearly to 80 over 500 epN 1000 Target Network Update 600 in most experimentsbs 32 Mini-batch Size —

mem 50000 Experience Replay Memory —lr 5e-4 Learning Rate —

Table A.1: FADQN default hyperparameters

34

master thesis report -...

Documents