hierarchical reinforcement learning via dynamic subspace...

19
Noname manuscript No. (will be inserted by the editor) Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning Aaron Ma Michael Ouimet Jorge Cort´ es the date of receipt and acceptance should be inserted later Abstract We consider scenarios where a swarm of un- manned vehicles (UxVs) seek to satisfy a number of di- verse, spatially distributed objectives. The UxVs strive to determine an efficient plan to service the objectives while operating in a coordinated fashion. We focus on developing autonomous high-level planning, where low-level controls are leveraged from previous work in distributed motion, target tracking, localization, and communication. We rely on the use of state and ac- tion abstractions in a Markov decision processes frame- work to introduce a hierarchical algorithm, Dynamic Domain Reduction for Multi-Agent Planning, that en- ables multi-agent planning for large multi-objective en- vironments. Our analysis establishes the correctness of our search procedure within specific subsets of the envi- ronments, termed ‘sub-environment’ and characterizes the algorithm performance with respect to the optimal trajectories in single-agent and sequential multi-agent deployment scenarios using tools from submodularity. Simulated results show significant improvement over us- ing a standard Monte Carlo tree search in an environ- ment with large state and action spaces. A preliminary version of this work appeared as [1] at the International Symposium on Multi-Robot and Multi-Agent Systems. A. Ma · J. Cort´ es Department of Mechanical and Aerospace Engineering, Uni- versity of California, San Diego, E-mail: [email protected], [email protected] M. Ouimet SPAWAR Systems Center Pacific, San Diego, E-mail: [email protected] 1 Introduction Recent technology has enabled the deployment of UxVs in a wide range of applications involving intelligence gathering, surveillance and reconnaissance, disaster re- sponse, exploration, and surveying for agriculture. In many scenarios, these unmanned vehicles are controlled by one or, more often than not, multiple human oper- ators. Reducing UxV dependence on human effort en- hances their capability in scenarios where communica- tion is expensive, low bandwidth, delayed, or contested, as agents can make smart and safe choices on their own. In this paper we design a framework for enabling multi-agent autonomy within a swarm in order to sat- isfy arbitrary spatially distributed objectives. Planning presents a challenge because the computational com- plexity of determining optimal sequences of actions be- comes expensive as the size of the swarm, environment, and objectives increase. Literature review Recent algorithms for decentralized methods of multi- agent deployment and path planning enable agents to use local information to satisfy some global objective. A variety of decentralized methods can be used for de- ployment of robotic swarms with the ability to mon- itor spaces, see e.g., [2–7] and references therein. We gather motivation from these decentralized methods because they enable centralized goals to be realized with decentralized computation and autonomy. In gen- eral, these decentralized methods provide approaches for low-level autonomy in multi-agent systems, so we look to common approaches used for high-level plan- ning and scheduling algorithms. Reinforcement learning is relevant to this goal be- cause it enables generalized planning. Reinforcement learning algorithms commonly use Markov decision pro- cesses (MDP) as the standard framework for tem- poral planning. Variations of MDPs exist, such as semi-Markov decision processes (SMDP) and partially- observable MDPs (POMDP). These frameworks are in- valuable for planning under uncertainty, see e.g. [8–11]. Given a (finite or infinite) time horizon, the MDP framework is conducive to constructing a policy for op- timally executing actions in an environment [12–14]. Reinforcement learning contains a large ecosystem of approaches. We separate them into three classes with respect to their flexibility of problem application and their ability to plan online vs the need to be trained offline prior to use. The first class of approaches are capable of run- ning online, but are tailored to solve specific domain

Upload: others

Post on 04-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Noname manuscript No.(will be inserted by the editor)

    Hierarchical reinforcementlearning via dynamicsubspace search formulti-agent planning

    Aaron MaMichael OuimetJorge Cortés

    the date of receipt and acceptance should be inserted later

    Abstract We consider scenarios where a swarm of un-manned vehicles (UxVs) seek to satisfy a number of di-verse, spatially distributed objectives. The UxVs striveto determine an efficient plan to service the objectiveswhile operating in a coordinated fashion. We focuson developing autonomous high-level planning, wherelow-level controls are leveraged from previous work indistributed motion, target tracking, localization, andcommunication. We rely on the use of state and ac-tion abstractions in a Markov decision processes frame-work to introduce a hierarchical algorithm, DynamicDomain Reduction for Multi-Agent Planning, that en-ables multi-agent planning for large multi-objective en-vironments. Our analysis establishes the correctness ofour search procedure within specific subsets of the envi-ronments, termed ‘sub-environment’ and characterizesthe algorithm performance with respect to the optimaltrajectories in single-agent and sequential multi-agentdeployment scenarios using tools from submodularity.Simulated results show significant improvement over us-ing a standard Monte Carlo tree search in an environ-ment with large state and action spaces.

    A preliminary version of this work appeared as [1] at theInternational Symposium on Multi-Robot and Multi-AgentSystems.

    A. Ma · J. CortésDepartment of Mechanical and Aerospace Engineering, Uni-versity of California, San Diego, E-mail: [email protected],[email protected]

    M. OuimetSPAWAR Systems Center Pacific, San Diego, E-mail:[email protected]

    1 Introduction

    Recent technology has enabled the deployment of UxVsin a wide range of applications involving intelligencegathering, surveillance and reconnaissance, disaster re-sponse, exploration, and surveying for agriculture. Inmany scenarios, these unmanned vehicles are controlledby one or, more often than not, multiple human oper-ators. Reducing UxV dependence on human effort en-hances their capability in scenarios where communica-tion is expensive, low bandwidth, delayed, or contested,as agents can make smart and safe choices on theirown. In this paper we design a framework for enablingmulti-agent autonomy within a swarm in order to sat-isfy arbitrary spatially distributed objectives. Planningpresents a challenge because the computational com-plexity of determining optimal sequences of actions be-comes expensive as the size of the swarm, environment,and objectives increase.

    Literature review

    Recent algorithms for decentralized methods of multi-agent deployment and path planning enable agents touse local information to satisfy some global objective.A variety of decentralized methods can be used for de-ployment of robotic swarms with the ability to mon-itor spaces, see e.g., [2–7] and references therein. Wegather motivation from these decentralized methodsbecause they enable centralized goals to be realizedwith decentralized computation and autonomy. In gen-eral, these decentralized methods provide approachesfor low-level autonomy in multi-agent systems, so welook to common approaches used for high-level plan-ning and scheduling algorithms.

    Reinforcement learning is relevant to this goal be-cause it enables generalized planning. Reinforcementlearning algorithms commonly use Markov decision pro-cesses (MDP) as the standard framework for tem-poral planning. Variations of MDPs exist, such assemi-Markov decision processes (SMDP) and partially-observable MDPs (POMDP). These frameworks are in-valuable for planning under uncertainty, see e.g. [8–11].Given a (finite or infinite) time horizon, the MDPframework is conducive to constructing a policy for op-timally executing actions in an environment [12–14].Reinforcement learning contains a large ecosystem ofapproaches. We separate them into three classes withrespect to their flexibility of problem application andtheir ability to plan online vs the need to be trainedoffline prior to use.

    The first class of approaches are capable of run-ning online, but are tailored to solve specific domain

  • 2 Aaron Ma Michael Ouimet Jorge Cortés

    of objectives, such as navigation. The work [15] intro-duces an algorithm that allows an agent to simultane-ously optimize hierarchical levels by learning policiesfrom primitive actions to solve an abstract state spacewith a chosen abstraction function. Although this al-gorithm is implemented for navigational purposes, itcan be tailored for other objective domains. In con-trast, our framework reasons using higher levels of ab-straction over different types of multiple objectives. Inour formulation, we assume agents have the ability touse preexisting algorithms, such as [15–18], as actionsthat an agent utilizes in a massive environment. Thedec-POMDP framework [19] incorporates joint deci-sion making and collaboration of multiple agents underuncertain and high-dimensional environments. MaskedMonte Carlo Search is a dec-POMDP algorithm [20]that determines joint abstracted actions in a centralizedway for multiple agents that plan their trajectories ina decentralized POMDP. Belief states are used to con-tract the expanding history and curse of dimensionalityfound in POMDPs. Inspired by Rapidly-Exploring Ran-domized Trees [16], the Belief Roadmap [17] allows anagent to find minimum cost paths efficiently by findinga trajectory through belief spaces. Similarly, the algo-rithm in [18] creates Gaussian belief states and exploitsfeedback controllers to reduce POMDPs to MDPs fortractability in order to find a trajectory. Most of thealgorithms in this class are not necessarily comparableto each other due to the specific context of their prob-lem statements and type of objective. For that reason,we are motivated to find an online method that is stillflexible and can be utilized for a large class of objec-tives.

    The second class of approaches have flexible usecases and are most often computed offline. These for-mulations include reinforcement learning algorithms forvalue or policy iteration. In general, these algorithmsrely on MDPs to examine convergence, although themodel is considered hidden or unknown in the al-gorithm. An example of a state-of-the-art reinforce-ment model-free learner is Deep Q-network (DQN),which uses deep neural networks and reinforcementlearning to approximate the value function of a high-dimensional state space to indirectly determine a policyafterwards [21]. Policy optimization reinforcement algo-rithms focus on directly optimizing a policy of an agentin an environment. Trust Region Policy Optimization(TRPO) [22] enforces constraints on the KL-divergencebetween the new and old policy after each updateto produce more incremental, stable policy improve-ment. Actor-Critic using Kronecker-factored Trust Re-gion (ACKTR) [23] is a hybrid of policy optimizationand Q-learning which alternates between policy im-

    provement and policy evaluation to better guide thepolicy optimization. These techniques were successfullyapplied to a range of Atari 2600 games, with resultssimilar to advanced human players. Offline, model-freereinforcement algorithms are attractive because theycan reason over abstract objectives and problem state-ments, however, they do not take advantage of inherentproblem model structure. Because of this, model-freelearning algorithms usually produce good policies moreslowly than model-based algorithms, and often requireoffline computation.

    The third class of algorithms are flexible in applica-tion and can be used online. Many of these algorithmsrequire a model in the form of a MDP, or other vari-ations. Standard algorithms include Monte Carlo treesearches (MCTS) [24] and modifications such as theupper confidence bound tree search [25]. Many worksunder this category attempt to address the curse of di-mensionality by lowering the state space through ei-ther abstracting the state space [26], the history inPOMDPs [27], or the action space [28]. These algo-rithms most closely fit our problem statement, becausewe are interested in online techniques for optimizingacross a large domain of objectives.

    In the analysis of our algorithm, we rely on the no-tion of submodular set functions and the character-ization of the performance of greedy algorithms, seee.g., [29, 30]. Even though some processes of our al-gorithm are not completely submodular, we are ableto invoke these results by resorting to the concept ofsubmodularity ratio [31], that quantifies how far a setfunction is from being submodular, using tools fromscenario optimization [32].

    Statement of contributions

    Our approach seeks to bridge the gap between theMDP-based approaches described above. We provide aframework that remains general enough to reason overmultiple objective domains, while taking advantage ofthe inherent spatial structure and known vehicle modelof most robotic applications to efficiently plan. Our goalis to synthesize a multi-agent algorithm that enablesagents to abstract and plan over large, complex en-vironments taking advantage of the benefits resultingfrom coordinating their actions. We determine mean-ingful ways to represent the environment and developan algorithm that reduces the computational burdenon an agent to determine a plan. We introduce meth-ods of generalizing positions of agents, and objectiveswith respect to proximity. We rely on the concept of‘sub-environment’, which is a subset of the environ-ment with respect to proximity-based generalizations,

  • Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning 3

    Fig. 1.1 Workflow of the proposed hierarchical algorithm. A sub-environment is dynamically constructed as series of spatial-based state abstractions in the environment in a process called SubEnvSearch. Given this sub-environment, an agent performstasks, which are abstracted actions constrained to properties of the sub-environment, in order to satisfy objectives. Once asub-environment is created, the process TaskSearch uses a semi-Markov decision process to model the sub-environment anddetermine an optimal ‘task’ to perform. Past experiences are recycled for similar looking sub-environments allowing the agentto quickly converge to an optimal policy. The agent dedicates time to both finding the best sub-environment and evaluatingthat sub-environment by cycling through SubEnvSearch and TaskSearch.

    and use high-level actions, with the help of low-levelcontrollers, designed to reduce the action space andplan in the sub-environments. The main contributionof the paper is an algorithm for splitting the work ofan agent between dynamically constructing and evalu-ating sub-environments and learning how to best act inthat sub-environment, cf. Figure 1.1. We also introducemodifications that enable multi-agent deployment byallowing agents to interact with the plans of other teammembers. We provide convergence guarantees on keycomponents of our algorithm design and identify met-rics to evaluate the performance of the sub-environmentselection and sequential multi-agent deployment. Usingtools from submodularity and scenario optimization, weestablish formal guarantees on the suboptimality gap ofthese procedures. We illustrate the effectiveness of dy-namically constructing sub-environments for planningin environments with large state spaces through sim-ulation and compare our proposed algorithm againstMonte Carlo tree search techniques.

    Organization

    The paper is organized as follows. Section 2 presentspreliminaries on Markov decision processes. Section 3introduces the problem of interest. Section 4 describesour approach to abstract states and actions with re-spect to spatial proximity, and Section 5 builds on thesemodels to design our hierarchical planning algorithm.Section 6 presents analysis on algorithm convergenceand performance, and Section 7 shows our simulationresults. We gather our conclusions and ideas for futurework in Section 8.

    Notation

    We use Z, Z≥1, R, and R>0 to denote integer, positiveinteger, real, and positive real numbers, respectively.We let |Y | denote the cardinality of an arbitrary set Y .We employ object-oriented notation throughout the pa-per; b.c, means that c belongs to b, for arbitrary objectsb and c. For reference, Appendix C presents a list ofcommonly used symbols.

    2 Preliminaries on Markov decision processes

    We follow the exposition from [15] to introduce basicnotions on Markov decision processes (MDPs). A MDPis a tuple 〈S,A,Prs, R〉, where S and A are the stateand action spaces, respectively; Prs(s′|a, s) is a tran-sition function that returns the probability that states ∈ S becomes state s′ after taking action a ∈ A; andR(s, a, s′) is the reward that an agent gets after tak-ing action a from state s to reach state s′. A policy isa feedback control that maps each state to an action,π : s → a, for each s. The value of a state under agiven policy is

    V π(s) = R(s, π(s)) + γ∑s′∈S

    Prs(s′|π(s), s)V π(s′),

    where γ ∈ (0, 1) is the discount factor. The value func-tion takes finite values. The solution to the MDP isan optimal policy that maximizes the value function,π∗(s) = argmaxπ V

    π(s) for all s. The value of takingan action at a given state under a given policy is

    Q(s, a) = R(s, a) + γ∑s′∈S

    Prs(s′|a, s)V π(s′).

  • 4 Aaron Ma Michael Ouimet Jorge Cortés

    Usual methods for obtaining π∗ require a tree search ofthe possible states that can be reached by taking a seriesof actions. The rate at which the tree of states growsis called the branching factor. This search is a chal-lenge for solving MDPs with large state spaces and ac-tions with low-likelihood probabilistic state transitions.A technique often used to decrease the size of the statespace is state abstractions, where a collection of statesare clustered into one in some meaningful way. This canbe formalized with a state abstraction function of theform φs : s → sφ. Similarly, actions can be abstractedwith an action abstraction function φa : a → aφ. Ab-stracting actions is used to decrease the action space,which can make π∗ easier to calculate. In MDPs, actionstake one time step per action. However, abstracted ac-tions may take a probabilistic amount of time to com-plete, Prt(t|aφ, s). When considering the problem usingabstracted actions aφ ∈ Aφ in 〈S,Aφ,Prs, R〉, the pro-cess becomes a semi-Markov Decision Process (SMDP),which allows for probabilistic time per abstracted ac-tion. The loss of precision in the abstracted actionsmeans that an optimal policy for an SMDP with ab-stracted modifications may not be optimal with respectto the original MDP.

    Determining π∗ often involves constructing a treeof reachable MDP/SMDP states determined throughsimulating actions from an initial state. Dynamic pro-gramming is commonly used for approximating π∗ byusing a Monte Carlo tree search to explore the MDPfor the initial state. The action with the maximum up-per confidence bound (UCB) [25] of the approximatedexpected value for taking the action at a given state,

    argmaxa∈A

    {Q̂(s, a) + C

    √ln(Ns)

    Ns,a

    },

    is often used to efficiently explore the MDP. Here, Ns isthe number of times a state has been visited, Ns,a is thenumber of times that action a has been taken at state sand C is a constant. Taking the action that maximizesthis quantity balances between exploiting actions thatpreviously had high reward and exploring actions withuncertain but potentially higher reward.

    3 Problem statement

    Consider a set of agents A indexed by α ∈ A. We as-sume agents are able to communicate with each otherand have access to each other’s locations. An agent oc-cupies a point in Zd, and has computational, commu-nication, and mobile capabilities. A waypoint o ∈ Zdis a point that an agent must visit in order to servean objective. Every waypoint then belongs to a class of

    objective of the form Ob = {o1, . . . , o|Ob|}. Agents areable to satisfy objectives when waypoints are visitedsuch that o ∈ Ob is removed from Ob. When Ob = ∅ theobjective is considered ‘complete’. An agent receives areward rb ∈ R for visiting a waypoint o ∈ Ob. Define theset of objectives to be O = {O1, . . . ,O|O|}, and assumeagents are only able to service one Ob ∈ O at a time.We consider the environment to be E = O × A, whichcontains information about all objectives and agents.The state space of E increases exponentially with thenumber of objectives |O|, the cardinality of each objec-tive |Ob|, for each b, and the number of agents |A|.

    We strive to design a decentralized algorithm thatallows the agents in A to individually approximate thepolicy π∗ that optimally services objectives in O in sce-narios where E is very large. To tackle this problem, werely on abstractions that reduce the stochastic branch-ing factor to find a good policy in the environment. Webegin our strategy by spatially abstracting objectives inthe environment into convex sets termed ‘regions’. Wedynamically create and search subsets of the environ-ment to reduce dimensions of the state that individualagents reason over. Then we structure a plan of high-level actions with respect to the subset of the environ-ment. Finally, we specify a limited, tunable number ofinteractions that must be considered in the multi-agentjoint planning problem, leading up to the Dynamic Do-main Reduction for Multi-Agent Planning algorithm toapproximate the optimal policy.

    4 Abstractions

    In order to leverage dynamic programming solutions toapproximate a good policy, we reduce the state spaceand action space. We begin by introducing methods ofabstraction with respect to spatial proximity for a singleagent, then move on to the multi-agent case.

    4.1 Single-agent abstractions

    To tackle the fact that the number of states in theenvironment grows exponentially with respect to thenumber of agents, the number of objectives, and theircardinality, we cluster waypoints and agent locationsinto convex sets in space, a process we term region ab-straction. Then, we construct abstracted actions thatagents are allowed to execute with respect to the re-gion abstractions.

  • Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning 5

    4.1.1 Region abstraction

    We define a region to be a convex set x ⊆ Rd. Let Ωxbe a set of disjoint regions where the union of all re-gions in Ωx is the entire space that agents reason overin the environment. We consider regions to be equalin size, shape, and orientation, so that the set of re-gions creates a tessellation of the environment. Thismakes the presentation simpler, albeit our frameworkcan be extended to handle regions that are non-regularby including region definitions for each unique regionin consideration.

    Furthermore, let Obi be the set of waypoints of ob-jective Ob in region xi, i.e., such that Obi ⊆ xi. We usean abstraction function, φ : Obi → sbi , to get the ab-stracted objective state, sbi , which enables us to general-ize the states of an objective in a region. In general, theabstraction function is designed by a human user thatdistinguishes importance of an objective in a region.

    We define a regional state, si = (s1i , . . . , s

    |O|i ), to be the

    Cartesian product of sbi for all objectives, Ob ∈ O, withrespect to xi.

    4.1.2 Action abstraction

    We assume that low-level feedback controllers allow-ing for servicing of waypoints are available. We nextdescribe how we use low-level controllers as compo-nents of a ‘task’ to reason over. We define a task to beτ = 〈sbi , s′bi , xi, xj , b〉, where sbi and s′bi are abstractedobjective states associated to xi, xj is a target region,and b is the index of a target objective. Assume thatlow-level controllers satisfy the following requirements:

    – Objective transition: low-level controller executingτ drives the state transition, τ.sbi → τ.s′bi .

    – Regional transition: low-level controller executing τdrives the agent’s location to xj after the objectivetransition is complete.

    Candidates for low-level controllers include policiesdetermined using the approaches in [15,25] after settingup the region as a MDP, modified traveling salesper-son [33], or path planning-based algorithms interfacedwith the dynamics of the agent. Agents that start atask are required to continue working on the task untilrequirements are met. Because the tasks are dependenton abstracted objectives states, the agent completes atask in a probabilistic time, given by Prt(t|τ), that isdetermined heuristically. The set of all possible tasks isgiven by Γ . If an agent begins a task such that the fol-lowing properties are not satisfied, then Prt(∞|τ) = 1and the agent never completes it.

    4.1.3 Sub-environment

    In order to further alleviate the curse of dimensionality,we introduce sub-environments, a subset of the environ-ment, in an effort to only utilize relevant regions. A sub-environment is composed of a sequence of regions anda state that encodes proximity and regional states ofthose regions. Formally, we let the sub-environment re-gion sequence, −→x , be a sequence of regions of length upto N� ∈ Z≥1. The kth region in −→x is denoted with −→x k.The regional state of −→x k is given by s−→x k . For example,−→x = [x2, x1, x3] is a valid sub-environment region se-quence with N� ≥ 3, the first region in −→x is −→x 1 = x2,and the regional state of −→x 1 is s−→x 1 = sx2 . In orderto simulate the sub-environment, the agent must knowif there are repeated regions in −→x . Let ξ(k,−→x ), returnthe first index h of −→x such that −→x h = −→x k. Define therepeated region list to be ξ−→x = [ξ(1,

    −→x ), . . . , ξ(N�,−→x )].Let φt(xi, xj) : xi, xj → Z be an abstracted amountof time it takes for an agent to move from xi to xj ,or ∞ if no path exists. Let s = [s−→x 1 , . . . , s−→x N� ] ×ξ−→x × [φt(−→x 1,−→x 2), . . . , φt(−→x N�−1,,−→x N�)] be the sub-environment state for a given −→x , and let S� be theset of all possible sub-environment states. We define asub-environment to be � = 〈−→x , s〉.

    In general, we allow a sub-environment to con-tain any region that is reachable in finite time. How-ever, in practice, we only allow agents to choose sub-environments that they can traverse within some rea-sonable time in order to reduce the number of pos-sible sub-environments and save onboard memory. Inwhat follows, we use the notation �.s to denote the sub-environment state of sub-environment �.

    4.1.4 Task trajectory

    We also define an ordered list of tasks that the agentsexecute with respect to a sub-environment �. Let −→τ =[−→τ 1, . . . ,−→τ N�−1] be an ordered list of feasible taskssuch that −→τ k.xi = �.−→x k, and −→τ k.xj = �.−→x k+1 forall k ∈ {1, . . . , N� − 1}, where xi and xj are the re-gions in the definition of task −→τ k. Agents generate thisordered list of tasks assuming that they will executeeach of them in order. The probability distribution onthe time of completing the kth task in −→τ (after com-pleting all previous tasks in −→τ ) is given by

    −→Prtk. We

    define−→Prt = [

    −→Prt1, . . . ,

    −→PrtN�−1] to be the ordered list

    of probability distributions. We construct the task tra-

    jectory to be ϑ = 〈−→τ ,−→Prt〉, which is used to determine

    the finite time reward for a sub-environment.

  • 6 Aaron Ma Michael Ouimet Jorge Cortés

    4.1.5 Sub-environment SMDP

    As tasks are completed, the environment evolves, so wedenote E ′ as the environment after an agent has per-formed a task. Because the agents perform tasks thatreason over abstracted objective states, there are manypossible initial, and outcome environments. The exactreward that an agent receives when acting on the en-vironment is a function of E and E ′, which is complexto determine by our use of abstracted objective states.We determine the reward an agent receives for complet-ing τ as a probabilistic function Prr that is determinedheuristically. Let r� be the abstracted reward function,determined by

    r�(τ) =∑r∈R

    Prr(r|τ)r, (4.1)

    which is the expected reward for completing τ giventhe state of the sub-environment. Next we define thesub-environment evolution procedure. Note that agentsmust begin in the region �.−→x 1 and end up in region�.−→x 2 by definition of task. Evolving a sub-environmentconsists of 2 steps. First, the first element of the sub-environment region sequence is removed. The lengthof the sub-environment sequence �.−→x is reduced by 1.Next, the sub-environment state �.s is recalculated withthe new sub-environment region sequence. To do this,we draw the sub-environment state from a probabilitydistribution and we determine the sub-environment af-ter completing the task

    �′ = 〈−→x = [−→x 2, . . . ,−→x N� ], s = Prs(�′.s|�.s, τ)〉.(4.2)

    Finally, we can represent the process of execut-ing tasks in sub-environments as the sub-environmentSMDP M = 〈S�, Γ, Prs� , r�,Prt〉. The goal of the agentis to determine a policy π� : �.s → τ that yields thegreatest rewards inM. The state value under policy π�is given by

    V π�(�.s)= r�+∑t�∈R

    Prt�

    γt� ∑�.s∈S�

    PrsV π�(�.s′)

    (4.3)

    We strive to generate a policy that yields optimal statevalue

    π∗� (�.s) = argmaxπ�

    V π�(�.s),

    with associated optimal value V π�∗(�.s) =maxπ�

    V π�(�.s).

    Remark 4.1 (Extension to heterogeneous swarms) Theframework described above can also handle UxV agents

    with heterogeneous capabilities. In order to do this, onecan consider the possibility of any given agent having aunique set of controls which allow it to complete sometasks more quickly than others. The agents use ourframework to develop a policy that maximizes their re-wards with respect to their own capability, which is im-plicitly encoded in the reward function. For example, ifan agent chooses to serve some objective and has no low

    level control policy that can achieve it,−→Prtk(∞) = 1,

    and the agent will never complete it. In this case, theagent would naturally receive a reward of 0 for the re-mainder of the trajectory. •

    4.2 Multi-agent abstractions

    Due to the large environment induced by the actioncoupling of multi-agent joint planning, determining theoptimal policy is computationally unfeasible. To reducethe computational burden on any given agent, we re-strict the number of coupled interactions. In this sec-tion, we modify the sub-environment, task trajectory,and rewards to allow for multi-agent coupled interac-tions. The following discussion is written from the per-spective of an arbitrary agent labeled α in the swarm,where other agents are indexed with β ∈ A.

    4.2.1 Sub-environment with interaction set

    Agent α may choose to interact with other agents inits interaction set Iα ⊆ A while executing −→τ α in or-der to more effectively complete the tasks. The inter-action set is constructed as a parameter of the sub-environment and indicates to the agent which tasksshould be avoided based on the other agents’ trajec-tories. Let N be a (user specified) maximum num-ber of agents that an agent can interact with (hence|Iα| ≤ N at all times). The interaction set is updatedby adding another agent β and their interaction set,Iα = Iα ∪ {β} ∪ Iβ . If |Iα ∪ {β} ∪ Iβ | > N , then weconsider agent β to be an invalid candidate. Addingβ’s interaction set is necessary because tasks that af-fect the task trajectory ϑβ may also affect all agentsin Iβ . Constraining the maximum interaction set sizereduces the large state size that occurs when agents’actions are coupled. To avoid interacting with agentsnot in the interaction set, we create a set of waypointsthat are off-limits when creating a trajectory.

    We define a claimed regional objective as θ =〈Ob, xi〉. The agent creates a set of claimed region ob-jectives Θα = {θ1, . . . , θN�−1} that contains a claimedregion objective for every task in its trajectory and de-scribes a waypoint in Ob in a region that the agent is

  • Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning 7

    planning to service. We define the global claimed ob-jective set to be ΘA = {Θ1, . . . , Θ|A|}, which containsthe claimed region objective set for all agents. Lastly,let Θ′α = Θ

    A \{⋃

    β∈Iα Θβ}

    be the complete set ofclaimed objectives an agent must avoid when planninga trajectory. The agent uses Θ′α to modify its perceptionof the environment. As shown in the following function,the agent sets the state of claimed objectives in Θ′α to 0,removing appropriate tasks from the feasible task set.

    sbΘ′α,−→x k =

    {0 if 〈Ob, �α.−→x k〉 ∈ Θ′α,sb−→x k otherwise.

    (4.4)

    Let−→S Θ′α,−→x =

    −→S Θ′α,−→x 1 × . . . ×

    −→S Θ′α,−→x N� , where−→

    S Θ′α,−→x k = (sO1Θ′α,−→x k , . . . , s

    O|O|Θ′α,−→x k). In addition to

    the modified sub-environment state, we includethe partial trajectories of other agents being in-teracted with. Consider β’s trajectory ϑβ and anarbitrary �α. Let ϑ

    pβ,k = 〈ϑβ .

    −→τ k.sbi , ϑβ .−→τ k.b〉.

    The partial trajectory, ϑpβ = [ϑpβ,1, . . . , ϑ

    pβ,|ϑβ |] de-

    scribes β’s trajectory with respect to �α.−→x . Let

    ξ(k, �α, �β) return the first index of �α.−→x , h, such that

    �β .−→x h = �α.−→x k, or 0 if there is no match. Each agent

    in the interaction set creates a matrix Ξ of elements,ξ(k, �α, �β), for k ∈ {1, . . . , N�} and β ∈ {1, . . . , |Iα|}.We finally determine the complete multi-agent

    state, s =−→S Θ′α,−→x ×

    {〈ϑp1, Ξ1〉 , . . . , 〈ϑ

    p|Iα|, Ξ|Iα|〉

    [φt(−→x 1,−→x 2), . . . , φt(−→xN�−1,−→xN�)]. With these modi-

    fications to the sub-environment state, we define the(multi-agent) sub-environment as �α = 〈−→x , s, Iα〉.

    4.2.2 Multi-agent action abstraction

    We consider the effect that an agent α has on anotheragent β ∈ A when executing a task that affects sbi in β’strajectory. Some tasks will be completed sooner withtwo or more agents working on them, for instance. Forall β ∈ Iα, let tβ be the time that β begins a taskthat transitions sbi → s′bi . If agent β does not con-tain such a task in its trajectory, then tβ = ∞. LetTAIα = [t1, . . . , t|Iα|]. We denote by Pr

    tIα(t|τ, T

    AIα) the

    probability that τ is completed at exactly time t ifother agents work on transitioning sbi → s′bi . We re-define here the definition of

    −→Prt in Section 4.1.4, as−→

    Prt = [−→Prt1,Iα , . . . ,

    −→PrtN�−1,Iα ], which is the probability

    time set of α modified by accounting for other agentstrajectories. Furthermore, if an agent chooses a taskthat modifies agent α’s trajectory, we define the prob-

    ability time set to be−→Prt′ = [

    −→Prt′1,Iα , . . . ,

    −→Prt′N�−1,Iα ].

    With these modifications, we redefine the task trajec-

    tory to be ϑ =〈−→τ ,−→Prt〉. Finally, we designate Xϑ to

    be the set that contains all trajectories of the agents.

    4.2.3 Multi-agent sub-environment SMDP

    We modify the reward abstraction so that each agenttakes into account agents that it may interact with.When α interacts with other agents, it modifies the ex-pected discounted reward gained by those agents. Wedefine the interaction reward function, which returns areward based on whether the agent executes a task thatinteracts with one or more other agents. The interactionreward function is defined as

    rφ(τ,Xϑ) =

    {R(τ,Xϑ) if τ ∈ ϑpβ for any β,r�(τ) otherwise.

    (4.5)

    Here, the term R represents a designed reward that thatthe agent receives for completing τ when it is sharedby by other agents. This expression quantifies the ef-fect that an interacting task has on an existing task.If a task helps another agent trajectory in a significantway, the agent may choose a task that aids the globalexpected reward amongst the agents. Let the multi-agent sub-environment SMDP be defined as the tupleM = 〈S�, Γ, Prs, rφ,PrtIα〉. The state value from (4.3)is updated using (4.5)

    V π�(�.s)=rφ+∑t�∈R

    Prt�

    γt� ∑�.s∈S′�

    PrsV π�(�.s′).

    (4.6)

    We strive to generate a policy that yields optimal statevalue

    π∗� (�.s) = argmaxπ�

    V π�(�.s),

    with associated optimal value V π�∗(�.s) =maxπ�

    V π�(�.s). Our next section introduces an al-

    gorithm for approximating this optimal state value.

    5 Dynamic Domain Reduction for Multi-AgentPlanning

    This section describes our algorithmic solution to ap-proximate the optimal policy π∗� . The Dynamic Do-main Reduction for Multi-Agent Planning algorithmconsists of three main functions: DDRMAP, TaskSearch,and SubEnvSearch1. Algorithm 1 presents a formal de-scription in the multi-agent case, where each agent caninteract with a maximum of N other agents for plan-ning purposes. In the case of a single agent, we takeN = 0 and refer to our algorithm as Dynamic Domain

    1 pseudocode of functions denoted with † is omitted butdescribed in detail

  • 8 Aaron Ma Michael Ouimet Jorge Cortés

    Reduction Planning (DDRP). In what follows, we firstdescribe the variables and parameters employed in thealgorithm and then discuss each of the main functionsand the role of the support functions.

    Algorithm 1: : Dynamic Domain Reduction for

    Multi-Agent Planning

    1 Ωx =⋃{x} ∀x

    2 E ← current environment3 ΘA ←

    ⋃Θβ , ∀β ∈ A

    4 Q̂, Vx, N�.s, Nb ← loaded from previous trials

    5 DDRMAP (Ωx, N�, E, ΘA, Q̂, Vx, N�.s, Nb,M):6 Yϑ = ∅7 while run time < step time:8 �←SubEnvSearch (Ωx, E, ΘA, Vx)9 TaskSearch (Q̂,N�.s, Nb, �)

    10 Yϑ = Yϑ ∪ {MaxTrajectory(�)}11 return Yϑ

    12 TaskSearch (Q̂,N�.s, Nb, �)13 if �.−→x is empty14 return 0

    15 τ ← maxτ∈M.Γ

    {Q̂[�.s][τ.b] + 2Cp

    √lnN�.s[�.s]Nb[�.s][τ.b]

    }16 t← Sample† M.Prt(t|τ, �)17 �′ = 〈[�.−→x 2 . . . , �−→x N� ],Sample† M.P rs(�.s, τ)〉18 r =M.r�(τ, �)+γtTaskSearch (Q̂,N�.s, Nb, �′)19 TaskValueUpdate (Q̂,N�.s, Nb, �.s, τ.b, r)20 return r

    21 TaskValueUpdate (N�.s, Nb, Q̂, �.s, τ.b, r)22 N�.s[�.s] = N�.s[�.s] + 123 Nb[�.s][τ.b] = Nb[�.s][τ.b] + 1

    24 Q̂[�.s][τ.b] = Q̂[�.s][τ.b] + 1Nb[�.s][τ.b]

    (r − Q̂[�.s][τ.b])

    25 SubEnvSearch (Ω�, E, ΘA, Vx)26 Xx = ∅27 while |Xx| < N�:28 for x ∈ Ω�:29 if Vx[Xx ∪ {x}] is empty:30 Vx[Xx ∪ {x}] ={

    maxτ∈M.Γ

    Q̂[InitSubEnv(Xx, E, ΘA).S][τ ]}

    31 x = argmaxx∈Ω�

    Vx[Xx ∪ {x}]

    32 Xx = Xx ∪ {x}33 return InitSubEnv(Xx, E, ΘA)

    34 InitSubEnv (Xx, E, ΘA)35

    −→x =argmax

    −→x∈(Xx|Xx|

    )\V�

    {maxτ∈M.Γ

    Q̂[GetSubEnvState(−→x ,ΘA)][τ ]

    36 � = 〈−→x , GetSubEnvState(−→x ,ΘA)〉37 return �

    The following variables are common across themulti-agent system: the set of regions Ωx, the cur-rent environment E , the claimed objective set ΘA, andthe multi-agent sub-environment SMDPM. Some vari-ables can be loaded, or initialized to zero such as thenumber of times an agent has visited a state N�.s, the

    number of times an agent has taken an action in a stateNb, the estimated value of taking an action in a stateQ̂, and the estimated value of selecting a region in thesub-environment search process Vx. The set V� containssub-environments as they are explored by the agent.

    The main function DDRMAP structures Q-learningwith domain reduction of the environment. In essence,DDRMAP maps the environment into a sub-environmentwhere it can use a pre-constructed SMDP and upperconfidence bound tree search to determine the valueof the sub-environment. DDRMAP begins by initializingthe set of constructed trajectories Yϑ as an emptyset. The function uses SubEnvSearch to find a suit-able sub-environment from the given environment, thenTaskSearch is used to evaluate that sub-environment.MaxTrajectory constructs a trajectory using the sub-environment, which is added to Yϑ. This process is re-peated for an allotted amount of time. The functionreturns the set of constructed trajectories Yϑ.

    TaskSearch is a modification on Monte Carlo treesearch. Given sub-environment �, the function finds anappropriate task ϑ to exploit and explore the SMDP.On line 15 we select a task based on the upper confi-dence bound of Q̂. We simulate executing the task bysampling Prt for the amount of time it takes to com-plete the task. We then evolve the sub-environmentto get �′ by following the sub-environment evolutionprocedure (4.2). On line 18, we get the discounted re-ward of the sub-environment by summing the rewardfor the current task using (4.1,4.5) and the reward re-turned by recursively calling TaskSearch with the sam-pled evolution of the sub-environment �′ at a discount.The recursive process is terminated at line 13 whenthe sub-environment no longer contains regions in �.−→x .TaskValueUpdate is called after each recursion and up-datesN�.s,Nb, and Q̂. On line 24, Q̂ is updated by recal-culating the average reward over all experiences giventhe task and sub-environment state, which is done byusing N−1b as the learning rate. The time complexityof TaskSearch is O(|Γ |N�) due to the task selectionon Line 15 and the recursive depth of the size of thesub-environment N�.

    We employ SubEnvSearch to explore and find thevalue of possible sub-environments in the environment.The function InitSubEnv maps a set of regions Xx ⊆Ωx (we use subindex ‘x’ to emphasize that this set con-tains regions) to the sub-environment with the highestexpected reward. We do this by finding the sequence ofregions −→x given a Xx that maximizes maxτ∈Γ Q̂[�.s][τ ]on line 35. We keep track of the expected value of choos-ing Xx and the sub-environment that is returned byInitSubEnv with Vx. SubEnvSearch begins by initializ-ing Xx to empty. The region that increases the value Vx

  • Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning 9

    the most when appended to Xx is then appended to Xx.This process is repeated until the length of Xx is N�.Finally, the best sub-environment given Xx is returnedwith InitSubEnv. The time complexity of InitSubEnvand SubEnvSearch is O(N !) and O(|Ωx|N� log(N�!)),respectively. InitSubEnv requires iteration over all pos-sible permutations of Xx, however in practice we useheuristics to reduce the time of computation.

    6 Convergence and performance analysis

    In this section, we look at the performance of indi-vidual elements of our algorithm. First, we establishthe convergence of the estimated value of performing atask determined over time using TaskSearch. We buildon this result to characterize the performance of theSubEnvSearch and of sequential multi-agent deploy-ment.

    6.1 TaskSearch estimated value convergence

    We start by making the following assumption about thesub-environment SMDP.

    Assumption 6.1 There always exists a task that anagent can complete in finite time. Furthermore, no taskcan be completed in zero time steps.

    Assumption 6.1 is reasonable because if not true,then the agent’s actions are irrelevant and the scenariois trivial. The following result characterizes long termperformance of the function TaskSearch which is neces-sary for the analysis of other elements of the algorithm.

    Theorem 6.1 Let Q̂ be the estimated value of perform-ing a task determined over time using TaskSearch. Un-der Assumption 6.1, Q̂ converges to the optimal statevalue V ∗� of the sub-environment SMDP with probabil-ity 1.

    Proof. SMDP Q-learning converges to the optimalvalue under the following conditions [34], rewritten hereto match our notation:

    (i) State and action spaces are finite;(ii) Var{r�} is finite;(iii)

    ∞∑p=1

    αp(�.s, τ) = ∞ and∞∑p=1

    α2p(�.s, τ) < ∞ uni-

    formly over �.s, τ ;(iv) 0 < Bmax = max

    �.s∈S�,τ∈Γ

    ∑Prt(t|�.sτ)γt < 1 .

    In the construction of the sub-environment SMDP,we assume that sub-environment lengths and numberof objectives are finite, satisfying (i). We reason over

    the expected value of the reward in R, that is deter-mined heuristically, which implies that Var{r�} = 0satisfying (ii). From TaskValueUpdate on line 24, wehave that αp(�.s, τ) = 1/p if we substitute p for Nb.

    Therefore, (iii) is satisfied because∞∑

    Nb=1

    1Nb

    = ∞ and∞∑

    Nb=1

    ( 1Nb )2 = π2/6 (finite). Lastly, to satisfy (iv), we

    use Assumption 6.1 (there always exists some τ suchthat Prt(∞|�.s, τ) < 1) and the fact that γ ∈ (0, 1) toensure that Bmax will always be greater than 0. We useAssumption 6.1 (for all �.s, τ Prt(t > 0|�.sτ) = 1) toensure that Bmax is always less than 1. •

    In the following, we consider a version of our algo-rithm that is trained offline called Dynamic Domain Re-duction Planning:Online+Offline (DDRP-OO). DDRP-OOutilizes data that it learned from previous experimentsin similar environments. In order to do this, we trainDDRP offline and save the state values for online use.We use the result of Theorem 6.1 as justification forthe following assumption.

    Assumption 6.2 Agents using DDRP-OO are well-trained, i.e., Q̂ = V ∗�.

    In practice, we accomplish this by running DDRP onrandomized environments until Q̂ remains unchangedfor a substantial amount of time, an indication thatit has nearly converged to V ∗�. The study of DDRP-OOgives insight on the tree search aspect of finding a sub-environment in DDRP and gives intuition on its long-term performance.

    6.2 Sub-environment search by a single agent

    We are interested in how well the sub-environmentsare chosen with respect to the best possible sub-environment. In our study, we make the following as-sumption.

    Assumption 6.3 Rewards are positive. Objectives areuncoupled, meaning that the reward for completingone objective is independent of the completion of anyother objective. Furthermore, objectives only requireone agent’s service for completion.

    Our technical approach builds on the submodular-ity framework, cf. Appendix A, to establish analyticalguarantees of the sub-environment search. The basicidea is to show that the algorithmic components of thisprocedure can be cast as a greedy search with respectto a conveniently defined set function. Let Ωx be a fi-nite set of regions. InitSubEnv takes a set of regions

  • 10 Aaron Ma Michael Ouimet Jorge Cortés

    Xx ⊆ Ωx and returns the sub-environment made up ofregions in Xx in optimal order. We define the power setfunction fx : 2

    Ωx → R, mapping each set of regions tothe discounted reward that an agent expects to receivefor choosing the corresponding sub-environment

    fx(Xx) = maxτ∈Γ

    Q̂[InitSubEnv(Xx).s][τ ]. (6.1)

    For convenience, let X∗x = argmaxXx∈Ωx fx(Xx) de-note the set of regions that yields the sub-environmentwith the highest expected reward amongst all possiblesub-environments. The following counterexample showsthat fx is, in general, not submodular.

    Lemma 6.1 Let fx be the discounted reward that anagent expects to receive for choosing the correspond-ing sub-environment given a set of regions, as definedin (6.1). Under Assumptions 6.2-6.3, fx is not submod-ular.

    Proof. We provide a counterexample to show that fxis not submodular in general. Consider a 1-dimensionalenvironment. For simplicity, let regions be singletons,where Ωx = {x0 = {0}, x1 = {−1}, x2 = {1}, x3 ={2}}. Assume that only one objective exists, which isto enter a region. For this objective, agents only getrewarded for entering a region the first time. Let thetime required to complete a task be directly propor-tional to the distance between region, t = |τ.xi − τ.xj |.Let Xx = {x1} ⊂ Yx = (x1, x3) ⊂ Ωx. fx is submodu-lar only if the marginal gain including {x2} is greaterfor Xx than Yx. Assuming that the agent begins in x0,one can verify that the region sequences of the sub-environments returned by InitSubEnv are as follows:

    InitSubEnv(Xx)→ −→x = [x1]InitSubEnv(Xx ∪ {x2})→ −→x = [x1, x2]InitSubEnv(Yx)→ −→x = [x1, x3]InitSubEnv(Yx ∪ {x2})→ −→x = [x2, x3, x1]

    Assuming that satisfying each task yields the same re-ward r we can calculate the marginal gains as

    fx(Xx ∪ {x2})− fx(Xx) =(γt1r + γt1γt2r)− (γt1r) ≈ .73r,

    evaluated at t1 = x1 − x0 = 1 and t2 = x2 − x1 = 2.The marginal gains for appending {x2} to Yx is

    fx(Yx ∪ {x2})− fx(Yx) =(γt3r + γt3γt4r + γt3γt4γt5r)− (γt1r + γt1γt2r) ≈ .74r,

    evaluated at t1 = x1 − x0 = 1, t2 = x3 − x1 = 3,t3 = x2−x0 = 1, t4 = x3−x2 = 1, and t5 = x1−x3 = 3,showing that the marginal gain for including {x2} isgreater for Yx than Xx. Hence, fx is not submodular.•

    Even though fx is not submodular in general, onecan invoke the notion of submodularity ratio to providea guaranteed lower bound on the performance of thesub-environment search. According to (A.4), the sub-modularity ratio of a function fx is the largest scalarλ ∈ [0, 1] such that

    λ ≤

    ∑z∈Zx

    fx(Xx ∪ {z})− fx(Xx)

    fx(Xx ∪ Zx)− fx(Xx)(6.2)

    for all Xx, Zx ⊆ Ωx. This ratio measures how far thefunction is from being submodular. The following re-sult provides a guarantee on the expected reward withrespect to the optimal sub-environment choice in termsof the submodularity ratio.

    Theorem 6.2 Let Xx be region set returned by thesub-environment search algorithm in DDRP-OO. UnderAssumptions 6.2-6.3, it holds that fx(Xx) ≥ (1 −e−λ)fx(X

    ∗x).

    Proof. According to Theorem A.3, we need to showthat fx is a monotone set function, that fx(∅) = 0, andthat the sub-environment search algorithm has greedycharacteristics. Because of Assumption 6.3, adding re-gions to Xx monotonically increases the expected re-ward, hence equation (A.2) is satisfied. Next, we notethat if Xx = ∅, then InitSubEnv returns an empty sub-environment, which implies from equation (6.1) thatfx(∅) = 0. Lastly, by construction, the first iterationof the sub-environment search adds regions to the sub-environment one at a time in a greedy fashion. Becausethe sub-environment search keeps in memory the sub-environment with the highest expected reward, the en-tire algorithm is lower bounded by the first iterationof the sub-environment search. The result now followsfrom Theorem A.3. •

    Given the generality of our proposed framework, thesubmodularity ratio of fx is in general difficult to de-termine. To deal with this, we resort to tools from sce-nario optimization, cf. Appendix B, to obtain an es-timate of the submodularity ratio. The basic observa-tion is that, from its definition, the computation of thesubmodularity ratio can be cast as a robust convex op-timization problem. Solving this optimization problemis difficult given the large number of constraints thatneed to be considered. Instead, the procedure for es-timating the submodularity ratio samples the currentenvironment that an agent is in, randomizing optimiza-tion parameters. The human supervisor chooses con-fidence and violation parameters, $ and ε, that aresatisfactory. submodulariyRatioEstimation, cf. Algo-rithm 2, iterates through randomly sampled parame-ters, while maintaining the maximum submodularity

  • Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning 11

    Algorithm 2: : Submodularity ratio estimation

    1 ∆← set of all possible pairs of Xx, Zx.2 Prδ ← probability distribution of sampling δ from ∆.

    3 submodulariyRatioEstimation ($, ε,Ωx,∆, Prδ)4 λ+ = 1, d = 1, h = ∅5 NSCP =

    2ε(ln 1

    $+ d)

    6 for n = 0;n++;n < NSCP7 Xx, Zx =Sample† (Prδ)

    8 λδ =

    ∑z∈Zx

    fx(Xx∪{z})−fx(Xx)

    fx(Xx∪Zx)−fx(Xx)

    9 if λδ < λ+

    10 λ+ = λδ

    11 h.append(λδ, n)12 a, b = argmin

    a,b∈R

    ∑λδ,n∈h

    (λ+ − a− bn)2

    13 λ̂ = a+ b|∆|14 return λ̂

    ratio that does not violate the sampled constraints.Once the agent has completed NSCP number of sam-ple iterations, we extrapolate the history of evolutionof the submodularity ratio to get λ̂, the approximatesubmodularity ratio. We do this by using a simple lin-ear regression in lines 12-13 and evaluate the expressionat n = |∆|, the cardinality of the constraint parameterset, to determine an estimate for the robust convex op-timization problem.

    The following result justifies to what extent the ob-tained ratio is a good approximation of the actual sub-modularity ratio.

    Lemma 6.2 Let λ̂ be the approximate submodularityratio returned by submodulariyRatioEstimation withinputs $ and ε. With probability, 1−$, up to ε-fractionof constraints will be violated with respect to the robustconvex optimization problem (B.1).

    Proof. First, we show submodulariyRatioEstimationcan be formulated as a scenario convex program andthat it satisfies the convex constraint condition in The-orem B.1. Lines 6-11 provide a solution, λ+, to the fol-lowing scenario convex program.

    λ+ =argminλ−∈R

    − λ−

    s.t. fδi(λ−) ≤ 0, i = 1, . . . , NSCP,

    where line 8 is a convex function that comes from equa-tion (6.2). Since fδi is a convex function, we can applyTheorem B.1; with probability, 1 − $, λ+ violates atmost ε-fraction of constraints in ∆.

    The simple linear regression portion of the algo-rithm, lines 12-13, uses data points λδ, n that are onlyincluded in h when λδ < λ+. Therefore, the slope of

    the linear regression b is strictly negative. On line 13,λ̂ is evaluated at n = NSCP which implies that λ̂ ≤ λ+and that with probability, 1 − $, λ̂ violates at mostε-fraction of constraints in ∆. •

    Note that $ and ε can be chosen as small as de-sired to ensure that λ̂ is a good approximation of λ.As λ̂ approaches λ, our approximation of the lowerbound performance of the algorithm with respect tofx becomes more accurate. We conclude this sectionby studying whether the submodularity ratio is strictlypositive. First, we prove it is always positive in non-degenerate cases.

    Theorem 6.3 Under Assumptions 6.1-6.3, fx is aweakly submodular function.

    Proof. We need to establish that the submodularity ra-tio of fx is positive. We reason by contradiction, i.e.,assume that the submodularity ratio is 0. This meansthat there exist Xx and Zx such that the righthand sideof expression (6.2) is zero (this rules out, in particular,the possibility of either Xx or Zx being empty). In par-ticular, this implies that fx(Xx ∪ {z}) − fx(Xx) = 0for every z ∈ Zx and that fx(Xx ∪ Zx) − fx(Xx) > 0.Assume that InitSubEnv(Xx ∪ {z}) yields an orderedregion list [x1, x2, . . . , z] for each z. Let rx and tx de-note the reward and time for completing a task in aregion x conditioned by the generated sub-environmentInitSubEnv(Xx ∪ {z}). Then,

    fx(Xx) = γtx1 rx1 + γ

    tx2 rx2 + . . . ,

    fx(Xx ∪ {z}) = γtx1 rx1 + γtx2 rx2 + . . .+ γtzrz,fx(Xx ∪ {z})− fx(Xx) = γtzrz,

    for each z ∈ Zx conditioned by the generated sub-environment InitSubEnv(Xx ∪ {z}). Under Assump-tion 6.3, the term fx(Xx∪{z})−fx(Xx) equals 0 whentz is infinite. On the other hand, let r

    ′x and t

    ′x de-

    note the reward and time for completing a task in re-gion x conditioned by the generated sub-environmentInitSubEnv(Xx ∪ Zz). The denominator is nonzerowhen t′z is finite. This cannot hold when tz is infinitefor each z ∈ Zx without contradicting Assumption 6.3,concluding the proof. •

    The next remark discusses the challenge of deter-mining an explicit lower bound on the submodularityratio.

    Remark 6.1 Beyond the result in Theorem 6.3, it is ofinterest to determine an explicit positive lower boundon the submodularity ratio. In general, obtaining sucha bound for arbitrary scenarios is challenging and likelywould yield overly conservative results. To counter this,

  • 12 Aaron Ma Michael Ouimet Jorge Cortés

    we believe that restricting the attention to specific fami-lies of scenarios may instead lead to informative bounds.Our simulation results in Section 7.1 later suggest, forinstance, that the submodularity ratio is approximately1 in common scenarios related to scheduling spatiallydistributed tasks. However, formally establishing thisfact remains an open problem. •

    6.3 Sequential multi-agent deployment

    We explore the performance of DDRP-OO in an environ-ment with multiple agents. We consider the followingassumption for the rest of this section.

    Assumption 6.4 If an agent chooses a task that was al-ready selected in Xϑ, none of the completion time prob-ability distributions are modified. Furthermore, the ex-pected discounted reward for completing task τ giventhe set of trajectories Xϑ is

    T (τ,Xϑ) = maxϑ∈Xϑ

    ϑ.−−→Prtkγ

    trφ = maxϑ∈Xϑ

    T (τ, {ϑ}),

    (6.3)

    given that ϑ.−→τ k = τ .

    This assumption is utilized in the following sequen-tial multi-agent deployment algorithm.

    Algorithm 3: : Sequential multi-agent deploy-

    ment1 Ωx =

    ⋃{x} ∀x

    2 E ← current environment3 Q̂, Vx, N�.s, Nb ← loaded from previous trials

    4 Evaluate (ϑ,Xϑ):5 val = 06 for −→τ k in ϑ.−→τ :7 if T (−→τ k, Xϑ) < ϑ.

    −−→Prtk(t)γtrφ:

    8 val = val+ ϑ.−−→Prtk(t)γtrφ − T (−→τ k, Xϑ)

    9 return val

    10 SequentialMultiAgentDeployment (E,A, Ωx)11 ΘA = ∅12 Xϑ = ∅13 for β ∈ A14 ΘA = ΘA ∪ {Θβ}15 Zϑ = DDRMAP (Ωx, N�, E, ΘA, Q̂, Vx, N�.s, Nb,M)16 ϑα = argmax

    ϑ∈ZϑEvaluate(ϑ,Xϑ) Xϑ = Xϑ ∪ {ϑα}

    17 return Xϑ

    In this algorithm, agents plan their sub-environments and task search to determine a tasktrajectory ϑ one at a time. The function Evaluatereturns the marginal gain of including ϑ, which is theadded benefit of including ϑ in Xϑ. We define the setfunction fϑ : 2

    Ωϑ → R to be a metric for measuring the

    performance of SequentialMultiAgentDeployment asfollows:

    fϑ(Xϑ) =∑∀τ

    T (τ,Xϑ).

    This function is interpreted as the sum of discountedrewards given all a set of trajectories Xϑ. The defini-tion of T from Assumption 6.4 allows us to state thefollowing result.

    Lemma 6.3 Under Assumptions 6.3-6.4, fϑ is a sub-modular, monotone set function.

    Proof. With (6.3) and the fact that rewards are non-negative (Assumption 6.3), we have that the marginalgain is never negative, therefore the function is mono-tone. For the function to be submodular, we show thatit satisfies the condition of diminishing returns, mean-ing that fϑ(Xϑ ∪ {ϑα}) − fϑ(Xϑ) ≥ fϑ(Yϑ ∪ {ϑα}) −fϑ(Yϑ) for any Xϑ ⊆ Yϑ ⊆ Ωϑ and ϑα ∈ Ωϑ \ Yϑ. Let

    G(Xϑ, ϑα) = T (τ,Xϑ ∪ {ϑα})− T (τ,Xϑ) =max

    ϑ∈Xϑ∪{ϑα}T (τ, {ϑ})− max

    ϑ∈XϑT (τ, {ϑ})

    be the marginal gain of including ϑα in Xϑ. The max-imum marginal gain of occurs when no ϑ ∈ Xϑ sharethe same tasks as ϑα. We determine the marginal gainsG(Xϑ, ϑα) and G(Yϑ, ϑα) for every possible case andshow that G(Xϑ, ϑα) ≥ G(Yϑ, ϑα).

    Case 1: ϑα = argmaxϑ∈Xϑ∪{ϑα} T (τ, {ϑ}) =argmaxϑ∈Yϑ∪{ϑα} T (τ, {ϑ}). This implies thatT (τ,Xϑ ∪ {ϑα}) = T (τ, Yϑ ∪ {ϑα}).

    Case 2: ϑ = argmaxϑ∈Xϑ∪{ϑα} T (τ, {ϑ}) =argmaxϑ∈Yϑ∪{ϑα} T (τ, {ϑ}) such that ϑ ∈ Xϑ. This im-plies that T (τ,Xϑ ∪ {ϑα}) = T (τ, Yϑ ∪ {ϑα}).

    Case 3: ϑα = argmaxϑ∈Xϑ∪{ϑα} T (τ, {ϑ}), and ϑ =argmaxϑ∈Yϑ∪{ϑα} T (τ, {ϑ}) such that ϑ ∈ Yϑ\Xϑ. Thus

    G(Xϑ, ϑα) ≥ 0 and G(Yϑ, ϑα) = 0.

    For both cases 1 and 2, since the function is monotone,we have T (τ, Yϑ) ≥ T (τ,Xϑ). Therefore, the marginalgain G(Xϑ, ϑα) ≥ G(Yϑ, ϑα) for all cases. •

    Having established that fϑ is a submodu-lar and monotone function, our next step isto provide conditions that allow us to castSequentialMultiAgentDeployment as a greedyalgorithm with respect to this function. This wouldenable us to employ Theorem A.1 to guarantee lowerbounds on fϑ(Xϑ), with Xϑ being the output ofSequentialMultiAgentDeployment.

    First, when picking trajectories from line 16in SequentialMultiAgentDeployment, all trajectories

  • Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning 13

    must be chosen from the same set of possible trajecto-ries Ωϑ. We satisfy this requirement with the followingassumption.

    Assumption 6.5 Agents begin at the same location inthe environment, share the same SMDP, and are capa-ble of interacting with as many other agents as needed,i.e., N = |A|. Agents choose their sub-environment tra-jectories one at a time. Furthermore, agents are givensufficient time for DDRP (line 15) to have visited all pos-sible trajectory sets Ωϑ.

    The next assumption we make allows agents to picktrajectories regardless of order and repetition, which isneeded to mimic the set function properties of fϑ.

    Assumption 6.6 R is set to the single agent reward r�.As a result, the multi-agent interaction reward functionis rφ = r�.

    This assumption is necessary because, if we insteadconsider a reward R that is dependent on the num-ber of agents acting on τ , the order of which theagents choose their trajectories would affect their deci-sion making. Furthermore, this restriction satisfies thecondition V ar(R) to be finite in Theorem 6.1. We arenow ready to characterize the lower bound performanceof SequentialMultiAgentDeployment with respect tothe optimal set of task trajectories. For convenience, letX∗ϑ = argmax

    Xϑ∈Ωϑfϑ(Xϑ) denote the optimal set of task

    trajectories.

    Theorem 6.4 Let Xϑ be the trajectory set returned bySequentialMultiAgentDeployment. Under Assump-tions 6.2-6.6, it holds that fϑ(Xϑ) ≥ (1− e−1)fϑ(X∗ϑ).

    Proof. Our strategy relies on making sure we can invokeTheorem A.1 for SequentialMultiAgentDeployment.From Lemma 6.3, we know fϑ is submodularand monotone. What is left is to show thatSequentialMultiAgentDeployment chooses trajecto-ries from Ωϑ which maximize the marginal gainwith respect to fϑ. First, we have that the agentsall choose from the set Ωϑ as a direct result ofAssumptions 6.2 and 6.5. This is because Q̂ andSMDP are equivalent for all agents and they allbegin with the same initial conditions. Now weshow that SequentialMultiAgentDeployment choosesagents which locally maximizes the marginal gain of fϑ.Given any set Xϑ and ϑα ∈ Ωϑ \Xϑ, the marginal gainis

    fϑ(Xϑ ∪ {ϑα})− fϑ(Xϑ) =∑∀τ

    T (τ,Xϑ ∪ {ϑα})−∑∀τ

    T (τ,Xϑ) =

    ∑∀τ

    (T (τ,Xϑ ∪ {ϑα})− T (τ,Xϑ)).

    The function Evaluate on lines 7 and 8 has the agentcalculate the marginal gain for a particular task, givenϑ and Xϑ. Evaluate calculates the marginal gain forall tasks as∑

    ∀τ∈ϑ.−→τ

    max((T (τ, {ϑ}), T (τ,Xϑ))− T (τ,Xϑ)),

    which is equivalent to∑∀τ

    (T (Xϑ ∪{ϑ})−T (Xϑ)). Since

    SequentialMultiAgentDeployment takes the trajec-tory that maximizes Evaluate, the result now followsfrom Theorem A.1. •

    7 Empirical validation and evaluation ofperformance

    In this section we perform simulations in order to vali-date our theoretical analysis and justify the use of DDRPover other model-based and model-free methods. Allsimulations were performed on a Linux-based worksta-tion with 16 GB of RAM and a stock AMD Ryzen 1600CPU. GPU was not utilized in our studies, but could beimplemented to improve sampling speed of some test-ing algorithms. We first illustrate the optimality ratioobtained by the sub-environment search in single-agentand sequential multi-agent deployment scenarios. Next,we compare the performance of DDRP, DDRP-OO, MCTS,and ACKTR in a simple environment. Lastly, we studythe effect of multi-agent interaction on the performanceof the proposed algorithm.

    7.1 Illustration of performance guarantees

    Here we perform simulations that help validate the re-sults of Section 6. We achieve this by determining Xxfrom SubEnvSearch, which is an implementation ofgreedy maximization of submodular set functions, andthe optimal region set X∗x , by brute force computa-tion of all possible sets. In this simulation, we use 1agent and 1 objective with 25 regions. We implementDDRP-OO by loading previously trained data and com-pare the value of the first sub-environment found to thevalue of the optimal sub-environment. 1000 trials aresimulated by randomizing the initial state of the envi-ronments. We plot the probability distribution functionof fx(Xx)/fx(X

    ∗x) in Figure 7.1. The empirical lower

    bound of fx(Xx)/fx(X∗x) is a little less than 1 − e−1,

    consistent with the result, cf. Theorem 6.2, that thesubmodularity ratio of SubEnvSearch may not be 1.We believe that another factor for this empirical lower

  • 14 Aaron Ma Michael Ouimet Jorge Cortés

    bound is that Assumption 6.2 is not fully satisfied inour experiments. In order to perform the simulation,we trained the agent on similar environments for 10minutes. Because the number of possible states in thissimulation is very large, some of the uncommon statesmay not have been visited enough for Q̂ to mature.

    Fig. 7.1 Probability distribution function of fx(Xx)/fx(X∗x).

    Next we look for empirical validation for the lowerbounds on fϑ(Xϑ)/fϑ(X

    ∗ϑ). This is a difficult task be-

    cause determining the optimal set of trajectories X∗ϑis combinatorial with respect to the trajectory size,number regions, and number of agents. We simulateSequentialMultiAgentDeployment in a 36 region en-vironment with one type of objective where 3 agentsare required to start at the same region and share thesame Q̂, which is assumed to have converged to theoptimal value. 1000 trials are simulated by randomiz-ing the initial state of the environments. As shown inFigure 7.2, the lower bound on the performance withrespect to the optimal set of trajectories is greater than1− e−1, as guaranteed by Theorem 6.4. This empiricallower bound may change under more complex environ-ments with an increased number of agents, more re-gions, and longer sub-environment lengths. Due to thecombinatorial nature of determining the optimal set oftrajectories, it is difficult to simulate environments ofhigher complexity.

    7.2 Comparisons to alternative algorithms

    In DDRP, DDRP-OO, and MCTS, the agent is given anallocated time to search for the best trajectory. InACKTR, we look at the number of simulations neededto converge to a policy comparable to the ones foundin DDRP, DDRP-OO, and MCTS. We simulate the sameenvironment across these algorithms. The environmenthas |O| = 1, where objectives have a random number ofwaypoints (|Ob| ≤ 15) placed uniformly randomly. The

    Fig. 7.2 Probability distribution function of fϑ(Xϑ)/fϑ(X∗ϑ).

    environment contains 100× 100 points in R2, with 100evenly distributed regions. The sub-environment lengthfor DDRP and DDRP-OO are both 10 and the maximumnumber of steps that an agent is allowed to take is100. Furthermore, the maximum reward per episode iscapped at 10. We choose this environment because ofthe large state space and action space in order to illus-trate the strength of Dynamic Domain Reduction forMulti-Agent Planning in breaking it down into compo-nents that have been previously seen. Figure 7.3 showsthat MCTS performs poorly for the chosen environ-ment because of the large state space and branching fac-tor. DDRP initially performs poorly, but yields strong re-sults given enough time to think. DDRP-OO performs welleven when not given much time to think. Theorem 6.2helps give intuition to the immediate performance ofDDRP-OO. The ACKTR simulation, displayed in Fig-ure 7.4, performs well, but only after several millionepisodes of training, corresponding to approximately 2hours using 10 CPU cores. This illustrates the inher-ent advantage of model-based reinforcement learningapproaches when the MDP model is available. Data isplotted to show the average and confidence intervals ofthe expected discounted reward of the agent(s) found inthe allotted time. We perform 100 trials per data pointin the case studies.

    We perform another study to visually compare tra-jectories generated from DDRP, MCTS, and ACKTR asshown in Figure 7.5. The environment contains threeobjectives with waypoints denoted by ‘x’, ‘square’, and‘triangle’ markers. Visiting ‘x’ waypoints yield a rewardof 3, while visiting ‘square’ or ‘triangle’ waypoints yielda reward of 1. We ran both MCTS and DDRP for 3.16seconds and ACKTR for 10000 trials, all with a dis-count factor of γ = .99. The best trajectories foundby MCTS, DDRP, and ACKTR are shown in Figure 7.5which yield discounted rewards of 1.266, 5.827, and

  • Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning 15

    Fig. 7.3 Performance of DDRP, DDRP-OO, and MCTS in ran-domized 2D environments with one objective type.

    Fig. 7.4 Performance of offline algorithm: ACKTR in ran-domized 2D environments with one objective type.

    4.58, respectively. It is likely that the ACKTR policyconverged to a local maximum because the trajecto-ries generated near the ending of the 100000 trials hadlittle deviation. We use randomized instances of thisenvironment to show a comparison of DDRP, DDRP-OO,and MCTS with respect to runtime in Figure 7.6 andshow the performance of ACKTR in a static instanceof this environment in Figure 7.7.

    7.3 Effect of multi-agent interaction

    Our next simulation evaluates the effect of multi-agentcooperation in the algorithm performance. We consideran environment similar to the one in Section 7.2, ex-cept with 10 agents and |O| = 3, where objectives havea random number of waypoints (|Ob| ≤ 5) that areplaced randomly. In this simulation we do not trainthe agents before trials, and Dynamic Domain Reduc-tion for Multi-Agent Planning is used with varying

    Fig. 7.5 The trajectories generated from MCTS, DDRP, andACKTR are shown with dashed, solid, and dash-dot linesrespectively, in a 100 × 100 environment with 3 objectives.The squares, x’s, and triangles represent waypoints of threeobjective types. The agent (represented by a torpedo) startsat the center of the environment.

    Fig. 7.6 Performance of DDRP, DDRP-OO, and MCTS in ran-domized 2D environments with three objective types.

    N where agents asynchronously choose trajectories. InFigure 7.8, we can see the benefit of allowing agents tointeract with each other. When agents are able to takecoupled actions, the expected potential discounted re-ward is greater, a feature that becomes more markedas agents are given more time T to think.

    8 Conclusions

    We have presented a framework for high-level multi-agent planning leading to the Dynamic Domain Re-duction for Multi-Agent Planning algorithm. Our de-sign builds on a hierarchical approach that simul-

  • 16 Aaron Ma Michael Ouimet Jorge Cortés

    Fig. 7.7 Performance of ACKTR in a static 2D environmentwith three objective types.

    Fig. 7.8 Performance of multi-agent deployment usingDDRMAP in 2D environment.

    taneously searches for and creates sequences of ac-tions and sub-environments with the greatest ex-pected reward, helping alleviate the curse of dimen-sionality. Our algorithm allows for multi-agent inter-action by including other agents’ state in the sub-environment search. We have shown that the actionvalue estimation procedure in DDRP converges to theoptimal value of the sub-environment SMDP withprobability 1. We also identified metrics to quan-tify performance of the sub-environment selection inSubEnvSearch and sequential multi-agent deploymentin SequentialMultiAgentDeployment, and providedformal guarantees using scenario optimization and sub-modularity. We have illustrated our results and com-pared the algorithm performance against other ap-proaches in simulation. The biggest limitation of ourapproach is related to the spatial distribution of objec-tives. The algorithm does not perform well if the envi-ronment is set up such that objectives cannot be split

    well into regions. Future work will explore the incor-poration of constraints on battery life and connectiv-ity maintenance of the team of agents, the considera-tion of partial agent observability and limited communi-cation, and the refinement of multi-agent cooperationcapabilities enabled by prediction elements that indi-cate whether other agents will aid in the completion ofan objective. We also plan to explore the characteriza-tion, in specific families of scenarios, of positive lowerbounds on the submodularity ratio of the set functionthat assigns the discounted reward of the selected sub-environment, and the use of parallel, distributed meth-ods for submodular optimization capable of handlingasynchronous communication and computation.

    Acknowledgments

    This work was supported by ONR Award N00014-16-1-2836. The authors would like to thank the organizersof the International Symposium on Multi-Robot andMulti-Agent Systems (MRS 2017), which provided uswith the opportunity to obtain valuable feedback onthis research, and the reviewers.

    References

    1. A. Ma, M. Ouimet, and J. Cortés, “Dynamic domain re-duction for multi-agent planning,” in International Sym-posium on Multi-Robot and Multi-Agent Systems, Los An-geles, CA, 2017, pp. 142–149.

    2. B. P. Gerkey and M. J. Mataric, “A formal analysis andtaxonomy of task allocation in multi-robot systems,” In-ternational Journal of Robotics Research, vol. 23, no. 9, pp.939–954, 2004.

    3. F. Bullo, J. Cortés, and S. Martinez, Distributed Con-trol of Robotic Networks, ser. Applied Mathematics Series.Princeton University Press, 2009, electronically availableat http://coordinationbook.info.

    4. M. Mesbahi and M. Egerstedt, Graph Theoretic Methodsin Multiagent Networks, ser. Applied Mathematics Series.Princeton University Press, 2010.

    5. M. Dunbabin and L. Marques, “Robots for environmentalmonitoring: Significant advancements and applications,”IEEE Robotics & Automation Magazine, vol. 19, no. 1, pp.24–39, 2012.

    6. J. Das, F. Py, J. B. J. Harvey, J. P. Ryan, A. Gellene,R. Graham, D. A. Caron, K. Rajan, and G. S. Sukhatme,“Data-driven robotic sampling for marine ecosystemmonitoring,” The International Journal of Robotics Re-search, vol. 34, no. 12, pp. 1435–1452, 2015.

    7. J. Cortés and M. Egerstedt, “Coordinated control ofmulti-robot systems: A survey,” SICE Journal of Control,Measurement, and System Integration, vol. 10, no. 6, pp.495–503, 2017.

    8. R. Sutton, D. Precup, and S. Singh, “Between MDPsand semi-MDPs: A framework for temporal abstractionin reinforcement learning,” Artificial Intelligence, vol. 112,no. 1-2, pp. 181–211, 1999.

    http://coordinationbook.info

  • Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning 17

    9. F. Broz, I. Nourbakhsh, and R. Simmons, “Planningfor human-robot interaction using time-state aggregatedPOMDPs,” in AAAI, vol. 8, 2008, pp. 1339–1344.

    10. M. Puterman, Markov decision processes: discrete stochas-tic dynamic programming. John Wiley & Sons, 2014.

    11. R. Howard, Dynamic programming and Markov processes.M.I.T. Press, 1960.

    12. C. H. Papadimitriou and J. N. Tsitsiklis, “The complex-ity of Markov decision processes,” Mathematics of Opera-tions Research, vol. 12, no. 3, pp. 441–450, 1987.

    13. W. S. Lovejoy, “A survey of algorithmic methods for par-tially observed Markov decision processes,” Annals of Op-erations Research, vol. 28, no. 1, pp. 47–65, 1991.

    14. R. Bellman, Dynamic programming. Courier Corporation,2013.

    15. A. Bai, S. Srivastava, and S. Russell, “Markovianstate and action abstractions for MDPs via hierarchi-cal MCTS,” in Proceedings of the Twenty-fifth Interna-tional Joint Conference on Artificial Intelligence, IJCAI,New York, NY, 2016, pp. 3029–3039.

    16. S. M. LaValle and J. J. Kuffner, “Rapidly-exploring ran-dom trees: Progress and prospects,” in Workshop on Al-gorithmic Foundations of Robotics, Dartmouth, NH, Mar.2000, pp. 293–308.

    17. S. Prentice and N. Roy, “The belief roadmap: Efficientplanning in linear POMDPs by factoring the covariance,”in Robotics Research. Springer, 2010, pp. 293–305.

    18. A. A. Agha-mohammadi, S. Chakravorty, and N. M. Am-ato, “FIRM: Feedback controller-based information-stateroadmap-a framework for motion planning under uncer-tainty,” in IEEE/RSJ Int. Conf. on Intelligent Robots &Systems, San Francisco, CA, 2011, pp. 4284–4291.

    19. F. A. Oliehoek and C. Amato, A Concise Introduction toDecentralized POMDPs, ser. SpringerBriefs in IntelligentSystems. New York: Springer, 2016.

    20. S. Omidshafiei, A. A. Agha-mohammadi, C. Amato, andJ. P. How, “Decentralized control of partially observ-able Markov decision processes using belief space macro-actions,” in IEEE Int. Conf. on Robotics and Automation,Seattle, WA, May 2015, pp. 5962–5969.

    21. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,A. K. Fidjeland, and G. Ostrovski, “Human-level controlthrough deep reinforcement learning,” Nature, vol. 518,no. 7540, p. 529, 2015.

    22. J. Schulman, S. Levine, P. Abbeel, M. Jordan, andP. Moritz, “Trust region policy optimization,” in Inter-national Conference on Machine Learning, Lille, France,2015, pp. 1889–1897.

    23. Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba,“Scalable trust-region method for deep reinforcementlearning using kronecker-factored approximation,” in Ad-vances in Neural Information Processing Systems, vol. 30,2017, pp. 5285–5294.

    24. D. P. Bertsekas, Dynamic Programming and Optimal Con-trol. Athena Scientific, 1995.

    25. L. Kocsis and C. Szepesvári, “Bandit based Monte-Carloplanning,” in ECML, vol. 6. Springer, 2006, pp. 282–293.

    26. E. A. Hansen and Z. Feng, “Dynamic programming forPOMDPs using a factored state representation,” in In-ternational Conference on Artificial Intelligence PlanningSystems, Breckenridge, CO, 2000, pp. 130–139.

    27. A. K. McCallum and D. Ballard, “Reinforcement learn-ing with selective perception and hidden state,” Ph.D.dissertation, University of Rochester. Dept. of ComputerScience, 1996.

    28. G. Theocharous and L. P. Kaelbling, “Approximate plan-ning in POMDPs with macro-actions,” in Advances inNeural Information Processing Systems, 2004, pp. 775–782.

    29. A. Clark, B. Alomair, L. Bushnell, and R. Pooven-dran, Submodularity in Dynamics and Control of NetworkedSystems, ser. Communications and Control Engineering.New York: Springer, 2016.

    30. A. A. Bian, J. M. Buhmann, A. Krause, and S. Tschi-atschek, “Guarantees for greedy maximization of non-submodular functions with applications,” in InternationalConference on Machine Learning, vol. 70, Sydney, Aus-tralia, 2017, pp. 498–507.

    31. A. Das and D. Kempe, “Submodular meets spectral:Greedy algorithms for subset selection, sparse approxi-mation and dictionary selection,” CoRR, February 2011.

    32. M. C. Campi, S. Garatti, and M. Prandini, “The scenarioapproach for systems and control design,” Annual Reviewsin Control, vol. 32, no. 2, pp. 149–157, 2009.

    33. A. Blum, S. Chawla, D. R. Karger, T. Lane, A. Mey-erson, and M. Minkoff, “Approximation algorithms fororienteering and discounted-reward TSP,” SIAM Journalon Computing, vol. 37, no. 2, pp. 653–670, 2007.

    34. R. Parr and S. Russell, Hierarchical control and learningfor Markov decision processes. University of California,Berkeley Berkeley, CA, 1998.

    35. G. Nemhauser, L. Wolsey, and M. Fisher, “An analysis ofthe approximations for maximizing submodular set func-tions,” Mathematical Programming, vol. 14, pp. 265–294,1978.

    36. P. R. Goundan and A. S. Schulz, “Revisiting the greedyapproach to submodular set function maximization,” Op-timization online, pp. 1–25, 2007.

    37. A. Ben-Tal and A. Nemirovski, “Robust convex optimiza-tion,” Math. Oper. Res., vol. 23, pp. 769–805, 1998.

    38. L. E. Ghaoui, F. Oustry, and H. Lebret, “Robust solu-tions to uncertain semidefinite programs,” SIAM Journalon Optimization, vol. 9, no. 1, pp. 33–52, 1998.

    A Submodularity

    We review here concepts of submodularity and monotonic-ity of set functions following [29]. A power set functionf : 2Ω → R is submodular if it satisfies the property of di-minishing returns,

    f(X ∪ {x})− f(X) ≥ f(Y ∪ {x})− f(Y ), (A.1)

    for all X ⊆ Y ⊆ Ω and x ∈ Ω \ Y . The set function f ismonotone if

    f(X) ≤ f(Y ), (A.2)

    for all X ⊆ Y ⊆ Ω. In general, monotonicity of a set functiondoes not imply submodularity, and vice versa. These proper-ties play a key role in determining near-optimal solutions tothe cardinality-constrained submodular maximization problemdefined by

    max f(X)

    s.t. |X| ≤ k.(A.3)

    In general, this problem is NP-hard. Greedy algorithms seekto find a suboptimal solution to (A.3) by building a set X oneelement at a time, starting with |X| = 0 to |X| = k. Thesealgorithms proceed by choosing the best next element,

    maxx∈Ω\X

    f(X ∪ {x}),

  • 18 Aaron Ma Michael Ouimet Jorge Cortés

    to include in X. The following result [29,35] provides a lowerbound on the performance of greedy algorithms.

    Theorem A.1 Let X∗ denote the optimal solution of prob-lem (A.3). If f is monotone, submodular, and satisfies f(∅) = 0,then the set X returned by the greedy algorithm satisfies

    f(X) ≥ (1− e−1)f(X∗).

    An important extension of this result characterizes theperformance of a greedy algorithm where, at each step, onechooses an element x that satisfies

    f(X ∪ {x})− f(X) ≥ α(f(X ∪ {x∗})− f(X)),

    for some α ∈ [0, 1]. That is, the algorithm chooses an elementthat is at least an α-fraction of the local optimal elementchoice, x∗. In this case, the following result [36] characterizesthe performance.

    Theorem A.2 Let X∗ denote the optimal solution of prob-lem (A.3). If f is monotone, submodular, and satisfies f(∅) = 0,then the set X returned by a greedy algorithm that chooses ele-ments of at least α-fraction of the local optimal element choicesatisfies

    f(X) ≥ (1− e−α)f(X∗).

    A generalization of the notion of submodular set func-tion is given by the submodularity ratio [31], which measureshow far the function is from being submodular. This ratio isdefined as largest scalar λ ∈ [0, 1] such that

    λ ≤

    ∑z∈Z

    f(X ∪ {z})− f(X)

    f(X ∪ Z)− f(X), (A.4)

    for all X,Z ⊂ Ω. The function f is called weakly submodularif it has a submodularity ratio in (0, 1]. If a function f issubmodular, then its submodularity ratio is 1. The followingresult [31] generalizes Theorem A.1 to monotone set functionswith submodularity ratio λ.

    Theorem A.3 Let X∗ denote the optimal solution of prob-lem (A.3). If f is monotone, weakly submodular with submod-ularity ratio λ ∈ (0, 1], and satisfies f(∅) = 0, then the set Xreturned by the greedy algorithm satisfies

    f(X) ≥ (1− e−λ)f(X∗).

    B Scenario optimization

    Scenario optimization aims to determine robust solutions forpractical problems with unknown parameters [37,38] by hedg-ing against uncertainty. Consider the following robust convexoptimization problem defined by

    RCP: min cT γγ∈Rd

    subject to: fδ(γ) ≤ 0, ∀δ ∈ ∆,(B.1)

    where fδ is a convex function, d is the dimension of the opti-mization variable, δ is an uncertain parameter, and ∆ is theset of all possible parameter values. In practice, solving theoptimization (B.1) can be difficult depending on the cardi-nality of ∆. One approach to this problem is to solve (B.1)with sampled constraint parameters from ∆. This approach

    views the uncertainty of situations in the robust convex op-timization problem through a probability distribution Prδ

    of ∆, which encodes either the likelihood or importance ofsituations occurring through the constraint parameters. Toalleviate the computational load, one selects a finite numberNSCP of parameter values in ∆ sampled according to Prδ andsolves the scenario convex program [32] defined by

    SCPN : minγ∈Rd

    cT γ

    s.t. fδ(i)(γ) ≤ 0, i = 1, . . . , NSCP.(B.2)

    The following result states to what extent the solutionof (B.2) solves the original robust optimization problem.

    Theorem B.1 Let γ∗ be the optimal solution to the scenarioconvex program (B.2) when NSCP is the number of convex con-straints. Given a ‘violation parameter’, ε, and a ‘confidence pa-rameter’, $, if

    NSCP ≥2

    ε(ln

    1

    $+ d)

    then, with probability 1−$, γ∗ satisfies all but an ε-fraction ofconstraints in ∆.

    C List of symbols

    (Z≥1)Z . . . . . . . . . . . . . . . . . . . . . . . . . . . (non-negative) integer(R>0)R . . . . . . . . . . . . . . . . . . . . . . . . . . (positive) real number|Y | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . cardinality of set Ys ∈ S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . state/state spacea ∈ A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . action/action spacePrs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . transition functionr,R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . reward, reward function(π∗)π . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .(optimal) policyV π . . . . . . . . . . . . . . . . . . . . . . . value of a state given policy, πγ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . discount factorα, β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .agent indicesA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . set of agentso ∈ Ob ∈ O . . . . . . . . . . . . .waypoint/objective/objective setE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . environmentx ∈ Ωx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . region/region setObi . . . . . . . . . . .set of waypoints of an objective in a regionsbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . .abstracted objective statesi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . regional stateτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . taskΓ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . set of feasible tasks−→x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ordered list of regionsξ−→x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . repeated region listφt(, ) . . . . . . . . . . . . . . . . . . . . . . . . . . time abstraction function(�k)� . . . . . . . . . . . . . . . . . . . . . . . . . . (partial) sub-environmentN� . . . . . . . . . . . . . . . . . . . . . . . . . . . . size of a sub-environment−→τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ordered list of tasks−→Prt . . . . . . . . . . . . . . ordered list of probability distributions(ϑpβ)ϑ . . . . . . . . . . . . . . . . . . . . . . . . . . . (partial) task trajectory

    Iα . . . . . . . . . . . . . . . . . . . . . . . . . . . . interaction set of agent αN . . . . . . . . . . . . . . . . . . . . . . . . . . . . max size of interaction setθ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . claimed regional objectiveΘα . . . . . . . . . . . . . . . . . . . . . claimed objective set of agent αΘA . . . . . . . . . . . . . . . . . . . . . . . . . global claimed objective set

  • Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning 19

    Ξ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . interaction matrixQ . . . . . . . . . . . . . . . . . . . . . . . . . value for choosing τ given �.s

    Q̂ . . . . . . . . . . . . . . . estimated value for choosing τ given �.sN . . . . . . . . . . . . . . . . . . . . . . . . . . number of simulations in �.sNOb . . . . . . . . number of simulations of an objective in �.st . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . timeT . . . . . . multi-agent expected discounted reward per taskλ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . submodularity ratio

    λ̂ . . . . . . . . . . . . . . . . . . . . . . approximate submodularity ratiofx . . . . . . . . . . . . sub-environment search value set functionXx ⊂ Yx ⊂ Ωx . . . . . . . . . . . . . . . . . . . . . . . finite set of regionsfϑ . sequential multi-agent deployment value set functionXϑ ⊂ Yϑ ⊂ Ωϑ . . . . . . . . . . . . . . . . . . . finite set of trajectories$ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . confidence parameterε . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .violation parameter

    IntroductionPreliminaries on Markov decision processesProblem statementAbstractionsDynamic Domain Reduction for Multi-Agent PlanningConvergence and performance analysis Empirical validation and evaluation of performanceConclusionsSubmodularityScenario optimizationList of symbols