q-learning algorithm for mean-field controls, with ... · q-learning algorithm for mean-field...

Mean-Field Controls with Q-learning for Cooperative MARL: Convergence andComplexity Analysis∗

Haotian Gu† , Xin Guo‡ , Xiaoli Wei‡ , and Renyuan Xu§

Abstract. Multi-agent reinforcement learning (MARL), despite its popularity and empirical success, suffersfrom the curse of dimensionality. This paper builds the mathematical framework to approximatecooperative MARL by a mean-field control (MFC) framework, and shows that the approximationerror is of O( 1√

N). By establishing appropriate form of the dynamic programming principle for both

the value function and the Q function, it proposes a model-free kernel-based Q-learning algorithm(MFC-K-Q), which is shown to be of linear convergence rate, the first of its kind in the MARLliterature. It further establishes that the convergence rate and the sample complexity of MFC-K-Qare independent of the number of agents N . Empirical studies for the network traffic congestionproblem demonstrate that MFC-K-Q outperforms existing MARL algorithms when N is large, forinstance when N > 50.

Key words. Mean-Field Control, Multi-Agent Reinforcement Learning, Q-Learning, Cooperative Games, Dy-namic Programming Principle.

AMS subject classifications. 49N80, 68Q32, 68T05, 90C40

1. Introduction. Multi-agent reinforcement learning (MARL) has enjoyed substantialsuccesses for analyzing the otherwise challenging games, including two-agent or two-teamcomputer games [41, 47], self-driving vehicles [40], real-time bidding games [22], ride-sharing[26], and traffic routing [10]. Despite its empirical success, MARL suffers from the curse ofdimensionality known also as the combinatorial nature of MARL: its sample complexity byexisting algorithms for stochastic dynamics grows exponentially with respect to the number ofagents N . (See [18] and also Proposition 2.1 in Section 2). In practice, this N can be on thescale of thousands or more, for instance, in rider match-up for Uber-pool and network routingfor Zoom.

One classical approach to tackle this curse of dimensionality is to focus on local policies,namely by exploiting special structures of MARL problems and by designing problem-dependentalgorithms to reduce the complexity. For instance, [25] developed value-based distributed Q-learning algorithm for deterministic and finite Markov decision problems (MDPs), and [35]exploited special dependence structures among agents. (See the review by [54] and the referencestherein).

Another approach is to consider MARL in the regime with a large number of homogeneousagents. In this paradigm, by functional strong law of large numbers (a.k.a. propagation ofchaos) [23, 29, 44, 14], non-cooperative MARLs can be approximated under Nash equilibriumby mean-filed games with learning, and cooperative MARLs can be studied under Paretooptimality by analyzing mean-field controls (MFC) with learning. This approach is appealing

∗October 19, 2020.†Department of Mathematics, University of California, Berkeley, USA ([email protected]).‡IEOR Department, University of California, Berkeley, USA ([email protected], [email protected]).§Mathematical Institute, University of Oxford, UK ([email protected]).

1

arX

iv:2

002.

0413

1v3

[cs

.LG

] 1

5 O

ct 2

020

mailto:haotian\protect [email protected]

mailto:[email protected]



2 GU, GUO, WEI AND XU

not only because the dimension of MFC or MFG is independent of the number of agents N ,but also because solutions of MFC/MFG (without learning) have been shown to provide goodapproximations to the corresponding N -agent game in terms of both game values and optimalstrategies [19, 24, 30, 36, 38].

MFG with learning has gained popularity in the RL community [13, 16, 53, 20], with itssample complexity shown to be similar to that of single-agent RL ([13, 16]). Yet MFC withlearning is by and large an uncharted field despite its potentially wide range of applications[26, 27, 49, 50]. The main challenge for MFC with learning is to deal with probability measurespace over the state-action space, which is shown ([15]) to be the minimal space for which theDynamic Programming Principle will hold. One of the open problems for MFC with learningis therefore, as pointed out in [30], to design efficient RL algorithms on probability measurespace.

To circumvent designing algorithms on probability measure space, [6] proposed to addcommon noises to the underlying dynamics. This approach enables them to apply the standardRL theory for stochastic dynamics. Their model-free algorithm, however, suffers from highsample complexity as illustrated in Table 1 below, and with weak performance as demonstratedin Section 6. For special classes of linear-quadratic MFCs with stochastic dynamics, [5] exploredthe policy gradient method and [28] developed an actor-critic type algorithm.

Our work. The paper builds the mathematical framework to approximate cooperative MARLby MFCs with learning. The approximation error is shown to be of O( 1√

N). It then identifies

the minimum space on which the Dynamic Programming Principle holds. It then proposesan efficient approximation algorithm (MFC-K-Q) for MFC with learning. This model-freeQ-learning-based algorithm combines the technique of kernel regression with approximatedBellman operator. The convergence rate and the sample complexity of this algorithm are shownto be independent of the number of agents N , and rely only on the size of the state-actionspace of the underlying single-agent dynamics (Table 1). As far as we are aware of, there is noprior algorithm with linear convergence rate for cooperative MARL.

Mathematically, the DPP is established through lifting the state-action space by aggregatingthe reward and the underlying dynamics. This lifting idea has been used in previous MFCframework ([34, 51] without learning and [15] with learning). Our work finds that this liftingidea is critical for efficient algorithm design for MFC with learning: the resulting deterministicdynamics from this lifting trivialize the choice of the learning rate for the convergence analysisand significantly reduces the sample complexity.

Our experiment in Section 6 demonstrates that MFC-K-Q avoids the curse of dimensionalityand outperforms existing MARL algorithms (when N > 50) as well as the MFC algorithm in[6]. Table 1 summarizes the complexity of our MFC-K-Q algorithm along with these relevantalgorithms.

Tcov in Table 1 is the covering time of the exploration policy and l = max3 + 1/κ, 1/(1−κ) > 4 for some κ ∈ (0.5, 1). Other parameters are as in Proposition 2.1 and also in Theorem5.6. Note that [35] assumed that agents interact locally through a given graph so that localpolicies can approximate the global one, yet f(log(1/ε)) can scale as N for a dense graph.

Organizations. Section 2 connects cooperative MARL and MFC with learning. Section 3establishes the dynamical programming principle for MFC with learning. Section 4 proposes

MFC WITH Q-LEARNING FOR MARL GAMES 3

Work MFC/N-agent Method Sample Complexity Guarantee

Our work MFC Q-learning Ω(Tcov · log(1/δ))

[6] MFC Q-learning Ω((Tcov · log(1/δ))l · poly(log(1/(δε))/ε))Vanilla N-agent N-agent Q-learning Ω(poly((|X ||U|)N · log(1/(δε)) ·N/ε))[35] N-agent Actor-critic Ω(poly((|X ||U|)f(log(1/ε)) · log(1/δ) ·N/ε))

Table 1Comparison of algorithms

the algorithm (MFC-K-Q) for MFC with learning, with convergence and sample complexityanalysis. Section 5 is dedicated to the proof of the main theorem. Finally, Section 6 testsperformance of MFC-K-Q in a network congestion control problem. For ease of exposition,proofs for all lemmas are in the Appendix.

Notation. For a set S, we denote RS for the set of all real-valued functions on S, RS :=f : S → R. For each f ∈ RS , define the sup norm of f to be ||f ||∞ = sups∈S |f(s)|. Inaddition, when S is finite, we denote |S| for the size of S, and P(S) for the set of all probabilitymeasures on S: p : p(s) ≥ 0,

∑s∈S p(s) = 1, which is equivalent to the probability simplex in

R|S|. Moreover, in P(S), let dP(S) be the metric induced by the l1 norm: for any u, v ∈ P(S),dP(S)(u, v) =

∑s∈S |u(s)− v(s)|.

2. MARL and MFC with Learning.

2.1. MARL and its Complexity. First recall cooperative MARL in an infinite time horizon,where there are N agents whose game strategies are coordinated by a central controller. Let usassume the state space X and the action space U are all finite.

At each step t = 0, 1, · · · , the state of agent j (= 1, 2, · · · , N) is xjt ∈ X and she takes anaction ujt ∈ U . Given the current state profile xxxt = (x1

t , · · · , xNt ) ∈ XN and the current actionprofile uuut = (u1

t , · · · , uNt ) ∈ UN of N -agents, agent j will receive a reward rj(xxxt,uuut) and herstate will change to xjt+1 according to a transition probability function P j(xxxt,uuut). A Markoviangame further restricts the admissible policy for agent j to be of the form ujt ∼ πjt (xxxt). Thatis, πjt : XN → P(U) maps each state profile xxx ∈ XN to a randomized action, with P(U) theprobability measure space on space U .

In this cooperative MARL, the central controller is to maximize the aggregated accumulatedrewards over all policies and averaged over all agents. That is to find

supπππ

1

N

N∑j=1

vj(xxx,πππ),

where

vj(xxx,πππ) = E[ ∞∑t=0

γtrj(xxxt,uuut)∣∣xxx0 = xxx

]is the accumulated reward for agent j, given the initial state profile xxx0 = xxx and policyπππ = πππt∞t=0 with πππt = (π1

t , . . . , πNt ). Here γ ∈ (0, 1) is a discount factor, ujt ∼ πjt (xxxt), and

xjt+1 ∼ P j(xxxt,uuut).


The sample complexity of the Q learning algorithm of this cooperative MARL is exponentialwith respect to N . Indeed, take Theorem 4 in [11] and note that the corresponding coveringtime for the policy of the central controller will be at least (|X ||U|)N , then we see

Proposition 2.1. Let |X | and |U| be respectively the size of the state space X and the actionspace U . Let Q∗ and QT be respectively the optimal value and the value of the asynchronous

Q-learning algorithm in [11] using polynomial learning rate with time T = Ω

(poly

((|X ||U|)N ·

Nε · ln( 1

δε)

)). Then with probability at least 1− δ, ‖QT −Q∗‖∞ ≤ ε.

2.2. MFC with Learning. To overcome the curse of dimensionality in N , we now proposea mean-field control (MFC) framework to approximate this cooperative MARL.

In this MFC framework, all agents are assumed to be identical, indistinguishable, andinterchangeable, and each agent j(= 1, · · · , N) is assumed to depend on all other agentsonly through the empirical distribution of their states and actions. That is, denote P(X )and P(U) as the probability measure spaces over the state space X and the action spaceU , respectively. The empirical distribution of the states is µNt = 1

N

∑Nj=1 δxjt

∈ P(X ), and

the empirical distribution of the actions is νNt = 1N

∑Nj=1 δujt

∈ P(U). Then, by law of largenumbers, this cooperative MARL becomes an MFC with learning when N →∞. Moreover, asall agents are indistinguishable, one can focus on a single representative agent.

Mathematically, this MFC with learning is as follows. At each time t = 0, 1, · · · , therepresentative agent in state xt takes an action ut ∈ U according to the admissible policyπt(xt, µt) : X×P(X )→ P(U) assigned by the central controller, who can observe the populationstate distribution µt ∈ P(X ). The agent will then receive a reward r(xt, µt, ut, νt) and move tothe next state xt+1 ∈ X according to a probability transition function P (xt, µt, ut, νt). Here Pand r rely on the state distribution µt and the action distribution νt(·) :=

∑x∈X πt(x, µt)(·)µt(x),

and are possibly unknown.The objective for this MFC with learning is to find v the maximal accumulated reward

over all admissible policies π = πt∞t=0, namely

v = supπvπ

:= supπ

E[ ∞∑t=0

γtr(xt, µt, ut, νt)

∣∣∣∣x0 ∼ µ],(MFC)

subject to xt+1 ∼ P (xt, µt, ut, νt), ut ∼ πt(xt, µt).

Note that if µt is fixed, then πt(·, µt) can be viewed as a mapping from X to P(U). Wedenote H := h : X → P(U) and it is identical to the product of |X | copies of P(U).

We will show that the objective in (MFC) is law-invariant and the probability distributionof the dynamics in (MFC) satisfies flow property. This flow property is crucial for establishingthe convergence of the associated cooperative MARL by (MFC), and for deriving the DynamicProgramming Principle (DPP) of (MFC).

Lemma 2.2. Under any admissible policy π = πt∞t=0, and the initial state distribution


x0 ∼ µ0 = µ, the evolution of the state distribution µtt≥0, xt ∼ µt, is given by

µt+1 = Φ(µt, πt(·, µt)),(2.1)

where for any (µ, h) ∈ P(X )×H, ν(µ, h) :=∑

x∈X h(x)µ(x) ∈ P(U), and the dynamics Φ isdefined as

Φ(µ, h) :=∑x∈X

∑u∈U

P (x, µ, u, ν(µ, h))µ(x)h(x)(u) ∈ P(X ).(2.2)

Moreover, the value function vπ defined in (MFC) can be rewritten as

vπ(µ) =

∞∑t=0

γtr(µt, πt(·, µt)),(2.3)

where for any (µ, h) ∈ P(X )×H, the reward r is defined as

r(µ, h) :=∑x∈X

∑u∈U

r(x, µ, u, ν(µ, h))µ(x)h(x)(u).(2.4)

Remark 2.3. Because of the aggregated form of Φ and r, they are also called the aggregateddynamics and the aggregated reward, respectively.

2.3. MFC Approximation to cooperative MARL. Now we will show that under the Paretooptimality criterion, (MFC) is an approximation to its corresponding cooperative MARL, withan error of O( 1√

N).

First, note that the cooperative MARL in Section 2 with N identical, indistinguishable,and interchangeable agents becomes

supπ

1

N

N∑j=1

vj,π = supπ

1

N

N∑j=1

E[ ∞∑t=0

γtr(xj,Nt , µNt , uj,Nt , νNt )

],(MARL)

subject to xj,Nt+1 ∼ P (xj,Nt , µNt , uj,Nt , νNt ), uj,Nt ∼ πt(xj,Nt , µNt ), 1 ≤ j ≤ N.

Definition 2.4. πε is ε-Pareto optimal for (MARL) if

1

N

N∑j=1

vj,πε ≥ sup

π

1

N

N∑j=1

vj,π − ε.

Next, the following assumptions are needed for such an approximation and the subsequentconvergence and sample complexity analysis for learning algorithms. They are standardregularity assumptions for MFC problems [4], To start, let us use the l1 distance for themetrics dP(X ) and dP(U) of P(X ) and P(U), and define dH(h1, h2) = maxx∈X ||h1(x)− h2(x)||1and dC((µ1, h1), (µ2, h2)) = ||µ1 − µ2||1 + dH(h1, h2) for the space H and C := P(X ) × H,respectively.


Assumption 2.5 (Continuity and boundedness of r). There exists R > 0, Lr > 0, such thatfor all x ∈ X , u ∈ U , µ1, µ2 ∈ P(X ), ν1, ν2 ∈ P(U),

|r(x, µ1, u, ν1)| ≤ R, |r(x, µ1, u, ν1)− r(x, µ2, u, ν2)| ≤ Lr · (||µ1 − µ2||1 + ||ν1 − ν2||1).

Assumption 2.6 (Continuity of P ). There exists LP > 0 such that for all x ∈ X , u ∈U , µ1, µ2 ∈ P(X ), ν1, ν2 ∈ P(U), ||P (x, µ1, u, ν1)−P (x, µ2, u, ν2)||1 ≤ LP · (||µ1−µ2||1 + ||ν1−ν2||1).

Under Assumptions 2.5 and 2.6, it is clear that the probability measure ν over the actionspace, the aggregated reward r in (2.4), and the aggregated dynamics Φ in (2.2) are all Lipschitzcontinuous, which will be useful for subsequent analysis.

Lemma 2.7 (Continuity of ν).

(2.5) ‖ν(µ, h)− ν(µ′, h′)‖1 ≤ dC((µ, h), (µ′, h′)).

Lemma 2.8 (Continuity of r). Under Assumption 2.5,

(2.6) |r(µ, h)− r(µ′, h′)| ≤ (R+ 2Lr)dC((µ, h), (µ′, h′)).

Lemma 2.9 (Continuity of Φ). Under Assumption 2.6,

(2.7) ‖Φ(µ, h)− Φ(µ′, h′)‖1 ≤ (2LP + 1)dC((µ, h), (µ′, h′)).

We are now ready to show that the optimal policy for (MFC) is approximately Paretooptimal for (MARL) when N →∞.

Theorem 2.10 (Approximation). Given Assumptions 2.5, 2.6, assume γ · (2LP + 1) < 1,and π = (πt)

∞t=0 with πt ∈ H for any t ≥ 0. Then there exists constant C = C(LP , Lr, R, γ)

such that

supπ

∣∣∣ 1

N

N∑j=1

vj,π − vπ∣∣∣ ≤ C 1√

N.(2.8)

where vπ and vj,π are given in (MFC) and (MARL) respectively. Consequently, for anyε > 0, there exists N(ε) such that N ≥ N(ε), any δ-optimal policy for (MFC) with learning is(δ + ε)-optimal for (MARL).

Proof of Theorem 2.10. First note that

1

N

N∑j=1

vj,π =1

N

N∑j=1

∞∑t=0

γtE[r(xj,Nt , µNt , uj,Nt , νNt )] =

∞∑t=0

γtE[r(µNt , πt)

],

vπ =∞∑t=0

γtE[r(xt, µt, ut, νt)] =∞∑t=0

γtr(µt, πt).

By the continuity of r from Lemma 2.8, we have

supπ

∣∣∣ 1

N

N∑j=1

vj,π − vπ∣∣∣ ≤ (R+ 2Lr)

∞∑t=0

γt supπ

E[‖µN,πt − µπt ‖1

].


To prove (2.8), it is sufficient to estimate δNt := supπ∈Π E[‖µN,πt −µπt ‖1] with δN0 = 0. We claimthat δNt = O(N−

12 ). This is done by induction. The claim holds for t = 0. Suppose the claim

holds for t and consider t+ 1

E[‖µNt+1 − µt+1‖1

]≤ E

[‖µNt+1 − Φ(µNt , πt)‖1

]+ E

[‖Φ(µNt , πt)− µt+1‖1

].(2.9)

Now denote, for any µ ∈ P(X ), π ∈ H, and g : X → R,

µ(g) :=∑x∈X

g(x)µ(x), P πµ (x, y) :=∑u∈U

P (x, µ, u, ν, y)π(u).

For the first term in RHS of (2.9)

E[‖µNt+1 − Φ(µNt , πt)‖1

]2≤ E

[E[‖µNt+1 − Φ(µNt , πt)‖1

∣∣x1,Nt , · · · , xN,Nt

]2]= 4E

[E[

supg:X→[−1,1]

(µNt+1(g)− Φ(µNt , πt)(g)

)|x1,Nt , · · · , xN,Nt

]2]= 4E

[sup

g:X→[−1,1]

( 1

N

N∑j=1

g(xj,Nt+1)− 1

N

N∑j=1

∑y∈X

P πµNt

(xj,Nt , y)g(y))2]

≤ 4

N2E[

supg:X→[−1,1]

N∑j=1

∑y∈X

g2(y)P πtµNt

(xj,Nt , y) +(∑y∈X

g(y)P πtµNt

(xj,Nt , y))2]

≤ 8

N,

where the first inequality is by law of total expectation and Jensen’s inequality, and lastinequality is due to g being valued in [−1, 1].For the second term in RHS of (2.9),

E[‖Φ(µNt , πt)− µt+1‖1

]= E

[‖Φ(µNt , πt)− Φ(µt, πt)‖1

]≤ (2LP + 1)E

[‖µNt − µt‖1

],

where the first equality is from the flow of probability measure µt+1 = Φ(µt, πt) by Lemma 2.2,and the first inequality is by the continuity of Φ from Lemma 2.9. By taking supremum over πon both sides of (2.9), we have δNt+1 ≤ (2LP + 1)δNt + 2

√2N−1/2, hence

δNt ≤√

2

LP

((2LP + 1)tN−1/2 −N−1/2

).

We deduce that

supπ

∣∣∣ 1

N

N∑j=1

vj,π − vπ∣∣∣ ≤ (R+ 2Lr)

∞∑t=0

γtδNt ≤ (R+ 2Lr)

√2

LP

( 1

1− (2LP + 1)γ− 1

1− γ

)N−1/2.

This proves (2.8).


3. DPP for Q Function in MFC with learning. In this section, we will establish the DPPof the Q function for (MFC). We will adapt the approach of [15] to allow for incorporatingdependence of P and r on the population’s action distribution νt.

First, by Lemma 2.2, (MFC) can be recast as a general Markov decision problem (MDP)with probability measure space as the new state-action space.

More specifically, if one views the policy πt to be a mapping from P(X ) to H, and definethe set of admissible policies Π := π = πt∞t=0|πt : P(X )→ H, then (MFC) can be restatedas the following MDP with unknown r and Φ:

v(µ) := supπ∈Π

∞∑t=0

γtr(µt, πt(µt))(MDP)

subject to µt+1 = Φ(µt, πt(µt)), µ0 = µ.

With this reformulation, we can define the associated optimal Q function for (MDP) startingfrom arbitrary (µ, h) ∈ C,

Q(µ, h) := supπ∈Π

[ ∞∑t=0

γtr(µt, πt(µt))

∣∣∣∣µ0 = µ, π0(µ0) = h

].(3.1)

Similarly, we can define Qπ to be the Q function associated with a given policy π,

Qπ(µ, h) :=

[ ∞∑t=0

γtr(µt, πt(µt))

∣∣∣∣µ0 = µ, π0(µ0) = h

].(3.2)

Remark 3.1. With this reformulation, (MFC) is now lifted from the finite state-action spaceX and U to a compact continuous state-action space C embedded in an Euclidean space. Inaddition, the dynamics become deterministic by the aggregation over the original state-actionspace. Due to this aggregation for r, Φ, and the Q function, we will subsequently refer this Q in(3.1) as an Integrated Q (IQ) function, to underline the difference between the Q function forRL of single agent and that for MFC with learning.

The following theorem shows Bellman equation for the IQ function in (3.1).

Theorem 3.2. For any µ ∈ P(X ),

v(µ) = suph∈H

Q(µ, h) = supπ∈Π

Qπ(µ, h).(3.3)

Moreover, the Bellman equation for Q : C → R is

(3.4) Q(µ, h) = r(µ, h) + γ suph∈H

Q(Φ(µ, h), h).

Proof of Theorem 3.2. Recall the definition of v in (MDP) and Q in (3.1). For v(µ), thesupremum is taken over all the admissible policies Π, while for Q(µ, h), the supremum is


taken over all the admissible policies Π with a further restriction that π0(µ) = h. Now insuph∈HQ(µ, h), since we are free to choose h, it is equivalent to v. Moreover,

v(µ) = supπ∈Π

[ ∞∑t=0

γtr(µt, πt(µt))

∣∣∣∣µ0 = µ

]

= supπ∈Π,π0(µ)=h,h∈H

[ ∞∑t=0

γtr(µt, πt(µt))

∣∣∣∣µ0 = µ, π0(µ0) = h

]

= suph∈H

supπ∈Π,π0(µ)=h

[ ∞∑t=0

γtr(µt, πt(µt))

∣∣∣∣µ0 = µ, π0(µ0) = h

]= sup

h∈HQ(µ, h).

Q(µ, h) = supπ∈Π

[ ∞∑t=0

γtr(µt, πt(µt))

∣∣∣∣µ0 = µ, π0(µ0) = h

]

= r(µ, h) + supπt∞t=1

[ ∞∑t=1

γtr(µt, πt(µt))

∣∣∣∣µ1 = Φ(µ, h)

]

= r(µ, h) + supπt∞t=0

γ

[ ∞∑t=0

γtr(µt, πt(µt))

∣∣∣∣µ0 = Φ(µ, h)

]= r(µ, h) + γv(Φ(µ, h))

= r(µ, h) + γ suph∈H

Q(Φ(µ, h), h),

where the third equality is from shifting the time index by one.

Next, we have the following verification theorem for this IQ function.

Proposition 3.3 (Verification). Assume Assumption 2.5 and define Vmax := R1−γ . Then,

• Q defined in (3.1) is the unique function in f ∈ RC : ‖f‖∞ ≤ Vmax satisfying theBellman equation (3.4).

• Suppose that for every µ ∈ P(X ), one can find an hµ ∈ H such that hµ ∈ arg maxh∈HQ(µ, h),then π∗ = π∗t ∞t=0, where π

∗t (µ) = hµ for any µ ∈ P(X ) and t ≥ 0, is an optimal stationary

policy of (MDP).

In order to prove the proposition, let us first define the following two operators.• Define the operator B : RC → RC for (MDP)

(3.5) (B q)(c) = r(c) + γmaxh∈H

q(Φ(c), h).

• Define the operator Bπ : RC → RC for (MDP) under a given stationary policy πt = π :P(X )→ H∞t=0

(3.6) (Bπ q)(c) = r(c) + γq(Φ(c), π(Φ(c))).


Proof. Since ||r||∞ ≤ R, for any µ ∈ P(X ) and h ∈ H, the aggregated reward function(2.4) satisfies

|r(µ, h)| ≤ R ·∑x∈X

∑u∈U

µ(x)h(x)(u) = R.

In this case, for any µ ∈ P(X ), h ∈ H and policy π, |Qπ(µ, h)| ≤ R ·∑∞

t=0 γt = Vmax. Hence,

Q of (3.1) and Qπ of (3.2) both belong to f ∈ RC : ‖f‖∞ ≤ Vmax. Meanwhile, by definition,it is easy to show that B and Bπ map f ∈ RC : ‖f‖∞ ≤ Vmax to itself.

Next, we notice that B is a contraction operator with modulus γ < 1 under the sup normon f ∈ RC : ‖f‖∞ ≤ Vmax: for any (µ, h) ∈ C,

|Bq1(µ, h)−Bq2(µ, h)| ≤ γmaxh∈H|q1(Φ(µ, h), h)− q2(Φ(µ, h), h)| ≤ γ‖q1 − q2‖∞.

Thus, ‖Bq1 −Bq2‖∞ ≤ γ‖q1 − q2‖∞. By Banach Fixed Point Theorem, B has a unique fixedpoint in f ∈ RC : ‖f‖∞ ≤ Vmax. By (3.4) in Theorem 3.2, we conclude that unique fixedpoint is Q.

Similarly, we can show that for any stationary policy π, Bπ is also a contraction operatorwith modulus γ < 1. Meanwhile, by the standard DPP argument as in Theorem 3.2, we haveQπ = BπQπ. This implies Qπ is the unique fixed point for Bπ in f ∈ RC : ‖f‖∞ ≤ Vmax.

Now let π∗ be the stationary policy defined in the statement of Proposition 3.3. By definition,for any c ∈ C, Q(c) = r(c) + γmaxh∈HQ(Φ(c), h) = r(c) + γQ(Φ(c), π∗(Φ(c))) = Bπ∗Q(c).Since Bπ∗ has a unique fixed point Qπ∗ in f ∈ RC : ‖f‖∞ ≤ Vmax, which is the IQ functionfor the stationary policy π∗, clearly Qπ∗ = Q, and the optimal IQ function is attained by theoptimal policy π∗.

Lemma 3.4 (Characterization of Q). Assume Assumptions 2.5, 2.6, and γ · LΦ < 1. Q of(3.1) is continuous.

The continuity property of Q from Lemma 3.4, along with the compactness of H andProposition 3.3, leads to the following existence of stationary optimal policy.

Lemma 3.5. Assume Assumptions 2.5, 2.6 and γ ·LΦ < 1. There exists an optimal stationarypolicy π∗ : P(X )→ H such that Qπ∗ = Q.

This existence of a stationary optimal policy is essential for the convergence analysis. Inparticular, it allows for comparing the optimal values of two MDPs with different action spaces:(MDP) and its variant defined in (5.9)-(5.10).

Note that the existence of a stationary optimal policy is well known when the state andaction spaces are finite. Yet, we are unable to find any prior corresponding result for thecontinuous case.

4. MFC-K-Q Algorithm via Kernel Regression and Approximated Bellman Operator.In this section, we will develop a kernel-based Q-learning algorithm (MFC-K-Q) for the MFCproblem with learning based on (3.4).

Note from (3.4), the MFC problem with learning is different from the classical MDP [43]in two aspects. First, the lifted state space P(X ) and lifted action space H are continuous,rather than discrete or finite. Second, the maximum in the Bellman operator is taken over acontinuous space H.


To handle the lifted continuous state-action space, we use a kernel regression methodon the discretized state-action space. Kernel regression is a local averaging approach forapproximating unknown state-action pair from observed data on a discretized space called ε-net.Mathematically, a set Cε = ci = (µi, hi)Nεi=1 is an ε-net for C if min1≤i≤Nε dC(c, c

i) < ε for allc ∈ C. Note that compactness of C implies the existence of such an ε-net Cε. The choice of ε iscritical for the convergence and the sample complexity analysis.

Correspondingly, we define the so-called kernel regression operator ΓK : RCε → RC suchthat

ΓKf(c) =

Nε∑i=1

K(ci, c)f(ci),(4.1)

where K(ci, c) ≥ 0 is a weighted kernel function such that for all c ∈ C and ci ∈ Cε,

(4.2)Nε∑i=1

K(ci, c) = 1, and K(ci, c) = 0 if dC(ci, c) > ε.

In fact, K can be of any form

(4.3) K(ci, c) =φ(ci, c)∑Nεi=1 φ(ci, c)

,

with some function φ satisfying φ ≥ 0 and φ(x, y) = 0 when dC(x, y) ≥ ε. (See Section 6 forsome choices of φ).

Meanwhile, to avoid maximizing over a continuous space H as in the Bellman equation(3.4), we take the maximum over the ε-net Hε on H. Here Hε is an ε-net on H induced fromCε, i.e., Hε contains all the possible action choices in Cε, whose size is denoted by NHε .

The corresponding approximated Bellman operator Bε acting on functions is then definedon the ε-net Cε: RCε → RCε such that

(4.4) (Bε q)(ci) = r(ci) + γ max

h∈HεΓKq(Φ(ci), h).

Since (Φ(ci), h) may not be on the ε-net, one needs to approximate the value at that point viathe kernel regression ΓKq(Φ(ci), h).

In practice, one may only have access to noisy estimations r(ci), Φ(ci)Nεi=1 instead of theaccurate data r(ci),Φ(ci)Nεi=1 on Cε. Taking this into consideration, Algorithm 4.1 consistsof two steps. First, it collects samples on C given an exploration policy. For each componentci on the ε-net Cε, the estimated data (r(ci), Φ(ci)) is computed by averaging samples in theε-neighborhood of ci. Second, the fixed point iteration is applied to the approximated Bellmanoperator Bε with r(ci), Φ(ci)Nεi=1. Under appropriate conditions, Algorithm 4.1 provides anaccurate estimation of the true Q function with efficient sample complexity (See Theorem 5.5).

5. Convergence and Sample Complexity Analysis of MFC-K-Q. In this section, we willestablish the convergence of MFC-K-Q algorithm and analyze its sample complexity. Theconvergence analysis in Section 5.1 relies on studying the fixed point iteration of Bε; and the


Algorithm 4.1 Kernel-based Q-learning Algorithm for MFC (MFC-K-Q)

1: Input: Initial state distribution µ0, ε > 0, ε-net on C : Cε = ci = (µi, hi)Nεi=1, explorationpolicy π taking actions from Hε induced from Cε, regression kernel K on Cε.

2: Initialize: r(ci) = 0, Φ(ci) = 0, N(ci) = 0,∀i.3: repeat4: At the current state distribution µt, act ht according to π, observe µt+1 = Φ(µt, ht) and

rt = r(µt, ht).5: for 1 ≤ i ≤ Nε do6: if dC(ci, (µt, ht)) < ε then7: N(ci)←N(ci) + 1.8: r(ci)←N(ci)−1

N(ci)· r(ci) + 1

N(ci)· rt

9: Φ(ci)←N(ci)−1N(ci)

· Φ(ci) + 1N(ci)

· µt10: end if11: end for12: until N(ci) > 0, ∀i.13: Initialize: q0(ci) = 0,∀ci ∈ Cε, l = 0.14: repeat15: for ci ∈ Cε do16: ql+1(ci)←

(r(ci)+ γmaxh∈Hε ΓK ql(Φ(ci), h)

).

17: end for18: l = l + 1.19: until converge

complexity analysis in Section 5.2 is based on an upper bound of the necessary sample size tovisit each ε-neighborhood of the ε-net at least once.

In addition to Assumptions 2.5 and 2.6, the following conditions are needed for theconvergence and the sample complexity analysis.

Assumption 5.1 (Controllability of the dynamics). For all ε, there exists Mε ∈ N such thatfor any ε-net Hε on H and µ, µ′ ∈ P(X ), there exists an action sequence (h1, . . . , hm) withhi ∈ Hε and m < Mε, with which the state µ will be driven to an ε-neighborhood of µ′.

Assumption 5.2 (Regularity of kernels). For any point c ∈ C, there exist at most NK

points ci’s in Cε such that K(ci, c) > 0. Moreover, there exists an LK > 0 such that for allc ∈ Cε, c′, c′′ ∈ C, |K(c, c′)−K(c, c′′)| ≤ LK · dC(c′, c′′).

Assumption 5.1 ensures the dynamics to be controllable. Assumption 5.2 is easy to besatisfied: take a uniform grid as the ε-net, then NK is roughly bounded from above by 2dim(C);meanwhile, a number of commonly used kernels, including the triangular kernel in Section 6,satisfy the Lipschitz condition in Assumption 5.2.

5.1. Convergence Analysis. To start, recall the Lipschitz continuity of the aggregatedrewards r and dynamics Φ from Lemma 2.8 and Lemma 2.9. To simplify the notation, letus denote Lr := R+ 2Lr as the Lipschitz constant for r and LΦ := 2LP + 1 as the Lipschitz


constant for Φ.Next, recall that there are three sources of the approximation error in Algorithm 4.1: the

kernel regression ΓK on C with the ε-net Cε, the discretized action space Hε on H, and thesampled data r and Φ for both the dynamics and the rewards.

The key idea for the convergence analysis is to decompose the error based on these sourcesand to analyze each decomposed error accordingly. That is to consider the following differenttypes of Bellman operators:

• the operator B in (3.5) for (MDP);• the operator BHε : RC → RC which involves the discretized action space Hε

(5.1) BHεq(c) = r(c) + γ maxh∈Hε

q(Φ(c), h);

• the operator Bε in (4.4) defined on the ε-net Cε, which involves the discretized actionspace Hε, and the kernel approximation;• the operator Bε : RCε → RCε defined by

(5.2) (Bε q)(ci) = r(ci) + γ max

h∈HεΓKq(Φ(ci), h),

which involves the discretized action space Hε, the kernel approximation, and theestimated data.• the operator T that maps f ∈ RP(X ) : ‖f‖∞ ≤ Vmax to itself, such that

(5.3) Tv(µ) = maxh∈Hε

(r(µ, h) + γv(Φ(µ, h))).

We will show that under mild assumptions, each of the above operators admits a unique fixedpoint.

Lemma 5.3. Assume Assumption 2.5. Let Vmax := R1−γ . Then,

• B in (3.5) has a unique fixed point in f ∈ RC : ‖f‖∞ ≤ Vmax. That is, there exists aunique Q such that

(BQ)(c) = r(c) + γmaxh∈H

Q(Φ(c), h).(5.4)

• BHε in (5.1) has a unique fixed point in f ∈ RC : ‖f‖∞ ≤ Vmax. That is, there existsa unique QHε such that

BHεQHε(c) = r(c) + γ maxh∈Hε

QHε(Φ(c), h).(5.5)

• Bε in (4.4) has a unique fixed point in f ∈ RCε : ‖f‖∞ ≤ Vmax. That is, there existsa unique Qε such that for any ci ∈ Cε,

(BεQε)(ci) = r(ci) + γ max

h∈HεΓKQε(Φ(ci), h).(5.6)

• Bε in (5.2) has a unique fixed point in f ∈ RCε : ‖f‖∞ ≤ Vmax. That is, there existsa unique Qε such that for any ci ∈ Cε, and r, Φ sampled from ci’s ε-neighborhood,

(Bε Qε)(ci) = r(ci) + γ max

h∈HεΓKQε(Φ(ci), h).(5.7)


• T has a unique fixed point VHε in f ∈ RP(X ) : ‖f‖∞ ≤ Vmax. That is

(5.8) T VHε(µ) = maxh∈Hε

(r(µ, h) + γVHε(Φ(µ, h))).

Lemma 5.4 (Characterization of QHε). Assume Assumption 2.5. VHε in (5.8) is the optimalvalue function for the following MFC problem with continuous state space P(X ) and discretizedaction space Hε.

(5.9) VHε(µ) = supπ∈Πε

∞∑t=0

γtr(µt, πt(µt))

with Πε := π = πt∞t=0|πt : P(X )→ Hε, subject to(5.10) µt+1 = Φ(µt, πt(µt)), µ0 = µ.

Moreover, QHε in (5.5) and VHε in (5.8) satisfy the following relation:

(5.11) QHε(µ, h) = r(µ, h) + γVHε(Φ(µ, h)),

and QHε is Lipschitz continuous.

This connection between QHε and the optimal value function VHε of the MFC problem withcontinuous state space P(X ) and discretized action space Hε, is critical for estimating the errorbounds in the convergence analysis.

Theorem 5.5 (Convergence). Assume Assumptions 2.5, 2.6, 5.1, 5.2, and γ · LΦ < 1. LetBε : RCε → RCε be the operator defined in (5.2)

(Bε q)(ci) = r(ci) + γ max

h∈HεΓKq(Φ(ci), h),

where r(c) and Φ(c) are sampled from an ε-neighborhood of c, then it has a unique fixed pointQε in f ∈ RCε : ||f ||∞ ≤ Vmax. Moreover, the sup distance between ΓKQε in (4.1) and QC in(3.1) is

(5.12) ||Q− ΓKQε||∞ ≤Lr + 2γNKLKVmaxLΦ

1− γ· ε+

2Lr(1− γLΦ)(1− γ)

· ε.

In particular, for a fixed ε, Algorithm 4.1 converges linearly to Qε.

Proof of Theorem 5.5. The proof of the the convergence is to quantify ||Q−ΓKQε||∞ fromthe following estimate

(5.13) ||Q− ΓKQε||∞ ≤ ||Q−QHε ||∞︸︷︷︸(I)

+ ||QHε − ΓKQε||∞︸︷︷︸(II)

+ ||ΓKQε − ΓKQε||∞︸︷︷︸(III)

.

(I) can be regarded as the approximation error from discretizing the lifted action space Hby Hε; (II) is the error from the kernel regression on C with the ε-net Cε; and (III) is estimatingthe error introduced by the sampled data r and Φ.


Step 1. We shall use 3.5 and Lemmas 5.4 to show that ||Q−QHε ||∞ ≤ Lr(1−γLΦ)(1−γ) · ε. By

Lemma 5.4, Q(c)−QHε(c) = γ(V (Φ(c))−VHε(Φ(c))

), where V is the optimal value function of

the problem on P(X ) and H in (MDP), and VHε is the optimal value function of the problemon P(X ) and Hε (5.9)-(5.10). Hence it suffices to prove that ||V − VHε ||∞ ≤ Lr

(1−γLΦ)(1−γ) · ε.We adopt the similar strategy as in the proof of Lemma 3.5.

Let π∗ be the optimal policy of (MDP), whose existence is shown in Lemma 3.5. Forany µ ∈ P(X ), let (µ, h) = (µ0, h0), (µ1, h1), (µ2, h2), . . . , (µt, ht), . . . be the trajectory of thesystem under the optimal policy π∗, starting from µ. We have V (µ) =

∑∞t=0 γ

tr(µt, ht).Now let hit be the nearest neighbor of ht in Hε. dH(hit , ht) ≤ ε. Consider the trajectory of

the system starting from µ and then taking hi0 , . . . , hit , . . . , denote the corresponding state byµ′t. We have VHε ≥

∑∞t=0 γ

tr(µ′t, hit), since VHε is the optimal value function.

dP(X )(µ′t, µt) = dP(X )

(Φ(µ′t−1, h

it−1),Φ(µt−1, ht))≤ LΦ ·

(dP(X )(µ

′t−1, µt−1) + ε

)By the iteration, we have dP(X )(µ

′t, µt) ≤

LΦ−Lt+1Φ

1−LΦ· ε.

|r(µ′t, hit)− r(µt, ht)| ≤ Lr ·(dP(X )(µ

′t, µt) + ε

)≤ Lr ·

Lt+1Φ − 1

LΦ − 1· ε,

which implies

0 ≤V (µ)− VHε(µ) ≤∞∑t=0

γt(r(µt, ht)− r(µ′t, hit))

≤∞∑t=0

γt · Lr ·Lt+1

Φ − 1

LΦ − 1· ε =

Lr(1− γLΦ)(1− γ)

· ε.

Here 0 ≤ V (µ)− VHε(µ) is by the optimality of VC .Step 2. We shall use Lemmas 5.3 and 5.4 to show that ||QHε − ΓKQε||∞ ≤ Lr

(1−γLΦ)(1−γ) · ε.Note that

||ΓKQε −QHε ||∞=||ΓKBεQε −QHε ||∞ = ||ΓKBHεΓKQε −QHε ||∞≤||ΓKBHεΓKQε − ΓKBHεQHε ||∞ + ||ΓKBHεQHε −QHε ||∞=||ΓKBHεΓKQε − ΓKBHεQHε ||∞ + ||ΓKQHε −QHε ||∞≤γ||ΓKQε −QHε ||∞ + ||ΓKQHε −QHε ||∞.

Here the first and the third equalities come from the fact that Qε is the fixed point of Bε andQHε is the fixed point of BHε . The second inequality is by the fact that ΓK is a non-expansionmapping, i.e., ‖ΓKf‖∞ ≤ ‖f‖∞, and that BHε is a contraction with modulus γ with thesupremum norm. Meanwhile, for any Lipschitz function f ∈ RC with Lipschitz constant L, wehave for all c ∈ C,

|ΓKf(c)− f(c)| =Nε∑i=1

K(c, ci)|f(ci)− f(c)| ≤Nε∑i=1

K(c, ci)εL = εL.


Note here the inequality follows from K(c, ci) = 0 for all dC(c, ci) ≥ ε. Therefore,

||ΓKQε −QHε ||∞ ≤LQHε1− γ

ε.

where LQHε = Lr1−γLΦ

is the Lipschitz constant for QHε .Final step. Let q0 denote the zero function on Cε. By Lemma 5.3, Qε = limn→∞B

nε q0, and

Qε = limn→∞ Bnε q0. Denote qn := Bn

ε q0, qn := Bnε q0, and en := ||qn − qn||∞. For any c ∈ Cε,

en+1(c)

=∣∣r(c) + γ max

h∈HεΓK qn(Φ(c), h)− r(c)− γ max

h∈HεΓKqn(Φ(c), h)

∣∣≤|r(c)− r(c)|+ γ max

h∈Hε

∣∣ΓK qn(Φ(c), h)− ΓKqn(Φ(c), h)∣∣

≤εLr + γ maxh∈Hε

[|ΓK qn(Φ(c), h)− ΓK qn(Φ(c), h)|+ |ΓK qn(Φ(c), h)− ΓKqn(Φ(c), h)|

].

Here |r(c)− r(c)| ≤ εLr because r(c) is sampled from an ε-neighborhood of c and by Assump-tion 2.5. Moreover, for any fixed h,

|ΓK qn(Φ(c), h)− ΓK qn(Φ(c), h)| = |Nε∑i=1

(K(ci, (Φ(c), h))−K(ci, (Φ(c), h)))qn(ci)|

≤ 2NKLKVmax · dP(X )(Φ(c),Φ(c))

≤ 2NKLKVmaxLΦε.

The first inequality comes from Assumption 5.2, because K(ci, (Φ(c), h))−K(ci, (Φ(c), h)) isnonzero for at most 2NK index i ∈ 1, 2, . . . , Nε, K is Lipschitz continuous, and ||qn||∞ ≤ Vmax.The second inequality comes from the fact that Φ(c) is sampled from an ε-neighborhood of cand by Assumption 2.6. Meanwhile,

|ΓK qn(Φ(c), h)− ΓKqn(Φ(c), h)| ≤ ||qn − qn||∞ = en,

since Γ is non-expansion. Putting these pieces together, we have

en+1 = maxc∈Cε

en+1(c) ≤ εLr + εγ2NKLKVmaxLΦ + γen.

In this case, elementary algebra shows that en ≤ ε · Lr+γ2NKLKVmaxLΦ1−γ , ∀n. Then since ΓK is

non-expansion, ||ΓKQCε − ΓKQε||∞ ≤ ε · Lr+γ2NKLKVmaxLΦ1−γ , hence the error bound (5.12).

The claim regarding the convergence rate follows from the γ−contraction of the operatorBε.

5.2. Sample Complexity Analysis. In classical Q-learning for MDPs with stochastic envi-ronment, every component in the ε-net is required to be visited a number of times in orderto get desirable estimate for the Q function. The usual terminology covering time refers tothe expected number of steps to visit every component in the ε-net at least once, for a given


exploration policy. The complexity analysis thus focuses on the necessary rounds of the coveringtime.

In contrast, visiting each component in the ε-net once is sufficient with deterministicdynamics. We will demonstrate that using deterministic mean-field dynamics to approximateN-agent stochastic environment will indeed significantly reduce the complexity analysis.

To start, denote TC,π as the covering time of the ε-net under (random) policy π, such that

TC,π := supµ∈P(X )

inft > 0 : µ0 = µ,∀ci ∈ Cε,∃ti ≤ t,

(µti , hti) in the ε-neighborhood of ci, under the policy π.

Recall that an ε′-greedy policy on Hε is a policy which with probability at least ε′ will uniformlyexplore the actions on Hε. Note that this type of policy always exists. And we have thefollowing sample complexity result.

Theorem 5.6 (Sample complexity). Given Assumption 5.1, for any ε′ > 0, let πε′ be anε′-greedy policy on Hε. Then

(5.14) E[TC,πε′ ] ≤(Mε + 1) · (NHε)Mε+1

(ε′)Mε+1· log(Nε).

Moreover, with probability 1− δ, for any initial state s, under the ε′-greedy policy, the dynamicswill visit each ε-neighborhood of elements in Cε at least once, after

(5.15)(Mε + 1) · (NHε)Mε+1

(ε′)Mε+1· log(Nε) · e · log(1/δ).

time steps, where log(Nε) = Θ(|X ||U| log(1/ε)), and NHε = Θ((1ε )

(|U|−1|)|X |).

Theorem 5.6 provides an upper bound Ω(poly((1/ε) · log(1/δ))) for the covering time underthe ε′-greedy policy, in terms of the size of the ε-net and the accuracy 1/δ. The proof ofTheorem 5.6 relies on the following lemma.

Lemma 5.7. Assume for some policy π, E[TC,π] ≤ T <∞. Then with probability 1− δ, forany initial state µ, under the policy π, the dynamics will visit each ε-neighborhood of elementsin Cε at least once, after T · e · log(1/δ) time steps, i.e. P(TC,π ≤ T · e · log(1/δ)) ≥ 1− δ.

Proof of Theorem 5.6. Recall there are Nε different pairs in the ε-net. Denote the ε-neighborhoods of those pairs by Bε = BiNεi=1. Without loss of generality, we may assume thatBi are disjoint, since the covering time will only become smaller if they overlap with each other.Let Tk := mint > 1 : k of Bε is visited. Tk − Tk−1 is the time to visit a new neighborhoodafter k − 1 neighborhoods are visited. By Assumption 5.1, for any Bi ∈ Bε with center (µi, hi),µ ∈ P(X ), there exists a sequence of actions in Hε, whose length is at most Mε, such thatstarting from µ and taking that sequence of actions will lead the visit of the ε-neighborhood ofµi. Then, at that point, taking hi will yield the visit of Bi. Hence ∀Bi ∈ Bε, µ ∈ P(X ),

P(Bi is visited in Mε + 1 steps |µTk−1= s) ≥ (

ε′

NHε)Mε+1.

P(a new neighborhood is visited in Mε + 1 steps |µTk−1= µ) ≥ (Nε − k + 1) · ( ε′

NHε)Mε+1.


This implies E[Tk − Tk−1] ≤ Mε+1Nε−k+1 · (

NHεε′ )Mε+1. Summing E[Tk − Tk−1] from k = 1 to

k = Nε yields the desired result. The second part follows directly from Lemma 5.7. Meanwhile,NHε , the size of the ε-net in H is Θ((1

ε )(|U|−1)|X |), because H is a compact (|U| − 1)|X |

dimensional manifold. Similarly, Nε = Θ((1ε )|U||X |−1) as C is a compact |U||X | − 1 dimensional

manifold.

5.3. Discussions.Related works on kernel-based reinforcement learning. Kernel method is a popular dimension

reduction technique to map high-dimensional features into a low dimension space that bestrepresents the original features. This technique was first introduced for RL by [33, 32], in whicha kernel-based reinforcement learning algorithm (KBRL) was proposed to handle the continuityof the state space. Subsequent works demonstrated the applicability of KBRL to large-scaleproblems and for various types of RL algorithms ([2], [45] and [52]). However, there is no priorwork on convergence rate or sample complexity analysis.

Our kernel regression idea is closely related to [39], which combined Q-learning with kernel-based nearest neighbor regression to study continuous-state stochastic MDPs with samplecomplexity guarantee. However, our problem setting and technique for error bound analysisare different from theirs. In particular, Theorem 5.5 has both action space approximation andstate space approximation; whereas [39] has only state space approximation and their actionspace is finite. The error control in [39] was obtained via martingale concentration inequalitieswhereas ours is by the regularity property of the underlying dynamics.

Stochastic vs deterministic dynamics. We reiterate that unlike learning algorithms for sto-chastic dynamics where the choice of learning rate ηt is to guarantee the convergence of the Qfunction, MFC-K-Q directly conducts the fixed point iteration for the approximated Bellmanoperator Bε on the sampled data set, and sets the learning rate as 1 to fully utilize thedeterministic nature of the dynamics. Consequently, complexity analysis of this algorithm isreduced significantly. By comparison, for stochastic systems each component in the ε-net hasto be visited sufficiently many times for a decent estimate in Q-learning.

Sample complexity comparison. Theorem 5.6 shows that sample complexity for MFC withlearning is Ω(poly((1/ε) · log(1/δ))), instead of the exponential rate in N by existing algorithmsfor cooperative MARL in Proposition 2.1. Careful readings reveal that this complexity analysisholds for other exploration schemes, including the Gaussian exploration and the Boltzmannexploration, as long as Lemma 5.7 holds.

Centralized training with decentralized execution. When the rewards can be decomposedadditively across agent observations, [42] developed the centralized-training-with-decentralized-execution scheme to reduce the communication cost and computational complexity. If agentsonly have access to partial observations, [17] and [12] showed empirically that using a singleshared policy network across all agents represents an efficient training mechanism.

It is worth mentioning that our MFC framework is fully adaptive for this practically popular“centralized training with decentralized execution” scheme. For example, if

rt =∑s,a

r(µt(s), νt(a))1(st=s,at=a),


then

Qπglobal(µ, h) = E

[ ∞∑t=0

γt∑s,a

r(µt(s), νt(a))1(st=s,at=a)

∣∣∣∣∣ s0 ∼ µ, a ∼ h(s0, ·), at ∼ π

]=∑s,a

Qπlocal

(µ(s), h(s, a)

)µ(s)h(s, a).

With this scheme, the communication efficiency in our algorithm can be improved, as agentsonly need to maintain a “local” Q table (Qlocal) for execution while the central controller islearning to update the “global” Q table (Qglobal).

Single-agent MDP with continuous state-action space and deterministic dynamics. In additionto the problem of MFC with learning, our MFC-K-Q Algorithm can also be applied tosingle-agent MDP problem with continuous space and deterministic dynamics, with similarconvergence and sample complexity analysis. This may be of independent research interests.

Convergence under different norms. Our main assumptions and results adopt the infinitynorm (‖ · ‖∞) for ease of exposition. They can also be established under the Lp (‖ · ‖p) norm toallow for the neural network approximation of Q-learning. In addition, by properly controllingthe Lipschtiz constant, the empirical performance of the neural network approximation may befurther improved ([1]).

Extensions to other settings. For future research, we are interested in extending our frameworkand learning algorithm to other variations of mean-field controls including risk-sensitive mean-field controls ([3], [7], and [8]), robust mean-field controls ([48]), and partially observed mean-fieldcontrols ([7, 37]).

6. Experiments. We will test the MFC-K-Q algorithm on a network traffic congestioncontrol problem. In the network there are senders and receivers. Multiple senders share a singlecommunication link which has an unknown and limited bandwidth. When the total sendingrates from these senders exceed the shared bandwidth, packages may be lost. Sender streamsdata packets to the receiver and receives feedback from the receiver on success or failure in theform of packet acknowledgements (ACKs). (See Figure 1 for illustration and [21] for a similarset-up). The control problem for each sender is to send the packets as fast as possible and withthe risk of packet loss as little as possible. Given a large interactive population of senders, theexact dynamics of the system and the rewards are unknown, thus it is natural to formulatethis control problem in the framework of learning MFC.

Figure 1. Multiple network traffic flows sharing the same link.


6.1. Set-up.States. For a representative agent in MFC problem with learning, at the beginning of

each round t, the state xt is her inventory (current unsent packet units) taking values fromX = 0, . . . , |X | − 1. Denote µt := µt(x)x∈X as the population state distribution over X .

Actions. The action is the sending rate. At the beginning of each round t, the agentcan adjust her sending rate ut, which remains fixed in [t, t + 1). Here we assume ut ∈ U =0, . . . , |U| − 1. Denote ht = ht(x)(u)x∈X ,u∈U as the policy from the central controller.

Limited bandwidth and packet loss. A system with N agents has a shared link of unknownbandwidth cN (c > 0). In the mean-field limit with N →∞, Ft =

∑x∈X ,u∈U uht(x)(u)µt(x)

is the average sending rate at time t. If Ft > c, with probability (Ft−c)Ft

, each agent’s packetwill be lost.

MFC dynamics. At time t+ 1, the state of the representative agent moves from xt to xt−ut.Overshooting is not allowed: ut ≤ xt. Meanwhile, at the end of each round, there are somepackets added to each agent’s packet sending queue. The packet fulfillment consists of twoscenarios. First, a lost package will be added to the original queue. Second, once the inventoryhits zero, a random fulfillment with uniform distribution Unif(X ) will be added to her queue.That is,

xt+1 = xt − ut + ut1t(L) + (1− 1t(L)I(ut = xt) · Ut,

where 1t(L) = I(packet is lost in round t), with I an indicator function and Ut ∼ Unif(X ).Evolution of population state distribution µt. Define, for x ∈ X ,

µt(x) =∑x′≥x

µt(x′)ht(x

′)(x′ − x)

(1− I(Ft>c)

Ft − cFt

)+ µt(x)I(Ft>c)

Ft − cFt

.

Then µt represents the state of the population distribution after the first step of task ful-fillment and before the second step of task fulfillment. Finally, for x ∈ X , µt+1(x) =(µt(x) + µt(0)

|X |

)I(x 6=0) + µt(0)

|X | I(x=0), describes the transition of the flows µt+1 = Φ(µt, ht).Rewards. Consistent with [9] and [21], the reward function depending on throughput, latency,

with loss penalty is defined as r = a ∗ throughput− b ∗ latency2 − d ∗ loss, with a, b, d ≥ 0.

6.2. Performance of MFC-K-Q Algorithm. We first test the convergence property andperformance of MFC-K-Q (Algorithm 4.1) for this traffic control problem with different kernelchoices and with varying N . We then compare MFC-K-Q with MFQ Algorithm [6] on MFC,Deep PPQ [21], and PCC-VIVACE [9] on MARL.

We assume the access to an MFC simulator G(µ, h) = (µ′, r). That is, for any pair (µ, h) ∈C, we can sample the aggregated population reward r and the next population state distributionµ′ under policy h. We sample G(µ, h) = (µ′, r) once for all (µ, h) ∈ Cε. In each outer iteration,each update on (µ, h) ∈ Cε is one inner-iteration. Therefore, the total number of inner iterationswithin each outer iteration equals |Cε|.

Applying MFC policy to N -agent game. To measure the performance of the MFC policy πfor an N -agent set-up, we apply π to the empirical state distribution of N agents.


Performance criteria. We assume the access to an N-agent simulator GN (xxx,uuu) = (xxx′, rrr).That is, if agents take joint action uuu from state xxx, we can observe the joint reward rrr and thenext joint state xxx′. We evaluate different policies in the N -agent environment.

We randomly sample K initial states xxxk0 ∈ XNKk=1 and apply policy π to each initial state

xxxk0 and collect the continuum rewards in each path for T0 rounds rπk,tT0t=1. Here r

πk,t =

∑Ni=1 r

π,ik

N

is the average reward from N agents in round t under policy π. Then RπN (xxxk0) :=∑T0

t=1 γtrπk,t

is used to approximate the value function V πC with policy π, when T0 is large.

Two performance criteria are used: the first one C(1)N (π) = 1

K

∑Kk=1R

πN (xxxk0) measures the

average reward from policy π; and the second criterion C(2)N (π1, π2) = 1

K

∑Kk=1

Rπ1N (xxxk0)−Rπ2

N (xxxk0)

Rπ1N (xxxk0)

measures the relative improvements of using policy π1 instead of policy π2.Experiment set-up. We set γ = 0.5, a = 30, b = 10, d = 50, c = 0.4, M = 2, K = 500 and

T0 = 30, and compare policies with N = 5n agents (n = 1, 2, · · · , 20). For the ε-net, we takeuniform grids with ε distance between adjacent points on the net. The confidence intervals arecalculated with 20 repeated experiments.

(a) Convergence of Q function. (b) C(1)N : Average reward.

Figure 2. Performance comparison among different kernels.

Results with different kernels. We use the following kernels with hyper-parameter ε: triangular,(truncated) Gaussian, and (truncated) constant kernels. That is, φ(1)

ε (x, y) = 1‖x−y‖2≤ε∣∣ε−

‖x− y‖2∣∣, φ(2)

ε (x, y) = 1‖x−y‖2≤ε1√2π

exp(−|ε−‖x− y‖2|2), and φ(3)ε (x, y) = 1‖x−y‖2≤ε. We

run the experiments for K(j)ε (ci, c) = φ

(j)ε (ci,c)∑Nε

i=1 φ(j)ε (ci,c)

. with j = 1, 2, 3 and ε = 0.1.

All kernels lead to the convergence of Q functions within 15 outer iterations (Figure 2a).When N ≤ 10, the performances of all kernels are similar since ε-net is accurate for gameswith N = 1

ε agents. When N ≥ 15, K(1)0.1 performs the best and K(3)

0.1 does the worst (Figure2b): implying that treating all nearby ε-net points with equal weights yields relatively poorperformance.

Further comparison of K(j)0.1 ’s suggests that appropriate choices of kernels for specific

problems with particular structures of Q functions help reducing errors from a fixed ε-net.


(a) Convergence of Q function (b) C(1)N : Average reward.

(c) C(2)N : Improvement of K(1)

0.1 from 1-NN. (d) C(2)N : Improvement of K(1)

0.1 from 3-NN.

Figure 3. Comparison between K10.1(x, y) and k-NN (k = 1, 3).

Results with different k-nearest neighbors. We compare kernel K(1)0.1 (x, y) with the k-nearest-

neighbor (k-NN) method (k = 1, 3), with 1-NN the projection approach by which each point isprojected onto the closest point in Cε, a simple method for continuous state and action spaces[31, 46].

All K(1)0.1 (x, y) and k-NN converge within 15 outer iterations. The performances of K1

0.1(x, y)

and k-NN are similar when N ≤ 10. However, K(1)0.1(x, y) outperforms both 1-NN and 3-

NN for large N under both criteria C(1)N and C(2)

N : under C(1)N , K(1)

0.1(x, y), 1-NN, and 3-NNhave respectively average rewards of 1.4, 1.07, and 1.2 when N ≥ 65; under C(2)

N , K(1)0.1(x, y)

outperforms 1-NN and 3-NN by 15% and 13% respectively when N = 10, by 29% and 21%respectively when N = 15, and by 25% and 16% respectively when N ≥ 60.

Comparison with other algorithms. We compare MFC-K-Q withK(1)0.1 with three representative

algorithms, MFQ from [6], Deep PPQ from [21], and PCC-VIVACE from [9] on MARL. Ourexperiment demonstrates superior performances of MFC-K-Q.

• When N>40, MFC-K-Q dominates all these three algorithms (Figure 4a) and it learnsthe bandwidth parameter c most accurately (Figure 4b). Despite being the bestperformer when N<35, Deep PPQ suffers from the “curse of dimensionality” and the


performance gets increasingly worse when N increases;• MFC-K-Q with K(1)

0.1 dominates MFQ, which is similar to our worst performer MFC-K-Q with 1-NN. In general, kernel regression performs better than simple projection(adopted in MFQ) where only one point is used to estimate Q;• the decentralized PCC-VIVACE has the worst performance. Moreover, it is insensitive

to the bandwidth parameter c. See Figure 4b.

(a) C(1)N : Average reward. (b) Average sending flow.

Figure 4. Performance comparison among different algorithms.

REFERENCES

[1] K. Asadi, D. Misra, and M. L. Littman, Lipschitz continuity in model-based reinforcement learning,arXiv preprint arXiv:1804.07193, (2018).

[2] A. M. Barreto, D. Precup, and J. Pineau, Practical kernel-based reinforcement learning, Journal ofMachine Learning Research, 17 (2016), pp. 2372–2441.

[3] A. Bensoussan, B. Djehiche, H. Tembine, and P. Yam, Risk-sensitive mean-field-type control, in2017 IEEE 56th Annual Conference on Decision and Control, IEEE, 2017, pp. 33–38.

[4] R. Carmona and F. Delarue, Probabilistic Theory of Mean Field Games with Applications I-II, Springer,2018.

[5] R. Carmona, M. Laurière, and Z. Tan, Linear-quadratic mean-field reinforcement learning: Conver-gence of policy gradient methods, arXiv preprint arXiv:1910.04295, (2019).

[6] R. Carmona, M. Laurière, and Z. Tan, Model-free mean-field reinforcement learning: Mean-fieldMDP and mean-field Q-learning, arXiv preprint arXiv:1910.12802, (2019).

[7] B. Djehiche and H. Tembine, Risk-sensitive mean-field type control under partial observation, inStochastics of Environmental and Financial Economics, Springer, Cham, 2016, pp. 243–263.

[8] B. Djehiche, H. Tembine, and R. Tempone, A stochastic maximum principle for risk-sensitivemean-field type control, IEEE Transactions on Automatic Control, 60 (2015), pp. 2640–2649.

[9] M. Dong, T. Meng, D. Zarchy, E. Arslan, Y. Gilad, B. Godfrey, and M. Schapira, PCCvivace: Online-learning congestion control, in 15th USENIX Symposium on Networked Systems Designand Implementation (NSDI 18), Renton, WA, Apr. 2018, USENIX Association, pp. 343–356.

[10] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, Multiagent reinforcement learning for integratednetwork of adaptive traffic signal controllers (MARLIN-ATSC): Methodology and large-scale applicationon downtown Toronto, IEEE Transactions on Intelligent Transportation Systems, 14 (2013), pp. 1140–1150.

[11] E. Even-Dar and Y. Mansour, Learning rates for Q-learning, Journal of Machine Learning Research,5 (2003), pp. 1–25.


[12] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson, Learning to communicate with deepmulti-agent reinforcement learning, in Advances in Neural Information Processing Systems, 2016,pp. 2137–2145.

[13] Z. Fu, Z. Yang, Y. Chen, and Z. Wang, Actor-critic provably finds Nash equilibria of linear-quadraticmean-field games, arXiv preprint arXiv:1910.07498, (2019).

[14] J. Gärtner, On the McKean-Vlasov limit for interacting diffusions, Mathematische Nachrichten, 137(1988), pp. 197–248.

[15] H. Gu, X. Guo, X. Wei, and R. Xu, Dynamic programming principles for learning MFCs, arXivpreprint arXiv:1911.07314, (2019).

[16] X. Guo, A. Hu, R. Xu, and J. Zhang, Learning mean-field games, in Advances in Neural InformationProcessing Systems, 2019, pp. 4966–4976.

[17] J. K. Gupta, M. Egorov, and M. Kochenderfer, Cooperative multi-agent control using deepreinforcement learning, in International Conference on Autonomous Agents and Multiagent Systems,Springer, 2017, pp. 66–83.

[18] P. Hernandez-Leal, B. Kartal, and M. E. Taylor, A survey and critique of multiagent deepreinforcement learning, Autonomous Agents and Multi-Agent Systems, 33 (2019), pp. 750–797.

[19] M. Huang, P. E. Caines, and R. P. Malhamé, Large-population cost-coupled LQG problems withnonuniform agents: individual-mass behavior and decentralized ε-nash equilibria, IEEE Transactionson Automatic Control, 52 (2007), pp. 1560–1571.

[20] K. Iyer, R. Johari, and M. Sundararajan, Mean field equilibria of dynamic auctions with learning,Management Science, 60 (2014), pp. 2949–2970.

[21] N. Jay, N. Rotman, B. Godfrey, M. Schapira, and A. Tamar, A deep reinforcement learningperspective on internet congestion control, in Proceedings of the 36th International Conference onMachine Learning, 2019, pp. 3050–3059.

[22] J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang, Real-time bidding with multi-agentreinforcement learning in display advertising, in Proceedings of the 27th ACM International Conferenceon Information and Knowledge Management, 2018, pp. 2193–2201.

[23] M. Kac, Foundations of kinetic theory, in Proceedings of the Third Berkeley Symposium on MathematicalStatistics and Probability, vol. 3, University of California Press Berkeley and Los Angeles, California,1956, pp. 171–197.

[24] J.-M. Lasry and P.-L. Lions, Mean field games, Japanese journal of mathematics, 2 (2007), pp. 229–260.[25] M. Lauer and M. Riedmiller, An algorithm for distributed reinforcement learning in cooperative multi-

agent systems, in Proceedings of the 17th International Conference on Machine Learning, Citeseer,2000.

[26] M. Li, Z. Qin, Y. Jiao, Y. Yang, J. Wang, C. Wang, G. Wu, and J. Ye, Efficient ridesharing orderdispatching with mean field multi-agent reinforcement learning, in The World Wide Web Conference,2019, pp. 983–994.

[27] K. Lin, R. Zhao, Z. Xu, and J. Zhou, Efficient large-scale fleet management via multi-agent deepreinforcement learning, in Proceedings of the 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining, 2018, pp. 1774–1783.

[28] Y. Luo, Z. Yang, Z. Wang, and M. Kolar, Natural actor-critic converges globally for hierarchicallinear quadratic regulator, arXiv preprint arXiv:1912.06875, (2019).

[29] H. P. McKean, Propagation of chaos for a class of non-linear parabolic equations, Stochastic DifferentialEquations (Lecture Series in Differential Equations, Session 7, Catholic Univ., 1967), (1967), pp. 41–57.

[30] M. Motte and H. Pham, Mean-field Markov decision processes with common noise and open-loopcontrols, arXiv preprint arXiv:1912.07883, (2019).

[31] R. Munos and A. Moore, Variable resolution discretization in optimal control, Machine Learning, 49(2002), pp. 291–323.

[32] D. Ormoneit and P. Glynn, Kernel-based reinforcement learning in average-cost problems, IEEETransactions on Automatic Control, 47 (2002), pp. 1624–1636.

[33] D. Ormoneit and Ś. Sen, Kernel-based reinforcement learning, Machine Learning, 49 (2002), pp. 161–178.[34] H. Pham and X. Wei, Discrete time McKean–Vlasov control problem: a dynamic programming approach,

Applied Mathematics & Optimization, 74 (2016), pp. 487–506.[35] G. Qu, A. Wierman, and N. Li, Scalable reinforcement learning of localized policies for multi-agent


networked systems, arXiv preprint arXiv:1912.02906, (2019).[36] N. Saldi, T. Basar, and M. Raginsky, Markov–nash equilibria in mean-field games with discounted

cost, SIAM Journal on Control and Optimization, 56 (2018), pp. 4256–4287.[37] N. Saldi, T. Başar, and M. Raginsky, Approximate nash equilibria in partially observed stochastic

games with mean-field interactions, Mathematics of Operations Research, 44 (2019), pp. 1006–1033.[38] N. Saldi, T. Başar, and M. Raginsky, Approximate markov-nash equilibria for discrete-time risk-

sensitive mean-field games, Mathematics of Operations Research, (2020).[39] D. Shah and Q. Xie, Q-learning with nearest neighbors, in Advances in Neural Information Processing

Systems, 2018, pp. 3111–3121.[40] S. Shalev-Shwartz, S. Shammah, and A. Shashua, Safe, multi-agent, reinforcement learning for

autonomous driving, arXiv preprint arXiv:1610.03295, (2016).[41] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit-

twieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot, Mastering the game of go withdeep neural networks and tree search, Nature, 529 (2016), p. 484.

[42] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot,N. Sonnerat, J. Z. Leibo, K. Tuyls, et al., Value-decomposition networks for cooperative multi-agent learning, arXiv preprint arXiv:1706.05296, (2017).

[43] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT press, 2018.[44] A.-S. Sznitman, Topics in propagation of chaos, in Ecole d’été de Probabilités de Saint-Flour XIX-1989,

Springer, 1991, pp. 165–251.[45] G. Taylor and R. Parr, Kernelized value function approximation for reinforcement learning, in

Proceedings of the 26th International Conference on Machine Learning, 2009, pp. 1017–1024.[46] H. Van Hasselt, Reinforcement learning in continuous state and action spaces, in Reinforcement Learning,

Springer, 2012, pp. 207–251.[47] O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. M. Czarnecki, A. Dudzik,

A. Huang, P. Georgiev, and R. Powell, Alphastar: Mastering the real-time strategy game starcraftII, DeepMind Blog, (2019), p. 2.

[48] B.-C. Wang and Y. Liang, Robust mean field social control problems with applications in analysis ofopinion dynamics, arXiv preprint arXiv:2002.12040, (2020).

[49] H. Wang, X. Wang, X. Hu, X. Zhang, and M. Gu, A multi-agent reinforcement learning approach todynamic service composition, Information Sciences, 363 (2016), pp. 96–119.

[50] M. A. Wiering, Multi-agent reinforcement learning for traffic light control, in Proceedings of the 17thInternational Conference Machine Learning, 2000, pp. 1151–1158.

[51] C. Wu, J. Zhang, et al., Viscosity solutions to parabolic master equations and McKean–Vlasov SDEswith closed-loop controls, Annals of Applied Probability, 30 (2020), pp. 936–986.

[52] X. Xu, D. Hu, and X. Lu, Kernel-based least squares policy iteration for reinforcement learning, IEEETransactions on Neural Networks, 18 (2007), pp. 973–992.

[53] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang, Mean field multi-agent reinforcementlearning, arXiv preprint arXiv:1802.05438, (2018).

[54] K. Zhang, Z. Yang, and T. Başar, Multi-agent reinforcement learning: A selective overview of theoriesand algorithms, arXiv preprint arXiv:1911.10635, (2019).


Appendix A. Proofs of Lemmas.

Proof of Lemma 2.2. At time step t, assume xt ∼ µt. Under the policy πt, it is easy tocheck via direct computation that the corresponding action distribution νt is ν(µt, πt(·, µt)).Meanwhile, for any bounded function ϕ on X , by the law of iterated conditional expectation:

Eπ[ϕ(xt+1)] = Eπ[Eπ[ϕ(xt+1)|x0 . . . , xt

]]= Eπ

[ ∑x′∈X

ϕ(x′)P (xt, µt, ut, νt)(x′)]

=∑x′∈X

ϕ(x′)Eπ[P (xt, µt, ut, νt)(x

′)]

=∑x′∈X

ϕ(x′)∑x∈X

µt(x)∑u∈U

πt(x, µt)(u)P (x, µt, u, νt)(x′),

which concludes that xt+1 ∼ Φ(µt, πt(·, µt)). Here Eπ denotes the expectation under policy π.Therefore, under π = πt∞t=0, µt+1 = Φ(µt, πt(·, µt)) defines a deterministic flow µt∞t=0 inP(X ), and xt ∼ µt. Moreover, by Fubini’s theorem

vπ(µ) = Eπ[ ∞∑t=0

γtr(xt, µt, ut, νt)

∣∣∣∣x0 ∼ µ]

=∞∑t=0

γtEπ[r(xt, µt, ut, νt)

∣∣∣∣x0 ∼ µ]

=

∞∑t=0

γtE[r(xt, µt, ut, νt)

∣∣∣∣xt ∼ µt, ut ∼ πt(xt, µt)]

=∞∑t=0

γt∑x∈X

∑u∈U

r(x, µt, u, ν(µt, πt(·, µt)))µt(x)πt(x, µt)(u)

=

∞∑t=0

γtr(µt, πt(·, µt)).

This proves (2.3).

Proof of Lemma 2.7 .

‖ν(µ, h)− ν(µ′, h′)‖1≤‖ν(µ, h)− ν(µ, h′)‖1 + ‖ν(µ, h′)− ν(µ′, h′)‖1

≤∣∣∣∣∣∣∑x∈X

(h(x)− h′(x))µ(x)∣∣∣∣∣∣

1+∣∣∣∣∣∣∑x∈X

(µ(x)− µ′(x))h′(x)∣∣∣∣∣∣

1

≤∑x∈X

µ(x)∣∣∣∣∣∣h(x)− h′(x)

∣∣∣∣∣∣1

+∣∣∣∣∣∣∑x∈X

(µ(x)− µ′(x))h′(x)∣∣∣∣∣∣

1

≤maxx∈X

∣∣∣∣∣∣h(x)− h′(x)∣∣∣∣∣∣

1+∑u∈U

∑x∈X|µ(x)− µ′(x)|h′(x)(u)

=dH(h, h′) + ‖µ− µ′‖1 = dC((µ, h), (µ′, h′)).


Proof of Lemma 2.8.

|r(µ, h)− r(µ′, h′)|

=∣∣∣∑x∈X

∑u∈U

r(x, µ, u, ν(µ, h))µ(x)h(x)(u)−∑x∈X

∑u∈U

r(x, µ′, u, ν(µ′, h′))µ′(x)h′(x)(u)∣∣∣

(For simplicity, denote rx,u = r(x, µ, u, ν(µ, h)), r′x,u = r(x, µ′, u, ν(µ′, h′)).)

≤∣∣∣∑x∈X

∑u∈U

(rx,u − r′x,u)µ(x)h(x)(u)∣∣∣+∣∣∣∑x∈X

∑u∈U

r′x,u(µ(x)h(x)(u)− µ′(x)h′(x)(u))∣∣∣.

By Assumption 2.5 and Lemma 2.7, for any x ∈ X , u ∈ U ,

|rx,u − r′x,u| ≤ Lr(‖µ− µ′‖1 + ‖ν(µ, h), ν(µ′, h′)‖1)

≤Lr · (‖µ− µ′‖1 + dC((µ, h), (µ′, h′))) ≤ 2LrdC((µ, h), (µ′, h′)).

Meanwhile, ∑x∈X

∑u∈U|µ(x)h(x)(u)− µ′(x)h′(x)(u)|

≤∑x∈X

∑u∈U|µ(x)− µ′(x)|h(x)(u) +

∑x∈X

∑u∈U

µ′(x)|h(x)(u)− h′(x)(u)|

=∑x∈X|µ(x)− µ′(x)|+

∑x∈X

µ′(x)‖h(x)− h′(x)‖1

≤‖µ− µ′‖1 + maxx∈X‖h1(x)− h2(x)‖1 = dC((µ, h), (µ′, h′)).

Combining all these results, we have

|r(µ, h)− r(µ′, h′)|

≤∑x∈X

∑u∈U|rx,u − r′x,u|µ(x)h(x)(u) + R

∑x∈X

∑u∈U|µ(x)h(x)(u)− µ′(x)h′(x)(u)|

≤(R+ 2Lr)dC((µ, h), (µ′, h′)).

Proof of Lemma 2.9.

‖Φ(µ, h)− Φ(µ′, h′)‖1

=∣∣∣∣∣∣∑x∈X

∑u∈U

P (x, µ, u, ν(µ, h))µ(x)h(x)(u)−∑x∈X

∑u∈U

P (x, µ′, u, ν(µ′, h′))µ′(x)h′(x)(u)∣∣∣∣∣∣

1

(For simplicity, denote Px,u = P (x, µ, u, ν(µ, h)), P ′x,u = P (x, µ′, u, ν(µ′, h′)).)

≤∣∣∣∣∣∣∑x∈X

∑u∈U

(Px,u − P ′x,u)µ(x)h(x)(u)∣∣∣∣∣∣

1+∣∣∣∣∣∣∑x∈X

∑u∈U

P ′x,u(µ(x)h(x)(u)− µ′(x)h′(x)(u))∣∣∣∣∣∣

1.

By Assumption 2.6 and Lemma 2.7, for any x and u,

||Px,u − P ′x,u||1≤LP · (‖µ− µ′‖1 + ‖ν(µ, h)− ν(µ′, h′)‖1)

≤LP · (‖µ− µ′‖1 + dC((µ, h), (µ′, h′))) ≤ 2LP · dC((µ, h), (µ′, h′)).


Meanwhile, from the proof of Lemma 2.8, we know

∑x∈X

∑u∈U|µ(x)h(x)(u)− µ′(x)h′(x)(u)| ≤ dC((µ, h), (µ′, h′)).

Combining all these results, we have

‖Φ(µ, h)− Φ(µ′, h′)‖1≤∑x∈X

∑u∈U||Px,u − P ′x,u||1µ(x)h(x, u) +

∑x∈X

∑u∈U||P ′x,u||1|µ(x)h(x)(u)− µ′(x)h′(x)(u))|

≤(2LP + 1)dC((µ, h), (µ′, h′)).

Proof of Lemma 3.4. To prove the continuity of Q, first fix c and c′ ∈ C. Then there existssome policy π such that Q(c)−Qπ(c) < ε

2 . Let c = (µ0, h0), (µ1, h1), (µ2, h2), . . . , (µt, ht), . . .be the trajectory of the system starting from c and then taking the policy π. Then Qπ(c) =∑∞

t=0 γtr(µt, ht).

Now consider the trajectory of the system starting from c′ and then taking h1, . . . , ht, . . . ,denoted by c′ = (µ′0, h

′0), (µ′1, h1), (µ′2, h2), . . . , (µ′t, ht), . . . . Note that this trajectory starting

from c′ may not be the optimal trajectory, therefore, Q(c′) ≥∑∞

t=0 γtr(µ′t, ht). By Lemma 2.8

and Lemma 2.9,

|r(µ′t, ht)− r(µt, ht)|≤Lr · dP(X )(µ

′t, µt) = Lr · dP(X )(Φ(µ′t−1, ht−1),Φ(µt−1, ht−1))

≤Lr · LΦ · dP(X )(µ′t−1, µt−1) ≤ · · · ≤ Lr · LtΦ · dC(c, c′),

implying that

Q(c)−Q(c′)

≤ ε2

+Qπ(c)−Q(c′) ≤ ε

2+ (r(c)− r(c′)) +

∞∑t=1

γt(r(µt, ht)− r(µ′t, ht))

≤ ε2

+∞∑t=0

γt · LtΦ · Lr · dC(c, c′) =ε

2+

Lr1− γ · LΦ

· dC(c, c′).

Similarly, one can show Q(c′)−Q(c) ≤ ε2 + Lr

1−γ·LΦ· dC(c, c′). Therefore, as long as dC(c, c′) ≤

ε·(1−γ·LΦ)2Lr

, |Q(c′)−Q(c)| ≤ ε. This proves that Q is continuous.

Proof of Lemma 5.3. By definition, it is easy to show that B and BHε map f ∈ RC :‖f‖∞ ≤ Vmax to itself, Bε and Bε map f ∈ RCε : ‖f‖∞ ≤ Vmax to itself, and T mapsf ∈ RP(X ) : ‖f‖∞ ≤ Vmax to itself.


For Bε, we have

‖Bεq1 −Bεq2‖∞≤γmax

c∈Cεmaxh∈Hε

|ΓKq1(Φ(c), h)− ΓKq2(Φ(c), h)|

≤γmaxc∈Cε

maxh∈Hε

Nε∑i=1

K(ci, (Φ(c), h))|q1(ci)− q2(ci)|

≤γ‖q1 − q2‖∞,

where we use (4.2) for the property of kernel function K(ci, c).Therefore, Bε is a contraction mapping with modulus γ < 1 under the sup norm on

f ∈ RCε : ‖f‖∞ ≤ Vmax. By Banach Fixed Point Theorem, the statement for Bε holds.Similar arguments prove the statements for the other four operators.

Proof of Lemma 5.4. Using the same DPP argument as in Theorem 3.2, we can show thevalue function for (5.9)-(5.10) is a fixed point for T (5.3) in f ∈ RP(X ) : ‖f‖∞ ≤ Vmax. ByLemma 5.3, it coincides with VHε .

To prove (5.11), recall from Lemma 5.3 that T is a contraction mapping with modulus γwith the supremum norm on f ∈ RP(X ) : ‖f‖∞ ≤ Vmax, with a fixed point VHε which is thevalue function of the MFC (5.9)-(5.10), i.e., (MDP) with the action space restricted to Hε.Moreover, define Q(µ, h) := r(µ, h) + γVHε(Φ(µ, h)). Then

Q(µ, h)

=r(µ, h) + γVHε(Φ(µ, h))

=r(µ, h) + γ maxh∈Hε

(r(Φ(µ, h), h) + γVHε(Φ(Φ(µ, h), h)))

=r(µ, h) + γ maxh∈Hε

Q(Φ(µ, h), h).

So Q ∈ f ∈ RC : ‖f‖∞ ≤ Vmax is a fixed point of BHε . By Lemma 5.3, Q = QHε .Now, since QHε is the value function of the MFC problem (5.9), replacing Q with QHε in

the argument of Lemma 3.5 and then taking ε→ 0 yield the Lipschitz continuity of QHε .

Proof of Lemma 5.7. By Markov’s inequality,

P(TC,π > eT ) ≤E[TC,π]

eT≤ 1

e.

Since TC,π is independent of the initial state and the dynamics are Markovian, the probabilitythat Cε has not been covered during any time period with length eT is less or equal to 1

e .Therefore, for any positive integer k, P(TC,π > ekT ) ≤ 1

ek. Take k = log(1/δ) and we get the

desired result.

q-learning algorithm for mean-field controls, with ... · q-learning algorithm for mean-field...

Documents