sampling from the wasserstein barycenter

Sampling From the WassersteinBarycenterChiheb Daaloul 1 Thibaut Le Gouic 2 Jacques Liandrat 1 Magali Tournus 1

Abstract. This work presents an algorithm to sample from the Wasserstein barycenter of ab-solutely continuous measures. Our method is based on the gradient flow of the multimarginalformulation of the Wasserstein barycenter, with an additive penalization to account for themarginal constraints. We prove that the minimum of this penalized multimarginal formula-tion is achieved for a coupling that is close to the Wasserstein barycenter. The performancesof the algorithm are showcased in several settings.

1. INTRODUCTION

The barycenter in the space of probability measures equipped with the Wasserstein distance,first introduced by Agueh and Carlier (2011) gained a lot of popularity in recent years. Themost striking property of the Wasserstein space is probably how its geodesics can be interpretedas a displacement of particles in Rd, allowing the Wasserstein barycenter of measures to takeinto account the geometry of the underlying space Rd. Thanks to its geometric interpretation, itproved useful in a variety of applications. For instance, it has been applied in image processing:Julien et al. (2011) developed an algorithm for texture synthesis built around this measure, Barréet al. (2020) used the Wasserstein barycenter of images to improve the precision of gas emissionsourcing, and it played a central role in Gramfort, Peyré, and Cuturi (2015) to compare brainimage data. In the field of fairness, Gordaliza et al. (2019) and Le Gouic, Loubes, and Rigollet(2020) leveraged the distribution to tackle the problem of fairness in regression and classificationproblems. Barycenters also found applications in Bayesian inference; Srivastava, Cevher, et al.(2015) and Srivastava, Li, and Dunson (2018) proposed a method to accelerate the computationof posterior distributions for Bayesian inference using the Wasserstein barycenter.

More formally, let us denote by (P2(Rd),W2) the set of all measures defined on Rd with finitesecond order moment, endowed with the Wasserstein distance

W2 : (µ, ν) 7−→

√infγ

∫Rd×Rd

|x− y|2 dγ(x, y) ,

where the infimum is taken over the set of couplings of µ, ν. A Wasserstein barycenter b ofµ1, . . . , µn with weights λ1, . . . , λn is a minimizer of the map ν 7→

∑λiW

22 (µi, ν), i.e. the

measure b corresponds to a Fréchet mean of µ1, . . . , µn in the Wasserstein space (P2(Rd),W2).Equivalently, the multimarginal formulation of the barycenter problem asserts that b is obtainedby pushing forward the minimizer γ? of a functional G (see equation (2)) defined over the set ofcouplings of the µi’s. Under mild conditions on the µi’s, both problems admit unique solutions,and yield the same measure. Section 2 provides more details on the Wasserstein barycenter.

Most methods in the literature propose to estimate the Wasserstein barycenter with a discretemeasure e.g. Cuturi and Doucet (2014), Solomon et al. (2015), and Benamou et al. (2015).1 Aix-Marseille Univ., CNRS, I2M, UMR7373, Centrale Marseille, 13451 Marseille, France2 Massachusetts Institute of Technology, Department of Mathematics, 77 Massachusetts Avenue, Cambridge,

MA 02139-4307, USA

1

arX

iv:2

105.

0170

6v1

[cs

.LG

] 4

May

202

1

However, this approach does not scale well with the dimension as noted by Altschuler and Boix-Adserà (2021). Indeed, they showed that computing the infimum of ν 7→

∑λiW

22 (µi, ν) when

the µi are discrete measures is a NP-hard problem in the dimension, even when only approximatesolutions are acceptable. We can, however, avoid estimating the density altogether and focus ongenerating samples distributed according to the barycenter of known measures. Given the broadapplicability of the Wasserstein barycenter and of sampling techniques in general, we believe thatsuch sampling procedures should be part of the statistician’s toolbox.

We are motivated by the success of sampling methods in high dimensional settings, whichis due to the possibility of integrating functions against the target measure without resorting todiscretization on a large grid. Markov chain methods have become popular tools to this end.The most well known among them are probably Hamiltonian Monte Carlo methods, variantsof the Metropolis-Hastings algorithm relying on simulations to generate diverse samples from adistribution (Bishop (2006, Chapter 11)), and Langevin diffusion (Pavliotis (2014, Chapter 4)),which transports points along random trajectories to redistribute them according to the targetmeasure. Such transportation methods are computationally attractive since one only needs tostore minimal information about how to move the particles at each iteration.

In their celebrated work, Jordan, Kinderlehrer, and Otto (1998) showed that the marginalsof Langevin diffusion are distributed according to gradient flows in the Wasserstein space ofthe Kullback-Leibler divergence with respect to the stationary measure. This insight, studiedrigorously in Ambrosio, Gigli, and Savaré (2008), brought to light a connection between sam-pling and optimization which added a new perspective to the study of Monte Carlo algorithms(e.g. Vempala and Wibisono (2019) analyze the Unadjusted Langevin Algorithm and Chewi, LeGouic, Lu, Maunu, Rigollet, and Stromme (2020) analyze the Mirror Langevin diffusion in thisframework). Inspired by this insight, we aim to minimize the multimarginal formulation G withgradient descent in the Wasserstein space. This procedure, like Langevin diffusion, iterativelyredistributes randomly initialized points to produce a sample from the barycenter. We thereforerecover an approximation of the optimal coupling through the samples, which can be desirablein applications.

However since the barycenter is defined by a constrained optimization problem, one cannotexpect that the constraints will remain satisfied along the gradient flow. This issue of constraintshas been tackled in a variety of ways in the literature. A natural solution consists in restricting thedomain of possible directions at each iteration, which hints to the Frank-Wolfe algorithm, i.e. tochoose the steepest descent direction in the admissible set. This approach was studied in Luise etal. (2019) within the context of regularized optimal transport, where the regularized Wassersteinbarycenter (also known as the Sinkhorn barycenter) minimizes a sum of Sinkhorn divergences.When the admissible set of directions forms a Reproducing Kernel Hilbert Space (RKHS) withsuitable kernel, Shen et al. (2020) propose to generate samples distributed according to theSinkhorn barycenter by iterative pushforward of an initial measure with the map idRd − h · d,where h is a step size and d is the direction of steepest descent in the RKHS. Both methodsoperate on discrete measures and the authors prove that the continuous measure is recovered as a(weak) limit when the number of samples increases to ensure consistency. We refer to Peyré andCuturi (2020) for details about regularized optimal transport. We note that in some settings theissue of constraints can be addressed more easily, for example in the Bures-Wasserstein manifold(i.e. the subspace of Gaussian measures in the Wasserstein space) where the barycenter problemreduces to a finite dimensional optimization problem. Chewi, Maunu, et al. (2020) propose agradient descent algorithm for the original formulation of the barycenter problem. However, we

2

obviously cannot hope that similar properties hold for arbitrary continuous measures.

In this paper, we perform gradient descent on a penalized functional Fα : P2((Rd)n)→ R+

obtained from G by adding a penalization term to control the distance between the couplingmarginals and the µi’s. We weigh the penalization with a coefficient α and control the inducederror with α. Taking advantage of the differential calculus on the Wasserstein space, we candefine a gradient flow for Fα. To implement this procedure, we focus on the popular SVGDalgorithm introduced in Liu and Wang (2016) to approximate the gradient of the penalization;technical details are given in Section 4.

Contributions. Inspired by the work of Jordan, Kinderlehrer and Otto cited above, we in-troduce a new sampling algorithm, called BARYGD, that performs kernelized gradient descenton a well chosen functional Fα built as a penalized version of the now classical multimarginalformulation of the barycenter problem introduced by Agueh and Carlier (2011). We show thatour method is consistent: with fixed penalization strength α, if gradient descent yields a couplingγε,α such that Fα(γε,α)− inf Fα ≤ ε, then we get a quantitative bound on the Wasserstein dis-tance between the approximate barycenter obtained from γε,α and the true barycenter obtainedfrom the optimal coupling γ?. This bound vanishes when ε→ 0 and α→∞. As a consequence,we show that the minimizers of Fα converge to γ? when α → ∞ and quantify the rate of con-vergence. Furthermore, we perform numerical experiments with the algorithm in several settings(see Figures 1 and 2).

Organization. In Section 2, we describe the Wasserstein barycenter in more details. Wepresent and analyze our method in Section 3. Section 4 is devoted to implementation details. InSection 5, we discuss some open questions.

Notation. All probability measures are assumed to be absolutely continuous with respect tothe Lebesgue measure and, with a slight abuse of notation, we will identify measures with theirdensities. We write X ∼ ν when random variable X has distribution ν. For µ, ν ∈ P2(Rd) and ameasurable map f : Rd → Rd, we write ν = f#µ if f(X) ∼ ν when X ∼ µ. The set of couplingsof µ and ν, i.e. probability measures on Rd×Rd with marginals µ and ν, is denoted by Π(µ, ν).We denote marginals by indices so γ1 = µ and γ2 = ν for γ ∈ Π(µ, ν); the notation extendsnaturally to couplings of n measures. We denote Euclidean gradients by ∇ and write ∇W forgradients in the Wasserstein space. Finally, we write indifferently ct = c(t) and c′ = c = ∂tc

when c : [a, b]→ X is a curve in some space X.

2. WASSERSTEIN BARYCENTERS

This section briefly presents background notions on the Wasserstein space and its barycentersthat were first introduced in the seminal paper of Agueh and Carlier (2011). We refer to Le Gouicand Loubes (2016) for questions of existence and stability of the barycenter.

The Wasserstein space over Rd is defined as the set P2(Rd) of probability measures over Rd

with finite second order moment, endowed with the distance W2 defined by

W 22 (µ0, µ1) = inf

γ∈Π(µ0,µ1)

∫|x− y|2 dγ(x, y), ∀µ0, µ1 ∈ P2(Rd)

where the infimum is taken over the set Π(µ0, µ1) of all couplings γ of µ0 and µ1. The Wasserstein

3

space is a geodesic space: for every µ0 and µ1 in P2(Rd) there exists a path (µt)t∈[0,1] such that

W2(µs, µt) = |s− t|W2(µ0, µ1), ∀s, t ∈ [0, 1].

Such a path (µt)t∈[0,1] is called a constant-speed geodesic between its end points µ0 and µ1 —we will often shorten it to geodesic. A functional G defined over P2(Rd) is said to be geodesicallyconvex if it is convex along every geodesic.

Let µ1, . . . , µn ∈ P2(Rd) and let λ1, . . . , λn be strictly positive weights such that∑λi = 1.

A Wasserstein barycenter of (µi) with weights (λi) is defined as any measure

b ∈ argminν∈P2(Rd)

∑λiW

22 (µi, ν), (1)

i.e. a barycenter corresponds to a Fréchet mean of the µi in the metric space (P2(Rd),W2).Whenever one of µ1, . . . , µn has density with respect to the Lebesgue measure, the barycenter isunique and always defined. We assume throughout that the µi are absolutely continuous withrespect to the Lebesgue measure on Rd.

The minimization problem (1) is equivalent to the following multimarginal problem. LetT (x) =

∑λixi for any x ∈ (Rd)n and recall that Π(µ1, . . . , µn) denotes the set of all couplings

of the µi. The infimum of the functional

G : γ 7−→∫ ∑

λi |xi − T (x)|2 dγ(x) (2)

over Π(µ1, . . . , µn) is achieved for a coupling γ? such that

b = T#γ?, (3a)

G(γ?) =∑

λiW22 (µi, T#γ

?). (3b)

Moreover, if one of the µi’s is absolutely continuous with respect to the Lebesgue measure, thenγ? is unique. Note that by introducing the probability measure P =

∑λiδµi on P2(Rd), the

right hand side of (3b) corresponds to the variance of P . With this notations, b minimizes thevariance functional ν 7→

∑λiW

22 (µi, ν) of P and b corresponds to the barycenter of P .

In the next section, we leverage the smooth structure of the Wasserstein space in order todevelop an optimization scheme for G that will be the base of our sampling algorithm.

3. SAMPLING WITH GRADIENT FLOWS

This section defines the flow gradient of a functional on the Wasserstein space. Some usefuldetails of this technique introduced in Jordan, Kinderlehrer, and Otto (1998) are gathered inAppendix B.

3.1. Definition of the problem

We aim to optimize the functional G defined in Section 2 with a gradient flow on the Wassersteinspace. For simplicity, we first define the cost function c by

c : x 7→∑

λi |xi − T (x)|2

4

then, for any γ ∈ P2((Rd)n), we have

G(γ) =

∫cdγ.

The set of marginal constraints Π(µ1, . . . , µn) is convex in L2((Rd)n) but is not geodesicallyconvex in the Wasserstein space, which makes the multimarginal problem a difficult non-convexoptimization problem on P2(Rd). Since a gradient descent scheme on G will inevitably leavethe constraint set Π(µ1, . . . , µn), we modify the problem by enforcing the constraints with apenalization term that ensures that the marginals are close to the µi’s in χ2 divergence. Werecall that the χ2 divergence between two measures µ, ν is given by

χ2ν(µ) =

∫ (µν− 1)2

dν if µ� ν and χ2ν(µ) = +∞ otherwise.

The problem thus becomes to minimize the functional Fα defined by

γ 7→ Fα(γ) :=

∫cdγ + α

∑λiχ

2µi(γi),

for α ∈ [0,∞] (we use the convention 0×∞ = 0).

For α large enough, the minimizer of Fα should be close to the minimizer of G with themarginal constraints — which corresponds also to the minimizer of F∞. The following proposi-tion ensures that the minimizer exists.

Proposition 1. Suppose at least one of the µi’s is absolutely continuous. Then, for any α > 0,the functional Fα admits at least one minimizer in P2((Rd)n). Moreover, this minimizer isabsolutely continuous with respect to the Lebesgue measure.

Given a minimizer γα of Fα, the measure T#γα, dubbed the associated barycenter to γα,

should be close to the barycenter b of P thanks to (3a) and (3b). Our next result quantifies thisproximity with respect to α under extra assumptions on µ1, . . . , µn.

We first assume that µ1, . . . , µn satisfy the Poincaré inequality with a strictly positive constantCP, i.e. that for any i ∈ J1, nK and any Lipschitz f ∈ L2(µi), we have

|f |2L2(µi)≤ CP |∇f |2L2(µi)

(4)

where ∇f is defined Lebesgue-almost everywhere. Such inequalities are very common in thesampling setting. We also assume that P =

∑λiδµi satisfies a variance inequality with constant

k, i.e. that there exists a strictly positive constant k such that for any ν ∈ P2(Rd), it holds

kW 22 (ν, bP ) ≤ B(ν)−B(bP ),

where b is the barycenter of µ1, . . . , µn, and B : ν 7→∑λiW

22 (µi, ν) is the functional appearing

in the original formulation of the barycenter problem.

Such inequalities were introduced by Sturm (2003) to describe curvature properties of geodesicspaces and have played a central role in understanding the behavior of the empirical Wassersteinbarycenter (see Ahidar-Coutrix, Le Gouic, and Paris (2020) and Le Gouic et al. (2019)). Werefer to Appendix C for details on variance inequalities. We can now state the following result.

Proposition 2. Let α ≥ 1 and ε ≤ 1. Suppose µ1, . . . , µn satisfy a Poincaré inequality withconstant CP and that

∑λiδµi satisfies a variance inequality with constant k. Then there exists a

5

strictly positive constant C, only depending on the variance σ2 =∫cdγ? and CP, such that for

any γε,α ∈ P2((Rd)n) satisfying Fα(γε,α)− inf Fα ≤ ε, it holds

k

4W 2

2 (T#γε,α, T#γ

?) ≤ ε+C√α. (5)

Proposition 2 shows that any approximate minimizer of Fα produces a barycenter whichdistance to the desired barycenter b is of order O(1/

√α) and can therefore be controlled to

arbitrary precision with α.

The fact that the minimization of Fα is carried out without constraints on the marginals ofγ allows to use classical optimization techniques. The next section is devoted to the introductionof the gradient flow on the Wasserstein space of Fα that will be the backbone of our algorithm.

3.2. Minimizing scheme

We first briefly recall that the Wasserstein gradient of a functional F : P2(Rd) → R at a pointν ∈ P2(Rd) is a vector field on Rd given by the Euclidean gradient of its first variation f , i.e.

∇WF (ν) := x 7→ ∇f(x) ∈ L2(Rd,Rd; ν) (6)

where f is such that for any signed measure δ with∫

dδ = 0, we have

F (ν + εδ) = F (ν) + ε

∫f dδ + o(ε).

For instance, the Wasserstein gradient of µ 7→ χ2ν(µ) is given by

∇Wχ2ν(µ) = 2∇µ

ν.

Intuitively, when a probability measure is seen as a large number of particles, the Wassersteingradient of a function defines a vector field along which each particle should be moved in orderto locally maximize the function. More details about the differential calculus on the Wassersteinspace are provided in Appendix B.

To sample from the barycenter of P =∑λi δµi , we consider the flow described by{

∂tXt = −∇WFα(γ(t))(Xt)

X(0) = X0 ∼ γ(0),(7)

where γ(t) denotes the distribution of Xt at time t > 0. Recall that for a measure γ ∈ P2((Rd)n),we write γi for its i-th marginal. Using (6), the Wasserstein gradient of Fα is given by

∇WFα(γ)(x) = ∇c(x) + 2α∑

λi∇γiµi

(x),

for any γ ∈ P2((Rd)n) and x ∈ Rd.

The explicit Euler scheme for (7) provides an approximation of (Xtm)m for tm+1 − tm = hm,with

Xm+1 = Xm − hm[∇c(Xm) + 2α

∑λi∇

γi(tm)

µi(Xm)

], (8)

where γ(tm) is the distribution of Xm and hm is a given step size.

6

However, note that since γ(t) is unknown, this discretization scheme is not implementable asit is. In order to implement it, we use a kernel to approximate the Wasserstein gradient of thepenalization term in Fα as is done in Liu and Wang (2016) and Chewi, Le Gouic, Lu, Maunu,and Rigollet (2020). The next section provides more details.

4. IMPLEMENTATION

As noted in Section 3, Algorithm (8) cannot be implemented directly since there is no canonicalway to recover the unknown distribution γ(tm) of the current state Xtm from the mere knowledgeof Xm. In this section, we present a concrete implementation of the scheme (8).

The SVGD algorithm. We consider the popular SVGD algorithm introduced in Liu and Wang(2016) which uses an explicit kernel K independent of the target measure. The kernel K canbe chosen to have a convenient analytical form (e.g. a Gaussian kernel) amenable to directevaluation. Let π ∝ e−V , with V ∈ C 1(Rd), be the target measure. SVGD aims to minimize theχ2π divergence by transporting randomly initialized points along the flow described by

∂tXt = −Kπ∇χ2π(µt)(Xt), (9)

where Xt ∼ µt and Kπ : f 7→∫K( · , x)f(x) dπ(x) is the integral operator associated with K.

Note that integration by parts yields

Kπ∇χ2π(µt)(Xt) =

∫{K(Xt, x)∇V (x)−∇2K(Xt, x)} dµt(x).

Approximating the integrals by averaging with samples (Xit)i=1,...,N distributed according to µt,

one gets the SVGD algorithm

Xit+1 = Xi

t +htN

N∑j=1

{−K(Xi

t , Xjt )∇V (Xj

t ) +∇2K(Xit , X

jt )}

(10)

where the initial points X10 , . . . , X

N0 are chosen at random.

Proposed algorithm. To minimize Fα, we implement (8), replacing each gradient of thepenalization sum,

∑ni=1 λiχ

2µi , with the SVGD iteration for each marginal. Thus, we couple

n batches of N points and generate N samples approximately distributed according to thebarycenter T#γ

α of a coupling γα minimizing Fα. Our iteration takes the form

Xi,jt+1 = Xi,j

t − ht∇jc(Xt) +αhtλjN

N∑`=1

{K(Xi,j

t , X`,jt )∇ logµj(X

`,jt ) +∇2K(Xi,j

t , X`,jt )}

where j ∈ J1, nK is the index of marginal µj and Xi,jt , i ∈ J1, NK, stands for the i-th sample

associated with the j-th marginal.

Numerical details. All the simulations are carried out with a Gaussian kernel

K(x, y) = exp(− |x− y|2

)∀x, y ∈ R.

7

The step size ht as well as α = αt vary over iterations, starting from h0 = 0.1 and α0 = 1000.We first keep the parameters fixed and let gradient descent concentrate the particles on thegraph of an increasing function before doubling the penalization strength while keeping ht · αtconstant. Thus, we enforce the marginal constraints once the coupling corresponds to an optimaltransport. In Figure 2, we the step-size evolves according to AdaGrad (see Duchi, Hazan, andSinger (2011)). We give further experimental results in Appendix A.

5. CONCLUSION

We have presented a new algorithm to sample from the Wasserstein barycenter of a finite setof measures. Our numerical experiments illustrate its convergence in several examples. We alsoproved a theoretical bound for samples drawn from almost minimizers of Fα, the gradient flowof which inspired our algorithm. It remains to formally prove that this algorithm, a kernelizedgradient descent of Fα, converges to a minimum of Fα and bound the rate of convergence.On another note, it would be interesting to import other common tools of optimization to theWasserstein setting in order to derive a sampling algorithm by optimizing Fα.

ACKNOWLEDGMENTS

We thank Philippe Rigollet for useful discussions. Thibaut Le Gouic was supported by NSFaward IIS-1838071.

8

Figure 1: Top plot: 200 samples drawn from T#N (0, 1), arctan #N (0, 1) using BARYGD with a Gaussiankernel and initial step size h = 10−3, and the resulting kernel density estimate. Bottom plot: Trajectoriesof the samples evolving over 220 iterations, from green to red samples. We can see the arctan transportmap is approximately reconstructed.

9

Figure 2: Top figure: Top plot: 150 samples drawn from the barycenter of two unit variance normaldistributions (shown in light gray) using our algorithm with Gaussian kernel with α = 103 and hm = 10−4,and the resulting kernel density estimator. Bottom plot: Trajectories of the samples evolving over 1000iterations, starting from green to red sample. The transport map is almost perfectly reconstructed.Bottom figure: the Wasserstein distance to the true barycenter over iterations plotted in log− log scale.From the graph, we read that BARYGD performs as O(1/

√m) in the number of iterations m.

10

REFERENCES

Agueh, M. and G. Carlier (2011). “Barycenters in the Wasserstein Space”. In: SIAM J. Math.Anal. 43.2, pp. 904–924.

Ahidar-Coutrix, A., T. Le Gouic, and Q. Paris (2020). “Convergence rates for empirical barycen-ters in metric spaces: curvature, convexity and extendable geodesics”. In: Probab. TheoryRelated Fields 177.1, pp. 323–368.

Altschuler, J. M. and E. Boix-Adserà (2021). “Wasserstein barycenters are NP-hard to compute”.In: arXiv: 2101.01100.

Ambrosio, L., N. Gigli, and G. Savaré (2008). Gradient Flows: In Metric Spaces and in theSpace of Probability Measures. 2nd. Lectures in Mathematics. ETH Zürich. Springer Science& Business Media.

Barré, M., C. Giron, M. Mazzolini, and A. d’Aspremont (2020). “Averaging Atmospheric GasConcentration Data using Wasserstein Barycenters”. In: arXiv: 2010.02762.

Benamou, J.-D., G. Carlier, M. Cuturi, L. Nenna, and G. Peyré (2015). “Iterative Bregmanprojections for regularized transportation problems”. In: SIAM J. Sci. Comput. 37.2, A1111–A1138.

Bishop, C. (2006). Pattern Recognition and Machine Learning. Information Sciences and Statis-tics. Springer-Verlag New-York.

Chewi, S., T. Le Gouic, C. Lu, T. Maunu, and P. Rigollet (2020). “SVGD as a kernelized Wasser-stein gradient flow of the chi-squared divergence”. In: arXiv: 2006.02509.

Chewi, S., T. Le Gouic, C. Lu, T. Maunu, P. Rigollet, and A. Stromme (2020). “Exponentialergodicity of mirror-Langevin diffusions”. In: arXiv: 2005.09669.

Chewi, S., T. Maunu, P. Rigollet, and A. J. Stromme (2020). “Gradient descent algorithms forBures-Wasserstein barycenters”. In: arXiv: 2001.01700.

Cuturi, M. and A. Doucet (2014). “Fast computation of Wasserstein barycenters”. In: Proceedingsof the 31st International Conference on Machine Learning(ICML), JMLR 32.

Ding, Y. (2015). “A note on quadratic transportation and divergence inequality”. In: Statist.Probab. Letters 100, pp. 115–123.

Duchi, J., E. Hazan, and Y. Singer (July 2011). “Adaptive Subgradient Methods for OnlineLearning and Stochastic Optimization”. In: J. Mach. Learn. Res. 12.null, pp. 2121–2159.

Gordaliza, P., E. Del Barrio, G. Fabrice, and J.-M. Loubes (2019). “Obtaining fairness usingoptimal transport theory”. In: International Conference on Machine Learning, pp. 2357–2365.

Gramfort, A., G. Peyré, and M. Cuturi (2015). “Fast optimal transport averaging of neuroimagingdata”. In: International Conference on Information Processing in Medical Imaging. Springer,pp. 261–272.

Jordan, R., D. Kinderlehrer, and F. Otto (Jan. 1998). “The Variational Formulation of theFokker–Planck Equation”. In: SIAM J. Math. Anal. 29.1, pp. 1–17.

Julien, R., G. Peyré, J. Delon, and B. Marc (2011). “Wasserstein Barycenter and its Applicationto Texture Mixing”. In: SSVM’11. Israel: Springer, pp. 435–446.

Le Gouic, T., J.-M. Loubes, and P. Rigollet (2020). “Projection to fairness in statistical learning”.In: arXiv: 2005.11720.

Le Gouic, T. and J.-M. Loubes (2016). “Existence and Consistency of Wasserstein Barycenters”.In: arXiv: 1506.04153.

Le Gouic, T., Q. Paris, P. Rigollet, and A. J. Stromme (2019). “Fast convergence of empiricalbarycenters in Alexandrov spaces and the Wasserstein space”. In: arXiv: 1908.00828.

Ledoux, M. (2018). “Remarks on some transportation cost inequalities”. In: preprint.

11

https://arxiv.org/abs/2101.01100








Liu, Q. and D. Wang (2016). “Stein variational gradient descent: A general purpose bayesianinference algorithm”. In: Advances in neural information processing systems, pp. 2378–2386.

Luise, G., S. Salzo, M. Pontil, and C. Ciliberto (2019). “Sinkhorn Barycenters with Free Supportvia Frank-Wolfe Algorithm”. In: arXiv: 1905.13194.

Malagò, L. and G. Pistone (2018). “Wasserstein Riemannian geometry of Gaussian densities”.In: Information Geometry 1, pp. 137–179.

Pavliotis, G. A. (2014). Stochastic processes and applications: diffusion processes, the Fokker-Planck and Langevin equations. Vol. 60. Springer.

Peyré, G. and M. Cuturi (2020). “Computational Optimal Transport”. In: arXiv: 1803.00567.Polyanksiy, Y. and Y. Wu (2019). “Lecture Notes on Information Theory”. In.Shen, Z., Z. Wang, A. Ribeiro, and H. Hassani (2020). “Sinkhorn Barycenter via Functional

Gradient Descent”. In: arXiv: 2007.10449.Solomon, J., F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas

(2015). “Convolutional wasserstein distances: Efficient optimal transportation on geometricdomains”. In: ACM Transactions on Graphics (TOG) 34.4, pp. 1–11.

Srivastava, S., V. Cevher, Q. Dinh, and D. Dunson (2015). “WASP: Scalable Bayes via barycen-ters of subset posteriors”. In: Artificial Intelligence and Statistics, pp. 912–920.

Srivastava, S., C. Li, and D. B. Dunson (2018). “Scalable Bayes via barycenter in Wassersteinspace”. In: The Journal of Machine Learning Research 19.1, pp. 312–346.

Sturm, K.-T. (2003). “Probability measures on metric spaces of nonpositive curvature”. In: EmileBorel Center of the Henri Poincaré Institute, Paris, France. Heat Kernels and Analysis onManifolds, Graphs, and Metric Spaces: Lecture Notes from a Quarter Program on HeatKernels, Random Walks, and Analysis on Manifolds and Graphs: April 16 July 13 2002,pp. 338–357.

Vempala, S. and A. Wibisono (2019). “Rapid convergence of the unadjusted langevin algorithm:Isoperimetry suffices”. In: Advances in Neural Information Processing Systems, pp. 8094–8106.

Villani, C. (2009). Optimal Transport: old and new. Vol. 338. Gründlehren der mathematischenWissenschaften. Springer-Verlag Berlin Heidelberg.

12




A. ADDITIONAL SIMULATIONS

In this section, we illustrate our algorithm by other experiments. We also test the LAWGDalgorithm introduced by Chewi, Le Gouic, Lu, Maunu, and Rigollet (2020) to approximate thegradient of the penalization in the Euler scheme (8).

The LAWGD algorithm. LAWGD generates a batch of samples from the target measure π ∝e−V by transporting randomly initialized points X1

0 , . . . , XN0 ∈ Rd along the flow described by

∂tµt = div(µt∇2L−1(µt)) (11)

where L = ∆ − ∇V · ∇ is the infinitesimal generator of a Markov diffusion process havingstationary measure π. It is assumed to have discrete spectrum and that its eigenvectors ψk(associated with strictly positive eigenvalues λk) form a basis of L2(π). An eigendecompositionthen yields

L −1 =∑ ψk ⊗ ψk

λk, (12)

which leads to the LAWGD algorithms upon discretization:

Xim+1 = Xi

m −hmN

N∑j=1

∇2L−1(Xi

m, Xjm) (13)

where hm is the step size at iteration m. In contrast to SVGD where one uses ∇V directly, herethe gradient intervenes only indirectly to compute the ψk while the actual computations arecarried out with ∇2L −1. To compute eigendecomposition (12), we implement the Schrödingerscheme described in Chewi, Le Gouic, Lu, Maunu, and Rigollet (2020, Section 5).

Barycenter of Gaussians. Let b denote the Wasserstein barycenter of µi = N (mi, Si),i = 1, . . . , n. We have b = N (m?, S?) where m? =

∑λimi and S? is the fixed point of

S 7→∑λi(S

1/2SiS1/2)1/2 on positive semidefinite matrices. Moreover, the Wasserstein distance

between two Gaussians is explicit:

W 22 (µi, µj) = |mi −mj |2 + tr

(S2i + S2

j − 2(SiSj)1/2)

for any i, j ∈ J1, nK. See Malagò and Pistone (2018) for details on the geometry of Gaussiansand Agueh and Carlier (2011, Section 6) for the characterization of the barycenter of Gaussians.These results allow us to compare our approximate barycenters to the true barycenters forGaussian marginals.

Experimental setup. In our experiments, we first consider testing BARYGD in one dimensionwith measures obtained by pushing forward the standard normal distribution by an increasingmap. With three marginals, to choose the parameters αt and ht, we start with h0 = 0.1 andα0 = 1000 and double αt and divide ht by 2—keeping αt ht = cte—whenever all marginal samplesare increasing functions of each other. reference samples from the barycenter against which tocompare our results. Figures 1 and 2 illustrate our algorithm with two marginals. Figures 4and 5 shows results with three marginals. The golden line represents the Gangbo-Święch mapT = (T 1, T 2, T 3) where T i are the coordinate transformations, e.g. T 1 = id, T 2 = arctan, T 3 =

13

(an affine transformation) when the distributions are N (0, 1), arctan #N (0, 1) and N (µ, σ2).

Then we test the algorithm in two dimensions by sampling from the barycenter of threeGaussian distributions. In this case, we choose α = 1000 and h = 10−4. This is shown in Figure6. The colored regions show the kernel estimates of the marginals. The contour lines representthe true marginals and barycenter.

Choosing an appropriate sequence of step sizes that ensures gradient descent converges isa difficult problem. We focused on testing BARYGD with SVGD as our implementation of thisvariant proved easier to parametrize. However, as illustrated on Figure 7, we have observed thatwhen BARYGD with LAWGD kernel is well parametrized, it can yield qualitatively better resultsthan when SVGD is used.

Effect of the penalization strength. We illustrate the effect of the penalization coefficient,we generate samples distributed according to the barycenter of two Gaussians in two dimensions(see Figure 3). In this experiment, the step size and the penalization strength are kept fixed.We observe that the algorithm the quality of the approximate barycenter increases with thepenalization.

14

Figure 3: 100 samples (red) drawn from the barycenter of two almost orthogonal normal

distributions (gray patches) N (0,

[1 00 ε

]) and N (0,

[ε 00 1

]) with ε = 10−2 using BARYGD. We

used a Gaussian kernel and fixed step size h = 10−2α−1. The resulting density estimates areshown as red patches. Initial samples are shown in light green. The density of the true barycenteris represented by the black contour lines and the densities of the marginals by the white contourlines. The algorithm ran for 2000 iterations with penalization strengths α = 1, 102, 103 (top leftto top right to bottom plots). The barycentric weights are λ1 = λ2 = 0.5. We see that themarginals and the barycenter are increasingly better approximated as the value of α increases.

15

Figure 4: Top plot: 300 samples drawn from the barycenter of three normal distributions usingBARYGD with a Gaussian kernel, and the density estimator. Initial samples are shown in lightgreen. Bottom plot: Evolution of the samples over 600 iterations. We see the linear transportmap is well reconstructed. The "front" is formed when the algorithm starts imposing the marginalconstraints.

16

Figure 5: Top plot: 300 samples drawn from the barycenter of two normal distributions andarctan #N (0, 1) using BARYGD with a Gaussian kernel, and the density estimator. Initial sam-ples are shown in light green. Bottom plot: Evolution of the samples over 600 iterations. Wesee the samples coil around the transport map, approximately reconstructing it. The "front" isformed when the algorithm starts imposing the marginal constraints.

17

Figure 6: 150 samples (red) drawn from the barycenter of three Gaussian distributions in theplane using BARYGD with an adaptive Gaussian kernel and adaptive step size (AdaGrad) over1000 iterations. Initial points are shown in green. The colored regions represent kernel densityestimates of the marginals computed based on the generated samples. Contour lines representthe reference measures.

18

Figure 7: Left: 100 samples drawn from the barycenter of two Gaussian distributions withBARYGD with LAWGD kernel over 300 iterations. We see that the density is well approximated.Right: Coupling at the end of the algorithm. The points are aligned on a line, as expected whentransporting Gaussians.

19

B. FIRST ORDER DIFFERENTIAL STRUCTURE IN WASSERSTEIN SPACES

The fundamental elements upon which a Riemannian structure is constructed are firstly thecurves, from which the tangent spaces are constructed, then the metric to give the tangent spacean Euclidean structure. To do so on the Wasserstein space, it is helpful to think of a probabilitymeasure as a vast collection of appropriately distributed particles in Rd. Then, properties of themeasure can be phrased in terms of properties of the particles and vice versa.

Let µ0 ∈ P2(Rd) and X0 ∼ µ0 be a point on an integral curve (Xt)t of a vector field (Vt)t, i.e.∂tXt = Vt(Xt), and let µt be the law of Xt. Differentiating µt by duality with smooth, compactlysupported functions f ∈ C∞c (Rd) yields

〈∂tµt, f〉 = E ∂tf(Xt)

= E 〈∇f(Xt), Vt(Xt)〉

=

∫Rd

∇f · Vt dµt (= 〈∂tµt, f〉) (14)

= −∫Rd

f div(µtVt)).

In particular (µt)t satisfies the conservation of mass equation

∂tµt = −div(µtVt) (15)

in a weak sense. In comparison with the Riemannian analogue of (14), equality (14) suggests tointerpret the tangent vector to the curve at time t as the vector field Vt. However, equation (15)does not uniquely define such a vector as replacing Vt with Vt + Vt where div(µtVt) = 0, givesthe same curve µt for a different "tangent vector".

Fortunately, optimal transport theory provides a natural choice for Vt when one wants totransport µt to a nearby µt+h at minimal quadratic cost: gradients ∇(ψt− 1

2 | · |2) of Kantorovich

potentials ψt — i.e. convex functions such that µt+h = ∇ψt#µt. Indeed, the optimal trajectorybetween Xt+h and Xt is a geodesic: Xt+ν = (1− ν)Xt + νXt+h for 0 ≤ ν ≤ 1, thus

(∂νXt+ν)|ν=0 = ∇(−1

2|Xt|2 + ψt(Xt)).

Motivated by this case, one then defines the tangent space at µt as

TµtP2(Rd) = {∇ψ |ψ ∈ C∞(Rd)}L2(µt)

,

equipped with the Hilbert structure inherited from L2(Rd,Rd;µt). We denote by 〈·, ·〉µ such ascalar product given by 〈f, g〉µ =

∫〈f, g〉 dµ for f, g : Rd → Rd. This Hilbert structure is also

consistent with the Benamou-Brenier formula for the Wasserstein metric

W 22 (µ0, µ1) = inf

µ∈Γ(µ0,µ1)

∫ 1

0|Vt|2µt dt (16)

where Γ(µ0, µ1) = {(µt)t |µ(0) = µ0 and µ(1) = µ1, ∂tµt = −div(µtVt)}. Equation (16) is a per-fect analogue of the length formula in Riemannian manifolds. Moreover, it verifies that optimaltransport curves are indeed the geodesics in the Wasserstein space.

Now let F : P2(Rd)→ R be a functional on the Wasserstein space and let f denote its first

20

variation f , i.e. the function satisfying

(∂tF (µ+ tρ))|t=0 =

∫Rd

f dρ

for signed measures ρ with total mass∫Rd dρ = 0. To define the gradient of F at a point

µ ∈ P2(Rd), let (µt)t be a curve such that µ0 = µ. Then formally µt = µ0 − t div(µ0v0) + o(t)

andF (µt) = F (µ0)− t

∫Rd

f div(µ0V0) + o(t)

so(∂tF (µt))|t=0 = 〈∇WF (µ0), V0〉µ0 = 〈∇f, V0〉µ0 .

This resembles the behavior of Riemannian gradients and suggests that we define the Wassersteingradient of F as the gradient of its first variation, i.e. that we set

∇WF (µ) := ∇f : Rd → Rd ⊂ L2(Rd,Rd, µ)

With this machinery, functionals over the Wasserstein space can be optimized by followingWasserstein gradient flows. This was the main result of the seminal paper of Jordan, Kinder-lehrer, and Otto Jordan, Kinderlehrer, and Otto (1998), in which they proved that the marginaldistributions of Langevin diffusion forms a path in the Wasserstein space that is the gradientflow of the Kullback-Leibler divergence.

Let µ ∈ P ac2 (Rd). For completeness, we compute the gradient of χ2µ at ν ∈ P ac2 (Rd). Let δ

be a signed measure such that∫

dδ = 0. Then, for small t, we have

χ2µ(ν + tδ) = χ2

µ(ν) + 2t

∫ (ν

µ− 1

)δ + o(t) = χ2

µ(ν) + 2t

∫ν

µδ + o(t)

hence the first variation of χ2µ at ν is 2ν/µ, and

∇Wχ2µ(ν) = 2∇ν

µ.

C. PROOFS

In the next two sections, we recall some notions that will appear in the proofs. We recall usefulnotation in the third section and then the different proofs are gathered.

C.1. The χ2 transportation inequality

We recall a transportation inequality which connects the χ2 penalization with the Wassersteindistance under the Poincaré inequality. We say that a measure µ ∈ P2(Rd) satisfies a χ2-transport inequality with positive constant c, denoted by Tχ

2

2 (c), if we have

W 22 (µ, ν) ≤ 2c χ2

µ(ν)

for all ν ∈ P2(Rd). The following important result is due to Ding (2015) (see also the noteby Ledoux (2018)) and makes the connection between the χ2 divergence and the Wassersteindistance precise.

21

Proposition 3 (Ding). Let µ ∈ P2(Rd) and c ∈ R?+. If µ satisfies a Poincaré inequality with

constant c then it satisfies Tχ2

2 (2c). Conversely, if µ satisfies Tχ2

2 (c) then it satisfies a Poincaréinequality with constant 2c.

C.2. Variance inequalities

Variance inequalities are crucial to the proof of Proposition 2 in Subsection C.5. Moreover, theynaturally ensure uniqueness of the barycenter. These inequalities were introduced by Sturm(2003) in his investigation of the curvature of general metric spaces. They have since playeda central role in the study of convergence rates of empirical barycenters in Ahidar-Coutrix, LeGouic, and Paris (2020) and Le Gouic et al. (2019). Recently, Chewi, Maunu, et al. (2020) gavea simple condition on the optimal transport from the barycenter to any other measure implyinga variance inequality.

Recall that a probability measure P ∈ P2(P2(Rd)) with barycenter bP ∈ P2(Rd) satisfies avariance inequality with constant k ∈ [ 0, 1 ] if

kW 22 (ν, bP ) ≤

∫[W 2

2 (µ, ν)−W 22 (µ, bP )]P (dµ) (17)

for any ν ∈ P2(Rd). Note that such an inequality always holds with k = 0 and that if k > 0

then the barycenter is unique. Variance inequalities express that the variance functional ν 7→∫W 2

2 (µ, ν)P (dµ) behaves quadratically around the barycenter bP .

From Sturm (2003), it is known that variance inequalities with constant 1 are satisfied forprobability measures defined over spaces of non positive Alexandrov curvature. The situation ismore contrasted for measures defined over spaces of non negative curvature, such as P2(Rd). Onsuch spaces, variance inequalities may not hold for k > 0. Let us recall a recent result in Chewi,Maunu, et al. (2020, Theorem 6) in that direction.

Let P ∈ P2(P2(Rd)) and let bP ∈ P2(Rd) be a barycenter of P . For any ν ∈ P2(Rd), denoteby ϕν the Kantorovitch potential from bP to ν. If there exists k : P2(Rd) → R+ such that ϕνis k(ν)-strongly convex for P -almost any ν ∈ P2(Rd), and

∫ϕν dP (ν) = id, then P satisfies a

variance inequality with constant∫k(ν)P (dν).

C.3. Notation

We recall that T denotes for the map x 7→∑λixi, where the λi are the barycentric weights, and

that, for any α ≥ 0, we minimize the functional

Fα : γ 7−→∫cdγ + α

∑λiχ

2µi(γ) where c(x) =

∑λi |xi − T (x)|2 ∀x ∈ (Rd)n.

We write γ? for the barycenter coupling for the µi with weights λi, i.e. it satisfies

T#γ? = argmin

ν∈P2(Rd)

∑λiW

22 (µi, ν). (18)

We assume throughout that the measures µ1, . . . , µn satisfy the Poincaré inequality with constantCP.

22

C.4. Proof of Proposition 1

Let α > 0. We begin by showing that Fα is lower semi-continuous, i.e. that for every convergentsequence (γk), with limit γ, we have

Fα(γ) ≤ lim infk→∞

Fα(γk).

Let R =∑λiχ

2µi and F : γ 7−→

∫cdγ so that Fα = F + αR. By Polyanksiy and Wu (2019,

Theorem 3.6), αR is lower semi-continuous with respect to the weak topology on P2((Rd)n) andit remains to show that this property also holds for F . Let m > 0 and cm = c ∧ m. Since cis continuous, it follows that cm ∈ C 0

b ((Rd)n, and one sees that cm ≥ 0 is increasing in m (i.e.cm < cm′ when m < m′). Moreover, it is clear that cm −−−−→

m→∞c pointwise. For any k ≥ 0

∫cm dγk ≤

∫cdγk

and taking the limit inferior in k on both sides we obtain

lim infk→∞

∫cm dγk =

∫cm dγ ≤ lim inf

k→∞

∫cdγk

where the equality follows by weak convergence. We now use the monotone convergence theoremto obtain ∫

cdγ ≤ lim infk→∞

∫cdγk

which shows that F is lower semi-continuous with respect to the weak topology on P2((Rd)n).Hence, Fα is lower semi-continuous.

Next, we show that there exists a minimizer for Fα. It is clear that inf Fα ≥ 0. Let (γk) be aminimizing sequence of probability measures on P2((Rd)n), i.e. it satisfies Fα(γk) −−−→

k→∞inf Fα.

Then, from a certain rank k0, for any k ≥ k0, we have Fα(γk) ≤ inf Fα + 1. Because for anycoupling γ, we have

Fα(γ) = m2(γ) +∑

λi

∫(|T (x)|2 − 2 〈xi, T (x)〉)γ( dx) + α

∑λiRµi(γ),=: m2(γ) + r(γ),

and since the assumption implies that r(γk) ≤ inf Fα + 1, it holds

m2(γk) ≤ 2(inf Fα + 1) <∞. (19)

This guarantees that the sequence (γk) is tight. Indeed, if by contradiction (γk) were not tight,then at least for some i ∈ J1, nK the sequence (γki ) is not tight either (if all the γi’s were tight,then, for any ε > 0 there is a compact set Kε

i such that for all k we have γki (Rd \Kεi ) ≤ ε/n,

and then γk((Rd)n \⊗Kεi ) ≤ ε, which would make (γk) tight as well). Now if the sequence (γki )

is not tight, then for some m > 0, for any compact set K ⊂ Rd we have γki (Rd \K) > m, for asubsequence still denoted by (γki ), and then

∑∫|xi|2 dγki (xi) = m2(γk) =∞, which contradicts

(19). Thus (γk) is tight. Then, by Villani (2009, Lemma 6.14), at least a sub-sequence of (γk)

converges to some absolutely continuous γ ∈ P2((Rd)n) (if γ were to be singular, we would haveFα(γ) = ∞ due to the penalization term, which cannot be). The lower semi-continuity of Fα

implies that γ is a minimizer, which concludes the proof.

23

C.5. Proof of Proposition 2

Notation. We define the set of ε-approximate minimizers of Fα

M ε,α ={γ ∈ P2((Rd)n) | 0 ≤ Fα(γ)− inf Fα ≤ ε

}.

where α and ε are strictly positive.

C.5.1 Main result

The main result of this subsection is Proposition 2. It states that, assuming the µi satisfy avariance inequality and a Poincaré inequality, for any α ≥ 1 and a small ε, the distance betweenbarycenters of couplings in M ε,α and the barycenter T#γ

? is controlled with ε and 1/√α. The

implication of this result is that, in the regime ε � 1 � α we are interested in, this distance isbounded by ε+ 1/

√α, up to a constant. We recall the precise statement.

Proposition 2. Let α ≥ 1 and ε ≤ 1. Suppose µ1, . . . , µn satisfy the Poincaré inequality withconstant CP and that

∑λiδµi satisfies a variance inequality with constant k. Then there exists a

constant C1, only depending on the variance σ2 =∫cdγ? and CP, such that for any γε,α ∈M ε,α,

k

4W 2

2 (T#γε,α, T#γ

?) ≤ ε+C1√α.

Let us now sketch the proof. For any γε,α ∈M ε,α, we have

1

2W 2

2 (T#γε,α, T#γ

?) ≤W 22 (T#γ

ε,α, T#γε,α) +W 2

2 (T#γε,α, T#γ

?), (20)

where γε,α denotes the coupling associated to the multimarginal problem for the marginals ofγε,α. Our first step is to show that under the Poincaré inequality couplings in M ε,α are close toΠ(µ1, . . . , µn), i.e. their marginals are, up to a constant,

√ε/α-close to the desired marginals;

this is Lemma 4. Assuming that∑λiδµi satisfies a variance inequality, we then use Lemmas 4

and 5 to derive a bound on the distance between the barycenters of γε,α and γ?.

Lemma 4. Suppose µ1, . . . , µn satisfy the Poincaré inequality with constant CP. Recall σ2 =∫cdγ?. Let α > 0, ε ≤ σ2 and let γε,α ∈M ε,α. Then

∑λiW

22 (γε,αi , µi) ≤

8CPσ2

α(21)

Lemma 5. Suppose the µi satisfy the CP-Poincaré inequality. Let α ≥ σ2, ε ≤ σ2 and letγε,α ∈M ε,α. Further, let γε,α be the barycenter coupling with weights λ1, . . . , λn for γε,α1 , . . . , γε,αn ,and define

C2 = 4σ(

2CP +√

2CP

).

Then ∫cdγε,α −

∫cdγ? ≤ ε, (22)∫

cdγε,α −∫cdγ? ≤ C2√

α,

and ∫cdγε,α −

∫cdγε,α ≤ ε+

3C2√α

24

Our second step is to show that a perturbed variance inequality holds for∑λiδγi for any

γ ∈ M ε,α; this is Lemma 6. We exploit this result to derive a bound on the distance betweenthe barycenters of γε,α and γε,α. However, it introduces a supplementary term (the second onein the right hand side of (23)), which we need to control with ε and 1/

√α. We carry this out

with Lemmas 4 and 5. We conclude the proof by injecting the two distance bounds back in (20).

Lemma 6. Let P,Q ∈ P2(P2(Rd)). Assume that P satisfies a k-variance inequality and that Pand Q admit barycenters bP and bQ ∈ P2(Rd). Let

σ2(P ) =

∫W 2

2 (ν, bP )P (dν) and σ2(Q) =

∫W 2

2 (ν, bQ)Q(dν).

Then, for any µ ∈ P2(Rd),

k

2W 2

2 (µ, bQ) ≤∫ (

W 22 (ν, µ)−W 2

2 (ν, bQ))Q(dν) + C(µ)W2(Q,P ) (23)

whereC(µ) = 2

√8[σ2(Q) + σ2(P ) +W 2

2 (bP , bQ)] + 2W 22 (µ, bQ)

C.5.2 Proofs

Proof of Lemma 4. Let γα be a minimizer of Fα. Let R =∑λiχ

2µi . The definition of M ε,α

gives

0 ≤∫cdγε,α −

∫cdγα + αR(γε,α)− αR(γα) ≤ ε,

whenceαR(γε,α) ≤ ε−

∫cdγε,α +

∫cdγα + αR(γα).

Since γα is minimizing, we have Fα(γα) ≤ Fα(γ?) and, subtracting∫cdγε,α from both sides,

we find−∫cdγε,α +

∫cdγα + αR(γα) ≤

∫cdγ? −

∫cdγα ≤

∫cdγ?,

hence αR(γε,α) ≤ ε+∫c dγ?. Now, by Proposition 3, the Poincaré inequality impliesW 2

2 (µj , · ) ≤4CPRµj for any j ∈ J1, nK and the assumption that ε ≤

∫cdγ? = σ2, thus

∑λiW

22 (γε,αi , µi) ≤ 4CPR(γε,α) ≤ 4CP

α

(ε+

∫cdγ?

)≤ 8CPσ

2

α,

which completes the proof. �

Proof of Lemma 5.

(First inequality) By definition,∫cdγε,α −

∫cdγ? = Fα(γε,α)− αR(γε,α)− Fα(γ?) ≤ Fα(γε,α)− Fα(γ?)

and adding and subtracting Fα(γα) yields

Fα(γε,α)− Fα(γ?) = {Fα(γε,α)− Fα(γα)}︸︷︷︸≤ε since γε,α∈Mε,α

+ {Fα(γα)− Fα(γ?)}︸︷︷︸≤0 since γα minimizes Fα

≤ ε,

25

hence the inequality.

(Second inequality) By equality of the original and multimarginal formulations of the barycen-ter problem for γε,α and γ?, we have∫

cdγε,α −∫cdγ? =

∑λi(W 2

2 (γε,αi , T#γε,α)−W 2

2 (µi, T#γ?))

≤∑

λi(W 2

2 (γε,αi , T#γ?)−W 2

2 (µi, T#γ?))

where we used the fact that γ? is not optimal for the γε,αi to assert that∑λiW

22 (γε,αi , T#γ

ε,α) ≤∑λiW

22 (γε,αi , T#γ

?). Combining the triangle inequality with the identity a2−b2 = (a−b)(a+b)

for all a, b ∈ R yields∫cdγε,α −

∫cdγ?

≤∑

λi(W2(γε,αi , T#γ?)−W2(µi, T#γ

?))(W2(µi, T#γ?) +W2(γε,αi , T#γ

?))

≤∑

λiW2(γε,αi , µi)(W2(γε,αi , T#γ?) +W2(µi, T#γ

?))

≤∑

λiW2(γε,αi , µi)(W2(µi, γε,αi ) + 2W2(µi, T#γ

?)).

Applying Cauchy-Schwarz inequality, we get∫cdγε,α −

∫cdγ? ≤

∑λiW

22 (γε,αi , µi) + 2

√∑λiW 2

2 (γε,αi , µi)√∑

λiW 22 (µi, T#γ?).

We conclude with Lemma 4, which yields∫cdγε,α −

∫cdγ? ≤ 8CPσ

2

α+ 2

σ√

8CP√α

σ ≤ C2√α

(24)

since we assume that σ ≥ ε.

(Third inequality) Adding and subtracting∫cdγε,α in (22), we get{∫

cdγε,α −∫cdγε,α

}+

{∫cdγε,α −

∫cdγ?

}≤ ε

and by moving the second term of the left hand side to the right hand side, we have∫cdγε,α −

∫cdγε,α ≤ ε+

∫cdγ? −

∫cdγε,α. (25)

Proceeding similarly as for (24), we obtain using Lemma 4 that∫cdγ? −

∫cdγε,α ≤

∑λiW

22 (γε,αi , µi) + 2

√∑λiW 2

2 (γε,αi , µi)√∑

λiW 22 (γε,αi , T#γε,α)

However, by optimality of γε,α and using Inequality (24) and the assumption that α ≥ σ2,

∑λiW

22 (γε,αi , T#γ

ε,α) =

∫cdγε,α ≤

∫cdγε,α ≤ σ2 +

8CPσ2

α+ 2

σ√

8CP√α

σ ≤(σ +

√8CP

)2.

26

Therefore, we proved∫cdγ? −

∫c dγε,α ≤ 8CPσ

2

α+ 2

σ√

8CP√α

(σ +

√8CP

)≤ 1√

α

(4σ(2CP +

√2CP) + 16σCP

)=

4σ√α

(6CP +

√2CP

)≤ 3C2√

α

which concludes the proof together with (25). �

Proof of Lemma 6. Let µ ∈ P2(Rd). From the variance inequality for P , we have

kW 22 (µ, bP ) ≤

∫W 2

2 (µ, ν)P (dν)−∫W 2

2 (ν, bP )P (dν)

=

∫W 2

2 (µ, ν)(P −Q)(dν) +

∫W 2

2 (ν, bQ)Q(dν)

−∫W 2

2 (ν, bP )P (dν) +

∫(W 2

2 (µ, ν)−W 22 (bQ, ν))Q(dν)

≤∫W 2

2 (µ, ν)(P −Q)(dν) +

∫W 2

2 (ν, bP )(Q− P )(dν)

+

∫(W 2

2 (µ, ν)−W 22 (ν, bQ))Q(dν), (26)

where we used the optimality of bQ as a barycenter for Q in the last line. We start by boundingthe first term in (26). Let Γ be an optimal coupling between P and Q. Then∫

W 22 (µ, ν)(P −Q)(dν) =

∫(W 2

2 (µ, ν)−W 22 (µ, ζ)) Γ(dν dζ)

=

∫(W2(µ, ν)−W2(µ, ζ))(W2(µ, ν) +W2(µ, ζ)) Γ(dν dζ),

and, from the triangle inequality, we have W2(µ, ν) ≤W2(µ, ζ) +W2(ζ, ν), hence∫W 2

2 (µ, ν)(P −Q)(dν) ≤∫W2(ν, ζ)(W2(µ, ν) +W2(µ, ζ)) Γ(dν, dζ).

Recall the notation

σ2(µ;P ) =

∫W 2

2 (ν, µ)P (dν) and σ2(µ;Q) =

∫W 2

2 (ν, µ)Q(dν).

Applying Cauchy-Schwarz and Young inequalities,∫W 2

2 (µ, ν)(P −Q)(dν) ≤W2(Q,P )√

2(σ2(µ;P ) + σ2(µ;Q)).

Now, since variance inequality holds in the reverse direction for k = 1 in nonnegatively curvedspace (see Ahidar-Coutrix, Le Gouic, and Paris (2020, Theorem 3.2))

σ2(µ;P )− σ2(bP ;P ) ≤W 22 (µ, bP ) and σ2(µ;Q)− σ2(bQ;P ) ≤W 2

2 (µ, bQ),

whence∫W 2

2 (ν, µ)(P −Q)(dν) ≤W2(Q,P )√

2(σ2(P ) + σ2(Q) +W 22 (µ, bP ) +W 2

2 (µ, bQ)).

27

We follow similar steps to bound the second term of (26) and obtain that∫W 2

2 (ν, bP )(Q− P )(dν) ≤W2(Q,P )√

2(σ2(Q) + σ2(P ) +W 22 (bQ, bP ))

Injecting these bounds into (26) and recalling the definition of A, we have

kW 22 (µ, bP ) ≤

∫(W 2

2 (µ, ν)−W 22 (bQ, ν))Q(dν) +A(µ;Q,P )W2(Q,P ). (27)

We conclude by applying (27) twice, with µ and bQ, and the triangle inequality to get

k

2W 2

2 (µ, bQ) ≤ kW 22 (bQ, bP ) + kW 2

2 (µ, bP )

≤ kW 22 (bQ, bP ) +A(µ;Q,P )W2(Q,P ) +

∫(W 2

2 (ν, µ)−W 22 (ν, bQ))Q( dν)

≤∫

(W 22 (µ, ν)−W 2

2 (bQ, ν))Q(dν) +W2(Q,P ) (A(bQ;Q,P ) +A(µ;Q,P )) .

�

Proof of Proposition 2. The proof follows the outline described above. Let γε,α ∈ M ε,α

and let γε,α be the barycentric coupling for γε,α1 , . . . , γε,αn with weights λ1, . . . , λn. Recall thatC2 = 4σ

(4CP +

√2CP

). From the triangle and Young’s inequalities, we have

1

2W 2

2 (T#γ?, T#γ

ε,α) ≤W 22 (T#γ

?, T#γε,α) +W 2

2 (T#γε,α, T#γ

ε,α), . (28)

(Step 1. Bound on W2(T#γ?, T#γ

ε,α)) We start by bounding the distance between thebarycenters of γε,α and γ?. Using the variance inequality for P , we have

kW 22 (T#γ

?, T#γε,α) ≤

∑λi(W

22 (µi, T#γ

ε,α)−W 22 (µi, T#γ

?))

=∑

λi(W2(µi, T#γε,α)−W2(µi, T#γ

?))(W2(µi, T#γε,α) +W2(µi, T#γ

?))

and, by the triangle applied to W2(µi, T#γε,α) in both factors, we get

kW 22 (T#γ

ε,α, T#γ?)

≤∑

λi(W2(µi, γε,αi ) +W2(γε,αi , T#γ

ε,α)−W2(µi, T#γ?))

· (W2(µi, γε,αi ) +W2(γε,αi , T#γ

ε,α) +W2(µi, T#γ?))

=∑

λiW22 (µi, γ

ε,αi ) + 2

∑λiW2(µi, γ

ε,αi )W2(γε,αi , T#γ

ε,α)

+∑

λi(W22 (γε,αi , T#γ

ε,α)−W 22 (µi, T#γ

?)). (29)

We now bound the second term of (29). Using Cauchy-Schwarz the second term is bounded by∑λiW2(µi, γ


ε,α) ≤√∑

λiW 22 (µi, γ

ε,αi )√∑

λiW 22 (γε,αi , T#γε,α). (30)

28

Recall that σ2(P ) =∑λiW

22 (µi, T#γ

?) denotes the variance of P . By optimality of T#γε,α,

using the triangle and Young inequalities yields∑λiW

22 (γε,αi , T#γ

ε,α) =∑

λiW22 (γε,αi , T#γ

?) ≤ 2∑

λi(W22 (γε,αi , µi) +W 2

2 (µi, T#γ?)) (31)

(Lemma 4)

≤ 2

(8CPσ

2

α+ σ2

). (32)

Therefore, using Lemma 4 again, and the assumption α ≥ σ2,

∑λiW2(µi, γ


ε,α) ≤ σ√

16CP√α

√8CP + σ2

For the last term of (29), the multimarginal formulation for γε,α and γ? shows that the last sumequals

∫cdγε,α −

∫cdγ?, which we control by Lemma 5. The first term of (29) is bounded by

a direct application of Lemma 4. Therefore, we get from (29):

kW 22 (T#γ

ε,α, T#γ?) ≤ σ

√8CP√α

+ 2σ√

16CP√α

√8CP + σ2 +

C2√α≈ CPσ

2

√α

(33)

(Step 2. Bound on W2(T#γε,α, T#γ

ε,α)) We first bound W2(T#γε,α, T#γ

ε,α) by applyingLemma 6 with µ = T#γ

ε,α and

P :=∑

λiδµi and Q := P ε,α :=∑

λiδγε,αi .

Since P satisfies a k-variance inequality by assumption, this gives us

k

2W 2

2 (T#γε,α, T#γ

ε,α) ≤∑

λi(W 2


2 (γε,αi , T#γε,αi ))+C(T#γ

ε,α)W2(P ε,α, P )

(34)where

C(T#γε,α) = 2

√8[σ2(P ε,α) + σ2(P ) +W 2

2 (T#γ?, T#γε,α)] + 2W 22 (T#γε,α, T#γε,α).

First, we bound C(T#γε,α). Using the multimarginal problem for γε,α, the variance of P ε,α is

easily bounded with Lemma 5. Indeed, we have

σ2(P ε,α) =∑

λiW22 (γi, T#γ

ε,α) =

∫cdγε,α ≤ σ2 +

C2√α

Then we bound W 22 (T#γ

ε,α, T#γε,α). Using again triangle and Young inequalities, we get the

following bound by Lemma 5

W 22 (T#γ

ε,α, T#γε,α) ≤ 2

∑λiW

22 (T#γ

ε,α, γε,αi ) + 2∑

λiW22 (γε,αi , T#γ

ε,α)

≤ 2

∫cdγε,α + 2

∫cdγε,α

≤ 2ε+ 6C2√α

+ 4σ2(P ε,α)

≤ 6C2√α

+ 6σ2 + 4C2√α.

29

Since, σ2 = σ2(P ) =∫cdγ?, it follows from gathering our bounds and using (33) that

C(T#γε,α)

≤ 2

√8

[2σ2 +

σ√

8CP√α

+ 2σ√

16CP√α

√8CP + σ2 +

C2√α

]+ 12

C2√α

+ 12σ2 + 8C2√α

= 2

√28σ2 + 21

C2√α

+ 2σ√

16CP√α

√8CP + σ2 =: Ab ≈

√CP√α

+ σ2. (35)

Remark that W 22 (P, P ε,α) ≤

∑λiW

22 (µi, γ

ε,αi ). Collecting our bounds, we can now apply (34)

to bound W 22 (T#γ

ε,α, T#γε,α) by using Lemma 5

k

2W 2

2 (T#γε,α, T#γ

ε,α)

≤∑

λi(W 2


2 (γε,αi , T#γε,α))

+AbW2(P, P ε,α)

≤∫cdγε,α −

∫cdγε,α +Ab

√∑λiW 2

2 (γε,αi , µi)

≤ ε+1√α

(3C2 + σAb

√8CP

). (36)

(Step 3. Conclusion) We conclude by applying the bounds (36) and (33) to (28) to obtain

k

2W 2

2 (T#γ?, T#γ

ε,α) ≤ 2ε+σ√

8CP√α

+ 2σ√

16CP√α

√8CP + σ2 +

C2√α

+2√α

(3C2 + σAb

√8CP

).

The proof is complete setting

2C1 = σ√

8CP + 2σ√

16CP(8CP + σ2) + C2 + 2(

3C2 + σAb√

8CP

). (37)

�

30

sampling from the wasserstein barycenter

Documents