quantitative stability of optimal transport maps and
TRANSCRIPT
Quantitative Stability of Optimal Transport Maps andLinearization of the 2-Wasserstein Space
Alex Delalande*, +with Quentin Mérigot* and Frédéric Chazal+
Laboratoire de Mathématiques d’Orsay, Université Paris-Sud*DataShape team, INRIA Saclay+
AISTATS 2020
1/14
Introduction
I Numerous problems involve the comparison of point clouds/probabilitymeasures (e.g. in astronomy, shape recognition, image processing, generative modelling,large-scale learning, etc).
Figure 1: Point cloud comparison may appear in astronomy, 3D shape recognitionor color transfer (from [Yu et al., 2012, Wu et al., 2015, Paty et al., 2019]).
2/14
Introduction
I Numerous problems involve the comparison of point clouds/probabilitymeasures (e.g. in astronomy, shape recognition, image processing, generative modelling,large-scale learning, etc).
I Wasserstein distances, provided by Optimal Transport (OT), are naturalto perform these comparisons.
3/14
Introduction
I Numerous problems involve the comparison of point clouds/probabilitymeasures (e.g. in astronomy, shape recognition, image processing, generative modelling,large-scale learning, etc).
I Wasserstein distances, provided by Optimal Transport (OT), are naturalto perform these comparisons.Wasserstein distances.
I X , Y compact and convex subsets of Rd .I α, β probability measures on X , Y respectively.
Wpp(α, β) := min
π
{∫X×Y
||x − y ||pdπ(x , y) | π ∈ U(α, β)},
where U(α, β) = {π ∈ P(X × Y) | (PX )#π = α and (PY )#π = β} withPX (x , y) = x and PY (x , y) = y .
3/14
Introduction
I Numerous problems involve the comparison of point clouds/probabilitymeasures (e.g. in astronomy, shape recognition, image processing, generative modelling,large-scale learning, etc).
I Wasserstein distances, provided by Optimal Transport (OT), are naturalto perform these comparisons.Wasserstein distances.
I X , Y compact and convex subsets of Rd .I α, β probability measures on X , Y respectively.
Wpp(α, β) := min
π
{∫X×Y
||x − y ||pdπ(x , y) | π ∈ U(α, β)},
where U(α, β) = {π ∈ P(X × Y) | (PX )#π = α and (PY )#π = β} withPX (x , y) = x and PY (x , y) = y .
They enable notions such as:
3/14
Introduction
I In practice: for αn = 1n∑n
i=1 δxi and βn = 1n∑n
i=1 δyi , computingWp
p(αn, βm) corresponds to solving a LP problem.
4/14
Introduction
I In practice: for αn = 1n∑n
i=1 δxi and βn = 1n∑n
i=1 δyi , computingWp
p(αn, βm) corresponds to solving a LP problem.
Several algorithms, e.g.Algorithm Complexity
Network Simplex O(n3 log(n)2)Auction O(n3)
Sinkhorn (τ -approximate OT ) O(n2 log(n)τ−3)[Peyré and Cuturi, 2019]
4/14
Introduction
I In practice: for αn = 1n∑n
i=1 δxi and βn = 1n∑n
i=1 δyi , computingWp
p(αn, βm) corresponds to solving a LP problem.
Several algorithms, e.g.Algorithm Complexity
Network Simplex O(n3 log(n)2)Auction O(n3)
Sinkhorn (τ -approximate OT ) O(n2 log(n)τ−3)[Peyré and Cuturi, 2019]
1 solving of OT =⇒ high computational costs.
To get the distance matrix of k point clouds, O(k2) OT problems to solve:can be prohibitive for large values of k.
4/14
Introduction
I In practice: solving numerous OT problems can be computationally prohibitive.
Wasserstein spaces are curved, hence non linear.
Figure 2: (P2,W2) is curved.
No "closed-form" notions of sum or mean in (Pp,Wp) (e.g. Wassersteinbarycenters are defined as arg min’s).
5/14
Introduction
I In practice: solving numerous OT problems can be computationally prohibitive.
Wasserstein spaces are curved, hence non linear.
Figure 3: (P2,W2) is curved.
No "closed-form" notions of sum or mean in (Pp,Wp) (e.g. Wassersteinbarycenters are defined as arg min’s).In ML, can be interesting to have an Hilbertian structure: impossible inWasserstein spaces.
5/14
Introduction
I In practice: solving numerous OT problems can be computationally prohibitive.Wasserstein spaces are curved and not Hilbertian.
I Here: we propose a measure embedding into a Hilbert space thatconserves some of the (unregularized) Wasserstein geometry.
6/14
Proposed workaround: Monge embedding
Let ρ be a fixed a.c. measure on X .
By [Brenier, 1991], for any µ ∈ P(Y),
minπ
{∫X×Y
||x − y ||2dπ(x , y) | π ∈ U(ρ, µ)}
=
minT
{∫X||x − T (x)||2dρ(x) | T : X → Y,T#ρ = µ
},
where T#ρ is s.t. ∀Y ⊆ Y, T#ρ(Y ) = ρ(T−1(Y )).
µ
νN = 1N
∑iδyi
ρ
Tµ
TνN
7/14
Proposed workaround: Monge embedding
Let ρ be a fixed a.c. measure on X .
By [Brenier, 1991], for any µ ∈ P(Y),
minπ
{∫X×Y
||x − y ||2dπ(x , y) | π ∈ U(ρ, µ)}
=
minT
{∫X||x − T (x)||2dρ(x) | T : X → Y,T#ρ = µ
},
where T#ρ is s.t. ∀Y ⊆ Y, T#ρ(Y ) = ρ(T−1(Y )).
A solution Tµ always exists and is uniquely defined as thegradient Tµ = ∇φµ of a convex function φµ that minimizesthe Kantorovich functional
K(φµ) =∫Xφµdρ+
∫Yψµdµ,
where ψµ = φ∗µ is the Legendre transform of φµ.
µ
νN = 1N
∑iδyi
ρ
Tµ
TνN
7/14
Proposed workaround: Monge embeddingLet ρ be a fixed a.c. measure on X .
By [Brenier, 1991], for any µ ∈ P(Y),
minπ
{∫X×Y
||x − y ||2dπ(x , y) | π ∈ U(ρ, µ)}
=
minT
{∫X||x − T (x)||2dρ(x) | T : X → Y,T#ρ = µ
},
where T#ρ is s.t. ∀Y ⊆ Y, T#ρ(Y ) = ρ(T−1(Y )).
A solution Tµ always exists and is uniquely defined as thegradient Tµ = ∇φµ of a convex function φµ that minimizesthe Kantorovich functional
K(φµ) =∫Xφµdρ+
∫Yψµdµ,
where ψµ = φ∗µ is the Legendre transform of φµ.
µ
νN = 1N
∑iδyi
ρ
Tµ
TνN
Definition (Monge embedding)The Monge embedding is the mapping
P(Y)→ L2(ρ,Rd),µ 7→ Tµ.
7/14
Proposed workaround: Monge embeddingLet ρ be a fixed a.c. measure on X .
Definition (Monge embedding)For all µ ∈ P(Y), denote Tµ the solution of Monge’sOT problem between ρ and µ for the squarredEuclidean cost.
The Monge embedding is the mapping
P(Y)→ L2(ρ,Rd),µ 7→ Tµ.
µ
νN = 1N
∑iδyi
ρ
Tµ
TνN
RemarksI When µ is discrete: semi-discrete OT. Efficiently solved in low dimensions with
second-order methods [Kitagawa et al., 2019] and in higher dimensions withstochastic optimization methods [Genevay et al., 2016].
I L2(ρ)-Distance matrix of k point-clouds =⇒ k OT problems to solve + O(k2)distance computations on Hilbertian/Euclidean data.
7/14
Proposed workaround: Monge embeddingLet ρ be a fixed a.c. measure on X .
Definition (Monge embedding)For all µ ∈ P(Y), denote Tµ the solution of Monge’sOT problem between ρ and µ for the squarredEuclidean cost.
The Monge embedding is the mapping
P(Y)→ L2(ρ,Rd),µ 7→ Tµ.
µ
νN = 1N
∑iδyi
ρ
Tµ
TνN
Main result: bi-Hölder behavior of µ 7→ Tµ
∀µ, ν ∈ P(Y), W2(µ, ν) ≤ ‖Tµ − Tν‖L2(ρ) ≤ Cd,X ,YW2(µ, ν)2/15.
7/14
Geometric interpretation of the Monge embeddingMonge embedding as logarithm map (similar construction andinterpretation in the Linear Optimal Transportation Framework of[Wang et al., 2013])
( ( ), W2)
T = IdL2( )
W2( , )
L2( )T TW2, ( , )
T T
where W2,ρ(µ, ν) := ||Tµ − Tν ||L2(ρ) .
Riemannian geometry Optimal transportPoint x ∈ M µ ∈ P2(Rd )
Geodesic distance dg (x, y) W2(µ, ν)Tangent space TρM TρP2(Rd ) ≈ L2(ρ)
Inverse exponential map exp−1(x) ∈ TρM Tµ ∈ L2(ρ)Distance in tangent space
∥∥exp−1(x)− exp−1(y)∥∥
g(ρ)
∥∥Tµ − Tν∥∥
L2(ρ)
What amount of the Wasserstein geometry is preserved by the embeddingµ 7→ Tµ ?
8/14
Immediate properties of the Monge embedding
µ 7→ Tµ is discriminative
I µ 7→ Tµ is injective(by definition of the push-forward operator).
I µ 7→ Tµ is reverse-Lipschitz: ||Tµ − Tν ||L2(ρ) ≥W2(µ, ν)(γ := (Tµ,Tν )#ρ defines an admissible coupling between µ and ν).
µ 7→ Tµ is not better than 12 -Hölder
I Take ρ := 1πLebB(0,1) on R2 and µθ :=
δxθ+δxθ+π2 with xθ = (cos(θ), sin(θ)). Then
||Tµθ − Tµθ+δ ||2L2(ρ)
≥ Cδ while W2(µθ, µθ+δ) ≤ Cδ.
9/14
Immediate properties of the Monge embedding
Theorem ( 12 -Hölder continuity near a regular measure, similar to a resultof [Gigli, 2011])Let µ, ν ∈ P(Y) and assume that Tµ is K -Lipschitz. Then,
‖Tµ − Tν‖L2(ρ) ≤ 2√
MXKW1(µ, ν)1/2,
where MX is s.t. X ⊂ B(0,MX ).
Very strong hypothesis on µ.
Theorem (General Hölder-continuity as a corollary of [Berman, 2018],Proposition 3.4)If ρ ≡ 1 on X with |X | = 1, then for any measures µ and ν in P(Y),
‖Tµ − Tν‖L2(ρ) ≤ Cd,X ,YW1(µ, ν)1
(d+2)2(d−1) .
High dependence on the ambient dimension. Optimal exponent?
10/14
Stability of the Monge embedding
Theorem (Dimension-independent Hölder continuity)If ρ ≡ 1 on X convex with |X | = 1, then for any measures µ and ν in P(Y),
‖Tν − Tµ‖L2(X ) ≤ Cd,X ,YW1(µ, ν)2/15.
RemarksI Hölder exponent independent of the ambient dimension.I No hypothesis on µ and ν except that they are compactly supported.I No regularization.I Optimality: best exponent belongs to [ 2
15 ,12 ].
11/14
Stability of the Monge embedding
Theorem (Dimension-independent Hölder continuity)If ρ ≡ 1 on X with |X | = 1, then for any measures µ and ν in P(Y),
‖Tν − Tµ‖L2(X ) ≤ Cd,X ,YW1(µ, ν)2/15.
RemarksI Hölder exponent independent of the ambient dimension.I No hypothesis on µ and ν except that they are compactly supported.I No regularization.I Optimality: best exponent belongs to [ 2
15 ,12 ].
I Proof ingredients:I Discrete µ and ν (general case by density).
11/14
Stability of the Monge embedding
Theorem (Dimension-independent Hölder continuity)If ρ ≡ 1 on X with |X | = 1, then for any measures µ and ν in P(Y),
‖Tν − Tµ‖L2(X ) ≤ Cd,X ,YW1(µ, ν)2/15.
RemarksI Hölder exponent independent of the ambient dimension.I No hypothesis on µ and ν except that they are compactly supported.I No regularization.I Optimality: best exponent belongs to [ 2
15 ,12 ].
I Proof ingredients:I Discrete µ and ν (general case by density).I A Discrete Poincaré-Wirtinger inequality =⇒ local estimate of the strong
convexity of the Kantorovich functional (non trivial because of the lack ofregularization). Gives:
‖ψν − ψµ‖L2(µ+ν) ≤ CW1(µ, ν)13 .
11/14
Stability of the Monge embeddingTheorem (Dimension-independent Hölder continuity)If ρ ≡ 1 on X with |X | = 1, then for any measures µ and ν in P(Y),
‖Tν − Tµ‖L2(X ) ≤ Cd,X ,YW1(µ, ν)2/15.
RemarksI Hölder exponent independent of the ambient dimension.I No hypothesis on µ and ν except that they are compactly supported.I No regularization.I Optimality: best exponent belongs to [ 2
15 ,12 ].
I Proof ingredients:I Discrete µ and ν (general case by density).I A Discrete Poincaré-Wirtinger inequality =⇒ local estimate of the strong
convexity of the Kantorovich functional (non trivial because of the lack ofregularization).
I An inverse Poincaré-Wirtinger inequality gives:
‖Tν − Tµ‖L2(ρ) = ‖∇φν −∇φµ‖L2(ρ) ≤ CW1(µ, ν)215 .
11/14
Numerical illustrations
I Let X = Y = [0, 1]2, ρ ≡ 1 on X .
I Project L2(ρ,R2) onto a finite dimensional space.
Distance approximation: W2,ρ vs. W2
0.1 0.2 0.3 0.4 0.5 0.6W2(μM, νN)
0.1
0.2
0.3
0.4
0.5
0.6
||Tμ M
−T ν
N|| L
2 (ρ)
Gaussiansy = x
0.05 0.10 0.15 0.20 0.25 0.30 0.35W2(μM, νN)
0.1
0.2
0.3||T
μ M−T ν
N|| L
2 (ρ)
Mixture of 4 Gaussiansy = x
0.04 0.06 0.08 0.10 0.12W2(μM, νN)
0.04
0.06
0.08
0.10
0.12
||Tμ M
−T ν
N|| L
2 (ρ)
Uniformsy = x
W2,ρ vs. W2 between point clouds sampled from a Gaussian, a Mixture of 4 Gaussian and auniform distribution.
12/14
Numerical illustrations
Wasserstein barycenter [Agueh and Carlier, 2011] approximationApproximate argminµ
∑Ss=1 λsW2
2(µ, µs) with µ =(∑S
s=1 λsTµs
)#ρ.
Barycenters of 4 point clouds. Weights (λs )s are bilinear w.r.t. the corners of the square.
Push-forwards of the 20 centroids after clustering of the Monge embeddings of the MNISTtraining set.
13/14
Conclusion
µ 7→ Tµ is an Hilbert space embedding that:I Linearizes to some extent the 2-Wasserstein space.I Is bi-Hölder continuous w.r.t. W2.I Allows for the direct use of generic ML algorithms on measure data thanks
to the linearity and Hilbertian structure of L2(ρ).
Future work:I Not compactly supported target measures.I More general source measures.I Statistical properties: concentration and sample complexity of the defined
distances.I Applications: compact encoding of Tµ ∈ L2(ρ) that scales well to high
dimensions.
14/14
References I
Agueh, M. and Carlier, G. (2011).Barycenters in the Wasserstein space.SIAM Journal on Mathematical Analysis, 43(2):904–924.
Berman, R. J. (2018).Convergence rates for discretized monge-ampère equations andquantitative stability of optimal transport.arXiv preprint 1803.00785.
Brenier, Y. (1991).Polar factorization and monotone rearrangement of vector-valuedfunctions.Communications on Pure and Applied Mathematics, 44(4):375–417.
Genevay, A., Cuturi, M., Peyré, G., and Bach, F. (2016).Stochastic optimization for large-scale optimal transport.In Advances in neural information processing systems, pages 3440–3448.
Gigli, N. (2011).On hölder continuity-in-time of the optimal transport map towardsmeasures along a curve.Proceedings of the Edinburgh Mathematical Society, 54(2):401–409.
14/14
References II
Kitagawa, J., Mérigot, Q., and Thibert, B. (2019).Convergence of a newton algorithm for semi-discrete optimal transport.Journal of the European Mathematical Society.
Paty, F.-P., d’Aspremont, A., and Cuturi, M. (2019).Regularity as regularization: Smooth and strongly convex brenierpotentials in optimal transport.
Peyré, G. and Cuturi, M. (2019).Computational optimal transport.Foundations and Trends in Machine Learning, 11(5-6):355–607.
Wang, W., Slepčev, D., Basu, S., Ozolek, J. A., and Rohde, G. K. (2013).A linear optimal transportation framework for quantifying and visualizingvariations in sets of images.Int. J. Comput. Vision, 101(2):254–269.
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., and Xiao, J.(2015).3d shapenets: A deep representation for volumetric shapes.In CVPR, pages 1912–1920. IEEE Computer Society.
14/14
References III
Yu, L., Efstathiou, K., Isenberg, P., and Isenberg, T. (2012).Efficient structure-aware selection techniques for 3d point cloudvisualizations with 2dof input.IEEE Transactions on Visualization and Computer Graphics,18(12):2245–2254.
14/14