maximum mean discrepancy gradient flow(mmd) on the space of probability distributions. application :...

51
Maximum Mean Discrepancy Gradient Flow Michael Arbel 1 Anna Korba 1 Adil Salim 2 Arthur Gretton 1 1 Gatsby Computational Neuroscience Unit, UCL, London 2 Visual Computing Center, KAUST, Saudi Arabia Fields Institute September 2020 1/ 30

Upload: others

Post on 08-Nov-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Maximum Mean Discrepancy Gradient Flow

Michael Arbel 1 Anna Korba 1 Adil Salim 2 Arthur Gretton 1

1Gatsby Computational Neuroscience Unit, UCL, London

2Visual Computing Center, KAUST, Saudi Arabia

Fields InstituteSeptember 2020

1/ 30

Page 2: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Problem and Outline

Problem:I Transport mass from a starting probability distribution to a

target distributionI How? By finding a continuous path on the space of

distributions, decreasing some loss (Wassersteingradient flows)

I This work: Minimize the Maximum Mean Discrepancy(MMD) on the space of probability distributions.

Application : Insights on the theoretical properties of somelarge neural networks and alteration of the dynamics to improveconvergence.

2/ 30

Page 3: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Outline

Background and motivation

Maximum Mean Discrepancy (Wasserstein) Gradient Flow

Convergence properties of the MMD gradient flow

A noise-injection algorithm for better convergence

3/ 30

Page 4: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Setting

Let P the set of probability measures on Z ⊂ Rd with finitesecond moment :

P = {µ ∈ P(Z),

∫‖z‖2dµ(z) <∞}

The space P is endowed with the Wassertein-2 distance from

Optimal transport :

W 22 (ν, µ) = inf

π∈Π(ν,µ)

∫Z×Z

∥∥z − z ′∥∥2 dπ(z, z ′) ∀ν, µ ∈ P

where Π(ν, µ) is the set of possible couplings between ν and µ.

In other words Π(ν, µ) contains all possible distributions π onZ × Z such that if (Z ,Z ′) ∼ π then Z ∼ ν and Z ′ ∼ µ.

4/ 30

Page 5: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Setting

Let P the set of probability measures on Z ⊂ Rd with finitesecond moment :

P = {µ ∈ P(Z),

∫‖z‖2dµ(z) <∞}

The space P is endowed with the Wassertein-2 distance from

Optimal transport :

W 22 (ν, µ) = inf

π∈Π(ν,µ)

∫Z×Z

∥∥z − z ′∥∥2 dπ(z, z ′) ∀ν, µ ∈ P

where Π(ν, µ) is the set of possible couplings between ν and µ.

In other words Π(ν, µ) contains all possible distributions π onZ × Z such that if (Z ,Z ′) ∼ π then Z ∼ ν and Z ′ ∼ µ.

4/ 30

Page 6: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Maximum Mean DiscrepancyI Let k : Z × Z → R a positive, semi-definite kernel

k(z, z ′) = 〈φ(z), φ(z ′)〉H, φ : Z → HI H its RKHS (Reproducing Kernel Hilbert Space).

Recall: H is a Hilbert space with inner product 〈., .〉H and norm‖.‖H. It satisfies the reproducing property:

∀ f ∈ H, z ∈ Z, f (z) = 〈f , k(z, .)〉HAssume µ 7→

∫k(z, .)dµ(z) injective (characteristic k ).

Maximum Mean Discrepancy ([Gretton et al., 2012]) defines adistance on P (probability distributions on Z):

MMD(µ, ν) = ‖fµ,ν‖H, where

fµ,ν(.) =

∫k(z, .)dµ(z)−

∫k(z, .)dν(z)︸ ︷︷ ︸

"witness function"

5/ 30

Page 7: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Maximum Mean DiscrepancyI Let k : Z × Z → R a positive, semi-definite kernel

k(z, z ′) = 〈φ(z), φ(z ′)〉H, φ : Z → HI H its RKHS (Reproducing Kernel Hilbert Space).

Recall: H is a Hilbert space with inner product 〈., .〉H and norm‖.‖H. It satisfies the reproducing property:

∀ f ∈ H, z ∈ Z, f (z) = 〈f , k(z, .)〉HAssume µ 7→

∫k(z, .)dµ(z) injective (characteristic k ).

Maximum Mean Discrepancy ([Gretton et al., 2012]) defines adistance on P (probability distributions on Z):

MMD(µ, ν) = ‖fµ,ν‖H, where

fµ,ν(.) =

∫k(z, .)dµ(z)−

∫k(z, .)dν(z)︸ ︷︷ ︸

"witness function"

5/ 30

Page 8: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

MMD functional

For a target distribution ν∗ (fixed), for any ν ∈ P:

F(ν) =12

MMD2(ν, ν∗)

=12‖fν,ν∗‖2H

=12

∫k(z, z ′)dν(z)dν(z ′) +

12

∫k(z, z ′)dν∗(z)dν∗(z ′)

−∫

k(z, z ′)dν(z)dν∗(z ′)

Proof : use the reproducing property withfν,ν∗(.) =

∫k(z, .)dν(z)−

∫k(z, .)dν∗(z)

Appear as a loss when optimizing some large neuralnetworks.

6/ 30

Page 9: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Regression with 1-hidden layer NN

Consider the following regression problem:

I φZi (x) = wig(x , θi),Zi = (wi , θi) ∈ R× Rd

φZi : non linearity

I Example:

φZ (x) = wg(ax + b)

where g : R→ R(sigmoidg(z) = 1/(1 + e−z), RelU(g(z) = max(0, z)...)

7/ 30

Page 10: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Regression with 1-hidden layer NN

Consider the following regression problem:

I φZi (x) = wig(x , θi),Zi = (wi , θi) ∈ R× Rd

φZi : non linearity

I Example:

φZ (x) = wg(ax + b)

where g : R→ R(sigmoidg(z) = 1/(1 + e−z), RelU(g(z) = max(0, z)...)

7/ 30

Page 11: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Regression with 1-hidden layer NNFinite dimensional non-convex optimization (regressionsetting):

minZ1,...,ZN∈Z

F

(1N

N∑i=1

δZi

)

I Optimization usinggradient descent (GD):

Z t+1i = Z t

i − γ∇ZiF

(1N

N∑i=1

δZ ti

)

I Hard to describe thedynamics of GD!

I Idea: look at thedistribution of the Zi ’s

8/ 30

Page 12: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Regression with 1-hidden layer NNFinite dimensional non-convex optimization (regressionsetting):

minZ1,...,ZN∈Z

F

(1N

N∑i=1

δZi

)

I Optimization usinggradient descent (GD):

Z t+1i = Z t

i − γ∇ZiF

(1N

N∑i=1

δZ ti

)

I Hard to describe thedynamics of GD!

I Idea: look at thedistribution of the Zi ’s

8/ 30

Page 13: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Regression with 1-hidden layer NNFinite dimensional non-convex optimization (regressionsetting):

minZ1,...,ZN∈Z

F

(1N

N∑i=1

δZi

)

I Optimization usinggradient descent (GD):

Z t+1i = Z t

i − γ∇ZiF

(1N

N∑i=1

δZ ti

)

I Hard to describe thedynamics of GD!

I Idea: look at thedistribution of the Zi ’s

8/ 30

Page 14: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Regression with 1-hidden layer NNFinite dimensional non-convex optimization (regressionsetting):

minZ1,...,ZN∈Z

F

(1N

N∑i=1

δZi

)

I Optimization usinggradient descent (GD):

Z t+1i = Z t

i − γ∇ZiF

(1N

N∑i=1

δZ ti

)

I Hard to describe thedynamics of GD!

I Idea: look at thedistribution of the Zi ’s

8/ 30

Page 15: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Infinite width regime

Infinite dimensional convex optimization [Chizat and Bach, 2018],[Mei et al., 2018]:

minZ1,...,ZN∈Z

F

(1N

N∑i=1

δZi

)−−−−→N→∞

minν∈PF (ν)

9/ 30

Page 16: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Minimization of the MMD : the well-specified case

Assume ∃ν∗ ∈ P , E[y |X = x ] = EZ∼ν∗ [φZ (x)].

Then : minν∈P

E[‖y − EZ∼ν [φZ (x)]‖2]

mminν∈P

E[‖EZ∼ν∗ [φZ (x)]− EZ∼ν [φZ (x)]‖2]

mminν∈P

EZ∼ν∗Z ′∼ν∗

[k(Z ,Z ′)] + EZ∼νZ ′∼ν

[k(Z ,Z ′)]− 2EZ∼ν∗Z ′∼ν

[k(Z ,Z ′)]

with k(Z ,Z ′) = Ex∼data[φZ (x)TφZ ′(x)]

mminν∈P

MMD2(ν, ν∗)

Optimizing the parameters of a NN⇔ minimization of theMMD on P in the population limit (N →∞).

10/ 30

Page 17: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Minimization of the MMD : the well-specified case

Assume ∃ν∗ ∈ P , E[y |X = x ] = EZ∼ν∗ [φZ (x)].

Then : minν∈P

E[‖y − EZ∼ν [φZ (x)]‖2]

mminν∈P

E[‖EZ∼ν∗ [φZ (x)]− EZ∼ν [φZ (x)]‖2]

mminν∈P

EZ∼ν∗Z ′∼ν∗

[k(Z ,Z ′)] + EZ∼νZ ′∼ν

[k(Z ,Z ′)]− 2EZ∼ν∗Z ′∼ν

[k(Z ,Z ′)]

with k(Z ,Z ′) = Ex∼data[φZ (x)TφZ ′(x)]

mminν∈P

MMD2(ν, ν∗)

Optimizing the parameters of a NN⇔ minimization of theMMD on P in the population limit (N →∞).

10/ 30

Page 18: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Minimization of the MMD : the well-specified case

Assume ∃ν∗ ∈ P , E[y |X = x ] = EZ∼ν∗ [φZ (x)].

Then : minν∈P

E[‖y − EZ∼ν [φZ (x)]‖2]

mminν∈P

E[‖EZ∼ν∗ [φZ (x)]− EZ∼ν [φZ (x)]‖2]

mminν∈P

EZ∼ν∗Z ′∼ν∗

[k(Z ,Z ′)] + EZ∼νZ ′∼ν

[k(Z ,Z ′)]− 2EZ∼ν∗Z ′∼ν

[k(Z ,Z ′)]

with k(Z ,Z ′) = Ex∼data[φZ (x)TφZ ′(x)]

mminν∈P

MMD2(ν, ν∗)

Optimizing the parameters of a NN⇔ minimization of theMMD on P in the population limit (N →∞).

10/ 30

Page 19: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Outline

Background and motivation

Maximum Mean Discrepancy (Wasserstein) Gradient Flow

Convergence properties of the MMD gradient flow

A noise-injection algorithm for better convergence

11/ 30

Page 20: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

We consider

minν∈PF(ν) where F(ν) =

12

MMD2(ν, ν∗)

I Gradient descent dynamics in this setting takes the form of aPDE (gradient flow on P)

∂νt

∂t= div(νt∇

∂F(νt )

∂t) = div(νt∇fνt ,ν∗)

Can be obtained as the limit when τ → 0 of the JKOscheme [Jordan et al., 1998] :

ν(n + 1) = argminν∈P

F(ν) +12τ

W 22 (ν, ν(n))

I Density of particles following a Mc-Kean Vlasov dynamic :

dZt

dt= −∇Zt fνt ,ν∗(Zt ), Zt ∼ νt

where ∇Zt fνt ,ν∗ =∫∇k(Z ,Zt )dνt (Z )−

∫∇k(Z ,Zt )dν∗(Z ).

12/ 30

Page 21: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Example : Student-Teacher networkSatisfies the "well-specified" assumption ! (∃ν∗, E[y |X = x ] = EZ∼ν∗ [φZ (x)])

I the output of the Teacher network is deterministic and given byy =

∫φZ (x)dν∗(Z ) where ν∗ = 1

M

∑Mm=1 δUm

I Student network parametrized by ν0 = 1N

∑Nn=1 δZ n

0tries to learn

the mapping x 7→∫φZ (x)dν∗(Z ).

13/ 30

Page 22: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Gradient descent on each parameter n ∈ {1, . . . ,N} :

znt+1 = zn

t − γEx∼data

[(1N

N∑n′=1

φzn′t

(x)− 1M

M∑m=1

φum (x)

)∇zn

tφzn

t(x)

],

Re-arranging terms and recalling thatk(Z ,U) = Ex∼data[φZ (x)TφU(x)], the update becomes:

znt+1 = zn

t − γ

(1N

N∑n′=1

∇2k(zn′t , z

nt )− 1

M

M∑m=1

∇2k(um, znt )

)︸ ︷︷ ︸

∇fν?,νt(zn

t )

The above equation is a time-discretized version of thegradient flow of the MMD.

14/ 30

Page 23: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Outline

Background and motivation

Maximum Mean Discrepancy (Wasserstein) Gradient Flow

Convergence properties of the MMD gradient flow

A noise-injection algorithm for better convergence

15/ 30

Page 24: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence of the MMD GF - Convexity

F(νt ) along the MMD flow ∂νt∂t = div(νt∇∂F(νt )

∂t )?

I A functional F is (λ)-geodesically convex if it is convexalong W2 geodesics, i.e. if for any t ∈ [0,1]:

F(ρ(t)) ≤ (1−t)F(ρ(0))+tF(ρ(1))−t(1− t)λ

2W 2

2 (ρ(0), ρ(1))2

where ρ(t) = ((1− t)I + tT ρ(1)ρ(0) )#ρ(0)

I If F is λ-convex with λ > 0, all gradient flows of Fconverge to the unique minimizer of F [Carrillo et al., 2006]

Our finding: The MMD is not λ-convex with λ > 0 in general.

16/ 30

Page 25: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence of the MMD GF - Convexity

F(νt ) along the MMD flow ∂νt∂t = div(νt∇∂F(νt )

∂t )?

I A functional F is (λ)-geodesically convex if it is convexalong W2 geodesics, i.e. if for any t ∈ [0,1]:

F(ρ(t)) ≤ (1−t)F(ρ(0))+tF(ρ(1))−t(1− t)λ

2W 2

2 (ρ(0), ρ(1))2

where ρ(t) = ((1− t)I + tT ρ(1)ρ(0) )#ρ(0)

I If F is λ-convex with λ > 0, all gradient flows of Fconverge to the unique minimizer of F [Carrillo et al., 2006]

Our finding: The MMD is not λ-convex with λ > 0 in general.

16/ 30

Page 26: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence of the MMD GF - Convexity

F(νt ) along the MMD flow ∂νt∂t = div(νt∇∂F(νt )

∂t )?

I A functional F is (λ)-geodesically convex if it is convexalong W2 geodesics, i.e. if for any t ∈ [0,1]:

F(ρ(t)) ≤ (1−t)F(ρ(0))+tF(ρ(1))−t(1− t)λ

2W 2

2 (ρ(0), ρ(1))2

where ρ(t) = ((1− t)I + tT ρ(1)ρ(0) )#ρ(0)

I If F is λ-convex with λ > 0, all gradient flows of Fconverge to the unique minimizer of F [Carrillo et al., 2006]

Our finding: The MMD is not λ-convex with λ > 0 in general.

16/ 30

Page 27: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence of the MMD GF - Convexity

F(νt ) along the MMD flow ∂νt∂t = div(νt∇∂F(νt )

∂t )?

I A functional F is (λ)-geodesically convex if it is convexalong W2 geodesics, i.e. if for any t ∈ [0,1]:

F(ρ(t)) ≤ (1−t)F(ρ(0))+tF(ρ(1))−t(1− t)λ

2W 2

2 (ρ(0), ρ(1))2

where ρ(t) = ((1− t)I + tT ρ(1)ρ(0) )#ρ(0)

I If F is λ-convex with λ > 0, all gradient flows of Fconverge to the unique minimizer of F [Carrillo et al., 2006]

Our finding: The MMD is not λ-convex with λ > 0 in general.

16/ 30

Page 28: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence of the MMD GF - Convexity

F(νt ) along the MMD flow ∂νt∂t = div(νt∇∂F(νt )

∂t )?

I A functional F is (λ)-geodesically convex if it is convexalong W2 geodesics, i.e. if for any t ∈ [0,1]:

F(ρ(t)) ≤ (1−t)F(ρ(0))+tF(ρ(1))−t(1− t)λ

2W 2

2 (ρ(0), ρ(1))2

where ρ(t) = ((1− t)I + tT ρ(1)ρ(0) )#ρ(0)

I If F is λ-convex with λ > 0, all gradient flows of Fconverge to the unique minimizer of F [Carrillo et al., 2006]

Our finding: The MMD is not λ-convex with λ > 0 in general.

16/ 30

Page 29: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence of the MMD GF - a Lojasiewiczinequality

dF(νt )

dt≤ −CF(νt )

2

Applying Gronwall’s lemma results in: F(νt ) = O(1t ).

I on the right, it’s the RKHS norm:

F(νt )=12‖fνt ,ν∗‖

2H

I on the left we have the weighted Sobolev semi-norm:

dF(νt )

dt= −

∫‖∇fνt ,ν∗(x)‖2dνt (x) = −‖fνt ,ν∗‖

2H(νt )

Since:∂νt

∂t= div(νt∇fνt ,ν∗)

(dissipation of the MMD along the MMD flow)

17/ 30

Page 30: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence of the MMD GF - a Lojasiewiczinequality

dF(νt )

dt≤ −CF(νt )

2

Applying Gronwall’s lemma results in: F(νt ) = O(1t ).

I on the right, it’s the RKHS norm:

F(νt )=12‖fνt ,ν∗‖

2H

I on the left we have the weighted Sobolev semi-norm:

dF(νt )

dt= −

∫‖∇fνt ,ν∗(x)‖2dνt (x) = −‖fνt ,ν∗‖

2H(νt )

Since:∂νt

∂t= div(νt∇fνt ,ν∗)

(dissipation of the MMD along the MMD flow)17/ 30

Page 31: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

A criterion for convergence

It can be shown that:

‖fνt ,ν∗‖2H ≤ ‖fνt ,ν∗‖H(νt )

‖ν∗ − νt‖H−1(νt )

where on the r.h.s. we have the dual norm of H .(νt ):

‖νt − ν∗‖H−1(νt )= sup

g, EZ∼νt [‖∇g(Z )‖2]≤1|EZ∼νt [g(Z )]− EU∼ν∗ [g(U)]|

Assume that ‖νt − ν∗‖H−1(νt )≤ C for all t , then

MMD2(νt , ν∗) ≤ 1

MMD2(ν0, ν∗) + 4C−1t

Problem: Depends on the whole sequence νt ; Hard to verify ingeneral [Peyre, 2018]; and we’ve seen failure cases in practice.

18/ 30

Page 32: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

A criterion for convergence

It can be shown that:

‖fνt ,ν∗‖2H ≤ ‖fνt ,ν∗‖H(νt )

‖ν∗ − νt‖H−1(νt )

where on the r.h.s. we have the dual norm of H .(νt ):

‖νt − ν∗‖H−1(νt )= sup

g, EZ∼νt [‖∇g(Z )‖2]≤1|EZ∼νt [g(Z )]− EU∼ν∗ [g(U)]|

Assume that ‖νt − ν∗‖H−1(νt )≤ C for all t , then

MMD2(νt , ν∗) ≤ 1

MMD2(ν0, ν∗) + 4C−1t

Problem: Depends on the whole sequence νt ; Hard to verify ingeneral [Peyre, 2018]; and we’ve seen failure cases in practice.

18/ 30

Page 33: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

A criterion for convergence

It can be shown that:

‖fνt ,ν∗‖2H ≤ ‖fνt ,ν∗‖H(νt )

‖ν∗ − νt‖H−1(νt )

where on the r.h.s. we have the dual norm of H .(νt ):

‖νt − ν∗‖H−1(νt )= sup

g, EZ∼νt [‖∇g(Z )‖2]≤1|EZ∼νt [g(Z )]− EU∼ν∗ [g(U)]|

Assume that ‖νt − ν∗‖H−1(νt )≤ C for all t , then

MMD2(νt , ν∗) ≤ 1

MMD2(ν0, ν∗) + 4C−1t

Problem: Depends on the whole sequence νt ; Hard to verify ingeneral [Peyre, 2018]; and we’ve seen failure cases in practice.

18/ 30

Page 34: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

A criterion for convergence

It can be shown that:

‖fνt ,ν∗‖2H ≤ ‖fνt ,ν∗‖H(νt )

‖ν∗ − νt‖H−1(νt )

where on the r.h.s. we have the dual norm of H .(νt ):

‖νt − ν∗‖H−1(νt )= sup

g, EZ∼νt [‖∇g(Z )‖2]≤1|EZ∼νt [g(Z )]− EU∼ν∗ [g(U)]|

Assume that ‖νt − ν∗‖H−1(νt )≤ C for all t , then

MMD2(νt , ν∗) ≤ 1

MMD2(ν0, ν∗) + 4C−1t

Problem: Depends on the whole sequence νt ; Hard to verify ingeneral [Peyre, 2018]; and we’ve seen failure cases in practice.

18/ 30

Page 35: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence: Failure case

19/ 30

Page 36: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence: Failure case

19/ 30

Page 37: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence: Failure case

19/ 30

Page 38: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence: Failure case

20/ 30

Page 39: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Convergence issues

I The condition we exhibited for global convergence may nothold and (F(νt ))t might be stuck at a local minima.

dF(νt )

dt= −

∫‖∇fνt ,ν∗(x)‖2dνt (x) at equilibrium

=⇒∫‖∇fν∞,ν∗(x)‖2dν∞(x) = 0

If ν∞ positive everywhere this implies fν∞,ν∗ = cte = 01

But ν∞ might be singular...

I Idea : Evaluate ∇fνt ,ν∗ outside of the support of νt to get abetter signal!

1as soon as the RKHS H does not contain non-zero constant functions,e.g. for a gaussian kernel.

21/ 30

Page 40: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Outline

Background and motivation

Maximum Mean Discrepancy (Wasserstein) Gradient Flow

Convergence properties of the MMD gradient flow

A noise-injection algorithm for better convergence

22/ 30

Page 41: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Noise injection

I At each iteration n, sample ξn ∼ N (0,1) and βn is thenoise level:

Zn+1 = Zn − γ∇fνn,ν∗(Zn + βnξn)

I Similar to randomized smoothing2, but extended tointeracting particles.

I Different from adding noise outside ("diffusion")

Zn+1 = Zn − γ∇fνn,ν∗(Zn) + βnξn

which corresponds to an entropic regularization of theoriginal loss 3.

2[Duchi et al., 2012]3[Mei et al., 2018]

23/ 30

Page 42: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Noise injection

I At each iteration n, sample ξn ∼ N (0,1) and βn is thenoise level:

Zn+1 = Zn − γ∇fνn,ν∗(Zn + βnξn)

I Similar to randomized smoothing2, but extended tointeracting particles.

I Different from adding noise outside ("diffusion")

Zn+1 = Zn − γ∇fνn,ν∗(Zn) + βnξn

which corresponds to an entropic regularization of theoriginal loss 3.

2[Duchi et al., 2012]3[Mei et al., 2018]

23/ 30

Page 43: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Noise injection

I At each iteration n, sample ξn ∼ N (0,1) and βn is thenoise level:

Zn+1 = Zn − γ∇fνn,ν∗(Zn + βnξn)

I Similar to randomized smoothing2, but extended tointeracting particles.

I Different from adding noise outside ("diffusion")

Zn+1 = Zn − γ∇fνn,ν∗(Zn) + βnξn

which corresponds to an entropic regularization of theoriginal loss 3.

2[Duchi et al., 2012]3[Mei et al., 2018]

23/ 30

Page 44: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Noise Injection: Theory (discrete time)Tradeoff for the level of noise βn

I Too large βn: νn+1 not a descent direction anymore:F(νn+1, ν

∗) > F2(νn, ν∗)

=⇒ β2nF2(νn) ≤ CkE Zn∼νn

un∼N (0,1)[‖∇fνn,ν∗(Zn + βnξn)‖2] (1)

I Too small βn: Back to the failure mode:∇fνn,ν∗(Zn + βnun) ' 0.

=⇒N∑

n=1

β2n →∞ (2)

Under (1) and (2) :

F2(νN , ν∗) ≤ F2(ν0, ν

∗)e−Ckγ(1−γC′k )∑N

n=1 β2n

24/ 30

Page 45: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Noise Injection: Theory (discrete time)Tradeoff for the level of noise βn

I Too large βn: νn+1 not a descent direction anymore:F(νn+1, ν

∗) > F2(νn, ν∗)

=⇒ β2nF2(νn) ≤ CkE Zn∼νn

un∼N (0,1)[‖∇fνn,ν∗(Zn + βnξn)‖2] (1)

I Too small βn: Back to the failure mode:∇fνn,ν∗(Zn + βnun) ' 0.

=⇒N∑

n=1

β2n →∞ (2)

Under (1) and (2) :

F2(νN , ν∗) ≤ F2(ν0, ν

∗)e−Ckγ(1−γC′k )∑N

n=1 β2n

24/ 30

Page 46: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Noise Injection: Experiments (Gaussians)

25/ 30

Page 47: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Noise Injection: Experiments (Student-Teacher)

Methods:I SGDI SGD + Noise injectionI SGD + diffusionI KSD 4: SGD using a (regularized) Negative Sobolev

distance as a loss function; also decreases the MMD.

4"Kernel Sobolev Descent" [Mroueh et al., 2019]26/ 30

Page 48: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Noise Injection: Experiments

0 2×103 4×103 6×103 8×103

Epochs

10 2

10 1

Test error per epochSGD SGD + noiseSGD + diffusionKSD flow

dimension = 50

27/ 30

Page 49: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Noise Injection: Experiments

10 4 10 2 100 102

noise level

10 2

10 1

Sensitivity to noise (Test error)SGDSGD + noiseSGD + diffusionKSD flow

28/ 30

Page 50: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Contributions and openings

I Provided a convergence criterion for the Wassersteingradient flow of the MMD.

I Proposed a pertubation of the dynamics with a noiseinjection and showed it effectiveness on simple examples.

I new insights for training large neural networks.Openings:I Control of the weighted negative Sobolev norm?I Stronger guarantees for the convergence for the noise

injection algorithm.

29/ 30

Page 51: Maximum Mean Discrepancy Gradient Flow(MMD) on the space of probability distributions. Application : Insights on the theoretical properties of some large neural networks and alteration

Carrillo, J. A., McCann, R. J., and Villani, C. (2006).Contractions in the 2-wasserstein length space andthermalization of granular media.Archive for Rational Mechanics and Analysis,179(2):217–263.

Chizat, L. and Bach, F. (2018).On the global convergence of gradient descent forover-parameterized models using optimal transport.NIPS.

Duchi, J. C., Bartlett, P. L., and Wainwright, M. J. (2012).Randomized smoothing for stochastic optimization.SIAM Journal on Optimization, 22(2):674–701.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B.,and Smola, A. J. (2012).A kernel two-sample test.JMLR, 13.

Jordan, R., Kinderlehrer, D., and Otto, F. (1998).The variational formulation of the fokker–planck equation.SIAM journal on mathematical analysis, 29(1):1–17.

Mei, S., Montanari, A., and Nguyen, P.-M. (2018).A mean field view of the landscape of two-layer neuralnetworks.Proceedings of the National Academy of Sciences,115(33):E7665–E7671.

Mroueh, Y., Sercu, T., and Raj, A. (2019).Sobolev descent.In AISTATS.

Peyre, R. (2018).Comparison between w2 distance and h1 norm, andlocalization of wasserstein distance.ESAIM: Control, Optimisation and Calculus of Variations,24(4):1489–1501.

30/ 30