wasserstein gan - postechmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · wasserstein gan...

Wasserstein GAN

Juho Lee

Jan 23, 2017

Wasserstein GAN (WGAN)

I Arxiv submission

I Martin Arjovsky, Soumith Chintala, and Leon Bottou

I A new GAN model minimizing the Earth-Mover’s distance(Wasserstein-1 distance)

I Stabilized GAN training with way less mode collapse

I Provide meaningful learning curves useful for debegging

Towards principled methods for training generativeadversarial networks

I ICLR 2017 (oral)

I Martin Arjovsky and Leon Bottou

I Why do updates gets worse as the discriminator gets better?

I Why is GAN training massively unstable?

I The impact of − logD(G(z)) trick; is it following the JSD?

Learning probability distribution

I Given a set of observations {xi}ni=1, assume a model distribution Pθof parametric family.

I Select a distance measure between the model distribution and realdistribution Pr; ρ(Pθ,Pr).

I Convergence: as t→∞, θt → θ, so Pθt → Pθ where ρ(Pr,Pθ)→ 0.

I Desirable conditions: the mapping θ 7→ ρ(Pr,Pθ) is continuous.

Distances between probability distributions I

Let (X ,Σ) be measurable space, where X is a compact metric set and Σis a Borel σ-algebra.

I The Total Variation (TV) distance

δ(Pr,Pθ) = supA∈Σ|Pr(A)− Pθ(A)|.

I The Kullback-Leibler (KL) divergence

KL(Pr‖Pθ) =

∫log

Pr(x)

Pθ(x)Pr(x)dµ(x),

where both Pr and Pθ are assumed to be absolutely continuous, andtherefore admit densities, w.r.t. a same measure µ on X .

Distances between probability distributions II

I The Jensen-Shannon (JS) divergence

JS(Pr,Pθ) =1

2KL(Pr‖Pm) +

1

2KL(Pθ‖Pm),

where Pm := (Pr + Pθ)/2.

I The Earth-Mover’s (EM) distance or Wasserstein-1 distance

W (Pr,Pθ) = infγ∈Π(Pr,Pθ)

E(x,y)∼γ [|x− y|],

where Π(Pr,Pθ) denotes the set of all joint distributions γ(x, y)whose marginals are respectively Pr and Pθ.

Distances between probability distributions III

(0, Z) (θ, Z)

Z ∼ Unif([0, 1])

I KL(Pθ‖P0) =

{∞ if θ 6= 00 if θ = 0

.

I JS(P0,Pθ) =

{log 2 if θ 6= 00 if θ = 0

.

I δ(P0,Pθ) =

{1 if θ 6= 00 if θ = 0

.

I W (P0,Pθ) = |θ|.

Instability of GAN I

Original objective function:

L(D, gθ) = Ex∼Pr [logD(x)] + Ex∼Pg [log(1−D(x))].

The optimal discriminator is

D∗(x) =Pr(x)

Pr(x) + Pg(x),

and

L(D∗, gθ) = 2JS(Pr,Pg)− 2 log 2.

Instability of GAN II

Theorem 1

Let Pr and PG be two distributions that have support contained in twoclosed manifolds M and P that don’t perfectly align and don’t have fulldimensions. We further assume that Pr and Pg are continuous in theirrespective manifolds, meaning that if there is a set A with measure 0 inM, then Pr(A) = 0 (and analogously for Pg). Then, there exists anoptimal discriminator D∗ : X → [0, 1] that has accuracy 1 and for almostany x in M∪P, D∗ is smooth in a neighbourhood of x and∇xD∗(x) = 0.

Instability of GAN III

Theorem 2

(Vanishing gradients on the generator) Let gθ : Z → X be adifferentiable function that induces a distribution Pg. If some conditionsare satisfied, and ‖D −D∗‖ < ε, and Ez∼p(z)[‖Jθgθ(z)‖22] ≤M2,

‖∇θEz∼p(z)[log(1−D(gθ(z)))]‖2 < Mε

1− ε .

Instability of GAN IV

Instability of GAN V

The − logD trick I

For generator, instead of minimizing Ez∼p(z)[log(1−D(gθ(z))], minimizeEz∼p(z)[log(D(gθ(z))]. This does not change the fixed points.

Theorem 3

Let D∗ = PrPr+Pg

be the optimal discriminator for a fixed θ = θ0.

Ez∼p(z)[−∇θ logD∗(gθ(z))|θ=θ0 ]

= ∇θ[KL(Pgθ0‖Pr)− 2JS(Pgθ0 ,Pr)]|θ=θ0 .

The − logD trick II

Theorem 4

(Under some conditions) Ez∼p(z)[−∇θ logD(gθ(z))] is a centeredCauchy distribution with infinite expectation and variance.

Why should we use Wasserstein distance I

Theorem 5

Let Pr be a fixed distribution over X . Let Z be a random variable overanother space Z. Let g : Z × Rd → X be a function, that will bedenoted gθ(z). Let Pθ denote the distribution of gθ(Z). Then,

1. If g is continuous in θ, so is W (Pr,Pθ).

2. If g is locally Lipschitz and satisfies regularity assumption 1, thenW (Pr,Pθ) is continuous everywhere, and differentiable almosteverywhere.

3. 1 and 2 are false for the Jensen-Shannon and KL divergences.

If we choose gθ to be any feedforward neural network parametrized by θ,and p(z) to be E[‖z‖] <∞, then the regularity assumption 1 is satisfied.

Why should we use Wasserstein distance II

Theorem 6

Let P be a distribution on a compact space X and (Pn)n∈N be asequence of distributions on X . Then, as n→∞,

1. δ(Pn,P)→ 0 ⇐⇒ JS(Pn,P)→ 0.

2. W (Pn,P)→ 0 ⇐⇒ PnD→ P.

3. KL(Pn‖P)→ 0 or KL(P‖Pn)→ 0 implies 1.

4. 1 implies 2.

Why should we use Wasserstein distance III

Approximating the Earth-Mover’s distance

By the Kantorovich-Rubinstein duelity [1]

W (Pr,Pθ) = sup‖f‖L≤1

Ex∼Pr [f(x)]− Ex∼Pθ [f(x)],

where the supremum is over all the 1-Lipschitz functions f : X → R.1-Lipschitz can be replaced by K-Lipschitz.

Theorem 7

Let Pr be any distribution, and let Pθ be the distribution of gθ(Z)satisfying assumption 1. Then, there exists a solution f : X → R to theproblem

max‖f‖L≤1

Ex∼Pr [f(x)]− Ex∼Pθ [f(x)]

and we have

∇θW (Pr,Pθ) = −Ez∼p(z)[∇θf(gθ(z))],

when both terms are well-defined.

WGAN algorithm

Experiments I

Experiments II

C. Villani.

Optimal Transport: Old and New.Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009.

wasserstein gan - postechmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · wasserstein gan...

Documents