wasserstein gan - postechmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · wasserstein gan...
TRANSCRIPT
Wasserstein GAN
Juho Lee
Jan 23, 2017
Wasserstein GAN (WGAN)
I Arxiv submission
I Martin Arjovsky, Soumith Chintala, and Leon Bottou
I A new GAN model minimizing the Earth-Mover’s distance(Wasserstein-1 distance)
I Stabilized GAN training with way less mode collapse
I Provide meaningful learning curves useful for debegging
Towards principled methods for training generativeadversarial networks
I ICLR 2017 (oral)
I Martin Arjovsky and Leon Bottou
I Why do updates gets worse as the discriminator gets better?
I Why is GAN training massively unstable?
I The impact of − logD(G(z)) trick; is it following the JSD?
Learning probability distribution
I Given a set of observations {xi}ni=1, assume a model distribution Pθof parametric family.
I Select a distance measure between the model distribution and realdistribution Pr; ρ(Pθ,Pr).
I Convergence: as t→∞, θt → θ, so Pθt → Pθ where ρ(Pr,Pθ)→ 0.
I Desirable conditions: the mapping θ 7→ ρ(Pr,Pθ) is continuous.
Distances between probability distributions I
Let (X ,Σ) be measurable space, where X is a compact metric set and Σis a Borel σ-algebra.
I The Total Variation (TV) distance
δ(Pr,Pθ) = supA∈Σ|Pr(A)− Pθ(A)|.
I The Kullback-Leibler (KL) divergence
KL(Pr‖Pθ) =
∫log
Pr(x)
Pθ(x)Pr(x)dµ(x),
where both Pr and Pθ are assumed to be absolutely continuous, andtherefore admit densities, w.r.t. a same measure µ on X .
Distances between probability distributions II
I The Jensen-Shannon (JS) divergence
JS(Pr,Pθ) =1
2KL(Pr‖Pm) +
1
2KL(Pθ‖Pm),
where Pm := (Pr + Pθ)/2.
I The Earth-Mover’s (EM) distance or Wasserstein-1 distance
W (Pr,Pθ) = infγ∈Π(Pr,Pθ)
E(x,y)∼γ [|x− y|],
where Π(Pr,Pθ) denotes the set of all joint distributions γ(x, y)whose marginals are respectively Pr and Pθ.
Distances between probability distributions III
(0, Z) (θ, Z)
Z ∼ Unif([0, 1])
I KL(Pθ‖P0) =
{∞ if θ 6= 00 if θ = 0
.
I JS(P0,Pθ) =
{log 2 if θ 6= 00 if θ = 0
.
I δ(P0,Pθ) =
{1 if θ 6= 00 if θ = 0
.
I W (P0,Pθ) = |θ|.
Instability of GAN I
Original objective function:
L(D, gθ) = Ex∼Pr [logD(x)] + Ex∼Pg [log(1−D(x))].
The optimal discriminator is
D∗(x) =Pr(x)
Pr(x) + Pg(x),
and
L(D∗, gθ) = 2JS(Pr,Pg)− 2 log 2.
Instability of GAN II
Theorem 1
Let Pr and PG be two distributions that have support contained in twoclosed manifolds M and P that don’t perfectly align and don’t have fulldimensions. We further assume that Pr and Pg are continuous in theirrespective manifolds, meaning that if there is a set A with measure 0 inM, then Pr(A) = 0 (and analogously for Pg). Then, there exists anoptimal discriminator D∗ : X → [0, 1] that has accuracy 1 and for almostany x in M∪P, D∗ is smooth in a neighbourhood of x and∇xD∗(x) = 0.
Instability of GAN III
Theorem 2
(Vanishing gradients on the generator) Let gθ : Z → X be adifferentiable function that induces a distribution Pg. If some conditionsare satisfied, and ‖D −D∗‖ < ε, and Ez∼p(z)[‖Jθgθ(z)‖22] ≤M2,
‖∇θEz∼p(z)[log(1−D(gθ(z)))]‖2 < Mε
1− ε .
Instability of GAN IV
Instability of GAN V
The − logD trick I
For generator, instead of minimizing Ez∼p(z)[log(1−D(gθ(z))], minimizeEz∼p(z)[log(D(gθ(z))]. This does not change the fixed points.
Theorem 3
Let D∗ = PrPr+Pg
be the optimal discriminator for a fixed θ = θ0.
Ez∼p(z)[−∇θ logD∗(gθ(z))|θ=θ0 ]
= ∇θ[KL(Pgθ0‖Pr)− 2JS(Pgθ0 ,Pr)]|θ=θ0 .
The − logD trick II
Theorem 4
(Under some conditions) Ez∼p(z)[−∇θ logD(gθ(z))] is a centeredCauchy distribution with infinite expectation and variance.
Why should we use Wasserstein distance I
Theorem 5
Let Pr be a fixed distribution over X . Let Z be a random variable overanother space Z. Let g : Z × Rd → X be a function, that will bedenoted gθ(z). Let Pθ denote the distribution of gθ(Z). Then,
1. If g is continuous in θ, so is W (Pr,Pθ).
2. If g is locally Lipschitz and satisfies regularity assumption 1, thenW (Pr,Pθ) is continuous everywhere, and differentiable almosteverywhere.
3. 1 and 2 are false for the Jensen-Shannon and KL divergences.
If we choose gθ to be any feedforward neural network parametrized by θ,and p(z) to be E[‖z‖] <∞, then the regularity assumption 1 is satisfied.
Why should we use Wasserstein distance II
Theorem 6
Let P be a distribution on a compact space X and (Pn)n∈N be asequence of distributions on X . Then, as n→∞,
1. δ(Pn,P)→ 0 ⇐⇒ JS(Pn,P)→ 0.
2. W (Pn,P)→ 0 ⇐⇒ PnD→ P.
3. KL(Pn‖P)→ 0 or KL(P‖Pn)→ 0 implies 1.
4. 1 implies 2.
Why should we use Wasserstein distance III
Approximating the Earth-Mover’s distance
By the Kantorovich-Rubinstein duelity [1]
W (Pr,Pθ) = sup‖f‖L≤1
Ex∼Pr [f(x)]− Ex∼Pθ [f(x)],
where the supremum is over all the 1-Lipschitz functions f : X → R.1-Lipschitz can be replaced by K-Lipschitz.
Theorem 7
Let Pr be any distribution, and let Pθ be the distribution of gθ(Z)satisfying assumption 1. Then, there exists a solution f : X → R to theproblem
max‖f‖L≤1
Ex∼Pr [f(x)]− Ex∼Pθ [f(x)]
and we have
∇θW (Pr,Pθ) = −Ez∼p(z)[∇θf(gθ(z))],
when both terms are well-defined.
WGAN algorithm
Experiments I
Experiments II
C. Villani.
Optimal Transport: Old and New.Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009.