cooperative training of descriptor and generator networkscooperative training of descriptor and...

17
Cooperative Training of Descriptor and Generator Networks Jianwen Xie JIANWEN@UCLA. EDU Yang Lu YANGLV@UCLA. EDU Song-Chun Zhu SCZHU@STAT. UCLA. EDU Ying Nian Wu YWU@STAT. UCLA. EDU Department of Statistics, University of California, Los Angeles, CA, USA Abstract This paper studies the cooperative training of two probabilistic models of signals such as images. Both models are parametrized by convolutional neural networks (ConvNets). The first network is a descriptor network, which is an exponential family model or an energy-based model, whose feature statistics or energy function are defined by a bottom-up ConvNet, which maps the ob- served signal to the feature statistics. The sec- ond network is a generator network, which is a non-linear version of factor analysis. It is de- fined by a top-down ConvNet, which maps the latent factors to the observed signal. The maxi- mum likelihood training algorithms of both the descriptor net and the generator net are in the form of alternating back-propagation, and both algorithms involve Langevin sampling. We ob- serve that the two training algorithms can cooper- ate with each other by jumpstarting each other’s Langevin sampling, and they can be naturally and seamlessly interwoven into a CoopNets al- gorithm that can train both nets simultaneously. 1. Introduction 1.1. Two ConvNets of opposite directions We begin with a tale that the reader of this paper can readily relate to. A student writes up an initial draft of a paper. His advisor then revises it. After that they submit the revised paper for review. The student then learns from his advisor’s revision, while the advisor learns from the outside review. In this tale, the advisor guides the student, but the student does most of the work. Such is a tale we are going to tell in this paper. It is a tale of two nets. It is about two probabilistic models of signals such as images. Both models are parametrized by convolu- tional neural networks (ConvNets or CNNs) (LeCun et al., 1998; Krizhevsky et al., 2012). As mighty as ConvNets have become, they are still one way streets. So the two nets in our tale take two opposite directions. One is bottom- up, and the other is top-down, as illustrate by the following diagram: Bottom-up ConvNet Top-down ConvNet features latent variables signal signal (a) Descriptor Net (b) Generator Net (1) The simultaneous training of such two nets was first stud- ied by the recent work of (Kim & Bengio, 2016). These two nets belong to two major classes of probabilistic mod- els. (a) The exponential family models or the energy-based models (LeCun et al., 2006; Hinton, 2002) or the Markov random field models (Zhu et al., 1997; Roth & Black, 2005), where the probability distribution is defined by the feature statistics or the energy function computed from the signal by a bottom-up process. (b) The latent variable mod- els or the directed graphical models, where the signal is as- sumed to be a transformation of the latent factors that fol- low a known prior distribution. The latent factors generate the signal by a top-down process. A classical example is factor analysis (Rubin & Thayer, 1982). The two classes of models have been contrasted by (Zhu, 2003; Guo et al., 2003; Teh et al., 2003; Wu et al., 2008; Ngiam et al., 2011). (Zhu, 2003; Guo et al., 2003) called the two classes of models the descriptive models and the generative models respectively. Both classes of models can benefit from the high capacity of the multi-layer ConvNets. (a) In the exponential family models or the energy-based models, the feature statistics or the energy function can be defined by a bottom-up ConvNet that maps the signal to the features (Ngiam et al., 2011; Dai et al., 2015; Lu et al., 2016; Xie et al., 2016), and the energy function is usually the sum or a linear combination of the features at the top layer. We call the resulting model a descriptive network or arXiv:1609.09408v1 [stat.ML] 29 Sep 2016

Upload: others

Post on 09-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

Jianwen Xie [email protected] Lu [email protected] Zhu [email protected] Nian Wu [email protected]

Department of Statistics, University of California, Los Angeles, CA, USA

AbstractThis paper studies the cooperative training of twoprobabilistic models of signals such as images.Both models are parametrized by convolutionalneural networks (ConvNets). The first networkis a descriptor network, which is an exponentialfamily model or an energy-based model, whosefeature statistics or energy function are definedby a bottom-up ConvNet, which maps the ob-served signal to the feature statistics. The sec-ond network is a generator network, which is anon-linear version of factor analysis. It is de-fined by a top-down ConvNet, which maps thelatent factors to the observed signal. The maxi-mum likelihood training algorithms of both thedescriptor net and the generator net are in theform of alternating back-propagation, and bothalgorithms involve Langevin sampling. We ob-serve that the two training algorithms can cooper-ate with each other by jumpstarting each other’sLangevin sampling, and they can be naturallyand seamlessly interwoven into a CoopNets al-gorithm that can train both nets simultaneously.

1. Introduction1.1. Two ConvNets of opposite directions

We begin with a tale that the reader of this paper can readilyrelate to. A student writes up an initial draft of a paper. Hisadvisor then revises it. After that they submit the revisedpaper for review. The student then learns from his advisor’srevision, while the advisor learns from the outside review.In this tale, the advisor guides the student, but the studentdoes most of the work.

Such is a tale we are going to tell in this paper. It is a taleof two nets. It is about two probabilistic models of signalssuch as images. Both models are parametrized by convolu-

tional neural networks (ConvNets or CNNs) (LeCun et al.,1998; Krizhevsky et al., 2012). As mighty as ConvNetshave become, they are still one way streets. So the two netsin our tale take two opposite directions. One is bottom-up, and the other is top-down, as illustrate by the followingdiagram:

Bottom-up ConvNet Top-down ConvNetfeatures latent variables⇑ ⇓

signal signal(a) Descriptor Net (b) Generator Net

(1)

The simultaneous training of such two nets was first stud-ied by the recent work of (Kim & Bengio, 2016). Thesetwo nets belong to two major classes of probabilistic mod-els. (a) The exponential family models or the energy-basedmodels (LeCun et al., 2006; Hinton, 2002) or the Markovrandom field models (Zhu et al., 1997; Roth & Black,2005), where the probability distribution is defined by thefeature statistics or the energy function computed from thesignal by a bottom-up process. (b) The latent variable mod-els or the directed graphical models, where the signal is as-sumed to be a transformation of the latent factors that fol-low a known prior distribution. The latent factors generatethe signal by a top-down process. A classical example isfactor analysis (Rubin & Thayer, 1982).

The two classes of models have been contrasted by (Zhu,2003; Guo et al., 2003; Teh et al., 2003; Wu et al., 2008;Ngiam et al., 2011). (Zhu, 2003; Guo et al., 2003) calledthe two classes of models the descriptive models and thegenerative models respectively. Both classes of models canbenefit from the high capacity of the multi-layer ConvNets.(a) In the exponential family models or the energy-basedmodels, the feature statistics or the energy function can bedefined by a bottom-up ConvNet that maps the signal tothe features (Ngiam et al., 2011; Dai et al., 2015; Lu et al.,2016; Xie et al., 2016), and the energy function is usuallythe sum or a linear combination of the features at the toplayer. We call the resulting model a descriptive network or

arX

iv:1

609.

0940

8v1

[st

at.M

L]

29

Sep

2016

Page 2: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

D2 updating

D1 Langevin

observed examples

Descriptor

synthesized examples

Figure 1. The flow chart of Algorithm D for training the descrip-tor net. The updating in Step D2 is based on the difference be-tween the observed examples and the synthesized examples. TheLangevin sampling of the synthesized examples from the currentmodel in Step D1 can be time consuming.

a descriptor net following (Zhu, 2003), because it is builton descriptive feature statistics. (b) In the latent variablemodels or the directed graphical models, the transformationfrom the latent factors to the signal can be defined by a top-down ConvNet (Zeiler & Fergus, 2014; Dosovitskiy et al.,2015), which maps the latent factors to the signal. We callthe resulting model a generative network or generator netfollowing (Goodfellow et al., 2014), who proposed such amodel in their work on the generative adversarial networks(GAN).

In this article, we assume that the two nets have separatestructures and parameters. What we shall do is to fuse themaximum likelihood training algorithms of these two nets.Although each training algorithm can stand on its own,their fusion is both beneficial and natural, in particular, thefusion fuels the Markov chain Monte Carlo (MCMC) forsampling the descriptor net, and turns the training of thegenerator net from unsupervised learning into supervisedlearning. This fusion leads to a cooperative training al-gorithm that interweaves the steps of the two algorithmsseamlessly. We call the resulting algorithm the CoopNetsalgorithm, and we show that it can train both nets simulta-neously.

1.2. Two algorithms of alternating back-propagation

Both the descriptor net and the generator net can be learnedfrom the training examples by maximum likelihood.

The training algorithm for the descriptor net alternates be-tween the following two steps (Xie et al., 2016). We call itAlgorithm D. See Figure 1 for an illustration.

Step D1 for Langevin revision: Sampling synthesizedexamples from the current model by Langevin dynamics(Neal, 2011; Girolami & Calderhead, 2011). We call thisstep Langevin revision because it keeps revising the currentsynthesized examples.

Step D2 for density shifting: Updating the parameters of

G1 Langevin

inferred latent factors

G2 updatingGenerator

observed examples

Figure 2. The flow chart of Algorithm G for training the generatornet. The updating in Step G2 is based on the observed examplesand their inferred latent factors. The Langevin sampling of thelatent factors from the current posterior distribution in Step G1can be time consuming.

the descriptor net based on the difference between the ob-served examples and the synthesized examples obtained inD1. This step is to shift the high probability regions of thedescriptor from the synthesized examples to the observedexamples.

Both steps can be powered by back-propagation. The algo-rithm is thus an alternating back-propagation algorithm.

In Step D1, the descriptor net is dreaming by synthesizingimages from the current model. In Step D2, the descriptorupdates its parameters to make the dream more realistic.

The training algorithm for the generator net alternates be-tween the following two steps (Han et al., 2016). We call itAlgorithm G. See Figure 2 for an illustration.

Step G1 for Langevin inference: For each observed ex-ample, sample the latent factors from the current poste-rior distribution by Langevin dynamics. We call this stepLangevin inference because it infers the latent factors foreach observed example, in order for the inferred latent vari-ables to explain or reconstruct the observed examples.

Step G2 for reconstruction: Updating the parameters ofthe generator net based on the observed examples and theirinferred latent factors obtained in G1, so that the inferredlatent variables can better reconstruct the observed exam-ples.

Both steps can be powered by back-propagation. The algo-rithm is thus an alternating back-propagation algorithm.

The training algorithm for generator net is similar to theEM algorithm (Dempster et al., 1977), where step G1 canbe mapped to the E-step, and step G2 can be mapped to theM-step. It is an unsupervised learning algorithm becausethe latent factors are unobserved.

In Step G1, the generator net is thinking about each ob-served example by inferring the latent factors that can re-construct it. The thinking involves explaining-away rea-

2

Page 3: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

soning: the latent factors compete with each other in theLangevin inference process to explain the example. In StepG2, the generator net updates its parameters to make thethinking more accurate.

Compared to Step D2, Step G2 is also a form of densityshifting from the current reconstructed examples towardsthe observed examples, but it requires inferring the latentfactors for each observed example.

1.3. CoopNets algorithm: reconstruct the revisions

The two algorithms can operate separately on their own(Xie et al., 2016; Han et al., 2016). But just like the casewith the student and the advisor, they benefit from coop-erating with each other, where the generator net plays therole of the student, and the descriptor net plays the role ofthe advisor.

The needs for cooperation stem from the fact that theLangevin sampling in Step D1 and Step G1 can be time-consuming. If the two algorithms cooperate, they canjumpstart each other’s Langevin sampling in D1 and G1.The resulting algorithm, which we call the CoopNets algo-rithm, naturally and seamlessly interweaves the steps in thetwo algorithms with minimal modifications.

In more plain language, while the descriptor needs to dreamhard in Step D1 for synthesis, the generator needs to thinkhard in Step G1 for explaining-away reasoning. On theother hand, the generator is actually a much better dreamerbecause it can generate images by direct ancestral sam-pling, while the descriptor does not need to think in orderto learn.

Specifically, we have the following steps in the CoopNetsalgorithm:

Step G0 for generation: Generate the initial synthesizedexamples using the generator net. These initial examplescan be obtained by direct ancestral sampling.

Step D1 for Langevin revision: Starting from the initialsynthesized examples produced in Step G0, run Langevinrevision dynamics for a number of steps to obtain the re-vised synthesized examples.

Step D2 for density shifting: The same as before, exceptthat we use the revised synthesized examples produced byStep D1 to shift the density of the descriptor towards theobserved examples.

Step G1 for Langevin inference: The generator net canlearn from the revised synthesized examples produced byStep D1. For each revised synthesized example, we knowthe values of the latent factors that generate the correspond-ing initial synthesized example in Step G0, therefore wemay simply infer the latent factors to be their known val-

G1 Langevin

generated latent factors

D2 updating

G2 updating

D1 Langevin

Generator

observed examples

Descriptor

initial synthesized examples

revised synthesized examples

Figure 3. The flow chart of the CoopNets algorithm. The part ofthe flow chart for training the descriptor is similar to AlgorithmD in Figure 1, except that the D1 Langevin sampling is initializedfrom the initial synthesized examples supplied by the generator.The part of the flow chart for training the generator can also bemapped to Algorithm G in Figure 2, except that the revised syn-thesized examples play the role of the observed examples, andthe known generated latent factors can be used as inferred latentfactors (or be used to initialize the G1 Langevin sampling of thelatent factors).

ues given in Step G0, or initialize the Langevin inferencedynamics in Step G1 from the known values.

Step G2 for reconstruction: The same as before, exceptthat we use the revised synthesized examples and the in-ferred latent factors obtained in Step G1 to update the gen-erator. The generator in Step G0 generates and thus recon-structs the initial synthesized examples. Step G2 updatesthe generator to reconstruct the revised synthesized exam-ples. The revision of the generator accounts for the revi-sions made by the Langevin revision dynamics in Step D1.

Figure 3 shows the flow chart of the CoopNets algorithm.The generator is like the student. It generates the initialdraft of the synthesized examples. The descriptor is like theadvisor. It revises the initial draft by running a number ofLangevin revisions. The descriptor learns from the outsidereview, which is in the form of the difference between theobserved examples and the revised synthesized examples.The generator learns from how the descriptor revises theinitial draft by reconstructing the revised draft. Such is thetale of two nets.

The reason we let the generator net learn from the revisedsynthesized examples instead of the observed examples isthat the generator does not know the latent factors that gen-erate the observed examples, and it has to think hard toinfer them by explaining-away reasoning. However, the

3

Page 4: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

generator knows the latent factors that generate each ini-tial synthesized example, and thus it essentially knows thelatent factors when learning from the revised synthesizedexamples. By reconstructing the revised synthesized ex-amples, the generator traces and accumulates the Langevinrevisions made by the descriptor. This cooperation is thusbeneficial to the generator by relieving it the burden of in-ferring the latent variables.

While the generator may find it hard to learn from the ob-served examples directly, the descriptor has no problemlearning from the observed examples because it only needsto compute the bottom-up features deterministically. How-ever, it needs synthesized examples to find its way to shiftits density, and they do not come by easily. The genera-tor can provide unlimited number of examples, and in eachlearning iteration, the generator supplies a completely newbatch of independent examples on demand. The descrip-tor only needs to revise the new batch of examples insteadof generating a new batch by itself from scratch. The gen-erator has memorized the cumulative effect of all the pastLangevin revisions, so that it can produce new samples inone shot. This cooperation is thus beneficial to the descrip-tor by relieving it the burden of synthesizing examples fromscratch.

In terms of density shifting, the descriptor shifts its den-sity from the revised synthesized examples towards the ob-served examples in Step D2, and the generator shifts itsdensity from the initial synthesized examples towards therevised synthesized examples in Step G2 by reconstructingthe latter.

In terms of energy function, the descriptor shifts its low en-ergy regions towards the observed examples, and it inducesthe generator to map the latent factors to its low energyregions. It achieves that by stochastically relaxing the syn-thesized examples towards low energy regions, and let thegenerator track the synthesized examples.

True to the essence of learning, the two nets still learn fromthe data. The descriptor learns from the observed data andteaches the generator through the synthesized data. Specif-ically, they collaborate and communicate with each otherthrough synthesized data in the following three versions (Sstands for synthesis). (S1) The generator generates the ini-tial draft. (S2) The descriptor revises the draft. (S3) Thegenerator reconstructs the revised draft.

2. Related workPrevious authors have already told fascinating tales of twonets, although the plot of our tale is among the more intri-cate.

The most fabled tale of two nets is about the generative ad-

versarial networks (GAN) (Goodfellow et al., 2014; Den-ton et al., 2015; Radford et al., 2015). In GAN, the genera-tor net is paired with a discriminator net. The two nets playadversarial roles. In our work, the generator net and thedescriptor net play cooperative roles, and they feed eachother the synthesized data. The learning of both nets isbased on maximum likelihood, and the learning process isquite stable because of the cooperative nature and the con-sistent direction of the two maximum likelihood trainingalgorithms.

The connection between the descriptor and the discrimi-nator has been explored by (Xie et al., 2016), where thedescriptor can be derived from the discriminator. See alsoearlier pioneering work by (Tu, 2007; Welling et al., 2002).

Our tale of two nets is most similar to the one in the recentpioneering work of (Kim & Bengio, 2016). In fact, the set-tings of the two nets are exactly the same. In their work,the generator is learned from the current descriptor by min-imizing the Kullback-Leibler divergence from the genera-tor to the current descriptor, which can be decomposed intoan energy term and an entropy term, with the latter approx-imated based on quantities produced by batch normaliza-tion (Ioffe & Szegedy, 2015). Their work avoids MCMCsampling, in particular, their work does not involve revisingand reconstructing the synthesized examples, which playsa pivotal role in our method. In their work, the descriptoracts on the generator directly, whereas in our work, the de-scriptor acts on the generator through the synthesized data.

Despite the difference, there is an interesting connectionbetween our work and (Kim & Bengio, 2016): while (Kim& Bengio, 2016) seeks to minimize the Kullback-Leiblerdivergence from the generator to the current descriptor, theLangevin revision dynamics in our work runs a Markovchain from the current generator to the current descrip-tor, and it automatically and monotonically decrease theKullback-Leibler divergence from the distribution of therevised synthesized data to the distribution of the currentdescriptor (to zero in the time limit), which is an impor-tant property of the Markov chain that underlies the secondlaw of thermodynamics and the so-called “arrow of time.”(Cover & Thomas, 2012) The revised synthesized data arethen fed to the generator for it to reconstruct the effect ofthe Langevin revisions. Therefore our learning method isconsistent with that of (Kim & Bengio, 2016). In fact, theenergy and entropy terms in the learning objective functionin (Kim & Bengio, 2016) correspond to the gradient andnoise terms of the Langevin dynamics in our work, but theLangevin dynamics relieves us the need to approximate theintractable entropy term.

Our work is related to the contrastive divergence algorithm(Hinton, 2002) for training the descriptor net. The con-trastive divergence initializes the MCMC sampling from

4

Page 5: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

the observed examples. The CoopNets algorithm initial-izes the Langevin sampling from the examples supplied bythe generator.

Our work is somewhat related to the generative stochasticnetwork of (Thibodeau-Laufer et al., 2014), which learnsthe Markov transition probability directly. In our work, thegenerator serves to absorb the cumulative effect of all thepast Markov transitions (i.e., the Langevin revisions) of thedescriptor and reproduce it in one shot.

Our work appears to be related to knowledge distilling(Hinton et al., 2015). In our work, the descriptor distillsits knowledge to the generator.

Our work bears similarity to the co-training method of(Blum & Mitchell, 1998). The two learners in our work areof different directions, and they feed each other the synthe-sized data, instead of class labels.

3. Two nets and two algorithmsThis section reviews the descriptor net and the generatornet. The review of the descriptor net mainly follows (Xieet al., 2016). The review of the generator net mainly fol-lows (Han et al., 2016).

3.1. Descriptor net and training algorithm

The descriptor net has its root in exponential family modelin statistics and in energy-based model in machine learning.Let Y be the signal, such as an image, a video sequence, ora sound sequence. The descriptor model is in the form ofexponential tilting of a reference distribution (Xie et al.,2016):

PD(Y ;WD) =1

Z(WD)exp [f(Y ;WD)] q(Y ). (2)

q(Y ) is the reference distribution such as Gaussian whitenoise

q(Y ) =1

(2πs2)D/2exp

[−‖Y ‖

2

2s2

], (3)

where D is the dimensionality of the signal Y , and Y ∼N(0, s2ID) under q(Y ) (ID denotes the D-dimensionalidentity matrix). f(Y ;WD) (f stands for features) is thefeature statistics or energy function, defined by a Con-vNet whose parameters are denoted by WD. This Con-vNet is bottom-up because it maps the signal Y to thefeature statistics. See the diagram in (1). Z(WD) =∫exp [f(Y ;WD)] q(Y )dY = Eq{exp[f(Y ;WD)]} is the

normalizing constant, where Eq is the expectation with re-spect to q.

Suppose we observe training examples {Yi, i = 1, ..., n}from an unknown data distribution Pdata(Y ). The max-

imum likelihood training seeks to maximize the log-likelihood function

LD(WD) =1

n

n∑i=1

logPD(Yi;WD). (4)

If the sample size n is large, the maximum likelihood esti-mator minimizes KL(Pdata|PD), the Kullback-Leibler di-vergence from the data distribution Pdata to the model dis-tribution PD.

The gradient of the LD(WD) is

L′D(WD) =1

n

n∑i=1

∂WDf(Yi;WD)

− EWD

[∂

∂WDf(Y ;WD)

], (5)

where EWD denotes the expectation with respect toPD(Y ;WD). The key to the above identity is that∂

∂WDlogZ(WD) = EWD [

∂∂WD

f(Y ;WD)].

The expectation in equation (5) is analytically intractableand has to be approximated by MCMC, such as Langevinrevision dynamics, which iterates the following steps:

Yτ+1 = Yτ −δ2

2

[Yτs2− ∂

∂Yf(Yτ ;WD)

]+ δUτ , (6)

where τ indexes the time steps of the Langevin dynamics,δ is the step size, and Uτ ∼ N(0, ID) is the Gaussian whitenoise term.

We can run n parallel chains of Langevin dynamics ac-cording to (6) to obtain the synthesized examples {Yi, i =1, ..., n}. The Monte Carlo approximation to L′D(WD) is

L′D(WD) ≈ 1

n

n∑i=1

∂WDf(Yi;WD)

− 1

n

n∑i=1

∂WDf(Yi;WD), (7)

which is the difference between the observed examples andthe synthesized examples. See (Lu et al., 2016) for details.

Algorithm 1 (Xie et al., 2016) describes the training al-gorithm. See Figure 1 for an illustration. Step D1 needsto compute ∂

∂Y f(Y ;WD). Step D2 needs to compute∂

∂WDf(Y ;WD). The computations of both derivatives can

be powered by back-propagation. Because of the ConvNetstructure of f(Y ;WD), the computations of the two deriva-tives share most of their steps in the chain rule computa-tions.

Because the parameter WD keeps changing in the learningprocess, the energy landscape and the local energy minima

5

Page 6: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

Algorithm 1 Algorithm DInput:

(1) training examples {Yi, i = 1, ..., n}(2) number of Langevin steps lD(3) number of learning iterations T

Output:(1) estimated parameters WD(2) synthesized examples {Yi, i = 1, ..., n}

1: Let t← 0, initialize WD.2: Initialize Yi, i = 1, ..., n.3: repeat4: Step D1 Langevin revision: For each i, run lD

steps of Langevin dynamics to update Yi, i.e., start-ing from the current Yi, each step follows equation(6).

5: Step D2 density shifting: Update W(t+1)D =

W(t)D + γtL

′D(W

(t)D ), with learning rate γt, where

L′D(W(t)D ) is computed according to (7).

6: Let t← t+ 17: until t = T

also keep changing. This may help the Langevin revisiondynamics avoid being trapped by the local energy minima.

Algorithm D is a stochastic approximation algorithm (Rob-bins & Monro, 1951), except that the synthesized examplesare obtained by a finite number of Langevin steps in eachlearning iteration. The convergence of an algorithm of thistype to the maximum likelihood estimate has been estab-lished by (Younes, 1999).

3.2. Generator net and training algorithm

The generator net (Goodfellow et al., 2014) has its root infactor analysis in statistics and in latent variable model ordirected graphical model in machine learning. The gener-ator net seeks to explain the signal Y of dimension D bya vector of latent factors X of dimension d, and usuallyd� D. The model is of the following form:

X ∼ N(0, Id),

Y = g(X;WG) + ε, ε ∼ N(0, σ2ID). (8)

g(X;WG) (g stands for generator) is a top-down ConvNetdefined by the parameters WG . The ConvNet g maps thelatent factors X to the signal Y . See the diagram in (1).

The joint density of model (8) is PG(X,Y ;WG) =PG(X)PG(Y |X;WG), and

logPG(X,Y ;WG) = − 1

2σ2‖Y − g(X;WG)‖2

− 1

2‖X‖2 + constant, (9)

where the constant term is independent of X , Y and WG .The marginal density is obtained by integrating out the la-tent factors X , i.e., PG(Y ;WG) =

∫PG(X,Y ;WG)dX .

The inference of X given Y is based on the posteriordensity PG(X|Y ;WG) = PG(X,Y ;WG)/PG(Y ;WG) ∝PG(X,Y ;WG) as a function of X .

For the training data {Yi, i = 1, ..., n}, the generator netcan be trained by maximizing the log-likelihood

LG(WG) =1

n

n∑i=1

logPG(Yi;WG). (10)

For large sample, WG is obtained by minimizing theKullback-Leibler divergence KL(Pdata|PG) from the datadistribution Pdata to the model distribution PG .

The gradient of LG(WG) is obtained according to the fol-lowing identity

∂WGlogPG(Y ;WG)

= EPG(X|Y ;WG)

[∂

∂WGlogPG(X,Y ;WG)

]. (11)

The above identity underlies the EM algorithm. The use-fulness of identify (11) lies in the fact that the derivative ofthe complete-data log-likelihood logPG(X,Y ;WG) on theright hand side can be obtained in closed form.

In general, the expectation in (11) is analytically in-tractable, and has to be approximated by MCMC thatsamples from the posterior PG(X|Y ;WG), such as theLangevin inference dynamics, which iterates

Xτ+1 = Xτ +δ2

2

∂XlogPG(Xτ , Y ;WG) + δUτ , (12)

where τ indexes the time step, δ is the step size, and fornotational simplicity, we continue to use Uτ to denote thenoise term, but here Uτ ∼ N(0, Id). The Langevin infer-ence solves a `2 penalized non-linear least squares problemso that Xi can reconstruct Yi given the current WG . TheLangevin inference process performs explaining-away rea-soning, where the latent factors in X compete with eachother to explain the current residual Y − g(X;WG).

With Xi sampled from PG(Xi | Yi,WG) for each observa-tion Yi by the Langevin inference process, the Monte Carloapproximation to L′G(WG) is

L′G(WG) ≈1

n

n∑i=1

∂WGlogPG(Xi, Yi;WG)

=1

n

n∑i=1

1

σ2(Yi − g(Xi;WG))

∂WGg(Xi;WG). (13)

The updating of WG solves a non-linear regression prob-lem, so that the learned WG enables better reconstruction

6

Page 7: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

of Yi by the inferred Xi. Given the inferred Xi, the learn-ing of WG is a supervised learning problem (Dosovitskiyet al., 2015).

Algorithm 2 Algorithm GInput:

(1) training examples {Yi, i = 1, ..., n}(2) number of Langevin steps lG(3) number of learning iterations T

Output:(1) estimated parameters WG(2) inferred latent factors {Xi, i = 1, ..., n}

1: Let t← 0, initialize WG .2: Initialize Xi, i = 1, ..., n.3: repeat4: Step G1 Langevin inference: For each i, run lG

steps of Langevin dynamics to update Xi, i.e., start-ing from the current Xi, each step follows equation(12).

5: Step G2 reconstruction: Update W(t+1)G =

W(t)G + γtLG

′(W(t)G ), with learning rate γt, where

LG′(W

(t)G ) is computed according to equation (13).

6: Let t← t+ 17: until t = T

Algorithm 2 (Han et al., 2016) describes the training al-gorithm. See Figure 2 for an illustration. Step G1 needsto compute ∂

∂X g(X;WG). Step G2 needs to compute∂

∂WGg(X;WG). The computations of both derivatives can

be powered by back-propagation, and the computations ofthe two derivatives share most of their steps in the chainrule computations.

Algorithm G is a stochastic approximation or stochasticgradient algorithm that converges to the maximum likeli-hood estimate (Younes, 1999) .

4. CoopNets algorithmIn Algorithm D and Algorithm G, both steps D1 and G1are Langevin dynamics. They may be slow to convergeand may become bottlenecks in their respective algorithms.An interesting observation is that the two algorithms cancooperate with each other by jumpstarting each other’sLangevin sampling.

Specifically, in Step D1, we can initialize the synthe-sized examples by generating examples from the genera-tor net, which does not require MCMC, because the gen-erator net is a directed graphical model. More specifi-cally, we first generate Xi ∼ N(0, Id), and then generateYi = g(Xi;WG)+εi, for i = 1, ..., n. If the current genera-tor PG is close to the current descriptor PD, then the gener-

ated {Yi} should be a good initialization for sampling fromthe descriptor net, i.e., starting from the {Yi, i = 1, ..., n}supplied by the generator net, we run Langevin dynamicsin Step D1 for lD steps to get {Yi, i = 1, ..., n}, which arerevised versions of {Yi}. These {Yi} can be used as thesynthesized examples from the descriptor net. We can thenupdate WD according to Step D2 of Algorithm D.

In order to update WG of the generator net, we treat the{Yi, i = 1, ..., n} produced by the above D1 step as thetraining data for the generator. Since these {Yi} are ob-tained by the Langevin revision dynamics initialized fromthe {Yi, i = 1, ..., n} produced by the generator net withknown latent factors {Xi, i = 1, ..., n}, we can update WGby learning from {(Yi, Xi), i = 1, ..., n}, which is a super-vised learning problem, or more specifically, a non-linearregression of Yi on Xi. At W (t)

G , the latent factors Xi gen-erates and thus reconstructs the initial example Yi. Afterupdating WG , we want Xi to reconstruct the revised exam-ple Yi. That is, we revise WG to absorb the revision fromYi to Yi, so that the generator shifts its density from {Yi} to{Yi}. The reconstruction error can tell us whether the gen-erator has caught up with the descriptor by fully absorbingthe revision.

The left diagram in (14) illustrates the basic idea.

Xi

Yi Yi

W(t)G W

(t+1)G

W(t)D

Xi Xi

Yi Yi

W(t)G

W(t)G W

(t+1)G

W(t)D (14)

In the two diagrams in (14), the double line arrows indicategeneration and reconstruction in the generator net, whilethe dashed line arrows indicate Langevin dynamics for re-vision and inference in the two nets. The diagram on theright in (14) illustrates a more rigorous method, where weinitialize the Langevin inference of {Xi, i = 1, ..., n} inStep G1 from {Xi}, and then update WG in Step G2 basedon {(Yi, Xi), i = 1, ..., n}.

Algorithm 3 describes the cooperative training that inter-weaves Algorithm D and Algorithm G. See Figure 3 for theflow chart of the CoopNets algorithm. In our experiments,we set lG = 0 and infer Xi = Xi.

The learning of both the descriptor and the generator fol-lows the “analysis by synthesis” principle (Grenander &Miller, 2007). There are three sets of synthesized exam-ples (S stands for synthesis). (S1) Initial synthesized exam-ples {Yi} generated by Step G0. (S2) Revised synthesizedexamples {Yi} produced by Step D1. (S3) Reconstructedsynthesized examples {g(Xi;W

(t+1)G )} produced by Step

7

Page 8: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

Algorithm 3 CoopNets AlgorithmInput:

(1) training examples {Yi, i = 1, ..., n}(2) numbers of Langevin steps lD ad lG(3) number of learning iterations T

Output:(1) estimated parameters WD and WG(2) synthesized examples {Yi, Yi, i = 1, ..., n}

1: Let t← 0, initialize WD and WG .2: repeat3: Step G0: For i = 1, ..., n, generate Xi ∼ N(0, Id),

and generate Yi = g(Xi;W(t)G ) + εi.

4: Step D1: For i = 1, ..., n, starting from Yi, Run lDsteps of Langevin revision dynamics to obtain Yi,each step following equation (6).

5: Step G1: Treat the current {Yi, i = 1, ..., n} as thetraining data, for each i, infer Xi = Xi. Or morerigorously, starting from Xi = Xi, run lG stepsof Langevin inference dynamics to update Xi, eachstep following equation (12).

6: Step D2: Update W (t+1)D = W

(t)D + γtL

′D(W

(t)D ),

where L′D(W(t)D ) is computed according to (7).

7: Step G2: Update W (t+1)G = W

(t)G + γtLG

′(W(t)G ),

where LG′(WG) is computed according to equation(13), except that Yi is replaced by Yi, and n by n. Wecan run multiple iterations of Step G2 to learn fromand reconstruct {Yi}, and to allow the generator tocatch up with the descriptor.

8: Let t← t+ 19: until t = T

G2. The descriptor shifts its density from (S2) towards theobserved data, while the generator shifts its density from(S1) towards (S2) (by arriving at (S3)).

The evolution from (S1) to (S2) is the work of the descrip-tor. It is a process of stochastic relaxation that settles thesynthesized examples in the low energy regions of the de-scriptor. The descriptor works as an associative memory,with (S1) being the cue, and (S2) being the recalled mem-ory. It serves as a feedback to the generator. The recon-struction of (S2) by (S3) is the work of the generator thatseeks to absorb the feedback conveyed by the evolutionfrom (S1) to (S2). The descriptor can test whether the gen-erator learns well by checking whether (S3) is close to (S2).The two nets collaborate and communicate with each othervia synthesized data.

5. ExperimentsWe use the MatConvNet of (Vedaldi & Lenc, 2015) for cod-ing. For the descriptor net, we adopt the structure of (Xie

et al., 2016), where the bottom-up network consists of mul-tiple layers of convolution by linear filtering, ReLU non-linearity, and down-sampling. We adopt the structure ofthe generator network of (Radford et al., 2015; Dosovitskiyet al., 2015), where the top-down network consists of multi-ple layers of deconvolution by linear superposition, ReLUnon-linearity, and up-sampling, with tanh non-linearity atthe bottom-layer (Radford et al., 2015) to make the signalsfall within [−1, 1]. The synthesized images we show are allgenerated by the descriptor net.

5.1. Learning texture patterns

We learn a separate model from each texture image. Thetraining images are collected from the Internet, and thenresized to 224 × 224. The synthesized images are of thesame size as the training images.

We use a 3-layer descriptor net, where the first layer has100 15 × 15 filters with sub-sampling rate of 3 pixels, thesecond layer has 70 9 × 9 filters with sub-sampling of 1,and the third layer has 30 7× 7 filters with sub-sampling of1. We fix the standard deviation of the reference distribu-tion of the descriptor net to be s = 0.012. We use L=20 or30 steps of Langevin revision dynamics within each learn-ing iteration, and the Langevin step size is set at 0.003. Thelearning rate is 0.01.

Starting from 7 × 7 latent factors, the generator net has5 layers of deconvolution with 5 × 5 kernels (basis func-tions), with an up-sampling factor of 2 at each layer. Thestandard deviation of the noise vector is σ = 0.3. Thelearning rate is 10−6. The number of generator learningsteps is 1 at each cooperative learning iteration.

We run 104 cooperative learning iterations to train theCoopNets. Figure 4 displays the results of generating tex-ture patterns. For each category, the first image is the train-ing image, and the rest are 3 of the images generated bythe learning algorithm. We run n = 6 parallel chains forthe first example, where images from 3 of them are pre-sented. We run a single chain for the rest examples, wherethe synthesized images are generated at different iterations.Even though we run a single chain, it is as if we run an infi-nite number of chains, because in each iteration, we runLangevin revision dynamics from a new image sampledfrom the generator.

5.2. Learning object patterns

We learn a separate model for each object pattern. Thetraining images are collected from the Internet, and thenresized to 64 × 64. The number of training images foreach category is about 5. The structure of the generatornetwork is the same as (Radford et al., 2015; Dosovitskiyet al., 2015). We adopt a 3-layer descriptor net. The first

8

Page 9: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

Figure 4. Generating texture patterns. Each row displays one tex-ture experiment, where the first image is the training image, andthe rest are 3 of the images generated by the CoopNets algorithm.

Figure 5. Generating object patterns. Each row displays one ob-ject experiment, where the first 4 images are 4 of the training im-ages, and the rest are 4 of the images generated by the CoopNetsalgorithm.

9

Page 10: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

Figure 6. Generating human face pattern. The synthesized imagesare generated by the coopNets algorithm that learns from 10, 000images.

layer has 100 4× 4 filters with sub-sampling of 2, the sec-ond layers has 64 2× 2 filters with sub-sampling of 1, andthe third layer is a fully connected layer with single channeloutput. We fix the standard deviation of the reference distri-bution of the descriptor net to be s = 0.016. We use L=10steps of Langevin revision dynamics within each learningiteration, and the Langevin step size is set at 0.005. Thelearning rate is 0.07. We run 3, 000 cooperative learningiterations to train the model. Figure 5 displays the results.Each row displays an experiment, where the first 4 imagesare 4 of the training images and the rest are generated im-ages.

We also conduct an experiment on human faces, where weadopt a 4-layer descriptor net. The first layer has 96 5 × 5filters with sub-sampling of 2, the second layers has 1285× 5 filters with sub-sampling of 2, the third layer has 2565 × 5 filters with sub-sampling of 2, and the final layer isa fully connected layer with 50 channels as output. Wereduce the Langevin step size to 0.002. The training dataare 10, 000 human faces randomly selected from CelebAdataset (Liu et al., 2015). We run 600 cooperative learningiterations. Figure 6 displays 144 synthesized human facesby the descriptor net.

6. ConclusionThe most unique feature of our work is that the two net-works feed each other the synthesized data in the learningprocess. The generator feeds the descriptor the initial ver-sion of the synthesized data. The descriptor feedbacks the

generator the revised version of the synthesized data. Thegenerator then produces the reconstructed version of thesynthesized data. While the descriptor learns from finiteamount of observed data, the generator learns from virtu-ally infinite amount of synthesized data.

Another unique feature of our work is that the learning pro-cess interweaves the existing maximum likelihood learningalgorithms for the two networks, so that the two algorithmsjumpstart each other’s MCMC sampling.

A third unique feature of our work is that the MCMC forthe descriptor keeps rejuvenating the chains by refreshingthe samples by independent replacements supplied by thegenerator, so that a single chain effectively amounts to aninfinite number of chains or the evolution of the wholemarginal distribution. Powering the MCMC sampling ofthe descriptive models in (Lu et al., 2016; Xie et al., 2016)is the main motivation of this paper, with the bonus of turn-ing the unsupervised learning of the generator (Han et al.,2016) into supervised learning.

Code and datahttp://www.stat.ucla.edu/˜ywu/CoopNets/main.html

AcknowledgementWe thank Hansheng Jiang for her work on this project asa summer visiting student. We thank Tian Han for sharingthe code on learning the generator network, and for helpfuldiscussions.

The work is supported by NSF DMS 1310391, DARPASIMPLEX N66001-15-C-4035, ONR MURI N00014-16-1-2007, and DARPA ARO W911NF-16-1-0579.

7. Appendix 1: Convergence7.1. Generator of infinite capacity

In the CoopNets algorithm, the descriptor learns from theobserved examples, while the generator learns from the de-scriptor through the synthesized examples. Therefore, thedescriptor is the driving force in terms of learning, althoughthe generator is the driving force in terms of synthesis. Inorder to understand the convergence of learning, we canstart from Algorithm D for learning the descriptor.

Algorithm D is a stochastic approximation algorithm (Rob-bins & Monro, 1951), except that the samples are generatedby finite step MCMC transitions. According to (Younes,1999), Algorithm D converges to the maximum likelihoodestimate under suitable regularity conditions on the mixingof the transition kernel of the MCMC and the schedule of

10

Page 11: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

the learning rate γt, even if the number of Langevin stepslD is finite or small (e.g., lD = 1), and even if the numberof parallel chains n is finite or small (e.g., n = 1). Thereason is that the random fluctuations caused by the finitenumber of chains, n, and the limited mixing caused by thefinite steps of MCMC, lD, are mitigated if the learning rateγt is sufficiently small. At learning iteration t, let W (t)

D bethe estimated parameter of the descriptor. Let P (t+1)

D bethe marginal distribution of {Yi}, even though P (t+1)

D 6=PD(Y ;W

(t)D ) because lD is finite (P (t+1)

D = PD(Y ;W(t)D )

if lD → ∞), we still have W (t)D → WD in probability

according to (Younes, 1999), where WD is the maximumlikelihood estimate of WD.

The efficiency of Algorithm D increases if the number ofparallel chains n is large because it leads to more accurateestimation of the expectation in the gradient L′D(WD) ofequation (5), so that we can afford to use larger learningrate γt for faster convergence.

Now let us come back to the CoopNets algorithm. In or-der to understand how the descriptor net helps the train-ing of the generator net, let us consider the idealized sce-nario where the number of parallel chains n→∞, and thegenerator has infinite capacity, and in each iteration it es-timates WG by maximum likelihood using the synthesizeddata from P

(t+1)D . In this idealized scenario, the learned

generator PG(Y ;W(t+1)G ) will reproduce P (t+1)

D by mini-

mizing KL(P(t+1)D (Y )|PG(Y ;WG)), with P (t+1)

D servingas its data distribution. Then eventually the learned gener-ator PG(Y, WG) will reproduce PD(Y ; WD). Thus the co-operative training helps the learning of the generator. Notethat the learned generator PG(Y, WG) will not reproducethe distribution of the observed data Pdata, unless the de-scriptor is of infinite capacity too.

Conversely, the generator net also helps the learning of thedescriptor net in the CoopNets algorithm. In Algorithm D,it is impractical to make the number of parallel chains n toolarge. On the other hand, it would be difficult for a smallnumber of chains {Yi, i = 1, ..., n} to explore the statespace. In the CoopNets algorithm, because PG(Y ;W

(t)G )

reproduces P (t)D , we can generate a completely new batch

of independent samples {Yi} from PG(Y ;W(t)G ), and re-

vise {Yi} to {Yi} by Langevin dynamics, instead of run-ning Langevin dynamics from the same old batch of {Yi}as in the original Algorithm D. This is like implementingan infinite number of parallel chains, because each itera-tion evolves a fresh batch of examples, as if each itera-tion evolves a new set of chains. By updating the gener-ator WG , it is like we are updating the infinite number ofparallel chains, becauseWG memorizes the whole distribu-tion. Even if n in the CoopNets algorithm is small, e.g.,

n = 1, viewed from the perspective of Algorithm D, it is asif n→∞. Thus the above idealization n→∞ is sound.

7.2. Generator of finite capacity

From an information geometry (Amari & Nagaoka, 2007)point of view, let D = {PD(Y ;WD),∀WD} be the man-ifold of the descriptor models, where each distributionPD(Y ;WD) is a point on this manifold. Then the max-imum likelihood estimate of WD is a projection of thedata distribution Pdata onto the manifold D. Let G ={PG(Y ;WG),∀WG} be the manifold of the generator mod-els, where each distribution PG(Y ;WG) is a point on thismanifold. Then the maximum likelihood estimate of WG isa projection of the data distribution Pdata onto the manifoldG.

From now on, for notational simplicity and with a slightabuse of notation, we use WD to denote the descriptor dis-tribution PD(Y ;WD), and use WG to denote the generatordistribution PG(Y ;WG).

We assume both the observed data size n and the synthe-sized data size n are large enough so that we shall work ondistributions or populations instead of finite samples. Asexplained above, assuming n → ∞ is sound because thegenerator net can supply unlimited number of examples.

The Langevin revision dynamics runs a Markov chain fromW

(t)G towards W (t)

D . Let LWD be the Markov transitionkernel of lD steps of Langevin revisions towards WD. Thedistribution of the revised synthesized data is

P(t+1)D = L

W(t)D·W (t)G , (15)

where the notation L · P denotes the marginal distributionobtained by running the Markov transition L from P . Thedistribution P (t+1)

D is in the middle between the two netsW

(t)G andW (t)

D , and it serves as the data distribution to trainthe generator, i.e., we project this distribution to the mani-fold G = {PG(Y ;WG),∀WG} = {WG} (recall we use WGto denote the distribution PG(Y ;WG)) in the informationgeometry picture, so that

W(t+1)G = argmin

GKL(P

(t+1)D |WG). (16)

The learning process alternates between Markov transitionin (15) and projection in (16), as illustrated by Figure 7.Compared to the three sets of synthesized data mentionedat the end of Section 4, W (t)

G corresponds to (S1), P (t+1)D

corresponds to (S2), and W (t+1)G corresponds to (S3).

In the case of lD →∞,

W(t)D → WD = argmin

DKL(Pdata|WD), (17)

W(t)G → WG = argmin

GKL(WD|WG). (18)

11

Page 12: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

Markov transition

projection

Figure 7. The learning of the generator alternates betweenMarkov transition and projection. The family of the generatormodels G is illustrated by the black curve. Each distribution isillustrated by a point.

That is, we first project Pdata onto D, and from there con-tinue to project onto G. Therefore, WD converges to themaximum likelihood estimate with Pdata being the datadistribution, and WG converges to the maximum likelihoodestimate with WD serving as the data distribution.

For finite lD, the algorithm may converge to the followingfixed points. The fixed point for the generator satisfies

WG = argminG

KL(LWD· WG |WG). (19)

The fixed point for the descriptor satisfies

WD = argminD

[KL(Pdata|WD)

−KL(LWD· WG |WD)

], (20)

which is similar to contrastive divergence (Hinton, 2002),except that WG takes the place of Pdata in the secondKullback-Leibler divergence. Because WG is supposed tobe close to WD, the second Kullback-Leibler divergenceis supposed to be small, hence our algorithm is closer tomaximum likelihood learning than contrastive divergence.

(Kim & Bengio, 2016) learned the generator by gradi-ent descent on KL(WG |W (t)

D ) over G. The objectivefunction is KL(WG |W (t)

D ) = EWG [logPG(Y ;WG)] −EWG [logPD(Y ;W

(t)D )], where the first term is the nega-

tive entropy that is intractable, and the second term is theexpected energy that is tractable. Our learning methodfor the generator is consistent with the learning objectiveKL(WG |W (t)

D ), because

KL(P(t+1)D |W (t)

D ) ≤ KL(W(t)G |W

(t)D ). (21)

In fact, KL(P(t+1)D |W (t)

D ) → 0 monotonically as lD →∞ due to the second law of thermodynamics (Cover &Thomas, 2012). The reduction of the Kullback-Leibler di-vergence in (21) and the projection in (16) in our learningof the generator are consistent with the learning objectiveof reducing KL(WG |W (t)

D ) in (Kim & Bengio, 2016). Butthe Monte Carlo implementation of L in our work avoidsthe need to approximate the intractable entropy term.

8. Appendix 2: Generator and MCMCThe general idea of the interaction between MCMC and thegenerator can be illustrated by the following diagram,

MCMC : P (t)Markov transition−−−−−−−−−−→ P (t+1)

m m

Generator : W (t)Parameter updating

−−−−−−−−−−→ W (t+1)

(22)

where P (t) is the marginal distribution of a MCMC andW (t) is the parameter of the generator which traces the evo-lution of the marginal distribution in MCMC by absorbingthe cumulative effect of all the past Markov transitions (wedrop the subscripts D and G here for simplicity and gener-ality).

In traditional MCMC, we only have access to Monte Carlosamples, instead of their marginal distributions, which existonly theoretically but are analytically intractable. However,with the generator net, we can actually implement MCMCat the level of the whole distributions, instead of a num-ber of Monte Carlo samples, in the sense that after learningW (t) from the existing samples of P (t), we can replace theexisting samples by fresh new samples by sampling fromthe generator defined by the learned W (t), which is illus-trated by the two-way arrow between P (t) and W (t). Ef-fectively, the generator powers MCMC by implicitly run-ning an infinite number of parallel chains. Conversely, theMCMC does not only drive the evolution of the samples, italso drives the evolution of a generator model.

If the descriptor is fixed, the generator will converge tothe descriptor if the generator has infinite capacity. In theCoopNets algorithm, the descriptor is an evolving one, be-cause we keep updating the parameter WD of the descrip-tor by learning from the observed examples. Therefore theMarkov chain in (22) is a non-stationary one, because thetarget distribution of the MCMC, i.e., the current descrip-tor, keeps evolving, so that the Markov transition kernelalso keeps changing. Initially, the distribution of the de-scriptor is of high temperature, and the noise term domi-nates the Langevin dynamics. By following the descrip-tor, the generator will gain high entropy and high variance.As the learning proceeds, the parameters of the descrip-tor become bigger in magnitude, and the distribution be-comes colder and multi-modal, so that the Langevin dy-namics will be dominated by the gradient term. At the lowtemperature, the patterns begin to form. By chasing thedescriptor, the generator begins to map the latent factorsto the local modes to create the patterns. In the end, it ispossible that the generator maps each small neighborhoodof the factor space into a local mode of the energy land-scape of the descriptor. As observed by (Xie et al., 2016),the local modes of the descriptor satisfy an auto-encoderY/s2 = ∂f(Y ;WD)/∂Y . Hence the examples generated

12

Page 13: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

by the generator Y = g(X;WG) satisfy the auto-encoderdefined by the descriptor, so that the stochastic relaxationby the descriptor cannot make much revision of the ex-amples generated by the generator. From an embeddingperspective, the generator embeds the high-dimensional lo-cal modes of the descriptor into its low-dimensional fac-tor space, thus providing a map of the energy landscapeof the descriptor. From the MCMC perspective, the learn-ing process is like a simulated annealing or tempering pro-cess (Kirkpatrick et al., 1983; Marinari & Parisi, 1992), andthe generator traces out this process in distribution as illus-trated by diagram (22). As is the case with simulated an-nealing, it is necessary to begin with a high temperature andhigh entropy distribution to cast a sufficiently wide gener-ator net, and use a slow tempering schedule, or learningschedule in our case, to capture most of the major modes.

MCMC sampling of an evolving descriptor can be easierthan sampling from a fixed descriptor, because the energylandscape of an evolving descriptor keeps changing or evendestroying its local modes defined by the auto-encoder,thus releasing the Markov chains from the trapping of thelocal modes.

9. Appendix 3: More on two nets9.1. Backgrounds

To illustrate the descriptive models and the generative mod-els, (Zhu, 2003) used the FRAME (Filters, Random field,And Maximum Entropy) model of (Zhu et al., 1997) as aprototype of the descriptive models, and used the sparsecoding model of (Olshausen & Field, 1997) as a prototypeof the generative models. Both the FRAME model andthe sparse coding model involve wavelets, but the waveletsplay different roles in these two models. In the FRAMEmodel, the wavelets serve as bottom-up filters for extract-ing features. In the sparse coding model, the waveletsserve as the top-down basis functions for linear superpo-sition. These two roles can be unified only in an orthog-onal basis or a tight frame of wavelets, but in general wecannot have both. In more general setting, in a descrip-tive model, the features are explicitly defined by a bottom-up process, but the signal cannot be recovered from thefeatures by an explicit top-down process. In a generativemodel, the latent variables generate the signal by an ex-plicit top-down process, but inferring the latent variablesrequires explaining-away reasoning that cannot be definedby an explicit bottom-up process. Efforts have been madeto integrate these two classes of models and to understandtheir entropy regimes, such as (Guo et al., 2003; Wu et al.,2008; Xie et al., 2014; 2016), but we still lack a clear under-standing of their roles, their scopes, and their relationship.

Both the descriptor net and the generator net are funda-

mental representations of knowledge. The descriptor pro-vides multiple layers of features, and its proper use maybe to serve as an associative memory or content address-able memory (Hopfield, 1982), where the conditional dis-tribution of the signal given partial information is explicitlydefined. Compared to the full distribution, the conditionaldistribution can be easier to sample from because it canbe less multi-modal than the full distribution. The gener-ator net disentangles the variations in the training exam-ples and provides non-linear dimension reduction. It em-beds the highly non-linear manifold formed by the high-dimensional training examples into the low-dimensionalEuclidean space of latent factors, so that linear interpola-tion in the factor space results in highly non-linear interpo-lation in the signal space.

We could have called the descriptor net the energy-basedmodel, but the latter term carries much broader meaningthan probabilistic models, see, e.g., (LeCun et al., 2006).In fact, energy-based probabilistic models are often lookedupon unfavorably because of the intractable normalizingconstant and the reliance on MCMC. However, a prob-abilistic model appears the only coherent way to definean associative memory, where the conditional distributionsgiven different partial informations are always consistentwith each other. As to the difficulty with MCMC for sam-pling the descriptor net, the generator net can actually be ofgreat help as we have explored in this paper.

There are other probabilistic models that do not belong tobut are related to the above two classes of models. For ex-ample, there are energy-based models that involve latentvariables, such as the restricted Boltzmann machine and itsgeneralizations (Hinton et al., 2006; Salakhutdinov & Hin-ton, 2009; Lee et al., 2009). These models are undirected.There are also models where the latent variables and signalsare of the same dimension, and they can be transformed toeach other, so that the latent variables can also be consid-ered features (Hyvarinen et al., 2004; Dinh et al., 2016).There are also models that impose causal auto-regressivefactorizations (Oord et al., 2016; Dinh et al., 2016). Thesemodels are beyond the scope of this paper.

9.2. Descriptor net

The descriptor net can be considered a multi-layer gener-alization of the FRAME (Filters, Random field, And Max-imum Entropy) model (Zhu et al., 1997; Lu et al., 2016).The energy function of the descriptor model is

E(Y ;WD) =‖Y ‖2

2s2− f(Y ;WD). (23)

(Xie et al., 2016) noticed that the local energy minima sat-isfy an auto-encoder ∂E(Y ;WD)/∂Y = 0, which is

Y/s2 = ∂f(Y ;WD)/∂Y. (24)

13

Page 14: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

The right hand side involves a bottom-up pass of convolu-tion and a top-down pass of back-propagation for comput-ing the derivative, and this back-propagation is actually adeconvolution process (Zeiler & Fergus, 2014). This auto-encoder is curious in that back-propagation becomes repre-sentation. It also connects the Hopfield associative memoryto the hierarchical distributed representation.

(Xie et al., 2016) also noticed that if the non-linearity inthe bottom-up ConvNet that defines f is rectified linearunits (ReLU), h(r) = max(0, r) = 1(r > 0)r, where1(r > 0) is the binary activation, then PD(Y ;WD) ispiecewise Gaussian, and the modes of the Gaussian piecesare auto-encoding. Specifically, the activation pattern ofall the ReLU units of the ConvNet partitions the space ofY into different pieces, so that those Y within the samepiece share the same activation pattern. f(Y ;WD) is lin-ear on each piece (Pascanu et al., 2013), so that the energyfunction ‖Y ‖2/(2s2) − f(Y ;WD) is piecewise quadratic,which gives rise to the piecewise Gaussian structure.

In the Langevin equation (6), the gradient term is in theform of the reconstruction error Y/s2 − ∂f(Y ;WD)/∂Y .If Y is a local mode of the energy function, then Y satis-fies the auto-encoder Y/s2 = ∂f(Y ;WD)/∂Y mentionedabove. The Langevin revision dynamics explore these lo-cal modes, by stochastically relaxing the current sample to-wards the local mode.

(Xie et al., 2016) derived the descriptor net from the dis-criminator net where f(Y ;WD) is the score for soft-max ormultinomial logistic classification, and the reference distri-bution q(Y ) plays the role of the negative or baseline cate-gory.

9.3. Generator net

The generator network can be viewed a multi-layer re-cursion of factor analysis. (Han et al., 2016) noted thatif the non-linearity in the top-down ConvNet that definesg(X;WG) is ReLU, then the model becomes piecewise lin-ear factor analysis. Specifically, the activation pattern of allthe ReLU units of the ConvNet partition the factor space ofX into different pieces, so that those X within the samepiece share the same activation pattern. Then g(X;WG)is linear on each piece (Pascanu et al., 2013), so that themodel is piecewise linear factor analysis, and PG(Y ;WG)is essentially piecewise Gaussian.

We can compare the descriptor net PD(Y ;WD) with thegenerator net PG(Y ;WG) =

∫PG(X,Y ;WG)dX , where

the integral can be approximated by the Laplace approx-imation around the posterior mode X = X(Y ;WG) thatmaximizes PG(X|Y,WG). Then we may compare theenergy ‖Y ‖2/(2s2) − f(Y ;WD) of the descriptor with‖Y − g(X;WG)‖2/(2σ2) + ‖X‖2/2 in the Laplace ap-

proximation of the generator. X = X(Y ;WG) can be con-sidered an encoder or recognizer for inferring X . The en-coder or recognizer net has been learned in the wake-sleepalgorithm (Hinton et al., 1995) and variation auto-encoder(Kingma & Welling, 2014; Rezende et al., 2014; Mnih &Gregor, 2014). Recall that f(Y ;WD) can be used as scorefor soft-max discriminator. Thus the four nets of discrimi-nator, descriptor, generator, and recognizer are connected.

To understand how algorithm G or its EM idealizationmaps the prior distribution of the latent factors PG(X) tothe data distribution Pdata by the learned g(X;WG), letus define Pdata(X,Y ;WG) = Pdata(Y )PG(X|Y,WG) =Pdata(X;WG)Pdata(Y |X,WG), where Pdata(X;WG) =∫PG(X|Y,WG)Pdata(Y )dY is obtained by averaging the

posteriors PG(X|Y ;WG) over the observed data Y ∼Pdata. That is, Pdata(X;WG) can be considered the dataprior. The data prior Pdata(X;WG) is close to the true priorPG(X) in the sense that KL(Pdata(X;WG)|PG(X)) ≤KL(Pdata(Y )|PG(Y ;WG)), with the right hand side mini-mized at the maximum likelihood estimate WG , hence thedata prior Pdata(X; WG) at WG should be especially closeto the true prior PG(X). In other words, at WG , the poste-riors PG(X|Y, WG) of all the data points Y ∼ Pdata tendto pave the whole prior PG(X). Based on (9) and (11), inthe M-step of EM, for each data point Y , WG seeks to re-construct Y by g(X;WG) from the inferred latent factorsX ∼ PG(X|Y,WG). In other words, the M-step seeks tomap all those X ∼ PG(X|Y,WG) to Y . Since the poste-riors PG(X|Y,WG) of all Y tend to pave the prior PG(X),overall, the M-step seeks to map the prior PG(X) to thewhole training data set Pdata. Of course the mapping fromall those X ∼ PG(X|Y,WG) to the observed Y cannot beexact. In fact, g(X;WG) actually maps PG(X|Y,WG) toa d-dimensional patch around the D-dimensional Y . Thelocal patches for all the observed examples patch up thed-dimensional manifold form by the D-dimensional ob-served examples and their interpolations. The EM algo-rithm is a process of density shifting, so that the mappingof Pdata(X;WG) by g(X;WG) shifts towards Pdata.

In the CoopNets algorithm, we avoid dealing withthe posterior distributions for the observed examplesPG(X|Y ;WG) and the data prior Pdata(X;WG). Instead,we generate directly from the prior distribution PG(X), in-stead of using PG(X|Y ;WG) of the observed Y to pave theprior. The density shifting is accomplished by tracking theLangevin revisions made by the descriptor.

10. Appendix 4: ConvNet and Alternatingback-propagation

Both the descriptor net and the generator net areparametrized by ConvNets. In general, for an L-layer net-

14

Page 15: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

work Y = f(X;W ), it can be written recursively as

X(l) = fl(WlX(l−1) + bl), (25)

where l = 1, ..., L, X(0) = X , X(L) = f(X;W ), andW = (Wl, bl, l = 1, ..., L). fl is element-wise non-linearity.

The alternating back-propagation algorithms in both thedescriptor and the generator involve computations of∂f(X;W )/∂W for updating W , and ∂f(X;W )/∂X forLangevin sampling of X . Both derivatives can be com-puted by the chain-rule back-propagation, and they sharethe computation of

∂X(l)/∂X(l−1) = f′

l (WlX(l−1) + bl)Wl (26)

in the chain rule, where f′

l is the derivative matrix offl. Because fl is element-wise, f

l is a diagonal matrix,where the diagonal elements are element-wise derivatives.The computations ∂X(l)/∂X(l−1) for l = 1, ..., L giveus ∂f/∂X = ∂X(L)/∂X(0). In addition, ∂f/∂Wl =∂X(L)/∂X(l) × ∂X(l)/∂Wl. Therefore the computationsof ∂f/∂X are a subset of the computations of ∂f/∂W .In other words, the back-propagation for Langevin dynam-ics is a by-product of the back-propagation for parameterupdating in terms of coding. The two alternating back-propagation algorithms for the two nets are very similar toeach other in coding.

For ReLU non-linearity, f′

l = diag(1(WlX(l−1) + bl >

0)) = δl, where diag(v) makes the vector v into a diago-nal matrix, and for a vector v, 1(v > 0) denotes a binaryvector whose elements indicate whether the correspondingelements of v are greater than 0 or not. δl is the activationpattern of the ReLU units at layer l.

ReferencesAmari, Shun-ichi and Nagaoka, Hiroshi. Methods of infor-

mation geometry, volume 191. American MathematicalSoc., 2007.

Blum, Avrim and Mitchell, Tom. Combining labeled andunlabeled data with co-training. In Proceedings of theeleventh annual conference on Computational learningtheory, pp. 92–100. ACM, 1998.

Cover, Thomas M and Thomas, Joy A. Elements of infor-mation theory. John Wiley & Sons, 2012.

Dai, Jifeng, Lu, Yang, and Wu, Ying Nian. Generativemodeling of convolutional neural networks. In ICLR,2015.

Dempster, Arthur P, Laird, Nan M, and Rubin, Donald B.Maximum likelihood from incomplete data via the emalgorithm. Journal of the royal statistical society. SeriesB (methodological), pp. 1–38, 1977.

Denton, Emily L, Chintala, Soumith, Fergus, Rob, et al.Deep generative image models using a laplacian pyra-mid of adversarial networks. In Advances in Neural In-formation Processing Systems, pp. 1486–1494, 2015.

Dinh, Laurent, Sohl-Dickstein, Jascha, and Bengio, Samy.Density estimation using real nvp. arXiv preprintarXiv:1605.08803, 2016.

Dosovitskiy, E, Springenberg, J. T., and Brox, T. Learningto generate chairs with convolutional neural networks. InIEEE International Conference on Computer Vision andPattern Recognition (CVPR), 2015.

Girolami, Mark and Calderhead, Ben. Riemann manifoldlangevin and hamiltonian monte carlo methods. Jour-nal of the Royal Statistical Society: Series B (StatisticalMethodology), 73(2):123–214, 2011.

Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,Bing, Warde-Farley, David, Ozair, Sherjil, Courville,Aaron, and Bengio, Yoshua. Generative adversarial nets.In Advances in Neural Information Processing Systems,pp. 2672–2680, 2014.

Grenander, Ulf and Miller, Michael I. Pattern theory: fromrepresentation to inference. Oxford University Press,2007.

Guo, Cheng-En, Zhu, Song-Chun, and Wu, Ying Nian.Modeling visual patterns by integrating descriptive andgenerative methods. International Journal of ComputerVision, 53(1):5–29, 2003.

Han, Tian, Lu, Yang, Zhu, Song-Chun, and Wu, Ying Nian.Alternating back-propagation for generator network.arXiv preprint arXiv:1606.08571, 2016.

Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distill-ing the knowledge in a neural network. arXiv preprintarXiv:1503.02531, 2015.

Hinton, Geoffrey E. Training products of experts by mini-mizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.

Hinton, Geoffrey E, Dayan, Peter, Frey, Brendan J, andNeal, Radford M. The ”wake-sleep” algorithm for un-supervised neural networks. Science, 268(5214):1158–1161, 1995.

Hinton, Geoffrey E., Osindero, Simon, and Teh, Yee-Whye. A fast learning algorithm for deep belief nets.Neural Computation, 18:1527–1554, 2006.

Hopfield, John J. Neural networks and physical systemswith emergent collective computational abilities. Pro-ceedings of the national academy of sciences, 79(8):2554–2558, 1982.

15

Page 16: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

Hyvarinen, Aapo, Karhunen, Juha, and Oja, Erkki. Inde-pendent component analysis, volume 46. John Wiley &Sons, 2004.

Ioffe, Sergey and Szegedy, Christian. Batch normalization:Accelerating deep network training by reducing internalcovariate shift. arXiv preprint arXiv:1502.03167, 2015.

Kim, Taesup and Bengio, Yoshua. Deep directed generativemodels with energy-based probability estimation. arXivpreprint arXiv:1606.03439, 2016.

Kingma, Diederik P. and Welling, Max. Auto-encodingvariational bayes. ICLR, 2014.

Kirkpatrick, Scott, Gelatt, C Daniel, Vecchi, Mario P, et al.Optimization by simmulated annealing. science, 220(4598):671–680, 1983.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.Imagenet classification with deep convolutional neuralnetworks. In NIPS, pp. 1097–1105, 2012.

LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner,Patrick. Gradient-based learning applied to documentrecognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

LeCun, Yann, Chopra, Sumit, Hadsell, Rata, Ranzato,Mare’Aurelio, and Huang, Fu Jie. A tutorial on energy-based learning. In Predicting Structured Data. MITPress, 2006.

Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, and Ng,Andrew Y. Convolutional deep belief networks for scal-able unsupervised learning of hierarchical representa-tions. In ICML, pp. 609–616. ACM, 2009.

Liu, Ziwei, Luo, Ping, Wang, Xiaogang, and Tang, Xiaoou.Deep learning face attributes in the wild. In Proceedingsof the IEEE International Conference on Computer Vi-sion, pp. 3730–3738, 2015.

Lu, Yang, Zhu, Song-Chun, and Wu, Ying Nian. LearningFRAME models using CNN filters. In Thirtieth AAAIConference on Artificial Intelligence, 2016.

Marinari, Enzo and Parisi, Giorgio. Simulated tempering:a new monte carlo scheme. EPL (Europhysics Letters),19(6):451, 1992.

Mnih, Andriy and Gregor, Karol. Neural variational infer-ence and learning in belief networks. In ICML, 2014.

Neal, Radford M. Mcmc using hamiltonian dynamics.Handbook of Markov Chain Monte Carlo, 2, 2011.

Ngiam, Jiquan, Chen, Zhenghao, Koh, Pang Wei, and Ng,Andrew Y. Learning deep energy models. In Interna-tional Conference on Machine Learning, 2011.

Olshausen, Bruno A and Field, David J. Sparse coding withan overcomplete basis set: A strategy employed by v1?Vision Research, 37(23):3311–3325, 1997.

Oord, Aaron van den, Kalchbrenner, Nal, andKavukcuoglu, Koray. Pixel recurrent neural networks.arXiv preprint arXiv:1601.06759, 2016.

Pascanu, Razvan, Montufar, Guido, and Bengio, Yoshua.On the number of response regions of deep feed for-ward networks with piece-wise linear activations. arXivpreprint arXiv:1312.6098, 2013.

Radford, Alec, Metz, Luke, and Chintala, Soumith. Un-supervised representation learning with deep convolu-tional generative adversarial networks. arXiv preprintarXiv:1511.06434, 2015.

Rezende, Danilo J., Mohamed, Shakir, and Wierstra, Daan.Stochastic backpropagation and approximate inferencein deep generative models. In Jebara, Tony and Xing,Eric P. (eds.), ICML, pp. 1278–1286. JMLR Workshopand Conference Proceedings, 2014.

Robbins, Herbert and Monro, Sutton. A stochastic approx-imation method. The annals of mathematical statistics,pp. 400–407, 1951.

Roth, Stefan and Black, Michael J. Fields of experts: Aframework for learning image priors. In CVPR, vol-ume 2, pp. 860–867. IEEE, 2005.

Rubin, Donald B and Thayer, Dorothy T. Em algorithmsfor ml factor analysis. Psychometrika, 47(1):69–76,1982.

Salakhutdinov, Ruslan and Hinton, Geoffrey E. Deep boltz-mann machines. In AISTATS, 2009.

Teh, Yee Whye, Welling, Max, Osindero, Simon, and Hin-ton, Geoffrey E. Energy-based models for sparse over-complete representations. Journal of Machine LearningResearch, 4(Dec):1235–1260, 2003.

Thibodeau-Laufer, Eric, Alain, Guillaume, and Yosinski,Jason. Deep generative stochastic networks trainable bybackprop. 2014.

Tu, Zhuowen. Learning generative models via discrimina-tive approaches. In 2007 IEEE Conference on ComputerVision and Pattern Recognition, pp. 1–8. IEEE, 2007.

Vedaldi, A. and Lenc, K. Matconvnet – convolutional neu-ral networks for matlab. In Proceeding of the ACM Int.Conf. on Multimedia, 2015.

Welling, Max, Zemel, Richard S, and Hinton, Geoffrey E.Self supervised boosting. In Advances in neural infor-mation processing systems, pp. 665–672, 2002.

16

Page 17: Cooperative Training of Descriptor and Generator NetworksCooperative Training of Descriptor and Generator Networks soning: the latent factors compete with each other in the Langevin

Cooperative Training of Descriptor and Generator Networks

Wu, Ying Nian, Zhu, Song-Chun, and Guo, Cheng-En.From information scaling of natural images to regimesof statistical models. Quarterly of Applied Mathematics,66:81–122, 2008.

Xie, Jianwen, Hu, Wenze, Zhu, Song-Chun, and Wu,Ying Nian. Learning sparse frame models for naturalimage patterns. International Journal of Computer Vi-sion, pp. 1–22, 2014.

Xie, Jianwen, Lu, Yang, Zhu, Song-Chun, and Wu,Ying Nian. A theory of generative convnet. In ICML,2016.

Younes, Laurent. On the convergence of markovianstochastic algorithms with rapidly decreasing ergodicityrates. Stochastics: An International Journal of Probabil-ity and Stochastic Processes, 65(3-4):177–228, 1999.

Zeiler, Matthew D and Fergus, Rob. Visualizing and under-standing convolutional neural networks. In ECCV, 2014.

Zhu, Song-Chun. Statistical modeling and conceptual-ization of visual patterns. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 25(6):691–712,2003.

Zhu, Song-Chun, Wu, Ying Nian, and Mumford, David.Minimax entropy principle and its application to texturemodeling. Neural Computation, 9(8):1627–1660, 1997.

17