autoencoders - deep learning · the denoising autoencoder (dae) is an autoencoder that receives a...

AutoencodersLecture slides for Chapter 14 of Deep Learning

www.deeplearningbook.org Ian Goodfellow

2016-09-30

(Goodfellow 2016)

Structure of an Autoencoder

CHAPTER 14. AUTOENCODERS

to the activations on the reconstructed input. Recirculation is regarded as morebiologically plausible than back-propagation, but is rarely used for machine learningapplications.

xx rr

hh

f g

Figure 14.1: The general structure of an autoencoder, mapping an input x to an output(called reconstruction) r through an internal representation or code h. The autoencoderhas two components: the encoder f (mapping x to h) and the decoder g (mapping h tor).

14.1 Undercomplete Autoencoders

Copying the input to the output may sound useless, but we are typically notinterested in the output of the decoder. Instead, we hope that training theautoencoder to perform the input copying task will result in h taking on usefulproperties.

One way to obtain useful features from the autoencoder is to constrain h tohave smaller dimension than x. An autoencoder whose code dimension is lessthan the input dimension is called undercomplete. Learning an undercompleterepresentation forces the autoencoder to capture the most salient features of thetraining data.

The learning process is described simply as minimizing a loss function

L(x, g(f(x))) (14.1)

where L is a loss function penalizing g(f(x)) for being dissimilar from x, such asthe mean squared error.

When the decoder is linear and L is the mean squared error, an undercompleteautoencoder learns to span the same subspace as PCA. In this case, an autoencodertrained to perform the copying task has learned the principal subspace of thetraining data as a side-effect.

Autoencoders with nonlinear encoder functions f and nonlinear decoder func-tions g can thus learn a more powerful nonlinear generalization of PCA. Unfortu-

503

Figure 14.1

Input

Hidden layer (code)

Reconstruction

(Goodfellow 2016)

Stochastic Autoencoders


Typically, the output variables are treated as being conditionally independentgiven h so that this probability distribution is inexpensive to evaluate, but sometechniques such as mixture density outputs allow tractable modeling of outputswith correlations.

xx rr

hh

pencoder

(h | x) pdecoder

(x | h)

Figure 14.2: The structure of a stochastic autoencoder, in which both the encoder and thedecoder are not simple functions but instead involve some noise injection, meaning thattheir output can be seen as sampled from a distribution, p

encoder

(h | x) for the encoderand p

decoder

(x | h) for the decoder.

To make a more radical departure from the feedforward networks we have seenpreviously, we can also generalize the notion of an encoding function f(x) toan encoding distribution p

encoder

(h | x), as illustrated in figure 14.2.Any latent variable model p

model

(h, x) defines a stochastic encoder

pencoder

(h | x) = pmodel

(h | x) (14.12)

and a stochastic decoder

pdecoder

(x | h) = pmodel

(x | h). (14.13)

In general, the encoder and decoder distributions are not necessarily conditionaldistributions compatible with a unique joint distribution p

model

(x, h). Alain et al.

(2015) showed that training the encoder and decoder as a denoising autoencoderwill tend to make them compatible asymptotically (with enough capacity andexamples).

14.5 Denoising Autoencoders

The denoising autoencoder (DAE) is an autoencoder that receives a corrupteddata point as input and is trained to predict the original, uncorrupted data pointas its output.

The DAE training procedure is illustrated in figure 14.3. We introduce acorruption process C(

˜

x | x) which represents a conditional distribution over510

Figure 14.2

(Goodfellow 2016)

Avoiding Trivial Identity• Undercomplete autoencoders

• h has lower dimension than x

• f or g has low capacity (e.g., linear g)

• Must discard some information in h

• Overcomplete autoencoders

• h has higher dimension than x

• Must be regularized

(Goodfellow 2016)

Regularized Autoencoders

• Sparse autoencoders

• Denoising autoencoders

• Autoencoders with dropout on the hidden layer

• Contractive autoencoders

(Goodfellow 2016)

Sparse Autoencoders

• Limit capacity of autoencoder by adding a term to the cost function penalizing the code for being larger

• Special case of variational autoencoder

• Probabilistic model

• Laplace prior corresponds to L1 sparsity penalty

• Dirac variational posterior

(Goodfellow 2016)

Denoising AutoencoderCHAPTER 14. AUTOENCODERS

˜x˜x LL

hh

fg

xx

C(

˜x | x)

Figure 14.3: The computational graph of the cost function for a denoising autoencoder,which is trained to reconstruct the clean data point x from its corrupted version ˜x.This is accomplished by minimizing the loss L = � log p

decoder

(x | h = f(

˜x)), where˜x is a corrupted version of the data example x, obtained through a given corruptionprocess C(

˜x | x). Typically the distribution pdecoder

is a factorial distribution whose meanparameters are emitted by a feedforward network g.

corrupted samples ˜

x, given a data sample x. The autoencoder then learns areconstruction distribution p

reconstruct

(x | ˜

x) estimated from training pairs(x, ˜x), as follows:

1. Sample a training example x from the training data.

2. Sample a corrupted version ˜x from C(

˜

x | x = x).

3. Use (x, ˜x) as a training example for estimating the autoencoder reconstructiondistribution p

reconstruct

(x | ˜x) = pdecoder

(x | h) with h the output of encoderf(

˜x) and pdecoder

typically defined by a decoder g(h).

Typically we can simply perform gradient-based approximate minimization (suchas minibatch gradient descent) on the negative log-likelihood � log p

decoder

(x | h).So long as the encoder is deterministic, the denoising autoencoder is a feedforwardnetwork and may be trained with exactly the same techniques as any otherfeedforward network.

We can therefore view the DAE as performing stochastic gradient descent onthe following expectation:

� Ex⇠p̂

data

(x)

E˜

x⇠C(

˜

x|x)

log pdecoder

(x | h = f(

˜x)) (14.14)

where p̂data

(x) is the training distribution.

511

Figure 14.3

C: corruption process (introduce noise)


˜x˜x LL

hh

fg

xx

C(

˜x | x)

Figure 14.3: The computational graph of the cost function for a denoising autoencoder,which is trained to reconstruct the clean data point x from its corrupted version ˜x.This is accomplished by minimizing the loss L = � log p

decoder

(x | h = f(

˜x)), where˜x is a corrupted version of the data example x, obtained through a given corruptionprocess C(

˜x | x). Typically the distribution pdecoder

is a factorial distribution whose meanparameters are emitted by a feedforward network g.

corrupted samples ˜

x, given a data sample x. The autoencoder then learns areconstruction distribution p

reconstruct

(x | ˜

x) estimated from training pairs(x, ˜x), as follows:

1. Sample a training example x from the training data.

2. Sample a corrupted version ˜x from C(

˜

x | x = x).

3. Use (x, ˜x) as a training example for estimating the autoencoder reconstructiondistribution p

reconstruct

(x | ˜x) = pdecoder

(x | h) with h the output of encoderf(

˜x) and pdecoder

typically defined by a decoder g(h).

Typically we can simply perform gradient-based approximate minimization (suchas minibatch gradient descent) on the negative log-likelihood � log p

decoder

(x | h).So long as the encoder is deterministic, the denoising autoencoder is a feedforwardnetwork and may be trained with exactly the same techniques as any otherfeedforward network.

We can therefore view the DAE as performing stochastic gradient descent onthe following expectation:

� Ex⇠p̂

data

(x)

E˜

x⇠C(

˜

x|x)

log pdecoder

(x | h = f(

˜x)) (14.14)

where p̂data

(x) is the training distribution.

511

(Goodfellow 2016)

Denoising Autoencoders Learn a Manifold


x

˜x

g � f

˜x

C(

˜x | x)

x

Figure 14.4: A denoising autoencoder is trained to map a corrupted data point ˜x back tothe original data point x. We illustrate training examples x as red crosses lying near alow-dimensional manifold illustrated with the bold black line. We illustrate the corruptionprocess C(

˜x | x) with a gray circle of equiprobable corruptions. A gray arrow demonstrateshow one training example is transformed into one sample from this corruption process.When the denoising autoencoder is trained to minimize the average of squared errors||g(f(

˜x))�x||2, the reconstruction g(f(

˜x)) estimates Ex,

˜

x⇠pdata(x)C(

˜

x|x)

[x | ˜x]. The vectorg(f(

˜x))� ˜x points approximately towards the nearest point on the manifold, since g(f(

˜x))

estimates the center of mass of the clean points x which could have given rise to ˜x. Theautoencoder thus learns a vector field g(f(x)) � x indicated by the green arrows. Thisvector field estimates the score r

x

log pdata

(x) up to a multiplicative factor that is theaverage root mean square reconstruction error.

512

Figure 14.4

(Goodfellow 2016)

Score Matching

• Score:

• Fit a density model by matching score of model to score of data

• Some denoising autoencoders are equivalent to score matching applied to some density models


14.5.1 Estimating the Score

Score matching (Hyvärinen, 2005) is an alternative to maximum likelihood. Itprovides a consistent estimator of probability distributions based on encouragingthe model to have the same score as the data distribution at every training pointx. In this context, the score is a particular gradient field:

rx log p(x). (14.15)

Score matching is discussed further in section 18.4. For the present discussionregarding autoencoders, it is sufficient to understand that learning the gradientfield of log p

data

is one way to learn the structure of pdata

itself.A very important property of DAEs is that their training criterion (with

conditionally Gaussian p(x | h)) makes the autoencoder learn a vector field(g(f(x)) � x) that estimates the score of the data distribution. This is illustratedin figure 14.4.

Denoising training of a specific kind of autoencoder (sigmoidal hidden units,linear reconstruction units) using Gaussian noise and mean squared error asthe reconstruction cost is equivalent (Vincent, 2011) to training a specific kindof undirected probabilistic model called an RBM with Gaussian visible units.This kind of model will be described in detail in section 20.5.1; for the presentdiscussion it suffices to know that it is a model that provides an explicit p

model

(x; ✓).When the RBM is trained using denoising score matching (Kingma and LeCun,2010), its learning algorithm is equivalent to denoising training in the correspondingautoencoder. With a fixed noise level, regularized score matching is not a consistentestimator; it instead recovers a blurred version of the distribution. However, ifthe noise level is chosen to approach 0 when the number of examples approachesinfinity, then consistency is recovered. Denoising score matching is discussed inmore detail in section 18.5.

Other connections between autoencoders and RBMs exist. Score matchingapplied to RBMs yields a cost function that is identical to reconstruction errorcombined with a regularization term similar to the contractive penalty of theCAE (Swersky et al., 2011). Bengio and Delalleau (2009) showed that an autoen-coder gradient provides an approximation to contrastive divergence training ofRBMs.

For continuous-valued x, the denoising criterion with Gaussian corruption andreconstruction distribution yields an estimator of the score that is applicable togeneral encoder and decoder parametrizations (Alain and Bengio, 2013). Thismeans a generic encoder-decoder architecture may be made to estimate the score

513

(Goodfellow 2016)

Vector Field Learned by a Denoising Autoencoder


by training with the squared error criterion

||g(f(

˜x)) � x||2 (14.16)

and corruptionC(

˜

x =

˜x|x) = N (

˜x; µ = x, ⌃ = �2I) (14.17)with noise variance �2. See figure 14.5 for an illustration of how this works.

Figure 14.5: Vector field learned by a denoising autoencoder around a 1-D curved manifoldnear which the data concentrates in a 2-D space. Each arrow is proportional to thereconstruction minus input vector of the autoencoder and points towards higher probabilityaccording to the implicitly estimated probability distribution. The vector field has zerosat both maxima of the estimated density function (on the data manifolds) and at minimaof that density function. For example, the spiral arm forms a one-dimensional manifold oflocal maxima that are connected to each other. Local minima appear near the middle ofthe gap between two arms. When the norm of reconstruction error (shown by the lengthof the arrows) is large, it means that probability can be significantly increased by movingin the direction of the arrow, and that is mostly the case in places of low probability.The autoencoder maps these low probability points to higher probability reconstructions.Where probability is maximal, the arrows shrink because the reconstruction becomes moreaccurate. Figure reproduced with permission from Alain and Bengio (2013).

In general, there is no guarantee that the reconstruction g(f(x)) minus theinput x corresponds to the gradient of any function, let alone to the score. That is

514

Figure 14.5

(Goodfellow 2016)

Tangent Hyperplane of a Manifold


Figure 14.6: An illustration of the concept of a tangent hyperplane. Here we create aone-dimensional manifold in 784-dimensional space. We take an MNIST image with 784pixels and transform it by translating it vertically. The amount of vertical translationdefines a coordinate along a one-dimensional manifold that traces out a curved paththrough image space. This plot shows a few points along this manifold. For visualization,we have projected the manifold into two dimensional space using PCA. An n-dimensionalmanifold has an n-dimensional tangent plane at every point. This tangent plane touchesthe manifold exactly at that point and is oriented parallel to the surface at that point.It defines the space of directions in which it is possible to move while remaining onthe manifold. This one-dimensional manifold has a single tangent line. We indicate anexample tangent line at one point, with an image showing how this tangent directionappears in image space. Gray pixels indicate pixels that do not change as we move alongthe tangent line, white pixels indicate pixels that brighten, and black pixels indicate pixelsthat darken.

517

Figure 14.6

(Goodfellow 2016)

Learning a Collection of 0-D Manifolds by Resisting PerturbationCHAPTER 14. AUTOENCODERS

x0

x1

x2

x

0.0

0.2

0.4

0.6

0.8

1.0

r(x)

Identity

Optimal reconstruction

Figure 14.7: If the autoencoder learns a reconstruction function that is invariant to smallperturbations near the data points, it captures the manifold structure of the data. Herethe manifold structure is a collection of 0-dimensional manifolds. The dashed diagonalline indicates the identity function target for reconstruction. The optimal reconstructionfunction crosses the identity function wherever there is a data point. The horizontalarrows at the bottom of the plot indicate the r(x) � x reconstruction direction vectorat the base of the arrow, in input space, always pointing towards the nearest “manifold”(a single datapoint, in the 1-D case). The denoising autoencoder explicitly tries to makethe derivative of the reconstruction function r(x) small around the data points. Thecontractive autoencoder does the same for the encoder. Although the derivative of r(x) isasked to be small around the data points, it can be large between the data points. Thespace between the data points corresponds to the region between the manifolds, wherethe reconstruction function must have a large derivative in order to map corrupted pointsback onto the manifold.

the manifold. Such a representation for a particular example is also called itsembedding. It is typically given by a low-dimensional vector, with less dimensionsthan the “ambient” space of which the manifold is a low-dimensional subset. Somealgorithms (non-parametric manifold learning algorithms, discussed below) directlylearn an embedding for each training example, while others learn a more generalmapping, sometimes called an encoder, or representation function, that maps anypoint in the ambient space (the input space) to its embedding.

Manifold learning has mostly focused on unsupervised learning procedures thatattempt to capture these manifolds. Most of the initial machine learning researchon learning nonlinear manifolds has focused on non-parametric methods basedon the nearest-neighbor graph. This graph has one node per training exampleand edges connecting near neighbors to each other. These methods (Schölkopfet al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000; Brand, 2003; Belkin

518

Figure 14.7

(Goodfellow 2016)

Non-Parametric Manifold Learning with Nearest-Neighbor GraphsCHAPTER 14. AUTOENCODERS

Figure 14.8: Non-parametric manifold learning procedures build a nearest neighbor graphin which nodes represent training examples a directed edges indicate nearest neighborrelationships. Various procedures can thus obtain the tangent plane associated with aneighborhood of the graph as well as a coordinate system that associates each trainingexample with a real-valued vector position, or embedding. It is possible to generalizesuch a representation to new examples by a form of interpolation. So long as the numberof examples is large enough to cover the curvature and twists of the manifold, theseapproaches work well. Images from the QMUL Multiview Face Dataset (Gong et al.,2000).

and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul, 2004; Hintonand Roweis, 2003; van der Maaten and Hinton, 2008) associate each of nodes with atangent plane that spans the directions of variations associated with the differencevectors between the example and its neighbors, as illustrated in figure 14.8.

A global coordinate system can then be obtained through an optimization orsolving a linear system. Figure 14.9 illustrates how a manifold can be tiled by alarge number of locally linear Gaussian-like patches (or “pancakes,” because theGaussians are flat in the tangent directions).

However, there is a fundamental difficulty with such local non-parametricapproaches to manifold learning, raised in Bengio and Monperrus (2005): if themanifolds are not very smooth (they have many peaks and troughs and twists),one may need a very large number of training examples to cover each one of

519

Figure 14.8

(Goodfellow 2016)

Tiling a Manifold with Local Coordinate SystemsCHAPTER 14. AUTOENCODERS

Figure 14.9: If the tangent planes (see figure 14.6) at each location are known, then theycan be tiled to form a global coordinate system or a density function. Each local patchcan be thought of as a local Euclidean coordinate system or as a locally flat Gaussian, or“pancake,” with a very small variance in the directions orthogonal to the pancake and avery large variance in the directions defining the coordinate system on the pancake. Amixture of these Gaussians provides an estimated density function, as in the manifoldParzen window algorithm (Vincent and Bengio, 2003) or its non-local neural-net basedvariant (Bengio et al., 2006c).

these variations, with no chance to generalize to unseen variations. Indeed, thesemethods can only generalize the shape of the manifold by interpolating betweenneighboring examples. Unfortunately, the manifolds involved in AI problems canhave very complicated structure that can be difficult to capture from only localinterpolation. Consider for example the manifold resulting from translation shownin figure 14.6. If we watch just one coordinate within the input vector, xi, as theimage is translated, we will observe that one coordinate encounters a peak or atrough in its value once for every peak or trough in brightness in the image. Inother words, the complexity of the patterns of brightness in an underlying imagetemplate drives the complexity of the manifolds that are generated by performingsimple image transformations. This motivates the use of distributed representationsand deep learning for capturing manifold structure.

520

Figure 14.9

(Goodfellow 2016)

Contractive Autoencoders


14.7 Contractive Autoencoders

The contractive autoencoder (Rifai et al., 2011a,b) introduces an explicit regularizeron the code h = f(x), encouraging the derivatives of f to be as small as possible:

⌦(h) = �

�

�

�

�

@f(x)

@x

�

�

�

�

2

F

. (14.18)

The penalty ⌦(h) is the squared Frobenius norm (sum of squared elements) of theJacobian matrix of partial derivatives associated with the encoder function.

There is a connection between the denoising autoencoder and the contractiveautoencoder: Alain and Bengio (2013) showed that in the limit of small Gaussianinput noise, the denoising reconstruction error is equivalent to a contractivepenalty on the reconstruction function that maps x to r = g(f(x)). In otherwords, denoising autoencoders make the reconstruction function resist small butfinite-sized perturbations of the input, while contractive autoencoders make thefeature extraction function resist infinitesimal perturbations of the input. Whenusing the Jacobian-based contractive penalty to pretrain features f(x) for usewith a classifier, the best classification accuracy usually results from applying thecontractive penalty to f(x) rather than to g(f(x)). A contractive penalty on f(x)

also has close connections to score matching, as discussed in section 14.5.1.The name contractive arises from the way that the CAE warps space. Specifi-

cally, because the CAE is trained to resist perturbations of its input, it is encouragedto map a neighborhood of input points to a smaller neighborhood of output points.We can think of this as contracting the input neighborhood to a smaller outputneighborhood.

To clarify, the CAE is contractive only locally—all perturbations of a trainingpoint x are mapped near to f(x). Globally, two different points x and x0 may bemapped to f(x) and f(x0

) points that are farther apart than the original points.It is plausible that f be expanding in-between or far from the data manifolds (seefor example what happens in the 1-D toy example of figure 14.7). When the ⌦(h)

penalty is applied to sigmoidal units, one easy way to shrink the Jacobian is tomake the sigmoid units saturate to 0 or 1. This encourages the CAE to encodeinput points with extreme values of the sigmoid that may be interpreted as abinary code. It also ensures that the CAE will spread its code values throughoutmost of the hypercube that its sigmoidal hidden units can span.

We can think of the Jacobian matrix J at a point x as approximating thenonlinear encoder f(x) as being a linear operator. This allows us to use the word“contractive” more formally. In the theory of linear operators, a linear operator

521


Inputpoint

Tangent vectors

Local PCA (no sharing across regions)

Contractive autoencoder

Figure 14.10: Illustration of tangent vectors of the manifold estimated by local PCAand by a contractive autoencoder. The location on the manifold is defined by the inputimage of a dog drawn from the CIFAR-10 dataset. The tangent vectors are estimatedby the leading singular vectors of the Jacobian matrix @h

@x

of the input-to-code mapping.Although both local PCA and the CAE can capture local tangents, the CAE is able toform more accurate estimates from limited training data because it exploits parametersharing across different locations that share a subset of active hidden units. The CAEtangent directions typically correspond to moving or changing parts of the object (such asthe head or legs). Images reproduced with permission from Rifai et al. (2011c).

if we do not impose some sort of scale on the decoder. For example, the encodercould consist of multiplying the input by a small constant ✏ and the decodercould consist of dividing the code by ✏. As ✏ approaches 0, the encoder drives thecontractive penalty ⌦(h) to approach 0 without having learned anything about thedistribution. Meanwhile, the decoder maintains perfect reconstruction. In Rifaiet al. (2011a), this is prevented by tying the weights of f and g. Both f and g arestandard neural network layers consisting of an affine transformation followed byan element-wise nonlinearity, so it is straightforward to set the weight matrix of gto be the transpose of the weight matrix of f .

14.8 Predictive Sparse Decomposition

Predictive sparse decomposition (PSD) is a model that is a hybrid of sparsecoding and parametric autoencoders (Kavukcuoglu et al., 2008). A parametricencoder is trained to predict the output of iterative inference. PSD has beenapplied to unsupervised feature learning for object recognition in images and video(Kavukcuoglu et al., 2009, 2010; Jarrett et al., 2009; Farabet et al., 2011), as wellas for audio (Henaff et al., 2011). The model consists of an encoder f(x) and adecoder g(h) that are both parametric. During training, h is controlled by the

523

Figure 14.10

autoencoders - deep learning · the denoising autoencoder (dae) is an autoencoder that receives a...

Documents