deep hessianfree

Deep learning via Hessian-free optimization

James Martens [email protected]

University of Toronto, Ontario, M5S 1A1, Canada

Abstract

We develop a 2nd-order optimization methodbased on the “Hessian-free” approach, and applyit to training deep auto-encoders. Without usingpre-training, we obtain results superior to thosereported byHinton & Salakhutdinov(2006) onthe same tasks they considered. Our method ispractical, easy to use, scales nicely to very largedatasets, and isn’t limited in applicability to auto-encoders, or any specific model class. We alsodiscuss the issue of “pathological curvature” asa possible explanation for the difficulty of deep-learning and how 2nd-order optimization, and ourmethod in particular, effectively deals with it.

1. Introduction

Learning the parameters of neural networks is perhaps oneof the most well studied problems within the field of ma-chine learning. Early work on backpropagation algorithmsshowed that the gradient of the neural net learning objectivecould be computed efficiently and used within a gradient-descent scheme to learn the weights of a network with mul-tiple layers of non-linear hidden units. Unfortunately, thistechnique doesn’t seem to generalize well to networks thathave very many hidden layers (i.e. deep networks). Thecommon experience is that gradient-descent progresses ex-tremely slowly on deep nets, seeming to halt altogether be-fore making significant progress, resulting in poor perfor-mance on the training set (under-fitting).

It is well known within the optimization community thatgradient descent is unsuitable for optimizing objectivesthat exhibit pathological curvature. 2nd-order optimizationmethods, which model the local curvature and correct forit, have been demonstrated to be quite successful on suchobjectives. There are even simple 2D examples such as theRosenbrock function where these methods can demonstrateconsiderable advantages over gradient descent. Thus it isreasonable to suspect that the deep learning problem couldbe resolved by the application of such techniques. Unfortu-

Appearing inProceedings of the27 th International Conferenceon Machine Learning, Haifa, Israel, 2010. Copyright 2010 by theauthor(s)/owner(s).

nately, there has yet to be a demonstration that any of thesemethods are effective on deep learning problems that areknown to be difficult for gradient descent.

Much of the recent work on applying 2nd-order methodsto learning has focused on making them practical for largedatasets. This is usually attempted by adopting an “on-line”approach akin to the one used in stochastic gradient descent(SGD). The only demonstrated advantages of these meth-ods over SGD is that they can sometimes converge in fewertraining epochs and that they require less tweaking of meta-parameters, such as learning rate schedules.

The most important recent advance in learning for deepnetworks has been the development of layer-wise unsu-pervised pre-training methods (Hinton & Salakhutdinov,2006; Bengio et al., 2007). Applying these methods beforerunning SGD seems to overcome the difficulties associatedwith deep learning. Indeed, there have been many suc-cessful applications of these methods to hard deep learn-ing problems, such as auto-encoders and classification nets.But the question remains: why does pre-training work andwhy is it necessary? Some researchers (e.g.Erhan et al.,2010) have investigated this question and proposed variousexplanations such as a higher prevalence of bad local op-tima in the learning objectives of deep models.

Another explanation is that these objectives exhibit patho-logical curvature making them nearly impossible forcurvature-blind methods like gradient-descent to success-fully navigate. In this paper we will argue in favor of thisexplanation and provide a solution in the form of a pow-erful semi-online 2nd-order optimization algorithm whichis practical for very large models and datasets. Usingthis technique, we are able to overcome the under-fittingproblem encountered when training deep auto-encoderneural nets far more effectively than the pre-training +fine-tuning approach proposed byHinton & Salakhutdinov(2006). Being an optimization algorithm, our approachdoesn’t deal specifically with the problem of over-fitting,however we show that this is only a serious issue for one ofthe three deep-auto encoder problems considered by Hin-ton & Salakhutdinov, and can be handled by the usualmethods of regularization.

These results also help us address the question of whydeep-learning is hard and why pre-training sometimes


helps. Firstly, while bad local optima do exist in deep-networks (as they do with shallow ones) in practice they donot seem to pose a significant threat, at least not to strongoptimizers like ours. Instead of bad local minima, the diffi-culty associated with learning deep auto-encoders is betterexplained by regions of pathological curvature in the ob-jective function, which to 1st-order optimization methodsresemble bad local minima.

2. Newton’s method

In this section we review the canonical 2nd-order optimiza-tion scheme, Newton’s method, and discuss its main ben-efits and why they may be important in the deep-learningsetting. While Newton’s method itself is impractical onlarge models due to the quadratic relationship between thesize of the Hessian and the number of parameters in themodel, studying it nevertheless informs us about how itsmore practical derivatives (i.e. quasi-Newton methods)might behave.

Newton’s method, like gradient descent, is an optimizationalgorithm which iteratively updates the parametersθ ∈ R

N

of an objective functionf by computing search directionspand updatingθ asθ + αp for someα. The central idea mo-tivating Newton’s method is thatf can be locally approxi-mated around eachθ, up to 2nd-order, by the quadratic:

f(θ + p) ≈ qθ(p) ≡ f(θ) +∇f(θ)>p +1

2p>Bp (1)

whereB = H(θ) is the Hessian matrix off at θ. Find-ing a good search direction then reduces to minimizing thisquadratic with respect top. Complicating this idea is thatH may be indefinite so this quadratic may not have a mini-mum, and moreover we don’t necessarily trust it as an ap-proximation off for large values ofp. Thus in practice theHessian is “damped” or re-conditioned so thatB = H + λIfor some constantλ ≥ 0.

2.1. Scaling and curvature

An important property of Newton’s method is “scale invari-ance”. By this we mean that it behaves the same for anylinear rescaling of the parameters. To be technically pre-cise, if we adopt a new parameterizationθ = Aθ for someinvertible matrixA, then the optimal search direction in thenew parameterization isp = Ap wherep is the originaloptimal search direction. By contrast, the search directionproduced by gradient descent has the opposite response tolinear re-parameterizations:p = A−>p.

Scale invariance is important because, without it, poorlyscaled parameters will be much harder to optimize. It alsoeliminates the need to tweak learning rates for individualparameters and/or anneal global learning-rates accordingto arbitrary schedules. Moreover, there is an implicit “scal-ing” which varies over the entire parameter space and isdetermined by the local curvature of the objective function.

Figure 1.Optimization in a long narrow valley

By taking the curvature information into account (in theform of the Hessian), Newton’s method rescales the gradi-ent so it is a much more sensible direction to follow.

Intuitively, if the curvature is low (and positive) in a par-ticular descent directiond, this means that the gradient ofthe objective changes slowly alongd, and sod will remaina descent direction over a long distance. It is thus sensi-ble to choose a search directionp which travels far alongd (i.e. by makingp>d large), even if the amount of reduc-tion in the objective associated withd (given by−∇f>d) isrelatively small. Similarly if the curvature associated withd is high, then it is sensible to choosep so that the dis-tance traveled alongd is smaller. Newton’s method makesthis intuition rigorous by computing the distance to movealongd as its reduction divided by its associated curvature:−∇f>d/d>Hd. This is precisely the point alongd afterwhichf is predicted by (1) to start increasing.

Not accounting for the curvature when computing searchdirections can lead to many undesirable scenarios. First,the sequence of search directions might constantly movetoo far in directions of high curvature, causing an unstable“bouncing” behavior that is often observed with gradientdescent and is usually remedied by decreasing the learningrate. Second, directions of low curvature will be exploredmuch more slowly than they should be, a problem exacer-bated by lowering the learning rate. And if the only direc-tions ofsignificantdecrease inf are ones of low curvature,the optimization may become too slow to be practical andeven appear to halt altogether, creating the false impressionof a local minimum. It is our theory that the under-fittingproblem encountered when optimizing deep nets using 1st-order techniques is mostly due to such techniques becom-ing trapped in such false local minima.

Figure 1 visualizes a “pathological curvature scenario”,where the objective function locally resembles a long nar-row valley. At the base of the valley is a direction of lowreduction and low curvature that needs to be followed inorder to make progress. The smaller arrows represent thesteps taken by gradient descent with large and small learn-ing rates respectively, while the large arrow along the baseof the valley represents the step computed by Newton’smethod. What makes this scenario “pathological” is notthe presence of merely low or high curvature directions,


Algorithm 1 The Hessian-free optimization method1: for n = 1, 2, ... do2: gn ← ∇f(θn)3: compute/adjustλ by some method4: define the functionBn(d) = H(θn)d + λd5: pn ← CG-Minimize(Bn,−gn)6: θn+1 ← θn + pn

7: end for

but the mixture ofbothof them together.

2.2. Examples of pathological curvature in neural nets

For a concrete example of pathological curvature in neu-ral networks, consider the situation in which two unitsaand b in the same layer havenearly identical incomingand outgoing weights and biases. Letd be a descent di-rection which increases the value of one ofa’s outgoingweights, say parameteri, while simultaneously decreasingthe corresponding weight for unitb, say parameterj, sothat dk = δik − δjk. d can be interpreted as a directionwhich “differentiates” the two units. The reduction associ-ated withd is−∇f>d = (∇f)j − (∇f)i ≈ 0 and the cur-vature isd>Hd = (Hii−Hij)+ (Hjj −Hji) ≈ 0+0 = 0.Gradient descent will only make progress alongd which isproportional to the reduction, which is very small, whereasNewton’s methods will move much farther, because the as-sociated curvature is also very small.

Another example of pathological curvature, particularto deeper nets, is the commonly observed phenomenonwhere, depending on the magnitude of the initial weights,the gradients will either shrink towards zero or blow up asthey are back-propagated, making learning of the weightsbefore the last few layers nearly impossible. This difficultyin learning all but the last few layers is sometimes calledthe “vanishing gradients” problem and may be slightly mit-igated by using heuristics to adapt the learning rates of eachparameter individually. The issue here is not so much thatthe gradients become very small or very large absolutely,but rather that they become sorelative to the gradients ofthe weights associated with units near the end of the net.Critically, the second-derivatives will shrink or blow up inan analogous way, corresponding to either very low or highcurvature along directions which change the affected pa-rameters. Newton’s method thus will rescale these direc-tions so that they are far more reasonable to follow.

3. Hessian-free optimization

The basis of the 2nd-order optimization approach we de-velop in this paper is a technique known as Hessian-free optimization (HF), aka truncated-Newton, which hasbeen studied in the optimization community for decades(e.g.Nocedal & Wright, 1999), but never seriously appliedwithin machine learning.

In the standard Newton’s method,qθ(p) is optimized bycomputing theN × N matrix B and then solving the sys-tem Bp = −∇f(θ)>p. This is prohibitively expensivewhenN is large, as it is with even modestly sized neuralnetworks. Instead, HF optimizesqθ(p) by exploiting twosimple ideas. The first is that for anN -dimensional vectord, Hd can be easily computed using finite differences at thecost of a single extra gradient evaluation via the identity:

Hd = limε→0

∇f(θ + εd)−∇f(θ)

ε

The second is that there is a very effective algorithm foroptimizing quadratic objectives (such asqθ(p)) which re-quires only matrix-vector products withB: the linear con-jugate gradient algorithm (CG). Now since in the worstcase CG will requireN iterations to converge (thus requir-ing the evaluation ofN Bd-products), it is clearly imprac-tical to wait for CG to completely converge in general. Butfortunately, the behavior CG is such that it will make sig-nificant progress in the minimization ofqθ(p) after a muchmore practical number of iterations. Algorithm1 gives thebasic skeleton of the HF method.

HF is appealing because unlike many other quasi-Newtonmethods it does not make any approximation to the Hes-sian. Indeed, theHd products can be computed accuratelyby the finite differences method, or other more stable algo-rithms. HF differs from Newton’s method only because it isperforming an incomplete optimization (via un-convergedCG) ofqθ(p) in lieu of doing a full matrix inversion.

Another appealing aspect of the HF approach lies in thepower of the CG method. Distinct from the non-linearCG method (NCG) often used in machine learning, linearCG makes strong use of the quadratic nature of the op-timization problem it solves in order to iteratively gener-ate a set of “conjugate directions”di (with the propertythat d>i Adj = 0 for i 6= j) and optimize along theseindependently and exactly. In particular, the movementalong each direction is precisely what Newton’s methodwould select, the reduction divided by the curvature, i.e.−∇f>di/d>i Adi, a fact which follows from the conjugacyproperty. On the other hand, when applying the non-linearCG method (which is done onf directly, notqθ), the di-rections it generates won’t remain conjugate for very long,even approximately so, and the line search is usually per-formed inexactly and at a relatively high expense.

Nevertheless, CG and NCG are in many ways similar andNCG even becomes equivalent to CG when it uses an ex-act line-search and is applied to a quadratic objective (i.e.one with constant curvature). Perhaps the most importantdifference is that when NCG is applied to a highly non-linear objectivef , the underlying curvature evolves witheach new search direction processed, while when CG is ap-plied to the local quadratic approximation off (i.e. qθ), thecurvature remains fixed. It seems likely that the later condi-tion would allow CG to be much more effective than NCG


at finding directions of low reduction and curvature, as di-rections of high reduction and high curvature can be foundby the early iterations of CG and effectively “subtractedaway” from consideration via the conjugate-directions de-composition. NCG, on the other hand, must try to keep upwith the constantly evolving curvature conditions off , andtherefore focus on the more immediate directions of high-reduction and curvature which arise at each successivelyvisited position in the parameter space.

4. Making HF suitable for machine learningproblems

Our experience with using off-the-shelf implementationsof HF is that they simply don’t work for neural networktraining, or are at least grossly impractical. In this sec-tion we will describe the modifications and design choiceswe made to the basic HF approach in order to yield an al-gorithm which is both effective and practical on the prob-lems we considered. Note that none of these enhancementsare specific to neural networks, and should be applicable toother optimization problems that are of interest to machine-learning researchers.

4.1. Damping

The issue of damping, like with standard Newton’s method,is of vital importance to the HF approach. Unlike methodssuch as L-BFGS where the curvature matrix is crudely ap-proximated, the exact curvature matrix implicitly availableto the HF method allows for the identification of directionswith extremely low curvature. When such a direction isfound that also happens to have a reasonable large reduc-tion, CG will elect to move very far along it, and possiblywell outside of the region where (1) is a sensible approx-imation. The damping parameterλ can be interpreted ascontrolling how “conservative” the approximation is, es-sentially by adding the constantλ‖d‖2 to the curvature es-timate for each directiond. Using a fixed setting ofλ is notviable for several reasons, but most importantly becausethe relative scale ofB is constantly changing. It might alsobe the case that the “trustworthiness” of the approximationvaries significantly over the parameter space.

There are advanced techniques, known as Newton-Lanczosmethods, for computing the value ofλ which correspondsto a given “trust-region radius”τ . However, we foundthat such methods were very expensive and thus not cost-effective in practice and so instead we used a simpleLevenberg-Marquardt style heuristic for adjustingλ di-rectly: if ρ < 1

4: λ ← 3

2λ elseif ρ > 3

4: λ ← 2

3λ

endif whereρ is the “reduction ratio”. The reduction ratiois a scalar quantity which attempts to measure the accuracyof qθ and is given by:

ρ =f(θ + p)− f(θ)

qθ(p)− qθ(0)

4.2. Computing the matrix-vector products

While the productHd can be computed using finite-differences, this approach is subject to numerical problemsand also requires the computationally expensive evaluationof non-linear functions.Pearlmutter(1994) showed thatthere is an efficient procedure for computing the productHd exactly for neural networks and several other modelssuch as RNNs and Boltzmann machines. This algorithm islike backprop as it involves a forward and backward pass,is “local”, and has a similar computational cost. Moreover,for standard neural nets it can also be performed withoutthe need to evaluate non-linear functions.

In the development of his on-line 2nd-order method“SMD”, Schraudolph(2002) generalized Pearlmutter’smethod in order to compute the productGd whereG isthe Gauss-Newton approximation to the Hessian. Whilethe classical Gauss-Newton method applies only to a sum-of-squared-error objective, it can be extended to neural net-works whose output units “match” their loss function (e.g.logistic units with cross-entropy error).

While at first glance this might seem pointless since wecan already computeHd with relative efficiency, there aregood reasons for usingG instead ofH. Firstly, the Gauss-Newton matrixG is guaranteed to be positive semi-definite,even when un-damped, which avoids the problem of neg-ative curvature, thus guaranteeing that CG will work forany positive value ofλ. Mizutani & Dreyfus(2008) argueagainst usingG and that recognizing and exploiting nega-tive curvature is important, particularly for training neuralnets. Indeed, some implementations of HF will perform acheck for directions of negative curvature during the CGruns and if one is found they will abort CG and run a spe-cialized subroutine in order to search along it. Based onour limited experience with such methods we feel that theyare not particularly cost-effective. Moreover, on all of thelearning problems we tested, usingG instead ofH con-sistently resulted in much better search directions, even insituations where negative curvature was not present. An-other more mundane advantage of usingG overH is thatthe associated matrix-vector product algorithm forG usesabout half the memory and runs nearly twice as fast.

4.3. Handling large datasets

In general, the computational cost associated with comput-ing theBd products will grow linearly with the amount oftraining data. Thus for large training datasets it may beimpractical to compute these vectors as many times as isneeded by CG in order to sufficiently optimizeqθ(p). Oneobvious remedy is to just truncate the dataset when comput-ing the products, but this is unsatisfying. Instead we seeksomething akin to “online learning”, where the dataset usedfor each gradient evaluation is a constantly changing subsetof the total, i.e. a “mini-batch”.

Fortunately, there is a simple way to adapt the HF as an on-


line algorithm which we have found works well in practice,although some care must be taken. One might be temptedto cycle through a sequence of mini-batches for the evalu-ation of eachBd product, but this is a bad strategy for sev-eral reasons. Firstly, it is much more efficient to computethese products if the unit activations for each training ex-ample can be cached and reused across evaluations, whichwould be much less viable if the set of training examplesis constantly changing. Secondly, and more importantly,the CG algorithm is not robust to changes in theB matrixwhile it is running. The power of the CG method relieson the invariants it maintains across iterations, such as theconjugacy of its search directions. These will be quickly vi-olated if the implicit definition ofB is constantly changingas CG iterates. Thus the mini-batch should be kept constantduring each CG run, cycling only at the end of each HF it-eration. It might also be tempting to try using very smallmini-batches, perhaps even of size 1. This strategy, too,is problematic since theB matrix, if defined using a verysmall mini-batch, will not contain enough useful curvatureinformation to produce a good search direction.

The strategy we found that works best is to use relativelylarge mini-batches, the optimal size of which grows asthe optimization progresses. And while the optimal mini-batch size may be a function of the size of the model (wedon’t have enough data to make a determination) it criti-cally doesn’t seem to bear any relation to the total datasetsize. In our experiments, while we used mini-batches tocompute theBd products, the gradients and log-likelihoodswere computed using the entire dataset. The rationale forthis is quite simple: each HF iteration involves a run of CGwhich may require hundreds ofBd evaluations but only 1gradient evaluation. Thus it is cost-effective to obtain amuch higher quality estimate of the gradient. And it shouldbe noted that unlike SGD which performs tens of thousandsof iterations, the number of iterations performed by our HFapproach rarely exceeds 200.

4.4. Termination conditions for CG

Implementations of HF generally employ a convergencetest for the CG runs of the form‖Bp + ∇f(θ)‖2 < εwhere the toleranceε is chosen high enough so as toensure that CG will terminate in a number of iterationsthat is practical. A popular choice seems to beε =

min(1

2, ‖∇f(θ)‖

1

2

2 )‖∇f(θ)‖2, which is supported by sometheoretical convergence results. In this section we will ar-gue why this type of convergence test is bad and proposeone which we have found works much better in practice.

While CG is usually thought of as an algorithm for finding aleast-squares solution to the linear systemAx = b, it is notactually optimizing the squared error objective‖Ax− b‖2.Instead, it optimizes the quadraticφ(x) = 1

2x>Ax − b>x,

and is invoked within HF implementations by settingA =B andb = −∇f(θ). While φ(x) and‖Ax − b‖2 have thesame global minimizer, a good but sub-optimal solution for

one may actually be a terrible solution for the other. On atypical run of CG one observes that the objective functionφ(x) steadily decreases with each iteration (which is guar-anteed by the theory), while‖Ax−b‖2 fluctuates wildly upand down and only starts shrinking to 0 towards the veryend of the optimization. Moreover, the original motivationfor running CG within the HF method was to minimize thequadraticqθ(p), and not‖Bp +∇f(θ)‖2.

Thus it is in our opinion surprising that the commonly usedtermination condition for CG used within HF is based on‖Ax − b‖2. One possible reason is that while‖Ax − b‖2

is bounded below by 0, it is not clear how to find a sim-ilar bound forφ(x) that would generate a reasonable ter-mination condition. We experimented with several obviousheuristics and found that the best one by far was to termi-nate the iterations once therelative per-iteration progressmade in minimizingφ(x) fell below some tolerance. Inparticular, we terminate CG at iterationi if the followingcondition is satisfied:

i > k and φ(xi) < 0 andφ(xi)− φ(xi−k)

φ(xi)< kε

wherek determines how many iterations into the past welook in order to compute an estimate of the current per-iteration reduction rate. Choosingk > 1 is in generalnecessary because, while the average per-iteration reduc-tion in φ tends to decrease over time, it also displays aconsiderable amount of variance and thus we need to av-erage over many iterations to obtain a reliable estimate.In all of our experiments we setk = max(10, 0.1i) andε = 0.0005, thus averaging over a progressively larger in-terval asi grows. Note thatφ can be computed from theCG iterates at essentially no additional cost.

In practice this approach will cause CG to terminate in veryfew iterations whenλ is large (which makesqθ(p) easyto optimize) which is the typical scenario during the earlystages of optimization. In later stages,qθ(p) begins to ex-hibit pathological curvature (as long asλ is decayed ap-propriately), which reflects the actual properties off , thusmaking it harder to optimize. In these situations our ter-mination condition will permit CG to run much longer, re-sulting in a much more expensive HF iteration. But this isthe price that seemingly must be paid in order to properlycompensate for the true curvature inf .

Heuristics which attempt to significantly reduce the num-ber of iterations by terminating CG early (and we triedseveral, such as stopping whenf(θ + p) increases), pro-vide a speed boost early on in the optimization, but con-sistently seem to result in worse long-term outcomes, bothin terms of generalization error and overall rate of reduc-tion in f . A possible explanation for this is that shorterCG runs are more “greedy” and do not pursue as manylow-curvature directions, which seem to be of vital impor-tance, both for reducing generalization error and for avoid-ing extreme-curvature scenarios such as unit saturation and


poor differentiation between units.

4.5. Sharing information across iterations

A simple enhancement to the HF algorithm which we foundimproves its performance by an order of magnitude is to usethe search directionpn−1 found by CG in the previous HFiteration as the starting point for CG in the current one.There are intuitively appealing reasons whypn−1 mightmake a good initialization. Indeed, the values ofB and∇f(θ) for a given HF iteration should be “similar” in somesense to their values at the previous iteration, and thus theoptimization problem solved by CG is also similar to theprevious one, making the previous solution a potentiallygood starting point.

In typical implementations of HF, the CG runs are initial-ized with the zero vector, and doing this has the nice prop-erty that the initial value ofφ will be non-positive (0 infact). This in turn ensures that the search direction pro-duced by CG will always provide a reduction inqθ, even ifCG is terminated after the first iteration. In general, if CG isinitialized with a non-zero vector, the initial value ofφ canbe greater than zero, and we have indeed found this to bethe case when usingpn−1. However, judging an initializa-tion merely by itsφ value may be misleading, and we foundthat runs initialized frompn−1 rather than 0 consistentlyyielded better reductions inqθ, even whenφ(pn−1)� 0. Apossible explanation for this finding is thatpn−1 is “wrong”within the new quadratic objective mostly along the volatilehigh-curvature directions, which are quickly and easily dis-covered and “corrected” by CG, thus leaving the harder-to-find low-curvature directions, which tend to be more stableover the parameter space, and thus more likely to remaindescent directions between iterations of HF.

4.6. CG iteration backtracking

While each successive iteration of CG improves the valueof p with respect to the 2nd-order modelqθ(p), these im-provements are not necessarily reflected in the value off(θ + p). In particular, ifqθ(p) is untrustworthy due to thedamping parameterλ being set too low or the current mini-batch being too small or unrepresentative, then running CGpast a certain number of iterations can actually be harmful.In fact, the dependency of the directions generated by CGon the “quality” ofB should almost certainly increase CGiterates, as the basis from which CG generates directions(called the Krylov basis) expands to include matrix-vectorproducts with increasingly large powers ofB.

By storing the current solution forp at iterationdγje ofCG for eachj (whereγ > 1 is a constant; 1.3 in our ex-periments), we can “backtrack” along them after CG hasterminated, reducingj as long as thedγj−1eth iterate ofCG yields a lower value off(x+ p) than thedγjeth. How-ever, we have observed experimentally that, for best perfor-mance, the value ofp used to initialize the next CG run (as

described in the previous sub-section) shouldnot be back-tracked in this manner. The likely explanation for this ef-fect is that directions which are followed too strongly inpdue to bad curvature information from an unrepresentativemini-batch, orλ being too small, will be “corrected” by thenext run of CG, since it will use a different mini-batch andpossibly a largerλ to computeB.

4.7. Preconditioning CG

Preconditioning is a technique used to accelerate CG.It does this by performing a linear change of variablesx = Cx for some matrix C, and then optimizingthe transformed quadratic objective given byφ(x) =1

2x>C−>AC−1x − (C−1b)>x. φ may have more forgiv-

ing curvature properties than the originalφ, depending onthe value of the matrixC. To use preconditioned CG onespecifiesM ≡ C>C, with the understanding that it mustbe easy to solveMy = x for arbitraryx. Preconditioningis somewhat of an application specific art, and we experi-mented with many possible choices. One that we found tobe particularly effective was the diagonal matrix:

M =

[

diag

(

D∑

i=1

∇fi(θ)�∇fi(θ)

)

+ λI

]α

where fi is the value of the objective associated withtraining-casei, � denotes the element-wise product andthe exponentα is chosen to be less than 1 in order to sup-press “extreme” values (we used 0.75 in our experiments).The inner sum has the nice interpretation of being the di-agonal of the empirical Fisher information matrix, whichis similar in some ways to theG matrix. Unfortunately,it is impractical to use the diagonal ofG itself, since theobvious algorithm has a cost similar toK backprop opera-tions, whereK is the size of the output layer, which is largefor auto-encoders (although typically not for classificationnets).

5. Random initialization

Our HF approach, like almost any deterministic optimiza-tion scheme, is not completely immune to “bad” initializa-tions, although it does tend to be far more robust to thesethan 1st-order methods. For example, it cannot break sym-metry between two units in the same layer that are initial-ized with identical weights. However, as discussed in sec-tion 2.2 it has a far better chance than 1st-order methods ofdoing so if the weights arenearly identical.

In our initial investigations we tried a variety of randominitialization schemes. The better ones, which were morecareful about avoiding issues like saturation, seemed toallow the runs of CG to terminate after fewer iterations,mostly likely because these initializations resulted in morefavorable local curvature properties. The best random ini-tialization scheme we found was one of our own design,


“sparse initialization”. In this scheme we hard limit thenumber of non-zero incoming connection weights to eachunit (we used 15 in our experiments) and set the biases to0 (or 0.5 for tanh units). Doing this allows the units to beboth highly differentiated as well as unsaturated, avoidingthe problem in dense initializations where the connectionweights must all be scaled very small in order to preventsaturation, leading to poor differentiation between units.

6. Related work on 2nd-order optimization

LeCun et al.(1998) have proposed several diagonal ap-proximations of theH andG matrices for multi-layer neu-ral nets. While these are easy to invert, update and store,the diagonal approximation may be overly simplistic sinceit neglects the very important interaction between parame-ters. For example, in the “nearly identical units” scenarioconsidered in section2.2, a diagonal approximation wouldnot be able to recognize the “differentiating direction” asbeing one of low curvature.

Amari et al.(2000) have proposed a 2nd-order learning al-gorithm for neural nets based on an empirical approxima-tion of the Fisher information matrix (which can be definedfor a neural net by casting it as a probabilistic model).Since Schraudolph’s approach for computingGd may begeneralized to computeFd, we were thus able to evaluatethe possibility of usingF as an alternative toG within ourHF approach. The resulting algorithm wasn’t able to makesignificant progress on the deep auto-encoder problems weconsidered, possibly indicating thatF doesn’t contain suffi-cient curvature information to overcome the problems asso-ciated with deep-learning. A more theoretical observationwhich supports the use ofG overF in neural nets is thatthe ranks ofF andG areD andDL respectively, whereDis the size of the training set andL is the size of the outputlayer. Another observation is thatG will converge to theHessian as the error of the net approaches zero, a propertynot shared byF.

Building on the work ofPearlmutter(1994), Schraudolph(2002) proposed a 2nd-order method called “StochasticMeta-descent” (SMD) which uses an on-line 2nd-order ap-proximation tof and optimizes it via updates top whichare also computed on-line. This method differs from HFin several important ways, but most critically in the way itoptimizesp. The update scheme used by SMD is a form ofpreconditioned gradient-descent given by:

pn+1 = pn + M−1rn rn ≡ ∇f(θn) + Bpn

whereM is a diagonal pre-conditioning matrix chosen toapproximateB. Using the previously discussed method forcomputingBd products, SMD is able to compute these up-dates efficiently. However, using gradient-descent insteadof CG to optimizeqθ(p), even with a good diagonal precon-ditioner, is an approach likely to fail becauseqθ(p) will ex-hibit the same pathological curvature as the objective func-tionf that it approximates. And pathological curvature was

Table 1.Experimental parameters

NAME SIZE K ENCODER DIMSCURVES 20000 5000 784-400-200-100-50-25-6MNIST 60000 7500 784-1000-500-250-30FACES 103500 5175 625-2000-1000-500-30

the primary reason for abandoning gradient-descent as anoptimization method in the first place. Moreover, it canbe shown that in the batch case, the updates computed bySMD lie in the same Krylov subspace as those computedby an equal number of CG iterations, and that CG finds theoptimal solution ofqθ(p) within this subspace.

These 2nd-order methods, plus others we haven’t discussed,have only been validated on shallow nets and/or toy prob-lems. And none of them have been shown to be fundamen-tally more effective than 1st-order optimization methods ondeep learning problems.

7. Experiments

We present the results from a series of experimentsdesigned to test the effectiveness of our HF ap-proach on the deep auto-encoder problems considered byHinton & Salakhutdinov(2006) (abbr. H&S). We adoptprecisely the same model architectures, datasets, loss func-tions and training/test partitions that they did, so as to en-sure that our results can be directly compared with theirs.

Each dataset consists of a collection of small grey-scaleimages of various objects such as hand-written digits andfaces. Table1 summarizes the datasets and associated ex-perimental parameters, wheresize gives the size of thetraining set,K gives the size of minibatches used, andencoder dims gives the encoder network architecture. Ineach case, the decoder architecture is the mirror image ofthe encoder, yielding a “symmetric autoencoder”. Thissymmetry is required in order to be able to apply H&S’spre-training approach. Note that CURVES is the syntheticcurves dataset from H&S’s paper and FACES is the aug-mented Olivetti face dataset.

We implemented our approach using the GPU-computingMATLAB package Jacket. We also re-implemented, us-ing Jacket, the precise approach considered by H&S, us-ing their provided code as a basis, and then re-ran theirexperiments usingmanymore training epochs than theydid, and for far longer than we ran our HF approach onthe same models. With these extended runs we were ableto obtain slightly better results than they reported for boththe CURVES and MNIST experiments. Unfortunately, wewere not able to reproduce their results for the FACESdataset, as each net we pre-trained had very high gen-eralization error, even before fine-tuning. We ran eachmethod until it either seemed to converge, or started tooverfit (which happened for MNIST and FACES, but notCURVES). We found that since our method was much bet-ter at fitting the training data, it was thus more prone to


Table 2.Results (training and test errors)

PT + NCG RAND+HF PT + HFCURVES 0.74, 0.82 0.11, 0.20 0.10, 0.21MNIST 2.31, 2.72 1.64, 2.78 1.63, 2.46MNIST* 2.07, 2.61 1.75, 2.55 1.60, 2.28FACES -, 124 55.4, 139 -,-FACES* -,- 60.6, 122 -,-

overfitting, and so we ran additional experiments where weintroduced an2 prior on the connection weights.

Table 2 summarizes our results, wherePT+NCG is thepre-training + non-linear CG fine-tuning approach of H&S,RAND+HF is our Hessian-free method initialized ran-domly, andPT+HF is our approach initialized with pre-trained parameters. The numbers given in each entry of thetable are the average sum of squared reconstruction errorson the training-set and the test-set. The *’s indicate thatan `2 prior was used, with strength10−4 on MNIST and10−2 on FACES. Error numbers for FACES which involvepre-training are missing due to our failure to reproduce theresults of H&S on that dataset (instead we just give the test-error number they reported).

Figure2 demonstrates the performance of our implementa-tions on the CURVES dataset. Pre-training time is includedwhere applicable. This plot is not meant to be a definitiveperformance analysis, but merely a demonstration that ourmethod is indeed quite efficient.

8. Discussion of results and implications

The most important implication of our results is that learn-ing in deep models can be achieved effectively and effi-ciently by a completely general optimizer without any needfor pre-training. This opens the door to examining a diverserange of deep or otherwise difficult-to-optimize architec-tures for which there are no effective pre-training methods,such as asymmetric auto-encoders, or recurrent neural nets.

A clear theme which emerges from our results is that theHF optimized nets have much lower training error, imply-ing that our HF approach does well because it is more ef-fective than pre-training + fine-tuning approaches at solv-ing the under-fitting problem. Because both the MNISTand FACES experiments used early-stopping, the trainingerror numbers reported in Table2 are actually much higherthan can otherwise be achieved. When we initialized ran-domly and didn’t use early-stopping we obtained a trainingerror of 1.40 on MNIST and 12.9 on FACES.

A pre-trained initialization benefits our approach in termsof optimization speed and generalization error. However,there doesn’t seem to be any significant benefit in regardsto under-fitting, as our HF approach seems to solve thisproblem almost completely by itself. This is in stark con-trast to the situation with 1st-order optimization algorithms,where the main hurdle overcome by pre-training is thatof under-fitting, at least in the setting of auto-encoders.

103

104

0

1

2

3

4

time in seconds (log−scale)

erro

r

PT+NCGRAND+HFPT+HF

Figure 2.Error (train and test) vs. computation time on CURVES

Based on these results we can hypothesize that the way pre-training helps 1st-order optimization algorithms overcomethe under-fitting problem is by placing the parameters in aregion less affected by issues of pathological curvature inthe objective, such as those discussed in section2.2. Thiswould also explain why our HF approach optimizes fasterfrom pre-trained parameters, as more favorable local cur-vature conditions allow the CG runs to make more rapidprogress when optimizingqθ(p).

Finally, while these early results are very encouraging,clearly further research is warranted in order to addressthe many interesting questions that arise from them, suchas how much more powerful are deep nets than shallowones, and is this power fully exploited by current pre-training/fine-tuning schemes?

Acknowledgments

The author would like to thank Geoffrey Hinton, RichardZemel, Ilya Sutskever and Hugo Larochelle for their help-ful suggestions. This work was supported by NSERC andthe University of Toronto.

REFERENCES

Amari, S., Park, H., and Fukumizu, K. Adaptive method of realiz-ing natural gradient learning for multilayer perceptrons.NeuralComputation, 2000.

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedylayer-wise training of deep networks. InNIPS, 2007.

Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P.,and Bengio, S. Why does unsupervised pre-training help deeplearning?Journal of Machine Learning Research, 2010.

Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimension-ality of data with neural networks.Science, July 2006.

LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop.In Orr, G. and K., Muller (eds.),Neural Networks: Tricks ofthe trade. Springer, 1998.

Mizutani, E. and Dreyfus, S. E. Second-order stagewise back-propagation for hessian-matrix analyses and investigation ofnegative curvature.Neural Networks, 21(2-3):193 – 203, 2008.

Nocedal, J. and Wright, S. J.Numerical Optimization. Springer,1999.

Pearlmutter, B. A. Fast exact multiplication by the hessian. Neu-ral Computation, 1994.

Schraudolph, N. N. Fast curvature matrix-vector products forsecond-order gradient descent.Neural Computation, 2002.

deep hessianfree

Documents