cs224d deep nlp lecture 6: neural tips and tricks...

CS224dDeepNLP

Lecture6:NeuralTipsandTricks

+RecurrentNeuralNetworks

[email protected]

OverviewToday:

• Anotherexplanationofbackprop(fromatutorialYoshua,ChrisandIdid)

• Practicaltipsandtricks:• Multi-tasklearning• Nonlinearities• Finitedifferencegradientcheck• Momentum,AdaGrad

• LanguageModels• FirstintrotoRecurrentNeuralNetworks

4/17/16RichardSocherLecture1,Slide 2

Backpropagation (Anotherexplanation)• Computegradientofexample-wiselosswrt

parameters

• Simplyapplyingthederivativechainrulewisely

• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient

3

Simple Chain Rule

4

Multiple Paths Chain Rule

5

MultiplePathsChainRule- General

…

6

Chain Rule in Flow Graph

…

…

…

Flowgraph:anydirectedacyclicgraphnode=computationresultarc=computationdependency

=successorsof

7

Back-Prop in Multi-Layer Net

…

…

8

h = sigmoid(Vx)

Back-Prop in General Flow Graph

…

…

…

=successorsof

1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors

2. Bprop:- initializeoutputgradient=1- visitnodes inreverseorder:

Computegradientwrt eachnodeusinggradientwrt successors

Single scalaroutput

9

Automatic Differentiation

• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.

• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.

• Easyandfastprototyping

10…

…

NeuralTipsandTricks

11

Multi-tasklearning/Weightsharing

• Basemodel:NeuralnetworkfromlastclassbutreplacesthesinglescalarscorewithaSoftmax classifier

• Trainingisagaindoneviabackpropagation

• NLP(almost)fromscratch,Collobertetal.2011

12

c1 c2 c3

x1 x2x3 +1

a1 a2

Themodel

• Wealreadyknowthesoftmaxclassifierandhowtooptimizeit• Theinterestingtwistindeeplearningisthattheinputfeaturesx

andtheirtransformationsinahiddenlayerarealsolearned.• Twofinallayersarepossible:

Sc1 c2 c3

x1 x2x3 +1

a1 a2

s U2

W23

x1 x2x3 +1

a1 a2

13

Multitasklearning

• Mainidea:WecanshareboththewordvectorsANDthehiddenlayerweights.Onlythesoftmax weightsaredifferent.

• Costfunctionisjustthesumoftwocrossentropyerrors

S1

c1 c2 c3

x1 x2x3 +1

a1 a2

x1 x2x3 +1

a1 a2

14

S2

c4 c5 c6

Themultitaskmodel- Training

• Example:predicteachwindow’scenterNERtagandPOStag:(examplePOStags:DT,NN,NNP,JJ,JJS(superlativeadj),VB,…)

• Efficientimplementation:Sameforwardprop,computeerrorsonhiddenvectorsandadd

S1

c1 c2 c3

x1 x2x3 +1

a1 a2

x1 x2x3 +1

a1 a2

15

S2

c4 c5 c6

POSWSJ (acc.)

NERCoNLL (F1)

State-of-the-art* 97.24 89.31SupervisedNN 96.37 81.47Wordvectorpre-trainingfollowedbysupervisedNN**

97.20 88.87

+hand-craftedfeatures*** 97.29 89.59*Representativesystems:POS:(Toutanovaetal.2003),NER: (Ando&Zhang2005)**130,000-wordembeddingtrainedonWikipediaandReuterswith11wordwindow,100unithiddenlayer– thensupervisedtasktraining***FeaturesarecharactersuffixesforPOSandagazetteerforNER

Thesecretsauce(sometimes)istheunsupervisedwordvectorpre-trainingonalargetextcollection

16

POSWSJ (acc.)

NERCoNLL (F1)

SupervisedNN 96.37 81.47NNwithBrownclusters 96.92 87.15Fixedembeddings* 97.10 88.87C&W 2011** 97.29 89.59

*SamearchitectureasC&W2011,butwordembeddingsarekeptconstantduringthesupervisedtrainingphase**C&Wisunsupervisedpre-train+supervisedNN+featuresmodeloflastslide

Supervisedrefinementoftheunsupervisedwordrepresentationhelps

17

GeneralStrategyforSuccessfulNNets

1. Selectnetworkstructureappropriateforproblem1. Structure:Singlewords,fixedwindows,sentencebased,

documentlevel;bagofwords,recursivevs.recurrent,CNN2. Nonlinearity

2. Checkforimplementationbugswithgradientchecks3. Parameter initialization4. Optimizationtricks5. Checkifthemodelispowerfulenoughtooverfit

1. Ifnot,changemodelstructureormakemodel“larger”2. Ifyoucanoverfit:Regularize

18

Non-linearities:What’sused

logistic(“sigmoid”)tanh

tanh isjustarescaledandshiftedsigmoidtanh(z) = 2logistic(2z)−1

19

Formanymodels,tanh isthebest!

4/17/16RichardSocher20

• Incomparisontosigmoid:

• Atinitialization:valuescloseto0

• Fasterconvergenceinpractice

• Likesigmoid:Nicederivative:

Non-linearities:Therearevariousotherchoices

hardtanh softsign rectifiedlinear(ReLu)

• hardtanh similarbutcomputationallycheaperthantanh andsaturateshard.• Glorot andBengio,AISTATS2011discusssoftsign andrectifier

rect(z) =max(z, 0)softsign(z) = a1+ a

21

MaxOut Network

22

Arecenttypeofnonlinearity/network

Goodfellow etal.(2013)

Where

Thisfunctionalsobecomesauniversalapproximatorwhenstackedinmultiplelayers

Stateoftheartonseveralimagedatasets

GradientChecksareAwesome!

• Allowyoutoknowthattherearenobugsinyourneuralnetworkimplementation!

• Steps:1. Implementyourgradient2. Implementafinitedifferencecomputationbylooping

throughtheparametersofyournetwork,addingandsubtractingasmallepsilon(∼10-4)andestimatederivatives

3. Comparethetwoandmakesuretheyarealmostthesame

23

Usinggradientchecksandmodelsimplification

• Ifyougradientfailsandyoudon’tknowwhy?• Whatnow?Createaverytinysyntheticmodelanddataset• Simplifyyourmodeluntilyouhavenobug!• Example:Startfromsimplestmodelthengotowhatyouwant:

• Onlysoftmax onfixedinput• Backprop intowordvectorsandsoftmax• Addsingleunitsinglehiddenlayer• Addmultiunitsinglelayer• Addbias• Addsecondlayersingleunit,addmultipleunits,bias• Addonesoftmax ontop,thentwosoftmax layers

24

GeneralStrategy

1. SelectappropriateNetworkStructure1. Structure:Singlewords,fixedwindowsvs Recursive

SentenceBasedvs Bagofwords2. Nonlinearity

2. Checkforimplementationbugswithgradientcheck3. Parameter initialization4. Optimizationtricks5. Checkifthemodelispowerfulenoughtooverfit


25

ParameterInitialization

• Initializehiddenlayerbiasesto0andoutput(orreconstruction)biasestooptimalvalueifweightswere0(e.g.,meantargetorinversesigmoidofmeantarget).

• Initializeweights∼ Uniform(−r,r),r inverselyproportionaltofan-in(previouslayersize)andfan-out(nextlayersize):

fortanh units,and4xbiggerforsigmoidunits [Glorot AISTATS2010]

26

• Gradientdescentusestotalgradientoverallexamplesperupdate,SGDupdatesafteronly1orfewexamples:

• Jt =lossfunctionatcurrentexample,µ = parametervector,® = learningrate.

• Ordinarygradientdescentasabatchmethodisveryslow,shouldneverbeused.Use2nd orderbatchmethodsuchasL-BFGS.

• Onlargedatasets,SGDusuallywinsoverallbatchmethods.OnsmallerdatasetsL-BFGSorConjugateGradientswin.Large-batchL-BFGSextendsthereachofL-BFGS[Leetal.ICML2011].

StochasticGradientDescent(SGD)

27

• Gradientdescentusestotalgradientoverallexamplesperupdate,SGDupdatesafteronly1example

• Mostcommonlyusednow:Minibatches

• SizeofeachminibatchB:20to1000:

• Helpsparallelizinganymodelbycomputinggradientsformultipleelementsofthebatchinparallel

Mini-batchStochasticGradientDescent(SGD)

28

ImprovementoverSGD:Momentum

• Idea:Addafractionvofpreviousupdatetocurrentone• Whenthegradientkeepspointinginthesamedirection,thiswill

increasethesizeofthestepstakentowardstheminimum• Reducegloballearningrate® whenusingalotofmomentum

• Updaterule:

• visinitializedat0• Common:¹ =0.9• Momentumoftenincreasedaftersomeepochs(0.5à 0.99)


IntuitionMomentum

• Addsfriction(momentum~misnomer)

• Parametersbuildupvelocityindirectionofconsistentgradient

• Simpleconvexfunctionoptimizationdynamicswithoutmomentum withmomentum:


Figure fromhttps://www.willamette.edu/~gorr/classes/cs449/momrate.html

LearningRates

• Simplestrecipe:keepitfixedandusethesameforallparameters.Standard:

• BetterresultsbyallowinglearningratestodecreaseOptions:• Reduceby0.5whenvalidationerrorstopsimproving• ReductionbyO(1/t)becauseoftheoreticalconvergenceguarantees,e.g.:withhyper-parametersε0 andτ andtisiterationnumbers

• Betteryet:Nohand-setlearningofratesbyusingAdaGrad à

31

Adagrad

• Adaptivelearningratesforeachparameter!• Relatedpaper:AdaptiveSubgradient MethodsforOnline

LearningandStochasticOptimization,Duchi etal.2010

• Learningrateisadaptingdifferentlyforeachparameterandrareparametersgetlargerupdatesthanfrequentlyoccurringparameters.Wordvectors!

• Let,then:


3.6 Subgradient Methods and AdaGradThe objective function is not differentiable due tothe hinge loss. Therefore, we generalize gradientascent via the subgradient method (Ratliff et al.,2007) which computes a gradient-like direction.Let ✓ = (X,W

(··), v

(··)) 2 RM be a vector of all

M model parameters, where we denote W

(··) asthe set of matrices that appear in the training set.The subgradient of Eq. 3 becomes:

@J

@✓

=

X

i

@s(x

i

, y

max

)

@✓

� @s(x

i

, y

i

)

@✓

+ ✓,

where y

max

is the tree with the highest score. Tominimize the objective, we use the diagonal vari-ant of AdaGrad (Duchi et al., 2011) with mini-batches. For our parameter updates, we first de-fine g

⌧

2 RM⇥1 to be the subgradient at time step⌧ and G

t

=

Pt

⌧=1

g

⌧

g

T

⌧

. The parameter update attime step t then becomes:

✓

t

= ✓

t�1

� ↵ (diag(Gt

))

�1/2

g

t

, (7)

where ↵ is the learning rate. Since we use the di-agonal of G

t

, we only have to store M values andthe update becomes fast to compute: At time stept, the update for the i’th parameter ✓

t,i

is:

✓

t,i

= ✓

t�1,i

� ↵qPt

⌧=1

g

2

⌧,i

g

t,i

. (8)

Hence, the learning rate is adapting differ-ently for each parameter and rare parameters getlarger updates than frequently occurring parame-ters. This is helpful in our setting since some W

matrices appear in only a few training trees. Thisprocedure found much better optima (by ⇡3% la-beled F1 on the dev set), and converged morequickly than L-BFGS which we used previouslyin RNN training (Socher et al., 2011a). Trainingtime is roughly 4 hours on a single machine.

3.7 Initialization of Weight MatricesIn the absence of any knowledge on how to com-bine two categories, our prior for combining twovectors is to average them instead of performing acompletely random projection. Hence, we initial-ize the binary W matrices with:

W

(··)= 0.5[I

n⇥n

I

n⇥n

0

n⇥1

] + ✏,

where we include the bias in the last column andthe random variable is uniformly distributed: ✏ ⇠

U [�0.001, 0.001]. The first block is multiplied bythe left child and the second by the right child:

W

(AB)

2

4a

b

1

3

5=

hW

(A)

W

(B)

bias

i2

4a

b

1

3

5

= W

(A)

a+W

(B)

b+ bias.

4 Experiments

We evaluate the CVG in two ways: First, by a stan-dard parsing evaluation on Penn Treebank WSJand then by analyzing the model errors in detail.

4.1 Cross-validating HyperparametersWe used the first 20 files of WSJ section 22to cross-validate several model and optimizationchoices. The base PCFG uses simplified cate-gories of the Stanford PCFG Parser (Klein andManning, 2003a). We decreased the state split-ting of the PCFG grammar (which helps both bymaking it less sparse and by reducing the num-ber of parameters in the SU-RNN) by addingthe following options to training: ‘-noRightRec -dominatesV 0 -baseNP 0’. This reduces the num-ber of states from 15,276 to 12,061 states and 602POS tags. These include split categories, such asparent annotation categories like VPˆS. Further-more, we ignore all category splits for the SU-RNN weights, resulting in 66 unary and 882 bi-nary child pairs. Hence, the SU-RNN has 66+882transformation matrices and scoring vectors. Notethat any PCFG, including latent annotation PCFGs(Matsuzaki et al., 2005) could be used. However,since the vectors will capture lexical and semanticinformation, even simple base PCFGs can be sub-stantially improved. Since the computational com-plexity of PCFGs depends on the number of states,a base PCFG with fewer states is much faster.

Testing on the full WSJ section 22 dev set (1700sentences) takes roughly 470 seconds with thesimple base PCFG, 1320 seconds with our newCVG and 1600 seconds with the currently pub-lished Stanford factored parser. Hence, increasedperformance comes also with a speed improve-ment of approximately 20%.

We fix the same regularization of � = 10

�4

for all parameters. The minibatch size was set to20. We also cross-validated on AdaGrad’s learn-ing rate which was eventually set to ↵ = 0.1 andword vector size. The 25-dimensional vectors pro-vided by Turian et al. (2010) provided the best

GeneralStrategy

1. SelectappropriateNetworkStructure1. Structure:Singlewords, fixedwindows vs RecursiveSentenceBasedvs Bagofwords2. Nonlinearity

2. Checkforimplementationbugswithgradientcheck3. Parameterinitialization4. Optimizationtricks5. Checkifthemodelispowerful enoughtooverfit


Assumingyoufoundtherightnetworkstructure,implementeditcorrectly,optimizeitproperlyandyoucanmakeyourmodeloverfit onyourtrainingdata.

Now,it’stimetoregularize33

PreventOverfitting:ModelSizeandRegularization

• Simplefirststep:Reducemodelsizebyloweringnumberofunitsandlayersandotherparameters

• StandardL1orL2regularizationonweights

• EarlyStopping:Useparametersthatgavebestvalidationerror

• Sparsity constraintsonhiddenactivations,e.g.,addtocost:

34

PreventFeatureCo-adaptation

Dropout(Hintonetal.2012)• Trainingtime:ateachinstanceofevaluation(inonlineSGD-training),randomlyset50%oftheinputstoeachneuronto0

• Testtime:halvethemodelweights(nowtwiceasmany)• Thispreventsfeatureco-adaptation:Afeaturecannotonlybeusefulinthepresenceofparticularotherfeatures

• Inasinglelayer:Akindofmiddle-groundbetweenNaïveBayes(whereallfeatureweightsaresetindependently)andlogisticregressionmodels(whereweightsaresetinthecontextofallothers)

• Canbethoughtofasaformofmodelbagging• Italsoactsasastrongregularizer

35

DeepLearningTricksoftheTrade• Y.Bengio(2012),“PracticalRecommendationsforGradient-

BasedTrainingofDeepArchitectures”• Unsupervisedpre-training• Stochasticgradientdescentandsettinglearningrates• Mainhyper-parameters• Learningrateschedule&earlystopping,Minibatches,Parameterinitialization,Numberofhiddenunits,regularization(=weightdecay)

• Howtoefficientlysearchforhyper-parameterconfigurations• Shortanswer:Randomhyperparameter search (!)

• Somemoreadvancedandrecenttricksinlaterlectures

36

LanguageModels

37

LanguageModels


Alanguagemodelcomputesaprobabilityforasequenceofwords:

Probabilityisusuallyconditionedonwindowofnpreviouswords:

Veryusefulforalotoftasks:

Canbeusedtodeterminewhetherasequenceisagood/grammatical translationorspeechutterance

Originalneurallanguagemodel


ANeuralProbabilisticLanguageModel,Bengio etal.2003

Originalequations:

Problem:Fixedwindowofcontextforconditioning:(

BENGIO, DUCHARME, VINCENT AND JAUVIN

softmax

tanh

. . . . . .. . .

. . . . . .

. . . . . .

across words

most computation here

index for index for index for

shared parameters

Matrix

inlook−upTable

. . .

C

C

wt�1wt�2

C(wt�2) C(wt�1)C(wt�n+1)

wt�n+1

i-th output = P(wt = i | context)

Figure 1: Neural architecture: f (i,wt�1, · · · ,wt�n+1) = g(i,C(wt�1), · · · ,C(wt�n+1)) where g is theneural network andC(i) is the i-th word feature vector.

parameters of the mapping C are simply the feature vectors themselves, represented by a |V |⇥mmatrixC whose row i is the feature vectorC(i) for word i. The function g may be implemented by afeed-forward or recurrent neural network or another parametrized function, with parameters ω. Theoverall parameter set is θ= (C,ω).

Training is achieved by looking for θ that maximizes the training corpus penalized log-likelihood:

L=1T ∑t

log f (wt ,wt�1, · · · ,wt�n+1;θ)+R(θ),

where R(θ) is a regularization term. For example, in our experiments, R is a weight decay penaltyapplied only to the weights of the neural network and to theC matrix, not to the biases.3

In the above model, the number of free parameters only scales linearly with V , the number ofwords in the vocabulary. It also only scales linearly with the order n : the scaling factor couldbe reduced to sub-linear if more sharing structure were introduced, e.g. using a time-delay neuralnetwork or a recurrent neural network (or a combination of both).

In most experiments below, the neural network has one hidden layer beyond the word featuresmapping, and optionally, direct connections from the word features to the output. Therefore thereare really two hidden layers: the shared word features layer C, which has no non-linearity (it wouldnot add anything useful), and the ordinary hyperbolic tangent hidden layer. More precisely, theneural network computes the following function, with a softmax output layer, which guaranteespositive probabilities summing to 1:

P(wt |wt�1, · · ·wt�n+1) =eywt∑i eyi

.

3. The biases are the additive parameters of the neural network, such as b and d in equation 1 below.

1142

A NEURAL PROBABILISTIC LANGUAGE MODEL

The yi are the unnormalized log-probabilities for each output word i, computed as follows, withparameters b,W ,U ,d and H:

y= b+Wx+U tanh(d+Hx) (1)

where the hyperbolic tangent tanh is applied element by element, W is optionally zero (no directconnections), and x is the word features layer activation vector, which is the concatenation of theinput word features from the matrix C:

x= (C(wt�1),C(wt�2), · · · ,C(wt�n+1)).

Let h be the number of hidden units, andm the number of features associated with each word. Whenno direct connections from word features to outputs are desired, the matrixW is set to 0. The freeparameters of the model are the output biases b (with |V | elements), the hidden layer biases d (withh elements), the hidden-to-output weightsU (a |V |⇥h matrix), the word features to output weightsW (a |V |⇥ (n� 1)m matrix), the hidden layer weights H (a h⇥ (n� 1)m matrix), and the wordfeatures C (a |V |⇥m matrix):

θ= (b,d,W,U,H,C).

The number of free parameters is |V |(1+ nm+ h) + h(1+ (n� 1)m). The dominating factor is|V |(nm+ h). Note that in theory, if there is a weight decay on the weightsW and H but not on C,thenW and H could converge towards zero while C would blow up. In practice we did not observesuch behavior when training with stochastic gradient ascent.

Stochastic gradient ascent on the neural network consists in performing the following iterativeupdate after presenting the t-th word of the training corpus:

θ θ+ ε∂ log P(wt |wt�1, · · ·wt�n+1)

∂θ

where ε is the “learning rate”. Note that a large fraction of the parameters needs not be updatedor visited after each example: the word features C( j) of all words j that do not occur in the inputwindow.Mixture of models. In our experiments (see Section 4) we have found improved performance by

combining the probability predictions of the neural network with those of an interpolated trigrammodel, either with a simple fixed weight of 0.5, a learned weight (maximum likelihood on thevalidation set) or a set of weights that are conditional on the frequency of the context (using thesame procedure that combines trigram, bigram, and unigram in the interpolated trigram, which is amixture).

3. Parallel Implementation

Although the number of parameters scales nicely, i.e. linearly with the size of the input window andlinearly with the size of the vocabulary, the amount of computation required for obtaining the outputprobabilities is much greater than that required from n-gram models. The main reason is that withn-gram models, obtaining a particular P(wt |wt�1, . . . ,wt�n+1) does not require the computation ofthe probabilities for all the words in the vocabulary, because of the easy normalization (performedwhen training the model) enjoyed by the linear combinations of relative frequencies. The maincomputational bottleneck with the neural implementation is the computation of the activations ofthe output layer.

1143

BENGIO, DUCHARME, VINCENT AND JAUVIN

softmax

tanh

. . . . . .. . .

. . . . . .

. . . . . .

across words

most computation here

index for index for index for

shared parameters

Matrix

inlook−upTable

. . .

C

C

wt�1wt�2

C(wt�2) C(wt�1)C(wt�n+1)

wt�n+1

i-th output = P(wt = i | context)

Figure 1: Neural architecture: f (i,wt�1, · · · ,wt�n+1) = g(i,C(wt�1), · · · ,C(wt�n+1)) where g is theneural network andC(i) is the i-th word feature vector.

parameters of the mapping C are simply the feature vectors themselves, represented by a |V |⇥mmatrixC whose row i is the feature vectorC(i) for word i. The function g may be implemented by afeed-forward or recurrent neural network or another parametrized function, with parameters ω. Theoverall parameter set is θ= (C,ω).

Training is achieved by looking for θ that maximizes the training corpus penalized log-likelihood:

L=1T ∑t

log f (wt ,wt�1, · · · ,wt�n+1;θ)+R(θ),

where R(θ) is a regularization term. For example, in our experiments, R is a weight decay penaltyapplied only to the weights of the neural network and to theC matrix, not to the biases.3

In the above model, the number of free parameters only scales linearly with V , the number ofwords in the vocabulary. It also only scales linearly with the order n : the scaling factor couldbe reduced to sub-linear if more sharing structure were introduced, e.g. using a time-delay neuralnetwork or a recurrent neural network (or a combination of both).

In most experiments below, the neural network has one hidden layer beyond the word featuresmapping, and optionally, direct connections from the word features to the output. Therefore thereare really two hidden layers: the shared word features layer C, which has no non-linearity (it wouldnot add anything useful), and the ordinary hyperbolic tangent hidden layer. More precisely, theneural network computes the following function, with a softmax output layer, which guaranteespositive probabilities summing to 1:

P(wt |wt�1, · · ·wt�n+1) =eywt∑i eyi

.

3. The biases are the additive parameters of the neural network, such as b and d in equation 1 below.

1142

RecurrentNeuralNetworks

40

RecurrentNeuralNetworks!


Solution:Conditiontheneuralnetworkonallpreviouswordsandtietheweightsateachtimestep

xt−1 xt xt+1

ht−1 ht ht+1

W W

RecurrentNeuralNetworklanguagemodel


Givenlistofwordvectors:

Atasingletimestep:

xt ht

ßà


Mainidea:weusethesamesetofWweightsatalltimesteps!

Everythingelseisthesame:

issomeinitializationvectorforthehiddenlayerattimestep0

isthecolumnvectorofLatindex[t]attimestept



isaprobabilitydistributionoverthevocabulary

Samecrossentropylossfunctionbutpredictingwordsinsteadofclasses



Evaluationcouldjustbenegativeofaveragelogprobabilityoverdatasetofsize(numberofwords)T:

Butmorecommonly: Perplexity:2J

Lowerisbetter!

TrainingRNNsishard

• ThegradientisaproductofJacobianmatrices,eachassociatedwithastepintheforwardcomputation.

• Multiplythesamematrixateachtimestepduringbackprop

• Thiscanbecomeverysmallorverylargequickly[Bengio etal1994],andthelocalityassumptionofgradientdescentbreaksdown.à Vanishingorexplodinggradient


The second derivative in eq. 28 for output units is simply

@a

(nl

)i

@W

(nl

�1)ij

=@

@W

(nl

�1)ij

z

(nl

)i

=@

@W

(nl

�1)ij

⇣W

(nl

�1)i· a

(nl

�1)⌘= a

(nl

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

@E

n

@W

(nl

�1)ij

= (yi

� t

i

)a(nl

�1)j

= �

(nl

)i

a

(nl

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

@E

@W

(nl

�2)ij

=X

n

@E

n

@a

(nl

)| {z }�

(nl

)

@a

(nl

)

@W

(nl

�2)ij

+ �W

(nl

�2)ji

. (48)

Now,

(�(nl

))T@a

(nl

)

@W

(nl

�2)ij

= (�(nl

))T@z

(nl

)

@W

(nl

�2)ij

(49)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)a

(nl

�1) (50)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)·i a

(nl

�1)i

(51)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

a

(nl

�1)i

(52)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(W (nl

�2)i· a

(nl

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

)a(nl

�2)j

(55)

=⇣(�(nl

))TW (nl

�1)·i

⌘f

0(z(nl

�1)i

)a(nl

�2)j

(56)

=

0

@s

l+1X

j=1

W

(nl

�1)ji

�

(nl

)j

)

1

Af

0(z(nl

�1)i

)

| {z }

a

(nl

�2)j

(57)

= �

(nl

�1)i

a

(nl

�2)j

(58)

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

�

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

7

InitializationtrickforRNNs!


• InitializeW(hh) tobetheidentitymatrixIandf(z)=

• à Hugedifference!

• InitializationideafirstintroducedinParsingwithCompositionalVectorGrammars,Socheretal.2013

• Newexperimentswithrecurrentneuralnetslastweek(!)inASimpleWaytoInitializeRecurrentNetworksofRectifiedLinearUnits,Leetal.2015

T LSTM RNN + Tanh IRNN150 lr = 0.01, gc = 10, fb = 1.0 lr = 0.01, gc = 100 lr = 0.01, gc = 100

200 lr = 0.001, gc = 100, fb = 4.0 N/A lr = 0.01, gc = 1

300 lr = 0.01, gc = 1, fb = 4.0 N/A lr = 0.01, gc = 10

400 lr = 0.01, gc = 100, fb = 10.0 N/A lr = 0.01, gc = 1

Table 1: Best hyperparameters found for adding problems after grid search. lr is the learning rate, gcis gradient clipping, and fb is forget gate bias. N/A is when there is no hyperparameter combinationthat gives good result.

4.2 MNIST Classification from a Sequence of Pixels

Another challenging toy problem is to learn to classify the MNIST digits [21] when the 784 pixelsare presented sequentially to the recurrent net. In our experiments, the networks read one pixel at atime in scanline order (i.e. starting at the top left corner of the image, and ending at the bottom rightcorner). The networks are asked to predict the category of the MNIST image only after seeing all784 pixels. This is therefore a huge long range dependency problem because each recurrent networkhas 784 time steps.

To make the task even harder, we also used a fixed random permutation of the pixels of the MNISTdigits and repeated the experiments.

All networks have 100 recurrent hidden units. We stop the optimization after it converges or whenit reaches 1,000,000 iterations and report the results in figure 3 (best hyperparameters are listed intable 2).

0 1 2 3 4 5 6 7 8 9 10x 105

0

10

20

30

40

50

60

70

80

90

100

Steps

Test

Acc

urac

y

Pixel−by−pixel MNIST

LSTMRNN + TanhRNN + ReLUsIRNN

0 1 2 3 4 5 6 7 8 9 10x 105

0

10

20

30

40

50

60

70

80

90

100

Steps

Test

Acc

urac

y

Pixel−by−pixel permuted MNIST

LSTMRNN + TanhRNN + ReLUsIRNN

Figure 3: The results of recurrent methods on the “pixel-by-pixel MNIST” problem. We report thetest set accuracy for all methods. Left: normal MNIST. Right: permuted MNIST.

Problem LSTM RNN + Tanh RNN + ReLUs IRNNMNIST lr = 0.01, gc = 1 lr = 10

−8, gc = 10 lr = 10−8, gc = 10 lr = 10

−8, gc = 1

fb = 1.0

permuted lr = 0.01, gc = 1 lr = 10−8, gc = 1 lr = 10

−6, gc = 10 lr = 10−9, gc = 1

MNIST fb = 1.0

Table 2: Best hyperparameters found for pixel-by-pixelMNIST problems after grid search. lr is thelearning rate, gc is gradient clipping, and fb is the forget gate bias.

The results using the standard scanline ordering of the pixels show that this problem is so difficultthat standard RNNs fail to work, even with ReLUs, whereas the IRNN achieves 3% test error ratewhich is better than most off-the-shelf linear classifiers [21]. We were surprised that the LSTM didnot work as well as IRNN given the various initialization schemes that we tried. While it still possi-ble that a better tuned LSTM would do better, the fact that the IRNN perform well is encouraging.

5

rect(z) =max(z, 0)

Long-Termdependenciesandclippingtrick

• ThesolutionfirstintroducedbyMikolov istoclipgradientstoamaximumvalue.

• MakesabigdifferenceinRNNs.

48

On the di�culty of training Recurrent Neural Networks

region of space. It has been shown that in practiceit can reduce the chance that gradients explode, andeven allow training generator models or models thatwork with unbounded amounts of memory(Pascanuand Jaeger, 2011; Doya and Yoshizawa, 1991). Oneimportant downside is that it requires a target to bedefined at every time step.

In Hochreiter and Schmidhuber (1997); Graves et al.(2009) a solution is proposed for the vanishing gra-dients problem, where the structure of the model ischanged. Specifically it introduces a special set ofunits called LSTM units which are linear and have arecurrent connection to itself which is fixed to 1. Theflow of information into the unit and from the unit isguarded by an input and output gates (their behaviouris learned). There are several variations of this basicstructure. This solution does not address explicitly theexploding gradients problem.

Sutskever et al. (2011) use the Hessian-Free opti-mizer in conjunction with structural damping, a spe-cific damping strategy of the Hessian. This approachseems to deal very well with the vanishing gradient,though more detailed analysis is still missing. Pre-sumably this method works because in high dimen-sional spaces there is a high probability for long termcomponents to be orthogonal to short term ones. Thiswould allow the Hessian to rescale these componentsindependently. In practice, one can not guarantee thatthis property holds. As discussed in section 2.3, thismethod is able to deal with the exploding gradientas well. Structural damping is an enhancement thatforces the change in the state to be small, when the pa-rameter changes by some small value�✓. This asks forthe Jacobian matrices @xt

@✓

to have small norm, hencefurther helping with the exploding gradients problem.The fact that it helps when training recurrent neuralmodels on long sequences suggests that while the cur-vature might explode at the same time with the gradi-ent, it might not grow at the same rate and hence notbe su�cient to deal with the exploding gradient.

Echo State Networks (Lukosevicius and Jaeger, 2009)avoid the exploding and vanishing gradients problemby not learning the recurrent and input weights. Theyare sampled from hand crafted distributions. Becauseusually the largest eigenvalue of the recurrent weightis, by construction, smaller than 1, information fed into the model has to die out exponentially fast. Thismeans that these models can not easily deal with longterm dependencies, even though the reason is slightlydi↵erent from the vanishing gradients problem. Anextension to the classical model is represented by leakyintegration units (Jaeger et al., 2007), where

x

k

= ↵x

k�1 + (1� ↵)�(Wrec

x

k�1 +W

in

u

k

+ b).

While these units can be used to solve the standardbenchmark proposed by Hochreiter and Schmidhu-ber (1997) for learning long term dependencies (see(Jaeger, 2012)), they are more suitable to deal withlow frequency information as they act as a low passfilter. Because most of the weights are randomly sam-pled, is not clear what size of models one would needto solve complex real world tasks.

We would make a final note about the approach pro-posed by Tomas Mikolov in his PhD thesis (Mikolov,2012)(and implicitly used in the state of the art re-sults on language modelling (Mikolov et al., 2011)).It involves clipping the gradient’s temporal compo-nents element-wise (clipping an entry when it exceedsin absolute value a fixed threshold). Clipping has beenshown to do well in practice and it forms the backboneof our approach.

3.2. Scaling down the gradients

As suggested in section 2.3, one simple mechanism todeal with a sudden increase in the norm of the gradi-ents is to rescale them whenever they go over a thresh-old (see algorithm 1).

Algorithm 1 Pseudo-code for norm clipping the gra-dients whenever they explode

g @E@✓

if kgk � threshold then

g threshold

kgk g

end if

This algorithm is very similar to the one proposed byTomas Mikolov and we only diverged from the originalproposal in an attempt to provide a better theoreticalfoundation (ensuring that we always move in a de-scent direction with respect to the current mini-batch),though in practice both variants behave similarly.

The proposed clipping is simple to implement andcomputationally e�cient, but it does however in-troduce an additional hyper-parameter, namely thethreshold. One good heuristic for setting this thresh-old is to look at statistics on the average norm overa su�ciently large number of updates. In our ex-periments we have noticed that for a given task andmodel size, training is not very sensitive to this hyper-parameter and the algorithm behaves well even forrather small thresholds.

The algorithm can also be thought of as adaptingthe learning rate based on the norm of the gradient.Compared to other learning rate adaptation strate-gies, which focus on improving convergence by col-lecting statistics on the gradient (as for example in

Gradientclippingintuition


• ErrorsurfaceofasinglehiddenunitRNN,

• Highcurvaturewalls

• Solidlines:standardgradientdescenttrajectories

• Dashedlinesgradientsrescaledtofixedsize

On the di�culty of training Recurrent Neural Networks

Figure 6. We plot the error surface of a single hidden unit

recurrent network, highlighting the existence of high cur-

vature walls. The solid lines depicts standard trajectories

that gradient descent might follow. Using dashed arrow

the diagram shows what would happen if the gradients is

rescaled to a fixed size when its norm is above a threshold.

explode so does the curvature along v, leading to awall in the error surface, like the one seen in Fig. 6.

If this holds, then it gives us a simple solution to theexploding gradients problem depicted in Fig. 6.

If both the gradient and the leading eigenvector of thecurvature are aligned with the exploding direction v, itfollows that the error surface has a steep wall perpen-dicular to v (and consequently to the gradient). Thismeans that when stochastic gradient descent (SGD)reaches the wall and does a gradient descent step, itwill be forced to jump across the valley moving perpen-dicular to the steep walls, possibly leaving the valleyand disrupting the learning process.

The dashed arrows in Fig. 6 correspond to ignoringthe norm of this large step, ensuring that the modelstays close to the wall. The key insight is that all thesteps taken when the gradient explodes are alignedwith v and ignore other descent direction (i.e. themodel moves perpendicular to the wall). At the wall, asmall-norm step in the direction of the gradient there-fore merely pushes us back inside the smoother low-curvature region besides the wall, whereas a regulargradient step would bring us very far, thus slowing orpreventing further training. Instead, with a boundedstep, we get back in that smooth region near the wallwhere SGD is free to explore other descent directions.

The important addition in this scenario to the classicalhigh curvature valley, is that we assume that the val-ley is wide, as we have a large region around the wallwhere if we land we can rely on first order methodsto move towards the local minima. This is why justclipping the gradient might be su�cient, not requiringthe use a second order method. Note that this algo-

rithm should work even when the rate of growth of thegradient is not the same as the one of the curvature(a case for which a second order method would failas the ratio between the gradient and curvature couldstill explode).

Our hypothesis could also help to understand the re-cent success of the Hessian-Free approach comparedto other second order methods. There are two key dif-ferences between Hessian-Free and most other second-order algorithms. First, it uses the full Hessian matrixand hence can deal with exploding directions that arenot necessarily axis-aligned. Second, it computes anew estimate of the Hessian matrix before each up-date step and can take into account abrupt changes incurvature (such as the ones suggested by our hypothe-sis) while most other approaches use a smoothness as-sumption, i.e., averaging 2nd order signals over manysteps.

3. Dealing with the exploding andvanishing gradient

3.1. Previous solutions

Using an L1 or L2 penalty on the recurrent weights canhelp with exploding gradients. Given that the parame-ters initialized with small values, the spectral radius ofW

rec

is probably smaller than 1, from which it followsthat the gradient can not explode (see necessary condi-tion found in section 2.1). The regularization term canensure that during training the spectral radius neverexceeds 1. This approach limits the model to a sim-ple regime (with a single point attractor at the origin),where any information inserted in the model has to dieout exponentially fast in time. In such a regime we cannot train a generator network, nor can we exhibit longterm memory traces.

Doya (1993) proposes to pre-program the model (toinitialize the model in the right regime) or to useteacher forcing. The first proposal assumes that ifthe model exhibits from the beginning the same kindof asymptotic behaviour as the one required by thetarget, then there is no need to cross a bifurcationboundary. The downside is that one can not alwaysknow the required asymptotic behaviour, and, even ifsuch information is known, it is not trivial to initial-ize a model in this specific regime. We should alsonote that such initialization does not prevent cross-ing the boundary between basins of attraction, which,as shown, could happen even though no bifurcationboundary is crossed.

Teacher forcing is a more interesting, yet a not verywell understood solution. It can be seen as a way ofinitializing the model in the right regime and the right

OnthedifficultyoftrainingRecurrentNeuralNetworks,Pascanuetal.2013

Summary


Tipsandtrickstobecomeadeepneuralnetninjia

IntroductiontoRecurrentNeuralNetwork

Nextweek:

à LectureonTensorFlow forpracticalimplementations,PSetandproject

à MoreRNNdetailsandvariants(LSTMsandGRUs)

à Excitingtimes!

cs224d deep nlp lecture 6: neural tips and tricks...

Documents