cs224d deep nlp lecture 5: project information neural ... · project information + neural networks...

CS224dDeepNLP

Lecture5:ProjectInformation

+NeuralNetworks&Backprop

RichardSocherrichard@metamind.io

OverviewToday:

• OrganizationalStuff

• ProjectTips

• Fromone-layertomultilayerneuralnetwork!

• Max-Marginlossandbackprop!(Thisisthehardestlectureofthequarter)

4/12/16RichardSocherLecture5,Slide 2

Announcement:

• 1%extracreditforPiazzaparticipation!• HintforPSet1:Understandmathanddimensionality,thenadd

printstatements,e.g.

• Studentsurveysentoutlastnight,pleasegiveusfeedbacktoimprovetheclass:)

ClassProject

• Mostimportant(40%)andlastingresultoftheclass• PSet 3alittleeasiertohavemoretime• Startearlyandclearlydefineyourtaskanddataset

• Projecttypes:1. Applyexistingneuralnetworkmodeltoanewtask2. Implementacomplexneuralarchitecture3. Comeupwithanewneuralnetworkmodel4. Theory

ClassProject:ApplyExistingNNets toTasks

1. DefineTask:• Example:Summarization

2. DefineDataset1. Searchforacademicdatasets• Theyalreadyhavebaselines• E.g.:DocumentUnderstandingConference(DUC)

2. Defineyourown(harder,needmorenewbaselines)• Ifyou’reagraduatestudent:connecttoyourresearch• Summarization,Wikipedia:Introparagraphandrestoflargearticle• Becreative:Twitter,Blogs,News

3. Defineyourmetric• Searchonlineforwellestablishedmetricsonthistask• Summarization:Rouge(Recall-OrientedUnderstudyfor

Gisting Evaluation)whichdefinesn-gramoverlaptohumansummaries

4. Splityourdataset!• Train/Dev/Test• Academicdatasetoftencomepre-split• Don’tlookatthetestsplituntil~1weekbeforedeadline!

5. Establishabaseline• Implementthesimplestmodel(oftenlogisticregressionon

unigramsandbigrams)first• ComputemetricsontrainANDdev• Analyzeerrors• Ifmetricsareamazingandnoerrors:

done,problemwastooeasy,restart:)

6. Implementexistingneuralnetmodel• Computemetricontrainanddev• Analyzeoutputanderrors• Minimumbarforthisclass

7. Alwaysbeclosetoyourdata!• Visualizethedataset• Collectsummarystatistics• Lookaterrors• Analyzehowdifferenthyperparameters affectperformance

8. Tryoutdifferentmodelvariants• Soonyouwillhavemoreoptions• Wordvectoraveragingmodel(neuralbagofwords)• Fixedwindowneuralmodel• Recurrentneuralnetwork• Recursiveneuralnetwork• Convolutionalneuralnetwork

ClassProject:ANewModel-- AdvancedOption

• Doallotherstepsfirst(Startearly!)• Gainintuitionofwhyexistingmodelsareflawed

• Talktootherresearchers, cometomyofficehoursalot• Implementnewmodelsanditeratequicklyoverideas• Setupefficientexperimentalframework• Buildsimplernewmodelsfirst• ExampleSummarization:

• Averagewordvectorsperparagraph,thengreedysearch• Implementlanguagemodelorautoencoder (introducedlater)• Stretchgoalforpotentialpaper:Generatesummary!

ProjectIdeas

• Summarization• NER,likePSet 2butwithlargerdata

NaturalLanguageProcessing (almost) fromScratch,RonanCollobert, JasonWeston, LeonBottou,MichaelKarlen,Koray Kavukcuoglu, Pavel Kuksa, http://arxiv.org/abs/1103.0398

• Simplequestionanswering,A NeuralNetworkforFactoidQuestionAnsweringoverParagraphs,Mohit Iyyer,JordanBoyd-Graber,LeonardoClaudino, RichardSocherandHalDaumé III(EMNLP2014)

• Imagetotextmappingorgeneration,GroundedCompositional SemanticsforFinding andDescribingImageswithSentences, RichardSocher, AndrejKarpathy,Quoc V.Le,Christopher D.Manning,AndrewY.Ng.(TACL2014)orDeepVisual-SemanticAlignments forGeneratingImageDescriptions, AndrejKarpathy,LiFei-Fei

• Entitylevelsentiment• UseDLtosolveanNLPchallengeonkaggle,

Developascoringalgorithmforstudent-written short-answerresponses, https://www.kaggle.com/c/asap-sas

Defaultproject:sentimentclassification

• Sentimentonmoviereviews:http://nlp.stanford.edu/sentiment/• Lotsofdeeplearningbaselinesandmethodshavebeentried

Amorepowerfulwindowclassifier

• Revisiting

• Xwindow =[xmuseums xin xParis xare xamazing ]

• Assumewewanttoclassifywhetherthecenterwordisalocationornot

ASingleLayerNeuralNetwork

• Asinglelayerisacombinationofalinearlayerandanonlinearity:

• Theneuralactivationsacanthenbeusedtocomputesomefunction

• Forinstance,anunnormalized scoreorasoftmax probabilitywecareabout:

Summary:Feed-forwardComputation

Computingawindow’sscorewitha3-layerneuralnet:s=score(museumsinParisareamazing)

Xwindow =[xmuseums xin xParis xare xamazing ]

Mainintuitionforextralayer

Thelayerlearnsnon-linearinteractionsbetweentheinputwordvectors.

Example:onlyif“museums” isfirstvectorshoulditmatterthat“in” isinthesecondposition

Xwindow =[xmuseums xin xParis xare xamazing ]

Summary:Feed-forwardComputation

• s =score(museumsinParisareamazing)• sc =score(NotallmuseumsinParis)

• Ideafortrainingobjective:makescoreoftruewindowlargerandcorruptwindow’sscorelower(untilthey’regoodenough):minimize

• Thisiscontinuous,canperformSGD

Max-marginObjectivefunction

• Objectiveforasinglewindow:

• Eachwindowwithalocationatitscentershouldhaveascore+1higherthananywindowwithoutalocationatitscenter

• xxx|ß 1à|ooo

• Forfullobjectivefunction:Sumoveralltrainingwindows

TrainingwithBackpropagation

AssumingcostJis>0,computethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x

• Let’sconsiderthederivativeofasingleweightWij

• Thisonlyappearsinsideai

• Forexample:W23 isonlyusedtocomputea2

x1 x2x3 +1

DerivativeofweightWij:

x1 x2x3 +1

whereforlogisticf

DerivativeofsingleweightWij :

Localerrorsignal

Localinputsignal

x1 x2x3 +1

• Wewantallcombinationsofi =1,2 and j=1,2,3à ?

• Solution:Outerproduct:whereisthe“responsibility”orerrormessagecomingfromeachactivationa

• FromsingleweightWij tofullW:

x1 x2x3 +1

• Forbiasesb,weget:

x1 x2x3 +1

That’salmostbackpropagationIt’ssimplytakingderivativesandusingthechainrule!

Remainingtrick:wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers!

Example:lastderivativesofmodel,thewordvectorsinx

• Takederivativeofscorewithrespecttosingleelementofwordvector

• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:

Re-usedpartofpreviousderivative25

• With,whatisthefullgradient?à

• Observations:Theerrormessage± thatarrivesatahiddenlayerhasthesamedimensionalityasthathiddenlayer

Puttingallgradientstogether:

• Remember:Fullobjectivefunctionforeachwindowwas:

• Forexample:gradientforU:

Twolayerneuralnetsandfullbackprop

• Let’slookata2layerneuralnetwork• Samewindowdefinitionforx• Samescoringfunction• 2hiddenlayers(carefullynotsuperscriptsnow!)

• Fullywrittenoutasonefunction:

• SamederivationasbeforeforW(2)(nowsittingona(1))

• SamederivationasbeforefortopW(2):

• Inmatrixnotation:

whereand± istheelement-wiseproductalsocalledHadamard product

• Lastmissingpieceforunderstandinggeneralbackprop:

• Lastmissingpiece:

• What’sthebottomlayer’serrormessage±(2)?

• Similarderivationtosinglelayermodel

• Maindifference,wealreadyhaveandneedtoapplythechainruleagainon

• Chainrulefor:

• Getintuitionbyderivingasifitwasascalar

• Intuitively,wehavetosumoverallthenodescomingintolayer

• Puttingitalltogether:

The second derivative in eq. 28 for output units is simply

�1)ij

�1)i· a

�1)⌘= a

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

�1)ij

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

�2)ij

)| {z }�

�2)ij

+ �W

�2)ji

. (48)

(�(nl

�2)ij

= (�(nl

�2)ij

= (�(nl

�2)ij

�1)a

�1) (50)

= (�(nl

�2)ij

�1)·i a

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

�2)ij

f(W (nl

�2)i· a

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

�2)j

=⇣(�(nl

))TW (nl

�1)·i

0(z(nl

�1)i

�2)j

�1)ji

0(z(nl

�1)i

| {z }

�2)j

�1)i

�2)j

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

• Lastmissingpiece:

• IngeneralforanymatrixW(l)atinternallayerl andanyerrorwithregularizationERallbackprop instandardmultilayerneuralnetworksboilsdownto2equations:

• Topandbottomlayershavesimpler±

The second derivative in eq. 28 for output units is simply

�1)ij

�1)i· a

�1)⌘= a

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

�1)ij

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

�2)ij

)| {z }�

�2)ij

+ �W

�2)ji

. (48)

(�(nl

�2)ij

= (�(nl

�2)ij

= (�(nl

�2)ij

�1)a

�1) (50)

= (�(nl

�2)ij

�1)·i a

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

�2)ij

f(W (nl

�2)i· a

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

�2)j

=⇣(�(nl

))TW (nl

�1)·i

0(z(nl

�1)i

�2)j

�1)ji

0(z(nl

�1)i

| {z }

�2)j

�1)i

�2)j

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

(l+1)i

+ �W

Which in one simplified vector notation becomes:

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to richard@socher.org

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

Visualizationofintuition

• Let’ssaywewant withpreviouslayerandf=¾

Our first example: Backpropagation using error vectors

CS224D: Deep Learning for NLP 31

1 σ 1 z(1) a(1)

W(1) z(2) a(2) W(2)

z(3) s

Gradient w.r.t W(2) = δ(3)a(2)T

Visualizationofintuition

Our first example: Backpropagation using error vectors

1 σ 1 z(1) a(1)

W(1) z(2) a(2) W(2)

z(3) s

δ(3) W(2)T δ(3)

--Reusing the δ(3) for downstream updates. --Moving error vector across affine transformation simply requires multiplication with the transpose of forward matrix --Notice that the dimensions will line up perfectly too!

VisualizationofintuitionOur first example: Backpropagation using error vectors

1 σ 1 z(1) a(1)

W(1) z(2) a(2) W(2)

z(3) s

W(2)T δ(3) σ’(z(2))!W(2)T δ(3)

= δ(2)

--Moving error vector across point-wise non-linearity requires point-wise multiplication with local gradient of the non-linearity

VisualizationofintuitionOur first example: Backpropagation using error vectors

1 σ 1 z(1) a(1)

W(1) z(2) a(2) W(2)

z(3) s

Gradient w.r.t W(1) = δ(2)a(1)T

W(1)T δ(2)

Backpropagation (Anotherexplanation)• Computegradientofexample-wiselosswrt

parameters

• Simplyapplyingthederivativechainrulewisely

• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient

Simple Chain Rule

Multiple Paths Chain Rule

MultiplePathsChainRule- General

Chain Rule in Flow Graph

Flowgraph:anydirectedacyclicgraphnode=computationresultarc=computationdependency

=successorsof

Back-Prop in Multi-Layer Net

h = sigmoid(Vx)

Back-Prop in General Flow Graph

=successorsof

1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors

2. Bprop:- initializeoutputgradient=1- visitnodes inreverseorder:

Computegradientwrt eachnodeusinggradientwrt successors

Single scalaroutput

Automatic Differentiation

• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.

• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.

• Easyandfastprototyping

Summary

4/12/16RichardSocher46

• Congrats!

• Yousurvivedthehardestpartofthisclass.

• Everythingelsefromnowonisjustmorematrixmultiplicationsandbackprop :)

• Nextup:• RecurrentNeuralNetworks

cs224d deep nlp lecture 5: project information neural ... · project information + neural networks...

Documents

chapter 2 introduction to neural...

1256 ieee transactions on neural networks,...

cs224d: deep nlp lecture 12: midterm...

neural and fuzzy neural networks

explainable neural computation via stack neural module...

parsing natural scenes and natural language with recursive...

parsing natural scenes and natural language with...

iceg morphology classification using an analogue vlsi neural...

chapter 15: neural correlates of conscious emotional...

neural network part 3: convolutional neural...

neural network for neural networks

proximal mean-ﬁeld for neural network quantizationproximal...

cs224n-2018-lecture12-transformers and cnns · transformer...

cs224d deep nlp lecture 8: recurrent neural...

artificial neural networksartificial neural...

neural communication: the neural chain

differentiation between non neural and neural

neural networks neural networks

natural language processing - tel aviv...

ftfwu, sczhug@stat.ucla.edu, cmxiong@metamind.io arxiv...