cs224d deep nlp lecture 4: word window …cs224d.stanford.edu/lectures/cs224d-lecture4.pdfupdating...

CS224dDeepNLP

Lecture4:WordWindowClassification

andNeuralNetworks

RichardSocher

OverviewToday:

• Generalclassificationbackground

• Updatingwordvectorsforclassification

• Windowclassification&crossentropyerrorderivationtips

• Asinglelayerneuralnetwork!

• (Max-Marginlossandbackprop)

4/7/16RichardSocherLecture1,Slide 2

Classificationsetupandnotation

• Generallywehaveatrainingdatasetconsistingofsamples

{xi,yi}Ni=1

• xi - inputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.

• yi - labelswetrytopredict,• e.g.otherwords• class:sentiment,namedentities,buy/selldecision,• later:multi-wordsequences


Classificationintuition

• Trainingdata:{xi,yi}Ni=1

• Simpleillustrationcase:• Fixed2dwordvectorstoclassify• Usinglogisticregression• à lineardecisionboundaryà

• GeneralML:assumexisfixedandonlytrainlogisticregressionweightsWandonlymodifythedecisionboundary


VisualizationswithConvNetJS byKarpathy!http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

Classificationnotation

• Crossentropylossfunctionoverdataset{xi,yi}Ni=1

• Whereforeachdatapair(xi,yi):

• Wecanwritef inmatrixnotation andindexelementsofitbasedonclass:


Classification:Regularization!

• Reallyfulllossfunctionoveranydatasetincludesregularizationoverallparametersµ:

• Regularizationwillpreventoverfittingwhenwehavealotoffeatures(orlateraverypowerful/deepmodel)• x-axis:morepowerfulmodelormoretrainingiterations

• Blue:trainingerror,red:testerror


Details:GeneralMLoptimization

• Forgeneralmachinelearningµ usuallyonlyconsistsofcolumnsofW:

• Soweonlyupdatethedecisionboundary


VisualizationswithConvNetJS byKarpathy

Classificationdifferencewithwordvectors

• Commonindeeplearning:• LearnbothWandwordvectorsx


Verylarge!

OverfittingDanger!

Losinggeneralizationbyre-trainingwordvectors

• Setting:Traininglogisticregressionformoviereviewsentimentandinthetrainingdatawehavethewords• “TV”and“telly”

• Inthetestingdatawehave• “television”

• Originallytheywereallsimilar(frompre-trainingwordvectors)

• Whathappenswhenwetrainthewordvectors?


TVtelly

television


• Whathappenswhenwetrainthewordvectors?• Thosethatareinthetrainingdatamovearound• Wordsfrompre-trainingthatdoNOTappearintrainingstay

• Example:• Intrainingdata:“TV”and“telly”• Intestingdataonly:“television”


TVtelly

television:(


• Takehomemessage:

Ifyouonlyhaveasmalltrainingdataset,don’ttrainthewordvectors.

Ifyouhavehaveaverylargedataset,itmayworkbettertotrainwordvectorstothetask.


TVtelly

television

Sidenoteonwordvectorsnotation

• ThewordvectormatrixLisalsocalledlookuptable• Wordvectors=wordembeddings =wordrepresentations(mostly)• Mostlyfrommethodslikeword2vecorGlove

|V|

L =d ……

aardvarka…meta…zebra• Thesearethewordfeaturesxword fromnowon

• Conceptuallyyougetaword’svectorbyleftmultiplyingaone-hotvectore byL:x =Le2 d£ V¢ V£ 1

[]

12

Windowclassification

• Classifyingsinglewordsisrarelydone.

• Interestingproblemslikeambiguityariseincontext!

• Example:auto-antonyms:• "Tosanction"canmean"topermit"or"topunish.”• "Toseed"canmean"toplaceseeds"or"toremoveseeds."

• Example:ambiguousnamedentities:• Parisà Paris,Francevs ParisHilton• Hathawayà BerkshireHathawayvs AnneHathaway



• Idea:classifyawordinitscontextwindowofneighboringwords.

• Forexamplenamedentityrecognitioninto4classes:• Person,location,organization,none

• Manypossibilitiesexistforclassifyingonewordincontext,e.g.averagingallthewordsinawindowbutthatloosespositioninformation



• Trainsoftmax classifierbyassigningalabeltoacenterwordandconcatenatingallwordvectorssurroundingit

• Example:ClassifyParisinthecontextofthissentencewithwindowlength2:

…museumsinParisareamazing….

Xwindow =[xmuseums xin xParis xare xamazing ]T

• Resultingvectorxwindow =x2 R5d,acolumnvector!


Simplestwindowclassifier:Softmax

• Withx=xwindow wecanusethesamesoftmax classifierasbefore

• Withcrossentropyerrorasbefore:

• Buthowdoyouupdatethewordvectors?


same

predictedmodeloutputprobability

Updatingconcatenatedwordvectors

• Shortanswer:Justtakederivativesasbefore

• Longanswer:Let’sgooverthestepstogether(you’llhavetofillinthedetailsinPSet 1!)

• Define:• :softmax probabilityoutputvector(seepreviousslide)• :targetprobabilitydistribution(all0’sexceptatgroundtruthindexofclassy,whereit’s1)

• andfc =c’th elementofthefvector

• Hard,thefirsttime,hencesometipsnow:)


• Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!

• Tip2:Knowthychainruleanddon’tforgetwhichvariablesdependonwhat:

• Tip3:Forthesoftmax partofthederivative:Firsttakethederivativewrt fc whenc=y(thecorrectclass),thentakederivativewrt fc whenc≠ y(alltheincorrectclasses)



• Tip4:Whenyoutakederivativewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpartialderivatives:

• Tip5:Tolaternotgoinsane&implementation!à resultsintermsofvectoroperationsanddefinesingleindex-ablevectors:



• Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpartialderivativesofe.g.xi orWij

• Tip7:Tocleanitupforevenmorecomplexfunctionslater:Knowdimensionalityofvariables&simplifyintomatrixnotation

• Tip8:Writethisoutinfullsumsifit’snotclear!



• Whatisthedimensionalityofthewindowvectorgradient?

• x istheentirewindow,5d-dimensionalwordvectors,sothederivativewrt toxhastohavethesamedimensionality:



• Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:

• Let• Withxwindow =[xmuseums xin xParis xare xamazing ]

• Wehave



• Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnamedentities.

• Forexample,themodelcanlearnthatseeingxin asthewordjustbeforethecenterwordisindicativeforthecenterwordtobealocation



• ThegradientofJwrt thesoftmax weightsW!

• Similarsteps,writedownpartialwrt Wij first!• Thenwehavefull

What’smissingfortrainingthewindowmodel?


Anoteonmatriximplementations

4/7/16RichardSocher25

• Therearetwoexpensiveoperationsinthesoftmax:

• Thematrixmultiplication andtheexp

• Aforloopisneverasefficientwhenyouimplementitcomparedvs whenyouusealargermatrixmultiplication!

• Examplecodeà



• Loopingoverwordvectorsinsteadofconcatenatingthemallintoonelargematrixandthenmultiplyingthesoftmax weightswiththatmatrix

• 1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop



• ResultoffastermethodisaCxNmatrix:

• Eachcolumnisanf(x)inournotation(unnormalized classscores)

• Matricesareawesome!

• Youshouldspeedtestyourcodealottoo

Softmax (=logisticregression)isnotverypowerful


• Softmax onlygiveslineardecisionboundariesintheoriginalspace.

• Withlittledatathatcanbeagoodregularizer

• Withmoredataitisverylimiting!

Softmax (=logisticregression)isnotverypowerful


• Softmax onlylineardecisionboundaries

• à Lamewhenproblemiscomplex

• Wouldn’titbecooltogetthesecorrect?

NeuralNetsfortheWin!


• Neuralnetworkscanlearnmuchmorecomplexfunctionsandnonlineardecisionboundaries!

Fromlogisticregressiontoneuralnets

31

Demystifyingneuralnetworks

Neuralnetworkscomewiththeirownterminologicalbaggage

…justlikeSVMs

Butifyouunderstandhowsoftmax modelswork

Thenyoualreadyunderstand theoperationofabasicneuralnetworkneuron!

AsingleneuronAcomputationalunitwithn(3) inputs

and1outputandparametersW,b

Activationfunction

Inputs

Biasunitcorresponds tointerceptterm

Output

32

Aneuronisessentiallyabinarylogisticregressionunit

hw,b(x) = f (wTx + b)

f (z) = 11+ e−z

w,b aretheparametersofthisneuroni.e.,thislogisticregressionmodel

33

b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm

Aneuralnetwork=runningseverallogisticregressionsatthesametimeIfwefeedavectorofinputsthroughabunchoflogisticregressionfunctions,thenwegetavectorofoutputs…

Butwedon’thavetodecideaheadoftimewhatvariablestheselogisticregressionsaretryingtopredict!

34

Aneuralnetwork=runningseverallogisticregressionsatthesametime…whichwecanfeedintoanotherlogisticregressionfunction

Itisthelossfunctionthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredictingthetargetsforthenextlayer,etc.

35

Aneuralnetwork=runningseverallogisticregressionsatthesametime

Beforeweknowit,wehaveamultilayerneuralnetwork….

36

Matrixnotationforalayer

Wehave

Inmatrixnotation

wheref isappliedelement-wise:

a1

a2

a3

a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.

z =Wx + ba = f (z)

f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]37

W12

b3

Non-linearities (f):Whythey’reneeded

• Example:functionapproximation,e.g.,regressionorclassification• Withoutnon-linearities,deepneuralnetworkscan’tdoanythingmorethanalineartransform

• Extralayerscouldjustbecompileddownintoasinglelineartransform:W1W2x =Wx

• Withmorelayers,theycanapproximatemorecomplexfunctions!

38

Amorepowerfulwindowclassifier

• Revisiting

• Xwindow =[xmuseums xin xParis xare xamazing ]


ASingleLayerNeuralNetwork

• Asinglelayerisacombinationofalinearlayerandanonlinearity:

• Theneuralactivationsacanthenbeusedtocomputesomefunction

• Forinstance,asoftmax probabilityoranunnormalized score:

40

Summary:Feed-forwardComputation

41

Computingawindow’sscorewitha3-layerneuralnet:s=score(museumsinParisareamazing)

Xwindow =[xmuseums xin xParis xare xamazing ]

Nextlecture:


Trainingawindow-basedneuralnetwork.

Takingmoredeeperderivativesà Backprop

Thenwehaveallthebasictoolsinplacetolearnaboutmorecomplexmodels:)

Probablyfornextlecture…


Anotheroutputlayerandlossfunctioncombo!

44

• Sofar:softmax andcross-entropyerror(exp slow)

• Wedon’talwaysneedprobabilities,oftenunnormalized scoresareenoughtoclassifycorrectly.

• Also:Max-margin!

• Moreonthatinfuturelectures!

NeuralNetmodeltoclassifygrammaticalphrases


• Idea:Trainaneuralnetworktoproducehighscoresforgrammatical phrasesofspecificlengthandlowscoresforungrammaticalphrases

• s =score(catchillsonamat)

• sc =score(catchillsMenloamat)

Anotheroutputlayerandlossfunctioncombo!

• Ideafortrainingobjective• Makescoreoftruewindowlargerandcorruptwindow’sscorelower(untilthey’regoodenough):minimize

• Thisiscontinuous,canperformSGD46

TrainingwithBackpropagation

AssumingcostJis>0,itissimpletoseethatwecancomputethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x

47


• Let’sconsiderthederivativeofasingleweightWij

• Thisonlyappearsinsideai

• Forexample:W23 isonlyusedtocomputea2

x1 x2x3 +1

a1 a2

s U2

W23

48


DerivativeofweightWij:

49

x1 x2x3 +1

a1 a2

s U2

W23

whereforlogisticf


DerivativeofsingleweightWij :

Localerrorsignal

Localinputsignal

50

x1 x2x3 +1

a1 a2

s U2

W23

• Wewantallcombinationsofi =1,2 and j=1,2,3

• Solution:Outerproduct:whereisthe“responsibility”comingfromeachactivationa


• FromsingleweightWij tofullW:

51

x1 x2x3 +1

a1 a2

s U2

W23


• Forbiasesb,weget:

52

x1 x2x3 +1

a1 a2

s U2

W23


53

That’salmostbackpropagationIt’ssimplytakingderivativesandusingthechainrule!

Remainingtrick:wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers

Example:lastderivativesofmodel,thewordvectorsinx


• Takederivativeofscorewithrespecttosinglewordvector(forsimplicitya1dvector,butsameifitwaslonger)

• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:

Re-usedpartofpreviousderivative54

Summary


cs224d deep nlp lecture 4: word window …cs224d.stanford.edu/lectures/cs224d-lecture4.pdfupdating...

Documents