deep learning and lexical, syntactic and semantic analysis · deep learning and lexical, syntactic...

61
Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10

Upload: others

Post on 23-Jun-2020

38 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Deep Learning and Lexical, Syntacticand Semantic Analysis

Wanxiang Che and Yue Zhang2016-10

Page 2: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Part 2: Introduction toDeepLearning

Page 3: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Part 2.1: DeepLearning Background

Page 4: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

WhatisMachineLearning?

• FromDatatoKnowledge

InputAlgorithm

Output

Traditional Program

InputOutput

“Algorithm”

ML Program

Page 5: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

AStandardExampleofML

• TheMNIST(ModifiedNIST)databaseofhand-writtendigitsrecognition– Publiclyavailable– AhugeamountabouthowwellvariousMLmethodsdoonit

– 60,000+ 10,000 hand-writtendigits(28x28pixelseach)

Page 6: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Veryhardtosaywhatmakesa2

Page 7: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Traditional Model(before2012)

• Fixed/engineeredfeatures+trainableclassifier (分类器)– Designingafeatureextractorrequiresconsiderableeffortsbyexperts

SIFTGISTShapecontext

Page 8: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

DeepLearning (after2012)

• LearningHierarchicalRepresentations• DEEPmeansmorethanonestageofnon-linearfeaturetransformation

Page 9: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

DeepLearningArchitecture

Page 10: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

DeepLearningisNotNew

• 1980stechnology(NeuralNetworks)

Page 11: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

AboutNeuralNetworks

• Pros– Simpletolearnp(y|x)– ResultsOKforshallownets

• Cons– Doesnotlearnp(x)– Troublewith>3layers– Overfitts– Slowtotrain

Page 12: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

DeepLearningbeatsNN

• Pros– Simpletolearnp(y|x)– ResultsOK forshallownets

• Cons– Doesnotlearnp(x)– Troublewith>3layers– Overfitts– Slowtotrain

Unsupervisedfeaturelearning:RMBs,DAEs,

• Dropout• Maxout• StochasticPooling

GPU

• Newactivationfunctions:ReLU,…

• Gatedmechanism

Page 13: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

ResultsonMNIST

• NaïveNeuralNetwork– 96.59%

• SVM(defaultsettingsforlibsvm)– 94.35%

• OptimalSVM[AndreasMueller]– 98.56%

• Thestateoftheart:ConvolutionalNN(2013)– 99.79%

Page 14: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

DeepLearningWins

9.MICCAI2013GrandChallengeonMitosisDetection8.ICPR2012ContestonMitosisDetectioninBreastCancerHistologicalImages7.ISBI2012BrainImageSegmentationChallenge(withsuperhumanpixelerrorrate)6.IJCNN2011TrafficSignRecognitionCompetition(onlyourmethodachievedsuperhumanresults)5.ICDAR2011offlineChineseHandwritingCompetition4.OnlineGermanTrafficSignRecognitionContest3.ICDAR2009ArabicConnectedHandwritingCompetition2.ICDAR2009HandwrittenFarsi/ArabicCharacterRecognitionCompetition1.ICDAR2009FrenchConnectedHandwritingCompetition.Comparetheoverviewpageonhandwritingrecognition.• http://people.idsia.ch/~juergen/deeplearning.html

Page 15: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

DeepLearningforSpeechRecognition

Page 16: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

DeepLearningforNLP

Monolingual Data

Multi-lingual Data

Multi-modalData

Big Data

Deep Learning

喜欢

苹果

喜欢

苹果

喜欢

苹果

Recurrent NN Convolutional NN Recursive NN

Semantic Vector Space

Applications

QAPOS tagging Parsing

CaptionMT Dialog MCTest

Word Seg…

Page 17: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Part 2.2: Feedforward Neural Networks

Page 18: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

TheTraditionalParadigmforML

1. Converttherawinputvectorintoavectoroffeatureactivations– Usehand-writtenprogramsbasedoncommon-sensetodefinethe

features2. Learnhowtoweighteachofthefeatureactivationstogeta

singlescalarquantity3. Ifthisquantityisabovesomethreshold,decidethattheinput

vectorisapositiveexampleofthetargetclass

Page 19: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

feature units

decision unit

input units

hand-coded programs

learned weights

TheStandardPerceptronArchitecture

IPhone is very good .

good very/good very …

>5?

0.8 0.9 0.1 …

Page 20: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

TheLimitationsofPerceptrons

• Thehand-codedfeatures– Greatinfluenceontheperformance– Needlotsofcosttofindsuitablefeatures

• Alinearclassifierwithahyperplane– Cannotseparatenon-lineardata,suchasXORfunctioncannotbelearnedbyasingle-layerperceptron

0,1

0,0 1,0

1,1

The positive and negative casescannot be separated by a plane

Page 21: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

LearningwithNon-linearHiddenLayers

Page 22: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

FeedforwardNeuralNetworks

• Theinformationispropagatedfromtheinputstotheoutputs

• Timehasnorole(NOcyclebetweenoutputsandinputs)

• Multi-layerPerceptron(MLP)?• Learningtheweightsofhiddenunits

isequivalenttolearningfeatures• Networkswithouthiddenlayersare

verylimitedintheinput-outputmappings– Morelayersoflinearunitsdonot

help.Itsstilllinear– Fixedoutputnon-linearities arenot

enough

x1 x2 xn…..

1st hidden layer

2nd hiddenlayer

Output layer

Page 23: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

MultipleLayer NeuralNetworks

• Whatarethosehiddenneuronsdoing?– Mayberepresentoutlines

Page 24: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

GeneralOptimizing(Learning)Algorithms

• GradientDescent

• StochasticGradientDescent(SGD)– Minibatch SGD(m>1),OnlineGD(m=1)

Page 25: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Computational/Flow Graphs

• DescribingMathematicalExpressions• Forexample

– e=(a+b)*(b+1)• c=a+b,d=b+1,e=c*d

– Ifa=2,b=1

Page 26: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

DerivativesonComputationalGraphs

Page 27: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

ComputationalGraphBackwardPass(Backpropagation)

Page 28: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

AnFNNPOSTagger

Page 29: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Part 2.3: Word Embeddings

Page 30: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Typical Approaches for Word Representation

• 1-hot representation(orthogonality)– bag-of-word model

sun[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …]

[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …]

star

sim(star, sun) = 0

Page 31: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

DistributedWordRepresentation

• Eachwordisassociatedwithalow-dimension(compressed,50-1000),density (non-sparse)andreal (continuous)vector(wordembedding)– Learningwordvectorsthroughsupervisedmodels

• Nature– Semanticsimilarityasvectorsimilarity

Page 32: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

HowtoobtainWordEmbedding

Page 33: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

NeuralNetworkLanguageModels

• NeuralNetworkLanguageModels(NNLM)– FeedForward(Bengio etal.2003)

• Maximum-Likelihood Estimation• Back-propagation• Input: (𝑛 − 1) embeddings

Page 34: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Predict Word Vector Directly

• SENNA (Collobert and Weston,2008)• word2vec(Mikolov etal.2013)

Page 35: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Word2vec:CBOW(ContinuousBag-of-Word)

• Addinputsfromwordswithinshortwindowtopredictthecurrentword

• Theweightsfordifferentpositionsareshared• ComputationallymuchmoreefficientthannormalNNLM• Thehiddenlayerisjustlinear• Eachwordisanembeddingv(w)• Eachcontextisanembeddingv’(c)

Page 36: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Word2vec:Skip-Gram

• Predictingsurroundingwordsusingthecurrentword

• SimilarperformancewithCBOW• Eachwordisanembeddingv(w)• Eachcontextisanembeddingv’(c)

Page 37: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Word2vecTraining

• SGD+backpropagation• Mostofthecomputationalcostisafunctionofthesizeofthevocabulary(millions)

• Trainingaccelerating– NegativeSampling

• Mikolov etal.2013– HierarchicalDecomposition

• MorinandBengio 2005.Mnih andHinton2008.Mikolov etal.2013– GraphProcessingUnit (GPU)

Page 38: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

WordAnalogy

Page 39: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Part 2.4: Recurrent and OtherNeural Networks

Page 40: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

LanguageModels

• Alanguagemodelcomputesaprobabilityforasequenceofword:𝑃(𝑤(,⋯𝑤+) orpredictsaprobabilityforthenextword:𝑃(𝑤+,(|𝑤(,⋯𝑤+)

• Usefulformachinetranslation,speechrecognition,andsoon• Wordordering

• P(thecatissmall)>P(smalltheiscat)• Wordchoice

• P(therearefour cats)>P(therearefor cats)

Page 41: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

TraditionalLanguageModels

• AnincorrectbutnecessaryMarkovassumption!• Probabilityisusuallyconditionedonwindowofn previouswords• 𝑃 𝑤(,⋯𝑤+ = ∏ 𝑃(𝑤0|𝑤(,⋯ ,𝑤01() ≈3

04( ∏ 𝑃(𝑤0|𝑤01(+1(),⋯ ,𝑤01()304(

• Howtoestimateprobabilities• 𝑝 𝑤6 𝑤( = 789+:(;<,;=)

789+:(;<)𝑝 𝑤> 𝑤(, 𝑤6 = 789+:(;<,;=,;?)

789+:(;<,;=)

• Performanceimproveswithkeepingaroundhighern-gramscountsanddoingsmoothing,suchasbackoff (e.g.if4-gramnotfound,try3-gram,etc)

• Disadvantages• ThereareALOTofn-grams!• Cannotseetoolonghistory

• P(坐/作了一整天的火车/作业)

Page 42: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

RecurrentNeuralNetworks(RNNs)

• Conditiontheneuralnetworkonallpreviousinputs• RAMrequirementonlyscaleswithnumberofinputs

W1 W1

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

W1

W2 W2W2

W3 W3W3

Page 43: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

RecurrentNeuralNetworks (RNNs)

• Atasingletimestept• ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)• 𝑦I: = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊>ℎ:)

ht …ht-1

𝑦I:

xt

W1

W2

W3

ht

𝑦I:

xt

W1

W2

W3

W1

Page 44: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

TrainingRNNsishard

• Ideallyinputsfrommanytimestepsagocanmodifyoutputy• Forexample,with2timesteps

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

W1 W1

W2 W2

W3 W3

Page 45: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

BackPropagation ThroughTime(BPTT)

• Totalerroristhesumofeacherrorattimestept• PQPR

= ∑ PQTPR

U:4(

• PQTPR? =

PQTPVT

PVTPR? iseasytobecalculated

• Buttocalculate PQTPR< =

PQTPVT

PVTPXT

PXTPR< ishard(alsofor𝑊6)

• Becauseℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:) dependsonℎ:1(,whichdependson𝑊( andℎ:16,andsoon.

• So PQTPR< = ∑ PQT

PVT

PVTPXT

PXTPXY

PXYPR<

:Z4(

ht

𝑦I:

xt

W1

W2

W3

Page 46: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

The vanishinggradientproblem

• PQTPR

= ∑ PQTPVT

PVTPXT

PXTPXY

PXYPR

:Z4( ,ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)

• PXTPXY

= ∏ PX[PX[\<

:]4Z,( = ∏ 𝑊(diag[tanh′(⋯ )]:

]4Z,(

• PXTPXT\<

≤ 𝛾 𝑊( ≤ 𝛾𝜆(• where 𝛾 is bound diag[tanh′(⋯ )] , 𝜆(is the largest singular value of𝑊(

• PXTPXY

≤ 𝛾𝜆( :1Zà0,if𝛾𝜆( < 1

• Thiscanbecomeverysmallorverylargequicklyà Vanishingorexplodinggradient

• Trickforexplodinggradient:clippingtrick(setathreshold)

Page 47: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

A“solution”

• Intuition• Ensure𝛾𝜆( ≥ 1à topreventvanishinggradients

•So…• ProperinitializationoftheW• TouseReLU insteadoftanh orsigmoidactivationfunctions

Page 48: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Abetter“solution”

• Recalltheoriginaltransitionequation• ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)

• Wecaninsteadupdatethestateadditively• 𝑢: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)• ℎ: = ℎ:1( + 𝑢:• then, PXT

PXT\<= 1 + P9T

PXT\<≥ 1

• Ontheotherhand• ℎ: = ℎ:1( + 𝑢: = ℎ:16 + 𝑢:1( + 𝑢: = ⋯

ht …ht-1

𝑦I:

xt

Page 49: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Abetter“solution” (cont.)

• Interpolatebetweenoldstateandnewstate(“choosingtoforget”)• 𝑓: = 𝜎 𝑊k𝑥: + 𝑈kℎ:1(• ℎ: = 𝑓: ⊙ ℎ:1( + (1 − 𝑓:) ⊙ 𝑢:

• Introduceaseparateinputgate𝑖:• 𝑖: = 𝜎 𝑊0𝑥: + 𝑈0ℎ:1(• ℎ: = 𝑓: ⊙ ℎ:1( + 𝑖: ⊙ 𝑢:

• Selectivelyexposememorycell𝑐: withanoutputgate𝑜:• 𝑜: = 𝜎 𝑊8𝑥: + 𝑈8ℎ:1(• 𝑐: = 𝑓: ⊙ 𝑐:1( + 𝑖: ⊙ 𝑢:• ℎ: = 𝑜: ⊙ tanh(𝑐:)

Page 50: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

LongShort-TermMemory(LSTM)

• Hochreiter & Schmidhuber,1997• LSTM=additiveupdates+gating

Xt

+ + + +0 1 2 3

ht-1

Ct-1 Ct

ht

+

�σ σ σtanh

tanh

ht

Page 51: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

GatedRecurrentUnites,GRU(Choetal.2014)

• Mainideas• Keeparoundmemoriestocapturelongdistancedependencies• Allowerrormessagestoflowatdifferentstrengthsdependingontheinputs

• Updategate• Basedoncurrentinputandhiddenstate• 𝑧: = 𝜎 𝑊q𝑥: + 𝑈qℎ:1(

• Resetgate• Similarlybutwithdifferentweights• 𝑟: = 𝜎(𝑊s𝑥: + 𝑈sℎ:1()

Page 52: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

GRU

• Newmemorycontent• ℎt: = tanh(𝑊𝑥: + 𝑟: ⊙ 𝑈ℎ:1()• Updategatez controlshowmuchofpaststateshouldmatternow

• Ifz closedto1,thenwecancopyinformationinthatunitthroughmanytimestepsàlessvanishinggradient!

• Ifresetgater unitiscloseto0,thenthisignorespreviousmemoryandonlystoresthenewinputinformationà allowsmodeltodropinformationthatisirrelevantinthefuture

• Unitswithlongtermdependencieshaveactiveupdategatesz• Unitswithshort-termdependenciesoftenhaverestgatesr veryactive

• Finalmemoryattimestepcombinescurrentandprevioustimesteps• ℎ: = 𝑧: ⊙ ℎ:1( + (1 − 𝑧:) ⊙ ℎt

Page 53: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

LSTMvs.GRU

• Noclearwinner!• Tuninghyperparameters likelayersizeisprobablymoreimportantthanpickingtheidealarchitecture

• GRUshave fewer parametersandthusmaytrainabitfasterorneedless datatogeneralize

• Ifyouhaveenoughdata,thegreaterexpressivepowerofLSTMsmay leadtobetterresults.

Page 54: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

MoreRNNs

• BidirectionalRNN • Stack BidirectionalRNN

Page 55: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Tree-LSTMs

• TraditionalSequentialComposition

• Tree-StructuredComposition

Page 56: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

MoreApplicationsofRNN

• NeuralMachineTranslation• HandwritingGeneration• ImageCaptionGeneration• …...

Page 57: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

NeuralMachineTranslation

• RNNtrainedend-to-end:encoder-decoder

W1 W1

h2 h3h1

I love you

W2 W2W2

我 ?

Encoder:

Decoder:

Thisneedstocapturethe

entiresentence!

Page 58: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

AttentionMechanism– Scoring

• Bahdanau etal.2015

W1 W1

h2 h3h1

I love you

W2 W2W2

我 ?

Encoder:

Decoder:

α=score(ℎ:, ℎ{)

0.3 0.6 0.1

ht𝛼

c

Page 59: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

ConvolutionNeuralNetwork

CS231nConvolutionalNeuralNetworkforVisualRecognition.

Pooling

Page 60: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

CNN for NLP

Zhang,Y.,&Wallace,B.(2015).ASensitivityAnalysisof(andPractitioners’Guideto)ConvolutionalNeuralNetworksforSentenceClassification.

Page 61: Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

RecursiveNeuralNetwork

Socher, R., Manning, C., & Ng, A. (2011). Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Network. NIPS.