rnn review & hierarchical attention...
TRANSCRIPT
![Page 1: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/1.jpg)
RNN Review&HierarchicalAttentionNetworksSHANGGAO
![Page 2: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/2.jpg)
Overview◦ ReviewofRecurrentNeuralNetworks◦ AdvancedRNNArchitectures
◦ Long-Short-Term-Memory◦ GatedRecurrentUnits
◦ RNNsforNaturalLanguageProcessing◦ WordEmbeddings◦ NLPApplications
◦ AttentionMechanisms◦ HierarchicalAttentionNetworks
![Page 3: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/3.jpg)
FeedforwardNeuralNetworksInaregularfeedforwardnetwork,eachneurontakesininputsfromtheneuronsinthepreviouslayer,andthenpassitsoutputtotheneuronsinthenextlayer
Theneuronsattheendmakeaclassificationbasedonlyonthedatafromthecurrentinput
![Page 4: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/4.jpg)
RecurrentNeuralNetworksInarecurrentneuralnetwork,eachneurontakesindatafromthepreviouslayerANDitsownoutputfromtheprevioustimestep
TheneuronsattheendmakeaclassificationdecisionbasedonNOTONLYtheinputatthecurrenttimestepBUTALSOtheinputfromalltimestepsbeforeit
Recurrentneuralnetworkscanthuscapturepatternsovertime(e.g.weather,stockmarketdata,speechaudio,naturallanguage)
![Page 5: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/5.jpg)
RecurrentNeuralNetworksIntheexamplebelow,theneuronatthefirsttimesteptakesinaninputandgeneratesanoutput
TheneuronatthesecondtimesteptakesinaninputANDALSO theoutputfromthefirsttimesteptomakeitsdecision
Theneuronatthethirdtimesteptakesinaninputandalsotheoutputfromthesecondtimestep(whichaccountedfordatafromthefirsttimestep),soitsoutputisaffectedbydatafromboththefirstandsecondtimestep
![Page 6: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/6.jpg)
RecurrentNeuralNetworksTraditional Neuron: output=sigmoid(weights*input+bias)
Recurrent Neuron:output=sigmoid(weights1*input+weights2*previous_output + bias)oroutput=sigmoid(weights*concat(input,previous_output)+bias)
![Page 7: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/7.jpg)
Toy RNN ExampleAddingBinary
Ateachtimestep,RNNtakesintwovaluesrepresentingbinaryinput
Ateachtimestep,RNNoutputsthesumofthetwobinaryvaluestakingintoaccountanycarryoverfromprevioustimestep
![Page 8: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/8.jpg)
ProblemswithBasicRNNsInabasicRNN,newdataiswrittenintoeachcellateverytimestep
Datafromtimestepsveryearlyongetdilutedbecausetheyarewrittenoversomanytimes
Intheexamplebelow,datafromthefirsttimestepisreadintotheRNN
Ateachsubsequenttimestep,theRNNfactorsindatafromthecurrenttimestep
BytheendoftheRNN,thedatafromthefirsttimestephasverylittleimpactontheoutputoftheRNN
![Page 9: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/9.jpg)
ProblemswithBasicRNNsBasicRNNcellscan’tretaininformationacrossalargenumberoftimesteps
Dependingontheproblem,RNNscanlosedatainasfewas3-5timesteps
Thisiscausesproblemsontaskswhereinformationneedstoberetainedoveralongtime
Forexample,innaturallanguageprocessing,themeaningofapronounmaydependonwhatwasstatedinaprevioussentence
![Page 10: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/10.jpg)
LongShortTermMemoryLongShortTermMemorycellsareadvancedRNNcellsthataddresstheproblemoflong-termdependencies
Insteadofalwayswritingtoeachcellateverytimestep,eachunithasaninternal‘memory’thatcanbewrittentoselectively
![Page 11: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/11.jpg)
LongShortTermMemoryInputfromthecurrenttimestepiswrittentotheinternalmemorybasedonhowrelevantitistotheproblem(relevanceislearnedduringtrainingthroughbackpropagation)
Iftheinputisn’trelevant,nodataiswrittenintothecell
Thiswaydatacanbepreservedovermanytimestepsandberetrievedwhenitisneeded
![Page 12: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/12.jpg)
LongShortTermMemoryMovementofdataintoandoutofanLSTMcelliscontrolledby“gates”
The“forgetgate”outputsavaluebetween0(delete)and1(keep)andcontrolshowmuchoftheinternalmemorytokeepfromtheprevioustimestep
Forexample,attheendofasentence,whena‘.’isencountered,wemaywanttoresettheinternalmemoryofthecell
![Page 13: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/13.jpg)
LongShortTermMemoryThe“candidatevalue”istheprocessedinputvaluefromthecurrenttimestepthatmaybeaddedtomemory◦ Notethattanh activationisusedforthe“candidatevalue”toallowfornegativevaluestosubtractfrommemory
The“inputgate”outputsavaluebetween0(delete)and1(keep)andcontrolshowmuchofthecandidatevalueaddtomemory
![Page 14: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/14.jpg)
LongShortTermMemoryCombined,the“inputgate”and“candidatevalue”determinewhatnewdatagetswrittenintomemory
The“forgetgate”determineshowmuchofthepreviousmemorytoretain
ThenewmemoryoftheLSTMcellisthe“forgetgate”*thepreviousmemorystate+the“inputgate”*the“candidatevalue”fromthecurrenttimestep
![Page 15: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/15.jpg)
LongShortTermMemoryTheLSTMcelldoesnotoutputthecontentsofitsmemorytothenextlayer◦ Storeddatainmemorymightnotberelevantforcurrenttimestep,e.g.,acellcanstoreapronounreferenceandonlyoutputwhenthepronounappears
Instead,an“output”gateoutputsavaluebetween0and1thatdetermineshowmuchofthememorytooutput
Thememorygoesthroughafinaltanh activationbeforebeingpassedtothenextlayer
![Page 16: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/16.jpg)
GatedRecurrentUnitsGatedRecurrentUnitsareverysimilartoLSTMsbutusetwogatesinsteadofthree
The“updategate”determineshowmuchofthepreviousmemorytokeep
The“resetgate”determineshowtocombinethenewinputwiththepreviousmemory
Theentireinternalmemoryisoutputwithoutanadditionalactivation
![Page 17: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/17.jpg)
LSTMsvsGRUsGreff,etal.(2015) comparedLSTMsandGRUsandfoundtheyperformaboutthesame
Jozefowicz,etal.(2015) generatedmorethantenthousandvariantsofRNNsanddeterminedthatdependingonthetask,somemayperformbetterthanLSTMs
GRUstrainfasterthanLSTMsbecausetheyarelesscomplex
Generallyspeaking,tuninghyperparameters(e.g.numberofunits,sizeofweights)willprobablyaffectperformancemorethanpickingbetweenGRUandLSTM
![Page 18: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/18.jpg)
RNNsforNaturalLanguageProcessingThenaturalinputforaneuralnetworkisavectorofnumericvalues(e.g.pixeldensitiesforimagingoraudiofrequencyforspeechrecognition)
Howdoyoufeedlanguageasinputintoaneuralnetwork?
Themostbasicsolutionisonehotencoding◦ Alongvector(equaltothelengthofyourvocabulary)whereeachindexrepresentsonewordinthevocabulary
◦ Foreachword,theindexcorrespondingtothatwordissetto1,andeverythingelseissetto0
![Page 19: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/19.jpg)
OneHotEncodingLSTMExampleTrainedLSTMtopredictthenextcharactergivenasequenceofcharacters
Trainingcorpus:AllbooksinHitchhiker’sGuidetotheGalaxyseries
One-hotencodingusedtoconverteachcharacterintoavector
72possiblecharacters– lowercaseletters,uppercaseletters,numbers,andpunctuation
Inputvectorisfedintoalayerof256LSTMnodes
LSTMoutputfedintoasoftmax layerthatpredictsthefollowingcharacter
Thecharacterwiththehighestsoftmax probabilityischosenasthenextcharacter
![Page 20: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/20.jpg)
GeneratedSamples700iterations:aeae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae aeae ae ae ae ae
4200iterations:thesandandthesaidthesandandthesaidthesandandthesaidthesandandthesaidthesandandthesaidthe
36000iterations:searedtobealittlewasasmallbeachoftheshipwasasmallbeachoftheshipwasasmallbeachoftheship
100000iterations:thesecondthestarsisthestarstothestarsinthestarsthathehadbeensotheshiphadbeensotheshiphadbeen
290000iterations:startedtorunacomputertothecomputertotakeabitofaproblemofftheshipandthesunandtheairwasthesound
500000iterations:"IthinktheGalaxywillbealotofthingsthatthesecondmanwhocouldnotbecontinuallyandthesoundofthestars
![Page 21: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/21.jpg)
OneHotEncodingShortcomingsOne-hotencodingislackingbecauseitfailstocapturesemanticsimilaritybetweenwords,i.e.,theinherentmeaningofword
Forexample,thewords“happy”,“joyful”,and“pleased”allhavesimilarmeanings,butunderone-hotencodingtheyarethreedistinctandunrelatedentities
Whatifwecouldcapturethemeaningofwordswithinanumericalcontext?
![Page 22: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/22.jpg)
WordEmbeddingsWordembeddingsarevectorrepresentationsofwordsthatattempttocapturesemanticmeaningEachwordisrepresentedasavectorofnumericalvaluesEachindexinthevectorrepresentssomeabstract“concept”◦ Theseconceptsareunlabeledandlearnedduringtraining
Wordsthataresimilarwillhavesimilarvectors
Masculinity Royality Youth Intelligence
King 0.95 0.95 -0.1 0.6
Queen -0.95 0.95 -0.1 0.6
Prince 0.8 0.8 0.7 0.4
Woman -0.95 0.01 -0.1 0.2
Peasant 0.1 -0.95 0.1 -0.3
Doctor 0.12 0.1 -0.2 0.95
![Page 23: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/23.jpg)
Word2VecWordsthatappearinthesamecontextaremorelikelytohavethesamemeaning◦ Iamexcited toseeyoutoday!◦ Iamecstatic toseeyoutoday!
Word2Vecisanalgorithmthatusesafunnel-shapedsinglehiddenlayerneuralnetwork(similartoautoencoder)tocreatewordembeddings
Givenaword(inone-hotencodedformat),ittriestopredicttheneighborsofthatword(alsoinone-hotencodedformat),orviceversa
Wordsthatappearinthesamecontextwillhavesimilarembeddings
![Page 24: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/24.jpg)
Word2VecThemodelistrainedonalargecorpusoftextusingregularbackpropagation
Foreachwordinthecorpus,predictthe5wordstotheleftandright(orviceversa)
Oncethemodelistrained,theembeddingforaparticularwordistherowoftheweightmatrixassociatedwiththatword
Manypretrainedvectors(e.g.Google)canbedownloadedonline
![Page 25: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/25.jpg)
Word2Vecon20Newsgroups
![Page 26: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/26.jpg)
BasicDeepLearningNLPPipelineGenerateWordEmbeddings◦ Pythongensim package
FeedwordembeddingsintoLSTMorGRUlayer
FeedoutputofLSTMorGRUlayerintosoftmax classifier
![Page 27: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/27.jpg)
NLPApplicationsforRNNsLanguageModels◦ Givenaseriesofwords,predictthenextword◦ Understandtheinherentpatternsinagivenlanguage◦ Usefulforautocompletionandmachinetranslation
SentimentAnalysis◦ Givenasentenceordocument,classifyifitispositiveornegative◦ Usefulforanalyzingthesuccessofaproductlaunchorautomatedstocktradingbasedoffnews
Otherformstextclassification◦ Cancerpathologyreportclassification
![Page 28: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/28.jpg)
AdvancedApplicationsQuestionAnswering◦ Readadocumentandthenanswerquestions◦ ManymodelsuseRNNsastheirfoundation
AutomatedImageCaptioning◦ Givenanimage,automaticallygenerateacaption
◦ ManymodelsusebothCNNsandRNNs
MachineTranslation◦ Automaticallytranslatetextfromonelanguagetoanother
◦ Manymodels(includingGoogleTranslate)useRNNsastheirfoundation
![Page 29: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/29.jpg)
LSTM ImprovementsBi-directionalLSTMsSometimes,importantcontextforawordcomesaftertheword(especiallyimportanttranslation)◦ Isawacrane flyingacrossthesky◦ Isawacrane liftingalargeboulder
Solution- usetwoLSTMlayers,onethatreadstheinputforwardandonethatreadstheinputbackwards,andconcatenatetheiroutputs
![Page 30: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/30.jpg)
LSTM ImprovementsAttentionMechanismsSometimesonlyafewwordsinasentenceordocumentareimportantandtherestdonotcontributeasmuchmeaning◦ Forexample,whenclassifyingcancerlocationfromcancerpathologyreports,wemayonlycareaboutcertainkeywordslike“rightupperlung”or“ovarian”
InatraditionalRNN,weusuallytaketheoutputatthelasttimestep
Bythelasttimestep,informationfromtheimportantwordsmayhavebeendiluted,evenwithLSTMsandGRUsunits
Howcanwecapturetheinformationatthemostimportantwords?
![Page 31: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/31.jpg)
LSTM ImprovementsAttentionMechanismsNaïvesolution:topreventinformationloss,insteadofusingtheLSTMoutputatthelasttimestep,taketheLSTMoutputateverytimestepandusetheaverage
Bettersolution:findtheimportanttimesteps,andweighttheoutputatthosetimestepsmuchhigherwhendoingtheaverage
![Page 32: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/32.jpg)
LSTM ImprovementsAttentionMechanismsAnattentionmechanismcalculateshowimportanttheLSTMoutputateachtimestepis
It’sasimplefeedforwardnetworkwithasingle(tanh)hiddenlayerandasoftmax output
Ateachtimestep,feedtheoutputfromtheLSTM/GRUintotheattentionmechanism
![Page 33: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/33.jpg)
LSTM ImprovementsAttentionMechanisms
Oncetheattentionmechanismhasallthetimesteps,itcalculatesasoftmaxoverallthetimesteps◦ softmax alwaysaddsto1
Thesoftmax tellsushowtoweighttheoutputateachtimestep,i.e.,howimportanteachtimestepis
Multiplytheoutputateachtimestepwithitscorrespondingsoftmax weightandaddtocreateaweightedaverage
![Page 34: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/34.jpg)
LSTM ImprovementsAttentionMechanisms
Attentionmechanismscantakeintoaccount“context”todeterminewhat’simportant
Rememberdotproductisameasureofsimilarity– twovectorsthataresimilarwillhavelargerdotproduct
Innormalsoftmax,dotproductinputwithrandomlyinitializedweightsbeforeapplyingsoftmax function
![Page 35: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/35.jpg)
LSTM ImprovementsAttentionMechanisms
Instead,wecandotproductwithavectorthatrepresents“context”tofindwordsmostsimilar/relevanttothatcontext:◦ Forquestionanswering,canrepresentaquestionbeingasked
◦ Formachinetranslation,canrepresentthepreviousword
◦ Forclassification,canbeinitializedrandomlyandlearnedduringtraining
![Page 36: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/36.jpg)
LSTM ImprovementsAttentionMechanisms
Withattention,youcanvisualizehowimportanteachtimestepisforaparticulartask
![Page 37: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/37.jpg)
LSTM ImprovementsAttentionMechanisms
Withattention,youcanvisualizehowimportanteachtimestepisforaparticulartask
![Page 38: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/38.jpg)
CNNsforTextClassificationStartwithWordEmbeddings◦ Ifyouhave10words,andyourembeddingsizeis300,you’llhavea10x300matrix
3ParallelConvolutionLayers◦ Takeinwordembeddings◦ Slidingwindowthatprocesses3,4,and5wordsatatime(1Dconv)
◦ Filtersizesare3x300x100,4x300x100,and5x300x100(width,in-channels,out-channels)
◦ Eachconvlayeroutputs10x100matrix
![Page 39: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/39.jpg)
CNNsforTextClassificationMaxpool andConcatenate◦ Foreachfilterchannel,maxpoolacrosstheentirewidthofsentence
◦ Thisislikepickingthe‘mostimportant’wordinthesentenceforeachchannel
◦ Alsoensureseverysentence,nomatterhowlong,isrepresentedbysamelengthvector
◦ Foreachofthethree10x100matrices,returns1x100matrix
◦ Concatenatethethree1x100matricesintoa1x300matrix
DenseandSoftmax
![Page 40: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/40.jpg)
HierarchicalAttentionNetworks
![Page 41: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/41.jpg)
ProblemOverviewNationalCancerInstitutehasaskedOakRidgeNationalLabtodevelopaprogramthatcanautomaticallyclassifycancerpathologyreports
Pathologyreportsarewhatdoctorswriteupwhentheydiagnosecancer,andNCIusesthemtocalculatenationalstatisticsandtrackhealthtrends
Challenges:◦ Differentdoctorsusedifferentterminologytolabelthesametypesofcancer◦ Somediagnosesmayreferenceothertypesofcancerorotherorgansthatarenottheactualcancerbeingdiagnosed
◦ Typos
Task:givenapathologyreport,teachaprogramtofindthetypeofcancer,locationofcancer,histologicalgrade,etc
![Page 42: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/42.jpg)
ApproachTheperformanceofvariousdifferentclassifiersweretested:◦ Traditionalmachinelearningclassifiers:NaiveBayes,LogisticRegression,SupportVectorMachines,RandomForests,andXG-Boost
◦ Traditionalmachinelearningclassifiersrequiremanuallydefinedfeatures,suchn-gramsandtf-idf
◦ Deeplearningmethods:recurrentneuralnetworks,convolutionalneuralnetworks,andhierarchicalattentionnetworks
◦ Givenenoughdata,deeplearningmethodscanlearntheirownfeatures,suchaswhichwordsorphrasesareimportant
TheHierarchicalAttentionNetworkisarelativelynewdeeplearningmodelthatcameoutlastyearandisoneofthetopperformers
![Page 43: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/43.jpg)
HANArchitectureTheHierarchicalAttentionNetwork(HAN)isadeeplearningmodelfordocumentclassification
BuiltfrombidirectionalRNNscomposedofGRUs/LSTMswithattentionmechanisms
Composedof“hierarchies”wheretheoutputsofthelowerhierarchiesbecometheinputstotheupperhierarchies
![Page 44: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/44.jpg)
HANArchitectureBeforefeedingadocumentintotheHAN,wefirstbreakitdownintosentences(orinourcase,lines)
Thewordhierarchyisresponsibleforcreatingsentenceembeddings◦ Thishierarchyreadsinonefullsentenceatime,intheformofwordembeddings
◦ Theattentionmechanismselectsthemostimportantwords
◦ Theoutputisasentenceembeddingthatcapturesthesemanticcontentofthesentencebasedonthemostimportantwords
![Page 45: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/45.jpg)
HANArchitectureThesentencehierarchyisresponsibleforcreatingthefinaldocumentembedding◦ Identicalstructurewiththewordhierarchy◦ Readsinthesentenceembeddingsoutputfromthewordhierarchy
◦ Theattentionmechanismselectsthemostimportantsentence
◦ Theoutputisadocumentembeddingrepresentingthemeaningoftheentiredocument
Thefinaldocumentembeddingisusedforclassification
![Page 46: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/46.jpg)
ExperimentalSetup945cancerpathologyreports,allcasesofbreastandlungcancer
10– foldcrossvalidationused,30epochsperfold
Hyperparameteroptimizationappliedonmodelstofindoptimalparameters
Twomaintasks– primarysiteclassificationandhistologicalgradeclassification◦ Unevenclassdistribution,somewithonly~10occurrencesindataset◦ F-scoreusedforperformancemetric◦ MicroF-scoreisweightedF-scoreaveragebasedonclasssize◦ MacroF-scoreisunweightedF-scoreaverageacrossallclasses
![Page 47: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/47.jpg)
HAN PerformancePrimarySite12possiblecancersubsitelocations◦ 5lungsubsites◦ 7breastsubsites
DeeplearningmethodsoutperformedalltraditionalMLmethodsexceptforXGBoost
HANhadbestperformance,pretrainingimprovedperformanceevenfurther
TraditionalMachineLearningClassifiers
Classifier PrimarySiteMicroF-Score
PrimarySiteMacroF-Score
NaiveBayes .554(.521,.586)
.161(.152,.170)
LogisticRegression .621(.589,.652)
.222(.207,.237)
SupportVectorMachine(C=1,gamma=1)
.616(.585,.646)
.220(.205,.234)
RandomForest(numtrees=100)
.628(.597,.661)
.258(.236,.283)
XGBoost(maxdepth=5,nestimators=300)
.709(.681,.738)
.441(.404,.474)
DeepLearningClassifiers
RecurrentNeuralNetwork(withattentionmechanism)
.694(.666,.722)
.468(.432,.502)
ConvolutionalNeuralNetwork .712(.680,.736)
.398(.359,.434)
HierarchicalAttentionNetwork(nopretraining)
.784(.759,.810)
.566(.525,.607)
HierarchicalAttentionNetwork(withpretraining)
.800(.776,.825)
.594(.553,.636)
![Page 48: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/48.jpg)
HAN PerformanceHistologicalGrade4possiblehistologicalgrades◦ 1-4,indicatinghowabnormaltumorcellsandtumortissueslookunderamicroscopewith4beingmostabnormal
◦ Indicateshowquicklyatumorislikelytogrowandspread
OtherthanRNNs,deeplearningmodelsgenerallyoutperformtraditionalMLmodels
HANhadbestperformance,butpretrainingdidnothelpperformance
TraditionalMachineLearningClassifiers
Classifier HistologicalGradeMicroF-Score
HistologicalGradeMacroF-Score
NaiveBayes .481(.442,.519)
.264(.244,.283)
LogisticRegression .540(.499,.576)
.340(.309,.371)
SupportVectorMachine(C=1,gamma=1)
.520(.482,.558)
.330(.301,.357)
RandomForest(numtrees=100)
.597(.558,.636)
.412(.364,.476)
XGBoost(maxdepth=5,nestimators=300)
.673(.634,.709)
.593(.516,.662)
DeepLearningClassifiers
RecurrentNeuralNetwork(withattentionmechanism)
.580(.541,.617)
.474(.416,.536)
ConvolutionalNeuralNetwork .716(.681,.750)
.521(.493,.548)
HierarchicalAttentionNetwork(nopretraining)
.916(.895,.936)
.841(.778,.895)
HierarchicalAttentionNetwork(withpretraining)
.904(.881,.927)
.822(.744,.883)
![Page 49: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/49.jpg)
TFIDFDocumentEmbeddings
TFIDF-weightedWord2Vecembeddingsreducedto2dimensionsviaPCAfor(A.)primarysitetrainreports,(B.)histologicalgradetrainreports,(C.)primarysitetestreports,and(D.)histologicalgradetrainreports.
![Page 50: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/50.jpg)
TFIDFDocumentEmbeddings
HANdocumentembeddingsreducedto2dimensionsviaPCAfor(A.)primarysitetrainreports,(B.)histologicalgradetrainreports,(C.)primarysitetestreports,and(D.)histologicalgradetrainreports.
![Page 51: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/51.jpg)
PretrainingWehaveaccesstomoreunlabeleddatathanlabeleddata(approximately1500unlabeled,1000labeled)
Toutilizedunlabeleddata,wetrainedourHANtocreatedocumentembeddingsthatmatchedthecorrespondingTF-IDFweightedwordembeddingsforthatdocument
HANtrainingandvalidationaccuracywithandwithoutpretrainingfor(A.)primarysitetaskand(B.)histologicalgradetask
![Page 52: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/52.jpg)
HANDocumentAnnotations
![Page 53: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/53.jpg)
MostImportantWordsperTaskWecanalsousetheHAN’sattentionweightstofindthewordsthatcontributemosttowardstheclassificationtaskathand:
PrimarySite HistologicalGrademainstemadenocalullowerbreast
carinacusauppermiddlerul
buttocktemporalupperretrosputum
poorlyg2highiiidlr
Undifferentiatedg3iiig1
moderatelyintermediatewellarising2
![Page 54: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/54.jpg)
ScalingRelativetoothermodels,HANisveryslowtotrain◦ OnCPU,HANtakesapproximately4hourstogothrough30epochs◦ Incomparison,CNNtakesaround40minutestogothrough30epochs,andtraditionalmachinelearningclassifierstakelessthanaminute
◦ TheHANisslowduetoitscomplexarchitectureanduseofRNNs,sogradientsareveryexpensivetocompute
WearecurrentworkingtoscaletheHANtorunonmultipleGPUs◦ OnTensorflow,RNNsonGPUrunslowerthanonCPU◦ WeareconsideringexploringaPyTorch implementationtogetaroundthisproblem
WehavesuccessfullydevelopedadistributedCPU-onlyHANthatrunsonTITANusingMPI,with4xspeedupon8nodes
![Page 55: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/55.jpg)
AttentionisAllYouNeedNewpaperthatcameoutJune2017fromGoogleBrain,inwhichtheyshowedtheycouldgetcompetitiveresultsinmachinetranslationwithonlyattentionmechanismsandnoRNNs
WeappliedthesamearchitecturetoreplacetheRNNsinourHAN
Becauseattentionmechanismsarejustmatrixmultiplications,itrunsabout10xfasterthantheHANwithRNNs
ThisnewmodelperformsalmostaswellastheHANwithRNNs– 0.77micro-Fonprimarysite(comparedto0.78inoriginalHAN),and0.86micro-Fonhistologicalgrade(comparedto0.91inoriginalHAN)
BecausenoRNNsareutilized,thismodelismucheasiertoscaleontheGPU
![Page 56: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/56.jpg)
OtherFutureWorkMultitaskLearning◦ Predicthistologicalgrade,primarysite,andothertaskssimultaneouslywithinthesamemodel◦ Hopefullyboosttheperformanceofalltasksbysharinginformationacrosstasks
Semi-SupervisedLearning◦ Utilizeunlabeleddataduringtrainingratherthaninpretrainingwiththegoalofimprovingclassificationperformance
◦ Thistaskischallengingbecauseinmostsemi-supervisedtasks,weknowallthelabelswithinthedataset.Inourcase,weonlyhaveasubsetofthelabels.
![Page 57: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks](https://reader034.vdocuments.site/reader034/viewer/2022042919/5f6081b3600c73148e62a1ad/html5/thumbnails/57.jpg)
Questions?