rnn review & hierarchical attention...

RNN Review&HierarchicalAttentionNetworksSHANGGAO

Overview◦ ReviewofRecurrentNeuralNetworks◦ AdvancedRNNArchitectures

◦ Long-Short-Term-Memory◦ GatedRecurrentUnits

◦ RNNsforNaturalLanguageProcessing◦ WordEmbeddings◦ NLPApplications

◦ AttentionMechanisms◦ HierarchicalAttentionNetworks

FeedforwardNeuralNetworksInaregularfeedforwardnetwork,eachneurontakesininputsfromtheneuronsinthepreviouslayer,andthenpassitsoutputtotheneuronsinthenextlayer

Theneuronsattheendmakeaclassificationbasedonlyonthedatafromthecurrentinput

RecurrentNeuralNetworksInarecurrentneuralnetwork,eachneurontakesindatafromthepreviouslayerANDitsownoutputfromtheprevioustimestep

TheneuronsattheendmakeaclassificationdecisionbasedonNOTONLYtheinputatthecurrenttimestepBUTALSOtheinputfromalltimestepsbeforeit

Recurrentneuralnetworkscanthuscapturepatternsovertime(e.g.weather,stockmarketdata,speechaudio,naturallanguage)

RecurrentNeuralNetworksIntheexamplebelow,theneuronatthefirsttimesteptakesinaninputandgeneratesanoutput

TheneuronatthesecondtimesteptakesinaninputANDALSO theoutputfromthefirsttimesteptomakeitsdecision

Theneuronatthethirdtimesteptakesinaninputandalsotheoutputfromthesecondtimestep(whichaccountedfordatafromthefirsttimestep),soitsoutputisaffectedbydatafromboththefirstandsecondtimestep

RecurrentNeuralNetworksTraditional Neuron: output=sigmoid(weights*input+bias)

Recurrent Neuron:output=sigmoid(weights1*input+weights2*previous_output + bias)oroutput=sigmoid(weights*concat(input,previous_output)+bias)

Toy RNN ExampleAddingBinary

Ateachtimestep,RNNtakesintwovaluesrepresentingbinaryinput

Ateachtimestep,RNNoutputsthesumofthetwobinaryvaluestakingintoaccountanycarryoverfromprevioustimestep

ProblemswithBasicRNNsInabasicRNN,newdataiswrittenintoeachcellateverytimestep

Datafromtimestepsveryearlyongetdilutedbecausetheyarewrittenoversomanytimes

Intheexamplebelow,datafromthefirsttimestepisreadintotheRNN

Ateachsubsequenttimestep,theRNNfactorsindatafromthecurrenttimestep

BytheendoftheRNN,thedatafromthefirsttimestephasverylittleimpactontheoutputoftheRNN

ProblemswithBasicRNNsBasicRNNcellscan’tretaininformationacrossalargenumberoftimesteps

Dependingontheproblem,RNNscanlosedatainasfewas3-5timesteps

Thisiscausesproblemsontaskswhereinformationneedstoberetainedoveralongtime

Forexample,innaturallanguageprocessing,themeaningofapronounmaydependonwhatwasstatedinaprevioussentence

LongShortTermMemoryLongShortTermMemorycellsareadvancedRNNcellsthataddresstheproblemoflong-termdependencies

Insteadofalwayswritingtoeachcellateverytimestep,eachunithasaninternal‘memory’thatcanbewrittentoselectively

LongShortTermMemoryInputfromthecurrenttimestepiswrittentotheinternalmemorybasedonhowrelevantitistotheproblem(relevanceislearnedduringtrainingthroughbackpropagation)

Iftheinputisn’trelevant,nodataiswrittenintothecell

Thiswaydatacanbepreservedovermanytimestepsandberetrievedwhenitisneeded

LongShortTermMemoryMovementofdataintoandoutofanLSTMcelliscontrolledby“gates”

The“forgetgate”outputsavaluebetween0(delete)and1(keep)andcontrolshowmuchoftheinternalmemorytokeepfromtheprevioustimestep

Forexample,attheendofasentence,whena‘.’isencountered,wemaywanttoresettheinternalmemoryofthecell

LongShortTermMemoryThe“candidatevalue”istheprocessedinputvaluefromthecurrenttimestepthatmaybeaddedtomemory◦ Notethattanh activationisusedforthe“candidatevalue”toallowfornegativevaluestosubtractfrommemory

The“inputgate”outputsavaluebetween0(delete)and1(keep)andcontrolshowmuchofthecandidatevalueaddtomemory

LongShortTermMemoryCombined,the“inputgate”and“candidatevalue”determinewhatnewdatagetswrittenintomemory

The“forgetgate”determineshowmuchofthepreviousmemorytoretain

ThenewmemoryoftheLSTMcellisthe“forgetgate”*thepreviousmemorystate+the“inputgate”*the“candidatevalue”fromthecurrenttimestep

LongShortTermMemoryTheLSTMcelldoesnotoutputthecontentsofitsmemorytothenextlayer◦ Storeddatainmemorymightnotberelevantforcurrenttimestep,e.g.,acellcanstoreapronounreferenceandonlyoutputwhenthepronounappears

Instead,an“output”gateoutputsavaluebetween0and1thatdetermineshowmuchofthememorytooutput

Thememorygoesthroughafinaltanh activationbeforebeingpassedtothenextlayer

GatedRecurrentUnitsGatedRecurrentUnitsareverysimilartoLSTMsbutusetwogatesinsteadofthree

The“updategate”determineshowmuchofthepreviousmemorytokeep

The“resetgate”determineshowtocombinethenewinputwiththepreviousmemory

Theentireinternalmemoryisoutputwithoutanadditionalactivation

LSTMsvsGRUsGreff,etal.(2015) comparedLSTMsandGRUsandfoundtheyperformaboutthesame

Jozefowicz,etal.(2015) generatedmorethantenthousandvariantsofRNNsanddeterminedthatdependingonthetask,somemayperformbetterthanLSTMs

GRUstrainfasterthanLSTMsbecausetheyarelesscomplex

Generallyspeaking,tuninghyperparameters(e.g.numberofunits,sizeofweights)willprobablyaffectperformancemorethanpickingbetweenGRUandLSTM

RNNsforNaturalLanguageProcessingThenaturalinputforaneuralnetworkisavectorofnumericvalues(e.g.pixeldensitiesforimagingoraudiofrequencyforspeechrecognition)

Howdoyoufeedlanguageasinputintoaneuralnetwork?

Themostbasicsolutionisonehotencoding◦ Alongvector(equaltothelengthofyourvocabulary)whereeachindexrepresentsonewordinthevocabulary

◦ Foreachword,theindexcorrespondingtothatwordissetto1,andeverythingelseissetto0

OneHotEncodingLSTMExampleTrainedLSTMtopredictthenextcharactergivenasequenceofcharacters

Trainingcorpus:AllbooksinHitchhiker’sGuidetotheGalaxyseries

One-hotencodingusedtoconverteachcharacterintoavector

72possiblecharacters– lowercaseletters,uppercaseletters,numbers,andpunctuation

Inputvectorisfedintoalayerof256LSTMnodes

LSTMoutputfedintoasoftmax layerthatpredictsthefollowingcharacter

Thecharacterwiththehighestsoftmax probabilityischosenasthenextcharacter

GeneratedSamples700iterations:aeae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae aeae ae ae ae ae

4200iterations:thesandandthesaidthesandandthesaidthesandandthesaidthesandandthesaidthesandandthesaidthe

36000iterations:searedtobealittlewasasmallbeachoftheshipwasasmallbeachoftheshipwasasmallbeachoftheship

100000iterations:thesecondthestarsisthestarstothestarsinthestarsthathehadbeensotheshiphadbeensotheshiphadbeen

290000iterations:startedtorunacomputertothecomputertotakeabitofaproblemofftheshipandthesunandtheairwasthesound

500000iterations:"IthinktheGalaxywillbealotofthingsthatthesecondmanwhocouldnotbecontinuallyandthesoundofthestars

OneHotEncodingShortcomingsOne-hotencodingislackingbecauseitfailstocapturesemanticsimilaritybetweenwords,i.e.,theinherentmeaningofword

Forexample,thewords“happy”,“joyful”,and“pleased”allhavesimilarmeanings,butunderone-hotencodingtheyarethreedistinctandunrelatedentities

Whatifwecouldcapturethemeaningofwordswithinanumericalcontext?

WordEmbeddingsWordembeddingsarevectorrepresentationsofwordsthatattempttocapturesemanticmeaningEachwordisrepresentedasavectorofnumericalvaluesEachindexinthevectorrepresentssomeabstract“concept”◦ Theseconceptsareunlabeledandlearnedduringtraining

Wordsthataresimilarwillhavesimilarvectors

Masculinity Royality Youth Intelligence

King 0.95 0.95 -0.1 0.6

Queen -0.95 0.95 -0.1 0.6

Prince 0.8 0.8 0.7 0.4

Woman -0.95 0.01 -0.1 0.2

Peasant 0.1 -0.95 0.1 -0.3

Doctor 0.12 0.1 -0.2 0.95

Word2VecWordsthatappearinthesamecontextaremorelikelytohavethesamemeaning◦ Iamexcited toseeyoutoday!◦ Iamecstatic toseeyoutoday!

Word2Vecisanalgorithmthatusesafunnel-shapedsinglehiddenlayerneuralnetwork(similartoautoencoder)tocreatewordembeddings

Givenaword(inone-hotencodedformat),ittriestopredicttheneighborsofthatword(alsoinone-hotencodedformat),orviceversa

Wordsthatappearinthesamecontextwillhavesimilarembeddings

Word2VecThemodelistrainedonalargecorpusoftextusingregularbackpropagation

Foreachwordinthecorpus,predictthe5wordstotheleftandright(orviceversa)

Oncethemodelistrained,theembeddingforaparticularwordistherowoftheweightmatrixassociatedwiththatword

Manypretrainedvectors(e.g.Google)canbedownloadedonline

Word2Vecon20Newsgroups

BasicDeepLearningNLPPipelineGenerateWordEmbeddings◦ Pythongensim package

FeedwordembeddingsintoLSTMorGRUlayer

FeedoutputofLSTMorGRUlayerintosoftmax classifier

NLPApplicationsforRNNsLanguageModels◦ Givenaseriesofwords,predictthenextword◦ Understandtheinherentpatternsinagivenlanguage◦ Usefulforautocompletionandmachinetranslation

SentimentAnalysis◦ Givenasentenceordocument,classifyifitispositiveornegative◦ Usefulforanalyzingthesuccessofaproductlaunchorautomatedstocktradingbasedoffnews

Otherformstextclassification◦ Cancerpathologyreportclassification

AdvancedApplicationsQuestionAnswering◦ Readadocumentandthenanswerquestions◦ ManymodelsuseRNNsastheirfoundation

AutomatedImageCaptioning◦ Givenanimage,automaticallygenerateacaption

◦ ManymodelsusebothCNNsandRNNs

MachineTranslation◦ Automaticallytranslatetextfromonelanguagetoanother

◦ Manymodels(includingGoogleTranslate)useRNNsastheirfoundation

LSTM ImprovementsBi-directionalLSTMsSometimes,importantcontextforawordcomesaftertheword(especiallyimportanttranslation)◦ Isawacrane flyingacrossthesky◦ Isawacrane liftingalargeboulder

Solution- usetwoLSTMlayers,onethatreadstheinputforwardandonethatreadstheinputbackwards,andconcatenatetheiroutputs

LSTM ImprovementsAttentionMechanismsSometimesonlyafewwordsinasentenceordocumentareimportantandtherestdonotcontributeasmuchmeaning◦ Forexample,whenclassifyingcancerlocationfromcancerpathologyreports,wemayonlycareaboutcertainkeywordslike“rightupperlung”or“ovarian”

InatraditionalRNN,weusuallytaketheoutputatthelasttimestep

Bythelasttimestep,informationfromtheimportantwordsmayhavebeendiluted,evenwithLSTMsandGRUsunits

Howcanwecapturetheinformationatthemostimportantwords?

LSTM ImprovementsAttentionMechanismsNaïvesolution:topreventinformationloss,insteadofusingtheLSTMoutputatthelasttimestep,taketheLSTMoutputateverytimestepandusetheaverage

Bettersolution:findtheimportanttimesteps,andweighttheoutputatthosetimestepsmuchhigherwhendoingtheaverage

LSTM ImprovementsAttentionMechanismsAnattentionmechanismcalculateshowimportanttheLSTMoutputateachtimestepis

It’sasimplefeedforwardnetworkwithasingle(tanh)hiddenlayerandasoftmax output

Ateachtimestep,feedtheoutputfromtheLSTM/GRUintotheattentionmechanism

LSTM ImprovementsAttentionMechanisms

Oncetheattentionmechanismhasallthetimesteps,itcalculatesasoftmaxoverallthetimesteps◦ softmax alwaysaddsto1

Thesoftmax tellsushowtoweighttheoutputateachtimestep,i.e.,howimportanteachtimestepis

Multiplytheoutputateachtimestepwithitscorrespondingsoftmax weightandaddtocreateaweightedaverage


Attentionmechanismscantakeintoaccount“context”todeterminewhat’simportant

Rememberdotproductisameasureofsimilarity– twovectorsthataresimilarwillhavelargerdotproduct

Innormalsoftmax,dotproductinputwithrandomlyinitializedweightsbeforeapplyingsoftmax function


Instead,wecandotproductwithavectorthatrepresents“context”tofindwordsmostsimilar/relevanttothatcontext:◦ Forquestionanswering,canrepresentaquestionbeingasked

◦ Formachinetranslation,canrepresentthepreviousword

◦ Forclassification,canbeinitializedrandomlyandlearnedduringtraining


Withattention,youcanvisualizehowimportanteachtimestepisforaparticulartask

CNNsforTextClassificationStartwithWordEmbeddings◦ Ifyouhave10words,andyourembeddingsizeis300,you’llhavea10x300matrix

3ParallelConvolutionLayers◦ Takeinwordembeddings◦ Slidingwindowthatprocesses3,4,and5wordsatatime(1Dconv)

◦ Filtersizesare3x300x100,4x300x100,and5x300x100(width,in-channels,out-channels)

◦ Eachconvlayeroutputs10x100matrix

CNNsforTextClassificationMaxpool andConcatenate◦ Foreachfilterchannel,maxpoolacrosstheentirewidthofsentence

◦ Thisislikepickingthe‘mostimportant’wordinthesentenceforeachchannel

◦ Alsoensureseverysentence,nomatterhowlong,isrepresentedbysamelengthvector

◦ Foreachofthethree10x100matrices,returns1x100matrix

◦ Concatenatethethree1x100matricesintoa1x300matrix

DenseandSoftmax

HierarchicalAttentionNetworks

ProblemOverviewNationalCancerInstitutehasaskedOakRidgeNationalLabtodevelopaprogramthatcanautomaticallyclassifycancerpathologyreports

Pathologyreportsarewhatdoctorswriteupwhentheydiagnosecancer,andNCIusesthemtocalculatenationalstatisticsandtrackhealthtrends

Challenges:◦ Differentdoctorsusedifferentterminologytolabelthesametypesofcancer◦ Somediagnosesmayreferenceothertypesofcancerorotherorgansthatarenottheactualcancerbeingdiagnosed

◦ Typos

Task:givenapathologyreport,teachaprogramtofindthetypeofcancer,locationofcancer,histologicalgrade,etc

ApproachTheperformanceofvariousdifferentclassifiersweretested:◦ Traditionalmachinelearningclassifiers:NaiveBayes,LogisticRegression,SupportVectorMachines,RandomForests,andXG-Boost

◦ Traditionalmachinelearningclassifiersrequiremanuallydefinedfeatures,suchn-gramsandtf-idf

◦ Deeplearningmethods:recurrentneuralnetworks,convolutionalneuralnetworks,andhierarchicalattentionnetworks

◦ Givenenoughdata,deeplearningmethodscanlearntheirownfeatures,suchaswhichwordsorphrasesareimportant

TheHierarchicalAttentionNetworkisarelativelynewdeeplearningmodelthatcameoutlastyearandisoneofthetopperformers

HANArchitectureTheHierarchicalAttentionNetwork(HAN)isadeeplearningmodelfordocumentclassification

BuiltfrombidirectionalRNNscomposedofGRUs/LSTMswithattentionmechanisms

Composedof“hierarchies”wheretheoutputsofthelowerhierarchiesbecometheinputstotheupperhierarchies

HANArchitectureBeforefeedingadocumentintotheHAN,wefirstbreakitdownintosentences(orinourcase,lines)

Thewordhierarchyisresponsibleforcreatingsentenceembeddings◦ Thishierarchyreadsinonefullsentenceatime,intheformofwordembeddings

◦ Theattentionmechanismselectsthemostimportantwords

◦ Theoutputisasentenceembeddingthatcapturesthesemanticcontentofthesentencebasedonthemostimportantwords

HANArchitectureThesentencehierarchyisresponsibleforcreatingthefinaldocumentembedding◦ Identicalstructurewiththewordhierarchy◦ Readsinthesentenceembeddingsoutputfromthewordhierarchy

◦ Theattentionmechanismselectsthemostimportantsentence

◦ Theoutputisadocumentembeddingrepresentingthemeaningoftheentiredocument

Thefinaldocumentembeddingisusedforclassification

ExperimentalSetup945cancerpathologyreports,allcasesofbreastandlungcancer

10– foldcrossvalidationused,30epochsperfold

Hyperparameteroptimizationappliedonmodelstofindoptimalparameters

Twomaintasks– primarysiteclassificationandhistologicalgradeclassification◦ Unevenclassdistribution,somewithonly~10occurrencesindataset◦ F-scoreusedforperformancemetric◦ MicroF-scoreisweightedF-scoreaveragebasedonclasssize◦ MacroF-scoreisunweightedF-scoreaverageacrossallclasses

HAN PerformancePrimarySite12possiblecancersubsitelocations◦ 5lungsubsites◦ 7breastsubsites

DeeplearningmethodsoutperformedalltraditionalMLmethodsexceptforXGBoost

HANhadbestperformance,pretrainingimprovedperformanceevenfurther

TraditionalMachineLearningClassifiers

Classifier PrimarySiteMicroF-Score

PrimarySiteMacroF-Score

NaiveBayes .554(.521,.586)

.161(.152,.170)

LogisticRegression .621(.589,.652)

.222(.207,.237)

SupportVectorMachine(C=1,gamma=1)

.616(.585,.646)

.220(.205,.234)

RandomForest(numtrees=100)

.628(.597,.661)

.258(.236,.283)

XGBoost(maxdepth=5,nestimators=300)

.709(.681,.738)

.441(.404,.474)

DeepLearningClassifiers

RecurrentNeuralNetwork(withattentionmechanism)

.694(.666,.722)

.468(.432,.502)

ConvolutionalNeuralNetwork .712(.680,.736)

.398(.359,.434)

HierarchicalAttentionNetwork(nopretraining)

.784(.759,.810)

.566(.525,.607)

HierarchicalAttentionNetwork(withpretraining)

.800(.776,.825)

.594(.553,.636)

HAN PerformanceHistologicalGrade4possiblehistologicalgrades◦ 1-4,indicatinghowabnormaltumorcellsandtumortissueslookunderamicroscopewith4beingmostabnormal

◦ Indicateshowquicklyatumorislikelytogrowandspread

OtherthanRNNs,deeplearningmodelsgenerallyoutperformtraditionalMLmodels

HANhadbestperformance,butpretrainingdidnothelpperformance

TraditionalMachineLearningClassifiers

Classifier HistologicalGradeMicroF-Score

HistologicalGradeMacroF-Score

NaiveBayes .481(.442,.519)

.264(.244,.283)

LogisticRegression .540(.499,.576)

.340(.309,.371)

SupportVectorMachine(C=1,gamma=1)

.520(.482,.558)

.330(.301,.357)

RandomForest(numtrees=100)

.597(.558,.636)

.412(.364,.476)

XGBoost(maxdepth=5,nestimators=300)

.673(.634,.709)

.593(.516,.662)

DeepLearningClassifiers

RecurrentNeuralNetwork(withattentionmechanism)

.580(.541,.617)

.474(.416,.536)

ConvolutionalNeuralNetwork .716(.681,.750)

.521(.493,.548)

HierarchicalAttentionNetwork(nopretraining)

.916(.895,.936)

.841(.778,.895)

HierarchicalAttentionNetwork(withpretraining)

.904(.881,.927)

.822(.744,.883)

TFIDFDocumentEmbeddings

TFIDF-weightedWord2Vecembeddingsreducedto2dimensionsviaPCAfor(A.)primarysitetrainreports,(B.)histologicalgradetrainreports,(C.)primarysitetestreports,and(D.)histologicalgradetrainreports.

TFIDFDocumentEmbeddings

HANdocumentembeddingsreducedto2dimensionsviaPCAfor(A.)primarysitetrainreports,(B.)histologicalgradetrainreports,(C.)primarysitetestreports,and(D.)histologicalgradetrainreports.

PretrainingWehaveaccesstomoreunlabeleddatathanlabeleddata(approximately1500unlabeled,1000labeled)

Toutilizedunlabeleddata,wetrainedourHANtocreatedocumentembeddingsthatmatchedthecorrespondingTF-IDFweightedwordembeddingsforthatdocument

HANtrainingandvalidationaccuracywithandwithoutpretrainingfor(A.)primarysitetaskand(B.)histologicalgradetask

HANDocumentAnnotations

MostImportantWordsperTaskWecanalsousetheHAN’sattentionweightstofindthewordsthatcontributemosttowardstheclassificationtaskathand:

PrimarySite HistologicalGrademainstemadenocalullowerbreast

carinacusauppermiddlerul

buttocktemporalupperretrosputum

poorlyg2highiiidlr

Undifferentiatedg3iiig1

moderatelyintermediatewellarising2

ScalingRelativetoothermodels,HANisveryslowtotrain◦ OnCPU,HANtakesapproximately4hourstogothrough30epochs◦ Incomparison,CNNtakesaround40minutestogothrough30epochs,andtraditionalmachinelearningclassifierstakelessthanaminute

◦ TheHANisslowduetoitscomplexarchitectureanduseofRNNs,sogradientsareveryexpensivetocompute

WearecurrentworkingtoscaletheHANtorunonmultipleGPUs◦ OnTensorflow,RNNsonGPUrunslowerthanonCPU◦ WeareconsideringexploringaPyTorch implementationtogetaroundthisproblem

WehavesuccessfullydevelopedadistributedCPU-onlyHANthatrunsonTITANusingMPI,with4xspeedupon8nodes

AttentionisAllYouNeedNewpaperthatcameoutJune2017fromGoogleBrain,inwhichtheyshowedtheycouldgetcompetitiveresultsinmachinetranslationwithonlyattentionmechanismsandnoRNNs

WeappliedthesamearchitecturetoreplacetheRNNsinourHAN

Becauseattentionmechanismsarejustmatrixmultiplications,itrunsabout10xfasterthantheHANwithRNNs

ThisnewmodelperformsalmostaswellastheHANwithRNNs– 0.77micro-Fonprimarysite(comparedto0.78inoriginalHAN),and0.86micro-Fonhistologicalgrade(comparedto0.91inoriginalHAN)

BecausenoRNNsareutilized,thismodelismucheasiertoscaleontheGPU

OtherFutureWorkMultitaskLearning◦ Predicthistologicalgrade,primarysite,andothertaskssimultaneouslywithinthesamemodel◦ Hopefullyboosttheperformanceofalltasksbysharinginformationacrosstasks

Semi-SupervisedLearning◦ Utilizeunlabeleddataduringtrainingratherthaninpretrainingwiththegoalofimprovingclassificationperformance

◦ Thistaskischallengingbecauseinmostsemi-supervisedtasks,weknowallthelabelswithinthedataset.Inourcase,weonlyhaveasubsetofthelabels.

Questions?

rnn review & hierarchical attention...

Documents