lecture 22: the attention mechanism - github pages€¦ · lecture 22: the attention mechanism theo...

Post on 07-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS839:ProbabilisticGraphicalModels

Lecture22:TheAttentionMechanismTheoRekatsinas

1

WhyAttention?

2

• Considermachinetranslation:• Weneedtopayattentiontothewordwearecurrentlytranslating.Istheentiresequenceneededascontext?

• Thecatisblack->Lechatest noir

WhyAttention?

3

• Considermachinetranslation:• Weneedtopayattentiontothewordwearecurrentlytranslating.Istheentiresequenceneededascontext?

• Thecatisblack->Lechatest noir

• RNNsarethede-factostandardformachinetranslation• Problem:translationreliesonreadingacompletesentenceandcompressesallinformationintoafixed-lengthvectorasentencewithhundredsofwordsrepresentedbyseveralwordswillsurelyleadtoinformationloss,inadequatetranslation,etc.

• Long-rangedependenciesaretricky.

Basicencoder- decoder

4

SoftAttentionforTranslation

5

SoftAttentionforTranslation

“Ilovecoffee”->“Megustaelcafé”

Distributionoverinputwords

Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015

SoftAttentionforTranslation

SoftAttentionforTranslation

“Ilovecoffee”->“Megustaelcafé”

Distributionoverinputwords

Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015

SoftAttentionforTranslation

SoftAttentionforTranslation

“Ilovecoffee”->“Megustaelcafé”

Distributionoverinputwords

Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015

SoftAttentionforTranslation

SoftAttentionforTranslation

“Ilovecoffee”->“Megustaelcafé”

Distributionoverinputwords

Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015

SoftAttentionforTranslation

SoftAttention

FromY.Bengio CVPR2015Tutorial

BidirectionalencoderRNN

DecoderRNN

AttentionModel

SoftAttentionContextvector(inputtodecoder):

Mixtureweights:

Alignmentscore(howwelldoinputwordsnearjmatchoutputwordsatpositioni):

SoftAttentionLuong,PhamandManning’sTranslationSystem(2015):

LuongandManningIWSLT2015

TranslationErrorRatevsHuman

HardAttention

MonotonicAttention

GlobalAttention• Blue=encoder• Red=decoder• Attendtoacontextvector.• Decodercapturesglobalinformationnotonlytheinformationfromonehiddenstate.• Contextvectortakesallcell’soutputsasinputandcomputesaprobabilitydistributionforeachtokenthedecoderwantstogenerate

LocalAttention

• Computeabestalignedpositionfirst• Thencomputeacontextvectorcenteredatthatposition

RNNforCaptioning

CNN

Image:HxWx3

Features:D

h0

Hiddenstate:H

h1

y1

h2

y2

Firstword

Secondword

d1

Distributionovervocab

d2

RNNonlylooksatwholeimage,once

WhatiftheRNNlooksatdifferentpartsoftheimageateachtimestep?

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

SoftAttentionforCaptioning

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

SoftAttentionforCaptioning

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

DistributionoverLlocations

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

SoftAttentionforCaptioning

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

Weightedcombinationoffeatures

DistributionoverLlocations

z1Weightedfeatures:D

SoftAttentionforCaptioning

SoftAttentionforCaptioning

CNN

Image:HxWx

3

Features:LxD

h0

a1

z1

Weightedcombinationoffeatures

h1

DistributionoverLlocations

Weightedfeatures:D y1

Firstword

SoftAttentionforCaptioning

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

z1

Weightedcombinationoffeatures

y1

h1

Firstword

DistributionoverLlocations

a2 d1

Weightedfeatures:D

Distributionovervocab

SoftAttentionforCaptioning

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

z1

Weightedcombinationoffeatures

y1

h1

Firstword

DistributionoverLlocations

a2 d1

z2Weightedfeatures:D

Distributionovervocab

SoftAttentionforCaptioning

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

z1

Weightedcombinationoffeatures

y1

h1

Firstword

DistributionoverLlocations

a2 d1

h2

z2 y2Weightedfeatures:D

Distributionovervocab

SoftAttentionforCaptioning

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

z1

Weightedcombinationoffeatures

y1

h1

Firstword

DistributionoverLlocations

a2 d1

h2

a3 d2

z2 y2Weightedfeatures:D

Distributionovervocab

SoftAttentionforCaptioning

SoftvsHardAttention

CNN

Image:HxWx3

Gridoffeatures(EachD-dimensional)

a b

c d

pa pb

pc pd

Distributionovergridlocations

pa+pb+pc+pc=1

FromRNN:

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

SoftvsHardAttention

SoftvsHardAttention

CNN

Image:HxWx3

Gridoffeatures(EachD-dimensional)

a b

c d

pa pb

pc pd

Distributionovergridlocations

pa+pb+pc+pc=1

FromRNN:

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

Contextvectorz(D-dimensional)

SoftvsHardAttention

SoftvsHardAttention

CNN

Image:HxWx3

Gridoffeatures(EachD-dimensional)

a b

c d

pa pb

pc pd

Distributionovergridlocations

pa+pb+pc+pc=1

FromRNN:

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

Contextvectorz(D-dimensional)

Softattention:SummarizeALLlocationsz=paa+pbb +pcc +pdd

Derivativedz/dpisnice!Trainwithgradientdescent

SoftvsHardAttention

SoftvsHardAttention

CNN

Image:HxWx3

Gridoffeatures(EachD-dimensional)

a b

c d

pa pb

pc pd

Distributionovergridlocations

pa+pb+pc+pc=1

FromRNN:

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

Contextvectorz(D-dimensional)

Softattention:SummarizeALLlocationsz=paa+pbb +pcc +pdd

Derivativedz/dpisnice!Trainwithgradientdescent

Hardattention:SampleONElocation

accordingtop,z=thatvector

Withargmax,dz/dpiszeroalmosteverywhere…

Can’tusegradientdescent;needreinforcementlearning

SoftvsHardAttention

Multi-headedAttention

Attentionisallyouneed

Attentiontricks

SoftvsHardAttentionAttentionTakeawaysPerformance:• Attentionmodelscanimprove

accuracy andreducecomputationatthesametime.

Complexity:• Therearemanydesignchoices.• Thosechoiceshaveabigeffectonperformance.• Ensemblinghasunusuallylargebenefits.• Simplifywherepossible!

SoftvsHardAttentionAttentionTakeawaysExplainability:• Attentionmodelsencodeexplanations.• Bothlocusandtrajectoryhelp

understandwhat’sgoingon.

Hardvs.Soft:• Softmodelsareeasiertotrain,hardmodelsrequirereinforcementlearning.

• Theycanbecombined,asinLuongetal.

top related