lecture 22: the attention mechanism - github pages€¦ · lecture 22: the attention mechanism theo...

35
CS839: Probabilistic Graphical Models Lecture 22: The Attention Mechanism Theo Rekatsinas 1

Upload: others

Post on 07-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

CS839:ProbabilisticGraphicalModels

Lecture22:TheAttentionMechanismTheoRekatsinas

1

Page 2: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

WhyAttention?

2

• Considermachinetranslation:• Weneedtopayattentiontothewordwearecurrentlytranslating.Istheentiresequenceneededascontext?

• Thecatisblack->Lechatest noir

Page 3: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

WhyAttention?

3

• Considermachinetranslation:• Weneedtopayattentiontothewordwearecurrentlytranslating.Istheentiresequenceneededascontext?

• Thecatisblack->Lechatest noir

• RNNsarethede-factostandardformachinetranslation• Problem:translationreliesonreadingacompletesentenceandcompressesallinformationintoafixed-lengthvectorasentencewithhundredsofwordsrepresentedbyseveralwordswillsurelyleadtoinformationloss,inadequatetranslation,etc.

• Long-rangedependenciesaretricky.

Page 4: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

Basicencoder- decoder

4

Page 5: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforTranslation

5

Page 6: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforTranslation

“Ilovecoffee”->“Megustaelcafé”

Distributionoverinputwords

Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015

SoftAttentionforTranslation

Page 7: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforTranslation

“Ilovecoffee”->“Megustaelcafé”

Distributionoverinputwords

Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015

SoftAttentionforTranslation

Page 8: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforTranslation

“Ilovecoffee”->“Megustaelcafé”

Distributionoverinputwords

Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015

SoftAttentionforTranslation

Page 9: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforTranslation

“Ilovecoffee”->“Megustaelcafé”

Distributionoverinputwords

Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015

SoftAttentionforTranslation

Page 10: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttention

FromY.Bengio CVPR2015Tutorial

BidirectionalencoderRNN

DecoderRNN

AttentionModel

Page 11: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionContextvector(inputtodecoder):

Mixtureweights:

Alignmentscore(howwelldoinputwordsnearjmatchoutputwordsatpositioni):

Page 12: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionLuong,PhamandManning’sTranslationSystem(2015):

LuongandManningIWSLT2015

TranslationErrorRatevsHuman

Page 13: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

HardAttention

Page 14: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

MonotonicAttention

Page 15: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

GlobalAttention• Blue=encoder• Red=decoder• Attendtoacontextvector.• Decodercapturesglobalinformationnotonlytheinformationfromonehiddenstate.• Contextvectortakesallcell’soutputsasinputandcomputesaprobabilitydistributionforeachtokenthedecoderwantstogenerate

Page 16: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

LocalAttention

• Computeabestalignedpositionfirst• Thencomputeacontextvectorcenteredatthatposition

Page 17: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

RNNforCaptioning

CNN

Image:HxWx3

Features:D

h0

Hiddenstate:H

h1

y1

h2

y2

Firstword

Secondword

d1

Distributionovervocab

d2

RNNonlylooksatwholeimage,once

WhatiftheRNNlooksatdifferentpartsoftheimageateachtimestep?

Page 18: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

SoftAttentionforCaptioning

Page 19: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

SoftAttentionforCaptioning

Page 20: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

DistributionoverLlocations

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

SoftAttentionforCaptioning

Page 21: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

Weightedcombinationoffeatures

DistributionoverLlocations

z1Weightedfeatures:D

SoftAttentionforCaptioning

Page 22: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforCaptioning

CNN

Image:HxWx

3

Features:LxD

h0

a1

z1

Weightedcombinationoffeatures

h1

DistributionoverLlocations

Weightedfeatures:D y1

Firstword

SoftAttentionforCaptioning

Page 23: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

z1

Weightedcombinationoffeatures

y1

h1

Firstword

DistributionoverLlocations

a2 d1

Weightedfeatures:D

Distributionovervocab

SoftAttentionforCaptioning

Page 24: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

z1

Weightedcombinationoffeatures

y1

h1

Firstword

DistributionoverLlocations

a2 d1

z2Weightedfeatures:D

Distributionovervocab

SoftAttentionforCaptioning

Page 25: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

z1

Weightedcombinationoffeatures

y1

h1

Firstword

DistributionoverLlocations

a2 d1

h2

z2 y2Weightedfeatures:D

Distributionovervocab

SoftAttentionforCaptioning

Page 26: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftAttentionforCaptioning

CNN

Image:HxWx3

Features:LxD

h0

a1

z1

Weightedcombinationoffeatures

y1

h1

Firstword

DistributionoverLlocations

a2 d1

h2

a3 d2

z2 y2Weightedfeatures:D

Distributionovervocab

SoftAttentionforCaptioning

Page 27: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftvsHardAttention

CNN

Image:HxWx3

Gridoffeatures(EachD-dimensional)

a b

c d

pa pb

pc pd

Distributionovergridlocations

pa+pb+pc+pc=1

FromRNN:

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

SoftvsHardAttention

Page 28: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftvsHardAttention

CNN

Image:HxWx3

Gridoffeatures(EachD-dimensional)

a b

c d

pa pb

pc pd

Distributionovergridlocations

pa+pb+pc+pc=1

FromRNN:

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

Contextvectorz(D-dimensional)

SoftvsHardAttention

Page 29: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftvsHardAttention

CNN

Image:HxWx3

Gridoffeatures(EachD-dimensional)

a b

c d

pa pb

pc pd

Distributionovergridlocations

pa+pb+pc+pc=1

FromRNN:

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

Contextvectorz(D-dimensional)

Softattention:SummarizeALLlocationsz=paa+pbb +pcc +pdd

Derivativedz/dpisnice!Trainwithgradientdescent

SoftvsHardAttention

Page 30: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftvsHardAttention

CNN

Image:HxWx3

Gridoffeatures(EachD-dimensional)

a b

c d

pa pb

pc pd

Distributionovergridlocations

pa+pb+pc+pc=1

FromRNN:

Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015

Contextvectorz(D-dimensional)

Softattention:SummarizeALLlocationsz=paa+pbb +pcc +pdd

Derivativedz/dpisnice!Trainwithgradientdescent

Hardattention:SampleONElocation

accordingtop,z=thatvector

Withargmax,dz/dpiszeroalmosteverywhere…

Can’tusegradientdescent;needreinforcementlearning

SoftvsHardAttention

Page 31: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

Multi-headedAttention

Page 32: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

Attentionisallyouneed

Page 33: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

Attentiontricks

Page 34: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftvsHardAttentionAttentionTakeawaysPerformance:• Attentionmodelscanimprove

accuracy andreducecomputationatthesametime.

Complexity:• Therearemanydesignchoices.• Thosechoiceshaveabigeffectonperformance.• Ensemblinghasunusuallylargebenefits.• Simplifywherepossible!

Page 35: Lecture 22: The Attention Mechanism - GitHub Pages€¦ · Lecture 22: The Attention Mechanism Theo Rekatsinas 1. Why Attention? 2 •Consider machine translation: •We need to pay

SoftvsHardAttentionAttentionTakeawaysExplainability:• Attentionmodelsencodeexplanations.• Bothlocusandtrajectoryhelp

understandwhat’sgoingon.

Hardvs.Soft:• Softmodelsareeasiertotrain,hardmodelsrequirereinforcementlearning.

• Theycanbecombined,asinLuongetal.