on learning form and meaning in neural machine translation ...people.csail.mit.edu › mitra ›...

Post on 03-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

OnLearningFormandMeaninginNeuralMachineTranslationModels

YonatanBelinkovMay2017

With:NadirDurrani,HassanSajjad,Fahim Dalvi,Lluis Marques,JamesGlass

Motivation

• Neuralmachinetranslation(NMT)obtainsstate-of-the-artresults• Elegantandsimpleend-to-endarchitecture

Motivation

• Neuralmachinetranslation(NMT)obtainsstate-of-the-artresults• Elegantandsimpleend-to-endarchitecture

• However,NMTmodelsaredifficulttointerpret;whatdotheylearnaboutthesourceandtargetlanguages?

Motivation

• Neuralmachinetranslation(NMT)obtainsstate-of-the-artresults• Elegantandsimpleend-to-endarchitecture

• However,NMTmodelsaredifficulttointerpret;whatdotheylearnaboutthesourceandtargetlanguages?

• Recentinterestinthecommunity(e.g.Shi+16onsyntax)

Motivation

• Thiswork:analyzingmorphology(andsemantics)inNMT

TranslationasDecoding

• WarrenWeavertoNorbertWiener,March4,1947:

Alsoknowingnothingofficialabout,buthavingguessedandinferredconsiderableabout,powerfulnewmechanizedmethodsincryptography-methodswhichIbelievesucceedevenwhenonedoesnotknowwhatlanguagehasbeencoded- onenaturallywondersiftheproblemoftranslationcouldconceivablybetreatedasaproblemincryptography.WhenIlookatanarticleinRussian,Isay"ThisisreallywritteninEnglish,butithasbeencodedinsomestrangesymbols.Iwillnowproceedtodecode.”

BriefHistoryofMachineTranslation

• 1947:InitialideasofMT(Weaver)• 1950s:FirstMTsystems• 1960s:High-qualityMTfails,cutingovernmentfunding• 1970s-1980s:Rule-basedsystems,interlinguaideas• 1990s:StatisticalMT,IBMalignmentmodels• 2000s:Phrase-basedMT,open-sourcetoolkits• 2014-2015:NeuralMT:seq2seq+attention

StatisticalMachineTranslation

• TranslateasourcesentenceF intoatargetsentenceE

StatisticalMachineTranslation

• TranslateasourcesentenceF intoatargetsentenceE

StatisticalMachineTranslation

• TranslateasourcesentenceF intoatargetsentenceE

StatisticalMachineTranslation

• TranslateasourcesentenceF intoatargetsentenceE

• – Translationmodel• – Languagemodel

StatisticalMachineTranslation

• TranslateasourcesentenceF intoatargetsentenceE

• – Translationmodel• – Languagemodel

Marianodió una alabruja verde

Marydidnotslapthegreenwitch

bofetada

From:Jurafsky &Martin2009

StatisticalMachineTranslation

• TranslateasourcesentenceF intoatargetsentenceE

• – Translationmodel• – Languagemodel

Marianodió una alabruja verde

Marydidnotslapthegreenwitch

bofetada

From:Jurafsky &Martin2009

NeuralMachineTranslation

Encoder

Decoder

Inputtext

Translatedtext

NeuralMachineTranslation

NeuralMachineTranslation

• Encoder:

• Decoder:

• Loss:

NeuralMachineTranslation

• Encoder:

• Decoder:

• Loss:

Sourcehiddenstate

Targethiddenstate

Summaryvector

Encoder-Decoder

Marianodió una bofetada alabruja verde

Marydidnotslapthegreenwitch<STOP>

• RaymondMooney,June26,2016:

TheProblemwiththeEncoder-Decoder

“Youcan’tcramthemeaningofawhole%&!$#sentenceintoasingle$&!#*vector!”

AttentionMechanism

Marianodió una bofetada alabruja verde

Marydidnotslapthegreenwitch<STOP>

AttentionMechanism

Marianodió una bofetada alabruja verde

Marydidnotslapthegreenwitch<STOP>

AttentionMechanism

Marianodió una bofetada alabruja verde

Marydidnotslapthegreenwitch<STOP>

Attentionassoftalignment

Marianodió una alabruja verde

Marydidnotslapthegreenwitch

bofetada

Phrase-basedMT

Attentionassoftalignment

Marianodió una alabruja verde

Marydidnotslapthegreenwitch

bofetadaMarianodió una alabruja verde

Marydidnotslapthegreenwitch

bofetada

Phrase-basedMTNeuralMT

ResearchQuestions

ResearchQuestions

• WhichpartsoftheNMTarchitecturecapturewordstructure?Whichcapturemeaning?• Whatisthedivisionoflaborbetweendifferentcomponents?• Howdodifferentwordrepresentationshelplearnbettermorphology?• Howdoesthetargetlanguageaffectthelearningofwordstructure?

Methodology

• Threestepprocedure:1. TrainaneuralMTsystem2. Extractfeaturerepresentationsusingtrainedthemodel3. Trainaclassifierusingextractedfeaturesandevaluateitonanextrinsictask

Methodology

• Threestepprocedure:1. TrainaneuralMTsystem2. Extractfeaturerepresentationsusingtrainedthemodel3. Trainaclassifierusingextractedfeaturesandevaluateitonanextrinsictask

• Assumption:performanceoftheclassifierreflectsqualityoftheNMTrepresentationsforthegiventask

Methodology

Methodology

Methodology

PartA:Morphology

ExperimentalSetup

• Tasks• Part-of-speechtagging• Morphologicaltagging

• Languages• Arabic-,German-,French-,andCzech-English• Arabic-Hebrew(richandsimilar)• Arabic-German(richbutdifferent)

ExperimentalSetup

• MTdata:TEDtalks• Annotateddata• Goldtags• Predictedtags

Encoder

EffectofWordRepresentation

running running

Wordembedding CharacterCNN

EffectofWordRepresentation

POSAccuracy BLEUWord Char Word Char

Ar-En

Ar-He

De-En

Fr-En

Cz-En

EffectofWordRepresentation

POSAccuracy BLEUWord Char Word Char

Ar-En 89.62 95.35 24.7 28.4

Ar-He 88.33 94.66 9.9 10.7

De-En 93.54 94.63 29.6 30.4

Fr-En 94.61 95.55 37.8 38.8

Cz-En 75.71 79.10 23.2 25.4

EffectofWordRepresentation

POSAccuracy BLEUWord Char Word Char

Ar-En 89.62 95.35 24.7 28.4

Ar-He 88.33 94.66 9.9 10.7

De-En 93.54 94.63 29.6 30.4

Fr-En 94.61 95.55 37.8 38.8

Cz-En 75.71 79.10 23.2 25.4

• Character-basedmodelsgeneratebetterrepresentationsforPOStagging

• Especiallywithrichermorphologicalsystems

EffectofWordRepresentation

POSAccuracy BLEUWord Char Word Char

Ar-En 89.62 95.35 24.7 28.4

Ar-He 88.33 94.66 9.9 10.7

De-En 93.54 94.63 29.6 30.4

Fr-En 94.61 95.55 37.8 38.8

Cz-En 75.71 79.10 23.2 25.4

EffectofWordRepresentation

POSAccuracy BLEUWord Char Word Char

Ar-En 89.62 95.35 24.7 28.4

Ar-He 88.33 94.66 9.9 10.7

De-En 93.54 94.63 29.6 30.4

Fr-En 94.61 95.55 37.8 38.8

Cz-En 75.71 79.10 23.2 25.4

• Character-basedmodelsimprovetranslationquality

ImpactofWordFrequency

ImpactofWordFrequency

ImpactofTagFrequency

ComparingSpecificTagsWord-based Char-based

ComparingSpecificTags

NN,NNP

DetDet

Word-based Char-based

EffectofEncoderDepth

• NMTmodelscanbeverydeep• GoogleTranslate:8encoder/decoderlayers• Zhou+2016:16layers

EffectofEncoderDepth

• NMTmodelscanbeverydeep• GoogleTranslate:8encoder/decoderlayers• Zhou+2016:16layers

• Whatkindofinformationislearnedateach?

EffectofEncoderDepth

• NMTmodelscanbeverydeep• GoogleTranslate:8encoder/decoderlayers• Zhou+2016:16layers

• Whatkindofinformationislearnedateach?• Weanalyzeda2-layerencoder• Extractrepresentationsfromdifferentlayersfortrainingtheclassifier

EffectofEncoderDepth

EffectofEncoderDepth

EffectofEncoderDepth

• Layer1>Layer2>Layer0• Butdeepermodelstranslatebetter

EffectofEncoderDepth

• Islayer2learningmoreaboutsemantics?Moreonthatlater…

EffectofTargetLanguage

• Howdoesthetargetlanguageaffectthelearnedsourcelanguagerepresentations?

EffectofTargetLanguage

• Howdoesthetargetlanguageaffectthelearnedsourcelanguagerepresentations?

• Experiment:• FixsourcesideandtrainNMTmodelsondifferenttargetlanguages• ComparelearnedrepresentationsonPOS/morphologicaltagging

EffectofTargetLanguage

• Sourcelanguage:Arabic• Targetlanguages:English,German,Hebrew,Arabic

EffectofTargetLanguage

• Sourcelanguage:Arabic• Targetlanguages:English,German,Hebrew,Arabic

EffectofTargetLanguage

• Poorermorphologyontargetside,bettersourcesiderepresentationsformorphology

EffectofTargetLanguage

• HigherBLEU≠betterrepresentations

Decoder

EncodervsDecoder

POSAccuracyEncoder Decoder

Arabic↔ English

German↔ English

Czech↔ English

EncodervsDecoder

POSAccuracyEncoder Decoder

Arabic↔ English 89.6 43.9

German↔ English 93.5 53.6

Czech↔ English 75.7 36.3

EncodervsDecoder

POSAccuracyEncoder Decoder

Arabic↔ English 89.6 43.9

German↔ English 93.5 53.6

Czech↔ English 75.7 36.3

• Thedecoderlearnsverylittleabouttargetlanguagemorphology

EncodervsDecoder

POSAccuracyEncoder Decoder

Arabic↔ English 89.6 43.9

German↔ English 93.5 53.6

Czech↔ English 75.7 36.3

• Thedecoderlearnsverylittleabouttargetlanguagemorphology• Why?

EffectofAttention

Marianodió una bofetada alabruja verde

Marydidnotslapthegreenwitch<STOP>

EffectofAttention

Marianodió una bofetada alabruja verde

Marydidnotslapthegreenwitch<STOP>

EffectofAttention

Marianodió una bofetada alabruja verde

Marydidnotslapthegreenwitch<STOP>

EffectofAttention

Withattention

Withoutattention

Englishà German

Englishà Czech

EffectofAttention

Withattention

Withoutattention

Englishà German 44.55 50.26

Englishà Czech 36.35 42.09

• Removingattentionimprovesdecoderrepresentations• Attentionisremovingburdenoffofthedecoder• Thedecoderdoesnotneedtolearnasmuchabouttargetwords

EffectofAttention

Withattention

Withoutattention

Englishà German 44.55 50.26

Englishà Czech 36.35 42.09

• Concatenatingmostattendedwordimprovesperformance• Encoderrepresentationshelpfulfortargetmorphology

EffectofAttention

Withattention

Withoutattention

Withmostattendedword

Englishà German 44.55 50.26 60.34

Englishà Czech 36.35 42.09 48.64

• Concatenatingmostattendedwordimprovesperformance• Encoderrepresentationshelpfulfortargetmorphology• Butusingonlyencodersideisnotasgood

EffectofAttention

Withattention

Withoutattention

Withmostattendedword

Onlymostattendedword

Englishà German 44.55 50.26 60.34 43.43

Englishà Czech 36.35 42.09 48.64 36.36

Summary

• NMTencoderlearnsgoodrepresentationsformorphology• Character-basedrepresentationsmuchbetterthanword-based• Targetlanguageimpactssourcesiderepresentations• Layer1>Layer2>Layer0

• Decoderlearnspoortargetsiderepresentations• Attentionmodelhelpsdecoderexploitsourcerepresentations

Summary

• NMTencoderlearnsgoodrepresentationsformorphology• Character-basedrepresentationsmuchbetterthanword-based• Targetlanguageimpactssourcesiderepresentations• Layer1>Layer2>Layer0

• Decoderlearnspoortargetsiderepresentations• Attentionmodelhelpsdecoderexploitsourcerepresentations

PartB:Semantics

Recap

• Wesaw• NMTrepresentationsfromlayer1betterthanlayer2(andlayer0)forPOSandmorphologicaltagging• Deepernetworksleadtobettertranslationperformance

Recap

• Wesaw• NMTrepresentationsfromlayer1betterthanlayer2(andlayer0)forPOSandmorphologicaltagging• Deepernetworksleadtobettertranslationperformance

• Questions• Whatiscapturedinhigherlayers?• Howissemanticinformationrepresented?

Recap

• Wesaw• NMTrepresentationsfromlayer1betterthanlayer2(andlayer0)forPOSandmorphologicaltagging• Deepernetworksleadtobettertranslationperformance

• Questions• Whatiscapturedinhigherlayers?• Howissemanticinformationrepresented?

• Let’sapplyasimilarmethodologytoasemantictask

Semantictagging

• Lexicalsemantics• AbstractionoverPOStagging• Language-neutral,aimedformulti-lingualsemanticparsing

Semantictagging

• Lexicalsemantics• AbstractionoverPOStagging• Language-neutral,aimedformulti-lingualsemanticparsing

• Someexamples• Determiners:every,no,some• Commaasconjunction,disjunction,apposition• Rolenouns,entitynouns• Comparisonadjectives:comparative,superlative,equative

ExperimentalSetup

• Semantictaggingdata• 66fine-grainedtags,13coarsecategories

• MTdata– UNcorpus• Multi-parallel• 11Msentences• Arabic,Chinese,English,French,Spanish,Russian

Train Dev TestSentences 42.5K 6.1K 12.2K

Tokens 937.1K 132.3K 265.5K

Baselines

System AccuracyMostfrequenttag 82.0

Unsupervised embeddings 81.1

Word2Tagencoder-decoder 91.4

State-of-the-art(Bjerva+16) 95.5

EffectofNetworkDepth

Mostfrequenttag

EffectofNetworkDepth

Mostfrequenttag

EffectofNetworkDepth

Mostfrequenttag

• Layer0belowbaseline

EffectofNetworkDepth

Mostfrequenttag

• Layer0belowbaseline• Layer1>>layer0

EffectofNetworkDepth

Mostfrequenttag

• Layer0belowbaseline• Layer1>>layer0• Layer4>layer1

EffectofNetworkDepth

Mostfrequenttag

• Layer0belowbaseline• Layer1>>layer0• Layer4>layer1

• Similartrendsforcoarsetags

EffectofTargetLanguage

Mostfrequenttag

EffectofTargetLanguage

Mostfrequenttag

• Noimpactonsemantictagging

EffectofTargetLanguage

Mostfrequenttag

• Noimpactonsemantictagging• Butlargeimpactontranslation:

BLEUEn-Ar 32.7

En-Es 49.1

En-Fr 38.5

En-Ru 34.2

En-Zh 32.1

AnalyzingSpecificTags

• Layer4vslayer1• Bleu:distinguishingamongcoarsetags• Red:distinguishingamongfine-grainedtagswithinacoarsecategory

AnalyzingSpecificTags

• Layer4>layer1

AnalyzingSpecificTags

• Layer4>layer1• Especiallywith:• Discourserelations(DIS)• Propertiesofnouns(ENT)• Events,tenses(EVE,TNS)• Logicrelationsandquantifiers (LOG)• Comparativeconstructions(COM)

AnalyzingSpecificTags

• Negativeexamples

AnalyzingSpecificTags

• Negativeexamples

• Modality(MOD)• Closed-class(“no”,“not”,“should”,”must”,etc.)

AnalyzingSpecificTags

• Negativeexamples

• Modality(MOD)• Closed-class(“no”,“not”,“should”,”must”,etc.)

• Namedentities(NAM)• OOVs?• NeuralMTlimitation?

Semantictagsvs.POStags

Semantictagsvs.POStags

0 1 2 3 4POS 87.9 92.0 91.7 91.8 91.9

Sem 81.8 87.8 87.4 87.6 88.2

• HigherlayersimprovesemantictaggingbutnotPOStagging• Layer1bestforPOS;layer4bestforsemantictagging

Semantictagsvs.POStags

0 1 2 3 4POS 87.9 92.0 91.7 91.8 91.9

Sem 81.8 87.8 87.4 87.6 88.2

• HigherlayersimprovesemantictaggingbutnotPOStagging• Layer1bestforPOS;layer4bestforsemantictagging• Similartrendswithbidirectionalencoder

Semantictagsvs.POStags

0 1 2 3 4

UniPOS 87.9 92.0 91.7 91.8 91.9

Sem 81.8 87.8 87.4 87.6 88.2

BiPOS 87.9 93.3 92.9 93.2 92.8

Sem 81.9 91.3 90.8 91.9 91.9

Summary

• NeuralMTrepresentationscontainusefulinformationaboutwordformandmeaning• LowerlayersfocusonPOS/morphology• Higherlayersfocuson(lexical)semantics• Targetlanguagedoesnotaffectsemantictaggingquality

FutureWork

• OtherneuralMTarchitectures• Wordrepresentations;multi-lingualmodels

• Otherlinguisticproperties• Syntacticandsemanticrelations,complexstructures

• ImprovingneuralMT• Multi-tasklearning

• Analyzingrepresentationsinotherneuralmodels• End-to-endspeechrecognition

top related