stephan vogel - machine translation1 machine translation factored models stephan vogel spring...
TRANSCRIPT
Stephan Vogel - Machine Translation 1
Machine Translation
Factored Models
Stephan VogelSpring Semester 2011
Stephan Vogel - Machine Translation 2
Overview
Factored Language Models Multi-Stream Word Alignment Factored Translation Models
Stephan Vogel - Machine Translation 3
Motivation
Vocabulary grows dramatically for morphology rich languages
Looking at surface word form does not take connections (morphological derivations) between words into account Example: ‘book’ and ‘books’ as unrelated as ‘book’ and ‘sky’
Dependencies within sentences between words are not well detected Example: number or gender agreement
Singular: der alte Tisch (the old table)Plural: die alten Tische (the old tables)
Consider word as a bundle of factors Surface word form, stem, root, prefix, suffix, POS, gender
marker, case marker, number marker, …
Stephan Vogel - Machine Translation 4
Two solutions
Morphological decomposition into stream of morphemes Compound noun splitting Prefix-stem-suffix splitting
Words as bundle of (parallel) factorsword
lemma
POS
morphology
word class
prefix stem suffix prefix stem suffix …
w1 w2 w3 w4 ….
Stephan Vogel - Machine Translation 5
Questions
Which information is the most useful
How to use this information? In the language model In the translation model
How to use it at training time How to use it at decoding time
Stephan Vogel - Machine Translation 6
Factored Models
Morphological preprocessing A significant body of work
Factored language models Kirchhoff et al
Hierarchical lexicon Niessen at al
Bi-Stream alignment Zhao et al
Factored translation models Koehn et al
Stephan Vogel - Machine Translation 7
Factored Language Model
Some papers:
Bilmes and Kirchhoff, 2003Factored Language Models and Generalized Parallel
Backoff
Duh and Kirchhoff, 2004Automatic learning of language model structure
Kirchhoff and Yang, 2005Improved Language Modeling for Statistical Machine
Translation
Stephan Vogel - Machine Translation 8
Factored Language Model
Representation:
LM probability:
I
i
Ki
KKi
KI
KI fffpffpwwp
1
:11
:11
:1:1:111 ),...,|(),...,(),...,(
KK ffffw :121 ,...,,
Stephan Vogel - Machine Translation 9
Language Model Backoff
Smoothing by backing off Backoff paths
in standard LM in factored LM
Stephan Vogel - Machine Translation 10
Choosing Backoff Paths
Different possibitities Fixed path Choose path dynamically during training Choose multiple paths dynamically during training and
combine results (Generalized Parallel Backoff)
Many paths -> optimization problem Duh & Kirchhoff (2004) use genetic algorithm
Bilmes and Kirchhoff (2003) report LM perplexities Kirchhoff and Yang (2005) use FLM to rescore n-best
list generated by SMT system 3-gram FLM slightly worse then standard 4-gram LM Combined LM does not outperform standard 4-gram LM
Stephan Vogel - Machine Translation 11
Hierarchical Lexicon
Morphological analysis Using GERCG, a constraint
grammar parser for German for lexical analysis and morphological and syntactic disambiguation
Build equivalence classes Group words which tend to
translate into same target word Don’t distinguish what does not
need to be distinguished! Eg. for nouns: gender is
irrelevant as is nominative, dative, and accusative; but genitive translates differently
Sonja Nießen and Hermann Ney, Toward hierarchical models for statistical machine translation of inflected languages. Proceedings of the workshop on data-driven methods in machine translation - Volume 14, 2001.
Stephan Vogel - Machine Translation 12
Hierarchical Lexicon
Equivalence classes at different levels of abstraction Example: ankommen
n is full analysis n-1: drop “first person” -> group “ankomme”, “ankommst”,
“ankommt” n-2: drop singular/plural distinction …
Stephan Vogel - Machine Translation 13
Hierarchical Lexicon
Translation probability
Probability for taking all factors up to i into account
Assumption: does not depend on e and word form follows unambiguously from tags
Linear combination of pi
it
iii etfpetpefp
0
),|()|()|( 00
),|( 0 etfpi
)|(...)|()|( 00 efpefpefp nn
Stephan Vogel - Machine Translation 14
Multi-Stream Word Alignment
Use multiple annotations: stem, POS, … Consider each annotation as additional stream or tier Use generative alignment models
Model each stream But tie streams through alignment
Example: Bi-Stream HMM word alignment (Zhao et al 2005)
Stephan Vogel - Machine Translation 15
Bi-Stream HMM Alignment
HMM: Relative word position as distortion component (can be
conditioned on word classes) Forward-backward algorithms for training
1jf jf
],[jaj ea
f
J
j
a
J
jjjaj
IJ aaPefPefP1 1
111 )|()|()|(
Stephan Vogel - Machine Translation 16
Bi-Stream HMM Alignment
Bi-Stream HMM: Assume the hidden alignment generates 2 data stream:
words and word class labels
1jf jf
],[jaj ea
1jf
1jg jg 1jg
Stream 1:
Stream 2:
Stream 1 Stream 2
Stephan Vogel - Machine Translation 17
Second Stream: Bilingual Word Clusters
Ideally, we want to use word classes with group translations of words in source language cluster into cluster on target side
Bilingual Word Clusters (Och, 1999) Assume monolingual clusters fixed first Optimize the clusters for the other language (mkcls in GIZA+
+)
Bilingual Word Spectral Clusters Eigen-structure analysis K-means or single linkage clustering.
Other Word Clusters LDA (Blei, 2000) Co-clusters, etc.
Stephan Vogel - Machine Translation 18
Bi-Stream HMM with Word Clusters
Evaluating Word Alignment Accuracy: F-measure Bi-stream HMM (Bi-HMM) is better than HMM; Bilingual word-spectral clusters are better than traditional ones; Helps more for small training data.
41
42
43
44
45
46
47
48
49
50
51
52
HMM-fe Bi-HMM-mfe Bi-HMM-bfe
40
42
44
46
48
50
52
54
HMM-ef Bi-HMM-mef Bi-HMM-bef
45
4647
48
4950
51
5253
54
HMM-fe Bi-HMM-mfe Bi-HMM-bfe44
46
48
50
52
54
56
58
60
HMM-ef Bi-HMM-mef Bi-HMM-bef
TreeBank, F2E TreeBank, E2F
FBIS, F2E FBIS, E2F
F-M
easu
reF
-Mea
sure
HMM Trad. with-Spec HMM Trad. with-Spec
Stephan Vogel - Machine Translation 19
Factored Translation Models
Paper:Koehn and Hoang, Factored Translation Models, EMNLP 2007
Factored translation model as extension of phrase-based SMT Interesting for translating into or between morphology rich
languages Experiments for English-German, English Spanish, English-
Czech
(I follow that paper. Description on Moses web site is nearly identical.
See http://www.statmt.org/moses/?n=Moses.FactoredModels
Example also from: http://www.inf.ed.ac.uk/teaching/courses/mt/lectures/factored-models.pdf)
Stephan Vogel - Machine Translation 20
Factored Model
Analysis as preprocessing Need to specify the transfer Need to specify the generation
word
lemma
POS
morphology
word class
word
lemma
POS
morphology
word class
… …
Input Output
word
lemma
POS
morphology
word class
word
lemma
POS
morphology
word class
… …
Input Output
Factored Representation Factored Model: transfer and generation
Stephan Vogel - Machine Translation 21
Transfer
Mapping individual factors: As we do with non-factored models Example: Haus -> house, home, building, shell
Mapping combinations of factors: New vocabulary as Cartesian product of the vocabularies of
the individual factors, e.g. NN and singular -> NN|singular Map these combinations Example: NN|plural|nominative -> NN|plural, NN|singular
Number of factors on source and target side can differ
Stephan Vogel - Machine Translation 22
Generation
Generate surface form from factors Examples:
house|NN|plural -> houseshouse|NN|singular -> househouse|VB|present|3rd-person -> houses
Stephan Vogel - Machine Translation 23
Example including all Steps
German word Häuser Analysis:
häuser|haus|NN|plural|nominative|neutral
Translation Mapping lemma:
{ ?|house|?|?|?, ?|home|?|?|?, ?|building|?|?|? } Mapping morphology:
{ ?|house|NN|plural, ?|house|NN|singular, ?|home|NN|plural, ?|building|NN||plural }
Generation Generating surface forms:
{houses|house|NN|plural, house|house|NN|singular, homes|home|NN|plural, buildings|building|NN||plural }
Stephan Vogel - Machine Translation 24
Training the Model
Parallel data needs to be annotated -> preprocessing Source and target side annotation typically independent of
each other Some work on ‘coupled’ annotation, e.g. inducing word classes
through clustering with mkcls, or morphological analysis of Arabic conditioned on English side (Linh)
Word alignment Operate on surface form only Use multi-stream alignment (example: BiStream HMM) Use discriminative alignment (example: CRF approach) Estimate translation probabilities: collect counts for factors or
combination of factors
Phrase alignment Extract from word alignment using standard heuristics Estimate various scoring functions
Stephan Vogel - Machine Translation 25
Training the Model
Word alignment (symmetrized)
Stephan Vogel - Machine Translation 26
Training the Model
Extract phrase: natürlich hat john # naturally john has
Stephan Vogel - Machine Translation 27
Training the Model
Extract phrase for other factors: ADV V NNP # ADV NNP V
Stephan Vogel - Machine Translation 28
Training the Generation Steps
Train on target side of corpus Can use additional monolingual data Map factor(s) to factor(s), e.g. word->POS and POS-
>word
Example: The/DET big/ADJ tree/NN Count collection:
count( the, DET )++count( big, ADJ )++count( tree, NN )++
Probability distributions (maximum likelihood estimates)p( the | DET ) and p( DET | the )p( big | ADJ ) and p( ADJ | big )p( tree | NN ) and p( NN | tree )
Stephan Vogel - Machine Translation 29
Combination of Components
Log-linear components of feature functions
Sentence translation generated from a set of phrase pairs
Translation component: Feature functions h defined over phrase pairs
Generation component: Feature function hdefined over output words
n
iiihZ
p1
)f,e(exp1
)f|e(
j
jj f,eh )()f|e(
)( jj f,e
k
keh )()e(
Stephan Vogel - Machine Translation 30
Decoding with Factored Models
Instead of just phrase table, now multiple tables Important: all mappings operate on same segmentation
of source sentence into phrases More target translations are now possible
Example: … beat … can be verb or nounTranslations: beat # schlag (NN or VB) , schlagen (VB), Rhythmus (NN)
… beat …
schlag
schlagen
Rhythmus
… beat …
schlag|NN|Nom
schlag|VB|1-person|singular
schlag|NN|Dat
schlag|NN|Akk
Not-factored Factored
Stephan Vogel - Machine Translation 31
Decoding with Factored Models
Combinatorial explosion -> harsher pruning needed
Notice: translation step features and generation step features depend only on phrase pair Alternative translations can be generated and inserted into the
translation lattice before best-path search begins(building fully expanded phrase table?)
Features can be calculate and used for translation model pruning (observation pruning)
Pruning in Moses decoder Non-factored model: default is 20 alternatives Factored model: default is 50 alternative Increase in decoding time: factor 2-3
Stephan Vogel - Machine Translation 32
Factored LMs in Moses
The training script allows to specify multiple LMs on different factors, with individual orders (history length)
Example:--lm 0:3:factored-corpus/surface.lm // surface form 3-gram LM--lm 2:3:factored-corpus/pos.lm // POS 3-gram LM
This generates different LMs on the different factors, not a factored LM Different LMs are used as independent features in decoder No backing-off between different factors
Stephan Vogel - Machine Translation 33
Summary
Factored models to Deal with large vocabulary in morphology rich LMs ‘Connect’ words, thereby getting better model estimates Explicitly model morphological dependencies within sentences
Factored models are not always called factored models Hierarchical model (lexicon) Multi-stream model (alignment)
Factored LMs introduced for ASR Many backoff paths
Moses decoder Allows factored TMs and factored LMs But no backing-off between factors, only log-linear combination