stephan vogel - machine translation1 machine translation factored models stephan vogel spring...

Stephan Vogel - Machine Translation 1

Machine Translation

Factored Models

Stephan VogelSpring Semester 2011


Overview

Factored Language Models Multi-Stream Word Alignment Factored Translation Models


Motivation

Vocabulary grows dramatically for morphology rich languages

Looking at surface word form does not take connections (morphological derivations) between words into account Example: ‘book’ and ‘books’ as unrelated as ‘book’ and ‘sky’

Dependencies within sentences between words are not well detected Example: number or gender agreement

Singular: der alte Tisch (the old table)Plural: die alten Tische (the old tables)

Consider word as a bundle of factors Surface word form, stem, root, prefix, suffix, POS, gender

marker, case marker, number marker, …


Two solutions

Morphological decomposition into stream of morphemes Compound noun splitting Prefix-stem-suffix splitting

Words as bundle of (parallel) factorsword

lemma

POS

morphology

word class

prefix stem suffix prefix stem suffix …

w1 w2 w3 w4 ….


Questions

Which information is the most useful

How to use this information? In the language model In the translation model

How to use it at training time How to use it at decoding time


Factored Models

Morphological preprocessing A significant body of work

Factored language models Kirchhoff et al

Hierarchical lexicon Niessen at al

Bi-Stream alignment Zhao et al

Factored translation models Koehn et al


Factored Language Model

Some papers:

Bilmes and Kirchhoff, 2003Factored Language Models and Generalized Parallel

Backoff

Duh and Kirchhoff, 2004Automatic learning of language model structure

Kirchhoff and Yang, 2005Improved Language Modeling for Statistical Machine

Translation


Factored Language Model

Representation:

LM probability:

I

i

Ki

KKi

KI

KI fffpffpwwp

1

:11

:11

:1:1:111 ),...,|(),...,(),...,(

KK ffffw :121 ,...,,


Language Model Backoff

Smoothing by backing off Backoff paths

in standard LM in factored LM


Choosing Backoff Paths

Different possibitities Fixed path Choose path dynamically during training Choose multiple paths dynamically during training and

combine results (Generalized Parallel Backoff)

Many paths -> optimization problem Duh & Kirchhoff (2004) use genetic algorithm

Bilmes and Kirchhoff (2003) report LM perplexities Kirchhoff and Yang (2005) use FLM to rescore n-best

list generated by SMT system 3-gram FLM slightly worse then standard 4-gram LM Combined LM does not outperform standard 4-gram LM


Hierarchical Lexicon

Morphological analysis Using GERCG, a constraint

grammar parser for German for lexical analysis and morphological and syntactic disambiguation

Build equivalence classes Group words which tend to

translate into same target word Don’t distinguish what does not

need to be distinguished! Eg. for nouns: gender is

irrelevant as is nominative, dative, and accusative; but genitive translates differently

Sonja Nießen and Hermann Ney, Toward hierarchical models for statistical machine translation of inflected languages. Proceedings of the workshop on data-driven methods in machine translation - Volume 14, 2001.



Equivalence classes at different levels of abstraction Example: ankommen

n is full analysis n-1: drop “first person” -> group “ankomme”, “ankommst”,

“ankommt” n-2: drop singular/plural distinction …



Translation probability

Probability for taking all factors up to i into account

Assumption: does not depend on e and word form follows unambiguously from tags

Linear combination of pi

it

iii etfpetpefp

0

),|()|()|( 00

),|( 0 etfpi

)|(...)|()|( 00 efpefpefp nn


Multi-Stream Word Alignment

Use multiple annotations: stem, POS, … Consider each annotation as additional stream or tier Use generative alignment models

Model each stream But tie streams through alignment

Example: Bi-Stream HMM word alignment (Zhao et al 2005)


Bi-Stream HMM Alignment

HMM: Relative word position as distortion component (can be

conditioned on word classes) Forward-backward algorithms for training

1jf jf

],[jaj ea

f

J

j

a

J

jjjaj

IJ aaPefPefP1 1

111 )|()|()|(


Bi-Stream HMM Alignment

Bi-Stream HMM: Assume the hidden alignment generates 2 data stream:

words and word class labels

1jf jf

],[jaj ea

1jf

1jg jg 1jg

Stream 1:

Stream 2:

Stream 1 Stream 2


Second Stream: Bilingual Word Clusters

Ideally, we want to use word classes with group translations of words in source language cluster into cluster on target side

Bilingual Word Clusters (Och, 1999) Assume monolingual clusters fixed first Optimize the clusters for the other language (mkcls in GIZA+

+)

Bilingual Word Spectral Clusters Eigen-structure analysis K-means or single linkage clustering.

Other Word Clusters LDA (Blei, 2000) Co-clusters, etc.


Bi-Stream HMM with Word Clusters

Evaluating Word Alignment Accuracy: F-measure Bi-stream HMM (Bi-HMM) is better than HMM; Bilingual word-spectral clusters are better than traditional ones; Helps more for small training data.

41

42

43

44

45

46

47

48

49

50

51

52

HMM-fe Bi-HMM-mfe Bi-HMM-bfe

40

42

44

46

48

50

52

54

HMM-ef Bi-HMM-mef Bi-HMM-bef

45

4647

48

4950

51

5253

54

HMM-fe Bi-HMM-mfe Bi-HMM-bfe44

46

48

50

52

54

56

58

60

HMM-ef Bi-HMM-mef Bi-HMM-bef

TreeBank, F2E TreeBank, E2F

FBIS, F2E FBIS, E2F

F-M

easu

reF

-Mea

sure

HMM Trad. with-Spec HMM Trad. with-Spec


Factored Translation Models

Paper:Koehn and Hoang, Factored Translation Models, EMNLP 2007

Factored translation model as extension of phrase-based SMT Interesting for translating into or between morphology rich

languages Experiments for English-German, English Spanish, English-

Czech

(I follow that paper. Description on Moses web site is nearly identical.

See http://www.statmt.org/moses/?n=Moses.FactoredModels

Example also from: http://www.inf.ed.ac.uk/teaching/courses/mt/lectures/factored-models.pdf)

http://www.statmt.org/moses/?n=Moses.FactoredModels


Factored Model

Analysis as preprocessing Need to specify the transfer Need to specify the generation

word

lemma

POS

morphology

word class

word

lemma

POS

morphology

word class

… …

Input Output

word

lemma

POS

morphology

word class

word

lemma

POS

morphology

word class

… …

Input Output

Factored Representation Factored Model: transfer and generation


Transfer

Mapping individual factors: As we do with non-factored models Example: Haus -> house, home, building, shell

Mapping combinations of factors: New vocabulary as Cartesian product of the vocabularies of

the individual factors, e.g. NN and singular -> NN|singular Map these combinations Example: NN|plural|nominative -> NN|plural, NN|singular

Number of factors on source and target side can differ


Example including all Steps

German word Häuser Analysis:

häuser|haus|NN|plural|nominative|neutral

Translation Mapping lemma:

{ ?|house|?|?|?, ?|home|?|?|?, ?|building|?|?|? } Mapping morphology:

{ ?|house|NN|plural, ?|house|NN|singular, ?|home|NN|plural, ?|building|NN||plural }

Generation Generating surface forms:

{houses|house|NN|plural, house|house|NN|singular, homes|home|NN|plural, buildings|building|NN||plural }


Training the Model

Parallel data needs to be annotated -> preprocessing Source and target side annotation typically independent of

each other Some work on ‘coupled’ annotation, e.g. inducing word classes

through clustering with mkcls, or morphological analysis of Arabic conditioned on English side (Linh)

Word alignment Operate on surface form only Use multi-stream alignment (example: BiStream HMM) Use discriminative alignment (example: CRF approach) Estimate translation probabilities: collect counts for factors or

combination of factors

Phrase alignment Extract from word alignment using standard heuristics Estimate various scoring functions


Training the Model

Word alignment (symmetrized)


Training the Model

Extract phrase: natürlich hat john # naturally john has


Training the Model

Extract phrase for other factors: ADV V NNP # ADV NNP V


Training the Generation Steps

Train on target side of corpus Can use additional monolingual data Map factor(s) to factor(s), e.g. word->POS and POS-

>word

Example: The/DET big/ADJ tree/NN Count collection:

count( the, DET )++count( big, ADJ )++count( tree, NN )++

Probability distributions (maximum likelihood estimates)p( the | DET ) and p( DET | the )p( big | ADJ ) and p( ADJ | big )p( tree | NN ) and p( NN | tree )


Combination of Components

Log-linear components of feature functions

Sentence translation generated from a set of phrase pairs

Translation component: Feature functions h defined over phrase pairs

Generation component: Feature function hdefined over output words

n

iiihZ

p1

)f,e(exp1

)f|e(

j

jj f,eh )()f|e(

)( jj f,e

k

keh )()e(


Decoding with Factored Models

Instead of just phrase table, now multiple tables Important: all mappings operate on same segmentation

of source sentence into phrases More target translations are now possible

Example: … beat … can be verb or nounTranslations: beat # schlag (NN or VB) , schlagen (VB), Rhythmus (NN)

… beat …

schlag

schlagen

Rhythmus

… beat …

schlag|NN|Nom

schlag|VB|1-person|singular

schlag|NN|Dat

schlag|NN|Akk

Not-factored Factored


Decoding with Factored Models

Combinatorial explosion -> harsher pruning needed

Notice: translation step features and generation step features depend only on phrase pair Alternative translations can be generated and inserted into the

translation lattice before best-path search begins(building fully expanded phrase table?)

Features can be calculate and used for translation model pruning (observation pruning)

Pruning in Moses decoder Non-factored model: default is 20 alternatives Factored model: default is 50 alternative Increase in decoding time: factor 2-3


Factored LMs in Moses

The training script allows to specify multiple LMs on different factors, with individual orders (history length)

Example:--lm 0:3:factored-corpus/surface.lm // surface form 3-gram LM--lm 2:3:factored-corpus/pos.lm // POS 3-gram LM

This generates different LMs on the different factors, not a factored LM Different LMs are used as independent features in decoder No backing-off between different factors


Summary

Factored models to Deal with large vocabulary in morphology rich LMs ‘Connect’ words, thereby getting better model estimates Explicitly model morphological dependencies within sentences

Factored models are not always called factored models Hierarchical model (lexicon) Multi-stream model (alignment)

Factored LMs introduced for ASR Many backoff paths

Moses decoder Allows factored TMs and factored LMs But no backing-off between factors, only log-linear combination

stephan vogel - machine translation1 machine translation factored models stephan vogel spring...

Documents