transfer-based mt. syntactic transfer-based machine translation direct and example-based approaches...

Transfer-based MT

Syntactic Transfer-based Machine Translation

• Direct and Example-based approaches – Two ends of a spectrum– Recombination of fragments for better coverage.

• What if the matching/transfer is done at syntactic parse level

• Three Steps – Parse: Syntactic parse of the source language sentence

• Hierarchical representation of a sentence– Transfer: Rules to transform source parse tree into target parse

tree• Subject-Verb-Object Subject-Object-Verb

– Generation: Regenerating target language sentence from parse tree• Morphology of the target language

• Tree-structure provides better matching and longer distance transformations than is possible in string-based EBMT.

I

Examples of SynTran-MT

quiero

ajá usar

mi tarjeta

de

crédito

wanna

yeah use

my card

credit

•Mostly parallel parse structures

• Might have to insert word – pronouns, morphological particles

Example of SynTran MT -2

• Pros:– Allows for structure transfer– Re-orderings are typically restricted to the parent-child nodes.

• Cons:– Transfer rules are for each language pair (N2 sets of rules)– Hard to reuse rules when one of the languages is changed

need

I make

to call

a collect

必要があります (need)

私は (I)

かける (make)

コールを (call)

コレクト (collect)

Lexical-semantic Divergences

Linguistic Divergences

• Structural differences between languages– Categorical Divergence

• Translation of words in one language into words that have different parts of speech in another language– To be jealous– Tener celos (To have jealousy)

Issues

Linguistic Divergences– Conflational Divergence

• Translation of two or more words in one language into one word in another language– To kick – Dar una patada (Give a kick)

Issues

Linguistic Divergences– Structural Divergence

• Realization of verb arguments in different syntactic configurations in different languages– To enter the house – Entrar en la casa (Enter in the house)

Issues

Linguistic Divergences– Head-Swapping Divergence

• Inversion of a structural-dominance relation between two semantically equivalent words – To run in – Entrar corriendo (Enter running)

Issues

Linguistic Divergences– Thematic Divergence

• Realization of verb arguments that reflect different thematic to syntactic mapping orders– I like grapes – Me gustan uvas (To-me please grapes)

Divergence counts from Bonnie Dorr

32% of sentences in UN Spanish/English Corpus (5K)

Categorial X tener hambre Y have hunger

98%

Conflational X dar puñaladas a Z X stab Z

83%

Structural X entrar en Y X enter Y

35%

Head Swapping X cruzar Y nadando X swim across Y

8%

Thematic X gustar a Y Y likes X

6%

Transfer rules

Syntax-driven statistical machine translation

Slides from Devi Xiong, CAS, Beijing

Why syntax-based SMT

Weakness of phrase-based SMT

• Long-distance reordering: phrase-level reordering

• Discontinuous phrases

• Generalization

• …

Other methods using syntactic knowledge

• Word alignment integrating syntactic constraints

• Pre-order source sentences

• Rerank n-best output of translation models

SSMT based on formal structures

Compared with phrase-based SMT

• Translated hierarchically

• The target structures finally generated are not necessarily real linguistic structures, but– Make long-distance reordering more feasible– Introduce non-terminals/variables

• Discontinuous phrases: put x on, 在 x 时• Generalization

SCFG

Formulated:

• Two CFGs and there correspondences

Or

• P:

) , ,( 21 GGSCFG

)S , , ,( PTNSCFG

) , ,( BA

) , ,( X

SCFG: an example

SCFG: derivation

ITG as reordering constraint

Two kinds of reordering• Inverted• straight

Coverage• Wu(1997): “been unable to find real examples” of cases where

alignments would fail under this constraint, at least in “lightly inflected languages, such as English and Chinese.”

• Wellington(2006): “we found examples”, “at least 5% of the Chinese/English sentence pairs”.

Weakness• No strong mechanism determining which order is better, inverted

or straight.

Chiang’05: Hierarchical Phrase-based Model (HPM)

Rules:

Glue rule:

Model: log-linear

Decoder: CKY

) , ,( X

) , ( 1221 XwithXhaveXyouXyuX

) , ( 2121 XSXSS

) ,( 11 XXS

Chiang’05: rule extraction

Chiang’05: rule extraction restrictions

Initial base rule at most 15 on French side

Final rule at most 5 on French side

At most two non-terminals on each side, nonadjacent

At least one aligned terminal pair

Chiang’05: Model

• Log-linear form

• and

Chiang’05: decoder

SSMT based on phrase structures

Using grammars with linguistic knowledge

• The grammars are based on SCFG

Two categories:

• Tree-string– Tree-to-string– String-to-tree

• Tree-tree

Yamada & Knight 2001, 2003

Yamada’s work vs. SCFG

Insertion operation:

• A (wA1, A1)

Reordering operation

• A (A1A2A3, A1A3A2)

Translating operation

• A (x, y)

Yamada: weakness

Single-level mapping

• Multi-level reordering– Yamada: flatten

Word-based

• Yamada: phrasal leaf

Galley et al. 2004, 2006

translation model incorporates syntactic structure on the target language side

• trained by learning “translation rules” from bilingual data

the decoder uses a parser-like method to create syntactic trees as output hypotheses

Translation rules

Translation rules

• Target: multi-level subtrees

• Source: continuous or discontinuous phrases

Types of translation rules

• Translating source phrases into target chunks– NPB(PRP/I) ↔我– NP-C(NPB(DT/this NN/address)) ↔这个地址

Types of translation rules

Have variables• NP-C(NPB(PRP$/my x0:NN)) ↔ 我的 x0

• PP(TO/to NP-C(NPB(x0:NNS NNP/park))) ↔ 去 x0 公园Combine previously translated results together• VP(x0:VBZ x1:NP-C) ↔ x1 x0

– takes a noun phrase followed by a verb, switches their order, then combines them into a new verb phrase

Rules extraction

Word-align a parallel corpus

Parse the target side

Extract translation rules

• Minimal rules: can not be decomposed

• Composed rules: composed by minimal rules

Estimate probalities

Rule extraction

Minimal rule

Composed rules

Format is Expressive

S

x0:NP VP

x1:VB x2:NP2

x1, x0, x2

S

PRO VP

VB x0:NPthere

are

hay, x0

NP

x0:NP PP

of

P x1:NP

x1, , x0

Multilevel Re-Ordering

Non-constituent Phrases

Lexicalized Re-Ordering

VP

VBZ VBG

is

está, cantando

Phrasal Translation

singing

VP

VB x0:NP PRT

put

poner, x0

Non-contiguous Phrases

on

NPB

DT x0:NNS

the

x0

Context-SensitiveWord Insertion

[Knight & Graehl, 2005]

decoder

probabilistic CYK-style parsing algorithm with beams

results in an English syntax tree corresponding to the Chinese sentence

guarantees the output to have some kind of globally coherent syntactic structure

Decoding example

Marcu et al. 2006

SPMT

• Integrating non-syntactifiable phrases

• Multiple features for each rule

• Decoding with multiple models


Two categories:

• Tree-string– String-to-tree– Tree-to-string

• Tree-tree

Tree-to-string

Liu et al. 2006

• Tree-to-string alignment template model

TAT

NP

NR NN

布什总统

President Bush

LCP

NP LC

NR CC NR 间

美国和

between United States and

NP

DNP NP

NP DEG

TAT: extraction

Constraints

• Source trees have to be Subtree

• Have to be consistent with word alignment

Restrictions on extraction

• both the first and last symbols in the target string must be aligned to some source symbols

• The height of T(z) is limited to no greater than h

• The number of direct descendants of a node of T(z) is limited to no greater than c

TAT: Model

Decoding

Tree-to-string vs. string-to-tree

Tree-to-string

• Integrating source structures into translation and reordering

• The output can not be grammatical

string-to-tree

• guarantees the output to have some kind of globally coherent syntactic structure

• Can not use any knowledge from source structures


Two categories:

• Tree-string– String-to-tree– Tree-to-string

• Tree-tree

Tree-Tree

Synchronous tree-adjoining grammar (STAG)

Synchronous tree substitution grammar (STSG)

STAG: derivation

STSG

enfants(“kids”)

kids

NPd’

(“of”)

beaucoup(“lots”)

NP

NPquitenull

Adv

oftennullAdv

nullAdv

SamSam

NP

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

Start

STSG: elementary trees

SamSam NPenfants(“kids”)

kids

NP

quitenull Adv

d’(“of”)


NPNP


baiser(“kiss”)

un(“a”)

à (“to”)Start

NP

NP

nullAdv

Dependency structures

外商投资企业成为中国外贸重要增长点

NN NN NN VV NR NN JJ NN

NPNP NP ADJP NP

NP

VP

IP

外商投资

企业

中国外贸

增长点

重要

成为

(a)

(b)

For MT: dependency structures vs. phrase structures

Advantages of dependency structures over phrase structures for machine translation

• Inherent lexicalization

• Meaning-relative

• Better representation of divergences across languages

SSMT based on dependency structures

Lin 2004

• A Path-based Transfer Model for Machine Translation

Quirk et al. 2005

• Dependency Treelet Translation: Syntactically Informed Phrasal SMT

Ding et al. 2005

• Machine Translation Using Probabilistic Synchronous Dependency Insertion Grammars

Lin 2004

Translation model trained by learning transfer rules from bilingual corpus where the source language sentences are parsed.

decoding: finding the minimum path covering of the source language dependency tree

Lin 2004: path

Lin 2004: transfer rule

Quirk et al. 2005

Translation model trained by learning treelet pairs from bilingual corpus where the source language sentences are parsed.

Decoding: CKY-style

Treelet pairs

Quirk 2005: decoding

Ding 2005

summary

Tree (formal, phrase, or dependency structure )

String (phrase or chunk)

Semantic

Interlingua

Word

Source Languge Target Language

State of the art machine translation systems based on statistical models rooted in the theory of formal grammars/automata

Translation models based on finite state devices cannot easily model translations between languages with strong differences in word ordering

Recently, several models based on context-free grammars have been investigated, borrowing from the theory of compilers the idea of synchronous rewriting

Introduction

Slides from G. Satta

Translation models based on synchronous rewriting:

Inversion Transduction Grammars (Wu, 1997)

Head Transducer Grammars (Alshawi et al., 2000)

Tree-to-string models (Yamada & Knight, 2001; Galley et al, 2004)

“Loosely tree-based” model (Gildea, 2003)

Multi-Text Grammars (Melamed, 2003)

Hierarchical phrase-based model (Chiang, 2005)

We use synchronous CFGs to study formal properties of all these

Introduction

Synchronous CFG

A synchronous context-free grammar (SCFG) is based on three components:

Context free grammar (CFG) for source language

CFG for target language

Pairing relation on the productions of the two grammars and on the nonterminals in their right-hand sides

Synchronous CFG

Example (Yamada & Knight, 2001) :

VB PRP(1) VB1(2) VB2(3)

VB2 VB(1) TO(2)

TO TO(1) NN(2)

PRP he

VB1 adores

VB listening

TO to

NN music

VB PRP(1) VB2(3) VB1(2)

VB2 TO(2) VB(1) ga

TO NN(2) TO(1)

PRP kare ha

VB1 daisuki desu

VB kiku no

TO wo

NN ongaku

Synchronous CFG

Example (cont’d):

VB(1) VB(1)

PRP(1) VB2(3) VB1(2) PRP(1) VB1(2) VB2(3)

he kare ha adores daisuki desu

NN(2) TO(1)TO(1) NN(2)

music ongaku to wo

VB(1) TO(2) TO(2) VB(1) ga

listening kiku no

Synchronous CFG

A pair of CFG productions in a SCFG is called a synchronous production

A SCFG generates pairs of trees/strings, where each component is a translation of the other

A SCFG can be extended with probabilities:

Each pair of productions is assigned a probability

Probability of a pair of trees is the product of probabilities of synchronous productions involved

Membership

The membership problem (Wu, 1997) for SCFGs is defined as follows:

Input: SCFG and pair of strings [w1, w2 ]

Output: Yes/No depending on whether w1 translates into w2 under the SCFG

Applications in segmentation, word alignment and bracketing of parallel corpora

Assumption that SCFG is part of the input is made here to investigate the dependency of problem complexity on grammar size

Membership

Result: Membership problem for SCFGs is NP-complete

Proof uses SCFG derivations to explore space of consistent truth assignments that satisfy source 3SAT instance

Remarks:

Result transfers to (Yamada & Knight, 2001), (Gildea, 2003), (Melamed, 2003), which are at least as powerful as SCFG

Membership

Remarks (cont’d):

Problem can be solved in polynomial time if:

• input grammar is fixed or production length is bounded (Melamed, 2004)

• Inversion Transduction Grammars (Wu, 1997)

• Head Transducer Grammars (Alshawi et al., 2000)

For NLP applications, it is more realistic to assume a fixed grammar and varying input string

Chart parsing

Providing an exponential time lower bound for the membership problem would amount to showing P ≠ NP

But we can show such a lower bound if we make some assumptions on the class of algorithms and data structures that we use to solve the problem

Result: If chart parsing techniques are used to solve the membership problem for SCFG, a number of partial analyses is obtained that grows exponentially with the production length of the input grammar

Chart parsing

Chart parsing for CFGs works by combining completed constituents with partial analyses:

A B1 B2 B3 … Bn

Three indices are used to process each combination, for a total number of O(n3) possible combinations that must be checked, n the length of the input string

B1 B2 B3

…

B4

Chart parsing

Consider the synchronous production :

[ A B (1) B (2) B (3) B (4) , A B (3) B (1) B (4) B (2) ]

representing the permutation :

B (1) B (2) B (3) B (4)

B (3) B (1) B (4) B (2)

Chart parsing

When applying chart parsing, there is no way to keep partial analyses “contiguous”:

B (4)B (1) B (2) B (3)

B (4)B (1) B (2)B (3)

Chart parsing

The proof of our result generalizes the previous observations

We show that, for some worst case permutations of length q, any combination strategy we choose leads to a number of indices growing with order at least sqrt(q)

Then for SCFGs of size q, sqrt(q) is an asymptotic lower bound for the membership problem when chart parsing algorithms are used

Translation

A probabilistic SCFG provides the probability that tree t1 translates into tree t2:

Pr( [t1 , t2] )

Accordingly, we can define the probability that string w1 translates into string w2:

Pr( [w1 , w2] ) = t1w1,t2w2 Pr( [t1 , t2] )

and the probability that string w translates into tree t:

Pr( [w , t ] ) = t1w Pr( [t1 , t ] )

Translation

The string-to-tree translation problem for probabilistic SCFGs is defined as follows:

Input: Probabilistic SCFG and string w

Output: tree t such that Pr([w, t ]) is maximized

Application in machine translation

Again, assumption that SCFG is part of the input is made to investigate the dependency of problem complexity on grammar size

Result: string-to-tree translation problem for probabilistic SCFGs (summing over possible source trees) is NP-hard

Proof reduces from consensus problem:

Strings generated by probabilistic finite automaton or hidden Markov model have probabilities defined as sum of probabilities of several paths

Maximizing such summation is NP-hard (Casacuberta & Higuera, 2000) (Lyngso & Pedersen, 2002)

Translation

Remarks:

Source of complexity of the problem comes from the fact that several source trees can be translated into the same target tree

Result persists if there is a constant bound on length of synchronous productions

Open: can the problem be solved in polynomial time if probabilistic SCFG is fixed?

Translation

the

Learning Non-Isomorphic Tree Mappings for Machine Translation

a

b

A

B

events of

misinform

wrongly

report

to-John

events

him

“wrongly report events to-John” “him misinform of the events”

2 words become 1

reorder dependents

0 words become 1

0 words become 1

Slides from J. Eisner

Syntax-Based Machine Translation

Previous work assumes essentially isomorphic trees

• Wu 1995, Alshawi et al. 2000, Yamada & Knight 2000

But trees are not isomorphic!

• Discrepancies between the languages

• Free translation in the training data

the

a

b

A

B

events of

misinform

wrongly

report

to-John

events

him

Two training trees, showing a free translation from French to English.

Synchronous Tree Substitution Grammar

enfants(“kids”)

d’(“of”)


Sam

donnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

kids

Sam

kiss

quite

often

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

enfants(“kids”)

kids

NPd’

(“of”)


NP

NP

SamSam

NP



baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

quitenullAdv

oftennullAdv

nullAdv


Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.

enfants(“kids”)

kids

Adv

d’(“of”)


NP

SamSam

NP



baiser(“kiss”)

un(“a”)

à (“to”)

Start

NPquite

often


Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.A much worse alignment ...

enfants(“kids”)

kids

NPd’

(“of”)


NP

NP

SamSam

NP



baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

quitenullAdv

oftennullAdv

nullAdv


Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.

enfants(“kids”)

kids

NPd’

(“of”)


NP

NPquitenull

Adv

oftennullAdv

nullAdv

SamSam

NP


baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv



Start

Two training trees, showing a free translation from French to English.A possible alignment is shown in orange. Alignment shows how trees are generated synchronously from “little trees” ...

SamSamNP

Grammar = Set of Elementary Trees


baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

idiomatictranslation

enfants(“kids”)

kids

NP

enfants(“kids”)

kids

NP

SamSamNP



baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

idiomatictranslation

SamSamNP

enfants(“kids”)

kids

NP



baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

SamSamNP

enfants(“kids”)

kids

NP

d’(“of”)


NP

NP

“beaucoup d’” deletes inside the tree

d’(“of”)


NP

NP

“beaucoup d’” deletes inside the tree



baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

SamSamNP

enfants(“kids”)

kids

NP

enfants(“kids”)

kids

NPd’

(“of”)


NP

NP

“beaucoup d’” matches nothing in English



baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

SamSamNP

enfants(“kids”)

kids

NP

SamSamNP

enfants(“kids”)

kids

NPquitenull

Adv


oftennullAdv

nullAdv

d’(“of”)


NP

NP


baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

adverbial subtree matches nothing in French

Probability model similar to PCFG

Probability of generating training trees T1, T2 with alignment A

P(T1, T2, A) = p(t1,t2,a | n)

probabilities of the “little” trees that are used

p(is given by a maximum entropy model

wrongly

misinform

NP

NP

reportVP | )VP

FEATURES• report+wrongly misinform?

(use dictionary)

• report misinform? (at root)

• wrongly misinform?

Maxent model of little tree pairs

• verb incorporates adverb child?

• verb incorporates child 1 of 3?

• children 2, 3 switch positions?

• common tree sizes & shapes?

• ... etc. ....

p(wrongly

misinform

NP

NP

reportVP | )VP

Inside Probabilities

the

a

b

A

B

events of

misinform

wrongly

report

to-Johnevents

him

VP

( ) = ...misinformreport VP

* ( ) * ( ) + ...

p( | )VP

Inside Probabilities

the

a

b

A

B

events of

misinform

wrongly

report

to-Johnevents

him

VP

NP

NP

( ) = ...misinformreport VP

events ofNP to-John himNP* ( ) * ( ) + ...

p( | )VP

NP

misinform

wrongly

report VP

NP

only O(n2)

An MT Architecture

Viterbi alignment yields output T2

dynamic programming engine

Probability Model p(t1,t2,a) of Little Trees

score little tree pair

propose translations t2of little tree t1

each possible (t1,t2,a)

inside-outsideestimated counts

update parameters

for each possible t1, various (t1,t2,a)

each proposed (t1,t2,a)

DecoderTrainer

scores all alignmentsof two big trees T1,T2

scores all alignmentsbetween a big tree T1

& a forest of big trees T2

transfer-based mt. syntactic transfer-based machine translation direct and example-based approaches...

Documents

house slide

kick slide

transferbased mt slide

jealousy slide

running slide

beijing slide

y y likes x

structural x entrar