transfer-based mt. syntactic transfer-based machine translation direct and example-based approaches...
Post on 20-Dec-2015
217 views
TRANSCRIPT
Transfer-based MT
Syntactic Transfer-based Machine Translation
• Direct and Example-based approaches – Two ends of a spectrum– Recombination of fragments for better coverage.
• What if the matching/transfer is done at syntactic parse level
• Three Steps – Parse: Syntactic parse of the source language sentence
• Hierarchical representation of a sentence– Transfer: Rules to transform source parse tree into target parse
tree• Subject-Verb-Object Subject-Object-Verb
– Generation: Regenerating target language sentence from parse tree• Morphology of the target language
• Tree-structure provides better matching and longer distance transformations than is possible in string-based EBMT.
I
Examples of SynTran-MT
quiero
ajá usar
mi tarjeta
de
crédito
wanna
yeah use
my card
credit
•Mostly parallel parse structures
• Might have to insert word – pronouns, morphological particles
Example of SynTran MT -2
• Pros:– Allows for structure transfer– Re-orderings are typically restricted to the parent-child nodes.
• Cons:– Transfer rules are for each language pair (N2 sets of rules)– Hard to reuse rules when one of the languages is changed
need
I make
to call
a collect
必要があります (need)
私は (I)
かける (make)
コールを (call)
コレクト (collect)
Lexical-semantic Divergences
Linguistic Divergences
• Structural differences between languages– Categorical Divergence
• Translation of words in one language into words that have different parts of speech in another language– To be jealous– Tener celos (To have jealousy)
Issues
Linguistic Divergences– Conflational Divergence
• Translation of two or more words in one language into one word in another language– To kick – Dar una patada (Give a kick)
Issues
Linguistic Divergences– Structural Divergence
• Realization of verb arguments in different syntactic configurations in different languages– To enter the house – Entrar en la casa (Enter in the house)
Issues
Linguistic Divergences– Head-Swapping Divergence
• Inversion of a structural-dominance relation between two semantically equivalent words – To run in – Entrar corriendo (Enter running)
Issues
Linguistic Divergences– Thematic Divergence
• Realization of verb arguments that reflect different thematic to syntactic mapping orders– I like grapes – Me gustan uvas (To-me please grapes)
Divergence counts from Bonnie Dorr
32% of sentences in UN Spanish/English Corpus (5K)
Categorial X tener hambre Y have hunger
98%
Conflational X dar puñaladas a Z X stab Z
83%
Structural X entrar en Y X enter Y
35%
Head Swapping X cruzar Y nadando X swim across Y
8%
Thematic X gustar a Y Y likes X
6%
Transfer rules
Syntax-driven statistical machine translation
Slides from Devi Xiong, CAS, Beijing
Why syntax-based SMT
Weakness of phrase-based SMT
• Long-distance reordering: phrase-level reordering
• Discontinuous phrases
• Generalization
• …
Other methods using syntactic knowledge
• Word alignment integrating syntactic constraints
• Pre-order source sentences
• Rerank n-best output of translation models
SSMT based on formal structures
Compared with phrase-based SMT
• Translated hierarchically
• The target structures finally generated are not necessarily real linguistic structures, but– Make long-distance reordering more feasible– Introduce non-terminals/variables
• Discontinuous phrases: put x on, 在 x 时• Generalization
SCFG
Formulated:
• Two CFGs and there correspondences
Or
• P:
) , ,( 21 GGSCFG
)S , , ,( PTNSCFG
) , ,( BA
) , ,( X
SCFG: an example
SCFG: derivation
ITG as reordering constraint
Two kinds of reordering• Inverted• straight
Coverage• Wu(1997): “been unable to find real examples” of cases where
alignments would fail under this constraint, at least in “lightly inflected languages, such as English and Chinese.”
• Wellington(2006): “we found examples”, “at least 5% of the Chinese/English sentence pairs”.
Weakness• No strong mechanism determining which order is better, inverted
or straight.
Chiang’05: Hierarchical Phrase-based Model (HPM)
Rules:
Glue rule:
Model: log-linear
Decoder: CKY
) , ,( X
) , ( 1221 XwithXhaveXyouXyuX
) , ( 2121 XSXSS
) ,( 11 XXS
Chiang’05: rule extraction
Chiang’05: rule extraction restrictions
Initial base rule at most 15 on French side
Final rule at most 5 on French side
At most two non-terminals on each side, nonadjacent
At least one aligned terminal pair
Chiang’05: Model
• Log-linear form
• and
Chiang’05: decoder
SSMT based on phrase structures
Using grammars with linguistic knowledge
• The grammars are based on SCFG
Two categories:
• Tree-string– Tree-to-string– String-to-tree
• Tree-tree
Yamada & Knight 2001, 2003
Yamada’s work vs. SCFG
Insertion operation:
• A (wA1, A1)
Reordering operation
• A (A1A2A3, A1A3A2)
Translating operation
• A (x, y)
Yamada: weakness
Single-level mapping
• Multi-level reordering– Yamada: flatten
Word-based
• Yamada: phrasal leaf
Galley et al. 2004, 2006
translation model incorporates syntactic structure on the target language side
• trained by learning “translation rules” from bilingual data
the decoder uses a parser-like method to create syntactic trees as output hypotheses
Translation rules
Translation rules
• Target: multi-level subtrees
• Source: continuous or discontinuous phrases
Types of translation rules
• Translating source phrases into target chunks– NPB(PRP/I) ↔我– NP-C(NPB(DT/this NN/address)) ↔这个 地址
Types of translation rules
Have variables• NP-C(NPB(PRP$/my x0:NN)) ↔ 我 的 x0
• PP(TO/to NP-C(NPB(x0:NNS NNP/park))) ↔ 去 x0 公园Combine previously translated results together• VP(x0:VBZ x1:NP-C) ↔ x1 x0
– takes a noun phrase followed by a verb, switches their order, then combines them into a new verb phrase
Rules extraction
Word-align a parallel corpus
Parse the target side
Extract translation rules
• Minimal rules: can not be decomposed
• Composed rules: composed by minimal rules
Estimate probalities
Rule extraction
Minimal rule
Composed rules
Format is Expressive
S
x0:NP VP
x1:VB x2:NP2
x1, x0, x2
S
PRO VP
VB x0:NPthere
are
hay, x0
NP
x0:NP PP
of
P x1:NP
x1, , x0
Multilevel Re-Ordering
Non-constituent Phrases
Lexicalized Re-Ordering
VP
VBZ VBG
is
está, cantando
Phrasal Translation
singing
VP
VB x0:NP PRT
put
poner, x0
Non-contiguous Phrases
on
NPB
DT x0:NNS
the
x0
Context-SensitiveWord Insertion
[Knight & Graehl, 2005]
decoder
probabilistic CYK-style parsing algorithm with beams
results in an English syntax tree corresponding to the Chinese sentence
guarantees the output to have some kind of globally coherent syntactic structure
Decoding example
Decoding example
Decoding example
Decoding example
Decoding example
Marcu et al. 2006
SPMT
• Integrating non-syntactifiable phrases
• Multiple features for each rule
• Decoding with multiple models
SSMT based on phrase structures
Two categories:
• Tree-string– String-to-tree– Tree-to-string
• Tree-tree
Tree-to-string
Liu et al. 2006
• Tree-to-string alignment template model
TAT
NP
NR NN
布什 总统
President Bush
LCP
NP LC
NR CC NR 间
美国 和
between United States and
NP
DNP NP
NP DEG
TAT: extraction
Constraints
• Source trees have to be Subtree
• Have to be consistent with word alignment
Restrictions on extraction
• both the first and last symbols in the target string must be aligned to some source symbols
• The height of T(z) is limited to no greater than h
• The number of direct descendants of a node of T(z) is limited to no greater than c
TAT: Model
Decoding
Tree-to-string vs. string-to-tree
Tree-to-string
• Integrating source structures into translation and reordering
• The output can not be grammatical
string-to-tree
• guarantees the output to have some kind of globally coherent syntactic structure
• Can not use any knowledge from source structures
SSMT based on phrase structures
Two categories:
• Tree-string– String-to-tree– Tree-to-string
• Tree-tree
Tree-Tree
Synchronous tree-adjoining grammar (STAG)
Synchronous tree substitution grammar (STSG)
STAG
STAG: derivation
STSG
enfants(“kids”)
kids
NPd’
(“of”)
beaucoup(“lots”)
NP
NPquitenull
Adv
oftennullAdv
nullAdv
SamSam
NP
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
Start
STSG: elementary trees
SamSam NPenfants(“kids”)
kids
NP
quitenull Adv
d’(“of”)
beaucoup(“lots”)
NPNP
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)Start
NP
NP
nullAdv
Dependency structures
外商 投资 企业 成为 中国 外贸 重要 增长点
NN NN NN VV NR NN JJ NN
NPNP NP ADJP NP
NP
VP
IP
外商 投资
企业
中国 外贸
增长点
重要
成为
(a)
(b)
For MT: dependency structures vs. phrase structures
Advantages of dependency structures over phrase structures for machine translation
• Inherent lexicalization
• Meaning-relative
• Better representation of divergences across languages
SSMT based on dependency structures
Lin 2004
• A Path-based Transfer Model for Machine Translation
Quirk et al. 2005
• Dependency Treelet Translation: Syntactically Informed Phrasal SMT
Ding et al. 2005
• Machine Translation Using Probabilistic Synchronous Dependency Insertion Grammars
Lin 2004
Translation model trained by learning transfer rules from bilingual corpus where the source language sentences are parsed.
decoding: finding the minimum path covering of the source language dependency tree
Lin 2004: path
Lin 2004: transfer rule
Quirk et al. 2005
Translation model trained by learning treelet pairs from bilingual corpus where the source language sentences are parsed.
Decoding: CKY-style
Treelet pairs
Quirk 2005: decoding
Ding 2005
summary
Tree (formal, phrase, or dependency structure )
String (phrase or chunk)
Semantic
Interlingua
Word
Source Languge Target Language
State of the art machine translation systems based on statistical models rooted in the theory of formal grammars/automata
Translation models based on finite state devices cannot easily model translations between languages with strong differences in word ordering
Recently, several models based on context-free grammars have been investigated, borrowing from the theory of compilers the idea of synchronous rewriting
Introduction
Slides from G. Satta
Translation models based on synchronous rewriting:
Inversion Transduction Grammars (Wu, 1997)
Head Transducer Grammars (Alshawi et al., 2000)
Tree-to-string models (Yamada & Knight, 2001; Galley et al, 2004)
“Loosely tree-based” model (Gildea, 2003)
Multi-Text Grammars (Melamed, 2003)
Hierarchical phrase-based model (Chiang, 2005)
We use synchronous CFGs to study formal properties of all these
Introduction
Synchronous CFG
A synchronous context-free grammar (SCFG) is based on three components:
Context free grammar (CFG) for source language
CFG for target language
Pairing relation on the productions of the two grammars and on the nonterminals in their right-hand sides
Synchronous CFG
Example (Yamada & Knight, 2001) :
VB PRP(1) VB1(2) VB2(3)
VB2 VB(1) TO(2)
TO TO(1) NN(2)
PRP he
VB1 adores
VB listening
TO to
NN music
VB PRP(1) VB2(3) VB1(2)
VB2 TO(2) VB(1) ga
TO NN(2) TO(1)
PRP kare ha
VB1 daisuki desu
VB kiku no
TO wo
NN ongaku
Synchronous CFG
Example (cont’d):
VB(1) VB(1)
PRP(1) VB2(3) VB1(2) PRP(1) VB1(2) VB2(3)
he kare ha adores daisuki desu
NN(2) TO(1)TO(1) NN(2)
music ongaku to wo
VB(1) TO(2) TO(2) VB(1) ga
listening kiku no
Synchronous CFG
A pair of CFG productions in a SCFG is called a synchronous production
A SCFG generates pairs of trees/strings, where each component is a translation of the other
A SCFG can be extended with probabilities:
Each pair of productions is assigned a probability
Probability of a pair of trees is the product of probabilities of synchronous productions involved
Membership
The membership problem (Wu, 1997) for SCFGs is defined as follows:
Input: SCFG and pair of strings [w1, w2 ]
Output: Yes/No depending on whether w1 translates into w2 under the SCFG
Applications in segmentation, word alignment and bracketing of parallel corpora
Assumption that SCFG is part of the input is made here to investigate the dependency of problem complexity on grammar size
Membership
Result: Membership problem for SCFGs is NP-complete
Proof uses SCFG derivations to explore space of consistent truth assignments that satisfy source 3SAT instance
Remarks:
Result transfers to (Yamada & Knight, 2001), (Gildea, 2003), (Melamed, 2003), which are at least as powerful as SCFG
Membership
Remarks (cont’d):
Problem can be solved in polynomial time if:
• input grammar is fixed or production length is bounded (Melamed, 2004)
• Inversion Transduction Grammars (Wu, 1997)
• Head Transducer Grammars (Alshawi et al., 2000)
For NLP applications, it is more realistic to assume a fixed grammar and varying input string
Chart parsing
Providing an exponential time lower bound for the membership problem would amount to showing P ≠ NP
But we can show such a lower bound if we make some assumptions on the class of algorithms and data structures that we use to solve the problem
Result: If chart parsing techniques are used to solve the membership problem for SCFG, a number of partial analyses is obtained that grows exponentially with the production length of the input grammar
Chart parsing
Chart parsing for CFGs works by combining completed constituents with partial analyses:
A B1 B2 B3 … Bn
Three indices are used to process each combination, for a total number of O(n3) possible combinations that must be checked, n the length of the input string
B1 B2 B3
…
B4
Chart parsing
Consider the synchronous production :
[ A B (1) B (2) B (3) B (4) , A B (3) B (1) B (4) B (2) ]
representing the permutation :
B (1) B (2) B (3) B (4)
B (3) B (1) B (4) B (2)
Chart parsing
When applying chart parsing, there is no way to keep partial analyses “contiguous”:
B (4)B (1) B (2) B (3)
B (4)B (1) B (2)B (3)
Chart parsing
The proof of our result generalizes the previous observations
We show that, for some worst case permutations of length q, any combination strategy we choose leads to a number of indices growing with order at least sqrt(q)
Then for SCFGs of size q, sqrt(q) is an asymptotic lower bound for the membership problem when chart parsing algorithms are used
Translation
A probabilistic SCFG provides the probability that tree t1 translates into tree t2:
Pr( [t1 , t2] )
Accordingly, we can define the probability that string w1 translates into string w2:
Pr( [w1 , w2] ) = t1w1,t2w2 Pr( [t1 , t2] )
and the probability that string w translates into tree t:
Pr( [w , t ] ) = t1w Pr( [t1 , t ] )
Translation
The string-to-tree translation problem for probabilistic SCFGs is defined as follows:
Input: Probabilistic SCFG and string w
Output: tree t such that Pr([w, t ]) is maximized
Application in machine translation
Again, assumption that SCFG is part of the input is made to investigate the dependency of problem complexity on grammar size
Result: string-to-tree translation problem for probabilistic SCFGs (summing over possible source trees) is NP-hard
Proof reduces from consensus problem:
Strings generated by probabilistic finite automaton or hidden Markov model have probabilities defined as sum of probabilities of several paths
Maximizing such summation is NP-hard (Casacuberta & Higuera, 2000) (Lyngso & Pedersen, 2002)
Translation
Remarks:
Source of complexity of the problem comes from the fact that several source trees can be translated into the same target tree
Result persists if there is a constant bound on length of synchronous productions
Open: can the problem be solved in polynomial time if probabilistic SCFG is fixed?
Translation
the
Learning Non-Isomorphic Tree Mappings for Machine Translation
a
b
A
B
events of
misinform
wrongly
report
to-John
events
him
“wrongly report events to-John” “him misinform of the events”
2 words become 1
reorder dependents
0 words become 1
0 words become 1
Slides from J. Eisner
Syntax-Based Machine Translation
Previous work assumes essentially isomorphic trees
• Wu 1995, Alshawi et al. 2000, Yamada & Knight 2000
But trees are not isomorphic!
• Discrepancies between the languages
• Free translation in the training data
the
a
b
A
B
events of
misinform
wrongly
report
to-John
events
him
Two training trees, showing a free translation from French to English.
Synchronous Tree Substitution Grammar
enfants(“kids”)
d’(“of”)
beaucoup(“lots”)
Sam
donnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
kids
Sam
kiss
quite
often
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
enfants(“kids”)
kids
NPd’
(“of”)
beaucoup(“lots”)
NP
NP
SamSam
NP
Synchronous Tree Substitution Grammar
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
quitenullAdv
oftennullAdv
nullAdv
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.
enfants(“kids”)
kids
Adv
d’(“of”)
beaucoup(“lots”)
NP
SamSam
NP
Synchronous Tree Substitution Grammar
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NPquite
often
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.A much worse alignment ...
enfants(“kids”)
kids
NPd’
(“of”)
beaucoup(“lots”)
NP
NP
SamSam
NP
Synchronous Tree Substitution Grammar
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
quitenullAdv
oftennullAdv
nullAdv
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.
enfants(“kids”)
kids
NPd’
(“of”)
beaucoup(“lots”)
NP
NPquitenull
Adv
oftennullAdv
nullAdv
SamSam
NP
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
Synchronous Tree Substitution Grammar
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Start
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange. Alignment shows how trees are generated synchronously from “little trees” ...
SamSamNP
Grammar = Set of Elementary Trees
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
idiomatictranslation
enfants(“kids”)
kids
NP
enfants(“kids”)
kids
NP
SamSamNP
Grammar = Set of Elementary Trees
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
idiomatictranslation
SamSamNP
enfants(“kids”)
kids
NP
Grammar = Set of Elementary Trees
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
SamSamNP
enfants(“kids”)
kids
NP
d’(“of”)
beaucoup(“lots”)
NP
NP
“beaucoup d’” deletes inside the tree
d’(“of”)
beaucoup(“lots”)
NP
NP
“beaucoup d’” deletes inside the tree
Grammar = Set of Elementary Trees
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
SamSamNP
enfants(“kids”)
kids
NP
enfants(“kids”)
kids
NPd’
(“of”)
beaucoup(“lots”)
NP
NP
“beaucoup d’” matches nothing in English
Grammar = Set of Elementary Trees
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
SamSamNP
enfants(“kids”)
kids
NP
SamSamNP
enfants(“kids”)
kids
NPquitenull
Adv
Grammar = Set of Elementary Trees
oftennullAdv
nullAdv
d’(“of”)
beaucoup(“lots”)
NP
NP
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
adverbial subtree matches nothing in French
Probability model similar to PCFG
Probability of generating training trees T1, T2 with alignment A
P(T1, T2, A) = p(t1,t2,a | n)
probabilities of the “little” trees that are used
p(is given by a maximum entropy model
wrongly
misinform
NP
NP
reportVP | )VP
FEATURES• report+wrongly misinform?
(use dictionary)
• report misinform? (at root)
• wrongly misinform?
Maxent model of little tree pairs
• verb incorporates adverb child?
• verb incorporates child 1 of 3?
• children 2, 3 switch positions?
• common tree sizes & shapes?
• ... etc. ....
p(wrongly
misinform
NP
NP
reportVP | )VP
Inside Probabilities
the
a
b
A
B
events of
misinform
wrongly
report
to-Johnevents
him
VP
( ) = ...misinformreport VP
* ( ) * ( ) + ...
p( | )VP
Inside Probabilities
the
a
b
A
B
events of
misinform
wrongly
report
to-Johnevents
him
VP
NP
NP
( ) = ...misinformreport VP
events ofNP to-John himNP* ( ) * ( ) + ...
p( | )VP
NP
misinform
wrongly
report VP
NP
only O(n2)
An MT Architecture
Viterbi alignment yields output T2
dynamic programming engine
Probability Model p(t1,t2,a) of Little Trees
score little tree pair
propose translations t2of little tree t1
each possible (t1,t2,a)
inside-outsideestimated counts
update parameters
for each possible t1, various (t1,t2,a)
each proposed (t1,t2,a)
DecoderTrainer
scores all alignmentsof two big trees T1,T2
scores all alignmentsbetween a big tree T1
& a forest of big trees T2