latest developments in (s)mt
DESCRIPTION
Latest Developments in (S)MT. MT Wars II: The Empire* strikes back. Harold Somers University of Manchester. * Linguistics. Overview. The story so far EBMT SMT Latest developments in RBMT Is there convergence? Some attempts to classify MT (Carl and Wu’s MT model spaces) - PowerPoint PPT PresentationTRANSCRIPT
Latest Developments in (S)MT
Harold SomersUniversity of Manchester
MT Wars II:The Empire* strikes back
* Linguistics
Overview
The story so far EBMT SMT Latest developments in RBMT
Is there convergence? Some attempts to classify MT (Carl and Wu’s MT model spaces)
Has the empire struck back?
The story so far: EBMT Early history well known
Nagao (1981/3) Early development as part of RBMT Relationship with Translation Memories Focus (cf. Somers 1998) on
Matching algorithms Selection and storage of examples Mainly sentence-based TL generation (Recombination) not much
addressed
Somers, H. (1998) New paradigms in MT, 10th European Summer School in Logic, Language and Information, Workshop on MT, Saarbrücken; revised version in Machine Translation 14 (1999) and 2nd revised version in M. Carl & A. Way (2003) Recent Advances in EBMT (Kluwer).
EBMT in a nutshell(In case you’ve been on Tatooine for the last 15 years)
Database of paired examples Translation involves
Finding the best example(s) (matching) Identifying which bits do(n’t) match (alignment) Replacing the non-matching bits (if multiple
examples, gluing them together) (recombination)
All of the above at run-time
The operation was interrupted because the file was hidden.a. The operation was interrupted because the Ctrl-c key was pressed. L’opération a été interrompue car la touché Ctrl-c a été enfoncée.b. The specified method failed because the file is hidden. La méthode spécifiée a échoué car le fichier est masqué
EBMT in a nutshell (cont.)
Main difficulty is “boundary friction” in two senses:
The old man is dead : Le vieil homme est mortThe old woman is dead : * Le vieil femme est mort
EBMT later developments
Example generalisation (templates) Incorporation of linguistic resources
and/or statistical measures Structured representation of
examples Use of statistical techniques
Example generalisation (Furuse & Iida, Kaji et al., Matsumoto et al., Carl, Cicekli &
Güvenir, Brown, McTait, Way et al.) Similar examples can be combined to give
a more general example Can be seen as a way of generating
transfer rules (and lexicons) Process may be entirely automatic, based
on string matching … … or “seeded” using linguistic information
(POS tags) or resources (bilingual dictionary)
Example generalisation (cont.)
The dog ate a rabbit inu wa usagi o tabeta
monkey saru man hito
The … ate a peach … wa momo o tabeta
The monkey ate a peach saru wa momo o tabeta The man ate a peach hito wa momo o tabeta
dog inurabbit usagi
The …x ate a ...y …x wa …y tabeta
Example generalisation (cont.)
That’s too simple (e.g. because of boundary friction)
Need to introduce constraints on the slots, e.g. using POS tags and morphological information (which implies some other processing)
Can use clustering algorithms to infer substitution sets
Incorporation of linguistic resources Actually, early EBMT used all sorts of
linguistic resources Briefly there was a move towards more
“pure” approaches Now we see much use of POS tags
(sometimes only partial, e.g. marker words – Way et al.), morphological analysis (as just mentioned), bilingual lexicons
Target-language grammars for recombination/generation phase
Incorporation of statistical measures Example database preprocessed to assign
weights (probabilities) to fragments and their translations (Aramaki et al.) Good way of handling “ambiguities” due to
alternative translations Clustering words into equivalence classes
for example generalization (Brown) Using statistical tools to extract translation
knowledge from parallel corpora (Yamamoto & Matsumoto)
Statistically induced grammars for translation or generation, as in ...
Use of structured representations
Again, a feature of early EBMT, now reappearing
Translation grammars induced from the example set
Examples stored as tree structures (overwhelmingly: dependency structures)
Translation grammars
Carl: generates translation grammars from aligned linguistically annotated texts
Way: Data-Oriented Translation based on Poutsma’s DOP, using both PS and LFG models)
Structured examples Use of tree comparison algorithms to
extract translation patterns from parsed corpora/tree banks (Watanabe et al.)
Translation pairings extracted from aligned parsed examples (Menezes & Richardson)
Tree-to-string approach used by Langlais & Gotti and Liu et al. (+ statistical generation model)
Typical use of structured examples Rule-based analysis and generation +
example-based transfer Input is parsed into representation using a
traditional or statistics-based analyser TL representation constructed by combining
translation mappings learned from the parallel corpus
TL sentence generated using a hand-written or machine-learned generation grammar
Is this still EBMT? Note that the only example-based part is
use of mappings which are learned, not computed at run-time
Pure EBMT (Lepage & Denoual)
In contrast (but now something of an oddity): pure analogy-based EBMT
Use of proportional analogies A:B::C:D Terms in the analogies are translation
pairs A→A’: B→B’:: C→C’: D→D’
Pure EBMT
No explicit transfer No extraction of symbolic knowledge
No use of templates Analogies do not always represent any
sort of linguistic reality No training or preprocessing
Solving the proportional analogies is done at run-time
The story so far (SMT) Early history well known
IBM group inspired by improved results in speech recognition when non-linguistic approach taken
Availability of Canadian Hansards inspired purely statistical approach to MT (1988)
Immediate partial success (60%) to the dismay of MT people
Early observers (Wilks) predicted hybrid methods (“stone soup”) would evolve
Later developments Phrase-based SMT Syntax-based SMT
SMT in a nutshell(In case you’ve been on Kamino for the last 15 years) From parallel corpus two sets of statistical
data are extracted Translation model: probabilities that a given
word e in the SL gives rise to a word f in the TL (Target) language model: most probable word-
order for the words predicted by the translation model
These two models are computed off-line Given an input sentence, a “decoder”
applies the two models, and juggles the probabilities to get the best score; various methods have been proposed
SMT in a nutshell (cont.) The translation model has to take into
account the fact that for a given e in there may be various different fs
depending on context (grammatical variants as well as alternatives due to polysemy or homonymy)
a given e may not necessarily correspond to a single f, or any f at all: “fertility”
(e.g. may have → aurait; implemented → mis en application)
SMT in a nutshell (cont.) The language model has to take into
account the fact that The TL words predicted by the translation model
will not occur in the same order as the SL words: “distortion”
TL word choices can depend on neighbouring words (which may be easy to model) or, especially because of distortion, more distant words: “long-distance dependencies”, much harder to model
SMT in a nutshell (cont.) Main difficulty: combination of fertility and
distortion:Zeitmangel erschwert das Problem.Lack of time makes the problem more difficult.Eine Diskussion erübrigt sich demnach.Therefore there is no point in discussion.Das ist der Sache nicht angemessen.That is not appropriate for this matter.Den Vorschlag lehnt die Kommission ab.The Commission rejects the proposal.
SMT later developments Phrase-based SMT Extend models beyond individual words to
word sequences (phrases) Direct phrase alignment Word alignment induced phrase model Alignment templates
Results better than word-based models, and show improvement proportional (log-linear) to corpus size
Phrases do not correspond to constituents, and limiting them to do so hurts results
Direct phrase alignment(Wang & Waible 1998, Och et al., 1999, Marcu & Wong 2002)
Enhance word translation model by adding joint probabilities, i.e. probabilities for phrases
Phrase probabilities compensate for missing lexical probabilities
Easy to integrate probabilities from different sources/methods, allows for mutual compensation
Word alignment induced modelKoehn et al. 2003; example stolen from Knight & Koehn
http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf
Maria did not slap the green witch
Maria no daba una botefada a la bruja verda
Start with all phrase pairs justified by the word alignment
Word alignment induced modelKoehn et al. 2003; example stolen from Knight & Koehn
http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf
(Maria, Maria), (no, did not)(daba una botefada, slap),(a la, the), (verde, green), (bruja, witch)
Word alignment induced modelKoehn et al. 2003; example stolen from Knight & Koehn
http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf
(Maria, Maria), (no, did not)(daba una botefada, slap),(a la, the), (verde, green) (bruja, witch), (Maria no, Maria did not), (no daba una botefada, did not slap),(daba una botefada a la, slap the), (bruja verde, green witch)
etc.
Word alignment induced modelKoehn et al. 2003; example stolen from Knight & Koehn
http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf
(Maria, Maria), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green), (Maria no, Maria did not), (no daba una bofetada, did not slap),(daba una bofetada a la, slap the), (bruja verde, green witch),(Maria no daba una bofetada, Maria did not slap),(no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch),(Maria no daba una bofetada a la, Maria did not slap the),(daba una bofetada a la bruja verde, slap the green witch),(no daba una bofetada a la bruja verde, did not slap the green witch),(Maria no daba una bofetada a la bruja verde, Maria did not slap the green witch)
Word alignment induced model
Given the phrase pairs collected, estimate the phrase translation probability distribution by relative frequency (without smoothing)
Alignment templatesOch et al. 1999; further developed by Marcu and Wong
2002, Koehn and Knight 2003, Koehn et al. 2003) Problem of sparse data worse for phrases So use word classes instead of words
alignment templates instead of phrases more reliable statistics for translation table smaller translation table more complex decoding
Word classes are induced (by distributional statistics), so may not correspond to intuitive (linguistic) classes
Takes context into account
Problems with phrase-based models
Still do not handle very well ... dependencies (especially long-distance) distortion discontinuities (e.g. bought = habe ... gekauft)
More promising seems to be ...
Syntax-based SMT
Better able to handle Constituents Function words Grammatical context (e.g. case marking)
Inversion Transduction Grammars Hierarchical transduction model Tree-to-string translation Tree-to-tree translation
Inversion transduction grammars
Wu and colleagues (1997 onwards) Grammar generates two trees in
parallel and mappings between them Rules can specify order changes Restriction to binary rules limits
complexity
Inversion transduction grammars
Inversion transduction grammars Grammar is trained on word-aligned
bilingual corpus: Note that all the rules are learned automatically
Translation uses a decoder which effectively works like traditional RBMT: Parser uses source side of transduction rules to
build a parse tree Transduction rules are applied to transform the
tree The target text is generated by linearizing the
tree
Almost all possible mappings can be handledMissing ones (crossing constraints) are not found in Wu’s corpusBut examples can be found, apparently
Hierarchical transduction model
(Alshawi et al. 1998) Based on finite-state transducers, also
uses binary notation Uses automatically induced
dependency structure Initial head-word pair is chosen Sentence is then expanded by
translating the dependent structures
Tree-to-string translation
(Yamada & Knight 2001, Charniak 2003) Uses (statistical) parser on input side
only Tree is then subject to reordering and
insertion according to models learned from data
Lexical translation is then done, again according to probability models
wa wa
reorder
insert
translate
linearize kare ha ongaku wo kiku no ga daisuki desuwa
Tree-to-tree translation
(Gildea 2003) Use parser on both sides to capture
structurual differences Subtree cloning(Habash 2002, Čmejrek et al. 2003) Full morphology/syntactic/semantic
parsing All based on stochastic grammars
Latest developments in RBMT RBMT making a come-back (e.g. METIS) Perhaps it was always there, just wasn’t
represented in CL journals/conferences There is some activity, but around the
periphery Open-source systems development for low-density languages
Much use made of corpus-derived modules, eg tagging, chunking
SMT is now RBMT, only the rules are learned rather than written by linguists
Overview
The story so far EBMT SMT Latest developments in RBMT
Is there convergence? Some attempts to classify MT (Carl and Wu’s MT model spaces)
Has the empire struck back?
Classifications of MT Empirical vs. Rationalist
data- vs theory-driven use (or not) of symbolic representation From MLIM chapter 4:
high vs. low coverage low vs. high quality/fluency shallow vs. deep representation
Distinguish in the above design vs. consequence How true are they anyway?
EBMT~SMT: Is there convergence?
Lively debate on mtlist Articles by
Somers, Turcato & Popowich in Carl & Way (2003)
Hutchins, Carl, Wu (2006) in special issue of Machine Translation
Slides marked need your input!!
Essential features of EBMT
Use of bilingual corpus data as the main (only?) source of knowledge (Somers) Most early EBMT systems were hybrids
We do not know a priori which parts of example are relevant (Turcato & Popowich) Raw data is consulted at run-time: (little or) no
preprocessing Therefore template-based EBMT is already a
hybrid (with RBMT) Act of matching the input against the examples,
regardless of how they are stored (Hutchins)
Pros (and cons) of analogy model Like CBR:
Library of cases used during task performance Analogous examples broken down, adapted,
recombined In contrast with other machine learning methods
Offline learning to compile abstract performance model
No loss of coverage due to incorrect generalization during training
Guaranteed correct when input is exactly like an example in the training set (not true of SMT)
But: Lack of generalization leads to potential runtime inefficiency
(Wu, 2006)
EBMT~SMT: Common features Easily agreed
Use of bilingual corpus data as the main (only?) source of knowledge
Translation relations are derived automatically from the data
Underlying methods are independent of language-pair, and hence of language similarity
More contentious Bilingual corpus data should be real (a practical
issue for SMT, but some EBMT systems use “hand-crafted” examples)
System can be easily extended just by adding more data
EBMT~RBMT common features
Hybrid is easy to conceive Rule-based analysis/generation with
example-based transfer Example-based processing only for
awkward cases
!
SMT~RBMT common features
Some versions of SMT exactly mirror classic RBMT parse-transfer-generate
Same things are hard Long-distance dependency Discontinuous constituents
!
Wu’s 3D classification of all MT Example-based vs.
schema-based abstraction or
generalization performed at run-time
Compositional vs. lexical Relates primarily to
transfer (or equiv.) Statistical vs. logical Pictures also show
historical development
Classic (direct and transfer) MT models
Early systems (Georgetown) lexical and compositional
Treatment of idioms, collocations, phrasal translations in classical 2G transfer systems
Modern RBMT systems starting to adopt statistical methods (according to Wu)
Where do commercial systems sit?
EBMT systems
SMT systems
Example-based SMT systems
Summary
Model space corpus-based MT (Carl 2000)
Based on Dummett’s theory of meaning
Rich vs austere Complexity of representations
Molecular vs holistic Descriptions based on finite set of
predefined features vs global distinctions Fine-grained vs coarse-grained
Based on smaller or larger units
Rich vs austere Translation memories are most austere,
depending only on graphemic similarity TMs with annotated examples (eg Planas &
Furuse) are richer Early EBMT systems, and recent systems where
examples are generalized are rich EBMT using light annotation (eg TAGS, markers)
are moderately rich Pure EBMT (Lepage & Denoual) is austere Early SMT systems were austere, but move
towards syntax makes them richer Phrase-based SMT still austere
Annotatedtranslation memories Classic EBMT
(Sato, Nagao)
Template-based EBMT (McTait, Brown, Cicekli)
Phrase-based SMT Syntax-based
SMTMarker-based EBMT (Way)
EBMT where examples are
lightly annotated
Translation memories
Early SMT (Brown et al.)
Pure EBMT(Lepage)
METIS
Molecular vs holistic Early SMT purely holistic, as is pure
EBMT TMs molecular: distance measure
based on fixed set of symbols Translation templates are holistic, but
molecular if they depend on some sort of analysis
Phrase-based and syntax-based SMT highly molecular
Annotatedtranslation memories
Classic EBMT(Sato, Nagao)
Template-based EBMT (Cicekli)
Phrase-based SMT
Syntax-based SMT
Marker-based EBMT (Way)
EBMT where examples are
lightly annotated
Translation memories
Early SMT (Brown et al.)
Pure EBMT(Lepage)
Template-based EBMT (McTait,
Brown)
METIS analysis
METIS generation
Coarse- vs. fine-grained Coarse-grained translates with bigger
units TM system works only on sentences:
coarse-grained Word-based systems are fine-grained:
Early SMT Phrase-based SMT slightly more
coarse-grained Template-based EBMT fine-grained
!
Phrase-based SMT
Marker-based EBMT (Way)
Translation memories
Early SMT (Brown et al.)
coarse
fine
Template-based EBMT (McTait,
Brown)
Overview
The story so far EBMT SMT Latest developments in RBMT
Is there convergence? Some attempts to classify MT (Carl and Wu’s MT model spaces)
Has the empire struck back?
Has the empire struck back?
Is linguistics back in MT? Was MT ever of interest to linguists?
Is SMT like RBMT?
!
Vauquois triangle
To what extent can a given system be described in terms of the classic view of MT (G2) ?
!Interlingua
Transfer
Direct translation
Ana
lysi
s
Generation
Has the empire struck back?
Is linguistics back in MT? Was MT ever of interest to linguists?
Is SMT like RBMT?
!
As predicted by Wilks (“Stone soup” talk, 1992) way forward is hybrid
Negative experience (for me) of seeing SMT presenters rediscovering problems first described by Yngve, Vauquois ...
... without referencing the original papers!
IT’S LIFE, JIM, BUT NOT AS WE KNOW IT.
LINGUISTICS
SMT EBMT
RBMT
!
Fill in the gaps
Annotatedtranslation memories
Classic EBMT(Sato, Nagao)
Template-based EBMT (Cicekli)
Phrase-based SMT
Syntax-based SMT
Marker-based EBMT (Way)
EBMT where examples are
lightly annotated
Translation memories
Early SMT (Brown et al.)
Pure EBMT(Lepage)
Template-based EBMT (McTait,
Brown)