when multiwords go bad in machine translation
DESCRIPTION
This preseantation addresses the impact of multiword translation errors in machine translation (MT). We have analysed translations of multiwords in the OpenLogos rule-based system (RBMT) and in the Google Translate statistical system (SMT) for the English-French, English-Italian, and English-Portuguese language pairs. Our study shows that, for distinct reasons, multiwords remain a problematic area for MT independently of the approach, and require adequate linguistic quality evaluation metrics founded on a systematic categorization of errors by MT expert linguists. We propose an empirically-driven taxonomy for multiwords, and highlight the need for the development of specific corpora for multiword evaluation. Finally, the paper presents the Logos approach to multiword processing, illustrating how semantico-syntactic rules contribute to multiword translation quality.TRANSCRIPT
technology from seed
WHEN MULTIWORDS GO BAD
IN MACHINE TRANSLATION
Anabela Barreiro, L2F - INESC-ID, Portugal
Johanna Monti, University of Sassari, Italy
Brigitte Orliac, Logos Institute, USA
Fernando Batista, L2F - INESC-ID and ISCTE, Portugal
• Introduction
• Multiwords in NLP and MT
• RBMT and SMT approaches to multiword processing
• Multiword Assessment
– Corpus
– Multiword taxonomy
– Error Categorization for multiword translations
– Quantitative results
– Analysis of relevant problems
• OL semantico-syntactic rules for multiword translation precision
• Main conclusions and future work
OUTLINE
• MT has become popular, widespread and useful to society
• Every internet user is now using MT (with or without knowing it!)
• HOWEVER, linguistic quality is still a serious problem,
as translations contain morphological, syntactic and semantic errors
• Successful MWU processing still represents
one of the most significant linguistic challenges
for MT systems
INTRODUCTION
• Crucial role in NLP
• Essential in MT
– Unfeasible to include all MWU in dictionaries
– Poor syntactic and semantic analysis – reduces the performance of NLP systems
– Fragmentation of any part of a MWU leads to generation errors
– Incorrect MWU generation has a negative impact on the understandability and quality of the translated text
MULTIWORDS IN NLP AND MT
• Lack/degree of compositionality
• Constituent dependencies
• contiguous (adjacent) - no inserts
• non-contiguous (remote) – inserts
• Morphosyntactic variations
Freely available RBMT and SMT fail at translating MWU:
– RBMT systems fail for lack of MWU coverage
– SMT systems fail for lack of linguistic (semantico-syntactic) knowledge to process them, leading to structural problems
CRITICAL PROBLEMS FOR MULTIWORD PROCESSING
MT APPROACHES TO MULTIWORD PROCESSING
• Importance of a correct processing of MWU so that they can be translated correctly by MT systems
• Sag et al., 2001, Thurmair, 2004, Rayson et al., 2010, Monti, 2013
• Solutions to resolve MWU translation problems:
– Use of generative dependency grammars with features
• Diaconescu, 2004
– grouping bilingual MWU before performing statistical alignment
• Lambert and Banchs, 2006
– paraphrasing MWU
• Barreiro, 2010
MULTIWORD PROCESSING IN RBMT
• Lexical approach
– MWU as single lemmata in dictionaries
– Suitable for contiguous compounds
• Compositional approach
– MWU processing by means of POS tagging and syntactic analysis of its different components
– Suitable for compounds not coded in the dictionary and for verbal constructions
MULTIWORD PROCESSING IN SMT
• Traditional approach to word alignment
• Brown et al., 1993
– Inability to handle many-to-many correspondences
• Current state-of-the-art phrase-based SMT systems
• Koehn et al., 2003
– The correct translation of MWU occurs if the its constituents are marked and aligned as parts of consecutive phrases in the training set
– Phrases are defined as sequences of contiguous words (n-grams) with limited linguistic information (mostly syntactic)
• will stay - linguistically meaningful
• that he - no linguistic significance
MULTIWORD PROCESSING IN SMT • MWU processing and translation in SMT as a problem of:
– Automatically learning and integrating translations
– Word sense disambiguation
– Word alignment
• Barreiro et al., 2013
• Integration of phrase-based models with linguistic knowledge:
– Identification of possible monolingual MWU
• Wu et al., 2008, Okita et al., 2010
– Integration of bilingual domain MWU in SMT
• Ren et al., 2009
– Incorporation of machine-readable dictionaries and glossaries, treating these resources as phrases in the phrase-based table
• Okuma et al., 2008
– Identification and grouping of MWU prior to statistical alignment
• Lambert and Banchs, 2006
• Linguistic analysis and error categorization of the MWU translations
• 2 MT systems:
• OpenLogos RBMT
• Google Translate SMT
• 3 language pairs: EN-FR, EN-IT and EN-PT
• Analysis of MWU translations by 3 MT expert linguists
• MWU taxonomy to evaluate MWU (in any system independently of the approach)
• OpenLogos solution to MWU processing in MT
OUR WORK
• Created a Corpus of MWU
• Translated sentences with the MWU into EN, IT and PT using the OL and GT systems
The purpose of our work WAS NOT to compare and evaluate systems
The purpose of our work WAS to assess and measure the quality of MWU translation independently of the two systems considered
• Developed an empirically-driven taxonomy for MWU
• Analysed the MWU translation errors based on this taxonomy
– The different errors were categorized by MT expert linguists of the respective target languages
METHODOLOGY
CORPUS
• 150 English sentences - news and internet
• Average of ~4-5 MWU per sentence
• The corpus was divided into 3 sets of 50 sentences translated for each language pair by the 2 systems
• 3 native linguists reviewed 50 sentences each for the 3 target languages, and evaluated the MWU translations for each of these languages (1 evaluator for each language), classifying the translations according to a binary evaluation metrics:
• OK for correct translations
• ERR for incorrect translations
• None of the systems was specifically trained for the task - texts were not domain specific
MULTIWORD TAXONOMY Type Subtype Acronym Example
VERB
Compound Verb COMPV may have been done, have [already] shown
Support Verb Construction
SVC make a presentation, be meaningful, have [particularly good] links, give an illustration of, be the ADV cause of, fall [so far] short of, take a seat; to play a [very important] role
Prepositional Verb PREPV deal with, give N to
Phrasal Verb PHRV closing down, make N up, slow down to; stand up to, mix N up with
Other Verbal Expression VEXPR in trying to, hold N in place
NOUN
Compound Noun COMPN union spokesman, constraint-based grammar, air conditioning
Prepositional Noun PREPN interest in, right side of
ADJECTIVE
Compound Adjective COMPADJ cost-cutting
Prepositional Adjective PREPADJ famous for; similar to
ADVERB
Compound Adverb COMPADV in a fast way, most notably, last time
Prepositional Adverb PREPADV in front of
DETERMINER
Compound Determiner COMPDET certain of these
Prepositional Determiner PREPDET most of
CONJUNCTION Compound Conjunction COMPCONJ in order to, as a result of, rather than
PREPOSITION Compound Preposition COMPPREP as part of
OTHER
EXPRESSION
Named Entity NE Economic Council
Idiomatic Expression IDIOM get to the bottom of the situation, purr like a cat; for goodness’ sake
Lexical Bundle BUNDLE I believe that, as much if not more than, if I were you
• The results shed some light on the demand for higher precision MWU translation
• MWU occur frequently in our corpus - several times within
the same sentence:
Witnesses said the speeding car may have been playing tag with
another vehicle when it veered into the southbound lane occupied
by Lopez' truck shortly before 8 p.m. Sunday
• may have been playing tag with - COMPV - idiomatic PREPSVC
• veered into - PREPV
• southbound lane - COMPN
• 8 p.m. Sunday - double temporal expression (time + date)
RESULTS
QUANTITATIVE RESULTS
Correct and incorrect MWU translations
15
System Lang pair OK ERR Total
OL
EN-FR 40 48 88
EN-IT 36 83 119
EN-PT 60 96 156
Total 136 227 363
GT
EN-FR 70 38 108
EN-IT 59 47 106
EN-PT 67 47 114
Total 196 132 328
Performance for the 3 most frequent MWU
16
EN-FR OL GT
Type Ok Error Ok Error
VERB 17 21 27 12
COMPN 8 10 13 18
NE 6 4 16 4
EN-IT OL GT
Type Ok Error Ok Error
COMPN 14 39 26 21
VERB 10 12 6 15
NE 2 8 14 2
EN-PT OL GT
Type Ok Error Ok Error
VERB 30 21 11 23
COMPN 28 12 18 17
NE 11 26 9 9
QUANTITATIVE RESULTS
General language or domain-specific COMPN - 32,5%
• hit-run driver
pilot hit run chauffeur/conducteur ayant commis un délit de fuite
• nuclear fuel cycle
cycle de combustible nucléaire
cycle de combustion nucléaire
SVC - 18,6%
• is a bit misleading (adjectival)
est un égarement de morceau
est quelque peu trompeur
• it has [wide] applicability (nominal)
il a l’applicabilité large
il a de nombreuses possibilités d’application
MULTIWORDS “GOING BAD” IN FRENCH
LOGOS APPROACH TO MULTIWORD TRANSLATION
Main linguistic knowledge bases of the LOGOS system:
• Dictionaries
• Semantico-syntactic rules - analysis, transfer and generation
• Semantic Table SEMTAB - language-pair specific rules
– Analysis and translation of words in their context
– invoked after dictionary look-up and during the execution of target transfer rules to solve analysis and lexical ambiguity problems
• verb dependencies - different verb argument structures
– speak to
– speak against
– speak of
– speak on N (radio, TV, television, etc.)
• MWU of different nature
SAL - Semantico-syntactic Abstraction Language
– Taxonomy: 3 levels organized hierarchically:
• Supersets / Sets / Subsets
– Semantico-Syntactic continuum from NL word to Word Class
• Literal word: airport
• Head morph: port
• SAL Subset: Agfunc (agentive functional location)
• SAL Set: func (functional location)
• SAL Superset: PL (place)
• Word Class: N
– SAL combines both the lexical and the compositional approaches in
order to process different types of MWU
LOGOS APPROACH TO MULTIWORD TRANSLATION
RESOLUTION OF POLYSEMY
NL String SEMTAB Rule Portuguese Transfer
raise a child V(‘raise’) N(ANdes) criar. . .
raise corn V(‘raise’) N(MAedib) cultivar. . .
raise the rent V(‘raise’) N(MEabs) aumentar. . .
DEEP STRUCTURE RULES OF SEMTAB
A single deep-structure rule matches multiple surface-structures
and produces correct target transfers
he raised the rent ele aumentou a renda V+Object
the raising of the rent o aumento da renda Gerund
the rent, raised by … a renda, aumentada por… Part. ADJ
a rent raise um aumento de renda Noun
• MWU – problematic for MT systems independently of the approach
• Literal translations lead to unclear/incorrect translations or loss of meaning
• Correct identification and analysis of source language MWU is a challenging
task, but the starting point for higher quality MT
– Linguistic quality evaluation metrics
– Systematic categorization of errors by MT expert linguists
– Specific corpora for MWU evaluation
• OpenLogos approach to MWU processing uses semantico-syntactic rules,
which can contribute to MWU translation quality with reference to any
language pair
CONCLUSIONS
FUTURE WORK
• Research on how OpenLogos linguistic knowledge – SEMTAB - can be applied to a SMT system to correct MWU errors … and vice-versa
• Successful combination of linguistic PRECISION (OL approach) and COVERAGE (GT approach) in resolving the MWU problem
– evolution in the MT field
• Successful integration of semantico-syntactic knowledge in SMT
– solution for achieving high quality MT
– The accomplishment of this task requires a combination of expertise in MT technology and deep linguistic knowledge to address reverse research avenue: integration of SMT technology/processes in RBMT to advance MT
24
Thank you!
This work was supported by Fundação para a Ciência e Tecnologia (Portugal) through
Anabela Barreiro’s post-doctoral grant SFRH/BPD/91446/2012 and project PEst-OE/EEI/LA0021/2013.