when multiwords go bad in machine translation

24
technology from seed WHEN MULTIWORDS GO BAD IN MACHINE TRANSLATION Anabela Barreiro, L2F - INESC-ID, Portugal Johanna Monti, University of Sassari, Italy Brigitte Orliac, Logos Institute, USA Fernando Batista, L2F - INESC-ID and ISCTE, Portugal

Upload: anabela-barreiro

Post on 05-Dec-2014

780 views

Category:

Technology


0 download

DESCRIPTION

This preseantation addresses the impact of multiword translation errors in machine translation (MT). We have analysed translations of multiwords in the OpenLogos rule-based system (RBMT) and in the Google Translate statistical system (SMT) for the English-French, English-Italian, and English-Portuguese language pairs. Our study shows that, for distinct reasons, multiwords remain a problematic area for MT independently of the approach, and require adequate linguistic quality evaluation metrics founded on a systematic categorization of errors by MT expert linguists. We propose an empirically-driven taxonomy for multiwords, and highlight the need for the development of specific corpora for multiword evaluation. Finally, the paper presents the Logos approach to multiword processing, illustrating how semantico-syntactic rules contribute to multiword translation quality.

TRANSCRIPT

Page 1: When Multiwords Go Bad in Machine Translation

technology from seed

WHEN MULTIWORDS GO BAD

IN MACHINE TRANSLATION

Anabela Barreiro, L2F - INESC-ID, Portugal

Johanna Monti, University of Sassari, Italy

Brigitte Orliac, Logos Institute, USA

Fernando Batista, L2F - INESC-ID and ISCTE, Portugal

Page 2: When Multiwords Go Bad in Machine Translation

• Introduction

• Multiwords in NLP and MT

• RBMT and SMT approaches to multiword processing

• Multiword Assessment

– Corpus

– Multiword taxonomy

– Error Categorization for multiword translations

– Quantitative results

– Analysis of relevant problems

• OL semantico-syntactic rules for multiword translation precision

• Main conclusions and future work

OUTLINE

Page 3: When Multiwords Go Bad in Machine Translation

• MT has become popular, widespread and useful to society

• Every internet user is now using MT (with or without knowing it!)

• HOWEVER, linguistic quality is still a serious problem,

as translations contain morphological, syntactic and semantic errors

• Successful MWU processing still represents

one of the most significant linguistic challenges

for MT systems

INTRODUCTION

Page 4: When Multiwords Go Bad in Machine Translation

• Crucial role in NLP

• Essential in MT

– Unfeasible to include all MWU in dictionaries

– Poor syntactic and semantic analysis – reduces the performance of NLP systems

– Fragmentation of any part of a MWU leads to generation errors

– Incorrect MWU generation has a negative impact on the understandability and quality of the translated text

MULTIWORDS IN NLP AND MT

Page 5: When Multiwords Go Bad in Machine Translation

• Lack/degree of compositionality

• Constituent dependencies

• contiguous (adjacent) - no inserts

• non-contiguous (remote) – inserts

• Morphosyntactic variations

Freely available RBMT and SMT fail at translating MWU:

– RBMT systems fail for lack of MWU coverage

– SMT systems fail for lack of linguistic (semantico-syntactic) knowledge to process them, leading to structural problems

CRITICAL PROBLEMS FOR MULTIWORD PROCESSING

Page 6: When Multiwords Go Bad in Machine Translation

MT APPROACHES TO MULTIWORD PROCESSING

• Importance of a correct processing of MWU so that they can be translated correctly by MT systems

• Sag et al., 2001, Thurmair, 2004, Rayson et al., 2010, Monti, 2013

• Solutions to resolve MWU translation problems:

– Use of generative dependency grammars with features

• Diaconescu, 2004

– grouping bilingual MWU before performing statistical alignment

• Lambert and Banchs, 2006

– paraphrasing MWU

• Barreiro, 2010

Page 7: When Multiwords Go Bad in Machine Translation

MULTIWORD PROCESSING IN RBMT

• Lexical approach

– MWU as single lemmata in dictionaries

– Suitable for contiguous compounds

• Compositional approach

– MWU processing by means of POS tagging and syntactic analysis of its different components

– Suitable for compounds not coded in the dictionary and for verbal constructions

Page 8: When Multiwords Go Bad in Machine Translation

MULTIWORD PROCESSING IN SMT

• Traditional approach to word alignment

• Brown et al., 1993

– Inability to handle many-to-many correspondences

• Current state-of-the-art phrase-based SMT systems

• Koehn et al., 2003

– The correct translation of MWU occurs if the its constituents are marked and aligned as parts of consecutive phrases in the training set

– Phrases are defined as sequences of contiguous words (n-grams) with limited linguistic information (mostly syntactic)

• will stay - linguistically meaningful

• that he - no linguistic significance

Page 9: When Multiwords Go Bad in Machine Translation

MULTIWORD PROCESSING IN SMT • MWU processing and translation in SMT as a problem of:

– Automatically learning and integrating translations

– Word sense disambiguation

– Word alignment

• Barreiro et al., 2013

• Integration of phrase-based models with linguistic knowledge:

– Identification of possible monolingual MWU

• Wu et al., 2008, Okita et al., 2010

– Integration of bilingual domain MWU in SMT

• Ren et al., 2009

– Incorporation of machine-readable dictionaries and glossaries, treating these resources as phrases in the phrase-based table

• Okuma et al., 2008

– Identification and grouping of MWU prior to statistical alignment

• Lambert and Banchs, 2006

Page 10: When Multiwords Go Bad in Machine Translation

• Linguistic analysis and error categorization of the MWU translations

• 2 MT systems:

• OpenLogos RBMT

• Google Translate SMT

• 3 language pairs: EN-FR, EN-IT and EN-PT

• Analysis of MWU translations by 3 MT expert linguists

• MWU taxonomy to evaluate MWU (in any system independently of the approach)

• OpenLogos solution to MWU processing in MT

OUR WORK

Page 11: When Multiwords Go Bad in Machine Translation

• Created a Corpus of MWU

• Translated sentences with the MWU into EN, IT and PT using the OL and GT systems

The purpose of our work WAS NOT to compare and evaluate systems

The purpose of our work WAS to assess and measure the quality of MWU translation independently of the two systems considered

• Developed an empirically-driven taxonomy for MWU

• Analysed the MWU translation errors based on this taxonomy

– The different errors were categorized by MT expert linguists of the respective target languages

METHODOLOGY

Page 12: When Multiwords Go Bad in Machine Translation

CORPUS

• 150 English sentences - news and internet

• Average of ~4-5 MWU per sentence

• The corpus was divided into 3 sets of 50 sentences translated for each language pair by the 2 systems

• 3 native linguists reviewed 50 sentences each for the 3 target languages, and evaluated the MWU translations for each of these languages (1 evaluator for each language), classifying the translations according to a binary evaluation metrics:

• OK for correct translations

• ERR for incorrect translations

• None of the systems was specifically trained for the task - texts were not domain specific

Page 13: When Multiwords Go Bad in Machine Translation

MULTIWORD TAXONOMY Type Subtype Acronym Example

VERB

Compound Verb COMPV may have been done, have [already] shown

Support Verb Construction

SVC make a presentation, be meaningful, have [particularly good] links, give an illustration of, be the ADV cause of, fall [so far] short of, take a seat; to play a [very important] role

Prepositional Verb PREPV deal with, give N to

Phrasal Verb PHRV closing down, make N up, slow down to; stand up to, mix N up with

Other Verbal Expression VEXPR in trying to, hold N in place

NOUN

Compound Noun COMPN union spokesman, constraint-based grammar, air conditioning

Prepositional Noun PREPN interest in, right side of

ADJECTIVE

Compound Adjective COMPADJ cost-cutting

Prepositional Adjective PREPADJ famous for; similar to

ADVERB

Compound Adverb COMPADV in a fast way, most notably, last time

Prepositional Adverb PREPADV in front of

DETERMINER

Compound Determiner COMPDET certain of these

Prepositional Determiner PREPDET most of

CONJUNCTION Compound Conjunction COMPCONJ in order to, as a result of, rather than

PREPOSITION Compound Preposition COMPPREP as part of

OTHER

EXPRESSION

Named Entity NE Economic Council

Idiomatic Expression IDIOM get to the bottom of the situation, purr like a cat; for goodness’ sake

Lexical Bundle BUNDLE I believe that, as much if not more than, if I were you

Page 14: When Multiwords Go Bad in Machine Translation

• The results shed some light on the demand for higher precision MWU translation

• MWU occur frequently in our corpus - several times within

the same sentence:

Witnesses said the speeding car may have been playing tag with

another vehicle when it veered into the southbound lane occupied

by Lopez' truck shortly before 8 p.m. Sunday

• may have been playing tag with - COMPV - idiomatic PREPSVC

• veered into - PREPV

• southbound lane - COMPN

• 8 p.m. Sunday - double temporal expression (time + date)

RESULTS

Page 15: When Multiwords Go Bad in Machine Translation

QUANTITATIVE RESULTS

Correct and incorrect MWU translations

15

System Lang pair OK ERR Total

OL

EN-FR 40 48 88

EN-IT 36 83 119

EN-PT 60 96 156

Total 136 227 363

GT

EN-FR 70 38 108

EN-IT 59 47 106

EN-PT 67 47 114

Total 196 132 328

Page 16: When Multiwords Go Bad in Machine Translation

Performance for the 3 most frequent MWU

16

EN-FR OL GT

Type Ok Error Ok Error

VERB 17 21 27 12

COMPN 8 10 13 18

NE 6 4 16 4

EN-IT OL GT

Type Ok Error Ok Error

COMPN 14 39 26 21

VERB 10 12 6 15

NE 2 8 14 2

EN-PT OL GT

Type Ok Error Ok Error

VERB 30 21 11 23

COMPN 28 12 18 17

NE 11 26 9 9

QUANTITATIVE RESULTS

Page 17: When Multiwords Go Bad in Machine Translation

General language or domain-specific COMPN - 32,5%

• hit-run driver

pilot hit run chauffeur/conducteur ayant commis un délit de fuite

• nuclear fuel cycle

cycle de combustible nucléaire

cycle de combustion nucléaire

SVC - 18,6%

• is a bit misleading (adjectival)

est un égarement de morceau

est quelque peu trompeur

• it has [wide] applicability (nominal)

il a l’applicabilité large

il a de nombreuses possibilités d’application

MULTIWORDS “GOING BAD” IN FRENCH

Page 18: When Multiwords Go Bad in Machine Translation

LOGOS APPROACH TO MULTIWORD TRANSLATION

Main linguistic knowledge bases of the LOGOS system:

• Dictionaries

• Semantico-syntactic rules - analysis, transfer and generation

• Semantic Table SEMTAB - language-pair specific rules

– Analysis and translation of words in their context

– invoked after dictionary look-up and during the execution of target transfer rules to solve analysis and lexical ambiguity problems

• verb dependencies - different verb argument structures

– speak to

– speak against

– speak of

– speak on N (radio, TV, television, etc.)

• MWU of different nature

Page 19: When Multiwords Go Bad in Machine Translation

SAL - Semantico-syntactic Abstraction Language

– Taxonomy: 3 levels organized hierarchically:

• Supersets / Sets / Subsets

– Semantico-Syntactic continuum from NL word to Word Class

• Literal word: airport

• Head morph: port

• SAL Subset: Agfunc (agentive functional location)

• SAL Set: func (functional location)

• SAL Superset: PL (place)

• Word Class: N

– SAL combines both the lexical and the compositional approaches in

order to process different types of MWU

LOGOS APPROACH TO MULTIWORD TRANSLATION

Page 20: When Multiwords Go Bad in Machine Translation

RESOLUTION OF POLYSEMY

NL String SEMTAB Rule Portuguese Transfer

raise a child V(‘raise’) N(ANdes) criar. . .

raise corn V(‘raise’) N(MAedib) cultivar. . .

raise the rent V(‘raise’) N(MEabs) aumentar. . .

Page 21: When Multiwords Go Bad in Machine Translation

DEEP STRUCTURE RULES OF SEMTAB

A single deep-structure rule matches multiple surface-structures

and produces correct target transfers

he raised the rent ele aumentou a renda V+Object

the raising of the rent o aumento da renda Gerund

the rent, raised by … a renda, aumentada por… Part. ADJ

a rent raise um aumento de renda Noun

Page 22: When Multiwords Go Bad in Machine Translation

• MWU – problematic for MT systems independently of the approach

• Literal translations lead to unclear/incorrect translations or loss of meaning

• Correct identification and analysis of source language MWU is a challenging

task, but the starting point for higher quality MT

– Linguistic quality evaluation metrics

– Systematic categorization of errors by MT expert linguists

– Specific corpora for MWU evaluation

• OpenLogos approach to MWU processing uses semantico-syntactic rules,

which can contribute to MWU translation quality with reference to any

language pair

CONCLUSIONS

Page 23: When Multiwords Go Bad in Machine Translation

FUTURE WORK

• Research on how OpenLogos linguistic knowledge – SEMTAB - can be applied to a SMT system to correct MWU errors … and vice-versa

• Successful combination of linguistic PRECISION (OL approach) and COVERAGE (GT approach) in resolving the MWU problem

– evolution in the MT field

• Successful integration of semantico-syntactic knowledge in SMT

– solution for achieving high quality MT

– The accomplishment of this task requires a combination of expertise in MT technology and deep linguistic knowledge to address reverse research avenue: integration of SMT technology/processes in RBMT to advance MT

Page 24: When Multiwords Go Bad in Machine Translation

24

Thank you!

This work was supported by Fundação para a Ciência e Tecnologia (Portugal) through

Anabela Barreiro’s post-doctoral grant SFRH/BPD/91446/2012 and project PEst-OE/EEI/LA0021/2013.