syntax, semantics and structure in statistical translation · 2015. 6. 1. · nora aranberri,...

95
Proceedings of SSST-9 Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation Dekai Wu, Marine Carpuat, Eneko Agirre and Nora Aranberri (editors) NAACL HLT 2015 / SIGMT / SIGLEX Workshop 4 June 2015 Denver, Colorado, USA

Upload: others

Post on 18-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Proceedings of SSST-9

    Ninth Workshop on

    Syntax, Semantics and Structurein Statistical Translation

    Dekai Wu, Marine Carpuat,Eneko Agirre and Nora Aranberri (editors)

    NAACL HLT 2015 / SIGMT / SIGLEX Workshop4 June 2015

    Denver, Colorado, USA

  • This workshop is partially funded by the European Union QTLeap project (FP7-ICT-2013-10-610516).

    c©2015 The Association for Computational Linguistics

    Order print-on-demand copies from:

    Curran Associates57 Morehouse LaneRed Hook, New York 12571USATel: +1-845-758-0400Fax: [email protected]

    ISBN 978-1-941643-41-9

    ii

  • Introduction

    The Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-9) was heldon 4 June 2015 following the NAACL HLT 2015 conference in Denver, Colorado. Like the first eightSSST workshops in 2007, 2008, 2009, 2010, 2011, 2012, 2013 and 2014, it aimed to bring togetherresearchers from different communities working in the rapidly growing field of structured statisticalmodels of natural language translation.

    This year’s SSST featured an award for best paper to advance statistical MT using lexical semanticsand deep language processing. The e 500 prize was sponsored by the European Union QTLeapproject (http://qtleap.eu, FP7-ICT-2013.4.1-610516), which aims to research and deliver an articulatedmethodology for machine translation that explores deep language engineering approaches in view ofbreaking the way to translations of higher quality.

    We selected 13 papers and extended abstracts for this year’s workshop, many of which reflect statisticalmachine translation’s movement toward not only tree-structured and syntactic models incorporatingstochastic synchronous/transduction grammars, but also increasingly semantic models and the closelylinked issues of deep syntax and shallow semantics, vector space representations to support theseapproaches, and semantic evaluation methodologies.

    Thanks are due once again to our authors and our Program Committee for making the ninth SSSTworkshop another success.

    Dekai Wu, Marine Carpuat, Eneko Agirre and Nora Aranberri

    iii

  • Organizers:

    Dekai Wu, Hong Kong University of Science and Technology (HKUST)Marine Carpuat, University of MarylandEneko Agirre, University of the Basque Country (UPV/EHU)Nora Aranberri, University of the Basque Country (UPV/EHU)

    Program Committee:

    Timothy Baldwin, University of MelbourneOndřej Bojar, Charles University in PragueAljoscha Burchardt, German Research Centre for Artificial Intelligence (DFKI)Francisco Casacuberta, Universitat Politècnica de ValènciaColin Cherry, National Research Council (NRC) CanadaDavid Chiang, USC/ISIStephen Clark, University of CambridgeKevin Duh, Nara Institute of Science and Technology (NAIST)Marc Dymetman, Xerox Research Centre EuropeDaniel Gildea, University of RochesterNal Kalchbrenner, University of OxfordPhilipp Koehn, University of EdinburghGorka Labaka, University of the Basque Country (UPV/EHU)Alon Lavie, Carnegie Mellon UniversityChi-kiu Lo, Hong Kong University of Science and Technology (HKUST)Markus Saers, Hong Kong University of Science and Technology (HKUST)Khalil Sima’an, University of AmsterdamIvan Vulić, University of LeuvenTaro Watanabe, GoogleAndy Way, Dublin City UniversityDeyi Xiong, Soochow UniversityFrançois Yvon, LIMSI

    v

  • Table of Contents

    Harmonizing word alignments and syntactic structures for extracting phrasal translation equivalentsDun Deng, Nianwen Xue and Shiman Guo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    Non-projective Dependency-based Pre-Reordering with Recurrent Neural Network for Machine Trans-lation

    Antonio Valerio Miceli Barone and Giuseppe Attardi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    Translating Negation: Induction, Search And Model ErrorsFederico Fancellu and Bonnie Webber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    SMT error analysis and mapping to syntactic, semantic and structural fixesNora Aranberri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    Unsupervised False Friend Disambiguation Using Contextual Word Clusters and Parallel Word Align-ments

    Maryam Aminian, Mahmoud Ghoneim and Mona Diab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    METEOR-WSD: Improved Sense Matching in MT EvaluationMarianna Apidianaki and Benjamin Marie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    Analyzing English-Spanish Named-Entity enhanced Machine TranslationMikel Artetxe, Eneko Agirre, Iñaki Alegria and Gorka Labaka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    Predicting Prepositions for SMTMarion Weller, Alexander Fraser and Sabine Schulte im Walde . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55

    Translation reranking using source phrase dependency featuresAntonio Valerio Miceli Barone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    Semantics-based pretranslation for SMT using fuzzy matchesTom Vanallemeersch and Vincent Vandeghinste. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61

    What Matters Most in Morphologically Segmented SMT Models?Mohammad Salameh, Colin Cherry and Grzegorz Kondrak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    Improving Chinese-English PropBank AlignmentShumin Wu and Martha Palmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    vii

  • Conference Program

    2015/06/04

    08:55–09:00 Opening Remarks

    09:00–10:30 Session 1

    09:00–10:00 Invited TalkPhilipp Koehn

    10:00–10:30 Harmonizing word alignments and syntactic structures for extracting phrasal trans-lation equivalentsDun Deng, Nianwen Xue and Shiman Guo

    10:30–11:00 Coffee Break

    11:00–12:30 Session 2

    11:00–11:30 Non-projective Dependency-based Pre-Reordering with Recurrent Neural Networkfor Machine TranslationAntonio Valerio Miceli Barone and Giuseppe Attardi

    11:30–12:00 Translating Negation: Induction, Search And Model ErrorsFederico Fancellu and Bonnie Webber

    12:00–12:30 SMT error analysis and mapping to syntactic, semantic and structural fixesNora Aranberri

    ix

  • 2015/06/04 (continued)

    12:30–13:55 Lunch Break

    13:55–14:30 Session 3

    13:55–14:00 QTLeap Best Paper Award

    14:00–14:30 Unsupervised False Friend Disambiguation Using Contextual Word Clusters andParallel Word AlignmentsMaryam Aminian, Mahmoud Ghoneim and Mona Diab

    14:30–15:30 Session 4: Posters

    14:30–14:35 METEOR-WSD: Improved Sense Matching in MT EvaluationMarianna Apidianaki and Benjamin Marie

    14:35–14:40 Analyzing English-Spanish Named-Entity enhanced Machine TranslationMikel Artetxe, Eneko Agirre, Iñaki Alegria and Gorka Labaka

    14:40–14:45 Predicting Prepositions for SMTMarion Weller, Alexander Fraser and Sabine Schulte im Walde

    14:45–14:50 Translation reranking using source phrase dependency featuresAntonio Valerio Miceli Barone

    14:50–14:55 Semantics-based pretranslation for SMT using fuzzy matchesTom Vanallemeersch and Vincent Vandeghinste

    x

  • 2015/06/04 (continued)

    15:30–16:00 Coffee Break

    16:00–17:00 Session 5

    16:00–16:30 What Matters Most in Morphologically Segmented SMT Models?Mohammad Salameh, Colin Cherry and Grzegorz Kondrak

    16:30–17:00 Improving Chinese-English PropBank AlignmentShumin Wu and Martha Palmer

    xi

  • Proceedings of SSST-9, Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 1–9,Denver, Colorado, June 4, 2015. c©2015 Association for Computational Linguistics

    Harmonizing word alignments and syntactic structures for extractingphrasal translation equivalents

    Dun Deng, Nianwen Xue and Shiman GuoComputer Science Department

    Brandeis University415 South Street, Waltham, MA 02453

    [email protected], [email protected], [email protected]

    AbstractAccurate identification of phrasal translationequivalents is critical to both phrase-basedand syntax-basedmachine translation systems.We show that the extraction of many phrasaltranslation equivalents is made impossible byword alignments done without taking syntacticstructures into consideration. To address theproblem, we propose a new annotation schemewhere word alignment and the alignment ofnon-terminal nodes (i.e., phrases) are done si-multaneously to avoid conflicts between wordalignments and syntactic structures. Relyingon this new alignment approach, we constructa Hierarchically Aligned Chinese-English Par-allel Treebank (HACEPT), and show that allphrasal translation equivalents can be automat-ically extracted based on the phrase alignmentsin HACEPT.

    1 IntroductionDuring the past two decades since the emergence

    of the statistical paradigm of Machine Translation(MT) (Brown et al., 1993), the field of StatisticalMa-chine Translation (SMT) has attained consensus onthe need for structural mappings between languagesin MT. Accurately identifying structural mappings(i.e., phrasal translation equivalents) is critical to theperformance of both phrase-based systems (Koehn,Och, and Marcu, 2003; Och and Ney, 2004) andsyntax-based systems (Chiang, 2005; Chiang, 2007;Galley et al., 2004). The fact is that phrasal transla-tion equivalents are identified based on word align-ments, so how word alignments are done directly af-fects the identification of phrasal translation equiv-

    alents. As reported by (Zhu, Li, and Xiao, 2015),even one spurious word alignment can prevent somedesirable phrasal translation equivalents from beingextracted. The unfortunate fact is that spurious wordalignments abound in current word-aligned paralleltexts used for extracting phrasal translation equiva-lents. This is because the word alignments in theseparallel texts, whether they are induced in an unsu-pervised manner such as that described by (Och andNey, 2003) or manually annotated based on exist-ing word alignment standards such as (Li, Ge, andStrassel, 2009) and (Melamed, 1998), are generallydone as an independent task without taking syntac-tic structures into consideration. As a result, con-flicts between word alignments and syntactic struc-tures are inevitable, and when such a conflict arises,the extraction of desirable phrasal translation equiv-alents will be impossible.To address this shortcoming, we designed a hi-

    erarchical alignment scheme in which word-levelalignment (namely alignment of terminal nodes) andphrase-level alignments (namely alignment of non-terminals) are done simultaneously in a coordinatedmanner so that conflicts between word alignmentsand syntactic structures are avoided. Based on thisalignment scheme, we constructed a HierarchicallyAligned Chinese-English Parallel Treebank (HA-CEPT) which currently has 9,897 sentence pairs. Weshow that this hierarchically aligned corpus providesa new way to extract hierarchical translation rulesand can be used as a training corpus to learn this typeof alignments.The rest of the paper is organized like this: Section

    2 shows how phrasal translation equivalents can be

    1

  • made impossible to extract by word alignments donewithout considering syntactic structures. To avoidthe problem, Section 3 introduces our new alignmentscheme and how HACEPT is constructed using thescheme. Section 4 shows how hierarchical transla-tion rules can be extracted from the phrase align-ments in HACEPT. We also provide statistics abouttwo important aspects of the rules, namely the dis-tributions of terminal and non-terminal nodes in therules and the number of terminal nodes contained ina single rule. Section 5 discusses some work in theliterature that are related to what is discussed in thispaper. Section 6 concludes the paper and points outfurture work to do.

    2 Spurious word alignments impede theextraction of phrase pairs

    Spurious word alignments arise in any word align-ment practice where the alignment is done as an inde-pendent task without taking sysntactic structures intoconsideration, regardless of whether the alignment isautomatically generated by utilizing a word alignersuch as the GIZA++ toolkit (Och and Ney, 2003) ormanually annotated using alignment standards suchas (Li, Ge, and Strassel, 2009) and (Melamed, 1998).(Zhu, Li, and Xiao, 2015) has described how a spuri-ous word alignment in automatically generated wordalignments prevents some phrasal translation equiv-alents from being extracted. In this section, we willshow how spurious word alignments in human anno-tated word alignments make the extraction of phrasaltranslation equivalents impossible.Consider the following example quoted from (Li,

    Ge, and Strassel, 2009), where the relevant wordalignment in each sentence/phrase pair is highlightedby underlining. Note that the word alignments aredone without taking syntactic structures into consid-eration, as can be told from the fact that all the under-lined aligned multi-word strings do not correspondto a constituent in a Penn TreeBank (Marcus, San-torini, and Marcinkiewicz, 1993) or Chinese Tree-Bank (Xue et al., 2005) parse tree.1a. He is visiting Beijing 他正访问北京1b. He has gone to Beijing 他去北京了1c. to quickly and efficiently solve the problem 迅速有效地解决问题

    1d. Results can be obtained by doing experiments做实验可以得出结果

    1e. We fully agree with the Chinese position thatthere is only one China in the world 我们完全同意中方的立场,世界上只有一个中国

    Just like the spurious word alignment discussedby (Zhu, Li, and Xiao, 2015), the underlined wordalignment in each of the sentence/phrase pair abovemakes it impossible to extract at least one diserablephrasal translation equivalent. For each of the sen-tence/phrase pair in (1), (2) lists the phrasal transla-tion equivalents that cannot be extracted due to theword alignment done in that pair:

    2a. visiting Beijing 访问北京2b. gone to Beijing 去北京2c. solve the problem 解决问题2d. doing experiments 做实验2e. the Chinese position 中方的立场

    The reasonwhy the phrasal translation equivalentsin (2) cannot be extracted is because a word in aphrase on one side is aligned to a word that is notpart of the corresponding phrase on the other side.Take (2c) for instance. The Chinese verb解决/solvein the phrase解决问题 is aligned to both solve andto in (1c), which is not part of the phrase solve theproblem. As a result, the phrase pair in (2c) cannotbe obtained.It is not desirable that legitimate phrase pairs such

    as those in (2) cannot be extracted. To fix the prob-lem, Section 3 proposes a new alignment scheme.

    3 Hierarchical alignment and the creationof HACEPT

    Hierarchical alignment is a new alignment schemewhere both terminal nodes (words) and non-terminalnodes (linguistic phrases) between parallel parsetrees are aligned in a coordinated way so that con-flicts in the form of redundancies and incompatabili-ties between word alignments and syntatic structuresare avoided. We use this scheme to construct HA-CEPTwith the goal of providing the field ofMTwith2

    2

  • a resource that has human annotated tree-structuredmappings for MT training purposes.The word alignment done in HACEPT differs

    from the common practice of word alignment in thefield (Melamed, 1998; Li, Ge, and Strassel, 2009) inthat the requirement that every word in a sentencepair needs to be word-aligned is relaxed. On theword level, we only align words that have an equiv-alent in terms of lexical meaning and grammaticalfunction. For those words that do not have a trans-lation counterpart, we leave them unaligned at wordlevel and instead the appropriate phrases in whichthey appear. This strategy makes sure that bothredundancies and incompatibilities between wordalignments and syntactic structures are avoided. Inaddition, artificial ambiguities are also eliminated.These points will be illustrated in the discussion ofthe concrete example in Figure 1 below.We take the Chinese-English portion of the Paral-

    lel Aligned Treebank (PAT) described in (Li et al.,2012) for annotation. Our data have three batches:one batch is weblogs, one batch is postings from on-line discussion forums and one batch is news wire.The English sentences in the data set are annotatedbased on the original Penn TreeBank (PTB) annota-tion stylebook (Bies et al., 1995) as well as its exten-sions (Warner et al., 2004), while the Chinese sen-tences in the data set are annotated based on the Chi-nese TreeBank (CTB) annotation guidelines (Xueand Xia, 1998) and its extensions ZhangandXue12.The PAT has no phrase alignments and the wordalignments in it are done under the requirement thatall the words in a sentence should be aligned.Next we discuss our annotation procedure in de-

    tail. Our annotators are presentedwith sentence pairsthat come with parallel parse trees. The task of theannotator is to decide, first on the word level and thenon the phrase level, if a word or phrase needs to bealigned at all, and if so, to which word or phrase itshould be aligned. The decisions about word align-ment and phrase alignment are not independent, andmust obey well-formedness constraints as outlinedin (Tinsley et al., 2007):

    a. A non-terminal node can only be aligned once.b. if Node nc is aligned to Node ne, then the de-

    scendants of nc can only be aligned to descen-dants of ne.

    c. if Node nc is aligned to Node ne, then the an-cestors of nc can only be aligned to ancestors ofne.

    This means that once a word alignment is in place,it puts constraints on phrase alignments. A pair ofnon-terminal nodes (nc, ne) cannot be aligned if aword that is a descendant of nc is aligned to a wordthat is not a descendant of ne on the word level.Let us use the concrete example in Figure 1 to

    illustrate the annotation process, which is guidedby a set of detailed annotation guidelines. On theword level, only those words that are connected witha dashed line are aligned since they have equiva-lents. Note that the Chinese prenominalmodificationmarker的 and the existential verb有/have, and theEnglish determiner the, the relative pronounwho, thepreposition of, the expletive subject it, the copularverb is, the infinitive marker to and the conjunctionword both are all left unaligned on the word level.Aligning these words will generate artificial ambigu-ous cases and create both redundancies and incom-patibilities between word alignments and parse trees.For instance, if的 is to be word-aligned, it could

    be glued to the preceding verb 喋喋不休 and thewhole string will be aligned to harp. Note that 喋喋不休 and harp are both unambiguous and forma one-to-one correspondence. With the word align-ment between喋喋不休的 and harp, we make theunambiguous harp correspond to both喋喋不休 and喋喋不休的 (and possibly more strings), thus cre-ating a spurious ambiguity. Also note that the string喋喋不休的 does not form a constituent in the Chi-nese parse tree, so the word alignment is incompat-ible with the syntactic structure of the sentence. Byleaving 的 unaligned, we avoid both the spuriousambiguity and the incompatibility.As for redundancies, consider the English deter-

    miner the, which has no translation counterpart inthe Chinese sentence. If the is to be word-aligned, itcould be attached to the noun people and the wholestring the people will be aligned to人们. This willcreate a redundancy, since the English parse tree al-ready groups the and people together to form an NP,and therefore there is no need to repeat this infor-mation on the word level by attaching the to people,especially when the word alignment also generates aspurious ambiguity for人们, which unambiguously3

    3

  • means people but is aligned to the people.With word alignments in place, next the annotator

    needs to perform phrase alignments. Note that wordalignments place restrictions on phrase alignments.For instance, VPc0 cannot be a possible alignmentfor VPe1, because 通常, a descendant of VPc0, isaligned to often, which is not a descendant of VPe1.For a phrase that does have a possible alignment,the annotator needs to decide whether the possiblephrase alignment can be actually made. This is achallenging task since, for a given phrase, there usu-ally are more than one candidate from which a singlealignment needs to be picked. For instance, for theEnglish ADJP, there are in total two possible phrasealignments, namely VPc6, and VPc7, both of whichobey the well-formedness constraints. Since a non-terminal node is not allowed to be aligned to multi-ple non-terminal nodes on the other side, the anno-tator needs to choose one among all the candidates.This highlights the point that the alignment of non-terminal nodes cannot be deterministically inferredfrom the alignment of terminal nodes. This is espe-cially true given our approach where some terminalnodes are left unaligned on the word level. For in-stance, the reason why VPc7 is a possible alignmentfor ADJP is because the word 有 is left unaligned.If 有 were aligned with, say, is, VPc7 could not bealigned with ADJP since is is not a descendant ofADJP and aligning the two nodes will violate Con-straint b.While Constraints b and c can be enforced au-

    tomatically given the word alignments, the deci-sions regarding the alignment of non-terminal nodeswhich satisfy Constraint a are based on linguisticconsiderations. One key consideration is to de-termine which non-terminal nodes encapsulate thegrammatical relations signaled by the unalignedwords so that the alignment of the non-terminalnodes will effectively capture the unaligned wordsin their syntactic context. When identifying non-terminal nodes to align, we follow two seeminglyconflicting general principles:

    • Phrase alignment should not sever key depen-dencies involving the grammatical relation sig-naled by an unaligned word.

    • Phrase alignment should be minimal, in thesense that the phrase alignment should contain

    only the elements involved in the grammaticalrelation, and nothing more.

    The first principle ensures that the grammatical re-lation is properly encapsulated in the aligned non-terminal nodes. For example in Figure 1, if we attachthe English preposition on to tolls and aligning themto 通行费, we would fail to capture the lexical de-pendency between harp and on. Aligning VPc2 withVPe2 captures the dependency.The first principle in and of itself is insufficient

    to produce desired alignment. Taken to the ex-treme, it can be trivially satisfied by aligning the tworoot nodes of the sentence pair. We also need thealignment to be minimal, in the sense that alignednon-terminal nodes should contain only the elementsinvolved in the grammatical relation, and nothingmore. These two requirements used in conjunctionensure that a unique phrase alignment can be foundfor each unaligned word. The phrase alignments inFigure 1 which are indicated by blue dotted lines, allsatisfy these two principles.Following the principles and the procedure intro-

    duced above, we constructued HACEPT,1 which has9,897 sentence pairs. In the next section, we showhow the alignments in HACEPT can help to extracttranslation rules.

    4 Extracting hierarchical translation rulesin HACEPT

    Hierarchical translation rules can be automaticallyextracted from the phrase alignments in HACEPT.Take a pair of aligned non-terminal nodes (nc, ne),a translation rule based on the alignment between ncand ne can be extracted like this: Check each of theimmediate daughter nodes of both nc and ne. Forany of the daughter nodes that is aligned, stop look-ing down into the node and keep the phrase cate-gory label of the node as a variable the rule. Foreach daughter node that is not aligned, recursivelytraverse its children until either an aligned node isfound, in which case its phrase category label willbe kept as a variable in the rule, or a terminal node is

    1As of thewriting of this paper, we are in the process of doingadjudication on the double annotation done to create HACEPT.We look forward to finishing adjudication soon and releasing theresouce to the public.4

    4

  • ....IP0.....

    ..VPc0.....

    ..VPc1.....

    ..IP1.....

    ..VPc7.....

    ..VPc6.....

    ..VP...

    ..VA...

    ..昂贵.

    ..

    ..ADVP...

    ..AD...

    ..多么.

    ..

    ..VV...

    ..有.

    ..

    ..VPc3.....

    ..VPc5.....

    ..NPc3...

    ..PN...

    ..它.

    ..

    ..VV...

    ..维护.

    ....

    ..CC...

    ..和.

    ..

    ..VPc4...

    ..VV...

    ..运营.

    ..

    ..VV...

    ..知道.

    ....

    ..ADVP...

    ..AD...

    ..不.

    ..

    ..ADVP...

    ..AD...

    ..通常.

    ..

    ..NPc0.....

    ..NPc1...

    ..NN...

    ..人们.

    ..

    ..CP.....

    ..DEC...

    ..的.

    ..

    ..VPc2.....

    ..VP...

    ..VV...

    ..喋喋不休.

    ..

    ..PP.....

    ..NPc2...

    ..NN...

    ..通行费.

    ..

    ..P...

    ..对

    .

    ..S0.....

    ..VPe1.....

    ..NP.....

    ..PP.....

    ..S1.....

    ..S.....

    ..VP.....

    ..VP.....

    ..VPe3.....

    ..VPe5.....

    ..NPe3...

    ..PRP...

    ..it

    ...

    ..VB...

    ..maintain

    .....

    ..CC...

    ..and

    .....

    ..VPe4.....

    ..NP...

    ..PRP...

    ..it

    ...

    ..VB...

    ..run

    ...

    ..CC...

    ..both

    ...

    ..TO...

    ..to

    ...

    ..VB...

    ..is

    ...

    ..NP...

    ..PRP...

    ..it

    ...

    ..ADJP.....

    ..JJ...

    ..costly

    ...

    ..WRB...

    ..how

    ...

    ..IN...

    ..of

    ...

    ..NP.....

    ..NN...

    ..idea

    ...

    ..DT...

    ..no

    ...

    ..VBP...

    ..have

    .....

    ..ADVP...

    ..RB...

    ..often

    ...

    ..NPe0.....

    ..S.....

    ..VPe2.....

    ..PP.....

    ..NPe2...

    ..NNS...

    ..tolls

    ...

    ..IN...

    ..on

    ...

    ..VBG...

    ..harp

    ...

    ..NP...

    ..WP...

    ..who

    ...

    ..NPe1.....

    ..NNS...

    ..people

    ...

    ..DT...

    ..The

    Figure 1: A hierarchically aligned sentence pair

    5

    5

  • reached, in which case the word is included as partof the translation rule.To illustrate the rule extraction process specified

    above, let us take the phrase alignment betweenNPc0and NPe0 in Figure 1 for instance. The search startstop-down from the two root nodes. On the Chineseside, NPc0 has two immediate daughter nodes: CPand NPc1. NPc1 is aligned, so we stop looking in-side the node and just keep the phrase category labelof the node as part of the rule. CP is not aligned, sowe keep checking its two immediate daughter nodes:VPc2 and DEC. VPc2 is aligned and will not be fur-ther checked. DEC is not aligned and dominatesthe terminal node的, which will be kept in the rule.SinceDEC is the last node insideNPc0 and a terminalnode is reached, the search on the Chinese side ends.The same procedure will simultaneously take placeon the English side, and when the search is done, wewill get the translation rule in (3) below:(3) NPc0⇔ NPe0:

    VPc2的 NPc1 NPe1 who VPe2Note that the rule contains both terminals (的 and

    who) and non-terminals represented by phrase cate-gory labels.The rule in (3) illustrates one type of rule, namely

    the rules containing both terminal and non-terminalnodes. There are also rules with only terminal nodesand rules with only non-terminal nodes. Figure 1 hasquite a few examples for the former and an exampleis given below:(4) NPc2⇔ NPe2:通行费 tolls

    The rule above contains only terminals. Figure 1does not contain an example for rules with only non-terminals, but such rules do exist and here is a com-mon example:(5) IP⇔ S:

    NPsubj VPpred NPsubj VPpredThe rule above illustrates parallel sentences whose

    subjects and predicates are both aligned.Table 1 provides the statistics of the distribution

    of the three types of rules in HACEPT.

    Rule types No. Percentagewith only terminals 52379 50.46

    with only non-terminals 2621 2.53with both 48796 47.01Total 103796 100

    Table 1: Rule distribution

    Given the importance of hierarchical translationrules for MT, a natural question to ask about the hier-archical translation rules extracted from HACEPT isthis: are these rules usable? The most crucial factordeciding the usability of a rule is its length in termsof the number of terminal nodes it contains. If a rulecontains too many terminal nodes, it cannot be easilyused for MT purposes. Table 2 provides the statis-tics about the number of terminal node (TN) in theextracted rules:

    TN Rule Percentage Cumulative0 6974 6.72 6.721 4017 3.87 10.592 30829 29.70 40.293 18780 17.09 58.384 12897 12.43 70.815 9387 9.04 79.856 6079 5.86 85.717 4404 4.24 89.95

    More than 7 10429 10.05 1

    Table 2: Rule length

    As shown by the table, 89.95 percent of the rulescontain 7 or less than 7 terminal nodes. There arestill 10 percent of the rules that contain more than 7terminal nodes.One primary reason that increases the number of

    terminal nodes in a rule is how the parse trees are de-signed. To be specific, some parts of the parse treesare designed to be flat, presumably for the sake ofincreasing treebank annotation throughput, but thismakes some otherwise legitimate phrase alignmentsinaccessible unless we change the underlying parsetrees. When a phrase alignment cannot be made,some terminal nodes will be left out to appear in therule. This is illustrated by Figure 1.On the Chinese side, there is a node, namely VPc0,

    which dominates the predicate part of the sentence.6

    6

  • On the English side, the predicate part of the Englishsentence is split into ADVP andVPe1, and there is nosingle node dominating these two nodes. As a result,VPc0 has no phrase alignment. Suppose a node VPe0is created that includes ADVP and VPe1 as its imme-diate daughters, VPc0 and VPe0 could be aligned. (7)below is the rule based on the alignment between thetwo sentences in Figure 1, and (8) is the rule based onthe alignment between the two sentences if a node iscreated for the predicate of the English sentence andaligned to VPc0.

    (6) IP0⇔ S0NPc0 ADVP不知道 IP1 NPe0 ADVP haveno idea of S1

    (7) IP0⇔ S0NPc0 VPc0 NPe0 VPe0

    (VPe0⇒ ADVP VPe1)

    Note that the rule in (6) has 6 terminal nodes intotal whereas the rule in (7) has none. This is agood example to illustrate the fact that a flat struc-ture makes some legitimate phrase alignment impos-sible and as a result increases the number of terminalnodes in a rule.There is another place in Figure 1 that has the same

    problem. Note that the Chinese VPc0 has three im-mediate daughter nodes: ADVP, ADVP, and VPc1.This structure is flat and can become deeper if an in-termediate node is created to dominate the secondADVP and VPc1. This node can then combine withthe first ADVP to form VPc0. Note that this inter-mediate node will serve as the phrase alignment ofVPe1, which cannot be unaligned in the figure. Withthe phrase alignment between VPe1 and the hypo-thetical intermediate node (call it VPc9), the numberof terminal nodes in (6) will be reduced to zero evenwithout the creation of VPe0 in (7). The new rulelooks like this:

    (8) IP0⇔ S0NPc0 ADVP VPc9 NPe0 ADVP VPe1

    (VPc9⇒ ADVP VPc1)

    In the near future, we plan to binarize the flat struc-tures as illustrated above to create some intermediatenodes, which can be aligned and reduce the numberof terminal nodes in existing rules.

    5 Related workTo address the problem caused by spurious word

    alignments, there has been research done to improveword alignment quality by incorporating syntacticinformation into word alignments (May and Knight,2007; Fossum, Knight, and Abney, 2008). An-other research direction has been explored to con-duct syntactic alignment between parse trees (Tins-ley et al., 2007; Pauls et al., 2010; Sun, Zhang, andTan, 2010b; Sun, Zhang, and Tan, 2010a), and im-plements syntactic rule extraction based on syntacticalignment instead of word alignment. Our work re-ported in Section 3 can be viewed as a combinationof these two lines of research.There has also been reasearch done to automati-

    cally obtain phrasal translation equivalents (Ambatiand Lavie, 2008; Hanneman, Burroughs, and Lavie,2011; Lavie, Parlikar, and Ambati, 2008; Zhu, Li,and Xiao, 2015). This line of research is differentfrom our work in two respects:First, word alignment as the foundation of phrase-

    pair extraction is done differently in the two ap-proaches. Automatic extraction of phrase pairs usesautomatically generated word alignments, wherethere are lots of spurious word alignments, which, aspointed out by (Zhu, Li, andXiao, 2015), are harmfulto rule extraction and affect translation quality. Bycontrast, HACEPT is free of spurious word align-ments. As already mentioned in Section 3, all theword alignments in HACEPT are compatible withthe syntactic structures and will not block any legit-imate phrase alignment.Second, phrase alignment is inferred from word

    alignment in automatic approaches. As reported by(Ambati and Lavie, 2008), in places where language-particular function words such as English auxiliaryverbs that exist in one language but not the other areinvolved, there are usually more than one candiate inthe language that has the function words for a phrasein the language that does not have a counterpart ofthe function words. Automatic inference cannot al-ways make the right decision in such situations. We7

    7

  • have strict standards for choosing the correct phrasealignment in such cases and as a result, HACEPTcan function as a training corpus for automatic ap-proaches.

    6 ConclusionIn this paper, we report a resource we have con-

    structed with a novel alignment scheme. The cor-pus contains both word and phrase alignments andcan help extract hierarchical translation rules andtrain syntax-based MT models. The next step is, ofcourse, to do MT experiments with this resource tosee if it indeed helps to improve system performance.

    AcknowledgmentsThis work is supported by the IBM subcontract

    No. 4913014934 under DARPA Prime Contract No.0011-12-C-0015 entitled "Broad Operational Lan-guage Translation". We would like to thank LibinShen and Salim Roukos for their inspiration and dis-cussion during early stages of the project, Abe Itty-cheriah and Niyu Ge for their help with setting upthe data, Loretta Bandera for developing and main-taining the annotation tool, and two anonymous re-viewers for their helpful comments. We are gratefulfor the hard work of our annotators Hui Gao, Tse-ming Wang and Lingya Zhou. Any opinions, find-ings, conclusions or recommendations expressed inthis material are those of the authors and do not nec-essarily reflect those of the sponsor or any of the peo-ple mentioned above.

    ReferencesAmbati, Vamshi and Alon Lavie. 2008. Improving syn-

    tax driven translation models by re-structuring diver-gent and non-isomorphic parse tree structures. In Pro-ceedings of AMTA-2008 Student Research Workshop,pages 235--244.

    Bies, Ann, Mark Ferguson, Karen Katz, Robert Mac-Intyre, Victoria Tredinnick, Grace Kim, Mary AnnMarcinkiewicz, and Britta Schasberger. 1995. Brack-eting guidelines for treebank ii style penn treebankproject. Technical report, University of Pennsylvania.

    Brown, Peter F., Vincent J. Della Pietra, Stephen A. DellaPietra, and Robert L. Mercer. 1993. The mathemat-ics of statistical machine translation: Parameter esti-mation. Computational Linguistics, 19(2):263--311.

    Chiang, David. 2005. A hierarchical phrase-based modelfor statistical machine translation. In Proceedings ofthe 43rd AnnualMeeting of the Association for Compu-tational Linguistics, pages 263--270. Association forComputational Linguistics.

    Chiang, David. 2007. Hierarchical phrase-based transla-tion. Computational Linguistics, 33(2):201--228.

    Fang, Licheng and Chengqing Zong. 2008. An effi-cient approach to rule redundancy reduction in hier-archical phrase-based translation. In Proceedings ofNLP-KE '08 International Conference on Natural Lan-guage Processing and Knowledge Engineering, pages1--6.

    Fossum, V., K. Knight, and S. Abney. 2008. Using syn-tax to improve word alignment precision for syntax-based machine translation. In Proceedings of the ThirdWorkshop on Statistical Machine Translation, pages44--52.

    Galley, Michel, Mark Hopkins, Kevin Knight, and DanielMarcu. 2004. What's in a translation rule? In Su-san Dumais, Daniel Marcu, and Salim Roukos, editors,HLT-NAACL 2004: Main Proceedings, pages 273--280.

    Hanneman, Greg, Michelle Burroughs, and Alon Lavie.2011. A general-purpose rule extractor for scfg-basedmachine translation. In Proceedings of SSST-5, FifthWorkshop on Syntax, Semantics and Structure in Sta-tistical Translation, pages 135--144.

    Koehn, Philipp, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Proceed-ings of the 2003 Conference of the North AmericanChapter of the Association for Computational Linguis-tics on Human Language Technology, volume 1, pages48--54.

    Lavie, Alon, Alok Parlikar, and Vamshi Ambati. 2008.Syntax-driven learning of sub-sentential translationequivalents and translation rules from parsed parallelcorpora. In Proceedings of the Second ACL Work-shop on Syntax and Structure in Statistical Translation(SSST-2), pages 87--95.

    Li, Xuansong, Niyu Ge, and Stephanie Strassel. 2009.Tagging guidelines for chinese-english word align-ment. Technical report, Linguistic Data Consortium.

    Li, Xuansong, Stephanie Strassel, Stephen Grimes, SafaIsmael, Mohamed Maamouri, Ann Bies, and NianwenXue. 2012. Parallel aligned treebanks at ldc: Newchallenges interfacing existing infrastructures. In Pro-ceedings of LREC-2012, Istanbul, Turkey.

    Marcus, Mitchell P., Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotated cor-pus of english: The penn treebank. ComputationalLinguistics, 19(2):313--330.

    May, J. and K. Knight. 2007. Syntactic re-alignmentmodels for machine translation. In Proceedings of the8

    8

  • 2007 Joint Conference on Empirical Methods in Nat-ural Language Processing and Computational Natu-ral Language Learning (EMNLP--CoNLL), pages 360--68.

    Melamed, I. Dan. 1998. Annotation style guide for theblinker project. Technical report, University of Penn-sylvania.

    Och, Franz Josef and Hermann Ney. 2003. A system-atic comparison of various statistical alignment mod-els. Computational Linguistics, 29(1):19--51.

    Och, Franz Josef and Hermann Ney. 2004. The align-ment template approach to statistical machine transla-tion. Computational Linguistics, 30(4):417--449.

    Pauls, A., D. Klein, D. Chiang, and K. Knight. 2010.Unsupervised syntactic alignment with inversion trans-duction grammars. In Proceedings of Human Lan-guage Technologies: The 2010 Annual Conference ofthe North American Chapter of the Association forComputational Linguistics (HLT--NAACL), pages 118--26.

    Sun, J., M. Zhang, and C.L. Tan. 2010a. Discriminativeinduction of sub-tree alignment using limited labeleddata. In Proceedings of the 23rd International Confer-ence on Computational Linguistics (COLING), pages1047--55.

    Sun, J., M. Zhang, and C.L. Tan. 2010b. Exploringsyntactic structural features for sub-tree alignment us-ing bilingual tree kernels. In Proceedings of the 48thAnnual Meeting of the Association for ComputationalLinguistics (ACL), pages 306--15.

    Tinsley, John, Ventsislav Zhechev, Mary Hearne, andAndy Way. 2007. Robust language pair-independentsubtree alignment. In Proceedings of Machine Trans-lation Summit XI, Copenhagen, Denmark.

    Warner, Colin, Ann Bies, Christine Brisson, and JustinMott. 2004. Addendum to the penn treebank ii stylebracketing guidelines: Biomedical treebank annota-tion. Technical report, University of Pennsylvania.

    Xue, Nianwen and Fei Xia. 1998. The bracketing guide-lines for penn chinese treebank project. Technical re-port, University of Pennsylvania.

    Xue, Nianwen, Fei Xia, Fudong Chiou, and MarthaPalmer. 2005. The penn chinese treebank: Phrasestructure annotation of a large corpus. Natural Lan-guage Engineering, 11(2):207--238.

    Zhu, Jingbo, Qiang Li, and Tong Xiao. 2015. Improv-ing syntactic rule extraction through deleting spuriouslinks with translation span alignment. Natural Lan-guage Engineering, pages 1--23.

    9

    9

  • Proceedings of SSST-9, Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 10–20,Denver, Colorado, June 4, 2015. c©2015 Association for Computational Linguistics

    Non-projective Dependency-based Pre-Reordering with RecurrentNeural Network for Machine Translation

    Antonio Valerio Miceli-BaroneUniversità di Pisa

    Largo B. Pontecorvo, 356127 Pisa, Italy

    [email protected]

    Giuseppe AttardiUniversità di Pisa

    Largo B. Pontecorvo, 356127 Pisa, Italy

    [email protected]

    Abstract

    The quality of statistical machine trans-lation performed with phrase based ap-proaches can be increased by permutingthe words in the source sentences in anorder which resembles that of the targetlanguage. We propose a class of recur-rent neural models which exploit source-side dependency syntax features to re-order the words into a target-like order.We evaluate these models on the German-to-English language pair, showing signif-icant improvements over a phrase-basedMoses baseline, obtaining a quality simi-lar or superior to that of hand-coded syn-tactical reordering rules.

    1 Introduction

    Statistical machine translation is typically per-formed using phrase-based systems (Koehn etal., 2007). These systems can usually produceaccurate local reordering but they have difficul-ties dealing with the long-distance reorderingthat tends to occur between certain languagepairs (Birch et al., 2008).

    The quality of phrase-based machine transla-tion can be improved by reordering the wordsin each sentence of source-side of the paral-lel training corpus in a ”target-like” order andthen applying the same transformation as apre-processing step to input strings during ex-ecution.When the source-side sentences can be ac-curately parsed, pre-reordering can be per-formed using hand-coded rules. This approach

    has been successfully applied to German-to-English (Collins et al., 2005) and other lan-guages. The main issue with it is that theserules must be designed for each specific lan-guage pair, which requires considerable lin-guistic expertise.

    Fully statistical approaches, on the otherhand, learn the reordering relation from wordalignments. Some of them learn reorderingrules on the constituency (Dyer and Resnik,2010) (Khalilov and Fonollosa, 2011) or projec-tive dependency (Genzel, 2010), (Lerner andPetrov, 2013) parse trees of source sentences.The permutations that these methods can learncan be generally non-local (i.e. high distance)on the sentences but local (parent-child orsibling-sibling swaps) on the parse trees. More-over, constituency or projective dependencytrees may not be the ideal way of represent-ing the syntax of non-analytic languages, whichcould be better described using non-projectivedependency trees (Bosco and Lombardo, 2004).Other methods, based on recasting reorder-ing as a combinatorial optimization problem(Tromble and Eisner, 2009), (Visweswariah etal., 2011), can learn to generate in principle ar-bitrary permutations, but they can only makeuse of minimal syntactic information (part-of-speech tags) and therefore can’t exploit the po-tentially valuable structural syntactic informa-tion provided by a parser.

    In this work we propose a class of reorder-ing models which attempt to close this gap byexploiting rich dependency syntax features and

    10

  • at the same time being able to process non-projective dependency parse trees and generatepermutations which may be non-local both onthe sentences and on the parse trees.We represent these problems as sequence pre-diction machine learning tasks, which we ad-dress using recurrent neural networks.

    We applied our model to reorder Germansentences into an English-like word order asa pre-processing step for phrase-based ma-chine translation, obtaining significant im-provements over the unreordered baseline sys-tem and quality comparable to the hand-codedrules introduced by Collins et al. (2005).

    2 Reordering as a walk on adependency tree

    In order to describe the non-local reorder-ing phenomena that can occur between lan-guage pairs such as German-to-English, weintroduce a reordering framework similar to(Miceli Barone and Attardi, 2013), based on agraph walk of the dependency parse tree of thesource sentence. This framework doesn’t re-strict the parse tree to be projective, and allowsthe generation of arbitrary permutations.

    Let f ≡ ( f1, f2, . . . , fL f ) be a source sentence,annotated by a rooted dependency parse tree:∀j ∈ 1, . . . , L f , hj ≡ PARENT(j)

    We define a walker process that walks fromword to word across the edges of the parse tree,and at each steps optionally emits the currentword, with the constraint that each word mustbe eventually emitted exactly once.Therefore, the final string of emitted words f ′is a permutation of the original sentence f , andany permutation can be generated by a suitablewalk on the parse tree.

    2.1 Reordering automaton

    We formalize the walker process as a non-deterministic finite-state automaton.The state v of the automaton is the tuple v ≡(j, E, a) where j ∈ 1, . . . , L f is the current vertex(word index), E is the set of emitted vertices, ais the last action taken by the automaton.The initial state is: v(0) ≡ (root f , {}, null)where root f is the root vertex of the parse tree.

    At each step t, the automaton chooses one ofthe following actions:

    • EMIT: emit the word f j at the current ver-tex j. This action is enabled only if the cur-rent vertex has not been already emitted:

    j /∈ E(j, E, a) EMIT→ (j, E ∪ {j}, EMIT)

    (1)

    • UP: move to the parent of the current ver-tex. Enabled if there is a parent and we didnot just come down from it:

    hj 6= null, a 6= DOWNj(j, E, a) UP→ (hj, E, UPj)

    (2)

    • DOWNj′ : move to the child j′ of the cur-rent vertex. Enabled if the subtree s(j′)rooted at j′ contains vertices that have notbeen already emitted and if we did not justcome up from it:

    hj′ = j, a 6= UPj′ , ∃k ∈ s(j′) : k /∈ E(j, E, a)

    DOWNj′→ (j′, E, DOWNj′)(3)

    The execution continues until all the verticeshave been emitted.

    We define the sequence of states of the walkerautomaton during one run as an execution v̄ ∈GEN( f ). An execution also uniquely specifiesthe sequence of actions performed by the au-tomation.

    The preconditions make sure that all execu-tion of the automaton always end generatinga permutation of the source sentence. Further-more, no cycles are possible: progress is madeat every step, and it is not possible to enter inan execution that later turns out to be invalid.Every permutation of the source sentence canbe generated by some execution. In fact, eachpermutation f ′ can be generated by exactly oneexecution, which we denote as v̄( f ′).

    We can split the execution v̄( f ′) into a se-quence of L f emission fragments v̄j( f ′), each end-ing with an EMIT action.The first fragment has zero or more DOWN∗ ac-tions followed by one EMIT action, while each

    11

  • other fragment has a non-empty sequence ofUP and DOWN∗ actions (always zero or moreUPs followed by zero or more DOWNs) fol-lowed by one EMIT action.

    Finally, we define an action in an executionas forced if it was the only action enabled at thestep where it occurred.

    2.2 Application

    Suppose we perform reordering using a typicalsyntax-based system which processes source-side projective dependency parse trees and islimited to swaps between pair of vertices whichare either in a parent-child relation or in a sib-ling relation. In such execution the UP actionsare always forced, since the ”walker” processnever leaves a subtree before all its verticeshave been emitted.

    Suppose instead that we could perform re-ordering according to an ”oracle”. The execu-tions of our automaton corresponding to thesepermutations will in general contain unforcedUP actions. We define these actions, and theexecution fragments that exhibit them, as non-tree-local.

    In practice we don’t have access to a re-ordering ”oracle”, but for sentences pairs ina parallel corpus we can compute heuristic”pseudo-oracle” reference permutations of thesource sentences from word-alignments.

    Following (Al-Onaizan and Papineni, 2006),(Tromble and Eisner, 2009), (Visweswariah etal., 2011), (Navratil et al., 2012), we gener-ate word alignments in both the source-to-target and the target-to-source directions us-ing IBM model 4 as implemented in GIZA++(Och et al., 1999) and then we combine theminto a symmetrical word alignment using the”grow-diag-final-and” heuristic implementedin Moses (Koehn et al., 2007).

    Given the symmetric word-aligned corpus,we assign to each source-side word an integerindex corresponding to the position of the left-most target-side word it is aligned to (attach-ing unaligned words to the following alignedword) and finally we perform a stable sort ofsource-side words according to this index.

    2.3 Reordering example

    Consider the segment of a German sentenceshown in fig. 1. The English-reordered segment”die Währungsreserven anfangs lediglich di-enen sollten zur Verteidigung” corresponds tothe English: ”the reserve assets were origi-nally intended to provide protection”.

    In order to compose this segment from theoriginal German, the reordering automaton de-scribed in our framework must perform a com-plex sequence of moves on the parse tree:

    • Starting from ”sollten”, de-scend to ”dienen”, descent to”Währungsreserven” and finallyto ”die”. Emit it, then go up to”Währungsreserven”, emit it and go up to”dienen” and up again to ”sollten”. Notethat the last UP is unforced since ”dienen”has not been emitted at that point and hasalso unemitted children. This unforcedaction indicates non-tree-local reordering.

    • Go down to ”anfangs”. Note that thein the parse tree edge crosses anotheredge, indicating non-projectivity. Emit”anfangs” and go up (forced) back to”sollten”.

    • Go down to ”dienen”, down to ”zur”,down to ”lediglich” and emit it. Goup (forced) to ”zur”, up (unforced) to”dienen”, emit it, go up (unforced) to”sollten”, emit it. Go down to ”dienen”,down to ”zur” emit it, go down to”Verteidigung” and emit it.

    Correct reordering of this segment would bedifficult both for a phrase-based system (sincethe words are further apart than both the typ-ical maximum distortion distance and maxi-mum phrase length) and for a syntax-basedsystem (due to the presence of non-projectivityand non-tree-locality).

    3 Recurrent Neural Network reorderingmodels

    Given the reordering framework describedabove, we could try to directly predict the ex-

    12

  • Figure 1: Section of the dependency parse tree of a German sentence.

    ecutions as Miceli Barone and Attardi (2013)attempted with their version of the frame-work. However, the executions of a given sen-tence can have widely different lengths, whichcould make incremental inexact decoding suchas beam search difficult due to the need toprune over partial hypotheses that have differ-ent numbers of emitted words.

    Therefore, we decided to investigate a dif-ferent class of models which have the propertythat state transition happen only in correspon-dence with word emission. This enables us toleverage the technology of incremental languagemodels.

    Using language models for reordering is notsomething new (Feng et al., 2010), (Durraniet al., 2011), (Bisazza and Federico, 2013), butinstead of using a more or less standard n-gram language model, we are going to base ourmodel on recurrent neural network language mod-els (Mikolov et al., 2010).

    Neural networks allow easy incorporation ofmultiple types of features and can be trainedmore specifically on the types of sequences thatwill occur during decoding, hence they canavoid wasting model space to represent theprobabilities of non-permutations.

    3.1 Base RNN-RM

    Let f ≡ ( f1, f2, . . . , fL f ) be a source sentence.We model the reordering system as a determin-istic single hidden layer recurrent neural net-work:

    v(t) = τ(Θ(1) · x(t) + ΘREC · v(t− 1)) (4)where x(t) ∈ Rn is a feature vector associatedto the t-th word in a permutation f ′, v(0) ≡

    vinit, Θ(1) and ΘREC are parameters1 and τ(·) isthe hyperbolic tangent function.

    If we know the first t − 1 words of the per-mutation f ′ in order to compute the probabilitydistribution of the t-th word we do the follow-ing:

    • Iteratively compute the state v(t− 1) fromthe feature vectors x(1), . . . , x(t− 1).• For the all the indices of the words that

    haven’t occurred in the permutation so farj ∈ J(t) ≡ ([1, L f ]− īt−1:), compute a scorehj,t ≡ ho(v(t− 1), xo(j)), where xo(·) is thefeature vector of the candidate target word.

    • Normalize the scores using the logisticsoftmax function: P( Īt = j| f , īt−1:, t) =

    exp(hj,t)∑j′∈J(t) exp(hj′ ,t)

    .

    The scoring function ho(v(t − 1), xo(j)) ap-plies a feed-forward hidden layer to the fea-ture inputs xo(j), and then takes a weighed in-ner product between the activation of this layerand the state v(t − 1). The result is then lin-early combined to an additional feature equalto the logarithm of the remaining words in thepermutation (L f − t),2 and to a bias feature:

    hj,t ≡< τ(Θ(o) · xo(j)), θ(2) � v(t− 1) >+ θ(α) · log(L f − t) + θ(bias)

    (5)

    where hj,t ≡ ho(v(t− 1), xo(j)).1we don’t use a bias feature since it is redundant when

    the layer has input features encoded with the ”one-hot”encoding

    2since we are then passing this score to a softmax ofvariable size (L f − t), this feature helps the model to keepthe score already approximately scaled.

    13

  • We can compute the probability of an entirepermutation f ′ just by multiplying the proba-bilities for each word: P( f ′| f ) = P( Ī = ī| f ) =∏

    L ft=1 P( Īt = īt| f , t)

    3.1.1 Training

    Given a training set of pairs of sentencesand reference permutations, the training prob-lem is defined as finding the set of parame-ters θ ≡ (vinit, Θ(1), θ(2), ΘREC, Θ(o), θ(α), θ(bias))which minimize the per-word empirical cross-entropy of the model w.r.t. the reference per-mutations in the training set. Gradients canbe efficiently computed using backpropagationthrough time (BPTT).

    In practice we used the following training ar-chitecture:Stochastic gradient descent, with each train-ing pair ( f , f ′) considered as a single mini-batch for updating purposes. Gradients com-puted using the automatic differentiation facil-ities of Theano (Bergstra et al., 2010) (which im-plements a generalized BPTT). No truncationis used. L2-regularization 3. Learning ratesdynamically adjusted per scalar parameter us-ing the AdaDelta heuristic (Zeiler, 2012). Gradi-ent clipping heuristic to prevent the ”explodinggradient” problem (Graves, 2013). Early stop-ping w.r.t. a validation set to prevent overfit-ting. Uniform random initialization for param-eters other than the recurrent parameter ma-trix ΘREC. Random initialization with echo stateproperty for ΘREC, with contraction coefficientσ = 0.99 (Jaeger, 2001), (Gallicchio and Micheli,2011).

    Training time complexity is O(L2f ) per sen-tence, which could be reduced to O(L f ) usingtruncated BTTP at the expense of update accu-racy and hence convergence speed. Space com-plexity is O(L f ) per sentence.

    3.1.2 Decoding

    In order to use the RNN-RM model for pre-reordering we need to compute the most likely

    3λ = 10−4 on the recurrent matrix, λ = 10−6 on thefinal layer, per minibatch.

    permutation∗f ′ of the source sentence f :

    ∗f ′ ≡ argmax

    f ′∈GEN( f )P( f ′| f ) (6)

    Solving this problem to the global optimum iscomputationally hard4, hence we solve it to alocal optimum using a beam search strategy.

    We generate the permutation incrementallyfrom left to right. Starting from an initial stateconsisting of an empty string and the initialstate vector vinit, at each step we generate allpossible successor states and retain the B-mostprobable of them (histogram pruning), accord-ing to the probability of the entire prefix of per-mutation they represent.

    Since RNN state vectors do not decompose ina meaningful way, we don’t use any hypothesisrecombination.At step t there are L f − t possible successorstates, and the process always takes exactly L fsteps5, therefore time complexity is O(B · L2f )and space complexity is O(B).

    3.1.3 Features

    We use two different feature configurations:unlexicalized and lexicalized.

    In the unlexicalized configuration, the statetransition input feature function x(j) is com-posed by the following features, all encoded us-ing the ”one-hot” encoding scheme:

    • Unigram: POS(j), DEPREL(j), POS(j) ∗DEPREL(j). Left, right and parent un-igram: POS(k), DEPREL(k), POS(k) ∗DEPREL(k), where k is the index of re-spectively the word at the left (in the orig-inal sentence), at the right and the depen-dency parent of word j. Unique tags areused for padding.

    • Pair features: POS(j) ∗ POS(k), POS(j) ∗DEPREL(k), DEPREL(j) ∗ POS(k),DEPREL(j) ∗ DEPREL(k), for k definedas above.

    4NP-hard for at least certain choices of features and pa-rameters

    5actually, L f − 1, since the last choice is forced

    14

  • • Triple features POS(j) ∗ POS(le f tj) ∗POS(rightj), POS(j) ∗ POS(le f tj) ∗POS(parentj), POS(j) ∗ POS(rightj) ∗POS(parentj).

    • Bigram: POS(j) ∗ POS(k), POS(j) ∗DEPREL(k), DEPREL(j) ∗ POS(k) wherek is the previous emitted word in thepermutation.

    • Topological features: three binary featureswhich indicate whether word j and thepreviously emitted word are in a parent-child, child-parent or sibling-sibling rela-tion, respectively.

    The target word feature function xo(j) is thesame as x(j) except that each feature is also con-joined with a quantized signed distance6 be-tween word j and the previous emitted word.Feature value combinations that appear lessthan 100 times in the training set are replacedby a distinguished ”rare” tag.

    The lexicalized configuration is equivalent tothe unlexicalized one except that x(j) and xo(j)also have the surface form of word j (not con-joined with the signed distance).

    3.2 Fragment RNN-RM

    The Base RNN-RM described in the previoussection includes dependency information, butnot the full information of reordering frag-ments as defined by our automaton model (sec.2). In order to determine whether this richinformation is relevant to machine translationpre-reordering, we propose an extension, de-noted as Fragment RNN-RM, which includes re-ordering fragment features, at expense of a sig-nificant increase of time complexity.We consider a hierarchical recurrent neural net-work. At top level, this is defined as the previ-ous RNN. However, the x(j) and xo(j) vectors,in addition to the feature vectors described asabove now contain also the final states of an-other recurrent neural network.This internal RNN has a separate clock and a

    6values greater than 5 and smaller than 10 are quan-tized as 5, values greater or equal to 10 are quantized as10. Negative values are treated similarly.

    separate state vector. For each step t of thetop-level RNN which transitions between wordf ′(t− 1) and f ′(t), the internal RNN is reinitial-ized to its own initial state and performs mul-tiple internal steps, one for each action in thefragment of the execution that the walker au-tomaton must perform to walk between wordsf ′(t − 1) and f ′(t) in the dependency parse(with a special shortcut of length one if they areadjacent in f with monotonic relative order).

    The state transition of the inner RNN is de-fined as:

    vr(t) = τ(Θ(r1) · xr(tr)+ ΘrREC · vr(tr− 1))(7)where xr(tr) is the feature function for the wordtraversed at inner time tr in the execution frag-ment. vr(0) = vinitr , Θ(r1) and ΘrREC are parame-ters.Evaluation and decoding are performed es-sentially in the same was as in Base RNN-RM, except that the time complexity is nowO(L3f ) since the length of execution fragmentsis O(L f ).Training is also essentially performed in thesame way, though gradient computation ismuch more involved since gradients propagatefrom the top-level RNN to the inner RNN. Inour implementation we just used the automaticdifferentiation facilities of Theano.

    3.2.1 Features

    The unlexicalized features for the inner RNNinput vector xr(tr) depend on the current wordin the execution fragment (at index tr), theprevious one and the action label: UP, DOWNor RIGHT (shortcut). EMIT actions are notincluded as they always implicitly occur at theend of each fragment.Specifically the features, encoded with the”one-hot” encoding are: A ∗ POS(tr) ∗POS(tr − 1), A ∗ POS(tr) ∗ DEPREL(tr − 1),A ∗ DEPREL(tr) ∗ POS(tr − 1), A ∗DEPREL(tr) ∗ DEPREL(tr − 1).These features are also conjoined with thequantized signed distance (in the originalsentence) between each pair of words.The lexicalized features just include the surfaceform of each visited word at tr.

    15

  • 3.3 Base GRU-RM

    We also propose a variant of the Base RNN-RMwhere the standard recurrent hidden layer is re-placed by a Gated Recurrent Unit layer, recentlyproposed by Cho et al. (2014) for machine trans-lation applications.The Base GRU-RM is defined as the Base RNN-RM of sec. 3.1, except that the recurrence rela-tion 4 is replaced by fig. 2

    Features are the same of unlexicalized BaseRNN-RM (we experienced difficulties trainingthe Base GRU-RM with lexicalized features).Training is also performed in the same way ex-cept that we found more beneficial to conver-gence speed to optimize using Adam (Kingmaand Ba, 2014) 7 rather than AdaDelta.In principle we could also extend the FragmentRNN-RM into a Fragment GRU-RM, but wedid not investigate that model in this work.

    4 Experiments

    We performed German-to-English pre-reordering experiments with Base RNN-RM(both unlexicalized and lexicalized), FragmentRNN-RM and Base GRU-RM .

    4.1 Setup

    The baseline phrase-based system was trainedon the German-to-English corpus included inEuroparl v7 (Koehn, 2005). We randomly splitit in a 1,881,531 sentence pairs training set, a2,000 sentence pairs development set (used fortuning) and a 2,000 sentence pairs test set. TheEnglish language model was trained on theEnglish side of the parallel corpus augmentedwith a corpus of sentences from AP News, fora total of 22,891,001 sentences.The baseline system is phrase-based Mosesin a default configuration with maximumdistortion distance equal to 6 and lexicalizedreordering enabled. Maximum phrase size isequal to 7.The language model is a 5-gramIRSTLM/KenLM.The pseudo-oracle system was trained on

    7with learning rate 2 · 10−5 and all the other hyperpa-rameters equal to the default values in the article.

    the training and tuning corpus obtained bypermuting the German source side usingthe heuristic described in section 2.2 and isotherwise equal to the baseline system.In addition to the test set extracted from Eu-roparl, we also used a 2,525 sentence pairs testset (”news2009”) a 3,000 sentence pairs ”chal-lenge” set used for the WMT 2013 translationtask (”news2013”).

    We also trained a Moses system with pre-reordering performed by Collins et al. (2005)rules, implemented by Howlett and Dras(2011).Constituency parsing for Collins et al. (2005)rules was performed with the Berkeley parser(Petrov et al., 2006), while non-projective de-pendency parsing for our models was per-formed with the DeSR transition-based parser(Attardi, 2006).

    For our experiments, we extract approxi-mately 300,000 sentence pairs from the Mosestraining set based on a heuristic confidencemeasure of word-alignment quality (Huang,2009), (Navratil et al., 2012). We randomly re-moved 2,000 sentences from this filtered datasetto form a validation set for early stopping, therest were used for training the pre-reorderingmodels.

    4.2 Results

    The hidden state size s of the RNNs was setto 100 while it was set to 30 for the GRUmodel, validation was performed every 2,000training examples. After 50 consecutive vali-dation rounds without improvement, trainingwas stopped and the set of training parame-ters that resulted in the lowest validation cross-entropy were saved.Training took approximately 1.5 days for theunlexicalized Base RNN-RM, 2.5 days for thelexicalized Base RNN-RM and for the unlexi-calized Base GRU-RM and 5 days for the unlex-icalized Fragment RNN-RM on a 24-core ma-chine without GPU (CPU load never rose tomore than 400%).

    Decoding was performed with a beam sizeof 4. Decoding the whole corpus took about1.0-1.2 days for all the models except Fragment

    16

  • vrst(t) = π(Θ(1)rst · x(t) + ΘRECrst · v(t− 1))

    vupd(t) = π(Θ(1)upd · x(t) + ΘRECupd · v(t− 1))

    vraw(t) = τ(Θ(1) · x(t) + ΘREC · v(t− 1)� vupd(t))v(t) = vrst(t)� v(t− 1) + (1− vrst(t))� vraw(t)

    (8)

    Figure 2: GRU recurrence equations. vrst(t) and vupd(t) are the activation vectors of the ”reset”and ”update” gates, respectively, and π(·) is the logistic sigmoid function.

    .

    Reordering BLEU improvementnone 62.10unlex. Base RNN-RM 64.03 +1.93lex. Base RNN-RM 63.99 +1.89unlex. Fragment RNN-RM 64.43 +2.33unlex. Base GRU-RM 64.78 +2.68

    Figure 3: ”Monolingual” reordering scores (upstream system output vs. ”oracle”-permuted Ger-man) on the Europarl test set. All improvements are significant at 1% level.

    RNN-RM for which it took about 3 days.Effects on monolingual reordering score are

    shown in fig. 3, effects on translation qualityare shown in fig. 4.

    4.3 Discussion and analysis

    All our models significantly improve over thephrase-based baseline, performing as well as oralmost as well as (Collins et al., 2005), which isan interesting result since our models doesn’trequire any specific linguistic expertise.

    Surprisingly, the lexicalized version of BaseRNN-RM performed worse than the unlexical-ized one. This goes contrary to expectation asneural language models are usually lexicalizedand in fact often use nothing but lexical fea-tures.

    The unlexicalized Fragment RNN-RM wasquite accurate but very expensive both duringtraining and decoding, thus it may not be prac-tical.

    The unlexicalized Base GRU-RM performedvery well, especially on the Europarl dataset(where all the scores are much higher than theother datasets) and it never performed signif-icantly worse than the unlexicalized Fragment

    RNN-RM which is much slower.We also performed exploratory experiments

    with different feature sets (such as lexical-onlyfeatures) but we couldn’t obtain a good train-ing error. Larger network sizes should increasemodel capacity and may possibly enable train-ing on simpler feature sets.

    5 Conclusions

    We presented a class of statistical syntax-basedpre-reordering systems for machine transla-tion.Our systems processes source sentences parsedwith non-projective dependency parsers andpermutes them into a target-like word order,suitable for translation by an appropriatelytrained downstream phrase-based system.

    The models we proposed are completelytrained with machine learning approaches andis, in principle, capable of generating arbitrarypermutations, without the hard constraintsthat are commonly present in other statisticalsyntax-based pre-reordering methods.Practical constraints depend on the choice offeatures and are therefore quite flexible, allow-ing a trade-off between accuracy and speed.

    In our experiments with the RNN-RM and

    17

  • Test set system BLEU improvementEuroparl baseline 33.00Europarl ”oracle” 41.80 +8.80Europarl Collins 33.52 +0.52Europarl unlex. Base RNN-RM 33.41 +0.41Europarl lex. Base RNN-RM 33.38 +0.38Europarl unlex. Fragment RNN-RM 33.54 +0.54Europarl unlex. Base GRU-RM 34.15 +1.15news2013 baseline 18.80news2013 Collins NA NAnews2013 unlex. Base RNN-RM 19.19 +0.39news2013 lex. Base RNN-RM 19.01 +0.21news2013 unlex. Fragment RNN-RM 19.27 +0.47news2013 unlex. Base GRU-RM 19.28 +0.48news2009 baseline 18.09news2009 Collins 18.74 +0.65news2009 unlex. Base RNN-RM 18.50 +0.41news2009 lex. Base RNN-RM 18.44 +0.35news2009 unlex. Fragment RNN-RM 18.60 +0.51news2009 unlex. Base GRU-RM 18.58 +0.49

    Figure 4: RNN-RM translation scores. All improvements are significant at 1% level.

    GRU-RM models we managed to achieve trans-lation quality improvements comparable tothose of the best hand-coded pre-reorderingrules.

    References

    Yaser Al-Onaizan and Kishore Papineni. 2006. Dis-tortion models for statistical machine translation.In Proceedings of the 21st International Conferenceon Computational Linguistics and the 44th AnnualMeeting of the Association for Computational Lin-guistics, ACL-44, pages 529–536, Stroudsburg, PA,USA. Association for Computational Linguistics.

    Giuseppe Attardi. 2006. Experiments with a multi-language non-projective dependency parser. InProceedings of the Tenth Conference on Computa-tional Natural Language Learning, CoNLL-X ’06,pages 166–170, Stroudsburg, PA, USA. Associa-tion for Computational Linguistics.

    James Bergstra, Olivier Breuleux, Frédéric Bastien,Pascal Lamblin, Razvan Pascanu, Guillaume Des-jardins, Joseph Turian, David Warde-Farley, andYoshua Bengio. 2010. Theano: a CPU and GPUmath expression compiler. In Proceedings of the

    Python for Scientific Computing Conference (SciPy),June. Oral Presentation.

    Alexandra Birch, Miles Osborne, and PhilippKoehn. 2008. Predicting success in machinetranslation. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing,EMNLP ’08, pages 745–754, Stroudsburg, PA,USA. Association for Computational Linguistics.

    Arianna Bisazza and Marcello Federico. 2013. Ef-ficient solutions for word reordering in German-English phrase-based statistical machine transla-tion. In Proceedings of the Eighth Workshop on Sta-tistical Machine Translation, pages 440–451, Sofia,Bulgaria, August. Association for ComputationalLinguistics.

    Cristina Bosco and Vincenzo Lombardo. 2004. De-pendency and relational structure in treebank an-notation. In COLING 2004 Recent Advances in De-pendency Grammar, pages 1–8.

    Kyunghyun Cho, Bart van Merriënboer, DzmitryBahdanau, and Yoshua Bengio. 2014. Onthe properties of neural machine translation:Encoder-decoder approaches. arXiv preprintarXiv:1409.1259.

    Michael Collins, Philipp Koehn, and IvonaKučerová. 2005. Clause restructuring forstatistical machine translation. In Proceedings of

    18

  • the 43rd annual meeting on association for computa-tional linguistics, pages 531–540. Association forComputational Linguistics.

    Nadir Durrani, Helmut Schmid, and AlexanderFraser. 2011. A joint sequence translation modelwith integrated reordering. In Proceedings of the49th Annual Meeting of the Association for Compu-tational Linguistics: Human Language Technologies-Volume 1, pages 1045–1054. Association for Com-putational Linguistics.

    Chris Dyer and Philip Resnik. 2010. Context-free reordering, finite-state translation. In HumanLanguage Technologies: The 2010 Annual Conferenceof the North American Chapter of the Association forComputational Linguistics, HLT ’10, pages 858–866,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

    Minwei Feng, Arne Mauser, and Hermann Ney.2010. A source-side decoding sequence model forstatistical machine translation. In Conference of theAssociation for Machine Translation in the Americas(AMTA).

    C. Gallicchio and A. Micheli. 2011. Architecturaland markovian factors of echo state networks.Neural Networks, 24(5):440 – 456.

    Dmitriy Genzel. 2010. Automatically learningsource-side reordering rules for large scale ma-chine translation. In Proceedings of the 23rd In-ternational Conference on Computational Linguistics,COLING ’10, pages 376–384, Stroudsburg, PA,USA. Association for Computational Linguistics.

    Alex Graves. 2013. Generating sequenceswith recurrent neural networks. arXiv preprintarXiv:1308.0850.

    Susan Howlett and Mark Dras. 2011. Clause re-structuring for SMT not absolutely helpful. InProceedings of the 49th Annual Meeting of the As-socation for Computational Linguistics: Human Lan-guage Technologies, pages 384–388.

    Fei Huang. 2009. Confidence measure for wordalignment. In Proceedings of the Joint Conferenceof the 47th Annual Meeting of the ACL and the 4thInternational Joint Conference on Natural LanguageProcessing of the AFNLP: Volume 2-Volume 2, pages932–940. Association for Computational Linguis-tics.

    Herbert Jaeger. 2001. The echo state ap-proach to analysing and training recurrent neu-ral networks-with an erratum note. Bonn, Ger-many: German National Research Center for Infor-mation Technology GMD Technical Report, 148:34.

    Maxim Khalilov and José AR Fonollosa. 2011.Syntax-based reordering for statistical ma-

    chine translation. Computer speech & language,25(4):761–788.

    Diederik Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. arXivpreprint arXiv:1412.6980.

    Philipp Koehn, Hieu Hoang, Alexandra Birch,Chris Callison-Burch, Marcello Federico, NicolaBertoldi, Brooke Cowan, Wade Shen, ChristineMoran, Richard Zens, Chris Dyer, Ondřej Bojar,Alexandra Constantin, and Evan Herbst. 2007.Moses: open source toolkit for statistical machinetranslation. In Proceedings of the 45th Annual Meet-ing of the ACL on Interactive Poster and Demonstra-tion Sessions, ACL ’07, pages 177–180, Strouds-burg, PA, USA. Association for ComputationalLinguistics.

    Philipp Koehn. 2005. Europarl: A parallel corpusfor statistical machine translation. In ConferenceProceedings: the tenth Machine Translation Summit,pages 79–86, Phuket, Thailand. AAMT, AAMT.

    Uri Lerner and Slav Petrov. 2013. Source-side clas-sifier preordering for machine translation. In Pro-ceedings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP ’13).

    Antonio Valerio Miceli Barone and Giuseppe At-tardi. 2013. Pre-reordering for machine transla-tion using transition-based walks on dependencyparse trees. In Proceedings of the Eighth Workshopon Statistical Machine Translation, pages 164–169,Sofia, Bulgaria, August. Association for Compu-tational Linguistics.

    Tomas Mikolov, Martin Karafiát, Lukas Burget, JanCernockỳ, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. InINTERSPEECH, pages 1045–1048.

    Jiri Navratil, Karthik Visweswariah, and Anan-thakrishnan Ramanathan. 2012. A comparison ofsyntactic reordering methods for english-germanmachine translation. In COLING, pages 2043–2058.

    Franz Josef Och, Christoph Tillmann, HermannNey, et al. 1999. Improved alignment models forstatistical machine translation. In Proc. of the JointSIGDAT Conf. on Empirical Methods in Natural Lan-guage Processing and Very Large Corpora, pages 20–28.

    Slav Petrov, Leon Barrett, Romain Thibaux, and DanKlein. 2006. Learning accurate, compact, and in-terpretable tree annotation. In Proceedings of the21st International Conference on Computational Lin-guistics and the 44th annual meeting of the Associa-tion for Computational Linguistics, pages 433–440.Association for Computational Linguistics.

    19

  • Roy Tromble and Jason Eisner. 2009. Learninglinear ordering problems for better translation.In Proceedings of the 2009 Conference on EmpiricalMethods in Natural Language Processing: Volume 2 -Volume 2, EMNLP ’09, pages 1007–1016, Strouds-burg, PA, USA. Association for ComputationalLinguistics.

    Karthik Visweswariah, Rajakrishnan Rajkumar,Ankur Gandhe, Ananthakrishnan Ramanathan,and Jiri Navratil. 2011. A word reordering modelfor improved machine translation. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing, EMNLP ’11, pages 486–496,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

    Matthew D Zeiler. 2012. Adadelta: An adap-tive learning rate method. arXiv preprintarXiv:1212.5701.

    20

  • Proceedings of SSST-9, Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 21–29,Denver, Colorado, June 4, 2015. c©2015 Association for Computational Linguistics

    Translating Negation: Induction, Search And Model Errors

    Federico Fancellu and Bonnie WebberSchool of Informatics

    University of Edinburgh11 Crichton Street, Edinburgh

    f.fancellu[at]sms.ed.ac.uk , bonnie[at]inf.ed.ac.uk

    Abstract

    Statistical Machine Translation systems showconsiderably worse performance in translatingnegative sentences than positive ones (Fan-cellu and Webber, 2014; Wetzel and Bond,2012). Various techniques have addressed theproblem of translating negation, but their un-derlying assumptions have never been vali-dated by a proper error analysis. A relatedpaper (Fancellu and Webber, 2015) reports ona manual error analysis of the kinds of errorsinvolved in translating negation. The presentpaper presents ongoing work to discover theircauses by considering which, if any, are in-duction, search or model errors. We show thatstandard oracle decoding techniques providelittle help due to the locality of negation scopeand their reliance on a single reference. Weare working to address these weaknesses usinga chart analysis based on oracle hypotheses,guided by the negation elements contained in asource span and by how these elements are ex-pected to be translated at each decoding step.Preliminary results show chart analysis is ableto give a more in-depth analysis of the aboveerrors and better explains the results of themanual analysis.

    1 Introduction

    In recent years there has been increasing interest inimproving the quality of SMT systems over a widerange of linguistic phenomena, including corefer-ence resolution (Hardmeier et al., 2014) and modal-ity (Baker et al., 2012). Negation, however, isa problem that has still not been researched thor-oughly (section 2).

    Our previous study (Fancellu and Webber, 2015)takes a first step towards understanding why nega-tion is a problem in SMT, through manual analysisof the kinds of errors involved in its translation. Ourerror analysis employs a small set of standard string-based operations, applying them to the semantic el-ements involved in the meaning of negation (section3).

    The current paper describes our current work onunderstanding the causes of these errors. Focussingon the distinction between induction, search andmodel errors, we point out the challenges in trying touse existing techniques to quantify these three typesof errors in the context of translating negation.

    Previous work on ascribing errors to induction,search, or model has taken an approach using ora-cle decoding, i.e. forcing the decoder to reconstructthe reference sentence as a proxy to analyse its po-tentiality. We show however that this technique doesnot suit well semantic phenomena with local scope(such as negation), given that a conclusion drawn onthe reconstruction of an entire sentence might referto spans not related to these. Moreover, as in pre-vious work, we stress once again the limitation ofusing a single reference to compute the oracle (sec-tion 4.1)

    To overcome these problems, we propose the useof an oracle hypothesis, instead of an oracle sen-tence, that relies uniquely on the negation elementscontained in the source span and how these are ex-pected to be translated in the target hypothesis at agiven time during decoding (section 4.2).

    Sections 5 and 6 report results of the analysison a Chinese-to-English Hierarchical Phrase Based

    21

  • Model (Chiang, 2007). We show that even if it pos-sible to detect the presence of model errors throughthe use of an oracle sentence, computing an ora-cle hypotheses at each step during decoding offersa more robust, in-depth analysis around the problemof translating negation and helps explaining the er-rors observed during the manual analysis.

    2 Previous Work

    While recent years have seen work on automaticallydetecting negation in monolingual texts (Chowdhuryand Mahbub, 2012; Read et al., 2012), SMT hasmainly considered it a side problem. For this reason,no actual analysis on the type of errors involved intranslating negation or their causes has been specif-ically carried out. The standard approach has beento formulate a hypothesis about what can go wrongwhen translating negation, modify the SMT systemin a way aimed at reducing the number of timesthat happens, and then assume that any increasein BLEU score - the standard automatic evaluationmetric used in SMT - confirms the initial hypothe-sis. Collins et al. (2005) and Li et al. (2009) consid-ers negation, along with other linguistic phenomena,as a problem of structural mismatch between sourceand target; Wetzel and Bond (2012) considers it in-stead as a problem of training data sparsity; finallyBaker et al. (2012) and Fancellu and Webber (2014)considers it as a model problem, where the systemneeds enhancement with respect to the semantics ofnegation.

    Only a few efforts have tried to investigate errorsoccurring during decoding. Automatic evaluationmetrics are in fact only informative about the qual-ity of the output, but not about the decoding processthat produces the output. As such, the most relevantrelated work are two studies on the main categoriesof errors during decoding (Auli et al., 2009; Wis-niewski and Yvon, 2013). Both works use the ref-erence sentence as a proxy to generate an oracle hy-pothesis but they differ in the technique they use andin the problem they are interesting analysing. Auliet al. (2009) targets induction errors — i.e. caseswhere a good translation is absent from the searchspace — by forcing the decoder to generate the ref-erence sentence with varying translation options (foreach source span) and distortion limits. If when in-

    creasing the number of target translations consideredfor each span, the number of references that is pos-sible to fully generate also increases, an inductionerror has occurred. Results on a French-to-EnglishPBSMT validates this hypothesis.

    Wisniewski and Yvon (2013) considers insteadoracle decoding as a proxy to distinguish search vs.model errors. If the oracle translation has a modelscore higher than the 1-best system output, a searcherror has occurred, since the system could not out-put the hypothesis with the highest probability; incontrast, a model error has occurred when the scor-ing function is unable to rank translations correctly.Here, the oracle translation is generated via ILP bymaximising the unigram recall between oracle andreference translation, resembling the work of Ger-mann et al. (2001) on optimal decoding in word-based models. In both Auli et al. (2009) and Wis-niewski and Yvon (2013), almost all the errors dur-ing decoding are model errors.

    A shortcoming of both methods is that neither cangenerate more than 35% of the references in the testset, by virtue of taking only one particular referenceas the oracle, despite there usually being many waysthat a source sentence can be translated.

    3 Manual Error Analysis

    This section briefly summarises the key points ofthe manual error analysis described in (Fancellu andWebber, 2015), since they also underpin the auto-mated analysis described in section 4. The manualerror analysis makes two assumptions:

    • the semantic structure of negation can be an-notated in a similar way across different lan-guages, because the essentials of negation arelanguage-independent.

    • for analytic languages like English and Chi-nese, a set of string-based operations (deletion,insertion and reordering) can be used to assesstranslation errors in the semantics of negation.

    Both assumptions involve first of all reducing arather abstract semantic phenomenon into elementstangible at string-level. Following Blanco andMoldoval (2011), Morante and Blanco (2012) andFancellu and Webber (2014), we decompose nega-

    22

  • tion into its three main components, described be-low, and use them as the target of our analysis.

    • Cue, i.e. the word or multi-words unit inher-ently expressing negation (e.g. ‘He is not wash-ing his clothes’)

    • Event, i.e. the lexical event the cue directlyrefers to (e.g. ‘He is not washing his clothes)

    • Scope, i.e. all the elements whose falsity wouldprove the statement to be true (e.g. ‘He is notwashing his clothes’); the event is taken to bepart of the scope, since its falsity influencesthe truth value of negation. In the error anal-ysis however, we exclude the event from thescope (since it is already considered per se)and further decompose the scope, to isolatethe semantic fillers in its boundaries (He, hisclothes), here taken to be Propbank-like seman-tic roles.

    Given that we are combining standard, widely usederror categories and language-independent seman-tic elements, we expect the annotation process andthe error analysis to be robust and applicable to lan-guages other than English and Chinese.

    Results show an in-depth analysis of negation-related errors, where we are able to discern clearlywhich operations affect which elements and to whatextent. We found the cue the element the least proneto translation errors with only four cases of it beingdeleted during translation. We also found reorderingto be the most frequent error category especially forthe fillers, given that the SMT system does not pos-sess explicit knowledge of semantic frames and itsboundaries.

    By making use of the decoding trace, contain-ing the rules used to build the 1-best hypothesis, wecould also inspect the causes of deletion and inser-tion. We found that almost all deletion and insertionerrors are caused by a wrong rule application thattranslates a Chinese source span containing nega-tion into an English hypothesis that does not or viceversa. OOV items seem not to constitute a problemwhen translating negation. This is important espe-cially in the case of the cue, whose absence meansthat the whole negation instance is lost. Given thatall the cues in the test set have been seen during

    training, we also know the system has the ability topotentially reproduce negation on the target side.

    4 Automatic Error Analysis

    The manual error analysis can only get us as faras analysing the 1-best hypothesis and its buildingblocks. No explicit information on the causes ofthese errors can be recovered from the decodingtrace only. To address this problem, we introducetwo different techniques to analyse and distinguishdifferent kinds of errors occurring at decoding time.

    First however, we give a more formal definition ofthe three main categories of decoding-related errorsas follows, where e and p(e) are the optimal trans-lation the decoder can produce, along with its prob-ability while ê and p(ê) stand for the 1-best outputand its probability.

    • Search error: e 6= ê and p(e) > p(ê); the 1-bestoutput is not the most probable output, giventhe model. Search errors are a consequence ofthe impossibility of exploring th