data management plan (final) · 2018-08-09 · quality translation 21 d5.7 data management plan-...

This document is part of the Research and Innovation Action “Quality Translation 21 (QT21).” This project has received funding from the European Union’s Horizon 2020 program for ICT under grant agreement no. 645452.

Deliverable D5.7

Data Management Plan (final)

Author(s): Christian Dugast (DFKI) Marco Turchi (FBK) Lucia Specia (USFD) Kim Harris (t&f) Anna Samiotou (TAUS) Jan Niehues (KIT) Raivis Skadiņš (TILDE) Roberts Rozis (TILDE) Phil Williams (UEDIN)

Dissemination Level: Public Date: 2018-01-31 Copyright: No part of this document may be reproduced or

transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner.

Quality Translation 21

D5.7 Data Management Plan- final

of 52

Grant agreement no. 645452

Project acronym QT21

Project full title Quality Translation 21

Type of action Research and Innovation Action

Coordinator Prof Josef van Genabith (DFKI)

Start date, duration 1 February 2015, 36 months

Dissemination level Public

Contractual date of delivery 31/07/2017

Actual date of delivery 31/01/2018

Deliverable number D5.7

Deliverable title Data Management Plan

Type Report

Status and version Version 10 Final

Number of pages 52

Contributing partners DFKI, FBK, KIT, UEDIN, TAUS

WP leader DFKI

Task leader DFKI

Author(s) Christian Dugast, Marco Turchi, Lucia Specia, Kim Harris, Anna Samiotou, Jan Niehues

EC project officer Susan Fraser

The partners in QT21 are: Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany Rheinisch-Westfälische Technische Hochschule Aachen (RWTH), Germany Universiteit van Amsterdam (UvA), Netherlands Dublin City University (DCU), Ireland University of Edinburgh (UEDIN), United Kingdom Karlsruher Institut für Technologie (KIT), Germany Centre National de la Recherche Scientifique (CNRS), France Univerzita Karlova v Praze (CUNI), Czech Republic Fondazione Bruno Kessler (FBK), Italy University of Sheffield (USFD), United Kingdom TAUS b.v. (TAUS), Netherlands text & form GmbH (TAF), Germany TILDE SIA (TILDE), Latvia Hong Kong University of Science and Technology (HKUST), Hong Kong

For copies of reports, updates on project activities, and other QT21-related information, contact: DFKI GmbH QT21

Prof. Stephan Busemann Stuhlsatzenhausweg 3 Campus D3_2 D-66123 Saarbrücken, Germany

[email protected] Phone: +49 (681) 85775 5286 Fax: +49 (681) 85775 5338

Copies of reports and other material can also be accessed via http://www.qt21.eu © 2016 QT21 Consortium

mailto:[email protected]

http://www.qt21.eu/



of 52

Contents

1 Executive Summary 5

2 MT Generic Training Data 6

2.1 Introduction 6

2.2 Data Description 6

2.3 Standards and Metadata 7

2.4 Sharing 8

2.5 Archiving 8

3 Human Annotations 9

3.1 Introduction and Scope 9

3.1.1 Post-Editing 9

3.1.2 Error Annotation 10

3.2 Data description 10

3.3 Data Sharing 14

3.4 Archiving 15

4 WMT Automatic Post Editing and Quality Estimation Tasks 16




4.4 Sharing 21

4.5 Archiving 21

5 WMT News evaluation campaigns 22




5.4 Sharing 24

5.5 Archiving 24

A Annex A: SGML format of bi-lingual text (TM) 25

B Annex B: Generation of HPE and HEA Data - Methodology 26

B.1. Corpus Selection 26

B.2. Segment Selection 27

C Annex C: Guidelines for Post-Editing Only 31

C.1. How to Achieve “Good Enough” Quality 31

C.2. How to Achieve Quality Similar or Equal to Human Translation: 31



of 52

C.3. Best Practices for Evaluating Post-Editor Performance 32

C.4. Defining Goals 32

C.5. Structuring Your Analysis 32

C.6. Analysing Results 33

D Annex D: Error-Annotation and its Related Post-Editing Guidelines 35

D.1. Annotation and post-editing level 35

D.2. What is an Error? 36

D.3. The Annotation Process 36

D.4. Tricky Cases 37

D.5. Minimal Mark-Up 38

D.6. Issue Categories 40

D.6.1 Accuracy Issues 40

D.6.2 Terminology Issues 41

D.6.3 Locale Convention Issues 41

D.6.4 Fluency Issues 41

D.7. Decision Tree 43

E Annex E: Schema for MQM Annotations in XML Format 45

F Annex F: XSLT Stylesheet for Converting Annotated XML into HTML 46

G Annex G: Example of an XML File with a Single Annotated Segment 47

H Annex H: Prose Explanation of the Basic Elements and Attributes 48

I Annex I: Licence Agreement for Using HPE and HEA Data 49

J Annex J: Glossary of Terms 52



of 52

1 Executive Summary1

This Data Management Plan (DMP) reports on the final state (as of month 36) of the data QT21 has used and generated.

This document follows the structure recommended for all Horizon2020 DMPs, by describing the data selection methodology, and then formally describing the data. The formal data description starts with a name and a reference to the data followed by a description of the content of the data. Further standards, data sharing and data archiving have to be addressed.

QT21 has produced four data sets. This document therefore presents four DMPs, the first one related to WP1+2, the second and third DMPs are specific to WP3, the fourth to WP4.

The first DMP is organised around the data used and produced together with the CRACKER project for the Workshop on Machine Translation (WMT, http://www.statmt.org/) in order to train and fine-tune SMT engines. This data set has been used by Work Packages 1 and 2 (WP1, WP2, see section 2).

Two other DMPs have been defined with respect to the work done in WP3 which deals with continuous learning on domain-specific datasets.

The first DMP for WP3 (section 3) covers the human annotations (human post-edits and human error-annotations), produced by professional translators, on the output of the MT engines for two domains: Information Technology (IT) and Pharmacy, and also on the output of a selected set of WMT17 news translation engines on WMT17 news data. An artificial corpus to train Automatic Post Editing systems (called eSCAPE) has been created which does not contain HPEs but only triplets of source, reference and target segments. In addition, a data set defined for the purposes of building machine learning quality estimation (QE) models, primarily at the word level, but also potentially for phrase, sentence and document levels has been built. The goal is to use this dataset for the WMT18 shared task on quality estimation2.

The second DMP for WP3 (section 4) is contained in the previous DMP but is defined as an independent one, as it is a referenceable set used for the WMT APE task.

Guidelines to help produce human post-edits and human error-annotations are appended to this deliverable.

Last but not least, the fourth DMP (for WP4) has produced data for three WMT “translation tasks” that ran in 2016 and 2017. For WMT 2018 the data is ready, but still not publically available. This data production has been organised jointly (shared task) with the EC funded project CRACKER (see section 5.5).

1 See annex J for a glossary of terms used in this document. 2This will be the largest ever collection of data points -- approximately 10K segments, totalising 100K

words -- with fine-grained error annotation (MQM). The currently largest existing dataset of this type (albeit even simpler, without severity levels, which we describe below) has only 1.9K segments (WMT14 quality estimation English-Spanish dataset).

http://cracker-project.eu/

http://www.statmt.org/




of 52

2 MT Generic Training Data

2.1 Introduction

WP1 and WP2 are mainly focused on improving technology for the language pairs considered. Both WPs have no specific requirement on data. As a consequence, both WPs rely on existing data sets and on news crawls organised for WMT.

For the German-English and Czech-English language pairs, well-established test and training sets are available from the Workshop on Statistical Machine Translation (http://www.statmt.org/). Using these data sets, we are not only able to compare the performance within the project, but also within the research community.

The data consists of translated news articles to and from different languages. In order to concentrate on method comparison, the training data is limited to the data available for the WMT Evaluations. During the three years of the project life, QT21 has followed the constraints given by WMT.

All this data is downloadable from http://www.statmt.org/wmtXX/translation-task.html, XX being the year of the WMT campaign.

2.2 Data Description

Table 1 WMT 2016 News Crawl

Resource Name WMT 2016 News Crawl

Resource Type corpus

Media Type text

Language(s) English, German, Czech and Romanian.

Licence The source data are crawled from online news sites and carry the respective licensing conditions.

Distribution Medium

downloadable

Usage Building MT systems

Size 4.8Gb

Description This data set consists of text crawled from online news, with the HTML stripped out and sentences shuffled.


Resource Name WMT 2017 News Crawl


Media Type text


http://www.statmt.org/wmtXX/translation-task.html



of 52

Language(s) English, German, Czech and Latvian.


Distribution Medium

downloadable


Size 3.7Gb

Description This data set consists of text crawled from online news, with the HTML stripped out and sentences shuffled.


Resource Name WMT 2018 News Crawl3


Media Type text

Language(s) English, German, Czech and Finnish.


Distribution Medium

downloadable


Size 4 Gb

Description

This data set consists of text crawled from online news, with the HTML stripped out and sentences shuffled. 2018 - http://data.statmt.org/wmt18/translation-task/training-monolingual-news-crawl.tgz

2.3 Standards and Metadata

The data was collected over several years and is available in standard formats. An exact description can be found specifically for each year at http://www.statmt.org/wmt16/translation-task.html, http://www.statmt.org/wmt17/translation-task.html, http://www.statmt.org/wmt18/translation-task.html. See Annex A for an example of the format used.

3The WMT 2018 News Crawl will be made available end of February 2018

http://data.statmt.org/wmt18/translation-task/training-monolingual-news-crawl.tgz.

http://data.statmt.org/wmt18/translation-task/training-monolingual-news-crawl.tgz


http://www.statmt.org/wmt16/translation-task.html





of 52

2.4 Sharing

The data is freely available at

- http://data.statmt.org/wmt16/translation-task/training-monolingual-news-crawl.tgz

- http://data.statmt.org/wmt17/translation-task/training-monolingual-news-crawl.tgz

- http://data.statmt.org/wmt18/translation-task/training-monolingual-news-crawl.tgz.

Licences to be accepted by users prior to downloading the data are standard licences.

2.5 Archiving

All datasets produced are provided and made sustainable through the existing META-SHARE repositories, or new repositories that partners may choose to set up and link to the META-SHARE network (e.g. LINDAT). Data sets are locally stored in the repositories’ storage layer in compressed format.

The data will remain available for download from http://www.statmt.org/. This website is currently hosted at the University of Edinburgh.









of 52

3 Human Annotations

3.1 Introduction and Scope

The success of WP3 is connected to the availability of data containing human feedback in the form of Human Post-Editing (HPE) and/or Human Error Annotation (HEA) of MT errors. HPE is about “what” is wrong: it corrects translations and provides insight into what text is corrected. HEA is about “why” it is wrong: it identifies and names specific errors and thus is useful for understanding why corrections are made and what types of errors are made.

Table 4 gives an overview for each QT21 language pair on the volume of human-generated data (Post Edits and Error Annotations) that the project has produced based on domain-specific corpora. Table 5 and Table 6 give details.

The consortium has also produced a small corpus of HPE and HEA on WMT17 data (see Table 7 and Table 8).

Table 4 WP3- QT21 language pairs and related HPE and HEA volumes in number of segments. The error annotation volume consists of 200 double and 1,600 single annotated segments.

Language Pairs

MT engine used to

produce target

Post Editing Volume (Number of segments

and source words)

Error Annotation

Volume

Data Set Label

EN-DE PBMT 420,000 (30.000) 2.200

Set A EN-DE NMT 420,000 (30.000) 2.200

EN-CS PBMT 630.000 (45,000) 2.200

EN-LV PBMT 330,000 (22.500) 2.000 Set B

EN-LV NMT 330,000 (22.500) 2.000

DE-EN PBMT 6300,000 (45.000) 2.200 Set C

During the three years of the project, WP3 has used the two typologies to generate two sets of human-annotated data with the following content:

1. HPE sentences: for each source sentence, the reference, the MT output or target and the post-edited sentence obtained by the work of professional translators are made available;

2. HEA information: with each source sentence, the target, the post-edited sentences and the post-edited sentences enriched with one or two error-annotations are provided by professional translators using MQM, the harmonised error metric developed in WP3.

The methodology used to produce this data set is described in Annex B.

In order to produce HPE and HEA, professional human translators have followed the guidelines described in the Annexes C and D.

3.1.1 Post-Editing

The post-editing guidelines (see Annex C) are aimed at helping QT21 participants (project managers, post-editors, evaluators) set clear expectations and can be used as a basis on which to instruct post-editors. It is not practical to present a set of guidelines that will cover all scenarios. It would be better if these are used as baseline guidelines and are tailored to the given purpose. Generally, these guidelines assume bilingual post-editing that is ideally carried out by a paid translator but in some scenarios it could be carried out by bilingual



of 52

domain experts or volunteers. While the QT21 project will aim at delivering 15,000 segments in five language pairs, the guidelines presented here are not system- or language-specific, therefore they can be applied throughout the whole project.

3.1.2 Error Annotation

In QT21, error annotation has always been performed on segments that are also post-edited4 . This means HEA and HPE guidelines are to be harmonised, which leads to more precise guidelines for the post-editing process when the segment has also been annotated for errors: the specific error annotation and related post-editing guidelines are described in Annex D page 25.

For error annotations, an XML form developed in the QTLaunchPad project has been used. It groups together the results of multiple annotators and provides a number of features.

The permissible elements and attributes are defined in the schema (annotations.xsd) included in Annex E page 45 of this document.

The XSLT stylesheet included in Annex F page 46 can be used to convert the XML format into an HTML output format.

Annex G page 47 gives the example of an XML file containing one annotated segment.

Annex H page 48 gives a prose description of the XML basic elements and attributes.

3.2 Data description

Table 5 Domain-Specific Human Post-Edited data set

Resource Name QT21 Domain-Specific Human Post-Edited data set


Media Type text

Language(s) English to German, English to Czech, English to Latvian, German to English

Licence

QT21-TAUS Terms of Use. TAUS grants access to QT21 Users to the WMT Data Set with the following rights: i) the right to use the target side of the translation units in a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is his own new translation; ii) the right to make Derivative Works; and iii) the right to use or resell such Derivative Works commercially and for the following goals: i) research and benchmarking; ii) piloting new solutions; and iii) testing of new commercial services.

Distribution Medium

downloadable

4At the same time by the same individual in order to assure consistency between post-edition and error-

annotation.

https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21


https://lindat.mff.cuni.cz/



of 52

Usage Training of Automatic Post-editing and Quality Estimation components / Error Analysis

Size 70 MB

Description

Set of 195,000 domain-specific Human Post-Edited (HPE) triplets for four language pairs and six translation engines. Each quadruplet consists in (source, reference, target, HPE). The domain for En-De and En-Cz is IT, the domain for En-Lv and De-En is Pharma. A total of six translation engines have been used to produce the targets that have been post-edited: PBMT from KIT and NMT (using Nematus) for En-De, PBMT from KIT for De-En, PBMT from CUNI for En-Cz and both PBMT and NMT systems from Tilde for En-Lv. For each language pair, one unique set of source segments has been used as input to the different translation engines. The De-En and the En-Cz have provided 45,000 target segments each, both En-De engines have provided 30,000 target segments each, and both En-Lv engines have provided 22,500 target segments each. En-De and De-En HPEs have been collected by professional translators from Text&Form. En-Lv HPEs have been collected by professional translators from Tilde. En-Cz HPEs have been collected by professional translators from Traductera.

Table 6 Domain-Specific Human Error-Annotated data set

Resource Name QT21 Domain-Specific Human Error-Annotated data set


Media Type text

Language(s) English to German, English to Czech, English to Latvian, German to English

Licence

QT21-TAUS Terms of Use TAUS grants access to QT21 Users to the WMT Data Set with the following rights: i) the right to use the target side of the translation units in a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is his own new translation; ii) the right to make Derivative Works; and iii) the right to use or resell such Derivative Works commercially and for the following goals: i) research and benchmarking; ii) piloting new solutions; and iii) testing of new commercial services.

Distribution Medium

downloadable

Usage Training of Automatic Post-editing and Quality Estimation components / Error Analysis

Size 39 MB






of 52

Description

Set of 10,800 domain-specific Human Error-Annotated (HEA) tuplets for four language pairs and six translation engines. The set consists of 8,800 quadruplets made of (source, target, HPE, HEA) and 2,000 quintuplets which contain a second HEA produced by a different annotator i.e. (source, target, HPE, HEA1, HEA2). The domain for En-De and En-Cz is IT, the domain for En-Lv and De-En is Pharma. This HEA data set is based on the HPE in “Table 5”. A total of six translation engines have been used to produce the targets that have been post-edited: PBMT from KIT and NMT (Nematus) for En-De, PBMT from KIT for De-En, PBMT from CUNI for En-Cz and both PBMT and NMT systems from Tilde for En-Lv. For each language pair, one unique set of source segments has been used as input to the different translation engines. From each translation engine, 1,800 target segments have been error-annotated. From each subset of 1,800 HEA segments, 400 are annotated by two different professional translators (except for the two En-LV engines for which only 200 have been doubly annotated). En-De and De-En HEAs have been collected by professional translators from Text & Form. En-Lv HEAs have been collected by professional translators from Tilde. En-Cz HEAs have been collected by professional translators from Aspena.

Table 7 WMT17 Human Post-Edited data set

Resource Name QT21 WMT Human Post-Edited data set


Media Type text

Language(s) English to German, English to Czech, English to Latvian


Distribution Medium

downloadable

Usage Training of Automatic Post-editing and Quality Estimation components / Quality Estimation / Error Analysis

Size 10,800 Human Post-Edited (HPE) triplets (for 3 language pairs)

Description

Set of 10,800 Human Post-Edited (HPE) quadruplets for three language pairs on WMT17 news task data. Each quadruplets consists of (source, reference, target, HPE). For each language pair, the target segments have been produced on the WMT17 news task by the three best WMT17 systems in their respective language pair. Each translation engine has provided 1,200 segments. Translations (targets) have been generated using, “1 62.0 0.308 uedin-nmt”,”3 55.9 0.111 limsi-factored-norm”, “54.1 0.050 CU-Chimera” for En-Cz, “69.8 0.139 uedin-nmt”,”66.7 0.022 KIT”, “66.0 0.003 RWTH-nmt-ensemb” for En-De and “54.4 0.196 tilde-nc-nmt-smt”, “50.8 0.075 limsi-fact-norm”,”50.0 0.058 usfd-cons-qt21” for En-Lv. HPEs for En-De have been collected by professional translators from Text&Form. En-Lv HPEs have been




of 52

collected by professional translators from Tilde. En-Cz HPEs have been collected by professional translators from Traductera.

Table 8 WMT17 Human Error-Annotated data set

Resource Name QT21 WMT Human Error-Annotated data set


Media Type text

Language(s) English to German, English to Czech, English to Latvian


Distribution Medium

downloadable

Usage Training of Automatic Post-editing and Quality Estimation components / Quality Estimation / Error Analysis

Size 1,800 quadruplets (for 3 language pairs)

Description

Set of 1,800 WMT17 Human Error-Annotated (HEA) quintuplets for three language pairs and nine translation engines. Each quadruplet consists of (source, target, HPE, HEA1, HEA2). The source data comes from the WMT17 news task. A total of nine translation engines have been used to produce the targets that have been post-edited: Translations (targets) have been generated using, “1 62.0 0.308 uedin-nmt”,”3 55.9 0.111 limsi-factored-norm”, “54.1 0.050 CU-Chimera” for En-Cz, “69.8 0.139 uedin-nmt”,”66.7 0.022 KIT”, “66.0 0.003 RWTH-nmt-ensemb” for En-De and “54.4 0.196 tilde-nc-nmt-smt”, “50.8 0.075 limsi-fact-norm”,”50.0 0.058 usfd-cons-qt21” for En-Lv. From each translation engine, 200 target segments have been post-edited which further have been error-annotated by two different professional translators. En-De HEAs have been collected by professional translators from Text&Form. En-Lv HEAs have been collected by professional translators from Tilde. En-Cz HEAs have been collected by professional translators from Aspena.

Table 9 eSCAPE : corpus to train and test APE

Resource Name eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing


Media Type text

Language(s) English to German and English to Italian

Licence No licence (data publicly available)




of 52

Distribution Medium

downloadable

Usage Training of Automatic Post-editing models based on (source, reference and target)

Size 3.5 GB

Description

Set of 7,258,533 English-German and 3,357,371 English-Italian triplets (source, target and reference). For each language pair, the data set is available in two versions: one, where the target segments are produced by a neural MT system and another, where the targets are obtained by a phrase-based MT system. The MT systems used to generate the targets are instances of the open-source Modern MT tool developed by the European project MMT. The original (source, reference) pairs are derived from a collection of corpora from different domains that is available in the OPUS repository.

Table 10 DA-Annotated WMT14 QE

Resource Name DA-Annotated WMT14 Quality Estimation Task 1.2


Media Type text

Language(s) English, Spanish.

Licence No licence – data publicly available

Distribution Medium

downloadable5


Size 2.5 KB

Description This data set consists of sentences used for the quality estimation shared task from WMT14 with crowd-sourced direct assessment human quality judgments.

3.3 Data Sharing

Domain-specific data is rare. The domain-specific data we have found in large volumes was at TAUS, our partner. As the data provided by TAUS needs to comply with the agreement TAUS has signed with its own data providers, we had to create a specific licence agreement (see Annex I, section 3.1) with TAUS, in order to give researchers all rights to use (access, mine, exploit, reproduce and disseminate) the data and its associated metadata and also

5 Link will be made available before the final review meeting

http://hltshare.fbk.eu/QT21/eSCAPE.html

https://www.tausdata.org/



of 52

the rights to create derivative work from it provided it is for the purpose of research, benchmarking, piloting, and testing commercial products.

In order to respect the licence conditions, users need to accept the licence agreement prior to downloading the data through a unique click (see Annex I for the full text of the agreement).

For the other post-edits and error-annotations like those from WMT17, we follow the same principles here as for all WMT data, for which no licence agreement is needed.

3.4 Archiving

All data sets produced are provided and made sustainable through the existing META-SHARE repositories, or new repositories that partners may choose to set up and link to the META-SHARE network (e.g. LINDAT). Data sets are locally stored in the repositories’ storage layer in compressed format.

The domain-specific data is made available through the following CLARIN node https://lindat.mff.cuni.cz/

The WMT data will remain available for download from http://www.statmt.org/. This website is currently hosted at the University of Edinburgh.




of 52

4 WMT Automatic Post Editing and Quality Estimation Tasks


This data plan concerns only the Automatic Post Editing (APE) and Quality Estimation (QE) shared tasks with WMT.

At WMT16 and WMT17, the APE and QE tasks shared the same data. For WMT18, the data set for APE is different to the QE data set: QE will cover more languages (Table 14) and will contain an additional data set (Table 15).


Table 11 WMT 2016 Automatic Post-editing and Quality Estimation Data Set

Resource Name WMT 2016 Automatic Post-editing and Quality Estimation data set


Media Type text

Language(s) English to German

Licence

TAUS Terms of Use TAUS grants access to QT21 Users to the WMT Data Set with the following rights: i) the right to use the target side of the translation units in a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is his own new translation; ii) the right to make Derivative Works; and iii) the right to use or resell such Derivative Works commercially and for the following goals: i) research and benchmarking; ii) piloting new solutions; and iii) testing of new commercial services.

Distribution Medium

downloadable

Usage Training of Automatic Post-editing and Quality Estimation components

Size 1294 kb

Description

Training, development and text data consist of English-German triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Training and development respectively contain 12,000 and 1,000 triplets, while the test set contains 2,000 instances. Target sentences are machine-translated with the KIT system. Post-edits are collected by Text & Form from professional translators.






of 52

Table 12 WMT 2017 Automatic Post-editing and Quality Estimation Data Set

Resource Name

WMT 2017 Automatic Post-editing and Quality Estimation data set


Media Type text

Language(s) English to German and German to English

Licence

TAUS Terms of Use TAUS grants access to QT21 Users to the WMT Data Set with the following rights: i) the right to use the target side of the translation units in a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is his own new translation; ii) the right to make Derivative Works; and iii) the right to use or resell such Derivative Works commercially and for the following goals: i) research and benchmarking; ii) piloting new solutions; and iii) testing of new commercial services.

Distribution Medium

downloadable

Usage Training of Automatic Post-editing and Quality Estimation components

Size 1294 kb

Description

For WMT 2017, 11,000 segments have been added to the WMT16 training set (En-De) together with a new test (for 2017) made of 2,000 segments (En-De). In 2017 a new language pair has been added: De-En with 25k segments for training, 1k segments for dev, 2k segments for test. Adding the 2016 and 2017 APE and QE data together, we obtain, for each language pair a total of 28k segments each, split in: En-De: training set = 23 k, dev set = 1k , test-set16 = 2k, test-set17 = 2k, De-En: training set: 25k, dev-set = 1k, test-set17= 2k

Training, development and text data consist of English-German triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Target sentences are machine-translated with the KIT system. Post-edits are collected by Text & Form from professional translators.

Training, development and text data consist of German-English triplets (source, target and post-edit) belonging to the Pharma domain and already tokenised. Target sentences are machine-translated with the KIT system. Post-edits are collected by Text & Form from professional translators.






of 52

Table 13 WMT 2018 Automatic Post-editing data set

Resource Name

WMT 2018 Automatic Post-editing data set


Media Type text

Language(s) English to German and German to English

Licence

QT21-TAUS Terms of Use. (https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21). TAUS grants access to QT21 Users to the WMT Data Set with the following rights: i) the right to use the target side of the translation units in a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is his own new translation; ii) the right to make Derivative Works; and iii) the right to use or resell such Derivative Works commercially and for the following goals: i) research and benchmarking; ii) piloting new solutions; and iii) testing of new commercial services.

Distribution Medium

downloadable6

Usage Training of Automatic Post-editing components

Size 1294 kb

Description

For the APE shared task at WMT2018, we will use: - A new test set of 2,000 segments for the English-German language

pair from 2017 (en-de and de-en), where the MT segments are generated by the SMT system. In total this language pair covers 30k segments. The split is: training set = 23 k, dev set = 1k, test-set16 = 2k, test-set17 = 2k, test-set18= 2k.

- A new English-German dataset of 30,000 segments where the MT segments are generated by a NMT system. The split is: training set = 27k, dev set = 1k, test-set18 = 1k.

The SMT English-German test data consists of 2,000 triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Target sentences are machine-translated with the KIT SMT system. Post-edits are collected by Text & Form from professional translators. The NMT English-German data consists of 30,000 triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Target sentences are machine-translated with the Nematus system. Post-edits are collected by Text & Form from professional translators.

6 The 2018 APE data sets will be made available end of June 2018





of 52

Table 14 WMT 2018 Quality Estimation Core Data Set

Resource Name

WMT 2018 Quality Estimation Core Data Set


Media Type text

Language(s) English to German, English to Latvian, English to Czech

Licence

QT21-TAUS Terms of Use. (https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21). TAUS grants access to QT21 Users to the WMT Data Set with the following rights: i) the right to use the target side of the translation units in a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is his own new translation; ii) the right to make Derivative Works; and iii) the right to use or resell such Derivative Works commercially and for the following goals: i) research and benchmarking; ii) piloting new solutions; and iii) testing of new commercial services.

Distribution Medium

downloadable7

Usage Training and testing of Quality Estimation components

Size 3700 kb

Description

For WMT2018 we will 6 sets in total: 1) English-German SMT: 30k segments split in 27K for training, 1K for development and 1K for test. 2) English-German NMT: 30,000 segments split in 27K for training, 1,000 for development and 2,000 for test (same source segments as for SMT). 3) English-Latvian SMT: 20,738 segments split in 17,738 for training, 1,000 for development and 2,000 for test. 4) English-Latvian NMT: 20,738 segments split in 17,738 for training, 1,000 for development and 2,000 for test (same source as for SMT). 5) English-Czech SMT: 45,000 segments split in 42,000 for training, 1,000 for development and 2,000 for test. 6) German-English SMT: 45,000 segments split in 42,000 for training, 1,000 for development and 2,000 for test. Training, development and test data for English-German and English-Czech consist of triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenized. Target sentences are machine-translated with the SMT and NMT KIT systems (German) and a CUNI system (Czech). Post-edits are collected by Text & Form from professional translators (German) and subcontracted for Czech. Training, development and test data for German-English and English-Latvian consist of triplets (source, target and post-edit) belonging to the Pharma domain and already tokenized. Target sentences are machine-translated with the KIT SMT system (German) and TILDE NMT and SMT systems (Latvian). Post-

7 The 2018 QE data sets will be made available end of June 2018





of 52

edits are collected from professional translators by Text & Form (German) and by TILDE (Latvian).

Table 15 WMT18 Quality Estimation Product Reviews Data Set

Resource Name WMT18 Quality Estimation Task: Product Reviews


Media Type text

Language(s) English to French

Licence No licence (data publicly available)

Distribution Medium

downloadable8

Usage Training of Automatic Quality Estimation methods based on source, machine translation and annotated version of the machine translation at the word level

Size 100K words annotated with specific, fine-grained errors, plus severity levels

Description

This data consists of a selection of product titles and descriptions from the Amazon Product Reviews dataset (http://jmcauley.ucsd.edu/data/amazon/qa/) which focuses on the Sports and Outdoors category. The data was machine-translated by a state-of-the-art off-the-shelf MT system (Bing) and annotated for errors at the word level as follows: The errors are annotated following the MQM fine-grained typology, which is composed of three major branches: accuracy (the translation does not accurately reflect the source text), fluency (the translation affects the reading of the text) and style (the translation has stylistic problems, like the use of a wrong register). These branches include more specific issues lower in the hierarchy. Besides the identification of an error and its classification according to this typology (by applying a specific tag), the errors will receive a severity scale that will show the impact of each error on the overall meaning, style, and fluency of the translation. An error can be minor (if it doesn’t lead to a loss of meaning and it doesn’t confuse or mislead the user), major (if it changes the meaning) or critical (if it changes the meaning and carry any type of implication, or could be seen as offensive).

In essence, the annotation process involves the following steps: - select the error (a unit that comprises all elements that constitute the error): unitising step; - apply a specific tag (from the error typology): tagging step; - choose a severity degree: rating step.

8 The 2018 QE Product Reviews data sets will be made available end of June 2018



of 52


WMT test sets are distributed in an SGML format, which is compatible with common machine translation evaluation tools, such as the NIST scoring tool (mteval-v13a.pl).The text encoding is Unicode (UTF-8).

Metadata such as language codes and document identifiers, are provided in the SGML documents. See Annex A for an example of the format used.

4.4 Sharing

The data is also made available from the appropriate WMT website. For APE we have the following links:

- http://www.statmt.org/wmt16/ape-task.html - http://www.statmt.org/wmt17/ape-task.html - http://www.statmt.org/wmt18/ape-task.html.

For QE we have the following links:

- http://www.statmt.org/wmt16/quality-estimation-task.html - http://www.statmt.org/wmt17/quality-estimation-task.html - http://www.statmt.org/wmt18/quality-estimation-task.html.


4.5 Archiving


The domain-specific data is primarily made available through the following CLARIN node https://lindat.mff.cuni.cz/


http://www.statmt.org/wmt16/ape-task.html






http://www.statmt.org/wmt16/quality-estimation-task.html










of 52

5 WMT News evaluation campaigns


This data plan concerns the news translation shared task with WMT.

In WP4, we organise three annual shared task campaigns in collaboration with the CRACKER project. Each campaign involves a translation shared task, a quality estimation shared task, and a metrics shared task. These campaigns continue a successful series of shared tasks held by the Workshop on Statistical Machine Translation (WMT) in previous years.

We aim to create around 6,000 sentences of human-translated text for each year of the translation task, in two language pairs. This text will be used as an evaluation set or be split into separate sets for system development and evaluation.

Collaboration with other projects such as CRACKER enables us to cover more than just two languages in the shared tasks. The core language pairs are German-English and Czech-English, but other challenging language pairs will be introduced each year. The WMT translation task typically has three to five language pairs in total.


Table 16 WMT 2016 Test Sets

Resource Name WMT 2016 Test Sets


Media Type text

Language(s)

QT21 has contributed to the German-English and Czech-English test sets from 2015 to 2018, as well as a different guest language in each of these years.

The guest language pair for 2016 was Romanian-English.

We also included Russian, Turkish, Chinese, Estonian and Kazakh with funding from other sources, as well as Finnish in 2016.


Distribution Medium

downloadable

Usage For tuning and testing MT systems.

Size 3,000 sentences per language pair, per year.

Description

These are the test sets for the WMT shared translation task. They are small parallel data sets used for testing MT systems, and are typically created by translating a selection of crawled articles from online news sites.



http://data.statmt.org/wmt16/translation-task/test.tgz



of 52


Resource Name WMT 2017 Test Sets


Media Type text

Language(s)

QT21 has contributed to the German-English and Czech-English test sets from 2015 to 2018, as well as a different guest language in each of these years. The guest language pair for 2017 was Latvian-English.

We also included Russian, Turkish, Chinese, Estonian and Kazakh with funding from other sources, as well as Finnish in 2017.


Distribution Medium

downloadable



Description



Resource Name WMT 20189, Test Sets


Media Type text

Language(s)

QT21 has contributed to the German-English and Czech-English test sets from 2015 to 2018 as well as a different guest language in each of these years. The guest language pair for 2018 is English-Finnish.

We also included Russian, Turkish, Chinese, Estonian and Kazakh with funding from other sources.


Distribution Medium

downloadable10

9 The 2018 test sets will be made available end of June 2018.

10The 2018 test sets will be made available end of June 2018.





of 52



Description



WMT test sets are distributed in an SGML format which is compatible with common machine translation evaluation tools, such as the NIST scoring tool (mteval-v13a.pl).The text encoding is Unicode (UTF-8).

Metadata, such as language codes and document identifiers, are provided in the SGML documents. See Annex A for an example of the format used.

5.4 Sharing

The data is made available from the appropriate WMT website.

- http://www.statmt.org/wmt16/translation-task.html - http://www.statmt.org/wmt17/translation-task.html



5.5 Archiving










of 52

A Annex A: SGML format of bi-lingual text (TM) The following excerpts from the Czech-to-English newstest2015 evaluation set from WMT 2015 exemplifies the SGML format conventions and the metadata stored therein.

The source SGML document (newstest2015-csen-src.cs.sgm) packages source-language (here: Czech) text:

<srcset setid="newstest2015" srclang="any"> <doc sysid="ref" docid="101-aktualne.cz" genre="news" origlang="cs">

 <seg id="1">Petr Čech: Přestup na poslední chvíli?</seg> <seg id="2">Možné je všechno</seg> <seg id="3">Dnešek je posledním dnem, kdy lze ještě proskočit transferním oknem a

udělat velký přestup.</seg> <seg id="4">Další šance přijde až v zimě.</seg> ...

 </doc> <doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">

 <seg id="1">Předsedové vlády Indie a Japonska na setkání v Tokiu</seg> <seg id="2">Nový indický předseda vlády Narendra Modi je v Tokiu na setkání se svým

japonským protějškem Shinzo Abem. Na své první větší návštěvě od květnového vítězství ve volbách má projednat ekonomické a bezpečnostní závazky.</seg>

... 

</doc> ...

</srcset>

The reference SGML document (newstest2015-csen-ref.en.sgm) packages the corresponding target-language (here: English) reference translations:

<refset setid="newstest2015" srclang="any" trglang="en"> <doc sysid="ref" docid="101-aktualne.cz" genre="news" origlang="cs">

 <seg id="1">Petr Čech: Transfer at the last minute?</seg> <seg id="2">Everything is possible</seg> <seg id="3">Today is the last day when it is still possible to jump through the transfer

window and make a major change.</seg> <seg id="4">The next chance won't come until winter.</seg> ...

 </doc> <doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">

 <seg id="1">India and Japan prime ministers meet in Tokyo</seg> <seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese

counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>

... 

</doc> ...

</refset>



of 52

B Annex B: Generation of HPE and HEA Data - Methodology

B.1. Corpus Selection

The goal of WP3 is to develop new translation techniques leveraging and learning from human feedback.

The efficiency of learning from human feedback depends very much on the quality of the human feedback. We need a good balance of high quality MT-generated output (though not perfect) and output of lower quality11. Also, the less ambiguous is the human-annotation (on the MT output), the clearer the message to the learning system is.

Furthermore, because the methods developed in WP3 are statistical methods, the efficiency of these methods (learning from human feedback) also depends on the number of similar annotations (messages) the learning system will observe: The more repetition of error types, the better the system becomes in the translation task. This can be best achieved when working on a specific domain from which one can expect a higher repetition of errors.

Working with domains specific data will give WP3 the advantage of using data that reflects

the kinds of data managed on a daily basis by Language Service Providers and professional translators.

The WP3 data selected has to reflect the following minimal constraint set:

- Data contains source and reference segments12

- Data is within a narrow domain13

- Data can be shared and referenced within the research community

- Data covers the four QT21 language pairs

- Data should contain, in total for all the languages, at least 2.5M source words. This results in a number of clean and high quality source-reference segments pairs that ranges from around 25K to 45k.

The data sets used that cover these constraints are those of the TAUS Data Association

(TDA) and Tilde. Table B-19 gives the number of words for translation memories available in two different domains that are of interest for WP3.

This set allowed us to define three data sets:

Set A comprises bilingual segments in EN (US) - CS and EN (US) – DE in the domain of Computer Software. The content creator is Adobe. The total number of segments in the selected corpora are 6.5 Mio segments for EN-DE and nearly 1 Mio for EN-CS. Both data sets are provided by TAUS.

Set B: comprises bilingual segments in EN (UK) – LV in the domain of Pharmaceuticals and Biotechnology. The content creator is the European Medicines Agency and the datasets consists of documents from 2009 to 2014. For this corpus, the number of available unique segments is about 260,000 segments. This data set is provided by Tilde.

11 A bad quality of MT generated output would lead to too many annotations that could be ambiguous 12E.g. translation memories, parallel corpora. 13 E.g. IT or Medical would qualify. Legal corpora, however, would not, as legal texts typically cover a

wide variety of different domains.




of 52

Set C comprises bilingual segments DE – EN (UK) in the domain of Pharmaceuticals and Biotechnology. The content creator is the European Medicines Agency and the dataset consists of document till 2009. The original corpus was translated from the target language English back into the source language German. For the purpose of our task, we used in the target and source languages in the opposite direction. This is not ideal, however, there exists very little domain-specific data for under-resourced language pairs, and this includes German as a source language. For both corpora, the number of available segments is about 450k. This dataset is provided by TAUS.

Language Pairs

Computer Hardware (# Words)

Computer Software (# Words)

Pharma (# Words)

Data Set Label

EN-DE 24,166,846 83,001,203 412,397 Set A

EN-CS 2,731,003 12,470,776 0

EN-LV - - ca. 3.9 M Set B

DE-EN 6,298,559 1,211,718 6,385,014 Set C

Table B-19 – WP3-Domain Selection within TAUS Data: Based on number of words. We had three domains to choose from. The data sets that have been selected are in yellow.

B.2. Segment Selection

To ensure good post-edits and good annotations, it is important that the segments on which the MT engines will run are clean and sentence-like. For this reason, we looked at the data provided by TAUS and Tilde and we added some constraints.

TAUS data:

The data provided by TDA are translation memories that have been uploaded by TAUS partners. These data sets comprise translation units that can be well-formed sentences or small sequences of words. To keep only the clean and well-formed pairs the following constraints are applied:

1. To ensure comparability between language pairs, we select source segments that are identical to both language pairs (in other words the same text has been translated into several languages)

2. Each source segment contains between 3 and 50 words14.

3. Both the source and the target segment end with a punctuation mark. The five selected punctuation marks are the following (see also Table B-20):

a. Full stop ‘.’ b. Colon ‘:’ c. Semicolon ‘;’ d. Question mark ‘?’

14Short segments are not that interesting for MT and we want to reduce their frequency within the

selected corpus (the IT corpus has a large number of short segments); Long segments of above 50

words are typically the result of miss-segmentation that we also want to avoid.




of 52

e. Exclamation mark ‘!’

4. The data does not contain duplicate bilingual segments: it is sorted-unique on

bilingual segments.

Constraint number 1 and 2 above participated each in a relative size reduction of the corpora by 15%.

Further constraint number 3 contributed most importantly to the size reduction of the corpora we are working on (relatively by about 30%).

Table B-20Table B-20 shows how punctuation is used to classify segments as sentence-like or not. If the last character of a segment is within that character set, it is considered a sentence. This definition can be applied to both source and target segments or only to one of them (e.g., only to the source segment) or to neither source nor target.

For example, the data set extracted from the TAUS data that follows the “Punct_5” labelled punctuation is a data set where both source and target segments end with a character within the “Punct_5” set.

It has been observed that the data sets following the “Source_Punct_5” or “Target_Punct5” definitions are very small in size, suggesting the TAUS Data is very clean. For this reason we will consider only the two disjointed data sets “Punct_5” and “No_Punct_5”.

Punctuation Character Set

Label Source Segment ends in the punctuation set

Target Segment ends in the punctuation set

. ; : ? ! :

Punct_5 Yes Yes

No_Punct_5 No No

Source_Punct_5 Yes No

Target_Punct_5 No Yes

Table B-20 – WP3-Punctuation: Punctuation sets labelled according to on which data type it is applied (source or target).

By applying the constraint set above, we obtained a high quality set of segments as showed in Table B-21 from which the required number of words to be post-edited and annotated are extracted15.

Data Set

Language Pair

Punctuation Set

Number of Segments

Number of Source Words

Domain Data Provider

Set A EN(US)-DE Punct_5 80,874 1,322,775 IT-Soft Adobe

EN(US)-CS Punct_5 81,352 7,942,426 IT-Soft Adobe

Set C DE-EN(UK) Punct_5 193,637 25,397,824 Pharma European Medicines

Agency

Table B-21 - WP3-High Quality TM (based on the punctuation set Punct_5) that can be used for Post Edits and Annotations

15Each data set has English either as source or as target language for both language pairs. The English

segments are almost identical for each set – the difference in the number of bilingual segments lies in the fact that there may be more translations in one of the two language pairs of each set.



of 52

TILDE Data:

Tilde created English-Latvian and English-Romanian datasets for HPE and HEA tasks and for MT system training. Initially it was planned to create these data sets using OPUS EMEA or similar TAUS data, but it was discovered tokenization issues in both of these data, therefore the latest version of data from European Medicines Agency was used.

Tilde EMEA 2014 corpus was created by Tilde as part of a Tilde internal data collection project that collects and processes data from EMEA web site16. At the time of running the EMEA 2014 data collection project, OPUS EMEA corpus was available based on data released until year 2009. From the EMEA 2014 project, we selected the data about drugs and medicines approved for use in 2009-2014.

The compilation of EMEA 2014 corpus had the following steps:

1. Collecting the drug titles catalogue and selecting the ones released after year 2009. We collected 1003 drug entries to process.

2. Building links for the search page for each of the selected drug. 3. Running the search queries, and for each drug in EMA web site we collected links

for the following PDF documents for the respective drug:

Summary for the public

Product Information

All Authorised presentations 4. Downloading the PDF files. For each of the links from Step 3 we collected the

respective PDF files in all European languages. We ended up with 69,622 PDF files. 5. Extracting the text from PDF files. We use a two-step process. First, we used Adobe

Acrobat v10 Professional to convert PDF files to HTML format as this preserved most of the original document structure. Then we ran custom-tailored PERL scripts to convert the HTML files to TXT and clean the data – noise – extra content from page break areas etc.

6. Alignment of the parallel files for chosen language pairs. We use Microsoft Bilingual Sentence Aligner for this purpose. It is a free PERL library, which does sentence alignment of parallel TXT files. We have aligned biggest European languages with all other languages resulting in 130 language pairs.

The EMEA 2014 datasets are quite repetitive, and for this reason duplicates were removed. For the purpose of the project and following the same strategy we applied to the TAUS data, segments with less than 3 and more than 35 source words are discarded.17 The statistics of the cleaned EN-RO and EN-LV data are reported in Table B-22.

Data Set

Language Pair

Number of Segments

Number of Source Words

Domain Data Provider

Set B EN(UK)-LV 231,028 3,607,102 Pharma European Medicines

Agency

16 http://www.ema.europa.eu 17 Differently from the TAUS data, we limit the maximum number of source words to 35 instead of 50,

because the pilot experiments on the EN-DE TUAS data shown that post-editing too long sentences can be problematic for the translator and they have the tendency to rewrite from scratch the segment.

http://www.ema.europa.eu/



of 52

EN(UK)-RO 225,024 3,458,164 Pharma European Medicines

Agency

Table B-22 - WP3-High Quality EMEA 2014 EN-RO and EN-LV TM data that can be used for Post Edits and Annotations



of 52

C Annex C: Guidelines for Post-Editing Only18

The effort involved in post-editing will be determined by two main criteria:

1. The quality of the MT raw output. 2. The expected end quality of the content.

To reach quality similar to “high-quality human translation and revision” (a.k.a. “publishable quality”), full post-editing is usually recommended. For quality of a lower standard, often referred to as “good enough” or “fit for purpose”, light post-editing is usually recommended. However, light post-editing of really poor MT output may not bring the output up to publishable quality standards. On the other hand, if the raw MT output is of good quality, then perhaps all that is needed is a light, not a full, post-edit to achieve publishable quality. So, instead of differentiating between guidelines for light and full-post-editing, we will differentiate here between two levels of expected quality. Other levels could be defined, but we will stick to two here to keep things simple. The set of guidelines proposed below are conceptualised as a group of guidelines where individual guidelines can be selected, depending on the needs of the customer and the raw MT quality.

C.1. How to Achieve “Good Enough” Quality

“Good enough” is defined as comprehensible (i.e. you can understand the main content of the message), accurate (i.e. it communicates the same meaning as the source text), but as not being stylistically compelling. The text may sound like it was generated by a computer, syntax might be somewhat unusual, grammar may not be perfect but the message is accurate.

Aim for semantically correct translation.

Ensure that no information has been accidentally added or omitted.

Edit any offensive, inappropriate or culturally unacceptable content.

Use as much of the raw MT output as possible.

Basic rules regarding spelling apply.

No need to implement corrections that are of a stylistic nature only.

No need to restructure sentences solely to improve the natural flow of the text.

C.2. How to Achieve Quality Similar or Equal to Human Translation:

This level of quality is generally defined as being comprehensible (i.e. an end user perfectly understands the content of the message), accurate (i.e. it communicates the same meaning as the source text), stylistically fine, though the style may not be as good as that achieved by a native-speaker human translator. Syntax is normal, grammar and punctuation are correct.

Aim for grammatically, syntactically and semantically correct translation.

Ensure that key terminology is correctly translated and that untranslated terms belong to the client’s list of “Do Not Translate” terms.

Ensure that no information has been accidentally added or omitted.

Edit any offensive, inappropriate or culturally unacceptable content.

Use as much of the raw MT output as possible.

Basic rules regarding spelling, punctuation and hyphenation apply.

Ensure that formatting is correct.

18Post-edition alone, not coupled to error-annotation



of 52

C.3. Best Practices for Evaluating Post-Editor Performance

Machine translation (MT) with post-editing is fast becoming a standard practice in our industry. This means that organisations need to be able to easily identify, qualify, train and evaluate Post-Editor performance.

Today, there are many methodologies in use, resulting in a lack of cohesive standards as organizations take various approaches for evaluating performance. Some use Final Output Quality Evaluation or Post-Editor Productivity as a standalone metric. Others analyse quality data such as “over-edit” or “under-edit” of the post-editor’s effort or evaluate the percentage of MT suggestions used versus MT suggestions that are discarded in the final output.

An agreed set of best practices will help the industry fairly and efficiently select the most suitable talent for post-editing work and identify the training opportunities that will help translators and new players, such as crowdsourcing resources, become highly skilled and qualified post-editors.

C.4. Defining Goals

Determine the objectives of your evaluation.

Identify the best performers from the pool of post-editors, who deliver the desired level of quality output with the highest productivity gains; identify the “ideal” post-editor profile for the specific content type and quality requirements (linguist, domain specialist, “casual” translator).

Identify common over-edit and under-edit mistakes in order to refine post-editing guidelines and determine the workforce training needs to achieve higher productivity.

Gather intelligence on the performance of the in-house technology used to enable the post-editing process, such as a translation management system, a recommended post-editing environment and MT engines.

Based on the post-editor productivity, set the realistic TAT (turnaround time) expectations, determine the appropriate pricing structure for specific content types and language pair and reflect the above in an SLA (Service Level Agreement).

C.5. Structuring Your Analysis

In order to select the top productivity performers and evaluate the quality of the output using reviewers:

Select a subset of the content used for the productivity test for which the highest and the lowest productivity is seen (the “outliers”), and evaluate the quality of the output using reviewers and automated quality evaluation tools (spellcheckers, Checkmate, X-bench, style consistency evaluation tools). Make sure the final output meets your quality expectations for the selected content types.

Use multiple translators and multiple reviewers.

Make sure there is minimal subjective evaluation. Provide clear evaluation guidelines to the reviewers and make certain the reviewers’ expectations and the Post-Editors’ instructions are aligned. Refer to the known most common post-editing mistakes.

Examples of full post-editing known problem areas:

Handling of measurements and locale-specific punctuation, date formats and alike



of 52

Correcting inconsistencies in terminology, terminology disambiguation

Handling of list elements, tables or headers versus body text

Handling of proper names, product names and other DoNotTranslate (DNT) elements

Repetitions (consistent exact matches)

Removing duplicates, fixing omissions (for SMT output post-editing)

Morphology (agreement), negations, word order, plural vs. singular

Examples of light post-editing known problem areas:

Correctly conveying the meaning of the source sentence

Correcting inconsistencies in terminology

Removing duplicates, fixing omissions (for SMT output post-editing)

Morphology (agreement), negations, word order, plural vs. singular

In order to identify the common over-edit and under-edit patterns:

Obtain edit distance and MT quality data for the content post-edited during the productivity evaluation, using the industry-standard methods, e.g. GTM19, Levenshtein20, BLEU21, TER22.

If your productivity evaluation tool captures this data, obtain information on the changes made by post-editors, i.e. edit location, nature of edits; if your tool doesn’t capture such data a simple file difference tool can be used.

You will be able to determine realistic turnaround time and pricing expectations based on the productivity and quality data. Always have a clear human translation benchmark for reference (e.g. 2000 words per day for European languages) unless your tool allows you to capture the actual in-production data.

For smaller LSPs and freelance translators, tight turnaround projects or other limited bandwidth scenarios reliance on legacy industry information is recommended.

C.6. Analysing Results

Make sure that the final output quality matches the desired quality level for the selected content type, differentiate between the full post-editing quality level and light post-editing quality level.

Assess individual post-editors’ performance using side-by-side data for more than one post-editor; use the individual post-editor data to identify the most suitable post-editor profile for the specific content types and quality levels.

19 A software package that measures the similarity between texts by matching between the

components of e.g. a text and its translation. GTM can be used to help evaluate machine translation, by checking whether all elements in the source are represented in the target. 20A very basic edit distance: the Levenshtein distance between two words (two phrases) is the minimum number of single-character (single-word) edits (i.e. insertions, deletions or substitutions) required to change one word (phrase) into the other. 21Bi-Lingual Evaluation Understudy, an algorithm for evaluating machine translation output against a reference human translation. Best used to evaluate improvements of a machine translation system over several cycles of training. BLEU is not a useful metric for machine translation end users trying to evaluate quality. 22Translation Error Rate (TER) is an automatic metric for measuring the number of edit operations needed to transform machine translation output into a human translated reference. It is used to assess the post-edition load.



of 52

Calculate the mode (the number that appears most often in a set of numbers) based on scores of multiple post-editors to obtain data specific to certain content/quality/language combination; gather data for specific sentence length ranges and sentence types.

Do not use obtained productivity data in isolation in order to calculate expected daily throughputs and turnaround times, as the reported values reflect the “ideal” off-production scenario; add time necessary for terminology and concept research, administrative tasks, breaks and alike.

Identify best and worst performing sentence types (length, presence/absence of DNT elements, tags, terminology, certain syntactic structures) and gather best practices for post-editors on most optimal handling of such sentences and elements.

Analyse edit distance data alongside with the MT quality evaluation data and assessment by reviewers to determine whether the edits made by post-editors were necessary for meeting the output quality requirements, gather best practices for specific edit types. Do not use edit distance data in isolation to evaluate the post-editor performance.

Analyse the nature of edits using the data obtained during the productivity tests, use top performer’s data to create recommendations for lower performers; use the obtained information to provide feedback on the MT engine.



of 52

D Annex D: Error-Annotation and its Related Post-Editing Guidelines

Selecting issues can be a complex task. In order to assist evaluators, a decision tree (see section D.7 on page 43) helps evaluators select appropriate issues. Use the decision tree not only for learning about MQM issues, but to guide your annotation efforts and resolve any questions or concerns you may have.

In QT21 annotation tasks, you will be asked to provide annotations (but not to edit) in one column in translate5 (see Figure 1, in which annotation is taking place and the next column shows the result of repairing the issue that has been annotated) and to provide a post-edited translation in another column. Please ensure that the issues you identify are resolved in the post-edited version: doing so helps researchers to understand why you chose the issues you did.

Figure 1. Annotation and post-editing in translate5

Note that translate5 cannot correlate your post-edits to your annotations, so if you feel that anything is unclear, please use the note features to supply comments on the annotations or segment.

To use the decision tree, start at the upper left corner and then answer the questions and follow the arrows to find appropriate issues.

If using translate5, note that the decision tree is organised a bit differently than the hierarchy in translate5 because it eliminates specific issue types before moving to general ones, so you familiarise yourself with how issues are organised in translate5 before beginning annotation.

Add notes in translate5 of the scorecard to explain any decisions that you feel need clarification, to ask questions, or to provide information needed to understand issues, such as notes about what has been omitted in a translation.

In addition to using the decision tree, please understand and follow the guidelines in this document. Email us at [email protected] if you have questions that the decision tree and other content in this document do not address.

D.1. Annotation and post-editing level

The goal of the annotation and post-editing is not to produce a stylistically perfect text, but rather to produce one that is free from errors. As a result, even if a text might be improved



of 52

with stylistic edits, please refrain from making them. However, any issues that would impede the ability of a typical reader to understand a translation should be marked. This guideline will sometimes result in ambiguous results. For example, if a translated segment has a stilted syntax that allows the intended meaning to be understood, but only with careful reading, it should probably be marked with an error in the Fluency branch (this MQM subset does not contain any Style issues). By contrast, if the translation reads as something that a human could make, even if there is a more elegant or easier way to say the same thing, it should not be marked or edited to correct that problem. When in doubt, use the Notes function in translate5 to discuss specific cases.

For the post-editing portion, do not change text that is not accounted for in the MQM annotation. For example if you have the following German source/English target pair and seen is marked for Word order (with no other issues noted):

Source: Er hat den Mann gestern gesehen. Target: He has the man yesterday seen.

You should not then produce He saw the man yesterday as your post-edit, even though it is a perfectly valid translation of the German source. If you want to produce He saw the man yesterday as your post-editing, you would need to mark has as well (with Tense/aspect/mood) because you are also changing the tense in the English translation, in addition to moving the verb to the proper place.

It is important that all post-edits correspond to one or more annotated issues—and that all annotated issues correspond to post-edits—to ensure the usability of the results and the correlation between the post-edits and the annotations.

D.2. What is an Error?

An error represents any issue you may find with the translated text that either does not correspond to the source or is considered incorrect in the target language. The list of language issues upon which you are to base your annotation is described in detail below and provides a range of examples.

The list is divided into four main issue categories, Accuracy, Fluency, Terminology, and Locale convention. In the full MQM hierarchy, each of these contains relevant, more detailed subcategories; in this subset, by contrast, Terminology and Locale convention do not include any subcategories.

Whenever possible, the correct subcategory should be chosen; however, if in doubt, please do not guess. Instead, select the category level about which you are most certain in order to avoid inconsistencies in the results.

Example: The German term Zoomfaktor was incorrectly translated as zoom shot factor, and you are unsure whether this represents a Mistranslation or an Addition. In this case, categorise the error as an Accuracy error since it is unclear whether content has been added or a term mistranslated.

D.3. The Annotation Process

The translations you annotate should be a set of “near miss” (i.e., “almost perfect”) translations to annotate. Please follow these rules when selecting errors and tagging the respective text in the translations:

1. Use the examples in this documentation to understand specific classes.



of 52

2. If multiple types could be used to describe an issue (e.g., Agreement, Word form, Grammar, and Fluency), select the first one that the decision tree guides you to. The tree is organised along the following principles:

a. It prefers more specific types (e.g., Part of speech) to general ones (e.g., Grammar). However, if a specific type does not apply, it guides you to use

the general type. b. General types are used where the problem is of a general nature or where

the specific problem does not have a precise type. For example He slept the baby exhibits what is technically known as a valency error, but because

there is no specific type for this error available, it is assigned to Grammar. 3. Less is more. Only tag the relevant text. For example, if a single word is wrong in a

phrase, tag only the single word rather than the entire phrase. If two words, separated by other words, constitute an error, mark only those two words

separately. (See the section on “minimal mark-up” below.) 4. If correcting one error would take care of others, tag only that error. For example, if

fixing an Agreement error would fix other related issues that derive from it, tag only

the Agreement error, not the errors that result from it.

Examples

Source: Importfilter werden geladen Translation: Import filter are being loaded Correct: Import filters are being loaded

In this example, the only error is the translation of filter in the singular rather than the plural (as made clear by the verb form in the source text). This case should be classified as Mistranslation, even though it shows problems with agreement: if the subject had been translated properly the agreement problem would be resolved. In this case only filter should be tagged as a Mistranslation.

Source: Nach der Installation eines Exportfilters wird der Name des Filters der Dateitypliste im Dialog Exportieren hinzugefügt.

Translation: After the installation, the export filters are added to the file type list [in the dialogue export].

Correct: After the installation, the export filters are added to the file type list [in the Export dialogue].

In this example, only Terminology should be marked for the portion in brackets. While both Word order and Spelling (capitalisation) appear to be errors in this sentence, these two words constitute one domain-specific term that has been incorrectly rendered in English and fixing the terminology problem would also resolve these other problems. In such difficult cases where it is not entirely clear what problems may be involved, please provide notes that will assist us in understanding your rationale.

5. If one word contains two errors (e.g., it has a Spelling issue and is also an Extraneous function word), enter both errors separately and mark the respective

word in both cases. 6. If in doubt, choose a more general category. The categories Accuracy and Fluency

can be used if the nature of an error is unclear. In such cases, providing notes to explain the problem will assist their interpretation.

D.4. Tricky Cases

The following examples are ones that have been encountered in practice and that we wish to clarify.



of 52

Function words: In some cases issues related to function words break the Accuracy/Fluency division seen in the decision tree because they are listed under Fluency even though they may impact meaning. Despite this issue, please

categorise them as the appropriate class under Function words. Example: The ejector may be found with the external case (should be on in this case). Even though this error changes the meaning, it should be classified as Function words: incorrect

in the Fluency branch.

Word order: Word order problems often affect long spans of text. When encountering word orders, mark the smallest possible portion that could be moved

to correct the problem. Example: He has the man with the telescope seen. Here

only seen should be marked as moving this one word would fix the problem.

Hyphenation: Hyphenation issues should generally be classified as Spelling unless they arise because content was untranslated and had a hyphen in the source.

Example: Load the XML-files (Spelling, if the source has XML Datei)

Nützen Sie die macro-lens (Untranslated, if the source has macro-lens as well)

Number (plural vs. singular): if it does not reflect the source, a problem with

number is a Mistranslation.

Terminology: Inappropriate use of terms as distinct from general-language

Mistranslation. Example: An English translation uses the term thumb drive to translate the German USB Speicherkarte. This translation is intelligible, but if the translation mandated in specifications or a relevant term-base is USB memory stick, the use of thumb drive constitutes a Terminology error, even if thumb drive would be acceptable in everyday usage. However, if USB Speicherkarte were to be translated as USB Menu, this would be a Mistranslation since the words would be translated incorrectly, regardless of whether the original phrase is a term.

NOTE: Because no specific terminology list is provided, please use your understanding of relevant domain terminology for the evaluation task.

Unintelligible: Use Unintelligible if content cannot be understood and the reason cannot be analysed according to the decision tree. This category is used as a last

resort for text where the nature of the problem is not clear at all. Example: In the sentence “You can also you can use this tab to precision, with the colours are described as well as the PostScript Level,” there are enough errors that the meaning is unclear and the precise nature of the errors that lead to its unintelligibility cannot

be easily determined.

Agreement: This category generally refers to agreement between subject and

predicate or gender and case. Examples: The boy was playing with her own train

and I is at work

Untranslated: Many words may look as if they have been translated and the translation simply forgotten to apply proper capitalisation or hyphenations rules. In many cases, this would represent an untranslated term and not a Spelling issue. If the target word or phrase is identical to the source word or phrase, it should be treated as Untranslated, even if a Spelling error could also account for the problem.

D.5. Minimal Mark-Up

It is vital in creating error mark-up that errors be marked up with the shortest possible spans. Mark-up must identify only that area needed to specify the problem. In some cases

this requirement means that two separate spans must be identified. The following

examples help clarify the general principles:



of 52

Incorrect mark-up Problem Correct minimal mark-up

Double click on the number faded in the status bar.

[Mistranslation]

Only the single word faded is problematic, but the mark-up indicates that number faded in is incorrect.

Double click on the number faded in the status bar.

The standard font size for dialogs is 12pt, which corresponds to a standard of 100%. [Terminology]

Only the term Maßstab has been translated incorrectly. The larger span indicates that text that is perfectly fine has a problem.

The standard font size for dialogs is 12pt, which corresponds to a standard of 100%.

The in 1938 nascent leader with flair divined %temp_name eating lonely. [Unintelligible]

The entire sentence is Unintelligible and should be marked as such.

The in 1938 nascent leader with flair divined %temp_ name eating lonely.

As noted above, Word order can be problematic because it is often unclear what portion(s) of the text should be marked. In cases of word order, mark the shortest portion of text (in number of words) that could be moved to fix the problem. If two portions of the text could resolve the problem and are equal in length, mark the one that occurs first in the text. The following examples provide guidance:


The telescope big observed the operation

Moving the word telescope would solve the problem and only this word should be marked (since it occurs first in the text).

The telescope big observed the operation

The eruption by many instruments was recorded.

Although this entire portion shows word order problems, moving was recorded would resolve the problem (and is the shortest span that would resolve the problem

The eruption by many instruments was recorded.

The given policy in the manual user states that this action voids the warranty.

This example actually has two separate issues that should be marked separately.

The given policy in the manual user states that this action voids the warranty.

Agreement poses special challenges because portions that disagree may be widely separated. To select appropriate minimal spans, consider the following guidelines:

o If two items disagree and it is readily apparent which should be fixed, mark only the portion that needs to be fixed. E.g., in “The man and its companion were business partners” it is readily apparent that its should be his and the

wrong grammatical gender has been used, so only its should be marked.



of 52

o If two items disagree and it is not clear which portion is incorrect, mark

both items for Agreement, as shown in the example in the table below.

The following examples demonstrate how to mark Agreement:


The man and its companion were business partners. [Agreement]

In this example, it is clear that its is the problematic portion, and that man is correct, so only its should be marked.

The man and its companion were business partners.

He saw her own car. [Agreement]

For this example, assume that the translation is from a language such as Spanish where the gender of the subject and pronoun are not specified, so it is unclear which one is correct and both then show agreement problems. In this case mark both for Agreement and make a note of the problem. (Such cases should be uncommon.)

He saw her own car. [Agreement]

In the event of questions about the scope of mark-up that should be used, utilise the Notes field to make a query or explain your choice.

D.6. Issue Categories

The error corpus uses the following issue categories:

D.6.1 Accuracy Issues

Accuracy. Accuracy addresses the extent to which the target text accurately renders the meaning of the source text. For example, if a translated text tells the user to push a button when the source tells the user not to push it, there is an accuracy issue (specifically, a Mistranslation.

Note(s): Plain accuracy errors rarely occur in this MQM subset since errors are generally specific to one of the subtypes.

o Mistranslation. The target content does not accurately represent the source content.

Example: A source text states that a medicine should not be administered in doses greater than 200 mg, but the translation states that it should not be administered in doses less than 200 mg.

Note(s): Mistranslation can be used for both words and phrases.

o Omission. Content is missing from the translation that is present in the source.

Example: A source text refers to a “mouse pointer” but the translation does not mention it.



of 52

Note(s): Omission should be reserved for those cases where content present in the source and essential to its meaning is not found

in the target text.

o Addition. The target text includes text not present in the source.

Example: A translation includes portions of another translation that

were inadvertently pasted into the document.

o Untranslated. Content that should have been translated has been left untranslated.

Example: A sentence in a Japanese document translated into English is left in Japanese.

Note(s): As noted above, if a term is passed through untranslated, it should be classified as Untranslated rather than as

Mistranslation.

D.6.2 Terminology Issues

Terminology. Domain- or industry-specific terms (including multi-word terms) are translated incorrectly.

Example: In a musicological text the term dog is encountered and translated into German as Hund ‘dog’ rather than the domain-specific term Schnarre ‘snare’.

Note(s): Terminology errors may be valid translations for the source word in general language, but are incorrect for the specific domain or organisation.

D.6.3 Locale Convention Issues

Locale convention. Text is accurate but appears in a form or format inappropriate for the target locale.

Examples: A number appears in a German text with commas (,) as thousands separators and a full stop (.) as the decimal—123,456.789—instead of the reverse, i.e., 123.456,789

A text translated from English into Czech uses miles to describe distances, rather than kilometres.

D.6.4 Fluency Issues

Fluency. Fluency relates to the monolingual qualities of the source or target text, relative to agreed-upon specifications, but independent of relationship between source and target. In other words, fluency issues can be assessed without regard to whether the text is a translation or not. For example, a spelling error or a problem with grammatical register remain issues regardless of whether the text is translated or not.

o Grammatical register. If a language uses grammatical markers of formality, they are used incorrectly

Examples: A formal announcement text in German is translated with the informal du pronouns instead of the formal Sie pronouns



of 52

Note(s): This category does not apply to English target texts since English does not have a grammatical distinction between formal and informal.

o Spelling. Issues related to spelling of words (including capitalisation)

Examples: The German word Zustellung is spelled Zustetlugn. The name

John Smith is written as “john smith”.

o Typography. Issues related to the mechanical presentation of text. This

category should be used for any typographical errors other than spelling.

Examples: Extra, unneeded carriage returns are present in a text. A

semicolon is used in place of a comma.

o Grammar. Issues related to the grammar or syntax of the text, other than spelling and orthography.

Example: An English text reads “The man was sleeping the baby.”

Note(s): Use Grammar only if no subtype accurately describes the

issue.

Word form. The wrong form of a word is used. Subtypes should be used when possible.

Example: An English text has comed instead of came.

Part of speech. A word is the wrong part of speech

Example: A text reads “Read these instructions careful”

instead of “Read these instructions carefully.”

Agreement. Two or more words do not agree with respect to case, number, person, or other grammatical features

Example: A text reads “They was expecting a report.”

Tense/aspect/mood. A verbal form inappropriate for the context is used

Example: An English text reads “Yesterday he sees his friend” instead of “Yesterday he saw his friend”; an English text reads “The button must be pressing”

instead of “The button must be pressed”.

Word order. The word order is incorrect

Example: A German text reads “Er hat gesehen den Mann” instead

of “Er hat den Mann gesehen.”

Function words. Linguistic function words such as prepositions,

particles, and pronouns are used incorrectly

Example: An English text reads “He beat him around” instead of “he beat him up.”

Note(s): Function words is used for cases where individual words with a grammatical function are used



of 52

incorrectly. The most common problems will have to do with prepositions, and particles. For languages where verbal prefixes play a significant role in meaning (as in German), they should be included here, even if they are not independent words.

There are three subtypes of Function words. These are used to indicate whether an unneeded function word is present (Extraneous), a needed function word is missing (Missing), or a incorrect function word is used (Incorrect).

Unintelligible. The exact nature of the error cannot be determined. Indicates a major break down in fluency.

Example: The following text appears in an English translation of a German

automotive manual: “The brake from whe this કુતારો િ સ S149235 part

numbr,,."

Note(s): Use this category sparingly for cases where further analysis is too uncertain to be useful. If an issue is categorised as Unintelligible no further categorisation is required. Unintelligible can refer to texts where a significant number of issues combine to create a text for which no further determination of error type can be made or where the relationship of target to source is entirely unclear.

D.7. Decision Tree

The MQM annotation decision tree is a tool to aid annotators in selecting the proper issue type. It is used by answering the questions with a yes or no response and following the appropriate arrows until an issue type is selected. The decision tree should be consulted frequently when first annotating and at any time thereafter when the choice of issue type is unclear.



of 52

of 52

E Annex E: Schema for MQM Annotations in XML Format <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="xml.xsd"/> <xs:element name="annotations"> <xs:complexType> <xs:sequence> <xs:element name="annotGrp" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="sourceText" maxOccurs="1"> <xs:complexType mixed="true"> <xs:attribute ref="xml:lang" use="required"/> </xs:complexType> </xs:element> <xs:element name="targetGrp" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="target"> <xs:complexType mixed="true"> <xs:attribute name="generator" use="required" type="xs:NCName"/> <xs:attributeGroup ref="xml:id" /> <xs:attribute ref="xml:lang" use="required"/> </xs:complexType> </xs:element> <xs:element name="annotatedTarget" maxOccurs="unbounded"> <xs:complexType mixed="true"> <xs:choice minOccurs="0" maxOccurs="unbounded"> <xs:element name="issueStart"> <xs:complexType> <xs:attribute name="note" use="required"/> <xs:attribute name="type" use="required"/> <xs:attributeGroup ref="xml:id"/> </xs:complexType> </xs:element> <xs:element name="issueEnd"> <xs:complexType> <xs:attribute name="idref" use="required" type="xs:IDREF"/> </xs:complexType> </xs:element> </xs:choice> <xs:attribute name="annotator" use="required" type="xs:NCName"/> <xs:attributeGroup ref="xml:id"/> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="visID" use="required"/> <xs:attributeGroup ref="xml:id"/> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>

of 52

F Annex F: XSLT Stylesheet for Converting Annotated XML into HTML

The following XSLT stylesheet can be used to convert annotated XML into a user-friendly HTML. Note that the output will require further wrapping and work to produce a user-friendly and filterable version such as that seen at http://www.qt21.eu/deliverables/annotations/de-en-round2.html

<?xml version="1.0" encoding="UTF-8"?> <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <xsl:for-each select="annotations/annotGrp"> <tbody id="<xsl:value-of select="annotGrp/@id" />" class="<xsl:value-of select="annotGrp/@source" />" data-min="" data-max=""> <tr class="source"><td colspan="7"> <xsl:value-of select="annotGrp/@original_id" /> <xsl:value-of select="annotGrp/src" /> </td></tr> <xsl:for-each select="target"> <tr class="target"> <td class="row-number"> <xsl:value-of select="annotGrp/@row_number" /> </td> <td class="type">( <xsl:value-of select="target/@engine" />) </td> <td class="target-text" colspan="4"> <xsl:value-of select="targetSeg" /> </td> <td class="class"> – </td> </tr> <tr class="annotated-row"> <td colspan="2"></td><td class="annotator"></td> <td class="annotated"> Text text</td> <td class="num-annotations"></td> <td class="annotations" colspan="2"><ul>

<li id="">Mistranslation [Nehmen] </li></ul>

</td> </tr> </xsl:for-each> <tr class="space"><td colspan="7"></td></tr> </tbody> </xsl:for-each> </xsl:template> </xsl:stylesheet>

http://www.qt21.eu/deliverables/annotations/de-en-round2.html

of 52

G Annex G: Example of an XML File with a Single Annotated Segment

<?xml version="1.0" encoding="UTF-8"?>

<annotations xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance xsi:noNamespaceSchemaLocation="annotations.xsd">

<annotGrp id="derstandart_at_2012_12_01_141907-1_de" original_id="derstandart_at/2012/12/01/141907-1_de" source="wmt">

<src xml:lang="de">Die MVRDV Architekten beweisen, dass die wahren Abenteuer nicht nur im Kopf sind - am Beispiel von Spijkenisse und dem jüngst errichteten Bücherberg - 2 Fotos</src>

<targets xml:lang="en">

<target engine="SMT">

<targetSeg>The architects MVRDV prove that the true adventures are not only in the HEA-Evald - the example of Spijkenisse and the newly-created books mountain - 2 photos</targetSeg>

<annotatedTargets>

<annotatedTarget annotator="1">The <issue id="de-en-1-1" type="Word order">architects MVRDV</issue> prove that the true adventures are not only <issue id="de-en-1-2" type="Terminology">in the HEA-Evald</issue> - <issue id="de-en-1-3" type="Style/register">the example of</issue> Spijkenisse and the newly-created books mountain - 2 photos</annotatedTarget>

<annotatedTarget annotator="7">The <issue id="de-en-1-4" type="Word order">architects MVRDV</issue> prove that the true adventures are not only in the HEA-Evald -<issue id="de-en-1-5" type="Omission"> </issue>the example of Spijkenisse and the newly-created <issue id="de-en-1-6" type="Mistranslation">books mountain</issue> - 2 photos</annotatedTarget>

<annotatedTarget annotator="8">The <issue id="de-en-1-7" type="Word order">architects</issue> MVRDV prove that <issue id="de-en-1-8" type="Function words">the</issue> <issue id="de-en-1-9" type="Mistranslation">true</issue> adventures are not only in the HEA-Evald -<issue id="de-en-1-10" type="Omission"> </issue>the example of Spijkenisse and the newly-created books mountain - 2 photos</annotatedTarget>

</annotatedTargets>

</target>

</targets>

</annotGrp>

</annotations>

of 52

H Annex H: Prose Explanation of the Basic Elements and Attributes

<annotations>: root element of file o Child element(s):

one or more <annotGrp> elements

<annotGrp>: A group of segments related to a single source-target pair o Child element(s):

one <src> element one <targets> element

o Attribute(s): id (mandatory): an ID for the annotGrp; by convention, this is based

on an internal ID (such as the WMT segment ID), which should be recorded using the original-id attribute. (Note: This attribute should be replaced by xml:id in a future version)

original-id (optional): the original form of the segment ID source (optional): A text descriptor of the source of the segment

(e.g., the data set from which it was drawn)

<src>: The source segment (as plain text) o Child element(s): none o Attribute(s):

xml:lang: Identifies the language of the source segment.

<targets>: Container for all annotated target segments o Child element(s): one or more <target> elements o Attribute(s):

xml:lang: Identifies the language of the target segment

<target>: Contains all annotation for a specific target segment. (This nesting allows multiple targets for the same source to be included in one file.)

o Child element(s): one <targetSeg> element and one <annotatedTargets> element

o Attribute(s): engine: optional identifier of the engine that translated the text

<targetSeg>: Contains the text of the unannotated target segment o Child element(s): none o Attribute(s): none

<annotatedTargets>: container for all annotated target segments corresponding to a single translated target.

o Child element(s): one or more <annotatedTarget> o Attribute(s): none

<annotatedTarget>: a specific annotated target segment o Child element(s): unlimited number of optional <issue> elements o Attribute(s):

annotator: A text identifier for the individual or organisation that annotated the segment.

<issue>: a specific MQM issue o Child element(s): may contain nested <issue> elements o Attribute(s):

id (mandatory): identifier for a specific issue (Note: should be replaced by xml:id in the future)

type (mandatory): MQM issue type identifier



of 52

I Annex I: Licence Agreement for Using HPE and HEA Data

AGREEMENT ON THE USE OF DATA IN QT21

between

You, hereinafter referred to as the “QT21 User”, and

Taus B.V., a limited liability company organized and existing under the laws of the Netherlands, having its office at Keizersgracht 74, Amsterdam in the Netherlands, registered with the Dutch Chamber of Commerce under registration number [37147141], hereby duly represented by Mr. Jaap van der Meer (hereinafter: “TAUS”).

QT21 is a research action receiving funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 645452 between 2015 and 2018. TAUS, being a project partner in QT21, offers data that will be used in QT21 to support statistical machine translation.

This Agreement is based on the TAUS ToU and sets forth the additional terms and conditions applying to the Parties. It addresses the needs defined in section 29.3 of QT21 Grant Agreement, in particular the ability to

(a) deposit in a research data repository and take measures to make it possible for third parties to access, mine, exploit, reproduce and disseminate — free of charge for any user — the WMT Data Set, including associated metadata, needed to validate the results presented in scientific publications as soon as possible;

(b) provide information — via the repository — about tools and instruments at the disposal of the beneficiaries and necessary for validating the results (and — where possible — provide the tools and instruments themselves).

1. Definitions and rules of construction

1.1 The following words and phrases shall have the meanings and definitions set forth below:

(a) “TAUS ToU” means the TAUS Terms of Use as of January 2016, attached as Annex I to this Agreement;

(b) "QT21 User" means You, the contracting party. A QT21 User is any legal person or legal entity who intends to use the WMT Data Set;

(c) "Agreement" means this agreement;

(d) "Parties" means TAUS and QT21 User jointly;

(e) “WMT Data Set” refers to the data collections that have been prepared for the QT21 Users and that are originated from the TAUS data repository: benchmark data



of 52

named “TAUS_TEST_Data sets_for_QT21_WP3” as defined in the QT21 Data Management Plan.

2. The subject matter of this Agreement

2.1 All terms and conditions of the TAUS ToU are fully incorporated in this Agreement and fully apply unless explicitly stipulated otherwise in this Agreement in writing. In case of any inconsistencies between the TAUS ToU and the terms and conditions as laid down in this Agreement, the terms and conditions of this Agreement will prevail.

3. Licence

3.1 Subject to the terms and conditions of this Agreement, TAUS hereby grants to QT21 User access to the WMT Data Set with the following rights:

i) the right to use the target side of the translation units in a commercial

product, provided that QT21 User may not resell the WMT Data Set as if it is

his own new translation;

ii) the right to make Derivative Works; and

iii) the right to use or resell such Derivative Works commercially

and for the following goals:

i) research and benchmarking;

ii) piloting new solutions; and

iii) testing of new commercial services.

3.2 The licence in Clause 3.1 of this Agreement is explicitly limited to the use as set forth in that Clause and excludes any other use of the WMT Data Set.

3.3 QT21 User shall not be entitled to and agrees that it shall not disclose and/or make available the WMT Data Set to any Third Party.

3.4 QT21 User does not need to be a TAUS Member to sign this Agreement.

4. No Warranty / No Liability / Indemnification

4.1 TAUS makes no representation nor any warranty that the WMT Data Set are correct or fit for any purpose.

4.2 Except for liability arising from willful misconduct or gross negligence, TAUS will not be liable for any damages, loss of business, profits, revenue, anticipated savings, indirect or punitive damages or for any special, incidental or consequential losses with respect to the WMT Data Set and/or the Database and Language Data Platform.

4.3 QT21 User shall indemnify, defend and hold TAUS harmless for any losses, claims, damages, awards, penalties or injuries incurred, including reasonable attorney’s fees, which arise from any claim by any third party in relation to this Agreement, including the use of the WMT Data Set and/or use of the Database and Language Data Platform.



of 52

5. Termination

5.1 The QT21 User rights granted in this Agreement are irrevocable, unless QT21 User breaches any of the conditions of this Agreement, upon which QT21 User must immediately cease access to, and use of, the WMT Data Set.

5.2 To ensure the needs defined in section 29.3 of QT21 Grant Agreement, it is agreed that in the case where TAUS is out of business, or changes ownership, the QT21 User rights set out in this Agreement shall persist.

6. Miscellaneous

6.1 This Agreement is meant to be signed online by QT21 User in the course of the download of the WMT Data Set. Successful completion of the sign-up procedure shall mean the agreement of TAUS to use the WMT Data Set in accordance with this Agreement.

6.2 This Agreement and its terms and conditions cannot be assigned by a Party, without the other Party's prior written consent, such consent not to be unreasonably withheld.

6.3 This Agreement is open to any legal person or legal entity who intends to use the WMT Data Set.

6.4 If one clause is deemed invalid it does not harm the validity of the rest of the agreement.



of 52

J Annex J: Glossary of Terms

Acronyms Definition

EN English

DE German

CS Czech

LV Latvian

RO Romanian

CRACKER Coordinating Project within Horizon2020

TDA TAUS Data Association

APE Automatic Post Editing

BLEU Bilingual Evaluation Understudy (Machine translation error metric)

DQF Dynamic Quality Framework

DNT Do Not Translate

GTM General Text Matcher (see footnote Error! Bookmark not defined. in nnex C.5)

HEA Human Error Annotation

HPE Human Post Editing

IT Information Technology

LSP Language Service Provider

MQM Multidimensional Quality Metrics

MT Machine Translation

TBA To Be Announced

TER Translation Error Rate (see footnote Error! Bookmark not defined. page 33)

TM Translation Memories

WMT Workshop on Machine Translation (organised yearly)

WP Work Package

Table J-23 – Table of acronyms used in this document



data management plan (final) · 2018-08-09 · quality translation 21 d5.7 data management plan-...

Documents