work package 1 - casmacatcasmacat.eu/uploads/secondyear/wp1.pdf · work package 1 michael carl,...

Work Package 1

Michael Carl, Mercedes Garcia Martinez, Bartolome Mesa-Lao, Nancy Underwood, CBS

Frank Keller, Robin Hill, UEDIN

November 25, 2013

Name 2nd year review meeting November 25, 2013

Overview of WP1

• Task 1.1: Post-editing. (month 6–18) – completed

• Task 1.2: Interactive Translation. (month 19-30) – ongoing

• Task 1.3: Translator Types and Translation Styles. (month 1–24) - completed

• Task 1.4: Text Type. (month 6–30) – ongoing

• Task 1.5: Cognitive Modelling. (month 6–36) - ongoing

• Task 1.6: User Modelling. (month 6–36) - ongoing

2

Name 2nd year review meeting

November 25, 2013

WP1 - Task 1.1: Post-editing

First Casmacat Field trial Comparing translation from scratch and post-editing Post-editing faster than from-scratch translation

Already described in:

• Deliverable D6.1

• Mesa-Lao, Bartolomé (2012). "The next generation translator's workbench: post-editing in CASMACAT v.1.0" . Proceedings of the 34th Translating and the Computer Conference. 29 & 30 November 2012. ASLIB - The Association for Information Management, London.

• Elming, Jakob, Michael Carl, and Laura Winther Balling. (Forthcoming). "Investigating User Behaviour in Post-editing and Translation Using the CasMaCat Workbench."

3

Name Event November 25, 2013

WP1 - Task 1.2: Interactive Translation

• Collection and post-processing of data from the field trial

• Simple Statistics – Post-editing time (productivity)

– Typing activity

– Gaze behaviour

– Post-editing quality

– Revision

• Correlation of processes and translation product properties – Keystrokes (insertions and deletions) vs. time vs. edit distance

– Translation ambiguity vs. gaze fixation time on source and target text

– Relative translation distortion vs. fixation time on source and target

5


WP1 - Task 1.2: Interactive Translation

Data overview in the CRITT TPR database (2nd Field Trial)

3 datasets

dataset1 – at Celer, with gaze data and reviewed

dataset2 & dataset3 – at home

Raw logging data post-processed to extract User Activity Data

6


System

#Segments Segments

containing gaze data

Segments

reviewed

CFT1: P 1345 372 372

CFT2: PI 1368 372 372

CFT3: PIA 1373 372 372

Total 4086 1116 1116

WP1 - Task 1.2

Information calculated per segment

• Nedit: number of times the segment was opened. • Tdur: total duration that the segment was opened. • Kdur: total duration of keystroke activity, excl. pauses of 5 secs or more. • Fdur: total duration of postediting, excl. pauses of 200 secs or more. • GazeS: fixation duration on source segment. • GazeT: fixation duration on target segment. • Mins: manual insertions. • Ains: automatic insertions. • Adel: automatic deletions. • TokS: number of tokens in the source segment. • LenS: number characters in the source segment. • TokT: number tokens in the target segment. • LenT: number characters in the target segment. • edDistMP: edit distance between MT-output and PE version • edDistPR: edit distance between PE version and Revision • edDistMR: edit distance between MT-output and Revision

8


WP1 - Task 1.2 Productivity Learning effect

• Some post-editors uniformly improved productivity over time

• Longitudinal study?

10


Gaze fixation on source and target text

• Post-editors fixate more on target than source text

• Enabling interactivity increases fixation on the target text and decreases fixation on the source text

Average gaze fixations on source and target window per system

13


WP1 - Task 1.2

Total gaze fixations on source and target texts per no. of translation alternatives

14


WP1 - Task 1.2

Translation alternatives vs gaze fixation

Alignment cross distance : ”how much you need to read ahead or back in the text before being able to translate the current alignment unit”

gaze on source text

gaze on target text

15


WP1 - Task 1.2

Alignment cross distance effect on gaze fixation

Quality of post-edited text

a. Number of revisions made by reviewers Initial sessions (dataset 1) carried out at Celer were reviewed

Calculated on text modifications, edit distance & revision time

No significant difference found between GUI configurations

b. Manual analysis of errors in post-editors’ output

Final sessions (dataset 3)

2 error types:

Essential changes not implemtented

Errors introduced by post-editors

16


WP1 - Task 1.2

Residual errors in post-edited output:

• Errors introduced by post-editors are often typos and punctuation errors

• Open questions:

– Do the different GUI configurations affect the sort of errors produced? If so how?

– Do error types correlate with different user types? If so how?

17


WP1 - Task 1.2

Essential changes not implemented

Errors introduced by post-editor

Configuration P PI PIA P PI PIA

Mistranslation 9 10 7 - 4 -

Target Language errors

42 29 47 27 14 51

Task 1.3: Translator Types and Translation Styles

21


November 25, 2013

Post-editing styles

• Backtracking between segments 3 backtracking strategies:

• Exclusively local backtracking

• Text final long distance backtracking

• Mixed in-text backtracking

• Gaze fixations on source and target texts

23


November 25, 2013

Style 1 Style 2

Style 3 Style 4

Task 1.3: Post-editing Styles

WP1 - Task 1.3 24


November 25, 2013

Distribution of post-editing styles

* Predominant style , ∙ style also present

WP1 - Task 1.4: Text Type

1. Existing Field trial experiments

Text type: news items

2. Further post-editing experiments using technical texts. Texts and training data from EMEA corpus (European Medicines Agency).

Text type: Technical

Domain: Pharmacy

Language pairs: EN to DA, EN to DE, EN to ES, EN to PT.

System configurations: CASMACAT GUI with and without interactivity

3. Experiments to date

Pilot test with an EN to DA system

EN to PT experiment

- data collected for 21 participants (work in progress).

27


November 25, 2013

WP1 - Task 1.5: Cognitive Modelling.

Goal

To understand the cognitive processes involved in human verification and error-checking behaviour while post-editing machine translated output.

28

Robin Hill Event November 25, 2013

[T1.5] Intuitively something wrong

29

Robin Hill 2nd year review meeting

November 25, 2013

?

?

[T1.5] Google gets it wrong

30


November 25, 2013

Jorge Rivas ran into an offensive glitch when using Google Translate. In eight out of ten tries, the Spanish language word “indocumentado,” which translates to “undocumented,” was mistranslated by Google Translate as “illegal” when it appeared in a headline. "As a journalist, when I use the term undocumented immigrant instead of illegal immigrant I’m doing so in order to remain more neutral and not use language charged with anti-immigrant sentiment. When you use the term illegal immigrant, it affects attitudes towards immigrants and people of colour."

[T1.5] Dynamic processes and interaction

31


November 25, 2013

• In order to create an interactive system (Casmacat) we need to understand the progressive nature of post-editing and not just the final result. “Help along the way.”

• Human translation is an incremental, dynamic (time and space) and analogue process.

• Little known about the cognitive processes involved in detecting that a translation is wrong.

– Some literature on proofreading and on plausibility.

[T1.5] Methodology: eye-tracking

32


November 25, 2013

1.Precise indication of where and when attention is focused.

2.Patterns of eye movements can reveal how problems are initially spotted, checked/verified and then resolved.

3.People do not read and parse the sentences normally and then generate a BLEU score.

[T1.5] Experiments

33


November 25, 2013

• Investigate the cognitive processes involved in checking for lexical, syntactic and semantic violations in translated text.

• Establish clear baselines.

• Contrast monolingual (native) and multilingual (non-native) readers of English.

• Establish whether there are “levels of processing difficulty” between classes of errors.

[T1.5] Error Classifications

34


November 25, 2013

1) TE: Transposition (Easy). Hypothesised to be the easiest and to provide a baseline measure.

Picasso said that good artists ocpy [copy], great artists steal.

[Picasso sagte, dass gute Künstler kopieren, großartige Künstler klauen.]

2) TD: Transposition (Difficult). Two internal letters switched to produce an incorrect but legitimate word.

I have decided to write all my deepest thoughts in a dairy [diary] again.


35


November 25, 2013

3) WO: Word Order. Transposition at the word level rather than letter level.

Mostly were affected [affected were] the vegetable, corn and chickpea crops. [Betroffen waren vor allem der Gemüse-, Mais- und Kichererbsenanbau.]

4) MT: Mistranslation of Tense or agreement. Violation in verb tense or a mismatch in gender or number agreement.

Many of our friend [friends] are surfers and I have a great friend who lives in Tamarindo.

[Viele unserer Freund sind Surfer und ich habe einen großartigen Freund, der in Tamarindo lebt.]

The cuts were [would] ultimately hit the combat troops. [Die Kürzungen wurden letztendlich die Kampftruppen treffen.]

.


36


November 25, 2013

5) ML: Mistranslated Lexical item. Semantically connected but contextually odd or inappropriate.

Judge Torkjel Nesheim cancelled [interrupted] Breivik during his monologue.

[Richter Torkjel Nesheim unterbrach Breivik während diesem Monolog.]

.

[T1.5] Materials and design

37


November 25, 2013

• Four conditions were drawn from the German-to-English Machine Translation Marathon 2012 (MTM12) competition dataset.

• 24 sentence frames were constructed for each of the five error conditions. Each item had two variants: a correct version and a version where one word was the primary source of an error.

• Two balanced item lists.

• 180 sentences (including distractor items) presented in random order.

[T1.5] Procedure

38


November 25, 2013

• Participants had to read each sentence and decide whether there was an error (yes/no decision).

• If yes, they had to click on the first word of where the problem began (location as well as judgement).

• Binocular recording of eye movements at 1KHz sample rate per eye.

[T1.5] Participants

39


November 25, 2013

• Monolinguals (native English) – 20 native English speakers.

– 11 Male, 9 female; mean age 23.05.

• Multilinguals (non-native English) – 20 non-native English speakers.

– 6 Male, 14 female; mean age 30.2.

– European first language (L1) and English as their second language (L2), averaging 20.6 years of English.

– 7 bilingual; 13 tri- or more.

– 7 had experience or training in professional translation.

[T1.5] Analyses

40


November 25, 2013

• Range of measures (see D1.2), essentially broken into:

– Global Effects (sentence-level analyses)

– Local Effects (word-level analyses)

• Focus on target word (error vs. no error) and its following word (spillover effects)

• Combined results presented here but full details by experiment in D1.2.

[T1.5] Combined summary (global)

41


November 25, 2013

Monolinguals Multilinguals

Error Type Mean Percent Mean Percent

TE 11.85 98.75 11.50 95.83

TD 8.90 74.17 9.40 78.33

WO 10.65 88.75 10.85 90.42

MT 10.50 87.50 9.95 82.92

ML 8.25 68.75 7.65 63.75

False Positives 16.85 14.04 20.85 17.38

Detection rates for each error type (maximum 12) and false positives (max 120)

• No significant difference in detection between the linguistic groups for any error. • 25% more false detections by non-native English speakers (more cautious or

pernickety?) but not reliable (large variance and individual differences). • Significant ranking of errors: TE > WO >= MT > TD > ML


42


November 25, 2013

Mean overall reading times for the five sentence constructions

Monolingual

Multi l ingual

No Error

TE TD WO MT ML5000

6000

7000

8000

9000

10000

11000

12000

13000

14000

15000

To

tal S

en

ten

ce R

ea

din

g T

ime

(ms)

Error

TE TD WO MT ML

More careful in non-error? Non-sig.

Reading speed of error sentences: TE < (WO = MT = TD) < ML


43


November 25, 2013

• Errors appear to lead to longer individual fixations but not necessarily longer overall reading times for sentences.

• Small pupillometric response to an error for the multilinguals.

• As far as end performance is concerned, participants scored consistently well and took a similar length of time, irrespective of whether they were native or non-native speakers of English.

[T1.5] Combined summary (local)

44


November 25, 2013

• Differences between the linguistic groups emerge on gaze behaviour around the target word.

• Problems detected faster for the monolinguals compared to the multilinguals.

• Immediate impact for multilinguals only for the simplest baseline condition (TE). Disruption “spills over” more.

• Temporal disassociation between eye-movement control and sentence processing for multilinguals. – E.g. greater likelihood of making a leftwards regressive

movement, but only two or more fixations after initially encountering the error.


45


November 25, 2013

Mean First Fixation Duration on Target Word

No Error

Error

Monolingual

TE TD WO MT ML100

120

140

160

180

200

220

240

260

280

300

320

Me

an

Firs

t Fix

atio

n o

n C

ritica

l Wo

rd (m

s)

Multi l ingual

TE TD WO MT ML

= p<0.05


46


November 25, 2013

Probability of going back after reaching the “spill over” word

No Error

Error

Monolingual

TE TD WO MT ML0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pro

ba

bility

of m

akin

g a

reg

ressiv

e e

ye

mo

ve

me

nt

Multi l ingual

TE TD WO MT ML

[T1.5] Next stage

47


November 25, 2013

[T1.5] Next stage

48


November 25, 2013

[T1.5] Next stage

49


November 25, 2013

[T1.5] Integration

50


November 25, 2013

• Monolinguals could perform a cheap, fast, first pass, detecting potential problems. – Effectively a manual modification of word-confidence levels. – Reduce both time and skill wastage of professional translators. – GUI modification for monolingual checking?

• Display of text should avoid placing low confidence words at the beginning or end of lines, as well as trying to avoid sentence and clause breaks over lines. – Multilingual eyes in particular may have moved onto the next word before a mistake is

identified increasing regressions. – Particularly costly and disruptive if a return sweep has already been made.

• Dynamic window size for amount of predictive or suggested text shown at any moment.

• The separation into global and localised effects complements approaches of other work packages: – sentence-level post-editing effort (global processing) and word-level confidence

measures (local effects); – paraphrasing granularity (sentential/clausal versus lexical/phrasal).

[T1.5] End

51


November 25, 2013

Summary

52


November 25, 2013

• Bilingual advantages and disadvantages

• Levels of error difficulty

• Post-editing styles

• Post-editor performance

– Quality (errors)

– Productivity

– Cross, ambiguity

Future Work

53


November 25, 2013

• T1.6 User Modelling – Integrate:

• Cognitive modelling (T1.5)

• Translator types and styles (T1.3)

• Text types and language pairs (T1.4) - user behaviour w.r.t. specific text types

– Correlate • User profiles, text types, quality, error productions,

productivity, gaze activity

– Longitudinal study – learning effects

– Interaction of translator types and translation briefs

Future Work

54


November 25, 2013

• Evaluate specific UI components: – Visualisation of word alignments (WP1) – Visualization of translation options (WP3) – Correlate confidence measures with post-editing

difficulty (WP1) – Visualisation of confidence measures (WP2+4) – Size of prediction window (WP 3)

• Translation Data Analytics – 6 weeks intensive workshop summer 2014, including

user group data – Disseminate in EAMT workshop

WP1 End

55


November 25, 2013

Thanks!

work package 1 - casmacatcasmacat.eu/uploads/secondyear/wp1.pdf · work package 1 michael carl,...

Documents