work package 1 - casmacatcasmacat.eu/uploads/secondyear/wp1.pdf · work package 1 michael carl,...
TRANSCRIPT
Work Package 1
Michael Carl, Mercedes Garcia Martinez, Bartolome Mesa-Lao, Nancy Underwood, CBS
Frank Keller, Robin Hill, UEDIN
November 25, 2013
Name 2nd year review meeting November 25, 2013
Overview of WP1
• Task 1.1: Post-editing. (month 6–18) – completed
• Task 1.2: Interactive Translation. (month 19-30) – ongoing
• Task 1.3: Translator Types and Translation Styles. (month 1–24) - completed
• Task 1.4: Text Type. (month 6–30) – ongoing
• Task 1.5: Cognitive Modelling. (month 6–36) - ongoing
• Task 1.6: User Modelling. (month 6–36) - ongoing
2
Name 2nd year review meeting
November 25, 2013
WP1 - Task 1.1: Post-editing
First Casmacat Field trial Comparing translation from scratch and post-editing Post-editing faster than from-scratch translation
Already described in:
• Deliverable D6.1
• Mesa-Lao, Bartolomé (2012). "The next generation translator's workbench: post-editing in CASMACAT v.1.0" . Proceedings of the 34th Translating and the Computer Conference. 29 & 30 November 2012. ASLIB - The Association for Information Management, London.
• Elming, Jakob, Michael Carl, and Laura Winther Balling. (Forthcoming). "Investigating User Behaviour in Post-editing and Translation Using the CasMaCat Workbench."
3
Name Event November 25, 2013
WP1 - Task 1.2: Interactive Translation
• Collection and post-processing of data from the field trial
• Simple Statistics – Post-editing time (productivity)
– Typing activity
– Gaze behaviour
– Post-editing quality
– Revision
• Correlation of processes and translation product properties – Keystrokes (insertions and deletions) vs. time vs. edit distance
– Translation ambiguity vs. gaze fixation time on source and target text
– Relative translation distortion vs. fixation time on source and target
5
Name Event November 25, 2013
WP1 - Task 1.2: Interactive Translation
Data overview in the CRITT TPR database (2nd Field Trial)
3 datasets
dataset1 – at Celer, with gaze data and reviewed
dataset2 & dataset3 – at home
Raw logging data post-processed to extract User Activity Data
6
Name Event November 25, 2013
System
#Segments Segments
containing gaze data
Segments
reviewed
CFT1: P 1345 372 372
CFT2: PI 1368 372 372
CFT3: PIA 1373 372 372
Total 4086 1116 1116
WP1 - Task 1.2
Information calculated per segment
• Nedit: number of times the segment was opened. • Tdur: total duration that the segment was opened. • Kdur: total duration of keystroke activity, excl. pauses of 5 secs or more. • Fdur: total duration of postediting, excl. pauses of 200 secs or more. • GazeS: fixation duration on source segment. • GazeT: fixation duration on target segment. • Mins: manual insertions. • Ains: automatic insertions. • Adel: automatic deletions. • TokS: number of tokens in the source segment. • LenS: number characters in the source segment. • TokT: number tokens in the target segment. • LenT: number characters in the target segment. • edDistMP: edit distance between MT-output and PE version • edDistPR: edit distance between PE version and Revision • edDistMR: edit distance between MT-output and Revision
8
Name Event November 25, 2013
WP1 - Task 1.2 Productivity Learning effect
• Some post-editors uniformly improved productivity over time
• Longitudinal study?
10
Name Event November 25, 2013
Gaze fixation on source and target text
• Post-editors fixate more on target than source text
• Enabling interactivity increases fixation on the target text and decreases fixation on the source text
Average gaze fixations on source and target window per system
13
Name Event November 25, 2013
WP1 - Task 1.2
Total gaze fixations on source and target texts per no. of translation alternatives
14
Name Event November 25, 2013
WP1 - Task 1.2
Translation alternatives vs gaze fixation
Alignment cross distance : ”how much you need to read ahead or back in the text before being able to translate the current alignment unit”
gaze on source text
gaze on target text
15
Name Event November 25, 2013
WP1 - Task 1.2
Alignment cross distance effect on gaze fixation
Quality of post-edited text
a. Number of revisions made by reviewers Initial sessions (dataset 1) carried out at Celer were reviewed
Calculated on text modifications, edit distance & revision time
No significant difference found between GUI configurations
b. Manual analysis of errors in post-editors’ output
Final sessions (dataset 3)
2 error types:
Essential changes not implemtented
Errors introduced by post-editors
16
Name Event November 25, 2013
WP1 - Task 1.2
Residual errors in post-edited output:
• Errors introduced by post-editors are often typos and punctuation errors
• Open questions:
– Do the different GUI configurations affect the sort of errors produced? If so how?
– Do error types correlate with different user types? If so how?
17
Name Event November 25, 2013
WP1 - Task 1.2
Essential changes not implemented
Errors introduced by post-editor
Configuration P PI PIA P PI PIA
Mistranslation 9 10 7 - 4 -
Target Language errors
42 29 47 27 14 51
Task 1.3: Translator Types and Translation Styles
21
Name 2nd year review meeting
November 25, 2013
Post-editing styles
• Backtracking between segments 3 backtracking strategies:
• Exclusively local backtracking
• Text final long distance backtracking
• Mixed in-text backtracking
• Gaze fixations on source and target texts
23
Name 2nd year review meeting
November 25, 2013
Style 1 Style 2
Style 3 Style 4
Task 1.3: Post-editing Styles
WP1 - Task 1.3 24
Name 2nd year review meeting
November 25, 2013
Distribution of post-editing styles
* Predominant style , ∙ style also present
WP1 - Task 1.4: Text Type
1. Existing Field trial experiments
Text type: news items
2. Further post-editing experiments using technical texts. Texts and training data from EMEA corpus (European Medicines Agency).
Text type: Technical
Domain: Pharmacy
Language pairs: EN to DA, EN to DE, EN to ES, EN to PT.
System configurations: CASMACAT GUI with and without interactivity
3. Experiments to date
Pilot test with an EN to DA system
EN to PT experiment
- data collected for 21 participants (work in progress).
27
Name 2nd year review meeting
November 25, 2013
WP1 - Task 1.5: Cognitive Modelling.
Goal
To understand the cognitive processes involved in human verification and error-checking behaviour while post-editing machine translated output.
28
Robin Hill Event November 25, 2013
[T1.5] Intuitively something wrong
29
Robin Hill 2nd year review meeting
November 25, 2013
?
?
[T1.5] Google gets it wrong
30
Name 2nd year review meeting
November 25, 2013
Jorge Rivas ran into an offensive glitch when using Google Translate. In eight out of ten tries, the Spanish language word “indocumentado,” which translates to “undocumented,” was mistranslated by Google Translate as “illegal” when it appeared in a headline. "As a journalist, when I use the term undocumented immigrant instead of illegal immigrant I’m doing so in order to remain more neutral and not use language charged with anti-immigrant sentiment. When you use the term illegal immigrant, it affects attitudes towards immigrants and people of colour."
[T1.5] Dynamic processes and interaction
31
Robin Hill 2nd year review meeting
November 25, 2013
• In order to create an interactive system (Casmacat) we need to understand the progressive nature of post-editing and not just the final result. “Help along the way.”
• Human translation is an incremental, dynamic (time and space) and analogue process.
• Little known about the cognitive processes involved in detecting that a translation is wrong.
– Some literature on proofreading and on plausibility.
[T1.5] Methodology: eye-tracking
32
Robin Hill 2nd year review meeting
November 25, 2013
1.Precise indication of where and when attention is focused.
2.Patterns of eye movements can reveal how problems are initially spotted, checked/verified and then resolved.
3.People do not read and parse the sentences normally and then generate a BLEU score.
[T1.5] Experiments
33
Robin Hill 2nd year review meeting
November 25, 2013
• Investigate the cognitive processes involved in checking for lexical, syntactic and semantic violations in translated text.
• Establish clear baselines.
• Contrast monolingual (native) and multilingual (non-native) readers of English.
• Establish whether there are “levels of processing difficulty” between classes of errors.
[T1.5] Error Classifications
34
Robin Hill 2nd year review meeting
November 25, 2013
1) TE: Transposition (Easy). Hypothesised to be the easiest and to provide a baseline measure.
Picasso said that good artists ocpy [copy], great artists steal.
[Picasso sagte, dass gute Künstler kopieren, großartige Künstler klauen.]
2) TD: Transposition (Difficult). Two internal letters switched to produce an incorrect but legitimate word.
I have decided to write all my deepest thoughts in a dairy [diary] again.
[T1.5] Error Classifications
35
Robin Hill 2nd year review meeting
November 25, 2013
3) WO: Word Order. Transposition at the word level rather than letter level.
Mostly were affected [affected were] the vegetable, corn and chickpea crops. [Betroffen waren vor allem der Gemüse-, Mais- und Kichererbsenanbau.]
4) MT: Mistranslation of Tense or agreement. Violation in verb tense or a mismatch in gender or number agreement.
Many of our friend [friends] are surfers and I have a great friend who lives in Tamarindo.
[Viele unserer Freund sind Surfer und ich habe einen großartigen Freund, der in Tamarindo lebt.]
The cuts were [would] ultimately hit the combat troops. [Die Kürzungen wurden letztendlich die Kampftruppen treffen.]
.
[T1.5] Error Classifications
36
Robin Hill 2nd year review meeting
November 25, 2013
5) ML: Mistranslated Lexical item. Semantically connected but contextually odd or inappropriate.
Judge Torkjel Nesheim cancelled [interrupted] Breivik during his monologue.
[Richter Torkjel Nesheim unterbrach Breivik während diesem Monolog.]
.
[T1.5] Materials and design
37
Robin Hill 2nd year review meeting
November 25, 2013
• Four conditions were drawn from the German-to-English Machine Translation Marathon 2012 (MTM12) competition dataset.
• 24 sentence frames were constructed for each of the five error conditions. Each item had two variants: a correct version and a version where one word was the primary source of an error.
• Two balanced item lists.
• 180 sentences (including distractor items) presented in random order.
[T1.5] Procedure
38
Robin Hill 2nd year review meeting
November 25, 2013
• Participants had to read each sentence and decide whether there was an error (yes/no decision).
• If yes, they had to click on the first word of where the problem began (location as well as judgement).
• Binocular recording of eye movements at 1KHz sample rate per eye.
[T1.5] Participants
39
Robin Hill 2nd year review meeting
November 25, 2013
• Monolinguals (native English) – 20 native English speakers.
– 11 Male, 9 female; mean age 23.05.
• Multilinguals (non-native English) – 20 non-native English speakers.
– 6 Male, 14 female; mean age 30.2.
– European first language (L1) and English as their second language (L2), averaging 20.6 years of English.
– 7 bilingual; 13 tri- or more.
– 7 had experience or training in professional translation.
[T1.5] Analyses
40
Robin Hill 2nd year review meeting
November 25, 2013
• Range of measures (see D1.2), essentially broken into:
– Global Effects (sentence-level analyses)
– Local Effects (word-level analyses)
• Focus on target word (error vs. no error) and its following word (spillover effects)
• Combined results presented here but full details by experiment in D1.2.
[T1.5] Combined summary (global)
41
Robin Hill 2nd year review meeting
November 25, 2013
Monolinguals Multilinguals
Error Type Mean Percent Mean Percent
TE 11.85 98.75 11.50 95.83
TD 8.90 74.17 9.40 78.33
WO 10.65 88.75 10.85 90.42
MT 10.50 87.50 9.95 82.92
ML 8.25 68.75 7.65 63.75
False Positives 16.85 14.04 20.85 17.38
Detection rates for each error type (maximum 12) and false positives (max 120)
• No significant difference in detection between the linguistic groups for any error. • 25% more false detections by non-native English speakers (more cautious or
pernickety?) but not reliable (large variance and individual differences). • Significant ranking of errors: TE > WO >= MT > TD > ML
[T1.5] Combined summary (global)
42
Robin Hill 2nd year review meeting
November 25, 2013
Mean overall reading times for the five sentence constructions
Monolingual
Multi l ingual
No Error
TE TD WO MT ML5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
To
tal S
en
ten
ce R
ea
din
g T
ime
(ms)
Error
TE TD WO MT ML
More careful in non-error? Non-sig.
Reading speed of error sentences: TE < (WO = MT = TD) < ML
[T1.5] Combined summary (global)
43
Robin Hill 2nd year review meeting
November 25, 2013
• Errors appear to lead to longer individual fixations but not necessarily longer overall reading times for sentences.
• Small pupillometric response to an error for the multilinguals.
• As far as end performance is concerned, participants scored consistently well and took a similar length of time, irrespective of whether they were native or non-native speakers of English.
[T1.5] Combined summary (local)
44
Robin Hill 2nd year review meeting
November 25, 2013
• Differences between the linguistic groups emerge on gaze behaviour around the target word.
• Problems detected faster for the monolinguals compared to the multilinguals.
• Immediate impact for multilinguals only for the simplest baseline condition (TE). Disruption “spills over” more.
• Temporal disassociation between eye-movement control and sentence processing for multilinguals. – E.g. greater likelihood of making a leftwards regressive
movement, but only two or more fixations after initially encountering the error.
[T1.5] Combined summary (local)
45
Robin Hill 2nd year review meeting
November 25, 2013
Mean First Fixation Duration on Target Word
No Error
Error
Monolingual
TE TD WO MT ML100
120
140
160
180
200
220
240
260
280
300
320
Me
an
Firs
t Fix
atio
n o
n C
ritica
l Wo
rd (m
s)
Multi l ingual
TE TD WO MT ML
= p<0.05
[T1.5] Combined summary (local)
46
Robin Hill 2nd year review meeting
November 25, 2013
Probability of going back after reaching the “spill over” word
No Error
Error
Monolingual
TE TD WO MT ML0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Pro
ba
bility
of m
akin
g a
reg
ressiv
e e
ye
mo
ve
me
nt
Multi l ingual
TE TD WO MT ML
[T1.5] Next stage
47
Robin Hill 2nd year review meeting
November 25, 2013
[T1.5] Next stage
48
Robin Hill 2nd year review meeting
November 25, 2013
[T1.5] Next stage
49
Robin Hill 2nd year review meeting
November 25, 2013
[T1.5] Integration
50
Robin Hill 2nd year review meeting
November 25, 2013
• Monolinguals could perform a cheap, fast, first pass, detecting potential problems. – Effectively a manual modification of word-confidence levels. – Reduce both time and skill wastage of professional translators. – GUI modification for monolingual checking?
• Display of text should avoid placing low confidence words at the beginning or end of lines, as well as trying to avoid sentence and clause breaks over lines. – Multilingual eyes in particular may have moved onto the next word before a mistake is
identified increasing regressions. – Particularly costly and disruptive if a return sweep has already been made.
• Dynamic window size for amount of predictive or suggested text shown at any moment.
• The separation into global and localised effects complements approaches of other work packages: – sentence-level post-editing effort (global processing) and word-level confidence
measures (local effects); – paraphrasing granularity (sentential/clausal versus lexical/phrasal).
[T1.5] End
51
Robin Hill 2nd year review meeting
November 25, 2013
Summary
52
Robin Hill 2nd year review meeting
November 25, 2013
• Bilingual advantages and disadvantages
• Levels of error difficulty
• Post-editing styles
• Post-editor performance
– Quality (errors)
– Productivity
– Cross, ambiguity
Future Work
53
Robin Hill 2nd year review meeting
November 25, 2013
• T1.6 User Modelling – Integrate:
• Cognitive modelling (T1.5)
• Translator types and styles (T1.3)
• Text types and language pairs (T1.4) - user behaviour w.r.t. specific text types
– Correlate • User profiles, text types, quality, error productions,
productivity, gaze activity
– Longitudinal study – learning effects
– Interaction of translator types and translation briefs
Future Work
54
Robin Hill 2nd year review meeting
November 25, 2013
• Evaluate specific UI components: – Visualisation of word alignments (WP1) – Visualization of translation options (WP3) – Correlate confidence measures with post-editing
difficulty (WP1) – Visualisation of confidence measures (WP2+4) – Size of prediction window (WP 3)
• Translation Data Analytics – 6 weeks intensive workshop summer 2014, including
user group data – Disseminate in EAMT workshop
WP1 End
55
Robin Hill 2nd year review meeting
November 25, 2013
Thanks!