data and linguistics: delivering machine translation with subject matter expertise
TRANSCRIPT
“Data & Linguistics” Delivering Machine Translation with
Subject Matter Expertise
John TinsleyDirector / Co-Founder
Localization World. 31st Oct 2014, Vancouver
Data EngineeringWhat is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
Patents: an MT nightmare
L is an organic group selected from -CH2-(OCH2CH2)n-, -CO-NR'-, with R'=H or C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words
“Most of these things are not like the other”
Many languages aren’t a dream either
(And teaches the teacher her students language the Arabic)
Spanish – Italian English – Spanish Arabic – English
Data EngineeringWhat is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
Data Engineering + Linguistic EngineeringAn “ensemble” architecture
Chinese pre-ordering rules
StatisticalPost-editing
Input
Output
Training Data
Spanish med-deviceentity recognizer Multi-output
Combination
Korean pharmatokenizer
Patent inputclassifier
Client TM/terminology (optional)
Japanese scriptnormalisation
GermanCompounding rules
Moses
RBMT
Moses
Moses
Easier said than done
“A very particular set of skills”
MT Knowledge
(from a scientific perspective)
Domain Knowledge
(the nature of the content)
Linguistic Knowledge
(the characteristics of the language)
MT Knowledge
Implementation
• Computer science!• Programming• Data structures• Algorithms
Science
• Machine learning• Probability theory• Bayesian statistics• Markov Models
Domain Knowledge
What’s important?
• Chemical names• References to figures• Claim cross-references
Where do we learn?
• Commercial partners• LSPs & Translators• Research
Consistent across langs?
• Japanese abstract order• Numbering / bullets• Document layout
Document types?
• Patents• Applications, reports
• Pharmaceutical• IFUs, labels
Iconic Translation Machines
Linguistic Knowledge
Number agreement: the house / the houses vs. la maison / les maisons
Gender agreement: the house / the cheese vs. la maison / le frommage
English - Spanish
English - French
Linguistic Knowledge
English - German
English - Chinese
种水果的农民
The farmer who grows fruit[Lit: “grow fruit (particle) farmer”]
If you don’t understand it, you can’t translate it
MT with Subject Matter Expertise
“Allopurinol-induced serious cutaneous adverse reactions (SCAR), including Steven Johnson’s syndrome
(SJS) and toxic epidermal necrolysis (TEN), are associated with a genetic marker, the HLA-B*5801
allele.”
“IPTranslator is perfect for someone who needs to search [patents] across multiple languages and with is useful in the case of both patentability and infringement searches.”
– Aalt van de Kuilen, Global Head of Patent Information, Abbott
Machine Translation for Patents
What is the value for users?
Specialist solutions deliver more useable outcomes for the user
Post-editing
For information purposes
Multilingual search
Increased productivity
Extract more meaning
Retrieve more relevant results
=
=
=
De-risking the machine translation proposition
What is the value for users?
+ Data + Time + €€€ = ???
+ No data needed + Systems are ready to go + No upfront cost= Evaluate immediately
New PrerequisitesTypical Prerequisites
Customisation. Refinement.
» Incorporation of user feedback» Incremental training with post-edits» Tuning for specific input types
Case Studies 1. What this approach means straight up in terms of quality…
2. Productivity gains from using these systems…
3. As a foundation for client customization…
Case 1: Quality
2.83
4 3.863.56
1
1.5
2
2.5
3
3.5
4
4.5
5
Evaluator 1 Evaluator 2 Evaluator 3 Average
German to English TranslationGerman to English
Case 2: Productivity
Iconic had a domain-specific MT solution for that industry
Machine Translation technology for the legal industry
Business Need
Case 2: Productivity
Delivered immediately and initial results were positive
Translation samples required for initial evaluation
Process (1)
Case 2: Productivity
“The complexities and unforeseen but inevitable surprises of MT integration in large scale production processes were handled both competently and efficiently.”
Integrate Iconic with GlobalSight for productivity pilot
Process (2)
Case 2: Productivity
>20% productivity increase for translator post-editing Iconic output
“Measurable productivity gains delivered from the outset”
Performance
Case 2: Productivity
• Ongoing improvement through feedback from translators• Ongoing improvement through the incorporation of post-edits
• More than 5 million words translated to date for Asian languages• Periodic roll-out of new languages over time
Looking forward
Case 3: Customization
- Modify our patent machine translation engines for “Written Opinions” on patents
- 0.25% new data, 2 new ensemble processes
21 20
27
0
10
20
30
40
50
60
Iconic Google
+ Modification
Baseline
Chinese to English
All content is not created equal
We cannot afford to be dogmatic when it comes to MT
Know your subject matter!
Domain specific MT is about more than just data
Take home messages…
+ Linguistics!