2008 – copyright systran systran challenges and recent advances in hybrid machine translation jean...

20
2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan [email protected]

Upload: bethanie-phillips

Post on 12-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation

Jean Senellart, Jin Yang, Jens Stephan

[email protected]

Page 2: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

Overview

SYSTRAN – 40 years of innovation

The MT Challenges

SYSTRANLabProjectsHybrid EnginesFrom Research to Products

CWMT08

Conclusions

Page 3: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

SYSTRAN

40 years of history

Located in Paris (La Défense) and San Diego

+70 employees: ~ 20 linguists, ~ 30 engineersIncluding 10 PhDs

Page 4: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

Core Technology

Core technology “Rule-Based”Based on language descriptionAnalysis – Transfer – Generation paradigmBuild a « syntax tree » based on hierarchical constituents with multi-level relationshipsMulti-pass analysis

• Morphology Analysis• Homograph Resolution• Clause Boundary• Syntagm Identification• Syntactic Role Identification• …

Rely heavily on linguistic resources

Page 5: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

Page 6: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

Languages

Chinese 882 Korean 78Arabic 422 Italian 62Spanish 358 Ukrainian 47English 350 Polish 42Hindi 325 Dutch 23Portuguese 250 Serbo-Croatian 21Russian 170 Greek 18French 130 Czech 12Japanese 125 Albanian 6Urdu 100 Slovak 6German 100Farsi 82

22 source languages

70 language pairs

Dictionaries: 200K-1M entries per LP~6M reference multi-source / multi-target dictionary

3600

Page 7: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

SYSTRAN Activity

Retail products:Windows Desktop ProductSYSTRAN Mobile on PDAMac OS Dashboard Widget

Online ServicesSYSTRANBox, SYSTRANNet, SYSTRANLinks

Corporate customersSymantec, Cisco, Verizon, Ford, Daimler, Chemical

Abstract…Institutional Customers

EC and US agenciesPortals - Online Translation

“Babel Fish”, Google, Yahoo!, Microsoft Live, …

Page 8: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

MT Challenges RBMT/SMT Strengths and Weaknesses - I

Rule-Based system builds a translation with available linguistic resources (dictionaries, rules)

Human-built resources• Incremental

Track the translation process• Predictable output

Some phenomena are hard to formalize• Need semantic/pragmatic knowledge

Not designed to deal with exceptions to the rules• … which are very frequent

Page 9: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

MT Challenges RBMT/SMT Strengths and Weaknesses - II

Statistical system finds a translation within a choice of many, many possible translations

Very easy to build• Automatic training process

Knowledge acquisition is easy…• Not limited to predefined linguistic patterns – “phrase”

… but cannot “understand” or generalize information • Not even elementary rules

Output is “unpredictable”

Page 10: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

MT ChallengesCorpus-Based or Rule-Based Approach?

No conflict between “corpus” and “rule-based” approaches

Possible to learn rules• Already learns terminology – monolingual and multilingual• Some approaches acquire complex rules

Possible to find the best translation amongst several translations“Decoding” can be constrained by syntactic restrictionsLinguistic rules but corpus drives!

Page 11: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

SYSTRANLab

Research Projects Overview

Toward Hybrid EnginesCollaborationsStatistical Post-Edition

Lattice Decoding

Source Analysis Adaptation

From Research to Products

Page 12: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

Research Projects

Resources AcquisitionConsolidating a 6M entry multilingual dictionaryAcquiring more from corpus – lexicon and rules

Linguistic DevelopmentEntity Recognition with local grammarsAutonomous Generation modules

Introduction of corpus-based technology

ApplicationsMore interactive applicationsProfessional Post-Edition Module (POEM)

Page 13: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

SYSTRANLab Research Projects

The Phoenix Project

Collaboration with P. Koehn (University of Edinburgh)

Introduce corpus-based decision modules in SYSTRAN

Specialized modulesWord Sense DisambiguationLattice GenerationPreposition / Determiner Choice

Page 14: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

SYSTRANLab Research Projects

The Sphinx Project

Collaboration with CNRC

Sequential use of SYSTRAN and statistical engines (Statistical Post-Edition)

GALE (DARPA Project)

Participated in WMT07, NIST08

Page 15: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

SYSTRANLab Research Projects

The Pegasus Project

Collaboration with H. Schwenk (Université du Maine)

Introduce linguistic knowledge in statistical engines

Participated in WMT08

Page 16: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

SYSTRANLabHybrid Engines

Introduce Self-Learning capability

Learn “post-edition rules”

Deep integration of statistical decision modules

Insert linguistic knowledge in statistical

engines

HYBRIDHYBRID

Page 17: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

CWMT08

Chinese-English MT evaluation

Primary: RBMT+SPE

Contrast: RBMTStarted in 1994, 1.2M terms, S&T-focus

BLEU4 BLEU4-SBP

NIST5 GTM mWER mPER ICT

Primary-a 0.2275 0.2193 7.9180 0.7101 0.7209 0.5085 0.3262

Contrast-b 0.1956 0.1930 7.6356 0.7089 0.7165 0.5123 0.2942

Page 18: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

CWMT08: SPE Usage

SPE module trained on 1.8m sentencesCWMT08 training data not use

Not only translation by also annotation by RBMTDates, numerals, etc.

Transfer model is filteredExclusion of “bad rules” by rule based filteringExamples are “random” quotes, entities appearing

Some expressions are “protected”Constituents will be replaced with placeholders before SPETranslated with RBMTRe-injected in translation after SPE

SPE model for CWMT08 is trained using GIZA++, and decoding using Moses (www.statmt.org/moses)

Page 19: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

Statistical Post-EditionA Case Study

Case Study – SYMANTEC – English>Chinese

BLEU PERFECT Improv / Degrad

SYSTRAN Raw 20.89 2 -SYSTRAN Cust 34.49 4.8 refSYSTRAN Raw + Translation Model

46.86 7.4 -

SYSTRAN Cust + Translation Model

50.90 10.5 15

Page 20: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2008 – copyright SYSTRAN

Conclusions

Our approach is to start with rule-based frameworkDeveloped techniques give very competitive resultsMajor focus on “degradation” controlLearn more advanced post-edition rules

Generic Translation – still a long way to goBigger still better?

Domain TranslationQuality is there – statistics provides adaptation and fluidity

Need dedicated applications, workflow

Bootstrapping new language pair development