exploiting multilingual corpora for machine translation andreas eisele saarland university &...

Exploiting Multilingual Corpora Exploiting Multilingual Corpora for Machine Translationfor Machine Translation

Andreas EiseleSaarland University & DFKI

[email protected]

Arona, September 2005 JRC Enlargement and Integration Workshop

Exploiting parallel corpora in up to 20 languages

Exploiting Multilingual Corpora 2 [email protected]

OverviewOverview

Multilingual/MT Projects & Tools at DFKI MT-Related Activities at Saarland University Work in the PTOLEMAIOS Project Plans for Near-Term Future


Multilingual Projects at DFKIMultilingual Projects at DFKI

Main LT Application Areas: Multilingual Natural Communication Multilingual Document Production Crosslingual Information and Knowledge

Management


Multilingual Natural CommunicationMultilingual Natural Communication

NL Dialogue Systems (DISCO, COSMA, Interprice)

Speech Dialogue Processing (Verbmobil, Interprice)

Robust Speech Parsing (Verbmobil, Interprice) Automatic Processing and Answering of Email

(COSMA, ICC, XtraMind)

Natural Speech Synthesis (Mary, Interprice)

Sample Application Areas: e-commerce (product search, CRM)

Application Projects with Interprice, AOL Europe and spin-off company XtraMind Technologies


Multilingual Document ProductionMultilingual Document Production

Terminology Checking (DiET, FLAG, WHITEBOARD, SKATE) Grammar and Style Checking (LATESLAV, FLAG, SKATE) Controlled Language Checking (FLAG, WHITEBOARD,

SKATE) Automatic XML Tagging (WHITEBOARD) Consistency Control (BiLD, WHITEBOARD)

Sample Application Areas: multilingual document production, web-content production

Application Project with SAPSpin-Off company


Crosslingual Information and Crosslingual Information and Knowledge ManagementKnowledge Management

Crosslingual Content Management (TWENTYONE, MUCHMORE) Crosslingual Information Retrieval (TWENTYONE, MULINEX, MIETTA, MUCHMORE) Crosslingual Multimedia Retrieval (POP-EYE, OLIVE, MUMIS, DIRECT INFO) Crosslingual Information Extraction (PARADIME, WHITEBOARD , DIRECT INFO) Crosslingual Text Mining, Terminology Extraction (GETESS, AIRFORCE, WIPO) Multilingual Summarization (MULINEX, MUCHMORE, MUSI) Multilingual Language Generation (TG/2, TEMSIS, MIETTA)

Sample Application Areas: multilingual and crosslingual search, tourism information on the web, up to date air quality reporting, information management for mega-events (world championship, Olympic Games), phonetic trademark search, term extraction from patent translations

Application Projects with German Telekom, ESG, Dresdner Bank, law firm Boehmert&Boehmert, feasibility study on terminology extraction with WIPO (via acrolinx), …


Multilingual Resources at DFKIMultilingual Resources at DFKI

POS-tagger TnT (T.Brants) and Chunkie can be trained for arbitrary languages

Middleware HoG for multilingual robust shallow and HPSG-based deep analysis (mapping into RMRSs)

Morphologies from MMorph project exist for German, English, French, Spanish, Italian

Morphologies are encoded as FS transducers, usable for error-tolerant analysis and generation

Adding more languages is very easy (as done for Arabic with A.Soudi)

Uniform handling of all EU languages would be extremely convenient, but linguistic resources are currently lacking


Multilingual Projects at DFKIMultilingual Projects at DFKI

Main LT Application Areas: Multilingual Natural Communication Multilingual Document Production Crosslingual Information and Knowledge

ManagementTopic emerging since 2005: Machine Translation


Machine Translation at DFKIMachine Translation at DFKI

Topics in Compass (Digital Olympics 2006):Multi-Engine Machine Translation, Speech Technologies, Multilingual Content Management, Cross-lingual Information Retrieval and Multilingual Question Answering

Open LOGOSLOGOS MT ® = one of the largest and most powerful

among the commercial MT enginesDFKI turned LOGOS MT into an open source product

(in cooperation with GlobalWare AG)Plans for integrated, hybrid MT from rule-based and

stochastic engines (code name: EuroMatrix)


MT Activities at Saarland UniversityMT Activities at Saarland University

Guiding principle: Start with method that works today, improve it by adding linguistic functionality as appropriate

Starting point: Phrase-based SMT (Köhn,Och,Marcu, HLT-NAACL2003)

Conceptually, phrase-based SMT is an intermediate step between TM and MT, combines TM’s ability to learn from examples with compositionality of MT

Among best approaches in ongoing DARPA evaluation campaign Easy to deploy (thanks to tools by F.J. Och and P. Köhn) Conceptually very simple, hence a good candidate to enrich

models with linguistic sophistication


MT Activities at Saarland UniversityMT Activities at Saarland University

April ’05: participation in ACL shared task on statistical machine translation with a multi-engine approach {Finnish,French,German,Spanish} English

May ‘05: participation in DARPA MT evaluation with baseline phrase-based SMT system (Chinese English)

Project seminar on empirical MT, students learned to turn parallel corpora into SMT systems (based on EuroParl corpus, but also Welsh ↔ English and Arabic ↔ English)

Diploma Thesis on corpus-based MT via RMRS alignmentExperience: Using parallel corpora for MT quickly yields very promising results! Experience: Using parallel corpora for MT quickly yields very promising results!

We should have more language pairs and more data…We should have more language pairs and more data… Crawling of UN document repository, collection of 6-way parallel

{Arabic,Chinese,English,French,Russian,Spanish} corpus (+ some German)


The PTOLEMAIOS projectThe PTOLEMAIOS project

Assumptions: Advanced language technology for truly multilingual

applications is a key challenge for computational linguistics Treebanking and supervised learning have been successful

for English (and some other languages), but may not be feasible for “smaller” languages

Parallel corpora can be used to transfer knowledge about linguistic relations across languages or to induce linguistic knowledge from data

Word alignments derived from simple models (GIZA++) can help to support this process

“Parallel-Text-based Optimization for Language learning ― Exploiting Multilingual Alignment for the

Induction Of Syntactic grammars”


PTOLEMAIOSPTOLEMAIOS

Funding: Emmy-Noether fellowship from DFG, P.I. Jonas KuhnExpected Duration: April 2005 – March 2009Original Goal:

Induce grammars from parallel corpora (and evaluate them in isolation)Revised Goal (since August’05):

Evaluate grammars wrt. impact on MT performanceFirst Steps:

Use GIZA++-derived word alignment as filter to speed up parsing, several papers on suitable parsing algorithms

Use of LinearB’s SMT decoder on phrase-aligned EuroParl corpusPlanned Steps:

Explore the usefulness of syntactic analyses for phrase-based SMTword-based and syntax-based partial analyses are offered to decoderdecoder can exploit syntax if useful, fall back to plain PBSMT if notoptimal weight of syntactic dependencies can be determined empirically

Work on more languages (UN corpus in 6 languages, AC corpus)


EuroMatrix: current situation EuroMatrix: current situation (joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh)(joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh)

MT systems per language pair (data taken from J.Hutchins’ Compendium of Translation Software, 10th Edition)


EuroMatrix: current situationEuroMatrix: current situation

Most language pairs remain uncovered


EuroMatrix: SMT for many languagesEuroMatrix: SMT for many languages

EuroParl Corpus has been constructed to build statistical MT systems

Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005



Multilingual corpora can be aligned across all languages…




SMT systems derived from the corpora vary in quality




Difficulty of translation into and from a given language may differ widely…



EuroMatrixEuroMatrix

Ideas: For language pairs where rule-based MT and SMT based on parallel

corpora exist, they should be integrated to exploit complementary strengths of both approaches

Parallel corpora can then be used in two ways feeding the SMT sub-system fine-tuning the integrated setup

For language pairs where only monolingual resources (lexicons, morphologies, taggers,…) and parallel corpora exist, transfer rules operating on linguistic representations should be derived from data

We need a generic framework that allows to plug and play with different approaches (an open source MT toolbox)

Development of MT systems needs open evaluation campaign, in the style of DARPA MTeval / ACL shared task


ConclusionConclusion

Machine translation performance can be enabled/ boosted by parallel corpora

Current work just scratches the surface of what can be done

SMT systems for the languages of new member states should soon emerge from AC corpus

More parallel data for these languages would be desirable (100MW much better than 10MW!)

It would be very helpful to cooperate with teams from “new” countries for morphologies, taggers, parsers,…

exploiting multilingual corpora for machine translation andreas eisele saarland university &...

Documents

multilingual projects

multilingual corpora

multilingual resources

information management

knowledge management

crm application projects

tourism information

dfki main lt application