panacea presentation - pangeanic - budapest

15
PangeaMT putting open standards to work… well Manuel Herranz

Upload: manuel-herranz

Post on 04-Jul-2015

2.372 views

Category:

Technology


4 download

DESCRIPTION

presentation on history of MT and how language resources have helped to develop MT (particularly statistical MT) with an emphasis in Pangeanic's experience

TRANSCRIPT

Page 1: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

Manuel Herranz

Page 2: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

Chomsky: Imagine that ifin the futurelarge enoughamounts of data existed, they could be processed bycomputers withenoughcomputingpower

rule-based systems, IBM licenses, many linked to patent EN/RU & Intel

First statistical papers

1st Open source SMT

Translation industryappropriating Moseshttp://euromatrixplus.net/moses

DIY SMT

http://t.co/HDTboxQ

Page 3: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

PEAK of ColdWar and informationcontrol.Products & informationdirected toconsumers/ users / citizens

BEGINNING of data resources. Internet.Accessability toinformation

Content generated BY USERS / CONSUMERS / CITIZENS, multilingual, free informationexchange across theworld

Page 4: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

2007/08

.

2009/10

2011/12

• DIY SMT• Empower Users• Glossary• Automated re-training• Transfer architecture and know-how to users

• Compatibility withcommercial formats (ttx, sdlxliff, itd)

2007 and before

• RB tests with commercial software• Insufficiently good output• Only internal production

• EU Post-Editing Award

• V1: Small data sets (2-5M words), automotive & electronics

• (ES), then Fr/It/De in other fields

• Division born• 00's of engine trials and language combinations

• Open-Source to commercial

• TMX / XLIFF workflows

As of May 2009: 487 Billion gigabytes or1,000,000,000 * 487,000,000,000 = 4,87 x 1020

EstimatesUp 50% a year (Oracle)Doubles every 11 hours (IBM)

Page 5: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

OBJECTIVES = CHALLENGES

Turn academic development (Moses) into a commercialapplication.

To provide High Q MT for Post-Editing and save time and cost. No Google-type broad TR but domain-specific, user-centric.

Lower entry level for MT. Bring democracy and affordability to MT. Bring it to the user, take away from programmer.

How? By fostering open-standard geared translation automation strategies

To use only community-based Open standards –> Oasis / ISO: xliff / tmx, xml). NO proprietary formats (technology independence) so USERS are not “locked” in to buying and updating expensive software.

Page 6: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

LR… The plight

• TAUS TDA, millions of words

• Own data

• Sony (client donation)

• (Manual) data gathering & alignment

• Manual cleaning until some tools developed – limited SME resources

• Resources getting smaller soon – need to build more

• HELP!!!!

Page 7: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

7

The rush for data

Soon realised that there was a rush to gather data but that other resources around data were necessary

cleaning

More cleaning

Page 8: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

8

The rush for data (clean)

cleaning

More cleaning

<tu srclang="en-GB">

<tuv xml:lang="EN-GB">

<seg>A system for recovering the methane that is emitted from the manure so that

it does not leak into the atmosphere.</seg>

</tuv>

<tuv xml:lang="FR-FR">

<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel

d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg>

</tuv>

<tu creationdate="20090817T114430Z" creationid="APIACCESS"

changedate="20110617T141159Z" changeid=“pat">

<tuv xml:lang="EN-US">

<seg>Overall heigtht –<bpt i="1">{\f43 </bpt> <ept i="1">}</ept>25&quot;; width –

<bpt i="2">{\f43 </bpt> <ept i="2">}</ept>20.1&quot;.</seg>

</tuv>

<tuv xml:lang="ES-EM">

<seg><bpt i="1">{\f2 </bpt>Altura total - 25&quot;; anchura <ept i="1">}</ept>–

<bpt i="2">{\f43 </bpt> <ept i="2">}</ept><bpt i="3">{\f2 </bpt>20,1&quot;.<ept

i="3">}</ept></seg>

</tuv>

</tu>

<tuv xml:lang=“EN-US">

<seg>On 22nd May we decided not to join the group.</seg>

<tuv xml:lang=“DE-DE">

<seg>Am 22. </seg>

Page 9: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

LR… The plight

• MORE DATA!!!

• Domain specific

• Monolingual (for LM) and parallel crawling

• Corpus normalization

• (Semi) automated PoS tagging

• Self-generation of similar texts for morphologically-rich languages

Page 10: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

- “on the fly” SMT training (minutes / hours, not manually) –April 2011 !!

- pick and match sets of data: “extreme customization” –April 2011 !!

- online, user-customizable glossaries, DNTs, expressions →

“predictive SMT” – May 2011 !!

- objetive stats for post-editors (calculate effort)

- confidence scores for users (→ translators or readers) withCAT integration (web-based / desktop)…

Page 11: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

- API integration + user domain building

- Audiovisual integration

- Release the code to users → create a community and flavours to each situation; hybridate and create rules

- Have more and more companies and institutions use PangeaMT as their platform and make it grown

Page 12: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

2015

2014

2013

2011

2010

2009

2012

2018

2017

2016

User e

mpo

werm

ent

YEAR2016

00

0's

of c

usto

miz

ed

MT

syste

ms

Predictions Tech. notthe realm of afew providers

Page 13: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

2010

2009

2018

Page 14: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

2015

2014

2013

2011

2010

2009

2012

2018

2017

2016

MT

acce

pta

nce

User e

mpo

werm

ent

• MT acceptance growth.

• Translator engagement challenge

• Need for data has been addressed – still more work to be done.

• Users and practitioners now can build their own systems.

Until 2011

YEAR2016

00

0's

of c

usto

miz

ed

MT

syste

msIn 5 years... after 2016

Predictions

• Combinations??

• Supra-engines??

• World-knowledge??

…...suggestions....???

Tech. notthe realm of afew providers

Page 15: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

15

QUESTIONS ?

[email protected]