panacea presentation - pangeanic - budapest
DESCRIPTION
presentation on history of MT and how language resources have helped to develop MT (particularly statistical MT) with an emphasis in Pangeanic's experienceTRANSCRIPT
PangeaMT – putting open standards to work… well
Manuel Herranz
PangeaMT – putting open standards to work… well
Chomsky: Imagine that ifin the futurelarge enoughamounts of data existed, they could be processed bycomputers withenoughcomputingpower
rule-based systems, IBM licenses, many linked to patent EN/RU & Intel
First statistical papers
1st Open source SMT
Translation industryappropriating Moseshttp://euromatrixplus.net/moses
DIY SMT
http://t.co/HDTboxQ
PangeaMT – putting open standards to work… well
PEAK of ColdWar and informationcontrol.Products & informationdirected toconsumers/ users / citizens
BEGINNING of data resources. Internet.Accessability toinformation
Content generated BY USERS / CONSUMERS / CITIZENS, multilingual, free informationexchange across theworld
PangeaMT – putting open standards to work… well
2007/08
.
2009/10
2011/12
• DIY SMT• Empower Users• Glossary• Automated re-training• Transfer architecture and know-how to users
• Compatibility withcommercial formats (ttx, sdlxliff, itd)
2007 and before
• RB tests with commercial software• Insufficiently good output• Only internal production
• EU Post-Editing Award
• V1: Small data sets (2-5M words), automotive & electronics
• (ES), then Fr/It/De in other fields
• Division born• 00's of engine trials and language combinations
• Open-Source to commercial
• TMX / XLIFF workflows
As of May 2009: 487 Billion gigabytes or1,000,000,000 * 487,000,000,000 = 4,87 x 1020
EstimatesUp 50% a year (Oracle)Doubles every 11 hours (IBM)
PangeaMT – putting open standards to work… well
OBJECTIVES = CHALLENGES
Turn academic development (Moses) into a commercialapplication.
To provide High Q MT for Post-Editing and save time and cost. No Google-type broad TR but domain-specific, user-centric.
Lower entry level for MT. Bring democracy and affordability to MT. Bring it to the user, take away from programmer.
How? By fostering open-standard geared translation automation strategies
To use only community-based Open standards –> Oasis / ISO: xliff / tmx, xml). NO proprietary formats (technology independence) so USERS are not “locked” in to buying and updating expensive software.
PangeaMT – putting open standards to work… well
LR… The plight
• TAUS TDA, millions of words
• Own data
• Sony (client donation)
• (Manual) data gathering & alignment
• Manual cleaning until some tools developed – limited SME resources
• Resources getting smaller soon – need to build more
• HELP!!!!
PangeaMT – putting open standards to work… well
7
The rush for data
Soon realised that there was a rush to gather data but that other resources around data were necessary
cleaning
More cleaning
PangeaMT – putting open standards to work… well
8
The rush for data (clean)
cleaning
More cleaning
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>A system for recovering the methane that is emitted from the manure so that
it does not leak into the atmosphere.</seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel
d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg>
</tuv>
<tu creationdate="20090817T114430Z" creationid="APIACCESS"
changedate="20110617T141159Z" changeid=“pat">
<tuv xml:lang="EN-US">
<seg>Overall heigtht –<bpt i="1">{\f43 </bpt> <ept i="1">}</ept>25"; width –
<bpt i="2">{\f43 </bpt> <ept i="2">}</ept>20.1".</seg>
</tuv>
<tuv xml:lang="ES-EM">
<seg><bpt i="1">{\f2 </bpt>Altura total - 25"; anchura <ept i="1">}</ept>–
<bpt i="2">{\f43 </bpt> <ept i="2">}</ept><bpt i="3">{\f2 </bpt>20,1".<ept
i="3">}</ept></seg>
</tuv>
</tu>
<tuv xml:lang=“EN-US">
<seg>On 22nd May we decided not to join the group.</seg>
<tuv xml:lang=“DE-DE">
<seg>Am 22. </seg>
PangeaMT – putting open standards to work… well
LR… The plight
• MORE DATA!!!
• Domain specific
• Monolingual (for LM) and parallel crawling
• Corpus normalization
• (Semi) automated PoS tagging
• Self-generation of similar texts for morphologically-rich languages
PangeaMT – putting open standards to work… well
- “on the fly” SMT training (minutes / hours, not manually) –April 2011 !!
- pick and match sets of data: “extreme customization” –April 2011 !!
- online, user-customizable glossaries, DNTs, expressions →
“predictive SMT” – May 2011 !!
- objetive stats for post-editors (calculate effort)
- confidence scores for users (→ translators or readers) withCAT integration (web-based / desktop)…
PangeaMT – putting open standards to work… well
- API integration + user domain building
- Audiovisual integration
- Release the code to users → create a community and flavours to each situation; hybridate and create rules
- Have more and more companies and institutions use PangeaMT as their platform and make it grown
PangeaMT – putting open standards to work… well
2015
2014
2013
2011
2010
2009
2012
2018
2017
2016
User e
mpo
werm
ent
YEAR2016
00
0's
of c
usto
miz
ed
MT
syste
ms
Predictions Tech. notthe realm of afew providers
PangeaMT – putting open standards to work… well
2010
2009
2018
PangeaMT – putting open standards to work… well
2015
2014
2013
2011
2010
2009
2012
2018
2017
2016
MT
acce
pta
nce
User e
mpo
werm
ent
• MT acceptance growth.
• Translator engagement challenge
• Need for data has been addressed – still more work to be done.
• Users and practitioners now can build their own systems.
Until 2011
YEAR2016
00
0's
of c
usto
miz
ed
MT
syste
msIn 5 years... after 2016
Predictions
• Combinations??
• Supra-engines??
• World-knowledge??
…...suggestions....???
Tech. notthe realm of afew providers