machine translation in numbers
TRANSCRIPT
![Page 1: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/1.jpg)
TAUS - Portland, October 24, 2016
MMTMachine Translation in Numbers
![Page 2: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/2.jpg)
Team
![Page 3: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/3.jpg)
Problems with current Open Source MT?
22 years old
idea
(Brown, Della Pietra - 1994)
10 years old
implementation
(Moses, JHU workshop 2006)
![Page 4: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/4.jpg)
Problems with current Open Source
Need re-training to learn from new data
![Page 5: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/5.jpg)
Problems with current Open Source
Does not adapt to context
![Page 6: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/6.jpg)
Problems with current Open Source
Does not adapt to contextToday, you often get to the absurd:
More data = Lower Quality
![Page 7: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/7.jpg)
Welcome to MMT
● Incremental: Learns corrections in seconds.
● Adapts to context as you use it.
● No more initial training needed, like our old TMs :)
● Comes with data. Lots of data.
![Page 8: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/8.jpg)
One more thing...
It is Free and Open Source
![Page 9: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/9.jpg)
How does it work?
![Page 10: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/10.jpg)
Context Analyzer
Retrieves best matching TMs based on context similarity
![Page 11: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/11.jpg)
Indexed instead of Training
● Suffix array indexed with TMs
● Phrase table is built on the fly by sampling from the SA
● Phrases of TMs with highest weights sampled first
![Page 12: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/12.jpg)
Adaptive Language Model
![Page 13: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/13.jpg)
Why is this different from Matecat or Lilt?
Learns for all users not just one
![Page 14: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/14.jpg)
Why is this different from Matecat or Lilt?
Learns for all users not just one
Uses context
![Page 15: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/15.jpg)
Quality - Using the TAUS Data Cloud
MS Translator HUB - commercial adaptive engine by Microsoft
Paypal 1 Paypal 2 Paypal 3 Linkedin 1 Paypal 1 Paypal 2 Paypal 3
![Page 16: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/16.jpg)
MS Translator Hub vs Modern MT
ModernMT - our adaptive and incremental solution
Paypal 1 Paypal 2 Paypal 3 Linkedin 1 Paypal 1 Paypal 2 Paypal 3
![Page 17: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/17.jpg)
Initial Setup
MMT
Moses
Neural MT
3 hours ($3 AWS)
30 hours ($30 AWS)
300 hours ($300 AWS)
100M parallel words, 1B monolingual, $1 / hour AWS
![Page 18: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/18.jpg)
Translation speed
MMT
Moses
Neural MT
855 w/s
455 w/s
409 w/s
100M parallel words, 1B monolingual, $1 / hour AWS
![Page 19: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/19.jpg)
Marco stop talking, it’s Jaap time.
![Page 20: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/20.jpg)
TAUS Data Cloud
● Largest industry-shared repository of translation data
● A neutral and secure repository platform for
○ Sharing/pooling translation data based on a reciprocity model
○ Searching domain-specific or general data
○ Leveraging Translation Data
● Solid legal framework established by 45 founding members
● Addresses the shortage of available in-domain parallel data from the
industry
● September 2016: 72B+ words in the repository
● 10M to 100M words per ModernMT language pair
![Page 21: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/21.jpg)
![Page 22: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/22.jpg)
Collecting from the Web - Hard!
● The Web is large - even the so-called Surface or Indexable Web
● The Web is messy
● The Web is constantly in flux
● Not many organizations crawl the entire indexable web
○ Google - about 49B web pages in index (Source: www.worldwidewebsize.com)
○ Microsoft - about 20B web pages in index (Source: www.worldwidewebsize.com)
● Other crawls are focused crawls on a subset with certain criteria/goals
● Still hard for the same reasons
![Page 23: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/23.jpg)
Common Crawl Come to Rescue
● Commoncrawl.org
○ “CommonCrawl is a 501(c)(3) non-profit organization dedicated to providing a copy of
the internet to internet researchers, companies and individuals at no cost for the
purpose of research and analysis.”
● On average 1.5B unique URLs per crawl
● A very good resource for sourcing bilingual and monolingual data for machine
translation purposes
○ Prototype developed by academic developers in 2012/2013 showed potential to mine
parallel corpora with millions of source words
![Page 24: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/24.jpg)
Common Crawl Come to Rescue
● Implemented data collection pipeline based on prototype techniques
● Collecting monolingual and bilingual data
● Open sourced at https://github.com/ModernMT/DataCollection
● We are making the indices of parallel pages we discover available
○ Saves running half of the data collection pipeline
○ Each user still has to download their own data
● Avoids potential copyright issues
![Page 25: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/25.jpg)
Parallel Data Stats
![Page 26: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/26.jpg)
Monolingual Data
![Page 27: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/27.jpg)
What’s next
● Release 0.14 - Next Week
○ Planned for AMTA next week. 45 languages supported, adding incremental learning.
● Baseline engines and data - 3 months
○ Finish the crawling and legal activity to release the data for the baseline engines.
● Neural MT - 12 months
○ Engineering effort to make it cost-effective, incremental and context aware.
Included by default in MMT
![Page 28: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/28.jpg)
How to contribute
● Do you want to use MMT? Provide Feedback (it is on GitHub).
● Do you want your engineers to contribute to the project?
● Do you want to add your data to the TAUS Data Cloud and help
sharing baseline engines?
![Page 29: Machine Translation in Numbers](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f1577b1a28ab5b018b45fd/html5/thumbnails/29.jpg)
Thank youhttps://github.com/ModernMT/MMT