machine translation in numbers

29
TAUS - Portland, October 24, 2016 MMT Machine Translation in Numbers

Upload: taus-enabling-better-translation

Post on 15-Apr-2017

78 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Machine Translation in Numbers

TAUS - Portland, October 24, 2016

MMTMachine Translation in Numbers

Page 2: Machine Translation in Numbers

Team

Page 3: Machine Translation in Numbers

Problems with current Open Source MT?

22 years old

idea

(Brown, Della Pietra - 1994)

10 years old

implementation

(Moses, JHU workshop 2006)

Page 4: Machine Translation in Numbers

Problems with current Open Source

Need re-training to learn from new data

Page 5: Machine Translation in Numbers

Problems with current Open Source

Does not adapt to context

Page 6: Machine Translation in Numbers

Problems with current Open Source

Does not adapt to contextToday, you often get to the absurd:

More data = Lower Quality

Page 7: Machine Translation in Numbers

Welcome to MMT

● Incremental: Learns corrections in seconds.

● Adapts to context as you use it.

● No more initial training needed, like our old TMs :)

● Comes with data. Lots of data.

Page 8: Machine Translation in Numbers

One more thing...

It is Free and Open Source

Page 9: Machine Translation in Numbers

How does it work?

Page 10: Machine Translation in Numbers

Context Analyzer

Retrieves best matching TMs based on context similarity

Page 11: Machine Translation in Numbers

Indexed instead of Training

● Suffix array indexed with TMs

● Phrase table is built on the fly by sampling from the SA

● Phrases of TMs with highest weights sampled first

Page 12: Machine Translation in Numbers

Adaptive Language Model

Page 13: Machine Translation in Numbers

Why is this different from Matecat or Lilt?

Learns for all users not just one

Page 14: Machine Translation in Numbers

Why is this different from Matecat or Lilt?

Learns for all users not just one

Uses context

Page 15: Machine Translation in Numbers

Quality - Using the TAUS Data Cloud

MS Translator HUB - commercial adaptive engine by Microsoft

Paypal 1 Paypal 2 Paypal 3 Linkedin 1 Paypal 1 Paypal 2 Paypal 3

Page 16: Machine Translation in Numbers

MS Translator Hub vs Modern MT

ModernMT - our adaptive and incremental solution

Paypal 1 Paypal 2 Paypal 3 Linkedin 1 Paypal 1 Paypal 2 Paypal 3

Page 17: Machine Translation in Numbers

Initial Setup

MMT

Moses

Neural MT

3 hours ($3 AWS)

30 hours ($30 AWS)

300 hours ($300 AWS)

100M parallel words, 1B monolingual, $1 / hour AWS

Page 18: Machine Translation in Numbers

Translation speed

MMT

Moses

Neural MT

855 w/s

455 w/s

409 w/s

100M parallel words, 1B monolingual, $1 / hour AWS

Page 19: Machine Translation in Numbers

Marco stop talking, it’s Jaap time.

Page 20: Machine Translation in Numbers

TAUS Data Cloud

● Largest industry-shared repository of translation data

● A neutral and secure repository platform for

○ Sharing/pooling translation data based on a reciprocity model

○ Searching domain-specific or general data

○ Leveraging Translation Data

● Solid legal framework established by 45 founding members

● Addresses the shortage of available in-domain parallel data from the

industry

● September 2016: 72B+ words in the repository

● 10M to 100M words per ModernMT language pair

Page 21: Machine Translation in Numbers
Page 22: Machine Translation in Numbers

Collecting from the Web - Hard!

● The Web is large - even the so-called Surface or Indexable Web

● The Web is messy

● The Web is constantly in flux

● Not many organizations crawl the entire indexable web

○ Google - about 49B web pages in index (Source: www.worldwidewebsize.com)

○ Microsoft - about 20B web pages in index (Source: www.worldwidewebsize.com)

● Other crawls are focused crawls on a subset with certain criteria/goals

● Still hard for the same reasons

Page 23: Machine Translation in Numbers

Common Crawl Come to Rescue

● Commoncrawl.org

○ “CommonCrawl is a 501(c)(3) non-profit organization dedicated to providing a copy of

the internet to internet researchers, companies and individuals at no cost for the

purpose of research and analysis.”

● On average 1.5B unique URLs per crawl

● A very good resource for sourcing bilingual and monolingual data for machine

translation purposes

○ Prototype developed by academic developers in 2012/2013 showed potential to mine

parallel corpora with millions of source words

Page 24: Machine Translation in Numbers

Common Crawl Come to Rescue

● Implemented data collection pipeline based on prototype techniques

● Collecting monolingual and bilingual data

● Open sourced at https://github.com/ModernMT/DataCollection

● We are making the indices of parallel pages we discover available

○ Saves running half of the data collection pipeline

○ Each user still has to download their own data

● Avoids potential copyright issues

Page 25: Machine Translation in Numbers

Parallel Data Stats

Page 26: Machine Translation in Numbers

Monolingual Data

Page 27: Machine Translation in Numbers

What’s next

● Release 0.14 - Next Week

○ Planned for AMTA next week. 45 languages supported, adding incremental learning.

● Baseline engines and data - 3 months

○ Finish the crawling and legal activity to release the data for the baseline engines.

● Neural MT - 12 months

○ Engineering effort to make it cost-effective, incremental and context aware.

Included by default in MMT

Page 28: Machine Translation in Numbers

How to contribute

● Do you want to use MMT? Provide Feedback (it is on GitHub).

● Do you want your engineers to contribute to the project?

● Do you want to add your data to the TAUS Data Cloud and help

sharing baseline engines?

Page 29: Machine Translation in Numbers

Thank youhttps://github.com/ModernMT/MMT