parallel multilingual data from monolingual speakers · parallel multilingual data from monolingual...

Parallel Multilingual Data from Monolingual Speakers

Bill Dolan Microsoft Research

Joint work with David Chen Department of Computer Science The University of Texas at Austin

FLaReNeT, 2011

Introduction

• Statistical machine translation systems require large amounts of parallel corpora

• Existing translation data often created for government or business uses

• Heavy bias for English as one of the languages

• No similar data for training paraphrase engines

Issues in collecting language data

• Professional translators are expensive

– E.g. $0.36/word to create a Tamil-English corpus [Germann, ACL 2001 Workshop]

• No such resource for paraphrasing at all

Issues in collecting language data

• Professional translators are expensive

– E.g. $0.36/word to create a Tamil-English corpus [Germann, ACL 2001 Workshop]

• No such resource for paraphrasing at all

Crowdsource data collection

Variety of Natural Language Processing tasks done on Mechanical Turk

• Evaluating machine translation quality – Callison-Burch, EMNLP 2009

– Denkowski and Lavie, NAACL 2010 AMT workshop

• Collecting translation data – Ambati and Vogel, NAACL 2010 AMT workshop

– Bloodgood and Callison-Burch, NAACL 2010 AMT workshop

• Collecting paraphrase data – Buzek et al., NAACL 2010 AMT workshop

– Denkowski et al., NAACL 2010 AMT workshop

Issues with Mechanical Turk

• Cheaters use online MT engines

– Collect and check against online translations

– Use an image instead of text

• Translation quality can be poor

– Ask other workers to edit the translations

– Ask other workers to rank the translations

Issues with Mechanical Turk

• Attracting workers

– Higher pay doesn’t always mean higher quality

– More incentive to cheat if pay is high

• Make tasks simple and quick enough

– Difficult to collect data where input is required rather than selecting buttons

– More difficult to translate whole sentences than n-grams

Use video to collect language data

• Use short video clips to elicit descriptions from people around the world

• Parallel descriptions in different languages

Translation data

• Parallel descriptions in the same language

Paraphrase data

Example video

English Descriptions

• A man eats sphagetti sauce. • A man is eating food. • A man is eating from a plate. • A man is eating something. • A man is eating spaghetti from a large bowl while standing. • A man is eating spaghetti out of a large bowl. • A man is eating spaghetti. • A man is eating spaghetti. • A man is eating. • A man is eating. • A man is eating. • A man tasting some food in the kitchen is expressing his satisfaction. • The man ate some pasta from a bowl. • The man is eating. • The man tried his pasta and sauce.

Descriptions in Other Languages

• Tagalog: Linasahan ng kusinero ang kanyang pagkain.

• Slovene: Moški je špagete z vilico.

• German: Ein Mann isst Spagetti

• Romanian: Un barbat mananca paste.

Un barbat mananca spaghetti.

Un bucatar mananca ce a preparat.

• French: Un homme mande des pates.

• Spanish: Un gordo saborea un plato de pasta

• Dutch: De luie kok neemt gulzig een hap van zijn bord spaghetti met worstjes.

• Serbian: Čovek jede špagete.

• Russian: Мужчина что-то ест из тарелки.

• Tamil: ஒருவர் சாப்பிட்டுக்க ாண்டு இருக் ிமார். ஒருவர் முள் ண்டிால் உணவவ சாப்பிடு ிமார்

னிதன் சாப்பிட்டு க ாண்டு இருக் ிமான்.

Why do this?

• Problems with collecting translation data – Requires bilingual speakers – Need quality control to prevent cheating using online

translation services

• Problems with collecting paraphrase data – Biased by the source sentence – Task is ill-defined; even the highly-skilled perform

poorly “This is the last major battle of the war.” “The concluding of the war takes place with this final battle.”

– 92% find our task more enjoyable, 75% find it easier

Video description task

• Show a video segment to the user

• Ask them to write a single-sentence description in any language

– Little incentive to cheat

• Requires only monolingual skills

– Bederson et al. [Graphical Interface 2010]

– Luis von Ahn, DuoLingo

Quality Control

• 2-tiered payment system

• First, the worker has to prove their competence • [Novotney and Callison-Burch, NAACL 2010 AMT workshop]

• Everyone has access to the Tier-1 tasks

• Manually selected good workers and granted them access to Tier-2 tasks

• Tasks identical in the two tiers, except for pay ($0.01 for Tier-1, $0.05 for Tier-2)

Video collection task

• Single, unambiguous action/event

• Short (4-10 seconds)

• Generally accessible

• No dialog

• No words (subtitles, overlaid text, titles)

Statistics for data collected

• Money spent: ~$5000

• Number of videos: 2089

• Number of total sentences: 122K

• Total number of workers: 820 (94 gained access to Tier-2)

• Total work hours: 2711 – Not including video finding task

• Cost per sentence: $0.04

• Work time per sentence: 79.6 sec

• Video segment length on average: 9.9 sec

• Number of sentences per video: 58.4

Sample worker responses

• “The speed of approval gave me confidence that you would pay me for future work.”

• “Posters on Turker Nation recommended it as both high-paying and interesting. Both ended up being true.”

• “I consider this task pretty enjoyable. some videos are funny, others interesting.”

• “Fast, easy, really fun to do it.”

Daily number of descriptions collected

0

2000

4000

6000

8000

10000

12000

14000

7/21 7/28 8/4 8/11 8/18 8/25 9/1

Distribution of languages collected

• Other: Tagalog, Portuguese, Norwegian, Filipino, Estonian, Turkish, Arabic, Urdu, Hungarian, Indonesian, Malay, Bulgarian, Danish, Bosnian, Marathi, Swedith, Albanian

English 85550 (33855) Spanish 1883

Hindi 6245 Gujarati 1437

Romanian 3998 Russian 1243

Slovene 3584 French 1226

Serbian 3420 Italian 953

Tamil 2789 Georgian 907

Dutch 2735 Polish 544

German 2326 Chinese 494

Macedonian 1915 Malayalam 394

Dataset Publicly Available

• http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/default.aspx

http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/default.aspx












Worker statistics

Gender

Male

Female

Country

UnitedStatesEurope

India

SouthAmericaPhilippines

Mexico

Canada

Age

18-24

25-35

36-45

46-55

Based on a survey of 46 workers

Unique Data Characteristics

• Massively parallel – Easy to collect 50 or 100 or more utterances per video – Relatively exhaustive coverage of natural monolingual paraphrases

• Truly multilingual data – Visual, not linguistic source = no fluency bias – Rich data for new research on cluster-level multilingual alignment

• Can be targeted to specific topic domains, languages – As long as they have a simple visual component…. – Travel, cooking, sports, automotive, etc.

• Particularly useful for creating evaluation sets – Easy to build a corpus with arbitrarily many reference sentences – Useful for metrics like BLEU, which work best with many references

Conclusion

• Introduced a novel way to collect translation and paraphrase data

– From monolingual speakers

• Conducted a successful pilot data collection on Mechanical Turk

parallel multilingual data from monolingual speakers · parallel multilingual data from monolingual...

Documents