parallel multilingual data from monolingual speakers · parallel multilingual data from monolingual...

23
Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of Computer Science The University of Texas at Austin FLaReNeT, 2011

Upload: others

Post on 27-Jan-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Parallel Multilingual Data from Monolingual Speakers

Bill Dolan Microsoft Research

Joint work with David Chen Department of Computer Science The University of Texas at Austin

FLaReNeT, 2011

Page 2: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Introduction

• Statistical machine translation systems require large amounts of parallel corpora

• Existing translation data often created for government or business uses

• Heavy bias for English as one of the languages

• No similar data for training paraphrase engines

Page 3: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Issues in collecting language data

• Professional translators are expensive

– E.g. $0.36/word to create a Tamil-English corpus [Germann, ACL 2001 Workshop]

• No such resource for paraphrasing at all

Page 4: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Issues in collecting language data

• Professional translators are expensive

– E.g. $0.36/word to create a Tamil-English corpus [Germann, ACL 2001 Workshop]

• No such resource for paraphrasing at all

Crowdsource data collection

Page 5: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Variety of Natural Language Processing tasks done on Mechanical Turk

• Evaluating machine translation quality – Callison-Burch, EMNLP 2009

– Denkowski and Lavie, NAACL 2010 AMT workshop

• Collecting translation data – Ambati and Vogel, NAACL 2010 AMT workshop

– Bloodgood and Callison-Burch, NAACL 2010 AMT workshop

• Collecting paraphrase data – Buzek et al., NAACL 2010 AMT workshop

– Denkowski et al., NAACL 2010 AMT workshop

Page 6: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Issues with Mechanical Turk

• Cheaters use online MT engines

– Collect and check against online translations

– Use an image instead of text

• Translation quality can be poor

– Ask other workers to edit the translations

– Ask other workers to rank the translations

Page 7: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Issues with Mechanical Turk

• Attracting workers

– Higher pay doesn’t always mean higher quality

– More incentive to cheat if pay is high

• Make tasks simple and quick enough

– Difficult to collect data where input is required rather than selecting buttons

– More difficult to translate whole sentences than n-grams

Page 8: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Use video to collect language data

• Use short video clips to elicit descriptions from people around the world

• Parallel descriptions in different languages

Translation data

• Parallel descriptions in the same language

Paraphrase data

Page 9: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Example video

Page 10: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

English Descriptions

• A man eats sphagetti sauce. • A man is eating food. • A man is eating from a plate. • A man is eating something. • A man is eating spaghetti from a large bowl while standing. • A man is eating spaghetti out of a large bowl. • A man is eating spaghetti. • A man is eating spaghetti. • A man is eating. • A man is eating. • A man is eating. • A man tasting some food in the kitchen is expressing his satisfaction. • The man ate some pasta from a bowl. • The man is eating. • The man tried his pasta and sauce.

Page 11: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Descriptions in Other Languages

• Tagalog: Linasahan ng kusinero ang kanyang pagkain.

• Slovene: Moški je špagete z vilico.

• German: Ein Mann isst Spagetti

• Romanian: Un barbat mananca paste.

Un barbat mananca spaghetti.

Un bucatar mananca ce a preparat.

• French: Un homme mande des pates.

• Spanish: Un gordo saborea un plato de pasta

• Dutch: De luie kok neemt gulzig een hap van zijn bord spaghetti met worstjes.

• Serbian: Čovek jede špagete.

• Russian: Мужчина что-то ест из тарелки.

• Tamil: ஒருவர் சாப்பிட்டுக்க ாண்டு இருக் ிமார். ஒருவர் முள் ண்டிால் உணவவ சாப்பிடு ிமார்

னிதன் சாப்பிட்டு க ாண்டு இருக் ிமான்.

Page 12: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Why do this?

• Problems with collecting translation data – Requires bilingual speakers – Need quality control to prevent cheating using online

translation services

• Problems with collecting paraphrase data – Biased by the source sentence – Task is ill-defined; even the highly-skilled perform

poorly “This is the last major battle of the war.” “The concluding of the war takes place with this final battle.”

– 92% find our task more enjoyable, 75% find it easier

Page 13: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Video description task

• Show a video segment to the user

• Ask them to write a single-sentence description in any language

– Little incentive to cheat

• Requires only monolingual skills

– Bederson et al. [Graphical Interface 2010]

– Luis von Ahn, DuoLingo

Page 14: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Quality Control

• 2-tiered payment system

• First, the worker has to prove their competence • [Novotney and Callison-Burch, NAACL 2010 AMT workshop]

• Everyone has access to the Tier-1 tasks

• Manually selected good workers and granted them access to Tier-2 tasks

• Tasks identical in the two tiers, except for pay ($0.01 for Tier-1, $0.05 for Tier-2)

Page 15: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Video collection task

• Single, unambiguous action/event

• Short (4-10 seconds)

• Generally accessible

• No dialog

• No words (subtitles, overlaid text, titles)

Page 16: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Statistics for data collected

• Money spent: ~$5000

• Number of videos: 2089

• Number of total sentences: 122K

• Total number of workers: 820 (94 gained access to Tier-2)

• Total work hours: 2711 – Not including video finding task

• Cost per sentence: $0.04

• Work time per sentence: 79.6 sec

• Video segment length on average: 9.9 sec

• Number of sentences per video: 58.4

Page 17: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Sample worker responses

• “The speed of approval gave me confidence that you would pay me for future work.”

• “Posters on Turker Nation recommended it as both high-paying and interesting. Both ended up being true.”

• “I consider this task pretty enjoyable. some videos are funny, others interesting.”

• “Fast, easy, really fun to do it.”

Page 18: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Daily number of descriptions collected

0

2000

4000

6000

8000

10000

12000

14000

7/21 7/28 8/4 8/11 8/18 8/25 9/1

Page 19: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Distribution of languages collected

• Other: Tagalog, Portuguese, Norwegian, Filipino, Estonian, Turkish, Arabic, Urdu, Hungarian, Indonesian, Malay, Bulgarian, Danish, Bosnian, Marathi, Swedith, Albanian

English 85550 (33855) Spanish 1883

Hindi 6245 Gujarati 1437

Romanian 3998 Russian 1243

Slovene 3584 French 1226

Serbian 3420 Italian 953

Tamil 2789 Georgian 907

Dutch 2735 Polish 544

German 2326 Chinese 494

Macedonian 1915 Malayalam 394

Page 21: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Worker statistics

Gender

Male

Female

Country

UnitedStatesEurope

India

SouthAmericaPhilippines

Mexico

Canada

Age

18-24

25-35

36-45

46-55

Based on a survey of 46 workers

Page 22: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Unique Data Characteristics

• Massively parallel – Easy to collect 50 or 100 or more utterances per video – Relatively exhaustive coverage of natural monolingual paraphrases

• Truly multilingual data – Visual, not linguistic source = no fluency bias – Rich data for new research on cluster-level multilingual alignment

• Can be targeted to specific topic domains, languages – As long as they have a simple visual component…. – Travel, cooking, sports, automotive, etc.

• Particularly useful for creating evaluation sets – Easy to build a corpus with arbitrarily many reference sentences – Useful for metrics like BLEU, which work best with many references

Page 23: Parallel Multilingual Data from Monolingual Speakers · Parallel Multilingual Data from Monolingual Speakers Bill Dolan Microsoft Research Joint work with David Chen Department of

Conclusion

• Introduced a novel way to collect translation and paraphrase data

– From monolingual speakers

• Conducted a successful pilot data collection on Mechanical Turk