sltu 2014 – 4th workshop on spoken language technologies for under-resourced languages st....

27
SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia www.kit.edu KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an Internet Explorer nutzt Hardwarebeschleunigung Websites werden schneller geladen damit Sie noch reibungsloser surfen können Nimm deine Lieblingsmusik überallhin mit kommt der iPod shuffle mit Speicher genug für hunderte von Songs alle wichtigen Songs fürs Training Wiedergabelisten Genius Mixes Podcasts und Hörbücher Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation: A Case Study on our German IT Corpus astian leidig, Tim Schlippe, Tanja Schultz

Upload: elliott-ruffins

Post on 01-Apr-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced LanguagesSt. Petersburg, Russia

www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

ZeigenSieandereAppsfüreinfachesMultitaskingnebendemBrowseranInternetExplorernutztHardwarebeschleunigungWebsiteswerdenschnellergeladendamitSienochreibungslosersurfenkönnen

NimmdeineLieblingsmusiküberallhinmitkommtderiPodshufflemitSpeichergenugfürhundertevonSongsallewichtigenSongsfürsTrainingWiedergabelistenGeniusMixesPodcastsundHörbücher

Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation:

A Case Study on our German IT Corpus

Sebastian leidig, Tim Schlippe, Tanja Schultz

Page 2: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

2 15-May-2014

Motivation

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

From Microsoft's German website www.microsoft.de:

“Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an.”

“Internet Explorer nutzt Hardwarebeschleunigung. Websites werden schneller geladen, damit Sie noch reibungsloser surfen können.”

Page 3: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

3 15-May-2014

Motivation

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

With the globalization words from other languages come into a language without assimilation to the phonetic system of the new language

To economically build up lexical resources with automatic or semi-automatic methods

detect and treat them separately

Page 4: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

4 15-May-2014

Overview

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

combinationfeaturesInput

graphemeperplexity

g2p confidence

hunspell lookup(native)

hunspell lookup(English)

Wiktionarylookup

Googlehit count

voting

decision tree

SVM

Output

word list

word1

word2

word3

word4

word5

word6

classification

Page 5: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

5 15-May-2014

Outline

1. Motivation and Overview

2. Test Sets

3. Single Features

4. Combinations

5. Summary and Future Work

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 6: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

6 15-May-2014

Test Sets - Domains

German IT websitewww.microsoft.de

4.6k unique words

German general newswww.spiegel.de

6.6k unique words

AfrikaansNCHLT corpus (Heerden, Davel, Barnard, 2013), (Basson, Davel, 2013)

9.4k unique words

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 7: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

7 15-May-2014

Test Sets - Domains

Tag for “English”:

e.g. Software, Brain, …

Foreign hybridsCompound words

e.g. Schadsoftware, …

Grammatically adapted words

e.g. downloaden, …

Decisions based onAgreement of annotators

duden.de .Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Different word categories:Abbreviations:

e.g. UV, CIA, …

Other foreign wordsCompound words

e.g. Français, Niveau, …

Page 8: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

8 15-May-2014

Foreign words in different test sets

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 9: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

9 15-May-2014

Single Features – Design Criteria

Features trained on commonly available resourcesWord lists, Pronunciation dictionaries, Spellchecker dictionaries, Wiktionary, Google

Thresholds without supervised trainingComparison between English and native models

New approaches

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 10: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

10 15-May-2014

Grapheme Perplexity

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 11: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

11 15-May-2014

Grapheme Perplexity

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 12: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

12 15-May-2014

Grapheme-to-Phoneme Confidence

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Phonetisaurus confidence

scores (costs)

Page 13: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

13 15-May-2014

Grapheme-to-Phoneme Confidence

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 14: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

14 15-May-2014

Hunspell Lookup

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

classification

word list

word1

word2

word3

word4

spellchecker dictionaryEnglish: Hunspell-en

classification

Hunspell

dictionary lookup

derive word forms

classification

word list

word1

word2

word3

word4

spellchecker dictionaryGerman: Hunspell-de

classification

Hunspell

dictionary lookup

derive word forms

2 features performed

best

Page 15: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

15 15-May-2014

Hunspell Lookup

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

classification

word list

word1

word2

word3

word4

spellchecker dictionaryEnglish: Hunspell-en

classification

Hunspell

dictionary lookup

derive word forms

classification

word list

word1

word2

word3

word4

spellchecker dictionaryGerman: Hunspell-de

classification

Hunspell

dictionary lookup

derive word forms

Page 16: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

16 15-May-2014

Wiktionary Lookup

Check crowdsourced information from matrix language Wiktionary

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 17: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

17 15-May-2014

Google Hit Count

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

Page 18: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

18 15-May-2014

Google Hit Count

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

Page 19: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

19 15-May-2014

Result: Single Features

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 20: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

20 15-May-2014

Grapheme-to-Phoneme Confidence

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 21: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

21 15-May-2014

Result: Single Features

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

On Spiegel-de test set: Higher ratio of words classified as English are wrong

Page 22: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

22 15-May-2014

Result: Combination

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 23: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

23 15-May-2014

Performance after filtering difficult words (oracle)

Challenges

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 24: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

24 15-May-2014

Conclusion and Future Work

Features based on available sources

New approaches:G2P confidence

Wiktionary

Further features:Part-of-speech (POS)

Context, trigger words

Capitalization

Translate and compare

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 25: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

25 15-May-2014

благодари? м за внима? ние!

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 26: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

26 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

References

Page 27: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

27 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

References