Transcript
Page 1: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced LanguagesSt. Petersburg, Russia

www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

ZeigenSieandereAppsfüreinfachesMultitaskingnebendemBrowseranInternetExplorernutztHardwarebeschleunigungWebsiteswerdenschnellergeladendamitSienochreibungslosersurfenkönnen

NimmdeineLieblingsmusiküberallhinmitkommtderiPodshufflemitSpeichergenugfürhundertevonSongsallewichtigenSongsfürsTrainingWiedergabelistenGeniusMixesPodcastsundHörbücher

Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation:

A Case Study on our German IT Corpus

Sebastian leidig, Tim Schlippe, Tanja Schultz

Page 2: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

2 15-May-2014

Motivation

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

From Microsoft's German website www.microsoft.de:

“Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an.”

“Internet Explorer nutzt Hardwarebeschleunigung. Websites werden schneller geladen, damit Sie noch reibungsloser surfen können.”

Page 3: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

3 15-May-2014

Motivation

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

With the globalization words from other languages come into a language without assimilation to the phonetic system of the new language

To economically build up lexical resources with automatic or semi-automatic methods

detect and treat them separately

Page 4: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

4 15-May-2014

Overview

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

combinationfeaturesInput

graphemeperplexity

g2p confidence

hunspell lookup(native)

hunspell lookup(English)

Wiktionarylookup

Googlehit count

voting

decision tree

SVM

Output

word list

word1

word2

word3

word4

word5

word6

classification

Page 5: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

5 15-May-2014

Outline

1. Motivation and Overview

2. Test Sets

3. Single Features

4. Combinations

5. Summary and Future Work

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 6: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

6 15-May-2014

Test Sets - Domains

German IT websitewww.microsoft.de

4.6k unique words

German general newswww.spiegel.de

6.6k unique words

AfrikaansNCHLT corpus (Heerden, Davel, Barnard, 2013), (Basson, Davel, 2013)

9.4k unique words

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 7: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

7 15-May-2014

Test Sets - Domains

Tag for “English”:

e.g. Software, Brain, …

Foreign hybridsCompound words

e.g. Schadsoftware, …

Grammatically adapted words

e.g. downloaden, …

Decisions based onAgreement of annotators

duden.de .Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Different word categories:Abbreviations:

e.g. UV, CIA, …

Other foreign wordsCompound words

e.g. Français, Niveau, …

Page 8: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

8 15-May-2014

Foreign words in different test sets

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 9: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

9 15-May-2014

Single Features – Design Criteria

Features trained on commonly available resourcesWord lists, Pronunciation dictionaries, Spellchecker dictionaries, Wiktionary, Google

Thresholds without supervised trainingComparison between English and native models

New approaches

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 10: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

10 15-May-2014

Grapheme Perplexity

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 11: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

11 15-May-2014

Grapheme Perplexity

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 12: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

12 15-May-2014

Grapheme-to-Phoneme Confidence

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Phonetisaurus confidence

scores (costs)

Page 13: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

13 15-May-2014

Grapheme-to-Phoneme Confidence

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 14: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

14 15-May-2014

Hunspell Lookup

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

classification

word list

word1

word2

word3

word4

spellchecker dictionaryEnglish: Hunspell-en

classification

Hunspell

dictionary lookup

derive word forms

classification

word list

word1

word2

word3

word4

spellchecker dictionaryGerman: Hunspell-de

classification

Hunspell

dictionary lookup

derive word forms

2 features performed

best

Page 15: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

15 15-May-2014

Hunspell Lookup

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

classification

word list

word1

word2

word3

word4

spellchecker dictionaryEnglish: Hunspell-en

classification

Hunspell

dictionary lookup

derive word forms

classification

word list

word1

word2

word3

word4

spellchecker dictionaryGerman: Hunspell-de

classification

Hunspell

dictionary lookup

derive word forms

Page 16: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

16 15-May-2014

Wiktionary Lookup

Check crowdsourced information from matrix language Wiktionary

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 17: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

17 15-May-2014

Google Hit Count

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

Page 18: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

18 15-May-2014

Google Hit Count

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

Page 19: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

19 15-May-2014

Result: Single Features

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 20: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

20 15-May-2014

Grapheme-to-Phoneme Confidence

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 21: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

21 15-May-2014

Result: Single Features

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

On Spiegel-de test set: Higher ratio of words classified as English are wrong

Page 22: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

22 15-May-2014

Result: Combination

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 23: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

23 15-May-2014

Performance after filtering difficult words (oracle)

Challenges

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 24: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

24 15-May-2014

Conclusion and Future Work

Features based on available sources

New approaches:G2P confidence

Wiktionary

Further features:Part-of-speech (POS)

Context, trigger words

Capitalization

Translate and compare

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 25: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

25 15-May-2014

благодари? м за внима? ние!

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 26: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

26 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

References

Page 27: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State

27 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

References


Top Related