considerations on using pls for slovenian pronunciation lexicon construction

20
PLS Considerations on using PLS for Sloveni Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia [email protected] Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006

Upload: neve-wagner

Post on 30-Dec-2015

19 views

Category:

Documents


0 download

DESCRIPTION

Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006. Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Ž ganec Gros Alpineon d.o.o. , Ljubljana, Slovenia [email protected]. ALPINEon - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

PLS

Considerations on using PLS for Slovenian

Pronunciation Lexicon Construction

Jerneja Žganec Gros

Alpineon d.o.o., Ljubljana, Slovenia

[email protected]

Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006

Page 2: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

ALPINEon

SI-PRON lexicon:– word list

– lexicon format

– phonetic transcription

– morpho-syntactic descriptions

Proposed extensions to PLS, SSML

Conclusions

Page 3: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Language specifics

Slovenian language:

– Slavic language, 2 million speakers, over 70 dialects

– complex inflectional paradigm (common to Slavic languages)

• including "dual" – like ancient Greek!

– lexical stress position – undefined and moving, like Russian

(unlike some other Slavic languages, e.g. Croatian never carries

accent on the last syllable)

– many homographs, usually POS info helps at disambiguation:

• example: On je. (He is/eats). auxiliary_verb/indicative

Page 4: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Pron lex

Speech technology applications: – automatic speech recognition (ASR)– text-to-speech synthesis (TTS)– require consistent specification of pronunciation

– Slovenian: lexical stress position not fixed -> pron lex crucial

Pronunciation lexicons:– general: not supposed to be covered by PLS– application-specific

• word/phrase pronunciations

• application-specific proper nouns: personal&location names

Page 5: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

SI-PRON wordlist:

(a) 93,154 lemmas from SSKJ

(b) over 1,000,000 word form derived from (a) – morphol. deriv.

(c) additional word list:• corpus-based search

• 20,000 most freq inflected word forms not covered by SSKJ lemmas

(d) collocations, multi-word expressions

SSKJ: Slovar slovenskega knjižnega jezika

Word-list

Page 6: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Phonetic transcriptions

SSKJ lemmas: – automatic derivation, based on dynamic/tonemic accent information

– manual corrections for about 2.500 lemmas (words of foreign origin)

Word forms derived from SSKJ:

– automatic: SSKJ lemma pronunciation look-up, inflectional paradigms

Additional corpus-based word list:– automatic lexical stress assignment

– AlpSynth grapheme-to-phoneme rule set

Page 7: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

GTP rules

193 context-dependent grapheme-to-phoneme rules:

Leftcontext

Graphemestring

Rightcontext

Phonetictranscr.

Example Rule explanation

$ er _ [@r] Gaber @ occurs before each -r notfollowed by a vowel(Toporisic91, p.49)

= m f [F] Simfonija <m> in front of <f> and <v> ispronounced as a labiodental(Pravopis90, p. 145)

Page 8: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Transcription accuracy experiment

reference: hand-crafted pron lex, 30K lexemes, no loanwords(!)

automatic lexical stress assignment: 15% error rate

lexical stress & o/e pronunciation known in advance:

– transcription success rate 99.1% (0.6% handcrafting errors)

conclusion: for semi-automatic derivation of phonetic transcriptions

with a 0.3% error rate only lexical stress positions & e/o

need to be manually validated

Page 9: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

SI-PRON format

LC-STAR lexicon specs – STTS (Shamas & v Heuvel, 2004)

Pronunciation Lexicon Specification (PLS)– Version 1.0, W3C Last Call Working Draft 31 January 2006

• http://www.w3.org/TR/pronunciation-lexicon/

PLS:– Ver 1.0 not designed for TTS internal lexicons

– on the other hand, we want to have a stronger link between SSML and the lexicon

– we are even thinking of introducing POS attribute into token-like elements!

– leave these issues for PLS Ver 2.x or address them now?

Page 10: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Pronunciation variations

multiple pronunciations:

– several <phoneme> elements

– preferred pronunciation:

• indicated by the prefer element

• usually the 1st pronunciation from the SSKJ

• for some words, 2 prons are equally preferred, e.g.:

- male Slovenian nouns, terminating with "ilec" like

/borilec/, /darovalec/

- "iUts"/"ilts", "ilts"/"iUts", "ilts", or "iUts"

- typically account for more fluent "iUts" or overarticulated "ilts" pronunciation

Page 11: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Extensions…

proposed extension for PLS/SSML:

– a new optional attribute for the <phoneme> element:• pron-style attribute

• values: "fluent", "overarticulated"

– pron-style also for other elements (linkage SSML-lex!):

• <voice>, <speak>, <p>, <s>

• another optional attribute for the above elements: emotion for expressive TTS ?

- could this be covered by the new role attribute? - similar to <speaking_style>, proposed yesterday

Page 12: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Extensions…

PLS…. source/creator:

– only the <metadata> element

– source of multiple pronunciations:

• useful info when merging multiple PLS dox

• some sources/creators may be more reliable than

others…

- additional optional attribute pron-source for the

<phoneme> element

Page 13: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Extensions…

part-of-speech tags:

– Slovenian – complex inflectional paradigm

– morphological, syntactic and semantic(?) descriptors welcome in

future revisions of the PLS specification

– SSML: POS tags could be defined as an optional attribute of the

<token> element

lemma, MSD attributes used in SI-PRON

MULTEXT-East MSDs (Erjavec, 2004) – Telri, Concede

Multext-East LRs, http://nl.ijs.si/ME/V3

EAGLES,TEI P4 compliant

Page 14: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

MSDs

Page 15: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

MSDs

Page 16: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

MSDs

Page 17: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

MDSs

TTS-internal lexicon (for high-inflected languages)

– full-blown form (PLS or other)

– compact lexicons:

– exception lexicon

– derivational scheme/paradigm for providing

prefix/suffix morphological rules, indications of lexical

stress position shifts (hardly an issue of PLS)

Page 18: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Conclusion

possible extensions to PLS, SSML:

– pron-style attribute

– emotion attribute needed?

– source/creator attribute welcome

– morpho-syntactic, semantic descriptors

Page 19: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Alpineon

ZRC-SAZU • Fran Ramovš Institute of the Slovenian Language

Project Partners

L6-5405 project

– Research of Slovenian Language in Lexicography and Lexicology based on Digital Language Resources

– Spoken representation of Slovenian words:• http://bos.zrc-sazu.si/sskj.html

Page 20: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

PLS

THANK YOU FOR YOUR ATTENTION!