considerations on using pls for slovenian pronunciation lexicon construction
DESCRIPTION
Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006. Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Ž ganec Gros Alpineon d.o.o. , Ljubljana, Slovenia [email protected]. ALPINEon - PowerPoint PPT PresentationTRANSCRIPT
PLS
Considerations on using PLS for Slovenian
Pronunciation Lexicon Construction
Jerneja Žganec Gros
Alpineon d.o.o., Ljubljana, Slovenia
Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006
ALPINEon
SI-PRON lexicon:– word list
– lexicon format
– phonetic transcription
– morpho-syntactic descriptions
Proposed extensions to PLS, SSML
Conclusions
Language specifics
Slovenian language:
– Slavic language, 2 million speakers, over 70 dialects
– complex inflectional paradigm (common to Slavic languages)
• including "dual" – like ancient Greek!
– lexical stress position – undefined and moving, like Russian
(unlike some other Slavic languages, e.g. Croatian never carries
accent on the last syllable)
– many homographs, usually POS info helps at disambiguation:
• example: On je. (He is/eats). auxiliary_verb/indicative
Pron lex
Speech technology applications: – automatic speech recognition (ASR)– text-to-speech synthesis (TTS)– require consistent specification of pronunciation
– Slovenian: lexical stress position not fixed -> pron lex crucial
Pronunciation lexicons:– general: not supposed to be covered by PLS– application-specific
• word/phrase pronunciations
• application-specific proper nouns: personal&location names
SI-PRON wordlist:
(a) 93,154 lemmas from SSKJ
(b) over 1,000,000 word form derived from (a) – morphol. deriv.
(c) additional word list:• corpus-based search
• 20,000 most freq inflected word forms not covered by SSKJ lemmas
(d) collocations, multi-word expressions
SSKJ: Slovar slovenskega knjižnega jezika
Word-list
Phonetic transcriptions
SSKJ lemmas: – automatic derivation, based on dynamic/tonemic accent information
– manual corrections for about 2.500 lemmas (words of foreign origin)
Word forms derived from SSKJ:
– automatic: SSKJ lemma pronunciation look-up, inflectional paradigms
Additional corpus-based word list:– automatic lexical stress assignment
– AlpSynth grapheme-to-phoneme rule set
GTP rules
193 context-dependent grapheme-to-phoneme rules:
Leftcontext
Graphemestring
Rightcontext
Phonetictranscr.
Example Rule explanation
$ er _ [@r] Gaber @ occurs before each -r notfollowed by a vowel(Toporisic91, p.49)
= m f [F] Simfonija <m> in front of <f> and <v> ispronounced as a labiodental(Pravopis90, p. 145)
Transcription accuracy experiment
reference: hand-crafted pron lex, 30K lexemes, no loanwords(!)
automatic lexical stress assignment: 15% error rate
lexical stress & o/e pronunciation known in advance:
– transcription success rate 99.1% (0.6% handcrafting errors)
conclusion: for semi-automatic derivation of phonetic transcriptions
with a 0.3% error rate only lexical stress positions & e/o
need to be manually validated
SI-PRON format
LC-STAR lexicon specs – STTS (Shamas & v Heuvel, 2004)
Pronunciation Lexicon Specification (PLS)– Version 1.0, W3C Last Call Working Draft 31 January 2006
• http://www.w3.org/TR/pronunciation-lexicon/
PLS:– Ver 1.0 not designed for TTS internal lexicons
– on the other hand, we want to have a stronger link between SSML and the lexicon
– we are even thinking of introducing POS attribute into token-like elements!
– leave these issues for PLS Ver 2.x or address them now?
Pronunciation variations
multiple pronunciations:
– several <phoneme> elements
– preferred pronunciation:
• indicated by the prefer element
• usually the 1st pronunciation from the SSKJ
• for some words, 2 prons are equally preferred, e.g.:
- male Slovenian nouns, terminating with "ilec" like
/borilec/, /darovalec/
- "iUts"/"ilts", "ilts"/"iUts", "ilts", or "iUts"
- typically account for more fluent "iUts" or overarticulated "ilts" pronunciation
Extensions…
proposed extension for PLS/SSML:
– a new optional attribute for the <phoneme> element:• pron-style attribute
• values: "fluent", "overarticulated"
– pron-style also for other elements (linkage SSML-lex!):
• <voice>, <speak>, <p>, <s>
• another optional attribute for the above elements: emotion for expressive TTS ?
- could this be covered by the new role attribute? - similar to <speaking_style>, proposed yesterday
Extensions…
PLS…. source/creator:
– only the <metadata> element
– source of multiple pronunciations:
• useful info when merging multiple PLS dox
• some sources/creators may be more reliable than
others…
- additional optional attribute pron-source for the
<phoneme> element
Extensions…
part-of-speech tags:
– Slovenian – complex inflectional paradigm
– morphological, syntactic and semantic(?) descriptors welcome in
future revisions of the PLS specification
– SSML: POS tags could be defined as an optional attribute of the
<token> element
lemma, MSD attributes used in SI-PRON
MULTEXT-East MSDs (Erjavec, 2004) – Telri, Concede
Multext-East LRs, http://nl.ijs.si/ME/V3
EAGLES,TEI P4 compliant
MSDs
MSDs
MSDs
MDSs
TTS-internal lexicon (for high-inflected languages)
– full-blown form (PLS or other)
– compact lexicons:
– exception lexicon
– derivational scheme/paradigm for providing
prefix/suffix morphological rules, indications of lexical
stress position shifts (hardly an issue of PLS)
Conclusion
possible extensions to PLS, SSML:
– pron-style attribute
– emotion attribute needed?
– source/creator attribute welcome
– morpho-syntactic, semantic descriptors
Alpineon
ZRC-SAZU • Fran Ramovš Institute of the Slovenian Language
Project Partners
L6-5405 project
– Research of Slovenian Language in Lexicography and Lexicology based on Digital Language Resources
– Spoken representation of Slovenian words:• http://bos.zrc-sazu.si/sskj.html
PLS
THANK YOU FOR YOUR ATTENTION!