considerations on using pls for slovenian pronunciation lexicon construction
Post on 30-Dec-2015
19 Views
Preview:
DESCRIPTION
TRANSCRIPT
PLS
Considerations on using PLS for Slovenian
Pronunciation Lexicon Construction
Jerneja Žganec Gros
Alpineon d.o.o., Ljubljana, Slovenia
jerneja.gros@alpineon.com
Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006
ALPINEon
SI-PRON lexicon:– word list
– lexicon format
– phonetic transcription
– morpho-syntactic descriptions
Proposed extensions to PLS, SSML
Conclusions
Language specifics
Slovenian language:
– Slavic language, 2 million speakers, over 70 dialects
– complex inflectional paradigm (common to Slavic languages)
• including "dual" – like ancient Greek!
– lexical stress position – undefined and moving, like Russian
(unlike some other Slavic languages, e.g. Croatian never carries
accent on the last syllable)
– many homographs, usually POS info helps at disambiguation:
• example: On je. (He is/eats). auxiliary_verb/indicative
Pron lex
Speech technology applications: – automatic speech recognition (ASR)– text-to-speech synthesis (TTS)– require consistent specification of pronunciation
– Slovenian: lexical stress position not fixed -> pron lex crucial
Pronunciation lexicons:– general: not supposed to be covered by PLS– application-specific
• word/phrase pronunciations
• application-specific proper nouns: personal&location names
SI-PRON wordlist:
(a) 93,154 lemmas from SSKJ
(b) over 1,000,000 word form derived from (a) – morphol. deriv.
(c) additional word list:• corpus-based search
• 20,000 most freq inflected word forms not covered by SSKJ lemmas
(d) collocations, multi-word expressions
SSKJ: Slovar slovenskega knjižnega jezika
Word-list
Phonetic transcriptions
SSKJ lemmas: – automatic derivation, based on dynamic/tonemic accent information
– manual corrections for about 2.500 lemmas (words of foreign origin)
Word forms derived from SSKJ:
– automatic: SSKJ lemma pronunciation look-up, inflectional paradigms
Additional corpus-based word list:– automatic lexical stress assignment
– AlpSynth grapheme-to-phoneme rule set
GTP rules
193 context-dependent grapheme-to-phoneme rules:
Leftcontext
Graphemestring
Rightcontext
Phonetictranscr.
Example Rule explanation
$ er _ [@r] Gaber @ occurs before each -r notfollowed by a vowel(Toporisic91, p.49)
= m f [F] Simfonija <m> in front of <f> and <v> ispronounced as a labiodental(Pravopis90, p. 145)
Transcription accuracy experiment
reference: hand-crafted pron lex, 30K lexemes, no loanwords(!)
automatic lexical stress assignment: 15% error rate
lexical stress & o/e pronunciation known in advance:
– transcription success rate 99.1% (0.6% handcrafting errors)
conclusion: for semi-automatic derivation of phonetic transcriptions
with a 0.3% error rate only lexical stress positions & e/o
need to be manually validated
SI-PRON format
LC-STAR lexicon specs – STTS (Shamas & v Heuvel, 2004)
Pronunciation Lexicon Specification (PLS)– Version 1.0, W3C Last Call Working Draft 31 January 2006
• http://www.w3.org/TR/pronunciation-lexicon/
PLS:– Ver 1.0 not designed for TTS internal lexicons
– on the other hand, we want to have a stronger link between SSML and the lexicon
– we are even thinking of introducing POS attribute into token-like elements!
– leave these issues for PLS Ver 2.x or address them now?
Pronunciation variations
multiple pronunciations:
– several <phoneme> elements
– preferred pronunciation:
• indicated by the prefer element
• usually the 1st pronunciation from the SSKJ
• for some words, 2 prons are equally preferred, e.g.:
- male Slovenian nouns, terminating with "ilec" like
/borilec/, /darovalec/
- "iUts"/"ilts", "ilts"/"iUts", "ilts", or "iUts"
- typically account for more fluent "iUts" or overarticulated "ilts" pronunciation
Extensions…
proposed extension for PLS/SSML:
– a new optional attribute for the <phoneme> element:• pron-style attribute
• values: "fluent", "overarticulated"
– pron-style also for other elements (linkage SSML-lex!):
• <voice>, <speak>, <p>, <s>
• another optional attribute for the above elements: emotion for expressive TTS ?
- could this be covered by the new role attribute? - similar to <speaking_style>, proposed yesterday
Extensions…
PLS…. source/creator:
– only the <metadata> element
– source of multiple pronunciations:
• useful info when merging multiple PLS dox
• some sources/creators may be more reliable than
others…
- additional optional attribute pron-source for the
<phoneme> element
Extensions…
part-of-speech tags:
– Slovenian – complex inflectional paradigm
– morphological, syntactic and semantic(?) descriptors welcome in
future revisions of the PLS specification
– SSML: POS tags could be defined as an optional attribute of the
<token> element
lemma, MSD attributes used in SI-PRON
MULTEXT-East MSDs (Erjavec, 2004) – Telri, Concede
Multext-East LRs, http://nl.ijs.si/ME/V3
EAGLES,TEI P4 compliant
MSDs
MSDs
MSDs
MDSs
TTS-internal lexicon (for high-inflected languages)
– full-blown form (PLS or other)
– compact lexicons:
– exception lexicon
– derivational scheme/paradigm for providing
prefix/suffix morphological rules, indications of lexical
stress position shifts (hardly an issue of PLS)
Conclusion
possible extensions to PLS, SSML:
– pron-style attribute
– emotion attribute needed?
– source/creator attribute welcome
– morpho-syntactic, semantic descriptors
Alpineon
ZRC-SAZU • Fran Ramovš Institute of the Slovenian Language
Project Partners
L6-5405 project
– Research of Slovenian Language in Lexicography and Lexicology based on Digital Language Resources
– Spoken representation of Slovenian words:• http://bos.zrc-sazu.si/sskj.html
PLS
THANK YOU FOR YOUR ATTENTION!
top related