![Page 1: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/1.jpg)
PLS
Considerations on using PLS for Slovenian
Pronunciation Lexicon Construction
Jerneja Žganec Gros
Alpineon d.o.o., Ljubljana, Slovenia
Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006
![Page 2: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/2.jpg)
ALPINEon
SI-PRON lexicon:– word list
– lexicon format
– phonetic transcription
– morpho-syntactic descriptions
Proposed extensions to PLS, SSML
Conclusions
![Page 3: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/3.jpg)
Language specifics
Slovenian language:
– Slavic language, 2 million speakers, over 70 dialects
– complex inflectional paradigm (common to Slavic languages)
• including "dual" – like ancient Greek!
– lexical stress position – undefined and moving, like Russian
(unlike some other Slavic languages, e.g. Croatian never carries
accent on the last syllable)
– many homographs, usually POS info helps at disambiguation:
• example: On je. (He is/eats). auxiliary_verb/indicative
![Page 4: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/4.jpg)
Pron lex
Speech technology applications: – automatic speech recognition (ASR)– text-to-speech synthesis (TTS)– require consistent specification of pronunciation
– Slovenian: lexical stress position not fixed -> pron lex crucial
Pronunciation lexicons:– general: not supposed to be covered by PLS– application-specific
• word/phrase pronunciations
• application-specific proper nouns: personal&location names
![Page 5: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/5.jpg)
SI-PRON wordlist:
(a) 93,154 lemmas from SSKJ
(b) over 1,000,000 word form derived from (a) – morphol. deriv.
(c) additional word list:• corpus-based search
• 20,000 most freq inflected word forms not covered by SSKJ lemmas
(d) collocations, multi-word expressions
SSKJ: Slovar slovenskega knjižnega jezika
Word-list
![Page 6: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/6.jpg)
Phonetic transcriptions
SSKJ lemmas: – automatic derivation, based on dynamic/tonemic accent information
– manual corrections for about 2.500 lemmas (words of foreign origin)
Word forms derived from SSKJ:
– automatic: SSKJ lemma pronunciation look-up, inflectional paradigms
Additional corpus-based word list:– automatic lexical stress assignment
– AlpSynth grapheme-to-phoneme rule set
![Page 7: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/7.jpg)
GTP rules
193 context-dependent grapheme-to-phoneme rules:
Leftcontext
Graphemestring
Rightcontext
Phonetictranscr.
Example Rule explanation
$ er _ [@r] Gaber @ occurs before each -r notfollowed by a vowel(Toporisic91, p.49)
= m f [F] Simfonija <m> in front of <f> and <v> ispronounced as a labiodental(Pravopis90, p. 145)
![Page 8: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/8.jpg)
Transcription accuracy experiment
reference: hand-crafted pron lex, 30K lexemes, no loanwords(!)
automatic lexical stress assignment: 15% error rate
lexical stress & o/e pronunciation known in advance:
– transcription success rate 99.1% (0.6% handcrafting errors)
conclusion: for semi-automatic derivation of phonetic transcriptions
with a 0.3% error rate only lexical stress positions & e/o
need to be manually validated
![Page 9: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/9.jpg)
SI-PRON format
LC-STAR lexicon specs – STTS (Shamas & v Heuvel, 2004)
Pronunciation Lexicon Specification (PLS)– Version 1.0, W3C Last Call Working Draft 31 January 2006
• http://www.w3.org/TR/pronunciation-lexicon/
PLS:– Ver 1.0 not designed for TTS internal lexicons
– on the other hand, we want to have a stronger link between SSML and the lexicon
– we are even thinking of introducing POS attribute into token-like elements!
– leave these issues for PLS Ver 2.x or address them now?
![Page 10: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/10.jpg)
Pronunciation variations
multiple pronunciations:
– several <phoneme> elements
– preferred pronunciation:
• indicated by the prefer element
• usually the 1st pronunciation from the SSKJ
• for some words, 2 prons are equally preferred, e.g.:
- male Slovenian nouns, terminating with "ilec" like
/borilec/, /darovalec/
- "iUts"/"ilts", "ilts"/"iUts", "ilts", or "iUts"
- typically account for more fluent "iUts" or overarticulated "ilts" pronunciation
![Page 11: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/11.jpg)
Extensions…
proposed extension for PLS/SSML:
– a new optional attribute for the <phoneme> element:• pron-style attribute
• values: "fluent", "overarticulated"
– pron-style also for other elements (linkage SSML-lex!):
• <voice>, <speak>, <p>, <s>
• another optional attribute for the above elements: emotion for expressive TTS ?
- could this be covered by the new role attribute? - similar to <speaking_style>, proposed yesterday
![Page 12: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/12.jpg)
Extensions…
PLS…. source/creator:
– only the <metadata> element
– source of multiple pronunciations:
• useful info when merging multiple PLS dox
• some sources/creators may be more reliable than
others…
- additional optional attribute pron-source for the
<phoneme> element
![Page 13: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/13.jpg)
Extensions…
part-of-speech tags:
– Slovenian – complex inflectional paradigm
– morphological, syntactic and semantic(?) descriptors welcome in
future revisions of the PLS specification
– SSML: POS tags could be defined as an optional attribute of the
<token> element
lemma, MSD attributes used in SI-PRON
MULTEXT-East MSDs (Erjavec, 2004) – Telri, Concede
Multext-East LRs, http://nl.ijs.si/ME/V3
EAGLES,TEI P4 compliant
![Page 14: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/14.jpg)
MSDs
![Page 15: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/15.jpg)
MSDs
![Page 16: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/16.jpg)
MSDs
![Page 17: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/17.jpg)
MDSs
TTS-internal lexicon (for high-inflected languages)
– full-blown form (PLS or other)
– compact lexicons:
– exception lexicon
– derivational scheme/paradigm for providing
prefix/suffix morphological rules, indications of lexical
stress position shifts (hardly an issue of PLS)
![Page 18: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/18.jpg)
Conclusion
possible extensions to PLS, SSML:
– pron-style attribute
– emotion attribute needed?
– source/creator attribute welcome
– morpho-syntactic, semantic descriptors
![Page 19: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/19.jpg)
Alpineon
ZRC-SAZU • Fran Ramovš Institute of the Slovenian Language
Project Partners
L6-5405 project
– Research of Slovenian Language in Lexicography and Lexicology based on Digital Language Resources
– Spoken representation of Slovenian words:• http://bos.zrc-sazu.si/sskj.html
![Page 20: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction](https://reader036.vdocuments.site/reader036/viewer/2022082710/56812b2b550346895d8f3824/html5/thumbnails/20.jpg)
PLS
THANK YOU FOR YOUR ATTENTION!