knowledge of language origin improves pronunciation accuracy ariadna font llitjos april 13, 2001...
TRANSCRIPT
Knowledge of Language Origin Improves Pronunciation Accuracy
Ariadna Font Llitjos
April 13, 2001
Advisor: Alan W Black
Motivation
It is impossible to have a lexicon with complete coverage, and high proportion of unknown words are proper names:
In an experiment done by [Black, Lenzo and Pagel, 1998], when processing the first section of the WSJ Penn Treebank (about 40,000 words), they found that 4.6% (1775 words) were out of vocabulary words (using OALD), 76.6% of which are proper names.
Motivation cont.
We need an automatic way of learning an acceptable pronunciation for OOV words, most of which are proper names.
General approach: LTS rules (CART)
Specifically, add language probability information
Data and limits- 56,000 proper names from the CMUDICT lexicon
with stress [originally from Bell Labs directory listings, ~20 years ago]
90% training set & 10% test set- We only looked at the educated native American
English pronunciation of proper names: e.g. for ‘Van Gogh’, we don’t want our system to say /F AE1 N G O K/ or /F AE1 N G O G/, which some people may claim is the correct way of pronouncing it, but rather the educated American pronunciation of it:
/V AE1 N . G OW1/.
Baseline Technique
Decision trees to predict phones based on letters and their context (n-grams). In English, letters map to epsilon, a phone or occasionally two phones:
(a) Monongahela m ah n oa1 ng g ah hh ey1 l ax (b) Pittsburgh p ih1 t s b er g (c) exchange ih k-s ch ey1 n jh
Allowables (45 –> 101 phones) and alignments (stress & epsilon misplacements affect accuracy)
Origin Class info
What does origin class mean? - geographic? - etymologic? [Church, 2000] - language (our 1st approach) - data driven (what we really want, current work)
LLM for 26 languages
- European Corpus IMC I: English, French, German, Spanish,
Croatian, Czech, Danish, Dutch, Estonian, Hebrew, Italian, Malaysian, Norwegian, Portuguese, Serbian, Slovenian, Swedish, Turkish
- using the Corpusbuilder + manually:Catalan, Chinese, Japanese, Korean, Polish, Thai, Tamil and other Indian languages (except for Tamil).
Language Identifier
An implementation of a variation of the
algorithm presented in Canvar, W.B., and
Trenkle J.M. N-Gram-Based Text
Categorization, in Proceedings of 3rd
Annual Symposium on Document
Analysis and Information Retrieval,1994.
Language Identifier cont.
The language identifier creates a LLM on the fly
for the input word (or document) and, for every
trigram in the input, it calculates the probability
of it belonging to all the languages by
multiplying them by the relative frequencies for
those trigrams in each one of the languages
(LLMs)
LI example
./classify.pl -t "Ying Zhang" chinese-pn: 0.730594870150084 english.train: 0.0525988955766553 german-pn: 0.0506847882275029 british-pn: 0.0378543572677309 german.train: 0.0303455616225699 tamil-pn: 0.029581372574322 french-pn: 0.0201655107720744 spanish-pn: 0.0185146818045872 catalan-pn: 0.0162318631058251 japanese-pn: 0.00851225092810786 french.train: 0.002861385664355 spanish.train: 0.00205446230618505
Indirect use of the Language Identifier Instead of building trees explicitly for
each language (data sparseness problem), we use the results from the language identification process as features within the CART build process, allowing those features to affect the tree building only when their information is relevant.
Features for or pronunciation model We decided to add to the n-gram featured the
following: - most probable language, with its probability - 2nd most likely language, with its probability - difference between the 2 highest
probabilities
(zysk ( (best-lang slovenian.train) (higher-prob 0.18471) (2nd-best-lang czech.train)(2nd-higher-prob 0.18428) (prob-difference 0.00043)))
CART example ((a
((n.n.n.name is 0) ((n.name is #) ((p.name is e) ((p.p.p.name is #) ((_epsilon_)) ((p.p.p.name is c) ((_epsilon_)) ((ax)))) ((ax))) ((n.name is y) ((p.p.p.name is #) ((ey1)) ((p.p.p.name is 0) ((ey1)) ((p.name is w) ((p.p.p.name is e) ((ey1)) ((p.p.p.name is t) ((ey)) ((p.p.p.name is n) ((2nd-best-lang is "english.train") ((ey)) ((ey1))) ((2nd-best-lang is "czech.train") ((p.p.p.name is d) ((ey1)) ((ey))) ((ey1)))))) ((p.name is d) ((2nd-best-lang is "english.train") ((p.p.p.name is l) ((ey)) ((ey1))) ((ey1))) ((p.p.p.name is c) ((ey1)) ((2nd-best-lang is "malaysian.train") ((p.p.p.name is m) ((ey1)) ((_epsilon_))) ((2nd-best-lang is "czech.train") ((_epsilon_)) ((ey)))))))))
Results
Lexicons Letters Words
PN-base-5 89.02% 54.08%
PN-lang-5 91.23% 61.72%
PN-base-8 90.29% 52.88%
PN-lang-8 90.63% 59.77%
CMUDICT 91.99% 57.80%
ODALD 95.80% 74.56%
Rho’s example
Cepstral’s talking head ./oscars-example
User Studies From the names that both PN-base-8 and PN-
lang-8 got “wrong” (did not exactly match the CMUDICT pronunciation in the test set), I selected the ones for which the two models assigned a different pronunciation (112), and from those, I selected 20 at random to run perceived accuracy user studies.
Overall, the perceived accuracy of the PN-lang-8 model was 17% higher (PN-lang-8: 46%, PN-base-8: 29%, no preference: 25%).
… or a 60% relative improvement
Upper bound
UB is determined by: - how noisy the data is - how much language origin info can really help us in this task [ hard to
estimate without having reliably labeled data]
… - what about adding prior probabilities?
Priors For each language, we could have a prior
probability that would tell us how likely it is to find a name in that language, independently of the name. If our model were trained from newswires data instead of directory listings, it would be relatively easy to determine such priors. E.g.:
“Yesterday in Barcelona, the mayor Joan Clos inaugurated the Forum of Cultures…”,
P(Catalan) = 0.8 P(Spanish) = 0.15 P(all other languages) ~ 0
What I’m working on now
Unsupervised clustering of proper names taking the pronunciation into account.
Traditionally, people working on grapheme to phoneme conversion only looked at the written words, but not at the actual pronunciation
Second approach
- Convert a word into a bunch of features of the form: l1 l2 l3 ph2
i.e. a letter in context (trigram) and the phone it is aligned to
- Bottom-up unsupervised clusteringCriterion: merge two clusters unless there is a clash
Defining clash
Two clusters will merged if the contexts (trigrams) are different or if, given a common context, it is aligned to the same phone on both clusters.
Example
References - Black, A., Lenzo, K. and Pagel, V. Issues in Building General
Letter to Sound Rules. 3rd ESCA Speech Synthesis Workshop, pp. 77-80, Jenolan Caves, Australia, 1998
- CMUDICT. Carnegie Mellon Pronunciation Dictionary. 1998. http://www.speech.cs.cmu.edu/cgibin/cmudict
- Church, K. (2000). Stress Assignment in Letter to Sound rules for Speech Synthesis (Technical Memoradnum). AT&T Labs –Research. November 27, 2000.
- Chotimongkol, A. and Black, A. Statistically trained orthographic to sound models for Thai. Beijing October 2000.
- Tomokiyo, T. Applying Maximum Entropy to English Grapheme-to-Phoneme Conversion. LTI, CMU. Project for 11-744, unpublished. May 9, 2000.
- Ghani R., Jones R. and Mladenic D. Building Minority Language Corpora by Learning to Generate Web Search Queries. Technical Report CMU-CALD-01-100, 2001. http://www.cs.cmu.edu/~TextLearning/corpusbuilder/
Question & Ideas
…
… Thanks