knowledge of language origin improves pronunciation accuracy ariadna font llitjos april 13, 2001...

Knowledge of Language Origin Improves Pronunciation Accuracy

Ariadna Font Llitjos

April 13, 2001

Advisor: Alan W Black

Motivation

It is impossible to have a lexicon with complete coverage, and high proportion of unknown words are proper names:

In an experiment done by [Black, Lenzo and Pagel, 1998], when processing the first section of the WSJ Penn Treebank (about 40,000 words), they found that 4.6% (1775 words) were out of vocabulary words (using OALD), 76.6% of which are proper names.

Motivation cont.

We need an automatic way of learning an acceptable pronunciation for OOV words, most of which are proper names.

General approach: LTS rules (CART)

Specifically, add language probability information

Data and limits- 56,000 proper names from the CMUDICT lexicon

with stress [originally from Bell Labs directory listings, ~20 years ago]

90% training set & 10% test set- We only looked at the educated native American

English pronunciation of proper names: e.g. for ‘Van Gogh’, we don’t want our system to say /F AE1 N G O K/ or /F AE1 N G O G/, which some people may claim is the correct way of pronouncing it, but rather the educated American pronunciation of it:

/V AE1 N . G OW1/.

Baseline Technique

Decision trees to predict phones based on letters and their context (n-grams). In English, letters map to epsilon, a phone or occasionally two phones:

(a) Monongahela m ah n oa1 ng g ah hh ey1 l ax (b) Pittsburgh p ih1 t s b er g (c) exchange ih k-s ch ey1 n jh

Allowables (45 –> 101 phones) and alignments (stress & epsilon misplacements affect accuracy)

Origin Class info

What does origin class mean? - geographic? - etymologic? [Church, 2000] - language (our 1st approach) - data driven (what we really want, current work)

LLM for 26 languages

- European Corpus IMC I: English, French, German, Spanish,

Croatian, Czech, Danish, Dutch, Estonian, Hebrew, Italian, Malaysian, Norwegian, Portuguese, Serbian, Slovenian, Swedish, Turkish

- using the Corpusbuilder + manually:Catalan, Chinese, Japanese, Korean, Polish, Thai, Tamil and other Indian languages (except for Tamil).

Language Identifier

An implementation of a variation of the

algorithm presented in Canvar, W.B., and

Trenkle J.M. N-Gram-Based Text

Categorization, in Proceedings of 3rd

Annual Symposium on Document

Analysis and Information Retrieval,1994.

Language Identifier cont.

The language identifier creates a LLM on the fly

for the input word (or document) and, for every

trigram in the input, it calculates the probability

of it belonging to all the languages by

multiplying them by the relative frequencies for

those trigrams in each one of the languages

(LLMs)

LI example

./classify.pl -t "Ying Zhang" chinese-pn: 0.730594870150084 english.train: 0.0525988955766553 german-pn: 0.0506847882275029 british-pn: 0.0378543572677309 german.train: 0.0303455616225699 tamil-pn: 0.029581372574322 french-pn: 0.0201655107720744 spanish-pn: 0.0185146818045872 catalan-pn: 0.0162318631058251 japanese-pn: 0.00851225092810786 french.train: 0.002861385664355 spanish.train: 0.00205446230618505

Indirect use of the Language Identifier Instead of building trees explicitly for

each language (data sparseness problem), we use the results from the language identification process as features within the CART build process, allowing those features to affect the tree building only when their information is relevant.

Features for or pronunciation model We decided to add to the n-gram featured the

following: - most probable language, with its probability - 2nd most likely language, with its probability - difference between the 2 highest

probabilities

(zysk ( (best-lang slovenian.train) (higher-prob 0.18471) (2nd-best-lang czech.train)(2nd-higher-prob 0.18428) (prob-difference 0.00043)))

CART example ((a

((n.n.n.name is 0) ((n.name is #) ((p.name is e) ((p.p.p.name is #) ((_epsilon_)) ((p.p.p.name is c) ((_epsilon_)) ((ax)))) ((ax))) ((n.name is y) ((p.p.p.name is #) ((ey1)) ((p.p.p.name is 0) ((ey1)) ((p.name is w) ((p.p.p.name is e) ((ey1)) ((p.p.p.name is t) ((ey)) ((p.p.p.name is n) ((2nd-best-lang is "english.train") ((ey)) ((ey1))) ((2nd-best-lang is "czech.train") ((p.p.p.name is d) ((ey1)) ((ey))) ((ey1)))))) ((p.name is d) ((2nd-best-lang is "english.train") ((p.p.p.name is l) ((ey)) ((ey1))) ((ey1))) ((p.p.p.name is c) ((ey1)) ((2nd-best-lang is "malaysian.train") ((p.p.p.name is m) ((ey1)) ((_epsilon_))) ((2nd-best-lang is "czech.train") ((_epsilon_)) ((ey)))))))))

Results

Lexicons Letters Words

PN-base-5 89.02% 54.08%

PN-lang-5 91.23% 61.72%

PN-base-8 90.29% 52.88%

PN-lang-8 90.63% 59.77%

CMUDICT 91.99% 57.80%

ODALD 95.80% 74.56%

Rho’s example

Cepstral’s talking head ./oscars-example

User Studies From the names that both PN-base-8 and PN-

lang-8 got “wrong” (did not exactly match the CMUDICT pronunciation in the test set), I selected the ones for which the two models assigned a different pronunciation (112), and from those, I selected 20 at random to run perceived accuracy user studies.

Overall, the perceived accuracy of the PN-lang-8 model was 17% higher (PN-lang-8: 46%, PN-base-8: 29%, no preference: 25%).

… or a 60% relative improvement

Upper bound

UB is determined by: - how noisy the data is - how much language origin info can really help us in this task [ hard to

estimate without having reliably labeled data]

… - what about adding prior probabilities?

Priors For each language, we could have a prior

probability that would tell us how likely it is to find a name in that language, independently of the name. If our model were trained from newswires data instead of directory listings, it would be relatively easy to determine such priors. E.g.:

“Yesterday in Barcelona, the mayor Joan Clos inaugurated the Forum of Cultures…”,

P(Catalan) = 0.8 P(Spanish) = 0.15 P(all other languages) ~ 0

What I’m working on now

Unsupervised clustering of proper names taking the pronunciation into account.

Traditionally, people working on grapheme to phoneme conversion only looked at the written words, but not at the actual pronunciation

Second approach

- Convert a word into a bunch of features of the form: l1 l2 l3 ph2

i.e. a letter in context (trigram) and the phone it is aligned to

- Bottom-up unsupervised clusteringCriterion: merge two clusters unless there is a clash

Defining clash

Two clusters will merged if the contexts (trigrams) are different or if, given a common context, it is aligned to the same phone on both clusters.

Example

References - Black, A., Lenzo, K. and Pagel, V. Issues in Building General

Letter to Sound Rules. 3rd ESCA Speech Synthesis Workshop, pp. 77-80, Jenolan Caves, Australia, 1998

- CMUDICT. Carnegie Mellon Pronunciation Dictionary. 1998. http://www.speech.cs.cmu.edu/cgibin/cmudict

- Church, K. (2000). Stress Assignment in Letter to Sound rules for Speech Synthesis (Technical Memoradnum). AT&T Labs –Research. November 27, 2000.

- Chotimongkol, A. and Black, A. Statistically trained orthographic to sound models for Thai. Beijing October 2000.

- Tomokiyo, T. Applying Maximum Entropy to English Grapheme-to-Phoneme Conversion. LTI, CMU. Project for 11-744, unpublished. May 9, 2000.

- Ghani R., Jones R. and Mladenic D. Building Minority Language Corpora by Learning to Generate Web Search Queries. Technical Report CMU-CALD-01-100, 2001. http://www.cs.cmu.edu/~TextLearning/corpusbuilder/

Question & Ideas

…

… Thanks

knowledge of language origin improves pronunciation accuracy ariadna font llitjos april 13, 2001...

Documents

f ae1 n g o g

language identifier

knowledge of language

v ae1 n

t ying zhangchinesepn

context ngrams

oov words

vocabulary words