pronunciation modeling

33
Pronunciation Modeling Lecture 11 Spoken Language Processing Prof. Andrew Rosenberg

Upload: akio

Post on 25-Feb-2016

62 views

Category:

Documents


0 download

DESCRIPTION

Pronunciation Modeling. Lecture 11 Spoken Language Processing Prof. Andrew Rosenberg. What is a pronunciation model?. Audio Features. Word Hypothese. Acoustic Model. Pronunciation Model. Language Model. Phone Hypothese. Word Hypothese. Why do we need one?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pronunciation Modeling

Pronunciation Modeling

Lecture 11Spoken Language Processing

Prof. Andrew Rosenberg

Page 2: Pronunciation Modeling

2

What is a pronunciation model?

Acoustic Model

PronunciationModel

LanguageModel

Audio Features

Phone Hypothese

Word Hypothese

Word Hypothese

Page 3: Pronunciation Modeling

3

Why do we need one?• The pronunciation model defines the

mapping between sequences of phones and words.

• The acoustic model can deliver a one-best, hypothesis – “best guess”.

• From this single guess, converting to words can be done with dynamic programming alignment.

• Or viewed as a Finite State Automata.

Page 4: Pronunciation Modeling

4

Simplest Pronunciation “model”• A dictionary.• Associate a word (lexical item,

orthographic form) with a pronunciation.

ACHE EY KACHES EY K SADJUNCT AE JH AH NG K TADJUNCTS AE JH AN NG K T SADVANTAGE AH D V AE N T IH JHADVANTAGE AH D V AE N IH JHADVANTAGE AH D V AE N T AH JH

Page 5: Pronunciation Modeling

5

Example of a pronunciation dictionary

Page 6: Pronunciation Modeling

6

Finite State Automata view• Each word is an automata over

phones

EY K

EY K

AH D V AE N T

S

I JH

Page 7: Pronunciation Modeling

7

Size of whole word models• these models get very big, very

quickly

EY K

EY K

AH D V AE N T

S

I JH

START END

Page 8: Pronunciation Modeling

8

Potential problems• Every word in the training material and test

vocabulary must be in the dictionary• The dictionary is generally written by hand• Prone to errors and inconsistencies

ACHE EY KACHES EY K SADJUNCT AE JH AH NG K TADJUNCTS AE JH AN NG K T SADVANTAGE AH D V AE N T IH JHADVANTAGE AH D V AE N IH JHADVANTAGE AH D V AE N T AH JH

Page 9: Pronunciation Modeling

9

Baseforms represented by graphs

Page 10: Pronunciation Modeling

10

Composition• From the word graph, we can replace

each phone by its markov model

Page 11: Pronunciation Modeling

11

Automating the construction• Do we need to write a rule for every

word?

• pluralizing?– Where is it +[Z]? +[IH Z]?

• prefixes, unhappy, etc.– +[UH N]– How can you tell the difference between

“unhappy”, “unintelligent” and “under” and “

Page 12: Pronunciation Modeling

12

Is every pronunciation equally likely?

• Different phonetic realizations can be weighted.

• The FSA view of the pronunciation model makes this easy.

ACAPULCO AE K AX P AH L K OWACAPULCO AA K AX P UH K OWTHE TH IYTHE TH AXPROBABLY P R AA B AX B L IYPROBABLY P R AA B L IYPROBABLY P R AA L IY

Page 13: Pronunciation Modeling

13

Is every pronunciation equally likely?

• Different phonetic realizations can be weighted.

• The FSA view of the pronunciation model makes this easy.

ACAPULCO AE K AX P AH L K OW0.75ACAPULCO AA K AX P UH K OW

0.25THE TH IY

0.15THE TH AX

0.85PROBABLY P R AA B AX B L IY

0.5PROBABLY P R AA B L IY

0.4PROBABLY P R AA L IY

0.1

Page 14: Pronunciation Modeling

14

Collecting pronunciations• Collect a lot of data• Ask a phonetician to phonetically

transcribe the data.• Count how many times each

production is observed.

• This is very expensive – time consuming, finding linguists.

Page 15: Pronunciation Modeling

15

Collecting pronunciations• Start with equal likelihoods of all

pronunciations• Run the recognizer on transcribed

speech– forced alignment

• See how many times the recognizer uses each pronunciation.

• Much cheaper, but less reliable

Page 16: Pronunciation Modeling

16

Out of Vocabulary Words• A major problem for Dictionary based

pronunciation is out of vocabulary terms.

• If you’ve never seen a name, or new word, how do you know how to pronounce it?– Person names– Organization and Company Names– New words “truthiness”, “hypermiling”,

“woot”, “app”– Medical, scientific and technical terms

Page 17: Pronunciation Modeling

17

Collecting Pronunciations from the web

• Newspapers, blog posts etc. often use new names and unknown terms.

• For example:– Flickeur (pronounced like Voyeur) randomly

retrieves images from Flickr.com and creates an infinite film with a style that can vary between stream-of-consciousness, documentary or video clip.

– Our group traveled to Peterborough (pronounced like “Pita-borough”)...

• The web can be mined for pronunciations [Riley, Jansche, Ramabhadran 2009]

Page 18: Pronunciation Modeling

18

Grapheme to Phoneme Conversion

• Given a new word, how do you pronounce it.

• Grapheme is a language independent term for things like “letters”, “characters”, “kanji”, etc.

• With a phoneme to grapheme-to-phoneme converter, dictionaries can be augmented with any word.

• Some languages are more ambiguous than others.

Page 19: Pronunciation Modeling

19

Grapheme to Phoneme conversion

• Goal: Learn an alignment between graphemes (letters) and phonemes (sounds)

• Find the lowest cost alignment.• Weight rules, and learn contextual variants.

T E X - T

T EH K S T

T E X T - - - - -

- - - - T EH K S T

Page 20: Pronunciation Modeling

20

Grapheme to Phoneme Difficulties

• How to deal with Abbreviations– US CENSUS– NASA, scuba vs. AT&T, ASR– LOL– IEEE

• What about misspellings?– should “teh” have an entry in the dictionary?– If we’re collecting new terms from the web,

or other unreliable sources, how do we know what is a new word?

Page 21: Pronunciation Modeling

21

Application of Grapheme to Phoneme Conversion

• This Pronunciation Model is used much more often in Speech Synthesis than Speech Recognition

• In Speech Recognition we’re trying to do Phoneme-to-Grapheme conversion– This is a very tricky problem.– “ghoti” -> F IH SH– “ghoti” -> silence

Page 22: Pronunciation Modeling

22

Approaches to Grapheme to Phoneme conversion

• “Instance Based Learning”– Lookup based on a sliding window of 3

letters– Helps with sounds like “ch” and “sh”

• Hidden Markov Model– Observations are phones– States are letters

Page 23: Pronunciation Modeling

23

Machine Learning for Grapheme to Phoneme Conversion

• Input:– A letter, and surrounding context, e.g. 2

previous and 2 following letters• Output:

– Phoneme

Page 24: Pronunciation Modeling

24

Decision Trees• Decision trees are intuitive

classifiers– Classifier: supervised machine

learning, generating categorical predictions

Feature > threshold?

Class A Class B

Page 25: Pronunciation Modeling

25

Decision Trees Example

Page 26: Pronunciation Modeling

26

Decision Tree Training• How does the letter “p” sound?• Training data

– P loophole, peanuts, pay, apple– F physics, telephone, graph, photo– ø apple, psycho, pterodactyl,

pneumonia• pronunciation depends on context

Page 27: Pronunciation Modeling

27

Decision Trees example• Context: L1, L2, p, R1, R2

R1 = “h”Yes No

P loopholeF physicsF telephoneF graphF photo

P peanutP payP appleø appleø psychoø psychoøpterodactyløpneumonia

Page 28: Pronunciation Modeling

28

Decision Trees example• Context: L1, L2, p, R1, R2

R1 = “h”Yes No

P loopholeF physicsF telephoneF graphF photo

P peanutP payP appleø appleø psychoøpterodactyløpneumonia

Yes No

Ploophole

F physicsFtelephoneF graphF photo

L1 = “o”

R1 = consonantNoYes

PpeanutP pay

P appleø psychoø pterodactylø pneumonia

Page 29: Pronunciation Modeling

29

Decision Trees example• Context: L1, L2, p, R1, R2

R1 = “h”Yes No

P loopholeF physicsF telephoneF graphF photo

P peanutP payP appleø appleø psychoøpterodactyløpneumonia

Yes No

Ploophole

F physicsFtelephoneF graphF photo

L1 = “o”

R1 = consonantNoYes

PpeanutP pay

P appleø psychoø pterodactylø pneumonia

try “PARIS”

Page 30: Pronunciation Modeling

30

Decision Trees example• Context: L1, L2, p, R1, R2

R1 = “h”Yes No

P loopholeF physicsF telephoneF graphF photo

P peanutP payP appleø appleø psychoøpterodactyløpneumonia

Yes No

Ploophole

F physicsFtelephoneF graphF photo

L1 = “o”

R1 = consonantNoYes

PpeanutP pay

P appleø psychoø pterodactylø pneumonia

Now try “GOPHER”

Page 31: Pronunciation Modeling

31

Training a Decision Tree• At each node, decide what the most useful

split is.– Consider all features– Select the one that improves the performance the

most

• There are a few ways to calculate improved performance– Information Gain is typically used.– Accuracy is less common.

• Can require many evaluations

Page 32: Pronunciation Modeling

32

Pronunciation Models in TTS and ASR

• In ASR, we have phone hypotheses from the acoustic model, and need word hypotheses.

• In TTS, we have the desired word, but need a corresponding phone sequence to synthesize.

Page 33: Pronunciation Modeling

33

Next Class• Language Modeling• Reading: J&M Chapter 4