learning from mistakes: expanding pronunciation lexicons ... › ~sravana › slides ›...
TRANSCRIPT
![Page 1: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/1.jpg)
Learning from Mistakes: Expanding Pronunciation Lexicons
using Word Recognition Errors
Sravana Reddy
The University of Chicago
Joint Work with Evandro Gouvêa
![Page 2: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/2.jpg)
Sang Bissenette
SPEECH RECOGNITION
Sane visitor
![Page 3: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/3.jpg)
Mariano DiFabio
SPEECH RECOGNITION
Mary and the fable
![Page 4: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/4.jpg)
SPEECH RECOGNITION
Mary and the fable
Black Box Latent Phonetic Similarity Channel
Out of Vocabulary (OOV) Words
Known Words
This Work
Pronunciations of OOV words (Mariano and DiFabio)
Mariano DiFabio
![Page 5: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/5.jpg)
Previous Work
SPEECH RECOGNITION
Mary and the fable
Mariano DiFabio
<s>
M
N
L
AH
EH
AE
EY
AH
R
L
N
D
IY
IH
EY
AH
AA
AE
AH
AH
Pronunciations of OOV words (Mariano and DiFabio)
![Page 6: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/6.jpg)
Previous Work
Wooters and Stolcke (ICASSP 1994) Sloboda and Waibel (ICSLP 1996) Fossler-Lussier (Ph.D. Thesis 1999) Maison (Eurospeech 2003) Tan and Bessacier (Interspeech 2008) Bansal et al (ICASSP 2009) Badr et al (Interspeech 2010)
etc.
![Page 7: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/7.jpg)
Why assume black-box access?
Practical: What if ASR engine is a black box? (proprietary speech recognition tools, etc.)
Example possible use of our approach: Third-party app analyzes results of black-box recognition engine, returns OOV pronunciations
Scientific: How much pronunciation information can we get from only word recognition errors?
![Page 8: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/8.jpg)
Our Generative Model…
1. Generate word w with Pr(w)
2. Generate pronunciation baseform b with Pr(b|w)
3. Generate phoneme sequence p with Pr(p|b, w) by passing through phonetic confusion channel
4. Generate hypothesis word or phrase e with Pr(e|p, b, w)
DiFabio
D IY F AA B IH OW
DH AH F EY B AH L
the fable
… for input word w and output recognition hypothesis e
Black Box ASR
€
Pr(w,e) = Pr(w)Pr(b |w)Pr(p |b,w)Pr(e | p,b,w)b,p∑
![Page 9: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/9.jpg)
Our Generative Model…
1. Generate word w with Pr(w)
2. Generate pronunciation baseform b with Pr(b|w)
3. Generate phoneme sequence p with Pr(p|b, w) by passing through phonetic confusion channel
4. Generate hypothesis word or phrase e with Pr(e|p, b, w)
5. Repeat steps 2-4 to generate more e
… for input word w and output recognition hypothesis e
DiFabio
D IY F AA B IH OW
DH AH F EY B AH L
the fable differ but
D IH F ER B AH T
D IY F EY B IH OW
![Page 10: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/10.jpg)
Learning Algorithm GOAL : find best pronunciation for input word w
Given
Current guess about Pr(baseform b|w)
Pr(transformed phonemes p|b, w) -- will explain later
Pr(word recognition output e|p, b, w) = Pr(e|p)
€
argmaxb Pr(b |w)
Current Lexicon
Phonetic Confusions
![Page 11: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/11.jpg)
Learning Algorithm Compute posterior probability of baseform b given w and e
Sum over all e in n-best word recognition lists over all utterances of w €
Pr(b | e,w) =Pr(b |w)Pr(p |b,w)Pr(e | p,b,w)Pr(c |w)Pr(p | c,w)Pr(e | p,c,w)
c∑
Guess Phonetic
Confusions Current Lexicon
€
Pr(b |w) = Pr(b | e,w)Pr(e)e∈Ew
∑Uniform
From Above
Expectation Maximization
Iterate
![Page 12: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/12.jpg)
Initial Guess for Pr(b|w) Limit to reasonable candidates
* Bisani and Ney (2008)
€
Pr(b |w) =1|Bw |
if b∈ Bw
0 otherwise
Joint-sequence g2p algorithm (Sequitur*)
Existing Lexicon Broad coverage: order 2 multigrams (low accuracy, high
recall)
Bw = {all sequences b with > 0.00001 probability}
Initialize
![Page 13: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/13.jpg)
Modeling Phonetic Confusions
TIMIT (train)
Phone Recognition
Phoneme Hypotheses Phoneme References
Phoneme Confusion Finite-State Transducer
Pr(p|b,w) = Pr(p|b) = sum of paths with input b & output p
p conditionally independent of
w
![Page 14: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/14.jpg)
Data CSLU Names Corpus Only use single-word names (isolated-word experiments) 20423 utterances, 7771 unique names
Train (learn OOV pronunciations): Random 50% of utterances for each name
Test (evaluate new lexicon): Remaining utterances
![Page 15: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/15.jpg)
Setup Sphinx 3
MFCCs extracted using Sphinx’s default parameters
Acoustic Models trained on TIMIT
Original Lexicon: CMU Dictionary, CSLU names removed
Language Model: unigrams over names, add-one smoothing to include all CMU Dictionary words
![Page 16: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/16.jpg)
Evaluation Word Error Rate of ASR recognition with learned lexicon
Baseform Error Rate: proportion of learned baseforms different from corpus transcriptions
Phoneme Error Rate: proportion of insertions, deletions, and substitutions of learned baseforms against corpus transcriptions
Baselines: 1. State of the art g2p: Sequitur, multigrams of order 6
(SEQUITUR) 2. CMU Dictionary pronunciations for names in dictionary
(CMUGOLD)
![Page 17: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/17.jpg)
Results
Ew (set of hypotheses) = results from
10-best recognition
Ew = results from 5-best recognition
SEQUITUR
Can we get better pronunciations than a grapheme-to-phoneme system?
![Page 18: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/18.jpg)
Results
Ew (set of hypotheses) = results from
10-best recognition
Ew = results from 5-best recognition
CMUGOLD
(Only those utterances where the names are in the CMU Dictionary)
How does ASR recognition with gold standard pronunciations compare?
![Page 19: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/19.jpg)
Results
Ew (set of hypotheses) = results from
10-best recognition
Ew = results from 5-best recognition
SEQUITUR
![Page 20: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/20.jpg)
Results
Ew (set of hypotheses) = results from
10-best recognition
Ew = results from 5-best recognition
SEQUITUR
![Page 21: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/21.jpg)
What Works?
Rutherford
Sparse phonetic neighborhood
Rumor for
Ruder for
Luther of
Not so successful
Marilyn
Mary Mary and Merry
in
Mary-land
Marian Mari-time
Successful pronunciation recovery
Dense phonetic neighborhood
Perelman
![Page 22: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/22.jpg)
Conclusion Can we learn pronunciations from
word recognition errors? Yes! Learned pronunciations are better than grapheme-to-phoneme
results
Preliminary work – lots more to be done Extend EM to also learn (or augment) phonetic confusions Learn pronunciation variants of words in lexicon Adapt to continuous speech (not just isolated words) Seed Pr(b|w) independent of Sequitur or other g2p Combine phone lattice information and word recognition output as
cues for pronunciation
![Page 23: Learning from Mistakes: Expanding Pronunciation Lexicons ... › ~sravana › slides › lfm_interspeech_s… · SPEECH RECOGNITION Mary and the fable Black Box Latent Phonetic Similarity](https://reader036.vdocuments.site/reader036/viewer/2022070821/5f200196242b20550f00e11a/html5/thumbnails/23.jpg)
Dank Yu!