mark hasegawa-johnson jhasegaw@uiuc research performed in collaboration with
DESCRIPTION
Landmark-Based Speech Recognition The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations. Mark Hasegawa-Johnson [email protected] Research performed in collaboration with James Baker (Carnegie Mellon), Sarah Borys (Illinois), - PowerPoint PPT PresentationTRANSCRIPT
Landmark-Based Speech Landmark-Based Speech RecognitionRecognition
The Marriage of The Marriage of High-Dimensional Machine High-Dimensional Machine Learning Techniques with Modern Linguistic Learning Techniques with Modern Linguistic
RepresentationsRepresentations Mark Hasegawa-Johnson
Research performed in collaboration withJames Baker (Carnegie Mellon), Sarah Borys (Illinois),
Ken Chen (Illinois), Emily Coogan (Illinois), Steven Greenberg (Berkeley), Amit Juneja (Maryland), Katrin Kirchhoff (Washington), Karen Livescu (MIT),
Srividya Mohan (Johns Hopkins), Jen Muller (Dept. of Defense), Kemal Sonmez (SRI), and Tianyu Wang (Georgia Tech)
Goal of this TalkGoal of this Talk1. Experiments with human subjects (since 1910 at Bell Labs, since 1950
at Harvard) give us detailed knowledge of human speech perception. • Human speech perception is multi-resolution, like progressive JPEG: • syllables and prosody → distinctive features → words
2. Automatic speech recognition (ASR) works best if all parameters in the system can be simultaneously learned in order to adjust a global optimality criterion
• In 1967, it became possible to globally optimize all parameters of a very simple recognition model called the hidden Markov model
• Multi-resolution speech models could not be globally optimized• Therefore from 1985-1999, standard ASR ignored results from speech
psychology
3. In the 1990s, new results in machine learning made it possible to globally optimize a multi-resolution model of speech psychology, and to use the resulting model as an automatic speech recognizer
• We do not yet know how best to “marry” speech psychology with new machine learning technology
• Goal of this talk: to test globally optimized computational models of speech psychology as automatic speech recognizers
Talk OutlineTalk OutlineHistory and Overview1. Acoustics → Landmarks
a. Psychological Results: Landmark-Based Speech Perceptionb. Psychological Results: Perceptual Space ≠ Acoustic Spacec. Computational Model: Landmark Detection and Classificationd. Algorithm: Support Vector Machines
2. Landmarks → Wordsa. The Pronunciation Modeling Problemb. Psychological Model #1: An Underspecified Distinctive Feature Lexicon
• Computational Model: Discriminative Selection of Landmarksc. Psychological Model #2: Articulatory Phonology
• Computational Model: Dynamic Bayesian Network (DBN)
3. Technological Evaluationa. Landmark Detection and Classificationb. Forced Alignment using the DBNc. Rescoring of word lattice output from an HMM-based recognizerd. Error Analysis and Future Plans
HistoryHistory• Human Speech Recognition Models
– 1955, Miller and Nicely: Distinctive Features– 1955, Delattre, Liberman, and Cooper: Landmarks– 1975, Goldsmith: Underspecified Lexicon– 1992, Stevens: Landmark-Based Speech Perception Model– 1990, Browman and Goldstein: Articulatory Phonology
• Automatic Speech Recognition– 1999, Niyogi and Ramesh: Support Vector Machines for Landmark
Detection– 2003, Livescu and Glass: Dynamic Bayesian Network
implementation of Articulatory Phonology– 2004, Hasegawa-Johnson et al., WS04 Summer Workshop at the
Johns Hopkins Center for Language and Speech Processing• Underspecified Lexicon with Discriminative Landmark Selection• Hybrid SVM-DBN implementation of Articulatory Phonology
Landmark-Based Speech Landmark-Based Speech RecognitionRecognition
ONSET
NUCLEUS
CODA
NUCLEUS
CODA
ONSET
PronunciationVariants:
… backed up …… backtup ..… back up …… backt ihp …… wackt ihp…
…
Lattice hypothesis:… backed up …
SyllableStructure
ScoresWordsTimes
Acoustics Acoustics → Landmarks: → Landmarks: Results and Models from Results and Models from
Psychology and Psychology and LinguisticsLinguistics
Spectral DynamicsSpectral DynamicsDelattre, Liberman and Cooper, 1955Delattre, Liberman and Cooper, 1955
• To recognize a stop consonant, one spectrum is not enough.• Recognition depends on the pattern of spectral change over a 50ms period following the release “landmark.”
Landmarks are RedundantLandmarks are RedundantMany authors, including Stevens, 1999Many authors, including Stevens, 1999
To recognize a stop consonant, it is necessary and sufficient to hear any one of these:
• Release into vowel
• Closure from vowel
• “Ejective” burst
… three “acoustic landmarks” with very different spectral patterns.
“backed”
Recognition Depends on RhythmRecognition Depends on Rhythm(Warren, Healy, and Chalikia, 1996)(Warren, Healy, and Chalikia, 1996)
Heard as one voice saying “aa iy uw ow ae iy”
Heard as two voices: one says “hi uw,” one says “iowa”
Nonlinear Map from Acoustic Nonlinear Map from Acoustic Features to Perceptual Features to Perceptual
FeaturesFeatures(Kuhl et al., 1992)(Kuhl et al., 1992)
In the Perceptual Space, In the Perceptual Space, Distinctive Feature Errors are Distinctive Feature Errors are
IndependentIndependent(Miller and Nicely, 1955)(Miller and Nicely, 1955)
• Experimental Method:– Subjects listen to nonsense syllables mixed with noise (white noise or
BPF)
– Subjects write the consonant they hear
• Results:
p(q*|q,SNR,BPF) ≈ i p(fi* | fi,SNR,BPF)
q* = consonant label heard by the listener
q = true consonant label
F*=[f1*,…,f6*] = perceived distinctive feature labels
F=[f1,…,f6] = true distinctive feature labels
[±nasal, ±voiced, ±fricated, ±strident, ±lips, ±blade]
Consonant Confusions at -6dB Consonant Confusions at -6dB SNRSNR
P T K F TH S SH B D G V DH Z ZH M N
P 80 43 64 17 14 6 2 1 1 1 1 2
T 71 84 55 5 9 3 8 1 1 1
K 66 76 107 12 8 9 4 1 1
F 18 12 9 175 48 11 1 7 2 1 2 2
TH 19 17 16 104 64 32 7 5 4 5 6 4 5
S 8 5 4 23 39 107 45 4 2 3 1 1 3 2 1
SH 1 6 3 4 6 29 195 3 1
B 1 5 4 4 136 10 9 47 16 6 1 5 4
D 8 5 80 45 11 20 20 26 1
G 2 3 63 66 3 19 37 56 3
V 2 2 48 5 5 145 45 12 4
DH 6 31 6 17 86 58 21 5 6 4
Z 1 1 17 20 27 16 28 94 44 1
ZH 1 26 18 3 8 45 129 2
M 1 4 4 1 3 177 46
N 4 1 5 2 7 1 6 47 163
Distinctive Features: ±nasal, ±voiced, ±fricative, ±strident
In the Acoustic Space, In the Acoustic Space, Distinctive Features are Not Distinctive Features are Not
IndependentIndependent(Volaitis and Miller, 1992)(Volaitis and Miller, 1992)
Acoustics Acoustics → Landmarks→ LandmarksA Computational ModelA Computational Model
Landmark Detection and Landmark Detection and ExplanationExplanation
(based on Stevens, Manuel, Shattuck-Hufnagel, and Liu, (based on Stevens, Manuel, Shattuck-Hufnagel, and Liu, 1992)1992)
ONSET
NUCLEUS
CODA
NUCLEUS
CODA
ONSET
Search Space:…
… buck up …… big dope …
… backed up …… bagged up …
…… big doowop …
…
MAP understanding:… backed up …
Landmark Detector Inputs: Landmark Detector Inputs: Acoustic, Phonetic, and Auditory Acoustic, Phonetic, and Auditory
FeaturesFeaturesTotal Feature Vector Dimension: 483/frameTotal Feature Vector Dimension: 483/frame
• MFCCs, 25ms window (standard ASR features)• Spectral shape: energy, spectral tilt, and spectral
compactness, once/millisecond• Noise-robust MUSIC-based formant frequencies,
amplitudes, and bandwidths (Zheng & Hasegawa-Johnson, ICSLP 2004)
• Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures; Bitar & Espy-Wilson, 1996)
• Rate-place model of neural response fields in the cat auditory cortex (Carlyon & Shamma, JASA 2003)
Cues for Place of Cues for Place of Articulation:Articulation:
MFCC+formants + ratescale, within 150ms of
landmark
Landmark Detection using Support Landmark Detection using Support Vector Machines (SVMs)Vector Machines (SVMs)
False Acceptance vs. False Rejection Errors, TIMIT, per 10ms frame SVM Stop Release Detector: Half the Error of an HMM
(3) Linear SVM: EER = 0.15%
(4) Radial Basis Function SVM: Equal Error Rate=0.13%
Niyogi, Ramesh & Burges, 1999, 2002
(1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2%
(2) HMM (*): False Rejection Error=0.3%
What is a Support Vector What is a Support Vector Machine?Machine?
• SVM = hyperplane, RBF, or kernel-based classifier, trained to minimize upper bound on EXPECTED TEST CORPUS ERROR• EXPECTED TEST CORPUS ERROR ≤ (TRAINING CORPUS ERROR) + (Distance between the hyperplane and the nearest data point)-2
• Classifier on right: higher TRAINING CORPUS ERROR, but lower EXPECTED TEST CORPUS ERROR
What are the SVMs trained to What are the SVMs trained to detect?detect?
Simple answer: any binary distinction that will be useful to the recognizer.Hard answer (as implemented in July 2004):• SVMs trained to be correct in every frame
– Articulatory-free features: Speech vs. silence, vowel vs. consonant, sonorant vs. obstruent, nasal vs. non-nasal, fricative vs. non-fricative
– Landmarks: Stop release vs. any other frame, fricative release vs. any other frame, stop closure vs. any other frame, …
• SVMs trained to be correct given a specified context, and meaningless otherwise:– Primary articulator: lips vs. tongue blade vs. tongue body– Secondary articulators: voiced vs. unvoiced, nasal vs. not
Why are we studying binary Why are we studying binary distinctive features?distinctive features?
By focusing on binary distinction, and using regularized learners (SVMs), we can “push the limit” of classifier complexity …
… in order to get high binary classification accuracy.
NTIMIT Landmark vs. Non-landmark
Feature Multiwindow MFCCs
MFCCs+ Formants
–+continuant 94.1% 94.8%
+–continuant 94.9% 95.6%
–+sonorant 97.2% 97.0%
+–sonorant 96.4% 97.4%
–+syllabic 95.3$ 96.1%
+–syllabic 90.1% 94.3%
TIMIT Stops Consonant Releases
blade 83.3%
lips 90.5%
body 88.1%
Perceptual Space Encodes Perceptual Space Encodes Distinctive Features: Errors Distinctive Features: Errors
Independent even if Acoustics Independent even if Acoustics NotNot
Nonlinear Transform Implicit in the SVM Kernel
NonlinearTransform:
Implicitin theSVM
Kernel
““Phonetic Features” = Nonlinear Phonetic Features” = Nonlinear Transform followed by a One-Transform followed by a One-
Dimensional CutDimensional Cut
SVM Extracts a Discriminant Dimension
SVM Discriminant Dimension =argmin(error(margin)+1/width(margin)
(Niyogi & Burges, 2002: Posterior PDF = Sigmoid Model in Discriminant Dimension)
An Equivalent Model: Likelihoods =Gaussian in Discriminant Dimension
Soft Decisions once/5msSoft Decisions once/5ms::p ( manner feature d(t) | Y(t) )
p( place feature d(t) | Y(t), t is a landmark )
SVM
Histogram
2000-dimensional acoustic feature vector
Discriminant yi(t)
Posterior probability of distinctive featurep(di(t)=1 | yi(t))
Landmarks Landmarks → Words→ WordsThe Problem of The Problem of
Pronunciation VariabilityPronunciation Variability
The Problem of Pronunciation The Problem of Pronunciation VariabilityVariability
(Livescu, 2004)(Livescu, 2004)
0
10
20
30
40
50
60
70
80
0 50 100 150 200 250
word frequency
# p
ron
un
cia
tio
ns
/wo
rd
read
casual
p r aa b iy 2
p r ay 1
p r aw l uh 1
p r ah b iy 1
p r aa lg iy 1
p r aa b uw 1
p ow ih 1
p aa iy 1
p aa b uh b l iy 1
p aa ah iy 1
probably
Landmarks Landmarks → Words→ WordsPhonological Model #1: Phonological Model #1: Underspecified LexiconUnderspecified Lexicon
Underspecified Lexicon(Goldsmith, 1975)
Features S A P N A P S P A T S C A T
Vowel – + – – + – – – + – – – + –
sonorant – – + – – – – –
continuant + – – – – – – –
strident + – – + – – + – –
lips – + – + + – – –
blade + – + – – + – –
voiced – – – – – – –
• Once listener hears [+vowel] the features sonorant, continuant, strident, lips, blade, and voiced are meaningless and redundant.
• There are no [+sonorant,+strident] or [+sonorant,-voiced] phonemes. Given [+strident], the features [sonorant,strident] are meaningless and redundant.
• If /s/ is in a consonant cluster, the listener only needs to hear [+strident] --- no other features are necessary, because no word could have anything but an /s/ in this position.
Computational Model: Select Landmarks to Distinguish
Confusable Word Pairs• Rationale: baseline HMM-based system already provides
high-quality hypotheses
– 1-best error rate from N-best lists: 24.4% (RT-03 dev set)
– oracle error rate: 16.2%
• Method: 1. Use an HMM-NN hybrid system to generate a first-pass word lattice
2. Use landmark detection only where necessary, to correct errors made by baseline recognition system
Example:
Ref: that cannot be that hard to sneak onto an airplane Hyp: they can be a that hard to speak on an airplane
fsh_60386_1_0105420_0108380
Identifying Confusable Hypotheses
• Use existing alignment algorithms for converting lattices into confusion networks (Mangu, Brill & Stolcke 2000)
• Hypotheses ranked by posterior probability• Generated from n-best lists without 4-gram or pronunciation
model scores ( higher WER compared to lattices)
• Multi-words (“I_don’t_know”) were split prior to generating confusion networks
airplaneanon
onto
sneak
speaktohard
*DEL*be
that can
can’t athey
Identifying Confusable Hypotheses
• How much can be gained from fixing confusions?
• Baseline error rate: 25.8%
• Oracle error rates when selecting correct word from confusion set:
# hypotheses to select from
Including homophones
Not including homophones
2 23.9% 23.9%
3 23.0% 23.0%
4 22.4% 22.5%
5 22.0% 22.1%
Selecting Relevant LandmarksSelecting Relevant Landmarks• Convert each word into a fixed-length vector• Dimensions of the vector = frequencies of occurrence, in the
word, of selected binary landmark-pair relationships:– Manner landmarks: precedence, e.g. V ≺ Son. Cons.– Manner & place features: overlap, e.g. Stop o +blade
• Not all possible relations are used; dimensionality of feature space is 40 - 60
• The vector for each word– … should be derived from actual pronunciation data, e.g., from
landmarks automatically detected in a very large speech corpus– … unfortunately, due to time constraints, that experiment hasn’t been
run yet.– In the mean time, the vector for each word was derived from a standard
pronunciation dictionary (pronlex).
Vector-Space Word Vector-Space Word RepresentationRepresentation
Start < Fric
Fric< Stop
Fric< Son
Fric < Vowel
Stop < Vowel
Vowel o high
Vowel o front
Fric o strident
speak 1 1 0 0 1 1 1 1
sneak 1 0 1 0 0 1 1 1
seek 1 0 0 1 0 1 1 1
he 1 0 0 1 0 1 1 0
she 1 0 0 1 0 1 1 1
steak 1 1 0 0 1 0 1 1
….
Maximum-Entropy Maximum-Entropy DiscriminationDiscrimination
Use maxent classifier
• Here: y = words, x = acoustics, f = landmark relationships
• Why maxent classifier?– Discriminative classifier
– Possibly large set of confusable words
– Later addition of non-binary features
• Training: ideally on real landmark detection output
• Here: on entries from lexicon (includes pronunciation variants)
1( | ) exp( ( , ))
( ) k kk
P y x f x yZ x
Maximum-Entropy Maximum-Entropy DiscriminationDiscrimination
• Example: sneak vs. speak
• Different model is trained for each confusion set landmarks can have different weights in different contexts
speakSC ○ +blade -2.47 FR < SC -2.47FR < SIL 2.11SIL < ST 1.75…..
sneakSC ○ +blade 2.47 FR < SC 2.47FR < SIL -2.11SIL < ST -1.75…..
Landmark QueriesLandmark Queries• Select N landmarks with highest weights
• Ask landmark detection module to produce scores for selected landmarks within word boundaries given by baseline system
• Example:
sneak 1.70 1.99 SC ○ +blade ?
sneak 1.70 1.99 SC ○ +blade 0.75 0.56
Confusionnetworks
Landmarkdetectors
Landmarks Landmarks → Words→ WordsPhonological Model #2: Phonological Model #2: Articulatory PhonologyArticulatory Phonology
Articulatory Phonology: Lips and Tongue Have Different
Variability
LIP-OP TT-OPEN
TT-LOC
TB-LOC
TB-OPENVELUM
VOICING
• warmth [w ao r m p th] - Phone insertion?
• I don’t know [ah dx uh_n ow_n] - Phone deletion??
• several [s eh r v ax l] - Exchange of two phones???
Articulatory PhonologyArticulatory Phonology(Browman and Goldstein, 1990; slide from Livescu and Glass, (Browman and Goldstein, 1990; slide from Livescu and Glass,
2004)2004)
• Many pronunciation phenomena can be parsimoniously described as resulting from asynchrony and reduction of quasi-independent speech articulators.
• instruments [ih_n s ch em ih_n n s] everybody [eh r uw ay]
Brief Review: Bayesian Brief Review: Bayesian NetworksNetworks
• Each node in the graph is a random variable• Arrow represents dependent probabilities• Probability distributions: One per variable
– Number of columns = number of different values the variable can take– Number of rows = number of different values the variable’s parents can take
• Modularity of the graph Modularity of computation; very complicated models can be used for speech recognition with not-so-bad computational cost
H 0 1
P(H|G=0) 0.7 0.3
P(H|G=1) 0.3 0.7 H L
G
G 0 1
P(G) 0.5 0.5
L 0 1
P(H|G=0) 0.1 0.9
P(H|G=1) 0.9 0.1
Dynamic Bayesian Network ModelDynamic Bayesian Network Model(Livescu and Glass, 2004)(Livescu and Glass, 2004)
• The model is implemented as a dynamic Bayesian network (DBN):– A representation, via a directed graph, of a
distribution over a set of variables that evolve through time
• Example DBN with three articulators:
tword
2;1tcheckSync
1tind 2
tind 3tind
1tS 2
tS 3tS
1tU 2
tU 3tU
2;1tasync
= 13;2,1
tcheckSync= 1
3;2,1tasync
)|Pr(|)Pr( 212;1 aindindaasync
2;1212;1 || if 1 asyncindindcheckSync
…
.1
0
0
4
… … … … … …
… .2 .7 0 0 2
… .1 .2 .7 0 1
… 0 .1 .2 .7 0
… 3 2 1 0
given by baseform pronunciations
Tword
12;1 Tsync 13;2,1 Tsync
1Tind 2
Tind 3Tind
1TS 2
TS 3TS
1TU 2
TU 3TU
1word
12;11 sync 13;2,1
1 sync
11ind 2
1ind 31ind
11S 2
1S 31S
11U 2
1U 31U
0word
12;10 sync 13;2,1
0 sync
10ind 2
0ind 30ind
10S 2
0S 30S
10U 2
0U 30U
. . .
A DBN Model of Articulatory A DBN Model of Articulatory Phonology for Speech Phonology for Speech
RecognitionRecognition(Livescu and Glass, 2004)(Livescu and Glass, 2004)
• wordt: word ID at frame #t• wdTrt: word transition?• indt
i: which gesture, from the canonical word model, should articulator i be trying to implement?• asynct
i;j: how asynchronous are articulators i and j? • Ut
i: canonical setting of articulator #i• St
i: surface setting of articulator #i
Incorporating the SVMs: An SVM-DBN Incorporating the SVMs: An SVM-DBN Hybrid ModelHybrid Model
Tongue front
Palatal
Tongue closed
Semi-closed
Glide
A
Tongue Mid Tongue open
Tongue open …
Word
Canonical Form
Surface Form
Place
x:Multi-FrameObservation
including
Spectrum,Formants,
& AuditoryModel
Tongue front
Front Vowel
LIKE
Tongue Front
p( gPGR(x) | palatal glide release) p( gGR(x) | glide release )
…
SVM Outputs
Manner
…
…
…
Technological EvaluationTechnological Evaluation
Acoustic Feature SelectionAcoustic Feature Selection
1. Accuracy per Frame (%), Stop Releases only, NTIMIT
2. Word Error Rate: Lattice Rescoring, RT03-devel, One Talker (WARNING: this talker is atypical.)Baseline: 15.0% (113/755)Rescoring, place based on: MFCCs + Formant-based params: 14.6% (110/755) Rate-Scale + Formant-based params: 14.3% (108/755)
MFCCs+Shape MFCCs+Formants
Kernel Linear RBF Linear RBF
+/- lips 78.3 90.7 92.7 95.0+/- blade 73.4 87.1 79.6 85.1
+/- body 73.0 85.2 85.7 87.2
SVM Training: Mixed vs. Targeted DataSVM Training: Mixed vs. Targeted Data
Train NTIMIT NTIMIT&SWB NTIMIT Switchboard
Test NTIMIT NTIMIT&SWB Switchboard Switchboard
Kernel Linear RBF Linear RBF Linear RBF Linear RBF
speech onset 95.1 96.2 86.9 89.9 71.4 62.2 81.6 81.6
speech offset 79.6 88.5 76.3 86.4 65.3 78.6 68.4 83.7
consonant onset 94.5 95.5 91.4 93.5 70.3 72.7 95.8 97.7
consonant offset 91.7 93.7 94.3 96.8 80.3 86.2 92.8 96.8
continuant onset 89.4 94.1 87.3 95.0 69.1 81.9 86.2 92.0
continuant offset 90.8 94.9 90.4 94.6 69.3 68.8 89.6 94.3
sonorant onset 95.6 97.2 97.8 96.7 85.2 86.5 96.3 96.3
sonorant offset 95.3 96.4 94.0 97.4 75.6 75.2 95.2 96.4
syllabic onset 90.7 95.2 91.4 95.5 69.5 78.9 87.9 92.6
syllabic offset 90.1 88.9 87.1 92.9 54.4 60.8 88.2 89.7
DBN-SVM: Models Nonstandard DBN-SVM: Models Nonstandard PhonesPhonesI don’t know
/d/ becomesflap
/n/ becomes a creakynasal glide
DBN-SVM Design DecisionsDBN-SVM Design Decisions
• What kind of SVM outputs should be used in the DBN?– Method 1 (EBS/DBN): Generate landmark segmentation with EBS using manner
SVMs, then apply place SVMs at appropriate points in the segmentation
• Force DBN to use EBS segmentation
• Allow DBN to stray from EBS segmentation, using place/voicing SVM outputs whenever available
– Method 2 (SVM/DBN): Apply all SVMs in all frames, allow DBN to consider all possible segmentations
• In a single pass
• In two passes: (1) manner-based segmentation; (2) place+manner scoring
• How should we take into account the distinctive feature hierarchy?
• How do we avoid “over-counting” evidence?
• How do we train the DBN (feature transcriptions vs. SVM outputs)?
DBN-SVM Rescoring ExperimentsDBN-SVM Rescoring Experiments• For each lattice edge:
– SVM probabilities computed over edge duration and used as soft evidence in DBN– DBN computes a score S P(word | evidence)– Final edge score is a weighted interpolation of baseline scores and EBS/DBN or SVM/DBN score
Date Experimental setup 3-speaker WER (# errors)
RT03 dev WER
- Baseline 27.7 (550) 26.8
Jul31_0 EBS/DBN, “hierarchically-normalized” SVM output probabilities, DBN trained on subset of ICSI transcriptions
27.6 (549) 26.8
Aug1_19 + improved silence modeling 27.6 (549)
Aug2_19 EBS/DBN, unnormalized SVM probs + fricative lip feature 27.3 (543) 26.8
Aug4_2 + DBN trained using SVM outputs 27.3 (543)
Aug6_20 + full feature hierarchy in DBN 27.4 (545)
Aug7_3 + reduction probabilities depend on word frequency 27.4 (544)
Aug8_19 + retrained SVMs + nasal classifier + DBN bug fixes 27.4 (544)
Aug11_19 SVM/DBN, 1 pass Miserable failure!
Aug14_0 SVM/DBN, 2 pass 27.3 (542)
Aug14_20 SVM/DBN, 2 pass, using only high-accuracy SVMs 27.2 (541)
Discriminative Pronunciation Discriminative Pronunciation ModelModel
WER Insertions Deletions Substitutions
Baseline 25.8% 2.6% (982) 9.2% (3526) 14.1% (5417)
Rescored 25.8% 2.6% (984) 9.2% (3524) 14.1% (5408)
RT-03 dev set, 35497 Words, 2930 Segments, 36 Speakers(Switchboard and Fisher data)
Rescored: product combination of old and new prob. distributions, weights 0.8 (old), 0.2 (new)
-Correct/incorrect decision changed in about 8% of all cases -Slightly higher number of fixed errors vs. new errors
AnalysisAnalysis
• When does it work? – Detectors give high probability for correct distinguishing
feature
• When does it not work?– Problems in lexicon representation
– Landmark detectors are confident but wrong
once (correct) vs. what (false): Sil ○ +blade 0.87
like (correct) vs. liked (false): Sil ○ +blade 0.95
can’t [kæ̃P t] (correct) vs cat (false): SC ○ +nasal 0.26
mean (correct) vs. me (false) V < +nasal 0.76
AnalysisAnalysis • Incorrect landmark scores often due to word
boundary effects, e.g.:
• Word boundaries given by baseline system may exclude relevant landmarks or include parts of neighbouring words
• DBN-SVM system also failed when word boundaries grossly misaligned
muchhe
she
Conclusions• SVMs work best when:
– Mixed training data, at least 3000 landmarks/class– Manner classification: Small acoustic feature vectors OK (3-20 dimensions)– Place classification: Large acoustic feature vectors best (~2000 dimensions)
• DBN-SVM correctly models non-canonical pronunciations– DBN is able to match nasalized glide in place of /n/– One talker laughed a lot while speaking; DBN-SVM reduced WER for that talker
• Both DBN-SVM and MaxEnt need more training data– Our training data: 3.5 hours. Baseline HMM training data: 3000 hours.– DBN-SVM novel errors: mostly pronunciation unexpected pronunciations– MaxEnt model currently defines a “lexically discriminative” feature by comparing
dictionary entries, therefore it fails most frequently when observing pronunciation variants.
– MaxEnt model should instead be trained using automatic landmark transcriptions of confusable words from a large training corpus.
• Both DBN-SVM and MaxEnt are sensitive to word boundary time errors. Solution: Probabilistic word boundary times?