landmark-based speech recognition the marriage of high-dimensional machine learning techniques with...

Landmark-Based Speech Landmark-Based Speech RecognitionRecognition

The Marriage of The Marriage of High-Dimensional Machine High-Dimensional Machine Learning Techniques with Modern Linguistic Learning Techniques with Modern Linguistic

RepresentationsRepresentations Mark Hasegawa-Johnson

[email protected]

Research performed in collaboration withJames Baker (Carnegie Mellon), Sarah Borys (Illinois),

Ken Chen (Illinois), Emily Coogan (Illinois), Steven Greenberg (Berkeley), Amit Juneja (Maryland), Katrin Kirchhoff (Washington), Karen Livescu (MIT),

Srividya Mohan (Johns Hopkins), Jen Muller (Dept. of Defense), Kemal Sonmez (SRI), and Tianyu Wang (Georgia Tech)

What are Landmarks?What are Landmarks?• Time-frequency regions of high mutual information between

phone and signal (maxima of I(phone label; acoustics(t,f)) )• Acoustic events with similar importance in all languages, and

across all speaking styles• Acoustic events that can be detected even in extremely noisy

environments

• Syllable Onset ≈ Consonant Release• Syllable Nucleus ≈ Vowel Center• Syllable Coda ≈ Consonant Closure

Where do these things happen?Where do these things happen?

I(phone;acoustics) experiment: Hasegawa-Johnson, 2000

Landmark-Based Speech Landmark-Based Speech RecognitionRecognition

ONSET

NUCLEUS

CODA

NUCLEUS

CODA

ONSET

PronunciationVariants:

… backed up …… backtup ..… back up …… backt ihp …… wackt ihp…

…

Lattice hypothesis:… backed up …

SyllableStructure

ScoresWordsTimes

Talk OutlineTalk OutlineOverview1. Acoustic Modeling

– Speech data and acoustic features– Landmark detection– Estimation of real-valued “distinctive features” using support vector

machines (SVM)2. Pronunciation Modeling

– A Dynamic Bayesian network (DBN) implementation of Articulatory Phonology

– A Discriminative Pronunciation model implemented using Maximum Entropy (MaxEnt)

3. Technological Evaluation– Rescoring of word lattice output from an HMM-based recognizer– Errors that we fixed: Channel noise, Laughter, etcetera– New errors that we caused: Pronunciation models trained on 3 hours can’t

compete with triphone models trained on 3000 hours.– Future Plans

OverviewOverview• History

– Research described in this talk was performed between June 30 and August 17, 2004, at the Johns Hopkins summer workshop WS04

• Scientific Goal– To use high-dimensional machine learning technologies

(SVM, DBN) to create representations capable of learning, from data, the types of speech knowledge that humans exhibit in psychophysical speech perception experiments

• Technological Goal– Long-term: To create a better speech recognizer– Short-term: lattice rescoring, applied to word lattices

produced by SRI’s NN/HMM hybrid

… …

Acoustic Model: SVMs

p(landmark|SVM)

MFCC (5ms & 1ms frame period), Formants, Phonetic & Auditory Model Parameters

concatenate 4-15 frames

Pronunciation Model (DBN or MaxEnt)

First-Pass ASR Word Lattice p(SVM|word)

Rescoring: Log-Linear Score Combination

p(MFCC,PLP|word), p(word|words)

word label, start & end times

Overview of Systems to be Overview of Systems to be DescribedDescribed

I. Acoustic ModelingI. Acoustic Modeling• Goal: Learn precise and generalizable models

of the acoustic boundary associated with each distinctive feature.

• Methods:– Large input vector space (many acoustic feature

types)– Regularized binary classifiers (SVMs)– SVM outputs “smoothed” using dynamic

programming– SVM outputs converted to posterior probability

estimates once/5ms using histogram

Speech DatabasesSpeech Databases

Size Phonetic Transcr.

Word Lattices

NTIMIT 14hrs manual -

WS96&97 3.5hrs manual -

SWB1 WS04 subset 12hrs auto-SRI BBN

Eval01 10hrs - BBN & SRI

RT03 Dev 6hrs - SRI

RT03 Eval 6hrs - SRI

Acoustic and Auditory FeaturesAcoustic and Auditory Features

• MFCCs, 25ms window (standard ASR features)• Spectral shape: energy, spectral tilt, and spectral

compactness, once/millisecond• Noise-robust MUSIC-based formant frequencies,

amplitudes, and bandwidths (Zheng & Hasegawa-Johnson, ICSLP 2004)

• Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures; Bitar & Espy-Wilson, 1996)

• Rate-place model of neural response fields in the cat auditory cortex (Carlyon & Shamma, JASA 2003)

What are Distinctive Features? What are Distinctive Features? What are Landmarks?What are Landmarks?

• Distinctive feature = – a binary partition of the phonemes (Jakobson, 1952)– … that compactly describes pronunciation variability (Halle)– … and correlates with distinct acoustic cues (Stevens)

• Landmark = Change in the value of a Manner Feature– [+sonorant] to [–sonorant], [–sonorant] to [+sonorant]– 5 manner features: [consonantal, continuant, syllabic, silence]

• Place and Voicing features: SVMs are only trained at landmarks – Primary articulator: lips, tongue blade, or tongue body– Features of primary articulator: anterior, strident– Features of secondary articulator: nasal, voiced

Landmark Detection using Support Landmark Detection using Support Vector Machines (SVMs)Vector Machines (SVMs)

False Acceptance vs. False Rejection Errors, TIMIT, per 10ms frame SVM Stop Release Detector: Half the Error of an HMM

(3) Linear SVM: EER = 0.15%

(4) Radial Basis Function SVM: Equal Error Rate=0.13%

Niyogi & Burges, 1999, 2002

(1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2%

(2) HMM (*): False Rejection Error=0.3%

Dynamic Programming Smooths Dynamic Programming Smooths SVMsSVMs

• Maximize i p( features(ti) | X(ti) ) p(ti+1-ti | features(ti))

• Soft-decision “smoothing” mode: p( acoustics | landmarks ) computed, fed to pronunciation model

Cues for Place of Cues for Place of Articulation:Articulation:

MFCC+formants + ratescale, within 150ms of

landmark

Kernel:Transform to

Infinite-Dimensional

HilbertSpace

Niyogi & Burges, 2002: p(class|acoustics) ≈ Sigmoid Model in Discriminant Dimension

Soft-Decision Landmark ProbabilitiesSoft-Decision Landmark Probabilities

SVM Extracts a Discriminant Dimension

SVM Discriminant Dimension =argmin(error(margin)+1/width(margin)

Juneja & Espy-Wilson, 2003: p(class|acoustics) ≈ Histogram in Discriminant Dimension

OR

Soft Decisions once/5msSoft Decisions once/5ms::p ( manner feature d(t) | Y(t) )

p( place feature d(t) | Y(t), t is a landmark )

SVM

Histogram

2000-dimensional acoustic feature vector

Discriminant yi(t)

Posterior probability of distinctive featurep(di(t)=1 | yi(t))

II. Pronunciation II. Pronunciation ModelingModeling

• Goal: Represent a large number of pronunciation variants, in a controlled fashion, using distinctive features. Pick out the distinctive features that are most important for each word recognition task.

• Methods:1. Distinctive feature based lexicon + dynamic

programming alignment2. Dynamic Bayesian Network model of Articulatory

Phonology (articulation-based pronunciation variability model)

3. MaxEnt search for lexically discriminative features (perceptually based “pronunciation model”)

1. Distinctive-Feature Based Lexicon

• Merger of English Switchboard and Callhome dictionaries

• Converted to landmarks using Hasegawa-Johnson’s perl transcription tools

Landmarks in blue, Place and voicing features in green.

AGO(0.441765) +syllabic +reduced +back AX

+–continuant +– sonorant +velar +voiced G closure

–+continuant –+sonorant +velar +voiced G release

+syllabic –low –high +back +round +tense OW

AGO(0.294118) +syllabic +reduced –back IX

–+ continuant –+sonorant +velar +voiced G closure

–+continuant –+sonorant +velar +voiced G release

+syllabic –low –high +back +round +tense OW

Dynamic Programming Lexical Dynamic Programming Lexical SearchSearch

• Choose the word that maximizes i p( features(ti) | X(ti) ) p(ti+1-ti | features(ti)) p(features(ti)|word)

LIP-OP TT-OPEN

TT-LOC

TB-LOC

TB-OPENVELUM

VOICING

• warmth [w ao r m p th] - Phone insertion?

• I don’t know [ah dx uh_n ow_n] - Phone deletion??

• several [s eh r v ax l] - Exchange of two phones???

2. Articulatory Phonology2. Articulatory Phonology• Many pronunciation phenomena can be

parsimoniously described as resulting from asynchrony and reduction of sub-phonetic features

• instruments [ih_n s ch em ih_n n s] everybody [eh r uw ay]

– One set of features based on articulatory phonology [Browman & Goldstein 1990]:

Dynamic Bayesian Network ModelDynamic Bayesian Network Model(Livescu and Glass, 2004)(Livescu and Glass, 2004)

• The model is implemented as a dynamic Bayesian network (DBN):– A representation, via a directed graph, of a

distribution over a set of variables that evolve through time

• Example DBN with three features:

tword

2;1tcheckSync

1tind 2

tind 3tind

1tS 2

tS 3tS

1tU 2

tU 3tU

2;1tasync

= 13;2,1

tcheckSync= 1

3;2,1tasync

)|Pr(|)Pr( 212;1 aindindaasync

2;1212;1 || if 1 asyncindindcheckSync

…

.1

0

0

4

… … … … … …

… .2 .7 0 0 2

… .1 .2 .7 0 1

… 0 .1 .2 .7 0

… 3 2 1 0

given by baseform pronunciations

Tword

12;1 Tsync 13;2,1 Tsync

1Tind 2

Tind 3Tind

1TS 2

TS 3TS

1TU 2

TU 3TU

1word

12;11 sync 13;2,1

1 sync

11ind 2

1ind 31ind

11S 2

1S 31S

11U 2

1U 31U

0word

12;10 sync 13;2,1

0 sync

10ind 2

0ind 30ind

10S 2

0S 30S

10U 2

0U 30U

. . .

The DBN-SVM Hybrid Developed at WS04

Tongue front

Palatal

Tongue closed

Semi-closed

Glide

A

Tongue Mid Tongue open

Tongue open …

Word

Canonical Form

Surface Form

Place

x:Multi-FrameObservation

including

Spectrum,Formants,

& AuditoryModel

Tongue front

Front Vowel

LIKE

Tongue Front

p( gPGR(x) | palatal glide release) p( gGR(x) | glide release )

…

SVM Outputs

Manner

…

…

…

3. Discriminative Pronunciation Model

• Rationale: baseline HMM-based system already provides high-quality hypotheses

– 1-best error rate from N-best lists: 24.4% (RT-03 dev set)

– oracle error rate: 16.2%

• Method: Use landmark detection only where necessary, to correct errors made by baseline recognition system

Example:

Ref: that cannot be that hard to sneak onto an airplane Hyp: they can be a that hard to speak on an airplane

fsh_60386_1_0105420_0108380

Identifying Confusable Hypotheses

• Use existing alignment algorithms for converting lattices into confusion networks (Mangu, Brill & Stolcke 2000)

• Hypotheses ranked by posterior probability• Generated from n-best lists without 4-gram or pronunciation

model scores ( higher WER compared to lattices)

• Multi-words (“I_don’t_know”) were split prior to generating confusion networks

airplaneanon

onto

sneak

speaktohard

*DEL*be

that can

can’t athey

Identifying Confusable Hypotheses

• How much can be gained from fixing confusions?

• Baseline error rate: 25.8%

• Oracle error rates when selecting correct word from confusion set:

# hypotheses to select from

Including homophones

Not including homophones

2 23.9% 23.9%

3 23.0% 23.0%

4 22.4% 22.5%

5 22.0% 22.1%

Selecting Relevant LandmarksSelecting Relevant Landmarks• Not all landmarks are equally relevant for

distinguishing between competing word hypotheses (e.g. vowel features irrelevant for sneak vs. speak)

• Using all available landmarks might deteriorate performance when irrelevant landmarks have weak scores (but: redundancy might be useful)

• Automatic selection algorithm– Should optimally distinguish set of confusable words

(discriminative) – Should rank landmark features according to their relevance

for distinguishing words (i.e. output should be interpretable in phonetic terms)

– Should be extendable to features beyond landmarks

Maximum-Entropy Landmark Maximum-Entropy Landmark SelectionSelection

• Convert each word in confusion set into fixed-length landmark-based representation using idea from information retrieval:

• Vector space consisting of binary relations between two landmarks– Manner landmarks: precedence, e.g. V < Son. Cons.– Manner & place features: overlap, e.g. V o +high– preserves basic temporal information

• Words represented as frequency entries in feature vector• Not all possible relations are used (phonotactic constraints,

place features detected dependent on manner landmarks)• Dimensionality of feature space: 40 - 60• Word entries derived from phone representation plus

pronunciation rules

Vector-Space Word Vector-Space Word RepresentationRepresentation

Start < Fric

Fric< Stop

Fric< Son

Fric < Vowel

Stop < Vowel

Vowel o high

Vowel o front

Fric o strident

speak 1 1 0 0 1 1 1 1

sneak 1 0 1 0 0 1 1 1

seek 1 0 0 1 0 1 1 1

he 1 0 0 1 0 1 1 0

she 1 0 0 1 0 1 1 1

steak 1 1 0 0 1 0 1 1

….

Maximum-Entropy Maximum-Entropy DiscriminationDiscrimination

Use maxent classifier

• Here: y = words, x = acoustics, f = landmark relationships

• Why maxent classifier?– Discriminative classifier

– Possibly large set of confusable words

– Later addition of non-binary features

• Training: ideally on real landmark detection output

• Here: on entries from lexicon (includes pronunciation variants)

1( | ) exp( ( , ))

( ) k kk

P y x f x yZ x

Maximum-Entropy Maximum-Entropy DiscriminationDiscrimination

• Example: sneak vs. speak

• Different model is trained for each confusion set landmarks can have different weights in different contexts

speakSC ○ +blade -2.47 FR < SC -2.47FR < SIL 2.11SIL < ST 1.75…..

sneakSC ○ +blade 2.47 FR < SC 2.47FR < SIL -2.11SIL < ST -1.75…..

Landmark QueriesLandmark Queries• Select N landmarks with highest weights

• Ask landmark detection module to produce scores for selected landmarks within word boundaries given by baseline system

• Example:

sneak 1.70 1.99 SC ○ +blade ?

sneak 1.70 1.99 SC ○ +blade 0.75 0.56

Confusionnetworks

Landmarkdetectors

III. EvaluationIII. Evaluation

Acoustic Feature SelectionAcoustic Feature Selection

1. Accuracy per Frame (%), Stop Releases only, NTIMIT

2. Word Error Rate: Lattice Rescoring, RT03-devel, One Talker (WARNING: this talker is atypical.)Baseline: 15.0% (113/755)Rescoring, place based on: MFCCs + Formant-based params: 14.6% (110/755) Rate-Scale + Formant-based params: 14.3% (108/755)

MFCCs+Shape MFCCs+Formants

Kernel Linear RBF Linear RBF

+/- lips 78.3 90.7 92.7 95.0+/- blade 73.4 87.1 79.6 85.1

+/- body 73.0 85.2 85.7 87.2

SVM Training: Mixed vs. Targeted DataSVM Training: Mixed vs. Targeted Data

Train NTIMIT NTIMIT&SWB NTIMIT Switchboard

Test NTIMIT NTIMIT&SWB Switchboard Switchboard

Kernel Linear RBF Linear RBF Linear RBF Linear RBF

speech onset 95.1 96.2 86.9 89.9 71.4 62.2 81.6 81.6

speech offset 79.6 88.5 76.3 86.4 65.3 78.6 68.4 83.7

consonant onset 94.5 95.5 91.4 93.5 70.3 72.7 95.8 97.7

consonant offset 91.7 93.7 94.3 96.8 80.3 86.2 92.8 96.8

continuant onset 89.4 94.1 87.3 95.0 69.1 81.9 86.2 92.0

continuant offset 90.8 94.9 90.4 94.6 69.3 68.8 89.6 94.3

sonorant onset 95.6 97.2 97.8 96.7 85.2 86.5 96.3 96.3

sonorant offset 95.3 96.4 94.0 97.4 75.6 75.2 95.2 96.4

syllabic onset 90.7 95.2 91.4 95.5 69.5 78.9 87.9 92.6

syllabic offset 90.1 88.9 87.1 92.9 54.4 60.8 88.2 89.7

DBN-SVM: Models Nonstandard DBN-SVM: Models Nonstandard PhonesPhonesI don’t know

/d/ becomesflap

/n/ becomes a creakynasal glide

DBN-SVM Design DecisionsDBN-SVM Design Decisions

• What kind of SVM outputs should be used in the DBN?– Method 1 (EBS/DBN): Generate landmark segmentation with EBS using manner

SVMs, then apply place SVMs at appropriate points in the segmentation

• Force DBN to use EBS segmentation

• Allow DBN to stray from EBS segmentation, using place/voicing SVM outputs whenever available

– Method 2 (SVM/DBN): Apply all SVMs in all frames, allow DBN to consider all possible segmentations

• In a single pass

• In two passes: (1) manner-based segmentation; (2) place+manner scoring

• How should we take into account the distinctive feature hierarchy?

• How do we avoid “over-counting” evidence?

• How do we train the DBN (feature transcriptions vs. SVM outputs)?

DBN-SVM Rescoring ExperimentsDBN-SVM Rescoring Experiments• For each lattice edge:

– SVM probabilities computed over edge duration and used as soft evidence in DBN– DBN computes a score S P(word | evidence)– Final edge score is a weighted interpolation of baseline scores and EBS/DBN or SVM/DBN score

Date Experimental setup 3-speaker WER (# errors)

RT03 dev WER

- Baseline 27.7 (550) 26.8

Jul31_0 EBS/DBN, “hierarchically-normalized” SVM output probabilities, DBN trained on subset of ICSI transcriptions

27.6 (549) 26.8

Aug1_19 + improved silence modeling 27.6 (549)

Aug2_19 EBS/DBN, unnormalized SVM probs + fricative lip feature 27.3 (543) 26.8

Aug4_2 + DBN trained using SVM outputs 27.3 (543)

Aug6_20 + full feature hierarchy in DBN 27.4 (545)

Aug7_3 + reduction probabilities depend on word frequency 27.4 (544)

Aug8_19 + retrained SVMs + nasal classifier + DBN bug fixes 27.4 (544)

Aug11_19 SVM/DBN, 1 pass Miserable failure!

Aug14_0 SVM/DBN, 2 pass 27.3 (542)

Aug14_20 SVM/DBN, 2 pass, using only high-accuracy SVMs 27.2 (541)

Discriminative Pronunciation Discriminative Pronunciation ModelModel

WER Insertions Deletions Substitutions

Baseline 25.8% 2.6% (982) 9.2% (3526) 14.1% (5417)

Rescored 25.8% 2.6% (984) 9.2% (3524) 14.1% (5408)

RT-03 dev set, 35497 Words, 2930 Segments, 36 Speakers(Switchboard and Fisher data)

Rescored: product combination of old and new prob. distributions, weights 0.8 (old), 0.2 (new)

-Correct/incorrect decision changed in about 8% of all cases -Slightly higher number of fixed errors vs. new errors

AnalysisAnalysis

• When does it work? – Detectors give high probability for correct distinguishing

feature

• When does it not work?– Problems in lexicon representation

– Landmark detectors are confident but wrong

once (correct) vs. what (false): Sil ○ +blade 0.87

like (correct) vs. liked (false): Sil ○ +blade 0.95

can’t [kæ̃G t] (correct) vs cat (false): SC ○ +nasal 0.26

mean (correct) vs. me (false) V < +nasal 0.76

AnalysisAnalysis • Incorrect landmark scores often due to word

boundary effects, e.g.:

• Word boundaries given by baseline system may exclude relevant landmarks or include parts of neighbouring words

• DBN-SVM system also failed when word boundaries grossly misaligned

muchhe

she

Conclusions• SVMs work best when:

– Mixed training data, at least 3000 landmarks/class– Manner classification: Small acoustic feature vectors OK (3-20 dimensions)– Place classification: Large acoustic feature vectors best (~2000 dimensions)

• DBN-SVM correctly models non-canonical pronunciations– DBN is able to match nasalized glide in place of /n/– One talker laughed a lot while speaking; DBN-SVM reduced WER for that talker

• Both DBN-SVM and MaxEnt need more training data– Our training data: 3.5 hours. Baseline HMM training data: 3000 hours.– DBN-SVM novel errors: mostly pronunciation unexpected pronunciations– MaxEnt model currently defines a “lexically discriminative” feature by comparing

dictionary entries, therefore it fails most frequently when observing pronunciation variants.

– MaxEnt model should instead be trained using automatic landmark transcriptions of confusable words from a large training corpus.

• Both DBN-SVM and MaxEnt are sensitive to word boundary time errors. Solution: Probabilistic word boundary times?

landmark-based speech recognition the marriage of high-dimensional machine learning techniques with...

Documents

acoustic modeling speech

histogram slide

acoustic modeling goal

acoustic boundary

acoustic feature types

pronunciation models

f acoustic events

lattice rescoring