lsa l352 speech recognition and synthesis e mf orh s( the ... · speech recognition and synthesis...

1

1LSA 352 Summer 2007

LSA 352Speech Recognition and Synthesis

Dan Jurafsky

Lecture 7: Baum Welch/Learningand Disfluencies

IP Notice: Some of these slides were derived from Andrew Ng’s CS 229 notes,as well as lecture notes from Chen, Picheny et al, and Bryan Pellom. I’ll try toaccurately give credit on each slide.


Outline for Today

LearningEM for HMMs (the “Baum-Welch Algorithm”)Embedded TrainingTraining mixture gaussians

Disfluencies


The Learning Problem

Baum-Welch = Forward-Backward Algorithm(Baum 1972)Is a special case of the EM or Expectation-Maximization algorithm (Dempster, Laird, Rubin)The algorithm will let us train the transitionprobabilities A= {aij} and the emission probabilitiesB={bi(ot)} of the HMM


Input to Baum-Welch

O unlabeled sequence of observationsQ vocabulary of hidden states

For ice-cream taskO = {1,3,2.,,,.}Q = {H,C}


Starting out with ObservableMarkov Models

How to train?Run the model on the observation sequence O.Since it’s not hidden, we know which states we wentthrough, hence which transitions and observationswere used.Given that information, training:

B = {bk(ot)}: Since every state can only generateone observation symbol, observation likelihoods Bare all 1.0A = {aij}:

!

aij =C(i" j)

C(i" q)q#Q

$


Extending Intuition to HMMs

For HMM, cannot compute these counts directly fromobserved sequencesBaum-Welch intuitions:

Iteratively estimate the counts.– Start with an estimate for aij and bk, iteratively improve

the estimates

Get estimated probabilities by:– computing the forward probability for an observation– dividing that probability mass among all the different

paths that contributed to this forward probability

2


The Backward algorithm

We define the backward probability asfollows:

This is the probability of generating partialobservations Ot+1T from time t+1 to the end, giventhat the HMM is in state i at time t and of coursegiven Φ.!

"t (i) = P(ot+1,ot+2,...oT ,|qt = i,#)


The Backward algorithm

We compute backward prob by induction:


Inductive step of the backward algorithm(figure inspired by Rabiner and Juang)

Computation of βt(i) by weighted sum of all successive values βt+1


Intuition for re-estimation of aij

We will estimate ^aij via this intuition:

Numerator intuition:Assume we had some estimate of probability that agiven transition i->j was taken at time t inobservation sequence.If we knew this probability for each time t, we couldsum over all t to get expected value (count) for i->j.

!

ˆ a ij =expected number of transitions from state i to state j

expected number of transitions from state i


Re-estimation of aij

Let ξt be the probability of being in state i at time tand state j at time t+1, given O1..T and model Φ:

We can compute ξ from not-quite-ξ, which is:

!

" t (i, j) = P(qt = i,qt+1 = j |O,#)

!

not _quite_" t (i, j) = P(qt = i,qt+1 = j,O | #)


Computing not-quite-ξ

!

The four components of P(qt = i,qt+1 = j,) | ") : #,$,aij and bj (ot )

3


From not-quite-ξ to ξ

We want:

We’ve got:

Which we compute as follows:

!


!

" t (i, j) = P(qt = i,qt+1 = j |O,#)



We want:

We’ve got:

Since:

We need:!


!

" t (i, j) = P(qt = i,qt+1 = j |O,#)

!

" t (i, j) =not _quite_" t (i, j)

P(O | #)



!

" t (i, j) =not _quite_" t (i, j)

P(O | #)


From ξ to aij

The expected number of transitions from state i tostate j is the sum over all t of ξThe total expected number of transitions out of state iis the sum over all transitions out of state iFinal formula for reestimated aij

!

ˆ a ij =expected number of transitions from state i to state j

expected number of transitions from state i


Re-estimating the observationlikelihood b

!

ˆ b j (vk ) =expected number of times in state j and observing symbol vk

expected number of times in state j

We’ll need to know the probability of being in state j at time t:


Computing gamma

4


Summary

The ratio between theexpected number oftransitions from state i to j and the expected number ofall transitions from state i

The ratio between theexpected number of timesthe observation data emittedfrom state j is vk, and theexpected number of timesany observation is emittedfrom state j


Summary: Forward-BackwardAlgorithm

1) Intialize Φ=(A,B)2) Compute α, β, ξ3) Estimate new Φ’=(A,B)4) Replace Φ with Φ’

5) If not converged go to 2


The Learning Problem: Caveats

Network structure of HMM is always created by handno algorithm for double-induction of optimal structureand probabilities has been able to beat simple hand-built structures.Always Bakis network = links go forward in timeSubcase of Bakis net: beads-on-string net:

Baum-Welch only guaranteed to return local max,rather than global optimum


Complete Embedded Training

Setting all the parameters in an ASR systemGiven:

training set: wavefiles & word transcripts for eachsentenceHand-built HMM lexicon

Uses:Baum-Welch algorithm

We’ll return to this after we’ve introduced GMMs


Baum-Welch for Mixture Models

By analogy with ξ earlier, let’s define the probability of being in state jat time t with the kth mixture component accounting for ot:

Now,

!

" tm ( j) =

# t$1( j)i=1

N

% aijc jmb jm (ot )& j (t)

#F (T)

!

µ jm =

" tm ( j)ott=1

T

#

" tk ( j)k=1

M

#t=1

T

#

!

c jm =

" tm ( j)t=1

T

#

" tk ( j)k=1

M

#t=1

T

#

!

" jm =

# tm ( j)(ot $µ j )t=1

T

% (ot $µ j )T

# tm ( j)k=1

M

%t=1

T

%


How to train mixtures?

• Choose M (often 16; or can tune M dependent onamount of training observations)

• Then can do various splitting or clusteringalgorithms

• One simple method for “splitting”:1) Compute global mean µ and global variance2) Split into two Gaussians, with means µ±ε

(sometimes ε is 0.2σ)

3) Run Forward-Backward to retrain4) Go to 2 until we have 16 mixtures

5


Embedded Training

Components of a speech recognizer:Feature extraction: not statisticalLanguage model: word transition probabilities,trained on some other corpusAcoustic model:

– Pronunciation lexicon: the HMM structure for each word,built by hand

– Observation likelihoods bj(ot)– Transition probabilities aij


Embedded training of acousticmodel

If we had hand-segmented and hand-labeled training dataWith word and phone boundariesWe could just compute the

B: means and variances of all our triphone gaussiansA: transition probabilities

And we’d be done!But we don’t have word and phone boundaries, nor phonelabeling


Embedded training

Instead:We’ll train each phone HMM embedded in an entiresentenceWe’ll do word/phone segmentation and alignmentautomatically as part of training process


Embedded Training


Initialization: “Flat start”

Transition probabilities: set to zero any that you want to be “structurallyzero”

– The γ probability computation includes previous value ofaij, so if it’s zero it will never change

Set the rest to identical values

Likelihoods:initialize µ and σ of each state to global mean andvariance of all training data


Embedded Training

Now we have estimates for A and BSo we just run the EM algorithmDuring each iteration, we compute forward andbackward probabilitiesUse them to re-estimate A and BRun EM til converge

6


Viterbi training

Baum-Welch training says:We need to know what state we were in, to accumulate counts of agiven output symbol otWe’ll compute ξI(t), the probability of being in state i at time t, byusing forward-backward to sum over all possible paths that mighthave been in state i and output ot.

Viterbi training says:Instead of summing over all possible paths, just take the singlemost likely pathUse the Viterbi algorithm to compute this “Viterbi” pathVia “forced alignment”


Forced Alignment

Computing the “Viterbi path” over the training data iscalled “forced alignment”Because we know which word string to assign to eachobservation sequence.We just don’t know the state sequence.So we use aij to constrain the path to go through thecorrect wordsAnd otherwise do normal ViterbiResult: state sequence!


Viterbi training equations

Viterbi Baum-Welch

!

ˆ b j (vk ) =n j (s.t.ot = vk )

n j!

ˆ a ij =nij

ni

For all pairs of emitting states, 1

7


Repeating:

With some rearrangement of terms

Where:

Note that this looks like a weighted Mahalanobis distance!!!Also may justify why we these aren’t really probabilities (pointestimates); these are really just distances.

Log domain


State Tying: Young, Odell, Woodland 1994

The steps in creating CDphones.Start with monophone, do EMtrainingThen clone Gaussians intotriphonesThen build decision tree andcluster GaussiansThen clone and train mixtures(GMMs


Summary: Acoustic Modeling forLVCSR.

Increasingly sophisticated modelsFor each state:

GaussiansMultivariate GaussiansMixtures of Multivariate Gaussians

Where a state is progressively:CI PhoneCI Subphone (3ish per phone)CD phone (=triphones)State-tying of CD phone

Forward-Backward TrainingViterbi training


Outline

DisfluenciesCharacteristics of disfluencesDetecting disfluenciesMDE bakeoffFragments


Disfluencies


Disfluencies: standard terminology (Levelt)

Reparandum: thing repairedInterruption point (IP): where speaker breaks offEditing phase (edit terms): uh, I mean, you knowRepair: fluent continuation

8


Why disfluencies?

Need to clean them up to get understandingDoes American airlines offer any one-way flights [uh] one-way fares for 160 dollars?Delta leaving Boston seventeen twenty one arriving FortWorth twenty two twenty one forty

Might help in language modelingDisfluencies might occur at particular positions (boundariesof clauses, phrases)

Annotating them helps readabilityDisfluencies cause errors in nearby words


Counts (from Shriberg, Heeman)

Sentence disfluency rateATIS: 6% of sentences disfluent (10% long sentences)Levelt human dialogs: 34% of sentences disfluentSwbd: ~50% of multiword sentences disfluentTRAINS: 10% of words are in reparandum or editingphrase

Word disfluency rateSWBD: 6%ATIS: 0.4%AMEX 13%

– (human-human air travel)


Prosodic characteristics ofdisfluencies

Nakatani and Hirschberg 1994Fragments are good cues to disfluenciesProsody:

Pause duration is shorter in disfluent silence than fluentsilenceF0 increases from end of reparandum to beginning ofrepair, but only minor changeRepair interval offsets have minor prosodic phraseboundary, even in middle of NP:

– Show me all n- | round-trip flights | from Pittsburgh | toAtlanta


Syntactic Characteristics ofDisfluencies

Hindle (1983)The repair often has same structure as reparandumBoth are Noun Phrases (NPs) in this example:

So if could automatically find IP, could find and correctreparandum!


Disfluencies and LM

Clark and Fox TreeLooked at “um” and “uh”

“uh” includes “er” (“er” is just British non-rhoticdialect spelling for “uh”)

Different meaningsUh: used to announce minor delays

– Preceded and followed by shorter pauses

Um: used to announce major delays– Preceded and followed by longer pauses


Um versus uh: delays(Clark and Fox Tree)

9


Utterance Planning

The more difficulty speakers have in planning, the more delaysConsider 3 locations:

I: before intonation phrase: hardestII: after first word of intonation phrase: easierIII: later: easiest

And then uh somebody said, . [I] but um -- [II] don’t you thinkthere’s evidence of this, in the twelfth - [III] and thirteenthcenturies?


Delays at different points in phrase


Disfluencies in language modeling

Should we “clean up” disfluencies before training LM(i.e. skip over disfluencies?)

Filled pauses– Does United offer any [uh] one-way fares?

Repetitions– What what are the fares?

Deletions– Fly to Boston from Boston

Fragments (we’ll come back to these)– I want fl- flights to Boston.


Does deleting disfluenciesimprove LM perplexity?

Stolcke and Shriberg (1996)Build two LMs

RawRemoving disfluencies

See which has a higher perplexity on the data.Filled PauseRepetitionDeletion


Change in Perplexity when FilledPauses (FP) are removed

LM Perplexity goes up at following word:

Removing filled pauses makes LM worse!!I.e., filled pauses seem to help to predict next word.Why would that be?

Stolcke and Shriberg 1996 54LSA 352 Summer 2007

Filled pauses tend to occur atclause boundaries

Word before FP is end of previous clause; word afteris start of new clause;

Best not to delete FP

Some of the different things we’re doing [uh] there’snot time to do it all“there’s” is very likely to start a sentenceSo P(there’s|uh) is better estimate thanP(there’s|doing)

10


Suppose we just delete medial FPs

Experiment 2:Parts of SWBD hand-annotated for clausesBuild FP-model by deleting only medial FPsNow prediction of post-FP word (perplexity) improvesgreatly!Siu and Ostendorf found same with “you know”


What about REP and DEL

S+S built a model with “cleaned-up” REP and DELSlightly lower perplexityBut exact same word error rate (49.5%)Why?

Rare: only 2 words per 100Doesn’t help because adjacent words aremisrecognized anyhow!


Stolcke and Shriberg conclusions wrtLM and disfluencies

Disfluency modeling purely in the LM probably won’tvastly improve WERBut

Disfluencies should be considered jointly withsentence segmentation taskNeed to do better at recognizing disfluent wordsthemselvesNeed acoustic and prosodic features


WER reductions from modelingdisfluencies+background events

Rose and Riccardi (1999)


HMIHY Background events

Out of 16,159 utterances:

5353Echoed Prompt

5112Operater Utt.

3585Background Speech

8834Non-Speech Noise

8048Breath

2171Lipsmack

163Laughter

792Hesitations

1265Word Fragments

7189Filled Pauses


Phrase-based LM

“I would like to make a collect call”“a [wfrag]” [brth] “[brth] and I”

11


Rose and Riccardi (1999)Modeling “LBEs” does help in WER

89.860.8LBELBE

88.160.8BaselineLBE

87.758.7BaselineBaseline

Card NumberGreeting(HMIHY)

LMHMM

Test CorporaSystem Configuration

ASR Word Accuracy


More on location of FPs

Peters: Medical dictation taskMonologue rather than dialogueIn this data, FPs occurred INSIDE clausesTrigram PP after FP: 367Trigram PP after word: 51

Stolcke and Shriberg (1996b)wk FP wk+1: looked at P(wk+1|wk)Transition probabilities lower for these transitions thannormal ones

Conclusion:People use FPs when they are planning difficult things, sofollowing words likely to be unexpected/rare/difficult


Detection of disfluencies

Nakatani and HirschbergDecision tree at wi-wj boundary

pause durationWord fragmentsFilled pauseEnergy peak within wiAmplitude difference between wi and wjF0 of wiF0 differencesWhether wi accented

Results:78% recall/89.2% precision


Detection/Correction

Bear, Dowding, Shriberg (1992)System 1:Hand-written pattern matching rules to find repairs

– Look for identical sequences of words– Look for syntactic anomalies (“a the”, “to from”)– 62% precision, 76% recall– Rate of accurately correcting: 57%


Using Natural LanguageConstraints

Gemini natural language systemBased on Core Language EngineFull syntax and semantics for ATISCoverage of whole corpus:

70% syntax50% semantics


Using Natural LanguageConstraints

Gemini natural language systemRun pattern matcherFor each sentence it returned

Remove fragment sentencesLeaving 179 repairs, 176 false positivesParse each sentence

– If succeed: mark as false positive– If fail:

run pattern matcher, make corrections Parse again If succeeds, mark as repair If fails, mark no opinion

12


NL Constraints

Syntax OnlyPrecision: 96%

Syntax and SemanticsCorrection: 70%


Recent work: EARS MetadataEvaluation (MDE)

A recent multiyear DARPA bakeoffSentence-like Unit (SU) detection:

find end points of SUDetect subtype (question, statement, backchannel)

Edit word detection:Find all words in reparandum (words that will be removed)

Filler word detectionFilled pauses (uh, um)Discourse markers (you know, like, so)Editing terms (I mean)

Interruption point detection

Liu et al 2003


Kinds of disfluencies

RepetitionsI * I like it

RevisionsWe * I like it

Restarts (false starts)It’s also * I like it


MDE transcription

Conventions:./ for statement SU boundaries, for fillers,[] for edit words,* for IP (interruption point) inside edits

And wash your clothes whereveryou are ./ and [ you ] * you really get used to theoutdoors ./


MDE Labeled Corpora

1.86.8Filler word %

1.87.4Edit word %

8.113.6SU %

11.714.9STT WER (%)

45K35KTest set (words)

182K484KTraining set (words)

BNCTS


MDE Algorithms

Use both text and prosodic featuresAt each interword boundary

Extract Prosodic features (pause length, durations,pitch contours, energy contours)Use N-gram Language modelCombine via HMM, Maxent, CRF, or other classifier

13


State of the art: SU detection

2 stageDecision tree plus N-gram LM to decide boundarySecond maxent classifier to decide subtype

Current error rates:Finding boundaries

– 40-60% using ASR– 26-47% using transcripts


State of the art: Edit worddetection

Multi-stage model– HMM combining LM and decision tree finds IP– Heuristics rules find onset of reparandum– Separate repetition detector for repeated words

One-stage modelCRF jointly finds edit region and IPBIO tagging (each word has tag whether is beginning ofedit, inside edit, outside edit)

Error rates:43-50% using transcripts80-90% using ASR


Using only lexical cues

3-way classification for each wordEdit, filler, fluent

Using TBLTemplates: Change

– Word X from L1 to L2– Word sequence X Y to L1– Left side of simple repeat to L1– Word with POS X from L1 to L2 if followed by word with

POS Y


Rules learned

Label all fluent filled pauses as fillersLabel the left side of a simple repeat as an editLabel “you know” as fillersLabel fluent well’s as fillerLabel fluent fragments as editsLabel “I mean” as a filler


Error rates using only lexicalcues

CTS, using transcriptsEdits: 68%Fillers: 18.1%

Broadcast News, using transcriptsEdits 45%Fillers 6.5%

Using speech:Broadcast news filler detection from 6.5% error to 57.2%

Other systems (using prosody) better on CTS, not on BroadcastNews


Conclusions: Lexical Cues Only

Can do pretty well with only words(As long as the words are correct)

Much harder to do fillers and fragments from ASRoutput, since recognition of these is bad

14


Fragments

Incomplete or cut-off words:Leaving at seven fif- eight thirtyuh, I, I d-, don't feel comfortableYou know the fam-, well, the familiesI need to know, uh, how- how do you feel…

Uh yeah, yeah, well, it- it- that’s right. And it-

SWBD: around 0.7% of words are fragments (Liu 2003)ATIS: 60.2% of repairs contain fragments (6% of corpussentences had a least 1 repair) Bear et al (1992)Another ATIS corpus: 74% of all reparanda end in wordfragments (Nakatani and Hirschberg 1994)


Fragment glottalization

Uh yeah, yeah, well, it- it- that’s right. And it-


Why fragments are important

Frequent enough to be a problem:Only 1% of words/3% of sentencesBut if miss fragment, likely to get surrounding words wrong(word segmentation error).So could be 3 or more times worse (3% words, 9% ofsentences): problem!

Useful for finding other repairsIn 40% of SRI-ATIS sentences containing fragments,fragment occurred at right edge of long repair74% of ATT-ATIS reparanda ended in fragments

Sometimes are the only cue to repair“leaving at eight thirty”


How fragments are dealt with incurrent ASR systems

In training, throw out any sentences with fragmentsIn test, get them wrongProbably get neighboring words wrong too!!!!!!


Cues for fragment detection

49/50 cases examined ended in silence >60msec; average282ms (Bear et al)24 of 25 vowel-final fragments glottalized (Bear et al)

Glottalization: increased time between glottal pulses

75% don’t even finish the vowel in first syllable (i.e., speakerstopped after first consonant) (O’Shaughnessy)



Nakatani and Hirschberg (1994)Word fragments tend to be content words:

54%155Untranscribed

4%12Function

42%121Content

PercentTokenLexical Class

15



Nakatani and Hirschberg (1994)91% are one syllable or less

0.3%13

9%252

52%1491

39%1130

PercentTokensSyllables



Nakatani and Hirschberg (1994)Fricative-initial common; not vowel-initial

73%

0%

11%

% 1-C frags

45%33%Fric

13%25%Vowel

23%23%Stop

% frags% wordsClass


Liu (2003): Acoustic-Prosodicdetection of fragments

Prosodic featuresDuration (from alignments)

– Of word, pause, last-rhyme-in word– Normalized in various ways

F0 (from pitch tracker)– Modified to compute stylized speaker-specific contours

Energy– Frame-level, modified in various ways


Liu (2003): Acoustic-Prosodicdetection of fragments

Voice Quality FeaturesJitter

– A measure of perturbation in pitch period Praat computes this

Spectral tilt– Overall slope of spectrum– Speakers modify this when they stress a word

Open Quotient– Ratio of times in which vocal folds are open to total length of

glottal cycle– Can be estimated from first and second harmonics– Creaky voice (laryngealization) vocal folds held together, so

short open quotient


Liu (2003)

Use Switchboard 80%/20%Downsampled to 50% frags, 50% wordsGenerated forced alignments with gold transcriptsExtract prosodic and voice quality featuresTrain decision tree


Liu (2003) results

Precision 74.3%, Recall 70.1%

10143fragment

reference 35109complete

fragmentcomplete

hypothesis

16


Liu (2003) features

Features most queried by DT

0.018Pause duration

0.084Position of current turn

.147Average OQ

%

.238Ratio between F0 before and after boundary

.241Energy slope difference between current andfollowing word

.272jitter

Feature


Liu (2003) conclusion

Very preliminary workFragment detection is good problem that isunderstudied!


Fragments in other languages

Mandarin (Chu, Sung, Zhao, Jurafsky 2006)Fragments cause similar errors as in English:

但我﹣我問的是

I- I was asking…他是﹣卻很﹣活的很好

He very- lived very well


Fragments in Mandarin

Mandarin fragments unlike English; no glottalization.Instead: Mostly (unglottalized) repetitions

So: best features are lexical, rather than voice quality


Summary

DisfluenciesCharacteristics of disfluencesDetecting disfluenciesMDE bakeoffFragments

lsa l352 speech recognition and synthesis e mf orh s( the ... · speech recognition and synthesis...

Documents