1 lsa 352 summer 2007 lsa 352 speech recognition and synthesis dan jurafsky lecture 6: feature...

1LSA 352 Summer 2007

LSA 352Speech Recognition and Synthesis

Dan Jurafsky

Lecture 6: Feature Extraction and Acoustic Modeling

IP Notice: Various slides were derived from Andrew Ng’s CS 229 notes, as well as lecture notes from Chen, Picheny et al, Yun-Hsuan Sung, and Bryan Pellom. I’ll try to give correct credit on each slide, but I’ll prob miss some.


MFCC

Mel-Frequency Cepstral Coefficient (MFCC)Most widely used spectral representation in ASR


Windowing

Why divide speech signal into successive overlapping frames?

Speech is not a stationary signal; we want information about a small enough region that the spectral information is a useful cue.

FramesFrame size: typically, 10-25msFrame shift: the length of time between successive frames, typically, 5-10ms


Windowing

Slide from Bryan Pellom


MFCC


Discrete Fourier Transform computing a spectrum

A 24 ms Hamming-windowed signalAnd its spectrum as computed by DFT (plus other smoothing)


MFCC


Mel-scale

Human hearing is not equally sensitive to all frequency bandsLess sensitive at higher frequencies, roughly > 1000 HzI.e. human perception of frequency is non-linear:


Mel-scale

A mel is a unit of pitchDefinition:

– Pairs of sounds perceptually equidistant in pitch Are separated by an equal number of mels:

Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz

Definition:


Mel Filter Bank Processing

Mel Filter bankUniformly spaced before 1 kHzlogarithmic scale after 1 kHz


MFCC


Log energy computation

Compute the logarithm of the square magnitude of the output of Mel-filter bank


Log energy computation

Why log energy?Logarithm compresses dynamic range of values

Human response to signal level is logarithmic– humans less sensitive to slight differences in amplitude at

high amplitudes than low amplitudes

Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker’s mouth moving closer to mike)Phase information not helpful in speech


MFCC


Dynamic Cepstral Coefficient

The cepstral coefficients do not capture energy

So we add an energy feature

Also, we know that speech signal is not constant (slope of formants, change from stop burst to release).

So we want to add the changes in features (the slopes).

We call these delta features

We also add double-delta acceleration features


Delta and double-delta

Derivative: in order to obtain temporal information


Typical MFCC features

Window size: 25msWindow shift: 10msPre-emphasis coefficient: 0.97MFCC:

12 MFCC (mel frequency cepstral coefficients)1 energy feature12 delta MFCC features 12 double-delta MFCC features1 delta energy feature1 double-delta energy feature

Total 39-dimensional features


Why is MFCC so popular?

Efficient to compute

Incorporates a perceptual Mel frequency scale

Separates the source and filter

IDFT(DCT) decorrelates the featuresImproves diagonal assumption in HMM modeling

AlternativePLP


…and now what?

What we have = what we OBSERVE

A MATRIX OF OBSERVATIONS:O = [O1, O2, ……………….OT]


Multivariate Gaussians

Vector of means and covariance matrix

f (x |,) 1

(2 )n / 2 | |1/ 2exp

1

2(x )T 1(x )

f (x | jk, jk ) c jkN(x, jk, jk )k1

M

b j (ot ) c jkN(ot , jk, jk )k1

M

M mixtures of Gaussians:


Gaussian Intuitions: Off-diagonal

As we increase the off-diagonal entries, more correlation between value of x and value of y

Text and figures from Andrew Ng’s lecture notes for CS229


Diagonal covariance

Diagonal contains the variance of each dimension ii

2

So this means we consider the variance of each acoustic feature (dimension) separately

f (x |, ) 1

j 2exp

1

2

xd d d

2

d1

D

f (x |, ) 1

2D

2 d2

d1

D

exp(

1

2

(xd d )2

d2

d1

D

)


Gaussian Intuitions: Size of

= [0 0] = [0 0] = [0 0] = I = 0.6I = 2IAs becomes larger, Gaussian becomes more spread out; as becomes smaller, Gaussian more compressed



Observation “probability”


How “well” and observation “fits” into a given stateExample: Obs. 3 in State 2

M

kkkk oNcob

1223232 ),,()(


Gaussian PDFs

A Gaussian is a probability density function; probability is area under curve.To make it a probability, we constrain area under curve = 1.BUT…

We will be using “point estimates”; value of Gaussian at point.

Technically these are not probabilities, since a pdf gives a probability over an interval, needs to be multiplied by dxBUT….this is ok since same value is omitted from all Gaussians (we will use argmax, it is still correct).


Gaussian as Probability Density Function


Gaussian as Probability Density Function

Pointestimate


Using Gaussian as an acoustic likelihood estimator

EXAMPLE:Let’s suppose our observation was a single real-

valued feature (instead of 33 dimension vector)Then if we had learned a Gaussian over the

distribution of values of this featureWe could compute the likelihood of any given

observation ot as follows:


…what about transition probabilities?

) / (

,...)at ,at / (

1 at time at time

2 time1 time at time

tt

ttt

kStateiStateP

lStatekStateiStateP


So, how to train a HMM ??

First decide the topology: number of sates, transitions,..

HMM =


In our HTK example (HMM prototypes)

File name MUST be equal to the linguistic unit: si , no and silWe will use:

12 states ( 14 in HTK: 12 + 2 start/end)

and two Gaussians (mixtures) per state…

…..with dimension 33.



So, for example, the file named si will be:

~o <vecsize> 33 <MFCC_0_D_A> ~h "si"

<BeginHMM>

<Numstates> 14<State> 2 <NumMixes> 2<Stream> 1 <Mixture> 1 0.6

<Mean> 33 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Variance> 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

<Mixture> 2 0.4

<Mean> 33 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Variance> 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

<State> 3 <NumMixes> 2<Stream> 1

......................................



......................................

<Transp> 14x14 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.40.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<EndHMM>


…but, how do we train a HMM ??

Train = “estimate” transitionand observation probabilities (weights, means and variances)

HMM =

b j (ot ) c jkN(ot , jk, jk )k1

M

for a particularlinguistic unit



We have some training data for each linguistic unit



i 1

Tot

t1

T

s.t. ot is state i

i2

1

T(ot

t1

T

i)2 s.t. ot is state i

Example for a (single) Gaussian characterized by a mean and a variance


Uniform segmentation is used by HInit

OUTPUT HMMsOUTPUT HMMs


…what is the “hidden” sequence of sates ??

GIVEN: a HMM = and “a sequences of states” S



Can be easily obtained )|( SOP



..and by adding all possible sequences S



..and by adding all possible sequences S

(Baum-Welch)


…so with the optimum sequence of states on the training data..

transitionand observation probabilities(weights, means and variances) can be re-trained … UNTIL CONVERGENCE


HMM training using HRest



…training is usually done using Baum-Welch

Simplifying (too much..?) adaptation formulas weight

observations by their probability to occupy a particular state

(occupation probability)Weighted by occupation probability and Normalize

HTKBook

Weighted by occupation probability and Normalize

HTKBook


…why not to train all HMM together?

In our example, in each file we have:

…like a global HMM …


HMM training using HERest



Young et al state tying


Decision tree for clustering triphones for tying


…and now what?

For our example: TWO-WORD RECOGNITION PROBLEM

Each word is represented by a HMM (also silence, but…)

HMM for “si” = HMM for “no” = [O1, O2, … ……………………….OT]

GIVEN the OBSERVATION MATRIX:

IS < >)/( 1 OP )/( 2 OP

RECOGNITION SHOULD BE BASED ON:


…and now what?

HMM for “si” = HMM for “no” =

[O1, O2, … ……………………….OT]GIVEN the OBSERVATION MATRIX:

IS < >)/( 1 OP )/( 2 OP

RECOGNITION SHOULD BE BASED ON:

BUT WHAT WE CANEVALUATE IS:

)/( OP


Bases: Bayes Theorem

)(

)()|()|(

OP

WPWOPOWP

P(W) ... Probability of pronouncing W (a priori) P(O|W) ... Given W “Likelihood” of generating O P(O) ... Probability for O

)()|(arg WPWOPmaxWW

Acoustic Model Language Model

Let’s call (Word) W =

THIS IS WHAT WE CAN EVALUATETHIS IS WHAT WE NEED TO DECIDE


Language Models

Deterministic: GRAMMAR

Stocastic: N-Gram

de reconocimiento indica, principalmente, qué palabras o vocabulario va a ser capaz de reconocer el sistema, y cómo quedan definidas esas palabras en función de los HMMs que tiene el mismo.

En nuestro caso la relación palabra del vocabulario HMM es directa, por lo que el fichero dic es muy simple:

sil silsi si no no

SIL SI

NO

SIL

Statistical Language Modeling

• Goal: compute the probability of a sentence or sequence of words: P(W) = P(w1,w2,w3,w4,w5…wn)

• Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4)

• A model that computes either of these: P(W) or P(wn|w1,w2…wn-1) is called a

language model.

How to compute P(W)

• How to compute this joint probability:

P(its, water, is, so, transparent, that)

• Intuition: let’s rely on the Chain Rule of Probability

Reminder: The Chain Rule

• Recall the definition of conditional probabilities

Rewriting:

• More variables: P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

• The Chain Rule in General P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)

How to estimate these probabilities

• Could we just count and divide?

• No! Too many possible sentences!• We’ll never see enough data for estimating these

Markov Assumption

• Simplifying assumption:

• Or maybe

Andrei Markov

Markov Assumption

• In other words, we approximate each component in the product

i

ikiin wwwPwwwP )|()( 121

)|()|( 1121 ikiiii wwwPwwwwP

Simplest case: Unigram model

fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Some automatically generated sentences from a unigram model

i

in wPwwwP )()( 21

Condition on the previous word:

Bigram model

texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

)|()|( 1121 iiii wwPwwwwP

N-gram models

• We can extend to trigrams, 4-grams, 5-grams

• In general this is an insufficient model of language because language has long-distance

dependencies:

“The computer which I had just put into the machine room on the fifth floor crashed.”

• But we can often get away with N-gram models

Estimating bigram probabilities

• The Maximum Likelihood Estimate

An example

<s> I am Sam </s><s> Sam I am </s><s> I do not like green eggs and ham </s>


The recognition ENGINE

1 lsa 352 summer 2007 lsa 352 speech recognition and synthesis dan jurafsky lecture 6: feature...

Documents

mfcc slide

speech slide

ms slide

khz slide

windowing slide

asr slide

smoothing slide

filter bank slide