1 lsa 352 summer 2007 lsa 352 speech recognition and synthesis dan jurafsky lecture 6: feature...

69
1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides were derived from Andrew Ng’s CS 229 notes, as well as lecture notes from Chen, Picheny et al, Yun- Hsuan Sung, and Bryan Pellom. I’ll try to give correct credit on each slide, but I’ll prob miss some.

Upload: bruno-austin-martin

Post on 13-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

1LSA 352 Summer 2007

LSA 352Speech Recognition and Synthesis

Dan Jurafsky

Lecture 6: Feature Extraction and Acoustic Modeling

IP Notice: Various slides were derived from Andrew Ng’s CS 229 notes, as well as lecture notes from Chen, Picheny et al, Yun-Hsuan Sung, and Bryan Pellom. I’ll try to give correct credit on each slide, but I’ll prob miss some.

2LSA 352 Summer 2007

MFCC

Mel-Frequency Cepstral Coefficient (MFCC)Most widely used spectral representation in ASR

3LSA 352 Summer 2007

Windowing

Why divide speech signal into successive overlapping frames?

Speech is not a stationary signal; we want information about a small enough region that the spectral information is a useful cue.

FramesFrame size: typically, 10-25msFrame shift: the length of time between successive frames, typically, 5-10ms

4LSA 352 Summer 2007

Windowing

Slide from Bryan Pellom

5LSA 352 Summer 2007

MFCC

6LSA 352 Summer 2007

Discrete Fourier Transform computing a spectrum

A 24 ms Hamming-windowed signalAnd its spectrum as computed by DFT (plus other smoothing)

7LSA 352 Summer 2007

MFCC

8LSA 352 Summer 2007

Mel-scale

Human hearing is not equally sensitive to all frequency bandsLess sensitive at higher frequencies, roughly > 1000 HzI.e. human perception of frequency is non-linear:

9LSA 352 Summer 2007

Mel-scale

A mel is a unit of pitchDefinition:

– Pairs of sounds perceptually equidistant in pitch Are separated by an equal number of mels:

Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz

Definition:

10LSA 352 Summer 2007

Mel Filter Bank Processing

Mel Filter bankUniformly spaced before 1 kHzlogarithmic scale after 1 kHz

11LSA 352 Summer 2007

MFCC

12LSA 352 Summer 2007

Log energy computation

Compute the logarithm of the square magnitude of the output of Mel-filter bank

13LSA 352 Summer 2007

Log energy computation

Why log energy?Logarithm compresses dynamic range of values

Human response to signal level is logarithmic– humans less sensitive to slight differences in amplitude at

high amplitudes than low amplitudes

Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker’s mouth moving closer to mike)Phase information not helpful in speech

14LSA 352 Summer 2007

MFCC

15LSA 352 Summer 2007

MFCC

16LSA 352 Summer 2007

Dynamic Cepstral Coefficient

The cepstral coefficients do not capture energy

So we add an energy feature

Also, we know that speech signal is not constant (slope of formants, change from stop burst to release).

So we want to add the changes in features (the slopes).

We call these delta features

We also add double-delta acceleration features

17LSA 352 Summer 2007

Delta and double-delta

Derivative: in order to obtain temporal information

18LSA 352 Summer 2007

Typical MFCC features

Window size: 25msWindow shift: 10msPre-emphasis coefficient: 0.97MFCC:

12 MFCC (mel frequency cepstral coefficients)1 energy feature12 delta MFCC features 12 double-delta MFCC features1 delta energy feature1 double-delta energy feature

Total 39-dimensional features

19LSA 352 Summer 2007

Why is MFCC so popular?

Efficient to compute

Incorporates a perceptual Mel frequency scale

Separates the source and filter

IDFT(DCT) decorrelates the featuresImproves diagonal assumption in HMM modeling

AlternativePLP

20LSA 352 Summer 2007

…and now what?

What we have = what we OBSERVE

A MATRIX OF OBSERVATIONS:O = [O1, O2, ……………….OT]

21LSA 352 Summer 2007

22LSA 352 Summer 2007

Multivariate Gaussians

Vector of means and covariance matrix

f (x |,) 1

(2 )n / 2 | |1/ 2exp

1

2(x )T 1(x )

f (x | jk, jk ) c jkN(x, jk, jk )k1

M

b j (ot ) c jkN(ot , jk, jk )k1

M

M mixtures of Gaussians:

23LSA 352 Summer 2007

Gaussian Intuitions: Off-diagonal

As we increase the off-diagonal entries, more correlation between value of x and value of y

Text and figures from Andrew Ng’s lecture notes for CS229

24LSA 352 Summer 2007

Diagonal covariance

Diagonal contains the variance of each dimension ii

2

So this means we consider the variance of each acoustic feature (dimension) separately

f (x |, ) 1

j 2exp

1

2

xd d d

2

d1

D

f (x |, ) 1

2D

2 d2

d1

D

exp(

1

2

(xd d )2

d2

d1

D

)

25LSA 352 Summer 2007

Gaussian Intuitions: Size of

= [0 0] = [0 0] = [0 0] = I = 0.6I = 2IAs becomes larger, Gaussian becomes more spread out; as becomes smaller, Gaussian more compressed

Text and figures from Andrew Ng’s lecture notes for CS229

26LSA 352 Summer 2007

Observation “probability”

Text and figures from Andrew Ng’s lecture notes for CS229

How “well” and observation “fits” into a given stateExample: Obs. 3 in State 2

M

kkkk oNcob

1223232 ),,()(

27LSA 352 Summer 2007

Gaussian PDFs

A Gaussian is a probability density function; probability is area under curve.To make it a probability, we constrain area under curve = 1.BUT…

We will be using “point estimates”; value of Gaussian at point.

Technically these are not probabilities, since a pdf gives a probability over an interval, needs to be multiplied by dxBUT….this is ok since same value is omitted from all Gaussians (we will use argmax, it is still correct).

28LSA 352 Summer 2007

Gaussian as Probability Density Function

29LSA 352 Summer 2007

Gaussian as Probability Density Function

Pointestimate

30LSA 352 Summer 2007

Using Gaussian as an acoustic likelihood estimator

EXAMPLE:Let’s suppose our observation was a single real-

valued feature (instead of 33 dimension vector)Then if we had learned a Gaussian over the

distribution of values of this featureWe could compute the likelihood of any given

observation ot as follows:

31LSA 352 Summer 2007

…what about transition probabilities?

) / (

,...)at ,at / (

1 at time at time

2 time1 time at time

tt

ttt

kStateiStateP

lStatekStateiStateP

32LSA 352 Summer 2007

So, how to train a HMM ??

First decide the topology: number of sates, transitions,..

HMM =

33LSA 352 Summer 2007

In our HTK example (HMM prototypes)

File name MUST be equal to the linguistic unit: si , no and silWe will use:

12 states ( 14 in HTK: 12 + 2 start/end)

and two Gaussians (mixtures) per state…

…..with dimension 33.

34LSA 352 Summer 2007

In our HTK example (HMM prototypes)

So, for example, the file named si will be:

~o <vecsize> 33 <MFCC_0_D_A> ~h "si"

<BeginHMM>

<Numstates> 14<State> 2 <NumMixes> 2<Stream> 1 <Mixture> 1 0.6

<Mean> 33 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Variance> 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

<Mixture> 2 0.4

<Mean> 33 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Variance> 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

<State> 3 <NumMixes> 2<Stream> 1

......................................

35LSA 352 Summer 2007

In our HTK example (HMM prototypes)

......................................

<Transp> 14x14 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.40.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<EndHMM>

36LSA 352 Summer 2007

…but, how do we train a HMM ??

Train = “estimate” transitionand observation probabilities (weights, means and variances)

HMM =

b j (ot ) c jkN(ot , jk, jk )k1

M

for a particularlinguistic unit

37LSA 352 Summer 2007

…but, how do we train a HMM ??

We have some training data for each linguistic unit

38LSA 352 Summer 2007

…but, how do we train a HMM ??

i 1

Tot

t1

T

s.t. ot is state i

i2

1

T(ot

t1

T

i)2 s.t. ot is state i

Example for a (single) Gaussian characterized by a mean and a variance

39LSA 352 Summer 2007

…but, how do we train a HMM ??

40LSA 352 Summer 2007

Uniform segmentation is used by HInit

OUTPUT HMMsOUTPUT HMMs

41LSA 352 Summer 2007

…what is the “hidden” sequence of sates ??

GIVEN: a HMM = and “a sequences of states” S

42LSA 352 Summer 2007

…what is the “hidden” sequence of sates ??

Can be easily obtained )|( SOP

43LSA 352 Summer 2007

…what is the “hidden” sequence of sates ??

..and by adding all possible sequences S

44LSA 352 Summer 2007

…what is the “hidden” sequence of sates ??

..and by adding all possible sequences S

(Baum-Welch)

45LSA 352 Summer 2007

…what is the “hidden” sequence of sates ??

46LSA 352 Summer 2007

…so with the optimum sequence of states on the training data..

transitionand observation probabilities(weights, means and variances) can be re-trained … UNTIL CONVERGENCE

47LSA 352 Summer 2007

HMM training using HRest

OUTPUT HMMsOUTPUT HMMs

48LSA 352 Summer 2007

…training is usually done using Baum-Welch

Simplifying (too much..?) adaptation formulas weight

observations by their probability to occupy a particular state

(occupation probability)Weighted by occupation probability and Normalize

HTKBook

Weighted by occupation probability and Normalize

HTKBook

49LSA 352 Summer 2007

…why not to train all HMM together?

In our example, in each file we have:

…like a global HMM …

50LSA 352 Summer 2007

HMM training using HERest

OUTPUT HMMsOUTPUT HMMs

51LSA 352 Summer 2007

Young et al state tying

52LSA 352 Summer 2007

Decision tree for clustering triphones for tying

53LSA 352 Summer 2007

…and now what?

For our example: TWO-WORD RECOGNITION PROBLEM

Each word is represented by a HMM (also silence, but…)

HMM for “si” = HMM for “no” = [O1, O2, … ……………………….OT]

GIVEN the OBSERVATION MATRIX:

IS < >)/( 1 OP )/( 2 OP

RECOGNITION SHOULD BE BASED ON:

54LSA 352 Summer 2007

…and now what?

HMM for “si” = HMM for “no” =

[O1, O2, … ……………………….OT]GIVEN the OBSERVATION MATRIX:

IS < >)/( 1 OP )/( 2 OP

RECOGNITION SHOULD BE BASED ON:

BUT WHAT WE CANEVALUATE IS:

)/( OP

55LSA 352 Summer 2007

Bases: Bayes Theorem

)(

)()|()|(

OP

WPWOPOWP

P(W) ... Probability of pronouncing W (a priori) P(O|W) ... Given W “Likelihood” of generating O P(O) ... Probability for O

)()|(arg WPWOPmaxWW

Acoustic Model Language Model

Let’s call (Word) W =

THIS IS WHAT WE CAN EVALUATETHIS IS WHAT WE NEED TO DECIDE

56LSA 352 Summer 2007

Language Models

Deterministic: GRAMMAR

Stocastic: N-Gram

de reconocimiento indica, principalmente, qué palabras o vocabulario va a ser capaz de reconocer el sistema, y cómo quedan definidas esas palabras en función de los HMMs que tiene el mismo.

En nuestro caso la relación palabra del vocabulario HMM es directa, por lo que el fichero dic es muy simple:

sil silsi si no no

SIL SI

NO

SIL

Statistical Language Modeling

• Goal: compute the probability of a sentence or sequence of words: P(W) = P(w1,w2,w3,w4,w5…wn)

• Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4)

• A model that computes either of these: P(W) or P(wn|w1,w2…wn-1) is called a

language model.

How to compute P(W)

• How to compute this joint probability:

P(its, water, is, so, transparent, that)

• Intuition: let’s rely on the Chain Rule of Probability

Reminder: The Chain Rule

• Recall the definition of conditional probabilities

Rewriting:

• More variables: P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

• The Chain Rule in General P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)

The Chain Rule applied to compute joint probability of

words in sentence

P(“its water is so transparent”) =

P(its) × P(water|its) × P(is|its water) × P(so|its water is)

× P(transparent|its water is so)

i

iin wwwwPwwwP )...|()...( 12121

How to estimate these probabilities

• Could we just count and divide?

• No! Too many possible sentences!• We’ll never see enough data for estimating these

Markov Assumption

• Simplifying assumption:

• Or maybe

Andrei Markov

Markov Assumption

• In other words, we approximate each component in the product

i

ikiin wwwPwwwP )|()( 121

)|()|( 1121 ikiiii wwwPwwwwP

Simplest case: Unigram model

fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Some automatically generated sentences from a unigram model

i

in wPwwwP )()( 21

Condition on the previous word:

Bigram model

texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

)|()|( 1121 iiii wwPwwwwP

N-gram models

• We can extend to trigrams, 4-grams, 5-grams

• In general this is an insufficient model of language because language has long-distance

dependencies:

“The computer which I had just put into the machine room on the fifth floor crashed.”

• But we can often get away with N-gram models

Estimating bigram probabilities

• The Maximum Likelihood Estimate

An example

<s> I am Sam </s><s> Sam I am </s><s> I do not like green eggs and ham </s>

69LSA 352 Summer 2007

The recognition ENGINE