1 lsa 352 summer 2007 lsa 352 speech recognition and synthesis dan jurafsky lecture 6: feature...
TRANSCRIPT
1LSA 352 Summer 2007
LSA 352Speech Recognition and Synthesis
Dan Jurafsky
Lecture 6: Feature Extraction and Acoustic Modeling
IP Notice: Various slides were derived from Andrew Ng’s CS 229 notes, as well as lecture notes from Chen, Picheny et al, Yun-Hsuan Sung, and Bryan Pellom. I’ll try to give correct credit on each slide, but I’ll prob miss some.
2LSA 352 Summer 2007
MFCC
Mel-Frequency Cepstral Coefficient (MFCC)Most widely used spectral representation in ASR
3LSA 352 Summer 2007
Windowing
Why divide speech signal into successive overlapping frames?
Speech is not a stationary signal; we want information about a small enough region that the spectral information is a useful cue.
FramesFrame size: typically, 10-25msFrame shift: the length of time between successive frames, typically, 5-10ms
6LSA 352 Summer 2007
Discrete Fourier Transform computing a spectrum
A 24 ms Hamming-windowed signalAnd its spectrum as computed by DFT (plus other smoothing)
8LSA 352 Summer 2007
Mel-scale
Human hearing is not equally sensitive to all frequency bandsLess sensitive at higher frequencies, roughly > 1000 HzI.e. human perception of frequency is non-linear:
9LSA 352 Summer 2007
Mel-scale
A mel is a unit of pitchDefinition:
– Pairs of sounds perceptually equidistant in pitch Are separated by an equal number of mels:
Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz
Definition:
10LSA 352 Summer 2007
Mel Filter Bank Processing
Mel Filter bankUniformly spaced before 1 kHzlogarithmic scale after 1 kHz
12LSA 352 Summer 2007
Log energy computation
Compute the logarithm of the square magnitude of the output of Mel-filter bank
13LSA 352 Summer 2007
Log energy computation
Why log energy?Logarithm compresses dynamic range of values
Human response to signal level is logarithmic– humans less sensitive to slight differences in amplitude at
high amplitudes than low amplitudes
Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker’s mouth moving closer to mike)Phase information not helpful in speech
16LSA 352 Summer 2007
Dynamic Cepstral Coefficient
The cepstral coefficients do not capture energy
So we add an energy feature
Also, we know that speech signal is not constant (slope of formants, change from stop burst to release).
So we want to add the changes in features (the slopes).
We call these delta features
We also add double-delta acceleration features
18LSA 352 Summer 2007
Typical MFCC features
Window size: 25msWindow shift: 10msPre-emphasis coefficient: 0.97MFCC:
12 MFCC (mel frequency cepstral coefficients)1 energy feature12 delta MFCC features 12 double-delta MFCC features1 delta energy feature1 double-delta energy feature
Total 39-dimensional features
19LSA 352 Summer 2007
Why is MFCC so popular?
Efficient to compute
Incorporates a perceptual Mel frequency scale
Separates the source and filter
IDFT(DCT) decorrelates the featuresImproves diagonal assumption in HMM modeling
AlternativePLP
20LSA 352 Summer 2007
…and now what?
What we have = what we OBSERVE
A MATRIX OF OBSERVATIONS:O = [O1, O2, ……………….OT]
22LSA 352 Summer 2007
Multivariate Gaussians
Vector of means and covariance matrix
f (x |,) 1
(2 )n / 2 | |1/ 2exp
1
2(x )T 1(x )
f (x | jk, jk ) c jkN(x, jk, jk )k1
M
b j (ot ) c jkN(ot , jk, jk )k1
M
M mixtures of Gaussians:
23LSA 352 Summer 2007
Gaussian Intuitions: Off-diagonal
As we increase the off-diagonal entries, more correlation between value of x and value of y
Text and figures from Andrew Ng’s lecture notes for CS229
24LSA 352 Summer 2007
Diagonal covariance
Diagonal contains the variance of each dimension ii
2
So this means we consider the variance of each acoustic feature (dimension) separately
f (x |, ) 1
j 2exp
1
2
xd d d
2
d1
D
f (x |, ) 1
2D
2 d2
d1
D
exp(
1
2
(xd d )2
d2
d1
D
)
25LSA 352 Summer 2007
Gaussian Intuitions: Size of
= [0 0] = [0 0] = [0 0] = I = 0.6I = 2IAs becomes larger, Gaussian becomes more spread out; as becomes smaller, Gaussian more compressed
Text and figures from Andrew Ng’s lecture notes for CS229
26LSA 352 Summer 2007
Observation “probability”
Text and figures from Andrew Ng’s lecture notes for CS229
How “well” and observation “fits” into a given stateExample: Obs. 3 in State 2
M
kkkk oNcob
1223232 ),,()(
27LSA 352 Summer 2007
Gaussian PDFs
A Gaussian is a probability density function; probability is area under curve.To make it a probability, we constrain area under curve = 1.BUT…
We will be using “point estimates”; value of Gaussian at point.
Technically these are not probabilities, since a pdf gives a probability over an interval, needs to be multiplied by dxBUT….this is ok since same value is omitted from all Gaussians (we will use argmax, it is still correct).
30LSA 352 Summer 2007
Using Gaussian as an acoustic likelihood estimator
EXAMPLE:Let’s suppose our observation was a single real-
valued feature (instead of 33 dimension vector)Then if we had learned a Gaussian over the
distribution of values of this featureWe could compute the likelihood of any given
observation ot as follows:
31LSA 352 Summer 2007
…what about transition probabilities?
) / (
,...)at ,at / (
1 at time at time
2 time1 time at time
tt
ttt
kStateiStateP
lStatekStateiStateP
32LSA 352 Summer 2007
So, how to train a HMM ??
First decide the topology: number of sates, transitions,..
HMM =
33LSA 352 Summer 2007
In our HTK example (HMM prototypes)
File name MUST be equal to the linguistic unit: si , no and silWe will use:
12 states ( 14 in HTK: 12 + 2 start/end)
and two Gaussians (mixtures) per state…
…..with dimension 33.
34LSA 352 Summer 2007
In our HTK example (HMM prototypes)
So, for example, the file named si will be:
~o <vecsize> 33 <MFCC_0_D_A> ~h "si"
<BeginHMM>
<Numstates> 14<State> 2 <NumMixes> 2<Stream> 1 <Mixture> 1 0.6
<Mean> 33 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<Mixture> 2 0.4
<Mean> 33 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<State> 3 <NumMixes> 2<Stream> 1
......................................
35LSA 352 Summer 2007
In our HTK example (HMM prototypes)
......................................
<Transp> 14x14 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.40.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<EndHMM>
36LSA 352 Summer 2007
…but, how do we train a HMM ??
Train = “estimate” transitionand observation probabilities (weights, means and variances)
HMM =
b j (ot ) c jkN(ot , jk, jk )k1
M
for a particularlinguistic unit
37LSA 352 Summer 2007
…but, how do we train a HMM ??
We have some training data for each linguistic unit
38LSA 352 Summer 2007
…but, how do we train a HMM ??
i 1
Tot
t1
T
s.t. ot is state i
i2
1
T(ot
t1
T
i)2 s.t. ot is state i
Example for a (single) Gaussian characterized by a mean and a variance
41LSA 352 Summer 2007
…what is the “hidden” sequence of sates ??
GIVEN: a HMM = and “a sequences of states” S
43LSA 352 Summer 2007
…what is the “hidden” sequence of sates ??
..and by adding all possible sequences S
44LSA 352 Summer 2007
…what is the “hidden” sequence of sates ??
..and by adding all possible sequences S
(Baum-Welch)
46LSA 352 Summer 2007
…so with the optimum sequence of states on the training data..
transitionand observation probabilities(weights, means and variances) can be re-trained … UNTIL CONVERGENCE
48LSA 352 Summer 2007
…training is usually done using Baum-Welch
Simplifying (too much..?) adaptation formulas weight
observations by their probability to occupy a particular state
(occupation probability)Weighted by occupation probability and Normalize
HTKBook
Weighted by occupation probability and Normalize
HTKBook
49LSA 352 Summer 2007
…why not to train all HMM together?
In our example, in each file we have:
…like a global HMM …
53LSA 352 Summer 2007
…and now what?
For our example: TWO-WORD RECOGNITION PROBLEM
Each word is represented by a HMM (also silence, but…)
HMM for “si” = HMM for “no” = [O1, O2, … ……………………….OT]
GIVEN the OBSERVATION MATRIX:
IS < >)/( 1 OP )/( 2 OP
RECOGNITION SHOULD BE BASED ON:
54LSA 352 Summer 2007
…and now what?
HMM for “si” = HMM for “no” =
[O1, O2, … ……………………….OT]GIVEN the OBSERVATION MATRIX:
IS < >)/( 1 OP )/( 2 OP
RECOGNITION SHOULD BE BASED ON:
BUT WHAT WE CANEVALUATE IS:
)/( OP
55LSA 352 Summer 2007
Bases: Bayes Theorem
)(
)()|()|(
OP
WPWOPOWP
P(W) ... Probability of pronouncing W (a priori) P(O|W) ... Given W “Likelihood” of generating O P(O) ... Probability for O
)()|(arg WPWOPmaxWW
Acoustic Model Language Model
Let’s call (Word) W =
THIS IS WHAT WE CAN EVALUATETHIS IS WHAT WE NEED TO DECIDE
56LSA 352 Summer 2007
Language Models
Deterministic: GRAMMAR
Stocastic: N-Gram
de reconocimiento indica, principalmente, qué palabras o vocabulario va a ser capaz de reconocer el sistema, y cómo quedan definidas esas palabras en función de los HMMs que tiene el mismo.
En nuestro caso la relación palabra del vocabulario HMM es directa, por lo que el fichero dic es muy simple:
sil silsi si no no
SIL SI
NO
SIL
Statistical Language Modeling
• Goal: compute the probability of a sentence or sequence of words: P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4)
• A model that computes either of these: P(W) or P(wn|w1,w2…wn-1) is called a
language model.
How to compute P(W)
• How to compute this joint probability:
P(its, water, is, so, transparent, that)
• Intuition: let’s rely on the Chain Rule of Probability
Reminder: The Chain Rule
• Recall the definition of conditional probabilities
Rewriting:
• More variables: P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The Chain Rule in General P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint probability of
words in sentence
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is|its water) × P(so|its water is)
× P(transparent|its water is so)
i
iin wwwwPwwwP )...|()...( 12121
How to estimate these probabilities
• Could we just count and divide?
• No! Too many possible sentences!• We’ll never see enough data for estimating these
Markov Assumption
• In other words, we approximate each component in the product
i
ikiin wwwPwwwP )|()( 121
)|()|( 1121 ikiiii wwwPwwwwP
Simplest case: Unigram model
fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass
thrift, did, eighty, said, hard, 'm, july, bullish
that, or, limited, the
Some automatically generated sentences from a unigram model
i
in wPwwwP )()( 21
Condition on the previous word:
Bigram model
texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen
outside, new, car, parking, lot, of, the, agreement, reached
this, would, be, a, record, november
)|()|( 1121 iiii wwPwwwwP
N-gram models
• We can extend to trigrams, 4-grams, 5-grams
• In general this is an insufficient model of language because language has long-distance
dependencies:
“The computer which I had just put into the machine room on the fifth floor crashed.”
• But we can often get away with N-gram models