cs 552/652 speech recognition with hidden markov models winter 2011 oregon health & science...

CS 552/652Speech Recognition with Hidden Markov Models

Winter 2011

Oregon Health & Science UniversityCenter for Spoken Language Understanding

John-Paul Hosom

Lecture 17½Speaker Adaptation

Notes based on: Huang, Acero, and Hon (2001), “Spoken Language Processing” section 9.6Lee and Gauvain (1993), “Speaker Adaptation based on MAP Estimation of HMM paramters”, ICASSP 93Woodland (2001), “Speaker Adaptation for Continuous Density HMMs: A Review” 2001.Gauvain and Lee (1994), “Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of

Markov Chains”Renals (2008) speaker adaptation lecture notes.Lee and Rose (1996), “Speaker Normalization Using Efficient Frequency Warping Procedures”Panchapagesan and Alwan (2008), “Frequency Warping for VTLN and Speaker Adaptation by Linear

Transformation of Standard MFCC”

Speaker Adaptation

Given an HMM that has been trained on a large number of people (a speaker-independent HMM), we can try to improve performance by adapting to the speaker currently being recognized in testing.

Two basic types of speaker adaptation:

1. Adaptation of the feature space (speaker normalization)Vocal-Tract Length Normalization (VTLN)= warp the feature space to better fit the model parameters

2. Adaptation of the model parametersMaximum A Posteriori (MAP) adaptation= retrain individual state parameters

Maximum Likelihood Linear Regression (MLLR)= “warp” model parameters to better fit adaptation data

Common technique is Vocal Tract Length Normalization (VTLN)

Assumption: The majority of speaker differences in the acoustic space are caused by different vocal tract lengths. Different lengths of the vocal tract can be normalized using a non-linear frequency warping (like Mel scale, but on speaker-by-speaker basis).

Performance using VTLN typically improves by a relative reduction in error of 10% (e.g. from 22% WER to 20% WER, or 10% to 9%, or 5% to 4.5%).

Two questions need to be answered to implement VTLN:1. What type of non-linear warping2. How to determine optimal parameter value for the non-linear

warping during both training and recognition?

Speaker Normalization


With different lengths of vocal tract, the resonant frequencies (formants) shift. A shorter vocal tract yields higher formants; a longer vocal tract yields lower formants. But the shift is not a linear function of frequency. So, need to choose a non-linear warping function.

1. what type of non-linear warping?piecewise linearadjustment of Mel scalepower function

Also, what range for parameters? If we consider vocal tract lengths to be correlated with a person’s height, then we can look at variation in height to determine range of vocal tract lengths. In U.S., average male is 5’10” (1.776m) and has VTL of 17 cm. A tall man might be 6’6”, or 11% taller than average. An average woman is 5’ 4”, or 90% of the average male height. A short woman might be 85% of the average male height.


warping of Mel scale:

(equation from Huang, Acero, Hon “Spoken Language Processing” 2001 p. 427)

1

70010log2595

fMel

frequencies for no warping (=1)

freq

uenc

ies

for

war

ping

with

fr

om 0

.85

to 1

.10

by 0

.05

=0.85

=1.0

=1.10


piecewise linear warping:

(figure from Renals’ ASR lecture, 2008)


warping by power function:

(figure from Renals’ ASR lecture, 2008)

ff f 8000/3ˆ

Speaker NormalizationActual estimated warping for different vocal tract lengths, basedon two-tube model of four vowels (/ax/, /iy/, /ae/, /aa/; tube parameter values taken from CS551 Lecture 9):

formant frequencies for 17-cm vocal tract

form

ant f

requ

enci

es f

or

voca

l tra

ct le

ngth

s:85

%, 9

0%, 9

5%, 1

00%

, 105

%,

and

110%

of

17 c

m

So, complexity of non-linear warping actually isn’t warranted; a linear model fits theoretical data well, or -warping of Mel scale

85%=14.4cm

100%=17cm

110%=18.7cm

2. how to determine optimal parameter value during both training and recognition?


• “Grid Search”: try 13 regularly-spaced values from 0.88 to 1.12, and find the value that maximizes the likelihood of the model. (Linear increase in processing time) (Lee and Rose, 1996).

• Use gradient search instead of grid search.

• Estimate and align (along frequency scale) formant peaks in speaker data. For example, ratio of median position of 3rd formant for current speaker divided by median F3 averaged over all speakers (Eide and Gish, 1996):

)3(

)3(

all

speakerspeaker Fmedian

Fmedian

If we have some (labeled) data for a speaker, we can adapt our model parameters to better fit that speaker using MAP adaptation.

Sometimes just the means are updated; covariance matrix is assumed to be the same, as are transition probabilities and mixture weights. We also assume that each aspect (means, covariance matrix, etc.) can be treated independently.

Maximum Likelihood estimation:

MAP estimation:

where g() is the prior probability distribution of the model over the space of model parameter values. (If we know nothing about g(), the prior probability of the model, then MAP reduces to ML estimation.)

)|(maxarg

OfML

parameter space

g()

Maximum a Posteriori (MAP) Adaptation of Model Parameters

original paper on MAP: Lee and Gauvain, ICASSP 1993

)()|(maxarg

gfMAP O

What do we know about g(), the prior probability density function of the new model? Usually, we don’t know g(), so we use maximum-likelihood (EM) training. However, in this case, we have an existing, speaker-independent (S.I.) model (know, prior information) and we want to learn the model for a specific speaker.

If we assume that each of the parts of the GMM model (, , weights) are independent, we can optimize each of these sub-problems independently.

For the D-dimensional Gaussian distributions characterized by and , the prior density g() can be represented with a normal-Wishart density, with the following parameters: >D-1, >0. The normal Wishart pdf also has a vector nw being the mean of the Gaussian of the speaker-independent model, and a matrix S being the covariance matrix from the speaker-independent model.


Using the Lagrange multiplier similar to the EM derivation (Lecture 12) applied to this normal-Wishart pdf, the update formula for the means of the model becomes:

T

ttik

T

tttnwik

ik

ki

kiik

1

1

),(

),(

ˆ

oμ

μot = observations for new speaker = probabilities for new speaker = from SI model


is the mean of the S.I. model for state i, component k ik, the weight contribution of prior knowledge (the S.I. model)

and new observed data (the speaker-dependent data), is determined empirically. This controls the rate of change of

t(i,k) is the probability of being in state i and component k at time t given the speaker-dependent data and model (Lecture 11).

This is updating of the means is iterated, just like EM. Each iteration changes the t(i,k) values, and therefore the

iknwμ

ikμ̂

ikμ̂

“When ik is large, the prior density is sharply peaked around the values of the seed (S.I.) HMM parameters which will be only slightly modified by the adaptation process. Conversely, if ik is small, the adaptation will be very fast” (Lee and Gauvain (1993), p. 560). When this weight is small, the effect of the S.I. model is smaller, and the speaker-specific observations dominate the computation.

As the number of observations of the new speaker increases for state j and component k (or, as T approaches infinity), the MAP estimate approaches the ML estimate of the new data, as the new data dominate over the old mean .

The same approach can be used to adjust the covariance matrix.

ik can be constrained to be the same for all components in all GMMS and states; a typical value is between 2 and 20.


iknwμ

“MAP HMM can be regarded as an interpolated model between the speaker-independent and speaker-dependent HMM. Both are derived from the standard ML forward-backward algorithm.” (Huang, p. 447)

How much data is needed? Of course, more is better. Results have been reported for only several utterances per new speaker up to 600 utterances per new speaker.

Problem 1: need (relatively) lots of training data for the speaker to be adapted to.

Problem 2: each state and component is updated independently. If a speaker doesn’t say data associated with a particular state and component, then that state still uses the S.I. model. It would be nice to update all the parameters of the model from a small amount of data.


• The idea behind MLLR is to use a set of linear regression transformation functions to map means (and maybe also covariances) in order to maximize the likelihood on the adaptation data.

• In other words, we want to find some linear transform (of the form ax+b) that warps the mean vector in such a way that the likelihood of the model given the new data, ,is maximized. (In the following, ot is one frame of Onew.)

• Updating only the means is effective; updating the covariance matrix gives less than an additional 2% error reduction (Huang, p. 450) and so is less commonly done.

• The same transformation can be used for similar GMMs; this sharing allows updating of the entire model faster and uniformly.

Maximum Likelihood Linear Regression (MLLR)

)|( newOL

The mean vector for state i, component k can be transformed using the following equation:

where Ac is a regression matrix and bc is an additive bias vector; Ac and bc are associated with a broad class of phonemes or set of tied states (not just an individual state), called c, to better share model parameters.

We want to find Ac and bc such that the mismatch with new (speaker-specific) data is smallest. We can re-write this as

where ik is rewritten as and we need to solve for

Wc, which contains both Ac and bc, e.g. Wc =[bc, Ac]


ikμ

cikcik bμAμ ~

ikcik μWμ ~

ikμ

1

Maximizing a Q function by setting the derivative to zero, in the same way that was done in Lecture 12, maximizes the likelihood of the adaptation data (Huang p. 448-449); this yields the function

which can be re-written as

where


T

t c

T

t cikikciktiktikt

ik ik

kiki1 1

11 ),(),(b b

μμWΣμoΣ

c

ikcik

ikb

DWVZ

T

t ciktikt

ik

ki1

1),(b

μoΣΖ

T

tiktik ki

1

1),( ΣV

ikikik μμD

If the covariance matrix ik is diagonal, there is a closed-form solution for Wc:

where subscript q denotes the qth row of matrix Wc and Z;

where vqq denotes the qth diagonal element of Vik

We need to make sure that Gq is invertible, by having enough training data. If there’s not enough data, we can tie more classes together.

This process can be iterated with new values for t(i,k) and ik in each iteration, but usually one iteration gives the most gain in performance.


1 qqq GZW

c

ikqqq

ik

vb

DG

Unsupervised adaptation can be done by (a) recognizing with a speaker-independent (S.I.) model, and then (b) assuming that these recognized results are correct, using these results as training data for adaptation. (In this case, the use of confidence scores (indicating which regions of speech are better recognized) may be helpful to constrain the training to only adapt to correctly-recognized speech samples.)

MLLR and MAP can be combined for (slightly) better performance over either technique alone. Also, MLLR and VTLN performance improvement is often approximately additive. For example, a 10% relative WER reduction from VTLN and 15% relative WER reduction from MLLR in isolation yields a 25% relative WER from using both VTLN and MLLR.) (Pye and Woodland, ICASSP 97)


One example of combining MAP and MLLR is from the Whisper system:


or 15% WER reduction using MLLR and a total 22% relative error reduction on 1000 utterances from combined MAP+MLLR.(The speaker-dependent system was trained on the 1000 utterances from that speaker.)

cs 552/652 speech recognition with hidden markov models winter 2011 oregon health & science...

Documents