t-61.184 automatic speech recognition: from theory to practice · l. r. rabiner, “a tutorial on...

Automatic Speech Recognition: From Theory to Practice 1T-61.184T-61.184

T-61.184Automatic Speech Recognition:

From Theory to Practice

http://www.cis.hut.fi/Opinnot/T-61.184/October 4, 2004

Prof. Bryan PellomDepartment of Computer Science

Center for Spoken Language ResearchUniversity of Colorado

[email protected]


Role of Hidden Markov Models in Speech Recognition

Feature Extraction

Feature Extraction

Optimization /

Search

Optimization /

Search

Lexicon Language Model

Acoustic Model

Speech Text +Timing +Confidence


Motivation for Today’s Topic

SPEECH

S P IY CH


Expectations for this course

Today to present a basic idea of Hidden Markov Models (HMMs)

Next week we’ll discuss how HMMs are used in practice to model acoustics of speech

Don’t be lost in the math today: I don’t expect you’ll get the theory from all these slides! If you want to learn about HMMs it’s important to take a look at the resources listed on the next slide.


Resources for Today’s Topic

L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, Vol. 77, No. 2, pp. 257-286, February, 1989.

L.R. Rabiner & B. W. Juang, Fundamentals of Speech Recognition, Prentice-Hall, ISBN 0-13 015157-2, 1993 (see chapter 6)

The Cambridge HTK Toolkit has a user manual called the “HTK Book”. This is an excellent resource and covers many of the equations related to HMM modeling


Discrete-Time Markov ProcessCharacterized by:

A set of N states

Transition Probabilities

First-order Markov ChainCurrent system state only depends on previous stateLet be the system state at time t.Transition probability depends only on previous state

{ }NSSSS L,, 10=

)|( 1 itjtij SqSqPa === −

S0

00a01a

11a

10a

)|(),|( 121 itjtktitjt SqSqPSqSqSqP ====== −−− L

tq

S1ija


Discrete-Time Markov Process

Properties:

“Observable Markov Model” : output of the process is a set of states.

ji, 0 ∀≥ija

∑=

∀=N

jija

0

i 1

=

1110

0100Aaaaa

S0

00a01a

11a

10a S1


Example 1: Single Fair Coin Observable Markov Process

OutcomesHead (State 0)Tail (State 1)

Observed outcomes uniquely define state sequence

e.g., HHHTTTHHTT S=0001110011

Transition Probabilities,

HEAD TAILS

S0

00a

01a11a

10a S1

=

5.05.05.00.5

A


Example 2: Observable Markov Model of Weather

States:S0 : rainyS1 : cloudyS2 : sunny

With state transition probabilities,

02a

00a

01aS0

11a

S1

S221a

22a

12a02a20a

10a

{ }

==

8.01.01.02.06.02.03.03.04.0

ijaA


Observable Markov Model of Weather

What is the probability that the weather for 8 consecutive days is “sun, sun, sun, rain, rain, sun, cloudy, sun”?

SolutionObservation sequence:

O = {sun, sun, sun, rain, rain, sun, cloudy, sun”}Corresponds to state sequence, S = { 2, 2, 2, 0, 0, 2, 1, 2}Want to determine, P(O | model)P(O | model) = P ( S={2,2,2,0,0,2,1,2} | model )


Observable Markov Model of Weather

{ }

( ) ( ) ( ) ( ) ( ) ( )2.01.03.04.01.08.0

)1|2()2|2()2()model|2,1,2,0,0,2,2,2()model|(

22

122102002022222

78121

⋅⋅⋅⋅⋅=

⋅⋅⋅⋅⋅⋅⋅=======

==

π

π aaaaaaaqqPqqPqP

SPOPL

{ }

==

8.01.01.02.06.02.03.03.04.0

ijaA)(yprobabilit state initial :

1 iqPii

==ππ


“Hidden” Markov Models

Observations are a probabilistic function of state

Underlying state sequence is not observable (hidden)

First-order assumption

Output independence: observations are dependent only on the state that generated them, not on eachother.


Example: 2-Coins, Observable Markov Process

ObservationsHeadTail

Observed outcomes do not uniquely define state sequence

Transition Probabilities,1

1

1)()(

PTPPHP−=

=

2

2

1)()(

PTPPHP−=

=

S0

00a

01a11a

10a S1

−

−=

1111

0000

11a

Aaaa


Urn-and-Ball Illustration

A Genie plays a game with a blind observer1. Genie initially selects an Urn at random2. Genie picks a colored ball, tells blind observer the color3. Genie puts ball back in Urn4. Genie moves to next Urn based on random selection

procedure from current Urn.5. Steps 2-4 repeated



Observations: The color sequence of the balls

States:The identity of the urn

State-Transition:The process of selecting the urns



Observation sequence: R B Y Y G B Y G R …Individual colors (observations) don’t reveal urn (state)

60.0)(10.0)(30.0)(20.0)(

====

YPBPGPRP

20.0)(20.0)(15.0)(45.0)(

====

YPBPGPRP

05.0)(10.0)(70.0)(15.0)(

====

YPBPGPRP


Discrete Symbol Observation HMMA set of N states

Transition Probabilities

A set of M observation symbols

Probability Distribution(for state j, symbol k)

{ }NSSSS L,, 10=

)|( 1 itjtij SqSqPa === −

{ }MvvvV L,, 21=

)|()( jqvoPkb tktj ===)|(

)|()|(

2

1

iM

i

i

SvP

SvPSvP

M

)|(

)|(

)|(

2

1

jM

j

j

SvP

SvP

SvP

M

Si

iia

ijajja

jia Sj


Discrete Symbol Observation HMM

Also characterized by:An initial state distribution

Specification thus requires2 model parameters, N and MSpecification of M symbols3 probability measures A,B,π

{ } ( )iqPi === 1ππ

)|(

)|()|(

2

1

iM

i

i

SvP

SvPSvP

M

)|(

)|(

)|(

2

1

jM

j

j

SvP

SvP

SvP

M

Si

iia

ijajja

jia Sj

),,( πλ BA=


Left-to-Right HMMs

=

333231

232221

131211

Aaaaaaaaaa ijaij


Ergodic HMMs

jiaij ∀∀> , 0

S1

11a12a

22a

S2 S3

23a33a

13a21a 32a

31a


HMM as a Symbol Generator

Given parameters N, M, A, B, and π, HMMs can generate an observation sequence,

1. Choose initial state from init. state distribution π2. Set 3. Choose according to distribution4. Transition to set state according to state-transition probability distribution5. Set go to step 3

{ }ToooO ,,, 21 L=iq =1

1=tkt vo = )(kbijqt =+1

ija1+= tt


HMM as a Symbol Generator

Example output from symbol generation

We can think of the HMM as generating the observation sequence as it transitions from state to state.

Observation

State

T…654321Time, t

1q 2q 3q 4q 5q 6q Tq1o 2o 3o 4o 5o 6o To


Three Interesting ProblemsProblem 1: Scoring & Evaluation

How to efficiently compute the probability of an observation sequence (O) given a model (λ)? P(O| λ)

Problem 2: DecodingGiven an observation sequence (O) and a model (λ), how do we determine the corresponding state-sequence (q) that “best explains” how the observations were generated?

Problem 3: TrainingHow to adjust model parameters (λ = {A,B,π}) to maximize probability of generating a given observation sequence? maximize P(O| λ).


Problem 1: Scoring and Evaluation

Given an observation sequence,

Want to compute probability of generating it:

Let’s assume a particular sequence of states,

{ }ToooO ,,, 21 L=

)| λ(OP

{ }Tqqqq ,,, 21 L=



Using the chain rule we can decompose the problem by summing over all possible state sequences,

The first term relates to the likelihood of generating the observed symbol sequence given the assumed state sequence.

The second term relates to how likely the system is to step through those sequence of states.

∑=q all

)|)P(,|P()|P( λλλ qqOO



Probability of the observation sequence given the state sequence,

Probability of the state sequence,

)()()(

),|(),|

21

1

21 Tqqq

T

ttt

Tbbb

qpqP

ooo

o (O

L⋅=

=∏=

λλ

)()()()|132211 TT qqqqqqq

aaaqP−

⋅= Lπλ (



Using the chain rule,

This is not practical to compute. For N states, T observations, the number of state sequences is:

∑=q all

)|)P(,|P()|P( λλλ qqOO

)o()o()o(122111 2

q all 1 Tqqqqqqqq TTT babab −∑= Lπ

)*2(O TNT


Forward AlgorithmDefinition:

(Probability of seeing observations o1 to ot and ending at state i given HMM λ)

1. Initialization

2. Induction

3. Termination

)|,()( 21 λα iqPi ttt == ooo K

ii πα =)(0

)()()( 11

1 +=

+

= ∑ tj

N

iijtt baij oαα

∑=

=N

iT iP

1)()|( αλO


Forward Algorithm Illustration)1(tα

)2(tα

)(itα

)(Ntα

ja1ja2

ija

Nja

)()(

)(

11i

1

+=

+

=

∑ tjN

ijt

t

bai

j

oα

α

time t time t+1 )*(O2NT


Forward Algorithm Example

S0

6.0

4.0

0.1

S1

2.0B8.0A

7.0B3.0A

=

0.00.1

π

Given the above HMM with discrete observations “A” and “B”, what is the probability of generating the sequence “O = {A,A,B}”?

In other words, find P( O={A,A,B} | λ )


Forward Algorithm Example

S0

6.0

4.0

0.1

S1

2.0B8.0A

7.0B3.0A

=

0.00.1

π

t=0S0

S1

0.6*0.81.0

0.01.0*0.3

t=10.6*0.8

0.48

1.0*0.30.12

t=20.6*0.2

0.23

t=30.03

0.091.0*0.7

0.13

0.4*0.3 0.4*0.3 0.4*0.3

A A B

Answer:P(O|λ)=

0.03+0.13= 0.16

Answer:P(O|λ)=

0.03+0.13= 0.16


Backward Algorithm

Definition:

Probability of observation sequence ot+1 to oTgiven state i at time t and HMM λ

Initialization

Induction

)|,()( 21 λβ iqPi tTttt == ++ ooo K

)()()( 11

1 jbai tN

jtjijt +

=+∑= ββ o

1)( =iTβ

NiTTt

≤≤−−=

11,,2,1 K


Backward Algorithm Illustration

( )122 +ti oba

)1(1+tβ

)2(1+tβ

)(1 jt+β

)(1 Nt+β

( )111 +ti oba

)()(

)(

1j11∑

=++

=N

ttjij

t

jba

i

β

β

o

time t time t+1

( )122 +ti oba

( )1+tjij oba

( )1+tNiN oba

)*(O 2NT


Backward Algorithm Example

S0

6.0

4.0

0.1

S1

2.0B8.0A

7.0B3.0A

=

0.00.1

π

t=0S0

S1

0.6*0.80.16

0.001.0*0.3

t=10.6*0.8

0.28

1.0*0.30.21

t=20.6*0.2

0.40

t=31.0

0.71.0*0.7

1.0

0.4*0.3 0.4*0.3 0.4*0.7

A A B

1* 0 =π

0* 1 =π



In fact, there are 2 ways to compute

Computation of each term is

)| λ(OP

∑=

=N

iT iP

1)()|( αλO

)()|( 01

iPN

iiβπλ ∑

=

=O

)(O 2TN

Forward Algorithm

Backward Algorithm


Problem 2: Decoding

Given an observation sequence,

Find the single best sequence of states,

Which maximizes,

{ }ToooO ,,, 21 L=

)|,( λqP O

{ }Tqqqq ,,, 21 L=


Solution: Viterbi Algorithm

Define:

“Highest probable state sequence that accounts for observations 1 through t-1and ends in state i and time t”.

)|,(max)( 21,121121

λδ tttqqqt iqqqqPi tooo KL

L== −

−

( )11 ])(max[)( ++ ⋅= tjijtit obaij δδ


Viterbi Algorithm Illustration

)1(tδ

)2(tδ

)(itδ

)(Ntδ

ja1ja2

ija

Nja

)(])([max )(

11

1

+≤≤

+ =

tjijtNi

t

baij

oδδ

time t time t+1


Viterbi Algorithm1. Initialization

2. Recursion

3. Termination

4. Path Back trace

)()( 11 oiibi πδ = 0)(1 =iψ

)(])([max)( 11 tjijtNit baij o−≤≤= δδ

])([maxarg)( 11

ijtNi

t aij −≤≤

= δψ

[ ])(max1

* iP TNi δ≤≤=[ ])(maxarg

1

* iq TNi

T δ≤≤

=

)( * 11*

++= ttt qq ψ


Viterbi Algorithm Illustration

S0

6.0

4.0

0.1

S1

2.0B8.0A

7.0B3.0A

=

0.00.1

π

t=0S0

S1

0.6*0.81.0

0.01.0*0.3

t=10.6*0.8

0.48

1.0*0.30.12

t=20.6*0.2

0.23

t=30.03

0.061.0*0.7

0.06

0.4*0.3 0.4*0.3 0.4*0.7

A A B


Viterbi Algorithm in Log-Domain

Due to numerical underflow issues, it is often common to conduct the Viterbi algorithm in the log-domain,

Viterbi search in speech recognition systems is implemented in the log-domain for example.

( )( ) ( )( )

( )ijijtjtj

ii

aa

obob

log~log~

log~

=

=

= ππ


Viterbi Algorithm in Log-Domain1. Initialization

2. Recursion

3. Termination

4. Path Back trace

)(~~)(~ 11 oii bi +=πδ 0)(1 =iψ

)(~]~)(~[max)( 11 tjijtNit baij o++= −≤≤ δδ

]~)(~[maxarg)( 11

ijtNi

t aij += −≤≤

δψ

[ ])(~max~1

* iP TNi δ≤≤=[ ])(~maxarg

1

* iq TNi

T δ≤≤

=

)( * 11*

++= ttt qq ψ


Problem 3: Training

Train parameters of HMMTune λ to maximize P(O| λ)No efficient algorithm for global optimumEfficient iterative algorithm finds a local optimum

(Baum-Welch) Forward-Backward AlgorithmCompute probabilities using current model λRefine λ λ based on computed valuesUses α and β from forward and backward algorithms


Forward-Backward Algorithm

Define:

“Probability of being in state i at time tand state j at time t+1 given the model and observation sequence”

)|()|,,(

),|,(),(

1

1

λλλξ

OOO

PjqiqPjqiqPji

tt

ttt

===

===

+

+



)|()()()(

),|,(),(

11

1

λβα

λξ

Oo

O

Pjbai

jqiqPji

ttjijt

ttt

++

+

=

===

)(itα )(1 jt+β)( 1+tjijba o



∑=

++

++

+

=

=

===

N

itt

ttjijt

ttjijt

ttt

ii

jbaiP

jbaijqiqPji

1

11

11

1

)()(

)()()(

)|()()()(

),|,(),(

βα

βαλβα

λξ

oOo

O



=∑−

=

1

1

)(T

tt iγ

=∑−

=

1

1),(

T

tt jiξ

Expected number of transitions from state i in O.

Expected number of transitions from state i tostate j in O.

∑=

=N

jtt jii

1

),()( ξγ Probability of being in statei at time t



Using these definitions, we can compute the initial state occupancy probability,

1 at time statein timesofnumber expected == tiiπ

)(1 iγ=



i

jiija

state from ns transitioofnumber expected

state to state from ns transitioofnumber expected =

∑

∑

∑∑

∑−

=

−

=−

= =

−

= == 1

1

1

11

1 1

1

1

)(

),(

),(

),(

T

tt

T

tt

T

t

N

jt

T

tt

i

ji

ji

ji

γ

ξ

ξ

ξ



j

jkbj statein esnumber tim expected

symbolkth observing and statein timesofnumber expected )( =

∑

∑

∑

∑

∑∑

∑∑

=

=

=

=

= =

= ==== === T

ttt

T

ttt

T

tt

T

tt

T

t

N

jt

T

t

N

jt

jj

jj

j

j

jj

jj

kvtkvtkvt

1

1

1

1

1 1

1 1

)()(

)()(

)(

)(

),(

),(

βα

βα

γ

γ

ξ

ξooo


It can be shownThat

P(O|λ) > P(O| λ)

Unlessλ= λ

It can be shownThat

P(O|λ) > P(O| λ)

Unlessλ= λ


1. Initialize λ=(A,B,π)

2. Compute α, β, and ξ

3. Estimate λ = (A,B, π)

4. Replace λ with λ

5. If not converged go to 2


FB Algorithm Satisfies Constraints:

∑=

=N

ii

1

1π

NiaN

jij ≤≤=∑

=

1 11

NjkbM

kj ≤≤=∑

=

1 1)(1


Continuous Observation Densities

Discussions up until this point have considered observations characterized by discrete symbols

Often, we work with continuous valued data

Can convert continuous variables to discrete symbols using a Vector Quantizer (VQ) codebook [with information loss]

How to model continuous observations directly?


Mixture Gaussian PDFs

Mixture Gaussian PDF with M components,

Constraints on the mixture weights,

),,()(1

jkjk

M

kjkj µcb ΣΝ=∑

=tt oo

∑=

=M

kjkc

1

1

Mk1 ,0 ≤≤≥jkc


Definition

Probability of being in state j at time t with the kth mixture component accounting for ot:

ΣΝ

ΣΝ

=

∑∑==

M

mjmjmtjm

jkjktjkN

jtt

ttt

c

c

jj

jjkj

11

),,(

),,(

)()(

)()(),(µ

µ

βα

βαγo

o


Resulting Equations for Mixture Weight & Mean Update

∑∑

∑

= =

=

⋅= T

t

M

kt

t

T

tt

jk

kj

kj

1 1

1

),(

),(

γ

γµ

o

∑∑

∑

= =

== T

t

M

kt

T

tt

jk

kj

kjc

1 1

1

),(

),(

γ

γ


Resulting Equations for Transition Matrix & Variance Update

Transition probability aij same as case for discrete symbols. Covariance Matrix:

( )( )

∑∑

∑

= =

=

′−−⋅=Σ T

t

M

kt

jktjkt

T

tt

jk

kj

kj

1 1

1

),(

),(

γ

µµγ oo


Estimation from Multiple Observation Sequences

We estimate HMM parameters from multiple examples of speech:

Collected from different speakersIn different contexts

We attempt to model variability in producing each sound unit by estimating parameters from large corpora of speech data.


Multiple Observations

Assume K training patterns (observation sequences),

Where,

And,

[ ](K)(2))1( , , , OOOO K=

{ }(k)T(k)1(k)1(k) koooO , , , K=

)|(P (k) λOPk =


Transition Probabilitieswith Multiple Observations

∑ ∑

∑ ∑

=

−

=

=

−

=++

=K

k

T

t

kt

kt

k

K

k

T

t

ktjij

kt

kij k

k

iiP

jbaiP

a

1

1

1

1

1

11

)()(1

)()()(1

βα

βα (k)1to


Observation Probabilitieswith Multiple Observations

∑ ∑

∑ ∑

=

−

=

=

−

===

K

k

T

t

kt

kt

k

K

k

T

t

kt

kt

k

j k

k

lvtOts

iiP

iiP

lb

1

1

1

1

1

1

)()(1

)()(1

)( ..

βα

βα


Single State HMM Example

How to estimate the parameters of a single-state HMM model with M component Gaussians?,

( )( ) ( )

21exp

2

),,()(

1

12/12/

1

∑

∑

=

−

=

−Σ′−−

Σ=

Σ=

M

kktkkt

kd

k

kk

M

kkk

uouow

µbwb

π

tt oo


Single State HMM Example

Define probability of tth observation being generated by the kth mixture component,

Note that “k” (bk) here refers to mixture component, not HMM state. We are assuming just 1 HMM state.

( )

( )∑=

= M

ktkk

tkkt

obw

obwokP

1

),|( λ


Single State HMM Update Equations

( )( )2

1

1

2

1

1

1

),|(

),|(

),|(

),|(

),|(1

kT

tt

T

ttt

k

T

tt

T

ttt

k

T

ttk

uokP

ookP

okP

ookPu

okPT

w

−⋅

=Σ

⋅=

=

∑

∑

∑

∑

∑

=

=

=

=

=

λ

λ

λ

λ

λ This termis computedusing modelparametersfrom previousalgorithmiteration.

Updated(new)

parameterestimates


Mixture Gaussian (1-D case)

-7.5 -5.5 -3.5 -1.5 0.5 2.5 4.5 6.5

Example: 3 mixtures used to model underlying random process of 3 Gaussians

)(ob

o


Homework #3

You will estimate the parameters of a single state multivariate HMM given a sequence of observations (O).

Training parameters (observations) are MFCCs (13 dimensional) from 10 speakers

Test parameters: from unknown speakers… which speaker most likely produced the sequence?


Homework #3

Can compute the probability of generating the observation sequence for each estimated single-state HMM:

Find the model which has the maximum log-probability…

∑=

=T

tts obOP

1

)(log)|(log λ


HMM Assumptions

First-Order Markov ChainProbability of transitioning to a state only dependents on the current state

Time-independence of State TransitionsTransitioning from state A to state B is independent of time

Observation IndependenceObservations don’t depend on each other, just on the states thatgenerated them

(sometimes) Left-to-Right TopologyAs we will see, it is generally assumed that a left-to-right HMM is used to model speech units (phonemes)


Important Concepts from Today

Do you understand the basic idea of an HMM?

Could you implement the Viterbi Algorithm if you had to?

Can you estimate at least the parameters of a single-state HMM (mixture-Gaussians)?

What assumptions do HMMs impose on the data being modeled?


Next Week

How to use HMMs for modeling speech units?

Consider major strategies for acoustic training

How to model context dependencies using HMMs

Some practical notes on HMM training for speech recognition

t-61.184 automatic speech recognition: from theory to practice · l. r. rabiner, “a tutorial on...

Documents