speech recognitionmohri/asr12/lecture_8.pdf · mehryar mohri - speech recognition page courant...

24
Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research [email protected]

Upload: others

Post on 23-Aug-2020

28 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Speech RecognitionLecture 8: Expectation-Maximization Algorithm,

Hidden Markov Models.

Mehryar MohriCourant Institute and Google Research

[email protected]

Page 2: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Expectation-Maximization (EM) algorithm

Hidden-Markov Models

This Lecture

2

Page 3: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Latent Variables

Definition: unobserved or hidden variables, as opposed to those directly available at training and test time.

• example: mixture models, HMMs.

Why latent variables?

• naturally unavailable variables: e.g., economics variables.

• modeling: latent variables introduced to model dependencies.

3

Page 4: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

ML with Latent Variables

Problem:

• with fully observed variables, ML is often straightforward.

• with latent variables, log-likelihood contains a sum (harder to find the best parameter values):

Idea: use current parameter values to estimate latent variables and use those to re-estimate parameter values EM algorithm.

4

L(!, x) = log p![x] = log!

z

p!(x, z) = log!

z

p![z|x]p![x].

Page 5: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

EM Theorem

Theorem: for any , and parameter values and ,

Proof:

5

! !!

x

L(!!, x) ! L(!, x) = log p!! [x] ! log p![x]

=!

z

p![z|x] log p!! [x] !!

z

p![z|x] log p![x]

=!

z

p![z|x] logp!! [x, z]

p!! [z|x]!

!

z

p![z|x] logp![x, z]

p![z|x]

=!

z

p![z|x] logp![z|x]

p!! [z|x]+

!

z

p![z|x] log p!! [x, z] !!

z

p![z|x] log p![x, z]

= D(p![z|x]"p!! [z|x]) +!

z

p![z|x] log p!! [x, z] !!

z

p![z|x] log p![x, z]

#!

z

p![z|x] log p!! [x, z] !!

z

p![z|x] log p![x, z].

L(!!, x) ! L(!, x) "!

z

p![z|x] log p!! [x, z] !!

z

p![z|x] log p![x, z].

Page 6: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

EM Idea

Maximize expectation:

Iterations:

• compute for current value of .

• compute expectation and find maximizing .

6

z

pθ[z|x] log pθ� [x, z] = Ez∼pθ[·|x]

�log pθ� [x, z]

�.

θ�

pθ[z|x] θ

Page 7: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

EM Algorithm

Algorithm: maximum-likelihood meta algorithm for models with latent variables.

• E-step: .

• M-step: .

Interpretation:

• E-step: posterior probability of latent variables given observed variables and current parameter.

• M-step: maximum-likelihood parameter given all data.

7

qt+1 ! p!t [z|x]

!t+1 ! argmax!

!

z

qt+1(z|x) log p![x, z]

(Dempster, Laird, and Rubin, 1977; Wu, 1983)

Page 8: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

EM Algorithm - Notes

Applications:

• mixture models, e.g., Gaussian mixtures. Latent variables: which model generated points.

• HMMs (Baum-Welsh algorithm).Notes:

• positive: each iteration increases likelihood. No parameter tuning.

• negative: can converge to local optima. Note that likelihood could converge but not parameter. Dependence on initial parameter.

8

Page 9: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

EM = Coordinate Ascent

Theorem: EM = coordinate ascent applied to

• E-step: .

• M-step: .

Proof: the E-step follows the two identities

9

A(q, !) =!

z

q[z|x] logp![x, z]

q[z|x]

=!

z

q[z|x] log p![x, z] !!

z

q[z|x] log q[z|x].

qt+1! argmax

q

A(q, !t)

A(q, !) ! log!

z

q[z|x]p![x, z]

q[z|x]= log

!

z

p![x, z] = log p![x] = L(!, x),

A(p![z|x], !) =!

z

p![z|x] log p![x] = log p![x] = L(!, x).

!t+1! argmax

!

A(qt+1, !)

Page 10: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

EM = Coordinate Ascent

Proof (cont.): finally, note that

coincides with the M-step of EM.

10

argmax!

A(qt+1, !) = argmax!

!

z

qt+1[z|x] log p![x, z],

Page 11: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

EM - Extensions and Variants

Generalized EM (GEM): maximization is not necessary at all steps since any (strict) increase of the auxiliary function guarantees an increase of the log-likelihood.

Sparse EM: posterior probabilities computed at some points in E-step.

11

(Dempster et al., 1977; Jamshidian and Jennrich, 1993; Wu, 1983)

Page 12: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Expectation-Maximization (EM) algorithm

Hidden-Markov Models

This Lecture

12

Page 13: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Motivation

Data: sample of sequences over alphabet drawn i.i.d. according to some distribution :

Problem: find sequence model that best estimates distribution .

Latent variables: observed sequences may have been generated by model with states and transitions that are hidden or unobserved.

13

m !

D

xi1, x

i2, . . . , x

iki

, i = 1, . . . , m.

D

Page 14: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Example

Observations: sequences of Heads and Tails.

Model:

14

H, T, T, H, T, H, T, H, H, H, T, T, . . . , H.

0

H/.31T/.7

T/.4

2

H/.6

T/.4

T/.5

H/.1

(0, H, 0), (0, T, 1), (1, T, 1), . . . , (2, H, 2).unobserved paths:

observed sequence:

Page 15: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Hidden Markov Models

Definition: probabilistic automata, generative view.

• discrete case: finite alphabet .

• transition probability: .

• emission probability: .

• simplification: for us,

Illustration:

15

!

Pr[transition e]

w[e] = Pr[a|e] Pr[e].

Pr[emission a|e]

0

a/.3 a/.2

a/.2b/.1

/.5c/.2

Page 16: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Other Models

Outputs at states (instead of transitions): equivalent model, dual representation.

Non-discrete case: outputs in .

• fixed probabilities replaced by distributions, e.g., Gaussian mixtures.

• application to acoustic modeling.

16

RN

Page 17: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Three Problems

Problem 1: given sequence and hidden Markov model compute .

Problem 2: given sequence and hidden Markov model find most likely path that generated that sequence.

Problem 3: estimate parameters of a hidden Markov model via ML:

17

(Rabiner, 1989)

x1, . . . , xk

p! p![x1, . . . , xk]

x1, . . . , xk

p! ! = e1, . . . , er

!

!! = argmax"

p"[x].

Page 18: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

P1: Evaluation

Algorithm: let represent a finite automaton representing the sequence and let denote the current hidden Markov model.

• Then,

• Thus, it can be computed using composition and a shortest-distance algorithm in time .

18

X

x = x1 · · ·xk H

p![x] =!

"!X"H

w[!].

O(|X||H|)

Page 19: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

P2: Most Likely Path

Algorithm: shortest-path problem in semiring.

• any shortest-path algorithm applied to the result of the composition . In particular, since the automaton is acyclic, with a linear-time shortest-path algorithm, the total complexity is in .

• a traditional solution is to use the Viterbi algorithm.

19

(max,!)

X ! H

O(|X ! H|) = O(|X||H|)

Page 20: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

P3: Parameter Estimation

Baum-Welsh algorithm: special case of EM applied to HMMs.

• Here, represents the transitions weights .

• The latent or hidden variables are paths .

• M-step:

20

! w[e]

!

(Baum, 1972)

argmax!

!

"

p!! [!|x] log p![x, !]

subject to !q " Q,!

e!E[q]

w![e] = 1.

Page 21: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

P3: Parameter Estimation

Using Lagrange multipliers , , the problem consists of setting the following partial derivatives to zero:

only for paths labeled with : .

In that case, , where is the number of occurrences of in .

21

!q q ! Q

!

!w![e]

!

"

"

p!! ["|x] log p![x, "] !"

q

#q

"

e!E[q]

w![e]#

= 0

""

"

p!! ["|x]! log p![x, "]

!w![e]! #orig[e] = 0.

p![x, !] != 0 ! x ! ! !(x)

p![x, !] =!

e

w![e]|"|e |!|e

e !

Page 22: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

P3: Parameter Estimation

Thus, the equation with partial derivatives can be rewritten as

22

!

!!!(x)

p"! [!|x]|!|ew"[e]

= "orig(e)

! w"[e] =1

"orig(e)

!

!!!(x)

p"! [!|x]|!|e

! w"[e] =1

"orig(e)p"! [x]

!

!!!(x)

p"! [x, !]|!|e

! w"[e] "!

!!!(x)

p"! [x, !]|!|e.

Page 23: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

w![e] !k!

i=1

!i!1("")w!! [e]#i("

")

P3: Parameter Estimation

But, .

Thus,

with

23

!

!!!(x)

p"! [x, !]|!|e =k!

i=1

"i"1(##)w"! [e]$i(#

#)

!i("!) = p!! [x1, . . . , xi ! q = orig[e]]

!i("!) = p!! [xi, . . . , xk|dest[e]].

Can be computed using forward-backward algorithms, or, for us, shortest-distance algorithms.

Expected number of times through transition e.

Page 24: Speech Recognitionmohri/asr12/lecture_8.pdf · Mehryar Mohri - Speech Recognition page Courant Institute, NYU References • L. E. Baum. An Inequality and Associated Maximization

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

References• L. E. Baum. An Inequality and Associated Maximization Technique in Statistical Estimation

for Probalistic Functions of Markov Processes. Inequalities, 3:1-8, 1972.

• Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin . Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Vol. 39, No. 1. (1977), pp. 1-38..

• Jamshidian, M. and R. I. Jennrich: 1993, Conjugate Gradient Acceleration of the EM Algorithm. Journal of the American Statistical Association 88(421), 221-228.

• Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol. 77, No. 2, pp. 257, 1989.

• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.

• C. F. Jeff Wu. On the Convergence Properties of the EM Algorithm. The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103.

24