hidden markov processes - george mason universityece.gmu.edu/~yephraim/talks/hmp-montpilier.pdf ·...

HIDDEN MARKOV PROCESSES

Yariv Ephraim

George Mason UniversityFairfax, VA 22030

• This presentation is based on “Hidden Markov Processes,” by Y.

Ephraim and N. Merhav, which appeared in IEEE Trans. Inform.

Theory, vol. 48, pp. 1518-1569, June 2002

1

Motivation

• HMP’s have attracted significant research interest for over 50

years:

– They are fairly general processes that are amenable to

mathematical analysis

– They were found useful in many applications

• I will present a brief survey form the statistical and information

theoretic viewpoints

2

Definition of HMP

• A Markov chain observed through a noisy channel

-{St} Markov

Chain

St ∈ S

MemorylessChannelb(yt|st)

- {Yt}Observation

Sequence

p(yn|sn) =n∏

t=1

b(yt|st)

Yt ∈ Y

– If Y is finite, we have a finite-alphabet HMP

– If Y is not necessarily finite, e.g., standard alphabet, we have a

general HMP

3

Examples of HMP’s

• Deterministic channel

– Yt = f(St) where f(·) is a many-to-one function

– Aggregated Markov chains but not Markov chains. They exhibit

longer memory compared to Markov chains

– Any finite-alphabet HMP is a deterministic function of a Markov

chain with augmented state space S × Y

• Gaussian channel

– The conditional density b(yt|st) in normal with mean and/or

variance dependent on st

4

Density of HMP

• The n-dimensional joint density of observations and states is given

by

p(yn, sn) = πs1b(y1|s1)n∏

t=2

ast−1st b(yt|st)

• Matrix form:

p(yn) = πn∏

t=1

(BtA) 1

where

π = {πi}; A = {aij}; 1 = {1}

Bt = diag(b(yt|s1), . . . , b(yt|sM))

5

Parametrization of HMP

• The “usual” parameter of the HMP is given by φ = (π, A, θ)

π−initial state distribution

A-transition matrix

θ = {θj, j = 1, . . . , M} is the parameter of the channel

– For stationary Markov chains, π = πA, and the usual parameter

becomes φ = (A, θ)

• Non-usual parametrization occurs when (π, A, θ) depend on another

parameter

7

HMP’s vs. Markov Chains

• HMP’s have far more complex structure than Markov chains

• Information theorists deal primarily with finite-alphabet HMP’s.

For these processes:

– There is no closed-form single-letter expression for the entropy

rate of an HMP

– A powerful technique, known as the method of types, works for

Markov chains but not for HMP’s

8

HMP’s as Dependent Mixture Processes

• “HMP’s are mixture processes with Markov dependence” (Leroux

1992)

– Each observation has a mixture density

p(yt) =M∑

j=1

πj b(yt|St = j)

Contrary to mixture processes, HMP observations need not be

statistically independent

– HMP’s of interest have weakly dependent observations

9

HMP’s as Martingale Difference Sequences

• If each density b(yt|st) has zero mean, then

E{Yt|Y t−1} = 0

and {Yt} is a martingale difference sequence

– As such, the HMP is a sequence of non-correlated random

variables that may also be statistically independent

– The observations Y t−1 are not useful in predicting Yt in the

MMSE sense

10

HMP’s as Dynamical Systems

• HMP’s have state-space representations (Segall, 1976)

• For example, a finite-state finite-alphabet HMP has the following

representation

St+1 = A′St + Vt+1

Yt+1 = B′St + Wt+1

where St is a unit vector in RM , and {Vt} and {Wt} are martingale

difference sequences

11

Statistical properties of HMP’s

• HMP’s inherit their statistical properties from the Markov chains

(Adler, 1961)

– Stationary Markov chain ⇒ stationary HMP

– Stationary ergodic Markov chain ⇒ stationary ergodic HMP

– Stationary mixing Markov chain ⇒ stationary mixing HMP

12

Stationary, Ergodic, Mixing Markov Chains

• A Markov chain has a unique stationary distribution if and only if

the set of positive recurrent states is non-empty and irreducible

– An irreducible finite-state Markov chain has a unique positive

stationary distribution

• A Markov chain is an ergodic process if and only if it is irreducible

recurrent

– An irreducible finite-state Markov chain is ergodic

• An irreducible recurrent Markov chain is mixing if and only if it is

a-periodic

13

Extensions of HMP’s

• Examples of extensions of the HMP as defined earlier:

– Switching autoregressive processes

– Continuous-time hidden Markov processes

– Markov-modulated Poisson processes (MMPP’s)

– Finite-state channels

14

Switching autoregressive processes

• Markov chains observed through channels with memory

• A simple example of a switching autoregressive process is

Yt = −r∑

i=1

cst(i)Yt−i + σstWst(t), t = 1,2, . . .

where the prediction coefficients {cst(i), i = 1, . . . , r} and the gain

σst depend on the state st at the given time

• Non-trivial stability issues

• A sufficient condition for second-order stationarity is that the

spectral radius of an Mr2 ×Mr2 specifically constructed matrix is

smaller than one (Lindgren, Holst and Thuvesholmen, 1994)

15

Markov-Modulated Poisson Processes

• MMPP’s have many applications in medicine, computer networks

modelling, and queueing theory

• An MMPP is a Poisson process whose rate is controlled by a

non-observable continuous-time finite-state homogeneous Markov

chain

• MMPP’s can be cast as HMP’s as follows:

– The discrete-time Markov chain is defined by sampling the

continuous-time chain at the Poisson event epochs

– The observation sequence is given by the inter-event time

durations

16

An Important Special Case of

a finite-alphabet HMP

• Unifilar Source, Gallager, 1968:

– The n-dimensional density of the observation sequence yn given

the initial state s0 is

p(yn|s0) =n∏

t=1

p(yt|st−1)

st = g(yt, st−1)

where g(·, ·) is a next-state deterministic function

– Amenable to the method of types

17

Problems of Interest

• Parameter estimation

• State estimation

• Order estimation

• Simultaneous parameter estimation for several HMP’s

• waveform estimation from noisy observations

• Classification of HMP’s

• Minimum bit rate coding of an HMP

18

Examples of Important Applications

• The hookup of Markov chains with memoryless channels occurs

naturally in information theory and in various applications of

communications

• HMP’s yield a rich family of parametric models that were found

useful in many applications, e.g.,

– Automatic speech, language, and text recognition

– Turbo-codes in data communications

– DNA sequencing

– Ion-channel analysis

– Econometrics

19

Early Milestones in HMP’s

• Deterministic functions of Markov chains were proposed by

Shannon in 1948 as models for sources with memory

• They were extensively studied in the late 1950’s by several notable

statisticians, Blackwell and Koopmans, Burke and Rosenblatt, and

Gilbert

• In 1965 Wonham developed finite-dimensional conditional mean

estimator for the states of a continuous-time Markov chain

observed in white noise

• In 1966 Baum and Petrie published their seminal work on

finite-alphabet HMP’s

20

The Work of Baum and Petrie (1966-1970)

• Thorough analysis of finite-state finite-alphabet HMP’s:

– Identifiability, stationarity and ergodicity

– Entropy ergodic theorem

– Asymptotic optimality of the ML parameter estimator

– Recursive state estimation

– EM algorithm for parameter estimation

21

The 1992-Present Era

• Renewed interest in the theory of HMP’s began with the pioneering

work of Leroux in 1992 and of Bickel, Ritov and Ryden in 1998.

They have extended the work of Baum and Petrie to HMP’s with

finite-state spaces and general alphabets.

• Subsequent extensions were for the following cases:

– HMP’s with compact state spaces and general alphabets

(Jensen and Petersen, 1999)

– Switching autoregressive processes with compact state spaces

and general alphabets (Douc, Moulines and Ryden, 2001)

22

Ergodic theorems for HMP’s

• The theorems were used in proving consistency of the ML

estimator of the parameter of the HMP

• They form the basis for the equipartition property in information

theory

• I will review the following ergodic theorems:

– Shannon-McMillan-Breiman theorem

– Baum-Petrie-Leroux theorem

– Barron theorem

– Finesso Theorem

23

Shannon-McMillan-Breiman theorem for

Finite-Alphabet HMP’s

• The SMB theorem applies to stationary ergodic HMP’s:

limn→∞−

1

nlog p(Y n;φ0) = H(Pφ0) Pφ0 − a.s.

where

H(Pφ0) = limn→∞

1

nEφ0

{− log p(Y n;φ0)

}

< ∞is the entropy rate of {Yt}

24

Leroux Ergodic Theorem for General HMP’s, 1992

• For stationary ergodic HMP {Yt} ∼ Pφ0 and any parameter φ ∈ Φ

limn→∞

1

nlog p(Y n;φ) = H(Pφ0, Pφ) Pφ0 − a.s.

where

H(Pφ0, Pφ) = limn→∞

1

nEφ0 {log p(Y n;φ)}

< ∞

• The theorem was first proved for finite-alphabet HMP’s by Baum

and Petrie (1966)

25

Leroux Ergodic Theorem for HMP’s (Cont.)

• Leroux’s theorem implies convergence of the relative entropy

density to the relative entropy rate:

limn→∞

1

nlog

dP(n)φ0

dP(n)φ

= D(Pφ0||Pφ) Pφ0 − a.s.

where

D(Pφ0||Pφ) = limn→∞

1

nEφ0

{log

p(Y n;φ0)

p(Y n;φ)

}

• An important property:

D(Pφ0||Pφ) ≥ 0 with equality iff φ ∼ φ0

26

Barron Ergodic Theorem, 1984

• P is a stationary ergodic probability measure and Q is an mth order

Markov measure on a standard alphabet Borel space

limn→∞

1

nlog

dP (n)

dQ(n)= D(P ||Q) P − a.e. and in L1

• In Leroux, P and Q are HMP measures. Here, P is any stationary

ergodic measure but Q is Markov measure

• Extension of Barron’s theorem to an HMP measure Q is not

generally known. Such extension for finite-alphabet processes was

shown by Finesso, 1990.

27

Underlying Principles for Ergodic Theorems

• Kingman’s sub-additive ergodic theorem (1973)

• Martingale convergence theorem

• Furstenberg and Kesten ergodic theorem for product of random

matrices (1960)

• Le Gland and Mevel geometric ergodicity of an extended Markov

chain (2000)

28

Asymptotic Optimality of MLE for General HMP’s

• The ML estimator

φ(n) = argmaxφ∈Φ

log p(yn;φ)

• Strong consistency (Leroux, 1992)

limn→∞ φ(n) = φ0 Pφ0 − a.s.

• Asymptotic normality (Bickel, Ritov and Ryden, 1998)

n1/2(φ(n)− φ0) → N (0, I −1φ0 ) Pφ0-weakly as n →∞

– Iφ0 is the Fisher information matrix

Iφ0 = Eφ0{ZZ′}; Z = limn→∞Dφ log p(Y1|Y 0−n;φ)|

φ=φ0

• The MLE is efficient

29

The proof of Bickel, Ritov and Ryden, 1998

• Taylor expansion of DLn(φn) = D log p(yn; φn) about φ0 gives

n1/2(φn − φ0) =[−n−1D2Ln(φn)

]−1n−1/2DLn(φ

0)

where φn ∈ (φ0, φn)

– CLT for score function: n−1/2DLn(φ0) → N (0, Iφ0)

– LLN for observed information: −n−1D2Ln(φn) → Iφ0

• The crux of the proof is based on bounds developed for the mixing

coefficients of the inhomogeneous Markov chain {St} given {Yt}

30

Algorithms for ML Parameter Estimation

• Baum, Petrie, Soules and Weiss, 1970, developed an iterative

algorithm for local maximization of the likelihood function of a

general HMP. It is the EM algorithm applied to HMP’s.

– Algorithm increases the likelihood function unless a stationary

point is reached

– Unlike the Kalman filter, the Baum algorithm does not provide

the error covariance in each iteration

• Louis, 1982, developed a formula for calculating the observed

information matrix and BRR proved consistency of the estimator

31

Algorithms for ML Estimation (Cont.)

• The Baum-Viterbi algorithm, Jelinek, 1976, performs alternate

maximization of log p(sn, yn;φ) over the state sequence sn and

parameter φ

– Results in inconsistent parameter and state estimators

– Useful in applications where observation vectors of sufficiently

high dimensions are used, e.g., speech recognition

• Farago-Lugosi, 1989, provided an elegant non-iterative algorithm

for global maximization of log p(sn, yn;φ) for left-right HMP’s

– The Markov chain of a left-right HMP is allowed self and

forward transitions

32

Ziv’s Inequality

• Tight upper bound on the global maximum of the likelihood

function of any stationary ergodic finite-alphabet HMP

• The bound uses universal coding of the observation sequence

• The bound for any observation sequence yn is given by

maxφ

1

nlog p(yn;φ) ≤ −1

nu(yn) + δn

where u(yn) is the binary codeword length for yn in the Lempel-Ziv

universal data compression algorithm, and δn → 0 as n →∞independently of φ

33

Order Estimation

• Strongly consistent order estimators were developed for

finite-alphabet HMP’s

– Penalized likelihood function (Finesso, 1990)

Mn = min

{argmin

j≥1

{−1

nlog p(yn; φj) + 2c2j

logn

n

}}

where φj is the MLE of φ and cj = j(j + |Y| − 2).

• For a general HMP, Ryden (1995) proposed a penalized likelihood

estimator that does not underestimate the order as n →∞, a.s.

34

Concluding Remarks

• The theory and applications of HMP’s are very rich and there is

much more to say about HMP’s

• Recent successes of the theory include application of the

forward-backward recursions in turbo-coding and in monte-carlo

simulation for non-linear signal estimation (“particle filters”)

• Large sample behavior of HMP’s and their maximum likelihood

estimators are well understood

35

Several problems of current interest

• Fast algorithms for parameter estimation

• Robust parameter estimation from relatively short data

• HMP design for optimal performance in classification problems

36

hidden markov processes - george mason universityece.gmu.edu/~yephraim/talks/hmp-montpilier.pdf ·...

Documents