hidden markov processes - george mason universityece.gmu.edu/~yephraim/talks/hmp-montpilier.pdf ·...
TRANSCRIPT
HIDDEN MARKOV PROCESSES
Yariv Ephraim
George Mason UniversityFairfax, VA 22030
• This presentation is based on “Hidden Markov Processes,” by Y.
Ephraim and N. Merhav, which appeared in IEEE Trans. Inform.
Theory, vol. 48, pp. 1518-1569, June 2002
1
Motivation
• HMP’s have attracted significant research interest for over 50
years:
– They are fairly general processes that are amenable to
mathematical analysis
– They were found useful in many applications
• I will present a brief survey form the statistical and information
theoretic viewpoints
2
Definition of HMP
• A Markov chain observed through a noisy channel
-{St} Markov
Chain
St ∈ S
MemorylessChannelb(yt|st)
- {Yt}Observation
Sequence
p(yn|sn) =n∏
t=1
b(yt|st)
Yt ∈ Y
– If Y is finite, we have a finite-alphabet HMP
– If Y is not necessarily finite, e.g., standard alphabet, we have a
general HMP
3
Examples of HMP’s
• Deterministic channel
– Yt = f(St) where f(·) is a many-to-one function
– Aggregated Markov chains but not Markov chains. They exhibit
longer memory compared to Markov chains
– Any finite-alphabet HMP is a deterministic function of a Markov
chain with augmented state space S × Y
• Gaussian channel
– The conditional density b(yt|st) in normal with mean and/or
variance dependent on st
4
Density of HMP
• The n-dimensional joint density of observations and states is given
by
p(yn, sn) = πs1b(y1|s1)n∏
t=2
ast−1st b(yt|st)
• Matrix form:
p(yn) = πn∏
t=1
(BtA) 1
where
π = {πi}; A = {aij}; 1 = {1}
Bt = diag(b(yt|s1), . . . , b(yt|sM))
5
An Alternative Useful Density Form
p(yn) = p(y1)n∏
t=2
p(yt|yt−1)
p(y1) =M∑
s1=1
πs1b(y1|s1)
p(yt|yt−1) =M∑
st=1
p(st|yt−1)b(yt|st)
• The predictive density p(st|yt−1) can be efficiently calculated using
a forward recursion
6
Parametrization of HMP
• The “usual” parameter of the HMP is given by φ = (π, A, θ)
π−initial state distribution
A-transition matrix
θ = {θj, j = 1, . . . , M} is the parameter of the channel
– For stationary Markov chains, π = πA, and the usual parameter
becomes φ = (A, θ)
• Non-usual parametrization occurs when (π, A, θ) depend on another
parameter
7
HMP’s vs. Markov Chains
• HMP’s have far more complex structure than Markov chains
• Information theorists deal primarily with finite-alphabet HMP’s.
For these processes:
– There is no closed-form single-letter expression for the entropy
rate of an HMP
– A powerful technique, known as the method of types, works for
Markov chains but not for HMP’s
8
HMP’s as Dependent Mixture Processes
• “HMP’s are mixture processes with Markov dependence” (Leroux
1992)
– Each observation has a mixture density
p(yt) =M∑
j=1
πj b(yt|St = j)
Contrary to mixture processes, HMP observations need not be
statistically independent
– HMP’s of interest have weakly dependent observations
9
HMP’s as Martingale Difference Sequences
• If each density b(yt|st) has zero mean, then
E{Yt|Y t−1} = 0
and {Yt} is a martingale difference sequence
– As such, the HMP is a sequence of non-correlated random
variables that may also be statistically independent
– The observations Y t−1 are not useful in predicting Yt in the
MMSE sense
10
HMP’s as Dynamical Systems
• HMP’s have state-space representations (Segall, 1976)
• For example, a finite-state finite-alphabet HMP has the following
representation
St+1 = A′St + Vt+1
Yt+1 = B′St + Wt+1
where St is a unit vector in RM , and {Vt} and {Wt} are martingale
difference sequences
11
Statistical properties of HMP’s
• HMP’s inherit their statistical properties from the Markov chains
(Adler, 1961)
– Stationary Markov chain ⇒ stationary HMP
– Stationary ergodic Markov chain ⇒ stationary ergodic HMP
– Stationary mixing Markov chain ⇒ stationary mixing HMP
12
Stationary, Ergodic, Mixing Markov Chains
• A Markov chain has a unique stationary distribution if and only if
the set of positive recurrent states is non-empty and irreducible
– An irreducible finite-state Markov chain has a unique positive
stationary distribution
• A Markov chain is an ergodic process if and only if it is irreducible
recurrent
– An irreducible finite-state Markov chain is ergodic
• An irreducible recurrent Markov chain is mixing if and only if it is
a-periodic
13
Extensions of HMP’s
• Examples of extensions of the HMP as defined earlier:
– Switching autoregressive processes
– Continuous-time hidden Markov processes
– Markov-modulated Poisson processes (MMPP’s)
– Finite-state channels
14
Switching autoregressive processes
• Markov chains observed through channels with memory
• A simple example of a switching autoregressive process is
Yt = −r∑
i=1
cst(i)Yt−i + σstWst(t), t = 1,2, . . .
where the prediction coefficients {cst(i), i = 1, . . . , r} and the gain
σst depend on the state st at the given time
• Non-trivial stability issues
• A sufficient condition for second-order stationarity is that the
spectral radius of an Mr2 ×Mr2 specifically constructed matrix is
smaller than one (Lindgren, Holst and Thuvesholmen, 1994)
15
Markov-Modulated Poisson Processes
• MMPP’s have many applications in medicine, computer networks
modelling, and queueing theory
• An MMPP is a Poisson process whose rate is controlled by a
non-observable continuous-time finite-state homogeneous Markov
chain
• MMPP’s can be cast as HMP’s as follows:
– The discrete-time Markov chain is defined by sampling the
continuous-time chain at the Poisson event epochs
– The observation sequence is given by the inter-event time
durations
16
An Important Special Case of
a finite-alphabet HMP
• Unifilar Source, Gallager, 1968:
– The n-dimensional density of the observation sequence yn given
the initial state s0 is
p(yn|s0) =n∏
t=1
p(yt|st−1)
st = g(yt, st−1)
where g(·, ·) is a next-state deterministic function
– Amenable to the method of types
17
Problems of Interest
• Parameter estimation
• State estimation
• Order estimation
• Simultaneous parameter estimation for several HMP’s
• waveform estimation from noisy observations
• Classification of HMP’s
• Minimum bit rate coding of an HMP
18
Examples of Important Applications
• The hookup of Markov chains with memoryless channels occurs
naturally in information theory and in various applications of
communications
• HMP’s yield a rich family of parametric models that were found
useful in many applications, e.g.,
– Automatic speech, language, and text recognition
– Turbo-codes in data communications
– DNA sequencing
– Ion-channel analysis
– Econometrics
19
Early Milestones in HMP’s
• Deterministic functions of Markov chains were proposed by
Shannon in 1948 as models for sources with memory
• They were extensively studied in the late 1950’s by several notable
statisticians, Blackwell and Koopmans, Burke and Rosenblatt, and
Gilbert
• In 1965 Wonham developed finite-dimensional conditional mean
estimator for the states of a continuous-time Markov chain
observed in white noise
• In 1966 Baum and Petrie published their seminal work on
finite-alphabet HMP’s
20
The Work of Baum and Petrie (1966-1970)
• Thorough analysis of finite-state finite-alphabet HMP’s:
– Identifiability, stationarity and ergodicity
– Entropy ergodic theorem
– Asymptotic optimality of the ML parameter estimator
– Recursive state estimation
– EM algorithm for parameter estimation
21
The 1992-Present Era
• Renewed interest in the theory of HMP’s began with the pioneering
work of Leroux in 1992 and of Bickel, Ritov and Ryden in 1998.
They have extended the work of Baum and Petrie to HMP’s with
finite-state spaces and general alphabets.
• Subsequent extensions were for the following cases:
– HMP’s with compact state spaces and general alphabets
(Jensen and Petersen, 1999)
– Switching autoregressive processes with compact state spaces
and general alphabets (Douc, Moulines and Ryden, 2001)
22
Ergodic theorems for HMP’s
• The theorems were used in proving consistency of the ML
estimator of the parameter of the HMP
• They form the basis for the equipartition property in information
theory
• I will review the following ergodic theorems:
– Shannon-McMillan-Breiman theorem
– Baum-Petrie-Leroux theorem
– Barron theorem
– Finesso Theorem
23
Shannon-McMillan-Breiman theorem for
Finite-Alphabet HMP’s
• The SMB theorem applies to stationary ergodic HMP’s:
limn→∞−
1
nlog p(Y n;φ0) = H(Pφ0) Pφ0 − a.s.
where
H(Pφ0) = limn→∞
1
nEφ0
{− log p(Y n;φ0)
}
< ∞is the entropy rate of {Yt}
24
Leroux Ergodic Theorem for General HMP’s, 1992
• For stationary ergodic HMP {Yt} ∼ Pφ0 and any parameter φ ∈ Φ
limn→∞
1
nlog p(Y n;φ) = H(Pφ0, Pφ) Pφ0 − a.s.
where
H(Pφ0, Pφ) = limn→∞
1
nEφ0 {log p(Y n;φ)}
< ∞
• The theorem was first proved for finite-alphabet HMP’s by Baum
and Petrie (1966)
25
Leroux Ergodic Theorem for HMP’s (Cont.)
• Leroux’s theorem implies convergence of the relative entropy
density to the relative entropy rate:
limn→∞
1
nlog
dP(n)φ0
dP(n)φ
= D(Pφ0||Pφ) Pφ0 − a.s.
where
D(Pφ0||Pφ) = limn→∞
1
nEφ0
{log
p(Y n;φ0)
p(Y n;φ)
}
• An important property:
D(Pφ0||Pφ) ≥ 0 with equality iff φ ∼ φ0
26
Barron Ergodic Theorem, 1984
• P is a stationary ergodic probability measure and Q is an mth order
Markov measure on a standard alphabet Borel space
limn→∞
1
nlog
dP (n)
dQ(n)= D(P ||Q) P − a.e. and in L1
• In Leroux, P and Q are HMP measures. Here, P is any stationary
ergodic measure but Q is Markov measure
• Extension of Barron’s theorem to an HMP measure Q is not
generally known. Such extension for finite-alphabet processes was
shown by Finesso, 1990.
27
Underlying Principles for Ergodic Theorems
• Kingman’s sub-additive ergodic theorem (1973)
• Martingale convergence theorem
• Furstenberg and Kesten ergodic theorem for product of random
matrices (1960)
• Le Gland and Mevel geometric ergodicity of an extended Markov
chain (2000)
28
Asymptotic Optimality of MLE for General HMP’s
• The ML estimator
φ(n) = argmaxφ∈Φ
log p(yn;φ)
• Strong consistency (Leroux, 1992)
limn→∞ φ(n) = φ0 Pφ0 − a.s.
• Asymptotic normality (Bickel, Ritov and Ryden, 1998)
n1/2(φ(n)− φ0) → N (0, I −1φ0 ) Pφ0-weakly as n →∞
– Iφ0 is the Fisher information matrix
Iφ0 = Eφ0{ZZ′}; Z = limn→∞Dφ log p(Y1|Y 0−n;φ)|
φ=φ0
• The MLE is efficient
29
The proof of Bickel, Ritov and Ryden, 1998
• Taylor expansion of DLn(φn) = D log p(yn; φn) about φ0 gives
n1/2(φn − φ0) =[−n−1D2Ln(φn)
]−1n−1/2DLn(φ
0)
where φn ∈ (φ0, φn)
– CLT for score function: n−1/2DLn(φ0) → N (0, Iφ0)
– LLN for observed information: −n−1D2Ln(φn) → Iφ0
• The crux of the proof is based on bounds developed for the mixing
coefficients of the inhomogeneous Markov chain {St} given {Yt}
30
Algorithms for ML Parameter Estimation
• Baum, Petrie, Soules and Weiss, 1970, developed an iterative
algorithm for local maximization of the likelihood function of a
general HMP. It is the EM algorithm applied to HMP’s.
– Algorithm increases the likelihood function unless a stationary
point is reached
– Unlike the Kalman filter, the Baum algorithm does not provide
the error covariance in each iteration
• Louis, 1982, developed a formula for calculating the observed
information matrix and BRR proved consistency of the estimator
31
Algorithms for ML Estimation (Cont.)
• The Baum-Viterbi algorithm, Jelinek, 1976, performs alternate
maximization of log p(sn, yn;φ) over the state sequence sn and
parameter φ
– Results in inconsistent parameter and state estimators
– Useful in applications where observation vectors of sufficiently
high dimensions are used, e.g., speech recognition
• Farago-Lugosi, 1989, provided an elegant non-iterative algorithm
for global maximization of log p(sn, yn;φ) for left-right HMP’s
– The Markov chain of a left-right HMP is allowed self and
forward transitions
32
Ziv’s Inequality
• Tight upper bound on the global maximum of the likelihood
function of any stationary ergodic finite-alphabet HMP
• The bound uses universal coding of the observation sequence
• The bound for any observation sequence yn is given by
maxφ
1
nlog p(yn;φ) ≤ −1
nu(yn) + δn
where u(yn) is the binary codeword length for yn in the Lempel-Ziv
universal data compression algorithm, and δn → 0 as n →∞independently of φ
33
Order Estimation
• Strongly consistent order estimators were developed for
finite-alphabet HMP’s
– Penalized likelihood function (Finesso, 1990)
Mn = min
{argmin
j≥1
{−1
nlog p(yn; φj) + 2c2j
logn
n
}}
where φj is the MLE of φ and cj = j(j + |Y| − 2).
• For a general HMP, Ryden (1995) proposed a penalized likelihood
estimator that does not underestimate the order as n →∞, a.s.
34
Concluding Remarks
• The theory and applications of HMP’s are very rich and there is
much more to say about HMP’s
• Recent successes of the theory include application of the
forward-backward recursions in turbo-coding and in monte-carlo
simulation for non-linear signal estimation (“particle filters”)
• Large sample behavior of HMP’s and their maximum likelihood
estimators are well understood
35
Several problems of current interest
• Fast algorithms for parameter estimation
• Robust parameter estimation from relatively short data
• HMP design for optimal performance in classification problems
36