fast temporal state-splitting for hmm model selection and learning

Fast Temporal State-Splittingfor HMM Model Selection and

Learning

Sajid Siddiqi

Geoffrey Gordon

Andrew Moore

t

x

How many kinds of observations (x) ?

t

x

How many kinds of observations (x) ? 3

t

x


How many kinds of transitions (xt+1|xt)?

t

x


How many kinds of transitions (xt+1|xt)? 4

t

x


How many kinds of transitions ( xtxt+1)? 4

We say that this sequence ‘exhibits four states under the first-order Markov assumption’

Our goal is to discover the number of such states (and their parameter settings) in sequential data, and to do so efficiently

Definitions

An HMM is a 3-tuple = {A,B,π}, where

A : NxN transition matrix

B : NxM observation probability matrix

π : Nx1 prior probability vector

|| : number of states in HMM , i.e. N

T : number of observations in sequence

qt : the state the HMM is in at time t

HMMs as DBNs

1/3

q0

q1

q2

q3

q4

O0

O1

O2

O3

O4

Each of these probability tables is identical

i P(qt+1=s1|qt=si) P(qt+1=s2|qt=si) … P(qt+1=sj|qt=si) …P(qt+1=sN|qt=si)

1 a11 a12…a1j

…a1N

2 a21 a22…a2j

…a2N

3 a31 a32…a3j

…a3N

: : : : : : :

i ai1 ai2…aij

…aiN

N aN1 aN2…aNj

…aNN

Transition Model

1/3

q0

q1

q2

q3

q4

Notation:

)|( 1 itjtij sqsqPa

Observation Modelq0

q1

q2

q3

q4

O0

O1

O2

O3

O4

i P(Ot=1|qt=si) P(Ot=2|qt=si) … P(Ot=k|qt=si) … P(Ot=M|qt=si)

1 b1(1) b1 (2) … b1 (k) … b1(M)

2 b2 (1) b2 (2) … b2(k) … b2 (M)

3 b3 (1) b3 (2) … b3(k) … b3 (M)

: : : : : : :

i bi(1) bi (2) … bi(k) … bi (M)

: : : : : : :

N bN (1) bN (2) … bN(k) … bN (M)

Notation:)|()( itti sqkOPkb

HMMs as DBNs

1/3

q0

q1

q2

q3

q4

O0

O1

O2

O3

O4

HMMs as FSAsHMMs as DBNs

1/3

q0

q1

q2

q3

q4

O0

O1

O2

O3

O4

S1 S3

S2

S4

Operations on HMMs Problem 1: Evaluation

Given an HMM and an observation sequence, what is the likelihood of this sequence?

Problem 2: Most Probable PathGiven an HMM and an observation sequence, what

is the most probable path through state space?Problem 3: Learning HMM parameters

Given an observation sequence and a fixed number of states, what is an HMM that is likely to have produced this string of observations?

Problem 3: Learning the number of states Given an observation sequence, what is an HMM (of

any size) that is likely to have produced this string of observations?

Problem Algorithm Complexity

Evaluation:

Calculating P(O|)

Forward-Backward

O(TN2)

Path Inference:

Computing Q* = argmaxQ P(O,Q|) ViterbiO(TN2)

Parameter Learning:

1. Computing *=argmax,Q P(O,Q|2. Computing *=argmax P(O|

Viterbi Training Baum-Welch (EM) O(TN2)

Learning the number of states ?? ??

Operations on HMMs

Path Inference

• Viterbi Algorithm for calculating argmaxQ P(O,Q|)

t δt(1)

δt(2)

δt(3)

… δt(N)

1

2

3

4

5

6

7

8

9

t δt(1)

δt(2)

δt(3)

… δt(N)

1

2 …

3 …

4

5

6

7

8

9

Path Inference

• Viterbi Algorithm for calculating argmaxQ P(O,Q|)

Running time: O(TN2)

Yields a globally optimal path through hidden state space, associating each timestep with exactly one HMM state.

Parameter Learning I

• Viterbi Training(≈ K-means for sequences)

Parameter Learning I

• Viterbi Training(≈ K-means for sequences)

Q*s+1 = argmaxQ P(O,Q|s)

(Viterbi algorithm)

s+1 = argmax P(O,Q*s+1|)

Running time: O(TN2) per iteration

Models the posterior belief as a δ-function per timestep in the sequence. Performs well on data with easily distinguishable states.

Parameter Learning II

• Baum-Welch(≈ GMM for sequences)1. Iterate the following two steps until

2. Calculate the expected complete log-likelihood given s

3. Obtain updated model parameters s+1 by maximizing this log-likelihood

Parameter Learning II

• Baum-Welch(≈ GMM for sequences)1. Iterate the following two steps until

2. Calculate the expected complete log-likelihood given s

3. Obtain updated model parameters s+1 by maximizing this log-likelihood

Obj(,s) = EQ[P(O,Q|) | O,s]

s+1 = argmax Obj(,s)

Running time: O(TN2) per iteration, but with a larger constant Models the full posterior belief over hidden states per timestep. Effectively models sequences with overlapping states at the cost of extra computation.

HMM Model Selection

• Distinction between model search and actual selection step– We can search the spaces of HMMs with

different N using parameter learning, and perform selection using a criterion like BIC.

HMM Model Selection

• Distinction between model search and actual selection step– We can search the spaces of HMMs with

different N using parameter learning, and perform selection using a criterion like BIC.

Running time: O(Tn2) to compute likelihood for BIC

HMM Model Selection I

• for n = 1 … Nmax• Initialize n-state HMM randomly• Learn model parameters• Calculate BIC score • If best so far, store model• if larger model not chosen, stop

HMM Model Selection I

• for n = 1 … Nmax• Initialize n-state HMM randomly• Learn model parameters• Calculate BIC score• If best so far, store model• if larger model not chosen, stop

Running time: O(Tn2) per iteration

Drawback: Local minima in parameter optimization

HMM Model Selection II

• for n = 1 … Nmax– for i = 1 … NumTries

• Initialize n-state HMM randomly• Learn model parameters• Calculate BIC score• If best so far, store model

– if larger model not chosen, stop

HMM Model Selection II

• for n = 1 … Nmax– for i = 1 … NumTries

• Initialize n-state HMM randomly• Learn model parameters• Calculate BIC score• If best so far, store model

– if larger model not chosen, stopRunning time: O(NumTries x Tn2) per iteration

Evaluates NumTries candidate models for each n to overcome local minima. However: expensive, and still prone to local minima especially for large N

Idea: Binary state splits* to generate candidate models

• To split state s into s1 and s2,

– Create ’ such that ’\s \s

– Initialize ’s1 and ’s2 based on s and on parameter constraints

* first proposed in Ostendorf and Singer, 1997

Notation:s : HMM parameters related to state s\s : HMM parameters not related to state s

Idea: Binary state splits* to generate candidate models

• To split state s into s1 and s2,

– Create ’ such that ’\s \s

– Initialize ’s1 and ’s2 based on s and on parameter constraints

• This is an effective heuristic for avoiding local minima

* first proposed in Ostendorf and Singer, 1997

Notation:s : HMM parameters related to state s\s : HMM parameters not related to state s

Overall algorithm

Overall algorithmStart with a small number of states

Binary state splits* followed by EM

BIC on training set

Stop when bigger HMMis not selected

EM (B.W. or V.T.)

Overall algorithmStart with a small number of states

Binary state splits

followed by EM

BIC on training set

Stop when bigger HMMis not selectedWhat is ‘efficient’? Want this

loop to be at most O(TN2)

EM (B.W. or V.T.)

HMM Model Selection III

• Initialize n0-state HMM randomly• for n = n0 … Nmax

– Learn model parameters– for i = 1 … n

• Split state i, learn model parameters• Calculate BIC score• If best so far, store model







O(Tn2)






Running time: O(Tn3) per iteration of outer loop

More effective at avoiding local minima than previous approaches. However, scales poorly because of n3 term.

O(Tn2)

Fast Candidate Generation

Fast Candidate GenerationOnly consider timesteps owned by s in Viterbi path

Only allow parameters of split states to vary

Merge parameters and store as candidate

OptimizeSplitParams I: Split-State Viterbi Training (SSVT)

Iterate until convergence:

Constrained Viterbi

Splitting state s to s1,s2. We calculate

using a fast ‘constrained’ Viterbi algorithm over only those timesteps owned by s in Q*, and constraining them to belong to s1

or s2 .

t δt(1)

δt(2)

δt(3)

… δt(N)

1

2

3

4

5

6

7

8

9

The Viterbi path is denoted by Suppose we split state N into s1,s2

t δt(1)

δt(2)

δt(3)

… δt(s1

)δt(s2

)

1

2

3

4

5

6

7

8

9

? ?

??????


t δt(1)

δt(2)

δt(3)

… δt(s1

)δt(s2

)

1

2

3

4

5

6

7

8

9



Running time: O(|Ts|n) per iteration

When splitting state s, assumes rest of the HMM parameters (\s ) and rest of the Viterbi path (Q*

\Ts) are both fixed

OptimizeSplitParams I: Split-State Viterbi Training (SSVT)

Fast approximate BIC

Compute once for base model: O(Tn2)

Update optimistically* for candidate model: O(|Ts|)

* first proposed in Stolcke and Omohundro, 1994

HMM Model Selection IV



• Split state i, optimize by constrained EM• Calculate approximate BIC score• If best so far, store model


HMM Model Selection IV



• Split state i, optimize by constrained EM• Calculate approximate BIC score• If best so far, store model


Running time: O(Tn2) per iteration of outer loop!

O(Tn)

Algorithms

SOFT: Baum-Welch / Constrained Baum-WelchHARD : Viterbi Training / Constrained Viterbi Training

faster, coarser

slower, more accurate

Results

1. Learning fixed-size models

2. Learning variable-sized models

Baseline: Fixed-size HMM Baum-Welch with five restarts

Learning fixed-size models

Fixed-size experiments table, continued

Learning fixed-size models

Learning variable-size models

Conclusion

• Pros:– Simple and efficient method for HMM model selection– Also learns better fixed-size models

• (Often faster than single run of Baum-Welch )

– Different variants suitable for different problems

Conclusion

• Cons:– Greedy heuristic: No performance guarantees– Binary splits also prone to local minima– Why binary splits?– Works less well on discrete-valued data

• Greater error from Viterbi path assumptions

• Pros:– Simple and efficient method for HMM model selection– Also learns better fixed-size models

• (Often faster than single run of Baum-Welch )

– Different variants suitable for different problems

Thank you

Appendix

Viterbi Algorithm

Constrained Vit.

EM for HMMs

More definitions

OptimizeSplitParams II: Constrained Baum-Welch


Penalized BIC

fast temporal state-splitting for hmm model selection and learning

Documents