dynamic bayesian network

Dynamic Bayesian Network

Fuzzy SystemsLifelog management

• Introduction

• Definition

• Representation

• Inference

• Learning

• Comparison

• Summary

Outline

Brief Review of Bayesian Networks• Graphical representations of joint distributions:

)/( ABP

A B

)(AP

)(AP ),/( CABPB

D

C

A

)(CP

)/( BDP

Static world, each random variable has a single fixed value.

)()()/()/(

BPAPABPBAP

Mathematical formula used for calculating conditional probabilities. Develop by the mathematician and theologian Thomas Bayes (published in 1763)

3/29

Introduction• Dynamic system

– Sequential data modeling (part of speech)– Time series modeling (activity recognition)

• Classic approaches – Linear models: ARIMA (autoregressive integrated moving average),

ARMAX (autoregressive moving average exogenous variables model)

– Nonlinear models: neural networks, decision trees– Problems

• Prediction of the future based on only a finite window• Difficult to incorporate prior knowledge• Difficult to deal with multi-dimensional inputs and/or outputs

• Recent approaches– Hidden Markov models (HMMs): discrete random variable– Kalman filter models (KFMs): continuous state variables– Dynamic Bayesian networks (DBNs)

4/29

Motivation

Time = t

Mt

Xt

Ot

Transportation Mode: Walking, Running, Car, Bus

True velocity and location

Observed location

Need conditional probability distributions

e.g. a distribution on (velocity, location)

given the transportation mode

Prior knowledge or learned from data

Time = t+1

Mt+1

Xt+1

Ot+1

Given a sequence of observations (Ot), find the most likely Mt’s that explain it.

Or could provide a probability distribution on the possible Mt’s.

5/29

• Introduction

• Definition

• Representation

• Inference

• Learning

• Comparison

• Summary

Outline

Dynamic Bayesian Networks• BNs consisting of a structure that repeats an indefinite (or dynamic)

number of times– Time-invariant: the term ‘dynamic’ means that we are modeling

a dynamic model, not that the networks change over time

• General form of HMMs and KFLs by representing the hidden and observed state in terms of state variables of complex interdependencies

D

A

C

B

D

A

C

B

D

A

C

B

frame i frame i+1frame i-1

7/29

Formal Definition• Defined as

– : a directed, acyclic graph of starting nodes (initial probability distribution)

– : a directed, acyclic graph of transition nodes (transition probabilities between time slices)

– : starting vectors of observable as well as hidden random variable

– : transition matrices regarding observable as well as hidden random variables

8/29

• Introduction

• Definition

• Representation

• Inference

• Learning

• Comparison

• Summary

Outline

Representation (1): Problem• Target: Is it raining today?

• Necessity to specify an unbounded number of conditional probability table, one for each variable in each slice

• Each one might involve an unbounded number of parents

}{}{

tt

tt

UERX

next step: specify dependencies among the variables.

10/29

Representation (2): Solution• Assume that change in the world state are caused by a stationary

process (unmoving process over time)

• Use Markov assumption - The current state depends on only in a finite history of previous states.

Using the first-order Markov process:

• In addition to restricting the parents of the state variable Xt, we must restrict the parents of the evidence variable Et

))(/( tt UParentUP is the same for all t

)/()/( 11:0 tttt XXPXXP Transition Model

)/(),/( 1:0:0 ttttt XEPEXEP Sensor Model

11/29

Representation: Extension• There are two possible fixes if the approximation is too inaccurate:

– Increasing the order of the Markov process model. For example, adding as a parent of , which might give slightly more accurate predictions

– Increasing the set of state variables. For example, adding to allow to incorporate historical records of rainy seasons, or adding , and to allow to use a physical model of rainy conditions

• Bigram

• Trigram

Wi+1WiWi-1 . . . . . .

Wi+1WiWi-1 . . . . . .

2tRain tRain

tSeason

teTemperatur tHumiditytPressure

12/29

• Introduction

• Definition

• Representation

• Inference

• Learning

• Comparison

• Summary

Outline

Inference: Overview• To infer the hidden states X given the observations Y1:t

– Extend HMM and KFM’s / call BN inference algorithms as subroutines – NP-hard problem

• Inference tasks– Filtering(monitoring): recursively estimate the belief state using Bayes’ rule

• Predict: computing P(Xt| y1:t-1 )• Updating: computing P(Xt | y1:t )• Throw away the old belief state once we have computed the

prediction(“rollup”)– Smoothing: estimate the state of the past, given all the evidence up to the

current time• Fixed-lag smoothing(hindsight): computing P(Xt-1 | y1:t ) where l > 0 is the

lag– Prediction: predict the future

• Lookahead: computing P(Xt+h | y1:t) where h > 0 is how far we want to look ahead

– Viterbi decoding: compute the most likely sequence of hidden states given the data

• MPE(abduction): x*1:t = argmax P(x1:t | y1:t )

14/29

Inference: Comparison• Filtering: r = t

• Smoothing: r > t

• Prediction: r < t

• Viterbi: MPE

15/29

Inference: Filtering• Compute the belief state - the posterior distribution over the current

state, given all evidence to date

• Filtering is what a rational agent needs to do in order to keep track of the current state so that the rational decisions can be made

• Given the results of filtering up to time t, one can easily compute the result for t+1 from the new evidence

)/( :1 tt eXP

))/(()/( 1:1,11:11 ttttt eXPefeXP

)/()/(

)/()/(

)/(

:1111

:11:1,11

1,:11

tttt

ttttt

ttt

eXPXeP

eXPeXeP

eeXP

(dividing up the evidence)

(for some function f)

(using Bayes’ Theorem)

(by the Marcov propertyof evidence)

α is a normalizing constant used to make probabilities sum up to 1

16/29

Inference: Filtering• Illustration for two steps in the Umbrella example:

• On day 1, the umbrella appears so U1=true

– The prediction from t=0 to t=1 is

and updating it with the evidence for t=1 gives

• On day 2, the umbrella appears so U2=true– The prediction from t=1 to t=2 is

and updating it with the evidence for t=2 gives

0

)()/()( 0011r

rPrRPRP

)()/()/( 11111 RPRuPuRP

1

)/()/()/( 111212r

urPrRPuRP

)/()/(),/( 1222212 uRPRuPuuRP

17/29

Inference: Smoothing• Compute the posterior distribution over the past state, given all

evidence up to the present

• Hindsight provides a better estimate of the state than was available at the time, because it incorporates more evidence

)/( :1 tk eXP for some k such that 0 ≤ k < t.

18/29

Inference: Prediction• Compute the posterior distribution over the future state, given all

evidence to date

• The task of prediction can be seen simply as filtering without the addition of new evidence

)/( :1 tkt eXP for some k>0

19/29

Inference: Most Likely Explanation (MLE)• Compute the sequence of states that is most likely to have

generated a given sequence of observation

• Algorithms for this task are useful in many applications, including speech recognition

• There exist a recursive relationship between the most likely paths to each state Xt+1 and the most likely paths to each state Xt. This relationship can be write as an equation connecting the probabilities of the paths:

)/(maxarg :1:1:1 ttx eXPt

))/,...,()/(()/(

)/,,...,(

:111...

111

1:111

maxmax

max

11

...1

ttXX

ttX

tt

tttXX

exxPxXPXeP

eXxxP

tt

t

20/29

Inference: Algorithms• Exact Inference algorithms

– Forwards-backwards smoothing algorithm (on any discrete-state DBN)

– The frontier algorithm (sweep a Markov blanket, the frontier set F, across the DBN, first forwards and then backwards)

– The interface algorithm (use only the set of nodes with outgoing arcs to the next time slice to d-separate the past from the future)

– Kalman filtering and smoothing

• Approximate algorithms:– The Boyen-Koller (BK) algorithm (approximate the joint distribution

over the interface as a product of marginals)– Factored frontier (FF) algorithm / Loopy propagation algorithm (LBP)– Kalman filtering and smoother– Stochastic sampling algorithm:

• Importance sampling or MCMC (offline inference)• Particle filtering (PF) (online)

21/29

• Introduction

• Definition

• Representation

• Inference

• Learning

• Comparison

• Summary

Outline

Learning (1)• The techniques for learning DBN are mostly straightforward extensions of

the techniques for learning BNs• Parameter learning

– The transition model P(Xt | Xt-1) / The observation model P(Yt | Xt) – Offline learning

• Parameters must be tied across time-slices• The initial state of the dynamic system can be learned independently

of the transition matrix– Online learning

• Add the parameters to the state space and then do online inference (filtering)

– The usual criterion is maximum-likelihood(ML)

• The goal of parameter learning is to compute– θ*

ML = argmaxθP( Y| θ) = argmaxθlog P( Y| θ) – θ*

MAP = argmaxθlog P( Y| θ) + logP(θ)– Two standard approaches: gradient ascent and EM(Expectation

Maximization)

23/29

Learning (2)• Structure learning

– The intra-slice connectivity must be a DAG– Learning the inter-slice connectivity is equivalent to the variable

selection problem, since for each node in slice t, we must choose its parents from slice t-1.

– Learning for DBNs reduces to feature selection if we assume the intra-slice connections are fixed

24/29

• Introduction

• Definition

• Representation

• Inference

• Learning

• Comparison

• Summary

Outline

Comparison (HMM: Hidden Markov Model)

• Structure– One discrete hidden node (X: hidden variables)– One discrete or continuous observed node per time slice (Y: observations)

• Parameters– The initial state distribution P( X1 ) – The transition model P( Xt | Xt-1 )– The observation model P( Yt | Xt )

• Features– A discrete state variable with arbitrary dynamics and arbitrary

measurements– Structures and parameters remain same over time

X1

Y1

X2

Y2

X3

Y3

X4

Y4

26/29

Comparison with HMMs• HMMs

1 2 3

obs obs obs

.3.7 .8

.21

1 2 3qiqi-1

1 .7 .3 0

2 0 .8 .23 0 0 1

P(qi|qi-1)

q=1

q=2

q=3

P(obsi | qi)

frame i

Qi

obsi

. . .. . .Qi-1

obsi-1

Qi+1

obsi+1

frame i+1frame i-1

= state= allowed transition

= variable= allowed dependency

• DBNs

27/29

Comparison (KFL: Kalman Filter Model)• KFL has the same topology as an HMM

• All the nodes are assumed to have linear-Gaussian distributions

– x(t+1) = F*x(t) + w(t), • w ~ N(0, Q) : process noise, x(0) ~ N(X(0), V(0))

– y(t) = H*x(t) + v(t), • v ~ N(0, R) : measurement noise

• Features– A continuous state variable with linear-Gaussian dynamics and

measurements– Also known as Linear Dynamic Systems(LDSs)

• A partially observed stochastic process • With linear dynamics and linear observations: f( a + b) = f(a) +

f(b)• Both subject to Gaussian noise

X1

Y1

X2

Y2

28/29

Comparison with HMM and KFM• DBN represents the hidden state in terms of a set of random

variables– HMM’s state space consists of a single random variable

• DBN allows arbitrary CPDs– KFM requires all the CPDs to be linear-Gaussian

• DBN allows much more general graph structures– HMMs and KFMs have a restricted topology

• DBN generalizes HMM and KFM (more expressive power)

29/29

Summary• DBN: a Bayesian network with a temporal probability model

• Complexity in DBNs– Inference– Structure learning

• Comparison with other methods– HMMs: discrete variables– KFMs: continuous variables

• Discussion– Why to use DBNs instead of HMMs or KFMs?– Why to use DBNs instead of BNs?

30/29

dynamic bayesian network

Documents

world state

set of state variables

state variable xt

hidden random variables

current state

markov process model

dynamic model

order markov process