simulation of discrete event systems...- bayesian methodology to calculate posterior distributions...

© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University

Simulation of Discrete Event Systems

Univ.-Prof. Dr.-Ing. Dipl.-Wirt.-Ing. Christopher M. Schlick Dr.-Ing. Sven Tackenberg

Chair and Institute of Industrial Engineering and Ergonomics RWTH Aachen University

Bergdriesch 27 52062 Aachen

phone: 0241 80 99 440 email: [email protected]

Unit 10 and 11 Bayesian Networks and Dynamic Bayesian Networks

Fall Winter 2016/2017

10 - 2 © Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University

Contents

1. Introduction

2. Background - Bayes theorem and rules of probability

- Maximum a posteriori hypothesis

- Bayesian methodology to calculate posterior distributions

3. Bayesian networks - Approach

- Definition

- Inference in simple Bayesian networks

4. Introduction to Dynamic Bayesian networks

5. Formalism of Dynamic Bayesian Networks


model

dynamic static

time-invariant time-varying

nonlinear linear

discrete states continuous states

event-driven time-driven

stochastic deterministic

continuous-time discrete-time

Focus of lecture and exercise

Focus of lecture and exercise


1. Introduction

1. Introduction


What for are Bayesian Networks helpful?

Observe

Learn Decide

Experts are persons who need for processing their tasks a specific expertise

Normal procedure of experts Based on observations decisions are made

Decisions lead to actions

An Action causes good or bad results

The results lead to a learning of the expert

Experts often have to make decisions based on incomplete and conflicting information

The probable best decision is in general the one which minimizes the risk!

Bayesian networks are used to build up an expert system.


Motivation

The Bayes theorem and the associated rules of probability are a consistent and powerful basis for algorithms manipulating probability mass functions and probability density functions directly.

The Bayes methodology is a statistical approach to modeling and simulating discrete-event systems under uncertainty.

! Why is it worth to consider the Bayes methodology?

The basic assumption is that the state variables can be represented by probability mass functions (discrete variables) or probability density functions (continuous variables).

Based on the Bayes theorem, conclusions can be drawn to identify optimal decisions.

! Which is the concept of Bayes methodology?

! What is the Bayes methodology used for?


Think about….

… if you see that there are clouds, what is the probability soon there will be rain?

… if you know that it is raining, by hearing it patter on the roof, what is the probability that there are clouds?

p(clouds | rain)

p(rain| clouds)

Is p(rain | clouds) equal to p(clouds | rain )? ?


Repetition of relevant definitions and formulas of probability theory

Probability of event A:

Probability of event A and on the condition of event B:

Bayes’ formula (theoretical basis of Bayesian Networks):

Formula of the total probability:

P(A)

P(A|B)

P(A|B) = P(B|A) P(A)

P(B)

This formula enables the conversion of the probability of event A on the condition of event B into the probability of event B on the condition of event A.

𝑃𝑃 𝐴𝐴 = �𝑃𝑃 𝐴𝐴|𝐵𝐵𝑖𝑖 𝑃𝑃 𝐵𝐵𝑖𝑖𝑖𝑖

The absolute probability of A can be calculated based on the conditional probability of A.


1. Product rule: The joint probability of A and B is: 2. Independence: The random variables A and B are independent, if the joint probability distribution

can be factorized as:

3. Sum rule: If the hypotheses B1, ..., Bn are mutually exclusive and therefore form a partition of the

set B, the marginal likelihood of the data is:

Hence, the Bayes theorem can be expanded:

Rules of probability

Note, associated with Bayesian methodology the random variables A and B are named D and h.

𝑃𝑃 𝐴𝐴,𝐵𝐵 = 𝑃𝑃 𝐵𝐵|𝐴𝐴 𝑃𝑃 𝐴𝐴 = 𝑃𝑃 𝐴𝐴|𝐵𝐵 𝑃𝑃 𝐵𝐵 B in condition to A A in condition to B

𝑃𝑃 𝐴𝐴,𝐵𝐵 = 𝑃𝑃 𝐴𝐴 𝑃𝑃 𝐵𝐵

𝑃𝑃 𝐴𝐴 = �𝑃𝑃 𝐴𝐴,𝐵𝐵𝑖𝑖𝑖𝑖

= �𝑃𝑃 𝐴𝐴|𝐵𝐵𝑖𝑖 𝑃𝑃 𝐵𝐵𝑖𝑖𝑖𝑖

𝑃𝑃 𝐵𝐵𝑖𝑖|𝐴𝐴 =𝑃𝑃 𝐴𝐴|𝐵𝐵𝑖𝑖 𝑃𝑃 𝐵𝐵𝑖𝑖

∑ 𝑃𝑃 𝐴𝐴|𝐵𝐵𝑖𝑖′ 𝑃𝑃 𝐵𝐵𝑖𝑖′𝑖𝑖′


Causal networks – Introduction

Causal networks are a precursor of Bayesian networks!

Formalism to describe causal dependence within given situations. Consisting of:

Set of variables

Each variable can have different (finite, infinite) states

Set of directed arcs

Each variable must be in one of the defined states, but the current state could be unknown!

A B

State of variable A direct causes the occurrence of states of variable B


Example of a causal network

W

C D

K M

W (Winter): {true, false} C (Slippery roads): {true, false} D (Klaus drunk alcohol): {true, false} K (Klaus has an accident): {true, false} M (Mike has an accident): {true, false}

Formalism to describe causal dependence within given situations. Consisting of:

Season of the year: Variable winter | states (true, false) | has a significant impact on the condition of the street

Condition of the street: Variable C | states (true, false) | describes the sleekness of the street and has a significant impact on the risk of an accident of Klaus (K) or Mike (M)

Occurrence of an accident: Variables K or M | states (true, false) | describe the occurrence of an accident of Klaus (K) or Mike (M)

Condition of Klaus: Variable D | states (true, false) | describes if Klaus has drunken alcohol

Causal network


Dependency and conditional dependency

Two variables A and B of a causal network are designated as dependent if the probabilities of the states of variable A depends on the state of variable B and vice versa:

Two variables A and B of a causal network are designated as conditional dependent if A and B are dependent for specific states Z and independent for all other states �̅�𝑍.

𝑃𝑃 𝐴𝐴,𝐵𝐵 ≠ 𝑃𝑃 𝐴𝐴 𝑃𝑃 𝐵𝐵

𝑃𝑃 𝐴𝐴,𝐵𝐵|𝑍𝑍 ≠ 𝑃𝑃 𝐴𝐴|𝑍𝑍 𝑃𝑃 𝐵𝐵|𝑍𝑍

𝑃𝑃 𝐴𝐴,𝐵𝐵|�̅�𝑍 ≠ 𝑃𝑃 𝐴𝐴|�̅�𝑍 𝑃𝑃 𝐵𝐵|�̅�𝑍 and


Dependencies (1/2)

W

C

M

Serial Dependency

W (Winter): {true, false} C (Slippery roads): {true, false} M (Mike has an accident): {true, false}

Variables W and M are independent if the condition of the road C is known.

If the conditions of the street are known the season has no impact on the probability of an accident.

K

C

M

Branch C (Slippery roads): {true, false} K (Klaus has an accident): {true, false} M (Mike has an accident): {true, false}

Variables K and M are independent if the condition of the road C is known. If K has an accident and the condition of the street is unknown the probability of the sleekness of the street increases. Furthermore, the probability of an accident of M increases.


Dependencies (2/2)

Merge

D C D (Klaus drunk alcohol): {true, false} C (Slippery roads): {true, false} K (Klaus has an accident): {true, false}

Variables D and C dependent on each other if the state of variable K is known. If Klaus (K) has an accident and the street is not slippery then the probability that he has drunken alcohol is increased.

K


2. Background

2. Background


Example: Diagnosis of scarce faults

A X-ray test of a track is done:

Object has hairline cracks

Object has no hairline cracks

Measurement result: hairline crack true: in 98% of the cases

Measurement result: hairline crack false: in 97% of the cases

Hairline cracks occur only at 0.8% of the produced tracks.

? Calculate the probability that a measurement indicates hairline cracks and in reality the track has some cracks.


Bayes theorem

( | ) ( )( | )( )

P D h P hP h DP D

=

The Bayes theorem goes back to the seminal work of the English reverent Thomas Bayes in the 18th century on games of chances.

To answer this question, the Bayes theorem is used.

Formula:

P(h): A priori probability of a hypothesis h (or a model) representing the initial degree of belief

P(D): A priori probability of the data D (observations)

P(h|D): A posteriori probability of hypothesis h under the condition of given data D

P(D|h): Probability of data D under the condition of hypothesis h

known as the Bayesian methodology

Two meanings of probability: Frequencies of outcomes in random experiments, e.g. repeated rolling of a dice

Degrees of belief in propositions that do not necessarily involve random experiments, e.g. probability that a certain production machine will fail, given the evidence of a poor surface quality of the workpiece



Measurement: hairline crack true: in 98% of cases

Measurement: hairline crack false: in 97% of cases

Hairline cracks occur only at 0.8% of the produced tracks

? Calculate the probability that a measurement indicates hairline cracks and in reality there are some cracks.

( | ) ( )( | )( )

P D h P hP h DP D

=





𝑃𝑃 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠|⨁ =𝑃𝑃 ⨁|𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑃𝑃 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

𝑃𝑃 ⨁

Track has a crack Data shows crack

Track has a crack

Probability that the Data shows a crack



The probability of a positively tested track that also has hailine cracks is only 21%!

𝑃𝑃 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠|⨁ =𝑃𝑃 ⨁|𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑃𝑃 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

𝑃𝑃 ⨁

Track has a crack Data shows crack

Track has a crack

Probability that Data shows a crack

𝑃𝑃 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 0.008 𝑃𝑃 ⨁|𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 0.98 𝑃𝑃 ⊝ | ⊣ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 0.97

⊣ not

𝑃𝑃 ⊣ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 0.992 𝑃𝑃 ⊝ |𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 0.02 𝑃𝑃 ⨁| ⊣ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 0.03

Auxiliary calculation: 𝑃𝑃 ⨁ = 𝑃𝑃 ⨁|𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑃𝑃 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + ⨁| ⊣ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑃𝑃 ⊣ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 Probability that

Data shows a crack = 0,0376

𝑃𝑃 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠|⨁ =0.98 ∙ 0.008

0.0376 ≈ 0.21


Bayesian methodology

• The choice of P(h) and P(D|h) represents the a priori knowledge and assumptions of the modeler concerning the application domain.

• The hypotheses are regarded as functions of the observations, which can be adapted iteratively to the state of knowledge of an observer.

• If all hypotheses have the same a priori probability, the equation above can be simplified further and only the term P(D|h) has to be maximized. Each hypothesis maximizing P(D|h) is called the maximum likelihood hypothesis (hML) :

( )arg maxML h Hh P D h

∈=

( ) ( ) ( )( )arg max arg max arg max ( )

( )MAP h H h H h H

P D h P hh P h D P D h P h

P D∈ ∈ ∈= = =

Objective function for Bayesian parameter estimation is the most likely hypothesis given the observations. The hypothesis hMAP representing the maximum of the probability mass is called the maximum a posteriori hypothesis:


Example of Bayesian methodology (I)

Workpieces of only one type are stored in a pallet cage.

A produced workpiece is faultless (index g for “good”)

A produced workpiece is defective (index b for “bad”).

Due to a new manufacturing process, the prior probability distribution of the frequency of faultless and defective workpieces is unknown.

? Calculate the posterior distribution of the proportion of faultless workpieces step-by-step (produced workpieces) on the basis of the Bayesian methodology.

The input data are a sample of N workpieces, randomly drawn from the line!

The workpieces in the sample are tested independently!


Example of Bayesian methodology (II)

The probability of observing exactly ng times faultless workpieces in the sample follows the binomial distribution.

Workpieces of only one type are stored in a pallet cage.

A produced workpiece is faultless (index g for “good”)

A produced workpiece is defective (index b for “bad”).

The proportions to be estimated under hypothesis h on the basis of the sample of size N are:

ℎ = �̂�𝑠𝑔𝑔,�̂�𝑠𝑏𝑏 = �̂�𝑠𝑔𝑔,1 − �̂�𝑠𝑔𝑔 �̂�𝑠𝑔𝑔 : estimated proportion of “good” workpiece

The properties of the sample can be described sufficiently by the following aggregated quantities:

𝑛𝑛𝑏𝑏 = 𝑁𝑁 − 𝑛𝑛𝑔𝑔 𝑛𝑛𝑏𝑏 : frequency of “good” workpieces after N tests


Binomial distribution

The probably most important discrete distribution is the binomial distribution

Lets consider an experiment with n trials

Each trial can result in two states {a, b}

The probability of a or b is the same in each trial

The number of {a} is X

Probability, that a specific number of {a} appears.

𝑠𝑠 𝑋𝑋 = 𝑥𝑥 = 𝑛𝑛𝑥𝑥 𝑠𝑠𝑥𝑥 1 − 𝑠𝑠 𝑛𝑛−𝑥𝑥

The distribution is defined by n and p.

- Mean value: 𝜇𝜇 = 𝑛𝑛 ∙ 𝑠𝑠 - Variance: 𝜎𝜎2 = 𝑛𝑛 ∙ 𝑠𝑠 1 − 𝑠𝑠

Example: If an accident occurs, every tenth person of the population is able to provide initial medical treatment

How large is the probability that there are 0, 1, … up to10 persons of a total quantity of 10, who are able to provide initial medical treatment. 𝑒𝑒.𝑔𝑔. 𝑠𝑠 𝑋𝑋 = 1 → 𝑂𝑂𝑛𝑛𝑒𝑒 𝑠𝑠𝑒𝑒𝑠𝑠𝑠𝑠𝑝𝑝𝑛𝑛 𝑖𝑖𝑠𝑠 𝑠𝑠𝑎𝑎𝑎𝑎𝑒𝑒 𝑡𝑡𝑝𝑝 𝑚𝑚𝑠𝑠𝑚𝑚𝑒𝑒

𝑠𝑠 𝑡𝑡𝑠𝑠𝑒𝑒𝑠𝑠𝑡𝑡𝑚𝑚𝑒𝑒𝑛𝑛𝑡𝑡

𝑠𝑠 𝑋𝑋 = 0 = 100 0.10 1− 0.9 10−0 = 0.3487

𝑠𝑠 𝑋𝑋 = 1 = 101 0.11 1 − 0.9 10−1 = 0.3874


Example of Bayesian methodology (III)

The binomial distribution represents the generative model of the data P(D|h) under hypothesis h:

𝑃𝑃 𝑛𝑛𝑔𝑔|𝑠𝑠𝑔𝑔,𝑁𝑁 =𝑁𝑁!

𝑁𝑁 − 𝑛𝑛𝑔𝑔 !𝑛𝑛𝑔𝑔!𝑠𝑠𝑔𝑔𝑛𝑛𝑔𝑔 1 − 𝑠𝑠𝑔𝑔

𝑁𝑁−𝑛𝑛𝑔𝑔 Probability of observing exactly ng times faultless workpieces in the sample

Bayesian methodology (Remember )

Objective function for Bayesian parameter estimation is the most likely hypothesis under the given the observations:

ℎ𝑀𝑀𝑀𝑀𝑀𝑀 = 𝑠𝑠𝑠𝑠𝑔𝑔 𝑚𝑚𝑠𝑠𝑥𝑥ℎ∈𝐻𝐻𝑃𝑃 𝐷𝐷|ℎ 𝑃𝑃 ℎ

𝑃𝑃 𝐷𝐷= 𝑠𝑠𝑠𝑠𝑔𝑔 𝑚𝑚𝑠𝑠𝑥𝑥ℎ∈𝐻𝐻𝑃𝑃 𝐷𝐷|ℎ 𝑃𝑃 ℎ

Due to the new manufacturing process, there is no knowledge regarding the proportion of faultless workpieces.

Prior probability of the corresponding hypothesis h is described by a uniform distribution for the parameter pg:

𝑓𝑓𝑝𝑝 𝑠𝑠𝑔𝑔|𝑁𝑁 = 0,𝑛𝑛𝑔𝑔 = 0 =Γ 2

Γ 1 Γ 1𝑠𝑠𝑔𝑔0 1 − 𝑠𝑠𝑔𝑔

0 = 1


For each measurement observation the initial uniform distribution is transformed into the Beta-type posterior distribution for the independent parameter pg.

Example of Bayesian methodology (IV)

𝑓𝑓𝑝𝑝 𝑠𝑠𝑔𝑔|𝑁𝑁,𝑛𝑛𝑔𝑔 =Γ 𝑁𝑁 + 2

Γ 𝑛𝑛𝑒𝑒 + 1 Γ 𝑁𝑁 − 𝑛𝑛𝑔𝑔 + 1𝑠𝑠𝑔𝑔𝑛𝑛 1 − 𝑠𝑠𝑔𝑔

𝑁𝑁−𝑛𝑛𝑔𝑔~𝛽𝛽 𝑁𝑁 + 2,𝑛𝑛𝑒𝑒 + 1

Due to the Bayesian methodology we can define the A-posteriori probability density:

Incremental measuring of the workpieces drawn from the production line leads to the samples:

after N = 5 measurements ng = 3 workpieces turned out to be faultless




( )p gf p

( 3, 5)p g gf p n N= =

( 6, 10)p g gf p n N= =

( 9, 15)p g gf p n N= =( )ˆ 9, 15MAP

g gp n N= =

( )ˆ 6, 10MAPg gp n N= =

( )ˆ 3, 5MAPg gp n N= =

gp

pf

Example of Bayesian methodology (V)


!ˆ arg max ( , ) arg max (1 )( )! !

g g

g g

n N nMLg g g g gp p

g g

g

Np P n p N p pN n n

nN

−= = −−

=

Conversely, when using the maximum likelihood estimator and not the maximum a posteriori estimator we have the point estimate:

For instance, the maximum likelihood value for the first sample that had been drawn from the line (N = 5, ng = 3) is:

3 2

!3 2

!2 2 3

ˆ arg max (1 )

(1 ) 0

3ˆ3 (1 ) 2(1 )( 1) 05

g

MLg g gp

g gg

MLg g g g g

p p p

d p pdp

p p p p p

= − ⇔

− = ⇔

− + − − = ⇔ =

Obviously, the maximum likelihood estimate is equivalent to the relative frequency of the faultless workpieces in the tested sample!

Example of Bayesian methodology (V)


3. Bayesian Networks

3. Bayesian Networks


Example of a Bayesian Network (1/9)

Winter

Sprinkler Rain

Wet Grass

Wet Road

Winter = true Winter = false

0.6 0.4

ΘWinter|∅ is:



ΘRain|Winter is:

Winter Rain ¬Rain

true

false

0.8

0.1

0.2

0.9

¬ represents „false“

Winter

Sprinkler Rain

Wet Grass

Wet Road



ΘWet Grass|Sprinkler,Rain is:


Winter

Sprinkler Rain

Wet Grass

Wet Road

Sprinkler

true

true

false

false

Rain ¬Wet Grass Wet Grass

true

false

true

false

0.95

0.9

0.8

0

0.05

0.1

0.2

0



ΘWet Road|Rain is:

Rain Wet Road ¬Wet Road

true

false

0.7

0

0.3

1


Winter

Sprinkler Rain

Wet Grass

Wet Road




Winter

Sprinkler Rain

WetGrass

WetRoad

Probability distribution described by a Bayesian Network

Allocation of interest:

ω(Winter) = true ω(Sprinkler) = false ω(Rain) = true ω(Wet Grass) = true ω(Wet Road) = true

Winter ¬Winter

0.6 0.4

ΘWinter|∅ Probability of winter P(Winter) = 0.6

ΘSprinkler|Winter Winter Sprinkler ¬Sprinkler

true

false

0.2

0.75

0.8

0.25

Probability of “Winter” and not used “Sprinkler”

P(W ∧ ¬S) = 0.6 ∙ 0.8 = 0.48




Winter

Sprinkler Rain

WetGrass

WetRoad




ΘRain|Winter

Winter Rain ¬Rain

true

false

0.8

0.1

0.2

0.9

Probability of “Winter” and not used “Sprinkler”

P(W ∧ ¬S) = 0.6 ∙ 0.8 = 0.48

Probability of “Winter” and not used “Sprinkler” and “Rain”

P(W ∧ ¬S ∧ R) = 0.48 ∙ 0.8 = 0.384




Winter

Sprinkler Rain

WetGrass

WetRoad




ΘWet Grass|Sprinkler,Rain Sprinkler

true

true

false

false

Rain ¬Wet Grass Wet Grass

true

false

true

false

0.95

0.9

0.8

0

0.05

0.1

0.2

0

Probability of “Winter” and not used “Sprinkler” and “Rain”

P(W ∧ ¬S ∧ R) = 0.384

Probability of “Winter” and not used “Sprinkler” and “Rain” and “Wet road”

P(W ∧ ¬S ∧ R ∧ WG) = 0.384 ∙ 0.8 = 0.3072




Winter

Sprinkler Rain

WetGrass

WetRoad




ΘWet Road|Rain is:

Rain Wet Road ¬Wet Road

true

false

0.7

0

0.3

1

Probability of “Winter” and not used “Sprinkler” and “Rain” and “Wet road”

P(W ∧ ¬S ∧ R ∧ WG) = 0.3072 Probability of “Winter” and not used “Sprinkler” and “Rain” and “Wet road”

P(W ∧ ¬S ∧ R ∧ WG ∧ WR ) = 0.3072 ∙ 0.7 = 0.21504


Approach (I)

Reason: Number of alternatives to factorize the joint probability distribution increases exponentially with the number of variables:

To classify and predict a discrete event system model with uncertainty, it is necessary to make assumptions about statistical independency of variables.

Assumption:

𝑃𝑃 𝑋𝑋1,𝑋𝑋2 = 𝑃𝑃 𝑋𝑋2|𝑋𝑋1 ∙ 𝑃𝑃 𝑋𝑋1 = 𝑃𝑃 𝑋𝑋1|𝑋𝑋2 ∙ 𝑃𝑃 𝑋𝑋2

𝑃𝑃 𝑋𝑋1,𝑋𝑋2,𝑋𝑋3 = 𝑃𝑃 𝑋𝑋1|𝑋𝑋2,𝑋𝑋3 ∙ 𝑃𝑃 𝑋𝑋2|𝑋𝑋3 ∙ 𝑃𝑃 𝑋𝑋3 = 𝑃𝑃 𝑋𝑋2|𝑋𝑋1,𝑋𝑋3 ∙ 𝑃𝑃 𝑋𝑋2|𝑋𝑋3 ∙ 𝑃𝑃 𝑋𝑋3

A conditional independency of random variables X and Y given Z, if it holds: Conditional independency:

𝑃𝑃 𝑋𝑋,𝑌𝑌|𝑍𝑍 = 𝑃𝑃 𝑋𝑋|𝑍𝑍 ∙ 𝑃𝑃 𝑌𝑌|𝑍𝑍 ⇔ 𝑃𝑃 𝑋𝑋|𝑌𝑌,𝑍𝑍 ∙ 𝑃𝑃 𝑋𝑋|𝑍𝑍

Bayesian networks

… encode conditional independency assumptions among subsets of random system variables

… are represented by a directed acyclic graphical model, with: - directed arcs between nodes (model structure) - conditional probability tables related to the random system variables (model parameters)


Approach (II)

Rain Wet Road

Semantics of the graphical model:

Bayesian networks

Nodes: Random variables as state variables and observables of the system model

Directed arcs: Causal dependencies of the system model from which the conditional independency of the random system variables follows

If a directed arc is drawn from node X (“Rain”) to node Y (“Wet Road”), node X is called parent node of Y and Y is called the child node of X

Nodes without parent nodes are called root nodes

A directed path from node X to Y is said to exist, if one can find a valid sequence of nodes starting from X and ending in Y such that each node in the sequence is a parent of the following node in the sequence

Each random variable Y with the parent nodes X1, ..., Xn is associated with a conditional

probability table (CPT) encoding the conditional probability P(Y=y | X1=x1, ..., Xn=xn)

Parent Child

Root node

Rain Wet Road Clouds

X Y

Rain

Wet Road

Sprinkler X1

X2

Y


Definition of a Bayesian network

A discrete Bayesian network (BN) is represented by the parameter tuple: Definition of a discrete Bayesian network (BN) :

𝜆𝜆𝐵𝐵𝑁𝑁 = 𝐺𝐺,𝛩𝛩

G is a directed, acyclic graph. Its nodes represent discrete random variables Xi (i = 1, ... n):

„A node is conditionally independent from its non-descendents, given its parents“ if the predecessor nodes are given.

𝛩𝛩i = (aimr) are the conditional probability tables (CPT) of nodes of the network with the

components aimr (values):

Rain Wet Road Clouds Slippery

Road given non-descendent

aimr = P(Xi = xm | Parents(Xi) = wr) (m = 1... |Xi|; r = 1...|Parent1(Xi)| ∙ |Parent2(Xi)| ∙ |… |)

aimr The index r of the CPT columns enumerates the possible combinations of values wr of the associated

parent nodes (if the node is a root node, r is simply 1)

m = 1... |Xi| simply enumerates the values of the discrete random variable Xi. The column vectors in the CPTs have always a sum of one.

r1

r2 Values w2

m


1. Proposition: The joint probability distribution of a discrete Bayesian network with the random variables X1, X2, …, Xn can be factorized as follows:

Factorization of the joint probability distribution

𝑃𝑃 𝑋𝑋1,𝑋𝑋2, … ,𝑋𝑋𝑛𝑛 = �𝑃𝑃 𝑋𝑋𝑖𝑖|𝑃𝑃𝑠𝑠𝑠𝑠𝑒𝑒𝑛𝑛𝑡𝑡𝑠𝑠 𝑋𝑋𝑖𝑖

𝑛𝑛

𝑖𝑖=1

Predecessor of Xi

Therefore, a transformation is only forward directed!

Note: Factorization mechanism is directly associated with the graphical model: Compared to a fully interlinked and structurally uninformative graph the number of

alternatives to factorize the joint probability distribution can be significantly reduced.

A graphical model can be developed from first principles and established theories about cause and effect relationships.

Note: Several valid factorizations can exist for a given joint probability distribution of a Bayesian model


Example of a Bayesian network (I)

A production machine (M) tends to produce a significant amount of defective parts.

Source: 1000steine.de

.

Causes:

Its drive (D) is over-heated

The control electronics (E) are disturbed

The shop floor temperature (T) influences the over-heating of the drive (D)

The shop floor (T) temperature depends on the season (S), because there is no air conditioning system.

The functioning of the control electronics (E) is affected by grid (G) voltage jitters and by the shop floor temperature (T).

Graphical model of conditional independencies:

Machine Temperature

Grid Control

Electronics

Drive

Season


Example of a Bayesian network (II)

Graphical model of conditional independencies:

Machine Temperature

Grid Control

Electronics

Drive

Season

X1 = M with binary states: {normal productivity, low productivity} = {m, ¬m}

X2 = E with binary states: {faultless, disturbed} = {e, ¬e}

X3 = D with binary states: {normal, over-heated} = {d, ¬d}

X4 = G with binary states: {no voltage jitters, significant jitters} = {g, ¬g}

X5 = T with ternary states: {high, normal, low} = {h, n, l}

X6 = S with quaternary states: {winter, spring, summer, fall} = {w, p, s, f}

Random system variables of system model:


S = w S = p S = s S = f

P(T = h|.) 0.05 0.10 0.75 0.10

P(T = n|.) 0.20 0.30 0.20 0.30

P(T = l|.) 0.75 0.60 0.05 0.60

Example conditional probability table (CPTT) of the variable temperature (T):

E = e ∧ D = d E = e ∧ D = ¬d E = ¬e ∧ D = d E = ¬e ∧ D = ¬d

P(M = m|.) 0.94 0.01 0.025 0.01

P(M = -m|.) 0.06 0.99 0.975 0.99

Example conditional probability table (CPTM) of production machine (M):

Example of a Bayesian network (III)

{high, normal, low} = {h, n, l} relating to the season

Temperature: high

Temperature: normal

Temperature: low

Season: Winter Spring Summer Fall

normal

low

Machine productivity:

Electronic: faultless e, disturbed ¬e | Drive: normal d, over-heated ¬d


Example of a Bayesian network (IV)

( ) ( ) ( ) ( ) ( ), , , , , , , ( ) ( )P M E D G T S P M E D P E G T P D T P T S P S P G=

The joint probability distribution encoded by a discrete Bayesian network with the random variables X1, X2, …, Xn can be factorized as follows: ( )( )1 2

1

( , ,..., )n

n i ii

P X X X P X Parents X=

=∏

Remember

For the example the following parameter setting is developed:

P(M,E,D,G,T,S) =

Machine

Control Electronics

Drive

P(M|E,D)

Temperature

Grid

P(E|G,T) P(D|T)

Season

P(T|S)

Grid

P(G) P(S)

Season


Inference in Bayesian networks (I)

Overall goal Probability calculation with Bayesian networks, also referred to as “inference”

Estimation of the probability mass functions of not-observable (hidden) random variables in the network, if (some) states of observable variables are known.

Child

Parent (root)

Machine Temperature

Grid Control

Electronics

Drive

Season

! If due to the network structure the child nodes are observable and hidden causes have to be estimated, the inference is called a diagnosis or bottom-up inference.

Example: P(“significant grid voltage jitters” | ”low productivity of machine”)

Grid

Machine

If root nodes or parent nodes are observable and effects have to be estimated, the inference is called a prognosis or top-down inference. ! Example: P (“low productivity of machine” | ”over-heated drive”)

Machine

Drive


Inference in Bayesian networks (II)

Child

Parent (root)

Machine Temperature

Grid Control

Electronics

Drive

Season

! Inference in Bayesian networks is very flexible: The states of arbitrary network nodes can be defined and therefore the probability distributions of the other nodes can be updated.

Example: P(... | “winter season”, “significant grid voltage jitters“)

But… the exact calculation of probability values usually is a NP-incomplete problem.

Therefore, we only present closed-form solutions for chains of variables (like Markov chains) and a simple tree in this introductory course.


Diagnosis in chains (I)

Case 1: Dual chain: X → Y and {Y = y } is observed

( | ) ( )( | )( )

P D h P hP h DP D

=





Remember the Bayes Theorem:

y represents the a priori probability of the observed data D probability of the observed y under the condition of x

𝐵𝐵𝑒𝑒𝑎𝑎𝑖𝑖𝑒𝑒𝑓𝑓 𝑥𝑥 ≡ 𝑃𝑃 𝑋𝑋 = 𝑥𝑥|𝑌𝑌 = 𝑦𝑦 =𝑃𝑃 𝑋𝑋 = 𝑥𝑥|𝑌𝑌 = 𝑦𝑦 𝑃𝑃 𝑋𝑋 = 𝑥𝑥

𝑃𝑃 𝑌𝑌 = 𝑦𝑦=

𝑃𝑃 𝑋𝑋 = 𝑥𝑥|𝑌𝑌 = 𝑦𝑦 𝑃𝑃 𝑋𝑋 = 𝑥𝑥∑ 𝑃𝑃 𝑋𝑋 = 𝑥𝑥|𝑋𝑋 = 𝑥𝑥𝑥 𝑃𝑃 𝑋𝑋 = 𝑥𝑥𝑥𝑥𝑥′

𝐵𝐵𝑒𝑒𝑎𝑎𝑖𝑖𝑒𝑒𝑓𝑓 𝑥𝑥 ≡ 𝑠𝑠 𝑥𝑥|𝑦𝑦 =𝑠𝑠 𝑦𝑦|𝑥𝑥 𝑠𝑠 𝑥𝑥

∑ 𝑠𝑠 𝑦𝑦|𝑥𝑥′ 𝑠𝑠 𝑥𝑥′𝑥𝑥′= 𝑠𝑠 ∙ 𝑠𝑠 𝑥𝑥 ∙ 𝑎𝑎 𝑥𝑥

with: 𝑠𝑠 = �𝑠𝑠 𝑦𝑦|𝑥𝑥′ 𝑠𝑠 𝑥𝑥′

𝑥𝑥′

−1

; 𝑎𝑎 𝑥𝑥 = 𝑠𝑠 𝑦𝑦|𝑥𝑥


Diagnosis in chains (II) Case 1: Dual chain: X → Y and {Y = y } is observed

Machine Temperature

Grid Control

Electronics

Drive

Season

Example: Grid → control electronics, observed is {control electronics = “disturbed”}

Assumptions: P(E = “faultless” | G = “no jitters”) ≡ p(e|g) = 0.9 ⟹ p(¬e|g) = 0.1 Faultless of electronics in condition to no grid voltage jitters

P(E = “perturbed” | G = “significant jitters”) ≡ p(¬e|¬g) = 0.8 ⟹ p(e|¬g) = 0.2 Disturbed electronics in condition to significant grid voltage jitters

P(G = “no jitters”) ≡ p(g) = 0.95 ⟹ p(¬g) = 0.05 Probability of the occurrence of jitters

𝐵𝐵𝑒𝑒𝑎𝑎𝑖𝑖𝑒𝑒𝑓𝑓 𝑔𝑔 = 𝑠𝑠 ∙ 𝑠𝑠 𝑔𝑔 ∙ 𝑎𝑎 𝑔𝑔 = 𝑠𝑠 ∙ 𝑠𝑠 𝑔𝑔 ∙ 𝑠𝑠 ¬e|g = 𝑠𝑠 ∙0.95 ∙0.1 = 𝑠𝑠 ∙ 0.095

𝐵𝐵𝑒𝑒𝑎𝑎𝑖𝑖𝑒𝑒𝑓𝑓 ¬𝑔𝑔 = 𝑠𝑠 ∙ 𝑠𝑠 ¬𝑔𝑔 ∙ 𝑎𝑎 ¬𝑔𝑔 = 𝑠𝑠 ∙ 𝑠𝑠 ¬𝑔𝑔 ∙ 𝑠𝑠 ¬e|¬g = 𝑠𝑠 ∙0.05 ∙0.8 = 𝑠𝑠 ∙ 0.04

𝑠𝑠 ∙0.95 + 𝑠𝑠 ∙ 0.04 = 1 ⇒ 𝑠𝑠 = 0.095 + 0.04 −1 = 7.4074

𝐵𝐵𝑒𝑒𝑎𝑎𝑖𝑖𝑒𝑒𝑓𝑓 𝑔𝑔 ≈ 0.70 𝐵𝐵𝑒𝑒𝑎𝑎𝑖𝑖𝑒𝑒𝑓𝑓 ¬𝑔𝑔 ≈ 0.30 and


Diagnosis in chains (III)

Case 2: Triple chain: X → Y → Z and {Z = z} is observed

probability of the observed x under the condition of z

𝐵𝐵𝑒𝑒𝑎𝑎𝑖𝑖𝑒𝑒𝑓𝑓 𝑥𝑥 = 𝑠𝑠 𝑥𝑥|𝑧𝑧 =1

𝑠𝑠 𝑧𝑧𝑠𝑠 𝑥𝑥 ∙ 𝑠𝑠 𝑧𝑧|𝑥𝑥 = 𝑠𝑠 ∙ 𝑠𝑠 𝑥𝑥 ∙ 𝑎𝑎 𝑥𝑥

𝑎𝑎 𝑥𝑥 = �𝑠𝑠 𝑧𝑧|𝑦𝑦, 𝑥𝑥 ∙ 𝑠𝑠 𝑦𝑦|𝑥𝑥𝑦𝑦

= �𝑠𝑠 𝑧𝑧|𝑦𝑦 ∙ 𝑠𝑠 𝑦𝑦|𝑥𝑥𝑦𝑦

Likelihood function

Machine Temperature

Grid Control

Electronics

Drive

Season

Example: Grid → Electronics → Machine ⟹ Observed is the {machine = “low productivity”}

Assumptions: P(M = “normal productivity” | E = “faultless”) ≡ p(m|e) = 0.95⟹ p(¬m|e) = 0.05

P(M = “low productivity” | E = “disturbed”) ≡ p(¬m|¬e) = 0.85 ⟹ p(m|¬e) = 0.15 P(E = “faultless” | G = “no jitters”) ≡ p(e|g) = 0.9 ⟹ p(¬e|g) = 0.1 P(E = “perturbed” | G = “significant jitters”) ≡ p(¬e|¬g) = 0.8 ⟹ p(e|¬g) = 0.2 P(G = “no jitters”) ≡ p(g) = 0.95 ⟹ p(¬g) = 0.05


Diagnosis in chains (IV)

𝑎𝑎 𝑔𝑔 = 𝑠𝑠 ¬𝑚𝑚|𝑔𝑔 𝑎𝑎 𝑥𝑥 = 𝑠𝑠 𝑦𝑦|𝑥𝑥

Machine Temperature

Grid Control

Electronics

Drive

Season

Low productivity in condition to no voltage jitters

𝑎𝑎 𝑔𝑔 = 𝑠𝑠 ¬𝑚𝑚|𝑔𝑔 = 𝑠𝑠 ¬𝑚𝑚|𝑒𝑒 ∙ 𝑠𝑠 𝑒𝑒|𝑔𝑔 + 𝑠𝑠 ¬𝑚𝑚|¬𝑒𝑒 ∙ 𝑠𝑠 ¬𝑒𝑒|𝑔𝑔

Faultless electronic in condition to no voltage jitters

𝑎𝑎 𝑔𝑔 = 0.05 ∙ 0.90 + 0.85 ∙ 0.1 = 0.13

𝐵𝐵𝑒𝑒𝑎𝑎𝑖𝑖𝑒𝑒𝑓𝑓 𝑔𝑔 = 𝑠𝑠 ∙ 𝑠𝑠 𝑔𝑔 ∙ 𝑎𝑎 𝑔𝑔 = 𝑠𝑠 ∙0.95 ∙ 0.13 = 𝑠𝑠 ∙ 0.1235 from last slide 𝑎𝑎 ¬𝑔𝑔 = 𝑠𝑠 ¬𝑚𝑚|¬𝑔𝑔 = 𝑠𝑠 ¬𝑚𝑚|𝑒𝑒 ∙ 𝑠𝑠 𝑒𝑒|¬𝑔𝑔 + 𝑠𝑠 ¬𝑚𝑚|¬𝑒𝑒 ∙ 𝑠𝑠 ¬𝑒𝑒|¬𝑔𝑔

𝐵𝐵𝑒𝑒𝑎𝑎𝑖𝑖𝑒𝑒𝑓𝑓 ¬𝑔𝑔 = 𝑠𝑠 ∙ 𝑠𝑠 ¬𝑔𝑔 ∙ 𝑎𝑎 ¬𝑔𝑔 = 𝑠𝑠 ∙0.05 ∙ 0.69 = 𝑠𝑠 ∙ 0.0345

𝑎𝑎 ¬𝑔𝑔 = 0.05 ∙ 0.2 + 0.85 ∙ 0.8 = 0.69

𝐵𝐵𝑒𝑒𝑎𝑎𝑖𝑖𝑒𝑒𝑓𝑓 𝑔𝑔 ≈ 0.78 𝐵𝐵𝑒𝑒𝑎𝑎𝑖𝑖𝑒𝑒𝑓𝑓 ¬𝑔𝑔 ≈ 0.22 and

𝑠𝑠 = 0.1235 + 0.0345 −1


Case 3: n-tuple chain: X1 → ... → Xn and {Xn=xn} is observed

( ) ( )

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

1 2

1 2

1 2 2

1 1 1 1 11

1 1 1 1 2 1 2 1

1 1 2 2 1

1 1 2 2 1

1( ) ( ) ( ) ( )( )

( ) ... ,..., ,..., ...

... ...

...

n

n

n n

n nn

n n n nx x

n n n nx x

n n n nx x x

belief x p x x p x p x x cp x l xp x

l x p x x x p x x x p x x

p x x p x x p x x

p x x p x x p x x

−

−

− −

− − −

− − −

− − −

= = =

=

=

=

∑ ∑

∑ ∑

∑ ∑ ∑

Diagnosis in chains (V)


Tree Multiply connected tree

Topologies of trees

Note, in a tree there is not any node which merges arcs.


Case 4: Simple tree: X1 → Y → Z and {Z=z} is observed X2

( ) ( )( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

2

2

2

1 1 1 1 11

1 2 1 2 1 2 1

2 1 2

2 1 2

1( ) ( ) ( ) ( )( )

( ) , , ,

,

,

y x

y x

y x

bel x p x z p x p z x cp x l xp z

l x p z y x x p y x x p x x

p z y p y x x p x

p z y p y x x p x

= = =

=

=

=

∑∑

∑∑

∑ ∑

Moreover, it is possible to derive exact inference algorithms for trees with multiple layers as well as multiply connected trees. Multiply connected trees are converted into multiple layer trees. These algorithms are given in KOCH (2000).

Diagnosis in a simple tree


4. Introduction to Dynamic Bayesian networks

4. Introduction


O1 O2 O3 OT ...

t = 1 t = 2 t = 3 t = T ...

time slices

[ ]0 1 1( 1) ( 2) ...P O P O= = =π

11 12

21 22

...[ ] ...

... ... ...ij

p pp p p

= =

P...

12 1( 2 1)t tp P O O −= = =

Approach (I)

In the previous lecture the approach of static Bayesian networks with discrete random variables was introduced, which is able to encode prior knowledge and independency assumptions of a problem domain to be modelled both efficiently and consistently in a graphical model and allows to infer the system state from incomplete data.

In this lecture the primary question is how we can exploit the methodology of Bayesian networks to

model and simulate stochastic processes. These processes were already analyzed in the 7th and 8th lecture. As in the case of Markov chains we are only interested in the total probability p(x´, x) of a transition from state x to state x´ and do not distinguish the events triggering the state transition.

For instance, it is possible to represent a discrete-state and discrete-time Markov chain as a

Bayesian network. Therefore, the time-indexed random variable Ot defined over the integers 1,2,… encodes the observable state of the chain in each time step t of the process:


Clearly, we can make use of the structure of the graphical model according to proposition 1 of the previous lecture to factorize the joint probability distribution of the observables:

Furthermore, we showed in the previous lecture how to compute the bottom-up inference

(diagnosis) in such a Markov chain using the Bayes theorem. According to the factorization of the joint distribution the predictive power of this simple process

model is limited, because the state transition mechanism considers only two neighboring time slices. In other words, if we have modeled the state sequence {O1, ..., Ot} and we want to predict the future state of the stochastic process Ot+1, the simple Markovian chain model considers only the distribution of the probability mass related to Ot in conjunction with the single-step transition probabilities pij. The previous instances of the process are irrelevant, given the present state.

This minimum chain model is also called a first-order Markov chain, because only two consecutive

time slices are linked in the graphical process model. The first-order Markov chain can be considered as the minimum structure of a dynamic Bayesian network.

1 2 1 2 1 3 2 1( , ,..., ) ( ) ( ) ( ) ( )T T TP O O O P O P O O P O O P O O −= ⋅⋅⋅

Approach (II)


A significantly larger predictive power of the chain model is possible (without recoding states, see 8th lecture!), if the present (t) state of the chain does not only depend on the state in the previous time slice (t-1) but also on additional time slices in the past of the process (t-2, t-3, …). If the “memory depth” of the model is 2 it is called a second-order Markov chain and drawn as follows:

Clearly, the joint probability distribution of the second-order Markov chain can be factorized as:

1 2 1 2 1 3 2 1 4 3 2 1 2( , ,..., ) ( ) ( ) ( , ) ( , ) ( , )T T T TP O O O P O P O O P O O O P O O O P O O O− −= ⋅⋅⋅

1. Proposition: The joint probability distribution of a discrete-state, discrete-time Markov chain of order k can be factorized in each time step T as:

1 2 1 1 1 1 2 1 2 1

1 1 2 1 2 1

( , ,..., ) ( ) ( ,..., ) ( ,..., ) ... ( )

( ,..., ) ( ,..., ) ( ,..., )T k k k k

k k k k T T T k

P O O O P O P O O O P O O O P O O

P O O O P O O O P O O O− − −

+ + + − −

= ⋅ ⋅ ⋅

⋅ ⋅ ⋅

High-order Markov chains

O1 O2 O3 OT ...

... t = 1 t = 2 t = 3 t = T


Markov chains (MC) of finite order k are able to simulate significant memory capacity, but the number of model parameters N = |λ| (λ represents the parameter tuple) that are stored in the prior and conditional probabilities tables grows polynomially with the order.

Consider a stochastic process with three states ot ∈ {1, 2, 3}. We have: • First-order MC: N1 = (3-1) + 3(3-1) (initial state prob. plus transition matrix; rows must sum up to 1) • Second-order MC: N2 = (3-1) + 3(3-1) + 32⋅(3-1) • k-th order MC: Nk = (3-1) + 3(3-1) +…+ 3k⋅(3-1) In order to avoid this rapid growth of the number of parameters and to be able to model processes

with latent dependency structures leading to long-range correlations, in engineering science the approach of Markov chains with hidden variables was invented. These Hidden Markov Models (HMM) distinguish a not directly observable state process {Qt} that satisfies the Markov property and a non Markovian observation process {Ot} that depends on the state process. We have the following structure of this kind of dynamic Bayesian network with hidden (latent) state variables:

Q1 Q2 Q3 QT ...

O1 O2 O3 OT

t = 1 t = 2 t = 3 t = T ...

Graph of Hidden Markov Model

Markov chains with hidden variables (I)


Phoneme and word recognition on the basis of adequately sampled and encoded acoustical spectra in speech recognition

Classification of human behavior when interacting with anthropomorphic robots

Prediction of event sequences in communication engineering and human-computer interaction

Function model of a speech recognition system of Prof. Schukat (Jena University)

Application examples of HMM


2. Proposition: The joint probability distribution of a Hidden Markov Model can be factorized in each time step T as:

1 1 1 1 1 12

( ,..., , ,..., ) ( ) ( ) ( ) ( )T

T T t t t tt

P Q Q O O P Q P O Q P Q Q P O Q−=

= ∏

1. Def.: A discrete-time, discrete-state Hidden Markov Model is represented by the parameter tuple

λHMM = (Q, O, Π, A, B), where

- Q is a set of hidden states being mapped in the following onto the integers {1, 2,..., J}

- O is a set of observable states being mapped in the following onto the integers {1, 2,..., K}

- Π = (π1, ..., πJ) encodes the start vector indicating the initial distribution of the probability mass over the hidden states with πj = P(Q1 = j) (j = 1...J)

- A = (aij) = P(Qt = j | Qt-1 = i) encodes the transition matrix of the hidden process (i, j = 1...J)

- B = (bjk) = P(Ot = k | Qt = j) encodes the emission matrix of the observable states given the hidden states (j = 1...J , k = 1... K).

Therefore, the distribution of the probability mass Π (t) in time step t given the initial distribution Π, transition matrix A and emission matrix B can be calculated as follows:

Markov chains with hidden variables (II)

1( ) ( ) ttP O t −= =Π ΠA B


A fluid in a chemical reactor has two states Q = {1 (non-toxic), 2 (toxic)}. According to the molecular properties of the fluid its state can change spontaneously (e.g. due to temperature jitters) from the non-toxic state to the toxic state with probability p12 = 0.01 at any time instant. This state switching is irreversible. Laboratory studies have shown that the temporal unfolding of the state process can be represented with a sufficiently high level of accuracy by a first-order Markov chain model. Initially, the fluid is filled in the non-toxic state into the reactor.

The measurement of the state of the fluid can only be carried out with the help of an integrated

sensor. A direct state observation is not possible. The sensor is fast enough to finish the measurement in the same time instant. The sensor identifies the toxic state with a reliability of 99.9% and the non-toxic state with a reliability of 95%.

How is the probability mass distributed over the observable states in time step t =4 when the

system is initialized in non-toxic state? Solution: [ ]1.00 0.00

0.99 0.010.00 1.00

0.95 0.050.001 0.999

=

=

=

Π

A

B

[ ]

34( )

0.92 0.08tP O = =

=

ΠA B

HMM example


2. Def.: A discrete-state, discrete-time dynamic Bayesian network is represented by the parameter tuple λDBN = (G1, Gtr, {Πi}i∈{1,...I}, {CPTj} j∈{1,...J}), where - G1 is a directed, acyclic graph of start nodes in the first time slice (t=1) encoding the initial

distribution of the probability mass, which has the same meaning as a static Bayesian network: “Each node is conditionally independent from its non-descendents, given its parents”,

- Gtr is a directed, acyclic graph of transition nodes in replicated time slices encoding the

transition probabilities between time steps with the same meaning as a static Bayesian network,

- Πi = (πi

km) encode the start vectors or start matrices of observable as well as hidden random variables X1

i of the start nodes in the first time slice (t=1) with the components

πi1m = P(X1

i= m) (i = 1...|G1|, m = 1,…, |X1i|) if X1

i is a root node or πi

km = P(X1i= m | Parents(X1

i) = w1k) (k = 1,…, |Parents(X1

i|) if X1i is not a root node,

- CPTj = (aj

rl) encode the transition matrices regarding observable as well as hidden random variables Xtr

j in the replicated time slices (t=2, 3, …) with the components

ajkm = P(Xtr

j = m | Parents(Xtrj) = wtr

k ) (j = 1...|Gtr(t=2)|-1) (k = 1...|Parents(Xtrj)|, m = 1... |Xtr

j |).

Network definition


1. First-order Markov chain:

O1 Ot-1 Ot G1 Gtr

[ ]1 1 1( 1) ( 2)P O P O= = = =Π Π 1 1

1 1

( 1 1) ( 2 1)( 1 2) ( 2 2)

t t t t

t t t t

P O O P O OP O O P O O

− −

− −

= = = = = = = = = =

1CPT P

2. HMM: Q1 Qt-1 Qt

Ot

[ ]1 1 1( 1) ( 2)P Q P Q= = = =Π Π

1 1

1 1

( 1 1) ( 2 1)( 1 2) ( 2 2)

t t t t

t t t t

P Q Q P Q QP Q Q P Q Q

− −

− −

= = = = = = = = = =

1CPT A

( 1 1) ( 2 1)( 1 2) ( 2 2)

t t t t

t t t t

P O Q P O QP O Q P O Q = = = =

= = = = = = 2CPT B

G1 Gtr

Basic DBN structures with parameterization for binary states (I)


3. Autoregressive HMM:

Q1 Qt-1 Qt

Ot

1 1

1 1

( 1 1) ( 2 1)( 1 2) ( 2 2)

t t t t

t t t t

P Q Q P Q QP Q Q P Q Q

− −

− −

= = = = = = = = = =

1CPT A

1 1

1 12

1 1

1 1

( 1 1, 1) ( 2 1, 1)( 1 2, 1) ( 2 2, 1)( 1 1, 2) ( 2 1, 2)( 1 2, 2) ( 2 2, 2)

t t t t t t

t t t t t t

t t t t t t

t t t t t t

P O Q O P O Q OP O Q O P O Q OP O Q O P O Q OP O Q O P O Q O

− −

− −

− −

− −

= = = = = = = = = = = = = = = = = = =

= = = = = =

CPT

G1 Gtr

Ot-1

1 1 1 12

1 1 1 1

( 1 1) ( 2 1)( 1 2) ( 2 2)

P O Q P O QP O Q P O Q = = = =

= = = = = Π

[ ]1 1 1( 1) ( 2)P Q P Q= = = =Π Π

O1

Basic DBN structures with parameterization for binary states (II)


3. Factorial HMM: Q11 Q1

t-1 Q1t

Ot

1 11 11 ( 1) ( 2)P Q P Q = = = Π

1 1 1 11 1

1 1 1 11 1

( 1 1) ( 2 1)

( 1 2) ( 2 2)

t t t t

t t t t

P Q Q P Q Q

P Q Q P Q Q

− −

− −

= = = = = = = = =

1CPT

1 2 1 2

1 2 1 2

1 2 1 2

1 2 1 2

( 1 1, 1) ( 2 1, 1)

( 1 2, 1) ( 2 2, 1)

( 1 1, 2) ( 2 1, 2)

( 1 2, 2) ( 2 2, 2)

t t t tt t

t t t tt t

t t t tt t

t t t tt t

P O Q Q P O Q Q

P O Q Q P O Q Q

P O Q Q P O Q Q

P O Q Q P O Q Q

= = = = = = = = = = = = = = = = = = = = = = = = =

3CPT

G1 Gtr

Q21

Q2t-1 Q2

t

2 21 12 ( 1) ( 2)P Q P Q = = = Π

2 2 2 21 1

2 2 2 21 1

( 1 1) ( 2 1)

( 1 2) ( 2 2)

t t t t

t t t t

P Q Q P Q Q

P Q Q P Q Q

− −

− −

= = = = = = = = =

2CPT

Basic DBN structures with parameterization for binary states (III)


3. Def.: A DBN in two consecutive time slices and the aggregated random state variables is a net fragment Gtr and represents the probability distribution of a transition model according to

( )1 2, ,..., nt t tX X X X=

( )1 21 1 1, ,..., n

t t tX X X X+ + +′ =

( ) ( )( )1

n

tr i ii

P X X P X Parents X=

′ ′ ′≡∏

3. Proposition: The joint probability distribution of the aggregated random state variables of DBN can be factorized in each time slice T according to: Π1(.) represents the initial probability distribution of the aggregated state variables in the first time slice and Ptr(.) represents the transition model defined by Def. 3.

( ) ( ) ( )1 1 12

,..., ,T

T DBN DBN tr DBNt

P X X X P X Xλ λ λ=

′= Π ⋅∏

Factorization in DBN


Open Questions ???

Questions ?

simulation of discrete event systems...- bayesian methodology to calculate posterior distributions...

Documents