directed graphical models or bayesian networkslsong/teaching/8803ml/lecture1.pdfbayesian networks...

Directed Graphical Models or Bayesian Networks

Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Le Song

Bayesian Networks

One of the most exciting recent advancements in statistical AI

Compact representation for exponentially-large probability distributions

Fast marginalization algorithm

Exploit conditional independencies

Difference from undirected graphical models!

2

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢

Handwriting recognition

3

Handwriting recognition

4

𝑋1 𝑋2 𝑋3 𝑋4 𝑋5

Summary: basic concepts for R.V.

Outcome: assign 𝑥1, … , 𝑥𝑛to 𝑋1, …𝑋𝑛

Conditional probability: 𝑃(𝑋, 𝑌) = 𝑃(𝑋) 𝑃(𝑌|𝑋)

Bayes rule: 𝑃(𝑌|𝑋) =𝑃 𝑋 𝑌 𝑃(𝑌)

𝑃(𝑋)

Chain rule:

𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋1 𝑃 𝑋2 𝑋1 …𝑃 𝑋𝑛 𝑋1, … , 𝑋𝑛−1

5

Representation of Bayesian Networks

Consider 𝑃 𝑋𝑖

Assign probability to each 𝑥𝑖 ∈ 𝑉𝑎𝑙 𝑋𝑖

Assume 𝑉𝑎𝑙 𝑋𝑖 = 𝑘, how many independent parameters?

𝑘 − 1

Consider 𝑃 𝑋1, … , 𝑋𝑛

How many independent parameters if 𝑉𝑎𝑙 𝑋𝑖 = 𝑘?

𝑘𝑛 − 1

Bayesian Networks can represent the same joint probability with fewer parameters

7

Simple Bayesian Networks

What if variables are independent?

What if all variables are independent?

Is it enough to have 𝑋𝑖 ⊥ 𝑋𝑗 , ∀𝑖, 𝑗

Not enough!!!

Must assume that 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛

𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …

Can write as

𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖𝑖=1…𝑛

Bayesian networks with no edges

How many independent parameters now?

𝑛 ⋅ 𝑘 − 1 9

𝑋1 𝑋2 𝑋3 𝑋𝑛 …

Conditional parameterization – two nodes

Grade (G) is determined by intelligence (I)

𝑃(𝐼) =

𝑃 𝐺 𝐼 =

𝑃 𝐼 = 𝑉𝐻, 𝐺 = 𝐵 = 𝑃 𝐼 = 𝑉𝐻 𝑃 𝐺 = 𝐵 𝐼 = 𝑉𝐻 = 0.85 × 0.1 = 0.085

10

𝐺

𝐼 VH H

0.85 0.15

G I VH H

A 0.9 0.5

B 0.1 0.5

Conditional parameterization – three nodes

Grade and SAT score are determined by intelligence

(𝐺 ⊥ 𝑆 | 𝐼) ⇔ 𝑃(𝑆 𝐺, 𝐼 = 𝑃(𝑆|𝐼)

𝑃(𝐼, 𝐺, 𝑆) = 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼), why?

Chain rule 𝑃(𝐼, 𝐺, 𝑆) = 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐺, 𝐼)

Use conditional independence, we get 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼)

11

𝐺

𝐼

𝑆

𝑃(𝐼)

𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼)

The naïve Bayes model

Class variable: 𝐶

Evidence variables: 𝑋1, …𝑋𝑛

Assume that (𝒳 ⊥ 𝒴| 𝐶), ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛

𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃 𝐶 𝑃 𝑋𝑖 𝐶 𝑖 , why?

Chain rule 𝑃 𝐶, 𝑋1, … , 𝑋𝑛= 𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶, 𝑋1 …P 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1

12

𝑋1

𝐶

𝑋2 𝑋𝑛 …

𝑃 𝑋2 𝐶, 𝑋1 = 𝑃 𝑋2 𝐶 𝑋_1 ⊥ 𝑋_2 | 𝐶

𝑃 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1 = 𝑃 𝑋𝑛 𝐶 𝑋𝑛 ⊥ 𝑋1, … , 𝑋𝑛−1| 𝐶

More Complicated Bayesian Networks and an Example

Causal Structure

We will learn the semantics of Bayesian Networks (BNs), relate them to independence assumptions encoded by the graph

Suppose we know the following:

The flu (F) causes sinus inflammation (S)

Allergies (A) cause sinus inflammation

Sinus inflammation causes a runny nose (N)

Sinus inflammation causes headaches (H)

How are these variables connected?

14

𝑁𝑜𝑠𝑒



𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝐹𝑙𝑢

Possible queries

Probabilistic Inference

Eg., 𝑃(𝐴 = 𝑡|𝐻 = 𝑡, 𝑁 = 𝑓)

Most probable explanation

max𝐹,𝐴,𝑆 𝑃(𝐹, 𝐴, 𝑆|𝐻 = 𝑡, 𝑁 = 𝑡)

Active data collection

What’s the next best test variable to observed

15


𝑁𝑜𝑠𝑒



𝐹𝑙𝑢

Car starts BN

18 binary variables

𝑃 𝐴, 𝐹, 𝐿, … , 𝑆𝑝𝑎𝑟𝑘

Inference

P(BatteryAge|Starts=f)

= … 𝑃 𝐴, 𝐹, 𝐿, … 𝐹𝑎𝑛𝐵𝑒𝑙𝑡𝐴𝑙𝑡

Sum over 216 terms, why BN so fast?

Use the sparse graph structure

For the HailFinder BN in JavaBayes

More than 354 =

58149737003040059690390169 terms 16

Factored joint distribution--preview

𝑃 𝐹, 𝐴, 𝑆, 𝐻,𝑁 = 𝑃 𝐹 𝑃 𝐴 𝑃 𝑆 𝐹, 𝐴 𝑃 𝐻 𝑆 𝑃 𝑁 𝑆

17


𝑁𝑜𝑠𝑒



𝐹𝑙𝑢 𝑃(𝐹) 𝑃(𝐴)

𝑃(𝑆|𝐹, 𝐴)

𝑃(𝑁|𝑆) 𝑃(𝐻|𝑆)

Bayesian networks (BN) has 10 parameters

Full probability table 𝑃(𝐹, 𝐴, 𝑆, 𝐻, 𝑁) explicitly has 25 − 1 = 31 parameters

Number of parameters

18


𝑁𝑜𝑠𝑒



𝐹𝑙𝑢 𝑃(𝐹) 𝑃(𝐴)

𝑃(𝑆|𝐹, 𝐴)

𝑃(𝑁|𝑆) 𝑃(𝐻|𝑆)

1 1

4

2 2

S FA tt tf ft ff

t 0.9 0.7 0.8 0.2

f 0.1 0.3 0.2 0.8

1 1 1 1

Key: Independence assumptions

¬ 𝐻 ⊥ 𝑁

𝐻 ⊥ 𝑁 | 𝑆

¬ 𝐴 ⊥ 𝑁

𝐴 ⊥ 𝑁 | 𝑆

Knowing sinus separates the symptom variables from each other

19


𝑁𝑜𝑠𝑒



𝐹𝑙𝑢

Marginal independence

Flu and Allergy are (marginally) independent

𝐹 ⊥ 𝐴

𝑃 𝐹, 𝐴 = 𝑃 𝐹 𝑃 𝐴

More generally:

∀subsets of {𝑋1, … , 𝑋𝑛}, 𝒳 ⊥ 𝒴,𝒳 ⊆ 𝑋1, … , 𝑋𝑛 , 𝒴 ⊆ 𝑋1, … , 𝑋𝑛

𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖

𝑖

20

Flu = t 0.1

Flu = f 0.9

Al = t 0.3

Al = f 0.7

Flu = t Flu = f

Al = t 0.1 × 0.3 = 0.03

0.3 × 0.9

Al = f 0.1 × 0.7 0.9 × 0.7


𝑁𝑜𝑠𝑒



𝐹𝑙𝑢

Conditional independence

Flu and headache are not (marginally) independent

¬𝐹 ⊥ 𝐻, 𝑃 𝐹 𝐻 ≠ 𝑃 𝐹 , 𝑃 𝐹,𝐻 ≠ 𝑃 𝐹 𝑝 𝐻

Flu and headache are independent given Sinus infection

𝐹 ⊥ 𝐻 | 𝑆, 𝑃 𝐹,𝐻 𝑆 = 𝑃 𝐹 𝑆 𝑃 𝐻 𝑆

𝑃 𝐹 𝐻, 𝑆 = 𝑃 𝐹 𝑆

More generally:

𝑋1, 𝑋2, … , 𝑋𝑛 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑜𝑡ℎ𝑒𝑟 𝑔𝑖𝑣𝑒𝑛 𝐶

𝑃 𝑋1, 𝑋2, … , 𝑋𝑛 𝐶 = 𝑃 𝑋1 𝐶 𝑃 𝑋2, … , 𝑋𝑛 𝐶

𝑃 𝑋1, 𝑋2, … , 𝑋𝑛 𝐶 = 𝑃 𝑋𝑖 𝐶𝑖

21


𝑁𝑜𝑠𝑒



𝐹𝑙𝑢

The conditional independence assumption

Local Markov Assumption: a variable X is independence of its non-descendants given its parents and only its parents

𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋

Flu: 𝑃𝑎𝐹𝑙𝑢 = ∅,𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝐹𝑙𝑢 = {𝐴}

𝐹 ⊥ 𝐴

Nose: 𝑃𝑎𝑛𝑜𝑠𝑒 = 𝑆 , 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑛𝑜𝑠𝑒 = 𝐹, 𝐴, 𝐻

𝑁 ⊥ {𝐹, 𝐴, 𝐻} | 𝑆

Sinus: 𝑃𝑎𝑠𝑖𝑛𝑢𝑠 = 𝐹, 𝐴 ,𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑠𝑖𝑛𝑢𝑠 = ∅

No assumption about 𝑆 ⊥? ? ? | 𝐹, 𝐴

22


𝑁𝑜𝑠𝑒



𝐹𝑙𝑢

Local Markov Assumption: a variable X is independence of its non-descendants given its parents and only its parents


local Markov assumption not imply 𝐹 ⊥ 𝐴 | 𝑆

XOR: 𝑃 𝐴 = 𝑡 = 0.5, 𝑃 𝐹 = 𝑡 = 0.5, 𝑆 = 𝐹 𝑋𝑂𝑅 𝐴

𝑆 = 𝑡, 𝐴 = 𝑡 ⇒ 𝐹 = 𝑓,

ie., 𝑃 𝐹 = 𝑓 𝑆 = 𝑡, 𝐴 = 𝑡 =0

𝑃 𝐹 = 𝑡 = 0.2, 𝑃 𝐹 = 𝑡 𝑆 = 𝑡 = 0.5, 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 = 0.3

𝑃 𝐹 = 𝑡 ≤ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 ≤ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡

Knowing 𝐴 = 𝑡 lowers the probability of 𝐹 = 𝑡 (A=t explains away F=t)

Depends on P(S|F,A), it could be 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 ≥ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡

Explaining away

23


𝑁𝑜𝑠𝑒



𝐹𝑙𝑢

Chain rule and Local Markov Assumption

Pick a topological ordering of nodes

Interpret a BN using particular chain rule order

Use the conditional independence assumption

𝑃 𝑁,𝐻, 𝑆, 𝐴, 𝐹= 𝑃 𝐹 𝑃 𝐴 𝐹 𝑃 𝑆 𝐹, 𝐴 𝑃 𝐻 𝑆, 𝐹, 𝐴 𝑃 𝑁 𝑆, 𝐹, 𝐴, 𝐻

𝑃(𝑁,𝐻, 𝑆, 𝐴, 𝐹) = 𝑃(𝐹) 𝑃(𝐴) 𝑃(𝑆|𝐹𝐴) 𝑃(𝐻|𝑆) 𝑃(𝑁|𝑆)

Why can we decompose joint distribution?

24


𝑁𝑜𝑠𝑒



𝐹𝑙𝑢

𝑃(𝑆|𝐹, 𝐴) 𝑃(𝐴)

𝐴 ⊥ 𝐹

𝑃(𝐻|𝑆)

𝐻 ⊥ 𝐹𝐴|𝑆

𝑃(𝑁|𝑆)

𝐻 ⊥ 𝐹𝐴𝐻|𝑆

1 2

3

4 5

General Bayesian Networks

Set of random variables 𝑋1, … , 𝑋𝑛

Directed acyclic graph (DAG)

Loops are ok

But no directed cycle

Local Markov Assumptions

A variable 𝑋 is independent of its non-descendants given its parents and only its parents (𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋)

Conditional probability tables (CPTs), 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖), for each 𝑋𝑖

Joint distribution 𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖 𝑃𝑎𝑋𝑖 𝑖

A general Bayes net

26

What distributions can be represented by a Bayesian Networks?

What Bayesian Networks can represent a distribution?

What are the independence assumptions encoded by a BN?

In addition to the local Markov assumption

Question?

27

𝐴 𝐵 𝐶 𝐷

Local Markov 𝐴 ⊥ 𝐶 | 𝐵 𝐷 ⊥ 𝐴𝐵 | 𝐶

Derived independence 𝐴 ⊥ 𝐷 | 𝐵

Conditional Independence in Problem

28

True distribution 𝑃 Contains conditional independence assertions

𝐼(𝑃)

Graph G encodes local independence assumptions 𝐼𝑙 𝐺

World, Data, Reality: Bayesian Networks:

Key representational assumption: 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

The representation theorem – True conditional independence => BN factorization

BN encodes local conditional independence assumptions 𝐼𝑙 𝐺

29

If local conditional independence in BN are

subset of conditional independence in

𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

Then the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)

𝑖

obtain

The representation theorem – BN factorization => True conditional independence


30

Then local conditional independence in BN are



If the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)

𝑖

obtain

Example: naïve Bayes – True conditional independence => BN Factorization

Independence assumptions:

𝑋𝑖 are independent of each other given 𝐶

That is 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛

𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …

Same as local conditional independence

prove 𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃(𝐶) 𝑃 𝑋𝑖 𝐶 𝑖

Use chain rule, and local Markov property 𝑃 𝐶, 𝑋1, … , 𝑋𝑛= 𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶, 𝑋1 …P 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1

31

𝑋1

𝐶

𝑋2 𝑋𝑛 …

𝑃 𝑋2 𝐶, 𝑋1 = 𝑃 𝑋2 𝐶 𝑋_1 ⊥ 𝑋_2 | 𝐶

𝑃 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1 = 𝑃 𝑋𝑛 𝐶 𝑋𝑛 ⊥ 𝑋1, … , 𝑋𝑛−1| 𝐶

Example: naïve Bayes – BN Factorization => True conditional independence

Assume 𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃(𝐶) 𝑃 𝑋𝑖 𝐶 𝑖

Prove Independence assumptions:

𝑋𝑖 are independent of each other given 𝐶

That is 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛

𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …

Eg. 𝑛 = 4, 𝑃(𝑋1, 𝑋_2|𝐶) =𝑃(𝑋1,𝑋2,𝐶)

𝑃(𝐶)= 𝑃(𝑋1,𝑋2,𝑋3,𝑋4,𝐶)𝑥3,𝑥4

𝑃(𝐶)

=1

𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 𝑃 𝑋3 𝐶𝑥3,𝑥4 𝑃 𝑋4 𝐶 𝑃 𝐶

= 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 𝑃 𝑋3 𝐶 𝑃(𝑋4|𝐶)𝑥4𝑥3

= 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 ⋅ 1 ⋅ 1

32

𝑋1

𝐶

𝑋2 𝑋𝑛 …

How about the general case?


33

If local conditional independence in BN are



Then the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)

𝑖

obtain

Then local conditional independence in BN are



If the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)

𝑖

obtain

This BN is an I-map of P

𝑃 factorizes according to BN

Every P has least one BN structure G

Read independence of P from BN structure G

Proof from I-map to factorization

Topological ordering of 𝑋1, 𝑋2, …𝑋𝑛:

Number variables such that

Parent has lower number than child

ie., 𝑋𝑖 → 𝑋𝑗 ⇒ 𝑖 < 𝑗

Variable has lower number than all its descendants

Use chain rule

𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋1 𝑃 𝑋2 𝑋1 …𝑃 𝑋𝑛 𝑋1, …𝑋𝑛−1

𝑃 𝑋𝑖 𝑋1, … , 𝑋𝑖−1 :

𝑃𝑎𝑋𝑖 ∈{𝑋1, … , 𝑋𝑖−1},

And there is no descendant of 𝑋𝑖 in {𝑋1, … , 𝑋𝑖−1}

𝑃 𝑋𝑖 𝑋1, … , 𝑋𝑖−1 = 𝑃(𝑋𝑖| 𝑃𝑎𝑋𝑖) 34


𝑁𝑜𝑠𝑒



𝐹𝑙𝑢 1 2

3

4 5

Local Markov Assumption: 𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋

Let G be an I-map for P, any DAG G’ that includes the same directed edges as G is also an I-map of P

G’ is strictly more expressive than G

If G is an I-map for P, then adding edges still results in an I-map

P(N|S,A)

𝑃(𝑁|𝑆, 𝐴 = 𝑡) = 𝑃(𝑁|𝑆, 𝐴 = 𝑓)

⇒ 𝑃(𝑁|𝑆) = 𝑃(𝑁|𝑆, 𝐴)

⇔ 𝑁 ⊥ 𝐴 | 𝑆

Adding edges doesn’t hurt

35


𝑁𝑜𝑠𝑒



𝐹𝑙𝑢

Minimal I-maps

G is a minimal I-map for 𝑃 if deleting any edges for 𝐺 makes it no longer an I-map

Eg. ¬𝐴 ⊥ 𝐵, 𝐴 ⊥ 𝐵 | 𝐶, minimal I-map 𝐴 → 𝐶 → 𝐵, if remove an edge then no longer I-map

Obtain a minimal I-map given a set of variabls and conditional independence assertions

Choose an ordering on variables 𝑋1, … , 𝑋𝑛

For i = 1 to n

Add 𝑋𝑖 to network

Define parents of 𝑋𝑖 , 𝑃𝑎𝑋𝑖, in graphs as the minimal subset of nodes

such that local Markov assumptions hold

Define/learn CPT – 𝑃 𝑋𝑖 𝑃𝑎𝑋𝑖) 36

Read conditional independence from BN structure G

Conditional Independence encoded in BN

Local Markov assumption


There are other conditional independence

Eg., explaining away, derived

What other derived conditional independence there?

38



𝐹𝑙𝑢

𝐴 ⊥ 𝐹 ¬𝐴 ⊥ 𝐹 | 𝑆

𝐴 𝐵 𝐶 𝐷

Local Markov 𝐴 ⊥ 𝐶 | 𝐵 𝐷 ⊥ 𝐴𝐵 | 𝐶

Derived independence 𝐴 ⊥ 𝐷 | 𝐵

3-node cases

Causal effect:

Common cause

Common effect (V-structure)

39



𝐹𝑙𝑢

𝐴 ⊥ 𝐹 ¬𝐴 ⊥ 𝐹 | 𝑆

𝑁𝑜𝑠𝑒



¬𝑁 ⊥ 𝐻 𝑁 ⊥ 𝐻 | 𝑆

𝑆𝑖𝑛𝑢𝑠 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝐴𝑙𝑙𝑒𝑟𝑔𝑦

¬𝐴 ⊥ 𝐻 𝐴 ⊥ 𝐻 | 𝑆

How about other derived relations? ? 𝐹 ? {𝐵, 𝐸, 𝐺, 𝐽}

𝐹 ⊥ {𝐵, 𝐸, 𝐺, 𝐽}

? 𝐴, 𝐶, 𝐹, 𝐼 ? 𝐵, 𝐸, 𝐺, 𝐽

𝐴, 𝐶, 𝐹, 𝐼 ⊥ {𝐵, 𝐸, 𝐺, 𝐽}

? 𝐵 ? 𝐽 | 𝐸

𝐵 ⊥ 𝐽 | 𝐸

? 𝐸 ? 𝐹 | 𝐾

¬ 𝐸 ⊥ 𝐹 | 𝐾

? 𝐸 ? 𝐹 | {𝐾, 𝐼}

𝐸 ⊥ 𝐹 | {𝐾, 𝐼}

? 𝐹 ? 𝐺 | 𝐷 ¬ 𝐹 ⊥ 𝐺 | 𝐷

? 𝐹 ? 𝐺 | 𝐻

¬ 𝐹 ⊥ 𝐺 | 𝐻

? 𝐹 ? 𝐺 | {𝐻, 𝐾}

¬ 𝐹 ⊥ 𝐺 | {𝐻, 𝐾}

? 𝐹 ? 𝐺 | {𝐻, 𝐴}

𝐹 ⊥ 𝐺 | {𝐻, 𝐴}

40

𝐴

𝐷

𝐵

𝐶 𝐸

𝐻 𝐹 𝐺

𝐼

𝐾

𝐽

Active trails

A trail 𝑋1– 𝑋2 −⋯– 𝑋𝑘is an active trail when variables 𝑂 ⊆ 𝑋1, … , 𝑋𝑛 are observed if for each consecutive triplet in the trail:

𝑋𝑖−1 → 𝑋𝑖 → 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂

𝑋𝑖−1 ← 𝑋𝑖 ← 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂

𝑋𝑖−1 ← 𝑋𝑖 → 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂

𝑋𝑖−1 → 𝑋𝑖 ← 𝑋𝑖+1 and 𝑋𝑖 is observed 𝑋𝑖 ∈ 𝑂 or one of its descendants (V-structure)

41

Active trails and conditional independence

Variables 𝑋𝑖 and 𝑋𝑗 are

independent given 𝑍 ⊆𝑋1, … , 𝑋𝑛 , if there is no active trail

between 𝑋𝑖 and 𝑋𝑗 when variables Z

are observed (𝑋𝑖⊥ 𝑋𝑗|𝑍)

We say that 𝑋𝑖 and 𝑋𝑗 are d-

separated given 𝑍 (dependency separation)

Eg., 𝐹 ⊥ 𝐺

42

𝐴

𝐷

𝐵

𝐶 𝐸

𝐻 𝐹 𝐺

𝐼

𝐾

𝐽

Soundness of d-separation

Given BN with structure 𝐺

Set of conditional independence assertions obtained by d-separation

𝐼 𝐺 = 𝑋 ⊥ 𝑌 𝑍: 𝑑 − 𝑠𝑒𝑝𝐺 𝑋; 𝑌 𝑍

Soundness of d-separation

If 𝑃 factorizes over 𝐺 then 𝐼 𝐺 ⊆ 𝐼 𝑃 (not only 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃 )

Interpretation: d-separation only captures true conditional independencies

For most P’s that factorize over G, 𝐼 𝐺 = 𝐼 𝑃 (P-map)

43

Bayesian Networks are not enough

Inexistence of P-maps, example 1

XOR: A = B XOR C

𝐴 ⊥ 𝐵,¬𝐴 ⊥ 𝐵 | 𝐶

𝐵 ⊥ 𝐶,¬ 𝐴 ⊥ 𝐶 | 𝐵

𝐶 ⊥ 𝐴,¬𝐵 ⊥ 𝐶 | 𝐴

45

𝐴 𝐵

𝐶

Minimal I-map 𝐴 ⊥ 𝐵

Not P-map Can not read 𝐵 ⊥ 𝐶 𝐴 ⊥ C

Inexistence of P-maps, example 2

Swinging couples of variables 𝑋1, 𝑌1 and 𝑋2, 𝑌2

𝑋1 ⊥ 𝑌1

𝑋2 ⊥ 𝑌2

𝑋1 ⊥ 𝑋2 | 𝑌1𝑌2

𝑌1 ⊥ 𝑌2 | 𝑋1𝑋2

No Bayesian Network P-map

Need undirected graphical models!

46

𝑋1 𝑋2

𝑌1

𝑌2

directed graphical models or bayesian networkslsong/teaching/8803ml/lecture1.pdfbayesian networks...

Documents