directed graphical models or bayesian networkslsong/teaching/8803ml/lecture1.pdfbayesian networks...

46
Directed Graphical Models or Bayesian Networks Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Le Song

Upload: others

Post on 11-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Directed Graphical Models or Bayesian Networks

Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Le Song

Page 2: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Bayesian Networks

One of the most exciting recent advancements in statistical AI

Compact representation for exponentially-large probability distributions

Fast marginalization algorithm

Exploit conditional independencies

Difference from undirected graphical models!

2

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢

Page 3: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Handwriting recognition

3

Page 4: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Handwriting recognition

4

𝑋1 𝑋2 𝑋3 𝑋4 𝑋5

Page 5: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Summary: basic concepts for R.V.

Outcome: assign 𝑥1, … , 𝑥𝑛to 𝑋1, …𝑋𝑛

Conditional probability: 𝑃(𝑋, 𝑌) = 𝑃(𝑋) 𝑃(𝑌|𝑋)

Bayes rule: 𝑃(𝑌|𝑋) =𝑃 𝑋 𝑌 𝑃(𝑌)

𝑃(𝑋)

Chain rule:

𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋1 𝑃 𝑋2 𝑋1 …𝑃 𝑋𝑛 𝑋1, … , 𝑋𝑛−1

5

Page 6: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Summary: conditional independence

𝑋 is independent of 𝑌 given 𝑍 if 𝑃(𝑋 = 𝑥|𝑌 = 𝑦, 𝑍 = 𝑧) = 𝑃(𝑋 = 𝑥| 𝑍 = 𝑧) ∀𝑥 ∈ 𝑉𝑎𝑙 𝑋 , 𝑦 ∈ 𝑉𝑎𝑙 𝑌 , 𝑧 ∈ 𝑉𝑎𝑙 𝑍

Shorthand:

(𝑋 ⊥ 𝑌 | 𝑍)

For (𝑋 ⊥ 𝑌| ∅), write 𝑋 ⊥ 𝑌

Proposition: (𝑋 ⊥ 𝑌 | 𝑍) if and only if 𝑃(𝑋, 𝑌|𝑍) = 𝑃(𝑋|𝑍)𝑃(𝑌|𝑍)

6

Page 7: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Representation of Bayesian Networks

Consider 𝑃 𝑋𝑖

Assign probability to each 𝑥𝑖 ∈ 𝑉𝑎𝑙 𝑋𝑖

Assume 𝑉𝑎𝑙 𝑋𝑖 = 𝑘, how many independent parameters?

𝑘 − 1

Consider 𝑃 𝑋1, … , 𝑋𝑛

How many independent parameters if 𝑉𝑎𝑙 𝑋𝑖 = 𝑘?

𝑘𝑛 − 1

Bayesian Networks can represent the same joint probability with fewer parameters

7

Page 8: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Simple Bayesian Networks

Page 9: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

What if variables are independent?

What if all variables are independent?

Is it enough to have 𝑋𝑖 ⊥ 𝑋𝑗 , ∀𝑖, 𝑗

Not enough!!!

Must assume that 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛

𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …

Can write as

𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖𝑖=1…𝑛

Bayesian networks with no edges

How many independent parameters now?

𝑛 ⋅ 𝑘 − 1 9

𝑋1 𝑋2 𝑋3 𝑋𝑛 …

Page 10: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Conditional parameterization – two nodes

Grade (G) is determined by intelligence (I)

𝑃(𝐼) =

𝑃 𝐺 𝐼 =

𝑃 𝐼 = 𝑉𝐻, 𝐺 = 𝐵 = 𝑃 𝐼 = 𝑉𝐻 𝑃 𝐺 = 𝐵 𝐼 = 𝑉𝐻 = 0.85 × 0.1 = 0.085

10

𝐺

𝐼 VH H

0.85 0.15

G I VH H

A 0.9 0.5

B 0.1 0.5

Page 11: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Conditional parameterization – three nodes

Grade and SAT score are determined by intelligence

(𝐺 ⊥ 𝑆 | 𝐼) ⇔ 𝑃(𝑆 𝐺, 𝐼 = 𝑃(𝑆|𝐼)

𝑃(𝐼, 𝐺, 𝑆) = 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼), why?

Chain rule 𝑃(𝐼, 𝐺, 𝑆) = 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐺, 𝐼)

Use conditional independence, we get 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼)

11

𝐺

𝐼

𝑆

𝑃(𝐼)

𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼)

Page 12: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

The naïve Bayes model

Class variable: 𝐶

Evidence variables: 𝑋1, …𝑋𝑛

Assume that (𝒳 ⊥ 𝒴| 𝐶), ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛

𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃 𝐶 𝑃 𝑋𝑖 𝐶 𝑖 , why?

Chain rule 𝑃 𝐶, 𝑋1, … , 𝑋𝑛= 𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶, 𝑋1 …P 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1

12

𝑋1

𝐶

𝑋2 𝑋𝑛 …

𝑃 𝑋2 𝐶, 𝑋1 = 𝑃 𝑋2 𝐶 𝑋_1 ⊥ 𝑋_2 | 𝐶

𝑃 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1 = 𝑃 𝑋𝑛 𝐶 𝑋𝑛 ⊥ 𝑋1, … , 𝑋𝑛−1| 𝐶

Page 13: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

More Complicated Bayesian Networks and an Example

Page 14: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Causal Structure

We will learn the semantics of Bayesian Networks (BNs), relate them to independence assumptions encoded by the graph

Suppose we know the following:

The flu (F) causes sinus inflammation (S)

Allergies (A) cause sinus inflammation

Sinus inflammation causes a runny nose (N)

Sinus inflammation causes headaches (H)

How are these variables connected?

14

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝐹𝑙𝑢

Page 15: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Possible queries

Probabilistic Inference

Eg., 𝑃(𝐴 = 𝑡|𝐻 = 𝑡, 𝑁 = 𝑓)

Most probable explanation

max𝐹,𝐴,𝑆 𝑃(𝐹, 𝐴, 𝑆|𝐻 = 𝑡, 𝑁 = 𝑡)

Active data collection

What’s the next best test variable to observed

15

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢

Page 16: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Car starts BN

18 binary variables

𝑃 𝐴, 𝐹, 𝐿, … , 𝑆𝑝𝑎𝑟𝑘

Inference

P(BatteryAge|Starts=f)

= … 𝑃 𝐴, 𝐹, 𝐿, … 𝐹𝑎𝑛𝐵𝑒𝑙𝑡𝐴𝑙𝑡

Sum over 216 terms, why BN so fast?

Use the sparse graph structure

For the HailFinder BN in JavaBayes

More than 354 =

58149737003040059690390169 terms 16

Page 17: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Factored joint distribution--preview

𝑃 𝐹, 𝐴, 𝑆, 𝐻,𝑁 = 𝑃 𝐹 𝑃 𝐴 𝑃 𝑆 𝐹, 𝐴 𝑃 𝐻 𝑆 𝑃 𝑁 𝑆

17

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢 𝑃(𝐹) 𝑃(𝐴)

𝑃(𝑆|𝐹, 𝐴)

𝑃(𝑁|𝑆) 𝑃(𝐻|𝑆)

Page 18: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Bayesian networks (BN) has 10 parameters

Full probability table 𝑃(𝐹, 𝐴, 𝑆, 𝐻, 𝑁) explicitly has 25 − 1 = 31 parameters

Number of parameters

18

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢 𝑃(𝐹) 𝑃(𝐴)

𝑃(𝑆|𝐹, 𝐴)

𝑃(𝑁|𝑆) 𝑃(𝐻|𝑆)

1 1

4

2 2

S FA tt tf ft ff

t 0.9 0.7 0.8 0.2

f 0.1 0.3 0.2 0.8

1 1 1 1

Page 19: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Key: Independence assumptions

¬ 𝐻 ⊥ 𝑁

𝐻 ⊥ 𝑁 | 𝑆

¬ 𝐴 ⊥ 𝑁

𝐴 ⊥ 𝑁 | 𝑆

Knowing sinus separates the symptom variables from each other

19

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢

Page 20: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Marginal independence

Flu and Allergy are (marginally) independent

𝐹 ⊥ 𝐴

𝑃 𝐹, 𝐴 = 𝑃 𝐹 𝑃 𝐴

More generally:

∀subsets of {𝑋1, … , 𝑋𝑛}, 𝒳 ⊥ 𝒴,𝒳 ⊆ 𝑋1, … , 𝑋𝑛 , 𝒴 ⊆ 𝑋1, … , 𝑋𝑛

𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖

𝑖

20

Flu = t 0.1

Flu = f 0.9

Al = t 0.3

Al = f 0.7

Flu = t Flu = f

Al = t 0.1 × 0.3 = 0.03

0.3 × 0.9

Al = f 0.1 × 0.7 0.9 × 0.7

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢

Page 21: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Conditional independence

Flu and headache are not (marginally) independent

¬𝐹 ⊥ 𝐻, 𝑃 𝐹 𝐻 ≠ 𝑃 𝐹 , 𝑃 𝐹,𝐻 ≠ 𝑃 𝐹 𝑝 𝐻

Flu and headache are independent given Sinus infection

𝐹 ⊥ 𝐻 | 𝑆, 𝑃 𝐹,𝐻 𝑆 = 𝑃 𝐹 𝑆 𝑃 𝐻 𝑆

𝑃 𝐹 𝐻, 𝑆 = 𝑃 𝐹 𝑆

More generally:

𝑋1, 𝑋2, … , 𝑋𝑛 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑜𝑡ℎ𝑒𝑟 𝑔𝑖𝑣𝑒𝑛 𝐶

𝑃 𝑋1, 𝑋2, … , 𝑋𝑛 𝐶 = 𝑃 𝑋1 𝐶 𝑃 𝑋2, … , 𝑋𝑛 𝐶

𝑃 𝑋1, 𝑋2, … , 𝑋𝑛 𝐶 = 𝑃 𝑋𝑖 𝐶𝑖

21

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢

Page 22: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

The conditional independence assumption

Local Markov Assumption: a variable X is independence of its non-descendants given its parents and only its parents

𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋

Flu: 𝑃𝑎𝐹𝑙𝑢 = ∅,𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝐹𝑙𝑢 = {𝐴}

𝐹 ⊥ 𝐴

Nose: 𝑃𝑎𝑛𝑜𝑠𝑒 = 𝑆 , 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑛𝑜𝑠𝑒 = 𝐹, 𝐴, 𝐻

𝑁 ⊥ {𝐹, 𝐴, 𝐻} | 𝑆

Sinus: 𝑃𝑎𝑠𝑖𝑛𝑢𝑠 = 𝐹, 𝐴 ,𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑠𝑖𝑛𝑢𝑠 = ∅

No assumption about 𝑆 ⊥? ? ? | 𝐹, 𝐴

22

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢

Page 23: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Local Markov Assumption: a variable X is independence of its non-descendants given its parents and only its parents

𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋

local Markov assumption not imply 𝐹 ⊥ 𝐴 | 𝑆

XOR: 𝑃 𝐴 = 𝑡 = 0.5, 𝑃 𝐹 = 𝑡 = 0.5, 𝑆 = 𝐹 𝑋𝑂𝑅 𝐴

𝑆 = 𝑡, 𝐴 = 𝑡 ⇒ 𝐹 = 𝑓,

ie., 𝑃 𝐹 = 𝑓 𝑆 = 𝑡, 𝐴 = 𝑡 =0

𝑃 𝐹 = 𝑡 = 0.2, 𝑃 𝐹 = 𝑡 𝑆 = 𝑡 = 0.5, 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 = 0.3

𝑃 𝐹 = 𝑡 ≤ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 ≤ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡

Knowing 𝐴 = 𝑡 lowers the probability of 𝐹 = 𝑡 (A=t explains away F=t)

Depends on P(S|F,A), it could be 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 ≥ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡

Explaining away

23

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢

Page 24: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Chain rule and Local Markov Assumption

Pick a topological ordering of nodes

Interpret a BN using particular chain rule order

Use the conditional independence assumption

𝑃 𝑁,𝐻, 𝑆, 𝐴, 𝐹= 𝑃 𝐹 𝑃 𝐴 𝐹 𝑃 𝑆 𝐹, 𝐴 𝑃 𝐻 𝑆, 𝐹, 𝐴 𝑃 𝑁 𝑆, 𝐹, 𝐴, 𝐻

𝑃(𝑁,𝐻, 𝑆, 𝐴, 𝐹) = 𝑃(𝐹) 𝑃(𝐴) 𝑃(𝑆|𝐹𝐴) 𝑃(𝐻|𝑆) 𝑃(𝑁|𝑆)

Why can we decompose joint distribution?

24

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢

𝑃(𝑆|𝐹, 𝐴) 𝑃(𝐴)

𝐴 ⊥ 𝐹

𝑃(𝐻|𝑆)

𝐻 ⊥ 𝐹𝐴|𝑆

𝑃(𝑁|𝑆)

𝐻 ⊥ 𝐹𝐴𝐻|𝑆

1 2

3

4 5

Page 25: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

General Bayesian Networks

Page 26: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Set of random variables 𝑋1, … , 𝑋𝑛

Directed acyclic graph (DAG)

Loops are ok

But no directed cycle

Local Markov Assumptions

A variable 𝑋 is independent of its non-descendants given its parents and only its parents (𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋)

Conditional probability tables (CPTs), 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖), for each 𝑋𝑖

Joint distribution 𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖 𝑃𝑎𝑋𝑖 𝑖

A general Bayes net

26

Page 27: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

What distributions can be represented by a Bayesian Networks?

What Bayesian Networks can represent a distribution?

What are the independence assumptions encoded by a BN?

In addition to the local Markov assumption

Question?

27

𝐴 𝐵 𝐶 𝐷

Local Markov 𝐴 ⊥ 𝐶 | 𝐵 𝐷 ⊥ 𝐴𝐵 | 𝐶

Derived independence 𝐴 ⊥ 𝐷 | 𝐵

Page 28: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Conditional Independence in Problem

28

True distribution 𝑃 Contains conditional independence assertions

𝐼(𝑃)

Graph G encodes local independence assumptions 𝐼𝑙 𝐺

World, Data, Reality: Bayesian Networks:

Key representational assumption: 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

Page 29: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

The representation theorem – True conditional independence => BN factorization

BN encodes local conditional independence assumptions 𝐼𝑙 𝐺

29

If local conditional independence in BN are

subset of conditional independence in

𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

Then the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)

𝑖

obtain

Page 30: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

The representation theorem – BN factorization => True conditional independence

BN encodes local conditional independence assumptions 𝐼𝑙 𝐺

30

Then local conditional independence in BN are

subset of conditional independence in

𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

If the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)

𝑖

obtain

Page 31: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Example: naïve Bayes – True conditional independence => BN Factorization

Independence assumptions:

𝑋𝑖 are independent of each other given 𝐶

That is 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛

𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …

Same as local conditional independence

prove 𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃(𝐶) 𝑃 𝑋𝑖 𝐶 𝑖

Use chain rule, and local Markov property 𝑃 𝐶, 𝑋1, … , 𝑋𝑛= 𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶, 𝑋1 …P 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1

31

𝑋1

𝐶

𝑋2 𝑋𝑛 …

𝑃 𝑋2 𝐶, 𝑋1 = 𝑃 𝑋2 𝐶 𝑋_1 ⊥ 𝑋_2 | 𝐶

𝑃 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1 = 𝑃 𝑋𝑛 𝐶 𝑋𝑛 ⊥ 𝑋1, … , 𝑋𝑛−1| 𝐶

Page 32: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Example: naïve Bayes – BN Factorization => True conditional independence

Assume 𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃(𝐶) 𝑃 𝑋𝑖 𝐶 𝑖

Prove Independence assumptions:

𝑋𝑖 are independent of each other given 𝐶

That is 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛

𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …

Eg. 𝑛 = 4, 𝑃(𝑋1, 𝑋_2|𝐶) =𝑃(𝑋1,𝑋2,𝐶)

𝑃(𝐶)= 𝑃(𝑋1,𝑋2,𝑋3,𝑋4,𝐶)𝑥3,𝑥4

𝑃(𝐶)

=1

𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 𝑃 𝑋3 𝐶𝑥3,𝑥4 𝑃 𝑋4 𝐶 𝑃 𝐶

= 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 𝑃 𝑋3 𝐶 𝑃(𝑋4|𝐶)𝑥4𝑥3

= 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 ⋅ 1 ⋅ 1

32

𝑋1

𝐶

𝑋2 𝑋𝑛 …

Page 33: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

How about the general case?

BN encodes local conditional independence assumptions 𝐼𝑙 𝐺

33

If local conditional independence in BN are

subset of conditional independence in

𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

Then the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)

𝑖

obtain

Then local conditional independence in BN are

subset of conditional independence in

𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

If the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)

𝑖

obtain

This BN is an I-map of P

𝑃 factorizes according to BN

Every P has least one BN structure G

Read independence of P from BN structure G

Page 34: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Proof from I-map to factorization

Topological ordering of 𝑋1, 𝑋2, …𝑋𝑛:

Number variables such that

Parent has lower number than child

ie., 𝑋𝑖 → 𝑋𝑗 ⇒ 𝑖 < 𝑗

Variable has lower number than all its descendants

Use chain rule

𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋1 𝑃 𝑋2 𝑋1 …𝑃 𝑋𝑛 𝑋1, …𝑋𝑛−1

𝑃 𝑋𝑖 𝑋1, … , 𝑋𝑖−1 :

𝑃𝑎𝑋𝑖 ∈{𝑋1, … , 𝑋𝑖−1},

And there is no descendant of 𝑋𝑖 in {𝑋1, … , 𝑋𝑖−1}

𝑃 𝑋𝑖 𝑋1, … , 𝑋𝑖−1 = 𝑃(𝑋𝑖| 𝑃𝑎𝑋𝑖) 34

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢 1 2

3

4 5

Local Markov Assumption: 𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋

Page 35: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Let G be an I-map for P, any DAG G’ that includes the same directed edges as G is also an I-map of P

G’ is strictly more expressive than G

If G is an I-map for P, then adding edges still results in an I-map

P(N|S,A)

𝑃(𝑁|𝑆, 𝐴 = 𝑡) = 𝑃(𝑁|𝑆, 𝐴 = 𝑓)

⇒ 𝑃(𝑁|𝑆) = 𝑃(𝑁|𝑆, 𝐴)

⇔ 𝑁 ⊥ 𝐴 | 𝑆

Adding edges doesn’t hurt

35

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

𝐹𝑙𝑢

Page 36: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Minimal I-maps

G is a minimal I-map for 𝑃 if deleting any edges for 𝐺 makes it no longer an I-map

Eg. ¬𝐴 ⊥ 𝐵, 𝐴 ⊥ 𝐵 | 𝐶, minimal I-map 𝐴 → 𝐶 → 𝐵, if remove an edge then no longer I-map

Obtain a minimal I-map given a set of variabls and conditional independence assertions

Choose an ordering on variables 𝑋1, … , 𝑋𝑛

For i = 1 to n

Add 𝑋𝑖 to network

Define parents of 𝑋𝑖 , 𝑃𝑎𝑋𝑖, in graphs as the minimal subset of nodes

such that local Markov assumptions hold

Define/learn CPT – 𝑃 𝑋𝑖 𝑃𝑎𝑋𝑖) 36

Page 37: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Read conditional independence from BN structure G

Page 38: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Conditional Independence encoded in BN

Local Markov assumption

𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋

There are other conditional independence

Eg., explaining away, derived

What other derived conditional independence there?

38

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑆𝑖𝑛𝑢𝑠

𝐹𝑙𝑢

𝐴 ⊥ 𝐹 ¬𝐴 ⊥ 𝐹 | 𝑆

𝐴 𝐵 𝐶 𝐷

Local Markov 𝐴 ⊥ 𝐶 | 𝐵 𝐷 ⊥ 𝐴𝐵 | 𝐶

Derived independence 𝐴 ⊥ 𝐷 | 𝐵

Page 39: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

3-node cases

Causal effect:

Common cause

Common effect (V-structure)

39

𝐴𝑙𝑙𝑒𝑟𝑔𝑦

𝑆𝑖𝑛𝑢𝑠

𝐹𝑙𝑢

𝐴 ⊥ 𝐹 ¬𝐴 ⊥ 𝐹 | 𝑆

𝑁𝑜𝑠𝑒

𝑆𝑖𝑛𝑢𝑠

𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒

¬𝑁 ⊥ 𝐻 𝑁 ⊥ 𝐻 | 𝑆

𝑆𝑖𝑛𝑢𝑠 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝐴𝑙𝑙𝑒𝑟𝑔𝑦

¬𝐴 ⊥ 𝐻 𝐴 ⊥ 𝐻 | 𝑆

Page 40: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

How about other derived relations? ? 𝐹 ? {𝐵, 𝐸, 𝐺, 𝐽}

𝐹 ⊥ {𝐵, 𝐸, 𝐺, 𝐽}

? 𝐴, 𝐶, 𝐹, 𝐼 ? 𝐵, 𝐸, 𝐺, 𝐽

𝐴, 𝐶, 𝐹, 𝐼 ⊥ {𝐵, 𝐸, 𝐺, 𝐽}

? 𝐵 ? 𝐽 | 𝐸

𝐵 ⊥ 𝐽 | 𝐸

? 𝐸 ? 𝐹 | 𝐾

¬ 𝐸 ⊥ 𝐹 | 𝐾

? 𝐸 ? 𝐹 | {𝐾, 𝐼}

𝐸 ⊥ 𝐹 | {𝐾, 𝐼}

? 𝐹 ? 𝐺 | 𝐷 ¬ 𝐹 ⊥ 𝐺 | 𝐷

? 𝐹 ? 𝐺 | 𝐻

¬ 𝐹 ⊥ 𝐺 | 𝐻

? 𝐹 ? 𝐺 | {𝐻, 𝐾}

¬ 𝐹 ⊥ 𝐺 | {𝐻, 𝐾}

? 𝐹 ? 𝐺 | {𝐻, 𝐴}

𝐹 ⊥ 𝐺 | {𝐻, 𝐴}

40

𝐴

𝐷

𝐵

𝐶 𝐸

𝐻 𝐹 𝐺

𝐼

𝐾

𝐽

Page 41: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Active trails

A trail 𝑋1– 𝑋2 −⋯– 𝑋𝑘is an active trail when variables 𝑂 ⊆ 𝑋1, … , 𝑋𝑛 are observed if for each consecutive triplet in the trail:

𝑋𝑖−1 → 𝑋𝑖 → 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂

𝑋𝑖−1 ← 𝑋𝑖 ← 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂

𝑋𝑖−1 ← 𝑋𝑖 → 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂

𝑋𝑖−1 → 𝑋𝑖 ← 𝑋𝑖+1 and 𝑋𝑖 is observed 𝑋𝑖 ∈ 𝑂 or one of its descendants (V-structure)

41

Page 42: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Active trails and conditional independence

Variables 𝑋𝑖 and 𝑋𝑗 are

independent given 𝑍 ⊆𝑋1, … , 𝑋𝑛 , if there is no active trail

between 𝑋𝑖 and 𝑋𝑗 when variables Z

are observed (𝑋𝑖⊥ 𝑋𝑗|𝑍)

We say that 𝑋𝑖 and 𝑋𝑗 are d-

separated given 𝑍 (dependency separation)

Eg., 𝐹 ⊥ 𝐺

42

𝐴

𝐷

𝐵

𝐶 𝐸

𝐻 𝐹 𝐺

𝐼

𝐾

𝐽

Page 43: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Soundness of d-separation

Given BN with structure 𝐺

Set of conditional independence assertions obtained by d-separation

𝐼 𝐺 = 𝑋 ⊥ 𝑌 𝑍: 𝑑 − 𝑠𝑒𝑝𝐺 𝑋; 𝑌 𝑍

Soundness of d-separation

If 𝑃 factorizes over 𝐺 then 𝐼 𝐺 ⊆ 𝐼 𝑃 (not only 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃 )

Interpretation: d-separation only captures true conditional independencies

For most P’s that factorize over G, 𝐼 𝐺 = 𝐼 𝑃 (P-map)

43

Page 44: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Bayesian Networks are not enough

Page 45: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Inexistence of P-maps, example 1

XOR: A = B XOR C

𝐴 ⊥ 𝐵,¬𝐴 ⊥ 𝐵 | 𝐶

𝐵 ⊥ 𝐶,¬ 𝐴 ⊥ 𝐶 | 𝐵

𝐶 ⊥ 𝐴,¬𝐵 ⊥ 𝐶 | 𝐴

45

𝐴 𝐵

𝐶

Minimal I-map 𝐴 ⊥ 𝐵

Not P-map Can not read 𝐵 ⊥ 𝐶 𝐴 ⊥ C

Page 46: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation

Inexistence of P-maps, example 2

Swinging couples of variables 𝑋1, 𝑌1 and 𝑋2, 𝑌2

𝑋1 ⊥ 𝑌1

𝑋2 ⊥ 𝑌2

𝑋1 ⊥ 𝑋2 | 𝑌1𝑌2

𝑌1 ⊥ 𝑌2 | 𝑋1𝑋2

No Bayesian Network P-map

Need undirected graphical models!

46

𝑋1 𝑋2

𝑌1

𝑌2