directed graphical models or bayesian networkslsong/teaching/8803ml/lecture1.pdfbayesian networks...
TRANSCRIPT
Directed Graphical Models or Bayesian Networks
Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Le Song
Bayesian Networks
One of the most exciting recent advancements in statistical AI
Compact representation for exponentially-large probability distributions
Fast marginalization algorithm
Exploit conditional independencies
Difference from undirected graphical models!
2
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
Handwriting recognition
3
Handwriting recognition
4
𝑋1 𝑋2 𝑋3 𝑋4 𝑋5
Summary: basic concepts for R.V.
Outcome: assign 𝑥1, … , 𝑥𝑛to 𝑋1, …𝑋𝑛
Conditional probability: 𝑃(𝑋, 𝑌) = 𝑃(𝑋) 𝑃(𝑌|𝑋)
Bayes rule: 𝑃(𝑌|𝑋) =𝑃 𝑋 𝑌 𝑃(𝑌)
𝑃(𝑋)
Chain rule:
𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋1 𝑃 𝑋2 𝑋1 …𝑃 𝑋𝑛 𝑋1, … , 𝑋𝑛−1
5
Summary: conditional independence
𝑋 is independent of 𝑌 given 𝑍 if 𝑃(𝑋 = 𝑥|𝑌 = 𝑦, 𝑍 = 𝑧) = 𝑃(𝑋 = 𝑥| 𝑍 = 𝑧) ∀𝑥 ∈ 𝑉𝑎𝑙 𝑋 , 𝑦 ∈ 𝑉𝑎𝑙 𝑌 , 𝑧 ∈ 𝑉𝑎𝑙 𝑍
Shorthand:
(𝑋 ⊥ 𝑌 | 𝑍)
For (𝑋 ⊥ 𝑌| ∅), write 𝑋 ⊥ 𝑌
Proposition: (𝑋 ⊥ 𝑌 | 𝑍) if and only if 𝑃(𝑋, 𝑌|𝑍) = 𝑃(𝑋|𝑍)𝑃(𝑌|𝑍)
6
Representation of Bayesian Networks
Consider 𝑃 𝑋𝑖
Assign probability to each 𝑥𝑖 ∈ 𝑉𝑎𝑙 𝑋𝑖
Assume 𝑉𝑎𝑙 𝑋𝑖 = 𝑘, how many independent parameters?
𝑘 − 1
Consider 𝑃 𝑋1, … , 𝑋𝑛
How many independent parameters if 𝑉𝑎𝑙 𝑋𝑖 = 𝑘?
𝑘𝑛 − 1
Bayesian Networks can represent the same joint probability with fewer parameters
7
Simple Bayesian Networks
What if variables are independent?
What if all variables are independent?
Is it enough to have 𝑋𝑖 ⊥ 𝑋𝑗 , ∀𝑖, 𝑗
Not enough!!!
Must assume that 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛
𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …
Can write as
𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖𝑖=1…𝑛
Bayesian networks with no edges
How many independent parameters now?
𝑛 ⋅ 𝑘 − 1 9
𝑋1 𝑋2 𝑋3 𝑋𝑛 …
Conditional parameterization – two nodes
Grade (G) is determined by intelligence (I)
𝑃(𝐼) =
𝑃 𝐺 𝐼 =
𝑃 𝐼 = 𝑉𝐻, 𝐺 = 𝐵 = 𝑃 𝐼 = 𝑉𝐻 𝑃 𝐺 = 𝐵 𝐼 = 𝑉𝐻 = 0.85 × 0.1 = 0.085
10
𝐺
𝐼 VH H
0.85 0.15
G I VH H
A 0.9 0.5
B 0.1 0.5
Conditional parameterization – three nodes
Grade and SAT score are determined by intelligence
(𝐺 ⊥ 𝑆 | 𝐼) ⇔ 𝑃(𝑆 𝐺, 𝐼 = 𝑃(𝑆|𝐼)
𝑃(𝐼, 𝐺, 𝑆) = 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼), why?
Chain rule 𝑃(𝐼, 𝐺, 𝑆) = 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐺, 𝐼)
Use conditional independence, we get 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼)
11
𝐺
𝐼
𝑆
𝑃(𝐼)
𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼)
The naïve Bayes model
Class variable: 𝐶
Evidence variables: 𝑋1, …𝑋𝑛
Assume that (𝒳 ⊥ 𝒴| 𝐶), ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛
𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃 𝐶 𝑃 𝑋𝑖 𝐶 𝑖 , why?
Chain rule 𝑃 𝐶, 𝑋1, … , 𝑋𝑛= 𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶, 𝑋1 …P 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1
12
𝑋1
𝐶
𝑋2 𝑋𝑛 …
𝑃 𝑋2 𝐶, 𝑋1 = 𝑃 𝑋2 𝐶 𝑋_1 ⊥ 𝑋_2 | 𝐶
𝑃 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1 = 𝑃 𝑋𝑛 𝐶 𝑋𝑛 ⊥ 𝑋1, … , 𝑋𝑛−1| 𝐶
More Complicated Bayesian Networks and an Example
Causal Structure
We will learn the semantics of Bayesian Networks (BNs), relate them to independence assumptions encoded by the graph
Suppose we know the following:
The flu (F) causes sinus inflammation (S)
Allergies (A) cause sinus inflammation
Sinus inflammation causes a runny nose (N)
Sinus inflammation causes headaches (H)
How are these variables connected?
14
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝐹𝑙𝑢
Possible queries
Probabilistic Inference
Eg., 𝑃(𝐴 = 𝑡|𝐻 = 𝑡, 𝑁 = 𝑓)
Most probable explanation
max𝐹,𝐴,𝑆 𝑃(𝐹, 𝐴, 𝑆|𝐻 = 𝑡, 𝑁 = 𝑡)
Active data collection
What’s the next best test variable to observed
15
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
Car starts BN
18 binary variables
𝑃 𝐴, 𝐹, 𝐿, … , 𝑆𝑝𝑎𝑟𝑘
Inference
P(BatteryAge|Starts=f)
= … 𝑃 𝐴, 𝐹, 𝐿, … 𝐹𝑎𝑛𝐵𝑒𝑙𝑡𝐴𝑙𝑡
Sum over 216 terms, why BN so fast?
Use the sparse graph structure
For the HailFinder BN in JavaBayes
More than 354 =
58149737003040059690390169 terms 16
Factored joint distribution--preview
𝑃 𝐹, 𝐴, 𝑆, 𝐻,𝑁 = 𝑃 𝐹 𝑃 𝐴 𝑃 𝑆 𝐹, 𝐴 𝑃 𝐻 𝑆 𝑃 𝑁 𝑆
17
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢 𝑃(𝐹) 𝑃(𝐴)
𝑃(𝑆|𝐹, 𝐴)
𝑃(𝑁|𝑆) 𝑃(𝐻|𝑆)
Bayesian networks (BN) has 10 parameters
Full probability table 𝑃(𝐹, 𝐴, 𝑆, 𝐻, 𝑁) explicitly has 25 − 1 = 31 parameters
Number of parameters
18
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢 𝑃(𝐹) 𝑃(𝐴)
𝑃(𝑆|𝐹, 𝐴)
𝑃(𝑁|𝑆) 𝑃(𝐻|𝑆)
1 1
4
2 2
S FA tt tf ft ff
t 0.9 0.7 0.8 0.2
f 0.1 0.3 0.2 0.8
1 1 1 1
Key: Independence assumptions
¬ 𝐻 ⊥ 𝑁
𝐻 ⊥ 𝑁 | 𝑆
¬ 𝐴 ⊥ 𝑁
𝐴 ⊥ 𝑁 | 𝑆
Knowing sinus separates the symptom variables from each other
19
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
Marginal independence
Flu and Allergy are (marginally) independent
𝐹 ⊥ 𝐴
𝑃 𝐹, 𝐴 = 𝑃 𝐹 𝑃 𝐴
More generally:
∀subsets of {𝑋1, … , 𝑋𝑛}, 𝒳 ⊥ 𝒴,𝒳 ⊆ 𝑋1, … , 𝑋𝑛 , 𝒴 ⊆ 𝑋1, … , 𝑋𝑛
𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖
𝑖
20
Flu = t 0.1
Flu = f 0.9
Al = t 0.3
Al = f 0.7
Flu = t Flu = f
Al = t 0.1 × 0.3 = 0.03
0.3 × 0.9
Al = f 0.1 × 0.7 0.9 × 0.7
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
Conditional independence
Flu and headache are not (marginally) independent
¬𝐹 ⊥ 𝐻, 𝑃 𝐹 𝐻 ≠ 𝑃 𝐹 , 𝑃 𝐹,𝐻 ≠ 𝑃 𝐹 𝑝 𝐻
Flu and headache are independent given Sinus infection
𝐹 ⊥ 𝐻 | 𝑆, 𝑃 𝐹,𝐻 𝑆 = 𝑃 𝐹 𝑆 𝑃 𝐻 𝑆
𝑃 𝐹 𝐻, 𝑆 = 𝑃 𝐹 𝑆
More generally:
𝑋1, 𝑋2, … , 𝑋𝑛 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑜𝑡ℎ𝑒𝑟 𝑔𝑖𝑣𝑒𝑛 𝐶
𝑃 𝑋1, 𝑋2, … , 𝑋𝑛 𝐶 = 𝑃 𝑋1 𝐶 𝑃 𝑋2, … , 𝑋𝑛 𝐶
𝑃 𝑋1, 𝑋2, … , 𝑋𝑛 𝐶 = 𝑃 𝑋𝑖 𝐶𝑖
21
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
The conditional independence assumption
Local Markov Assumption: a variable X is independence of its non-descendants given its parents and only its parents
𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋
Flu: 𝑃𝑎𝐹𝑙𝑢 = ∅,𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝐹𝑙𝑢 = {𝐴}
𝐹 ⊥ 𝐴
Nose: 𝑃𝑎𝑛𝑜𝑠𝑒 = 𝑆 , 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑛𝑜𝑠𝑒 = 𝐹, 𝐴, 𝐻
𝑁 ⊥ {𝐹, 𝐴, 𝐻} | 𝑆
Sinus: 𝑃𝑎𝑠𝑖𝑛𝑢𝑠 = 𝐹, 𝐴 ,𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑠𝑖𝑛𝑢𝑠 = ∅
No assumption about 𝑆 ⊥? ? ? | 𝐹, 𝐴
22
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
Local Markov Assumption: a variable X is independence of its non-descendants given its parents and only its parents
𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋
local Markov assumption not imply 𝐹 ⊥ 𝐴 | 𝑆
XOR: 𝑃 𝐴 = 𝑡 = 0.5, 𝑃 𝐹 = 𝑡 = 0.5, 𝑆 = 𝐹 𝑋𝑂𝑅 𝐴
𝑆 = 𝑡, 𝐴 = 𝑡 ⇒ 𝐹 = 𝑓,
ie., 𝑃 𝐹 = 𝑓 𝑆 = 𝑡, 𝐴 = 𝑡 =0
𝑃 𝐹 = 𝑡 = 0.2, 𝑃 𝐹 = 𝑡 𝑆 = 𝑡 = 0.5, 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 = 0.3
𝑃 𝐹 = 𝑡 ≤ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 ≤ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡
Knowing 𝐴 = 𝑡 lowers the probability of 𝐹 = 𝑡 (A=t explains away F=t)
Depends on P(S|F,A), it could be 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 ≥ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡
Explaining away
23
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
Chain rule and Local Markov Assumption
Pick a topological ordering of nodes
Interpret a BN using particular chain rule order
Use the conditional independence assumption
𝑃 𝑁,𝐻, 𝑆, 𝐴, 𝐹= 𝑃 𝐹 𝑃 𝐴 𝐹 𝑃 𝑆 𝐹, 𝐴 𝑃 𝐻 𝑆, 𝐹, 𝐴 𝑃 𝑁 𝑆, 𝐹, 𝐴, 𝐻
𝑃(𝑁,𝐻, 𝑆, 𝐴, 𝐹) = 𝑃(𝐹) 𝑃(𝐴) 𝑃(𝑆|𝐹𝐴) 𝑃(𝐻|𝑆) 𝑃(𝑁|𝑆)
Why can we decompose joint distribution?
24
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
𝑃(𝑆|𝐹, 𝐴) 𝑃(𝐴)
𝐴 ⊥ 𝐹
𝑃(𝐻|𝑆)
𝐻 ⊥ 𝐹𝐴|𝑆
𝑃(𝑁|𝑆)
𝐻 ⊥ 𝐹𝐴𝐻|𝑆
1 2
3
4 5
General Bayesian Networks
Set of random variables 𝑋1, … , 𝑋𝑛
Directed acyclic graph (DAG)
Loops are ok
But no directed cycle
Local Markov Assumptions
A variable 𝑋 is independent of its non-descendants given its parents and only its parents (𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋)
Conditional probability tables (CPTs), 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖), for each 𝑋𝑖
Joint distribution 𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖 𝑃𝑎𝑋𝑖 𝑖
A general Bayes net
26
What distributions can be represented by a Bayesian Networks?
What Bayesian Networks can represent a distribution?
What are the independence assumptions encoded by a BN?
In addition to the local Markov assumption
Question?
27
𝐴 𝐵 𝐶 𝐷
Local Markov 𝐴 ⊥ 𝐶 | 𝐵 𝐷 ⊥ 𝐴𝐵 | 𝐶
Derived independence 𝐴 ⊥ 𝐷 | 𝐵
Conditional Independence in Problem
28
True distribution 𝑃 Contains conditional independence assertions
𝐼(𝑃)
Graph G encodes local independence assumptions 𝐼𝑙 𝐺
World, Data, Reality: Bayesian Networks:
Key representational assumption: 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
The representation theorem – True conditional independence => BN factorization
BN encodes local conditional independence assumptions 𝐼𝑙 𝐺
29
If local conditional independence in BN are
subset of conditional independence in
𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
Then the joint probability 𝑃 can be written as
𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)
𝑖
obtain
The representation theorem – BN factorization => True conditional independence
BN encodes local conditional independence assumptions 𝐼𝑙 𝐺
30
Then local conditional independence in BN are
subset of conditional independence in
𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
If the joint probability 𝑃 can be written as
𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)
𝑖
obtain
Example: naïve Bayes – True conditional independence => BN Factorization
Independence assumptions:
𝑋𝑖 are independent of each other given 𝐶
That is 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛
𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …
Same as local conditional independence
prove 𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃(𝐶) 𝑃 𝑋𝑖 𝐶 𝑖
Use chain rule, and local Markov property 𝑃 𝐶, 𝑋1, … , 𝑋𝑛= 𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶, 𝑋1 …P 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1
31
𝑋1
𝐶
𝑋2 𝑋𝑛 …
𝑃 𝑋2 𝐶, 𝑋1 = 𝑃 𝑋2 𝐶 𝑋_1 ⊥ 𝑋_2 | 𝐶
𝑃 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1 = 𝑃 𝑋𝑛 𝐶 𝑋𝑛 ⊥ 𝑋1, … , 𝑋𝑛−1| 𝐶
Example: naïve Bayes – BN Factorization => True conditional independence
Assume 𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃(𝐶) 𝑃 𝑋𝑖 𝐶 𝑖
Prove Independence assumptions:
𝑋𝑖 are independent of each other given 𝐶
That is 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛
𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …
Eg. 𝑛 = 4, 𝑃(𝑋1, 𝑋_2|𝐶) =𝑃(𝑋1,𝑋2,𝐶)
𝑃(𝐶)= 𝑃(𝑋1,𝑋2,𝑋3,𝑋4,𝐶)𝑥3,𝑥4
𝑃(𝐶)
=1
𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 𝑃 𝑋3 𝐶𝑥3,𝑥4 𝑃 𝑋4 𝐶 𝑃 𝐶
= 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 𝑃 𝑋3 𝐶 𝑃(𝑋4|𝐶)𝑥4𝑥3
= 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 ⋅ 1 ⋅ 1
32
𝑋1
𝐶
𝑋2 𝑋𝑛 …
How about the general case?
BN encodes local conditional independence assumptions 𝐼𝑙 𝐺
33
If local conditional independence in BN are
subset of conditional independence in
𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
Then the joint probability 𝑃 can be written as
𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)
𝑖
obtain
Then local conditional independence in BN are
subset of conditional independence in
𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
If the joint probability 𝑃 can be written as
𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)
𝑖
obtain
This BN is an I-map of P
𝑃 factorizes according to BN
Every P has least one BN structure G
Read independence of P from BN structure G
Proof from I-map to factorization
Topological ordering of 𝑋1, 𝑋2, …𝑋𝑛:
Number variables such that
Parent has lower number than child
ie., 𝑋𝑖 → 𝑋𝑗 ⇒ 𝑖 < 𝑗
Variable has lower number than all its descendants
Use chain rule
𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋1 𝑃 𝑋2 𝑋1 …𝑃 𝑋𝑛 𝑋1, …𝑋𝑛−1
𝑃 𝑋𝑖 𝑋1, … , 𝑋𝑖−1 :
𝑃𝑎𝑋𝑖 ∈{𝑋1, … , 𝑋𝑖−1},
And there is no descendant of 𝑋𝑖 in {𝑋1, … , 𝑋𝑖−1}
𝑃 𝑋𝑖 𝑋1, … , 𝑋𝑖−1 = 𝑃(𝑋𝑖| 𝑃𝑎𝑋𝑖) 34
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢 1 2
3
4 5
Local Markov Assumption: 𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋
Let G be an I-map for P, any DAG G’ that includes the same directed edges as G is also an I-map of P
G’ is strictly more expressive than G
If G is an I-map for P, then adding edges still results in an I-map
P(N|S,A)
𝑃(𝑁|𝑆, 𝐴 = 𝑡) = 𝑃(𝑁|𝑆, 𝐴 = 𝑓)
⇒ 𝑃(𝑁|𝑆) = 𝑃(𝑁|𝑆, 𝐴)
⇔ 𝑁 ⊥ 𝐴 | 𝑆
Adding edges doesn’t hurt
35
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
Minimal I-maps
G is a minimal I-map for 𝑃 if deleting any edges for 𝐺 makes it no longer an I-map
Eg. ¬𝐴 ⊥ 𝐵, 𝐴 ⊥ 𝐵 | 𝐶, minimal I-map 𝐴 → 𝐶 → 𝐵, if remove an edge then no longer I-map
Obtain a minimal I-map given a set of variabls and conditional independence assertions
Choose an ordering on variables 𝑋1, … , 𝑋𝑛
For i = 1 to n
Add 𝑋𝑖 to network
Define parents of 𝑋𝑖 , 𝑃𝑎𝑋𝑖, in graphs as the minimal subset of nodes
such that local Markov assumptions hold
Define/learn CPT – 𝑃 𝑋𝑖 𝑃𝑎𝑋𝑖) 36
Read conditional independence from BN structure G
Conditional Independence encoded in BN
Local Markov assumption
𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋
There are other conditional independence
Eg., explaining away, derived
What other derived conditional independence there?
38
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑆𝑖𝑛𝑢𝑠
𝐹𝑙𝑢
𝐴 ⊥ 𝐹 ¬𝐴 ⊥ 𝐹 | 𝑆
𝐴 𝐵 𝐶 𝐷
Local Markov 𝐴 ⊥ 𝐶 | 𝐵 𝐷 ⊥ 𝐴𝐵 | 𝐶
Derived independence 𝐴 ⊥ 𝐷 | 𝐵
3-node cases
Causal effect:
Common cause
Common effect (V-structure)
39
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑆𝑖𝑛𝑢𝑠
𝐹𝑙𝑢
𝐴 ⊥ 𝐹 ¬𝐴 ⊥ 𝐹 | 𝑆
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
¬𝑁 ⊥ 𝐻 𝑁 ⊥ 𝐻 | 𝑆
𝑆𝑖𝑛𝑢𝑠 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝐴𝑙𝑙𝑒𝑟𝑔𝑦
¬𝐴 ⊥ 𝐻 𝐴 ⊥ 𝐻 | 𝑆
How about other derived relations? ? 𝐹 ? {𝐵, 𝐸, 𝐺, 𝐽}
𝐹 ⊥ {𝐵, 𝐸, 𝐺, 𝐽}
? 𝐴, 𝐶, 𝐹, 𝐼 ? 𝐵, 𝐸, 𝐺, 𝐽
𝐴, 𝐶, 𝐹, 𝐼 ⊥ {𝐵, 𝐸, 𝐺, 𝐽}
? 𝐵 ? 𝐽 | 𝐸
𝐵 ⊥ 𝐽 | 𝐸
? 𝐸 ? 𝐹 | 𝐾
¬ 𝐸 ⊥ 𝐹 | 𝐾
? 𝐸 ? 𝐹 | {𝐾, 𝐼}
𝐸 ⊥ 𝐹 | {𝐾, 𝐼}
? 𝐹 ? 𝐺 | 𝐷 ¬ 𝐹 ⊥ 𝐺 | 𝐷
? 𝐹 ? 𝐺 | 𝐻
¬ 𝐹 ⊥ 𝐺 | 𝐻
? 𝐹 ? 𝐺 | {𝐻, 𝐾}
¬ 𝐹 ⊥ 𝐺 | {𝐻, 𝐾}
? 𝐹 ? 𝐺 | {𝐻, 𝐴}
𝐹 ⊥ 𝐺 | {𝐻, 𝐴}
40
𝐴
𝐷
𝐵
𝐶 𝐸
𝐻 𝐹 𝐺
𝐼
𝐾
𝐽
Active trails
A trail 𝑋1– 𝑋2 −⋯– 𝑋𝑘is an active trail when variables 𝑂 ⊆ 𝑋1, … , 𝑋𝑛 are observed if for each consecutive triplet in the trail:
𝑋𝑖−1 → 𝑋𝑖 → 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂
𝑋𝑖−1 ← 𝑋𝑖 ← 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂
𝑋𝑖−1 ← 𝑋𝑖 → 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂
𝑋𝑖−1 → 𝑋𝑖 ← 𝑋𝑖+1 and 𝑋𝑖 is observed 𝑋𝑖 ∈ 𝑂 or one of its descendants (V-structure)
41
Active trails and conditional independence
Variables 𝑋𝑖 and 𝑋𝑗 are
independent given 𝑍 ⊆𝑋1, … , 𝑋𝑛 , if there is no active trail
between 𝑋𝑖 and 𝑋𝑗 when variables Z
are observed (𝑋𝑖⊥ 𝑋𝑗|𝑍)
We say that 𝑋𝑖 and 𝑋𝑗 are d-
separated given 𝑍 (dependency separation)
Eg., 𝐹 ⊥ 𝐺
42
𝐴
𝐷
𝐵
𝐶 𝐸
𝐻 𝐹 𝐺
𝐼
𝐾
𝐽
Soundness of d-separation
Given BN with structure 𝐺
Set of conditional independence assertions obtained by d-separation
𝐼 𝐺 = 𝑋 ⊥ 𝑌 𝑍: 𝑑 − 𝑠𝑒𝑝𝐺 𝑋; 𝑌 𝑍
Soundness of d-separation
If 𝑃 factorizes over 𝐺 then 𝐼 𝐺 ⊆ 𝐼 𝑃 (not only 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃 )
Interpretation: d-separation only captures true conditional independencies
For most P’s that factorize over G, 𝐼 𝐺 = 𝐼 𝑃 (P-map)
43
Bayesian Networks are not enough
Inexistence of P-maps, example 1
XOR: A = B XOR C
𝐴 ⊥ 𝐵,¬𝐴 ⊥ 𝐵 | 𝐶
𝐵 ⊥ 𝐶,¬ 𝐴 ⊥ 𝐶 | 𝐵
𝐶 ⊥ 𝐴,¬𝐵 ⊥ 𝐶 | 𝐴
45
𝐴 𝐵
𝐶
Minimal I-map 𝐴 ⊥ 𝐵
Not P-map Can not read 𝐵 ⊥ 𝐶 𝐴 ⊥ C
Inexistence of P-maps, example 2
Swinging couples of variables 𝑋1, 𝑌1 and 𝑋2, 𝑌2
𝑋1 ⊥ 𝑌1
𝑋2 ⊥ 𝑌2
𝑋1 ⊥ 𝑋2 | 𝑌1𝑌2
𝑌1 ⊥ 𝑌2 | 𝑋1𝑋2
No Bayesian Network P-map
Need undirected graphical models!
46
𝑋1 𝑋2
𝑌1
𝑌2