conditional probability distributions eran segal weizmann institute
TRANSCRIPT
Conditional Probability Distributions
Eran Segal Weizmann Institute
Last Time Local Markov assumptions – basic BN independencies d-separation – all independencies via graph structure G is an I-Map of P if and only if P factorizes over G I-equivalence – graphs with identical independencies Minimal I-Map
All distributions have I-Maps (sometimes more than one) Minimal I-Map does not capture all independencies in P
Perfect Map – not every distribution P has one PDAGs
Compact representation of I-equivalence graphs Algorithm for finding PDAGs
CPDs Thus far we ignored the representation of CPDs Today we will cover the range of CPD
representations Discrete Continuous Sparse Deterministic Linear
Table CPDs Entry for each joint assignment of X and Pa(X) For each pax: Most general representation Represents every discrete CPD
Limitations Cannot model continuous RVs Number of parameters exponential in |Pa(X)| Cannot model large in-degree dependencies Ignores structure within the CPD
S
I s0 s1
i0 0.95
0.05
i1 0.2 0.8
I
i0 i1
0.7 0.3
P(S|I)P(I)
I
S
)(
1)|(XValx
xpaxP
Structured CPDs Key idea: reduce parameters by modeling P(X|
PaX) without explicitly modeling all entries of the joint
Lose expressive power (cannot represent every CPD)
Deterministic CPDs There is a function f: Val(PaX) Val(X) such
that
Examples OR, AND, NAND functions Z = Y+X (continuous variables)
otherwise0
)(1)|( xx
pafxpaxP
Deterministic CPDs
S
T1 T2 s0 s1
t0 t0 0.95 0.05
t0 t1 0.2 0.8
t1 t0 0.2 0.8
t1 t1 0.2 0.8
T1
S
T2 T1
T
T2
S
S
T s0 s1
t0 0.95
0.05
t1 0.2 0.8
T
T1 T2 t0 t1
t0 t0 1 0
t0 t1 0 1
t1 t0 0 1
t1 t1 0 1
Replace spurious dependencies with deterministic CPDs
Need to make sure that deterministic CPD is compactly stored
Deterministic CPDs Induce additional conditional independencies Example: T is any deterministic function of
T1,T2
T1
T
T2
S1
Ind(S1;S2 | T1,T2)
S2
Deterministic CPDs Induce additional conditional independencies Example: C is an XOR deterministic function of
A,B
D
C
A
E
BInd(D;E | B,C)
Deterministic CPDs Induce additional conditional independencies Example: T is an OR deterministic function of
T1,T2
T1
T
T2
S1
Ind(S1;S2 | T1=t1)
S2
Context specific independencies
Context Specific Independencies
Let X,Y,Z be pairwise disjoint RV sets Let C be a set of variables and cVal(C) X and Y are contextually independent given Z and c,
denoted (XcY | Z,c) if: 0),,(whenever),|(),,|( cPcPcP ZYZXZYX
Tree CPDs
D
A B C d0 d1
a0 b0 c0 0.2 0.8
a0 b0 c1 0.2 0.8
a0 b1 c0 0.2 0.8
a0 b1 c1 0.2 0.8
a1 b0 c0 0.9 0.1
a1 b0 c1 0.7 0.3
a1 b1 c0 0.4 0.6
A1 b1 C1 0.4 0.6
A
D
B C A
D
B C
A
B
C
(0.2,0.8)
(0.4,0.6)
(0.7,0.3)(0.9,0.1)
a0 a1
b1b0
c1c0
4 parameters8 parameters
Activator Repressor
Regulated gene
Activator Repressor
Regulated gene
Activator
Regulated gene
Repressor
State 1
Act
ivat
or
State 2
Act
ivat
or
Repressor
State 3
Gene Regulation: Simple Example
Regulated gene
DNA Microarray
Regulators
DNA Microarray
Regulators
truefalse
truefalse
Regulation Tree
Activator?
Repressor?
State 1 State 2 State 3
true Regulation
program
Module
genes
Activator expressio
n
Repressor expressio
n
Segal et al., Nature Genetics ’03
Respiration Module
Regulation
program
Module genes
Hap4+Msn4 known to regulate module genes
Module genes known targets of predicted regulators?Predicted regulator
Segal et al., Nature Genetics ’03
Rule CPDs A rule r is a pair (c;p) where c is an assignment
to a subset of variables C and p[0,1]. Let Scope[r]=C
A rule-based CPD P(X|Pa(X)) is a set of rules R s.t. For each rule rR Scope[r]{X}Pa(X) For each assignment (x,u) to {X}Pa(X) we have
exactly one rule (c;p)R such that c is compatible with (x,u).Then, we have P(X=x | Pa(X)=u) = p
Rule CPDs Example
Let X be a variable with Pa(X) = {A,B,C} r1: (a1, b1, x0; 0.1) r2: (a0, c1, x0; 0.2) r3: (b0, c0, x0; 0.3) r4: (a1, b0, c1, x0; 0.4) r5: (a0, b1, c0; 0.5) r6: (a1, b1, x1; 0.9) r7: (a0, c1, x1; 0.8) r8: (b0, c0, x1; 0.7) r9: (a1, b0, c1, x1; 0.6)
Note: each assignment maps to exactly one rule Rules cannot always be represented compactly
within tree CPDs
Tree CPDs and Rule CPDs Can represent every discrete function Can be easily learned and dealt with in
inference But, some functions are not represented
compactly XOR in tree CPDs: cannot split in one step on a0,b1
and a1,b0
Alternative representations exist Complex logical rules
Context Specific Independencies
A
D
B C
C
(0.2,0.8) (0.4,0.6)(0.7,0.3)(0.9,0.1)
a0 a1
b1b0c1c0
B
A
A
D
B C
A=a1 Ind(D,C | A=a1)
A
D
B C
A=a0 Ind(D,B | A=a0)
Reasoning by cases implies that Ind(B,C | A,D)
Independence of Causal Influence
X1
Y
X2 Xn... Causes: X1,…Xn
Effect: Y
General case: Y has a complex dependency on X1,…Xn
Common case Each Xi influences Y separately Influence of X1,…Xn is combined to an overall
influence on Y
Example 1: Noisy OR Two independent effects X1, X2
Y=y1 cannot happen unless one of X1, X2 occurs P(Y=y0 | X1=x1
0 , X2=x20) = P(X1=x1
0)P(X2=x20)
Y
X1 X2 y0 y1
x10 x2
0 1 0
x10 x2
1 0.2 0.8
x11 x2
0 0.1 0.9
x11 x2
1 0.02 0.98
X1
Y
X2
Noisy OR: Elaborate Representation
Y
X’1 X’2 y0 y1
x10 x2
0 1 0
x10 x2
1 0 1
x11 x2
0 0 1
x11 x2
1 0 1
X’1
Y
X’2
X1 X2
Deterministic OR
X’1
X1 x10 x1
1
x10 1 0
x11 0.1 0.9
Noisy CPD 1
X’2
X2 x20 x2
1
x20 1 0
x21 0.2 0.8
Noisy CPD 2Noise parameter X1
=0.9 Noise parameter X1=0.8
Noisy OR: Elaborate Representation
Decomposition results in the same distribution
),|( 122
111
0 xXxXyYP
02.0
08.09.002.09.008.01.012.01.0
)','|()|'()|'(
),,','|(),,'|'(),|'(
),|',',(
21
21
21
','21
01222
1111
','
122
11121
0122
11112
122
1111
','
122
11121
0
XX
XX
XX
XXyYPxXXPxXXP
xXxXXXyYPxXxXXXPxXxXXP
xXxXXXyYP
Noisy OR: General Case Y is a binary variable with k binary parents
X1,...Xn
CPD P(Y | X1,...Xn) is a noisy OR if there are k+1 noise parameters 0,1,...,n such that
)1()1(),...,|(1:
010
ixXi
k
ii
XXyYP
)1()1(1),...,|(1:
011
ixXi
k
ii
XXyYP
Noisy OR Independencies
X’1
Y
X’2
X1 X2
X’n
Xn
...
ij: Ind(Xi Xj | Y=yo)
Generalized Linear Models Model is a soft version of a linear threshold
function Example: logistic function
Binary variables X1,...Xn,Y
k
iiio
k
Xww
XXyYP
1
11
)1(exp1
1),...,|(
1
General Formulation Let Y be a random variable with parents X1,...Xn
The CPD P(Y | X1,...Xn) exhibits independence of causal influence (ICI) if it can be described by
The CPD P(Z | Z1,...Zn) is deterministic
Z1
Z
Z2
X1 X2
Y
Zn
Xn...
... Noisy OR
Zi has noise model
Z is an OR function
Y is the identity CPD
Logistic
Zi = wi1(Xi=1)
Z = Zi
Y = logit (Z)
General Formulation
Key advantage: O(n) parameters
As stated, not all that useful as any complex CPD can be represented through a complex deterministic CPD
Continuous Variables One solution: Discretize
Often requires too many value states Loses domain structure
Other solution: use continuous function for P(X|Pa(X)) Can combine continuous and discrete variables,
resulting in hybrid networks Inference and learning may become more difficult
Gaussian Density Functions Among the most common continuous
representations Univariate case: 2
2
12
2
1)(),(~)(
x
eXpNXP if
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-4 -2 0 2 4
Gaussian Density Functions A multivariate Gaussian distribution over
X1,...Xn has Mean vector nxn positive definite covariance matrix
positive definite: Joint density function:
i=E[Xi] ii=Var[Xi] ij=Cov[Xi,Xj]=E[XiXj]-E[Xi]E[Xj] (ij)
0: xxx Tn
)()(
2
1exp
||)2(
1)( 1
2/12/
xxxp T
n
Gaussian Density Functions Marginal distributions are easy to compute
Independencies can be determined from parameters If X=X1,...Xn have a joint normal distribution N(;)
then Ind(Xi Xj) iff ij=0 Does not hold in general for non-Gaussian distributions
YYYX
XYXX
Y
XYX ;),(
NP
XXXX ;)( NP
Linear Gaussian CPDs Y is a continuous variable with parents X1,...Xn
Y has a linear Gaussian model if it can be described using parameters 0,...,n and 2 such that Vector notation:
Pros Simple Captures many interesting dependencies
Cons Fixed variance (variance cannot depend on parents
values)
);...(),...,|( 21101 nnn xxNxxYP
);()|( 20 xx TNYP
Linear Gaussian Bayesian Network
A linear Gaussian Bayesian network is a Bayesian network where All variables are continuous All of the CPDs are linear Gaussians
Key result: linear Gaussian models are equivalent to multivariate Gaussian density functions
Equivalence Theorem Y is a linear Gaussian of its parents X1,...Xn:
Assume that X1,...Xn are jointly Gaussian with N(;)
Then: The marginal distribution of Y is Gaussian with
N(Y;Y2)
The joint distribution over {X,Y} is Gaussian where
);()|( 20 xx TNYP
TY
TY
220 μ
n
j ijji YXCov1
];[
Linear Gaussian BNs define a joint Gaussian distribution
Converse Equivalence Theorem If {X,Y} have a joint Gaussian distribution then
Implications of equivalence Joint distribution has compact representation: O(n2) We can easily transform back and forth between
Gaussian distributions and linear Gaussian Bayesian networks
Representations may differ in parameters Example:
Gaussian distribution has full covariance matrix Linear Gaussian
);()|( 20 XX TNYP
XnX2X1 ...
Hybrid Models Models of continuous and discrete variables
Continuous variables with discrete parents Discrete variables with continuous parents
Conditional Linear Gaussians Y continuous variable X = {X1,...,Xn} continuous parents U = {U1,...,Um} discrete parents
A Conditional Linear Bayesian network is one where Discrete variables have only discrete parents Continuous variables have only CLG CPDs
21 ,0, ;),|(: uuu xxuUu
n
i iiaaNYP
Hybrid Models Continuous parents for discrete children
Threshold models
Linear sigmoid
otherwise05.0
109.0)|( 1 xxyYP
k
iiio
k
xww
xxyYP
1
11
)exp1
1),...,|(
Summary: CPD Models Deterministic functions Context specific dependencies Independence of causal influence
Noisy OR Logistic function
CPDs capture additional domain structure