conditional probability distributions eran segal weizmann institute

Conditional Probability Distributions

Eran Segal Weizmann Institute

Last Time Local Markov assumptions – basic BN independencies d-separation – all independencies via graph structure G is an I-Map of P if and only if P factorizes over G I-equivalence – graphs with identical independencies Minimal I-Map

All distributions have I-Maps (sometimes more than one) Minimal I-Map does not capture all independencies in P

Perfect Map – not every distribution P has one PDAGs

Compact representation of I-equivalence graphs Algorithm for finding PDAGs

CPDs Thus far we ignored the representation of CPDs Today we will cover the range of CPD

representations Discrete Continuous Sparse Deterministic Linear

Table CPDs Entry for each joint assignment of X and Pa(X) For each pax: Most general representation Represents every discrete CPD

Limitations Cannot model continuous RVs Number of parameters exponential in |Pa(X)| Cannot model large in-degree dependencies Ignores structure within the CPD

S

I s0 s1

i0 0.95

0.05

i1 0.2 0.8

I

i0 i1

0.7 0.3

P(S|I)P(I)

I

S

)(

1)|(XValx

xpaxP

Structured CPDs Key idea: reduce parameters by modeling P(X|

PaX) without explicitly modeling all entries of the joint

Lose expressive power (cannot represent every CPD)

Deterministic CPDs There is a function f: Val(PaX) Val(X) such

that

Examples OR, AND, NAND functions Z = Y+X (continuous variables)

otherwise0

)(1)|( xx

pafxpaxP

Deterministic CPDs

S

T1 T2 s0 s1

t0 t0 0.95 0.05

t0 t1 0.2 0.8

t1 t0 0.2 0.8

t1 t1 0.2 0.8

T1

S

T2 T1

T

T2

S

S

T s0 s1

t0 0.95

0.05

t1 0.2 0.8

T

T1 T2 t0 t1

t0 t0 1 0

t0 t1 0 1

t1 t0 0 1

t1 t1 0 1

Replace spurious dependencies with deterministic CPDs

Need to make sure that deterministic CPD is compactly stored

Deterministic CPDs Induce additional conditional independencies Example: T is any deterministic function of

T1,T2

T1

T

T2

S1

Ind(S1;S2 | T1,T2)

S2

Deterministic CPDs Induce additional conditional independencies Example: C is an XOR deterministic function of

A,B

D

C

A

E

BInd(D;E | B,C)

Deterministic CPDs Induce additional conditional independencies Example: T is an OR deterministic function of

T1,T2

T1

T

T2

S1

Ind(S1;S2 | T1=t1)

S2

Context specific independencies

Context Specific Independencies

Let X,Y,Z be pairwise disjoint RV sets Let C be a set of variables and cVal(C) X and Y are contextually independent given Z and c,

denoted (XcY | Z,c) if: 0),,(whenever),|(),,|( cPcPcP ZYZXZYX

Tree CPDs

D

A B C d0 d1

a0 b0 c0 0.2 0.8

a0 b0 c1 0.2 0.8

a0 b1 c0 0.2 0.8

a0 b1 c1 0.2 0.8

a1 b0 c0 0.9 0.1

a1 b0 c1 0.7 0.3

a1 b1 c0 0.4 0.6

A1 b1 C1 0.4 0.6

A

D

B C A

D

B C

A

B

C

(0.2,0.8)

(0.4,0.6)

(0.7,0.3)(0.9,0.1)

a0 a1

b1b0

c1c0

4 parameters8 parameters

Activator Repressor

Regulated gene

Activator Repressor

Regulated gene

Activator

Regulated gene

Repressor

State 1

Act

ivat

or

State 2

Act

ivat

or

Repressor

State 3

Gene Regulation: Simple Example

Regulated gene

DNA Microarray

Regulators

DNA Microarray

Regulators

truefalse

truefalse

Regulation Tree

Activator?

Repressor?

State 1 State 2 State 3

true Regulation

program

Module

genes

Activator expressio

n

Repressor expressio

n

Segal et al., Nature Genetics ’03

Respiration Module

Regulation

program

Module genes

Hap4+Msn4 known to regulate module genes

Module genes known targets of predicted regulators?Predicted regulator

Segal et al., Nature Genetics ’03

Rule CPDs A rule r is a pair (c;p) where c is an assignment

to a subset of variables C and p[0,1]. Let Scope[r]=C

A rule-based CPD P(X|Pa(X)) is a set of rules R s.t. For each rule rR Scope[r]{X}Pa(X) For each assignment (x,u) to {X}Pa(X) we have

exactly one rule (c;p)R such that c is compatible with (x,u).Then, we have P(X=x | Pa(X)=u) = p

Rule CPDs Example

Let X be a variable with Pa(X) = {A,B,C} r1: (a1, b1, x0; 0.1) r2: (a0, c1, x0; 0.2) r3: (b0, c0, x0; 0.3) r4: (a1, b0, c1, x0; 0.4) r5: (a0, b1, c0; 0.5) r6: (a1, b1, x1; 0.9) r7: (a0, c1, x1; 0.8) r8: (b0, c0, x1; 0.7) r9: (a1, b0, c1, x1; 0.6)

Note: each assignment maps to exactly one rule Rules cannot always be represented compactly

within tree CPDs

Tree CPDs and Rule CPDs Can represent every discrete function Can be easily learned and dealt with in

inference But, some functions are not represented

compactly XOR in tree CPDs: cannot split in one step on a0,b1

and a1,b0

Alternative representations exist Complex logical rules

Context Specific Independencies

A

D

B C

C

(0.2,0.8) (0.4,0.6)(0.7,0.3)(0.9,0.1)

a0 a1

b1b0c1c0

B

A

A

D

B C

A=a1 Ind(D,C | A=a1)

A

D

B C

A=a0 Ind(D,B | A=a0)

Reasoning by cases implies that Ind(B,C | A,D)

Independence of Causal Influence

X1

Y

X2 Xn... Causes: X1,…Xn

Effect: Y

General case: Y has a complex dependency on X1,…Xn

Common case Each Xi influences Y separately Influence of X1,…Xn is combined to an overall

influence on Y

Example 1: Noisy OR Two independent effects X1, X2

Y=y1 cannot happen unless one of X1, X2 occurs P(Y=y0 | X1=x1

0 , X2=x20) = P(X1=x1

0)P(X2=x20)

Y

X1 X2 y0 y1

x10 x2

0 1 0

x10 x2

1 0.2 0.8

x11 x2

0 0.1 0.9

x11 x2

1 0.02 0.98

X1

Y

X2

Noisy OR: Elaborate Representation

Y

X’1 X’2 y0 y1

x10 x2

0 1 0

x10 x2

1 0 1

x11 x2

0 0 1

x11 x2

1 0 1

X’1

Y

X’2

X1 X2

Deterministic OR

X’1

X1 x10 x1

1

x10 1 0

x11 0.1 0.9

Noisy CPD 1

X’2

X2 x20 x2

1

x20 1 0

x21 0.2 0.8

Noisy CPD 2Noise parameter X1

=0.9 Noise parameter X1=0.8

Noisy OR: Elaborate Representation

Decomposition results in the same distribution

),|( 122

111

0 xXxXyYP

02.0

08.09.002.09.008.01.012.01.0

)','|()|'()|'(

),,','|(),,'|'(),|'(

),|',',(

21

21

21

','21

01222

1111

','

122

11121

0122

11112

122

1111

','

122

11121

0

XX

XX

XX

XXyYPxXXPxXXP

xXxXXXyYPxXxXXXPxXxXXP

xXxXXXyYP

Noisy OR: General Case Y is a binary variable with k binary parents

X1,...Xn

CPD P(Y | X1,...Xn) is a noisy OR if there are k+1 noise parameters 0,1,...,n such that

)1()1(),...,|(1:

010

ixXi

k

ii

XXyYP

)1()1(1),...,|(1:

011

ixXi

k

ii

XXyYP

Noisy OR Independencies

X’1

Y

X’2

X1 X2

X’n

Xn

...

ij: Ind(Xi Xj | Y=yo)

Generalized Linear Models Model is a soft version of a linear threshold

function Example: logistic function

Binary variables X1,...Xn,Y

k

iiio

k

Xww

XXyYP

1

11

)1(exp1

1),...,|(

1

General Formulation Let Y be a random variable with parents X1,...Xn

The CPD P(Y | X1,...Xn) exhibits independence of causal influence (ICI) if it can be described by

The CPD P(Z | Z1,...Zn) is deterministic

Z1

Z

Z2

X1 X2

Y

Zn

Xn...

... Noisy OR

Zi has noise model

Z is an OR function

Y is the identity CPD

Logistic

Zi = wi1(Xi=1)

Z = Zi

Y = logit (Z)

General Formulation

Key advantage: O(n) parameters

As stated, not all that useful as any complex CPD can be represented through a complex deterministic CPD

Continuous Variables One solution: Discretize

Often requires too many value states Loses domain structure

Other solution: use continuous function for P(X|Pa(X)) Can combine continuous and discrete variables,

resulting in hybrid networks Inference and learning may become more difficult

Gaussian Density Functions Among the most common continuous

representations Univariate case: 2

2

12

2

1)(),(~)(

x

eXpNXP if

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-4 -2 0 2 4

Gaussian Density Functions A multivariate Gaussian distribution over

X1,...Xn has Mean vector nxn positive definite covariance matrix

positive definite: Joint density function:

i=E[Xi] ii=Var[Xi] ij=Cov[Xi,Xj]=E[XiXj]-E[Xi]E[Xj] (ij)

0: xxx Tn

)()(

2

1exp

||)2(

1)( 1

2/12/

xxxp T

n

Gaussian Density Functions Marginal distributions are easy to compute

Independencies can be determined from parameters If X=X1,...Xn have a joint normal distribution N(;)

then Ind(Xi Xj) iff ij=0 Does not hold in general for non-Gaussian distributions

YYYX

XYXX

Y

XYX ;),(

NP

XXXX ;)( NP

Linear Gaussian CPDs Y is a continuous variable with parents X1,...Xn

Y has a linear Gaussian model if it can be described using parameters 0,...,n and 2 such that Vector notation:

Pros Simple Captures many interesting dependencies

Cons Fixed variance (variance cannot depend on parents

values)

);...(),...,|( 21101 nnn xxNxxYP

);()|( 20 xx TNYP

Linear Gaussian Bayesian Network

A linear Gaussian Bayesian network is a Bayesian network where All variables are continuous All of the CPDs are linear Gaussians

Key result: linear Gaussian models are equivalent to multivariate Gaussian density functions

Equivalence Theorem Y is a linear Gaussian of its parents X1,...Xn:

Assume that X1,...Xn are jointly Gaussian with N(;)

Then: The marginal distribution of Y is Gaussian with

N(Y;Y2)

The joint distribution over {X,Y} is Gaussian where

);()|( 20 xx TNYP

TY

TY

220 μ

n

j ijji YXCov1

];[

Linear Gaussian BNs define a joint Gaussian distribution

Converse Equivalence Theorem If {X,Y} have a joint Gaussian distribution then

Implications of equivalence Joint distribution has compact representation: O(n2) We can easily transform back and forth between

Gaussian distributions and linear Gaussian Bayesian networks

Representations may differ in parameters Example:

Gaussian distribution has full covariance matrix Linear Gaussian

);()|( 20 XX TNYP

XnX2X1 ...

Hybrid Models Models of continuous and discrete variables

Continuous variables with discrete parents Discrete variables with continuous parents

Conditional Linear Gaussians Y continuous variable X = {X1,...,Xn} continuous parents U = {U1,...,Um} discrete parents

A Conditional Linear Bayesian network is one where Discrete variables have only discrete parents Continuous variables have only CLG CPDs

21 ,0, ;),|(: uuu xxuUu

n

i iiaaNYP

Hybrid Models Continuous parents for discrete children

Threshold models

Linear sigmoid

otherwise05.0

109.0)|( 1 xxyYP

k

iiio

k

xww

xxyYP

1

11

)exp1

1),...,|(

Summary: CPD Models Deterministic functions Context specific dependencies Independence of causal influence

Noisy OR Logistic function

CPDs capture additional domain structure

conditional probability distributions eran segal weizmann institute

Documents