introduction to probabilistic graphical models eran segal weizmann institute

Introduction to Probabilistic Graphical

ModelsEran Segal

Weizmann Institute

Logistics Staff:

Instructor Eran Segal ([email protected], room 149)

Teaching Assistants Ohad Manor ([email protected], room 125)

Course information: WWW: http://www.weizmann.ac.il/math/pgm

Course book: “Bayesian Networks and Beyond”,

Daphne Koller (Stanford) & Nir Friedman (Hebrew U.)

Course structure One weekly meeting

Sun: 9am-11am Homework assignments

2 weeks to complete each 40% of final grade

Final exam 3 hour class exam, date will be announced 60% of final grade

Probabilistic Graphical Models

Tool for representing complex systems and performing sophisticated reasoning tasks

Fundamental notion: Modularity Complex systems are built by combining simpler parts

Why have a model? Compact and modular representation of complex

systems Ability to execute complex reasoning patterns Make predictions Generalize from particular problem

Probabilistic Graphical Models

Increasingly important in Machine Learning

Many classical probabilistic problems in statistics, information theory, pattern recognition, and statistical mechanics are special cases of the formalism Graphical models provides a common framework Advantage: specialized techniques developed in one

field can be transferred between research communities

Representation: Graphs

Intuitive data structure for modeling highly-interacting sets of variables

Explicit model for modularity

Data structure that allows for design of efficient general-purpose algorithms

Reasoning: Probability Theory

Well understood framework for modeling uncertainty Partial knowledge of the state of the world Noisy observations Phenomenon not covered by our model Inherent stochasticity

Clear semantics

Can be learned from data

Probabilistic Reasoning In this course we will learn:

Semantics of probabilistic graphical models (PGMs) Bayesian networks Markov networks

Answering queries in a PGMs (“inference”) Learning PGMs from data (“learning”) Modeling temporal processes with PGMs

Hidden Markov Models (HMMs) as a special case

Course Outline

Week Topic Reading1 Introduction, Bayesian network

representation1-3

2 Bayesian network representation cont. 1-33 Local probability models 54 Undirected graphical models 45 Exact inference 9,106 Exact inference cont. 9,107 Approximate inference 128 Approximate inference cont. 129 Learning: Parameters 16,1710 Learning: Parameters cont. 16,1711 Learning: Structure 1812 Partially observed data 1913 Learning undirected graphical models 2014 Template models 615 Dynamic Bayesian networks 15

A Simple Example We want to model whether our neighbor will

inform us of the alarm being set off The alarm can set off if

There is a burglary There is an earthquake

Whether our neighbor calls depends on whether the alarm is set off

A Simple Example Variables

Earthquake (E), Burglary (B), Alarm (A), NeighborCalls (N)

E B A N Prob.

F F F F 0.01

F F F T 0.04

F F T F 0.05

F F T T 0.01

F T F F 0.02

F T F T 0.07

F T T F 0.2

F T T T 0.1

T F F F 0.01

T F F T 0.07

T F T F 0.13

T F T T 0.04

T T F F 0.06

T T F T 0.05

T T T F 0.1

T T T T 0.05

24-1 independent parameters

A Simple Example

Alarm

NeighborCalls

BurglaryEarthquake

A

E B F T

F F 0.99 0.01

F T 0.1 0.9

T F 0.3 0.7

T T 0.01 0.99

N

A F T

F 0.9 0.1

T 0.2 0.8

E

F T

0.9 0.1

7 independent parameters

B

F T

0.7 0.3

Example Bayesian Network The “Alarm” network for monitoring intensive

care patients 509 parameters (full joint 237) 37 variables

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Application: Clustering Users Input: TV shows that each user watches Output: TV show “clusters”

Assumption: shows watched by same users are similar

Class 1• Power rangers• Animaniacs• X-men• Tazmania• Spider man

Class 4• 60 minutes• NBC nightly news• CBS eve news• Murder she wrote• Matlock

Class 2• Young and restless• Bold and the beautiful• As the world turns• Price is right• CBS eve news

Class 3• Tonight show• Conan O’Brien• NBC nightly news• Later with Kinnear• Seinfeld

Class 5• Seinfeld• Friends• Mad about you• ER• Frasier

App.: Recommendation Systems

Given user preferences, suggest recommendations

Example: Amazon.com

Input: movie preferences of many users Solution: model correlations between movie

features Users that like comedy, often like drama Users that like action, often do not like cartoons Users that like Robert Deniro films often like Al

Pacino films Given user preferences, can predict probability that

new movies match preferences

Diagnostic Systems Diagnostic indexing for home health site at

microsoft Enter symptoms recommend multimedia

content

Online TroubleShooters

Expression level in each module is a

function of expression of regulators

App.: Finding Regulatory Networks

Experiment

Gene

Expression

Module

Regulator1

Regulator2

Regulator3

Level

What module does gene “g” belong

to?

Expression level of Regulator1 in experiment

BMH1

GIC2

00 0

2

1

Module

P(Level | Module, Regulators)

HAP4

CMK1

0

0 0

App.: Finding Regulatory Networks

Ypl2

30

w Hap4

Xbp1

Yer1

84

c

Yap6

Gat1

Ime4

Lsg1

Msn

4

Gac1

Gis1

Not3

Sip

2

Amino acidmetabolism

Energy andcAMP signaling

DNA and RNAprocessing

nuclear

STR

E

N4

1

HA

P2

34

REPC

AR

CA

T8

N2

6A

DR

1

HSF

HA

C1

XB

P1

MC

M1

N3

0

AB

F_C

N3

6

Kin

82

Cm

k1

Tpk1

Ppt1

N1

1

GA

TA

GC

N4

CB

F1

_B

Tpk2

Pph3

N1

4

N1

3

Bm

h1

Gcn

20

GC

R1

MIG

1

N1

8

12 3 2533414263947 30 4231 36 5 16 810913 14151718 11

Regulation supported in literature

Regulator (Signaling molecule)Regulator (transcription factor)

Inferred regulation

48 Module (number)

Experimentally tested regulator

Enriched cis-Regulatory Motif

Prerequisites Probability theory

Conditional probabilities Joint distribution Random variables

Information theory Function optimization Graph theory Computational complexity

Probability Theory Probability distribution P over (, S) is a mapping from

events in S such that: P() 0 for all S P() = 1 If ,S and =, then P()=P()+P()

Conditional Probability:

Chain Rule:

Bayes Rule:

Conditional Independence:

)(

)()|(

P

PP

)(

)()|()|(

P

PPP

)()|()( PPP

)|()|()|( PP

Random Variables & Notation

Random variable: Function from to a value Categorical / Ordinal / Continuous Val(X) – set of possible values of RV X Upper case letters denote RVs (e.g., X, Y, Z) Upper case bold letters denote set of RVs (e.g., X, Y) Lower case letters denote RV values (e.g., x, y, z) Lower case bold letters denote RV set values (e.g., x) Values for categorical RVs with |Val(X)|=k: x1,x2,…,xk

Marginal distribution over X: P(X) Conditional independence: X is independent of Y

given Z if:

Pin )|(:,, zZyYxXZzYyXx

Expectation Discrete RVs:

Continuous RVs:

Linearity of expectation:

Expectation of products:(when X Y in P)

Xx

P xxPXE )(][

Xx

P xxpXE )(][

][][

)()(

),(),(

),(),(

),()(][

, ,

,

YEXE

yyPxxP

yxPyyxPx

yxyPyxxP

yxPyxYXE

x y

x y xy

yx yx

yxP

][][

)()(

)()(

),(][

,

,

YEXE

yyPxxP

yPxxyP

yxxyPXYE

yx

yx

yxP

Independence

assumption

Variance Variance of RV:

If X and Y are independent: Var[X+Y]=Var[X]+Var[Y]

Var[aX+b]=a2Var[X]

][][

][][][2][]][[]][2[][

][][2][][

22

22

22

22

2

XEXE

XEXEXEXEXEEXXEEXE

XEXXEXEXEXEXVar

PP

PPPP

PPPPP

PPP

PPP

Information Theory Entropy:

We use log base 2 to interpret entropy as bits of information Entropy of X is a lower bound on avg. # of bits to encode values

of X 0 Hp(X) log|Val(X)| for any distribution P(X)

Conditional entropy:

Information only helps: Mutual information:

0 Ip(X;Y) Hp(X) Symmetry: Ip(X;Y)= Ip(Y;X) Ip(X;Y)=0 iff X and Y are independent

Chain rule of entropies:

XxP xPxPXH )(log)()(

YyXx

PPP

yxPyxP

YHYXHYXH

,)|(log),(

)(),()|(

),...,|(...)(),...,( 1111 nnPPnP XXXHXHXXH

)()|( XHYXH PP

YyXx

PPP

xP

yxPyxP

YXHXHYXI

, )(

)|(log),(

)|()();(

Distances Between Distributions

Relative Entropy: D(P॥Q)0 D(P॥Q)=0 iff P=Q Not a distance metric (no symmetry and triangle inequality)

L1 distance: L2 distance: L distance:

21

22 ))()((

Xx

xQxPQP

Xx xQ

xPxPXQXPD

)(

)(log)())()((

XxxQxPQP |)()(|1

|)()(|max Xx xQxPQP

Optimization Theory Find values 1,…n such that Optimization strategies

Solve gradient analytically and verify local maximum Gradient search: guess initial values, and improve iteratively

Gradient ascent Line search Conjugate gradient

Lagrange multipliers Solve maximization problem with constraints Maximize

)',...,'(max),...,( 1',...,'1 1 nn ffn

0,,...,1:

jcnjC

j

n

jjcfL

1

)(),(

Graph Theory Undirected graph Directed graph Complete graph (every two nodes connected) Acyclic graph Partially directed acyclic graph (PDAG) Induced graph Sub-graph Graph algorithms

Shortest path from node X1 to all other nodes (BFS)

Representing Joint Distributions Random variables: X1,…,Xn

P is a joint distribution over X1,…,Xn

If X1,..,Xn binary, need 2n parameters to describe P

Can we represent P more compactly? Key: Exploit independence properties

Independent Random Variables Two variables X and Y are independent if

P(X=x|Y=y) = P(X=x) for all values x,y Equivalently, knowing Y does not change predictions

of X

If X and Y are independent then: P(X, Y) = P(X|Y)P(Y) = P(X)P(Y)

If X1,…,Xn are independent then: P(X1,…,Xn) = P(X1)…P(Xn) O(n) parameters All 2n probabilities are implicitly defined Cannot represent many types of distributions

Conditional Independence X and Y are conditionally independent given Z

if P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z Equivalently, if we know Z, then knowing Y does not

change predictions of X Notation: Ind(X;Y | Z) or (X Y | Z)

Conditional Parameterization S = Score on test, Val(S) = {s0,s1} I = Intelligence, Val(I) = {i0,i1}

I S P(I,S)

i0 s0 0.665

i0 s1 0.035

i1 s0 0.06

i1 s1 0.24

S

I s0 s1

i0 0.95

0.05

i1 0.2 0.8

I

i0 i1

0.7 0.3

P(S|I)P(I)P(I,S)

Joint parameterization Conditional parameterization

3 parameters 3 parameters

Alternative parameterization: P(S) and P(I|S)

Conditional Parameterization S = Score on test, Val(S) = {s0,s1} I = Intelligence, Val(I) = {i0,i1} G = Grade, Val(G) = {g0,g1,g2} Assume that G and S are independent given I

Joint parameterization 223=12-1=11 independent parameters

Conditional parameterization has P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I) P(I) – 1 independent parameter P(S|I) – 21 independent parameters P(G|I) - 22 independent parameters 7 independent parameters

Naïve Bayes Model Class variable C, Val(C) = {c1,…,ck} Evidence variables X1,…,Xn

Naïve Bayes assumption: evidence variables are conditionally independent given C

Applications in medical diagnosis, text classification Used as a classifier:

Problem: Double counting correlated evidence

n

iin CXPCPXXCP

11 )|()(),...,,(

n

i i

i

n

n

cCxP

cCxP

cCP

cCP

xxcCP

xxcCP

1 2

1

2

1

12

11

)|(

)|(

)(

)(

),...,|(

),...,|(

Bayesian Network (Informal) Directed acyclic graph G

Nodes represent random variables Edges represent direct influences between random

variables Local probability models

I

S

I

S G

C

X1 XnX2…

Naïve BayesExample 2Example 1

Bayesian Network (Informal) Represent a joint distribution

Specifies the probability for P(X=x) Specifies the probability for P(X=x|E=e)

Allows for reasoning patterns Prediction (e.g., intelligent high scores) Explanation (e.g., low score not intelligent) Explaining away (different causes for same effect

interact)

I

S G

Example 2

Bayesian Network Structure Directed acyclic graph G

Nodes X1,…,Xn represent random variables

G encodes local Markov assumptions Xi is independent of its non-descendants given its

parents Formally: (Xi NonDesc(Xi)

| Pa(Xi)) A

B C

E

G

D FE {A,C,D,F} | B

Independency Mappings (I-Maps)

Let P be a distribution over X Let I(P) be the independencies (X Y | Z) in P A Bayesian network structure is an I-map

(independency mapping) of P if I(G)I(P)

I

S

I S P(I,S)

i0 s0 0.25

i0 s1 0.25

i1 s0 0.25

i1 s1 0.25

I

S

I S P(I,S)

i0 s0 0.4

i0 s1 0.3

i1 s0 0.2

i1 s1 0.1

I(P)={IS}

I(G)={IS}

I(G)=

I(P)=

Factorization Theorem If G is an I-Map of P, then

Proof: wlog. X1,…,Xn is an ordering consistent with G By chain rule: From assumption:

Since G is an I-Map (Xi; NonDesc(Xi)| Pa(Xi))I(P)

n

iiin XPaXPXXP

11 ))(|(),...,(

)()(},{

},{)(

1,1

1,1

iii

ii

XNonDescXPaXX

XXXPa

n

iiin XXXPXXP

1111 ),...,|(),...,(

))(|(),...,|( 11 iiii XPaXPXXXP

Factorization Implies I-Map G is an I-Map of P

Proof: Need to show (Xi; NonDesc(Xi)| Pa(Xi))I(P) or

that P(Xi | NonDesc(Xi)) = P(Xi | Pa(Xi)) wlog. X1,…,Xn is an ordering consistent with G

n

iiin XPaXPXXP

11 ))(|(),...,(

))(|(

))(|(

))(|(

))((

))(,())(|(

1

1

1

ii

i

kkk

i

kkk

i

iiii

XPaXP

XPaXP

XPaXP

XNonDescP

XNonDescXPXNonDescXP

Bayesian Network Definition A Bayesian network is a pair (G,P)

P factorizes over G P is specified as set of CPDs associated with G’s

nodes

Parameters Joint distribution: 2n

Bayesian network (bounded in-degree k): n2k

Bayesian Network Design Variable considerations

Clarity test: can an omniscient being determine its value?

Hidden variables? Irrelevant variables

Structure considerations Causal order of variables Which independencies (approximately) hold?

Probability considerations Zero probabilities Orders of magnitude Relative values

introduction to probabilistic graphical models eran segal weizmann institute

Documents

course information

formalismgraphical models

pgms inferencelearning

graphsintuitive data

instructoreran segal

pgmshidden markov models

probability theorywell

modularitycomplex systems