introduction to probabilistic graphical models eran segal weizmann institute
TRANSCRIPT
Introduction to Probabilistic Graphical
ModelsEran Segal
Weizmann Institute
Logistics Staff:
Instructor Eran Segal ([email protected], room 149)
Teaching Assistants Ohad Manor ([email protected], room 125)
Course information: WWW: http://www.weizmann.ac.il/math/pgm
Course book: “Bayesian Networks and Beyond”,
Daphne Koller (Stanford) & Nir Friedman (Hebrew U.)
Course structure One weekly meeting
Sun: 9am-11am Homework assignments
2 weeks to complete each 40% of final grade
Final exam 3 hour class exam, date will be announced 60% of final grade
Probabilistic Graphical Models
Tool for representing complex systems and performing sophisticated reasoning tasks
Fundamental notion: Modularity Complex systems are built by combining simpler parts
Why have a model? Compact and modular representation of complex
systems Ability to execute complex reasoning patterns Make predictions Generalize from particular problem
Probabilistic Graphical Models
Increasingly important in Machine Learning
Many classical probabilistic problems in statistics, information theory, pattern recognition, and statistical mechanics are special cases of the formalism Graphical models provides a common framework Advantage: specialized techniques developed in one
field can be transferred between research communities
Representation: Graphs
Intuitive data structure for modeling highly-interacting sets of variables
Explicit model for modularity
Data structure that allows for design of efficient general-purpose algorithms
Reasoning: Probability Theory
Well understood framework for modeling uncertainty Partial knowledge of the state of the world Noisy observations Phenomenon not covered by our model Inherent stochasticity
Clear semantics
Can be learned from data
Probabilistic Reasoning In this course we will learn:
Semantics of probabilistic graphical models (PGMs) Bayesian networks Markov networks
Answering queries in a PGMs (“inference”) Learning PGMs from data (“learning”) Modeling temporal processes with PGMs
Hidden Markov Models (HMMs) as a special case
Course Outline
Week Topic Reading1 Introduction, Bayesian network
representation1-3
2 Bayesian network representation cont. 1-33 Local probability models 54 Undirected graphical models 45 Exact inference 9,106 Exact inference cont. 9,107 Approximate inference 128 Approximate inference cont. 129 Learning: Parameters 16,1710 Learning: Parameters cont. 16,1711 Learning: Structure 1812 Partially observed data 1913 Learning undirected graphical models 2014 Template models 615 Dynamic Bayesian networks 15
A Simple Example We want to model whether our neighbor will
inform us of the alarm being set off The alarm can set off if
There is a burglary There is an earthquake
Whether our neighbor calls depends on whether the alarm is set off
A Simple Example Variables
Earthquake (E), Burglary (B), Alarm (A), NeighborCalls (N)
E B A N Prob.
F F F F 0.01
F F F T 0.04
F F T F 0.05
F F T T 0.01
F T F F 0.02
F T F T 0.07
F T T F 0.2
F T T T 0.1
T F F F 0.01
T F F T 0.07
T F T F 0.13
T F T T 0.04
T T F F 0.06
T T F T 0.05
T T T F 0.1
T T T T 0.05
24-1 independent parameters
A Simple Example
Alarm
NeighborCalls
BurglaryEarthquake
A
E B F T
F F 0.99 0.01
F T 0.1 0.9
T F 0.3 0.7
T T 0.01 0.99
N
A F T
F 0.9 0.1
T 0.2 0.8
E
F T
0.9 0.1
7 independent parameters
B
F T
0.7 0.3
Example Bayesian Network The “Alarm” network for monitoring intensive
care patients 509 parameters (full joint 237) 37 variables
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
Application: Clustering Users Input: TV shows that each user watches Output: TV show “clusters”
Assumption: shows watched by same users are similar
Class 1• Power rangers• Animaniacs• X-men• Tazmania• Spider man
Class 4• 60 minutes• NBC nightly news• CBS eve news• Murder she wrote• Matlock
Class 2• Young and restless• Bold and the beautiful• As the world turns• Price is right• CBS eve news
Class 3• Tonight show• Conan O’Brien• NBC nightly news• Later with Kinnear• Seinfeld
Class 5• Seinfeld• Friends• Mad about you• ER• Frasier
App.: Recommendation Systems
Given user preferences, suggest recommendations
Example: Amazon.com
Input: movie preferences of many users Solution: model correlations between movie
features Users that like comedy, often like drama Users that like action, often do not like cartoons Users that like Robert Deniro films often like Al
Pacino films Given user preferences, can predict probability that
new movies match preferences
Diagnostic Systems Diagnostic indexing for home health site at
microsoft Enter symptoms recommend multimedia
content
Online TroubleShooters
Expression level in each module is a
function of expression of regulators
App.: Finding Regulatory Networks
Experiment
Gene
Expression
Module
Regulator1
Regulator2
Regulator3
Level
What module does gene “g” belong
to?
Expression level of Regulator1 in experiment
BMH1
GIC2
00 0
2
1
Module
P(Level | Module, Regulators)
HAP4
CMK1
0
0 0
App.: Finding Regulatory Networks
Ypl2
30
w Hap4
Xbp1
Yer1
84
c
Yap6
Gat1
Ime4
Lsg1
Msn
4
Gac1
Gis1
Not3
Sip
2
Amino acidmetabolism
Energy andcAMP signaling
DNA and RNAprocessing
nuclear
STR
E
N4
1
HA
P2
34
REPC
AR
CA
T8
N2
6A
DR
1
HSF
HA
C1
XB
P1
MC
M1
N3
0
AB
F_C
N3
6
Kin
82
Cm
k1
Tpk1
Ppt1
N1
1
GA
TA
GC
N4
CB
F1
_B
Tpk2
Pph3
N1
4
N1
3
Bm
h1
Gcn
20
GC
R1
MIG
1
N1
8
12 3 2533414263947 30 4231 36 5 16 810913 14151718 11
Regulation supported in literature
Regulator (Signaling molecule)Regulator (transcription factor)
Inferred regulation
48 Module (number)
Experimentally tested regulator
Enriched cis-Regulatory Motif
Prerequisites Probability theory
Conditional probabilities Joint distribution Random variables
Information theory Function optimization Graph theory Computational complexity
Probability Theory Probability distribution P over (, S) is a mapping from
events in S such that: P() 0 for all S P() = 1 If ,S and =, then P()=P()+P()
Conditional Probability:
Chain Rule:
Bayes Rule:
Conditional Independence:
)(
)()|(
P
PP
)(
)()|()|(
P
PPP
)()|()( PPP
)|()|()|( PP
Random Variables & Notation
Random variable: Function from to a value Categorical / Ordinal / Continuous Val(X) – set of possible values of RV X Upper case letters denote RVs (e.g., X, Y, Z) Upper case bold letters denote set of RVs (e.g., X, Y) Lower case letters denote RV values (e.g., x, y, z) Lower case bold letters denote RV set values (e.g., x) Values for categorical RVs with |Val(X)|=k: x1,x2,…,xk
Marginal distribution over X: P(X) Conditional independence: X is independent of Y
given Z if:
Pin )|(:,, zZyYxXZzYyXx
Expectation Discrete RVs:
Continuous RVs:
Linearity of expectation:
Expectation of products:(when X Y in P)
Xx
P xxPXE )(][
Xx
P xxpXE )(][
][][
)()(
),(),(
),(),(
),()(][
, ,
,
YEXE
yyPxxP
yxPyyxPx
yxyPyxxP
yxPyxYXE
x y
x y xy
yx yx
yxP
][][
)()(
)()(
),(][
,
,
YEXE
yyPxxP
yPxxyP
yxxyPXYE
yx
yx
yxP
Independence
assumption
Variance Variance of RV:
If X and Y are independent: Var[X+Y]=Var[X]+Var[Y]
Var[aX+b]=a2Var[X]
][][
][][][2][]][[]][2[][
][][2][][
22
22
22
22
2
XEXE
XEXEXEXEXEEXXEEXE
XEXXEXEXEXEXVar
PP
PPPP
PPPPP
PPP
PPP
Information Theory Entropy:
We use log base 2 to interpret entropy as bits of information Entropy of X is a lower bound on avg. # of bits to encode values
of X 0 Hp(X) log|Val(X)| for any distribution P(X)
Conditional entropy:
Information only helps: Mutual information:
0 Ip(X;Y) Hp(X) Symmetry: Ip(X;Y)= Ip(Y;X) Ip(X;Y)=0 iff X and Y are independent
Chain rule of entropies:
XxP xPxPXH )(log)()(
YyXx
PPP
yxPyxP
YHYXHYXH
,)|(log),(
)(),()|(
),...,|(...)(),...,( 1111 nnPPnP XXXHXHXXH
)()|( XHYXH PP
YyXx
PPP
xP
yxPyxP
YXHXHYXI
, )(
)|(log),(
)|()();(
Distances Between Distributions
Relative Entropy: D(P॥Q)0 D(P॥Q)=0 iff P=Q Not a distance metric (no symmetry and triangle inequality)
L1 distance: L2 distance: L distance:
21
22 ))()((
Xx
xQxPQP
Xx xQ
xPxPXQXPD
)(
)(log)())()((
XxxQxPQP |)()(|1
|)()(|max Xx xQxPQP
Optimization Theory Find values 1,…n such that Optimization strategies
Solve gradient analytically and verify local maximum Gradient search: guess initial values, and improve iteratively
Gradient ascent Line search Conjugate gradient
Lagrange multipliers Solve maximization problem with constraints Maximize
)',...,'(max),...,( 1',...,'1 1 nn ffn
0,,...,1:
jcnjC
j
n
jjcfL
1
)(),(
Graph Theory Undirected graph Directed graph Complete graph (every two nodes connected) Acyclic graph Partially directed acyclic graph (PDAG) Induced graph Sub-graph Graph algorithms
Shortest path from node X1 to all other nodes (BFS)
Representing Joint Distributions Random variables: X1,…,Xn
P is a joint distribution over X1,…,Xn
If X1,..,Xn binary, need 2n parameters to describe P
Can we represent P more compactly? Key: Exploit independence properties
Independent Random Variables Two variables X and Y are independent if
P(X=x|Y=y) = P(X=x) for all values x,y Equivalently, knowing Y does not change predictions
of X
If X and Y are independent then: P(X, Y) = P(X|Y)P(Y) = P(X)P(Y)
If X1,…,Xn are independent then: P(X1,…,Xn) = P(X1)…P(Xn) O(n) parameters All 2n probabilities are implicitly defined Cannot represent many types of distributions
Conditional Independence X and Y are conditionally independent given Z
if P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z Equivalently, if we know Z, then knowing Y does not
change predictions of X Notation: Ind(X;Y | Z) or (X Y | Z)
Conditional Parameterization S = Score on test, Val(S) = {s0,s1} I = Intelligence, Val(I) = {i0,i1}
I S P(I,S)
i0 s0 0.665
i0 s1 0.035
i1 s0 0.06
i1 s1 0.24
S
I s0 s1
i0 0.95
0.05
i1 0.2 0.8
I
i0 i1
0.7 0.3
P(S|I)P(I)P(I,S)
Joint parameterization Conditional parameterization
3 parameters 3 parameters
Alternative parameterization: P(S) and P(I|S)
Conditional Parameterization S = Score on test, Val(S) = {s0,s1} I = Intelligence, Val(I) = {i0,i1} G = Grade, Val(G) = {g0,g1,g2} Assume that G and S are independent given I
Joint parameterization 223=12-1=11 independent parameters
Conditional parameterization has P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I) P(I) – 1 independent parameter P(S|I) – 21 independent parameters P(G|I) - 22 independent parameters 7 independent parameters
Naïve Bayes Model Class variable C, Val(C) = {c1,…,ck} Evidence variables X1,…,Xn
Naïve Bayes assumption: evidence variables are conditionally independent given C
Applications in medical diagnosis, text classification Used as a classifier:
Problem: Double counting correlated evidence
n
iin CXPCPXXCP
11 )|()(),...,,(
n
i i
i
n
n
cCxP
cCxP
cCP
cCP
xxcCP
xxcCP
1 2
1
2
1
12
11
)|(
)|(
)(
)(
),...,|(
),...,|(
Bayesian Network (Informal) Directed acyclic graph G
Nodes represent random variables Edges represent direct influences between random
variables Local probability models
I
S
I
S G
C
X1 XnX2…
Naïve BayesExample 2Example 1
Bayesian Network (Informal) Represent a joint distribution
Specifies the probability for P(X=x) Specifies the probability for P(X=x|E=e)
Allows for reasoning patterns Prediction (e.g., intelligent high scores) Explanation (e.g., low score not intelligent) Explaining away (different causes for same effect
interact)
I
S G
Example 2
Bayesian Network Structure Directed acyclic graph G
Nodes X1,…,Xn represent random variables
G encodes local Markov assumptions Xi is independent of its non-descendants given its
parents Formally: (Xi NonDesc(Xi)
| Pa(Xi)) A
B C
E
G
D FE {A,C,D,F} | B
Independency Mappings (I-Maps)
Let P be a distribution over X Let I(P) be the independencies (X Y | Z) in P A Bayesian network structure is an I-map
(independency mapping) of P if I(G)I(P)
I
S
I S P(I,S)
i0 s0 0.25
i0 s1 0.25
i1 s0 0.25
i1 s1 0.25
I
S
I S P(I,S)
i0 s0 0.4
i0 s1 0.3
i1 s0 0.2
i1 s1 0.1
I(P)={IS}
I(G)={IS}
I(G)=
I(P)=
Factorization Theorem If G is an I-Map of P, then
Proof: wlog. X1,…,Xn is an ordering consistent with G By chain rule: From assumption:
Since G is an I-Map (Xi; NonDesc(Xi)| Pa(Xi))I(P)
n
iiin XPaXPXXP
11 ))(|(),...,(
)()(},{
},{)(
1,1
1,1
iii
ii
XNonDescXPaXX
XXXPa
n
iiin XXXPXXP
1111 ),...,|(),...,(
))(|(),...,|( 11 iiii XPaXPXXXP
Factorization Implies I-Map G is an I-Map of P
Proof: Need to show (Xi; NonDesc(Xi)| Pa(Xi))I(P) or
that P(Xi | NonDesc(Xi)) = P(Xi | Pa(Xi)) wlog. X1,…,Xn is an ordering consistent with G
n
iiin XPaXPXXP
11 ))(|(),...,(
))(|(
))(|(
))(|(
))((
))(,())(|(
1
1
1
ii
i
kkk
i
kkk
i
iiii
XPaXP
XPaXP
XPaXP
XNonDescP
XNonDescXPXNonDescXP
Bayesian Network Definition A Bayesian network is a pair (G,P)
P factorizes over G P is specified as set of CPDs associated with G’s
nodes
Parameters Joint distribution: 2n
Bayesian network (bounded in-degree k): n2k
Bayesian Network Design Variable considerations
Clarity test: can an omniscient being determine its value?
Hidden variables? Irrelevant variables
Structure considerations Causal order of variables Which independencies (approximately) hold?
Probability considerations Zero probabilities Orders of magnitude Relative values