introduction to graphical models slide credits: kevin murphy, mark pashkin, zoubin ghahramani and...
TRANSCRIPT
INTRODUCTION TO GRAPHICAL MODELSSLIDE CREDITS: KEVIN MURPHY, MARK PASHKIN, ZOUBIN GHAHRAMANI AND JEFF BILMES
CS188: Computational Models of Human Behavior
Reasoning under uncertainty
• In many settings, we need to understand what is going on in a system when we have imperfect or incomplete information
• For example, we might deploy a burglar alarm to detect intruders– But the sensor could be triggered by other events, e.g.,
earth-quake
• Probabilities quantify the uncertainties regarding the occurrence of events
Probability spaces
• A probability space represents our uncertainty regarding an experiment
• It has two parts:– A sample space , which is the set of outcomes– the probability measure P, which is a real function of the
subsets of • A set of outcomes A is called an event. P(A)
represents how likely it is that the experiment’s actual outcome be a member of A
An example
• If our experiment is to deploy a burglar alarm and see if it works, then there could be four outcomes:
= {(alarm, intruder), (no alarm, intruder), (alarm, no intruder), (no alarm, no intruder)}
• Our choice of P has to obey these simple rules …
The three axioms of probability theory
• P(A)≥0 for all events A• P()=1• P(A U B) = P(A) + P(B) for disjoint events A and B
Some consequences of the axioms
Example
• Let’s assign a probability to each outcome ω
• These probabilities must be non-negative and sum to one
intruder no intruder
alarm 0.002 0.003
no alarm 0.001 0.994
Conditional Probability
Marginal probability
• Marginal probability is then the unconditional probability P(A) of the event A; that is, the probability of A, regardless of whether event B did or did not occur.
• For example, if there are two possible outcomes corresponding to events B and B', this means that – P(A) = P(AB) + P(AB’)
• This is called marginalization
Example• If P is defined by
then P({(intruder, alarm)|(intruder, alarm),(no intruder, alarm)})
intruder no intruder
alarm 0.002 0.003
no alarm 0.001 0.994
P({(intruder,alarm)} {(intruder,alarm),(no intruder,alarm)})({(intruder,alarm),(no intruder,alarm)})
P({(intruder,alarm)})({(intruder,alarm),(no intruder,alarm)})
0.0020.4
(0.002 0.003)
P
P
The product rule
• The probability that A and B both happen is the probability that A happens and B happens, given A has occurred
The chain rule
• Applying the product rule repeatedly:
P(A1,A2,…,Ak) = P(A1) P(A2|A1)P(A3|A2,A1)…P(Ak|Ak-1,…,A1)
• Where P(A3|A2,A1) = P(A3|A2A1)
Bayes’ rule
• Use the product rule both ways with P(AB)– P(A B) = P(A)P(B|A)– P(A B) = P(B)P(A|B)
Random variables and densities
Inference
• One of the central problems of computational probability theory
• Many problems can be formulated in these terms. Examples:– The probability that there is an intruder given the alarm
went off is pI|A(true, true)
• Inference requires manipulating densities
Probabilistic graphical models
• Combination of graph theory and probability theory– Graph structure specifies which parts of the system are
directly dependent– Local functions at each node specify how different parts
interaction
• Bayesian Networks = Probabilistic Graphical Models based on directed acyclic graph
• Markov Networks = Probabilistic Graphical Models based on undirected graph
Some broad questions
Bayesian Networks
• Nodes are random variables• Edges represent dependence – no directed cycles
allowed)
• P(X1:N) = P(X1)P(X2|X1)P(X3|X1,X2) = P(Xi|X1:i-1) = P(Xi|Xi)
x1
x2
x3
x5x4
x7x6
1
N
i
1
N
i
Example
• Water sprinkler Bayes net
P(C,S,R,W)=P(C)P(S|C)P(R|C,S)P(W|C,S,R) chain rule
=P(C)P(S|C)P(R|C)P(W|C,S,R) since R S|C
=P(C)P(S|C)P(R|C)P(W|S,R) since W C|R,S
Inference
Naïve inference
Problem with naïve representation of the joint probability
• Problems with the working with the joint probability– Representation: big table of numbers is hard to understand
– Inference: computing a marginal P(Xi) takes O(2N) time
– Learning: there are O(2N) parameters to estimate
• Graphical models solve the above problems by providing a structured representation for the joint
• Graphs encode conditional independence properties and represent families of probability distribution that satisfy these properties
Bayesian networks provide a compact representation of the joint probability
Conditional probabilities
Another example: medical diagnosis (classification)
Approach: build a Bayes’ net and use Bayes’s rule to get class probability
A very simple Bayes’ net: Naïve Bayes
Naïve Bayes classifier for medical diagnosis
Another commonly used Bayes’ net: Hidden Markov Model (HMM)
Conditional independence properties of Bayesian networks: chains
Conditional independence properties of Bayesian networks: common cause
Conditional independence properties of Bayesian networks: explaining away
Global Markov properties of DAGs
Bayes ball algorithm
Example
Undirected graphical models
Parameterization
Clique potentials
Interpretation of clique potentials
Examples
Joint distribution of an undirected graphical model
Complexity scales exponentially as 2n for binary random variable if we use a naïve approach to computing the partition function
Max clique vs. sub-clique
Log-linear models
Log-linear models
Log-linear models
Summary
Summary
From directed to undirected graphs
From directed to undirected graphs
Example of moralization
Comparing directed and undirected models
Expressive power
x y
w
z
x y
z
Coming back to inference
Coming back to inference
Belief propagation in trees
Belief propagation in trees
Belief propagation in trees
Belief propagation in trees
Belief propagation in trees
Belief propagation in trees
Belief propagation in trees
Belief propagation in trees
Learning
Parameter Estimation
Parameter Estimation
Maximum-likelihood Estimation (MLE)
Example: 1-D Gaussian
MLE for Bayes’ Net
MLE for Bayes’ Net
MLE for Bayes’ Net with Discrete Nodes
Parameter Estimation with Hidden Nodes
Z1 Z2 Z3 Z4 Z5 Z6
Z
Why is learning harder?
Where do hidden variables come from?
Parameter Estimation with Hidden Nodes
zz
EM
Different Learning Conditions
Structure Observability
Full Partial
Known Closed form search EM
Unknown Local search Structural EM