graphical models - department of systems engineering and...
TRANSCRIPT
![Page 1: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/1.jpg)
Graphical ModelsA Brief Introduction
Reference: Pattern Recognition and Machine Learningby C.M. Bishop, SpringerChapter 8.2
https://www.microsoft.com/en‐us/research/wp‐content/uploads/2016/05/Bishop‐PRML‐sample.pdf
1
![Page 2: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/2.jpg)
ProbabilisticModel
Real WorldData
P(Data | Parameters)
2
![Page 3: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/3.jpg)
ProbabilisticModel
Real WorldData
P(Data | Parameters)
P(Parameters | Data)
3
![Page 4: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/4.jpg)
ProbabilisticModel
Real WorldData
P(Data | Parameters)
P(Parameters | Data)
Generative Model, Probability
Inference, Statistics
4
![Page 5: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/5.jpg)
Notation and Definitions• X is a random variable
– Lower‐case x is some possible value for X– “X = x” is a logical proposition: that X takes value x– There is uncertainty about the value of X
• e.g., X is the Hang Seng index at 5pm tomorrow
• p(X = x) is the probability that proposition X=x is true– often shortened to p(x)
• If the set of possible x’s is finite, we have a probability distribution and p(x) = 1
• If the set of possible x’s is infinite, p(x) is a density function, and p(x) integrates to 1 over the range of X
5
![Page 6: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/6.jpg)
Multiple Variables• p(x, y, z)
– Probability that X=x AND Y=y AND Z =z– Possible values: cross‐product of X Y Z
– e.g., X, Y, Z each take 10 possible values• x,y,z can take 103 possible values• p(x,y,z) is a 3‐dimensional array/table
– Defines 103 probabilities• Note the exponential increase as we add more variables
– e.g., X, Y, Z are all real‐valued• x,y,z live in a 3‐dimensional vector space• p(x,y,z) is a positive function defined over this space, integrates to 1
6
![Page 7: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/7.jpg)
Conditional Probability• p(x | y, z)
– Probability of x given that Y=y and Z = z– Could be
• hypothetical, e.g., “if Y=y and if Z = z”• observational, e.g., we observed values y and z
– can also have p(x, y | z), etc– “all probabilities are conditional probabilities”
• Computing conditional probabilities is the basis of many prediction and learning problems, e.g.,– p(DJI tomorrow | DJI index last week)– expected value of [DJI tomorrow | DJI index next week)– most likely value of parameter given observed data
7
![Page 8: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/8.jpg)
Computing Conditional Probabilities• Variables A, B, C, D
– All distributions of interest related to A,B,C,D can be computed from the full joint distribution p(a,b,c,d)
• Examples, using the Law of Total Probability– p(a) = {b,c,d} p(a, b, c, d) – p(c,d) = {a,b} p(a, b, c, d)– p(a,c | d) = {b} p(a, b, c | d)
where p(a, b, c | d) = p(a,b,c,d)/p(d)• These are standard probability manipulations: however, we
will see how to use these to make inferences about parameters and unobserved variables, given data
8
![Page 9: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/9.jpg)
Two Practical Problems
(Assume for simplicity each variable takes K values) • Problem 1: Computational Complexity
– Conditional probability computations scale as O(KN) • where N is the number of variables being summed over
• Problem 2: Model Specification– To specify a joint distribution we need a table of O(KN) numbers
– Where do these numbers come from?
9
![Page 10: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/10.jpg)
Two Key Ideas
• Problem 1: Computational Complexity– Idea: Graphical models
• Structured probability models lead to tractable inference
• Problem 2: Model Specification– Idea: Probabilistic learning
• General principles for learning from data
10
![Page 11: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/11.jpg)
Conditional Independence• A is conditionally independent of B given C iff
p(a | b, c) = p(a | c)(also implies that B is conditionally independent of A given C)
• In words, B provides no information about A, if value of C is known
• Example:– a = “reading ability”– b = “height”– c = “age”
• Note that conditional independence does not imply marginal independence
11
![Page 12: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/12.jpg)
Graphical Models
• Represent dependency structure with a directed graph– Node <‐> random variable– Edges encode dependencies
• Absence of edge ‐> conditional independence– Directed and undirected versions
• Why is this useful?– A language for communication– A language for computation
12
![Page 13: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/13.jpg)
Examples of 3‐way Graphical Models
A CB Marginal Independence:p(A,B,C) = p(A) p(B) p(C)
13
![Page 14: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/14.jpg)
Examples of 3‐way Graphical Models
A
CB
Conditionally independent effects:p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independentGiven A
e.g., A is a disease, and we model B and C as conditionally independentsymptoms given A
14
![Page 15: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/15.jpg)
Examples of 3‐way Graphical Models
A B
C
Independent Causes:p(A,B,C) = p(C|A,B)p(A)p(B)
15
![Page 16: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/16.jpg)
Examples of 3‐way Graphical Models
A CB Markov dependence:p(A,B,C) = p(C|B) p(B|A)p(A)
16
![Page 17: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/17.jpg)
Directed Graphical Models
A B
C
p(A,B,C) = p(C|A,B)p(A)p(B)
17
![Page 18: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/18.jpg)
Directed Graphical Models
A B
C
In general,p(X1, X2,....XN) = p(Xi | parents(Xi ) )
p(A,B,C) = p(C|A,B)p(A)p(B)
18
![Page 19: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/19.jpg)
Directed Graphical Models
A B
C
• Probability model has simple factored form
• Directed edges => direct dependence
• Absence of an edge => conditional independence
• Also known as belief networks, Bayesian networks, causal networks
In general,p(X1, X2,....XN) = p(Xi | parents(Xi ) )
p(A,B,C) = p(C|A,B)p(A)p(B)
19
![Page 20: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/20.jpg)
Reminders from Probability….
• Law of Total ProbabilityP(a) = b P(a, b) = b P(a | b) P(b)
– Conditional version:P(a|c) = b P(a, b|c) = b P(a | b , c) P(b|c)
• Factorization or Chain Rule– P(a, b, c, d) = P(a | b, c, d) P(b | c, d) P(c | d) P (d), or
= P(b | a, c, d) P(c | a, d) P(d | a) P(a), or= …..
20
![Page 21: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/21.jpg)
Probability Calculations on Graphs• General algorithms exist ‐ beyond trees
– Complexity is typically O(m (number of parents ) )(where m = arity of each node)
– If single parents (e.g., tree), ‐> O(m)– The sparser the graph the lower the complexity
• Technique can be “automated”– i.e., a fully general algorithm for arbitrary graphs– For continuous variables:
• replace sum with integral– For identification of most likely values
• Replace sum with max operator
21
![Page 22: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/22.jpg)
ProbabilisticModel
Real WorldData
P(Data | Parameters)
P(Parameters | Data)
Generative Model, Probability
Inference, Statistics
22
![Page 23: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/23.jpg)
The Likelihood Function• Likelihood = p(data | parameters)
= p( D | ) = L ()
• Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters
• Details– Constants that do not involve can be dropped in defining L ()
– Often easier to work with log L ()
23
![Page 24: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/24.jpg)
Comments on the Likelihood Function
• Constructing a likelihood function L () is the first step in probabilistic modeling
• The likelihood function implicitly assumes an underlying probabilistic model M with parameters
• L () connects the model to the observed data
• Graphical models provide a useful language for constructing likelihoods
24
![Page 25: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/25.jpg)
Binomial Likelihood• Binomial model
– N memoryless trials, 2 outcomes– probability of success at each trial
• Observed data– r successes in n trials – Defines a likelihood:
L() = p(D | ) = p(successes) p(non‐successes)= r (1‐) n‐r
25
![Page 26: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/26.jpg)
Binomial Likelihood Examples
26
![Page 27: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/27.jpg)
Graphical Models
• Left – data points are conditionally independent given
• Right – plate notation (same model as left)repeated nodes are inside a box (plate)number in lower right hand corner , specifies the number of repetitions of the node
27
• Represent using a graphical model:
![Page 28: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/28.jpg)
• Assume each data case was generated independently but from the same distribution
• Data cases are only independent conditional on the parameters
• Marginally, the data cases are dependent• The order in which the data cases arrive makes
no difference to the benefits about (all orderings have same sufficient statistics) data is exchangeable
28
Graphical Models
![Page 29: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/29.jpg)
• Avoid visual clutter:use a form of syntactic sugar, called plates
• Draw a little box around the repeated variables• With the convention that nodes within the box is
repeated when the model is unrolled• Bottom right corner of the box: number of copies
or repetitions• The corresponding joint distribution has the form:
29
Plate Notation
![Page 30: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/30.jpg)
Multinomial Likelihood• Multinomial model
– N memoryless trials, K outcomes– Probability vector for outcomes at each trial
• Observed data– nj successes in n trials – Defines a likelihood:
30
![Page 31: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/31.jpg)
Graphical Model for Multinomial
w1
= [ p(w1), p(w2),….. p(wk) ]
w2 wn
Parameters
Observed data
31
![Page 32: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/32.jpg)
“Plate” Notation
wi
i=1:n
Data = D = {w1,…wn}
Model parameters
Plate (rectangle) indicates replicated nodes in a graphical model
Variables within a plate are conditionally independent manner given parent
32
![Page 33: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/33.jpg)
Learning in Graphical Models
wi
i=1:n
Data = D = {w1,…wn}
Model parameters
• Can view learning in a graphical model as computing the most likely value of the parameter node given the data nodes
33
![Page 34: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/34.jpg)
Maximum Likelihood (ML) Principle
wi
i=1:n
L () = p(Data | ) = p(yi | )
Maximum Likelihood: ML = arg max{ Likelihood() }
Select the parameters that make the observed data most likely
Data = {w1,…wn}
Model parameters
34
![Page 35: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/35.jpg)
The Bayesian Approach to Learning
wi
i=1:n
Fully Bayesian:p( | Data) = p(Data | ) p() / p(Data)
Maximum A Posteriori:MAP = arg max{ Likelihood() x Prior() }
Prior() = p( )
35
![Page 36: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters](https://reader035.vdocuments.site/reader035/viewer/2022070714/5ed45994116674658f37d754/html5/thumbnails/36.jpg)
Summary of Bayesian Learning• Can use graphical models to describe relationships between
parameters and data• P(data | parameters) = Likelihood function• P(parameters) = prior
– In applications such as text mining, prior can be “uninformative”, i.e., flat
– Prior can also be optimized for prediction (e.g., on validation data)
• We can compute P(parameters | data, prior)or a “point estimate” (e.g., posterior mode or mean)
• Computation of posterior estimates can be computationally intractable – Monte Carlo techniques often used
36