learning bayesian networks most slides by nir friedman some by dan geiger
Post on 22-Dec-2015
217 views
TRANSCRIPT
![Page 1: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/1.jpg)
.
Learning Bayesian networks
Most Slides by Nir Friedman
Some by Dan Geiger
![Page 2: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/2.jpg)
2
Known Structure -- Incomplete Data
InducerInducerInducerInducer
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is specified Data contains missing values
We consider assignments to missing values
E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>
![Page 3: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/3.jpg)
3
Learning Parameters from Incomplete Data
Incomplete data: Posterior distributions can become interdependent Consequence:
ML parameters can not be computed separately for each multinomial
Posterior is not a product of independent posteriors
X
Y|X=Hm
X[m]
Y[m]
Y|X=T
![Page 4: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/4.jpg)
4
Learning Parameters from Incomplete Data (cont.).
In the presence of incomplete data, the likelihood can have multiple global maxima
Example: We can rename the values of hidden variable
H If H has two values, likelihood has two global
maxima
Similarly, local maxima are also replicated Many hidden variables a serious problem
H Y
![Page 5: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/5.jpg)
5
Expectation Maximization (EM) A general purpose method for learning from incomplete dataIntuition: If we had access to counts, then we can estimate
parameters However, missing values do not allow to perform counts “Complete” counts using current parameter assignment
![Page 6: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/6.jpg)
6
Expectation Maximization (EM)
1.30.41.71.6
X Z N (X,Y )X Y #H
THHT
Y
??HTT
TT?TH
HTHT
HHTT
P(Y=H|X=T, Z=T, ) = 0.4
Expected CountsP(Y=H|X=H,Z=T,) = 0.3
Data
Current model
These numbers are placed for illustration; they have not been computed.
X
YZ
![Page 7: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/7.jpg)
7
EM (cont.)
TrainingData
X1 X2 X3
H
Y1 Y2 Y3
Initial network (G,0)
Expected CountsN(X1)N(X2)N(X3)N(H, X1, X1, X3)N(Y1, H)N(Y2, H)N(Y3, H)
Computation
(E-Step)
Reparameterize
X1 X2 X3
H
Y1 Y2 Y3
Updated network (G,1)
(M-Step)
Reiterate
![Page 8: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/8.jpg)
8
L(
|D)
Expectation Maximization (EM):Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point
MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem
![Page 9: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/9.jpg)
9
EM in Practice
Initial parameters: Random parameters setting “Best” guess from other source
Stopping criteria: Small change in likelihood of data Small change in parameter values
Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones
![Page 10: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/10.jpg)
10
The setup of the EM algorithm
We start with a likelihood function parameterized by .
The observed quantity is denoted X=x. It is often a vector x1,…,xL of observations (e.g., evidence for some nodes in a Bayesian network).
The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayesian network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|) is easy to maximize.
The log-likelihood of an observation x has the form:log P(x| ) = log P(x,y| ) – log P(y|x,)
(Because P(x,y| ) = P(x| ) P(y|x, )).
![Page 11: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/11.jpg)
11
The goal of EM algorithm
The log-likelihood of an observation x has the form:log P(x| ) = log P(x,y| ) – log P(y|x,)
The goal: Starting with a current parameter vector ’, EM’s goal is to find a new vector such that P(x| ) > P(x| ’) with the highest possible difference.
The result: After enough iterations EM reaches a local maximum of the likelihood P(x| ).
For independent points (xi, yi), i=1,…,m, we can similarly write:
i log P(xi| ) = i log P(xi,yi| ) – i log P(yi|xi,)
We will stick to one observation in our derivation recalling that all derived equations can be modified by summing over x.
![Page 12: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/12.jpg)
12
The Mathematics involvedRecall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] = y y p(y).
The expectation of a function L(Y) is given by E[L(Y)] = y L(y) p(y).
A bit harder to comprehend example: E’[log p(x,y|)] = y p(y|x, ’) log p(x ,y|)
The expectation operator E is linear. For two random variables X,, and constants a,b, the following holds
E[aX+bY] = a E[X] + b E[Y]
Q( |’)
![Page 13: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/13.jpg)
13
The Mathematics involved (Cont.)Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ), multiplying both sides by P(y|x ,’), and summing over y, yields
Log P(x |) = P(y|x, ’) log P(x ,y|) - P(y|x, ’) log P(y |x, ) y y
= E’[log p(x,y|)] = Q( |’) We now observe that
= log P(x| ) – log P(x|’) = Q( | ’) – Q(’ | ’) + P(y|x, ’) log [P(y |x, ’) / P(y |x, )] y
0 (relative entropy)So choosing * = argmax Q(| ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x| ).
![Page 14: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/14.jpg)
14
The EM algorithm itselfInput: A likelihood function p(x,y| ) parameterized by .
Initialization: Fix an arbitrary starting value ’ Repeat
E-step: Compute Q( | ’) = E’[log P(x,y| )]
M-step: ’ argmax Q(| ’)
Until = log P(x| ) – log P(x|’) <
Comment: At the M-step one can actually choose any ’ as long as > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.
![Page 15: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/15.jpg)
16
Expectation Maximization (EM)
In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum.
Hence, often EM is used few iterations and then Gradient Ascent steps are applied.
![Page 16: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/16.jpg)
17
Gradient Ascent:Follow gradient of likelihood w.r.t. to parameters
L(
|D)
MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem
![Page 17: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/17.jpg)
18
MLE from Incomplete Data
Both Ideas:Find local maxima only.Require multiple restarts to find approximation to the global maximum.
![Page 18: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/18.jpg)
19
Gradient Ascent
Main result
Theorem GA:
)],[|,(1)|(log
,,
mopaxPDP
iimpaxpax iiii
Requires computation: P(xi,pai|o[m],) for all i, m
Inference replaces taking derivatives.
![Page 19: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/19.jpg)
20
Gradient Ascent (cont)
m pax ii
moPmoP ,
)|][()|][(
1
m paxpax iiii
moPDP
,,
)|][(log)|(log
ii pax
moP
,
)|][(
How do we compute ?
Proof:
![Page 20: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/20.jpg)
21
Gradient Ascent (cont)
Since:
ii pax
ii opaxP
','
),,','(
=1
ii pax ','
iindi
ndii
d paxPopaPopaxoP
),'|'()|,'(),,','|(
ii iipax pax
ndiii
ndii
d opaPpaxPopaxoP
, ','
)|,(),|(),,,|(
ii iiiipax pax
ii
pax
opaxPoP
, ','','
)|,,()|(
ipaix
opaxPoP ii,
)|,,()|(
![Page 21: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger](https://reader035.vdocuments.site/reader035/viewer/2022062715/56649d775503460f94a599f4/html5/thumbnails/21.jpg)
22
Gradient Ascent (cont)
Putting all together we get
m paxpax iiii
moP
moP
DP
,,
)|][(
)|][(
1)|(log
m pax
ii
ii
mopaxP
moP ,
)|][,,(
)|][(
1
m pax
ii
ii
mopaxP
,
)],[|,(
)],[|,(1)|(log
,,
mopaxPDP
iimpaxpax iiii