learning bayesian networks most slides by nir friedman some by dan geiger

Learning Bayesian networks

Most Slides by Nir Friedman

Some by Dan Geiger

Known Structure -- Incomplete Data

InducerInducerInducerInducer

A.9 .1

.99 .01

BE P(A | E,B)

BE P(A | E,B) E B

Network structure is specified Data contains missing values

We consider assignments to missing values

E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>

Learning Parameters from Incomplete Data

Incomplete data: Posterior distributions can become interdependent Consequence:

ML parameters can not be computed separately for each multinomial

Posterior is not a product of independent posteriors

Y|X=Hm

Learning Parameters from Incomplete Data (cont.).

In the presence of incomplete data, the likelihood can have multiple global maxima

Example: We can rename the values of hidden variable

H If H has two values, likelihood has two global

maxima

Similarly, local maxima are also replicated Many hidden variables a serious problem

Expectation Maximization (EM) A general purpose method for learning from incomplete dataIntuition: If we had access to counts, then we can estimate

parameters However, missing values do not allow to perform counts “Complete” counts using current parameter assignment

Expectation Maximization (EM)

1.30.41.71.6

X Z N (X,Y )X Y #H

P(Y=H|X=T, Z=T, ) = 0.4

Expected CountsP(Y=H|X=H,Z=T,) = 0.3

Current model

These numbers are placed for illustration; they have not been computed.

EM (cont.)

TrainingData

X1 X2 X3

Y1 Y2 Y3

Initial network (G,0)

Expected CountsN(X1)N(X2)N(X3)N(H, X1, X1, X3)N(Y1, H)N(Y2, H)N(Y3, H)

Computation

(E-Step)

Reparameterize

X1 X2 X3

Y1 Y2 Y3

Updated network (G,1)

(M-Step)

Reiterate

Expectation Maximization (EM):Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point

MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem

EM in Practice

Initial parameters: Random parameters setting “Best” guess from other source

Stopping criteria: Small change in likelihood of data Small change in parameter values

Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones

The setup of the EM algorithm

We start with a likelihood function parameterized by .

The observed quantity is denoted X=x. It is often a vector x1,…,xL of observations (e.g., evidence for some nodes in a Bayesian network).

The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayesian network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|) is easy to maximize.

The log-likelihood of an observation x has the form:log P(x| ) = log P(x,y| ) – log P(y|x,)

(Because P(x,y| ) = P(x| ) P(y|x, )).

The goal of EM algorithm

The log-likelihood of an observation x has the form:log P(x| ) = log P(x,y| ) – log P(y|x,)

The goal: Starting with a current parameter vector ’, EM’s goal is to find a new vector such that P(x| ) > P(x| ’) with the highest possible difference.

The result: After enough iterations EM reaches a local maximum of the likelihood P(x| ).

For independent points (xi, yi), i=1,…,m, we can similarly write:

i log P(xi| ) = i log P(xi,yi| ) – i log P(yi|xi,)

We will stick to one observation in our derivation recalling that all derived equations can be modified by summing over x.

The Mathematics involvedRecall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] = y y p(y).

The expectation of a function L(Y) is given by E[L(Y)] = y L(y) p(y).

A bit harder to comprehend example: E’[log p(x,y|)] = y p(y|x, ’) log p(x ,y|)

The expectation operator E is linear. For two random variables X,, and constants a,b, the following holds

E[aX+bY] = a E[X] + b E[Y]

Q( |’)

The Mathematics involved (Cont.)Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ), multiplying both sides by P(y|x ,’), and summing over y, yields

Log P(x |) = P(y|x, ’) log P(x ,y|) - P(y|x, ’) log P(y |x, ) y y

= E’[log p(x,y|)] = Q( |’) We now observe that

= log P(x| ) – log P(x|’) = Q( | ’) – Q(’ | ’) + P(y|x, ’) log [P(y |x, ’) / P(y |x, )] y

0 (relative entropy)So choosing * = argmax Q(| ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x| ).

The EM algorithm itselfInput: A likelihood function p(x,y| ) parameterized by .

Initialization: Fix an arbitrary starting value ’ Repeat

E-step: Compute Q( | ’) = E’[log P(x,y| )]

M-step: ’ argmax Q(| ’)

Until = log P(x| ) – log P(x|’) <

Comment: At the M-step one can actually choose any ’ as long as > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

Expectation Maximization (EM)

In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum.

Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

Gradient Ascent:Follow gradient of likelihood w.r.t. to parameters

MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem

MLE from Incomplete Data

Both Ideas:Find local maxima only.Require multiple restarts to find approximation to the global maximum.

Gradient Ascent

Main result

Theorem GA:

)],[|,(1)|(log

mopaxPDP

iimpaxpax iiii

Requires computation: P(xi,pai|o[m],) for all i, m

Inference replaces taking derivatives.

Gradient Ascent (cont)

m pax ii

moPmoP ,

)|][()|][(

m paxpax iiii

)|][(log)|(log

ii pax

How do we compute ?

Proof:

Since:

ii pax

ii opaxP

),,','(

ii pax ','

d paxPopaPopaxoP

),'|'()|,'(),,','|(

ii iipax pax

d opaPpaxPopaxoP

)|,(),|(),,,|(

ii iiiipax pax

opaxPoP

, ','','

)|,,()|(

opaxPoP ii,

)|,,()|(

Putting all together we get

m paxpax iiii

1)|(log

mopaxP

)|][,,(

mopaxP

)],[|,(

)],[|,(1)|(log

mopaxPDP

iimpaxpax iiii

learning bayesian networks most slides by nir friedman some by dan geiger

u data

parameters u

incomplete data u

u random parameters

x y z slide

u network structure

log px

expectation maximization

Documents

© 1998, nir friedman, u.c. berkeley, and moises goldszmidt,...

weight annealing data perturbation for escaping local maxima...

welcome to plab. course staff teacher: nir friedman...

research center for quantum social and cognitive science...

agglomerative multivariate information bottleneck ·...

phylogenetic trees lecture 11 sections 6.1, 6.2, in setubal...

a simple hyper geometric approach for discovering putative...

using bayesian networks to analyze expression...

approximate inference slides by nir friedman. when can we...

copyright n. friedman, m. ninio. i. pe’er, and t. pupko....

using bayesian networks to analyze whole-genome expression...

pgm: tirgul 11 na?ve bayesian classifier + tree augmented...

bayesian networks russell and norvig: chapter 14 cmcs424...

inference i introduction, hardness, and variable elimination...

using bayesian networks to analyze expression data by...

1 robust temporal and spectral modeling for query by melody...

learning bayesian networks from data nir friedman daphne...

learning bayesian networks with local structure by nir...

continuous time bayesian networks a dissertation...

defining scoring functions, multiple sequence alignment...