learning bayesian networks most slides by nir friedman some by dan geiger

Post on 22-Dec-2015

217 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

.

Learning Bayesian networks

Most Slides by Nir Friedman

Some by Dan Geiger

2

Known Structure -- Incomplete Data

InducerInducerInducerInducer

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is specified Data contains missing values

We consider assignments to missing values

E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>

3

Learning Parameters from Incomplete Data

Incomplete data: Posterior distributions can become interdependent Consequence:

ML parameters can not be computed separately for each multinomial

Posterior is not a product of independent posteriors

X

Y|X=Hm

X[m]

Y[m]

Y|X=T

4

Learning Parameters from Incomplete Data (cont.).

In the presence of incomplete data, the likelihood can have multiple global maxima

Example: We can rename the values of hidden variable

H If H has two values, likelihood has two global

maxima

Similarly, local maxima are also replicated Many hidden variables a serious problem

H Y

5

Expectation Maximization (EM) A general purpose method for learning from incomplete dataIntuition: If we had access to counts, then we can estimate

parameters However, missing values do not allow to perform counts “Complete” counts using current parameter assignment

6

Expectation Maximization (EM)

1.30.41.71.6

X Z N (X,Y )X Y #H

THHT

Y

??HTT

TT?TH

HTHT

HHTT

P(Y=H|X=T, Z=T, ) = 0.4

Expected CountsP(Y=H|X=H,Z=T,) = 0.3

Data

Current model

These numbers are placed for illustration; they have not been computed.

X

YZ

7

EM (cont.)

TrainingData

X1 X2 X3

H

Y1 Y2 Y3

Initial network (G,0)

Expected CountsN(X1)N(X2)N(X3)N(H, X1, X1, X3)N(Y1, H)N(Y2, H)N(Y3, H)

Computation

(E-Step)

Reparameterize

X1 X2 X3

H

Y1 Y2 Y3

Updated network (G,1)

(M-Step)

Reiterate

8

L(

|D)

Expectation Maximization (EM):Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point

MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem

9

EM in Practice

Initial parameters: Random parameters setting “Best” guess from other source

Stopping criteria: Small change in likelihood of data Small change in parameter values

Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones

10

The setup of the EM algorithm

We start with a likelihood function parameterized by .

The observed quantity is denoted X=x. It is often a vector x1,…,xL of observations (e.g., evidence for some nodes in a Bayesian network).

The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayesian network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|) is easy to maximize.

The log-likelihood of an observation x has the form:log P(x| ) = log P(x,y| ) – log P(y|x,)

(Because P(x,y| ) = P(x| ) P(y|x, )).

11

The goal of EM algorithm

The log-likelihood of an observation x has the form:log P(x| ) = log P(x,y| ) – log P(y|x,)

The goal: Starting with a current parameter vector ’, EM’s goal is to find a new vector such that P(x| ) > P(x| ’) with the highest possible difference.

The result: After enough iterations EM reaches a local maximum of the likelihood P(x| ).

For independent points (xi, yi), i=1,…,m, we can similarly write:

i log P(xi| ) = i log P(xi,yi| ) – i log P(yi|xi,)

We will stick to one observation in our derivation recalling that all derived equations can be modified by summing over x.

12

The Mathematics involvedRecall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] = y y p(y).

The expectation of a function L(Y) is given by E[L(Y)] = y L(y) p(y).

A bit harder to comprehend example: E’[log p(x,y|)] = y p(y|x, ’) log p(x ,y|)

The expectation operator E is linear. For two random variables X,, and constants a,b, the following holds

E[aX+bY] = a E[X] + b E[Y]

Q( |’)

13

The Mathematics involved (Cont.)Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ), multiplying both sides by P(y|x ,’), and summing over y, yields

Log P(x |) = P(y|x, ’) log P(x ,y|) - P(y|x, ’) log P(y |x, ) y y

= E’[log p(x,y|)] = Q( |’) We now observe that

= log P(x| ) – log P(x|’) = Q( | ’) – Q(’ | ’) + P(y|x, ’) log [P(y |x, ’) / P(y |x, )] y

0 (relative entropy)So choosing * = argmax Q(| ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x| ).

14

The EM algorithm itselfInput: A likelihood function p(x,y| ) parameterized by .

Initialization: Fix an arbitrary starting value ’ Repeat

E-step: Compute Q( | ’) = E’[log P(x,y| )]

M-step: ’ argmax Q(| ’)

Until = log P(x| ) – log P(x|’) <

Comment: At the M-step one can actually choose any ’ as long as > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

16

Expectation Maximization (EM)

In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum.

Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

17

Gradient Ascent:Follow gradient of likelihood w.r.t. to parameters

L(

|D)

MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem

18

MLE from Incomplete Data

Both Ideas:Find local maxima only.Require multiple restarts to find approximation to the global maximum.

19

Gradient Ascent

Main result

Theorem GA:

)],[|,(1)|(log

,,

mopaxPDP

iimpaxpax iiii

Requires computation: P(xi,pai|o[m],) for all i, m

Inference replaces taking derivatives.

20

Gradient Ascent (cont)

m pax ii

moPmoP ,

)|][()|][(

1

m paxpax iiii

moPDP

,,

)|][(log)|(log

ii pax

moP

,

)|][(

How do we compute ?

Proof:

21

Gradient Ascent (cont)

Since:

ii pax

ii opaxP

','

),,','(

=1

ii pax ','

iindi

ndii

d paxPopaPopaxoP

),'|'()|,'(),,','|(

ii iipax pax

ndiii

ndii

d opaPpaxPopaxoP

, ','

)|,(),|(),,,|(

ii iiiipax pax

ii

pax

opaxPoP

, ','','

)|,,()|(

ipaix

opaxPoP ii,

)|,,()|(

22

Gradient Ascent (cont)

Putting all together we get

m paxpax iiii

moP

moP

DP

,,

)|][(

)|][(

1)|(log

m pax

ii

ii

mopaxP

moP ,

)|][,,(

)|][(

1

m pax

ii

ii

mopaxP

,

)],[|,(

)],[|,(1)|(log

,,

mopaxPDP

iimpaxpax iiii

top related