580.691 learning theory reza shadmehr em and expected complete log-likelihood mixture of experts

580.691 Learning Theory

Reza Shadmehr

EM and expected complete log-likelihood

Mixture of Experts

Identification of a linear dynamical system

The log likelihood of the unlabeled data

P z

3 1z 1 1z

p x z 3 3 31, ,p z x

3

1

1 1i ii

p P z p z

x x

2 2 21, ,p z x

2 1z

z

x

Hidden variable

Measured variable

1 1 11, ,p z x

The unlabeled data

(1) ( )

1 1 1 3 3 3

3( ) ( )

1

3( )

11

3( )

1 1

, ,

1 , , , , 1 , ,

1 ,

1 ,

log 1 ,

N

n ni i i

i

Nn

i i iin

Nn

i i in i

D

P z P z

p P z N

p D P z N

l D P z N

x x

x x

x

x

In the last lecture we assumed that in the M step, we knew the posterior probabilities, and found the derivative of the log-likelihood with respect to mu and sigma to maximize the log-likelihood. Today we take a more general approach to include both the E and M steps into the log-likelihood.

A more general formulation of EM: Expected complete log likelihood

The real data is not labeled. But for now, assume that someone labeled it, resulting in the “complete data”.

( )

( )

(1) (1) ( ) ( )

3( ) ( ) ( )

1

3( )

1 1

3( ) ( )

1 1

3( ) ( )

1 1

( ) ( )

, , , ,

, 1 ,

1 ,

log 1 ,

log 1 ,

1 log

ni

ni

N Nc

zn n n

i i ii

N zn

c i i in i

Nn n

c c i i i in i

Nn n

c c i i i in i

n ni

D

p P z N

p D P z N

l D z P z N

E l D E z P z N

P z

x z x z

x z x

x

x

x

x

3

( )

1 1

1 ,N

ni i i

n i

P z N

x

Complete log-likelihood

Expected complete log-likelihood

In EM, in the E step we fix theta and try to maximize the expected complete log-likelihood by setting expected value of our hidden variables z to the posterior probability.

In the M step, we fix expected value of z and try to maximize the expected complete log-likelihood by setting the parameters theta.

,p p px z x z z

A more general formulation of EM: Expected complete log likelihood

3( ) ( ) ( )

1 1

( ) ( ) ( )

1

( ) ( ) ( )

1

( ) ( ) ( ) 1 ( )

1

( ) ( ) ( ) 1 (

1 log 1 ,

1 log 1 ,

1 log 1 log ,

11

2

11

2

Nn n n

c c i i i in i

Nn n n

i i i i in

Nn n n

i i i in

N Tn n n ni i i i

n

n n n T ni i

E l D P z P z N

J P z P z N

P z P z N

P z

P z

x x μ

μ x x μ

x x μ

x x μ x μ

x x x

) ( ) 1 1 ( ) 1

1

( ) ( ) 1 ( ) 1

1

( ) ( ) ( )

1

( ) ( )

1

11 2 2 0

2

1

1

Nn T T n T

i i i i i i in

Ni n n n

i i i ini

Nn n n

in

i Nn n

in

JP z

d

P z

P z

x μ μ x μ μ

μx x μ

μ

x xμ

x

Expected complete log-likelihood

In the M step, we fix expected value of z and try to maximize the expected complete log-likelihood by setting parameters theta.

11/ 2/ 2

1

1 1, exp

22

1 1log , log 2 log

2 2 2

T

d

T

N

dN

x μ x μ x μΣ

x μ Σ x μ x μ

3( ) ( ) ( )

1 1

( ) ( ) ( )

1

( ) ( ) ( ) 1 ( )

1

( ) ( ) ( ) ( )1

1 log 1 ,

1 log 1 log ,

11 log

2

11

2

Nn n n

c c i i i in i

Nn n n

i i i i in

N Tn n n ni i i i i

n

Ti n n n ni i i i

ni

E l D P z P z N

J P z P z N

P z

JP z

d

x x

x x μ

x x μ x μ

x x μ x μ

1

( ) ( ) ( ) ( )

1

( ) ( )

1

0

1

1

N

N Tn n n ni i i

ni N

n ni

n

P z

P z

x x μ x μ

x

11

11

11 1

1 11

1 1

1

log log log

loglog

i ii

i i

i i i

Tiii i

i i

dd

d d

1

1 2 1

1

1

if 0 at some , then 0

df x df x df xdx

dx dx dx x dxdf x df x

xdx dx

3( ) ( ) ( )

1 1

3( ) ( )

1 1

1 log 1 ,

1

1 log

Nn n n

c c i i i in i

i i

Nn n

i i in i

E l D P z P z N

P z

J P z

x x

x

constraint

Function to maximize

The value pi that maximizes this function is one.But that’s not interesting because we also have another constraint: The sum of priors should be one. So we want to maximize this function given the constraint that the sum of pi_i is 1.

3

1

1 0ii

3( ) ( )

1 1

3( ) ( )

1 1

3

1

( ) ( )

1

( ) ( )

1

( ) ( )

1

( ) ( )1 2 3 1

11

1 log

' 1 log

1 0

' 11

1

11

1

1

Nn n

i i in i

Nn n

i i in i

i ii

Ni n n

ini i

i

i

Nn n

in i

Nn n

i in

Nn n

n

J P z

J P z

g

dJP z

d

dg

d

P z

P z

P z

x

x

x

x

x

x

( ) ( ) ( ) ( )2 3

1 1

( ) ( )

1

1 1

11

N Nn n n n

n n

Nn n

i in

P z P z

N

P zN

x x

x

constraint

Function to maximize 1i iP z

Function to minimize

We have 3 such equations, one for each pi. If we add the equations together, we get:

EM algorithm: Summary

( ) ( )( ) ( ) ( ) ( ) ( )1 1 11 , , , , 1 , ,

k kk k k k km m mP z P z

arg max c cE l D

The “M” step: maximize the expected complete log-likelihood with respect to the model parameters theta:

We begin with a guess about the mixture parameters:

The “E” step: Calculate the expected complete log-likelihood. In the mixture example, this reduces to just computing the posterior probabilities:

3( ) ( )

1 1

3( ) ( ) ( )

1 1

log 1 ,

1 log 1 ,

Nn n

c c i i i in i

Nn n n

i i i in i

E l D E z P z N

P z P z N

x

x x

Selecting number of mixture components

A simple idea that helps with selection of number of mixture components is to form a cost that depends on both the log-likelihood of the data and the number of parameters used in the model. As the number of parameters increases, the log-likelihood increases. We want a cost that balances the change in the log-likelihood with the cost of having increasing parameters.

A common technique is to find the m mixture components that minimize the “description length”.

(1) (2) ( )log , , , log2

N mm

dDL p x x x N

Maximum likelihood estimate of parameters for m mixture components

The effective number of parameters in the model

Number of data points

Minimize the description length

-1 0 1 2 3 4 5-5

-2.5

0

2.5

5

7.5

10

x

x

y

1Ty w x

2Ty w x

Expert 1

Expert 2

Conditional probability of choosing expert 2

Expert 1 Expert 2

x

+

Moderator

The data set (x,y) is clearly non-linear, but we can break it up into two linear problems. We will try to switch from one “expert” to the another at around x=0.

y

1

2

1 1 2 2

1

ˆ

i i

T T

P z x

y

w x w x

Mixture of Experts

-1 1 2 3 4 5

0.2

0.4

0.6

0.8

1

2 1P z x

1iP z x

1 1z

1,p y x

3 1z

3,p y x

z

y

x

,p y x z

p z x

11

1

1

1|log

1|

exp1|

exp

T

k

T

i kTj

j

P z

P z

P z

xv x

x

v xx

v x

2 1z

2,p y x

We have observed a sequence of data points (x,y), and believe that it was generated by a process shown to the right: Note that y depends on both x (which we can measure) and z, which is hidden from us.

For example, the dependence of y on x might be a simple linear model, but conditioned on z, where z is a multi-nomial.

The Moderator (gating network)

When there are only two experts, the moderator can be a logistic function:

1

11|

1 exp TP z x

v x

When there are multiple experts, the moderator can be a soft-max function:

0

1

1

v

vx

x v

1

, 1 , 1, ,m

i ii

p y x P z x p y z x

1

1

exp1

exp

T

i mTj

j

P z x

v x

v x

1 1z

1,p y x

3 1z

3,p y x

2 1z

2,p y x

Based on our hypothesis, we should have the following distribution of observed data:

A key quantity is the posterior probability of the latent variable z:

1 1

1

1 , 1, ,1 , ,

1 ,

1 , ,

, , , ,

1, ,

1, ,

i ii

i i

i i

m m

i ii m

j jj

P z x p y z xP z x y

p y

P z x

P z x y

p y z x

p y z x

v v

Note that the posterior probability for the i-th expert is updated based on how probable the observed data y was for this expert. In a way, the expression tells us that given the observed data y, how strongly should we assign it to expert i.

Posterior probability that the observed y “belongs” to the i-th expert.

Parameters of the moderator

Parameters of the expert

2

2

2

1

, 1 ,

1

, ,

, ,

Ti i i

i i

i i i i

mT

i i ii

p y x z N

P z x

p y x N

w x

w

w x

Suppose there are two experts (m=2). For a given value of x, the two regressions each give us a Gaussian distribution at their mean. Therefore, for each value of x, we have a bimodal probability distribution for y. We have a mixture distribution in the output space y for each input value of x.

Parameters of the i-th expert

Output of the moderator

Output of the i-th expert

Output of the whole network

(1) (1) ( ) ( )

( ) ( )

1

2

11

2

1 1

, , , ,

,

,

log ,

N N

Ni i

i

N mT

j j jji

N mT

j j ji j

D x y x y

p D p y x

N

l D N

w x

w x

The log-likelihood of the observed data.

( )

( )

( )

(1) (1) (1) ( ) (1) ( )

( ) ( ) ( ) ( ) ( )

1

( ) ( ) 2

1

( ) ( ) 2

1 1

( ) ( ) ( ) 2

, , , , , ,

, , 1 , 1, ,

1 , ,

,

log ,

ni

ni

ni

N Nc

m zn n n n n

i ii

m zn T n

i i ii

N m zn T n

c i i in i

n n T nc c i i j j

i

D x y x y

p y x P z x p y z x

P z x N

p D N

l D z N

z z

z

w x

w x

w x

1 1

( ) ( ) ( ) 2

1 1

( ) ( ) ( ) ( ) 2

1 1

log ,

1 , , log ,

N m

n

N mn n T n

c c i i j jn i

N mn n n T n

i i j jn i

E l D E z N

P z x y N

w x

w x

The complete log-likelihood for the mixture of experts problem

The “completed” data:

Complete log-likelihood

Expected complete log-likelihood (assuming that someone had given us theta)

In the E step, we begin by assuming that we have theta. To compute the expected complete log-likelihood, all we need are the posterior probabilities.

( )

( ) ( ) ( )

( )

1

1, ,1 , ,

1, ,

ni in n n

i i mn

j jj

p y z xP z x y

p y z x

The posterior for each expert depends on the likelihood that the observed data y came from that expert.

The E step for the mixture of experts problem

The M step for the mixture of experts problem: the moderator

( ) ( ) ( ) 2

1 1

( ) ( )1 1 ( )

( ) ( )2 1

11 1

( ) ( ) ( ) ( )1 1 2 2

1

( ) ( ) ( ) ( )1 1 1 1

1

( 1) ( )

( )1

log ,

2

11|

1 exp

1

1

log log

log 1 log 1

1

1

N mn n T n

c c i i j jn i

n n

T n

n n

T

Nn n n n

n

Nn n n n

n

n n

n

E l D N

m

P z x

d

d

J

w x

v x

v x

v

v v

( )

( ) ( )1 1 ( ) ( )( )

1

nn n

n T nn x

x x

0

1

1

v

vx

x v

Exactly the same as the IRLS cost function. We find first and second derivatives and find a learning rule:

The moderator learns from the posterior probability.

The M step for the mixture of experts problem: weights of the expert

( ) ( ) ( ) 2

1 1

( ) ( ) 2

1

( ) ( ) ( ) ( ) ( )2

1

( )( ) ( ) ( )

21

( 1) ( ) ( ) ( ) ( )

log ,

log ,

1log

2

2

N mn n T n

c c i i j jn i

Nn T n

i i i in

N Tn n T n n T ni i i i

n i

nNi n T n ni

ini i

n n n n T ni i i i

E l D N

J N

y y

dJy

d

y

w x

w w x

w x w x

ww x x

w

w w w x x

( )n

The expert i learns from the observed data point y, weighted by the posterior probability that the error came from that expert.

A weighted least-squares problem

The M step for the mixture of experts problem: variance of the expert

( ) ( ) ( ) 2

1 1

2 ( ) ( ) 2

1

( ) 2 2 ( ) ( ) ( ) ( )

1

2

( ) 2 ( ) ( ) ( ) ( )2

1

(2

log ,

log ,

1 1log

2 2

1

2

N mn n T n

c c i i j jn i

Nn T n

i i i in

N Tn n T n n T ni i i i i

n

N Ti n n T n n T ni i i i

ni

n

i

E l D N

J N

y y

dJy y

d

w x

w x

w x w x

w x w x

1) ( 1)2 ( ) 2 ( ) ( ) ( ) ( )Tn n n T n n T ni i i i iy y

w x w x

Parameter Estimation for Linear Dynamical Systems using EM

( 1) ( ) ( )

( ) ( )

0,

0,

n n nx x

n ny y

A B N Q

C N R

x x u ε ε

y x ε ε

( ) ( ) ( )

( 1) ( ) ( ) ( ) ( )

,

, ,

n n n

n n n n n

p N C R

p N A B Q

y x x

x x u x u

(1) (2) ( )

(1) (2) ( ) (1) (2) ( )

, , ,

, , , , , , ,

, , , ,

log , ,

T

T Tc

D

D

A B C Q R

E p

y y y

x x x y y y

x y u

Need to find the expected complete log-likelihood

Objective: to find the parameters A, B, C, Q, and R of a linear dynamical system from a set of data that includes inputs u and outputs y.

,

, ,

p a b p a b p b

p a b c p a b c p b c

(0) (1) (1) (0) (1) (1) (0) (0) (1) (0) (0)

(1) (1) (1) (0) (0) (0)

2 1 2 1(0) (1) (2) (1) (2) (1) (0) (1) (2)

0 0 0 0

2 1 2 1(2) (1) (1)

0 0 0 0

1(2) (1) (0) (1) (0)

0

, , , , ,

,

, , , , , , ,

, , ,

, , ,

p p p

p p p

p p p

p p

p p

x x y u y x x u x x u

y x x x u x

x x x y y u u y y x u x u

y y x u y x u

x x x u x x

1

0

(2) (2) (1) (1)

(2) (1) (1) (1) (0) (0) (0)

( ) ( ) ( ) ( 1) ( 1) (0)

0 1 11 1

, ,

, , ,N N

N N N n n n n n

n n

p p

p p p

p p p p

u

y x y x

x x u x x u x

x y u y x x x u x

( ) ( ) ( )

1/ 2/ 2 ( ) ( ) 1 ( ) ( )

( 1) ( ) ( ) ( ) ( )

1/ 2/ 2 ( 1) ( ) ( ) 1 ( 1) ( ) ( )

,

12 exp

2

, ,

12 exp

2

n n n

Tp n n n n

n n n n n

Tk n n n n n n

p N C R

R C R C

p N A B Q

Q A B Q A B

y x x

y x y x

x x u x u

x x u x x u

( ) ( ) ( ) ( 1) ( 1) (0)

0 1 11 1

, , ,N N

N N N n n n n n

n n

p p p p

x y u y x x x u x

( ) ( ) 1 ( ) ( )

0 1 11

( ) ( 1) ( 1) 1 ( ) ( 1) ( 1)

1

(0) (0) 1 (0) (0)0

0

1log , ,

2

1

2

1ˆ ˆ

21

log log log2 2 2

N TN N N n n n n

n

N Tn n n n n n

n

T

p C R C

A B Q A B

P

N NR Q P const

x y u y x y x

x x u x x u

x x x x

( ) ( ) 1 ( ) ( )

0 1 11

1log , , log

2 2

T TT T T n n n nc

n

Nl p C R C R

x y u y x y x

( ) ( ) 1 ( ) ( ) 1 1 1

1 1

1 ( ) ( ) 1 ( ) ( )

1

1 ( ) ( ) 1 ( )

1

1

( ) ( ) ( )

1 1

1 12

2 2

0

ˆ 0

ˆ

N NTn n n n T T T T

n n

Nn n T n n Tc

n

Nn n T nc

n

N Nn n T n

newn n

C R C R C C R C R

dlR R C

dC

dlE R R CP

dC

C P

y x y x y x x x y y

y x x x

y x

y x

T T

T T T T T

dX

dXd

X CX C X CXdX

a b ab

a a aa aa

Posterior estimate of state and variance

( ) ( ) ( ) ( ) ( ) ( )1

1

( ) ( ) ( ) ( ) ( )1

1

( ) ( ) ( ) ( ) ( )

1

1 10

2 2 2

1 1ˆ 0

2 2 2

1ˆ2

Nn n T n n T T n n Tc

n

Nn n T n T n n Tc

n

Nn T n n T n n T

newn

dl NR C C C

dR

dl NE R C CP C

dR

R CP C CN

x y x x y y

x y y y

y y x y

( ) ( 1) ( 1) 1 ( ) ( 1) ( 1)

1

( ) 1 ( 1) ( 1) 1 ( 1) ( 1) 1 ( 1)

1

1 ( ) ( 1) 1 ( 1) ( 1) 1 ( 1) ( 1)

1

1

2

12 2

2

12 2 2 0

2

N Tn n n n n nc

n

Nn T n n T T n n T T n

n

Nn n T n n T n n Tc

n

c

l A B Q A B

Q A A Q A B Q A

dlQ Q A Q B

dA

dlE

x x u x x u

x x x x u x

x x x x u x

( , 1) ( 1) ( 1) ( 1)

1

1

( , 1) ( 1) ( 1) ( 1)

1 1

ˆ 0

ˆ

Nn n n n n T

n

N Nn n n n T n

newn n

P AP BdA

A P B P

u x

u x

( ) ( 1) ( 1) 1 ( ) ( 1) ( 1)

1

( ) 1 ( 1) ( 1) 1 ( 1) ( 1) 1 ( 1)

1

1 ( ) ( 1) 1 ( 1) ( 1) 1 ( 1) ( 1)

1

1

2

12 2

2

12 2 2 0

2

N Tn n n n n nc

n

Nn T n n T T n n T T n

n

Nn n T n n T n n Tc

n

c

l A B Q A B

Q B A Q B B Q B

dlQ Q A Q B

dB

dlE

x x u x x u

x u x u u u

x u x u u u

( ) ( 1) ( ) ( 1) ( 1) ( 1)

1

1

( ) ( 1) ( 1) ( 1) ( 1) ( 1)

1 1

ˆ ˆ 0

ˆ ˆ

Nn n T n n T n n T

n

N Nn n T n n T n n T

newn n

A BdB

B A

x u x u u u

x u x u u u

( ) ( 1) ( 1) 1 ( ) ( 1) ( 1)

1

( ) 1 ( ) ( ) 1 ( 1) ( ) 1 ( 1)

1

( 1) 1 ( 1) ( 1) 1 ( 1) ( 1) 1 ( 1)

1

1log

2 2

1log 2 2

2 2

2 2

2

N Tn n n n n nc

n

Nn T n n T n n T n

n

n T T n n T T n n T T n

c

Nl Q A B Q A B

NQ Q Q A Q B

A Q A A Q B B Q B

dl NQ

dQ

x x u x x u

x x x x x u

x x x u u u

( ) ( ) ( 1) ( ) ( 1) ( ) ( 1) ( 1)

1

( 1) ( 1) ( 1) ( 1)

( ) ( 1, ) ( 1) ( ) ( 1)

1

( 1) ( 1) ( 1) ( 1)

12 2 2

2

2

1ˆ2 2 2

ˆ2

Nn n T n n T n n T n n T T

n

n n T T n n T T

Nn n n n n T n T

newn

n n T T n n T T

A B A A

A B B B

Q P AP B AP AN

A B B B

x x x x u x x x

x u u u

u x

x u u u

580.691 learning theory reza shadmehr em and expected complete log-likelihood mixture of experts

Documents