580.691 learning theory reza shadmehr em and expected complete log-likelihood mixture of experts
DESCRIPTION
580.691 Learning Theory Reza Shadmehr EM and expected complete log-likelihood Mixture of Experts Identification of a linear dynamical system. The log likelihood of the unlabeled data. Hidden variable. Measured variable. The unlabeled data. - PowerPoint PPT PresentationTRANSCRIPT
580.691 Learning Theory
Reza Shadmehr
EM and expected complete log-likelihood
Mixture of Experts
Identification of a linear dynamical system
The log likelihood of the unlabeled data
P z
3 1z 1 1z
p x z 3 3 31, ,p z x
3
1
1 1i ii
p P z p z
x x
2 2 21, ,p z x
2 1z
z
x
Hidden variable
Measured variable
1 1 11, ,p z x
The unlabeled data
(1) ( )
1 1 1 3 3 3
3( ) ( )
1
3( )
11
3( )
1 1
, ,
1 , , , , 1 , ,
1 ,
1 ,
log 1 ,
N
n ni i i
i
Nn
i i iin
Nn
i i in i
D
P z P z
p P z N
p D P z N
l D P z N
x x
x x
x
x
In the last lecture we assumed that in the M step, we knew the posterior probabilities, and found the derivative of the log-likelihood with respect to mu and sigma to maximize the log-likelihood. Today we take a more general approach to include both the E and M steps into the log-likelihood.
A more general formulation of EM: Expected complete log likelihood
The real data is not labeled. But for now, assume that someone labeled it, resulting in the “complete data”.
( )
( )
(1) (1) ( ) ( )
3( ) ( ) ( )
1
3( )
1 1
3( ) ( )
1 1
3( ) ( )
1 1
( ) ( )
, , , ,
, 1 ,
1 ,
log 1 ,
log 1 ,
1 log
ni
ni
N Nc
zn n n
i i ii
N zn
c i i in i
Nn n
c c i i i in i
Nn n
c c i i i in i
n ni
D
p P z N
p D P z N
l D z P z N
E l D E z P z N
P z
x z x z
x z x
x
x
x
x
3
( )
1 1
1 ,N
ni i i
n i
P z N
x
Complete log-likelihood
Expected complete log-likelihood
In EM, in the E step we fix theta and try to maximize the expected complete log-likelihood by setting expected value of our hidden variables z to the posterior probability.
In the M step, we fix expected value of z and try to maximize the expected complete log-likelihood by setting the parameters theta.
,p p px z x z z
A more general formulation of EM: Expected complete log likelihood
3( ) ( ) ( )
1 1
( ) ( ) ( )
1
( ) ( ) ( )
1
( ) ( ) ( ) 1 ( )
1
( ) ( ) ( ) 1 (
1 log 1 ,
1 log 1 ,
1 log 1 log ,
11
2
11
2
Nn n n
c c i i i in i
Nn n n
i i i i in
Nn n n
i i i in
N Tn n n ni i i i
n
n n n T ni i
E l D P z P z N
J P z P z N
P z P z N
P z
P z
x x μ
μ x x μ
x x μ
x x μ x μ
x x x
) ( ) 1 1 ( ) 1
1
( ) ( ) 1 ( ) 1
1
( ) ( ) ( )
1
( ) ( )
1
11 2 2 0
2
1
1
Nn T T n T
i i i i i i in
Ni n n n
i i i ini
Nn n n
in
i Nn n
in
JP z
d
P z
P z
x μ μ x μ μ
μx x μ
μ
x xμ
x
Expected complete log-likelihood
In the M step, we fix expected value of z and try to maximize the expected complete log-likelihood by setting parameters theta.
11/ 2/ 2
1
1 1, exp
22
1 1log , log 2 log
2 2 2
T
d
T
N
dN
x μ x μ x μΣ
x μ Σ x μ x μ
3( ) ( ) ( )
1 1
( ) ( ) ( )
1
( ) ( ) ( ) 1 ( )
1
( ) ( ) ( ) ( )1
1 log 1 ,
1 log 1 log ,
11 log
2
11
2
Nn n n
c c i i i in i
Nn n n
i i i i in
N Tn n n ni i i i i
n
Ti n n n ni i i i
ni
E l D P z P z N
J P z P z N
P z
JP z
d
x x
x x μ
x x μ x μ
x x μ x μ
1
( ) ( ) ( ) ( )
1
( ) ( )
1
0
1
1
N
N Tn n n ni i i
ni N
n ni
n
P z
P z
x x μ x μ
x
11
11
11 1
1 11
1 1
1
log log log
loglog
i ii
i i
i i i
Tiii i
i i
dd
d d
1
1 2 1
1
1
if 0 at some , then 0
df x df x df xdx
dx dx dx x dxdf x df x
xdx dx
3( ) ( ) ( )
1 1
3( ) ( )
1 1
1 log 1 ,
1
1 log
Nn n n
c c i i i in i
i i
Nn n
i i in i
E l D P z P z N
P z
J P z
x x
x
constraint
Function to maximize
The value pi that maximizes this function is one.But that’s not interesting because we also have another constraint: The sum of priors should be one. So we want to maximize this function given the constraint that the sum of pi_i is 1.
3
1
1 0ii
3( ) ( )
1 1
3( ) ( )
1 1
3
1
( ) ( )
1
( ) ( )
1
( ) ( )
1
( ) ( )1 2 3 1
11
1 log
' 1 log
1 0
' 11
1
11
1
1
Nn n
i i in i
Nn n
i i in i
i ii
Ni n n
ini i
i
i
Nn n
in i
Nn n
i in
Nn n
n
J P z
J P z
g
dJP z
d
dg
d
P z
P z
P z
x
x
x
x
x
x
( ) ( ) ( ) ( )2 3
1 1
( ) ( )
1
1 1
11
N Nn n n n
n n
Nn n
i in
P z P z
N
P zN
x x
x
constraint
Function to maximize 1i iP z
Function to minimize
We have 3 such equations, one for each pi. If we add the equations together, we get:
EM algorithm: Summary
( ) ( )( ) ( ) ( ) ( ) ( )1 1 11 , , , , 1 , ,
k kk k k k km m mP z P z
arg max c cE l D
The “M” step: maximize the expected complete log-likelihood with respect to the model parameters theta:
We begin with a guess about the mixture parameters:
The “E” step: Calculate the expected complete log-likelihood. In the mixture example, this reduces to just computing the posterior probabilities:
3( ) ( )
1 1
3( ) ( ) ( )
1 1
log 1 ,
1 log 1 ,
Nn n
c c i i i in i
Nn n n
i i i in i
E l D E z P z N
P z P z N
x
x x
Selecting number of mixture components
A simple idea that helps with selection of number of mixture components is to form a cost that depends on both the log-likelihood of the data and the number of parameters used in the model. As the number of parameters increases, the log-likelihood increases. We want a cost that balances the change in the log-likelihood with the cost of having increasing parameters.
A common technique is to find the m mixture components that minimize the “description length”.
(1) (2) ( )log , , , log2
N mm
dDL p x x x N
Maximum likelihood estimate of parameters for m mixture components
The effective number of parameters in the model
Number of data points
Minimize the description length
-1 0 1 2 3 4 5-5
-2.5
0
2.5
5
7.5
10
x
x
y
1Ty w x
2Ty w x
Expert 1
Expert 2
Conditional probability of choosing expert 2
Expert 1 Expert 2
x
+
Moderator
The data set (x,y) is clearly non-linear, but we can break it up into two linear problems. We will try to switch from one “expert” to the another at around x=0.
y
1
2
1 1 2 2
1
ˆ
i i
T T
P z x
y
w x w x
Mixture of Experts
-1 1 2 3 4 5
0.2
0.4
0.6
0.8
1
2 1P z x
1iP z x
1 1z
1,p y x
3 1z
3,p y x
z
y
x
,p y x z
p z x
11
1
1
1|log
1|
exp1|
exp
T
k
T
i kTj
j
P z
P z
P z
xv x
x
v xx
v x
2 1z
2,p y x
We have observed a sequence of data points (x,y), and believe that it was generated by a process shown to the right: Note that y depends on both x (which we can measure) and z, which is hidden from us.
For example, the dependence of y on x might be a simple linear model, but conditioned on z, where z is a multi-nomial.
The Moderator (gating network)
When there are only two experts, the moderator can be a logistic function:
1
11|
1 exp TP z x
v x
When there are multiple experts, the moderator can be a soft-max function:
0
1
1
v
vx
x v
1
, 1 , 1, ,m
i ii
p y x P z x p y z x
1
1
exp1
exp
T
i mTj
j
P z x
v x
v x
1 1z
1,p y x
3 1z
3,p y x
2 1z
2,p y x
Based on our hypothesis, we should have the following distribution of observed data:
A key quantity is the posterior probability of the latent variable z:
1 1
1
1 , 1, ,1 , ,
1 ,
1 , ,
, , , ,
1, ,
1, ,
i ii
i i
i i
m m
i ii m
j jj
P z x p y z xP z x y
p y
P z x
P z x y
p y z x
p y z x
v v
Note that the posterior probability for the i-th expert is updated based on how probable the observed data y was for this expert. In a way, the expression tells us that given the observed data y, how strongly should we assign it to expert i.
Posterior probability that the observed y “belongs” to the i-th expert.
Parameters of the moderator
Parameters of the expert
2
2
2
1
, 1 ,
1
, ,
, ,
Ti i i
i i
i i i i
mT
i i ii
p y x z N
P z x
p y x N
w x
w
w x
Suppose there are two experts (m=2). For a given value of x, the two regressions each give us a Gaussian distribution at their mean. Therefore, for each value of x, we have a bimodal probability distribution for y. We have a mixture distribution in the output space y for each input value of x.
Parameters of the i-th expert
Output of the moderator
Output of the i-th expert
Output of the whole network
(1) (1) ( ) ( )
( ) ( )
1
2
11
2
1 1
, , , ,
,
,
log ,
N N
Ni i
i
N mT
j j jji
N mT
j j ji j
D x y x y
p D p y x
N
l D N
w x
w x
The log-likelihood of the observed data.
( )
( )
( )
(1) (1) (1) ( ) (1) ( )
( ) ( ) ( ) ( ) ( )
1
( ) ( ) 2
1
( ) ( ) 2
1 1
( ) ( ) ( ) 2
, , , , , ,
, , 1 , 1, ,
1 , ,
,
log ,
ni
ni
ni
N Nc
m zn n n n n
i ii
m zn T n
i i ii
N m zn T n
c i i in i
n n T nc c i i j j
i
D x y x y
p y x P z x p y z x
P z x N
p D N
l D z N
z z
z
w x
w x
w x
1 1
( ) ( ) ( ) 2
1 1
( ) ( ) ( ) ( ) 2
1 1
log ,
1 , , log ,
N m
n
N mn n T n
c c i i j jn i
N mn n n T n
i i j jn i
E l D E z N
P z x y N
w x
w x
The complete log-likelihood for the mixture of experts problem
The “completed” data:
Complete log-likelihood
Expected complete log-likelihood (assuming that someone had given us theta)
In the E step, we begin by assuming that we have theta. To compute the expected complete log-likelihood, all we need are the posterior probabilities.
( )
( ) ( ) ( )
( )
1
1, ,1 , ,
1, ,
ni in n n
i i mn
j jj
p y z xP z x y
p y z x
The posterior for each expert depends on the likelihood that the observed data y came from that expert.
The E step for the mixture of experts problem
The M step for the mixture of experts problem: the moderator
( ) ( ) ( ) 2
1 1
( ) ( )1 1 ( )
( ) ( )2 1
11 1
( ) ( ) ( ) ( )1 1 2 2
1
( ) ( ) ( ) ( )1 1 1 1
1
( 1) ( )
( )1
log ,
2
11|
1 exp
1
1
log log
log 1 log 1
1
1
N mn n T n
c c i i j jn i
n n
T n
n n
T
Nn n n n
n
Nn n n n
n
n n
n
E l D N
m
P z x
d
d
J
w x
v x
v x
v
v v
( )
( ) ( )1 1 ( ) ( )( )
1
nn n
n T nn x
x x
0
1
1
v
vx
x v
Exactly the same as the IRLS cost function. We find first and second derivatives and find a learning rule:
The moderator learns from the posterior probability.
The M step for the mixture of experts problem: weights of the expert
( ) ( ) ( ) 2
1 1
( ) ( ) 2
1
( ) ( ) ( ) ( ) ( )2
1
( )( ) ( ) ( )
21
( 1) ( ) ( ) ( ) ( )
log ,
log ,
1log
2
2
N mn n T n
c c i i j jn i
Nn T n
i i i in
N Tn n T n n T ni i i i
n i
nNi n T n ni
ini i
n n n n T ni i i i
E l D N
J N
y y
dJy
d
y
w x
w w x
w x w x
ww x x
w
w w w x x
( )n
The expert i learns from the observed data point y, weighted by the posterior probability that the error came from that expert.
A weighted least-squares problem
The M step for the mixture of experts problem: variance of the expert
( ) ( ) ( ) 2
1 1
2 ( ) ( ) 2
1
( ) 2 2 ( ) ( ) ( ) ( )
1
2
( ) 2 ( ) ( ) ( ) ( )2
1
(2
log ,
log ,
1 1log
2 2
1
2
N mn n T n
c c i i j jn i
Nn T n
i i i in
N Tn n T n n T ni i i i i
n
N Ti n n T n n T ni i i i
ni
n
i
E l D N
J N
y y
dJy y
d
w x
w x
w x w x
w x w x
1) ( 1)2 ( ) 2 ( ) ( ) ( ) ( )Tn n n T n n T ni i i i iy y
w x w x
Parameter Estimation for Linear Dynamical Systems using EM
( 1) ( ) ( )
( ) ( )
0,
0,
n n nx x
n ny y
A B N Q
C N R
x x u ε ε
y x ε ε
( ) ( ) ( )
( 1) ( ) ( ) ( ) ( )
,
, ,
n n n
n n n n n
p N C R
p N A B Q
y x x
x x u x u
(1) (2) ( )
(1) (2) ( ) (1) (2) ( )
, , ,
, , , , , , ,
, , , ,
log , ,
T
T Tc
D
D
A B C Q R
E p
y y y
x x x y y y
x y u
Need to find the expected complete log-likelihood
Objective: to find the parameters A, B, C, Q, and R of a linear dynamical system from a set of data that includes inputs u and outputs y.
,
, ,
p a b p a b p b
p a b c p a b c p b c
(0) (1) (1) (0) (1) (1) (0) (0) (1) (0) (0)
(1) (1) (1) (0) (0) (0)
2 1 2 1(0) (1) (2) (1) (2) (1) (0) (1) (2)
0 0 0 0
2 1 2 1(2) (1) (1)
0 0 0 0
1(2) (1) (0) (1) (0)
0
, , , , ,
,
, , , , , , ,
, , ,
, , ,
p p p
p p p
p p p
p p
p p
x x y u y x x u x x u
y x x x u x
x x x y y u u y y x u x u
y y x u y x u
x x x u x x
1
0
(2) (2) (1) (1)
(2) (1) (1) (1) (0) (0) (0)
( ) ( ) ( ) ( 1) ( 1) (0)
0 1 11 1
, ,
, , ,N N
N N N n n n n n
n n
p p
p p p
p p p p
u
y x y x
x x u x x u x
x y u y x x x u x
( ) ( ) ( )
1/ 2/ 2 ( ) ( ) 1 ( ) ( )
( 1) ( ) ( ) ( ) ( )
1/ 2/ 2 ( 1) ( ) ( ) 1 ( 1) ( ) ( )
,
12 exp
2
, ,
12 exp
2
n n n
Tp n n n n
n n n n n
Tk n n n n n n
p N C R
R C R C
p N A B Q
Q A B Q A B
y x x
y x y x
x x u x u
x x u x x u
( ) ( ) ( ) ( 1) ( 1) (0)
0 1 11 1
, , ,N N
N N N n n n n n
n n
p p p p
x y u y x x x u x
( ) ( ) 1 ( ) ( )
0 1 11
( ) ( 1) ( 1) 1 ( ) ( 1) ( 1)
1
(0) (0) 1 (0) (0)0
0
1log , ,
2
1
2
1ˆ ˆ
21
log log log2 2 2
N TN N N n n n n
n
N Tn n n n n n
n
T
p C R C
A B Q A B
P
N NR Q P const
x y u y x y x
x x u x x u
x x x x
( ) ( ) 1 ( ) ( )
0 1 11
1log , , log
2 2
T TT T T n n n nc
n
Nl p C R C R
x y u y x y x
( ) ( ) 1 ( ) ( ) 1 1 1
1 1
1 ( ) ( ) 1 ( ) ( )
1
1 ( ) ( ) 1 ( )
1
1
( ) ( ) ( )
1 1
1 12
2 2
0
ˆ 0
ˆ
N NTn n n n T T T T
n n
Nn n T n n Tc
n
Nn n T nc
n
N Nn n T n
newn n
C R C R C C R C R
dlR R C
dC
dlE R R CP
dC
C P
y x y x y x x x y y
y x x x
y x
y x
T T
T T T T T
dX
dXd
X CX C X CXdX
a b ab
a a aa aa
Posterior estimate of state and variance
( ) ( ) ( ) ( ) ( ) ( )1
1
( ) ( ) ( ) ( ) ( )1
1
( ) ( ) ( ) ( ) ( )
1
1 10
2 2 2
1 1ˆ 0
2 2 2
1ˆ2
Nn n T n n T T n n Tc
n
Nn n T n T n n Tc
n
Nn T n n T n n T
newn
dl NR C C C
dR
dl NE R C CP C
dR
R CP C CN
x y x x y y
x y y y
y y x y
( ) ( 1) ( 1) 1 ( ) ( 1) ( 1)
1
( ) 1 ( 1) ( 1) 1 ( 1) ( 1) 1 ( 1)
1
1 ( ) ( 1) 1 ( 1) ( 1) 1 ( 1) ( 1)
1
1
2
12 2
2
12 2 2 0
2
N Tn n n n n nc
n
Nn T n n T T n n T T n
n
Nn n T n n T n n Tc
n
c
l A B Q A B
Q A A Q A B Q A
dlQ Q A Q B
dA
dlE
x x u x x u
x x x x u x
x x x x u x
( , 1) ( 1) ( 1) ( 1)
1
1
( , 1) ( 1) ( 1) ( 1)
1 1
ˆ 0
ˆ
Nn n n n n T
n
N Nn n n n T n
newn n
P AP BdA
A P B P
u x
u x
( ) ( 1) ( 1) 1 ( ) ( 1) ( 1)
1
( ) 1 ( 1) ( 1) 1 ( 1) ( 1) 1 ( 1)
1
1 ( ) ( 1) 1 ( 1) ( 1) 1 ( 1) ( 1)
1
1
2
12 2
2
12 2 2 0
2
N Tn n n n n nc
n
Nn T n n T T n n T T n
n
Nn n T n n T n n Tc
n
c
l A B Q A B
Q B A Q B B Q B
dlQ Q A Q B
dB
dlE
x x u x x u
x u x u u u
x u x u u u
( ) ( 1) ( ) ( 1) ( 1) ( 1)
1
1
( ) ( 1) ( 1) ( 1) ( 1) ( 1)
1 1
ˆ ˆ 0
ˆ ˆ
Nn n T n n T n n T
n
N Nn n T n n T n n T
newn n
A BdB
B A
x u x u u u
x u x u u u
( ) ( 1) ( 1) 1 ( ) ( 1) ( 1)
1
( ) 1 ( ) ( ) 1 ( 1) ( ) 1 ( 1)
1
( 1) 1 ( 1) ( 1) 1 ( 1) ( 1) 1 ( 1)
1
1log
2 2
1log 2 2
2 2
2 2
2
N Tn n n n n nc
n
Nn T n n T n n T n
n
n T T n n T T n n T T n
c
Nl Q A B Q A B
NQ Q Q A Q B
A Q A A Q B B Q B
dl NQ
dQ
x x u x x u
x x x x x u
x x x u u u
( ) ( ) ( 1) ( ) ( 1) ( ) ( 1) ( 1)
1
( 1) ( 1) ( 1) ( 1)
( ) ( 1, ) ( 1) ( ) ( 1)
1
( 1) ( 1) ( 1) ( 1)
12 2 2
2
2
1ˆ2 2 2
ˆ2
Nn n T n n T n n T n n T T
n
n n T T n n T T
Nn n n n n T n T
newn
n n T T n n T T
A B A A
A B B B
Q P AP B AP AN
A B B B
x x x x u x x x
x u u u
u x
x u u u