expectation- maximization & baum-welchrshamir/algmb/presentations/em-bw-ron-16 .pdfbaum-welch:...
TRANSCRIPT
CG 1
Expectation-Maximization & Baum-Welch
Slides: Roded Sharan, Jan 15; revised by Ron
Shamir, Nov 15
CG
• Input: incomplete data originating from a probability distribution with some unknown parameters
• Want to find the parameter values that maximize the likelihood
• EM – approach that helps when maximum likelihood solution cannot be directly computed.
• Seeks a local maximum by iteratively solving two easier subproblems
The goal
2
CG
• Coins A, B with unknown heads probs. A, B
• Goal: estimate A, B
• Experiment: Repeat x5: choose A or B with prob. 1/2, flip 10 times, record results.
Coin flipping: complete data
3
x=(x1,…x5) : no of H in set 1,…5
Y=(y1,…y5) : coin used in set 1,…5
Do & Batzoglou, NBT 08
CG
• Natural guess: i = fraction of H in flips of coin i
• This is actually the ML solution: maximizes P(x,y|) (ex.)
• What if we do not know which coin was used in each round?
Coin flipping: complete data
4
CG
• Now (y1,…y5) are hidden / latent variables.
• Cannot compute H prob for each coin
• If we guessed Y correctly – we could.
• Idea: guess initial 0A, 0
B – Use t
A, tB to compute the most likely coin for each
set, get new y
– Use the resulting y to recompute A, B using ML, get t+1
A, t+1B
– Repeat till convergence
• EM: use probabilities rather than the single most likely completion y
Coin flipping: incomplete data
5
CG
The probabilistic setting
Input: data X coming from a probabilistic model with hidden information y Goal: Learn the model’s parameters so that the likelihood is maximized.
CG
Mixture of two Gaussians
2 1
2
2
1( 1) ; ( 2) 1
( )1( | ) exp
22
j
i i
i
i i
P y P y p p
xP x y
p
j
Kalai et al. Disentangling
Gaussians CACM 2012
Our input generates the black distribution. We want to color each sample red/blue and find the parameters of the two distributions to maximize the data probability. (assume known)
1 1 2( , , )p
CG
The likelihood function
1 2 1
2
2
2
2
( 1) ; ( 2) 1
( )1( | ) exp
22
( ) ( | ) ( , | )
( )log ( ) log exp
22
i i
i j
i i
i i i
ji i
j i j
i j
P y p P y p p
xP x y j
L P x P x y j
p xL
To be continued…
CG
KL divergence
log(x)≤x-1 for all x>0 with equality iff x=1
( ) ( )( || ) ( ) log ( ) 1
( ) ( )
( ) ( ) ( ) 1 0
i iKL i i
i ii i
i i i
i i i
Q x Q xD P Q P x P x
P x P x
Q x P x Q x
( )( ) log| )
)( |
(
ii
i
L
i
KD P QP x
P xQ x
Def: The Kullback-Liebler divergence (aka relative entropy) of discrete probability distributions P and Q:
i: sum over x s.t P(x)>0
Q(x)=0 P(x)=0, 0log0=0
With equality iff PQ
Lemma: KL divergence is nonnegative
CG
The EM algorithm (i) Goal: max log P(x|θ)=log (Σ P(x,y|θ)) Strategy: guess an initial and iteratively adjust it, making sure that the likelihood always improves. Assume we have a model θt that we wish to improve to a new value. Bayes rule: P(x|θ) = P(x,y|θ) / P(y|x,θ) ( | , ) log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )
( | , ) log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )
log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )
t t t
t t t
y y y
t t
y y
P y x P x P y x P x y P y x P y x
P y x P x P y x P x y P y x P y x
P x P y x P x y P y x P y x
Take log and multiply both sides by ( | , )tP y x
CG
The EM algorithm (ii)
log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )
log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )
t t
y y
t t t t t
y y
P x P y x P x y P y x P y x
P x P y x P x y P y x P y x
Want ( | ) ( | )tP x P x
( | , ) = ( | ) ( | ) ( | , ) log
( | , )
tt t t t
y
P y xQ Q P y x
P y x
= logP(x| ) - logP(x| ) t Define
( | , ) log () , || )( tt
y
P y x P x yQ Define
( | ) ( | )t t tQ Q
KL Divergence 0
CG
The EM algorithm (iii) Main component:
log P(x,y|θ) is called the complete log likelihood function Q is the expectation of the complete log likelihood
over the distribution of y given the current parameters θt
The algorithm: repeat
• E-step: Calculate the Q function • M-step: Maximize Q(θ|θt) with respect to θ • Stopping criterion: improvement in log likelihood ≤ ε
( | ) ( | , ) log ( , | )t t
y
Q P y x P x y
Note: local optimum guaranteed to be reached, not global. Starting point matters! Try many..
CG 14
Back to the Gaussian mixture model
( | ) ( | , ) log ( , | )t t
y
Q P y x P x y
( , | ) ( , | ) ( , | )
1
0
ijy
i i i i
j
i
i
i
i i
j
P x y P x y P x y j
y j
y jy
log ( , | ) log ( , | )
( | ) ( | , ) log ( , | )
( | , ) log ( , | )
ij i i
i j
t t
ij i i
y i j
t
ij i i
i j y
P x y y P x y j
Q P y x y P x y j
P y x y P x y j
CG 15
Application (cont.)
( | ) ( 1| , ) log ( , | )
( , | ): ( 1| , )
( , | )
t t
ij i i i
i j
tt t i iij ij i t
i i
j
Q P y x P x y j
P x y jw P y x
P x y j
2
2
( )1( | ) log log log
22
i jt t
ij j
i j
xQ w p
Now write the derivatives and equate to zero to get the optimal parameters t+1 =( 1
t+1, 2 t+1, p1
t+1 )
17
Reminder: HMM
path =1,…,M
Given sequence X = (x1,…,xM):
• akl = P(i=l | i-1=k),
• ek(b) = P(xi=b | i=k)
Model=(, Q, )
P(X,) = a0,1·i=1…Lei(xi) ·ai,i+1
Goal: Finding path * maximizing P(X,)
Hidden states j
Observed output symbols xi
Markovian transition
prob. akl
Emission prob. ek(b)
CG
Max likelihood in HMM
• y=π, =( akl, ek(b) )
the log likelihood is
And the Q function is:
log ( | ) log ( , | )P x P x
( | ) ( | , ) log ( , | )t tQ P x P x
18
CG
Computing Q
( , ) ( )
1 1 1
( , | ) [ ( )]k kl
M M ME b A
k kl
k b k l
P x e b a
Emission probability, state k
character b
Transition probability, state k
to state l
Number of times we saw b from k in path π
Number of transitions from k to l in path π
CG
Computing Q (ii)
1 1 1
1 1 1
( | ) ( | , ) ( , ) log( ( )) ( ) log
( | , ) ( , ) log( ( )) ( | , ) ( ) log
M M Mt t
k k kl kl
k b k l
M M Mt t
k k kl kl
k b k l
Q P x E b e b A a
P x E b e b P x A a
( | , ) ( ) kl
t
klP x A A
(( | , ) ( , ) )k
t
kP x E b E b
value probability expectation value probability expectation
CG
•So we want to find a set of parameters θt+1 that maximizes:
•Ek(b), Akl can be computed using forward/backward:
•For maximization, select:
Computing Q (iii)
1 1 1
( ) log( ( )) logM M M
k k kl kl
k b k l
E b e b A a
'
( ) , ( )
( ')
ij kij k
ik k
k b
A E ba e b
A E b
P(i=k, i+1=l | x, t) = [1/P(x)]·fk(i)·akl·el(xi+1)·bl(i+1)
Akl = [1/P(x)]· i fk(i) · akl · el(xi+1) · bl(i+1) similarly, Ek(b) = [1/P(x)] · {i|xi=b}
fk(i) · bk(i)
fk(i) = P(x0,…,xi, i=k)
bk(i) = P(xi+1,…xL | i=k)
CG
Maximize:
Baum-Welch: EM for HMM
1 1 1
( ) log( ( )) logM M M
k k kl kl
k b k l
E b e b A a
chosen
ij
'
( ) (denote as a ), ( )
( ')
ij kij k
ik k
k b
A E ba e b
A E b
always positive
'
1 1 1 ' 1 '
'
'
1 ' 1
log log
log
chosen chosenM M M Mkl kl kl
kl kkother otherk l k k lkl kk kl
k
chosenM Mchosen kl
kk kl otherk k l kl
a A aA A
a A a
aA a
a
Difference between chosen set and some other:
Multiply and divide by same factor
CG 23
Summary: Parameter Estimation in HMM When States are Unknown
Input: X1,…,Xn indep training sequences Baum-Welch alg. (1972): Expectation: • compute expected no. of kl state transitions: P(i=k, i+1=l | X, ) = [1/P(x)]·fk(i)·akl·el(xi+1)·bl(i+1) Akl = j[1/P(X
j)] · i fkj(i) · akl · el(x
ji+1) · bl
j(i+1) • compute expected no. of symbol b appearances in state k Ek(b) = j[1/P(X
j)] · {i|xji=b} fk
j(i) · bkj(i) (ex.)
Maximization: • re-compute new parameters from A, E using max.
likelihood.
repeat (1)+(2) until improvement
CG 24 Lloyd Welch, USC Electrical Engineering Leonard Baum, many years after the IDA
GE © Ron Shamir
de novo motif discovery using EM
Slides sources:
Chaim Linhart, Danit Wider, Katherina Kechris
25
Transcription Factors
• A transcription factor (TF) is a protein that regulates a gene by binding to a binding site (BS) in its vicinity, specific to the TF. • Binding sites vary in their sequences. Their sequence pattern is called a motif.
26
Motif profile
a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G
_________________ A 3 0 1 0 3 1 1 0 Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4
_________________ Consensus A C G T A C G T
• Line up the patterns by their start indexes
s = (s1, s2, …, st)
• Construct matrix profile with frequencies of each nucleotide in columns
Motif finding: Given a set of co-regulated genes,
find a recurrent motif in their promoter regions. 27
An example: Implanting Motif AAAAAAAGGGGGGG
atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa
28
Where is the Implanted Motif? (*)
atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga
29
Implanting Motif AAAAAAGGGGGGG with Four Mutations
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
30
Where is the Motif???
atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga
31
GE © Ron Shamir
MEME Multiple EM for Motif Elicitation
[Bailey, Elkan ISMB ’94]
Goal: Given a set of sequences, find a motif
(PWM) that maximizes the expected
likelihood of the data
Technique: EM (Expectation Maximization)
(based on [Lawrence, Reilly ’90])
32
GE © Ron Shamir
The Mixture Model Data: X = (X1,…,Xn) :
all (overlapping) l-mers in the input sequences
Assume Xi’s were generated by a two-component mixture model - θ = (θ1 , θ2 ) :
Model #1: θ1 = motif model:
fi,b = prob. of base b at pos i in motif, 1 ≤ i ≤ l
Model #2: θ2 = background (BG) model:
f0,b = prob. of base b
Mixing parameter: λ = (λ1 , λ2 )
λj = prob. that model #j is used (λ1+λ2=1)
Assume independence between l-mers 33
GE © Ron Shamir
Log Likelihood Missing data: Z = (Z1,…,Zn) :
Zi = (Zi1, Zi2); Zij = 1 if Xi from model #j ; 0 o/w
Complete Likelihood of model given data:
L (θ, λ | X, Z) = p (X, Z | θ, λ)
= Πi=1…n p (Xi, Zi | θ, λ)
p (Xi, Zi | θ, λ) = p (Xi | Zi ,θ, λ) p (Zi) =
= λ1 p(Xi|θ1) if Zi1=1; λ2 p(Xi|θ2) if Zi2=1
log L = Σi=1…n Σj=1,2 Zij log (λj p(Xi|θj))
34
GE © Ron Shamir
MEME: Algorithm
Goal: Maximize E[log L]
Outline of EM algorithm:
• Choose starting θ, λ
• Repeat until convergence of θ:
– E-step: Re-estimate Z from θ, λ, X
– M-step: Re-estimate θ, λ from X, Z
• Repeat all of the above for various θ, λ … 35
GE © Ron Shamir
E-step
Compute expectation of log L over Z:
E[log L] = Σi=1…n Σj=1,2 Z’ij log (λj p(Xi|θj))
where:
Z’ij = p(Zij=1| θ,λ,Xi) =
= p(Zij=1, Xi| θ,λ) / p(Xi| θ,λ) =
= p(Zij=1, Xi| θ,λ) / Σk=1,2 p(Zik=1, Xi| θ,λ) =
= λj p(Xi|θj) / Σk=1,2 λk p(Xi|θk)
36
GE © Ron Shamir
M-step
Find θ,λ that maximize E[log L]=Q(,| t ,t):
E[log L] = Σi=1…n Σj=1,2 Z’ij log (λj p(Xi|θj))
Finding λ:
Suffices to maximize L1= Σi=1…n Σj=1,2 Z’ij log λj
λ1+λ2=1 L1= Σi=1…n (Z’i1 log λ1 + Z’i2 log (1-λ1))
dL1/dλ1 = Σi=1…n (Z’i1 / λ1 – Z’i2 / (1-λ1))
37
GE © Ron Shamir
MEME: Algorithm
M-step (cont.):
dL1/dλ1 = Σi=1…n (Z’i1 / λ1 – Z’i2 / (1-λ1)) = 0
λ1 Σi=1…n Z’i2 = (1-λ1) Σi=1…n Z’i1
λ1 ( Σi=1…n (Z’i1+Z’i2) ) = Σi=1…n Z’i1
λ1 = ( Σi=1…n Z’i1 ) / n
λ2 = 1- λ1 = ( Σi=1…n Z’i2 ) / n
Finding θ: … 38
GE © Ron Shamir
Tim Bailey, Charles Elkan
• Senior Research Fellow Institute for Molecular Bioscience , University of Queensland, Brisbane, Australia
• Professor Department of Computer Science and Engineering University of California, San Diego
40