(regulatory-) motif finding. clustering of genes find binding sites responsible for common...
Post on 21-Dec-2015
217 views
TRANSCRIPT
(Regulatory-) Motif Finding (Regulatory-) Motif Finding
Clustering of Genes
Find binding sites responsible for
common expression patterns
Chromatin Immunoprecipitation
Microarray Chip“ChIP-chip”
Finding Regulatory Motifs
Given a collection of genes with common expression,
Find the TF-binding motif in common
.
.
.
Characteristics of Regulatory Motifs
• Tiny
• Highly Variable
• ~Constant Size Because a constant-size
transcription factor binds
• Often repeated
• Low-complexity-ish
Essentially a Multiple Local Alignment
• Find “best” multiple local alignment
• Alignment score defined differently in probabilistic/combinatorial cases
.
.
.
Algorithms
• CombinatorialCONSENSUS, TEIRESIAS, SP-STAR, others
• Probabilistic
1. Expectation Maximization:MEME
2. Gibbs Sampling: AlignACE, BioProspector
Combinatorial Approaches to Motif Finding
Discrete Formulations
Given sequences S = {x1, …, xn}
• A motif W is a consensus string w1…wK
• Find motif W* with “best” match to x1, …, xn
Definition of “best”:
d(W, xi) = min hamming dist. between W and any word in xi
d(W, S) = i d(W, xi)
Approaches
• Exhaustive Searches
• CONSENSUS
• MULTIPROFILER
• TEIRESIAS, SP-STAR, WINNOWER
Exhaustive Searches
1. Pattern-driven algorithm:
For W = AA…A to TT…T (4K possibilities)
Find d( W, S )
Report W* = argmin( d(W, S) )
Running time: O( K N 4K )
(where N = i |xi|)
Advantage: Finds provably “best” motif W
Disadvantage: Time
Exhaustive Searches
2. Sample-driven algorithm:
For W = any K-long word occurring in some xi
Find d( W, S )
Report W* = argmin( d( W, S ) )or, Report a local improvement of W*
Running time: O( K N2 )
Advantage: Time
Disadvantage: If the true motif is weak and does not occur in data
then a random motif may score better than any instance of true motif
MULTIPROFILER
• Extended sample-driven approach
Given a K-long word W, define:
Nα(W) = words W’ in S s.t. d(W,W’) α
Idea:
Assume W is occurrence of true motif W*
Will use Nα(W) to correct “errors” in W
MULTIPROFILER
Assume W differs from true motif W* in at most L positions
Define:
A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K
Example:
K = 7; L = 3
W = ACGTTGA
G = --A--CG
MULTIPROFILER
Algorithm:
For each W in S:For L = 1 to Lmax
1. Find the α-neighbors of W in S Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,
1. Modify W by the wordlet G W’2. Compute d(W’, S)
Report W* = argmin d(W’, S)
Step 1 above: Smaller motif-finding problem; Use exhaustive search
CONSENSUS
Algorithm:
Cycle 1:
For each word W in S (of fixed length!)
For each word W’ in S
Create alignment (gap free) of W, W’
Keep the C1 best alignments, A1, …, AC1
ACGGTTG , CGAACTT , GGGCTCT …
ACGCCTG , AGAACTA , GGGGTGT …
CONSENSUS
Algorithm:
Cycle t:
For each word W in S
For each alignment Aj from cycle t-1
Create alignment (gap free) of W, Aj
Keep the Cl best alignments A1, …, ACt
ACGGTTG , CGAACTT , GGGCTCT …
ACGCCTG , AGAACTA , GGGGTGT …
… … …
ACGGCTC , AGATCTT , GGCGTCT …
CONSENSUS
• C1, …, Cn are user-defined heuristic constants
N is sum of sequence lengths n is the number of sequences
Running time:
O(N2) + O(N C1) + O(N C2) + … + O(N Cn)
= O( N2 + NCtotal)
Where Ctotal = i Ci, typically O(nC), where C is a big constant
Expectation Maximization in Motif FindingExpectation Maximization in Motif Finding
All K-long wordsmotif background
Expectation Maximization
Algorithm (sketch):
1. Given genomic sequences find all K-long words
2. Assume each word is motif or background
3. Find likeliest Motif Model
Background Model
classification of words into either Motif or Background
Expectation Maximization
• Given sequences x1, …, xN,
• Find all k-long words X1,…, Xn
• Define motif model:
M = (M1,…, MK)
Mi = (Mi1,…, Mi4)(assume {A, C, G, T})
where Mij = Prob[ letter j occurs in motif position i ]
• Define background model:
B = B1, …, B4
Bi = Prob[ letter j in background sequence ]
motif background
ACGT
M1 MKM1 B
Expectation Maximization
• Define
Zi1 = { 1, if Xi is motif;
0, otherwise }
Zi2 = { 0, if Xi is motif;
1, otherwise }
• Given a word Xi = x[1]…x[k],
P[ Xi, Zi1=1 ] = M1x[1]…Mkx[k]
P[ Xi, Zi2=1 ] = (1 - ) Bx[1]…Bx[K]
Let 1 = ; 2 = (1- )
motif background
ACGT
M1 MKM1 B
1 –
Expectation Maximization
Define:
Parameter space = (M,B)
1: Motif; 2: Background
Objective:
Maximize log likelihood of model:
2
1
2
111
1
2
11
log)|(log
))|(log(),|,...(log
j jjij
n
ijiij
n
i
n
i jjijijn
ZZ
Z
XP
XPZXXP
ACGT
M1 MKM1 B
1 –
Expectation Maximization
• Maximize expected likelihood, in iteration of two steps:
Expectation:
Find expected value of log likelihood:
Maximization:
Maximize expected value over ,
)],|,...([log 1 ZXXPE n
Expectation:
Find expected value of log likelihood:
2
1
2
111
1
log][)|(log][
)],|,...([log
j jjij
n
ijiij
n
i
n
ZZ EXPE
ZXXPE
where expected values of Z can be computed as follows:
ijii
jijij Z
XPXP
XPZE *
)|()1()|(
)|(][
21
Expectation Maximization: E-step
Expectation Maximization: M-step
Maximization:
Maximize expected value over and independently
For , this is easy:
n
i
n
i
iii
NEW
n
Zxam ZZ
1 1
121
*))1log(log(arg **
• For = (M, B), define
cjk = E[ # times letter k appears in motif position j]
c0k = E[ # times letter k appears in background]• cij values are calculated easily from Z* values
It easily follows:
4
1k jk
jkNEWjk
c
cM
4
1 0
0
k k
kNEWk
c
cB
to not allow any 0’s, add pseudocounts
Expectation Maximization: M-step
Initial Parameters Matter!
Consider the following “artificial” example:
x1, …, xN contain: 212 patterns on {A, T}: A…A, A…AT,……, T… T 212 patterns on {C, G}: C…C , C…CG,…… , G…G D << 212 occurrences of 12-mer ACTGACTGACTG
Some local maxima:
½; B = ½C, ½G; Mi = ½A, ½T, i = 1,…, 12
D/2k+1; B = ¼A,¼C,¼G,¼T;
M1 = 100% A, M2= 100% C, M3 = 100% T, etc.
Overview of EM Algorithm
1. Initialize parameters = (M, B), : Try different values of from N-1/2 up to 1/(2K)
2. Repeat:
a. Expectation
b. Maximization
3. Until change in = (M, B), falls below
4. Report results for several “good”
Overview of EM Algorithm
• One iteration running time: O(NK) Usually need < N iterations for convergence, and < N starting
points. Overall complexity: unclear – typically O(N2K) - O(N3K)
• EM is a local optimization method
• Initial parameters matter
MEME: Bailey and Elkan, ISMB 1994.
M,B