regulatory motif discovery 6.095/6.895 - computational biology: genomes, networks, evolution lecture...

72
Regulatory motif discovery 5/6.895 - Computational Biology: Genomes, Networks, Evo ecture 10 Oct 12, 2005

Post on 20-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Regulatory motif discovery

6.095/6.895 - Computational Biology: Genomes, Networks, Evolution

Lecture 10 Oct 12, 2005

Page 2: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

The course so far…

Fundamental bio problem Fundamental comp. toolLec

t

1 Introduction 1

2 Sequence alignment Dynamic programming 2,3

3 Database search Hashing 4,5

4 Modeling biological signals HMMs 6,7

5 Microarray analysis Clustering 8,9

6 Regulatory motifs Gibbs Sampling/EM 10

7 Biological networks Graph algorithms 11

8 Phylogenetic trees Evolutionary modeling 12

Page 3: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Challenges in Computational Biology

DNA

4 Genome Assembly

1 Gene Finding5 Regulatory motif discovery

Database lookup3

Gene expression analysis8

RNA transcript

Sequence alignment

Evolutionary Theory7

TCATGCTATTCGTGATAATGAGGATATTTATCATATTTATGATTT

Cluster discovery9 Gibbs sampling10Protein network analysis11

Emerging network properties13

12 Regulatory network inference

Comparative Genomics6

2

Page 4: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Challenges in Computational Biology

DNARegulatory motif discovery

Group of co-regulated genesCommon subsequence

Last time: Discover groups of co-regulated genes.

Today: Find motifs within them.

Page 5: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Overview

Introduction Bio review: Where do ambiguities come from? Computational formulation of the problem

Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement

Probabilistic solutions Expectation maximization Gibbs sampling

Page 6: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Overview

Introduction Bio review: Where do ambiguities come from? Computational formulation of the problem

Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement

Probabilistic solutions Expectation maximization Gibbs sampling

Page 7: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

ATGACTAAATCTCATTCAGAAGAAGTGA

Regulatory motif discovery

GAL1

CCCCWCGG CCG

Gal4 Mig1

CGG CCG

Gal4

• Regulatory motifs– Genes are turned on / off in response to changing environments

– No direct addressing: subroutines (genes) contain sequence tags (motifs)

– Specialized proteins (transcription factors) recognize these tags

• What makes motif discovery hard?– Motifs are short (6-8 bp), sometimes degenerate

– Can contain any set of nucleotides (no ATG or other rules)

– Act at variable distances upstream (or downstream) of target gene

Page 8: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

PS1: Evaluating genome-wide motif conservation

• Genome-wide conservation reveals real motifs– Count conserved instances Not informative– Count total instances Not informative– Evaluate conserved/total instances Real motifs!

• Motif enumeration– Perfectly conserved motif instances– Each motif is a fully-specified 6-mer

• Algorithmic speed-up– Do not search entire genome for every motif– Scan genome once, fill in table of motif instances– Content-based indexing

• What’s missing?

ATTAGCCAGTAGCGCAGTGCATGCATGCACGACTGCAAGTGCATGCATGCTAGCTACGTAGCTAGCCGCGCATGCTGTGACTGCTAG

AGTGCA

AGTGAT

AGTGAG

AGTGAC

AGTGCG

AGTGCC

AGTGCT

AGTGAA

50 200

20 4000

20 4000

20 4000

20 4000

20 4000

20 4000

20 4000

tota

l

cons

ratio

Page 9: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Problem Definition

Combinatorial

Motif M: m1…mW

Some of the mi’s blank

• Find M that occurs in all si with k differences

• Or, Find M with smallest total hamming dist

Given a collection of promoter sequences s1,…, sN of genes with common expression

Probabilistic

Motif: Mij; 1 i W

1 j 4

Mij = Prob[ letter j, pos i ]

Find best M, and positions p1,…, pN in sequences

Page 10: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Overview

Introduction Bio review: Where do ambiguities come from? Computational formulation of the problem

Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement

Probabilistic solutions Expectation maximization Gibbs sampling

Page 11: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Discrete Formulations

Given sequences S = {x1, …, xn}

• A motif W is a consensus string w1…wK

• Find motif W* with “best” match to x1, …, xn

Definition of “best”:

d(W, xi) = min hamming dist. between W and any word in xi

d(W, S) = i d(W, xi)

Page 12: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Overview

Introduction Bio review: Where do ambiguities come from? Computational formulation of the problem

Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement

Probabilistic solutions Expectation maximization Gibbs sampling

Page 13: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Exhaustive Searches

1. Pattern-driven algorithm:

For W = AA…A to TT…T (4K possibilities)

Find d( W, S )

Report W* = argmin( d(W, S) )

Running time: O( K N 4K )

(where N = i |xi|)

Advantage: Finds provably “best” motif W

Disadvantage: Time

Page 14: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Exhaustive Searches

2. Sample-driven algorithm:

For W = any K-long word occurring in some xi

Find d( W, S )

Report W* = argmin( d( W, S ) )or, Report a local improvement of W*

Running time: O( K N2 )

Advantage: Time

Disadvantage: If the true motif is weak and does not occur in data

then a random motif may score better than any instance of true motif

Page 15: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Overview

Introduction Bio review: Where do ambiguities come from? Computational formulation of the problem

Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement

Probabilistic solutions Expectation maximization Gibbs sampling

Page 16: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Greedy motif clustering (CONSENSUS)

Algorithm:

Cycle 1:

For each word W in S (of fixed length!)

For each word W’ in S

Create alignment (gap free) of W, W’

Keep the C1 best alignments, A1, …, AC1

ACGGTTG , CGAACTT , GGGCTCT …

ACGCCTG , AGAACTA , GGGGTGT …

Page 17: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Greedy motif clustering (CONSENSUS)

Algorithm:

Cycle t:For each word W in S

For each alignment Aj from cycle t-1

Create alignment (gap free) of W, Aj

Keep the Cl best alignments A1, …, ACt

ACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …… … …ACGGCTC , AGATCTT , GGCGTCT …

Page 18: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Greedy motif clustering (CONSENSUS)

• C1, …, Cn are user-defined heuristic constants

– N is sum of sequence lengths– n is the number of sequences

Running time:

O(N2) + O(N C1) + O(N C2) + … + O(N Cn)

= O( N2 + NCtotal)

Where Ctotal = i Ci, typically O(nC), where C is a big constant

Page 19: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Overview

Introduction Bio review: Where do ambiguities come from? Computational formulation of the problem

Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement

Probabilistic solutions Expectation maximization Gibbs sampling

Page 20: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Motif Refinement and wordlets (MULTIPROFILER)

• Extended sample-driven approach

Given a K-long word W, define:

Nα(W) = words W’ in S s.t. d(W,W’) α

Idea:

Assume W is occurrence of true motif W*

Will use Nα(W) to correct “errors” in W

Page 21: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Motif Refinement and wordlets (MULTIPROFILER)

Assume W differs from true motif W* in at most L positions

Define:

A wordlet G of W is a L-long pattern with blanks, differing from W– L is smaller than the word length K

Example:

K = 7; L = 3

W = ACGTTGA

G = --A--CG

Page 22: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Motif Refinement and wordlets (MULTIPROFILER)

Algorithm:

For each W in S:For L = 1 to Lmax

1. Find the α-neighbors of W in S Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,

1. Modify W by the wordlet G W’2. Compute d(W’, S)

Report W* = argmin d(W’, S)

Step 1 above: Smaller motif-finding problem; Use exhaustive search

Page 23: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Overview

Introduction Bio review: Where do ambiguities come from? Computational formulation of the problem

Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement

Probabilistic solutions Expectation maximization Gibbs sampling

Page 24: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Where do ambiguous bases come from ?

• Protein-DNA interactions– Proteins read DNA by “feeling”

the chemical properties of the bases

– Without opening DNA (not by base complementarity)

• Sequence specificity– Topology of 3D contact dictates

sequence specificity of binding– Some positions are fully

constrained; other positions are degenerate

– “Ambiguous / degenerate” positions are loosely contacted by the transcription factor

Page 25: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Representing motif ambiguities

• Entropy at pos’n I, H(i) = – {letter x} freq(x, i) log2 freq(x, i)• Height of x at pos’n i, L(x, i) = freq(x, i) (2 – H(i))

– Examples: • freq(A, i) = 1; H(i) = 0; L(A, i) = 2• A: ½; C: ¼; G: ¼; H(i) = 1.5; L(A, i) = ¼; L(not T, i) = ¼

entropy - n1: (communication theory) a numerical measure of the

uncertainty of an outcome; "the signal contained thousands of bits of information" [information, selective information]

2: (thermodynamics) a thermodynamic quantity representing the amount of energy in a system that is no longer available for doing mechanical work; "entropy increases as matter and energy in the universe degrade to an ultimate state of inert uniformity" [randomness]

Page 26: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Motifs and Profile Matrices

sequence positions

A

C

G

T

1 2 3 4 5 6 7 8

0.1

0.1

0.6

0.2

• given a set of aligned sequences, it is straightforward to construct a profile matrix characterizing a motif of interest

0.1

0.5

0.2

0.2 0.3

0.2

0.2

0.3 0.2

0.1

0.5

0.2 0.1

0.1

0.6

0.2

0.3

0.2

0.1

0.4

0.1

0.1

0.7

0.1

0.3

0.2

0.2

0.3

shared motif

• given a profile matrix characterizing the motif, it is straightforward to find the set of aligned sequences

Page 27: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Motifs and Profile Matrices

• Construct the profile if the sequences aren’t aligned– in the typical case we don’t know what the motif looks

like or where the starting positions are

Use expectation maximization

Page 28: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

The EM Approach

• EM is a family of algorithms for learning probabilistic models in problems that involve hidden state

• In our problem, the hidden state is where the motif starts in each training sequence

Page 29: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Representing Motifs

• a motif is assumed to have a fixed width, W

• a motif is represented by a matrix of probabilities: represents the probability of character c in column k

• example: DNA motif with W=3

ckp

1 2 3A 0.1 0.5 0.2C 0.4 0.2 0.1G 0.3 0.1 0.6T 0.2 0.2 0.1

p

Page 30: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Representing Motifs

• we will also represent the “background” (i.e. outside the motif) probability of each character

• represents the probability of character c in the background

• example:

0cp

A 0.26C 0.24G 0.23 T 0.27

0p

Page 31: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Basic EM Approach

• the element of the matrix represents the

probability that the motif starts in position j in sequence I

• example: given 4 DNA sequences of length 6, where W=3

Z

1 2 3 4seq1 0.1 0.1 0.2 0.6seq2 0.4 0.2 0.1 0.3seq3 0.3 0.1 0.5 0.1seq4 0.1 0.5 0.1 0.3

Z

ijZ

Page 32: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Basic EM Approach

given: length parameter W, training set of sequences

set initial values for p

do

re-estimate Z from p (E –step)

re-estimate p from Z (M-step)

until change in p < return: p, Z

Page 33: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Basic EM Approach

• we’ll need to calculate the probability of a training sequence given a hypothesized starting position:

L

Wjkc

Wj

jkjkc

j

kciji kkk

ppppZX 0,

1

1,

1

10,),1|Pr(

is the ith sequence

is 1 if motif starts at position j in sequence i

is the character at position k in sequence ikcijZ

iX

before motif motif after motif

Page 34: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Example

0.25 0.251.01.02.00.25 0.25

),1|Pr(

G,0A,0T,3G,2T,1C,0G,0

3

ppppppp

pZX ii

0 1 2 3A 0.25 0.1 0.5 0.2C 0.25 0.4 0.2 0.1G 0.25 0.3 0.1 0.6T 0.25 0.2 0.2 0.1

p

G C T G T A GiX

Page 35: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

The E-step: Estimating Z

• this comes from Bayes’ rule applied to

1

1

)(

)()(

)1Pr(),1|Pr(

)1Pr(),1|Pr(WL

kik

tiki

ijt

ijitij

ZpZX

ZpZXZ

• to estimate the starting positions in Z at step t

),|1Pr( )(tiij pXZ

Page 36: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

The E-step: Estimating Z

• assume that it is equally likely that the motif will start in any position

1

1

)(

)()(

)1Pr(),1|Pr(

)1Pr(),1|Pr(WL

kik

tiki

ijt

ijitij

ZpZX

ZpZXZ

Page 37: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

E-step: Estimating Z Example

25.025.025.025.01.02.03.01 iZ

25.025.025.06.02.04.025.02 iZ

• then normalize so that 11

1

WL

jijZ

...

0 1 2 3A 0.25 0.1 0.5 0.2C 0.25 0.4 0.2 0.1G 0.25 0.3 0.1 0.6T 0.25 0.2 0.2 0.1

p

G C T G T A GiX

Page 38: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

The M-step: Estimating the motif p

bkbkb

kckctkc dn

dnp

)( ,,

,,)1(,

W

jjcc

i cXjij

kc

knn

kZ

nkji

1,

}|{

,

0

0 1,

pseudo-counts

total # of c’sin data set

• recall represents the probability of character c in position k ; values for position 0 represent the background

kcp ,

Page 39: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

M-step example: Estimating p

A G G C A G

A C A G C A

T C A G T C

1.0,1.0,7.0,1.0 4,13,12,11,1 ZZZZ

4.0,1.0,1.0,4.0 4,23,22,21,2 ZZZZ

1.0,1.0,6.0,2.0 4,33,32,31,3 ZZZZ

4 ...

1

4,33,32,11,1

3,31,23,11,1A,1

ZZZZ

ZZZZp

Page 40: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

The EM Algorithm

• EM converges to a local maximum in the likelihood of the data given the model:

i

i pX )|Pr(

• usually converges in a small number of iterations• sensitive to initial starting point (i.e. values in p)

Page 41: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Overview

Introduction Bio review: Where do ambiguities come from? Computational formulation of the problem

Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement

Probabilistic solutions Expectation maximization MEME extensions Gibbs sampling

Page 42: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

MEME Enhancements to the Basic EM Approach

• MEME builds on the basic EM approach in the following ways:

– trying many starting points– not assuming that there is exactly one motif

occurrence in every sequence – allowing multiple motifs to be learned– incorporating Dirichlet prior distributions

Page 43: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Starting Points in MEME

• for every distinct subsequence of length W in the training set

– derive an initial p matrix from this subsequence– run EM for 1 iteration

• choose motif model (i.e. p matrix) with highest likelihood

• run EM to convergence

Page 44: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Using Subsequences as Starting Points for EM

• set values corresponding to letters in the subsequence to X

• set other values to (1-X)/(M-1) where M is the length of the alphabet

• example: for the subsequence TAT with X=0.5

1 2 3A 0.17 0.5 0.17C 0.17 0.17 0.17G 0.17 0.17 0.17T 0.5 0.17 0.5

p

Page 45: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

The ZOOPS Model(zero-or-one-occurrence-per-sequence)

• the approach as we’ve outlined it, assumes that each sequence has exactly one motif occurrence per sequence; this is the OOPS model

• the ZOOPS model assumes zero or one occurrences per sequence

Page 46: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

E-step in the ZOOPS Model

• we need to consider another alternative: the ith sequence doesn’t contain the motif

• we add another parameter (and its relative)

)1( WL

prior prob that any position in a sequence is the start of a motif

prior prob of a sequence containing a motif

Page 47: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

E-step in the ZOOPS Model

1

1

)()()()(

)()()(

),1|Pr()1)(,0|Pr(

),1|Pr(WL

k

ttiki

ttii

ttijit

ij

pZXpQX

pZXZ

1

1,

WL

jjii ZQ

• here is a random variable that takes on 0 to indicate that the sequence doesn’t contain a motif occurrence

iQ

Page 48: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

M-step in the ZOOPS Model

• update p same as before

• update as follows

n

i

m

j

tji

tt Z

WLnWL 1 1

)(,

)1()1(

)1(

1

)1(

,

• average of across all sequences, positions)(

,tjiZ

Page 49: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

The TCM Model(two-component mixture model)

• the TCM (two-component mixture model) assumes zero or more motif occurrences per sequence

Page 50: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Likelihood in the TCM Model

• the TCM model treats each length W subsequence independently

• to determine the likelihood of such a subsequence:

1

1,),1|Pr(Wj

jkjkcijij k

ppZX

1

0,),0|Pr(Wj

jkcijij kppZX

assuming a motifstarts there

assuming a motifdoesn’t start there

Page 51: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

E-step in the TCM Model

)()(,

)()(,

)()(,)(

),1|Pr()1)(,0|Pr(

),1|Pr(tt

ijjitt

ijji

ttijjit

ij pZXpZX

pZXZ

• M-step same as before

subsequence isn’t a motif subsequence is a motif

Page 52: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Finding Multiple Motifs

• basic idea: discount the likelihood that a new motif starts in a given position if this motif would overlap with a previously learned one

• when re-estimating , multiply by ijZ )1Pr( ijV

otherwise 0,

],...,[in motifs previous no 1, 1,, wjijiij

XXV

• is estimated using values from previous passes of motif finding ijV ijZ

Page 53: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Overview

Introduction Bio review: Where do ambiguities come from? Computational formulation of the problem

Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement

Probabilistic solutions Expectation maximization MEME extensions Gibbs sampling

Page 54: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Gibbs Sampling

• a general procedure for sampling from the joint distribution of a set

of random variables by iteratively sampling from

for each j

• application to motif finding: Lawrence et al. 1993

• can view it as a stochastic analog of EM for this task

• less susceptible to local minima than EM

) ..., ...|Pr( 111 njjj UUUUU

) ...Pr( 1 nUU

Page 55: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Gibbs Sampling Approach

• in the EM approach we maintained a distribution over the possible motif starting points for each sequence

• in the Gibbs sampling approach, we’ll maintain a specific starting point for each sequence but we’ll keep resampling these

iZ

ia

Page 56: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Gibbs Sampling Approach

given: length parameter W, training set of sequences

choose random positions for a

do

pick a sequence

estimate p given current motif positions a (update step)

(using all sequences but )

sample a new motif position for (sampling step)

until convergence

return: p, a

iX

iX

iXia

Page 57: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Sampling New Motif Positions

• for each possible starting position, , compute a weight

• randomly select a new starting position according to these weights

1

0,

1

1,

Wj

jkc

Wj

jkjkc

j

k

k

p

p

A

jai

ia

Page 58: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Gibbs Sampling (AlignACE)

• Given: – x1, …, xN, – motif length K,– background B,

• Find:– Model M– Locations a1,…, aN in x1, …, xN

Maximizing log-odds likelihood ratio:

N

i

K

kika

ika

i

i

xB

xkM

1 1 )(

),(log

Page 59: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Gibbs Sampling (AlignACE)

• AlignACE: first statistical motif finder• BioProspector: improved version of AlignACE

Algorithm (sketch):1. Initialization:

a. Select random locations in sequences x1, …, xN

b. Compute an initial model M from these locations

2. Sampling Iterations:a. Remove one sequence xi

b. Recalculate modelc. Pick a new location of motif in xi according to probability the

location is a motif occurrence

Page 60: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Gibbs Sampling (AlignACE)

Initialization:• Select random locations a1,…, aN in x1, …, xN

• For these locations, compute M:

N

ikakj jx

NM

i1

)(1

• That is, Mkj is the number of occurrences of letter j in motif position k, over the total

Page 61: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Gibbs Sampling (AlignACE)Predictive Update:

• Select a sequence x = xi

• Remove xi, recompute model:

))(()1(

1

,1

N

isskajkj jx

BNM

s

where j are pseudocounts to avoid 0s,

and B = j j

M

Page 62: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Gibbs Sampling (AlignACE)

Sampling:

For every K-long word xj,…,xj+k-1 in x:

Qj = Prob[ word | motif ] = M(1,xj)…M(k,xj+k-1)

Pi = Prob[ word | background ] B(xj)…B(xj+k-1)

Let

Sample a random new position ai according to the probabilities A1,…, A|

x|-k+1.

1||

1

/

/kx

jjj

jjj

PQ

PQA

0 |x|

Prob

Page 63: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Gibbs Sampling (AlignACE)

Running Gibbs Sampling:

1. Initialize

2. Run until convergence

3. Repeat 1,2 several times, report common motifs

Page 64: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Advantages / Disadvantages

• Very similar to EM

Advantages:• Easier to implement• Less dependent on initial parameters• More versatile, easier to enhance with heuristics

Disadvantages:• More dependent on all sequences to exhibit the motif• Less systematic search of initial parameter space

Page 65: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Repeats, and a Better Background Model• Repeat DNA can be confused as motif

– Especially low-complexity CACACA… AAAAA, etc.

Solution:

more elaborate background model

0th order: B = { pA, pC, pG, pT }1st order: B = { P(A|A), P(A|C), …, P(T|T) }…

Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }

Has been applied to EM and Gibbs (up to 3rd order)

Page 66: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Example Application: Motifs in YeastGroup:

Tavazoie et al. 1999, G. Church’s lab, Harvard

Data:

• Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.)

• 15 time points across two cell cycles

Page 67: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Processing of Data

1. Selection of 3,000 genes

• Genes with most variable expression were selected

• Clustering according to common expression

• K-means clustering• 30 clusters, 50-190 genes/cluster• Clusters correlate well with known function

1. AlignACE motif finding • 600-long upstream regions• 50 regions/trial

Page 68: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Motifs in Periodic Clusters

Page 69: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Motifs in Non-periodic Clusters

Page 70: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Limits of Motif Finders

• Given upstream regions of coregulated genes:

– Increasing length makes motif finding harder – random motifs clutter the true ones

– Decreasing length makes motif finding harder – true motif missing in some sequences

0

gene???

Page 71: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Limits of Motif Finders

A (k,d)-motif is a k-long motif with d random differences per copy

Motif Challenge problem:

Find a (15,4) motif in N sequences of length L

CONSENSUS, MEME, AlignACE, & most other programs fail for N = 20, L = 1000

Page 72: Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

Summary

Introduction Bio review: Where do ambiguities come from? Computational formulation of the problem

Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement

Probabilistic solutions Expectation maximization Gibbs sampling