setting up a replica exchange approach to motif discovery in dna

Setting Up a Replica Exchange Approach to Motif

Discovery in DNAJeffrey Goett

Advisor:

Professor Sengupta

Protein Synthesis from DNA

Translation to

Proteins

TranscriptionRegulation

RNA polymerase

Binding

Proteins

geneBinding

Binding Sites

Sequence A:

code for protein

Binding protein “A”Binding Site

A - A - C - G - A - C -

T - T - G - C - T - G -

T - T - C - A - A - C - C - A -

A - A - G - T - T - G - G - T -

Sequence B:

code for protein

A - A - G - G - A - C -

T - T - C - C - T - G -

C - G - T - T - G - C - T - C -

G - C - A - A - C - G - A - G -

Binding protein “A”

Discovering New Binding Motifs

…ATCG GCTCAG CTAG……CACT GATCAG AGTA……TTCC GCTCTG TAAC……GCTA GCTCAA ATCG…

A 0 .25 0 0 .75 .25

T 0 0 1 0 .25 0

C 0 .75 0 1 0 0

G 1 0 0 0 0 .75

Motif Probability Model

Motif: GCTCAG

Modeling Motifs in Sequences

ATATCCGTA

AATCGAGAC

TCGATGTGT

CCACCTGCA

Assume:

Break into N sequences

Each sequence has one instance of motif embedded in random background

Variations of motif by point mutation, but not insertion or deletion

Modeling Motifs in Sequences

AT ATC CGTA

A ATC GAGAC

TCG ATG TGT

CC ACC TGCA

p j,ρ =

A 1 0 0

T 0 .75 0

C 0 .25 .75

G 0 0 .25

The “Alignment:” Starting position of motif in each sequence

The “Motif Probability Distribution:” Probability of each letter occurring at each motif position

rx = {x1,x2, x3 ...xN }

ex : r x = {3, 2, 4, 3}

p j,ρ

Scoring a Model

p(r x , p j ,ρ | S) =

p(S |r x ,p j ,ρ )p(

r x )p( p j ,ρ )

p(r x , p j ,ρ | S) ⏐ → ⏐ log(

p(S |r x ,p j ,ρ )p( p j ,ρ )

p(S | pρ0 )

) + log(p(r x )) + log(p(S)) =

1N n j,ρ log(

ˆ p j ,ρ

pρ0 ) + constant

ρ ∈Σ

∑j=1

p(S |r x , p j ,ρ ) :

“Log-likelihood” score:

ATATCCGTA

AATCGAGAC

TCGATGTGT

CCACCTGCA

p1,T p2,A p3,T

p1,A p2,G p3,A

p1,A p2,T p3,G

p1,C p2,C p3,A

pC pC pG pT pA0 0 0 0 0

pA pA pT pC pG0 0 0 0 0 pC

pT pC pG0 0 0 pT pG pT

pC pC pT pG pC pA0 0 0 0 0 0

Example Models

A TAT CCGTA

AAT CGA GAC

TCGATG TGT

CC ACC TGCA

p j,ρ =

A 1 0 0

T 0 .75 0

C 0 .25 .75

G 0 0 .25

rx = {3, 2, 4, 3}

AT ATC CGTA

A ATC GAGAC

TCG ATG TGT

CC ACC TGCA

L(S |r x , p j ,ρ , p j

0) ≈ 3

p j,ρ =

A .25 .25 .25

T .5 0 .5

C .25 .25 .25

G 0 .5 0

rx = {2, 4, 7, 3}

L(S |r x , p j ,ρ , p j

0) ≈1.1

The Gibbs SamplerWe want to find

pj, ρ

p( p j,ρ | S)that maximizes

pj, ρ

L( p j,ρ ,r x | S)

p( p j,ρ | S) = p( p j,ρ∫ ,r x | S)d

The Gibbs Sampler

pj, ρ

p( p j,ρ ,r x | S)

pj, ρ

The Gibbs Sampler

Times visited

pj, ρ

Over time, the frequency distribution approaches

p( p j,ρ | S)

Biasing our search to these areas may discover the pj,ro values which maximize faster.

If we assume areas of local maximization contribute the most during “integration” to the local maximizations of

Optimization Technique

p( p j,ρ | S)

Multiple Gibbs Samplers

By combining results from Gibbs Samplers begun at random positions, find maximizing sooner

p( p j,ρ | S)

Replica Exchange/Parallel Tempering

“Low-sensitivity” samplers which “scout out area” periodically swap with “high-sensitivity” samplers good at focused searches if swap appears promising.

Controlling Sensitivity

˜ p (x i | p j,ρ ,S) = eβL(xi ,p j ,ρ |S )Adjust the relative probability of sampling an xi by adjusting a new parameter in distribution:

β Large

Search breadth of space Focused search of region

Testing the Sensitivity

Running on randomly generated sequences to see motifs found, different sensitivity samplers converge to different scores.

21.9.1

Predicting Convergence Score

Measure of Similarity:

magnetization

m = 1N si

“Configuration Score:” energy

Ex: m=.5

E = −12 Jsis j

j=1j≠ i

∑i=1

∑m=.5

p ≈ e−β 0

p ≈ eβ 6J

p ≈ e−β 2J

Alignment Analogue

p ≈ eβ 9J

p ≈ eβ 5J

Test Results

L < |alphabet|w

Test Results

L > |alphabet|w

Test Results

Hidden Motifs: Gibbs SamplerBeta = .1 Beta = .5 Beta = .9

Beta = 1.3 Beta = 1.7 Beta = 2

W=5, l=500

Hidden Motifs: Replica Exchange

setting up a replica exchange approach to motif discovery in dna

motif position

gibbs samplerbeta

motif discovery

instance of motif

different sensitivity

gctcagmodeling motifs

deletionmodeling motifs

starting position of

Documents

matrix profile xiv: scaling time series motif discovery

classification and assessment tools for structural motif...

lecture 10 regulatory motif discovery and target ... · pdf...

rna search and motif discovery

motif and anomaly discovery of time series based on

statistical techniques for biological motif...

efﬁcient motif discovery in spatial trajectories using...

motif discovery

biological sequence motif discovery using...

motif discovery in spatial trajectories using grammar...

efcient motif discovery in spatial trajectories using...

sequence analysis a generic motif discovery algorithm for

negative information for motif discovery

acceptor regions for motif discovery genome-wide sequence

lnbi 4453 - network motif discovery using subgraph...

topological analysis in ppi networks & network motif...

biological motif discovery concepts motif modeling and motif...

high performance computational tools for motif discovery

parallel random projection for motif discovery on gpus

seamote: a method for high-throughput motif discovery...