identification of distinguishing motifs

Identification of Distinguishing Motifs

Zhanyong WANG(Master Degree Student)

Dept. of Computer Science, City University of Hong KongE-mail: [email protected]

Joint work with WangSen FENG and Lusheng WANG

mailto:[email protected]

Outline

• The Definitions of Problems• Applications• Previous work• Our work• Algorithm for Single Group• Algorithm for Two Groups• Simulation Results for Single Group• Simulation Results for Two Groups

Motif Identification

• Two versions

1. Single Group

2. Two Groups

Single Group

• Instance: a group of n sequences.

• Objective: find a length-L motif that appears in each of the given sequences and those occurrences of the motif are similar

Two Groups

• Instance: two groups of sequences:

B (Bad) and G (Good)

• Objective: find a motif of length-L that appears in every sequence in group B and does not appear in anywhere of the sequences in G

the occurrences of the motif have errors

Applications

1. Finding Targets for Potential Drugs

(T. Jiang, C. Trendall, S, Wang, T. Wareham, X. Zhang, 98) (K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999)

-- bad strings in B are from Bacteria. -- good strings in G are from Humans

-- find a substring s of length L that is conserved in all bad strings, but not conserved in good strings.

-- use s to screen chemicals -- those selected chemicals can then be tested as potential broad-range antibiotics.

Applications

2. Creating Diagnostic Probes for Bacterial Infection

(T. Brown, G.A. Leonard, E.D. Booth, G. Kneale, 1990)

-- a group of closely related pathogenic bacteria

-- find a substring that occurs in each of the bacterial sequences (with as few substitutions as possible) and does not occur in the human sequences

Applications

3. Locating binding sites and regulatory signals

4. Creating Universal PCR Primers

5. Creating Unbiased Consensus Sequences

6. Anti-sense Drug Design

Previous work

• The closest substring problem was proved to be NP-hard. So are the single group and two groups

(K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999)

• Polynomial time approximation schemes -theoretical results

-speed is slow in order to solve practical instances

Previous Programs

• Bailey and Elkan: MEME (1994) uses a modified EM algorithm, allows the motif

to be absent in some of the given sequences • Waterman: Extended sample-driven approach (1984)• Keich and Pavel Pevzner: two programs (2002)• Buhler and Tompa : Projection (2002)

combine EM and random projection• Price, Ramabhadran and Pevzner: PatternBranching uses branching from sample strings (2003)

faster than the previously best known program: projection

Previous Programs (continued)

• Do not allow indels

• Only for the one group problem

• Some algorithms can handle one gap

Our work

• An extension of the EM approach

• A randomized algorithm for the single group problem which can handle indels

• We give an algorithm for the two groups problem

Representation of motifs• Consensus pattern: choosing the letter that appears the most in each

of the L columns (Figure a)• Profile: 4×L matrix W (ACGT), each cell W(i,j) is a number indicating th

e occurrence rate of letter i in column j.(Figure b)

• Use the profile representation in the early stage of the EM algorithm• Use the consensus pattern representation to improve the accuracy

caaccca caacccc catcccg catccct cacccca

--------------------consensus pattern caacccaAnother con. Pattern catccca (a)

A 0 1 0.4 0 0 0 0.4

C 1 0 0.2 1 1 1 0.2

G 0 0 0.0 0 0 0 0.2

T 0 0 0.4 0 0 0 0.2 (b)

Computing the single group problem

The EM (Expectation Maximization) Algorithm(Wang,L. Dong,L. and Fan,H. 2004)

Input:– n sequences S1,S2,...,Sn

– a 4L matrix W (the initial guess of the motif)

Output:– new matrix W that is a local maximal solution

A 0.25 0.0 1.0

C 0.25 1.0 0.0

G 0.25 0.0 0.0

T 0.25 0.0 0.0

Step 1: L-mer: Sij, a length-L substringFor each L-mer Sij, calculate the likelihood that Sij is theoccurrence of the motif:

P(i,j)=x=1 to L W(Sij(x),x)To avoid zero weights, a fixed small number is added to W(i,j) (0.1)

Step 2: Normalize the likelihood:

P'(i, j)=P(i,j) / x=1m-L+1

P(i, x)

s. t. j=1 to m-L+1P'(i,j)=1

Sij= c a a

W=a 0.25 0 1 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0

P(i,j): 0.25*0.1*1=0.025

Step 3: Re-estimate the motif matrix W.

W= i=1 n j=1

m-L+1 Wij

Where Wij is constructed from Sij

Sij= c a a

W=a 0.25 0 1 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0

P(i,j): 0.25*0.1*1=0.025

Sij(1) Sij(2) Sij(3) Sij = c a a

Wij= a 0 0.025 0.025 c 0.025 0 0 g 0 0 0 t s 0 0 0

Step 4

Normalize W

W'(b,x)= W(b,x)/b=A,C,G,TW(b,x)

Replace W with W'

Step 5

Steps 1 to 4 is called a cycle. If W changes very little from last cycle, then

EM converges and the algorithm ends. otherwise, goto step 1 and start next cycle

Determine the amount of change:

max|Wq(b,x)-Wq-1(b,x)|< set =0.05 such that the algorithm stops within few

cycles

Our Algorithm For Single Group(with indels)

General frame is the same as the previous algorithm

1. We get a initial guess of the motif W

2. With W as initial value, use the new EM algorithm to update W

3. Repeat 1–2 several (Maxtrials) times and choose the best result.

Incorporating Indels

• We add the “space” as a letter, so the matrix for EM algorithm became 5×L

• K: the maximum total number of indels

• For each starting position, consider all length L+h substrings, h=0,1,-1,…,k,-k is the number of indels.

• For each length L+h substring, align it with the matrix

Align a length L+h string with a 5×L matrix

• Dynamic programming• similar to pair wise string alignment• d[i, j] is the score of aligning the first i columns in the ma

trix with the first j letters in the string

d[i, j]=max{d[i-1, j-1] ×W[x,i],

d[i-1,j] ×w[ ,i],△ d[i, j-1] ×e}

Buttom-up order: d[L, L+h]

Best alignment (with indel)

Continued

After calculated the motif W (profile representation: matrix) , we use the matrix W to find the occurrence of the motif in each sequence

Find the motif occurrences

• find the occurrence of the motif in each string

∑i=1LW(ai,i)

a1a2a3…aL is a length-L substring (L-mer) and W is the matrix for the motif

Algorithm for the two Groups (no indels)

• We follow the basic steps of EM method

• Modify the formula to re-construct W

• Re-estimate the matrix W from both group B and G

Main idea

When the motif represented by the matrix W is too close to some L-mers from group G (p(i,j)>ave), we scoop the pattern from the matrix by subtracting the corresponding matrix Wij

Experiment Results (Single Group)

• Input: (1) randomly generate sequences

n = 20m= 600

(2) insert motif into the sequences Center string s (length L) Mutate d positions (insertion, deletion, mutation) Implant the mutated copy into the sequences

• Output:Use our program to find the implanted pattern.


Table 1: 15 sequences: no indel 5 sequences: one deletion

Table 2:10 sequences: no indel5 sequences : one deletion 5 sequences : one insertion

In table 2, the running time increases significantly and accuracy in many cases is slightly worse than that in Table 1


•Table 3:5 sequences : one deletion5 sequences : two deletions10 sequences: no indel

•Table 4:5 sequences : one insertion5 sequences : two insertions10 sequences: no indel

The results in Table 4 are slightly better than those in Table 3. The reason might be that the case in Table 4 needs to insert two columns in the matrix for the motif, whereas the case in Table 3 needs to insert two spaces in the motif sequences


•Table 5, the mixed case:

Probability:

one insertion : 1/8 one deletion : 1/8

two insertions : 1/8 two deletions: 1/8

one insertion and one deletion: 1/8

no indel: 3/8

Experiment Results (Two Groups)

• Center (m=600):

c1: the center for group B, random sequence

c2: the center for group G, randomly mutate

200 positions from c1

• Generate two groups

n=10

Randomly mutate 200 positions from the center

Experiment Results (Two Groups)

From Table 6, we can see that it is easy to find a motif that can distinguish the two groups when L is large

Compare Table 7 with Table 6, we can see that it is easy to find a distinguishing motif when the distance between the two centers is large

Table 7 shows the results when the average Hamming distance between c1 and c2 is about 175

Table 6 shows the results when the average Hamming distance between c1 and c2 is about 128

Summary

• An algorithm for the single group problem that can handle indels

• An algorithm for the two groups problem

identification of distinguishing motifs

Documents

single group problem

group b

group of n sequences

groups of sequences

motif of length

bad strings

single groupalgorithm

thebacterial sequences