identification of distinguishing motifs

32
Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: [email protected] Joint work with WangSen FENG and Lusheng WA NG

Upload: bayard

Post on 08-Jan-2016

39 views

Category:

Documents


3 download

DESCRIPTION

Identification of Distinguishing Motifs. Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: [email protected] Joint work with WangSen FENG and Lusheng WANG. Outline. The Definitions of Problems Applications Previous work Our work - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Identification of Distinguishing Motifs

Identification of Distinguishing Motifs

Zhanyong WANG(Master Degree Student)

Dept. of Computer Science, City University of Hong KongE-mail: [email protected]

Joint work with WangSen FENG and Lusheng WANG

Page 2: Identification of Distinguishing Motifs

Outline

• The Definitions of Problems• Applications• Previous work• Our work• Algorithm for Single Group• Algorithm for Two Groups• Simulation Results for Single Group• Simulation Results for Two Groups

Page 3: Identification of Distinguishing Motifs

Motif Identification

• Two versions

1. Single Group

2. Two Groups

Page 4: Identification of Distinguishing Motifs

Single Group

• Instance: a group of n sequences.

• Objective: find a length-L motif that appears in each of the given sequences and those occurrences of the motif are similar

Page 5: Identification of Distinguishing Motifs

Two Groups

• Instance: two groups of sequences:

B (Bad) and G (Good)

• Objective: find a motif of length-L that appears in every sequence in group B and does not appear in anywhere of the sequences in G

the occurrences of the motif have errors

Page 6: Identification of Distinguishing Motifs

Applications

1. Finding Targets for Potential Drugs

(T. Jiang, C. Trendall, S, Wang, T. Wareham, X. Zhang, 98) (K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999)

-- bad strings in B are from Bacteria. -- good strings in G are from Humans

-- find a substring s of length L that is conserved in all bad strings, but not conserved in good strings.

-- use s to screen chemicals -- those selected chemicals can then be tested as potential broad-range antibiotics.

Page 7: Identification of Distinguishing Motifs

Applications

2. Creating Diagnostic Probes for Bacterial Infection

(T. Brown, G.A. Leonard, E.D. Booth, G. Kneale, 1990)

-- a group of closely related pathogenic bacteria

-- find a substring that occurs in each of the bacterial sequences (with as few substitutions as possible) and does not occur in the human sequences

Page 8: Identification of Distinguishing Motifs

Applications

3. Locating binding sites and regulatory signals

4. Creating Universal PCR Primers

5. Creating Unbiased Consensus Sequences

6. Anti-sense Drug Design

Page 9: Identification of Distinguishing Motifs

Previous work

• The closest substring problem was proved to be NP-hard. So are the single group and two groups

(K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999)

• Polynomial time approximation schemes -theoretical results

-speed is slow in order to solve practical instances

Page 10: Identification of Distinguishing Motifs

Previous Programs

• Bailey and Elkan: MEME (1994) uses a modified EM algorithm, allows the motif

to be absent in some of the given sequences • Waterman: Extended sample-driven approach (1984)• Keich and Pavel Pevzner: two programs (2002)• Buhler and Tompa : Projection (2002)

combine EM and random projection• Price, Ramabhadran and Pevzner: PatternBranching uses branching from sample strings (2003)

faster than the previously best known program: projection

Page 11: Identification of Distinguishing Motifs

Previous Programs (continued)

• Do not allow indels

• Only for the one group problem

• Some algorithms can handle one gap

Page 12: Identification of Distinguishing Motifs

Our work

• An extension of the EM approach

• A randomized algorithm for the single group problem which can handle indels

• We give an algorithm for the two groups problem

Page 13: Identification of Distinguishing Motifs

Representation of motifs• Consensus pattern: choosing the letter that appears the most in each

of the L columns (Figure a)• Profile: 4×L matrix W (ACGT), each cell W(i,j) is a number indicating th

e occurrence rate of letter i in column j.(Figure b)

• Use the profile representation in the early stage of the EM algorithm• Use the consensus pattern representation to improve the accuracy

caaccca caacccc catcccg catccct cacccca

--------------------consensus pattern caacccaAnother con. Pattern catccca (a)

A 0 1 0.4 0 0 0 0.4

C 1 0 0.2 1 1 1 0.2

G 0 0 0.0 0 0 0 0.2

T 0 0 0.4 0 0 0 0.2 (b)

Page 14: Identification of Distinguishing Motifs

Computing the single group problem

The EM (Expectation Maximization) Algorithm(Wang,L. Dong,L. and Fan,H. 2004)

Input:– n sequences S1,S2,...,Sn

– a 4L matrix W (the initial guess of the motif)

Output:– new matrix W that is a local maximal solution

A 0.25 0.0 1.0

C 0.25 1.0 0.0

G 0.25 0.0 0.0

T 0.25 0.0 0.0

Page 15: Identification of Distinguishing Motifs

Step 1: L-mer: Sij, a length-L substringFor each L-mer Sij, calculate the likelihood that Sij is theoccurrence of the motif:

P(i,j)=x=1 to L W(Sij(x),x)To avoid zero weights, a fixed small number is added to W(i,j) (0.1)

Step 2: Normalize the likelihood:

P'(i, j)=P(i,j) / x=1m-L+1

P(i, x)

s. t. j=1 to m-L+1P'(i,j)=1

Sij= c a a

W=a 0.25 0 1 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0

P(i,j): 0.25*0.1*1=0.025

Page 16: Identification of Distinguishing Motifs

Step 3: Re-estimate the motif matrix W.

W= i=1 n j=1

m-L+1 Wij

Where Wij is constructed from Sij

Sij= c a a

W=a 0.25 0 1 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0

P(i,j): 0.25*0.1*1=0.025

Sij(1) Sij(2) Sij(3) Sij = c a a

Wij= a 0 0.025 0.025 c 0.025 0 0 g 0 0 0 t s 0 0 0

Page 17: Identification of Distinguishing Motifs

Step 4

Normalize W

W'(b,x)= W(b,x)/b=A,C,G,TW(b,x)

Replace W with W'

Page 18: Identification of Distinguishing Motifs

Step 5

Steps 1 to 4 is called a cycle. If W changes very little from last cycle, then

EM converges and the algorithm ends. otherwise, goto step 1 and start next cycle

Determine the amount of change:

max|Wq(b,x)-Wq-1(b,x)|< set =0.05 such that the algorithm stops within few

cycles

Page 19: Identification of Distinguishing Motifs

Our Algorithm For Single Group(with indels)

General frame is the same as the previous algorithm

1. We get a initial guess of the motif W

2. With W as initial value, use the new EM algorithm to update W

3. Repeat 1–2 several (Maxtrials) times and choose the best result.

Page 20: Identification of Distinguishing Motifs

Incorporating Indels

• We add the “space” as a letter, so the matrix for EM algorithm became 5×L

• K: the maximum total number of indels

• For each starting position, consider all length L+h substrings, h=0,1,-1,…,k,-k is the number of indels.

• For each length L+h substring, align it with the matrix

Page 21: Identification of Distinguishing Motifs

Align a length L+h string with a 5×L matrix

• Dynamic programming• similar to pair wise string alignment• d[i, j] is the score of aligning the first i columns in the ma

trix with the first j letters in the string

d[i, j]=max{d[i-1, j-1] ×W[x,i],

d[i-1,j] ×w[ ,i],△ d[i, j-1] ×e}

Buttom-up order: d[L, L+h]

Best alignment (with indel)

Page 22: Identification of Distinguishing Motifs

Continued

After calculated the motif W (profile representation: matrix) , we use the matrix W to find the occurrence of the motif in each sequence

Page 23: Identification of Distinguishing Motifs

Find the motif occurrences

• find the occurrence of the motif in each string

∑i=1LW(ai,i)

a1a2a3…aL is a length-L substring (L-mer) and W is the matrix for the motif

Page 24: Identification of Distinguishing Motifs

Algorithm for the two Groups (no indels)

• We follow the basic steps of EM method

• Modify the formula to re-construct W

• Re-estimate the matrix W from both group B and G

Page 25: Identification of Distinguishing Motifs

Main idea

When the motif represented by the matrix W is too close to some L-mers from group G (p(i,j)>ave), we scoop the pattern from the matrix by subtracting the corresponding matrix Wij

Page 26: Identification of Distinguishing Motifs

Experiment Results (Single Group)

• Input: (1) randomly generate sequences

n = 20m= 600

(2) insert motif into the sequences Center string s (length L) Mutate d positions (insertion, deletion, mutation) Implant the mutated copy into the sequences

• Output:Use our program to find the implanted pattern.

Page 27: Identification of Distinguishing Motifs

Experiment Results (Single Group)

Table 1: 15 sequences: no indel 5 sequences: one deletion

Table 2:10 sequences: no indel5 sequences : one deletion 5 sequences : one insertion

In table 2, the running time increases significantly and accuracy in many cases is slightly worse than that in Table 1

Page 28: Identification of Distinguishing Motifs

Experiment Results (Single Group)

•Table 3:5 sequences : one deletion5 sequences : two deletions10 sequences: no indel

•Table 4:5 sequences : one insertion5 sequences : two insertions10 sequences: no indel

The results in Table 4 are slightly better than those in Table 3. The reason might be that the case in Table 4 needs to insert two columns in the matrix for the motif, whereas the case in Table 3 needs to insert two spaces in the motif sequences

Page 29: Identification of Distinguishing Motifs

Experiment Results (Single Group)

•Table 5, the mixed case:

Probability:

one insertion : 1/8 one deletion : 1/8

two insertions : 1/8 two deletions: 1/8

one insertion and one deletion: 1/8

no indel: 3/8

Page 30: Identification of Distinguishing Motifs

Experiment Results (Two Groups)

• Center (m=600):

c1: the center for group B, random sequence

c2: the center for group G, randomly mutate

200 positions from c1

• Generate two groups

n=10

Randomly mutate 200 positions from the center

Page 31: Identification of Distinguishing Motifs

Experiment Results (Two Groups)

From Table 6, we can see that it is easy to find a motif that can distinguish the two groups when L is large

Compare Table 7 with Table 6, we can see that it is easy to find a distinguishing motif when the distance between the two centers is large

Table 7 shows the results when the average Hamming distance between c1 and c2 is about 175

Table 6 shows the results when the average Hamming distance between c1 and c2 is about 128

Page 32: Identification of Distinguishing Motifs

Summary

• An algorithm for the single group problem that can handle indels

• An algorithm for the two groups problem