parallelized multiple sequence alignment on the public cloud

Parallelized Multiple Sequence Alignment on the Public Cloud

Presented by:Dr. G.Sudha SadasivamProfessor, Dept of CSE,

PSG College of Technology, Coimbatore

Co-authorsMr B. Vijayan, Mr S. Arul Prakash, Mr K.V. Hari Babu

Students, BE(CSE), Dept of CSE, PSG College of Technology,

Coimbatore

Agenda Sequence alignment Introduction to Clouds Approaches for MSA Problem statement System Architecture Illustration of working of the system Analysis Experimental results Conclusion

What is Sequence Alignment?

The procedure of comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. Uses

For sequence similarity Phylogenetic tree analysis

Factors – accuracy and speed

Cloud computingProvides scalable, on-demand, RT computing services

Suitability of cloud for Sequence Alignment On-demand scalability of cloud makes it suitable for

dynamic nature of MSA Low cost in maintenance of infrastructure for

applications Data and compute parallelism in clouds through map-

reduce paradigm facilitates energy efficient and fast MSA.

Types of Sequence Alignment Pair-wise Alignment

Alignment of two sequencesGlobal –using Needleman Wunsch algorithm.

L G P S S K Q T G K G S _ S R A W D N | | | | | | |

L N _ A T K S A G K G A I M R L G D ALocal – using Smith Waterman algorithm.

_ _ _ _ _ _ _ _ _ T G K G _ _ _ _ _ _ _ _ _ _ | | |

_ _ _ _ _ _ _ _ _ A G K G _ _ _ _ _ _ _ _ _ _

Multiple Sequence AlignmentAlignment of more than two sequences

MSA methodsDynamic Programming

(n – dim matrix)

Accurate Computationally complex

O(Nn)

Exhaustive

Progressive approximation

(aligns closest seq first - heuristics)

Fast Alignment Cannot be modified

Local maxima

Less accurate

ClustalW

MAFFT

Iterative Probabilistic/ Stochastic

(Random)

Slow & less accurate

GA & HMM

N- sequence length; n- number of sequences

MSA in cloud

CloudBurst – RMAP Does not split sequences to load in cloud

environment Not for MSA No automatic scale up/down of clusters

CLUE- proposal from Maryland University VM cloning – Snowflock with MPIs

Problem statementTime efficient approach to sequence alignment with quality

(accuracy) in Cloud

Using hadoop framework Dynamic approach accuracy Data and compute parallelism in hadoop speed Blocking and scalability of hadoop

Parallel transfer of sequence splits over the network to remote clusters

Automated scale up/down of clusters based on computational needs of th environment.

Initialization

F(0, 0) = 0

F(0, i) = −i * d

F(j, 0) = −j* d Main Iteration

For each i=1…M and j=1….N

F(i-1,j-1)+s(xi,yj), case 1F(i,j) = max F(i-1,j)-d, case 2

F(i,j-1)-d, case 3

DIAG, if case 1 Ptr(i,j) = UP, if case 2 LEFT, if case 3

Case 1: xi aligns to yi Case 2: xi aligns to gapCase 3: yi aligns to gap

Needleman Wunsch Algorithm

s(xi,yj ) = +1 , match -1 , mismatch

Needleman Wunsch Algorithm

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

F(i,j) i=0 1 2 3 4

j=0

1

2

3

f(0,0)+s(1,1) =1F(1,1)=max f(0,1)-1 = -2 f(1,0)-1 = -2 = 1(case 1)

Optimal Alignment A_TA AGTA

Case 1: xi aligns to yi Case 2: xi aligns to gapCase 3: yi aligns to gap

s(xi,yj ) = +1, match -1, mismatch

d=1

PTR =DIAG, if case 1UP, if case 2LEFT, if case 3

f(0,1)+s(1,2) =-2f(0,2)-1 = -3f(1,1)-1 = 0Max = 0 (case 3)

F(0, 0) = 0F(0, i) = −i * dF(j, 0) = −j* d

F(i-1,j-1)+s(xi,yj)F(i-1,j)-dF(i,j-1)-d

A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA.

The input is a set of query sequences that are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor.

From the resulting multiple sequence alignment , phylogenetic analysis can be conducted to assess the sequences shared evolutionary origins.

Multiple Sequence Alignment

Dynamic programming

Progressive alignment

Iterative approach

MSA Approaches

Direct method for MSA to identify the globally optimal alignment solution .

Computational complexity n-dimensional equivalent of the pairwise alignment

matrix is formed. The search space increases exponentially with

increasing n and is strongly dependent on sequence length(N).

O(Nn)

Dynamic Programming

Heuristic search . builds up a final MSA by combining pair wise alignments

beginning with the most similar pair and progressing to the most distantly related.

Stages: The relationships between the sequences are represented

as a tree, called a guide tree (pairwise alignment scores). The MSA is built by adding the sequences sequentially to

the growing MSA according to the guide tree.

seq 1seq 2seq3seq4

According to guide tree, 1) Align seq 1 and 2, 2) Align seq 3 wrt seq 1 and 2, 3) Align seq 4 to that of seq 1, 2,

and 3.

Progressive Alignment

The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. Random/ iterative approaches are used

Performance is also particularly bad when all of the sequences in the set are rather distantly related.

Drawbacks

AGT….CGAGT….CG

AGT….CGAGT….CG

AGT….CG

Head Server(VM)

New VMs

New VMs

New VMs

………...

2. Parallel transmission over Internet

4. Forking VMs / deleting VMs

CLIENT SIDE VIRTUAL ENVIRONMENT

6. Report the resultSEQUENCE FRAGMENTS

1. Create virtual environment

2. Split the sequences

System Architecture

3. Copy to HDFS

5. Perform Alignment

SERVER SIDE HADOOP CLUSTER

D1,B1 D2,B1 D1,B2 D1,B3 D3,B1 D2,B2 D3,B2

M M M M M M M

K1,C1

K2,C1

K3,C1

K2,C2

K5,C2

K3,C2

K6,C3

K3,C3

K4,C3

K5,C4

K2,C4

K4,C4

K4,C5

K1,C5

K6,C5

K6,C6

K3,C6

K1,C6

K5,C7

K6,C7

K4,C7

Sort and Group (D2)

K1,[C6] K2,[C2] K3,[C2,C6] K5,[C2] K6,[C6]

Sort and Group (D1)

R R R R R R

K1,[C1] K2,[C1,C4] K3,[C1,C3] K4,[C4,C3] K5,[C4] K6,[C3]

R R R R R

K1,I K2,I K3, I K4, I K5, I K6,I K1, I K2, I K3, I K5, I K6, I

Map Task 1 Map Task 2 Map Task 3

Reduce Task 1 Reduce Task 2

Map reduce Architecture

A single Combination – An illustration

0 1 2 3 4

A G T A

0 0 -1 -2 -3 -4

1 A -1 1 0 -1 -2

2 T -2 0 0 1 0

3 A -3 -1 -1 0 2

SCORE: 4

A1S1:“AGTA”; A1S2:“A_TA”

0 1 2 3 4

A G T A

0 0 -1 -2 -3 -4

1 G -1 -1 0 -1 -2

2 A -2 0 -1 1 0

3 T -3 -1 -1 0 -1

SCORE: -5

A2S1:“AG_TA”; A1S3:“_GAT_”

1. ALIGNMENT OF SI & S2

2. ALIGNMENT OF A1SI & S3

S1= “AGTA”; A2=“ATA”; A3=“GAT”

0 1 2 3 4 5

A _ T A _

0 0 -1 -2 -3 -4 -5

1 _ -1 0 0 -1 -2 -3

2 G -2 -1 -1 -1 -2 -2

3 A -3 -1 -1 -2 0 -1

4 T -4 -2 -1 0 -1 0

5 _ -5 -3 -1 -1 0 0

SCORE: -3

A2S2:“A _ _TA_”;

A2S3:“ _GAT_ _”

3. ALIGNMENT OF A1S2 & A1S3

Complexity Measure

Proposed Method

Conventional Method

Score Calculation

O(N) O(n*N)

Pairwise alignment

O(K2) O(N2)

MSA O[K2 * ( n(n-1)/2] O(Nn)

‘n’ – Number of Sequences

‘N’ – Average length of a sequence

‘k’ – Average number of blocks in a sequence

‘K’ – Size of 1 block

Analysis

‘T’ – Time for sequence transfer serially & ‘k’ – block size

T/k – Time for sequence transfer in parallel

Advantage: Computation power of remote cluster is optimal and not wasted

Disadvantage: Time to set up the cluster

2. Parallelised data trasfer

3. Dynamic cluster creation

Experimental Setup

Core – 2 Duo processors – 2.8 GHz - 160GB HD,

2 GB RAM LAN- 100 Mbps. OS - RHEL v5 Client virtual environment - 4 VMs Server cluster - 5 machines Hadoop DFS in fully distributed mode OpenVZ was used for virtualization

Effect of parallel file transfer

FileSize(MB)

FileTransfer(sec)

Split Time(sec)

Merge Time(sec)

C1(sec)

T1 (sec)

C2(sec)

T2 (sec)

100 6.23 0.02 0.03 2.13 2.18 0.73 0.78

200 9.32 0.23 0.43 2.96 3.62 1.23 1.89

300 11.43 0.85 1.64 3.84 6.33 1.16 3.65

C1: Communication time from 3 client VMs to server without multithreading.C2: Communication time from 3 client VMs to the server with multithreading.T1: Total time for file transfer from client to server without multi threading T2: Total time for file transfer from client to server with multi threading

Time to start virtual machines

0

20

40

60

80

100

120

1 2 3 4

Number of VMs

Tim

e in

Sec

Parallelised starting of VMs can be done to reduce time

cluster performance wrt number of VMs 30 KB sequences with 2 KB splits – upto 5 sequences

Number of sequences is less than 6, a five node hadoop cluster is sufficient.

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9 10Number of sequences

Tim

e in

Sec

4 slave VMs (sec) 6 slave VMs (sec)

3 4 5 6 7 8 9 10 11 12

Dynamic scaling up/down of clusters

File Size (GB)

Block size (10 MB)

Static VM creation based on Predicted application load (maps + reduces)

Dynamic VM creation based on actual application load (maps + reduces)

Time (min -sec)

VMs Time (min-sec)

New VMs added

1 5-36 2 3-16 1

2 5-52 3 5-40 1

3 8-27 4 5-48 2

5 12-13 5 6-39 9

VMs instantiated based on number of Map-Reduce Tasks

Dynamically number of tasks were checked up New VMs started and tasks were reallocated

Old VMs were destroyed if not used

Conclusion1) Proposed MSA improves on the computation time and also

maintains the accuracy. Parallelism of sequence alignment in three levels.

Hadoop data grids - Data and compute parallelism & scalability

Dynamic Programming - accuracy.

2) Complexity is reduced from O(Nn) to O[K2 * (n *(n-1)/2)] Combining progressive and dynamic approaches. Blocking in hadoop

3) Enhancements (using clouds for MSA) Automatic configuration of the cloud environment based on

the computational needs Efficient upload of data into the HDFS by parallel transfer of

sequence fragments over the Internet.

Acknowledgements

The Research has been carried out as a result of PSG-Yahoo Research programme on Grid and Cloud computing.

Sincere Thanks to

1) Dr R Rudramoorthy, Principal,

PSG College of Techniology, Coimbatore.

2) Mr K V Chidambaran,

Director, Grid and Cloud Systems Group,

Yahoo, Bangalore

THANK YOU

QUESTIONS?

REFERENCES Apache, (2002), Hadoop Documentation, retrieved on September 20, 2009,

fromhttp://hadoop.apache.org/core/docs/r0.17.2/. Tahir, N., Imitaz, S. and Shaftab, A., “Parallel Needleman-Wunsch Algorithm for

Grid”. retrieved on January 19, 2009 from http://www.gridbus.org/~alchemi/files/Parallel%20Needleman% 20Algo.pdf

Michael, C., (2009). “Cloud Burst: highly sensitive read mapping with MapReduce”, Bioinformatics, 25(11), 1363-1369.

Lee, T., “A genomic CluE for Cloud Computing”, retrieved on January 13, 2009 from http://www.eurekalert.org/pub_releases /2009-04/uom-agc042309.php

Yongli, H. and Shen, J., “Sequence analysis scale up and acceleration using Grid and Cloud Computing yield efficient analyses of HIV-1 variants and other viruses”, retrieved on February 15, 2009 from www.iscb.org /uploaded/css/43/12056.pdf.

Philip, P., Andres, L., Eyal, L. and Michael, B. “Adding the easy button to the cloud with SnowFlock and MPI”, in Proceedings of 3rd ACM workshop in system level virtualization for HPC (2009), 122-127.

parallelized multiple sequence alignment on the public cloud

Documents

yi case

case 3f0

g p s s

g d alocal

optimal alignment

max fi

split sequences

biological sequences