umit catalyurek , mike gray, eric stahlberg, renato ferreira, tahsin kurc, joel saltz

March 2, 2004, BMI 731 - Biomedical Data

Management

Improving Performance of Multiple Sequence Alignment Analysis in

Multi-client Environments

Use of Inexpensive Storage as Grid Cache

Umit Catalyurek, Mike Gray, Eric Stahlberg, Renato Ferreira, Tahsin Kurc, Joel Saltz

Department of Biomedical InformaticsThe Ohio State University

Ohio Supercomputer Center


Management

Outline

• Multi Sequence Alignment• CLUSTALW• Sequence Analysis in Multiple Client

Environment – Caching Intermediate Results– Deployment on SMP Machine– Deployment on Distributed Memory Machine

• Experimental Results• Conclusion


Management

Sequence Alignment

• alignment is a mutual arrangement of two sequences– where the two sequences are similar, and

where they differ

Sequence s: AAT AGCAA AGCACACA

Sequence t: TAA ACATA ACACACTA

Hamming Dist: 2 3 6


Management

Edit Distance

Unit Cost:

s: AGCACAC-A AG-CACACA

t: A-CACACTA or ACACACT-A

cost 2 cost 4

distance(s, t) = 2


Management

Multiple Sequence AlignmentVTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWESNG--

VTISCTGSSSNIG-AGNHVKWYQQLPGVTISCTGTSSNIG--SITVNWYQQLPGLRLSCSSSGFIFS--SYAMYWVRQAPGLSLTCTVSGTSFD--DYYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNW--YVDGATLVCLISDFYPG--AVTVAW--KADSAALGCLVKDYFPE--PVTVSW--NS-GVSLTCLVKGFYPS--DIAVEW--ESNG

Optimal: O(2n |si|)6 sequences of length 100 if constant is 10-9 seconds

running time 6.4 x 104 secondsadd 2 sequences

running time 2.6 x 109 seconds

or


Management

CLUSTAL W

• Based on Higgins & Sharp CLUSTAL [Gene88]• Progressive alignment-based strategy

– Pairwise Alignment (n2l2)• A distance matrix is computed using either an approximate method

(fast) or dynamic programming (more accurate, slower)– Computation of Guide Tree (n3): phylogenetic tree

• Computed from the distance matrix • Iteratively selecting aligned pairs and linking them.

– Progressive Alignment (nl2)• A series of pairwise alignments computed using full dynamic

programming to align larger and larger groups of sequences.• The order in the Guide Tree determines the ordering of sequence

alignments. • At each step; either two sequences are aligned, or a new sequence is

aligned with a group, or two groups are aligned. • n: number of sequences in the query• l : average sequence length


Management

Sequence Analysis in Multiple Client Environment

• Many Gene and Protein databases can be accessed over Internet– Multiple request by multiple client

• Data Caching– Cache pairwise alignments

• Most expensive phase• Computations are independent


Management

Data Caching• Low-cost high-performance, high-capacity

commodity hardware– Disks are cheap: 100GB EIDE Disks around $250.– A PC costs around $700-$1000

• no monitor, • no high-end graphics card,• moderate size memory (128MB-512MB)

– Switched fast ethernet • Better performance with channel bonding

– In 2001: 6 Pentium III PCs, 1 TB of disk storage < $10,000– In 2002: 5 Pentium 4 PCs, 2.5TB of disk storage < $9,000– BMI Storage Cluster 7.2TB, 24 PCs = $50,000-$55,000 – UMD Storage Cluster 9.5 TB, 50 PCs


Management

Caching Pairwise Alignment Scores

• Sequence -> Unique ID (UID): – use Hash (tested 10 hash functions

including MD5; 4 of them gives similar result with MD5)

– Resolve collisions and assign UID to each sequence

• For more than 1 million sequences from GenBank max collision per hash value was 3: constant time

• For each pairwise alignment, store two UIDs and a float score– B-Tree: used GIST B-Tree implementation


Management

Sequence -> Unique ID (UID):

Hash Table i

j Sequencec bits=

2c elements

Collision arrays

Unique ID = (i << c) || j


Management

Deployment on SMP Machine

• A hash table is used to associate a sequence with a unique integer ID (UID)

• Partitioned B tree stores pairwise alignment results

• Cache partition chosen by min (UID1, UID2)% #Partitions

• Multiple threads for Pairwise alignment computation


Management

DataCutter• Component Framework for Combined

Task/Data Parallelism• Core Services

– Indexing Service: Multilevel hierarchical indexes based on R-tree indexing method.

– Filtering Service: Distributed C++ component framework

• User defines sequence of pipelined components (filters and filter groups)

– Pleasingly Parallel– Generalized Reduction

• User directive tells preprocessor/runtime system to generate and instantiate copies of filters

• Stream based communication • Multiple filter groups can be

active simultaneously• Flow control between

transparent filter copies– Replicated individual filters– Transparent: single stream illusion

9/11/2002 DataCutter 19

Combined Data/Task Parallelism

host1

R0

R1

host2

R2

host3

Ra0

host1

E0

EK

host2

EK+1

EN

host4

Ra1

host5

Ra2

host1

M

Cluster 1

Cluster 3

Cluster 2

http://www.datacutter.org


Management

Deployment on Distributed Memory Machine

DataCutter version of ClustalW – v1

• Hash Filter– Stores/computes sequence to

unique IDs mapping– Partitioned (declustered) hash

• Cache Filter– Partitioned (declustered) cache– computes pairwise alignment if it

doesn’t exist in the cache• Owner computes: computational

imbalance

• CLUSTALW Filter– computes guide tree generation

and progressive alignment

CLUSTALW

Hash (UniqueID)

Cache & Compute


Management

DataCutter version of ClustalW – v2

DC-ClustalW-v1 +• Separate Pairwise Alignment

Filter– Cache misses computed in

Pairwise Align– Balanced computation

• Handles multiple queries– multiple copies of CLUSTALW

filter

CLUSTALW

Hash (UniqueID)

Cache

Pairwise Align

Deployment on Distributed Memory Machine


Management

Multiple Query Processing

-QueryManager Filter

-ClustalW Filter-Hash Filter-Cache Filter-Pairwise Alignment

Filter

CW

H

C

P

Host-1

Host-n+1

CW

Host-nH

C

P

Host-2n

QM

Host-0

Deployment on Distributed Memory MachineDataCutter version of ClustalW – v2


Management

Experimental Setup

1. Pentium III 650 MHz, 768MB Memory• 1000 random sequences from GPCR• Average length 450 amino acids per sequence

2. 24-Processor Sun Fire 6800, 750MHz, 24GB Memory• 350 MSA queries from GPCR; from 2 sequences per

query to over 200 sequences per query

3. 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk• 64 queries each consist of 40 unique protein

sequences from GPCR • Average length 450 amino acids per sequence


Management

Experiment 1 – Execution Time of CLUSTAL W

Execution Time of CLUSTAL W

1.00

10.00

100.00

1000.00

10000.00

100000.00

25 50 75 100 150 200 400 600 800 1000

Number of GPCR sequences

Exe

cuti

on

tim

e (s

eco

nd

s)

Breakdown of CLUSTAL W Execution Time on PIII-650MHz

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

25 50 75 100 150 200 400 600 800 1000

Number of GPCR Sequences

Tim

e F

ract

ion

s

prog-align

guidetree

pairwise

Pentium III 650 MHz, 768MB Memory• 1000 random sequences from GPCR• Average length 450 amino acids per sequence


Management

Experiment 2 - SMP ResultsBreakdown of Average Execution Time on 1-processor

0.00

10.00

20.00

30.00

40.00

50.00

60.00

NOCACHE 12% 25% 50% 75% 100%

Cache Hit Ratio

Exe

cuti

on

Tim

e (s

eco

nd

s)

pairwise

guide tree

prog.align

SMP : 64 Queries Total Execution Time

0

500

1000

1500

2000

2500

3000

3500

1 2 4 8

Number of Processors

Execu

tio

n T

ime (

seco

nd

s)

no cache

directio

no directio

24-Processor Sun Fire 6800, 750MHz, 24GB Memory• 350 MSA queries from GPCR; from 2 sequences per query to

over 200 sequences per query


Management

Experiment 3 – Distributed Memory DataCutter version of ClustalW – v1

Average Query Execution Time - v1

0.00

5.00

10.00

15.00

20.00

25.00

1 2 4 8

# Processors

Tim

e (

se

co

nd

s) no-cache

0% hit ratio

25% hit ratio

50% hit ratio

75% hit ratio

100% hit ratio

Breakdown of CLUSTALW Execution Time on 1-processor

0.00

5.00

10.00

15.00

20.00

25.00

no-cache 0% hit ratio 25% hit ratio 50% hit ratio 75% hit ratio 100% hit ratio

Cache Hit Ratio

Tim

e (s

eco

nd

s)

pair align

tree gen

prog align

16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk• 64 queries each consist of 40 unique protein sequences from

GPCR • Average length 450 amino acids per sequence


Management


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

no-cache 0% hit ratio 25% hit ratio 50% hit ratio 75% hit ratio 100% hitratio

Cache Hit Ratio

Tim

e F

ract

ion

s prog align

tree gen

insert

compute

search


0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

no-cache 0% hit ratio 25% hit ratio 50% hit ratio 75% hit ratio 100% hit ratio

Cache Hit Ratio

Tim

e (s

eco

nd

s)

pair align

tree gen

prog align





Management

Average Query Execution Time - v2 (load balanced)

0.00

5.00

10.00

15.00

20.00

25.00

1 2 4 8

Number of Processors

Tim

e (s

eco

nd

s) 0% hit ratio

25% hit ratio

50% hit ratio

75% hit ratio

100% hit ratio

Speedup of DataCutter version of CLUSTALW

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

0 1 2 3 4 5 6 7 8 9

# ProcessorsS

pe

ee

du

p

linear

v2 - total

v1 - total

ideal speedup

v2 - pair align

v1 - pair align


1 ClustalW filter intra-query parallelization




Management

64 Queries Total Execution Time

0

200

400

600

800

1000

1200

1400

1 2 4 8

Number of copies of each filter (ClustalW, Hash, Cache, PairAlign)

Tim

e (s

eco

nd

s)

0% hit ratio

25% hit ratio

50% hit ratio

75% hit ratio

100% hit ratio


Multiple ClustalW filters inter-query parallelization

16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk• 8 running a copy of Hash, Cache and PairAlign, 8 running ClustalW• 64 queries each consist of 40 unique protein sequences from GPCR • Average length 450 amino acids per sequence


Management

Conclusion

• Caching intermediate results– computational intensive application data

intensive application

• SMP• Distributed Memory implementation

with DataCutter

umit catalyurek , mike gray, eric stahlberg, renato ferreira, tahsin kurc, joel saltz

Documents

tb of disk storage

new sequence

number of sequences

larger groups of sequences

bmi storage cluster

umd storage cluster

distanceunit cost

aligned pairs