acceleration of ungapped extension in mercury blast

9
Acceleration of ungapped extension in Mercury BLAST Joseph Lancaster a, * , Jeremy Buhler a , Roger D. Chamberlain a,b a Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA b BECS Technology, Inc., St. Louis, MO, USA article info Article history: Available online 20 February 2009 Keywords: BLAST Biosequence analysis Sequence alignment FPGA acceleration abstract The amount of biosequence data being produced each year is growing exponentially. Extracting useful information from this massive amount of data efficiently is becoming an increasingly difficult task. There are many available software tools that molecular biologists use for comparing genomic data. This paper focuses on accelerating the most widely used such tool, BLAST. Mercury BLAST takes a streaming approach to the BLAST computation by offloading the performance-critical sections to specialized hard- ware. This hardware is then used in combination with the processor of the host system to deliver BLAST results in a fraction of the time of the general-purpose processor alone. This paper presents the design of the ungapped extension stage of Mercury BLAST. The architecture of the ungapped extension stage is described along with the context of this stage within the Mercury BLAST system. The design is compact and runs at 100 MHz on available FPGAs, making it an effective and pow- erful component for accelerating biosequence comparisons. The performance of this stage is 25 that of the standard software distribution, yielding close to 50 performance improvement on the complete BLAST application. The sensitivity is essentially equivalent to that of the standard distribution. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction Databases of genomic DNA and protein sequences are an essen- tial resource for modern molecular biology. Computational search of these databases can show that a DNA sequence acquired in the lab is similar to other sequences of known biological function, revealing both its role in the cell and its history over evolutionary time. A decade of improvement in DNA sequencing technology has driven exponential growth of biosequence databases such as NCBI GenBank [1], which has doubled in size every 12–16 months for the last decade and now stands at over 59 billion characters. Tech- nological gains have also generated more novel sequences, includ- ing entire mammalian genomes [2,3], to keep search engines busy. The most widely used software for efficiently comparing biose- quences to a database is BLAST, the Basic Local Alignment Search Tool [4–6]. BLAST compares a query sequence to a biosequence database to find other sequences that differ from it by a small number of edits (single-character insertions, deletions, or substitu- tions). Because direct measurement of edit distance between se- quences is computationally expensive, BLAST uses a variety of heuristics to identify small portions of a large database that are worth comparing carefully to the query. BLAST is a pipeline of computations that filter a stream of char- acters (the database) to identify meaningful matches to a query. To keep pace with growing databases and queries, this stream must be filtered at increasingly higher rates. One path to higher perfor- mance is to develop a specialized processor that offloads part of BLAST’s computation from a general-purpose CPU. Past examples of processors that accelerate or replace BLAST include the ASIC- based Paracel GeneMatcher TM [7] and the FPGA-based TimeLogic DecypherBLAST TM engine [8]. Recently, we have developed a new accelerator design, the FPGA-based Mercury BLAST engine [9]. Mercury BLAST exploits fine-grained parallelism in BLAST’s algo- rithms and the high I/O bandwidth of current commodity comput- ing systems to deliver 1–2 orders of magnitude speedup over software BLAST on a card suitable for deployment in a laboratory desktop. Mercury BLAST is a multistage pipeline, parts of which are implemented in FPGA hardware. This work describes a key part of the pipeline, ungapped extension, that sifts through pattern matches between query and database and decides whether to per- form a more accurate but computationally expensive comparison between them. Our design illustrates a fruitful approach to acceler- ating variable-length string matching that is robust to character substitutions. The implementation is compact, runs at high clock rates, and can process one pattern match every clock cycle. The rest of the paper is organized as follows. Section 2 gives a ful- ler account of the BLAST computation and illustrates the need to accelerate ungapped extension. Section 3 describes our accelerator 0141-9331/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2009.02.007 * Corresponding author. E-mail address: [email protected] (J. Lancaster). Microprocessors and Microsystems 33 (2009) 281–289 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

Upload: joseph-lancaster

Post on 21-Jun-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Acceleration of ungapped extension in Mercury BLAST

Microprocessors and Microsystems 33 (2009) 281–289

Contents lists available at ScienceDirect

Microprocessors and Microsystems

journal homepage: www.elsevier .com/ locate/micpro

Acceleration of ungapped extension in Mercury BLAST

Joseph Lancaster a,*, Jeremy Buhler a, Roger D. Chamberlain a,b

a Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USAb BECS Technology, Inc., St. Louis, MO, USA

a r t i c l e i n f o a b s t r a c t

Article history:Available online 20 February 2009

Keywords:BLASTBiosequence analysisSequence alignmentFPGA acceleration

0141-9331/$ - see front matter � 2009 Elsevier B.V. Adoi:10.1016/j.micpro.2009.02.007

* Corresponding author.E-mail address: [email protected] (J. Lancaster)

The amount of biosequence data being produced each year is growing exponentially. Extracting usefulinformation from this massive amount of data efficiently is becoming an increasingly difficult task. Thereare many available software tools that molecular biologists use for comparing genomic data. This paperfocuses on accelerating the most widely used such tool, BLAST. Mercury BLAST takes a streamingapproach to the BLAST computation by offloading the performance-critical sections to specialized hard-ware. This hardware is then used in combination with the processor of the host system to deliver BLASTresults in a fraction of the time of the general-purpose processor alone.

This paper presents the design of the ungapped extension stage of Mercury BLAST. The architecture ofthe ungapped extension stage is described along with the context of this stage within the Mercury BLASTsystem. The design is compact and runs at 100 MHz on available FPGAs, making it an effective and pow-erful component for accelerating biosequence comparisons. The performance of this stage is 25� that ofthe standard software distribution, yielding close to 50� performance improvement on the completeBLAST application. The sensitivity is essentially equivalent to that of the standard distribution.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

Databases of genomic DNA and protein sequences are an essen-tial resource for modern molecular biology. Computational searchof these databases can show that a DNA sequence acquired in thelab is similar to other sequences of known biological function,revealing both its role in the cell and its history over evolutionarytime. A decade of improvement in DNA sequencing technology hasdriven exponential growth of biosequence databases such as NCBIGenBank [1], which has doubled in size every 12–16 months forthe last decade and now stands at over 59 billion characters. Tech-nological gains have also generated more novel sequences, includ-ing entire mammalian genomes [2,3], to keep search engines busy.

The most widely used software for efficiently comparing biose-quences to a database is BLAST, the Basic Local Alignment SearchTool [4–6]. BLAST compares a query sequence to a biosequencedatabase to find other sequences that differ from it by a smallnumber of edits (single-character insertions, deletions, or substitu-tions). Because direct measurement of edit distance between se-quences is computationally expensive, BLAST uses a variety ofheuristics to identify small portions of a large database that areworth comparing carefully to the query.

ll rights reserved.

.

BLAST is a pipeline of computations that filter a stream of char-acters (the database) to identify meaningful matches to a query. Tokeep pace with growing databases and queries, this stream mustbe filtered at increasingly higher rates. One path to higher perfor-mance is to develop a specialized processor that offloads part ofBLAST’s computation from a general-purpose CPU. Past examplesof processors that accelerate or replace BLAST include the ASIC-based Paracel GeneMatcherTM [7] and the FPGA-based TimeLogicDecypherBLASTTM engine [8]. Recently, we have developed a newaccelerator design, the FPGA-based Mercury BLAST engine [9].Mercury BLAST exploits fine-grained parallelism in BLAST’s algo-rithms and the high I/O bandwidth of current commodity comput-ing systems to deliver 1–2 orders of magnitude speedup oversoftware BLAST on a card suitable for deployment in a laboratorydesktop.

Mercury BLAST is a multistage pipeline, parts of which areimplemented in FPGA hardware. This work describes a key partof the pipeline, ungapped extension, that sifts through patternmatches between query and database and decides whether to per-form a more accurate but computationally expensive comparisonbetween them. Our design illustrates a fruitful approach to acceler-ating variable-length string matching that is robust to charactersubstitutions. The implementation is compact, runs at high clockrates, and can process one pattern match every clock cycle.

The rest of the paper is organized as follows. Section 2 gives a ful-ler account of the BLAST computation and illustrates the need toaccelerate ungapped extension. Section 3 describes our accelerator

Page 2: Acceleration of ungapped extension in Mercury BLAST

282 J. Lancaster et al. / Microprocessors and Microsystems 33 (2009) 281–289

design and details its hardware architecture. Section 4 evaluatesthe sensitivity and throughput of our implementation. Section 5describes related work, and Section 6 concludes.

2. Background: the BLAST computation

BLAST’s search computation is organized as a three-stage pipe-line, illustrated in Fig. 1. The pipeline is initialized with a query se-quence, after which a database is streamed through it to identifymatches to that query. This paper focuses on BLASTN, the form ofBLAST used to compare DNA sequences; however, many of the de-tails described here also apply to BLASTP, the form used for proteinsequences.

The first pipeline stage, word matching, detects substrings offixed length w in the stream that perfectly match a substring ofthe query; typically, w ¼ 11 for DNA. We refer to these shortmatches as w-mers. Each matching w-mer is forwarded to the sec-ond stage, ungapped extension, which extends the w-mer to eitherside to identify a longer pair of sequences around it that matchwith at most a small number of mismatched characters. Theselonger matches are high-scoring segment pairs (HSPs), or ungappedalignments. Finally, every HSP that has both enough matches andsufficiently few mismatches is passed to the third stage, gappedextension, which uses the Smith–Waterman dynamic programmingalgorithm [10] to extend it into a gapped alignment, a pair of similarregions that may differ by arbitrary edits. BLAST reports only gap-ped alignments with many matches and few edits.

To quantify the computational cost of each stage of BLASTN, weprofiled the standard BLASTN software published by the NationalCenter for Biological Information (NCBI), v2.3.2, on a comparisonof 30 randomly selected 25,000 base queries from the human gen-ome to a database containing the nonrepetitive fraction of themouse genome (1:16� 109 characters). NCBI BLAST was profiledon a 2.8 GHz Intel P4 workstation running Linux. The search spent

w−mers ungappematching

wordextensio

stage 1 stage 2

sequencesdatabase

Fig. 1. NCBI BLA

−33−3−

C

5Query

C A G T G A A C G A T G T G A AT

2 11 1 1 1

1 3 6 5111Comparison Score

Running Score

NCBI BLAST

CQuery

A G T G A T A C G A T G T G A A1 1 1 1Comparison Score

Mercury BLAST

A T C C T G T C G A T C G G T A C

C C T G A T G A T C G G T A C

1

A TSubject

Subject

X−drop = 10

A

CC

0 43−3 3

133

2− −−−1

−−−−4

Fig. 2. Examples of NCBI and Mercury ungapped extension. The parameters used here abegins at the end of the w-mer and extends left. The extension stops when the runningcomputation is then performed in the other direction. The final substring is the concaungapped extension begins at the leftmost base of the window (indicated by brackets) anthe algorithms gave the same result, but this is not necessarily the case in general.

83.9% of time in word matching, 15.9% in ungapped extension, andthe remaining time in gapped extension. While execution time var-ied with query length (roughly linearly), the per-stage percent exe-cution times are relatively invariant with query length.

Our profile illustrates that, to achieve more than about a 6�speedup of NCBI BLASTN on large genome comparisons, one mustaccelerate both word matching and ungapped extension. MercuryBLASTN therefore accelerates both these stages, leaving stage 3to NCBI’s software. Our previous work [9] described how we accel-erate word matching. This work describes Mercury BLAST’s ap-proach to ungapped extension.

Profiling of typical BLASTP computations reveals that ungappedextension accounts for an even larger fraction of the overall com-putational cost. The design described here for BLASTN ports withminimal changes to become an efficient stage 2 for BLASTP as well.

3. Design description

The purpose of extending a w-mer is to determine, as quicklyand accurately as possible, whether the w-mer arose by chancealone, or whether it may indicate a significant match. Ungappedextension must decide whether each w-mer emitted from wordmatching is worth inspecting by the more computationally inten-sive gapped extension. It is important to distinguish spurious w-mers as early as possible in the BLAST pipeline because later stagesare increasingly more complex. There exists a delicate balance be-tween the stringency of the filter and its sensitivity, i.e. the num-ber of biologically significant alignments that are found. A highlystringent filter is needed to minimize time spent in fruitless gap-ped extension, but the filter must not throw out w-mers that legit-imately identify long query-database matches with fewdifferences. For FPGA implementation, this filtering computationmust also be parallelizable and simple enough to fit in a limitedarea.

dn extension

gappedHSPs

stage 3

finalalignments

ST pipeline.

(score = 8)

−3 −3

maximal scoring substring(score = 8)

AGA

−mer

T C G C A T T T C A C A C A ATA T G

2 11111

11

21 1 1

21

AGA T

5−mer

maximal scoring substring

C A T G C A T T T C A C A G C A T A1 1 1 1 1 1 1

AG T C T T G C A A A T A G T T C

AGA T C T T G C A A A G T C A A G T G T C

A G C A G

1 1

34 0 1 4 7− 101−− −−3 3 3 3− − − −

1−−3

re Lw ¼ 19, w ¼ 5, a ¼ 1, b ¼ �3, and X-drop ¼ 10. NCBI BLAST ungapped extensionscore drops 10 below the maximum score (as indicated by the arrows). The same

tenation of the best substrings from the left and right extensions. Mercury BLASTd moves right, calculating the best-scoring substring in the window. In this example,

Page 3: Acceleration of ungapped extension in Mercury BLAST

Fig. 3. Mercury BLASTN Extension algorithm pseudocode.

J. Lancaster et al. / Microprocessors and Microsystems 33 (2009) 281–289 283

Mercury BLASTN implements stage 2 guided in part by lessonslearned from the implementation of stage 1 described in [9]. It de-ploys an FPGA ungapped extension stage that acts as a prefilter infront of NCBI BLASTN’s software ungapped extension. This designexploits the speed of FPGA implementation to greatly reduce thenumber of w-mers passed to software while retaining the flexibil-ity of the software implementation on those w-mers that pass. Aw-mer must pass both hardware and software ungapped extensionbefore being released to gapped extension. Here, we are exploitingthe streaming nature of the application as a whole. Since the per-formance focus is on overall throughput, not latency, adding anadditional processing stage that is deployed on dedicated hard-ware is a useful technique. In the next sections, we briefly describeNCBI BLASTN’s software ungapped extension stage, then describeMercury BLASTN’s hardware stage. Fig. 2 shows an example illus-trating the two different approaches to ungapped extension.

3.1. NCBI BLASTN ungapped extension

NCBI BLASTN’s ungapped extension of a w-mer into an HSP runsin two steps. The w-mer is extended back toward the beginnings ofthe two sequences, then forward toward their ends. As the HSP ex-tends over each character pair, that pair receives a reward þa if thecharacters match or a penalty �b if they mismatch. An HSP’s scoreis the sum of these rewards and penalties over all its pairs. The endof the HSP in each direction is chosen to maximize the total scoreof that direction’s extension. If the final HSP scores above a user-defined threshold, it is passed on to gapped extension.

For long sequences, it is useful to terminate extension beforereaching the ends of the sequences, especially if no high-scoringHSP is likely to be found. BLASTN implements early terminationby an X-drop mechanism. The algorithm tracks the highest scoreachieved by any extension of the w-mer thus far; if the currentextension scores at least X below this maximum, further extensionin that direction is terminated.

Ungapped extension with X-dropping allows BLASTN to recoverHSPs of arbitrary length while limiting the average search space fora given w-mer. However, because the regions of extension can inprinciple be quite long, this heuristic is not as suitable for fastimplementation in an FPGA. Note that even though extension inboth directions can be done in parallel, this was not sufficient par-allelism to achieve the speedups we desired.

3.2. Mercury BLASTN’s approach

Mercury BLASTN takes a different, more FPGA-friendly ap-proach to ungapped extension. Extension for a given w-mer is per-formed in a single forward pass over a fixed-size window. Thesefeatures of our approach simplify hardware implementation andexpose opportunities to exploit fine-grain parallelism and pipelin-ing that are not easily accessed in NCBI BLASTN’s algorithm. Ourextension algorithm is given as pseudocode in Fig. 3.

Extension begins by calculating the limits of a fixed window oflength Lw, centered on the w-mer, in both query and databasestream. The appropriate substrings of the query and the streamare fetched into buffers. Once these substrings are buffered, theextension algorithm begins.

Extension implements a dynamic programming recurrence thatsimultaneously computes the start and end of the best HSP in thewindow. First, the score contribution of each character pair in thewindow is computed, using the same bonus þa and penalty �b asthe software implementation. These contributions can be calcu-lated independently in parallel for each pair. Then, for each posi-tion i of the window, the recurrence computes the score ci of thebest (highest-scoring) HSP that terminates at i, along with the po-sition Bi at which this HSP begins. These values can be updated for

each i in constant time. The algorithm also tracks Ci, the score ofthe best HSP ending at or before i, along with its endpoints Bmax

and Emax. Note that CLw is the score of the best HSP in the entirewindow. If CLw is greater than a user-defined score threshold, thew-mer passes the prefilter and is forwarded to software ungappedextension.

Two subtleties of Mercury BLASTN’s algorithm should be ex-plained. First, our recurrence requires that the HSP found by thealgorithm pass through its original matching w-mer; a higher-scor-ing HSP in the window that does not contain this w-mer is ignored.This constraint ensures that, if two distinct biological features ap-pear in a single window, the w-mers generated from each have achance to generate two independent HSPs. Otherwise, both w-mersmight identify only the feature with the higher-scoring HSP, caus-ing the other feature to be ignored. Second, if the best HSP inter-sects the bounds of the window, it is passed on to softwareregardless of its score. This heuristic ensures that HSPs that mightextend well beyond the window boundaries are properly found bysoftware, which has no fixed-size window limits, rather than beingprematurely eliminated.

3.3. Implementation

As the name implies, Mercury BLAST has been targeted to theMercury system [11]. The Mercury system is a prototyping infra-structure designed to accelerate disk-based computations. Its orga-nization is illustrated in Fig. 4. Data flows off the disks into an FPGA.The FPGA provides reconfigurable logic that has its function speci-fied via HDL. Results of the processing performed on the FPGA aredelivered to the processor. By delivering the high-volume data di-rectly to the FPGA, the processor can be relieved of the requirement

Page 4: Acceleration of ungapped extension in Mercury BLAST

processordiskcontroller

diskdata

toprocessor

configurationstore

reconfigurablelogic

network

Fig. 4. Mercury system organization.

284 J. Lancaster et al. / Microprocessors and Microsystems 33 (2009) 281–289

of handling the bulk of the original data set. Associated with thereconfigurable logic is a configuration store that maintains a num-ber of FPGA configurations.

matchingword

retliferpretliferpextensionmatching

word

stage 1a stage 1b stage 2a

SRAM

FPGA

ungappeddatabase

Fig. 5. Overview of Mercury BLASTN

w−mers /commands

ModuleLookupWindowExtension

Controllerdatabase

Fig. 6. Ungapped extens

..

.

query

query offsetwindowcompute

start

database ...

db offset

database bu

query buff

Fig. 7. Top-level diagram of the window lookup module. The query is streamed in at the bbuffer which holds the necessary portion of the stream needed for extension. The contrrequesting the window from the buffer modules, and finally constructing the final wind

Fig. 5 shows the organization of the application pipeline forBLASTN. The input comes from the disk, is delivered to the hard-ware word matching module (which also employs a prefilter),and then passes into the ungapped extension prefilter. The outputof the prefilter goes to the processor for the remainder of stage 2(ungapped extension) and stage 3 (gapped extension). The prefilteralgorithm lends itself to hardware implementation despite thesequential expression of the computation in Fig. 3.

The ungapped extension prefilter design is fully pipelined inter-nally and accepts one match per clock. Commands are supported toconfigure parameters such as match score, mismatch score, andcutoff thresholds. Thus, the trade-off between sensitivity andthroughput is left to the discretion of the user. Since the w-mermatching stage generates more output than input, two indepen-dent data paths are used for input into stage 2. The w-mers andcommands are sent on one path, and the database is sent on theother. The module is organized as three pipelined stages as illus-trated in Fig. 6.

extension extension

stage 2b stage 3

processor

gapped alignmentsungapped

hardware/software deployment.

rotarapmoCrotarapmoCThresholdBase

. . .

Scoring Stages

Scoring Module

ion prefilter design.

seed out

outputformat query window

db windowffer

er

eginning of each BLAST search. A portion of the database stream flows into a circularoller takes in a w-mer and is responsible for calculating the bounds of the window,ow from the raw output of the buffer.

Page 5: Acceleration of ungapped extension in Mercury BLAST

J. Lancaster et al. / Microprocessors and Microsystems 33 (2009) 281–289 285

The controller parses the input to demultiplex the shared com-mand and data stream. All w-mer matches and the database flowthrough the controller into the window lookup module. This mod-ule is responsible for fetching the appropriate substrings of thestream and the query. Fig. 7 shows the overall structure of themodule. The query is stored on chip using the dual-ported Block-RAMs on the FPGA. The query is loaded once at the beginning ofthe BLAST computation and is fixed until the end of the databaseis reached. The size of the query is limited to the amount of Block-RAM that is allocated for buffering it. In the current implementa-tion, a maximum query size of 65,536 is supported using 8BlockRAMs. The subset of the database stream needed for scoringis retained in a circular buffer also constructed using BlockRAMs.Since the w-mer generation is performed in the first hardwarestage, only a relatively small amount of the database stream needsto be buffered to accommodate extension requests. The databasebuffer is large enough to accommodate w-mers whose positions

αTo AuxiliaryData Registers

G A A G T C C

A C A T G T A

CMPCMP . . .−β

Fig. 8. Illustration of the base comparator for BLASTN. The base comparator stagecomputes the scores of every base pair in the window in parallel. These scores arestored in registers which are fed as input to the systolic array of scorer stages.

Score Table Lookup

A P G T R R E

A P G T L Q G

. . . to scoringstages

. . .

−34

2

−1−2−1

base score10−bits5−bits

5−bits

Fig. 9. Illustration of the first stage of the scoring for BLASTP. The amino acid pairsare used to index a set of parallel lookup tables that retain the pair score.

Bas

e C

ompa

rato

rIn

itial

izat

ion

PipelineScoring

PipelineScoring

Stage

Stage

Fig. 10. Depiction of the systolic array of scoring stages. The dark registers hold data thatcycle. The light registers are the pipeline calculation registers used to transfer the state ocontains a different w-mer in the pipeline.

in the stream are out of order by up to a fixed amount. Whilenot critical for BLASTN, this extra capacity is important in theBLASTP implementation, since the w-mers from the word match-ing stage can be out of order.

The window lookup module is organized as a 6-stage pipeline.The first stage of the pipeline calculates the beginning of the queryand database windows based on the incoming w-mer and a config-urable window size. The offset of the beginning of the window isthen passed to the buffer modules, which begin the task of readinga superset of the window from the BlockRAMs. One extra word ofdata must be retrieved from the BlockRAMs because there is noguarantee that the window boundaries will fall on a word bound-ary. Hence, one extra word is fetched on each lookup so that theexact window can be constructed from a temporary buffer holdinga window size worth of data plus the extra word. The next fourstages of the pipeline move the input data in lock step with thebuffer lookup process and ensure that pipeline stalls are handledin a correct fashion. In the final stage, the superset of the queryand database windows arrive to the format output module. Thecorrect window of the buffers is extracted and registered as theoutput.

After the window is fetched, it is passed into the scoring mod-ule. The scoring module implements the recurrence of the exten-sion algorithm. Since the computation is too complex to be donein a single cycle, the scorer is extensively pipelined.

Fig. 8 illustrates the first stage of the scoring pipeline. Thisstage, the base comparator, assigns a comparison score to each basepair in the window. For BLASTN, the base comparator assigns a re-ward a to each matching base pair and a penalty �b to each mis-matching pair. All comparison scores are calculated in a singlecycle, using Lw comparators. After the scores are calculated, theyare stored for use in later stages of the pipeline.

The score computation is the same for BLASTP, except that thereare many more choices for the score. In BLASTP, the alphabet con-sists of 20 amino acids rather than the 4 bases of BLASTN. As a re-sult, amino acids are represented by 5 bits each. The a and �b arereplaced with a value retrieved from a lookup table that is indexedby a pair of of amino acids. This is shown in Fig. 9. The remainder ofthe stage 2 design is common to both BLASTN and BLASTP.

The scoring module is arranged as a classic systolic array. Thedata from the previous stage are read on each clock, and resultsare output to the following stage on the next clock. As Fig. 6 shows,storage for comparison scores in successive pipeline stagesdecreases in every stage. This decrease is possible because the

To ThresholdComparator

PipelineScoring

PipelineScoring

Stage

Stage

is independent of the state of the recurrence. The data flow left to right on each clockf the recurrence from a previous scoring stage to the next. Each column of registers

Page 6: Acceleration of ungapped extension in Mercury BLAST

Pipeline Calculation

B

Γ

Bmax

Emax

Pipeline CalculationRegisters Out

Pipeline Scoring

Stage i

γ

Emax

Bmax

Γ

γ

OffsetDB

OffsetQuery

Auxiliary Data Registers

Registers In

B

α / β

from

sco

ring

stag

e i−

1

to s

corin

g st

age

i+1

Fig. 11. Detailed view of an individual scoring stage.

Table 1Performance model parameters. Query size is 25 kbases (double stranded), and thepass fractions for stages 2a and 2b are with the most permissive cutoff score of 16.

Parameter Value Units Meaning

p1 0.0205 Matches/base Stage 1 pass fraction [9]p2a 0.0043 Alignments/match Stage 2a pass fractionp2b 0.0133 Alignments out/alignment

inStage 2b pass fraction

t2b 0.265 ls/alignment input Stage 2b execution time [9]t3 60.4 ls/alignment input Stage 3 execution time [9]Tput1 1.4 Gbases/s Stage 1 throughput [9]Tput2a 4.9 Gbases/s Stage 2a throughputTput2b3 10.6 Gbases/s Processor throughputTputoverall 1.4 Gbases/s Overall pipeline

throughput

286 J. Lancaster et al. / Microprocessors and Microsystems 33 (2009) 281–289

comparison score for window position i is consumed in the ith pipe-line stage and may then be discarded, since later stages inspect onlywindow positions >i. This structure of data movement is shown inmore detail in Fig. 10. The darkened registers hold the necessarycomparison scores for the w-mer being processed in each pipelinestage. Note that for ease of discussion, Fig. 10 shows a single com-parison score being dropped for each scorer stage; however, theactual implementation consumes two comparison scores per stage.

Fig. 11 shows the interface of an individual scoring stage. Thevalues shown entering the top of the scoring stage are the stateof the dynamic programming recurrence as propagated from theprevious scoring stage. These values are read as input to combina-tional logic, and the results are stored in the output registersshown on the right. Each scoring stage in the pipeline containscombinational logic to implement the dynamic programmingrecurrence shown in lines 13–22 of the algorithm described inFig. 3. The data entering from the left of the module are the com-parison scores and the database and query positions for a givenw-mer, which are independent of the state of the recurrence. In or-der to sustain a high clock frequency, each scoring stage computesonly two iterations of the loop per clock cycle, resulting in Lw/2scoring stages for a complete calculation. Hence, there are Lw/2independent w-mers being processed simultaneously in the scor-ing stages of the processor when the pipe is full.

The final pipeline stage of the scoring module is the thresholdcomparator. The comparator takes the fully scored segment andmakes a decision to discard or keep the w-mer. This decision isbased on the score of the alignment relative to a user-definedthreshold, T, as well as the position of the highest-scoring sub-string. If the maximum score is above the threshold, the w-meris passed on. Additionally, if the maximal scoring substring inter-sects either boundary of the window, the w-mer is also passedon, regardless of the score. This second condition ensures that ahigh-scoring alignment that extends beyond the window boundaryis not spuriously discarded. If neither of the above conditionsholds, the w-mer is discarded.

In the current design, the window size, Lw, is set at the time thedesign is synthesized, while the scoring parameters, a and �b, andscore threshold, T, are set at run time.

4. Results

There are many aspects to the performance of the ungappedextension stage of the BLAST pipeline. First is the individual stage

throughput. The ungapped extension prefilter must run fast en-ough not to be a bottleneck in the overall pipeline. Second, theungapped extension stage must effectively filter as many w-mersas possible, since downstream stages are even more computation-ally expensive. Finally, the above must be achieved without inad-vertently dropping a large percentage of the significantalignments (i.e., the false negative rate must be limited). We willdescribe throughput performance first for the ungapped extensionprefilter alone, then for the entire BLASTN application. Afterdescribing throughput, we discuss the sensitivity of the pipelineto significant sequence alignments.

4.1. Performance

The throughput of the Mercury BLASTN ungapped extensionprefilter is a function of the data input rate. The ungapped exten-sion stage accepts one w-mer per clock and runs at 100 MHz.Hence the maximum throughput of the prefilter is 1 inputmatch/cycle � 100 MHz ¼ 100 Mmatches/s. This gives a speedupof 25� over the software ungapped extension executed on thebaseline system described earlier.

To explore the impact that stage 2a performance has on theoverall streaming application, we use the following mean-valueperformance model. Overall pipeline throughput for the deploy-ment of Fig. 5 is

Tputoverall ¼ minðTput1; Tput2a; Tput2b3Þ;

where Tput1 is the maximum throughput of stage 1 (both 1a and1b) executing on the FPGA, Tput2a is the maximum throughput ofstage 2a executing on the FPGA (concurrently with stage 1), andTput2b3 is the maximum throughput of stages 2b and 3 executingon the processor.

For the above expression to be correct, all of the throughputsmust be normalized to the same units; we will normalize to inputDNA bases per unit time. This normalization can be accomplishedwith knowledge of the fractions of input bases that survive each ofthe stages of filtering. Call the w-mers from stage 1 ‘‘matches” andthe HSPs from stages 2a and 2b ‘‘alignments.” We define p1 as thepass fraction from stage 1 (matches out per base in), p2a as the passfraction from stage 2a (alignments out per match in), and p2b as thepass fraction from stage 2b (alignments out per alignment in).With this information, we compute the normalized throughput ofstage 2a as Tput2a ¼(100 Mmatches/s)/p1. Finally, we need the timerequired to process alignments in software (both in stage 2b and instage 3). We define t2b as the execution time per input alignmentfor the stage 2b software and t3 as the execution time per inputalignment for the stage 3 software. The normalized throughputof the stages executing in software can then be expressed as

Tput2b3 ¼1

p1p2a t2b þ p2bt3ð Þ :

Page 7: Acceleration of ungapped extension in Mercury BLAST

99.0099.1099.2099.3099.4099.5099.6099.7099.8099.90

100.00

15 16 17 18 19 20 21

Score Threshold

%Fo

und

6496128256

Fig. 12. Mercury BLAST sensitivity. The four curves represent window lengths of 64,96, 128, and 256 bases.

J. Lancaster et al. / Microprocessors and Microsystems 33 (2009) 281–289 287

Table 1 provides the above parameters and their values for a 25-kbase, double-stranded query. The values of p1, t2b, t3, and Tput1

come from [9]. Clearly, the overall throughput is limited by thecapacity of stage 1.

It is important to note that performance of stage 1 in the currentimplementation of Mercury BLASTN is limited by the input rate ofthe I/O subsystem. Hence, as newer interconnect technologies,such as PCI Express, become more readily available, the throughputof our system will increase significantly. It is likely that the nextgeneration of the Mercury system will use PCI Express to delivereven higher throughput to the hardware accelerator. Since thedownstream pipeline stages are clearly capable of sustaining high-er throughputs, any improvement in stage 1 throughput will trans-late directly into greater overall throughput. Even with the I/Olimitations of the existing infrastructure, Mercury BLASTN has aspeedup over software NCBI BLASTN of approximately 48� (for a25-kbase query, NCBI BLASTN has a throughput of 29 Mbases/s[9] vs. Mercury BLASTN throughput of 1.4 Gbases/s).

4.2. Quality of results: sensitivity

The quality of BLAST’s results is measured primarily by its sen-sitivity, the number of statistically significant gapped alignmentsthat it discovers. Because we are trying to improve the perfor-mance of BLAST without sacrificing sensitivity, we compare Mer-cury BLAST’s sensitivity to that of the NCBI BLAST software,taking the latter as our standard of correctness.

Formally, sensitivity is defined as follows:

Sensitivity ¼ #New Alignments#Original Alignments

;

where ‘‘# New Alignments” is the number of statistically significantgapped alignments discovered by Mercury BLAST, and ‘‘# OriginalAlignments” is the number of such alignments returned from NCBIBLAST given the same measure of significance. Measurements ofsensitivity vary depending on how stringently the user chooses tofilter NCBI BLAST’s output. The numbers reported here correspondto a BLAST E-value of 10�5, which is reasonably permissive forDNA similarity search.

Two parameters of the ungapped extension stage affect thequality of its output. First, the score cutoff threshold used affectsthe number of alignments that are produced. If the cutoff thresholdis set too high, the filter will incorrectly reject a large number ofstatistically significant alignments. Conversely, if the threshold isset too low, the filter will generate many false positives and willnegatively affect system throughput. Second, the length of the win-dow can affect the number of false negatives that are produced. Inparticular, alignments that have enough mismatches before thewindow boundary to fall below the score threshold but have manymatches immediately outside this boundary will be incorrectly re-jected. The larger the window size, the higher the score thresholdthat can be used without diminishing the quality of results.

We measured the sensitivity of BLASTN with our ungappedextension design using a modified version of the NCBI BLASTN codebase. A software emulator of the new ungapped extension algo-rithm (stage 2a) was placed in front of the standard NCBI ungappedextension stage (stage 2b). We used BLAST with this emulator tocompare sequences extracted from the human and mouse gen-omes. The queries were statistically significant samples of varioussizes (10 kbase, 100 kbase, and 1 Mbase). The database stream wasthe mouse genome with low-complexity and repetitive sequencesremoved.

To compare the results of NCBI BLAST to those of MercuryBLAST, a Perl script was written to quantify how many gappedalignments produced by each implementation were also producedby the other. The test for whether an alignment exists in both sets

of results is more complicated than a simple equality check, sincethe two BLASTs can produce highly similar (and equally useful) butnon-identical alignments. Gapped alignments were therefore com-pared using the following overlap metric. For each alignment A out-put from one BLAST run, the overlap metric determines if anyalignment A0 from the other run overlaps A in at least a fraction fof its bases in each sequence. If one gapped alignment overlaps an-other by more than f, it is considered to be the same alignment forpurposes of sensitivity measurement. In our experiments, f ¼ 1=2,an arbitrary but reasonable fraction given that in practice almostall overlaps are complete (100% overlap).

Fig. 12 illustrates the sensitivity of the Mercury BLAST systemwhen the hardware ungapped filter is combined with the NCBIungapped extension filter to make up the full stage 2 (i.e., the con-figuration of Fig. 5). Using window sizes of 64 to 256 bases, thecombined filters yielded sensitivity in excess of 99.6%; in absoluteterms, the worst observed false negative rate was only 36 falsenegatives out of 11616 significant alignments. Larger window sizesresulted in higher sensitivity, up to 100%; however, increasing thethe window size above 96 yielded little benefit for the additionallogic consumed. A window size of 64 reduced sensitivity below99.9%, but this loss can be compensated by lowering the scorethreshold. We conclude that using the hardware prefilter in thisconfiguration does not noticeably degrade the quality of results.

4.3. Efficiency of filtration: specificity

Specificity measures how effectively the ungapped extensionstage discards insignificant or chance matches from its input. Highspecificity is desirable for computational efficiency, since fewermatches out of ungapped extension lowers the computational bur-den on software gapped extension. An effective filter exhibits bothhigh sensitivity and high specificity.

For BLASTN ungapped extension, specificity is measured asfollows:

Specificity ¼ 1� ð#alignments out=#matches inÞ;

which for stage 2a is 1� p2a and for the complete stage 2 is1� p2ap2b. To quantify the specificity of our implementation, wegathered statistics during the aforementioned experiments onhow many w-mers (matches) arrived at the ungapped extensionstage, and how many of these produced ungapped alignments thatpassed the stage’s score threshold.

Fig. 13 shows the specificity of Mercury BLASTN ungappedextension for various score thresholds and window lengths. In thisgraph, the ungapped extension stage consists of the hardware filter

Page 8: Acceleration of ungapped extension in Mercury BLAST

9999.199.299.399.499.599.699.799.899.9100

15 16 17 18 19 20 21Score Threshold

%Fi

ltere

d

6496128256Original

Fig. 13. Mercury BLAST specificity for stage 2a alone. The four curves representwindow lengths of 64, 96, 128, and 256 bases. The individual point represents thevalue for NCBI BLAST stage 2.

99.9900

99.9920

99.9940

99.9960

99.9980

100.0000

15 16 17 18 19 20 21

Score Threshold

%Fi

ltere

d

6496128256

Fig. 14. Mercury BLAST specificity for complete stage 2. The four curves representwindow lengths of 64, 96, 128, and 256 bases.

288 J. Lancaster et al. / Microprocessors and Microsystems 33 (2009) 281–289

alone (i.e., stage 2a), without NCBI BLAST’s software ungapped fil-ter. As the score threshold increases, the hardware passes fewerw-mers, and so the specificity of the filter increases. Specificity isnot strongly influenced by window size. At a score threshold of18, the hardware prefilter is approximately as specific as the origi-nal NCBI BLASTN software’s ungapped extension stage.

Fig. 14 shows the specificity of the combined hardware filterand software ungapped extension filter. As expected, the specific-ity is essentially constant, with only a miniscule increase at thehighly stringent score threshold of 20.

4.4. Resource utilization

As mentioned earlier, the Mercury ungapped extension filter isparameterizable and can be configured for different window

Table 2FPGA resource usage and utilization of the hardware ungapped extension stage inisolation. The three rows show the resource usage for window sizes of 64, 96, and 128bases on a Xilinx Virtex-II 6000 FPGA.

Window size Slices used (% Utilized) BlockRAMs used (% Utilized)

64 9174 (27%) 13 (9%)96 11700 (35%) 18 (12%)128 15226 (45%) 18 (12%)

lengths. Currently, the filter exists with window lengths of 64,96, and 128 bases. Table 2 gives resource usage information foreach of these design points. For comparison, the full MercuryBLASTN design, including both stages 1 and 2a, utilizes approxi-mately 54% of the logic cells and 134 BlockRAMs of our FPGA plat-form with a stage 2a window size of 64.

5. Related work

Several research groups have implemented systems to acceler-ate similarity search, both in software alone and in hardware.

Software tools exist that seek to accelerate BLASTN-like compu-tations through algorithmic improvements. MegaBLAST [12] isused by NCBI as a faster alternative to BLASTN; it explicitly sacri-fices substantial sensitivity relative to BLASTN in exchange for im-proved running time. The SSAHA [13] and BLAT [14] packagesachieve higher throughput than BLASTN by requiring that the en-tire database be indexed offline before being used for searches.By eliminating the need to scan the database, these tools canachieve more than an order of magnitude speedup versus BLASTN;however, they must trade off between sensitivity and space fortheir indices and so in practice are less sensitive. In contrast, Mer-cury BLASTN aims for at least BLASTN-equivalent sensitivity.

Other software approaches, such as DASH [15] and PatternHun-ter II [16], achieve both faster search and higher sensitivity com-pared to BLASTN using alternative forms of pattern matching anddynamic programming extension. DASH’s reported speedup overBLASTN is less than 10-fold for queries of 1500 bases, and it isnot clear how it performs at our query sizes, which are an orderof magnitude larger. DASH’s authors have also reported on a preli-minary FPGA design for their algorithm [17], while PatternHunterII achieves only a two-fold reported speedup relative to BLASTN, al-beit with substantially greater sensitivity.

In hardware, numerous implementations of the Smith–Water-man dynamic programming algorithm have been reported in theliterature, using both non-reconfigurable ASIC logic [18,19] andreconfigurable logic [20–22]. These implementations focus onaccelerating gapped alignment, which is heavily loaded in proteo-mic BLAST comparisons but takes only a small fraction of runningtime in genomic BLASTN computations. Our work instead focuseson accelerating the bottleneck stages of the BLASTN pipeline,which reduces the data sent to later stages to the point thatSmith–Waterman acceleration is not necessary.

While one could in principle dispense with the pattern match-ing and ungapped extension stages of BLASTN given a sufficientlyfast Smith–Waterman implementation, no such implementationis likely to be feasible with current hardware. The projected datarate of 1.4 Gbases/s for a 25-kbase query, if achieved by a Smith–Waterman implementation, would imply computation of around1014 dynamic programming matrix cells per second. In contrast,existing FPGA implementations report rates of less than 1010 cellsper second.

High-end commercial systems have been developed to acceler-ate or replace BLAST. The Paracel GeneMatcherTM [7] relies on non-reconfigurable ASIC logic, which is inflexible in its application andcannot easily be updated to exploit technology improvements. Incontrast, FPGA-based systems can be reprogrammed to tackle di-verse applications and can be redeployed on newer, faster FPGAswith minimal additional design work. RDisk [23] is one suchFPGA-based approach, which claims 60 Mbases/s throughput forstage 1 of BLAST using a single disk.

Two commercial products that do not rely on ASIC technologyare BLASTMachine2TM from Paracel [7] and DeCypherBLASTTM fromTimeLogic [8]. The highest-end 32-CPU Linux cluster BLASTMa-chine2TM performs BLASTN with a throughput of 2.93 Mbases/sfor a 2.8-Mbase query. The DeCypherBLASTTM solution uses an

Page 9: Acceleration of ungapped extension in Mercury BLAST

J. Lancaster et al. / Microprocessors and Microsystems 33 (2009) 281–289 289

FPGA-based approach to improve the performance of BLASTN. Thissolution has throughput rate of 213 kbases/s for a 16-Mbase query.

6. Conclusion

Biosequence similarity search can be accelerated practically bya pipeline designed to filter high-speed streams of character data.We have described a portion of our Mercury BLASTN search accel-erator, focusing on our design for its performance-critical ungap-ped extension stage. Our highly parallel and pipelinedimplementation yields results comparable to those obtained fromsoftware BLASTN while running 25� faster and enabling the entireaccelerator to run close to 50� faster. Currently, a functionallycomplete implementation of the accelerator has been deployedon the Mercury system and is undergoing performance tuning.

Our ungapped extension design is suitable not only for theDNA-focused BLASTN but also for other forms of BLAST, particu-larly the BLASTP algorithm used on proteins. We are using thisstage in our in-progress design for Mercury BLASTP and expect thatit will prove similarly successful in that application.

Acknowledgements

This research is supported by NIH/NGHRI grant 1 R42HG003225-01 and NSF grants DBI-0237902 and CCF-0427794. R.Chamberlain is a principal in BECS Technology, Inc. This work isan extension of the previously published paper, J. Lancaster, J. Buh-ler, and R. Chamberlain, ‘‘Acceleration of Ungapped Extension inMercury BLAST,” in Proc. of 7th Workshop on Media and StreamingProcessors, November 2005.

References

[1] National Center for Biological Information, Growth of GenBank, 2002. <http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html>.

[2] E.S. Lander et al., Initial sequencing and analysis of the human genome, Nature409 (2001) 860–921.

[3] R.H. Waterston et al., Initial sequencing and comparative analysis of the mousegenome, Nature 420 (2002) 520–562.

[4] S.F. Altschul, W. Gish, Local alignment statistics, Methods: A Companion toMethods in Enzymology 266 (1996) 460–480.

[5] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, et al., Basic local alignment searchtool, Journal of Molecular Biology 215 (1990) 403–410.

[6] S.F. Altschul, T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, D.J.Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein databasesearch programs, Nucleic Acids Research 25 (1997) 3389–3402.

[7] Paracel, Paracel, Inc., 2006. <http://www.paracel.com>.[8] Timelogic, TimeLogic Corporation, 2006. <http://www.timelogic.com>.[9] P. Krishnamurthy, J. Buhler, R.D. Chamberlain, M.A. Franklin, K. Gyang, J.

Lancaster, Biosequence similarity search on the Mercury system, in:Proceedings of the 15th IEEE International Conference on Application-Specific Systems, Architectures, and Processors, 2004, pp. 365–375.

[10] T.F. Smith, M.S. Waterman, Identification of common molecular subsequences,Journal of Molecular Biology 147 (1) (1981) 195–197.

[11] R.D. Chamberlain, R.K. Cytron, M.A. Franklin, R.S. Indeck, The mercury system:exploiting truly fast hardware for data search, in: Proceedings of InternationalWorkshop on Storage Network Architecture and Parallel I/Os, 2003, pp. 65–72.

[12] Z. Zhang, S. Schwartz, L. Wagner, W. Miller, A greedy algorithm for aligningDNA sequences, Journal of Computational Biology 7 (2000) 203–214.

[13] Z. Ning, A.J. Cox, J.C. Mullikin, SSAHA: a fast search method for large DNAdatabases, Genome Research 11 (2001) 1725–1729.

[14] W.J. Kent, BLAT: the BLAST-like alignment tool, Genome Research 12 (2002)656–664.

[15] G. Knowles, P. Gardner-Stephen, DASH: localizing dynamic programming fororder of magnitude faster, accurate sequence alignment, in: Proceedings of theIEEE Computational Systems Bioinformatics Conference, 2004, pp. 732–735.

[16] M. Li, B. Ma, D. Kisman, J. Tromp, Patternhunter II: highly sensitive and fasthomology search, Journal of Bioinformatics and Computational Biology 2(2004) 417–439.

[17] G. Knowles, P. Gardner-Stephen, A new hardware architecture for genomic andproteomic sequence alignment, in: Proceedings of the IEEE ComputationalSystems Bioinformatics Conference, 2004, pp. 730–731.

[18] R.K. Singh, et al., BioSCAN: a dynamically reconfigurable systolic array forbiosequence analysis, in: Proceedings of CERCS’96, 1996.

[19] J.D. Hirschberg, R. Hughley, K. Karplus, Kestrel: a programmable array forsequence analysis, in: Proceedings of IEEE International Conference onApplication-Specific Systems, Architecture, and Processors, 1996, pp. 23–34.

[20] D.T. Hoang, Searching genetic databases on Splash 2, in: IEEE Workshop onFPGAs for Custom Computing Machines, 1995, pp. 185–191.

[21] N. Pappas, Searching Biological Sequence Databases Using DistributedAdaptive Computing, Master’s Thesis, Virginia Polytechnic Institute and StateUniversity, 2003.

[22] Y. Yamaguchi, T. Maruyama, A. Konagaya, High speed homology search withFPGAs, in: Pacific Symposium on Biocomputing, 2002, pp. 271–282.

[23] D. Lavenier, S. Guytant, S. Derrien, S. Rubin, A reconfigurable parallel disksystem for filtering genomic banks, in: Proceedings of InternationalConference on Engineering of Reconfigurable Systems and Algorithms, 2003,pp. 154–166.