integrating fpga acceleration into hmmer

11
Integrating FPGA acceleration into HMMer Tim Oliver a , Leow Yuan Yeow a , Bertil Schmidt b, * a Progeniq Pte Ltd., 8 Prince George’s Park, Singapore 118407, Singapore b School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore article info Article history: Received 24 April 2007 Received in revised form 3 June 2008 Accepted 21 August 2008 Available online 20 September 2008 Keywords: Bioinformatics Reconfigurable computing Hidden Markov models Viterbi algorithm Accelerators abstract HMMer is a commonly used package for biological sequence database searching with pro- file hidden Markov model (HMMs). It allows researchers to compare HMMs to sequence databases or sequences to HMM databases. However, such searches often take many hours on traditional computer architectures. These runtime requirements are likely to become even more severe due to the rapid growth in size of both sequence and model databases. We present a new reconfigurable architecture to accelerate the two HMMer database search procedures hmmsearch and hmmpfam. It is described how this leads to significant runtime savings on off-the-shelf field-programmable gate arrays (FPGAs). Ó 2008 Elsevier B.V. All rights reserved. 1. Introduction A hidden Markov model (HMM) is a statistical model, which has originally been developed for speech recognition tasks. Nowadays, it is also frequently used in the area of molecular biology. Its most popular use in molecular biology is as a prob- abilistic profile of a protein family or protein domain, which is called a profile HMM [6,8,11]. The comparison (or alignment) of a profile HMM to a protein sequence is used as a building block for common database searching tasks. For example, a pro- tein sequence can be compared to a whole database of profile HMMs representing known protein families/domains. A sig- nificant match to one of these profile HMMs can identify the queried sequence and determine its function. Similarly, a profile HMM representing a protein family/domain can be used as a query to search a protein sequence database to see if any other known sequences possess this domain. HMMer [7] is an open-source implementation of profile HMM algorithms, which is widely adopted for the described data- base searching tasks. There are two important search procedures in HMMer: hmmsearch and hmmpfam. hmmsearch can be used to scan a protein sequence database with a query profile HMM and hmmpfam for scanning a profile HMM database with a number of query protein sequences. This is summarized in Table 1. Both search procedures frequently employ the comparison of a sequence to a profile HMM. This comparison determines the probability that the given sequence is generated by the given profile HMM using the dynamic programming based Viter- bi algorithm [18]. Due to the quadratic time complexity of the Viterbi algorithm the search procedure can take hours, days, or even weeks depending on database size, query size, and hardware used. A usage scenario where hardware acceleration of HMMer is a necessity are protein exploration in metagenomics sequencing studies of different environments like soil, air, or sea. An example is the Sorcerer II Global Ocean Sampling (GOS) expedition [21]. Yooseph et al. [21] compared all 0167-8191/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2008.08.003 * Corresponding author. Tel.: +65 6790 6107. E-mail addresses: [email protected] (T. Oliver), [email protected] (L.Y. Yeow), [email protected] (B. Schmidt). Parallel Computing 34 (2008) 681–691 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco

Upload: tim-oliver

Post on 29-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integrating FPGA acceleration into HMMer

Parallel Computing 34 (2008) 681–691

Contents lists available at ScienceDirect

Parallel Computing

journal homepage: www.elsevier .com/ locate/parco

Integrating FPGA acceleration into HMMer

Tim Oliver a, Leow Yuan Yeow a, Bertil Schmidt b,*

a Progeniq Pte Ltd., 8 Prince George’s Park, Singapore 118407, Singaporeb School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore

a r t i c l e i n f o a b s t r a c t

Article history:Received 24 April 2007Received in revised form 3 June 2008Accepted 21 August 2008Available online 20 September 2008

Keywords:BioinformaticsReconfigurable computingHidden Markov modelsViterbi algorithmAccelerators

0167-8191/$ - see front matter � 2008 Elsevier B.Vdoi:10.1016/j.parco.2008.08.003

* Corresponding author. Tel.: +65 6790 6107.E-mail addresses: [email protected] (T. Oliver),

HMMer is a commonly used package for biological sequence database searching with pro-file hidden Markov model (HMMs). It allows researchers to compare HMMs to sequencedatabases or sequences to HMM databases. However, such searches often take many hourson traditional computer architectures. These runtime requirements are likely to becomeeven more severe due to the rapid growth in size of both sequence and model databases.We present a new reconfigurable architecture to accelerate the two HMMer databasesearch procedures hmmsearch and hmmpfam. It is described how this leads to significantruntime savings on off-the-shelf field-programmable gate arrays (FPGAs).

� 2008 Elsevier B.V. All rights reserved.

1. Introduction

A hidden Markov model (HMM) is a statistical model, which has originally been developed for speech recognition tasks.Nowadays, it is also frequently used in the area of molecular biology. Its most popular use in molecular biology is as a prob-abilistic profile of a protein family or protein domain, which is called a profile HMM [6,8,11]. The comparison (or alignment)of a profile HMM to a protein sequence is used as a building block for common database searching tasks. For example, a pro-tein sequence can be compared to a whole database of profile HMMs representing known protein families/domains. A sig-nificant match to one of these profile HMMs can identify the queried sequence and determine its function. Similarly, a profileHMM representing a protein family/domain can be used as a query to search a protein sequence database to see if any otherknown sequences possess this domain.

HMMer [7] is an open-source implementation of profile HMM algorithms, which is widely adopted for the described data-base searching tasks. There are two important search procedures in HMMer: hmmsearch and hmmpfam. hmmsearch can beused to scan a protein sequence database with a query profile HMM and hmmpfam for scanning a profile HMM database witha number of query protein sequences. This is summarized in Table 1.

Both search procedures frequently employ the comparison of a sequence to a profile HMM. This comparison determinesthe probability that the given sequence is generated by the given profile HMM using the dynamic programming based Viter-bi algorithm [18]. Due to the quadratic time complexity of the Viterbi algorithm the search procedure can take hours, days, oreven weeks depending on database size, query size, and hardware used. A usage scenario where hardware acceleration ofHMMer is a necessity are protein exploration in metagenomics sequencing studies of different environments like soil, air,or sea. An example is the Sorcerer II Global Ocean Sampling (GOS) expedition [21]. Yooseph et al. [21] compared all

. All rights reserved.

[email protected] (L.Y. Yeow), [email protected] (B. Schmidt).

Page 2: Integrating FPGA acceleration into HMMer

Table 1HMMer programs for database searching

HMMer procedure Description Application

hmmpfam Search a set of query sequences against an HMM database Annotate various kinds of domains in the query sequencehmmsearch Search a sequence database with a query profile HMM Find additional homologues of a modeled family

682 T. Oliver et al. / Parallel Computing 34 (2008) 681–691

generated GOS sequences to the PFAM [3] and TIGRFAM [9] HMM databases to identify and annotate new proteins and pro-tein families. This procedure took 327 h on a large hardware-accelerated HMMer system.

Several parallel solutions for HMMer database searching have been developed on coarse-grained architectures, such asclusters [4,22], as well as on fine-grained architectures, such as network processors [20] and graphics cards [10]. In this pa-per, we show how reconfigurable field-programmable gate array (FPGA)-based hardware platforms can be integrated intoHMMER to accelerate database searching by one to two orders of magnitude in a cost-effective way. Since there is a largeoverall FPGA market, this approach has a relatively small price/unit and also facilitates regular upgrading to FPGAs basedon state-of-the-art technology.

This paper is organized as follows. In Section 2, we review the profile HMM architecture used in HMMer and the Viterbialgorithm for aligning a sequence to a profile HMM. In Section 3, we describe related work in the area of HMMER acceler-ation. Our reconfigurable hardware design is presented in Section 4 and its performance is evaluated in Section 5. Finally,Section 6 concludes the paper.

2. Background

2.1. Profile HMMs

Profile HMMs are statistical models of multiple sequence alignments. They capture position-specific information abouthow conserved each column of the alignment is, and which residues are likely. Fig. 1 illustrates how an ungapped profilecan be derived from an ungapped alignment. The derived profile consists of a linear sequence of states. Each state corre-sponds to one position (column) in the associated alignment. Furthermore, there are two types of probabilities associatedwith each state: the transition probabilities and the emission probabilities. Since there is only one possible path through thestates of an ungapped profile, all the transition probabilities in Fig. 1 are 1.0. Emissions probabilities are based on the prob-ability of a given amino acid existing at the corresponding position in the alignment. For Column 3 in Fig. 1, the correspond-ing emission probabilities in State 3 are 0.625 for Y, 0.125 for A, G, I and 0.0 for each of the other 16 amino acids. In order toavoid over-fitting, the determination of emission probabilities usually adds pseudo-counts to the distribution of the ob-served amino acids.

V H A E H

V N Y E D

V D Y E H

V T Y E D

V N I G H

F N G E D

I N Y E H

V E Y E D

A: 0/8, C: 0/8D: 0/8, E: 0/8F: 1/8, G: 0/8H: 0/8, I: 1/8K: 0/8, L: 0/8M: 0/8, N: 0/8P: 0/8, Q: 0/8R: 0/8, S: 0/8T: 0/8, V: 6/8W: 0/8, Y: 0/8

1.0 1.0

A: 0/8, C: 0/8D: 1/8, E: 1/8F: 0/8, G: 0/8H: 1/8, I: 0/8K: 0/8, L: 0/8M: 0/8, N: 4/8P: 0/8, Q: 0/8R: 0/8, S: 0/8T: 1/8, V: 0/8W: 0/8, Y: 0/8

1.0

A: 1/8, C: 0/8D: 0/8, E: 0/8F: 0/8, G: 1/8H: 0/8, I: 1/8K: 0/8, L: 0/8M: 0/8, N: 0/8P: 0/8, Q: 0/8R: 0/8, S: 0/8T: 0/8, V: 0/8W: 0/8, Y: 5/8

1.0

A: 0/8, C: 0/8D: 0/8, E: 7/8F: 0/8, G: 1/8H: 0/8, I: 0/8K: 0/8, L: 0/8M: 0/8, N: 0/8P: 0/8, Q: 0/8R: 0/8, S: 0/8T: 0/8, V: 0/8W: 0/8, Y: 0/8

1.0

A: 0/8, C: 0/8D: 4/8, E: 0/8F: 0/8, G: 0/8H: 4/8, I: 0/8K: 0/8, L: 0/8M: 0/8, N: 0/8P: 0/8, Q: 0/8R: 0/8, S: 0/8T: 0/8, V: 0/8W: 0/8, Y: 0/8

1.0Begin End

Fig. 1. An ungapped profile derived from an ungapped multiple sequence alignment.

Page 3: Integrating FPGA acceleration into HMMer

T. Oliver et al. / Parallel Computing 34 (2008) 681–691 683

In order to align a given protein sequence to a profile, gaps need to be considered. There are two types of gaps in an align-ment: insertions and deletions. Insertions correspond to regions of the sequence that are not present in the profile. Deletionsoccur when the profile contains states that do not correspond to amino acids in the sequence. Profile HMMs model insertionsand deletions by extending the simple structure of a profile. They are using three states at each position k (called node k): amatch ðMkÞ, insert ðIkÞ, and delete state ðDkÞ. M-states emit a single residue and correspond to the states described in Fig. 1.Each I-state also emits a single amino acid, while D-states are silent. Transitions are arranged so that at each node, either theM-state or the D-state is used exactly once. I-states have a self-transition, allowing one or more inserted residues to occurbetween consensus columns. Fig. 2 shows the general transition structure of a profile HMM with four nodes.

A valid alignment of a sequence to a profile HMM can be generated by finding a path from the Begin to the End state thatemits the given sequence. Fig. 2 shows an example of an alignment of the protein sequence HEIKQ to a profile HMM. Theprobability of an alignment is calculated by multiplying the emission and transition probabilities along the path. In practice,multiplication of probabilities is replaced by addition of corresponding log-odd ratios. The optimal alignment corresponds tothe most probable path through the profile HMM generating the given sequence. This path can be computed by the dynamicprogramming based Viterbi algorithm [18] which is described in the next subsection.

The profile HMM transition structure shown in Fig. 2 only allows for global alignments; i.e. the whole profile HMM alignsto the whole sequence. However, other types of alignments such as local and multi-hit alignments are often of greater rel-evance to biology. Therefore, the popular HMMer software tool uses a more flexible profile HMM architecture called Plan7(see Fig. 3). In Plan 7, the linear sequence of nodes is flanked by a begin state (B) and an end state (E). Furthermore, there arethe special states: S, N, C, T, and J. They control alignment specific features of the model; e.g. how likely the model is to gen-erate various sorts of global, local or even multi-hit alignments.

Traditional pairwise alignment techniques such as the Smith–Waterman [16] algorithm, BLAST [1], or the Needleman–Wunsch algorithm [14] use only position-independent scoring parameters; i.e. substitution matrix and gap penalties are fixedfor all positions, though more recent extensions such as PSI-BLAST [2] seek to overcome such limitations. Profile HMMs cap-ture important information about the degree of conservation at various positions in the multiple alignments, and the varyingdegree to which gaps and insertions are permitted. As a consequence, databases of thousands of profile HMMs have beenbuilt and applied extensively to whole genome analysis. One such database is Pfam [3]. Pfam covers many common proteindomains and families and currently contains 9318 profile HMMs in Plan7 format (version 22.0). The construction and use ofPfam is tied to the HMMER software package [7].

2.2. Viterbi algorithm

One of the major bioinformatics applications of profile HMMs is database searching. HMMER includes two searching op-tions. A researcher may search a database of profile HMMs against a set of input query sequences, or alternatively, maysearch a sequence database for matches to an input profile HMM. In HMMER these searches are known as hmmpfam andhmmsearch, respectively. In either cases, the similarity score sim(H,S) of a profile HMM H and a protein sequence S is usedto rank all sequences/HMMs in the queried database. The highest ranked sequences/HMMs with corresponding alignmentsare then displayed to the user as top hits identified by the database search.

The key to effective database searching is the accuracy of the similarity score sim(H,S). In HMMER it is computed usingthe well-known Viterbi [18] algorithm. The similarity score can therefore be recast into finding the Viterbi score of the profileHMM H and the protein sequence S. The Viterbi score is defined as the most probable path through H that generates a se-quence equal to S. The Viterbi dynamic programming (DP) algorithm for Plan7 profile HMMs works as follows.

Given a profile HMM H of length k in Plan7 format (see Fig. 1) and a sequence S of length n. The Viterbi algorithm usesthree two-dimensional matrices of size ðnþ 1Þ � ðkþ 1Þ each: M, I, and D, where

H

M1

E

M2

I

-

K

-

-

M3

Q

M4

Begin

I0

M1

I1

D1

M2

I2

D2

M3

I3

D3

M4

I4

D4

End

Fig. 2. General transition structure of a profile HMM and a possible alignment of the protein sequence HEIKQ to the model.

Page 4: Integrating FPGA acceleration into HMMer

S N B

E C T

M1 M2 M3 M4

D2 D3

I1 I2 I3

J

Fig. 3. The Plan7 architecture for a profile HMM of length 4.

684 T. Oliver et al. / Parallel Computing 34 (2008) 681–691

– M(i, j) denotes the score of the best path emitting the subsequence S[1. . .i] of S ending with S[i] being emitted in stateMj.

–I(i, j) denotes the score of the best path ending with S[i] being emitted in state Ij.–D(i, j) denotes the best path ending in state Dj.

Furthermore, there are five one-dimensional matrices of size n + 1 each: XN, XE, XJ, XB, and XC, where

–XN(i), XJ(i), and XC(i) denote the score of the best path emitting S[1. . .i] ending with S[i] being emitted in special state N,J, and C, respectively.–XE(i) and XB(i) denote the score of the best path emitting S[1. . .i] ending in E and B, respectively.

The recurrence relations with initial conditions for each of these matrices are as follows:

Mði;0Þ ¼ Iði; 0Þ ¼ Dði; 0Þ ¼ �1 for 1 6 i 6 n

Mð0; jÞ ¼ Ið0; jÞ ¼ Dð0; jÞ ¼ �1 for 1 6 j 6 k

Mði; jÞ ¼ eðMj; S½i�Þ þmax

Mði� 1; j� 1Þ þ trðMj�1;MjÞIði� 1; j� 1Þ þ trðIj�1;MjÞDði� 1; j� 1Þ þ trðDj�1;MjÞXBði� 1Þ þ trðB;MjÞ

8>>><>>>:

Iði; jÞ ¼ eðIj; S½i�Þ þmaxMði� 1; jÞ þ trðMj; IjÞIði� 1; jÞ þ trðIj; IjÞ

Dði; jÞ ¼maxMði; j� 1Þ þ trðMj�1;DjÞDði; j� 1Þ þ trðDj�1;DjÞ

for 1 6 i 6 n; 1 6 j 6 k

XNð0Þ ¼ 0XNðiÞ ¼ XNði� 1Þ þ trðN;NÞ for 1 6 i 6 n

XEð0Þ ¼ �1XEðiÞ ¼max

16j6kfMði; jÞ þ trðMj; EÞg for 1 6 i 6 n

XJð0Þ ¼ �1

XJðiÞ ¼ maxXJði� 1Þ þ trðJ; JÞXEðiÞ þ trðE; JÞ

�for 1 6 i 6 n

XBð0Þ ¼ trðN; BÞ

XBðiÞ ¼maxXNðiÞ þ trðN;BÞXJðiÞ þ trðJ; BÞ

�for 1 6 i 6 n

XCð0Þ ¼ �1

XCðiÞ ¼maxXCði� 1Þ þ trðC; CÞXEðiÞ

�for 1 6 i 6 n

Page 5: Integrating FPGA acceleration into HMMer

T. Oliver et al. / Parallel Computing 34 (2008) 681–691 685

In the above equation, the profile HMM is described in terms of transitions between two states and emissions of amino acidsat particular states. For example, tr(State1,State2) implies the transition score from State1 to State2. Similarly, e(State1,s) im-plies the emission score from emitting s at State1. After the computation of all matrices, the score of the best path emittingthe complete sequence S is determined by XC(n) + tr(C,T). These matrices and their corresponding dependencies are used forour FPGA implementation (see Section 4).

3. Related work

There have been a variety of techniques used to accelerate HMMER searches. They range from typical high performancecomputing (HPC) strategies such as clustering, to web services, and even extending the core HMMER algorithm to novel pro-cessing architectures. In this section, we discuss the various strategies used in both implementing and accelerating HMMER.

Walters et al. [19] describe an SSE2 implementation of the Viterbi algorithm. Using the SIMD instructions of the Pentium4and above, multiple iterations of the Viterbi algorithm were processed in parallel. However, the data dependencies that wedescribe in Section 4 prevented deeper parallelism. Similarly, Lindahl [12] implemented an Altivec-optimized Viterbi algo-rithm using SIMD Altivec instructions. The advantage of the Altivec implementation over the SSE2 implementation is thelarger set of Altivec instructions coupled with additional SIMD registers. With these extra instructions (32-bit vectormax), most steps of the Viterbi algorithm can be processed within the Altivec registers without requiring additional packingor memory operations.

ClawHMMER [10] implements hmmsearch using highly parallel graphics processors. Unlike traditional general purposeprocessors, graphics hardware has been optimized to perform the same operation over large streams of input data. This issimilar to the SIMD operations of general purpose CPUs, but with greater width and speed. In the case of ClawHMMER, multi-ple sequences are computed in parallel, rather than parallelizing the Viterbi algorithm itself.

JackHMMer [20] uses network processors in place of a general purpose processor in order to accelerate the core Viterbialgorithm. Specifically, JackHMMer uses the Intel IXP 2850 network processor, a heterogeneous multicore chip consistingof an XScale CPU paired with sixteen 32-bit microengines. The XScale CPU distributes work to the microengines and actsas a coordinator for the worker microengines.

TimeLogic provides an FPGA HMM protein characterization solution named DeCypherHMM [17]. The DeCypher engine isdeployed as a standard PCI card into an existing machine. Multiple DeCypher engines can be installed in a single machine,which according to TimeLogic, results in near linear speedup. Other FPGA implementations exist as well [13,15]. The disad-vantage of these implementations is that they eliminate portions of the Plan7 model which can lead to inaccuracies. How-ever, they also exhibit extremely high speedup at the cost of such potential inaccuracy.

SledgeHMMER [4] is a web service designed to allow researchers to perform Pfam database searches without having toinstall HMMER locally. To use the service, a user submits a batch job to the SledgeHMMER website. Upon completion ofthe job, the results are simply emailed back to the user. SledgeHMMER is a clustered implementation using a UNIX file lockingstrategy rather than MPI or PVM.

Other cluster-like implementations exist as well. Zhu et al. [22] discuss a Pfam search implementation for the EARTH mul-ti-threaded architecture. Their implementation achieves near linear speedup with 128 dual core processors. HMMER itselfsupports a PVM implementation of the major database search tools including hmmsearch and hmmpfam.

(i,j)

(i–1,j)

(i,j –1)

(i–1,j–1)

i–1i–1

i–1

j

i

M, I, D

XB XE

XJ

Fig. 4. Data dependencies for computing the values M(i, j), I(i, j), and D(i, j) in the Plan7 Viterbi algorithm. Solid lines are used for direct dependencies anddashed lines for indirect dependencies.

Page 6: Integrating FPGA acceleration into HMMer

686 T. Oliver et al. / Parallel Computing 34 (2008) 681–691

4. Mapping onto an FPGA platform

In order to develop a parallel architecture for the recurrence relations presented in Section 2.2, their dependencies mustbe accurately described. Fig. 4 shows the data dependencies for computing the cell (i, j) in DP matrices M, I, and D. It can beseen that the computation of this cell requires the left, upper, and upper-left neighbor. Additionally, it depends on XB(i � 1).This values depends on XJ(i � 1), which in turn depends on XE(i � 1). XE(i � 1) then depends on all cells in row i � 1 in matrixM. Hence, to satisfy all dependencies the two-dimensional matrices M, I, and D must be filled one cell at a time, in row-majororder. Hence, computing several DP matrix cells in parallel is not possible for a Plan7 Viterbi score calculation due to thefeedback loop induced by the J-state.

Eliminating the J-state is a typical strategy used when implementation the Viterbi algorithm in hardware [13,15]. In doingso, highly efficient parallelism can be achieved with an FPGA at the cost of the implementation’s inability to find multi-hitalignments such as repeat matches of subsequences of S to subsections of H. This can in turn result in a severe loss of sen-sitivity for database searching within HMMer. In this paper we present an FPGA solution that implements a full Plan7 model.

Fig. 5 shows our design for each individual PE (Processing Element). It contains registers to store the following temporaryDP matrix values: M(i � 1, j � 1), I(i � 1, j � 1), D(i � 1, j � 1), M(i, j � 1), I(i, j � 1), and D(i, j � 1). The DP matrix values M(i, j),I(i, j), and D(i, j) are not stored explicitly, instead they are the inputs to the M(i, j � 1), I(i, j � 1), and D(i, j � 1) registers, respec-tively. The PE gets the emission (eðMj; siÞ and eðIj; siÞ) and transition probabilities trðMj�1;MjÞ, trðIj�1;MjÞ, trðDj�1;MjÞ, trðIj;MjÞ,trðIj;MjÞ, trðMj�1;DjÞ, trðDj�1;MjÞ, and trðMj; EÞ from the internal FPGA RAM (Block RAM). The transition probabilities trðB;MjÞ,tr(N,N), tr(E, J), tr(J, J), tr(J,B), tr(N,B), tr(C,C), and tr(C,T) are stored in registers in the PE. The PE has a four stage pipeline: Fetch,Compute1, Compute2, and Store. In the Fetch stage transition, emissions and intermediate DP matrix values are read from theBlock RAM. All necessary computations are performed in the two compute stages. Finally, results are written to the Block RAMin the store stage. The computation of the special state matrices uses intermediate values for XE(i) which are computes as

M

I

D

XEði; jÞ ¼maxfXEði; j� 1Þ;Mði; jÞ þ trðMj; EÞg

The updating of XN, XJ, XB, and XC is only performed at the end of the DP matrix row; i.e. if j = k.

+ M(i-1, j-1)(i-1, j)

tr(Mj-1

,Mj)

+ I(i-1, j-1)(i-1, j)

tr(Ij-1

,Mj)

+ D(i-1, j-1)(i-1, j)

tr(Dj-1

,Mj)

+

MAXtr(M

j,I

j)

+

tr(Ij,I

j)

MAX

+

MAX

MAX

M(i, j-1)

e(Mj, S[i])

+ I(i, j-1)e(I

j, S[i])

XB(i-1)

+

tr(B,Mj)

+

tr(Dj-1

,Dj)

+

tr(Mj-1

,Dj)

D(i, j-1)MAX

+

tr(Mj,E) XE(i-1,j-1)

MAX

+

tr(E,J)

+ tr(J,J)XJ(i-1)

MAX

+

tr(J,B)

tr(N,N)XN(i-1)

+

+

tr(N,B)

MAX

SEL

tr(N,B)

+

tr(E,C)

+

tr(C,C)XC(i-1)

tr(C,T)

+MAX

SIM(H,S)

Fig. 5. HMM processing element (PE) design.

Page 7: Integrating FPGA acceleration into HMMer

T. Oliver et al. / Parallel Computing 34 (2008) 681–691 687

All numbers are represented in 2’s complement form. Furthermore, the adders in our PE design use saturation arithmetic.In order to achieve high clock frequencies fast saturation arithmetic is crucial to our design. Therefore, we have added twotag bits to our number representation. These two tags encode the following cases: number (00), +max (01), �max (10), andnot-a-number (NaN) (11). The tags of the result of an addition and maximum operation are calculated according to Tables 2and 3. Our representation has the advantage that result tags can be computed in a very simple and efficient way: if any of theoperand’s tags is set in an addition, a simple bit-wise OR operation suffices. Otherwise, the tags will be set according to theoverflow bit of the performed addition.

As mentioned above, the Plan7 Viterbi algorithm does not allow computing several cells in parallel. Instead of computingthe Viterbi algorithm on one database subject at a time, we align different query/subject pairs independently in separate PEs.Our system design with 4 PEs is shown in Fig. 6. Each PE has its own intermediate value storage (IVS). The IVS needs to storeone row of previously computed results of the matrices M, I, and D. The PEs are connected to an emission and transition stor-age. Our design assumes that the same profile HMM has to be aligned to different sequences. All PEs are synchronized toprocess the same HMM state in every clock cycle. Therefore, the bandwidth requirement to access the transition storageis reduced to a single state. Score collect and score buffer are designed to handle cases where PEs produce results in the sameclock cycle. The HMM loader transfers emission and transition values into their respective storage. The sequence loaderfetches sequence elements from external memory and forwards them to the emission selection multiplexers. The systemis connected to the HMMer software running on the host system via an USB interface.

The host software performs two functions: load/store the FPGA and post-process relevant hits. The host portion of hmm-search and hmmpfam is described in Algorithm 1. As can be seen, the FPGA is itself used as a first-pass filter. Large databasechunks are quickly scanned and narrowed down to a few interesting hits. These hits are then processed in software, accordingto the high-level algorithm presented in Algorithm 1. The e-value described in Algorithm 1 is a statistical measure of the num-ber of false hits expected within the entire database whose scores are less than or equal to the current sequence’s score. Thesoftware HMMER implementation uses these e-values to filter results whose score indicates a high chance of a false positive.Similarly the T and E input parameters are user-defined threshold (T) and cutoff (E) values. The function Pack_and score() inAlgorithm 1 packs protein sequences and HMM into a buffer to be sent to the FPGA. In order to enable all PEs to processone letter at the same time, the amino acids of different sequences are interleaved. However, due to different sequence lengths,this can cause scores to be returned out-of-order. Therefore, we built a relative-to-absolute position table when the data ispacked. This table is then used to copy the scores to their correct position when the data is returned to the CPU.

Algorithm 1: FPGA integration into hmmsearch and hmmpfam.

Input: T parameter, E parameter, HMM array hmm[], Sequence array seq[]Output: Top matches{Note: number of HMMs = 1 for hmmsearch}for all Current HMMs, hmm[j] do

repeat{Returns scores and sequence arrays for use in post-processing}Pack_and_score()for all Current sequences, seq[i] do

FPGA_score = score[i]/1000.0 {scale integer log-odds score}if FPGA_score P T and e-value6 E then

dsq = digitize sequence(seq[i])Software score = P7Viterbi(dsq,H)if Software score P T and e-value6 E then

PostprocessSignificantHit(seq[i])end if

end ifend for

until No More Sequencesuntil No More HMMs

Table 2Computation of result tags in the case of an addition

add number (00) +max (01) �max (10) NaN (11)

number (00) 00a 01 10 11+max (01) 01 01 11 11�max (10) 10 11 10 11NaN (11) 11 11 11 11

a Except the case that the result produces an overflow, then the result tag is 01 (if MSB is set) or 10 (if MSB is not set).

Page 8: Integrating FPGA acceleration into HMMer

Table 3Computation of result tags in the case of a maximum operation

Max number (00) +max (01) �max (10) NaN (11)

number (00) 00 01 00 11+max (01) 01 01 01 11�max (10) 00 01 10 11NaN (11) 11 11 11 11

FPGA System

IVS

Score Collect

IVS

IVS

IVS

TransitionStorage

Emission Storage

Host Interface

Sequence Loader

ScoreBuffer

HMMLoader

Host System

IVS: Intermediate Value Storage

PE

PE

PE

PE

Fig. 6. Architecture of the HMM system implemented on FPGA.

688 T. Oliver et al. / Parallel Computing 34 (2008) 681–691

5. Performance evaluation

We have described our PE design in Verilog and targeted it to two members of the low-cost Xilinx Spartan-3 family,XC3S1500 and XC3S4000 (see Table 4), in order to investigate the effect of the amount of logic slices and memory on thescalability of our design.

We have used Xilinx ISE 8.2i for synthesis, mapping, placement, and routing. The size of one PE is 451 logic slices. Theamount of memory required is 50 RAM entries per HMM state, comprising 42 emissions and 8 transitions. Furthermore,there are 3 entries per HMM state for each PE’s IVS. Thus, the overall amount of Block RAM entries required is50 � kþ 3 � k � N, where for k is the HMM length and N is the number of PEs. Table 5 shows the maximum number of PEs thatwe are able to fit for a varying HMM lengths. k = 256 and 1024 are the largest power-of-two HMM lengths we are able tosupport on an XC3S1500 and XC3S400, respectively. In both these cases the number of PEs reported in Table 5 is limited

Table 4Resources of the targeted FPGAs

Spartan-3 XC3S1500 XC3S4000

Logic slices 13,312 27,648Block RAMs 32 96

Page 9: Integrating FPGA acceleration into HMMer

Table 5Maximal number of supported PEs (N) for varying HMM lengths (k)

Targeted FPGA k N Mega CUPS

XC3S1500 256 10 700

XC3S4000 256 30 2100512 30 2100

1024 13 910

T. Oliver et al. / Parallel Computing 34 (2008) 681–691 689

by the number of Block RAM in the targeted FPGA. The number of PEs can therefore be increased for shorter HMM lengths,e.g. for k = 512 it is possible to fit 30 PEs on an XC3S4000. Further improvement over the 512-state version is then limited bylogic slices on the XC3S4000; e.g. for 256 states the maximal PE number is still 30.

A performance measure commonly used in computational biology is cell updates per second (CUPS). A CUPS representsthe time for a complete computation of one entry of each of the matrices M, D, and I. The CUPS performance of our imple-mentations can be measured by multiplying number of PEs times the clock frequency. The corresponding CUPS perfor-mances of our design are also shown Table 5 based on the achieved clock frequency of 70 MHz. CUPS represents the peakperformance that the hardware accelerator can achieve and does not consider data communication time or initializationtime. Thus, it should be considered an upper-bound.

We have implemented the HMMER FPGA integration for hmmsearch and hmmpfam as described in Algorithm 1. The accel-eration board used for this study is a low-cost Spartan-3 XC3S1500 board with 64 MB SDRAM, Ethernet, USB 2.0 and PS/2interfaces. The SC3S1500 board can easily be plugged into any host computer system via a USB interface. Tables 6 and 7 showthe timings for FPGA-accelerated hmmsearch and hmmpfam executions for a varying number of HMM states and sequences,respectively. All timings include data transfer, initialization and pre- and post-processing. We have used an AMD Athlon 643500+ as a host machine for the board. The speedup of FPGA-accelerated HMMer compared to the non-accelerated versionrunning on the same desktop system is also reported.

The HMMs chosen for the experiments in Table 7 are the top 1554 HMMs of the superfamily database. This equals themaximum amount of HMM data that can be loaded into the SDRAM of the utilized FPGA hardware in a single pass. A largernumber of HMMs simply requires running multiple passes. Since our runtime measurements include all overheads, the asso-ciated runtimes scale linearly with the number of passes. Similarly, the number of sequences used in hmmpfam can be in-creased. In each pass, a single HMM is loaded from the SDRAM into the FPGA’s BlockRAM. Subsequently, all sequences arestreamed from the SDRAM into the FPGA and compared to the HMM. The associated runtime is therefore determined by:HMM loading time + streaming time. After about 1000 average-sized sequences, the HMM loading time becomes negligiblecompared to the streaming time, and so the speedup levels off after about 1000 sequences.

Examining Tables 6 and 7, we can see the effect of the number of states within an HMM and number of protein sequenceson the FPGA implementation as compared to the software only implementation. The speedup generally increases with a

Table 6Runtime comparison of accelerated hmmsearch on the XC3S1500 board and non-accelerated hmmsearch on an AMD Athlon 64 3500+

HMM name HMM states FPGA-accelerated hmmsearch (s) Non-accelerated hmmsearch (s) Speedup Achieved MCUPS Efficiency (%)

zf – C2H2 24 14 142 10.1 331 47lg 45 20 291 14.6 435 62Collagen 60 23 270 11.7 504 72Rvp 112 38 746 19.6 569 81Cyto 222 88 2386 27.1 487 7014-3-3 244 72 2247 31.2 655 94

All measurements use a database consisting of 643,552 protein sequences provided by EBI (European Bioinformatics Institute). The achieved MCUPS andefficiency compared to the theoretical peak performance of 700 MCUPS (see Table 5) are also reported.

Table 7Runtime comparison of accelerated hmmpfam on the XC3S1500 board and non-accelerated hmmpfam on an AMD Athlon 64 3500+

Number of protein sequences FPGA-accelerated hmmpfam (s) Non-accelerated hmmspfam (s) Speedup Achieved MCUPS Efficiency (%)

50 21 375 17.9 175 25100 31 773 24.9 267 38200 47 1559 33.2 351 50400 103 3369 32.7 364 52800 176 6556 37.3 413 59

1000 208 8127 39.1 429 61

All measurements use a subset of the superfamily database consisting of 1554 HMMs. The proteins used in the benchmarking are GPCR (G Protein-CoupledReceptor) sequences. The achieved MCUPS and efficiency compared to the theoretical peak performance of 700 MCUPS (see Table 5) are also reported.

Page 10: Integrating FPGA acceleration into HMMer

Table 8Performance comparison of different hmmsearch implementations

Machine Optimization Time taken (s) Mega CUPS Speedup

AMD Athlon 64 3500+ None 2247 20.96 1Apple G5 None 2024 23.26 1.11Apple G5 Altivec 562 83.77 4AMD Athlon 64 3500+ FPGA 72 655.87 31.2

All measurements use the same database as in Table 6 and the 14-3-3 HMM.

Table 9Performance comparison of different hmmpfam implementations

Machine Optimization Time taken (s) Mega CUPS Speedup

AMD Athlon 64 3500+ None 8135 10.98 1Apple G5 None 5811 15.37 1.4Apple G5 Altivec 3165 28.21 2.57AMD Athlon 64 3500+ FPGA 208 429.30 39.1

All measurements use the same database as in Table 7 and a set of 1000 GPCR protein sequences.

690 T. Oliver et al. / Parallel Computing 34 (2008) 681–691

larger number of states and sequences, respectively. This is to be expected as the software implementation of the Viterbialgorithm does not improve efficiency with a larger number of states or larger number of sequences. However, in case ofthe FPGA, the greater number of states results in more effective use of the resources, while the larger number of sequencesreduces the impact of data transfer overheads. Slight variations in runtime are due to the post-processing load on the soft-ware which varies with each dataset. This can be seen in the 222-state test in Table 6 which takes more time than the 244-state test.

The HMMer package is able to take advantage of the Altivec extensions available in the Power Mac G5 processor. A per-formance comparison between this optimization and the FPGA-accelerated version on the XC3S1500 board is shown in Ta-bles 8 and 9.

6. Conclusion

In this paper we have demonstrated that reconfigurable hardware platforms provide a cost-effective solution to high per-formance biological sequence database searching with HMMer. We have described a PE design to implement database scan-ning using the full Plan7 Viterbi algorithm. Our strategy outperforms available sequential desktop implementations for bothhmmsearch and hmmpfam by one to two orders of magnitude.

The difference between previous acceleration approaches (such as Kestrel [5]) and ours is that FPGAs allow easy upgrad-ing. This, and our previous work [15], shows the portability inherent in FPGA technology. Our previous HMM accelerator wastargeted at the Virtex-II architecture but did not support the full Plan-7 Viterbi algorithm. Our new architecture which sup-ports the full Plan7 Viterbi algorithm could easily be ported to the higher performance Virtex-4 architecture or the newerVirtex-5 architecture for even higher computing performance (due to larger number of logic slices and more Block RAM).However, we have opted for the Spartan-3 architecture due to its better price/performance ratio, small form factors, andlow power consumption, Furthermore, the low cost of the Spartan-3 devices makes it possible to put an FPGA acceleratorin several nodes within a cluster or a grid.

References

[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment search tool, J. Mol. Biol. 215 (1990) 403–410.[2] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D.J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database

search programs, Nucleic Acids Res. 25 (17) (1997) 3389–3402.[3] A. Bateman et al, The PFAM protein families database, Nucleic Acid Res. 32 (2004) 138–141.[4] G. Chukkapalli, C. Guda, S. Subramaniam, SledgeHMMER: a web server for batch searching the pfam database, Nucleic Acid Res. 32 (2004) W542–

W544.[5] A. Di Blas et al, The UCSC kestrel parallel processor, IEEE Trans. Parall. Distrib. Syst. 16 (1) (2005) 80–92.[6] R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis, Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press,

1998.[7] S.R. Eddy, HMMER: profile HMMs for protein sequence analysis, 2007. <http://hmmer.janelia.org>.[8] S.R. Eddy, Profile hidden Markov models, Bioinformatics 14 (1998) 755–763.[9] D.H. Haft, J.D. Selengut, O. White, The TIGRFAMs database of protein families, Nucleic Acids Res. 31 (2003) 371–373.

[10] D.R. Horn, M. Houston, P. Hanrahan, ClawHMMER: a streaming HMMer-search implementation, in: ACM/IEEE Conference on Supercomputing, 2005.[11] A. Krogh, M. Brown, S. Mian, K. Sjolander, D. Hausler, Hidden Markov models in computational biology: applications to protein modeling, J. Mol. Biol.

235 (1994) 1501–1531.[12] E. Lindahl, Altivec-accelerated HMM algorithms, 2005. <http://lindahl.sbc.su.se/>.

Page 11: Integrating FPGA acceleration into HMMer

T. Oliver et al. / Parallel Computing 34 (2008) 681–691 691

[13] R.P. Maddimsetty, J. Buhler, R. Chamberlain, M. Franklin, B. Harris, Accelerator design for protein sequence HMM search, in: Proceedings of the 20thACM International Conference on Supercomputing (ICS06), 2006, pp. 288–296.

[14] S. Needleman, C. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two sequences, J. Mol. Biol. 48 (3)(1970).

[15] T.F. Oliver, B. Schmidt, J. Yanto, D.L. Maskell, Accelerating the Viterbi algorithm for profile hidden Markov models using reconfigurable hardware,Lecture Notes in Computer Science, vol. 3991, Springer, 2006. pp. 522–529.

[16] T.F. Smith, M.S. Waterman, Identification of common molecular subsequences, J. Mol. Biol. 147 (1981) 195–197.[17] TimeLogic Biocomputing Solutions, DecypherHMM, 2007. <http://www.timelogic.com/>.[18] A.J. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Trans. Inform. Theory 13 (2) (1967) 260–

269.[19] J.P. Walters, B. Qudah, V. Chaudhary, Accelerating the hmmer sequence analysis suite using conventional processors, Proceedings of the AINA’06, vol. 1,

IEEE Computer Society, 2006, pp. 289–294.[20] B. Wun, J. Buhler, P. Crowley, Exploiting coarse-grained parallelism to accelerate protein motif finding with a network processor, in: Proceedings of the

14th International Conference on Parallel Architectures and Compilation Techniques (PACT 2005), 2005, pp. 173–184.[21] S. Yooseph et al, The Sorcerer II global ocean sampling expedition: expanding the universe of protein families, PLoS Biol. (2007).[22] W. Zhu, Y. Niu, J. Lu, G.R. Gao, Implementing parallel Hmm–Pfam on the EARTH multithreaded architecture, in: Proceedings of the Second IEEE

Computer Society Bioinformatics Conference, 2003, pp. 549–550.