disclosing the nature of computational tools for the …beccuti/paper/jdrug.pdfdisclosing the nature...

Disclosing the nature of computational tools for the analysis of NextGeneration Sequencing data.

Francesca Cordero1,2, Marco Beccuti1, Susanna Donatelli1 and Raffaele A Calogero2

(1) Department of Computer Science, University of Torino, Torino, Italy;(2) Department of Clinical and Biological Sciences, University of Torino, Torino, Italy;

Abstract

Next-generation sequencing (NGS) technologies are rapidly changing the approach to complex ge-nomic studies, opening the way to personalized drugs development and personalized medicine. NGStechnologies are characterized by a massive throughput for relatively short-sequences (30-100), andthey are currently the most reliable and accurate method for grouping individuals on the basis of theirgenetic profiles. The first and crucial step in sequence analysis is the conversion of millions of shortsequences (reads) into valuable genetic information by their mapping to a known (reference) genome.New computational methods, specifically designed for the type and the amount of data generated byNGS technologies, are replacing earlier widespread genome alignment algorithms which are unable tocope with such massive amount of data.

This review provides an overview of the bioinformatics techniques that have been developed forthe mapping of NGS data onto a reference genome, with a special focus on polymorphism rate andsequence error detection. The different techniques have been experimented on an appropriately defineddataset, to investigate their relative computational costs and usability, as seen from an user perspective.

Since NGS platforms interrogate the genome using either the conventional nucleotide space or themore recent color space, this review does consider techniques both in nucleotide and color space,emphasizing similarities and diversities.

1 Introduction

The successful sequencing of a wide range ofgenomes has opened the way to a new approach indrug design and selection. The Sanger enzymaticdideoxy technique, firstly described in 1977 [10],has been improved continuously over the years,leading to a significant reduction in the overall costof genome sequencing. It is in 2006 that new se-quencing technologies, collectively referred to asNext-Generation Sequencing (NGS), appear on themarket, leading to a revolution in the sequencingtechniques.

This revolution is currently driven by three com-mercial platforms: 454 (Roche) [28], Genome An-alyzer (Illumina/Solexa) [14] and AB-SOLiD (Ap-

plied Biosystems) [4]. A technical review of tem-plate preparation, sequencing and an overview ofNGS technologies, instrument performance andcost it is reported in [21]. while a review of thesequencing techniques implemented in these plat-forms have been already published in [6, 22, 23,27]. The use of NGS inside appropriate work-flow, targeting a specific application, have been re-viewed in [36] for RNAseq, in [15] for ChiPseqand in [9] for metagenomics.

All NGS technologies have in common the gen-eration of sequences on an unprecedented scale,without the requirement for DNA cloning, and ata fraction of the costs required for traditional se-quencing. These features are expected to lead inthe near future to personalized genome sequencing

1

at a cost less than 1000$ per sample, a cost that willallow to address biological questions that were noteconomically or logistically practical before [29].

In particular in the field of molecular and phar-maceutical biotechnology NGS can enable i) thediscovery of new drugs or derivatives from naturalproducts, and ii) the development of personalizedmedicine.

As explained in [19], drug discovery from nat-ural products has suffered a significant slowdowndue to the complicate and long framework of highthroughput screening and to the increasing govern-ment resections on drug approvals. But we can ex-pect the first cause to be temporary, since the rapidaccess to inexpensive genome sequencing tech-nologies will simplify the extraction of the geneticinformation of uncultivated organisms that couldpotential lead to new drug discovery. For example,the remarkable study [20], conducted by an inter-national consortium, used Roche-454 sequencingof Mycobacterium tuberculosis to identify targetsfor diarylquinoline-based drug that potently in-hibits the growth of drug sensitive/resistant strainsof the pathogen. Based on these early reports, itis likely that our understanding of the spectrum ofgenome variation within clinical isolates will begreatly enhanced in the near future. This knowl-edge will lead to improved diagnostics, monitor-ing and treatments [7] as the availability of per-sonal genome sequencing will provide informa-tion on patient predisposition to drug response andclearance [31]. New gene-mapping techniqueswill facilitate the design of diagnostic tests to ef-ficiently highlight the causes of illness, includinginfections.

A source of potential weakness in NGS dataanalysis is the definition of specific bioinformat-ics workflow [17] and the correct choice of map-ping methods for the application target. Basicmapping and variant detection tools are providedby the next-generation sequencing platform ven-dors, sequencing service providers and academiccommunity. Whichever software is used, it is ofparamount importance that its strength and weak-

ness is clearly understood. Indeed the choice ofthe best algorithm may require a deep understand-ing of the peculiarity of each of them. This reviewgoes along the line of providing information usablefor the correct choice of a mapping method.

In this paper, we describe and discuss the maincharacteristics of alignment approaches. With re-spect to the complete reviews on the algorithmsavailable in literature [25, 5], here we shall high-light algorithm peculiarities, as seen from an userperspective, by means of their comparison over asynthetic dataset encompassing mutations in 35-mers reads both in nucleotide and color space.

The paper is organized as follows. Section 2introduces the alignment and mapping problems,thus defining the context of this review. Section 3provides an overview of the distinctiveness of colorspace versus nucleotide space. Section 4 analyzesthe peculiar features of the most used mappingalgorithms for NGS data: Short OligonucleotideColor Space (SOCS) [3], Mapping and Assemblywith Qualities (MAQ) [12], Periodic Seed Map-ping (PerM) [34], SHort Read Mapping Package(SHRiMP) [30], and Bowtie [2]. Section 5 showsthe description of the synthetic dataset used in theexperiments and the results of the comparison, interms of computational costs and sensibility, of theselected algorithms. We conclude with a discus-sion on the obtained results in Section 6.

2 Background

NGS technologies allow to sequence more thanone billion base pairs (gigabases) in a few days.Roche [28] and Illumina [14] platforms both im-plement the Sanger methodology, which is basedon the principle of sequencing by extension: theDNA is used as a template to generate a comple-mentary fragment considering a single base at atime. The identity of the bases is determined bychemical means, the results are a set of probesformed by 35-100 positions, expressed in the clas-sical nucleotide space. AB SOLiD sequencing

2

technology introduces a new dibase sequencingtechnique. The system encodes each dimer withone of four available colors using a degeneratecoding system: four fluorescent dyes are used toencode the sixteen possible nucleotide pairs, andeach nucleotide is interrogated twice in indepen-dent reactions. The resulting output is a set of 35-mers probes expressed in color space.

NGS platforms associate with each element ofthe read a quality score, which is a measure of thelikelihood of the nucleotide or dinucleotide (forAB SOLiD ) call. In AB SOLiD technology themaximum value of quality is 33 (high probabil-ity of a correct dimer). Illumina Genome Analyserand Roche 454 platforms use instead an encodinginto an ASCII character of a Phred scoring [18].

The first step in converting the huge volume ofsequence data into biologically-valuable objects istheir alignment on a reference genome. This steprequires specifically devised computational tools.

The alignment between two or more biologi-cal sequences is a process initially developed todetect their homology rate. In biological terms,alignment has the objective to align homologousresidues. DNA, RNA and proteins change theirstructure and evolve mainly under the action ofthree types of modifications: mutation, insertionor deletion. Mutations correspond to substitutionsof a portion of sequence positions (nucleotides oramino acids). If the mutation involves a singleposition the term Single Nucleotide Polymorphism(SNP) is used, when it involves multiple contigu-ous positions it is called SNP locus (of a givensize k). Otherwise, sequences undergoing inser-tion and/or deletion mutations will differ in lengthas part of their sequence will be removed (deletion)or new stretches of sequence will be added (inser-tion).

Given two sequences S1 and S2, an aligning al-gorithm must identify the list of operations, ex-pressed by the function f , such as f(S1) = S2.The choice of f must optimize an objective func-tion that represents the best alignment.

Assuming that evolution is parsimonious, when

performing an alignment the aim is to minimizethe number of evolutionary changes (substitutionsor insertions/deletions) that the alignment implies.

We shall call full mapping an alignment algo-rithm that takes in input two sequences S1 andS2 and returns all the positions i of S2 whereS1 is found without mutations (function f isthe identity). We call instead up to mapping analignment algorithm in which a similarity levelis given to allow a certain number of mutationsor insertions/deletions. For example, up to 1SNP mapping searches all positions in S2 wheresequence S1 is found with up to a SNP mutation.

The mapping algorithms whose input data arethe NGS reads (millions of n-mer) and a genome(stored as up to gigabyte of information), are ex-pensive in time and memory, and a brute force ap-proach is usually unfeasible. To minimize thesecosts different optimization and/or heuristics tech-niques have been developed. As any heuristics, op-timality of the results is not guaranteed. In partic-ular in mapping based on heuristics we can haveboth true and false positive. A false positive in thealgorithm output is a position j not satisfying thegiven similarity constraints. For example, consid-ering in a up to 2 SNP algorithm, a sequence S1 is afalse positive if strictly more than 2 mutations areneeded to obtain from S1 a full matching onto (apart of) S2. False negative are also possible: theseare all the positions that do not belong to the outputset but that are indeed proper alignments.

3 Color space features

Mapping algorithms handle input reads given incolor and/or nucleotide space. Dibase sequencingtechniques (color space) provide some benefits interms of read accuracy since each base in the tem-plate is interrogated twice in independent primerrounds. This feature of the color space facilitatesthe discrimination between erroneous base callingand true genetic variant. Technology that gives

3

in output results in nucleotide space, uses insteadother techniques to remove likely error earlier inthe processing pipeline.

To avoid confusion, in this paper we call mis-match single mutation on a sequence in colorspace, Single Nucleotide Polymorphism (SNP) sin-gle mutation of nucleotide space and SNP locusmultiple contiguous mutations in nucleotide orcolor space.

Hereafter, we deeply describe the color schemacharacteristics.

3.1 Color scheme

Given a nucleotide sequence of k bases the colorencoding produces a string of length k: the first po-sitions is a nucleotide base and the remaining (k-1)positions are colors. Each color represents four po-tential combination of two bases. Fig. 1 reports,for each pair of nucleotides, the correspondingcolor. The translation from nucleotide sequence tocolor sequence requires only the straight applica-tion of the table ColorTab shown in Fig 1.

Figure 1: ColorTab: coding scheme of AB SOLiDsystem

The decoding of the color sequence into nu-cleotide sequence requires the application of colortransformation one by one as follows. Let fc(L)be the function for the decoding. Then fc(L) = L′

if ColorTab[c, L′] = L (f returns the column ofthe row N corresponding to color c, for instancef3(A) = T ). If we consider a color sequence as asequence of transformation defined by function f ,then to decode a color sequence into a nucleotidesequence it is enough to recursively apply f : at

each recursive call a new base is decoded. Forexample, to decode the sequence A320 we applyf3(A) to obtain T, to which f2 is applied, gettingf2(T ) = C and so on until all bases are decoded.

Let t be the sequence of fc() applications, thenthe decoded sequence heavily depends on the firstbase. Consider the decoding of A32130 andC32130, then

t(A32130) produces ATCATT

t(C32130) produces CGACGG

The first nucleotide base should be retained in thecolor sequence: indeed the same color sequenceleads to two completely different nucleotide se-quences depending on the initial letter.

As reported in [30] it is mandatory to compareeach read with respect to the reference genomebefore translating color reads in nucleotide space.This is due to two aspects: as first color-codingof every dibase pairing is not unique, a string ofcolors can represent one of several DNA stringsdepending on the preceding base. The secondaspect regards the possibility that a sequence ofmatches and mismatches in color-space does notmap uniquely into letter space.

Figure 2: Example of isolate single base variant incolor and nucleotide space

For example, as depicted in Fig. 2 a single SNPresults in two adjacent color space mismatches.Even though there are 9 possible color pairs onlythree of them correspond to a SNP, while the restlead to DNA sequences that is completely differentwith respect to the reference genome.

In the following we describe the main features ofthe color space and how it is possible to transform

4

a color sequence in a nucleotide sequence we ex-plain how and why up to k similarity in color spacecan be correctly interpreted into nucleotide space.

3.2 Detecting mismatches and SNP

Any color can also be considered as a specific nu-cleotide transformation. Color 0 maps a base intoitself, color 3 maps a nucleotide into its comple-ment, color 1 exchange A in C and G in T and viceversa, finally color 2 completes all possible trans-formations by exchanging A in G and C in T andvice versa. Colors can be considered as transfor-mations listed in Fig. 3.

Figure 3: Color as transformations

Let T be the matrix reported in Fig. 3, thenT (i, j) = k means that the application of i fol-lowed by j on a base b is equivalent to applying kto b, and we write i⊕ j = k. In the Fig. 3 we havehighlighted the case i = 2, j = 1 where results ink = 3.

Considering the example reported in Fig. 2 itis possible to observe how a single polymorphismis amplified in color-space on two adjacent colorpositions. When is it the case that k adjacent mis-matches correspond to a SNP-locus? And what isthe size of the locus? For this we take in accountthe operator⊕, declared above, from pairs of colorto color, through the matrix reported in Fig. 3. In-deed the coding scheme of Fig. 1 has been build sothat the following proprieties are met:

• A dibase and its reverse get the same color;

• Two different dibases that have the same firstbase get different colors;

• Two different dibases that have the same sec-ond base get different colors;

• Monodibases get the same color;

• A dibase and its complement get the samecolor.

The coding scheme reported in Fig. 1 is theonly one that satisfies the above proprieties, by twocolor permutations. Indeed once a row is fixedthere is only one possible choice for the other three.

Considering the SNP reported in Fig. 2 the col-ors g2, g3, r2, r3 are consistent with this variant, ifthey satisfy the following proprieties:

g2 6= r2 colors are different;

g2 ⊕ g3 = r2 ⊕ r3 color transformation is thesame.

Note that the previous proprieties imply:g3 6= r3

Examining the sequences reported in Fig. 4 ifonly one of the previous proprieties is not satisfied,then there is only one color change instead of two.

Figure 4: Effect of mismatch into nucleotide se-quence

In the previous example both the color positionsare relevant to correctly detect a single mismatchin base-space. In the following examples two andthree adjacent mismatches are showed. Figure 5reports an example of two and three contiguous nu-cleotide variants.

5

Figure 5: Example of SNP locus composed by two(A) and three (B) mutations in color and nucleotidespace

In the first case, Fig 5(A), three necessary pro-prieties to obtain two adjacent mismatches in therespective nucleotide sequence are:

g1 6= r1

g1 ⊕ g2 6= r1 ⊕ r2

g1 ⊕ g2 ⊕ g3 = r1 ⊕ r2 ⊕ r3

Other two proprieties follow from the above:

g2 ⊕ g3 6= r2 ⊕ r3

g3 6= r3

The colors g2 and r2 could be equal or theycould be different as well. Even though the centercolor positions are not mismatches, the analysisstill apply since the example satisfy the aboveproprieties.

In the second case, Fig 5(B), the proprieties are:

g1 6= r1

g1 ⊕ g2 6= r1 ⊕ r2

g1 ⊕ g2 ⊕ g3 6= r1 ⊕ r2 ⊕ r3

g1 ⊕ g2 ⊕ g3 ⊕ g4 = r1 ⊕ r2 ⊕ r3 ⊕ r4

It is worthwhile to enunciate a general theoremto specify when a color changing correspond toa SNP mutations. As reported in the AB SOLiDwhite paper, it is necessary to clarify two commonmisunderstanding that declare an isolate mismatch

must always correspond to a sequencing error andthat an isolated two-color changes always corre-spond an error. Indeed only when the followingtheorem is satisfied it is possible to consider a mu-tation true in color space and in base space.

Let G = 〈b0, g1, . . . , gn〉 and R =〈b0, r1, . . . , rn〉 be two color sequences whereb0 ∈ {ACGT} and gi, ri ∈ {0, 1, 2, 3} then thek-color sub-string Rk = 〈rj . . . , rk+j〉 encodesan isolated (k-1) base changing with respect tok-color sub-string Gk = 〈gj , . . . , gk+j〉 iff

• full matching in nucleotide and color beforej :g1, g2, ..., gj−1 = r1, r2, ..., rj−1

• ∀l<k⊕j+l

i=j ri 6=⊕j+l

i=j gi

•⊕j+k

i=j ri =⊕j+k

i=j gi

In addition to the mismatches, AB SOLiD isable to detect insertions or deletions. It should benoted that a base deleted/inserted from/in the readis depicted as a “-”; AB SOLiD has not a call corre-sponding to “-”. This symbol would be generatedonly by aligning the read to the reference.

Figure 6: Example of deletion (A) and insertion(B) in color and nucleotide space

Fig. 6(A) reports a single base deletion in readsequence. In this case, the main propriety is g3 =r2 ⊕ r3, a single base deletion resulted in two el-ementary colors r2 and r3 being replaced by one

6

g3. Another evidence is that all colors involved inthe deletion are different, but if one of those is theidentity color (0) the other two colors must be iden-tical. This peculiar propriety can also be appliedfor multiple bases. The alignment reported in Fig6(B) shows the insertion of three nucleotide in theread. The only propriety evinced in this case is thatg2 = r2⊕r3⊕r4⊕r5, the reference color g2 in ref-erence sequence must be replaced by the four colorin the read.

Let G = 〈b0, g1, . . . , gn〉 and R =〈b0, r1, . . . , rm〉 be two color sequences whereb0, b

′0 ∈ {ACGT} and gi, ri ∈ {0, 1, 2, 3} with

m >= n + k then R is obtained by a k−insertionsfrom G in positions j iff

• full matching in nucleotide and color beforej :g1, g2, ..., gj = r1, r2, ..., rj

•⊕j+k

i=j gi = rj+1

A deletion of k from G as the complementaryoperations: if R is obtained by a k insertion intoG, the G can also be considered as the result of a kdeletion from R.

4 Algorithms comparison

In the last two years the number of availablemapping tools designed for NGS output has in-creased steadily, while the earlier alignment tool,like BLAST, are progressively abandoned due totheir computational time and memory costs re-quirements. Indeed, BLAST was designed forfinding the homologous sequences given a pro-tein sequence. The adoption of BLAST as short-read alignment program has several problems inthe consideration of both technology error rate andthe elevate number of mismatches.

This new generation of tool must take into con-sideration two new aspects: the large amount of

data produced, and the number of mismatches tobe considered, a value that is driven by the speciespolymorphism rate and by the technology errorrate [25].

All alignment tools considered follow a similarorganization into macro steps:

• Pre-processing to pre-process the referencegenome (or reads set) into one or more indextables;

• Mapping to find matches in the index tablesfor all queried subsequences and to locate theregions of potential homology;

• Result Refinement to examine all potentialmatches produced in the previous step usingthe full read-genome substring alignment.

We devote the rest of the section to discuss themain data structures and algorithms used in thefirst two steps and to present the 5 NGS alignmenttools considered.

4.1 Data structures and algorithms

All mapping tool have a pre-processing phasein which either the reference genome or the setof reads are organized into data structures thatallow a quick access to the information. Sincethis data structures are used for a very largenumber of alignments, then the execution time fortheir generation is not an issue (since it is easilyamortized), while important factors that drive thechoice of the data structure are the memory cost,the execution time of search, and the amount ofsearch results that require further processing.

Seed and hash functions An hash table is ba-sically an array which uses a hash function to ef-ficiently map an input string, called key to an as-sociated value. Each input string, typically large,is encoded by a hash function in a smaller datum,called hash, (i.e. a single integer); that is used as

7

the index of the array element where the associatedvalue is stored. If the hash table stores the positionof each substring of 35 nucleotides of a referencegenome, and if the same substring appears morethan once in the reference, then the hash table is notan array of integers, but an array of sets of integers(usually implemented using the bin data structure).The hash function applied to a read r will returnthe index of the array that contains the positions inthe reference in which the read is found.

Ideally, the hash function should map each pos-sible key to a different index; but this is not ensuredin practice: two or more different keys may havethe same hash (hash collision). For the mapping ofgenome this has significant consequences, since itmay indeed be the case that the set associated withan entry in the hash table contains positions thatmatch different substrings, so that a further stepis required to ensure the mapping correctness (theResult Refinement step introduced before). This af-fects the efficiency of each search/lookup, even ifthe choice of an appropriate hash function can re-duce the number of hash collisions.

The hash-based alignment algorithms can en-code in hash table either the information of refer-ence genome or of the input reads, so that a fullmapping can be performed by the lookup of a read(or a set of reads) in the hash table of the refer-ence genome, or vice versa. Each approach hasits advantages and disadvantages: for instance theformer requires a smaller and variable memory re-quirements due to the number and diversity of theinput reads, but it may use more computation timeto scan the whole genome when input reads arefew. In both cases, due to hash collisions, all thefound matches have to be verified.

This approach works well only in the full match-ing case: indeed to find a read with 1 mutationwould require to search all reads obtained by intro-ducing a single mutation (and there are 3 of them)in each possible position (as many position as theread size). To deal with mutations the concept ofseed [11] has been introduced. A seed is a setof selected positions within a window which gen-

erates fixed length sub-sequences along a string.Seeds are used to discover a potential match be-tween a read and a genomic portion with up to nmismatches; indeed it is always possible to dividea read in a set of consecutive seeds so that at leastone has to match exactly in a genome position. Inthis case a potential match is declared and furtherexamination has to be carried out to determine thewhole similarity.

According to this idea the hash based techniquescan be enriched with consecutive seeds: each se-quence is divided in k + m fragments. To ensurefull sensitivity to k mismatches k +m− 1 hash ta-bles are built and

(k+m

m

)hashing steps are used to

check for exact matches in the different combina-tions of m fragments. After that, the detected po-tential matches have to be still verified consideringthe whole sequence through specialized and accu-rate alignment algorithms. Increasing the value ofm leads to a smaller number of potential matches,but to more hash tables and hash steps.

More recently spaced seeds have been intro-duced in [1]. A spaced seed is a set of “care” and“do not care” positions, annotated as “1” and “*”respectively. Usually the length of the seed is thetotal length of the string, and the weight of theseed is the number of 1 in the string (total numberof care positions). It has been proved [16, 35]that spaced seeds are better than consecutive seedsin finding local similarities between two strings.However the speed of execution depends largelyon the spaced seed employed, the number ofdesired matches and its length. By decreasing theseed weight the execution time increases: usuallyspace seed with a weight around 16 becameimpractical due to memory limitations.

Burrows Wheeler Transformation and suffixarray Other tools for sequence alignment arebased on Burrows Wheeler Transformation (BWT)paired with a Compressed Suffix Array (CSA)[33]. The BWT is a well known algorithm used in

8

data compression (it is used for example in bzip2).It performs a transformation of an input sequenceconsisting of a reversible permutation of the se-quence characters which gives a new string thatis “easier to compress”: if the original string hasseveral substrings that occur often, then the trans-formed string has several places where a singlecharacter is repeated multiple time.

A suffix array is a data structure designed forefficient searching of a large text. It is an arrayof integers giving the starting position of the suf-fixes of a string in lexicographical order. It can beused as index to quickly locate every occurrenceof a substring within the string. Finding every oc-currence of the substring is equivalent to findingevery suffix that begins with the substring. Thanksto the lexicographical ordering these suffixes willbe grouped together in the suffix array, and can befound efficiently using a binary search.

Ferraggina and Manzini show in [24] that a suf-fix array is much more efficient (in both memoryand searching time) if it is created from a BWTcompressed sequences, rather than from the origi-nal sequence. A suffix array that works on com-pressed data is called Compressed Suffix Array(CSA).

The alignment algorithms based on BWT use aCSA to encode the reference genome and to searchall the matches. Creating the CSA requires twosteps: first the sequence order of the referencegenome is modified using BWT. Next, the trans-lated sequences are compressed and used to createthe CSA.

This technique has been extended in [2] to al-low alignments up to n mismatches: the binarysearch has been modified and, each time that a suf-fix does not occur, the algorithm selects an already-matched position, it replaces with another base andstart the search again.

The choice of positions that have to be substi-tuted is performed according to the quality score,so that all the positions with a poor quality scoreare selected before, a characteristics that may havean impact if the algorithm are instructed to find a

single alignment for each read.

4.2 Tools

Hereafter, we describe the peculiar features of thefive algortimhs which are compared in the nextsection.

SOCS is an tool that maps AB SOLiD data ontoa reference genome. It works in color space andit uses consecutive seeds and hash tables built onthe genome. SOCS takes m = 1, therefore for kmismatches it splits the read in k + 1 pieces, sothat increasing the mismatches tolerance, the frag-ments used for each partial hash get smaller, andthus their hashes are less unique leading to highernumber of “spurious hits”.

Whenever a read matches (with one or moremismatches) in multiple positions the quality scoreis used to rank them taking into account also an ap-proximate distribution of the probability of a givennumber of sequencing errors in a read.

The results obtained by SOCS are strictly relatedto color space. A run for k mismatches finds up tok mutations in the color string. As explained inSection 3 a mismatch found in color space doesnot match one to one with SNP or SNP locus innucleotide space. We shall come back on this ob-servation when we shall examine the experimentalresults in Section 5.

MAQ aligns short reads to the reference genomeand infers variants, including SNP and short dele-tion (indel). MAQ works both in color and nu-cleotide space. The mapping algorithm is basedon consecutive seeds dividing each read into k +m fragments to provide full sensitivity to k mis-matches. The value of m is fixed to 2. As a thresh-old between sensibility and efficiency, MAQ doesnot consider any mapping that has more than twomismatches in the first 28 positions. As we shallsee in Section 5, this choice leads to very low sen-sibility for the case of three or more mutations.

9

If a read aligns to multiple positions, MAQscores the various possibilities through qualityscores and compatibility on the complementarystrand. MAQ finds SNP and indel using mate-pair information. All the hits found on the for-ward strand of the reference sequence are storedin a queue. While examining the reverse strand, ifa hit for a read is found, MAQ looks up the queueto check if there is a partial overlapping with oneof the hits found on the forward strand. A pair ofreads is correctly mapped if both the end of thehits are consistent, i.e. correct orientation withinthe proper distance. If only one can be mappedwith confidence, a possible scenario is that an in-del in one of the two reads occurred. The clas-sical Smith Waterman algorithm [32] is then ap-plied on the two reads to check validity of thealignment with/without indel or SNP. Even thoughMAQ is not any longer under active development,it has been considered in this review since in thepast two years it has been one of the most pop-ular mapping algorithms, used for many interest-ing biological studies. For instance, MAQ wasused for the annotation and mapping task in one ofthe first papers that characterize single-nucleotidepolymorphisms [8]: the authors found four mil-lions SNPs and four hundred thousand structuralvariants, many of which were previously unknown.MAQ research team is now working on Burrows-Wheeler Aligner (BWA) [13], descibe below.

PerM was designed to provide full sensitivity toa high number of mismatches on genome-scalemapping. PerM can work in both color and basespace. PerM uses periodic seeds to reduce execu-tion time, and it offers a set of maximum-weightperiodic seeds which are full sensitive to k muta-tions; PerM performs the post processing requiredto correctly classify color mismatches into SNP lo-cus, as explained in Section 3, by checking eachread, which represents a potential mapping, againstthe target positions of the genome. This check isefficiently implemented trough a bit-wise proce-

dure.The choice of the spaced seed is crucial to en-

sure a good mapping speed and the sensitivity ofthe algorithm. A user can choose from a set of pre-defined seeds, or can specify its own. Users shouldbe aware that the design of seeds is all but a triv-ial task, since the seed does greatly influences theperformance of the mapping algorithm.

SHRiMP is a mapping tool that works in colorand nucleotide spaces. SHRiMP uses hash andspaced seeds inside the three steps procedure butwith the peculiarity of doing hashing and mappingonly over k-mer substrings, with k usually muchsmaller with respect to the read length, and the fi-nal refinement step only takes into account readsthat have more than a given threshold of hits in thegenome portion being examined. An high thresh-old may lead to low sensibility (some mappingmay be lost), while a low threshold leads to anhigh computation cost of the refinement. The re-finement step uses Smith-Waterman algorithm: apeculiarity of SHRiMP is that it implements theSmith-Waterman algorithms to work directly incolor space.

SHRiMP also provides a confidence of the pos-sible mappings of each read by using two statistics:pchance, the probability that the hit occurred bychance, and pgenome, the probability that the hitwas generated by the genome, given the observedrates of the various evolutionary and error events,A good alignment should be characterized by a lowpchance and a high pgenome.

Bowtie was specifically designed for the align-ment of short sequences. It uses CSA withBWT-based compression to encode the referencegenome. This operation is particularly efficient inmemory (the human genome is encoded in only1.3GB), without incurring in a significant timeexecution penalty. Bowtie implements a back-tracking algorithm, that extends the one proposedin [24], to allow mismatches and to privilege high-

10

quality alignments. Searching for inexact align-ments in Bowtie is performed using a quality-aware, greedy depth-first search through the spaceof possible alignments. If a valid alignment ex-ists, then Bowtie will find it, but the greedy searchcan lead to “local solutions”. Indeed Bowtie out-puts stops at the first matching founded for a read,which may not be the best in terms of numberof mismatches or/and quality. The user can setBowtie to find all possible alignments and to returnthe best in terms of mismatches; but this choice hasan impact on computation time (up to two/threetimes slower). Bowtie is one of the most used toolbased on BWT. Other short-read alignment pro-grams based on BWT and CSA includes BWA [13]and SOAP2 [26]. In particular, BWA is a new toolwhich implements two different algorithms to per-form sequence alignment. These algorithms areboth based on BWT transformation: the former al-gorithm is designed for short reads up to 200bpwith low error rate while the latter one for longreads with more errors.

5 Dataset generation and toolcomparison

In this section we compare the behavior of the fivetools considered. Table 1 lists the websites fromwhich the tools were downloaded, the version usedin the experiments, and whether color space, basespace, or both are available. Table 2 reports, foreach tool, the specific run command used in theexperiments.

The comparison is based on three “synthetic”datasets built from the Chromosome 22 genome,and only takes into account mismatches, while in-sertions and indel are not considered. The use ofa synthetic dataset allows, for each choice of al-lowed mismatches, a known number of mappings,thus simplifying the computation of the sensitivityof the runs.

Dataset generation. To generate the threedatasets we have realized a new tool that takesin input a Specific Set (SS) of sequences thatare surely present in the reference genome, and aBackground Set (BS) of sequences that are surelynot present in it. If SS is used as input for a map-ping algorithm against the reference genome, thenthe result should be that each read is mapped in atleast one position. If BS is used instead, the re-sult should be that no mapping has been found. Toallow to test algorithms also in presence of mis-matches and SNP, the tool produces in output aset of sequences obtained by introducing in the se-quences of BS and SS 0, 1, . . . , n mutations, wheren is an input parameter. Introducing mutations isalso a way to generate large sets of sequences thatcan be used to test the efficiencies of the alignmenttools in terms of time and space.

The dataset generation tool consists of two sep-arate programs. The first one generates from an in-put set of sequences a new set, introducing at mostn mutations (n is an input parameter defined by theuser); the second one takes two sets of sequencesand merges them in a unique set according to anuser specified policy (e.g. random order, first allthe sequences in the first input file, first all the se-quences in the second input file . . .).

For each newly created sequence the toolmaintains in the sequence identifier all therequired information to evaluate the mapping al-gorithms, leading to the following naming scheme:

〈Id source seq〉‘-’〈 n m 〉(‘*’〈pos〉)n m ,where nm is the number of mutations insertedin the sequence identified by Id source seq. Forexample SEQ IDi-2*3*17 is the identifier of asequence that has been derived from the sourcesequence of identifier SEQ IDi, introducing 2mutations: one in position 3 and one in position17.

11

Name Download site color - base space VersionPerM http://code.google.com/p/perm/ Yes/Yes 0.2.3SOCS http://socs.biology.gatech.edu/ Yes/No 1.2.1

SHRiMP http://compbio.cs.toronto.edu/shrimp Yes/Yes 1.3.2MAQ http://maq.sourceforge.net Yes/Yes 0.7.1

Bowtie http://bowtie-bio.sourceforge.net/index.shtml Yes/Yes 0.12.3

Table 1: Mapping tools

Tool Name ParametersPerM -B v 4 seed F4SOCS 3 100000 16 false true

SHRiMP -o 1 -n 1 -M 35bp -M fastMAQ -C 1 n 3

Bowtie -t -p 16 -k 1 n 3

Table 2: Mapping parameters

Figure 7: Derivation of the three datasets

Figure 7 illustrates the generation steps of thethree datasets used for the experiments. Dataset1has been generated starting from a SS and BS setsof respectively 54 and 2788 nucleotide sequencesof length 35. These cardinalities have been cho-sen so as to produce from SS about 4, 000, 000sequences (using nm = 4, at most 4 mutationsper sequence, and from BS about 20, 000, 000 se-quences (using nm = 3). The two generated setshave been merged using a random order, result-ing in a dataset containing about 24, 000, 000 se-quences in nucleotide space, a number comparablewith the reads produced by an AB SOLiD experi-ment.

Dataset2 is derived from Dataset1 by translatingall its output sequences into color space; the result-ing dataset contains about 24, 000, 000 sequences

in color space. Observe that the translation leadsto color sequences that may contain more than 4mutations, as discussed in Sec. 3.

Dataset3 also consists of about 24, 000, 000 se-quences and it is built as Dataset1, but translatingfirst SS and BS into color space. In this dataset nosequence has more than 4 color mutations, that cannevertheless correspond to more than 4 mutationsin the nucleotide space.

Of the three datasets, Dataset1 and Dataset2share the same input data from a biological pointof view, while Dataset3 can be used to check colorspace mapping algorithms per se, independentlyfrom their ability to correctly detect, using colorcoding, also the SNP locus.

The dataset generation tool will be publicavailable soon as a web service accessible at:http://www6.unito.it/dataset/seqmdd/program.php

Comparison. The three datasets produced havebeen used to exercise the five considered tools onmapping, without considering indel and insertion.Although these algorithms share a certain similar-ity in the mapping process, the presence of spe-cific main data structures, of different algorithmsto assess similarity, and of different (and not triv-ial) optimizations techniques to speed-up compar-ison, may have a strong impact on computational

12

time and sensibility, especially when the numberof mismatches increases.

The setup of the tool parameters was done us-ing software documentation publicly available andtrying to keep as homogeneous as possible the pa-rameters setting over the five tools (see Table 2).

Readers should be aware that, since each of theinspected tool is characterized by many tuning pa-rameters, it is possible that some of the resultswe have obtained can be further improved by par-ticular parameters tuning. The comparison is farfrom being complete: more than trying to establishwhich tool is “the best” (a question whose answeris in most cases strongly dependent on the applica-tion at hand), we have tried to reproduce a situationin which an “informed” user runs the tools to mapreads allowing SNP in base space (and mismatchesin color space).

A peculiarity of certain tools is that, dependingon an input parameter, they may or may not out-put reads that match in more than one position ofthe genome. Therefore, letting the tools toleratemismatches, might results in a significant numberof reads that are discarded. To simplify the map-ping results evaluation, we have instructed all toolsnot to discard reads with possible multiple align-ments, nevertheless all tools were instructed to pro-duce only one alignment per read, usually the bestone, to simplify the comparative analysis of sensi-tivity. Since the quality scores of the dataset areall set to the same value (the maximum allowed),there are cases in which the “best” option still pro-duces more than one matching, which may inducean overestimation of the sensitivity of the tools.

Table 3 reports the results of the runs onDataset1 (for base space) and Dataset2 (for colorspace). For each tool the table reports (column“Space”) whether base space (nt) or color space(cs)was used. Column “Detection rate” reportsthe sensitivity of the algorithm, as the fraction ofthe sequences generated from SS that have beencorrectly detected with 0, 1, 2, and 3 mutations.The computation of the detection rate has requireda post-processing of the output of each tool, to

check the results against the known characteristicsof the input dataset. Column “Time” is the exe-cution time, while “Allowed mismatches/seed” arethe number of allowed mismatches given in inputfor the runs, whenever this was allowed, and/or theseed used. We would have liked to perform ourruns with 4 allowed mutations, but this was notpossible or meaningful for the tools: SOCS, MAQand Bowtie have a very low sensitivity already with3 mutations.

A number of observations are indeed possibleeven on this limited comparison. Indeed all toolsthat work in both base and color, do perform betterin base space, moreover execution times are quitecomparable, but for SHRiMP. Indeed SHRiMP issignificantly slower, but, as we shall discuss next,it has the highest sensitivity.

For a correct interpretation of the results wemust emphasize that we are dealing with sequencesin which up to 4 mutations have been introducedin base space, and readers should be aware that 4SNPs in a sequence can lead to 8 mismatches whenthe sequence is translated into colors.

To clarify the difference between the sensitivityresults of each tool, we should go a bit deeper intothe computation of the results included in the de-tection rate columns, for k = 0, 1, 2, 3.

The column for k = 0 reports the percentage ofsequences with 0 mutation found. Since 0 SNP re-sults into 0 color mismatches, we expect no differ-ence in sensitivity between base and color, which isindeed the case. As expected, all tools do a perfectjob in finding full matching alignments (sensitivityof 1).

The column for k = 1 reports the percentageof sequences with 1 single nucleotide mutationthat have been found while working in base space(row nt) on Dataset1 or in color space on Dataset2(which is a mere color translation of Dataset1).Since a single SNP results in two adjacent muta-tions on the corresponding color sequence, thenany tool that can find up to 2 mismatches in color,it is able to detect all sequences that contain 1 SNP.Indeed for k = 1 all the five tools reach a sensitiv-

13

Tool Space Detection rate Time Allowed mismatches-Seed0 1 2 3

PerM cs 1 1 0.99 0.17 4m 30s F4

nt 1 1 1 0.81 3mSOCS cs 1 1 0.18 0.02 50m 3SHRiMP cs 1 1 1 0.69 125m –

nt 1 1 1 1 119mMAQ cs 1 1 0.04 0 6m 3

nt 1 1 1 0.01 6mBowtie cs 1 1 0.09 0.01 6m 3

nt 1 1 0.84 0.0 5m 45s

Table 3: Mapping tools’ results on Dataset1 and 2, run on 16 CPUs (4 x Quad-Core Intel Xeon E7320processor 2.13GHz) with a total 100 Gb RAM.

ity of 100%.

Things gets a bit more tricky when k = 2. If thetwo SNPs are adequately spaced, then the corre-sponding color sequence contains exactly 4 colormismatches, adjacent two by two, and only thetools that correctly detect 4 mismatches can findthem. If the two SNPs are adjacent, then the cor-responding color sequence contains at most 3 con-secutive mismatches (the first and third are surelydifferent, while the second color may either be amismatch or not). Note that the case of two SNPsseparated by a matching position also leads to atmost 4 mismatches. The five tools have rather di-verse detection rate. In base space PerM, SHRiMP,and MAQ have full sensitivity, while Bowtie doesnot correctly aligns a 16% of the input read. Incolor, SHRiMP is the only one with full sensitivity,PerM looses a 1%, which should not be the casesince PerM is run with an F4 seed which should en-sure full sensitivity to 4 mutations. We could notfind a good explanation for this behavior. MAQand Bowtie exhibit a very low sensibility, whichis not surprising, considering the low sensibilitythat they have for base space with k = 3, whichimplies that all pairs of isolated SNPs are almostnever detected. Also SOCS does not do a good jobfor k = 2, again, 3 color mismatches only allowto find a 18% of the reads that could be alignedthrough mutations of 2 base positions (which are

presumably the ones in which the two SNPs areadjacent). SOCS does nevertheless shows an im-provement over MAQ and Bowtie, neverthelessquite limited considering that SOCS has been de-veloped for color space.

If we consider instead k = 3, in base spaceonly SHRiMP reaches full sensitivity, while PerMlooses 19% of the reads. Again, it is unclear whilethis is happening in PerM, considering that we areusing the F4 seed. MAQ and Bowtie have very lowsensitivity, presumably because they both followthe policy of considering only sequences with upto 2 mutations in the first 28 positions of the read.For what concerns color space, if the 3 SNPs arewell spaced then the corresponding color sequencecontains exactly 6 color mismatches, and only thetools that correctly detect 6 mismatches can detectthem (and, accordingly to the obtained results, thisis never the case). If the 3 SNPs are instead adja-cent, then the corresponding color sequence con-tains at most 4 consecutive mismatches (the firstand fourth are surely different, while the secondand third positions may either be a mismatch ornot). If a tool is limited by 3 mutations, it surelycannot detect all the cases that result in a string of4 different adjacent colour, while it may dependon the specific tool how the other cases are con-sidered. Again SHRiMP has the best sensitivity,which could be a consequence of having full sen-

14

sitivity to the base space case for k = 3, althoughonly 69% of the read are correctly aligned if thealignment requires 3 SNPs. PerM gets a low 17%of sequences, while the other three tools only de-tects very few percents.

Dataset3 has been used only for PERM, whichresults in a sensitivity of 1 for k = 0, 1, 2 and asensitivity of 0.99 for the case of 3 mismatches.Again, it is unclear while there is such a discrep-ancy between the sensitivity for k = 3 in base forDataset1 with respect to color in Dataset3. A rea-son could be that PerM is “color-aware” and fur-ther investigation is needed to understand how sin-gle isolated mismatches are treated.

6 Conclusions

Next Generation Sequencing (NGS) is becomingmore used in genomic fields. In the last 5 yearswe have seen an enormous increase in NGS datathroughput and in reads length, also associatedwith an incredible drop of cost per gigabase. NGSis opening unprecedented opportunities for gener-ating new knowledge and translating it into appli-cations that enhance human health. We are nowin the early phase of a new big jump in biologicalknowledge acquisition which is likely to be big-ger than the one produced by the parallelization oftranscription analysis. This could lead in an ex-pansion of expression studies of enzyme involvedin novel biosynthetic pathways which are a priorityin drug discovery from natural products.

In NGS a big effort has been made to de-velop computational tools to discover the maxi-mum number of correct alignments of reads overa reference genome. New tools have been op-timized for this task and in this review we havedescribed their algorithmic structure and we havegiven a basic view of their strength and weaknesswhen applied to NGS data encoded in nucleotideor color space. The results have been obtainedand presented taking an user perspective. Indeeda large discrepancy in the quality of the results be-

tween color and nucleotide mapping, in presenceof mutations, to explain why it is so this reviewhas devoted special attention to the relationshipsbetween mutations in color space and mutations inbase space, a topic for which little coverage in theliterature exists.

7 Acknowledgments

This study was funded under the auspices of EU-CAAD 200755. The project EUCAAD has re-ceived research funding from the European Com-munity’s Seventh Framework Programme. Thiswork was also supported by grants from Italian As-sociation for Cancer Research; the Italian Minis-tero dellUniversit e della Ricerca; the Universityof Torino.

References

[1] Califano A and Rigoutsos I. Flash: a fastlook-up algorithm for string homology. com-puter vision and pattern recognition. InProceedings CVPR, IEEE Computer SocietyConference, pages 353–359, 1993.

[2] Langmead B, Trapnell C, Pop M, andSalzberg SL. Ultrafast and memory-efficientalignment of short dna sequences to the hu-man genome. Genome Biol., 10, 2009.

[3] Ondov BD, Varadarajan A, Passalacqua KD,and Bergman NH. Efficient mapping of ap-plied biosystems solid sequence data to a ref-erence genome for functional genomic ap-plications. Bioinformatics., 24:2776–2777,2008.

[4] ABI-SOLiD (Applied Biosystems).http://www..appliedbiosystems.com.

[5] Horner DS, Pavesi G, Castrignan T, De MeoPD, Liuni S, Sammeth M, Picardi E, and

15

Pesole G. Bioinformatics approaches for ge-nomics and post genomics applications ofnext-generation sequencing. Brief Bioin-form., 11:181–197, 2010.

[6] Pettersson E, Lundeberg J, and Ahmadian A.Generations of sequencing technologies. Ge-nomics, 93:105–111, 2009.

[7] Mardis ER. The impact of next-generationsequencing technology on genetics. TrendsGenet., 24:133–141, 2008.

[8] David R. B. et al. Accurate whole humangenome sequencing using reversible termina-tor chemistry. Nature, 456:53–59, 2008.

[9] Petrosino J. F., Highlander S., Luna R. A.,Gibbs R. A., and Versalovic J. Metagenomicpyrosequencing and microbial identification.Clin. Chem., 55:856866, 2009.

[10] Sanger F, Nicklen S, and Coulson AR.Dna sequencing with chain-terminating in-hibitors. Proc. Natl. Acad. Sci., 74:5463–5467, 1977.

[11] Myers G. A fast bit-vector algorithm for ap-proximate string matching based on dynamicprogramming. J. ACM, 46:395–415, 1999.

[12] Li H, Ruan J, and Durbin R. Mapping shortdna sequencing reads and calling variants us-ing mapping quality scores. Genome Res.,18:1851–1858, 2008.

[13] Li H. and Durbin R. Fast and accurate shortread alignment with burrows-wheeler trans-form. Bioinformatics, 25:17541760, 2009.

[14] Genome Analyzer (Illumina/Solexa).http://www.illumina.com.

[15] Park P. J. Chipseq: advantages and chal-lenges of a maturing technology. Nature Re-views Genetics, 10:669680, 2009.

[16] Xu J, Brown D, Li M, and Ma B. Opti-mizing multiple spaced seeds for homologysearch. Journal of Computational Biolology,13:1355–1368, 2006.

[17] McPherson JD. Next-generation gap. NatMethods., 6, 2009.

[18] Bonfield JK and Staden R. The application ofnumerical estimates of base calling accuracyto dna sequencing projects. Nucleic Acids Re-search, 23:1406–1410, 1995.

[19] Li JW and Vederas JC. Drug discovery andnatural products: end of an era or an endlessfrontier? Science, 325:161–165, 2009.

[20] Andries K, Verhasselt P, Guillemont J,Ghlmann HWH, Neefs JM, Winkler H,Van Gestel J, Timmerman P, Zhu M, Lee E,Williams P, De Chaffoy D, Huitric E, HoffnerS, Cambau E, Truffot-Pernot C, Lounis N,and Jarlier V. A diarylquinoline drug activeon the atp synthase of mycobacterium tuber-culosis. Science, 307:223–227, 2005.

[21] Metzker M.L. Sequencing technologies thenext generation. Nature Reviews Genetics,11:31–46, 2010.

[22] Rusk N and Kiermer V. Primer: sequencing:the next generation. Nat. Methods., 5, 2008.

[23] Morozova O and Marra MA. Applicationsof next-generation sequencing technologiesin functional genomics. Genomics, 92:255–264, 2008.

[24] Ferragina P and Manzini G. An experimentalstudy of an opportunistic index. In Proceed-ings of the Twelfth Annual ACM-SIAM Sym-posium on Discrete algorithms Washington,DC: Society for Industrial and Applied Math-ematics, pages 269–278, 2001.

[25] Flicek P and Birney E. Sense from sequencereads: methods for alignment and assembly.Nat Methods., 6:S6–S12, 2009.

16

[26] Li R., Yu C, Li Y., Lam T. W., Yiu S.M., Kris-tiansen K, and Wang J. Soap2: an improvedultrafast tool for short read alignment. Bioin-formatics, 25:19661967, 2009.

[27] Strausberg RL, Levy S, and Rogers YH.Emerging dna sequencing technologies forhuman genomic medicine. Drug Discov To-day, 13:569–577, 2008.

[28] 454 (Roche). http://www.454.com.

[29] Marguerat S, Wilhelm BT, and Bhler J. Next-generation sequencing: applications beyondgenomes. Biochem Soc Trans., 36:1091–1096, 2008.

[30] Rumble SM, Lacroute P, Dalca AV, Fiume M,Sidow A, and Brudno M. Shrimp: accuratemapping of short color-space reads. PLoSComput Biol., 5, 2009.

[31] Lee SS and Mudaliar A. Racing forward: thegenomics and personalized medicine act. Sci-ence, 323:342, 2009.

[32] Smith TF and Waterman MS. Identificationof common molecular subsequences. J MolBiol, 147:195–197, 1981.

[33] Manber U and Myers G. Suffix arrays: a newmethod for on-line string searches. SIAM J.Comput., 22(5):935–948, 1993.

[34] Chen Y, Souaiaia T, and Chen T. Perm: effi-cient mapping of short sequencing reads withperiodic full sensitive spaced seeds. Bioinfor-matics, 25:2514–2521, 2009.

[35] Sun Y and Buhler J. Designing seeds forsimilarity search in genomic dna. Journal ofComputer and System Sciences, pages 67–75,2003.

[36] Wang Z., Gerstein M., and Snyder M. Rna-seq: a revolutionary tool for transcriptomics.Nature Reviews Genetics, 10:57–63, 2009.

17

disclosing the nature of computational tools for the …beccuti/paper/jdrug.pdfdisclosing the nature...

Documents