genome sequence assembly: algorithms and issues · 2005. 10. 16. · ach cell of a living organism...

89
Model PT-6 FSS Fertilizer and Lime Spreader SERIAL # __________________ WORK ORDER # ___________

Upload: others

Post on 22-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genome Sequence Assembly: Algorithms and Issues · 2005. 10. 16. · ach cell of a living organism contains chro-mosomes composed of a sequence of DNA base pairs. This sequence, the

0018-9162/02/$17.00 © 2002 IEEE July 2002 47

C O V E R F E A T U R E

Genome Sequence Assembly: Algorithms and Issues

E ach cell of a living organism contains chro-mosomes composed of a sequence of DNAbase pairs. This sequence, the genome, rep-resents a set of instructions that controlsthe replication and function of each organ-

ism. The automated DNA sequencer gave birth togenomics, the analytic and comparative study ofgenomes, by allowing scientists to decode entiregenomes.

Although genomes vary in size from millions ofnucleotides in bacteria to billions of nucleotides inhumans and most animals and plants, the chemicalreactions researchers use to decode the DNA basepairs are accurate for only about 600 to 700nucleotides at a time.

The process of sequencing begins by physicallybreaking the DNA into millions of random frag-ments, which are then “read” by a DNA sequenc-ing machine. Next, a computer program called anassembler pieces together the many overlappingreads and reconstructs the original sequence. Thisgeneral technique, called shotgun sequencing, wasintroduced by Fred Sanger in 1982.1 The techniquetook a quantum leap forward in 1995, when a teamled by Craig Venter and Robert Fleischmann of TheInstitute for Genomic Research (TIGR) andHamilton Smith of Johns Hopkins University usedit on a large scale to sequence the 1.83 million basepair (Mbp) genome of the bacterium Haemophilusinfluenzae.2

Much like a large jigsaw puzzle, the DNA readsthat shotgun sequencing produces must be assem-bled into a complete picture of the genome. This

seemingly simple process is not without technicalchallenges. For one thing, the data contains errors—some from limitations in sequencing technology andothers from human mistakes during laboratorywork. Even in the absence of errors, DNAsequences have features that complicate the assem-bly process—most notably, repetitive sections calledrepeats. The human genome, for example, includessome repeats that occur in more than 100,000copies each. Similar to pieces of sky in jigsaw puz-zles, reads belonging to repeats are difficult to posi-tion correctly. Further complicating assembly, someDNA fragments from each genome are impossibleto sequence, resulting in gaps in coverage.

The resolution of these problems entails an addi-tional finishing phase involving a large amount ofhuman intervention. Finishing is very costly, as itrequires specialized laboratory techniques andhighly trained personnel. Assembly programs candramatically reduce this cost by taking into accountadditional information obtained during finishing,yet most current assemblers disregard this infor-mation and generate the best possible assemblysolely from the initial shotgun reads. Advances inassembly algorithms must include features that helpfinishing efforts.

WHOLE GENOME SHOTGUN SEQUENCINGWhile shotgun sequencing remains the basic

strategy for all genome sequencing projects, itsapplicability to large genomes has been controver-sial. Until recently it was applied only at the end ofa hierarchical process. This process first breaks the

Algorithms that can assemble millions of small DNA fragments into genesequences underlie the current revolution in biotechnology, helpingresearchers build the growing database of complete genomes.

Mihai PopSteven L.SalzbergMartinShumwayThe Institute forGenomic Research

Page 2: Genome Sequence Assembly: Algorithms and Issues · 2005. 10. 16. · ach cell of a living organism contains chro-mosomes composed of a sequence of DNA base pairs. This sequence, the

48 Computer

DNA of large genomes into a set of large piecescalled bacterial artificial chromosomes. The BACsare then mapped to the genome to obtain a tilingpath, after which the shotgun method is used tosequence each BAC in the tiling path separately.

In contrast, whole-genome shotgun sequencing(WGSS) assembles the genome from the initial frag-ments without using a BAC map. This requiresenormous computational resources. The sheer sizeof the data argued against WGSS for large projects.So did the presence of repeats. If positioning thelong repeat stretches correctly is difficult, automat-ing the process is even harder.

In 2000, however, Eugene Myers and colleaguesput most doubts to rest when they published awhole-genome assembly of the fruit fly Drosophilamelanogaster.3 Using a new assembler built specif-ically for very large genomes, the Myers team suc-cessfully sequenced and assembled the 135-Mbpgenome. The project was 25 times larger than anyprevious WGSS project, and the team went on toapply the WGSS strategy to sequence and assem-ble the draft human genome in 2001.

Clones and coverageA WGSS project begins in the laboratory, where

ultrasound or a high-pressure air stream randomlyshatters the DNA into pieces that researchers theninsert into cloning vectors, or clones, as illustratedin Figure 1a. The clone in this case is a circular pieceof DNA called a plasmid. It has a known sequenceof base pairs and can accept a clone insert of foreignDNA. The bacterium Escherichia coli is then usedto multiply the plasmid, thus amplifying the cloneinsert.

In most projects, researchers sequence both endsof each clone insert, yielding a set of sequencingreads that defines the clone-pairing data for thatinsert. This process links each read from a cloneinsert to its clone mate from the opposite end of theinsert. The resulting clone-pairing data is extremelyvaluable not only in guiding the assembly processbut also in correctly ordering the contiguoussequences, or contigs, resulting from assembly.Figure 1b shows the process for three contigs.

The ultimate goal of sequencing is to determine allthe base pairs contained in the DNA. In practice,

however, we try to achieve the goal of having morethan 99 percent of the genome covered by readsafter the initial shotgun phase. To achieve this goalwe need to sequence clones until the reads (averag-ing 600 to 700 base pairs) provide an eightfold (8X)oversampling of the genome. For example, a 2-Mbpbacterial genome sequenced to 8X coverage requires16 Mbp, or approximately 27,000 reads.

Researchers choose the inserts from among sev-eral “libraries” of clone collections generated in thelaboratory. The insert size specifies the average dis-tance separating each pair of clone mates, and sizesvary from one library to the next. Typical projectscontain at least two insert libraries of sizes 2 to 3kbp and 8 to 10 kbp, respectively, and may includeothers, such as BAC libraries of 100 to 150 kbp.The sequenced portion of each insert averages1,200 bp out of 3,000 bp total, so the clone insertsof a 3-kbp library sequenced to 8X cover 2.5 timesas much distance as the sequences themselves.

These libraries provide a “clone coverage” ofmore than 20-fold, meaning that, on average, 20clones span each of the genome’s bases, thus offer-ing the theoretical guarantee that each base is con-tained in at least one of the clones. This guaranteeassumes uniformly random-sampled clones fromthe genome. In practice, this requirement is seldomperfectly satisfied. Cloning biases lead to a non-random clone distribution, causing areas of thegenome to remain unsequenced regardless of theamount of sequencing performed.

AssemblyA WGSS assembler’s task is to combine all the

reads into contigs based on sequence similaritybetween the individual reads. The basic principleis that two overlapping reads—that is, reads wherea suffix of one is a prefix of another—presumablyoriginate from the same region of the genome andcan be assembled together. This assumption isinvalid, however, for repetitive sequences, where itis impossible to distinguish reads from two or moredistinct places in the genome.

Figure 2 shows how an assembler can incorrectlycombine the reads from two copies (rpt1A, rpt1B) ofa repeat, producing a misassembled contig andthrowing out the unique region between the two

Clone insert

Sequencing reads

Clone

Links

(a) (b)

Contigs

Figure 1. Clone andscaffold. (a) Cloneinserts are se-quenced from bothends, yieldingmated sequencereads. (b) A scaffolduses linking infor-mation provided bythe clone-pairingdata to order andorient contiguoussequences, or con-tigs, in the genomeunder assembly.

Page 3: Genome Sequence Assembly: Algorithms and Issues · 2005. 10. 16. · ach cell of a living organism contains chro-mosomes composed of a sequence of DNA base pairs. This sequence, the

July 2002 49

repeat copies as a separate contig. Repeats representa major challenge to assembly software. An assem-bler’s utility depends in large part on detecting andcorrectly resolving repeat regions. Resolving misas-semblies in the finishing phases can be costly.

Information about clone mates, combined withknowledge about the distribution of clone sizes, mayhelp assembly programs to put some classes ofrepeats together correctly. If a repeat is shorter thanthe length of a clone insert, mate-pair informationis enough to separate the individual repeat copiesbecause each read within the repeat has an anchor-ing clone mate in the nearby nonrepetitive region.

FinishingIn practice, imperfect coverage, repeats, and

sequencing errors cause the assembler to producenot one but hundreds or even thousands of contigs.The task of closing the gaps between contigs andobtaining a complete molecule is called finishing.First, a program called a scaffolder uses clone-mateinformation to order and orient the contigs withrespect to each other into larger structures calledscaffolds. Within a scaffold, pairs of reads span-ning the gaps between contigs determine the orderand orientation of contigs, as Figure 1b shows.Note that the physical DNA molecule has an eas-ily determined direction, even though the textualrepresentation of DNA as a string of A, C, T, or Gcharacters appears to be directionless.

The gaps between contigs belonging to the samescaffold are called sequence gaps. Although theyrepresent genuine gaps in the sequence, researcherscan retrieve the original clone inserts spanning thegap and use a straightforward “walking” techniqueto fill in the sequence.

Determining the order and orientation of thescaffolds with respect to each other is more diffi-cult. The gaps between scaffolds are called physi-cal gaps because the physical DNA that would spanthem is either not present in the clone inserts orindeterminable due to misassemblies. Filling thesegaps involves a large amount of manual labor andcomplex laboratory techniques.

ASSEMBLY ALGORITHMSResearchers first approximated the shotgun

sequence assembly problem as one of finding theshortest common superstring of a set of sequences:Given a set of input strings {s1, s2, ...}, find the short-est string T such that every si is a substring of T.

While this problem has been shown to be NP-hard,there is an efficient approximation algorithm. Thisgreedy algorithm starts by computing all possibleoverlaps between the strings and assigning a score toeach potential overlap. The algorithm then mergesstrings in an iterative fashion by combining thosestrings whose overlap has the highest score. This pro-cedure continues until no more strings can be merged.

While it can be argued that the shortest super-string problem does not correctly model the assem-bly problem, the first successful assembly algo-rithms applied the greedy merging heuristic in theirdesign. For example, TIGR Assembler,4 Phrap,5 andCAP36 followed this paradigm.

Greedy algorithms are relatively easy to imple-ment, but they are inherently local in nature andignore long-range relationships between reads,which could be useful in detecting and resolvingrepeats. In addition, all current implementations ofthe greedy method require up to one gigabyte ofRAM for each megabase of assembled sequence,assuming the genome was sequenced at 8X cover-age. This limits their applicability on currentlyavailable hardware to organisms with genomes of32 Mbp or less. Such organisms include bacteriaand a few single-celled eukaryotes, but not plants,mammals, or other multicellular organisms.

These limitations spurred the development ofnew algorithms. Two approaches exploit tech-niques developed in the field of graph theory: onethat represents the sequence reads as graph nodesand another that represents them as edges.

Overlap-layout-consensusThe first approach, overlap-layout-consensus,7

constructs a graph in which nodes represent reads,and edges indicate that the corresponding readsoverlap. Each contig is represented as a simple

rpt1Brpt1A

I

I II

II

III

III

Figure 2. Repeatsequence. The toprepresents the cor-rect layout of threeDNA sequences. Thebottom shows arepeat collapsed ina misassembly.

Page 4: Genome Sequence Assembly: Algorithms and Issues · 2005. 10. 16. · ach cell of a living organism contains chro-mosomes composed of a sequence of DNA base pairs. This sequence, the

50 Computer

path—that is, a path through the graph thatcontains each node at most once.

An assembler following this paradigmmust first build the graph by computing allpossible alignments between the reads. A sec-ond stage cleans up the graph by removingtransitive edges and resolving ambiguities.The output of this stage comprises a set ofnonintersecting simple paths in this refinedgraph, each such path corresponding to acontig. A final step generates a consensussequence for each contig by constructing themultiple alignment of the reads that is con-sistent with the chosen path.

Full information about each read in theinput is only necessary for the overlap and the con-sensus stages. The graph-refinement stage storesonly a limited amount of information about eachoverlap, such as its coordinates and length. Thisallows a memory-efficient implementation.

Not surprisingly, recent WGSS assemblers usethis approach.4,8 The overlap-layout-consensustechnique has the additional value of encodingother relationships between reads, such as clone-mate information, which an assembler can use incorrectly assembling repetitive areas.

Eulerian pathThe second graph-theoretical approach to shot-

gun sequence assembly uses a sequencing-by-hybridization (SBH) technique.9 The idea is to createa virtual SBH problem by breaking the reads intooverlapping n-mers, where an n-mer is a substringof length n from the original sequence. Next, theassembler builds a directed deBruijn graph in whicheach edge corresponds to an n-mer from one of theoriginal sequence reads. The source and destinationnodes correspond respectively to the n − 1 prefixand n − 1 suffix of the corresponding n-mer. Forexample, an edge connecting the nodes ACTTA andCTTAG represents the 6-mer ACTTAG. Under thisformulation, the problem of reconstructing the orig-inal DNA molecule corresponds to finding a paththat uses all the edges—that is, an Eulerian path.

In theory, the Eulerian path approach is com-putationally far more efficient than the overlap-layout-consensus approach because the assemblercan find Eulerian paths in linear time while theproblems associated with the overlap-layout-con-sensus paradigm are NP-complete.7 Despite thisdramatic theoretical difference, the actual perfor-mance of existing algorithms indicates that over-lap-layout-consensus is just as fast as the SBH-based approach.

HANDLING REPEATSIf genomic data included no repeats, an assem-

bler could use any assembly algorithm to put allthe pieces together correctly, even in the presence ofsequencing errors. The repeats found in realgenomes can, however, prohibit correct automatedassembly, at least solely from information con-tained in the original reads.

For example, a large tandem repeat found in thebacterium Streptococcus pneumoniae consists of a24-bp unit that is repeated in identical and nearlyidentical copies, in tandem, for a stretch coveringapproximately 14,000 bp. Given that individualreads have an average length of 600 bp and that allreads obtained from this region are identical, noassembler can determine a unique tiling across thisrepeat. The resulting assembly is likely to contain areconstruction of a 600-bp section of the repeat inwhich all the reads have collapsed on top of eachother—similar to the situation in Figure 2. In thisparticular case, clone mates also fail to resolve theproblem because the largest clone insert for this pro-ject covered only 10 kbp.

This example highlights several issues thatassembly programs must address. First, they mustidentify repeats, preferably during the assemblyprocess, to avoid mistakes caused by overcollaps-ing repeat copies. Detecting such misassemblies ismuch more difficult after assembly is completed,and the misassemblies can lead to incorrect genomereconstructions.

Second, assemblers must attempt to correctlyassemble as many repeats as possible to reduce theamount of human labor involved in completing thegenome. For short repeats, this step can be as sim-ple as using anchored reads, meaning those havingmates in the unique areas surrounding the repeat.In more complex repeats, the assembler must beable to use additional information obtainedthrough laboratory experiments.

Detecting repeatsA simple solution to the repeat-detection prob-

lem identifies the pileup caused by a misassembly.Because the reads come from a random sampling ofthe genomic DNA, typically with 8X coverage ofthe genome, areas covered by a significantly largenumber of reads indicate an over-collapsed repeat.Most assemblers use variations on this simple idea.

Although the idea is useful, it assumes that thereads are sampled uniformly at random from thegenome. In reality, certain areas tend to be poorlyrepresented or absent from the sample—for exam-ple, if the insert is toxic to the laboratory organism

Detecting misassemblies is

more difficult afterassembly is

completed, and the misassemblies

can lead to incorrect genomereconstructions.

Page 5: Genome Sequence Assembly: Algorithms and Issues · 2005. 10. 16. · ach cell of a living organism contains chro-mosomes composed of a sequence of DNA base pairs. This sequence, the

July 2002 51

into which it was cloned—while other areas areoverrepresented. In addition, low-copy repeats,which appear only two to three times in a genome,may escape detection because they do not appear tobe statistically oversampled.

While statistical methods can provide a rough fil-ter, assembly programs must use other techniquesto accurately separate out the repeats. As an exam-ple, the recently developed assembly programEuler9 detects repeats by finding complex areas, ortangles, in the graph constructed during assembly.Researchers can use the information contained inthe tangle to guide experiments to resolve therepeat. Assemblers that simply mask out repeats—another common strategy—lose this informationand must obtain it by other means.

Because the cloning process generates reads inpairs from opposite ends of clone inserts, assemblerscan use information about clone mates to help detectareas that have been incorrectly assembled due torepeats. Such areas usually contain many instancesof clone mates that were assembled either too closeor too far from each other, or whose relative orien-tation is incorrect. This information must be usedwith care, however, since clone length estimates areusually imprecise, especially for larger clones.

The difficult problem here is finding outliers in adata set whose distribution is unknown. The mostreliable information comes from the relative orien-tation of the sequencing reads, which can nearlyalways be tracked correctly. When repeats arewidely separated in the genome, clone-pairing datacan resolve them effectively for reads whose matesare anchored in the neighboring nonrepetitive areas.

Although some repeats are identical, it is morecommon to find some differences in them. Thesedifferences sometimes provide enough informationfor the assembler to distinguish the copies from oneanother. In the absence of sequencing errors, a sin-gle nucleotide difference between two copies of arepeat is enough to distinguish them.

Researchers have developed several techniquesto correct sequencing errors during repeat resolu-tion—for example, see the work of John Kececiogluand Jun Yu.10 All current techniques are based onfinding statistically significant clusters of reads,where the clusters are based on shared differencesin the reads. This approach assumes that sequenc-ing errors are independent and, therefore, that anidentical position difference in multiple reads islikely to be a real difference typical of that copy ofthe repeat. For example, if four reads contain an Ain position 200 and four other reads contain a Gin that position, then the assembler can infer with

high confidence that the first four reads comefrom one copy of the repeat, while the sec-ond four represent a different copy.

One drawback of this approach is the needfor relatively deep coverage to detect true dif-ferences between repeat copies. If a repeatregion is difficult to clone—a common phe-nomenon—the coverage of that repeat willbe low. Moreover, true polymorphisms, suchas those between different copies of nearlyidentical chromosomes—for example, eachhuman chromosome occurs in two copies—or fromnonclonal source DNA further complicate thisproblem.

Unresolved repeatsEven using all these information sources, an

assembler cannot resolve every repeat. Humansmust intervene to finish some complex areas. Thebasic technique for this task is to separate out thereads coming from distinct repeat copies. Indirected sequencing experiments, researchersamplify stretches of DNA anchored in unique areasaround the repeat. If we consider each copy of therepeat in isolation from the others, an assemblyprogram can put the genome together by holdingthese repeat contigs together. Assembling a mixtureof contigs and reads, while guaranteeing that thecontigs will not break up in the process, is knownas a jumpstart assembly. Only TIGR Assembler cur-rently supports this capability.

SCAFFOLDING SOFTWAREThe scaffolding process groups contigs together

into subsets with a known order and orientation.Researchers generally infer relationships betweencontigs from clone-mate information. Most recentassemblers include a scaffolding step.3,8,11 Moreover,the Human Genome Project BAC collections wereordered and oriented through scaffolding.12,13

To reformulate the scaffolding problem in graph-theoretic terms, we can construct a graph in whichthe nodes correspond to contigs, and a directededge links two nodes when mate pairs bridge thegap between them. In this case, each pair of readsimplies a particular orientation and spacing of thecontigs to form a correct pair (see Figure 1b).

The scaffolding program must now solve threeproblems:

• Find all connected components in the definedgraph.

• Find a consistent orientation for all nodes inthe graph, where nodes are connected by two

The difficult problemin using clone matesis finding outliers in

a data set whosedistribution is

unknown.

Page 6: Genome Sequence Assembly: Algorithms and Issues · 2005. 10. 16. · ach cell of a living organism contains chro-mosomes composed of a sequence of DNA base pairs. This sequence, the

52 Computer

edge types: those requiring the two nodes tohave the same orientation and those forcingthe two nodes to have different orientations.We call the latter reversal edges. A consistentorientation of all the nodes is possible only ifall undirected cycles contain an even numberof reversal edges. Because errors in the pair-ing data or misassemblies can invalidate thiscondition, we must solve an optimizationproblem: Find the smallest number of edgesthat must be removed so that no cycle has anodd number of reversal edges. This opti-

mization problem is NP-complete.• Given the length estimates of the edges, embed

the graph on a line or—for some bacterial andarchaeal genomes—on a circle, such that theleast number of constraints is invalidated. Thisproblem is a special case of the optimal lineararrangement problem, which is also NP-com-plete.

While the last two problems are difficult from atheoretical standpoint, simple heuristics can easilyhandle the instances encountered in practice.Moreover, in practice we can relax the optimalitycriteria. During the finishing phase, for example, alinear embedding of the contigs is not necessary. Infact, ambiguities in the graph can highlight possi-ble misassemblies, and finishing teams can use thisinformation in designing experiments to confirm aparticular embedding of the graph.

The complexity of scaffolding stems specificallyfrom the presence of errors in the data. Again, sim-ple heuristics can reduce the effect of such errors.For example, we can reduce errors caused by datatracking problems or misassemblies by requiring atleast two sources of linking information betweencontigs or by ignoring links anchored in repeat areas.

For any but the smallest genomes, it is unlikelythat a single scaffold will hold all contigs. Thus wewill need additional information to order and ori-ent the scaffolds themselves. Two common sourcesof such information are physical maps and com-parisons to related organisms.

Physical mapping encompasses a variety of lab-oratory techniques for characterizing a set of mark-ers along a DNA strand. Markers include knowngenes and short, unique sequences of a few hun-dred nucleotides, called tags, that researchers havefluorescently tagged and mapped to an approxi-mate point on a chromosome. Determining the lay-out of these markers before sequencing provides anindependent information source for scaffoldingsoftware. Using contigs created by an assembler,

researchers can simulate the mapping experimentcomputationally by searching for the tag locationsin the contig sequence. The comparison betweenthe electronic map and the physical map also pro-vides ordering information that the scaffolding pro-gram can use.

The sequence of a closely related organism isanother source of scaffolding information. Forexample, by aligning the scaffolds from a prelimi-nary assembly of the mouse genome to the humangenome, we can obtain the likely order of themouse contigs. Of course, this information will beincorrect where major genome rearrangementshave occurred in the evolutionary divergence of thetwo species. This technique therefore works bestwith a very closely related genome that has beensequenced to completion.

The sources of linking information used to con-struct scaffolds vary in quality. In particular, theerror in determining the length of inserts, and thusthe distance between clone mates, increases withthe insert size. Physical map data is inherently errorprone. Finally, large-scale genome rearrangementscan affect the homology data. TIGR has developedBambus, a scaffolder that factors our confidence inthe linking information into hierarchically con-structed scaffolds. The algorithm first builds a setof scaffolds based on the highest confidence links,then incorporates the lower confidence informa-tion to combine each scaffold into a larger struc-ture. This hierarchical method reduces the effect ofincorrect linking data, while still using all the infor-mation sources.

ASSESSING ASSEMBLY QUALITYCorrecting misassemblies is expensive, especially

if they go undetected until the late stages of asequencing project. Assemblers highlight prob-lematic areas by outputting the confidence level ineach base of the consensus. Because this simplequality-control method is an inherently local mea-sure, it fails to capture larger scale phenomena,such as whole DNA sections that are incorrectlyspliced together. The assembly pipeline must there-fore contain a validation module that uses addi-tional information to determine the contig quality.

Finding errors in assemblies is easy when thecomplete sequence is already known, and we canuse known benchmark data sets to fine-tune assem-bly software. These data sets, either artificially gen-erated or representing real sequencing reads fromcompleted projects, provide both the correct con-sensus sequence and the exact location of all readsin the true DNA sequence. Such information lets

Ambiguities in a graph can

highlight possiblemisassemblies that finishing

teams caninvestigate.

Page 7: Genome Sequence Assembly: Algorithms and Issues · 2005. 10. 16. · ach cell of a living organism contains chro-mosomes composed of a sequence of DNA base pairs. This sequence, the

July 2002 53

us detect both local errors in the consensus basecalls and large-scale rearrangements, such as rever-sals and insertions, in the genome assembly.

We can apply some of these ideas to the challengein real practice: finding assembly errors when thetrue layout is unknown. For example, physicalmaps provide markers that we can use to validatelarge contigs. Similarly, we can use the sequence ofa closely related organism to confirm areas that wedo not expect to have significantly diverged. In theabsence of any other types of information, clonemates have been used to detect assembly errors.14

Areas of the genome that violate the orientationand distance constraints imposed by the clonemates indicate potential misassemblies.

Most reported measures of assembly quality areaggregate measures, such as the number and sizesof contigs. They assume that an assembly consist-ing of a few large contigs is better than one com-posed of many small contigs. This assumption ispartly true, in that the number of contigs indicatesthe number of gaps, which in turn correlates withthe amount of work needed to finish the genome.Aggregate size measures do not, however, accountfor the possibility of misassemblies, and they aretherefore only marginally useful. If anything, anassembler can generate large contig sizes at theexpense of misassemblies.

To demonstrate these concepts, we performed aseries of tests on the genome of Wolbachia, anendosymbiotic bacterium found in the Drosophilafruit fly and other insects. We recently completedthis genome at TIGR, so we had a “true” DNAsequence to which we could compare assemblyresults. We assembled this genome from the origi-nal shotgun reads using Phrap, TIGR Assembler,and Celera Assembler. We ran all assemblers withtheir default settings. We verified the assemblies byaligning the resulting contigs to the finishedsequence. Table 1 summarizes the results.

According to the number and length of contigs,Phrap appears to produce the best output, followedclosely by TIGR Assembler. The Phrap assemblycontains about one-fourth as many contigs asCelera Assembler’s, and its contigs are about fourtimes larger on average. In addition, the total sizeof these contigs (1.26 Mbp) matches the actual sizeof the Wolbachia genome (1.26 Mbp).

In contrast, if we look at the proportion of thesequence covered by correct assemblies, the CeleraAssembler’s output spans more than 99 percent of allbases, while the TIGR Assembler contigs cover justover 93 percent, and Phrap covers barely 36 percent.

These results lead to a couple of conclusions.

First, Phrap and TIGR Assembler appear to havemisassembled some repeats, which explains the lackof coverage. At the same time, the small length ofthe Celera Assembler’s contigs, combined with theirlarge total size, lead us to believe that it failed tocombine many contigs that should have beenassembled together. Closer examination—and sim-ilar experience with many other genomes—indi-cates that this usually results from poor quality dataat the ends of sequences.

To get Celera Assembler to combine more con-tigs, we performed an additional step of moreaggressively trimming poor quality data from theends of the input sequences. The fourth row inTable 1 indicates that this technique closed a num-ber of gaps, yielding larger contigs overall. At thesame time, the coverage of the genome decreased,indicating a potential drawback to the technique.

These observations correlate with our under-standing of the assembly algorithms used by thethree programs. TIGR Assembler and Phrap aremore tolerant of incorrect data at the sequence ends,which allows them to create bigger contigs. At thesame time, their handling of clone-mate informa-tion is less sophisticated than Celera Assembler’s.In particular, TIGR Assembler uses a greedyapproach that lets it walk through a repeat, occa-sionally violating clone-link constraints.

The Phrap program simply does not take theseconstraints into consideration, leading to the over-collapse of repeat regions. Closer analysis verifiedthe hypothesis that all the misassemblies presentedin Table 1 correlated with repeats in the Wolbachiagenome.

T he ultimate goal of genome sequencing is thecomplete DNA sequence of an organism. Agood assembler can aid the human effort

involved in the finishing phase. An assembler

Table 1. Comparison of shotgun sequence data from the Wolbachia genomeproject.

Number Average Percentof contig Total genome Number of

Assembler contigs length size covered misassemblies

Phrap 56 22.4 kbp 1.26 Mbp 36.0 14 TIGR Assembler 76 16.8 kbp 1.28 Mbp 93.1 2 Celera Assembler 220 6.3 kbp 1.39 Mbp 99.1 1 Celera Assembler 101 12.5 kbp 1.26 Mbp 98.4 0 trimmed data

Page 8: Genome Sequence Assembly: Algorithms and Issues · 2005. 10. 16. · ach cell of a living organism contains chro-mosomes composed of a sequence of DNA base pairs. This sequence, the

54 Computer

designed for finishing should use multiple sourcesof information, such as data from finishing exper-iments in the laboratory; it should also let thehuman experts put restrictions on the assembly,such as regions that need to be held together orrepeats that should be kept separate. Better qual-ity-control tools are essential, and defining qualitymeasures that make it possible to evaluate assem-bly algorithms is a first step toward their improve-ment. This issue is particularly critical for in-complete genomes, such as the various and con-stantly changing versions of the draft humangenome sequence. Assemblers that can report“weak” areas in the assembly and highlight poten-tial misassembly sites are essential not only for thesubsequent analysis of assembly data but also forguiding the efforts of finishing experts. Moreover,well-defined objective quality measures will pro-vide an additional level of validation even in thecase of completely finished genomes. ■

AcknowledgmentsThe authors are supported in part by grants from

the National Science Foundation and the NationalInstitutes of Health.

References1. F. Sanger et al., “Nucleotide Sequence of Bacterio-

phage Lambda DNA,” J. Molecular Biology, vol.162, no. 4, 1982, pp. 729-773.

2. R.D. Fleischmann et al., “Whole-Genome RandomSequencing and Assembly of Haemophilus InfluenzaeRd,” Science, vol. 269, no. 5223, 1995, pp. 296-512.

3. E.W. Myers et al., “A Whole-Genome Assembly ofDrosophila,” Science, vol. 287, 2000, pp. 2196-2204.

4. G.G. Sutton et al., “TIGR Assembler: A New Toolfor Assembling Large Shotgun Sequencing Projects,”Genome Science and Technology, 1995, vol. 1, pp.9-19.

5. P. Green, “Phrap Documentation: Algorithms,”Phred/Phrap/Consed System Home Page; http://www.phrap.org (current June 2002).

6. X. Huang and A. Madan, “CAP3: A DNA SequenceAssembly Program,” Genome Research, vol. 9, no. 9,1999, pp. 868-877.

7. J.D. Kececioglu and E.W. Myers, “CombinatorialAlgorithms for DNA Sequence Assembly,” Algorith-mica, vol. 13, 1995, pp. 7-51.

8. S. Batzoglou et al., “Arachne: A Whole-GenomeShotgun Assembler,” Genome Research, vol. 12, no.1, 2002, pp. 177-189.

9. P.A. Pevzner, H. Tang, and M.S. Waterman, “An

Eulerian Path Approach to DNA Fragment Assem-bly,” Proc. Nat’l Academy of Science USA, vol. 98,no. 17, 2001, pp. 9748-9753.

10. J. Kececioglu and J. Yu, “Separating Repeats in DNASequence Assembly,” Proc. 5th Ann. Int’l Conf.Computational Biology (RECOMB), ACM Press,New York, 2001, pp. 176-183.

11. P.A. Pevzner and H. Tang, “Fragment Assembly withDouble-Barreled Data,” Bioinformatics, vol. 17,suppl. 1, 2001, pp. S225-S233.

12. W.J. Kent and D. Haussler, “Assembly of the Work-ing Draft of the Human Genome with GigAssem-bler,” Genome Research, vol. 11, 2001, pp. 1541-1548.

13. D.H. Huson, K. Reinert, and E. Myers, “The GreedyPath-Merging Algorithm for Sequence Assembly,”Proc. 5th Ann. Int’l Conf. Computational Biology(RECOMB), ACM Press, New York, 2001, pp. 157-163.

14. D.H. Huson et al., “Comparing Assemblies UsingFragments and Mate Pairs,” Proc. Workshop Algo-rithms in Bioinformatics, BRICS, Aarhus, Denmark,2001, pp. 294-306.

Mihai Pop is a bioinformatics scientist at the Insti-tute for Genomics Research in Rockville, Mary-land. His research interests include DNA se-quencing and closure techniques and sequenceassembly algorithms and applications. Pop receiveda PhD in computer science from the Johns Hop-kins University. Contact him at [email protected].

Steven L. Salzberg is the senior director of Bioin-formatics at the Institute for Genomic Research.He is also a research professor in both the com-puter science and biology departments at JohnsHopkins University. His research interests includegene finding, genome sequence alignment, genomearchaeology, and evolution. He developed one ofthe first computational gene-finding systems in themid-1990s. Salzberg received a PhD in computerscience from Harvard University. He is a memberof the AAAS and the ACM. Contact him [email protected].

Martin Shumway is a software manager at theInstitute for Genomic Research, with interests inmeasurement. He received an MSc in computer sci-ence from Colorado State University and is a mem-ber of the IEEE. Contact him at [email protected].