the principles of shotgun sequencing and automated ... · 1 introduction the secret of life is...

44
The Principles of Shotgun Sequencing and Automated Fragment Assembly Special excerpt for lecture c Martti T. Tammi [email protected] Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden April 13, 2003 1

Upload: others

Post on 23-Sep-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

The Principles of Shotgun Sequencing and

Automated Fragment AssemblySpecial excerpt for lecture

c©Martti T. [email protected]

Center for Genomics and Bioinformatics,Karolinska Institutet,Stockholm, Sweden

April 13, 2003

1

Contents

1 Introduction 3

1.1 The Human Genome Project . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 The First Shotgun Projects . . . . . . . . . . . . . . . . . . 4

1.2 The Principle of Shotgun Sequencing . . . . . . . . . . . . . . . . . 5

1.2.1 Strategies for Sequencing . . . . . . . . . . . . . . . . . . . 5

1.2.2 The Hierarchical Sequencing (HS) Strategy or Clone by CloneSequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.3 The Whole Genome Shotgun (WGS) Strategy . . . . . . . . 7

1.2.4 Combining HS and WGS . . . . . . . . . . . . . . . . . . . 7

1.2.5 Shotgun Sequencing Methods . . . . . . . . . . . . . . . . . 8

1.3 Development of Shotgun Fragment Assembly Tools . . . . . . . . . 9

2 Milestones in Genome Sequencing 11

2.0.1 First Cellular Organisms . . . . . . . . . . . . . . . . . . . 11

2.0.2 First Multicellular Organisms . . . . . . . . . . . . . . . . . 11

3 Shotgun Fragment Assembly Algorithms 12

3.1 Types of Problems in Sequence Assembly . . . . . . . . . . . . . . 12

3.2 Complications in Fragment Assembly . . . . . . . . . . . . . . . . . 13

3.3 Base calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 The Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5 Error Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.6 Genome Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.7 Distance Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.8 Fragment Assembly Problem Models . . . . . . . . . . . . . . . . . 22

3.8.1 The Shortest Common Superstring Model . . . . . . . . . . 24

3.8.2 The Sequence Reconstruction Model . . . . . . . . . . . . . 25

3.8.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.9 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.9.1 Contamination . . . . . . . . . . . . . . . . . . . . . . . . . 28

1

3.9.2 Overlap Detection. . . . . . . . . . . . . . . . . . . . . . . . 28

3.9.3 Contig Layout Determination. . . . . . . . . . . . . . . . . 29

3.9.4 Creation of Multiple Alignments. . . . . . . . . . . . . . . . 30

3.9.5 Computation of the Consensus Sequence. . . . . . . . . . . 30

4 Shotgun Fragment Assembly Programs 31

4.1 Phrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 The Phrap Method . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.2 Greedy Algorithm for Layout Construction . . . . . . . . . 33

4.1.3 Construction of Contig Sequences . . . . . . . . . . . . . . . 34

2

1 Introduction

The secret of life is complementarity. Deoxyribonucleic acid, DNA, is a doublehelix consisting of alternating phosphate and sugar groups. The two chains areheld together by hydrogen bonds between pairs of organic bases adenine (A) withthymine (T) and guanine (G) with cytosine (C) (Watson & Crick, 1953). In 1962James Watson, Francis Crick, and Maurice Wilkins jointly received The NobelPrize in Physiology or Medicine for their determination of the structure of DNA.In 1959, Seymour Benzer presented the idea that DNA and RNA molecules can beseen as linear words over a 4-letter alphabet (Benzer, 1959). It is the order of theseletters (bases) that codes for the important genetic information of an organism.

DNA can be thought of as the language of life, the four letters in an arrangementof words, each word consisting of three letters, codons, that code for amino acids.The codons or words are arranged in sentences that code for specific proteins.The language of DNA dictates which proteins are made, as well as when andhow much. In the early days the determination of the nucleotide sequence was astrenuous task and computer programs were implemented to aid in the sequencingprocess. These, and in continuation attempts to decipher the language of DNAwere the seeds that led to the development of the field of bioinformatics. Todayvast amounts of data are available for analysis in various databases around theworld. The rapid development of bioinformatics was due to the fast developmentof computer hardware and the need of large genome projects to store, organizeand analyze increasing amounts of data. Genome projects were in turn facilitatedby the development of improved bioinformatics tools.

1.1 The Human Genome Project

The development of computational solutions to problems in genomics and sequencehandling have been crucial to the decision to undertake and successfully carry outthe ambitious effort to sequence the whole human genome, approximately threebillion base pairs, and the sequencing of many other genomes. The Human GenomeProject (HGP) began formally in October 1990 when the Department of Energy(DOE) and National Institute of Health (NIH) presented a joint five-year plan tothe U.S. Congress. The document 1 says, ”improvements in technology for almostevery aspect of genomics research have taken place. As a result, more specificgoals can now be set for the project.”. This international effort was originallyestimated to require 15 years and $3 billion. This plan was revised in 1998, whenJ. Craig Venter, PE Corporation and The Institute of Genomic Research (TIGR)proposed to launch a joint venture that would sequence the entire human genomein three years (Marshall, 1999). In June of 2000, the publicly sponsored HumanGenome Project leaders, the private company Celera Genomics, the U.S. PresidentBill Clinton, and the Prime Minister of Great Britain Tony Blair announced thecompletion of a ”working draft” sequence of the human genome. A special issue

1Coauthored by DOE and NIH and titled Understanding Our Genetic Inheritance, The U.S.Human Genome Project: The First Five Years (FY 1991-1995)

3

of Nature, Feb. 15, 2001 contains the initial analyses of the draft sequence ofthe International Human Genome Sequencing Consortium, while a special issueof Science, Feb. 16, 2001, reported the draft sequence reported by the privatecompany, Celera Genomics. The draft sequence is to be improved to achieve acomplete, high-quality DNA reference sequence by 2003, two years earlier thanoriginally projected.

1.1.1 The First Shotgun Projects

In 1965, Robert William Holley and co-workers published the complete sequenceof the alanine tRNA from yeast (Holley et al., 1965), a task that took seven yearsto complete. Since the sequence of only small fragments could be determined, afragmentation stratagem had to be used. An enzyme obtained from snake venomwas used to remove nucleotides one by one from a fragment, to produce a mix-ture of smaller fragments of all possible intermediate lengths. This mixture wassubsequently separated by passing it through a chromatography column. By thedetermining the identity of the terminal nucleotide of each fraction and the knowl-edge of its length, it was possible to establish the sequence of the smaller fragments,as well as the sequence of the original large fragment using basic overlapping logic.

The first three shotgun sequencing projects were the 6,569 base pair human mito-chondrial DNA (Anderson et al., 1981), the bacteriophage λ nucleotide sequence of48,502 base pairs (Sanger et al., 1982) and the Epstein-Barr virus at 172,282 basepairs (Baer et al., 1984). In the shotgun sequencing method (Anderson, 1981;Messing et al., 1981; Deininger, 1983) the sequence is broken into small pieces,since the length of the longest contiguous DNA stretch that can be determined islimited by the length that can be read using an electrophoresis gel. In the bacte-riophage λ project, the whole λ genome was sonicated to yield small fragments,(Fuhrman et al., 1981) in order to create a random library of fragmented DNA.These fragments were cloned using a single stranded bacteriophage M13 (Gronen-born & Messing, 1978; Messing et al., 1981) as a vector. This cloning procedurereplaced the previous methods. All the sequences were assembled using RodgerStadens assembly program (Staden, 1982a; Staden, 1982b).

After the introduction of gel electrophoresis based sequencing methods in 1977 byMaxam & Gilbert using chemical degradation of DNA (Maxam & Gilbert, 1977)and chain termination techniques by Frederick Sanger and his colleagues (Sangeret al., 1977), it became possible to determine longer sequences. These techniquesallow a routine determination of approximately 300 consecutive bases of a DNAstrand in one reaction.

Until the advent of fluorescent sequencing, polyacrylamide slab gel electrophoresiswas used together with radioactive 32P, 33P or 35S labels and autoradiographs.Fluorescent labelling, i.e. dyes that can be detected after excitation using a laser,allowed automatic detection of the base sequence (Smith et al., 1986). Regardlessof refinements of the sequencing methods, the average fragment length has notgrown significantly. In some conditions and using special care it is possible to

4

produce reads that are over 1000 bases long, but only average lengths of about400 to 800 base pairs are feasible in routine production.

Attempts to increase the sequencing speed and throughput, as well as attempts tolower the cost has led to the development of alternative strategies, such as sequenc-ing by hybridization (Drmanac et al., 1989), Massively Parallel Signature Sequenc-ing (MPSS) (Brenner et al., 2000a; Brenner et al., 2000b), Matrix-assisted LaserDesorption Ionization Time-of-Flight (MALDI-TOF-MS) (Karas & Hillenkamp,1988), Pyrosequencing (Ronaghi & Nyren, 1998) and the use of Scanning Tunnel-ing Microscopy (Allison et al., 1990). However, none of these methods have asyet contributed to large-scale genomic sequencing due to their short read lengths,but many of them are well suited for e.g. Single Nucleotide Polymorphism (SNP)genotyping. Thus, Sanger sequencing remains the method of choice for large-scalesequencing.

1.2 The Principle of Shotgun Sequencing

Although the computational problems involved in fragment assembly have beenstudied almost 30 years, they still continue to receive attention, mainly because oflarge sequencing projects, such as human genome project and many other projects,necessitates access to efficient algorithms for precise assembly of long DNA se-quences and attempts to sequence whole genomes, using the whole genome shot-gun sequencing approach revealed some limitations in currently used methods.Several genomes have been published as drafts, but none of the eukaryote genomesare completely finished, yet. This is mainly due to ubiquitous repeated sequencesthat complicate the fragment assembly and finishing.

The core of the problem is that it is not routinely possible to sequence largercontiguous fragments than approximately 500 to 800 bases. In some conditionsand much work it may be possible to get sequence over 1 kb in length, but thisis not feasible in normal, automated production. Genomes are much longer thanthis. The length of human genome is about 3 Gb. X chromosome is about 150 Mband the human mitochondrion genome alone is 16.5 kb. To determine the basesequence of long DNA stretches we obviously need some stratagem. The shotgunsequencing method has proven to be the method of choice as the solution to thisproblem. DNA is first cut into small pieces, each of these fragments are sequencedindividually and then puzzled together to create the original contiguous sequence,contig. The advantages of the shotgun sequencing method are easy automationand scalability.

1.2.1 Strategies for Sequencing

In one approach, the whole genome of an organism is broken into small pieces,which are then sequenced, often from both ends. This approach is called WholeGenome Shotgun Sequencing (WGS). Another approach to sequence a genome ismore hierarchical. The genome is first broken into more manageable pieces of ap-proximately 100 to 200 kb, which are inserted into bacterial artificial chromosomes,

5

BACs, to yield a library. The BACs are ordered by Sequence Tagged Sites (STS)mapping or by fingerprinting to result in a minimally overlapping tiling path, thatcovers the whole genome. The BACs are in turn sequenced using the shotgunapproach. Thus, the shotgun method applies also to the Hierarchical Sequencing(HS) Strategy. The most powerful method may be the combination of the twoapproaches (see below).

A physical map is a set of DNA fragments whose relative positions in the genomeare known. The ultimate physical map is the complete sequence of a genome.Clones in a library can be ordered either by fingerprinting, hybridization or end-sequencing. The presence of significant repeat content may cause a failure to pro-vide an unique link between segments of a genome. Further, the stochastic methodto sample a library often results in certain segments being over-represented, under-represented or to be completely absent, due to experimental procedures. For thisreason, it is essential to use more than one library and more than one cloningsystem.

1.2.2 The Hierarchical Sequencing (HS) Strategy or Clone by CloneSequencing

The HS strategy involves an initial mapping step. A physical map is built usingclones with large inserts, such as BACs. The minimum tiling path of clones isselected to cover the whole genome. Each clone is individually sequenced usinga shotgun strategy. A mixture of clone types is usually used in sequencing largegenomes, also the different chromosomes may be separated and used to createchromosome specific libraries.

This hierarchical strategy has been used for example for bacterial genomes, thenematode C. elegans(C. Elegans Sequencing Consortium, 1998) and the yeast, S.cerevisiae (Goffeau et al., 1996), A. thaliana (The Arabidopsis Genome Initiative,2000), in a currently ongoing P. falciparum, the human malaria parasite project(P. falciparum Genome Sequencing Consortium, 1996), and also in sequencing thewhole human genome by the International Human Genome Sequencing Consor-tium.

The physical map can also be built as the sequencing progresses. This is called the“map as you go” strategy. Venter et al. developed an efficient method for selectingminimally overlapping clones by sequence-tagged connectors (STCs) (Venter et al.,1996), which is commonly referred to as ”BAC-end sequencing”. The end ofclones from a large insert library are sequenced, yielding sequence tags. Once aselected clone is sequenced, these tags can be used to select a flanking, minimallyoverlapping clone to be subsequently sequenced. When no overlapping clone exist,a new seed point is selected.

6

1.2.3 The Whole Genome Shotgun (WGS) Strategy

In the whole genome shotgun strategy the genome is fragmented randomly withouta prior mapping step. The computer assembly of sequence reads is more demandingthan in the HS strategy, due to the lacking positional information. Therefore, theuse of varied clone lengths and the sequencing of both ends is essential to buildscaffolds (Edwards & Caskey, 1990; Chen et al., 1993; Smith et al., 1994; Kupferet al., 1995; Roach et al., 1995; Nurminsky & Hartl, 1996; Roach et al., 1995),in particular when sequencing large, complex genomes. This strategy has beenused to sequence e.g. Haemophilus influenzae Rd. (Fleischmann et al., 1995), the580,070 bp genome of Mycoplasma genitalium, the smallest known genome of anyfree-living organism and the Drosophila melanogaster genome (Adams et al., 2000;Myers et al., 2000). However, a clone-by-clone based strategy has been used tofinish the Drosophila genome.

The WGS approach to sequence the human genome was proposed (Weber & Myers,1997). This started a debate of the feasibility of the WGS strategy to sequencethe entire human genome (Green, 1997; Eichler, 1998; The Sanger Centre, 1998;Waterston et al., 2002).

The method was applied to the whole human genome by the private companyCelera Genomics.

1.2.4 Combining HS and WGS

In some cases when one considers the project type, cost and time involved, it maybe advantageous to combine the WGS and HS strategy. Mathematical models arehelpful aids to determine an applicable sequencing approach. Several mathematicalmodels have been developed and simulations performed for a variety of sequencingstrategies (Clarke & Carbon, 1976; Lander & Waterman, 1988; Idury & Waterman,1995; Schbath, 1997; Roach et al., 2000; Batzoglou et al., 1999). These models areuseful in conjunction with large genome sequencing projects. The models allowthe monitoring of the progress of the project to control deviations from predictedvalues, such as coverage, that may arise due to biological or technical problems,and thereby aid in taking early measures to adjust the problem. These models arealso advantageous in planning a sequencing project, estimating project durationand allocating resources, as well as optimizing cost.

A rearrangement (Anderson, 1981) of the Clarke and Carbon expression (Clarke& Carbon, 1976) can be used to monitor how random sequencing progresses:

fl = 1−(1− r

L

)n

The equation gives the fraction, fl of the target sequence, G of length L in contigs,when n number of reads of the length r are sequenced.

Lander and Waterman developed a mathematical model for genomic mapping byfingerprinting random clones (Lander & Waterman, 1988), which can be considered

7

analogous to the well-known Clarke-Carbon formulas. The Lander and Watermanexpression can be used to calculate the average number of gaps, contigs and theiraverage length in a shotgun project; a target sequence G is redundantly sampled,at an average coverage c = nr/L and assuming that the fragments are uniformlydistributed along the target sequence, the coverage at a given base b is a Poissonrandom variable with mean c:

P(base b covered by k fragments) =e−c · ck

k!(1)

For example, the fraction of G that is covered by at least one fragment is 1− e−c.

Comparison analysis of map-based sequencing to WGS (Wendl et al., 2001) showthat, regardless of library type, random selection initially yields comparable ratesof coverage and lower redundancy than map-based sequencing. This advantageshifts to the map-based approach after a certain number of clones have been se-quenced. The map-based sequencing essentially yields a linear coverage increase,if a constant redundancy can be maintained, as opposed to the asymptotic processof random clone selection, equation 1.

Considering the number of clones needed with each method to achieve a certaincoverage, it is reasonable to combine the approaches so that a random strategyis applied early in the project but finished by a map-based approach. The ad-vantages and disadvantages of each method to sequence the human genome hasbeen extensively debated in the literature (Weber & Myers, 1997; Green, 1997;Waterston et al., 2002).

The advantage of the hierarchical approach is that it minimizes large misassme-blies, since each clone is separately assembled. On the other hand, the WGSmethod eliminates the time consuming mapping step and relies largely on the per-formance of fragment assembly algorithms to solve larger and more complicatedassembly problems. For example, the first five years of the public human genomeproject were spent in mapping (National Research Counsil, 1988; Watson, 1990).

1.2.5 Shotgun Sequencing Methods

Both HS and WGS involve shotgun sequencing. In the shotgun process, the targetsequence is broken into small pieces as randomly as possible. The fractionationis usually performed by passing the fragments through a nozzle under regulatedpressure, for example by nebulization, by sonication or by partial digest withrestriction enzymes. These processes normally yield a semi-random collection offragments. In order to remove fragments that are too long or too short, thefragments are loaded onto an agarose gel and separated by electrophoresis. Thesection of the gel containing the desired size is cut out and purified to yield a setof fragments of similar size. Since there are many copies of the original targetsequence, many of these fragments overlap.

To get a sufficient quantity of each unique fragment, these size selected fragmentsare inserted into a cloning vector, mostly using a ligation reaction. The vectors

8

are inserted into ’competent’ E. coli cells, and as the colonies grow, the insert iseffectively amplified. A sufficient number of colonies or plaques each containing aseparate clone, are selected to achieve the desired coverage of the target sequence,usually about ten-fold.

The vectors used (M13 or plasmid) contain primer sites for reverse and forwardprimers at the region where the restriction enzyme cuts, to allow sequencing fromeach end of the insert. Sequencing from both ends yields additional positionalinformation, since the distance by which the read pair is separated is known (Ed-wards & Caskey, 1990). After a subsequent purification step to remove all bacterialcomponents, media, proteins and other contaminants, the fragments are sequencedusing dye primer or dye terminator techniques. The sequence is read using an auto-matic sequencer, where electropherograms or chromatograms containing the signalsequence read from the sequencing gel are created. The actual determination ofthe base sequence, the base-calling, is based on this signal sequence. The resultingreads are assembled in a computer to reconstruct the original target sequence. Dueto cloning biases and systematic failures in the sequencing chemistry, the randomdata alone is usually insufficient to yield a complete sequence. Thus, the computerassembly step results in several contigous sequences, contigs (Staden, 1980). Inorder to obtain the complete sequence, the remaining gaps must be filled and re-maining problems solved. This can be accomplished by directed sequencing afterthe contigs have been ordered. This finishing process is usually a time consumingand labor intensive step.

1.3 Development of Shotgun Fragment Assembly Tools

In the 1960’s, several theories and computer programs were developed to aid se-quence reconstruction based on the overlap method (Dayhoff, 1964; Shapiro et al.,1965; Mosimann et al., 1966; Shapiro, 1967). George Hutchinson at National In-stitutes of Health (NIH) was the first to apply the mathematical theory of graphsto reconstruct an unknown sequence from small fragments (Hutchinson, 1969).

The new electrophoresis technique devised by Sanger et al. allowed faster sequencedetermination, which in turn resulted in an increased need for computer programsto handle the larger amount of sequence data. During the years 1977 to 1982,Rodger Staden published a number of papers on constructing sequence fragmentassembly and project management software (Staden, 1977; Staden, 1978; Staden,1979; Staden, 1980; Staden, 1982a; Staden, 1982b). In Staden’s original algorithm,the shotgun fragments were overlapped and merged in a sequential manner, so thatin each step, a fragment chosen from a database was tested against all mergedfragments.

Alignment techniques for sequences were devised in parallel with the developmentof fragment assembly programs (Needleman & Wunsch, 1970; Smith & Water-man, 1981). These methods allowed comparison and optimal alignment betweensequences with insertions and deletions, corresponding to e.g. sequencing errors.The errors in sequence reads are dispersed, but in general the number of errors

9

increases toward the ends of reads, due to the geometric distribution of the dif-ference between the growing sizes of fragments. Despite the erroneous input data,all the early fragment assembly models were based on the unrealistic assumptionthat sequence reads are free of errors. Peltola et al. were the first to define thesequence assembly problem involving errors (Peltola et al., 1983) and also devel-oped the SEQAID program based on their model (Peltola et al., 1984). Earlierprograms and models were merely ad hoc or based on models, that allowed noerrors in the shotgun data e.g. (Smetanic & Polozov, 1979; Gingeras et al., 1979;Gallant, 1983). In 1994, Mark J. Miller and John I. Powell compared 11 earlysequence assembly programs for the accuracy and reproducibility with which theyassemble fragments. Only three of the programs, consistently produced consensussequences of low error rate and high reproducibility, when tested on a sequencethat contained no repeats (Miller & Powell, 1994).

The general outline of most assembly algorithms is: 1. Examine all pairs andcreate a set of candidate overlaps, 2. Form an approximate layout of fragments,3. Make multiple alignments of the layout, 4. Create the consensus sequence, seesection 3.9. All existing methods rely on heuristics, since the fragment assemblyproblem is NP-hard (section 3.1) (Garey & Johnson, 1979; Gallant et al., 1980;Gallant, 1983; Kececioglu, 1991). Hence, it is not possible to exactly computethe optimal solution if the amount of input data is large enough (section 3.1).Kececioglu and Myers made a great deal of progress by identifying the optimalitycriteria and developing methods to solve them (Kececioglu, 1991; Kececioglu &Myers, 1995).

In 1990, a pairwise end-sequencing strategy was described (Edwards & Caskey,1990). This was followed by the development of several variations of the strategy,such as scaffold-building to “automatically” produce a sequence map as a shotgunproject proceeds. This was described and simulated by Roach and colleagues(Roach et al., 1995). Also variations of the approach were detailed by other groups(Chen et al., 1993; Smith et al., 1994; Kupfer et al., 1995; Nurminsky & Hartl,1996). TIGR Assembler (Sutton et al., 1995) was the first assembly program toutilize the clone length information, and was successfully applied to assemble theHaemophilus influenzae genome by the whole genome shotgun method. Myersengineered an approach to the assembly of larger genomes, using paired whole-genome shotgun reads (Anson & Myers, 1999). Myers’ approach was implementedin the Celera Assembler and used to succesfully assemble first the Drosophilamelanogaster genome (Myers et al., 2000). The algorithm has also been appliedto the human genome.

On the realization that the initial, correct determination of overlaps was criticalto the overall assembly result, the reads were first conservatively clipped to avoidthe lower accuracy regions at the ends, where the overlaps are typically detected.This was first implemented in 1991 (Dear & Staden, 1991) and in 1995, in theGAP (Bonfield et al., 1995) programs , followed by many others. The base qualityestimates were first used to improve the consensus calculation (Bonfield & Staden,1995). The major improvement in assembly followed from the base calling programPhred (Ewing et al., 1998; Ewing & Green, 1998), that computes error probabilitiesfor base calls using parameters derived from electropherograms. The idea of error

10

probabilities was earlier described by several groups (Churchill & Waterman, 1992;Giddings et al., 1993; Lawrence & Solovyev, 1994; Lipshutz et al., 1994). Otherbase calling programs have followed, that compute error probabilities for base calls,and in addition, gap probabilities, for example Life Trace (Walther et al., 2001).The Phred error probabilities were utilized in Phrap (http://www.phrap.org) todiscriminate pairwise overlaps.

A different approach was used in the design of GigAssembler (Kent & Haussler,2001). The GigAssembler was designed to work as a second-stage assembler andorder initial contigs that were assembled separately by Phrap or another assemblymethod. The program utilizes several sources of information, e.g. paired plasmidends, expressed sequence tags (ESTs), bacterial artificial chromosome (BAC) endpairs and fingerprint contigs. The GigAssembler algorithm was used to assemblethe public working draft of the human genome.

2 Milestones in Genome Sequencing

2.0.1 First Cellular Organisms

The first bacterial genome, Haemophilus influenzae Rd 1.83 Mbp (Fleischmannet al., 1995), sequenced using the whole genome shotgun sequencing approach.

First eukaryotic organism Saccharomyces cerevisiae 13 Mbp (Goffeau et al., 1996).

A bacterium, Escherichia coli 4.60 Mbp (Blattner et al., 1997) is a preferred modelin genetics, molecular biology, and biotechnology. E. coli K-12 was the earliestorganism to be suggested as a candidate for whole genome sequencing.

2.0.2 First Multicellular Organisms

A small invertebrate, the nematode or roundworm, Caenorhabditis elegans 97 Mbp(C. Elegans Sequencing Consortium, 1998).

A fruit fly, Drosophila melanogaster 137 Mb (Adams et al., 2000).

Homo sapiens 2.9 Bbp International Human Genome Sequencing Consortium,Nature, Feb. 15, 2001 and Venter et al. Science Feb. 16, 2001.

Mus Musculus 2.6 Bbp Celera Genomics, spring 2001, unpublished.

According to Genomes Online Database (Bernal et al., 2001), on 10th of April,2002, the total of 84 complete genomes have been published, 271 on-going se-quencing projects of prokaryote genomes and 178 eukaryotic sequencing projects.

The development of computer hardware and powerful computational tools is cen-tral to large-scale genome sequencing. Without fragment assembly software andautomation of gel reading by image-processing tools it would not have been feasibleto assemble all gel reads up to millions of bases of contiguous sequence.

11

3 Shotgun Fragment Assembly Algorithms

The goal of a shotgun fragment assembly program is to order overlapping readsto recreate the original DNA sequence. In Practice this means the constructionof the “best guess” of the original clone sequence, BAC, Cosmid or in the case ofwhole genome shotgun, genome sequence. In order to reach the goal of revealingthe original DNA sequence, several problems and complications must be solved.

Many real world problems are computationally intractable. The assembly of shot-gun sequences contain such problems in multiple alignment and fragment layoutdetermination stages. Hence, in many cases it will not be possible or feasible tocompute the optimal solution and therefore there is no quarantee that the resultingconsensus sequence is the same as the original DNA sequence.

3.1 Types of Problems in Sequence Assembly

An algorithm is a procedure that specifies a set of well-defined instructions fora solution of a problem in a finite number of steps. Apart from the fact that itis difficult to prove that an algorithm is correct, even when it solves what it isintended to solve, the central question in computer science is the efficiency of analgorithm: How the running time, and number of steps needed to get a solutiondepends on the size of input data. This depends mainly on the problem type athand. The problems can be divided into two main complexity classes: polynomialand non-polynomial. In general polynomial time problems are efficiently solvable,whereas non-polynomial problems are hard to solve since the running time of analgorithm grows faster than any power of the number of items required to specifyan increasing amount of input data. No practical algorithm that solves a non-polynomial problem exists, since it would take an unreasonably long time to getan answer (Garey & Johnson, 1979).

It is very hard to prove that a problem is non-polynomial, since all possible algo-rithms would have to be analyzed and proved not to be able to solve the problemin polynomial time. Several problems exist that are believed to be non-polynomial,but this has not yet been proven. For example, the well known Travelling SalesmanProblem (Karp, 1972): Given a cost of travel between cities, the problem is to findthe cheapest tour that passes through all of the cities in the salesman’s itineraryand returns to the point of departure. Finding prime factors of an integer is alsobelieved to be non-polynomial and many security systems that encrypt e.g. creditcard numbers rely on this belief.

In 1971, Steve Cook (Cook, 1971) introduced classes of decision problems, poly-nomial runnning time P , and nondeterministic polynomial running time NP 2,based on the verification of the solution. P is the class of decision problems forwhich the solution can be both verified and computed in polynomial time. For theNP -class of problems it is possible only to verify a given answer in polynomialtime. A problem is in the NP -complete class, if it is possible to solve all problems

2NP is not the same as non-P .

12

within the class in polynomial time, using an algorithm that can solve any oneof the problems. In other words, any problem within a class can be converted toany other problem within the same class. NP -hard problems are a large set ofproblems that include NP -complete problems so that an algorithm that can solvean NP -hard problem, can solve any problem in NP with only polynomially morework. Since Cook’s theorem, the travelling salesman problem has been proven tobe NP -complete, so have a large number of other problems.

Many real world problems are inherently computationally intractable. This factled to the development of the theory of approximation algorithms dealing withoptimization problems. The objective of optimization is to search for feasible so-lutions by minimizing or maximizing a certain objective function when the searchspace is too large to search thoroughly and thus when computing the optimal so-lution would take too long. Different NP optimization problems have differentcharacteristics. Some problems may not be approximated at all. Other optimiza-tion problems differ in regard to how close to the optimum an approximation canget.

Based on empirical evidence, it has been conjectured that all NP -complete prob-lems have regions that are separated by a phase transition and that the problemsthat are hardest to solve occur near this phase transition. This seems to be due tothe large number of local minima between the two regions that are either underconstrained or over constrained (Cheeseman et al., 1991).

3.2 Complications in Fragment Assembly

There are a number of computational problems that a sequence assembly programhas to be able to handle. The most important ones are:

1. Repeated regions. Repeats are difficult to separate and often cause the frag-ment assembly program to assemble reads that come from different locations.Short repeats are easier to deal with than repeats that are longer than anaverage read length, of v 500 bp. Repeats may be identical or contain anynumber of differences between repeat copies. There are interspersed repeatswhich are present at various different genomic locations, and tandem re-peats, a sequence of varying length repeated one or several times at the samecontiguous genomic location. Repeats present the most difficult problemin sequence assembly. It may be close to impossible to correctly separateidentical repeats, but fortunately identical repeats are rare since differentcopies have accumulated mutations during evolution. This can be used todiscriminate different repeat copies (Figure 1.).

2. Base-calling errors or sequencing errors. The limitation in current sequenc-ing technology results in varying quality of the sequence data between readsand within each read. In general the quality is lower toward the ends, wherethe overlapping segment between a read pair occurs. There are a number ofreasons for this. The data at the end of the reads originates from the longestfragments and the proportional difference between long DNA fragments, dif-

13

Genome sequence:RepeatRepeat

Sequence reads in the correct layout:

CTTCGCGTCATCATCA

TCATCATCACTTGA

Due to repeats, there is an alternative, incorrect layouts:

1. One contig including some erroneous sequencing errors:

Contig 1:

Contig 2:

CTTCGCGTCATCATCATCATCATCACTTGA

TCATCATCACCTCGGACTTGAGTCATCATCA

CTTGAGTCATCATCATCATCATCACCTCGGA

CTTCGCGTCATCATCACTTGAGTCATCATCACCTCGGA

CTTCGCGTCATCATCA

TCATCATCAC*TTG* ACTT*GAGTCATCATCA

TCATCATCACCTCGGA

Figure 1: A fictious genome sequence contains a repeat which is present is twocopies.

fering only one base, is smaller than of that compared to short fragments.Thus, the random incorporation of ddNTPs results in a geometric distri-bution of sizes, yielding a weaker fluorescent signal for the large molecules.Molecules diffuse in the gel and the longer the time takes for fragments topass the laser detector, the more they diffuse. These factors among otherseffectively prevent considerably longer reads than approximately one kilo-base. In routine production the read length varies from about 400 basesto 900 bases. The sequence reads may also have low quality regions in themiddle of high quality regions. These can be caused by for example simplerepeats and “compressions”, i.e. some bands in the sequencing gel run closertogether than they are expected to, caused by secondary structure effects.Single stranded DNA can self anneal to form a hairpin loop structure whichcauses it to move faster on the gel than would be expected according to itslength. This is a fairly common phenomenon in GC-rich segments and results

14

in compression of trace peaks. There are four different types of sequencingerrors: 1. Substitutions: A base is erroneously reported. 2. Deletions: Oneor several bases are missing. 3. Insertions: One or several erroneous extrabases are reported. 4. Ambiguous base calls: A,T,G, or C. See Table 1.

Symbol Meaning Symbol MeaningG G A AT T or U C CU U or T R G or AY T, U or C M A or CK G, T or U S G or CW A, T or U H A, C, T or UB G, T, U or C V G, C or AD G, A, T or U N G, A, T, U or C

Table 1: IUPAC ambiguity codes

3. Contamination. Sequences from the host organism used to clone shotgunfragments, e.g. E.coli, which are mixed with shotgun sequences.

4. Unremoved vector sequences present in reads. If not removed, these maycause false overlaps between reads.

5. Unknown orientation. It is not known from which strand each fragmentoriginates. This increases the complexity of the assembly task. Hence, a readmay represent one strand or the reverse complement sequence on the otherstrand. A BAC-sized project may contain over 2000 reads and since it is notknown from which strand each read originates, a duplicate, complementaryset of reads is created by first reversing the base sequence and complementingeach base. This results in a total of over 4000 reads (Figure 2.).

Figure 2: The orientation of each fragment is initially unknown.

6. Polymorphisms. Complicates separation of nearly identical repeats. Poly-morphisms will not usually be detected through “clone by clone” sequencingsince only one variant for each genomic region is sampled.

15

7. Incomplete coverage. Coverage varies in different target sequence locationsdue to the nature of random sampling. The coverage has theoretically acertain probability to be zero depending on the average sampling coverage ofthe target genome, see equation 1. The number of gaps are further increasedby bias due to the composition of the genomic sequence, e.g. some sequencesmay be toxic and interfere with the biology of the host organism.

3.3 Base calling

Commercial companies, such as Applied Biosystems that among other things pro-duces DNA sequencers, have not made their algorithms public. However, publicresearch for base calling on fluorescent data has been performed. Phred (Ewinget al., 1998) is today a popular base calling system that has challenged the ABIbase calling. Tibbet, Bowling, and Golden(Adams et al., 1994) have studied basecalling using neural networks and Giddings et al. (Giddings et al., 1993) describea system that uses an object-oriented filtering system. Also, Dirk Walther et al.(Walther et al., 2001) describe a base calling algorithm reporting lower error ratesthan Phred.

Base-calling refers to the conversion of raw chromatogram data into the actual se-quence of bases figure 3. Using the Sanger sequencing method, extension reactionsare performed using cloned or PCR product DNA segments as templates. The ex-tensions generate products of all lengths as determined by the sequence. Whenthese products are run on an electrophoresis gel, the base sequence can be readstarting from the bottom. Different bases can be identified by labelling each basereaction with a different fluorescent dye. In sequencing machines using slab-gel orcapillary technique, the dyes are exited with laser and the emitted light intensityis usually detected by a CCD-camera or a photo multiplier tube. This produces aset of four arrays called traces corresponding to signal intensities of each color overa time line. These traces are further processed by mobility correction, locatingstart and stop positions, base line substraction, spectral separation, and resolutionenhancement. Base calling refers to the process of translating the processed datainto an actual nucleotide sequence.

Figure 3: A graphical view of the four traces, A, T, G and C showing the Phredand ABI base-calling.

16

3.4 The Data Quality

The limitation in current sequencing technology results in varying quality of thesequencing data between reads and within each read. In general the quality islower towards the ends. There are number of reasons for this. The data at theend of the reads originates from the longest fragments and the proportional dif-ference between long DNA fragments, differing only one base, is smaller than ofthat compared to short fragments. Also the random incorporation of ddNTPsresults in a geometric distribution of concentration and size yielding a weaker flu-orescent signal. Then there is the time factor. Molecules diffuse in the gel andthe longer the time until the fragments pass the laser detector more they diffuse.These factors among others effectively prevent considerably longer reads than ap-proximately one kilobase. In routine production the read length varies from about400 bases to 800 bases. The sequence reads may also have low quality regionsin the middle of high quality regions. These are mainly caused by the character-istics of polymerase and sequencing reactions. The base sequence of DNA itselfhas influence on the data quality. Single stranded DNA can self anneal to form ahair pin loop structure which causes it to move faster on the gel than would beexpected according to its length. This is a fairly common phenomena on GC-richsegments and results in compression of trace peaks. Figure 4. Considering the

Figure 4: An example of compressed traces.

above it is not suprising that there have been a lot of interest in quality asses-ments in DNA sequencing during the last decade (Ewing & Green 1998(Ewing &Green, 1998), Richterich 1998(P.Richerich, 1998), Li et al. 1997(Li et al., 1997),Bonfield & Staden 1995(Bonfield et al., 1995), Naeve et al. 1995(Naeve et al.,1995), Lawrence & Solovyev 1994(Lawrence & Solovyev, 1994), Lipshutz et al.1994(Lipshutz et al., 1994), Khurshid & Beck 1993(Khurshid & Becks, 1993) andChen & Hunkapiller 1992(Chen & Hunkapiller, 1992))

3.5 Error Probabilities

There are several obvious reason for why we would want to have error free sequencedata. The fragment assembly problem would be algorithmically much simpler and

17

the significantly lower cost associated with sequencing since fewer reads wouldbe needed. Unfortunately, we cannot get perfect data, but we have to settlewith the second best and try to measure the accuracy. In fact, many early basecalling algorithms estimated confidence values (Giddings et al., 1993; Giddingset al., 1998; Golden et al., 1993) which is best done by studying electropherogramdata. In 1994 Lawrence and Solovyev (Lawrence & Solovyev, 1994) carried outan extensive study defining a large number of trace parameters and an analysisto determine which ones are most effective in distinguishing accurate base calls.After 1998 when Brent Ewing and Phil Green implemented Phred (Ewing & Green,1998), a base-calling program with the ability to estimate a probability of a base-call, Phred became one the most widely used base-calling programs today and Predquality values became a standard within the sequencing community (P.Richerich,1998). These quality values are logarithmic and range from 0 to 99 where increasedvalue indicates increased quality. They are defined as follows:

q = −10 · lg P

Where q is the quality and P is the estimated error probability of a base-call.Phred quality values have a wide range. For example, a quality value 10 meansthat the probability of a base being wrong is 1

10 or having a 90% accuracy. An99.99% accuracy is required for finished shotgun data submitted into data bases,which means that the probability of an error is 1

10 000 .

Now we know that the root of many of the problems in fragment assembly es-sentially lies in the error prone base-calling. Ideally peaks in all four traces, onefor each base, would be Gaussian-shaped with even distances without any back-ground. As always in real life, several factors disturb the obtainable data quality,resulting in variable peak spacing, uneven peak heights and variable background.Thus, resulting in erroneous final nucleotide sequence. One of the key essencesin successfull definition predictive error probabilities is the correct characteriza-tion of traces. Lawrence and Solovyev (Lawrence & Solovyev, 1994) consideredsingle peaks in traces, but according to the work by Ewing and Green (Ewing &Green, 1998), the measures become more reliable when peaks in the neighborhoodare taken into account. More information is available since indications of errormay be present in near-by peaks but not in the peak under consideration. Ewingand Green found the following parameters to be the most important to effectivelydetect errors in base-calling:

1. Peak spacing, D. The ratio of the largest peak-to-peak distance, Lmax to thesmallest peak-to-peak distance, Lmin, D = Lmax

Lmin. Computed in a window of

seven peaks which is centered on the current peak. This value measures thedeviation from the ideal when peaks are evenly spaced, yielding a minimumvalue of one.

2. Uncalled/called ratio in window of seven peaks. The ratio of the amplitude ofthe largest peak not resulting in a base call to the amplitude of the smallestpeak resulting in a base call, i.e. called peak. If the current window lacksan uncalled peak, the largest of the three remaining trace array values atthe location of the called peak is used instead. In case the called base is an’N’, phred vill assign a value of 100.0. The minimum value is 0 for traces

18

with no uncalled peaks. With good data one of the four fluorscent signalswill clearly dominate over the three other signals. Computed in a window ofseven peaks centered at the base of interest.

3. Uncalled/called ratio in window of three peaks. The same as above, but usinga window of three peaks.

4. Peak resolution. The number of bases between the current base and thenearest unresolved base. Base calls tend to be more accurate if the peaksare well separated and thus do not overlap significantly.

By re-sequencing an already known DNA sequence and using it as a golden stan-dard, it is possible to compare the output of a base-caller program to this standard.The errors can now be categorized using the parameter values computed from thetraces for each base. This can be used in predicting errors in unknown sequeceby comparing the parameter values of the known errors frequencies. In order tocompute new errors probabilities rapidly for unknown sequences, a lookup table iscreated where all four parameters values for each category of error rate is stored.Phred uses a lookup table consisting of 2011 lines to predict the error probabilityand to set the quality values.

The Phred quality values provide only information on the probability of a basebeing errorneously sequenced. Recall that there are four different types of errorsthat may occur in base-calling: substitutions, deletions, insertions and ambigouscalls. Phred quality values do not give us a probability of a missing base, i.e.a probability of deletion. A recently published base-calling program, Life Traceby Dirk Walther et al. (Walther et al., 2001) introduces a new quality value,gap-quality. Pre-compiled versions of Life Trace for several different platforms areavailable at http://www.incyte.com/software/home.jsp.

The error probabilities can be used in several ways in fragment assembly programs.For example screening and trimming of reads and in defining pair-wise overlaps ofreads. We discuss this more throughly in part 3.9

3.6 Genome Coverage

The benefit of the shotgun method is that several reads usually cover the sameregion on both DNA strands and the a sought after consensus sequence can becomputed using the information from several reads. Usually in a genomic sequenc-ing project the coverage is about 8, often denoted as 8X. Coverage is defined asfollows:

C =N · rG

(2)

Where c = Coverage, N = Number of reads, r = Mean read length and G =Length of the genome. How many reads do we need to get 10X coverage of a BACsequence of length 125 kb, if the mean read length is 480 bases?

N = c·Gr = 10·125 000

480 = 2604 reads

19

About 2600 fragments need to be sequenced to get 10X coverage. Rememberthat we don’t know the orientation of any of the fragments, therefore we need tocomplement all reads, i.e., compute the reverse complement sequence and thenfigure out the orientation in fragment assembly. This yields 2 · 2600 = 5200 reads,that an assembly program has to combat.

Figure 5: A assembled with 10X average coverage. y-axis shows the coverage andx axis the position on the .

This is the total average coverage, but the actual observed coverage will vary alongand between the assembled . The observed coverage is zero on regions that arenot sequenced at all. Figure 5. shows a typical coverage variation along a which issequenced using 10X average coverage.It may be valuable to know the probabilityto observe a coverage of certain depth or a resulting mean number of gaps andtheir average length. If we know the mean read length, r, average coverage, c andthe length of the genomic region, G, the coverage can be computed using C asa binomial random variable. The shotgun sequencing can be seen as randomlyselecting an initial base as a starting position within the clone we aim to sequenceand then continuing from that position forward an average read length numberof bases. What is the probability of selecting this initial base? Let the averagecoverage be c = 8, the genome length G = 104 bases and the average read lengthr = 500 bases. Using the equation (2), we can calculate that we have to sequence1600 fragments to achieve the average coverage of 8. Thus, we will select a startingbase for sequencing 1600 times, which gives us the probability of randomly selectinga base position,

p =No. of reads

G=

1600104

= 0.016.

Each selection or ’trial’ of a starting base results in ’success’ with the probability,p = 0.016 and in ’failure’, not selected with the probability 1 − p = 1 − 0.016 =0.984. How many trials do we make? This may be a little tricky. The numberof ’trials’ is the same as the average read length. We succeed to get one unit ofcoverage by selecting one of the positions in a window that equals the average readlengt. The probability of a succesful selection is p and the number of different wayswe can select a base in the window is

(rc

). If we think coverage C as the number

of “successes”, the probability to observe a certain coverage, C, can be computedusing the binomial distribution function:

P{C ≤ i} =i∑

k=0

(r

k

)pk(1− p)r−k, i = 0, 1, ..., r (3)

20

Where p is the probability of choosing c coverage times a base from the numberof bases equaling the mean read length, r:

p =c

r(4)

Note that we can derive the equation (4) starting from equation (2). C is binomialwith parameters (r,p).

Example 1. Let coverage c = 8, the mean read length, r = 500 and the lengthof the clone, G = 104. The probability of observing zero coverage, C is:

P{C = 0} = 1 · 1 ·(

1− 8500

)500

≈ 3.14 · 10−4,

and the probability of observing a coverage, C = 12 is

P{C = 12} =(

50012

)(8

500

)12(1− 8

500

)500−12

≈ 4.79 · 10−2,

and the probability of a coverage, C ≤ 1 is

P{C ≤ 1} = P{C = 0}+ P{C = 1}

3.14 · 10−4 +(

5001

)(8

500

)1(1− 8

500

)500−1

≈ 2.56 · 10−3

The Poisson random variable can be used to approximate the binomial randomvariable with parameters (n,p) when p is small and n large so that np is of moderatesize. λ = np

The Poisson probability mass function:

p(i) = P{X = i} = e−λ · λi

i!(5)

The probability that a base is not sequenced: P0 = e−C , where C is the coverage.The total gap length = L · e−C , average gap size = L

n , where n is the numberof randomly sequenced segments (Lander & Waterman, 1988; Fleischmann et al.,1995).

3.6.1 Exercises

1. Average coverage is 3.5, the mean read length 420 bases and the length ofthe BAC clone is 120 kb. Calculate the probability of a base not beingsequenced.

21

2. Suppose you have a BAC sequence that contains two repeated regions oflength 1 kb. The repeat copies differ 0.8% from each other and the mutationsare evenly distibuted. You want that at least two differences are covered byat least four same reads with a porbability of 99.9%. What average coverageshould you have?

3.7 Distance Constraints

When clones of DNA are sequenced from both ends, the approximate distanceand their relative orientation is known. These sequence pairs from cloned endsare called mate-pairs (Myers et al., 2000) or sequence mapped gaps (Edwards& Caskey, 1990). The distance information can be used to both verify finishedassemblies and as an additional source of information in assembly algorithms. Themodified shotgun sequencing strategy using these distance constraints is coineddouble-barreled. Typically a whole-genome sequencing projects use a mixture ofclones containing inserts of several lengths, e.g. 2 kbp, 5 kbp, 50 kbp and 150 kbpclones.*FIGURE* A usual way to employ the information in fragment assemblyprograms, is to build a so called scaffold. A scaffold is a set of or reads thatare positioned, oriented and ordered corresponding to the distance informationprovided by cloned ends. *SCAFFOLD FIG* Only an approximate length ofclones are known and it is therefore not possible to rely only on one pair of clonedends to build an assembly scaffold. To acquire a reasonable confidence, a certaincoverage of mate-pairs is required. This is obviously dependent on the accuracyof the laboratory procedure when fragmented DNA seqments of a specific sizeare chosen. For example in the whole-genome shotgun assembly of Drosophila, asingle mate-pair information was found to be wrong 0.34% of the time, while twomate-pairs providing consistent information, the error was found to decrease to10−15 (Myers et al., 2000).

3.8 Fragment Assembly Problem Models

The sequence reconstruction or fragment assembly problem would be easy if theoriginal DNA sequence was unique, i.e. did not contain repeats. If we find a com-mon sub-sequence between two different reads 14 bases of length, the probabilityof finding an identical sequence by random is 4−14 = 1

268 435 456 . This meansthat if the length of the sequence we are assembling is about 270 Mbps, we wouldexpect to find one such sequence by cheer chance. In a BAC-sized project ( 100000bps) the probability of a random match is 100000

268 435 456 ≈ 4 ∗ 10−4, provided thatthe DNA sequence is random3. To solve the fragment assembly problem would beeasy, since we would only need to find all the matching sub-sequences between allthe reads and with a high probability almost all the overlaps assigned this way,would be correct. All what is left to do is to assemble these reads into contigoussequences, contigs, compute consensus sequence and we are done. Unfortunately,

3NOTE that this is NOT the probability of two reads sharing a common sequence of a certainlength.

22

genomes usually are not random and contain repeats. This tactics is doomed tofail in practice.

During decades, many algorithms have been developed that attempt to solve thefragment assembly problem. One of the keys to good fragment assembly is asuccessful definition of overlaps between reads. These overlaps can be of fourdifferent types, figure 6. Read A can overlap read B either at the

Figure 6: True overlaps between reads can be of four different types. In all casesthe overlap must end at least at the end of one of the reads.

beginning or at the end, read A can be contained by read B or vice verse. Inall cases the overlap has to end at the end of at least one of the reads to beconsidered as a true candidate overlap. All true overlaps fall into one of thesecategories. Considering that we are interested in pairing reads that come from thesame clone, i.e. genomic region and should therefore have identical base sequences,it is surprisingly hard to define these overlaps correctly. The reads contain manysequencing errors. In general the most of the errors are at the beginning and atthe end of the reads, making it hard to define where exactly the read begins andwhere it ends, figure 7.

Figure 7: A typical quality distribution of a shotgun read. The beginning and theend of a read has low quality. X-axis: The position (bases) on the read, Y-axis:The PHRED quality value.

All fragments usually pass through a trimming process where low quality end aretrimmed. If trimming is too stringent, information is lost and a greater numberof reads is required to give a certain coverage. See section 3.6. On the other

23

hand, if the trimming is too loose leaving many sequencing errors, it is difficultto determine whether the sequences come from the same region or if they just arevery similar. Since many of the defined overlaps between reads are not right, itis very hard to rebuild the original genomic sequence. There will be extremelymany ways to put this puzzle together. Actually so many ways, it is impossibleto try out all the possibilities. It gives rise to a combinatorial explosion. Thistype of problem is very hard to solve and in computer science it is called non-deterministic polynomial complete, NP-complete. This is the hardest problem inNP. Thousands of problems have been proven to be NP-complete. One of the mostfamous is the Traveling Salesman Problem. Many computer scientists are workingto find general solutions for this problem type. The simplest algorithm to solvethis problem is the greedy algorithm.

3.8.1 The Shortest Common Superstring Model

The fragment assembly problem can be abstracted into the shortest common su-perstring problem (SCS). The object is to find the shortest possible superstringS that can be constructed from a given set of strings F so that S contains eachstring in F as a substring. SCS can also be be reduced to the traveling salesmanproblem and also has applications in data compression.

Maier and Storer showed that SCS is NP -complete (Maier & Storer, 1977). Later,in 1980, a more elegant proof was described (Gallant et al., 1980). Hence SCS isNP -complete, no feasible polynomial algorithm is known that solves this problem,but polynomial approximation algorithms exist. Moreover, SCS was proven to beMAX−SNP -hard in 1994 (Blum et al., 1994), meaning that no polynomial-timealgorithm exists that can approximate the optimal solution within an arbitraryconstant, but rather within a predetermined constant (section 3.1). A simplegreedy algorithm can be used to approximate the solution by repeatedly choosinga maximum overlap between substrings in the set F until only one string is left.The greedy algorithm was first used in fragment assembly in 1979 (Staden, 1979).

Tarhio & Ukkonen and Turner established some performance guarantees for thegreedy algorithm, which implied that the resulting superstring S is at most half thetotal, catenated length of the substrings in F (Tarhio & Ukkonen, 1988; Turner,1989). Blum et al. were the first to describe an approximation algorithm for SCSwith an upper bound of three times optimal (Blum et al., 1991). The approxi-mation methods have been continuously refined. Kosaraju et al. established thebound of 250

63 (Kosaraju et al., 1994), Stein and Armen the bound of 234 in 1994

and 2 23 bound in 1996 (Armen & Stein, 1994; Armen & Stein, 1996) and Sweedyk

has reduced the bound to 212 (Sweedyk, 1995). It has been conjectured that greedy

is at most two times optimal.

SCS is a sub-optimal model for fragment assembly, mainly because it tends tocollapse repeated regions and it accounts for neither sequencing errors (i.e. sub-stitutions, insertions and deletions) nor unknown orientation.

24

3.8.2 The Sequence Reconstruction Model

The problem model derived from SCS that accounts for sequencing errors andfor fragment orientation is called The Sequence Reconstruction Problem (SRP).The minimum number of edit operations, i.e. substitutions, insertions, deletionsrequired to convert the sequence a into the sequence b is defined as the edit distancebetween sequences and denoted d(a, b). Given a set of sequence fragments, F andan sequencing error rate ε ∈ [0, 1], the object is to find the shortest sequence Ssuch that for all fi ∈ F there is a substring Ssub of S such that

min{d(Ssub, fi), d(Ssub, fri )} ≤ ε|Ssub|,

where fri is the reverse complement of fi.

Since the sequence reconstruction problem can be reduced to NP -complete SCSwith zero error rate, SRP is also NP -complete. The proof was given in 1996(Waterman, 1996).

Peltola et al. (Peltola et al., 1984) were the first to include errors and fragmentorientation in the problem model together with graph theory, and they used thegreedy algorithm to approximate the fragment layout. Also, Huang included theerror criterion in the CAP assembly program (Huang, 1992). Kececioglu and My-ers describe the extension to SRP by introducing discrimination of false overlapsbased on error likelihood including an extensive treatment of the unknown ori-entation problem using interval graphs (Kececioglu, 1991; Kececioglu & Myers,1995). Phrap uses Phred quality values in a likelihood model to discriminate falseoverlaps (section 3.9.2 and 3.9.3).

The sequence reconstruction model complies better with the reality than the short-est common superstring model.

3.8.3 Graphs

An article by L. Euler started a branch of mathematics called graph theory (Euler,1736; N. Biggs & Wilson, 1976). A graph is a collection of vertices connected byedges. Since each read should form an interval on the target sequence, it is naturalto model this problem as an interval graph problem, with reads corresponding tovertices and overlaps to edges. The directed graph Gov(V, E) in figure 8 for thesequence S = GTGATGCTGG has a vertex set V = GTG, TGA, GAT, ATG,TGC, GCT, CTG, TGG. Each edge E is an overlap with an adjacent sequenceby the minimum of two bases. A Hamiltonian path in the graph is a path thatpasses through every vertex exactly once. However, a Hamiltonian path does notnecessarily pass through all the edges. A Hamiltonian cycle or circuit is a Hamil-tonian path that ends in the same vertex as that in which it started. Each edgein the graph can be weighted by a score that reflects the quality of an overlap andthe traveling salesman problem formulation can be used to find the Hamiltonianpath that yields the best score. This problem is however NP -complete. If theamount of input data is large, and in particular in the presence of similar repeats

25

Figure 8: Directed graph Gov for the sequence GTGATGCTGG with the vertexset V . There is one Hamiltonian path that visits all vertices.

and sequencing errors, it might be impossible to find the optimal solution withina reasonable time.

An alternate data structure can be used to find Euler paths. An Euler path in agraph is a path that travels along every edge exactly once, but might pass throughindividual vertices of the graph more than once. An Euler path that begins andends in the same vertex is an Euler circuit or tour. The redundant shotgun datacan be converted into an Euler path problem by producing k-tuple data. Thisis done by extracting all overlapping k-tuples from all shotgun reads and thencollapsing all edges to single edges where data is consistent so that only uniquek-tuples remain. The graph is constructed on (k − 1)-tuples so that each edgehas a length of one. For each read fragment f of length k in S an edge from thevertex labeled with the left-most k − 1 bases of f is directed to the node labeledwith the right-most k − 1 bases of f . Edges may be weighted by e.g. the numberof fragments associated with an edge. An example of a directed graph Geu is thecase shown in figure 9, that is constructed for the sequence S with k = 3.

The fragment assembly problem is now recast into the problem of finding an Eulerpath in the graph and there are efficient algorithms for finding such paths. Thereduction of sequencing by hybridization (SHB) to a question of finding Euler pathswas originally developed by Pevzner (Pevzner, 1989) and the hybrid algorithm,based on SHB, was described by Idury and Waterman (Idury & Waterman, 1995).This problem is not NP -complete. However, this gain is paid partly by some lossof information in the process of fragmenting the original input data into a numberof k-tuples. There may be several Euler paths in one graph, as shown in figure 9,that start at the vertex labeled GT and results in the sequences shown in figure10.

One approach of getting the set of all Euler paths in a graph is based on thefact that all spanning trees can be enumerated and that there is a relationshipbetween the set of all spanning trees and the set of all Euler paths in a graph(Even, 1979). Another approach is to transform any compatible string by a series

26

Figure 9: The graph Geu for the sequence S = GTGATGCTGG

GTG GTGTGA TGCGAT GCTATG CTGTGC TGAGCT GATCTG ATGTGG TGG

---------- ----------1. GTGATGCTGG 2. GTGCTGATGG

Figure 10: Euler path solutions for the graph Geu.

of simple string operations, so that each string produced is itself compatible withthe input data. This is based on a conjecture described by Ukkonen and provedby Pevzner (Ukkonen, 1992; Pevzner, 1994).

The Euler path method produces single sequences as output. A multiple alignmentis required to evaluate the underlying read data. This can be accomplished byaligning all the initial reads against the Euler sequence(s).

Idealized versions of problem models are rarely applicable in practice. Sequencingerrors, in particular together with nearly identical repeats, are a complicatingfactor. Overlaps are based on exact sequence identity, thus there is a possibility oflosing existing overlaps because the identities are distorted by sequencing errors.The input data may be trimmed to be of very high quality, yet this results inshorter read lengths. However, it is possible to try to correct the errors prior tothe construction of the graph, as is the approach used in the EULER assembler(Pevzner et al., 2001).

27

3.9 Algorithms

The fragment reconstruction problem based on overlaps can in turn be dividedinto a number of sub-problems. 1. Discarding contamination, i.e. sequences fromvector and chimeric reads and the host organism. 2. Determination of all overlapsbetween reads. 3. Computation of the layout of the contigs, i.e. the orientationand relative position of the reads in contigs. 4. Creation of multiple alignmentsof all reads. 5. Computation of the consensus sequences.

3.9.1 Contamination

Contaminating reads can easily be discarded by screening the reads against adatabase consisting of the host genome sequences, as well as the vector sequences.Since chimeric reads are composites, consisting of at least two sequences originatingfrom different locations of the target genome, they can be recognized by theirpattern of overlap with other reads. One such algorithm was described in 1996and implemented in CAP2 (Huang, 1996).

3.9.2 Overlap Detection.

In order to determine all pairwise overlaps, an all-versus-all comparison of reads hasto be performed. For example, 2000 reads would require (2 · 2000)2 = 16,000,000pair-wise comparisons, since each possible orientation is tested. The classical so-lution to this problem of approximate string matching is a dynamic programmingmethod (Sellers, 1980). The disadvantage of this approach is the quadratic runningtime, O(nm), where n and m are the lengths of the reads. There are several ap-proaches that improve the running time by only computing a part of the dynamicprogramming matrix, for example Ukkonen who reduces the time to O(nk), wherek is the edit distance (Ukkonen, 1985). The performance can be further improvedby pre-prossessing the database using indexing techniques, such as patricia trees(Knuth, 1973), suffix trees (McCreight, 1976) and suffix arrays (Manber et al.,1990). These structures allow a query time proportional to the length of the querystring, while being independent of the size of the database, which can be con-structed in linear time. The time to construct the database can be amortized byperforming a large number of queries.

There are many different approaches to solving the all-versus-all comparison prob-lem (Gusfield, 1997; Chen & Skiena, 1997). One solution that was used in theTIGR Assembler, was to locate all k-tuples shared between read pairs, similar tothe initial processing in BLAST (Altschul et al., 1990). A more practical solution,that can be applied to shotgun reads, is to use a combination of suffix arrays andindexes built of the database sequences (Myers, 1995a). A suffix array can beconstructed by indexing all instances of short words, typically 10 to 14 bases, thatoccur in the database. All instances of words are stored in an array. By assigningpointers to point to their locations in reads, the lists of suffixes in each array po-sition can be sorted in lexicographical order. In this way, all similar reads will be

28

grouped close to each other and it is possible to rapidly extract matching suffixes.The candidate alignments are created by dynamic programming, where only aband in the dynamic programming matrix is computed (Chao et al., 1992), or avariation of the Smith-Waterman algorithm (SWA) (Smith & Waterman, 1981) isused. This method is used in the Phrap fragment assembly program and similarmethods with some variation are used in many other assembly programs, e.g. inFALCON (Gryan, 1994) and ARACHNE (Batzoglou et al., 2002). Approachesbased on hybridization and techniques based on the sampling of large numbers ofsegments, sub-words (Kim & Segre, 1999), have not yet been shown to be successfulin practice, mainly due to the presence of ubiquitous repeats. Each of the pair-wisealignments are scored in order to evaluate the quality of overlaps. The total scoreis obtained by dynamic programming using a scoring scheme with specified matchscore, and using mismatch and gap penalties (Pearson & Lipman, 1988; Altschulet al., 1990), possibly together with some estimate of the total sequencing error, asin the assembly programs, CAP and CAP2 (Huang, 1992; Huang, 1996). A moreaccurate approach is to utilize quality values, i.e. the probability that a base iserroneously called. Most fragment assembly programs that use error probabilities,use quality values generated by Phred, which is the most widely used base callingprogram. Several programs use the error probabilities merely as weights for matchscore, mismatch and gap penalties, e.g. STROLL (Chen & Skiena, 2000) and CAP(Huang & Madan, 1999). Phrap uses Phred quality values in a likelihood modelto discriminate false overlaps. The initial candidate overlaps are initially filteredby a minimum SWA score and then all alignments are re-scored by computing alog likelihood ratio (LLR), which is used in the construction of contigs.

The pair-wise overlaps can be used to generate overlap graphs, where each readand its reverse complement are nodes and edges represent the overlaps betweenreads. Edges are weighted by some measure that reflects the strength of an overlap,such as pair-wise alignments scores combined with the lengths of clone inserts.

3.9.3 Contig Layout Determination.

The problem of creating the fragment layout in contigs, i.e. the problem of de-termining the relative order and offset of all reads, is based on the reconstructionmodel and is NP-complete (Gallant et al., 1980; Gallant, 1983; Kececioglu, 1991).Hence, all methods rely mostly on heuristics to combine reads. Several sophis-ticated methods to approximate the optimal solution have been described, forexample, relaxation based on sorting overlaps by decreasing score or spanningforests (Kececioglu, 1991), methods based on simulated annealing (Burks et al.,1994), reduction to greedy Eulerian tour (see section 3.8.3), using genetic algo-rithms (Parsons & E., 1995) or by a series of reductions to an overlap graph thatreduce the number of edges and vertices without changing the space of potentialsolutions (Myers, 1995b).

Most of the approximation algorithms perform well on non-repeated data, but havetrouble with the determination of the correct layout in the presence of repeats.This is true in particular when repeats are highly similar and longer than the

29

average read length. In order to avoid confusing assembly algorithms, severalassembly strategies screen for repeats in the input data set against a database priorto assembly in order to assemble reads originating from non-repeated regions ofthe target sequence into a number of contigs. These contigs may then be orderedusing clone length constraints given by the sequencing of both ends of clone inserts.This is the approach used in the Celera Assembler (Anson & Myers, 1999; Myerset al., 2000; Huson et al., 2002). This leaves most of the long repeated regionsunordered. Other ways to detect reads originating from repeated regions of thetarget sequence are the computation of a probability of observed coverage, whichis not a precise method, and the detection of conflicting paths in an overlap graph(Myers, 1995b).

Phrap uses the greedy method, in which the overlaps are first sorted by theirLLR-scores and are joined together into contigs in descending LLR-score order.Additional information, such as coverage on both strands are used to confirm readpositions. Moreover, the layout may be broken at weak overlaps and the layout re-computed. The sequencing from both ends of an insert yields distance constraintsthat can be used both to confirm an assembly, and to aid in the ordering of thepair-wise overlaps e.g. the TIGR Assembler, CAP3 and ARACHNE, as well as inthe Celera Assembler.

3.9.4 Creation of Multiple Alignments.

In order to get the final result, the consensus sequence, a multiple alignment ofthe reads in each contig has to be computed. There are several multiple alignmentmethods, that differ mostly by the criteria for selecting pairs of alignments andthereby the order in which pairs are merged (Gusfield, 1997). However, shotgunread fragments that are aligned into a contig are assumed to represent the sameregion of the target sequence. Therefore the global alignments are most commonlycreated by the simple approach of repeatedly aligning the next read with the cur-rent alignment, as for example in CAP3. This simple approach results in multiplealignments that are not locally optimized. In most cases, local deviations in themultiple alignments will only have a small effect on the quality of the computedconsensus sequence. The multiple alignments can be locally optimized by an algo-rithm described by Anson and Myers, where each sequence is iteratively alignedto a consensus structure (Anson & Myers, 1997). After each round of alignments,the score is computed and compared to the previous score. The iterations arecontinued until no lower score can be obtained.

3.9.5 Computation of the Consensus Sequence.

The consensus sequence can be computed using simple majority voting. Whenquality values for bases are available, more sophisticated methods may be applied.The scheme used in CAP3 assumes that the errors are independently distributedbetween the strands. For each column of the alignment the quality values for thebase types are partitioned into two groups, one from each strand. The quality

30

values in each group are sorted in decreasing order and the highest quality valuein each group is given the weight of 1.0 and the rest the weight of 0.5. The sum ofquality values is computed. The base with the largest quality value is taken as theconsensus base for the column and the quality of the consensus base is the sumof quality values for every base minus the sum of the quality values for the otherbases.

Phrap uses an approach where an interval of the best quality read is chosen torepresent the consensus sequence of that interval of the multiple alignment. Aweighted directed graph, where the nodes represent selected read positions and thebidirectional edges, that represent aligned positions in two overlapping reads withthe weight of zero, is constructed. In addition, the undirectional edges betweenpositions in the same read with the weight of the sum of the quality values forthe bases between the positions are used. The contig sequence is the path withthe highest weight in the graph. Hence, the method used in Phrap does not yielda consensus sequence in the proper sense, but is instead a mosaic of the highestquality regions.

4 Shotgun Fragment Assembly Programs

4.1 Phrap

Several programs are needed to do complete assembly: phred (Ewing et al., 1998)(Ew-ing & Green, 1998) cross-match, repeat-masker, phrap (http://www.phrap.org) andconsed (Gordon et al., 1998). Phred reads the trace files produced by a sequencingmachine, does base calling and assigns quality values, i.e. error rates for each base.These quality values reflect the estimated probability that a base is errorneouslycalled. Cross-match is useful in finding vector sequences in reads.

The assembly is performed by Phrap that uses the quality information provided byPhred (see section 3.5) to make pair-wise comparisons of reads. This method hasthe advantage that it can use almost the whole read length without any significanttrimming, since low quality base calls can be ignored. Phrap makes use of theinformation provided by both DNA strands by assuming that the base calls areindependent between the opposite strands. Phrap may confirm a base call byanother call on the opposite strand by summing it with the score of the confirmingcall. The same is true for calls performed using different chemistries. Phrap usesthe Phred quality scores to discriminate putative repeats by likelihood ratios. Ifmismatches occur along the alignment at positions where base qualities are high,they are assumed to be due to repeats and reads are not overlapped. If mismatchesoccur at low quality positions, they are assumed to be due to base calling errors.

RepeatMasker is a program developed to screen sequences for known repeats toavoid false overlaps due to repeated regions in the sequenced genomic region.Detected repeats are masked and not used during the assembly process.

Phrap computes the final sequence or consensussequence from sections of highest

31

quality regions of the reads in assembled contigs. Consed is used to view theresults of the assembly. Figure 11. shows a screen shot of Consed.

Figure 11: A screenshot of Consed widows: Main, contigview and trace viewer.

4.1.1 The Phrap Method

The hardest step in the assembly process is to determine the correct pattern ofhow all the reads align relative to each other (layout). Phrap uses the greedyalgorithm together with log likelihood ratios, LLR scores for overlaps, to determinethe layout of reads. LLR scores are computed using Phred quality values or errorprobabilities. The sum of LLR scores are then used to decide if discrepancies inoverlaps are real, or if they are due to base-calling errors.

Candidate overlaps are obtained by dynamic programming and scored initiallywithout quality values. The scores for these candidate overlaps are re-computedusing error probabilities. The log likelihood ratio H1/H0 is computed for each basepair in an overlap and the score of the overlap is the sum of the LLR scores foreach base pair. H1: the probability of observed data assuming overlap is real. H0:the probability observing the data assuming reads are from 95% similar repeats.The figure 95% is found to work well in practice.

Following assumptions are made:

1. Alignment positions are independent of each other

2. When a base-calling error occurs at a read position, the base calls in the tworeads disagree at the position. (Correction required to avoid this assumptionis very small).

3. Errors in reads are independent of each other.

32

The base pairs in an alignment may match or mismatch. This yields two cases ifpossible insertions and deletions are not counted.

Case 1: Bases match.

Prob(agree|H1(overlap)) = (1− pi)(1− pj).P rob(agree|H0(repeat)) = (1− pi)(1− pj) ∗ 0.95

LLR = log( 10.95 )

Case 2: Discrepancy between base cells:

Prob(discrep|H1(overlap)) = pi + pj − pi · pj

Prob(discrep|H0(repeat)) = 0.05 + 0.95(pi + pj − pi · pj)LLR ≈ log(pi+pj−pi·pj

0.05 + 0.95(pi + pj − pi · pj))

Where pi = probability that the base in sequence i is errorneous and pj = proba-bility that the base in sequence j is errorneous. See figure 12.

Case 2

A G G G G G G T A C A T C T A G T C A

T A A T C A G T C A T T G T A

Case 1

(i:)

(j:)

Figure 12: Case 1: Bases in the alignment match. Case 2: Bases in the alignmentmismatch.

4.1.2 Greedy Algorithm for Layout Construction

Once all overlaps between reads are computed, the layout, or the order of how thereads align realitive to each other, can be determined. Phrap maintains the layoutof reads in a graph, where the orientation and offset with respect to the startof a contig is stored. Vertices in this graph represent reads and edges representoverlaps. Thus, a contig is the connected component in the graph.

Reads are added to a contig using the greedy algorithm. Initially, each read formsa separate contig. Each read pair is then processed in decreasing LLR score order.Each time a read pair is chosen, a potential merge between different contigs isconsidered. No negatively scoring read pairs are allowed into a contig and readpairs must have consistent offset in the candidate contig.

The greedy algorithm together with LLR scores will always result in average cor-rect assemblies. Phrap enchances this result by breaking the layout in positions

33

where reads have positive scores in alternative locations. This time an exhaustingcomputation is performed, i.e. all possible ways of joining the read fragments isconsidered and the highest scoring alternative is chosen.

4.1.3 Construction of Contig Sequences

The final sequence is not strictly a consensus sequence, since the contig sequence isconstructed from the highest quality regions of the reads. The comlumn positionsin a contig are never evaluated. The main steps are as follows:

1. Construct a weighted directed graph:

- Nodes are selected read positions.

- Edges are of two types: 1) Bidirectional: These edges are betweenaligned positions in two overlapping reads. The weights of the edgesare zero. 2) Undirectional: These are between positions in the sameread in the 3’ to 5’ direction. Weight of these edges is the sum of Phredquality measures for bases.

2. Determine the contig sequence from the read segments that lie on the highestweight path through the graph. The highest weight path can be obtainedusing Tarjan’s algorithm with modifications to deal with non-trivial cycles.

References

Adams, M., Celniker, S., Holt, R., Evans, C., Gocayne, J., Amanatides, P.,Scherer, S., Li, P., Hoskins, R., Galle, R., George, R., Lewis, S., Richards, S.,Ashburner, M., Henderson, S., Sutton, G., Wortman, J., Yandell, M., Zhang,Q., Chen, L., Brandon, R., Rogers, Y., Blazej, R., Champe, M., Pfeiffer,B., Wan, K., Doyle, C., Baxter, E., Helt, G., Nelson, C., Gabor, G., Abril,J., Agbayani, A., An, H., Andrews-Pfannkoch, C., Baldwin, D., Ballew, R.,Basu, A., Baxendale, J., Bayraktaroglu, L., Beasley, E., Beeson, K., Benos, P.,Berman, B., Bhandari, D., Bolshakov, S., Borkova, D., Botchan, M., Bouck,J., Brokstein, P., Brottier, P., Burtis, K., Busam, D., Butler, H., Cadieu, E.,Center, A., Chandra, I., Cherry, J., Cawley, S., Dahlke, C., Davenport, L.,Davies, P., de Pablos, B., Delcher, A., Deng, Z., Mays, A., Dew, I., Dietz, S.,Dodson, K., Doup, L., Downes, M., Dugan-Rocha, S., Dunkov, B., Dunn, P.,Durbin, K., Evangelista, C., Ferraz, C., Ferriera, S., Fleischmann, W., Fosler,C., Gabrielian, A., Garg, N., Gelbart, W., Glasser, K., Glodek, A., Gong,F., Gorrell, J., Gu, Z., Guan, P., Harris, M., Harris, N., Harvey, D., Heiman,T., Hernandez, J., Houck, J., Hostin, D., Houston, K., Howland, T., Wei,M., Ibegwam, C., Jalali, M., Kalush, F., Karpen, G., Ke, Z., Kennison, J.,Ketchum, K., Kimmel, B., Kodira, C., Kraft, C., Kravitz, S., Kulp, D., Lai,Z., Lasko, P., Lei, Y., Levitsky, A., Li, J., Li, Z., Liang, Y., Lin, X., Liu, X.,

34

Mattei, B., McIntosh, T., McLeod, M., McPherson, D., Merkulov, G., Mil-shina, N., Mobarry, C., Morris, J., Moshrefi, A., Mount, S., Moy, M., Murphy,B., Murphy, L., Muzny, D., Nelson, D., Nelson, D., Nelson, K., Nixon, K.,Nusskern, D., Pacleb, J., Palazzolo, M., Pittman, G., Pan, S., Pollard, J.,Puri, V., Reese, M., Reinert, K., Remington, K., Saunders, R., Scheeler, F.,Shen, H., Shue, B., Siden-Kiamos, I., Simpson, M., Skupski, M., Smith, T.,Spier, E., Spradling, A., Stapleton, M., Strong, R., Sun, E., Svirskas, R., Tec-tor, C., Turner, R., Venter, E., Wang, A., Wang, X., Wang, Z., Wassarman,D., Weinstock, G., Weissenbach, J., Williams, S., Woodage, T., Worley, K.,Wu, D., Yang, S., Yao, Q., Ye, J., Yeh, R., Zaveri, J., Zhan, M., Zhang, G.,Zhao, Q., Zheng, L., Zheng, X., Zhong, F., Zhong, W., Zhou, X., Zhu, S.,Zhu, X., Smith, H., Gibbs, R., Myers, E., Rubin, G. & Venter, J. (2000) Thegenome sequence of drosophila melanogaster. Science, 287 (5461), 2185–95.

Adams, M., Fields, C. & Venter, J., eds (1994) Neural networks for automatedbasecalling of gel-based DNA sequencing ladders. Academic Press, San Diego,CA.

Allison, D., Thompson, J., Jacobson, K., Warmack, R. & Ferrel, T. (1990) Scan-ning tunneling microscopy and spectroscopy of plasmid DNA. Scanning Mi-croscience, 4, 517–522.

Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D. (1990) Basic localalignment search tool. J. Mol. Biol., 215 (3), 403–10.

Anderson, S. (1981) Shotgun DNA sequencing using cloned DNase I-generatedfragments. Nucleic Acids Res., 9 (13), 3015–27.

Anderson, S., Bankier, A., Barrell, B., de Bruijn, M., Coulson, A., Drouin, J.,Eperon, I., Nierlich, D., Roe, B., Sanger, F., Schreier, P., Smith, A., Staden,R. & Young, I. (1981) Sequence and organization of the human mitochondrialgenome. Nature, 290 (5806), 457–65.

Anson, E. & Myers, E. (1997) Realigner: a program for refining DNA sequencemultialignments. J. Comp. Biol., 4 (3), 369–383.

Anson, E. & Myers, G. (1999) Algorithms for whole genome shotgun sequencing.In Proc. RECOMB ’99 pp. 1–9, Lyon, France.

Armen, C. & Stein, C. (1994). In A 2.75 approximation algorithm for the shortestsuperstring problem DIMACS Workshop on Sequencing and Mapping.

Armen, C. & Stein, C. (1996) A 2 2/3-approximation algorithm for the shortestsuperstring problem. In Combinatorial Pattern Matching pp. 87–101.

Baer, R., Bankier, A., Biggin, M., Deininger, P., Farrell, P., Gibson, T., Hatfull,G., Hudson, G., Satchwell, S., C, C. S., Tuffnell, P. & Barrell, B. (1984) DNAsequence and expression of the B95-8 Epstein-Barr virus genome. Nature,310 (5974), 207–11.

Batzoglou, S., Berger, B., Mesirov, J. & Lander, E. S. (1999) Sequencing a genomeby walking with clone-end sequences: a mathematical analysis. Genome Re-search, 9 (12), 1163–74.

35

Batzoglou, S., Jaffe, D. B., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger,B., Mesirov, J. P. & Lander, E. S. (2002) ARACHNE: a whole–genome shot-gun assembler. Genome Research, 12, 177–189.

Benzer, S. (1959) On the topology of the genetic fine structure. Proc. Natl. Acad.Sci., 45, 1607–1620.

Bernal, A., Ear, U. & Kyrpides, N. (2001) Genomes online database (GOLD): amonitor of genome projects world–wide. Nucleic Acids Res., 29 (1), 126–127.

Blattner, F. R., III, G. P., Bloch, C. A., Perna, N. T., Burland, V., Riley, M.,Collado-Vides, J., Glasner, J. D., Rode, C. K., Mayhew, G. F., Gregor, J.,Davis, N. W., Kirkpatrick, H. A., Goeden, M. A., Rose, D. J., Mau, B. &Shao, Y. (1997) The complete genome sequence of Escherichia coli K–12.Science, 277 (5331), 1453–74.

Blum, Avrim, Jiang, T., Li, M., Tromp, J. & Yannakakis, M. (1991) Linear approx-imation of shortest superstrings. In Proceedings of the 23rd AC Symposiumon Theory of Computation pp. 328–336.

Blum, A., Jiang, T., Li, M., Tromp, J. & Yannakis, M. (1994) Linear approxima-tion of shortest superstrings. J. of the ACM, 41, 634–47.

Bonfield, J., Smith, K. & Staden, R. (1995) A new DNA sequence assembly pro-gram. Nucleic Acids Res., 23 (24), 4992–9.

Bonfield, J. & Staden, R. (1995) The application of numerical estimates of basecalling accuracy to DNA sequencing projects. Nucleic Acids Res., 23, 1406–1410.

Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D., Johnson, D., Luo,S., McCurdy, S., Foy, M., Ewan, M., Roth, R., George, D., Eletr, S., Albrecht,G., Vermaas, E., Williams, S., Moon, K., Burcham, T., Pallas, M., DuBridge,R., Kirchner, J., Fearon, K., Mao, J. & Corcoran, K. (2000a) Gene expres-sion analysis by massively parallel signature sequencing (mpss) on microbeadarrays. Nat. Biotechnol., 18 (6), 630–4.

Brenner, S., S.R.Williams, Vermaas, E., Storck, T., Moon, K., McCollum, C., Mao,J., Luo, S., Kirchner, J., Eletr, S., DuBridge, R., Burcham, T. & Albrecht, G.(2000b) In vitro cloning of complex mixtures of DNA on microbeads: physicalseparation of differentially expressed cDNAs. Proc. Natl. Acad. Sci., 97 (4),1665–70.

Burks, C., Engle, M., Forrest, S., Parsons, R., Soderlund, C. & Stolorz, P. (1994)Automated DNA Sequencing and Analysis. Academic Press, New York pp.249–259.

C. Elegans Sequencing Consortium (1998) Genome sequence of the nematode C.elegans: A platform for investigating biology. Science, 282, 2012–2018.

Chao, K., Pearson, W. & Miller, W. (1992) Aligning two sequences within a spec-ified diagonal band. Comput. Appl. Biosci., 8 (5), 481–7.

36

Cheeseman, P., Kanefsky, B. & Taylor, W. M. (1991) Where the Really HardProblems Are. In Proceedings of the Twelfth International Joint Conferenceon Artificial Intelligence, IJCAI-91, Sidney, Australia pp. 331–337.

Chen, E., Schlessinger, D. & Kere, J. (1993) Ordered shotgun sequencing, a strat-egy for integrated mapping and sequencing of YAC clones. Genomics, 17,651–656.

Chen, T. & Skiena, S. (1997) Trie–based data structures for sequence assembly.In The Eight Symbosium on Combinatorial Pattern Matching pp. 206–223.

Chen, T. & Skiena, S. S. (2000) A case study in genome–level fragment assembly.Bioinformatics, 16 (6), 494–500.

Chen, W. & Hunkapiller, T. (1992) Sequence accuracy of large DNA sequencingprojects. DNA Sequence – Journal of DNA Sequences and Mapping, 2, 335–342.

Churchill, G. & Waterman, M. (1992) The accuracy of DNA sequences: estimatingsequence quality. Genomics, 14, 89–98.

Clarke, L. & Carbon, J. (1976) A colony bank containing synthetic ColE1 hybridplasmids representative of the entire E. coli genome. Cell, 9, 91–101.

Cook, S. (1971) The complexity of theorem-proving procedures. In Proc. 3rd Ann.ACM Symp. on Theory of Computing pp. 151–158.

Dayhoff, M. (1964) Computer aids to protein sequence determination. J. Theor.Biol., 8, 97–112.

Dear, S. & Staden, R. (1991) A sequence assembly and editing program for efficientmanagement of large projects. Nucleic Acid Res., 19, 3907–3911.

Deininger, P. (1983) Approaches to rapid DNA sequence analysis. Anal. Biochem.,135 (2), 247–63.

Drmanac, R., Labat, I., Brukner, I. & Crkvenjakov, R. (1989) Sequencing ofmegabase plus DNA by hybridisation: theory of the method. Genomics,4, 114–128.

Edwards, A. & Caskey, C. (1990) Closure Strategies for Random DNA Sequencing.Methods: A Companion to Methods in Enzymology, 3, 41–47.

Eichler, E. (1998) Masquerading repeats: paralogous pitfalls of the human genome.Genome Research, 8, 758–762.

Euler, L. (1736) Solutio problematis ad geometriam situs pertinentis. CommenrariiAcademiae Scientiarum Imperialis Petropolitanae, 8, 128–140.

Even, S. (1979) Graph Algorithms. Computer Science Press, Mill Valley, CA.

Ewing, B. & Green, P. (1998) Base-calling of automated sequencer traces usingphred. ii. error probabilities. Genome Research, 8, 186–194.

37

Ewing, B., Hillier, L., Wendl, M. & Green, P. (1998) Base-calling of automatedsequencer traces using Phred I. Accuracy Assessment. Genome Research, 8,175–185.

Fleischmann, R., Adams, M., White, O., Clayton, R., Kirkness, E. & Kerlavage,A. (1995) Whole-genome random sequencing and assembly of haemophilusinfluenzae rd. Science, 269, 496–512.

Fuhrman, S., Deininger, P., LaPorte, P., Friedmann, T. & Geiduschek, E. (1981)Analysis of transcription of the human Alu family ubiquitous repeating ele-ment by eukaryotic RNA polymerase III. Nucleic Acids Res., 9 (23), 6439–56.

Gallant, J. (1983) The complexity of the overlap method for sequencing biopoly-mers. Journal of Theoretical Biology, 101, 1–17.

Gallant, J., Maier, D. & Storer, J. (1980) On finding minimal length superstrings.Journal of Computer and System Sciences, 20, 50–58.

Garey, M. R. & Johnson, D. S. (1979) Computers and Intractability: A Guide tothe Theory of NP-Completeness. Freeman, New York.

Giddings, M., Jr., R. B., Haker, M. & Smith, L. (1993) An adaptive, objectoriented strategy for base calling in DNA sequence analysis. Nucleic AcidsRes., 21, 4530–4540.

Giddings, M., Severin, J., Westphall, M., Wu, J. & Smith, L. (1998) A softwaresystem for data analysis in automated DNA sequencing. Nucleic Acids Res.,8, 644–665.

Gingeras, T., Milazzo, J., Sciaky, D. & Roberts, R. (1979) Computer programs forthe assembly of DNA sequences. Nucleic Acid Res., 7, 529–545.

Goffeau, A., Barrell, B., Bussey, H., Davis, R., Dujon, B., Feldmann, H., Galibert,F., Hoheisel, J., Jacq, C. & Johnston, M. (1996) Life with 6000 genes. Science,274 (546), 563–567.

Golden, J., Torgersen, D. & Tibbets, C. (1993) Pattern recognition for automateddna sequencing, i: on-line signal conditioning and feature extraction for base-calling. In Proceedings of the first International Conference on IntelligentSystems for Molecular Biology, (et al., L. H., ed.), AAAI Press, Menlo Park,CA.

Gordon, D., Abajian, C. & Green, P. (1998) Consede: a graphical tool for sequencefinishing. Genome Research, 8, 195–202.

Green, P. (1997) Against a whole–genome shotgun. Genome Research, 7, 410–417.

Gronenborn, B. & Messing, J. (1978) Methylation of single-stranded DNA in vitrointroduces new restriction endonuclease cleavage sites. Nature, 272 (5651),375–377.

Gryan, G. (1994) Faster sequence assembly software for megabase shotgun assem-blies. In Genome Sequencing and Analysis Conference VI.

38

Gusfield, D. (1997) Algorithms on strings, trees, and sequences: computer scienceand computational biology. Press Syndicate of the University of Cambridge,New York, NY.

Holley, R. W., Apgar, J., Everett, G. A., Madison, J. T., Marquisee, M., Merrill,S. H., Penswick, J. R. & Zamir, A. (1965) Structure of a ribonucleic acid.Science, 147, 1462–1465.

Huang, X. (1992) A contig assembly program based on sensitive detection of frag-ment overlaps. Genomics, 14, 18–25.

Huang, X. (1996) An improved sequence assembly program. Genomics, 33, 21–31.

Huang, X. & Madan, A. (1999) CAP3: a DNA sequence assembly program.Genome Research, 9, 868–877.

Huson, D., Reinert, K. & Myers, E. (2002). The greedy path–merging algorithmfor sequence assembly. J. Comp. Biol. submitted.

Hutchinson, G. (1969) Evaluation of polymer sequence fragment data using graphtheory. Bulletin of Mathematical Biophysics, 31, 541–562.

Idury, R. & Waterman, M. (1995) A new algorithm for shotgun sequencing. J.Comp. Biol., 2, 291–306.

Karas, M. & Hillenkamp, F. (1988) Laser desorption ionisation of proteins withmolecular masses exceeding 10,000 daltons. Anal. Chem., 60, 2299–301.

Karp, R. (1972) Complexity of Computer Computations. Plenum Press, New Yorkpp. 85–103.

Kececioglu, J. (1991). Exact and approximate algorithms for DNA sequence recon-struction. Ph.D. thesis, technical report 91–26 Dept. of Computer Science, U.of Arizona, Tucson, AZ 85721.

Kececioglu, J. & Myers, E. (1995) Combinatorial algorithms for DNA sequenceassembly. Algorithmica, 13, 7–51.

Kent, W. J. & Haussler, D. (2001) Assembly of the working draft of the humangenome with GigAssembler. Genome Research, 11, 1541–1548.

Khurshid, F. & Becks, S. (1993) Error analysis in manual and automated DNAsequencing. Analytical Biochemistry, 208, 138–143.

Kim, S. & Segre, A. (1999) AMASS: a structured pattern matching approach toshotgun sequence assembly. J. Comp. Biol., 6, 163–86.

Knuth, D. (1973) The Art of Computer Programming (Volume III): Sorting andSearching. Addison–Wesley, Reading, MA.

Kosaraju, R., Park, J. & Stein, C. (1994). Long tours and short superstrings.

Kupfer, K., Smith, M., Quackenbush, J. & Evans, G. (1995) Physical mapping ofcomplex genomes by sampled sequencing: a theoretical analysis. Genomics,27, 90–100.

39

Lander, E. & Waterman, M. (1988) Genomic mapping by fingerprinting randomclones: a mathematical analysis. Genomics, 2, 231–239.

Lawrence, C. & Solovyev, V. (1994) Assignment of position-specific error proba-bility to primary DNA sequence data. Nucleic Acid Res., 22, 1272–1280.

Li, P., Kupfer, K., Davies, C., Burbee, D., Evans, G. & Garner, H. (1997) PRIMO:a primer design program that applies base quality statistics for automatedlarge-scale DNA sequencing. Genomics, 40, 476–485.

Lipshutz, R., Taverner, F., Hennessy, K., Gartzell, G. & Davis, R. (1994) DNAsequence confidence estimation. Genomics, 19, 417–424.

Maier, D. & Storer, J. (1977). A note on the complexity of the superstring problem.Technical report 233 Dept. of Electrical Engineering and Computer SciencePrinceton University.

Manber, Udi & Myers, G. (1990) Suffix arrays: a new method for on–line stringsearches. In Proceedings of the 1st Annual ACM-SIAM Symposium on De-screte Algorithms pp. 319–327.

Marshall, E. (1999) A high stakes gamble on genome sequencing. Science, 284,1906–1909.

Maxam, A. & Gilbert, W. (1977) A new method for sequencing DNA. Proc. Natl.Acad. Sci., 74, 560–564.

McCreight, E. (1976) A space-economical suffix tree construction algorithm. J. ofthe ACM, 23 (2), 262–272.

Messing, J., Crea, R. & Seeburg, P. (1981) A system for shotgun DNA sequencing.Nucl. Acids Res., 9 (2), 309–321.

Miller, M. J. & Powell, J. I. (1994) A quantitative comparison of DNA sequenceassembly programs. J. Comp. Biol., 1 (4), 257–269.

Mosimann, J., Shapiro, M., Merril, C., Bradley, D. & Vinton, J. (1966) Reconstruc-tion of Protein and Nucleic Acid Sequences:IV. The algebra of Free Monoidsand the Fragmentation Stratagem. Bull. Math. Biophysics, 28, 235–260.

Myers, E. (1995a) A sublinear algorithm for approximate keyword matching. Al-gorithmica, 12 (4–5), 345–374.

Myers, E. (1995b) Toward simplifying and accurately formulating fragment assem-bly. J. Comp. Biol., 2 (2), 275–290.

Myers, E., Sutton, G., Delcher, A., Dew, I., Fasulo, D., Flanigan, M., Kravitz,S., Mobarry, C., Reinert, K., Remington, K., Anson, E., Bolanos, R., Chou,H.-H., Jordan, C., Halpern, A., Lonardi, S., Beasley, E., Brandon, R., Chen,L., Dunn, P., Lai, Z., Liang, Y., Nusskern, D., Chan, M., Xhang, Q., Zheng,X., Rubin, G., Adams, M. & Venter, J. (2000) A whole-genome assembly ofdrosophila. Science, 287, 2196–2204.

N. Biggs, E. L. & Wilson, R. (1976) Graph Theory 1736–1936. Clarendon PressOxford.

40

Naeve, C., Buck, G., Niece, R., Pon, R., Roberson, M. & Smith, A. (1995) Ac-curacy of automated DNA sequencing: multi–laboratory comparison of se-quencing results. BioTechniques, 19, 448–453.

National Research Counsil (1988) Mapping and Sequencing the Human Genome.Natl. Academy. Press., Washington DC.

Needleman, S. & Wunsch, C. (1970) A general method applicable to the search forsimilarities in the amino acid sequence of two proteins. Journal of MolecularBiology, 48, 443–453.

Nurminsky, D. & Hartl, D. (1996) Sequence scanning: a method for rapid sequenceacquisition from large–fragment DNA clones. Proc. Natl. Acad. Sci., 93,1694–1698.

Parsons, R. J. & E., M. (1995) DNA fragment assembly and genetic algorithms– new results and puzzling insights. In Third International Conference onIntelligent Systems for Molecular Biology pp. 277–284 CA:AAAI Press.

Pearson, W. & Lipman, D. (1988) Improved tools for biological sequence compar-ison. Proc. Natl. Acad. Sci., 85, 2444–2448.

Peltola, H., Soderlund, H., Tarhio, J. & Ukkonen, E. (1983) Algorithms for somestring matching problems arising in molecular genetics. In Proceedings of the9th IFIP World Computer Congress pp. 59–64.

Peltola, H., Soderlund, H. & Ukkonen, E. (1984) SEQAID: a DNA sequence as-sembly program based on a mathematical model. Nucleic Acid Res., 12,307–321.

Pevzner, P. (1989) l–tuple DNA sequencing: computer analysis. J. Biomol. Struc-ture. Dynamics, 7, 63–73.

Pevzner, P. (1994) DNA physical mapping and alternating eulerian cycles in col-ored graphs. Algorithmica, 12, 77–105.

Pevzner, P. A., Tang, H. & Waterman, M. S. (2001) An eulerian path approachto DNA fragment assembly. Proc. Natl. Acad. Sci., 98 (17), 9748–9753.

P. falciparum Genome Sequencing Consortium (1996). http://plasmodb.org.

P.Richerich (1998) Estimation of errors in ”raw” DNA sequence: a validationstudy. Genome Research., 8, 251–259.

Roach, J., Boysen, C., Wang, K. & Hood, L. (1995) Pairwise end sequencing: aunified approach to genomic mapping and sequencing. Genomics, 26, 345–353.

Roach, J., Thorsson, V. & Siegel, A. (2000) Parking strategies for genome sequenc-ing. Genome Research, 10 (7), 1020–30.

Ronaghi, M. & Nyren, M. U. P. (1998) A sequencing method based on real-timepyrophosphate. Science, 281 (5375), 363, 365.

41

Sanger, F., Coulson, A., Hong, G., Hill, D. & Petersen, G. (1982) Nucleotidesequence of bacteriophage λ DNA. J. of Mol. Biol., 162.

Sanger, F., Nicklen, S. & Coulson, A. (1977) DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci., 74, 5463–5467.

Schbath, S. (1997) Coverage processes in physical mapping by anchoring randomclones. J. Comp. Biol., 4 (1), 61–82.

Sellers, P. (1980) The theory and computations of evolutionary distances: patternrecognition. J. Algorithms, 1, 359–373.

Shapiro, M. (1967) An algorithm for reconstructing protein and RNA sequences.Journal of the Association for Computing Machinery, 14, 720–731.

Shapiro, M., Merril, C., Bradley, D. & Mosimann, J. (1965) Reconstruction ofprotein and nucleic acid sequences: alanine transfer ribonucleic acid. Science,150 (12), 918–921.

Smetanic, Y. & Polozov, R. (1979) On the algorithms for determining the primarystructure of biopolymers. Bulletin of Mathematical Biology, 41, 1–20.

Smith, L., Sanders, J., Kaiser, R., Hughes, P., Dodd, C., Connell, C., Heiner,C., Kent, S. & Hood, L. (1986) Fluorescence detection in automated DNAsequence analysis. Nature, 6071 (321), 674–9.

Smith, M., Holmsen, A., Wei, Y., Peterson, M. & Evans, G. (1994) Genomicsequence sampling: a strategy for high resolution sequence–based physicalmapping of complex genomes. Nature Genet., 7, 40–47.

Smith, T. & Waterman, M. (1981) Identification of common molecular subse-quences. Journal of Molecular Biology, 147, 195–197.

Staden, R. (1977) Sequence data handling by computer. Nucleic Acids Res., 4,4037–4051.

Staden, R. (1978) Further procedures for sequence analysis by computer. NucleicAcids Res., 5, 1013–1015.

Staden, R. (1979) A strategy of DNA sequencing employing computer programs.Nucleic Acids Res., 6, 2601–2610.

Staden, R. (1980) A new computer method for the storage and manipulation ofDNA gel reading data. Nucleic Acids Res., 8, 3673–3694.

Staden, R. (1982a) Automation for the computer handling of gel reading dataproduced by the shotgun method of dna sequencing. Nucleic Acids Res., 10,4731–4751.

Staden, R. (1982b) An interactive graphics program for comparing and aligningnucleic acid and amino acid sequences. Nucleic Acids Res., 10, 2951–2961.

Sutton, G., White, O., Adams, M. & Kerlavage, A. (1995) TIGR assembler: anew tool for assembling large shotgun sequencing projects. Genome Science& Technology, 1, 9–19.

42

Sweedyk, E. (1995). A 2 1/2 approximation algorithm for shortest common super-string. PhD thesis, Univ. of Calif., Berkley, Dept. of Computer Science.

Tarhio, J. & Ukkonen, E. (1988) A greedy approximation algorithm for construct-ing shortest common superstrings. Theoretical Computer Science, 57, 131–145.

The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of theflowering plant Arabidopsis thaliana. Nature, 408, 796–815.

The Sanger Centre (1998) Toward a complete human genome sequence. GenomeResearch, 8 (11), 1097–108.

Turner, J. (1989) Approximation algorithms for the shortest common superstringproblem. Information and Computation, 83, 1–20.

Ukkonen, E. (1985) Finding approximate patterns in strings. J. Algorithms, 6,132–137.

Ukkonen, E. (1992) Approximate string matching with q-grams and maximalmatches. Theor. Comp. Sci., 92, 191–211.

Venter, J. C., Smith, H. O. & Hood, L. (1996) A new strategy for genome sequenc-ing. Nature, 381.

Walther, D., Bartha, G. & Morris, M. (2001) Basecalling with life trace. GenomeResearch, 11, 875–888.

Waterman, M. S. (1996) Introduction to Computational Biology. Chapman & Hall,London, UK.

Waterston, R. H., Lander, E. S. & Sulston, J. E. (2002) On the sequencing of thehuman genome. Proc. Natl. Acad. Sci., 99 (6), 3712–3716.

Watson, J. (1990) The human genome project: past, present, and future. Science,248 (4951), 44–9.

Watson, J. & Crick, F. (1953) Molecular structure of nucleic acids. a structure fordeoxyribose nucleic acid. Nature, 171, 737–738.

Weber, J. L. & Myers, E. W. (1997) Human whole–genome shotgun sequencing.Genome Research, 7, 401–409.

Wendl, M. C., Marra, M. A., Hillier, L. W., Chinwalla, A. T., Wilson, R. K.& Waterston, R. (2001) Theories and applications for sequencing randomlyselected clones. Genome Research, 11, 274–280.

43