genome sequencing

Genome Sequencing

Bacteriophage fX174, the first genome to be sequenced, is a viral genome with only 5,368 base pairs (bp). Fred Sanger invented "shotgun" sequencing, a strategy based on the isolation of random pieces of DNA from the host genome to be used as primers for the PCR amplification of the entire genome. The amplified portions of DNA are assembled in silico by their overlapping regions to form contiguous transcripts (otherwise known as contigs). The final step involved using custom primers to elucidate gaps between contigs thus giving the completely sequenced genome (‘walking’). Sanger used "shotgun" sequencing five years later to complete a bacteriophage l sequence that was significantly larger ( 48 Kbp). This method allowed sequencing projects to proceed at a much faster rate, expanding the scope of realistic sequencing ventures. Since then other viral and organellar genomes have been sequenced using similar techniques.

The success with viral genome sequencing stemmed from their small size. In 1989, Andre Goffeau set up a European consortium to sequence the genome of Saccharomyces cerevisiae (12.5 Mb).

Human Genome Project. The U.S. Human Genome Project (HGP) is a joint effort of the Department of Energy and the National Institute of Health, designed as a three-step program to produce genetic maps, physical maps, and finally the complete nucleotide sequence map of the human chromosomes. In the wake of this pronouncement came the start of three projects aimed at elucidating the sequences of smaller model organisms, similar to S. Cerevisiae in their academic utility, such as Escherichia. coli, Mycoplasma capricolum, and Caenorhabditis. elegans. It was hoped that these projects would increase the efficiency of sequencing but unfortunately they fell short of this task.

Original Goal was to finish by September 30, 2005

A team headed by J. Craig Venter from the Institute for Genomic Research (TIGR) and Hamilton Smith of Johns Hopkins University sequenced a 1.8 Mb bacterium with new computational methods developed at TIGR. Previous sequencing projects had been limited by the lack of adequate computational approaches to assemble the large amount of random sequences produced by "shotgun" sequencing. Venter's team utilized a more comprehensive approach by "shotgunning" the H. Influenzae genome. Previously, such an approach would have failed because software did not exist to assemble the information accurately. The TIGR Assembler was up to the task, reassembling approximately 24,000 DNA fragments. The TIGR Assembler software required approximately 30 hours of central processing unit time on a SPARCenter 2000 containing half a gigabyte of RAM.Venter's H. project failed to win NIH funding because the panel did not believe that this approach could sequence a 1.8 Mb sequence accurately. Venter proved everyone wrong and succeeded in sequencing the genome in 13 months at a cost of 50 cents per base, half the cost and much faster than conventional sequencing.

Map Based Sequencing

Two alternatives were used to sequence the human genome. The BAC-to-BAC method, employed by the DOE and NIH funded HGP, is slow because it depends on mapping the genome ot be sequenced and obtaining sets of partially ordered, overlapping BACs. Also referred to as the map-based method, it was developed from procedures used in individual labs in the late 1980s and 90s.

BAC to BAC Sequencing

The BAC to BAC approach first creates a crude physical map of the whole genome before sequencing the DNA. Constructing a map requires cutting the chromosomes into large pieces and figuring out the order of these big chunks of DNA before taking a closer look and sequencing all the fragments.

VENTER’S SHOTGUN

Whole genome shotgun sequencing is a much faster approach, and enabled researchers to speed up the timetable for sequencing enormously. The shotgun method was developed by J. Craig Venter and his associates in 1996 when he was at the Institute for Genomic Research (TIGR). Venter

Whole Genome Shotgun Sequencing

Shotgun sequencing is based on the strategy of sequencing a large number of fragments of a chromosome without reference to a physical map. It is much faster, but puts a large requirement on the ability to align the overlapped fragments and reconstruct the chromosome . The people in charge of the original sequencing project didn’t think it could work.

BAC to BAC SequencingWhole Genome Shotgun

Sequencing

© 2002 The Center for the Advancement of Genomics (TCAG). Close Window

Each 150,000 bp fragment is inserted into a BAC (bacterial artificial chromosome). A BAC can replicate inside a bacterial cell. A set of BACs containing an entire human genome is called a BAC library

Multiple copies of the genome are randomly broken into 10,000 bp long segments by squeezing the DNA through a syringe. This is done a second time to generate pieces that are 2,000 bp long. Each 2,000 and 10,000 bp fragment is inserted into a plasmid, which is a piece of DNA that can replicate in bacteria. The two collections of plasmids containing 2 and 10 Kbp chunks of human DNA are known as plasmid libraries

BAC to BAC SequencingWhole Genome Shotgun

Sequencing


MAP BASED

150 Kbp fragments are fingerprinted to give each piece a unique identification tag that determines the order of the fragments. Fingerprinting involves cutting each BAC fragment with a single enzyme and finding common sequence landmarks in overlapping fragments that determine the location of each BAC along the chromosome. Then overlapping BACs with markers every 100,000 bp form a map of each chromosome.

http://www.genomenewsnetwork.org/articles/06_00/sixth.shtml


Whole Genome Shotgun Sequencing

This step not needed in shotgun sequencing


BAC to BAC Sequencing Whole Genome Shotgun Sequencing

This step not needed in shotgun sequencing

© 2002 The Center for the Advancement of Genomics (TCAG).


Whole Genome Shotgun

Sequencing


Sequence all fragments Sequence all fragments

~500 bp from both ends

BAC to BAC Sequencing Whole Genome Shotgun Sequencing


Assembled with PHRED

PhredThe phred software reads DNA sequencing trace files, calls bases, and assigns a quality value to each base. The quality value is a log-transformed error probability, specifically Q = -10 log10( Pe )where Q and Pe are respectively the quality value and error probability of a particular base call.Phred can use the quality values to perform sequence trimming.Phred works well with trace files from the most manufacturers' sequencing machinesPhred is distributed as 'C' source code: needs a 'C' compiler. See the phred documentation for additional information.Phred Quality Values and ABI 3700 DataEwing B, Green P: Basecalling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8:186-194 (1998).Ewing B, Hillier L, Wendl M, Green P: Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Genome Research 8:175-185 (1998).

http://www.phrap.org/phredphrap/phred.html

http://www.phrap.org/phredphrap/phred.html

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=9521922









Phrapphrap is a program for assembling shotgun DNA sequence data. Among other features, it allows use of the entire read and not just the trimmed high quality part, it uses a combination of user-supplied and internally computed data quality information to improve assembly accuracy in the presence of repeats, it constructs the contig sequence as a mosaic of the highest quality read segments rather than a consensus, it provides extensive assembly information to assist in trouble-shooting assembly problems, and it handles large datasets. See the phrap/cross_match/swat documentation and phrap documentation for additional information.

http://www.phrap.org/phredphrap/general.html


http://www.phrap.org/phredphrap/phrap.html

http://www.phrap.org/phredphrap/phrap.html

Swat is a program for searching one or more DNA or protein query sequences, or a query profile, against a sequence database, using an efficient implementation of the Smith-Waterman or Needleman-Wunsch algorithms with linear (affine) gap penalties. For each match an empirical measure of statistical significance derived from the observed score distribution is computed. See the phrap/cross_match/swat documentation and swat documentation for additional information.cross_match is a general purpose utility for comparing any two DNA sequence sets using a 'banded' version of swat. For example, it can be used to compare a set of reads to a set of vector sequences and produce vector-masked versions of the reads, a set of cDNA sequences to a set of cosmids, contig sequences found by two alternative assembly procedures (for example, phrap and xbap) to each other, or phrap contigs to the final edited cosmid sequence. It is slower but more sensitive than BLAST. See the phrap/cross_match/swat documentation and phrap documentation for additional information.



Consed/Autofinish Consed/Autofinish is a tool for viewing, editing, and finishing sequence assemblies created with phrap. Finishing capabilities include allowing the user to pick primers and templates, suggesting additional sequencing reactions to perform, and facilitating checking the accuracy of the assembly using digest and forward/reverse pair information. See the consed page for additional information.References:Gordon, David. "Viewing and Editing Assembled Sequences Using Consed", in Current Protocols in Bioinformatics,A. D. Baxevanis and D. B. Davison, eds, New York: John Wiley & Co., 2004, 11.2.1-11.2.43.

TIGR Assembler (Graham Sutton)

The TIGR Assembler is a sequence fragment assembly program building contigs from small sequence reads. It is versatile, offering a wide variety of options for tuning the assembly process and analyzing sequence data. The current assembly engine uses a greedy algorithm and heuristics to build contigs, find repeat regions, and target alignment regions. Sequence overlaps are detected and scored using a 32-mer hash. Sequence alignment and merging is done using a Smith Waterman algorithm. Gap penalties and score values corresponding to the bases and their quality values are predefined and hard coded into the program.

INPUT: .seq files from called sequence reads

OUTPUT: set of assemblies formed by alignment of overlapped fragments

genome sequencing

Documents