Shotgun Sequencing (SGS)

Download Shotgun Sequencing (SGS)

Post on 11-Apr-2015




0 download

Embed Size (px)


Shotgun Sequencing (SGS) Jun Yu, Gane Ka-Shu Wong, Jian Wang, and Huanming Yang Beijing Genomics Institute, Chinese Academy of Sciences, Beijing, China James D. Watson Institute of Genome Sciences, Zhejiang University, Hangzhou, China 1. Introduction 3 1.1 A Brief History of DNA Sequencing 1.2 Overview of SGS Methodology and Application 2. Essentials for SGS 10

2.1 Unbiased Clone Coverage 2.2 Sequence Quality Control 2.3 Repetitive Contents 2.4 Physical and Sequence Gaps 2.5 Validation of Sequence Assembly 2.6 Integration of Genomic Data 3. Technical and Experimental Basics 14 3.1 Genomic DNA Libraries 3.1.1 Source DNA 3.1.2 Quality Assessment 3.2 Sequencing Data Acquisition 3.2.1 WG-SGS and CBC-SGC 3.2.2 Directed Cloning and Random SGS 3.2.3 Process and Data Management 3.3 Sequence Assembly 3.3.1 Overview of General Procedures 3.3.2 Classification of Repetitive Sequences 3.3.3 Software Packages for Sequence Assembly 3.3.4. Building Contigs and Scaffolds 3.4 Sequence Finishing 3.4.1 Improving Data Quality 3.4.2 Primer-walking and End-sequencing 3.4.3 Closing Sequence Gaps 3.4.4. Physical Mapping and Clone-fingerprinting 3.5 Genome Annotations and Analysis 3.5.1 Genome Annotation 3.5.2 Compositional and Structural Dynamics 3.5.3 Sequence Variations 3.5.4 Genome Evolution 4 Other Useful Resources for SGS


4.1 ESTs and Full-length cDNAs 4.2 Physical and Genetic Maps 4.3 Comparative Genome Analysis 5 Perspectives 28

5.1 Large Genomes 5.2 Genomes with High Repetitive Content 5.3 Polyploidy: Ancient and New 5.4 Mixtures of Genomes: Metagenomes Bibliography Keywords Shotgun Sequencing A DNA sequencing technique used to achieve sequence contiguity by acquiring random sequences from clone libraries made of the target DNA. Physical Map DNA sequences or chromosome segments delineated by a set of ordered and oriented markers or fingerprinted clones. Large-insert Clones Clones made as cosmids, fosmids, BACs or YACs, with an insert size around or over 40 Kb. Sequence Contig A contiguous sequence assembly composed of overlapping sequences. Scaffold A group of ordered and oriented contigs with physical but without sequence contiguity. Minimal-Tiling-Path Clones Physically mapped clones cover their source DNA molecule with a single path and least amount of overlaps. Sequence Gap Gaps not covered by sequences in a sequence assembly. Physical Gap Gaps not covered by available clones in a clone assembly. 30

Abstract Shotgun sequencing (SGS) is primarily a large-scale sequencing (LSS) technique that does not rely on precise guiding information about the target DNA that includes large-insert clones and single genomes, ranging from thousands of basepairs (Kb) to billions of basepairs. A mixture of smaller genomes can also be sequenced in a similar way when retrospective means are available to assemble and distinguish them. It provides a fast and cost-effective way for sequencing large genomes regardless whether the project as a whole takes whole-genome (WG) or clone-by-clone (CBC) approach. The success of SGS essentially depends on random sampling, high-quality data, sufficient sequence coverage, effectiveassembly, and gap closing procedures. SGS has already demonstrated its tremendous power

not only in sequencing microbial genomes but also large eukaryotic genomes, such as those of the cultivated rice and the laboratory mouse. Together with improved sequencing technologies and computing tools, SGS is expected to play a central role in the field of genomics, especially when it has to constantly face some of the major challenges: scientific, managerial, and political. Some of the scientific challenges involve in effectively sequencing large, polyploid, and mixed genomes, as well as those with highly repetitive sequence contents. A key managerial challenge is for the operators to consistently produce high-quality data while increasing throughput and reducing cost. It is always a tough decision to chooseWG or CBC for a steering committee organizing a genome project but SGS is always the

basic method of choice. 1 Introduction DNA sequencing has become one of the essential laboratory methods for molecular biology, together with DNA cloning, DNA microarray, polymerase-chain-reaction (PCR), electrophoresis, chromatography and centrifugation. It has become not only a fundamental method for determining genome sequences but also a basic tool for identifying DNA methylation sites and sequence polymorphisms. In both scales and scopes, DNA sequencing dwarfs all other molecular technologies in acquisition of basic genomic information, paving a way for other genome-scale, technology-oriented disciplines or, in contemporary terminology, the numerous -omics (transcriptomics and proteomics being well-accepted examples) leading biologists into the future decades, providing new data, ample information, fresh knowledge, and innovative insights. At the leading edge of genomics, principal ideas of DNA sequencing are not fundamentally different from technical innovations demonstrated in the early days of technology development for molecular biology. When a cloned DNA molecule is larger than one (from one end) or two (from both ends) sequencing readouts (often referred as sequencing reads; screenshots of a sequence trace and a sequence alignment are shown in Figure 1) from an electrophoresis-based instrument, its full-length sequence has to be obtained with one of the following two ways: (1) a step-wise primer-directed walking strategy, based on newly-acquired sequences; (2) a parallel subcloning strategy, based on directed or shotgun cloning techniques by breaking the target DNA into smaller pieces so

that a collection of sequencing reads from overlapping subclones sufficiently covers the entire target length (Figure 2). Obviously, the choice of approaches depends on capability and throughput of a sequencing operation. One is managed serially thus limited in scale, and the other is paralleling apt for high-throughput, large-scale operations. Since current capability of a typical sequencing instrument allows to acquire sequence of a single run in a length of 500-2,000 basepairs (bp), aside from instrumental solutions (such as parallelization and incremental increase of read length), the choice of approaches for any large-scale sequencing (LSS) project has to take advantage of SGS for either large-insert clones over tens of kilobasepairs (Kb) or genomes in a size range from a few kilobases (such as large viral genomes) to a few magabasepairs (Mb; such as bacterial genomes), and even a few hundred to thousand Mb in length (such as higher plants and animals). SGS has been successfully applied to sequence many large genomes at two essential scales, targeting clones or genomes. Large genome sequencing projects are often initiated by large-insert clone-based physical mapping, especially for important model organisms from representative taxonomic groups; the Human Genome Project (HGP) is the most famous example. These physical mapping techniques include YAC (yeast artificial chromosome)-based or BAC (bacterium artificial chromosome)-based STS (sequence-tagged site) mapping, and restriction-digest mapping with different types of large-insert clones. These clones are subsequently sequenced by SGS method, guided by a collection of clones that cover the targeted region with least overlaps, often termed as minimal tiling path (MTP). Physical maps providing anchored clones for sequencing are often referred as sequence-ready maps. In a map-directed process, sequence data are captured at clone level and merged after data from overlapping clones are completed individually. Although SGS is utilized at clone level, such an approach is often referred as a clone-by-clone SGS or CBC-SGS, as opposed to a whole genome SGS, or WG-SGS. In the second scenario, when sufficient information (including estimated genome size, number of chromosomes, GC-content, repeat contents, and genetic homogeneity) about a genome has been acquired from a closely related or the same organism, WG-SGS approach becomes feasible and often the method of choice. In most of the cases, SGS has its sweet range in sequencing genomes, from a few Mb to a few hundred Mb. However, the quality and completeness of a final assembly habitually depends on repetitive sequence content, clonability, and complexity of the genome, coupled with other factors such as funding, project management, and homogeneity of the source DNA (sequence variations among individuals that donate DNA sometimes are known to withhold a quality sequence assembly for months or even years). Before long, SGS will have to face even greater challenges, being applied to genomes in a size range around thousands of Mb, polyploid genomes, and even a mixture of genomes, and novel approaches, techniques, and computing tools will have to be gradually implemented, making SGS a more efficient and robust technology. Two major driving forces are of significance: ever-increasing desires for more sequence information and incremental cost reductions evidenced in history of genomics over the past decade. 1.1 A Brief History of DNA Sequencing

The concept and technology of DNA sequencing have been evolving over the past thirty years. The initial phase of DNA sequencing was born with development of the basic technology, fundamentally a series of technology advancements in DNA sequencing and molecular cloning around early 1970s. A starting point of the second phase began with proposals of the Human Genome Project in 1983 and 1984. The following decade, before LSS of HGP had launched in 1995, is the most critical period for special technology and concept development; most notably are those of physical mapping and DNA sequencing. Thepast ten years has welcomed the Large-scale Sequencing Era, which is undoubtedly going to

last at least a decade or even longer. In this Era, together with add