low dna input workflow considerations for de novo · low dna input workflow considerations for de...

7
www.pacb.com Application Note Low DNA Input Low DNA Input Workflow Considerations for De Novo Genome Assembly Introduction Obtaining plant and animal genomes with the highest accuracy and contiguity is extremely important when exploring the functional impact of genetic diversity. A comprehensive view of the genome provides power to capture undetected SNVs, fully intact genes, and regulatory regions embedded in complex structures that fragmented draft genomes often miss. Single Molecule, Real-Time (SMRT ® ) Sequencing has become the gold standard for easy and affordable generation of high-quality de novo genome assemblies of even the most complex plant and animal genomes. With our new low DNA input procedure, it is now possible to generate high-quality genome assemblies from as-low-as 150 ng of input genomic DNA (gDNA). This low-input approach puts PacBio ® genome assemblies in reach for small, highly heterozygous organisms that comprise much of the diversity of life. The low DNA input workflow supports sequencing and assembly of genomes up to 300 Mb from only 150 ng of input gDNA with a modified SMRTbell ® library construction protocol that eliminates DNA shearing and size selection steps. As genomes and gDNA samples vary in genetic complexity and quality, general recommendations and practical guidance and workflow considerations for utilizing the low DNA input protocol, is provided here. Figure 1 Overview of low DNA input workflow for genome assembly on the Sequel System.

Upload: others

Post on 18-Oct-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Low DNA Input Workflow Considerations for De Novo · Low DNA Input Workflow Considerations for De Novo Genome Assembly. Introduction Obtaining plant and animal genomes with the highest

www.pacb.com

Application Note Low DNA Input

Low DNA Input Workflow Considerations for De Novo Genome Assembly

Introduction Obtaining plant and animal genomes with the highest accuracy and contiguity is extremely important when exploring the functional impact of genetic diversity. A comprehensive view of the genome provides power to capture undetected SNVs, fully intact genes, and regulatory regions embedded in complex structures that fragmented draft genomes often miss. Single Molecule, Real-Time (SMRT®) Sequencing has become the gold standard for easy and affordable generation of high-quality de novo genome assemblies of even the most complex plant and animal genomes.

With our new low DNA input procedure, it is now possible to generate high-quality genome assemblies from as-low-as 150 ng of input genomic DNA (gDNA). This low-input approach puts PacBio® genome assemblies in reach for small, highly heterozygous organisms that comprise much of the diversity of life.

The low DNA input workflow supports sequencing and assembly of genomes up to 300 Mb from only 150 ng of input gDNA with a modified SMRTbell® library construction protocol that eliminates DNA shearing and size selection steps. As genomes and gDNA samples vary in genetic complexity and quality, general recommendations and practical guidance and workflow considerations for utilizing the low DNA input protocol, is provided here.

Figure 1 – Overview of low DNA input workflow for genome assembly on the Sequel System.

Page 2: Low DNA Input Workflow Considerations for De Novo · Low DNA Input Workflow Considerations for De Novo Genome Assembly. Introduction Obtaining plant and animal genomes with the highest

www.pacb.com

Application Note Low DNA Input

Experimental Design Considerations We recommend considering the genome assembly project as a whole, from DNA extraction to bioinformatics, to establish your experimental design.

While the ability to generate high-quality de novo genome assemblies from as little as 150 ng of gDNA opens up many opportunities, it is important to note that with such low DNA input there are some constraints on sample preparation when designing a project. Sample preparation constraints:

• The DNA size selection step frequently used with traditional genome projects may be eliminated, making the quality (size and integrity) of the input gDNA the primary factor in predicting the success of a project.

• The size of the genome to be sequenced and assembled is limited by the amount of gDNA input into the experiment. For genomes larger than 300 Mb, it may be necessary to increase the amount of gDNA input in proportion to the genome size to obtain the required data amount for genome assembly.

For best results, we recommend preparing multiple DNA extractions for evaluation prior to library construction. Key considerations include starting with gDNA that is 20-100 kb and following best practices for handling high molecular weight DNA to avoid damage which may reduce read lengths.

Another consideration is that PacBio sequencing can generate multiple subreads for a single template molecule. This is the result of the polymerase making multiple passes around the circular library molecule within a reaction well. The number of passes depends on the movie length and the length of the insert, among other factors. For de novo genome assembly, we recommend selecting only a single subread per reaction well for the assembly process. This reduces the rate of chimerism/misassembly in the resulting contigs. However, we recommend using all subreads when polishing your contigs in order to get the highest base qualities.

When planning for a genome assembly project, you should consider two types of coverage:

1. Total coverage: the total number of bases sequenced compared to the genome size of the organism of interest, calculated by dividing the total bases by the genome size. a. Example: 30 Gb of raw sequence data divided by

1 Gb genome size is 30-fold total coverage

2. Unique Molecular Coverage (UMC) is the number of unique bases sequenced compared to the genome size of the organism of interest, calculated by dividing the Unique Molecular Yield (UMY) by the genome size a. Example: 30 Gb of raw data resulting in 12 Gb

of UMY, so 12 Gb divided by 1 Gb genome size is 12-fold UMC

For high-quality de novo genome assembly that is both contiguous and has high consensus accuracy we recommend aiming for 30-fold UMC per haplotype for your genome of interest.

*Note: Unique Molecular Yield is not currently reported for a sequencing run. To calculate it, please see instructions in our pb-assembly FAQs. UMY is scheduled to be added to the Run QC report in the next software release.

Sample Preparation: Start with High-quality gDNA High-quality gDNA is critical to achieving the long reads needed to span repetitive regions in complex genomes. We recommend starting with gDNA that is predominantly between 20 kb and 100 kb with few fragments <20 kb present and NanoDrop® purity readings of A260/280: 1.8-2.0 and A260/230: ≥2.0.

To determine the gDNA size distribution, we recommend using the Femto Pulse System from Agilent to enable gDNA size analysis from only 500 pg of input material. If you have sufficient gDNA (100-150 ng for instrument usage + 150 ng for library preparation) you may be able to evaluate the size distribution via another method (CHEF Mapper® System from BioRad, Pippin Pulse™ System from Sage Science, or Fragment Analyzer from Agilent).

If the NanoDrop purity readings (A260/280 and A260/230) are out of the range specified above, we recommend performing a 1X AMPure® PB bead purification step followed by a gDNA sample quantity and purity re-assessment using Qubit® fluorometer and NanoDrop, respectively.

Page 3: Low DNA Input Workflow Considerations for De Novo · Low DNA Input Workflow Considerations for De Novo Genome Assembly. Introduction Obtaining plant and animal genomes with the highest

Application Note Low DNA Input

Figure 2 – Quality of gDNA for library construction. Femto Pulse gel images and traces of four gDNA samples. Samples A and B contain fragments with a majority of gDNA >20 kb, with minimal fragments <20 kb, and are considered suitable for generating long reads for de novo genome assembly. However, Sample B also has a smear of DNA <10 kb, putting some risk on continuing into library preparation. Samples C and D are too fragmented with a majority of gDNA <20 kb and would not be recommended for library construction or sequencing.

Library Preparation: Reduce DNA Damage and Optional Size Selection

For samples without enough DNA to size select for longer fragments, it is critically important to follow best practices for reducing DNA shearing and damage during library preparation. While the low DNA input library preparation is an additive, single-tube workflow that minimizes DNA damage, there are other measures you can take to prevent inadvertent shearing. These include:

• Minimizing or eliminating the number of freeze/thaw cycles the gDNA undergoes to reduce DNA damage

• Allowing sufficient time for thawing aliquots of DNA, as partially frozen DNA is prone to shearing

• Only using wide bore pipette tips when handling DNA and pipetting very slowly to reduce shearing

• Eliminate high-speed vortexing and use gentle mixing techniques such as slow inversion

For samples that do not have the required 3 µg of DNA for use with the standard Express Template Preparation, but have enough DNA to size select for longer fragments and still retain ≥150 ng of DNA for library preparation with the low DNA input protocol, we encourage using the BluePippin™ System from Sage Science. Eliminating degraded DNA by selecting for DNA >10 kb will significantly increase the quality of the resulting sequencing and assembly.

Page 4: Low DNA Input Workflow Considerations for De Novo · Low DNA Input Workflow Considerations for De Novo Genome Assembly. Introduction Obtaining plant and animal genomes with the highest

www.pacb.com

Application Note Low DNA Input

SMRT Sequencing: Optimize for Sufficient Unique Molecular Yield In addition to loading optimization for sequencing performance, we recommend running low DNA input libraries with 2-hour pre-extensions and 10-hour collection times. This will allow any damaged DNA molecules, still present in the library, to be eliminated before sequencing.

The final insert size in your SMRTbell library will have some effect on the maximum number of SMRT Cells that can be sequenced from that library. Two library preparations from the same gDNA input amount, but with different insert sizes, will result in differing maximum numbers of SMRT Cells that can be sequenced.

Example Library

gDNA Input Amount (ng)

Insert Size of Final Library (bp)

Max No. of SMRT Cells (at

5pM) 1 150 18,000 11 2 150 19,000 10 3 150 39,000 5 4 150 42,000 4

Table 1 – Comparison of SMRT Cell yield between four different size libraries. With similar input (150 ng gDNA) and library yield (50-60%), libraries with larger inserts have a lower maximum number of SMRT Cells that can be sequenced from each library preparation. In addition, it is a requirement to use the Sample Setup Calculator for binding and annealing unless you have updated to software v7.0 or later.

Genome Assembly: Data Analysis for Any User Assembly of sequence data generated with the low DNA input protocol is no different from a standard de novo genome assembly project analysis. You can generate highly accurate and contiguous genome assemblies from 30-fold UMC with bioinformatics tools developed and optimized for SMRT Sequencing data. Options for assembly include:

• Push-button assembly with HGAP4 using SMRT Analysis

• Phased assembly at the command line with FALCON and FALCON-Unzip from PacBio DevNet

• Network of analysis partners for platform or full-service bioinformatics

Example 1: Mosquito Genome (~266 Mb) A high-quality de novo genome assembly was generated for a single mosquito (Anopheles coluzzii) using the low DNA input protocol. From only 100 ng of input gDNA, >20 Gb of sequence data was generated on each of three SMRT Cells. This data produced 12.9 Gb of UMY, or 45-fold UMC of the ~266 Mb genome. The data was sufficient to produce a 251 Mb genome assembly with a contig N50 of 3.47 Mb and a complete BUSCO score of 98.1%.

Figure 3 – SMRTbell library preparation workflow using low DNA input protocol.

Page 5: Low DNA Input Workflow Considerations for De Novo · Low DNA Input Workflow Considerations for De Novo Genome Assembly. Introduction Obtaining plant and animal genomes with the highest

www.pacb.com

Application Note Low DNA Input

Loading Conc. (pM)

Movie Time (PE)

Total Yield (Gb)

Unique Mol. Yield

(Gb)

N50 Polymerase Read Length

(bp)

N50 Subread Length

(bp) P0 P1 P2

5 20 hr (2 hr) 24.1 4.5 116,615 12,978 26.0% 60.1% 13.9%

5 20 hr (2 hr) 23.6 4.5 114,807 13,132 27.1% 59.0% 14.0%

6 20 hr (2 hr) 25.0 3.9 122,898 12,751 35.3% 53.1% 11.7%

Table 2 – Run statistics for the three SMRT Cells run using the An. coluzzii library. Sequel System Chemistry 3.0, Software v6.0, 20-hour movies with 2-hour pre-extension (PE). A total of 45-fold UMC was generated.

Genome Assembly Results

PacBio Sequencing Sanger Sequencing

Primary Contigs

Total Length 251 Mb 224 Mb

No. Contigs 206 27,063

Contig N50 3.47 Mb 0.025 Mb

Alternate Haplotigs

Total Length 89.2 Mb Unresolved

No. Contigs 830 N/A

Contig N50 0.199 Mb N/A

BUSCO (diptera n=2,799)

Complete 98.1% 87.5%

Duplicated 2.4% 0.1%

Fragmented 0.9% 6.8%

Missing 1.0% 5.7%

Table 3 – Comparative assembly statistics for A. coluzzii de novo genome assemblies. PacBio assembly statistics compared with the previous Sanger sequencing-based assembly (GCA_000150765.1) for this species. BUSCO was run on the PacBio assembly primary contigs after curation with Purge Haplotigs.

Figure 2 – An. coluzzii input DNA and final SMRTbell library. Majority of the DNA is >20 kb and the final SMRTbell library is 17 kb,100 ng input DNA used for library preparation.

Page 6: Low DNA Input Workflow Considerations for De Novo · Low DNA Input Workflow Considerations for De Novo Genome Assembly. Introduction Obtaining plant and animal genomes with the highest

www.pacb.com

Application Note Low DNA Input

Example 2: Rice Genome (~387 Mb) A high-quality de novo genome assembly was generated for the cultivated rice (Oryza sativa subsp. indica) line Minghui 63 (MH63) using the low DNA input protocol. From only 150 ng of input gDNA, >9 Gb of sequence data was generated on each of six SMRT Cells. The data produced 19.8 Gb of UMY, or 51-fold UMC of the ~387 Mb genome.

The data were sub-sampled down to 31-fold UMC and were sufficient to produce a 390 Mb genome assembly with a contig N50 of 1.01 Mb and a complete BUSCO score of 98%.

Loading Conc. (pM)

Movie Time (PE)

Total Yield (Gb)

Unique Mol. Yield

(Gb)

N50 Polymerase Read Length

(bp)

N50 Subread Length

(bp) P0 P1 P2

5 10 hr (2 hr) 10.25 3.71 60,243 12,977 40.9% 48.1% 10.9%

5 10 hr (2 hr) 9.41 3.11 67,161 13,275 53.0% 39.6% 7.4%

5 10 hr (2 hr) 9.36 3.01 68,971 13,222 53.6% 38.6% 7.8%

5 10 hr (2 hr) 9.01 3.10 63,754 13,002 51.9% 40.3% 7.7%

5 10 hr (2 hr) 9.38 3.43 60,323 13,105 46.8% 44.0% 9.2%

5 10 hr (2 hr) 9.65 3.41 62,579 13,069 47.2% 44.0% 8.8%

Table 4 – Run statistics for the six SMRT Cells run from the O. sativa subsp. indica library. Sequel System Chemistry 3.0, Software v6.0, 10-hour movies with 2-hour pre-extension (PE). A total of 45-fold UMC was generated.

Figure 5 – O. sativa subsp. indica input DNA and final SMRTbell library. Majority of the DNA is >20 kb and the final SMRTbell library is 19 kb,150 ng input DNA used for library preparation.

Page 7: Low DNA Input Workflow Considerations for De Novo · Low DNA Input Workflow Considerations for De Novo Genome Assembly. Introduction Obtaining plant and animal genomes with the highest

www.pacb.com

Application Note Low DNA Input

Genome Assembly Results

Data No. SMRT Cells 4

UMC 31-fold

Primary Contigs

Total Length 390 Mb

No. Contigs 1,099

Contig N50 1.01 Mb

BUSCO (diptera n=2,799)

Complete 98%

Duplicated 0.8%

Fragmented 0.4%

Missing 1.6%

Table 5 – Assembly statistics of the PacBio O. sativa subsp. indica de novo genome assembly. Full data were sub-sampled to 31-fold UMC and assembled using FALCON.

Conclusions This new low DNA input approach puts high-quality genome assemblies in reach for small, highly heterozygous organisms that comprise much of the diversity of life and samples for which gDNA is limited. The method described here can be applied to samples with input gDNA amounts ≥150 ng per 300 Mb of genome size. We have demonstrated that the quality and size of the input gDNA and obtaining a minimum of 30-fold UMC of the genome defines the robustness of this workflow to consistently deliver high-quality genome assemblies.

PacBio Consumable Part Numbers:

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019, Pacific Biosciences of California, Inc. All rights reserved. Information in this document is subject to change without notice. Pacific Biosciences assumes no responsibility for any errors or omissions in this document. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at https://www.pacb.com/legal-and-trademarks/terms-and-conditions-of-sale/ Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences in the United States and/or certain other countries. All other trademarks are the sole property of their respective owners. PN 101-769-600 Version 01 (April 2019)

Part Number Item 100-938-900 SMRTbell Express Template Prep Kit 2.0 100-265-900 AMPure PB