from swab to publication

1
A Comprehensive Workflow for Microbial Genome Sequencing From Swab to Publication Madison I. Dunitz 1 , David A. Coil 1 , Jenna M. Lang 1 , Guillaume Jospin 1 , Aaron E. Darling 2 , Jonathan A. Eisen 1 UC Davis Genome Center 1 University of California, Davis; 2 ithree Institute, University of Sydney, Australia The sequencing and de novo assembly of microbial genomes has already yielded enormous scientific insight revolutionizing a diverse collection of fields, from epidemiology to ecology. In the past two decades increasing advances in DNA sequencing technology have led to the creation of a wide variety of options for DNA library preparation, sequencing and assembly. Each option comes with its own advantages and disadvantages in terms of complexity, expense, computing power, time, and experience required. This workflow was designed in an attempt to democratize the process of microbial sequencing and de novo assembly, in order to make them accessible to low funded labs or even classrooms on a massive scale. GenBank Submission Making a Phylogenetic Tree Annotation Library Preparation and Sequencing Assembly Verification of the Assembly Identify/ Choose Microbe Sanger Sequence Processing Bench Work Swab Plate Dilution Streak (X2) Overnight Culture DNA Extraction 16s PCR Create a BioProject at NCBI FASTA2AGP (custom script) Create a .sbt Template Tbl2asn Create a Whole Genome Shotgun Submission Submitting Raw Reads SeqTrace Geneious Custom Script Options RAST Obtain the Full- Length 16S Sequence Create an RDP Alignment Building the Tree Viewing the Tree Library Preparation Kit Options Considerations in Library Preparation Multiplexing Download and Install A5 Running A5 Interpretation of A5 stats Verification of 16S Sequence Phylosift BLAST 16S Interpret the Results GOLD Align the 16S Sequences using Align Sequences Nucleotide BLAST Introduction The objective of the present study was to design, test, troubleshoot, and publish a comprehensive workflow for taking a researcher from a swab to a microbial genome publication; enabling even a lab with limited resources and bioinformatics experience to perform it. What Do You Think? Fill out a post-it note to let us know Results I. Bench Work We assume a starting point of wanting to isolate an organism from a particular environment and needing to identify it. Users starting with a known organism should proceed to "Library Preparation and Sequencing”. Here we cover the steps necessary to take a sample through plating, dilution streaking, overnight growth, creating a glycerol stock, 16s PCR and preparation for Sanger sequencing. II.Sanger Sequence Processing In this section we identify multiple software programs that allow the researcher to view and edit the genetic sequence. We detail the advantages and disadvantages of particular programs and explain how to quality trim the reads, reverse complement and align the reads and generate a consensus sequence. This is easiest to do visually via a chromatogram allowing the user to see the trace and process the sequences manually. III.Identify/Choose Microbe In a classroom or undergraduate research setting the researchers may not have a particular bacterial species in mind. In this case it is necessary to screen the 16S Sanger sequencing results for possible genome project candidates. We recommend starting with the BLAST results, then continuing onto the Genomes Online Database (GOLD), and simply Google searching. In many cases it will be handy to first build a phylogenetic tree to aid in identification since the 16S results may not be IV.Library Preparation and Sequencing The choice of sequencing technology and of library preparation method for genome sequencing is ever-changing and much-debated. Here we recommend using Illumina MiSeq for reasons of cost, depth of coverage, and length of reads. Furthermore, the assembly pipeline, A5-miseq, requires Illumina data and is optimized for the longer reads from the MiSeq. V. Assembly VI.Verification of the Assembly There are three portions to the verification of a genome assembly. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5 (e.g. number of contigs and contig N50). The second is verification that the organism sequenced is the organism of interest, simply by checking the 16S sequence with BLAST. The third is the "completeness” of the genome, which is difficult to measure except in cases where a close reference (representative example of the species’ genome) is available. VII.Annotation There are a number of different pipelines available for the annotation of bacterial genomes. These include Prokka, IMG, RAST, PGAP and others. Genomic annotation involves identifying protein coding regions and attempting to predict the genes being coded for and their biological function. VIII.Making a Phylogenetic Tree There are two points during the workflow where making a 16S phylogenetic tree may be useful. The first is after identification of candidate organisms by Sanger sequencing and the second is after assembly of the genome. The process is identical in both cases, but the additional length and improved quality of the post- assembly 16S sequence may generate a better tree. The tree can be used for identification of the candidate (e.g. is the candidate found in a single species clade), for naming of the candidate (does it fall in a clade containing only members of that species, and other members of the species are not found outside that clade), and for placement of the organism into a phylogenetic context. The outline of this step, is to use the Ribosomal Database Project (RDP) to generate an alignment of the sequence with close relative and an out-group, following by cleanup of the RDP headers, tree-building with FastTree and viewing/analysis of the tree in Dendroscope. IX.GenBank Submission This section describes how to submit contigs and scaffolds (if applicable) as a Whole Genome Shotgun (WGS) submission to Embank. We also recommend allowing Embank to annotate the genome themselves, since submitting RAST annotations to GenBank can be prohibitively complicated. The genomes are automatically shared with the DNA Data Bank of Japan (DDBJ) and EBML. In addition, genomes from GenBank are automatically pulled into IMG, and are annotated there as well.

Upload: gabriel-livingston

Post on 02-Jan-2016

38 views

Category:

Documents


1 download

DESCRIPTION

Bench Work. Sanger Sequence Processing. Identify/Choose Microbe. From Swab to Publication. A Comprehensive Workflow for Microbial Genome Sequencing. Madison I. Dunitz 1 , David A. Coil 1 , Jenna M. Lang 1 , Guillaume Jospin 1 , Aaron E. Darling 2 , Jonathan A. Eisen 1. UC Davis Genome Center - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: From Swab to Publication

A Comprehensive Workflow for Microbial Genome SequencingFrom Swab to Publication

Madison I. Dunitz1, David A. Coil1, Jenna M. Lang1, Guillaume Jospin1, Aaron E. Darling2, Jonathan A. Eisen1

UC Davis Genome Center1University of California, Davis; 2ithree Institute, University of Sydney, Australia

The sequencing and de novo assembly of microbial genomes has already yielded enormous scientific insight revolutionizing a diverse collection of fields, from epidemiology to ecology. In the past two decades increasing advances in DNA sequencing technology have led to the creation of a wide variety of options for DNA library preparation, sequencing and assembly. Each option comes with its own advantages and disadvantages in terms of complexity, expense, computing power, time, and experience required. This workflow was designed in an attempt to democratize the process of microbial sequencing and de novo assembly, in order to make them accessible to low funded labs or even classrooms on a massive scale.

GenBank Submission

Making a Phylogenetic Tree

Annotation

Library Preparation and Sequencing

AssemblyVerification of the Assembly

Identify/Choose Microbe

Sanger Sequence Processing

Bench Work

• Swab• Plate• Dilution Streak (X2)• Overnight Culture• DNA Extraction• 16s PCR

• Create a BioProject at NCBI• FASTA2AGP (custom script)• Create a .sbt Template• Tbl2asn• Create a Whole Genome Shotgun

Submission Submitting Raw Reads

• SeqTrace• Geneious• Custom Script

• Options• RAST

• Obtain the Full-Length 16S Sequence

• Create an RDP Alignment• Building the Tree• Viewing the Tree

• Library Preparation• Kit Options• Considerations in Library

Preparation• Multiplexing

• Download and Install A5

• Running A5

• Interpretation of A5 stats• Verification of 16S

Sequence• Phylosift

• BLAST 16S• Interpret the Results• GOLD• Align the 16S Sequences

using Align Sequences Nucleotide BLAST

Introduction

The objective of the present study was to design, test, troubleshoot, and publish a comprehensive workflow for taking a researcher from a swab to a microbial genome publication; enabling even a lab with limited resources and bioinformatics experience to perform it.

What Do You Think?Fill out a post-it note to let us know

ResultsI. Bench WorkWe assume a starting point of wanting to isolate an organism from a particular environment and needing to identify it. Users starting with a known organism should proceed to "Library Preparation and Sequencing”.Here we cover the steps necessary to take a sample through plating, dilution streaking, overnight growth, creating a glycerol stock, 16s PCR and preparation for Sanger sequencing.

II. Sanger Sequence ProcessingIn this section we identify multiple software programs that allow the researcher to view and edit the genetic sequence. We detail the advantages and disadvantages of particular programs and explain how to quality trim the reads, reverse complement and align the reads and generate a consensus sequence. This is easiest to do visually via a chromatogram allowing the user to see the trace and process the sequences manually.

III. Identify/Choose MicrobeIn a classroom or undergraduate research setting the researchers may not have a particular bacterial species in mind. In this case it is necessary to screen the 16S Sanger sequencing results for possible genome project candidates. We recommend starting with the BLAST results, then continuing onto the Genomes Online Database (GOLD), and simply Google searching. In many cases it will be handy to first build a phylogenetic tree to aid in identification since the 16S results may not be

IV. Library Preparation and Sequencing

The choice of sequencing technology and of library preparation method for genome sequencing is ever-changing and much-debated. Here we recommend using Illumina MiSeq for reasons of cost, depth of coverage, and length of reads. Furthermore, the assembly pipeline, A5-miseq, requires Illumina data and is optimized for the longer reads from the MiSeq.

V. AssemblyGenome assembly typically consists of data cleaning (quality filtering and adaptor removal), error correction, contig assembly, scaffolding, and verification of scaffolds/contigs. There are a large array of programs that can perform some, or most of these steps. These programs include commercial and open-source options.

For this workflow we recommend using the open source A5 assembly pipeline which automates all of the steps described above with a single command .

VI. Verification of the AssemblyThere are three portions to the verification of a genome assembly. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5 (e.g. number of contigs and contig N50). The second is verification that the organism sequenced is the organism of interest, simply by checking the 16S sequence with BLAST. The third is the "completeness” of the genome, which is difficult to measure except in cases where a close reference (representative example of the species’ genome) is available.

VII. AnnotationThere are a number of different pipelines available for the annotation of bacterial genomes. These include Prokka, IMG, RAST, PGAP and others. Genomic annotation involves identifying protein coding regions and attempting to predict the genes being coded for and their biological function.

VIII.Making a Phylogenetic TreeThere are two points during the workflow where making a 16S phylogenetic tree may be useful. The first is after identification of candidate organisms by Sanger sequencing and the second is after assembly of the genome. The process is identical in both cases, but the additional length and improved quality of the post-assembly 16S sequence may generate a better tree. The tree can be used for identification of the candidate (e.g. is the candidate found in a single species clade), for naming of the candidate (does it fall in a clade containing only members of that species, and other members of the species are not found outside that clade), and for placement of the organism into a phylogenetic context.

The outline of this step, is to use the Ribosomal Database Project (RDP) to generate an alignment of the sequence with close relative and an out-group, following by cleanup of the RDP headers, tree-building with FastTree and viewing/analysis of the tree in Dendroscope.

IX. GenBank SubmissionThis section describes how to submit contigs and scaffolds (if applicable) as a Whole Genome Shotgun (WGS) submission to Embank. We also recommend allowing Embank to annotate the genome themselves, since submitting RAST annotations to GenBank can be prohibitively complicated. The genomes are automatically shared with the DNA Data Bank of Japan (DDBJ) and EBML. In addition, genomes from GenBank are automatically pulled into IMG, and are annotated there as well.