computational assembly for prokaryotic sequencing projects

Overview of Vibrio vulnificus and V. navarrensis

Computational assembly for prokaryotic sequencing projectsLee Katz, Ph.D.Bioinformatician, Enteric Diseases Laboratory BranchJanuary 15, 2014

DisclaimersThe findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease Control and Prevention and should not be construed to represent any agency determination or policy.The findings and conclusions in this [report/presentation] are those of the author(s) and do not necessarily represent the official position of CDC

Partners in Public Health

Graduated Oct 2010CDC 2010 - present

Lee Katz, PresentCurrently in the National Enteric Reference LaboratoryVibrio, Campylobacter, Escherichia, Shigella, Yersinia, SalmonellaFocusing on Listeria and Vibrio

One of my projects is #2 on CDCs list of accomplishments for 2013!

#2http://www.cdc.gov/features/endofyear/OutlineSequencing1st gen2nd gen3rd genReadsQuality control (Q/C)Read metricsRead-cleaningAssemblyAlgorithmsAssembly metrics

8Prokaryotic Sequencing ProjectsStagesSequencingAssemblyFeature predictionFunctional annotationanalysisDisplay (Genome Browser)

ExamplesHaemophilus influenzaeNeisseria meningitidisBordetella bronchiscepticaVibrio choleraeListeria monocytogenes

Fleischman et al. (1995) Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd Science 269:5223Kislyuk et al. (2010) A computational genomics pipeline for prokaryotic sequencing projects Bioinformatics 26:159Out with the old; in with the new:Two new technologies to the compgenomics class! 454Illumina single end readsIllumina paired end readsPacBioSanger Sequencing (1st gen)

Sequencing: first generationMargulies et al. (2005) Genome sequencing in open microfabricated high density picoliter reactors. Nature 437:705712Sanger sequencing outputUsually .ab1/.scf file format

454 Sequencing (2nd Gen)

15454 PyrosequencingMix DNA library & capture beads(limited dilution)

Break micro-reactorsIsolate DNA containing beads

Create Water-in-oil emulsion + PCR Reagents + Emulsion Oil Perform emulsion PCR

A

B454 PyrosequencingLoad enzyme beads

44 m

Load beads into PicoTiterPlate PicoTiterPlateDiameter = 44 mDepth = 55 m Well size = 75 plWell density = 480 wells mm-21.6 million wells per slide454 Pyrosequencing

Reagent flowSequencing by synthesisPhotons generated are captured by CCD camera

Margulies et al., 2005454 sequencing outputFlow Order

TACG1-mer2-mer3-mer4-merKEY (TCAG)Measures the presence or absence of each nucleotide at any given positionFlowgram (.sff file format)19Illumina sequencing (2nd Gen)

Region complementary to P5 grafting primer Index 2P5 primerDNA insertP7 primerIndex 1P7 grafting primerP5 grafting primerFlow cell surfaceThe following animations are courtesy of Illumina, Inc.21SBS Sequencing Primer HybridizationThe following animations are courtesy of Illumina, Inc.22Sequence (Cycle 1)The following animations are courtesy of Illumina, Inc.23Sequence (Cycle 1)24Index 1 Seq Primer Hybridization25Index 1 read 8 cycles26Unblock27P5 grafting primer287 dark cyclesP5 grafting primer297 dark cyclesIndex 2 index read8 cyclesP5 grafting primer307 dark cyclesIndex 2 index read8 cyclesP5 grafting primerExtension31Original strandNew strandLinearizationLinearization32

Illumina sequencing videohttp://www.youtube.com/watch?v=womKfikWlxMPacBio sequencing* (3rd Gen)*Pacific Biosciences

http://www.youtube.com/watch?v=NHCJ8PtYCFcEid et al Science,January 2009/10.1126/science.1162986

Thanks to PacBio for donating some slide materials in this sectionSMRT BellZero-mode waveguide (ZMW), a very fancy and very small wellhttp://www.youtube.com/watch?v=NHCJ8PtYCFcEid et al Science,January 2009/10.1126/science.1162986

Eid et al Science,January 2009/10.1126/science.1162986

PacBio videohttp://www.youtube.com/watch?v=NHCJ8PtYCFcReadsQ/C + cleaning + metricsQ/CYou need to know if your data are good!Example softwareFastQCComputational Genomics Pipeline (CG-Pipeline)

Quality Control

FastQC outputQuality Control bioinformatics

FastQC outputThe CG-Pipeline wayrun_assembly_readMetrics.plFile avgReadLength totalBases minReadLength maxReadLength avgQuality tmp.fastq 80.00 177777760 80 80 35.39Read cleaningRead cleaning with CG-Pipeline(not validated; please use with caution)http://sourceforge.net/projects/cg-pipeline/F. ReadR. ReadRead

%ACGT

PhredGraphs made with FastqQC (AMOS)1. Trimming low-qual endsrun_assembly_trimLowQualEnds.plhttp://sourceforge.net/projects/cg-pipeline/F. ReadR. ReadRead

1A. %ACGT

1B. PhredGraphs made with FastqQC (AMOS)2a. Removing duplicate reads2b. Sometimes: downsamplingrun_assembly_removeDuplicateReads.plhttp://sourceforge.net/projects/cg-pipeline/Trimmed reads3. Trimming and filteringrun_assembly_trimClean.pl3A. trimming3B. filteringMin lengthMin avg. qualityMin lengthMin avg. qualityhttp://sourceforge.net/projects/cg-pipeline/MoreSoftwareFastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/EA-utils https://code.google.com/p/ea-utils/AMOS amos: SourceForge.net and more is out there!EvaluationFabbro et al 2013, An extensive evaluation of read trimming effects on Illumina NGS data analysis

AssemblYAlgorithms + metrics

Whole genome sequencing: WGSLarge pieces and de novo assembly52

Business dog http://www.buzzfeed.com/tiad/business-dogWhole genome sequencing: WGSSmall pieces and reference assembly53

Business cat http://www.quickmeme.com/Business-Cat/ NNN NAssemblyOverlaps between reads

Generate contigs (contiguous sequences)

Generate scaffolds54Derive consensus sequenceSlide adapted from Andrey Kislyuk, http://www.compgenomics2009.biology.gatech.edu/images/1/12/2009-01-14-compgenomics-kislyuk.pdf

TAGATTACACAGATTACTGA-TTGATGGCGTAA-CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG-TTACACAGATTATTGACTTCATGGCGTAA-CTATAGATTACACAGATTACTGACTTGATGGCGTAA-CTATAGATTACACAGATTACTGACTTGATGGGGTAA-CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA-CTA

Derive each consensus base by weighted voting55NNNNNNNNNNRecap of assemblyScaffoldcontigsPaired end readsreads56CG-Pipeline way for Illuminarun_assembly reads.fastq.gz o assembly.fastaNo module yet in CGP for PacBio unfortunatelyBe on the look out for several papers that compare Illumina assemblers. PacBio AssemblyThe following slides are courtesy of PacBioFinishing Genomes Using Only PacBio Reads Utilizes all PacBio data from single, long-insert libraryLongest reads for continuity All reads for high consensus accuracyNow available through SMRT Portal in SMRT Analysis v2.0.1Hierarchical Genome Assembly Process (HGAP)

Chin et al (2013), Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data Nature Methods. doi 10.1038/nmeth.2474Hierarchical Genome Assembly Process significantly advances our understanding of microbial genomes using only PacBio reads. High-quality and high-accuracy microbial genomes can be obtained from genomic DNA to final assembly in a few days. 59Hierarchical Genome Assembly Process (HGAP)

Start with long seed readsAlign other reads

Build consensusConstruct accurate (>99%)pre-assembled readsHGAP Example - Meiothermus ruberpre-assemblyCelera AssemblerPolish, Quiver

250 Mb>5 kb

Collaboration with A. Clum, A. Copeland (Joint Genome Institute)In a collaboration with the Joint Genome Institute, we demonstrated that the HGAP assembly method could de novo assemble the M. Ruber genome in three SMRT Cells.

First a single, large-insert library (10 kb) was generated. From that, 3 SMRT Cells were run at the time this was done using C2-C2 chemistry and on a PacBio RS instrument. 250 Mb of data was generated with a read length profile shown on the right

61HGAP Example - Meiothermus ruberpre-assemblyCelera AssemblerPolish, Quiver


The PacBio reads >5 kb were selected as the seed reads. All the other reads were aligned to these long reads in a pre-assembly step.62

HGAP Example - Meiothermus ruberPre-assemblyCelera AssemblerPolish, Quiver


Following pre-assembly to the 5 kb seed reads, the alignment identity of the >5 kb reads improved to close to ~99%. These pre-assembled long reads will be the input into assembly algorithms.

63HGAP Example - Meiothermus ruberPre-assembly

1 contigCelera AssemblerMinimus2QuiverCollaboration with A. Clum, A. Copeland (Joint Genome Institute)Single-contig assembly99.99965% concordance with reference99.3% genes predictedIn a collaboration with the Joint Genome Institute, we demonstrated that the HGAP assembly method could de novo assemble the M. Ruber genome in three SMRT Cells. From a single, large-insert library, a single contig assembly was generated with >99.999% concordance with JGIs reference.

64Polish with Quiver for High AccuracyOrganismAssembly size (bases)Differences with Sanger referenceConcordance with Sanger referenceNominal QVSNPs validated as correct PacBio callsRemaining differencesQVMeiothermus ruber3,098,7811199.99965%54.581(3)60M. ruber Sanger referencePacBio reads

Targeted Sanger validationTo characterize the remaining 11 differences that remained between the HGAP assembly and the original Sanger reference, targeted Sanger validation was done on the sequenced M. ruber sample. Of the 9 clones that could be amplified, eight were validated as being correct in the PacBio consensus sequence, and one was different. The remaining three could not be validated. The final QV for the PacBio consensus sequences was at least 60.65Estimated Coverage Targets for Finishing Smaller GenomesAssembly Approach /Software ToolRecommended PacBio CoverageAdditional Data SetsGenome Size ConstraintsHierarchicalSMRT Analysis implementation of HGAP (uses Celera Assembler 7.0)75-100X PacBio CLRNone< 10 MB (SMRT Portal)< 130 MB (Command Line)Celera Assembler via PacBiotoCA (recent compilation) see Koren et al (2013) http://arxiv.org/abs/1304.3752 75-100X PacBio CLRNoneSimilar to aboveHybridCelera Assembler 7.0 with PacBiotoCA (SMRT Analysis) 20-50X PacBio CLR50X short readsALLPATHS-LG50X PacBio 3 kb CLR- 50X Illumina PE- 50X Illumina jumping libraries 20 MBMIRA (with PacBiotoCA)20-50X PacBio CLR50X short readsScaffolding AHA (SMRT Analysis)10X PacBio CLRHigh-confidence contigs

computational assembly for prokaryotic sequencing projects

Documents

genome sequencing

gen sequencing

genome random sequencing

courtesy of illumina

index read8 cyclesp5

following animations

emulsion pcr

primer287 dark cyclesp5