![Page 1: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/1.jpg)
Adina HoweMichigan State University, AdjunctArgonne National Laboratory, PostdocASM Workshop, May 2013
Visual Complexityhttp://www.flickr.com/photos/maisonbisson
![Page 2: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/2.jpg)
Titus Brown Jim Tiedje Jason Pell Qingpeng Zhang Jordan Fish Eric McDonald Chris Welcher Aaron Garoutte Jiarong Guo
Janet Jansson Susannah Tringe
MSU Lab: Collaborators:
![Page 3: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/3.jpg)
I will upload this on slideshare (adinachuanghowe) Khmer documentation
github.com/ged-lab/khmer/https://khmer.readthedocs.org/en/latest/guide.html
Manuscripts
Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
http://www.pnas.org/content/early/2012/07/25/1121464109
A reference-free algorithm for computational normalization of shotgun sequencing data
http://arxiv.org/abs/1203.4802
Assembling large, complex metagenomeshttp://arxiv.org/abs/1212.2832
![Page 4: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/4.jpg)
A few gotchas of sequencing:
Errors / Artifacts (confusion)
Diversity / Complexity (scale)
![Page 5: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/5.jpg)
1. Digital normalization (lossy compression)
2. Partitioning3. Enabling usage of current previously
unusable assembly tools
![Page 6: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/6.jpg)
Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools available
But…
![Page 7: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/7.jpg)
Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools available
But…
High memory requirementsDepends on good (~10x) sequencing coverage
![Page 8: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/8.jpg)
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
![Page 9: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/9.jpg)
Note that k-mer abundance is not properly represented here! Each blue k-mer will be present
around 10 times.
![Page 10: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/10.jpg)
Each single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.
![Page 11: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/11.jpg)
![Page 12: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/12.jpg)
![Page 13: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/13.jpg)
Low-abundance peak (errors)
![Page 14: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/14.jpg)
High-abundance peak(true k-mers)
![Page 15: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/15.jpg)
Suppose you have a dilution factor of A
(10) to B(1). To get 10x of B you need to
get 100x of A! Overkill!!
This 100x will consume disk space
and, because of errors, memory.
We can discard it for you…
![Page 16: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/16.jpg)
![Page 17: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/17.jpg)
![Page 18: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/18.jpg)
![Page 19: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/19.jpg)
![Page 20: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/20.jpg)
![Page 21: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/21.jpg)
![Page 22: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/22.jpg)
A digital analog to cDNA library normalization, diginorm:
Reference free.
Is single pass: looks at each read only once;
Does not “collect” the majority of errors;
Keeps all low-coverage reads;
Smooths out coverage of regions.
![Page 23: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/23.jpg)
Digital normalization produces “good” metagenome assemblies.
Smooths out abundance variation, strain variation.
Reduces computational requirements for assembly.
It also kinda makes sense :)
![Page 24: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/24.jpg)
Split reads into “bins” belonging to different source species.
Can do this based almost entirely on connectivity of sequences.
“Divide and conquer”Memory-efficient
implementation helps to scale assembly.
Pell et al., 2012, PNAS
![Page 25: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/25.jpg)
![Page 26: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/26.jpg)
![Page 27: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/27.jpg)
![Page 28: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/28.jpg)
Low coverage is the dominant problem blocking assembly of your soil metagenome.
![Page 29: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/29.jpg)
In order to build assemblies, each assembler makes choices – uses heuristics – to reach a conclusion.
These heuristics may not be appropriate for your sample! High polymorphism? Mixed population vs clonal? Genomic vs metagenomic vs mRNA Low coverage drives differences in
assembly.
![Page 30: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/30.jpg)
![Page 31: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/31.jpg)
We can assemble virtually anything but soil ;). Genomes, transcriptomes, MDA, mixtures, etc. Repeat resolution will be fundamentally limited by
sequencing technology (insert size; sampling depth)
Strain variation confuses assembly, but does not prevent useful results. Diginorm is systematic strategy to enable
assembly. Banfield has shown how to deconvolve strains at
differential abundance. Kostas K. results suggest that there will be a
species gap sufficient to prevent contig misassembly.
![Page 32: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/32.jpg)
Most metagenomes require 50-150 GB of RAM.
Many people don’t have access to computers of that size.
Amazon Web Services (aws.amazon.com) will happily rent you such computers for $1-2/hr.
http://ged.msu.edu/angus/2013-hmp-assembly-webinar/index.html
![Page 33: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/33.jpg)
Optimizing our programs => faster.
Building an evaluation framework for metagenome assemblers.
Error correction!
![Page 34: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/34.jpg)
Achieving one or more assemblies is fairly straightforward.
An assembly is a hypothesis and evaluating them is challenging, however, and where you should be thinking hardest about assembly.
There are relatively few pipelines available for analyzing assembled metagenomic data.
![Page 35: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/35.jpg)
Questions?
![Page 36: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/36.jpg)
How do we study complexity? Interactions? Diversity? Communities? Evolution? Our environment?
Visual Complexityhttp://www.flickr.com/photos/maisonbisson
• Major efforts of data collection
• Open-mind for discoveries• Willingness to adjust to
change• Multiple efforts• Well-designed experiments
Workshop example: Illumina deep sequencing and scaling large datasets on soil metagenomes
![Page 37: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/37.jpg)
We receive Gb of sequences Generally, my data is…
Split by barcodes Untrimmed Adapters are present Two paired end fastq files
Underestimation of computational requirements: Quality control steps usually require 2-3
times the amount of hard drive space Similarity comparison against known
databases impractical (soil metagenome ~50 years to BLAST)
Home Alone ScreamMy first slide graphic that I’m scared may date me.
![Page 38: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/38.jpg)
Two ways to reduce the onslaught:
Cluster into known observances (annotate, bin)AssemblySome mix of the above
![Page 39: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/39.jpg)
Ten of you upload 1 Hiseq flowcell into MG-RAST
![Page 40: ASM 2013 Metagenomic Assembly Workshop Slides](https://reader034.vdocuments.site/reader034/viewer/2022052522/554e8062b4c905f66a8b547f/html5/thumbnails/40.jpg)
Illumina short reads from soil metagenome (~100 bp)
454 short reads from soil metagenome (~368 bp)
Assembled contigs (Illumina) reads from soil metagenome (~491 bp)
Read length will increase… computational requirements? Assembly great way to reduce data.