metagenome sequence assembly (cabbio 20150629 buenos aires)
TRANSCRIPT
Bas E. DutilhBacteriófagos: Aspectos básicos y moleculares. Aplicaciones Biotecnológicas
Buenos Aires, June 29th 2015
Shotgun sequence assembly
Method Read length Accuracy Million reads
Time Cost per M
454 100-700 99% 1 1 day $10
Illumina 50-300 98% 3,000 1-2 days $0.10
IonTorrent 100-400 98% 40-80 2 hours $1
PacBio 1,000-30,000 87% 0.05 2 hours $1
Sanger 400-1,200 99.9% n/a 2 hours $2,400
SOLiD 50 99.9% 1,200 1-2 weeks $0.13
Sequencing specs*
* these numbers change all the time!
Lengths of reads and genomes NGS technologies provide reads of 50 to max. 30,000 bp, but most genomes are much longer
Gago, Science 2009
Nucleotide codesDescription Bases
A Adenine A
1C Cytosine CG Guanine GT Thymine TU Uracil UW Weak A T
2
S Strong C GM aMino A CK Keto G TR puRine A GY pYrimidine C TB not A (B after A) C G T
3D not C (D after C) A G TH not G (H after G) A C TV not T (V after T/U) A C GN aNy base (not a gap) A C G T 4- Gap (no nucleotide) 0
Sequence File Formats
• Different file formats for different uses• Competing formats developed in parallel• Some easy to read, some easy to parse
• Simplest sequence file format• Unique identifiers!
• “Fasta wide” format has the whole sequence on one line• Even easier to parse in a computer script
Fasta
>identifier1 [optional information]CCGATCATATGACTAGCATGCATCGATCGATCGACTAGCATTTAGAGCTACGATCAGCACTACACGCTTTGTATGATTGGCGGCGGCTATTATATTGGGA>identifier2 [optional information]GAGAGCTACGATCAGAGCTACGATCAGCACTACACGCTTTGTATGATTGGCCCCCTATATTGGGACACGATCAGCACTACACGCTTTGTATGATTGGCGGCGGCTATCCGATCAT
• Based on Fasta format• Contains information about quality of each nucleotide• Quality estimated by sequencing machine
@SRR014849.1 EIXKN4201CFU84 length=93GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC+hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E%
• Four lines per sequence:1. Identifier line starting with @2. DNA sequence on one line3. Second identifier line starting with + (identifier optional)4. String of quality scores on one line
Fastq
Quality scores Phred 10: 10-1 chance that the base is wrong 90% accuracy; 10% error rate
Phred 20: 10-2 chance that the base is wrong 99% accuracy ; 1% error rate
Phred 30: 10-3 chance that the base is wrong 99.9% accuracy ; 0.1% error rate
Etcetera
ASCII character codes Fastq quality score: Phred score + 33, converted to ASCII text
Note: old Illumina format was different!
@SRR014849.1 EIXKN4201CFU84 length=93GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC+hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E%
Random genome, random coverage
• Average depth:– Genome size G– Base depth B=40x– Read length L=100 bp– K-mer size K=25 bp
C = B * (L - K + 1) / L
• Uncovered bases:u = G * eC
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
Sequence assembly
Reads
Scaffold Order and orientation of contigs Sizes of the gaps between contigs (filled with NNN)
Contigs Consensus sequence of assembled reads Includes alignment of all reads
Assembly of shotgun sequences• Human genome project
–1-2 kb Sanger reads–< 10x coverage–Low error rate
• High-throughput (meta-)genomics– Millions/billions of ~100-400 bp reads– Mix of genomes with different coverage– Biases and sequencing errors
• Quality drops towards the end of reads• Homo-polymers may be miss-called in 454 or Ion Torrent
2000
NOW
Assembly strategies• Reference-guided assembly
– Align reads to a (database) of reference genome(s)– Cannot discover:
• Larger genomic mutations– Insertions, deletions, rearrangements
• Distantly related species• Most viruses
• De novo assembly– Requires sufficient coverage x depth– Breaks on repeats and low-coverage regions– Algorithms
• Greedy assembly (only to illustrate)• Overlap-layout-consensus• De Bruijn graph
Reference-guided assembly
• Illumina sequencing of community DNA• Same-species genome available (2.8M nt)
• Sometimes, only a minority of the reads can be mapped/aligned
Distant reference
• Natural diversity of community– “Species” share >94% average nucleotide identity– Consensus = “average” of the species
Consensus
Genome space
Reference
Konstantinidis andTiedje, PNAS 2004
• The assembly is a better representation of the community
• Can we further approach the consensus genome by re-mapping the reads against this first assembly?
Reference
Genome space
Iterative mapping and assembly
First assembly
Consensus
Dutilh et al. Bioinformatics 2009
Greedy assembly1. Sequences (reads)2. Pairwise all-vs-all similarities3. Find best matching pair4. Collapse/assemble
• Works well for few, long reads (Sanger)– All-vs-all calculations are expensive– One clear best match
• Does not work for high throughput NGS datasets– Many reads -> expensive to calculate– Low coverage requires graph approach
(reads/contigs)
Repetitive sequences
• Reads A-D are from a region with two long repeats• Greedy approach would first join A-D with the
largest overlap, and place B-C in a separate contig• Resolving this requires a global view of all the
possibilities before joining two reads: a graph
repeat repeat
DAC
B
D CBA
B
D
Assembly as a “graph” problem
• Overlap-layout-consensus• De Bruijn Graph
• A graph contains nodes and edges
node edge
1. Identify all overlaps between reads– Use cutoffs: minimum overlap and percent identity
2. Make graph of overlap connections– Nodes: reads– Edges: overlaps
3. Find Hamiltonian path– Path that contains every node once– No efficient algorithm available
4. Determine consensus at each position
TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
Overlap-layout-consensus
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT
CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT
GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC
K N
LJ M
K NLJ M
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
De Bruijn graph1. Find every word of length k (k-mer) in every read
– K-mer should be long enough to be quite unique, but– … short enough to not break on polymorphisms/errors
TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT
GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC CTTGATACTAATGCTTTTTGTAATCTTAT TTGATACTAATGCTTTTTGTAATCTTATT TGATACTAATGCTTTTTGTAATCTTATTG GATACTAATGCTTTTTGTAATCTTATTGG ATACTAATGCTTTTTGTAATCTTATTGGT TACTAATGCTTTTTGTAATCTTATTGGTT ACTAATGCTTTTTGTAATCTTATTGGTTG CTAATGCTTTTTGTAATCTTATTGGTTGG TAATGCTTTTTGTAATCTTATTGGTTGGC AATGCTTTTTGTAATCTTATTGGTTGGCT ATGCTTTTTGTAATCTTATTGGTTGGCTT TGCTTTTTGTAATCTTATTGGTTGGCTTA GCTTTTTGTAATCTTATTGGTTGGCTTAA CTTTTTGTAATCTTATTGGTTGGCTTAAA TTTTTGTAATCTTATTGGTTGGCTTAAAC
K N
LM
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT
J
De Bruijn graph2. Make graph of sequential k-mers in sequence
– Nodes: k-mers– Edges: sequential presence of k-mers in reads
TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT
CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT
GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC CTTGATACTAATGCTTTTTGTAATCTTAT TTGATACTAATGCTTTTTGTAATCTTATT TGATACTAATGCTTTTTGTAATCTTATTG GATACTAATGCTTTTTGTAATCTTATTGG ATACTAATGCTTTTTGTAATCTTATTGGT TACTAATGCTTTTTGTAATCTTATTGGTT ACTAATGCTTTTTGTAATCTTATTGGTTG CTAATGCTTTTTGTAATCTTATTGGTTGG TAATGCTTTTTGTAATCTTATTGGTTGGC AATGCTTTTTGTAATCTTATTGGTTGGCT ATGCTTTTTGTAATCTTATTGGTTGGCTT TGCTTTTTGTAATCTTATTGGTTGGCTTA GCTTTTTGTAATCTTATTGGTTGGCTTAA CTTTTTGTAATCTTATTGGTTGGCTTAAA TTTTTGTAATCTTATTGGTTGGCTTAAAC
K N
L
J
M
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
De Bruijn graphCTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
3. Find Eulerian path– Path that contains every edge once– Efficient algorithm available
• In an optimal sequencing run of a repeat-less genome, there is one path connecting all nodes
• In practice (especially in metagenomes) there are many possible structures in the graph
• Edge width represents the number of linking reads (depth)
Possible structures in De Bruijn graphs• Cycle: path converges on itself
– Repeated region on the same contig
• Frayed rope: converge then diverge– Repeated region on different contigs
• Bubble: paths diverge then converge– Sequencing error in the middle of a read– Polymorphisms
• Spur: short dead-ends– Sequencing error at the end of a read– Zero coverage shortly after end of repeat
Examples of De Bruijn graphs
1% 5%
10% 15%
Pell PNAS 2012
Sequencing errors in arandom circular sequence
Peng Bioinformatics 2011
Five E. coli subspecies
Random versus real sequences Biological sequences are not random
Genes, operons, promoters, etcetera Biased nucleotide usage (GC content) Biased oligonucleotide usage (k-mers)
Repeated sequences in (meta-)genomes Low-complexity regions Conserved protein domains Duplicated genes, horizontal transfers “Selfish” elements (e.g. transposons, prophages) Polymorphic repeats (haplotypes, strains) …etcetera
Repeats have multiple sinks/sources
16s
Salmonella has 7 rrn operons
Salmonella recombines at rrn operons
Helm and Maloy
Genome versus metagenome• Depending on diversity
– Expect many sequences – Fragmented sequences– Varying read depth
• Natural microdiversity
• Sequencing errors or natural diversity?
• Repeats also include closely related strains, conserved genes, etc.
• Depending on coverage– Expect single sequence – Contiguous sequence– Even read depth
• Clonal sequence
• Identify sequencing errors by low coverage
• Repeats consist of duplicated genes and conserved domains
Chimerization in metagenome assembly• Both OLC and DBG include “chimera protection”
– Break contigs at ambiguities– Works if depth/coverage is high enough
contig1
contig2
contig4
contig5contig3
• Assess final result with different parameters– High versus low stringency assembly
• Chimerization is more frequent between closely related strains
Assembly strategies• Reference-guided assembly
– Align reads to a (database) of reference genome(s)– Cannot discover:
• Larger genomic mutations– Insertions, deletions, rearrangements
• Distantly related species• Most viruses
• De novo assembly– Requires sufficient read lengths, depth, and coverage– Breaks on long repeats and low-coverage regions– Algorithms
• Greedy assembly (only to illustrate)• Overlap-layout-consensus• De Bruijn graph