de-novo assembly
DESCRIPTION
De-novo Assembly. Day 4. The Challenge. Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome Issues : Coverage Errors in reads Reads vary from very short (35bp) to quite long (800bp), and are double-stranded Non-uniqueness of solution - PowerPoint PPT PresentationTRANSCRIPT
De-novo Assembly
Day 4
The Challenge
• Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome
• Issues: – Coverage– Errors in reads– Reads vary from very short (35bp) to quite long (800bp),
and are double-stranded– Non-uniqueness of solution– Running time and memory
In an ideal world…
• Reads have no error
• Reads are long enough and each appears once
• Each read given in the same orientation (all 5’ to 3’, for example)
• Contiguous reads have enough overlap so that they can easily be assembled
DNA target sampleSHEAR & SIZE
End Reads / Mate Pairs
550bp
10,000bp
Shotgun DNA Sequencing
Not all sequencing technologies produce mate-pairs. Different error modelsDifferent read lengths
Terminology
• Reads are what you start with (35bp-800bp)
• Fragmented assemblies produce contigs
• Contigs can be put together into scaffolds
Reference-based vs de novo assembly
• Comparative assembly means you have a “reference” genome– Same OR related species– Can get weird in bacteria and viruses due to recombination
• De novo assembly means you do everything from scratch
Comparative Assembly
Much easier than de novo!
Basic idea: • Take the reads and map them onto the
reference (allowing for small mismatches)• Collect all overlapping reads, produce a
multiple sequence alignment, and produce consensus sequence
Comparative Assembly
• Fast
• Short reads can map to several places (especially if they have errors)
• Needs close reference genome
• Repeats are problematic
• Can be highly accurate even when reads have errors
De Novo assembly
• Much easier to do with long reads
• Need very good coverage
• Generally produces fragmented assemblies
• Necessary when you don’t have a closely related (and correctly assembled) reference genome
Conceptual steps in de novo assembly
1. Find reads that overlap by a specified number of bases (the k-mer size)
2. Merge overlapping, “good” reads into longer contigs
3. Link contigs to form scaffolds using paired-end information
Diagrams from Serafim Batzoglou, Stanford
11
De Novo Assembly paradigms
• Overlap-layout-consensus methods
• k-mer graph (especially useful for assembly from short reads)– aka de Bruijn graph
de Bruijn graph hask-mers as nodes connected by reads;
assembly involves finding Eulerian path through graph
Diagram from Michael Schatz, Cold Spring Harbor
de Bruijn graph• Vertices are k-mers that appear in some read, and edges defined by
overlap of k-1 nucleotides
• Small values of k produce small graphs
• Does not require all-pairs overlap calculation!
• But: loss of information about reads can lead to “chimeric” contigs, and incorrect assemblies
• Also produces fragmented assemblies (even shorter contigs)
de Bruijn Graph use
• Create the de Bruijn graph for the following string, using k=5– ACATAGGATTCAC
• Find the Eulerian path
• Is the Eulerian path unique?
• Reconstruct the sequence from this path
But…
• Because of – Errors in reads– Repeats– Insufficient coveragede Bruijn graphs generally don’t Eulerian paths/circuits
• This means the first step doesn’t completely assemble the genome
Velvet and SOAPdenovo are two leading assemblers
• Velvet is from EMBL-EBI– http://www.ebi.ac.uk/~zerbino/velvet
• SOAPdenovo is from BGI– http://soap.genomics.org.cn/soapdenovo.html
• k-mer size is adjustable parameter– Typically it is adjusted to maximize N50 length of scaffolds or contigs– N50 length is central measure of distribution weighted by lengths
NOTE: QC step slightly different
1. You still do standard checks with something like FASTQC
2. Read Trimming is different– Algorithms/tools can deal with different read lengths– Finding overlaps give longer contigs– So we never want to sacrifice good reads
3. Solution: 1. remove sequencing adapters 2. trim individual reads as needed
• http://www.usadellab.org/cms/index.php?page=trimmomatic
Three useful measures for optimization of k-mer length
• Mean length: the usual average of the lengths
• Median length: the length for which half the sequences are shorter & half are longer
• N50 length: the length that splits the total bases in half, after the lengths are ordered
• Example values for distribution of contig lengths:– Mean length: 627– Median length: 200– N50 length: 1,718
• We’ll look at using N50 in in the practical
Practical prep
• Download and install– TRIMMOMATIC:
http://www.usadellab.org/cms/index.php?page=trimmomatic
– VELVET: http://www.ebi.ac.uk/~zerbino/velvet/• Use make ’MAXKMERLENGTH=70’ when compiling
• Grab practical dataset from course page