celera assembler arthur l. delcher senior research scientist cbcb university of maryland
TRANSCRIPT
Celera AssemblerCelera Assembler
Arthur L. DelcherArthur L. DelcherSenior Research ScientistSenior Research Scientist
CBCBCBCBUniversity of MarylandUniversity of Maryland
Slides by Art Delcher, Mike Schatz, and Adam Phillippy
Center for Bioinformatics and Computational Biology
Univ. of Maryland
SIZE SELECT
e.g., e.g., 10Kbp 10Kbp ± 8% ± 8% std.dev.std.dev.
SHEAR
Shotgun DNA Sequencing (Technology)
DNA target sampleDNA target sample
VectorVector
LIGATE & CLONE
PrimerPrimer
End Reads (Mates)End Reads (Mates)
SEQUENCE
550bp
Whole Genome Shotgun Sequencing
– Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously.
BAC 5’BAC 5’ BAC 3’BAC 3’
– Collect another 20X in clone coverage of 50Kbp end sequence pairs:~ 1.2million pairs for Human. pairs for Human.
– Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads reads for Human. for Human.
ShortShort LongLong
2Kbp2Kbp 10Kbp10Kbp
+ single highly automated process+ single highly automated process+ only three library constructions+ only three library constructions– – assembly is much more difficultassembly is much more difficult
Physical MappingPhysical Mapping
Clone-by-Clone Genome Sequencing
TargetTarget
Minimum Minimum Tiling SetTiling Set
(~(~33,000 BACs33,000 BACs for human)for human)Shotgun AssemblyShotgun Assembly
Celera’s Sequencing Factory
300 ABI 3700 DNA Sequencers 300 ABI 3700 DNA Sequencers
50 Production Staff50 Production Staff
20,000 sq. ft. of wet lab20,000 sq. ft. of wet lab
20,000 sq. ft. of sequencing space20,000 sq. ft. of sequencing space
800 tons of A/C (160,000 cfm)800 tons of A/C (160,000 cfm)
$1 million / year for electrical service$1 million / year for electrical service
$10 million / month for reagents$10 million / month for reagents
Celera’s Sequencing Factory(circa 2001)
Collected 27.27 Million reads = 5.11X coverageCollected 27.27 Million reads = 5.11X coverage
21.04 Million are paired (77%) = 10.52 Million pairs21.04 Million are paired (77%) = 10.52 Million pairs
2Kbp2Kbp 5.045 M5.045 M 98.6% true 98.6% true <6% std.dev.<6% std.dev.
10Kbp10Kbp 4.401 M4.401 M 98.6% true 98.6% true <8% std.dev.<8% std.dev.
50Kbp50Kbp 1.071 M1.071 M 90.0% true 90.0% true <15% std.dev.<15% std.dev.
Validated against finished Chrom. 21 sequenceValidated against finished Chrom. 21 sequence
The clones cover the genome 38.7X timesThe clones cover the genome 38.7X times
Data is from 5 individuals (roughly 3X, 4 others at .5X)Data is from 5 individuals (roughly 3X, 4 others at .5X)
Human Data (April 2000)
Consensus (15- 30Kbp)Consensus (15- 30Kbp)
ReadsReads
ContigContigAssembly without pairs results Assembly without pairs results in contigs whose order and in contigs whose order and orientation are not known.orientation are not known.
??
Pairs, especially groups of corroborating Pairs, especially groups of corroborating ones, link the contigs into scaffolds where ones, link the contigs into scaffolds where the size of gaps is well characterized.the size of gaps is well characterized.
2-pair2-pair
Mean & Std.Dev.Mean & Std.Dev.is knownis known
ScaffoldScaffold
Pairs Give Order & Orientation
ChromosomeChromosomeSTSSTS
STS-mapped ScaffoldsSTS-mapped Scaffolds
ContigContig
Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)
ConsensusConsensus
Reads (of several haplotypes)Reads (of several haplotypes)
SNPsSNPsExternal “Reads”External “Reads”
Anatomy of a WGS Assembly
WGS SequencingWGS AssemblyPerformance
Detect repeats and so avoid being misled by them, Detect repeats and so avoid being misled by them, leave for the last.leave for the last.
Make 1st order use of mate-pairs: first to Make 1st order use of mate-pairs: first to circumnavigate and later to fill in repeats. circumnavigate and later to fill in repeats.
Make all the sure moves firstMake all the sure moves first
tiered phases that get progressively more aggressivetiered phases that get progressively more aggressive
output a complete audit trail of the evidence for assembly.output a complete audit trail of the evidence for assembly.
Assembler Design Philosophy
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline (circa 2006)
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Trim & ScreenTrim & Screen Reads (typically 800bp) are quality-trimmed so Reads (typically 800bp) are quality-trimmed so that average error rate is .5% with 1-in-1000 that average error rate is .5% with 1-in-1000 having more than 2% error. Average trim length is having more than 2% error. Average trim length is 500-900bp, depending on the genome. (590bp for 500-900bp, depending on the genome. (590bp for human in year 2000)human in year 2000)
Contaminant and vector sequence is removedContaminant and vector sequence is removed
Repeat screening makes run time and overlap Repeat screening makes run time and overlap graph size reasonable, e.g. 10graph size reasonable, e.g. 1066 overlaps per Alu overlaps per Alu read must be avoided.read must be avoided.
Now we dynamicallyNow we dynamically limit repetitive overlaps in the overlap phase.
gatekeeper program to vet inputs/assign ID’sReads stored in compressed, random-access binary store.
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
AA
BB
impliesimplies
AA
BB
TRUE
OROR
AA BB
REPEAT-INDUCED
Find all overlaps Find all overlaps 40bp allowing 6% mismatch. 40bp allowing 6% mismatch.
Trim & ScreenTrim & Screen
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline
Compute all “overlap consistent” sub-assemblies:Compute all “overlap consistent” sub-assemblies:Unitigs (Uniquely Assembled Contig)
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Trim & ScreenTrim & Screen
OVERLAP GRAPH
Edge Types:
AA
BB
AA
BB
AA
BB
BB
BB
BB
AA
AA
AA
Regular DovetailRegular Dovetail
Prefix DovetailPrefix Dovetail
Suffix DovetailSuffix Dovetail
E.G.:E.G.: Edges are annotated Edges are annotated with deltas of overlapswith deltas of overlaps
The Unitig Reduction
1. Remove “Transitively Inferrable” Overlaps:1. Remove “Transitively Inferrable” Overlaps:
AA
BB
CC AABB
CC
The Unitig Reduction
2. Collapse “Unique Connector” Overlaps:2. Collapse “Unique Connector” Overlaps:
AA BBAA
BB
412412 352352
4545
Unitigs: Definition
Chordal Subgraph with no conflicting edges.Chordal Subgraph with no conflicting edges.
Conflicting edgeConflicting edge Conflicting edgeConflicting edge
Uniquely Assemble-able Conquely Assemble-able Contig
Unitig Theorem (Myers, JCB ‘95)
(1) Remove contained fragments
(2) Remove transitively inferred edges
(3) Collapse into unitigs
(*) Restore t.i. edges between unitig ends.
THM: Shortest Common Superstring of unitigs = Shortest Common Superstring of reads
Caveat: SCS is not the right objective for assembly.
Revised Unitigger Algorithm
Preceding algorithm is computationally expensive Current unitigger finds the “best” overlap on each end
of each read—its “best buddy”. Unitigs are chains of mutually unique best buddies—
adjacent reads are best buddies of each other and of no other read.
This takes time and space linear in the number of reads.
In rare cases results are different from graph reduction.
Branch Point Extension
A repeat boundary reflected on an underlying sequence read.
D
CB
Genome
A
Peers of AC
Compare peers to detect branch pts.
Consider graph without repeat-full edges and recompute unitigs
D
B
Makes sure you get a read-length into each repeat induced gap (most Alu sized elements are resolved)
A
Bubble Smoothing
412412 352352
245245 486486
Arrival IntervalsArrival Intervals
Arrival rate statistic (A-stat) is log-odds ratio of probability unitig is is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA.unique DNA versus 2-copy DNA.
Definitely UniqueDefinitely Repetitive Don’t Know
-10-10 +10+1000
Dist. For UniqueDist. For Repetitive
Unique DNA unitig Repetitive DNA unitig
Identifying Unique DNA Stretches
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Fill repeat gaps with doubly anchored positive unitigsFill repeat gaps with doubly anchored positive unitigs
Unitig>0Unitig>0
Trim & ScreenTrim & Screen
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Fill repeat gaps with assembled, singly anchored readsFill repeat gaps with assembled, singly anchored reads
StonesStones
Trim & ScreenTrim & Screen
Surrogates
Stones containing more than 1 read are added to contigs as consensus sequence only, without underlying reads.
Called “surrogates” Allows repeat unitigs to be put in multiple positions
in the assembly, but leaves regions without underlying read coverage.
We later attempt to resolve surrogates, by assigning reads from the original repeat unitig to the separate surrogate copies, based on mate pairs.