celera assembler arthur l. delcher senior research scientist cbcb university of maryland

44
Celera Assembler Celera Assembler Arthur L. Delcher Arthur L. Delcher Senior Research Scientist Senior Research Scientist CBCB CBCB University of Maryland University of Maryland

Upload: irene-swinburne

Post on 14-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Celera AssemblerCelera Assembler

Arthur L. DelcherArthur L. DelcherSenior Research ScientistSenior Research Scientist

CBCBCBCBUniversity of MarylandUniversity of Maryland

Page 2: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Slides by Art Delcher, Mike Schatz, and Adam Phillippy

Center for Bioinformatics and Computational Biology

Univ. of Maryland

Page 3: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

SIZE SELECT

e.g., e.g., 10Kbp 10Kbp ± 8% ± 8% std.dev.std.dev.

SHEAR

Shotgun DNA Sequencing (Technology)

DNA target sampleDNA target sample

VectorVector

LIGATE & CLONE

PrimerPrimer

End Reads (Mates)End Reads (Mates)

SEQUENCE

550bp

Page 4: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Whole Genome Shotgun Sequencing

– Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously.

BAC 5’BAC 5’ BAC 3’BAC 3’

– Collect another 20X in clone coverage of 50Kbp end sequence pairs:~ 1.2million pairs for Human. pairs for Human.

– Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads reads for Human. for Human.

ShortShort LongLong

2Kbp2Kbp 10Kbp10Kbp

+ single highly automated process+ single highly automated process+ only three library constructions+ only three library constructions– – assembly is much more difficultassembly is much more difficult

Page 5: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Physical MappingPhysical Mapping

Clone-by-Clone Genome Sequencing

TargetTarget

Minimum Minimum Tiling SetTiling Set

(~(~33,000 BACs33,000 BACs for human)for human)Shotgun AssemblyShotgun Assembly

Page 6: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Celera’s Sequencing Factory

Page 7: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

300 ABI 3700 DNA Sequencers 300 ABI 3700 DNA Sequencers

50 Production Staff50 Production Staff

20,000 sq. ft. of wet lab20,000 sq. ft. of wet lab

20,000 sq. ft. of sequencing space20,000 sq. ft. of sequencing space

800 tons of A/C (160,000 cfm)800 tons of A/C (160,000 cfm)

$1 million / year for electrical service$1 million / year for electrical service

$10 million / month for reagents$10 million / month for reagents

Celera’s Sequencing Factory(circa 2001)

Page 8: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Collected 27.27 Million reads = 5.11X coverageCollected 27.27 Million reads = 5.11X coverage

21.04 Million are paired (77%) = 10.52 Million pairs21.04 Million are paired (77%) = 10.52 Million pairs

2Kbp2Kbp 5.045 M5.045 M 98.6% true 98.6% true <6% std.dev.<6% std.dev.

10Kbp10Kbp 4.401 M4.401 M 98.6% true 98.6% true <8% std.dev.<8% std.dev.

50Kbp50Kbp 1.071 M1.071 M 90.0% true 90.0% true <15% std.dev.<15% std.dev.

Validated against finished Chrom. 21 sequenceValidated against finished Chrom. 21 sequence

The clones cover the genome 38.7X timesThe clones cover the genome 38.7X times

Data is from 5 individuals (roughly 3X, 4 others at .5X)Data is from 5 individuals (roughly 3X, 4 others at .5X)

Human Data (April 2000)

Page 9: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Consensus (15- 30Kbp)Consensus (15- 30Kbp)

ReadsReads

ContigContigAssembly without pairs results Assembly without pairs results in contigs whose order and in contigs whose order and orientation are not known.orientation are not known.

??

Pairs, especially groups of corroborating Pairs, especially groups of corroborating ones, link the contigs into scaffolds where ones, link the contigs into scaffolds where the size of gaps is well characterized.the size of gaps is well characterized.

2-pair2-pair

Mean & Std.Dev.Mean & Std.Dev.is knownis known

ScaffoldScaffold

Pairs Give Order & Orientation

Page 10: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

ChromosomeChromosomeSTSSTS

STS-mapped ScaffoldsSTS-mapped Scaffolds

ContigContig

Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)

ConsensusConsensus

Reads (of several haplotypes)Reads (of several haplotypes)

SNPsSNPsExternal “Reads”External “Reads”

Anatomy of a WGS Assembly

Page 11: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

WGS SequencingWGS AssemblyPerformance

Page 12: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Detect repeats and so avoid being misled by them, Detect repeats and so avoid being misled by them, leave for the last.leave for the last.

Make 1st order use of mate-pairs: first to Make 1st order use of mate-pairs: first to circumnavigate and later to fill in repeats. circumnavigate and later to fill in repeats.

Make all the sure moves firstMake all the sure moves first

tiered phases that get progressively more aggressivetiered phases that get progressively more aggressive

output a complete audit trail of the evidence for assembly.output a complete audit trail of the evidence for assembly.

Assembler Design Philosophy

Page 13: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Repeat Rez I, IIRepeat Rez I, II

Assembly Pipeline (circa 2006)

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

Trim & ScreenTrim & Screen Reads (typically 800bp) are quality-trimmed so Reads (typically 800bp) are quality-trimmed so that average error rate is .5% with 1-in-1000 that average error rate is .5% with 1-in-1000 having more than 2% error. Average trim length is having more than 2% error. Average trim length is 500-900bp, depending on the genome. (590bp for 500-900bp, depending on the genome. (590bp for human in year 2000)human in year 2000)

Contaminant and vector sequence is removedContaminant and vector sequence is removed

Repeat screening makes run time and overlap Repeat screening makes run time and overlap graph size reasonable, e.g. 10graph size reasonable, e.g. 1066 overlaps per Alu overlaps per Alu read must be avoided.read must be avoided.

Now we dynamicallyNow we dynamically limit repetitive overlaps in the overlap phase.

gatekeeper program to vet inputs/assign ID’sReads stored in compressed, random-access binary store.

Page 14: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 15: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 16: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 17: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 18: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 19: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Repeat Rez I, IIRepeat Rez I, II

Assembly Pipeline

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

AA

BB

impliesimplies

AA

BB

TRUE

OROR

AA BB

REPEAT-INDUCED

Find all overlaps Find all overlaps 40bp allowing 6% mismatch. 40bp allowing 6% mismatch.

Trim & ScreenTrim & Screen

Page 20: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Repeat Rez I, IIRepeat Rez I, II

Assembly Pipeline

Compute all “overlap consistent” sub-assemblies:Compute all “overlap consistent” sub-assemblies:Unitigs (Uniquely Assembled Contig)

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

Trim & ScreenTrim & Screen

Page 21: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

OVERLAP GRAPH

Edge Types:

AA

BB

AA

BB

AA

BB

BB

BB

BB

AA

AA

AA

Regular DovetailRegular Dovetail

Prefix DovetailPrefix Dovetail

Suffix DovetailSuffix Dovetail

E.G.:E.G.: Edges are annotated Edges are annotated with deltas of overlapswith deltas of overlaps

Page 22: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

The Unitig Reduction

1. Remove “Transitively Inferrable” Overlaps:1. Remove “Transitively Inferrable” Overlaps:

AA

BB

CC AABB

CC

Page 23: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

The Unitig Reduction

2. Collapse “Unique Connector” Overlaps:2. Collapse “Unique Connector” Overlaps:

AA BBAA

BB

412412 352352

4545

Page 24: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Unitigs: Definition

Chordal Subgraph with no conflicting edges.Chordal Subgraph with no conflicting edges.

Conflicting edgeConflicting edge Conflicting edgeConflicting edge

Uniquely Assemble-able Conquely Assemble-able Contig

Page 25: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Unitig Theorem (Myers, JCB ‘95)

(1) Remove contained fragments

(2) Remove transitively inferred edges

(3) Collapse into unitigs

(*) Restore t.i. edges between unitig ends.

THM: Shortest Common Superstring of unitigs = Shortest Common Superstring of reads

Caveat: SCS is not the right objective for assembly.

Page 26: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Revised Unitigger Algorithm

Preceding algorithm is computationally expensive Current unitigger finds the “best” overlap on each end

of each read—its “best buddy”. Unitigs are chains of mutually unique best buddies—

adjacent reads are best buddies of each other and of no other read.

This takes time and space linear in the number of reads.

In rare cases results are different from graph reduction.

Page 27: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 28: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 29: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Branch Point Extension

A repeat boundary reflected on an underlying sequence read.

D

CB

Genome

A

Peers of AC

Compare peers to detect branch pts.

Consider graph without repeat-full edges and recompute unitigs

D

B

Makes sure you get a read-length into each repeat induced gap (most Alu sized elements are resolved)

A

Page 30: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Bubble Smoothing

412412 352352

245245 486486

Page 31: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Arrival IntervalsArrival Intervals

Arrival rate statistic (A-stat) is log-odds ratio of probability unitig is is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA.unique DNA versus 2-copy DNA.

Definitely UniqueDefinitely Repetitive Don’t Know

-10-10 +10+1000

Dist. For UniqueDist. For Repetitive

Unique DNA unitig Repetitive DNA unitig

Identifying Unique DNA Stretches

Page 32: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 33: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 34: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Repeat Rez I, IIRepeat Rez I, II

Assembly Pipeline

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

Fill repeat gaps with doubly anchored positive unitigsFill repeat gaps with doubly anchored positive unitigs

Unitig>0Unitig>0

Trim & ScreenTrim & Screen

Page 35: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Repeat Rez I, IIRepeat Rez I, II

Assembly Pipeline

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

Fill repeat gaps with assembled, singly anchored readsFill repeat gaps with assembled, singly anchored reads

StonesStones

Trim & ScreenTrim & Screen

Page 36: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Surrogates

Stones containing more than 1 read are added to contigs as consensus sequence only, without underlying reads.

Called “surrogates” Allows repeat unitigs to be put in multiple positions

in the assembly, but leaves regions without underlying read coverage.

We later attempt to resolve surrogates, by assigning reads from the original repeat unitig to the separate surrogate copies, based on mate pairs.

Page 37: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 38: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 39: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 40: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 41: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 42: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 43: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland
Page 44: Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland