1. 2 genome sequence assembly assembly concepts and methods (some slides courtesy of mihai pop)

Genome sequence assemblyAssembly concepts and methods

(some slides courtesy of Mihai Pop)

Building a library

• Break DNA into random fragments (8-10x coverage)

Actual situation

Building a library

• Break DNA into random fragments (8-10x coverage)• Sequence the ends of the fragments

– Amplify the fragments in a vector– Sequence 800-1000 (500-700) bases at each end of the fragment

Assembling the fragments

Forward-reverse constraints• The sequenced ends are facing towards each other • The distance between the two fragments is known

(within certain experimental error)

Insert

Building Scaffolds

• Break DNA into random fragments (8-10x coverage)

• Sequence the ends of the fragments

• Assemble the sequenced ends

• Build scaffolds

Assembly gaps

sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap

physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Sequencing gaps

Physical gaps

Unifying view of assembly

Assembly

Scaffolding

Shotgun sequencing statistics

Typical contig coverage

1 2 3 4 5 6 Coverage

Contig

Imagine raindrops on a sidewalk

Lander-Waterman statistics

L = read lengthT = minimum detectable overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/L

E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ)contig = island with 2 or more reads

Example

c N #islands #contigs bases not in any read

bases not in contigs

1 1,667 655 614 698 367,806

3 5,000 304 250 121 49,787

5 8,334 78 57 20 6,735

8 13,334 7 5 1 335

Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40

Experimental data

X coverage

# ctgs % > 2X avg ctg size (L-W) max ctg size # ORFs

1 284 54 1,234 (1,138) 3,337 526

3 597 67 1,794 (4,429) 9,589 1,092

5 548 79 2,495 (21,791) 17,977 1,398

8 495 85 3,294 (302,545) 64,307 1,762

complete 1 100 1.26 M 1.26 M 1,329

Caveat: numbers based on artificially chopping upthe genome of Wolbachia pipientis dMel

Read coverage vs. Clone coverage

Read coverage = 8X

Clone (insert) coverage = 16

2X coverage in BAC-ends implies 100x coverage by BACs

(1 BAC clone = approx. 100kbp)

Assembly paradigms

• Overlap-layout-consensus– greedy (TIGR Assembler, phrap, CAP3...)– graph-based (Celera Assembler, Arachne)

• Eulerian path (especially useful for short read sequencing)

TIGR Assembler/phrap

Greedy

• Build a rough map of fragment overlaps

• Pick the largest scoring overlap

• Merge the two fragments

• Repeat until no more merges can be done

Overlap-layout-consensusMain entity: readRelationship between reads: overlap

1 2 3 4 5 6 7 8 9

1 2 3 12

ACCTGAACCTGAAGCTGAACCAGA

Paths through graphs and assembly

• Hamiltonian circuit: visit each node (city) exactly once, returning to the start

Genome

Implementation details

Overlap between two sequences

…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…

overlap (19 bases) overhang (6 bases)

overhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequences

The assembler screens merges based on: • length of overlap• % identity in overlap region• maximum overhang size.

% identity = 18/19 % = 94.7%

All pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs

are possible– Build a table of k-mers contained in sequences (single

pass through the genome)– Generate the pairs from k-mer table (single pass

through k-mer table)

REPEATS

RptA RptB

Non-repetitive overlap graph

Handling repeats1. Repeat detection

– pre-assembly: find fragments that belong to repeats• statistically (most existing assemblers)• repeat database (RepeatMasker)

– during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)

– post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker• "unhappy" mate-pairs (too close, too far, mis-oriented)

2. Repeat resolution– find DNA fragments belonging to the repeat– determine correct tiling across the repeat

Statistical repeat detectionSignificant deviations from average coverage flagged as repeats.

- frequent k-mers are ignored- “arrival” rate of reads in contigs compared with theoretical value

(e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)

Problem 1: assumption of uniform distribution of fragments - leads to false positives

non-random librariespoor clonability regions

Problem 2: repeats with low copy number are missed - leads to false negatives

Mis-assembled repeats

a b c d

I II III

a b d c e f

I II III IV

I III II IV

a d b e c f

collapsed tandem excision

rearrangement

“chimeras” “mates”

How do you align the green pieces?

ribosomal RNA repeats, Ames Porton strain

An assembly puzzle: contradictory data

(discovered after publication)

Puzzle solution

Tandem duplication

Reference: Ames ‘ancestor’ strain

Ames Porton Down strain

PortonStrain

Ft Detrick

AttackStrain

Victim

Plasmids cured

Porton Down

Lab C Lab DLab B

Floridaisolate

Porton1

Porton2

“Ames”isolate

1998 20012001

UC Berkeley

Anthrax attack strain history

(Ames ancestor)

Cut of assembly 67452 from GBX0130.contig (11 bases)Cut at ungapped consensus offset 15986 (from 1), +/- 5 positions:Cut at gapped consensus offset 16009 (from 1) 11 positions

Ungapped consensus TGAATGCACACGapped consensus TGAATGCACAC T G A A T G C A C A C

Covering reads: GBZEI27TF TGAATGCACAC 26 30 34 36 33 36 36 37 36 36 36 GBXEZ08TR TGAATGCACAC 27 30 33 35 41 37 36 23 36 36 36 GBZDA09TF TGAATGCACAC 26 18 35 31 26 20 29 19 36 36 36

Summary info:P-value (10^q) -7.9 -7.8 -10.2 -10.2 -10.0 -9.3 -10.1 -7.9 -10.8 -10.8 -10.8 Quality Class 5 3 3 3 3 3 3 3 3 3 3 Coverage depth 3 3 3 3 3 3 3 3 3 3 3Homogeneity 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Cut of assembly 6264 from GBA0117.contig (11 bases)Cut at ungapped consensus offset 725687 (from 1), +/- 5 positions:Cut at gapped consensus offset 730468 (from 1) 11 positions

Ungapped consensus TGAATACACACGapped consensus TGAATACACAC T G A A T A C A C A C

Covering reads: GBIFW80TF TGAATA-ACAC 34 33 15 16 13 09 00 11 11 11 15 GBICA33TR TGAATACACAC 34 35 34 35 35 34 34 36 36 36 36 GBIFQ32TR TGAATACACAC 10 13 12 18 24 13 21 12 13 14 13 GBICH40TF TGAATACACAC 36 36 36 32 32 32 32 32 31 31 36 GBICU19TR TGAATACACAC 21 30 33 18 19 24 21 11 36 10 29

Summary info:P-value (10^q) -42.8 -45.4 -45.5 -45.3 -46.7 -44.4 -46.7 -43.9 -46.8 -45.1 -42.7 Quality Class 1 1 1 1 1 1 1 1 1 1 1 Coverage depth 16 16 16 16 16 16 15 16 16 16 16Homogeneity 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

GBX(g)

GBA(a)

Not shown

Probability of base-callingerror

b. anthracis SNP CS-1

1. 2 genome sequence assembly assembly concepts and methods (some slides courtesy of mihai pop)

kbp slide

sidewalk slide

start genome slide

clone coverage

short read sequencing

length of overlap

random fragments

courtesy of mihai pop

Documents

pcagalati.mpublic.ropcagalati.mpublic.ro/achizitii/ptb_estimare_2018.pdf ·...

phev/ev li-ion battery second-use program...

parallel/distributed databases xml€¦ · ·...

genome assembly: a brief introduction slides courtesy of...

mihai berza

151 retete de post - mihai basoiu retete de post - mihai...

mihai ursachi

robotic technologies for in-space assembly operations · of...

the nationalism of mihai eminescu radu mihai crisan

valeriu dragan, mihai mihaescu - kth · unconventional...

courtesy courtesy . courtesy . courtesy

mihai cocoara

courtesy of by mi.pdfcourtesy of . courtesy of . courtesy...

courtesy · 2018. 8. 10. · courtesy . courtesy ....

criste mihai

courtesy: bro augustine jacob essien ikot eyo...

etica protestanta - mihai iordache - libris.ro...

rnaseq: isoform expression quantiﬁcation and transcript...

by cristian mihai 2013 copyright cristian mihai. all rights

mihai eminescu- cristiana mara lic.teoretic mihai veliciu...