assembling genomes using abyss

25
Shaun Jackman BC Genome Sciences Centre [email protected] [email protected] Assembling genomes using ABySS dnGASP 2011

Upload: shaun-jackman

Post on 10-May-2015

2.470 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Assembling genomes using ABySS

Shaun JackmanBC Genome Sciences Centre

[email protected]@bcgsc.ca

Assembling genomes using ABySSdnGASP 2011

Page 2: Assembling genomes using ABySS

2

An assembly in two stages

● Stage I: Sequence assembly algorithm● Stage II: Paired-end assembly algorithm

Page 3: Assembling genomes using ABySS

3

● Load the reads,breaking each read into k-mers

● Find adjacent k-mers, whichoverlap by k-1 bases

● Remove k-mers resulting from read errors

● Remove variant sequences● Generate contigs

Stage 1Sequence assembly algorithm

Load k-mers

Find overlaps

Prune tips

Pop bubbles

Generate contigs

Page 4: Assembling genomes using ABySS

4

Load the reads

● For each input read of length l, (l - k + 1) k-mers are generated by sliding a window of length k over the read

ATCATACATGATRead (l = 12):

k-mers (k = 9):ATCATACAT TCATACATG CATACATGA ATACATGAT

● Each k-mer is a vertex of the de Bruijn graph

● Two adjacent k-mers are an edge of the de Bruijn graph

Page 5: Assembling genomes using ABySS

5

De Bruijn Graph

● A simple graph for k = 5● Two reads

– GGACATC– GGACAGA

GGACA

GACAT

GACAG

ACATC

ACAGA

Page 6: Assembling genomes using ABySS

6

● Read errors cause tips

Pruning tips

Page 7: Assembling genomes using ABySS

7

● Read errors cause tips

● Pruning tips removes the erroneous reads from the assembly

Pruning tips

Page 8: Assembling genomes using ABySS

8

Popping bubbles● Variant sequences cause

bubbles● Popping bubbles removes

the variant sequence from the assembly

● Repeat sequences with small differences also cause bubbles

Page 9: Assembling genomes using ABySS

9

● Remove ambiguous edges

● Output contigs in FASTA format

Assemble contigs

Page 10: Assembling genomes using ABySS

10

Paired-end assembly algorithmStage 2

● Align the reads to the contigs of the first stage● Generate an empirical fragment-size

distribution using the paired reads that align to the same contig

● Estimate the distance between contigs using the paired reads that align to different contigs

Page 11: Assembling genomes using ABySS

11

Align the reads to the contigsKAligner

● Every k-mer in the single-end assembly is unique

● KAligner can map reads with k consecutive correct bases

● ABySS may use other aligners, including BWA and bowtie

Page 12: Assembling genomes using ABySS

12

Empirical fragment-size distributionParseAligns

● Generate an empirical fragment-size distribution using the paired reads that align to the same contig

Page 13: Assembling genomes using ABySS

13

Estimate distances between contigsDistanceEst

● Estimate the distance between contigs using the paired reads that align to different contigs

d = 25 ± 8

d = 3 ± 5

d = 6 ± 5

d = 4 ± 3

Page 14: Assembling genomes using ABySS

14

Maximum likelihood estimatorDistanceEst

● Use the empirical paired-end size distribution

● Maximize the likelihood function

● Find the most likely distance between the two contigs

Page 15: Assembling genomes using ABySS

15

Paired-end algorithmcontinued...

● Find paths through the contig adjacency graph that agree with the distance estimates

● Merge overlapping paths● Merge the contigs in these paths

and output the FASTA file

Generate paths

Generate contigs

Merge paths

Page 16: Assembling genomes using ABySS

16

Find consistent pathsSimpleGraph

● Find paths through the contig adjacency graph that agree with the distance estimates

d = 4 ± 3

Actual distance = 3

Page 17: Assembling genomes using ABySS

17

Merge overlapping pathsMergePaths

● Merge paths that overlap

Page 18: Assembling genomes using ABySS

18

Generate the FASTA output

● Merge the contigs in these paths.● Output the FASTA file

G A T T T T T G G A C G T C T T G A T C T T C A C G T A T T G C T A T T

Page 19: Assembling genomes using ABySS

19

Assembly process

● Stage 1 completed in 3.5 hours● Used 72 processors on six machines● Peak memory usage of 180 GB of RAM● Stage 2 completed in 9 hours● Used 12 processors on one machine● Peak memory usage of 48 GB of RAM● Assembly parameters k=64 s=200 n=10

Page 20: Assembling genomes using ABySS

20

Assembly resultsLevel 1: 500-bp paired-end reads

● Assembled half the genome in 7,676 contigs larger than the N50 of 50,612 bp

● Assembled 1.81 Gbp in 170,407 contigs larger than 200 bp

● The largest contig is 1,158,576 bp● Removed 1,296,819 variant sequences

Page 21: Assembling genomes using ABySS

21

Alignments to the reference

● Aligned the 170,407 contigs longer than 200 bp● 96.2% align at least 99% length● 1.2% align between 90% and 99% length● 2.5% align less than 90% length

>99%90-99%<90%

Page 22: Assembling genomes using ABySS

22

Works in progress

● Replace complex variant sequences with Ns● Scaffold over gaps and simple repeat sequence

using large fragment mate-pair reads● Filling in gaps with sequence using localized

microassembly

Page 23: Assembling genomes using ABySS

IEEE InfoVis 2009

ABySS Publications

Page 24: Assembling genomes using ABySS

24

Acknowledgments

SupervisorsSupervisors● İnanç Birol● Steven Jones

TeamTeam● Readman Chiu● Rod Docking● Karen Mungall● Jenny Qian

Page 25: Assembling genomes using ABySS

25