serc: de novo assembly workshop. francesco vezzi

Post on 22-Apr-2015

874 Views

Category:

Science

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

De novo assembly, a multi-technology approach: Illumina, PacBio, and OpGen. A multi technological prospective for de novo assembly projects.

TRANSCRIPT

De novo assembly, a multi-technology approach: Illumina, PacBio, and OpGen

PhD. Francesco VezziSenior Bioinformatician, NGI-Stockholm

Both Stockholm and Uppsala nodes

Illumina HiSeq 2000/2500 16

Illumina MiSeq 3

Life Technologies SOLiD 5500xl 4

Life Technologies SOLiD 5500wildfire 2

Life Technologies Ion Torrent 2

Life Technologies Ion Proton 6

Life Technologies Sanger ABI3730 2

Pacific Biosciences RSII 1

Argus Whole Genome Mapping System 1

NGI Equipment

One of 3 best-equipped sequencing sites in Europe

In this talkIllumina (Stockholm):

• 100/150 bp paired reads (low error rate)• 900/200 Gbp in 6/2 day(s)

PacBio (Uppsala):• 8.5 Kbp reads, (max 30Kbp, high error rate)• 375 Mbp (1 SMRT Cell) in 10 hours

OpGen Argus System (Stockholm):• ~300 Kbp maps• 10 Gbp in ~1 day

Optical Maps

• Restriction Map◦ Representation of the cut sites on a

given DNA molecule to provide spatial information of genetic loci

• An enzyme is selected and used to cut the molecules. This provides a 2D representation of the molecule structure

Optical Maps: workflow

DNA extraction directly from culture

Quality control of extracted material

Prepare a chip

Run Argus System

Data assembly

StepsTime

3-8h

1h

1.5h

1h

2-8h

Notes

Closing genomes with Optical Maps

De novo reconstructs parts missing in the reference strain

Correctly assembles long tandem repeats

De Novo assembly (Illumina, PacBio)

Set of un-ordered and not oriented contigs

Optical Map

Contigs

Case Study: Combing all the technologies

~15 Mbp genome sequenced at High Coverage with:• Illumina HiSeq:

• 500X PE libraries (180bp and 650bp insert)• 150X MP library (3Kbp)• 150X MP library (7Kbp)

• PacBio• 50/60X with reads longer than 2Kbp

• OpGen• 3 chips (only one worked really well)• 300X coverage • Average map length 320Kbp

Assembly Strategyhttps://github.com/vezzi/de_novo_scilife

Semi-automated pipeline for de novo assembly:• Global configuration file tools and system configuration• Sample configuration file samples description

3 modules:1. QC-module (Illumina only):

• Adaptor removal, kmer-analysis, fastqc, (insert size estimation)2. Assemble-module (Illumina only):

• Runs specified assemblers and outputs executed commands3. Validation-module:

• FRCbam, coverage analysis, GC-analysis, (N50)

I NEED USERS/FEEDBACK/CONTIRBUTIONS

QC-Module

Kmer analysis:• Samples complexity• Error rate• Heterozygosity

FASTQCAdaptor removalAlignment (partial assembly)

Assemble-ModuleIllumina only:

• SOAPdenovo• MaSuRCA• Allpaths-LG

PacBio only:• HGAP• CABOG

Hybrid:• PB-jelly (HAH)

>5000 #scaffolds totalLength maxContigLength N50 N80 percentageNs

Allpaths-LG 227 14513103 596012 139364 57619 15%MASURCA 163 18549484 1188669 526519 282507 2%HGAP 290 14399273 763592 142483 37117 0%PB-Jelly 179 14718213 747750 195225 85127 13%

• Try-and-fail process• Automated pipeline developed in order to

streamline these analysis• MASURCA surprisingly the “best” assembler

MaSuRCA HGAP PB-Jelly (HAH)

Validation-Module

FRCbam

Validation-Module

PacBio-only assembly is clearly outperforming the others

Optical MapsPacBio produces the best assembly however 290 contigs contigs are produced.

Optical Maps allowed to obtain the 2D representation of the 7 chromosomes.N.B. chromosome number was one of the biological questions of this project!!!

But much more can be done!!!

Incredible tool to finish (or almost finish) genomes

% contigs placedTotal size of placed contigs

% size placed contigs

% genome covered

pacBio+OpGene 94.12 11578995 97% 77.05Allpaths+OpGene 71.88 10692027 84% 52.88Allpaths+Masurca+Opgene 80.65 27506424 92% 69.64Allpaths+PacBio+Opgene 82.32 22271022 91% 83.05Masurca+PacBio+pgene 94.44 28393392 98% 83.79Allpaths+Masurca+PacBio+Opgene 85.42 39085419 94% 87.39

Combing all the technologies

Conclusions – Take home message

Attempt to automate de novo assembly process:• https://github.com/vezzi/de_novo_scilife• Not 100% automated

Illumina, PacBio, Hybrid assemblies:• PacBio alone seems to produce the best assemblers• Hybrid assembly seems to not be able to correct merged-assembly

problems

Mixing technologies is always a good idea:• Possibility to compensate technological biases• Allows to produce better assemblies

Thankshttps://github.com/vezzi/de_novo_scilife

top related