novel multi-platform next generation assembly methods for mammalian genomes

9
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut teamed together (under the KanGO consortium) as part of an international effort to generate a de novo genome sequence for marsupial model species, the tammar wallaby (M. eugenii). • An important model organism is part of the Metatherian mammals and harbors unique life history traits and genome features. • Genome sequencing was done at several institutions , using several technologies, over several years. • Developed a pipeline to integrate long and short read sequences and existing assemblies using well known mapping and assembly tools such as Bowtie and Phrap.

Upload: sibyl

Post on 15-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Novel multi-platform next generation assembly methods for mammalian genomes. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Novel multi-platform next generation assembly methods for mammalian genomes

Novel multi-platform next generation assembly methods for mammalian genomes

The Baylor College of Medicine, Australian Government and University of Connecticut teamed together (under the KanGO consortium) as part of an international effort to generate a de novo genome sequence for marsupial model species, the tammar wallaby (M. eugenii).

• An important model organism is part of the Metatherian mammals and harbors unique life history traits and genome features.• Genome sequencing was done at several institutions , using several technologies, over several years.• Developed a pipeline to integrate long and short read sequences and existing assemblies using well known mapping and assembly tools such as Bowtie and Phrap.

Page 2: Novel multi-platform next generation assembly methods for mammalian genomes

Overview of the DataTechnology Institution Read Length # Reads # Bases

Sanger BCM Avg. 915 bp 9,924,136 9,088,748,105

454 AGRF, UCONN Avg. 160 bp 1,530,592 275,951,386

Illumina AGRF 100 bp 271,875,064 27,187,506,400

Solid BCM 25 bp 710,427,490 18,471,114,740

• All reads used in reassembly were “paired”.• Read orientation not considered.• The Solid reads had insert size of 1.395 kb.• The Illumina reads had insert size of 3 and 8 kb in roughly equal proportions.• The 454 reads had insert size of 8, 12, 20, and 30 kb with the majority of them split between 20 and 30 kb.

Read ReadInsert

“Paired” Read Overview:

Page 3: Novel multi-platform next generation assembly methods for mammalian genomes

Local Assembly Pipeline

Initial AssemblySanger reads assembled using Atlas (BCM). Initial scaffolding done using Solid mate pairs.

Map Short ReadsTo scaffolds using Bowtie, there are two cases:• Orphaned, when one mate is unmapped.• Complete, when both are mapped.

Reads which map to multiple locations (non unique) are not considered.

Key: 454, illumina, sanger, unmapped read, dotted line – estimated distance

Initial Assembly

Map Short Reads

Scaffold Assembly

Final Mapping

Gap Re-estimation

Quality Calling

Finished Assembly

Page 4: Novel multi-platform next generation assembly methods for mammalian genomes

Pipeline continued

Final MappingMap all data to the final contigs.

Gap RestimationOf all gap distances in scaffolds using complete pairs mapped on different contigs.

;;;;;;;;;;;7;;;;;-;;;3;83;;;;;;;;;;;9;7;;.7;393333;;;;;;;;;;;9;7;;.7;393333 ;;;;;;;;;;;7;;;;;-;;;3;83;;3;;;;;;;7;;;;;;;88383;;;34 ;;3;;;;;;7;;;;;;;88383;;;77

Quality CallingMap all reads and calculate a quality of each base.

Scaffold AssemblyFeed contigs, complete and orphaned pairs to Phrap. Re-assemble.

Output: Contigs, Quality Scores, Scaffold (agp file).

Page 5: Novel multi-platform next generation assembly methods for mammalian genomes

Local Assembly Close-up

• Screen shot taken from Codon Code aligner which uses Phrap to map reads.• Red and blue denote orientation.

• A: Two contigs may be fused using short reads.• B: Contig will be extended with short reads at one end.• C: Single nucleotide and small errors are corrected using the short reads higher coverage. • We are confident of changes since even at < 10x each read is itself uniquely mapped, or its approximate position supported by a uniquely mapped read.

A B

C

Page 6: Novel multi-platform next generation assembly methods for mammalian genomes

Assembly Comparison and Validation

RIKEN BACs Total Reads Total Bases Recovered Reads

Recovered Bases

1.1 (original)

BAC 147312 113232882 34734 25662069

FOSMID 31250 18777544 4335 2263696

1.2 (updated)

BAC 147312 113232882 33328 24555624

FOSMID 31250 18777544 4294 2241059

Category Meug 1.0 Meug 1.1 Meug 1.2

Contigs (10^6) 1.21 1.17 1.101

N50 (10^3) 2.5 2.6 2.91

Bases (10^6) 2546 2536 2574

scaffolds 616418 277711 277711

Gaps (10^6) NA 539 614

Page 7: Novel multi-platform next generation assembly methods for mammalian genomes

Gap Estimation Methods

• Step1: Maximization: compute gap estimate (x), let the mean insertion length of N pairs equal μ (initial value is library average).

• Step 2: Sampling, given x, and the length of contigs, sample μ from completely mapped reads spanning gaps.

• Gap between two contigs estimated using an expectation maximization algorithm. Steps are repeated until estimated parameters do not change.

Page 8: Novel multi-platform next generation assembly methods for mammalian genomes

Gap Estimation Results

Simulation study of EM algorithm accuracy.

ctg len \ gap 1000 2000 3000 4000 5000

10 -19 -7 2 -1 -15

100 66 87 92 88 74

200 161 188 192 188 173

500 446 492 492 487 469

800 733 795 794 784 764

1000 939 997 995 982 956

1200 1153 1196 1196 1177 1152

1500 1501 1493 1496 1469 1436

• When using libraries with different insert size and std deviation it is necessary to bundle the estimates.• The following is an example of how two libraries are bundled: • nx = # reads in libx, ex = libx estimate, sx = lib std dev.

Page 9: Novel multi-platform next generation assembly methods for mammalian genomes

Conclusion

• This method is a viable way of improving existing draft genomes with short read technologies at limited (<10x depth) coverage.

• This method is robust and easily parallelized so it is practical for large mammalian genomes.

• Better results may be obtained through multiple iterations.

• Re-scaffolding of the contigs should be done between iterations.

• A contig aware assembly algorithm could improve local assembly performance.

Future Work