thoughts on the feasibility of an assemblathon 3 contest

55
Thoughts on the feasibility of… Ian Korf

Upload: keith-bradnam

Post on 14-Jul-2015

1.679 views

Category:

Science


2 download

TRANSCRIPT

Thoughts on the feasibility of…

Ian Korf

Please note: this is a draft version of a talk. I.e. these are slides I prepared for Ian Korf to use at the Genome 10K meeting. His final version of this talk will no doubt add/remove much material.

Keith Bradnam 2015-03-04

DNA sequencers keep on getting smaller… …the challenges of genome assembly

seem to keep getting bigger.

flickr.com/incrediblehow/

Let the people speak…

*If* there was to be an Assemblathon 3, what suggestions or ideas would you have for it?

Please tweet them using hashtag #A3wishlist

"Hybrid-approaches with PacBio, Nanopore, and Illumina data; non-model systems; egalitarian genomics"

"Polyploid assembly and haplotype reconstruction"

"Give people assemblies + reads, competition on best prediction which assembly is ‘best’ on different metrics"

"I vote axolotl or lungfish for Assemblathon 3! Big, repetitive, interesting, useful genomes"

"Large/complex/repetitive marine genomes; high heterozygosity (no inbred lines); crustacean/sharks;

Illumina 250 bp paired ends + PacBio + optical maps"

"PacBio vs Illumina assemblies, Illumina with low PacBio coverage to fill gaps + correcting PacBio errors with Illumina"

"Polyploid (highly heterozygous) genome assembly challenge with emphasis on sub-genome (haplotype) deconvolution"

"Hybrid-approaches with PacBio, Nanopore, and Illumina data; non-model systems; egalitarian genomics"

"Polyploid assembly and haplotype reconstruction"

"Give people assemblies + reads, competition on best prediction which assembly is ‘best’ on different metrics"

"I vote axolotl or lungfish for Assemblathon 3! Big, repetitive, interesting, useful genomes"

"Large/complex/repetitive marine genomes; high heterozygosity (no inbred lines); crustacean/sharks; Illumina 250 bp paired ends + PacBio + optical maps"

"PacBio vs Illumina assemblies, Illumina with low PacBio coverage to fill gaps + correcting PacBio errors with Illumina"

"Polyploid (highly heterozygous) genome assembly challenge with emphasis on sub-genome (haplotype) deconvolution"

"Hybrid-approaches with PacBio, Nanopore, and Illumina data; non-model systems; egalitarian genomics"

"Polyploid assembly and haplotype reconstruction"

"Give people assemblies + reads, competition on best prediction which assembly is ‘best’ on different metrics"

"I vote axolotl or lungfish for Assemblathon 3! Big, repetitive, interesting, useful genomes"

"Large/complex/repetitive marine genomes; high heterozygosity (no inbred lines); crustacean/sharks; Illumina 250 bp paired ends + PacBio + optical maps"

"PacBio vs Illumina assemblies, Illumina with low PacBio coverage to fill gaps + correcting PacBio errors with Illumina"

"Polyploid (highly heterozygous) genome assembly challenge with emphasis on sub-genome (haplotype) deconvolution"

A lot of people seem to want to assemble something really difficult! This presumes that we have already mastered assembly of haploid, low-repeat-content, average-sized genomes.

flickr.com/incrediblehow/

Problems with Assemblathon 2

Too many species!

Community effort was diluted across different species (only 2 teams assembled all 3 genomes). Multiple species presented more data management issues.

One species?

?

285xcoverage of parrot genome

Unrealistic amounts of sequence data available

285xUnrealistic amounts of sequence data available

It is not typical to sequence so much data for a genome assembly. Most researchers can not afford to pay for so much sequencing.

Make the assembly challenge representative of a real world scenario

Give teams a virtual budget and let them buy sequencing resources

$$$

Budget Team Illumina Moleculo PacBio Oxford Nanopore

$5,000

Team A 20x 5x

Team B 40x

Team C 10x 10x

$50,000

Team A 30x 20x 5x 2x

Team B 50x 10x

Team C 75x 30x 10x

Could allow teams to 'buy' sequences from a mix of platforms

Budget Team Illumina Moleculo PacBio Oxford Nanopore

$5,000

Team A 20x 5x

Team B 40x

Team C 10x 10x

$50,000

Team A 30x 20x 5x 2x

Team B 50x 10x

Team C 75x 30x 10x

Could potentially have two different budgets available (budgets here are just for illustrative reasons)

Budget Team Illumina Moleculo PacBio Oxford Nanopore

$5,000

Team A 20x 5x

Team B 40x

Team C 10x 10x

$50,000

Team A 30x 20x 5x 2x

Team B 50x 10x

Team C 75x 30x 10x

Fictional example to show different teams could use different strategies

Low amounts of useful validation data

Low amounts of useful validation data

PacBio data could have been held back to validate assemblies but wasn't and was then only used by a few teams. No good transcript data. Had Fosmids and optical maps (but not for all species).

• More Fosmid and/or BAC sequences? • Transcript(ome) data? • Long read sequence data? • Synteny information? • Tools such as Irys from BioNano Genomics?

Documentation for how assemblies were made was often poor or missing altogether

X

• Require reproducible assembly instructions at the time of submission

• Request better information relating to computer architecture used to make assembly

flickr.com/incrediblehow/

Other considerations for Assemblathon 3

Two different sequence file formats have been developed that can represent haplotype variation in a genome assembly

GFA FASTG

GFA FASTG

Two different sequence file formats have been developed that can represent haplotype variation in a genome assembly

Neither format seems to have been widely adopted… plus there are no (?) downstream bioinformatics tools that work with these formats. Would requiring either format deter participation?

Encourage multiple entries per team?

assembly_1a.fastaassembly_1b.fasta

Encourage multiple entries per team?

assembly_1a.fastaassembly_1b.fasta

Some of the better assemblies in Assemblathon 2 were the 'experimental' entries.

flickr.com/incrediblehow/

What species?

How about an endangered species?

How about an endangered species?

Assemblathon 3 could become a shining example of conservation genomics, and choosing an endangered species might help attract more community support. Also good PR!

How about an endangered species?California Condor (Gymnogyps californianus)

Image from http://www.manataka.org/

How about an endangered species?California Condor (Gymnogyps californianus)

Image from http://www.manataka.org/

Critically endangered. BAC resources may be available.

Tuatara lizard (Sphenodon punctatus)

Image from https://student.societyforscience.org/

Tuatara lizard (Sphenodon punctatus)

Image from https://student.societyforscience.org/

A 'living fossil'. Low risk of extinction. BAC libraries and partial transcriptome exist.

Spiny rat (Tokudaia spp)

Image from https://wikimedia.org/

Spiny rat (Tokudaia spp)

Image from https://wikimedia.org/

Endangered. Transcriptome available.

But does it have to be a Genome 10K species ?

But does it have to be a Genome 10K species ?If the species is eukaryotic and has a large genome, this would still be useful to assess assemblers that could be used for other Genome 10K species.

White abalone (Haliotis sorenseni)

Image from https://wikimedia.org/

White abalone (Haliotis sorenseni)

Image from https://wikimedia.org/

Estimated genome size: 1.7–2.0 Gbp. Native to California and Mexico. Critically endangered — first marine invertebrate to be listed under the Endangered Species Act.

Successfully bred the first white abalone in captivity in 2012.

Gary Cherr Director, Bodega Marine Laboratory

Principle Investigator for abalone captive breeding program

"The restoration of the white abalone in the wild — the first time this would ever have been attempted for a listed marine — may depend on the genome being sequenced."

Gary Cherr Director, Bodega Marine Laboratory

Principle Investigator for abalone captive breeding program

"There’s probably a few thousand left in the wild. But because they’re so far apart, they’re effectively sterile. Their population could be effectively extinct already."

Kristin Aquilino Manager of abalone

captive breeding program

flickr.com/incrediblehow/

Summary

• People seem to want very different things out of a possible Assemblathon 3 contest

• Trying to please everyone — rather than focusing on something achievable and helpful to the ultimate users of genome assembly software — might not be the most productive strategy

From Wikimedia commons

Three months later…

From http://flickr.com/markturner/