thoughts on the feasibility of an assemblathon 3 contest
TRANSCRIPT
Please note: this is a draft version of a talk. I.e. these are slides I prepared for Ian Korf to use at the Genome 10K meeting. His final version of this talk will no doubt add/remove much material.
Keith Bradnam 2015-03-04
DNA sequencers keep on getting smaller… …the challenges of genome assembly
seem to keep getting bigger.
*If* there was to be an Assemblathon 3, what suggestions or ideas would you have for it?
Please tweet them using hashtag #A3wishlist
"Hybrid-approaches with PacBio, Nanopore, and Illumina data; non-model systems; egalitarian genomics"
"Polyploid assembly and haplotype reconstruction"
"Give people assemblies + reads, competition on best prediction which assembly is ‘best’ on different metrics"
"I vote axolotl or lungfish for Assemblathon 3! Big, repetitive, interesting, useful genomes"
"Large/complex/repetitive marine genomes; high heterozygosity (no inbred lines); crustacean/sharks;
Illumina 250 bp paired ends + PacBio + optical maps"
"PacBio vs Illumina assemblies, Illumina with low PacBio coverage to fill gaps + correcting PacBio errors with Illumina"
"Polyploid (highly heterozygous) genome assembly challenge with emphasis on sub-genome (haplotype) deconvolution"
"Hybrid-approaches with PacBio, Nanopore, and Illumina data; non-model systems; egalitarian genomics"
"Polyploid assembly and haplotype reconstruction"
"Give people assemblies + reads, competition on best prediction which assembly is ‘best’ on different metrics"
"I vote axolotl or lungfish for Assemblathon 3! Big, repetitive, interesting, useful genomes"
"Large/complex/repetitive marine genomes; high heterozygosity (no inbred lines); crustacean/sharks; Illumina 250 bp paired ends + PacBio + optical maps"
"PacBio vs Illumina assemblies, Illumina with low PacBio coverage to fill gaps + correcting PacBio errors with Illumina"
"Polyploid (highly heterozygous) genome assembly challenge with emphasis on sub-genome (haplotype) deconvolution"
"Hybrid-approaches with PacBio, Nanopore, and Illumina data; non-model systems; egalitarian genomics"
"Polyploid assembly and haplotype reconstruction"
"Give people assemblies + reads, competition on best prediction which assembly is ‘best’ on different metrics"
"I vote axolotl or lungfish for Assemblathon 3! Big, repetitive, interesting, useful genomes"
"Large/complex/repetitive marine genomes; high heterozygosity (no inbred lines); crustacean/sharks; Illumina 250 bp paired ends + PacBio + optical maps"
"PacBio vs Illumina assemblies, Illumina with low PacBio coverage to fill gaps + correcting PacBio errors with Illumina"
"Polyploid (highly heterozygous) genome assembly challenge with emphasis on sub-genome (haplotype) deconvolution"
A lot of people seem to want to assemble something really difficult! This presumes that we have already mastered assembly of haploid, low-repeat-content, average-sized genomes.
Community effort was diluted across different species (only 2 teams assembled all 3 genomes). Multiple species presented more data management issues.
285xUnrealistic amounts of sequence data available
It is not typical to sequence so much data for a genome assembly. Most researchers can not afford to pay for so much sequencing.
Budget Team Illumina Moleculo PacBio Oxford Nanopore
$5,000
Team A 20x 5x
Team B 40x
Team C 10x 10x
$50,000
Team A 30x 20x 5x 2x
Team B 50x 10x
Team C 75x 30x 10x
Could allow teams to 'buy' sequences from a mix of platforms
Budget Team Illumina Moleculo PacBio Oxford Nanopore
$5,000
Team A 20x 5x
Team B 40x
Team C 10x 10x
$50,000
Team A 30x 20x 5x 2x
Team B 50x 10x
Team C 75x 30x 10x
Could potentially have two different budgets available (budgets here are just for illustrative reasons)
Budget Team Illumina Moleculo PacBio Oxford Nanopore
$5,000
Team A 20x 5x
Team B 40x
Team C 10x 10x
$50,000
Team A 30x 20x 5x 2x
Team B 50x 10x
Team C 75x 30x 10x
Fictional example to show different teams could use different strategies
Low amounts of useful validation data
PacBio data could have been held back to validate assemblies but wasn't and was then only used by a few teams. No good transcript data. Had Fosmids and optical maps (but not for all species).
• More Fosmid and/or BAC sequences? • Transcript(ome) data? • Long read sequence data? • Synteny information? • Tools such as Irys from BioNano Genomics?
• Require reproducible assembly instructions at the time of submission
• Request better information relating to computer architecture used to make assembly
Two different sequence file formats have been developed that can represent haplotype variation in a genome assembly
GFA FASTG
GFA FASTG
Two different sequence file formats have been developed that can represent haplotype variation in a genome assembly
Neither format seems to have been widely adopted… plus there are no (?) downstream bioinformatics tools that work with these formats. Would requiring either format deter participation?
Encourage multiple entries per team?
assembly_1a.fastaassembly_1b.fasta
Some of the better assemblies in Assemblathon 2 were the 'experimental' entries.
How about an endangered species?
Assemblathon 3 could become a shining example of conservation genomics, and choosing an endangered species might help attract more community support. Also good PR!
How about an endangered species?California Condor (Gymnogyps californianus)
Image from http://www.manataka.org/
How about an endangered species?California Condor (Gymnogyps californianus)
Image from http://www.manataka.org/
Critically endangered. BAC resources may be available.
Tuatara lizard (Sphenodon punctatus)
Image from https://student.societyforscience.org/
Tuatara lizard (Sphenodon punctatus)
Image from https://student.societyforscience.org/
A 'living fossil'. Low risk of extinction. BAC libraries and partial transcriptome exist.
Spiny rat (Tokudaia spp)
Image from https://wikimedia.org/
Endangered. Transcriptome available.
But does it have to be a Genome 10K species ?If the species is eukaryotic and has a large genome, this would still be useful to assess assemblers that could be used for other Genome 10K species.
White abalone (Haliotis sorenseni)
Image from https://wikimedia.org/
White abalone (Haliotis sorenseni)
Image from https://wikimedia.org/
Estimated genome size: 1.7–2.0 Gbp. Native to California and Mexico. Critically endangered — first marine invertebrate to be listed under the Endangered Species Act.
Gary Cherr Director, Bodega Marine Laboratory
Principle Investigator for abalone captive breeding program
"The restoration of the white abalone in the wild — the first time this would ever have been attempted for a listed marine — may depend on the genome being sequenced."
Gary Cherr Director, Bodega Marine Laboratory
Principle Investigator for abalone captive breeding program
"There’s probably a few thousand left in the wild. But because they’re so far apart, they’re effectively sterile. Their population could be effectively extinct already."
Kristin Aquilino Manager of abalone
captive breeding program
• Trying to please everyone — rather than focusing on something achievable and helpful to the ultimate users of genome assembly software — might not be the most productive strategy