2014 toronto-torbug
TRANSCRIPT
![Page 2: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/2.jpg)
Hello!Assistant Professor; Microbiology; Computer
Science; etc.
More information at:
• ged.msu.edu/• github.com/ged-lab/• ivory.idyll.org/blog/• @ctitusbrown
![Page 3: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/3.jpg)
Introducing k-mers
CCGATTGCACTGGACCGA (<- read)
CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA
![Page 4: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/4.jpg)
K-mers give you an implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTGGACCGATGCACGGTACCG
![Page 5: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/5.jpg)
K-mers give you an implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTGGACCGATGCACGGTACCGCATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)
![Page 6: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/6.jpg)
De Bruijn graphs – assemble on overlaps
J.R. Miller et al. / Genomics (2010)
![Page 7: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/7.jpg)
The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!
![Page 8: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/8.jpg)
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
Assembly graphs scale with data size, not
information.
![Page 9: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/9.jpg)
Practical memory measurements (soil)
Velvet measurements (Adina Howe)
![Page 10: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/10.jpg)
Counting k-mers efficiently (RAM)
![Page 11: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/11.jpg)
This leads to good things.
![Page 12: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/12.jpg)
Data structures & algorithms papers
• “These are not the k-mers you are looking for…”, Zhang et al., arXiv 1309.2975, in review.
• “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.
• “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802, under revision.
![Page 13: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/13.jpg)
Data analysis papers• “Tackling soil diversity with the assembly of large,
complex metagenomes”, Howe et al., PNAS, 2014.
• Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep.
• A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.
![Page 14: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/14.jpg)
Lab approach – not intentional, but working
out.
![Page 15: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/15.jpg)
This leads to good things.
(khmer software)
![Page 16: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/16.jpg)
Cu
rren
t re
searc
h(khmer software)
![Page 17: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/17.jpg)
How is this feasible?!
Representative half-arsed lab software development
![Page 18: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/18.jpg)
A not-insane way to do software development
![Page 19: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/19.jpg)
A not-insane way to do software development
![Page 20: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/20.jpg)
Testing & version control – the not so
secret sauce• High test coverage - grown over time.
• Stupidity driven testing – we write tests for bugs after we find them and before we fix them.
• Pull requests & continuous integration – does your proposed merge break tests?
• Pull requests & code review – does new code meet our minimal coding etc requirements?o Note: spellchecking!!!
![Page 21: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/21.jpg)
Integration testing• khmer is designed to work with other packages.
• For releases >= 1.0, we now have added acceptance tests to make sure that khmer works OK with other packages.
• These acceptance tests are based on integration tests, than in turn come from an education & documentation effort…
![Page 22: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/22.jpg)
khmer-protocols
![Page 23: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/23.jpg)
khmer-protocols:• Provide standard “cheap”
assembly protocols for the cloud.
• Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers)
• Open, versioned, forkable, citable….
![Page 24: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/24.jpg)
Literate testing• Our shell-command tutorials for bioinformatics
can now be executed in an automated fashion – commands are extracted automatically into shell scripts.
• See: github.com/ged-lab/literate-resting/.
• Tremendously improves peace of mind and confidence moving forward!
Leigh Sheneman
![Page 25: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/25.jpg)
Doing things right=> #awesomesauce
![Page 26: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/26.jpg)
Benchmarking protocols
Data subset; AWS m1.xlarge
~1 hour
(See PyCon 2014 talk; video and blog post.)
![Page 27: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/27.jpg)
Benchmarking protocols
Complete data; AWS m1.xlarge
~40 hours
(See PyCon 2014 talk; video and blog post.)
![Page 28: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/28.jpg)
Cu
rren
t re
searc
h
![Page 29: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/29.jpg)
Genomic intervals shared between data
sets
Qingpeng Zhang
* Assembly free!
![Page 30: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/30.jpg)
Error correction via graph alignment
Jason Pell and Jordan Fish
![Page 31: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/31.jpg)
Error correction on simulated E. coli data
1% error rate, 100x coverage.
Jordan Fish and Jason Pell
TP FP TN FN
ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9%
1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2%
1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8%
(corrected) (mistakes) (OK) (missed)
![Page 32: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/32.jpg)
Single pass, reference free, tunable, streaming online variant calling.
Streaming, online variant calling.
See NIH BIG DATA grant, http://ged.msu.edu/.
![Page 33: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/33.jpg)
Novelty… to what power?
• “Novelty” requirements for “high impact publishing”:o Must do novel algorithm developmento …and apply to novel and interesting data sets.o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662)
• We’ve taken on the additional challenge of trying to develop and maintain a core set of functionality in research software: novelty cubed? :)
![Page 34: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/34.jpg)
ReproducibilityScientific progress relies on reproducibility of
analysis. (Aristotle, Nature, 322 BCE.)
All our papers now have:
• Source hosted on github;• Data hosted there or on
AWS;• Long running data
analysis => ‘make’• Graphing and data
digestion => IPython Notebook (also in github)
Qingpeng Zhang
![Page 35: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/35.jpg)
Concluding thoughts• API is destiny – without online counting,
diginorm & streaming approaches would not have been possible.
• Tackle the hard problems – engineering optimization would not have gotten us very far.
• Testing lets us scale development & process – which means when something works, we can run with it.
![Page 36: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/36.jpg)
Caveats• Expense and effort – you can spend an infinite
amount of time on infrastructure & process!o Advice: choose techniques that address actual pain points.o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014)
• Funders and reviewers just don’t care – adopt good software practices for yourself, not others.o Advice: briefly mention keywords in grants, papers.
• Advisors just don’t care – see above.o These are 90% true statements :>
![Page 37: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/37.jpg)
Can we crowdsource bioinformatics?
We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let’s take advantage of
it!)
“It’s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that’s madness – who on Earth would create such an amazing resource?”
- http://thescienceweb.wordpress.com/2014/02/21/bioinformatics-software-companies-have-no-clue-why-no-one-buys-their-
products/
![Page 38: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/38.jpg)
Thanks!
![Page 39: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/39.jpg)
Prospective: sequencing tumor cells
• Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations.
• 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence.
• Most of this data will be redundant and not useful.
• Developing diginorm-based algorithms to eliminate data while retaining variant information.
![Page 40: 2014 toronto-torbug](https://reader036.vdocuments.site/reader036/viewer/2022070315/554e743fb4c905f66a8b4cb7/html5/thumbnails/40.jpg)
Where are we taking this?
• Streaming online algorithms only look at data ~once.
• Diginorm is streaming, online…
• Conceptually, can move many aspects of sequence analysis into streaming mode.
=> Extraordinary potential for computational efficiency.