2014 toronto-torbug

Building khmer, a platform for research in

scalable sequence analysisC. Titus [email protected]

Hello!Assistant Professor; Microbiology; Computer

Science; etc.

More information at:

• ged.msu.edu/• github.com/ged-lab/• ivory.idyll.org/blog/• @ctitusbrown

Introducing k-mers

CCGATTGCACTGGACCGA (<- read)

CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA

K-mers give you an implicit alignment

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTGGACCGATGCACGGTACCG

K-mers give you an implicit alignment

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTGGACCGATGCACGGTACCGCATGGACCGATTGCACTGGACCGATGCACGGACCG

(with no accounting for mismatches or indels)

De Bruijn graphs – assemble on overlaps

J.R. Miller et al. / Genomics (2010)

The problem with k-mers

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTCGACCGATGCACGGTACCG

Each sequencing error results in k novel k-mers!

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Assembly graphs scale with data size, not

information.

Practical memory measurements (soil)

Velvet measurements (Adina Howe)

Counting k-mers efficiently (RAM)

This leads to good things.

Data structures & algorithms papers

• “These are not the k-mers you are looking for…”, Zhang et al., arXiv 1309.2975, in review.

• “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.

• “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802, under revision.

Data analysis papers• “Tackling soil diversity with the assembly of large,

complex metagenomes”, Howe et al., PNAS, 2014.

• Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep.

• A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.

Lab approach – not intentional, but working

out.

This leads to good things.

(khmer software)

Cu

rren

t re

searc

h(khmer software)

How is this feasible?!

Representative half-arsed lab software development

A not-insane way to do software development

Testing & version control – the not so

secret sauce• High test coverage - grown over time.

• Stupidity driven testing – we write tests for bugs after we find them and before we fix them.

• Pull requests & continuous integration – does your proposed merge break tests?

• Pull requests & code review – does new code meet our minimal coding etc requirements?o Note: spellchecking!!!

Integration testing• khmer is designed to work with other packages.

• For releases >= 1.0, we now have added acceptance tests to make sure that khmer works OK with other packages.

• These acceptance tests are based on integration tests, than in turn come from an education & documentation effort…

khmer-protocols

khmer-protocols:• Provide standard “cheap”

assembly protocols for the cloud.

• Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers)

• Open, versioned, forkable, citable….

Literate testing• Our shell-command tutorials for bioinformatics

can now be executed in an automated fashion – commands are extracted automatically into shell scripts.

• See: github.com/ged-lab/literate-resting/.

• Tremendously improves peace of mind and confidence moving forward!

Leigh Sheneman

Doing things right=> #awesomesauce

Benchmarking protocols

Data subset; AWS m1.xlarge

~1 hour

(See PyCon 2014 talk; video and blog post.)

Benchmarking protocols

Complete data; AWS m1.xlarge

~40 hours

(See PyCon 2014 talk; video and blog post.)

Cu

rren

t re

searc

h

Genomic intervals shared between data

sets

Qingpeng Zhang

* Assembly free!

Error correction via graph alignment

Jason Pell and Jordan Fish

Error correction on simulated E. coli data

1% error rate, 100x coverage.

Jordan Fish and Jason Pell

TP FP TN FN

ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9%

1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2%

1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8%

(corrected) (mistakes) (OK) (missed)

Single pass, reference free, tunable, streaming online variant calling.

Streaming, online variant calling.

See NIH BIG DATA grant, http://ged.msu.edu/.

Novelty… to what power?

• “Novelty” requirements for “high impact publishing”:o Must do novel algorithm developmento …and apply to novel and interesting data sets.o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662)

• We’ve taken on the additional challenge of trying to develop and maintain a core set of functionality in research software: novelty cubed? :)

https://medium.com/tech-talk/dd88857f662



ReproducibilityScientific progress relies on reproducibility of

analysis. (Aristotle, Nature, 322 BCE.)

All our papers now have:

• Source hosted on github;• Data hosted there or on

AWS;• Long running data

analysis => ‘make’• Graphing and data

digestion => IPython Notebook (also in github)

Qingpeng Zhang

Concluding thoughts• API is destiny – without online counting,

diginorm & streaming approaches would not have been possible.

• Tackle the hard problems – engineering optimization would not have gotten us very far.

• Testing lets us scale development & process – which means when something works, we can run with it.

Caveats• Expense and effort – you can spend an infinite

amount of time on infrastructure & process!o Advice: choose techniques that address actual pain points.o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014)

• Funders and reviewers just don’t care – adopt good software practices for yourself, not others.o Advice: briefly mention keywords in grants, papers.

• Advisors just don’t care – see above.o These are 90% true statements :>

Can we crowdsource bioinformatics?

We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let’s take advantage of

it!)

“It’s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that’s madness – who on Earth would create such an amazing resource?”

- http://thescienceweb.wordpress.com/2014/02/21/bioinformatics-software-companies-have-no-clue-why-no-one-buys-their-

products/

Thanks!

Prospective: sequencing tumor cells

• Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations.

• 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence.

• Most of this data will be redundant and not useful.

• Developing diginorm-based algorithms to eliminate data while retaining variant information.

Where are we taking this?

• Streaming online algorithms only look at data ~once.

• Diginorm is streaming, online…

• Conceptually, can move many aspects of sequence analysis into streaming mode.

=> Extraordinary potential for computational efficiency.

2014 toronto-torbug

Science

data analysis

github data

exploration data

data digestion

coli data

interesting data sets

assembly free

data sets qingpeng zhang