2014 nicta-reproducibility

66
Openness and reproducibility in computational science: tools, approaches, and thought patterns. C. Titus Brown [email protected] October 16, 2014

Upload: ctitusbrown

Post on 27-Nov-2014

324 views

Category:

Science


1 download

DESCRIPTION

talk at NICTA on reproducibility

TRANSCRIPT

Page 1: 2014 nicta-reproducibility

Openness and reproducibility in

computational science: tools, approaches, and

thought patterns.

C. Titus [email protected]

October 16, 2014

Page 2: 2014 nicta-reproducibility

Hello!Assistant Professor @ MSU; Microbiology; Computer

Science; etc.=> UC Davis VetMed in 2015.

More information at:

• ged.msu.edu/• github.com/ged-lab/• ivory.idyll.org/blog/• @ctitusbrown

Page 3: 2014 nicta-reproducibility

The challenges of non-model sequencing

• Missing or low quality genome reference.

• Evolutionarily distant.

• Most extant computational tools focus on model organisms –o Assume low polymorphism (internal variation)o Assume reference genomeo Assume somewhat reliable functional annotationo More significant compute infrastructure

…and cannot easily or directly be used on critters of interest.

Page 4: 2014 nicta-reproducibility

Shotgun sequencing & assembly

http://eofdreams.com/library.html;http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/

Page 5: 2014 nicta-reproducibility

Shotgun sequencing analysis goals:

• Assembly (what is the text?)o Produces new genomes & transcriptomes.o Gene discovery for enzymes, drug targets, etc.

• Counting (how many copies of each book?)o Measure gene expression levels, protein-DNA

interactions• Variant calling (how does each edition

vary?)o Discover genetic variation: genotyping, linkage

studies…o Allele-specific expression analysis.

Page 6: 2014 nicta-reproducibility

AssemblyIt was the best of times, it was the wor, it was the worst of times, it was the isdom, it was the age of foolishness

mes, it was the age of wisdom, it was th

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of

foolishness

…but for lots and lots of fragments!

Page 7: 2014 nicta-reproducibility

Shared low-level fragments may

not reach the threshold for

assembly.

Lamprey mRNAseq:

Pooling all your data is important

Page 8: 2014 nicta-reproducibility

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Assembly graphs scale with data size, not

information.

Page 9: 2014 nicta-reproducibility

Practical memory measurements (soil)

Velvet measurements (Adina Howe)

Page 10: 2014 nicta-reproducibility

Data set size and cost• $1000 gets you ~200m “reads”, or about 20-80

GB of data, in ~week.

• > 1000 labs doing this regularly.

• Each data set analysis is ~custom.

• Analyses are data intensive and memory intensive.

Page 11: 2014 nicta-reproducibility

Efficient data structures & algorithms

Page 12: 2014 nicta-reproducibility

Shotgun sequencing is massively

redundant; can we eliminate redundancy

while retaining information?

Analog: JPEG lossy compression

Page 13: 2014 nicta-reproducibility

Sparse collections of k-mers can be stored efficiently in

Bloom filters

Pell et al., 2012, PNAS; doi: 10.1073/pnas.1121464109

Page 14: 2014 nicta-reproducibility

Data structures & algorithms papers

• “These are not the k-mers you are looking for…”, Zhang et al., PLoS One, 2014.

• “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.

• “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802.

Page 15: 2014 nicta-reproducibility

Data analysis papers• “Tackling soil diversity with the assembly of large,

complex metagenomes”, Howe et al., PNAS, 2014.

• Assembling novel ascidian genomes & transcriptomes, Stolfi et al. (eLife 2014), Lowe et (in prep)

• A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.

Page 16: 2014 nicta-reproducibility

Lab approach – not intentional, but working

out.

Page 17: 2014 nicta-reproducibility

This leads to good things.

(khmer software)

Page 18: 2014 nicta-reproducibility

Cu

rren

t re

searc

h(khmer software)

Page 19: 2014 nicta-reproducibility

Testing & version control – the not so

secret sauce• High test coverage - grown over time.

• Stupidity driven testing – we write tests for bugs after we find them and before we fix them.

• Pull requests & continuous integration – does your proposed merge break tests?

• Pull requests & code review – does new code meet our minimal coding etc requirements?o Note: spellchecking!!!

Page 20: 2014 nicta-reproducibility

Our “novel research” enables this:

• Novel data structures and algorithms;• Permit low(er) memory data analysis;• Liberate analyses from specialized hardware.

Page 21: 2014 nicta-reproducibility

Running entirely w/in cloud

Complete data; AWS m1.xlarge

~40 hours

(See PyCon 2014 talk; video and blog post.)

MEMORY

Page 22: 2014 nicta-reproducibility

On the “novel research” side:

• Novel data structures and algorithms;• Permit low(er) memory data analysis;• Liberate analyses from specialized hardware.

This last bit? => reproducibility.

Page 23: 2014 nicta-reproducibility

Reproducibility!

Scientific progress relies on

reproducibility of analysis. (Aristotle,

Nature, 322 BCE.)

“There is no such thing as ‘reproducible science’. There is only ‘science’, and ‘not

science.’” – someone on Twitter (Fernando Perez?)

Page 24: 2014 nicta-reproducibility

Disclaimer

Not a researcher of reproducibility!

Merely a practitioner.

Please take my points below as an argument and not as research conclusions.

(But I’m right.)

Page 25: 2014 nicta-reproducibility

Replication vs reproducibility

• I will not clearly distinguish.• There are important differences.

o Replication: someone using same data, same tools, => same resultso Reproduction: someone using different data and/or different tools =>

same result.

• The former is much easier.• The latter is much stronger.• Science is failing even mere replication!?

• So, mostly I will talk about how we make our analyses replicable.

Page 26: 2014 nicta-reproducibility

My usual intro:We practice open science!

Everything discussed here:• Code: github.com/ged-lab/ ; BSD license• Blog: http://ivory.idyll.org/blog (‘titus brown blog’)• Twitter: @ctitusbrown• Grants on Lab Web site:

http://ged.msu.edu/research.html• Preprints available.

Everything is > 80% reproducible.

Page 27: 2014 nicta-reproducibility

My usual intro:We practice open science!

Everything discussed here:• Code: github.com/ged-lab/ ; BSD license• Blog: http://ivory.idyll.org/blog (‘titus brown blog’)• Twitter: @ctitusbrown• Grants on Lab Web site:

http://ged.msu.edu/research.html• Preprints available.

Everything is > 80% reproducible.

Page 28: 2014 nicta-reproducibility

My lab & the diginorm paper.

• All our code was already on github;• Much of our data analysis was already in the

cloud;• Our figures were already made in IPython

Notebook• Our paper was already in LaTeX

Page 29: 2014 nicta-reproducibility

IPython Notebook: data + code =>

Page 30: 2014 nicta-reproducibility

My lab & the diginorm paper.

• All our code was already on github;• Much of our data analysis was already in the

cloud;• Our figures were already made in IPython

Notebook• Our paper was already in LaTeX

…why not push a bit more and make it easily reproducible?

This involved writing a tutorial. And that’s it.

Page 31: 2014 nicta-reproducibility

To reproduce our paper:

git clone <khmer> && python setup.py installgit clone <pipeline>cd pipelinewget <data> && tar xzf <data>make && cd ../notebook && makecd ../ && make

Page 32: 2014 nicta-reproducibility

Now standard in lab --Our papers now have:

• Source hosted on github;• Data hosted there or on

AWS;• Long running data

analysis => ‘make’• Graphing and data

digestion => IPython Notebook (also in github)

Qingpeng Zhang

Page 33: 2014 nicta-reproducibility

Research process

Page 34: 2014 nicta-reproducibility

Literate graphing & interactive exploration

Page 35: 2014 nicta-reproducibility

The process• We start with pipeline reproducibility• Baked into lab culture; default “use git; write

scripts”

Community of practice!

• Use standard open source approaches, so OSS developers learn it easily.

• Enables easy collaboration w/in lab• Valuable learning tool!

Page 36: 2014 nicta-reproducibility

Growing & refining the process

• Now moving to Ubuntu Long-Term Support + install instructions.

• Everything is as automated as is convenient.

• Students expected to communicate with me in IPython Notebooks.

• Trying to avoid building (or even using) new repro tools.

• Avoid maintenance burden as much as possible.

Page 37: 2014 nicta-reproducibility

1. Use standard OS; provide install instructions

• Providing install, execute for Ubuntu Long-Term Support release 14.04: supported through 2017 and beyond.

• Avoid pre-configured virtual machines! They:o Lock you into specific cloud homes.o Challenge remixability and extensibility.

Page 38: 2014 nicta-reproducibility

2. Automate• Literate graphing now easy with knitr and IPython

Notebook.

• Build automation with make, or whatever. To first order, it does not matter what tools you use.

• Explicit is better than implicit. Make it easy to understand what you’re doing and how to extend it.

Page 39: 2014 nicta-reproducibility

k-mer counting paper(Ubuntu 14.04, git, make, IPython Notebook, latex)

Page 40: 2014 nicta-reproducibility

Time from publication of KAnalyze to

our 100% reproducible re-evaluation?

~8 hours.

Page 41: 2014 nicta-reproducibility

3. Protocols, not pipelines.

STOP HIDING THE ANALYSIS STEPS.

Page 42: 2014 nicta-reproducibility

Write down what you’re doing…

https://khmer-protocols.readthedocs.org/

Page 43: 2014 nicta-reproducibility

…and add automated end-to-end tests.

c.f. “literate ReSTing”

Page 44: 2014 nicta-reproducibility

4. Drive sustainable software development with use cases.

Page 45: 2014 nicta-reproducibility

…that are explicit…

Page 46: 2014 nicta-reproducibility

…versioned…

Page 47: 2014 nicta-reproducibility

…and automated.

Page 48: 2014 nicta-reproducibility

5. Invest in automated, reproducible

workflows

Genome Reference

Quality Filtered

Diginorm Partition Reinflation

Velvet - 80.90 83.64 84.57

IDBA 90.96 91.38 90.52 88.80

SPAdes

90.42 90.35 89.57 90.02

Mis-assembled Contig Length

Velvet - 52071358 44730449 45381867

IDBA 21777032 20807513 17159671 18684159

SPAdes

28238787 21506019 14247392 18851571

Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013

Also! Tip o’ the hat to Michael Barton, nucleotid.es

Page 49: 2014 nicta-reproducibility

Automation enables super fun paper

reviews!• “What a nice new transcriptome assembler! Interesting

how it doesn’t perform that well on my 10 test data sets.”

• “Hey, so you make these claims, but I ran your code, and…”

• “Fun fact! Your source code has a syntax error in it – even Perl has standards! You’re still sure that’s the script you used?”

• “Here – use our evaluation pipeline, since you clearly need something better.”The Brown Lab: taking passive aggression to a whole new level!

Page 50: 2014 nicta-reproducibility

Myths of reproducible research

(Opinions from personal experience.)

Page 51: 2014 nicta-reproducibility

Myth 1: Partial reproducibility is hard.

“Here’s my script.” => Methods

More generally,• Many scientists cannot replicate any part of their

analysis without a lot of manual work.• Automating this is a win for reasons that have

nothing to do with reproducibility… efficiency!

See: Software Carpentry.

Page 52: 2014 nicta-reproducibility

Myth 2: Incomplete reproducibility is

uselessParaphrase: “We can’t possibly reproduce the

experimental data exactly, so we shouldn’t bother with anything else, either.”

(Analogous arg re software testing & code coverage.)

• …I really have a hard time arguing the paraphrase honestly…

• Being able to reanalyze your raw data? Interesting.• Knowing how you made your figures? Really useful.

Page 53: 2014 nicta-reproducibility

Myth 3: We need new platforms

• Techies always want to build something (which is fun!) but don’t want to do science (which is hard!)

• We probably do need new platforms, but stop thinking that building them does a service.

• Platforms need to be use driven. Seriously.

• If you write good software for scientific inquiry and make it easy to use reproducibly, that will drive virtuousity.

Page 54: 2014 nicta-reproducibility

Myth 4. Virtual Machine reproducibility is an end

solution.

• Good start! Better than nothing!

But:• Limits understanding & reuse.• Limits remixing: often cannot install other

software!

• “Chinese Room” argument: could be just a lookup table.

…what about Docker?

Page 55: 2014 nicta-reproducibility

Myth 5: We can use GUIs for reproducible

research(OK, this is partly just to make people think ;)

• Almost all data analysis takes place within a larger pipeline; the GUI must consume entire pipeline in order to be reproducible.

• IFF GUI wraps command line, that’s a decent compromise (e.g. Galaxy) but handicaps researchers using novel approaches.

• By the time it’s in a GUI, it’s no longer research. But it can be useful for research…

Page 56: 2014 nicta-reproducibility

Our current efforts?• Semantic versioning of our own code: stable

command-line interface.

• Writing easy-to-teach tutorials and protocols for common analysis pipelines.

• Automate ‘em for testing purposes.

• Encourage their use, inclusion, and adaptation by others.

Page 57: 2014 nicta-reproducibility

Literate testing• Our shell-command tutorials for bioinformatics

can now be executed in an automated fashion – commands are extracted automatically into shell scripts.

• See: github.com/ged-lab/literate-resting/.

• Tremendously improves peace of mind and confidence moving forward!

Leigh Sheneman

Page 58: 2014 nicta-reproducibility

Doing things right=> #awesomesauce

Page 59: 2014 nicta-reproducibility

What bits should people adopt?

• Version control!

• Literate graphing - IPython Notebook/knitr!

• Automated “build” from data => results!

• Make data available as early in your pipeline as possible.

Page 60: 2014 nicta-reproducibility

Our approaches --• We are not doing anything particularly neat on

the computational side... No “magic sauce.”

• Much of our effort is now driven by sheer utility:o Automation reduces our maintenance burden.o Extensibility makes revisions much easier!o Explicit instructions are good great for training.

• Some effort needed at the beginning, but once practices are established, “virtuous cycle” takes over.

Page 61: 2014 nicta-reproducibility

New science vs reproducibility

• Nobody would care that we were doing things reproducibly if our science wasn’t decent.

• Make sure students realize that faffing about on infrastructure isn’t science.

• Research is about doing science. Reproducibility (like other good practices) is much easier to proselytize if you can link it to progress in science.

Page 62: 2014 nicta-reproducibility

Is there a reproducibility crisis?

• Mina Bissell: maybe, but science is hard and we should not overly focus on replicating published results vs doing new research.

Bissel, 2013.

• “But we can’t even get the software in the first place!”

Collberg et al., 2014.

Computational science should be the easiest thing to replicate… but it’s not!?

Page 63: 2014 nicta-reproducibility

Monday, July 11th, 2039

“Replication debt”• Can we borrow idea of “technical debt” from

software engineering?• Semi-independent replication after initial

exploratory phase, followed by articulation of protocols and independent replication.

Image from blog.crisp.se

Page 64: 2014 nicta-reproducibility

Monday, July 11th, 2039

“Replication debt”• Semi-independent replication after initial

exploratory phase, followed by articulation of protocols and independent replication.

• Public acknowledgement of debt is important.

Image from blog.crisp.se

Page 65: 2014 nicta-reproducibility

Biology & sequence analysis is in a perfect place for

reproducibility

We are lucky! A good opportunity!

• Big Data: laptops are too small;• Excel doesn’t scale any more;• Few tools in use; most of them are $$ or UNIX;• Little in the way of entrenched research practice;

Page 66: 2014 nicta-reproducibility

Thanks!

Talk will soon be on slideshare:slideshare.net/c.titus.brown

E-mail or tweet me:[email protected]@ctitusbrown

Talk at ANU, 3:30pm today