reproducibility in scientific data analysis - bioscience seminar
TRANSCRIPT
Reproducibility in Scientific Data Analysis
Samuel Lampa @smllmp
PhD StudentPharmaceutical Bioinformatics at pharmb.io
with Assoc. Prof. Ola Spjuth @ola_spjuth@ Dept. of Pharm. Biosci. / Uppsala University
Farmbio BioScience Seminar – Dec 16 2016
Structure of this talk
Reproducibility in Scientific Data Analysis …
● What is it?● Why is it important?● Why is it a problem?● What can we do about it?● What does pharmb.io do about it?
What is it?
“it” = reproducibility in scientific data analysis
reproducible ≠ replicable
reproducible ≠ correct
Why is it important?
“it” = reproducibility in scientific data analysis
Why is it important?
● More and more data generation automated→ More and more focus on data analysis
● Culture of replicability not (yet) as established in computational as in classical disciplines
● “it is the only thing that an investigator can guarantee about a study”simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important
Why is it a problem?
“it” = reproducibility in scientific data analysis
wet lab data analysis?
Why is it a problem?
● Complexity of computing environment– Software versions, Data versions ...
● More black box components● Assumptions on computing
environment often left out● Manual steps often left out
What can we do about it?
“it” = reproducibility in scientific data analysis
What can we do about it?
Utopia: Infrastructure for all data and computations to be inspected and re-run with other data and parameters by anyone
But: We can’t wait for that
In the meanwhile: Even small steps towards reproducibility will help. Start today!
General themes
Know exactly what data and results mean
Know exactly how results were obtained
Be able to get same result independently
More concretely ...
Know exactly what data and results mean– Open standards, Ontologies, Data formats
Know exactly how results were obtained– Keeping track of manual steps, parameters, versions of
software and data ...
– Version control
– Automation (scripts)
Be able to get same result independently– code, data, and scripts … make it all available!
Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol. 2013;9(10):1-4. dx.doi.org/10.1371/journal.pcbi.1003285
FAIR Principlesfor data and meta data
F - Findable
A - Accessible
I - Interoperable
R – Reusable
Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18.
What does pharmb.io do about it?
“it” = reproducibility in scientific data analysis
What does pharmb.io do about it?
● Open data, open source, open standardsPromoting and using as much as possible
● BioImg.org Store Virtual Machines & Containers
● Semantic Data Technologies Machine readability - Avoiding ambiguity
● Re-runnable computational experimentsVia workflows, containers, infrastructure as code
O’Boyle NM, Guha R, Willighagen EL, et al. Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. J Cheminform. 2011;3(10):1-16. doi:10.1186/1758-2946-3-37
BioImg.org
Dahlö M, Haziza F, Kallio A, Korpelainen E, Bongcam-Rudloff E, Spjuth O. BioImg.org: A catalog of virtual machine images for the life sciences. Bioinform Biol Insights. 2015;9(Vmi):125-128. doi:10.4137/BBI.S28636.
Martin Dahlö
Semantic Data Technologies
Lampa S, Willighagen E, Kohonen P, King A, Vrandečić D, Grafström R, Spjuth O. RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. J Biomed Sem. Submitted.
Re-runnable experimentsvia containers
(and infrastructure as code)
Marco Capuccini
github.com/kubenow/KubeNowgithub.com/mcapuccini/SparkNow
Re-runnable experimentsvia workflows
Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.
Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.