2017-11-03 provenance and research object
TRANSCRIPT
Partners Funding
bioexcel.eu
Provenance and Research Object
1
Stian Soiland-Reyes
eScience Lab, The University of Manchester
2017-11-03, Aix-en-Provence
CESAB workshop: Reproducible Workflows
orcid.org/0000-0001-9842-9718 @soilandreyes
This work is licensed under aCreative Commons Attribution 4.0 International License.
bioexcel.eu
https://view.commonwl.org/
http://doi.org/10.7490/f1000research.1114375.1
bioexcel.eu
Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved.
http://www.w3.org/TR/prov-overview/
Core PROV model
Entity – A “thing” in the worldDocument, Excel file, database row, molecule, LEGO structure, house, …
Activity – Something that happened Usually defined start/end time May use and generate entities
Agent – Someone/something Participating in activitiesPerson, SoftwareAgent, Organization
Key principles:Provenance statements point backwards in timeAny PROV document is one particular view on historyMore than one entity can describe same “thing”
bioexcel.eu
AttributionWho collected this sample? Who helped?
Which lab performed the sequencing?
Who did the data analysis?
Who wrote the analysis workflow?
Who made the data set used by analysis?
Who curated the results?
AliceThe lab
Data
wasAttributedTo
actedOnBehalfOf
Why do I need this?i. To be recognized for my workii. Who should I give credits to?iii. Who should I complain to?iv. Can I trust them?v. Who should I make friends with?
bioexcel.eu
Derivation
Which sample was this metagenome sequenced from?
Which meta-genomes was this sequence extracted from?
Which sequence was the basis for the results?
What is the previous revision of the new results?
wasDerivedFrom
wasQuotedFrom
Sequence
New results
wasDerivedFrom
Sample
Meta -genome
Old results
wasRevisionOf
wasInfluencedBy
Why do I need this?i. To verify consistency (did I use
the correct sequence?)ii. To find the latest revisioniii. To backtrack where a diversion
appeared after a changeiv. To credit work I depend onv. Auditing and defence for
peer review
bioexcel.eu
Activities
What happened? When? Who?
What was used and generated?
Why was this workflow started?
Which workflow ran? Where?
used
wasGeneratedBy
wasStartedAt
"2012-06-21"
Metagenome
Sample
wasAssociatedWith
Workflow server
wasInformedBy
wasStartedBy
Workflow run
wasGeneratedBy
Results
Sequencing
wasAssociatedWith
Alice
hadPlan
Workflow definition
hadRole
Lab technician
Results
Why do I need this?i. To see which analysis was performedii. To find out who did whatiii. What was the metagenome
used for?iv. To understand the whole process
“make me a Methods section”v. To track down inconsistencies
bioexcel.eu
Input ports
Processors
Output ports
Workflow
Typical (?) workflow structure
Data links
http://taverna.incubator.apache.org/
bioexcel.eu
Workflow description (wfdesc)
http://purl.org/wf4ever/wfdesc#
bioexcel.eu
Workflow run provenance (wfprov)
http://purl.org/wf4ever/wfprov#
bioexcel.eu
Workflow Run Bundle
output/A.txt
output/C.jpg
output/B/
intermediates/
1.txt2.txt
3.txt
de/def2e58b-50e2-4949-9980-fd310166621a.txt
input/X.txtworkflow
URI references
attribution
executionenvironment
ZIP folder structure (RO Bundle)
mimetype
application/vnd.wf4ever.robundle+zip
.ro/manifest.json
https://doi.org/10.5281/zenodo.51314
workflowrun.prov.ttl
bioexcel.euhttps://doi.org/10.1016/j.websem.2015.01.003
application/vnd.wf4ever.robundle+zip
Research Object Bundlehttp://www.researchobject.org/
bioexcel.eu
A Research Object bundles and relates digital resources of a scientific experiment/investigation +
context
Data used and results produced in experimental study
Methods employed to produce and analyse that data
Provenance and settings for the experiments
People involved in the investigation
Annotations about these resources, to improve understanding and
interpretation
bioexcel.eu
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobject.org
bioexcel.eu
Systems Biology Research Objects exchange, portability and maintenance
components packaged into
various containers
ISA-TABchecksum
bioexcel.eu
Download as a Research Object Bundle
Snapshots evolving CWL files in GitHub
Permalink to snapshot the workflow identifier for RO
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
bioexcel.eu
Artists Impression
bioexcel.eu
https://osf.io/h59uh/ https://doi.org/10.1101/191783
bioexcel.eu
identifiers.org
PROV
JSON
https://doi.org/10.1109/BigData.2016.7840618
manifest.json
bioexcel.eu
Provenance from cwltoolFarah Z Khan:
Modify cwltool reference implementation
to capture provenance
Generates Bag-It Research Object
Mints identifiers for data and run
Capture intermediate values
Workflow activities as PROV
wfdesc, OPMW, ProvONE
http://doi.org/10.7490/f1000research.1114781.1
Partners Funding
bioexcel.eu
Acknowledgements
22
Farah Z Khan
Carole Goble
Michael R. Crusoe
Apache Taverna
BioExcel
Common Workflow Language
Research Object
W3C PROV WG
Partners Funding
bioexcel.eu
https://www.slideshare.net/StianSoilandReyes/