the lifecycle of reproducible science data and what provenance has got to do with it
TRANSCRIPT
![Page 1: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/1.jpg)
The lifecycle of reproducible science data and what provenance has got to do with it
Paolo MissierSchool of Computing Science
Newcastle University, UK
Alan Turing InstituteSymposium On Reproducibility for Data-Intensive Research
Oxford, April 6, 2016
With material contributed by:Yang Cao, Bertram Ludascher, Tim McPhillips, Dave Vieglais, Matt Jones and the DataONE CyberInfrastructure groupRawaa Qasha at Newcastle UniversityCarole Goble at the University of Manchester
![Page 2: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/2.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
(Yet another) Data Lifecycle picture
Searchdiscover
packagepublish
spec(P’)
DeployP’
Env(dep’)
?
prov(D’)
Compare(P,P’,D,D’)
spec(P)
prov(D)
D D1
P P’
dep dep’
<D,P,dep,spec(P), prov(D)>
compute
Env
D’
D1
![Page 3: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/3.jpg)
Reproducibility: working. reporting
submit articleand move on…
publish article
Research Environment
Publication Environment
Peer Review
![Page 4: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/4.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Re-what?
Re-*
ReRun:vary experiment and setup, same lab
P P’DD’depdep’
Repeat:Same experiment, setup, lab
P, D, dep, env(dep)
Replicate:Same experiment, setup, different lab
P, D, dep, env’(dep)
Reproduce:vary experiment and setup, different lab
P P’DD’depdep’env(dep) env’(dep’)
Reuse:Different experiment D, P Q
![Page 5: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/5.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Mapping the reproducibility space
5
Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution resultsApproach: compare provenance traces generated during the runs: PDIFF
P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013.
![Page 6: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/6.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Workflow evolution
6
Each of the elements in an execution may evolve (semi) independently from the others:
Can trt be computed again at some time t’>t?Requires saving EDt but may be impractical (eg large DB state)
Repeatability:
![Page 7: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/7.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Reproducibility
7
Can a new version trt’ of trt be computed at some later time t’ > t, after one of more of the elements has changed?
• Wi may not run new EDj’
• Wi may not run with wfmsk’
• Wi’ may not run with dh’
• ...
Potential issues:
![Page 8: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/8.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
(Yet another) Data Lifecycle picture
Searchdiscover
packagepublish
spec(P’)
DeployP’ Env
?
D D1
P P’
dep dep’
compute
Env
D’
prov(D’)
Compare(P,P’,D,D’)
spec(P)
prov(D)
ResearchObjects
DataONEFederatedResearch Data Repositories- Matlab
provenance recorder
TOSCA-based virtualisation
Pdiff: differencing provenance
YesWorkflow- Workflow Provenance- NoWorkflow
Matlab provenance recorder(DataONE)
ReproZip
![Page 9: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/9.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow differences
![Page 10: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/10.jpg)
Computational Workflow Runs
workflowrun.prov.ttl(RDF)
outputA.txt
outputC.jpg
outputB/
intermediates/
1.txt2.txt
3.txt
de/def2e58b-50e2-4949-9980-fd310166621a.txt
inputA.txtworkflow attribution
executionenvironment
Aggregating in Research Object
ZIP folder structure (RO Bundle)
mimetypeapplication/vnd.wf4ever.robundle+zip
.ro/manifest.json
URI references
Exchange
ReproducibilitySame dataSame code
Systematic and extensible meta-data collection
Workflow Annotation Profile
Wf4Ever Project
![Page 11: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/11.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Manifests and Containers
ContainerPackaging: Zip files, Docker images, BagIt, …Catalogues & Commons Platforms: FAIRDOM SEEK, Farr Commons CKAN, STELAR eLab, myExperiment
ManifestMetadataDescribes the aggregated resources, theirannotations and their provenance
Manifest
![Page 12: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/12.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Manifest Metadata
Manifest Construction• Identification – id, title, creator,
status….• Aggregates – list of ids/links to
resources• Annotations – list of annotations about
resources
Manifest
Manifest Description• Checklists – what should be there• Provenance – where it came from• Versioning – its evolution• Dependencies – what else is needed
Manifest
![Page 13: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/13.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow differences
![Page 14: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/14.jpg)
Components for a flexible, scalable, sustainable network
Cyberinfrastructure Component 2Member Nodes
www.dataone.org/member-nodes
Coordinating Nodes• retain complete
metadata catalog • indexing for search• network-wide services• ensure content
availability (preservation)
• replication services
Member Nodes• diverse institutions• serve local community• provide resources for
managing their data• retain copies of data
14
![Page 15: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/15.jpg)
15
Cyberinfrastructure
Data Services: Extraction, sub-setting etc
Provenance Semantics-enabled Discovery
ontology
annotation
SystemMetadata
ScienceData
Search API
ScienceMetadata
Provenance
Replicate
MetadataIndex
![Page 16: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/16.jpg)
Data Holdings
16
![Page 17: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/17.jpg)
17
What input data went into this study?
What methods were used?
… with what parameter settings, calibrations, …?
Can we trust the data and methods?
Provenance (lineage): track origin and processing history of data trust, data quality ~ audit trail for attribution, credit
Discovery of data, methodologies, experiments
Use Provenance for Transparency, Reproducibility
![Page 18: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/18.jpg)
W3C has published the ‘PROV’ standard
Entity
Activity
Agent
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
W3C PROV model
See w3.org/TR/prov-o/
used
20
![Page 19: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/19.jpg)
map image
R scriptExecution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
Using a common model
Example: Scientific workflow
21
![Page 20: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/20.jpg)
map image
R scriptExecution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
CSV dataused
wasDerivedFrom
Using a common model
Example: Scientific workflow
22
![Page 21: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/21.jpg)
map image
R scriptExecution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
CSV dataused
wasDerivedFrom
< “map image” wasDerivedFrom “CSV data” >
Using a common model
Example: Scientific workflow
23
![Page 22: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/22.jpg)
24
ProvONE Motivation: Different Kinds of Provenance Prospective Provenance
method/workflow description (“workflow-land”)
Retrospective Provenance runtime provenance tracking (“trace-land”)
Better together!
![Page 23: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/23.jpg)
ProvONE extends PROV for science!
“Trace-Land”
“Workflow-Land”
“Data-Land”
http://purl.dataone.org/provone-v1-dev25
![Page 24: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/24.jpg)
DataONE data packages: Provenance inside!
resource map
science metadata
system metadata
science data
system metadata
system metadata
OAI-ORE with ProvONE trace
figures
system metadata
software
system metadata
29
![Page 25: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/25.jpg)
31
Provenance… of Figures
![Page 26: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/26.jpg)
32
Provenance… of Data
![Page 27: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/27.jpg)
1 # @begin CreateGulfOfAlaskaMaps
2 # @in hcdb @as Total_Aromatic_Alkanes_PWS.csv
3 # @in world @as RWorldMap
4 # @out map @as Map_Of_Sampling_Locations.png
5 # @out detailMap @as Detailed_Map_Of_SamplingLocations.png
... mapping code is here ...
25 # @end CreateGulfOfAlaskaMaps
YesWorkflow (YW): Scripts as prospective provenance
33
![Page 28: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/28.jpg)
MATLAB, R , Python … Scripts
YesWorkflow (YW): Scripts as prospective provenance
Script + @YW-annotation workflow-land & trace-land
Combine provenance: Prospective (workflow) Retrospective (runtime trace) Reconstructed (logs, files, …)
User can query own data & provenance prior to sharing
Incentive: accelerate work!
“Provenance for Self”
34
![Page 29: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/29.jpg)
When a user cites a pub, we know: Which data produced it What software produced it What was derived from it Who to credit down the
attribution stack
Katz & Smith. 2014. Implementing Transitive Credit with JSON-LD. arXiv:1407.5117
Missier, Paolo. “Data Trajectories: Tracking Reuse of Published Data for Transitive Credit Attribution.” 11th Intl. Data Curation Conference (IDCC). Amsterdam, 2016. (Best Paper Award)
Transitive Credit
36
![Page 30: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/30.jpg)
Provenance today: Important but hard
“This report is the result of a three-year analytical effort by a team of over 300 experts, overseen by a broadly constituted Federal Advisory Committee of 60 members. It was developed from information and analyses gathered in over 70 workshops and listening sessions held across the country.”
37
![Page 31: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/31.jpg)
Provenance today: Important but hard
38
data and “code” / method linked
alt formats
![Page 32: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/32.jpg)
Yaxing’s script with inputs & output products
YesWorkflow model
Christopher using Yaxing’s outputs as inputs for his script
Christopher’s results can be traced back all the way to Yaxing’s input
Provenance in action
40
![Page 33: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/33.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow differences
![Page 34: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/34.jpg)
4
TOSCA
• Topology and Orchestration Specification of Cloud Applications
![Page 35: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/35.jpg)
Use Case: e-Science Central Workflow
5
http://www.esciencecentral.co.uk
![Page 36: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/36.jpg)
TOSCA-based mapping of an e-SC Workflow
6
• Workflow components as Node Types
• Block dependencies as Relationship Types
![Page 37: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/37.jpg)
e-SC Workflow Service Template
7
![Page 38: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/38.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow differences
![Page 39: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/39.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Data divergence analysis using provenance
All work done with reference to the e-Science Central WFMSAssumption: workflow WFj (new version) runs to completion
thus it produces a new provenance tracehowever, it may be disfunctional relative to WFi (the original)
Example: only input data changes: d != d’, WFj == WFi
47
Note: results may diverge even when the input datasets are identical, for example when one or more of the services exhibits non-deterministic behaviour, or depends on external state that has changed between executions.
![Page 40: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/40.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Provenance traces for two runs
48
used
genBy
![Page 41: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/41.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Delta graphs
49
A graph obtained as a result of traces “diff”which can be used to explain observed differences in workflow outputs, in terms of differences throughout the two executions.
This is the simplest possible delta “graph”!
![Page 42: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/42.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
More involved workflow differences
50
WA
WB
sv2
![Page 43: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/43.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
The corresponding traces
51
![Page 44: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/44.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Delta graph computed by PDIFF
52
![Page 45: The lifecycle of reproducible science data and what provenance has got to do with it](https://reader030.vdocuments.site/reader030/viewer/2022011722/58eaa2f01a28abe5728b5ce5/html5/thumbnails/45.jpg)
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
References
Research Objects: www.researchobject.orgBechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011). doi:doi:10.1016/j.future.2011.08.004.
DataONE: dataone.orgCuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati, Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.” In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014. doi:10.2218/ijdc.v9i2.332.
Process Virtualisation using TOSCAQasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015. doi:10.1109/CLOUD.2015.146.
NoWorkflow: provenance recording for PythonMurta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. “noWorkflow: Capturing and Analyzing Provenance of Scripts .” In Procs. IPAW’14. Cologne, ⋆Germany: Springer, 2014.
Pdiff: provenance differencing for understanding workflow differencesMissier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013). doi:10.1002/cpe.3035.