repro pdiff-talk (invited, humboldt university, berlin)
DESCRIPTION
See paper: Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience. In Press (2013).TRANSCRIPT
Provenance and data differencing for workflow reproducibility analysis
Paolo MissierSchool of Computing Science
Newcastle University, UK
Humboldt University, BerlinMarch 4, 2013
Provenance Metadata
2 (*) Definitions proposed by the W3C Incubator Group on provenance:http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance
Provenance is a description of how things came to be, and how they came to be in the state they are in today (*)
Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*)
Why does it matter?
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for debugging, improvement, evolution
A colourful provenance graph
3
Editing phase
drafting commenting editingused draftv1
wasGeneratedBy used draftcomments
wasGeneratedBy used draftv2
wasGeneratedBy
BobBob-1 Bob-2specializationOf
wasAssociatedWith
specializationOf
wasAssociatedWith
reading
wasDerivedFrom
paper3 used
Alice
wasAssociatedWith
actedOnBehalfOf
wasDerivedFrom
Remote past Recent past
wasGeneratedBy
distribution=internalstatus=draftversion=0.1
distribution=internalstatus=draftversion=0.1
type=personrole=main_editortype=person
role=jr_editorrole=author
role=editor
role=author
wasAttributedTo
Publishing phase
guidelineupdate
publicationdraftv2
used WD1
pubguidelines
v1
wasGeneratedBy pubguidelines
v2
wasGeneratedBy
wasDerivedFrom
Charlie
wasAssociatedWith
AliceactedOnBehalfOf
w3c:consortium
wasAssociatedWith
distribution=publicstatus=draftversion=1.0
type=personrole=headOfPublication
type=institution
role=issuer
Motivation: Reproducibility in e-science• Setting: Collaborative, Open Science
– Increasing rate of data sharing in science
• The stick: both journals and funders demand that data be uploaded– Multiple data journals, data repositories emerging
• The carrot: data is given a DOI and is citable, scientists get credit
4
• Thomson’s Data Citation Index
• Dryad data repository for biosciences(*)
• The DataBib repository of research data
• NSF Data Preservation projects: DataONE
• best practices document: notebooks.dataone.org/bestpractices/
•... and many others
(*) As of Jan 27, 2013, Dryad contains 2585 data packages and 7097 data files, associated with articles in 187 journals.
General problems• Quality assurance
– from non-malicious errors in method or data, all the way to fraud– ... leading to retractions in scientific publications
• see eg http://retractionwatch.wordpress.com/
• Repeatability– If I replicate your experiment / repeat your process on the same data, will I get the
same results?
• Reproducibility -- a more general notion
5
The ability for a third party who has access to the description of the original experiment and its results to reproduce those results, using a possibly different setting, with the goal to to confirm or dispute the original experimenter’s claims.
Specifically, in e-science...• Experimental method → scripts, programs, workflows
• Publication = results + {program, workflow} + evidence of results
• Repeatability, reproducibility– will I be able to run (my version of) your workflow on (my version of) your input
and compare my results to yours?
• Evidence of result: provenance of {program, workflow} execution
• Side note: portability issues are out of scope– VMs often solve the problem, with some limitations
• not when workflows depend on third party services• only for limited size data dependencies
6
Main issue: Workflow evolution and decay
Mapping the reproducibility space
7
dwf
dwf ! wf'
d ! d'wf
d ! d'wf ! wf'
ED
wfs wfs ! wfs'
ED ! ED'
resultsconfirmation
repeatability
reproducibility
Experimentalvariations
methodvariation
datavariation
data and method
variation
Environmental variations
decay region
disfunctionalworkflow
- service updates- state changes
non-functioningworkflow
exceptionanalysis,
debugging
divergenceanalysis
Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution resultsApproach: compare provenance traces generated during the runs: PDIFF
P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.
Mapping the reproducibility space
7
dwf
dwf ! wf'
d ! d'wf
d ! d'wf ! wf'
ED
wfs wfs ! wfs'
ED ! ED'
resultsconfirmation
repeatability
reproducibility
Experimentalvariations
methodvariation
datavariation
data and method
variation
Environmental variations
decay region
disfunctionalworkflow
- service updates- state changes
non-functioningworkflow
exceptionanalysis,
debugging
divergenceanalysis
Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution resultsApproach: compare provenance traces generated during the runs: PDIFF
P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.
Decay• Workflows that have external dependencies are harder to maintain
– they may become disfunctional or break altogether
8
Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. “Why Workflows Break - Understanding and Combating Decay in Taverna Workflows.” In Procs. E-science Conference. Chicago, 2012.
Decay• Workflows that have external dependencies are harder to maintain
– they may become disfunctional or break altogether
8
Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. “Why Workflows Break - Understanding and Combating Decay in Taverna Workflows.” In Procs. E-science Conference. Chicago, 2012.
Workflows and provenance traces
9
Workflow (structure): directed graph W=(T,E)T: set of tasks (computational units)P: set of (input, output) ports associated to each task t ∈ TE ⊂ T X T: graph edges representing data dependencies
⟨ti.pA,tj.pB⟩ ∈ E: data produced by ti on port pA ∈ P is routed to port pB ∈ P of tj
tr = exec(W,d,ED ,wfms)Execution trace:A: activitiesD: data itemsR = { used, genBy } relations: used ⊂ A × D × P, genBy ⊂ D × A × P
Workflow inputs: tr.I = {d ∈ D|a ∈ A, p ∈ P ⇒ (d,a,p) ∉ genBy}.
Workflow outputs: tr.O = {d ∈ D|a ∈ A, p ∈ P ⇒ (a,d,p) ∉ used}.
tr = exec(W,d,ED ,wfms)
trt = exect(Wi,EDj , dh,wfmsk), with i, j, h, k < t
Workflow evolution
10
Each of the elements in an execution may evolve (semi) independently from the others:
• Can trt be computed again at some time t’>t?• Requires saving EDt but may be impractical (eg large DB state)
Repeatability:
W ED d wfms
t1
t2
t3
W1
W2
ED1
ED3
d1
d3
wfms4t4
wfms1 tr1 = exec1(W1,ED1,d1,wfms1)
tr3 = exec3(W2,ED3,d3,wfms1)
tr5 = exec5(W2,ED5,d3,wfms4)ED5t5
tr2 = exec2(W2,ED1,d1,wfms1)
tr4 = exec4(W2,ED3,d3,wfms4)
Reproducibility
11
tr t� = exect�(Wi� ,EDj� , dh� ,wfmsk�)
Can a new version trt’ of trt be computed at some later time t’ > t, after one of more of the elements has changed?
trt = exect(Wi,EDj , dh,wfmsk), with i, j, h, k < t
• Wi may not run new EDj’
• Wi may not run with wfmsk’
• Wi’ may not run with dh’
• ...
Potential issues:
W ED d wfms
t1
t2
t3
W1
W2
ED1
ED3
d1
d3
wfms4t4
wfms1 tr1 = exec1(W1,ED1,d1,wfms1)
tr3 = exec3(W2,ED3,d3,wfms1)
tr5 = exec5(W2,ED5,d3,wfms4)ED5t5
tr2 = exec2(W2,ED1,d1,wfms1)
tr4 = exec4(W2,ED3,d3,wfms4)
Data divergence analysis using provenance• All work done with reference to the e-Science Central WFMS• Assumption: workflow WFj (new version) runs to completion
– thus it produces a new provenance trace– however, it may be disfunctional relative to WFi (the original)
• Example: only input data changes: d != d’, WFj == WFi
12
tr t = exect(W,ED , d,wfms), tr t� = exect�(W,ED , d�,wfms)
S0 S1
S2 S3
S4
Note: results may diverge even when the input datasets are identical, for example when one or more of the services exhibits non-deterministic behaviour, or depends on external state that has changed between executions.
Reproducibility requires comparing datasets• Experimenters may validate results by deliberately altering the
experimental settings (Wi′, dj′)• The outcomes will not be identical, but
– are they similar enough to the original to conclude that the experiment was successfully reproduced?
13
∆D(trt.O, trt′.O)
• Data comparison is type- and format-dependent in general
• Example: – workflow output: a classification model computed using model builders– two models may be different but statistically equivalent
• e-Science Central accommodates user-defined data diff blocks– these are just Java-based workflow blocks
if ∆D(trt.O, trt′.O) > threshold: why are results diverging?
Provenance traces for two runs
14
used
genBy
S0 S1
S2 S3
S4
P0 P0
P0 P1
P0
P0
P0
P1
d1
S0
d2
S1
z w
S2
d3
yx
S3
S4
df
d1'
S0
d2'
S1
z w'
S2
d3
y'x
S3
S4
df'
(i) Trace A (ii) Trace B
P0 P1
P0 P1
P0 P1
P0 P1
Delta graphs
15
A graph obtained as a result of traces “diff”which can be used to explain observed differences in workflow outputs, in terms of differences throughout the two executions.
This is the simplest possible delta “graph”!d1
S0
d2
S1
z w
S2
d3
yx
S3
S4
df
d1'
S0
d2'
S1
z w'
S2
d3
y'x
S3
S4
df'
(i) Trace A (ii) Trace B
P0 P1
P0 P1
P0 P1
P0 P1
�dF , dF ��
�y, y��
�w,w��
�d2, d2��
(iii) Delta tree
• S0 is followed by S0� in WA but not in WB ;
• S3 is preceded by S3� in WB but not in WA;
• S2 in WA is replaced by a new version, S2v2, in WB ;
• S1 in WA is replaced by S5 in WB .
More involved workflow differences
16
WA
WB
sv2
tr t = exect(W,ED , d,wfms), tr t� = exect(W �,ED �, d,wfms)
d1
S0
S0'
w h
S3 S2
y z
S4
x
k
S1
d2
d1'
S0
k'h'
S3'
S2v2
w'
S3
S4
y' z'
x'
S5
d2
(i) Trace A (ii) Trace B
P0 P1P0 P1
P0 P0 P1P1
S Sv2
d0 d0
The corresponding traces
17
Delta graph computed by PDIFF
18 �x, x��
�y, y�� �z, z��
�w,w��
�k, k���S0� , S3��
S0' S3'
�S1, S5�(service repl.)
�S2, S2v2�(version change)
�h, h��
S0'
P0 branch of S4 P1 branch of S4
P0 branch of S2 P1 branch of S2
�S,Sv2�(version change)
�S0, S0��
�d1, d1��
Summary• Setting:
– scientific results computed using workflows– openness / data sharing has potential to accelerate science – but requires results validation and reproducibility
• Problem: reproducibility is hard to achieve– workflow decay– evolution of data, workflow spec, dependencies, wf engine
• Goal: support divergence analysis
• Approach: PDIFF -- comparing provenance traces generated during the runs
19
Selected references
20
P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.
Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. Why Workflows Break - Understanding and Combating Decay in Taverna Workflows. In Procs. E-science Conference. Chicago, 2012.
Cohen-Boulakia S, Leser U. Search, adapt, and reuse: the future of scientific workflows. SIGMOD Rec. Sep 2011; 40(2):6–16, doi:http://doi.acm.org/10.1145/2034863.2034865
Peng RD,Dominici F., Zeger SL. ReproducibleEpidemiologicResearch. American Journal of Epidemiology 2006; 163(9):783–789, doi:10.1093/aje/kwj093.
Drummond C. Science, Replicability is not Reproducibility: Nor is it Good Science. Procs. 4th workshop on Evaluation Methods for Machine Learning In conjunction with ICML 2009, Montreal, Canada, 2009.
Peng R. Reproducible Research in Computational Science. Science Dec 2011; 334(6060):1226–1127
Schwab M, Karrenbach M, Claerbout J. Making Scientific Computations Reproducible. Computing in Science Engineering 2000; 2(6):61–67
Mesirov J. Accessible Reproducible Research. Science 2010; 327