repro pdiff-talk (invited, humboldt university, berlin)

Provenance and data differencing for workflow reproducibility analysis

Paolo MissierSchool of Computing Science

Newcastle University, UK

Humboldt University, BerlinMarch 4, 2013

Provenance Metadata

2 (*) Definitions proposed by the W3C Incubator Group on provenance:http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance

Provenance is a description of how things came to be, and how they came to be in the state they are in today (*)

Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*)

Why does it matter?

• To establish quality, relevance, trust

• To track information attribution through complex transformations

• To describe one’s experiment to others, for understanding / reuse

• To provide evidence in support of scientific claims

• To enable post hoc process analysis for debugging, improvement, evolution

http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance

http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance

A colourful provenance graph

3

Editing phase

drafting commenting editingused draftv1

wasGeneratedBy used draftcomments

wasGeneratedBy used draftv2

wasGeneratedBy

BobBob-1 Bob-2specializationOf

wasAssociatedWith

specializationOf

wasAssociatedWith

reading

wasDerivedFrom

paper3 used

Alice

wasAssociatedWith

actedOnBehalfOf

wasDerivedFrom

Remote past Recent past

wasGeneratedBy

distribution=internalstatus=draftversion=0.1

distribution=internalstatus=draftversion=0.1

type=personrole=main_editortype=person

role=jr_editorrole=author

role=editor

role=author

wasAttributedTo

Publishing phase

guidelineupdate

publicationdraftv2

used WD1

pubguidelines

v1

wasGeneratedBy pubguidelines

v2

wasGeneratedBy

wasDerivedFrom

Charlie

wasAssociatedWith

AliceactedOnBehalfOf

w3c:consortium

wasAssociatedWith

distribution=publicstatus=draftversion=1.0

type=personrole=headOfPublication

type=institution

role=issuer

Motivation: Reproducibility in e-science• Setting: Collaborative, Open Science

– Increasing rate of data sharing in science

• The stick: both journals and funders demand that data be uploaded– Multiple data journals, data repositories emerging

• The carrot: data is given a DOI and is citable, scientists get credit

4

• Thomson’s Data Citation Index

• Dryad data repository for biosciences(*)

• The DataBib repository of research data

• NSF Data Preservation projects: DataONE

• best practices document: notebooks.dataone.org/bestpractices/

•... and many others

(*) As of Jan 27, 2013, Dryad contains 2585 data packages and 7097 data files, associated with articles in 187 journals.

http://wokinfo.com/products_tools/multidisciplinary/dci/

http://wokinfo.com/products_tools/multidisciplinary/dci/

http://datadryad.org/

http://datadryad.org/

http://databib.org

http://databib.org

http://dataone.org/

http://dataone.org/

http://notebooks.dataone.org/bestpractices/

http://notebooks.dataone.org/bestpractices/

General problems• Quality assurance

– from non-malicious errors in method or data, all the way to fraud– ... leading to retractions in scientific publications

• see eg http://retractionwatch.wordpress.com/

• Repeatability– If I replicate your experiment / repeat your process on the same data, will I get the

same results?

• Reproducibility -- a more general notion

5

The ability for a third party who has access to the description of the original experiment and its results to reproduce those results, using a possibly different setting, with the goal to to confirm or dispute the original experimenter’s claims.

http://retractionwatch.wordpress.com/

http://retractionwatch.wordpress.com/

Specifically, in e-science...• Experimental method → scripts, programs, workflows

• Publication = results + {program, workflow} + evidence of results

• Repeatability, reproducibility– will I be able to run (my version of) your workflow on (my version of) your input

and compare my results to yours?

• Evidence of result: provenance of {program, workflow} execution

• Side note: portability issues are out of scope– VMs often solve the problem, with some limitations

• not when workflows depend on third party services• only for limited size data dependencies

6

Main issue: Workflow evolution and decay

Mapping the reproducibility space

7

dwf

dwf ! wf'

d ! d'wf

d ! d'wf ! wf'

ED

wfs wfs ! wfs'

ED ! ED'

resultsconfirmation

repeatability

reproducibility

Experimentalvariations

methodvariation

datavariation

data and method

variation

Environmental variations

decay region

disfunctionalworkflow

- service updates- state changes

non-functioningworkflow

exceptionanalysis,

debugging

divergenceanalysis

Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution resultsApproach: compare provenance traces generated during the runs: PDIFF

P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.

Decay• Workflows that have external dependencies are harder to maintain

– they may become disfunctional or break altogether

8

Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. “Why Workflows Break - Understanding and Combating Decay in Taverna Workflows.” In Procs. E-science Conference. Chicago, 2012.

Workflows and provenance traces

9

Workflow (structure): directed graph W=(T,E)T: set of tasks (computational units)P: set of (input, output) ports associated to each task t ∈ TE ⊂ T X T: graph edges representing data dependencies

⟨ti.pA,tj.pB⟩ ∈ E: data produced by ti on port pA ∈ P is routed to port pB ∈ P of tj

tr = exec(W,d,ED ,wfms)Execution trace:A: activitiesD: data itemsR = { used, genBy } relations: used ⊂ A × D × P, genBy ⊂ D × A × P

Workflow inputs: tr.I = {d ∈ D|a ∈ A, p ∈ P ⇒ (d,a,p) ∉ genBy}.

Workflow outputs: tr.O = {d ∈ D|a ∈ A, p ∈ P ⇒ (a,d,p) ∉ used}.

tr = exec(W,d,ED ,wfms)

trt = exect(Wi,EDj , dh,wfmsk), with i, j, h, k < t

Workflow evolution

10

Each of the elements in an execution may evolve (semi) independently from the others:

• Can trt be computed again at some time t’>t?• Requires saving EDt but may be impractical (eg large DB state)

Repeatability:

W ED d wfms

t1

t2

t3

W1

W2

ED1

ED3

d1

d3

wfms4t4

wfms1 tr1 = exec1(W1,ED1,d1,wfms1)

tr3 = exec3(W2,ED3,d3,wfms1)

tr5 = exec5(W2,ED5,d3,wfms4)ED5t5



Reproducibility

11

tr t� = exect�(Wi� ,EDj� , dh� ,wfmsk�)

Can a new version trt’ of trt be computed at some later time t’ > t, after one of more of the elements has changed?

trt = exect(Wi,EDj , dh,wfmsk), with i, j, h, k < t

• Wi may not run new EDj’

• Wi may not run with wfmsk’

• Wi’ may not run with dh’

• ...

Potential issues:

W ED d wfms

t1

t2

t3

W1

W2

ED1

ED3

d1

d3

wfms4t4

wfms1 tr1 = exec1(W1,ED1,d1,wfms1)


tr5 = exec5(W2,ED5,d3,wfms4)ED5t5



Data divergence analysis using provenance• All work done with reference to the e-Science Central WFMS• Assumption: workflow WFj (new version) runs to completion

– thus it produces a new provenance trace– however, it may be disfunctional relative to WFi (the original)

• Example: only input data changes: d != d’, WFj == WFi

12

tr t = exect(W,ED , d,wfms), tr t� = exect�(W,ED , d�,wfms)

S0 S1

S2 S3

S4

Note: results may diverge even when the input datasets are identical, for example when one or more of the services exhibits non-deterministic behaviour, or depends on external state that has changed between executions.

Reproducibility requires comparing datasets• Experimenters may validate results by deliberately altering the

experimental settings (Wi′, dj′)• The outcomes will not be identical, but

– are they similar enough to the original to conclude that the experiment was successfully reproduced?

13

∆D(trt.O, trt′.O)

• Data comparison is type- and format-dependent in general

• Example: – workflow output: a classification model computed using model builders– two models may be different but statistically equivalent

• e-Science Central accommodates user-defined data diff blocks– these are just Java-based workflow blocks

if ∆D(trt.O, trt′.O) > threshold: why are results diverging?

Provenance traces for two runs

14

used

genBy

S0 S1

S2 S3

S4

P0 P0

P0 P1

P0

P0

P0

P1

d1

S0

d2

S1

z w

S2

d3

yx

S3

S4

df

d1'

S0

d2'

S1

z w'

S2

d3

y'x

S3

S4

df'

(i) Trace A (ii) Trace B

P0 P1

P0 P1

P0 P1

P0 P1

Delta graphs

15

A graph obtained as a result of traces “diff”which can be used to explain observed differences in workflow outputs, in terms of differences throughout the two executions.

This is the simplest possible delta “graph”!d1

S0

d2

S1

z w

S2

d3

yx

S3

S4

df

d1'

S0

d2'

S1

z w'

S2

d3

y'x

S3

S4

df'


P0 P1

P0 P1

P0 P1

P0 P1

�dF , dF ��

�y, y��

�w,w��

�d2, d2��

(iii) Delta tree

• S0 is followed by S0� in WA but not in WB ;

• S3 is preceded by S3� in WB but not in WA;

• S2 in WA is replaced by a new version, S2v2, in WB ;

• S1 in WA is replaced by S5 in WB .

More involved workflow differences

16

WA

WB

sv2

tr t = exect(W,ED , d,wfms), tr t� = exect(W �,ED �, d,wfms)

d1

S0

S0'

w h

S3 S2

y z

S4

x

k

S1

d2

d1'

S0

k'h'

S3'

S2v2

w'

S3

S4

y' z'

x'

S5

d2


P0 P1P0 P1

P0 P0 P1P1

S Sv2

d0 d0

The corresponding traces

17

Delta graph computed by PDIFF

18 �x, x��

�y, y�� z, z��

�w,w��

�k, k��S0� , S3��

S0' S3'

�S1, S5�(service repl.)

�S2, S2v2�(version change)

�h, h��

S0'

P0 branch of S4 P1 branch of S4

P0 branch of S2 P1 branch of S2

�S,Sv2�(version change)

�S0, S0��

�d1, d1��

Summary• Setting:

– scientific results computed using workflows– openness / data sharing has potential to accelerate science – but requires results validation and reproducibility

• Problem: reproducibility is hard to achieve– workflow decay– evolution of data, workflow spec, dependencies, wf engine

• Goal: support divergence analysis

• Approach: PDIFF -- comparing provenance traces generated during the runs

19

Selected references

20

P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.

Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. Why Workflows Break - Understanding and Combating Decay in Taverna Workflows. In Procs. E-science Conference. Chicago, 2012.

Cohen-Boulakia S, Leser U. Search, adapt, and reuse: the future of scientific workflows. SIGMOD Rec. Sep 2011; 40(2):6–16, doi:http://doi.acm.org/10.1145/2034863.2034865

Peng RD,Dominici F., Zeger SL. ReproducibleEpidemiologicResearch. American Journal of Epidemiology 2006; 163(9):783–789, doi:10.1093/aje/kwj093.

Drummond C. Science, Replicability is not Reproducibility: Nor is it Good Science. Procs. 4th workshop on Evaluation Methods for Machine Learning In conjunction with ICML 2009, Montreal, Canada, 2009.

Peng R. Reproducible Research in Computational Science. Science Dec 2011; 334(6060):1226–1127

Schwab M, Karrenbach M, Claerbout J. Making Scientific Computations Reproducible. Computing in Science Engineering 2000; 2(6):61–67

Mesirov J. Accessible Reproducible Research. Science 2010; 327

repro pdiff-talk (invited, humboldt university, berlin)

Documents

wf data

data differencing

data repositories

data files

data packages

multiple data journals

rate of data sharing

d analysis