manish kumar anand [email protected] eighth biennial ptolemy miniconference berkeley, california a...

24
Manish Kumar Anand [email protected] Eighth Biennial Ptolemy Miniconference Berkeley, California A Provenance Framework to Capture, Store, Query, and Browse Data Lineage in Kepler

Post on 20-Dec-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Manish Kumar [email protected]

Eighth Biennial Ptolemy MiniconferenceBerkeley, California

A Provenance Framework to Capture, Store, Query, and

Browse Data Lineage in Kepler

2

Scientific Workflows• Discoveries achieved via complex computations• Workflows replacing traditional scripting approaches• Enable automation, reproducibility, sharing,

provenance

Perl script

Scientific workflow system

3

Provenance• A record of processes, inputs/outputs, dependencies• Supports reproducibility, interpretation, verification

AZG

AYG

AXG

AlignWarp Reslice Softmean Slicer Convert

AZG

AYG

AXG

AI1AH1

AI2AH2

AI4AH4

AI4AH4

RI RH

inputs

outputs

AXS

AYS

AZS

AI

AH

RI1

RH1

RI2

RH2

RI4

RH4

RI4

RH4

WP1

WP2

WP4

WP4alignWarp:4

alignWarp:3

alignWarp:2

alignWarp:1

reslice:4

reslice:3

reslice:2

reslice:1

softmean:1

slicer:1

slicer:2

slicer:3

convert:1

convert:2

convert:3

4

CapturingProvenance

StoringProvenance

QueryingProvenance

BrowsingProvenance

Outline

5

Conventional Provenance Models

• Records– Inputs/outputs of invocations

• Infers– Data dependency– Invocation dependency

Workflow execution graph

Data dependency Invocation dependency

input(a, s1), output(a, s2),

input(b, s2), input(c, s2), …

• Assumptions:- Data is atomic- Invocations consume all inputs

and produce new outputs- Every output depends on all

inputs

6

s1

as2

s3

s4

(b)

Challenges in Modeling Provenance

Many scientific workflow systems also support:a) Both data “transformers” and “pass-through” b) Processes with different dependency patternsc) Structured data (XML)

•Models of provenance must consider these factors

s1

a

(a)

s2

s3

s4

s1

as2

s3

s2

s1

s3

s4 s5

s1

s2 s3

a

(c)

7

Unified Provenance Model

8

Efficient Provenance Representation

• Instead of storing each version– Only store a single combined version

• Along with a set of updates (’s)– Updates and dependencies represented as annotations

1

2

4

5

6

1

2

3 4

a

1

2

3 4

5

6

+a

-a

2

3 4

5

6

+a

+a-a

1

ExpandedCondensed

a= {ins(5,a), dep(5,2), del(3,a)}

a= {ins(5,a), dep(5,2), del(3,a), ins(6,a), dep(5,3), dep(5,4), dep(6,2), dep(6,3), dep(6,4)}

9

Expanding and Condensing Traces

2

3 4

5

6

+a

+a-a

1

Expanded

1

2

3 4

5

6

+a

-a

Condensed

10

Trace Views

convertslicersoftmeanreslicewarpalignwarp

1

2

6 7 8

9 10

Image Header

Image Header

RefImage

AnatomyImage

Images

S1 1

2

11WarpParamSet

AnatomyImage

Images

S2 1

12

13

ReslicedImage

Images

S3

14

1

15

16

AtlasImage

Images

S4

17

1

15

18AtlasSlice

AtlasImage

Images

S5

Image

HeaderImage

Header

1

15

19AtlasGraphic

AtlasImage

Images

S6

Condensed Trace

Expanded Trace

Using a postorder (i.e, bottom-up, left-to-right) traversal

Remove annotations from a node n (i) dep(n,c) if dep(n,p) and child(p,c) (ii) dep(n,d) if child(p,n) and dep(p,d) (iii) ins(n,x) if child(p,n) and ins(p,x) (iv) del(n,y) if child(p,n) and del(p,y)Remove invocation order annotations -Those implied according to rules in (3--8)

Uses three distinct preorder (i.e., top-down, left-to-right) traversalsPass 1: rules (1-2) and rules (3-5) -Infers insertion and deletion annotations -Infers invocation order from nodes and parent-child relationshipsPass 2: rules (6-8) -Infers remaining invocation precedence relationships Pass 3: rules (9-10) -Expands dependencies sets and propagates dependencies to child nodes

11

CapturingProvenance

StoringProvenance

QueryingProvenance

BrowsingProvenance

Outline

12

Storage Strategies

Use standard relational DBMS and minimize storage size, update time and query time

• Store immediate and transitive dependencies– Faster query execution

• Reduction techniques– Represent dependencies in reduced form

13

Storage Strategies

• 5 storage strategies– NC: Naive Collapsed– NE: Naive Expanded– SE: Simple Expanded– RE: Reduced Expanded – RC: Reduced Collapsed

• Compare: – Storage size, update time,

query time

NC

Trace Collapsed

NE

Trace Expanded

SE

Trace ExpandedTransitive Dep.

RE

ReducedTrace ExpandedTransitive Dep.

RC

ReducedTrace CollapsedTransitive Dep.

Reduction Algorithms

14

Transitive Dependency Selection (Mixed Pattern)

0

10

20

30

40

50

0 1000 2000 3000 4000 5000 6000

Trace Nodes

Time (s)

NE

SE

RE

RC

Storage Size (Mixed Pattern)

NE

NC

SE

RE

RC0

50

100

150

200

250

300

350

400

450

500

0 1000 2000 3000 4000 5000 6000

Trace Nodes

Cells (1000)

NE

NC

SE

RE

RC

Analysis of Storage Strategies

• SE: Worst storage size and update time • RC: Very expensive query time• RE: Recommended storage strategy

Storage size

RC < RE < NC < NE < SE

Update time

RC < RE < NC < NE < SE

Query time SE < RE < NE < RC < NC

Storage Size

Traces Traces

Cells

(100

0)

Update Time (Mixed Pattern)

0

20

40

60

80

100

120

140

0 1000 2000 3000 4000 5000 6000

Trace Nodes

Time (s)

NENCSERERC

Update Time

TracesTim

e(s

)

Tim

e(s

)

Query TimeNE

NC

SE

RE

RC

RE

RC NE

SE

15

CapturingProvenance

StoringProvenance

QueryingProvenance

BrowsingProvenance

Outline

16

Querying Provenance can be Expensive

• Queries are often recursive – Complex to formulate– Expensive to evaluate

• Standard querying approaches– Tied to storage representation– Query language expertise

• Need to query across structures, lineage, or both

•How to express provenance queries easily and execute them efficiently?

(Q) Select lineage path that derived all children of AtlasImage created by slicer

convert:1slicer:1softmean:1reslicewarp:1alignwarp:1

1

2

6 7 8

9 10

Image Header

Image Header

RefImage

AnatomyImage

Images

S1 1

2

11WarpParamSet

AnatomyImage

Images

S2 1

12

13

ReslicedImage

Images

S3

14

1

15

16

AtlasImage

Images

S4

17

1

15

18AtlasSlice

AtlasImage

Images

S5

Image

HeaderImage

Header

1

15

19AtlasGraphic

AtlasImage

Images

S6

Stru

ctu

res

Lineage

17

select t.runId, t2.nodeId, t.nodeId as depNodeId from (select d1.runId, d1.pDep, d1.nodeIdfrom dependency d1where runId=runId_inunionselect p1.runId, p1.fromPointer as pDep, d2.nodeId from dependency d2, depSubsetPointer p1 where p1.runId=runId_inand d2.runId=runId_inand d2.pDep=p1.toPointer) as t, depMinMaxPointer p2, (select t.runId, r1.nodeId, t.pDep from (select dc1.runId, dc1.pDepC, dc1.pDepfrom depCdepPointer dc1where runId=runId_inunionselect p1.runId, p1.fromPointer as pDepC, dc2.pDep from depCdepPointer dc2, depCSubsetPointer p1 where p1.runId=runId_inand dc2.runId=runId_inand dc2.pDepC=p1.toPointer) as t, depCMinMaxPointer p2, runCollData r1, runItemProv rp1where p2.runId = runId_inand r1.runId=runId_inand rp1.runId=runId_inand r1.nodeId=nodeId_inand r1.pointer=rp1.pointerand rp1.pDep = p2.fromPointerand t.pDepC=p2.toPointerand t.pDep BETWEEN p2.depMin AND p2.depMaxunion……

To Express this Query …SQL (eg, transitive dependencies)

• Hard for domain scientists (… and SQL experts)• Optimization depends on SQL engine [He et al. SIGMOD 08]• Need for higher-level provenance query language

create procedure depc(in runId_in varchar(255), in nodeId_in Integer) begin

DECLARE finished integer default 0;

declare cur_1 cursor for select depNodeId from dependency

where runId=runId_in and itemNodeId=nodeId_tmp;

set nodeId_tmp = nodeId_in;

set depCnt = (select count(*) from dependency

where runId=runId_in and itemNodeId=nodeId_tmp);

if (depCnt is not null) then

open cur_1;

get_cur_1: loop

fetch cur_1 into depNodeId_tmp;

if finished then leave get_cur_1;

end if;

insert into depcT (nodeId) values(depNodeId_tmp);

end LOOP get_cur_1;

close cur_1; set cnt = 1;

while (cnt <= depCnt) do

set nodeId_tmp = (select nodeId from depcT where no=cnt);

set row_limit = (select count(*) from dependency where itemnodeId=nodeId_tmp and runId=runId_in);

set row_cnt =0;

open cur_1;

get_cur_1: loop

fetch cur_1 into depNodeId_tmp;

set flag = (select 1 from depcT where nodeId = depNodeId_tmp);

if (flag is null) then

insert into depcT (nodeId) values(depNodeId_tmp);

end if;

if (row_cnt > row_limit) then

leave get_cur_1;

end if;

set row_cnt = row_cnt + 1;

SQL (stored procedures)

18

QLP Constructs

First Provenance Challenge Queries Formulated in QLP

Query 1 *..//AtlasXGraphic

Query 2 #softmean..//AtlasXGraphic

Query 3 #softmean..#slicer..#convert..//AtlasXGraphic

Query 4 invocations(#align_warp[m=“12”, dateofExecution="Monday"]

Query 5 outputs(//AnatomyHeaders[maximum=“4096”]..//AtlasGraphic)

Query 6 outputs(#align_warp[-m=“12”]..#softmean)

Query 7 #convert..*, #pgmtoppm..*

Query 8 outputs(//AnatomyImages[center=“Uchicago”].#align_warp)

Query 9 //AtlasGraphic[studyModality=“speech” | “visual” | “audio”]/@*

19

convert:1slicer:1softmean:1reslicewarp:1alignwarp:1

1

2

6 7 8

9 10

Image Header

Image Header

RefImage

AnatomyImage

Images

S1 1

2

11WarpParamSet

AnatomyImage

Images

S2 1

12

13

ReslicedImage

Images

S3

14

1

15

16

AtlasImage

Images

S4

17

1

15

18AtlasSlice

AtlasImage

Images

S5

Image

HeaderImage

Header

1

15

19AtlasGraphic

AtlasImage

Images

S6

Querying Multiple Dimensions1. Obtain structures from @in and @out version operators

2. Apply XPath expressions to structure3. Apply lineage queries to each resulting node

QQLP: * derived //AtlasImage/* @out slicer

* derived 18

Stru

ctu

res

Lineage

//AtlasImage/*

(Q) Select lineage path that derived all children of AtlasImage created by slicer

1

15

18AtlasSlice

AtlasImage

Images

S5

@out slicer

20

Capturing Provenance

StoringProvenance

QueryingProvenance

BrowsingProvenance

Outline

21

Provenance Browser

• Browse different views of a trace• Data dependencies, collection structure, actor invocations• Move “forward” and “backward” through execution

22

Collection History

• Collection and invocation view• Incrementally step through execution history

23

Conclusion• Capture

– Supports nested data collections, explicit data dependency, update semantics

• Storage– Reduce update time, storage size and query time

• Query– A high-level provenance query language (QLP)

• Query structures with lineage graphs• Formulate queries easily and concisely

• Browse/Vizualize– Provenance Browser, a visualization tool to view

and navigate across provenance views

24

References

• M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Exploring Scientific Workflow Provenance using Hybrid Queries over Nested Data and Lineage Graphs. SSDBM 2009

• M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Efficient Provenance Storage over Nested Data Collections. EDBT 2009

• S. Bowers, T. McPhillips, S. Riddle, M. K. Anand, B. Ludäscher. Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life. IPAW 2008