manish kumar anand maanand@ucdavis eighth biennial ptolemy miniconference berkeley, california

24
Manish Kumar Anand [email protected] Eighth Biennial Ptolemy Miniconference Berkeley, California A Provenance Framework to Capture, Store, Query, and Browse Data Lineage in Kepler

Upload: grant-fernandez

Post on 31-Dec-2015

37 views

Category:

Documents


0 download

DESCRIPTION

A Provenance Framework to Capture, Store, Query, and Browse Data Lineage in Kepler. Manish Kumar Anand [email protected] Eighth Biennial Ptolemy Miniconference Berkeley, California. Scientific workflow system. Scientific Workflows. Discoveries achieved via complex computations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

Manish Kumar [email protected]

Eighth Biennial Ptolemy MiniconferenceBerkeley, California

A Provenance Framework to Capture, Store, Query, and

Browse Data Lineage in Kepler

Page 2: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

2

Scientific Workflows• Discoveries achieved via complex computations• Workflows replacing traditional scripting approaches• Enable automation, reproducibility, sharing,

provenance

Perl script

Scientific workflow system

Page 3: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

3

Provenance• A record of processes, inputs/outputs, dependencies• Supports reproducibility, interpretation, verification

AZG

AYG

AXG

AlignWarp Reslice Softmean Slicer Convert

AZG

AYG

AXG

AI1AH1

AI2AH2

AI4AH4

AI4AH4

RI RH

inputs

outputs

AXS

AYS

AZS

AI

AH

RI1

RH1

RI2

RH2

RI4

RH4

RI4

RH4

WP1

WP2

WP4

WP4alignWarp:4

alignWarp:3

alignWarp:2

alignWarp:1

reslice:4

reslice:3

reslice:2

reslice:1

softmean:1

slicer:1

slicer:2

slicer:3

convert:1

convert:2

convert:3

Page 4: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

4

CapturingProvenance

StoringProvenance

QueryingProvenance

BrowsingProvenance

Outline

Page 5: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

5

Conventional Provenance Models

• Records– Inputs/outputs of invocations

• Infers– Data dependency– Invocation dependency

Workflow execution graph

Data dependency Invocation dependency

input(a, s1), output(a, s2),

input(b, s2), input(c, s2), …

• Assumptions:- Data is atomic- Invocations consume all inputs

and produce new outputs- Every output depends on all

inputs

Page 6: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

6

s1

as2

s3

s4

(b)

Challenges in Modeling Provenance

Many scientific workflow systems also support:a) Both data “transformers” and “pass-through” b) Processes with different dependency patternsc) Structured data (XML)

•Models of provenance must consider these factors

s1

a

(a)

s2

s3

s4

s1

as2

s3

s2

s1

s3

s4 s5

s1

s2 s3

a

(c)

Page 7: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

7

Unified Provenance Model

Page 8: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

8

Efficient Provenance Representation

• Instead of storing each version– Only store a single combined version

• Along with a set of updates (’s)– Updates and dependencies represented as annotations

1

2

4

5

6

1

2

3 4

a

1

2

3 4

5

6

+a

-a

2

3 4

5

6

+a

+a-a

1

ExpandedCondensed

a= {ins(5,a), dep(5,2), del(3,a)}

a= {ins(5,a), dep(5,2), del(3,a), ins(6,a), dep(5,3), dep(5,4), dep(6,2), dep(6,3), dep(6,4)}

Page 9: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

9

Expanding and Condensing Traces

2

3 4

5

6

+a

+a-a

1

Expanded

1

2

3 4

5

6

+a

-a

Condensed

Page 10: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

10

Trace Views

convertslicersoftmeanreslicewarpalignwarp

1

2

6 7 8

9 10

Image Header

Image Header

RefImage

AnatomyImage

Images

S1 1

2

11WarpParamSet

AnatomyImage

Images

S2 1

12

13

ReslicedImage

Images

S3

14

1

15

16

AtlasImage

Images

S4

17

1

15

18AtlasSlice

AtlasImage

Images

S5

Image

HeaderImage

Header

1

15

19AtlasGraphic

AtlasImage

Images

S6

Condensed Trace

Expanded Trace

Using a postorder (i.e, bottom-up, left-to-right) traversal

Remove annotations from a node n (i) dep(n,c) if dep(n,p) and child(p,c) (ii) dep(n,d) if child(p,n) and dep(p,d) (iii) ins(n,x) if child(p,n) and ins(p,x) (iv) del(n,y) if child(p,n) and del(p,y)Remove invocation order annotations -Those implied according to rules in (3--8)

Uses three distinct preorder (i.e., top-down, left-to-right) traversalsPass 1: rules (1-2) and rules (3-5) -Infers insertion and deletion annotations -Infers invocation order from nodes and parent-child relationshipsPass 2: rules (6-8) -Infers remaining invocation precedence relationships Pass 3: rules (9-10) -Expands dependencies sets and propagates dependencies to child nodes

Page 11: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

11

CapturingProvenance

StoringProvenance

QueryingProvenance

BrowsingProvenance

Outline

Page 12: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

12

Storage Strategies

Use standard relational DBMS and minimize storage size, update time and query time

• Store immediate and transitive dependencies– Faster query execution

• Reduction techniques– Represent dependencies in reduced form

Page 13: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

13

Storage Strategies

• 5 storage strategies– NC: Naive Collapsed– NE: Naive Expanded– SE: Simple Expanded– RE: Reduced Expanded – RC: Reduced Collapsed

• Compare: – Storage size, update time,

query time

NC

Trace Collapsed

NE

Trace Expanded

SE

Trace ExpandedTransitive Dep.

RE

ReducedTrace ExpandedTransitive Dep.

RC

ReducedTrace CollapsedTransitive Dep.

Reduction Algorithms

Page 14: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

14

Transitive Dependency Selection (Mixed Pattern)

0

10

20

30

40

50

0 1000 2000 3000 4000 5000 6000

Trace Nodes

Time (s)

NE

SE

RE

RC

Storage Size (Mixed Pattern)

NE

NC

SE

RE

RC0

50

100

150

200

250

300

350

400

450

500

0 1000 2000 3000 4000 5000 6000

Trace Nodes

Cells (1000)

NE

NC

SE

RE

RC

Analysis of Storage Strategies

• SE: Worst storage size and update time • RC: Very expensive query time• RE: Recommended storage strategy

Storage size

RC < RE < NC < NE < SE

Update time

RC < RE < NC < NE < SE

Query time SE < RE < NE < RC < NC

Storage Size

Traces Traces

Cells

(100

0)

Update Time (Mixed Pattern)

0

20

40

60

80

100

120

140

0 1000 2000 3000 4000 5000 6000

Trace Nodes

Time (s)

NENCSERERC

Update Time

TracesTim

e(s

)

Tim

e(s

)

Query TimeNE

NC

SE

RE

RC

RE

RC NE

SE

Page 15: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

15

CapturingProvenance

StoringProvenance

QueryingProvenance

BrowsingProvenance

Outline

Page 16: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

16

Querying Provenance can be Expensive

• Queries are often recursive – Complex to formulate– Expensive to evaluate

• Standard querying approaches– Tied to storage representation– Query language expertise

• Need to query across structures, lineage, or both

•How to express provenance queries easily and execute them efficiently?

(Q) Select lineage path that derived all children of AtlasImage created by slicer

convert:1slicer:1softmean:1reslicewarp:1alignwarp:1

1

2

6 7 8

9 10

Image Header

Image Header

RefImage

AnatomyImage

Images

S1 1

2

11WarpParamSet

AnatomyImage

Images

S2 1

12

13

ReslicedImage

Images

S3

14

1

15

16

AtlasImage

Images

S4

17

1

15

18AtlasSlice

AtlasImage

Images

S5

Image

HeaderImage

Header

1

15

19AtlasGraphic

AtlasImage

Images

S6

Stru

ctu

res

Lineage

Page 17: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

17

select t.runId, t2.nodeId, t.nodeId as depNodeId from (select d1.runId, d1.pDep, d1.nodeIdfrom dependency d1where runId=runId_inunionselect p1.runId, p1.fromPointer as pDep, d2.nodeId from dependency d2, depSubsetPointer p1 where p1.runId=runId_inand d2.runId=runId_inand d2.pDep=p1.toPointer) as t, depMinMaxPointer p2, (select t.runId, r1.nodeId, t.pDep from (select dc1.runId, dc1.pDepC, dc1.pDepfrom depCdepPointer dc1where runId=runId_inunionselect p1.runId, p1.fromPointer as pDepC, dc2.pDep from depCdepPointer dc2, depCSubsetPointer p1 where p1.runId=runId_inand dc2.runId=runId_inand dc2.pDepC=p1.toPointer) as t, depCMinMaxPointer p2, runCollData r1, runItemProv rp1where p2.runId = runId_inand r1.runId=runId_inand rp1.runId=runId_inand r1.nodeId=nodeId_inand r1.pointer=rp1.pointerand rp1.pDep = p2.fromPointerand t.pDepC=p2.toPointerand t.pDep BETWEEN p2.depMin AND p2.depMaxunion……

To Express this Query …SQL (eg, transitive dependencies)

• Hard for domain scientists (… and SQL experts)• Optimization depends on SQL engine [He et al. SIGMOD 08]• Need for higher-level provenance query language

create procedure depc(in runId_in varchar(255), in nodeId_in Integer) begin

DECLARE finished integer default 0;

declare cur_1 cursor for select depNodeId from dependency

where runId=runId_in and itemNodeId=nodeId_tmp;

set nodeId_tmp = nodeId_in;

set depCnt = (select count(*) from dependency

where runId=runId_in and itemNodeId=nodeId_tmp);

if (depCnt is not null) then

open cur_1;

get_cur_1: loop

fetch cur_1 into depNodeId_tmp;

if finished then leave get_cur_1;

end if;

insert into depcT (nodeId) values(depNodeId_tmp);

end LOOP get_cur_1;

close cur_1; set cnt = 1;

while (cnt <= depCnt) do

set nodeId_tmp = (select nodeId from depcT where no=cnt);

set row_limit = (select count(*) from dependency where itemnodeId=nodeId_tmp and runId=runId_in);

set row_cnt =0;

open cur_1;

get_cur_1: loop

fetch cur_1 into depNodeId_tmp;

set flag = (select 1 from depcT where nodeId = depNodeId_tmp);

if (flag is null) then

insert into depcT (nodeId) values(depNodeId_tmp);

end if;

if (row_cnt > row_limit) then

leave get_cur_1;

end if;

set row_cnt = row_cnt + 1;

SQL (stored procedures)

Page 18: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

18

QLP Constructs

First Provenance Challenge Queries Formulated in QLP

Query 1 *..//AtlasXGraphic

Query 2 #softmean..//AtlasXGraphic

Query 3 #softmean..#slicer..#convert..//AtlasXGraphic

Query 4 invocations(#align_warp[m=“12”, dateofExecution="Monday"]

Query 5 outputs(//AnatomyHeaders[maximum=“4096”]..//AtlasGraphic)

Query 6 outputs(#align_warp[-m=“12”]..#softmean)

Query 7 #convert..*, #pgmtoppm..*

Query 8 outputs(//AnatomyImages[center=“Uchicago”].#align_warp)

Query 9 //AtlasGraphic[studyModality=“speech” | “visual” | “audio”]/@*

Page 19: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

19

convert:1slicer:1softmean:1reslicewarp:1alignwarp:1

1

2

6 7 8

9 10

Image Header

Image Header

RefImage

AnatomyImage

Images

S1 1

2

11WarpParamSet

AnatomyImage

Images

S2 1

12

13

ReslicedImage

Images

S3

14

1

15

16

AtlasImage

Images

S4

17

1

15

18AtlasSlice

AtlasImage

Images

S5

Image

HeaderImage

Header

1

15

19AtlasGraphic

AtlasImage

Images

S6

Querying Multiple Dimensions1. Obtain structures from @in and @out version operators

2. Apply XPath expressions to structure3. Apply lineage queries to each resulting node

QQLP: * derived //AtlasImage/* @out slicer

* derived 18

Stru

ctu

res

Lineage

//AtlasImage/*

(Q) Select lineage path that derived all children of AtlasImage created by slicer

1

15

18AtlasSlice

AtlasImage

Images

S5

@out slicer

Page 20: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

20

Capturing Provenance

StoringProvenance

QueryingProvenance

BrowsingProvenance

Outline

Page 21: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

21

Provenance Browser

• Browse different views of a trace• Data dependencies, collection structure, actor invocations• Move “forward” and “backward” through execution

Page 22: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

22

Collection History

• Collection and invocation view• Incrementally step through execution history

Page 23: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

23

Conclusion• Capture

– Supports nested data collections, explicit data dependency, update semantics

• Storage– Reduce update time, storage size and query time

• Query– A high-level provenance query language (QLP)

• Query structures with lineage graphs• Formulate queries easily and concisely

• Browse/Vizualize– Provenance Browser, a visualization tool to view

and navigate across provenance views

Page 24: Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

24

References

• M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Exploring Scientific Workflow Provenance using Hybrid Queries over Nested Data and Lineage Graphs. SSDBM 2009

• M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Efficient Provenance Storage over Nested Data Collections. EDBT 2009

• S. Bowers, T. McPhillips, S. Riddle, M. K. Anand, B. Ludäscher. Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life. IPAW 2008