manish kumar anand maanand@ucdavis.edu eighth biennial ptolemy miniconference berkeley, california a...
Post on 20-Dec-2015
222 Views
Preview:
TRANSCRIPT
Manish Kumar Anandmaanand@ucdavis.edu
Eighth Biennial Ptolemy MiniconferenceBerkeley, California
A Provenance Framework to Capture, Store, Query, and
Browse Data Lineage in Kepler
2
Scientific Workflows• Discoveries achieved via complex computations• Workflows replacing traditional scripting approaches• Enable automation, reproducibility, sharing,
provenance
Perl script
Scientific workflow system
3
Provenance• A record of processes, inputs/outputs, dependencies• Supports reproducibility, interpretation, verification
AZG
AYG
AXG
AlignWarp Reslice Softmean Slicer Convert
AZG
AYG
AXG
AI1AH1
AI2AH2
AI4AH4
AI4AH4
RI RH
inputs
outputs
AXS
AYS
AZS
AI
AH
RI1
RH1
RI2
RH2
RI4
RH4
RI4
RH4
WP1
WP2
WP4
WP4alignWarp:4
alignWarp:3
alignWarp:2
alignWarp:1
reslice:4
reslice:3
reslice:2
reslice:1
softmean:1
slicer:1
slicer:2
slicer:3
convert:1
convert:2
convert:3
5
Conventional Provenance Models
• Records– Inputs/outputs of invocations
• Infers– Data dependency– Invocation dependency
Workflow execution graph
Data dependency Invocation dependency
input(a, s1), output(a, s2),
input(b, s2), input(c, s2), …
• Assumptions:- Data is atomic- Invocations consume all inputs
and produce new outputs- Every output depends on all
inputs
6
s1
as2
s3
s4
(b)
Challenges in Modeling Provenance
Many scientific workflow systems also support:a) Both data “transformers” and “pass-through” b) Processes with different dependency patternsc) Structured data (XML)
•Models of provenance must consider these factors
s1
a
(a)
s2
s3
s4
s1
as2
s3
s2
s1
s3
s4 s5
s1
s2 s3
a
(c)
8
Efficient Provenance Representation
• Instead of storing each version– Only store a single combined version
• Along with a set of updates (’s)– Updates and dependencies represented as annotations
1
2
4
5
6
1
2
3 4
a
1
2
3 4
5
6
+a
-a
2
3 4
5
6
+a
+a-a
1
ExpandedCondensed
a= {ins(5,a), dep(5,2), del(3,a)}
a= {ins(5,a), dep(5,2), del(3,a), ins(6,a), dep(5,3), dep(5,4), dep(6,2), dep(6,3), dep(6,4)}
10
Trace Views
convertslicersoftmeanreslicewarpalignwarp
1
2
6 7 8
9 10
Image Header
Image Header
RefImage
AnatomyImage
Images
…
S1 1
2
11WarpParamSet
AnatomyImage
Images
…
S2 1
12
13
ReslicedImage
Images
…
S3
14
1
15
16
AtlasImage
Images
…
S4
17
1
15
18AtlasSlice
AtlasImage
Images
…
S5
Image
HeaderImage
Header
1
15
19AtlasGraphic
AtlasImage
Images
…
S6
Condensed Trace
Expanded Trace
Using a postorder (i.e, bottom-up, left-to-right) traversal
Remove annotations from a node n (i) dep(n,c) if dep(n,p) and child(p,c) (ii) dep(n,d) if child(p,n) and dep(p,d) (iii) ins(n,x) if child(p,n) and ins(p,x) (iv) del(n,y) if child(p,n) and del(p,y)Remove invocation order annotations -Those implied according to rules in (3--8)
Uses three distinct preorder (i.e., top-down, left-to-right) traversalsPass 1: rules (1-2) and rules (3-5) -Infers insertion and deletion annotations -Infers invocation order from nodes and parent-child relationshipsPass 2: rules (6-8) -Infers remaining invocation precedence relationships Pass 3: rules (9-10) -Expands dependencies sets and propagates dependencies to child nodes
12
Storage Strategies
Use standard relational DBMS and minimize storage size, update time and query time
• Store immediate and transitive dependencies– Faster query execution
• Reduction techniques– Represent dependencies in reduced form
13
Storage Strategies
• 5 storage strategies– NC: Naive Collapsed– NE: Naive Expanded– SE: Simple Expanded– RE: Reduced Expanded – RC: Reduced Collapsed
• Compare: – Storage size, update time,
query time
NC
Trace Collapsed
NE
Trace Expanded
SE
Trace ExpandedTransitive Dep.
RE
ReducedTrace ExpandedTransitive Dep.
RC
ReducedTrace CollapsedTransitive Dep.
Reduction Algorithms
14
Transitive Dependency Selection (Mixed Pattern)
0
10
20
30
40
50
0 1000 2000 3000 4000 5000 6000
Trace Nodes
Time (s)
NE
SE
RE
RC
Storage Size (Mixed Pattern)
NE
NC
SE
RE
RC0
50
100
150
200
250
300
350
400
450
500
0 1000 2000 3000 4000 5000 6000
Trace Nodes
Cells (1000)
NE
NC
SE
RE
RC
Analysis of Storage Strategies
• SE: Worst storage size and update time • RC: Very expensive query time• RE: Recommended storage strategy
Storage size
RC < RE < NC < NE < SE
Update time
RC < RE < NC < NE < SE
Query time SE < RE < NE < RC < NC
Storage Size
Traces Traces
Cells
(100
0)
Update Time (Mixed Pattern)
0
20
40
60
80
100
120
140
0 1000 2000 3000 4000 5000 6000
Trace Nodes
Time (s)
NENCSERERC
Update Time
TracesTim
e(s
)
Tim
e(s
)
Query TimeNE
NC
SE
RE
RC
RE
RC NE
SE
16
Querying Provenance can be Expensive
• Queries are often recursive – Complex to formulate– Expensive to evaluate
• Standard querying approaches– Tied to storage representation– Query language expertise
• Need to query across structures, lineage, or both
•How to express provenance queries easily and execute them efficiently?
(Q) Select lineage path that derived all children of AtlasImage created by slicer
convert:1slicer:1softmean:1reslicewarp:1alignwarp:1
1
2
6 7 8
9 10
Image Header
Image Header
RefImage
AnatomyImage
Images
…
S1 1
2
11WarpParamSet
AnatomyImage
Images
…
S2 1
12
13
ReslicedImage
Images
…
S3
14
1
15
16
AtlasImage
Images
…
S4
17
1
15
18AtlasSlice
AtlasImage
Images
…
S5
Image
HeaderImage
Header
1
15
19AtlasGraphic
AtlasImage
Images
…
S6
Stru
ctu
res
Lineage
17
select t.runId, t2.nodeId, t.nodeId as depNodeId from (select d1.runId, d1.pDep, d1.nodeIdfrom dependency d1where runId=runId_inunionselect p1.runId, p1.fromPointer as pDep, d2.nodeId from dependency d2, depSubsetPointer p1 where p1.runId=runId_inand d2.runId=runId_inand d2.pDep=p1.toPointer) as t, depMinMaxPointer p2, (select t.runId, r1.nodeId, t.pDep from (select dc1.runId, dc1.pDepC, dc1.pDepfrom depCdepPointer dc1where runId=runId_inunionselect p1.runId, p1.fromPointer as pDepC, dc2.pDep from depCdepPointer dc2, depCSubsetPointer p1 where p1.runId=runId_inand dc2.runId=runId_inand dc2.pDepC=p1.toPointer) as t, depCMinMaxPointer p2, runCollData r1, runItemProv rp1where p2.runId = runId_inand r1.runId=runId_inand rp1.runId=runId_inand r1.nodeId=nodeId_inand r1.pointer=rp1.pointerand rp1.pDep = p2.fromPointerand t.pDepC=p2.toPointerand t.pDep BETWEEN p2.depMin AND p2.depMaxunion……
To Express this Query …SQL (eg, transitive dependencies)
• Hard for domain scientists (… and SQL experts)• Optimization depends on SQL engine [He et al. SIGMOD 08]• Need for higher-level provenance query language
create procedure depc(in runId_in varchar(255), in nodeId_in Integer) begin
DECLARE finished integer default 0;
…
declare cur_1 cursor for select depNodeId from dependency
where runId=runId_in and itemNodeId=nodeId_tmp;
set nodeId_tmp = nodeId_in;
set depCnt = (select count(*) from dependency
where runId=runId_in and itemNodeId=nodeId_tmp);
if (depCnt is not null) then
open cur_1;
get_cur_1: loop
fetch cur_1 into depNodeId_tmp;
if finished then leave get_cur_1;
end if;
insert into depcT (nodeId) values(depNodeId_tmp);
end LOOP get_cur_1;
close cur_1; set cnt = 1;
while (cnt <= depCnt) do
set nodeId_tmp = (select nodeId from depcT where no=cnt);
set row_limit = (select count(*) from dependency where itemnodeId=nodeId_tmp and runId=runId_in);
set row_cnt =0;
open cur_1;
get_cur_1: loop
fetch cur_1 into depNodeId_tmp;
set flag = (select 1 from depcT where nodeId = depNodeId_tmp);
if (flag is null) then
insert into depcT (nodeId) values(depNodeId_tmp);
end if;
if (row_cnt > row_limit) then
leave get_cur_1;
end if;
set row_cnt = row_cnt + 1;
…
…
SQL (stored procedures)
18
QLP Constructs
First Provenance Challenge Queries Formulated in QLP
Query 1 *..//AtlasXGraphic
Query 2 #softmean..//AtlasXGraphic
Query 3 #softmean..#slicer..#convert..//AtlasXGraphic
Query 4 invocations(#align_warp[m=“12”, dateofExecution="Monday"]
Query 5 outputs(//AnatomyHeaders[maximum=“4096”]..//AtlasGraphic)
Query 6 outputs(#align_warp[-m=“12”]..#softmean)
Query 7 #convert..*, #pgmtoppm..*
Query 8 outputs(//AnatomyImages[center=“Uchicago”].#align_warp)
Query 9 //AtlasGraphic[studyModality=“speech” | “visual” | “audio”]/@*
19
convert:1slicer:1softmean:1reslicewarp:1alignwarp:1
1
2
6 7 8
9 10
Image Header
Image Header
RefImage
AnatomyImage
Images
…
S1 1
2
11WarpParamSet
AnatomyImage
Images
…
S2 1
12
13
ReslicedImage
Images
…
S3
14
1
15
16
AtlasImage
Images
…
S4
17
1
15
18AtlasSlice
AtlasImage
Images
…
S5
Image
HeaderImage
Header
1
15
19AtlasGraphic
AtlasImage
Images
…
S6
Querying Multiple Dimensions1. Obtain structures from @in and @out version operators
2. Apply XPath expressions to structure3. Apply lineage queries to each resulting node
QQLP: * derived //AtlasImage/* @out slicer
* derived 18
Stru
ctu
res
Lineage
//AtlasImage/*
(Q) Select lineage path that derived all children of AtlasImage created by slicer
1
15
18AtlasSlice
AtlasImage
Images
…
S5
@out slicer
21
Provenance Browser
• Browse different views of a trace• Data dependencies, collection structure, actor invocations• Move “forward” and “backward” through execution
22
Collection History
• Collection and invocation view• Incrementally step through execution history
23
Conclusion• Capture
– Supports nested data collections, explicit data dependency, update semantics
• Storage– Reduce update time, storage size and query time
• Query– A high-level provenance query language (QLP)
• Query structures with lineage graphs• Formulate queries easily and concisely
• Browse/Vizualize– Provenance Browser, a visualization tool to view
and navigate across provenance views
24
References
• M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Exploring Scientific Workflow Provenance using Hybrid Queries over Nested Data and Lineage Graphs. SSDBM 2009
• M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Efficient Provenance Storage over Nested Data Collections. EDBT 2009
• S. Bowers, T. McPhillips, S. Riddle, M. K. Anand, B. Ludäscher. Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life. IPAW 2008
top related