a graph model for data and workflow provenance · 2019. 2. 25. · a graph model for data and...
TRANSCRIPT
A graph model for data and workflow provenance
Umut Acar, Peter Buneman, James Cheney, Natalia Kwasnikowska,
Jan van den Bussche, & Stijn Vansummeren
TaPP 2010
Provenance in ...• Databases
• Mainly for (nested) relational model
• Where-provenance ("source location")
• Lineage, why ("witnesses")
• How/semiring model
• Relatively formal
• Workflows
• Many different systems
• Many different models
• (converging on OPM?)
• Graphs/DAGs
• Relatively informal
Provenance in ...• Databases
• Mainly for (nested) relational model
• Where-provenance ("source location")
• Lineage, why ("witnesses")
• How/semiring model
• Relatively formal
• Workflows
• Many different systems
• Many different models
• (converging on OPM?)
• Graphs/DAGs
• Relatively informal
?????
This talk
• Relate database & workflow "styles"
• Develop a common graph formalism
• Need a common, expressive language that
• supports many database queries
• describes some (simple) workflows
Previous work
• Dataflow calculus (DFL), based on nested relational calculus (NRC)
• Provenance "run" model by Kwasnikowska & Van den Bussche (DILS 07, IPAW 08)
• "Provenance trace" model for NRC
• by (Acar, Ahmed & C. '08)
• Open Provenance Model (bipartite graphs)
• (Moreau et al. 2008-9), used in many WF systems
NRC/DFL background
• A very simple, functional language:
• basic functions +, *,... & constants 0,1,2,3...
• variables x,y,z
• pair/record types (A:e,...,B:e), πA (e)
• collection (set) types
• {e,...} e ∪ e {e | x in e'} ∪e
An example
An example
• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
An example
• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
sum { x * y | (x,y,z) in R, x < y}
An example
• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
sum { x * y | (x,y,z) in R, x < y}
= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}}
An example
• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
sum { x * y | (x,y,z) in R, x < y}
= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}}
= sum {1 * 2, 4 * 5}
An example
• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
sum { x * y | (x,y,z) in R, x < y}
= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}}
= sum {1 * 2, 4 * 5}
= sum {2,20}
An example
• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
sum { x * y | (x,y,z) in R, x < y}
= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}}
= sum {1 * 2, 4 * 5}
= sum {2,20}
= 22
Another example
• In DFL, built-in functions / constants can be whole programs & files,
• as in Provenance Challenge 1 workflow:
let WarpParams := {align_warp(img,hdr})
| (img,hdr) in Inputs} in
let Reslices := {reslice(wp)
| wp in WarpParams} in
softmean(Reslices)
Goal: Define "provenance graphs" for DFL
Goal: Define "provenance graphs" for DFL
let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} inlet Reslices := {reslice(wp) | wp in WarpParams} inin softmean(Reslices)
Goal: Define "provenance graphs" for DFL
let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} inlet Reslices := {reslice(wp) | wp in WarpParams} inin softmean(Reslices)
http://www.flickr.com/photos/schneertz/679692806/
First step: values
c
<>{}
...
elem
elem A1
An
v
v
v
or
v
v
or ...copyvor
Example value
1
<>
{}elem
elem
A
B2
3
<>A
B
Next step: evaluation nodes ("process")
c ...
1
n
e
fe
x letx
body
e
e
Constants,primitive functions
Variables & temporary bindings
head
Pairing
...
A1
An
e
<>e
πAe
Record building
Field lookup
Conditionals
iftest
then
e
eif
test
else
e
e
Note: Only taken branch is recorded
Sets: basic operations
{}e
∅
∪1
2
e
e
Empty set
Singleton
Union
Sets: complex operations
∪e
forx
head
body
e
e
ebody...
Flattening
Iteration
Provenance graphs
• are graphs with "both value and evaluation structure"
!
" #
$
% &
!
"
#$%&" '
(
)
#$%&"
'
(
*
+
, '
'-(
./01"
2/34
(5 6%4"
./01!
2/34
$%&"
6%4"$%&"
!
" #
$
%
&'(
)
#
*%
+,-&'(
#
%
./01
A bigger example
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
Value structure
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
Value structure
{} {}{}
C
C
2
C
CC
C
C
C
T
{}
{}
{}
{}<>
1
2
1 <>
1 C
{}
C
C
CC
C F
C
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
Input values
{} {}{}
C
C
2
C
CC
C
C
C
T
{}
{}
{}
{}<>
1
2
1 <>
1 C
{}
C
C
CC
C F
C
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
Return value
{} {}{}
C
C
2
C
CC
C
C
C
T
{}
{}
{}
{}<>
1
2
1 <>
1 C
{}
C
C
CC
C F
C
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
Expression structure
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
Expression structure
=
fstx
snd
empty
let R
let S
for ysUfor x
if
if
R=
{}x
snd
fst
fst
sndy
+
Building provenance graphs
• is complicated
• Here we'll use high-level "graph rewrite rule" formalism
• Mostly because it is nicer to look at than formal version
cc c
ff f(v1,...,vn)
1
nvn
v1
...
1
nvn
v1
...
letx
head
body
v
ex
head
body
v
ex
letx copy
copy
πAi<>
A1
Anvn
v1
...
<>
A1
Anv
v
...πAi...
vi copy...
...vi
<>A1
Anvn
v1
...
A1
Anvn
v1
... <> <>
A1
An
if
e2
e1
True
e1
Trueif
test
then
test
then
else
if
e2
e1
False
e2
Falseif
test
else
test
then
elsecopy
copy
empty?{} {} empty? False
...
elem
elem
v
v
...
elem
elem
v
v
empty?{} {} empty? True
∅ ∅∅
{}
∪
...
elem
elem
v
v
{}
...
elem
elem
v
v
{}
v
elem
v {}{}
{}∪...
v
v
...
v
v
elem
elem
{}
elem
elem
{}
elem
elem
......
OK, take a deep breath!
e
e
x copy
x copy
forx
head
body
{}
ex
...
elem
elem
vn
v1
head
body
{}
...elem
elem
vn
v1
body
forx {}
elem
elem
...
...
elem
elem
v
v
{}
...
elem
elem
v
v
{}
elem
elem
{} ∪
...
elem
elem
v
v
{}
...
elem
elem
v
v
{}
elem
elem
{} {}∪
elem
elem
......
...
An example
forx
head
body
{}elem
elem
2
1
+1
x
An example
forx
head
body
{}elem
elem
2
1
+1
x
An example
head
body
{}elem
elem
2
1
+1
forx {}
elem
elem
+1
x C
x C
An example
head
body
{}elem
elem
2
1
+1
forx {}
elem
elem
+1
x C
x C
An example
head
body
{}elem
elem
2
1
+
forx {}
elem
elem
+
x C
x C
1 1
1 1
An example
head
body
{}elem
elem
2
1
+
forx {}
elem
elem
+
x C
x C
1 1
1 1
An example
head
body
{}elem
elem
2
1
forx {}
elem
elem
x C
x C
1 1
1 1
+ 2
+ 3
Graphs can "lie" (inconsistency)
•+ 5
2
2
Graphs can "lie" (inconsistency)
•+ 5
2
2
if copy
2
True test
else
Graphs can "lie" (inconsistency)
•+ 5
2
2
if copy
2
True test
else
4
3
2
1
head
body
elem
elem
body
forx {}
elem
elem
{}
3
4
Graphs can "lie" (inconsistency)
•+ 5
2
2
if copy
2
True test
else
4
3
2
1
head
body
elem
elem
body
forx {}
elem
elem
{}
3
4
"Locally" but not "globally" consistent
Graph queries
• Many possible approaches
• In paper: some Datalog
• Maybe overkill, seems fragile
• In code: some "annotation propagation" traversals
• Seems to handle where, "explanations", "summaries"
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
Explaining
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
Explaining
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
Explaining
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
Explaining
Note: Smallest consistent subgraph (NOT
transitive closure!)
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
Summarizing
!
"
#$%
&'()
*+,$-.
/&'()0
&'() 1
0
0
2+3
4#
5678 %8$%
2+3
%98-
"
#$%
&'()
*+,
$-./
&'() 0
&'()1
0
1 8:(%) 4#
;<=$8 %8$%
2+38=$8
#'6+"&'() 98<.
&'()
>'.)
&'()>'.)
2+3
?2+3
@
)
#$%
&'() $-.A
&'()
0&'()
1
#'6+)&'() 98<.
1
>'.)
2+3=8%+@
98<.
2+3
>'.)
=8%+!
98<.
&'()
>'.)
0
12+3
0
12+3
&'()
+ 2
Summarizing
{}1
1
Graphs are partially "replayable"
• If we change a value node, can try to "readjust" to recover consistency
• Formalized in (Acar, Ahmed, Cheney 08)
+ 4
2
2
Graphs are partially "replayable"
• If we change a value node, can try to "readjust" to recover consistency
• Formalized in (Acar, Ahmed, Cheney 08)
+ 4
2
17
Graphs are partially "replayable"
• If we change a value node, can try to "readjust" to recover consistency
• Formalized in (Acar, Ahmed, Cheney 08)
+
2
17
19
Graphs are partially "replayable"
• If we change a value node, can try to "readjust" to recover consistency
• Formalized in (Acar, Ahmed, Cheney 08)
+
2
17
19
if copy
2
test
else
False
Graphs are partially "replayable"
• If we change a value node, can try to "readjust" to recover consistency
• Formalized in (Acar, Ahmed, Cheney 08)
+
2
17
19
if copy
2
True test
else
Graphs are partially "replayable"
• If we change a value node, can try to "readjust" to recover consistency
• Formalized in (Acar, Ahmed, Cheney 08)
+
2
17
19
if
2
True test
else
Stuck!
????
Implementation in Haskell
• Summarized in paper, full code on request
• roughly 250 LOC for basic evaluator
• another 300 for graphviz translation, basic queries, examples
• Point?
• No claim of efficiency/scalability but easy to understand, experiment
• Elucidates some tricky details that pictures hide
• Similar "lightweight modeling" might be valuable for understanding/relating other WF/DB models
Related work• This work synthesizes/rearranges ideas from
several previous works & "folklore"
• traces (Acar, Ahmed, Cheney 2008)
• runs (Kwasnikowska, van den Bussche, DILS 2007, IPAW 2008)
• OPM graphs (Moreau et al. IPAW 2008 etc.)
• and many workflow systems
• More can be done to relate DB & workflow models
Future work
• This is work in progress
• Next steps:
• Extending to understand/model other workflow features
• Better grasp of "real" queries and features needed
• Implementa(tion|ability)?
• Optimization?
Conclusions
• DB & WF provenance have much in common
• We develop common graph model
• with both intuitive & precise presentations
• Still much to do to relate and integrate DB & WF models
• let alone integrate models at scale in real systems