annotated xml: queries and provenance nate foster tj green val tannen university of pennsylvania...

25
Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh May 21, 2008

Upload: ariana-collins

Post on 28-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Annotated XML: Queries and Provenance

Nate Foster TJ Green Val Tannen University of Pennsylvania

Symposium on Database ProvenanceUniversity of Edinburgh

May 21, 2008

Page 2: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Need to Track XML Provenance• For scientific data processing [Buneman+ 01]– Tree-structured data, heterogeneous sources – XML is the natural data model– Data annotated with source info; annotations need to

be propagated during query processing• For incomplete/probabilistic data [Sen.&Abit. 06]– Query output annotated with Boolean formulas– Annotations indicate correlations between source

data and output data• For data warehousing [Cui+ 00]– Even when data is relational, often have XML views

2

Page 3: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Provenance for Relational Algebra Views

3

A B C

a b c

d b e

f g e

A Ba ca ed cd ef e

V := ¼AB((¼AC(R) ⋈ ¼C(R)) [ (¼AB(R) ⋈ ¼BC(R)))

source Rview V

??

?

Page 4: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Semiring-Annotated Relations [PODS07]

• Associate each tuple in database with an annotation from a commutative semiring (K, +, ¢, 0, 1)

• Combine and propagate annotations during (positive) relational query processing–⋈, £, Å combine annotations using ¢–¼, [ combine annotations using +–¾ multiplies annotations by 0 or 1

4

Page 5: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Annotated Relations Example

5

A B C

a b c p

d b e r

f g e s

RA Ba c 2p2

a e prd c prd e 2r2 + rsf e 2s2 + rs

V

V := ¼AB((¼AC(R) ⋈ ¼C(R)) [ (¼AB(R) ⋈ ¼BC(R)))

Page 6: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Semiring Bestiary

• (B, Ç, Æ, ?, >) Set semantics• (N, +, ¢, 0, 1) Bag semantics• (PosBool(B), Ç, Æ, ?, >) Incomplete dbs• (P(), [, Å, ;, ) Probabilistic dbs• (P(P(X)), [, d, ;, {;}) Why-provenance where A

d B := {a [ b : a 2 A, b 2 B}• (C, min, max, absent, public) Security clearances• (N[X], +, ¢, 0, 1) Prov. polynomials

6

Page 7: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Our Contribution: Annotated XML• We show how to decorate unordered XML data

with semiring annotations: K-UXML • We propagate the annotations for K-UXQuery

(based on a large fragment of positive XQuery)

• We do this by generalizing the semantics of Nested Relational Calculus (NRC) to handle annotated values and to incorporate a recursive tree type and structural recursion on trees

• We prove a commutation with homomorphisms theorem, and show that it enables applications in security and incomplete databases

7

Page 8: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

K-UXML

• No attributes, no text values, no repeated children (inessential); no order (essential!)

• Each node decorated with a value k from semiring K (1 “neutral,” 0 “not present”)

• K-collection: a finite set of elements annotated with values from K

• Formally, the children of a node form a K-collection of subtrees (to annotate root, also have a top-level K-collection)

8

Page 9: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Example: XPath on K-UXML

9

a

bx1

cy3

cy1

a d

a

cy2 bx2

d

Source, $T:

r

cx1¢y3 + y1¢y2 cy1

d

a

cy2 bx2

Answer:

Query: element r { $T//c }

Omitted annotations are 1 (and omitted subtrees have annotation 0)

Page 10: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Example: For-Loops in K-UXQuery

10

az

bx1 cx2

dy1 dy2 ey3

Source, $S: Answer:

Query: element p { for $t in $S return for $x in ($t)/¤ return ($x)/¤ }(i.e., element p { $S/¤/¤ })

p

d z¢x1¢y1 + z¢x2¢y2 e z¢x2¢y3

Page 11: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Outline of Technical Approach

• Extend NRC with a recursive tree type– satisfies: tree = label £ { tree }

and an operation for structural recursion on trees (srt) [Robertson+ 07]– apply to each child subtree, collect results using

NRC big union• Generalize NRC + srt to handle semiring-

annotated complex values ) NRCK + srt• Define semantics of K-UXQuery by translation

to NRCK + srt11

Page 12: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Semantics of Small Union

• Sums annotations«e1 [ e2¬K (x) := «e1¬K (x) + «e2¬K (x)

• Example:

12

ax

by

ax

by

ax

bz

,

Query: return ($S, $T) (in NRC: $S [ $T)

a2x

by

ax

bz

,

Source: Answer:

Page 13: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Semantics of Big Union

• Sums and multiplies annotations

«[(x 2 e1) e2¬K (y) := «e1¬K (ai) ¢ «e2¬K[x := ai]

(y)

where the support (the set of elements with non-zero annotations) of «e1¬K is {a1, ..., an}

13

n

i 1

Page 14: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Big Union Example With K = N

14

Query: return $T/¤/¤ (in NRC: [(x 2 $T) [(y 2 x) { y })

b2

c3

b b

c c cc c cc7

b

c

b

c

Source, $T : Answer:

´ ´c, c, c, c, c, c, c, , ,

Page 15: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

XPath Descendant Operator Uses srt

• //¤ applied to forest $T translates to

[(x 2 $T) ¼1((srt(b, s) . f) x)

where

f := let self = Tree(b, [(x 2 s) {¼2(x)} in

let matches = [(x 2 s) {¼1(x)} in

(matches [ {self}, self))• //a, similar to above

15

Page 16: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

• Data annotated with clearance levels fromtotal order C : P < C < S < T < 0

• Joint use of data (¢) requires access to both (max of clearances); alternative use of data (+) requires access to either (min of clearances)

• (C, min, max, 0, P) is a commutative semiring

p

d min(max(P,C,C),max(P,C,S)) e max(P,C,T)

Application: Security Clearances

16

p

d C e T

aP

bC cC

dC dS eT Query: element p { $S/¤/¤}

Page 17: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

• For any given clearance level (e.g., C), want the following diagram to commute:

Security Condition: Non-Interference

17

pP

dC eT

pP

dC

aP

bC cC

dC dS eT

aP

bC cC

dC

query

query

erase > C erase > C

Page 18: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Application: Incomplete XML

• Data annotated with Boolean expressions; tree T represents set of possible worlds Mod(T)

18

T =

a

b

cy3

cy1

a d

a

cy2 b

da

b

c

c

a d

a

c b

d

Mod(T) =

a

b

a

d

a

b

c

a

d

a

b c

a d

a

b

d

, , ,...,

7 possible worlds

Page 19: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Correctness: Possible Worlds

19

• For every incomplete tree T, and every UXQuery query q, want this diagram to commute:

T Mod(T)

q(Mod(T)) = Mod(q(T))q(T)

q q

Mod

Mod

Page 20: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Commutation with Homomorphisms

• Theorem: Let h : K1 K2 be a semiring homo-morphism. Then for any UXQuery query q, and for any K1-UXML document D, we have h(q(D)) = q(h(D)).

• Ex: security clearanceshc : C C hc(k) := if k · c then k else 0

• Ex: incomplete dbsº : B B Evalº : PosBool(B) B

• Ex: duplicate elimination± : N B ±(k) := if k = 0 then ? else >

20

Page 21: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Related Work

• Bag semantics for NRC [Libkin&Wong 97]

• Incomplete XML [Kanza+ 99, Abiteboul+ 06]

• Probabilistic XML [Nierman&Jagadish 02, van Keulen+ 05, Abit.&Senellart 06, Sen.&Abit. 07, Hung+ 07]

• XML provenance [Buneman+ 01]

• NRC provenance [Hidders+ 07]

• Semiring-annotated XPath [Grahne+ 07]

• Negation, expressiveness of RAK [Geerts&Poggi 08]

21

Page 22: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Conclusion

• We showed how to annotate unordered XML trees (complex values) with values from a commutative semiring K, and propagate those annotations in queries for a large, positive fragment of XQuery (NRC + srt)

• We saw novel applications in security and incomplete dbs, made possible by a fundamental property of our framework, commutation with homomorphisms

22

Page 23: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

Future Work

• Practical applications based on framework– Security clearances– Jointly recording provenance, security,

multiplicities, uncertainty, etc. (product of semirings is also a semiring!)

• Query optimization: containment/equivalence wrt annotated semantics depends on K– In paper, we show K-equivalence for UXQuery is

the same as B-equivalence when K is a distributive lattice

23

Page 24: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

24

Page 25: Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh

K-UXQuery Syntax

25