an annotation management system for relational databases laura chiticariu university of california,...

35
An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew Tan, Gaurav Vijayvargiya

Upload: alize-plasters

Post on 30-Mar-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

An Annotation Management System for Relational Databases

Laura ChiticariuUniversity of California, Santa Cruz

Joint work with Deepavali Bhagwat, Wang-Chiew Tan, Gaurav Vijayvargiya

Page 2: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

2

A system that is able to propagate meta-data along with the data as the data is being moved around

Main motivation To trace the provenance and flow of data

Many other uses

Annotation Management System

transformationa2a1 a2

a1

b2

b1

b3

b2b1 b3a1 a2

a3

transformation step: a query, an ETL rule, etc.

transformation

Page 3: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

3

Restaurant Cost Type

Peacock Alley

Bull & Bear

PacificaSoho Kitchen & Bar

$$$ French

$$$ Seafood

$ Chinese$ American

Restaurant Cost Type

PacificaSoho Kitchen & Bar

$ Chinese$ American

All Restaurants Cheap Restaurants

Yummy chicken curry!!

NYRestaurants

Restaurant Cost Type

Peacock Alley

Bull & Bear

PacificaSoho Kitchen & Bar

Zip

$$$ French 10022

$$$ Seafood 10022

$ Chinese 10013$ American10022

Serves fine French Cuisine in elegant setting. Formal attire.

Extensive wine list!

Our Vision

Page 4: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

4

Other Applications

Keep information that cannot be otherwise stored in the current database design

Highlight wrong data Erroneous data may be copied around but the

comment that it is wrong goes along with it

Security and quality metric Annotate security or quality levels of data items

Page 5: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

5

Some Related Work

Idea is not new though propagation of annotations was never explicitly stated as provenance-based: Wang & Madnick [VLDB 90], Lee, Bressan & Madnick [WIDM 98], Bernstein & Bergstraesser [IEEE Data Eng. 99]

Superimposed Information. Maier and Delcambre [WebDB 99]

Annotations of Web documents Annotations on genomic sequences Why-Provenance

Cui, Widom, & Wiener [CWW00]

Page 6: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

6

Outline

pSQL queries Semantics

CUSTOM propagation schemeDEFAULT propagation schemeDEFAULT-ALL propagation scheme

ImplementationSystem architectureExperimental results

Page 7: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

7

pSQL – an extension of SQL

A pSQL fragment:

SELECT DISTINCT selectlist

FROM fromlist

WHERE wherelist

PROPAGATE DEFAULT

| DEFAULT-ALL

| r1.A1 TO B1, …, rn.An TO Bn

A pSQL query is a union of pSQL fragments

Page 8: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

8

The CUSTOM Scheme

SELECT DISTINCT AFROM R rPROPAGATE r.A TO A

UNION

SELECT DISTINCT A FROM R rPROPAGATE r.B TO A

A B

1 2

2 3

3 5

R

a

b

c

h

1

2

3

Result

Propagate annotations according to user specification

1

2

3

Result1

a

b

1

2

3

Result2

c

h

annotationUNION

a

b

c

h

Page 9: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

9

The DEFAULT Scheme

Propagate annotations according to where data is copied from

r.B TO B

s.B TO B

A B

1 2

2 3

3 5

R

a

b

c

h

2

3

4

5

Result

c

e

f

h

A B

4 2

5 3

6 4

S

d

f

g

e

g

SELECT DISTINCT BFROM R r PROPAGATE DEFAULT

UNION

SELECT DISTINCT BFROM S sPROPAGATE DEFAULT

natural semantics for tracing the provenance of data

Page 10: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

10

SELECT DISTINCT r.A, r.B, s.CFROM R r, S sWHERE r.B = s.BPROPAGATE DEFAULT

versus

SELECT DISTINCT *FROM R NATURAL JOIN SPROPAGATE DEFAULT

=a

Annotation Propagation under the DEFAULT Scheme

A B

1 2

R

a

1 2 3

Ans1

B C

2 3

S

b

a

1 2 3

Ans2

a b

equivalent queries,

but different

annotated output

Q1:

Q2:

Page 11: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

11

The DEFAULT-ALL scheme

Propagate annotations according to where data is copied from according to all equivalent formulations of the given query

User Query Q:

Compute the results of Q on a database D – idea: E(Q) denotes the set of all queries that are equivalent to Q

(more precisely, (*)). Execute each query in E(Q) on the database D under the

DEFAULT scheme, then combine the results under a.

SELECT DISTINCT r.A, s.B, s.CFROM R r, S sWHERE r.B = s.BPROPAGATE DEFAULT-ALL

(*) the SQL query corresponding to Q

Page 12: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

12

Computing the results of a DEFAULT-ALL query

Question:Given a pSQL query Q with DEFAULT-ALL propagation scheme and a database D, can we compute the result of Q(D)?

Problem: There are infinitely many queries in E(Q). It is therefore impossible to execute every query in E(Q) in order to obtain the result of Q(D).

Solution: Compute a finite basis of E(Q) first

Page 13: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

13

A Query Basis of Q

A query basis of Q, denoted as B(Q), is a finite set of pSQL queries (with default propagation scheme) such that:

Ua q(D) =a Ua q(D)

Given B(Q), we can execute each query in B(Q) and combine the results to obtain the result of Q(D).

Question: Given Q, does B(Q) always exist and how can we compute B(Q)?

qB(Q) qE(Q)

Page 14: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

14

Generating a Query Basis of Q Given R(A,B) and S(B,C) User query Q:

Representative Query Q0 :

Propagations under the default propagation scheme

Additional propagation due to the equality

r.B = s.B

Ans(x,y,z) :- R(x,y), S(y,z).

The representative query propagates annotations according to where data is copied from or equivalently copied from.

SELECT DISTINCT r.A, s.B, s.CFROM R r, S sWHERE r.B = s.BPROPAGATE DEFAULT-ALL

SELECT DISTINCT r.A, s.B, s.CFROM R r, S sWHERE r.B = s.BPROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B

Page 15: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

15

Generating a Query Basis of Q

Auxiliary Queries:

Q1:

Q2:

Ans(x,y,z) :- R(x,y), S(y,z), R(x,w).

SELECT DISTINCT r.A, s.B, s.CFROM R r, S s, R r’WHERE r.B = s.B, r’.A = r.APROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, r’.A TO A

Ans(x,y,z) :- R(x,y), S(y,z), S(w,z).

SELECT DISTINCT r.A, s.B, s.CFROM R r, S s, S s’WHERE r.B = s.B, s’.C = s.CPROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, s’.C TO C

Page 16: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

16

Generating a Query Basis of Q

Auxiliary Queries:

Q3:

Q4:

Ans(x,y,z) :- R(x,y), S(y,z), R(w,y).

SELECT DISTINCT r.A, s.B, s.CFROM R r, S s, R r’WHERE r.B = s.B, r’.B = r.BPROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, r’.B TO B

Ans(x,y,z) :- R(x,y), S(y,z), S(y,w).

SELECT DISTINCT r.A, s.B, s.CFROM R r, S s, S s’WHERE r.B = s.B, s’.B = s.BPROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, s’.B TO B

Page 17: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

17

Correctness of the Algorithm

For the example, a query basis of Q consists of Q0, Q1, Q2, Q3, and Q4.

Theorem:

Given a pSQL query Q with DEFAULT-ALL propagation scheme, the algorithm generates a query basis of Q.

Proof Idea: Every query in B(Q) is an equivalent query of Q

Every equivalent query of Q is annotation-contained in Ua q(D) qB(Q)

Page 18: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

18

Outline

pSQL queries

SemanticsCUSTOM propagation schemeDEFAULT propagation schemeDEFAULT-ALL propagation scheme

ImplementationSystem architectureExperimental results

Page 19: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

19

System Architecture

Translator Module Input: a pSQL query Q Output: an SQL query Q’ written against the naïve storage

scheme

Q’ is sent to the RDBMS and executed

Postprocessor Module Input: sorted tuples (returned by the RDBMS) Output: An annotated set of tuples.

Annotations for the same output location are collected together Duplicate tuples are removed

PostprocessorTranslatorUSER pSQL

querySQL

querysortedtuples

finalresultRDBMS

Page 20: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

20

For every attribute of every relation there is an additional attribute for storing the annotations

Conceivably, there are other possible storage schemes

A Naïve Storage Scheme

A B

1 2

3 4

a

b

cd

R A A’ B B’

1 a 2 c

1 d 2 -

3 b 4 -

R’

Page 21: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

21

The Translator module

Generate a Query Basis

pSQL querydefault-all scheme

set of pSQL querieswith custom scheme

Translatedefault pSQL

to custom pSQL

pSQL querydefault scheme

pSQL querycustom scheme

Translatecustom pSQL

to SQL

SQL query

SELECT DISTINCT r.A AS A, r.B AS BFROM R rPROPAGATE DEFAULT

SELECT DISTINCT r.A AS A, r.B AS BFROM R rPROPAGATE r.A TO A, r.B TO B

default pSQL query custom pSQL query

Page 22: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

22

Experiments

Goals

compare the performance of pSQL queries under different propagation schemes (DEFAULT, DEFAULT-ALL, or no propagation scheme)

compare the performance of pSQL queries when the number of annotations in a database is varied

Page 23: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

23

Experimental setup

Implemented on top of Oracle 9i Datasets

100MB, 500MB, 1GB TPCH database Unannotated database on original schema 30%, 60%, 100% annotations on naïve schema buffer size: 256Mb

Test queries SPJ queries Varied the number of joins (0 to 4 joins) Varied the number of selected attributes (1,3 or 5 attributes)

Page 24: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

24

100MB dataset – 100% annotated

Qi(j) denotes a query with i joins and j output attributes.

SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL

0.01

0.1

1

10

100

1000

Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)

seco

nds

(log

sca

le)

SQL on Unannotated DB pSQL DEFAULT pSQL DEFAULT-ALL

Page 25: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

25

500MB dataset – 100% annotated

Qi(j) denotes a query with i joins and j output attributes.

SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL

0.1

1

10

100

1000

10000

100000

Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)

seco

nds

(log

sca

le)

SQL on Unannotated DB pSQL DEFAULT pSQL DEFAULT-ALL

Page 26: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

26

1GB dataset – 100% annotated

Qi(j) denotes a query with i joins and j output attributes.

SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL

0.1

1

10

100

1000

10000

100000

Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)

seco

nds

(log

sca

le)

SQL on Unannotated DB pSQL DEFAULT pSQL DEFAULT-ALL

Page 27: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

27

100MB dataset annotated in various degrees

pSQL Default on 30%, 60% 100% annotated DBs

0.01

0.1

1

10

100

1000

Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)

seco

nds

(log

sca

le)

SQL on Unannotated BD 30% Annotated DB 60% Annotated DBt 100% Annotated DB

Qi(j) denotes a query with i joins and j output attributes.

Page 28: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

28

Contributions

an annotation management system for carrying annotations along as data is being transformed

based on provenance

pSQL query language for propagation annotations CUSTOM – user defined DEFAULT – where data was copied from? DEFAULT-ALL – invariant under equivalent queries

Generate-Query-Basis algorithm

an initial implementation

Page 29: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

29

Future work

Performance of our annotation management system on other storage schemes

pSQL extensions Aggregates Bag Queries

Page 30: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

30

END

Page 31: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

31

The CUSTOM Scheme - Example

SELECT DISTINCT BFROM R rPROPAGATE r.A TO B, r.B TO B

A B

1 2

2 3

3 5

R

a

b

c

h

2

3

5

Result

a

b

c

h

Page 32: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

32

Terminology

A location is a triple (R, t, A)

Definition:

A query Q1 is annotation contained in a query Q2 if:• Q1 Q2

• for every database D, the set of annotations attached to every output location in Q1(D) is a subset of the set of annotations associated with the same location in the output of Q2(D).

A B

1 2

R

aThe annotation “a”

is attached to the

location (R,(1,2),B)

Page 33: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

33

Ans(x,y,z) :- R(x,y), S(y’,z), y = y’. { x ! 1, y ! 2, y’ ! 2, z ! 3 }Ans(x,y,z) :- R(x,y), S(y,z). { x ! 1, y ! 2, z ! 3 }

Annotations of values that reside in different source locations but are bound to the same variable are unioned together.

Ans(y) :- R(x,y).Ans(y) :- S(y,z).

Ans(2 ).

Annotations that belong to the same output location are unioned together.

In a More Concise Notation

a b

a b

a b

Page 34: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

34

Containment vs. annotation-containment

A B C

1 2 3

1 4 5

1 8 4

8 9 5

R

a

b c

d

1 5

Ans1

c

1 5

Ans2

c d

a b

b

Q1

Ans(x,v) :- R(x,y,u), R(x,z,v), R(t,w,z). Q2

Ans(x,v) :- R(p,q,v), R(x,z,v), R(t,w,z).

Q1 Q2 but…Q1 a Q2 and Q2 a Q1

Page 35: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew

35

Translating a CUSTOM pSQL to SQL

Q1:SELECT r.A, NULL, s.B, s.B’, s.C, s.C’FROM R r, S sWHERE r.B = s.B

Q2:SELECT r.A, NULL, s.B, r.B’, s.C, NULLFROM R r, S sWHERE r.B = s.B

SELECT DISTINCT *FROM ( Q1 UNION Q2 ) tORDER BY t.A, t.B, t.C

SELECT DISTINCT r.A, s.B, s.CFROM R r, S sWHERE r.B = s.BPROPAGATE s.B TO B, s.C TO C, r.B TO B

custom pSQL query:

SQL query: