the rise and fall and rise of dependency theory part ii: the rise from the ashes

25
The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes Ronald Fagin IBM Almaden Research Center

Upload: duke

Post on 12-Jan-2016

75 views

Category:

Documents


8 download

DESCRIPTION

The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes. Ronald Fagin IBM Almaden Research Center. Dependencies were Considered Harmful. Dependencies were undesirable Except for keys and referential integrity constraints - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

The Rise and Fall and Rise of Dependency Theory

Part II: The Rise from the Ashes

Ronald Fagin IBM Almaden Research Center

Page 2: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

2

Dependencies were Considered Harmful

Dependencies were undesirable Except for keys and referential integrity constraints Database normalization eliminated dependencies

BCNF: each FD is a logical consequence of keys 4NF: each MVD is a logical consequence of keys 5NF: each JD is a logical consequence of keys

Page 3: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

3

But then:

Dependencies took on a new, very positive role!

Page 4: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

4

Data Integration and Data Exchange

Data integration:

Describe data in a global schema in terms of data in local schemas

Data exchange:

Describe data in a target schema in terms of data in a source schema, and actually produce the target database

Page 5: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

5

Data Integration and Data Exchange

These are old, but recurrent, database problems

Phil Bernstein – 2003 “Data exchange is the oldest database problem”

EXPRESS: IBM San Jose Research Lab – 1977 for transforming data between hierarchical databases

The universal relation model is an early case of data integration

We will focus mainly on data exchange

Page 6: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

6

Schema Mappings & Data Exchange

Source S Target T

Schema Mapping M = (S, T, Σ) Source schema S, Target schema T High-level, declarative assertions Σ that specify the

relationship between S and T

Data Exchange via the schema mapping M = (S, T, Σ):

Transform a given source instance I to a target instance J, so that <I, J> satisfy the specifications Σ of M

IJ

Σ

Page 7: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

7

Schema Mapping Specification Language

The relationship between source and target is typically given by source-to-target tgds

(x) y (x, y) where

(x) is a conjunction of atoms over the source (x, y) is a conjunction of atoms over the target

(Student(s) Enrolls(s,c)) t g (Teaches(t,c) Grade(s,c,g))

There may also be target tgds and egds:

Grade(s,c,g)) Grade(s,c,g’)) (g = g’)

Page 8: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

8

New Role of Dependencies

In data exchange, dependencies play a crucial role in describing how to transform data from one format to another

Page 9: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

9

Solutions in Schema Mappings

Definition: Schema Mapping M = (S, T, Σ) If I is a source instance, then a solution for I is a

target instance J such that <I, J> satisfy Σ

Fact: In general, for a given source instance I, there may be no solutions at all or there may be multiple solutions; in fact there may be

infinitely many solutions

Page 10: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

10

Universal Solutions in Data Exchange

[Fagin, Kolaitis, Miller, Popa – ICDT 2003] introduced universal solutions as the “best” solutions in data exchange

By definition, a solution is universal if it has homomorphisms to all other solutions Thus, it is a “most general” solution

Constants: entries in source instances

Variables (labeled nulls): entries besides constants in target instances

Homomorphism h: J1 → J2 between target instances: h(c) = c, if c is a constant If P(a1,…,am) is in J1,, then P(h(a1),…,h(am)) is in J2

Page 11: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

11

How to Obtain a Universal Solution?

Answer: Use our old friend the chase!

Theorem [Fagin, Kolaitis, Miller, Popa – ICDT 2003]:If there is a solution, then the chase produces a universal solution

Page 12: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

12

Standard schema mappings

[Fagin, Kolaitis, Miller, Popa – ICDT 2003] define a weakly acyclic set of tgds

[Deutsch, Tannen - ICDT 2003] have a slightly more restrictive notion

Let a standard schema mapping be one specified by s-t tgds, target egds, and a weakly acyclic set of target tgds.

Theorem [Fagin, Kolaitis, Miller, Popa – ICDT 2003]:

For standard schema mappings, the chase runs in polynomial time (data complexity)

Page 13: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

13

Query Answering in Data Exchange

Schema S Schema T

IJ

Σ q

Question: What is the semantics of target query answering?

Definition: The certain answers of a query q over T on I

certain(q,I) = ∩ { q(J): J is a solution for I }

Note: It is the standard semantics in data integration

Page 14: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

14

Computing the Certain Answers

Theorem [Fagin, Kolaitis, Miller, Popa – ICDT 2003]: Assume a standard schema mapping. Let q be a union of

conjunctive queries over the target.

If I is a source instance and J is a universal solution for I:

certain(q,I) = the set of all “null-free” tuples in q(J).

Hence, certain(q,I) is computable in polynomial time

1. Compute a universal solution J, using the chase, in polynomial time

2. Evaluate q(J) and remove tuples with nulls

Page 15: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

15

Composing Schema Mappings

Given M12 = (S1, S2, 12) and M23 = (S2, S3, 23), derive a

schema mapping M13 = (S1, S3, 13) that is “equivalent” to

the sequence M12 and M23

Schema S1 Schema S2 Schema S3

M12 M23

M13

What does it mean for M13 to be “equivalent” to the composition of M12 and M23?

Page 16: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

16

Semantics of Composition

13 has to have the property that:

<I1,I3> ⊨ 13 if and only if there exists I2 such that <I1,I2> ⊨ 12 and <I2,I3> ⊨ 23

Page 17: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

17

Result of the composition

Question: If M12 and M23 are each specified by s-t tgds, what language is needed for specifying the composition of M12 and M23?

Answer: [Fagin, Kolaitis, Popa, Tan – PODS 2004]:

second-order tgds

Page 18: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

18

Second-Order Tgds

Definition: Let S be a source schema and T a target schema.

A second-order tuple-generating dependency (SO-tgd) is a formula of the form:

f1 … fm( (x1(1 1)) … (xn(n n)) ),

where

fi is a function symbol

i is a conjunction of atoms over S and equalities of terms

i is a conjunction of atoms from T

Example: f (e( Emp(e) Mgr(e,f(e) ) e( Emp(e) (e=f(e)) SelfMgr(e) ) )

Page 19: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

19

Composition and SO-Tgds

Theorem [Fagin, Kolaitis, Popa, Tan – PODS 2004]:

The composition of any finite sequence of schema mappings specified by s-t tgds can be specified by an SO-tgd

Conversely, every SO-tgd specifies the composition of a finite sequence of mappings that are each specified by s-t tgds.

Recently [Arenas, Fagin, Nash – ICDT 2010] showed that the sequence need only be of size 2

Page 20: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

20

Composition with Target Constraints [Arenas, Fagin, Nash – ICDT 2010] defined s-t SO

dependencies, which generalize SO tgds by allowing not only target atoms but also equalities in the conclusion

Theorem [Arenas, Fagin, Nash – ICDT 2010] : • The composition of any finite sequence of standard schema

mappings can be specified by an s-t SO dependency (along with target egds and target tgds)

• Conversely, every s-t SO dependency specifies the composition of a finite sequence of standard schema mappings

– In fact, again, the sequence need only be of size 2 The chase procedure can be extended to schema mappings

specified by s-t SO dependencies, so that it produces universal solutions in polynomial time (data complexity)

Page 21: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

21

Conclusions

Dependencies now play a crucial role in data integration and data exchange

We even have second-order dependencies, which have in fact been implemented in IBM Infosphere Data Architect.

Dependency theory is alive and well!

Page 22: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

22

Extra slides

Page 23: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

23

The Smallest Universal Solution Fact: Universal solutions need not be unique Question: Is there a “best” universal solution? Answer: [Fagin, Kolaitis, Popa – PODS 2003] took a

“small is beautiful” approach:

There is a smallest universal solution (if solutions exist); hence,

the most compact one to materialize Definition: The core of an instance J is the smallest

subinstance J’ that is homomorphically equivalent to J Fact:

Every finite relational structure has a core The core is unique up to isomorphism

Page 24: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

24

Core: The smallest universal solution

Theorem [Fagin, Kolaitis, Popa – PODS 2003] :

All universal solutions have the same core

The core of the universal solutions is the smallest universal solution

If the target constraints are egds, then the core is polynomial-time computable (data complexity)

Theorem [Gottlob and Nash – PODS 2006]:

If the target constraints are egds and a weakly acyclic set of tgds, then the core is polynomial-time computable

Page 25: The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

25

Old Conclusions

Dependencies now play a crucial role in data integration and data exchange

We even have second-order dependencies, which have in fact been implemented in practice!

Lately, even probabilistic dependencies have been studied [Dong, Halevy, Yu – VLDB 2007] [Das Sarma, Dong, Halevy – SIGMOD 2008] [Fagin, Kimelfeld, Kolaitis – ICDT 2010]

Probabilistic dependencies on probabilistic databases

Dependency theory is alive and well!