1 a brief history of db theory functional dependencies --- normalization more dependencies:...

47
1 A Brief History of DB Theory • Functional dependencies --- Normalization • More dependencies: multivalued, general • Universal relations • Acyclic hypergraphs • Logical query languages: Datalog

Upload: deasia-tipton

Post on 14-Dec-2015

238 views

Category:

Documents


3 download

TRANSCRIPT

1

A Brief History of DB Theory• Functional dependencies ---

Normalization• More dependencies: multivalued,

general• Universal relations• Acyclic hypergraphs• Logical query languages: Datalog

2

Functional Dependencies• Setting: A relation = table with

column headers (schema ) and a finite number of rows (tuples ).

A B C0 1 20 3 4

3

FD X ->Y

• A statement about which instances (finite sets of tuples) are “legal” for a given relation.

• If two tuples agree in all the attributes of set X , then then must also agree in all attributes of set Y.– Common case: X is the key of the

relation, Y is the other attributes.

4

Issue #1: Inference

• Armstrong’s Axioms– X ->Y if Y is a subset of X– X ->Y implies XZ ->YZ (XZ , etc. =

“union”)– X ->Y and YZ ->W implies XZ ->W

• Closure test (Bernstein): given FD’s and a possible consequent X ->Y , use the FD’s to see what X determines, and then see if Y is contained therein.

5

Issue #2: Redundancy

• Certain combinations of FD’s lead to redundancy and other “anomalies.”

• Example: B ->C is the only FD:A B C0 1 23 1 ?

• The FD lets us predict the value of ?.

6

Issue #3: Normalization

• Eliminate redundancy by splitting schemas.

• A FD X ->Y allows us to split schema XYZ into XY and XZ .

• Without the FD, the decomposition would lack a lossless join : the ability to reconstruct the original XYZ relation from the decomposed relations.

7

How I Met Moshe & Learned Database Theory

Tsichritzis

Bernstein Beeri

Ullman Vardi

8

Why I Like Working With Vardi• Taste, selection of issues.• Best inventor of constructions.• Name comes after mine in

alphabet.

9

Lossless Joins

• When relations are decomposed, do the pieces allow reconstruction of the original?– Only way: join the projected relations.– You always get back what you started

with.– Bad case is when you get more.

10

ExampleA B C0 1 23 1 4

A B A B C B C0 1 0 1 4 1 23 1 3 1 2 1 4

0 1 23 1 4

11

More Kinds of Dependencies• In essence, a dependency is any

predicate that tells whether a given set of tuples is OK for a given schema.

• The language chosen determines what constraints can be expressed and what we can decide about relations.– FD’s are just one, simple example

12

Multivalued Dependencies

• These occur when a relation tries to connect one class of objects to independent sets from two other classes..

• Notation: X ->->Y means:– If two tuples agree on X , then we may

swap the Y components and get two tuples that are also in the relation.

13

Example

• EmpID ->-> PhoneEmpID Addr Phone Project

If: 123 a1 p1 j1123 a2 p2 j2

Then: 123 a1 p2 j1123 a2 p1 j2

14

MVD Origins

• Independent work of Delobel, Fagin, Zaniolo

• Inference of MVD’s + FD’s: Beeri, Fagin, and Howard.

15

Generalized Dependencies• Equality-generating dependencies:

– If tuples with this pattern of equal symbols appear in a relation instance, then certain symbols must also be equal.

– Generalizes FD’s.

• Tuple-generating dependencies:– If tuples with this pattern of equal symbols

appear in a relation instance, then a certain tuple must also appear.

– Generalizes MVD’s.

16

Example: EGD

• FD A ->B in schema ABC :A B Ca b1 c1

(Hypotheses)a b2 c2 b1 = b2

(Conclusion)

17

Example: TGD

• MVD A ->->B in schema ABC :A B Ca b1 c1

(Hypotheses)a b2 c2a b1 c2 (Conclusion)

18

Full Versus Embedded GD’s• Full GD’s have a conclusion with only

symbols that appear in the hypotheses.• Embedded GD’s may have new

symbols in the conclusion.– Existentially quantified.

• Embedded EGD makes no sense, but embedded TGD’s are quite interesting.

19

The Chase for Inferring GD’s• Test whether a GD G follows from given

GD’s.• Start with R = the hypotheses of G .• Apply GD H by mapping the hypotheses

of H to some rows of R .– If so, infer the (mapped) conclusion of H ---

equate two symbols of R or add a tuple to R .

• If you eventually infer the conclusion of G , then G follows, else it does not.

20

Example: If A ->B Then A ->->B• R has tuples (a,b1,c1) and

(a,b2,c2). Does it have (a,b1,c2)?• Apply A ->B .

– It’s hypotheses are the same as the two tuples of R , so surely they map.

– Lets us conclude b1=b2.– Thus, since R has (a,b2,c2), It surely

has (a,b1,c2).

21

Decision Properties of GD’s• If all dependencies are full, no new

symbols ever appear, so the chase must terminate.– Either we prove the desired

conclusion, or R becomes a relation that provides a counterexample --- it satisfies all the given GD’s, but not the one we were trying to infer.

22

Undecidability of Embedded GD’s• First undecidability results involved the

inference of untyped GD’s (symbols may appear in several columns).– Beeri and Vardi– Chandra, Lewis, and Makoswky

• Tough result is undecidability even when GD’s are typed (one column per symbol).– Vardi– Gurevich and Lewis

23

History of GD’s

• Similar ideas were developed independently by several people:– Yannakakis and Papadimitriou– Beeri and Vardi– Fagin– Sadri and Ullman

24

The Universal Relation

• Idea: attributes carry information regardless of how they are placed in relation schemas.

• One of the cool things about Moshe Vardi is the collection he keeps of re-inventions of this universal relation concept.– Typically touted by some HCI type as a

“wonderful new interface to databases.”

25

UR Query Systems

Query

UR

StoredRelations

26

Example• Stored relations: ES(Emp,Sal),

ED(Emp,Dept), DM(Dept,Mgr)• Universal relation: U(Emp,Sal, Dept,Mgr)• UR query: SELECT Mgr WHERE Emp=a• Translated to: SELECT Mgr FROM ED, DM WHERE Emp=a AND ED.Dept=DM.Dept

• U != join(ES, ED, DM); empty ES proves it.

27

Theory of Universal Relations• Some were quite upset by the UR idea.

– Objection: the informal or ad-hoc way queries were translated from UR to stored relations.

• Codd: “Neither the collection of all base relations nor the collection of all views should be cast by the DBMS in the form of a `universal relation’ (in the Stanford University sense) [Vardi, 1988].”

28

Some Early Approaches to the UR• Selected joins of stored relations.• Representative instance = pad stored

relations with nulls in missing attributes; then chase with given dependencies.– Contributions by Vassiliou, Honeyman,

Mendelzon, Sagiv, Yannakakis.

• Maximal objects = union of joins within “acyclic hypergraphs.”– Maier, Ullman

29

Window Functions• Two stage process:

1. If query involves set of attributes X , compute the window function [X ] = some relation with schema X derived from the UR.2. Apply the query to [X ].

• Different UR definitions give different window functions, e.g., join-based, rep.-instance, maximal-object.

• From Maier, Ullman, Vardi [1984].

30

Acyclic Hypergraphs

• Several different applications led to the identification of a class of database schemas (collection of relation schemas) with useful properties.– Sensible connections among stored relations

in maximal-object theory of UR.– Joins of many relations in time polynomial in

the size of the input and output (Yannakakis).

– Etc., etc.

31

Hypergraphs

• Nodes (typically attributes)• (Hyper)edges = sets of any

number of nodes.

A B C D E F

32

GYO Reduction• Graham, Yu-Ozsuyoglu, independently

defined acyclic hypergraphs thusly: • Two transformations:

1. Delete a node that is in only one hyperedge.2. Delete a hyperedge contained in another.

• Hypergraph is acyclic iff the result is a single, empty edge.– Limit is unique for any hypergraph.

33

Example

• GYO steps: delete C; delete D; delete {B}; delete F; delete {B,E}; delete A; delete B; delete E;

A B C D E F

34

History• Acyclic hypergraph idea from Bernstein

and Goodman.– But the definition is rather different.

• Hypergraph version from Fagin, Mendelzon, Beeri, Maier, others.– And don’t forget the GYO folks.

• Fagin defined several related-but-distinct concepts of “acyclic.”– This one is “alpha-acyclic.”

35

Logical Query Languages

• Collections of Horn-clause rules without function symbols behave almost like SQL, but can do recursion.

• Conjunctive queries = single Horn-clause rules w/o function symbols have decidable containment (Chandra and Merlin).

36

Conjunctive Queries

• Example: path(X,Y) :- edge(X,Z) & edge(Z,Y)

• Atoms in the body (hypothesis) refer to stored relations (EDB or “Extensional Database”).

• Atom in head (conclusion) is the result.• Q1 is contained in Q2 if for every

database D, Q1(D ) is a subset of Q2(D ).

37

Containment Mappings

• Mapping from the variables of Q2 to the variables of Q1 that:

1. Turns the head of Q2 into the head of Q1.2. Turns every atom in the body of Q2 into some atom in the body of Q1.

• Containment mapping exists iff Q1 is contained in Q2.

38

Example

Q1: answer(X,Y) :- e(X,Y) & e(Y,X)

Q2: answer(X,Y) :- e(X,W) & e(W,Z) & e(Z,Y)

39

Datalog Programs

• Collection of conjunctive queries.• Some predicates are EDB (stored).• Other predicates are IDB

(intensional database; defined by the CQ’s only).

• One IDB predicate is the answer = least fixedpoint of the CQ’s.

40

Example

path(X,Y) :- edge(X,Y)path(X,Y) :- path(X,Z) & edge(Z,Y)

• Value of path is all pairs of nodes such that there is a path from the first to the second according to EDB predicate edge .

41

Optimization of Datalog Programs• Problem: often the query asks for

only a fraction of the answer.– e.g., find path(0,Y) for fixed node 0.

• Bottom-up methods (“seminaive”) evaluate the whole least-fixedpoint, throw most away.

• Top-down (goal seeking) can get stuck in left-recursion.

42

Linear-Recursion Methods

• Recursion is linear if at most one IDB atom in the body.– Henschen-Naqvi.– Magic-sets for linear (Bancilhon, Maier,

Sagiv, Ullman).– Left- and right-linear special cases

(Naughton).– Conversion of nonlinear to linear

(Ramakrishnan, Sagiv, Ullman, Vardi).

43

Magic Sets• Several similar techniques for rewriting or

executing Datalog programs in a way that avoids generation of useless facts.– Rohmer, Lescoeur, and Kerisit (earliest

exposition).– Beeri and Ramakrishnan (reordering of

atoms).– Sacca and Zaniolo (simplification of rules).– Dietrich and Warren (tabulation).– Vielle (query-subquery).

44

Bounded Recursion

• Sometimes a recursive Datalog program is equivalent to a nonrecursive program.

• Sufficient conditions (Naughton, Ioannidis, Naughton and Sagiv).

• Undecidability (Gaifman, Mairson, Sagiv, and Vardi).

• Decidable cases (Vardi, Cosmodakis, Gaifman, Kanellakis, and Vardi)

45

Containment for Datalog

• Undecidable if one Datalog program is contained in another (Shmueli).

• NP-complete whether a CQ is contained in a Datalog program.

• Triply exponential whether a Datalog program is contained in a CQ (Chaudhuri and Vardi).

46

What Is It All Good For?

• Dependencies and normalization are now familiar to most CS graduates.– Difference between BCNF and 3NF.– Tradeoffs: lossless joins versus maintaining

dependencies.

• Magic-sets, recursion used in IBM’s DB/2.

• Universal-relation interfaces despite Codd.

47

More “What Is It All Good For?”• Information integration . While

logical query languages have not caught on, logic has been used to specify how legacy databases are combined into a uniform whole.– Tsimmis (Papakonstantinou, Vassalos).– Information Manifold (Levy).– Infomaster (Duschka, Genesereth).