cs848 presentation heng yu (henry) [email protected]

54
CS848 Presentation Heng YU (Henry) [email protected]

Upload: jade-washington

Post on 20-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

CS848 Presentation

Heng YU (Henry)

[email protected]

Page 2: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Paper to present

Answering queries using views:

A survey

by A. Y. Halevy

VLDB Journal 10: pp. 270-294

2001

Page 3: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Outline

• Introduction with examples• Formal problem definitions• Conditions of view usability• Using materialized views in query optimization• Answering queries using views in data

integration• Theoretical results• Extensions• Conclusion and challenges

Page 4: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Introduction

Page 5: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Problems (informal)

Given a query Q and a set of views V1, .., Vn over

a database schema, • Is it possible to answer Q using only the answers

to V1, .., Vn ?

• What is the maximal set of tuples in the answer of Q that we can get from V1, .., Vn ?

• If we can access both the views and the database relations, what is the cheapest query execution plan for answering Q?

Page 6: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Fields of applications

• Query optimization

• Physical data independence

• Data integration

• More: e.g. semantic cache

Page 7: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Example: a university schemaProf(name, area)Course(c-number, title)Teaches(prof, c-number, quarter)Registered(student, c-number, quarter)Major(student, dept)Works(prof, dept)Advises(prof, student)

Keys:Prof(name)Courses(c-cumber)

graduate course c-cumber ≥ 400Ph.D. course c-cumber ≥ 500

Page 8: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Query Optimization

Suppose we have a view for graduate courseregistration info:

create view Graduate asselect Registered.student, Course.title, Course.c-cnumber,

Registered.quarterfrom Registered.course where Registered.c-number = Course.c-number and

Course.c-number ≥ 400

Page 9: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Want to query students registering in Ph.D. level courses taught by a professor who ininterested in DB area:

select Registered.student, Course.titlefrom Teaches, Prof, Regestered, Coursewhere Prof.name = Teaches.prof and

Teaches.c-number = Register.c-number andTeachers.quarter = Registered.quarter andRegistered.c-number = Course.c-number andCourse.c-number ≥ 500 and Prof.area = ‘DB’

Query optimization (cont.)

Page 10: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Queryselect Registered.student, Course.titlefrom Teaches, Prof, Registered, Coursewhere Prof.name = Teaches.prof and

Teaches.c-number = Register.c-number andTeachers.quarter = Registered.quarter andRegistered.c-number = Course.c-number andCourse.c-number ≥ 500 and Prof.area = ‘DB’

Viewcreate view Graduate asselect Registered.student, Course.title, Course.c-cnumber,

Registered.quarterfrom Registered. Course where Registered.c-number = Course.c-number and

Course.c-number ≥ 400

Page 11: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Query optimization (cont.)

Result of query rewriting

select Graduate.student, Graduate.title

from Teachers, Prof, Graduate

where Prof.name = Teachers.prof and

Teaches.c-number = Graduate.c-cumber and

Teaches.quarter = Graduate.quarter and

Graduate.c-number ≥ 500 and Prof.area = ‘DB’

Page 12: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Maintaining physical data independence

• Relational database systems rely on 1-1 mapping between relations and files.

• In object-oriented and semistructured databases, logical model is more redundant and does not reflect optimal physical design.

• Physical storage can be described as views over the logical model.

e.g. GMAP (Tsatalos et al. 96)

Page 13: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Maintaining physical data independence (cont.)

GMAP (generalized multi-level access paths) def.gmap G1 as b+-tree by

given Student.nameselect Departmentwhere Student.major Department.

def.gmap G2 as b+-tree by given Student.nameselect Course.c-numberwhere Student registered Course

def.gmap G3 as b+-tree bygiven Course.c-numberselect Departmentwhere Student.registered Course and

Student major Department

Page 14: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Maintaining physical data independence (cont.)

Query:select Student.name, Department

where Student registered Course and

Student major Department and

Course.c-number ≥ 500

Plans:1.PStudent.name, Department (SCourse.c-

number≥500 (JStudent.name(G1, G2)))

2.JCourse.c-number (SCourse.c-number≥50(G3), G2)

Page 15: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Data integration

• Providing a uniform query interface to a multitude of autonomous heterogeneous data sources.

• Giving users a mediated schema.• Local as View: specifying data source

descriptions as a view over the mediated schema.

Page 16: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Data integration (cont.)

Example:

Prof(name, area)Course(c-number, title, univ)Teaches(prof, c-number, quarter, univ)Register(student, c-number, quarter)Major(student, dept)Works(prof, dept)Advises(prof, student)

Page 17: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Data integration (cont.)Suppose we have only 2 views available:

create view DB-courses asselect Course.title, Teaches.prof, Course.c-number, Course.univfrom Teaches, Coursewhere Teaches.c-number = Course.c-number and

Teaches.univ = Course.univ and Course.title = “Database Systems”

create view UW-phd-courses asselect Course.title, Teaches.prof, Course.c-number,Course.univfrom Teaches, Coursewhere Teaches.c-number = Course.c-number and

Course.univ = ‘UW’ and Teaches.univ = ‘UW’ andCourse.c-number ≥ 500

Page 18: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Data integration (cont.)• Query who teaches database courses in

UW:

select prof from DB-courses where univ = ‘UW’

• Query all graduate courses in UW: select title, c-number

from DB-courses where univ = ‘UW’ and c-number ≥ 400UNIONselect title, c-numberfrom UW-phd-courses

Page 19: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Comparison for two applicationsQuery Optimization and physical design

Data Integration

Output Query execution plan Q’

Query Q’

Equivalence with Q

Q’ must be equivalent to Q

Q’ can be equivalent to or contained in Q

Data accessed

Original relational data + materialized views

Only views

# of views Modest Huge

View completeness

Yes No

Rewriting reasoning

Logical correctness + cost model

Logical correctness

Page 20: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Formal Problem Definition

Page 21: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Query containment and equivalence

Definition A query Q1 is said to be contained in a query Q2, denoted by Q1 Q2, if for all database instancesD, the set of tuples computed for Q1 is a subset ofthose computed for Q2, i.e., Q1(D) Q2(D) ; Thetwo queries are equivalent if Q1 Q2 and Q2 Q1 .

Page 22: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Equivalent rewritings

Definition Let Q be a query, V = {V1, V2, …, Vm } be a set ofview definitions. The query Q’ is an equivalentrewriting of Q using V if:• Q’ refers only to the views in V;• Q’ is equivalent to Q.A query Q1 is said to be

contained in a query Q2,

Page 23: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Maximally-contained rewritings

Definition Let Q be a query, V = {V1, V2, …, Vm } be a set ofview definitions, and L be a query language. Thequery Q’ is maximally-contained rewriting of Q’w.r.t. L if:• Q’ is a query in L that refers only to the views in V;• Q’ is contained in Q;• there is no rewriting Q1 L, such that Q’ Q1 Q,

and Q1 is not equivalent to Q’.

Page 24: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Certain Answers

• Problem: finding all the answers to a query given a set of views.

• Not equivalent to maximally-contained rewriting because Maximal containment relies on languages.

• Formalized by certain answers (Abiteboul et.al. 98)• A tuple α is a certain answer of Q w.r.t. a set of view

definitions {Vi} and their extensions {vi}, if α is in Q(D) for any possible database instance D such that Vi (D) = vi (CWA) or Vi (D) vi (OWA) .

Page 25: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Conditions of view usability

Page 26: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

View usability conditions

For SPJ views to be usable in an equivalent rewriting of a SPJ query Q under bag semantics:1. There is a mapping ψ from occurrences of tables

mentioned in the from clause of V to those mentioned in the from clause of Q, mapping every table name to itself. For bag semantics, ψ must be 1-1.

2. V must either apply the join and selection predicates in Q on the attributes on the attributes of the tables in the domain of ψ, or must apply to them a logical weaker selection, and select the attributes on which predicate still need to applied.

3. V must not project out any attributes of the tables in the domain of ψ that are needed in the selection of Q.

Page 27: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Using materialized view in query optimization

Page 28: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

System-R style optimizationTraditional optimizer Optimizer using views

Single table access path

Access paths on all tables

Also consider usable materialized views

Combining partial plans

The predicates of the two partial plans are known, and the cheapest is considered.

Consider joining partial plans with several alternative join predicates.

Pruning of plans

Save the cheapest of each equivalence class

Compares any pairs of plans, and discard one if there is another cheaper plan dominates it.

Termination testing

Has the equivalent class including all relations in the query been considered?

Are all partial plans examined?

Page 29: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

System-R style (cont.)

Page 30: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Queries with grouping and aggregation

Example:View:create view V asselect c-number, year, Max(evaluation) as maxeval, Count(*) as offeringsfrom Teacheswhere c-number ≥ 400group by c-number, year

Query:select year, Count(*), Max(evaluation)from Teacheswhere c-number ≥ 500group by year

Page 31: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Queries with grouping and aggregation (cont.)

The query can be rewritten to:select year, sum(offering), Max(evaluation)From Vwhere c-number ≥ 500group by year

Comment:• More limitations if grouping and aggregation are

concerned.• Grouping in view must be finer than that in query.• Aggregations in query must be recoverable from the

output fields and aggregations in the view.

Page 32: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Answering queries using views for data integration

Page 33: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Main approches

• Using datalog query representation for both Q and V.

• Algorithms:– Bucket algorithm (Levy et al. 96)– Inverse rules algorithm (Qian et al. 96 )– MiniCon algorithm (Pottinger et al. 00)

Page 34: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Bucket algorithm • Create a bucket for each non-comparison subgoal g in Q:

For each subgoal g’ in V, if there is a unifier θ for g and g’ and the view, and after unification,

1)the comparison predicates in Q and V are simultaneously satisiable;

2)if a variable appears in head(Q) and subgoal g in the query, the corresponding variable in g’ also appears in head(V) in V,

add θ(head(V)) into the bucket of g.• Find a set of conjunctive query rewritings, and each produces a

conjunctive query including one conjunct from each bucket. It is a conjunctive rewriting if either

1)The conjunctive is contained in Q, or

2)It is possible to add atoms of comparison predicates such that the resulting conjunction is contained in Q.

Page 35: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Bucket algorithm exampleV1(student, c-number, quarter, title):-

Registered(student, c-number, quarter),Course(c-number, title), c-number ≥ 500, quarter ≥ Aut98.

V2(student, prof, c-number, quarter):-Registered(student, c-number, quarter), Teaches(prof, c-number, quarter)

V3(student, c-number):- Registered(student, c-number, quarter), quarter ≤ Aut94.

V4(prof, c-number, title, quarter):-Registered(student, c-number, quarter),

Course(c-number, title), Teaches(prof, c-number, quarter),quarter ≤ Aut97.

Page 36: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Bucket algorithm example (cont.)

Query:Q(S,C,P) :- Teaches(P,C,Q), Registered(S,C,Q),

Course(C,T), C ≥ 300, Q ≥ Aut95.

Bucket:

Teaches(P,C,Q) Registered(S,C,Q) Course(C,T)

V2(S’,P,C,Q) V1(S,C,Q,T’) V1(S’,C,Q’,T)

V4(P,C,T’,Q) V2(S,P’,C,Q) V4(P’,C,T,Q’)

Page 37: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Bucket algorithm example (cont.)

Result of rewriting:q’(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’)

q’(S,C,P) :- V4(P,C,T’,Q), V1(S,C,Q,T’), V4(P’,C,T,Q’)

q’(S,C,P) :- V2(S,P,C,Q), V4(P,C,T’,Q)

The second query is empty, so the result is the union of the first and the third conjunctivequeries.

Page 38: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Bucket algorithm comments

• Advantage– Prune significant number of query rewritings.– Return maximally-contained rewriting when

the query does not have comparison.

• Disadvantage– Cartesian product of buckets is still large– Testing query containment is costly and

-complete. p2

Page 39: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Inverse-rules algorithm

• Construct a set of rules that invert the view definitions.

• Idea: each tuple in the head of view definition query is a witness of tuples in relations corresponding to subgoals in the body.

• Assign one skelom function symbol for each existential variable in the view definition.

Page 40: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Inverse-rules algorithm example

Example:View definition:V3(dept, c-number) :- Major(student, dept), Registered(student, c-number)

Inverse rules:Major(f1(dept, X), dept) :- V3(dept, X)Registered(f1(Y, c-number), c-number) :- V3(Y, c-number)

Page 41: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Inverse-rule algorithm example (cont.)

Query:q(dept) :- Major(student, dept), Registered(student, 444)

V3 has tuples:{(CS, 444), (EE, 444), (CS 333)}

Applying inverse rules:

Registered: {(f1(CS, 444), CS), (f1(EE, 444), EE),

(f1(CS, 333), CS)}

Major: {(f1(CS, 444), 444), (f1(EE, 444), 444),

(f1(CS, 333), 333)}

Answer: {EE, CS}

Page 42: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Inverse-rule algorithm comments

• Advantage– Simplicity and modularity– Return maximally-contained rewriting

• Disadvantage– Keep more non-contributive views than bucket

algorithm– Require recomputing the relations from the

views. The reason to use precomputed materialized views is lost.

Page 43: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

MiniCon algorithm• Improvement on bucket algorithm.• Aim to eliminate more views that are useless to

the query.• When we find a unification between a subgoal g’

in V and a subgoal g in Q, all other subgoals that join with g in Q are examined. V must either have the join attribute in its head, or contain the corresponding joined subgoals in the body.

• For each view, compute a MiniCon consisting all subgoals in the query the view contributes.

Page 44: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

MiniCon example

Example:q(D) :- Major(S, D), Registered(S, 444, Q), Advises(P, S)

V1(dept) :- Major(student, dept), Registered(student, 444, quarter).V2(prof,dept,area) :- Advises(prof, student),

Prof(name, area).

V3(dep,c-number) :- Major(student, dept),

Registered(student, c-number, quarter)

Advises(prof, student)

MiniCon(V1) = Φ, MiniCon(V2) = Φ,

MiniCon(V3) = {Major, Registered, Advises}

Page 45: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Theoretical results(very selective)

Page 46: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Completeness

• Question: given a query Q and a set of views V, will the algorithm find an equivalent rewriting of Q using V, when there one exists?

• When a CQ has no comparison predicates and has n subgoals, there exists an equivalent conjunctive rewriting of Q using V only if there is a rewriting with at most n subgoals. The complexity is NP-hard. (Levy et al. 1995)

Page 47: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Recursive rewriting

• Goal: when we apply maximally-contained rewriting, we can also get the set of all certain answers.

• Recursive query rewriting is necessary when:– The query is recursive.– Database relations have functional dependencies.– There exist access pattern limitations on the views.– Views have unions.– Additional semantic information about class

hierarchies on objects is expressed in DL.

Page 48: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Recursive rewriting (example with fd)

Relation: schedule(Airline, Flight_no, Date, Pilot, Aircraft)

FDs: Pilot -> Airline, Aircraft->Airline

View: V(D,P,C) :- schedule(A, N, D, P, C)

Query:

Q(P) :- schedule(A, N, D, ‘mike’, C), schedule(A, N’, D’, P, C’)

Rewriting:

relevantPilot(‘mike’)

relevantAircraft(C) :- v(D, ‘mike’, C)

relevantAircraft(C) :- v(D, P, C), relevantPilot(P)

relevantPilot(P) :- relevantPilot(P1), relevantAircraft(C),

v(D1, P1, C), v(D2, P, C)

Page 49: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Finding certain answers• Open-world assumption: polynomial in most practical

cases. NP-hard (in the size of view extensions) if unions are allowed in view definition or inequality predicates are allowed in query languages.

• Close-world assumption: co-NP-hard even if both views and queries are CQs and have no comparison predicates. c.f. GAV: polynomial

• In cases views can contain incorrect tuples :– assume no comparison predicates in views or query– If all views are complete or all views may have

incorrect tuples: ploynomial in view ext. size– o.w.: co-NP-hard

Page 50: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Extensions

Page 51: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Extensions

• Object query languages (OQL) (Florescu 96)– more semantic info for class hierarchy and attributes– OQL does not clearly separate select and where

clauses, both can have path navigation. • Access pattern limitation (Rajaraman 95)

– Restricted parameterized queries on views

CitationDBbf(X,Y) :- Cites(X,Y)– Finite rewriting requires recursiveness.

Page 52: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Conclusion and challenges

Page 53: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Conclusion and challenges

• Answering queries using views plays significant roles in query optimization, physical data independency, and data integration.

• New fields to explore:– Consider new query languages– Consider integration constraints– Bridge the gap between query optimization and data

integration– Facilitate data warehouse query: query result reuse,

incremental computation,– Decide which views are materialized first.

Page 54: CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Thank you