cs848 presentation heng yu (henry) [email protected]

CS848 Presentation

Heng YU (Henry)

[email protected]

mailto:[email protected]

Paper to present

Answering queries using views:

A survey

by A. Y. Halevy

VLDB Journal 10: pp. 270-294

2001

Outline

• Introduction with examples• Formal problem definitions• Conditions of view usability• Using materialized views in query optimization• Answering queries using views in data

integration• Theoretical results• Extensions• Conclusion and challenges

Introduction

Problems (informal)

Given a query Q and a set of views V1, .., Vn over

a database schema, • Is it possible to answer Q using only the answers

to V1, .., Vn ?

• What is the maximal set of tuples in the answer of Q that we can get from V1, .., Vn ?

• If we can access both the views and the database relations, what is the cheapest query execution plan for answering Q?

Fields of applications

• Query optimization

• Physical data independence

• Data integration

• More: e.g. semantic cache

Example: a university schemaProf(name, area)Course(c-number, title)Teaches(prof, c-number, quarter)Registered(student, c-number, quarter)Major(student, dept)Works(prof, dept)Advises(prof, student)

Keys:Prof(name)Courses(c-cumber)

graduate course c-cumber ≥ 400Ph.D. course c-cumber ≥ 500

Query Optimization

Suppose we have a view for graduate courseregistration info:

create view Graduate asselect Registered.student, Course.title, Course.c-cnumber,

Registered.quarterfrom Registered.course where Registered.c-number = Course.c-number and

Course.c-number ≥ 400

Want to query students registering in Ph.D. level courses taught by a professor who ininterested in DB area:

select Registered.student, Course.titlefrom Teaches, Prof, Regestered, Coursewhere Prof.name = Teaches.prof and

Teaches.c-number = Register.c-number andTeachers.quarter = Registered.quarter andRegistered.c-number = Course.c-number andCourse.c-number ≥ 500 and Prof.area = ‘DB’

Query optimization (cont.)

Queryselect Registered.student, Course.titlefrom Teaches, Prof, Registered, Coursewhere Prof.name = Teaches.prof and

Teaches.c-number = Register.c-number andTeachers.quarter = Registered.quarter andRegistered.c-number = Course.c-number andCourse.c-number ≥ 500 and Prof.area = ‘DB’

Viewcreate view Graduate asselect Registered.student, Course.title, Course.c-cnumber,

Registered.quarterfrom Registered. Course where Registered.c-number = Course.c-number and


Query optimization (cont.)

Result of query rewriting

select Graduate.student, Graduate.title

from Teachers, Prof, Graduate

where Prof.name = Teachers.prof and

Teaches.c-number = Graduate.c-cumber and

Teaches.quarter = Graduate.quarter and

Graduate.c-number ≥ 500 and Prof.area = ‘DB’

Maintaining physical data independence

• Relational database systems rely on 1-1 mapping between relations and files.

• In object-oriented and semistructured databases, logical model is more redundant and does not reflect optimal physical design.

• Physical storage can be described as views over the logical model.

e.g. GMAP (Tsatalos et al. 96)

Maintaining physical data independence (cont.)

GMAP (generalized multi-level access paths) def.gmap G1 as b+-tree by

given Student.nameselect Departmentwhere Student.major Department.

def.gmap G2 as b+-tree by given Student.nameselect Course.c-numberwhere Student registered Course

def.gmap G3 as b+-tree bygiven Course.c-numberselect Departmentwhere Student.registered Course and

Student major Department

Maintaining physical data independence (cont.)

Query:select Student.name, Department

where Student registered Course and

Student major Department and


Plans:1.PStudent.name, Department (SCourse.c-

number≥500 (JStudent.name(G1, G2)))

2.JCourse.c-number (SCourse.c-number≥50(G3), G2)

Data integration

• Providing a uniform query interface to a multitude of autonomous heterogeneous data sources.

• Giving users a mediated schema.• Local as View: specifying data source

descriptions as a view over the mediated schema.

Data integration (cont.)

Example:

Prof(name, area)Course(c-number, title, univ)Teaches(prof, c-number, quarter, univ)Register(student, c-number, quarter)Major(student, dept)Works(prof, dept)Advises(prof, student)

Data integration (cont.)Suppose we have only 2 views available:

create view DB-courses asselect Course.title, Teaches.prof, Course.c-number, Course.univfrom Teaches, Coursewhere Teaches.c-number = Course.c-number and

Teaches.univ = Course.univ and Course.title = “Database Systems”

create view UW-phd-courses asselect Course.title, Teaches.prof, Course.c-number,Course.univfrom Teaches, Coursewhere Teaches.c-number = Course.c-number and

Course.univ = ‘UW’ and Teaches.univ = ‘UW’ andCourse.c-number ≥ 500

Data integration (cont.)• Query who teaches database courses in

UW:

select prof from DB-courses where univ = ‘UW’

• Query all graduate courses in UW: select title, c-number

from DB-courses where univ = ‘UW’ and c-number ≥ 400UNIONselect title, c-numberfrom UW-phd-courses

Comparison for two applicationsQuery Optimization and physical design

Data Integration

Output Query execution plan Q’

Query Q’

Equivalence with Q

Q’ must be equivalent to Q

Q’ can be equivalent to or contained in Q

Data accessed

Original relational data + materialized views

Only views

# of views Modest Huge

View completeness

Yes No

Rewriting reasoning

Logical correctness + cost model

Logical correctness

Formal Problem Definition

Query containment and equivalence

Definition A query Q1 is said to be contained in a query Q2, denoted by Q1 Q2, if for all database instancesD, the set of tuples computed for Q1 is a subset ofthose computed for Q2, i.e., Q1(D) Q2(D) ; Thetwo queries are equivalent if Q1 Q2 and Q2 Q1 .

Equivalent rewritings

Definition Let Q be a query, V = {V1, V2, …, Vm } be a set ofview definitions. The query Q’ is an equivalentrewriting of Q using V if:• Q’ refers only to the views in V;• Q’ is equivalent to Q.A query Q1 is said to be

contained in a query Q2,

Maximally-contained rewritings

Definition Let Q be a query, V = {V1, V2, …, Vm } be a set ofview definitions, and L be a query language. Thequery Q’ is maximally-contained rewriting of Q’w.r.t. L if:• Q’ is a query in L that refers only to the views in V;• Q’ is contained in Q;• there is no rewriting Q1 L, such that Q’ Q1 Q,

and Q1 is not equivalent to Q’.

Certain Answers

• Problem: finding all the answers to a query given a set of views.

• Not equivalent to maximally-contained rewriting because Maximal containment relies on languages.

• Formalized by certain answers (Abiteboul et.al. 98)• A tuple α is a certain answer of Q w.r.t. a set of view

definitions {Vi} and their extensions {vi}, if α is in Q(D) for any possible database instance D such that Vi (D) = vi (CWA) or Vi (D) vi (OWA) .

Conditions of view usability

View usability conditions

For SPJ views to be usable in an equivalent rewriting of a SPJ query Q under bag semantics:1. There is a mapping ψ from occurrences of tables

mentioned in the from clause of V to those mentioned in the from clause of Q, mapping every table name to itself. For bag semantics, ψ must be 1-1.

2. V must either apply the join and selection predicates in Q on the attributes on the attributes of the tables in the domain of ψ, or must apply to them a logical weaker selection, and select the attributes on which predicate still need to applied.

3. V must not project out any attributes of the tables in the domain of ψ that are needed in the selection of Q.

Using materialized view in query optimization

System-R style optimizationTraditional optimizer Optimizer using views

Single table access path

Access paths on all tables

Also consider usable materialized views

Combining partial plans

The predicates of the two partial plans are known, and the cheapest is considered.

Consider joining partial plans with several alternative join predicates.

Pruning of plans

Save the cheapest of each equivalence class

Compares any pairs of plans, and discard one if there is another cheaper plan dominates it.

Termination testing

Has the equivalent class including all relations in the query been considered?

Are all partial plans examined?

System-R style (cont.)

Queries with grouping and aggregation

Example:View:create view V asselect c-number, year, Max(evaluation) as maxeval, Count(*) as offeringsfrom Teacheswhere c-number ≥ 400group by c-number, year

Query:select year, Count(*), Max(evaluation)from Teacheswhere c-number ≥ 500group by year

Queries with grouping and aggregation (cont.)

The query can be rewritten to:select year, sum(offering), Max(evaluation)From Vwhere c-number ≥ 500group by year

Comment:• More limitations if grouping and aggregation are

concerned.• Grouping in view must be finer than that in query.• Aggregations in query must be recoverable from the

output fields and aggregations in the view.

Answering queries using views for data integration

Main approches

• Using datalog query representation for both Q and V.

• Algorithms:– Bucket algorithm (Levy et al. 96)– Inverse rules algorithm (Qian et al. 96 )– MiniCon algorithm (Pottinger et al. 00)

Bucket algorithm • Create a bucket for each non-comparison subgoal g in Q:

For each subgoal g’ in V, if there is a unifier θ for g and g’ and the view, and after unification,

1)the comparison predicates in Q and V are simultaneously satisiable;

2)if a variable appears in head(Q) and subgoal g in the query, the corresponding variable in g’ also appears in head(V) in V,

add θ(head(V)) into the bucket of g.• Find a set of conjunctive query rewritings, and each produces a

conjunctive query including one conjunct from each bucket. It is a conjunctive rewriting if either

1)The conjunctive is contained in Q, or

2)It is possible to add atoms of comparison predicates such that the resulting conjunction is contained in Q.

Bucket algorithm exampleV1(student, c-number, quarter, title):-

Registered(student, c-number, quarter),Course(c-number, title), c-number ≥ 500, quarter ≥ Aut98.

V2(student, prof, c-number, quarter):-Registered(student, c-number, quarter), Teaches(prof, c-number, quarter)

V3(student, c-number):- Registered(student, c-number, quarter), quarter ≤ Aut94.

V4(prof, c-number, title, quarter):-Registered(student, c-number, quarter),

Course(c-number, title), Teaches(prof, c-number, quarter),quarter ≤ Aut97.

Bucket algorithm example (cont.)

Query:Q(S,C,P) :- Teaches(P,C,Q), Registered(S,C,Q),

Course(C,T), C ≥ 300, Q ≥ Aut95.

Bucket:

Teaches(P,C,Q) Registered(S,C,Q) Course(C,T)

V2(S’,P,C,Q) V1(S,C,Q,T’) V1(S’,C,Q’,T)

V4(P,C,T’,Q) V2(S,P’,C,Q) V4(P’,C,T,Q’)

Bucket algorithm example (cont.)

Result of rewriting:q’(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’)

q’(S,C,P) :- V4(P,C,T’,Q), V1(S,C,Q,T’), V4(P’,C,T,Q’)

q’(S,C,P) :- V2(S,P,C,Q), V4(P,C,T’,Q)

The second query is empty, so the result is the union of the first and the third conjunctivequeries.

Bucket algorithm comments

• Advantage– Prune significant number of query rewritings.– Return maximally-contained rewriting when

the query does not have comparison.

• Disadvantage– Cartesian product of buckets is still large– Testing query containment is costly and

-complete. p2

Inverse-rules algorithm

• Construct a set of rules that invert the view definitions.

• Idea: each tuple in the head of view definition query is a witness of tuples in relations corresponding to subgoals in the body.

• Assign one skelom function symbol for each existential variable in the view definition.

Inverse-rules algorithm example

Example:View definition:V3(dept, c-number) :- Major(student, dept), Registered(student, c-number)

Inverse rules:Major(f1(dept, X), dept) :- V3(dept, X)Registered(f1(Y, c-number), c-number) :- V3(Y, c-number)

Inverse-rule algorithm example (cont.)

Query:q(dept) :- Major(student, dept), Registered(student, 444)

V3 has tuples:{(CS, 444), (EE, 444), (CS 333)}

Applying inverse rules:

Registered: {(f1(CS, 444), CS), (f1(EE, 444), EE),

(f1(CS, 333), CS)}

Major: {(f1(CS, 444), 444), (f1(EE, 444), 444),

(f1(CS, 333), 333)}

Answer: {EE, CS}

Inverse-rule algorithm comments

• Advantage– Simplicity and modularity– Return maximally-contained rewriting

• Disadvantage– Keep more non-contributive views than bucket

algorithm– Require recomputing the relations from the

views. The reason to use precomputed materialized views is lost.

MiniCon algorithm• Improvement on bucket algorithm.• Aim to eliminate more views that are useless to

the query.• When we find a unification between a subgoal g’

in V and a subgoal g in Q, all other subgoals that join with g in Q are examined. V must either have the join attribute in its head, or contain the corresponding joined subgoals in the body.

• For each view, compute a MiniCon consisting all subgoals in the query the view contributes.

MiniCon example

Example:q(D) :- Major(S, D), Registered(S, 444, Q), Advises(P, S)

V1(dept) :- Major(student, dept), Registered(student, 444, quarter).V2(prof,dept,area) :- Advises(prof, student),

Prof(name, area).

V3(dep,c-number) :- Major(student, dept),

Registered(student, c-number, quarter)

Advises(prof, student)

MiniCon(V1) = Φ, MiniCon(V2) = Φ,

MiniCon(V3) = {Major, Registered, Advises}

Theoretical results(very selective)

Completeness

• Question: given a query Q and a set of views V, will the algorithm find an equivalent rewriting of Q using V, when there one exists?

• When a CQ has no comparison predicates and has n subgoals, there exists an equivalent conjunctive rewriting of Q using V only if there is a rewriting with at most n subgoals. The complexity is NP-hard. (Levy et al. 1995)

Recursive rewriting

• Goal: when we apply maximally-contained rewriting, we can also get the set of all certain answers.

• Recursive query rewriting is necessary when:– The query is recursive.– Database relations have functional dependencies.– There exist access pattern limitations on the views.– Views have unions.– Additional semantic information about class

hierarchies on objects is expressed in DL.

Recursive rewriting (example with fd)

Relation: schedule(Airline, Flight_no, Date, Pilot, Aircraft)

FDs: Pilot -> Airline, Aircraft->Airline

View: V(D,P,C) :- schedule(A, N, D, P, C)

Query:

Q(P) :- schedule(A, N, D, ‘mike’, C), schedule(A, N’, D’, P, C’)

Rewriting:

relevantPilot(‘mike’)

relevantAircraft(C) :- v(D, ‘mike’, C)

relevantAircraft(C) :- v(D, P, C), relevantPilot(P)

relevantPilot(P) :- relevantPilot(P1), relevantAircraft(C),

v(D1, P1, C), v(D2, P, C)

Finding certain answers• Open-world assumption: polynomial in most practical

cases. NP-hard (in the size of view extensions) if unions are allowed in view definition or inequality predicates are allowed in query languages.

• Close-world assumption: co-NP-hard even if both views and queries are CQs and have no comparison predicates. c.f. GAV: polynomial

• In cases views can contain incorrect tuples :– assume no comparison predicates in views or query– If all views are complete or all views may have

incorrect tuples: ploynomial in view ext. size– o.w.: co-NP-hard

Extensions

Extensions

• Object query languages (OQL) (Florescu 96)– more semantic info for class hierarchy and attributes– OQL does not clearly separate select and where

clauses, both can have path navigation. • Access pattern limitation (Rajaraman 95)

– Restricted parameterized queries on views

CitationDBbf(X,Y) :- Cites(X,Y)– Finite rewriting requires recursiveness.

Conclusion and challenges

Conclusion and challenges

• Answering queries using views plays significant roles in query optimization, physical data independency, and data integration.

• New fields to explore:– Consider new query languages– Consider integration constraints– Bridge the gap between query optimization and data

integration– Facilitate data warehouse query: query result reuse,

incremental computation,– Decide which views are materialized first.

Thank you

cs848 presentation heng yu (henry) [email protected]

Documents

cnumber andcourse

cnumber andteachers

ccumber andteaches

course whereregistered

prof andteaches

b tree bygiven course

query q

departmentwhere student