cs848 presentation heng yu (henry) [email protected]
TRANSCRIPT
Paper to present
Answering queries using views:
A survey
by A. Y. Halevy
VLDB Journal 10: pp. 270-294
2001
Outline
• Introduction with examples• Formal problem definitions• Conditions of view usability• Using materialized views in query optimization• Answering queries using views in data
integration• Theoretical results• Extensions• Conclusion and challenges
Introduction
Problems (informal)
Given a query Q and a set of views V1, .., Vn over
a database schema, • Is it possible to answer Q using only the answers
to V1, .., Vn ?
• What is the maximal set of tuples in the answer of Q that we can get from V1, .., Vn ?
• If we can access both the views and the database relations, what is the cheapest query execution plan for answering Q?
Fields of applications
• Query optimization
• Physical data independence
• Data integration
• More: e.g. semantic cache
Example: a university schemaProf(name, area)Course(c-number, title)Teaches(prof, c-number, quarter)Registered(student, c-number, quarter)Major(student, dept)Works(prof, dept)Advises(prof, student)
Keys:Prof(name)Courses(c-cumber)
graduate course c-cumber ≥ 400Ph.D. course c-cumber ≥ 500
Query Optimization
Suppose we have a view for graduate courseregistration info:
create view Graduate asselect Registered.student, Course.title, Course.c-cnumber,
Registered.quarterfrom Registered.course where Registered.c-number = Course.c-number and
Course.c-number ≥ 400
Want to query students registering in Ph.D. level courses taught by a professor who ininterested in DB area:
select Registered.student, Course.titlefrom Teaches, Prof, Regestered, Coursewhere Prof.name = Teaches.prof and
Teaches.c-number = Register.c-number andTeachers.quarter = Registered.quarter andRegistered.c-number = Course.c-number andCourse.c-number ≥ 500 and Prof.area = ‘DB’
Query optimization (cont.)
Queryselect Registered.student, Course.titlefrom Teaches, Prof, Registered, Coursewhere Prof.name = Teaches.prof and
Teaches.c-number = Register.c-number andTeachers.quarter = Registered.quarter andRegistered.c-number = Course.c-number andCourse.c-number ≥ 500 and Prof.area = ‘DB’
Viewcreate view Graduate asselect Registered.student, Course.title, Course.c-cnumber,
Registered.quarterfrom Registered. Course where Registered.c-number = Course.c-number and
Course.c-number ≥ 400
Query optimization (cont.)
Result of query rewriting
select Graduate.student, Graduate.title
from Teachers, Prof, Graduate
where Prof.name = Teachers.prof and
Teaches.c-number = Graduate.c-cumber and
Teaches.quarter = Graduate.quarter and
Graduate.c-number ≥ 500 and Prof.area = ‘DB’
Maintaining physical data independence
• Relational database systems rely on 1-1 mapping between relations and files.
• In object-oriented and semistructured databases, logical model is more redundant and does not reflect optimal physical design.
• Physical storage can be described as views over the logical model.
e.g. GMAP (Tsatalos et al. 96)
Maintaining physical data independence (cont.)
GMAP (generalized multi-level access paths) def.gmap G1 as b+-tree by
given Student.nameselect Departmentwhere Student.major Department.
def.gmap G2 as b+-tree by given Student.nameselect Course.c-numberwhere Student registered Course
def.gmap G3 as b+-tree bygiven Course.c-numberselect Departmentwhere Student.registered Course and
Student major Department
Maintaining physical data independence (cont.)
Query:select Student.name, Department
where Student registered Course and
Student major Department and
Course.c-number ≥ 500
Plans:1.PStudent.name, Department (SCourse.c-
number≥500 (JStudent.name(G1, G2)))
2.JCourse.c-number (SCourse.c-number≥50(G3), G2)
Data integration
• Providing a uniform query interface to a multitude of autonomous heterogeneous data sources.
• Giving users a mediated schema.• Local as View: specifying data source
descriptions as a view over the mediated schema.
Data integration (cont.)
Example:
Prof(name, area)Course(c-number, title, univ)Teaches(prof, c-number, quarter, univ)Register(student, c-number, quarter)Major(student, dept)Works(prof, dept)Advises(prof, student)
Data integration (cont.)Suppose we have only 2 views available:
create view DB-courses asselect Course.title, Teaches.prof, Course.c-number, Course.univfrom Teaches, Coursewhere Teaches.c-number = Course.c-number and
Teaches.univ = Course.univ and Course.title = “Database Systems”
create view UW-phd-courses asselect Course.title, Teaches.prof, Course.c-number,Course.univfrom Teaches, Coursewhere Teaches.c-number = Course.c-number and
Course.univ = ‘UW’ and Teaches.univ = ‘UW’ andCourse.c-number ≥ 500
Data integration (cont.)• Query who teaches database courses in
UW:
select prof from DB-courses where univ = ‘UW’
• Query all graduate courses in UW: select title, c-number
from DB-courses where univ = ‘UW’ and c-number ≥ 400UNIONselect title, c-numberfrom UW-phd-courses
Comparison for two applicationsQuery Optimization and physical design
Data Integration
Output Query execution plan Q’
Query Q’
Equivalence with Q
Q’ must be equivalent to Q
Q’ can be equivalent to or contained in Q
Data accessed
Original relational data + materialized views
Only views
# of views Modest Huge
View completeness
Yes No
Rewriting reasoning
Logical correctness + cost model
Logical correctness
Formal Problem Definition
Query containment and equivalence
Definition A query Q1 is said to be contained in a query Q2, denoted by Q1 Q2, if for all database instancesD, the set of tuples computed for Q1 is a subset ofthose computed for Q2, i.e., Q1(D) Q2(D) ; Thetwo queries are equivalent if Q1 Q2 and Q2 Q1 .
Equivalent rewritings
Definition Let Q be a query, V = {V1, V2, …, Vm } be a set ofview definitions. The query Q’ is an equivalentrewriting of Q using V if:• Q’ refers only to the views in V;• Q’ is equivalent to Q.A query Q1 is said to be
contained in a query Q2,
Maximally-contained rewritings
Definition Let Q be a query, V = {V1, V2, …, Vm } be a set ofview definitions, and L be a query language. Thequery Q’ is maximally-contained rewriting of Q’w.r.t. L if:• Q’ is a query in L that refers only to the views in V;• Q’ is contained in Q;• there is no rewriting Q1 L, such that Q’ Q1 Q,
and Q1 is not equivalent to Q’.
Certain Answers
• Problem: finding all the answers to a query given a set of views.
• Not equivalent to maximally-contained rewriting because Maximal containment relies on languages.
• Formalized by certain answers (Abiteboul et.al. 98)• A tuple α is a certain answer of Q w.r.t. a set of view
definitions {Vi} and their extensions {vi}, if α is in Q(D) for any possible database instance D such that Vi (D) = vi (CWA) or Vi (D) vi (OWA) .
Conditions of view usability
View usability conditions
For SPJ views to be usable in an equivalent rewriting of a SPJ query Q under bag semantics:1. There is a mapping ψ from occurrences of tables
mentioned in the from clause of V to those mentioned in the from clause of Q, mapping every table name to itself. For bag semantics, ψ must be 1-1.
2. V must either apply the join and selection predicates in Q on the attributes on the attributes of the tables in the domain of ψ, or must apply to them a logical weaker selection, and select the attributes on which predicate still need to applied.
3. V must not project out any attributes of the tables in the domain of ψ that are needed in the selection of Q.
Using materialized view in query optimization
System-R style optimizationTraditional optimizer Optimizer using views
Single table access path
Access paths on all tables
Also consider usable materialized views
Combining partial plans
The predicates of the two partial plans are known, and the cheapest is considered.
Consider joining partial plans with several alternative join predicates.
Pruning of plans
Save the cheapest of each equivalence class
Compares any pairs of plans, and discard one if there is another cheaper plan dominates it.
Termination testing
Has the equivalent class including all relations in the query been considered?
Are all partial plans examined?
System-R style (cont.)
Queries with grouping and aggregation
Example:View:create view V asselect c-number, year, Max(evaluation) as maxeval, Count(*) as offeringsfrom Teacheswhere c-number ≥ 400group by c-number, year
Query:select year, Count(*), Max(evaluation)from Teacheswhere c-number ≥ 500group by year
Queries with grouping and aggregation (cont.)
The query can be rewritten to:select year, sum(offering), Max(evaluation)From Vwhere c-number ≥ 500group by year
Comment:• More limitations if grouping and aggregation are
concerned.• Grouping in view must be finer than that in query.• Aggregations in query must be recoverable from the
output fields and aggregations in the view.
Answering queries using views for data integration
Main approches
• Using datalog query representation for both Q and V.
• Algorithms:– Bucket algorithm (Levy et al. 96)– Inverse rules algorithm (Qian et al. 96 )– MiniCon algorithm (Pottinger et al. 00)
Bucket algorithm • Create a bucket for each non-comparison subgoal g in Q:
For each subgoal g’ in V, if there is a unifier θ for g and g’ and the view, and after unification,
1)the comparison predicates in Q and V are simultaneously satisiable;
2)if a variable appears in head(Q) and subgoal g in the query, the corresponding variable in g’ also appears in head(V) in V,
add θ(head(V)) into the bucket of g.• Find a set of conjunctive query rewritings, and each produces a
conjunctive query including one conjunct from each bucket. It is a conjunctive rewriting if either
1)The conjunctive is contained in Q, or
2)It is possible to add atoms of comparison predicates such that the resulting conjunction is contained in Q.
Bucket algorithm exampleV1(student, c-number, quarter, title):-
Registered(student, c-number, quarter),Course(c-number, title), c-number ≥ 500, quarter ≥ Aut98.
V2(student, prof, c-number, quarter):-Registered(student, c-number, quarter), Teaches(prof, c-number, quarter)
V3(student, c-number):- Registered(student, c-number, quarter), quarter ≤ Aut94.
V4(prof, c-number, title, quarter):-Registered(student, c-number, quarter),
Course(c-number, title), Teaches(prof, c-number, quarter),quarter ≤ Aut97.
Bucket algorithm example (cont.)
Query:Q(S,C,P) :- Teaches(P,C,Q), Registered(S,C,Q),
Course(C,T), C ≥ 300, Q ≥ Aut95.
Bucket:
Teaches(P,C,Q) Registered(S,C,Q) Course(C,T)
V2(S’,P,C,Q) V1(S,C,Q,T’) V1(S’,C,Q’,T)
V4(P,C,T’,Q) V2(S,P’,C,Q) V4(P’,C,T,Q’)
Bucket algorithm example (cont.)
Result of rewriting:q’(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’)
q’(S,C,P) :- V4(P,C,T’,Q), V1(S,C,Q,T’), V4(P’,C,T,Q’)
q’(S,C,P) :- V2(S,P,C,Q), V4(P,C,T’,Q)
The second query is empty, so the result is the union of the first and the third conjunctivequeries.
Bucket algorithm comments
• Advantage– Prune significant number of query rewritings.– Return maximally-contained rewriting when
the query does not have comparison.
• Disadvantage– Cartesian product of buckets is still large– Testing query containment is costly and
-complete. p2
Inverse-rules algorithm
• Construct a set of rules that invert the view definitions.
• Idea: each tuple in the head of view definition query is a witness of tuples in relations corresponding to subgoals in the body.
• Assign one skelom function symbol for each existential variable in the view definition.
Inverse-rules algorithm example
Example:View definition:V3(dept, c-number) :- Major(student, dept), Registered(student, c-number)
Inverse rules:Major(f1(dept, X), dept) :- V3(dept, X)Registered(f1(Y, c-number), c-number) :- V3(Y, c-number)
Inverse-rule algorithm example (cont.)
Query:q(dept) :- Major(student, dept), Registered(student, 444)
V3 has tuples:{(CS, 444), (EE, 444), (CS 333)}
Applying inverse rules:
Registered: {(f1(CS, 444), CS), (f1(EE, 444), EE),
(f1(CS, 333), CS)}
Major: {(f1(CS, 444), 444), (f1(EE, 444), 444),
(f1(CS, 333), 333)}
Answer: {EE, CS}
Inverse-rule algorithm comments
• Advantage– Simplicity and modularity– Return maximally-contained rewriting
• Disadvantage– Keep more non-contributive views than bucket
algorithm– Require recomputing the relations from the
views. The reason to use precomputed materialized views is lost.
MiniCon algorithm• Improvement on bucket algorithm.• Aim to eliminate more views that are useless to
the query.• When we find a unification between a subgoal g’
in V and a subgoal g in Q, all other subgoals that join with g in Q are examined. V must either have the join attribute in its head, or contain the corresponding joined subgoals in the body.
• For each view, compute a MiniCon consisting all subgoals in the query the view contributes.
MiniCon example
Example:q(D) :- Major(S, D), Registered(S, 444, Q), Advises(P, S)
V1(dept) :- Major(student, dept), Registered(student, 444, quarter).V2(prof,dept,area) :- Advises(prof, student),
Prof(name, area).
V3(dep,c-number) :- Major(student, dept),
Registered(student, c-number, quarter)
Advises(prof, student)
MiniCon(V1) = Φ, MiniCon(V2) = Φ,
MiniCon(V3) = {Major, Registered, Advises}
Theoretical results(very selective)
Completeness
• Question: given a query Q and a set of views V, will the algorithm find an equivalent rewriting of Q using V, when there one exists?
• When a CQ has no comparison predicates and has n subgoals, there exists an equivalent conjunctive rewriting of Q using V only if there is a rewriting with at most n subgoals. The complexity is NP-hard. (Levy et al. 1995)
Recursive rewriting
• Goal: when we apply maximally-contained rewriting, we can also get the set of all certain answers.
• Recursive query rewriting is necessary when:– The query is recursive.– Database relations have functional dependencies.– There exist access pattern limitations on the views.– Views have unions.– Additional semantic information about class
hierarchies on objects is expressed in DL.
Recursive rewriting (example with fd)
Relation: schedule(Airline, Flight_no, Date, Pilot, Aircraft)
FDs: Pilot -> Airline, Aircraft->Airline
View: V(D,P,C) :- schedule(A, N, D, P, C)
Query:
Q(P) :- schedule(A, N, D, ‘mike’, C), schedule(A, N’, D’, P, C’)
Rewriting:
relevantPilot(‘mike’)
relevantAircraft(C) :- v(D, ‘mike’, C)
relevantAircraft(C) :- v(D, P, C), relevantPilot(P)
relevantPilot(P) :- relevantPilot(P1), relevantAircraft(C),
v(D1, P1, C), v(D2, P, C)
Finding certain answers• Open-world assumption: polynomial in most practical
cases. NP-hard (in the size of view extensions) if unions are allowed in view definition or inequality predicates are allowed in query languages.
• Close-world assumption: co-NP-hard even if both views and queries are CQs and have no comparison predicates. c.f. GAV: polynomial
• In cases views can contain incorrect tuples :– assume no comparison predicates in views or query– If all views are complete or all views may have
incorrect tuples: ploynomial in view ext. size– o.w.: co-NP-hard
Extensions
Extensions
• Object query languages (OQL) (Florescu 96)– more semantic info for class hierarchy and attributes– OQL does not clearly separate select and where
clauses, both can have path navigation. • Access pattern limitation (Rajaraman 95)
– Restricted parameterized queries on views
CitationDBbf(X,Y) :- Cites(X,Y)– Finite rewriting requires recursiveness.
Conclusion and challenges
Conclusion and challenges
• Answering queries using views plays significant roles in query optimization, physical data independency, and data integration.
• New fields to explore:– Consider new query languages– Consider integration constraints– Bridge the gap between query optimization and data
integration– Facilitate data warehouse query: query result reuse,
incremental computation,– Decide which views are materialized first.
Thank you