mediators, wrappers, etc. based on tsimmis project at stanford. concepts used in several other...

Mediators, Wrappers, etc.

Based on TSIMMIS project at Stanford. Concepts used in several other related projects. Goal: integrate info. in heterogeneous data sources, incl., flat files, spreadsheets, … . Key idea: write “wrappers” for data sources that export a relation-like (or something as high level) views. BUT, remember: sources != DBs. Exported Views sets of heterogeneous “lightweight objects”.

II architecture.

mediator

sourcesource

mediator

query

query•No predefined hierarchy. •A med talks to sources via translators and other med’s.

What data model is appropriate?

Remember role played by data model now: In db design, you model appln. data first,

develop schema, create tables and populate ‘em.

Here, you are trying to abstract existing data and/or applns. using wrappers and would like to leverage the abstraction for querying (i.e., II) via mediators.

So, you don’t get to preach here!

Model as expressive as possible Yet as flexible as possible Handle missing, repeated (nested), and heterogeneous data Support meta-data

What are the architectural requirements?

Facilitate easy joining of new mediators and “registration” of new sourcesNeed for Mediator generator and wrapper generator

What sort of query model/language is appropriate?

Must understand and be in sync with the expressive but permissive data model we sketched at. TSIMMIS uses LOREL. But we will keep our discussion more general. In principle, can use SchemaSQL, XQuery, etc.

More on data model

Lightweight object model (OEM): an OEM object =

OID: <label, type, value>. Self-descriptive (i.e., schema along with data, and for every data item!). Value – atomic or set-valued.

An example OEM database guide

restorestoresto

o1

o2 o3 o4c n a near

gourmet Three amigos

s

cz

1650 stecatherine

montreal

H3G 1M7.

address

westmont

•Not every resto may have address of same type.

•Indeed, some may have no address!

TSIMMIS Query model

Each mediator describes its concepts (whatever it can garner from the sources it talks to) using some logical rules. TSIMMIS uses MSL, but we will see that SchemaLog can express it easily.

Information Manifold Approach

Two models: (Local as View (LAV)): World view = global predicates (like

base relations but does not exist) Each source = a description of what

info. it can contribute for the global predicate = view over global predicate (derived relations)

Query global predicate Answer using views (which are the

only ones that hold the data!)

IM approach

Alternative model: global predicates exported by sources as a view of the data they actually store Global as View (GAV) Query global predicates Answer by expanding query using

view defs.

IM follows LAV

LAV example

Global predicates: emp(E), phone(E,P), office(E,O), mgr(E,M), dept(E,D) (remember they DON’T exist!) source1(E,P,M) emp(E), phone(E,P), mgr(E,M). source2(E,O,D) emp(E), office(E,O), dept(E,D). source3(E,P) emp(E), phone(E,P), dept(E,`toy’). Points to remember:

Views are descriptive, not prescriptive. Completeness not guaranteed. Consistency across sources not guaranteed.

Example query: q1(O,P) phone(mary,P), office(mary,O).

Query answering How can we answer such a query?

Must get all relevant info. from views. I.e., rewrite query using ONLY source/view predicates. More than one possible way. Want ALL possible rewrites (to ensure (near)

completeness).

Rewritten q1: r1q1(O,P) s1(E,P,M), s2(E,O,D). r2q1(O,P) s3(E,P), s2(E,O,D). There are other rewrites too (e.g., join all three

sources), but they are contained in one of the above. So, above rewrites are all “minimal” answers.

Compare expanded r1q1 and r2q1 with q1 (w.r.t. containment). What can you say?

How do we get minimal rewrites?

q – original query given (CQ over global predicates). r – a candidate rewrite. It’s valid provided r’s expansion (by expanding source def.’s), say E(r) is contained in q. A rewrite r is minimal if E(r) is NOT contained in E(r’) for any other rewrite. What does minimality really mean?: Example: s1(X,Y) a(X,Y). s2(X,Y) a(X,Y). query: q(X,Y) <- a(X,Y).

r1q(X,Y) s1(X,Y) as well as r2q(X,Y) s2(X,Y) are needed to answer it. Why? (s1 and s2 do NOT necessarily provide the same set of tuples. Rules are descriptions NOT prescriptions!) How many rewrites should we try?

Levy-etal. Theorem Thm.: if a rewrite r of query q has more subgoals than q, then s can’t be minimal.

Proof: assume r is valid (or it’s useless). So E(r) is contained in q. let h be the c.m. if r has

more subgoals than q, there must be a subgoal p in r, s.t. h doesn’t map any subgoal of q to any subgoal in E(p).

Then get rid of all such subgoals modified rewrite r’. r’ contains r (trivially). But r’ is contained in q (just use the original c.m. h). \qed Given a q, only consider those sources whose body contains >= 1 global predicate appearing in q. Still exponential # choices, but not too terrible in practice.

Example revisited & expanded.

Suppose source 1 instead exported s1(E,P) and source 2 s2(E,O). Is q1 answerable using the views? What about q2(E) emp(E), mgr(E, `john’). What about q3(E1, E2) phone(E1,P), phone(E2,P). what about q4(E,M) emp(E), dept(E, “toy”), mgr(E,M).

QAV (AQUV) – general story

Why is QAV worthwhile problem? Speed up query processing. Materialized views.

can I answer this query using stored view(s)? Information integration.

Sources store some data, and *describe* (usu. using rules) how local data relates to the global schema (i.e., what are the contributions?)

Can I answer this query using available source data (i.e., views)?

How best can I answer?

QAV – two models

Classic query optimization context: Equivalent rewriting. Used extensively in data warehousing/OLAP.

Information integration: Maximally contained (also called

minimal, maximally sound) rewriting.

Excellent survey: Alon Y. Halevy. Answering queries using views: a survey. VLDB Jl. 2001.

mediators, wrappers, etc. based on tsimmis project at stanford. concepts used in several other...

Documents