query translation of web database integration: issues, advances and directions

39
Query Translation of Web Database Integration: Issues, Advances and Directions Fangjiao Jiang

Upload: hunter

Post on 13-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Query Translation of Web Database Integration: Issues, Advances and Directions. Fangjiao Jiang. Outline. Query translation in web database integration Introduction Problems A simple framework Survey the current works - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Query Translation of Web Database Integration: Issues, Advances and Directions

Query Translation of Web Database Integration: Issues,

Advances and Directions

Fangjiao Jiang

Page 2: Query Translation of Web Database Integration: Issues, Advances and Directions

Outline

Query translation in web database integration

Introduction Problems A simple frameworkSurvey the current worksThe Challenges and opportunities of

query translation in Web DB integrationOur future works of query translation

Page 3: Query Translation of Web Database Integration: Issues, Advances and Directions

Outline

Query translation in web database integration

Introduction Problems A simple frameworkSurvey the current workThe Challenges and opportunities of

query translation in Web DB integrationOur future works of query translation

Page 4: Query Translation of Web Database Integration: Issues, Advances and Directions

Introduction of query translation

… …

WDB m

Database SelectionDatabase Selection

Query DispatchQuery Dispatch

Entity IdentificationEntity Identification

Result Extraction Result Extraction

WDB 1

User query

World Wide Web

Web

Web DatabaseDiscovery

Web DatabaseDiscovery

WDB List

WDB InterfaceSchema

Extraction

WDB InterfaceSchema

Extraction

Query Processing Module.

Result

WDB ClusteringBy Domain

WDB ClusteringBy Domain

WDB Cluster 1

WDB Cluster n

. . . . . .Interface Integration

Interface Integration

Integrated Interface 1

Integrated Interface n

. . . . . .

Integrated Interface Generation Module.

Integrated Interface

Result AnnotationResult Annotation

Domain MappingDomain Mapping Result MergingResult Merging

query translation

Query translation A user’s query submitted to the integrated interface must be translated to web

database interfaces automatically.

Page 5: Query Translation of Web Database Integration: Issues, Advances and Directions

Q-Web DB

ProblemsProblem1:

Should we translate the query to every web database?

Necessary? costly? redundant?

What? What web databases should we

select to translate the user’s query? Database selection?

DB1

DB2 DB3 DBn……

query

Web DBs

Q

Q

Q

Page 6: Query Translation of Web Database Integration: Issues, Advances and Directions

ProblemsProblem2: 1-1 match

Complex match {Depart City}={leaving from} 1:1 {Destination}={Going to} 1:1 …… {Adult, Child}={Passengers} n:1

How? How to translate a query from integrated interface to a web database

interface? Attribute matching?

Integrated interface A local web DB interface

Page 7: Query Translation of Web Database Integration: Issues, Advances and Directions

ProblemsProblem3:

Title contains “red storm” Title contains “red storm” (any words) or Title contains “red storm” (all words) or Title contains “storm” (any words) Price<$35 Price<$25 U $25<Price<$45 …… How? How to translate a query from integrated interface to a web database interface? Constraint mapping?

Integrated interface

local web DB interface2

local web DB interface1

Page 8: Query Translation of Web Database Integration: Issues, Advances and Directions

ProblemsProblem4:

Author, Title could be queried together. Author, Title, Subject could only be queried one of them at a time. How? How to translate a query from integrated interface to a web database interface? Capability-based query rewriting?

Class=Economy There is no attribute in the web database interface that match the attribute “class”, so Class=true?

Integrated interface A local web DB interface

Integrated interfaceA local web DB interface

Page 9: Query Translation of Web Database Integration: Issues, Advances and Directions

ProblemsProblem5:

It is unavoidable that some of returned results are incorrect.

How? How to filter the returned results to get the correct results? Result filter?

Correct results

Returned results

Page 10: Query Translation of Web Database Integration: Issues, Advances and Directions

The simple framework of query translation

Database selection

Attribute matching

Query rewriting

Constraint mapping

Result filter

pre--processing

core--processing

post--processing

Page 11: Query Translation of Web Database Integration: Issues, Advances and Directions

Outline

Query translation in web database integration

Introduction Problems A simple frameworkSurvey the current workThe Challenges and opportunities of

query translation in Web DB integrationOur future works of query translation

Page 12: Query Translation of Web Database Integration: Issues, Advances and Directions

Survey the current work (Ⅰ) Database selection A Frequency-based Approach for Mining Coverage Statistics in Data Integration. Zaiq

ing Nie and Subbarao Kambhampati. In Proceedings of the 20th ICDE 2004. A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sour

ces using source descriptions. In VLDB Conference, 1996. A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270.

294, 2001. ……

Attributes matching E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching.

VLDB Journal, 10(4):334–350, 2001 Bin He.Discovering Complex Matchings across Web Query Interfaces: A Correlation

Mining Approach.KDD,2004 B. He and K. C.-C. Chang. Statistical schema matching across web query interfaces.

In SIGMOD Conference, 2003. Jiying Wang, Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma.Instance-based Schema M

atching for Web Databases by Domain-specific Query Probing. Proceedings of the 30th VLDB Conference,2004.

Bin He, Kevin Chen-Chuan Chang: Making holistic schema matching robust: an ensemble approach. KDD 2005, 429-438.

……

Page 13: Query Translation of Web Database Integration: Issues, Advances and Directions

Survey the current work (Ⅱ) Constraint mapping Z. Zhang, B. He, and K. C.-C. Chang. Light-weight domain-based form assistant: Que

rying Web Databases On the Fly, In VLDB Conference, 2005. K. C.-C. Chang and H. Garc´a-Molina. Approximate query mapping: Accounting for tr

anslation closeness. VLDB Journal 2001. K. C.-C. Chang, H. Garc´a-Molina, and A. Paepcke. Boolean Query Mapping Across

Heterogeneous Information Sources. IEEE Transactions on Knowledge and Data Engineering 1996.

K. C.-C. Chang, H. Garc´a-Molina : Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. Proceedings of the 1999 ACM SIGMOD Conference.

…… Query rewriting C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou,J. Ullman, and

M. Valiveti. Capability based mediation in TSIMMIS. SIGMOD Conference, 1998. Y. Papakonstantinou, A. Gupta, and L. Haas. Capabilities-based query rewriting in m

ediator systems. In International Conference on Parallel and Distributed Information Systems, 1996.

Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. D. Ullman.A query translation scheme for rapid implementation ofwrappers. In International Conference on Deductive and Object-Oriented Databases, 1995.

…… Result filter ……

Page 14: Query Translation of Web Database Integration: Issues, Advances and Directions

Related works(1)—Database selection:A Frequency-based Approach for Mining Coverage Statistics in Data Integration. (ICDE 2004)

BibFinder Scenario — A publicly fielded computer science bibliography mediator. It integrates several online Computer Science bibliography sources, such as

CSB, DBLP, ACM Digital Library, CiteSeer, and so on. Approach

Use Coverage and Overlap Statistics to Rank Sources.

Need to know the coverage of each source

S with respect to the query Q.

Learn statistics only with respect A smaller set of frequently asked queries

Only need to know the coverage of each source S

with respect to the frequent query Q.

Store statistics with respect to query class The new query which is

not in the query list can be mapped into most similar query classes

AV Hierarchies and query classes

Page 15: Query Translation of Web Database Integration: Issues, Advances and Directions

Query ListQuery Frequency Distinctive

Answers

Overlap (Coverage)

DBLP 35

CSB 23

CSB, DBLP 12

DBLP, Science 3

Science 3

CSB, DBLP, Science 1

Author=”andy king” 106 46

CSB, Science 1

CSB 16

DBLP 16

CSB, DBLP 7

ACMdl 5

ACMdl, CSB 3

ACMdl, DBLP 3

ACMdl, CSB, DBLP 2

Author=”fayyad”

Title=”data mining”

1 27

Science 1

Query List: the mediator maintains an XML log of all user queries, along with their access frequency, number of total distinct answers obtained, and number of answers from each source set which has answers for the query.

Page 16: Query Translation of Web Database Integration: Issues, Advances and Directions

AV Hierarchies and Query Classes

RT

2001 2002

AV Hierarchy for the Year Attribute

AI

SIGMOD ICDE AAAI ECP

RT

DB

AV Hierarchy for the Conference Attribute

RT,02 AI,RT

SIGMOD,RT ICDE,RT DB,02 AAAI,RT AI,01 ECP,RT

RT,01

SIGMOD01 ICDE02ICDE01 AAAI01

DB,01

ECP01

RT,RT

DB,RT

Query Class Hierarchy

Query Class: queries are grouped intoclasses by computing cartesianproducts over the AV Hierarchies.A query class is a set of queries thatall share a set of assignments ofparticular attributes to specific values.

Attribute-Value Hierarchy:An AV Hierarchy is a classification of the values of a particular attribute of the mediator relation. Leaf nodes in the hierarchy correspond to concrete values bound in a query.

… …

Page 17: Query Translation of Web Database Integration: Issues, Advances and Directions

Using Coverage and Overlap Statistics to Rank Sources

1. A new user query is mapped to a set of least general query classes.

2. The mediator estimates the statistics for the query using a weighted sum of the statistics of the mapped classes.

3. Data sources are ranked and called in order of relevance using the estimated statistics.In particular:

- The most relevant source has highest coverage- The next best source has highest residual

coverage

As a result, the maximum number of tuples are obtained while the least number of sources are called.

DBLP

CSB

ACMDL

Example:Here, CSB has highest coverage, followed by DBLP. However, since ACMDL has higher residual coverage than DBLP, the top 2 sources that would be called are CSB and ACMDL.

Page 18: Query Translation of Web Database Integration: Issues, Advances and Directions

Related works(2)—Database selection: Query heterogeneous information sources using source description( VLDB 1996) Information Manifold an implemented system that provides uniform access to a heterogeneous collection

of more than 100 information sources on the WWW. IM contains declarative description of the contents of the information sources.Example: Q: Get the price and review of cars for sale that were manufactured no earlier than 1992.

Page 19: Query Translation of Web Database Integration: Issues, Advances and Directions

Use the relational model, augmented with certain object-oriented features to describe the content of information sources.

Page 20: Query Translation of Web Database Integration: Issues, Advances and Directions

Related works(3): constraint mapping: Light-weight domain-based form assistant: Querying Web Databases On the Fly (VLDB 2005)

Page 21: Query Translation of Web Database Integration: Issues, Advances and Directions

semantic closeness Definition 1: Given a source query Qs and a target query form T, a query Qt* is a minimal

subsuming translation w.r.t. T if: 1. Qt* is a valid query w.r.t. T; 2. Qt* subsumes Qs, i.e., for any database instance Di, Qs(Di) ∑Qt* (Di); 3. Qt* is minimal, i.e., there is no query Qt such that Qt satisfies (1) and (2) and Qt* subs

umes Qt.Approach: find 37 template patterns in 150 sources. notice that two predicate templates have mapping correspondence only if there exists a c

oncept expressed with these two templates in different sources. CM (i, j) denote the number of concepts that are expressed using both templates Pi and Pj in the correspondence matrix CM.

As Figure 5 indicates, mappings happen mostly only within certain clusters of templates.

Datetime type Numeric type Text type

Page 22: Query Translation of Web Database Integration: Issues, Advances and Directions

Predicate mapper consists of two components: type recognizer and type handler

predicate mapper takes a source predicate s and a matched target predicate template P as input, and outputs the closest target translation t* for s

a type handler needs to have three key components: search space, closeness estimation, and search strategy.

Page 23: Query Translation of Web Database Integration: Issues, Advances and Directions

Closeness Estimation Given the search space (P) covering all possible mappin

gs, finding a Cmin mapping for numeric type and Datetime type is an easy task.

for text type, the inference of sub-sumption relationship is not trivial since it essentially needs logical reasoning.

Page 24: Query Translation of Web Database Integration: Issues, Advances and Directions

Text Type Handler: The question is which database instance can be used to reliably tes

t the sub-sumption relationship? we construct the database using words from Ws plus some addition

al random words. The database is composed of all possible combinations of the words (for testing the membership) with all possible orders (for testing the sequence). Figure8: t5 is the Cmin mapping.

Page 25: Query Translation of Web Database Integration: Issues, Advances and Directions

Related works(4)—constraint mapping:Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. SIGMOD 1999

Method -----Provide manually mapping rules to translate query constraints from one to

another sources consider one-to-one constraints mapping Consider inter-dependencies among constraints Explore relaxations into the closest supported version

Page 26: Query Translation of Web Database Integration: Issues, Advances and Directions

Related works(5)—query rewriting: Query heterogeneous information sources using source description( VLDB 1996) Information Manifold an implemented system that provides uniform access to a heterogeneous

collection of more than 100 information sources on the WWW. IM contains declarative description of the contents and capabilities of the

information sources. use the source description to prune the set of information sources for a

given query and generate executable query plans.

Page 27: Query Translation of Web Database Integration: Issues, Advances and Directions

Use the capability records to describe the capabilities of an information sources.

Every source relation associate one capability record of the form (Sin, Sout, Ssel, min, max)

Sin---must be given bindings for at least min elements of Sin, Sout---the parameters returned from the information sources, Ssel---the parameters that can apply selections of the form, Min---the minimum number of inputs allowed, Max--- the maxmum number of inputs allowed.

Page 28: Query Translation of Web Database Integration: Issues, Advances and Directions

Related works(6)—query rewriting: Capability based Mediation in TSIMMIS (SIGMOD 1998)

keep track of the capabilities of sources to answer queries. This may not lead to generate plans involving source queries that cannot be answered by the sources.

TSIMMIS system: The mediator encodes the relationship between the user views and the sour

ce views with a set of view definitions.

uses the Mediator Specification Language (MSL) to define user views. MSL is a logic-based language with object-oriented features.

For example, the user view paper is defined as follows: <paper (<title T><author A><abs B><conf C>}> :- <entry {<title T><author A><abs B>}>Qs1, <entry {<title T><conf C>}>QS2

Suppose the user wants to find the title and abstract of each paper written by ‘Smith’ in ‘SIGMOD-97’. The user formulates the following query, based on the user view paper:

<ans {<title T><abs B>}> :- <paper {<title T><author ‘Smith’><abs B><conf ‘SIGMOD-97’>}>

Page 29: Query Translation of Web Database Integration: Issues, Advances and Directions

When the user query arrives at the mediator, the mediator uses the view definitions to translate the query on the user views into a logical plan. The following is the logical plan for the example

user query: <ans {<title T><abs B>}> :- <entry {<title T><author ‘Smith’><abs B>}>Qsl, <entry {<title T><conf ‘SIGMOD-97’>}>Qs2Three possible physical plans for the logical plan of the example user query are:

PI: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1 ‘; send query <entry {<title T><conf ‘SIGMOD-97’>}>to s2; join the results of these source queries on the title attribute.

P2: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1; for each returned title, send query <entry {<title T> <conf ‘SIGMOD-97’>}>to s2, with T bound.

Ps: Send <entry {<title T><conf ‘SIGMOD-97’>}> to s2; for each returned title, send <entry {<title T><author ‘Smith’> <abs B>}> to s1, with T bound.

Some of these physical plans may or may not be feasible depending on the query capabilities of the sources,

Page 30: Query Translation of Web Database Integration: Issues, Advances and Directions

In order to describe the capabilities of sources, the TSIMMIS system uses templates to represent sets of queries that can be processed by each source.

Suppose s1 and s2 only have the following templates.

templates. TII : X:-X:<entry {<title $T><author A><abs B>}>Qs1 T21 : X:-X:<entry {<title T><conf $C>}>Qs2 T22 : X:-X:<entry {<title $T><conf C>}>Qs2

PI: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1 ‘; send query <entry {<title T><conf ‘SIGMOD-97’>}>to s2; join the results of these source queries on the title attribute.

P2: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1; for each returned title, send query <entry {<title T> <conf ‘SIGMOD-97’>}>to s2, with T bound.

Ps: Send <entry {<title T><conf ‘SIGMOD-97’>}> to s2; for each returned title, send <entry {<title T><author ‘Smith’> <abs B>}> to s1, with T bound.

Page 31: Query Translation of Web Database Integration: Issues, Advances and Directions

Outline

Query translation in web database integration

Introduction Problems A simple frameworkSurvey the current workThe Challenges and opportunities of

query translation in Web DB integrationOur future work of query translation

Page 32: Query Translation of Web Database Integration: Issues, Advances and Directions

Challenges How can we translate the query from the uniform integrated interface to web

database interfaces? The number of web databases we can access is very large even if in one

domain. The meta-information about the web databases is very difficult to access.

Logical source contents (books, new cars) Source capabilities (can answer the query) Source completeness (has all books) Statistics about the data (like in an RDBMS) Source reliability Update frequency The web databases are heterogeneous. with heterogeneous schema The web databases are autonomous. No central administration Uncontrolled source content overlap The web databases are dynamic. Approximate query translations will be unavoidable and complex. manually rule-based constraints mapping will be replaced by automatically

rule-based one.

Page 33: Query Translation of Web Database Integration: Issues, Advances and Directions

Opportunities The aggregate schema vocabul

ary of sources in the same domain trends to converge at a relatively small size.

The distributions of attribute frequencies is non-uniform Zipf-like.

There are 25 constraints patterns overall.

The distributions of constraints patterns is Zipf-like, too.

Data-model is simple.

Some related works of schema matching must be useful to query translation.

Page 34: Query Translation of Web Database Integration: Issues, Advances and Directions

Outline

Query translation in web database integration

Introduction Problems A simple frameworkSurvey the current workThe differences and Challenges of query

translation in Web DB integration Our future works of query translation

Page 35: Query Translation of Web Database Integration: Issues, Advances and Directions

Our framework of query translation

Database selectionDescription of the contents, completeness, capabilities reliabilities of each web

database

Attribute matchingCreate a hierarchy relationship

with respect to semantic matching

Query rewritingDescription capabilities of each

web database

Constraint mappingGenerate mapping rules

automatically according to the

types

Result filter

Probing Queries

Statistics information

Semantic matching

Generate mapping rules

Description capabilities

Page 36: Query Translation of Web Database Integration: Issues, Advances and Directions

Questions

How do we characterize, get and exploit source content, completeness, reliability, coverage and overlap?

How to create a hierarchy relationship with respect to the semantic mapping of attributes?

How to generate constraints mapping rules automatically according to the semantic mapping and type of attributes?

How to get and describe the capabilities of the local interfaces? How to rewrite query based on them?

Page 37: Query Translation of Web Database Integration: Issues, Advances and Directions

Thanks!

Questions

Page 38: Query Translation of Web Database Integration: Issues, Advances and Directions

Main references A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources u

sing source descriptions. In VLDB Conference, 1996. A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270.294,

2001. C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou,J. Ullman, and M. V

aliveti. Capability based mediation in TSIMMIS. SIGMOD Conference, 1998. Y. Papakonstantinou, A. Gupta, and L. Haas. Capabilities-based query rewriting in mediato

r systems. In International Conference on Parallel and Distributed Information Systems, 1996.

Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. D. Ullman.A query translation scheme for rapid implementation of wrappers. In International Conference on Deductive and Object-Oriented Databases, 1995.

Vasilis Vassalos, Yannis Papakonstantinou: Describing and Using Query Capabilities of Heterogeneous Sources. VLDB 1997: 256-265

K. C.-C. Chang and H. Garc´a-Molina. Approximate query mapping: Accounting for translation closeness. VLDB Journal 2001.

K. C.-C. Chang, H. Garc´a-Molina, and A. Paepcke. Boolean Query Mapping Across Heterogeneous Information Sources. IEEE Transactions on Knowledge and Data Engineering 1996.

K. C.-C. Chang, H. Garc´a-Molina : Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. Proceedings of the 1999 ACM SIGMOD Conference.

Z. Zhang, B. He, and K. C.-C. Chang. Light-weight domain-based form assistant: Querying Web Databases On the Fly, In VLDB Conference, 2005.

A Frequency-based Approach for Mining Coverage Statistics in Data Integration. Zaiqing Nie and Subbarao Kambhampati. In Proceedings of the 20th International Conference on Data Engineering (ICDE 2004).

Page 39: Query Translation of Web Database Integration: Issues, Advances and Directions

Main references K. C.-C. Chang and H. Garc´a-Molina. Conjunctive constraint mapping for data translation.

ACM ICDL,1998 Bin He, Kevin Chen-Chuan Chang. Making holistic schema matching robust: an ensemble

approach. KDD 2005, 429-438. Wensheng Wu, AnHai Doan, Clement Yu.WebIQ: Learning from the Web to Match Deep-

Web Query Interfaces.ICDE,2006. B. He, K. C.-C. Chang, and J. Han. Automatic complex schema matching across web quer

y interfaces: A correlation mining approach. Technical Report UIUCDCS-R-2003-2388, Dept. of Computer Science, UIUC, July 2003.

K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the web: Observations and implications. SIGMOD Record, 2004.

Palopoli L, Sacca D, Ursino D. Semi-automatic, semantic discovery of properties from database schemas. IDEAS,1998.

K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the web: Observations and implications. SIGMOD Record, 2004

Milo T, Zohar S,Using schema matching to simplify heterogeneous data translation. Proc24th VLDB

Bin He. Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach.KDD,2004.

Jiying Wang, Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma. Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. Proceedings of the 30th VLDB Conference,2004.

Hai He, Weiyi Meng, Clement Yu, and Zonghuan Wu. Automatic Integration of Web Search Interfaces with WISE-Integrator . VLDB 2003.

……