information integration across heterogeneous domains ... · information integration across...

Department of Computer Science and EngineeringUniversity of Texas at Arlington

Arlington, TX 76019

Information Integration across Heterogeneous Domains: Current

Scenario, Challenges and the InfoMosaic Approach

Aditya Telang and Sharma Chakravarthy

Technical Report CSE-2007

Information Integration across Heterogeneous Domains:

Current Scenario, Challenges and The InfoMosaic Approach

Aditya Telang and Sharma ChakravarthyIT Laboratory & Department of Computer Science & Engineering

The University of Texas at Arlington, Arlington, TX 76019.{telang, sharma}@cse.uta.edu

Abstract

Today, information retrieval and integration has assumed a totally different, complex con-notation than what it used to be. The advent of the Internet, the proliferation of informationsources, the presence of structured, semi-structured, and unstructured data - all of have addednew dimensions to the problem of information retrieval and integration as known earlier. Fromthe time of distributed databases leading to heterogeneous, federated, and multi-databases, re-trieval and integration of heterogeneous information has been an important problem for whicha complete solution has eluded researchers for a number of decades. Techniques such as globalschemas, schema integration, dealing with multiple schemas, domain specific wrappers, andglobal transactions have produced significant steps but never reached the stage of maturity forlarge scale deployment and usage. Currently, the problem is even more complicated as repos-itories exist in various formats (HTML, XML, Web Databases with query-interfaces, ...) andschemas, and both the content and the structure are changing autonomously. In this surveypaper, we describe the general problem of information retrieval and integration and discussthe challenges that need to be addressed to deal with the general problem of information re-trieval and integration as it pertains to data sources we encounter today. As the number ofrepositories/sources will increase in an uncontrolled manner, there is no other option but tofind a solution for integrating information from different autonomous sources as needed for asearch/query whose (partial) answers have to be retrieved and integrated from multiple sources.We survey the current work, identify challenges that need to be addressed, and present InfoMo-saic, a framework we are currently designing to handle the challenge of multi-domain integrationof data on the web.

1 Introduction

The volume of information accessible via the web is staggeringly large and growing rapidly. The cov-erage of information available from web sources is difficult to match by any other means. Althoughthe quality of information provided by these sources tends to vary from one to another (sometimedrastically), it is easy to search these repositories and obtain information of interest. Hence, overthe last decade or so, search engines (e.g. Google, Yahoo, etc.) [1] have become extremely popularand have facilitated users to quickly and effortlessly retrieve information from individual sources.Conceptually, search engines only perform the equivalence of a simple lookup operation on one ormore keywords, followed by a sophisticated ranking of the results [2]. There exist several advancedsearch mechanisms (or meta-search engines) that post-process the output of normal search enginesto organize and classify the sources in a meaningful manner (e.g., Vivisimo [3]). Additionally,question-answering frameworks (e.g., START [4]) that are language-based retrieval mechanisms,address the issues of providing detailed information in response to queries on varied concepts (e.g.,geography, science, etc.) expressed in English (or other) language. Furthermore, several commercial

2

web-portals (e.g. http://www.expedia.com, http://www.amazon.com, etc.) have been designed tooperate on individual domains (e.g., airlines, book purchase, hotels, etc.) or a set of pre-determinedsources for searching multiple repositories belonging to the same domain1 of interest and provideresults based on some user criteria, such as cost, schedule, proximity, etc. In summary, efficientand effective search and meta-search engines are available that do specific tasks very well.

These retrieval mechanisms are widely used mainly because of their simplicity and absence ofa learning curve as compared to other forms of information retrieval. This simplicity, on the otherhand, makes it difficult to specify queries that require extraction of data from multiple repositoriesacross diverse domains or results that are relevant to a context. Currently, there are no retrievalmechanisms that support – framing queries that retrieve results from multiple documents andsources, and combining them in meaningful ways to produce the desired result. For example,consider the query: “Retrieve all castles within 2 hours by train from London”. Although all theinformation for answering different parts of this query is available on the Web, it is currently notpossible to frame it as a single query and get a comprehensive set of relevant answers. The aboveexample underlines the “Tower of Babel” problem for integrating information to answer a querythat requires data from multiple independent sources to be combined intelligently. The islands ofinformation that we are experiencing now is not very different from the islands of automation seenearlier. This gap needs to be bridged to move from current search and meta-search approaches totrue information integration.

It is important to understand that information integration is not a new problem. It existedin the form of querying distributed, heterogeneous, multiple and federated databases. What hasreally changed in the last decade is the complexity of the problem (types/models of data sources,number of data sources, infeasibility of schema integration) and the kind of solution that is beingsought. In this survey paper, we analyze the problem of information integration as it pertains toextracting and combining heterogeneous data from autonomous Web sources in response to userqueries spanning across several domains.

The rest of the survey is organized as follows – In Section 2, we elaborate on the challengesencountered in heterogeneous information integration. Section 3 explains the dimensions alongwhich existing information integration systems are measured and differentiated. Section 4 elab-orates some of the salient approaches embraced by the research community for addressing thesechallenges. In Section 5, we analyze the existing frameworks, mechanisms and techniques that havebeen proposed to solve some of the intricate problems that are encountered in data integration.Section 6 elucidates our approach in the context of InfoMosaic, a framework proposed for web-basedmulti-domain information integration. Section 7 concludes the survey.

2 Challenges in Heterogeneous Information Integration

The challenge posed by information integration is further exemplified by the following query exam-ple posed by Lesk [5]: “Find kitchen furniture and a place to buy it which is within walking distanceof a metro stop in the Washington DC area”. Again all the individual pieces of information foranswering this request are available on the web. However, it takes considerable effort to correlatethis diverse data (including spatial data) to arrive at the answers for the above query. Below, welist additional illustrative queries that indicate the gravity of the problem:

Query 1: Retrieve all castles within 2 hours by train from London.

Query 2: Retrieve flights from Dallas for VLDB 2007 Conference that have fares under $10001We realize that the notion of a domain is subjective. In the context of this survey, a domain indicates a collection

of sources providing information of similar interest such as travel, books, literature, entertainment etc.

Query 3: List 3-bedroom houses in Austin within 2 miles of school and within 5 miles of highwayand priced under 250000$

Query 4: Retrieve Movie Theaters near Parks Mall in Arlington, Texas showing Simpsons movie

Query 5: Retrieve Attorney details in Colorado specializing in Immigrations and speaking Spanishalong with their experience in years

Although the gist of information integration has not changed and this topic has been investigatedover two decades, the problem at hand is quite different and far more complex than the oneattempted earlier. Although techniques developed earlier – global schemas, schema integration,dealing with multiple schemas, domain specific wrappers, and global transactions – have producedsignificant steps, they never reached the stage of maturity for deployment and usage. On the otherhand, current meta-search engines (that integrate information from multiple sources of a singledomain) had more success. The multi-domain problem is more complicated as repositories existin various formats (HTML, XML, Web Databases with query-interfaces, etc.), with and withoutschemas, and both the content and the structure are changing autonomously. As the numberof repositories/sources will increase steadily, there is no other option but to find a solution forintegrating information from different autonomous sources as needed for a search/query whose(partial) answers have to be retrieved and integrated from multiple domains and sources. As theenumeration of challenges below indicate, existing techniques from multiple domains need to beeclectically combined as well as new solutions developed in order to address this problem.

2.1 Capturing Imprecise Intent

One of the primary challenges is to provide a mechanism for the user to express his/her intentin an intuitive, easy-to-describe form. As elaborated by Query-1, user queries are complex, andare difficult to express using existing query specification formats. In an ideal scenario, the usershould be able to express the query in a natural language. This is one of the primary reasonsfor the popularity of search engines since, there is no query language to learn. However, unlike asearch engine operation (that involves a simple lookup of a word, phrase or expression in existingdocument repositories), information integration is a complex process of retrieving and combiningdata from different sources for different sub-queries embedded in the given user query. An alternateoption is the use of DBMS-style query languages (e.g., SQL) that allow users to specify a query ina pre-defined format. However, in sharp contrast to Database models (that assume the user knowswhat to access and from where to access), the anonymity of the sources and the complexity of thequery involved in a data integration scenario makes it difficult to express the intent using the hardsemantics of these data models.

Existing frameworks (e.g., Ariadne [6], TSIMMIS [7], and Whirl [8]) extend the database query-ing models using combinations of templates or menu-based forms to incorporate queries that arerestricted to a single domain (or a set of domains). Other frameworks (such as Havasu [9]) employan interface similar to search engines, that take relevant keywords (associated with a concept) fromthe user and retrieve information for this particular concept from a range of sources. However,as the domains for querying established by these systems are fixed (although the sources withinthe domain might change), the problem of designing a querying mechanism is simplified to a greatextent. When a more involved query needs to be posed, users may not know how to unambigu-ously express their needs and may formulate queries that lead to unsatisfactory results. Moreover,providing a rigid specification format may restrict the user from providing complete informationabout his/her intent.

Additionally, most of these frameworks fail to capture queries that involve a combination ofspatial, temporal, and spatio-temporal conditions. A few systems (e.g., Hermes [10], TerraWorld[11], etc.) allow a limited set of spatial operations (such as close to, travel time) through its push-button listing-based interface or a form-based interface. Currently, centralized web-based mappinginterfaces (e.g. Google Maps and Virtual Earth) allow searching and overlaying spatial layers (e.g.,all hotels and metro stations in current window or a given geo-region) to examine the relationshipsamong them visually. However, these user interfaces are not expressive enough and restrict usersfrom specifying their intent in a flexible manner.

2.2 Mapping Imprecise Intent into Precise Queries

The next challenge is to transform the user intent into an appropriate query format that canbe represented using a variant of relational algebra (or similar established mechanisms). Since thequeries in the context of information integration are complex and involve a myriad set of conditions,it is obvious that applying the existing formalisms of relational algebra may not be sufficient.

Over the past decade, several querying languages that extend the basics of relational algebraand allow access to structured data (SQL, OOQL [12], Whirl [8], etc.), semi-structured data (SemQL[13], CARIN [14], StruQL [15], etc.) and vague (or unstructured) data (VAGUE [16]) have beendesigned. These languages have, with limited success, incorporated imprecise user queries posed ona single-domain (or fixed set of multiple domains). Additionally, several frameworks have deployedcustomized models that translate the user query to a query format supported by the internal globalschema (that provides an interface to the underlying sources). Briefly, Havasu’s QPIAD [17] mapsimprecise user queries to a more generic query using a combination of data-mining techniques.Similarly, Ariandne [6] interprets the user-specified conditions as a sequence of LOOM statementsthat are combined to generate a single query. MetaQuerier’s form assistant [18] consists of built-intype handlers that aids the query translation process with moderate human efforts.

However, existing mechanisms will prove to be insufficient to represent complex intent spanningseveral domains. Hence, it becomes necessary to use domain-related taxonomies/ontologies andsource-related semantics to disambiguate as well as generate multiple potential queries from theuser intent. A feedback and learning mechanism may be appropriate to learn user intent from thecombinations of concepts provided based on user feedback. If multiple queries are generated (whichis very much possible on account of the ambiguity of natural language and the volume of conceptsinvolved in the domains of integration), an ordering mechanism may be useful to obtain valuablefeedback from the user. Once the query is finalized, a canonical representation can be used tofurther transform the query into its components and elaboration.

2.3 Discovery of Domain and Source Semantics

As elucidated by Query-1, user queries inherently consist of multiple sub-queries posed on distinctdomains (or concepts). Gathering appropriate knowledge about the domains and the correspondingsources within these domains is vital to the success of heterogeneous integration of information. Inorder to relate various parts of a user query to appropriate domains (or concepts), the meaning ofinformation that is interchanged across the system has to be understood.

Over the past decade, several customized techniques have been adapted by different frameworksthat focus on capturing such meta-data about concepts and sources that facilitate easy mapping ofqueries over the global schema and/or the underlying sources. Havasu’s attribute-valued hierarchies[9] maintain a classification of the attributes of the data sources over which the user queries areformed. Ariadne uses an independent domain model [6] for each application, that integrates theinformation from the underlying sources and provides a single terminology for querying. This

model is represented using the LOOM knowledge representation system [19]. TSIMMIS adoptsan Object Exchange Model (OEM) [7], a self-describing (tagged) object model, in which objectsare identified by labels, types, values, and an optional identifier. Information Manifold’s CARIN[14] proposes a method for representing local-source completeness and an algorithm for exploitingsource information in query processing. This is an important feature for integration systems, since,in most scenarios, data sources may be incomplete for the domain they are covering. Furthermore,it suggests the use of probabilistic reasoning for the ordering of data sources that appear relevantto answer a given query. InfoMaster’s knowledge base [20] is responsible for the storage of all therules and constraints required to describe heterogeneous data sources and their relationships witheach other. In Tukwila, the metadata obtained from several sources is stored in a single data sourcecatalog [21], and holds different type of information about the data sources such as – semanticdescription of the contents of the data sources, overlap information about pairs of data sources,and key statistics about the data, such as the cost of accessing each source, the sizes of the relationsin the sources, and selectivity information. Additionally, the use of ontologies for modeling implicitand hidden knowledge has been considered as a possible technique to overcome the problem ofsemantic heterogeneity by a number of frameworks such as KRAFT [22], SIMS [23], OntoBroker[24], etc..

The proliferation of data on the Internet has ensured that within each domain, there existvast number of sources providing adequate yet similar information. For instance, portals such asExpedia, Travelocity, Orbitz, etc. provide information for the domain of air-travel. Similarly, sourcessuch as Google Scholar, DBLP, CiteSeer, etc. generate adequate and similar results for the domainof publications and literature. Thus, the next logical challenge is to automate the current manualprocess of identifying appropriate sources associated with individual domains. Semantic discoveryof sources, that involves a combination of - web crawling, interface extraction, source clustering,semantic matching and source classification, has been extensively researched by the Semantic Webcommunity [25]. Currently, a significant and increasing amount of information obtained from theweb is hidden behind the query interfaces of searchable databases. The potential of integratingdata from such hidden data sources [26] is enormous. The MetaQuerier project [27] addressesthe challenges for integrating these deep-web sources such as – discovering and integrating sourcesautomatically, finding an appropriate mechanism for mapping independent user-queries to source-specific sub-queries, and developing mass collaboration techniques for the management, descriptionand rating of such sources.

An ideal archetype would be to design a global taxonomy (that models all the heterogeneousdomains across which user queries might be posed), and a domain taxonomy (that models allthe sources belonging to the domain and orders them based on distinct criteria specified by theintegration system). The construction of such a multi-level ontology requires extensive efforts in theareas of – domain knowledge aggregation, deep-web exploration, and statistics collection. However,the earlier work on databases (use of equivalences and statistics in centralized databases, use ofsource schemas for obtaining a global schema) and recent work on information integration (aselaborated earlier) provide adequate reasons to believe that this can be extended to multi-domainqueries and computations that include spatial and temporal constraints, which is being addressedin our InfoMosaic framework.

2.4 Query Planning and Optimization

Plan generation and optimization in an information integration environment differs from tradi-tional database query processing in several aspects – i) volume of sources to be integrated is muchlarger than in a normal database environment, ii) heterogeneity between the data (legacy databasesystems, web-sites, web-services, hidden web-data, etc.) makes it difficult to maintain the same

processing capability as found in a typical database system (e.g., the ability to perform joins), iii)the query planner and optimizer in information integration has little information about the datasince it resides in remote autonomous sources, and iv) unlike relational databases, there can beseveral restrictions on how an autonomous source can be accessed.

Current frameworks have devised several novel approaches for generating effective plans in thecontext of data integration. Havasu’s StatMiner (in association with the Multi-R Optimizer) [9]provides a guarantee on the cost and coverage of the results generated on a query by approximatingappropriate source statistics. Ariadne’s Theseus [6] pre-compiles part of the integration model anduses a local search method for generating query plans across a large number of sources. InformationManifold’s query-answering approach [14] translates user queries, posed on the mediated schema ofdata sources, into a format that maps to the actual relations within the data sources. This approachdiffers from the one adopted by Ariadne, and ensures that only the relevant set of data sourcesare accessed when answering a particular user query. In Tukwila, if the query planner concludesthat it does not have enough meta-data with which to reliably compare candidate query executionplans, it chooses to send only a partial plan to the execution engine, and takes further action onlyafter the partial plan has been completed.

However, since for these frameworks, the domains involved in the user query are pre-determined,generalizing and applying these techniques to autonomous heterogeneous sources is not possible.This is particularly true for techniques that generate their plans based on the type of modelingapplied for the underlying data sources. Furthermore, current optimization strategies [9] focus on arestricted set of metrics (such as cost, coverage and overlap of sources) for optimization. Additionalmetrics such as – volume of data retrieved from each source, number of calls made to and amountof data sent to each source, quantity of data processed, and the number of integration queriesexecuted – are currently not considered. It is important to understand that in this problem space,exact values of some of these measures may not be available and the information available aboutthe ability of the sources and their characteristics may determine how these measures can be used.Thus, effective plan generation and evaluation is significantly more complex than a traditionalsystem and requires to be investigated thoroughly.

2.5 Data Extraction

Typically, in schema-based systems (e.g., RDBMS), the description of data (or meta-data) is avail-able, query-language syntax is known, and the type and format of results are well-defined, and hencethey can be retrieved programmatically (e.g., ODBC/JDBC connection to a database). However,in the case of web repositories, although a page can be retrieved based on a URL (or filling forms inthe case of hidden web), or through a standard or non-standard web-service, the output structure ofdata is neither pre-determined nor remains the same over extended periods of time. The extractedinformation needs to be parsed as HTML or XML data types (using the meta-data of the page)and interpreted.

Currently, wrappers [6] are typically employed by most frameworks for the extraction of het-erogeneous data. However, as the number of data sources on the web and the diversity in theirrepresentation format continues to grow at a rapid rate, manual construction of wrappers provesto be an expensive task. There is a rapid need for developing automation tools that can design,develop and maintain wrappers effectively. Even though a number of integration systems have fo-cussed on automated wrapper generation (Ariadne’s Stalker [28], MetaQuerier [27], TSIMMIS [29],InfoMaster [30], and Tukwila [21]), since the domains (and the corresponding sources) embeddedwithin these systems are known and predefined, the task of generating automated wrappers usingmining and learning techniques is simplified by a large extent. There also exist several independenttools based on solid formal foundations that focus on low-level data extraction from autonomous

sources such as Lixto [31], Stalker [28], etc.. In the case of spatial data integration, (e.g., eMergessystem [32]), ontologies and semantic web-services are defined for integrating spatial objects, inaddition to wrappers and mediators. Heracles [33] (part of TerraWorld and derived from the con-cepts in Ariadne) combines online and geo-spatial data in a single integrated framework for assistingtravel arrangement and integrating world events in a common interface. A Storage Resource Brokerwas proposed in the LTER spatial data workbench [34] to organize data and services for handlingdistributed datasets.

Information Manifold [14] claimed that the problem of wrapping semi-structured sources wouldbe irrelevant as XML will eliminate the need for wrapper construction tools. This is an optimisticyet unrealistic assumption since there are some problems in querying semi-structured data that willnot disappear, for several reasons: 1) some data applications may not want to actively share theirdata with anyone who can access their web-page, 2) legacy web applications will continue to existfor many years to come, and 3) within individual domains, XML will greatly simplify the access tosources; however, across diverse domains, it is highly unlikely that an agreement on the granularityfor modeling the information will be established.

2.6 Data Integration

The most important challenge in the entire integration process involves fusion of the data extractedfrom multiple repositories. Since most of the existing frameworks are designed for a single domainor a set of predetermined domains, the integration task is generalized such that the data gener-ated by different sources only needs to be “appended” and represented in a homogeneous format.Frameworks, such as Havasu, support the “one-query on multiple-sources in single-domain” formatin which, the data fetched from multiple sources is checked for overlap, appended, and displayedin a homogeneous format to the user. Others, such as Ariadne, support the “multiple sub-querieson multiple-sources in separate-domains” format which is an extension to the above format, suchthat the task of checking data overlap is done at the sub-query level. The non-overlapping resultsfrom each sub-query are then appended and displayed.

However, the problem of integration becomes more acute when the sub-queries, although be-longing to distinct domains, are dependent on each other for generating a final result-set. Forinstance, in Query-1, although it is possible to extract data independently for “castles near Lon-don”, and “train-schedules to destinations within 2 hours from London”, the final result-set thatrequires generating “castles that are near London and yet reachable in 2 hours by train” cannot beobtained by simply appending the results of the two sub-queries. For this (and similar complex)query, it becomes necessary to perform additional processing on the extracted data based on thesub-query dependencies, before it can be integrated and displayed.

2.7 Other Challenges

In addition to the above challenges, there exist a number of issues that will prove to be significantas integration frameworks move from prototype designs to large-scale commercial systems.

Ranking Integrated Results: Users should be able to access available information; however,this information should be presented in a structured and easy-to-digest format. Returninghundreds and thousands of information snippets will not help the user to make sense of theinformation. An interesting option would be to apply a rank on the final integrated resultsand provide only a percentage (top-k) of the total answers generated [35].

However, unlike the domains of information retrieval [36] or even databases [37], the compu-tation of ranking in information integration is more complex due to – autonomous nature of

sources, lack of information about the quality of information from a source, lack of informa-tion about the amount of information (equivalent of cardinality) for a query on the source,and lack of support for retrieving results in some order or based on some metrics. To the bestof our knowledge, ranking has not been addressed explicitly in any of the major projects oninformation integration.

Decentralized Data Sharing: Current data integration systems employ a centralized mediationapproach [38] for answering user queries that access multiple sources. A centralized schemaaccepts user queries and reformulates them over the schema of different sources. However, thedesign, construction and maintenance of such a mediated schema is often hard to agree upon.For instance, data sources providing castle information and train schedules are independent,belong to separate domains and are governed by separate companies. To expect these datasources to be under the control of a single mediator is an unrealistic assumption.

Naming Inconsistencies: Entities (such as places, countries, companies, ...) are always consis-tent within a single data source. However, across heterogeneous sources, the same entitymight be referred to with different names and in different context. To make sense of the datathat spans across multiple sources, an integration system must be able to recognize and re-solve these differences. For instance, in a query requiring access to sources providing air-travelinformation, one source may list Departure City and Arrival City as the two input locationsfor querying. However, another source might use From and To as its querying input locations.Even though, these inputs indicate the same concept in the domain of travel, resolving thiscomplexity for an integration environment is a difficult task.

Security and Privacy: Existing information integration systems extracting data from autonomoussources assume that the information in each source can be retrieved and shared without anysecurity restrictions [39]. However, there is an increasing need for sharing information acrossautonomous entities in a manner that no data apart from the answer to the query is revealed.There exist several intricate challenges in specifying and implementing processes for ensuringsecurity and privacy measures before data from diverse sources can be integrated.

3 Dimensions for Integration

Existing information integration systems (elaborated in Section 5) tend to be designed along severaldimensions such as:

Goal of Integration: It indicates the overall goal of the integration framework. Some systemsare portal-oriented (e.g., Whirl [8], Ariadne [40] [41], InfoMaster [42]) in that they aim to supportan integrated browsing experience for the user. Others are more ambitious in that they take userqueries and return results of running those queries on the diverse yet appropriate sources (e.g.,Havasu [9] [43]).

Data Representation: Refers to the design assumptions made by the integration systemregards to the syntactic nature of the data being exported by the sources. Some systems assume theexistence of structured data models (e.g., SIMS [44], Havasu [9]). However, since most integrationframeworks perform a web-based fusion of data, they assume the co-existence of semi-structuredand unstructured data (e.g. Ariadne [40], TSIMMIS [45]).

Source Structure: Illustrates the assumptions made on the inter-relationship between sources.Most systems assume that sources they are integrating are complementary (horizontal integration)in that they export different parts of the schema (e.g. Ariadne [41]). Others consider the possibilitythat sources may be overlapping (vertical integration) (e.g. Havasu [46], Tukwila [47]) in whichcase aggregation of information is required, as opposed to pure integration of information.

Domain and Source Dynamics: Refers to the extent to which the user has control overand/or is required to specify the particular sources that are needed to be used in answering thequery. Some systems (e.g. Ariadne [33], BioKleisli [48]) require the user to select the appropriatesources to be used. Others (e.g. TSMIMMIS [7], Havasu [43]) circumvent this problem by hard-wiring specific parts of the integrated schema to specific sources.

User Expertise: Indicates the type of users that the system is directed towards. The systemsthat primarily support browsing need to assume very rudimentary expertise on the part of users.In contrast, systems that support user-queries need to assume some level of expertise on the userspart in formulating queries. Some systems might require user queries to be formulated in specificlanguages, while others might provide significant interactive support for users in formulating theirquery.

4 Approaches for Integration

Over the past two decades, various approaches have been suggested and adopted in the pursuit ofachieving an ideal information integration system. In this section, we provide a brief description ofthe prominent approaches that form the basis for the design of existing integration frameworks:

Mediator: It is one of the most notable approaches adopted by many integration frameworks(Ariadne [6], TSIMMIS [7], Havasu [9], ...). A mediator (in the information integration context) is asystem responsible for reformulating user queries (formed on a single mediated schema) into querieson the local schema of the underlying data sources. The sources contain the actual data, while theglobal schema provides a reconciled, integrated, and virtual view of the underlying sources. Mod-eling the relation between the sources and the global schema is therefore a crucial aspect. The twodistinct approaches for establishing the mapping between each source schema and the centralizedglobal schema are: i) Global-as-view (GAV), that requires the global schema to be represented interms of the underlying data sources, and ii) Local-as-view (LAV), that requires the global schemato be defined independently from the sources, and the relationships between them are establishedby defining every source as a view over the global schema.

Warehousing: This approach [15] derives its basis from traditional data warehousing tech-niques. Data from heterogeneous distributed information sources is gathered, mapped to a commonstructure and stored in a centralized location. Warehousing emphasizes data translation, as opposedto query translation in mediator-based integration [38]. In fact, warehousing requires that all thedata loaded from the sources be converted through data mapping to a standard unique formatbefore it is stored locally. In order to ensure that the information in the warehouse reflects thecurrent contents of the individual sources, it is necessary to periodically update the warehouse.In the case of large information repositories, this is not feasible unless the individual informationsources support mechanisms for detecting and retrieving changes in their contents. This is aninordinate expectation in the case of autonomous information sources spread across a number ofheterogeneous domains.

Ontological: In the last decade, semantics (which are an important component for data in-tegration) gained popularity leading to the inception of the celebrated ontology-based integrationapproach [49]. The Semantic Web research community [50], [51], [52], [53] has focused extensivelyon the problem of semantic integration [54] and the use of ontologies for blending heterogeneousschemas across multiple domains. Their pioneering efforts have provided a new dimension for re-searchers to investigate the challenges in information integration. A number of frameworks designed

using ontology-based integration approaches [49] have evolved in the past few years.

Federated: It [55] is developed on the premise that, information needed to answer a queryis gathered directly from the data sources in response to the posted query. Hence, the resultsare up-to-date with respect to the contents of the data sources at the time the query is posted.More importantly, the database federation approach [56] lends itself to be more readily adapted toapplications that require users to be able to impose their own ontologies on data from distributedautonomous information sources. The federated approach is preferred in scenarios when the datasources are autonomous (e.g., Indus [55]), and support for multiple ontologies is needed. However,this approach fails in situations where the querying frequency is much higher than the frequencyof changes to the underlying sources.

Navigational: Also known as link-based approach[57], is based on the fact that an increasingnumber of sources on the web require users to manually browse through several web-pages and datasources in order to obtain the desired information. In fact, the major premise and motive justifyingthis approach is that some sources provide the users with pages that would be difficult to accesswithout point-and-click navigation (e.g., hidden-web [26]).

5 Current Integration Techniques and Frameworks

Currently, a number of techniques, and frameworks have tried to address several challenges in theproblem of heterogeneous data integration in a delimited context. In this section, we elaboratesome of the prominent frameworks.

Havasu: A multi-objective query processing framework comprising of multiple functional mod-ules, Havasu [9], addresses the challenges of imprecise-query specification [58], query optimization[59], and source-statistics collection [60] in single-domain web integration. The AIMQ moduleprovides query independent solution to efficiently handle imprecise user queries using a combina-tion of source schema collection, attribute dependency mining [61], and source similarity mining[46]. Unlike traditional data integration systems that aim towards minimizing the cost of queryprocessing, Havasu’s StatMiner [62] provides a guarantee on the cost as well as the coverage ofthe results generated on a given user query, by approximating the coverage and overlap statisticsof the data sources based on attribute-valued hierarchies. The Multi-R optimizer [9] module usesthese source statistics for generating a multi-objective (cost and coverage) query optimization plan.The Havasu framework has been applied for the design of BibFinder [43], a publicly availablecomputer science literature retriever, that integrates several autonomous and partially overlappingbibliography sources.

MetaQuerier: It is made up of two distinct components that address the challenges in ex-ploration and integration of deep-web sources [26]. In contrast to the traditional approaches (e.g.,Wise-Integrator [63]), that aim towards integrating web data sources based on the assumptionthat query-interfaces can be extracted perfectly, MetaQuerier tries to perform source integrationby extracting query interfaces from raw HTML pages with subsequent schema matching. Hence, inessence, MetaQuerier strives to achieve data mining for information integration [64] i.e., it minesthe observable information to discover the underlying semantics from web data sources. The Meta-Explorer [27] is responsible for dynamic source discovery [65] and on-the-fly integration [66] for thediscovery, modeling, and structuring of web databases, to build a search-able source repository. On

the other hand, the MetaIntegrator [67] focuses on the issues of on-line source integration such assource selection, query mediation, and schema integration. In contrast to traditional integrationsystems, MetaIntegrator is dynamic i.e., new sources may be added as and when they are discovered.

Ariadne: It extended the information integration approach adopted by the SIMS mediatorarchitecture [68] to include semi-structured and unstructured data sources (e.g., web data) insteadof simple databases, by using specially designed wrappers [28]. These wrappers, built aroundindividual web sources, allowed querying for data in a database-like manner (for example, usingSQL). These wrappers were generated semi-automatically using a machine learning approach [69].Ariadne also constructed an independent domain model [70] for each application, that integratesthe information from the sources and provides a single terminology for querying. This model wasrepresented using the LOOM knowledge representation system [19]. The SIMS query planner didnot scale effectively when the number of sources increased beyond a certain threshold. Ariadnesolved this problem by embracing an approach [71] that is capable of efficiently constructing largequery plans, by pre-compiling part of the integration model and using a local search method forquery planning.

Since its inception, Ariadne has been divided into a number of individual projects such as –Apollo [72], Prometheus [73], and Mercury [74] for addressing separate issues and challenges inheterogeneous information integration. The early applications of Ariadne included a Country Infor-mation Agent [40], that integrated data related to countries from a variety of information sources.Ariadne was also used to build TheaterLoc [41], a system that integrated data from restaurantsand movie theaters, and placed the information on a map for providing efficient access to naiveusers. After its atomization, three individual and diverse application projects have been developedunder the Ariadne project – Heracles [33] (an interactive data-driven constraint-based integrationsystem), TerraWorld [11] (a geospatial data integration system), and Poseidon [75] (that focusseson the composition, optimization, and execution of query plans for bioinformatics web-services).

Information Manifold: It focused on efficient query processing [14] by accessing sources thatare capable of providing an appropriate answer to the queries posed by users. In order to facili-tate efficient user query formulation, as well as to represent background knowledge of the mediatedschema relations (designed over several autonomous sources), Information Manifold proposed an ex-pressive language, CARIN [76] modeled using a combination of Datalog (database query language)[77] and Description Logics (knowledge representation language) [78]. Information Manifold alsoproposed algorithms [79] for translating user queries, posed on the mediated schema of data sources,into a format that maps to the actual relations within the data sources. The query-answering ap-proach adopted by Manifold ensured that only the relevant set of data sources are accessed whenanswering a particular user query. This approach differed from the one adopted by Ariadne, thatuses a general purpose planner for query translation. In addition, Information Manifold proposed amethod [80] for representing local-source completeness and an algorithm [81] for exploiting sourceinformation in query processing. This is an important feature for integration systems, since, in mostscenarios, data sources may be incomplete for the domain they are covering. Furthermore, Informa-tion Manifold suggested the use of probabilistic reasoning [82] for the ordering of data sources thatappear relevant to answer a given query. Such an ordering is dependent on the overlap betweenthe sources and the query, and on the coverage of the sources.

TSIMMIS: It addressed the challenges in integration of heterogeneous data extracted fromstructured as well as unstructured sources [7]. TSIMMIS adopted a schema-less approach (i.e., nosingle global database or schema contained all information needed for integration) for retrieving in-formation from dynamic sources (i.e., when source contents changed frequently). Each information

source was covered by a wrapper that logically converted the underlying data objects to a commoninformation model. This logical translation was done by converting queries into requests based onthe information in the model that the source could execute, and converting the data returned bythe source into the common model. TSIMMIS adopted an Object Exchange Model (OEM) [83], aself-describing (tagged) object model, in which objects were identified by labels, types, values, andan optional identifier. These objects were requested with the aid of OEM-QL [84], a query languagespecifically designed for OEM. The mediators [83] were software modules that refined informationfrom one or more sources [85], and accepted OEM-QL queries as inputs to generate OEM objects.This approach allowed the mediators to access new sources transparently in order to process andrefine relevant information efficiently. End users accessed information either by writing applicationsthat requested OEM objects, or by using generic browsing tools supported by TSIMMIS, such asMOBIE (MOsaic Based Information Explorer) [7]. The query was expressed as an interactive worldwide web page or in a menu-selected format. The answer was received as a hypertext document.

Garlic [86], a sister project of TSIMMIS, was developed at IBM to enable large-scale integra-tion of multimedia information. It was applied in the late 1990s for multimedia data integration inthe fields of Medicine, Home Applications, and Business Agencies.

InfoMaster: It [42] was a framework that provided integrated access to structured informationsources. InfoMaster’s architecture consisted of a query facilitator, knowledge base, and translationrules. The facilitator analyzed the sources containing relevant information needed to answer userqueries, generated query plans to access these sources, and mapped these query plans to the sourcesusing a self-generated wrapper [87]. The knowledge base was responsible for the storage of all therules and constraints [20] required to describe heterogeneous data sources and their relationshipswith each other. The translation rules described how each distinct source could relate to the refer-ence schema. InfoMaster was developed and deployed at Stanford University for searching housingrentals in Bay Area, San Francisco and for scheduling rooms at Stanford’s Housing Department[88]. It was also the basis for Stanford Information Network (SIN) project [89] that integratednumerous structured information sources on the Stanford campus. InfoMaster was commercializedby Epistemics in 1997.

Tukwila: An adaptive integration framework, Tukwila [47], tackles the challenges encounteredduring the integration of autonomous sources on the web, namely source statistics, data arrival,and source overlap and redundancy. The major contributions of the Tukwila framework are –interleaved query planning and execution, adaptive operators, and a data source catalog. Queryprocessing in Tukwila does not require the creation of a complete query execution plan before thequery evaluation step [21]. If the query optimizer concludes that it does not have enough meta-data with which to reliably compare candidate query execution plans, it chooses to send only apartial plan to the execution engine, and takes further action only after the partial plan has beencompleted. Tukwila incorporates operators that are well suited for adaptive execution, and mini-mizing the time required to obtain the first answer to a query. In addition, the Tukwila executionengine includes a collector operator [21] that efficiently integrates data from a large set of possiblyoverlapping or redundant sources. The metadata obtained from several sources is stored in a singledata source catalog. The metadata holds different type of information about the data sources suchas – semantic description of the contents of the data sources, overlap information about pairs ofdata sources, and key statistics about the data, such as the cost of accessing each source, the sizesof the relations in the sources, and selectivity information.

Whirl: The primary contribution of Whirl has been the Whirl Query Language [8]. Thislanguage, based on the concepts extracted from Database Systems (the syntax is based on soft se-

mantics [90] and very similar to SQL), Information Retrieval (uses an inverted-indexing techniquefor generating top-k answers), and Artificial Intelligence (using a combination of A* search [91]),has been used in incorporating single domain queries across structured and semi-structured staticsources. The Whirl framework was applied for design of two applications: i) Children’s ComputerGames, that integrated information from 16 similar web-sites, and ii) North American Birds thatcontained a set of integrated databases with large number of tuples, the majority of which pointedto external web-pages.

Ontology-based Integration Systems: Semantics play an important role during integrationof heterogeneous data. In order to achieve semantic interoperability, the meaning of informationthat is interchanged across the system has to be understood. The use of ontologies, for the extrac-tion of implicit and hidden knowledge, has been considered as a possible technique to overcomethe problem of semantic heterogeneity by a number of frameworks such as KRAFT [22], SIMS[44], OntoBroker [24], and InfoSleuth [92]. Other ontology-based systems such as PICSEL [93],OBSERVER [94], and BUSTER [25] did not propose any methods for creation of ontologies.The literature associated with these systems makes it clear that there is an obvious lack of realmethodology for ontology-based development.

GeoSpatial Integration Systems: Vast amount of data available from the web containspatial information explicitly or implicitly. Implicit spatial information includes the location of anevent. For example, many news reader today use GeoRSS format, which is an enhancement to theW3C standard RSS XML file format to include location tags. Explicit spatial objects such as roadnetworks and disease distributions over different parts of the earth are commonly provided by bydifferent organizations in different data formats through different interfaces. To this end, severalsystems have been proposed to address the issues of geospatial integration.

Heracles [33] has tried to combine online and geospatial data in a single integrated frameworkfor assisting travel arrangement and integrating world events into a common space. The spatialmediated approaches (systems derived using the concepts from Ariadne) [95] combines spatialinformation using wrappers and mediators. A storage Resource Broker was used in LTER spatialdata workbench [34] to organize data and services to manage distributed datasets as a singlecollection. Query evaluation plan generation is addressed within a spatial meditation contextin [96]. As generic models, XML/GML and web services are also widely used in other spatialintegration systems [97], [98], [99], [100], [101]. The eMerges system [32] further defines ontologiesfor spatial objects and uses semantic web services for integration. The other line of work on spatialdata integration focused on studying the conflation of raster images with vector data [102] andsatellite image with vector maps [103].

The other line of work on spatial data integration [102] focused on studying conflation of rasterimages and vector data. The process involves the discovery of control point pairs, which are a setof conjugate point pairs between two datasets. Extension on integrating satellite image with vectormaps was investigated in [103]. A framework to combine spatial information based on a mediatorthat takes metadata on the information needs of the user was proposed in [95]. This infrastructureuses the view agent approach as the method of communication and is validated in mobile datasettings. Grid based architecture that utilizes service-oriented view to unify various geographicresources was briefly introduce in [97].

Biological Integration Systems: In recent years, integration of heterogeneous biological andgenomic data sources [104] has gained immense popularity. A number of prominent systems such asIndus [55], SRS [105], K2/BioKleisli [48], TAMBIS [106], DiscoveryLink [107], BioHavasu[104], Entrez [108], and BioSift Radia [104] have been designed for resolving some of the intricate

Figure 1: InfoMosaic Architecture

challenges in integrating assorted biological data.

6 The InfoMosaic Approach

6.1 InfoMosaic Architecture and Dataflow

The architecture of our InfoMosaic framework is shown in Figure 1. The user query is accepted inan intuitive manner using domain names and keywords of interest, and elaborated/refined using thedomain knowledge in the form of taxonomies and dictionaries (synonyms etc.). Requests are refinedand presented to the user for feedback (Query Refinement module). User feedback is accumulatedand used for elaborating/disambiguating future queries. Once the query is finalized, it is representedin a canonical form (e.g., query graphs) and transformed into a query plan using a two-phase process:i) generation of logical plans using domain characteristics, and ii) generation of physical plans usingsource semantics. The plan is further optimized by applying several metrics (Query planner andOptimizer module). The Query Execution and Data Extraction module generates the actual sourcequeries that are used by the extractor to retrieve the requisite data. It also determines whether apreviously retrieved answer can be reused by checking the data repositories (XML and PostGIS)for cached data. We have an XML Repository that stores extracted results from each source ina system-determined format. A separate PostGIS Repository is maintained for storing spatialdata extracted from sources. The Data Integrator formulates XQueries (with external functionsfor handling spatial component) on these repositories to compute the final answers and formatthem for the user. Ranking is applied at different stages (sub-query execution phase, extractionphase, or integration phase) depending on the user-ranking metrics, the selected sources and thecorresponding query plan. The Knowledge-base (broadly consisting of domain knowledge and source

semantics) blends all the pieces together in terms of the information used by various modules. Theadaptive capability of the system is based on the ability of the InfoMosaic components to updatethese knowledge-bases at runtime.

6.2 Challenges Addressed by InfoMosaic

Capturing and Mapping Imprecise Intent into Precise Query: We address the query spec-ification challenge in a multi-domain environment by combining and enhancing techniquesfrom natural language processing, database query specification and information retrieval toincorporate the following characteristics: i) specification of soft semantics instead of hardqueries, ii) ability to accept minimal specification and refine it to meet user intent and inthe process collect feedback for future usage, iii) support queries that include spatial, tem-poral, spatio-temporal, and cost-based conditions in addition to regular query conditions,iv) accepting optional ranking metrics based on user-specified criteria, and v) support queryapproximation and query relaxation for retrieving approximate answers instead of exact an-swers.

Moreover, instead of designing a new language that supports all the query conditions, weare currently extending the capabilities of SQL to incorporate soft-semantics and conditionsbased on domains rather than sources. In particular, we are trying to enhance the semantics ofSQL-based spatial query languages for easy specification of spatial relations including metric,topological, and directional relationships pertaining to heterogeneous datasets from the web.

Plan Generation and Optimization: We view this challenge as an intelligent query optimiza-tion problem involving two stages: logical and physical. In the logical phase, we identify theindividual domain sub-queries how they come together as a larger query by using appropriatedomain knowledge. In the physical phase, various source semantics and characteristics areused to generate effective plans for each individual sub-query. In addition, we are also in-vestigating query optimization techniques for handling spatial, temporal and spatio-temporalconditions.

Data Extraction We use Lixto [31], a powerful data extraction engine for programmatically ex-tracting portions of a HTML page (based on the need) and converting the result into a specificformat. It is based on monadic query languages over trees (based on monadic second orderlogic), and automatically generates Elog [31] (a variant of Datalog) programs for data ex-traction. For handling extraction of spatial data (that is larger in size and hence difficultto extract in a short time), we are planning to use a combination of – i) building a localspatial data repository by dynamically downloading related spatial files (using data clearinghouses such as Map Bureau, etc.) of data that is relatively static , and ii) querying spatialweb-services for fetching data that tends to change on a more frequent basis.

Data Integration: Extraction of spatial and non-spatial is done with respect to separate Post-GIS and XML repositories respectively. We then generate and execute queries (XQueryfor XML and spatial queries for Post-GIS whose results are converted to GML for furtherprocessing) to integrate this extracted and processed data. The generation of these queriesis based on the DTD (generated from the logical query plan) of the stored sub-query resultsand the attributes that need to be joined/combined from different sources. The join can bean arbitrary join (not necessarily equality) on multiple attributes. Our approach involvesgenerating XQueries for each sub-query and combine them into a larger query using FLOWRexpressions. It might be possible that the results of some sub-queries are already integrated

during the execution and extraction phase. This information, based on the physical queryplan, is taken into consideration for generating the required XQuery.

Result Ranking: We are currently addressing this challenge [35] by investigating the applicationof ranking at different stages in the integration process (i.e., at sub-query execution phase,before the integration phase, after the integration phase, etc.).

Representation of Domain Knowledge and Source Semantics: We address this challengeby adopting a global taxonomy (that models all the heterogeneous domains across which userqueries might be posed), and a domain taxonomy (that models all the sources belonging tothe domain and orders them based on distinct criteria specified by the integration system).The construction of such a multi-level ontology is done by combining and enhancing theextensive work carried out in the areas of – knowledge representation, domain knowledge ag-gregation, deep-web exploration, and statistics collection. The earlier work on databases (useof equivalences and statistics in centralized databases, use of source schemas for obtaininga global schema) and recent work on information integration (as elaborated earlier) provideadequate reasons to believe that this can be extended to multi-domain queries and computa-tions that include spatial and temporal constraints, which is being adopted in our InfoMosaicframework.

6.3 Novelty of Our Approach

As seen in Figure 2, although several challenges in information integration problem has been ad-dressed in a delimited context by a number of projects , a large number of challenges still needto be tackled in the context of heterogeneous data integration on the Web. To the best of ourknowledge, we are the first ones to address multi-domain information extraction and integration ofresults in conjunction with spatial and temporal data which is intended to push the state-of-the-artin functionality. We believe that it is important to establish the feasibility of the functionalitybefore addressing performance and scalability issues. Some of the novel aspects our approach are:

– We are formulating the problem of multi-domain information integration as an intelligentquery processing and optimization problem with some fundamental differences from con-ventional ones. InfoMosaic considers many additional statistics, semantics, domain & sourceknowledge, equivalences and inferencing for plan generation and optimization. We will extendconventional optimization techniques to do this by building upon techniques from databases,deductive databases, taxonomies, semantic information, and inferencing where appropriate.The thrust is to develop new techniques as well as to identify and use the existing knowledge.

– We plan on choosing a few communities (e.g., tourists, real-estate agents, museum visitors,etc.) each needing information from several domains and will address the problem in a real-world context. The crux of the problem here is to identify clearly the information needed (fromsources, ontologies, statistics, QoS, etc.) along with the rules and inferencing techniques todevelop algorithms and techniques for their usage. We will address the problem using actualdomains and web sources rather than making assumptions on data sources or using artificialsources (as in Havasu [9]) or using small number of pre-determined sources (as in Infomaster[42] or Ariadne [6]); however, our techniques will build upon and extend current approaches.

– Incorporating spatial and temporal data is unique to our proposal as, to the best of ourknowledge, this has not been addressed in the literature on information integration2.

2Retrieval of images and location data has been attempted. Currently mashuping locations on a map in Web 2.0

Figure 2: Integration Frameworks and Challenges Addressed

– Extensibility of the system and the ability to incrementally add functionality will be a keyaspect of our approach. That is, if we identify the information and techniques for represen-tative communities, it should be possible to add other communities and domains withoutmajor modifications to the framework and modules. This is similar to the approach taken forDBMS extensibility (by adding blades, cartridges, and extenders).

– We believe that in order for this system to be acceptable, user input should be intuitive (if notin natural language). We intend to develop a feedback-centric user input which can competewith the simplicity of a keyword based search request.

– Adaptability and learning from feedback and actions taken by the system will be centralto the whole project. The entire knowledge base of various types of information will beupdated to improve the system (in terms of accuracy, coverage, information content, etc.) ona continuous basis.

7 Conclusion

As this survey elicits, the research community has witnessed significant progress on many aspectsof data integration over the past two decades. As elaborated in Sections 3, 4 and 5, flexiblearchitectures for data integration, powerful methods for mediating between disparate data sources,tools for rapid wrapping of data sources, methods for optimizing queries across multiple data sourcesare a few of the important advances achieved towards solving the problem of information integration.

is very popular as evidenced by various projects. But combining map and information from multiple domains is stillthe provenance of custom systems.

However, extensive work is needed on the higher-levels of the system, including managing semanticheterogeneity in a more scalable fashion, the use of domain knowledge in various parts of the system,transforming these systems from query-only tools to more active data sharing scenarios, and easymanagement of data integration systems.

Furthermore, the emergence of web-databases and related technologies have completely changedthe landscape of the problem of information integration. First, Web-databases provides access tomany valuable structured data sources at a scale not seen before, and the standards underlying webservices greatly facilitate sharing of data among corporations. Instead of becoming an option, dataintegration has become a necessity. Second, business practices are changing to rely on informationintegration – in order to stay competitive, corporations must employ tools for business intelligenceand those, in turn, must glean data from multiple sources. Third, recent events have underscoredthe need for data sharing among government agencies, and life sciences have reached the pointwhere data sharing is crucial in order to make sustained progress. Fourth, personal informationmanagement (PIM) is starting to receive significant attention from both the research communityand the commercial world. A significant key to effective PIM is the ability to integrate data frommultiple sources.

Finally, we believe that information integration is an inherently hard problem that cannot besolved by just a few years of research. In terms of research style, the development of benchmarks,theoretical research on foundations of information integration as well as system and toolkit buildingis needed. Further progress in this area will be significantly accelerated by combining expertise fromthe areas of Database Systems, Artificial Intelligence, and Information Retrieval.

Bibliography

[1] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine.”Computer Networks, vol. 30, no. 1-7, pp. 107–117, 1998.

[2] M. Sahami, V. O. Mittal, S. Baluja, and H. A. Rowley, “The Happy Searcher: Challenges inWeb Information Retrieval.” in PRICAI, 2004, pp. 3–12.

[3] S. M. zu Eissen and B. Stein, “Analysis of Clustering Algorithms for Web-Based Search.” inPAKM, 2002, pp. 168–178.

[4] B. Katz, J. J. Lin, and D. Quan, “Natural Language Annotations for the Semantic Web.” inCoopIS/DOA/ODBASE, 2002, pp. 1317–1331.

[5] M. Lesk, D. R. Cutting, J. O. Pedersen, T. Noreault, and M. B. Koll, “Real Life InformationRetrieval: Commercial Search Engines (Panel).” in SIGIR, 1997, p. 333.

[6] C. A. Knoblock, “Planning, Executing, Sensing, and Replanning for Information Gathering.”in IJCAI, 1995, pp. 1686–1693.

[7] S. S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. D. Ull-man, and J. Widom, “The TSIMMIS Project: Integration of Heterogeneous InformationSources.” in IPSJ, 1994, pp. 7–18.

[8] W. W. Cohen, “Integration of Heterogeneous Databases Without Common Domains UsingQueries Based on Textual Similarity.” in SIGMOD Conference, 1998, pp. 201–212.

[9] S. Kambhampati, U. Nambiar, Z. Nie, and S. Vaddi, “Havasu: A Multi-Objective, AdaptiveQuery Processing Framework for Web Data Integration.” Arizona State University, Tech.Rep., 2002.

[10] V. Subrahmanian, “A heterougeneous reasoning and mediator system,”http://www.cs.umd.edu/projects/hermes/.

[11] M. Michalowski and C. A. Knoblock, “A Constraint Satisfaction Approach to GeospatialReasoning.” in AAAI, 2005, pp. 423–429.

[12] L. Liu, C. Pu, and Y. Lee, “An Adaptive Approach to Query Mediation Across HeterogeneousInformation Sources.” in CoopIS, 1996, pp. 144–156.

[13] J.-O. Lee and D.-K. Baik, “SemQL: A Semantic Query Language for Multidatabase Systems.”in CIKM, 1999, pp. 259–266.

[14] A. Y. Levy, “Information Manifold Approach to Data Integration.” IEEE Intelligent Systems,pp. 1312–1316, 1998.

[15] D. Florescu, A. Levy, and A. Mendelzon, “Database techniques for the world-wide web: Asurvey.” SIGMOD Record, vol. 27, no. 3, pp. 59–74, 1998.

[16] A. Motro, “VAGUE: A User Interface to Relational Databases that Permits Vague Queries.”ACM Trans. Information Systems, vol. 6, no. 3, pp. 187–214, 1988.

[17] J. Fan, H. Khatri, Y. Chen, and S. Kambhampati, “QPIAD: Query processing over IncompleteAutonomous Databases.” Arizona State University, Tech. Rep., 2006.

[18] Z. Zhang, B. He, and K. C.-C. Chang, “Light-weight Domain-based Form Assistant: QueryingWeb Databases On the Fly.” in VLDB, 2005, pp. 197–208.

[19] R. M. MacGregor, “Inside the LOOM Description Classifier.” SIGART Bulletin, vol. 2, no. 3,pp. 88–92, 1991.

[20] O. M. Duschka and M. R. Genesereth, “Infomaster: An Information Integration Tool.” inIntelligent Information Integration, 1997.

[21] Z. G. Ives, D. Florescu, M. Friedman, A. Levy, and D. S. Weld, “Adaptive Query Processingfor Internet Applications.” in IEEE Computer Society Technical Committee on Data Engi-neering, 1999, pp. 19–26.

[22] P. M. D. Gray, A. D. Preece, N. J. Fiddian, W. A. Gray, T. J. M. Bench-Capon, M. J. R.Shave, N. Azarmi, M. Wiegand, M. Ashwell, M. D. Beer, Z. Cui, B. M. Diaz, S. M. Embury,K. ying Hui, A. C. Jones, D. M. Jones, G. J. L. Kemp, E. W. Lawson, K. Lunn, P. Marti,J. Shao, and P. R. S. Visser, “KRAFT: Knowledge Fusion from Distributed Databases andKnowledge Bases.” in DEXA Workshop, 1997, pp. 682–691.

[23] C.-N. Hsu and C. A. Knoblock, “Reformulating Query Plans for Multidatabase Systems.” inCIKM, 1993, pp. 423–432.

[24] S. Decker, M. Erdmann, D. Fensel, and R. Studer, “Ontobroker: Ontology Based Access toDistributed and Semi-Structured Information.” in DS-8, 1999, pp. 351–369.

[25] H. Stuckenschmidt and H. Wache, “Context Modelling and Transformation for SemanticInteroperability.” in KRDB, 2000.

[26] B. He, M. Patel, C.-C. Chang, and Z. Zhang, “Accessing the Deep Web: A Survey.” Universityof Illinois, Urbana-Champaign, Tech. Rep., 2004.

[27] K. C.-C. Chang, B. He, and Z. Zhang, “Toward Large Scale Integration: Building a Meta-Querier over Databases on the Web.” in CIDR, 2005, pp. 44–55.

[28] I. Muslea, S. Minton, and C. A. Knoblock, “Hierarchical Wrapper Induction for Semistruc-tured Information sources.” in Autonomous Agents and Multi-Agent Systems, 2001.

[29] J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yerneni, M. Breunig, and V. Vassalos,“Template-based wrappers in the TSIMMIS system.” in SIGMOD Conference, 1997, pp.532–535.

[30] O. M. Duschka and M. R. Genesereth, “Query Planning in Infomaster.” in Selected Areas inCryptography, 1997, pp. 109–111.

[31] R. Baumgartner, S. Flesca, and G. Gottlob, “Visual Web Information Extraction with Lixto,”in VLDB, 2001, pp. 119–128.

[32] L. Tanasescu, A. Gugliotta, J. Domingue, L. G. Villaras, R. Davies, M. Rowlatt, M. Richard-son, and S. Stincic, “Spatial Integration of Semantic Web Services: the e-Merges Approach.”in International Semantic Web Conference, 2006.

[33] J. L. Ambite, C. A. Knoblock, S. Minton, and M. Muslea, “Heracles II: Conditional con-straint networks for interleaved planning and information gathering.” IEEE Intelligent Sys-tems, vol. 20, no. 2, pp. 25–33, 2005.

[34] L. N. Office, the San Diego Supercomputer Center, the Northwest Alliance forComputation Science, and Engineering, “The Spatial Data Workbench.”http://www.lternet.edu/technology/sdw/.

[35] A. Telang, R. Mishra, and S. Chakravarthy, “Ranking Issues for Information Integration,”in First International Workshop on Ranking in Databases in Conjunction with ICDE 2007,2007.

[36] D. L. Lee, H. Chuang, and K. E. Seamons, “Document Ranking and the Vector-Space Model.”IEEE Software, vol. 14, no. 2, pp. 67–75, 1997.

[37] S. Chaudhuri, G. Das, V. Hristidis, and G. Weikum, “Probabilistic Ranking of DatabaseQuery Results.” in VLDB, 2004, pp. 888–899.

[38] M. Lenzerini, “Data Integration: A Theoretical Perspective.” in PODS, 2002, pp. 233–246.

[39] R. Agrawal, A. V. Evfimievski, and R. Srikant, “Information Sharing Across PrivateDatabases.” in SIGMOD Conference, 2003, pp. 86–97.

[40] C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, I. Muslea, A. Philpot, and S. Tejada,“The Ariadne Approach to Web-Based Information Integration.” Int. J. Cooperative Inf.Syst., vol. 10, no. 1-2, pp. 145–169, 2001.

[41] G. Barish, C. A. Knoblock, Y.-S. Chen, S. Minton, A. Philpot, and C. Shahabi, “The The-aterLoc Virtual Application.” in AAAI/IAAI, 2000, pp. 980–987.

[42] M. R. Genesereth, A. M. Keller, and O. M. Duschka, “Infomaster: An Information IntegrationSystem.” in SIGMOD Conference, 1997, pp. 539–542.

[43] Z. Nie, S. Kambhampati, and T. Hernandez, “BibFinder/StatMiner: Effectively Mining andUsing Coverage and Overlap Statistics in Data Integration.” in VLDB, 2003, pp. 1097–1100.

[44] Y. Arens, C. Y. Chee, C.-N. Hsu, and C. A. Knoblock, “Retrieving and Integrating Datafrom Multiple Information Sources.” Int. J. Cooperative Inf. Syst., vol. 2, no. 2, pp. 127–158,1993.

[45] J. Hammer, H. Garcia-Molina, K. Ireland, Y. Papakonstantinou, J. D. Ullman, and J. Widom,“Integrating and Accessing Heterogeneous Information Sources in TSIMMIS.” in AAAI, 1995,pp. 61–64.

[46] T. Hernandez and S. Kambhampati, “Improving Text Collection Selection with Coverage andOverlap Statistics.” in WWW (Special interest tracks and posters), 2005, pp. 1128–1129.

[47] Z. G. Ives, D. Florescu, M. Friedman, A. Levy, and D. S. Weld, “An Adaptive Query Execu-tion System for Data Integration,” in SIGMOD Record, 1999.

[48] S. B. Davidson, J. Crabtree, B. P. Brunk, J. Schug, V. Tannen, G. C. Overton, and C. J. S.Jr, “K2/Kleisli and GUS: Experiments in integrated access to genomic data sources.” IBMSystems Journal, vol. 40, no. 2, pp. 512–531, 2001.

[49] H.Wache, T. Vogele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hubner,“Ontology-Based Information Integration: A Survey,” IJCAI, 2001.

[50] A. Doan and A. Y. Halevy, “Semantic-Integration Research in the Database community,” AIMagazine, vol. 26, no. 1, pp. 83–94, 2005.

[51] N. F. Noy, “Semantic Integration: A Survey Of Ontology-Based Approaches.” SIGMODRecord, vol. 33, no. 4, pp. 65–70, 2004.

[52] M. Doerr, J. Hunter, and C. Lagoze, “Towards a Core Ontology for Information Integration.”Journal of Digital Information, vol. 4, no. 1, 2003.

[53] J. A. Hendler and E. A. Feigenbaum, “Knowledge Is Power: The Semantic Web Vision.” inWeb Intelligence, 2001, pp. 18–29.

[54] D. K. W. Chiu and H. fung Leung, “Towards Ubiquitous Tourist Service Coordination andIntegration: A Multi-agent and Semantic Web approach.” in ICEC, 2005, pp. 574–581.

[55] J. Reinoso, A. Silvescu, D. Caragea, J. Pathak, and V. Honavar, “Information Extraction andIntegration from Heterogeneous, Distributed, Autonomous Information Sources : A FederatedOntology-Driven Query-Centric Approach.” in IRI, 2003, pp. 183–191.

[56] J. Reinoso-Castillo, “Ontology-driven information extraction and integration from hetero-geneous distributed autonomous data sources: A federated query centric approach.” Ph.D.dissertation, Artificial Intelligence Research Laboratory, Department of Computer Science,Iowa State University, 2002.

[57] M. Friedman, A. Y. Levy, and T. D. Millstein, “Navigational Plans For Data Integration.”in AAAI/IAAI, 1999, pp. 67–73.

[58] U. Nambiar and S. Kambhampati, “Mining Approximate Functional Dependencies and Con-cept Similarities to Answer Imprecise Queries.” in WebDB, 2004, pp. 73–78.

[59] Z. Nie and S. Kambhampati, “Joint optimization of cost and coverage of query plans in dataintegration.” in CIKM, 2001, pp. 223–230.

[60] Z. Nie, S. Kambhampati, U. Nambiar, and S. Vaddi, “Mining source coverage statistics fordata integration.” in WIDM, 2001, pp. 1–8.

[61] U. Nambiar and S. Kambhampati, “Answering Imprecise Queries over Autonomous WebDatabases.” in ICDE, 2006, p. 45.

[62] Z. Nie and S. Kambhampati, “A Frequency-based Approach for Mining Coverage Statisticsin Data Integration.” in ICDE, 2004, pp. 387–398.

[63] H. He, W. Meng, C. T. Yu, and Z. Wu, “WISE-Integrator: An Automatic Integrator of WebSearch Interfaces for E-Commerce.” in VLDB, 2003, pp. 357–368.

[64] B. He, Z. Zhang, and K. C.-C. Chang, “Towards Building a MetaQuerier: Extracting andMatching Web Query Interfaces.” in ICDE, 2005, pp. 1098–1099.

[65] G. Kabra, C. Li, and K. C.-C. Chang, “Query Routing: Finding Ways in the Maze of theDeep Web.” in International Workshop on Challenges in Web Information Retrieval andIntegration, 2005, pp. 64–73.

[66] B. He, Z. Zhang, and K. C.-C. Chang, “MetaQuerier: Querying Structured Web SourcesOn-the-fly.” in SIGMOD Conference, 2005, pp. 927–929.

[67] B. He and K. C.-C. Chang, “A Holistic Paradigm for Large Scale Schema Matching.” SIG-MOD Conference, vol. 33, no. 4, pp. 20–25, 2004.

[68] Y. Arens, C. Y. Chee, C.-N. Hsu, and C. A. Knoblock, “Retrieving and Integrating Datafrom Multiple Information Sources.” Int. J. Cooperative Inf. Syst., vol. 2, no. 2, pp. 127–158,1993.

[69] K. Lerman, S. Minton, and C. A. Knoblock, “Wrapper Maintenance: A Machine Learningapproach.” JAIR, vol. 18, pp. 149–181, 2003.

[70] S. Tejada, C. A. Knoblock, and S. Minton, “Learning domain-independent string transforma-tion weights for high accuracy object identification.” in ACM SIGKDD, 2002, pp. 350–359.

[71] G. Barish and C. A. Knoblock, “An efficient and expressive language for information gatheringon the web.” in AIPS, 2002.

[72] M. Michelson and C. A. Knoblock, “Learning Blocking Schemes for Record Linkage.” inAAAI, 2006.

[73] M. Michalowski, J. L. Ambite, C. A. Knoblock, S. Minton, S. Thakkar, and R. Tuchinda, “Re-trieving and Semantically Integrating Heterogeneous Data from the Web.” IEEE IntelligentSystems, vol. 19, no. 3, pp. 72–79, 2004.

[74] K. Lerman, A. Plangprasopchok, and C. A. Knoblock, “Automatically Labeling the Inputsand Outputs of Web Services.” in AAAI, 2006.

[75] S. Thakkar, J. L. Ambite, and C. A. Knoblock, “Composing, Optimizing, and ExecutingPlans for Bioinformatics Web Services.” VLDB Journal of Data Management, Analysis andMining for Life Sciences, vol. 14, no. 3, pp. 330–353, 2005.

[76] A. Y. Levy and M.-C. Rousset, “CARIN: A Representation Language Combining Horn Rulesand Description Logics.” in ECAI, 1996, pp. 323–327.

[77] Datalog and L.-B. Databases, “http://cs.wwc.edu/ aabyan/415/datalog.html,”http://cs.wwc.edu/ aabyan/415/Datalog.html.

[78] D. Logics, “http://dl.kr.org/,” http://dl.kr.org/.

[79] A. Y. Levy, A. Rajaraman, and J. J. Ordille, “Query-Answering Algorithms for InformationAgents.” in AAAI/IAAI, vol. 1, 1996, pp. 40–47.

[80] A. Y. Levy, “Obtaining Complete Answers from Incomplete Databases.” in VLDB, 1996, pp.402–412.

[81] A. Y. Levy, A. Rajaraman, and J. J. Ordille, “Querying Heterogeneous Information SourcesUsing Source Descriptions.” in VLDB, 1997, pp. 251–262.

[82] D. Florescu, D. Koller, and A. Y. Levy, “Using Probabilistic Information in Data Integration.”in VLDB, 1997, pp. 216–225.

[83] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom, “Object Exchange Across Heteroge-neous Information Sources.” in ICDE, 1995, pp. 251–260.

[84] C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou, J. Ullman, andM. Valiveti, “Capability based mediation in TSIMMIS.” in SIGMOD Conference, 1998, pp.564–566.

[85] H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. D. Ullman,V. Vassalos, and J. Widom, “The TSIMMIS Approach to Mediation: Data Models andLanguages.” Intelligent Information Systems, vol. 8, no. 2, pp. 117–132, 1997.

[86] W. F. Cody, L. M. Haas, W. Niblack, M. Arya, M. J. Carey, R. Fagin, M. Flickner, D. Lee,D. Petkovic, P. M. Schwarz, J. T. II, M. T. Roth, J. H. Williams, and E. L. Wimmers,“Querying Multimedia Data from Multiple Repositories by Content: The Garlic Project.” inVDB, 1995, pp. 17–35.

[87] O. M. Duschka, “Query Planning and Optimization in Information Integration.” Ph.D. dis-sertation, Stanford University, 1997.

[88] A. M. Keller and M. R. Genesereth, “Using Infomaster to Create a Housewares VirtualCatalog.” International Journal of Electronic Markets, vol. 7, no. 4, pp. 41–45, 1997.

[89] ——, “Multivendor Catalogs: Smart Catalogs and Virtual Catalogs.” Journal of ElectronicCommerce, vol. 9, no. 3, 1996.

[90] W. W. Cohen, “Data Integration using Similarity Joins and a Word-based Information Rep-resentation Language.” in ACM Transactions on Information Systems, vol. 18, no. 3, 2000,pp. 288–321.

[91] ——, “Recognizing Structure in Web Pages using Similarity Queries.” in AAAI/IAAI, 1999,pp. 59–66.

[92] T. Ksiezyk, G. Martin, and Q. Jia, “InfoSleuth: Agent-Based System for Data Integrationand Analysis.” in COMPSAC, 2001, p. 474.

[93] F. Goasdoue, V. Lattes, and M.-C. Rousset, “The Use of CARIN Language and Algorithmsfor Information Integration: The PICSEL System.” Int. J. Cooperative Inf. Syst., vol. 9,no. 4, pp. 383–401, 2000.

[94] E. Mena, V. Kashyap, A. P. Sheth, and A. Illarramendi, “OBSERVER: An Approach forQuery Processing in Global Information Systems based on Interoperation across Pre-existingOntologies.” in CoopIS, 1996, pp. 14–25.

[95] C. Shahabi, M. R. Kolahdouzan, S. Thakkar, J. L. Ambite, and C. A. Knoblock, “EfficientlyQuerying Moving Objects with pre-defined paths in a Distributed Environment.” in ACMInternational Symposium on Advances in Geographic Information Systems, 2001.

[96] A. Gupta, I. Zaslavsky, and R. Marciano, “Generating Query Plans within a Spatial Media-tor.” in International Symposium on Spatial Data Handling, 2000.

[97] O. Boucelma, M. Essid, and Z. Lacroix, “A WFS-based mediation system for GIS interoper-ability.” in Advances in Geographic Information Systems, 2002.

[98] A. Gupta, R. Marciano, I. Zaslavsky, and C. K. Baru, “Integrating GIS and Imagery ThroughXML-Based Information Mediation.” in Integrated Spatial Databases, Digital Inages and GIS,1999.

[99] I. Zaslavsky, R. Marciano, A. Gupta, and C. Baru, “XML-based Spatial Data MediationInfrastructure for Global Interoperability.” in Spatial Data Infrastructure Conference, 2000.

[100] X. Ma, Q. Pan, and M. Li, “Integration and Share of Spatial Data Based on Web Service.”in Parallel and Distributed Computing Applications and Technologies, 2005, pp. 328–332.

[101] C. Baru, A. Gupta, I. Zaslavsky, Y. Papakonstantinou, and P. Joftis, “I2T: An InformationIntegration testbed for Digital Government.” in Digital Government Research, 2004, pp. 1–2.

[102] C.-C. Chen, C. A. Knoblock, C. Shahabi, Y.-Y. Chiang, and S. Thakkar, “Automatically andAccurately Conflating Orthoimagery and Street maps.” in ACM International Symposium onAdvances in Geographic Information Systems, 2004.

[103] C.-C. Chen, C. A. Knoblock, C. Shahabi, and S. Thakkar, “Automatically and AccuratelyConflating Satellite Imagery and Maps.” in International Workshop on Next GenerationGeospatial Information, 2003.

[104] U. Nambiar and S. Kambhampati, “Answering imprecise database queries: a novel approach.”in WIDM, 2003, pp. 126–133.

[105] N. W. Paton and C. A. Goble, “Information Management for Genome Level Bioinformatics.”in VLDB, 2001.

[106] R. Stevens, P. G. Baker, S. Bechhofer, G. Ng, A. Jacoby, N. W. Paton, C. A. Goble, andA. Brass, “TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources.”Bioinformatics, vol. 16, no. 2, pp. 184–186, 2000.

[107] L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice, and W. C. Swope, “DiscoveryLink:A system for integrated access to life sciences data sources.” IBM Systems Journal, vol. 40,no. 2, pp. 489–511, 2001.

[108] R. C. Geer and E. W. Sayers, “Tutorial Section: Entrez: Making Use of Its Power.” Briefingsin Bioinformatics, vol. 4, no. 2, pp. 179–184, 2000.

information integration across heterogeneous domains ... · information integration across...

Documents