multi-data source fusion

15
Multi-data source fusion Gilles Nachouki a, * , Mohamed Quafafou b a LINA, Faculte ´ des Sciences et des Techniques, 2, rue de la Houssinie ´re, F-44322 Nantes Cedex 03, France b LSIS-UMR CNRS 6168, Domaine universitaire de St Jerome, F-13397 Marseille Cedex 20, France Received 19 April 2006; received in revised form 28 October 2007; accepted 11 December 2007 Available online 23 December 2007 Abstract This paper describes a new approach of heterogeneous data source fusion. Data sources are either static or active: static data sources can be structured or semi-structured, whereas active sources are services. In order to develop data sources fusion systems in dynamic contexts, we need to study all issues raised by the matching paradigms. This challenging problem becomes crucial with the dominating role of the internet. Classical approaches of data integration, based on schemas mediation, are not suitable to the World Wide Web (WWW) environment where data is frequently modified or deleted. Therefore, we develop a loosely integrated approach that takes into consideration both conflict management and semantic rules which must be enriched in order to integrate new data sources. Moreover, we introduce an XML-based Multi-data source Fusion Language (MFL) that aims to define and retrieve conflicting data from multiple data sources. The system, which is developed according to this approach, is called MDSManager (Multi-Data Source Manager). The benefit of the proposed framework is shown through a real world application based on web data sources fusion which is dedicated to online markets indices tracking. Finally, we give an evaluation of our MFL language. The results show that our language improves significantly the XQuery language especially considering its expressiveness power and its performances. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Xquery; Web; Information extraction; Semantic conflicts; Data fusion 1. Introduction Interoperability is the problem of interconnecting and accessing heterogeneous data sources. As organizations have evolved, this problem has become an important field of research in both academic and industry. In the past, intensive researches related to data integra- tion were focused on databases. A federated database (FDB) [30] is a collection of cooperating and autonomous database systems (DBs). A federated database manage- ment system (FDBMS) provides a controlled and coordi- nated manipulation of DBMS component. FDBMSs are characterized by the following three dimensions: distribution, autonomy and heterogeneity. Distribution concerns data which may be distributed among multiple databases that are stored on one or several computers. Autonomy refers to different DBs which have independent control. Heterogeneity means technical and semantic differences in database management systems (DBMSs) – technical differences include several data mod- els and languages; semantic differences represent several meaning, interpretation or use of the same related data. FDBMSs are divided into two main categories: tightly cou- pled and loosely coupled [30]. Tightly coupled FDBMS requires the DB component to be created, maintained and access-controlled by its administrator(s). It is always static because the creation of a federated schema is similar to databases schema integration process. In a loosely cou- pled FDBMSs, the user controls the creation and mainte- nance of DB component. Furthermore, loosely coupled FDBMS and its administrator(s) do not enforce access control to the DB. It is considered as dynamic because the federated schema may be managed on the fly: it can be easily created, changed and dropped. Each user is the 1566-2535/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.inffus.2007.12.001 * Corresponding author. E-mail addresses: [email protected] (G. Nachouki), [email protected] (M. Quafafou). www.elsevier.com/locate/inffus Available online at www.sciencedirect.com Information Fusion 9 (2008) 523–537

Upload: deep

Post on 13-Nov-2014

105 views

Category:

Documents


1 download

TRANSCRIPT

Available online at www.sciencedirect.com

www.elsevier.com/locate/inffus

Information Fusion 9 (2008) 523–537

Multi-data source fusion

Gilles Nachouki a,*, Mohamed Quafafou b

a LINA, Faculte des Sciences et des Techniques, 2, rue de la Houssiniere, F-44322 Nantes Cedex 03, Franceb LSIS-UMR CNRS 6168, Domaine universitaire de St Jerome, F-13397 Marseille Cedex 20, France

Received 19 April 2006; received in revised form 28 October 2007; accepted 11 December 2007Available online 23 December 2007

Abstract

This paper describes a new approach of heterogeneous data source fusion. Data sources are either static or active: static data sourcescan be structured or semi-structured, whereas active sources are services. In order to develop data sources fusion systems in dynamiccontexts, we need to study all issues raised by the matching paradigms. This challenging problem becomes crucial with the dominatingrole of the internet. Classical approaches of data integration, based on schemas mediation, are not suitable to the World Wide Web(WWW) environment where data is frequently modified or deleted. Therefore, we develop a loosely integrated approach that takes intoconsideration both conflict management and semantic rules which must be enriched in order to integrate new data sources. Moreover, weintroduce an XML-based Multi-data source Fusion Language (MFL) that aims to define and retrieve conflicting data from multiple datasources. The system, which is developed according to this approach, is called MDSManager (Multi-Data Source Manager). The benefitof the proposed framework is shown through a real world application based on web data sources fusion which is dedicated to onlinemarkets indices tracking. Finally, we give an evaluation of our MFL language. The results show that our language improves significantlythe XQuery language especially considering its expressiveness power and its performances.� 2007 Elsevier B.V. All rights reserved.

Keywords: Xquery; Web; Information extraction; Semantic conflicts; Data fusion

1. Introduction

Interoperability is the problem of interconnecting andaccessing heterogeneous data sources. As organizationshave evolved, this problem has become an important fieldof research in both academic and industry.

In the past, intensive researches related to data integra-tion were focused on databases. A federated database(FDB) [30] is a collection of cooperating and autonomousdatabase systems (DBs). A federated database manage-ment system (FDBMS) provides a controlled and coordi-nated manipulation of DBMS component.

FDBMSs are characterized by the following threedimensions: distribution, autonomy and heterogeneity.Distribution concerns data which may be distributed

1566-2535/$ - see front matter � 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.inffus.2007.12.001

* Corresponding author.E-mail addresses: [email protected] (G. Nachouki),

[email protected] (M. Quafafou).

among multiple databases that are stored on one or severalcomputers. Autonomy refers to different DBs which haveindependent control. Heterogeneity means technical andsemantic differences in database management systems(DBMSs) – technical differences include several data mod-els and languages; semantic differences represent severalmeaning, interpretation or use of the same related data.FDBMSs are divided into two main categories: tightly cou-pled and loosely coupled [30]. Tightly coupled FDBMSrequires the DB component to be created, maintainedand access-controlled by its administrator(s). It is alwaysstatic because the creation of a federated schema is similarto databases schema integration process. In a loosely cou-pled FDBMSs, the user controls the creation and mainte-nance of DB component. Furthermore, loosely coupledFDBMS and its administrator(s) do not enforce accesscontrol to the DB. It is considered as dynamic becausethe federated schema may be managed on the fly: it canbe easily created, changed and dropped. Each user is the

524 G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537

administrator of his/her own federation: he/she has tounderstand the semantics of the federated schema datasources and to solve all ambiguity problems raised.

With the advent of the Web, data management hasmoved away from the traditional framework to matchthe variety of information available on the Web [29]. Now-adays, web documents are largely used for storing dataespecially over the web, whilst XQuery language [33] hasbecome the standard data retrieval in such documents.The main architecture of data integration systems on theWeb is based on the mediator approach [32] which consistson defining an interface between an agent (user or applica-tion program) and a set of data sources. It is the role of theagent to submit queries over the mediation schema. Thisprocess is transparent to the user: it seems as he/she manip-ulates a single homogeneous and centralized data source.Two interesting characteristics of the Web are to be noted:the growing number of data sources and their frequentupdates and volatility. Major data sources are freely acces-sible on the Web and users often need to integrate themquickly without any help. Classical approaches based onmediator’s reasoning do not facilitate the user’s task sinceit is hard to unify data sources in a dynamic way. Rather,they assume a global mediated schema to model datasources. As such, these approaches often require an admin-istrator to control the mediated schema.

Our main objective is to provide a flexible integration ofdata sources on dynamic environments which is the webcase. In our system, we are using a loosely coupledapproach where there is no integration of data sourcesschema into a global one. We assume that the user, whois his/her own integrated schema’s administrator, is ableto understand, build, manage these schemas and formulateseparate queries over them. We have chosen this type ofmodel because the construction of a global schema is a verydifficult task, particularly on the web, where data sourcesare frequently updated or deleted.

The only remaining problem is to solve conflicts thatoften take place between different data sources (the usermust, in this context, solve these conflicts manually or byusing specific functions or services). For this objective, wepropose a Multi-data source Fusion Language (MFL) thatallows defining and retrieving data from conflicting datasources or multi-data sources. A multi-data source schemais defined in a specific structure. MFL is a simple and pow-erful language that allows users to formulate their need in asingle query. It facilitates query’s expansion over conflict-ing data sources and controls the semantics expressed inusers’ queries: For each query submitted, MFL will searchfor conflicts in the query’s body. If no conflict is detected,the query will be accepted and executed. Otherwise, threecases may arise: (1) Conflicts are solved through query pro-cessing; in this case, the query will be accepted and exe-cuted. (2) Conflicts cannot be solved; in this case, thequery will be rejected (e.g. case of homonyms conflicts).(3) Some conflicts are solved where others not; in this case,a part of the query related to the solved conflicts will be

executed and results will be returned to the user with awarning message informing him/her of the detected conflictnature (e.g. case of scale conflicts).

In this paper, we demonstrate the usability of ourapproach and describe the MFL language. However, wedo not study the following issues: 1. How conflicts betweendata sources are discovered. 2. How information isextracted from the web efficiently. Some references to thesetwo issues, which are related to well-known researchdomains, are presented here only to clarify our proposi-tion. The paper is organized as follows: Section 2 describesrelated work concerning data integration. Section 3 definesMFL language. Section 4 shows an implementation of ourapproach with MDSManager system. Section 5 illustratesour proposal with an application based on web datasources fusion and dedicated to online markets indicestracking. Finally, Section 6 discusses and summarizes somefuture research directions.

2. Related work

MSQL [18] and MDSL [19], are two multi-database lan-guages developed in the late 70th for data integration in aloosely coupled approach. Both languages were designed toaccess structured data only (e.g. relational databases).Many researches in data integration’s domain based onmediators use the mapping method between mediatedschema and integrated data sources schemas. Two mainapproaches are proposed: Global As View (GAV) andLocal As View (LAV) [12,16,17]. The GAV approachcomes from the federated databases’ process presentedabove. It involves the definition of a mediated schema asglobal view over local schemas. This view is created byassembling together different local views (called also localschemas). The LAV approach is a dual one. It involvesthe definition of local schemas as views over a given med-iated schema.

Many researches have been developed in order to inte-grate data sources in Peer-To-Peer (P2P) context wherethe definition of a unique mediated schema is unrealisticbecause of the autonomy and volatility of peers. Most ofP2P systems avoid the maintenance of a single globalschema and their approaches can be classified in three cat-egories: mediation between two peers; mediation between asmall group of peers; mediation between super-peers.

Many projects and systems are developed according to acentralized or P2P approaches. Among these systems, (thelist is not exhaustive), we quote TSIMMIS [10], DISCO[31], GARLIC [5], MIX [2], MOMIS [3], AGORA[21,22], INFORMATION MANIFOLD [15], C-WEB [1]and XYLEME [9]. Other systems like AXML [4], PIAZZA[7] and SENPEER [6] are developed in a P2P context. Inorder to show the diversity of these systems, we presentsome of them by giving the data model used, the languageand the type of mapping (GAV, LAV).

TSIMMIS follows a GAV approach. This system inte-grates structured and unstructured data sources. The

G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537 525

common model is an Object-based information ExchangeModel (OEM). The objective of GARLIC (following theGAV approach) is the integration of multimedia datasources (image, video, text etc.). The common model isobject oriented and the query language is an extension ofSQL called GQL (Garlic Query Language).

Given the popularity of XML as a data description for-mat, more and more DBMS vendors have added to theirsystems the capability to export relational or object schemato an XML format. In this direction, AGORA uses XMLas user interface format, while all data inside the query pro-cessor consist of relational tuples. AGORA follows LAVapproach. Queries are expressed using an XML query lan-guage and the results are formatted as XML documents,making the underlying relational engine transparent touser. C-Web proposes an ontology-based integrationapproach of XML web resources. C-WEB, like AGORAstarts from a global virtual schema (data lies in some exter-nal sources) expressed with a given model. The virtualschema allows users to formulate queries without beingaware of the source’s specific structure. Xyleme project,provides a semantic integration model for heterogeneousXML schemas. This model uses views that connect(through path-to-path mappings) a virtual tree-like schema(called abstract DTD) to the heterogeneous schemas (calledconcrete DTDs). A query on the view (abstract query) istranslated into several concrete queries and the results arereturned according to the view schema.

In P2P context, AXML (Active XML) is an XML doc-ument that may include some special elements such as thetag < sc > (for service web call). This tag contains calls toweb services (SOAP, WSDL) for data integration. Some ofthe data are given explicitly in the AXML documentswhereas the other ones are returned in these documentsby web services calls. The use of calls to web services,embedded in XML documents, allows to integrate orupdate data included in these documents in a dynamicway. PIAZZA integrates sources that are semi-structureddata based on XML data model and XQuery language.The peers, interested by exchange data, establish semanticlinks between them. PIAZZA combines the twoapproaches of mediation LAV/GAV. SENPEER is a P2Psystem where data sources are expressed using variousdata models (e.g. Relational, XML etc.). SENPEERassumes an unstructured P2P where peers are connectedto super-peers according to their semantic domainsexpressed with an ontology. Super-peers exchange mes-sages to discover semantic links between them. A queryformulated with the language of a peer is expressed withthe query exchange format SQUEL and sent to its processowner for execution.

Most of these systems use an architecture with single glo-bal schema (or ontology). This approach is interesting if allsources have similar attributes or concepts; otherwise, asource cannot be changed without reconsidering the globalschema and the other sources, to make sure that there areno conflicts. For the other systems, which are based on mul-

tiple schemas (or ontology), it is hard to make a comparisonbetween them since no common vocabulary were defined.Moreover, mapping between several couples of ontologyneeds to be defined. Unlike these systems, our approach usesa multi-data source language, based on XQuery, to datafusion: we can smoothly add or delete data sourcessince no global schema is required. Compared to existinglanguages and approaches, our work demonstrates goodsystem achievements when the user frequently formulatesqueries.

In the following section, we describe our Multi-datasource Fusion Language (MFL) for data sources fusion.

3. The multi-data source fusion language

MFL provides two sub-languages: the Multi-datasource Definition Language (MDL) – used to define themulti-data source – and the Multi-data source RetrievalLanguage (MRL) – used to retrieve data from multi-datasource. One characteristic of MDL resides on the simplicityof the multi-data source’s definition: users can give a collec-tive name called multi-data source name to some datasources. A collective name simplifies the expression ofqueries; users declare inter-sources conflicts between ele-ments composing the multi-data source. It is useful todeclare these conflicts for some types of queries describedbelow. MRL extends XQuery language in order to accessmultiple conflicting data sources. Furthermore, in conflict-ing data sources, user’s need is expressed through a singlequery. With MRL, it is easy to smooth out all semanticdata differences which often exist in autonomous datasources.

3.1. Multi-data source definition language

The purpose of MDL is to define the multi-data sourceschema starting from a set of data sources schemas.

3.1.1. Multi-data source description

Definition 1 (Multi-data source). A multi-data sourceschema is a collection of data sources’ schemas or multi-data sources according to the syntax of a Document TypeDefinition (DTD) of XML:

< !ELEMENT multisources ðsource j multisourcesÞ þ >< !ATTLIST multisources name CDATA #REQUIRED >< !ELEMENT source ðfeatureÞ þ >< !ATTLIST source name CDATA #REQUIRED >< !ATTLIST source url CDATA #REQUIRED >< !ELEMENT feature ð#PCDATAÞ >< !ATTLIST feature name CDATA #REQUIRED >

In this schema, < multisources > and < source > referto a specific multi-data source or a data source. < name >designates the name of a multi-data source or simply a

526 G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537

property in a data source. < url > describes the path toreach this data source. < feature > contains the propertiesof a data source. The following example shows how to inte-grate many data sources in a multi-data source.

Example 1 (Multi-data source creation). Consider themulti-data source given in Fig. 1 describing a bankingdomain. In this example, the three (static) data sourcesBnp, Cio and Cl represent a multi-data source calledBank2. Similarly, Bank2 with bank Sg form a multi-datasource called Bank1 and Bank is the root of the multi-datasource. Conflicts between these data sources are given in aspecific data source called Conflicts (described below).Services represent a multi-data source which describes two(active) data sources called Dollar2Euro and Euro2Dollar.The first service converts currencies from Dollar to Eurowhereas the second one converts currencies from Euro toDollar. In Fig. 1, the name of a static or active data source(e.g. Bnp) represents the root of a document (e.g.Bnp.dtd)that describes the schema of this source. In this example,the schema of data sources Bnp, Cl and Sg are expressed inFrench and the data source Cio in English. The structure ofmulti-data source Conflicts is detailed in the remaining ofthis section.

3.1.2. Conflicts description

Data sources present heterogeneity that consists of differ-ences in names, data structures, types, scale etc. Several per-ceptions of the same real world lead to different datasources. To integrate all these data sources together, we needfirst to solve their conflicts. Conflicts are classified as follows[14,30]: data models conflicts, data conflicts, structural con-flicts, descriptive conflicts, abstraction conflicts and seman-tic conflicts. Conflicts between data models appear whendata sources are designed using distinct data models (rela-tional, XML etc.). Data conflicts refer to differences amongdefinitions, such as attribute types, formats, or precision.Structural conflicts consist on describing the same idea indifferent ways and in different data sources. For example,

Fig. 1. A part of a multi-data sourc

a concept is defined as an attribute in a relational schemaSch1 and as a relation in another schema Sch2 etc. Descrip-tive conflicts concern different names given to a same entity(e.g. homonymous, synonymous), identity conflicts (e.g. aperson is identified by a number in Sch1 and social-securitynumber in Sch2), scale conflicts (e.g. salaries are given inDollar and Euro respectively in Sch1 and Sch2) etc. Abstrac-tion conflicts concern generalization (e.g. the concept ofemployee in Sch1 generalizes the concept of teacher inSch2) and aggregation (e.g. date of birth in Sch1 is a stringwhile in Sch2 it is composed of three fields month, dayand year). Semantic conflicts refer to differences and similar-ities in the meaning of elements in the data sources. Manyworks have studied semantic conflicts in the literature, wequote: [8,28]. In [8], authors propose an algorithm calledH-MATCH which takes two schemas as input and returnsthe mappings that identify corresponding concepts in thetwo schemas, namely the concepts with the same or the clos-est meaning. In [28], authors provide a survey of differentapproaches to automatic schema matching. In this paper,we consider the most important conflicts detailed above,such as data models conflicts, descriptive conflicts andsemantic conflicts, but we do not consider how to detectthem. We assume that they are pointed out either manuallyor by a program. In order to solve conflicts between datamodels, we express data sources schemas with DTD formal-ism which represents in our case a common data model of allintegrated data sources. Another model like XML schemacan be chosen. Semantic conflicts between elements aredetected using contexts of elements. A context of an elementE is the set of elements connected directly to E and on whichit depends semantically. In other words, if a node E2 is achild of a node E1, the term E2 has to be interpreted inthe scope of E1’s meaning. Thus, the different occurrencesof a same term do not have the same meaning (e.g. Namecan appear different times, and may represent the name ofdifferent entities).

We suppose the following descriptive conflicts betweenelements of two data sources: synonymous, equivalence,

e describing a banking domain.

Fig. 2. The structure of conflicts data source.

G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537 527

homonymous, disjoint and scale conflicts. Two elementsare considered semantically synonymous if their namesare synonymous and have the same context (i.e. synony-mous conflicts). Two elements are considered semanticallyequivalent if their names are the same and have the samecontext (i.e. equivalence conflicts). Such elements are con-nected between them by a similarity link in the Conflictsdata source (given above). Two elements are linked by adissimilarity (or difference) link if their names are the sameor synonymous and have two distinct contexts (i.e. homon-ymous conflicts). Finally, two elements are linked by a dis-similarity link if their names are distinct or notsynonymous (i.e. disjoint conflicts). A scale link describestwo elements which are similar and each one uses a specificscale unit (i.e. scale conflicts). The data source conflicts isgiven in Fig. 2.

In this figure each < Similar > tag is composed of sev-eral < Node > tags. A < Node > tag is composed of threeelements < Elt >, < Path > and < Unit >. The < Unit >tag is optional and is used when scale conflicts are raised.The < Elt > tag allows to declare the name of an element

Fig. 3. A fraction of conflicts of

and the < Path > tag shows the path to reach this elementin a specific data source. Each < Dissimilar > tag in thedata source conflicts combines an < Elt > tag and two< Path > tags. The < Elt > tag allows to declare the nameof an element and the two tags < Path > show the paths toreach the elements which are dissimilar in specific datasources. A < Scale > tag describes the scale unit of an ele-ment. The < Scale > tag has an attribute type whichdescribes the domain in which this scale is applied (e.g. cur-rency). The structure of the scale conflict consists on the< Node > tag mentioned above. The < Services > tagdescribes the services that can be used to solve the scaleconflicts between two elements (e.g. to convert the valueof an element from a scale unit to another one). This tagis made up of several tags called < Service >. Each< Service > tag specifies a service of conversion from anorigin unit denoted by < Unit1 > to a destination unitdenoted < Unit2 >. The two tags < Name > and< Path > respectively indicate the name of the serviceand its location in the multi-data source schema.

Example 2 (Conflicts between banking sources). Considerthe multi-data source of banking field given in Fig. 1. Afraction of the conflicts between data sources composingthis multi-data source is described in Fig. 3. In this figure,the first tag < Dissimilar > designates dissimilaritybetween the branch name and the user name. The secondtag < Similar > indicates a similarity between the twofeatures: the name of a branch and the ’nom’ of an ’agence’

given in French language. The tag < Scale > with typeCurrency means that there is a currency conflict betweenthe salary of a user and the ’salaire’ of an ’employe’ (inFrench). The last tag < Services > with type Currencydeclares all services (in the tag < Service >) that areimplied on currencies. The service Euro2Dollar convertscurrency from Euro to Dollar.

To integrate data, in the following section, we introducethe Multi-data source Retrieval Language MRL. Thislanguage is based on the structure of multi-data source.

the integrated data sources.

Fig. 4. Classification of MRL queries.

528 G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537

3.2. Multi-data source retrieval language

Query form: MRL query is defined as follows:

Use ðmulti�Þdatasource1 name1

½; ðmulti�Þdatasourcej namej��Allow $ < semantic� variables >(E)XQuery query

Close name1½; namej��

The clauses Use, XQuery query and Close are mandatoryin MRL Queries, whereas the clause Allow is optional. Theclause Use determines the scope of the query and connectsto data sources for processing whilst Close disconnectsfrom data sources. The name is a given alias for either adata source or multi-data source and the clause Allow per-mits the declaration of semantic variables. Through thesevariables, the user declares his/her intention to access data,in a given query, semantically similar and differentlynamed. The (E)XQuery query can be formulated like aquery w.r.t. to XQuery language [33] or as an EXQueryquery given in [24] that allows an active data source tobe called.

In MRL, we distinguish two categories of queries: ele-mentary query and semantic query illustrated in Fig. 4.

3.2.1. Kinds of data

Typically, a data source is built with one data model.However, such focus on one data model makes hard to dealwith more complex applications which need different datamodels and query languages. Data sources are oftenexpressed with relational or XML data models. We referto Kinds of data all objects expressed, one or many times,in one or multiple data sources schema, within the samedata model or distinct data models. For example, an objectnamed Car can be modeled, many times, within XML orrelational data model, in several data sources schemas.We say that the name Car refers to several kinds of data.But if the name Car appears only one time in a single datasource schema, we say that it refers to only one kind ofdata.

a

Fig. 5. (a) Cinema-theater, (b) undergrou

3.2.2. Elementary query

Elementary queries are used when various data sourcesrepresent distinct contexts, used conjointly in queries, likein the example of the Leisure given in Fig. 5.

Definition 2 (Unique identifiers). A unique identifier is adesignator (i.e. name) that refers to only one kind of data.

An MRL query is an elementary query, if all its designa-tors are unique identifiers in its scope. We distinguish twotypes of elementary queries: mono-data source and multi-data source. An elementary mono-data source queryinvolves only one data source (static or active). An elemen-tary multi-data source query involves several data sources.In the next example we show the situation in which we for-mulate an elementary mono-data or multi-data source query.

Example 3 (unique identifiers). In this example, we con-sider the multi-data source given in Fig. 5c. It is composedof one multi-data source called Leisure (the root). Leisure isalso composed of two data sources: the undergroundstation and the cinema theater (XML and Sql), whichrespectively model the following two contexts: Cinema

(Fig. 5a) and Underground (Fig. 5b).

If a user wants to retrieve the names of the movie the-aters in Saint-Michel street, the corresponding MRL query,in this case, is an elementary mono-data source querybecause it concerns one data source and unique identifiers(e.g. cin-street and name):

Q1: use cinema cfor $a in document(‘‘my mds”)/leisurewhere $a/�/cin-street/text()=‘‘Saint-Michel”return < result >$a/�/name/text()< =result >close c

If the user wants to retrieve the names of the movie the-aters and the underground stations that are in the samestreet, we formulate the following query:

Q2: use leisure lfor $a in document(‘‘my mds”)/lwhere $a/�/cin-street/text() = $a/�/und-street/text()return< results >< cinema >$a/�/name/text()< =cinema >,< underground >$a/�/station/text()< =underground >< =results >close l

b c

nd and (c) multi-data source Leisure.

G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537 529

The scope of Q2 is the multi-data source Leisure. Thisquery is an elementary multi-data source query since itjoins two unique identifiers in both data sources Cinema

and Underground.

3.2.3. Semantic query

Semantic queries are used when various data sourcesrepresent the same universe like the banks example givenabove (Fig. 1). Semantic queries are also called broadcastqueries [20] because a user may have to broadcast the sameretrieval query to several data sources. XQuery language,in its current form, does not easily allow to express such sit-uations. With XQuery, the user needs to formulate as manyqueries as there are data sources. In contrast, semantic que-ries allow to broadcast the user’s intention in a singlequery. This is a considerable simplification, specially fora larger scope. Syntactically, semantic queries are basicallyformulated as elementary queries. In MRL, semantic que-ries are formulated according to the following concepts:multiple identifiers and semantic variables.

Definition 3 (Multiple identifiers). A semantic query withmultiple identifier contains at least one designator thatrepresents more than one kind of data in its scope.

A semantic query q with multiple identifiers representsthe set of all elementary queries Sq1,Sq2; . . . ; Sqn, alsocalled subqueries, resulting from all substitutions of multi-ple identifiers by the corresponding unique ones. Each des-ignator becomes a unique identifier in a subquerySqi; i ¼ 1; . . . ; n. This choice of unique identifiers may leadto a subquery that cannot be executed. We say that anexecutable subquery is a pertinent subquery. The resultof a such query q is the set of the results returned by allits pertinent subqueries. We consider the example thatdescribes the Banking multi-data source given in Fig. 1.We show an example of a semantic query with multipleidentifiers.

Example 4 (multiple identifier and conflict). Select thenames of all the branches of the banks Bnp, Cl and Sg.

� If a user chooses to use XQuery language to answer thisquestion he/she would have to formulate three distinctsubqueries (Sq1, Sq2 and Sq3). Each subquery is exe-cuted over the corresponding data source. The unionof the three subqueries results constitutes the final result.

Sq1: for $a in document(‘‘sg”)/sg, $b in $a/agence/nomreturn < result >$b/text()< =result >

Sq2: for $a in document(‘‘cl”)/cl, $b in $a/agence/nomreturn < result >$b/text()< =result >

Sq3: for $a in document(‘‘bnp”)/bnp, $b in $a/agence/nomreturn < result >$b/text()< =result >

� If this user chooses to use the Multi-data source Retrie-val Language MRL to answer this question he/shewould have to formulate only one query (Q1) as follows:

Q1: use bnp b, cl c, sg sfor $a in document(‘‘my mds”)/bank/bank1, $b in$a/�/nomreturn < result >$b/text()< =result >close b,c,s

This example highlights the simplicity of our languageMRL. This query is a semantic query. The scope of thisquery is Bnp, Cl and Sg. Starting from the followingpath:/bank/bank1 in the multi-data source my mds(For $a in document(‘‘my mds”)/bank/bank1) and foreach data source (Bnp, Cl and Sg) reached from thispath, we return the names of the branches ($b/text()).In this query, the attribute ‘nom’ is a multiple identifiersince it designates the three attributes ‘nom’ in Bnp, Sg

and Cl. The designator ‘nom’ in the scope of this querydoes not present any conflict of dissimilarity in the Con-flicts data source (Fig. 3).

Let us consider another example, instead of selecting in(Q1) the names of all the branches of the banks Bnp, Cl andSg, we want to select the names of the banks in Cio. Thefollowing MRL query (Q2) is formulated:

Q2: use cio cfor $a in document(‘‘my mds”)/bank/bank1/bank2, $bin $a/�/namereturn < result >$b/text()< =result >close c

In this query, the designator name is a multiple identifiersince it designates the branch name and the user name inthe data source Cio. Dissimilarity (homonymous conflict)is detected in the scope of this query. In this case, a warningmessage informs the user of this conflict.

In addition, the branches name in Bnp and Cio arewritten in two distinct languages: French (‘nom’) andEnglish (name). We cannot use a multiple identifier to referthe branches names in these two data sources. In such acase we use a semantic variable whose domain is ‘nom’and name as follows:

Definition 4 (Semantic variables). A semantic variable is avariable whose domain are the names of data types whichare semantically similar. A query can invoke one or severalsemantic variables. The aim of these variables is to enablethe user to broadcast his/her intention over differentlynamed data types which are semantically linked betweenthem by similar links in the conflicts data source (e.g. nameand ‘nom’).

A semantic query with semantic variables is consideredas the set of pertinent elementary subqueries resulting fromall possible substitutions of semantic variables and multipleidentifiers by unique identifiers. The syntax of the clauseAllow is given as follows:

530 G. Nachouki, M. Quafafou / Inform

Allow $< semantic� variable >¼< designator >½; < designator > �þ< semantic� variable >::¼< simple� variable > j< composed � variable >< composed � variable >::¼< simple� variable >½: < simple� variable > �þ< simple� variable >::¼< string >< designator >::¼< string > ½: < string > �þ

Consider the following example where the clause Allow

appears:Allow $x1 � x2 . . . xk ¼ n1;1 � n2;1 . . . nk;1; . . . ; n1;m � n2;m . . .

nk;m.Each x is a semantic variable. Each n is a designator.

The domain of the semantic variable xi isni;j; for j ¼ 1 . . . m. To preserve semantics in the query,we assume that the data type names ni;j, for j = 1 . . .m

are semantically similar. The ith subquery corresponds tothe simultaneous substitutions of nj;i to xj, for j = 1. . .k.For example, the semantic variables $x1 � x2 . . . xk, for thefirst subquery, are replaced by the unique or multiple iden-tifiers n1;1 � n2;1 . . . nk;1.

The following query is a semantic query with one seman-tic variable (x) and multiple identifiers (name). This querycontains conflicts in its scope, including similarity (i.e. syn-onymous) and dissimilarity (i.e. homonymous) conflicts.

Example 5 (One semantic variable). Select the name of thebranches in the two data sources Bnp and Cio.

Q1: use bnp b,cio callow $x=nom,namefor $a in document(‘‘my mds”)/bank/bank1/bank2, $bin $a/�/$xreturn < result >$b/text()< =result >close b,c

The clause Allow declares one semantic variable x. Thedomain of this variable is the pair of values (’nom’, name).For each value of x a subquery is generated. Moreover, thedesignator name is a multiple identifier since it designatesthe name of a branch and the name of a user in the datasource Cio. Similarity and dissimilarity conflicts aredetected through query processing. An example of conflictsresolution is shown here: the meaning of ’nom’ of an’agence’ in Bnp is found to be different from the name of auser in Cio but it is similar to the name of a branch. In thiscase, we keep only the possibility for the variable x to takethe two features: ’nom’ of an ’agence’ and name of a branch.This query is considered as a set of two pertinentelementary subqueries (Sq1 and Sq2) resulting from thesubstitution of the semantic variable (x) by its correspond-ing unique identifiers:

Sq1: use cio cfor $a in document(‘‘my mds”)/bank/bank1/bank2, $bin $a/�/branch/name

return < result > $b/text()< =result >close c

ation Fusion 9 (2008) 523–537

Sq2: use bnp bfor $a in document(‘‘my mds”)/bank/bank1/bank2, $bin $a/�/nomreturn < result > $b/text()< =result >close b

The first subquery Sq1 corresponds to the substitution ofname (name of a branch) in Cio to the semantic variable x.For the second subquery Sq2 ’nom’ is related to the variablex. The following query shows a semantic query with twosemantic variables. Each semantic variable in this querydesignates two different names. Each name may be a uniqueor a multiple identifier in the scope of the query. This querycontains two conflicts in its scope: similarity and dissimi-larity conflicts. This query leads to two subqueries formu-lated differently, one per data source in the scope.

Example 6 (Several semantic variables). Select thebranches names in the two data sources Cio and Sg.

Q1: use cio c,sg sallow $x.y=branch.name, agence.nomfor $a in document(‘‘my mds”)/bank/bank1,$b in $a/�/$x/$yreturn < result >$b/text()< =result >close c,s

Here, the clause Allow declares a couple of semanticvariables (x,y), where x and y designate respectively thetwo pairs of values (branch, ’agence’) and (name, ’nom’).The feature name is a multiple identifier as shown above.To resolve the conflicts detected in this query, as inExample 5, the feature name is substituted by the corre-sponding unique one (the branch name). After that, wesearch for similar pair of values in the data sources conflicts(Fig. 3). This query is equivalent to two pertinent elemen-tary subqueries (Sq1 and Sq2):

Sq1: use cio cfor $a in document(‘‘my mds”)/bank/bank1,$b in $a/�/branch/namereturn < result >$b/text()< =result >close c

Sq2: use sg sfor $a in document(‘‘my mds”)/bank/bank1,$b in $a/�/agence/nomreturn < result >$b/text()< =result >close s

The first subquery corresponds to the substitution ofbranch and name respectively to the two semantic variables

G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537 531

x and y. In the same perspective, ’agence’ and ’nom’ arerelated respectively to x and y, in the second subquery. Thefollowing example, like the previous one, presents asemantic query with scale conflicts in its scope. This queryuses the clause Allow with one semantic variable.

Example 7 (Scale conflict). Select the salaries of employeesof Bnp and Cio.

Q1: use bnp b,cio callow $x=salaire,salaryfor $a in document(‘‘my mds”)/bank/bank1/bank2, $din $a/�/$xreturn < result >$d/text()< =result >close b,c

In the scope of this query a scale conflict is detectedbetween the two elements ‘‘salaire’ and salary. Afterexecution, this query returns to the user a result with awarning message that points out the unit scale of each valueselected by this query. In this case, the user has thepossibility to accept the result, to convert all the salariesinto the same unit of scale (e.g. Dollar or Euro) or to choosea part of the salaries (e.g. only salaries expressed with Euro).

To achieve this goal, we use the clause Service that callsan active data source.

Definition 5 (EXQuery). EXQuery (Extended XQuery) isan extension of XQuery language in which we have inte-grated the clause Service [24]. The syntax of this clause isgiven below:

<service>::¼service” (‘‘< interpreter>”,‘‘< name>”,‘‘< argument>”, ‘‘<locate>”, ‘‘<display>”)”< interpreter >::¼< string > j(empty)< name >::¼< string >< argument >::¼< arg > ½; < arg > �þ< arg >::¼< string > j < number >< locate >::¼< ch > ð< = >< ch >Þ�< ch >::¼< string >< display >::¼ 0j1

The first argument is optional and is specified only ifan interpreter software is used to run the servicerequired. The second argument is the service name.The third argument provides a list of parametersrequired to execute the service. The next argument spec-ifies the location of the service in order to reach it. Dis-play is optional; if this option is used, the result isdisplayed on the user’s standard output. Otherwise, itis included into the query result.

For each active data source an XML document [34] isdefined according to the document’s structure given asfollows:

< !ELEMENTserviceðinterpreter; name;parameter; locateÞ >

< !ELEMENT interpreterð#PCDATAÞ >< !ELEMENT nameð#PCDATAÞ >< !ELEMENT parameterðpara�Þ >< !ELEMENT parað#PCDATAÞ >< !ELEMENT locateð#PCDATAÞ >

This document provides information about a servicesuch as its name, interpreter, parameters and location.The following query uses the clause Service that calls anactive data source.

Example 8 (Service). Convert the salaries of employees ofbank cio into Euro.

Q1: use cio c, dollar2euro dfor $a in document(‘‘my mds”)/bank/bank1/bank2, $bin $a/�/salarylet $c :¼ document(‘‘my mds”)/servicesreturn< result > service($c/�/interpreter/text(),$c/�/name/text(),$b/text(),$c/�/locate/text(),display) < =result >close c,d

Query Q1, is an elementary multi-data source query thatuses the clause Service in order to convert the salaries ofCio’ employees in Euros currency using the serviceDollar2Euro. In this query, $b designates the salaries ofCio’ employees, and $c designates the multi-data sourceServices. The salaries selected from Cio are passed asarguments to the clause Service (e.g. $b/text()).

In conclusion, MFL gives the possibility to users tointegrate, dynamically, heterogeneous data sources, inparticular, data sources coming from the Web. Datasources are integrated online without the administratorsintervention. The proposed MFL language takes intoaccount semantic and scale conflicts between data sources.

The next section describes an implementation of ourproposed data source fusion approach.

4. MDSManager architecture

This section describes the design of MDSManager(Multi-Data Source Manager) system developed accordingto our approach [23,25]. The design of this system is givenin Fig. 6. MDSManager is composed of five principal com-ponents: Wrapper, Mediator, Extractor, Petri-Nets Engine(PN-Engine) and Interfaces. The Wrapper componentoffers two main modules which include Data Source Serverand T-Engine. The Data Source Server (DSS) manages sev-eral data sources (static or active). A DSS represents a datasource for the outer world. It represents the interface of thissource. There is no way to reach data sources directly inthis model. All communications toward data sources haveto be done through the DSS. This design provides the pos-sibility to reach all data sources of any type and to commu-nicate with them using the same interface. The DSS hides

Fig. 6. Design of MDSManager system.

532 G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537

the implementation differences between data source typesand provides a common prototype for data retrieval andquery. In contrast, T-Engine provides specific classes (e.g.java classes) which describe how to communicate with datasources.

For static data sources, we have developed T-Wrapperwhich corresponds to the most important sources of infor-mation existing over the web (e.g. SQL databases and XMLdocuments). The role of the T-Wrapper is to extract DTDsfrom data sources and to register them beside the mediator.The T-Wrapper executes also the XQuery queries on thedata sources and returns the results to the mediator.

For the active data sources, we have developed S-Wrap-per with main objective to extract the DTD from a givenXML document that describes an active data source. TheS-Wrapper accepts EXQuery queries, which integrate theclause Service in their bodies. The S-Wrapper runsthe active data source specified in Service and returnsresults to the mediator.

The mediator component offers users views of multi-data sources expressed with their DTDs. A mediatorreceives MRL or XQuery queries from user interfaces, itprocesses queries and returns the results to the user asXML documents. The mediator has the right informationto forward a specific query to the relevant DSS server.Two main modules are distinguished in this component:Data Source Repository and Query Server.

The goal of the Data Source Repository (DSR) is to col-lect information about data sources (references, addresses,names, descriptions and the DTD of the associated datasource, which is the most important information). All theDSS have to register themselves at start-up at the centralDSR, providing all the needed information pieces aboutthe requested data source. Therefore, a repository containsinformation of all main details about the connected datasources. It has the possibility also to share this informationwith the clients in a centralized way.

The Query Server (QS) receives MRL or XQuery queriesand processes them. Processing an MRL query consists in

analyzing this query, eventually substituting semantic vari-ables and multiple identifiers with unique ones, and gener-ating (E)XQuery pertinent subqueries. The QScommunicates with both DSR and Wrapper in order toobtain more information about data sources and to executesubqueries respectively.

The Extractor module extracts information from webpages. We only require the user to give a set of exampleinstances of the relation that he/she wants to extract. Ourmethod generates a wrapper (a set of patterns) correspond-ing to a web page. This wrapper allows the user to dailyupdate the extracted data without running the extractormodule completely. A wrapper returns an XML documentcontaining information, which is extracted from web pages.

More details about the extractor algorithm are given in[26]. In this paper, we do not consider how to extract effi-cient information from the Web. Many works have studiedthis problem and several solutions have been proposed.These works are illustrated in a survey of differentapproaches, which are found in Ref. [11].

The main objective of the Petri-Nets Engine (PN-Engine) is to build a fully automated process of dataextraction from the web and to query multi-data sources.To achieve this goal, there is a need to specify a set of basicoperators and to coordinate these operators. For example,an extractor operator takes as input a web page and returnsthe list of results extracted from the page. A parsing oper-ator takes as input an XML or HTML document, parses itand returns a DOM (Document Object Model) object. Afilter operator takes as input a predicate in order to refinethe results returned by a resource and so on.

In order to build automatically a complete informationextraction and fusion, it is important to coordinate thebasic operators. In our work the component PN-Enginesupports the composition and the interpretation of scenar-ios (i.e. advanced operators) which are expressed withPetri-Nets model [27]. While based on our approach andlanguage, these scenarios are built starting from basic oper-ators and constructs. Constructors can be Sequence, Alter-native, Iterative, Parallel etc. These constructors are givenin more details in [13]. PN-Engine, takes as input a scenariowhich is expressed with XML, interprets it and returns theresult to the user. This component communicates with themediator in order to retrieve data from multi-data sources.

The interface modules provide an administrator inter-face and a user interface. The administrator interfacedefines mainly the structure of the future multi-data sourceand describes conflicts existing between them. For eachdata source involved in the multi-data source, T-Engineextracts and stores, using the DSR module, all informationabout the data source (in particular its DTD, referencesetc.). The administrator interface extracts data from Webpages, stores data in XML documents and generates corre-sponding wrappers.

User interface allows expert users to submit their queriesdirectly. In other hand, it allows users, who are not familiarwith MRL or XQuery languages, to generate their queries

G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537 533

and then to submit them to the mediator. After analysis ofthese queries, the Query Server (QS) module generates a setof (E)XQuery subqueries and asks each Data Source Server(DSS) responsible of a data source to execute the corre-sponding subquery. Each DSS, in its turn, sends the sub-query to the Wrapper able to run it. The Wrappermodule executes the subquery on the data source and thefinal result is returned to the user.

Finally, the user interface allows to formulate severalqueries and to specify an order to execute them using thePetri-Nets specification. To interpret this specification,PN-Engine generates an XML document and uses QSmodule to run, for example, a sequence of queries.

The following section shows the fusion process of theinformation extracted from the web through a real-lifeapplication.

5. Online fusion of web data sources

In this application, we have chosen to compare on adaily-basis online markets indices given in two websites at the following addresses: Boursorama (http://www.bourso-rama.com/indices/indices_intern.phtml) and the

Fig. 7. Description of static

Nasdaq Stock Market (http://quotes.nasdaq/aspx/marke-tindices.aspx). The first step in this process is the extractionof information from the two web sites while respecting twogiven relations (expressed with DTD format in Fig. 7)called Boursorama.dtd and Nasdaq.dtd. A wrapper is thengenerated [26,11] for each site in order to select the new val-ues of indices on a daily basis. A fraction of the extracteddata from these two sites is given in Fig. 7 (they are respec-tively called Boursorama.xml and Nasdaq.xml).

The following steps create the data source Conflicts.Fig. 8 shows a fraction of the generated data source Con-

flicts. In this figure, the two elements Mindices=indices=nasdaq=indice=name and Mindices=indices=boursorama=indice=nom are similar. We create then the multi-datasource schema where Mindices (Fig. 9) represents the rootof the document. Mindices is defined as follows: it containsthe multi-data source schema called Indices, the multi-datasource called Services and the data source Conflicts. Themulti-data source Indices declares the Boursorama andNasdaq data sources. The multi-data source Services con-tain a service called CompIndice. The description of thisservice is given in an XML document called CompIndic-es.xml according to its DTD (showed in Fig. 7).

and active data sources.

Fig. 8. A fraction of the conflicts data source.

534 G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537

The final step of this process involves the activation ofthe multi-data source Mindices for future manipulations.Every day, after running the generated wrappers for thesetwo sites, the user can select the value of some indices, orcompare the values of two indices, etc. Example 9, showsan MRL query which returns the names and the valuesof all the indices.

Example 9. Select daily the names and the values of all themarket indices from the Web.

Q1: use indices iallow $x.y=nom.valeur,name.valuefor $a in document(‘‘my mds”)/mindices/ireturn< result >< name > $a/�/$x/text()< =name >,< value >$a/�/$y/text()< =value > < =result >close i

This query is a semantic query. The clause Allow

declares two semantic variables x and y. In this query,designators are unique identifiers and there are no conflicts

Fig. 9. A fraction of the mu

between the elements in data source conflicts. This query isequivalent to two elementary subqueries Sq1 and Sq2:

Sq1: use boursorama bfor $a in document(‘‘my mds”)/mindices/indicesreturn< result >< name >$a/�/nom/text()< =name >,< value >$a/�/valeur/ text()< =value > < =result >close b

2

for $a in document(‘‘my mds”)/mindices/indices

Sq : use nasdaq n

return< result >< name >$a/�/name/text()< =name >,< value >$a/�/value/ text()< =value > < =result >close n

In Example 10, we illustrate an MRL query which comparesthe values of the two market indices ‘‘NASDAQ100” and‘‘Amex Composite” updated daily on both sites.

Example 10. Compares daily values of two market indicesfrom the Web.

Q1: use indices i, compindices cfor $a in document(‘‘my mds”)/mindiceslet $b :¼ $a/�/valeur where $a/�/nom/text()=‘‘NASDAQ100”

returnlet $c :¼ $a/�/value where $a/i/�/name/text()=‘‘AmexComposite”return< result > serviceð$a= � =interpreter=textðÞ;$a= � =c=name=textðÞ; $b=textðÞ; $c=textðÞ;$a= � =locate=textðÞ; displayÞ < =result >close i,c

This query is an elementary multi-data source query. Itinvolves the multi-data source Indices and the active datasource CompIndices. All designators in this query (value,‘valeur’, name and ‘nom’) are unique identifiers. Query Q1 is

lti-data source Mindices.

G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537 535

decomposed into three elementary mono-data sourcesubqueries Sq1; Sq2; Sq3 as follows:

Sq1: use boursorama bfor $a in document(‘‘my mds”)/mindiceslet $b :¼ $a/�/valeur where $a/�/nom/text()=‘‘NASDAQ100”

return< resultb > $b=textðÞ < =resultb >close b

2

for $a in document(‘‘my mds”)/mindices

Sq :use nasdaq n

let $c :¼ $a/�/value where $b/�/name/text()=‘‘AmexComposite”

return< resultn > $c=textðÞ < =resultn >close n

3

for $a in document(‘‘my mds”)/mindices

Sq : use compindices c

return< result > serviceð$a= � =interpreter=textðÞ;$a= � =name=textðÞ; resultb; resultn;$a= � =locate=textðÞ; displayÞ < =result >close c

Sq1 and Sq2 involve respectively Boursorama and Nas-

daq while Sq3 contains the call of the service CompIndices.Each subquery (Sq1; Sq2) is translated into XQuery and issent to the DSS server in charge of Boursorama and Nasdaq

in order to obtain the value of the market indice from it.When the QS server receives the pair of values of the twomarket indices (i.e. 1549.80, 1474.82), it replaces respec-tively in the clause service of the query Sq3 the two results(resultb, resultn)) with the resulting pair of values. Next, theQS server generates the EXQuery query (corresponding toSq3) and sends it to the DSS server responsible of theservice CompIndices. The S-Wrapper runs this service andreturns the result to the mediator component which returnsit to the user (1548.80).

6. Discussion and future works

An increasing number of users are facing the problem ofaccessing multiple and autonomous data sources availableon the web. The approach, described in this paper, inte-grates online heterogeneous and conflicting data sourcesunlike classical approaches of data integration. Our pro-posal is a loosely integrated approach which takes into con-sideration both conflict management and semantic rules,which must be enriched in order to take into account addi-tional data source. Our approach is characterized by theabsence of a global schema because data sources integra-tion is a very hard task, especially on the web where datasources schema are frequently updated. This approach isvalidated by a prototype called MDSManager.

Two other main approaches have been discussed in therelated work section: GAV and LAV. In the GAVapproach, the global schema is defined in terms of sourcesand the mapping provide direct information about whichdata satisfies the elements of the global schema. Conse-quently, the global schema needs to be rebuilt each timea source is added or deleted, but query rewriting is straight-forward: a simple rule unfolding process is needed torewrite the query in terms of the source relations. In theLAV approach, data sources are described in terms ofthe global schema. In this case, the mapping does not pro-vide direct information about which data satisfies the glo-bal schema. Consequently, there is no need to modify theglobal schema when sources disappear, but it is harder torewrite queries in terms of data sources views: the systemhas to control the sources interaction and their data com-bination in order to answer the query (the Bucket algo-rithm) [16]. The MRL contribution is beyond GAV/LAVapproaches. In fact, our approach looks like P2P as wehave no global schema. Our rules for conflicting manage-ment and semantic may be considered as a subset of asser-tions that define mapping in P2P context.

We compare the performance of MRL with XQuery (ourbaseline language). The performance evaluation is doneusing a Java program and the test platform is a PC underWindows XP, a processor running at 1.06 GHz, 250 MOof RAM. The aim of our test is to show MRL benefits inthe context of heterogeneous data sources. We consider amulti-data source context composed of 100 data sources.Each data source is an XML document that contains at leasteight elements x1; . . . ; x8 linked together with similar links(i.e. two elements xi and xj, i, j = 1 . . .8 i 6¼ j are similar).To clarify our proposal, the elements (x1; . . . ; x8Þ representseveral products (e.g. tomato, lemon, apple etc.) in a catalogand the values of these elements represent the prices of theseproducts expressed with several currencies (e.g. Euro, Dol-lar, Rouble, Yen etc.). Each data source expresses its priceswith the same currency and 10% of data sources (100 datasources in all) express their prices with the same currency.Consequently, for the ten sources which use the same cur-rency (e.g. Euro), the elements x1; . . . ; x8 are linked togetherwith similar links in the conflicts data source (e.g. tomatoesexpressed in Euro in data sources are linked together withsimilar links). On the other hand, the elements x1; . . . ; x8

of sources which use distinct currencies to express the pricesof their products are linked together with scale conflicts (e.g.tomatoes expressed in Euro and Rouble in two distinct datasources are linked together with scale conflict link).

The comparison tests of both XQuery and MRL, consistin selecting from the multi-data source the prices of prod-ucts given in Euro only. We proceed as follows: we selectthe values of the element x1 (e.g. the prices of tomato givenin Euro in the multi-data source), the two elements (x1; x2Þ(e.g. tomato and lemon) and so on until we select the pricesof the eight elements (e.g. x1; . . . ; x8).

We consider two setting: (1) the first one is our baseline.It is used when a user chooses to formulate his/her queries

Fig. 11. Data sources contains up to 100 various prices by product.

536 G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537

manually with XQuery language. In this setting, in order toselect the values of the first product (e.g. tomato), the usermust formulate and submit 100 XQuery queries on therespective data sources (the user does not have informationabout the conflicts of prices given in the multi-data source).Similarly, if he/she selects the values of two products fromthe multi-data source (e.g. tomato and lemon) he/she mustformulate 200 XQuery queries and so on until he/sheselects eight products (800 XQuery queries). In this setting,the total response time is the time to run queries in the cor-responding data sources; (2) in the second setting, the userchooses to use MRL language. In this case, he/she submitsonly one MRL query whatever the number of products toselect. Each MRL query contains a semantic variable S.The domain of this variable is varying at each time. Forexample, for the first query, S takes only the productTomato and for the last query (eight queries in total) thedomain of S is the set of products (eight products). In addi-tion, each product is a multiple identifier in the MRLquery’s scope. Based on conflict management and semanticrules cited above, the first MRL query (select the prices ofTomatoes given in Euro) generates only 10 pertinentXQuery queries (instead of 100 XQuery queries in setting1), the second MRL query generates 20 XQuery queriesand the last generates 80 queries. The total response timeis the sum of: (a) the time to generate pertinent XQueryqueries (e.g. the research of eventual conflicts between theproducts and the generation of queries) and (b) the timeto run these queries in the corresponding data sources.

Figs. 10 and 11 show the time needed to obtain answersto queries using XQuery and MRL. The abscise representsthe total number of products to select from the multi-datasource and the range represents the total response time (inMilliseconds) to return the answers to the user. In our firstexperiment (Fig. 10), we have considered that each datasource contains very few data (e.g. each product providesless than five prices). In the following experiments, we haveincreased the volume of each data source. In Fig. 11, eachproduct can deal with up to 100 various prices. We observethat, if the data sources contain very few data, the response

Fig. 10. Data sources contains less than five various prices by product.

time needed, using MRL, is greater than the time for theexecution of XQuery queries (Fig. 10). This is due to thefact that the generation of pertinent XQuery queries isfairly costly.

If we increase weakly the size of data sources, Fig. 11shows that, MRL becomes immediately advantageous eachtime the scope of the MRL query (i.e. the number of prod-ucts) increases. Compared to the XQuery language, MRLleads to cost-effectiveness.

In the future, we plan to study query processing and itsefficiency over large heterogeneous and autonomous infor-mation sources. Query optimization should take intoaccount scale issues to avoid performance degradationwhich could lead to inefficient query execution plans. Ourobjectives are to study algorithms to discover in a dynamicway, similarities and dissimilarities between elements, tointegrate web services as active data sources and to studythis system’s migration in a Peer-To-Peer environment.

Acknowledgements

We would like to thank Omar Boucelma, Professor atUniversity of Aix-Marseille (U3), for many discussionsand comments about this work. We are also grateful tothe editor and the anonymous referees whose detailed com-ments helped greatly to improve the presentation of thispaper.

References

[1] B. Amann, C. Beeri, I. Fundulaki, M. Scholl, Ontology-basedintegration of XML Web resources, in: Proceedings of the Interna-tional Semantic Web Conference (ISWC), 2002, pp. 117–131.

[2] C. Baru, A. Gupta, B. Ludaesscher, R. Marciano, Y. Papakonstan-tinou, P. Velikhov, V. Chu, XML-based information mediation withMIX, in: Proceedings of the International Conference on Manage-ment of Data, 1999, pp. 597–599.

[3] D. Beneventano, S. Bergamaschi, The MOMIS methodology forintegrating heterogeneous data sources, in: IFIP Congress TopicalSessions, 2004, pp. 19–24.

[4] O. Benjelloun, Active XML: a data-centric perspective on Webservices, Ph.d. thesis of Paris XI University, Orsay, France, 2004.

G. Nachouki, M. Quafafou / Information Fusion 9 (2008) 523–537 537

[5] M.J. Carey, L.M. Haas, P.M. Schwarz, M. Arya, W.F. Cody, R.Fagin, M. Flickner, A.W. Luniewski, W. Niblack, D. Petkovic,J.Thomas, J.H. Williams, E.L. Wimmers, Towards heterogeneousmultimedia information systems: the Garlic approach, in: Proceedingsof the 5th International Workshop on Research Issues in DataEngineering-Distributed Object Management (RIDE-DOM’95),1995, pp. 161–173.

[6] D.C. Faye, G. Nachouki, P. Valduriez, SenPeer: un systeme pair-a-pair de mediation de donnees, International Journal ARIMA 4 (2006)24–48.

[7] A.Y. Halvey, Z.G. Ives, P. Mork, I. Tartarinov, Piazza: datamanagement infrastructure for semantic Web applications, in: Pro-ceedings of the Twelfth International Conference on World WideWeb, 2003, pp. 556–567.

[8] S. Castano, A. Ferrara, S. Montanelli, C. Quix, H-MATCH: analgorithm for dynamically matching ontologies in peer-based systems,in: Proceedings of the VLDB International Workshop on SemanticWeb and Databases (SWDB), 2003, pp. 231–250.

[9] C. Delobel, C. Reynaud, M.C. Rousset, J.P. Sirot, D. Vodislav,Semantic integration in xyleme: a uniform tree-based approach,Journal of Data and Knowledge Engineering 44 (3) (2003) 267–298.

[10] H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J.Ullman, J. Widom, Integrating and accessing heterogeneous infor-mation sources in TSIMMIS, in: Proceedings of the AAAI Sympo-sium on Information Gathering, 1995, pp. 61–64.

[11] B. Habegger, M. Quafafou, Building Web information extractiontasks, in: Proceedings of the IEEE/WIC/ACM International Confer-ence on Web Intelligence (WI), 2004, pp. 349–355.

[12] A.Y. Halevy, Answering queries using views: a survey, Journal of theVLDB 10 (4) (2001) 270–294.

[13] R. Hamadi, B. Benatallah, A petri net-based model for Web servicecomposition, Proceedings of the Australasian Database Conference(ADC) 17 (2003) 191–200.

[14] V. Kashyap, A. Sheth, Semantic and schematic similarities betweendatabase objects: a context-based approach, Journal of the VLDB 5(4) (1996) 276–304.

[15] T. Kirk, A.Y. Levy, Y. Sagiv, D. Srivastava, The informationmanifold, in: The American Association for Artificial Intelligence(AAAI) Press, 1995, pp. 85–91.

[16] A.Y. Levy, Logic-based techniques in data integration, in: JackMinker (Ed.), Logic-Based Artificial Intelligence, Kluwer AcademicPublisher, 2000, pp. 1–27.

[17] M. Lenzerini, Data integration: a theoretical perspective, in: Pro-ceedings of the ACM Symposium on Principles Of Database Systems(PODS), 2002, pp. 233–246.

[18] W. Litwin, A. Abdellatif, A. Zeroual, B. Nicolas, Ph. Vigier, MSQL:a multidatabase language, Journal on Information Sciences 49 (1–3)(1989) 59–101.

[19] W. Litwin, A. Abdellatif, An overview of the multidatabase manip-ulation language MDSL (Invited paper), Proceedings of the IEEE 75(5) (1987) 621–632.

[20] P. Lyngbaek, D. McLeod, An approach to object sharing indistributed database systems, in: Proceedings of the InternationalConference on Very Large Databases (VLDB), 1983, pp. 364–375.

[21] I. Manolescu, D. Florescu, D. Kossmann, D. Olteanu, F. Xhumari,Agora: living with XML and relational, in: Proceedings of theInternational Conference on Very Large Databases (VLDB), 2000,pp. 623–626.

[22] I. Manolescu, D. Florescu, D. Kossmann, Answering XML queriesover heterogeneous data sources, in: Proceedings of the Interna-tional Conference on Very Large Databases (VLDB), 2001, pp.241–250.

[23] G. Nachouki, M.P. Chastang, On-line analysis of a Web datawarehouse, in: Proceedings of the International Workshop ofDatabases in Networked Information Systems (DNIS), 2003, pp.112–121.

[24] G. Nachouki, Integration of static and active data sources, in:Proceedings of the Eighteenth International Symposium on Com-puter and Information Sciences (ISCIS), 2003, pp. 147–154.

[25] G. Nachouki, M. Quafafou, M.P. Chastang, MDSManager: a systembased on multidatasource approach for data integration, in: 2005IEEE/WIC/ACM International Conference on Web Intelligence(WI), IEEE Computer Society, 2005, pp. 438–441.

[26] G. Nachouki, A method for information extraction from the Web,in: Proceedings of the IEEE International Conference on Informa-tion and Communication Technologies (ICTTA), 2006, pp. 517–521.

[27] J. Peterson, Petri Net Theory and the Modeling of Systems, PrenticeHall PTR, 1981.

[28] E. Rahm, P.A. Bernstein, A survey of approaches to automaticschema matching, Journal of the VLDB 10 (4) (2001) 334–350.

[29] D. Suciu, G. Vossen, Report on the international workshop on theWeb and databases (WebDb), ACM SIGKDD 2 (1) (2000) 80–83.

[30] A.P. Sheth, J.A. Larson, Federated database systems for managingdistributed, heterogeneous, and autonomous databases, ACM Com-puting Surveys (CSUR) 22 (3) (1990) 183–236.

[31] A. Tomasic, L. Raschid, P. Valduriez, Scaling heterogeneousdatabases and the design of Disco, in: Proceedings of the DistributedComputing Systems Conference (DCSC), 1996, pp. 449–459.

[32] G. Wiederhold, Mediators in the architecture of future informationsystems, IEEE Computer 25 (3) (1992) 38–49.

[33] xquery 1.0: An XML Query Language, http://www.w3.org/TR/xpath-datamodel.

[34] XML: Extensible Markup Language, http://www.w3.org/TR/REC-xml.