tu/e eindhoven university of technology / faculty of mathematics and informatics technologie van...
TRANSCRIPT
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Technologie van InformatiesystemenTIS
college 3
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Inhoud• Inleiding, 30/11
• Web engineering & Web information systems, 7/12• Data transformatie & Data integratie, 14/12
• ERP, Smulders (Deloitte), 21/12 + 11/1
• Flower, Berens (Pallas Athena), 25/1 + 1/2
• Biztalk, van den Boom (Microsoft), 15+22/2
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Inhoud• Inleiding, 30/11
• Web engineering & Web information systems, 7/12• Data transformatie & Data integratie, 14/12
• ERP, Smulders (Deloitte), 21/12 + 11/1
• Flower, Berens (Pallas Athena), 25/1 + 1/2
• Biztalk, van den Boom (Microsoft), 15+22/2
Philippe Thiran
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationData Integration
Philippe ThiranComputer Science Department
Technische Universiteit EindhovenThe Netherlands
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data Transformation & Integration• Agenda
– Problem Statement• Existing database systems• Heterogeneity, distribution, autonomy
– Data Transformation• Schema conversion• Query conversion: Wrapper
– Data Integration• Schema integration• Query processing: Multidatabase and
Federation
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Problem Statement
Existing database systemsHeterogeneity, distribution,
autonomy
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Problem StatementExisting Database Systems• Existing Database Systems
– Data are recorded in existing database systems
– Existing database systems are:• Mission critical (essential to the organization
business)• To be operational at all times• Inflexible
– Typically, existing database systems are:• Very large (millions of lines of code)• Old (often more than 10 years old)• Written in old programming language like COBOL,
PL/1, SQL!• Built around an old DBMS
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Problem StatementExisting Database Systems
• Existing Database Systems– Data are recorded in existing database
systems– Answer of old requirements
• New functions and services• New user requirements• New technology (Web)
• Communication among them?
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Problem StatementExisting Database Systems• Existing Systems: New Services
– How to deal with existing database systems ?
• Abandon the existing systems: migration to a new system
• Keep and modify the existing systems• Keep the existing systems and wrap them:
autonomy
• Existing Systems: Communication– How to integrate existing database
systems?
TU/e eindhoven university of technology
/faculty of mathematics and informatics
• Data Integration Problems– Integrating database systems is very hard
and costly– Three main dimension
of the problem:• Distribution• Autonomy• Heterogeneity
Distribution
Autonomy
Heterogeneity
CentralizedDBMS
Distributed databases
Problem StatementData Integration
TU/e eindhoven university of technology
/faculty of mathematics and informatics
• Autonomy– Autonomy refers to the distribution
of control
– Four dimensions of autonomy:• Design: own data models and own transaction
management technique• Communication: nor knowledge of the existence of
other system nor how to communicate with them• Execution: independently of the other systems• Association: each system decides how much of its
data and processing capabilities it will share with the other system
Data IntegrationProblem Statement
Distribution
Autonomy
Heterogeneity
TU/e eindhoven university of technology
/faculty of mathematics and informatics
• Heterogeneity– Heterogeneity may exist at three basic levels:
• DBMS level. Data is managed by a variety of DBMS based on different data models and data languages– Data models : relational model, hierarchical model and
file model– Data languages : SQL, DL/1, COBOL programs
• Platform level. Different hardwares, different network protocols
• Semantic level. Different designer viewpoints in modelling the same objects of the application domain. Incompatible design specifications which lead to different naming, types or integrity constraints
Data IntegrationProblem Statement
Distribution
Autonomy
Heterogeneity
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationGeneric Integration Architecture
• Schema Hierarchy
DatabaseSchema 1
DB1
ExportSchema 1
DatabaseSchema 2
DB2
ExportSchema 2
DataSchema 3
ExportSchema 3
Relational DBMS OO DBMS File System
ImportSchema 1
IntegratedSchema
ImportSchema 2
ImportSchema 3
Loc
al M
odel
sC
omm
on M
odel
Unifies data models
View on export schema available fornon-local access
Homogenizes and unions importschemas
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationGeneric Integration Architecture
• Schema Hierarchy
DatabaseSchema 1
DB1
ExportSchema 1
DatabaseSchema 2
DB2
ExportSchema 2
DataSchema 3
ExportSchema 3
Relational DBMS OO DBMS File System
ImportSchema 1
IntegratedSchema
ImportSchema 2
ImportSchema 3
Loc
al M
odel
sC
omm
on M
odel
Data and Schema Transformation
Data and Schema Integration
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data Transformation
Schema ConversionQuery Conversion: Wrapper
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationSchema Conversion
• Introduction– Schema conversion– Query/Data conversion
DataSource 1
Local Data Models
Common Data Model
Query1’DatabaseSchema 1
DataSource 2
DatabaseSchema 2
ExportSchema 1
ExportSchema 2
Query1
Query2’
Query2
Data1’
Data1
Data2’
Data2
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationSchema Conversion• Schema Conversion
– Schema transformation• Transformation of a schema expressed in a data
model (Ms) into an equivalent schema expressed in another data model (Mt)
• Examples– ER model Relational model (lecture ISO)– Relational model XML Schema (see later)
• Schema transformation operators• Schema conversion consists in applying the
relevant transformations on the relevant constructs of the schema expressed in Ms in such a way that the final result complies with Mt
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationSchema Conversion
• Schema Conversion– Schema transformation
• A (schema) transformation basically is an operator by which a source data structure C is replaced with a target structure C'.
• Example of a semantics-preserving transformation: transforming a relationship type into an attribute
BB1B2id: B1
AA1B1ref: B1
1-1 0-NR
BB1B2id: B1
AA1
RT-FK: Transforming a binary relationship type into a foreign key.
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationSchema Conversion
• Schema Conversion– 2 main schema transformations for ER model
Relational model
0-N0-N R
B1B1
AA1
1-1
0-N
rB1-1
0-N
rAR
id: rB.B1rA.A
B1B1
AA1
RT-ET: Transforming a relationship type into an entity type.
Inverse: ET-RT
RT-FK: Transforming a binary relationship type into a foreign key. Inverse: FK-RT
BB1B2id: B1
AA1B1ref: B1
1-1 0-NR
BB1B2id: B1
AA1
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationSchema Conversion
• Schema Conversion– Exercice: From ER model Relational model
0-N
0-N
purchaseTot
0-N 1-1place
0-N
0-N
detailsOrder-qty
STOCKCodeNameLevelid: Code
ORDERCodeid: Code
CUSTOMERCodeDescriptionid: Code
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationSchema Conversion
• Schema Conversion– Exercice: From ER model Relational model
1-1
0-N
pur_STO
1-1
0-N
pur_CUS
0-N 1-1place
1-1
0-N
det_STO
1-1
0-N
det_ORD
STOCKCodeNameLevelid: Code
purchaseTotid: pur_CUS.CUSTOMER
pur_STO.STOCK
ORDERCodeid: Code
detailsOrder-qtyid: det_ORD.ORDER
det_STO.STOCK
CUSTOMERCodeDescriptionid: Code
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationSchema Conversion
• Schema Conversion– Exercice: From ER model Relational model
STOCKCodeNameLevelid: Code
purchaseP_C_CodeCodeTotid: P_C_Code
Coderef: Coderef: P_C_Code
ORDERCodeCus_Codeid: Coderef: Cus_Code
detailsD_O_CodeCodeOrder-qtyid: D_O_Code
Coderef: Coderef: D_O_Code
CUSTOMERCodeDescriptionid: Code
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationWrappers
• Definition– A wrapper controls a (legacy) data source
– Basically a wrapper is a software component that offers an homogeneous query interface based on a common data model (XML for the Web)
– It converts data and queries from the common data model to a local data model
It offers an adequate way for solving the DBMS heterogeneity that appears when one wants to integrate existing and heterogeneous data systems
Database Schema
Export Schema
DataSource
WrapperLocal Data Models
Common Data Model
Common Data ModelCommon Query Language
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationWrappers• Definition (ctd)
– A data wrapper is basically defined as a converter of data and queries
– That is, a wrapper:• Offers an export schema in the common data model• Accepts queries against the export schema• Translates them into queries understandable by the data
system • Transforms the results of the local queries into a format
understood by the application
Database Schema
Export Schema
DataSource
WrapperLocal Data Models
Common Data Model
Common Data ModelCommon Query Language
Query Data
Local Data ModelLocal Query Language
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationWrappers• Categories of Wrappers
– There exists no standard approach to build wrappers– Functionality
• One-way: only transformation of data (e.g., for data warehouses)
• Two-way: transformation of requests and data
– Development• Hard-wired wrappers, for specific data sources• Semi-automated generation: wrapper development tools• Automatically generated wrappers
– Availability• Standalone programs (data conversion, data migration)• Components of a federation (see later)• Database interface for foreign data
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationWrappers• Wrappers and the Web
– Wrapper interface• Data format: XML• Common data model: XML DTD and Schema • Common query language: XPath, XQuery, none
– Wrapper mapping• Generally between relational data and XML• Two translation types
– Automated – Defined by the user
• XML- or SQL-oriented query language
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationWrappers
• XML Views of Relational Databases– Automated translation
Oid Desc Cost
10 Ship 24000
10 Generator 8000
Id Custname Custnum
10 Philips 7734
9 Unilever 7725
Oid Due Amt
10 1/10/01 20000
9 6/10/01 12000
Order Item Payement
<db> <order> <row><id>10</id><custname>Philips</custname><custum>7734</custnum></row> <row><id>9</id><custname>Unilever</custname><custum>7725</custnum></row> </order> <item> <row><oid>10</oid><desc>Ship</desc><cost>24000</cost></row> <row><oid>10</oid><desc>Generator</desc><cost>8000</cost></row> </item> <payement>
similar to <order> and <item> </payement></db>
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationWrappers
• XML Views of Relational Databases– User-defined Translation
Oid Desc Cost
10 Ship 24000
10 Generator 8000
Id Custname Custnum
10 Philips 7734
9 Unilever 7725
Oid Due Amt
10 1/10/01 20000
9 6/10/01 12000
Order
Item
Payement
<order id=’10’> <custname> Philips </custname> <items>
<item description=“Ship”> <cost> 24000 </cost>
</item><item description=“Generator”> <cost> 800 <cost></item>
</items> </payments>
<payement due=’1/10/01’> <amount> 20000 </amount></payement>
</payements></order><order id =‘9’>…</order>
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationWrappers
• XML Views of Relational Databases– Exercises
• What is the XML Document of this relational database?
ProductReferenceLabel[0-1]UnitPriceSupplierid: Reference
OrderOderIDCustomerDateTotal[0-1]id: OderID
DetailOderIDReferenceQuantityAmountid: OderID
Referenceref: OderIDref: Reference
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationWrappers• XML Views of Relational Databases
– Exercises• What is the XML Document of this
relational database?
ProductReferenceLabel[0-1]UnitPriceSupplierid: Reference
OrderOderIDCustomerDateTotal[0-1]id: OderID
DetailOderIDReferenceQuantityAmountid: OderID
Referenceref: OderIDref: Reference
<!ELEMENT Catalog (Order*, Product*)><!ELEMENT Order (Customer, Date, Total?, Detail+)><!ATTLIST Order OrderID ID #REQUIRED><!ELEMENT Customer ANY><!ELEMENT Date (#PCDATA)><!ELEMENT Total (#PCDATA)><!ELEMENT Detail (Quantity, Amount)><!ATTLIST Detail Product IDREF #REQUIRED><!ELEMENT Quantity (#PCDATA)><!ELEMENT Amount (#PCDATA)><!ELEMENT Product (Supplier+)><!ATTLIST Product Reference ID #REQUIRED Label CDATA #IMPLIED UnitPrice CDATA #REQUIRED><!ELEMENT Supplier ANY>
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationWrappers
• XML Views of Existing Relational Databases
– Mapping definition• SQL-oriented query language
For $b inSQL(select * from Order where Custname=“’
+$x + ‘””)return <order> {$b/Id}
<Custname>{$x}</Custname></order>Id Custname Custnum
10 Philips 7734
9 Unilever 7725
Order
Order
Id Custname
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationWrappers
• XML Views of Existing Relational Databases– XML View definition
• Bottom-up (from the relational schema)• Top-Down (from a given XML schema)
– Mappings between XML views and relational schemas
• Automated (algorithm)• Manual (defined by the user)
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data TransformationWrappers
• XML Views of Existing Relational Databases– ExamplesProduct Name SQL-written
MappingXML-written Mapping
XML Schema Query over views
Xperanto (IBM) no yes (XQuery) XML Schema yes (XQuery)
update
Microsoft’s SQL Server
yes (FOR XML clause)
no XDR Schema yes (XPath)
DB2 (IBM) no yes (subset of XQuery)
yes (XQuery) no
Oracle9i yes no no
SilkRoute (AT&T)
no yes (XQuery) XML Schema yes (XQuery)
update
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data Integration
Generic Integration ArchitectureSchema Integration
Query Processing: multidatabase and federation
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data Integration
Generic Integration ArchitectureSchema Integration
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationGeneric Integration Architecture
• Schema Hierarchy
DatabaseSchema 1
DB1
ExportSchema 1
DatabaseSchema 2
DB2
ExportSchema 2
DataSchema 3
ExportSchema 3
Relational DBMS OO DBMS File System
ImportSchema 1
IntegratedSchema
ImportSchema 2
ImportSchema 3
Loc
al M
odel
sC
omm
on M
odel
Unifies data models
View on export schema available fornon-local access
Homogenizes and unions importschemas
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationGeneric Integration Architecture
• Component ArchitectureApplication 1
DB1
Application 2 Application 3
DBMS 1
DB2
DBMS 2
DB3
DBMS 3
Wrapper Wrapper Wrapper
Meditor
Common DDL/DMLIntegratedSchema
ExportSchema 1
Local DDL/DMLDatabaseSchema 1
ImportSchema 1
Controls a local data sourceOffers an homogeneous query interface based on a common data model
Offers an abstract integrated view of sourcesReconciles independent data structures to yield a unique, coherent, view of the data
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationGeneric Integration Architecture• Aspects to Consider for Integration
– General Issues• Bottom-up vs. top-down engineering
– From existing schema to integrated or vice-versa– Schema integration vs. schema matching
• Virtual vs. materialized integration• Read-only vs. read-write access• Transparency
– Language, schema, location
– Data Model related issues• Types of sources
– Structured, semi-structured, unstructured• Common data model of integrated system• Tight vs. loose integration
– Use of a global schema• Query model
TU/e eindhoven university of technology
/faculty of mathematics and informatics
• Methodology – Bottom-up process– Four main steps
• Preparing the local schemas• Detecting what is common between the components
of local schemas– Correspondence (what is common)
• Solving the conflicts– Conflict (what is incompatible)
• Integrating the different schemas according to the correspondences and conflicts detected in the previous steps
Data IntegrationSchema Integration
TU/e eindhoven university of technology
/faculty of mathematics and informatics
• Concept of Correspondence– Two complementary views of correspondence:
• Structural correspondence (schema level: concepts)• Instance correspondence (instance level: data)
– Structural correspondence• Five types of structural correspondence:
– Identity– Independence– Complementarity– Subtyping– Common supertype
Data IntegrationSchema Integration
TU/e eindhoven university of technology
/faculty of mathematics and informatics
• Concept of Correspondence– Instance correspondence
• Four types of instance correspondence:– Disjointed: the instances classes are disjointed– Inclusion: the set of one class is included to another
class– Equivalence: the classes contain the same instances– Overlapping: the classes share some instances but
not all
Data IntegrationSchema Integration
TU/e eindhoven university of technology
/faculty of mathematics and informatics
• Concept of Conflict– Conflicts occur in three possible ways : syntactic
(naming conflicts), structural, semantic or instance
– Syntactic conflicts (resolution: use of an ontology)• Synonyms. Two identical objects (entities, attributes,
relationships) that have different names are synonyms• Homonyms. Two different objects that have identical
names are homonyms
– Structural conflicts (resolution: mapping function or transformation)
• Domain. Two identical objects have different domains (Differences in dimension, units and scales)
• Structure. The same concept is presented by different data structures (e.g., different attributes)
Data IntegrationSchema Integration
TU/e eindhoven university of technology
/faculty of mathematics and informatics
• Concept of Conflict– Structural conflict
• In the left-hand schema, Address is an compound attribute, whereas in the right-hand one, Address is represented by an entity type
• Resolution: transformation
Data IntegrationSchema Integration
Site 1
CUSTOMERCUSTIDNAMEADDRESS
STREETZIP CODECITY
id: CUSTID
1-1
1-1
lives
ADDRESSSTREETZIP CODECITY
CUSTOMERCUSTIDNAMEid: CUSTID
Site 2
TU/e eindhoven university of technology
/faculty of mathematics and informatics
• Concept of Conflict– Semantic conflicts
• A semantic conflict appears when a contradiction appears between two representations A and B of the same application domain concept or between two integrity constraints (resolution?)
• Example– In the left-hand schema, Customer is identified
by CustId, whereas in the right-hand one, it is identified by Name
Data IntegrationSchema Integration
CUSTOMERCUSTIDNAMEADDRESS
STREETZIP CODECITY
id: CUSTID
CUSTOMERCUSTID
ADDRESSSTREETZIP CODECITY
id: NAME
NAMESite 1 Site 2
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationSchema Integration
• Concept of Conflict– Instance conflicts
• Instance conflicts are specific to existing data• Modelling constructs A and B that are recognized
as corresponding can cover sets with different scopes
• Examples– ZIP codes of addresses can be written like “NL-5600
MB” or “56oo MB” or “5600”– Different ZIP codes can be recorded for the same
address (encoding errors)– Resolution: Data transforming… cleaning?
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data Integration
Query Processing: multidatabase and federation
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationIntegration Architecture• Three Classical Architectures
– Multidatabases• No integrated schema• Integrated access to different relational DBMS
– Federated Databases• Integrated schema• Integrated access to different DBMS• Integrated access to different data sources (on the
Web)
– Data Warehouses• Materialized integrated data sources• Not here
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationQuery Processing
• Classical Architecture: Multidatabase– Enable transparent access to multiple (relational)
databases• Hides distribution, different SQL variants
• Processes queries and updates against multiple databases (2-phase commit)
• Does not provide any type of global schema (does not hide the different database schemas)
• Example: IBM DataJoiner
DataJoinerSybase
Open ClientOracle
SQL*Net
TCP/IPNetwork
SybaseServer
OracleServer
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationQuery Processing
• Classical Architecture: Multidatabase– Multidatabase schema
AuthorsANRTitleNameAffiliationid: ANR
PublicationsPNRTitleAuthorJournalid: PNR
WriterFirstNameLastNameNRofPublicationsid: FirstName
LastName
PapersNumberTitleWriterPublishedid: Number
Oracle.WriterFirstNameLastNameNRofPublicationsid: FirstName
LastName
Oracle.PapersNumberTitleWriterPublishedid: Number
Sybase.AuthorsANRTitleNameAffiliationid: ANR
Sybase.PublicationsPNRTitleAuthorJournalid: PNR
Source 1 Source 2
Sybase Oracle
Multidatabase Schema
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationQuery Processing
• Classical Architecture: Multidatabase– Query processing
Multidatabase Schema
SELECT titleFROM PUBLICATIONS
SELECT titleFROM PAPERS
Source 1
Sybase
Source 2
Oracle
SybaseData
OracleData
SELECT p2.titleFROM Sybase.PUBLICATIONS p1, Oracle.PAPERS p2WHERE p1.title = p2.title
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationQuery Processing
• Classical Architecture: Multidatabase • Main properties
• Transparency– Low level of transparency provided to the user
(The user is responsible for finding the relevant information, understanding each database schema, detecting and resolving the semantic conflicts, and finally, building the required view of the data in the sources)
• Autonomy– Not intrusive against the autonomy of the data sources– Suitable when component systems are strongly
autonomous• Methodology
– Simplicity since there is no schema integration• Maintenance and evolution
– No integrated schema maintenance
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationQuery Processing• Classical Architecture: Federation
– Integrated schema(s) and unique interface• Hides the semantic and location heterogeneity• Wrapper/Mediator hierarchy
– Wrapper» Controls a local data source» Offers an homogeneous query interface based on a
common data model– Mediator
» Offers an abstract integrated view of several sources» Reconciles independent data structures to yield a
unique, coherent, view of the data
– Research projects• Tsimmis (Stanford)• Garlic (IBM)• Oasis (Dublin University)
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationQuery Processing
• Classical Architecture: Federation– Typical example
<complexType name=“Book”> <element name=“title” type=“string”/> <element name=“authors” type=“string”/> <element name=“pages” type=“string”/></complexType>
<complexType name=“Book”> <element name=“title” type=“string”/> <element name=“author” type=“string”/></complexType>
<!ELEMENT Book(title,author)><!ELEMENT title(#PCDATA)><!ELEMENT author(#PCDATA)>
Views
Integrated schema
Import schemas
Oracle SQL DBMS XML DBMS
Wrapper (provides export schema) Wrapper (provides export schema)
Meditor
AuthorsANRTitleFirstNameSurnameAffiliation
id: ANR
Publication
PNRTitle
AuthorsJournalPages
id: PNR
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationQuery Processing
• Classical Architecture: Federation– Typical example
<complexType name=“Book”> <element name=“title” type=“string”/> <element name=“authors” type=“string”/> <element name=“pages” type=“string”/></complexType>
<complexType name=“Book”> <element name=“title” type=“string”/> <element name=“author” type=“string”/></complexType>
<complexType name=“Book”> <element name=“title” type=“string”/> <element name=“authors” type=“string”/></complexType>
Views
<complexType name=“Book”> <element name=“title” type=“string”/> <element name=“author” type=“string”/></complexType>Import schema DB1 Import schema DB2
<complexType name=“Book”> <element name=“title” type=“string”/> <element name=“author” type=“string”/></complexType> Integrated schema
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Q2
Q2’Q1’
Data IntegrationQuery Processing: Federation
Submit query Q
Q = FOR $b IN //Book RETURN $b/author
Q1 = FOR $b IN //Book RETURN $b/authors
Q2 = FOR $b IN //book RETURN $b/author
Q1’ = SELECT a.name FROM AUTHORS A
Q2’ = //book/author
ORACLESQL DBMS
XMLDBMS
Q1 A1={<authors> … <\authors>}
A1 A2
A2={<author> … <\author>}
A2
Return result A
A1’={<author> … <\author>}A = A1’ A2
TU/e eindhoven university of technology
/faculty of mathematics and informatics
Data IntegrationQuery Processing• Classical Architecture: Federation
• Main properties• Transparency
– High level of transparency provided to the user. The user is not aware of the distribution and the heterogeneity of the integrated data sources
• Autonomy– Each local data source have control over its sharable information
• Methodology– Problems of defining an integrated schema
– Web as Loosely Coupled Federation• Many different, widely distributed information systems• Heterogeneity
– Structural homogeneous: XML– Semantically heterogeneous: no explicit schemas (ontology?)
• Autonomy– Runtime autonomy: pages change on average every 4 weeks, dangling links
• Distribution– Replication (proxies) and caching frequently used