semantic rdf based integration framework for heterogeneous xml data sources

30
Semantic RDF Semantic RDF Based Based Integration Integration Framework Framework for for Heterogeneous Heterogeneous XML XML Data Data Sources Sources Deniz KILINÇ Deniz KILINÇ [email protected] [email protected]

Upload: deniz-kilinc

Post on 18-Dec-2014

369 views

Category:

Technology


7 download

DESCRIPTION

An approach brings a semantic RDF-based solution to XML integration problem...

TRANSCRIPT

Page 1: Semantic RDF based integration framework for heterogeneous XML data sources

““Semantic RDF Semantic RDF Based Based Integration Integration Framework Framework for for Heterogeneous Heterogeneous XMLXML Data Data SourcesSources”” Deniz KILINÇDeniz KILINÇ

[email protected]@hotmail.com

Page 2: Semantic RDF based integration framework for heterogeneous XML data sources

AgendaAgenda

IntroductionIntroduction IntegrationIntegration Related WorksRelated Works Description of MethodDescription of Method A Real Life ScenarioA Real Life Scenario Scenario Design Using Our Approach Scenario Design Using Our Approach ConclusionConclusion ReferencesReferences

Page 3: Semantic RDF based integration framework for heterogeneous XML data sources

Introduction Introduction (I)(I)

Different database servers like MSSQL, Different database servers like MSSQL, Oracle and DB2-400 can store and manipulate Oracle and DB2-400 can store and manipulate datadata

When data is spread across various databases When data is spread across various databases then there is some integration to perform. then there is some integration to perform.

For integration, a common standard is For integration, a common standard is required.required.

World Wide Web Consortium (W3C) World Wide Web Consortium (W3C) announced Extensible Markup Language announced Extensible Markup Language (XML) technology as a universal data (XML) technology as a universal data exchange formatexchange format

Page 4: Semantic RDF based integration framework for heterogeneous XML data sources

Introduction Introduction (II)(II)

XMLXML easy to learn, implement, read, and testeasy to learn, implement, read, and test effective, portable and easily customized data effective, portable and easily customized data

format that can easily sent over through any format that can easily sent over through any protocolprotocol

As a point of Business-to Business (B2B) view As a point of Business-to Business (B2B) view XML is a shortened product development time XML is a shortened product development time for most XML-related and data exchange. for most XML-related and data exchange.

In B2B, if you exchange dataIn B2B, if you exchange data, , you are you are probably using some form of Electronic Data probably using some form of Electronic Data Interchange (EDI) which is costly.Interchange (EDI) which is costly.

Page 5: Semantic RDF based integration framework for heterogeneous XML data sources

Integration Integration (I)(I)

Difficulty of integration depends on being Difficulty of integration depends on being homogenous or heterogeneous of XML data. homogenous or heterogeneous of XML data.

IntegratingIntegrating data which is homogenous is data which is homogenous is simple.simple. EEach document’s schema matches each otherach document’s schema matches each other

IIn real world, different XML source elements n real world, different XML source elements and attributes may have different names, and attributes may have different names, orders and types in their schemas, because of orders and types in their schemas, because of changes in requirements. changes in requirements.

TThe integration will not be as simple as like in he integration will not be as simple as like in homogenous data integration.homogenous data integration.

Page 6: Semantic RDF based integration framework for heterogeneous XML data sources

Integration Integration (II)(II)

To solve integration problemTo solve integration problem,, we offerwe offer To construct SemanticTo construct Semantic RDF (Resource RDF (Resource

Description Framework) based Description Framework) based ““regular regular expressionsexpressions” ” which define the whole which define the whole structure semantically instead of using structure semantically instead of using XPath (XML Path Language) XPath (XML Path Language) andand XSLT XSLT (Extensible Stylesheet Language)(Extensible Stylesheet Language)

easier and more effective compared to other easier and more effective compared to other works in use works in use because it is related because it is related logical logical vocabulary instead of interfering XML vocabulary instead of interfering XML document directlydocument directly..

Page 7: Semantic RDF based integration framework for heterogeneous XML data sources

Related Work Related Work (I)(I)

Researchers try to solve integration Researchers try to solve integration problem, using a predefined global schema problem, using a predefined global schema for the heterogeneous XML data sources for the heterogeneous XML data sources which have their own local schemas. which have their own local schemas.

Their solutions require much complicated Their solutions require much complicated rules for mapping from local schemas to rules for mapping from local schemas to agree upon a global schema. agree upon a global schema.

Mappings are really too complex for IT Mappings are really too complex for IT administrators and they have to know administrators and they have to know XPath and XSLT in detail. XPath and XSLT in detail.

Page 8: Semantic RDF based integration framework for heterogeneous XML data sources

Related Work Related Work (II)(II)

Mappings may not be one-to-one match Mappings may not be one-to-one match from local schema to global-one, from local schema to global-one, because an element in local schema because an element in local schema may not exist in global-one, vice-versa. may not exist in global-one, vice-versa.

They They solve this problem by adding solve this problem by adding empty XML elements into local data empty XML elements into local data sources that changes the original sources that changes the original structure. structure.

Hence too much redundant storage Hence too much redundant storage space is spent.space is spent.

Page 9: Semantic RDF based integration framework for heterogeneous XML data sources

Description of Method Description of Method (I)(I)

Our approach brings a semantic RDF-based Our approach brings a semantic RDF-based solution to XML integration problemsolution to XML integration problem

RDF is acronym of Resource Description RDF is acronym of Resource Description FrameworkFramework

An RDF document abstracts the details of An RDF document abstracts the details of document by presenting a semantic layerdocument by presenting a semantic layer

An RDF document has three partsAn RDF document has three parts identifieridentifier propertyproperty of propertyof property

Page 10: Semantic RDF based integration framework for heterogeneous XML data sources

Description of Method Description of Method (II)(II)

<rdf: RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-<rdf: RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">syntax-ns#">

<rdf: Description about="<rdf: Description about="http://www.cs.deu.edu.trhttp://www.cs.deu.edu.tr">">

<<AuthorAuthor>>Sibel KILINÇSibel KILINÇ</</AuthorAuthor>>

</rdf: Description></rdf: Description>

</rdf: RDF></rdf: RDF>

IdentifierIdentifier,, “http://www.cs.deu.edu.tr”“http://www.cs.deu.edu.tr” property, “Author” property, “Author” object which is the value of property, object which is the value of property,

“Sibel KILINÇ”“Sibel KILINÇ”

Page 11: Semantic RDF based integration framework for heterogeneous XML data sources

Description of Method Description of Method (III)(III)

Sample Regular ExpressionSample Regular Expression

STOCK-NAME (xS, xN)STOCK-NAME (xS, xN):= := Source/STOCK xT, xT/ID xS, xT/NAME xNSource/STOCK xT, xT/ID xS, xT/NAME xN

Source is the name of local schemaSource is the name of local schema Each STOCK element of Source is assigned to a Each STOCK element of Source is assigned to a

temporal variable, xT. temporal variable, xT. Each xT has two properties named ID and Each xT has two properties named ID and

NAME. NAME. xS is an identifier and holds the ID of STOCK. xS is an identifier and holds the ID of STOCK. xN is the name of property and value of xN holds xN is the name of property and value of xN holds

NAME of STOCKNAME of STOCK

Page 12: Semantic RDF based integration framework for heterogeneous XML data sources

Description of Method Description of Method (IV)(IV)

Semantic RDF-based Integration Semantic RDF-based Integration Framework has two major Framework has two major components;components;

Regular Expression Generator Tool Regular Expression Generator Tool (REGT)(REGT)

Integrator Tool Box (ITB)Integrator Tool Box (ITB)

Page 13: Semantic RDF based integration framework for heterogeneous XML data sources

Semantic Semantic RDF-RDF-basedbased

IntegratioIntegrationn

FramewoFrameworkrk

Page 14: Semantic RDF based integration framework for heterogeneous XML data sources

Description of Method Description of Method (V)(V)

Regular Expression Generator Tool (REGT)Regular Expression Generator Tool (REGT),, takes takes N local XML data sources N local XML data sources and and Global Global

Semantic Vocabulary that consists of common Semantic Vocabulary that consists of common XML elements in all local XML data sources with XML elements in all local XML data sources with an identifier and a property like in RDF. The left an identifier and a property like in RDF. The left part of “STOCK-NAME” description is an part of “STOCK-NAME” description is an example element of vocabulary. example element of vocabulary.

produces N Regular Expression Sets with produces N Regular Expression Sets with respect to local data sources. The right part respect to local data sources. The right part “STOCK-NAME” description is an example “STOCK-NAME” description is an example output of REGT. The upper part of framework output of REGT. The upper part of framework presents the processes done at local.presents the processes done at local.

Page 15: Semantic RDF based integration framework for heterogeneous XML data sources

Description of Method Description of Method (VI)(VI)

Integrator Tool Box (ITB)Integrator Tool Box (ITB) combines right sides of the regular combines right sides of the regular

expression set by the guide of expression set by the guide of vocabulary elements using some vocabulary elements using some iteration expressions. iteration expressions.

Finally, Global XML Data source has Finally, Global XML Data source has produced without changing the produced without changing the structure of local XML data sources, structure of local XML data sources, unlike related works.unlike related works.

Page 16: Semantic RDF based integration framework for heterogeneous XML data sources

A Real Life Scenario A Real Life Scenario (I)(I)

ScenarioScenario tthere is a corporation which has a branch here is a corporation which has a branch

office and a center office. office and a center office. ccenter office is in Ankaraenter office is in Ankara branch office is in İzmirbranch office is in İzmir eeach office holds different XML schemas ach office holds different XML schemas

for daily ordersfor daily orders managermanager of the corporation wants to see of the corporation wants to see

some audit reports, so that all orders must some audit reports, so that all orders must be brought togetherbe brought together

aat this point, an integration t this point, an integration process is process is neededneeded

Page 17: Semantic RDF based integration framework for heterogeneous XML data sources

A Real Life Scenario A Real Life Scenario (II)(II) General solution steps to XML integration General solution steps to XML integration

problemproblem A global definition must be agreed on.A global definition must be agreed on. For each source, different mappings must be used For each source, different mappings must be used

to transform local schemas into global-one.to transform local schemas into global-one. All transformed sources must be joined.All transformed sources must be joined.

Related Related Works in use are concentrated on Works in use are concentrated on XML based technologies. XML based technologies. A global XML Schema is defined.A global XML Schema is defined. XPath or XPath or XQueryXQuery mappings are used for mappings are used for

transformation step by changing the structure of transformation step by changing the structure of original local schemas. original local schemas.

Complex and redundant XSL-Transformations are Complex and redundant XSL-Transformations are use for joining step.use for joining step.

Page 18: Semantic RDF based integration framework for heterogeneous XML data sources

A Real Life Scenario A Real Life Scenario (III)(III)

Although integration problem is solved by Although integration problem is solved by using these steps, new two problems arise. using these steps, new two problems arise. First, since XML elements in local data sources First, since XML elements in local data sources

may not exist in global-one, matching becomes a may not exist in global-one, matching becomes a problem. problem. Related WorksRelated Works solve this problem by solve this problem by adding empty XML elements into local data adding empty XML elements into local data sources that changes the original structure. As a sources that changes the original structure. As a result too much redundant storage space is spent. result too much redundant storage space is spent.

Second, to apply transformation and joining steps, Second, to apply transformation and joining steps, IT administrators of each branch must know IT administrators of each branch must know complex XPath mappings and XSLT complex XPath mappings and XSLT transformations.transformations.

Page 19: Semantic RDF based integration framework for heterogeneous XML data sources

A Real Life Scenario A Real Life Scenario (IV)(IV)

In our framework, these problems are solved. In our framework, these problems are solved. A common vocabulary is agreed on. A common vocabulary is agreed on. Instead of adding new elements to schema Instead of adding new elements to schema

structures, RDF-based regular expressions are structures, RDF-based regular expressions are generated from local data sources by using REGT generated from local data sources by using REGT that that facilitates the transformation process instead of facilitates the transformation process instead of formulating the complex XPath queries. formulating the complex XPath queries.

Regular expressions are integrated and Regular expressions are integrated and processed by using ITB that processed by using ITB that facilitates the facilitates the joining processjoining process. Thanks to ITB and REGT, IT . Thanks to ITB and REGT, IT administrators do not need to know XSLT and administrators do not need to know XSLT and XPath technology.XPath technology.

Page 20: Semantic RDF based integration framework for heterogeneous XML data sources

Scenario Design Using Our Scenario Design Using Our Approach Approach (I)(I)

The The predicate rules used in this predicate rules used in this scenarioscenario are; are; STOCK-ORDER(xS,xO), STOCK-ORDER(xS,xO), STOCK-NAME(xS,xN), STOCK-NAME(xS,xN), STOCK-QUANTITY(xS,xQ), STOCK-QUANTITY(xS,xQ), ORDER-DATE(xO,xD) ORDER-DATE(xO,xD) ORDER-CUST(xO,xC) ORDER-CUST(xO,xC)

These predicate rules are the elements These predicate rules are the elements of common vocabulary which is also of common vocabulary which is also known as global definition.known as global definition.

Page 21: Semantic RDF based integration framework for heterogeneous XML data sources

Scenario Design Using Our Scenario Design Using Our Approach Approach (II)(II)

<schema> <element name="ORDERLIST"> <complexType> <sequence> <elemen tname="ORDER“ type="typeOrder"/> </sequence> </complexType> <complexType name="typeOrder"> <element name="NUMBER" type="string"/> <element name="DATE" type="DateTime"/> <element name="CUSTID“ type="string"/> <element name="STOCK" type="typeStock"/> </complexType> <complexType name="typeStock"> <element name="ID" type="string"/> <element name="NAME" type="string"/> <element name="QUANTITY" type="integer"/> </complexType> </element></schema>

<ORDERLIST> <ORDER> <NUMBER>AN001</NUMBER> <DATE>01/03/2004</DATE> <CUSTID>0001</CUSTID> <STOCK> <ID>S0001</ID> <NAME>CANON</NAME> <QUANTITY> 3</QUANTITY> </STOCK> </ORDER> <ORDER> <NUMBER>AN002</NUMBER> <DATE>01/03/2004</DATE> <CUSTID>0002</CUSTID> <STOCK> <ID>S0005</ID> <NAME>PANASONIC</NAME> <QUANTITY>10</QUANTITY> </STOCK> </ORDER></ORDERLIST>

Local schema and a sample XML document of center office, AnkaraLocal schema and a sample XML document of center office, Ankara

Page 22: Semantic RDF based integration framework for heterogeneous XML data sources

Scenario Design Using Our Scenario Design Using Our Approach Approach (III)(III)

REGT takes common semantic vocabulary and local schema of REGT takes common semantic vocabulary and local schema of Ankara as input and produces the following regular expressionAnkara as input and produces the following regular expression set 1.set 1.

* * STOCK-ORDER(xS,xO)STOCK-ORDER(xS,xO) ::= ::= Ankara/ORDERLIST/ORDER xT, xT/STOCK/ID xS, xT/NUMBER Ankara/ORDERLIST/ORDER xT, xT/STOCK/ID xS, xT/NUMBER

xOxO

* * STOCK-NAME(xS,xN) ::= STOCK-NAME(xS,xN) ::= Ankara/ORDERLIST/ORDER/STOCK xT, xT/ID xS, xT/NAME xNAnkara/ORDERLIST/ORDER/STOCK xT, xT/ID xS, xT/NAME xN

* * STOCK-QUANTITY(xS,xQ) ::= STOCK-QUANTITY(xS,xQ) ::= Ankara/ORDERLIST/ORDER/STOCK xT, xT/ID xS, xT/QUANTITY Ankara/ORDERLIST/ORDER/STOCK xT, xT/ID xS, xT/QUANTITY

xQxQ

* * ORDER-DATE (xO,xD)::= ORDER-DATE (xO,xD)::= Ankara/ORDERLIST/ORDER xT, xT/NUMBER xO, xT/DATE xDAnkara/ORDERLIST/ORDER xT, xT/NUMBER xO, xT/DATE xD

* * ORDER-CUST(xO,xC)::= ORDER-CUST(xO,xC)::= Ankara/ORDERLIST/ORDER xT, xT/NUMBER xO, xT/CUSTID xCAnkara/ORDERLIST/ORDER xT, xT/NUMBER xO, xT/CUSTID xC

Page 23: Semantic RDF based integration framework for heterogeneous XML data sources

Scenario Design Using Our Scenario Design Using Our Approach Approach (IV)(IV)

<schema>

<element name="ORDLIST">

<complexType>

<sequence>

<element name="ORD" type="typeOrd"/>

</sequence>

</complexType>

<complexType name="typeOrd">

<element name="ORDID" type="string"/>

<element name="CUSTID" type="string"/>

<element name="DATE" type="DateTime"/>

<element name="STKLIST" type="typeStList"/>

</complexType>

<complexType name="typeStList">

<element name="STK" type="typeSt"/>

</complexType>

<complexType name="typeSt">

<element name="ID" type="string"/>

<element name="QUANT" type="integer"/>

<element name="NAME" type="string"/>

</complexType></element></schema>

<ORDLIST>

<ORD>

<ORDID>IZ001</ORDID>

<CUSTID>0005</CUSTID>

<DATE>01/03/2004</DATE>

<STKLIST>

<STK>

<ID>S0006</ID>

<QUANT>2</QUANT>

<NAME>NIKON</NAME>

</STK>

</STKLIST>

</ORD>

<ORD>

<ORDID>IZ005</ORDID>

<CUSTID>0008</CUSTID>

<DATE>01/03/2004</DATE>

<STKLIST>

<STK><ID>S0005</ID>

<QUANT>5</QUANT>

<NAME>PANASONIC</NAME>

</STK> </STKLIST> </ORD></ORDERLIST>

Local schema and a sample XML document of Local schema and a sample XML document of branchbranch office, office, İzmirİzmir

Page 24: Semantic RDF based integration framework for heterogeneous XML data sources

Scenario Design Using Our Scenario Design Using Our Approach Approach (V)(V)

REGT takes common semantic vocabulary and local REGT takes common semantic vocabulary and local schema of schema of İzmirİzmir as input and produces the following as input and produces the following regular expressionregular expression set 2. set 2.

* * STOCK-ORDER(xS,xO) ::= STOCK-ORDER(xS,xO) ::= İzmir/ORDLIST/ORD xT, xT/STKLIST/STK/ID xS, xT/ORDID xOİzmir/ORDLIST/ORD xT, xT/STKLIST/STK/ID xS, xT/ORDID xO

* * STOCK-NAME(xS,xN) ::= STOCK-NAME(xS,xN) ::= İzmir /ORDLIST/ORD/STKLIST/STK xT, xT/ID xS, xT/NAME xNİzmir /ORDLIST/ORD/STKLIST/STK xT, xT/ID xS, xT/NAME xN

* * STOCK-QUANTITY(xS,xQ) ::= STOCK-QUANTITY(xS,xQ) ::= İzmir/ORDLIST/ORD/STKLIST/STK xT, xT/ID xS, xT/QUANT xQİzmir/ORDLIST/ORD/STKLIST/STK xT, xT/ID xS, xT/QUANT xQ

* * ORDER-DATE(xO,xD)::= ORDER-DATE(xO,xD)::= İzmir /ORDLIST/ORD xT, xT /ORDID xO, xT/DATE xDİzmir /ORDLIST/ORD xT, xT /ORDID xO, xT/DATE xD

* * ORDER-CUST(xO,xC)::= ORDER-CUST(xO,xC)::= İzmir /ORDLIST/ORD xT, xT/ORDID xO, xT/CUSTID xCİzmir /ORDLIST/ORD xT, xT/ORDID xO, xT/CUSTID xC

Page 25: Semantic RDF based integration framework for heterogeneous XML data sources

Scenario Design Using Our Scenario Design Using Our Approach Approach (VI)(VI)

Regular expression set 1 and 2 have Regular expression set 1 and 2 have been prepared to be joined and been prepared to be joined and processed by ITB. processed by ITB.

Manager wants to report the orders for Manager wants to report the orders for ‘PANASONIC’ stock in each office. ‘PANASONIC’ stock in each office.

In report stock name, quantity and In report stock name, quantity and order id will be displayed. order id will be displayed.

Using ITB, the following code section Using ITB, the following code section is produced and run.is produced and run.

Page 26: Semantic RDF based integration framework for heterogeneous XML data sources

Scenario Design Using Our Scenario Design Using Our Approach Approach (VII)(VII)

OutXML(<ORDERLIST>); For $xO in each (STOCK-ORDER(xS,xO)) If ($xS = STOCK-NAME(xS,xN)) And ($xN = ‘PANASONIC’) And ($xS = STOCK-QUANTITY(xS,xQ)) Then OutXML (<ORDEREDSTOCK> <ORDID>$xO</ORDID> <NAME>$xN</NAME> <QUANT>$xQ</QUANT> </ORDEREDSTOCK>); End If NextOutXML(</ORDERLIST>);

<ORDERLIST> <ORDEREDSTOCK> <ORDID>AN002</ORDID> <NAME> PANASONIC </NAME> <QUANT>10</QUANT> </ORDEREDSTOCK> <ORDEREDSTOCK> <ORDID>IZ005</ORDID> <NAME> PANASONIC </NAME> <QUANT>5</QUANT> </ORDEREDSTOCK></ORDERLIST>

Page 27: Semantic RDF based integration framework for heterogeneous XML data sources

ConclusionConclusion A A new approach to integrate XML data new approach to integrate XML data

sources which have different structure is sources which have different structure is presentedpresented

New tools for schema transformation and New tools for schema transformation and regular expression integration are also regular expression integration are also suggestedsuggested

Has advantageHas advantage of RDF based logical semantic of RDF based logical semantic layer with preserving original data sources layer with preserving original data sources during integration processduring integration process

This system guarantees that the global data This system guarantees that the global data source includes all the elements and attributes source includes all the elements and attributes which have defined as a parameter in global which have defined as a parameter in global semantic vocabulary. This vocabulary can be semantic vocabulary. This vocabulary can be changed according to requests of corporation.changed according to requests of corporation.

Page 28: Semantic RDF based integration framework for heterogeneous XML data sources

References References (I)(I)

Bray, T., Paoli, J., Spenger-McQueen, C.M.: Bray, T., Paoli, J., Spenger-McQueen, C.M.: Extensible Markup Language (XML) 1.0. Extensible Markup Language (XML) 1.0. W3C Recommendation, February 1998W3C Recommendation, February 1998

Clarke, J., XSL Transformations (XSLT) Clarke, J., XSL Transformations (XSLT) version 1.0. W3C Recommendation, version 1.0. W3C Recommendation, November 1999November 1999

Halevy, A., Etzioni, O., Doan, A., Ives, Z., Halevy, A., Etzioni, O., Doan, A., Ives, Z., Madhavan, J., McDowell, L., Tatarinov, I.: Madhavan, J., McDowell, L., Tatarinov, I.: Crossing the Structure Chasm. In CIDT’03 Crossing the Structure Chasm. In CIDT’03 (2003)(2003)

Harold, E.R., XML Bible, IDG Books Harold, E.R., XML Bible, IDG Books Worldwide Inc., United States of America, Worldwide Inc., United States of America, 19991999

Kılınc, D., Kut, A.: Xml Teknolojisine Kılınc, D., Kut, A.: Xml Teknolojisine Gerçekçi Yaklaşım. Inet-tr’09 (2003) Gerçekçi Yaklaşım. Inet-tr’09 (2003)

Page 29: Semantic RDF based integration framework for heterogeneous XML data sources

References References (II)(II) Kılınc, D., Kut, A.: XSL Dönüşümleri ve Kılınc, D., Kut, A.: XSL Dönüşümleri ve

Biçimleme. AB’03 (2004) Biçimleme. AB’03 (2004) Popa, L., Velegrakis, Y., Miller, R.J., Popa, L., Velegrakis, Y., Miller, R.J.,

Hernandez, M.A., Fagin, R. : Translating Hernandez, M.A., Fagin, R. : Translating Web Data. Proceedings of the 28th VLDB Web Data. Proceedings of the 28th VLDB Conference (2002) 598-609Conference (2002) 598-609

XML schema. Available at XML schema. Available at http://www.w3.org/XML/Schema..

XPath 2.0. W3C Working Draft, November XPath 2.0. W3C Working Draft, November 20022002

XQuery 1.0: An XML Query Language. XQuery 1.0: An XML Query Language. W3C Working Draft, November 2002W3C Working Draft, November 2002

Page 30: Semantic RDF based integration framework for heterogeneous XML data sources

<TheEnd><TheEnd>

<Thanks>ToAll</<Thanks>ToAll</Thanks>Thanks>

</TheEnd></TheEnd>