a hybrid match algorithm for xml schemas

19
A Hybrid Match Algorithm for XML Schemas Ray Dos Santos Aug 21, 2009 K. Claypool, V. Hegde, N. Tansalarak UMass – Lowell - ICDE ‘06

Upload: stephen-hunter

Post on 30-Dec-2015

15 views

Category:

Documents


0 download

DESCRIPTION

A Hybrid Match Algorithm for XML Schemas. K. Claypool, V. Hegde, N. Tansalarak. UMass – Lowell - ICDE ‘06. Ray Dos Santos Aug 21, 2009. XML integration. a hybrid match algorithm that provides a - PowerPoint PPT Presentation

TRANSCRIPT

A Hybrid Match Algorithm for XML Schemas

Ray Dos SantosAug 21, 2009

K. Claypool, V. Hegde, N. TansalarakUMass – Lowell - ICDE ‘06

2

XML integration

a hybrid match algorithm that provides a framework for analyzing and exploiting

semantic and structural information inherent in XML schemas

Objective: find corresponding entities.

3

XML integration

an XML match taxonomy: categorizes the structural and semantic overlap between two given XML schemas (qualitative)

weight-based match model: evaluates the

quality of match, assigning it an absolute numeric value (quantitatively)

4

QoM Classification

Label atomic Properties atomic Children many values Nesting Level atomic

Four Axes:

5

Atomic Values Exact Match:

the value v1 of the axis, where axis is either the label, properties, or level axis, in schema S1 is identical to the value v2 of the same axis in schema S2.

* *

6

Atomic Values Relaxed Match:

the value v1 of the axis, where axis is either the label, properties, or level axis, in schema S1 has some degree of match (but not exact) to the value v2 of the same axis in schema S2.

**

Need a linguistic match algorithm

7

Atomic Values Relaxed Match:

Level: values are indentical Properties: decided individually. The property value of the source is a

specialization or generalization of the target.

Ex: minOccurs, maxOccurs, and type minOccurs=0 , minOccurs=1

8

Set-Valued Elements (children axis) Coverage Match: Total: all children (sub-elements and attributes)

of the source element have a match with some child of the target element

** ** **

9

Set-Valued Elements (children axis) Coverage Match:

Partial: some but not all the children of the source element have a match with children of the target element

10

XML Match Taxonomy Leaf Match: A match between two leaf elements is said to be

exact, E1 = E2, if both its label and set of properties match exactly

A match between two leaf elements E1 and E2 is said to be relaxed, if either the label or any of the properties of element E1 have a relaxed match with the label and the properties of E2 respectively.

11

XML Match Taxonomy Subtree Match (intermediate node):

(1) the number of children matches; (2) the quality of match of the children; (3) the quality of match along the atomic valued axes

of the root node (of the sub-tree).

Children axis: Total exact: all children to all children Total relaxed: all children to some children Partial exact: some children to some children Partial relaxed: some children to some childre

12

Combining the Axes Total exact: exact match along the label, properties and

level axis, and a total exact match along the children axis Total relaxed: there is one or more relaxed match along any one of

the atomic valued axes or a total relaxed match Partial exact: implies an exact match along all atomic valued axis

and a partial exact match along the children axis Partial relaxed: relaxed match along one or more atomic valued

axis and/or a partial relaxed match along the children axis

Total relaxed

13

Tree Match 2 root elements PO and Purchase Order have a relaxed match along the label and

properties axis. PO root has three children, Purchase Order has five children. There is an exact

match between the leaf children nodes labeled OrderNo, and a relaxed match between the children nodes PurchaseDate and Date.

match the sub-tree rooted at PurchaseInfo with all sub-trees in the Purchase Order PurchaseInfo and Purchase Order have a relaxed match along the label and

properties axes

14

Tree Match The children (leaf nodes) BillingAddr and ShippingAddr have a relaxed match with the

leaf nodes BillTo and ShipTo in the Purchase Order

the sub-trees rooted at nodes Lines and Items, i.e., the two non-leaf nodes Lines and Items have a total relaxed match

Combining the matches along the different axes, the QoM for the match between the PO and Purchase root nodes is said to be total relaxed

15

Weight-based match model A match is classified based on the QoM of four axes: label, properties, children, level

Assign weights to each individual axis:

The highest match classification, total exact will always result in QoM(n1, n2) = 1.

Leaf Match: use the label and properties axes:

Subtree Match: use all 4 labels. A match along the children axis is given by: The subtree weight

The cardinality ratio

QoM

The normalized sum of the Qom of the childrenThe number of children matches to the number of children

QoM along node N along children axis

16

Hybrid Match Algorithm Recursive, depth-first search

Match the roots

Calculates children (QoMc)

Calculate atomic-valued axes

(QoMl,QoMh,QoMp)

Final QoM match:

17

Experiment XML schemas from XML Benchmark http://db.uwaterloo.ca/

ddbms/projects/xbench/

Inventory, books, and protein

Compared 3 algorithms: linguistic, structural, and hybrid

18

Experiment

R = real matches P= matches found by the algorithm

19

Conclusion

Combined structural matching + linguistic matching hybrid algorithm

Provided a matching taxonomy, a weighted formula applied along labels, children, properties, and levels of xml elements.

Combined them into an algorithm to determine the highest QoM between two schemas.