efficient processing of ordered xml twig pattern

Post on 05-Jan-2016

39 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Efficient Processing of Ordered XML Twig Pattern. by Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni Presented by: Tian Yu 23, Aug 2005. Outline. Introduction and motivation Background XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList - PowerPoint PPT Presentation

TRANSCRIPT

Efficient Processing of Ordered XML Twig

Pattern by Jiaheng Lu, Tok Wang Ling, Tian Yu,

Changqing Li, Wei NiPresented by: Tian Yu

23, Aug 2005

Efficient Processing of Ordered XML Twig Pattern 2

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Efficient Processing of Ordered XML Twig Pattern 3

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Efficient Processing of Ordered XML Twig Pattern 4

Introduction XML data representation rapidly increases

popularity

XML documents modeled as ordered trees.

XML queries specify patterns of selection predicates on multiple elements having some structural relationships (parent-child, ancestor-descendant)

Efficient Processing of Ordered XML Twig Pattern 5

What is a Twig Pattern? A twig pattern is a small tree whose nodes are tags,

attributes or text values and edges are either Parent-Child (P-C) edges or Ancestor-Descendant (A-D) edges.

E.g. Query description: Selects Figure elements which are descendants of Paragraph elements which in turn are children of Section elements having child element Title

Twig pattern :

Section

Title Paragraph

Figure

Efficient Processing of Ordered XML Twig Pattern 6

Motivation XML documents modeled as ordered trees, it’s

natural to have ordered queries. Four ordered axes: following-sibling, preceding-

sibling, following, preceding. Example:

ordered query:

//book/title/following-sibling::chapter

unordered query :

//book/title/chapter

Efficient Processing of Ordered XML Twig Pattern 7

Order axis Four axis: following-sibling, preceding-sibling,

following, and preceding. In the sample document: Set the context node to be f

a

b

e f

h

i

g

c

j

d

Sample XML document

Context node: fFollowing of f: i and jPreceding of f: b, c and eFollowing-sibling of f: iPreceding-sibling of f: e

Following-sibling of f = following of f and share the same parent with fPreceding-sibling of f = preceding of f and share the same parent with f

Efficient Processing of Ordered XML Twig Pattern 8

Ordered Twig Pattern //chapter[title=“related work”]/following::section Intuitive meaning: search for all the sections that appear after

(but are not descendents of) chapter elements with the title “related work” in the XML document.

The query node Book is ordered

Efficient Processing of Ordered XML Twig Pattern 9

Ordered Twig Pattern //chapter[title=“related work”]/following::section

Efficient Processing of Ordered XML Twig Pattern 10

Ordered Twig Pattern //chapter[title=“related work”]/following::section

If the twig pattern is unordered:

section1, section2, and section3 are all matching elements.

Efficient Processing of Ordered XML Twig Pattern 11

Ordered Twig Pattern //chapter[title=“related work”]/following::section

But for ordered query, section1 and section2 are not in the solution. How to know that in our method?

Efficient Processing of Ordered XML Twig Pattern 12

Motivation Naïve Method: Use the existing algorithm to output the intermediate

path solutions for each individual root-leaf query path Merge path solutions so that the final solutions are

guaranteed to satisfy the order predicates of the query.

Disadvantage of the naïve method: Many intermediate results may not contribute to final

answers.

Our Solution: efficient processing of ordered XML twig patterns.

Efficient Processing of Ordered XML Twig Pattern 13

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Efficient Processing of Ordered XML Twig Pattern 14

XML Twig Pattern Matching

An XML document is commonly modeled as a rooted, ordered and tagged tree.

book

preface chapter chapter

section

section

paragraph

section

paragraph

paragraph

………….

title

title

“XML”“Data”

“Intro”

“…” “…”

“…”

Efficient Processing of Ordered XML Twig Pattern 15

Region Coding

Node Label1: (startPos, endPos, LevelNum) E.g.

1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.

“…”

book

preface chapter chapter

section title

“Data”

“Intro”

“…”

(1,21,1)

(2,4,2)

(3,3,3)

(13,20,2)(5,12,2)

(9,11,3)

(6,8,3)

(7,7,4) (10,10,4)

section title

“Data” “…”

(17,19,3)(14,16,3)

(15,15,4) (18,18,4)

Efficient Processing of Ordered XML Twig Pattern 16

Region Coding

Given e1, e2: e1 is ancestor of e2: iff e1.start < e2.start and e1.end > e2.end.

1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.

e1

e2

book

preface chapter chapter

section title

“Data”

“Intro”

“…”

(1,21,1)

(2,4,2)

(3,3,3)

(13,20,2)(5,12,2)

(9,11,3)

(6,8,3)

(7,7,4) (10,10,4)

section title

“Data” “…”

(17,19,3)(14,16,3)

(15,15,4) (18,18,4)

Efficient Processing of Ordered XML Twig Pattern 17

Region Coding

Given e1, e2: e1 is parent of e2: iff e1.start < e2.start and e1.end > e2.end , and e1.level + 1= e2.level

1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.

e1 book

preface chapter chapter

section title

“Data”

“Intro”

“…”

(1,21,1)

(2,4,2)

(3,3,3)

(13,20,2)(5,12,2)

(9,11,3)

(6,8,3)

(7,7,4) (10,10,4)

section title

“Data” “…”

(17,19,3)(14,16,3)

(15,15,4) (18,18,4)

e2

Efficient Processing of Ordered XML Twig Pattern 18

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Efficient Processing of Ordered XML Twig Pattern 19

Previous work: TwigStack

TwigStack2: a holistic approach Two-phase algorithm:

Phase 1 TwigJoin: part of intermediate root-leaf paths are outputted Phase 2 Merge: merge the intermediate paths to get the final results

2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.

Efficient Processing of Ordered XML Twig Pattern 20

Sub-optimality of TwigStack TwigStack: optimal when the query contains only ancester-

descendant relationship If the query contains any parent-child relationship, TwigStack

may output some intermediate path solutions that cannot contribute to final results.

We call that TwigStack is sub-optimal for queries with parent-child relationships.

Efficient Processing of Ordered XML Twig Pattern 21

TwigStackList The main problem of TwigStack is to assume all edges

are ancestor-descendant relationship in the first phase. So it is not efficient for queries with parent-child relationships.

Improved method: TwigStackList3 [CIKM 2004] There is an additional list structure for each query node

to cache elements that likely participate in final solutions.

TwigStackList3 is an improvement algorithm for TwigStack, since it considers parent-child relationships in the first phase.

TwigStackList is optimal when there is no P-C edge for branching nodes (a branch node is a node with more than one descendant or child)

3. J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533- 542, 2004.

Efficient Processing of Ordered XML Twig Pattern 22

TwigStackList v.s. TwigStack

TwigStack output the it output the “uesless” path solution < s1,t1>, since it doesn’t check for parent-child relationsihp. TwigStackList has no uesless output. < s1,t1> is not in the

output.

Twig Pattern

s1

p1

section

titleparagraph

figure

p3

f1

t1

An XML tree

t2

s2

p2t3

f2

Root

s1

t1

No Parent-child relationship for branching node

Efficient Processing of Ordered XML Twig Pattern 23

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Efficient Processing of Ordered XML Twig Pattern 24

Ordered Children Extension (OCE) Definition: An element en (of Type n) has an OCE if: 1) In the query Q, for all A-D children of n (if any), n’,

there is an element en’ (with tag n’) that is a descendant of en , and en’ also has an OCE; and

2) In the query Q, for all P-C children of n (if any), n’, there is an element e’ (with tag n) in the path en to en’ such that e’ is the parent of en’, and en’ also has an OCE; and

3) For each child (or descendant) n’ of n, if there is an node m that is the immediate rightSibling of n, there are elements en’ and em such that en’ is a child (or descendant) of element en, en’.end < em.start, and both en’ and emi have OCE.

The first two conditions are guaranteed in twigStackList Our main focus is in the third condition

Efficient Processing of Ordered XML Twig Pattern 25

Ordered Children Extension (OCE) Definition: Condition 3)

For each child (or descendant) n’ of n, if there is an node m that is the immediate rightSibling of n, there are elements en’ and em such that en’ is a child (or descendant) of element en, en’.end < em.start, and both en’ and emi have OCE.

n

mn’

>en

emEn’

XML documentOrdered XML Query

Efficient Processing of Ordered XML Twig Pattern 26

Ordered Children Extension (OCE)In an Ordered XML query: If node n is ordered node:

In order to find it’s OCE, all the three previous conditions must be checked.

If node n is an unordered node:

In order to find it’s OCE, only the first two conditions need to be checked. The last condition does not apply.

Efficient Processing of Ordered XML Twig Pattern 27

Document:

Query:

a

b dc

>

Ordered Children Extension: Example 1

a1

c1 e2e1

b1 d1

Efficient Processing of Ordered XML Twig Pattern 28

Document:a1

c1 e2e1

Query:

b1 d1

a1 has an OCE

a

b dc

>

Ordered Children Extension: Example 1

Efficient Processing of Ordered XML Twig Pattern 29

Document:

a

b d

Query:

c

>

a1 has an OCE1) a1 has descendants b1 and d1, and child c1 (fulfill condition 1, 2

of OCE definition)2) b1 has a right sibling element c1 , and c1 has a right sibling

element d1 (fulfill condition 3 of OCE definition)

Ordered Children Extension: Example 1

a1

c1 e2e1

b1 d1

Efficient Processing of Ordered XML Twig Pattern 30

Document:

Query:

a

b dc

>

Ordered Children Extension: Example 2

a1

c1e1

b1 d1

Efficient Processing of Ordered XML Twig Pattern 31

Document:

Query:

a1 doesn’t have any OCE

a

b dc

>

Ordered Children Extension: Example 2

a1

c1e1

b1 d1

Efficient Processing of Ordered XML Twig Pattern 32

Document:

Query:

a

b dc

>

Ordered Children Extension: Example 2

a1

c1e1

b1 d1

a1 doesn’t have any OCE1) a1 has descendants b1 and d1, and child c1 (fulfill condition 1, 2

of OCE definition)2) b1 has a right sibling node c1 (fulfill condition 3 of OCE

definition)3) However, c1 only has descendant of d1. There is no element with

the labeld d that is a right sibling of element c1 (doesn’t satisfy condition 3 of OCE definition)

Efficient Processing of Ordered XML Twig Pattern 33

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Efficient Processing of Ordered XML Twig Pattern 34

Data structure Each node n in the twig query has: Stream, List, and Stack Data Stream: Tn

we partition an XML document into streams All elements in a stream are of the same tag and ordered by their start

Position The elements in each stream is read only once from head to tail.

a1, a2, a3

b1 , b2

C1 , C2

d1, d2, d3

a

dc

>

b

Ta

Tb

Tc

Td

Document

2:

3:

a1

a2 a3 b2

d2 b1d3

c2

d1

c1

4:

Level 1:

Efficient Processing of Ordered XML Twig Pattern 35

Data structure Each node n in the twig query has: Stream, List, and Stack List: Ln

The elements in lists help to check for P-C relationship Elements in each list Ln are strictly nested from the first to the

end, i.e. in the XML document, each element is an ancestor or parent of the following element.

a1, a2…

b1 ..

C1

d1 ,d3

a

dc

>

b

La

Lb

Lc

Ld

Efficient Processing of Ordered XML Twig Pattern 36

Data structure Each node n in the twig query has: Stream, List, and Stack Stack: Sn

Stacks is used to store elements that have at least one OCE Elements in the stack are potential solutions of the XML query. When we insert an new element into a stack, the top element of

the stack is popped out if the top of the stack doesn’t have A-D relationship with the new element.

a

dc

>

b

Sa

Sb

Sc

Sd

Efficient Processing of Ordered XML Twig Pattern 37

A holistic matching algorithm: OrderedTJ We propose a general algorithm, OrderedTJ, that computes answers to an ordered query twig.

Our key focus is to check the ordered nodes in the query and find elements which has at least one OCE.

Efficient Processing of Ordered XML Twig Pattern 38

Main function OrderedTJ Main function operates in two phases.

Efficient Processing of Ordered XML Twig Pattern 39

Main function OrderedTJ Main function operates in two phases.

Phase 1

Phase 2

Phase 1: Parts of query root-leaf paths are output. The ordering requirements in the ordered query is checked.

Phase 2: These solutions are merged-joined to compute the answers to the whole query.

Important function

Efficient Processing of Ordered XML Twig Pattern 40

getNext(n) It gets the next stream to be processed and advanced

Check Order

Check P-C

Efficient Processing of Ordered XML Twig Pattern 41

An example of OrderedTJ algorithmDocument:

Query: Book

Chapter Section

“Related work”

b1

c1

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

“Introduction”

c2 c3

“Algorithm”

s3

t1, t2, t3 Title:

>

t2 t3s2s1

“Related work”

t1

Partition an XML document into streams

Next Action:

Efficient Processing of Ordered XML Twig Pattern 42

An example of OrderedTJ algorithmDocument:

Query: Book

Chapter Section

“Related work”

b1

c1

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

“Introduction”

c2 c3

“Algorithm”

s3

t1, t2, t3 Title:

>

t2 t3s2s1

“Related work”

t1

Show lists for nodes with P-C child

Next Action:

Efficient Processing of Ordered XML Twig Pattern 43

An example of OrderedTJ algorithmDocument:

Query: Book

Chapter Section

“Related work”

b1

c1

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

“Introduction”

c2 c3

“Algorithm”

s3

t1, t2, t3 Title:

>

t2 t3s2s1

“Related work”

t1

Show Stacks of every node in the query

Next Action:

Efficient Processing of Ordered XML Twig Pattern 44

An example of OrderedTJ algorithmDocument:

Query: Book

Chapter Section

“Related work”

b1

c1

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

“Introduction”

c2 c3

“Algorithm”

s3

t1, t2, t3 Title:

>

t2 t3s2s1

“Related work”

t1

advance (Title)Next Action:

t1 has no descendant

“related work”

Efficient Processing of Ordered XML Twig Pattern 45

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Insert t2 into the list of Title

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

t2 has descendant

“related work”

Efficient Processing of Ordered XML Twig Pattern 46

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Advance (Chapter)

t2

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

C1 has no descendant title that has child “related

work”

Efficient Processing of Ordered XML Twig Pattern 47

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Insert c2 into the list of chapter

t2

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

C2 has a descendant t2 that has child

“related work”

Efficient Processing of Ordered XML Twig Pattern 48

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Advance(Section)

t2

c2

Document:b1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

s1 is not the following element of c2

c1

Efficient Processing of Ordered XML Twig Pattern 49

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Advance(Section)

t2

c2

Document:b1

c1

“Introduction”

c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

s2 is not the following element of c2

c2

Efficient Processing of Ordered XML Twig Pattern 50

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Push b1 into the stack of Book

t2

c2

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

b1 is has an OCE

Efficient Processing of Ordered XML Twig Pattern 51

c1

“Introduction”

s1t1

c2 is has an OCE

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Push c2 into the stack of Chapter

t2

c2

b1Document:b1

c2 c3

“Algorithm”

s3t2 t3

s2

“Related work”

Next Action:

Efficient Processing of Ordered XML Twig Pattern 52

s1t1

t2 is has an OCE

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Push t2 into the stack of Title

t2

b1

c2

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2

“Related work”

Next Action:

Efficient Processing of Ordered XML Twig Pattern 53

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Push “r…” to into the stack of “Related work”

b1

c2

t2

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

“rel..” is the leaf node

Efficient Processing of Ordered XML Twig Pattern 54

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

“Introduction” “Algorithm”

t1, t2, t3 Title:

>

b1

c2

t2

Output: b1, c2, t2,“r…”

“r…”

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

A path is found

Efficient Processing of Ordered XML Twig Pattern 55

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

b1

c2

t2

Push: s3 into stack

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

“r…”

Next Action:

s3 is a leaf node and follows element c2

Efficient Processing of Ordered XML Twig Pattern 56

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

b1

c2

t2

s3

Output: b1, s3

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

“r…”Next Action:

A path is found

Efficient Processing of Ordered XML Twig Pattern 57

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

b1

c2

t2

s3

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Previous Output:

Output: b1, c2, t2,“r…” Output: b1, s3

“r…”

Efficient Processing of Ordered XML Twig Pattern 58

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

b1

c2

t2

s3

A match is found

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Join the output paths

“r…”

Next Action:

Efficient Processing of Ordered XML Twig Pattern 59

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

b1

c2

t2

s3

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

“r…”

A match is found

Efficient Processing of Ordered XML Twig Pattern 60

Optimality of OrderedTJ TwigStack doesn’t consider P-C relationship, therefore, it

produce more intermediate result than TwigStackList. Therefore, we compare the optimality of our OrderedTJ

with TwigStackList. Example: we match ordered query1 in XML document 1

using the two algorithms: TwigStackList, and OrderedTJ.

Document 1:

a1

c1 a2

b1

>a

b c

Query 1:

Efficient Processing of Ordered XML Twig Pattern 61

Optimality of OrderedTJ TwigStackList can only solve ordered XML query

with naïve method. Therefore, it convert query 1 to query 2, by

removing the ordered sign in the twig pattern.

Document 1:

a1

c1 a2

b1

a

b c

>a

b c

Query 1: Query 2:

Efficient Processing of Ordered XML Twig Pattern 62

Optimality of OrderedTJ Sub-optimality of TwigStackList: When there is a P-C relationship at the branching node, there

could be redundant intermediate output. In this example : In the streams, the elements are read only once from head to tail. Therefore, when the TwigStackList process element a1, c1, and b1.

There is no way to decide if there is an element b2 that is a child of a1

Document:

a1

c1 a2

b1

a

b c

Query 2:

Therefore, the algorithm outputs useless solution <a1,c1>

b2 TwigStackList

Efficient Processing of Ordered XML Twig Pattern 63

Optimality of OrderedTJ Optimality of OrderedTJt: It allows the existence of parent-child relationship in the first branching edge

for the ordered node. In this example : Therefore, when the OrderedTJ process element a1, c1, and b1. Since there

is no element with tag name b before c1. It doesn’t satisfy condition 3 in the definition of OCE. c1 does not contribute to any final answer

Document:

a1

c1 a2

b1

a

b c

Query 1:

Therefore, the algorithm doesn’t outputs useless solution <a1,c1>

>

OrderedTJ

Efficient Processing of Ordered XML Twig Pattern 64

Optimality of OrderedTJ

TwigStack: optimal for A-D only queries.

A

B C

A-D only

TwigStack Optimality

Efficient Processing of Ordered XML Twig Pattern 65

Optimality of OrderedTJ

TwigStackList: optimal for queries that only has A-D edge for branching node.

The other edges in the query can be P-C edge.

TwigStackList Optimality

A

B C

A-D for branching node

A-D only

Efficient Processing of Ordered XML Twig Pattern 66

Optimality of OrderedTJ

OrderedTJ: It allows the existence of parent-child relationship in the first branching edge for the ordered nodes

OrderedTJ Optimality

A

B C

P-C for 1-Branch of ordered node

A-D only

A-D for branching node A

B C

D E

Efficient Processing of Ordered XML Twig Pattern 67

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Efficient Processing of Ordered XML Twig Pattern 68

Experiments Algorithms for comparison:

straightforward -TwigStack (short STW) straightforward-TwigStackList (STWL) Our proposed OrderedTJ

Benchmarks XMark: Synthetic Data

Size: 115 M bytes factor:1.0 Treebank: Real Data from Wall Street Journal

Size: 82M bytes nodes:2.5 million

Efficient Processing of Ordered XML Twig Pattern 69

Experiments Testing Queires

Q1, Q2, Q3 for XMark; Q4,Q5,Q6 for TreeBank)

Evaluation metrics Number of intermediate path solutions Total running time

Efficient Processing of Ordered XML Twig Pattern 70

Experiments: Execution Time

OrderedTJ outputs less intermediate result

Therefore, it has less execution time

Efficient Processing of Ordered XML Twig Pattern 71

Experiments: Intermediate result

OrderedTJ has the smallest intermediate results

QueryDataset

STW STWL OrderedTJUseful

solutions

Q1 XMark 71956 71956 44382 44382

Q2 XMark 65940 65940 10679 10679

Q3 XMark 71522 71522 23959 23959

Q4 TreeBank 2237 1502 381 302

Q5 TreeBank 92705 92705 83635 79941

Q6 TreeBank 10663 11 5 5

Table 1. The number of intermediate path solutions

Efficient Processing of Ordered XML Twig Pattern 72

Experiments: Intermediate result

QueryDataset

STW STWL OrderedTJUseful

solutions

Q1 XMark 71956 71956 44382 44382

Q2 XMark 65940 65940 10679 10679

Q3 XMark 71522 71522 23959 23959

Q4 TreeBank 2237 1502 381 302

Q5 TreeBank 92705 92705 83635 79941

Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions

For all queries, OrderedTJ has the smallest intermediate results.

Efficient Processing of Ordered XML Twig Pattern 73

Experiments: Intermediate resultQuery

DatasetSTW STWL OrderedTJ

Useful solutions

Q1 XMark 71956 71956 44382 44382

Q2 XMark 65940 65940 10679 10679

Q3 XMark 71522 71522 23959 23959

Q4 TreeBank 2237 1502 381 302

Q5 TreeBank 92705 92705 83635 79941

Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions

Only A-D edges, therefore, STW and STWL output same intermediate result.

However, OrderedTJ has less intermediate result since it also

considers the ordering relationship.

>test

bold keyword

Query 1:

Efficient Processing of Ordered XML Twig Pattern 74

Experiments: Intermediate resultQuery

DatasetSTW STWL OrderedTJ

Useful solutions

Q1 XMark 71956 71956 44382 44382

Q2 XMark 65940 65940 10679 10679

Q3 XMark 71522 71522 23959 23959

Q4 TreeBank 2237 1502 381 302

Q5 TreeBank 92705 92705 83635 79941

Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions

>PP

IN NP

Query 4:

VP

S

VBN

It has P-C edges for non-branching nodes. Therefore, STWL output less intermediate result than STW.

OrderedTJ output even less intermediate result since it also consider the ordering relationship.

OrderedTJ still has redundant intermediate result comparing with the final useful result. It is because there is P-C edges on the second branch of ordered node PP

Efficient Processing of Ordered XML Twig Pattern 75

Experiments: Intermediate resultQuery

DatasetSTW STWL OrderedTJ

Useful solutions

Q1 XMark 71956 71956 44382 44382

Q2 XMark 65940 65940 10679 10679

Q3 XMark 71522 71522 23959 23959

Q4 TreeBank 2237 1502 381 302

Q5 TreeBank 92705 92705 83635 79941

Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions

STWL output less intermediate result than STW, since there is a P-C edge in the query.

OrderedTJ output no redundant intermediate result comparing with the final useful result. It is because it only has a P-C edge on the first branch of ordered node PP

OrderedTJ is optimal in this case

>

DT PRP_DOLLAR_

Query 6:

S

Efficient Processing of Ordered XML Twig Pattern 76

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Efficient Processing of Ordered XML Twig Pattern 77

Conclusions We developed a new algorithm orderedTJ to solve

the problem of Ordered Twig Pattern matching. Our algorithm orderedTJ can identify a larger

query class to guarantee I/O optimality. Experimental results showed the effectiveness,

scalability, and efficiency of our algorithm. Future work: implement more efficient indexing

method, e.g. B tree or R tree to skip XML elements.

Efficient Processing of Ordered XML Twig Pattern 78

Reference(1) [1] M.P. Consens and T.Milo. Optimizing queries on

files. In In Proceedings of ACM SIGMOD, 1994 Node Label: Regional encoding. [2] N. Bruno, D. Srivastava, and N. Koudas. Holistic

twig joins: optimal XML pattern matching. In SIGMOD Conference, pages 310 - 321, 2002

Propose TwigStack algorithm [3] J. Lu, T. Chen, and T. W. Ling. Efficient processing

of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533-542, 2004.

Propose TwigStackList algorithm

Efficient Processing of Ordered XML Twig Pattern 79

Reference(2) [4] Y. Chen, S. B. Davidson, and Y. Zheng. BLAS: An

efficient XPath processing system. In Proc. of SIGMOD, pages 47-58, 2004.

Propose a new algorithm for XPath query [5] J. Lu, T. W. Ling. C.Y Chan and T. Chen, From

Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching In VLDB 2005

Propose a new twig pattern matching algorithm based on a proposed prefix labeling scheme

Efficient Processing of Ordered XML Twig Pattern 80

END

Thank you!

Q & A

top related