on boosting holism in xml twig pattern matching using structural indexing techniques ting chen,...

43
On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

Upload: jasmine-mcgregor

Post on 26-Mar-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching Using

Structural Indexing Techniques

Ting Chen, Jiaheng Lu, Tok Wang Ling

Page 2: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

2

Outline Background

XML Twig Pattern Query Previous Twig Join algorithms Limit of the original holistic method TwigStack

Our holistic Twig Pattern Matching algorithms Two Refined Indexing Schemes: Tag+Level and PPS A generalized holistic matching theory iTwigJoin: a generalized holistic matching algorithm

Experiments Conclusion

Page 3: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

3

Background: XML and Region coding XML document is modeled as a tree in our work

Region Coding for XML document tree <start, end, level> label for each element Containment Property:

a.start < b.start AND a.end > b.end if and only if a is an ancestor of b

Page 4: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

4

Background: XML twig pattern queries

An XML twig query is a small tree, whose edges include parent-child or ancestor-descendant relationships.

Given an XML document D, and an XML twig query Q, our problem is to find all occurrences of Q on D.

Page 5: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

5

Previous XML Twig Join algorithms

Techniques Edge Based

Binary Structural Join [Al-Khalifa et al ICDE02] Join Order Selection [Wu et al ICDE03]

Path Based BLAS [Chen et al SIGMOD04]

Tree (Holistic) Based TwigStack [Bruno et al SIGMOD02] TwigStackList [Lu et al CIKM04]

Index Based B tree [[Chien et al VLDB02] XR tree[Jiang et al ICDE02] TSGeneric+[Jiang et al VLDB03]

Page 6: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

6

Holistic Twig Matching TwigStack [Bruno et al SIGMOD02] A holistic twig

join algorithm E.g: For query A[.//C]//B, there may be many matches only to A//B. But

TwigStack only output results for A with descendants B and C. No join order selection required

TwigStack is optimal for only ancestor-descendant twig patterns.

Reordering of elements in a stream does not help. [Choi et al DEXA03]

Page 7: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

7

Sub-optimality of TwigStack Not optimal for twigs with parent-child edge

a1

b1 a2 an cn

b2 c1 bn cn-1…

a1 a2 … an

b1 b2 … bn c1 c2 … cn

A

B C

QueryDocument

Page 8: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

8

Two Refined Streaming Schemes(1) To enlarge the optimality of TwigStack, in our paper we proposed two refined streaming schemes.

Tag + Level: elements with the same tag and level are grouped together

a1

b1 a2 an cn…

b2 c1 bn cn-1…

a1

a3 … an

b2 b3 … bn c1 c2 …

a2b1 cn

A

B C

QueryDocument

Page 9: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

9

Two Refined Streaming Schemes(1) For this query, tag+level streaming scheme can guarantee the optimality.

a1

b1 a2 an cn…

b2 c1 bn cn-1…

a1

a3 … an

b2 b3 … bn c1 c2 …

a2b1 cn

A

B C

QueryDocument

Page 10: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

10

Two Refined Streaming Schemes(1) But given a more complex query and document, tag+level cannot guarantee the optimality.For example:

a1

e1 a2 b2

d2 b1 d3

c2

a1

d1 d2,d3

a2 b2

A

D B

QueryDocument

Cd1

c1

b1

c1 c2

Page 11: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

11

Two Refined Streaming Schemes(2) Prefix Path Streaming (PPS): elements with the same root-to-node path are grouped together

a1

a2

d1

b2

Document

a1

e1 a2 b2

d2 b1 d3

c1

d1

D:

d2 b1

c1

d3

c2

Every element in the document is stored as an individual stream in this

example.

e1

c2

Page 12: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

12

Two Refined Streaming Schemes(2) PPS is optimal for the following example.

a1

e1 a2 b2

d2 b1 d3

c2

a1

d1

a2 b2

A

D B

QueryDocument

Cd1

c1

b1

c1

d2

c2

d1,d2,c1,c2 are separated to

different streams

Page 13: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

13

Two Refined Streaming Schemes(2) A natural question : Can PPS guarantee to be

optimal for all queries and data?

Page 14: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

14

Two Refined Streaming Schemes(2) A natural question : Can PPS guarantee to be

optimal for all queries and data? The answer is NO. For example:

a1

b1 b2 b3

c2

a3

b5

a4

b4

a2

c1

e1 d1 e2

d2

A

C B

E D

c1, c2 are in the same stream.

Similarly, e1, e2 are also in the same

stream.

DocumentQuery: head element

Page 15: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

15

A general algorithm: iTwigJoin We propose a general algorithm, called iTwigJoin , which can be used on various data streaming schemes.

Our key idea is to classify all current head elements to three classes: Subtree-matching Useless Blocked

Page 16: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

16

Classifying Head Elements Subtree-Matching Element

Element e of tag E is called a subtree-matching element for query Q e is in a match to QE (QE is the sub-tree of Q rooted at E); and NOT in any future match to QP where P is the parent of E in Q

Useless Element Element e is called a useless element if e is not in any future

match to QE. Blocked Element

An element which is neither subtree-matching nor useless

Page 17: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

17

Example: Classifying Head Elements (Tag+Level Streaming)a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

a1

d1 d2 d3 … b1

a2 b2

c1 c2

A

D B

C

D:Q1:

: head element

Page 18: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

18

Example: Classifying Head Elements (Tag+Level Streaming)a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

a1

d1 d2 d3 … b1

a2 b2

c1 c2

A

D B

C

D:Q1:

: head element

Subtree-matching

useless

blocked d1

Page 19: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

19

Example: Classifying Head Elements (Tag+Level Streaming)a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

a1

d1 d2 d3 … b1

a2 b2

c1 c2

A

D B

C

D:Q1:

: head element

Subtree-matching

useless

blocked d1,c1

Page 20: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

20

Example: Classifying Head Elements (Tag+Level Streaming)a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

a1

d1 d2 d3 … b1

a2 b2

c1 c2

A

D B

C

D:Q1:

: head element

Subtree-matching

-

useless -blocked d1,c1,a1,a2,b2,b1

Page 21: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

21

Example: Classifying Head Elements (Tag+Level Streaming)a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

a1

d1 d2 d3 … b1

a2 b2

c1 c2

A

D B

C

D:Q1:

: head element

Subtree-matching

-

useless -blocked d1,c1,a1,a2,b2,b1

A

D B

Q2:Subtree-matching

useless

blockedC

Page 22: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

22

Example: Classifying Head Elements (Tag+Level Streaming)

Subtree-matching

-

useless -blocked d1,c1, a1,a2,b2,b1

a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

a1

d1 d2 d3 … b1

a2 b2

c1 c2

A

D B

C

D:Q1:

: head element

A

D B

C

Q2:Subtree-matching

d1

useless

blocked

Page 23: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

23

Example: Classifying Head Elements (Tag+Level Streaming)

Subtree-matching

-

useless -blocked d1,c1, a1,a2,b2,b1

a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

a1

d1 d2 d3 … b1

a2 b2

c1 c2

A

D B

C

D:Q1:

: head element

A

D B

C

Q2:Subtree-matching

d1

useless a1,b2blocked

Page 24: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

24

Example: Classifying Head Elements (Tag+Level Streaming)

Subtree-matching

-

useless -blocked d1,c1, a1,a2,b2,b1

a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

a1

d1 d2 d3 … b1

a2 b2

c1 c2

A

D B

C

D:Q1:

: head element

A

D B

C

Q2:Subtree-matching

d1

useless a1,b2blocked c1

Page 25: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

25

Example: Classifying Head Elements (Tag+Level Streaming)

Subtree-matching

-

useless -blocked d1,c1, a1,a2,b2,b1

a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

a1

d1 d2 d3 … b1

a2 b2

c1 c2

A

D B

C

D:Q1:

: head element

A

D B

C

Q2:Subtree-matching

d1

useless a1,b2blocked c1, b1, a2,

Page 26: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

26

Example: Classifying Head Elements (Tag+Level Streaming)

Subtree-matching

-

useless -

blocked a1,a2,b1,b2,c1,d1

a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

A

D B

C

Subtree-matching

d1,

useless a1,b2

blocked a2,b1,c1

A

D B

C

•Useless element can be discarded safely

•sub-tree Matching element is pushed to the corresponding stack

•Blocked element causes problem

•CANNOT be discarded because it may cause loss of results

•CANNOT be pushed to stack because it may cause useless results

•When all head elements are blocked; optimal holistic matching CANNOT be guaranteed

Page 27: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

27

iTwigJoin In our algorithm, in order to output all correct

answers, we push blocked elements into stack, which may result in useless intermediate results in some cases.

a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

A

D B

C

Q1:

Tag+Level Streaming

Page 28: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

28

iTwigJoin In our algorithm, in order to output all correct

answers, we push blocked elements into stack, which may result in useless intermediate results in some cases.

a1

e1 a2 b2

d2 b1

c2

d3

c1

d1

A

D B

C

Q1:

Since all head elements are

blocked, we have to push a1 to stack and

output one path solution (a1,d1).

Tag+Level Streaming

Page 29: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

29

iTwigJoin In our algorithm, in order to output all correct

answers, we push blocked elements into stack, which may result in useless intermediate results in some cases.

a1

e1 a2 b2

d2 b1 d3

c1

d1

A

D B

C

Q1:

If there is no c2, then (a1,d1) is a useless path solution.

Since all head elements are

blocked, we have to push a1 to stack and

output one path solution (a1,d1).

Tag+Level Streaming

c2

Page 30: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

30

iTwigJoin

Stream Manager

a1

c1 c2 c3 … b1

a2 b2

Temporary Storage

SA

SB SC

Two Main Components Stream Manager: Control the advance operation of

streams and send elements for temporary storage Temporary Storage: Push elements to stack and

output intermediate paths.

Page 31: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

31

Flowchart of iTwigJoinLabel current head elements

as either subtree-Matching, Useless or Blocked

Discard Useless elements

Select a subtree-Matching or blocked element e

Pop some elements from stack

Push e to the stack and output intermediate paths if e is the leaf

If useless element is found

If not all streams end

Page 32: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

32

Optimal classes of iTwigJoin for three streaming schemes

A

B C

Tag Streaming A-D only pattern

Optimal classStreaming scheme

A-D only

Page 33: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

33

A

B C

A

B C

Tag Streaming A-D only pattern

Tag+Level Streaming A-D/P-C only pattern

Optimal classStreaming scheme

A-D/P-C only

A-D only

Optimal classes of iTwigJoin for three streaming schemes

Page 34: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

34

A

B C

A

B C

Tag Streaming A-D only pattern

Tag+Level Streaming A-D/P-C only pattern

Prefix Path Streaming

Optimal classStreaming scheme

A-D/P-C only or 1-Branch node

A-D/P-C only

A-D only

A

B C

A-D/P-C only or 1-Branch

Optimal classes of iTwigJoin for three streaming schemes

Page 35: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

35

A

B C

A

B C

Tag Streaming A-D only pattern

Tag+Level Streaming A-D/P-C only pattern

Prefix Path Streaming A-D/P-C only or 1-Branch

Optimal classStreaming scheme

A-D/P-C only or 1-Branch node

A-D/P-C only

A-D only

A

B C

More refined

Optimal class:Larger

Optimal classes of iTwigJoin for three streaming schemes

Page 36: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

36

Experiments

Benchmarks XMark: Synthetic Data Treebank: Real Data from Wall Street Journal

Page 37: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

37

Experiments: I/O Performance

0

20000004000000

60000008000000

1000000012000000

14000000

Tree1 Tree2 Tree3 Tree4 Tree5

Ele

men

t Sca

nned

TwigStack TwigStackLst Tag+Level Prefix

Tree1: A-D only

Tree2: P-C only

Tree3: P-C only

Tree4: 1-branchnode

Tree5: 1-branchnode

By pruning irrelevant streams, PPS usually scan the fewest number of elements.

Page 38: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

38

Experiments: Number of Intermediate PathsTree1: A-D only

Tree2: P-C only

Tree3: P-C only

Tree4: 1-branchnode

Tree5: 1-branchnode1

10

100

1000

10000

100000

Tree1 Tree2 Tree3 Tree4 Tree5In

term

ed

iate

Pa

ths

Ou

tpu

tTwigStack TwigStackLst Tag+Level Prefix

For treebank 5, there is no matching results. So Tag+Level and PPS do not output any intermediate results.

Page 39: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

39

Experiments: Running Time

XMark1: Path Pattern,

XMark2: A-D only,

XMark3: P-C only,

XMark4: 1-branchnode,

XMark5: Non-optimal,

0

2

4

68

10

12

14

XMark1 XMark2 XMark3 XMark4 XMark5

Exe

cutio

n T

ime

(Sec

ond)

TwigStack TwigStackLst Tag+Level Prefix

Tag+level and PPS have better performance than TwigStack and TwigStackList in XMark data.

Page 40: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

40

Experiments: Summary

Both PPS and Tag+Level help to reduce I/O costs. while PPS saves more.

PPS may result in too many streams for deep XML data; Tag+Level seems to be a good compromise.

PPS and Tag+Level completely avoid the output of redundant intermediate paths in all cases we tested, though they cannot guarantee the optimality in theory.

Page 41: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

41

Conclusions We develop a general algorithm to perform

holistic twig join on Tag+Level and PPS streaming schemes.

We identify two I/O optimal classes for Tag+Level and PPS streaming schemes.

Since our experiments show that Tag+Level streaming schemes can guarantee to produce very few useless intermediate results in most cases, we recommend to use Tag+Level scheme for efficient XML twig pattern matching.

Page 42: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

42

END

Thank you! Q & A

Page 43: On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

43

Backup iTwigJoin AlgorithmWhile(not all streams end)

1. Label current head elements as either Matching, Useless or

Blocked

2. If any head element is Useless, discard it and continue

3. Let e1 be the matching element with the smallest startPos;

Let e2 be the blocked element with the smallest endPos;

4. If e2.endPos < e1.startPos, let e be the blocked element with

the smallest startPos; else let e be e1

5. Advance the stream e belongs to

6. Pop out elements from e’s stack whose endPos < e.startPos

7. Push e into its stack if e has a parent/ancestor in the

temporary storage system,

8. Output all paths involving e If the tag of e is a leaf node in Q