engineering succinct dom - university of leicester · 2011-10-11 · xml “bloat” •verbose...

Post on 22-Jun-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Engineering Succinct DOMEngineering Succinct DOM

O’Neil Delpratt, Rajeev Raman, Naila Rahman

XML “Bloat”

• Verbose data representation.

▫ interoperability

▫ human-readability

IAB 2008

2

▫ human-readability

• XML processing uses much more memory.

▫ benefits?

• “Premature optimisation is the root of all evil”

▫ SOAP packets ~ 1MB

DOM

book

#doc

7

<book catalogue=“java”><author>Selman</author><title>Java3d programming</title><year>2000</year>

</book>

cat “java”

IAB 2008

3

titleauthor year1

4 6

7

“[cr][sp]”

“Selman” “2000”

“[cr][sp]” “[cr][sp]”“[cr][sp]”

title year

“Java3d programming”

DOM (Document Object Model) – basic in-memory representation.

Navigate tree; retrieve information associated with node.

Standard DOM implementations: pointer-based.

In-memory representation = 300% to 800% of file size

For static documents, is there a “pointerless” representation, with full DOM functionality? Succinct DOM

Succinct DOM• Supports most static methods of DOM Core L3▫ document changes not supported.

• Supports x R y, R є {ancestor, descendant,preceding, following,

IAB 2008

4

descendant,preceding, following,

preceding/following-sibling}

• Good worst-case behaviour:▫ navigation, x R y etc. in O(1) time.▫ good upper bounds on space usage.

• “Very fast” with “reasonable compression”.• Leicester Research Archive (Google title)▫ SDOM library (linux/i686), test harness, Valgrind

3.5x faster to 7x slower than Xerces-C

(i) significantly smaller than original XML file or (ii) comparable to “query-

friendly” XML compressors

Previous Work• Many XML compressors: (XMill, Xgrind, XPress, XCQ, Xbzip…) ▫ Query-friendly (Xgrind, Xpress, XCQ, Xbzip …) focus on XPath queries, not navigation

▫ Navigation typically slow – order of milliseconds.▫ SDOM/(-CT)

� (at worst) few times slower than memory access.

IAB 2008

5

� (at worst) few times slower than memory access.� no direct support for Xpath queries.

• DOM implementations▫ SEDOM –basic operations slow (ms), CR between SDOM and SDOM-CT (allows updates)

▫ DDOM* – space better than Xerces (allows updates)▫ TinyTree – space much better than Xerces (static)▫ ISX (WWW’07):

� similar approach to ours, seems 5x-10x slower from paper, significantly worse CR, but fast updates.

• Summary: Succinct DOM occupies new point in Space/Time tradeoff, first to surpass performance of Xerces with good compression.

Succinct DOM

• Based on Succinct Data Structures.

▫ “Information-theoretic” optimality.

▫ Succinctness != compression.

IAB 2008

6

▫ Succinctness != compression.

• Highly optimised components [GRRR’06, DRR’07]

• Space usage?▫ Naïve: ≥ 2 pointers/node

▫ Succinct:

Tree Structure

1

221log

n

n

n

)(log2 nOn −=

IAB 2008

)(log2 nOn −=

7

Tree Structure

• Parenthesis sequence:

▫ findclose/open, enclose

Xerces: 5 ptrs/node(160 bits/node)

Succinct: 2 bits/node

a b c d e f g h i j

8

( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ) ( ) )

▫ findclose/open, enclose

� enclose(10) = 1

• O(1) time

▫ 2.86n bits [GRRR ‘06]

IAB 2008

Tree Nodes

• Numbered in document order

▫ Text nodes

� numbered consecutively 1..t in “document order”

IAB 2008

9

Bit-vector data structure with Rank operation

� numbered consecutively 1..t in “document order”

▫ Non-text tree nodes

� numbered consecutively 1..e in “document order”

� index into array of “short codes”, where each entry is log(p + 12) bits.

Number of distinct

element names

Specify other kinds of nodes e.g. processing-instruction.

Text Nodes<book catalogue=“java”><author>Selman</author><title>Java3d programming</title><year>2000</year></book>

book

#doc

7

cat “java”

• “String value” of text node

titleauthor year1

4 6

7

“[cr][sp]”

“Selman” “2000”

“[cr][sp]” “[cr][sp]”“[cr][sp]”

title year

“Java3d programming”

IAB 2008

10

Text Nodes

• Text nodes numbered 1..t

S e l m a n c

r

s

pJ a v a 3 D … m m i n g c

r

s

p2 0 0

IAB 2008

11

Text data “array”

• Naïve: 4-8 bytes per text node.

▫ Average text node: 10-12 characters – uncompressed.

• Offset array compression [DRR ‘07]: ~ 1 byte/node.

S e l m a nr p

J a v a 3 D … m m i n gr p

2 0 0

Offset “array”

Text data “array”

SDOM SDOM-CT (Q&D)

12

IAB 2008

(All text nodes concatenated in document order; offsets given by offset “array”)

• Array of uncompressed characters

▫ text node value is null-terminated: can return pointer into array.

▫ Pure succinct representation.

1. Bzip-compress 16KB blocks, decompress on demand.

o cache 4 uncompressed blocks.

2. Store array in FM-index or Compressed Suffix Array.

o seems slower than (1)

� Use (1) if locality in text access; (2) otherwise.

� Other alternatives e.g. hashing.

Node class

• Most important class in DOM for navigation.▫ represents a node in the DOM tree.▫ SDOM Node: 2 ints + ptr to Document object.

• SDOM initially has no objects

IAB 2008

13

• SDOM initially has no Node objects▫ Created upon traversal; no check to see if a tree node has an existing Node object:� y = x.getFirstChild();

� z = x.getFirstChild(); /* z != y */

• No garbage collection…• Better use TreeWalker class▫ NextNode/PreviousNodemethods are very fast.

IAB 2008

14

Files used (UW repository)

File Size Nodes

Max.

DepthMax node degree

#text

chars

#attr

Chars

Orders 5MB 300,004 3 30001 1MB NEG

Lineitem 32MB 2,045,954 3 120351 6MB 0

IAB 2008

15

Lineitem 32MB 2,045,954 3 120351 6MB 0

XPATH 50MB 2,522,571 5 42075 13MB NEG

Treebank_e 82MB 7,312,613 36 112769 57MB NEG

SwissProt 110MB 10,599,084 5 100000 35MB 13MB

DBLP 128MB 10,595,379 6 657717 64MB 7MB

XCDNA 594MB 25,221,153 7 82237 256MB 0

Also Xmark files: not shown. Machine: Pentium 2GB RAM. Ubuntu. g++ 3.3.

SDOM space usage

File Size SDOM Xerces Saxon

Orders 5MB 37% 451% 157%

Lineitem 32MB 28% 399% 161%

• DOM implementations (with uncompressed text)

• Percentages relative to file sizes.

IAB 2008

16

Lineitem 32MB 28% 399% 161%

XPath 50MB 33% 383% 137%

Treebank 82MB 84% 866% 266%

SwissPrt 110MB 60% 704% 272%

DBLP 128MB 68% 737% 240%

XCDNA 594MB 50% 491% 136%

sizes.

• Textual data left uncompressed!

SDOM-CT space usage

• Comparison with:▫ Query-Friendly XML compressors

IAB 2008

17

� XBZipIndex, XPRESS,

� XQZip, XGRIND.

▫ XMill

�SDOM-CT is a “middle-of-the-road” compressor.

�about 25% of SDOM-CT’s space is purely to support fast navigation.

Speed test

• Basic test: several kinds of traversals▫ Document-order▫ Reverse document-order▫ Upward path enumeration

IAB 2008

18

▫ Upward path enumeration• Full test:▫ Document-order traversal plus:

� traverse all attributes and values� search for a (uncommon) substring in all text nodes.

• Compare traversals using Node and TreeWalkerNextNode methods.• All XML documents (except XCDNA) fit comfortably in main memory even for Xerces.

Basic Test (Document/RDO)

IAB 2008

19

20

25

30

0

5

10

15

orders lineitem XPATH treebank_e SwissProt dblp XCDNA

Tim

e (Secs)

Xerces-DocOrder SDOM-DocOrder Xerces-ReverseDocOrder SDOM-ReverseDocOrder

Upward Path Enumeration

IAB 2008

20

100

120

0

20

40

60

80

orders lineitem XPATH treebank_e SwissProt dblp XCDNA

Tim

e (Secs)

Xerces SDOM

Full Test

IAB 2008

21

20

25

30

0

5

10

15

20

treebank_e SwissProt dblp XCDNA

Tim

e (Secs)

Xerces-TreeNav Xerces-NextNode SDOM-TreeNav SDOM-NextNode SDOM-CT-TreeNav

Text compression

libBzip2-blocks FMIndex

Files Text path-order doc-order doc-order

Orders 1MB 22% 30% 29%

IAB 2008

22

Lineitem 6MB 21% 31% 26%

XPATH 13MB 9% 13% 13%

Treebank 57MB 42% 45% 56%

SwissProt 49MB 17% 29% 20%

DBLP 71MB 28% 35% 30%

Difference between path-order and doc-order is fairly small for bzip2 (cf. Xmill).

PARSING

“Spike” caused by use of vector for accumulating Textual data as it is

IAB 2008

23

2,000

2,500

3,000

Memory usage (MB)

DOM parsers memory usage -XCDNA.xml (594MB)

Textual data as it is processed by SAX.

Easy fix for much “preprocessing” space usage.

-

500

1,000

1,500

Memory usage (MB)

Memory usage snap-shot

SDOM Xerces-c

Conclusion/Future Work

• Succinct DOM: very fast, reasonable compression, and robust performance. ▫ Incorporate into higher-level application

IAB 2008

24

▫ Incorporate into higher-level application (Xalan/Saxon/OpenOffice)? ▫ Optimise parsing further?▫ Succinctness + secondary storage?

• Commercialisation?:▫ Saxon £250 license▫ Intel high-performance XSLT library $500▫ IBM XML parsing hardware ~ $5k.

top related