michael schmidt stefanie scherzinger christoph koch saarland university database group
DESCRIPTION
Combined Static and Dynamic Analysis for Effective Buffer Minimization in Streaming XQuery Evaluation. Michael Schmidt Stefanie Scherzinger Christoph Koch Saarland University Database Group Saarbrücken, Germany. - PowerPoint PPT PresentationTRANSCRIPT
Combined Static and Dynamic Combined Static and Dynamic Analysis for Effective Buffer Analysis for Effective Buffer
Minimization in Streaming Minimization in Streaming XQuery EvaluationXQuery Evaluation
Michael Schmidt Stefanie Scherzinger Christoph Koch
Saarland University Database GroupSaarbrücken, Germany
2007 IEEE 23rd International Conference on Data Engineering - April 17, 2007
22
OutlineOutline
I. Streaming XQuery EvaluationI. Streaming XQuery Evaluation– Motivation and RequirementsMotivation and Requirements– Desiderata to streaming and in-memory XQuery EnginesDesiderata to streaming and in-memory XQuery Engines– Existing ApproachesExisting Approaches
II. Combining Static and Dynamic Buffer MinimizationII. Combining Static and Dynamic Buffer Minimization– Query NormalizationQuery Normalization– The Concept of RolesThe Concept of Roles– Active Garbage CollectionActive Garbage Collection– System ArchitectureSystem Architecture– OptimizationsOptimizations
III. The GCX XQuery EngineIII. The GCX XQuery Engine– Prototype ImplementationPrototype Implementation– Benchmark ResultBenchmark Result
IV. SummaryIV. Summary
33
Motivation and Motivation and RequirementsRequirements
Growing importance of streaming XML processing Growing importance of streaming XML processing comes along with the profileration of the WWWcomes along with the profileration of the WWW
Streams may arrive at very high ratesStreams may arrive at very high rates
storing incoming data to disk often unfeasiblestoring incoming data to disk often unfeasible
Main memory DOM tree representation of XML Main memory DOM tree representation of XML documents very space-consumingdocuments very space-consuming
buffer management becomes buffer management becomes thethe key prerequisite to key prerequisite to performanceperformance
Problem becomes even more urgent when evaluating Problem becomes even more urgent when evaluating (powerful fragments of) XQuery rather than simple (powerful fragments of) XQuery rather than simple filters on data streamsfilters on data streams
Streaming techniques very useful for in-memory Streaming techniques very useful for in-memory XQuery engingesXQuery enginges
I.
44
Desiderata for in-Desiderata for in-memory XQuery memory XQuery EnginesEngines
(1)(1) Only buffer data that is relevant for Only buffer data that is relevant for query evaluationquery evaluation
(2)(2) Avoid multiple copies of the data in Avoid multiple copies of the data in main memorymain memory
(3)(3) Do not keep data buffered longer Do not keep data buffered longer than necessarythan necessary
Claim:Claim: Combination of static and dynamicCombination of static and dynamic
analysis required to satisfy all desiderataanalysis required to satisfy all desiderata
I.
55
(1)(1) Only buffer data that is relevant for Only buffer data that is relevant for query evaluationquery evaluation
Document ProjectionDocument Projection Statical query analysisStatical query analysis Detect parts of the document that are Detect parts of the document that are
relevant to query evaluationrelevant to query evaluation Project away those parts of the document Project away those parts of the document
that are that are notnot relevant to query evaluation relevant to query evaluation
Existing Approaches Existing Approaches (1)(1)
A. Marian and J. Siméon“Projecting XML Documents”In Proc. VLDB’03, pages 213–224, 2003.
S. Bréssan, B. Catania, Z. Lacroix, Y. G. Li and A. Maddalena “Accelerating Queries by Pruning XML Documents”TKDE, 54(2):211–240, 2005.
V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyen“Type-Based XML Projection”In Proc. VLDB’06, 2006.
I.
66
Existing Approaches Existing Approaches (2)(2)
Document ProjectionDocument Projection
<q> {for $b in /bib/bookwhere ($b/author=“A. Turing” and fn:exists($b/price))return $b/title} </q>
XQuery
Projection Paths{ /bib/book, /bib/book/author/
dos::node(), /bib/book/price, /bib/book/title/
dos::node()}
bib
book
author price title
book
author price title
… … … …
article
… … …isbn
isbn
… … … …
XML document
I.
dos:=descendant-or-selfdos:=descendant-or-self
77
Existing Approaches Existing Approaches (3)(3)(2)(2) Avoid multiple copies of the data in main Avoid multiple copies of the data in main
memorymemory(3)(3) Do not keep data buffered longer than Do not keep data buffered longer than
necessarynecessary
Hard to satisfy both paradigms in combinationHard to satisfy both paradigms in combination
<q> { for $x1 in //book return for $x2 in //* return for $x3 in //article return <node/>} </q>
XQuery Two approaches:
(1) Single DOM-tree
(2) Buffers for variables
I.
88
The Big PictureThe Big PictureII.
XQuery
NormalizedXQuery
ProjectionTree
Roles
Buffer(nodes annotated
with roles)
input stream
Evaluator
output stream
RewrittenXQuery
(role updates)
transformation, extraction
input, output
communication
variable bindings
role removals, active garbage collection
99
Query NormalizationQuery Normalization
(1)(1) Rewriting where-expressions to if-Rewriting where-expressions to if-statementsstatements
(2)(2) Pushing down if-statementsPushing down if-statements
<r> { for $b in /bib where (fn:exists($b/book)) return <books>{ $b/book }</books>} </r>
<r> { for $b in /bib return ( if (fn:exists($b/book)) then <books> else (),
if (fn:exists($b/book)) then $b/book else (),
if (fn:exists($b/book)) then </books> else () )} </r>
II.
1010
Deriving RolesDeriving Roles
<r> { for $bib in /bib return (for $x in $bib/* return if (not(fn:exists($x/price))) then $x else (), for $b in $bib/book return $b/title )} </r>
/bib
/* /book
/
/title/dos::node()
/price[1]
dos::node()
rr11 //
rr22 /bib/bib $bib$bib
rr33 /bib/*/bib/* $x$x
rr44 /bib/*/price[1]/bib/*/price[1] $x/price$x/price
rr55 /bib/*/dos::node()/bib/*/dos::node() $x$x
rr66 /bib/book/bib/book $b$b
rr77 /bib/book/title//bib/book/title/dos::node()dos::node()
$b/title$b/title
II.
1111
Assigning RolesAssigning Roles
Matching document nodes get assigned roles when Matching document nodes get assigned roles when projected into the bufferprojected into the buffer
Roles assigned on-the-fly while reading the inputRoles assigned on-the-fly while reading the input Nodes without roles and role-carrying ancestors need not Nodes without roles and role-carrying ancestors need not
to be buffered (projection)to be buffered (projection)
bib
book
authortitle
{ r2 }
{ r3, r5, r6 }
{ r5 }{ r5, r7 }
rr1 1 / /
rr22 /bib /bib
rr33 /bib/* /bib/*
rr44 /bib/*/price[1] /bib/*/price[1]
rr55 /bib/*/dos::node() /bib/*/dos::node()
rr66 /bib/book /bib/book
rr77
/bib/book/title/dos::node()/bib/book/title/dos::node()
XML documentRoles
II.
1212
Inserting Role UpdatesInserting Role Updates
<r> { for $bib in /bib return (for $x in $bib/* return if (not(fn:exists($x/price))) then $x else (), for $b in $bib/book return $b/title)} </r>
<r> { for $bib in /bib return ( for $x in $bib/* return ( if (not(exists($x/price))) then $x else (), signOff($x,r3), signOff($x/price[1],r4), signOff($x/dos::node(),r5) ), for $b in $bib/book return ( $b/title, signOff($b,r6), signOff($b/title/dos::node(),r7))) ), signOff($bib,r2) ) } </r>
rr11 / /
rr22 /bib /bib $bib$bib
rr33 /bib/* /bib/* $x$x
rr44 /bib/*/price[1] /bib/*/price[1]
$x/price$x/price
rr55 /bib/*/dos::node() /bib/*/dos::node() $x$x
rr66 /bib/book /bib/book $b$b
rr77 /bib/book/title/dos::node() /bib/book/title/dos::node() $b/title$b/title
II.
1313
Active Garbage Active Garbage CollectionCollection
<r> { for $bib in /bib return ( for $x in $bib/* return ( if (not(exists($x/price))) then $x else (), signOff($x,r3), signOff($x/price[1],r4), signOff($x/dos::node(),r5) ), for $b in $bib/book return ( $b/title, signOff($b,r6), signOff($b/title/dos::node(),r7))) ), signOff($bib,r2) ) } </r>
Buffer:
Output stream:
Input stream:
<bib>
<book>
<title/><author/>
</book>…
<r><book>
<title/><author/>
</book>
bib
book
title
{r2}
{r3 , r5 , r6}
{r5 , r7} author
{r5}
{r5 , r6}
{r7} {}
{r6}
II.
1414
<r> { for $bib in /bib return (for $_1 in $bib/book (return $_1/book, signOff($_1/book/dos::node(),r2)), signOff($bib,r1))} </r>
<r> { for $bib in /bib return for $_1 in $bib/book return $_1/book} </r>
OptimizationsOptimizations
Rewrite path steps to for-Rewrite path steps to for-expressionsexpressions
Use aggregated rolesUse aggregated roles Remove redundant rolesRemove redundant roles
>r} < for $bib in /bib return $bib/book
/> {r<
>r} < for $bib in /bib) return $bib/book, signOff($bib,r1),
signOff($bib/book/dos::node(),r2)(/> {r<
II.
1515
GGarbage arbage CCollected ollected XXQueryQuery Implemented in C++ for a fragment of composition-free Implemented in C++ for a fragment of composition-free
XQueryXQuery– Arbitrary nested single step for-loopsArbitrary nested single step for-loops– FWR-expressionsFWR-expressions– Child and descendant axesChild and descendant axes– Node-tests for tags, wildcards, node(), text()Node-tests for tags, wildcards, node(), text()– If-expressions with If-expressions with andand, , oror, , notnot, , fn:existsfn:exists– Let/some-expressions and aggregations not yet supportedLet/some-expressions and aggregations not yet supported– No support for attributes (no restriction)No support for attributes (no restriction)
Open Source (Open Source (BBerkeley erkeley SSoftware oftware DDistribution Licence)istribution Licence)
GCX project page:GCX project page:http://www.infosys.uni-sb.de/projects/streams/gcx/index.phphttp://www.infosys.uni-sb.de/projects/streams/gcx/index.php
GCX download page:GCX download page:http://www.infosys.uni-sb.de/software/gcx/http://www.infosys.uni-sb.de/software/gcx/
III.The GCX XQuery The GCX XQuery EngineEngine
1616
Benchmark Results (1)Benchmark Results (1)
Time and memory consumptionTime and memory consumption Queries and documents from the XMark BenchmarkQueries and documents from the XMark Benchmark Queries and documents modified to match the supported Queries and documents modified to match the supported
fragmentfragment 3GHz CPU Intel Pentium IV with 2GB RAM 3GHz CPU Intel Pentium IV with 2GB RAM SuSe Linux 10.0, J2RE v1.4.2 for Java-based systemsSuSe Linux 10.0, J2RE v1.4.2 for Java-based systems Time limit: 1 hourTime limit: 1 hour Benchmarks against the following systemsBenchmarks against the following systems
– FluXFluXJava in-memory engine for streaming XQuery evaluation.Java in-memory engine for streaming XQuery evaluation.
– MonetDB v4.12.0/XQuery v0.12.0MonetDB v4.12.0/XQuery v0.12.0A A secondary storagesecondary storage engine written in C++. Loading of the engine written in C++. Loading of the document is included in time measurements.document is included in time measurements.
– QizX/open v1.1QizX/open v1.1Free in-memory XQuery engine written in Java.Free in-memory XQuery engine written in Java.
– Saxon v8.7.1Saxon v8.7.1Free in-memory XQuery engine written in Java.Free in-memory XQuery engine written in Java.
III.
1717
Benchmark Results (2)Benchmark Results (2)
<query1> { for $s in /site return for $p in $s/people return for $pe in $pe/person return if ($pe/person_id="person0") then <result>{ $pe/name }</result> else ()}</query1>
XMark Q1:
0
2
4
6
8
10
12
14
16
10MB 50MB 100MB 200MB
GCX
FluxQuery
MonetDB
Saxon
Qizx/open
Running time (s)
III.
1818
Benchmark Results (3)Benchmark Results (3)
0
100
200
300
400
500
600
700
800
900
1000
10MB 50MB 100MB 200MB
GCX
FluxQuery
MonetDB
Saxon
Qizx/open
Memory Consumption (MB)
<query1> { for $s in /site return for $p in $s/people return for $pe in $pe/person return if ($pe/person_id="person0") then <result>{ $pe/name }</result> else ()}</query1>
XMark Q1:
III.
1919
Benchmark Results (4)Benchmark Results (4)
<query8> { for $root in (/) return for $site in $root/site return for $people in $site/people return for $person in $people/person return <item> { ( <person>{ $person/name }</person>, <items_bought> { for $site2 in $root/site return for $cas in $site2/closed_auctions return for $ca in $cas/closed_auction return for $buyer in $ca/buyer return if ($buyer/buyer_person=$person/person_id) then <result> { $ca } </result> else () } </items_bought> ) } </item> } </query8>
XMark Q8:
III.
2020
Benchmark Results (5)Benchmark Results (5)
0
500
1000
1500
2000
2500
3000
3500
10MB 50MB 100MB 200MB
GCX
FluxQuery
MonetDB
Saxon
Qizx/open
XMark Q8
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
10MB 50MB 100MB 200MB
Running time (s)
Memory Consumption (MB)
Failure for 100MB: MonetDB – Failure for 200MB: GCX, FluxQuery, MonetDB
III.
2121
SummarySummary
Combination of Combination of static and dynamicstatic and dynamic buffer minimization buffer minimization RolesRoles are derived from the XQuery and assigned to are derived from the XQuery and assigned to
matching document nodes in the preprojection phasematching document nodes in the preprojection phase XQuery expression statically rewritten: at runtime, XQuery expression statically rewritten: at runtime,
signOff-statementssignOff-statements cause buffered nodes to lose roles cause buffered nodes to lose roles An An active garbage collectionactive garbage collection mechanism removes nodes mechanism removes nodes
from buffers that have lost their last rolefrom buffers that have lost their last role Document projection integrated in the role conceptDocument projection integrated in the role concept Technique behaves very well for composition-free Technique behaves very well for composition-free
XQuery w.r.t. execution time and memory consumptionXQuery w.r.t. execution time and memory consumption Applicable in streaming contexts, but also useful for Applicable in streaming contexts, but also useful for
common in-memory XQuery enginescommon in-memory XQuery engines
IV.
2222
Thank you for your attention!Thank you for your attention!
Z. Bar-Yossef, M. Fontoura, and V. Josifovski“On the Memory Requirements of XPath
Evaluation over XML Streams”
In Proc. PODS’04, pages 177–188, 2004
M. Benedikt, W. Fan, and F. Geerts“XPath Satisfiability in the Presence of DTDs”In Proc. PODS, pages 25–36, 2005
V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyen“Type-Based XML Projection”In Proc. VLDB’06, 2006
S. Bréssan, B. Catania, Z. Lacroix, Y. G. Li and A. Maddalena
“Accelerating Queries by Pruning XML Documents”
TKDE, 54(2):211–240, 2005
L. Fegaras, R. Dash, and Y. Wang“A Fully Pipelined XQuery Processor”In XIME-P, 2006
L. Fegaras, D. Levine, S. Bose, and V. Chaluvadi“Query Processing of Streamed XML Data”In Proc. CIKM 2002, pages 126–133, 2002
T. J. Green, G. Miklau, M. Onizuka, and D. Suciu “Processing XML Streams with Deterministic
Automata”In Proc. ICDT’03, pages 173–189, 2003
C. Koch“On the complexity of nonrecursive XQuery and
functional query languages on complex values”
ACM Transactions on Database Systems, 31(4), 2006
C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier
“Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams”
In Proc. VLDB’04, pages 228–239, 2004
X. Li and G. Agrawal“Efficient evaluation of XQuery over
streaming data”In Proc. VLDB’05, pages 265–276, 2005
A. Marian and J. Siméon“Projecting XML Documents”In Proc. VLDB’03, pages 213–224, 2003
D. Olteanu, H. Meuss, T. Furche, and F. Bry“XPath: Looking Forward”In EDBT 02: Proceedings of the Worshops XMLDM,
MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers,pages 109–127, 2002
D. Olteanu, T. Kiesling, and F. Bry“An Evaluation of Regular Path Expressions
with Qualifiers against XML Streams”In Proc. ICDE’03, page 702, 2003
H. Su, E. A. Rundensteiner, and M. Mani“Semantic Query Optimization for XQuery
over XML Streams”In Proc. VLDB, pages 277–288, 2005
P. R. Wilson“Uniprocessor Garbage Collection
Techniques”In Proc. IWMM’92, pages 1–42, 1992
2424
Additional ResourcesAdditional Resources
2525
Full Benchmark Full Benchmark ResultsResults
GCX FluxQuery Galax MonetDB Saxon Qizx/open
Q1
10MB 0.18s / 1.2MB 1.59s / 50MB 5.45s / 186MB 0.86s / 30MB 1.48s / 80MB 1.20s / 38MB
50MB 0.92s / 1.2MB 3.96s / 111MB 42.33s / 880MB 3.69s / 98MB 4.29s / 292MB 3.74s / 195MB
100MB 1.87s / 1.2MB 6.94s / 111MB 02:07m / 1,8GB 7.19s / 225MB 7.96s / 547MB 6.56s / 285MB
200MB 3.53s / 1.2MB 12.27s / 111MB timeout 13.60s / 244MB 14.30s / 973MB 11.82s / 480MB
Q6
10MB 0.34s / 1.2MB n/a 7.66s / 240MB 0.98s / 29MB 1.73s / 82MB 1.56s / 33MB
50MB 1.68s / 1.2MB n/a 57.98s / 1.2GB 5.06s / 111MB 5.78s / 292MB 6.13s / 169MB
100MB 3.33s / 1.2MB n/a 5:08m / 2GB 9.94s / 253MB 10.85s / 622MB 11.74s / 484MB
200MB 6.42s / 1.2MB n/a timeout 19.95s / 337MB 20.14s / 1.2GB 20.33s / 805MB
Q8
10MB 13.15s / 9.8MB 18.04s / 128MB 01:04m / 377MB 02:56m / 407MB 6.61s / 145MB 9.89s / 148MB
50MB 05:13m / 43MB 06:51m / 169MB 33:08m / 1.8GB 03:26m / 1.35GB 02:02m / 352MB 03:38m / 265MB
100MB 22:07m / 86MB 27:01m / 216MB timeout - 08:39m / 650MB 14:27m / 397MB
200MB timeout timeout timeout - 32:43m / 1.15GB 52:05m / 636MB
Q13
10MB 0.17s / 1.2MB 1.60s / 52MB 5.92s / 182MB 0.80s / 31MB 1.53s / 48MB 1.26s / 28MB
50MB 0.85s / 1.2MB 3.98s / 111MB 43.91s / 899MB 3.64s / 98MB 4.45s / 292MB 3.85s / 195MB
100MB 1.69s / 1.2MB 7.00s / 111MB 02:04m / 1.8GB 7.34s / 224MB 8.35s / 547MB 6.81s / 285MB
200MB 3.24s / 1.2MB 12.33s / 111MB timeout 13.52s / 271MB 15.02s / 1.05GB 12.30s / 480MB
Q20
10MB 0.25s / 1.2MB 1.65s / 48MB 6.95s / 215MB 0.85s / 34MB 1.65s / 62MB 1.43s / 39MB
50MB 1.24s / 1.2MB 4.19s / 111MB 53.08s / 1,5GB 4.17s / 120MB 4.90s / 292MB 4.18s / 195MB
100MB 2.48s / 1.2MB 7.37s / 111B 03:14m / 2GB 8.47s / 247MB 9.13s / 622MB 8.71s / 350MB
200MB 4.74s / 1.2MB 13.14s / 111MB timeout 16.40s / 296MB 16.58s / 1.15GB 15.80s / 628MB
2626
Benchmark Queries (1)Benchmark Queries (1)
<query1> { for $s in /site return for $p in $s/people return for $pe in $pe/person return if ($pe/person_id="person0") then <result>{ $pe/name }</result> else ()}</query1>
<query6> { for $site in //site return for $regions in $site/regions return $regions//item} </query6>
2727
Benchmark Queries (2)Benchmark Queries (2)
<query8> { for $root in (/) return for $site in $root/site return for $people in $site/people return for $person in $people/person return <item> { ( <person>{ $person/name }</person>, <items_bought> { for $site2 in $root/site return for $cas in $site2/closed_auctions return for $ca in $cas/closed_auction return for $buyer in $ca/buyer return if ($buyer/buyer_person=$person/person_id) then <result> { $ca } </result> else () } </items_bought> ) } </item> } </query8>
2828
Benchmark Queries (3)Benchmark Queries (3)
<query13> { for $site in /site return for $regions in $site/regions return for $australia in $regions/australia return for $item in $australia/item return <item> { ( <name> { $item/name } </name>, <desc> { $item/description } </desc> ) } </item>} </query13>
2929
Benchmark Queries (4)Benchmark Queries (4)
<query20> { for $site in /site return for $people in $site/people return for $person in $people/person return if (fn:not(fn:exists($person/person_income))) then $person else ()} </query20>
3030
Buffer Plot (1)Buffer Plot (1)
<query6> { for $site in //site return for $regions in $site/regions return $regions//item} </query6>
Buffer plot for XMark Q6 on 10MB input document
According to the DTD:all regions occur at the
beginning of the document
3131
Buffer Plot (2)Buffer Plot (2)
<query8> { for $root in (/) return for $site in $root/site return for $people in $site/people return for $person in $people/person return <item> { ( <person>{ $person/name }</person>, <items_bought> { for $site2 in $root/site return for $cas in $site2/closed_auctions return for $ca in $cas/closed_auction return for $buyer in $ca/buyer return if ($buyer/buyer_person=$person/person_id) then <result> { $ca } </result> else () } </items_bought> ) } </item> } </query8>
Buffer plot for XMark Q8 on 10MB input document
first partition of join partners:
persons
second partition of join partners:
buyers
3232
Buffer Plot (3)Buffer Plot (3)
<r> {for $bib in /bib return (for $x in $bib/* return if (not(exists($x/price))) then $x else (), for $b in $bib/book return $b/title)} </r>
XQuery
bib
(book|article)*
title
author
price
9 x article + 1 x book
9 x book + 1 x article
3333
The GCX Runtime The GCX Runtime EngineEngine
StreamPreprojector
BufferManager
Evaluator
XQueryinput stream
output stream
nodes/roles
node lookupgarbage collection
node/eos
signOff($x/π,r)
OK
node/NULL
getNext($x/π)
Buffer
nextNode()
3434
System ArchitectureSystem Architecture
XQuery
NormalizedXQuery
Evaluator
Buffer(nodes & roles)
role updates
input
input stream
output stream
Stream Preprojector
RewrittenXQuery
(role updates)
ProjectionPaths
Projection DFA(constructed lazily, assigns roles)
Roles