dependable cardinality forecast for xquery
TRANSCRIPT
Dependable Cardinality Forecasts for XQuery
Jens Teubner, ETH (formerly IBM Research)Torsten Grust, U Tubingen (formerly TUM)
Sebastian Maneth, UNSW and NICTASherif Sakr, UNSW and NICTA
c© Systems Group — Department of Computer Science — ETH Zurich August 26, 2008
Cardinality Estimation for XQueryThe feature-richness and semantics of the language makecardinality estimation for XQuery notoriously hard.
for $d in doc ("forecast.xml")/descendant::day
let $day := $d/@t
let $ppcp := data ($d/descendant::ppcp)
return
if ($ppcp > 50)
then ("rain likely on", $day,
"chance of precipitation:", $ppcp)
else ("no rain on", $day)
I for iterationI conditionals (if-then-else)I existential quantification
I sequence constructionI . . .(XPath is not a focus of this work.)
→ Goal: Compute subexpression-level cardinalities cardi.
August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 2
Cardinality Estimation for XQueryThe feature-richness and semantics of the language makecardinality estimation for XQuery notoriously hard.
card1 = ?
card2 = ?card3 = ?
card4 = ?
for $d in doc ("forecast.xml")/descendant::day
let $day := $d/@t
let $ppcp := data ($d/descendant::ppcp)
return
if ($ppcp > 50)
then ("rain likely on", $day,
"chance of precipitation:", $ppcp)
else ("no rain on", $day)
I for iterationI conditionals (if-then-else)I existential quantification
I sequence constructionI . . .(XPath is not a focus of this work.)
→ Goal: Compute subexpression-level cardinalities cardi.
August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 2
Idea: Perform cardinality estimation on relational planequivalents for XQuery.
relational cardinality estimation (System R)+ existing work on XPath estimation+ histograms for value predicates
= cardinality estimation for XQuery
I Build on Pathfinder’s XQuery-to-relational algebra compiler.1
→ tuple count≡ XQuery item countI Cardinality information for each subexpression.
1http://www.pathfinder-xquery.org/
August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 3
iter1
πiter,item:"forecast.xml"
docitem
item:descendant::day(item)
%pos:〈item〉‖iter
%inner:〈iter,pos〉
πiter:inner,itemπinner,outer:iter,
ord:pos
item:descendant::ppcp(item)
πiter1:iter,pos1:1,item1:50
item:attribute::t(item)
%pos:〈item〉‖iter%pos:〈item〉‖iter
πiter1:iter,pos,item
item
πiter1:iter,pos,item
oniter1=iter
=res:(item,item1)
σres
πiter
δ\
πiter,pos:1,ord:1,item:"rain..."
oniter1=iter
πiter,pos,item,ord:2
·∪
·∪
·∪
oniter=iter1
πiter,pos:1,ord:3,item:"chance..."
πiter,pos,item,ord:4
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
oniter1=iter
πiter,pos:1,ord:1,item:"no rain..."
πiter,pos,item,ord:2
·∪
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
·∪
oniter=inner
%pos1:〈ord,pos〉‖outer
πiter:outer,pos:pos1,item
πiter
1©
1©
2©
2©
3©
3©
4©
4©for $d in doc ("forecast.xml")/descendant::day
let $day := $d/@t
let $ppcp := data ($d/descendant::ppcp)
return
if ($ppcp > 50)
then ("rain likely on", $day,
"chance of precipitation:", $ppcp)
else ("no rain on", $day)
Example (Plan details not of interest today.)
I Pathfinder compiles XQuerywith arbitrary nesting.
I Maintain correspondence tooriginal query if back-end isnot relational.
I Derive estimates based on aninference rule set.
Relational XQuery Cardinality EstimationApply System R-style estimation to relational XQuery plans, e.g.,
Disjoint union:|q1 ·∪ q2| = |q1|+ |q2|
Cartesian product:|q1 × q2| = |q1| · |q2|
Equi-join:
|q1 ona=b
q2| =
|q1| · |q2|max {|a|idx , |b|idx}
if there are indexes onboth join columns,
?
|q1| · |q2||a|idx
if there is only an indexon column a,
?
|q1| · |q2| · 1/10 otherwise
|c|idx: Number of unique values in index on column c.
I Our joins typically operate over computed relations.
August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 5
Relational XQuery Cardinality EstimationApply System R-style estimation to relational XQuery plans, e.g.,
Disjoint union:|q1 ·∪ q2| = |q1|+ |q2|
Cartesian product:|q1 × q2| = |q1| · |q2|
Equi-join:
|q1 ona=b
q2| =
|q1| · |q2|max {|a|idx , |b|idx}
if there are indexes onboth join columns, ?
|q1| · |q2||a|idx
if there is only an indexon column a, ?
|q1| · |q2| · 1/10 otherwise
|c|idx: Number of unique values in index on column c.
I Our joins typically operate over computed relations.
August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 5
Abstract Domain Identifiers
A simple form of data flow analysis provides the informationneeded.
I Introduce abstract domain identifiers α, β, . . . as placeholdersfor the active runtime domain for each column c.(Read cα as “column c contains values from domain α.”)
I Estimate the size ‖α‖ of each domain α, e.g.,2
dom(%a:〈b1,...,bn〉(q)
)⊇ dom (q) ∪
{aα ∧ ‖α‖ =! |q|
}.
I Identify inclusion relationships α v β between domains, e.g.,
aα ∈ dom (q) ∧ aβ ∈ dom (σ···(q))⇒ β v α .
2Operator %a:〈b1,...,bn〉 introduces a new key column (holding row numbers).August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 6
Abstract Domain IdentifiersUse abstract domain information for cardinality estimation.
E.g., “foreign key” join:
aα ∈ dom (q1) bβ ∈ dom (q2) α v β
|q1 ona=b
q2| =|q1| · |q2|‖β‖
I Domain inclusion guarantees that each tuple in q1 finds(at least one) join partner in q2.
Other examples:I |q1 \ q2| = |q1| − |q2| if q2 is a subset of q1.I |q1 \ q2| = 0 if q1 is a subset of q2.I |q1 \ q2| = |q1| if q1 and q2 are disjoint.
August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 7
Interfacing with XPath—Projection Paths
Track XPath navigation by means of projection paths3
a
b c d
e
a b1 γa2 γb3 γd
a b c1 γa γb1 γa γc1 γa γd3 γd γe
c:child::*(b)
q1 q2
I b⇒p ∈ path (q1) ⇒ c⇒p/child::* ∈ path (q2)
I Step operator makes XPath navigation explicit in relationalplans (compiles to join on SQL back-ends).
3A. Marian and J. Simeon. Projecting XML Documents. VLDB 2003.August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 8
Interfacing with XPath—Cardinality Inference
a
b c d
e
a b1 γa2 γb3 γd
a b c1 γa γb1 γa γc1 γa γd3 γd γe
c:child::*(b)
q1 q2
bα bα′
α′ v α
Cardinality:|q2| = |q1| · fn:count (p/child::*)
fn:count (p) = |q1| · Prchild::* (p)︸ ︷︷ ︸fanout (here: 4/3)
Domain Sizes:‖α′‖ = ‖α‖ · fn:count (p [ child::* ])
fn:count (p) = ‖α‖ · Pr[child::*] (p)︸ ︷︷ ︸selectivity (here: 2/3)
Any XPath estimator that provides Prp2 (p1) and Pr[p2] (p1) will do.I Our prototype uses a simple Data Guide-based implementation.
August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 9
iter1
πiter,item:"forecast.xml"
docitem
item:descendant::day(item)
%pos:〈item〉‖iter
%inner:〈iter,pos〉
πiter:inner,itemπinner,outer:iter,
ord:pos
item:descendant::ppcp(item)
πiter1:iter,pos1:1,item1:50
item:attribute::t(item)
%pos:〈item〉‖iter%pos:〈item〉‖iter
πiter1:iter,pos,item
item
πiter1:iter,pos,item
oniter1=iter
=res:(item,item1)
σres
πiter
δ\
πiter,pos:1,ord:1,item:"rain..."
oniter1=iter
πiter,pos,item,ord:2
·∪
·∪
·∪
oniter=iter1
πiter,pos:1,ord:3,item:"chance..."
πiter,pos,item,ord:4
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
oniter1=iter
πiter,pos:1,ord:1,item:"no rain..."
πiter,pos,item,ord:2
·∪
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
·∪
oniter=inner
%pos1:〈ord,pos〉‖outer
πiter:outer,pos:pos1,item
πiter
1
2
34
Back to our Example Plan
πiter:inner,item
item:descendant::ppcp(item)
πiter1:iter,pos1:1,item1:50
%pos:〈item〉‖iter
item
oniter1=iter
=res:(item,item1)
πiter
a: Retrieve typed values for node identifiersin column a (atomization).
iter1
πiter,item:"forecast.xml"
docitem
item:descendant::day(item)
%pos:〈item〉‖iter
%inner:〈iter,pos〉
πiter:inner,itemπinner,outer:iter,
ord:pos
item:descendant::ppcp(item)
πiter1:iter,pos1:1,item1:50
item:attribute::t(item)
%pos:〈item〉‖iter%pos:〈item〉‖iter
πiter1:iter,pos,item
item
πiter1:iter,pos,item
oniter1=iter
=res:(item,item1)
σres
πiter
δ\
πiter,pos:1,ord:1,item:"rain..."
oniter1=iter
πiter,pos,item,ord:2
·∪
·∪
·∪
oniter=iter1
πiter,pos:1,ord:3,item:"chance..."
πiter,pos,item,ord:4
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
oniter1=iter
πiter,pos:1,ord:1,item:"no rain..."
πiter,pos,item,ord:2
·∪
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
·∪
oniter=inner
%pos1:〈ord,pos〉‖outer
πiter:outer,pos:pos1,item
πiter
1
2
34
Back to our Example Plan
Projection path:item⇒···/descendant::ppcp∈ path
(...(q)
)Cardinality (uses fanout):∣∣ item:ax::nt(item)(q)
∣∣ = |q|·Prax::nt (· · · )
Domain inclusion:iterα
′∈ dom
(...(q)
); α′ v α
Domain size (uses step selectivity):‖α′‖ = ‖α‖ · Pr[ax::nt] (· · · )
iterα
iterα′
iterα
iterα′iter1α
πiter:inner,item
item:descendant::ppcp(item)
πiter1:iter,pos1:1,item1:50
%pos:〈item〉‖iter
item
oniter1=iter
=res:(item,item1)
πiter
a: Retrieve typed values for node identifiersin column a (atomization).
iter1
πiter,item:"forecast.xml"
docitem
item:descendant::day(item)
%pos:〈item〉‖iter
%inner:〈iter,pos〉
πiter:inner,itemπinner,outer:iter,
ord:pos
item:descendant::ppcp(item)
πiter1:iter,pos1:1,item1:50
item:attribute::t(item)
%pos:〈item〉‖iter%pos:〈item〉‖iter
πiter1:iter,pos,item
item
πiter1:iter,pos,item
oniter1=iter
=res:(item,item1)
σres
πiter
δ\
πiter,pos:1,ord:1,item:"rain..."
oniter1=iter
πiter,pos,item,ord:2
·∪
·∪
·∪
oniter=iter1
πiter,pos:1,ord:3,item:"chance..."
πiter,pos,item,ord:4
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
oniter1=iter
πiter,pos:1,ord:1,item:"no rain..."
πiter,pos,item,ord:2
·∪
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
·∪
oniter=inner
%pos1:〈ord,pos〉‖outer
πiter:outer,pos:pos1,item
πiter
1
2
34
Back to our Example Plan
Projection path:item⇒···/descendant::ppcp∈ path
(...(q)
)Cardinality (uses fanout):∣∣ item:ax::nt(item)(q)
∣∣ = |q|·Prax::nt (· · · )
Domain inclusion:iterα
′∈ dom
(...(q)
); α′ v α
Domain size (uses step selectivity):‖α′‖ = ‖α‖ · Pr[ax::nt] (· · · )
iterα
iterα′
iterα
iterα′iter1α
πiter:inner,item
item:descendant::ppcp(item)
πiter1:iter,pos1:1,item1:50
%pos:〈item〉‖iter
item
oniter1=iter
=res:(item,item1)
πiter
a: Retrieve typed values for node identifiersin column a (atomization).
iter1
πiter,item:"forecast.xml"
docitem
item:descendant::day(item)
%pos:〈item〉‖iter
%inner:〈iter,pos〉
πiter:inner,itemπinner,outer:iter,
ord:pos
item:descendant::ppcp(item)
πiter1:iter,pos1:1,item1:50
item:attribute::t(item)
%pos:〈item〉‖iter%pos:〈item〉‖iter
πiter1:iter,pos,item
item
πiter1:iter,pos,item
oniter1=iter
=res:(item,item1)
σres
πiter
δ\
πiter,pos:1,ord:1,item:"rain..."
oniter1=iter
πiter,pos,item,ord:2
·∪
·∪
·∪
oniter=iter1
πiter,pos:1,ord:3,item:"chance..."
πiter,pos,item,ord:4
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
oniter1=iter
πiter,pos:1,ord:1,item:"no rain..."
πiter,pos,item,ord:2
·∪
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
·∪
oniter=inner
%pos1:〈ord,pos〉‖outer
πiter:outer,pos:pos1,item
πiter
1
2
34
Back to our Example Plan
“Foreign key” join:∣∣q1 oniter1=iter
q2∣∣ = |q1|·|q2|
‖α‖ (since α′ v α)
Projection path:item⇒···/descendant::ppcp∈ path
(...(q)
)Cardinality (uses fanout):∣∣ item:ax::nt(item)(q)
∣∣ = |q|·Prax::nt (· · · )
Domain inclusion:iterα
′∈ dom
(...(q)
); α′ v α
Domain size (uses step selectivity):‖α′‖ = ‖α‖ · Pr[ax::nt] (· · · )
iterα
iterα′
iterα
iterα′iter1α
πiter:inner,item
item:descendant::ppcp(item)
πiter1:iter,pos1:1,item1:50
%pos:〈item〉‖iter
item
oniter1=iter
=res:(item,item1)
πiter
a: Retrieve typed values for node identifiersin column a (atomization).
iter1
πiter,item:"forecast.xml"
docitem
item:descendant::day(item)
%pos:〈item〉‖iter
%inner:〈iter,pos〉
πiter:inner,itemπinner,outer:iter,
ord:pos
item:descendant::ppcp(item)
πiter1:iter,pos1:1,item1:50
item:attribute::t(item)
%pos:〈item〉‖iter%pos:〈item〉‖iter
πiter1:iter,pos,item
item
πiter1:iter,pos,item
oniter1=iter
=res:(item,item1)
σres
πiter
δ\
πiter,pos:1,ord:1,item:"rain..."
oniter1=iter
πiter,pos,item,ord:2
·∪
·∪
·∪
oniter=iter1
πiter,pos:1,ord:3,item:"chance..."
πiter,pos,item,ord:4
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
oniter1=iter
πiter,pos:1,ord:1,item:"no rain..."
πiter,pos,item,ord:2
·∪
%pos1:〈ord,pos〉‖iter
πiter,pos:pos1,item
·∪
oniter=inner
%pos1:〈ord,pos〉‖outer
πiter:outer,pos:pos1,item
πiter
1
2
34
card1: 990
card2: 4104
card3: 3540 card4: 564Back to our Example Plan
“Foreign key” join:∣∣q1 oniter1=iter
q2∣∣ = |q1|·|q2|
‖α‖ (since α′ v α)
Projection path:item⇒···/descendant::ppcp∈ path
(...(q)
)Cardinality (uses fanout):∣∣ item:ax::nt(item)(q)
∣∣ = |q|·Prax::nt (· · · )
Domain inclusion:iterα
′∈ dom
(...(q)
); α′ v α
Domain size (uses step selectivity):‖α′‖ = ‖α‖ · Pr[ax::nt] (· · · )
iterα
iterα′
iterα
iterα′iter1α
πiter:inner,item
item:descendant::ppcp(item)
πiter1:iter,pos1:1,item1:50
%pos:〈item〉‖iter
item
oniter1=iter
=res:(item,item1)
πiter
a: Retrieve typed values for node identifiersin column a (atomization).
Forecasting New Zealand’s WeatherThe obtained cardinalities can be mapped back to predict itemcounts for corresponding XQuery expressions:4
card1 = 990/990
card2 = 4104/3402card3 = 3540/2370
card4 = 564/1032
for $d in doc ("forecast.xml")/descendant::day
let $day := $d/@t
let $ppcp := data ($d/descendant::ppcp)
return
if ($ppcp > 50)
then ("rain likely on", $day,
"chance of precipitation:", $ppcp)
else ("no rain on", $day)
I Value statistics based on histograms (↗ paper)I Inaccuracy is mainly due to correlations in the data.
I Rain in the morning likely means rain in the afternoon, too.
4estimated/observed, based on data taken for New Zealand two weeks ago.August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 11
More Realistic Queries: W3C XQuery Use CasesI Prototype implementation
based on PathfinderI For each subexpression:
estimated cardinalityobserved cardinality
I Diameter indicates datapoint “stacking”
I Plan root: filled circle
I Applicable to “real” queriesI Recovery from intermediate
mis-estimationsI e.g., existential semantics
XMP Q4XMP Q5XMP Q6XMP Q8XMP Q9XMP Q10XMP Q11SGML Q3SGML Q4SGML Q8aSGML Q8bR Q2R Q3R Q10R Q11R Q13R Q14R Q15R Q17
0.01
0.01
0.1
0.1
1
1
10
10estimated cardinality/observed cardinality
August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 12
Wrap-UpI Cardinality estimation framework for XQueryI Subexpression-level estimates for arbitrary XQuery expressionsI Based on Pathfinder’s XQuery-to-relational algebra compiler
relational cardinality estimation (System R)+ existing work on XPath estimation+ histograms for value predicates
= cardinality estimation for XQuery
I High-quality estimates for realistic XQuery workloadsI Robust with respect to intermediate errors
I Pluggable and extensibleI e.g., XPath estimation subsystem, positional predicates
August 26, 2008 Systems Group — Department of Computer Science — ETH Zurich 13