multi-join query evaluation on big data lecture...

Background Skew Multi-round Open Problems

Multi-join Query Evaluation on Big DataLecture 4

Dan Suciu

March, 2015

Dan Suciu Multi-Joins – Lecture 4 March, 2015 1 / 29

Multi-join Query Evaluation – Outline

Part 1 Optimal Sequential Algorithms.

Part 2 Lower bounds for Parallel Algorithms.

Part 3 Optimal Parallel Algorithms.

Part 3 Data Skew.

Summary so far

Q(x) = R1(x1), . . . ,R`(x`) ∣R1∣ = m1, . . . , ∣R`∣ = m`

Sequential World Cost: output size of QUpper bound mρ∗ ; general: mu1

1 ⋯mu`` . Fractional edge cover.

Lower bound (tightness): fractional vertex packingGeneric-join algorithm.

Parallel World Cost: communication.1-round, skew-free, equal-cardinalities.

Lower bound m/p1/τ∗ ; general: (mu11 ⋯mu`

` /p)1/∑uj Fractional edgepacking.Upper bound: fractional vertex cover.HyperCube algorithm.

Outline of Lecture 4

Background: Hash-Based Partition

Coping with Skew

Multi-rounds (very short)

Open Problems

Will consider only databases without skew

Understanding Skew (Review from Lecture 2)

Given: R(x , y) with m items, p servers.Send each tuple R(x , y) to server h(y), where h = random function.

Claim 1 For each server u, its expected load is E[Lu] = m/p. Why?

Claim 2 We say that R is skewed if some value y occurs more thanm/p times in R. Then some server has load > m/p.

Claim 3 If R is not skewed, the maximum load of all servers isO(m/p) with high probability. (Details in lecture 4.)

Take-away: we assume no skew, meaning every frequency is ≤ m/p, then:

Max-load = O(Expected load)

Hash-Based Partition

A hash function is a function from some domain D to a range ofintegers [p]. E.g. h ∶ char(30)→ {1, . . . ,p}

A random family of hash functions is a set of hash functions, fromwhich we select one at random.

A strongly universal family of hash function is a set with the propertythat for any distinct values x1, . . . , xn ∈ D, and any outputs(u1, . . . ,un) ∈ [p]n,

P(h(x1) = u1 ∧⋯ ∧ h(xn) = un) =1

Let R be a bag (“multi-set”) with m elements.

Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].

Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

Q: What is E[Lu], for a fixed u? A: E[Lu] = m/pQ: What is E[maxu Lu]? A: May be as large as m.

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

Q: What is E[Lu], for a fixed u?

A: E[Lu] = m/pQ: What is E[maxu Lu]? A: May be as large as m.

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

Q: What is E[Lu], for a fixed u? A: E[Lu] = m/p

Q: What is E[maxu Lu]? A: May be as large as m.

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

Q: What is E[Lu], for a fixed u? A: E[Lu] = m/pQ: What is E[maxu Lu]?

A: May be as large as m.

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

Let dx denote the number of occurrences of the value x in R.

Theorem

Let α > 0 be a constant such that dx ≤ mα⋅p , for all x . Then, for all δ ≥ 0:

∀u ∈ [p] ∶P(Lu > (1 + δ)mp) < 1

2α⋅δP(max

uLu > (1 + δ)m

p) < p

2α⋅δ

Exercise on the Problem Set: prove it (hint: use Bennett’s theorem).

Application: if αδ ≥ (1 + c) log p for some constant c > 0, then

P (maxu Lu > (1 + δ)mp ) < 1/pc

Summary on Hash-Based Partition

Basic Partition Sent item R(x) to bin h(x).If R has no skew (degrees ≤ m/p) thenMax-Size = O(Expected-Size) = O(m/p), w.h.p. (hiding log p factor)

HyperCube Partition Send R(x1, . . . , xk) to bin (h1(x1), . . . ,hk(xk))If R has no skew then Max-Size = O(Expected-Size) = O(m/p) w.h.p.

Given shares p1p2⋯pk = p, no skew means:∀S ⊆ [r], every value of (xi)i∈S occurs ≤ m/∏i∈S pi times:

x1 = a occurs ≤ m/p1 timesx2 = b occurs ≤ m/p2 times(x1 = a, x2 = b) occurs ≤ m/p1p2 times, etc.

The hidden log p factor becomes a poly-log factor.

Heavy Hitters

Definition

(Informal) A value (or tuple of values) is a heavy hitter if it occurs moreoften than its skew threshold.

Example:

Basic partition of R(x , y) into p bins: x is heavy hitter if dx > m/pHypercube partition of R(x , y , z) into a cube p1p2p3:

x is a heavy hitter if dx > m/p1.y is a heavy hitter if dy > m/p2.A pair (x , y) is a heavy hitter if dxy > m/p1p2.A triple (x , y , z) is never heavy. (Why?)etc

There are at most O(p) heavy hitters. (Why?)

Heavy Hitters

Two ways to cope with heavy hitters:

Unknown Heavy Hitters Design the algorithm to be robust to heavyhitters.

Known Heavy Hitters Find all heavy hitters (only O(p)) and treat themspecially. This is by far preferred in practice.

Unknown Heavy Hitters

Consider a join: Q(x , y , z) = R(x , y),S(y , z).

Algorithm 1 Use shares px = 1,py = p,pz = 1 (standard hash-join).

If data has no skew: L = m/p.If data is skewed: L = m. Sensitive to skew

Algorithm 2 Use shares px = py = pz = 1/3.

If data has no skew: L = m/p2/3.If data is skewed: L = m/p1/3. Resilient to skew

This observation generalizes: for any query, we can compute shares thatminimize the load under the worst skew.

Known Heavy Hitters

Let HH be the set of all heavy hitter values. ∣HH ∣ = O(p)For each relation Rj(xj), variables z ⊆ xj , constants v ∈ Domainz, let:

mj ,z[v] = number of tuples in Rj that have z = v

In particular, mj ,∅[()] = mj

Definition

Given a database instance R1, . . . ,R`, its statistics are the set of numbers:

Σ = {mj ,z[v] ∣ j = 1, `, z ⊆ xj ,v ⊆ HHz}

Note: ∣Σ∣ = O(p), small enough that we can broadcast to all servers.

Open Problem design a Σ-optimal query evaluation algorithms (provablyoptimal for the set of statistics Σ).

Known Heavy Hitters

Definition

Σ = {mj ,z[v] ∣ j = 1, `, z ⊆ xj ,v ⊆ HHz}

Known Heavy Hitters

Definition

Σ = {mj ,z[v] ∣ j = 1, `, z ⊆ xj ,v ⊆ HHz}

Discussion

A Σ-optimal algorithm represents the sweet spot between worst-casealgorithm, and instance-optimal algorithms [Ngo’14].

In practice, systems often compute on the fly some statistics in Σ inorder to avoid significant skew, e.g. skew-join in PigLatin.

No Σ-optimal algorithm for arbitrary queries is known to date.[Beame’14] describe an algorithm that is optimal only within apoly-log p factor (due to the algorithm, not to the hash function).

Σ-optimal algorithms are known for a few special cases, in particularfor a join [Beame’14]. We will discuss next.

Σ-Optimal Join AlgorithmQ(x , y , z) = R(x , y),S(y , z) ∣R ∣ = mR , ∣S ∣ = mS .

∀v ∈ Domain mR[v] =degree of v in R

mS[v] =degree of v in S

(Thus: ∑v mR[v] = mR and ∑v mS [v] = mS )

Heavy hitters:

HHR ={v ∣ mR[v] ≥ mR/p}HHS ={v ∣ mS[v] ≥ mS/p}HH =HHR ∪HHS

Statistics:

Σ = {mR ,mS} ∪ {mR[v] ∣ v ∈ HHR} ∪ {mS[v] ∣ v ∈ HHS}

Heavy hitters:

Statistics:

Heavy hitters:

Statistics:

Heavy hitters:

Statistics:

Cartesian Product Revisited

A heavy hitter transforms the join into a cartesian product, lets revisit it

Q(x , z) = R(x),S(z).

The load is: L = maxu (m

p )1/(u1+u2)

u L(u) Shares px ,pz

(1,1) (mRmS/p)1/2√

mS,√

(1,0) mR/p p,1(0,1) mS/p 1,p

L = max((mRmS/p)1/2,mR/p,mS/p)

Σ-Optimal Join Algorithm – Intuition

Q(x , y , z) = R(x , y),S(y , z)

∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).

Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].

Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)

Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.

Optimal solution: pv = p mR[v]mS [v]

∑w∈HH mR[w]mS [w]

Optimal load L = (∑w∈HH mR[w]ms[w]

p )1/2

= Lv , for all v ∈ HH.

Q(x , y , z) = R(x , y),S(y , z)

p )1/2

Q(x , y , z) = R(x , y),S(y , z)

p )1/2

Q(x , y , z) = R(x , y),S(y , z)

p )1/2

Q(x , y , z) = R(x , y),S(y , z)

p )1/2

Q(x , y , z) = R(x , y),S(y , z)

p )1/2

Algorithm HyperSkew

Assume HHR = HHS = HH

Light Hitters Run HyperCube on the light hitters.Load: max(mR/p,mS/p)

Heavy Hitters In parallel, for each v ∈ HH:

Compute Qv using HyperCube on pv = p mR[v]mS [v]

∑w∈HH mR[w]mS [w]servers.

Load: L = (∑w∈HH mR[w]ms[w]

p )1/2

Exercise: generalize to HHR ≠ HHS (Hint: set mR[v] = 1 for v ∈ HHS −HHR etc).

The total load is:

Llower = max(mR/p,mS/p,(∑w∈HH mR[w]ms[w]

One Sided Heavy Hitters

The contribution of one-sided heavy hitters is negligible:

(∑w∈HHR−HHSmR[w]ms[w]p

≤ max(mR ,mS)/p

Proof.

∑w∈HHR−HHSmR[w]ms[w] ≤ (∑w∈HHR−HHS

mR[w])msp ≤ mRmS

Thus, the load due to one sided heavy hitters will not exceed the load dueto light hitters, and it’s OK to approximate the missing degree with 1.

Discussion

Skew may affect significantly the communication cost for joins: thespeedup decreases from 1/p to 1/p1/2

Adapting the algorithm to the statistics Σ is also important: if allheavy hitters are one sided, then the extra cost can be avoidedcompletely.

Multiple Rounds

Basic idea is very simple: generate a query plan for Q, then computethe plan bottom up, each level is one round.

Goal: reduce the load by having multiple rounds.

Challenge (major!): intermediate results may be much bigger, and arehard to estimate.

Two approaches:

In [Beame’13] each operator in the query plan is a conjunctive query(rather than just a single join), and is evaluated using 1-roundHyperCube. No control over the intermediate results.

In [Afrati’14] the GYM algorithm computes conjunctive queries onlyat the leaves s.t. the residual query is acyclic, then use Yannakakis’semi-join reduction to control the size of the intermediate results.

Grand Summary

Big Data Analytics needs to run complex queries, on big data.

Traditional query processing: one join at a time.Challenge: large and unpredictable intermediate results.

Novel worst-case optimal sequential algorithm: runtime = AGMbound.

Novel parallel algorithm: communication cost = provably optimal.

Many open problems (next)

Open Problem 1: AGM Bounds for Given Statistics

Generalize the AGM bound to databases with known statistics Σ.

Simpler problems:

Generalize the AGM bound for databases with bounded degrees. E.g.Q = R(x , y),S(y , z), normal AGM upper bound is m2, if all degrees ≤ d ,upper bound is dm.

Generalize the AGM bound to functional dependencies. This immediatelyproves the previous item (for details, send me email)

Open Problem 2: Σ-Optimal Sequential Algorithm

Design a sequential algorithm that is worst-case optimal for instancessatisfying given statistics Σ.

E.g. Q = R(x , y),S(y , z),T (z , x). Suppose ∀y , degree of y in S is ≤ 5.Then compute Q in time O(m). What if we know one HH y0 with degreem/2?

Open Problem 3: Lower Bounds for Multi-Round ParallelAlgorithm

Prove lower bounds for 2 or more rounds, assuming servers can sendmessages encoding arbitrary information.

Example: prove that R(x , y),S(y , z),T (z ,u),K(u, v),L(v ,w) cannot becomputed in 2 rounds, with load m/p.

Note: [Beame’13] proved such lower bounds, but for a weaker model,when message consists of tuples, not of arbitrary bits.

Open Problem 4: Optimal Multi-round Algorithms

Find the exact tradeoff between the load/round L and the number ofrounds r for a general query.

Note: emphasis here is on algorithm, not lower bounds. For lower boundsit’s OK to give the proof for a weaker models (as in [Beame’13]).

Open Problem 5: Σ-Optimal Parallel Algorithms

Generalize HyperSkew from joins to an arbitrary queries. Seemsstraightforward how to generalize it, but it’s open whether this is optimal.

Conjectured tight bound:

Llower = maxu,z

⎛⎝∑v∈Domainz m

u11,z[v]⋯mu`

`,z[v]p

⎞⎠

1/∑j uj

where z ⊆ {x1, . . . , xk}, and u is fractional edge packing of the residualquery qz that covers all variables in z.This formula is known to be a lower bound [Beame’14], but not known iftight.

Final Remark

Please solve the 7 problems on the Problem Set. Some are quite easy,some more challenging.

If you work on any of the open problems and make progress, pleaselet me know.

Thank you!

multi-join query evaluation on big data lecture...

Documents

from semistructured data to xml dan suciu at&t labs...

1 lecture 5: transactions concurrency control wednesday...

1 lecture 11: bloom filters, final review december 7, 2011...

author:elena danciu coordinators:assistant professor...

#noagile - dan suciu

lecture 15 nosql even more slides than usual borrowed from...

matrice animation giurgiu suciu

lecture 9 query optimization november 24, 2010 dan suciu --...

1 lecture 5 transactions wednesday october 27 th, 2010 dan...

pdi lucretia suciu

1 lecture 02: relational query languages and conceptual...

proiect ms - alexandra suciu

1 lecture 7: indexes and database tuning wednesday, november...

lecture 20: query optimization (2) - university of...

principles of database systems cse 544p lecture #1 september...

lecture 03: normal forms, constraints, views wednesday,...

managing xml and semistructured data lecture 1:...

1 lecture 02: conceptual design wednesday, october 6, 2010...

lecture 10: nosql wednesday, december 1 st, 2011 dan suciu...

principles of database systems cse 544p lecture #1 january 6...