multi-join query evaluation on big data lecture...
Post on 22-Aug-2020
4 Views
Preview:
TRANSCRIPT
Background Skew Multi-round Open Problems
Multi-join Query Evaluation on Big DataLecture 4
Dan Suciu
March, 2015
Dan Suciu Multi-Joins – Lecture 4 March, 2015 1 / 29
Background Skew Multi-round Open Problems
Multi-join Query Evaluation – Outline
Part 1 Optimal Sequential Algorithms.
Part 2 Lower bounds for Parallel Algorithms.
Part 3 Optimal Parallel Algorithms.
Part 3 Data Skew.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 2 / 29
Background Skew Multi-round Open Problems
Summary so far
Q(x) = R1(x1), . . . ,R`(x`) ∣R1∣ = m1, . . . , ∣R`∣ = m`
Sequential World Cost: output size of QUpper bound mρ∗ ; general: mu1
1 ⋯mu`` . Fractional edge cover.
Lower bound (tightness): fractional vertex packingGeneric-join algorithm.
Parallel World Cost: communication.1-round, skew-free, equal-cardinalities.
Lower bound m/p1/τ∗ ; general: (mu11 ⋯mu`
` /p)1/∑uj Fractional edgepacking.Upper bound: fractional vertex cover.HyperCube algorithm.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 3 / 29
Background Skew Multi-round Open Problems
Outline of Lecture 4
Background: Hash-Based Partition
Coping with Skew
Multi-rounds (very short)
Open Problems
Will consider only databases without skew
Dan Suciu Multi-Joins – Lecture 4 March, 2015 4 / 29
Background Skew Multi-round Open Problems
Understanding Skew (Review from Lecture 2)
Given: R(x , y) with m items, p servers.Send each tuple R(x , y) to server h(y), where h = random function.
Claim 1 For each server u, its expected load is E[Lu] = m/p. Why?
Claim 2 We say that R is skewed if some value y occurs more thanm/p times in R. Then some server has load > m/p.
Claim 3 If R is not skewed, the maximum load of all servers isO(m/p) with high probability. (Details in lecture 4.)
Take-away: we assume no skew, meaning every frequency is ≤ m/p, then:
Max-load = O(Expected load)
Dan Suciu Multi-Joins – Lecture 4 March, 2015 5 / 29
Background Skew Multi-round Open Problems
Hash-Based Partition
A hash function is a function from some domain D to a range ofintegers [p]. E.g. h ∶ char(30)→ {1, . . . ,p}
A random family of hash functions is a set of hash functions, fromwhich we select one at random.
A strongly universal family of hash function is a set with the propertythat for any distinct values x1, . . . , xn ∈ D, and any outputs(u1, . . . ,un) ∈ [p]n,
P(h(x1) = u1 ∧⋯ ∧ h(xn) = un) =1
pn
Dan Suciu Multi-Joins – Lecture 4 March, 2015 6 / 29
Background Skew Multi-round Open Problems
Hash-Based Partition
Let R be a bag (“multi-set”) with m elements.
Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].
Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1
x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1
L1 = 4,L2 = 2,L3 = 4
Q: What is E[Lu], for a fixed u? A: E[Lu] = m/pQ: What is E[maxu Lu]? A: May be as large as m.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29
Background Skew Multi-round Open Problems
Hash-Based Partition
Let R be a bag (“multi-set”) with m elements.
Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].
Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1
x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1
L1 = 4,L2 = 2,L3 = 4
Q: What is E[Lu], for a fixed u? A: E[Lu] = m/pQ: What is E[maxu Lu]? A: May be as large as m.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29
Background Skew Multi-round Open Problems
Hash-Based Partition
Let R be a bag (“multi-set”) with m elements.
Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].
Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1
x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1
L1 = 4,L2 = 2,L3 = 4
Q: What is E[Lu], for a fixed u?
A: E[Lu] = m/pQ: What is E[maxu Lu]? A: May be as large as m.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29
Background Skew Multi-round Open Problems
Hash-Based Partition
Let R be a bag (“multi-set”) with m elements.
Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].
Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1
x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1
L1 = 4,L2 = 2,L3 = 4
Q: What is E[Lu], for a fixed u? A: E[Lu] = m/p
Q: What is E[maxu Lu]? A: May be as large as m.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29
Background Skew Multi-round Open Problems
Hash-Based Partition
Let R be a bag (“multi-set”) with m elements.
Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].
Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1
x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1
L1 = 4,L2 = 2,L3 = 4
Q: What is E[Lu], for a fixed u? A: E[Lu] = m/pQ: What is E[maxu Lu]?
A: May be as large as m.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29
Background Skew Multi-round Open Problems
Hash-Based Partition
Let R be a bag (“multi-set”) with m elements.
Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].
Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1
x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1
L1 = 4,L2 = 2,L3 = 4
Q: What is E[Lu], for a fixed u? A: E[Lu] = m/pQ: What is E[maxu Lu]? A: May be as large as m.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29
Background Skew Multi-round Open Problems
Hash-Based Partition
Let dx denote the number of occurrences of the value x in R.
Theorem
Let α > 0 be a constant such that dx ≤ mα⋅p , for all x . Then, for all δ ≥ 0:
∀u ∈ [p] ∶P(Lu > (1 + δ)mp) < 1
2α⋅δP(max
uLu > (1 + δ)m
p) < p
2α⋅δ
Exercise on the Problem Set: prove it (hint: use Bennett’s theorem).
Application: if αδ ≥ (1 + c) log p for some constant c > 0, then
P (maxu Lu > (1 + δ)mp ) < 1/pc
Dan Suciu Multi-Joins – Lecture 4 March, 2015 8 / 29
Background Skew Multi-round Open Problems
Summary on Hash-Based Partition
Basic Partition Sent item R(x) to bin h(x).If R has no skew (degrees ≤ m/p) thenMax-Size = O(Expected-Size) = O(m/p), w.h.p. (hiding log p factor)
HyperCube Partition Send R(x1, . . . , xk) to bin (h1(x1), . . . ,hk(xk))If R has no skew then Max-Size = O(Expected-Size) = O(m/p) w.h.p.
Given shares p1p2⋯pk = p, no skew means:∀S ⊆ [r], every value of (xi)i∈S occurs ≤ m/∏i∈S pi times:
x1 = a occurs ≤ m/p1 timesx2 = b occurs ≤ m/p2 times(x1 = a, x2 = b) occurs ≤ m/p1p2 times, etc.
The hidden log p factor becomes a poly-log factor.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 9 / 29
Background Skew Multi-round Open Problems
Summary on Hash-Based Partition
Basic Partition Sent item R(x) to bin h(x).If R has no skew (degrees ≤ m/p) thenMax-Size = O(Expected-Size) = O(m/p), w.h.p. (hiding log p factor)
HyperCube Partition Send R(x1, . . . , xk) to bin (h1(x1), . . . ,hk(xk))If R has no skew then Max-Size = O(Expected-Size) = O(m/p) w.h.p.
Given shares p1p2⋯pk = p, no skew means:∀S ⊆ [r], every value of (xi)i∈S occurs ≤ m/∏i∈S pi times:
x1 = a occurs ≤ m/p1 timesx2 = b occurs ≤ m/p2 times(x1 = a, x2 = b) occurs ≤ m/p1p2 times, etc.
The hidden log p factor becomes a poly-log factor.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 9 / 29
Background Skew Multi-round Open Problems
Summary on Hash-Based Partition
Basic Partition Sent item R(x) to bin h(x).If R has no skew (degrees ≤ m/p) thenMax-Size = O(Expected-Size) = O(m/p), w.h.p. (hiding log p factor)
HyperCube Partition Send R(x1, . . . , xk) to bin (h1(x1), . . . ,hk(xk))If R has no skew then Max-Size = O(Expected-Size) = O(m/p) w.h.p.
Given shares p1p2⋯pk = p, no skew means:∀S ⊆ [r], every value of (xi)i∈S occurs ≤ m/∏i∈S pi times:
x1 = a occurs ≤ m/p1 timesx2 = b occurs ≤ m/p2 times(x1 = a, x2 = b) occurs ≤ m/p1p2 times, etc.
The hidden log p factor becomes a poly-log factor.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 9 / 29
Background Skew Multi-round Open Problems
Summary on Hash-Based Partition
Basic Partition Sent item R(x) to bin h(x).If R has no skew (degrees ≤ m/p) thenMax-Size = O(Expected-Size) = O(m/p), w.h.p. (hiding log p factor)
HyperCube Partition Send R(x1, . . . , xk) to bin (h1(x1), . . . ,hk(xk))If R has no skew then Max-Size = O(Expected-Size) = O(m/p) w.h.p.
Given shares p1p2⋯pk = p, no skew means:∀S ⊆ [r], every value of (xi)i∈S occurs ≤ m/∏i∈S pi times:
x1 = a occurs ≤ m/p1 timesx2 = b occurs ≤ m/p2 times(x1 = a, x2 = b) occurs ≤ m/p1p2 times, etc.
The hidden log p factor becomes a poly-log factor.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 9 / 29
Background Skew Multi-round Open Problems
Heavy Hitters
Definition
(Informal) A value (or tuple of values) is a heavy hitter if it occurs moreoften than its skew threshold.
Example:
Basic partition of R(x , y) into p bins: x is heavy hitter if dx > m/pHypercube partition of R(x , y , z) into a cube p1p2p3:
x is a heavy hitter if dx > m/p1.y is a heavy hitter if dy > m/p2.A pair (x , y) is a heavy hitter if dxy > m/p1p2.A triple (x , y , z) is never heavy. (Why?)etc
Fact
There are at most O(p) heavy hitters. (Why?)
Dan Suciu Multi-Joins – Lecture 4 March, 2015 10 / 29
Background Skew Multi-round Open Problems
Heavy Hitters
Two ways to cope with heavy hitters:
Unknown Heavy Hitters Design the algorithm to be robust to heavyhitters.
Known Heavy Hitters Find all heavy hitters (only O(p)) and treat themspecially. This is by far preferred in practice.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 11 / 29
Background Skew Multi-round Open Problems
Unknown Heavy Hitters
Consider a join: Q(x , y , z) = R(x , y),S(y , z).
Algorithm 1 Use shares px = 1,py = p,pz = 1 (standard hash-join).
If data has no skew: L = m/p.If data is skewed: L = m. Sensitive to skew
Algorithm 2 Use shares px = py = pz = 1/3.
If data has no skew: L = m/p2/3.If data is skewed: L = m/p1/3. Resilient to skew
This observation generalizes: for any query, we can compute shares thatminimize the load under the worst skew.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 12 / 29
Background Skew Multi-round Open Problems
Known Heavy Hitters
Let HH be the set of all heavy hitter values. ∣HH ∣ = O(p)For each relation Rj(xj), variables z ⊆ xj , constants v ∈ Domainz, let:
mj ,z[v] = number of tuples in Rj that have z = v
In particular, mj ,∅[()] = mj
Definition
Given a database instance R1, . . . ,R`, its statistics are the set of numbers:
Σ = {mj ,z[v] ∣ j = 1, `, z ⊆ xj ,v ⊆ HHz}
Note: ∣Σ∣ = O(p), small enough that we can broadcast to all servers.
Open Problem design a Σ-optimal query evaluation algorithms (provablyoptimal for the set of statistics Σ).
Dan Suciu Multi-Joins – Lecture 4 March, 2015 13 / 29
Background Skew Multi-round Open Problems
Known Heavy Hitters
Let HH be the set of all heavy hitter values. ∣HH ∣ = O(p)For each relation Rj(xj), variables z ⊆ xj , constants v ∈ Domainz, let:
mj ,z[v] = number of tuples in Rj that have z = v
In particular, mj ,∅[()] = mj
Definition
Given a database instance R1, . . . ,R`, its statistics are the set of numbers:
Σ = {mj ,z[v] ∣ j = 1, `, z ⊆ xj ,v ⊆ HHz}
Note: ∣Σ∣ = O(p), small enough that we can broadcast to all servers.
Open Problem design a Σ-optimal query evaluation algorithms (provablyoptimal for the set of statistics Σ).
Dan Suciu Multi-Joins – Lecture 4 March, 2015 13 / 29
Background Skew Multi-round Open Problems
Known Heavy Hitters
Let HH be the set of all heavy hitter values. ∣HH ∣ = O(p)For each relation Rj(xj), variables z ⊆ xj , constants v ∈ Domainz, let:
mj ,z[v] = number of tuples in Rj that have z = v
In particular, mj ,∅[()] = mj
Definition
Given a database instance R1, . . . ,R`, its statistics are the set of numbers:
Σ = {mj ,z[v] ∣ j = 1, `, z ⊆ xj ,v ⊆ HHz}
Note: ∣Σ∣ = O(p), small enough that we can broadcast to all servers.
Open Problem design a Σ-optimal query evaluation algorithms (provablyoptimal for the set of statistics Σ).
Dan Suciu Multi-Joins – Lecture 4 March, 2015 13 / 29
Background Skew Multi-round Open Problems
Discussion
A Σ-optimal algorithm represents the sweet spot between worst-casealgorithm, and instance-optimal algorithms [Ngo’14].
In practice, systems often compute on the fly some statistics in Σ inorder to avoid significant skew, e.g. skew-join in PigLatin.
No Σ-optimal algorithm for arbitrary queries is known to date.[Beame’14] describe an algorithm that is optimal only within apoly-log p factor (due to the algorithm, not to the hash function).
Σ-optimal algorithms are known for a few special cases, in particularfor a join [Beame’14]. We will discuss next.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 14 / 29
Background Skew Multi-round Open Problems
Σ-Optimal Join AlgorithmQ(x , y , z) = R(x , y),S(y , z) ∣R ∣ = mR , ∣S ∣ = mS .
∀v ∈ Domain mR[v] =degree of v in R
mS[v] =degree of v in S
(Thus: ∑v mR[v] = mR and ∑v mS [v] = mS )
Heavy hitters:
HHR ={v ∣ mR[v] ≥ mR/p}HHS ={v ∣ mS[v] ≥ mS/p}HH =HHR ∪HHS
Statistics:
Σ = {mR ,mS} ∪ {mR[v] ∣ v ∈ HHR} ∪ {mS[v] ∣ v ∈ HHS}
Dan Suciu Multi-Joins – Lecture 4 March, 2015 15 / 29
Background Skew Multi-round Open Problems
Σ-Optimal Join AlgorithmQ(x , y , z) = R(x , y),S(y , z) ∣R ∣ = mR , ∣S ∣ = mS .
∀v ∈ Domain mR[v] =degree of v in R
mS[v] =degree of v in S
(Thus: ∑v mR[v] = mR and ∑v mS [v] = mS )
Heavy hitters:
HHR ={v ∣ mR[v] ≥ mR/p}HHS ={v ∣ mS[v] ≥ mS/p}HH =HHR ∪HHS
Statistics:
Σ = {mR ,mS} ∪ {mR[v] ∣ v ∈ HHR} ∪ {mS[v] ∣ v ∈ HHS}
Dan Suciu Multi-Joins – Lecture 4 March, 2015 15 / 29
Background Skew Multi-round Open Problems
Σ-Optimal Join AlgorithmQ(x , y , z) = R(x , y),S(y , z) ∣R ∣ = mR , ∣S ∣ = mS .
∀v ∈ Domain mR[v] =degree of v in R
mS[v] =degree of v in S
(Thus: ∑v mR[v] = mR and ∑v mS [v] = mS )
Heavy hitters:
HHR ={v ∣ mR[v] ≥ mR/p}HHS ={v ∣ mS[v] ≥ mS/p}HH =HHR ∪HHS
Statistics:
Σ = {mR ,mS} ∪ {mR[v] ∣ v ∈ HHR} ∪ {mS[v] ∣ v ∈ HHS}
Dan Suciu Multi-Joins – Lecture 4 March, 2015 15 / 29
Background Skew Multi-round Open Problems
Σ-Optimal Join AlgorithmQ(x , y , z) = R(x , y),S(y , z) ∣R ∣ = mR , ∣S ∣ = mS .
∀v ∈ Domain mR[v] =degree of v in R
mS[v] =degree of v in S
(Thus: ∑v mR[v] = mR and ∑v mS [v] = mS )
Heavy hitters:
HHR ={v ∣ mR[v] ≥ mR/p}HHS ={v ∣ mS[v] ≥ mS/p}HH =HHR ∪HHS
Statistics:
Σ = {mR ,mS} ∪ {mR[v] ∣ v ∈ HHR} ∪ {mS[v] ∣ v ∈ HHS}
Dan Suciu Multi-Joins – Lecture 4 March, 2015 15 / 29
Background Skew Multi-round Open Problems
Cartesian Product Revisited
A heavy hitter transforms the join into a cartesian product, lets revisit it
Q(x , z) = R(x),S(z).
The load is: L = maxu (m
u11 m
u22
p )1/(u1+u2)
u L(u) Shares px ,pz
(1,1) (mRmS/p)1/2√
pmR
mS,√
pmS
mR
(1,0) mR/p p,1(0,1) mS/p 1,p
L = max((mRmS/p)1/2,mR/p,mS/p)
Dan Suciu Multi-Joins – Lecture 4 March, 2015 16 / 29
Background Skew Multi-round Open Problems
Σ-Optimal Join Algorithm – Intuition
Q(x , y , z) = R(x , y),S(y , z)
∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).
Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].
Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)
Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.
Optimal solution: pv = p mR[v]mS [v]
∑w∈HH mR[w]mS [w]
Optimal load L = (∑w∈HH mR[w]ms[w]
p )1/2
= Lv , for all v ∈ HH.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29
Background Skew Multi-round Open Problems
Σ-Optimal Join Algorithm – Intuition
Q(x , y , z) = R(x , y),S(y , z)
∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).
Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].
Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)
Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.
Optimal solution: pv = p mR[v]mS [v]
∑w∈HH mR[w]mS [w]
Optimal load L = (∑w∈HH mR[w]ms[w]
p )1/2
= Lv , for all v ∈ HH.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29
Background Skew Multi-round Open Problems
Σ-Optimal Join Algorithm – Intuition
Q(x , y , z) = R(x , y),S(y , z)
∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).
Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].
Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)
Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.
Optimal solution: pv = p mR[v]mS [v]
∑w∈HH mR[w]mS [w]
Optimal load L = (∑w∈HH mR[w]ms[w]
p )1/2
= Lv , for all v ∈ HH.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29
Background Skew Multi-round Open Problems
Σ-Optimal Join Algorithm – Intuition
Q(x , y , z) = R(x , y),S(y , z)
∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).
Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].
Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)
Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.
Optimal solution: pv = p mR[v]mS [v]
∑w∈HH mR[w]mS [w]
Optimal load L = (∑w∈HH mR[w]ms[w]
p )1/2
= Lv , for all v ∈ HH.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29
Background Skew Multi-round Open Problems
Σ-Optimal Join Algorithm – Intuition
Q(x , y , z) = R(x , y),S(y , z)
∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).
Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].
Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)
Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.
Optimal solution: pv = p mR[v]mS [v]
∑w∈HH mR[w]mS [w]
Optimal load L = (∑w∈HH mR[w]ms[w]
p )1/2
= Lv , for all v ∈ HH.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29
Background Skew Multi-round Open Problems
Σ-Optimal Join Algorithm – Intuition
Q(x , y , z) = R(x , y),S(y , z)
∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).
Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].
Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)
Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.
Optimal solution: pv = p mR[v]mS [v]
∑w∈HH mR[w]mS [w]
Optimal load L = (∑w∈HH mR[w]ms[w]
p )1/2
= Lv , for all v ∈ HH.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29
Background Skew Multi-round Open Problems
Algorithm HyperSkew
Assume HHR = HHS = HH
Light Hitters Run HyperCube on the light hitters.Load: max(mR/p,mS/p)
Heavy Hitters In parallel, for each v ∈ HH:
Compute Qv using HyperCube on pv = p mR[v]mS [v]
∑w∈HH mR[w]mS [w]servers.
Load: L = (∑w∈HH mR[w]ms[w]
p )1/2
Exercise: generalize to HHR ≠ HHS (Hint: set mR[v] = 1 for v ∈ HHS −HHR etc).
The total load is:
Llower = max(mR/p,mS/p,(∑w∈HH mR[w]ms[w]
p)1/2
)
Dan Suciu Multi-Joins – Lecture 4 March, 2015 18 / 29
Background Skew Multi-round Open Problems
One Sided Heavy Hitters
The contribution of one-sided heavy hitters is negligible:
Lemma
(∑w∈HHR−HHSmR[w]ms[w]p
)1/2
≤ max(mR ,mS)/p
Proof.
∑w∈HHR−HHSmR[w]ms[w] ≤ (∑w∈HHR−HHS
mR[w])msp ≤ mRmS
p
Thus, the load due to one sided heavy hitters will not exceed the load dueto light hitters, and it’s OK to approximate the missing degree with 1.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 19 / 29
Background Skew Multi-round Open Problems
Discussion
Skew may affect significantly the communication cost for joins: thespeedup decreases from 1/p to 1/p1/2
Adapting the algorithm to the statistics Σ is also important: if allheavy hitters are one sided, then the extra cost can be avoidedcompletely.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 20 / 29
Background Skew Multi-round Open Problems
Multiple Rounds
Basic idea is very simple: generate a query plan for Q, then computethe plan bottom up, each level is one round.
Goal: reduce the load by having multiple rounds.
Challenge (major!): intermediate results may be much bigger, and arehard to estimate.
Two approaches:
In [Beame’13] each operator in the query plan is a conjunctive query(rather than just a single join), and is evaluated using 1-roundHyperCube. No control over the intermediate results.
In [Afrati’14] the GYM algorithm computes conjunctive queries onlyat the leaves s.t. the residual query is acyclic, then use Yannakakis’semi-join reduction to control the size of the intermediate results.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 21 / 29
Background Skew Multi-round Open Problems
Grand Summary
Big Data Analytics needs to run complex queries, on big data.
Traditional query processing: one join at a time.Challenge: large and unpredictable intermediate results.
Novel worst-case optimal sequential algorithm: runtime = AGMbound.
Novel parallel algorithm: communication cost = provably optimal.
Many open problems (next)
Dan Suciu Multi-Joins – Lecture 4 March, 2015 22 / 29
Background Skew Multi-round Open Problems
Open Problem 1: AGM Bounds for Given Statistics
Generalize the AGM bound to databases with known statistics Σ.
Simpler problems:
Generalize the AGM bound for databases with bounded degrees. E.g.Q = R(x , y),S(y , z), normal AGM upper bound is m2, if all degrees ≤ d ,upper bound is dm.
Generalize the AGM bound to functional dependencies. This immediatelyproves the previous item (for details, send me email)
Dan Suciu Multi-Joins – Lecture 4 March, 2015 23 / 29
Background Skew Multi-round Open Problems
Open Problem 2: Σ-Optimal Sequential Algorithm
Design a sequential algorithm that is worst-case optimal for instancessatisfying given statistics Σ.
E.g. Q = R(x , y),S(y , z),T (z , x). Suppose ∀y , degree of y in S is ≤ 5.Then compute Q in time O(m). What if we know one HH y0 with degreem/2?
Dan Suciu Multi-Joins – Lecture 4 March, 2015 24 / 29
Background Skew Multi-round Open Problems
Open Problem 3: Lower Bounds for Multi-Round ParallelAlgorithm
Prove lower bounds for 2 or more rounds, assuming servers can sendmessages encoding arbitrary information.
Example: prove that R(x , y),S(y , z),T (z ,u),K(u, v),L(v ,w) cannot becomputed in 2 rounds, with load m/p.
Note: [Beame’13] proved such lower bounds, but for a weaker model,when message consists of tuples, not of arbitrary bits.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 25 / 29
Background Skew Multi-round Open Problems
Open Problem 4: Optimal Multi-round Algorithms
Find the exact tradeoff between the load/round L and the number ofrounds r for a general query.
Note: emphasis here is on algorithm, not lower bounds. For lower boundsit’s OK to give the proof for a weaker models (as in [Beame’13]).
Dan Suciu Multi-Joins – Lecture 4 March, 2015 26 / 29
Background Skew Multi-round Open Problems
Open Problem 5: Σ-Optimal Parallel Algorithms
Generalize HyperSkew from joins to an arbitrary queries. Seemsstraightforward how to generalize it, but it’s open whether this is optimal.
Conjectured tight bound:
Llower = maxu,z
⎛⎝∑v∈Domainz m
u11,z[v]⋯mu`
`,z[v]p
⎞⎠
1/∑j uj
where z ⊆ {x1, . . . , xk}, and u is fractional edge packing of the residualquery qz that covers all variables in z.This formula is known to be a lower bound [Beame’14], but not known iftight.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 27 / 29
Background Skew Multi-round Open Problems
Final Remark
Please solve the 7 problems on the Problem Set. Some are quite easy,some more challenging.
If you work on any of the open problems and make progress, pleaselet me know.
Dan Suciu Multi-Joins – Lecture 4 March, 2015 28 / 29
Background Skew Multi-round Open Problems
Thank you!
Dan Suciu Multi-Joins – Lecture 4 March, 2015 29 / 29
top related