top k knapsack joins and closure early results

31
Top k Knapsack Joins and Closure Early Results Witold LITWIN & Thomas Schwarz U. Paris Dauphine, France [email protected] Santa Clara U., CA, [email protected] 1

Upload: laban

Post on 22-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Top k Knapsack Joins and Closure Early Results. Witold LITWIN & Thomas Schwarz U. Paris Dauphine, France [email protected] Santa Clara U., CA, [email protected]. Knapsack Join (KS-Join). The join defined by the sum of the join attributes being at most some constant - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Top  k  Knapsack Joins and Closure  Early Results

1

Top k Knapsack Joins and Closure Early Results

Witold LITWIN & Thomas SchwarzU. Paris Dauphine, France [email protected]

Santa Clara U., CA, [email protected]

Page 2: Top  k  Knapsack Joins and Closure  Early Results

2

Knapsack Join (KS-Join)

• The join defined by the sum of the join attributes being at most some constant

• Father of 4 kids wishing to buy toys for at most 100€ total

• A person wishing to buy a computer tower, a screen, a printer and a desk for at most 1000 €

• ….

Page 3: Top  k  Knapsack Joins and Closure  Early Results

3

Knapsack Join (KS-Join)

• Traditional join–R1 Join R2 on c1 = c2

• KS - join–R1 Join R2 on c1 + c2 ≤ C

• Syntax legal for FROM clause in Access, SQL Server…

Page 4: Top  k  Knapsack Joins and Closure  Early Results

4

Top k Knapsack Join (KS-Join)

• Top k items with respect to the descending order on the constant

• Usually, only a few items the most close to the constant are of interest

• Select TOP 1 * from Toys T1, Toys T2, Toys T3, Toys T4Where T1.Price + T2.Price + T3.Price + T4.Price ≤ 100and T1.Id < T2.Id and T2.Id < T3.Id and T3.Id < T4.IdOrder by T1.Price + T2.Price + T3.Price + T4.Price Desc;

Page 5: Top  k  Knapsack Joins and Closure  Early Results

5

Top k Knapsack Join (KS-Join)• Top k Knapsack joins are of obvious interest• How DBMSs deal with ?• Nested loop– To our best knowledge

• Result: execution time makes the SQL capability useless for a larger data set– Consider our example for just 1000 toys to choose

from– FYI, 1K-tuple table & 3-way KS-join killed SQL Server

Page 6: Top  k  Knapsack Joins and Closure  Early Results

6

Our Goal : Optimizing Top k KS-Joins

• Algorithms provably faster than usual nested loop– Formulate the algorithm– Prove the complexity, storage & processing costs

• KS-optimized Nested Loop • Self-join Nested Loop• Sort Merge• KS – Join Indices• Distributed KS – Join Indices

Page 7: Top  k  Knapsack Joins and Closure  Early Results

7

Our Goal : Optimizing Top k KS-Joins

• Early Results• Only for Top k KS-Joins (TkKS-Joins) • Only the formal analysis as yet• Many variants of TkKS-Join queries left

for future work– See the paper

Page 8: Top  k  Knapsack Joins and Closure  Early Results

8

Knapsack Problem (KP)

• NP-hard optimization problem • Among most studied• Input: – A set O of objects {o1,...on} – An m-d subspace called knapsack K with– values bi , 1 ≤ i ≤ m, represent each the i-th

dimension's capacity of the knapsack– Vector cj represents the benefit of the object j if in

the knapsack

Page 9: Top  k  Knapsack Joins and Closure  Early Results

9

Knapsack Problem (KP)

• Input (continued): – The knapsack's constraints matrix with

entries ai,j ; 1 ≤ j ≤ n ; – Each entry stores the constraint value for

each object j in each dimension i (price, size, volume...).

• Output:– A set O' of objects stored in the knapsack.

Page 10: Top  k  Knapsack Joins and Closure  Early Results

10

Knapsack Problem (KP)

• Binary variable xj ; xj {0, 1}, indicates the selection of the object j into the knapsack –(x j= 1) for object j in and (xj = 0)

otherwise– xj is 0–1 decision variable

Page 11: Top  k  Knapsack Joins and Closure  Early Results

11

Knapsack Problem (KP)

• Select the elements of O’ which maximize the total profit of the selected objects

• Provided the match of the knapsack constraints

Page 12: Top  k  Knapsack Joins and Closure  Early Results

12

Knapsack Problem (KP)

• Formally, maximize:

• Subject to:

Page 13: Top  k  Knapsack Joins and Closure  Early Results

13

Knapsack Problem (KP)• The most frequently investigated case is the 1-d one– I.e., i = 1

• Often, or perhaps even the most often, the KP concept designates implicitly this case.

• Frequently, in addition, one also sets every cj to cj = aj.

• Both conditions are ours below– unless we state otherwise

• The m-d one is referred to then, if needed, as multidimensional (MKP).

Page 14: Top  k  Knapsack Joins and Closure  Early Results

14

Knapsack Problem (KP)• The general research orientation for KP and

MKP • Find a heuristic providing acceptable

approximate result– For the possibly largest data set – In the fastest time, –Or acceptable time–Given necessary constraints on the

computer system used.

Page 15: Top  k  Knapsack Joins and Closure  Early Results

15

KP / TkKS -Join• Our research orientation follows the

database approach • Find an exact result–For a reasonably practical problem

subspace–For a database size data •Say, 1Ktuples per table at least

Page 16: Top  k  Knapsack Joins and Closure  Early Results

16

KP / TkKS -Join• Find an exact result (continued)–In the fastest time–Or acceptable time•Minutes at most

–Given necessary constraints on the computer system used•Mainly storage cost

Page 17: Top  k  Knapsack Joins and Closure  Early Results

17

Knapsack Problem (KP)• Our reasonably practical problem

subspace at present:–As we already stated cj = aj –1-d space– Fixed # of objects for the knapsack• Join instead of closure

Page 18: Top  k  Knapsack Joins and Closure  Early Results

18

Knapsack Problem (KP)• Our reasonably practical problem subspace

at present (continued):–One tuple = one potential selection– One object = one tuple with distinct ID•No objects selected twice in a tuple for

the knapsack• Closure, MKP… left for the future

Page 19: Top  k  Knapsack Joins and Closure  Early Results

19

Nested loop TkKS-Join• Basic cost for tables with n1…nm tuples– O (n1*…*nm)

• To accelerate the calculus start with:– Evaluation of the restrictions ti < C – Evaluation of ti ≤ C – (Min1+…+Minj+…+Minm) • for any j ≠ i • DBMS may easily maintain the Minj statistics• Cost can be O(m) or even O (1) only

– Idem for C ≥ Max1+…+Maxm ?

Page 20: Top  k  Knapsack Joins and Closure  Early Results

20

Nested loop TkKS-Join• Self-joint of a table with its copies• Since KS-join is commutative one may avoid doubles – E.g. if we have tuple (t1, t2) then we should not have the

tuple tuples (t2, t1) – In general, we need only one tuple from all its

permutations • The optimizing cuts the complexity and calculus time

by half, at least• Final word: we may have – O (n1*…*nm /S), where S ≥ 1

Page 21: Top  k  Knapsack Joins and Closure  Early Results

21

Sort-Merge TkKS-Join• 2-way join

C =150

150

150

Page 22: Top  k  Knapsack Joins and Closure  Early Results

22

Sort-Merge TkKS-Join• Processing cost of 3-way TkKS-Join– O (n1+n2 ) in general– O ((n1+n2 )/2) for self-join

• For n-way TkKS-Join– O (nm*…n3(n2 + n1)) in general– For self-join ?

• E.g. For 16K-tuple R1 and R2 tables m-way join accelerates 8K times

– 1sec instead of 2+ hours• See the paper

Page 23: Top  k  Knapsack Joins and Closure  Early Results

23

KS-Join Index

• A relational table IKS with at least the attributes(C, t1.Id,…, tm.Id) – Here C = t1.c+…+tm.c

– Also t1.Id <… < tm.Id

• Can be also seen as a materialized view• Some or all ti.c should be useful as well• E.g. for queries with additional restrictions on

individual prices

Page 24: Top  k  Knapsack Joins and Closure  Early Results

24

KS-Join Index• IKS should be implemented as file sorted on C first– Then, on other key or non-key attributes of interest– E.g., a B-tree or trie…

• Storage cost:– O (n1*…*nm) in general– Half of it or less for copies of the same table

• 3-way indices may be in RAM • More should be typically on flash or disk

Page 25: Top  k  Knapsack Joins and Closure  Early Results

25

KS-Join Index• Processing cost– O (Log p (n1*…*nm) ) or less, according to the

storage cost, where p is the tree fan-out• Expected practical figures– ms for RAM, e.g., 3-way KS-Join index for 1K-tuple

tables – under 10 ms for flash– under 100 ms for the disk, e.g., 4-way KS-Join

index for our 1K-tuple tables

Page 26: Top  k  Knapsack Joins and Closure  Early Results

26

KS-Join Index• Maintainance cost – High processing cost– E.g., 1 insert into our 1K tables generates 1M new

entries • Main drawback of KS-Indices at present• Efficient processing is an open problem

Page 27: Top  k  Knapsack Joins and Closure  Early Results

27

Composing KS-Join Indices• TkKS-Join calculus can compound existing KS-

Indices• m-way & n-way indices may speed up (m+n)-

way TkKS Join• Through the sort-merge algorithm applied to

both indices• Seconds may suffice for up to 6-way joins– E.g., for our 1K relations

Page 28: Top  k  Knapsack Joins and Closure  Early Results

28

Scalable-Distributed TkKS-Join Index• Speeds up the calculus of even larger joins• Using the parallel distributed processing• Dozens of seconds may suffice for an 8-way

join– Over our favorite 1Ktuple relations–With two 4-way KS-Indices– Each being distributed over 1K nodes– Through, e.g., RP* SDDS

• Maintainance time speeds-up as well

Page 29: Top  k  Knapsack Joins and Closure  Early Results

29

Scalable-Distributed TkKS-Join Index

• C = 900 ; arrows show nodes to join in parallel

100 350 800 9900

10 50 450 700

Page 30: Top  k  Knapsack Joins and Closure  Early Results

30

Conclusion

• TkKS-Joins are potentially useful• Our optimizations may speed up the

processing by orders of magnitude• Queries with TkKS-Joins become then

practical• With all the usual disclaimers, the results

appear ready for mainstream DBMSs

Page 31: Top  k  Knapsack Joins and Closure  Early Results

31

Future Work

• Deeper formal analysis• Experiments• More TkKS-Join query types– See the paper

Thank You for Your Attention