top k knapsack joins and closure early results

1

Top k Knapsack Joins and Closure Early Results

Witold LITWIN & Thomas SchwarzU. Paris Dauphine, France [email protected]

Santa Clara U., CA, [email protected]

mailto:[email protected]

mailto:[email protected]

2

Knapsack Join (KS-Join)

• The join defined by the sum of the join attributes being at most some constant

• Father of 4 kids wishing to buy toys for at most 100€ total

• A person wishing to buy a computer tower, a screen, a printer and a desk for at most 1000 €

• ….

3

Knapsack Join (KS-Join)

• Traditional join–R1 Join R2 on c1 = c2

• KS - join–R1 Join R2 on c1 + c2 ≤ C

• Syntax legal for FROM clause in Access, SQL Server…

4

Top k Knapsack Join (KS-Join)

• Top k items with respect to the descending order on the constant

• Usually, only a few items the most close to the constant are of interest

• Select TOP 1 * from Toys T1, Toys T2, Toys T3, Toys T4Where T1.Price + T2.Price + T3.Price + T4.Price ≤ 100and T1.Id < T2.Id and T2.Id < T3.Id and T3.Id < T4.IdOrder by T1.Price + T2.Price + T3.Price + T4.Price Desc;

5

Top k Knapsack Join (KS-Join)• Top k Knapsack joins are of obvious interest• How DBMSs deal with ?• Nested loop– To our best knowledge

• Result: execution time makes the SQL capability useless for a larger data set– Consider our example for just 1000 toys to choose

from– FYI, 1K-tuple table & 3-way KS-join killed SQL Server

6

Our Goal : Optimizing Top k KS-Joins

• Algorithms provably faster than usual nested loop– Formulate the algorithm– Prove the complexity, storage & processing costs

• KS-optimized Nested Loop • Self-join Nested Loop• Sort Merge• KS – Join Indices• Distributed KS – Join Indices

7

Our Goal : Optimizing Top k KS-Joins

• Early Results• Only for Top k KS-Joins (TkKS-Joins) • Only the formal analysis as yet• Many variants of TkKS-Join queries left

for future work– See the paper

8

Knapsack Problem (KP)

• NP-hard optimization problem • Among most studied• Input: – A set O of objects {o1,...on} – An m-d subspace called knapsack K with– values bi , 1 ≤ i ≤ m, represent each the i-th

dimension's capacity of the knapsack– Vector cj represents the benefit of the object j if in

the knapsack

9


• Input (continued): – The knapsack's constraints matrix with

entries ai,j ; 1 ≤ j ≤ n ; – Each entry stores the constraint value for

each object j in each dimension i (price, size, volume...).

• Output:– A set O' of objects stored in the knapsack.

10


• Binary variable xj ; xj {0, 1}, indicates the selection of the object j into the knapsack –(x j= 1) for object j in and (xj = 0)

otherwise– xj is 0–1 decision variable

11


• Select the elements of O’ which maximize the total profit of the selected objects

• Provided the match of the knapsack constraints

12


• Formally, maximize:

• Subject to:

13

Knapsack Problem (KP)• The most frequently investigated case is the 1-d one– I.e., i = 1

• Often, or perhaps even the most often, the KP concept designates implicitly this case.

• Frequently, in addition, one also sets every cj to cj = aj.

• Both conditions are ours below– unless we state otherwise

• The m-d one is referred to then, if needed, as multidimensional (MKP).

14

Knapsack Problem (KP)• The general research orientation for KP and

MKP • Find a heuristic providing acceptable

approximate result– For the possibly largest data set – In the fastest time, –Or acceptable time–Given necessary constraints on the

computer system used.

15

KP / TkKS -Join• Our research orientation follows the

database approach • Find an exact result–For a reasonably practical problem

subspace–For a database size data •Say, 1Ktuples per table at least

16

KP / TkKS -Join• Find an exact result (continued)–In the fastest time–Or acceptable time•Minutes at most

–Given necessary constraints on the computer system used•Mainly storage cost

17

Knapsack Problem (KP)• Our reasonably practical problem

subspace at present:–As we already stated cj = aj –1-d space– Fixed # of objects for the knapsack• Join instead of closure

18

Knapsack Problem (KP)• Our reasonably practical problem subspace

at present (continued):–One tuple = one potential selection– One object = one tuple with distinct ID•No objects selected twice in a tuple for

the knapsack• Closure, MKP… left for the future

19

Nested loop TkKS-Join• Basic cost for tables with n1…nm tuples– O (n1*…*nm)

• To accelerate the calculus start with:– Evaluation of the restrictions ti < C – Evaluation of ti ≤ C – (Min1+…+Minj+…+Minm) • for any j ≠ i • DBMS may easily maintain the Minj statistics• Cost can be O(m) or even O (1) only

– Idem for C ≥ Max1+…+Maxm ?

20

Nested loop TkKS-Join• Self-joint of a table with its copies• Since KS-join is commutative one may avoid doubles – E.g. if we have tuple (t1, t2) then we should not have the

tuple tuples (t2, t1) – In general, we need only one tuple from all its

permutations • The optimizing cuts the complexity and calculus time

by half, at least• Final word: we may have – O (n1*…*nm /S), where S ≥ 1

21

Sort-Merge TkKS-Join• 2-way join

C =150

150

150

22

Sort-Merge TkKS-Join• Processing cost of 3-way TkKS-Join– O (n1+n2 ) in general– O ((n1+n2 )/2) for self-join

• For n-way TkKS-Join– O (nm*…n3(n2 + n1)) in general– For self-join ?

• E.g. For 16K-tuple R1 and R2 tables m-way join accelerates 8K times

– 1sec instead of 2+ hours• See the paper

23

KS-Join Index

• A relational table IKS with at least the attributes(C, t1.Id,…, tm.Id) – Here C = t1.c+…+tm.c

– Also t1.Id <… < tm.Id

• Can be also seen as a materialized view• Some or all ti.c should be useful as well• E.g. for queries with additional restrictions on

individual prices

24

KS-Join Index• IKS should be implemented as file sorted on C first– Then, on other key or non-key attributes of interest– E.g., a B-tree or trie…

• Storage cost:– O (n1*…*nm) in general– Half of it or less for copies of the same table

• 3-way indices may be in RAM • More should be typically on flash or disk

25

KS-Join Index• Processing cost– O (Log p (n1*…*nm) ) or less, according to the

storage cost, where p is the tree fan-out• Expected practical figures– ms for RAM, e.g., 3-way KS-Join index for 1K-tuple

tables – under 10 ms for flash– under 100 ms for the disk, e.g., 4-way KS-Join

index for our 1K-tuple tables

26

KS-Join Index• Maintainance cost – High processing cost– E.g., 1 insert into our 1K tables generates 1M new

entries • Main drawback of KS-Indices at present• Efficient processing is an open problem

27

Composing KS-Join Indices• TkKS-Join calculus can compound existing KS-

Indices• m-way & n-way indices may speed up (m+n)-

way TkKS Join• Through the sort-merge algorithm applied to

both indices• Seconds may suffice for up to 6-way joins– E.g., for our 1K relations

28

Scalable-Distributed TkKS-Join Index• Speeds up the calculus of even larger joins• Using the parallel distributed processing• Dozens of seconds may suffice for an 8-way

join– Over our favorite 1Ktuple relations–With two 4-way KS-Indices– Each being distributed over 1K nodes– Through, e.g., RP* SDDS

• Maintainance time speeds-up as well

29

Scalable-Distributed TkKS-Join Index

• C = 900 ; arrows show nodes to join in parallel

100 350 800 9900

10 50 450 700

30

Conclusion

• TkKS-Joins are potentially useful• Our optimizations may speed up the

processing by orders of magnitude• Queries with TkKS-Joins become then

practical• With all the usual disclaimers, the results

appear ready for mainstream DBMSs

31

Future Work

• Deeper formal analysis• Experiments• More TkKS-Join query types– See the paper

Thank You for Your Attention

top k knapsack joins and closure early results

Documents

knapsack joins

knapsack xj

knapsack constraints

knapsack problem kpselect

knapsack problem kpinput

knapsack problem kpformally

price t3

price t2