top k knapsack joins and closure early results
DESCRIPTION
Top k Knapsack Joins and Closure Early Results. Witold LITWIN & Thomas Schwarz U. Paris Dauphine, France [email protected] Santa Clara U., CA, [email protected]. Knapsack Join (KS-Join). The join defined by the sum of the join attributes being at most some constant - PowerPoint PPT PresentationTRANSCRIPT
1
Top k Knapsack Joins and Closure Early Results
Witold LITWIN & Thomas SchwarzU. Paris Dauphine, France [email protected]
Santa Clara U., CA, [email protected]
2
Knapsack Join (KS-Join)
• The join defined by the sum of the join attributes being at most some constant
• Father of 4 kids wishing to buy toys for at most 100€ total
• A person wishing to buy a computer tower, a screen, a printer and a desk for at most 1000 €
• ….
3
Knapsack Join (KS-Join)
• Traditional join–R1 Join R2 on c1 = c2
• KS - join–R1 Join R2 on c1 + c2 ≤ C
• Syntax legal for FROM clause in Access, SQL Server…
4
Top k Knapsack Join (KS-Join)
• Top k items with respect to the descending order on the constant
• Usually, only a few items the most close to the constant are of interest
• Select TOP 1 * from Toys T1, Toys T2, Toys T3, Toys T4Where T1.Price + T2.Price + T3.Price + T4.Price ≤ 100and T1.Id < T2.Id and T2.Id < T3.Id and T3.Id < T4.IdOrder by T1.Price + T2.Price + T3.Price + T4.Price Desc;
5
Top k Knapsack Join (KS-Join)• Top k Knapsack joins are of obvious interest• How DBMSs deal with ?• Nested loop– To our best knowledge
• Result: execution time makes the SQL capability useless for a larger data set– Consider our example for just 1000 toys to choose
from– FYI, 1K-tuple table & 3-way KS-join killed SQL Server
6
Our Goal : Optimizing Top k KS-Joins
• Algorithms provably faster than usual nested loop– Formulate the algorithm– Prove the complexity, storage & processing costs
• KS-optimized Nested Loop • Self-join Nested Loop• Sort Merge• KS – Join Indices• Distributed KS – Join Indices
7
Our Goal : Optimizing Top k KS-Joins
• Early Results• Only for Top k KS-Joins (TkKS-Joins) • Only the formal analysis as yet• Many variants of TkKS-Join queries left
for future work– See the paper
8
Knapsack Problem (KP)
• NP-hard optimization problem • Among most studied• Input: – A set O of objects {o1,...on} – An m-d subspace called knapsack K with– values bi , 1 ≤ i ≤ m, represent each the i-th
dimension's capacity of the knapsack– Vector cj represents the benefit of the object j if in
the knapsack
9
Knapsack Problem (KP)
• Input (continued): – The knapsack's constraints matrix with
entries ai,j ; 1 ≤ j ≤ n ; – Each entry stores the constraint value for
each object j in each dimension i (price, size, volume...).
• Output:– A set O' of objects stored in the knapsack.
10
Knapsack Problem (KP)
• Binary variable xj ; xj {0, 1}, indicates the selection of the object j into the knapsack –(x j= 1) for object j in and (xj = 0)
otherwise– xj is 0–1 decision variable
11
Knapsack Problem (KP)
• Select the elements of O’ which maximize the total profit of the selected objects
• Provided the match of the knapsack constraints
12
Knapsack Problem (KP)
• Formally, maximize:
• Subject to:
13
Knapsack Problem (KP)• The most frequently investigated case is the 1-d one– I.e., i = 1
• Often, or perhaps even the most often, the KP concept designates implicitly this case.
• Frequently, in addition, one also sets every cj to cj = aj.
• Both conditions are ours below– unless we state otherwise
• The m-d one is referred to then, if needed, as multidimensional (MKP).
14
Knapsack Problem (KP)• The general research orientation for KP and
MKP • Find a heuristic providing acceptable
approximate result– For the possibly largest data set – In the fastest time, –Or acceptable time–Given necessary constraints on the
computer system used.
15
KP / TkKS -Join• Our research orientation follows the
database approach • Find an exact result–For a reasonably practical problem
subspace–For a database size data •Say, 1Ktuples per table at least
16
KP / TkKS -Join• Find an exact result (continued)–In the fastest time–Or acceptable time•Minutes at most
–Given necessary constraints on the computer system used•Mainly storage cost
17
Knapsack Problem (KP)• Our reasonably practical problem
subspace at present:–As we already stated cj = aj –1-d space– Fixed # of objects for the knapsack• Join instead of closure
18
Knapsack Problem (KP)• Our reasonably practical problem subspace
at present (continued):–One tuple = one potential selection– One object = one tuple with distinct ID•No objects selected twice in a tuple for
the knapsack• Closure, MKP… left for the future
19
Nested loop TkKS-Join• Basic cost for tables with n1…nm tuples– O (n1*…*nm)
• To accelerate the calculus start with:– Evaluation of the restrictions ti < C – Evaluation of ti ≤ C – (Min1+…+Minj+…+Minm) • for any j ≠ i • DBMS may easily maintain the Minj statistics• Cost can be O(m) or even O (1) only
– Idem for C ≥ Max1+…+Maxm ?
20
Nested loop TkKS-Join• Self-joint of a table with its copies• Since KS-join is commutative one may avoid doubles – E.g. if we have tuple (t1, t2) then we should not have the
tuple tuples (t2, t1) – In general, we need only one tuple from all its
permutations • The optimizing cuts the complexity and calculus time
by half, at least• Final word: we may have – O (n1*…*nm /S), where S ≥ 1
21
Sort-Merge TkKS-Join• 2-way join
C =150
150
150
22
Sort-Merge TkKS-Join• Processing cost of 3-way TkKS-Join– O (n1+n2 ) in general– O ((n1+n2 )/2) for self-join
• For n-way TkKS-Join– O (nm*…n3(n2 + n1)) in general– For self-join ?
• E.g. For 16K-tuple R1 and R2 tables m-way join accelerates 8K times
– 1sec instead of 2+ hours• See the paper
23
KS-Join Index
• A relational table IKS with at least the attributes(C, t1.Id,…, tm.Id) – Here C = t1.c+…+tm.c
– Also t1.Id <… < tm.Id
• Can be also seen as a materialized view• Some or all ti.c should be useful as well• E.g. for queries with additional restrictions on
individual prices
24
KS-Join Index• IKS should be implemented as file sorted on C first– Then, on other key or non-key attributes of interest– E.g., a B-tree or trie…
• Storage cost:– O (n1*…*nm) in general– Half of it or less for copies of the same table
• 3-way indices may be in RAM • More should be typically on flash or disk
25
KS-Join Index• Processing cost– O (Log p (n1*…*nm) ) or less, according to the
storage cost, where p is the tree fan-out• Expected practical figures– ms for RAM, e.g., 3-way KS-Join index for 1K-tuple
tables – under 10 ms for flash– under 100 ms for the disk, e.g., 4-way KS-Join
index for our 1K-tuple tables
26
KS-Join Index• Maintainance cost – High processing cost– E.g., 1 insert into our 1K tables generates 1M new
entries • Main drawback of KS-Indices at present• Efficient processing is an open problem
27
Composing KS-Join Indices• TkKS-Join calculus can compound existing KS-
Indices• m-way & n-way indices may speed up (m+n)-
way TkKS Join• Through the sort-merge algorithm applied to
both indices• Seconds may suffice for up to 6-way joins– E.g., for our 1K relations
28
Scalable-Distributed TkKS-Join Index• Speeds up the calculus of even larger joins• Using the parallel distributed processing• Dozens of seconds may suffice for an 8-way
join– Over our favorite 1Ktuple relations–With two 4-way KS-Indices– Each being distributed over 1K nodes– Through, e.g., RP* SDDS
• Maintainance time speeds-up as well
29
Scalable-Distributed TkKS-Join Index
• C = 900 ; arrows show nodes to join in parallel
100 350 800 9900
10 50 450 700
30
Conclusion
• TkKS-Joins are potentially useful• Our optimizations may speed up the
processing by orders of magnitude• Queries with TkKS-Joins become then
practical• With all the usual disclaimers, the results
appear ready for mainstream DBMSs
31
Future Work
• Deeper formal analysis• Experiments• More TkKS-Join query types– See the paper
Thank You for Your Attention