a unified approach for computing top-k pairs in multidimensional space
DESCRIPTION
A Unified Approach for Computing Top-k Pairs in Multidimensional Space. Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1 , Haixun Wang 2 , Jianmin Wang 3 , Wenjie Zhang 1. 1 University of New South Wales, Australia 2 Microsoft Research Asia 3 Tsinghua University, China. - PowerPoint PPT PresentationTRANSCRIPT
A Unified Approach for Computing Top-k Pairs in Multidimensional Space
Presented By: Muhammad Aamir Cheema1
Joint work withXuemin Lin1, Haixun Wang2, Jianmin Wang3, Wenjie Zhang1
1 University of New South Wales, Australia2 Microsoft Research Asia3 Tsinghua University, China
Introduction
Top-k Pairs Query: Given a scoring function f() that computes the score of
a pair of objects, return k pairs of objects with smallest scores.
o2
o1
o3
o4
o5
x-axis
y-ax
is
Examples:
k-closest pairs
f(ou,ov) = dist(ou,ov)
Answer (k=1) = (o1,o2)
k-furthest pairs
f(ou,ov) = - dist(ou,ov)
Answer (k=1) = (o2,o4)
f(ou,ov) = (ou.x +ov.x) + (ou.y +ov.y)
Answer (k=1) = (o4,o5)
Related Work
• Computational geometry [M Smid, Handbook on Comp. Geometry]• Database community
[Hjaltason et. al, SIGMOD 1998][Corral et. al, SIGMOD 2000][Yang et. al, IDEAS 2002] [Shan et. al, SSTD 2003]
• Computational geometry [M Smid, Handbook on Comp. Geometry]• Database community
[Hjaltason et. al, SIGMOD 1998][Corral et. al, SIGMOD 2000][Yang et. al, IDEAS 2002] [Shan et. al, SSTD 2003]
K-Closest Pairs Queries
[Supowit , SODA 1990][Katoh et. al, IJCGA 1995] [Corral et. al, DKE 2004]
[Supowit , SODA 1990][Katoh et. al, IJCGA 1995] [Corral et. al, DKE 2004]
K-Furthest Pairs Queries
Top-k Queries• Fagin’s Algorithm [Fagin, PODS 1996]•Threshold Algorithm [Fagin, JCSS 1999], [Nepal et. al, ICDE 1999] , [Gȕntzer et. al, VLDB 2000] • No Random Access Algoritm [Fagin, JCSS 1999], [Mamoulis et. al, TODS 2007]
• Fagin’s Algorithm [Fagin, PODS 1996]•Threshold Algorithm [Fagin, JCSS 1999], [Nepal et. al, ICDE 1999] , [Gȕntzer et. al, VLDB 2000] • No Random Access Algoritm [Fagin, JCSS 1999], [Mamoulis et. al, TODS 2007]
Motivation
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;
• Other Lp distances (e.g., Manhattan distance) ?• More general scoring functions• Chromatic queries
• Other Lp distances (e.g., Manhattan distance) ?• More general scoring functions• Chromatic queries
• No existing work for more general queries
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager <> b.managerAND a.manager <> b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager <> b.managerAND a.manager <> b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;
• No existing unified algorithm • One framework that answers a broad class of top-k pairs
queries• One framework that answers a broad class of top-k pairs queries
Problem Definition (Preliminaries)
• Monotonic function
f() is monotonic if f(x1,…,xN) ≤ f(y1,…,yN) whenever xi ≤ yi for every 1 ≤ I ≤ N
Examples:f(x1,…,xN) = x1 + x2 + … + xN (summation) f(x1,…,xN) = (x1 + x2 + … + xN) / N (average)
f() is monotonic if f(x1,…,xN) ≤ f(y1,…,yN) whenever xi ≤ yi for every 1 ≤ I ≤ N
Examples:f(x1,…,xN) = x1 + x2 + … + xN (summation) f(x1,…,xN) = (x1 + x2 + … + xN) / N (average)
Problem Definition (Preliminaries)
• Loose monotonic function
0 ∞-∞
s() takes two parameters and is loose monotonic if both of following hold for every fixed value x1.for every y > x, s(x,y) either monotonically increases or monotonically decreases as y increases2.for every y < x, s(x,y) either monotonically increases or montonically decreases as y decreases
s() takes two parameters and is loose monotonic if both of following hold for every fixed value x1.for every y > x, s(x,y) either monotonically increases or monotonically decreases as y increases2.for every y < x, s(x,y) either monotonically increases or montonically decreases as y decreases
x
1
y
2 5
s1(x,y) = |x – y| 1 4 = s2(x,y) = (x + y) 3 6 =
y
-3
1 -2
Loose monotonic functions are more general than the monotonic functions
Loose monotonic functions are more general than the monotonic functions
Problem Definition
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – ORDER BY |a.sold – b.sold| - |a.salary – b.salary|b.salary|LIMIT k;LIMIT k;
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – ORDER BY |a.sold – b.sold| - |a.salary – b.salary|b.salary|LIMIT k;LIMIT k;
• Return k pairs of objects with smallest scores.
SCORE (a,b) = f ( s1(a,b),…,sd(a,b) )si( ) is called local scoring function and can be any loose monotonic function of user’s choice. f( ) is called global scoring function and can be any monotonic function that involves an arbitrary set of attributes.
SCORE (a,b) = f ( s1(a,b),…,sd(a,b) )si( ) is called local scoring function and can be any loose monotonic function of user’s choice. f( ) is called global scoring function and can be any monotonic function that involves an arbitrary set of attributes.
s1(a,b) = | a.sold – b.sold | s2(a,b) = -| a.salary – b.salary |f( ) = s1(a,b) + s2(a,b)
Problem Definition
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – ORDER BY |a.sold – b.sold| - |a.salary – b.salary|b.salary|LIMIT k;LIMIT k;
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – ORDER BY |a.sold – b.sold| - |a.salary – b.salary|b.salary|LIMIT k;LIMIT k;
• Return k pairs of objects with smallest scores among the valid pairs.
Let each object be assigned a color.Chromatic Queries: Homochromatic Queries: pairs containing objects of same color Heterochromatic Queries: pairs containing objects of different colors
Let each object be assigned a color.Chromatic Queries: Homochromatic Queries: pairs containing objects of same color Heterochromatic Queries: pairs containing objects of different colors
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager AND a.manager ≠≠ b.manager b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager AND a.manager ≠≠ b.manager b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager = b.managerAND a.manager = b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager = b.managerAND a.manager = b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;
Contributions
• k-closest pairs, k-furthest pairs and variants (any Lp distance)• queries involving any arbitrary subset of attributes• chromatic and non-chromatic queries• skyline pairs queries and rank based top-k pairs queries
• k-closest pairs, k-furthest pairs and variants (any Lp distance)• queries involving any arbitrary subset of attributes• chromatic and non-chromatic queries• skyline pairs queries and rank based top-k pairs queries
Unified algorithm (internal and external)
• efficiently builds a simple data structure on-the-fly• can answer queries involving filtering conditions on objects• efficiently builds a simple data structure on-the-fly• can answer queries involving filtering conditions on objects
No pre-built indexes required
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.age > 40 AND b.age > AND a.age > 40 AND b.age > 4040ORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;
SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.age > 40 AND b.age > AND a.age > 40 AND b.age > 4040ORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;
• existing R-tree based approaches may require arbitrarily large heaps• our algorithm requires O(k) space + 2d buffer pages
• existing R-tree based approaches may require arbitrarily large heaps• our algorithm requires O(k) space + 2d buffer pages
Known memory requirement
• Theoretically Optimal for d ≤ 2 • Experimentally• Theoretically Optimal for d ≤ 2 • Experimentally
Efficient
Framework
s1(a,b) s2(a,b) sd(a,b)
…
Top-K algorithms (e.g., FA, TA, NRA etc.)
How to efficiently create and maintain these sources???
f ( s1(a,b), s2(a,b), …,sd(a,b) )
Creating/maintaining sources
Naïve approach
• Create all possible pairs O(N2)• Sort them according to their local scores O(N2 log N)
space requirement: O(N2)
• Create all possible pairs O(N2)• Sort them according to their local scores O(N2 log N)
space requirement: O(N2)
Features of our approach
• Optimal internal memory algorithm• requires O(N) space• returns first pair in O(N log N)• each next best pair is returned in O( log N)
• Optimal external memory algorithm• B = number of elements that can be stored in one disk page• M = used internal memory minimum M = 2B• returns first pair in O(N/B logM/B N/B) • each next best pair is returned in O(logM/B N/B)
• Optimal internal memory algorithm• requires O(N) space• returns first pair in O(N log N)• each next best pair is returned in O( log N)
• Optimal external memory algorithm• B = number of elements that can be stored in one disk page• M = used internal memory minimum M = 2B• returns first pair in O(N/B logM/B N/B) • each next best pair is returned in O(logM/B N/B)
Creating/maintaining sources
6 12 14 15 20 302 1 5 106
o1 o2 o3 o4 o5 o6
63
Initialize• sort the objects• for each object ou
• create its best pair (ou,ov)• insert (ou,ov) in heap
getNextPair()• report the top pair (ou,ov) of heap• create next best pair of ou
• enheap the new pair and delete (ou,ov)s(x,y) = |x – y|
Homochromatic Queries
6 12 14 15 20 30
o1o2 o3 o4
o6o5
Heterochromatic Queries
6 12 14 15 20 30
o1o2 o3 o4
o6o5
• Let (ou,ov) be the pair• ox = the object next to ov
• If ou and ox have different color•(ou,ox) is the next best pair
• else•oy = the adjacent object of ox • (ou,oy) is the next best pair
Experiments
K-closest pairs queries [Corral et. al, SIGMOD 2000]• Data size: two dataset each containing 100K objects• k: 10
Experiments
• Naive: join the dataset with itself using nested loop (block nested loop for external memory algorithm)
• Scoring function:• Local scoring function is either sum or absolute difference (chosen randomly)• Global scoring function is weighted aggregate (weights are chosen randomly and negative weights are allowed)
Number of Objects
Number of attributes (d)
Value of k
Number of colors
Thanks
Complexity
Internal memory algorithm =
External memory algorithm =
d = number of local scoring functions involvedN = total number of objectsV = total number of valid pairs (N2 at most)M = internal memory used by the algorithmB = the number of entries one disk page can store