![Page 1: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/1.jpg)
Efficient Skyline Computation on Vertically Partitioned Datasets
Dimitris Papadias, David Yang, Georgios TrimponiasCSE Department, HKUST, Hong Kong
![Page 2: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/2.jpg)
Outline Problem Statement Skyline Computation on Vertically Partitioned
Datasets using Balke’s Algorithm Algorithms for Top-k Query Processing FM Sketches Putting Everything Together
![Page 3: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/3.jpg)
Outline Problem Statement Skyline Computation on Vertically Partitioned
Datasets using Balke’s Algorithm Algorithms for Top-k Query Processing FM Sketches Putting Everything Together
![Page 4: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/4.jpg)
A Motivating Example Consider a database containing information
about hotels. The y-dimension represents the price of the room, whereas the x-dimension captures the distance of the room from the beach.
Distance
Price Skyline objects
Hotel rooms
p
Dominance Region of p
Borders of p’sDominance Region
![Page 5: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/5.jpg)
Skyline Preliminaries[ICDE, 2001] Skylines constitute a very useful tool in numerous
disciplines, such as for multidimensional decision making and data mining.
Given a set of d-dimensional objects p1, …, pN, the skyline operator retrieves all these objects that are nor dominated by any other object in the set.
An object pi dominates another point pj, if it is not worse than pj in all dimensions and better than it in at least one dimension.
Properties:• The top-1 tuple according to any preference function
that assigns scores to tuples is in the skyline tuple.• Conversely, for any skyline tuple, there exists a
preference function according to which it is the top-1.
![Page 6: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/6.jpg)
4 Common Data Distributions
![Page 7: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/7.jpg)
Problem Definition Compute the skyline when the dataset is
vertically decomposed among a set of N servers.
Goal: minimize the data that must be retrieved from each server.
We assume wireless environments, where communication overhead constitutes the dominant factor in battery consumption.
Consider mobile phone applications as real world examples.
![Page 8: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/8.jpg)
(a) Subspace D1 at server N1 (b) Subspace D2 at Server N2
d2
d1
AF
B C
D
E
G Hlocal skyline
d3
d4
A F
B
C
D
E
GH
local skyline
First Observations
The global skyline may contain points that do not appear in the local skylines.
Instead of transmitting all records over the network, avoid sending out points that are guaranteed to be dominated globally by an anchor point.
![Page 9: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/9.jpg)
Outline Problem Statement Skyline Computation on Vertically
Partitioned Datasets using Balke’s Algorithm
Algorithms for Top-k Query Processing FM Sketches Putting Everything Together
![Page 10: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/10.jpg)
Balke’s Algorithm[EDBT, 2004] Assume that the d-dimensional database is
vertically partitioned into d lists, one for each dimension, assigned to different servers. The lists contain values in ascending order.
Idea: perform sorted accesses on the d lists in a round-robin manner, until a point p (anchor), is reached in every list.
Points that have not showed up at this moment in any list can be safely pruned, since they are dominated by the anchor.
![Page 11: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/11.jpg)
Example Let a 2-dimensional database with the
following two lists: L1
L2
Point a b d m g cValue 1 2 3 5 5 6
Point c d e k a bValue 1 2 3 4 5 7
…
…
![Page 12: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/12.jpg)
Example (cont.) Let a 2-dimensional database with the
following two lists: L1
L2
Point a b d m g cValue 1 2 3 5 5 6
Point c d e k a bValue 1 2 3 4 5 7
…
…
The first point to be retrieved from both lists.
![Page 13: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/13.jpg)
Example (cont.) Let a 2-dimensional database with the
following two lists: L1
L2
Point a b d m g cValue 1 2 3 5 5 6
Point c d e k a bValue 1 2 3 4 5 7
…
…
The first point to be retrieved from both lists.These points cannot be part of the skyline.
![Page 14: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/14.jpg)
Further Improvement Efficiency can be improved, if instead of
visiting the lists in a round-robin manner, we access the most promising list with random accesses.
As a result, only the least expansion is performed on each list.
∙ ∙ ∙ P ∙ ∙ ∙L1
∙ ∙ ∙ P ∙ ∙ ∙L2
avoid visiting these points
![Page 15: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/15.jpg)
Outline Problem Statement Skyline Computation on Vertically Partitioned
Datasets using Balke’s Algorithm Algorithms for Top-k Query Processing FM Sketches Putting Everything Together
![Page 16: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/16.jpg)
Setting Let N1, .., Nm be m servers storing the same
dataset DB. For each record P DB every server Ni maintains
a local score si(P), and sorts all records in decreasing order of their local scores.
A client wishes to obtain the k records of DB with the maximum global score s.
The score is computed using a monotonic function f on the local scores, i.e., s(P) = f(s1(P), .., sm(P)).
Goal: minimize the required number of accesses.
![Page 17: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/17.jpg)
Fagin’s Algorithm[PODS, 2001] Each server Ni performs sorted round-robin
accesses and sends to the client the next record and its local score.
When the first common record Panc is encountered by all servers, the client terminates the sorted accesses.
Then, it obtains the missing local scores of the other encountered points through random accesses.
The candidate with the highest global score is the top-1 result.
![Page 18: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/18.jpg)
Threshold Algorithm[PODS, 2001] It utilizes an upper bound TA on the global
score to terminate earlier than FA. The client retrieves the local scores of newly
encountered points with random accesses at the remaining servers and computes their global scores, and picks the best score sbest.
The threshold TA is equal to the sum of the local thresholds at each server.
As long TA > sbest, TA continues the sorted accesses, while it keeps updating TA.
Eventually, the top-1 point will be returned.
![Page 19: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/19.jpg)
Example for FA and TA
![Page 20: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/20.jpg)
Best Position Algorithm[VLDB, 2007] It further improves TA by utilizing a tighter
threshold. Let bpi be the position at server Ni such that
all points up to bpi have been encountered through sorted or random accesses.
The global threshold BP is equal to the sum of the local thresholds at bpi.
![Page 21: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/21.jpg)
Outline Problem Statement Skyline Computation on Vertically Partitioned
Datasets using Balke’s Algorithm Algorithms for Top-k Query Processing FM Sketches Putting Everything Together
![Page 22: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/22.jpg)
Flajolet / Martin sketches[JCSS ’85]
Goal: Estimate the distinct number of objects from a small-space representation of a set.
Sketch of a union of items is the OR of their sketches Insertion order and duplicates don’t matter!
Prerequisite: Let h be a random, binary hash function.
Sketch of an item For each unique item with ID x, For each integer 1 ≤ i ≤ k in turn, Compute h (x, i).
Stop when h (x, i) = 1, and set bit i.
X 0 0 1 0 0
Z 1 0 0 0 0
X Z 1 0 1 0 0∩
![Page 23: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/23.jpg)
Flajolet / Martin sketches (cont.)
Estimating COUNT
Take the sketch of a set of N items.
Let j be the position of the leftmost zero in the sketch.
j is an estimator of log2 (0.77 N)
Fixable drawbacks:• Estimate has faint bias• Variance in the estimate is
large.
1 1 01S 1
Best guess: COUNT ~ 11
j = 3
![Page 24: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/24.jpg)
Flajolet / Martin sketches (cont.) Standard variance reduction methods apply.
Compute m independent sketches in parallel. Compute m independent estimates of N. Take the mean of the estimates.
Provable tradeoffs between m and variance of the estimator.
![Page 25: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/25.jpg)
Application to COUNT in Sensor Databases• Each sensor computes k independent sketches of
itself (using unique ID x)– sensor computes a sketch of its value.
• Use a robust routing algorithm to route sketches up to the sink.
• Aggregate the k sketches via union en-route. (OR)• The sink then estimates the count.
sink
S1
S3
S2
S4
S1 S2
S1∪S2∪S3 S4
S1∪S2∪S3∪S3
![Page 26: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/26.jpg)
Outline Problem Statement Skyline Computation on Vertically Partitioned
Datasets using Balke’s Algorithm Algorithms for Top-k Query Processing FM Sketches Putting Everything Together
![Page 27: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/27.jpg)
Problem Characteristics Each vertical decomposition has arbitrary
dimensionality, contrary to Balke’s setting. Anchor selection substantially determines the
total number of transmitted data. VPS adopts sorting on the local dominance. In
particular, the local dominance count domi(P) of a point P with respect to subspace Di is the number of points dominated by P in Di.
Balke selects as the anchor, the data point P with the maximal domSUM(P).
We utilize a tighter upper bound for dom(P) is the minimum domMIN among all local dominance counts.
![Page 28: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/28.jpg)
Anchor Selection
(a) Subspace D1 at server N1 (b) Subspace D2 at Server N2
d2
d1
AF
B C
D
E
G Hlocal skyline
d3
d4
A F
B
C
D
E
GH
local skyline
C: optimal anchor pointA: has maximal domMIN
B: has maximal domSUM
![Page 29: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/29.jpg)
Our algorithm on the previous example
![Page 30: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/30.jpg)
1st Optimization: Multiple Anchor Points The previous algorithm performs pruning with
a single anchor Panc. Specifically, a point P that is locally dominated by Panc in all subspaces is not sent to the client.
On the other hand, if P is incomparable with Panc even in a single subspace Di, it will be transmitted by the corresponding server Ni.
We suggest that multiple points can often achieve more effective pruning.
![Page 31: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/31.jpg)
Pruning with 2 points
![Page 32: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/32.jpg)
2nd Optimization: Integration of Sketches So far, we have estimated the (expected) global
dominance dom(P) of a point P using domMIN(P). This approach is biased towards points that have
high local dominance counts in all subspaces, but dominate few records globally (A).
Thus, we propose an unbiased approach that directly estimates the global dominance counts using sketches that count the number of distinct objects approximately.
We assume that each Ni server has a local dominance sketch ski(P) for every point P, which aggregates all points that P dominates locally in Di.
![Page 33: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/33.jpg)
Experiments
![Page 34: Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong](https://reader035.vdocuments.site/reader035/viewer/2022062413/5a4d1b1a7f8b9ab05999336a/html5/thumbnails/34.jpg)
Thank you!