computer science spatio-temporal aggregation using sketches yufei tao, george kollios, jeffrey...
Post on 20-Dec-2015
224 views
TRANSCRIPT
Computer Science
Spatio-Temporal Aggregation Using SketchesSpatio-Temporal Aggregation Using Sketches
Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris PapadiasYufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris PapadiasDepartment of Computer ScienceDepartment of Computer Science
City University of Hong Kong, Boston University, City University of Hong Kong, Boston University,
Hong Kong University of Science and TechnologyHong Kong University of Science and Technology
18, March, 200418, March, 2004
OutlineOutline
• Applications and motivation
• Preliminaries –Aggregate trees and sketch techniques
• Distinct spatio-temporal aggregation
• Performance study
• Extensions
• Conclusion
• Traffic Supervision Systems
– Monitoring the number of vehicles in a district, the information could be used to identify the traffic jam area etc.
• Mobile Computing Applications
– Allocating bandwidth depending on the usage of each region
Spatio-Temporal Aggregate Query -- Spatio-Temporal Aggregate Query -- ApplicationsApplications
Example: For wireless companies, they would like to know the number of cell phone users in a particular region in a specified period. In addition, it is also interesting to know the total number of phone calls made by all users who qualified the first query.
Spatio-Temporal Aggregate QuerySpatio-Temporal Aggregate Query• Spatio-Temporal Application requires the retrieval of summarized
information about moving objects• Given an aggregate query region as a rectangle qr and query interval
qt, a spatio temporal aggregate query retrieves information about objects that appeared in qr during qt– Spatio-Temporal Count
• Returns the total number of qualifying objects– Spatio-Temporal Sum
• Each object associated with a measure, outputs the sum of the measures of the qualifying objects.
Existing Approach: multi-tree structures based on R-trees and B-trees – Problem: If an object remains in the query region for several
timestamps during the query interval, it will be counted (or summed ) multiple times in the result.
Spatio-Temporal Aggregate Query Spatio-Temporal Aggregate Query (cont.)(cont.)
Motivation: Distinct Spatio-Temporal Aggregate Query
Enable a much richer range of decision-making queries But: There is no way to exactly summarize distinct objects substantially better than by simply enumerating all of them
Solution:
Spatio-Temporal Aggregation Index Trees
Sketch Techniques
Stadi um
90
How to answer “Distinct Aggregate Query” ?e.g: How many cars are present in a
district?
ExampleExample
Query retrieve the aggregate sum (during time T1-T3) of all rectangles that intersect it.
regions
1 2 3 5
r1
r2
r3
4r
150
75
12
150
80
12
145
85
12 12
90
130135
90
132 127 125 127127
12
4time
R1
r1 r4r3
R2
r2
qr
rq
Preliminaries -- Preliminaries -- Aggregate RB-treeAggregate RB-treeIn the aRB-tree, the extents of all regions (in this case r1,r2,…,r4) are stored in an
R-tree. Each (leaf/non-leaf) entry of the R-tree is associated with a pointer to a B-tree that stores historical aggregate data about the entry
R-tree for the
1 12
B-tree for r3
220 1 144 2 139 3 137 4 139
1 283 3 405
B-tree for R2
901 75 2 80 3 85 4
1 155 3 265
2B-tree for r
1 150 3 145 4 135 5 130
1 445 4 265
B-tree for r1
1 225 2 230 4 225 5
1 685 4 445
B-tree for R1
R1
R2
r1
r2r
3r4
spatial dimensions
1 132 2 127 3 125 4 127
1 259 3 379
4B-tree for r
Preliminaries – Preliminaries – Flajolet-Martin sketchesFlajolet-Martin sketches
• Goal: Small-space representation of a set of items.
• Sketch of a union of items is the OR of their bitmaps.
Prerequisite: Let h be a random, binary hash function.
Sketch of an item
For each unique item with ID x,
For each integer 1 ≤ i ≤ k in turn,
Compute h (x, i).
Stop when h (x, i) = 1, and set bit i.
X 0 0 1 0 0
Z 1 0 0 0 0
X Z 1 0 1 0 0∩
Preliminaries – Preliminaries – Flajolet-Martin sketches (cont.)Flajolet-Martin sketches (cont.)
Estimating COUNT
Take the bitmap of a set of N items.
Let j be the position of the leftmost zero in the bitmap.
j is an estimator of log2 (0.77 N)
Fixable drawbacks:
• Variance in the estimate is large.
1 1 01S 1
Best guess: COUNT ~ 11
j = 3
Preliminaries – Preliminaries – Flajolet-Martin sketches (cont.)Flajolet-Martin sketches (cont.)
Standard variance reduction methods apply.
• Compute m independent bitmaps in parallel.
• Generate m independent estimates of N.
• Take the mean of the estimates.
Provable tradeoffs between m and variance of the estimator.
Distinct Spatio-Temporal AggregationDistinct Spatio-Temporal AggregationExact SolutionIf n is the number of distinct objects and T is the total number of timestamps in history, the exact solution requires (n∙T) space.
Existing Aggregation ApproachaRB tree stores only the summarized data, information about individual objects is lost and the problem cannot be solved.
Our Solution• Combining aRB tree with FM sketch technique! For each region ri and every timestamp t we maintain a sketch si(t) that captures the (ids of) objects in ri at t. • Requires (m∙R∙T∙logn) space. where R is the number of regions and m is an adjustable constant specifying number of bitmaps used by one sketch. (determines the tradeoff between overhead and approximation accuracy)
System ArchitectureSystem Architecture
object ids
or weights
object ids or weights
object ids
or weightsr 1
r 2
r 3
databaseaggregate queriesapprox. results
sketchproducers
sketches
regions
1 2 3 5
r1
r2
r3
4r
4 time
01000
10100
11000
101001110010000
11111
01100 01100
10000 1100010000 10001
100001000010000
10000 10000 10100 10100
The sketches can be stored in a two dimensional array
Sketch Indexing StructuresSketch Indexing Structures
<time, sketch>
The sketch of a non-leaf entry in B-tree equals to the OR of all the sketches in its sub-trees.
R-tree for the
R1
R2
r1
r2 r3
r4
spatial dimensions
4B-tree for r
10000110002100001 1010043
3110001 10100
1B-tree for r
11100011002100001 1010054
4111001 11101
1B-tree for R
11100111002101001 1010154
4111001 11101
2B-tree for r
11000100003101001 1000154
4111001 11101N 4
N 2
N 1
N 3
3B-tree for r
11111100002010001 5
5110001 11111
2B-tree for R
10100100003110001 4
4110001 11111
111115
R1r1 r4
r3
R2
r2
qr qt=(1,4)
Query ProcessingQuery Processing
• Similar to the query processing technique in aRB tree.
Basic Idea: The spatial and temporal searching conditions are applied alternatively. The result sketch is incrementally updated.
• Can be improved by applying some pruning techniques.
Heuristic 1: Let RS be the current result sketch, and e a non-leaf B-tree entry whose associated sketch is se. Then, the sub-tree of e can be pruned
if (se OR RS) = RS.
Heuristic 2: Given a set of entries that cannot be pruned by Heuristic 1, we visit their child nodes in descending order of the number of 1’s in their sketches.
And more heuristics!
Query Processing Query Processing – Supporting Distinct Sum Query– Supporting Distinct Sum Query
Extending FM sketches
• FM sketches can handle this :
- to insert a value of 500, perform 500 distinct item insertions
• Our observation: We can simulate a large number of insertions into an FM sketch more efficiently.
PerformancePerformance• Dataset settings
– Number of cities = 10,000
– Number of buses = 100,000
– History length = 1,00 timestamps
– Number of passengers for each bus = [200,300]
– At each timestamp, bus reports to its nearest city, <time t, city c, bus b, passenger # a>
• Each query contains 2 parameters: (spatial extents and interval length)
• A count query retrieves the number of distinct buses that report to cities in qr during qt, while a sum query returns the sum of these buses’ passengers
• Compare the sketch-index to the relational approach: index the 4-tuple table <t,c,b,a> using a B-tree on the time t column
Results Results (Space Consumption)(Space Consumption)
020406080
100120140160
8 16 32number of bitmaps per sketch
size (mega bytes)
databasesize
Size of sketch index could be further reduced by applying simple compression techniques!
Results Results (Sketch Pruning in Query)(Sketch Pruning in Query)
0
100
200
300
400
500
600
700
800
900
0.05 0.1 0.15 0.2 0.25
number of disk accesses
query rectangle length
sketch-pruning naive relational
(a) Cost vs. qrlen (qtlen=10)
Results Results (Sketch Pruning in Query)(Sketch Pruning in Query)
300
0
100
200
400
500
600
1 5 10 15 20
number of disk accesses
query interval length
sketch-pruning naive relational
(b) Cost vs. qtlen (qrlen=0.15)
Results Results (Accuracy of Approximate Results)(Accuracy of Approximate Results)
relative error
0%
5%
10%
15%
20%
25%
30%
35%
0.05 0.1 0.15 0.2 0.25query rectangle length
32-bitmap 16-bitmap 8-bitmap
(a) Error vs. qrlen (qtlen=10, count)
Results Results (Accuracy of Approximate Results)(Accuracy of Approximate Results)
relative error
query rectangle length
0%
5%
10%
15%
20%
25%
0.05 0.1 0.15 0.2 0.25
(b) Error vs. qrlen ( qtlen=10, sum)
32-bitmap 16-bitmap 8-bitmap
Results Results (Costs of Indexes)(Costs of Indexes)
number of disk accesses
query rectangle length
0
50
100
150
200
250
300
350
400
0.05 0.1 0.15 0.2 0.25
32-bitmap 16-bitmap 8-bitmap
(a) Cost vs. qrlen (qtlen=10)
Results Results (Costs of Indexes)(Costs of Indexes)
number of disk accesses
query interval length
0
50
100
150
200
250
300
350
1 5 10 15 20
(b) Cost vs. qtlen (qrlen=0.15)
32-bitmap 16-bitmap 8-bitmap
ExtensionsExtensions• Approximating general moving data
Problem: Each object o reports its location <x,y> at each timestamp t, the size of the database grows continuously! (n∙T)
• Solution: Impose a resres regular grid over the data space, the sketch index is applied by treating the grid cells as the finest aggregate granularity. O((res)2∙T∙logn) [or, O(T∙logn) when res is a constant ]
Level 0
Level 1
Level L
B-tree
B-tree
B-tree
B-tree
B-tree
B-tree
ConclusionConclusion
• We propose a sketch index that integrates traditional approximate counting techniques with spatio-temporal indexes for efficient distinct aggregation query processing in spatio-temporal database.
• Sketch index consumes less space and give an order of magnitude faster query process with less aggregate error than a conventional database.
• Extensions and Future work
– Other possible sketches
– More sophisticated algorithms for mining association rules