probabilistic threshold range aggregate query processing over uncertain data

34
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)

Upload: terah

Post on 12-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data. Wenjie Zhang University of New South Wales & NICTA, Australia. Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA). Outline. Background and Preliminaries Probabilistic Threshold Range Aggregate Query - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Probabilistic Threshold Range Aggregate Query

Processing over Uncertain Data

Wenjie Zhang

University of New South Wales & NICTA, AustraliaJoint work:

Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)

Page 2: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Outline

DB@UNSW

2

Background and Preliminaries Probabilistic Threshold Range Aggregate

Query Exact query processing Approximate query processing: Simple

Sampling & Double Sampling Experiments

Conclusion

Page 3: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Applications

DB@UNSW

3

Many applications involve data that is imperfect due to data randomness and incompleteness limitation of equipment delay or lose in data transfer … …

Applications Sensor networks Environmental surveillance Moving objects Data cleaning and integration … …

Page 4: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Applications

DB@UNSW

4

Sensor Networks: Sensor readings are often imprecise due to equipment

limitation and periodical reporting mechanism. (figures are borrowed from Jian et al, SIGMOD08)

Page 5: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Applications

DB@UNSW

5

Mobile Equipments / Moving Objects A mobile object reports its location periodically, the

exact location is often uncertain.

Page 6: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Applications

DB@UNSW

6

Satellite data

Page 7: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Applications

DBG @ UNSW

Data Quality Social Data Collection: Errors and estimation

inherent in customer surveys and sampling

7

Page 8: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Outline

DB@UNSW

8

Background and Preliminaries Modeling Uncertainty & Related Work

Probabilistic Threshold Range Query Conclusion

Page 9: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Modeling Uncertainty ( cont. )

DB@UNSW

9

Uncertain Objects Model1. Continuous case: described using a probability

density function (PDF) fU such that . E.g., uniform distribution, normal distribution.

Uu U duuf 1)(

Page 10: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Modeling Uncertainty ( cont. )

DB@UNSW

10

Uncertain Objects Model2. Discrete case : described using a set of

instances each instance u has an occurrence probability pu

1 Uu up

Page 11: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Possible World Semantics

DB@UNSW

11

Given a set of uncertain objects U1,U2, ..., Un, a possible world W = u1,u2, .., un is a set of n instances --- one instance per uncertain object

The probability of a possible worlds is

P(W) =

Let Ω be the set of all possible world, clearly,

n

i iuP1 )(

1)( WWP

Page 12: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Probabilistic Queries:

DB@UNSW

12

Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07]

Aggregate Queries [BDJR05, MJ07, CG07]

Join Queries [CSP06, AW07]

Top-k queries [SIC07, YLSK08, RDS07, HJZL08]

Nearest Neighbor Queries [KKR07, CCMC08]

Skyline Queries [PJLY07]

… …

Page 13: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Range query

DBG @ UNSW

13

Uncertain objects, exact query Probability threshold is often assigned

Page 14: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Related Work

DB@UNSW

14

Range Queries [TCXNKP05, BPS06, AY08]

Given a rectangle r and a probabilistic threshold t , find all objects that appear in r with probability at least t.

Appearance probability

r

o .reg ion

rregionoxdxxpdfo

.)(.

Page 15: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

U-tree

DB@UNSW

15

Probabilistically Constrained Region ( PCR ) [TCXNKP05]

PCR (0.2) Multi PCRs

Page 16: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Outline

DB@UNSW

16

Introduction Modeling Uncertainty & Related Work Probabilistic Threshold Range Aggregate

Query (PTRA) Conclusion

Page 17: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Contribution

DB@UNSW

17

Formally define PTRA query aU-Tree structure for exact PTRA query singleSample and doubleSample

techniques for approximate answer.

Page 18: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Problem Statement

DB@UNSW

18

Given a set of uncertain objects and query q , return the number of uncertain objects with appearance probability no less than threshold pq

Page 19: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Problem Definition

DB@UNSW

19

Assume threshold = 0.5, if the appearance probability computed for b is > 0.5 and for c is < 0.5, then the aggregate returned is 2 (a & b)

Page 20: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Exact Query Processing ( aU-Tree)

DB@UNSW

20

Main idea: add aggregate information on U-tree Advantage: stop at intermediate level if

pruned or fully covered by the query Disadvantage: otherwise, still need to drill

down to the leaf nodes. For a large portion of uncertain objects,

appearance probability needs to be computed Expensive for a massive number of instances

per object!

Page 21: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Exact Query Processing ( aU-Tree)

DB@UNSW

21

Page 22: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

singleSample

DB@UNSW

22

Sampling the instances of the uncertain objects. If m’ out of m sampled instances are inside query

region, then the approximate appearance probability is m’/m

Page 23: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

singleSample ( cont. )

DB@UNSW

23

An immediate application of Chernoff-Hoeffding bound

Page 24: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

doubleSample

DB@UNSW

24

Single Sampling is expensive when there is a massive number of objects!

Sampling the uncertain objects as well. Naive : uniform sampling objects from all

uncertain objects.

Page 25: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

doubleSample: Accuracy

DB@UNSW

25

•Note: “ appearance probability” of each object follows uniform distribution means spatial location is uniformly distributed.•Using Chernoff-Hoeffding bound.

Page 26: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

doubleSample: Our Approach

DB@UNSW

26

Skew! Aim: select K disjoint groups covering all objects

with the minimum “skew”; i.e. objects in each group with “uniform” distribution. (Then do uniform sampling of objects in each group.)

The optimization problem is NP-hard. Observation:

Min-skew is a good heuristic to conduct such a group.

aU-tree groups objects with a similar principle to the min-skew.

Page 27: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

doubleSample: Our Approach

DB@UNSW

27

Step 1: choose K subtrees to cover all objects with the total minimum skew. NP-hard! Find a level L such that the number of nodes at level

L is smaller than K but the number of nodes at level L-1 is larger than K.

Feed the min-skew algorithm with the subtrees at level L.

(note: if at a level L, the number of nodes = K, then these K subtrees are chosen.)

Step 2: sample objects in each subtree. Step 3. sample instances in each sampled object.

Page 28: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Experiments

DB@UNSW

28

Algorithms:

exact, singleSample, doubleSample

Data set:

LB : 53k objects at long beach country

CA : 62k objects at California

Synthetic aircraft dataset in 3D

10k instances for each points follow Uniform or constrained-Gaussian

Setting : C++, P4 2.8GHz , 2G memory, Debian linux, Page size 8K

Page 29: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Efficiency

DB@UNSW

29

Page 30: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Accuracy

DB@UNSW

30

Page 31: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Accuracy ( cont. )

DB@UNSW

31

Page 32: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Conclusion

DB@UNSW

32

Definition of PTRA aU-Tree technique Sampling technique Future work. Any approach with

theoretic guarantee?

Page 33: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

DB@UNSW

33

Thanks

Page 34: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Min-Skew technique

DB@UNSW

34