the power-method: a comprehensive estimation technique for multi-dimensional queries

43
The Power-Method: A The Power-Method: A Comprehensive Estimation Comprehensive Estimation Technique for Multi- Technique for Multi- Dimensional Queries Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST

Upload: ghalib

Post on 28-Jan-2016

22 views

Category:

Documents


1 download

DESCRIPTION

The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries. Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST. Roadmap. Problem – motivation Survey Proposed method – main idea Proposed method – details - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

The Power-Method: A The Power-Method: A Comprehensive Comprehensive

Estimation Technique for Estimation Technique for Multi-Dimensional QueriesMulti-Dimensional Queries

Yufei Tao U. Hong Kong

Christos Faloutsos CMU

Dimitris Papadias Hong Kong UST

Page 2: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 2

RoadmapRoadmap Problem – motivation Survey Proposed method – main idea Proposed method – details Experiments Conclusions

Page 3: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 3

Target query typesTarget query types

DB = set of m –d points. Range search (RS) k nearest neighbor (KNN) Regional distance (self-) join

(RDJ) in Louisiana, find all pairs of music

stores closer than 1mi to each other

Page 4: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 4

Target problemTarget problem

Estimate Query selectivity Query (I/O) cost

for any Lp metric using a single method

Page 5: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 5

Target ProblemTarget Problem

for any Lp metric using a single method

RS KNN RDJ

Sel. XXXX

I/O

Page 6: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 6

RoadmapRoadmap Problem – motivation Survey Proposed method – main idea Proposed method – details Experiments Conclusions

Page 7: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 7

Older Query estimation Older Query estimation approachesapproaches

Vast literature Sampling, kernel estimation, single

value decomposition, compressed histograms, sketches, maximal independence, Euler formula, etc

BUT: They target specific cases (mostly range search selectivity under the L norm), and their extensions to other problems are unclear

Page 8: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 8

Main competitorsMain competitors Local method

Representative methods: Histograms

Global method Provides a single estimate

corresponding to the average selectivity/cost of all queries, independently of their locations

Representative methods: Fractal and power law

Page 9: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 9

Rationale and problems of Rationale and problems of histogramshistograms

Partition the data space into a set of buckets and assume (local) uniformity

b1

qb2

b3

vincinity circleProblems

uniformity

tricky/slow estimations, for all but the L norm

Page 10: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 10

RoadmapRoadmap Problem – motivation Survey Proposed method – main idea Proposed method – details Experiments Conclusions

Page 11: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 11

Inherent defect of Inherent defect of histogramshistograms

Density trap – what is the density in the vicinity of q?

1

query point q

r

diameter=10: 10/100 = 0.1diameter=100: 100/10,000 = 0.01

Q: What is going on?

10

Page 12: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 12

Inherent defect of Inherent defect of histogramshistograms

Density trap – what is the density in the vicinity of q?

1

query point q

r

diameter=10: 10/100 = 0.1diameter=100: 100/10,000 = 0.01

Q: What is going on?A: we ask a silly question: ~ “what is the area of a line?”

10

Page 13: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 13

““Density Trap”Density Trap” Not caused not by a

mathematical oddity like the Hilbert curve, but by a line, a perfectly behaving Euclidean object!

This ‘trap’ will appear for any non-uniform dataset

Almost ALL real point-sets are non-uniform -> the trap is real

Page 14: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 14

““Density Trap”Density Trap”

In short:

is meaningless What should we do instead?

areaneighborsofcount /__

Page 15: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 15

““Density Trap”Density Trap”

In short:

is meaningless What should we do instead? A: log(count_of_neighbors) vs

log(area)

areaneighborsofcount /__

Page 16: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 16

Local power lawLocal power law In more detail: ‘local power law’:

nb: # neighbors of point p, within radius r

cp: ‘local constant’

np : ‘local exponent’ (= local intrinsic dimensionality)

pn

p pnb r c r

1

query point q

r

Page 17: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 17

Local power lawLocal power law

Intuitively: to avoid the ‘density trap’, use

np:local intrinsic dimensionality

instead of density

1

query point q

r

Page 18: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 18

Does LPL make sense?Does LPL make sense? For point ‘q’: LPL gives

nbq(r) = <constant> r1

(no need for ‘density’, nor uniformity)

1

query point q

r

diameter=10: 10/100 = 0.1diameter=100: 100/10,000 = 0.01

10

Page 19: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 19

Local power law and LxLocal power law and Lxif a point obeys L.P.L under L,

ditto for any other Lx metric,

with same ‘local exponent’

-> LPL works easily, for ANY Lx metric

/1

1

p

p

n m

nxxp p

VolSpherenb r c r

VolSphere

Page 20: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 20

ExamplesExamples

p1

p2

1

10

100

1k

10k

0.001 0.01 0.1r

nbp (r)p1

p2

p1 has higher ‘local exponent’ = ‘local intrinsic dimensionality’than p2

radius

#neighbors(<=r)

p1

p2

Page 21: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 22

RoadmapRoadmap Problem – motivation Survey Proposed method – main idea Proposed method – details Experiments Conclusions

Page 22: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 23

Proposed methodProposed method Main idea: if we know (or can

approximate) the cp and np of every point p, we can solve all the problems:

Page 23: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 24

Target ProblemTarget Problem

for any Lp metric using a single method

RS KNN RDJ

Sel. XXXX

I/O

Page 24: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 25

Target ProblemTarget Problem

for any Lp metric (Lemma3.2) using a single method

RS KNN RDJSel. Thm3.

1XXXX Thm3.

2I/O Thm3.

3Thm3.4

Thm3.5

Page 25: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 26

Theoretical resultsTheoretical results

interesting observation:

(Thm3.4): the cost of a kNN query q depends

only on the ‘local exponent’ and NOT on the ‘local constant’, nor on the cardinality of the

dataset

Page 26: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 27

ImplementationImplementation Given a query point q, we need

its local exponent and constants to perform estimation

but: too expensive to store, for every point. Q: What to do?

Page 27: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 28

ImplementationImplementation Given a query point q, we need

its local exponent and constants to perform estimation

but: too expensive to store, for every point. Q: What to do?

A: exploit locality:

Page 28: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 29

ImplementationImplementation nearby points: usually have

similar local constants and exponents. Thus, one solution:

‘anchors’: pre-compute the LPLaw for a set of representative points (anchors) – use nearest ‘anchor’ to q

Page 29: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 30

ImplementationImplementation choose anchors: with sampling,

DBS, or any other method.

Page 30: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 31

ImplementationImplementation (In addition to ‘anchors’, we

also tried to use ‘patches’ of near-constant cp and np – it gave similar accuracy, for more complicated implementation)

Page 31: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 32

Experiments - SettingsExperiments - Settings Datasets

SC that contain 40k points representing the coast lines of Scandinavia

LB that include 53k points corresponding to locations in Long Beach county

Structure: R*-tree Compare Power method to

Minskew Global method (fractal)

Page 32: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 33

Experiments - SettingsExperiments - Settings The LPLaw coefficients of each

anchor point are computed using L∞ 0.05-neighborhoods

Queries: Biased (following the data distribution) A query workload contains 500

queries

We report the average error i|actiesti|/iacti

Page 33: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 34

Target ProblemTarget Problem

for any Lp metric (Lemma3.2) using a single method

RS KNN RDJSel. Thm3.

1XXXX Thm3.

2I/O Thm3.

3Thm3.4

Thm3.5

Page 34: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 35

Range search selectivityRange search selectivity

0102030405060708090

100

0 0.02 0.04 0.06 0.08 0.1

estimation error (%)

r

0

10

20

30

40

50

60

70

0 0.02 0.04 0.06 0.08 0.1

estimation error (%)

r

minskew powerglobal

the LPL method wins

Page 35: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 36

Target ProblemTarget Problem

for any Lp metric (Lemma3.2) using a single method

RS KNN RDJSel. Thm3.

1XXXX Thm3.

2I/O Thm3.

3Thm3.4

Thm3.5

Page 36: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 37

No known global method in this case The LPL method wins, with higher

margin

Regional distance join Regional distance join selectivityselectivity

0

10

20

30

40

50

60

70

80

90

100

0 0.002 0.004 0.006 0.008 0.01t

estimation error (%)

0 0.002 0.004 0.006 0.008 0.01t

0

10

20

30

40

50

60

70 estimation error (%)

minskew power

Page 37: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 38

Target ProblemTarget Problem

for any Lp metric (Lemma3.2) using a single method

RS KNN RDJSel. Thm3.

1XXXX Thm3.

2I/O Thm3.

3Thm3.4

Thm3.5

Page 38: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 39

Range search query costRange search query cost

0 0.02 0.04 0.06 0.08 0.1r

0

10

20

30

40

50

60

70

80 estimation error (%)

0

10

20

30

40

50

60

70

80

90 estimation error (%)

0 0.02 0.04 0.06 0.08 0.1r

minskew powerglobal

Page 39: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 40

k nearest neighbor costk nearest neighbor cost

30

40

80

estimation error (%)

1 20 40 60 80 100k

0

10

20

50

60

7090

30

4540

estimation error (%)

1 20 40 60 80 100k

05

10152025

35

50

local uniformity powerglobal

Page 40: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 41

Regional distance join costRegional distance join cost

0 0.002 0.004 0.006 0.008 0.01t

0

20

40

60

80

100

120 estimation error (%)

0 0.002 0.004 0.006 0.008 0.01t

0

10

20

30

40

50

60

70 estimation error (%)

minskew power

Page 41: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 42

ConclusionsConclusions We spot the “density trap”

problem of the local uniformity assumption (<- histograms)

we show how to resolve it, using the ‘local intrinsic dimension’ instead (-> ‘Local Power Law’)

and we solved all posed problems:

1

query point q

r

Page 42: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 43

Conclusions – cont’dConclusions – cont’d

for any Lp metric using a single method

RS KNN RDJ

Sel. XXXX

I/O

Page 43: The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Tao, Faloutsos, Papadias 44

Conclusions – cont’dConclusions – cont’d

for any Lp metric (Lemma3.2) using a single method (LPL & ‘anchors’)

RS KNN RDJSel. Thm3.

1XXXX Thm3.

2I/O Thm3.

3Thm3.4

Thm3.5