23 1 christian böhm 1, florian krebs 2, and hans-peter kriegel 2 1 university for health...

23
1 23 Christian Böhm 1 , Florian Krebs 2 , and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal Dimension Order: A Generic Technique for the Similarity Join

Upload: trevor-garrett

Post on 18-Jan-2018

217 views

Category:

Documents


0 download

DESCRIPTION

23 3 Simple Similarity Queries  Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.

TRANSCRIPT

Page 1: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

123

Christian Böhm1, Florian Krebs2, and Hans-Peter Kriegel2

1University for Health Informatics and Technology, Innsbruck2University of Munich

Optimal Dimension Order: A Generic Technique for the Similarity Join

Page 2: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

223 Feature Based Similarity

Page 3: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

323 Simple Similarity Queries

Specify query object and• Find similar objects – range query• Find the k most similar objects – nearest neighbor q.

Page 4: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

423 Join Applications: Catalogue Matching

Catalogue matching• E.g. Astronomy catalogues

R

S

Page 5: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

523 Join Applications: Clustering

Clustering (e.g. DBSCAN)

Similarity self-join

Page 6: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

623 R-Tree Similarity Join

Depth-first traversal of two trees[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993]

R S

Page 7: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

723 The -kdB-Tree

[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]

Assumption: 2 adjacent -stripes fit in main mem. Unrealistic for large data sets which are ...

• clustered, • skewed and • high-dimensional data

Page 8: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

823 Epsilon Grid Order

[Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]

Page 9: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

923 Common Properties

Decomposition of data/space into regions Regions described by hyper-rectangles

for each pair (P,Q) of partitions having dist (P,Q)

for each pair of points (p,q) on (P,Q)test dist (p,q) ;

Most CPU-effort in distance test between vectors:Idea: Speed-up distance test

Page 10: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

1023 Related Work: Plane Sweep for Polygons

[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]

Observations:• More efficient to use x-axis as sweep direction.• Projection of polygons to y-axis yield high overlap• Decide by projections of the bounding boxes

(integrate a pdf)

Page 11: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

1123

Distance computation between feature vectors p,qfor (i=0 ; i<d ; i++) {dist2 = dist2 + (p[i] q[i])2 ;if (dist2 > 2)break ;}

Order dimensions by Mating Probability (increasing)

Feature Vectors in the Similarity Join

d0

d1

Page 12: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

1223 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis

d0

d1

Page 13: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

1323 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

d0

d0

d 0

d 0

d 0

d 0

d 0

Page 14: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

1423 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

d0-Projection of each point pair located inthis event space

d0[P]

d0[Q]

Page 15: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

1523 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

d0[P]

d0-Projection of each point pair located inthis event space

mating

point

pairs

on -

stripe

d0[Q]

y x y

x +

Page 16: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

1623 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

MatingProbabilityfor d0

d0[P]

d0[Q]

Page 17: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

1723 Optimal Dimension Order

For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability

Algorithm:for each pair (P,Q) of partitions having dist (P,Q)

determine ODO ;for each pair of points (p,q) on (P,Q)

test dist (p,q) using ODO ;

Page 18: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

1823 Shape of the Intersection Area

20 different shapes are possible, e.g.

1223 2233 2223

Easy proof of completeness and efficient case distinction by assigning codes to the corners• 1: Corner is left or above the -stripe• 2: Corner is on the -stripe• 3: Corner is right or below the -stripe

Easy formulas (only 45° and 90° angles)

Page 19: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

1923 Experimental Evaluation: R-tree Sim. Join

0%100%200%300%400%500%600%700%

base

tech

nique

ODO-algo

rithm

SDO dimen

sion 1

SDO dimen

sion 2

SDO dimen

sion 3

SDO dimen

sion 4

SDO dimen

sion 5

SDO dimen

sion 6

SDO dimen

sion 7

SDO dimen

sion 8

8-dimensional data, uniformly distributed

Page 20: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

2023 Experimental Evaluation: R-tree Sim. Join

16-dimensional data, from CAD-similarity search

Page 21: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

2123 Experimental Evaluation: Scalability

MuX, uniform data Z-RSJ, uniform data

Page 22: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

2223 Experimental Evaluation: Scalability

EGO, CAD data

Page 23: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal

2323 Conclusion

Conclusion:• Similarity join is an important database primitive for

knowledge discovery in databases• Many different basic algorithms• Most accelerable by our optimal dimension order

Future Work:• New applications of the similarity join• Further optimization (multi-parameter) of the sim. join• Parallel and distributed environments