23 1 christian böhm 1, florian krebs 2, and hans-peter kriegel 2 1 university for health...

Post on 18-Jan-2018

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

23 3 Simple Similarity Queries  Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.

TRANSCRIPT

123

Christian Böhm1, Florian Krebs2, and Hans-Peter Kriegel2

1University for Health Informatics and Technology, Innsbruck2University of Munich

Optimal Dimension Order: A Generic Technique for the Similarity Join

223 Feature Based Similarity

323 Simple Similarity Queries

Specify query object and• Find similar objects – range query• Find the k most similar objects – nearest neighbor q.

423 Join Applications: Catalogue Matching

Catalogue matching• E.g. Astronomy catalogues

R

S

523 Join Applications: Clustering

Clustering (e.g. DBSCAN)

Similarity self-join

623 R-Tree Similarity Join

Depth-first traversal of two trees[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993]

R S

723 The -kdB-Tree

[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]

Assumption: 2 adjacent -stripes fit in main mem. Unrealistic for large data sets which are ...

• clustered, • skewed and • high-dimensional data

823 Epsilon Grid Order

[Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]

923 Common Properties

Decomposition of data/space into regions Regions described by hyper-rectangles

for each pair (P,Q) of partitions having dist (P,Q)

for each pair of points (p,q) on (P,Q)test dist (p,q) ;

Most CPU-effort in distance test between vectors:Idea: Speed-up distance test

1023 Related Work: Plane Sweep for Polygons

[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]

Observations:• More efficient to use x-axis as sweep direction.• Projection of polygons to y-axis yield high overlap• Decide by projections of the bounding boxes

(integrate a pdf)

1123

Distance computation between feature vectors p,qfor (i=0 ; i<d ; i++) {dist2 = dist2 + (p[i] q[i])2 ;if (dist2 > 2)break ;}

Order dimensions by Mating Probability (increasing)

Feature Vectors in the Similarity Join

d0

d1

1223 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis

d0

d1

1323 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

d0

d0

d 0

d 0

d 0

d 0

d 0

1423 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

d0-Projection of each point pair located inthis event space

d0[P]

d0[Q]

1523 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

d0[P]

d0-Projection of each point pair located inthis event space

mating

point

pairs

on -

stripe

d0[Q]

y x y

x +

1623 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

MatingProbabilityfor d0

d0[P]

d0[Q]

1723 Optimal Dimension Order

For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability

Algorithm:for each pair (P,Q) of partitions having dist (P,Q)

determine ODO ;for each pair of points (p,q) on (P,Q)

test dist (p,q) using ODO ;

1823 Shape of the Intersection Area

20 different shapes are possible, e.g.

1223 2233 2223

Easy proof of completeness and efficient case distinction by assigning codes to the corners• 1: Corner is left or above the -stripe• 2: Corner is on the -stripe• 3: Corner is right or below the -stripe

Easy formulas (only 45° and 90° angles)

1923 Experimental Evaluation: R-tree Sim. Join

0%100%200%300%400%500%600%700%

base

tech

nique

ODO-algo

rithm

SDO dimen

sion 1

SDO dimen

sion 2

SDO dimen

sion 3

SDO dimen

sion 4

SDO dimen

sion 5

SDO dimen

sion 6

SDO dimen

sion 7

SDO dimen

sion 8

8-dimensional data, uniformly distributed

2023 Experimental Evaluation: R-tree Sim. Join

16-dimensional data, from CAD-similarity search

2123 Experimental Evaluation: Scalability

MuX, uniform data Z-RSJ, uniform data

2223 Experimental Evaluation: Scalability

EGO, CAD data

2323 Conclusion

Conclusion:• Similarity join is an important database primitive for

knowledge discovery in databases• Many different basic algorithms• Most accelerable by our optimal dimension order

Future Work:• New applications of the similarity join• Further optimization (multi-parameter) of the sim. join• Parallel and distributed environments

top related