23 1 christian böhm 1, florian krebs 2, and hans-peter kriegel 2 1 university for health...
DESCRIPTION
23 3 Simple Similarity Queries Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.TRANSCRIPT
123
Christian Böhm1, Florian Krebs2, and Hans-Peter Kriegel2
1University for Health Informatics and Technology, Innsbruck2University of Munich
Optimal Dimension Order: A Generic Technique for the Similarity Join
223 Feature Based Similarity
323 Simple Similarity Queries
Specify query object and• Find similar objects – range query• Find the k most similar objects – nearest neighbor q.
423 Join Applications: Catalogue Matching
Catalogue matching• E.g. Astronomy catalogues
R
S
523 Join Applications: Clustering
Clustering (e.g. DBSCAN)
Similarity self-join
623 R-Tree Similarity Join
Depth-first traversal of two trees[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993]
R S
723 The -kdB-Tree
[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]
Assumption: 2 adjacent -stripes fit in main mem. Unrealistic for large data sets which are ...
• clustered, • skewed and • high-dimensional data
823 Epsilon Grid Order
[Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]
923 Common Properties
Decomposition of data/space into regions Regions described by hyper-rectangles
for each pair (P,Q) of partitions having dist (P,Q)
for each pair of points (p,q) on (P,Q)test dist (p,q) ;
Most CPU-effort in distance test between vectors:Idea: Speed-up distance test
1023 Related Work: Plane Sweep for Polygons
[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]
Observations:• More efficient to use x-axis as sweep direction.• Projection of polygons to y-axis yield high overlap• Decide by projections of the bounding boxes
(integrate a pdf)
1123
Distance computation between feature vectors p,qfor (i=0 ; i<d ; i++) {dist2 = dist2 + (p[i] q[i])2 ;if (dist2 > 2)break ;}
Order dimensions by Mating Probability (increasing)
Feature Vectors in the Similarity Join
d0
d1
1223 Computation of the Mating Probability
To determine mating probability for di: Project bounding boxes on di-axis
d0
d1
1323 Computation of the Mating Probability
To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space
d0
d0
d 0
d 0
d 0
d 0
d 0
1423 Computation of the Mating Probability
To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space
d0-Projection of each point pair located inthis event space
d0[P]
d0[Q]
1523 Computation of the Mating Probability
To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space
d0[P]
d0-Projection of each point pair located inthis event space
mating
point
pairs
on -
stripe
d0[Q]
y x y
x +
1623 Computation of the Mating Probability
To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space
MatingProbabilityfor d0
d0[P]
d0[Q]
1723 Optimal Dimension Order
For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability
Algorithm:for each pair (P,Q) of partitions having dist (P,Q)
determine ODO ;for each pair of points (p,q) on (P,Q)
test dist (p,q) using ODO ;
1823 Shape of the Intersection Area
20 different shapes are possible, e.g.
1223 2233 2223
Easy proof of completeness and efficient case distinction by assigning codes to the corners• 1: Corner is left or above the -stripe• 2: Corner is on the -stripe• 3: Corner is right or below the -stripe
Easy formulas (only 45° and 90° angles)
1923 Experimental Evaluation: R-tree Sim. Join
0%100%200%300%400%500%600%700%
base
tech
nique
ODO-algo
rithm
SDO dimen
sion 1
SDO dimen
sion 2
SDO dimen
sion 3
SDO dimen
sion 4
SDO dimen
sion 5
SDO dimen
sion 6
SDO dimen
sion 7
SDO dimen
sion 8
8-dimensional data, uniformly distributed
2023 Experimental Evaluation: R-tree Sim. Join
16-dimensional data, from CAD-similarity search
2123 Experimental Evaluation: Scalability
MuX, uniform data Z-RSJ, uniform data
2223 Experimental Evaluation: Scalability
EGO, CAD data
2323 Conclusion
Conclusion:• Similarity join is an important database primitive for
knowledge discovery in databases• Many different basic algorithms• Most accelerable by our optimal dimension order
Future Work:• New applications of the similarity join• Further optimization (multi-parameter) of the sim. join• Parallel and distributed environments