pivoting m-tree: a metric access method for efficient similarity search
DESCRIPTION
Pivoting M-tree: A Metric Access Method for Efficient Similarity Search. Tomáš Skopal tomas.skopal @vsb.cz Department of Computer Science, V ŠB-Technical University of Ostrava. Presentation Outline. Similarity search in Metric Spaces M-tree PM-tree structure range queries - PowerPoint PPT PresentationTRANSCRIPT
Pivoting M-tree: A Metric Access Method
for Efficient Similarity Search
Tomáš [email protected]
Department of Computer Science, VŠB-Technical University of Ostrava
DATESO 2004 2
Presentation Outline
Similarity search in Metric Spaces M-tree PM-tree
structure range queries hyper-ring storage
Experimental Results
DATESO 2004 3
Similarity search in Metric Spaces
Similarity search – methods for content-based retrieval in multimedia databases (in Information Retrieval resp.)
Similarity modelled by metric d:
Restriction to metric yields a paradigmatic discrepancy with several similarity theories – nevertheless, the triangular inequality is the basic tool for metric region construction leading to an efficient similarity search
Metric queries range query (specified by pivot object Q and covering radius rQ) k-NN query (specified by pivot object Q and number of nearest neighbours k)
DATESO 2004 4
Metric Access Methods
Designed to search in metric datasets in order to keep the search costs minimal (number of distance computation).
When searching large multimedia databases also the I/O search costs have to be minimized.
Many MAMs developed so far: M-tree, GH-tree, GNAT, LAESA, D-index, VP-tree, MVP-tree, SAT, ...
Majority of the MAMs is not suitable for similarity search in large datasets (either a static method or high I/O search costs) only M-tree and (recently) D-index are suitable candidates
DATESO 2004 5 (euclidean 2D space)
range query
M-tree dynamic, balanced, and paged metric tree (like e.g. B+-tree, R-tree) the leaves are clusters of objects routing entries in the inner nodes represent metric regions, recursively bounding the object clusters in leaves during query evaluation, the triangular inequality allows discarding of irrelevant M-tree branches (metric regions resp.)
DATESO 2004 6
PM-tree, motivation
metric regions in M-tree are unnecessarily large
indexing of large portions of empty space (the “dead” space)
higher probability of intersection with query region
less efficient search reduction of metric region “volume” should lead to more effective discarding
of irrelevant subtrees the way is to specify a metric region bounding all the objects more “tightly”
DATESO 2004 7
PM-tree, structure Pivoting M-tree (PM-tree):
a combination of M-tree with the pivot-based methods (LAESA-like) given a fixed set of p pivots Pi (selected from the dataset), a PM-tree
region is additionaly defined by p hyper-ring regions (Pi , HR[i]) each routing entry contains an array HR of p intervals <HR[i].min, HR[i].max> each interval HR[i] bounds the distances of objects to the respective pivot Pi
intersection of the hyper-sphere and the hyper-rings forms a smaller region bounding all the objects
the more pivots, the more thightly bounded region
DATESO 2004 8
PM-tree region
PM-tree, query processing prior to processing of a query (Q,rQ), distances d(Q, Pi) for all i ≤ p must be
computed metric region is relevant to a range query just in case that all the hyper-rings
and the hyper-sphere intersect the range query region the more hyper-rings, the lower probability of intersection with query
no additional distance computations are needed for the intersection test
M-tree region
queryquery
DATESO 2004 9
PM-tree, hyper-ring storage The routing entries of PM-tree nodes are enlarged by the additional pivot-based information
stored in HR arrays To keep the space overhead minimal, a compact storage of HR[i] intervals is necessary A distance histogram for each pivot Pi is created, and interval <di
min, dimax> is chosen such that
e.g. 90% of distances in the distance histogram fall into that interval Each value HR[i].min, HR[i].max, is scaled to the <di
min, dimax> interval
using a single byte, i.e. each hyper-ring HR[i] takes 2 bytes
Oi, r, ptr(T), ... HR[1],HR[2],...,HR[p]
storage of HR array
DATESO 2004 10
Experimental results (synthetic) synthetic dataset of 100,000 30-dimensional tuples distributed
within 1000 clusters, L2 distance, query selectivity 50 objs.
DATESO 2004 11
Experimental results (images) collection of 10,000 images represented by 256-dimensional vectors
(gray histograms), L2 distance, query selectivity 50 objs.
DATESO 2004 12
Recent results (not included in proceedings)
Cost models for range queries in PM-tree ( ADBIS‘04)
Experiments on image dataset ( ADBIS‘04)
Optimal k-NN query algorithm for PM-tree + cost models (to be published...)
DATESO 2004 13
Reference
[1] Skopal T., Pokorný J., Snášel V.: PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases, submitted to ADBIS 2004, Budapest, Hungary
[2] Skopal T.: Pivoting M-tree: A Metric Access Method for Efficient Similarity Search, DATESO 2004, Desná
[3] Skopal T., Pokorný J., Krátký M., Snášel V.: Revisiting M-tree Building Principles. ADBIS 2003, LNCS 2798, Springer-Verlag, Dresden, Germany