pivoting m-tree: a metric access method for efficient similarity search

Pivoting M-tree: A Metric Access Method

for Efficient Similarity Search

Tomáš [email protected]

Department of Computer Science, VŠB-Technical University of Ostrava

DATESO 2004 2

Presentation Outline

Similarity search in Metric Spaces M-tree PM-tree

structure range queries hyper-ring storage

Experimental Results

DATESO 2004 3

Similarity search in Metric Spaces

Similarity search – methods for content-based retrieval in multimedia databases (in Information Retrieval resp.)

Similarity modelled by metric d:

Restriction to metric yields a paradigmatic discrepancy with several similarity theories – nevertheless, the triangular inequality is the basic tool for metric region construction leading to an efficient similarity search

Metric queries range query (specified by pivot object Q and covering radius rQ) k-NN query (specified by pivot object Q and number of nearest neighbours k)

DATESO 2004 4

Metric Access Methods

Designed to search in metric datasets in order to keep the search costs minimal (number of distance computation).

When searching large multimedia databases also the I/O search costs have to be minimized.

Many MAMs developed so far: M-tree, GH-tree, GNAT, LAESA, D-index, VP-tree, MVP-tree, SAT, ...

Majority of the MAMs is not suitable for similarity search in large datasets (either a static method or high I/O search costs) only M-tree and (recently) D-index are suitable candidates

DATESO 2004 5 (euclidean 2D space)

range query

M-tree dynamic, balanced, and paged metric tree (like e.g. B+-tree, R-tree) the leaves are clusters of objects routing entries in the inner nodes represent metric regions, recursively bounding the object clusters in leaves during query evaluation, the triangular inequality allows discarding of irrelevant M-tree branches (metric regions resp.)

DATESO 2004 6

PM-tree, motivation

metric regions in M-tree are unnecessarily large

indexing of large portions of empty space (the “dead” space)

higher probability of intersection with query region

less efficient search reduction of metric region “volume” should lead to more effective discarding

of irrelevant subtrees the way is to specify a metric region bounding all the objects more “tightly”

DATESO 2004 7

PM-tree, structure Pivoting M-tree (PM-tree):

a combination of M-tree with the pivot-based methods (LAESA-like) given a fixed set of p pivots Pi (selected from the dataset), a PM-tree

region is additionaly defined by p hyper-ring regions (Pi , HR[i]) each routing entry contains an array HR of p intervals <HR[i].min, HR[i].max> each interval HR[i] bounds the distances of objects to the respective pivot Pi

intersection of the hyper-sphere and the hyper-rings forms a smaller region bounding all the objects

the more pivots, the more thightly bounded region

DATESO 2004 8

PM-tree region

PM-tree, query processing prior to processing of a query (Q,rQ), distances d(Q, Pi) for all i ≤ p must be

computed metric region is relevant to a range query just in case that all the hyper-rings

and the hyper-sphere intersect the range query region the more hyper-rings, the lower probability of intersection with query

no additional distance computations are needed for the intersection test

M-tree region

queryquery

DATESO 2004 9

PM-tree, hyper-ring storage The routing entries of PM-tree nodes are enlarged by the additional pivot-based information

stored in HR arrays To keep the space overhead minimal, a compact storage of HR[i] intervals is necessary A distance histogram for each pivot Pi is created, and interval <di

min, dimax> is chosen such that

e.g. 90% of distances in the distance histogram fall into that interval Each value HR[i].min, HR[i].max, is scaled to the <di

min, dimax> interval

using a single byte, i.e. each hyper-ring HR[i] takes 2 bytes

Oi, r, ptr(T), ... HR[1],HR[2],...,HR[p]

storage of HR array

DATESO 2004 10

Experimental results (synthetic) synthetic dataset of 100,000 30-dimensional tuples distributed

within 1000 clusters, L2 distance, query selectivity 50 objs.

DATESO 2004 11

Experimental results (images) collection of 10,000 images represented by 256-dimensional vectors

(gray histograms), L2 distance, query selectivity 50 objs.

DATESO 2004 12

Recent results (not included in proceedings)

Cost models for range queries in PM-tree ( ADBIS‘04)

Experiments on image dataset ( ADBIS‘04)

Optimal k-NN query algorithm for PM-tree + cost models (to be published...)

DATESO 2004 13

Reference

[1] Skopal T., Pokorný J., Snášel V.: PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases, submitted to ADBIS 2004, Budapest, Hungary

[2] Skopal T.: Pivoting M-tree: A Metric Access Method for Efficient Similarity Search, DATESO 2004, Desná

[3] Skopal T., Pokorný J., Krátký M., Snášel V.: Revisiting M-tree Building Principles. ADBIS 2003, LNCS 2798, Springer-Verlag, Dresden, Germany

pivoting m-tree: a metric access method for efficient similarity search

Documents

tree pmtree

paged metric tree

tree nodes

tree dynamic

b tree

metric datasets

metric region construction

range query region