dbm-tree: trading height-balancing for performance in

20
DBM-Tree: Trading Height-Balancing for Performance in Metric Access Methods * Marcos R. Vieira, Caetano Traina Jr., Fabio J. T. Chino, Agma J. M. Traina ICMC – Institute of Mathematics and Computer Sciences USP – University of São Paulo at São Carlos Avenida do Trabalhador São-Carlense, 400 CEP 13560-970 – São Carlos – SP – Brazil {mrvieira, caetano, chino, agma}@icmc.usp.br Abstract Metric Access Methods (MAM) are employed to accelerate the processing of similarity queries, such as the range and the k-nearest neighbor queries. Current methods, such as the Slim- tree and the M-tree, improve the query perfor- mance minimizing the number of disk accesses, keeping a constant height of the structures stored on disks (height-balanced trees). However, the overlapping between their nodes has a very high influence on their performance. This paper presents a new dynamic MAM called the DBM- tree (Density-Based Metric tree), which can min- imize the overlap between high-density nodes by relaxing the height-balancing of the structure. Thus, the height of the tree is larger in denser regions, in order to keep a tradeoff between breadth-searching and depth-searching. An un- derpinning for cost estimation on tree structures is their height, so we show a non-height depend- able cost model that can be applied for DBM- tree. Moreover, an optimization algorithm called Shrink is also presented, which improves the per- * This work has been supported by FAPESP (São Paulo State Research Foundation) under grants 01/11987- 3, 01/12536-5 and 02/07318-1 and by CNPq (Brazil- ian National Council for Supporting Research) under grants 52.1685/98-6, 860.068/00-7, 50.0780/2003-0 and 35.0852/94-4. formance of an already built DBM-tree by reor- ganizing the elements among their nodes. Ex- periments performed over both synthetic and real world datasets showed that the DBM-tree is, in average, 50% faster than traditional MAM and reduces the number of distance calculations by up to 72% and disk accesses by up to 66%. Af- ter performing the Shrink algorithm, the perfor- mance improves up to 40% regarding the number of disk accesses for range and k-nearest neigh- bor queries. In addition, the DBM-tree scales up well, exhibiting linear performance with growing number of elements in the database. Keywords: Metric Access Method, Metric Tree, Indexing, Similarity Queries. 1 Introduction The volume of data managed by the Database Management Systems (DBMS) is continually in- creasing. Moreover, new complex data types, such as multimedia data (image, audio, video and long text), geo-referenced information, time series, fingerprints, genomic data and protein sequences, among others, have been added to DBMS. The main technique employed to accelerate data retrieval in DBMS is indexing the data using Access Methods (AM). The data domains used 1

Upload: others

Post on 31-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DBM-Tree: Trading Height-Balancing for Performance in

DBM-Tree: Trading Height-Balancing for Performancein Metric Access Methods∗

Marcos R. Vieira, Caetano Traina Jr., Fabio J. T. Chino, Agma J. M. TrainaICMC – Institute of Mathematics and Computer Sciences

USP – University of São Paulo at São CarlosAvenida do Trabalhador São-Carlense, 400CEP 13560-970 – São Carlos – SP – Brazil

{mrvieira, caetano, chino, agma}@icmc.usp.br

Abstract

Metric Access Methods (MAM) are employedto accelerate the processing of similarity queries,such as the range and thek-nearest neighborqueries. Current methods, such as the Slim-tree and the M-tree, improve the query perfor-mance minimizing the number of disk accesses,keeping a constant height of the structures storedon disks (height-balanced trees). However, theoverlapping between their nodes has a very highinfluence on their performance. This paperpresents a new dynamic MAM called theDBM-tree(Density-Based Metric tree), which can min-imize the overlap between high-density nodesby relaxing the height-balancing of the structure.Thus, the height of the tree is larger in denserregions, in order to keep a tradeoff betweenbreadth-searching and depth-searching. An un-derpinning for cost estimation on tree structuresis their height, so we show a non-height depend-able cost model that can be applied for DBM-tree. Moreover, an optimization algorithm calledShrinkis also presented, which improves the per-

∗This work has been supported byFAPESP (São

Paulo State Research Foundation) under grants 01/11987-

3, 01/12536-5 and 02/07318-1 and byCNPq (Brazil-

ian National Council for Supporting Research) under

grants 52.1685/98-6, 860.068/00-7, 50.0780/2003-0 and

35.0852/94-4.

formance of an already builtDBM-treeby reor-ganizing the elements among their nodes. Ex-periments performed over both synthetic and realworld datasets showed that theDBM-tree is, inaverage, 50% faster than traditional MAM andreduces the number of distance calculations byup to 72% and disk accesses by up to 66%. Af-ter performing theShrinkalgorithm, the perfor-mance improves up to 40% regarding the numberof disk accesses for range andk-nearest neigh-bor queries. In addition, theDBM-treescales upwell, exhibiting linear performance with growingnumber of elements in the database.

Keywords: Metric Access Method, MetricTree, Indexing, Similarity Queries.

1 Introduction

The volume of data managed by the DatabaseManagement Systems (DBMS) is continually in-creasing. Moreover, new complex data types,such as multimedia data (image, audio, videoand long text), geo-referenced information, timeseries, fingerprints, genomic data and proteinsequences, among others, have been added toDBMS.

The main technique employed to acceleratedata retrieval in DBMS is indexing the data usingAccess Methods (AM). The data domains used

1

Page 2: DBM-Tree: Trading Height-Balancing for Performance in

by traditional databases, i.e. numbers and shortcharacter strings, have the total ordering prop-erty. Every AM used in traditional DBMS to an-swer both equality (= and 6=) and relational or-dering predicates (≤, <, ≥ and>), such as theB-trees, are based on this property.

Unfortunately, the majority of complex datadomains do not have the total ordering property.The lack of this property precludes the use of tra-ditional AM to index complex data. Neverthe-less, these data domains allow the definition ofsimilarity relations among pairs of objects. Sim-ilarity queries are more natural for these data do-mains. For a given reference object, also calledthe query center object, a similarity query returnsall objects that meet a given similarity criteria.Traditional AM rely on the total ordering rela-tionship only, and are not able to handle thesecomplex data properly, neither to answer simi-larity queries over such data. These restrictionsled to the development of a new class of AM, theMetric Access Methods (MAM), which are well-suited to answer similarity queries over complexdata types.

A MAM such as the Slim-tree [15, 14] andthe M-tree [9] were developed to answer similar-ity queries based on the similarity relationshipsamong pairs of objects. The similarity (or dis-similarity) relationships are usually representedby distance functions computed over the pairs ofobjects of the data domain. The data domain anddistance function defines a metric space or metricdomain.

Formally, a metric space is a pair <S, d() >,whereS is the data domain andd() is a distancefunction that complies with the following threeproperties:

1. symmetry: d(s1, s2) = d(s2, s1);2. non-negativity: 0 < d(s1, s2) < ∞ if

s1 6= s2 andd(s1, s1) = 0; and3. triangular inequality : d(s1, s2) ≤

d(s1, s3) + d(s3, s2),

∀s1, s2, s3 ∈ S. A metric datasetS ⊂ S is a setof objectssi ∈ S currently stored in a database.

Vectorial data with aLp distance function, suchas Euclidean distance (L2), are special cases ofmetric spaces. The two main types of similarityqueries are:

• Range query -Rq: given a query cen-ter object sq ∈ S and a maximumquery distancerq, the queryRq(sq, rq)retrieves every objectsi ∈ S, such thatd(si, sq) ≤ rq. An example is: “Selectthe proteins that are similar to the pro-tein P by up to 5 purine bases”, whichis represented asRq(P, 5);

• k-Nearest Neighbor query - kNNq:given a query center objectsq ∈ Sand an integer valuek ≥ 1, the querykNNq(sq, k) retrieves thek objects inS that have the smallest distance fromthe query objectsq, according to the dis-tance functiond(). An example is: “Se-lect the 3 protein most similar to theproteinP ”, wherek=3, which is repre-sented askNNq(P, 3).

This paper presents a new dynamic MAMcalled DBM-tree (Density-Based Metric tree),which can minimize the overlap of nodes stor-ing objects in high-density regions relaxing theheight-balance of the structure. Therefore, theheight of a DBM-tree is larger in higher-densityregions, in order to keep a compromise betweenthe number of disk accesses required to breadth-search various subtrees and to depth-searchingone subtree. As the experiments will show, theDBM-tree presents better performance to answersimilarity queries than the rigidly balanced trees.This article also presents an algorithm to opti-mize DBM-trees, calledShrink, which improvesthe performance of these structures reorganizingthe elements among the tree nodes.

The experiments performed over synthetic andreal datasets show that the DBM-tree outper-forms the traditional MAM, such as the Slim-treeand the M-tree. The DBM-tree is, in average,50% faster than these traditional balanced MAM,

Page 3: DBM-Tree: Trading Height-Balancing for Performance in

reducing up to 66% the number of disk accessesand up to 72% the number of distance calcula-tions required to answer similarity queries. TheShrinkalgorithm, helps to achieve improvementsof up to 40% in number of disk accesses toanswer range andk-nearest neighbor queries.Moreover, the DBM-tree is scalable, exhibitinglinear behavior in the total processing time, thenumber of disk accesses and the number of dis-tance calculations regarding the number of in-dexed elements.

A preliminary version of this paper was pre-sented at SBBD 2004 [20]. Here, we show a newsplit algorithm for the DBM-tree. Additionally,this paper shows an accurate cost function for theDBM-tree using only information easily deriv-able from the tree, thus providing a cost func-tion that does not depend upon a constant tree-height. A cost function is fundamental to enablethe DBM-tree to be employed in real DBMS. Ev-ery tree-based AM used in existing DBMS usesthe height of the tree as the main parameter tooptimize a query plan. As the DBM-tree doesnot have a reference height, every existing the-ory about query plan optimizations are knockedout when using a DBM-tree. Therefore, the costfunction presented in this paper is a fundamentalrequirement to enable using DBM-trees in a realDBMS.

The remainder of this paper is structured asfollows: Section 2 presents the basic conceptsand Section 3 summarizes the related works. Thenew metric access method DBM-tree is presentedin Section 4. Section 5 describes the experimentsperformed and the results obtained. Finally, Sec-tion 6 gives the conclusions of this paper and sug-gests future works.

2 Basic Concepts

Access Methods(AM) are used by DBMSto improve performance on retrieval operations.The use of meaningful properties from the ob-jects indexed is fundamental to achieve the im-

provements. Using properties of the data do-main, it is possible to discard large subsets of datawithout comparing every stored object with thequery object. For example, consider the case ofnumeric data, where the total ordering propertyholds: this property allows dividing the storednumbers in two sets: those that are larger andthose that are smaller than or equal to the queryreference number. Hence, the fastest way toperform the search is maintaining the numberssorted. Thus, when a search for a given numberis required, comparing this number with a storedone enables discarding further comparisons withthe part of the data where the number cannot bein.

An important class of AM are the hierarchicalstructures (trees), which enables recursive pro-cesses to index and search the data. In a tree, theobjects are divided in blocks called nodes. Whena search is needed, the query object is comparedwith one or more objects in the root node, deter-mining which subtrees need to be traversed, re-cursively repeating this process for each subtreethat is able to store answers.

Notice that whenever the total ordering prop-erty applies, only a subtree at each tree level canhold the answer. If the data domain has only apartial ordering property, then it is possible thatmore than one subtree need to be analyzed ineach level. As numeric domains possess the totalordering property, the trees indexing numbers re-quires the access of only one node in each levelof the structure. On the other hand, trees stor-ing spatial coordinates, which have only a partialordering property, require searches in more thanone subtree in each level of the structure. Thiseffect is known as covering, or overlapping be-tween subtrees, and occurs for example in R-trees[12].

Hierarchical structures can be classified as(height-)balanced or unbalanced. In the balancedstructures, the height of every subtree is the same,or at most changes by a fixed amount.

The nodes of an AM used in a DBMS are

Page 4: DBM-Tree: Trading Height-Balancing for Performance in

stored in disk using fixed size registers. Storingthe nodes in disk is essential to warrant data per-sistence and to allow handling any number of ob-jects. However, as disk accesses are slow, it isimportant to keep the number of disk accessesrequired to answer queries small. TraditionalDBMS build indexes only on data holding the to-tal ordering property, so if a tree grows deeper,more disk accesses are required to traverse it.Therefore it is important to keep every tree theshallowest possible. When a tree is allowed togrow unbalanced, it is possible that it degener-ates completely, making it useless. Therefore,only balanced trees have been widely used in tra-ditional DBMS.

A metric tree divides a dataset into regionsand chooses objects called representatives or cen-ters to represent each region. Each node storesthe representatives, the objects in the covered re-gion, and their distances to the representatives.As the stored objects can be representatives inother nodes, this enables the structure to be orga-nized hierarchically, resulting in a tree. When aquery is performed, the query object is first com-pared with the representatives of the root node.The triangular inequality is then used to prunesubtrees, avoiding distance calculations betweenthe query object and objects or subtrees in thepruned subtrees. Distance calculations betweencomplex objects can have a high computationalcost. Therefore, to achieve good performance inmetric access methods, it is vital to minimize alsothe number of distance calculations in query op-erations.

Metric access methods exhibits the node over-lapping effect, so the number of disk accessesdepends both on the height of the tree and onthe amount of overlapping. In this case, it isnot worthwhile reducing the number of levels atthe expense of increasing the overlapping. In-deed, reducing the number of subtrees that can-not be pruned at each node access can be moreimportant than keep the tree balanced. As morenode accesses also requires more distance calcu-

lations, increasing the pruning ability of a MAMbecomes even more important. However, no pub-lished access method took this fact into accountso far.

The DBM-tree presented in this paper is a dy-namic MAM that relax the usual rule that im-poses a rigid height-balancing policy, thereforetrading a controlled amount of unbalancing atdenser regions of the dataset for a reduced over-lap between subtrees. As our experiments show,this tradeoff allows an overall increase in perfor-mance when answering similarity queries.

3 Related Works

Plenty of Spatial Access Methods (SAM) wereproposed for multidimensional data. A compre-hensive survey showing the evolution of SAMand their main concepts can be found in [11].However, the majority of them cannot indexdata in metric domains, and suffer from the di-mensionality curse, being efficient to index onlylow-dimensional datasets. An unbalanced R-tree called CUR-tree (Cost-Based Unbalanced R-tree) was proposed in [16] to optimize query exe-cutions. It uses promotion and demotion to movedata objects and subtrees around the tree takinginto account a given query distribution and a costmodel for their execution. The tree is shallowerwhere the most frequent queries are posed, butit needs to be reorganized every time a query isexecuted. This technique works only in SAM,making it infeasible to MAM. Considering costmodels, a great deal of work were also publishedregarding SAM [17]. However they rely on datadistribution in the space and other spatial proper-ties, what turns them infeasible for MAM.

The techniques of recursive partitioning ofdata in metric domains proposed by Burkhardand Keller [5] were the starting point for the de-velopment of MAM. The first technique dividesthe dataset choosing one representative for eachsubset, grouping the remaining elements accord-ing to their distances to the representatives. The

Page 5: DBM-Tree: Trading Height-Balancing for Performance in

second technique divides the original set in afixed number of subsets, selecting one represen-tative for each subset. Each representative andthe biggest distance from the representative to allelements in the subset are stored in the structureto improve nearest-neighbor queries.

The MAM proposed by Uhlmann [19] and theVP-tree (Vantage-Point tree) [21] are examplesbased on the first technique, where the vantagepoints are the representatives proposed by [5].Aiming to reduce the number of distance calcu-lations to answer similarity queries in the VP-tree, Baeza-Yates et al. [1] proposed to use thesame representative for every node in the samelevel. The MVP-tree (Multi-Vantage-Point tree)[2, 3] is an extension of the VP-tree, allowingto selectM representatives for each node in thetree. Using many representatives the MVP-treerequires lesser distance calculations to answersimilarity queries than the VP-tree. The GH-tree(Generalized Hyper-plane tree) [19] is anothermethod that recursively partitions the dataset intwo groups, selecting two representatives and as-sociating the remaining objects to the nearest rep-resentative.

The GNAT (Geometric Near-Neighbor Accesstree) [4] can be viewed as a refinement of the sec-ond technique presented in [5]. It stores the dis-tances between pairs of representatives, and thebiggest distance between each stored object toeach representative. The tree uses these data toprune distance calculations using the triangularinequality.

All MAM for metric datasets discussed so farare static, in the sense that the data structure isbuilt at once using the full dataset, and new in-sertions are not allowed afterward. Furthermore,they only attempt to reduce the number of dis-tance calculations, paying no attention on diskaccesses. The M-tree [9] was the first MAM toovercome this deficiency. The M-tree is a height-balanced tree based on the second technique of[5], with the data elements stored in leaf nodes.A cost model based only in the distance distribu-

tions of the dataset and information of the M-treenodes is provided in [8]. The Slim-Tree [15] is anevolution from the M-Tree, embodying the firstpublished method to reduce the amount of nodeoverlapping, called theSlim-Down.

The use of multiple representatives called,“omni-foci”, was proposed in [10] to generate acoordinate system of the objects in the dataset.The coordinates can be indexed using any SAM,ISAM (Indexed Sequential Access Method), oreven sequential scanning, generating a family ofMAM called the “Omni-family”. Two good sur-veys on MAM can be found in [7] and [13].

The MAM described so far build height-balanced trees aiming to minimize the tree heightat the expense of little flexibility to reduce nodeoverlap. The DBM-tree proposed in this paperis the first MAM which keep a tradeoff betweenbreadth-searching and depth-searching to allowstrading height-balancing with overlap reduction,to achieve better overall search performance.

4 The MAM DBM-tree

The DBM-tree is a dynamic MAM that growsbottom-up. The objects of the dataset aregrouped into fixed size disk pages, each pagecorresponding to a tree node. An object can bestored at any level of the tree. Its main intent isto organize the objects in a hierarchical structureusing a representative object as the center of eachminimum bounding region that covers the objectsin a subtree. An object can be stored in a node ifthe covering radius of the representative coversit.

Unlike the Slim-tree and the M-tree, there isonly one type of node in the DBM-tree. Thereare no distinctions between leaf and index nodes.Each node has a capacity to hold up toC entries,and it stores a fieldCeff to count how many en-tries si are effectively stored in that node. Anentry can be either a single object or a subtree. Anode can have subtree entries, single object en-tries, or both. Single objects cannot be covered

Page 6: DBM-Tree: Trading Height-Balancing for Performance in

by any of the subtrees stored in the same node.Each node has one of its entries elected to be arepresentative. If a subtree is elected, the rep-resentative is the center of the root node of thesubtree. The representative of a node is copiedto its immediate parent node, unless it is alreadythe root node. Entries storing subtrees have: onerepresentative objectsi that is the representativeof thei-th subtree, the distance between the noderepresentative and the representative of the sub-treed(srep, si), the linkPtri pointing to the nodestoring that subtree and the covering radius of thesubtreeRi. Entries storing single objects have:the single objectsj, the identifier of this objectOIdj and the distance between the object repre-sentative and the objectd(srep, sj). This structurecan be represented as:Node[Ceff , array [1..Ceff ] of |< si, d(srep, si),

Ptri, Ri > or < sj, OIdj, d(srep, sj)>|]In this structure, the entrysi whosed(srep, si) =0 holds the representative objectsrep.

4.1 Building the DBM-tree

The DBM-tree is a dynamic structure, allow-ing to insert new objects at any time after its cre-ation. When the DBM-tree is asked to insert anew object, it searches the structure for one nodequalified to store it. A qualifying node is one withat least one subtree that covers the new object.The Insert() algorithm is shown as Algorithm1. It starts searching in the root node and pro-ceeds searching recursively for a node that qual-ifies to store the new object. The insertion of thenew object can occur at any level of the struc-ture. In each node, theInsert() algorithm usesthe ChooseSubtree() algorithm (line 1), whichreturns the subtree that better qualifies to havethe new object stored. If there is no subtree thatqualifies, the new object is inserted in the currentnode (line 9). The DBM-tree provides two poli-cies for theChooseSubtree() algorithm:

• Minimum distance that covers thenew object (minDist): among the sub-trees that cover the new object, choose

Algorithm 1 Insert()Input: Ptrt: pointer to the subtree where the new object

sn will be inserted.

sn: the object to be inserted.

Output: Insert objectsn in thePtrt subtree.

1: ChooseSubtree(Ptrt, sn)2: if There is a subtree that qualifiesthen3: Insert(Ptri, sn)4: if There is a promotionthen5: Update the new representatives and their informa-

tion.

6: Insert the object set not covered for node split in

the current node.

7: for Each entrysi now covered by the updatedo8: Demote entrysi.

9: else ifThere is space in current nodePtrt to insertsn

then Insert the new objectsn in nodePtrt.

10: elseSplitNode(Ptrt, sn)

the one that has the smallest distance be-tween the representative and the new ob-ject. If there is not an entry that qualifiesto insert the new object, it is inserted inthe current node;

• Minimum growing distance(minGDist): similar to minDist butif there is no subtree that covers thenew object, it is chosen the one whoserepresentative is the closest to the newobject, increasing the covering radiusaccordingly. Therefore, the radius ofone subtree is increased only when noother subtree covers the new object.

The policy chosen by theChooseSubtree()algorithm has a high impact on the resultingtree. TheminDistpolicy tends to build trees withsmaller covering radii, but the trees can growhigher than the trees built with theminGDistpol-icy. TheminGDistpolicy tends to produce shal-lower trees than those produced with theminDistpolicy, but with higher overlap between the sub-trees.

If the node chosen by theInsert() algorithmhas no free space to store the new object, then all

Page 7: DBM-Tree: Trading Height-Balancing for Performance in

the existing entries together with the new objecttaken as a single object must be redistributed be-tween one or two nodes, depending on the redis-tribution option set in theSplitNode() algorithm(line 10). TheSplitNode() algorithm deletes thenodePtrt and remove its representative from itsparent node. Its former entries are then redis-tributed between one or two new nodes, and therepresentatives of the new nodes together withthe set of entries of the former nodePtrt not cov-ered by the new nodes are promoted and insertedin the parent node (line 6). Notice that the set ofentries of the former node that are not covered byany new node can be empty. The DBM-tree hasthree options to choose the representatives of thenew nodes in theSplitNode() algorithm:

• Minimum of the largest radii (min-Max): this option distributes the entriesinto at most two nodes, allowing a pos-sibly null set of entries not covered bythese two nodes. To select the represen-tatives of each new node, each pair ofentries is considered as candidate. Foreach pair, this option tries to insert eachremaining entry into the node having therepresentative closest to it. The chosenrepresentatives will be those generatingthe pair of radii whose largest radius isthe smallest among all possible pairs.The computational complexity of the al-gorithm executing this option isO(C3),whereC is the number of entries to bedistribute between the nodes;

• Minimum radii sum (minSum): thisoption is similar to theminMax, but thetwo representatives selected is the pairwith the smallest sum of the two cover-ing radii;

• 2-Clusters: this option tries to build atmost two groups. These groups werebuilt choosing objects that minimizesthe distances inside each group, orga-nizing them as a minimal spanning tree.This option is detailed as Algorithm 2.

The first step of this algorithm is thecreation ofC groups, each one of onlyone entry. The second step is join-ing each group with its nearest group.This step finishes when only 2 groupsremain (line 2). The next step checksif there is a group with only one ob-ject then it will be inserted in the up-per level (line 4). A representative ob-ject is chosen (line 5) for each remain-ing group, and nodes are created to storetheir objects (line 6). The representa-tives and all their information are pro-moted to the next upper level. Figure 1illustrates this approach applied to a bi-dimensional vector space. The node tobe split is presented in Figure 1(a). Afterbuilding theC groups (Figure 1(b)), thegroups are joined to form 2 groups (Fig-ure 1(c)). Figure 1(d) presets the tworesulting nodes after the split by the2-Clustersapproach.

The minimum node occupation is set when thestructure is created, and this value must be be-tween one object and at most half of the nodecapacityC. If the ChooseSubTree policy is setto minGDist then all theC entries must be dis-tributed between the two new nodes created bythe SplitNode() algorithm. After defining therepresentative of each new node, the remainingentries are inserted in the node with the closestrepresentative. After distributing every entry, ifone of the two nodes stores only the representa-tive, then this node is destroyed and its sole entryis inserted in its parent node as a single object.Based on the experiments and in the literature[9], splits leading to an unequal number of en-tries in the nodes can be better than splits withequal number of entries in each node, because ittends to minimize overlap between nodes.

If the ChooseSubTree policy is set tominDistand the minimum occupation is set to a valuelower than half of the node capacity, then eachnode is first filled with this minimum number

Page 8: DBM-Tree: Trading Height-Balancing for Performance in

of entries. After this, the remaining entries willbe inserted into the node only if its covering ra-dius does not increase the overlapping regionsbetween the two. The rest of the entries, thatwere not inserted into the two nodes, are insertedin the parent node.

Algorithm 2 2-Clusters()Input: C entries to be redistributed in nodes.

Output: A representative set (RepSet) and a entry set to

be inserted in the upper level (PromoSet).

1: Build C groups.

2: Try to join, one by one, theC groups, until only 2

groups remain.

3: for each group that have unique entries.do4: Insert the unique entries inPromoSet.

5: Choose each representative object for each group.

6: Create the nodes for the remaining groups.

7: Insert inRepsSetthe generated representatives.

Splittings promote the representative to theparent node, which in turn can cause other split-tings. After the split propagation in Algorithm 1(promotion - line 4) or the update of the represen-tative radii (line 5), it can occur that former un-covered single object entries are now covered bythe updated subtree. In this case each of these en-tries is removed from the current node and rein-serted into the subtree that covers it (demotion inlines 7 and 8).

4.2 Similarity Queries in the DBM-tree

The DBM-tree can answer the two main typesof similarity queries: Range query (Rq) andk-Nearest Neighbor query (kNNq). Their algo-rithms are similar to those of the Slim-tree andthe M-tree.

The Rq() algorithm for the DBM-tree is de-scribed as Algorithm 3. It receives as input pa-rameters a tree nodePtrt, the query centersq

and the query radiusrq. All entries in Ptrt

are checked against the search condition (line 2).The triangular inequality allows pruning subtreesand single objects that do not pertain to the re-

gion defined by the query. The entries that cannotbe pruned in this way have their distance to thequery object (line 3) calculated. Each entry cov-ered by the query (line 4) is now processed. If itis a subtree, it will be recursively analyzed by theRq algorithm (line 5). If the entry is an object,then it is added to the answer set (line 6). Theend of the process returns the answer set includ-ing every object that satisfies the query criteria.

Algorithm 3 Rq()Input: Ptrt tree to be perform the search, the query object

sq and the query radiusrq.

Output: Answer setAnswerSet with all objects satisfy-

ing the query conditions.

1: for Eachsi ∈ Ptrt do2: if |d(srep, sq)− d(srep, si)| ≤ rq + Ri then3: Calculatedist = d(si, sq)4: if dist ≤ rq + Ri then5: if si is a subtreethen Rq(Ptri, sq, rq)6: elseAnswerSet.Add(si).

The kNNq() algorithm, described as Algo-rithm 4, is similar toRq(), but it requires a dy-namic radiusrk to perform the pruning. In the be-ginning of the process, this radius is set to a valuethat covers all the indexed objects (line 1). It isadjusted when the answer set is first filled withkobjects, or when the answer set is changed there-after (line 12). Another difference is that there isa priority queue to hold the not yet checked sub-trees from the nodes. Entries are checked pro-cessing the single objects first (line 4 to 12) andthen the subtrees (line 13 to 18). Among the sub-trees, those closer to the query object that inter-sect the query region are checked first (line 3).When an object closer than thek already foundis located (line 8), it substitutes the previous far-thest one (line 11) and the dynamic radius is ad-justed (diminished) to tight further pruning (line12).

Page 9: DBM-Tree: Trading Height-Balancing for Performance in

(a) (b) (c) (d)

radiusradius

radius

C1s1

s8

s3s4

srep s5

s7

s6

s2s1

s8

s3

s4srep srep

s5 s7

s6

s2

C1

C8 C1

C3 C1C4 C1

C5 C2

C7 C2

C6 C2

C2 C2

Figure 1: Exemplifying a node split using the2-Clusters() algorithm: (a) before the split, (b) form-ing C groups with unique nodes, (c) 2 final groups, and (d) the final nodes created with the chosenrepresentatives.

4.3 TheShrink() optimization Algorithm

A special algorithm to optimize loaded DBM-trees was created, calledShrink(). This algo-rithm aims at shrinking the nodes by exchangingentries between nodes to reduce the amount ofoverlapping between subtrees. Reducing over-lap improves the structure, which results in adecreased number of distance calculations, totalprocessing time and number of disk accesses re-quired to answer bothRq and kNNq queries.During the exchanging of entries between nodes,some nodes can retain just one entry, so they arepromoted and the empty node is deleted from thestructure, further improving the performance ofthe search operations.

The Shrink() algorithm can be called at anytime during the evolution of a tree, as for exam-ple, after the insertion of many new objects. Thisalgorithm is described as Algorithm 5.

The algorithm is applied in every node of aDBM-tree. The input parameter is the pointerPtrt to the subtree to be optimized, and the re-sult is the optimized subtree. The stop condition(line 1) holds in two cases: when there is no en-try exchange in the previous iteration or when thenumber of exchanges already done is larger than3 times the number of entries in the node. Thislatter condition assures that no cyclic exchanges

can lead to a dead loop. It was experimentallyverified that a larger number of exchanges doesnot improve the results. For each entrysa in nodePtrt (line 2), the farthest entry from the node rep-resentative is set asi (line 3). Then search an-other entrysb in Ptrt that can store the entryi(line 5). If such a node exists, removei from sa

and reinsert it in nodesb (line 6). If the exchangemakes nodesa empty, it is deleted, as well as itsentry in nodePtrt (line 7). If this does not gen-erate an empty node, it is only needed to updatethe reduced covering radius of entrysa in nodePtrt (line 8). This process is recursively appliedover all nodes of the tree (line 9 and 10). Afterevery entry inPtrt has been verified, the nodesholding only one entry are deleted and its singleentry replaces the node inPtrt (line 11).

4.4 A Cost Model for DBM-tree

Cost models for search operations in trees usu-ally rely on the tree height. Such cost modelsdoes not apply for the DBM-tree. However, anAM requires a cost model in order to be used ina DBMS. Therefore, we developed a cost modelfor the DBM-tree, based on statistics of each treenode. The proposed approach does not rely onthe data distribution, but rather on the distancedistribution among objects. The cost model de-veloped assumes that the expected probability

Page 10: DBM-Tree: Trading Height-Balancing for Performance in

Algorithm 4 kNNq()()Input: root nodePtrroot, the query objectsq and number

of objectsk.

Output: Answer set with all objects satisfying the query

conditions.

1: rk = ∞2: PriorityQueue.Add(Ptrroot, 0)3: while ((Node = PriorityQueue.F irst()) <= rk)

do4: for eachsi ∈ Node do5: if si is a single objectthen6: if |d(srep, sq)− d(srep, si)| ≤ rk then7: Calculatedist = d(si, sq)8: if dist ≤ rk then9: AnswerSet.Add(si).

10: if AnswerSet.Elements() ≥ k then11: AnswerSet.Cut(k).12: rk = AnswerSet.MaxDistance().13: for eachsi ∈ Node do14: if si is a subtreethen15: if |d(srep, sq)− d(srep, si)| ≤ rk + Ri then16: Calculatedist = d(si, sq)17: if dist ≤ rk + Ri then18: PriorityQueue.Add(si, dist).

P () of a nodePtrt to be accessed is equal tothe probability of the node radiusRPtrt plus thequery radiusrq be greater or equal to the dis-tance of the node representativesrep of Ptrt tothe query objectsq. The probability ofPtrt to beaccessed can therefore be expressed as:

P (Ptrt) = P (RPtrt + rq ≥ d(srep, sq)) (1)

We assume that every object has a distributionof distances to the other objects in the dataset, inaverage, similar to the distribution of the otherobjects. Thus, Formula (1) can be approximatedby a normalized histogramHist() of the dis-tance distribution instead of computing the dis-tance of the query object to the node representa-tive. Therefore

P (Ptrt) ≈ Hist(RPtrt + rq) (2)

Algorithm 5 Shrink()Input: Ptrt tree to optimize.

Output: Ptrt tree optimized.

1: while The number of exchanges does not exceed 3

times the number of entries inPtrt node or no ex-

changes occurred the previous iterationdo2: for Each subtree entrysa in nodePtrt do3: Set entryi from sa as the farthest from thesa

representative.

4: for Each entrysb distinct fromsa in Ptrt do5: if The entryi of sa is covered by nodesb and

this node has enough space to storei then6: Remove the entryi from sa and reinsert it

in sb.

7: if node sa is empty then delete nodesa and

delete the entrysa from Ptrt.

8: elseUpdate the radius of entrysa in Ptrt.

9: for Eachsa subtree in nodePtrt do10: Shrink(sa).11: if nodesa has only one entrythen Delete nodesa

and update the entrysa in Ptrt.

whereHist() is an equi-width histogram of thedistances among pairs of objects of the dataset.

The histogram can be computed calculatingthe average number of distances falling at therange defined at each histogram bin, for every ob-ject in the dataset, or only for a small unbiasedsample of the dataset. Thereafter, to calculate theexpected number of disk accesses (DA) for anyRq, it is sufficient to sum the above probabilitiesover allN nodes of a DBM-tree, as:

DA(Rq(sq, rq)) =N∑

i=1

Hist(RPtri+ rq) (3)

The cost to keep the histogram is low and re-quires a small amount of main memory to main-tain the histogram. Moreover, if it is calculatedover a fixed size sample of the database, it is lin-ear on the database size, making it scalable forthe database size.

Page 11: DBM-Tree: Trading Height-Balancing for Performance in

5 Experimental Evaluation of the DBM-tree

The performance evaluation of the DBM-treewas done with a large assortment of real and syn-thetic datasets, with varying properties that af-fects the behavior of a MAM. Among these prop-erties are the embedded dimensionality of thedataset, the dataset size and the distribution of thedata in the metric space. Table 1 presents some il-lustrative datasets used to evaluate the DBM-treeperformance. The dataset name is indicated withits total number of objects (# Objs.), the embed-ding dimensionality of the dataset (E), the pagesize in KBytes (Pg), and the composition andsource description of each dataset. The multi-dimensional datasets uses the Euclidean distanceL2, and theMedHisto dataset uses the metric-histogramMhisto distance [18].

The DBM-tree was compared with Slim-treeand M-tree, that are the most known and used dy-namics MAM. The Slim-tree and the M-tree wereconfigured using their best recommended setup.They are: minDist for the ChooseSubtre() al-gorithm,minMax for the split algorithm and theminimal occupation set to 25% of node capacity.The results for the Slim-tree were measured afterthe execution of theSlim−Down() optimizationalgorithm.

We tested the DBM-tree considering four dis-tinct configurations, to evaluate its available op-tions. The tested configurations are the follow-ing:

• DBM-MM: minDist for theChooseSubtree() algorithm, min-Max for theSplitNode() algorithm andminimal occupation set to 30% of nodecapacity;

• DBM-MS: equal to DBM-MM, ex-cept using the optionminSumfor theSplitNode() algorithm;

• DBM-GMM: minGDist forChooseSubtree(), minMax forSplitNode();

• DBM-2CL: minGDist for

ChooseSubtree(), 2-Clusters forSplitNode().

All measurements were performed after the exe-cution of theShrink() algorithm.

The computer used for the experimentswas an Intel Pentium III 800MHz proces-sor with 512 MB of RAM and 80 GB ofdisk space, running the Linux operating sys-tem. The DBM-tree, the Slim-tree and the M-tree MAM were implemented using the C++language into the Arboretum MAM library(www.gbdi.icmc.usp.br/arboretum), all with thesame code optimization, to obtain a fair compar-ison.

From each dataset it was extracted 500 objectsto be used as query centers. They were cho-sen randomly from the dataset, and half of them(250) were removed from the dataset before cre-ating the trees. The other half were copied tothe query set, but maintained in the set of objectsinserted in the trees. Hence, half of the queryset belongs to the indexed dataset by the MAMand the other half does not, allowing to evalu-ate queries with centers indexed or not. How-ever, as the query centers are in fact objects ofthe original dataset, the set of queries closely fol-lows the queries expected to be posed by a realapplication. Each dataset was used to build onetree of each type, creating a total of thirty trees.Each tree was built inserting one object at a time,counting the average number of distance calcu-lations, the average number of disk accesses andmeasuring the total building time (in seconds). Inthe graphs showing results from the query eval-uations, each point corresponds to performing500 queries with the same parameters but vary-ing query centers. The numberk for thekNNqqueries varied from 2 to 20 for each measure-ment, and the radius varied from 0.01% to 10%of the largest distance between pairs of objects inthe dataset, because they are the most meaningfulrange of radii asked when performing similarityqueries. TheRq graphics are inlog scale for theradius abscissa, to emphasize the most relevant

Page 12: DBM-Tree: Trading Height-Balancing for Performance in

Table 1: Description of the synthetic and real-world datasets used in the experiments.Name # Objs. E Pg Description

Cities 5,507 2 1 Geographical coordinates of the Brazilian cities (www.ibge.gov.br).

ColorHisto 68,040 32 8 Color image histograms from the KDD repository of the University of

California at Irvine (http://kdd.ics.uci.edu). The metric returns the distance

between two objects in a 32-d Euclidean space.

MedHisto 4,247 - 4 Metric histograms of medical gray-level images. This dataset is adimensional

and was generated at GBDI-ICMC-USP. For more details on this

dataset and the metric used see [18].

Synt16D 10,000 16 8 Synthetic clustered datasets consisting of 16-dimensional vectors normally-distributed

(with σ=0.1) in 10 clusters over the unit hypercube. The process to generate this

dataset is described in [9].

Synt256D 20,000 256 128 Similar toSynt16D, but it is with 20 clusters (withσ=0.001) in a 256-d hypercube.

part of the graph.

5.1 Evaluating the tree building process

The building time and the maximum heightwere measured for every tree. The building timeof the 6 trees were similar for each dataset. Itis interesting to compare the maximum height ofthe various DBM-tree options and the balancedtrees, so they are summarized in Table 2.

The maximum height for theDBM-MM andtheDBM-MStrees were bigger than the balancedtrees in every dataset. The biggest difference wasin the ColorHisto, with achieved a height of 10levels as compared to only 4 levels for the Slim-tree and the M-tree. However, as the other experi-ments show, this higher height does not increasesthe number of disk accesses. In fact, those DBM-trees did, in average, less disk accesses than theSlim-tree and M-tree, as is shown in the next sub-section.

It is worth to note that, although theDBM-GMM trees do not force the height-balance, themaximum height in these trees were equal or veryclose to those of the Slim-tree and the M-tree.This fact is an interesting result that corroboratesour claim that the height-balance is not as impor-tant for MAM as it is for the overlap-free struc-tures.

The data distribution in the levels of a DBM-

tree is shown using theCities dataset. This vi-sualization was generated using theMAMViewsystem [6]. TheMAMView system is a toolto visualize similarity queries and MAM behav-ior, making it possible to explore metric trees.This is possible because this dataset is in a bi-dimensional Euclidean space. Figure 2 shows theindexed objects in theDBM-MM with each colorrepresenting objects at different levels. Darkercolors indicate objects in deeper levels. Figure2(a) shows the objects and the covering radius ofeach node, and Figure 2(b) shows only the ob-jects. The figure shows that the depth of the treeis larger in higher density regions and that objectsare stored in every level of the structure, as is ex-pected. This figure shows visually that the depthof the tree is smaller in low density regions. Italso shows that the number of objects in the deep-est levels is small, even in high-density regions.

5.2 Performance of query execution

We present the results obtained comparing theDBM-tree with the best setup of the Slim-treeand the M-tree. In this paper we present the re-sults from four meaningful datasets (ColorHisto,MedHisto, Synt16Dand Synt256D), which areor high-dimensional or non-dimensional (metric)datasets, and gives a fair sample of what hap-pened. The main motivation in these experiments

Page 13: DBM-Tree: Trading Height-Balancing for Performance in

Table 2: Maximum height of the tree for each dataset tested.Name Cities ColorHisto MedHisto Synt16D Synt256D

M-tree 4 4 4 3 3

Slim-tree 4 4 4 3 3

DBM-MM 7 10 9 6 7

DBM-MS 7 10 11 6 6

DBM-GMM 4 4 5 3 3

DBM-2CL 4 4 5 3 4

Figure 2: Visualization of theDBM-MM structure for theCitiesdataset. (a) with the covering radius ofthe nodes; and (b) only the objects. It is possible to verify that the structure is deeper (darker objects) inhigh-density regions, and shallower (lighter objects) in low-density regions.

is evaluating the DBM-tree performance with itsbest competitors with respect to the 2 main simi-larity query types: rangeRq andk-nearest neigh-borskNNq.

Figure 4 shows the measurements to answerRq andkNNq on these 4 datasets. The graphson the first column (Figures 4(a), (d), (g) and(j)) show the average number of distance calcu-lations. It is possible to note in the graphs thatevery DBM-tree executed in average a smallernumber of distance calculations than Slim-treeand M-tree. Among all, theDBM-MSpresentedthe best result for almost every dataset. NoDBM-tree executed more distance calculationsthan the Slim-tree or the M-tree, for any dataset.

The graphs also show that the DBM-tree reducesthe average number of distance calculations upto 67% for Rq (graph (g)) and up to 37% forkNNq (graph (j)), when compared to the Slim-tree. When compared to the M-tree, the DBM-tree reduced up to 72% forRq (graph (g)) and upto 41% forkNNq (graph (j)).

The graphs of the second column (Figures4(b), (e), (h) and (k)) show the average number ofdisk accesses for bothRq andkNNq queries. Inevery measurement the DBM-trees clearly out-performed the Slim-tree and the M-tree, withrespect to the number of disk accesses. Thegraphs show that the DBM-tree reduces the av-erage number of disk accesses up to 43% forRq

Page 14: DBM-Tree: Trading Height-Balancing for Performance in

Figure 3: Visualization of theSlim-treestructure for theCitiesdataset. (a) with the covering radius of thenodes; and (b) only the objects. It is possible to verify that the structure has the same level in high-densityregions and in low-density regions (level 4).

(graph (h)) and up to 53% forkNNq (graph (k)),when compared to the Slim-tree. It is importantto note that the Slim-tree is the MAM that in gen-eral requires the lowest number of disk accessesbetween every previous published MAM. Thesemeasurements were taken after the execution oftheSlim − Down() algorithm of the Slim-tree.When compared to the M-tree, the gain is evenlarger, increasing to up to 54% forRq (graph (h))and up to 66% forkNNq (graph (k)).

The results are better when the dimensional-ity and the number of clusters of the datasets in-crease (as shown for theSynt16DandSynt256Ddatasets). The main reason is that traditionalMAM produces high overlapping areas withthese datasets due both to the high dimension andthe need to fit the objects in the inter-cluster re-gions together with the objects in the clusters.The DBM-tree achieves a very good performancein high dimensional datasets and in datasets withnon-uniform distribution (a common situation inreal world datasets).

An important observation is that the immediateresult of reducing the overlap between nodes is a

reduced number of distance calculations. How-ever, the number of disk accesses in a MAM isalso related to the overlapping between subtrees.An immediate consequence of this fact is that de-creasing the overlap reduces both the number ofdistance calculations and of disk accesses, to an-swer both types of similarity queries. These twobenefits sums up to reduce the total processingtime of queries.

The graphs of the third column (Figures 4(c),(f), (i) and (l)) show the total processing time(in seconds). As the four DBM-trees performedlesser distance calculations and disk accessesthan both Slim-tree and M-tree, they are naturallyfaster to answer bothRq andkNNq. The impor-tance of comparing query time is that it reflectsthe total complexity of the algorithms besides thenumber of distance calculations and the numberof disk accesses. The graphs show that the DBM-tree is up to 44% faster to answerRq andkNNq(graphs (i) and (l)) than Slim-tree. When com-pared to the M-tree, the reducion in total querytime is even larger, with the DBM-tree being upto 50% faster forRq andkNNq queries (graphs

Page 15: DBM-Tree: Trading Height-Balancing for Performance in

Figure 4: Comparison of the average number of distance calculations (first column), average number ofdisk accesses (second column) and total processing time in seconds (third column) of DBM-tree, Slim-tree and M-tree, forRq andkNNq queries for theColorHisto((a), (b) and (c) -Rq), MedHisto((d), (e)and (f) -kNNq), Synt16D((g), (h) and (i) -Rq) andSynt256D((j), (k) and (l) -kNNq) datasets.

Page 16: DBM-Tree: Trading Height-Balancing for Performance in

(i) and (l)).

5.3 Experiments regarding theShrink() Algo-rithm

The experiments to evaluate the improvementachieved by theShrink() algorithm were per-formed on the four DBM-trees over all datasetsshown in Table 1. As the results of all the datasetswere similar, in Figure 5 we show only the resultsfor the number of disk accesses with theCol-orHisto (Figures 5(a) forRq and (b) forkNNq)and Synt256Ddataset (Figures 5(c) forRq and(d) for kNNq).

Figure 5 compares the query performance be-fore and after the execution of theShrink() al-gorithm for DBM-MM, DBM-MS, DBM-GMMand DBM-2CL for both Rq and kNNq. Ev-ery graph shows that theShrink() algorithmimproves the final trees. The most expressiveresult occurs in theDBM-GMM indexing theSynt256D, which achieved up to 40% lesser diskaccesses forkNNq andRq as compared with thesame structure not optimized.

5.4 Cost of Disk Accesses in the DBM-tree

This experiment evaluates the cost model toestimate the number of disk accesses of queryoperations. Only 10% of the dataset objectswere employed to build the histogramsHist, asa larger number of objects slightly improves theestimation.

Figure 6 shows the predicted values obtainedfrom the formula 3, and the real measurementsobtained executing the query on the tree. Herewe show only experiments for theDBM-MM onMedHistoFigure 6(a),DBM-MSonSynt16DFig-ure 6(b),DBM-GMM on ColorHistoFigure 6(c)andDBM-2CLon Synt256Ddataset Figure 6(d),as the others are similar. The real measurementsare the average of 500 queries as before, and theerror bars indicate the standard deviation of eachmeasure. It can be seen that the proposed formulais very accurate, showing errors within 1% of the

real measurement for theDBM-GMM, and within20% for theDBM-MS. The estimations is alwayswithin the range of the standard deviation.

5.5 Scalability of the DBM-tree

This experiment evaluated the behavior of theDBM-tree with respect to the number of elementsstored in the dataset. For the experiment, we gen-erated 20 datasets similar to theSynt16D, eachone with 50,000 elements. We inserted all 20datasets in the same tree, totaling 1,000,000 el-ements. After inserting each dataset we run theShrink() algorithm and asked the same sets of500 similarity queries for each point in the graph,as before. The behavior was equivalent for differ-ent values ofk and radius, thus we present onlythe results fork=15 and radius=0.1%.

Figure 7 presents the behavior of the fourDBM-tree considering the average number ofdistance calculations forkNNq (a) and forRq(b), the average number of disk accesses forkNNq (c) and forRq (d), and the total process-ing time forkNNq (e) and forRq (f). As it canbe seen, the DBM-trees exhibit linear behavior asthe number of indexed elements, what makes themethod adequate to index very large datasets, inany of its configurations.

6 Conclusions and Future Works

This paper presents a new dynamic MAMcalled DBM-tree (Density-Based Metric tree)that, in a controlled way, relax the height-balancing requirement of access methods, trad-ing a controlled amount of unbalancing at denserregions of the dataset for a reduced overlap be-tween subtrees. This is the first dynamic MAMthat makes possible to reduce the overlap be-tween nodes relaxing the rigid balancing of thestructure. The height of the tree is higherin denser regions, in order to keep a tradeoffbetween breadth-searching and depth-searching.The options to define how to construct a tree

Page 17: DBM-Tree: Trading Height-Balancing for Performance in

Figure 5: Average number of disk accesses to performRq andkNNq queries in the DBM-tree beforeand after the execution of theShrink() algorithm: (a)Rq on ColorHisto, (b) kNNq on ColorHisto, (c)Rq onSynt256D, (d) kNNq onSynt256D.

and the optimization possibilities in DBM-treeare larger than in rigid balanced trees, becauseit is possible to adjust the tree according to thedata distributions at different regions of the dataspace. Therefore, this paper also presented a newoptimization algorithm, calledShrink, which im-proves the performance in trees reorganizing theelements among their nodes.

The experiments performed over synthetic andreal datasets showed that theDBM-tree outper-forms the main balanced structures existing sofar: the Slim-tree and the M-tree. In average,it is up to 50% faster than the traditional MAMand reduces the number of required distance cal-culations in up to 72% when answering similar-ity queries. The DBM-tree spends fewer disk ac-cesses than the the Slim-tree, that until now was

the most efficient MAM with respect to disk ac-cess. The DBM-tree requires up to 66% fewerdisk accesses than the balanced trees. After ap-plying theShrink() algorithm, the performanceachieves improvements up to 40% for range andk-nearest neighbor queries considering disk ac-cesses. It was also shown that the DBM-treescales up very well with respect to the numberof indexed elements, presenting linear behavior,which makes it well-suited to very large datasets.

Among the future works, we intend to developa bulk-loading algorithm for the DBM-tree. Asthe construction possibilities of the DBM-treeis larger than those of the balanced structures,a bulk-loading algorithm can employ strategiesthat can achieve better performance than is pos-sible in other trees. Other future work is to de-

Page 18: DBM-Tree: Trading Height-Balancing for Performance in

Figure 6: Comparation of the real and the estimated number of disk accesses forRq in the (a)MedHistodataset using aDBM-MM tree, (b)Synt16Dusing aDBM-MS, (c) ColorHistousing aDBM-GMM and(d) Synt256Dusing aDBM-2CL.

velop an object-deletion algorithm that can reallyremove objects from the tree. All existing rigidlybalanced MAM such as the Slim-tree and the M-tree, cannot effectively delete objects being usedas representatives, so they are just marked asremoved, without releasing the space occupied.Moreover, they remain being used in the com-parisons required in the search operations. Theorganizational structure of the DBM-tree enablesthe effective deletion of objects, making it a com-pletely dynamic MAM.

References

[1] Ricardo A. Baeza-Yates, Walter Cunto, UdiManber, and Sun Wu. Proximity match-ing using fixed-queries trees. In5th An-

nual Symposium on Combinatorial PatternMatching (CPM), volume 807 ofLNCS,pages 198–212, Asilomar, USA, 1994.Springer Verlag.

[2] Tolga Bozkaya and Meral Özsoyo-glu. Distance-based indexing for high-dimensional metric spaces. InProceedingsof the ACM International Conference onManagement of Data (SIGMOD), pages357–368, 1997.

[3] Tolga Bozkaya and Meral Özsoyoglu. In-dexing large metric spaces for similaritysearch queries. ACM Transactions onDatabase Systems (TODS), 24(3):361–404,sep 1999.

[4] Sergey Brin. Near neighbor search in largemetric spaces. InProceedings of the In-ternational Conference on Very Large Data

Page 19: DBM-Tree: Trading Height-Balancing for Performance in

Figure 7: Scalability of DBM-tree regarding thedataset size executingkNNq queries ((a), (c) and(e)) andRq queries ((b), (d) and (f)), measuringthe average number of distance calculations ((a)and (b)), the average number of disk accesses ((c)and (d)) and the total processing time ((e) and(f)). The indexed dataset was theSynt16Dwith1,000,000 objects.

Bases (VLDB), pages 574–584, Zurich,Switzerland, 1995. Morgan Kaufmann.

[5] W. A. Burkhard and R. M. Keller. Some ap-proaches to best-match file searching.Com-munications of the ACM, 16(4):230–236,apr 1973.

[6] Fabio J. T. Chino, Marcos R. Vieira,Agma J. M. Traina, and Caetano TrainaJr. Mamview: A visual tool for exploringand understanding metric access methods.In Proceedings of the 20th Annual ACMSymposium on Applied Computing (SAC),page 6p, Santa Fe, New Mexico, USA,2005. ACM Press.

[7] Edgar Chávez, Gonzalo Navarro, RicardoBaeza-Yates, and José Luis Marroquín.Searching in metric spaces.ACM Com-puting Surveys (CSUR), 33(3):273–321, sep2001.

[8] P. Ciaccia, M. Patella, and P. Zezula. Acost model for similarity queries in metricspaces. InACM Symposium on Principlesof Database Systems (PODS), pages 59–68,1998.

[9] Paolo Ciaccia, Marco Patella, and PavelZezula. M-tree: An efficient access methodfor similarity search in metric spaces. InProceedings of International Conference onVery Large Data Bases (VLDB), pages 426–435, Athens, Greece, 1997. Morgan Kauf-mann.

[10] Roberto F. Santos Filho, Agma J. M. Traina,Caetano Traina Jr., and Christos Faloutsos.Similarity search without tears: The OMNIfamily of all-purpose access methods. InIEEE International Conference on Data En-gineering (ICDE), pages 623–630, Heidel-berg, Germany, 2001.

[11] Volker Gaede and Oliver Günther. Multidi-mensional access methods.ACM Comput-ing Surveys (CSUR), 30(2):170–231, 1998.

[12] A. Guttman. R-tree : A dynamic indexstructure for spatial searching. InACMInternational Conference on Data Man-agement (SIGMOD), pages 47–57, Boston,USA, 1984.

[13] Gisli R. Hjaltason and Hanan Samet. Index-driven similarity search in metric spaces.ACM Transactions on Database Systems(TODS), 28(4):517–580, dec 2003.

[14] Caetano Traina Jr., Agma J. M. Traina,Christos Faloutsos, and Bernhard Seeger.Fast indexing and visualization of metricdatasets using slim-trees.IEEE Transac-tions on Knowledge and Data Engineering(TKDE), 14(2):244–260, 2002.

[15] Caetano Traina Jr., Agma J. M. Traina,Bernhard Seeger, and Christos Falout-

Page 20: DBM-Tree: Trading Height-Balancing for Performance in

sos. Slim-trees: High performance met-ric trees minimizing overlap between nodes.In International Conference on Extend-ing Database Technology (EDBT), volume1777 of LNCS, pages 51–65, Konstanz,Germany, 2000. Springer.

[16] K. A. Ross, I. Sitzmann, and P. J.Stuckey. Cost-based unbalanced R-trees. InIEEE International Conference on Scientificand Statistical Database Management (SS-DBM), pages 203–212, 2001.

[17] Y. Theodoridis, E. Stefanakis, and T. K.Sellis. Efficient cost models for spa-tial queries using R-trees.IEEE Transac-tions on Knowledge and Data Engineering(TKDE), 12(1):19–32, 2000.

[18] Agma J. M. Traina, Caetano Traina Jr.,Josiane M. Bueno, and Paulo M. de A. Mar-ques. The metric histogram: A new and effi-cient approach for content-based image re-trieval. In Sixth IFIP Working Conferenceon Visual Database Systems (VDB), Bris-bane, Australia, 2002.

[19] Jeffrey K. Uhlmann. Satisfying gen-eral proximity/similarity queries with met-ric trees. Information Processing Letters,40(4):175–179, 1991.

[20] Marcos R. Vieira, Caetano Traina Jr., FabioJ. T. Chino, and Agma J. M. Traina. DBM-tree: A dynamic metric access method sen-sitive to local density data. InXIX BrazilianSymposium on Databases (SBBD), pages163–177, Brasília, Brazil, 2004.

[21] Peter N. Yianilos. Data structures andalgorithms for nearest neighbor search ingeneral metric spaces. InProceedings ofthe ACM-SIAM Symposium on Discrete Al-gorithms (SODA), pages 311–321, Austin,USA, 1993.