scalable and distributed similarity search in metric spaces michal batko claudio gennaro pavel...

Post on 20-Dec-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Scalable and Distributed Similarity Searchin Metric Spaces

Michal BatkoClaudio Gennaro

Pavel Zezula

2

Presentation contents

MotivationMetric spaces and similarity searchingGHT*

ConceptsGeneralized Hyperplane TreeDistributed architecture

Experimental resultsConclusions and future work

3

Motivation

Searching is a fundamental problemTraditional search

Numbers or stringsBased on total linear order of keys

New approachFree text, images, audio, video, etc.Impossible to structure in keys and records

4

Alternative

Metric spaces

Similarity searching

5

Metric space

Set of objects (A)any class of objects, which allows distance computingfor example text, audio or video files

Metric function (d)positivereflexivesymmetrictriangle inequality

0),(:, yxdAyx

),(),(:, xydyxdAyx ),(),(),( zxdzydyxd

0),(: xxdAx

6

Similarity searching

Range searchobjects at max distance r

from object Q

k -nearest neighbor searchk nearest neighbor objects of object Q

r

Q

1

2

4

3Q

7

GHT* – concepts

Data distributed among serversMultiple buckets with limited capacity

Clients perform updates and searchBucket location algorithm

Based on DDH and DST algorithmsExploits Generalized Hyperplane Tree

8

p2

p5

p1

p10

p3

p4

p11 p6

p7

p8

p9

p12

p13

P14

Generalized Hyperplane TreeSingle-site metric space indexing structureAllows similarity searching and is scalableBinary search treeData stored in leaf nodesInner nodes for routing

Two “pivots” per nodep2p2

p5p5

p5p2

p2 p4 p6 p12 p10 p9 p8 p5 p3 p7 p11 p13 p14 p1

9

GHT* – distributed architecture

GHT is used as search structureLeaf node represents a server

• unique server identifier• servers extend the tree with leaf nodes for their

local buckets

Inner nodes store routing information

GHT is replicatedGHT can be inaccurate

Update (image adjustment) messages

10

GHT* – distributed architecture

11

Experimental results – inserting

Preliminary phaseTests for vector space with Euclidean distance function

10000 objects min max avg

Occupied buckets 56 68 62.4

Occupied servers 7 9 8.07

Overall bucket load 58.8 71.4 64.3

Maximal tree depth 16 26 20.4

Replication 3.9% 5.9% 5%

12

Experimental results – searching

0

2

4

6

8

10

12

14

16

Range searchesD

ista

nce

com

put

atio

ns

Client Server

0

1

2

3

Range searches

Mes

sage

s se

nt

Client Server

20 range queries with radius 50 points

(match approx. 3 objects)

13

ConclusionsFirst structure for scalable distributed similarity searchSatisfies properties of SDDS

Scalability – can expand to new servers through autonomous splitsNo hot-spot – all clients use as precise addressing as possible and learn from misaddressingUpdates are local and never require updates to multiple clients

Client performs only a few distance computations to locate servers

14

Future work

More experimentsDifferent metric spacesMore complex evaluationAdditional evaluated properties

Nearest neighbor searchAlgorithm for parallel processing to better utilize distributed structureExperimental evaluation

Questions?

top related