scalable and distributed similarity search in metric spaces michal batko claudio gennaro pavel...

15
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

Post on 20-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

Scalable and Distributed Similarity Searchin Metric Spaces

Michal BatkoClaudio Gennaro

Pavel Zezula

Page 2: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

2

Presentation contents

MotivationMetric spaces and similarity searchingGHT*

ConceptsGeneralized Hyperplane TreeDistributed architecture

Experimental resultsConclusions and future work

Page 3: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

3

Motivation

Searching is a fundamental problemTraditional search

Numbers or stringsBased on total linear order of keys

New approachFree text, images, audio, video, etc.Impossible to structure in keys and records

Page 4: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

4

Alternative

Metric spaces

Similarity searching

Page 5: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

5

Metric space

Set of objects (A)any class of objects, which allows distance computingfor example text, audio or video files

Metric function (d)positivereflexivesymmetrictriangle inequality

0),(:, yxdAyx

),(),(:, xydyxdAyx ),(),(),( zxdzydyxd

0),(: xxdAx

Page 6: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

6

Similarity searching

Range searchobjects at max distance r

from object Q

k -nearest neighbor searchk nearest neighbor objects of object Q

r

Q

1

2

4

3Q

Page 7: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

7

GHT* – concepts

Data distributed among serversMultiple buckets with limited capacity

Clients perform updates and searchBucket location algorithm

Based on DDH and DST algorithmsExploits Generalized Hyperplane Tree

Page 8: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

8

p2

p5

p1

p10

p3

p4

p11 p6

p7

p8

p9

p12

p13

P14

Generalized Hyperplane TreeSingle-site metric space indexing structureAllows similarity searching and is scalableBinary search treeData stored in leaf nodesInner nodes for routing

Two “pivots” per nodep2p2

p5p5

p5p2

p2 p4 p6 p12 p10 p9 p8 p5 p3 p7 p11 p13 p14 p1

Page 9: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

9

GHT* – distributed architecture

GHT is used as search structureLeaf node represents a server

• unique server identifier• servers extend the tree with leaf nodes for their

local buckets

Inner nodes store routing information

GHT is replicatedGHT can be inaccurate

Update (image adjustment) messages

Page 10: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

10

GHT* – distributed architecture

Page 11: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

11

Experimental results – inserting

Preliminary phaseTests for vector space with Euclidean distance function

10000 objects min max avg

Occupied buckets 56 68 62.4

Occupied servers 7 9 8.07

Overall bucket load 58.8 71.4 64.3

Maximal tree depth 16 26 20.4

Replication 3.9% 5.9% 5%

Page 12: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

12

Experimental results – searching

0

2

4

6

8

10

12

14

16

Range searchesD

ista

nce

com

put

atio

ns

Client Server

0

1

2

3

Range searches

Mes

sage

s se

nt

Client Server

20 range queries with radius 50 points

(match approx. 3 objects)

Page 13: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

13

ConclusionsFirst structure for scalable distributed similarity searchSatisfies properties of SDDS

Scalability – can expand to new servers through autonomous splitsNo hot-spot – all clients use as precise addressing as possible and learn from misaddressingUpdates are local and never require updates to multiple clients

Client performs only a few distance computations to locate servers

Page 14: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

14

Future work

More experimentsDifferent metric spacesMore complex evaluationAdditional evaluated properties

Nearest neighbor searchAlgorithm for parallel processing to better utilize distributed structureExperimental evaluation

Page 15: Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

Questions?