scalable and distributed similarity search in metric spaces michal batko claudio gennaro pavel...
Post on 20-Dec-2015
212 views
TRANSCRIPT
Scalable and Distributed Similarity Searchin Metric Spaces
Michal BatkoClaudio Gennaro
Pavel Zezula
2
Presentation contents
MotivationMetric spaces and similarity searchingGHT*
ConceptsGeneralized Hyperplane TreeDistributed architecture
Experimental resultsConclusions and future work
3
Motivation
Searching is a fundamental problemTraditional search
Numbers or stringsBased on total linear order of keys
New approachFree text, images, audio, video, etc.Impossible to structure in keys and records
4
Alternative
Metric spaces
Similarity searching
5
Metric space
Set of objects (A)any class of objects, which allows distance computingfor example text, audio or video files
Metric function (d)positivereflexivesymmetrictriangle inequality
0),(:, yxdAyx
),(),(:, xydyxdAyx ),(),(),( zxdzydyxd
0),(: xxdAx
6
Similarity searching
Range searchobjects at max distance r
from object Q
k -nearest neighbor searchk nearest neighbor objects of object Q
r
Q
1
2
4
3Q
7
GHT* – concepts
Data distributed among serversMultiple buckets with limited capacity
Clients perform updates and searchBucket location algorithm
Based on DDH and DST algorithmsExploits Generalized Hyperplane Tree
8
p2
p5
p1
p10
p3
p4
p11 p6
p7
p8
p9
p12
p13
P14
Generalized Hyperplane TreeSingle-site metric space indexing structureAllows similarity searching and is scalableBinary search treeData stored in leaf nodesInner nodes for routing
Two “pivots” per nodep2p2
p5p5
p5p2
p2 p4 p6 p12 p10 p9 p8 p5 p3 p7 p11 p13 p14 p1
9
GHT* – distributed architecture
GHT is used as search structureLeaf node represents a server
• unique server identifier• servers extend the tree with leaf nodes for their
local buckets
Inner nodes store routing information
GHT is replicatedGHT can be inaccurate
Update (image adjustment) messages
10
GHT* – distributed architecture
11
Experimental results – inserting
Preliminary phaseTests for vector space with Euclidean distance function
10000 objects min max avg
Occupied buckets 56 68 62.4
Occupied servers 7 9 8.07
Overall bucket load 58.8 71.4 64.3
Maximal tree depth 16 26 20.4
Replication 3.9% 5.9% 5%
12
Experimental results – searching
0
2
4
6
8
10
12
14
16
Range searchesD
ista
nce
com
put
atio
ns
Client Server
0
1
2
3
Range searches
Mes
sage
s se
nt
Client Server
20 range queries with radius 50 points
(match approx. 3 objects)
13
ConclusionsFirst structure for scalable distributed similarity searchSatisfies properties of SDDS
Scalability – can expand to new servers through autonomous splitsNo hot-spot – all clients use as precise addressing as possible and learn from misaddressingUpdates are local and never require updates to multiple clients
Client performs only a few distance computations to locate servers
14
Future work
More experimentsDifferent metric spacesMore complex evaluationAdditional evaluated properties
Nearest neighbor searchAlgorithm for parallel processing to better utilize distributed structureExperimental evaluation
Questions?