scalability of similarity searching the mufin approach pavel zezula masaryk university brno, czech...

80
Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Upload: marjory-ross

Post on 11-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Scalabilityof Similarity Searching

the MUFIN approach

Pavel ZezulaMasaryk University

Brno, Czech Republic

Page 2: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 2

Outline of the talk

• On the importance of similarity and searching

• Main memory and disk-oriented index structures• Parallel and distributed implementations

• Future trends:– Findability and retrieval in cloud services– Self-organizing search networks

April 2014

Page 3: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 3

Real-life Similarity

• Are they similar?

April 2014

Page 4: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 4

Real-life Similarity

• Are they similar?

April 2014

Page 5: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 5

Real-life Similarity

• Are they similar?

April 2014

Page 6: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 6

Real-life Similarity

• Are they similar?

April 2014

Page 7: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 7

Real-Life MotivationThe social psychology view

• Any event in the history of organism is, in a sense, unique.

• Recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity.

• Similarity (proximity, resemblance, communality, representativeness, psychological distance, etc.) is fundamental to theories of perception, learning, judgment, etc.

• Similarity is subjective a context-dependentApril 2014

Page 8: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 8

Contemporary Networked MediaThe digital data view

• Almost everything that we see, read, hear, write, measure, or observe can be digital.

• Users autonomously contribute to production of global media and the growth is exponential.

• Sites like Flickr, YouTube, Facebook host user contributed content for a variety of events.

• The elements of networked media are related by numerous multi-facet links of similarity.

April 2014

Page 9: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 9

Challenge

• Networked media database is getting close to the human “fact-bases”– the gap between physical and digital world has blurred

• Similarity data management is needed to connect, search, filter, merge, relate, rank, cluster, classify, identify, or categorize objects across various collections.

WHY?It is the similarity which is in the world revealing.

April 2014

Page 10: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 10

State of the art inMetric Searching technology

Hanan SametFoundation of Multidimensional andMetric Data StructuresMorgan Kaufmann, 2006

P. Zezula, G. Amato, V. Dohnal, and M. BatkoSimilarity Search: The Metric Space ApproachSpringer, 2005

Teaching material: http://www.nmis.isti.cnr.it/amato/similarity-search-book/

April 2014

Page 11: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 11

Last Similarity Search Conference

April 2014

Keynote speakers:Ricardo Baeza-YatesJiri Matas

Page 12: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 12

The Big Data

• Loads on a sharp rise – usage on decline• The (3V) problem of: Volume, Variety, Velocity• Issues:

– Acquisition: what to keep and what to discard– Unstructured data: what content to extract– Datafication: render into data many new aspects– Inaccuracy: approximation, imprecision, noise

April 2014

Page 13: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 13

The Big Data problem

• Shifts in thinking: – from some to all (scalability)– from clean to messy (approximate)

• Technological obstacles: heterogeneity, scale, timeliness, complexity, and privacy aspects

• Foundational challenges: scalable and secure data analysis, organization, retrieval, and modeling

April 2014

Page 14: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 14

Similarity & Geometry

• Figures that have the same shape but not necessarily the same size are similar figures:

• Any two line segments are similar:

• Any two circles are similar:

A B C D

April 2014

Page 15: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 15

Similarity & Geometry

• Any two squares are similar:

• Any two equilateral triangles are similar:

April 2014

Page 16: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 16

Similarity & Geometry

Definition:• Two polygons are similar to each other, if:

1. Their corresponding angles are equal2. The lengths of their corresponding sides are

proportional

Two geometric figures are either similaror they are not similar at all

April 2014

Page 17: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 17

Visual Similarity

• MPEG-7 multimedia content desc. standard• Global feature descriptors:

– Color, shape, texture, …

– One high-dimensional vector per image

April 2014

Page 18: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 18

Multiple Visual Aspects

April 2014

Page 19: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 19

Visual Similarity- Local feature descriptors – SIFT (Scale Invariant Feature Transform), SURF, etc.- Invariant to image scaling, small viewpoint change, rotation, noise, illumination

April 2014

Page 20: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 20

Visual Similarity - finding coresspondence

April 2014

Page 21: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 21

Biometric Similarity

• Biometrics:– methods of recognizing a person based on

physiological and behavioral characteristics• Two types of recognition problems:

– Verification – authenticity of a person– Identification – recognition of a person

• Examples:– Finger prints, face, iris, retina, speech, gait, etc.

April 2014

Page 22: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 22

Biometrics: Fingerprint

• Minutiae detection:– Detect ridges (endings and branching)– Represented as a sequence of minutiae

• P=( (r1,e1,θ1), …, (rm,em,θm) )• Point in polar coordinates (r,e) and direction θ

• Matching of two sequences:– Align input sequence with database one– Compute weighted edit distance

• wins,del=620

• wrepl=[0;26] - depending on similarity of two minutiae

April 2014

Page 23: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 23

Biometrics: Hand Recognition

• Hand image analysis– Contour extraction, global registration

• Rotation, translation, normalization

– Finger registration– Contour represented as a set of pixels

F={f1,…,fNF}

• Matching: modified Hausdorff distance

April 2014

FGhGFhGFH ,,,max,

Gg FfGFf GgF

gfN

FGhgfN

GFh min1

,min1

,

Page 24: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 24

Remote Biometrics: Approaches

• Detection, normalization, extraction, recognition• Face recognition

– Methods:• Appearance-based – analyze the face as a whole• Model-based – compare individual features (e.g., eyes, mouth)

• Gait recognition– Less likely to be obscured, low resolution suffices– Methods are based on shape or dynamics of the person:

• Appearance-based – analyze person’s silhouettes• Model-based – compare features (e.g., trajectory, angular velocity)

April 2014

Page 25: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 25

Face similarity

• Face detection• Face recognition• face detection – MPEG-7

April 2014

Page 26: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 26

Search – the goals

1. We search to get results (papers, books, …)2. We ask to find answers (what time … )3. We use filters so that the right staff finds us4. We browse while wandering and way-finding

in typically restricted space

• In reality, we move fluidly between modes of ask, browse, filter, and search

April 2014

Page 27: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 27

Search – some quantitative facts

• 85% of all web traffic comes from search engines

• 450+ million searches/day are performed in North America alone

• 70%+ of all searches are done on Google sites

Search is the most popular application(second to E-mail??)

April 2014

Page 28: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 28

Search – some experience

• 60% of searchers NEVER go past 1st page of search results

• The top three results draw 80% of the attention

• The first few results inordinately influence query reformulation.

April 2014

Page 29: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 29

Search - as an interaction

• When we search, our next actions are reactions to the stimuli of previous search results

• What we find is changing what we seek

• In any case, search must be:

fast, simple, and relevantApril 2014

Page 30: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 30

Search – changes our cognitive habits

1. We are increasingly handing off the job of remembering to search engines

2. When we expect information to be easily found again, we do not remember it well

3. Our original memory of facts is changing to a memory of ways to find the facts

April 2014

Page 31: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 31

Evolution of Search Engine Strategies

Scalabilitydata volumenumber of users (queries)variety of data typesmulti-aspect queries

grad

e

high

low

well established cutting-edge research

peer

-to-

peer

cent

raliz

ed

para

llel

dist

ribut

ed

self-

orga

nize

d

April 2014

Determinismexact match similarity►precise approximate►same answer variable►fixed infrastr. mapping►

Page 32: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 32

The MUFIN Approach

MUFIN: MUlti-Feature Indexing Network

SEARCHdata

& q

uerie

s

infrastructure

index structureScalability

P2P structureExtensibilitymetric space

IndependenceInfrastructure as a service

April 2014

Page 33: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 33

Extensibility: Metric Abstraction of Similarity

• Metric space: M = (D,d)– D – domain– distance function d(x,y)

x,y,z D• d(x,y) > 0 - non-negativity• d(x,y) = 0 x = y - identity• d(x,y) = d(y,x) - symmetry• d(x,y) ≤ d(x,z) + d(z,y) - triangle inequality

April 2014

Page 34: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 34

Examples of Distance Functions

• Lp Minkovski distance (for vectors)• L1 – city-block distance

• L2 – Euclidean distance

• L¥ – infinity

• Edit distance (for strings)• minimal number of insertions, deletions and substitutions• d(‘application’, ‘applet’) = 6

• Jaccard’s coefficient (for sets A,B)

n

iii yxyxL

11 ||),(

n

iii yxyxL

1

22 ),(

ii

n

iyxyxL

max),(

1

BA

BABAd 1,

April 2014

Page 35: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 35

Examples of Distance Functions

• Mahalanobis distance– for vectors with correlated dimensions

• Hausdorff distance– for sets with elements related by another distance

• Earth movers distance– primarily for histograms (sets of weighted features)

• and many others

April 2014

Page 36: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 36

Similarity Search Problem

• For X D in metric space M,pre-process X so that the similarity queriesare executed efficiently.

In metric space: no total ordering exists!

April 2014

Page 37: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 37April 2014

Similarity Range Query

• range query– R(q,r) = { x X | d(q,x) ≤ r }

… all museums up to 2km from my hotel …

rq

Page 38: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 38April 2014

Nearest Neighbor Query

• the nearest neighbor query– NN(q) = x– x X, "y X, d(q,x) ≤ d(q,y)

• k-nearest neighbor query– k-NN(q,k) = A– A X, |A| = k– x A, y X – A, d(q,x) ≤ d(q,y)

… five closest museums to my hotel …

q

k=5

Page 39: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 39April 2014

Basic Partitioning Principles

• Given a set X D in M=(D,d), basic partitioning principles have been defined:

– Ball partitioning– Generalized hyper-plane partitioning– Excluded middle partitioning

Page 40: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 40April 2014

Ball Partitioning

• Inner set: { x X | d(p,x) ≤ dm }

• Outer set: { x X | d(p,x) > dm }

pdm

Page 41: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 41April 2014

Generalized Hyper-plane

• { x X | d(p1,x) ≤ d(p2,x) }

• { x X | d(p1,x) > d(p2,x) }

p2

p1

Page 42: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 42April 2014

Excluded Middle Partitioning

• Inner set: { x X | d(p,x) ≤ dm - }• Outer set: { x X | d(p,x) > dm + }

• Excluded set: otherwise

pdm

2r

pdm

Page 43: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 43April 2014

Clustering

• Cluster data into sets– bounded by a ball region– { x X | d(pi,x) ≤ ri

c }

Page 44: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 44

Scalability: Peer-to-Peer Indexing• Local search: Main memory structures• Native metric techniques: GHT*, VPT*• Transformation techniques: M-CAN, M-Chord

April 2014

Page 45: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 45

Main memory structures1. ball partitioning methods

1. Burkhard-Keller Tree2. Fixed Queries Tree (Array)3. Vantage Point Tree4. Excluded Middle Vantage Point Forest

2. generalized hyper-plane partitioning approaches1. Bisector Tree2. Generalized Hyper-plane Tree

3. exploiting pre-computed distances1. AESA2. Linear AESA3. Other Methods – Shapiro, Spaghettis

April 2014

Page 46: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 46

The M-tree [Ciaccia, Patella, Zezula, VLDB 1997]

1)Paged organization2)Dynamic3) Suitable for arbitrary metric spaces4) I/O and CPU optimization

- computing d can be time-consuming

April 2014

Page 47: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 47

The M-tree Idea

• Depending on the metric, the “shape” of index regions changes

C D E F

A B

B

F D

EA

C

Metric: L2 (Euclidean)

L1 (city-block) L (max-metric) weighted-Euclidean quadratic form

April 2014

Page 48: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 48April 2014

o7

M-tree: Example

o1o6

o10

o3

o2

o5

o4

o9

o8

o11

o1 4.5 -.- o2 6.9 -.-

o1 1.4 0.0 o10 1.2 3.3 o7 1.3 3.8 o2 2.9 0.0 o4 1.6 5.3

o2 0.0 o8 2.9o1 0.0 o6 1.4 o10 0.0 o3 1.2

o7 0.0 o5 1.3 o11 1.0 o4 0.0 o9 1.6

Covering radius

Distance to parentDistance to parentDistance to parent

Distance to parentLeaf entries

Page 49: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 49

M-tree family

• Bulk loading• Slim-tree• Multi-way insertion• PM-tree• M2-tree• etc.

April 2014

Page 50: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 50

D-Index [Dohnal, Gennaro, Zezula, MTA 2002]4 separable buckets at the first level

2 separable buckets at the second level

exclusion bucket of the whole structure

April 2014

Page 51: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 51

D-index: Insertion

April 2014

Page 52: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 52

D-index: Range Search

q

r

q

r

q

r

q

r

q

r

q

r

April 2014

Page 53: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 53

Parallel implementations

• Trees (parallel M-tree):– Single input point – tree root, a possible bottleneck– Inherently sequential processing steps

• Paths from the root to leaves – we never know which node to access before the ancestor is processed

• Hash-based organizations:– All processing steps can be parallelized– Single input point remains

April 2014

Page 54: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 54

Implementation Postulates of Distributed Indexes

• dynamism – nodes can be added and removed

• no hot-spots – no centralized nodes, no flooding by messages (transactions)

• update independence – network update at one site does not require an immediate change propagation to all the other sites

April 2014

Page 55: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 55

DistributedSimilarity Search Structures

• Native metric structures:– GHT* (Generalized Hyperplane Tree)– VPT* (Vantage Point Tree)

• Transformation approaches:– M-CAN (Metric Content Addressable Network)– M-Chord (Metric Chord)

April 2014

Page 56: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 56April 2014

GHT* Address Search Tree

• Based on the Generalized Hyperplane Tree [Uhl91]– two pivots for binary partitioning

p6

p5

p3

p4

p1

p2

p1 p2

p5 p6p3 p4

Page 57: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 57April 2014

GHT* Address Search Tree

• Inner node– two pivots (reference objects)

• Leaf node– BID pointer to a bucket if

data stored on the current peer

– NNID pointer to a peer ifdata stored on a different peer

p1 p2

p5 p6p3 p4

BID1 BID2 BID3 NNID2

Peer 2

Page 58: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 58April 2014

GHT* Address Search Tree

Peer 2 Peer 3

Peer 1

Page 59: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 59April 2014

BID1 BID2 BID3 NNID2

Peer 2

p1 p2

p5 p6p3 p4

Peer 2

BID3 NNID2

p5 p6

p1 p2

GHT* Range Query• Range query R(q,r)

– traverse peer’s own AST– search buckets for all BIDs found– forward query to all NNIDs found

p6

p5

p3

p4

rq

p1

p2

Page 60: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 60April 2014

AST: Logarithmic replication

• Full AST on every peer is space consuming– replication of pivots grows in a linear way

• Store only a part of the AST:– all paths to local buckets

• Deleted sub-trees:– replaced by NNID

of the leftmost peer

p13 p14p11 p12

p5 p6

p1 p2

p3 p4

p7 p8 p9 p10

NNID2 NNID3BID1 NNID4 NNID5 NNID6 NNID7 NNID8

p1 p2

p3 p4

p7 p8

BID1 NNID3 NNID5

Page 61: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 61April 2014

AST: Logarithmic Replication (cont.)

• Resulting tree– replication of pivots grows in a logarithmic way

p1 p2

p3 p4

p7 p8

NNID2

NNID3

BID1

NNID5

p1 p2

p3 p4

p7 p8

BID1

Page 62: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 62April 2014

p1

r1

p3

r3

VPT* Structure• Similar to the GHT* - ball partitioning is used

for AST– Based on the Vantage Point Tree [Yia93]

• inner nodes have one pivot and a radius• different traversing conditions

p2

r2

p1 (r1)

p2 (r2) p3 (r3)

Page 63: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 63

M-Chord: The Metric Chord• Transform metric space to one-dimensional domain

– Use M-Index - a generalized version of the iDistance

• Divide the domain into intervals – assign each interval to a peer

• Use the Chord P2P protocol for navigation• The Skip graphs distributed protocol can be used,

alternatively

April 2014

Page 64: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 64

– range query R(q,r): identify intervals of interest

• Generalization to metric spaces

– select pivots – then partition: Voronoi-style

M-Chord: Indexing the Distance

• iDistance – indexing technique for vector domains

– cluster analysis = centers = reference points pi

– assign iDistance keys to objects iCx

cixpdxiDist i ),()(

},...,{ 0 npp

April 2014

Page 65: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 65

M-Chord: Chord Protocol

• Peer-to-Peer navigation protocol• Peers are responsible for intervals of keys• hops to localize a node storing a key

M-Chord set the iDistance domain make it uniform: function h

Use Chord on this domain

)(log n

)),(()( cixpdhxmchord i

April 2014

Page 66: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 66

M-Chord: Range Query

• Node Nq initiates the search

• Determine intervals– generalized iDistance

• Forward requests to peers on intervals

• Search in the nodes– using local organization

• Merge the received partial answers

April 2014

Page 67: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 67April 2014

M-CAN: The Metric CAN

• Based on the Content-Addressable Network (CAN)– a DHT navigating in an N-dimensional vector space

• The Idea:1. Map the metric space to a vector space

– given N pivots: p1, p2 , … , pN, transform every o into vector F(o)

2. Use CAN to

– distribute the vector space zones among the nodes– navigate in the network

Page 68: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 68April 2014

CAN: Principles & Navigation

• CAN – the principles– the space is divided in zones – each node “owns” a zone– nodes know their neighbors

• CAN – the navigation– greedy routing– in every step, move to the

neighbor closer to the target location

2-d

imensi

onal vect

or

space

1

6 2

53

4

x,y

Page 69: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 69April 2014

M-CAN: Contractiveness & Filtering

• Use the L∞ as a distance measure

– the mapping F is contractive

• More pivots better filtering– but, CAN routing is better for less dimensions

• Additional filtering– some pivots are only used for filtering data (inside

the explored nodes)– they are not used for mapping into CAN vector space

),())(),(( yxdyFxFL

Page 70: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 70

Infrastructure Independence: MESSIF Metric Similarity Search Implementation Framework

Metric space (D,d) Operations Storage

Centralized index structures

Distributed index structures

Communication

NetVectors• Lp and quadratic formStrings • (weighted) edit and

protein sequence

Insert, delete,range query,k-NN query,Incremental k-NN

Volatile memoryPersistent memory

Performance statistics

April 2014

Page 72: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 72

MUFIN demos

• http://mufin.fi.muni.cz/imgsearch/similar• http://www.pixmac.com/• http://mufin.fi.muni.cz/twenga/random• http://mufin.fi.muni.cz/fingerprints/random• http://mufin.fi.muni.cz/subseq/random• http://mufin.fi.muni.cz/mma-faces-extended/• http://mufin.fi.muni.cz/plugins/annotation• http://mufin.fi.muni.cz/motion-retrieval/rand

omApril 2014

Page 73: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 73

Problems with Current Applications

1. Current applications are implemented as complex software projects; it is costly, highly qualified specialists are needed

2. They fitfully need massive infrastructure to build and run multiple indices

3. Applications are much more complex than a search, which is an important supporting service

April 2014

Page 74: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 74

SimilaritySearch

Computingeffectiveness efficiency

stimuli

matching extraction

evaluation executionoperators

Similarity Data Management System

April 2014

findability

retrieval

CloudServices

Application Areaswebenterprisemobilesocial

Page 75: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 75

Similarity Searching in Clouds

• Retrieval – effectiveness, evaluation, operations, execution, and efficiency

• Findability – effectiveness, matching, stimuli, extraction, and efficiency

• Cloud way of computing:– Scalability – must balance load across servers and avoid bottlenecks

– Elasticity – to allow adding (reducing) capacity to a running system

– Availability – to provide high levels of usability and fault tolerance

– Privacy – to safeguard data that is valuable or sensitive against unauthorized access

April 2014

Page 76: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

April 2014 Seminar Santiago da Compostela

Self-organizing Search Systems

• Big data in computer networks:– A very large number of users keeping data within their

devices (computers, mobiles, etc.)– A very high degree of churning users– Solution to storing and searching the data:

• Cloud-type systems – data stored within a single infrastructure• Fully distributed systems – data stored within each user’s device

that brings additional computational power to the system

FacebookYouTube

76/4

Page 77: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

April 2014 Seminar Santiago da Compostela

Self-organizing Search Systems

• A self-organizing system:– Devices are independent and equal in functionality– A set of devices interconnected by semantic links– There is no global control mechanism– Search mechanisms and link management based on social-

like interactions (acquaintance/friend links)• Properties:

– Scalability– Adaptability– Robustness

77/4

Page 78: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 78

Architecture

• Each peer stores its piece of data– It asks and answers similarity queries

• Each peer maintains a query history– An acquaintance tie - the best answer– Friend ties – similar peer

P1

Q1

Q3

P3

Q3

P2

Q1

Q2

Q3

April 2014

Page 79: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

Seminar Santiago da Compostela 79

150 peers organizing 200,000 images Extracted features: 45-dimensional vectors

Random initial state Five queries per peer

Experiments

• Query history– 100 queries limit

April 2014

Page 80: Scalability of Similarity Searching the MUFIN approach Pavel Zezula Masaryk University Brno, Czech Republic

April 2014 Seminar Santiago da Compostela

Self-organizing Search Systems• Our Solution – Metric Social Network

– Resilience to disconnections of peers – initially 2 000– After each 20th test series:

• S1 – 200 random peers were disconnected• S2 – 200 the most knowledgeable peers were disconnected

S1 S2

80/4