lsds-ir’08, october 30, 20081 peer-to-peer similarity search over widely distributed document...

LSDS-IR’08, October 30, 2008 1

Peer-to-Peer Similarity Search over Widely Distributed Document

Collections

Christos Doulkeridis1, Kjetil Nørvåg2, Michalis Vazirgiannis1

1Department of InformaticsAthens University of Economics and Business, Greece

2Department of Computer ScienceNorwegian University of Science and Technology, Norway


Motivation• Application

– Digital libraries

• Given a document (=query), retrieve similar documents

• e.g. find similar papers to my research paper

• Efficiently locate subset of peers that store similar content to the query

• Challenge– Similarity search over widely

distributed high-dimensional data

Computer

Computer

Computer

Computer

Computer

Computer

Computer

Computer

Distributed Information Retrieval


Outline

• Local peer pre-processing– Feature extraction

– Local clustering

• Semantic overlay network (SON) construction– Topological zone creation

– Zone clustering

• Super-peer organization of SONs – Searching

• Experimental evaluation• Conclusions & future work


Feature Extraction andLocal Document Clustering

• Peers store documents• Tokenization/stemming/

stop-word removal• Each document represented by a

feature vector (top-k features)– Vector Space Model (VSM)– Fi = {(fij, wij)}

• Cluster feature vectors

• Result: – set of initial clusters per peer

• Each cluster represented by feature vector

Peer’s initial clusters


Overlay Construction

• Multi-phase distributed process• Starting point: unstructured P2P network• Recursive application of 3 steps, until

global clusters (SONs) are created


Zone Creation• A certain percentage of peers

becomes initiators– randomly distributed over the

network.• PROBE-based technique• Partial synchronization• In case of excessive zone

sizes– zone partitioning

Finally:• Each initiator

– knows the peer ids in its zone– knows neighboring initiators

• Each peer knows its initiator

Initiators

Initiator


Zone Clustering

• Initiators – collect feature vectors from

peers

– perform intra-zone hierarchical clustering

– pick cluster representatives

• Cluster description– CDi = (Ci, Fi, {P}, R)

• Remaining challenge– How to bring together

similar (remote) clusters?

similar remote clusters


Inter-zone Clustering

Level 1

Level 2

Level 3

Level 4

Advantages:1) Very large networks2) Efficient3) Small individual load


SON Merging

• Create d links among the least-connected peers in merged SONs

SON 1 SON 2

For d=3

Super-Peer


Searching

• Inter-SON routing

• Intra-SON routing

• Naïve solution: flooding

Q


Adaptive Clustering

• After global SON creation– Broadcast final cluster

descriptions to all peers– Use zone hierarchy for

efficient broadcasting

• Each peer can then– Reassign its documents to

clusters– Join the appropriate SONs

• Similar to a feedback mechanism

• Advantages – see experimental results

H

G

DJ

IE

FB

CA

A

D G

J

E

Super-peer Level

Peer Level

A’s Cluster

D’s Cluster G’s ClusterJ’s Cluster

Final organization


Experimental Setup• GT-ITM topology generator (1K, 5K peers)• TREC.GOV2 (1M docs), Reuters (810K docs)• Random querying peer• Query:

“Given doc X, find the top-k similar docs to X”• Cosine similarity• Similarity threshold Ts, to determine matching docs to query• Metrics

– Recall– Recall@k– Precision@k– #Contacted peers


Clustering Statistics

• Adaptive clustering – decreases the average pair-wise similarity of clusters

– Increases average pair-wise similarity of documents within a cluster (not shown here)


Search Evaluation

• Recall– Ts=0.2

– Also tried Ts=0.1

• #Contacted Peers


Search Evaluation - GOV2/P5000

00.10.20.30.40.50.60.70.8

1 2 3 4 5

Top-N Clusters

Pre@10

Pre@20

Pre@40

Pre@60

Pre@80

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5

Top-N Clusters

Rec@10

Rec@20

Rec@40

Rec@60

Rec@80


SON-based versus Plain Super-peer


Conclusions

• We presented a novel approach for P2P similarity search • Peers self-organize into SONs, forming a super-peer

network• We showed how a high-quality searching mechanism can

be deployed• We presented experiments on 2 large document collections

(GOV2 and Reuters) to evaluate our approach

• Future work:– More efficient inter-SON routing– Semantic similarity search using query expansion– Use of other clustering algorithms to improve performance


Thank you for your attention !

More info:http://www.db-net.aueb.gr/

http://www.idi.ntnu.no/grupper/db/

lsds-ir’08, october 30, 20081 peer-to-peer similarity search over widely distributed document...

Documents