lsds-ir’08, october 30, 20081 peer-to-peer similarity search over widely distributed document...
Post on 21-Dec-2015
220 views
TRANSCRIPT
LSDS-IR’08, October 30, 2008 1
Peer-to-Peer Similarity Search over Widely Distributed Document
Collections
Christos Doulkeridis1, Kjetil Nørvåg2, Michalis Vazirgiannis1
1Department of InformaticsAthens University of Economics and Business, Greece
2Department of Computer ScienceNorwegian University of Science and Technology, Norway
LSDS-IR’08, October 30, 2008 2
Motivation• Application
– Digital libraries
• Given a document (=query), retrieve similar documents
• e.g. find similar papers to my research paper
• Efficiently locate subset of peers that store similar content to the query
• Challenge– Similarity search over widely
distributed high-dimensional data
Computer
Computer
Computer
Computer
Computer
Computer
Computer
Computer
Distributed Information Retrieval
LSDS-IR’08, October 30, 2008 3
Outline
• Local peer pre-processing– Feature extraction
– Local clustering
• Semantic overlay network (SON) construction– Topological zone creation
– Zone clustering
• Super-peer organization of SONs – Searching
• Experimental evaluation• Conclusions & future work
LSDS-IR’08, October 30, 2008 4
Feature Extraction andLocal Document Clustering
• Peers store documents• Tokenization/stemming/
stop-word removal• Each document represented by a
feature vector (top-k features)– Vector Space Model (VSM)– Fi = {(fij, wij)}
• Cluster feature vectors
• Result: – set of initial clusters per peer
• Each cluster represented by feature vector
Peer’s initial clusters
LSDS-IR’08, October 30, 2008 5
Overlay Construction
• Multi-phase distributed process• Starting point: unstructured P2P network• Recursive application of 3 steps, until
global clusters (SONs) are created
LSDS-IR’08, October 30, 2008 6
Zone Creation• A certain percentage of peers
becomes initiators– randomly distributed over the
network.• PROBE-based technique• Partial synchronization• In case of excessive zone
sizes– zone partitioning
Finally:• Each initiator
– knows the peer ids in its zone– knows neighboring initiators
• Each peer knows its initiator
Initiators
Initiator
LSDS-IR’08, October 30, 2008 7
Zone Clustering
• Initiators – collect feature vectors from
peers
– perform intra-zone hierarchical clustering
– pick cluster representatives
• Cluster description– CDi = (Ci, Fi, {P}, R)
• Remaining challenge– How to bring together
similar (remote) clusters?
similar remote clusters
LSDS-IR’08, October 30, 2008 8
Inter-zone Clustering
Level 1
Level 2
Level 3
Level 4
Advantages:1) Very large networks2) Efficient3) Small individual load
LSDS-IR’08, October 30, 2008 9
SON Merging
• Create d links among the least-connected peers in merged SONs
SON 1 SON 2
For d=3
Super-Peer
LSDS-IR’08, October 30, 2008 10
Searching
• Inter-SON routing
• Intra-SON routing
• Naïve solution: flooding
Q
LSDS-IR’08, October 30, 2008 11
Adaptive Clustering
• After global SON creation– Broadcast final cluster
descriptions to all peers– Use zone hierarchy for
efficient broadcasting
• Each peer can then– Reassign its documents to
clusters– Join the appropriate SONs
• Similar to a feedback mechanism
• Advantages – see experimental results
H
G
DJ
IE
FB
CA
A
D G
J
E
Super-peer Level
Peer Level
A’s Cluster
D’s Cluster G’s ClusterJ’s Cluster
Final organization
LSDS-IR’08, October 30, 2008 12
Experimental Setup• GT-ITM topology generator (1K, 5K peers)• TREC.GOV2 (1M docs), Reuters (810K docs)• Random querying peer• Query:
“Given doc X, find the top-k similar docs to X”• Cosine similarity• Similarity threshold Ts, to determine matching docs to query• Metrics
– Recall– Recall@k– Precision@k– #Contacted peers
LSDS-IR’08, October 30, 2008 13
Clustering Statistics
• Adaptive clustering – decreases the average pair-wise similarity of clusters
– Increases average pair-wise similarity of documents within a cluster (not shown here)
LSDS-IR’08, October 30, 2008 14
Search Evaluation
• Recall– Ts=0.2
– Also tried Ts=0.1
• #Contacted Peers
LSDS-IR’08, October 30, 2008 15
Search Evaluation - GOV2/P5000
00.10.20.30.40.50.60.70.8
1 2 3 4 5
Top-N Clusters
Pre@10
Pre@20
Pre@40
Pre@60
Pre@80
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 3 4 5
Top-N Clusters
Rec@10
Rec@20
Rec@40
Rec@60
Rec@80
LSDS-IR’08, October 30, 2008 16
SON-based versus Plain Super-peer
LSDS-IR’08, October 30, 2008 17
Conclusions
• We presented a novel approach for P2P similarity search • Peers self-organize into SONs, forming a super-peer
network• We showed how a high-quality searching mechanism can
be deployed• We presented experiments on 2 large document collections
(GOV2 and Reuters) to evaluate our approach
• Future work:– More efficient inter-SON routing– Semantic similarity search using query expansion– Use of other clustering algorithms to improve performance
LSDS-IR’08, October 30, 2008 18
Thank you for your attention !
More info:http://www.db-net.aueb.gr/
http://www.idi.ntnu.no/grupper/db/