improving search in p2p networks

Improving Search in P2P Networks

By Shadi Lahham

Improving P2P Search 2

Purpose of This Lecture

• General understanding of P2P systems

• Appreciating the need for efficient search

• Applying different search techniques to different scenarios


Table Of Contents

• P2P Basics– What Is P2P

– Advantages of P2P

– Types of P2P Systems

– Shortcomings

• Search Methods– The Search Problem

– Current Methods

– Suggested Methods

• Experimental Setup– Metrics– Data Collection– Calculating Costs

• Analysis of Results

• Conclusions

Introduction

P2P Basics


What is P2P

• Distributed system

• Peers (nodes) are servers and clients simultaneously

• Peers are of equal roles

• Resources shared across peers

• No central server needed

• Examples of P2P system


P2P Overview

file3f3

file2f2

file1f1

FileKey


Advantages of P2P

• P2P vs. Centralized Servers– Distributes disk space / bandwidth

– Inexpensively scalable

– Self organized (autonomous)

– Load balancing

– Adaptative / fault tolerant

– Less susceptible to attacks

– Allows for redundancy


Types of P2P Systems

• Hybrid ( napster )

• Pure ( gnutella )

• Super Peers ( kaZaA )


Hybrid ( napster )


Pure ( gnutella )


Super Peers ( kaZaA )

• Make use of heterogeneity– Powerful peers serve as super peers

– Weaker peers act as clients

• Super-peers index clients’ files– Requires updates on join/leave/update

• Queries handled at super-peer level– Saves query costs


Super Peers ( kaZaA )


Hybrid - Shortcomings

• High cost on centralized index

• Performance & scalability bottleneck

• Needs maintenance

• Vulnerable ! Highly visible target


Pure - Shortcomings

• Inefficient search (flooding)

• Heterogeneity of peers not considered– Bottlenecks (limited peers)

– Fragmentation


Super Peers - Shortcomings

• Super nodes might become bottlenecks for clients– requires redundancy

• Bad selection of supernodes might cause even worse problems

Search Methods


The Search Problem

• Connected graph

• Might contain cycles

• Individual node doesn’t know structure

• Only knows its neighbors

• No idea where data can be found


The Search Problem

• Goal : Find as many occurrences of the data using min time and resources

• Solution : – BFS ?

– Bounded BFS ?– (naive approaches)


Bounded BFS Search

TTL=2TTL=1TTL=0


Bounded BFS Search

• Messages get a global TTL (time to live)

• Algorithm– Source broadcasts a message to a subset of

neighbors

– Neighbors search locally . Results are sent to source if found

– TTL = TTL – 1;

– As long as TTL > 0 Nodes forward message to neighbors

• Downside : wastes bandwidth / processing


Current Methods

• Gnutella - BFS – High cost

– Gets complete results ( for depth D)

– Relatively short time

• Freenet - DFS – Poor response time

– Minimizes BW costs


Suggested Methods

• Iterative deepening

• Directed BFS

• Local Indices


Iterative Deepening

• Idea:– Search at a small depth and increase if

required

– Aims to minimize the cost of BFS without detracting from it’s ability to satisfy queries

• Notice that given enough iterations this method returns %100 results of BFS


Iterative Deepening (cont…)

• Elements :– Policies P={a,b,c,..} define deepening

behavior

– BFS is run to depth a and frozen

– If source is satisfied it stops the process

– Otherwise it asks BFS to resume to depth b

– Process is repeated until source satisfied or we reach the last policy item


Iterative Deepening (cont…)

• Elements :– We can specify how long to wait

between iterations

– We need a system-wide message ID to identify individual messages


Example P={1,3,4} W=1


Directed BFS

• Idea:– Choose a subset of neighbors to query

– Neighbors will BFS as usual

– Aims to provide a balance between good response time and results

– Minimize costs of full BFS

• Notice that only a subset of possible results are returned so we might fail to satisfy query


Directed BFS Example

TTL=2TTL=1TTL=0


Directed BFS (cont…)

• But which neighbors to pick ??– Maintain simple statistics on neighbors

to derive heuristics• Highest past results • Lowest average hops

– (close to nodes containing useful data) • High message count

– (stable - can handle large flow) • Shortest message queue

– (long implies saturation)• More to come …


Local Indices

• Idea:– Nodes hold metadata of all nodes at

radius r

– Can process query at a few nodes, but get same number of results

– Aims to balance satisfaction / costs


Local Indices

• Elements:– Policies P={a,b,c,..} define the depths at

which we search• Example P={1,5,6}• Nodes at depth 1 process the query• Nodes at depth 2,3,4 forward without

processing• Policy ends at depth 6

– System-wide Radius r (small ~ 50K metadata )


Example P={1,4}

Process

Don’t process

r = ?


Local Indices (cont…)

– Notice that now there is an overhead

– On Join• Send join message of TTL = r • Direct Exchange of metadata

– On leave / timeout• remove metadata of gone / dead nodes

– On Update• Send update message of TTL = r

Experimental Setup


Metrics

• How to compare methods ?1. Costs

2. Results

3. Time


Metrics

1. Costs – We do not base cost on a specific query but

rather calculate the average cost on Q rep ,

a representative set of real queries submitted

– It makes sense to discuss costs in aggregate (i.e., over all the nodes in the network)

– Therefore our two cost metrics are• Average aggregate bandwidth • Average aggregate processing cost


Metrics

2. Results Quality– Number of results

– Satisfaction

3. Time to satisfaction


Data Collection

• Data gathered from Gnutella network

• Directly measured– Iterative deepening

– Directed BFS

• Performance data & analysis– Local indices


Data Collection

Number of hops

Response time

Results per message

Source IP

Etc …

Collected Data


Data Collection

Symbol Description

M(Q; n) # of response messages received for query Q, from n hops away

R(Q; n) # of results received for query Q, from n hops away

N(Q; n) # of nodes n hops away that process Q

C(Q; n) # of redundant edges n hops away

Extracted Data


Calculating Costs

• We’ve seen two types of costs– Bandwidth (BW) costs

– Processing costs

• Calculations should take into account– Costs of sending a query

– Costs of sending replies

• A example of calculating BW costs


Calculating Costs

BWbfs (Q) = ∑ ( a(Q) · (N(Q,n) + C(Q,n)) D

n=1

+ n · ( c · R(Q,n) + d · M(Q,n))

a(Q) Size of query Q d Size of response message

c Size of result record D Max TTL

Analysis of Results

Iterative Deepening


Symbols Used

Symbol Definition

D Maximum time-to-live of a message, in terms of hops

Z Number of results needed to satisfy a query

Qrep Representative set of queries for the Gnutella network

W Waiting time (in seconds) between iterations

Ng Number of neighbors of client (source node)


Results – Iterative Deepening

• Recall that iterative deepening policies P={a,b,c,..} define deepening behavior

• In order to have the same level of satisfaction as BFS a policy must have D as the last depth

• Also note the degenerate case policy {D} which is the bounded BFS we presenter earlier



• Variables– Define :

Pd = { d , d+1 , … , D }

P = { Pd for d = 1,2,…,D }

= { {1,2,…D}, {2,3,…D},…, {D-1,…D},{D} }

W (waiting time) can take the values

1,2,4,6,150 (seconds)



• Fixed values Z = 50 , Ng = 8

– Increasing Z• Lower probability of satisfaction• Higher costs• More results

– Decreasing Ng• Slightly Lower probability of satisfaction• Significantly Lower costs



• BW costs same for P7 for all W’s

• As d increases costs increase.the larger d is the more likely the policy will “overshoot”

• As W decreases costs increaseon a small W premature determination of un-satisfaction again leads to overshooting



• Time to satisfaction is inversely proportional to cost

• Choose a policy that balances average waiting time and cost

• For example {P5 W=6}

Analysis of Results

Directed BFS


Heuristics - Directed BFS

Symbol HeuristicRAND (Random)

>RES Returned the greatest number of results*

<TIME Had the shortest average time to satisfaction*

<HOPS smallest average number of hops taken by results*

>MSG Sent our client the greatest number of messages (all types)

<QLEN Had the shortest message queue

<LAT Had the shortest latency

>DEG Had the highest degree (number of neighbors)

*in the past 10 queries


Results – Directed BFS



• Costs in directed BFS unaffected by Z

• Users more aware of quality of results than BW costs – We recommend >RES <TIME

– Still cheaper than full BFS (~65%)

• Sum up till now– Iterative deepening - lowest costs

– Directed BFS – fastest time to satisfaction

Analysis of Results

Local Indices


Results – Local Indices

• Recall that iterative deepening policies P={a,b,c,..} define the depths at which we search

• We choose policies that minimize the number of nodes that process the query



• We consider the following policies



• Also recall that joins / leaves / updates have a BW overhead

• QJR (QueryJoinRatio) gives us the ratio of queries to joins/leaves in the network



P0 r=0



21MB

71 KB



• Time to Satisfaction– Because most Query and Response

messages have r fewer hops to travel, the time to forward messages to the outermost depth and back to the source will be shorter than for BFS

– However, because nodes have larger indices, processing the query should take more time.



• Summary– Huge savings in costs

– Time to satisfaction comparable to BFS

– Determining r must take QJR into consideration

• For current QJR values (e.g. Gnutella = 10) r =1 is a good choice


Relative performance

Technique Time to satisfy

Satisfaction

Probability

Number of results

Aggregate Bandwidth

Aggregate

Processing

Bounded BFS 100% 100% 100% 100% 100%

Iterative deepening 190% 100% 19% 28% 47%

Directed BFS 140% 86% 37% 38% 28%

Local indices

≈100%

100% 100% 39% 51%


Conclusions

• All 3 methods show significant bandwidth and processing savings

• Methods are simple and easy to implement in current systems

• Methods might be used in conjunction


Bibliography

Yang, Beverly; Garcia-Molina, Hector :• Improving Search in Peer-to-Peer Systems

http://newdbpubs.stanford.edu:8090/pub/2002-28

• Improving Search in Peer-to-Peer Systems [extended]


• Designing a Super-peer Network http://newdbpubs.stanford.edu:8090/pub/2003-33

Gnutella websitehttp://www.gnutella.com/




http://www.gnutella.com/

Thank you

improving search in p2p networks

Documents

p2p systemsappreciating

p2p networks

p2p searchhybrid napster

p2p searchpure gnutella

results of bfs

cost of bfs

super peersweaker peers

directed bfs local indices