gossip-based search selection in hybrid peer-to-peer networks m. zaharia and s. keshav d.r.cheriton...

Gossip-based Search Selection in Gossip-based Search Selection in Hybrid Peer-to-Peer NetworksHybrid Peer-to-Peer Networks

M. Zaharia and S. Keshav

D.R.Cheriton School of Computer ScienceUniversity of Waterloo, Waterloo, ON, Canada

[email protected], [email protected]

IPTPS 2006, Feb 28th 2006

The Search Problem Decentralized system of nodes, each of which stores

copies of documents Keyword-based search

o Each document is identified by a set of keywords (e.g. song title)

o Queries return lists of documents whose keyword sets are supersets of the query keywords (“AND queries”)

Exampleo Song: “Here Comes the Sun”

keywords: “Here”, “Comes”, “The”, “Sun”o Query: “Here” AND “Sun”o Responses: “Here Comes the Sun”, “The Sun is Here”

Metrics

Success rate o fraction of queries that return a result, conditional on a

result being available Number of results found

o no more than a desired maximum Rmax

Response timeo for first result, and for Rmaxth result

Bandwidth costo includes costs of index creation, query propagation, and to

fetch result(s)

Key Workload Characteristics

Document popularities follow a Zipfian distributiono Some documents are more widely copied than

otherso Are also requested more often

Some nodes have much faster connections and much longer connection durations than others

So…

Retrieve popular documents with least work

Offload work to better-connected and longer-lived peers

How can we do that?

Hybrid P2P network [Loo, IPTPS 2004]

DHT

Ultrapeers

Peers

Bootstrap Nodes

Flood queries for popular documents

Use DHT for rare documents

Only publish rare documents to DHT index

How to know document popularity?

PIERSearch uses o Observations of

result size history keyword frequency keyword pair frequency

o Sampling of neighboring nodes These are all local Global knowledge is better

More on global knowledge Want histogram of document popularity

o i.e. number of ultrapeers that index a documento we only care about popular documents, so can truncate the

tail On getting a query, sum histogram values for all

matching document titles and divide by number of ultrapeers

If this exceeds threshold, then flood, else use DHT*

* modulo rare documents with common keywords, see paper

Example

Assume 100 ultrapeers and only two documents Suppose title ‘Here comes the Sun’ has count 15

(15 ultrapeers index it) and `You are my Sun’ has count 2

Query ‘Sun’ has sum 15+2/100 = 0.17 Query ‘Are My’ has sum 2/100 = 0.02 If threshold is 0.05, then first query is flooded and

for second, we use a DHT

How to compute the histogram?

Central servero Centralizes load and introduces single point

of failure Compute on induced tree

o brittle to failures Gossip

o pick random node and exchange partial histograms

o can result in double counting

Double counting problem

A: a, b B: a, c

a:2

b:1

c:1

a:2

b:1

c:1

C: a, d a:3

b:1

c:1

d:1

a:5

b:1

c:1

d:1

Avoiding double couting When an ultrapeer indexes a document title it hasn’t

indexed already, it tosses a coin up to k times and counts the number of heads it sees before the first tail = CT

Gossip CT values for all titles with other ultrapeers to compute maxCTo because max is an extremal value, no double counting

(Flajolet-Martin) Count of the number of ultrapeers with the document is roughly 2maxCT

Exampleo 1000 nodeso Chances are good that one will see 10 consecutive headso It gossips ‘10’

Approximate histograms

Use coin-flipping trick for each documento Note that there can be up to 50% error

Gossip partial histograms Concatenate histograms Truncate low-count documents

What about the threshold?

If chosen too low, flood too often! If chosen too high, flood too rarely!

Threshold is time dependent and load dependent No easy way to choose it

Adaptive thresholding

Associate utility with the performance of a query Threshold should maximize utility For some queries, use both flooding and DHT and

compare utilities This will tell us how to move the threshold in the

future

Utility function

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Evaluation

Built an event-driven simulator for peer-to-peer search in generic peer-to-peer network architectures, in Java.

Simulates each query, response and document download.

Uses user lifetime and bandwidth distributions observed in real systems.

Generates random exact queries based on fetch-at-most-once model (Zipfian with flattened head)o can also use traces of queries from real systems.

Parameters

3 peers join every 4 seconds Each enters with an average of 20 documents,

randomly chosen from a dataset of 20,000 unique documents

Peers emit queries on average once every 300 seconds, requesting at most 25 results

Zipf parameter of 1.0. 1.7 million queries over a 22 hour period

Simulation stability

Stable population achieved at 20,000 seconds

Variance of all results under 5% and removed for clarity

Systems compared

Pure DHT All queries looked up in a DHT u sing an in-network adaptive join method

Simple Hybrid

Models PIER: queries are first flooded to depth 2, then looked up in a DHT if fewer than 25 results are received from the flood after 2s.

Central Server

An ideal central server with zero request service time.

Metrics

Recall Percentage of queries that found a matching document, given that there exists at least one available document at some peer that matches the query

FRT Mean first response time for successful queries

LRT Mean last response time i.e response time for Rmaxth, query for successful queries

BWC Bandwidth cost in kilobytes per query; the cost of publishing and gossiping is also included in this cost (Kilobytes/query)

Performance (normalized)

System Recall FRT LRT BWC Pure DHT 99.9% 1.18 0.62 1.63

Simple Hybrid 99.9% 1.00 1.00 1.00 GAB 99.9% 0.70 0.57 0.51

Central Server 100.0% 0.41 0.21 0.22

Scaling (normalized)

Active Population FRT LRT BWC

7000 1.00 1.00 1.00 17,500 1.00 0.85 1.06 23,300 1.03 0.80 1.07

Trace-based simulation

System Recall FRT LRT BWC Simple Hybrid

87.0% 1.00 1.00 1.00

GAB 87.3% 0.59 0.45 0.67

• Trace of 50 ultrapeers for 3 hours on Sunday October 12, 2003

• ~230,000 distinct queries

• ~200,000 distinct keywords

• ~672,000 distinct documents

Conclusions Gossip is an effective way to compute global state Utility functions provide simple ‘knobs’ to control

performance and balance competing objectives Adaptive algorithms (threshold selection and flooding)

reduce the need for external management and “magic constants”

Giving hybrid ultrapeers access to global state reduces overhead by a factor of about two

Questions?

???

Simulator Speedup

Fast I/O routineso Java creates temporary objects during string concatenation.

Custom, large StringBuffer for string concatenation greatly improves performance.

Batch database uploads o prepared statements turn out to be much less efficient than

importing a table from a tab-separated text file. Avoid keyword search for exact queries Can simulate 20 hours with a population of 7000 users

(~2,300,000 queries) in about 20 minutes

gossip-based search selection in hybrid peer-to-peer networks m. zaharia and s. keshav d.r.cheriton...

Documents