gossip-based search selection in hybrid peer-to-peer networks m. zaharia and s. keshav d.r.cheriton...
Post on 20-Dec-2015
217 views
TRANSCRIPT
Gossip-based Search Selection in Gossip-based Search Selection in Hybrid Peer-to-Peer NetworksHybrid Peer-to-Peer Networks
M. Zaharia and S. Keshav
D.R.Cheriton School of Computer ScienceUniversity of Waterloo, Waterloo, ON, Canada
[email protected], [email protected]
IPTPS 2006, Feb 28th 2006
The Search Problem Decentralized system of nodes, each of which stores
copies of documents Keyword-based search
o Each document is identified by a set of keywords (e.g. song title)
o Queries return lists of documents whose keyword sets are supersets of the query keywords (“AND queries”)
Exampleo Song: “Here Comes the Sun”
keywords: “Here”, “Comes”, “The”, “Sun”o Query: “Here” AND “Sun”o Responses: “Here Comes the Sun”, “The Sun is Here”
Metrics
Success rate o fraction of queries that return a result, conditional on a
result being available Number of results found
o no more than a desired maximum Rmax
Response timeo for first result, and for Rmaxth result
Bandwidth costo includes costs of index creation, query propagation, and to
fetch result(s)
Key Workload Characteristics
Document popularities follow a Zipfian distributiono Some documents are more widely copied than
otherso Are also requested more often
Some nodes have much faster connections and much longer connection durations than others
So…
Retrieve popular documents with least work
Offload work to better-connected and longer-lived peers
How can we do that?
Hybrid P2P network [Loo, IPTPS 2004]
DHT
Ultrapeers
Peers
Bootstrap Nodes
Flood queries for popular documents
Use DHT for rare documents
Only publish rare documents to DHT index
How to know document popularity?
PIERSearch uses o Observations of
result size history keyword frequency keyword pair frequency
o Sampling of neighboring nodes These are all local Global knowledge is better
More on global knowledge Want histogram of document popularity
o i.e. number of ultrapeers that index a documento we only care about popular documents, so can truncate the
tail On getting a query, sum histogram values for all
matching document titles and divide by number of ultrapeers
If this exceeds threshold, then flood, else use DHT*
* modulo rare documents with common keywords, see paper
Example
Assume 100 ultrapeers and only two documents Suppose title ‘Here comes the Sun’ has count 15
(15 ultrapeers index it) and `You are my Sun’ has count 2
Query ‘Sun’ has sum 15+2/100 = 0.17 Query ‘Are My’ has sum 2/100 = 0.02 If threshold is 0.05, then first query is flooded and
for second, we use a DHT
How to compute the histogram?
Central servero Centralizes load and introduces single point
of failure Compute on induced tree
o brittle to failures Gossip
o pick random node and exchange partial histograms
o can result in double counting
Double counting problem
A: a, b B: a, c
a:2
b:1
c:1
a:2
b:1
c:1
C: a, d a:3
b:1
c:1
d:1
a:5
b:1
c:1
d:1
Avoiding double couting When an ultrapeer indexes a document title it hasn’t
indexed already, it tosses a coin up to k times and counts the number of heads it sees before the first tail = CT
Gossip CT values for all titles with other ultrapeers to compute maxCTo because max is an extremal value, no double counting
(Flajolet-Martin) Count of the number of ultrapeers with the document is roughly 2maxCT
Exampleo 1000 nodeso Chances are good that one will see 10 consecutive headso It gossips ‘10’
Approximate histograms
Use coin-flipping trick for each documento Note that there can be up to 50% error
Gossip partial histograms Concatenate histograms Truncate low-count documents
What about the threshold?
If chosen too low, flood too often! If chosen too high, flood too rarely!
Threshold is time dependent and load dependent No easy way to choose it
Adaptive thresholding
Associate utility with the performance of a query Threshold should maximize utility For some queries, use both flooding and DHT and
compare utilities This will tell us how to move the threshold in the
future
Utility function
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Adaptive thresholding
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Evaluation
Built an event-driven simulator for peer-to-peer search in generic peer-to-peer network architectures, in Java.
Simulates each query, response and document download.
Uses user lifetime and bandwidth distributions observed in real systems.
Generates random exact queries based on fetch-at-most-once model (Zipfian with flattened head)o can also use traces of queries from real systems.
Parameters
3 peers join every 4 seconds Each enters with an average of 20 documents,
randomly chosen from a dataset of 20,000 unique documents
Peers emit queries on average once every 300 seconds, requesting at most 25 results
Zipf parameter of 1.0. 1.7 million queries over a 22 hour period
Simulation stability
Stable population achieved at 20,000 seconds
Variance of all results under 5% and removed for clarity
Systems compared
Pure DHT All queries looked up in a DHT u sing an in-network adaptive join method
Simple Hybrid
Models PIER: queries are first flooded to depth 2, then looked up in a DHT if fewer than 25 results are received from the flood after 2s.
Central Server
An ideal central server with zero request service time.
Metrics
Recall Percentage of queries that found a matching document, given that there exists at least one available document at some peer that matches the query
FRT Mean first response time for successful queries
LRT Mean last response time i.e response time for Rmaxth, query for successful queries
BWC Bandwidth cost in kilobytes per query; the cost of publishing and gossiping is also included in this cost (Kilobytes/query)
Performance (normalized)
System Recall FRT LRT BWC Pure DHT 99.9% 1.18 0.62 1.63
Simple Hybrid 99.9% 1.00 1.00 1.00 GAB 99.9% 0.70 0.57 0.51
Central Server 100.0% 0.41 0.21 0.22
Adaptive thresholding
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Scaling (normalized)
Active Population FRT LRT BWC
7000 1.00 1.00 1.00 17,500 1.00 0.85 1.06 23,300 1.03 0.80 1.07
Trace-based simulation
System Recall FRT LRT BWC Simple Hybrid
87.0% 1.00 1.00 1.00
GAB 87.3% 0.59 0.45 0.67
• Trace of 50 ultrapeers for 3 hours on Sunday October 12, 2003
• ~230,000 distinct queries
• ~200,000 distinct keywords
• ~672,000 distinct documents
Conclusions Gossip is an effective way to compute global state Utility functions provide simple ‘knobs’ to control
performance and balance competing objectives Adaptive algorithms (threshold selection and flooding)
reduce the need for external management and “magic constants”
Giving hybrid ultrapeers access to global state reduces overhead by a factor of about two
Questions?
???
Simulator Speedup
Fast I/O routineso Java creates temporary objects during string concatenation.
Custom, large StringBuffer for string concatenation greatly improves performance.
Batch database uploads o prepared statements turn out to be much less efficient than
importing a table from a tab-separated text file. Avoid keyword search for exact queries Can simulate 20 hours with a population of 7000 users
(~2,300,000 queries) in about 20 minutes