gossip-based search selection in hybrid peer-to-peer networks m. zaharia and s. keshav d.r.cheriton...
Post on 20-Dec-2015
217 views
TRANSCRIPT
![Page 1: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/1.jpg)
Gossip-based Search Selection in Gossip-based Search Selection in Hybrid Peer-to-Peer NetworksHybrid Peer-to-Peer Networks
M. Zaharia and S. Keshav
D.R.Cheriton School of Computer ScienceUniversity of Waterloo, Waterloo, ON, Canada
[email protected], [email protected]
IPTPS 2006, Feb 28th 2006
![Page 2: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/2.jpg)
The Search Problem Decentralized system of nodes, each of which stores
copies of documents Keyword-based search
o Each document is identified by a set of keywords (e.g. song title)
o Queries return lists of documents whose keyword sets are supersets of the query keywords (“AND queries”)
Exampleo Song: “Here Comes the Sun”
keywords: “Here”, “Comes”, “The”, “Sun”o Query: “Here” AND “Sun”o Responses: “Here Comes the Sun”, “The Sun is Here”
![Page 3: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/3.jpg)
Metrics
Success rate o fraction of queries that return a result, conditional on a
result being available Number of results found
o no more than a desired maximum Rmax
Response timeo for first result, and for Rmaxth result
Bandwidth costo includes costs of index creation, query propagation, and to
fetch result(s)
![Page 4: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/4.jpg)
Key Workload Characteristics
Document popularities follow a Zipfian distributiono Some documents are more widely copied than
otherso Are also requested more often
Some nodes have much faster connections and much longer connection durations than others
![Page 5: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/5.jpg)
So…
Retrieve popular documents with least work
Offload work to better-connected and longer-lived peers
How can we do that?
![Page 6: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/6.jpg)
Hybrid P2P network [Loo, IPTPS 2004]
DHT
Ultrapeers
Peers
Bootstrap Nodes
Flood queries for popular documents
Use DHT for rare documents
Only publish rare documents to DHT index
![Page 7: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/7.jpg)
How to know document popularity?
PIERSearch uses o Observations of
result size history keyword frequency keyword pair frequency
o Sampling of neighboring nodes These are all local Global knowledge is better
![Page 8: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/8.jpg)
More on global knowledge Want histogram of document popularity
o i.e. number of ultrapeers that index a documento we only care about popular documents, so can truncate the
tail On getting a query, sum histogram values for all
matching document titles and divide by number of ultrapeers
If this exceeds threshold, then flood, else use DHT*
* modulo rare documents with common keywords, see paper
![Page 9: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/9.jpg)
Example
Assume 100 ultrapeers and only two documents Suppose title ‘Here comes the Sun’ has count 15
(15 ultrapeers index it) and `You are my Sun’ has count 2
Query ‘Sun’ has sum 15+2/100 = 0.17 Query ‘Are My’ has sum 2/100 = 0.02 If threshold is 0.05, then first query is flooded and
for second, we use a DHT
![Page 10: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/10.jpg)
How to compute the histogram?
Central servero Centralizes load and introduces single point
of failure Compute on induced tree
o brittle to failures Gossip
o pick random node and exchange partial histograms
o can result in double counting
![Page 11: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/11.jpg)
Double counting problem
A: a, b B: a, c
a:2
b:1
c:1
a:2
b:1
c:1
C: a, d a:3
b:1
c:1
d:1
a:5
b:1
c:1
d:1
![Page 12: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/12.jpg)
Avoiding double couting When an ultrapeer indexes a document title it hasn’t
indexed already, it tosses a coin up to k times and counts the number of heads it sees before the first tail = CT
Gossip CT values for all titles with other ultrapeers to compute maxCTo because max is an extremal value, no double counting
(Flajolet-Martin) Count of the number of ultrapeers with the document is roughly 2maxCT
Exampleo 1000 nodeso Chances are good that one will see 10 consecutive headso It gossips ‘10’
![Page 13: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/13.jpg)
Approximate histograms
Use coin-flipping trick for each documento Note that there can be up to 50% error
Gossip partial histograms Concatenate histograms Truncate low-count documents
![Page 14: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/14.jpg)
What about the threshold?
If chosen too low, flood too often! If chosen too high, flood too rarely!
Threshold is time dependent and load dependent No easy way to choose it
![Page 15: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/15.jpg)
Adaptive thresholding
Associate utility with the performance of a query Threshold should maximize utility For some queries, use both flooding and DHT and
compare utilities This will tell us how to move the threshold in the
future
![Page 16: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/16.jpg)
Utility function
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 17: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/17.jpg)
Adaptive thresholding
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 18: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/18.jpg)
Evaluation
Built an event-driven simulator for peer-to-peer search in generic peer-to-peer network architectures, in Java.
Simulates each query, response and document download.
Uses user lifetime and bandwidth distributions observed in real systems.
Generates random exact queries based on fetch-at-most-once model (Zipfian with flattened head)o can also use traces of queries from real systems.
![Page 19: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/19.jpg)
Parameters
3 peers join every 4 seconds Each enters with an average of 20 documents,
randomly chosen from a dataset of 20,000 unique documents
Peers emit queries on average once every 300 seconds, requesting at most 25 results
Zipf parameter of 1.0. 1.7 million queries over a 22 hour period
![Page 20: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/20.jpg)
Simulation stability
Stable population achieved at 20,000 seconds
Variance of all results under 5% and removed for clarity
![Page 21: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/21.jpg)
Systems compared
Pure DHT All queries looked up in a DHT u sing an in-network adaptive join method
Simple Hybrid
Models PIER: queries are first flooded to depth 2, then looked up in a DHT if fewer than 25 results are received from the flood after 2s.
Central Server
An ideal central server with zero request service time.
![Page 22: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/22.jpg)
Metrics
Recall Percentage of queries that found a matching document, given that there exists at least one available document at some peer that matches the query
FRT Mean first response time for successful queries
LRT Mean last response time i.e response time for Rmaxth, query for successful queries
BWC Bandwidth cost in kilobytes per query; the cost of publishing and gossiping is also included in this cost (Kilobytes/query)
![Page 23: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/23.jpg)
Performance (normalized)
System Recall FRT LRT BWC Pure DHT 99.9% 1.18 0.62 1.63
Simple Hybrid 99.9% 1.00 1.00 1.00 GAB 99.9% 0.70 0.57 0.51
Central Server 100.0% 0.41 0.21 0.22
![Page 24: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/24.jpg)
Adaptive thresholding
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 25: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/25.jpg)
Scaling (normalized)
Active Population FRT LRT BWC
7000 1.00 1.00 1.00 17,500 1.00 0.85 1.06 23,300 1.03 0.80 1.07
![Page 26: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/26.jpg)
Trace-based simulation
System Recall FRT LRT BWC Simple Hybrid
87.0% 1.00 1.00 1.00
GAB 87.3% 0.59 0.45 0.67
• Trace of 50 ultrapeers for 3 hours on Sunday October 12, 2003
• ~230,000 distinct queries
• ~200,000 distinct keywords
• ~672,000 distinct documents
![Page 27: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/27.jpg)
Conclusions Gossip is an effective way to compute global state Utility functions provide simple ‘knobs’ to control
performance and balance competing objectives Adaptive algorithms (threshold selection and flooding)
reduce the need for external management and “magic constants”
Giving hybrid ultrapeers access to global state reduces overhead by a factor of about two
![Page 28: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/28.jpg)
Questions?
???
![Page 29: Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d485503460f94a24449/html5/thumbnails/29.jpg)
Simulator Speedup
Fast I/O routineso Java creates temporary objects during string concatenation.
Custom, large StringBuffer for string concatenation greatly improves performance.
Batch database uploads o prepared statements turn out to be much less efficient than
importing a table from a tab-separated text file. Avoid keyword search for exact queries Can simulate 20 hours with a population of 7000 users
(~2,300,000 queries) in about 20 minutes