p2p file sharing systems johnny wong note: materials of these slides are based on the those from the...
TRANSCRIPT
P2P File Sharing Systems
Johnny Wong
Note: Materials of these slides are based on the those from the textbook “Computer Networking: A
Top-Down Approach featuring the Internet”by
J.F Kurose and K.W. Ross
P2P file sharingExample
• Alice runs P2P client application on her notebook computer
• Intermittently connects to Internet; gets new IP address for each connection
• Asks for “Hey Jude”
• Application displays other peers that have copy of Hey Jude.
• Alice chooses one of the peers, Bob.
• File is copied from Bob’s PC to Alice’s notebook: HTTP
• While Alice downloads, other users uploading from Alice.
• Alice’s peer is both a Web client and a transient Web server.
All peers are servers = highly scalable!
P2P: centralized directory
original “Napster” design
1) when peer connects, it informs central server:– IP address– content
2) Alice queries for “Hey Jude”
3) Alice requests file from Bob
centralizeddirectory server
peers
Alice
Bob
1
1
1
12
3
P2P: problems with centralized directory
• Single point of failure• Performance
bottleneck• Copyright
infringement
file transfer is decentralized, but locating content is highly centralized
P2P: Query flooding
• Gnutella • no hierarchy• use bootstrap node to learn
about others• join message
• Send query to neighbors• Neighbors forward
query• If queried peer has
object, it sends message back to querying peer
join
P2P: more on query flooding
Pros• peers have similar
responsibilities: no group leaders
• highly decentralized• no peer maintains
directory info
Cons• excessive query
traffic• query radius: may not
have content when present
• bootstrap node• maintenance of
overlay network
P2P: decentralized directory
• Each peer is either a group leader or assigned to a group leader.
• Group leader tracks the content in all its children.
• Peer queries group leader; group leader may query other group leaders.
ordinary peer
group-leader peer
neighoring re la tionshipsin overlay network
More about decentralized directoryoverlay network• peers are nodes• edges between peers
and their group leaders• edges between some
pairs of group leaders• virtual neighbors
bootstrap node• connecting peer is either
assigned to a group leader or designated as leader
advantages of approach• no centralized directory
server– location service
distributed over peers– more difficult to shut
down
disadvantages of approach• bootstrap node needed• group leaders can get
overloaded
Unstructured P2P File Sharing
• Centralized– Napster
• Distributed– Gnutella – KaZaA
Napster: how does it work
• Application-level, client-server protocol over point-to-point TCP
• Centralized directory server • Steps:
– connect to Napster server – upload your list of files to server. – give server keywords to search the full list with. – select “best” of matching answers (pings)
• Pings the candidate server using RTT
Gnutella
• Open-source
• Links are TCP-connections
• Each peer sends a query to each of its peers; the receiving peer sends a search query; once the file is found, the response follows the search path backward to the querier; file transfer is point-to-point
Gnutella (con’t)
– decentralized searching for files
– central directory server no longer the bottleneck
• more difficult to “pull plug” – each application instance serves to:
• store selected files • route queries from and to its neighboring peers • respond to queries if file stored locally • serve files
Gnutella history
– 3/14/00: release by AOL, almost immediately withdrawn
– became open source – many iterations to fix poor initial design
(poor design turned many people off) – issues:
• how much traffic does one query generate? • how many hosts can it support at once? • what is the latency associated with querying? • is there a bottleneck?
Gnutella: limited scope query
• Searching by flooding: – if you don’t have the file you want, query 7 of
your neighbors. – if they don’t have it, they contact 7 of their
neighbors, for a maximum hop count of 10. – reverse path forwarding for responses (not
files) • (useful for saving TCP connections)
Gnutella in practice
• Gnutella traffic << KaZaA traffic
• Anecdotal: – Couldn’t find anything – Downloads wouldn’t complete
• Fixes: do things KaZaA is doing: hierarchy, queue management, parallel download,…
KaZaA: Technology
Software • Proprietary • control data encrypted (including queries/responses)
– KaZaA Web site gives a few hits– Some studies described in Web
• Everything in HTTP request and response messages
Architecture • hierarchical • cross between Napster and Gnutella
KaZaA: The service (2)
• User can configure max number of simultaneous uploads and max number of simultaneous downloads – queue management at server and client
• Frequent uploaders can get priority in server queue – Keyword search
• User can configure “up to x” responses to keywords • Responses to keyword queries come in waves; stops
when x responses are found • From user’s perspective, service resembles Google, but
provides links to MP3s and videos rather than Web pages
KaZaA: Architecture
• Each peer is either a supernode or an ordinary node (assigned to one supernode) – Each supernode connected to many other
supernodes (supernode overlay)
• Nodes that have more connection bandwidth and are more available are designated as supernodes
KaZaa Supernode
• Each supernode acts as a mini-Napster hub, tracking the content and IP addresses of its descendants
• Guess: supernode has (on average) 200-500 descendants; roughly 10,000 supernodes
KaZaA: Overlay maintenance
• List of potential supernodes included with software download – New peer goes through list until it finds
operational supernode – Connects, obtains more up-to-date list
• Node then pings nodes on list and connects with the one with smallest RTT
• If supernode goes down, node obtains updated list and chooses new supernode
KaZaA Queries
– Node first sends query (keyword) to supernode – Supernode responds with matches
• If x matches found, done. – Otherwise, supernode forwards query to subset of
supernodes
• If total of x matches found, done. – Otherwise, query further forwarded – Probably by original supernode rather than
recursively?