grapevine: an exercise in distributed computing landon cox february 16, 2016
DESCRIPTION
Translating hostname to IP addr Hostname IP address Performed by Domain Name Service (DNS) Used to be a central server /etc/hosts at SRI What’s wrong with this approach? Doesn’t scale to the global InternetTRANSCRIPT
Grapevine: An Exercise in Distributed Computing
Landon CoxFebruary 16, 2016
Naming other computers• Low-level interface• Provide the destination MAC address• 00:13:20:2E:1B:ED
• Middle-level interface• Provide the destination IP address• 152.3.140.183
• High-level interface• Provide the destination hostname• www.cs.duke.edu
Translating hostname to IP addr
• Hostname IP address• Performed by Domain Name Service
(DNS)• Used to be a central server• /etc/hosts at SRI
• What’s wrong with this approach?• Doesn’t scale to the global Internet
DNS• Centralized naming doesn’t scale
• Server has to learn about all changes• Server has to answer all lookups
• Instead, split up data • Use a hierarchical database• Hierarchy allows local management of changes• Hierarchy spreads lookup work across many computers
Where is www.wikipidia.org?
Example: linux.cs.duke.edu• nslookup in interactive mode
Translating IP to MAC addrs• IP address MAC address
• Performed by ARP protocol within a LAN• How does a router know the MAC address of
152.3.140.183?• ARP (Address Resolution Protocol)• If it doesn’t know the mapping, broadcast through switch• “Whoever has this IP address, please tell me your MAC
address”• Cache the mapping• “/sbin/arp”
• Why is broadcasting over a LAN ok?• Number of computers connected to a switch is relatively small
Broadcast on local networks• On wired ethernet switch
• ARP requests/replies are broadcast• For the most part, IP communication is not broadcast (w/
caveats)• What about on a wireless network?
• Everything is broadcast• Means hosts can see all unencrypted traffic
• Why might this be dangerous?• Means any unencrypted traffic is visible to others• Open WiFi access points + non-SSL web requests and
pages• Many sites send cookie credentials in the clear …
• Use secure APs and SSL!
High-level network overview
Workstation
Workstation
Workstation
Server
Workstation
Workstation
Gateway
Server
Gateway Workstation
Workstation
Workstation
Ethernet
EthernetEthernet
Client-server• Classic and convenient structure for distributed
systems• How do clients and servers differ?
• Servers have more physical resources (disk, RAM, etc.)• Servers are trusted by all clients
• Why are servers more trustworthy?• Usually have better, more reliable hardware• Servers are better administered (paid staff watch over them)
• Servers are kind of like the kernel of a distributed system• Centralized concentration of trust• Support coordinated activity of mutually distrusting clients
Client-server• Why not put everything on one server?
• Scalability problems (server becomes overloaded)• Availability problems (server becomes single point of failure)• Want to retain organizational control of some data (some
distrust)• How do we address these issues?
• Replicate servers• Place multiple copies of server in network• Allow clients to talk to any server with appropriate
functionality• What are some drawbacks to replication?
• Data consistency (need sensible answers from servers)• Resource discovery (which server should I talk to?)
Client-server• Kernels are centralized too• Subject to availability, scalability problems
• Does it make sense to replicate kernels?• Perhaps for multi-core machines• Assign a kernel to each core• Separate address spaces of each kernel• Coordinate actions via message passing• Multi-core starts to look a lot like a distributed
system
Grapevine services• Message delivery• Send data to specified users
• Access control• Only allow specified users to access name
• Resource discovery• Where can I find a printer?
• Authentication• How do I know who I am talking to?
Registration servers• What logical data structure is replicated?
• The registry• RName Group entry | Individual entry
• What does an RName look like?• Character string F.R• F is a name (individual or group)• R is a registry corresponding to a data partition
• At what grain is registration data replicated?• Servers contain copies of whole registries• Individual server unlikely to have copy of all registries
RNames
RNamename.registry
Group{RName1, …, RNameN}
IndividualAuthenticator (password),
Inbox sites,Connect site
What two entities are represented by an individual entry?Users and servers
RNames
RNamename.registry
Group{RName1, …, RNameN}
IndividualAuthenticator (password),
Inbox sites,Connect site
How does an individual entry allow communication with a user?Inbox sites for users
RNames
RNamename.registry
Group{RName1, …, RNameN}
IndividualAuthenticator (password),
Inbox sites,Connect site
How does an individual entry allow communication with a server?Connect site for servers
Namespace• RNames provide a symbolic namespace
• Similar to file-system hierarchy or DNS• Autonomous control of names within a registry
• What is the most important part of the namespace?• *.gv (for Grapevine)• *.gv is replicated at every registration server
• Who gets to define the other registries?• All other registries must have group entry under *.gv• Owners of *.gv have complete control over other registries
• In what way do file systems and DNS operate similarly?• ICANN’s root DNS servers decide top-level domains• Root user controls root directory “/”
Resource discovery• How do clients locate server replicas?• Get list of all registries via “gv.gv”• Find registry name for service (e.g., “ms”)• Lookup group ms.gv at registration server• ms.gv returns a list of available servers (e.g.,
*.ms)• At this point control is transferred to
service• Service has autonomous control of its namespace• Service can define its own namespace conventions
Implementing services• Mail servers are replicated
• Any message server accepts any delivery request• All message servers can forward to others• An individual may have inboxes on many servers
• How does a client identify a server to send a message?• Find well-known name “MailDrop.ms” in *.ms• MailDrop.ms maps to mail servers• Any mail server can accept a message• Mail servers forward message to servers hosting users’ inboxes
• Note that the mail service makes “MailDrop.ms” special• Grapevine only defines semantics of *.gv• Grapevine delegates control of semantics of *.ms to mail service• Similar to imap.cs.duke.edu or www.google.com
Resource discovery• Bootstrapping resource discovery
• Rely on lower-level methods• Broadcast to name lookup server on Ethernet• Broadcast to registration server on Ethernet
• What data does the name lookup server store?• Simple string to Internet address mappings• Infrequently updated (minimal consistency issues)• Well-known GrapevineRServer addrs of registration servers
• What does this remind you of on today’s networks?• Dynamic host configuration protocol (DHCP)• Clients broadcast DHCP request on Ethernet• DHCP server (usually on gateway) responds with IP addr, DNS info
Updating replicated servers• At some point need to update registration database
• Want to add new machines• Want to reconfigure server locations
• Why not require updates to be atomic at all servers?• Requires that most servers be accessible to even start• All kinds of reasons why this might not be true• Trans-Atlantic phone line might be down• Servers might be offline for maintenance• Servers might be offline due to failure
• Instead embrace the chaos of eventual consistency• Might have transient differences between server state• Eventually everything will look the same (probably!)
Updating the database
• Information included in timestamps• Time + server address• Timestamps are guaranteed to be unique• Provides a total order on updates from a server
• Does the entry itself need a timestamp (a version)?• Not really, can just compute as the max of item timestamps• Entry version is a convenient optimization
Registration EntryList 1
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
List 2Active items
Deleted items
Updating the database
• Operations on an entries• Can add/delete items from lists• Can merge lists• Operations update item timestamps, modify
list content
Registration EntryList 1
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
List 2Active items
Deleted items
Updating the database
• How are updates propagated?• Asynchronously via the messaging service (i.e., *.ms)• Does not require all servers to be online• Updates can be buffered and ordered
Registration EntryList 1
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
List 2Active items
Deleted items
Updating the database
• How fast is convergence?• Registration servers check their inbox every 30 seconds• If all are online, state will converge in ~30 seconds• If server is offline, may take longer
Registration EntryList 1
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
List 2Active items
Deleted items
Updating the database
• What happens if two admins update concurrently?• “it is hard to predict which one of them will prevail.”• “acceptable“ because admins aren’t talking to each
other• Anyone make sense of this?
Registration EntryList 1
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
List 2Active items
Deleted items
Updating the database
• Why not just use a distributed lock?• What if a replica is offline during acquire, but reappears?• What if lock owner crashes?• What if lock maintainer crashes?
Registration EntryList 1
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
List 2Active items
Deleted items
Updating the database
• What if clients get different answers from servers?• Clients just have to deal with it (•_•) ( •_•)>⌐■-■ (⌐■_■)
• Inconsistencies are guaranteed to be transient• May not be good enough for some applications
Registration EntryList 1
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
List 2Active items
Deleted items
Updating the database
• What happens if a change message is lost during prop.?• Could lead to permanent inconsistency• Periodic replica comparisons and mergers if needed• Not perfect since partitions can prevent propagation
Registration EntryList 1
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
List 2Active items
Deleted items
Updating the database
• What happens if namespace is modified concurrently?• Use timestamps to pick a winner (last writer wins)
• Why is this potentially dangerous?• Later update could be trapped in offline machine• Updates to first namespace accumulate• When offline machine goes online, all work to first is thrown
out
Registration EntryList 1
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
List 2Active items
Deleted items
Updating the database
• What was the solution?• “Shouldn’t happen in practice.”• Humans should coordinate out-of-band• Probably true, but a little unsatisfying
Registration EntryList 1
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
List 2Active items
Deleted items
Why read Grapevine?• Describes many fundamental
problems• Performance and availabilityCaching and replicationConsistency problems
We still deal with many of these issues
Keeping replicas consistent• Requirement: members of write set agree
• Write request only returns if WS members agree• Problem: things fall apart
• What do we do if something fails in the middle?• This is why we had multiple replicas in first place
• Need agreement protocols that are robust to failures
Two-phase commit• Two phases
• Voting phase• Completion phase
• During the voting phase• Coordinator proposes value to rest of group• Other replicas tentatively apply update, reply “yes” to coordinator
• During the completion phase• Coordinator tallies votes• Success (entire group votes “yes”): coordinator sends “commit” message• Failure (some “no” votes or no reply): coordinator sends “abort” message• On success, group member commits update, sends “ack” to coordinator• On failure, group member aborts update, sends “ack” to coordinator• Coordinator aborts/applies update when all “acks” have been received
Two-phase commitPhase 1
Coordinator
Replica
Replica
Replica
Two-phase commitPhase 1
Coordinator
Replica
Replica
Replica
Propose: X 1
Prop
ose:
X 1
Propose: X 1
Two-phase commitPhase 1
Coordinator
Replica
Replica
Replica
Yes
Yes
Yes
X 1
X 1
X 1
Two-phase commitPhase 2
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
3 Yes votes
Two-phase commitPhase 2
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
Commit: X 1
Commit:
X
1Commit: X 1
Two-phase commitPhase 2
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
Two-phase commitPhase 2
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
ACK
ACKACK
Two-phase commitPhase 1
• What if fewer than 3 Yes votes?• Replicas will time out and assume
update is aborted
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
2 Yes votes Yes
NoYes
Two-phase commitPhase 1
• What if fewer than 3 Yes votes?• Replicas do not commit
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
2 Yes votes Abort: X 1
Abort:
X
1Abort: X 1
Two-phase commitPhase 1
• Why might replica vote No?• Replicas will time out and
assume update is aborted
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
2 Yes votes Yes
NoYes
Two-phase commitPhase 1
• Why might replica vote No?• Might not be able to acquire local
write lock• Might be committing w/ another
coord.
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
2 Yes votes Yes
NoYes
Two-phase commitPhase 2
• What if coord. fails after vote msg, before decision msg?• Replicas will time out and assume
update is aborted
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
3 Yes votes
Two-phase commitPhase 2
• What if coord. fails after vote msg, before decision msg?• Replicas will time out and assume
update is aborted
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
3 Yes votes
Two-phase commitPhase 2
• What if coord. fails after decision messages are sent?• Replicas commit update
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
Commit: X 1
Commit:
X
1Commit: X 1
Two-phase commitPhase 2
• What if coord. fails after decision messages are sent?• Replicas commit update
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
Two-phase commitPhase 2
• What if coord. fails while decision messages are sent?• If one replica receives a commit, all must
commit• If replica time out, check with other members• If any member receives a commit, all commit
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
Commit:
X
1
Two-phase commitPhase 2
• What if coord. fails while decision messages are sent?• If one replica receives a commit, all must
commit• If replica time out, check with other members• If any member receives a commit, all commit
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
Two-phase commitPhase 2
• What if coord. fails while decision messages are sent?• If one replica receives a commit, all must
commit• If replica time out, check with other members• If any member receives a commit, all commit
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
Two-phase commitPhase 1 or 2
• What if replica crashes during 2PC?• Coordinator removes it from the replica
group• If replica recovers it can rejoin the group
later
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
Two-phase commitPhase 1 or 2
• What if replica crashes during 2PC?• Coordinator removes it from the replica
group• If replica recovers it can rejoin the group
later
Coordinator
Replica
Replica
Replica
X 1
X 1
X 1
Two-phase commit• Anyone detect circular dependencies here?
• How do we agree on the coordinator?• How do we agree on the group membership?
• Need more powerful consensus protocols• Can become very complex• Protocols vary depending on what a “failure” is• Will cover in-depth very soon
• Two classes of failures• Fail-stop: failed nodes do not respond• Byzantine: failed nodes generate arbitrary outputs
Two-phase commit• What’s another problem with this protocol?
• It’s really slow• And it’s slow even when there are no failures (the
common case)• Consistency often requires taking a
performance hit• As we saw it can also undermine availability• Can think of an unavailable service as a really slow
service
Course administration• Project 2 questions?• Animesh is working on a test suite
• Mid-term exam• Friday, March 11• Responsible for everything up to that
point