grapevine: an exercise in distributed computing landon cox february 16, 2016

Grapevine: An Exercise in Distributed Computing

Landon CoxFebruary 16, 2016

Naming other computers• Low-level interface• Provide the destination MAC address• 00:13:20:2E:1B:ED

• Middle-level interface• Provide the destination IP address• 152.3.140.183

• High-level interface• Provide the destination hostname• www.cs.duke.edu

Translating hostname to IP addr

• Hostname IP address• Performed by Domain Name Service

(DNS)• Used to be a central server• /etc/hosts at SRI

• What’s wrong with this approach?• Doesn’t scale to the global Internet

DNS• Centralized naming doesn’t scale

• Server has to learn about all changes• Server has to answer all lookups

• Instead, split up data • Use a hierarchical database• Hierarchy allows local management of changes• Hierarchy spreads lookup work across many computers

Where is www.wikipidia.org?

Example: linux.cs.duke.edu• nslookup in interactive mode

Translating IP to MAC addrs• IP address MAC address

• Performed by ARP protocol within a LAN• How does a router know the MAC address of

152.3.140.183?• ARP (Address Resolution Protocol)• If it doesn’t know the mapping, broadcast through switch• “Whoever has this IP address, please tell me your MAC

address”• Cache the mapping• “/sbin/arp”

• Why is broadcasting over a LAN ok?• Number of computers connected to a switch is relatively small

Broadcast on local networks• On wired ethernet switch

• ARP requests/replies are broadcast• For the most part, IP communication is not broadcast (w/

caveats)• What about on a wireless network?

• Everything is broadcast• Means hosts can see all unencrypted traffic

• Why might this be dangerous?• Means any unencrypted traffic is visible to others• Open WiFi access points + non-SSL web requests and

pages• Many sites send cookie credentials in the clear …

• Use secure APs and SSL!

High-level network overview

Workstation

Workstation

Workstation

Server

Workstation

Workstation

Gateway

Server

Gateway Workstation

Workstation

Workstation

Ethernet

EthernetEthernet

Client-server• Classic and convenient structure for distributed

systems• How do clients and servers differ?

• Servers have more physical resources (disk, RAM, etc.)• Servers are trusted by all clients

• Why are servers more trustworthy?• Usually have better, more reliable hardware• Servers are better administered (paid staff watch over them)

• Servers are kind of like the kernel of a distributed system• Centralized concentration of trust• Support coordinated activity of mutually distrusting clients

Client-server• Why not put everything on one server?

• Scalability problems (server becomes overloaded)• Availability problems (server becomes single point of failure)• Want to retain organizational control of some data (some

distrust)• How do we address these issues?

• Replicate servers• Place multiple copies of server in network• Allow clients to talk to any server with appropriate

functionality• What are some drawbacks to replication?

• Data consistency (need sensible answers from servers)• Resource discovery (which server should I talk to?)

Client-server• Kernels are centralized too• Subject to availability, scalability problems

• Does it make sense to replicate kernels?• Perhaps for multi-core machines• Assign a kernel to each core• Separate address spaces of each kernel• Coordinate actions via message passing• Multi-core starts to look a lot like a distributed

system

Grapevine services• Message delivery• Send data to specified users

• Access control• Only allow specified users to access name

• Resource discovery• Where can I find a printer?

• Authentication• How do I know who I am talking to?

Registration servers• What logical data structure is replicated?

• The registry• RName Group entry | Individual entry

• What does an RName look like?• Character string F.R• F is a name (individual or group)• R is a registry corresponding to a data partition

• At what grain is registration data replicated?• Servers contain copies of whole registries• Individual server unlikely to have copy of all registries

RNames

RNamename.registry

Group{RName1, …, RNameN}

IndividualAuthenticator (password),

Inbox sites,Connect site

What two entities are represented by an individual entry?Users and servers

RNames

RNamename.registry




How does an individual entry allow communication with a user?Inbox sites for users

RNames

RNamename.registry




How does an individual entry allow communication with a server?Connect site for servers

Namespace• RNames provide a symbolic namespace

• Similar to file-system hierarchy or DNS• Autonomous control of names within a registry

• What is the most important part of the namespace?• *.gv (for Grapevine)• *.gv is replicated at every registration server

• Who gets to define the other registries?• All other registries must have group entry under *.gv• Owners of *.gv have complete control over other registries

• In what way do file systems and DNS operate similarly?• ICANN’s root DNS servers decide top-level domains• Root user controls root directory “/”

Resource discovery• How do clients locate server replicas?• Get list of all registries via “gv.gv”• Find registry name for service (e.g., “ms”)• Lookup group ms.gv at registration server• ms.gv returns a list of available servers (e.g.,

*.ms)• At this point control is transferred to

service• Service has autonomous control of its namespace• Service can define its own namespace conventions

Implementing services• Mail servers are replicated

• Any message server accepts any delivery request• All message servers can forward to others• An individual may have inboxes on many servers

• How does a client identify a server to send a message?• Find well-known name “MailDrop.ms” in *.ms• MailDrop.ms maps to mail servers• Any mail server can accept a message• Mail servers forward message to servers hosting users’ inboxes

• Note that the mail service makes “MailDrop.ms” special• Grapevine only defines semantics of *.gv• Grapevine delegates control of semantics of *.ms to mail service• Similar to imap.cs.duke.edu or www.google.com

http://www.nytimes.com/

Resource discovery• Bootstrapping resource discovery

• Rely on lower-level methods• Broadcast to name lookup server on Ethernet• Broadcast to registration server on Ethernet

• What data does the name lookup server store?• Simple string to Internet address mappings• Infrequently updated (minimal consistency issues)• Well-known GrapevineRServer addrs of registration servers

• What does this remind you of on today’s networks?• Dynamic host configuration protocol (DHCP)• Clients broadcast DHCP request on Ethernet• DHCP server (usually on gateway) responds with IP addr, DNS info

Updating replicated servers• At some point need to update registration database

• Want to add new machines• Want to reconfigure server locations

• Why not require updates to be atomic at all servers?• Requires that most servers be accessible to even start• All kinds of reasons why this might not be true• Trans-Atlantic phone line might be down• Servers might be offline for maintenance• Servers might be offline due to failure

• Instead embrace the chaos of eventual consistency• Might have transient differences between server state• Eventually everything will look the same (probably!)

Updating the database

• Information included in timestamps• Time + server address• Timestamps are guaranteed to be unique• Provides a total order on updates from a server

• Does the entry itself need a timestamp (a version)?• Not really, can just compute as the max of item timestamps• Entry version is a convenient optimization

Registration EntryList 1

Active items:{str1|t1, …, strn|tn}

Deleted items:{str1|t1, …, strm|tm}

List 2Active items

Deleted items


• Operations on an entries• Can add/delete items from lists• Can merge lists• Operations update item timestamps, modify

list content




List 2Active items

Deleted items


• How are updates propagated?• Asynchronously via the messaging service (i.e., *.ms)• Does not require all servers to be online• Updates can be buffered and ordered




List 2Active items

Deleted items


• How fast is convergence?• Registration servers check their inbox every 30 seconds• If all are online, state will converge in ~30 seconds• If server is offline, may take longer




List 2Active items

Deleted items


• What happens if two admins update concurrently?• “it is hard to predict which one of them will prevail.”• “acceptable“ because admins aren’t talking to each

other• Anyone make sense of this?




List 2Active items

Deleted items


• Why not just use a distributed lock?• What if a replica is offline during acquire, but reappears?• What if lock owner crashes?• What if lock maintainer crashes?




List 2Active items

Deleted items


• What if clients get different answers from servers?• Clients just have to deal with it (•_•) ( •_•)>⌐■-■ (⌐■_■)

• Inconsistencies are guaranteed to be transient• May not be good enough for some applications




List 2Active items

Deleted items


• What happens if a change message is lost during prop.?• Could lead to permanent inconsistency• Periodic replica comparisons and mergers if needed• Not perfect since partitions can prevent propagation




List 2Active items

Deleted items


• What happens if namespace is modified concurrently?• Use timestamps to pick a winner (last writer wins)

• Why is this potentially dangerous?• Later update could be trapped in offline machine• Updates to first namespace accumulate• When offline machine goes online, all work to first is thrown

out




List 2Active items

Deleted items


• What was the solution?• “Shouldn’t happen in practice.”• Humans should coordinate out-of-band• Probably true, but a little unsatisfying




List 2Active items

Deleted items

Why read Grapevine?• Describes many fundamental

problems• Performance and availabilityCaching and replicationConsistency problems

We still deal with many of these issues

Keeping replicas consistent• Requirement: members of write set agree

• Write request only returns if WS members agree• Problem: things fall apart

• What do we do if something fails in the middle?• This is why we had multiple replicas in first place

• Need agreement protocols that are robust to failures

Two-phase commit• Two phases

• Voting phase• Completion phase

• During the voting phase• Coordinator proposes value to rest of group• Other replicas tentatively apply update, reply “yes” to coordinator

• During the completion phase• Coordinator tallies votes• Success (entire group votes “yes”): coordinator sends “commit” message• Failure (some “no” votes or no reply): coordinator sends “abort” message• On success, group member commits update, sends “ack” to coordinator• On failure, group member aborts update, sends “ack” to coordinator• Coordinator aborts/applies update when all “acks” have been received

Two-phase commitPhase 1

Coordinator

Replica

Replica

Replica


Coordinator

Replica

Replica

Replica

Propose: X 1

Prop

ose:

X 1

Propose: X 1


Coordinator

Replica

Replica

Replica

Yes

Yes

Yes

X 1

X 1

X 1


Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

3 Yes votes


Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

Commit: X 1

Commit:

X

1Commit: X 1


Coordinator

Replica

Replica

Replica

X 1

X 1

X 1


Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

ACK

ACKACK


• What if fewer than 3 Yes votes?• Replicas will time out and assume

update is aborted

Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

2 Yes votes Yes

NoYes


• What if fewer than 3 Yes votes?• Replicas do not commit

Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

2 Yes votes Abort: X 1

Abort:

X

1Abort: X 1


• Why might replica vote No?• Replicas will time out and

assume update is aborted

Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

2 Yes votes Yes

NoYes


• Why might replica vote No?• Might not be able to acquire local

write lock• Might be committing w/ another

coord.

Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

2 Yes votes Yes

NoYes


• What if coord. fails after vote msg, before decision msg?• Replicas will time out and assume

update is aborted

Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

3 Yes votes


• What if coord. fails after decision messages are sent?• Replicas commit update

Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

Commit: X 1

Commit:

X

1Commit: X 1


• What if coord. fails after decision messages are sent?• Replicas commit update

Coordinator

Replica

Replica

Replica

X 1

X 1

X 1


• What if coord. fails while decision messages are sent?• If one replica receives a commit, all must

commit• If replica time out, check with other members• If any member receives a commit, all commit

Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

Commit:

X

1


• What if coord. fails while decision messages are sent?• If one replica receives a commit, all must

commit• If replica time out, check with other members• If any member receives a commit, all commit

Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

Two-phase commitPhase 1 or 2

• What if replica crashes during 2PC?• Coordinator removes it from the replica

group• If replica recovers it can rejoin the group

later

Coordinator

Replica

Replica

Replica

X 1

X 1

X 1

Two-phase commit• Anyone detect circular dependencies here?

• How do we agree on the coordinator?• How do we agree on the group membership?

• Need more powerful consensus protocols• Can become very complex• Protocols vary depending on what a “failure” is• Will cover in-depth very soon

• Two classes of failures• Fail-stop: failed nodes do not respond• Byzantine: failed nodes generate arbitrary outputs

Two-phase commit• What’s another problem with this protocol?

• It’s really slow• And it’s slow even when there are no failures (the

common case)• Consistency often requires taking a

performance hit• As we saw it can also undermine availability• Can think of an unavailable service as a really slow

service

Course administration• Project 2 questions?• Animesh is working on a test suite

• Mid-term exam• Friday, March 11• Responsible for everything up to that

point

grapevine: an exercise in distributed computing landon cox february 16, 2016

Documents