![Page 1: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/1.jpg)
“Mega-services”:Scale and Consistency
A Lightning TourNow with Elastic Scaling!
Jeff ChaseDuke University
![Page 2: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/2.jpg)
A service Client
Store
Web Server
App Server
DB Server
request
replyclient
server
![Page 3: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/3.jpg)
The Steve Yegge rant, part 1Products vs. Platforms
Selectively quoted/clarified from http://steverant.pen.io/, emphasis added. This is an internal google memorandum that ”escaped”. Yegge had moved to Google from Amazon. His goal was to promote service-oriented software structures within Google.
So one day Jeff Bezos [CEO of Amazon] issued a mandate....[to the developers in his company]:
His Big Mandate went something along these lines:
1) All teams will henceforth expose their data and functionality through service interfaces.
2) Teams must communicate with each other through these interfaces.
3) There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
![Page 4: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/4.jpg)
4) It doesn't matter what technology they use. HTTP, Corba, PubSub, custom protocols -- doesn't matter. Bezos doesn't care.
5) All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
6) Anyone who doesn't do this will be fired.
7) Thank you; have a nice day!
The Steve Yegge rant, part 2Products vs. Platforms
![Page 5: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/5.jpg)
Managing overload
• What should we do when a service is in overload?– Overload: service is close to saturation, leading to
unacceptable response time.
– Work queues grow without bound, increasing memory consumption.
λ > λmax
λ λmax
ThroughputX
offered load
![Page 6: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/6.jpg)
Options for overload
1. Thrashing– Keep trying and hope things get better. Accept each request
and inject it into the system. Then drop requests at random if some queue overflows its memory bound. Note: leads to dropping requests after work has been invested, wasting work and reducing throughput (e.g., congestion collapse).
2. Admission control or load conditioning– Reject requests as needed to keep system healthy. Reject
them early, before they incur processing costs. Choose your victims carefully, e.g., prefer “gold” customers, or reject the most expensive requests.
3. Dynamic provisioning or elastic scaling– E.g., acquire new capacity on the fly from a cloud provider, and
shift load over to the new capacity.
![Page 7: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/7.jpg)
Elastic scaling: “pay as you grow”
![Page 8: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/8.jpg)
EC2 Elastic Compute Cloud The canonical public cloud
Virtual Appliance
Image
Cloud Provider(s)
Host
GuestClient Service
![Page 9: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/9.jpg)
OS
VMM
Physical
Platform
Client Service
IaaS: Infrastructure as a Service
EC2 is a public IaaS cloud (fee-for-service).
Deployment of private clouds is growing rapidly w/ open IaaS cloud software.
Hosting performance and isolation is determined by virtualization layer
Virtual Machines (VM): VMware, KVM, etc.
![Page 10: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/10.jpg)
Native virtual machines (VMs)
• Slide a hypervisor underneath the kernel.– New OS/TCB layer: virtual machine monitor (VMM).
• Kernel and processes run in a virtual machine (VM).– The VM “looks the same” to the OS as a physical machine.
– The VM is a sandboxed/isolated context for an entire OS.
• Can run multiple VM instances on a shared computer.
![Page 11: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/11.jpg)
Thank you, VMware
![Page 12: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/12.jpg)
![Page 13: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/13.jpg)
![Page 14: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/14.jpg)
Also in the news, the Snowden NSA leak for Hallowe’en 2013. Scary?
![Page 15: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/15.jpg)
Work
Server cluster/farm/cloud/gridData center
Support substrate
Scaling a service
Dispatcher
Add servers or “bricks” for scale and robustness.Issues: state storage, server selection, request routing, etc.
![Page 16: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/16.jpg)
Service scaling and bottlenecks
Scale up by adding capacity incrementally?
• “Just add bricks/blades/units/elements/cores”...but that presumes we can parallelize the workload.
• “Service workloads parallelize easily.”
– Many independent requests: spread requests across multiple units.
– Problem: some requests use shared data. Partition data into chunks and spread them across the units: be sure to read/write a common copy.
• Load must be evenly distributed, or else some unit saturates before the others (bottleneck).
WorkA bottleneck limits throughput and/or may increase response time for some class of requests.
![Page 17: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/17.jpg)
Coordination for service clusters
• How to assign data and functions among servers?– To spread the work across an elastic server cluster?
• How to know which server is “in charge” of a given function or data object?– E.g., to serialize reads/writes on each object, or otherwise
ensure consistent behavior: each read sees the value stored by the most recent write (or at least some reasonable value).
• Goals: safe, robust, even, balanced, dynamic, etc.
• Two key techniques:– Leases (leased locks)
– Hash-based adaptive data distributions
![Page 18: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/18.jpg)
What about failures?• Systems fail. Here’s a reasonable set of assumptions
about failure properties:
• Nodes/servers/replicas/bricks– Fail-stop or fail-fast fault model
– Nodes either function correctly or remain silent
– A failed node may restart, or not
– A restarted node loses its memory state, and recovers its secondary (disk) state
– Note: nodes can also fail by behaving in unexpected ways, like sending false messages. These are called Byzantine failures.
• Network messages– “delivered quickly most of the time” but may be dropped.
– Message source and content are safely known (e.g., crypto).
X
![Page 19: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/19.jpg)
Challenge: data management
• Data volumes are growing enormously.
• Mega-services are “grounded” in data.
• How to scale the data tier?– Scaling requires dynamic placement of data items across data
servers, so we can grow the number of servers.
– Caching helps to reduce load on the data tier.
– Replication helps to survive failures and balance read/write load.
– E.g., alleviate hot-spots by spreading read load across multiple data servers.
– Caching and replication require careful update protocols to ensure that servers see a consistent view of the data.
![Page 20: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/20.jpg)
Service-oriented architecture of
Amazon’s platform
Dynamo is a scalable, replicated key-value store.
![Page 21: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/21.jpg)
Key-value stores
• Many mega-services are built on key-value stores.– Store variable-length content objects: think “tiny files” (value)
– Each object is named by a “key”, usually fixed-size.
– Key is also called a token: not to be confused with a crypto key! Although it may be a content hash (SHAx or MD5).
– Simple put/get interface with no offsets or transactions (yet).
– Goes back to literature on Distributed Data Structures [Gribble 1998] and Distributed Hash Tables (DHTs).
[image from Sean Rhea, opendht.org]
![Page 22: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/22.jpg)
• Data objects named in a “flat” key space (e.g., “serial numbers”)
• K-V is a simple and clean abstraction that admits a scalable, reliable implementation: a major focus of R&D.
• Is put/get sufficient to implement non-trivial apps?
Distributed hash table
Distributed application
get (key) data
node node node….
put(key, data)
Lookup service
lookup(key) node IP address
[image from Morris, Stoica, Shenker, etc.]
Key-value stores
![Page 23: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/23.jpg)
Scalable key-value stores
• Can we build massively scalable key/value stores?– Balance the load: distribute the keys across the nodes.
– Find the “right” server(s) for a given key.
– Adapt to change (growth and “churn”) efficiently and reliably.
– Bound the spread of each object.
• Warning: it’s a consensus problem!
• What is the consistency model for massive stores?– Can we relax consistency for better scaling? Do we have to?
![Page 24: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/24.jpg)
Memcached is a scalable in-memory key-value cache.
![Page 25: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/25.jpg)
Scaling database access
• Many services are data-driven.– Multi-tier services: the “lowest”
layer is a data tier with authoritative copy of service data.
• Data is stored in various stores or databases, some with advanced query API.– e.g., SQL
• Databases are hard to scale.– Complex data: atomic, consistent,
recoverable, durable. (“ACID”)
database servers
webserversSQL
query API
SQL: Structured Query Language
Caches can help if much of the workload is simple reads.
![Page 26: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/26.jpg)
Memcached
• “Memory caching daemon”
• It’s just a key/value store
• Scalable cluster service– array of server nodes
– distribute requests among nodes
– how? distribute the key space
– scalable: just add nodes
• Memory-based
• LRU object replacement
• Many technical issues: database servers
webservers
etc…
memcached servers
get/put API
SQL query API
Multi-core server scaling, MxN communication, replacement, consistency
![Page 27: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/27.jpg)
[From Spark Plug to Drive Train: The Life of an App Engine Request, Along Levi, 5/27/09]
![Page 28: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/28.jpg)
Consistency and coordination
Very loosely:
• Consistency: each read of a {key, file, object} sees the value stored by the most recent write.– Or at least writes propagate to readers “eventually”.
– More generally: the service behaves as expected: changes to shared state are properly recorded and observed.
• Coordination: the roles and functions of each element are understood and properly adjusted to respond to changes (e.g., failures, growth, or rebalancing: “churn”).– E.g., distribution of functions or data among the elements.
![Page 29: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/29.jpg)
Example: mutual exclusion
• It is often necessary to grant some node/process the “right” to “own” some given data or function.
• Ownership rights often must be mutually exclusive.– At most one owner at any given time.
• How to coordinate ownership?
• Warning: it’s a consensus problem!
![Page 30: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/30.jpg)
One solution: lock service
acquire
grantacquire
grant
release
release
A
B
x=x+1
lock service
x=x+1
![Page 31: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/31.jpg)
A lock service in the real world
acquire
grantacquire
A
B
x=x+1
X??????
B
![Page 32: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/32.jpg)
Solution: leases (leased locks)
• A lease is a grant of ownership or control for a limited time.
• The owner/holder can renew or extend the lease.
• If the owner fails, the lease expires and is free again.
• The lease might end early.– lock service may recall or evict
– holder may release or relinquish
![Page 33: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/33.jpg)
A lease service in the real world
acquire
grantacquire
A
B
x=x+1
X???
grant
release
x=x+1
![Page 34: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/34.jpg)
Leases and time
• The lease holder and lease service must agree when a lease has expired.– i.e., that its expiration time is in the past
– Even if they can’t communicate!
• We all have our clocks, but do they agree?– synchronized clocks
• For leases, it is sufficient for the clocks to have a known bound on clock drift.
– |T(Ci) – T(Cj)| < ε– Build in slack time > ε into the lease protocols as a safety
margin.
![Page 35: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/35.jpg)
OK, fine, but…
• What if the A does not fail, but is instead isolated by a network partition?
![Page 36: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/36.jpg)
A network partition
Cras hedrouter
A network partition is any event that blocks all message traffic between subsets of nodes.
![Page 37: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/37.jpg)
Never two kings at once
acquire
grantacquire
A
x=x+1 ???
B
grant
release
x=x+1
![Page 38: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/38.jpg)
Lease examplenetwork file cache consistency
![Page 39: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/39.jpg)
[From Spark Plug to Drive Train: The Life of an App Engine Request, Along Levi, 5/27/09]
![Page 40: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/40.jpg)
Google File System (GFS)
Similar: Hadoop HDFS, p-NFS, many other parallel file systems.A master server stores metadata (names, file maps) and acts as lock server.Clients call master to open file, acquire locks, and obtain metadata. Then theyread/write directly to a scalable array of data servers for the actual data. File data may be spread across many data servers: the maps say where it is.
![Page 41: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/41.jpg)
OK, fine, but…
• What if the manager/master itself fails?
X
We can replace it, but the nodes must agree on who the new master is: requires consensus.
![Page 42: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/42.jpg)
The Answer
• Replicate the functions of the manager/master.– Or other coordination service…
• Designate one of the replicas as a primary.– Or master
• The other replicas are backup servers.– Or standby or secondary
• If the primary fails, use a high-powered consensus algorithm to designate and initialize a new primary.
![Page 43: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/43.jpg)
Consensus: abstraction
Unreliable multicast
Step 1Propose.
P1
P2 P3
v1
v3 v2
Consensus algorithm
Step 2Decide.
P1
P2 P3
d1
d3 d2
Coulouris and Dollimore
Each P proposes a value to the others. All nonfaulty P agree on a value in a bounded time.
![Page 44: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/44.jpg)
Fischer-Lynch-Patterson (1985)• No consensus can be guaranteed in an
asynchronous system in the presence of failures.
• Intuition: a “failed” process may just be slow, and can rise from the dead at exactly the wrong time.
• Consensus may occur recognizably, rarely or often.
Network partition Split brain
![Page 45: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/45.jpg)
C-A-P choose
two
C
A P
consistency
Availability Partition-resilience
CA: available, and consistent, unless there is a partition.
AP: a reachable replica provides service even in a partition, but may be inconsistent.
CP: always consistent, even in a partition, but a reachable replica may deny service if it is unable to agree with the others (e.g., quorum).
Dr. Eric Brewer
“CAP theorem”
![Page 46: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/46.jpg)
Properties for Correct Consensus
• Termination: All correct processes eventually decide.
• Agreement: All correct processes select the same di.
– Or…(stronger) all processes that do decide select the same di, even if they later fail.
• Consensus “must be” both safe and live.
• FLP and CAP say that a consensus algorithm can be safe or live, but not both.
![Page 47: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/47.jpg)
Now what?
• We have to build practical, scalable, efficient distributed systems that really work in the real world.
• But the theory says it is impossible to build reliable computer systems from unreliable components.
• So what are we to do?
![Page 48: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/48.jpg)
http://research.microsoft.com/en-us/um/people/blampson/
Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at MIT…..He was one of the designers of the SDS 940 time-sharing system, the Alto personal distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the Microsoft Palladium high-assurance stack, and several programming languages. He received the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the NAE’s Draper Prize in 2004.
Butler W. Lampson
![Page 49: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/49.jpg)
[Lampson 1995]
![Page 50: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/50.jpg)
Summary/preview
• Master coordinates, dictates consensus– e.g., lock service
– Also called “primary”
• Remaining consensus problem: who is the master?– Master itself might fail or be isolated by a network
partition.
– Requires a high-powered distributed consensus algorithm (Paxos) to elect a new master.
– Paxos is safe but not live: in the worst case (multiple repeated failures) it might not terminate.
![Page 51: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/51.jpg)
A Classic Paper
• ACM TOCS:– Transactions on Computer Systems
• Submitted: 1990. Accepted: 1998
• Introduced:
![Page 52: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/52.jpg)
![Page 53: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/53.jpg)
???
![Page 54: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/54.jpg)
A Paxos Round
“Can I lead b?” “OK, but” “v!”
L
N
1a 1b
“v?” “OK”
2a 2b 3
Propose Promise Accept Ack Commit
Nodes may compete to serve as leader, and may interrupt one another’s rounds. It can take many rounds to reach consensus.
Wait for majority Wait for majority
log log safe
Self-appoint
![Page 55: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/55.jpg)
Consensus in Practice
• Lampson: “Since general consensus is expensive, practical systems reserve it for emergencies.”
– e.g., to select a primary/master, e.g., a lock server.
• Centrifuge, GFS master, Frangipani, etc.
• Google Chubby service (“Paxos Made Live”)
• Pick a primary with Paxos. Do it rarely; do it right.– Primary holds a “master lease” with a timeout.
• Renew by consensus with primary as leader.
– Primary is “czar” as long as it holds the lease.
– Master lease expires? Fall back to Paxos.
– (Or BFT.)
![Page 56: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/56.jpg)
Coordination services
• Build your cloud apps around a coordination service with consensus (Paxos) at its core.
• This service is a fundamental building block for consistent scalable services.– Chubby (Google)
– Zookeeper (Yahoo!)
– Centrifuge (Microsoft)
![Page 57: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/57.jpg)
Coordination and Consensus
• If the key to availability and scalability is to decentralize and replicate functions and data, how do we coordinate the nodes?
– data consistency– update propagation– mutual exclusion– consistent global states– failure notification– group membership (views)– group communication– event delivery and ordering
![Page 58: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/58.jpg)
We stopped here
• But I’ll leave these slides in.
• Hidden slides (e.g., on VMs and network file cache consistency) are not to be tested.
• Obviously we only peeked at some of these topics, but you should understand the problem of failures (lock service).
![Page 59: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/59.jpg)
Caching in the Web
• Web “proxy” caches are servers that cache Web content.
• Reduce traffic to the origin server.
• Deployed by enterprises to reduce external network traffic to serve Web requests of their members.
• Also deployed by third-party companies that sell caching service to Web providers.– Content Delivery/Distribution Network (CDN)
– Help Web providers serve their clients better.
– Help absorb unexpected load from “flash crowds”.
– Reduce Web server infrastructure costs.
![Page 60: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/60.jpg)
Content Delivery Network (CDN)
![Page 61: “Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University](https://reader036.vdocuments.site/reader036/viewer/2022081603/56649f1d5503460f94c33c97/html5/thumbnails/61.jpg)
DNS CachingCache freshness by TTL “freshness date”), also used by early NFS, and HTTP/Web.