large-scale distributed systems andrew whitaker cse451
TRANSCRIPT
![Page 1: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/1.jpg)
Large-Scale Distributed Systems
Andrew Whitaker
CSE451
![Page 2: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/2.jpg)
Textbook Definition
“A distributed system is a collection of loosely coupled processors interconnected by a communication network”
Typically, the nodes run software to create an application/service e.g., 1000s of Google nodes work together to
build a search engine
![Page 3: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/3.jpg)
Why Not to Build a Distributed System (1)Must handle partial failures
System must stay up, even when individual components fail
Amazon.com
![Page 4: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/4.jpg)
Why Not to Build a Distributed System (2) No global state
Machines can only communicate with messages
This makes it difficult to agree on anything “What time is it?” “Which happened first, A or B?”
Theory: consensus is slow and doesn’t work in the presence of failure So, we try to avoid needing to agree in the first place
A B
![Page 5: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/5.jpg)
Reasons to Build a Distributed System (1)The application or service is inherently
distributed
Andrew Whitaker Joan Whitaker
![Page 6: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/6.jpg)
Reason to Build a Distributed System (2)Application requirements
Must scale to millions of requests / sec Must be available despite component failures
This is why Amazon, Google, Ebay, etc. are all large distributed systems
![Page 7: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/7.jpg)
Internet Service Requirements
Basic goal: build a site that satisfies every user requests
Detailed requirements: Handle billions of transactions per day Be available 24/7 Handle load spikes that are 10x normal capacity Do it with a random selection of mismatched hardware
![Page 8: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/8.jpg)
An Overview of HotMail (Jim Gray) ~7,000 servers 100 backend stores with 300TB (cooked) Many data centers Links to
Internet Mail gateways Ad-rotator Passport
~ 5 B messages per day 350M mailboxes, 250M active ~1M new per day. New software every 3 months (small changes weekly).
![Page 9: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/9.jpg)
Availability Strategy #1: Perfect Hardware
Pay extra $$$ for components that do not fail
People have tried this “fault tolerant computing”
This isn’t practical for Amazon / Google: It’s impossible to get rid of all faults Software and administrative errors still exist
![Page 10: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/10.jpg)
Availability Strategy #2: Over-provisionStep 1: buy enough hardware to handle
your workloadStep 2: buy more hardware
Replicate
Replicate
Replicate
Replicate
![Page 11: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/11.jpg)
Benefits of Replication
ScalabilityGuards against hardware failuresGuards against software failures (bugs)
![Page 12: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/12.jpg)
Replication Meets Probability
p is probability that a single machine failsProbability of N failures is: 1-p^n
Siteunavailability
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
0 1 2 3 4 5 6 7
Number of Replicas
![Page 13: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/13.jpg)
Availability in the Real World
Phone network: 5 9’s 99.999% available
ATMs: 4 9’s 99.99% available
What about Internet services? Not very good…
![Page 14: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/14.jpg)
2006: typical 97.48% Availability
97.48%97.48%
Source: Jim Gray
![Page 15: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/15.jpg)
Netcraft’s Crisis-of-the-Day
![Page 16: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/16.jpg)
What Gives?
Why isn’t simple redundancy enough to give very high availability?
![Page 17: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/17.jpg)
Failure Modes
Fail-stop failure: A component fails by stopping It’s totally dead: doesn’t respond
to input or output Ideally, this happens fast
Like a light-bulb
Byzantine failure: Component fails in an arbitrary way Produces unpredictable output
![Page 18: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/18.jpg)
Byzantine Generals
Basic goal: reach consensus in the presence of arbitrary failures
Results: More than 2/3 of the nodes must be “loyal”
3t + 1 nodes with t traitors Consensus is possible, but expensive
Lot’s of messages Many rounds of communication
In practice, people assume that failures are fail-stop, and hope for the best…
![Page 19: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/19.jpg)
Example of a non Fail-Stop Failure
Server
Server
Server
Server
Server
Loadbalancer
Internet
Load Balancer uses a “Least Connections” policyServer fails by returning an HTTP error 400Net result: “failed” server becomes a black hole
Amazon.com
![Page 20: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/20.jpg)
Correlated Failures
In practice, components often fail at the same time Natural disasters Security vulnerabilities Correlated manufacturing defects Human error…
![Page 21: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/21.jpg)
Human errorHuman operator error is the leading cause of
dependability problems in many domains
Source: D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report UCB//CSD-02-1175, March 2002.
59%22%
8%
11%
OperatorHardwareSoftwareOverload
51%
15%
34%
0%
Public Switched Telephone Network Average of 3 Internet Sites
Sources of Failure
![Page 22: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/22.jpg)
Understanding Human Error
Administrator actions tend to involve many nodes at once: Upgrade from Apache 1.3 to Apache 2.0 Change the root DNS server Network / router misconfiguration
This can lead to (highly) correlated failures
![Page 23: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/23.jpg)
Learning to Live with Failures
If we can’t prevent failures outright, how can we make their impact less severe?
Understanding availability: MTTF: Mean-time-to-failure MTTR: Mean-time-to-repair Availability = MTTR / (MTTR + MTTF)
Approximately MTTR / MTTF
Note: recovery timeis just as importantas failure time!
![Page 24: Large-Scale Distributed Systems Andrew Whitaker CSE451](https://reader035.vdocuments.site/reader035/viewer/2022062312/551a9cee5503466b3a8b54fc/html5/thumbnails/24.jpg)
Summary
Large distributed systems are built from many flaky components Key challenge: don’t let component failures become
system failures Basic approach: throw lots of hardware at the
problem; hope everything doesn’t fail at once Try to decouple failures Try to avoid single points-of-failure Try to fail fast
Availability is affected as much by recovery time as by error frequency