lessons learnt on a 2000-core cluster

93
RUNNING OUR SOFTWARE ON A 2000-CORE CLUSTER Lessons learnt

Upload: eugene-kirpichov

Post on 29-Oct-2014

14 views

Category:

Technology


0 download

DESCRIPTION

Lessons learnt when testing our "embarassingly parallel" software on a 2000-core cluster.

TRANSCRIPT

Page 1: Lessons learnt on a 2000-core cluster

RUNNING OUR SOFTWARE ON A 2000-CORE CLUSTER

Lessons learnt

Page 2: Lessons learnt on a 2000-core cluster

STRUCTURE

• For each problem• Symptoms

• Method of investigation

• Cause

• Action taken

• Morale

Page 3: Lessons learnt on a 2000-core cluster

BACKGROUND

• Pretty simple: Distributing embarassingly parallel computations on a cluster

• Distribution fabric is RabbitMQ• Publish tasks to queue

• Pull results from queue

• Computational listeners on cluster nodes

• Tasks are “fast” (~1s cpu time) or “slow” (~15min cpu time)• Tasks are split into parts (usually 160)• Also parts share the same data chunk – it’s stored in

memcached and task input contains the “shared data id”

• Requirements: 95% utilization for slow tasks, “as much as we can” for fast ones.

Page 4: Lessons learnt on a 2000-core cluster

RabbitMQ starts refusing connectionsto some clients when there are

too many of them.

Page 5: Lessons learnt on a 2000-core cluster

INVESTIGATION

Eventually turned out RabbitMQ supports

max ~400 connections per process on Windows.

Page 6: Lessons learnt on a 2000-core cluster

SOLUTION

In RabbitMQ:• Establish a cluster of RabbitMQs

• 2 “eternal” connections per client, 512 connections per instance, 1600 clients ~16 instances suffice.

• Instances start on same IP, on subsequent ports (5672,5673..)

In code:• Make both submitter and consumer scan ports until success

Page 7: Lessons learnt on a 2000-core cluster

MORALE

• Capacity planning!

• If there’s a resource, plan how much of it you’ll need and with what pattern of usage.

Otherwise you’ll exhaust it sooner or later.

• Network bandwidth

• Network latency

• Connections

• Threads

• Memory

• Whatever

Page 8: Lessons learnt on a 2000-core cluster

RabbitMQConsumer uses a legacy component

which can’t run concurrent instances

in the same directory

Page 9: Lessons learnt on a 2000-core cluster

SOLUTION

• Create temporary directory.• Directory.SetCurrentDirectory() at startup.

• The temp directories pile up.

Page 10: Lessons learnt on a 2000-core cluster

SOLUTION

• At startup, clean up unused temp directories.

• How to know if it is unused?• Create a lock file in the directory

• At startup, try removing lock files and dirs

• Problem• Races: several instances want to delete the same file

• All but one crash!

Several solutions with various kinds of races, “fixed” by try/ignore band-aid…

Just wrap the whole “clean-up” block in a try/ignore!

That’s it.

Page 11: Lessons learnt on a 2000-core cluster

MORALE

If it’s non-critical, wrap the whole thing with try/ignore

Even if you think it will never fail• It will

• (maybe in the future, after someone changes the code…)

• Thinking “it won’t” is unneeded complexity

Low-probable errors will happen• The chance is small but frequent

• 0.001 probability of error, 2000 occasions = 87% that at least 1 failure occurs

Page 12: Lessons learnt on a 2000-core cluster

Then the thing started working.Kind of.

We asked for 1000 tasks “in flight”, and got only about 125.

Page 13: Lessons learnt on a 2000-core cluster

Gateway is highly CPU loaded

(perhaps that’s the bottleneck?)

Page 14: Lessons learnt on a 2000-core cluster

SOLUTION

• Eliminate data compression• It was unneeded – 160 compressions of <1kb-sized data per task (1 per

subtask)!

• Eliminate unneeded deserialization• Eliminate Guid.NewGuid() per subtask

• It’s not nearly as cheap as one might think

• Especially if there’s 160 of them per task

• Turn on server GC

Page 15: Lessons learnt on a 2000-core cluster

SOLUTION (CTD.)

• There was support for our own throttling and round-robining in code

• We didn’t actually need it! (needed before, but not now)• Eliminated both

Result• Oops, RabbitMQ crashed!

Page 16: Lessons learnt on a 2000-core cluster

CAUSE

• 3 queues per client• Remember “Capacity planning”?

• A RabbitMQ queue is an exhaustable resource

• Didn’t even remove unneeded queues• Long to explain, but

• Didn’t actually need them in this scenario

• RabbitMQ is not ok with several thousand queues• rabbimqctl list_queues took an eternity

Page 17: Lessons learnt on a 2000-core cluster

SOLUTION

• Have 2 queues per JOB and no cancellation queues• Just purge request queue

• OK unless several jobs share their request queue

• We don’t use this option.

Page 18: Lessons learnt on a 2000-core cluster

AND THEN IT WORKED

Cluster fully loaded

Cluster quickly and sustainedly saturated

Compute nodes at 100% cpu

Page 19: Lessons learnt on a 2000-core cluster

MORALE

Eliminate bloat – Complexity kills

Even if “We’ve got feature X” sounds cool• Round-robining and throttling

• Cancellation queues

• Compression

Page 20: Lessons learnt on a 2000-core cluster

MORALE

Rethink what is CPU-cheap• O(1) is not enough

• You’re going to compete with 2000 cores

• You’re going to do this “cheap” stuff a zillion times

Page 21: Lessons learnt on a 2000-core cluster

MORALE

Rethink what is CPU-cheap

1 task = avg. 600ms of computation for 2000 cores• Split into 160 parts

• 160 Guid.NewGuid()

• 160 gzip compressions of 1kb data

• 160 publishes to RabbitMQ

• 160*N serializations/deserializations

• It’s not cheap at all, compared to 600ms

• Esp. compared to 30ms, if you’re aiming at 95% scalability

Page 22: Lessons learnt on a 2000-core cluster

And then we tried short tasks

~1000x shorter

Page 23: Lessons learnt on a 2000-core cluster

Oh well.The tasks are really short, after all…

Page 24: Lessons learnt on a 2000-core cluster

And we started getting really a lot of memcached misses.

Page 25: Lessons learnt on a 2000-core cluster

INVESTIGATION

• Have we put so much into memcached that it evicted the tasks?

Log:

Key XXX not found

> echo “GET XXX” | telnet 123.45.76.89 11211

YYYYYYYY

Nope, it’s still there.

Page 26: Lessons learnt on a 2000-core cluster

SOLUTION

Retry until ok (with exponential back-off)

Page 27: Lessons learnt on a 2000-core cluster

Oh.

Blue: Fetching from memcachedOrange: Computing

Desperately retrying

Page 28: Lessons learnt on a 2000-core cluster

INVESTIGATION

• Memcached can’t be down for that long, right?• Right.

Look into code…

We cached the MemcachedClient objects

to avoid creating them per each request

because this is oh so slow

Page 29: Lessons learnt on a 2000-core cluster

INVESTIGATION

• There was a bug in the memcached client library (Enyim)• It took too long to discover that a server is back online• Our “retries” were not actually retrying• They were stumbling on Enyim’s cached “server is down”.

Page 30: Lessons learnt on a 2000-core cluster

SOLUTION

Do not cache the MemcachedClient objects

Result:• That helped. No more misses.

Page 31: Lessons learnt on a 2000-core cluster

MORALE

Eliminate bloat – Complexity kills• I think we’ve already talked of this one.

• Smart code is bad because you don’t know what it’s actually doing

Page 32: Lessons learnt on a 2000-core cluster

Then we saw that memcached gets take 200ms each

Page 33: Lessons learnt on a 2000-core cluster

INVESTIGATION

Memcached can’t be that slow, right?• Right.

Then who is slow?• Who is between us and memcached?

• Right, Enyim.

• Creating those non-cached Client objects

Page 34: Lessons learnt on a 2000-core cluster

SOLUTION

Write own fat-free “memcached client”• Just a dozen lines of code

• The protocol is very simple.

• Nothing stands between us and memcached(well, except for the OS TCP stack)

Result:• That helped. Now gets took ~2ms.

Page 35: Lessons learnt on a 2000-core cluster

MORALE

Eliminate bloat – Complexity kills• Should I say more?

Page 36: Lessons learnt on a 2000-core cluster

And this is how well we scaled these short tasks.

About 5 1-second tasks/s. Terrific for a 2000-core cluster.

Page 37: Lessons learnt on a 2000-core cluster

INVESTIGATION

These stripes are almost parallel!• Because tasks are round-robined to nodes in the same order.

• And this round-robiner’s not keeping up.

• Who’s that?

• RabbitMQ.

We must have hit RabbitMQ limits• ORLY?

• We push 160 messages per 1 task that takes 0.25ms on 2000 cores.

• Capacity planning?

Page 38: Lessons learnt on a 2000-core cluster

INVESTIGATION

And we also have 16 RabbitMQs.

And there’s just 1 queue.

Every queue lives on 1 node.

15/16 = 93.75% of pushes and pulls are indirect.

Page 39: Lessons learnt on a 2000-core cluster

SOLUTION

Don’t split these short tasks into parts.

Result:• That helped.

• ~76 tasks/s submitted to RabbitMQ.

Page 40: Lessons learnt on a 2000-core cluster

AND THEN THIS

• An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.

(during connection to the Gateway)

• Spurious program crashes in Enyim code under load

Page 41: Lessons learnt on a 2000-core cluster

SOLUTION

Update Enyim to latest version.

Result:• Didn’t help.

Page 42: Lessons learnt on a 2000-core cluster

SOLUTION

Get rid of Enyim completely.

(also implement put() – another 10 LOC)

Result:• That helped

• No more crashes

Postfactum:• Actually I forgot to destroy the Enyim client objects

Page 43: Lessons learnt on a 2000-core cluster

MORALE

Third-party libraries can fail• They’re written by humans

• Maybe by humans who didn’t test them under these conditions (i.e. a large number of connections occupied by the rest of the program)

YOU can fail (for example, misuse a library)• You’re a human

Do not fear replacing a library with an easy piece of code• Of course if it is easy (for memcached it, luckily, was)

• “Why did they write a complex library?” Because it does more, but maybe not what you need.

Page 44: Lessons learnt on a 2000-core cluster

But we’re still there at 76 tasks/s.

Page 45: Lessons learnt on a 2000-core cluster

SOLUTION

A thourough and blind CPU hunt in Client and Gateway.• Didn’t want to launch a profiler on the cluster nodes

because RDP was laggy and I was lazy• (Most probably this was a mistake)

Page 46: Lessons learnt on a 2000-core cluster

SOLUTION

Fix #1• Special-case optimization for TO-0 tasks:

Unneeded deserialization and splitting in Gateway (don’t split them at all)

Result• Gateway CPU load drops 2x

• Scalability doesn’t improve

Page 47: Lessons learnt on a 2000-core cluster

SOLUTION

Fix #2• Eliminate task GUID generation in Client

• Parallelize submission of requests

• To spread WCF serialization CPU overhead over cores

• Turn on Server GC

Result• Now it takes 14 instead of 20s to push 1900 tasks to Gateway (130/s).

Still not quite there.

Page 48: Lessons learnt on a 2000-core cluster

LOOK AT THE CLUSTER LOAD AGAIN

Where do these pauses come from?They appear consistently on every run.

Page 49: Lessons learnt on a 2000-core cluster

WHERE DO THESE PAUSES COME FROM?

What can pause a .NET application?• The Garbage Collector

• The OS (swap in/out)

What’s common between these runs?• ~Number of tasks in memory at pauses

Page 50: Lessons learnt on a 2000-core cluster

WHERE DID THE MEMORY GO?

Node with Client had 98-99% physical memory occupied.

By whom?• SQL Server: >4Gb• MS HPC Server: Another few Gb

No wonder.

Page 51: Lessons learnt on a 2000-core cluster

SOLUTION

Turn off HPC Server on this node.

Result:• These pauses got much milder

Page 52: Lessons learnt on a 2000-core cluster

Still don’t know what’s this.

About 170 tasks/s. Only using 1248 cores. Why? We don’t know yet.

Page 53: Lessons learnt on a 2000-core cluster

MORALE

• Measure your application.Eliminate interference from others. The interference can be drastic.

• Do not place a latency-sensitive component together with anything heavy (throughput-sensitive) like SQL Server.

Page 54: Lessons learnt on a 2000-core cluster

But scalability didn’t improve much.

Page 55: Lessons learnt on a 2000-core cluster

HOW DO WE UNDERSTAND WHY IT’S SO BAD?

Eliminate interference.

Page 56: Lessons learnt on a 2000-core cluster

WHAT INTERFERENCE IS THERE?

“Normalizing” tasks• Deserialize

• Extract data to memcached

• Serialize

Let us remove it (prepare tasks, then shoot like a machinegun).

• Result: almost same – 172 tasks/s

• (Unrealistic but easier for further investigation)

Page 57: Lessons learnt on a 2000-core cluster

SO HOW LONG DOES IT TAKE TO SUBMIT A TASK?

Client: “Oh, quite a lot!” Gateway: “Not much.”

1 track = 1 thread. Before BeginExecute: start orange bar, after BeginExecute: end bar.

(now that it’s the only thing we’re doing)

Page 58: Lessons learnt on a 2000-core cluster

DURATION OF THESE BARS

Client:“Usually and consistentlyabout 50ms.”

Gateway: “Usually a couple ms.”

Page 59: Lessons learnt on a 2000-core cluster

VERY SUSPICIOUS

What are those 50ms? Too round of a number.

Perhaps some protocol is enforcing it?

What’s our protocol?

Page 60: Lessons learnt on a 2000-core cluster

var client = new CloudGatewayClient("BasicHttpBinding_ICloudGateway");

Oops.

What’s our protocol?tcp, right?

Page 61: Lessons learnt on a 2000-core cluster

SOLUTION

Change to NetTcpBinding

Don’t remember which is which :(Still looks strange, but much better.

Page 62: Lessons learnt on a 2000-core cluster

Only using 1083 of >1800 cores! Why? We don’t know yet.

About 340 tasks/s.

Page 63: Lessons learnt on a 2000-core cluster

MORALE

Double-check your configuration.

Measure the “same” thing in several ways.• Time to submit a task, from POV

of client and gateway

Page 64: Lessons learnt on a 2000-core cluster

HERE COMES THE DESSERT.

“Tools matter”• Already shown how pictures

(and drawing tools) matter.

We have a logger. “Greg” = “Global Registrator”.• Most of the pictures wouldn’t be possible without it.

• Distributed (client/server)

• Accounts for machine clock offset

• Output is sorted on “global time axis”

• Lots of smart “scalability” tricks inside

Page 65: Lessons learnt on a 2000-core cluster

TOOLS MATTER

And it didn’t work quite well, for quite a long time.

Here’s how it failed:• Ate 1-2Gb RAM• Output was not sorted• Logged events with a 4-5min lag

Page 66: Lessons learnt on a 2000-core cluster

TOOLS MATTER

Here’s how its failures mattered:

• Had to wait several minutes to gather all the events from a run.

• Sometimes not all of them were even gathered

After the problems were fixed, “experiment roundtrip”

(change, run, collect data, analyze) skyrocketed at least 2x-3x.

Page 67: Lessons learnt on a 2000-core cluster

TOOLS MATTER

Too bad it was on the last day of cluster availability.

Page 68: Lessons learnt on a 2000-core cluster

WHY WAS IT SO BUGGY?

The problem ain’t that easy (as it seemed).• Lots of clients (~2000)

• Lots of messages

• 1 RPC request per message = unacceptable

• Don’t log a message until clock synced with the client machine

• Resync clock periodically

• Log messages in order of global time, not order of arrival

• Anyone might (and does) fail or come back online at any moment

• Must not crash

• Must not overflow RAM

• Must be fast

Page 69: Lessons learnt on a 2000-core cluster

HOW DOES IT WORK?

• Client buffers messages and sends them to server in batches (client initiates).

• Messages marked with client’s local timestamp.

• Server buffers messages from each client.

• Periodically client and server calibrate clocks (server initiates).Once a client machine is calibrated its messages go to global buffer with transformed timestamp.

• Messages stay in global buffer for 10s (“if a message is earliest for 10s, it will remain earliest”)

• Global buffer(windowSize):Add(time, event)

PopEarliest() : (time,event)

Page 70: Lessons learnt on a 2000-core cluster

SO, THE TRICKS WERE:

• Limit the global buffer (drop messages if it’s full)• “Dropping message”…

“Dropped 10000,20000… messages”…”Accepting again after dropping N”

• Limit the send buffer on client• Same

• Use compression for batches• (unused actually)

• Ignore (but log) errors like failed calibration, failed send, failed receive, failed connect etc• Retry after a while

• Send records to server in bounded batches• If I’ve got 1mln records to say, I shouldn’t keep the connection busy for a

long time (num.concurrent connections is a resource!). Cut into batches of 10000.

• Prefer polling to blocking because it’s simpler

Page 71: Lessons learnt on a 2000-core cluster

SO, THE TRICKS WERE:

• Prefer “negative feedback” style• Wake up, see what’s wrong, fix• Not:

“react to every event with preserving invariants”Much harder, sometimes impossible.

• Network performance tricks:• TCP NO_DELAY whenever possible• Warm up the connection before calibrating• Calibrate N times, average until confidence interval

reached• (actually precise calibration is theoretically impossible, only if

network latencies are symmetric, which they aren’t…)

Page 72: Lessons learnt on a 2000-core cluster

AND THE BUGS WERE:

Client called server even if it had nothing to say.• Impact: *lots* of unneeded connections.• Fix: Check, poll.

Page 73: Lessons learnt on a 2000-core cluster

AND THE BUGS WERE:

“Pending records” per-client buffer was unbounded.• Impact: Server ate memory if it couldn’t sync clock• Reason: Code duplication.

Should have abstracted away “Bounded buffer”.• Fix: Bound.

Page 74: Lessons learnt on a 2000-core cluster

AND THE BUGS WERE:

If couldn’t calibrate with client at 1st attempt, never calibrated.• Impact: Well… Esp. given the previous bug.• Reason: try{loop}/ignore instead of loop{try/ignore}• Meta reason: too complex code, mixed levels of abstraction

• Mixed what’s being “tried” with how it’s being managed (failures handled)

• Fix: change to loop{try/ignore}.• Meta fix: Go through all code, classify methods into

“spaghetti” and “flat logic”. Extract logic from spaghetti.

Page 75: Lessons learnt on a 2000-core cluster

AND THE BUGS WERE:

No calibration with a machine in scenario “Start client A, start client B, kill client A”

• Impact: Very bad.

• Reason: If client couldn’t establish a calibration TCP listener, it wouldn’t try again (“someone else’s listening, not my job”).Then that guy dies and whose job is it now?

• Meta reason: One-time global initialization for a globally periodic process (init; loop{action}).Global conditions change and initialization is needed again.

• Fix: Transform to loop{init; action} – periodically establish listener (ignore failure).

Page 76: Lessons learnt on a 2000-core cluster

AND THE BUGS WERE:

Events were not coming out in order.

• Impact: Not critical by itself, but casts doubt on the correctness of everything.If this doesn’t work, how can we be sure that we even get all messages?All in all, very bad.

• Reason: ???

And they were also coming out with a huge lag.

• Impact: Dramatic (as already said).

Page 77: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

There were many places where they could lag.• That’s already very bad by itself…

• On client? (repeatedly failing to connect to server)

• On server? (repeatedly failing to read from client)

• In per-client buffer? (failing to calibrate / to notice that calibration is done)

• In global buffer?(failing to notice that this event has “expired” its 10s)

Page 78: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

Meta fix:• More internal logging

Didn’t help.• This logging was invisible because done with Trace.WriteLine and viewed with

DbgView, which doesn’t work between sessions

• My fault – didn’t cope with this.

• Only failed under large load from many machines (the worst kind of error…)

But could have helped.• Log/assert everything

• If things were fine where you expect them to be, there’d be no bugs.But there are.

Page 79: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

Investigation by sequential elimination of reasons.

The most suspicious thing was “time-buffered queue”.

• A complex piece of mud.

• “Kind of” a priority queue with tracking times and sleeping/blocking on “pop”

• Looked right and passed tests, but felt uncomfortable

Rewritten it.

Page 80: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

Rewritten it.• Polling instead of blocking:

“What’s the earliest event? Has it been here for 10s yet?”

• A classic priority queue “from the book”

• Peek minimum, check expiry pop or not.

• That’s it.

Now the queue definitely worked correctly.

But events still lagged.

Page 81: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

What remained? Only a walk through the code.

Page 82: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

A while later…

Page 83: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

A client has 3 associated threads.• (1 per batch of records) Thread that reads them

to per-client buffer.• (1 per client) Thread that pulls from per-client

bufferand writes calibrated events to global buffer(after calibration is done)

• (1 per machine) Calibration thread

Page 84: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

A client has 3 associated threads.

And they were created in ThreadPool.

And ThreadPool creates no more than 2 new threads/s.

Page 85: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

So we have 2000 clients on 250 machines.

A couple thousand threads.

Not a big deal, OS can handle more. And they’re all doing IO. That’s what an OS is for.

Created at a rate of 2 per second.

4-5 minutes pass before the calibration thread is created in pool for the last machine!

Page 86: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

Fix: Start a new thread without ThreadPool.

And suddenly everything worked.

Page 87: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

Why did it take so long to find?• Unreproducible on less than a dozen machines

• Bad internal debugging tools (Trace.WriteLine)

• And lack of understanding of their importance

• Too complex architecture

• Too many places can fail, need to debug all at once

Page 88: Lessons learnt on a 2000-core cluster

THE CASE OF THE LAGGING EVENTS

Morale:• Functional abstractions leak in non-functional ways.

• Thread pool functional abstraction = “Do something soon”

• Know how exactly they leak, or don’t use them.

• “Soon, but no sooner than 2/s”

Page 89: Lessons learnt on a 2000-core cluster

GREG AGAIN

• Rewritten it nearly from scratch• Calibration now also initiated by client• Server only accepts client connections and

moves messages around the queues

• Pattern “Move responsibility to client” – server now does a lot less calibration-related bookkeeping

• Pattern “Eliminate dependency cycles / feedback loops”• Now server doesn’t care at all about failure of client

• Pattern “Do one thing and do it well”• Just serve requests.

• Don’t manage workflow.

• It’s now easier for server to throttle the number of concurrent requests of any kind

Page 90: Lessons learnt on a 2000-core cluster

THE GOOD PARTSOK, LOTS OF THINGS WERE BROKEN. WHICH WEREN’T?

Asynchronous processing• We’d be screwed if not for the recent “fully asynchronous” rewrite• “Concurrent synchronous calls” are a very scarce resource

Reliance on a fault-tolerant abstraction: Messaging• We’d be screwed if RabbitMQ didn’t handle the failures for us

Good measurement tools• We’d be blindfolded without the global clock-synced logging

and drawing tools

Good deployment scripts• We’d be in a configuration hell if we did that manually

Reasonably low coupling• We’d have much longer experiment roundtrips if we ran tests

on “the real thing” (Huge Legacy Program + HPC Server + everything)• It was not hard to do independent performance optimizations

of all the component layers involved (and there were not too many layers)

Page 91: Lessons learnt on a 2000-core cluster

Morales

Page 92: Lessons learnt on a 2000-core cluster

MORALES

• Tools matter• Would be helpless without the graphs

• Would have done much more if the logger was fixed earlier…

• Capacity planning• How much of X will you need for 2000 cores?

• Complexity kills• Problems are everywhere, and if they’re also

complex, then you can’t fix them

• Rethink “CPU cheap”• Is it cheap compared to what 2000 cores can do?

• Abstractions leak• Do not rely on a functional abstraction

when you have non-functional requirements

• Everything fails• Especially you

• Planning to have failures is more robust than planning how exactly to fight them

• There are no “almost improbable errors”: probabilities accumulate

• Explicitly ignore failures in non-critical code

• Code that does this is larger but simpler to understand than code that doesn’t

• Think where to put responsibility for what• Difference in ease of implementation may be dramatic

Page 93: Lessons learnt on a 2000-core cluster

That’s all.