does better throughput require worse latency?

Does Better Throughput Require Worse Latency?David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

Monday 7 January 2013

Example: On-Line Transaction Processing

✦ Large “database” (100 GB) of information

✦ Constant stream of incoming updates & queries

✦ Need many cores to handle the work

✦ Cores need to communicate updates

✦ roll-ups sum over many variables

✦ Trick:

✦ Caching - updates must sync with invalidates

✦ Replication - updates must propagate

Assumptions✦ Too much computation for one core

✦ Not trivially scalable;

✦ needs communication

✦ Inputs constantly changing

✦ No sub-space radio:

✦ communication finite and limiting

Throughput

Throughput ~ Scaling

1 core 25 cores 50 cores 75 cores 100 cores

throughput = 1.0

throughput = 0.25

Latency

✦ Inter-core

✦ Data structure/algorithm level

✦ Time needed for cause (input, computation result) on one core to affect another

What is best possible latency (on a given platform)?

Measure w/ Ring Counter

while (1

) A = D

while (1) D = C + 1;

while (1) B = A;

while (1

Core 1

RingCounter

Latency Baseline ≣ Time / Count / Number-of-Cores

Core 2

Core 3

Core 4

Ring CounterLatency Baselines

1 2 3 4 5 6 7 8

Normal loads & stores

# threads (4 cores, 2-way SMT)

1 2 3 4 5 6 7 8

Normal loads & stores + memory barrier

# threads (4 cores, 2-way SMT)

Other platforms? Signals? Atomics?

The Intution✦ After you have optimized:

✦ Suppose relative latency is 10

✦ Relative throughput is 1/4

✦ If you then raise throughput to 1/2

✦ Latency will increase to 20

Space of best algorithms exhibits this trade-off

Variables#readers

# writers

contention

reading/writing

Which Instructions

Normal loads & stores

Atomic loads & stores

Signals

Memory barriers

Shared CounterFrom McKenney’s PerfBook

Write code Read Code Latency Throughput

Serial

Lock-Free

Per-thread

Per-thread +

Race & Repair

C += delta C tiny single-core

lock, C += delta, unlock Csmall, unless

writers convoy

higher, but writers have locking &

contention overhead

C +=atomic delta Cif contention

writers can starvehigher for low-

contention writers

per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers

per-thread-C += deltaanother thread maintains sum;

read sum

higher: summing thread may be idle

high for both readers and

writers

C += delta Chigher under

contention: lost counts

writers

Serial

Lock-Free

Per-thread

Per-thread +

Race & Repair

writers convoy

contention overhead

contention writers

read sum

writers

Serial

Lock-Free

Per-thread

Per-thread +

Race & Repair

writers convoy

contention overhead

contention writers

read sum

writers

Serial

Lock-Free

Per-thread

Per-thread +

Race & Repair

writers convoy

contention overhead

contention writers

read sum

writers

Serial

Lock-Free

Per-thread

Per-thread +

Race & Repair

writers convoy

contention overhead

contention writers

read sum

writers

Serial

Lock-Free

Per-thread

Per-thread +

Race & Repair

writers convoy

contention overhead

contention writers

read sum

writers

Serial

Lock-Free

Per-thread

Per-thread +

Race & Repair

writers convoy

contention overhead

contention writers

read sum

writers

Conclusions✦ Throughput: how well parallelism gets work

✦ Latency: how fast one core responds to another

✦ Lots of dimensions: # readers, # writers, contention

✦ Throughput vs Latency:

✦ throughput -> parallel -> distributed/replicated -> more latency

does better throughput require worse latency?

Technology