does better throughput require worse latency?
DESCRIPTION
Presentation by David Ungar. Paper and more information: http://soft.vub.ac.be/races/paper/does-better-throughput-require-worse-latency/TRANSCRIPT
Does Better Throughput Require Worse Latency?David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
Monday 7 January 2013
Example: On-Line Transaction Processing
✦ Large “database” (100 GB) of information
✦ Constant stream of incoming updates & queries
✦ Need many cores to handle the work
✦ Cores need to communicate updates
✦ roll-ups sum over many variables
✦ Trick:
✦ Caching - updates must sync with invalidates
✦ Replication - updates must propagate
Monday 7 January 2013
Assumptions✦ Too much computation for one core
✦ Not trivially scalable;
✦ needs communication
✦ Inputs constantly changing
✦ No sub-space radio:
✦ communication finite and limiting
Monday 7 January 2013
Throughput
vs
Monday 7 January 2013
Throughput ~ Scaling
0
25
50
75
100
1 core 25 cores 50 cores 75 cores 100 cores
throughput = 1.0
throughput = 0.25
Monday 7 January 2013
Latency
✦ Inter-core
✦ Data structure/algorithm level
✦ Time needed for cause (input, computation result) on one core to affect another
Δt
What is best possible latency (on a given platform)?
Monday 7 January 2013
Measure w/ Ring Counter
while (1
) A = D
;
while (1) D = C + 1;
while (1) B = A;
while (1
) C =
B;
Core 1
RingCounter
Latency Baseline ≣ Time / Count / Number-of-Cores
Core 2
Core 3
Core 4
Monday 7 January 2013
Ring CounterLatency Baselines
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Normal loads & stores
Late
ncy
(ns)
# threads (4 cores, 2-way SMT)
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Normal loads & stores + memory barrier
Late
ncy
(ns)
# threads (4 cores, 2-way SMT)
Other platforms? Signals? Atomics?
min
max
min
max
Monday 7 January 2013
The Intution✦ After you have optimized:
✦ Suppose relative latency is 10
✦ Relative throughput is 1/4
✦ If you then raise throughput to 1/2
✦ Latency will increase to 20
Space of best algorithms exhibits this trade-off
Monday 7 January 2013
Variables#readers
# writers
contention
reading/writing
Which Instructions
Normal loads & stores
Atomic loads & stores
Signals
Memory barriers
Monday 7 January 2013
Shared CounterFrom McKenney’s PerfBook
Monday 7 January 2013
Shared CounterFrom McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-Free
Per-thread
Per-thread +
cache
Race & Repair
C += delta C tiny single-core
lock, C += delta, unlock Csmall, unless
writers convoy
higher, but writers have locking &
contention overhead
C +=atomic delta Cif contention
writers can starvehigher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers
per-thread-C += deltaanother thread maintains sum;
read sum
higher: summing thread may be idle
high for both readers and
writers
C += delta Chigher under
contention: lost counts
high for both readers and
writers
Monday 7 January 2013
Shared CounterFrom McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-Free
Per-thread
Per-thread +
cache
Race & Repair
C += delta C tiny single-core
lock, C += delta, unlock Csmall, unless
writers convoy
higher, but writers have locking &
contention overhead
C +=atomic delta Cif contention
writers can starvehigher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers
per-thread-C += deltaanother thread maintains sum;
read sum
higher: summing thread may be idle
high for both readers and
writers
C += delta Chigher under
contention: lost counts
high for both readers and
writers
Monday 7 January 2013
Shared CounterFrom McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-Free
Per-thread
Per-thread +
cache
Race & Repair
C += delta C tiny single-core
lock, C += delta, unlock Csmall, unless
writers convoy
higher, but writers have locking &
contention overhead
C +=atomic delta Cif contention
writers can starvehigher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers
per-thread-C += deltaanother thread maintains sum;
read sum
higher: summing thread may be idle
high for both readers and
writers
C += delta Chigher under
contention: lost counts
high for both readers and
writers
Monday 7 January 2013
Shared CounterFrom McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-Free
Per-thread
Per-thread +
cache
Race & Repair
C += delta C tiny single-core
lock, C += delta, unlock Csmall, unless
writers convoy
higher, but writers have locking &
contention overhead
C +=atomic delta Cif contention
writers can starvehigher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers
per-thread-C += deltaanother thread maintains sum;
read sum
higher: summing thread may be idle
high for both readers and
writers
C += delta Chigher under
contention: lost counts
high for both readers and
writers
Monday 7 January 2013
Shared CounterFrom McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-Free
Per-thread
Per-thread +
cache
Race & Repair
C += delta C tiny single-core
lock, C += delta, unlock Csmall, unless
writers convoy
higher, but writers have locking &
contention overhead
C +=atomic delta Cif contention
writers can starvehigher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers
per-thread-C += deltaanother thread maintains sum;
read sum
higher: summing thread may be idle
high for both readers and
writers
C += delta Chigher under
contention: lost counts
high for both readers and
writers
Monday 7 January 2013
Shared CounterFrom McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-Free
Per-thread
Per-thread +
cache
Race & Repair
C += delta C tiny single-core
lock, C += delta, unlock Csmall, unless
writers convoy
higher, but writers have locking &
contention overhead
C +=atomic delta Cif contention
writers can starvehigher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers
per-thread-C += deltaanother thread maintains sum;
read sum
higher: summing thread may be idle
high for both readers and
writers
C += delta Chigher under
contention: lost counts
high for both readers and
writers
Monday 7 January 2013
Shared CounterFrom McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-Free
Per-thread
Per-thread +
cache
Race & Repair
C += delta C tiny single-core
lock, C += delta, unlock Csmall, unless
writers convoy
higher, but writers have locking &
contention overhead
C +=atomic delta Cif contention
writers can starvehigher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers
per-thread-C += deltaanother thread maintains sum;
read sum
higher: summing thread may be idle
high for both readers and
writers
C += delta Chigher under
contention: lost counts
high for both readers and
writers
Monday 7 January 2013
Conclusions✦ Throughput: how well parallelism gets work
done
✦ Latency: how fast one core responds to another
✦ Lots of dimensions: # readers, # writers, contention
✦ Throughput vs Latency:
✦ throughput -> parallel -> distributed/replicated -> more latency
Monday 7 January 2013