are we trading consistency too easily? a case for sequential consistency madan musuvathi microsoft...
TRANSCRIPT
Are We Trading Consistency Too Easily?
A Case for Sequential Consistency
Madan MusuvathiMicrosoft Research
Dan Marino
Todd MillsteinUCLA University of Michigan
Abhay Singh
Satish Narayanasamy
MEMORY CONSISTENCY MODEL
Abstracts the program runtime (compiler + hardware) Hides compiler transformations Hides hardware optimizations, cache hierarchy, …
Sequential consistency (SC) [Lamport ‘79] “The result of any execution is the same as if the
operations were executed in some sequential order, and the operations of each individual processor thread in this sequence appear in the program order”
SEQUENTIAL CONSISTENCY EXPLAINED
X = 1;F = 1;
t = F;u = X;
int X = F = 0; // F = 1 implies X is initialized
X = 1;F = 1;t = F;u = X;
X = 1;
F = 1;
t = F;
u = X;
X = 1;
F = 1;
t = F;u = X;
X = 1;F = 1;
t = F;u = X;
X = 1;
F = 1;
t = F;
u = X;
X = 1;F = 1;
t = F;
u = X;
t=1, u=1 t=0, u=1 t=0, u=1 t=0, u=0 t=0, u=1 t=0, u=1
t=1 implies u=1
CONVENTIONAL WISDOM
SC is slow Disables important compiler optimizations Disables important hardware optimizations
Relaxed memory models are faster
CONVENTIONAL WISDOM
SC is slow Hardware speculation can hide the cost of SC hardware
[Gharachorloo et.al. ’91, … , Blundell et.al. ’09] Compiler optimizations that break SC provide negligible
performance improvement [PLDI ’11]
Relaxed memory models are faster Need fences for correctness Programmers conservatively add more fences than necessary Libraries use the strongest fence necessary for all clients Fence implementations are slow
Efficient fence implementations require speculation support
X
?
asm:mov eax [X];
IMPLEMENTING SEQUENTIAL CONSISTENCY EFFICIENTLY
SC-PreservingCompiler
SC Hardware
src:t = X;
Every SC behavior of the binary is a SC behavior of the source
Every observed runtime behavioris a SC behavior of the binary
This talk
CHALLENGE: IMPORTANT COMPILER OPTIMIZATIONS ARE NOT SC-PRESERVING
Example: Common Subexpression Elimination (CSE)
t,u,v are local variables X,Y are possibly shared
L1: t = X*5;L2: u = Y;L3: v = X*5;
L1: t = X*5;L2: u = Y;L3: v = t;
COMMON SUBEXPRESSION ELIMINATION IS NOT SC-PRESERVING
L1: t = X*5;L2: u = Y;L3: v = X*5;
L1: t = X*5;L2: u = Y;L3: v = t;
M1: X = 1;M2: Y = 1;
M1: X = 1;M2: Y = 1;
u == 1 implies v == 5 possibly u == 1 && v == 0
Init: X = Y = 0; Init: X = Y = 0;
IMPLEMENTING CSE IN A SC-PRESERVING COMPILER
Enable this transformation when X is a local variable, or Y is a local variable
In these cases, the transformation is SC-preserving
Identifying local variables: Compiler generated temporaries Stack allocated variables whose address is not taken
L1: t = X*5;L2: u = Y;L3: v = X*5;
L1: t = X*5;L2: u = Y;L3: v = t;
A SC-PRESERVING LLVM COMPILER FOR C PROGRAMS
Modify each of ~70 phases in LLVM to be SC-preserving Without any additional analysis
Enable trace-preserving optimizations These do not change the order of memory operations e.g. loop unrolling, procedure inlining, control-flow simplification,
dead-code elimination,…
Enable transformations on local variables
Enable transformations involving a single shared variable e.g. t= X; u=X; v=X; t=X; u=t; v=t;
AVERAGE PERFORMANCE OVERHEAD IS ~2%
Baseline: LLVM –O3
Experiments on Intel Xeon, 8 cores, 2 threads/core, 6GB RAM
faces
im
bodytrac
k
barnes
strea
mcluste
r
raytra
ce
cannea
l
blacks
choles lu
swap
tionsrad
ix
fluidanim
ate fft
choles
ky
freqmine
fmm
water-s
patial
water-n
square
d avg
-20.00%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
llvm-noopt
llvm+trace-preserv-ing
SC-preserving
SC-preserving + eager loads
480 373 154 132 200 116 159173 237 298
HOW FAR CAN A SC-PRESERVING COMPILER GO?float s, *x, *y;int i;s=0;for( i=0; i<n; i++ ){ s += (x[i]-y[i]) * (x[i]-y[i]);}
float s, *x, *y;float *px, *py, *e;
s=0; py=y; e = &x[n]for( px=x; px<e; px++, py++){ s += (*px-*py) * (*px-*py);}
float s, *x, *y;int i;s=0;for( i=0; i<n; i++ ){ s += (*(x + i*sizeof(float)) – *(y + i*sizeof(float))) * (*(x + i*sizeof(float)) – *(y + i*sizeof(float)));}
float s, *x, *y;float *px, *py, *e, t;
s=0; py=y; e = &x[n]for( px=x; px<e; px++, py++){ t = (*px-*py); s += t*t;}
noopt.
SC pres
fullopt
WE CAN REDUCE THE FACESIM OVERHEAD (IF WE CHEAT A BIT)
30% overhead comes from the inability to perform CSE in
But argument evaluation in C is nondeterministic The specification explicitly allows overlapped evaluation of function
arguments
return MATRIX_3X3<T>( x[0]*A.x[0]+x[3]*A.x[1]+x[6]*A.x[2], x[1]*A.x[0]+x[4]*A.x[1]+x[7]*A.x[2], x[2]*A.x[0]+x[5]*A.x[1]+x[8]*A.x[2], x[0]*A.x[3]+x[3]*A.x[4]+x[6]*A.x[5], x[1]*A.x[3]+x[4]*A.x[4]+x[7]*A.x[5], x[2]*A.x[3]+x[5]*A.x[4]+x[8]*A.x[5], x[0]*A.x[6]+x[3]*A.x[7]+x[6]*A.x[8], x[1]*A.x[6]+x[4]*A.x[7]+x[7]*A.x[8], x[2]*A.x[6]+x[5]*A.x[7]+x[8]*A.x[8] );
IMPROVING PERFORMANCE OF SC-PRESERVING COMPILER
Request programmers to reduce shared accesses in hot loops
Use sophisticated static analysis Infer more thread-local variables Infer data-race-free shared variables
Use program annotations Requires changing the program language Minimum annotations sufficient to optimize the hot loops
Perform load-optimizations speculatively Hardware exposes speculative-load optimization to the software Load optimizations reduce the max overhead to 6%
CONCLUSION
Hardware should support strong memory models TSO is efficiently implementable [Mark Hill]
Speculation support for SC over TSO is not currently justifiable Can we quantify the programmability cost for TSO?
Compiler optimizations should preserve the hardware memory model
High-level programming models can abstract TSO/SC Further enable compiler/hardware optimizations Improve programmer productivity, testability, and debuggability
EAGER-LOAD OPTIMIZATIONS
Eagerly perform loads or use values from previous loads or stores
L1: t = X*5;L2: u = Y;L3: v = X*5;
L1: t = X*5;L2: u = Y;L3: v = t;
L1: X = 2;L2: u = Y;L3: v = X*5;
L1: X = 2;L2: u = Y;L3: v = 10;
L1: L2: for(…)L3: t = X*5;
L1: u = X*5;L2: for(…)L3: t = u;
Common Subexpression
Elimination
Constant/copyPropagation
Loop-invariantCode
Motion
PERFORMANCE OVERHEAD
Allowing eager-load optimizations alone reduces max overhead to 6%
faces
im
bodytrac
k
barnes
strea
mcluste
r
raytra
ce
cannea
l
blacks
choles lu
swap
tionsrad
ix
fluidanim
ate fft
choles
ky
freqmine
fmm
water-s
patial
water-n
square
d avg
-20.00%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
llvm-noopt
llvm+trace-preserv-ing
SC-preserving
SC-preserving + eager loads
480 373 154 132 200 116 159173 237 298
CORRECTNESS CRITERIA FOR EAGER-LOAD OPTIMIZATIONS
Eager-loads optimizations rely on a variable remaining unmodified in a region of code
Sequential validity: No mods to X by the current thread in L1-L3
SC-preservation: No mods to X by any other thread in L1-L3
L1: t = X*5;
L2: *p = q;
L3: v = X*5;
Enable invariant “t == 5.X”
Maintain invariant “t == 5.X”
Use invariant “t == 5.X"to transform L3 to v = t;
SPECULATIVELY PERFORMING EAGER-LOAD OPTIMIZATIONS
On monitor.load, hardware starts tracking coherence messages on X’s cache line
The interference check fails if X’s cache line has been downgraded since the monitor.load
In our implementation, a single instruction checks interference on up to 32 tags
L1: t = X*5;L2: u = Y;L3: v = X*5;
L1: t = monitor.load(X, tag) * 5;L2: u = Y;L3: v = t;C4: if (interference.check(tag))C5: v = X*5;