parallel performance optimization asd shared memory hpc
TRANSCRIPT
Parallel Performance OptimizationASD Shared Memory HPC Workshop
Computer Systems Group, ANU
Research School of Computer ScienceAustralian National University
Canberra, Australia
February 13, 2020
Schedule - Day 4
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 2 / 63
NUMA systems
Outline
1 NUMA systems
2 Profiling Codes
3 Intel TBB
4 Lock Free Synchronization
5 Transactional Memory
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 3 / 63
NUMA systems
Non-Uniform Memory Access: Basic Ideas
NUMA means there is some hierarchy in main memory system’s structure
all memory is available to the programmer (single address space), butsome memory takes longer to access than others
modular memory systems with interconnects: UMA/NUMA vs NUMA
I N T E R C O N N E C T
MEMORY MEMORY MEMORY MEMORY
CORE1 CORE2 CORE1 CORE2 CORE1 CORE2
Cache Cache Cache Cache
CORE1 CORE2
MEMORY MEMORY MEMORY MEMORY
I N T E R C O N N E C T
CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 CORE1 CORE2
Cache Cache Cache Cache
on a NUMA system, there are two effects that may be important:
thread affinity: once a thread is assigned to a core, ensure that itstays thereNUMA affinity: memory used by a process must only be allocated onthe socket of the core that it is bound to
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 4 / 63
NUMA systems
Examples of NUMA Configurations
Intel Xeon5500, with QPI(courtesy qdpma.com)
4-socket Opteron; note ex-tra NUMA level within asocket!(courtesy qdpma.com)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 5 / 63
NUMA systems
Examples of NUMA Configurations (II)
8-way ‘glueless’ system (processors are directly connected)(courtesy qdpma.com)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 6 / 63
NUMA systems
Case Study: Why NUMA Matters
MetUM global atmosphere model, 1024× 769× 70 grid on an NCI IntelX5570 - Infiniband supercomputer (2011):Effect of Process and NUMA Affinity on Scaling
note differing valuesfor t16!
on the X5570,local:remotememory access is65:105 cycles
indicates asignificant amountof L3$ misses
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 7 / 63
NUMA systems
Case Study: Why NUMA Matters (II)
Time breakdown for no NUMA affinity, 1024 processes (dual socket nodes,4 cores per socket)
Note spikes in compute times were always in groups of 4 processes (e.g.socket 0)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 8 / 63
NUMA systems
Process and Thread Affinity
in general, the OS is free to decide which core (virtual CPU) aprocess or thread (next) runs onwe can restrict which CPUs it will run on by specifying an affinitymask of the CPU ids it may be scheduled to run onthis has 2 benefits (assuming other active processes/threads areexcluded from the specified CPUs):
ensure maximum speed for that process/threadminimize cache / TLB pollution caused by context switches
e.g. on an 8-CPU system, create 8 threads to run on different CPUs:1 pthread_t threadHandle [8]; cpu_set_t cpu;
for (int i = 0; i < 8; i++) {
3 pthread_create (& threadHandle[i], NULL , threadFunc , NULL);
CPU_ZERO (&cpu); CPU_SET(i, &cpu);
5 pthread_setaffinity_np(threadHandle[i], sizeof(cpu_set_t), &cpu);
}
for a process, it is similar:sched_setaffinity(getpid (), sizeof(cpu_set_t), &cpu);
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 9 / 63
NUMA systems
NUMActl: Controlling NUMA from the Shell
on a NUMA system, we generally wish to bind a process and itsmemory image to a particular ‘node’ (=NUMA domain)the NUMA API provides a way of controlling policies of memoryallocation on a per node or per process basis
policies are default, bind, interleave, preferred
run a program on a CPU on node 0, with all memory allocated onnode 0:
1 numactl --membind =0 --cpunodebind =0 ./prog -args
similar, but force to be run on CPU 0 (which must be on node 0):1 numactl --physcpubind =0 --membind =0 ./prog ./args
optimize bandwidth for a crucial program to utilize multiple memorycontrollers (at expense of other processes!)
1 numactl --interleave=all ./ memhog ...
numactl --hardware shows available nodes etcComputer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 10 / 63
NUMA systems
LibNUMA: Controlling NUMA from within a Program
with libnuma, we can similarly change (the current thread of) anexecuting process’s node affinity and memory allocation policyrun from now on on a CPU on node 0, with all memory allocated onnode 0:
numa_run_on_node (0);
2 numa_set_preferred (0);
nodemask_t mask;
2 nodemask_zero (&mask); nodemask_set (&mask , 0);
numa_bind (&mask);
to allow it to run on all nodes again:1 numa_run_on_node_mask (& numa_all_nodes);
execute a memory hogging function, with all its (new) memory fullyinterleaved, and then restore to previous state:
1 numamask_t prevmask = numa_get_interleave_mask ();
numa_set_interleave_mask (& numa_all_nodes);
3 memhog (...);
numa_set_interleave_mask (& prevmask);
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 11 / 63
NUMA systems
Hands-on Exercise: NUMA Effects
Objective:
Explore the effects of Non-Uniform Memory Access (NUMA), that isthe general benefit of ensuring a process and its memory are in thesame NUMA domain
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 12 / 63
Profiling Codes
Outline
1 NUMA systems
2 Profiling Codes
3 Intel TBB
4 Lock Free Synchronization
5 Transactional Memory
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 13 / 63
Profiling Codes
Profiling: Basics
Profiling is the process of recording information during execution of aprogram to form an aggregate view of its dynamic behaviour
Compare with tracing, which records an ordered log of events that canbe used to reconstruct dynamic behaviour
Used to understand program performance and find bottlenecks
At certain points in execution, record program state (instructionpointer, calling context, hardware performance counters, ...)
Sampling (recurrent event trigger) vs. instrumentation (probes atspecific points in program)
Real vs. simulated execution
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 14 / 63
Profiling Codes
Sampling
EuroMPI‘12: Hands-on Practical Hybrid Parallel Application Performance Engineering
Sampling
Program Main ... end Main Function Asterix (...) ... end Asterix Function Obelix (...) ... end Obelix ...
CPU program counter
cycle counter
cache miss counter
flop counter
Main Asterix Obelix +
Function Table interrupt every 10 ms
add and reset counter
When event trigger occurs, record instruction pointer (+ call context) andperformance counters: low overhead but subject to sampling errorSource: EuroMPI’12: Introduction to Performance Engineering
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 15 / 63
Profiling Codes
Instrumentation
EuroMPI‘12: Hands-on Practical Hybrid Parallel Application Performance Engineering
... Function Obelix (...) call monitor(“Obelix“, “enter“) ... call monitor(“Obelix“,“exit“) end Obelix ...
CPU
monitor(routine, location) if (“enter“) then else end if Function Table
Instrumentation and Monitoring
cache miss counter
Main
Asterix
Obelix + - 10 200 1300 1490
Inject ‘trampolines’ into source or binary code: accurate but higheroverheadSource: EuroMPI’12: Introduction to Performance Engineering
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 16 / 63
Profiling Codes
The 80/20 Rule & Life Cycle
Programs typically spend 80% of their time in 20% of the code
Programmers typically spend 20% of their effort to get 80% of thepossible speedup → optimize for the common case
EuroMPI‘12: Hands-on Practical Hybrid Parallel Application Performance Engineering
Performance Analysis Process
Measurement
Analysis
Ranking
Refinement
Coding
Performance Analysis
Production
Program Tuning
Source: EuroMPI’12: Introduction to Performance Engineering
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 17 / 63
Profiling Codes
perf and VTune
Perf is a profiler tool for Linux 2.6+ based systems that abstractsaway CPU hardware differences in Linux performance measurementsand presents a simple command-line interface. Perf is based on theperf events interface exported by recent versions of the Linux kernel.
Intel’s VTune is a commercial-grade profiling tool for complexapplications via the command-line or GUI.
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 18 / 63
Profiling Codes
perf Reference Material
http://www.brendangregg.com/perf.html
Perf for User-Space Program Analysis
Perf Wiki
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 19 / 63
Profiling Codes
VTune Reference Material
https://software.intel.com/en-us/intel-vtune-amplifier-xe
Documentation URL
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 20 / 63
Profiling Codes
perf for Linux
perf is both a kernel syscall interface and a collection of tools tocollect, analyze and present hardware performance counter data eithervia counting or sampling
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 21 / 63
Profiling Codes
perf for Linux
% perf list
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
...
branch-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-loads [Hardware cache event]
dTLB-store-misses [Hardware cache event]
dTLB-stores [Hardware cache event]
iTLB-load-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 22 / 63
Profiling Codes
VTune
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 23 / 63
Profiling Codes
Hands-on Exercise: Perf and VTune
Objective:
Use perf to measure performance of matrix multiply code
Use VTune to both measure and improve code performance
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 24 / 63
Intel TBB
Outline
1 NUMA systems
2 Profiling Codes
3 Intel TBB
4 Lock Free Synchronization
5 Transactional Memory
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 25 / 63
Intel TBB
Intel Threading Building Blocks (TBB)
Template library extending C++ for parallelism using tasks
Focus on divide-and-conquer algorithms
Thread-safe data structures
Work stealing scheduler
Efficient low-level atomic operations
Scalable memory allocation
Free software under GPLv2
Content adapted from https://software.intel.com/sites/default/files/IntelAcademic Parallel 08 TBB.pdf
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 26 / 63
Intel TBB
Using TBB - Task Based Approach
TBB provides C++ constructs that allow you to express parallelsolutions in terms of task objects
Task scheduler manages thread pool
Task scheduler avoids common performance problems ofprogramming with threads
Oversubscription - One scheduler thread per hardware threadFair scheduling - Non-preemptive unfair schedulingHigh overhead - Programmer specifies tasks, not threadsLoad imbalance - Work-stealing balances load
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 27 / 63
Intel TBB
Using Task based approach
Fibonacci calculation exampleThe function fibTBB calculates the nth Fibonacci number using a TBBtask_group.
int fibTBB(int n) {
2 if (n < 10) {
return fibSerial(n);
4 } else {
int x, y;
6 tbb:: task_group g;
g.run ([&]{ x = Fib(n-1); }); // spawn a task
8 g.run ([&]{ y = Fib(n-2); }); // spawn another task
g.wait(); // wait for both tasks to complete
10 return x+y;
}
12 }
Content adapted fromhttps://software.intel.com/sites/default/files/m/d/4/1/d/8/1-6-AppThr - Using Tasks Instead of Threads.pdf
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 28 / 63
Intel TBB
Using Tasks
Developers express the logical parallelism with tasks
Runtime library schedules these tasks on to internal pool of workerthreads
Tasks are much lighter weight than threads. Hence it is possible toexpress parallelism at a much finer granularity.
Apart from a task interface TBB also provides high-level algorithmsthat implement some of the most common task patterns, such asparallel_invoke, parallel_for, parallel_reduce etc
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 29 / 63
Intel TBB
TBB Algorithms
parallel_for, parallel_for_each: load-balanced parallel execution ofloop iterations where iterations are independent
parallel_reduce: load-balanced parallel execution of independent loopiterations that perform reduction (e.g. summation of array elements)
parallel_scan: load-balanced computation of parallel prefix
parallel_do: load-balanced parallel execution of independent loopiterations with ability to add more work during its execution
parallel_sort: parallel sort
parallel_invoke: parallel execution of function objects or pointers tofunctions
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 30 / 63
Intel TBB
parallel for
#include <tbb/blocked_range.h>
2 #include <tbb/parallel_for.h>
4 template <typename Range , typename Func >
Func parallel_for (const Range & range , const Func & f ,
6 [, task_group_context & group ])
8 template <typename Index , typename Func >
Func parallel_for (Index first , Index_type last [, Index step ],
10 const Func & f [, task_group_context & group ]);
Template function parallel for recursively divides loop
Partitions original range into subranges, and deals out subranges toworker threads in a way that:
Balances loadUses cache efficientlyScales
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 31 / 63
Intel TBB
Loop Splitting
blocked_range<T> is a splittable type representing 1D iteration spaceover type T
Similarly blocked_range2d<T> for 2D block splitting
Separate loop body as an object or lambda expression
void ParallelDecrement(float* a, size_t n) {
2 parallel_for(blocked_range <size_t >(0, n),
[=]( const blocked_range <size_t >& r) {
4 for (size_t i=r.begin(); i != r.end(); ++i)
a[i]--;
6 }
);
8 }
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 32 / 63
Intel TBB
An Example using parallel for
Independent iterations and fixed/known bounds
Serial code:
const int N = 100000;
2 void change_array(float array , int M) {
for (int i = 0; i < M; i++) {
4 array[i] *= 2;
}
6 }
int main () {
8 float A[N];
initialize_array(A);
10 change_array(A, N);
return 0;
12 }
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 33 / 63
Intel TBB
An Example using parallel for
Using parallel for
#include <tbb/blocked_range.h>
2 #include <tbb/parallel_for.h>
4 using namespace tbb;
6 void parallel_change_array(float* array , size_t M) {
parallel_for(blocked_range <size_t >(0, M, IdealGrainSize),
8 [=]( const blocked_range <size_t >& r) -> void {
for (size_t i = r.begin(); i != r.end(); i++)
10 array[i] *= 2;
}
12 );
}
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 34 / 63
Intel TBB
Generic Programming vs Lambda functions
Generic Programming:1 class ChangeArrayBody {
float *array;
3 public:
ChangeArrayBody(float *a): array(a) {}
5 void operator ()( const blocked_range <size_t >& r ) const{
for (size_t i = r.begin(); i != r.end(); i++) {
7 array[i] *= 2;
}
9 }
};
11 void parallel_change_array(float *array , size_t M) {
parallel_for(blocked_range <int >(0, M, IdealGrainSize),
13 ChangeArrayBody(array));
}
Lambda functions:void parallel_change_array(float *array , size_t M) {
2 parallel_for(blocked_range <size_t >(0, M, IdealGrainSize),
[=]( const blocked_range <size_t >& r) -> void {
4 for (size_t i = r.begin(); i != r.end(); i++)
array[i] *= 2;
6 });
}
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 35 / 63
Intel TBB
Mutual Exclusion in TBB
Multiple tasks computing the minimum value in an array usingparallel for
1 void ParallelMin(int* a, int n) {
parallel_for(blocked_range <int >(0, n),
3 [=]( const blocked_range <int >& r) {
for(int i=r.begin(); i!=r.end(); ++i)
5 if (a[i] < min) min = a[i];
}
7 );
}
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 36 / 63
Intel TBB
Mutex Flavours
TBB provides several flavours of mutex:
spin mutex: non-scalable, unfair, fastqueuing mutex: scalable, fair, slowerspin rw mutex, queuing rw mutex: as above, with reader locksmutex and recursive mutex: wrappers around native implementation(e.g. Pthreads)
Avoid locks wherever possible
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 37 / 63
Intel TBB
Mutual Exclusion in TBB
Scoped mutex inside critical section
typedef spin_mutex ReductionMutex;
2 ReductionMutex minMutex;
4 void ParallelMin(int* a, int n) {
parallel_for(blocked_range <int >(0, n),
6 [=]( const blocked_range <int >& r) {
for (int i=r.begin (); i!=r.end(); ++i) {
8 ReductionMutex :: scoped_lock lock(minMutex);
if (a[i] < min) min = a[i];
10 }
}
12 );
}
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 38 / 63
Intel TBB
parallel reduce in TBB
1 #include <tbb/blocked_range.h>
#include <tbb/parallel_reduce.h>
3template <typename Range , typename Value ,
5 typename Func , typename ReductionFunc >
Value parallel_reduce(const Range& range , const Value& identity ,
7 const Func& func , const ReductionFunc& reductionFunc ,
[, partitioner [, task_group_context &group ]]);
parallel reduce partitions original range into subranges like parallel for
The function Func is applied on these subranges, the returned resultis then merged with the others (or identity if there is none) using thefunction reductionFunc.
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 39 / 63
Intel TBB
parallel reduce in TBB - Serial Example
#include <limits >
2//Find index of smallest element in a[0...n-1]
4 size_t serialMinIndex(const float a[], size_t n) {
float value_of_min = numeric_limits <float >::max();
6 size_t index_of_min = 0;
for (size_t i=0; i<n; ++i) {
8 float value = a[i];
if (value < value_of_min) {
10 value_of_min = value ;
index_of_min = i;
12 }
}
14 return index_of_min ;
}
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 40 / 63
Intel TBB
parallel reduce in TBB - Parallel Version
1 #include <limits >
#include <tbb/blocked_range.h>
3 #include <tbb/parallel_reduce.h>
5 size_t parallelMinIndex(const float a[], size_t n) {
return parallel_reduce(
7 blocked_range <size_t >(0, n, 10000) ,
size_t (0),
9 [=]( blocked_range <size_t > &r, size_t index_of_min) -> size_t {
float value_of_min = a[index_of_min ];
11 for (size_t i = r.begin(); i != r.end(); ++i) {
float value = a[i];
13 if (value < value_of_min) {
value_of_min = value; // accumulate result
15 index_of_min = i;
}
17 }
return index_of_min;
19 },
[=]( size_t i1, size_t i2) {
21 return (a[i1] < a[i2])? i1: i2; // reduction operator
}
23 );
}
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 41 / 63
Intel TBB
Hands-on Exercise: Programming with TBB
Objective:
Implement a parallel heat stencil application using TBB and to profileit using VTune
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 42 / 63
Lock Free Synchronization
Outline
1 NUMA systems
2 Profiling Codes
3 Intel TBB
4 Lock Free Synchronization
5 Transactional Memory
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 43 / 63
Lock Free Synchronization
‘Lock-free’ Data Structures: Motivation
consider the atomic test-and-set operationatomic int testAndSet(volatile int *Lock) {int lv = *Lock; *Lock = 1; return lv;}
synchronizes the whole memory system (down to LLC), costs ≈ 50cycles, degrades memory access progress for all
not scalable: energy and time costs are O(N2) (N is number of cores)
mutual exclusion via test-and-set can be modelled by:volatile int lock = 0; //0..1; 1=locked, 0=unlocked
// thread 0
2 while (1) {
// non -critical section 0
4 ...
while (testAndSet (&lock) != 0)
6 { /*spin*/ }
8 // critical section 0
...
10 lock = 0;
}
1 // thread 1
while (1) {
3 // non -critical section 1
...
5 while (testAndSet (&lock) != 0)
{ /*spin*/ }
7// critical section 1
9 ...
lock = 0;
11 }
with p threads under heavy contention (time in non-critical becomessmall), there will be O(p2) atomics!! Can we do better?
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 44 / 63
Lock Free Synchronization
Improving Locked Access to Data Structures
only try atomics when we see lock is free: reduces #atomics to O(p)
1 do {
while (lock == 1) {/*spin*/}
3 } while (testAndSet (&lock) != 0);
use adaptive spin-locks: if lock is not acquired after a certain time,the thread yields
Pthreads mutexes should do this; especially useful if p > N
use algorithms requiring a (small) constant number of atomicoperations per access of the critical region
can be a general mutex algorithm, or data structure-specific
better still (possibly), use algorithms using no atomic operations
Lamport’s Bakery Algorithm: still has O(p2) costs, assumes order ofupdates made by one thread is seen by other threads
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 45 / 63
Lock Free Synchronization
Atomic Operation-based Bakery Algorithm
analogous to protocol of buying seafood at Woolworthsuses atomic int fetchAndIncrement(volatile int *v){int v = *v; (*v)++; return(v);}
initialize: ticketNumber = screenNumber = 0;
acquire: myTicket = fetchAndIncrement(&ticketNumber);
wait until (myTicket == screenNumber)
release: screenNumber++; (does not need to be atomic)
performance: good if H/W has direct support of fetchAndIncrement(),else may require unbounded number of compare-and-swaps
problems under contention: large amount of cache line invalidationsdue to all threads accessing ‘scoreboard’ variable screenNumber
can we restrict the # threads accessing their ‘scoreboard’ to just 2?
MCS lock: Woolworths analogy: instead of watching a screen, theperson just ahead tells you when they are served
requires atomic dual word copy on lock acquire and atomiccompare-and-swap on release
CLH lock: requires only a single fetch-and-store atomic
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 46 / 63
Lock Free Synchronization
Synchronization: Barriers
wait for all to arrive at same point in computation before proceeding(none may leave until all have arrived)
central barrier with p threads (initialize globals ctr to p, sense to 0 ;each thread initializes its own threadSense to 0)
threadSense = !threadSense;
if (fetchAndDecrement(ctr)== 1)
ctr = p, sense = !sense; // last to reach toggles global sensewait until (sense == threadSense)
sense is required for repeated barriers
caution: deadlock occurs if 1 thread does not participate!
problems:
most processors do not support atomic decrement of a memory locationnot (very) scalable: p atomic decrements per barrier
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 47 / 63
Lock Free Synchronization
Synchronization: Combining Tree Barrier
each pair of threads points to a leaf node in a treeeach node has a ctr (init. to 2), and a sense flag (init. to 0)
algorithm: begins on all leaf nodes (each thread has threadSense=1
flag):
if (fetchAndDecrement(ctr)== 1)
call algorithm on parent nodectr = 2, sense = !sense
wait until threadSense == sense
then, as leaving the barrier, threadSense = !threadSense
notes:last thread to reach each node continues up the treethe thread that reaches root begins the ‘wakeup’ (reversing sense)upon wakeup, a thread releases its siblings at each node along path
performance: 2× the atomic operations, but can distribute memorylocations of node data (e.g. across different LLC banks)
atomics can be avoided by the scalable tree barrier! (simply replacectr with 2 adjacent byte flags, one for each thread)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 48 / 63
Lock Free Synchronization
Single Reader, Single Writer Bounded Buffer
Strategy: simply put burden of condition synchronization (retry) on client
class LFBoundedBuffer {
2 int N, int in, out , *buf;
LFBoundedBuffer(int Nc) { N = Nc; in = out = 0; buf = new int [N]; }
4 bool put(int v) {
int inNxt = (in+1) % N;
6 if (inNxt == out) // buffer full
return false;
8 buf[in] = v; in = inNxt;
return true;
10 }
bool get(int *v) {
12 if (in == out) // buffer empty
return false;
14 *v = buf[out]; out = (out+1) % N;
return true;
16 }};
Why does this work? Threads do not operate on the same variable.Consider N=2. Need to assume updates on in/out don’t overtake theadd/remove from buf on other thread.
Unbounded: use linked lists, reader advances a read pointer, leaves writerto remove previously read nodes
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 49 / 63
Lock Free Synchronization
‘Lock-Free’ Stack
class LFStack {
2 int top , nPop;
LFStack(int Nc) { N = Nc; nPop = top = 0; buf = new int [N]; }
4 bool push(int v) {
while (1) {
6 int oldTop = top;
if (oldTop == N) return false; // caller must try again
8 if (dcas(&top , oldTop , oldTop+1,
&buf[oldTop], buf[oldTop], v))
10 return true; // oldTop remained a valid stack top
}
12 }
bool pop(int *v) {
14 while (1) {
int oldTop = top , oldNPop = nPop;
16 if (oldTop == 0) return false; // caller must try again
*v = buf[oldTop -1];
18 if (dcas(&top , oldTop , oldTop -1,
&nPop , oldNPop , oldNPop +1))
20 return true; // oldTop remained a valid stack top
}
22 }}; //note: dcas(&top , oldTop , oldTop -1, v, *v, buf[oldTop ]) won’t work!
dcas() is a ‘double’ compare-and-swap operation:
in this case we use the op on the 1st word to ensure safety, and the2nd to ensure the data is added / removed atomically
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 50 / 63
Lock Free Synchronization
Summary: Lock Free Synchronization
Use fine-grained locking to reduce contention in operations on shareddata structures
Non-blocking solution to avoid overheads due to lock, but stillrequires appropriate memory fence operations
Lock free design does not eliminate contention
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 51 / 63
Transactional Memory
Outline
1 NUMA systems
2 Profiling Codes
3 Intel TBB
4 Lock Free Synchronization
5 Transactional Memory
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 52 / 63
Transactional Memory
Transactional Memory - Motivations
or, problems with lock-based solutions:
convoying: a thread holding a lock is de-scheduled, possibly holdingup other threads waiting for a lock
priority inversion: lower-priority thread is pre-empted while holding alock needed by a higher priority thread
deadlock: two threads attempt to acquire the same locks do so indifferent order
inherently unsafe: relationship between data objects and locks inentirely by convention
lacks the composition property: difficult to combine 2 concurrentoperations using locks into a single larger operation
pessimistic: in low-contention situations, their protection is onlyneeded < 1% of the time!
recall overheads of acquiring a lock (even if free)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 53 / 63
Transactional Memory
Transactions - Basic Ideas
a transaction is a series of reads and writes (to global data) made bya single thread which execute atomically, i.e:
if it succeeds, no other thread sees any of the writes until it completesif it fails, no change to (global) data can be observed
it must also execute with consistency:the thread does not see any (interfering) writes from other threads
it has the following steps:
begin, read and writes to (transactional) variables, end
if consistency is met, the end results in a commit; otherwise itabortsa condition that meets this is serializability:
if transactions cannot execute concurrently, consistency is assurednote: this assumes that (transactional) variables are only written (insome cases, read) by transactions
can also achieve this if no other thread accesses the affected variablesduring the transaction
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 54 / 63
Transactional Memory
Transactions - Programming Model
e.g. to add an item newItem to a queue with tail pointer tail:atomic {
2 struct node *newTail = malloc(sizeof(struct node));
newTail ->item = newItem;
4 newTail ->next = tail;
// tail must not be changed between these 2 points!
6 tail = newTail;
}
here, tail is the single read and write transactional variablethe atomic block is executed atomically (it can be assumed that theunderlying system will re-execute it until it can commit (what about
the malloc?))e.g. code from Unstructured Adaptive benchmark of NAS suite
for (ije1 = 0; ije1 < nnje; ije1 ++) { ...
2 for (ije2 = 1; ije2 < nnje; ije2 ++)
for (col = 1-shift; col < lx1 -shift; col ++) {
4 ig = idmo[v_end[ije2], col , ije1 , ije2 , iface , ie];
#pragma omp atomic
6 tmor[ig] += temp[v_end[ije2],col ,ije1] * 0.5;
... }
8 }
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 55 / 63
Transactional Memory
Limited Compare-and-Swap Implementation
the atomic compare-and-swap operation may be expressed as:
typedef unsigned long uint64;
2 atomic uint64 cas(uint64 *x, uint64 x_old , uint64 x_new) {
uint64 x_now = *x;
4 if (x_now == x_old) *x = x_new;
return (x_now); }
(1st 2 lines are implemented by a single cas instruction)
we can implement the (essentially atomic part of) the code for queueinsertion by:
do {
2 newTail ->next = tail;
} while (cas(tail , newTail ->next , newTail) != newTail ->next);
and for the Unstructured Adaptive benchmark
1 do {
x_old = tmor[ig];
3 x_new = temp[v_end[ije2],col ,ije1] * 0.5 + x_old;
} while (cas(&tmor[ig], x_old , x_new) != x_old);
note: recent OpenMP runtime systems implemented atomic sectionsusing a global lock (why does this perform abysmally?)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 56 / 63
Transactional Memory
Hardware TM: Basis on Coherency Protocols
idea: tentative values can be held in cache lines (assuming eachprocessor has a separate cache)
review the standard MESI cache coherency protocol
cache line states:Modified: only this cache holds the data; it is ‘dirty’, Exclusive: asbefore, but not ‘dirty’, Shared: > 1 caches hold data, not ‘dirty’,Invalid: this line holds no datakey transitions:
(a) A: load x (miss; then line in E state)(b) B: load x (miss; then lines are in S state (A and B))(c) B: store x (hit; then lines in M state (B) and I state (A)(d) A: load x (miss; then B copies back line to memory and lines in S
state (A and B))
when a line is evicted (due to a conflict with a new address needing tobe loaded), the data is copied-back to memory
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 57 / 63
Transactional Memory
Hardware Transactional Memory (Bounded)
add a transactional bit (T) for the state of each cache lineT=1 if entry is placed in cache on behalf of a transaction (o.w. 0)
simply extend the MESI protocol as follows:if a line with T=1 is invalidated or evicted, abort (with no copy-back)note: can record fact that transaction will later abort instead
abort causes lines with T=1 to be invalidated (no copy-back, if dirty)
a commit requires all dirty lines with T=1 to be written atomicallyto memory (T bit is cleared)
requires transactional variants to load and store instructions, plus an(attempt to) commit instruction (which clears all T bits)why this works:
if a line with T=1 is invalidated, we have a R-W or W-W conflictif a line with T=1 is evicted, the transaction cannot complete
note: size of transaction is limited by cache size (in practice, smaller)
upon abort, hardware needs to indicate whether to retry (synch.conflict) or not (error or SW-detected resource exhaustion)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 58 / 63
Transactional Memory
HTM in the Intel Haswell/Broadwell
has new instructions to begin, end and abort a transaction
xbegin addr: address of retry loop, in case of abortxend: all loads/stores in between are transactional; and xabort
no guarantee of progress! (indefinitely repeated aborts)
e.g. xbegin; load x; load y; store z;
T flag tag0 S u1 E x1 S y0 M v1 M z0 I -
T flag tag0 S u0 I x0 I y0 M v0 I z0 I -
T flag tag0 S u0 E x0 S y0 M v0 E z0 I -
cache state after theabove
if x’s line becomesinvalidated (remotecore writes to x),before xend, abort
if v’s line becomesinvalidated (remotecore writes to v),before xend, commit
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 59 / 63
Transactional Memory
Uses of HTM
may be used to create custom atomic operations
e.g. the atomic double compare-and-swap operation may beexpressed as:
typedef unsigned long uint64;
2 int dcas(uint64 *x, uint64 x_old , uint64 x_new ,
uint64 *y, uint64 y_old , uint64 y_new) {
4 int swapped = 0;
xbegin abortaddr;
6 uint64 x_now = *x, y_now = *y;
if (x_now == x_old && y_now == y_old)
8 *x = x_new , *y = y_new , swapped = 1;
xend;
10 return (swapped); // success
abortaddr:
12 return (0); // failure
}
note: x86 provides a cmpxchg16b instruction which can do the aboveif x and y are in consecutive memory locations
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 60 / 63
Transactional Memory
Summary: Transactional Memory
Atomic construct: aim to increase simplicity of synchronizationwithout significantly sacrificing performance
Implementation: many variants that differ in versioning policy(eager/lazy), conflict detection (pessimistic/optimistic), detectiongranularity
Hardware transactional memory: versioned data kept in caches,conflict detection as part of coherence protocol
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 61 / 63
Transactional Memory
Hands-on Exercise: Lock and Barrier Performance
Objective:
Understand performance of atomic operations in variousimplementations of locks and barriers
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 62 / 63
Transactional Memory
Summary
Topics covered today - Parallel Performance Optimization:
Non-uniform memory access hardware
Profiling using VTune
Intel TBB
Lock free data structures and transactional memory
Tomorrow - Parallel Software Design!
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 63 / 63