parallel performance optimization asd shared memory hpc

Parallel Performance OptimizationASD Shared Memory HPC Workshop

Computer Systems Group, ANU

Research School of Computer ScienceAustralian National University

Canberra, Australia

February 13, 2020

Schedule - Day 4

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 2 / 63

NUMA systems

Outline

1 NUMA systems

2 Profiling Codes

3 Intel TBB

4 Lock Free Synchronization

5 Transactional Memory

NUMA systems

Non-Uniform Memory Access: Basic Ideas

NUMA means there is some hierarchy in main memory system’s structure

all memory is available to the programmer (single address space), butsome memory takes longer to access than others

modular memory systems with interconnects: UMA/NUMA vs NUMA

I N T E R C O N N E C T

MEMORY MEMORY MEMORY MEMORY

CORE1 CORE2 CORE1 CORE2 CORE1 CORE2

Cache Cache Cache Cache

CORE1 CORE2

MEMORY MEMORY MEMORY MEMORY

I N T E R C O N N E C T

CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 CORE1 CORE2

Cache Cache Cache Cache

on a NUMA system, there are two effects that may be important:

thread affinity: once a thread is assigned to a core, ensure that itstays thereNUMA affinity: memory used by a process must only be allocated onthe socket of the core that it is bound to

NUMA systems

Examples of NUMA Configurations

Intel Xeon5500, with QPI(courtesy qdpma.com)

4-socket Opteron; note ex-tra NUMA level within asocket!(courtesy qdpma.com)

NUMA systems

Examples of NUMA Configurations (II)

8-way ‘glueless’ system (processors are directly connected)(courtesy qdpma.com)

NUMA systems

Case Study: Why NUMA Matters

MetUM global atmosphere model, 1024× 769× 70 grid on an NCI IntelX5570 - Infiniband supercomputer (2011):Effect of Process and NUMA Affinity on Scaling

note differing valuesfor t16!

on the X5570,local:remotememory access is65:105 cycles

indicates asignificant amountof L3$ misses

NUMA systems

Case Study: Why NUMA Matters (II)

Time breakdown for no NUMA affinity, 1024 processes (dual socket nodes,4 cores per socket)

Note spikes in compute times were always in groups of 4 processes (e.g.socket 0)

NUMA systems

Process and Thread Affinity

in general, the OS is free to decide which core (virtual CPU) aprocess or thread (next) runs onwe can restrict which CPUs it will run on by specifying an affinitymask of the CPU ids it may be scheduled to run onthis has 2 benefits (assuming other active processes/threads areexcluded from the specified CPUs):

ensure maximum speed for that process/threadminimize cache / TLB pollution caused by context switches

e.g. on an 8-CPU system, create 8 threads to run on different CPUs:1 pthread_t threadHandle [8]; cpu_set_t cpu;

for (int i = 0; i < 8; i++) {

3 pthread_create (& threadHandle[i], NULL , threadFunc , NULL);

CPU_ZERO (&cpu); CPU_SET(i, &cpu);

5 pthread_setaffinity_np(threadHandle[i], sizeof(cpu_set_t), &cpu);

for a process, it is similar:sched_setaffinity(getpid (), sizeof(cpu_set_t), &cpu);

NUMA systems

NUMActl: Controlling NUMA from the Shell

on a NUMA system, we generally wish to bind a process and itsmemory image to a particular ‘node’ (=NUMA domain)the NUMA API provides a way of controlling policies of memoryallocation on a per node or per process basis

policies are default, bind, interleave, preferred

run a program on a CPU on node 0, with all memory allocated onnode 0:

1 numactl --membind =0 --cpunodebind =0 ./prog -args

similar, but force to be run on CPU 0 (which must be on node 0):1 numactl --physcpubind =0 --membind =0 ./prog ./args

optimize bandwidth for a crucial program to utilize multiple memorycontrollers (at expense of other processes!)

1 numactl --interleave=all ./ memhog ...

numactl --hardware shows available nodes etcComputer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 10 / 63

NUMA systems

LibNUMA: Controlling NUMA from within a Program

with libnuma, we can similarly change (the current thread of) anexecuting process’s node affinity and memory allocation policyrun from now on on a CPU on node 0, with all memory allocated onnode 0:

numa_run_on_node (0);

2 numa_set_preferred (0);

nodemask_t mask;

2 nodemask_zero (&mask); nodemask_set (&mask , 0);

numa_bind (&mask);

to allow it to run on all nodes again:1 numa_run_on_node_mask (& numa_all_nodes);

execute a memory hogging function, with all its (new) memory fullyinterleaved, and then restore to previous state:

1 numamask_t prevmask = numa_get_interleave_mask ();

numa_set_interleave_mask (& numa_all_nodes);

3 memhog (...);

numa_set_interleave_mask (& prevmask);

NUMA systems

Hands-on Exercise: NUMA Effects

Objective:

Explore the effects of Non-Uniform Memory Access (NUMA), that isthe general benefit of ensuring a process and its memory are in thesame NUMA domain

Profiling Codes

Outline

1 NUMA systems

2 Profiling Codes

3 Intel TBB

Profiling Codes

Profiling: Basics

Profiling is the process of recording information during execution of aprogram to form an aggregate view of its dynamic behaviour

Compare with tracing, which records an ordered log of events that canbe used to reconstruct dynamic behaviour

Used to understand program performance and find bottlenecks

At certain points in execution, record program state (instructionpointer, calling context, hardware performance counters, ...)

Sampling (recurrent event trigger) vs. instrumentation (probes atspecific points in program)

Real vs. simulated execution

Profiling Codes

Sampling

EuroMPI‘12: Hands-on Practical Hybrid Parallel Application Performance Engineering

Sampling

Program Main ... end Main Function Asterix (...) ... end Asterix Function Obelix (...) ... end Obelix ...

CPU program counter

cycle counter

cache miss counter

flop counter

Main Asterix Obelix +

Function Table interrupt every 10 ms

add and reset counter

When event trigger occurs, record instruction pointer (+ call context) andperformance counters: low overhead but subject to sampling errorSource: EuroMPI’12: Introduction to Performance Engineering

Profiling Codes

Instrumentation

... Function Obelix (...) call monitor(“Obelix“, “enter“) ... call monitor(“Obelix“,“exit“) end Obelix ...

monitor(routine, location) if (“enter“) then else end if Function Table

Instrumentation and Monitoring

cache miss counter

Asterix

Obelix + - 10 200 1300 1490

Inject ‘trampolines’ into source or binary code: accurate but higheroverheadSource: EuroMPI’12: Introduction to Performance Engineering

Profiling Codes

The 80/20 Rule & Life Cycle

Programs typically spend 80% of their time in 20% of the code

Programmers typically spend 20% of their effort to get 80% of thepossible speedup → optimize for the common case

Performance Analysis Process

Measurement

Analysis

Ranking

Refinement

Coding

Performance Analysis

Production

Program Tuning

Source: EuroMPI’12: Introduction to Performance Engineering

Profiling Codes

perf and VTune

Perf is a profiler tool for Linux 2.6+ based systems that abstractsaway CPU hardware differences in Linux performance measurementsand presents a simple command-line interface. Perf is based on theperf events interface exported by recent versions of the Linux kernel.

Intel’s VTune is a commercial-grade profiling tool for complexapplications via the command-line or GUI.

Profiling Codes

perf Reference Material

http://www.brendangregg.com/perf.html

Perf for User-Space Program Analysis

Perf Wiki

Profiling Codes

VTune Reference Material

https://software.intel.com/en-us/intel-vtune-amplifier-xe

Documentation URL

Profiling Codes

perf for Linux

perf is both a kernel syscall interface and a collection of tools tocollect, analyze and present hardware performance counter data eithervia counting or sampling

Profiling Codes

perf for Linux

% perf list

branch-instructions OR branches [Hardware event]

branch-misses [Hardware event]

bus-cycles [Hardware event]

cache-misses [Hardware event]

cache-references [Hardware event]

cpu-cycles OR cycles [Hardware event]

instructions [Hardware event]

ref-cycles [Hardware event]

branch-load-misses [Hardware cache event]

branch-loads [Hardware cache event]

dTLB-load-misses [Hardware cache event]

dTLB-loads [Hardware cache event]

dTLB-store-misses [Hardware cache event]

dTLB-stores [Hardware cache event]

iTLB-load-misses [Hardware cache event]

iTLB-loads [Hardware cache event]

Profiling Codes

Hands-on Exercise: Perf and VTune

Objective:

Use perf to measure performance of matrix multiply code

Use VTune to both measure and improve code performance

Intel TBB

Outline

1 NUMA systems

2 Profiling Codes

3 Intel TBB

Intel TBB

Intel Threading Building Blocks (TBB)

Template library extending C++ for parallelism using tasks

Focus on divide-and-conquer algorithms

Thread-safe data structures

Work stealing scheduler

Efficient low-level atomic operations

Scalable memory allocation

Free software under GPLv2

Content adapted from https://software.intel.com/sites/default/files/IntelAcademic Parallel 08 TBB.pdf

Intel TBB

Using TBB - Task Based Approach

TBB provides C++ constructs that allow you to express parallelsolutions in terms of task objects

Task scheduler manages thread pool

Task scheduler avoids common performance problems ofprogramming with threads

Oversubscription - One scheduler thread per hardware threadFair scheduling - Non-preemptive unfair schedulingHigh overhead - Programmer specifies tasks, not threadsLoad imbalance - Work-stealing balances load

Intel TBB

Using Task based approach

Fibonacci calculation exampleThe function fibTBB calculates the nth Fibonacci number using a TBBtask_group.

int fibTBB(int n) {

2 if (n < 10) {

return fibSerial(n);

4 } else {

int x, y;

6 tbb:: task_group g;

g.run ([&]{ x = Fib(n-1); }); // spawn a task

8 g.run ([&]{ y = Fib(n-2); }); // spawn another task

g.wait(); // wait for both tasks to complete

10 return x+y;

Content adapted fromhttps://software.intel.com/sites/default/files/m/d/4/1/d/8/1-6-AppThr - Using Tasks Instead of Threads.pdf

Intel TBB

Using Tasks

Developers express the logical parallelism with tasks

Runtime library schedules these tasks on to internal pool of workerthreads

Tasks are much lighter weight than threads. Hence it is possible toexpress parallelism at a much finer granularity.

Apart from a task interface TBB also provides high-level algorithmsthat implement some of the most common task patterns, such asparallel_invoke, parallel_for, parallel_reduce etc

Intel TBB

TBB Algorithms

parallel_for, parallel_for_each: load-balanced parallel execution ofloop iterations where iterations are independent

parallel_reduce: load-balanced parallel execution of independent loopiterations that perform reduction (e.g. summation of array elements)

parallel_scan: load-balanced computation of parallel prefix

parallel_do: load-balanced parallel execution of independent loopiterations with ability to add more work during its execution

parallel_sort: parallel sort

parallel_invoke: parallel execution of function objects or pointers tofunctions

Intel TBB

parallel for

#include <tbb/blocked_range.h>

2 #include <tbb/parallel_for.h>

4 template <typename Range , typename Func >

Func parallel_for (const Range & range , const Func & f ,

6 [, task_group_context & group ])

8 template <typename Index , typename Func >

Func parallel_for (Index first , Index_type last [, Index step ],

10 const Func & f [, task_group_context & group ]);

Template function parallel for recursively divides loop

Partitions original range into subranges, and deals out subranges toworker threads in a way that:

Balances loadUses cache efficientlyScales

Intel TBB

Loop Splitting

blocked_range<T> is a splittable type representing 1D iteration spaceover type T

Similarly blocked_range2d<T> for 2D block splitting

Separate loop body as an object or lambda expression

void ParallelDecrement(float* a, size_t n) {

2 parallel_for(blocked_range <size_t >(0, n),

[=]( const blocked_range <size_t >& r) {

4 for (size_t i=r.begin(); i != r.end(); ++i)

a[i]--;

Intel TBB

An Example using parallel for

Independent iterations and fixed/known bounds

Serial code:

const int N = 100000;

2 void change_array(float array , int M) {

for (int i = 0; i < M; i++) {

4 array[i] *= 2;

int main () {

8 float A[N];

initialize_array(A);

10 change_array(A, N);

return 0;

Intel TBB

An Example using parallel for

Using parallel for

2 #include <tbb/parallel_for.h>

4 using namespace tbb;

6 void parallel_change_array(float* array , size_t M) {

parallel_for(blocked_range <size_t >(0, M, IdealGrainSize),

8 [=]( const blocked_range <size_t >& r) -> void {

for (size_t i = r.begin(); i != r.end(); i++)

10 array[i] *= 2;

Intel TBB

Generic Programming vs Lambda functions

Generic Programming:1 class ChangeArrayBody {

float *array;

3 public:

ChangeArrayBody(float *a): array(a) {}

5 void operator ()( const blocked_range <size_t >& r ) const{

for (size_t i = r.begin(); i != r.end(); i++) {

7 array[i] *= 2;

11 void parallel_change_array(float *array , size_t M) {

parallel_for(blocked_range <int >(0, M, IdealGrainSize),

13 ChangeArrayBody(array));

Lambda functions:void parallel_change_array(float *array , size_t M) {

2 parallel_for(blocked_range <size_t >(0, M, IdealGrainSize),

[=]( const blocked_range <size_t >& r) -> void {

4 for (size_t i = r.begin(); i != r.end(); i++)

array[i] *= 2;

Intel TBB

Mutual Exclusion in TBB

Multiple tasks computing the minimum value in an array usingparallel for

1 void ParallelMin(int* a, int n) {

parallel_for(blocked_range <int >(0, n),

3 [=]( const blocked_range <int >& r) {

for(int i=r.begin(); i!=r.end(); ++i)

5 if (a[i] < min) min = a[i];

Intel TBB

Mutex Flavours

TBB provides several flavours of mutex:

spin mutex: non-scalable, unfair, fastqueuing mutex: scalable, fair, slowerspin rw mutex, queuing rw mutex: as above, with reader locksmutex and recursive mutex: wrappers around native implementation(e.g. Pthreads)

Avoid locks wherever possible

Intel TBB

Mutual Exclusion in TBB

Scoped mutex inside critical section

typedef spin_mutex ReductionMutex;

2 ReductionMutex minMutex;

4 void ParallelMin(int* a, int n) {

parallel_for(blocked_range <int >(0, n),

6 [=]( const blocked_range <int >& r) {

for (int i=r.begin (); i!=r.end(); ++i) {

8 ReductionMutex :: scoped_lock lock(minMutex);

if (a[i] < min) min = a[i];

Intel TBB

parallel reduce in TBB

1 #include <tbb/blocked_range.h>

#include <tbb/parallel_reduce.h>

3template <typename Range , typename Value ,

5 typename Func , typename ReductionFunc >

Value parallel_reduce(const Range& range , const Value& identity ,

7 const Func& func , const ReductionFunc& reductionFunc ,

[, partitioner [, task_group_context &group ]]);

parallel reduce partitions original range into subranges like parallel for

The function Func is applied on these subranges, the returned resultis then merged with the others (or identity if there is none) using thefunction reductionFunc.

Intel TBB

parallel reduce in TBB - Serial Example

#include <limits >

2//Find index of smallest element in a[0...n-1]

4 size_t serialMinIndex(const float a[], size_t n) {

float value_of_min = numeric_limits <float >::max();

6 size_t index_of_min = 0;

for (size_t i=0; i<n; ++i) {

8 float value = a[i];

if (value < value_of_min) {

10 value_of_min = value ;

index_of_min = i;

14 return index_of_min ;

Intel TBB

parallel reduce in TBB - Parallel Version

1 #include <limits >

3 #include <tbb/parallel_reduce.h>

5 size_t parallelMinIndex(const float a[], size_t n) {

return parallel_reduce(

7 blocked_range <size_t >(0, n, 10000) ,

size_t (0),

9 [=]( blocked_range <size_t > &r, size_t index_of_min) -> size_t {

float value_of_min = a[index_of_min ];

11 for (size_t i = r.begin(); i != r.end(); ++i) {

float value = a[i];

13 if (value < value_of_min) {

value_of_min = value; // accumulate result

15 index_of_min = i;

return index_of_min;

[=]( size_t i1, size_t i2) {

21 return (a[i1] < a[i2])? i1: i2; // reduction operator

Intel TBB

Hands-on Exercise: Programming with TBB

Objective:

Implement a parallel heat stencil application using TBB and to profileit using VTune

Lock Free Synchronization

Outline

1 NUMA systems

2 Profiling Codes

3 Intel TBB

‘Lock-free’ Data Structures: Motivation

consider the atomic test-and-set operationatomic int testAndSet(volatile int *Lock) {int lv = *Lock; *Lock = 1; return lv;}

synchronizes the whole memory system (down to LLC), costs ≈ 50cycles, degrades memory access progress for all

not scalable: energy and time costs are O(N2) (N is number of cores)

mutual exclusion via test-and-set can be modelled by:volatile int lock = 0; //0..1; 1=locked, 0=unlocked

// thread 0

2 while (1) {

// non -critical section 0

while (testAndSet (&lock) != 0)

6 { /*spin*/ }

8 // critical section 0

10 lock = 0;

1 // thread 1

while (1) {

3 // non -critical section 1

5 while (testAndSet (&lock) != 0)

{ /*spin*/ }

7// critical section 1

lock = 0;

with p threads under heavy contention (time in non-critical becomessmall), there will be O(p2) atomics!! Can we do better?

Improving Locked Access to Data Structures

only try atomics when we see lock is free: reduces #atomics to O(p)

1 do {

while (lock == 1) {/*spin*/}

3 } while (testAndSet (&lock) != 0);

use adaptive spin-locks: if lock is not acquired after a certain time,the thread yields

Pthreads mutexes should do this; especially useful if p > N

use algorithms requiring a (small) constant number of atomicoperations per access of the critical region

can be a general mutex algorithm, or data structure-specific

better still (possibly), use algorithms using no atomic operations

Lamport’s Bakery Algorithm: still has O(p2) costs, assumes order ofupdates made by one thread is seen by other threads

Atomic Operation-based Bakery Algorithm

analogous to protocol of buying seafood at Woolworthsuses atomic int fetchAndIncrement(volatile int *v){int v = *v; (*v)++; return(v);}

initialize: ticketNumber = screenNumber = 0;

acquire: myTicket = fetchAndIncrement(&ticketNumber);

wait until (myTicket == screenNumber)

release: screenNumber++; (does not need to be atomic)

performance: good if H/W has direct support of fetchAndIncrement(),else may require unbounded number of compare-and-swaps

problems under contention: large amount of cache line invalidationsdue to all threads accessing ‘scoreboard’ variable screenNumber

can we restrict the # threads accessing their ‘scoreboard’ to just 2?

MCS lock: Woolworths analogy: instead of watching a screen, theperson just ahead tells you when they are served

requires atomic dual word copy on lock acquire and atomiccompare-and-swap on release

CLH lock: requires only a single fetch-and-store atomic

Synchronization: Barriers

wait for all to arrive at same point in computation before proceeding(none may leave until all have arrived)

central barrier with p threads (initialize globals ctr to p, sense to 0 ;each thread initializes its own threadSense to 0)

threadSense = !threadSense;

if (fetchAndDecrement(ctr)== 1)

ctr = p, sense = !sense; // last to reach toggles global sensewait until (sense == threadSense)

sense is required for repeated barriers

caution: deadlock occurs if 1 thread does not participate!

problems:

most processors do not support atomic decrement of a memory locationnot (very) scalable: p atomic decrements per barrier

Synchronization: Combining Tree Barrier

each pair of threads points to a leaf node in a treeeach node has a ctr (init. to 2), and a sense flag (init. to 0)

algorithm: begins on all leaf nodes (each thread has threadSense=1

flag):

if (fetchAndDecrement(ctr)== 1)

call algorithm on parent nodectr = 2, sense = !sense

wait until threadSense == sense

then, as leaving the barrier, threadSense = !threadSense

notes:last thread to reach each node continues up the treethe thread that reaches root begins the ‘wakeup’ (reversing sense)upon wakeup, a thread releases its siblings at each node along path

performance: 2× the atomic operations, but can distribute memorylocations of node data (e.g. across different LLC banks)

atomics can be avoided by the scalable tree barrier! (simply replacectr with 2 adjacent byte flags, one for each thread)

Single Reader, Single Writer Bounded Buffer

Strategy: simply put burden of condition synchronization (retry) on client

class LFBoundedBuffer {

2 int N, int in, out , *buf;

LFBoundedBuffer(int Nc) { N = Nc; in = out = 0; buf = new int [N]; }

4 bool put(int v) {

int inNxt = (in+1) % N;

6 if (inNxt == out) // buffer full

return false;

8 buf[in] = v; in = inNxt;

return true;

bool get(int *v) {

12 if (in == out) // buffer empty

return false;

14 *v = buf[out]; out = (out+1) % N;

return true;

16 }};

Why does this work? Threads do not operate on the same variable.Consider N=2. Need to assume updates on in/out don’t overtake theadd/remove from buf on other thread.

Unbounded: use linked lists, reader advances a read pointer, leaves writerto remove previously read nodes

‘Lock-Free’ Stack

class LFStack {

2 int top , nPop;

LFStack(int Nc) { N = Nc; nPop = top = 0; buf = new int [N]; }

4 bool push(int v) {

while (1) {

6 int oldTop = top;

if (oldTop == N) return false; // caller must try again

8 if (dcas(&top , oldTop , oldTop+1,

&buf[oldTop], buf[oldTop], v))

10 return true; // oldTop remained a valid stack top

bool pop(int *v) {

14 while (1) {

int oldTop = top , oldNPop = nPop;

16 if (oldTop == 0) return false; // caller must try again

*v = buf[oldTop -1];

18 if (dcas(&top , oldTop , oldTop -1,

&nPop , oldNPop , oldNPop +1))

20 return true; // oldTop remained a valid stack top

22 }}; //note: dcas(&top , oldTop , oldTop -1, v, *v, buf[oldTop ]) won’t work!

dcas() is a ‘double’ compare-and-swap operation:

in this case we use the op on the 1st word to ensure safety, and the2nd to ensure the data is added / removed atomically

Summary: Lock Free Synchronization

Use fine-grained locking to reduce contention in operations on shareddata structures

Non-blocking solution to avoid overheads due to lock, but stillrequires appropriate memory fence operations

Lock free design does not eliminate contention

Transactional Memory

Outline

1 NUMA systems

2 Profiling Codes

3 Intel TBB

Transactional Memory - Motivations

or, problems with lock-based solutions:

convoying: a thread holding a lock is de-scheduled, possibly holdingup other threads waiting for a lock

priority inversion: lower-priority thread is pre-empted while holding alock needed by a higher priority thread

deadlock: two threads attempt to acquire the same locks do so indifferent order

inherently unsafe: relationship between data objects and locks inentirely by convention

lacks the composition property: difficult to combine 2 concurrentoperations using locks into a single larger operation

pessimistic: in low-contention situations, their protection is onlyneeded < 1% of the time!

recall overheads of acquiring a lock (even if free)

Transactions - Basic Ideas

a transaction is a series of reads and writes (to global data) made bya single thread which execute atomically, i.e:

if it succeeds, no other thread sees any of the writes until it completesif it fails, no change to (global) data can be observed

it must also execute with consistency:the thread does not see any (interfering) writes from other threads

it has the following steps:

begin, read and writes to (transactional) variables, end

if consistency is met, the end results in a commit; otherwise itabortsa condition that meets this is serializability:

if transactions cannot execute concurrently, consistency is assurednote: this assumes that (transactional) variables are only written (insome cases, read) by transactions

can also achieve this if no other thread accesses the affected variablesduring the transaction

Transactions - Programming Model

e.g. to add an item newItem to a queue with tail pointer tail:atomic {

2 struct node *newTail = malloc(sizeof(struct node));

newTail ->item = newItem;

4 newTail ->next = tail;

// tail must not be changed between these 2 points!

6 tail = newTail;

here, tail is the single read and write transactional variablethe atomic block is executed atomically (it can be assumed that theunderlying system will re-execute it until it can commit (what about

the malloc?))e.g. code from Unstructured Adaptive benchmark of NAS suite

for (ije1 = 0; ije1 < nnje; ije1 ++) { ...

2 for (ije2 = 1; ije2 < nnje; ije2 ++)

for (col = 1-shift; col < lx1 -shift; col ++) {

4 ig = idmo[v_end[ije2], col , ije1 , ije2 , iface , ie];

#pragma omp atomic

6 tmor[ig] += temp[v_end[ije2],col ,ije1] * 0.5;

Limited Compare-and-Swap Implementation

the atomic compare-and-swap operation may be expressed as:

typedef unsigned long uint64;

2 atomic uint64 cas(uint64 *x, uint64 x_old , uint64 x_new) {

uint64 x_now = *x;

4 if (x_now == x_old) *x = x_new;

return (x_now); }

(1st 2 lines are implemented by a single cas instruction)

we can implement the (essentially atomic part of) the code for queueinsertion by:

2 newTail ->next = tail;

} while (cas(tail , newTail ->next , newTail) != newTail ->next);

and for the Unstructured Adaptive benchmark

1 do {

x_old = tmor[ig];

3 x_new = temp[v_end[ije2],col ,ije1] * 0.5 + x_old;

} while (cas(&tmor[ig], x_old , x_new) != x_old);

note: recent OpenMP runtime systems implemented atomic sectionsusing a global lock (why does this perform abysmally?)

Hardware TM: Basis on Coherency Protocols

idea: tentative values can be held in cache lines (assuming eachprocessor has a separate cache)

review the standard MESI cache coherency protocol

cache line states:Modified: only this cache holds the data; it is ‘dirty’, Exclusive: asbefore, but not ‘dirty’, Shared: > 1 caches hold data, not ‘dirty’,Invalid: this line holds no datakey transitions:

(a) A: load x (miss; then line in E state)(b) B: load x (miss; then lines are in S state (A and B))(c) B: store x (hit; then lines in M state (B) and I state (A)(d) A: load x (miss; then B copies back line to memory and lines in S

state (A and B))

when a line is evicted (due to a conflict with a new address needing tobe loaded), the data is copied-back to memory

Hardware Transactional Memory (Bounded)

add a transactional bit (T) for the state of each cache lineT=1 if entry is placed in cache on behalf of a transaction (o.w. 0)

simply extend the MESI protocol as follows:if a line with T=1 is invalidated or evicted, abort (with no copy-back)note: can record fact that transaction will later abort instead

abort causes lines with T=1 to be invalidated (no copy-back, if dirty)

a commit requires all dirty lines with T=1 to be written atomicallyto memory (T bit is cleared)

requires transactional variants to load and store instructions, plus an(attempt to) commit instruction (which clears all T bits)why this works:

if a line with T=1 is invalidated, we have a R-W or W-W conflictif a line with T=1 is evicted, the transaction cannot complete

note: size of transaction is limited by cache size (in practice, smaller)

upon abort, hardware needs to indicate whether to retry (synch.conflict) or not (error or SW-detected resource exhaustion)

HTM in the Intel Haswell/Broadwell

has new instructions to begin, end and abort a transaction

xbegin addr: address of retry loop, in case of abortxend: all loads/stores in between are transactional; and xabort

no guarantee of progress! (indefinitely repeated aborts)

e.g. xbegin; load x; load y; store z;

T flag tag0 S u1 E x1 S y0 M v1 M z0 I -

T flag tag0 S u0 I x0 I y0 M v0 I z0 I -

T flag tag0 S u0 E x0 S y0 M v0 E z0 I -

cache state after theabove

if x’s line becomesinvalidated (remotecore writes to x),before xend, abort

if v’s line becomesinvalidated (remotecore writes to v),before xend, commit

Uses of HTM

may be used to create custom atomic operations

e.g. the atomic double compare-and-swap operation may beexpressed as:

typedef unsigned long uint64;

2 int dcas(uint64 *x, uint64 x_old , uint64 x_new ,

uint64 *y, uint64 y_old , uint64 y_new) {

4 int swapped = 0;

xbegin abortaddr;

6 uint64 x_now = *x, y_now = *y;

if (x_now == x_old && y_now == y_old)

8 *x = x_new , *y = y_new , swapped = 1;

10 return (swapped); // success

abortaddr:

12 return (0); // failure

note: x86 provides a cmpxchg16b instruction which can do the aboveif x and y are in consecutive memory locations

Summary: Transactional Memory

Atomic construct: aim to increase simplicity of synchronizationwithout significantly sacrificing performance

Implementation: many variants that differ in versioning policy(eager/lazy), conflict detection (pessimistic/optimistic), detectiongranularity

Hardware transactional memory: versioned data kept in caches,conflict detection as part of coherence protocol

Hands-on Exercise: Lock and Barrier Performance

Objective:

Understand performance of atomic operations in variousimplementations of locks and barriers

Summary

Topics covered today - Parallel Performance Optimization:

Non-uniform memory access hardware

Profiling using VTune

Intel TBB

Lock free data structures and transactional memory

Tomorrow - Parallel Software Design!

parallel performance optimization asd shared memory hpc

Documents

hpc parallel programming: overview and sequential...

debuggers and parallel debugging - scinetwiki · debuggers...

hpc – algorithms and applications · (source: intel/raj...

parallel matlab on hpc - usc · parallel matlab on hpc ......

towards rapid prototyping of parallel and hpc applications...

hpc - high performance computing...

rotary screw compressors asd series - maziak...

parallel programming concepts - lehigh...

parallel file systems for hpc

parallel rendering technologies for hpc clusters€¦ ·...

federated high performance computing with openccq ·...

parallelization strategies asd distributed memory hpc...

parallel jobs and benchmarking - hpc @ nih · parallel jobs...

large scale parallel i/o with hdf5 darren adams, ncsa...

hpc molecular simulations using...

evaluating hpc networks via simulation of parallel...

highest performance parallel storage for hpc environments

parallel file systems for hpc - sissa people personal home...

hot topics in parallel computing -...

hpc with matlab making parallel programming simple