parallel programming with openmp
DESCRIPTION
Parallel Programming with OpenMP. Ing. Andrea Marongiu [email protected]. The Multicore Revolution is Here!. More instruction-level parallelism hard to find Very complex designs needed for small gain Thread-level parallelism appears live and well - PowerPoint PPT PresentationTRANSCRIPT
![Page 2: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/2.jpg)
The Multicore Revolution is Here!
More instruction-level parallelism hard to find– Very complex designs needed for small gain– Thread-level parallelism appears live and well
Clock frequency scaling is slowing drastically– Too much power and heat when pushing envelope
Cannot communicate across chip fast enough– Better to design small local units with short paths
Effective use of billions of transistors– Easier to reuse a basic unit many times
Potential for very easy scaling– Just keep adding processors/cores for higher (peak) performance
![Page 3: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/3.jpg)
The Road Here
One processor used to require several chips
Then one processor could fit on one chip
Now many processors fit on a single chip
– This changes everything, since single processors are no longer the default
![Page 4: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/4.jpg)
Multiprocessing is Here
Multiprocessor and multicore systems are the future Some current examples
![Page 5: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/5.jpg)
Vocabulary in the Multi Era
AMP, Assymetric MP: Each processor has local memory, tasks statically allocated to one processor
SMP, Shared-Memory MP: Processors share memory, tasks dynamically scheduled to any processor
![Page 6: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/6.jpg)
Vocabulary in the Multi Era
Heterogeneous: Specialization among processors. Often different instruction sets. Usually AMP design.
Homogeneous: all processors have the same instruction set, can run any task, usually SMP design.
![Page 7: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/7.jpg)
Maintaining shared data consistent
Cache coherency:– Fundamental technology for
shared memory– Local caches for each
processor in the system– Multiple copies of shared
data can be present in caches
– To maintain correct function, caches have to be coherent
When one processor changes shared data, no other cache will contain old data (eventually)
![Page 8: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/8.jpg)
Future Embedded Systems
![Page 9: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/9.jpg)
The Software becomes the Problem
The advent of MPSoCs on the marketplace raised the necessity for standard parallel programming models.
Parallelism required to gain performance– Parallel hardware is “easy” to design
– Parallel software is (very) hard to write
Fundamentally hard to grasp true concurrency– Especially in complex software environments
Existing software assumes single-processor– Might break in new and interesting ways
– Multitasking no guarantee to run on multiprocessor
Exploiting parallelism has historically resulted in significant programmer effort This could be alleviated by a programming model that exposes those
mechanisms that are useful for the programmer to control, while hiding the details of their implementation.
![Page 10: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/10.jpg)
Message-Passing & Shared-Memory
Local memory for each task, explicit messages for communication
All tasks can access the same memory, communication is implicit
![Page 11: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/11.jpg)
Programming Parallel Machines
Synchronize & coordinate execution
Communicate data & status between tasks
Ensure parallelism-safe access to shared data
Components of the shared-memory solution:– All tasks see the same memory– Locks to protect shared data accesses– Synchronization primitives to coordinate execution
![Page 12: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/12.jpg)
Programming Model: MPI
Message-passing API– Explicit messages for
communication– Explicit distribution of data
to each thread for work– Shared memory not visible
in the programming model Best scaling for large
systems (1000s of CPUs)– Quite hard to program– Well-established in HPC
![Page 13: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/13.jpg)
Programming model: Posix Threads
Standard API Explicit operations Strong programmer
control, arbitrary work in each thread
Create & manipulate– Locks– Mutexes– Threads– etc.
![Page 14: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/14.jpg)
Programming model: OpenMP
De-facto standard for the shared memory programming model
A collection of compiler directives, library routines and environment variables
Easy to specify parallel execution within a serial code
Requires special support in the compiler Generates calls to threading libraries (e.g. pthreads) Focus on loop-level parallel execution Popular in high-end embedded
![Page 15: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/15.jpg)
Fork/Join Parallelism
Initially only master thread is active Master thread executes sequential code Fork: Master thread creates or awakens additional threads to execute parallel code Join: At the end of parallel code created threads die or are suspended
![Page 16: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/16.jpg)
Pragmas
Pragma: a compiler directive in C or C++
Stands for “pragmatic information”
A way for the programmer to communicate with the compiler
Compiler free to ignore pragmas: original sequential semantic is not altered
Syntax:
#pragma omp <rest of pragma>
![Page 17: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/17.jpg)
Components of OpenMP
Parallel regions
#pragma omp parallel
Work sharing
#pragma omp for #pragma omp sections
Synchronization
#pragma omp barrier #pragma omp critical #pragma omp atomic
Parallel regions
#pragma omp parallel
Work sharing
#pragma omp for #pragma omp sections
Synchronization
#pragma omp barrier #pragma omp critical #pragma omp atomic
Directives Data scope attributes
private shared reduction
Data scope attributes
private shared reduction
Clauses
Thread Forking/Joining
omp_parallel_start() omp_parallel_end()
Number of threads
omp_get_num_threads()
Thread IDs
omp_get_thread_num()
Thread Forking/Joining
omp_parallel_start() omp_parallel_end()
Number of threads
omp_get_num_threads()
Thread IDs
omp_get_thread_num()Runtime Library
![Page 18: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/18.jpg)
#pragma omp parallel
Most important directive Enables parallel execution through
a call to pthread_create Code within its scope is replicated
among threads
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
A sequential program....is easily parallelized
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
![Page 19: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/19.jpg)
#pragma omp parallel
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
Code originally contained within the scope of the pragma is outlined to a new function within the compiler
![Page 20: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/20.jpg)
#pragma omp parallel
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
The #pragma construct in the main function is replaced with function calls to the runtime library
![Page 21: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/21.jpg)
#pragma omp parallel
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
First we call the runtime to fork new threads, and pass them a pointer to the function to execute in parallel
![Page 22: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/22.jpg)
#pragma omp parallel
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
Then the master itself calls the parallel function
![Page 23: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/23.jpg)
#pragma omp parallel
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
Finally we call the runtime to synchronize threads with a barrier and suspend them
![Page 24: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/24.jpg)
#pragma omp parallelData scope attributes
int main(){ int id; int a = 5;#pragma omp parallel { id = omp_get_thread_num(); if (id == 0) { a = a * 2; printf (“Master: a = %d.”, a); } else printf (“Slave: a = %d.”, a); }} A slightly more complex example
Call runtime to get thread ID:Every thread sees a different value
Master and slave threads access the same variable a
![Page 25: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/25.jpg)
#pragma omp parallelData scope attributes
int main(){ int id; int a = 5;#pragma omp parallel { id = omp_get_thread_num(); if (id == 0) { a = a * 2; printf (“Master: a = %d.”, a); } else printf (“Slave: a = %d.”, a); }} A slightly more complex example
Call runtime to get thread ID:Every thread sees a different value
Master and slave threads access the same variable a
How to inform the compiler about these different
behaviors?
How to inform the compiler about these different
behaviors?
![Page 26: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/26.jpg)
#pragma omp parallelData scope attributes
int main(){ int id; int a = 5;#pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) { a = a * 2; printf (“Master: a = %d.”, a); } else printf (“Slave: a = %d.”, a); }} A slightly more complex example
Insert code to retrieve the address of the shared object from within each parallel thread
Allow symbol privatization: Each thread contains a private copy of this variable
![Page 27: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/27.jpg)
#pragma omp for
The parallel pragma instructs every thread to execute all of the code inside the block
If we encounter a for loop that we want to divide among threads, we use the for pragma
#pragma omp for
![Page 28: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/28.jpg)
#pragma omp forLoop partitioning algorithm
Es. N=10, Nthr=4.
LB = C * TID
Thread ID (TID) 0 1 2 3
0 3 6 9
3 6 9 10UB = min { [C * ( TID + 1) ], N}
N
NthrC = ceil ( )
DATA CHUNK
3 elements
#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }
LOWER BOUND
UPPER BOUND
![Page 29: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/29.jpg)
#pragma omp for
int main(){#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ for (i=LB; i<UB; i++) a[i] = i; }
The code of the for loop is moved inside the parallel function. Every thread works on a different subset of the iteration space
![Page 30: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/30.jpg)
#pragma omp for
int main(){#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ for (i=LB; i<UB; i++) a[i] = i; }
Lower and upper boundaries (LB, UB) are computed based on the thread ID, according to the previously described algorithm
![Page 31: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/31.jpg)
#pragma omp sections
The for pragma allows to exploit data parallelism in loops
OpenMP also provides a directive to exploit task parallelism
#pragma omp sections
![Page 32: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/32.jpg)
Task Parallelism Example
int main(){#pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta (); } #pragma omp sections { x = gamma (v, w); #pragma omp section y = delta (); }} printf (“%f\n”, epsilon (x, y));}
The dataflow graph of this application shows some
parallelism (i.e. functions whose execution doesn’t depend on the
outcomes of other functions)
![Page 33: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/33.jpg)
#pragma omp sections
int main(){#pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta (); } #pragma omp sections { x = gamma (v, w); #pragma omp section y = delta (); }} printf (“%f\n”, epsilon (x, y));}
![Page 34: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/34.jpg)
#pragma omp sections
int main(){#pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta (); } #pragma omp sections { x = gamma (v, w); #pragma omp section y = delta (); }} printf (“%f\n”, epsilon (x, y));}
alpha and beta execute in parallel
gamma and delta execute in parallel
![Page 35: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/35.jpg)
#pragma omp sections
int main(){#pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta (); } #pragma omp sections { x = gamma (v, w); #pragma omp section y = delta (); }} printf (“%f\n”, epsilon (x, y));}
A barrier is implied at the end of the sections pragma that
prevents gamma from consuming v and w before they
are produced
![Page 36: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/36.jpg)
#pragma omp barrier
Most important synchronization mechanism in shared memory fork/join parallel programming
All threads participating in a parallel region wait until everybody has finished before computation flows on
This prevents later stages of the program to work with inconsistent shared data
It is implied at the end of parallel constructs, as well as for and sections (unless a nowait clause is specified)
![Page 37: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/37.jpg)
#pragma omp critical
Critical Section: a portion of code that only one thread at a time may execute
We denote a critical section by putting the pragma
#pragma omp critical
in front of a block of C code
![Page 38: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/38.jpg)
-finding code example
double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;
area += 4.0/(1.0 + x*x); }} pi = area/n;
We must synchronize accesses to shared variable area to avoid inconsistent results. If we don’t..
![Page 39: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/39.jpg)
Race condition
…we set up a race condition in which one process may “race ahead” of another and not see its change to shared variable area
![Page 40: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/40.jpg)
Race condition (Cont’d)
![Page 41: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/41.jpg)
-finding code example
double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;#pragma omp critical area += 4.0/(1.0 + x*x); }} pi = area/n;
#pragma omp critical protects the code within its scope by acquiring a lock before entering the critical section and releasing it after execution
![Page 42: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/42.jpg)
Correctness, not performance!
As a matter of fact, using locks makes execution sequential
To dim this effect we should try use fine grained locking (i.e. make critical sections as small as possible)
A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!
The programmer is not aware of the real granularity of the critical section
![Page 43: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/43.jpg)
Using locks, as a matter of fact makes execution sequential
To dim this effect we should try use fine grained locking (i.e. make critical sections as small as possible)
A simple instruction to compute the value of area in the previous example is translated into much more instructions
The programmer is not aware of the real granularity of the critical section
Correctness, not performance!
This is a dump of the intermediate
representation of the program within the
compiler
This is a dump of the intermediate
representation of the program within the
compiler
![Page 44: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/44.jpg)
Correctness, not performance!
call runtime to acquire lock
Lock-protected operations (critical
section)
call runtime to release lock
This is the way the compiler represents the instruction
area += 4.0/(1.0 + x*x);
This is the way the compiler represents the instruction
area += 4.0/(1.0 + x*x);
THIS IS DONE AT EVERY LOOP ITERATION!
![Page 45: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/45.jpg)
Correctness, not performance!
A programming pattern such as area += 4.0/(1.0 + x*x); in which we:
– Fetch the value of an operand– Add a value to it– Store the updated value
is called a reduction, and is so common that OpenMP provides support for that
OpenMP takes care of storing partial results in private variables and combining partial results after the loop
![Page 46: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/46.jpg)
Correctness, not performance!
double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area) reduction(+:area){ for (i=0; i<n; i++) { x = (i +0.5)/n;
area += 4.0/(1.0 + x*x); }} pi = area/n;
The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable
![Page 47: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/47.jpg)
Correctness, not performance!
double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area) reduction(+:area){ for (i=0; i<n; i++) { x = (i +0.5)/n;
area += 4.0/(1.0 + x*x); }} pi = area/n;
The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable
Shared variable is updated at every
iteration.
Execution of this critical section is SERIALIZED
Shared variable is updated at every
iteration.
Execution of this critical section is SERIALIZED
Shared variable is only updated at the end of the loop, when partial sums
are computed
Shared variable is only updated at the end of the loop, when partial sums
are computed
UNOPTIMIZED CODE
Each thread computes partial sums on private copies of the reduction
variable
Each thread computes partial sums on private copies of the reduction
variable
LOOP
![Page 48: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/48.jpg)
Correctness, not performance!
double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area) reduction(+:area){ for (i=0; i<n; i++) { x = (i +0.5)/n;
area += 4.0/(1.0 + x*x); }} pi = area/n;
The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable
This is a single atomic write. Target architecture
may provide such an instruction
This is a single atomic write. Target architecture
may provide such an instruction
__sync_fetch_and_add(&.omp_data_i->area, area);
UNOPTIMIZED CODE
Shared variable is updated at every
iteration.
Execution of this critical section is SERIALIZED
Shared variable is updated at every
iteration.
Execution of this critical section is SERIALIZED
![Page 49: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/49.jpg)
Summary
![Page 50: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/50.jpg)
Customizing OpenMP for Efficient Exploitation of the Memory Hierarchy
Memory latency is well recognized as a severe performance bottleneck
MPSoCs feature complex memory hierarchy, with multiple cache levels, private and shared on-chip and off-chip memories
Using efficiently the memory hierarchy is of the utmost importance to exploit the computational power of MPSoCs, but..
![Page 51: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/51.jpg)
Customizing OpenMP for Efficient Exploitation of the Memory Hierarchy
It is a difficult task, requiring deep understanding of the application and its memory access pattern
OpenMP standard doesn’t provide any facilities to deal with data placement and partitioning
Customization of the programming interface would bring the advantages of OpenMP to the MPSoC world
![Page 52: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/52.jpg)
Data Placement
#pragma omp parallel sections
{
#pragma omp section
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
#pragma omp section
for (j = 0; j < n; j++)
B[j] = goo ();
}
INTERCONNECT
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
SHAREDMEMORY
OpenMP provides means to map parallel tasks to different processors..
![Page 53: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/53.jpg)
Data PlacementScenario 1 – Off-chip shared memory
#pragma omp parallel sections
{
#pragma omp section
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
#pragma omp section
for (j = 0; j < n; j++)
B[j] = goo ();
}
INTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
A B
..but not to specify phisical placement of data. This can have a great impact on performance!
Memory reference cost =
Bus latency +
Off-chip memory latency (70 cycles)
![Page 54: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/54.jpg)
Data Placement:Scenario 2 – On-chip remote scratchpad
#pragma omp parallel sections
{
#pragma omp section
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
#pragma omp section
for (j = 0; j < n; j++)
B[j] = goo ();
}
INTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
Memory reference cost =
Bus latency +
On-chip memory latency (2 cycles)
![Page 55: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/55.jpg)
Data Placement:Scenario 3 – On-chip local scratchpad
#pragma omp parallel sections
{
#pragma omp section
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
#pragma omp section
for (j = 0; j < n; j++)
B[j] = goo ();
}
INTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
Memory reference cost =
Local memory latency (1 cycle)
![Page 56: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/56.jpg)
Data Placement
Dealing with efficient data placement is a difficult and time-consuming activity
Why not to have the compiler take care about that?1. How to drive data allocation to specific memory regions?
Through the extension of the OpenMP model
2. How to discover the best mapping of data to memories? Static code analysis is not sufficient Application Profiling
![Page 57: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/57.jpg)
Extending OpenMP to support Data Distribution
We need a custom directive that enables specific code analysis and transformation
When static code analysis can’t tell how to distribute data we must rely on profiling
The runtime is responsible for exploiting this information to efficiently map arrays to memories
![Page 58: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/58.jpg)
Extending OpenMP to support Data Distribution
The entire process is driven by the custom #pragma omp distributed directive
{
int A[m];
float B[n];
#pragma omp distributed (A, B)
…
}
{
int *A;
float *B;
A = distributed_malloc (m);
B = distributed_malloc (n);
…
}
Originally stack-allocated arrays are transformed into pointers to allow for their explicit placement throughout the memory hierarchy within the program
![Page 59: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/59.jpg)
Extending OpenMP to support Data Distribution
The entire process is driven by the custom #pragma omp distributed directive
{
int A[m];
float B[n];
#pragma omp distributed (A, B)
…
}
{
int *A;
float *B;
A = distributed_malloc (m);
B = distributed_malloc (n);
…
}
The transformed program invokes the runtime to retrieve profile information which drive data placement
When no profile information is found, the distributed_malloc returns a pointer to the shared memory
![Page 60: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/60.jpg)
Extending OpenMP to support Data Distribution
Annotated code is compiled with custom OpenMP translator (GCC 4.3.2)
A first simulation run takes place. Arrays are placed in the shared memory
Every access to arrays declared with #pragma omp distributed is monitored
At the end of the program an access count map by each processor to each array (location) is retrieved
An allocation algorithm is executed that fills in a metadata structure based on profile information
A second simulation run takes place: Metadata is available to distributed_malloc to figure out the most efficient data mapping
![Page 61: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/61.jpg)
Data partitioning
OpenMP model is focused on loop parallelism
In this parallelization scheme multiple threads may access different sections (discontiguous addresses) of shared arrays
Data partitioning is the process of tiling data arrays and placing the tiles in memory such that a maximum number of accesses are satisfied from local memory
Most obvious implementation of this concept is the data cache, but..
– Inter-array discontiguity often causes cache conflicts– Embedded systems impose constraints on energy, predictability,
real-time that often make caches not suitable
![Page 62: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/62.jpg)
A simple example
#pragma omp parallel for
for (i = 0; i < 4; i++)
for (j = 0; j < 6; j++)
A[ i ][ j ] = 1.0;
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
ITERATION SPACEINTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
The iteration space is partitioned between the processors
![Page 63: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/63.jpg)
A simple example
#pragma omp parallel for
for (i = 0; i < 4; i++)
for (j = 0; j < 6; j++)
A[ i ][ j ] = 1.0;
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
ITERATION SPACE DATA SPACEINTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
Data space overlaps with iteration space. Each processor accesses a different tile
A(i,j)
Array is accessed with the loop induction variables
![Page 64: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/64.jpg)
A simple example
#pragma omp parallel for
for (i = 0; i < 4; i++)
for (j = 0; j < 6; j++)
A[ i ][ j ] = 1.0;
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
ITERATION SPACE DATA SPACEINTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
The compiler can actually split the matrix into four smaller arrays and allocate them
onto scratchpads
A(i,j)
No access to remote memories through the bus, since data is allocated locally
![Page 65: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/65.jpg)
Another example
#pragma omp parallel for
for (i = 0; i < 4; i++)
for (j = 0; j < 6; j++)
hist[A[i][j]]++;
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
ITERATION SPACE A
INTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
Different locations within array hist are accessed by many processors
3 7 7 5 4 4
3 2 2 3 0 0
1 1 0 2 3 5
1 1 4 5 5 4
hist
![Page 66: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/66.jpg)
Another example
In this case static code analysis can’t tell anything on array access pattern
How to decide most efficient partitioning?
Split array in as many tiles as there are processors
Use access count information to map tiles to the processor that has most accesses to it
![Page 67: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/67.jpg)
Another example
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
ITERATION SPACE A
INTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
Now processors need to access remote scratchpads, since they work on
multiple tiles!!!
3 7 7 5 4 4
3 2 2 3 0 0
1 1 0 2 3 5
1 1 4 5 5 4
hist
TILE 1Access countPROC1 1PROC2 2PROC3 3PROC4 2
TILE 2Access countPROC1 1PROC2 4PROC3 2PROC4 0
TILE 3Access countPROC1 3PROC2 0PROC3 1PROC4 4
TILE 4Access countPROC1 2PROC2 0PROC3 0PROC4 1
![Page 68: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/68.jpg)
Problem with data partitioning
If there is no overlapping of iteration space and data space it may happen that multiple processor need to access different tiles
In this case data partitioning introduces addressing difficulties because the data tiles can become discontiguous in physical memory
How to address the problem of generating efficient code to access data when performing loop and data partitioning?
We can further extend the OpenMP programming interface to deal with that!
The programmer only has to specify the intention of partitioning an array throughout the memory hierarchy, and the compiler does the necessary instrumentation
![Page 69: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/69.jpg)
Code Instrumentation
In general, the steps for addressing an array element using tiling are:
Computation of the offset w.r.t. the base address
Identify the tile to which this element belongs to
Re-compute the index relative to the current tile
Load the tile base address from a metadata array
This metadata array is populated during the memory allocation step of the tool-flow. It relies on access count information to figure out the most efficient mapping of array tiles to memories.
![Page 70: Parallel Programming with OpenMP](https://reader036.vdocuments.site/reader036/viewer/2022062301/568152ad550346895dc0cf94/html5/thumbnails/70.jpg)
Extending OpenMP to support data partitioning
#pragma omp parallel tiled(A)
{
…
/* Access memory */
A[i][j] = foo();
…
}
{
/* Compute offset, tile and index for distributed array */
int offset = …;
int tile = …;
int index = …;
/* Read tile base address */
int *base = tiles[dvar][tile];
/* Access memory */
base[index] = foo();
…
}
The instrumentation process is driven by the custom tiled clause, which can be coupled with every parallel and work-sharing construct.