openmp + numavuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfsummary: loop scheduling static: k...
TRANSCRIPT
![Page 1: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/1.jpg)
OpenMP + NUMACSE 6230: HPC Tools & Apps
Fall 2014 — September 5 !
Based in part on the LLNL tutorial @ https://computing.llnl.gov/tutorials/openMP/See also the textbook, Chapters 6 & 7
![Page 2: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/2.jpg)
OpenMP
๏ Programmer identifies serial and parallel regions, not threads
!
!
!
๏ Library + directives (requires compiler support) ๏ Official website: http://www.openmp.org
๏ Also: https://computing.llnl.gov/tutorials/openMP/
![Page 3: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/3.jpg)
Simple example
!
!
int main (){!
!
!
!
!
printf (“hello, world!\n”); // Execute in parallel!
return 0;}
![Page 4: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/4.jpg)
Simple example
#include <omp.h> !
int main (){ omp_set_num_threads (16); // OPTIONAL — Can also use // OMP_NUM_THREADS environment variable!
#pragma omp parallel { printf (“hello, world!\n”); // Execute in parallel } // Implicit barrier/join return 0;}
![Page 5: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/5.jpg)
Simple example
#include <omp.h> !
int main (){ omp_set_num_threads (16); // OPTIONAL — Can also use // OMP_NUM_THREADS environment variable!
#pragma omp parallel num_threads(8) // Restrict team size locally { printf (“hello, world!\n”); // Execute in parallel } // Implicit barrier/join return 0;}
![Page 6: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/6.jpg)
Simple example
#include <omp.h> !
int main (){ omp_set_num_threads (16); // OPTIONAL — Can also use // OMP_NUM_THREADS environment variable!
#pragma omp parallel { printf (“hello, world!\n”); // Execute in parallel } // Implicit barrier/join return 0;}
Compiling:
gcc -fopenmp … icc -openmp …
![Page 7: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/7.jpg)
Simple example
#include <omp.h> !
int main (){ omp_set_num_threads (16); // OPTIONAL — Can also use // OMP_NUM_THREADS environment variable!
#pragma omp parallel { printf (“hello, world!\n”); // Execute in parallel } // Implicit barrier/join return 0;}
Output:
hello, world! hello, world! hello, world! …
![Page 8: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/8.jpg)
Parallel loops
for (i = 0; i < n; ++i) { a[i] += foo (i);}
![Page 9: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/9.jpg)
Parallel loops
for (i = 0; i < n; ++i) { a[i] += foo (i);}
#pragma omp parallel // Activates the team of threads{ #pragma omp for shared (a,n) private (i) // Declares work sharing loop for (i = 0; i < n; ++i) { a[i] += foo (i); } // Implicit barrier/join} // Implicit barrier/join
![Page 10: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/10.jpg)
Parallel loops
for (i = 0; i < n; ++i) { a[i] += foo (i);}
#pragma omp parallel{ foo (a, n);} // Implicit barrier/join
void foo (item* a, int n) { int i; #pragma omp for shared (a,n) private (i) for (i = 0; i < n; ++i) { a[i] += foo (i); } // Implicit barrier/join}
Note: if foo() is called outside a parallel region, it is orphaned.
![Page 11: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/11.jpg)
Parallel loops
for (i = 0; i < n; ++i) { a[i] += foo (i);}
#pragma omp parallel for default (none) shared (a,n) private (i)for (i = 0; i < n; ++i) { a[i] += foo (i);} // Implicit barrier/join
Combining omp parallel and omp for is just a convenient shorthand for a common idiom.
![Page 12: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/12.jpg)
“If” clause
for (i = 0; i < n; ++i) { a[i] += foo (i);}
const int B = …; #pragma omp parallel for if (n>B) default (none) shared (a,n) private (i)for (i = 0; i < n; ++i) { a[i] += foo (i);} // Implicit barrier/join
![Page 13: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/13.jpg)
Parallel loops๏ You must check dependencies
s = 0;for (i = 0; i < n; ++i) s += x[i];
![Page 14: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/14.jpg)
Parallel loops๏ You must check dependencies
s = 0;for (i = 0; i < n; ++i) s += x[i];
#pragma omp parallel for shared(s)for (i = 0; i < n; ++i) s += x[i]; // Data race!
![Page 15: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/15.jpg)
Parallel loops๏ You must check dependencies
s = 0;for (i = 0; i < n; ++i) s += x[i];
#pragma omp parallel for reduction (+:s)for (i = 0; i < n; ++i) s += x[i];
#pragma omp parallel for shared(s)for (i = 0; i < n; ++i) #pragma omp critical s += x[i];
![Page 16: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/16.jpg)
Removing implicit barriers: nowait
#pragma omp parallel default (none) shared (a,b,n) private (i){ #pragma omp for nowait for (i = 0; i < n; ++i) a[i] = foo (i);!
#pragma omp for nowait for (i = 0; i < n; ++i) b[i] = bar (i);}
Contrast with _Cilk_for, which does not have such a “feature.”
![Page 17: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/17.jpg)
Single thread
#pragma omp parallel default (none) shared (a,b,n) private (i){ #pragma omp single [nowait] for (i = 0; i < n; ++i) { a[i] = foo (i); } // Implied barrier unless “nowait” specified! #pragma omp for for (i = 0; i < n; ++i) b[i] = bar (i);}
Only one thread from the team will execute the first loop. Use single with nowait to allow other threads to proceed while the one thread executes the first loop.
![Page 18: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/18.jpg)
Master thread
#pragma omp parallel default (none) shared (a,b,n) private (i){ #pragma omp master for (i = 0; i < n; ++i) { a[i] = foo (i); } // No implied barrier! #pragma omp for for (i = 0; i < n; ++i) b[i] = bar (i);}
![Page 19: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/19.jpg)
Synchronization primitives
Critical sections No explicit locks #pragma omp critical { … }
Barriers #pragma omp barrier
Explicit locks May require flushingomp_set_lock (l);…omp_unset_lock (l);
Single-thread regions Inside parallel regions #pragma omp single
{ /* executed once */ }
![Page 20: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/20.jpg)
Loop schedulingStatic: k iterations per thread, assigned statically
!
Dynamic: k iters / thread, using logical work queue
!
Guided: k iters / thread initially, reduced with each allocation
!
Run-time (schedule runtime): Use value of environment variable, OMP_SCHEDULE
What are all these scheduling things?
#pragma omp parallel for schedule static(k) …
#pragma omp parallel for schedule dynamic(k) …
#pragma omp parallel for schedule guided(k) …
20
![Page 21: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/21.jpg)
Loop scheduling strategies for load balanceCentralized scheduling (task queue)
Dynamic, on-line approach
Good for small no. of workers
Independent tasks, known
For loops: Self-scheduling
Task = subset of iterations
Loop body has unpredictable time
Tang & Yew (ICPP ’86)
21
Task
queue
Worker
threads
![Page 22: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/22.jpg)
Self-scheduling trade-offUnit of work to grab: balance vs. contention
Some variations:
Grab fixed size chunk
Guided self-scheduling
Tapering
Weighted factoring, adaptive factoring, distributed trapezoid
Self-adapting, gap-aware, …
22
Task
queue
Worker
threads
![Page 23: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/23.jpg)
Work queue23
23
![Page 24: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/24.jpg)
Work queue23
For P=3 procs, Ideal: 23 / 3 ~ 7.67
23
![Page 25: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/25.jpg)
Work queue23
For P=3 procs, Ideal: 23 / 3 ~ 7.67
Fixed k=1
12.5
23
![Page 26: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/26.jpg)
Work queue23
For P=3 procs, Ideal: 23 / 3 ~ 7.67
Fixed k=1
12.5
Fixed k=3
11
23
![Page 27: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/27.jpg)
Work queue23
For P=3 procs, Ideal: 23 / 3 ~ 7.67
Fixed k=1
12.5
Fixed k=3
11
Tapered, k0=3
11
23
![Page 28: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/28.jpg)
Work queue23
For P=3 procs, Ideal: 23 / 3 ~ 7.67
Fixed k=1
12.5
Fixed k=3
11
Tapered, k0=3
11
Tapered, k0=4
10.5
23
![Page 29: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/29.jpg)
Summary: Loop schedulingStatic: k iterations per thread, assigned statically
!
Dynamic: k iters / thread, using logical work queue
!
Guided: k iters / thread initially, reduced with each allocation
!
Run-time: Use value of environment variable, OMP_SCHEDULE
#pragma omp parallel for schedule static(k) …
#pragma omp parallel for schedule dynamic(k) …
#pragma omp parallel for schedule guided(k) …
24
![Page 30: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/30.jpg)
Tasking (OpenMP 3.0+)
25
int fib (int n) { // G == tuning parameter if (n <= G) fib__seq (n); int f1, f2; f1 = _Cilk_spawn fib (n-1); f2 = fib (n-2); _Cilk_sync; return f1 + f2;}
See also: https://iwomp.zih.tu-dresden.de/downloads/omp30-tasks.pdf
![Page 31: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/31.jpg)
Tasking (OpenMP 3.0+)
26
int // G == tuning parameter f1 = f2 = fib (n-2); }
int fib (int n) { if (n <= G) fib__seq (n); int f1, f2;#pragma omp task default(none) shared(n,f1) f1 = fib (n-1); f2 = fib (n-2);#pragma omp taskwait return f1 + f2;}
See also: https://iwomp.zih.tu-dresden.de/downloads/omp30-tasks.pdf
![Page 32: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/32.jpg)
Tasking (OpenMP 3.0+)
26
int // G == tuning parameter f1 = f2 = fib (n-2); }
int fib (int n) { if (n <= G) fib__seq (n); int f1, f2;#pragma omp task default(none) shared(n,f1) f1 = fib (n-1); f2 = fib (n-2);#pragma omp taskwait return f1 + f2;}
// At the call site:#pragma omp parallel#pragma omp single nowait answer = fib (n);
See also: https://iwomp.zih.tu-dresden.de/downloads/omp30-tasks.pdf
![Page 33: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/33.jpg)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
●
●●
●
●
●
●●
●●●●
●
●
●●
●●
● ●
●
●●
●
●
●●●
●●●
●●
●
●
●●●
●
●●
●
●●●●●
●
●●
●
●●
●●
● ●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
● ●
●
●●●
●
●● ●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Sequential Cilk Plus OpenMP Tasking
Spee
dup
rela
tive
to m
ean
sequ
entia
l tim
e
// At the call site:#pragma omp parallel#pragma omp single nowait answer = fib (n); !
// …!
int fib (int n) { if (n <= G) fib__seq (n); int f1, f2;#pragma omp task … f1 = fib (n-1); f2 = fib (n-2);#pragma omp taskwait return f1 + f2;}
Note: Parallel run-times are highly variable! For your assignments and projects, you should gather and report suitable statistics.
![Page 34: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/34.jpg)
Tasking (OpenMP 3.0+)
// Computes: f2 (A (), f1 (B (), C ()))int a, b, c, x, y;!a = A();b = B();c = C();x = f1(b, c);y = f2(a, x);
// Assume a parallel region#pragma omp task shared(a)a = A();#pragma omp task if (0) shared (b, c, x){ #pragma omp task shared(b) b = B(); #pragma omp task shared(c) c = C(); #pragma omp taskwait}x = f1 (b, c);#pragma omp taskwaity = f2 (a, x); 28
See also: https://iwomp.zih.tu-dresden.de/downloads/omp30-tasks.pdf
![Page 35: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/35.jpg)
Check: What does this code do?
// At some call site:#pragma omp parallel#pragma omp single nowaitfoo ();
29
foo (){#pragma omp task bar (); baz ();}
bar (){#pragma omp for for (int i = 1; i < 100; ++i) boo ();}
See also: https://iwomp.zih.tu-dresden.de/downloads/omp30-tasks.pdf
![Page 36: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/36.jpg)
Performance tuning tip: Controlling the number of threads
30
#pragma omp parallel num_threads(4){ #pragma omp for default (none) shared (a,n) private (i) for (i = 0; i < n; ++i) { a[i] += foo (i); }}
![Page 37: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/37.jpg)
Performance tuning tip: Exploit non-uniform memory access (NUMA)
31
Core0 Core1
Core2 Core3
Socket-0DRAM
0 1
2 3
Socket-1 DRAM
Example: Two quad-core CPUs with logically shared but physically distributed memory
![Page 38: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/38.jpg)
jinx9 $ /nethome/rvuduc3/local/jinx/likwid/2.2.1-ICC/bin/likwid-topology -g
------------------------------------------------------------- CPU type: Intel Core Westmere processor ************************************************************* Hardware Thread Topology ************************************************************* Sockets: 2 Cores per socket: 6 Threads per core: 2 ------------------------------------------------------------- HWThread Thread Core Socket 0 0 0 0 1 0 0 1 2 0 8 0 3 0 8 1 4 0 2 0 5 0 2 1 6 0 10 0 7 0 10 1 8 0 1 0 9 0 1 1 10 0 9 0 11 0 9 1 12 1 0 0 13 1 0 1 14 1 8 0 15 1 8 1 16 1 2 0 17 1 2 1 18 1 10 0 19 1 10 1 20 1 1 0 21 1 1 1 22 1 9 0 23 1 9 1 ------------------------------------------------------------- Socket 0: ( 0 12 8 20 4 16 2 14 10 22 6 18 ) Socket 1: ( 1 13 9 21 5 17 3 15 11 23 7 19 ) -------------------------------------------------------------
************************************************************* NUMA domains: 2 ------------------------------------------------------------- Domain 0: Processors: 0 2 4 6 8 10 12 14 16 18 20 22 Memory: 10988.6 MB free of total 12277.8 MB ------------------------------------------------------------- Domain 1: Processors: 1 3 5 7 9 11 13 15 17 19 21 23 Memory: 10986.1 MB free of total 12288 MB ------------------------------------------------------------- !************************************************************* Graphical: ************************************************************* Socket 0: +-------------------------------------------------------------------+ | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 0 12 | | 8 20 | | 4 16 | | 2 14 | | 10 22 | | 6 18 | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 256kB | | 256kB | | 256kB | | 256kB | | 256kB | | 256kB | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +---------------------------------------------------------------+ | | | 12MB | | | +---------------------------------------------------------------+ | +-------------------------------------------------------------------+ Socket 1: +-------------------------------------------------------------------+ | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 1 13 | | 9 21 | | 5 17 | | 3 15 | | 11 23 | | 7 19 | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
![Page 39: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/39.jpg)
Exploiting NUMA: Linux “first-touch” policy
33
a = /* … allocate buffer … */;#pragma omp parallel for … schedule(static)for (i = 0; i < n; ++i) { a[i] = /* … initial value … */ ;}
#pragma omp parallel for …for (i = 0; i < n; ++i) { a[i] += foo (i);}
![Page 40: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/40.jpg)
Exploiting NUMA: Linux “first-touch” policy
34
a = /* … allocate buffer … */;#pragma omp parallel for … schedule(static)for (i = 0; i < n; ++i) { a[i] = /* … initial value … */ ;}
#pragma omp parallel for … schedule(static)for (i = 0; i < n; ++i) { a[i] += foo (i);}
![Page 41: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/41.jpg)
Thread bindingKey environment variables
OMP_NUM_THREADS: Number of OpenMP threads
GOMP_CPU_AFFINITY: Specify thread-to-core binding !
Consider: 2-socket x 6-core system, main thread initializes data and ‘p’ OpenMP threads operate
env OMP_NUM_THREADS=6 GOMP_CPU_AFFINITY=“0 2 4 6 … 22” ./run-program …(shorthand: GOMP_CPU_AFFINITY=“0-22:2”)
env OMP_NUM_THREADS=6 GOMP_CPU_AFFINITY=“1 3 5 7 … 23” ./run-program …(shorthand: GOMP_CPU_AFFINITY=“1-23:2”)
35
![Page 42: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/42.jpg)
0
5
10
15
20
25
●●●●●●●●●●●●●
●●●●●
●
●●● ●●●●
●● ●●●●●●●●●● ●
●
●●
●
● ●●●●●●●
●
●●●●●● ●
●
●
●
●
●
●
●
●●●●●●●●●●●
●●
●●
●●●●●●●●●●●● ●●●●●●
●
●●●●●●●●●
●●●●
●●
●
●●●
●●●●●●●●●●●●●●●●●●●●
●
●●●
● ●●
●
● ●●●●●●●●
●
●●
●●
●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
Sequential OpenMP x 6Master initializes
Read from socket 0
OpenMP x 6Master initializes
Read from socket 1
OpenMP x 12Master initializes
Read from both sockets
OpenMP x 12First touch
Effe
ctive
Ban
dwid
th (G
B/s)
“Triad:” c[i] ← a[i] + s*b[i]
![Page 43: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/43.jpg)
Backup
![Page 44: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/44.jpg)
Expressing task parallelism (I)
38
#pragma omp parallel sections [… e.g., nowait]{ #pragma omp section foo ();!
#pragma omp section bar ();}
![Page 45: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/45.jpg)
Expressing task parallelism (II): Tasks (new in OpenMP 3.0)Like Cilk’s spawn construct
void foo () { // A & B independent A (); B ();}
// Idiom for tasks void foo () {#pragma omp parallel { #pragma omp single nowait { #pragma omp task A (); #pragma omp task B (); } }}
http://wikis.sun.com/display/openmp/Using+the+Tasking+Feature39
![Page 46: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/46.jpg)
Expressing task parallelism (II): Tasks (new in OpenMP 3.0)Like Cilk’s spawn construct
void foo () { // A & B independent A (); B ();}
// Idiom for tasks void foo () {#pragma omp parallel { #pragma omp single nowait { #pragma omp task A (); // Or, let parent run B B (); } }}
http://wikis.sun.com/display/openmp/Using+the+Tasking+Feature
40
![Page 47: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/47.jpg)
Variation 1: Fixed chunk sizeKruskal and Weiss (1985) give a model for computing optimal chunk size
Independent subtasks
Assumed distributions of running time for each subtask (e.g., IFR)
Overhead for extracting task, also random
Limitation: Must know distribution, though ‘n / p’ works (~ .8x optimal for large n/p)
Ref: “Allocating independent subtasks on parallel processors”
41
![Page 48: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/48.jpg)
Variation 2: Guided self-schedulingIdea
Large chunks at first to avoid overhead
Small chunks near the end to even-out finish times
Chunk size Ki = ceil(Ri / p), Ri = number of remaining tasks
Polychronopoulos & Kuck (1987): “Guided self-scheduling: A practical scheduling scheme for parallel supercomputers”
42
![Page 49: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/49.jpg)
Variation 3: TaperingIdea
Chunk size Ki = f(Ri; μ, σ)
(μ, σ) estimated using history
High-variance ⇒ small chunk size
Low-variance ⇒ larger chunks OK
A little better than guided self-scheduling
Ref: S. Lucco (1994), “Adaptive parallel programs.” PhD Thesis.
� = min. chunk sizeh = selection overhead
=� Ki = f
�⇤
µ,�,
Ri
p, h
⇥
43
![Page 50: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/50.jpg)
Variation 4: Weighted factoringWhat if hardware is heterogeneous?
!
Idea: Divide task cost by computational power of requesting node
Ref: Hummel, Schmit, Uma, Wein (1996). “Load-sharing in heterogeneous systems using weighted factoring.” In SPAA
44
![Page 51: OpenMP + NUMAvuduc.org/cse6230/slides/cse6230-fa14--04-omp.pdfSummary: Loop scheduling Static: k iterations per thread, assigned statically Dynamic: k iters / thread, using logical](https://reader035.vdocuments.site/reader035/viewer/2022070108/6041c67f77038e26661acaf4/html5/thumbnails/51.jpg)
When self-scheduling is usefulTask cost unknown
Locality not important
Shared memory or “small” numbers of processors
Tasks without dependencies; can use with, but most analysis ignores this
45