scalable and composable shared memory parallelism with...
TRANSCRIPT
![Page 1: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/1.jpg)
Scalable and composable shared memory parallelism with tasks
for multicore and manycore
Thomas Guillet, Intel
Marc Tchiboukdjian, UVSQ
TERATEC 2012 Forum, Palaiseau, 27-28 June 2012
![Page 2: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/2.jpg)
Some application challenges on an Exascale node
1 Exascale node • Energy dominated by data movements • O(1000) cores / node • Growing impact of machine jitter • Algorithmic load balancing
becomes a critical issue Shared memory programming matters • Expose as much parallelism
as possible • Benefit from dynamic load balancing • Use locality-aware algorithms
In this talk: an illustration of cache-friendly task-based parallelism on a kernel for unstructured FEM meshes
TERATEC 2012 Forum, 27-28 June 2012 2
• How can we reduce data movements in applications?
• How far can SPMD take us in terms of scalability?
Already benefits increasingly parallel architectures:
• Multicore: O(10) cores
• Intel MIC: > 50 cores
![Page 3: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/3.jpg)
Tasks With Intel® Cilk™ Plus What are tasks? • A way of expressing opportunities for
independent computations (function calls, code blocks, …)
• No explicit reference to threads
Intel® Cilk™ Plus: C/C++ language extensions for tasks in shared memory • spawn to create a task • sync to wait for the completion of tasks Cilk Plus Features • Automatic scheduling and load balancing • Available in Intel compilers, and as open-source
for GCC 4.7 (http://cilkplus.org) • Same source code targets multicore and MIC • Parallelism is introduced recursively:
well suited for Divide and Conquer (D&C) algorithms
TERATEC 2012 Forum, 27-28 June 2012 3
D&C algorithms: • Expose fine-grain task
parallelism • Are inherently cache-
friendly
![Page 4: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/4.jpg)
Why task parallelism in applications?
With MPI and OpenMP: parallelism is explicit • Parallelism is mandatory and relies on
a fixed number of participating workers • This breaks nested parallelism
– Either no additional parallelism – Or may result in oversubscription
Tasks offer composable parallelism • Parallelism can be expressed anywhere in the application,
libraries, … • Runtime manages work decomposition dynamically
– Expressing parallelism is low overhead – Nested parallelism is only used when needed/possible – Parallel slack provides load balancing
TERATEC 2012 Forum, 27-28 June 2012 4
Only one level of coarse grain parallelism
![Page 5: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/5.jpg)
Elastic Forces Kernel in SPECFEM3D
SPECFEM3D_GLOBE [Komatitsch et al.]
• Open source seismology package for globe-scale earthquake propagation
• Elastic wave propagation with rich physics (anisotropy, …)
Proven petascale scalability • Highly optimized application • Very local numerical method:
mass matrix is diagonal
Discrete volumes: hexahedral mesh elements
⋯
Ω
∙ 𝑤 dΩ
Gather displacements
from mesh points
Compute forces for element
Accumulate forces back to
mesh
Studied here: Stand-alone version of this kernel
For all elements
TERATEC 2012 Forum, 27-28 June 2012 5
![Page 6: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/6.jpg)
Concurrency issue for shared memory parallelization
+=
+=
1 spectral element
Global grid points
Gather displacements
from mesh points
Compute forces for element
Accumulate forces back to
mesh
Concurrent writes to global mesh by different spectral elements
TERATEC 2012 Forum, 27-28 June 2012 6
![Page 7: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/7.jpg)
Resolving Concurrent Writes: Mesh Coloring
• Partition the spectral elements into a number of “colors”
• No two spectral elements of a same color can contain a same global mesh point
• Elements within a same color can be processed independently
Classical algorithm, well-suited to OpenMP parallel loops
TERATEC 2012 Forum, 27-28 June 2012 7
![Page 8: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/8.jpg)
Coloring Algorithm Performance with OpenMP
TERATEC 2012 Forum, 27-28 June 2012 8
• OpenMP static schedule: bad load balancing • OpenMP dynamic schedule restores scalability but: • Bad data locality due to coloring algorithm 25% efficiency loss
0
8
16
24
32
0 8 16 24 32
Spe
ed
up
Number of cores
coloring (static)
coloring (dynamic)
ideal
0
0.25
0.5
0.75
1
1.25
0 8 16 24 32
Par
alle
l Eff
icie
ncy
Number of cores
coloring (static)coloring (dynamic)ideal
15% imbalance @ 32 cores with OpenMP static scheduling
Reduced memory locality
order of elements update
Test platform: 4 sockets × 8 cores (Intel Nehalem EX)
![Page 9: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/9.jpg)
D&C Parallel Algorithm
Spawn
Left task (recursive)
Right task (recursive)
Sync
Mesh task
Separator task
Recursively subdivide domains using a static kd-tree
• D&C retains good dynamic load balancing with good locality
• Parallelism is introduced recursively
order of elements update
D&C Coloring
TERATEC 2012 Forum, 27-28 June 2012 9
![Page 10: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/10.jpg)
Cilk D&C Pseudocode
process_dc(mesh) {
if (mesh is small enough)
then
process_seq(mesh)
else
left, right, sep = split(mesh)
spawn process_dc(left)
process_dc(right)
sync
process_seq(sep)
}
Spawn
Left task (recursive)
Right task (recursive)
Sync
Mesh task
Separator task
TERATEC 2012 Forum, 27-28 June 2012 10
![Page 11: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/11.jpg)
Cilk D&C Algorithm Performance
0
0.25
0.5
0.75
1
1.25
0 8 16 24 32
Par
alle
l Eff
icie
ncy
Number of cores
coloring (dynamic)
d&c (seq. sep.)
ideal0
8
16
24
32
0 8 16 24 32
Spe
ed
up
Number of cores
coloring (dynamic)
d&c (seq. sep.)
ideal
Efficiency at small core counts is good since locality is as good as sequential algorithm Sublinear scaling: why?
TERATEC 2012 Forum, 27-28 June 2012 11
![Page 12: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/12.jpg)
Work & Depth Analysis
• W = work (#operations of all tasks)
• D = depth (#operations of tasks on the critical path)
• Cilk work-stealing scheduler guarantees
𝑇𝑝 =𝑊
𝑝+ 𝒪 𝐷
Generalizes Amdahl’s law for Cilk programs
W = 21 D = 10
We need a small depth to achieve linear speedup
TERATEC 2012 Forum, 27-28 June 2012 12
![Page 13: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/13.jpg)
What is the depth in our case?
Spawn
Left task, size 𝑁/2
Right task, size 𝑁/2
Sync
Mesh, size 𝑁
Separator,
size 𝑂(𝑁2/3)
Sequential In parallel
𝐷𝑁 = 𝐷𝑁2+ 𝒪 𝑁2/3 𝐷𝑁 = 𝐷𝑁
2+ 𝐷𝑁2/3
𝐷𝑁 = 𝒪 𝑁2/3 𝐷𝑁 = 𝒪 log𝑁
Processing of separator planes
Parallel D&C processing of separator planes significantly reduces depth Parallelize the computation of separators using the same recursive technique
TERATEC 2012 Forum, 27-28 June 2012 13
![Page 14: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/14.jpg)
Overall Performance
0
1
2
3
4
serial replicated coloring d&c
DRAM Traffic in GB
write
read
0
8
16
24
32
0 8 16 24 32
Number of cores
Speedup
coloring (static)coloring (dynamic)d&c (seq. sep.)d&c (par. sep.)ideal
Cilk D&C parallel kernel is 1.2x faster than the dynamic OpenMP coloring kernel, with a 1.9x DRAM traffic reduction
TERATEC 2012 Forum, 27-28 June 2012 14
![Page 15: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/15.jpg)
Conclusion
Algorithms will matter more and more towards Exascale • Expose lots of parallelism • Be locality-aware
Tasks enable efficient shared-memory algorithms • Composable parallelism: expose parallelism at all levels • Implicit parallelism: runtime does the scheduling/load balancing • Can leverage inherent locality of D&C algorithm
What about full applications? • Efficient hybrid MPI+OpenMP programming is hard • Recent efforts to combine tasks and communications in distributed
memory (e.g. Asynchronous PGAS, MPI+StarSS, …) • Will likely need to be a joint effort with numerical methods,
e.g. communication-avoiding algorithms
TERATEC 2012 Forum, 27-28 June 2012 15
![Page 16: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/16.jpg)
Exascale Computing Research Contacts
• Address UVSQ, 45 Av. des Etats-Unis, Buffon building, 5th floor 78 000 Versailles, France
• Web site: www.exascale-computing.eu
• Team William Jalby, CT , [email protected] Marie-Christine Sawley, Co-design, [email protected] Bettina Krammer, Tools, [email protected]
• Collaboration partners
TERATEC 2012 Forum, 27-28 June 2012 16
![Page 17: Scalable and Composable Shared Memory Parallelism with ...moais.imag.fr/membres/marc.tchiboukdjian/pub/teratec2012_slides.pdf · Scalable and composable shared memory parallelism](https://reader034.vdocuments.site/reader034/viewer/2022042600/5aa2b1c17f8b9ada698d3d39/html5/thumbnails/17.jpg)
References
• SPECFEM3D Komatitsch and Tromp, Geophys. J. Int., 149 (2002) 390–412 Komatitsch et al., J. Comp. Phys., 229 (2010) 7692–7714
• Cache-Oblivious Algorithms Frigo et al., FOCS 1999
• Cilk & Intel® Cilk™ Plus Frigo et al., PLDI 1998 http://cilkplus.org
• MPI/SMPSs Marjanovic et al., ICS 2010
• Asynchronous PGAS languages: X10, Chapel, UPC Tasks
TERATEC 2012 Forum, 27-28 June 2012 17