directive-based approach to heterogeneous computing

Directive-based approach to heterogeneouscomputing

Ruyman Reyes Castro

High Performance Computing GroupUniversity of La Laguna

December 19, 2012

TOP500 Performance Development List

2 / 83

Applications Used in HPC Centers

Usage of HECToR by Area of Expertise

3 / 83

Real HPC Users

Most Used Applications in HECToR

Application % of total jobs Language Prog. ModelVASP 17% Fortran MPI+OpenMPCP2K 7% Fortran MPI+OpenMP

Unified Model (UM) 7% Fortran MPIGROMACS 4% C++ MPI+OpenMP

I Large code-bases

I Complex algorithms implemented

I Mixture of different Fortran flavours

4 / 83

Knowledge of Programming

Survey conducted in the Swiss National Supercomputing Centre(2011)

5 / 83

Are application developers usingthe proper tools?

6 / 83

Complexity Arises (I)

7 / 83

Directives: Enhancing Legacy Code (I)

OpenMP Example

1 ...

2 #pragma omp parallel for default(shared) private(i, j)

firstprivate(rmass, dt)

3 for (i = 0; i < np; i++) {

4 for (j = 0; j < nd; j++) {

5 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5 * dt*dt*a[i][j];

6 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]);

7 a[i][j] = f[i][j] * rmass;

8 }

9 }

10 ...

8 / 83

Complexity Arises (II)

9 / 83

Re-compiling the code is no longer enough tocontinue improving the performance

10 / 83

Porting Applications To New Architectures

Programming CUDA (Host Code)

1 float a_host[n], b_host[n];

2 // Allocate

3 cudaMalloc((void*)&a, n * sizeof(float));

4 cudaMalloc((void*)&b, n * sizeof(float));

5 // Transfer

6 cudaMemcpy(a, a_host, n * sizeof(float), cudaMemcpyHostToDevice);

7 cudaMemcpy(b, b_host, n * sizeof(float), cudaMemcpyHostToDevice);

8 // Define grid shape

9 blocks = 100

10 threads = 128

11 // Execute

12 kernel<<<blocks,threads>>>(a, b, c);

13 // Copy-back

14 cudaMemcpy(a_host, a, n * sizeof(float), cudaMemcpyDeviceToHost);

15 // Clean

16 cudaFree(a);

17 cudaFree(b);

11 / 83

Porting Applications To New Architectures

Programming CUDA (Kernel Source)

1 // Kernel code

2 __global__ void kernel(float *a, float *b, float c)

3 {

4 // Get the index of this thread

5 unsigned int index = (blockIdx.x * blockDim.x) + threadIdx.x;

6 // Do the computation

7 b[index] = a[index] * c;

8 // Wait for all threads in the block to finish

9 __syncthreads();

10 }

12 / 83

Programmers need faster ways to migrate existing code

13 / 83

Why not use directive-based approaches forthese new heterogeneous architectures?

14 / 83

Overview of Our Work

We can’t solve problems by using the same kind ofthinking we used when we created them.

Albert Einstein

The field is undergoing rapid changes: we have to adapt tothem

1. Hybrid MPI+OpenMP (2008)→ Usage of directives in cluster environments

2. OpenMP extensions (2009)→ Extensions of OpenMP/La Laguna C (llc) forheterogeneous architectures

3. Directives for accelerators (2011)→ Specific accelerator-oriented directives→ OpenACC (December 2011)

15 / 83

Outline

Hybrid MPI+OpenMPllc and llCoMP

Hybrid llCoMP

Computational ResultsTechnical Drawbacks

OpenMP-to-GPU

Directives for Accelerators

Conclusions

Future Work and Final Remarks

La Laguna C: llc

What is

I Directive-based approach to distributed memory environments

I OpenMP compatible

I Additional set of extensions to address particular features

I Implemented FORALL loops, Pipelines, Farms . . .

Reference[48] Dorta, A. J. Extension del modelo de OpenMP a memoriadistribuida. PhD Thesis, Universidad de La Laguna, December 2008.

17 / 83

Chronological Perspective (Late 2008)

Cores per Socket - System Share Accelerator - System Share

18 / 83

A Hybrid OpenMP+MPI Implementation

Same llc code, extended llCoMP implementation

I Directives are replaced by a set of parallel patterns

I Improved performance on multicore systems→ Better usage of inter-core memories (i.e cache)→ Lower memory requirements when using replicated memoryon MPI

Translation

19 / 83

llc Code Example

llc Implementation of the Mandelbrot Set Computation

1 ...

2 #pragma omp parallel for default(shared) reduction(+:numoutside)

private(i, j, ztemp, z) shared(nt, c)

3 #pragma llc reduction_type (int)

4 for(i = 0; i < npoints; i++) {

5 z.creal = c[i].creal; z.cimag = c[i].cimag;

6 for (j = 0; j < MAXITER; j++) {

7 ztemp = (z.creal*z.creal) - (z.cimag*z.cimag)+c[i].creal;

8 z.cimag = z.creal * z.cimag * 2 + c[i].cimag;

9 z.creal = ztemp;

10 if (z.creal * z.creal + z.cimag * z.cimag > THRESOLD) {

11 numoutside++;

12 break;

13 }

14 }

15 ...

20 / 83

Hybrid MPI+OpenMP performance

21 / 83

Technical Drawbacks

llCoMP

I The original design of llCoMP StS was not flexible enough

I Traditional two-pass compiler

I Excessive effort to implement new features

I Need more advanced features to implement GPU codegeneration

22 / 83

Back to the Drawing Board

23 / 83

Outline

Hybrid MPI+OpenMP

OpenMP-to-GPURelated WorkYet Another Compiler Framework (YaCF)

Computational ResultsTechnical Drawbacks


Conclusions


Chronological Perspective (Late 2009)


25 / 83

Related Work

Other OpenMP-to-GPU translators: OpenMPC[82] Lee, S., and Eigenmann, R. OpenMPC: Extended OpenMP programmingand tuning for GPUs. In SC’10: Proceedings of the 2010 ACM/IEEE conferenceon Supercomputing. IEEE Computer Society, pp. 1–11.

Other Compiler Frameworks: Cetus, LLVM[84] Lee, S., Johnson., T. A. and Eigenmann, R. Cetus – an extensible compilerinfrastructure for source-to-source transformation. In Languages and Compilersfor Parallel Computing, 16th Intl. Workshop, College Station, TX, USA, volume2958 of LNCS(2003), pp. 539-553.

[81] Lattner, C., and Adve, V. LLVM: A compilation framework for lifelongprogram analysis & transformation. In Proceedings of the internationalsymposium on Code generation and optimization: feedback-directed and runtimeoptimization, CGO’04. IEEE Computer Society, pp. 75–47.

26 / 83

YaCF: Yet Another Compiler Framework

Application programmer writes llc code

I Focus on data and algorithm

I Architecture independent

I Only needs to specify where the parallelism is

System engineer writes template code

I Focus on non-functional code

I Can reuse code from different patterns (i.e inheritance)

27 / 83

YaCF Software Architecture

28 / 83

Main Software Design Patterns

Implementing search and replacement in the IR

I Filter: Looks for an specific pattern on the IR→ E.g Looks for a pragma omp parallel construct

I Mutator: Looks for a node and transforms the IR→ E.g Applies loop transformations (nesting, flattening, . . . )→ E.g Replaces a pragma omp for by a CUDA kernel call

I Can be composed to solve more complex problems

29 / 83

Dynamic Language and Tools

Key Idea: Features Should Require Only a Few Lines of Code

30 / 83

Template Patterns

Ease back-end implementation

1 <%def name="initialization(var_list, prefix = ’’, suffix = ’’)">

2 %for var in var_list:

3 cudaMalloc((void **) (&${prefix}${var.name}${suffix}),

4 ${var.numelems} * sizeof(${var.type}));

5 cudaMemcpy(${prefix}${var.name}${suffix}, ${var.name},

6 ${var.numelems} * sizeof(${var.type}),

7 cudaMemcpyHostToDevice);

8 %endfor

9 </%def>

31 / 83

CUDA Back-end

Generates a CUDA kernel and memory transfers from theinformation obtained during the analysis

Supported syntax

I parallel, for and their condensed form implemented

I New directives to support manual optimizations (e.ginterchange)

I Syntax taken from an OpenMP proposal by BSC, UJI andothers (#pragma omp target)

I copy in, copy out enable users to provide memory transferinformation

I Generated code is human-readable

32 / 83

Example

Update Loop from the Molecular Dynamics Code

1 ...

2 #pragma omp target device(cuda) copy(pos, vel, f) copy_out(a)

3 #pragma omp parallel for default(shared) private(i, j)

firstprivate(rmass, dt)

4 for (i = 0; i < np; i++) {

5 for (j = 0; j < nd; j++) {

6 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5*dt*dt*a[i][j];

7 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]);

8 a[i][j] = f[i][j] * rmass;

9 }

10 }

33 / 83

Translation process

34 / 83

The Jacobi Iterative Method

1 error = 0.0;

2

3

4 {

5

6 for (i = 0; i < m; i++)

7 for (j = 0; j < n; j++)

8 uold[i][j] = u[i][j];

9

10 for (i = 0; i < (m - 2); i++) {

11 for (j = 0; j < (n - 2); j++) {

12 resid = ...

13 error += resid * resid;

14 }

15 }

16 }

17 k++;

18 error = sqrt(error) / (double) (n * m);

35 / 83

Jacobi OpenMP Source

1 error = 0.0;

2

3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid)

4 {

5 #pragma omp for

6 for (i = 0; i < m; i++)

7 for (j = 0; j < n; j++)

8 uold[i][j] = u[i][j];

9 #pragma omp for reduction(+:error)

10 for (i = 0; i < (m - 2); i++) {

11 for (j = 0; j < (n - 2); j++) {

12 resid = ...


14 }

15 }

16 }

17 k++;


36 / 83

Jacobi llCoMP v1

1 error = 0.0;

2 #pragma omp target device(cuda)


4 {

5 #pragma omp for

6 for (i = 0; i < m; i++)

7 for (j = 0; j < n; j++)

8 uold[i][j] = u[i][j];


10 for (i = 0; i < (m - 2); i++) {

11 for (j = 0; j < (n - 2); j++) {

12 resid = ...


14 }

15 }

16 }

17 k++;


37 / 83

Jacobi llCoMP v2

1 error = 0.0;

2 #pragma omp target device(cuda) copy_in(u, f) copy_out(f)


4 {

5 #pragma omp for

6 for (i = 0; i < m; i++)

7 for (j = 0; j < n; j++)

8 uold[i][j] = u[i][j];


10 for (i = 0; i < (m - 2); i++) {

11 for (j = 0; j < (n - 2); j++) {

12 resid = ...


14 }

15 }

16 }

17 k++;


38 / 83

Jacobi Iterative Method

39 / 83

Technical Drawbacks

Limited to Compile-time Optimizations

I Some features require runtime information→ Kernel grid configuration

I Orphaned directives were not possible→ Would require an inter-procedural analysis module

I Some templates were too complex→ And would need to be replicated to support OpenCL

40 / 83

Back to the Drawing Board

41 / 83

Outline

Hybrid MPI+OpenMP

OpenMP-to-GPU

Directives for AcceleratorsRelated WorkOpenACCAccelerator ULL (accULL)

Results

Conclusions


Chronological Perspective (2011)


43 / 83

Related Work (I)

hiCUDA

I Translates each directive into a CUDA call

I It is able to use the GPU Shared Memory

I Only works with NVIDIA devices

I The programmer still needs to know hardware details

Code Example:

1 ...

2 #pragma hicuda global alloc c [*] [*] copyin

3

4 #pragma hicuda kernel mxm tblock(N/16, N/16) thread(16, 16)

5 #pragma hicuda loop_partition over_tblock over_thread

6 for (i = 0; i < N; i++) {

7 #pragma hicuda loop_partition over_tblock over_thread

8 for (j = 0; j < N; j++) {

9 double sum = 0.0;

10 ...44 / 83

Related Work (II)

PGI Accelerator Model

I Higher level (directive-based) approach

I Fortran and C are supported

Code Example:

1 #pragma acc data copyin(b[0:n*l], c[0:m*l]) copy(a[0:n*m])

2 {

3 #pragma acc region

4 for (j = 0; j < n; j++)

5 for (i = 0; i < l; i++) {

6 double sum = 0.0;

7 for (k = 0; k < m; k++)

8 sum += b[i + k * l] * c[k + j * m];

9 a[i + j * l] = sum;

10 }

11 }

45 / 83

Our Ongoing Work at that Time: llcl

I Extending llc with support for heterogeneous platforms

I Compiler + Runtime implementation→ The Compiler generates runtime code→ The Runtime handles memory coherence and drivesexecution

I Compiler optimizations directed by an XML file

I More generic/higher level approach - not tied to GPUs

46 / 83

llcl: Directives

1 double *a, *b, *c;

2 ...

3 #pragma llc context name("mxm") copy_in(a[n * l], b[l * m], \

4 c[m * n], l, m, n) copy_out(a[n * l])

5 {

6 int i, j, k;

7 #pragma llc for shared(a, b, c, l, m, n) private(i, j, k)

8 for (i = 0; i < l; i++)

9 for (j = 0; j < n; j++) {

10 a[i + j * l] = 0.0;

11 for (k = 0; k < m; k++)

12 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];

13 }

14 }

15 ...

47 / 83

llcl: XML Platform Description File

1 <xml>

2 <platform name="default">

3 <region name="compute">

4 <element name="compute_1" class="loop">

5 <mutator name="Loop.LoopInterchange"/>

6 <target device="cuda"/>

7 <target device="opencl"/>

8 </element>

9 </region>

10 </platform>

11 </xml>

48 / 83

OpenACC Announcement

49 / 83

OpenACC Announcement

50 / 83

OpenACC: Directives

1 double *a, *b, *c;

2 ...

3 #pragma acc data copy_in(a[n * l],b[l * m],c[m * n], l, m, n)

copy_out(a[n * l])

4 {

5 int i, j, k;

6 #pragma acc kernels loop private(i, j, k)

7 for (i = 0; i < l; i++)

8 for (j = 0; j < n; j++) {

9 a[i + j * l] = 0.0;

10 for (k = 0; k < m; k++)

11 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];

12 }

13 }

14 ...

51 / 83

Related Work

OpenACC Implementations (After Announcement)

I PGI - Released on February 2012

I CAPS - Released on March 2012

I Cray - To be released→ Access to beta release available

We had a first experimental implementation in January 2012

52 / 83

accULL: Our OpenACC Implementation

accULL = YaCF + Frangollo

It is a two-layer based implementation:Compiler + Runtime Library

53 / 83

Frangollo: the Runtime

Implementation

I Lightweight

I Standard C++ and STL code

I CUDA component written using the CUDA Driver API

I OpenCL component written using the C OpenCL interface

I Experimental features can be enabled/disabled at compile time

Handles

1. Device discovery, initialization, . . .

2. Memory coherence (registered variables)

3. Manage kernel execution (including grid shape)

54 / 83

Frangollo Layered Structure

55 / 83

Memory Management

1 // Creates a context to handle memory coherence

2 ctxt_id = FRG__createContext("name", ...)

3 ...

4 // Register a variable within the context

5 FRG__registerVar(ctxt_id, &ptr, offset, size, constraints, ...);

6 ...

7 // Execute the kernel

8 FRG__kernelLaunch(ctxt_id, "kernel", param_list, ...)

9 ...

10 // Finish the context and concyle variables

11 FRG__destroyContext(ctxt_id);

56 / 83

Kernel Execution

Loading the kernel

I Context may have from zero to N named kernels associated

I Runtime loads different versions of the kernel for each device

I Kernel is loaded depending on the platform where it is executed

Grid shape

I Grid shape is estimated using compute intensity (CI):Nmem/(Cost ×Nflops)→ E.g Fermi, GFlops DP 512GFlop/s, Memory Bandwidth144Gb/s, Cost 3.5

I Low CI → favors memory accesses

I High CI → favors computation

57 / 83

Implementing OpenACC

Putting all together

1. The compiler driver generates Frangollo interface calls fromOpenACC directives→ Converts data region directives into context creation→ Generates Host and Device synchronization

2. Extracts the kernel code

3. Frangollo implements OpenACC API calls→ acc init, acc malloc/acc free

4. Implements some optimizations→ Compiler: loop invariant, skewing, strip-mining, interchange→ Kernel extraction: divergence reduction, data-dependencyanalysis (basic)→ Runtime: grid shape estimation, optimized reduction kernels

58 / 83

Building an OpenACC Code with accULL

59 / 83

Compilance with OpenACC Standard

Table: Compliance with the OpenACC 1.0 standard (directives)

Construct Supported bykernels PGI, HMPP, accULLloop PGI, HMPP, accULL

kernels loop PGI, HMPP, accULLparallel PGI, HMPPupdate Implemented

copy, copyin, copyout, . . . PGI, HMPP, accULLpcopy, pcopyin, pcopyout ,. . . PGI, HMPP, accULL

async PGIdeviceptr clause PGI

host accULL

collapse accULL

Table: Compliance with the OpenACC 1.0 standard (API)

API Call Supported byacc init PGI, HMPP, accULL

acc set device PGI, HMPP, accULL(no effect)acc get device PGI, HMPP, accULL

60 / 83

Experimental Platforms

Garoe: A Desktop computer

I Intel Core i7 930 processor (2.80 GHz), 4Gb RAMI 2 GPU devices attached:

I Tesla C1060I Tesla C2050 (Fermi)

Peco: A cluster node

I Peco: 2 quad core Intel Xeon E5410 (2.25GHz) processors,24Gb RAM

I Attached a Tesla C2050 (Fermi)

Drago: A shared memory system

I 4 Intel Xeon E7 4850 CPU, 6Gb RAM

I Accelerator platform: Intel OpenCL SDK 1.5, running on theCPU 61 / 83

Software

Compiler versions (Pre-OpenACC)

I PGI Compiler Toolkit 12.2 with the PGI AcceleratorProgramming Model 1.3

I hiCUDA: 0.9

Compiler versions (OpenACC)

I PGI Compiler Toolkit 12.6

I CAPS HMPP: 3.2.3

62 / 83

Matrix Multiplication (M ×M) (I)

1 #pragma acc data name("mxm") copy(a[L*N]) copyin(b[L*M], c[M*N])

2 {

3 #pragma acc kernels loop private(i, j) collapse(2)

4 for (i = 0; i < L; i++)

5 for (j = 0; j < N; j++)

6 a[i * L + j] = 0.0;

7 /* Iterates over blocks */

8 for (ii = 0; ii < L; ii += tile_size)

9 for (jj = 0; jj < N; jj += tile_size)

10 for (kk = 0; kk < M; kk += tile_size) {

11 /* Iterates inside a block */

12 #pragma acc kernels loop collapse(2) private(i,j,k)

13 for (j = jj; j < min(N, jj+tile_size); j++)

14 for (i = ii; i < min(L, ii+tile_size); i++)

15 for (k = kk; k < min(M, kk+tile_size); k++)

16 a[i*L+j] += (b[i*L+k] * c[k*M+j]);

17 }

18 }

63 / 83

Floating Point Performance for M×M in Peco

64 / 83

M×M (II)

1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N])

2 {

3 #pragma acc kernels loop private(i)

4 for (i = 0; i < L; i++)

5 #pragma acc loop private(j)

6 for (j = 0; j < N; j++)

7 a[i * L + j] = 0.0;






13 #pragma acc kernels loop private(i)


15 #pragma acc loop private(j)



18 a[i * L + j] += (b[i * L + k] * c[k * M + j]);

19 }

20 }

65 / 83

M×M (III)

1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N] ...)

2 {

3 #pragma acc kernels loop private(i) gang(32)

4 for (i = 0; i < L; i++)

5 #pragma acc loop private(j) worker(32)

6 for (j = 0; j < N; j++)

7 a[i * L + j] = 0.0;






13 #pragma acc kernels loop private(i) gang(32)


15 #pragma acc loop private(j) worker(32)



18 a[i*L+j] += (b[i*L+k] * c[k*M+j]);

19 }

20 }

66 / 83

About Grid Shape and Loop Scheduling Clauses

Optimal gang/worker (i.e, grid shape) values vary

I Among OpenACC implementations

I Among Platforms (Fermi vs Kepler?, NVIDIA vs ATI?)

I What happens if we implement a non-GPU accelerator?

I Our implementation ignores gang/worker, leaves decision toruntime→ User can influence the decision with an environment variable

I It is possible to enable the gang/worker clauses in ourimplementation→ Gang/worker feeds a Strip-mining transformation forcingblock/threads (WIP)

67 / 83

Effect of Varying Gang/Worker

68 / 83

OpenMP vs Frangollo+OpenCL in Drago

69 / 83

Needleman-Wunsch (NW)

I NW is a nonlinear global optimization method for DNAsequence alignments

I The potential pairs of sequences are organized in a 2D matrix

I The method uses Dynamic Programming to find the optimumalignment

70 / 83

Performance Comparison of NW in Garoe

71 / 83

Overall Comparison

72 / 83

Outline

Hybrid MPI+OpenMP

OpenMP-to-GPU


Conclusions


Directive-based Programming

I Support for accelerators in the OpenMP standard may beadded in the future→ In the meantime, OpenACC can be used to port codes toGPUs→ It is possible to combine OpenACC with OpenMP

I Generated code does not always match native-codeperformance→ But leverages the development effort providing enoughperformance

I accULL is an interesting research-oriented implementation ofOpenACC→ First non-commercial OpenACC implementation→ It is a flexible framework to explore optimizations, newplatforms, . . .

74 / 83

Outline

Hybrid MPI+OpenMP

OpenMP-to-GPU


Conclusions


Back to the Drawing Board?

76 / 83

accULL Still Has Some Opportunities

I Study support for multiple devices (either transparently or inOpenACC)

I Design an MPI component for the runtime

I Integration with other projects

I Improve the performance of the generated code (e.g usingPolyhedral models)

I Enhance the support for Extrae/Paraver (experimental tracingalready built-in)

77 / 83

Re-use our Know-how

Integrate OpenACC and OMPSs?

I Current OMPSs implementation does not automaticallygenerate kernel code

I Integrating OpenACCsyntax within tasks would enableautomatic code generation

I Improve portability in accelerator platforms

I Leverage development effort

78 / 83

Contributions

I Reyes, R. and de Sande, F. Automatic code generation for GPUs inllc. The Journal of Supercomputing 58, 3 (Mar. 2011), pp.349-356.

I Reyes, R. and de Sande, F. Optimization stategies in different CUDAarchitectures using llCoMP. Microprocessors and Microsystems -Embedded Hardware Design 36, 2 (Mar. 2012), pp. 78-87.

I Reyes, R., Fumero, J. J., Lopez, I. and de Sande, F. accULL: anOpenACC implementation with CUDA and OpenCL support. InEuro-Par 2012 Parallel Processing - 18th International Conference,vol. 7484 of LNCS, pp. 871-882.

I Reyes, R., Fumero, J. J., Lopez, I. and de Sande, F. A PreliminaryEvaluation of OpenACC Implementations. The Journal ofSupercomputing (In Press)

79 / 83

Other contributions

I accULL has been released as an Open Source Project→ http://cap.pcg.ull.es/accull

I accULL is currently being evaluated by VectorFabrics

I Provided feedback to CAPS which seems to be used in theircurrent version

I Contacted by members of the OpenACC committee

I Two HPC-Europa2 visits by our team master students

80 / 83

Acknowledgements

I Spanish MECPlan Nacional de I+D+i, contracts TIN2008-06570-C04-03and TIN2011-24598

I Canary Islands Government ACIISIContract SolSubC200801000285

I TEXT Project (FP7-261580)

I HPC-EUROPA2 (project number: 228398)

I Universitat Jaume I de Castellon

I Universidad de La Laguna

I All members of GCAP

81 / 83

Thank you for your attention!

82 / 83

Directive-based approach to heterogeneouscomputing

Ruyman Reyes Castro

High Performance Computing GroupUniversity of La Laguna

December 19, 2012

directive-based approach to heterogeneous computing

Documents

creal z

creal z

openmp llc

z sharednt

hybrid openmp

openmp extensions

openmp large code

openmp example