directive-based approach to heterogeneous computing

83
Directive-based approach to heterogeneous computing Ruyman Reyes Castro High Performance Computing Group University of La Laguna December 19, 2012

Upload: ruyman-reyes

Post on 10-May-2015

978 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Directive-based approach to Heterogeneous Computing

Directive-based approach to heterogeneouscomputing

Ruyman Reyes Castro

High Performance Computing GroupUniversity of La Laguna

December 19, 2012

Page 2: Directive-based approach to Heterogeneous Computing

TOP500 Performance Development List

2 / 83

Page 3: Directive-based approach to Heterogeneous Computing

Applications Used in HPC Centers

Usage of HECToR by Area of Expertise

3 / 83

Page 4: Directive-based approach to Heterogeneous Computing

Real HPC Users

Most Used Applications in HECToR

Application % of total jobs Language Prog. ModelVASP 17% Fortran MPI+OpenMPCP2K 7% Fortran MPI+OpenMP

Unified Model (UM) 7% Fortran MPIGROMACS 4% C++ MPI+OpenMP

I Large code-bases

I Complex algorithms implemented

I Mixture of different Fortran flavours

4 / 83

Page 5: Directive-based approach to Heterogeneous Computing

Knowledge of Programming

Survey conducted in the Swiss National Supercomputing Centre(2011)

5 / 83

Page 6: Directive-based approach to Heterogeneous Computing

Are application developers usingthe proper tools?

6 / 83

Page 7: Directive-based approach to Heterogeneous Computing

Complexity Arises (I)

7 / 83

Page 8: Directive-based approach to Heterogeneous Computing

Directives: Enhancing Legacy Code (I)

OpenMP Example

1 ...

2 #pragma omp parallel for default(shared) private(i, j)

firstprivate(rmass, dt)

3 for (i = 0; i < np; i++) {

4 for (j = 0; j < nd; j++) {

5 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5 * dt*dt*a[i][j];

6 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]);

7 a[i][j] = f[i][j] * rmass;

8 }

9 }

10 ...

8 / 83

Page 9: Directive-based approach to Heterogeneous Computing

Complexity Arises (II)

9 / 83

Page 10: Directive-based approach to Heterogeneous Computing

Re-compiling the code is no longer enough tocontinue improving the performance

10 / 83

Page 11: Directive-based approach to Heterogeneous Computing

Porting Applications To New Architectures

Programming CUDA (Host Code)

1 float a_host[n], b_host[n];

2 // Allocate

3 cudaMalloc((void*)&a, n * sizeof(float));

4 cudaMalloc((void*)&b, n * sizeof(float));

5 // Transfer

6 cudaMemcpy(a, a_host, n * sizeof(float), cudaMemcpyHostToDevice);

7 cudaMemcpy(b, b_host, n * sizeof(float), cudaMemcpyHostToDevice);

8 // Define grid shape

9 blocks = 100

10 threads = 128

11 // Execute

12 kernel<<<blocks,threads>>>(a, b, c);

13 // Copy-back

14 cudaMemcpy(a_host, a, n * sizeof(float), cudaMemcpyDeviceToHost);

15 // Clean

16 cudaFree(a);

17 cudaFree(b);

11 / 83

Page 12: Directive-based approach to Heterogeneous Computing

Porting Applications To New Architectures

Programming CUDA (Kernel Source)

1 // Kernel code

2 __global__ void kernel(float *a, float *b, float c)

3 {

4 // Get the index of this thread

5 unsigned int index = (blockIdx.x * blockDim.x) + threadIdx.x;

6 // Do the computation

7 b[index] = a[index] * c;

8 // Wait for all threads in the block to finish

9 __syncthreads();

10 }

12 / 83

Page 13: Directive-based approach to Heterogeneous Computing

Programmers need faster ways to migrate existing code

13 / 83

Page 14: Directive-based approach to Heterogeneous Computing

Why not use directive-based approaches forthese new heterogeneous architectures?

14 / 83

Page 15: Directive-based approach to Heterogeneous Computing

Overview of Our Work

We can’t solve problems by using the same kind ofthinking we used when we created them.

Albert Einstein

The field is undergoing rapid changes: we have to adapt tothem

1. Hybrid MPI+OpenMP (2008)→ Usage of directives in cluster environments

2. OpenMP extensions (2009)→ Extensions of OpenMP/La Laguna C (llc) forheterogeneous architectures

3. Directives for accelerators (2011)→ Specific accelerator-oriented directives→ OpenACC (December 2011)

15 / 83

Page 16: Directive-based approach to Heterogeneous Computing

Outline

Hybrid MPI+OpenMPllc and llCoMP

Hybrid llCoMP

Computational ResultsTechnical Drawbacks

OpenMP-to-GPU

Directives for Accelerators

Conclusions

Future Work and Final Remarks

Page 17: Directive-based approach to Heterogeneous Computing

La Laguna C: llc

What is

I Directive-based approach to distributed memory environments

I OpenMP compatible

I Additional set of extensions to address particular features

I Implemented FORALL loops, Pipelines, Farms . . .

Reference[48] Dorta, A. J. Extension del modelo de OpenMP a memoriadistribuida. PhD Thesis, Universidad de La Laguna, December 2008.

17 / 83

Page 18: Directive-based approach to Heterogeneous Computing

Chronological Perspective (Late 2008)

Cores per Socket - System Share Accelerator - System Share

18 / 83

Page 19: Directive-based approach to Heterogeneous Computing

A Hybrid OpenMP+MPI Implementation

Same llc code, extended llCoMP implementation

I Directives are replaced by a set of parallel patterns

I Improved performance on multicore systems→ Better usage of inter-core memories (i.e cache)→ Lower memory requirements when using replicated memoryon MPI

Translation

19 / 83

Page 20: Directive-based approach to Heterogeneous Computing

llc Code Example

llc Implementation of the Mandelbrot Set Computation

1 ...

2 #pragma omp parallel for default(shared) reduction(+:numoutside)

private(i, j, ztemp, z) shared(nt, c)

3 #pragma llc reduction_type (int)

4 for(i = 0; i < npoints; i++) {

5 z.creal = c[i].creal; z.cimag = c[i].cimag;

6 for (j = 0; j < MAXITER; j++) {

7 ztemp = (z.creal*z.creal) - (z.cimag*z.cimag)+c[i].creal;

8 z.cimag = z.creal * z.cimag * 2 + c[i].cimag;

9 z.creal = ztemp;

10 if (z.creal * z.creal + z.cimag * z.cimag > THRESOLD) {

11 numoutside++;

12 break;

13 }

14 }

15 ...

20 / 83

Page 21: Directive-based approach to Heterogeneous Computing

Hybrid MPI+OpenMP performance

21 / 83

Page 22: Directive-based approach to Heterogeneous Computing

Technical Drawbacks

llCoMP

I The original design of llCoMP StS was not flexible enough

I Traditional two-pass compiler

I Excessive effort to implement new features

I Need more advanced features to implement GPU codegeneration

22 / 83

Page 23: Directive-based approach to Heterogeneous Computing

Back to the Drawing Board

23 / 83

Page 24: Directive-based approach to Heterogeneous Computing

Outline

Hybrid MPI+OpenMP

OpenMP-to-GPURelated WorkYet Another Compiler Framework (YaCF)

Computational ResultsTechnical Drawbacks

Directives for Accelerators

Conclusions

Future Work and Final Remarks

Page 25: Directive-based approach to Heterogeneous Computing

Chronological Perspective (Late 2009)

Cores per Socket - System Share Accelerator - System Share

25 / 83

Page 26: Directive-based approach to Heterogeneous Computing

Related Work

Other OpenMP-to-GPU translators: OpenMPC[82] Lee, S., and Eigenmann, R. OpenMPC: Extended OpenMP programmingand tuning for GPUs. In SC’10: Proceedings of the 2010 ACM/IEEE conferenceon Supercomputing. IEEE Computer Society, pp. 1–11.

Other Compiler Frameworks: Cetus, LLVM[84] Lee, S., Johnson., T. A. and Eigenmann, R. Cetus – an extensible compilerinfrastructure for source-to-source transformation. In Languages and Compilersfor Parallel Computing, 16th Intl. Workshop, College Station, TX, USA, volume2958 of LNCS(2003), pp. 539-553.

[81] Lattner, C., and Adve, V. LLVM: A compilation framework for lifelongprogram analysis & transformation. In Proceedings of the internationalsymposium on Code generation and optimization: feedback-directed and runtimeoptimization, CGO’04. IEEE Computer Society, pp. 75–47.

26 / 83

Page 27: Directive-based approach to Heterogeneous Computing

YaCF: Yet Another Compiler Framework

Application programmer writes llc code

I Focus on data and algorithm

I Architecture independent

I Only needs to specify where the parallelism is

System engineer writes template code

I Focus on non-functional code

I Can reuse code from different patterns (i.e inheritance)

27 / 83

Page 28: Directive-based approach to Heterogeneous Computing

YaCF Software Architecture

28 / 83

Page 29: Directive-based approach to Heterogeneous Computing

Main Software Design Patterns

Implementing search and replacement in the IR

I Filter: Looks for an specific pattern on the IR→ E.g Looks for a pragma omp parallel construct

I Mutator: Looks for a node and transforms the IR→ E.g Applies loop transformations (nesting, flattening, . . . )→ E.g Replaces a pragma omp for by a CUDA kernel call

I Can be composed to solve more complex problems

29 / 83

Page 30: Directive-based approach to Heterogeneous Computing

Dynamic Language and Tools

Key Idea: Features Should Require Only a Few Lines of Code

30 / 83

Page 31: Directive-based approach to Heterogeneous Computing

Template Patterns

Ease back-end implementation

1 <%def name="initialization(var_list, prefix = ’’, suffix = ’’)">

2 %for var in var_list:

3 cudaMalloc((void **) (&${prefix}${var.name}${suffix}),

4 ${var.numelems} * sizeof(${var.type}));

5 cudaMemcpy(${prefix}${var.name}${suffix}, ${var.name},

6 ${var.numelems} * sizeof(${var.type}),

7 cudaMemcpyHostToDevice);

8 %endfor

9 </%def>

31 / 83

Page 32: Directive-based approach to Heterogeneous Computing

CUDA Back-end

Generates a CUDA kernel and memory transfers from theinformation obtained during the analysis

Supported syntax

I parallel, for and their condensed form implemented

I New directives to support manual optimizations (e.ginterchange)

I Syntax taken from an OpenMP proposal by BSC, UJI andothers (#pragma omp target)

I copy in, copy out enable users to provide memory transferinformation

I Generated code is human-readable

32 / 83

Page 33: Directive-based approach to Heterogeneous Computing

Example

Update Loop from the Molecular Dynamics Code

1 ...

2 #pragma omp target device(cuda) copy(pos, vel, f) copy_out(a)

3 #pragma omp parallel for default(shared) private(i, j)

firstprivate(rmass, dt)

4 for (i = 0; i < np; i++) {

5 for (j = 0; j < nd; j++) {

6 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5*dt*dt*a[i][j];

7 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]);

8 a[i][j] = f[i][j] * rmass;

9 }

10 }

33 / 83

Page 34: Directive-based approach to Heterogeneous Computing

Translation process

34 / 83

Page 35: Directive-based approach to Heterogeneous Computing

The Jacobi Iterative Method

1 error = 0.0;

2

3

4 {

5

6 for (i = 0; i < m; i++)

7 for (j = 0; j < n; j++)

8 uold[i][j] = u[i][j];

9

10 for (i = 0; i < (m - 2); i++) {

11 for (j = 0; j < (n - 2); j++) {

12 resid = ...

13 error += resid * resid;

14 }

15 }

16 }

17 k++;

18 error = sqrt(error) / (double) (n * m);

35 / 83

Page 36: Directive-based approach to Heterogeneous Computing

Jacobi OpenMP Source

1 error = 0.0;

2

3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid)

4 {

5 #pragma omp for

6 for (i = 0; i < m; i++)

7 for (j = 0; j < n; j++)

8 uold[i][j] = u[i][j];

9 #pragma omp for reduction(+:error)

10 for (i = 0; i < (m - 2); i++) {

11 for (j = 0; j < (n - 2); j++) {

12 resid = ...

13 error += resid * resid;

14 }

15 }

16 }

17 k++;

18 error = sqrt(error) / (double) (n * m);

36 / 83

Page 37: Directive-based approach to Heterogeneous Computing

Jacobi llCoMP v1

1 error = 0.0;

2 #pragma omp target device(cuda)

3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid)

4 {

5 #pragma omp for

6 for (i = 0; i < m; i++)

7 for (j = 0; j < n; j++)

8 uold[i][j] = u[i][j];

9 #pragma omp for reduction(+:error)

10 for (i = 0; i < (m - 2); i++) {

11 for (j = 0; j < (n - 2); j++) {

12 resid = ...

13 error += resid * resid;

14 }

15 }

16 }

17 k++;

18 error = sqrt(error) / (double) (n * m);

37 / 83

Page 38: Directive-based approach to Heterogeneous Computing

Jacobi llCoMP v2

1 error = 0.0;

2 #pragma omp target device(cuda) copy_in(u, f) copy_out(f)

3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid)

4 {

5 #pragma omp for

6 for (i = 0; i < m; i++)

7 for (j = 0; j < n; j++)

8 uold[i][j] = u[i][j];

9 #pragma omp for reduction(+:error)

10 for (i = 0; i < (m - 2); i++) {

11 for (j = 0; j < (n - 2); j++) {

12 resid = ...

13 error += resid * resid;

14 }

15 }

16 }

17 k++;

18 error = sqrt(error) / (double) (n * m);

38 / 83

Page 39: Directive-based approach to Heterogeneous Computing

Jacobi Iterative Method

39 / 83

Page 40: Directive-based approach to Heterogeneous Computing

Technical Drawbacks

Limited to Compile-time Optimizations

I Some features require runtime information→ Kernel grid configuration

I Orphaned directives were not possible→ Would require an inter-procedural analysis module

I Some templates were too complex→ And would need to be replicated to support OpenCL

40 / 83

Page 41: Directive-based approach to Heterogeneous Computing

Back to the Drawing Board

41 / 83

Page 42: Directive-based approach to Heterogeneous Computing

Outline

Hybrid MPI+OpenMP

OpenMP-to-GPU

Directives for AcceleratorsRelated WorkOpenACCAccelerator ULL (accULL)

Results

Conclusions

Future Work and Final Remarks

Page 43: Directive-based approach to Heterogeneous Computing

Chronological Perspective (2011)

Cores per Socket - System Share Accelerator - System Share

43 / 83

Page 44: Directive-based approach to Heterogeneous Computing

Related Work (I)

hiCUDA

I Translates each directive into a CUDA call

I It is able to use the GPU Shared Memory

I Only works with NVIDIA devices

I The programmer still needs to know hardware details

Code Example:

1 ...

2 #pragma hicuda global alloc c [*] [*] copyin

3

4 #pragma hicuda kernel mxm tblock(N/16, N/16) thread(16, 16)

5 #pragma hicuda loop_partition over_tblock over_thread

6 for (i = 0; i < N; i++) {

7 #pragma hicuda loop_partition over_tblock over_thread

8 for (j = 0; j < N; j++) {

9 double sum = 0.0;

10 ...44 / 83

Page 45: Directive-based approach to Heterogeneous Computing

Related Work (II)

PGI Accelerator Model

I Higher level (directive-based) approach

I Fortran and C are supported

Code Example:

1 #pragma acc data copyin(b[0:n*l], c[0:m*l]) copy(a[0:n*m])

2 {

3 #pragma acc region

4 for (j = 0; j < n; j++)

5 for (i = 0; i < l; i++) {

6 double sum = 0.0;

7 for (k = 0; k < m; k++)

8 sum += b[i + k * l] * c[k + j * m];

9 a[i + j * l] = sum;

10 }

11 }

45 / 83

Page 46: Directive-based approach to Heterogeneous Computing

Our Ongoing Work at that Time: llcl

I Extending llc with support for heterogeneous platforms

I Compiler + Runtime implementation→ The Compiler generates runtime code→ The Runtime handles memory coherence and drivesexecution

I Compiler optimizations directed by an XML file

I More generic/higher level approach - not tied to GPUs

46 / 83

Page 47: Directive-based approach to Heterogeneous Computing

llcl: Directives

1 double *a, *b, *c;

2 ...

3 #pragma llc context name("mxm") copy_in(a[n * l], b[l * m], \

4 c[m * n], l, m, n) copy_out(a[n * l])

5 {

6 int i, j, k;

7 #pragma llc for shared(a, b, c, l, m, n) private(i, j, k)

8 for (i = 0; i < l; i++)

9 for (j = 0; j < n; j++) {

10 a[i + j * l] = 0.0;

11 for (k = 0; k < m; k++)

12 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];

13 }

14 }

15 ...

47 / 83

Page 48: Directive-based approach to Heterogeneous Computing

llcl: XML Platform Description File

1 <xml>

2 <platform name="default">

3 <region name="compute">

4 <element name="compute_1" class="loop">

5 <mutator name="Loop.LoopInterchange"/>

6 <target device="cuda"/>

7 <target device="opencl"/>

8 </element>

9 </region>

10 </platform>

11 </xml>

48 / 83

Page 49: Directive-based approach to Heterogeneous Computing

OpenACC Announcement

49 / 83

Page 50: Directive-based approach to Heterogeneous Computing

OpenACC Announcement

50 / 83

Page 51: Directive-based approach to Heterogeneous Computing

OpenACC: Directives

1 double *a, *b, *c;

2 ...

3 #pragma acc data copy_in(a[n * l],b[l * m],c[m * n], l, m, n)

copy_out(a[n * l])

4 {

5 int i, j, k;

6 #pragma acc kernels loop private(i, j, k)

7 for (i = 0; i < l; i++)

8 for (j = 0; j < n; j++) {

9 a[i + j * l] = 0.0;

10 for (k = 0; k < m; k++)

11 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];

12 }

13 }

14 ...

51 / 83

Page 52: Directive-based approach to Heterogeneous Computing

Related Work

OpenACC Implementations (After Announcement)

I PGI - Released on February 2012

I CAPS - Released on March 2012

I Cray - To be released→ Access to beta release available

We had a first experimental implementation in January 2012

52 / 83

Page 53: Directive-based approach to Heterogeneous Computing

accULL: Our OpenACC Implementation

accULL = YaCF + Frangollo

It is a two-layer based implementation:Compiler + Runtime Library

53 / 83

Page 54: Directive-based approach to Heterogeneous Computing

Frangollo: the Runtime

Implementation

I Lightweight

I Standard C++ and STL code

I CUDA component written using the CUDA Driver API

I OpenCL component written using the C OpenCL interface

I Experimental features can be enabled/disabled at compile time

Handles

1. Device discovery, initialization, . . .

2. Memory coherence (registered variables)

3. Manage kernel execution (including grid shape)

54 / 83

Page 55: Directive-based approach to Heterogeneous Computing

Frangollo Layered Structure

55 / 83

Page 56: Directive-based approach to Heterogeneous Computing

Memory Management

1 // Creates a context to handle memory coherence

2 ctxt_id = FRG__createContext("name", ...)

3 ...

4 // Register a variable within the context

5 FRG__registerVar(ctxt_id, &ptr, offset, size, constraints, ...);

6 ...

7 // Execute the kernel

8 FRG__kernelLaunch(ctxt_id, "kernel", param_list, ...)

9 ...

10 // Finish the context and concyle variables

11 FRG__destroyContext(ctxt_id);

56 / 83

Page 57: Directive-based approach to Heterogeneous Computing

Kernel Execution

Loading the kernel

I Context may have from zero to N named kernels associated

I Runtime loads different versions of the kernel for each device

I Kernel is loaded depending on the platform where it is executed

Grid shape

I Grid shape is estimated using compute intensity (CI):Nmem/(Cost ×Nflops)→ E.g Fermi, GFlops DP 512GFlop/s, Memory Bandwidth144Gb/s, Cost 3.5

I Low CI → favors memory accesses

I High CI → favors computation

57 / 83

Page 58: Directive-based approach to Heterogeneous Computing

Implementing OpenACC

Putting all together

1. The compiler driver generates Frangollo interface calls fromOpenACC directives→ Converts data region directives into context creation→ Generates Host and Device synchronization

2. Extracts the kernel code

3. Frangollo implements OpenACC API calls→ acc init, acc malloc/acc free

4. Implements some optimizations→ Compiler: loop invariant, skewing, strip-mining, interchange→ Kernel extraction: divergence reduction, data-dependencyanalysis (basic)→ Runtime: grid shape estimation, optimized reduction kernels

58 / 83

Page 59: Directive-based approach to Heterogeneous Computing

Building an OpenACC Code with accULL

59 / 83

Page 60: Directive-based approach to Heterogeneous Computing

Compilance with OpenACC Standard

Table: Compliance with the OpenACC 1.0 standard (directives)

Construct Supported bykernels PGI, HMPP, accULLloop PGI, HMPP, accULL

kernels loop PGI, HMPP, accULLparallel PGI, HMPPupdate Implemented

copy, copyin, copyout, . . . PGI, HMPP, accULLpcopy, pcopyin, pcopyout ,. . . PGI, HMPP, accULL

async PGIdeviceptr clause PGI

host accULL

collapse accULL

Table: Compliance with the OpenACC 1.0 standard (API)

API Call Supported byacc init PGI, HMPP, accULL

acc set device PGI, HMPP, accULL(no effect)acc get device PGI, HMPP, accULL

60 / 83

Page 61: Directive-based approach to Heterogeneous Computing

Experimental Platforms

Garoe: A Desktop computer

I Intel Core i7 930 processor (2.80 GHz), 4Gb RAMI 2 GPU devices attached:

I Tesla C1060I Tesla C2050 (Fermi)

Peco: A cluster node

I Peco: 2 quad core Intel Xeon E5410 (2.25GHz) processors,24Gb RAM

I Attached a Tesla C2050 (Fermi)

Drago: A shared memory system

I 4 Intel Xeon E7 4850 CPU, 6Gb RAM

I Accelerator platform: Intel OpenCL SDK 1.5, running on theCPU 61 / 83

Page 62: Directive-based approach to Heterogeneous Computing

Software

Compiler versions (Pre-OpenACC)

I PGI Compiler Toolkit 12.2 with the PGI AcceleratorProgramming Model 1.3

I hiCUDA: 0.9

Compiler versions (OpenACC)

I PGI Compiler Toolkit 12.6

I CAPS HMPP: 3.2.3

62 / 83

Page 63: Directive-based approach to Heterogeneous Computing

Matrix Multiplication (M ×M) (I)

1 #pragma acc data name("mxm") copy(a[L*N]) copyin(b[L*M], c[M*N])

2 {

3 #pragma acc kernels loop private(i, j) collapse(2)

4 for (i = 0; i < L; i++)

5 for (j = 0; j < N; j++)

6 a[i * L + j] = 0.0;

7 /* Iterates over blocks */

8 for (ii = 0; ii < L; ii += tile_size)

9 for (jj = 0; jj < N; jj += tile_size)

10 for (kk = 0; kk < M; kk += tile_size) {

11 /* Iterates inside a block */

12 #pragma acc kernels loop collapse(2) private(i,j,k)

13 for (j = jj; j < min(N, jj+tile_size); j++)

14 for (i = ii; i < min(L, ii+tile_size); i++)

15 for (k = kk; k < min(M, kk+tile_size); k++)

16 a[i*L+j] += (b[i*L+k] * c[k*M+j]);

17 }

18 }

63 / 83

Page 64: Directive-based approach to Heterogeneous Computing

Floating Point Performance for M×M in Peco

64 / 83

Page 65: Directive-based approach to Heterogeneous Computing

M×M (II)

1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N])

2 {

3 #pragma acc kernels loop private(i)

4 for (i = 0; i < L; i++)

5 #pragma acc loop private(j)

6 for (j = 0; j < N; j++)

7 a[i * L + j] = 0.0;

8 /* Iterates over blocks */

9 for (ii = 0; ii < L; ii += tile_size)

10 for (jj = 0; jj < N; jj += tile_size)

11 for (kk = 0; kk < M; kk += tile_size) {

12 /* Iterates inside a block */

13 #pragma acc kernels loop private(i)

14 for (j = jj; j < min(N, jj+tile_size); j++)

15 #pragma acc loop private(j)

16 for (i = ii; i < min(L, ii+tile_size); i++)

17 for (k = kk; k < min(M, kk+tile_size); k++)

18 a[i * L + j] += (b[i * L + k] * c[k * M + j]);

19 }

20 }

65 / 83

Page 66: Directive-based approach to Heterogeneous Computing

M×M (III)

1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N] ...)

2 {

3 #pragma acc kernels loop private(i) gang(32)

4 for (i = 0; i < L; i++)

5 #pragma acc loop private(j) worker(32)

6 for (j = 0; j < N; j++)

7 a[i * L + j] = 0.0;

8 /* Iterates over blocks */

9 for (ii = 0; ii < L; ii += tile_size)

10 for (jj = 0; jj < N; jj += tile_size)

11 for (kk = 0; kk < M; kk += tile_size) {

12 /* Iterates inside a block */

13 #pragma acc kernels loop private(i) gang(32)

14 for (j = jj; j < min(N, jj+tile_size); j++)

15 #pragma acc loop private(j) worker(32)

16 for (i = ii; i < min(L, ii+tile_size); i++)

17 for (k = kk; k < min(M, kk+tile_size); k++)

18 a[i*L+j] += (b[i*L+k] * c[k*M+j]);

19 }

20 }

66 / 83

Page 67: Directive-based approach to Heterogeneous Computing

About Grid Shape and Loop Scheduling Clauses

Optimal gang/worker (i.e, grid shape) values vary

I Among OpenACC implementations

I Among Platforms (Fermi vs Kepler?, NVIDIA vs ATI?)

I What happens if we implement a non-GPU accelerator?

I Our implementation ignores gang/worker, leaves decision toruntime→ User can influence the decision with an environment variable

I It is possible to enable the gang/worker clauses in ourimplementation→ Gang/worker feeds a Strip-mining transformation forcingblock/threads (WIP)

67 / 83

Page 68: Directive-based approach to Heterogeneous Computing

Effect of Varying Gang/Worker

68 / 83

Page 69: Directive-based approach to Heterogeneous Computing

OpenMP vs Frangollo+OpenCL in Drago

69 / 83

Page 70: Directive-based approach to Heterogeneous Computing

Needleman-Wunsch (NW)

I NW is a nonlinear global optimization method for DNAsequence alignments

I The potential pairs of sequences are organized in a 2D matrix

I The method uses Dynamic Programming to find the optimumalignment

70 / 83

Page 71: Directive-based approach to Heterogeneous Computing

Performance Comparison of NW in Garoe

71 / 83

Page 72: Directive-based approach to Heterogeneous Computing

Overall Comparison

72 / 83

Page 73: Directive-based approach to Heterogeneous Computing

Outline

Hybrid MPI+OpenMP

OpenMP-to-GPU

Directives for Accelerators

Conclusions

Future Work and Final Remarks

Page 74: Directive-based approach to Heterogeneous Computing

Directive-based Programming

I Support for accelerators in the OpenMP standard may beadded in the future→ In the meantime, OpenACC can be used to port codes toGPUs→ It is possible to combine OpenACC with OpenMP

I Generated code does not always match native-codeperformance→ But leverages the development effort providing enoughperformance

I accULL is an interesting research-oriented implementation ofOpenACC→ First non-commercial OpenACC implementation→ It is a flexible framework to explore optimizations, newplatforms, . . .

74 / 83

Page 75: Directive-based approach to Heterogeneous Computing

Outline

Hybrid MPI+OpenMP

OpenMP-to-GPU

Directives for Accelerators

Conclusions

Future Work and Final Remarks

Page 76: Directive-based approach to Heterogeneous Computing

Back to the Drawing Board?

76 / 83

Page 77: Directive-based approach to Heterogeneous Computing

accULL Still Has Some Opportunities

I Study support for multiple devices (either transparently or inOpenACC)

I Design an MPI component for the runtime

I Integration with other projects

I Improve the performance of the generated code (e.g usingPolyhedral models)

I Enhance the support for Extrae/Paraver (experimental tracingalready built-in)

77 / 83

Page 78: Directive-based approach to Heterogeneous Computing

Re-use our Know-how

Integrate OpenACC and OMPSs?

I Current OMPSs implementation does not automaticallygenerate kernel code

I Integrating OpenACCsyntax within tasks would enableautomatic code generation

I Improve portability in accelerator platforms

I Leverage development effort

78 / 83

Page 79: Directive-based approach to Heterogeneous Computing

Contributions

I Reyes, R. and de Sande, F. Automatic code generation for GPUs inllc. The Journal of Supercomputing 58, 3 (Mar. 2011), pp.349-356.

I Reyes, R. and de Sande, F. Optimization stategies in different CUDAarchitectures using llCoMP. Microprocessors and Microsystems -Embedded Hardware Design 36, 2 (Mar. 2012), pp. 78-87.

I Reyes, R., Fumero, J. J., Lopez, I. and de Sande, F. accULL: anOpenACC implementation with CUDA and OpenCL support. InEuro-Par 2012 Parallel Processing - 18th International Conference,vol. 7484 of LNCS, pp. 871-882.

I Reyes, R., Fumero, J. J., Lopez, I. and de Sande, F. A PreliminaryEvaluation of OpenACC Implementations. The Journal ofSupercomputing (In Press)

79 / 83

Page 80: Directive-based approach to Heterogeneous Computing

Other contributions

I accULL has been released as an Open Source Project→ http://cap.pcg.ull.es/accull

I accULL is currently being evaluated by VectorFabrics

I Provided feedback to CAPS which seems to be used in theircurrent version

I Contacted by members of the OpenACC committee

I Two HPC-Europa2 visits by our team master students

80 / 83

Page 81: Directive-based approach to Heterogeneous Computing

Acknowledgements

I Spanish MECPlan Nacional de I+D+i, contracts TIN2008-06570-C04-03and TIN2011-24598

I Canary Islands Government ACIISIContract SolSubC200801000285

I TEXT Project (FP7-261580)

I HPC-EUROPA2 (project number: 228398)

I Universitat Jaume I de Castellon

I Universidad de La Laguna

I All members of GCAP

81 / 83

Page 82: Directive-based approach to Heterogeneous Computing

Thank you for your attention!

82 / 83

Page 83: Directive-based approach to Heterogeneous Computing

Directive-based approach to heterogeneouscomputing

Ruyman Reyes Castro

High Performance Computing GroupUniversity of La Laguna

December 19, 2012