arpith jacob programming using openmp · 2016. 4. 18. · clang • front-end to parse source code...

21
S6240 HighLevel GPU Programming Using OpenMP 4.5 and Clang/LLVM Arpith Jacob, Alexandre Eichenberger, Samuel Antao, Carlo Bertolli, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, Kevin O’Brien IBM T. J. Watson Research Center

Upload: others

Post on 14-Aug-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

S6240  -­‐  High-­‐Level  GPU  Programming  Using  OpenMP  4.5  

and  Clang/LLVM Arpith  Jacob,  Alexandre  Eichenberger,  Samuel  Antao,  Carlo  Bertolli,  Tong  

Chen,  Zehra  Sura,  Hyojin  Sung,  Georgios  Rokos,  Kevin  O’Brien  IBM  T.  J.  Watson  Research  Center  

Page 2: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

• IBM  is  building  heterogeneous  systems  with  Power  +  GPU  

• AdvocaKng  the  use  of  the  OpenMP  programming  model  • IBM  Research  is  contribuKng  OpenMP  support  for  NVIDIA  GPUs  in  Clang/LLVM  

• Upstreaming  in  progress.  download  at:  ibm.biz/ykt-omp 2  

CPU  

GPU  

Overview

Page 3: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

T  T   T   T  T  T  T  T  

C   C   C   C  C   C   C   C  C   C   C   C  

SMTs  

12  CORE

S  PE

R  SO

CKET  

SMX  

Streaming  

Mul8processors  

SP  CUDA  cores,  

DP  units,  etc..  

SMX   SMX  

SMX   SMX   SMX   SMX   SMX   SMX   SMX  

SMX   SMX   SMX  

off  chip  

L1  L2  

L3  L4  

L3  DRAM  

off  chip  

Constant  Read-­‐Only  

L1+SMEM  

L2  DRAM  2  TB+   12  GB  

Processing  

Storage  

Kepler/Maxwell  

Kepler/Maxwell  

Kepler/Maxwell  

P8  

P8   NVL

INK/PC

I-­‐E  

3  

Kepler/Maxwell  Latency  SensiRve  

High  single  thread  performance  

Hide  latency  via  memory  prefetch  or,  

Cache  hierarchy  for  spaKal  and  temporal  locality  

Throughput  OpRmized  

OpKmized  for  mulK-­‐threaded  code  

Low  overhead  context  switch  

Hide  memory  latency  with  threads  

ExploiFng  Heterogeneous  Node  Resources

Kepler  

Page 4: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

• ApplicaKons  must  exploit  heterogeneous  resources  in  a  performance  portable  manner  

   • Use  vendor  specific  languages  and  direcKves?  • Compiler  specific  pragmas?  • Mix  of  programming  models?  OpenMP,  OpenACC,  CUDA  

4  

CPU  

GPU  

Programmability  Challenge

Page 5: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

Fortran  

RAJA   KOKKOS  OP2  OPS   …  

Kepler/Maxwell  

Kepler/Maxwell  

P8  

P8   NVL

INK/PC

I-­‐E  

Kepler/Maxwell  

Kepler/Maxwell  

•  OpenMP is widely used to program CPUs; latest specs support accelerators •  Write performance portable code using flexible parallelism models •  Industry-wide acceptance: IBM, Intel, PathScale, Cray, PGI, Oracle, MS

5  

Programming  Overview

Page 6: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

6  

node  memory  

     

     

SMX  

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

     

             

cache  

SMX  

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

cache        

     

             

core  

             

L2                

core  

             

L2                

core  

             

L2                

core  

             

L2        

             

core  

             

L2                

core  

             

L2                

core  

             

L2                

core  

             

L2  

socket   socket  

CPU   GPU  

OpenMP  Memory  Model

Page 7: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

7  

node  memory  

             

L3  

     

             

core  

             

L2                

core  

             

L2                

core  

             

L2                

core  

             

L2        

             

core  

             

L2                

core  

             

L2                

core  

             

L2                

core  

             

L2  

socket   socket  

CPU  

•  CPU  threads  may  have  their  own  view  of  shared  variables  •  Relaxed  consistency  •  Explicit  flush  operaKons  on  host  

required  •  Unsynchronized  accesses  may  lead  to  data  

races  

OpenMP  Memory  Model

Page 8: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

8  

node  memory  

device  memory                

             

     

     

SMX  

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

     

             

cache  

SMX  

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

cache        

GPU  

•  OMP4  extends  views  to  target  devices  •  Map:  control  data  views  •  Target  data  enter/exit  •  Target  update  

•  Unsynchronized  accesses  may  lead  to  data  races  

OpenMP  Memory  Model

Page 9: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

9  

node  memory  

             

L3   device  memory                

             

     

     

SMX  

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

     

             

cache  

SMX  

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

             

cache        

     

             

core  

             

L2                

core  

             

L2                

core  

             

L2                

core  

             

L2        

             

core  

             

L2                

core  

             

L2                

core  

             

L2                

core  

             

L2  

socket   socket  

CPU   GPU  

OpenMP  Memory  Model

Page 10: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

How  do  we  use  OpenMP  offload?      

10  

#pragma omp target map(to: cls, len, compression[0:len]) \ map(from: bvc[0:len]) for (int i=0; i<len; i++) { bvc[i] = cls * (compression[i] + 1.0); }

OpenMP  Offload  Model

Page 11: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

11  

#pragma omp target map(to: cls, len, compression[0:len]) \ map(from: bvc[0:len]) for (int i=0; i<len; i++) { bvc[i] = cls * (compression[i] + 1.0); }

GPU  CPU  

map()  

target   funcRon  shipping  

compression[],  cls,  len    

bvc[]  

device  execuRo

n  

inacKve  ho

st  th

read  

OpenMP  Offload  Model

Page 12: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

12  

Loop work-sharing

#pragma omp parallel for

for (i = 0; i < M; i++) for (j = 0; j < N; j++) A[i][j] += u1[i] * v1[j] + u2[i] * v2[j];

     

     

core  0      

core  1      

core  2      

core  3      

core  4      

core  5      

core  6      

core  7      

socket  0   socket  1  Affinity: spread threads to maximize bandwidth  

     

     

core  0      

core  1      

core  2      

core  3      

core  4      

core  5      

core  6      

core  7      

socket  0   socket  1  

Affinity: pack threads to reuse cache locality  

Exploit  Hardware  Threads  on  POWER  CPUs

Page 13: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

13  

Loop work-sharing on GPUs with a target task

CUDA Grid

CUDA Block

GPU

S

Ms, w

arps, lanes C

UD

A grid, blocks, threads

OpenM

P team

s, threads, simd lanes

Block(0,1) Block(1,1) Block(2,1)

Block(0,0) Block(1,0) Block(2,0)

Team0

parallel team master

Th0 Th1 Th31

. . .

Th32 Th33 Th63

. . .

T0 T1 T31

. . .

T32 T33 T63 Warp

0 L0 L1 L31

. . .

Warp

1 L0 L1 L31

#pragma omp target teams distribute

#pragma omp parallel for for (i = 0; i < M; i++) for (j = 0; j < N; j++) A[i][j] += u1[i] * v1[j] + u2[i] * v2[j];

Exploit  Streaming  MulFprocessors  on  GPUs

Page 14: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

14  

SIMD and other OpenMP forms supported on the GPU Team

0

parallel

team master

Th0 Th1

simd simd

T0 T1 T31

. . .

T32 T33 T63 Warp

0 L0 L1 L31

. . .

Warp

1 L0 L1 L31

CUDA Block

#pragma omp target teams distribute

#pragma omp parallel for for (i = 0; i < M; i++) #pragma omp simd for (j = 0; j < N; j++) A[i][j] += u1[i] * v1[j] + u2[i] * v2[j];

14  

Exploit  Streaming  MulFprocessors  on  GPUs

Page 15: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

15  

Task Parallelism #pragma omp parallel

#pragma omp single { #pragma omp task depend(out: a) TraverseForward(A); #pragma omp task depend(in: a) TraverseReverse(B); … }

•  Tasks are well suited for parallelism that is dynamically uncovered: e.g. searches, graph processing

•  Tasks are load balanced between threads in the parallel region

•  A task is fired once all its dependent tasks have completed

Exploit  Hardware  Threads  on  POWER  CPUs

Page 16: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

16  

•  Target constructs are implicit tasks

•  A host thread may initiate several target tasks asynchronously

•  Target tasks may have dependencies

Dependencies between target tasks are resolved completely on the GPU without host intervention

Host task  

Target task  

CPU  &  GPU  Parallelism  using  Tasks

Page 17: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

17  

Concurrency in a node •  Host threads and device threads

•  Multiple GPUs in a node

•  Overlap device computation and communication

•  Concurrent target tasks on a GPU with task dependencies

CPU  &  GPU  Parallelism  using  Tasks

Page 18: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

▪ CLANG

•  Front-end to parse source code and generate LLVM IR code

•  Modified to generate code for OpenMP device constructs

•  Produces two copies of code for target regions

•  Inserts calls to standardized OMP runtime interface functions

•  Compiler driver modified to process code copies through different backends

▪ NVPTX backend •  Produces ptx code which is then

processed through ptxas to generate CUDA binary

Collaborating with wider community and industry partners (LLVM open-source, OMP standards)

C/C++ with OpenMP 4.5

CLANG Outlining/Duplication

LLVM Power backend

LLVM NVPTX backend

GPU OpenMP Library Host

OpenMP Library Device Linker

Host Linker

Executable

18  

LLVM  Compiler  SchemaFc

Page 19: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

19  

•  Compiler responsible for thread-activation and thread-coordination

OpenMP  Codegen  Internals

Page 20: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

•  LULESH:  proxy  for  hydrodynamics  code  

   

•  S6513  -­‐  GPU  OpRmizaRon  of  the  Kripke  Neutral-­‐ParRcle  Transport  Mini-­‐App,  Thursday,  15:30  at  Marrioh  Salon  3  

20  

heps://codesign.llnl.gov/lulesh.php  

Performance  Preview

Kernel*" CUDA (us)"

OpenMP 4.0 (us)"

Acceleration Calculation" 3.2   4.3  Apply Boundary Acceleration" 5.1   4.8  

Position and Velocity Calculation" 3.2  4.8  4.1  

Kinematics and Monotonic Gradient Calculation" 17  

6.5  58  40  

Monotonic Region Calculation" 11   15  

Apply Material Properties to Regions" 92   102.8  

Performance  Analysis  of  OpenMP  on  a  GPU  Using  a  CORAL  Proxy  ApplicaRon,  Bercea  et  al.  PMBS  '15.  

Page 21: Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

21  

•  Opensource: download and installation instructions at: ibm.biz/ykt-omp

•  Currently supports OpenMP 4.0, with offload to GPU •  Open source host runtime based on Intel contributed KMPC lib

•  Open source GPU runtime developed and contributed by IBM Research

•  Working on upstreaming 4.5 implementation to Clang/LLVM

•  Contact: [email protected]

Compiler  Availability  and  Roadmap

This  work  is  parKally  supported  by  the  CORAL  project  LLNS  Subcontract  No.  B604142.