advanced programming of manycore systems

32
Advanced Programming of ManyCore Systems CAPS OpenACC Compilers Stéphane BIHAN NVIDIA GTC, March 20, 2013, San Jose, California

Upload: others

Post on 12-Sep-2021

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Programming of ManyCore Systems

Advanced Programming of ManyCore Systems CAPS OpenACC Compilers Stéphane BIHAN NVIDIA GTC, March 20, 2013, San Jose, California

Page 2: Advanced Programming of ManyCore Systems

About CAPS

Page 3: Advanced Programming of ManyCore Systems

Mission Help you break the parallel wall and

harness the power of manycore computing

§  Software programming tools o  CAPS Manycore Compilers o  CAPS Workbench: set of development tools

§  CAPS engineering services o  Port applications to manycore systems o  Fine tune parallel applications o  Recommend machine configurations o  Diagnose application parallelism

§  Training sessions o  OpenACC, CUDA, OpenCL, o  OpenMP/pthreads

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   3  

CAPS Mission and Offer

Based  in  Rennes,  France  Created  in  2002  More  than  10  years  of  experEse  About  30  employees  

Page 4: Advanced Programming of ManyCore Systems

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California  

CAPS Worldwide

 

Headquarters - FRANCE Immeuble CAP Nord 4A Allée Marie Berhaut 35000 Rennes France Tel.: +33 (0)2 22 51 16 00 [email protected]

CAPS - USA 26 O’Farell St; Suite 500 San Francisco CA 94108 [email protected]

CAPS - CHINA Suite E2, 30/F JuneYao International Plaza 789, Zhaojiabang Road, Shanghai 200032 Tel.: +86 21 3363 0057 [email protected]

 

4  

Page 5: Advanced Programming of ManyCore Systems

CAPS Ecosystem

They trust us Resellers and Partners ü Main  chip  makers  ü Hardware  constructors  ü Major  soNware  leaders      

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   5  

Page 6: Advanced Programming of ManyCore Systems

Many-Core Technology Insights

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   6  

Page 7: Advanced Programming of ManyCore Systems

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   7  

Various Many-Core Paths

•  Large  number  of  small  cores  •  Data  parallelism  is  key  •  PCIe  to  CPU  connecEon  

AMD  Discrete  GPU  

AMD  APU  

•  Integrated  CPU+GPU  cores  •  Target  power  efficient  

devices  at  this  stage  •  Shared  memory  system  with  

parEEons  

INTEL  Many  Integrated  Cores  

•  50+  number  of  x86  cores    •  Support  convenEonal  programming  •  VectorizaEon  is  key  •  Run  as  an  accelerator  or  standalone  

NVIDIA  GPU  

•  Large  number  of  small  cores  •  Data  parallelism  is  key  •  Support  nested  and  dynamic  

parallelism  •  PCIe  to  host  CPU  or  low  

power  ARM  CPU  (CARMA)  

Page 8: Advanced Programming of ManyCore Systems

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   8  

Many-core Programming Models

ApplicaEons  

DirecEves  (OpenACC,  OpenHMPP)  

Standard  and  high  level    

Programming  languages  (CUDA,  OpenCL,,…)  

Maximum  performance  and  maintenance  cost    

Libraries  

Just  use  as  is  

CAPS  Compilers  

Portability  and  Performance  

Page 9: Advanced Programming of ManyCore Systems

§  Consortium created in 2011 by CAPS, CRAY, NVIDIA and PGI o  New members in 2012: Allinea, Georgia Tech, ORNL, TU-Dresden,

University of Houston, Rogue Wave o  Momentum with broad adoption o  Version 2 to be released

§  Created by CAPS in 2007 and adopted by PathScale o  Advanced features

•  Multi devices •  Native and accelerated libraries with directives and proxy •  Tuning directives for loop optimizations, use of device features, … •  Ground for innovation in CAPS

o  Can interoperate with OpenACC §  OpenMP4 with extensions for accelerators

o  Release candidate 2 available o  To be released in the coming months

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   9  

Directive-based Programming

Page 10: Advanced Programming of ManyCore Systems

§  Source to source compilers o  Use your preferred native x86 and hardware vendor compilers

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   10  

CAPS Compilers

OpenHMPP/OpenACC    annotated  source  code  

CAPS  Manycore  Compiler  

ApplicaCon  source  code  

Standard  Compiler  

Host  applicaCon  

Target  source  code  

Target  compiler  

Accelerated  parts  

CAPS  RunCme  

Preprocessor  

Target  driver  

Back-­‐end  Generators  

NVIDIA  CUDA    OpenCL  

AMD Many-­‐core  architecture  NVIDIA/AMD  GPUs,  Intel  MIC  

Intel,  GNU,  AbsoL    

Page 11: Advanced Programming of ManyCore Systems

OpenACC and OpenHMPP Directive-based Programming

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   11  

Page 12: Advanced Programming of ManyCore Systems

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   12  

Simply Accelerate with OpenACC

#pragma  acc  kernels    {  #pragma  acc  loop  independent      for  (int  i  =  0;  i  <  n;  ++i){          for  (int  j  =  0;  j  <  n;  ++j){              for  (int  k  =  0;  k  <  n;  ++k){                  B[i][j*k%n]  =  A[i][j*k%n];              }          }      }  

#pragma  acc  parallel  {      #pragma  acc  loop      for  (int  i  =  0;  i  <  n;  ++i){          #pragma  acc  loop          for  (int  j  =  0;  j  <  n;  ++j){              B[i][j]  =  A[i][j];          }      }      for(int  k=0;  k  <  n;  k++){          #pragma  acc  loop          for  (int  i  =  0;  i  <  n;  ++i){              #pragma  acc  loop              for  (int  j  =  0;  j  <  n;  ++j){                  C[k][i][j]  =  B[k-­‐1][i+1][j]  +  …;      }  }  }  }  

All loop nests in a kernels construct will be a differentkernel Each loop nest is launched in order in the device

Add independent clause to disambiguate pointers

The entire parallel construct becomes a single kernel

Add the loop directives to distribute the work amongst the threads (redundant otherwise)

Page 13: Advanced Programming of ManyCore Systems

#pragma  acc  kernels    {  #pragma  acc  loop  independent      for  (int  i  =  0;  i  <  n;  ++i){          for  (int  j  =  0;  j  <  n;  ++j){              for  (int  k  =  0;  k  <  n;  ++k){                  B[i][j*k%n]  =  A[i][j*k%n];              }          }      }  #pragma  acc  loop  gang(NB)  worker(NT)      for  (int  i  =  0;  i  <  n;  ++i){          #pragma  acc  loop  vector(NI)          for  (int  j  =  0;  j  <  m;  ++j){                B[i][j]  =  i  *  j  *  A[i][j];          }      }  }  

Either let the compiler map the parallel loop to the accelerator or configure manually

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   13  

Gang, Workers and Vector in a Grid

Grid  with  ‘gang’  number  of  blocks  

Block  

Block  with  ‘worker’  number  of  threads  

Vector:  the  number  of  inner  loop  iteraEons  to  be  executed  by  a  thread  

Page 14: Advanced Programming of ManyCore Systems

§  Accelerate function calls only

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   14  

OpenHMPP Codelet/Callsite Directives

#pragma  hmpp  sgemm  codelet,  target=CUDA:OPENCL,  args[vout].io=inout,  &  #pragma  hmpp  &  args[*].transfer=auto  extern  void  sgemm(  int  m,  int  n,  int  k,  float  alpha,                                          const  float  vin1[n][n],  const  float  vin2[n][n],                                          float  beta,  float  vout[n][n]  );      int  main(int  argc,  char  **argv)  {          /*  .  .  .  */      for(  j  =  0  ;  j  <  2  ;  j++  )  {              #pragma  hmpp  sgemm  callsite          sgemm(  size,  size,  size,  alpha,  vin1,  vin2,  beta,  vout  );        }          /*  .  .  .  */    }  

Declare CUDA and OPENCL codelets

Synchronous codelet call

Page 15: Advanced Programming of ManyCore Systems

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   15  

Managing Data

#pragma  acc  data  copyin(A[1:N-­‐2]),        copyout(B[N])  {      #pragma  acc  kernels      {          #pragma  acc  loop  independent          for  (int  i  =  0;  i  <  N;  ++i){              A[i][0]  =  ...;              A[i][M  -­‐  1]  =  0.0f;          }          ...      }      #pragma  acc  update  host(A)          ...      #pragma  acc  kernels      for  (int  i  =  0;  i  <  n;  ++i){              B[i]  =  ...;      }  }  

Data construct region with In/out

Explicit copy from device to host

Use create clause in data construct for accelerator-local arrays

Page 16: Advanced Programming of ManyCore Systems

What’s New In OpenACC 2.0

§  Clarifications of some OpenACC 1.0 features §  Improved asynchronism

o  WAIT clause on PARALLEL, KERNELS, UPDATE and WAIT directives o  ASYNC clause on WAIT so queues can wait for each others in a non-blocking

way for the host §  ENTER DATA and EXIT DATA directives

o  Equivalent to the DATA directive but without syntactic constraints §  New TILE directive on ACC LOOP to decompose loops into tiles before

worksharing §  Lot’s of new API calls §  Support for multiple types of devices §  ROUTINE directive for function calls within PARALLEL and KERNELS §  Simplification of the use of global data withi routines §  Nested parallelism using parallel or kernel region containing another

parallel or kernel region

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   16  

Page 17: Advanced Programming of ManyCore Systems

OpenHMPP Advanced Programming

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   17  

Page 18: Advanced Programming of ManyCore Systems

§  Use specialized libraries with directives §  Zero-copy transfers on APUs §  Tuning directives §  Use native kernels

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   18  

What’s in OpenHMPP not in OpenACC?

Page 19: Advanced Programming of ManyCore Systems

§  Single OpenHMPP directive before call to library routine o  Preserve original source code o  Make CPU and specialized libraries co-exist through a proxy o  Use most efficient library depending on data size for instance

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   19  

Using Accelerated Libraries with OpenHMPP

...  

...  

...  

...  #pragma  hmppalt  call  CPU_blas(…)  ...  ...  call  CPU_fft(…)...  ...  

CAPS    Proxy  

CPU  lib  

GPU  lib  

Page 20: Advanced Programming of ManyCore Systems

§  AMD APUs have a single system memory with a GPU partition o  Avoid duplication of memory objects between CPU and GPU o  Avoid copies required for memory coherency between CPU and GPU o  Minimization of data movement not required anymore to get performance

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   20  

Zero-copy Transfers on AMD APU

AMD  APU  

#pragma  hmpp  replacealloc="cl_mem_alloc_host_ptr"      float  *  A  =  malloc(N  *  sizeof(float));      float  *  B  =  malloc(N  *  sizeof(float));  #pragma  hmpp  replacealloc="host”  

void  foo(int  n,  float*  A,  float*  B){      int  i;    #pragma  acc  kernels,  pcopy(A[:n],B[:n])  #pragma  acc  loop  independent      for  (i  =  0;  i  <  n;  ++i)      {  #pragma  hmppcg  memspace  openclalloc  (A,B)          A[i]  =  -­‐i  +  A[i];          B[i]  =  2*i  +  A[i];      }  }  

Page 21: Advanced Programming of ManyCore Systems

§  Apply loop transformations o  Loop unrolling, blocking, tiling, permute, …

§  Map computations on accelerators o  Control gridification

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   21  

OpenHMPP Tuning Directives

#pragma  hmppcg(CUDA)  gridify(j,i),  blocksize  "64x1"  #pragma  hmppcg(CUDA)  unroll(8),  jam,  split,  noremainder    for(  i  =  0  ;  i  <  n;  i++  )  {        int  j;  #pragma  hmppcg(CUDA)  unroll(4),  jam(i),  noremainder        for(  j  =  0  ;  j  <  n;  j++  )  {            int  k;  double  prod  =  0.0f;            for(  k  =  0  ;  k  <  n;  k++  )  {                prod  +=  VA(k,i)  *  VB(j,k);            }            VC(j,i)  =  alpha  *  prod  +  beta  *  VC(j,i);        }   }  

1D gridification Using 64 threads

Loop transformations

Page 22: Advanced Programming of ManyCore Systems

§  Integration of handwritten CUDA and OpenCL codes with directives

CERN  openlab  -­‐  27th  of  Feb.  2013  

Native CUDA/OpenCL Functions Support

kernel.c    float  f1  (  float  x,  float  y,                          float  alpha,  float  pow)    {      float  res;      res  =  alpha  *  cosf(x);      res  +=  powf(y,  pow);      return  sqrtf(res);  }    #pragma  hmpp  mycdlt  codelet,  …  void  myfunc(  int  size,  float  alpha,                            float  pow,  float  *v1,                            float  *v2,  float  *res)  {      int  i;      float  t[size],temp;  #pragma  hmppcg  include  (<native.cu>)    #pragma  hmppcg  native  (f1)      for(i  =  0  ;  i  <  size  ;  i++  )                      res[i]  =  f1(  v1[i],v2[i],                                      alpha,  pow);  }  …  

naCve.cu    __device__  float  f1  (  float  x,  float  y,                                                float  alpha,  float  pow)    {      float  res;      res  =  alpha  *  __cosf(x);      res  +=  __powf(y,  pow);      return  __fsqrt_rn(res);  }  

‘f1’ is a native function

Native function in CUDA

Include file containing native function definition

Use  a  GPU  sqrt  

22  

Page 23: Advanced Programming of ManyCore Systems

Auto Tuning with CAPS Compilers

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   23  

Page 24: Advanced Programming of ManyCore Systems

§  Make applications dynamically adapt to various accelerator architectures o  Find efficient mapping of kernels on computational grid o  Different types of accelerators (NVIDIA/AMD GPUs, Intel MIC, …) o  Different versions of architectures (Fermi, Kepler)

§  Explore optimization space

o  Determine architecture-efficient number of gangs, workers and vector size

o  Select optimum kernel variant

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   24  

Auto Tuning Principles

CAPS  Manycore  applicaEon  

   

Page 25: Advanced Programming of ManyCore Systems

0  

2  

4  

6  

8  

10  

1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16  

Time  

Run  #  

CARMA  

§  Auto-tuning plugin o  Select different number of gangs/workers or kernel variants at runtime o  Execute and record kernel performance within the application o  Once exploration has complete, uses best found values for other

executions

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   25  

CAPS Auto-tuning Mechanism

CAPS  Manycore  applicaEon  

   

Auto-­‐tuning  plugin  0  

0,2  

0,4  

0,6  

0,8  

1  

1,2  

1,4  

1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  

Time  

Kernel  execuCon  Cme  

Fermi  

Kepler  

ExploraEon  phase  Steady    phase  

Gang:  14  Worker:  16  

Gang:  256  Worker:  128  

Page 26: Advanced Programming of ManyCore Systems

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   26  

Dynamic Mapping of Kernels on a Grid

 size_t  gangsVariant[]  =      {    14,  14,  14,  14,  14  ,  14    ,14  ,  14,    28,  28,  28,  28,  28,  28  ,  28,    28};      size_t  workersVariant[]  =  {    2,    4,    6,      8,  10,    12,    14,    16,      2,    4,    6,    8,  10,  12  ,  14,    16};        int  filterVariantSelector  =  variantSelectorState(__FILE__  ":32",  15);          int  gangs  =  gangsVariant[filterVariantSelector];      int  workers  =  workersVariant[filterVariantSelector];      #pragma  acc  parallel                                                                                                        \      copy(h_vvs[0:spp  *  spp]),                                                                                          \      copyin(  h_sitevalues[0:spp  *  endsite  *  4],    weightrat[0:endsite])          \      num_gangs(gangs),  num_workers(workers),  vector_length(32)      {          int  i,j,x,y;            TYPE  sitevalue10;          TYPE  sitevalue11;          TYPE  sitevalue12;  

Gang and worker values to explore

Retrieve variants’ value

Set variant value

Page 27: Advanced Programming of ManyCore Systems

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California  

Gang and Vector Auto-tuning Results

1   2   3   4   5   6   7   8   9   10   11   12  Performan

ce  

(lower  is  beV

er)  

Run  #  

NVIDIA  Kepler  

1   2   3   4   5   6   7   8   9   10   11   12  Performan

ce  

(lower  is  beV

er)  

Run  #  

AMD  Trinity  APU  

1   2   3   4   5   6   7   8   9   10   11   12  Performan

ce  

(lower  is  beV

er)  

Run  #  

Intel  MIC  

Gangs:  300  Vectors:  128  

Gangs:  100  Vectors:  256  

Gangs:  200  Vectors:  256  

Gangs:  200  Vectors:  16  

1   2   3   4   5   6   7   8   9   10   11   12  Performan

ce  

(lower  is  beV

er)  

Run  #  

AMD  Radeon    

27  

Page 28: Advanced Programming of ManyCore Systems

Conclusion

Page 29: Advanced Programming of ManyCore Systems

§  Many-Core becomes ubiquitous o  Various accelerator architectures o  Still as co-processors but toward on-die integrated

§  Programming models converge o  Momentum with OpenACC directive-based standard o  Key point is portability

§  New auto-tuning mechanisms o  A way to achieve performance with portability (not portable

performance!) o  Portable performance is a trade-off

•  How much you are willing to loose in term of performance •  Versus the effort to fine tune

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   29  

Conclusion

Page 30: Advanced Programming of ManyCore Systems

CAPS Compilers

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   30  

Starter&Edition Medium&Edition Full&Edition Full&Edition

Front&ends*(C/Fortran/C++) C*or*Fortran C,*Fortran*or*C++ All All

Programming*Directives*(OpenHMPP/OpenACC) OpenACC All All All

Targets*(CUDA/OpenCL) CUDA*or*OpenCL CUDA*or*OpenCL All All

License*Type User&based*Node&locked

User&based*Node&locked

User&based*Node&locked Floating

Support Bronze Silver Gold Gold

Price **********************199,00*€* **********************399,00*€* **********************999,00*€* 2990,00*€*(per*token)*1290,00*€*EDU

Workstation&License Server&License

CAPS&Compilers

Available  on  store.caps-­‐entreprise.com  stephane.bihan@caps-­‐entreprise.com  

 Visit  CAPS  knowledge  base  hkp://kb.caps-­‐entreprise.com/    

Page 31: Advanced Programming of ManyCore Systems

Free 30-day CAPS OpenACC Trial Version

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   31  

www.nvidia.eu/openacc  

Page 32: Advanced Programming of ManyCore Systems

Accelerator Programming model

Directive-based programming

Parallel Computing

OpenHMPP OpenACC GPGPU

Many-Core programming

Parallelization

HPC OpenCL

Code speedup NVIDIA CUDA

High Performance Computing

CAPS Compilers

CAPS Workbench Portability

Performance

Visit  CAPS  Website:    www.caps-­‐entreprise.com  

NVIDIA  GTC,  March  20,  2013,  San  Jose,  California   32