recent activities in programming models and runtime ...€¦ · recent activities in programming...

36
Recent Activities in Programming Models and Runtime Systems at ANL Rajeev Thakur Deputy Director Mathema0cs and Computer Science Division Argonne Na0onal Laboratory Joint work with Pavan Balaji, Darius Bun0nas, Jim Dinan, Dave Goodell, Bill Gropp, Rusty Lusk, Marc Snir

Upload: others

Post on 25-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Recent Activities in Programming Models and Runtime Systems at ANL

Rajeev  Thakur  Deputy  Director  

Mathema0cs  and  Computer  Science  Division  

Argonne  Na0onal  Laboratory  

Joint  work  with  Pavan  Balaji,  Darius  Bun0nas,  Jim  Dinan,    Dave  Goodell,  Bill  Gropp,  Rusty  Lusk,  Marc  Snir  

Page 2: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Programming Models and Runtime Systems Efforts at Argonne

MPI  and  Low-­‐level  Data  Movement  Run4me  Systems  

High-­‐level  Tasking  and  Global  Address  Programming  Models  

Accelerators  and  Heterogeneous  Compu4ng  Systems  

2  

Page 3: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

MPI at Exascale

  We  believe  that  MPI  has  a  role  to  play  even  at  exascale,  par0cularly  in  the  context  of  an  MPI+X  model    

  The  investment  in  applica0on  codes  and  libraries  wriNen  using  MPI  is  on  the  order  of  hundreds  of  millions  of  dollars  

  Un0l  a  viable  alterna0ve  to  MPI  for  exascale  is  available,  and  applica0ons  have  been  ported  to  the  new  model,  MPI  must  evolve  to  run  as  efficiently  as  possible  on  future  systems  

  Argonne  has  long-­‐term  strength  in  all  things  related  to  MPI  

3  

Page 4: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

MPI and Low-level Runtime Systems

  MPICH  implementa0on  of  MPI  –  Scalability  of  MPI  to  very  large  systems  

•  Scalability  improvements    to  the  MPI  implementa0on  

•  Scalability  to  large  numbers  of  threads  

–  MPI-­‐3  algorithms  and  interfaces  •  Remote  memory  access  

•  Fault  Tolerance  –  Research  on  features  for  future  MPI  standards  

•  Interac0on  with  accelerators  •  Interac0on  with  other  programming  models  

  Data  movement  models  on  complex  processor  and  memory  architectures  –  Data  movement  for  heterogeneous  memory  (accelerator  memory,  

NVRAM)  

4  

Page 5: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Pavan Balaji wins DOE Early Career Award

  Pavan  Balaji  was  recently  awarded  the  highly  compe00ve  and  pres0gious  DOE  early  career  award  ($500K/yr  for  5  years)  

  “Exploring  Efficient  Data  Movement  Strategies  for  Exascale  Systems  with  Deep  Memory  Hierarchies”  

5  

Page 6: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

MPICH-based Implementations of MPI

  IBM  MPI  for  the  Blue  Gene/Q  –  IBM  has  successfully  scaled  the  LAMMPS  applica0on  to  over  3  million  MPI  

ranks  and  submiNed  a  special  Gordon  Bell  prize  entry  •  So  it’s  not  true  that  MPI  won’t  scale!    

–  Will  be  used  on  Livermore  (Sequoia),  Argonne  (Mira),  and  other  major  BG/Q  installa0ons  

  Unified  IBM  implementa0on  for  BG  and  Power  systems  

  Cray  MPI  for  XE/XK-­‐6  –  On  Blue  Waters,  Oak  Ridge  (Jaguar,  Titan),  NERSC  (Hopper),  HLRS  StuNgart,  

and  other  major  Cray  installa0ons  

  Intel  MPI  for  clusters  

  Microsoi  MPI  

  Myricom  MPI  

  MVAPICH2  from  Ohio  State  

6  

Page 7: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

11/17/10  7  

MPICH Collaborators/Partners   Core  MPICH  developers  

  IBM    INRIA    Microsoi    Intel    University  of  Illinois    University  of  Bri0sh  Columbia  

  Deriva0ve  implementa0ons      Cray    Myricom    Ohio  State  University  

  Other  Collaborators    Absoi    Pacific  Northwest  Na0onal  Laboratory    QLogic    Queen’s  University,  Canada    Totalview  Technologies    University  of  Utah  

Page 8: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Accelerators and Heterogeneous Computing Systems   Accelerator-­‐augmented  data  movement  models  

–  Integrated  data  movement  models  that  can  move  data  from  any  memory  region  of  a  given  process  to  any  other  memory  region  of  another  process  •  E.g.,  move  data  from  accelerator  memory  of  one  process  to  that  of  another  

–  Internal  run0me  op0miza0ons  for  efficient  data  movement  –  External  programming  model  improvements  for  improved  produc0vity  

  Virtualized  accelerators  –  Allow  applica0ons  to  (semi-­‐)transparently  u0lize  light-­‐weight  virtual  

accelerators  instead  of  physical  accelerators  

–  Backend  improvements  for  performance,  power,  fault  tolerance,  etc.  

8  

Page 9: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Recent Work

1.  MPI-­‐ACC  (ANL,  NCSU,  VT)  –  Integrate  awareness  of  accelerator  memory  in  MPICH2  

–  Produc0vity  and  performance  benefits  

–  To  be  presented  at  HPCC  2012  Conference,  Liverpool,  UK,  June  2012  

2.  Virtual  OpenCL  (ANL,  VT,  SIAT  CAS)  –  OpenCL  implementa0on  allows  program  to  use  remote  accelerators  

–  One-­‐to-­‐many  model,  beNer  resource  usage,  load  balancing,  FT,  …  

–  Published  at  CCGrid  2012,  ODawa,  Canada,  May  2012  

3.  Scioto-­‐ACC  (ANL,  OSU,  PNNL)  –  Task  parallel  programming  model,  scalable  run0me  system  

–  Coscheduling  CPU  and  GPU,  automa0c  data  movement  

9  

Page 10: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Current MPI+GPU Programming

  MPI  operates  on  data  in  host  memory  only  

  Manual  copy  between  host  and  GPU  memory  serializes  PCIe,  Interconnect  –  Can  do  beNer  than  this,  but  will  incur  protocol  overheads  mul0ple  0mes  

  Produc(vity:  Manual  data  movement  

  Performance:  Inefficient,  unless  large,  non-­‐portable  investment  in  tuning  

10  

Page 11: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

MPI-ACC Interface: Passing GPU Buffers to MPI

!"

#"

$"

%"

&"

'!"

'#"

'$"

!" '" #" $" &" '%" (#" %$" '#&" #)%" )'#" '!#$"#!$&"$!*%"&'*#"

!"#$%&'()*+(

,"#"(-./$()0'#$*+(

+,-./"012"012"3"4567./.8"6,9,:;8;9-"/<;/="012"3">,8,8?6;",889.@A8;"/<;/="012"3"BA8C:,8./"D;8;/8.CE"

  Unified  Virtual  Address  (UVA)  space  –  Allow  device  pointer  in  MPI  rou0nes  directly  

–  Currently  supported  only  by  CUDA  and  newer  NVIDIA  GPUs  –  Query  cost  is  high  and  added  to  every  opera0on  (CPU-­‐CPU)  

  Explicit  Interface  –  e.g.  MPI_CUDA_Send(…)  

  MPI  Datatypes  –  Compa0ble  with  MPI  and  many  accelerator  models  

11  

MPI  Datatype  

UVA  Lookup  Overhead  (CPU-­‐CPU)  

CL_Context  

CL_Mem  

CL_Device_ID  

CL_Cmd_queue  

ANributes  

Page 12: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

MPI-ACC: Integrated, Optimized Data Movement

  Use  MPI  for  all  data  movement  –  Support  mul0ple  accelerators  and  prog.  models  (CUDA,  OpenCL,  …)  

–  Allow  applica0on  to  portably  leverage  system-­‐specific  op0miza0ons  

  Inter-­‐node  data  movement:  –  Pipelining:  Fully  u0lize  PCIe  and  network  links  –  GPU  direct  (CUDA):  Mul0-­‐device  pinning  eliminates  data  copying  

–  Handle  caching  (OpenCL):  Avoid  expensive  command  queue  crea0on  

  Intra-­‐node  data  movement:  –  Shared  memory  protocol  

•  Sender  and  receiver  drive  independent  DMA  transfers  

–  Direct  DMA  protocol  •  GPU-­‐GPU  DMA  transfer  (CUDA  IPC)  

–  Both  protocols  needed,  PCIe  limita0ons  serialize  DMA  across  I/O  hubs  

12  

Page 13: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Integrated Support for User-Defined Datatypes

MPI_Send(buffer, datatype, count, to, … ) !MPI_Recv(buffer, datatype, count, from, … ) !

  What  if  the  datatype  is  noncon0guous?    CUDA  doesn’t  support  arbitrary  noncon0guous  transfers    Pack  data  on  the  GPU  

–  FlaNen  datatype  tree  representa0on  –  Packing  kernel  that  can  saturate  memory  bus/banks  

13  

10

100

1000

26 28 210 212 214 216 218 220 222Pack

ing

Tim

e (M

icro

seco

nds)

Packed Buffer Size (Bytes)

4D Subarray

PackManualCUDA

Page 14: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Inter-node GPU-GPU Latency

!"#$%#"&#!%'#%("#(!%#

!)%&#%)&'#&)*"#'!*%#!"$'&#$%+"'#"(($"#

!"#$%&'()*+(

,"#"(-./$()0'#$*+(

,-./011#,23425#65789:3;##,23425#<73#=5789:3;#>-:?@5:3@AB#,-.#C@3ADE@8@:F@#>1-G/735HI#J7K@L#6743AB#

  Pipelining  (PCIe  +  IB)  pays  off  for  large  messages  –  2/3  latency  

14  

Page 15: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

GPUs as a Service: Virtual OpenCL

  Clusters,  cloud  systems,  provide  GPUs  on  subset  of  nodes  

  OpenCL  provides  a  powerful  abstrac0on  –  Kernels  compiled  on-­‐the-­‐fly  –  i.e.  at  the  device  

–  Enable  transparent  virtualiza0on,  even  across  different  devices    Support  GPUs  as  a  service  

–  One-­‐to-­‐Many:  One  process  can  drive  many  GPUs  

–  Resource  U0liza0on:  Share  GPU  across  applica0ons,  use  hybrid  nodes  –  System  Management,  Fault  Tolerance:  Transparent  migra0on  

15  

Page 16: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Virtual OpenCL (VOCL) Framework Components

  VOCL  library  (local)  and  proxy  process  (system  service)  

  API  and  ABI  compa0ble  with  OpenCL  –  transparent  to  app.  

  U0lize  both  local  and  remote  GPUs  –  Local  GPUs:    Call  na0ve  OpenCL  func0ons  –  Remote  GPUs:    Run0me  uses  MPI  to  forward  func0on  calls  to  proxy  

Proxy

Native OpenCL Library

Application

VOCL Library

Local  node   Remote  node  

Proxy

OpenCL    API  

GPU  VGPU   GPU  

16  

Page 17: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

High-level Tasking and Global Address Programming Models   Led  by  Jim  Dinan  

  High-­‐level  tasking  execu0on  model  for  performing  computa0ons  in  dynamic  environments  –  Programming  model  interface  and  implementa0on  

–  Mul0ple  programming  model  implementa0ons  for  this  tasking  model  to  work  with  MPI,  PGAS  models,  etc.  

  Interoperable  global  address  space  and  tasking  models  –  Unified  run0me  infrastructure  

  Benchmarking  –  Unbalanced  Tree  Search  family  of  benchmarks  

17  

Page 18: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Scioto-ACC: Accelerated, Scalable Collections of Task Objects

  Express  computa0on  as  collec0on  of  tasks  –  Tasks  operate  on  data  in  Global  Address  Space  (GA,  MPI-­‐RMA,  …)  

–  Executed  in  collec0ve  task  parallel  phases    Scioto  run0me  system  manages  task  execu0on  

–  Load  balancing,  locality  opt.,  fault  resilience  –  Mapping  to  Accelerator/CPU,  data  movement  

SPMD  

SPMD  

Task    Parallel  

Process  0      …      Process  n  

Termina0on  

18  

Led  by  Jim  Dinan  

Page 19: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Other Recent Papers

  “Scalable  Distributed  Consensus  to  Support  MPI  Fault  Tolerance”  –  Darius  Bun0nas,  IPDPS  2012,  Shanghai,  China,  May  2012  

–  Scalable  algorithm  for  a  set  of  MPI  processes  to  collec0vely  determine  which  subset  of  processes  among  them  have  failed    

  “Suppor0ng  the  Global  Arrays  PGAS  Model  Using  MPI  One-­‐Sided  Communica0on”  –  James  Dinan,  Pavan  Balaji,  Jeff  Hammond,  Sriram  Krishnamoorthy,  Vinod  

Tipparaju,  IPDPS  2012,  Shanghai,  China,  May  2012  

–  Implements  ARMCI  interface  over  MPI  one-­‐sided  

  “Efficient  Mul0threaded  Context  ID  Alloca0on  in  MPI”  –  James  Dinan,  David  Goodell,  William  Gropp,  Rajeev  Thakur,  Pavan  Balaji,  

submiNed  to  EuroMPI  2012  

–  Clever  algorithm  for  mul0threaded  context  id  genera0on  for  MPI_Comm_create_group  (new  func0on  in  MPI-­‐3)  

19  

Page 20: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Non-Collective Communicator Creation

  Create  a  communicator  collec0vely  only  on  new  members  

  Global  Arrays  process  groups  –  Past:  collec0ves  using  MPI  Send/Recv  

  Overhead  reduc0on  –  Mul0-­‐level  parallelism  –  Small  communicators  when  parent  is  large  

  Recovery  from  failures  –  Not  all  ranks  in  parent  can  par0cipate  

  Load  balancing  

20  

 

Page 21: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Non-Collective Communicator Creation Algorithm (using MPI 2.2 functionality)

21  

MPI_COMM_SELF  (intracommunicator)  

MPI_Intercomm_create  (intercommunicator)  MPI_Intercomm_merge  

0   1   2   3   4   5   6   7  

0   1   2   3   4   5   6   7  

0   1   2   3   4   5   6   7  

0   1   2   3   4   5   6   7  

Page 22: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Non-Collective Algorithm in Detail

  R  

22  

Translate  group  ranks  to  ordered  list  of  ranks  on  parent  communicator  

Calculate  my  group  ID  

Lei  group  

Right  group  

Page 23: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Evaluation: Microbenchmark

0

2

4

6

8

10

12

14

16

18

1 4 16 64 256 1024 4096

Cre

atio

n T

ime (

mse

c)

Group Size (processes)

Group-collectiveMPI_Comm_create

23  

O(log2  g)  

O(log  p)  

Page 24: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

New MPI-3 Function MPI_Comm_create_group

MPI_Comm_create_group(MPI_Comm  in,  MPI_Group  grp,  int  tag,  MPI_Comm  *out)  –  Collec0ve  only  over  process  in  “grp”  –  Tag  allows  for  safe  communica0on  and  dis0nc0on  of  calls  

  Eliminate  O(log  g)  factor  by  direct  implementa0on  

24  

Page 25: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Deadlock scenario in existing multithreaded context id generation algorithm in MPICH2

25  

Page 26: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Avoiding the deadlock

  Simple  solu0on:  Add  a  barrier  on  entry  to  the  algorithm  

  However,  it  is  addi0onal  synchroniza0on  overhead  

  BeNer  solu0on  –  Use  the  synchroniza0on  to  aNempt  to  acquire  a  context  id  

–  Par00on  the  context  id  space  into  two  segments:  eager  and  base  

–  Instead  of  the  barrier,  do  an  Allreduce  on  the  eager  segment  

–  If  eager  alloca0on  fails,  the  allreduce  acts  like  a  barrier,  and  the  context  id  is  acquired  in  the  second  call  to  allreduce  

–  If  eager  alloca0on  succeeds  (which  in  most  cases  it  will),  the  overhead  of  an  addi0onal  barrier  is  avoided  

26  

Eager   Base  

Context  id  space  

Page 27: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Performance

27  

Page 28: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Global Arrays over MPI One sided

28  

Page 29: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Global Arrays, a Global-View Data Model

  Distributed,  shared  mul0dimensional  arrays  –  Aggregate  memory  of  mul0ple  nodes  into  global  data  space  

–  Programmer  controls  data  distribu0on,  can  exploit  locality  

  One-­‐sided  data  access:  Get/Put({i,  j,  k}…{i’,  j’,  k’})  

  NWChem  data  management:  Large  coeff.  tables  (100GB+)  

29  

Shared  

Globa

l  add

ress  

space  

Private  

   Proc0                      Proc1                                                                                                          Procn  

         X[M][M][N]  

X[1..9]  [1..9][1..9]  

X  

Page 30: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

ARMCI: The Aggregate Remote Memory Copy Interface

  GA  run0me  system  –  Manages  shared  memory  

–  Provides  portability  –  Na0ve  implementa0on  

  One-­‐sided  communica0on  –  Get,  put,  accumulate,  …  –  Load/store  on  local  data  

–  Noncon0guous  opera0ons  

  Mutexes,  atomics,  collec0ves,  processor  groups,  …  

  Loca0on  consistent  data  access  –  I  see  my  opera0ons  in  issue  order  

30  

GA_Put({x,y},{x’,y’})  

0  

2  

1  

3  

ARMCI_PutS(rank,  addr,  …)  

Page 31: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

Implementing ARMCI

  ARMCI  Support  –  Na0vely  implemented  

–  Sparse  vendor  support  –  Implementa0ons  lag  systems  

  MPI  is  ubiquitous  –  Support  one-­‐sided  for  15  years  

  Goal:  Use  MPI  RMA  to  implement  ARMCI  1.  Portable  one-­‐sided  communica0on  for  NWChem  users  

2.  MPI-­‐2:  drive  implementa0on  performance,  one-­‐sided  tools  

3.  MPI-­‐3:  mo0vate  features  

4.  Interoperability:  Increase  resources  available  to  applica0on  •  ARMCI/MPI  share  progress,  buffer  pinning,  network  and  host  resources  

  Challenge:  Mismatch  between  MPI-­‐RMA  and  ARMCI  

31  

NWChem

Global Arrays

Native

MPI

ARMCI

NWChem

Global Arrays

MPI

ARMCI

Native

Na0ve   ARMCI-­‐MPI  

Page 32: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

ARMCI/MPI-RMA Mismatch

1.  Shared  data  segment  representa0on  –  MPI:  <window,  displacement>,  window  rank  

–  ARMCI:  <address>,  absolute  rank  

→ Perform  transla0on  

2.  Data  consistency  model  –  ARMCI:  Relaxed,  loca0on  consistent  for  RMA,  CCA  undefined  

–  MPI:  Explicit  (lock  and  unlock),  CCA  error  

→ Explicitly  maintain  consistency,  avoid  CCA  

32  

Page 33: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

NWChem Performance (XE6)

33  

0

3

6

9

12

15

18

744 1488 2232 2976 3720 4464 5208 5952 0

5

10

15

20

25

30

CC

SD

Tim

e (

min

)

(T)

Tim

e (

min

)

Number of Cores

Cray XE6

ARMCI-MPI CCSDARMCI-Native CCSD

ARMCI-MPI (T)ARMCI-Native (T)

Page 34: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

NWChem Performance (BG/P)

34  

0

5

10

15

20

25

30

35

0 256 512 768 1024

CC

SD

Tim

e (

min

)

Number of Nodes

Blue Gene/P

ARMCI-MPI CCSDARMCI-Native CCSD

Page 35: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

An idea and a proposal in need of funding

35  

Page 36: Recent Activities in Programming Models and Runtime ...€¦ · Recent Activities in Programming Models and Runtime Systems at ANL Rajeev&Thakur& Deputy’Director’ Mathemacs’and’Computer’Science’Division’

DSL-­‐to-­‐PM  transla0on  (ROSE)  

Low-­‐level  system  services  power  mgt,  resilience  

DXS  Run0me  (Charm++,  DAGuE)  

DDM-­‐to-­‐DEE  transla0on  (DAGuE)  

PM-­‐to-­‐DDM  transla0on    (ROSE,  Orio)  

Eclipse  Refactoring    Plug-­‐ins  

Domain-­‐Specific  Language  code  

Old  library  codes   Old  MPI    Applica0on  codes  

DXS  library  

Swii  DPLASMA  

(Nemesis)  

Cost  models  

(Pbound)  (TAU

)  Exposé  

New  programming  models  Translators  (Technology  leveraged)  

(miniapps)  

Major  new  components  

LEGEND  

Instrumented  hardware    (real,  emulated)  

Commercial  compiler  

New  codes  

36  

ANL,  UIUC,  UTexas,  UTK,    UOregon,  LLNL