por$ng’lammps’to’the’titan’ supercomputer’ · lammps’ • large-scale...

51
Por$ng LAMMPS to the Titan Supercomputer W. Michael Brown Na$onal Center for Computa$onal Sciences Oak Ridge Na$onal Laboratory Titan Users and Developers Workshop (West Coast) January 31, 2013

Upload: duongcong

Post on 27-Apr-2018

223 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Por$ng  LAMMPS  to  the  Titan  Supercomputer  

W.  Michael  Brown  Na$onal  Center  for  Computa$onal  Sciences  

Oak  Ridge  Na$onal  Laboratory    

Titan  Users  and  Developers  Workshop  (West  Coast)  January  31,  2013  

Page 2: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

LAMMPS  •  Large-scale Atomic/Molecular Massively Parallel Simulator"

o  http://lammps.sandia.gov"•  General Purpose Classical Molecular Dynamics Software"

o  Simulations for biology, materials science, granular, mesoscale, etc."•  Open source (GPL), with a user community"

o  1700 Citations"o  100K downloads, 25K mail list messages"o  100 contributors to code base"

•  Flexible and extensible:"o  80% of code-base is add-ons by developers and users"o  styles: atom, pair, fix, compute, etc"

Page 3: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Force Fields  •  Biomolecules: "CHARMM, AMBER, OPLS,"

" "COMPASS (class 2), Gromacs,"" "long-range Coulombics via PPPM,"" "point dipoles, ..."

•  Polymers: "all-atom, united-atom,"" "coarse-grain (bead-spring FENE),"" "bond-breaking, bond-forming, ..."

•  Materials: "EAM/MEAM for metals,"" "Buckingham, Morse, Yukawa,"" "Tersoff, COMB, AI-REBO,"" "ReaxFF, GAP, …"

•  Mesoscale: "granular, DPD, SPH, PD,"" "colloidal, aspherical, triangulated, …"

Page 4: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Hybrid  Models  •  Water/proteins on metal/silica surface"•  Metal-semiconductor interface"•  Metal islands on amorphous (LJ) substrate"

•  Specify 2 (or more) pair potentials:"–  A-A, B-B, A-B, etc"

•  Hybrid in two ways:"–  potentials (pair, bond, etc)"–  atom style (bio, metal, etc)"

Page 5: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Spa$al  Decomposi$on  •  Physical domain divided into 3d boxes, one per processor"•  Each proc computes forces on atoms in its box"

–  using info from nearby procs"•  Atoms "carry along" molecular topology"

–  as they migrate to new procs"

"•  Communication via"

–  nearest-neighbor 6-way stencil""•  Advantages:"

–  communication scales"•  sub-linear as (N/P)2/3"•  (for large problems)"

–  memory is optimal N/P"

Page 6: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Developing  a  Strategy  for  Por$ng  LAMMPS  

•  The  primary  developer  of  LAMMPS  is  Steve  Plimpton  at  SNL    

•  Goal  is  to  get  efficient  accelera$on  without  modifying  the  LAMMPS  core  –  Accelera$on  will  be  maintained  with  the  main  LAMMPS  package  –  Compa$bility  with  all  of  LAMMPS  features  

(307)  Debugging  is  at  least  twice  as  hard  as  wri$ng  the  program  in  the  first  place.  So  if  your  code  is  as  clever  as  you  can  possibly  make  it,  then  by  defini$on  you're  not  smart  enough  to  debug  it.    -­‐-­‐  Brian  Kernighan    

Page 7: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Classical  Molecular  Dynamics  •  Time  evolu$on  of  a  system  of  par$cles  •  Calculate  the  force  on  a  par$cle  due  to  other  par$cles  in  

the  system  as  the  gradient  of  the  poten$al  energy  –  2-­‐body  and  many-­‐body  poten$als  –  Short-­‐range  forces  are  calculated  using  only  par$cles  within  a  

spherical  cutoff  –  For  some  models,  addi$onal  long  range  calcula$ons  must  be  

performed  –  For  molecular  models,  addi$onal  forces  due  to  bonds  

•  Apply  constraints,  thermostats,  barostats  •  Do  sta$s$cs  calcula$ons,  I/O  •  Time  integra$on  to  find  new  posi$on  of  par$cles  at  next  

$me  step  

Page 8: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

PPPM (Particle-Mesh Ewald)  •  Long range coulombics needed for many systems (charged

polymers (polyelectrolytes), Organic & biological molecules, Ionic solids, oxides"

•  Hockney & Eastwood, Comp Sim Using Particles (1988).!•  Darden, et al, J Chem Phys, 98, p 10089 (1993).!•  Same as Ewald, except integral evaluated via:"

•  interpolate atomic charge to 3d mesh"•  solve Poisson's equation on mesh (FFTs)"•  interpolate E-fields back to atoms"

"""""

Page 9: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Molecular Force Fields (Class 1 & 2)  

•  Higher-order cross terms"•  Originated by "

–  BioSym à MSI à Accelrys"–  www.accelrys.com"

•  Parameters available in their"–  commercial MD code"

20 )()( rrKrE −=

2)()( 0−= θθθ KE))cos(1(( φφ ndKE += )

2)()( 0 − = ψψψ KE

Page 10: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Rhodopsin  Simula$on  

# Rhodopsin model!!units real !neigh_modify delay 5 every 1 !!atom_style full !bond_style harmonic !angle_style charmm !dihedral_style charmm !improper_style harmonic !pair_style lj/charmm/coul/long 8.0 10.0 !pair_modify mix arithmetic !kspace_style pppm 1e-4 !!read_data data.rhodo!!fix 1 all shake 0.0001 5 0 m 1.0 a 232!fix 2 all npt temp 300.0 300.0 100.0 &! z 0.0 0.0 1000.0 mtk no pchain 0 tchain 1!!special_bonds charmm! !thermo 50!thermo_style multi !timestep 2.0!!run 100!

0  

50  

100  

150  

200  

250  

300  

350  

CPU  

Force  Interp  

Field  Solve  

Charge  Spread  

Other  

Output  

Comm  

Neigh  

Bond  

Pair  

Page 11: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Geryon  Library  •  Allows  same  code  to  compile  with  CUDA  

Run$me,  CUDA  Driver,  and  OpenCL  APIs  •  Simple  classes  allow  for  more  succinct  code  than  

CUDA-­‐Run$me  –  Change  from  one  API  to  another  by  simply  changing  

the  namespace  –  Use  mul$ple  APIs  in  the  same  code  –  Lightweight  (only  include  files  –  no  build  required)  –  Manage  device  query  and  selec$on  –  Simple  vector  and  matrix  containers  –  Simple  rou$nes  for  data  copy  and  type  cas$ng  –  Simple  rou$nes  for  data  I/O  –  Simple  classes  for  managing  device  $ming  –  Simple  classes  for  managing  kernel  compila$on  and  

execu$on  

h^p://users.nccs.gov/~wb8/geryon/index.htm  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

1.4  

CUDA   OpenCL  

Time  /  To

tal  CUDA

 Tim

e   Short  Range  Force  Interpola$on  Charge  Assignment  Par$cle  Map  Data  Transfer  

Page 12: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Accelera$ng  Force  Calcula$on  •  Add  a  new  “pair_style”  in  the  same  manner  that  new  

force  models  are  added  (e.g.  lj/cut/gpu  vs  lj/cut)  •  Copy  the  atom  data,  neighbor  list  from  the  host  •  Assign  each  par$cle  to  a  thread,  thread  accumulates  the  

force  for  all  of  its  neighbors  •  Repack  the  neighbor  list  on  the  GPU  for  con$guous  

memory  access  by  threads  –  On  CPU,  only  need  to  calculate  the  force  once  for  both  

par$cles  in  the  pair  •  For  shared  memory  parallelism,  must  handle  memory  collisions  •  Can  avoid  this  by  calcula$ng  the  force  twice,  once  for  each  

par$cle  in  the  pair  –  Doubles  the  amount  of  computa$on  for  force  –  Doubles  the  size  of  the  neighbor  list  

Page 13: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Accelera$ng  Force  Calcula$on  •  Random  memory  access  is  expensive!  

–  Store  atom  type  with  the  posi$on  in  a  4-­‐vector  

•  Be^er  data  alignment,  reduces  the  number  of  random  access  fetches  

•  Periodically  sort  data  by  loca$on  in  simula$on  box  in  order  to  improve  data  locality  

•  Mixed  precision  –  perform  computa$ons  in  single  precision  with  all  accumula$on  and  answer  storage  in  double  precision  

•  50x  M2070  speedup  versus  single  Istanbul  core  

Page 14: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Accelera$ng  Force  Calcula$on  •  512  Compute  Cores  on  Fermi  •  Want  many  more  threads  than  this  to  hide  latencies  •  With  thread  per  atom  decomposi$on,  need  lots  of  

atoms  per  GPU  •  Increase  parallelism  with  mul$ple  threads  per  atom  

–  Tradeoff  between  increased  parallelism  and  increased  computa$on/synchroniza$on  

•  For  NVIDIA  and  AMD,  accelerators  can  avoid  synchroniza$on  for  n  threads  if  n  is  a  power  of  2  and  n<C.  

•  n  depends  on  the  model  and  the  number  of  neighbors  –  LAMMPS  will  guess  this  for  you  

•  Need  to  change  neighbor  list  storage  in  order  to  get  con$guous  memory  access  for  an  arbitrary  number  of  threads  

–  Can  increase  the  $me  for  neighbor  stuff,  but  generally  a  win  

0.0625  0.125  0.25  0.5  1  2  4  8  

16  32  64  

128  

1   2   4   8   16   32  

Wall  Tim

e  

GPUs/CPU  Cores  

1  Thread  per  Atom  8  Threads  per  Atom  CPU  

Page 15: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Accelerated  Non-­‐bonded  Short-­‐Range  Poten$als  

•  Single,  mixed,  and  double  precision  support  for:  •  lj/cut  •  lj96/cut  •  lj/expand  •  lj/cut/coul/cut  •  lj/cut/coul/long  •  lj/charmm/coul/long  •  lj/class2  •  lj/class2/coul/long  •  morse  •  cg/cmm  •  cg/cmm/coul/long  •  coul/long  •  gayberne  •  resquared  •  gauss  •  morse  

•  buck/cut  •  buck/coul/cut  •  buck/coul/long  •  eam  •  eam/fs  •  eam/alloy  •  table  •  yukawa  •  born  •  born/coul/long  •  born/coul/wolf  •  colloid  •  coul/dsf  •  coul/long  •  dipole/cut  •  dipole/sf  •  lj/cut/coul/debye  •  lj/cut/coul/dsf  •  lj/expand  

Page 16: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Rhodopsin  Simula$on  

# Rhodopsin model!newton off # For GPU, we double the compute work! # to avoid memory collisions!!units real !neigh_modify delay 5 every 1 !!atom_style full !bond_style harmonic !angle_style charmm !dihedral_style charmm !improper_style harmonic !pair_style lj/charmm/coul/long/gpu 8.0 10.0 !pair_modify mix arithmetic !kspace_style pppm 1e-4 !!read_data data.rhodo!!fix 1 all shake 0.0001 5 0 m 1.0 a 232!fix 2 all npt temp 300.0 300.0 100.0 &! z 0.0 0.0 1000.0 mtk no pchain 0 tchain 1!!special_bonds charmm! !thermo 50!thermo_style multi !timestep 2.0!!run 100!

0  

50  

100  

150  

200  

250  

300  

350  

CPU  

Force  Interp  

Field  Solve  

Charge  Spread  

Other  

Output  

Comm  

Neigh  

Bond  

Pair  

Page 17: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Neighbor  List  Builds  •  CPU  $me  for  neighbor  list  builds  can  now  become  

dominant  with  GPU  accelera$on  •  Build  neighbor  lists  on  the  GPU  

–  In  LAMMPS,  neighbor  lists  are  requested  by  the  “pair_style”  –  just  don’t  request  a  neighbor  list  and  you  are  free  to  do  it  on  the  GPU  

–  This  is  op$onal  for  several  reasons  •  First  calculate  cell  list  on  GPU  

–  Requires  binning  –  Atomic  opera$ons  or  sort  

•  LAMMPS  uses  radix  sort  to  get  determinism  or  op$onally  performs  the  binning  only  on  the  CPU  

•  Calculate  verlet  list  from  cell  list  •  Copy  and  transpose  data  for  1-­‐2,  1-­‐3,  1-­‐4  interac$ons  •  8.4x  M2070  Speedup  vs  Single  Core  Istanbul  (Memory)  

0  

50  

100  

150  

200  

250  

300  

350  

CPU  

Force  Interp  

Field  Solve  

Charge  Spread  

Other  

Output  

Comm  

Neigh  

Bond  

Pair  

Page 18: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Par$cle-­‐Par$cle  Par$cle-­‐Mesh  •  Most  of  the  $me  spent  in  charge  spreading  and  force  interpola$on  •  Field  solve  is  3D  FFT  

•  Communica$on  intensive  •  Can  run  with  an  arbitrary  FFT  library,  including  cuFFT  

•  Generally,  not  enough  work  to  be  worth  it  

•  Charge  spreading  •  Can’t  use  the  CPU  algorithm  (efficiently)  

•  First,  map  the  par$cles  to  mesh  points  •  Requires  atomic  opera$ons  to  count  number  of  atoms  

at  each  point  •  Problem,  since  the  par$cles  are  sorted  spa$ally  

•  Reindex  the  par$cles  accessed  by  each  thread  to  reduce  the  probability  of  memory  collisions  •  Reduces  kernel  $me  by  up  to  75%   0  

50  

100  

150  

200  

250  

300  

350  

CPU  

Force  Interp  

Field  Solve  

Charge  Spread  

Other  

Output  

Comm  

Neigh  

Bond  

Pair  

Page 19: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Par$cle-­‐Par$cle  Par$cle-­‐Mesh  •  Second,  spread  the  charges  on  the  mesh  using  splines  

–  Difficult  stencil  problem  •  Can’t  $le  due  to  shared/local  memory  limita$ons  •  Use  a  pencil  decomposi$on  (on  the  GPU  within  each  subdomain  assigned  to  the  process)  

–  Avoid  atomic  opera$ons  for  charge  spreading  (s$ll  slower  even  with  hardware  support  for  floa$ng  point  atomics)  

–  Con$guous  memory  access  for  par$cle  posi$ons  and  charges  mapped  to  mesh  –  Tradeoff  is  that  a  significant  amount  of  recomputa$on  is  required  compared  to  the  naïve  and  

CPU  algorithms  •  Assign  mul$ple  pencils  to  each  mul$processor  to  improve  strong  scaling  performance  

–  This  requires  an  increase  in  the  number  of  global  memory  fetches,  so  there  are  limits  

•  13.7X  M2070  speedup  versus  single  core  Istanbul  (small  number  of  mesh  points)  

Page 20: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Par$cle-­‐Par$cle  Par$cle-­‐Mesh  •  Force  Interpola$on  

– Use  same  algorithm  as  CPU  •  Thread  per  atom  •  No  memory  collisions,  etc.  

•  27.8x  Speedup  versus  1  core  Istanbul  

Page 21: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Rhodopsin  Simula$on  

# Rhodopsin model!newton off!!units real !neigh_modify delay 5 every 1 !!atom_style full !bond_style harmonic !angle_style charmm !dihedral_style charmm !improper_style harmonic !pair_style lj/charmm/coul/long/gpu 8.0 10.0 !pair_modify mix arithmetic !kspace_style pppm/gpu 1e-4 !!read_data data.rhodo!!fix 1 all shake 0.0001 5 0 m 1.0 a 232!fix 2 all npt temp 300.0 300.0 100.0 &! z 0.0 0.0 1000.0 mtk no pchain 0 tchain 1!!special_bonds charmm! !thermo 50!thermo_style multi !timestep 2.0!!run 100!

0  

50  

100  

150  

200  

250  

300  

350  

CPU  

Force  Interp  

Field  Solve  

Charge  Spread  

Other  

Output  

Comm  

Neigh  

Bond  

Pair  

Page 22: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

CPU/GPU  Concurrency    •  Can  compute  short-­‐range,  bond,  angle,  dihedral,  improper,  k-­‐space  

contribu$ons  to  the  forces,  energies,  and  virials  independently  of  each  other.  

•  Run  the  calcula$ons  not  ported  to  the  accelerator  simultaneously  –  In  some  cases,  there  is  no  possible  gain  from  por$ng  addi$onal  rou$nes  

•  Can  also  divide  the  force  calcula$on  between  the  CPU  and  the  accelerator  with  dynamic  load  balancing  

•  Can  run  hybrid  models  with  one  model  calculated  on  the  CPU  and  the  other  on  the  accelerator  

•  Add  a  new  “fix”  to  LAMMPS  with  a  hook  to  block  and  move  accelerator  data  to  the  host  before  proceeding  with  $me  integra$on,  etc.  

Page 23: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Rhodopsin  Simula$on  

# Rhodopsin model!newton off!fix gpu force/neigh 0 0 1!!units real !neigh_modify delay 5 every 1 !!atom_style full !bond_style harmonic !angle_style charmm !dihedral_style charmm !improper_style harmonic !pair_style lj/charmm/coul/long/gpu 8.0 10.0 !pair_modify mix arithmetic !kspace_style pppm/gpu 1e-4 !!read_data data.rhodo!!fix 1 all shake 0.0001 5 0 m 1.0 a 232!fix 2 all npt temp 300.0 300.0 100.0 &! z 0.0 0.0 1000.0 mtk no pchain 0 tchain 1!!special_bonds charmm! !thermo 50!thermo_style multi !timestep 2.0!!run 100!

0  

50  

100  

150  

200  

250  

300  

350  

CPU  

Force  Interp  

Field  Solve  

Charge  Spread  

Other  

Output  

Comm  

Neigh  

Bond  

Pair  

Page 24: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

8.6x  

But  we  have  16  cores  on  the  Interlagos  chip  and  LAMMPS  scales  well…  

0  

50  

100  

150  

200  

250  

300  

350  

CPU   GPU  

Data  Transfer  

Data  Cast  

Force  Interp  (27.8x  Speedup)  

Field  Solve  

Charge  Spread  (13.7x  Speedup)  

Other  

Output  

Comm  

Neigh  (8.4x  Speedup)  

Bond  

Pair  (50x  Speedup)  

8.6x  

Page 25: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Need  to  parallelize  code  not  ported  for  the  accelerator  on  the  CPU  cores:  

•  Example:  Rhodopsin  Benchmark  on  a  box  with  12  CPU  cores  and  1  GPU  –  Here,  only  the  non-­‐bonded  

“pair”  force  calcula$on  is  accelerated  for  an  example  

–  Speedup  on  1  CPU  core  with  a  GPU  is  only  2.6x  with  this  approach  (M2070  vs  Istanbul)  

–  Speedup  vs  12  CPU  cores  is  s$ll  >  2x  because  we  use  all  12  cores  in  addi$on  to  the  GPU  

•  MPI  Processes  Share  the  GPU   0  

50  

100  

150  

200  

250  

CPU  

GPU+C

PU  

CPU  

GPU+C

PU  

CPU  

GPU+C

PU  

CPU  

GPU+C

PU  

CPU  

GPU+C

PU  

1  CPU  Core   2  CPU  Cores   4  CPU  Cores   8  CPU  Cores   12  CPU  Cores  

Other  

Output  

Comm  

Neigh  

Kspce  

Bond  

Pair  

Page 26: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Host-­‐Accelerator  Concurrency    (-­‐  PPPM  Accelera$on)                  Not  to  scale  

Nbor  Core  1  

Nbor  Core  2  

Nbor  Core  3  

Nbor  Core  4  

Data  Transfer  In  

Pair  Core  1  

Pair  Core  2  

Pair  Core  3  

Pair  Core  4  

Data  Transfer  Out  

GPU   CPU  Core  1  

Pair  

Bond  

Long  Range  ElectrostaJcs  

Other  

CPU  Core  2  

Other  

CPU  Core  3  

Other  

CPU  Core  4  

Other  

Neighboring  is  not  performed  every  $mestep.  1  in  10-­‐20.  

Adjusts  split  of  non-­‐bonded  force  calcula$on  automa$cally  or  by  specifying  a  fixed  split  

Can  run  force  models  not  ported  to  GPU  concurrently  with  models  ported  to  GPU  (e.g.  solvent  on  GPU)  

Pair  

Bond  

Long  Range  ElectrostaJcs  

Pair  

Bond  

Long  Range  ElectrostaJcs  

Pair  

Bond  

Long  Range  ElectrostaJcs  

Page 27: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Host-­‐Accelerator  Concurrency  (+PPPM  Accelera$on)    

Nbor  Core  1  

Nbor  Core  2  

Nbor  Core  3  

Nbor  Core  4  

Data  Transfer  In  

Pair  Core  1  

Pair  Core  2  

Pair  Core  3  

Pair  Core  4  

Charge  Spread  Core  1  

Charge  Spread  Core  2  

Charge  Spread  Core  3  

Charge  Spread  Core  4  

GPU   CPU  Core  1  

Pair  

Bond  

Other  

Force  Interp  Core  1  

Force  Interp  Core  2  

Force  Interp  Core  3  

Force  Interp  Core  4  

Data  Transfer  Out  

Poisson  Solve  

CPU  Core  2  

Pair  

Bond  

Other  

Poisson  Solve  

CPU  Core  3  

Pair  

Bond  

Other  

Poisson  Solve  

CPU  Core  4  

Pair  

Bond  

Other  

Poisson  Solve  

Can  run  mul$ple  kernels  at  the  same  $me  on  some  accelerators  

Page 28: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Minimizing  the  Code  Burden  •  Focus  por$ng  efforts  on  the  rou$nes  that  dominate  the  computa$onal  $me  

and  have  high  poten$al  for  speedup  with  accelerators  •  Use  concurrent  CPU/GPU  calcula$ons  where  possible  and  overlap  host-­‐

accelerator  data  transfer  with  computa$on  •  Parallelize  rou$nes  that  are  not  ported  on  the  CPU  •  Allow  a  legacy  code  to  take  advantage  of  accelerators  without  rewri$ng  the  

en$re  thing  •  In  some  cases,  there  is  no  advantage  to  por$ng  

–  Mul$-­‐Core  CPUs  can  be  compe$$ve  with  GPU  accelerators  for  some  rou$nes  (latency-­‐bound,  thread  divergence,  etc.),  especially  at  lower  par$cle  counts  

–  If  concurrent  CPU/GPU  calcula$ons  are  used  effec$vely,  there  can  be  no  advantage  to  por$ng  certain  rou$nes  

–  For  some  simula$ons,  the  upper  bound  for  performance  improvements  due  to  por$ng  addi$onal  rou$nes  and  removing  data  transfer  is  <  5%.  

Page 29: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Op$miza$ons  to  CPU  code  for  the  XK6  •  Using  MPI-­‐only  on  mul$core  nodes  can  impact  performance  at  

scale  –  Increased  MPI  communica$on  $mes  

•  MPI  Cartesian  topologies  (e.g.  MPI_Cart_create)  are  designed  to  allow  vendors  to  op$mize  process  mapping  for  grids  

–  Typically  do  nothing  to  op$mize  for  mul$core  nodes  •  Implemented  a  two-­‐level  domain  factoriza$on  algorithm  to  minimize  off-­‐node  

communica$ons  –  Increased  K-­‐space  $mes  for  P3M  calcula$ons  

•  Many  processes  involved  in  effec$vely  all-­‐to-­‐all  communica$ons  for  3D  FFTs  •  Use  separate  process  par$$ons  for  the  K-­‐space  calcula$on  

–  Implemented  by  Yuxing  Peng  and  Chris  Knight  from  the  Voth  Group  at  the  University  of  Chicago  

Page 30: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

X2090  Benchmark  Results  •  Jaguar upgrade to Titan consisted of:

1.  Upgrade XT5 to XK6 blades (AMD Interlagos Processors and Gemini Interconnect)

2.  Install Titan Development Partition with Fermi+ GPUs on 10 cabinets 3.  Upgrade to XK7 nodes (Installation of K20X Kepler II GPUs on all all nodes)

•  XK6 Upgrade and Titan Development results are presented first •  Early results with Kepler on Titan presented at end of talk •  Benchmarks with acceleration used mixed-precision as compared to

double precision •  Left: Strong Scaling for fixed-size simulations of approximately 256K

particles •  Right: Weak Scaling with ~32K particles/node •  XK6 NP results use a new MPI process to grid mapping and a separate

partition for long-range electrostatics (as do the GPU results) Brown,  W.  M.,  Nguyen,  T.D.,  Fuentes-­‐Cabrera,  M.,  Fowlkes,  J.  D.,  Rack,  P.  D.,  Berger,  M.,  Bland,  A.  S.  An  EvaluaJon  of  Molecular  Dynamics  Performance  on  the  Hybrid  Cray  XK6  Supercomputer.  Proceedings  of  the  ICCS  Workshop  on  Large  Scale  ComputaFonal  Physics.  2012.  Published  in  Procedia  Computer  Science,  2012.  9  p.  186-­‐195.  

Page 31: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Atomic  Fluid  •  Atomic  fluid    -­‐    microcanonical  ensemble,  Lennard-­‐  Jones  poten$al,  

reduced  density  0.8442,  neighbor  skin  0.3σ,  cutoffs  of  2.5σ  and  5.0σ  

XK6  Single  Node:  1.17X  XK6  GPU  Single  Node:  3.03X,  5.68X  

XK6  Single  Node:  1.04X  XK6  GPU  Single  Node:  1.92X,  4.33X  XK6  GPU  512  Nodes:  2.92X,  5.11X  NP  gives  up  to  20%  improvement  

0

1

2

3

4

Tim

e (s

)

Nodes

XT5XK6XK6 NPXK6 GPUXT5XK6XK6 NPXK6 GPU

0.030.060.130.250.501.002.004.008.00

16.0032.00

1 2 4 8 16 32 64 128

Tim

e (s

)

Page 32: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Bulk  Copper  •  Copper  metallic  solid  in  the  microcanonical  ensemble.  The  force  cutoff  is  

4.95Å  with  a  neighbor  skin  of  1.0A.  –  Requires  addi$onal  ghost  exchange  during  force  calcula$on  for  electron  densi$es  

XK6  Single  Node:  1.07X  XK6  GPU  Single  Node:  3.05X  

XK6  Single  Node:  1.0X  XK6  GPU  Single  Node:  2.12X  XK6  GPU  512  Nodes:  2.90X    NP  gives  up  to  12%  improvement  

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Tim

e (s

)

Nodes

XT5

XK6

XK6 NP

XK6 GPU

0.13

0.25

0.50

1.00

2.00

4.00

8.00

16.00

1 2 4 8 16 32 64 128

Tim

e (s

)

Page 33: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Protein  •  All-­‐atom  rhodopsin  protein  in  a  solvated  lipid  bilayer  with  the  CHARMM  force  field.  Isothermal-­‐isobaric  

ensemble,  SHAKE  constraints,  2.0fs  $mestep.  Counter-­‐ions,  water,  8Å/  10Å  cutoff  –  Long  range  electrosta$cs  are  calculated  with  P3M.    

XK6  Single  Node:  1.1X  XK6  GPU  Single  Node:  3.3X  

XK6  Single  Node:  1.1X  XK6  GPU  Single  Node:  2.6X  XK6  GPU  512  Nodes:  8X    NP  gives  up  to  27%  improvement  

12481632641282565121,024

1 4 16 64 256

1024

4096

Tim

e (s

)

Nodes

XT5

XK6

XK6 NP

XK6 GPU

0.50

1.00

2.00

4.00

8.00

16.00

32.00

64.00

1 2 4 8 16 32 64

Tim

e (s

)

Page 34: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Liquid  Crystal  •  Liquid  crystal  mesogens  are  represented  with  biaxial  ellipsoid  par$cles,  Gay-­‐Berne  poten$al,  isotropic  

phase,  isothermal-­‐isobaric  ensemble,  4σ  cutoff  with  a  0.8σ  neighbor  skin  –  High  arithme$c  intensity  

XK6  Single  Node:  .9X  XK6  GPU  Single  Node:  6.23X  

XK6  Single  Node:  .9X  XK6  GPU  Single  Node:  5.82X  XK6  GPU  512  Nodes:  5.65X    

0

2

4

6

8

10

12

14

16

Tim

e (s

)

Nodes

XT5

XK6

XK6 NP

XK6 GPU

0.250.501.002.004.008.00

16.0032.0064.00

128.00

1 2 4 8 16 32 64 128

Tim

e (s

)

Page 35: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

XK6  vs  XK6+GPU  Benchmarks  

Atomic Fluid

(cutoff = 2.5σ)

Atomic Fluid

(cutoff = 5.0σ)

Bulk Copper Protein Liquid

Crystal

Speedup (1 Node) 1.92 4.33 2.12 2.6 5.82 Speedup (900 Nodes) 1.68 3.96 2.15 1.56 5.60

0 1 2 3 4 5 6 7 Speedup with Acceleration on XK6 Nodes

1 Node = 32K Particles 900 Nodes = 29M Particles

Page 36: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Example  Science  Problems  1.    •  Controlling the movement of nanoscale objects is a significant goal of

nanotechnology. •  Pulsed laser melting offers a unique opportunity to dictate materials assembly

where rapid heating and cooling rates and ns melt lifetimes are achievable •  Using both experiment and theory we have investigated ways of controlling how

the breakage occurs so as to control the assembly of metallic nanoparticles •  Here, we illustrate MD simulation to investigate the evolution of the Rayleigh-

Plateau liquid instability for copper lines deposited on a graphite substrate •  Simulations are performed with GPU acceleration on Jaguar at the same scales as

experiment  

Page 37: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

11.4M Cu Atom Simulations on Graphitic Substrate

100  nm  

2.7X Faster than 512 XK6 w/out Accelerators

Page 38: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Example  Science  Problems  2.  •  Membrane fusion, which involves the

merging of two biological membranes in a controlled manner, is an integral part of the normal life cycle of all living organisms.

•  Viruses responsible for human disease employ membrane fusion as an essential part of their reproduction cycle.

•  Membrane fusion is a critical step in the function of the nervous system

•  Correct fusion dynamics requires realistic system sizes

•  Kohlmeyer / Klein INCITE Award 39M Particle Liposome System2.7X Faster than 900 XK6 w/out Accelerators

Page 39: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

The  “Por$ng  Wall”  •  Although  you  can  get  performance  improvements  by  por$ng  exis$ng  

models/algorithms  and  running  simula$ons  using  tradi$onal  parameteriza$ons…  –  There  is  a  limit  to  the  amount  of  parallelism  that  can  be  exposed  to  decrease  

the  $me-­‐to-­‐solu$on  –  Increasingly  desirable  to  re-­‐evaluate  computa$onal  physics  methods  and  

models  with  an  eye  towards  approaches  that  allow  for  increased  concurrency  and  data  locality  

•  Parameterize  simula$ons  to  shi{  work  towards  rou$nes  well-­‐suited  for  the  accelerator  •  Methods/models  with  increased  computa$onal  requirements  can  perform  be^er  if  they  

can  increase  concurrency,  allow  for  larger  $me-­‐steps,  etc.  

•  Computa$onal  scien$sts  will  play  a  cri$cal  role  in  exploi$ng  the  performance  of  current  and  next-­‐genera$on  architectures  

•  Some  very  basic  examples…  

Page 40: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Electrosta$cs  Example  with  Polyelectrolyte  Brushes  

•  Electrosta$cs  are  typically  solved  by  spli|ng  the  Coulomb  poten$al  into  a  short-­‐range  poten$al  that  decays  to  zero  at  the  cutoff  and  a  long-­‐range  poten$al  that  converges  rapidly  in  k-­‐space  

–  Long-­‐range  component  is  typically  solved  with  discre$za$on  on  a  mesh  

•  Poisson  solve  with  3D  FFTs  •  Communica$ons  bo^leneck  

–  The  tradi$onal  parameteriza$ons  that  work  well  on  older  high-­‐performance  computers  are  not  necessarily  op$mal  for  many  simula$ons  

–  Shi{  the  work  towards  the  short-­‐range  component  •  Dispropor$onate  performance  improvements  on  

accelerator  with  larger  short-­‐range  cutoffs  •  Benefits  from  accelera$on  improve  at  larger  node  counts!  

0  

2  

4  

6  

8  

10  

4   8   16   32   64   128   256   512  

Time  (s)  

Nodes  

p3m  p3m  cpu  

Nguyen,  T.D.,  Carrillo,  J.-­‐M.,  Dobrynin,  A.V.,  Brown,  W.M.  A  Case  Study  of  Truncated  ElectrostaJcs  for  SimulaJon  of  Polyelectrolyte  Brushes  on  GPU  Accelerators.  Journal  of  Chemical  Theory  and  ComputaFon,  In  press.  

Page 41: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Electrosta$cs  Example  with  Polyelectrolyte  Brushes  •  Eliminate  the  long-­‐range  component?  

–  In  condensed  phase  systems,  charges  can  be  effec$vely  short-­‐range  due  to  screening  from  other  charged  par$cles  in  the  system  

–  Use  a  large  short  range  cutoff,  correct  for  issues  due  to  non-­‐neutral  spherical  cutoff  regions,  and  modify  cell  list  calcula$on  to  be  efficient  for  large  cutoffs  in  parallel  simula$ons  

–  Accurate  results  with  faster  $me-­‐to-­‐solu$on  for  some  simula$ons  

Nguyen,  T.D.,  Carrillo,  J.-­‐M.,  Dobrynin,  A.V.,  Brown,  W.M.  A  Case  Study  of  Truncated  ElectrostaJcs  for  SimulaJon  of  Polyelectrolyte  Brushes  on  GPU  Accelerators.  Journal  of  Chemical  Theory  and  ComputaFon,  In  Press.  

10-3

10-2

10-1

100

10-4 10-3 10-2

v y, �

/�

�yz, kBT/�3

(a) �zz = 3.46×10-2 kBT/�3

DSF rc = 10�DSF rc = 20�P3M

10-2

10-1

100

101

10-4 10-3 10-2

v y, �

/�

�yz, kBT/�3

(b) �zz = 3.46×10-3 kBT/�3

DSF rc = 10�DSF rc = 20�P3M

0.1

1

10

100

1 2 4 8 16 32 64

Tim

e, se

cond

s

Nodes

(a)

DSF GPU rc = 10�DSF GPU rc = 20�DSF GPU rc = 30�DSF GPU rc = 40�P3M GPU 10-5

P3M CPU

0

2

4

6

8

10

4 8 16 32 64 128 256 512

Tim

e, se

cond

s

Nodes

(b)

0.1

1

10

1 2 4 8 16 32 64

Tim

e, se

cond

s

Nodes

(c)

DSF GPU rc = 10�DSF GPU rc = 20�DSF GPU rc = 30�DSF GPU rc = 40�P3M GPU 10-5

P3M GPU 10-2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

4 8 16 32 64 128 256 512Ti

me,

seco

nds

Nodes

(d)

Page 42: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Alterna$ve  Poten$al  Energy  Models?  •  Implementa$on  of  new  models  that  can  

exploit  high  peak  accelerator  performance  to  improve  accuracy  and  sampling  –  Computa$onal  cost  of  a  single  ellipsoid-­‐

ellipsoid  interac$on  can  be  15x  that  for  Lennard-­‐Jones  on  the  CPU  

–  With  GPU  accelera$on,  it  is  more  compe$$ve  •  Higher  arithme$c  intensity,  so  be^er  speedup  

when  compared  to  LJ  accelera$on  •  Be^er  parallel  efficiency  relaFve  to  the  CPU  

–  S$ll  get  performance  with  fewer  threads  

 

Morriss-­‐Andrews,  A.,  RoZler  J.,  Plotkin  S.  S.,  JCP  132,  035105,  2010.  

Mergell,  B.,  Ejtehadi,  M.R.,  Everaers,  R.,  PRE,  68,  021911  (2003)  Orsi,  M.  et  al,  J.  Phys.  Chem.  B,  Vol.  112,  No.  3,  2008  

Page 43: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Moore’s  Law  for  Poten$als  

CHARMm

EAMStillinger-Weber

Tersoff

AIREBO

MEAM

ReaxFF

ReaxFF/CeFF

COMB

EIM

REBO

BOPGPT

GAP

1980 1990 2000 2010Year Published

10-6

10-5

10-4

10-3

10-2

Cost

[cor

e-se

c/at

om-ti

mes

tep]

O(2T/2)

Page 44: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

More  Crea$ve  Examples…  •  Advanced  $me  integrators,  $me  parallelism,  etc…  

Page 45: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Kepler  II  vs  Fermi  +  •  More  cores  

–  Efficiently  handle  larger  problems,  be^er  performance  for  complex  models  •  Improved  thread  performance  

–  Faster  $me  to  solu$on  for  same  problem  •  Larger  register  file  

–  Be^er  performance  for  complex  models  such  as  the  liquid  crystal  case  •  Warp  shuffle  

–  Reduced  shared  memory  usage  for  reduc$ons  across  threads  •  Hyper-­‐Q  

–  MPI  processes  sharing  the  GPU  can  share  a  context  •  Reduced  memory  overhead  per  process  •  Concurrent  kernel  execu$on  from  mul$ple  processes  

Page 46: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

0.0#

0.5#

1.0#

1.5#

2.0#

2.5#

3.0#

1# 4# 16#

64#256#1024#4096#

16384#

Time%(s)%

Nodes%

XK7+GPU#

XK6#

XK6+GPU#

0"

1"

2"

3"

4"

1" 4" 16"

64"256"1024"4096"

16384"

Time%(s)%

Nodes%

XK7+GPU"

XK6"

XK6+GPU"

XK7+GPU"

XK6"

XK6+GPU"

0.03$0.06$0.13$0.25$0.50$1.00$2.00$4.00$8.00$

16.00$32.00$

1$ 2$ 4$ 8$ 16$ 32$ 64$ 128$

Time%(s)%

0.13%

0.25%

0.50%

1.00%

2.00%

4.00%

8.00%

1% 2% 4% 8% 16% 32% 64% 128%

Time%(s)%

Early  Kepler  Benchmarks  on  Titan  

Atomic  Fluid  

Bulk  Copper  

Page 47: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

0"2"4"6"8"10"12"14"16"

1" 4" 16"

64"256"1024"4096"

16384"

Time%(s)%

Nodes%

XK7+GPU"

XK6"

XK6+GPU"

0.13%0.25%0.50%1.00%2.00%4.00%8.00%

16.00%32.00%64.00%

128.00%

1% 2% 4% 8% 16% 32% 64% 128%

Time%(s)%

1"

2"

4"

8"

16"

32"

1" 4" 16"

64"

256"

1024"

4096"

16384"

Time%(s)%

Nodes%

XK7+GPU"

XK6"

XK6+GPU"

Protein  

Liquid  Crystal  

0.50$

1.00$

2.00$

4.00$

8.00$

16.00$

32.00$

64.00$

1$ 2$ 4$ 8$ 16$ 32$ 64$ 128$

Time%(s)%

Early  Kepler  Benchmarks  on  Titan  

Page 48: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Early  Titan  XK6/XK7  Benchmarks  

Atomic  Fluid  (cutoff  =  2.5σ)  

Atomic  Fluid  (cutoff  =  5.0σ)   Bulk  Copper   Protein   Liquid  Crystal  

XK6  (1  Node)   1.92   4.33   2.12   2.6   5.82  XK7  (1  Node)   2.90   8.38   3.66   3.36   15.70  XK6  (900  Nodes)   1.68   3.96   2.15   1.56   5.60  XK7  (900  Nodes)   2.75   7.48   2.86   1.95   10.14  

0  2  4  6  8  10  12  14  16  18   Speedup  with  AcceleraJon  on  XK6/XK7  Nodes  

1  Node  =  32K  ParJcles  900  Nodes  =  29M  ParJcles  

Page 49: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Ongoing  Work  •  Implementa$on  and  evalua$on  of  alterna$ve  algorithms  for  long-­‐range  electrosta$cs.  – Mul$grid(-­‐like)  Methods  for  Poisson  Solve,  etc.    

•  O(N),  no  FFTs  •  Implementa$on  of  complex  models  well  suited  for  accelerators  

•  Improvements  driven  by  specific  science  problems  

Page 50: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Publica$ons  •  Brown,  W.M.,  Wang,  P.  Plimpton,  S.J.,  Tharrington,  A.N.  ImplemenJng  

Molecular  Dynamics  on  Hybrid  High  Performance  Computers  –  Short  Range  Forces.  Computer  Physics  CommunicaFons.  2011.  182:  p.  898-­‐911.    

•  Brown,  W.M.,  Kohlmeyer,  A.,  Plimpton,  S.J.,  Tharrington,  A.N.  ImplemenJng  Molecular  Dynamics  on  Hybrid  High  Performance  Computers  –  ParJcle-­‐ParJcle  ParJcle-­‐Mesh.  Computer  Physics  CommunicaFons.  2012.  183:  p.  449-­‐459.    

•  Brown,  W.  M.,  Nguyen,  T.D.,  Fuentes-­‐Cabrera,  M.,  Fowlkes,  J.  D.,  Rack,  P.  D.,  Berger,  M.,  Bland,  A.  S.  An  EvaluaJon  of  Molecular  Dynamics  Performance  on  the  Hybrid  Cray  XK6  Supercomputer.  Proceedings  of  the  ICCS  Workshop  on  Large  Scale  ComputaFonal  Physics.  2012.  Published  in  Procedia  Computer  Science,  2012.  9  p.  186-­‐195.  

•  Nguyen,  T.D.,  Carrillo,  J.-­‐M.,  Dobrynin,  A.V.,  Brown,  W.M.  A  Case  Study  of  Truncated  ElectrostaJcs  for  SimulaJon  of  Polyelectrolyte  Brushes  on  GPU  Accelerators.  Journal  of  Chemical  Theory  and  ComputaFon.  In  Press.  

Page 51: Por$ng’LAMMPS’to’the’Titan’ Supercomputer’ · LAMMPS’ • Large-scale Atomic/Molecular Massively Parallel Simulator" o " • General Purpose Classical Molecular Dynamics

Acknowledgements  •  LAMMPS  

–  Steve  Plimpton  (SNL)  and  many  others  •  LAMMPS  Accelerator  Library  

–  W.  Michael  Brown  (ORNL),  Trung  Dac  Nguyen  (ORNL),  Peng  Wang  (NVIDIA),  Axel  Kohlmeyer  (Temple),  Steve  Plimpton  (SNL),  Inderaj  Bains  (NVIDIA)  

•  Geryon  Library  –  W.  Michael  Brown  (ORNL),  Manuel  Rodriguez  Rodriguez  (ORNL),  Axel  Kohlmeyer  (Temple)  

•  K-­‐Space  Par$$ons  –  Yuxing  Peng  and  Chris  Knight  (University  of  Chicago)  

•  Metallic  Nanopar$cle  Dewe|ng  –  Miguel  Fuentes-­‐Cabrera  (ORNL),  Trung  Dac  Nguyen  (ORNL),  W.  Michael  Brown  (ORNL),  Jason  Fowlkes  (ORNL),  

Philip  Rack  (UT,  ORNL)  •  Lipid  Vesicle  Science  and  Simula$ons  

–  Vincenzo  Carnevale  (Temple),  Axel  Kohlmeyer  (Temple),  Michael  Klein  (Temple),  et.  al.  •  Enhanced  Trunca$on/Bo^lebrush  Simula$ons  

–  Trung  Dac  Nguyen  (ORNL),  Jan-­‐Michael  Carrillo  (UC),  Andrew  Dobrynin  (UC),  W.  Michael  Brown  (ORNL)  •  Mul$level  Summa$on  

–  Arnold  Tharrington  (ORNL)  •  NVIDIA  Support  

–  Carl  Ponder,  Mark  Berger