investigating the performance portability …...investigating the performance portability...

21
Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the performance portability of modern parallel programming models Ma# Mar&neau - UoB (m.mar&[email protected]) Simon McIntosh-Smith - UoB ([email protected]) Wayne Gaudin – UK Atomic Weapons Establishment

Upload: others

Post on 20-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA

UsingTeaLeafandothermini-appstoassesstheperformanceportabilityofmodernparallelprogrammingmodels

Ma#Mar&neau-UoB(m.mar&[email protected])SimonMcIntosh-Smith-UoB([email protected])WayneGaudin–UKAtomicWeaponsEstablishment

Page 2: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Agenda

•  Whichmini-apps?

•  Howdoyouprogramwitheachmodel?

•  Whichmodelsperformbest?

•  Conclusions

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

Page 3: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Mini-apps

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

TeaLeaf (2d) Implicit,sparse,matrix-freesolversforheatconduc&onequa&ononstructuredgrid,memorybandwidthboundSolvers:ConjugateGradient(CG),Chebyshev,Precondi&onedPolynomialCG(PPCG) CloverLeaf (2d) Lagrangian-Eulerian hydrodynamics – explicit solver on structured grid, memory bandwidth bound

Bristol University Docking Engine (BUDE) Molecular docking benchmark that uses an Evolutionary Monte Carlo technique, compute bound

Page 4: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

•  Newportswithemergingparallelprogrammingmodels:§  TeaLeaf:Kokkos,RAJA,OpenMP4.0,OpenACC§  CloverLeaf:OpenACC,OpenMP4.0§  BUDE:OpenACC,OpenMP4.0

•  Developedoru&lisedexis&ngportstogaugeperformancebounds:§  OpenCL,CUDA,OpenMP3.0

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

The Porting Process

Page 5: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

CUDA Code Sample

Page 6: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

OpenMP 4.0 Code Sample

Page 7: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

OpenMP 4.0 Alternatives

Page 8: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

RAJA Code Sample

Page 9: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

Kokkos Code Sample

Page 10: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

•  PerformancetestedonCPU,GPU,andKNC

•  Singlenodeonly(mul&-nodescalingproven)

•  Allportswereop&misedasmuchaspossible,whilstaimingforperformanceportability

•  Solved4096x4096problem,thepointofmeshconvergence,forsingleiteraIon

The Performance Experiment

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

Page 11: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

TeaLeaf - CPU

At most 12% runtime penalty for modern Intel CPU, nearly identical relative performance seen with Ivy Bridge

Intel Compilers 16.0.1 (OpenMP, Kokkos, RAJA), and PGI Compilers 15.10 (OpenACC)

Page 12: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

TeaLeaf – Power8

Results generally good, and particularly for optimised Kokkos Nested and RAJA Box versions

GCC 4.9.1 (OpenMP, Kokkos, RAJA)

Page 13: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

TeaLeaf – GPU

Performance bug with CG again present in some cases Shows that all portable models have a small runtime penalty

CUDA Toolkit 7.5 (All), PGI 15.10 (OpenACC), CCE 8.4.4 (OpenMP 4.0)

Page 14: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

TeaLeaf – KNC

Achieving good performance is more challenging on KNC Vectorisation was a very important factor

Intel Compilers 16.0.1 (OpenMP, RAJA, Kokkos), Intel OpenCL (OpenCL)

Page 15: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

BUDE – CPU & GPU

Required at least 2.2x runtime, shared optimisation increased this even further

Page 16: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

CloverLeaf – CPU & GPU

At most 1.3x performance penalty, re-emphasises good performance of directive-based models for such problems

Page 17: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Conclusions

•  Therearealreadylotsofmaturingmodels

•  Youcanbalanceperformanceandcomplexity

•  Theperformanceportablemodelscanachieveperformancethatisclosetolow-levelAPIs

•  Yourbestchoicemightdependon:•  Produc&vityandpoten&alfortuning•  Thedominantlanguageofyourapplica&ons

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

Page 18: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Acknowledgements

•  DavidBeckingsale,JeffKeasler,RichHornung,LLNL,forhelpwithRAJAport

•  CarterEdwards,Chris&anTro#,Sandia,forhelpwithKokkosport

•  TheIntelParallelCompu&ngCenter(IPCC)atBristol

•  AlistairHartandCrayInc.forsupportandprovisionoftheCrayXC40Swansupercomputer

•  EPSRCforfundingtheresearch

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

Page 19: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Any Questions?

Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model Martineau, M., McIntosh-Smith, S. & Gaudin, W. Presenting at HIPS Workshop (IPDPS), Chicago, May 2016

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

Mini-apps including TeaLeaf, CloverLeaf, SNAP (for GPUs), and BUDE https://github.com/UK-MAC/ https://github.com/UoB-HPC/

Optimising Sparse Iterative Solvers for Many-Core Computer Architectures Boulton, M., McIntosh-Smith S., Gaudin, W., & Garrett, P. Presented at UKMAC’14, Cambridge, December 2014

Assessing the Performance Portability of Modern Parallel Programming Models using TeaLeaf Martineau, M., McIntosh-Smith, S. & Gaudin, W. Submitted to Concurrency and Computation: Practice and Experience (April 2016)

Page 20: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Extra Slides

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

Page 21: Investigating the Performance Portability …...Investigating the Performance Portability Capabilities of OpenMP 4.0, Kokkos and RAJA Using TeaLeaf and other mini-apps to assess the

Matt Martineau, Simon McIntosh-Smith, Wayne Gaudin

SNAP

•  High dimensionality and sweep presents an interesting difficulty for many models

•  We were not able to accelerate with CCE on GPU because of use of indirection arrays

•  OpenACC using PGI does successfully achieve within 50% runtime of hand-optimised CUDA implementation