space charge with pyheadtail and pypic on the gpu stefan hegglin and adrian oeftiger space charge...

Download Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting – 29.10.2015

If you can't read please download the document

Upload: grant-jennings

Post on 18-Jan-2018

221 views

Category:

Documents


0 download

DESCRIPTION

Overview 1. PIC: Reminder 2. Implementation / Parallelisation Approach 3. Results Stefan Hegglin and Adrian Oeftiger3

TRANSCRIPT

Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting Overview 1. PIC: Reminder 2. Implementation / Parallelisation Approach 3. Results Stefan Hegglin and Adrian Oeftiger3 Motivation Stefan Hegglin and Adrian Oeftiger4 self-consistent space charge models: particle-in-cell (PIC) algorithm is dominating time consumer in simulations parallelisation is challenging (PIC memory-bound algorithm, i.e. few FLOP/byte) Output Stefan Hegglin and Adrian Oeftiger5 we parallelised PIC on the GPU (graphics processing unit) PyPIC: PIC algorithms in shared python library 2.5D (slice-by-slice transverse) and full 3D model much higher resolution possible, suppress noise issues courtesy: F. Kesting, GSI, https://eventbooking.stfc.ac.uk/uploads/spacecharge15/numericalnoisekesting.pdf example: on mesh size 128x128, reduced artificial emittance growth for more particles How to Approach Noise Issues? less noise longer applicability/validity of simulations e.g. SPS injection plateau: 10.8 seconds 500000 turns! impossible, instead we typically gain O(10000 turns) validity for a simulation time scale O(1 week) with current software Stefan Hegglin and Adrian Oeftiger6 choose grid resolution (acc. to physics) 10 macro- particles per grid cell fix total #macro- particles evaluate emittance growth convergence study New Available Parameter Space 1000000 macro- particles 20 slices128 x 128 mesh size Stefan Hegglin and Adrian Oeftiger7 152ms per kick 134ms per kick 110ms per kick Poisson Solving with PIC particle-in-cell algorithm: standard in accelerator physics domain solve Poisson equation finite differences Hockney: FFT, (integrated) Greens function for open boundaries FMM, particle-particle, see Ji Qiangs talks in PyHEADTAIL meeting and Space Charge WG meeting: https://indico.cern.ch/event/433371/ https://indico.cern.ch/event/433371/ Stefan Hegglin and Adrian Oeftiger PIC 3 Steps particle-in-cell algorithm: 1) particles to mesh: deposit charges to mesh nodes 2) solve the Poisson equation on the mesh Hockneys algorithm 3) mesh to particles: interpolate the mesh fields to the particles Stefan Hegglin and Adrian Oeftiger Hockneys algorithm Solve Poisson equation on a structured grid Greens function: analytical solution for open boundaries Formal solution using convolution: O(n^2) Trick: implementation using FFTs of 2x domain size, Stefan Hegglin and Adrian Oeftiger Greens function approach has problems when mesh has large aspect ratio (numerical integration uses constant function value per cell) Integrated Greens function: main idea: integrate Greens function analytically for each mesh cell, then sum all cells 11 Integrated Greens function Stefan Hegglin and Adrian Oeftiger Integrated Greens function Stefan Hegglin and Adrian Oeftiger12 Error of ex x Comparing IGF and GF for an aspect ratio of 1:5 Abell et al, PAC 07, GPUs GPU = Graphic Processing Unit: threads running massively parallel one concurrent instruction on >1000 cores large data arrays expensive global memory access resources for ABP simulations: CERN: LIU-PS-GPU server 4x NVIDIA Tesla C2075 cards (mid 2011) CNAF (Bologna): high performance cluster 7x NVIDIA Tesla K20m (early 2013) 8x Tesla K40m (late 2013) Stefan Hegglin and Adrian Oeftiger How to use the GPU Script: minimal changes for GPU how to submit a GPU job (CNAF): python: GPU data introspection works as flexible as on CPU (print(), calculations with GPUArrays, ) Stefan Hegglin and Adrian Oeftiger14 GPUCPU Parallelisation Approach Stefan Hegglin and Adrian Oeftiger15 identify bottleneck optimise code verify functionality profiling Different Bottlenecks: CPU vs. GPU Stefan Hegglin and Adrian Oeftiger16 CPUGPU FFT solving is bottleneck FFT: O(nx log nx), p2m: O(nx) particle-to-mesh deposition is bottleneck Implementation of 3 Steps particle-in-cell algorithm: particles to mesh (p2m): 1) atomicAdd: thread particle 2) parallel sort: thread cell Solve: cuFFT (parallel FFT) mesh to particles (m2p): thread particle Stefan Hegglin and Adrian Oeftiger Variant 1 of p2m 1 thread per particle Stefan Hegglin and Adrian Oeftiger18 race condition AtomicAdd: properly serialise memory updates slow but correct Variant 2 of p2m 1 thread per node Sort particles by node index (optimise memory access!) Stefan Hegglin and Adrian Oeftiger19 Avoids race condition (no concurrent writes) Different numerical models 2.5D slice bunch into n slices: solve n independent 2D Poisson equations. Approximation: bunch very long CPU: serial GPU: compute all slices simultaneously 3D solve the full 3D bunch on a 3D grid CPU: not implemented (very slow) GPU: large memory requirements due to Hockneys algorithm Stefan Hegglin and Adrian Oeftiger Stefan Hegglin and Adrian Oeftiger fixed mesh size: 256x256 Numeric Parameter Scans: Fixed nx fixed mesh size: 512x512 x4 x2 Timing: Fully Loaded GPU Parameters 2.5D model works well at high particle numbers, i.e. at low numbers the GPU is far from full exploitation! different slope of CPU vs. GPU (characteristic behaviour) new hardware at CNAF more efficient (x1.8) Stefan Hegglin and Adrian Oeftiger Timing: CUDA 6 vs CUDA Stefan Hegglin and Adrian Oeftiger23 speedup of up to x1.5 due to a faster implementation of the sorting algorithm (thrust 1.8) and better cuFFT 2D, CNAF Summary PyHEADTAIL now offers 2.5D (slice-by-slice transverse) and 3D self-consistent direct space charge models (on CPU and GPU): 3D model allows cross-checking 2.5D approximations GPU speeds up 13x for large meshes and #particles wide numeric parameter spaces available now! larger resolutions help to mitigate noise effects (artefacts such as numerical emittance blow-up) improved validity for long simulations (real machine time) next steps: SPS simulations (resonances) Stefan Hegglin and Adrian Oeftiger24 Specifications of Used GPU Machines available machines at CNAF: Stefan Hegglin and Adrian Oeftiger13 Specification of Used CPU Machine LIUPSGPU CPU: Stefan Hegglin and Adrian Oeftiger27 PyPIC on GPU Standalone Python module: GPU interfacing via PyCUDA/Thrust Flexible 2D/3D (integrated) Greens function cuFFT (new interface under branch: new_pypic_cpu_and_gpu) Stefan Hegglin and Adrian Oeftiger Stefan Hegglin and Adrian Oeftiger29 Timing: Fully Loaded GPU Parameters II on GPU, particle-to-mesh deposition dominates for fixed mesh size, more macro-particles onto the same grid induce memory bandwidth limitations on speed up