icecube simulation with ppc on gpus dmitry chirkin, uw madison photon propagation code graphics...

IceCube simulation with PPC on GPUs

Dmitry Chirkin, UW Madison

photon propagation codegraphics processing unit

IceCube simulation with PPC

Photonics: 2000 – up to now Photon propagation code PPC: 2009 - now

Photonics: conventional on CPU

• First, run photonics to fill space with photons, tabulate the result

• Create such tables for nominal light sources: cascade and uniform half-muon

• Simulate photon propagation by looking up photon density in tabulated distributions

Table generation is slow Simulation suffers from a wide range of binning artifacts Simulation is also slow! (most time is spent loading the tables)

Direct photon tracking with PPC

• simulating flasher/standard candle photons• same code for muon/cascade simulation

• using precise scattering function: linear combination of HG+SAM• using tabulated (in 10 m depth slices) layered ice structure• employing 6-parameter ice model to extrapolate in wavelength

• tilt in the ice layer structure is properly taken into account

• transparent folding of acceptance and efficiencies• precise tracking through layers of ice, no interpolation needed

• precise simulation of the longitudinal development of cascades and• angular distribution of particles emitting Cherenkov photons

photon propagation code

Approximation to Mie scattering

fSL

Simplified Liu:

Henyey-Greenstein:

Mie:

Describes scattering on acid, mineral, salt, and soot with concentrations and radii at SP

Dependence on g=<cos()> and fSL

g=<cos()> fSL

0.8 00.9 00.95 0

0.9 0.30.9 0.50.9 1.0

flashing 63-50 64-50

Ice tilt in ppc

Measured with dust loggers (Ryan Bay)

Photon angular profile

from thesis of Christopher Wiebusch

PPC simulation on GPUgraphics processing unit

execution threads

propagation steps(between scatterings)

photon absorbed

new photon created(taken from the pool)

threads completetheir execution(no more photons)

Running on an NVidia GTX 295 CUDA-capable card,ppc is configured with:

448 threads in 30 blocks (total of 13440 threads)average of ~ 1024 photons per thread (total of 1.38 . 107 photons per call)

Photon Propagation Code: PPCThere are 5 versions of the ppc:

• original c++• "fast" c++• in Assembly• for CUDA GPU• icetray module

All versions verified to produce identical results

comparison with i3mcmlhttp://icecube.wisc.edu/~dima/work/WISC/ppc/

ppc icetray module

• at http://code.icecube.wisc.edu/svn/projects/ppc/trunk/

• uses a wrapper: private/ppc/i3ppc.cxx, compiled by cmake system into the libppc.so

• additional library libxppc.so is compiled by cmakeSet GPU_XPPC:BOOL=ON or OFF

• or can also be compiled by running make in private/ppc/gpu: “make glib” compiles gpu-accelerated version (needs cuda tools) “make clib” compiles cpu version (from the same sources!)

http://code.icecube.wisc.edu/svn/projects/ppc/trunk/

ppc example script run.pyif(len(sys.argv)!=6): print "Use: run.py [corsika/nugen/flasher] [gpu] [seed] [infile/num of flasher events] [outfile]" sys.exit()…det = "ic86"detector = False…os.putenv("PPCTABLESDIR", expandvars("$I3_BUILD/ppc/resources/ice/mie"))…if(mode == "flasher"): … str=63 dom=20 nph=8.e9

tray.AddModule("I3PhotoFlash", "photoflash")(…)

os.putenv("WFLA", "405") # flasher wavelength; set to 337 for standard candles os.putenv("FLDR", "-1") # direction of the first flasher LED … # Set FLDR=x+(n-1)*360, where 0<=x<360 and n>0 to simulate n LEDs in a # symmetrical n-fold pattern, with first LED centered in the direction x. # Negative or unset FLDR simulates a symmetric in azimuth pattern of light.

tray.AddModule("i3ppc", "ppc")( ("gpu", gpu), ("bad", bad), ("nph", nph*0.1315/25), # corrected for efficiency and DOM oversize factor; eff(337)=0.0354 ("fla", OMKey(str, dom)), # set str=-str for tilted flashers, str=0 and dom=1,2 for SC1 and 2 )

else:

ppc-pick and ppc-eff

ppc-pick: restrict to primaries below MaxEpri

load("libppc-pick")

tray.AddModule("I3IcePickModule<I3EpriFilt>","emax")( ("DiscardEvents", True), ("MaxEpri", 1.e9*I3Units.GeV) )

ppc-eff: reduce efficiency from 1.0 to eff

load("libppc-eff")

tray.AddModule("AdjEff", "eff")( ("eff", eff) )

ppc homepage

http://icecube.wisc.edu/~dima/work/WISC/ppc

GPU scalingOriginal: 1/2.08 1/2.70CPU c++: 1.00 1.00Assembly: 1.25 1.37GTX 295: 147 157GTX/Ori: 307 424C1060: 104 112C2050: 157 150GTX 480: 210 204

On GTX 295: 1.296 GHzRunning on 30 MPs x 448 threadsKernel uses: l=0 r=35 s=8176 c=62400

On GTX 480: 1.401 GHzRunning on 15 MPs x 768 threadsKernel uses: l=0 r=40 s=3960 c=62400

On C1060: 1.296 GHzRunning on 30 MPs x 448 threadsKernel uses: l=0 r=35 s=3992 c=62400

On C2050: 1.147 GHzRunning on 14 MPs x 768 threadsKernel uses: l=0 r=41 s=3960 c=62400

Uses cudaGetDeviceProperties() to get the number of multiprocessors,Uses cudaFuncGetAttributes() to get the maximum number of threads

GTX 480 vs. GTX 295

• GTX 295 has 2 GPUs• 240 MPs in 30 cores

• 8 MPs per core• 2 single-precision sFPUs 60 sFPUs per GPU

480 cores per card 120 sFPUs per card

Shared memory:• 16Kb per core• 960 Kb per card

• GTX 480 has 1 GPU• 480 MPs in 15 cores

• 32 MPs per core• 4 single-precision sFPUs 60 sFPUs per GPU

480 cores per card 60 sFPUs per card (!)

shared memory• up to 48Kb of per core• up to 720 Kb per card

Why is ppc not a factor 2 faster on GTX 480 GPU than on GTX 295 GPU?

Kernel time calculationRun 3232 (corsika) IC86 processing on cuda002 (per file):

GTX 295: Device time: 1123741.1 (in-kernel: 1115487.9...1122539.1) [ms]GTX 480: Device time: 693447.8 (in-kernel: 691775.9...693586.2) [ms]

If more than 1 thread is running using same GPU:

Device time: 1417203.1 (in-kernel: 1072643.6...1079405.0) [ms]

3 counters: 1. time difference before/after kernel launch in host code2. in-kernel, using cycle counter: min thread time3. max thread time

Also, real/user/sys times of top:

gpus 6cpus 1cores 8files 693Real 749m4.693sUser 3456m10.888ssys 39m50.369sDevice time: 245312940.1 216887330.9 218253017.2 [ms]

files: 693 real: 64.8553 user: 37.8357 gpu: 58.9978 kernel: 52.4899 [seconds]

81%-91% GPU utilization

Concurrent execution

time

CPU GPU CPU GPUThread 1:

CPU GPU CPUGPUThread 2:

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPUOne thread:

Create track segments

Copy track segments to GPU

Process photon hits

Copy photon hits from GPU

Need 2 buffers for track segments and photon hits

However: have 2 buffers:1 on host and 1 on GPU!Just need to synchronize

before the buffers are re-used

Typical run times

corsika: run 3232: 10493 10.0345 sec filesic86/spx/3232 on cuda00[123] (53.4 seconds per job)1.2 days of real detector time in 6.5 days

nugen: run 2972: 9993 200000-event files; E^-2 weightedic86/spx/2972 on cudatest (25.1 seconds per job)entire 10k set of files in 2.9 days this is enough for an atmnu/diffuse analysis!

Considerations:

• Maximize GPU utilization by running only mmc+ppc parts on the GPU nodes• still, IC40 mmc+ppc+detector was run with ~80% GPU utilization

• run with 100% DOM efficiency, save all ppc events with at least 1 MC hit• apply a range of allowed efficiencies (70-100%) later with ppc-eff module

Use in analysis

PPC run on GPUs was already used in several analyses already published or in progress.

The ease of changing the ice parameters facilitated propagation of ice uncertainties through the analysis, as all “systematics” simulation sets are simulated in roughly the same amount of time, with no extra overhead.

A similar quality uncertainty analysis based on photonics simulation would have taken much longer because of the large CPU cost of the initial table generation.

OSG Summer School ‘10

DAG (Directed Acyclical Graph) -based simulation

• Separate simulation segments into tasks

• Assign task to a node in DAG

DedicatedDedicatedGPU clusterGPU cluster

GPU-based simulation

• We have recently began experimenting

with GPU-based implementation of

portions of IceCube simulation.

• DAG assigns separate tasks to different

compute nodes

• Execution of photon propagation

simulation on dedicated GPU nodes.

• For many simulations GPU segment of

chain is much faster than the rest of the

simulation.

• Small number GPU-enabled machines

can consume the data from large pool lf

CPU cores.

PPCPPC

generatorgenerator generatorgeneratorgeneratorgenerator generatorgenerator

DetectorDetector DetectorDetector

GPU-based simulation

• Optimal DAG differ depending on the

specific simulation

Current Status of GPU-based Simulation Production

• NuGen and CORSIKA simulation currently running on Madison cluster: NPX3+CUDA

• Testing optimal DAG configurations to take advantage of GPUs

• Current Condor queue has option for selecting machines with GPUs

• There are multiple cores and multiple GPUs on each machine

• Condor assigns environment variable ${_CONDOR_SLOT} which is used as parameter to select a GPU on PPC.

• SGE, PBS IceProd plugins being writtend to support DAGs in order to incorporate other non-Condor sites:

• NERSC Dirac and Tesla and clusters in Dortmund, DESY, Alberta

• Other IceCube sites report to have GPUs available and will be incorporated into production.

GPU/PPC production: coincident neutrino-CR muons

Our initial GPU cluster

4 computers:• 1 cudatest• 3 cuda nodes (cuda001-3)

Each has4-core CPU3 GPU cards, each with 2 GPUs (i.e. 6 GPUs per computer)

Each computer is ~ $3500

Our initial cluster

cudatest:

Found 6 devices, driver 2030, runtime 20300(1.3): GeForce GTX 295 1.296 GHz G(939261952) S(16384) C(65536) R(16384) W(32)

l1 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1)1(1.3): GeForce GTX 295 1.296 GHz G(939261952) S(16384) C(65536) R(16384) W(32)





l0 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1)

3 GTX 295 cards, each with 2 GPUs

PSU

0 and 14 and 52 and 3

nvidia-smi -lsa

GPU 0:Product Name : GeForce GTX 295Serial : 1803836293359PCI ID : 5eb10deTemperature : 87 C






BAD multiprocessors (MPs)clistcudatest 0 1 2 3 4 5cuda001 0 1 2 3 4 5cuda002 0 1 2 3 4 5cuda003 0 1 2 3 4 5

#badmpscuda001 3 22cuda002 2 20cuda002 4 10

Disable 3 bad GPUs out of 24: 12.5%Disable 3 bad MPs out of 720: 0.4%!

Configured: xR=5 eff=0.95 sf=0.2 g=0.943Loaded 12 angsens coefficientsLoaded 6x170 dust layer pointsLoaded 16028 random multipliersLoaded 42 wavelenth pointsLoaded 171 ice layersLoaded 3540 DOMs (19x19)Processing f2k muons from stdin on device 2Total GPU memory usage: 83053520photons: 13762560 hits: 991Error: TOT was a nan or an inf 1 times! Bad MP #20photons: 13762560 hits: 393photons: 13762560 hits: 570photons: 13762560 hits: 501photons: 13762560 hits: 832photons: 13762560 hits: 717CUDA Error: unspecified launch failure

Total GPU memory usage: 83053520photons: 13762560 hits: 938Error: TOT was a nan or an inf 9 times! Bad MP #20 #20 #20 #20photons: 13762560 hits: 442photons: 13762560 hits: 627CUDA Error: unspecified launch failure

[dima@cuda002 gpu]$ cat mmc.1.f2k | BADMP=20 ./ppc 2 > /dev/nullConfigured: xR=5 eff=0.95 sf=0.2 g=0.943Loaded 12 angsens coefficientsLoaded 6x170 dust layer pointsLoaded 16028 random multipliersLoaded 42 wavelenth pointsLoaded 171 ice layersLoaded 3540 DOMs (19x19)Processing f2k muons from stdin on device 2Not using MP #20Total GPU memory usage: 83053520photons: 13762560 hits: 871…photons: 1813560 hits: 114

Device time: 31970.7 (in-kernel: 31725.6...31954.8) [ms]

Failure rates:

• 3 x DELL PowerEdge C410x– 48 Tesla M2070 GPGPU– 21504 GPU cores– 48 TFlops single precision– 24 TFlops double precision

• 6 x DELL PowerEdge C6145– 24 AMD Opteron™ 6100 Processors– 288 CPU cores– 7 TFlops double precision

• QDR Infiniband Interconnect– Allows high speed MPI applications

• ~ $ 200,000

The IceWave Cluster

•DELL PowerEdge C410x– 16 PCIe Expansion chassis – Use of C2050 TESLA GPUs ()– Flexible assignment of GPUs toServers

Allows 1-4 GPUs per server

Basic Elements

•DELL PowerEdge C6145– 2 x 4 CPU Servers– AMD Opteron™ 6100 (Magny-Cours)– 12 cores per processor– 48-96 cores per 2U server– 192 GB per 2U server

Basic Elements

Concluding remarks

• PPC (photon propagation code) is used by IceCube for photon tracking

• Precise treatment of the photon propagation, includes all known effects (longitudinal development of particle cascades, ice tilt, etc.)

• PPC can be run on CPUs or GPUs; running on a GPU is 100s of times faster than on a CPU core

• We use DAG to run PPC routinely on GPUs for mass production of simulated data

• GPU computers can be assembled with NVidia or AMD video cards however, some problems exist in consumer video cards bad MPs can be worked around in PPC computing-grade hardware can be used instead

Extra slides

Oversize DOM treatmentOversized DOM treatment (designed for minimum bias compared to oversize=1):

oversize only in direction perpendicular to the photon time needed to reach the nominal (non-oversized) DOM surface is added re-use the photon after it hits a DOM and ensure the causality in the flasher simulation

The oversize model was chosen carefully to produce the best possible agreement with the nominal x1 case (see next slide).

nominal DOMoversized DOM

oversized ~ 5 times

phot

on

This is a crucial optimization, however:Some bias is unavoidable since DOMs occupy larger space:

x1: diameter of 33 cmx5: 1.65 mx16: 5.3 m

This could lead to ~5-10% variation of the individual DOM simulated rates.

Timing of oversized DOM MC

xR=1defaultdo not track back to detected DOMdo not track after detectionno ovesize delta correction!do not check causalitydel=(sqrtf(b*b+(1/(e.zR*e.zR-1)*c)-D)*e.zR-hdel=e.R-OMR

Flashing 63-50 63-48

64-48

64-52

xR=1

default

ice density: 0.9216 mwe

handbook of chemistry and physics

T.Gow's data of density near the surface

T=221.5-0.00045319*d+5.822e-6*d2-273.15 (fit to AMANDA data)

Fit to (1-p1*exp(-p2*d))*f(T(d))*(1+0.94e-12*9.8*917*d)

Device enumeration

cuda002:

Found 5 devices, driver 3010, runtime 30100(2.0): GeForce GTX 480 1.401 GHz G(1610285056) S(49152) C(65536) R(32768) W(32)





l0 o1 c0 h1 i0 m30 a256 M(2147483647) T(512: 512,512,64) G(65535,65535,1)

2 GTX 295 cards, 1 GTX 480 card

PSU

1 and 23 and 4

0

nvidia-smi -a

GPU 0:Product Name : GeForce GTX 295PCI ID : 5eb10deTemperature : 68 C


GPU 2:Product Name : GeForce GTX 480PCI ID : 6c010deTemperature : 106 C



0 and 13 and 4

2

Fermi vs. Teslacudatest: Found 6 devices, driver 2030, runtime 20300(1.3): GeForce GTX 295 1.296 GHz G(939261952) S(16384) C(65536) R(16384) W(32)

l1 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1)

tesla: Found 1 devices, driver 3000, runtime 30000(1.3): Tesla C1060 1.296 GHz G(4294770688) S(16384) C(65536) R(16384) W(32)

l1 o1 c0 h1 i0 m30 a256 M(2147483647) T(512: 512,512,64) G(65535,65535,1)

fermi: Found 1 devices, driver 3000, runtime 30000(2.0): Tesla C2050 1.147 GHz G(2817982464) S(49152) C(65536) R(32768) W(32)

l1 o1 c0 h1 i0 m14 a512 M(2147483647) T(1024: 1024,1024,64) G(65535,65535,1)

beta: Found 1 devices, driver 3010, runtime 30100(2.0): Tesla C2050 1.147 GHz G(2817982464) S(49152) C(65536) R(32768) W(32)

l1 o1 c0 h1 i0 m14 a512 M(2147483647) T(1024: 1024,1024,64) G(65535,65535,1)

11: arch=11 make gpu12: arch=12 make gpu (default/best)1x: arch=12 make gpu with -ftz=true -prec-div=false -prec-sqrt=false20: arch=20 make gpu2x: arch=20 make gpu with -ftz=true -prec-div=false -prec-sqrt=false

Flasher f2kmuon

35587.8 13797.034246.7 10990.8

40989.2 13563.240423.2 11514.9

29114.5 11361.427343.8 8760.027346.6 8755.5

29024.8 11429.227631.5 8833.227630.9 8833.328950.1 9079.528955.1 9073.2

Other

Consider:

• building production computers with only 2 cards, leaving a space in between• using 6-core CPUs if paired with 3 GPU cards

• 4-way Tesla GPU-only servers a possible solution• Consumer GTX card much faster than Tesla/Fermi cards

GTX 295 was so far found to be a better choice than GTX 480• but: no longer available!

Reliability:• 0.4% loss of advertised capacity in GTX 295 cards• however: 2 of 3 affected cards were “refurbished”• do cards deteriorate over time? The failed MPs did not change in ~3 months

icecube simulation with ppc on gpus dmitry chirkin, uw madison photon propagation code graphics...

Documents