blasting sand with mpm - gpu technology...
TRANSCRIPT
Blasting Sand with CUDA:
MPM Sand Simulation for VFX Gergely Klár
DreamWorks Animation
tn tn+1
tn tn+1
tn tn+1
Grid influence
Naïve Particles-to-Grid
Gather Particles-to-Grid
Our Solution
• Each particle is read only once,
• We efficiently use shared memory for the grids,
• We significantly reduce the number of atomic operations,
• And our secret sauce: a special data structure for particle queries.
1 CUDA
Block 1 CUDA
Block
1 CUDA
Block
1 CUDA
Block 1 CUDA
Block
1 CUDA
Block
1 CUDA
Block 1 CUDA
Block
1 CUDA
Block
CellBins
ParticleIDs
Actual particle data
TileBins
CellBins
ParticleIDs
Actual particle data
• In each block/tile: – Get blockIdx
– Cells in the tile are TileBins[blockIdx-1].. TileBins[blockIdx]-1
– Get a cellId for each warp from this list • Each thread works on two affected grid nodes
• Particles of a cell are CellBins[cellId-1]..CellBins[cellId]-1
• Compute the contribution from the particle
• Store in shared
– Write back to global
Tile & Cell Keys
●Particle coordinates: (px, py, pz)
●Cell coordinates: (ci, cj, ck) = ⌊(px, py, pz)/Δx⌋
●Tile and in-tile coordinates: (ci, cj, ck) = (ti, tj, tk)∙TILE_SIZE + (ri, rj, rk)
Δx
tj ti tk rj rk ri 7 bits 7 bits 7 bits 3 bits 3 bits 3 bits
32 bit unsigned integer
Tile Bins
sort
Initial Particle IDs
Particle IDs
RLE
inc. sum Cell Bins
masked RLE
inc. sum
Tile & Cell Keys ● When sorted as uint32s, keys of the
same tile will be consecutive
● RLE encoding counts the number of
particles per cell
● The running sum of the counts gives
the offsets to particles
● RLE encoding with a mask for the
tile bits counts the number of non-
empty cells per tile
● The running sum of these counts
gives the offsets to cells
Results
Overall
0
200
400
600
800
1000
262K 884K 2,097K 7,000K
# of particles
GPU
CPU
Milliseconds per time step. Smaller is better.
nVidia Quadro K5200 Intel Xeon CPU E5-2697 v3 @ 2.60GHz w/ 28 cores
Particles to Grids
0
100
200
300
400
500
600
262K 884K 2,097K 7,000K
Grids to Particles
0
100
200
300
400
500
600
262K 884K 2,097K 7,000K
Milliseconds per time step. Smaller is better.
Summary
• Particle binning with sort-RLE-scan
• Breaking the domain to tiles fitting to shared memory
• Processing particles of a cell by a single warp
Special thanks to:
• Ken Museth
• Stephen Jones
• Jeff Budsberg
• Lawrence Lee
• Rob Tesdahl
• David Tonnesen
• Ibrahim Sani
Thank you!