1 cudampi and glmpi: message passing for gpgpu clusters orion sky lawlor [email protected] u. alaska...

99
1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor [email protected] U. Alaska Fairbanks 2009-10-13 http://lawlor.cs.uaf.edu/ 8

Upload: juliana-fields

Post on 04-Jan-2016

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

1

cudaMPI and glMPI:Message Passing

for GPGPU Clusters

Orion Sky [email protected]. Alaska Fairbanks

2009-10-13http://lawlor.cs.uaf.edu/

8

Page 2: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

2

Where is Fairbanks, Alaska?

Page 3: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

3

Fairbanks in summer: 80F

Page 4: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

4

Fairbanks in winter: -20F

Page 5: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

5

CSAR Retrospective CSAR 2001-2004 work:

Charm++ and FEM Framework Mesh Modification Solution Transfer

Page 6: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Charm++ FEM Framework Handles parallel details in the runtime Leaves physics and numerics to user Presents clean, “almost serial”

interface: One call to update cross-processor boundaries

Not just for Finite Element computations! (renamed “ParFUM”)

Now supports ghost cells

Builds on top of AMPI or native MPI Allows use of advanced Charm++

features: adaptive parallel computation Dynamic, automatic load balancing Other libraries: Collision, adaptation,

visualization, …

Page 7: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

FEM Mesh Partitioning: Serial to Parallel

Page 8: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

FEM Mesh Partitioning: Communication

Summing forces from other processors only takes one call:

FEM_Update_field

Can also access values in ghost elements

Page 9: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Charm++ FEM Framework Users

Rocflu fluids solver, a part of GENX [Haselbacher]

Frac3D mechanics/fracture code [Geubelle]

DG, metal solidification/dendritic growth implicit solver [Danzig]

CPSD spacetime meshing code [Haber/Erickson]

Page 10: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Charm++ Collision Detection Detect collisions

(intersections) between objects scattered across processors

Built on Charm++ ArraysOverlay regular, sparse 3D grid of voxels (boxes)

Send objects to all voxels they touch

Collide voxels independently and collect results

Leaves collision response to caller

Page 11: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Collision Detection Algorithm: Sparse 3D voxel grid

Page 12: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Remeshing

As the solids burn away, the domain changes dramaticallyALE: Fluids mesh expandsALE: Solids mesh contracts

This distorts the elements of the mesh

We need to be able to fix the deformed mesh

Page 13: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Initial mesh consists of small, good quality tetrahedra

Page 14: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

As solids burn away, fluids domain becomes more and more stretched

Page 15: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

As solids burn away, fluids domain becomes more and more stretched

Page 16: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

As solids burn away, fluids domain becomes more and more stretched

Page 17: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

As solids burn away, fluids domain becomes more and more stretched

Page 18: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

As solids burn away, fluids domain becomes more and more stretched

Page 19: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

As solids burn away, fluids domain becomes more and more stretched

Page 20: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

As solids burn away, fluids domain becomes more and more stretched

Page 21: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

As solids burn away, fluids domain becomes more and more stretched

Page 22: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Significant degradation in element quality compared to original!

Page 23: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Remeshing: Solution Transfer Can use existing (off-the-shelf)

tools to remesh our domain Must also handle solution data

Density, velocity, displacement fields Gas pressure/temperature Boundary conditions!

Accurate transfer of solution data is a difficult mathematical problem

Solution data (and mesh) are scattered across processors

Page 24: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Remeshing and Solution Transfer

FEM: reassemble a serial boundary mesh

Call serial remeshing tools: YAMS, TetMesh ™

FEM: partition new serial mesh

Collision Library: match up old and new volume meshes

Transfer Library: conservative, accurate volume-to-volume data transfer using Jiao method

Page 25: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Original mesh after deformation

Page 26: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Closeup: before remeshing-- note stretched elements on boundary

Page 27: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

After remeshing-- better element size and shape

Page 28: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Remeshing restores element size and quality

Page 29: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Fluids Gas Velocity Transfer Deformed

mesh

New mesh

Page 30: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Remeshing: Continue the Simulation Theory: just start simulation using the new

mesh, solution, and boundaries! In practice: not so easy with the real code

(genx)Lots of little pieces: fluids, solids,

combustion, interface, ...Each have their own set of needs and

input file formats! Prototype: treat remeshing like a restart

Remeshing system writes mesh, solution, boundaries to ordinary restart files

Integrated code thinks this is an ordinary restart

Few changes needed inside genx

Page 31: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Can now continue simulation using new mesh (prototype)

Page 32: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Can now continue simulation using new mesh (prototype)

Page 33: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Can now continue simulation using new mesh (prototype)

Page 34: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Remeshed simulation resolves boundary better than old!

Page 35: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Remeshing: Future Work Automatically decide when to remesh

Currently manual Remesh the solids domain

Currently only Rocflu is supported Remesh during a real parallel run

Currently only serial data format supported Remesh without using restart files Remesh only part of the domain (e.g. burning

crack)Currently remeshes entire domain at once

Remesh without using serial toolsCurrently YAMS, TetMesh are completely

serial

Page 36: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Rocrem: Remeshing For Rocstar

Fully parallel mesh search & solution transfer Build on Charm++ parallel collision

detection framework Scalable to millions of elements -- full 3.8m

tet Titan IV mesh Supports node and element-centered data Supports node and face-centered boundary

conditions Preserved during remeshing and data

transfer Unique method to support transfer along

boundaries: boundary extrusion Integrated with Rocket simulation data

formats

Titan IV Boundary Conditions

Page 37: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Parallel Online Mesh Modification

3D on the way!

1pe 2pe 4pe 2D triangle mesh refinement Longest-edge bisection [Rivara et al] Fully parallel implementation Works with real application code

[Geubelle] Refines solution data Updates parallel communication

and external boundary conditions

Page 38: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

38

New Work: GPGPU Modern GPU Hardware Quantitative GPU Performance GPU-Network Interfacing cudaMPI and glMPI Performance MPIglut: Powerwall

Parallelization Conclusions & Future Work

Page 39: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Modern GPU Hardware

Page 40: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

40

Latency, or Bandwidth? time=latency + size / bandwidth To mimimize latency:

Caching (memory) Speculation (branches, memory)

To maximize bandwidth: Clockrate (since 1970) Pipelining (since 1985) Parallelism (since 2004)

For software, everything's transparent except parallelism!

Page 41: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

41

CPU vs GPU Design Targets Typical CPU optimizes for

instruction execution latency Millions of instructions / thread Less than 100 software threads Hardware caches, speculation

Typical GPU optimizes for instruction execution bandwidth Thousands of instructions / thread Over 1 million software threads Hardware pipelining, queuing,

request merging

Page 42: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

42

Modern GPUs: Good Things Fully programmable

Branches, loops, functions, etc IEEE 32-bit Float Storage

Huge parallelism in hardware Typical card has 100 cores Each core threaded 10-100 ways Hardware thread switching (SMT)

E.g., $250 GTX 280 has 240 cores, >900 gigaflops peak

GPU is best in flops, flops/watt, flops/$ [Datta ... Yelik 2008]

Page 43: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

43

Modern GPUs: Bad Things Programming is really difficult

Branch & load divergence penalties Non-IEEE NaNs, 64-bit much slower

Huge parallelism in software Need millions of separate threads! Usual problems: decomposition,

synchronization, load balance E.g., $250 GTX 280 has *only*

1GB of storage, 4GB/s host RAM GPU is definitely a mixed

blessing for real applications

Page 44: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

44

Modern GPU Programming Model “Hybrid” GPU-CPU model

GPU cranks out the flops CPU does everything else

•Run non-GPU legacy or sequential code

•Disk, network I/O• Interface with OS and user

Supported by CUDA, OpenCL, OpenGL, DirectX, ...

Likely to be the dominant model for the foreseeable future CPU may eventually “wither away”

Page 45: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

45

GPU Programming Interfaces OpenCL (2008)

Hardware independent (in theory) SDK still difficult to find

NVIDIA's CUDA (2007) Mixed C++/GPU source files Mature driver, widespread use

OpenGL GLSL, DirectX HLSL (2004) Supported on NVIDIA and ATI on

wide range of hardware Graphics-centric interface (pixels)

Page 46: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

46

NVIDIA CUDA Fully programmable (loops, etc) Read and write memory anywhere

•Memory accesses must be “coalesced” for high performance

•Zero protection against multithreaded race conditions

•“texture cache” can only be accessed via restrictive CUDA arrays

Manual control over a small shared memory region near the ALUs

CPU calls a GPU “kernel” with a “block” of threads

Only runs on new NVIDIA hardware

Page 47: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

47

OpenGL: GL Shading Language

Fully programmable (loops, etc) Can only read “textures”, only

write to “framebuffer” (2D arrays)•Reads go through “texture cache”, so

performance is good (iff locality)•Attach non-read texture to framebuffer•So cannot have synchronization bugs!

Today, greyscale GL_LUMINANCE texture pixels store floats, not GL_RGBA 4-vectors (zero penalty!)

Rich selection of texture filtering (array interpolation) modes

GLSL runs on nearly every GPU

Page 48: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

48

Modern GPU Fast Stuff Divide, sqrt, rsqrt, exp, pow, sin,

cos, tan, dot all sub-nanosecond Fast like arithmetic; not library call

Branches are fast iff: They're short (a few instructions) Or they're coherent across threads

GPU memory accesses fast iff: They're satisfied by “texture

cache” Or they're coalesced across threads

(contiguous and aligned) GPU “coherent” ≈ MPI “collective”

Page 49: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

49

Modern GPU Slow Stuff Not enough work per kernel

>4us latency for kernel startup Need >>1m flops / kernel!

Divergence: costs up to 97%! Noncoalesced memory accesses

•Adjacent threads fetching distant regions of memory

Divergent branches•Adjacent threads running different

long batches of code

Copies to/from host >10us latency, <4GB/s bandwidth

Page 50: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Quantitative

GPU Performance

Page 51: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

51

CUDA: Memory Output Bandwidth

Measure speed for various array sizes and number of threads per block

__global__ void write_arr(float *arr) { arr[threadIdx.x+blockDim.x*blockIdx.x]=0.0;}int main() { ... start timer

write_arr<<<nblocks,threadsPerBlock>>>(arr);... stop timer

}

Build a trivial CUDA test harness:

Page 52: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

52

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280

1K 10K 100K 1M 10M1 0.1 0.2 err err err2 0.1 0.3 0.4 err err4 0.1 0.4 0.8 err err8 0.1 0.5 1.4 err err

16 0.1 0.6 2.4 3.3 err32 0.1 0.7 3.6 6.4 err64 0.1 0.7 4.8 11.7 err

128 0.1 0.7 5.4 16.4 err256 0.1 0.7 5.3 16.4 20.7512 0.1 0.7 5.2 16.3 20.7

Number of Floats/Kernel

Th

read

s p

er

Blo

ck

Output Bandwidth (Gigafloats/second)

Page 53: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

53

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280

1K 10K 100K 1M 10M1 0.1 0.2 err err err2 0.1 0.3 0.4 err err4 0.1 0.4 0.8 err err8 0.1 0.5 1.4 err err

16 0.1 0.6 2.4 3.3 err32 0.1 0.7 3.6 6.4 err64 0.1 0.7 4.8 11.7 err

128 0.1 0.7 5.4 16.4 err256 0.1 0.7 5.3 16.4 20.7512 0.1 0.7 5.2 16.3 20.7

Number of Floats/Kernel

Th

read

s p

er

Blo

ck

Output Bandwidth (Gigafloats/second)

Over 64K blocks needed

Page 54: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

54

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280

1K 10K 100K 1M 10M1 0.1 0.2 err err err2 0.1 0.3 0.4 err err4 0.1 0.4 0.8 err err8 0.1 0.5 1.4 err err

16 0.1 0.6 2.4 3.3 err32 0.1 0.7 3.6 6.4 err64 0.1 0.7 4.8 11.7 err

128 0.1 0.7 5.4 16.4 err256 0.1 0.7 5.3 16.4 20.7512 0.1 0.7 5.2 16.3 20.7

Number of Floats/Kernel

Th

read

s p

er

Blo

ck

Output Bandwidth (Gigafloats/second)

Kernel StartupTime Dominates

Page 55: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

55

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280

1K 10K 100K 1M 10M1 0.1 0.2 err err err2 0.1 0.3 0.4 err err4 0.1 0.4 0.8 err err8 0.1 0.5 1.4 err err

16 0.1 0.6 2.4 3.3 err32 0.1 0.7 3.6 6.4 err64 0.1 0.7 4.8 11.7 err

128 0.1 0.7 5.4 16.4 err256 0.1 0.7 5.3 16.4 20.7512 0.1 0.7 5.2 16.3 20.7

Number of Floats/Kernel

Th

read

s p

er

Blo

ck

Output Bandwidth (Gigafloats/second)

Block StartupTime Dominates

Page 56: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

56

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280

1K 10K 100K 1M 10M1 0.1 0.2 err err err2 0.1 0.3 0.4 err err4 0.1 0.4 0.8 err err8 0.1 0.5 1.4 err err

16 0.1 0.6 2.4 3.3 err32 0.1 0.7 3.6 6.4 err64 0.1 0.7 4.8 11.7 err

128 0.1 0.7 5.4 16.4 err256 0.1 0.7 5.3 16.4 20.7512 0.1 0.7 5.2 16.3 20.7

Number of Floats/Kernel

Th

read

s p

er

Blo

ck

Output Bandwidth (Gigafloats/second)

Good Performance

Page 57: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

57

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280, fixed 128 threads per block

Page 58: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

58

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280, fixed 128 threads per block

Kernel startup latency: 4us

Kernel output b

andwidth: 80 GB/s

Page 59: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

59

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280, fixed 128 threads per block

Kernel startup latency: 4us

Kernel output b

andwidth: 80 GB/s

t = 4us / kernel + bytes / 80 GB/s

Page 60: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

60

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280, fixed 128 threads per block

Kernel startup latency: 4us

Kernel output b

andwidth: 80 GB/s

t = 4us / kernel + bytes * 0.0125 ns / byte

Page 61: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

61

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280, fixed 128 threads per block

Kernel startup latency: 4us

Kernel output b

andwidth: 80 GB/s

t = 4000ns / kernel + bytes * 0.0125 ns / byte

Page 62: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

62

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280, fixed 128 threads per block

t = 4000ns / kernel + bytes * 0.0125 ns / byte

WHAT!?time/kernel = 320,000 x time/byte!?

Page 63: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

63

CUDA: Device-to-Host Memcopy

NVIDIA GeForce GTX 280, pinned memory

Copy startup latency: 10us

Copy output b

andwidth: 1

.7 GB/s

t = 10000ns / kernel + bytes * 0.59 ns / byte

Page 64: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

64

CUDA: Device-to-Host Memcopy

NVIDIA GeForce GTX 280, pinned memory

t = 10000ns / kernel + bytes * 0.59 ns / byte

Page 65: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

65

CUDA: Memory Output Bandwidth

NVIDIA GeForce GTX 280, fixed 128 threads per block

t = 4000ns / kernel + bytes * 0.0125 ns / byte

Page 66: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

66

Performance Analysis Take-Home GPU Kernels: high latency, and

high bandwidth GPU-CPU copy: high latency, but

low bandwidth High latency means use big

batches (not many small ones) Low copy bandwidth means you

MUST leave most of your data on the GPU (can't copy it all out)

Similar lessons as network tuning! (Message combining, local storage.)

Page 67: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Why GPGPU clusters?

Page 68: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

68

Parallel GPU Advantages Multiple GPUs: more compute &

storage bandwidth

Page 69: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

69

Parallel GPU Communication Shared memory, or messages?

History: on the CPU, shared memory won for small machines; messages won for big machines

Shared memory GPUs: Tesla Message-passing GPUs: clusters

•Easy to make in small department• E.g., $2,500 for ten GTX 280's (8 TF!)

Or both, like NCSA Lincoln Messages are lowest common

denominator: run easily on shared memory (not vice versa!)

Page 70: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

70

Parallel GPU Application Interface On the CPU, for message

passing the Message Passing Interface (MPI) is standard and good Design by committee means if you

need it, it's already in there!•E.g., noncontiguous derived datatypes,

communicator splitting, buffer mgmt

So let's steal that! Explicit send and receive calls to

communicate GPU data on network

Page 71: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

71

Parallel GPU Implementation Should we send and receive

from GPU code, or CPU code? GPU code seems more natural for

GPGPU (data is on GPU already) CPU code makes it easy to call MPI

Network and GPU-CPU data copy have high latency, so message combining is crucial! Hard to combine efficiently on GPU

(atomic list addition?) [Owens] But easy to combine data on CPU

Page 72: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

72

Parallel GPU Send and Receives cudaMPI_Send is called on CPU

with a pointer to GPU memory cudaMemcpy into (pinned) CPU

communication buffer MPI_Send communication buffer

cudaMPI_Recv is similar MPI_Recv to (pinned) CPU buffer cudaMemcpy off to GPU memory

cudaMPI_Bcast, Reduce, Allreduce, etc all similar Copy data to CPU, call MPI (easy!)

Page 73: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

73

Asynchronous Send and Receives cudaMemcpy, MPI_Send, and

MPI_Recv all block the CPU cudaMemcpyAsync, MPI_Isend,

and MPI_Irecv don't They return a handle you

occasionally poll instead Sadly, cudaMemcpyAsync has

much higher latency than blocking version (40us vs 10us)

cudaMPI_Isend and Irecv exist But usually are not worth using!

Page 74: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

74

OpenGL Communication It's easy to write corresponding

glMPI_Send and glMPI_Recv glMPI_Send: glReadPixels,

MPI_Send glMPI_Recv: MPI_Recv,

glDrawPixels Sadly, glDrawPixels is very slow:

only 250MB/s! glReadPixels is over 1GB/s glTexSubImage2D from a pixel

buffer object is much faster (>1GB/s)

Page 75: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

75

CUDA or OpenGL? cudaMemcpy requires

contiguous memory buffer glReadPixels takes an arbitrary

rectangle of texture data Rows, columns, squares all equally

fast! Still, there's no way to simulate

MPI's rich noncontiguous derived data types

Page 76: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

76

Noncontiguous CommunicationNetwork Data Buffer

GPU Target Buffer

Scatter kernel? Fold into read?

Page 77: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

cudaMPI and glMPI

Performance

Page 78: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

78

Parallel GPU Performance

CPU

GPU

0.1 GB/sGigabit Ethernet

Networks are really low bandwidth!

GPU RAM

CPU

GPU

80 GB/s

Graphics Card Memory

1.7 GB/s

GPU-CPU Copy

GPU RAM

Page 79: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

79

Parallel GPU Performance: CUDA

Ten GTX 280 GPUs over gigabit Ethernet; CUDA

Page 80: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

80

Parallel GPU Performance: OpenGL

Ten GTX 280 GPUs over gigabit Ethernet; OpenGL

Page 81: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Conclusions

Page 82: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

82

Conclusions Consider OpenGL or DirectX

instead of CUDA or OpenCL Watch out for latency on GPU!

Combine data for kernels & copies Use cudaMPI and glMPI for free:

http://www.cs.uaf.edu/sw/ Future work:

More real applications! GPU-JPEG network compression Load balancing by migrating pixels DirectXMPI? cudaSMP?

Page 83: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Related work:

MPIglut

Page 84: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

MPIglut: Motivation●All modern computing is parallel

Multi-Core CPUs, Clusters•Athlon 64 X2, Intel Core2 Duo

Multiple Multi-Unit GPUs •nVidia SLI, ATI CrossFire

Multiple Displays, Disks, ...

●But languages and many existing applications are sequential

Software problem: run existing serial code on a parallel machine

Related: easily write parallel code

Page 85: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

What is a “Powerwall”?●A powerwall has:

Several physical display devices

One large virtual screenI.E. “parallel screens”

●UAF CS/Bioinformatics Powerwall Twenty LCD panels 9000 x 4500 pixels combined

resolution 35+ Megapixels

Page 86: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Sequential OpenGL Application

Page 87: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Parallel Powerwall Application

Page 88: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

MPIglut: The basic idea●Users compile their OpenGL/glut application using MPIglut, and it “just works” on the powerwall●MPIglut's version of glutInit runs a separate copy of the application for each powerwall screen●MPIglut intercepts glutInit, glViewport, and broadcasts user events over the network●MPIglut's glViewport shifts to render only the local screen

Page 89: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

MPIglut Conversion: Original Code#include <GL/glut.h>void display(void) {

glBegin(GL_TRIANGLES); ... glEnd();glutSwapBuffers();

}void reshape(int x_size,int y_size) {

glViewport(0,0,x_size,y_size);glLoadIdentity();gluLookAt(...);

}...int main(int argc,char *argv[]) {

glutInit(&argc,argv);glutCreateWindow(“Ello!”);glutMouseFunc(...);...

}

Page 90: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

MPIglut: Required Code Changes#include <GL/mpiglut.h>void display(void) {

glBegin(GL_TRIANGLES); ... glEnd();glutSwapBuffers();

}void reshape(int x_size,int y_size) {

glViewport(0,0,x_size,y_size);glLoadIdentity();gluLookAt(...);

}...int main(int argc,char *argv[]) {

glutInit(&argc,argv);glutCreateWindow(“Ello!”);glutMouseFunc(...);...

}

This is the only source change.Or, you can just copy mpiglut.h over your old glut.h header!

Page 91: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

MPIglut Runtime Changes: Init#include <GL/mpiglut.h>void display(void) {

glBegin(GL_TRIANGLES); ... glEnd();glutSwapBuffers();

}void reshape(int x_size,int y_size) {

glViewport(0,0,x_size,y_size);glLoadIdentity();gluLookAt(...);

}...int main(int argc,char *argv[]) {

glutInit(&argc,argv);glutCreateWindow(“Ello!”);glutMouseFunc(...);...

}

MPIglut starts a separate copy of the program (a “backend”) to drive each powerwall screen

Page 92: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

MPIglut Runtime Changes: Events#include <GL/mpiglut.h>void display(void) {

glBegin(GL_TRIANGLES); ... glEnd();glutSwapBuffers();

}void reshape(int x_size,int y_size) {

glViewport(0,0,x_size,y_size);glLoadIdentity();gluLookAt(...);

}...int main(int argc,char *argv[]) {

glutInit(&argc,argv);glutCreateWindow(“Ello!”);glutMouseFunc(...);...

}

Mouse and other user input events are collected and sent across the network. Each backend gets identical user events (collective delivery)

Page 93: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

MPIglut Runtime Changes: Sync#include <GL/mpiglut.h>void display(void) {

glBegin(GL_TRIANGLES); ... glEnd();glutSwapBuffers();

}void reshape(int x_size,int y_size) {

glViewport(0,0,x_size,y_size);glLoadIdentity();gluLookAt(...);

}...int main(int argc,char *argv[]) {

glutInit(&argc,argv);glutCreateWindow(“Ello!”);glutMouseFunc(...);...

}

Frame display is (optionally) synchronized across the cluster

Page 94: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

MPIglut Runtime Changes: Coords#include <GL/mpiglut.h>void display(void) {

glBegin(GL_TRIANGLES); ... glEnd();glutSwapBuffers();

}void reshape(int x_size,int y_size) {

glViewport(0,0,x_size,y_size);glLoadIdentity();gluLookAt(...);

}...int main(int argc,char *argv[]) {

glutInit(&argc,argv);glutCreateWindow(“Ello!”);glutMouseFunc(...);...

}

User code works only in global coordinates, but MPIglut adjusts OpenGL's projection matrix to render only the local screen

Page 95: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

MPIglut Runtime Non-Changes#include <GL/mpiglut.h>void display(void) {

glBegin(GL_TRIANGLES); ... glEnd();glutSwapBuffers();

}void reshape(int x_size,int y_size) {

glViewport(0,0,x_size,y_size);glLoadIdentity();gluLookAt(...);

}...int main(int argc,char *argv[]) {

glutInit(&argc,argv);glutCreateWindow(“Ello!”);glutMouseFunc(...);...

}

MPIglut does NOT intercept or interfere with rendering calls, so programmable shaders, vertex buffer objects, framebuffer objects, etc all run at full performance

Page 96: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

MPIglut Assumptions/Limitations

●Each backend app must be able to render its part of its screen

Does not automatically imply a replicated database, if application uses matrix-based view culling

●Backend GUI events (redraws, window changes) are collective

All backends must stay in synch Automatic for applications that are

deterministic function of events•Non-synchronized: files, network, time

Page 97: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

MPIglut: Bottom Line●Tiny source code change●Parallelism hidden inside MPIglut

Application still “feels” sequential

●Fairly major runtime changes Serial code now runs in parallel (!) Multiple synchronized backends

running in parallel User input events go across network OpenGL rendering coordinate system

adjusted per-backend But rendering calls are left alone

Page 98: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Load Balancing a Powerwall

●Problem: Sky really easy

Terrain really hard

●Solution: Move the rendering for load balance, but you've got to move the finished pixels back for display!

Page 99: 1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2009-10-13  8

Future Work: Load Balancing

●AMPIglut: principle of persistence should still apply ●But need cheap way to ship back finished pixels every frame●Exploring GPU JPEG compression

DCT + quantize: really easy Huffman/entropy: really hard Probably need a CPU/GPU split

•10000+ MB/s inside GPU•1000+ MB/s on CPU•100+ MB/s on network