[harvard cs264] 16 - managing dynamic parallelism on gpus: a case study of high performance sorting...

– Data-independent tasks

– Tasks with statically-known data dependences

– SIMD divergence

– Lacking fine-grained synchronization

– Lacking writeable, coherent caches

– Data-independent tasks

– Tasks with statically-known data dependences

– SIMD divergence

– Lacking fine-grained synchronization

– Lacking writeable, coherent caches

DEVICE 32-‐bit Key-‐value Sor7ng (106 keys / sec)

Keys-‐only Sor7ng (106 pairs/ sec)

NVIDIA GTX 280 449 (3.8x speedup*) 534 (2.9x speedup*)

* Satish et al.,"Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09

NVIDIA GTX 480 775 1005

NVIDIA GTX 280 449 534

NVIDIA 8800 GT 129 171

NVIDIA GTX 480 775 1005

NVIDIA GTX 280 449 534

NVIDIA 8800 GT 129 171

Intel Knight's Ferry MIC 32-‐core* 560

Intel Core i7 quad-‐core * 240

Intel Core-‐2 quad-‐core* 138

*Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures,“ Intel Tech Report 2010.

– Each output is dependent upon a finite subset of the input • Threads are decomposed by output element

• The output (and at least one input) index is a static function of thread-id

Output

Thread Thread Thread Thread

– Each output element has dependences upon any / all input elements

– E.g., sorting, reduction, compaction, duplicate removal, histogram generation, map-reduce, etc.

Output

– Threads are decomposed by output element

– Repeatedly iterate over recycled input streams

– Output stream size is statically known before each pass

Thread Thread Thread Thread

– O(n) global work from passes of pairwise-neighbor-reduction

– Static dependences, uniform output

+ + + +

– Repeated pairwise swapping

• Bubble sort is O(n2)

• Bitonic sort is O(nlog2n)

– Need partitioning: dynamic, cooperative

allocation

– Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)

• O(V+E) is work-optimal

– Need queue: dynamic, cooperative allocation

– Repeated pairwise swapping

• Bubble sort is O(n2)

• Bitonic sort is O(nlog2n)

– Need partitioning: dynamic, cooperative

allocation

– Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)

• O(V+E) is work-optimal

– Need queue: dynamic, cooperative allocation

– Variable output per thread

– Need dynamic, cooperative allocation

● ● ● ●●●

●●

● ● ● ● ● ● ● ●

● ● ● ●

● ● ●●

● ● ●

●● ● ●

●●

● ● ● ●

● ● ● ●● ● ● ●●

●●

• Where do I put something in a list?

– Duplicate removal

– Sorting

– Histogram compilation

Where do I enqueue something?

– Search space exploration

– Graph traversal

– General work queues

Output

Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread

• For 30,000 producers and consumers?

– Locks serialize everything

– O(n) work

– For allocation: use scan results as a scattering vector

– Popularized by Blelloch et al. in the ‘90s

– Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009

2 1 0 3 2

0 2 3 3 6

Prefix Sum

– O(n) work

2 1 0 3 2

0 2 3 3 6

Input ( & allocaOon requirement)

Result of prefix scan (sum)

Thread Thread Thread Thread Thread

– O(n) work

2 1 0 3 2

0 2 3 3 6

0 1 2 3 4 5 6 7

Input ( & allocaOon requirement)

Output

Result of prefix scan (sum)

Thread Thread Thread Thread Thread

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001

Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001 0 2 4 5 1 3 6 7

Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Scanned allocations (relocation offsets)

1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Scanned allocations (bin relocation offsets)

1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Adjusted allocations (global relocation offsets)

1 2 4 4 4 5 6 6 4 5 6 7 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001

Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001 0 2 4 5 1 3 6 7

1 2 0 3 4 5 7 6

Un-fused

l Device Mem

Host P

rogram

Determine allocaCon size

CUDPP scan

Distribute output

CUDPP Scan

Un-fused Fused

l Device Mem

Host P

rogram

Determine allocaCon size

CUDPP scan

Distribute output

CUDPP Scan

Host P

rogram

l Device Mem

ory Scan

Determine allocaCon

Distribute output

GPU Host

1. Heavy SMT (over-threading) yields usable “bubbles” of free computation

2. Propagate live data between steps in fast registers / smem

3. Use scan (or variant) as a “runtime” for everything

Host P

rogram

l Device Mem

Determine allocaCon

Distribute output

GPU Host

1. Heavy SMT (over-threading) yields usable “bubbles” of free computation

2. Propagate live data between steps in fast registers / smem

3. Use scan (or variant) as a “runtime” for everything

Host P

rogram

l Device Mem

Determine allocaCon

Distribute output

GPU Host

Device Memory Bandwidth Compute Throughput Memory wall Memory wall

(109 bytes/s) (109 thread-‐cycles/s) (bytes/cycle) (instrs/word)

GTX 480 169.0 672.0 0.251 15.9

GTX 285 159.0 354.2 0.449 8.9

GTX 280 141.7 311.0 0.456 8.8

Tesla C1060 102.0 312.0 0.327 12.2

9800 GTX+ 70.4 235.0 0.300 13.4

8800 GT 57.6 168.0 0.343 11.7

9800 GT 57.6 168.0 0.343 11.7

8800 GTX 86.4 172.8 0.500 8.0

Quadro FX 5600 76.8 152.3 0.504 7.9

Device Memory Bandwidth Compute Throughput Memory wall Memory wall

(109 bytes/s) (109 thread-‐cycles/s) (bytes/cycle) (instrs/word)

GTX 480 169.0 672.0 0.251 15.9

GTX 285 159.0 354.2 0.449 8.9

GTX 280 141.7 311.0 0.456 8.8

Tesla C1060 102.0 312.0 0.327 12.2

9800 GTX+ 70.4 235.0 0.300 13.4

8800 GT 57.6 168.0 0.343 11.7

9800 GT 57.6 168.0 0.343 11.7

8800 GTX 86.4 172.8 0.500 8.0

Quadro FX 5600 76.8 152.3 0.504 7.9

GTX285 r+w memory wall (17.8 instrucOons per

input word)

0 16 32 48 64 80 96 112

Thread

-‐InstrucOon

s / 32-‐bit scan

element

Problem Size (millions)

Insert work here

GTX285 r+w memory wall (17.8)

Data Movement Skeleton

0 16 32 48 64 80 96 112

Thread

-‐InstrucOon

s / 32-‐bit scan

element

Insert work here

GTX285 r+w memory wall (17.8)

Our Scan Kernel

0 16 32 48 64 80 96 112

Thread

-‐InstrucOon

s / 32-‐bit scan

element

Insert work here

– Increase granularity / redundant computation

• ghost cells

• radix bits

– Orthogonal kernel fusion

GTX285 r+w memory wall

(17.8)

Our Scan Kernel

0 16 32 48 64 80 96 112

Thread

-‐InstrucOon

s / 32-‐bit scan

element

Insert work here

CUDPP Scan Kernel

Our Scan Kernel

0 20 40 60 80 100 120

Thread

-‐InstrucOon

s / 32-‐bit scan

element

– Partially-coalesced writes

– 2x write overhead

– 4 total concurrent scan operations (radix 16)

GTX285 Scan Kernel Wall

Our Scan Kernel

GTX285 Radix Scader Kernel Wall

0 16 32 48 64 80 96 112

Thread

-‐InstrucOon

s / 32-‐bit scan

element

Insert work here

– Need kernels with tunable local (or redundant) work

• ghost cells

• radix bits

285 Radix Scader Kernel

480 Radix Scader Kernel

0 10 20 30 40 50 60 70 80 90

Thread

-‐instructoins / 32-‐bit w

– Virtual processors abstract a diversity of hardware configurations

– Leads to a host of inefficiencies

– E.g., only several hundred CTAs

– Virtual processors abstract a diversity of hardware configurations

– Leads to a host of inefficiencies

– E.g., only several hundred CTAs

grid-size = (N / tilesize) CTAs

grid-size = 150 CTAs (or other small constant)

threadblock

threadblock Grid

– Thread-dependent predicates

– Setup and initialization code (notably for smem)

– Offset calculations (notably for smem)

– Common values are hoisted and kept live

threadblock

– Spills are really bad

threadblock

– O( N / tilesize) gmem accesses

– 2-4 instructions per access (offset calcs,

load, store)

– GPU is least efficient here: get it over with as quick as possible

log tilesize (N) -level tree

Two-level tree

– O( N / tilesize) gmem accesses

– 2-4 instructions per access (offset calcs,

load, store)

– GPU is least efficient here: get it over with as quick as possible

log tilesize (N) -level tree

Two-level tree

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Thread

-‐instrucOon

s / Elem

Grid Size (# of threadblocks)

Compute Load

285 Scan Kernel Wall

– 16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA

– conditional evaluation

– singleton loads

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

– 16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA

– conditional evaluation

– singleton loads

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA

– 16.1M % (1024 * 150) = 136.4 extra tiles

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs)

– 109 + 1 = 110 tiles per CTA (136 CTAs)

– 16.1M % (1024 * 150) = 0.4 extra tiles

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

– If you breathe on your code, run it through the VP • Kernel runtimes

• Instruction counts

– Indispensible for tuning • Host-side timing requires too many iterations

• Only 1-2 cudaprof iterations for consistent counter-based perf data

– Write tools to parse the output • “Dummy” kernels useful for demarcation

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272

SorOng

Rate (106 keys/sec)

Problem size (millions)

GTX 480

C2050 (no ECC)

GTX 285

C2050 (ECC)

GTX 280

9800 GTX+

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240

SorOng

Rate (m

illions of p

airs/sec)

Problem size (millions)

GTX 480 C2050 (no ECC) GTX 285 GTX 280 C2050 (ECC) C1060 9800 GTX+

0 20 40 60 80 100 120

Kernel Ban

dwidth (G

iBytes / sec)

merrill_tree Reduce

merrill_rts Scan

0 20 40 60 80 100 120

Kernel Ban

dwidth (B

ytes x109 / sec)

merrill_linear Reduce

merrill_linear Scan

– Implement device “memcpy” for tile-processing • Optimize for “full tiles”

– Specialize for different SM versions, input types, etc.

– Use templated code to generate various instances

– Run with cudaprof env vars to collect data

0 20 40 60 80 100 120

width (G

iBytes/sec)

Words Copied (millions)

128-‐Thread CTA (64B ld)

Single

Single (no-‐overlap)

Double

Double (no-‐overlap)

Quad (no-‐overlap)

0 20 40 60 80 100 120

width (G

iBytes/sec)

Words Copied (millions)

128-‐Thread CTA (128B ld/st)

Single Single (no-‐overlap) Double Double (no-‐overlap) Quad Quad (no-‐overlap) Intrinsic Copy

One-‐way

Two-‐way

cudaMemcpy() us

– SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size

– Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size

x0 x1 x2 x3 x4 x5 x6

x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6

x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i

x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6

x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6

i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1 ⊕3 ⊕0 ⊕2

⊕1 ⊕0 =1 =0

=0 ⊕0

⊕0 ⊕1

⊕0 ⊕1 ⊕2 ⊕3

m0 m1 m2 m7 m3 m4 m5 m6

i i i i

x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x6..x7) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6)

x0 x1 x2 x7 x3 x4 x5 x6

⊕1 ⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7

⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7

⊕4 ⊕5 ⊕6 ⊕7

i i i i

⊕2 ⊕3 ⊕1 ⊕0

m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10

– SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size

– Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size

x0 x1 x2 x3 x4 x5 x6

x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6

x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i

x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6

x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6

i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1 ⊕3 ⊕0 ⊕2

⊕1 ⊕0 =1 =0

=0 ⊕0

⊕0 ⊕1

⊕0 ⊕1 ⊕2 ⊕3

m0 m1 m2 m7 m3 m4 m5 m6

i i i i

x0 x1 x2 x7 x3 x4 x5 x6

⊕1 ⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7

⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7

⊕4 ⊕5 ⊕6 ⊕7

i i i i

⊕2 ⊕3 ⊕1 ⊕0

m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10

t1 t2 t3 t2 t3

t0 t1 t0

t2 t1 t0

t3 t2 t1

t3 t3 t3

t2 t2 t2

t1 t1 t1

t0 t0 t0

… … … … tT -‐ 1 tT/2 + 1

tT/2 tT/2 + 2

tT/4 + 1

tT/4 tT/4 + 2

tT/2 -‐ 1 t1 t0 t2 tT/4 -‐1 t3T/4+1 t3T/4 t3T/4+2 t3T/4 -‐1

barrier

Tree-‐based:

vs. raking-‐based:

t1 t2 t3 t2 t3

t0 t1 t0

t2 t1 t0

t3 t2 t1

barrier

t1 t2 t3 t2 t3

t0 t1 t0

t2 t1 t0

t3 t2 t1

t3 t3 t3

t2 t2 t2

t1 t1 t1

t0 t0 t0

… … … … tT -‐ 1 tT/2 + 1

tT/2 tT/2 + 2

tT/4 + 1

tT/4 tT/4 + 2

tT/2 -‐ 1 t1 t0 t2 tT/4 -‐1 t3T/4+1 t3T/4 t3T/4+2 t3T/4 -‐1

barrier

Tree-‐based:

vs. raking-‐based:

t1 t2 t3 t2 t3

t0 t1 t0

t2 t1 t0

t3 t2 t1

barrier

– Barriers make O(n) code O(n log n)

– The rest are “DMA engine” threads

– Use threadblocks to cover pipeline latencies, e.g., for Fermi:

• 2 worker warps per CTA

• 6-7 CTAs

t1 t2 t3 t2 t3

t0 t1 t0

t2 t1 t0

t3 t2 t1

t3 t3 t3

t2 t2 t2

t1 t1 t1

t0 t0 t0

… … … … tT -‐ 1 tT/2 + 1

tT/2 tT/2 + 2

tT/4 + 1

tT/4 tT/4 + 2

tT/2 -‐ 1

t1 t0 t2 tT/4 -‐1

t3T/4+1

t3T/4 t3T/4+2

t3T/4 -‐1

Worker

– Different SMs (varied local storage: registers/smem)

– Different input types (e.g., sorting chars vs. ulongs)

– # of steps for each algorithm phase is configuration-driven

– Template expansion + Constant-propagation + Static loop unrolling + Preprocessor Macros

– Compiler produces a target assembly that is well-tuned for the specifically targeted hardware and problem

Serial ReducOon (regs)

Serial ReducOon (shmem)

SIMD Kogge-‐Stone Scan

(shmem) Serial Scan (shmem)

Serial Scan (regs)

Digit Decoding

0s carry-‐in

1s carry-‐in

2s carry-‐in

3s carry-‐in

Scader (SIMD Kogge-‐Stone Scan, smem exchange)

0s carry-‐out

1s carry-‐out

2s carry-‐out

3s carry-‐out

l Excha

nge Offs

tter Offs

t T -‐ 1

t T/2 + 1

t 3T/4 -‐2

t T/4 + 1

t T/2 -‐ 2

t T/2 -‐ 1

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t 3T/4 -‐2

t T/4 + 1

t T/2 -‐ 2

t T/2 -‐ 1

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

Digit Flag Vectors

Subseq

e of Key

t T -‐ 1

t T/2 + 1

t T/2 + 2

t T/4 + 1

t T/4 + 2

t T/2 -‐ 1

t T/4 -‐1

t 3T/4+1

t 3T/4

t 3T/4+2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t 3T/4 -‐2

t T/4 + 1

t T/2 -‐ 2

t T/2 -‐ 1

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t T/2 + 2

t T/4 + 1

t T/4 + 2

t T/2 -‐ 1

t T/4 -‐1

t 3T/4+1

t 3T/4

t 3T/4+2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t 3T/4 -‐2

t T/4 + 1

t T/2 -‐ 2

t T/2 -‐ 1

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t T/2 + 2

t T/4 + 1

t T/4 + 2

t T/2 -‐ 1

t T/4 -‐1

t 3T/4+1

t 3T/4

t 3T/4+2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t 3T/4 -‐2

t T/4 + 1

t T/2 -‐ 2

t T/2 -‐ 1

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t T/2 + 2

t T/4 + 1

t T/4 + 2

t T/2 -‐ 1

t T/4 -‐1

t 3T/4+1

t 3T/4

t 3T/4+2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t 3T/4 -‐2

t T/4 + 1

t T/2 -‐ 2

t T/2 -‐ 1

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

• Resource-allocation as runtime

1. Kernel memory-wall analysis (kernel fusion)

2. Algorithm serialization

3. Tune for data-movement

4. Warp-synchronous programming

5. Flexible granularity via meta-programming

– Back40Computing (a Google Code Project) • http://code.google.com/p/back40computing/

– Default sorting method for Thrust • http://code.google.com/p/thrust/

– A single host-side procedure call launches a kernel that performs orthogonal program steps

MyUberKernel<<<grid_size, num_threads>>>(d_device_storage);

– No existing public repositories of kernel “subroutines” for scavenging

– Callbacks, iterators, visitors, functors, etc.

– ReduceKernel<<<grid_size, num_threads>>>(CountingIterator(100));

– E.g., fused kernel left can’t be composed using a callback-based pattern

GATHER(key)

EXCHANGE (key)

SCATTER (key) SCATTER (value)

GATHER (value)

EXCHANGE (value)

LOCAL MULTI-SCAN (flag vectors)

Encode flag bit (into flag vectors)

Decode local rank (from flag vectors)

Extract radix digit

Extract radix digit (again)

Update global radix digit partition offsets

Fused radix sorting kernel

• Digit extraction

• Local prefix scan

• Scatter accordingly

– Compiled libraries suffer from code bloat • CUDPP primitives library is 100s of MBs, yet still doesn’t support all built-in numeric types.

• Specializing for device configurations makes it even worse

– The alternative is to ship source for #include’ing • Have to be willing to share source

– Need a way to fit meta-programming in at the JIT / bytecode level to help avoid expansion / mismatch-by-omission

– Can leverage fundamentally different algorithms for different phases • How to teach the compiler do to this?

[harvard cs264] 16 - managing dynamic parallelism on gpus: a case study of high performance sorting...

Education

gpus para científicos

©2015 duane morris llp. all rights reserved. duane morris...

duane michals

duane brown

[harvard cs264] 08b - mapreduce and hadoop (zak stone,...

[harvard cs264] 15a - jacket: visual computing (james...

duane morris new york at a glance - duane morris llp

apps.fcc.gov · web viewdoyle b ross doyle e. wilcox jr....

[harvard cs264] 04 - intermediate-level cuda programming

[harvard cs264] 14 - dynamic compilation for massively...

gpus and accelerators

introduction to gpus · gpus today gpus are widely deployed...

duane - ok

duane hanson

[harvard cs264] 03 - introduction to gpu computing, cuda...

fastparallelevaluationofexactgeometricpredicateson gpus

[harvard cs264] 06 - cuda ninja tricks: gpu scripting,...

[harvard cs264] 02 - parallel thinking, architecture, theory...

[harvard cs264] 11a - programming the memory hierarchy with...

www.duanemorris.com ©2013 duane morris llp. all rights...