[harvard cs264] 16 - managing dynamic parallelism on gpus: a case study of high performance sorting...
DESCRIPTION
http://cs264.orghttp://goo.gl/1K2fITRANSCRIPT
![Page 1: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/1.jpg)
![Page 2: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/2.jpg)
– Data-independent tasks
– Tasks with statically-known data dependences
– SIMD divergence
– Lacking fine-grained synchronization
– Lacking writeable, coherent caches
![Page 3: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/3.jpg)
– Data-independent tasks
– Tasks with statically-known data dependences
– SIMD divergence
– Lacking fine-grained synchronization
– Lacking writeable, coherent caches
![Page 4: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/4.jpg)
![Page 5: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/5.jpg)
DEVICE 32-‐bit Key-‐value Sor7ng (106 keys / sec)
Keys-‐only Sor7ng (106 pairs/ sec)
NVIDIA GTX 280 449 (3.8x speedup*) 534 (2.9x speedup*)
* Satish et al.,"Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09
![Page 6: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/6.jpg)
DEVICE 32-‐bit Key-‐value Sor7ng (106 keys / sec)
Keys-‐only Sor7ng (106 pairs/ sec)
NVIDIA GTX 480 775 1005
NVIDIA GTX 280 449 534
NVIDIA 8800 GT 129 171
![Page 7: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/7.jpg)
DEVICE 32-‐bit Key-‐value Sor7ng (106 keys / sec)
Keys-‐only Sor7ng (106 pairs/ sec)
NVIDIA GTX 480 775 1005
NVIDIA GTX 280 449 534
NVIDIA 8800 GT 129 171
Intel Knight's Ferry MIC 32-‐core* 560
Intel Core i7 quad-‐core * 240
Intel Core-‐2 quad-‐core* 138
*Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures,“ Intel Tech Report 2010.
![Page 8: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/8.jpg)
![Page 9: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/9.jpg)
– Each output is dependent upon a finite subset of the input • Threads are decomposed by output element
• The output (and at least one input) index is a static function of thread-id
Output
Thread Thread Thread Thread
Input
![Page 10: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/10.jpg)
– Each output element has dependences upon any / all input elements
– E.g., sorting, reduction, compaction, duplicate removal, histogram generation, map-reduce, etc.
Input
Output
?
![Page 11: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/11.jpg)
– Threads are decomposed by output element
– Repeatedly iterate over recycled input streams
– Output stream size is statically known before each pass
Thread Thread Thread Thread
Thread Thread Thread Thread
![Page 12: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/12.jpg)
– O(n) global work from passes of pairwise-neighbor-reduction
– Static dependences, uniform output
+ + + +
![Page 13: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/13.jpg)
– Repeated pairwise swapping
• Bubble sort is O(n2)
• Bitonic sort is O(nlog2n)
– Need partitioning: dynamic, cooperative
allocation
– Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)
• O(V+E) is work-optimal
– Need queue: dynamic, cooperative allocation
![Page 14: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/14.jpg)
– Repeated pairwise swapping
• Bubble sort is O(n2)
• Bitonic sort is O(nlog2n)
– Need partitioning: dynamic, cooperative
allocation
– Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)
• O(V+E) is work-optimal
– Need queue: dynamic, cooperative allocation
![Page 15: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/15.jpg)
– Variable output per thread
– Need dynamic, cooperative allocation
●
● ● ● ●●●
●●
●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●●
● ● ●
●● ● ●
●
●
●
●
●●
●●
● ● ● ●
● ● ● ●● ● ● ●●
●●
●
●●
●●
●●
●●
![Page 16: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/16.jpg)
• Where do I put something in a list?
– Duplicate removal
– Sorting
– Histogram compilation
Where do I enqueue something?
– Search space exploration
– Graph traversal
– General work queues
Input
Output
?
Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread
![Page 17: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/17.jpg)
• For 30,000 producers and consumers?
– Locks serialize everything
![Page 18: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/18.jpg)
– O(n) work
– For allocation: use scan results as a scattering vector
– Popularized by Blelloch et al. in the ‘90s
– Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
2 1 0 3 2
0 2 3 3 6
Input
Prefix Sum
![Page 19: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/19.jpg)
– O(n) work
– For allocation: use scan results as a scattering vector
– Popularized by Blelloch et al. in the ‘90s
– Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
2 1 0 3 2
0 2 3 3 6
Input ( & allocaOon requirement)
Result of prefix scan (sum)
Thread Thread Thread Thread Thread
![Page 20: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/20.jpg)
– O(n) work
– For allocation: use scan results as a scattering vector
– Popularized by Blelloch et al. in the ‘90s
– Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
2 1 0 3 2
0 2 3 3 6
0 1 2 3 4 5 6 7
Input ( & allocaOon requirement)
Output
Result of prefix scan (sum)
Thread Thread Thread Thread Thread
![Page 21: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/21.jpg)
0s 1s
Key sequence 1110 1010 1100 1000 0011 0111 0101 0001
Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001
![Page 22: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/22.jpg)
Key sequence 1110 1010 1100 1000 0011 0111 0101 0001 0 2 4 5 1 3 6 7
Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0s 1s
1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5
Scanned allocations (relocation offsets)
1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5
![Page 23: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/23.jpg)
Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0s 1s
1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5
Scanned allocations (bin relocation offsets)
1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5
Adjusted allocations (global relocation offsets)
1 2 4 4 4 5 6 6 4 5 6 7 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5
Key sequence 1110 1010 1100 1000 0011 0111 0101 0001
Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001 0 2 4 5 1 3 6 7
1 2 0 3 4 5 7 6
![Page 24: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/24.jpg)
![Page 25: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/25.jpg)
Un-fused
GPU
Globa
l Device Mem
ory
Host P
rogram
Determine allocaCon size
CUDPP scan
CUDPP scan
Distribute output
CUDPP Scan
Host
![Page 26: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/26.jpg)
Un-fused Fused
GPU
Globa
l Device Mem
ory
Host P
rogram
Determine allocaCon size
CUDPP scan
CUDPP scan
Distribute output
CUDPP Scan
Host
Host P
rogram
Globa
l Device Mem
ory Scan
Scan
Scan
Determine allocaCon
Distribute output
GPU Host
![Page 27: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/27.jpg)
1. Heavy SMT (over-threading) yields usable “bubbles” of free computation
2. Propagate live data between steps in fast registers / smem
3. Use scan (or variant) as a “runtime” for everything
Fused
Host P
rogram
Globa
l Device Mem
ory
Scan
Scan
Scan
Determine allocaCon
Distribute output
GPU Host
![Page 28: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/28.jpg)
1. Heavy SMT (over-threading) yields usable “bubbles” of free computation
2. Propagate live data between steps in fast registers / smem
3. Use scan (or variant) as a “runtime” for everything
Fused
Host P
rogram
Globa
l Device Mem
ory
Scan
Scan
Scan
Determine allocaCon
Distribute output
GPU Host
![Page 29: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/29.jpg)
Device Memory Bandwidth Compute Throughput Memory wall Memory wall
(109 bytes/s) (109 thread-‐cycles/s) (bytes/cycle) (instrs/word)
GTX 480 169.0 672.0 0.251 15.9
GTX 285 159.0 354.2 0.449 8.9
GTX 280 141.7 311.0 0.456 8.8
Tesla C1060 102.0 312.0 0.327 12.2
9800 GTX+ 70.4 235.0 0.300 13.4
8800 GT 57.6 168.0 0.343 11.7
9800 GT 57.6 168.0 0.343 11.7
8800 GTX 86.4 172.8 0.500 8.0
Quadro FX 5600 76.8 152.3 0.504 7.9
![Page 30: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/30.jpg)
Device Memory Bandwidth Compute Throughput Memory wall Memory wall
(109 bytes/s) (109 thread-‐cycles/s) (bytes/cycle) (instrs/word)
GTX 480 169.0 672.0 0.251 15.9
GTX 285 159.0 354.2 0.449 8.9
GTX 280 141.7 311.0 0.456 8.8
Tesla C1060 102.0 312.0 0.327 12.2
9800 GTX+ 70.4 235.0 0.300 13.4
8800 GT 57.6 168.0 0.343 11.7
9800 GT 57.6 168.0 0.343 11.7
8800 GTX 86.4 172.8 0.500 8.0
Quadro FX 5600 76.8 152.3 0.504 7.9
![Page 31: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/31.jpg)
GTX285 r+w memory wall (17.8 instrucOons per
input word)
0
5
10
15
20
25
0 16 32 48 64 80 96 112
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
Insert work here
![Page 32: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/32.jpg)
GTX285 r+w memory wall (17.8)
Data Movement Skeleton
0
5
10
15
20
25
0 16 32 48 64 80 96 112
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
Insert work here
![Page 33: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/33.jpg)
GTX285 r+w memory wall (17.8)
Our Scan Kernel
Data Movement Skeleton
0
5
10
15
20
25
0 16 32 48 64 80 96 112
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
Insert work here
![Page 34: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/34.jpg)
– Increase granularity / redundant computation
• ghost cells
• radix bits
– Orthogonal kernel fusion
GTX285 r+w memory wall
(17.8)
Our Scan Kernel
Data Movement Skeleton
0
5
10
15
20
25
0 16 32 48 64 80 96 112
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
Insert work here
![Page 35: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/35.jpg)
CUDPP Scan Kernel
Our Scan Kernel
0
5
10
15
20
25
0 20 40 60 80 100 120
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
![Page 36: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/36.jpg)
– Partially-coalesced writes
– 2x write overhead
– 4 total concurrent scan operations (radix 16)
GTX285 Scan Kernel Wall
Our Scan Kernel
GTX285 Radix Scader Kernel Wall
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
Insert work here
![Page 37: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/37.jpg)
– Need kernels with tunable local (or redundant) work
• ghost cells
• radix bits
285 Radix Scader Kernel
Wall
480 Radix Scader Kernel
Wall
0
5
10
15
20
25
30
35
40
45
50
0 10 20 30 40 50 60 70 80 90
Thread
-‐instructoins / 32-‐bit w
ord
Problem Size (millions)
![Page 38: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/38.jpg)
![Page 39: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/39.jpg)
– Virtual processors abstract a diversity of hardware configurations
– Leads to a host of inefficiencies
– E.g., only several hundred CTAs
![Page 40: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/40.jpg)
– Virtual processors abstract a diversity of hardware configurations
– Leads to a host of inefficiencies
– E.g., only several hundred CTAs
![Page 41: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/41.jpg)
Grid
A
grid-size = (N / tilesize) CTAs
grid-size = 150 CTAs (or other small constant)
…
threadblock
…
threadblock Grid
B
![Page 42: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/42.jpg)
– Thread-dependent predicates
– Setup and initialization code (notably for smem)
– Offset calculations (notably for smem)
– Common values are hoisted and kept live
…
threadblock
![Page 43: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/43.jpg)
– Thread-dependent predicates
– Setup and initialization code (notably for smem)
– Offset calculations (notably for smem)
– Common values are hoisted and kept live
…
threadblock
![Page 44: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/44.jpg)
– Thread-dependent predicates
– Setup and initialization code (notably for smem)
– Offset calculations (notably for smem)
– Common values are hoisted and kept live
– Spills are really bad
…
threadblock
![Page 45: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/45.jpg)
– O( N / tilesize) gmem accesses
– 2-4 instructions per access (offset calcs,
load, store)
– GPU is least efficient here: get it over with as quick as possible
log tilesize (N) -level tree
Two-level tree
![Page 46: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/46.jpg)
– O( N / tilesize) gmem accesses
– 2-4 instructions per access (offset calcs,
load, store)
– GPU is least efficient here: get it over with as quick as possible
log tilesize (N) -level tree
Two-level tree
![Page 47: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/47.jpg)
0
4
8
12
16
20
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Thread
-‐instrucOon
s / Elem
ent
Grid Size (# of threadblocks)
Compute Load
285 Scan Kernel Wall
![Page 48: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/48.jpg)
– 16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA
– conditional evaluation
– singleton loads
C = number of CTAs
N = problem size
T = tile size
B = tiles per CTA
![Page 49: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/49.jpg)
– 16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA
– conditional evaluation
– singleton loads
C = number of CTAs
N = problem size
T = tile size
B = tiles per CTA
![Page 50: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/50.jpg)
– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA
– 16.1M % (1024 * 150) = 136.4 extra tiles
C = number of CTAs
N = problem size
T = tile size
B = tiles per CTA
![Page 51: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/51.jpg)
– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs)
– 109 + 1 = 110 tiles per CTA (136 CTAs)
– 16.1M % (1024 * 150) = 0.4 extra tiles
C = number of CTAs
N = problem size
T = tile size
B = tiles per CTA
![Page 52: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/52.jpg)
![Page 53: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/53.jpg)
– If you breathe on your code, run it through the VP • Kernel runtimes
• Instruction counts
– Indispensible for tuning • Host-side timing requires too many iterations
• Only 1-2 cudaprof iterations for consistent counter-based perf data
– Write tools to parse the output • “Dummy” kernels useful for demarcation
![Page 54: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/54.jpg)
0
100
200
300
400
500
600
700
800
900
1000
1100
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272
SorOng
Rate (106 keys/sec)
Problem size (millions)
GTX 480
C2050 (no ECC)
GTX 285
C2050 (ECC)
GTX 280
C1060
9800 GTX+
![Page 55: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/55.jpg)
0
100
200
300
400
500
600
700
800
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240
SorOng
Rate (m
illions of p
airs/sec)
Problem size (millions)
GTX 480 C2050 (no ECC) GTX 285 GTX 280 C2050 (ECC) C1060 9800 GTX+
![Page 56: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/56.jpg)
0
20
40
60
80
100
120
140
160
180
0 20 40 60 80 100 120
Kernel Ban
dwidth (G
iBytes / sec)
Problem Size (millions)
merrill_tree Reduce
merrill_rts Scan
![Page 57: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/57.jpg)
0
20
40
60
80
100
120
140
160
180
0 20 40 60 80 100 120
Kernel Ban
dwidth (B
ytes x109 / sec)
Problem Size (millions)
merrill_linear Reduce
merrill_linear Scan
![Page 58: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/58.jpg)
– Implement device “memcpy” for tile-processing • Optimize for “full tiles”
– Specialize for different SM versions, input types, etc.
![Page 59: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/59.jpg)
![Page 60: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/60.jpg)
– Use templated code to generate various instances
– Run with cudaprof env vars to collect data
![Page 61: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/61.jpg)
80
90
100
110
120
130
140
150
160
0 20 40 60 80 100 120
Band
width (G
iBytes/sec)
Words Copied (millions)
128-‐Thread CTA (64B ld)
Single
Single (no-‐overlap)
Double
Double (no-‐overlap)
Quad
Quad (no-‐overlap)
70
80
90
100
110
120
130
140
0 20 40 60 80 100 120
Band
width (G
iBytes/sec)
Words Copied (millions)
128-‐Thread CTA (128B ld/st)
Single Single (no-‐overlap) Double Double (no-‐overlap) Quad Quad (no-‐overlap) Intrinsic Copy
One-‐way
Two-‐way
cudaMemcpy() us
![Page 62: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/62.jpg)
![Page 63: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/63.jpg)
– SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size
– Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
x0 x1 x2 x3 x4 x5 x6
x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6
x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i
x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6
x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6
i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1 ⊕3 ⊕0 ⊕2
⊕1 ⊕0 =1 =0
=0 ⊕0
⊕0 ⊕1
⊕0 ⊕1 ⊕2 ⊕3
x7 t0
t1
t2
t3
t4
t5
m0 m1 m2 m7 m3 m4 m5 m6
i i i i
i i i i
x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x6..x7) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6)
x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x4..x7) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6)
x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x7) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6)
x0 x1 x2 x7 x3 x4 x5 x6
t1
t2
t3
t0
⊕1 ⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7
⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7
⊕4 ⊕5 ⊕6 ⊕7
⊕0
⊕1
i i i i
⊕0
i i i i
⊕2 ⊕3 ⊕1 ⊕0
m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10
![Page 64: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/64.jpg)
– SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size
– Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
x0 x1 x2 x3 x4 x5 x6
x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6
x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i
x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6
x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6
i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1 ⊕3 ⊕0 ⊕2
⊕1 ⊕0 =1 =0
=0 ⊕0
⊕0 ⊕1
⊕0 ⊕1 ⊕2 ⊕3
x7 t0
t1
t2
t3
t4
t5
m0 m1 m2 m7 m3 m4 m5 m6
i i i i
i i i i
x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x6..x7) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6)
x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x4..x7) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6)
x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x7) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6)
x0 x1 x2 x7 x3 x4 x5 x6
t1
t2
t3
t0
⊕1 ⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7
⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7
⊕4 ⊕5 ⊕6 ⊕7
⊕0
⊕1
i i i i
⊕0
i i i i
⊕2 ⊕3 ⊕1 ⊕0
m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10
![Page 65: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/65.jpg)
t1 t2 t3 t2 t3
t3
t0 t1 t0
t0
t2 t1 t0
t3 t2 t1
t3 t3 t3
t2 t2 t2
t1 t1 t1
t0 t0 t0
… … … … tT -‐ 1 tT/2 + 1
tT/2 tT/2 + 2
tT/4 + 1
tT/4 tT/4 + 2
tT/2 -‐ 1 t1 t0 t2 tT/4 -‐1 t3T/4+1 t3T/4 t3T/4+2 t3T/4 -‐1
barrier
Tree-‐based:
vs. raking-‐based:
t1 t2 t3 t2 t3
t3
t0 t1 t0
t0
t2 t1 t0
t3 t2 t1
barrier
barrier
barrier
![Page 66: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/66.jpg)
t1 t2 t3 t2 t3
t3
t0 t1 t0
t0
t2 t1 t0
t3 t2 t1
t3 t3 t3
t2 t2 t2
t1 t1 t1
t0 t0 t0
… … … … tT -‐ 1 tT/2 + 1
tT/2 tT/2 + 2
tT/4 + 1
tT/4 tT/4 + 2
tT/2 -‐ 1 t1 t0 t2 tT/4 -‐1 t3T/4+1 t3T/4 t3T/4+2 t3T/4 -‐1
barrier
Tree-‐based:
vs. raking-‐based:
t1 t2 t3 t2 t3
t3
t0 t1 t0
t0
t2 t1 t0
t3 t2 t1
barrier
barrier
barrier
![Page 67: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/67.jpg)
– Barriers make O(n) code O(n log n)
– The rest are “DMA engine” threads
– Use threadblocks to cover pipeline latencies, e.g., for Fermi:
• 2 worker warps per CTA
• 6-7 CTAs
t1 t2 t3 t2 t3
t3
t0 t1 t0
t0
t2 t1 t0
t3 t2 t1
t3 t3 t3
t2 t2 t2
t1 t1 t1
t0 t0 t0
… … … … tT -‐ 1 tT/2 + 1
tT/2 tT/2 + 2
tT/4 + 1
tT/4 tT/4 + 2
tT/2 -‐ 1
t1 t0 t2 tT/4 -‐1
t3T/4+1
t3T/4 t3T/4+2
t3T/4 -‐1
Worker
DMA
![Page 68: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/68.jpg)
![Page 69: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/69.jpg)
– Different SMs (varied local storage: registers/smem)
– Different input types (e.g., sorting chars vs. ulongs)
– # of steps for each algorithm phase is configuration-driven
– Template expansion + Constant-propagation + Static loop unrolling + Preprocessor Macros
– Compiler produces a target assembly that is well-tuned for the specifically targeted hardware and problem
![Page 70: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/70.jpg)
Serial ReducOon (regs)
Serial ReducOon (shmem)
SIMD Kogge-‐Stone Scan
(shmem) Serial Scan (shmem)
Serial Scan (regs)
Digit Decoding
0s carry-‐in
9
1s carry-‐in
25
2s carry-‐in
33
3s carry-‐in
49
Scader (SIMD Kogge-‐Stone Scan, smem exchange)
0s carry-‐out
18
1s carry-‐out
32
2s carry-‐out
42
3s carry-‐out
56
t 1
t 2
t 3
t 2
t 3
t 3
t 0
t 1
t 0
t 0
t 2
t 1
t 0
t 3
t 2
t 1
0s to
tal
1s to
tal
2s to
tal
3s to
tal
10
17
18
19
1 11
20
12
27
2 28
3
4 5
21
14
29
22
30
23
24
7 15
31
8 16
25
9
26
13
6 32
10
11
12
13
14
15
16
18
26
27
28
29
30
31
33
34
35
36
37
38
39
41
50
51
52
53
54
55
17
32
40
56
Loca
l Excha
nge Offs
ets
Globa
l Sca
tter Offs
ets
…
…
…
…
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
…
…
…
…
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
Digit Flag Vectors
211
122
302
232
300
021
022
021
123
330
023
130
220
020
112
221
023
322
003
012
022
010
121
323
020
101
212
220
013
301
130
333
Subseq
uenc
e of Key
s
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
1 1
1 1
1 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
1 1
1 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
1 1
1 1
1 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
1 1
1 1
1 1
0s
1s
2s
3s
0s to
tal
9
1s to
tal
7
2s to
tal
9
3s to
tal
7
0 0
0 0
1 1
1 1
2 5
5 5
5 5
5 5
7 7
8 8
1 5
9 1
2 3
4 5
7 8
9 6
1 1
1 1
2 3
3 3
3 3
3 4
5 5
5 5
5 5
6 6
7 7
2 5
7 1
2 3
5 6
7 4
0 3
3 4
4 4
4 4
4 4
5 5
6 8
8 8
8 8
9 4
4 8
9 1
2 3
4 5
6 7
8 9
0 0
0 0
0 0
0 1
2 3
3 3
3 3
4 5
5 5
5 6
6 6
6 3
5 2
3 4
5 6
1 7
0s
1s
2s
3s
Exch
ange
d Ke
ys
300
330
130
220
020
130
010
220
211
021
021
301
221
121
122
302
232
022
122
112
322
212
013
123
023
023
003
323
020
101
022
333
t 1
t 2
t 3
t 2
t 3
t 3
t 0
t 1
t 0
t 0
t 2
t 1
t 0
t 3
t 2
t 1
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t T -‐ 1
t T/2 + 1
t T/2
t T/2 + 2
t T/4 + 1
t T/4
t T/4 + 2
t T/2 -‐ 1
t 1
t 0
t 2
t T/4 -‐1
t 3T/4+1
t 3T/4
t 3T/4+2
t 3T/4 -‐1
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
…
…
…
…
…
…
…
…
t 1
t 2
t 3
t 2
t 3
t 3
t 0
t 1
t 0
t 0
t 2
t 1
t 0
t 3
t 2
t 1
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t T -‐ 1
t T/2 + 1
t T/2
t T/2 + 2
t T/4 + 1
t T/4
t T/4 + 2
t T/2 -‐ 1
t 1
t 0
t 2
t T/4 -‐1
t 3T/4+1
t 3T/4
t 3T/4+2
t 3T/4 -‐1
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
…
…
…
…
…
…
…
…
t 1
t 2
t 3
t 2
t 3
t 3
t 0
t 1
t 0
t 0
t 2
t 1
t 0
t 3
t 2
t 1
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t T -‐ 1
t T/2 + 1
t T/2
t T/2 + 2
t T/4 + 1
t T/4
t T/4 + 2
t T/2 -‐ 1
t 1
t 0
t 2
t T/4 -‐1
t 3T/4+1
t 3T/4
t 3T/4+2
t 3T/4 -‐1
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
…
…
…
…
…
…
…
…
t 1
t 2
t 3
t 2
t 3
t 3
t 0
t 1
t 0
t 0
t 2
t 1
t 0
t 3
t 2
t 1
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t T -‐ 1
t T/2 + 1
t T/2
t T/2 + 2
t T/4 + 1
t T/4
t T/4 + 2
t T/2 -‐ 1
t 1
t 0
t 2
t T/4 -‐1
t 3T/4+1
t 3T/4
t 3T/4+2
t 3T/4 -‐1
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
…
…
…
…
…
…
…
…
![Page 71: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/71.jpg)
• Resource-allocation as runtime
1. Kernel memory-wall analysis (kernel fusion)
2. Algorithm serialization
3. Tune for data-movement
4. Warp-synchronous programming
5. Flexible granularity via meta-programming
![Page 72: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/72.jpg)
– Back40Computing (a Google Code Project) • http://code.google.com/p/back40computing/
– Default sorting method for Thrust • http://code.google.com/p/thrust/
![Page 73: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/73.jpg)
![Page 74: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/74.jpg)
– A single host-side procedure call launches a kernel that performs orthogonal program steps
MyUberKernel<<<grid_size, num_threads>>>(d_device_storage);
– No existing public repositories of kernel “subroutines” for scavenging
![Page 75: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/75.jpg)
– Callbacks, iterators, visitors, functors, etc.
– ReduceKernel<<<grid_size, num_threads>>>(CountingIterator(100));
– E.g., fused kernel left can’t be composed using a callback-based pattern
GATHER(key)
EXCHANGE (key)
SCATTER (key) SCATTER (value)
GATHER (value)
EXCHANGE (value)
LOCAL MULTI-SCAN (flag vectors)
Encode flag bit (into flag vectors)
Decode local rank (from flag vectors)
Extract radix digit
Extract radix digit (again)
Update global radix digit partition offsets
Fused radix sorting kernel
• Digit extraction
• Local prefix scan
• Scatter accordingly
![Page 76: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)](https://reader034.vdocuments.site/reader034/viewer/2022051609/547bc1c0b37959492b8b4e2b/html5/thumbnails/76.jpg)
– Compiled libraries suffer from code bloat • CUDPP primitives library is 100s of MBs, yet still doesn’t support all built-in numeric types.
• Specializing for device configurations makes it even worse
– The alternative is to ship source for #include’ing • Have to be willing to share source
– Need a way to fit meta-programming in at the JIT / bytecode level to help avoid expansion / mismatch-by-omission
– Can leverage fundamentally different algorithms for different phases • How to teach the compiler do to this?