[harvard cs264] 16 - managing dynamic parallelism on gpus: a case study of high performance sorting...

Post on 29-Nov-2014

1.340 Views

Category:

Education

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

http://cs264.orghttp://goo.gl/1K2fI

TRANSCRIPT

–  Data-independent tasks

–  Tasks with statically-known data dependences

–  SIMD divergence

–  Lacking fine-grained synchronization

–  Lacking writeable, coherent caches

–  Data-independent tasks

–  Tasks with statically-known data dependences

–  SIMD divergence

–  Lacking fine-grained synchronization

–  Lacking writeable, coherent caches

DEVICE 32-­‐bit  Key-­‐value  Sor7ng  (106  keys  /  sec)    

Keys-­‐only  Sor7ng  (106  pairs/  sec)  

NVIDIA  GTX  280 449      (3.8x  speedup*) 534      (2.9x  speedup*)

* Satish et al.,"Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09

DEVICE 32-­‐bit  Key-­‐value  Sor7ng  (106  keys  /  sec)    

Keys-­‐only  Sor7ng  (106  pairs/  sec)  

NVIDIA  GTX  480 775 1005

NVIDIA  GTX  280 449 534

NVIDIA  8800  GT 129 171

DEVICE 32-­‐bit  Key-­‐value  Sor7ng  (106  keys  /  sec)    

Keys-­‐only  Sor7ng  (106  pairs/  sec)  

NVIDIA  GTX  480 775 1005

NVIDIA  GTX  280 449 534

NVIDIA  8800  GT 129 171

Intel    Knight's  Ferry  MIC  32-­‐core* 560

Intel    Core  i7  quad-­‐core  * 240

Intel    Core-­‐2  quad-­‐core* 138

*Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures,“ Intel Tech Report 2010.

 

–  Each output is dependent upon a finite subset of the input •  Threads are decomposed by output element

•  The output (and at least one input) index is a static function of thread-id

Output  

Thread   Thread   Thread   Thread  

Input  

–  Each output element has dependences upon any / all input elements

–  E.g., sorting, reduction, compaction, duplicate removal, histogram generation, map-reduce, etc.

Input  

Output  

?  

–  Threads are decomposed by output element

–  Repeatedly iterate over recycled input streams

–  Output stream size is statically known before each pass

Thread   Thread   Thread   Thread  

Thread   Thread   Thread   Thread  

–  O(n) global work from passes of pairwise-neighbor-reduction

–  Static dependences, uniform output

+ + + +

–  Repeated pairwise swapping

• Bubble sort is O(n2)

• Bitonic sort is O(nlog2n)

–  Need partitioning: dynamic, cooperative

allocation

–  Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)

• O(V+E) is work-optimal

–  Need queue: dynamic, cooperative allocation

–  Repeated pairwise swapping

• Bubble sort is O(n2)

• Bitonic sort is O(nlog2n)

–  Need partitioning: dynamic, cooperative

allocation

–  Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)

• O(V+E) is work-optimal

–  Need queue: dynamic, cooperative allocation

–  Variable output per thread

–  Need dynamic, cooperative allocation

● ● ● ●●●

●●

● ● ● ● ● ● ● ●

● ● ● ●

● ● ● ●

● ● ● ●

● ● ●●

● ● ●

●● ● ●

●●

●●

● ● ● ●

● ● ● ●● ● ● ●●

●●

●●

●●

●●

●●

 

•  Where do I put something in a list?

–  Duplicate removal

–  Sorting

–  Histogram compilation

  Where do I enqueue something?

–  Search space exploration

–  Graph traversal

–  General work queues

Input  

Output  

?  

Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread  

• For 30,000 producers and consumers?

–  Locks serialize everything

–  O(n) work

–  For allocation: use scan results as a scattering vector

–  Popularized by Blelloch et al. in the ‘90s

–  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009

2   1   0   3   2  

0   2   3   3   6  

Input  

Prefix  Sum  

–  O(n) work

–  For allocation: use scan results as a scattering vector

–  Popularized by Blelloch et al. in the ‘90s

–  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009

2   1   0   3   2  

0   2   3   3   6  

Input    (  &  allocaOon    requirement)  

Result  of    prefix  scan  (sum)  

Thread   Thread   Thread   Thread  Thread  

–  O(n) work

–  For allocation: use scan results as a scattering vector

–  Popularized by Blelloch et al. in the ‘90s

–  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009

2   1   0   3   2  

0   2   3   3   6  

0   1   2   3   4   5   6   7  

Input    (  &  allocaOon    requirement)  

Output  

Result  of    prefix  scan  (sum)  

Thread   Thread   Thread   Thread  Thread  

0s 1s

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001

Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001 0 2 4 5 1 3 6 7

Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0s 1s

1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Scanned allocations (relocation offsets)

1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0s 1s

1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Scanned allocations (bin relocation offsets)

1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Adjusted allocations (global relocation offsets)

1 2 4 4 4 5 6 6 4 5 6 7 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001

Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001 0 2 4 5 1 3 6 7

1 2 0 3 4 5 7 6

 

Un-fused

GPU  

Globa

l  Device  Mem

ory  

Host  P

rogram

 

Determine  allocaCon  size  

CUDPP  scan  

CUDPP  scan  

Distribute  output  

CUDPP  Scan  

Host  

Un-fused Fused

GPU  

Globa

l  Device  Mem

ory  

Host  P

rogram

 

Determine  allocaCon  size  

CUDPP  scan  

CUDPP  scan  

Distribute  output  

CUDPP  Scan  

Host  

Host  P

rogram

 

Globa

l  Device  Mem

ory  Scan  

Scan  

Scan  

Determine  allocaCon  

Distribute  output  

GPU  Host  

1.  Heavy SMT (over-threading) yields usable “bubbles” of free computation

2.  Propagate live data between steps in fast registers / smem

3.  Use scan (or variant) as a “runtime” for everything

Fused

Host  P

rogram

 

Globa

l  Device  Mem

ory  

Scan  

Scan  

Scan  

Determine  allocaCon  

Distribute  output  

GPU  Host  

1.  Heavy SMT (over-threading) yields usable “bubbles” of free computation

2.  Propagate live data between steps in fast registers / smem

3.  Use scan (or variant) as a “runtime” for everything

Fused

Host  P

rogram

 

Globa

l  Device  Mem

ory  

Scan  

Scan  

Scan  

Determine  allocaCon  

Distribute  output  

GPU  Host  

Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall  

    (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)  

GTX  480   169.0   672.0   0.251   15.9  

GTX  285   159.0   354.2   0.449   8.9  

GTX  280   141.7   311.0   0.456   8.8  

Tesla  C1060   102.0   312.0   0.327   12.2  

9800  GTX+   70.4   235.0   0.300   13.4  

8800  GT   57.6   168.0   0.343   11.7  

9800  GT   57.6   168.0   0.343   11.7  

8800  GTX   86.4   172.8   0.500   8.0  

Quadro  FX  5600   76.8   152.3   0.504   7.9  

Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall  

    (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)  

GTX  480   169.0   672.0   0.251   15.9  

GTX  285   159.0   354.2   0.449   8.9  

GTX  280   141.7   311.0   0.456   8.8  

Tesla  C1060   102.0   312.0   0.327   12.2  

9800  GTX+   70.4   235.0   0.300   13.4  

8800  GT   57.6   168.0   0.343   11.7  

9800  GT   57.6   168.0   0.343   11.7  

8800  GTX   86.4   172.8   0.500   8.0  

Quadro  FX  5600   76.8   152.3   0.504   7.9  

GTX285  r+w  memory  wall    (17.8  instrucOons  per    

input  word)  

0  

5  

10  

15  

20  

25  

0   16   32   48   64   80   96   112  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

Insert  work  here  

GTX285  r+w  memory  wall  (17.8)  

Data  Movement  Skeleton  

0  

5  

10  

15  

20  

25  

0   16   32   48   64   80   96   112  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

Insert  work  here  

GTX285  r+w  memory  wall  (17.8)  

Our  Scan  Kernel  

Data  Movement  Skeleton  

0  

5  

10  

15  

20  

25  

0   16   32   48   64   80   96   112  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

Insert  work  here  

–  Increase granularity / redundant computation

• ghost cells

• radix bits

–  Orthogonal kernel fusion

GTX285  r+w  memory  wall  

(17.8)  

Our  Scan  Kernel  

Data  Movement  Skeleton  

0  

5  

10  

15  

20  

25  

0   16   32   48   64   80   96   112  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

Insert  work  here  

CUDPP  Scan  Kernel  

Our  Scan  Kernel  

0  

5  

10  

15  

20  

25  

0   20   40   60   80   100   120  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

–  Partially-coalesced writes

–  2x write overhead

–  4 total concurrent scan operations (radix 16)

GTX285  Scan  Kernel  Wall  

Our  Scan  Kernel  

GTX285  Radix  Scader  Kernel  Wall  

0  

5  

10  

15  

20  

25  

30  

35  

0   16   32   48   64   80   96   112  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

Insert  work  here  

–  Need kernels with tunable local (or redundant) work

• ghost cells

• radix bits

285  Radix  Scader  Kernel  

Wall  

480  Radix  Scader  Kernel  

Wall  

0  

5  

10  

15  

20  

25  

30  

35  

40  

45  

50  

0   10   20   30   40   50   60   70   80   90  

Thread

-­‐instructoins  /  32-­‐bit  w

ord  

Problem  Size  (millions)  

 

–  Virtual processors abstract a diversity of hardware configurations

–  Leads to a host of inefficiencies

–  E.g., only several hundred CTAs

–  Virtual processors abstract a diversity of hardware configurations

–  Leads to a host of inefficiencies

–  E.g., only several hundred CTAs

Grid

A

grid-size = (N / tilesize) CTAs

grid-size = 150 CTAs (or other small constant)

…  

threadblock  

…  

threadblock  Grid

B

–  Thread-dependent predicates

–  Setup and initialization code (notably for smem)

–  Offset calculations (notably for smem)

–  Common values are hoisted and kept live

…  

threadblock  

–  Thread-dependent predicates

–  Setup and initialization code (notably for smem)

–  Offset calculations (notably for smem)

–  Common values are hoisted and kept live

…  

threadblock  

–  Thread-dependent predicates

–  Setup and initialization code (notably for smem)

–  Offset calculations (notably for smem)

–  Common values are hoisted and kept live

–  Spills are really bad

…  

threadblock  

–  O( N / tilesize) gmem accesses

–  2-4 instructions per access (offset calcs,

load, store)

–  GPU is least efficient here: get it over with as quick as possible

log tilesize (N) -level tree

Two-level tree

–  O( N / tilesize) gmem accesses

–  2-4 instructions per access (offset calcs,

load, store)

–  GPU is least efficient here: get it over with as quick as possible

log tilesize (N) -level tree

Two-level tree

0  

4  

8  

12  

16  

20  

0   1000   2000   3000   4000   5000   6000   7000   8000   9000  

Thread

-­‐instrucOon

s  /  Elem

ent  

Grid  Size  (#  of  threadblocks)  

Compute  Load  

285  Scan  Kernel  Wall  

–  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA

–  conditional evaluation

–  singleton loads

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

–  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA

–  conditional evaluation

–  singleton loads

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

–  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA

–  16.1M % (1024 * 150) = 136.4 extra tiles

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

–  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs)

–  109 + 1 = 110 tiles per CTA (136 CTAs)

–  16.1M % (1024 * 150) = 0.4 extra tiles

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

 

–  If you breathe on your code, run it through the VP •  Kernel runtimes

•  Instruction counts

–  Indispensible for tuning •  Host-side timing requires too many iterations

•  Only 1-2 cudaprof iterations for consistent counter-based perf data

–  Write tools to parse the output •  “Dummy” kernels useful for demarcation

0  

100  

200  

300  

400  

500  

600  

700  

800  

900  

1000  

1100  

0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   256   272  

SorOng

 Rate  (106  keys/sec)  

Problem  size  (millions)  

GTX  480  

C2050  (no  ECC)  

GTX  285  

C2050  (ECC)  

GTX  280  

C1060  

9800  GTX+  

0  

100  

200  

300  

400  

500  

600  

700  

800  

0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240  

SorOng

 Rate  (m

illions  of  p

airs/sec)  

Problem  size  (millions)  

GTX  480  C2050  (no  ECC)  GTX  285  GTX  280  C2050  (ECC)  C1060  9800  GTX+  

0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

0   20   40   60   80   100   120  

Kernel  Ban

dwidth  (G

iBytes  /  sec)  

Problem  Size  (millions)  

merrill_tree  Reduce  

merrill_rts  Scan  

0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

0   20   40   60   80   100   120  

Kernel  Ban

dwidth  (B

ytes  x109  /  sec)  

Problem  Size  (millions)  

merrill_linear  Reduce  

merrill_linear  Scan  

–  Implement device “memcpy” for tile-processing •  Optimize for “full tiles”

–  Specialize for different SM versions, input types, etc.

–  Use templated code to generate various instances

–  Run with cudaprof env vars to collect data

80  

90  

100  

110  

120  

130  

140  

150  

160  

0   20   40   60   80   100   120  

Band

width  (G

iBytes/sec)  

Words  Copied  (millions)  

128-­‐Thread  CTA  (64B  ld)  

Single  

Single  (no-­‐overlap)  

Double  

Double  (no-­‐overlap)  

Quad  

Quad  (no-­‐overlap)  

70  

80  

90  

100  

110  

120  

130  

140  

0   20   40   60   80   100   120  

Band

width  (G

iBytes/sec)  

Words  Copied  (millions)  

128-­‐Thread  CTA  (128B  ld/st)  

Single  Single  (no-­‐overlap)  Double  Double  (no-­‐overlap)  Quad  Quad  (no-­‐overlap)  Intrinsic  Copy  

One-­‐way  

Two-­‐way  

cudaMemcpy()  us  

 

–  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size

–  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size

x0 x1 x2 x3 x4 x5 x6

x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6

x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i

x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6

x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6

i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1   ⊕3  ⊕0   ⊕2  

⊕1  ⊕0   =1 =0

=0   ⊕0  

⊕0   ⊕1  

⊕0   ⊕1   ⊕2   ⊕3  

x7 t0

t1

t2

t3

t4

t5

m0 m1 m2 m7 m3 m4 m5 m6

i i i i

i i i i

x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x6..x7) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6)

x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x4..x7) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6)

x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x7) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6)

x0 x1 x2 x7 x3 x4 x5 x6

t1  

t2  

t3  

t0  

⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7  

⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7  

⊕4   ⊕5   ⊕6  ⊕7  

⊕0  

⊕1  

i i i i

⊕0  

i i i i

⊕2   ⊕3  ⊕1  ⊕0  

m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10

–  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size

–  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size

x0 x1 x2 x3 x4 x5 x6

x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6

x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i

x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6

x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6

i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1   ⊕3  ⊕0   ⊕2  

⊕1  ⊕0   =1 =0

=0   ⊕0  

⊕0   ⊕1  

⊕0   ⊕1   ⊕2   ⊕3  

x7 t0

t1

t2

t3

t4

t5

m0 m1 m2 m7 m3 m4 m5 m6

i i i i

i i i i

x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x6..x7) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6)

x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x4..x7) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6)

x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x7) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6)

x0 x1 x2 x7 x3 x4 x5 x6

t1  

t2  

t3  

t0  

⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7  

⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7  

⊕4   ⊕5   ⊕6  ⊕7  

⊕0  

⊕1  

i i i i

⊕0  

i i i i

⊕2   ⊕3  ⊕1  ⊕0  

m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10

t1   t2   t3  t2   t3  

t3  

t0  t1  t0  

t0  

t2  t1  t0  

t3  t2  t1  

t3  t3  t3  

t2  t2  t2  

t1  t1  t1  

t0  t0  t0  

…   …   …   …   tT  -­‐  1  tT/2  +  1  

tT/2     tT/2  +  2  

tT/4  +  1  

tT/4     tT/4  +  2  

tT/2  -­‐  1  t1  t0   t2   tT/4  -­‐1   t3T/4+1  t3T/4     t3T/4+2  t3T/4  -­‐1  

barrier  

Tree-­‐based:  

vs.  raking-­‐based:  

t1   t2   t3  t2   t3  

t3  

t0  t1  t0  

t0  

t2  t1  t0  

t3  t2  t1  

barrier  

barrier  

barrier  

t1   t2   t3  t2   t3  

t3  

t0  t1  t0  

t0  

t2  t1  t0  

t3  t2  t1  

t3  t3  t3  

t2  t2  t2  

t1  t1  t1  

t0  t0  t0  

…   …   …   …   tT  -­‐  1  tT/2  +  1  

tT/2     tT/2  +  2  

tT/4  +  1  

tT/4     tT/4  +  2  

tT/2  -­‐  1  t1  t0   t2   tT/4  -­‐1   t3T/4+1  t3T/4     t3T/4+2  t3T/4  -­‐1  

barrier  

Tree-­‐based:  

vs.  raking-­‐based:  

t1   t2   t3  t2   t3  

t3  

t0  t1  t0  

t0  

t2  t1  t0  

t3  t2  t1  

barrier  

barrier  

barrier  

–  Barriers make O(n) code O(n log n)

–  The rest are “DMA engine” threads

–  Use threadblocks to cover pipeline latencies, e.g., for Fermi:

•  2 worker warps per CTA

•  6-7 CTAs

t1   t2   t3  t2   t3  

t3  

t0  t1  t0  

t0  

t2  t1  t0  

t3  t2  t1  

t3  t3  t3  

t2  t2  t2  

t1  t1  t1  

t0  t0  t0  

…   …   …   …   tT  -­‐  1  tT/2  +  1  

tT/2     tT/2  +  2  

tT/4  +  1  

tT/4     tT/4  +  2  

tT/2  -­‐  1  

t1  t0   t2  tT/4  -­‐1  

t3T/4+1  

t3T/4     t3T/4+2  

t3T/4  -­‐1  

Worker    

DMA  

 

–  Different SMs (varied local storage: registers/smem)

–  Different input types (e.g., sorting chars vs. ulongs)

–  # of steps for each algorithm phase is configuration-driven

–  Template expansion + Constant-propagation + Static loop unrolling + Preprocessor Macros

–  Compiler produces a target assembly that is well-tuned for the specifically targeted hardware and problem

Serial    ReducOon  (regs)  

Serial    ReducOon  (shmem)  

SIMD  Kogge-­‐Stone    Scan    

(shmem)  Serial  Scan    (shmem)  

Serial  Scan    (regs)  

Digit  Decoding  

0s  carry-­‐in

 

9  

1s  carry-­‐in

 

25  

2s  carry-­‐in

 

33  

3s  carry-­‐in

 

49  

Scader  (SIMD  Kogge-­‐Stone  Scan,  smem  exchange)  

0s  carry-­‐out

 

18  

1s  carry-­‐out

 

32  

2s  carry-­‐out

 

42  

3s  carry-­‐out

 

56  

t 1  

t 2  

t 3  

t 2  

t 3  

t 3  

t 0  

t 1  

t 0  

t 0  

t 2  

t 1  

t 0  

t 3  

t 2  

t 1  

0s  to

tal  

1s  to

tal  

2s  to

tal  

3s  to

tal  

10  

17  

18  

19  

1  11  

20  

12  

27  

2  28

 3  

4  5  

21  

14  

29  

22  

30  

23  

24  

7  15

 31  

8  16

 25

 9  

26  

13  

6  32

 

10  

11  

12  

13  

14  

15  

16  

18  

26  

27  

28  

29  

30  

31  

33  

34  

35  

36  

37  

38  

39  

41  

50  

51  

52  

53  

54  

55  

17  

32  

40  

56  

Loca

l  Excha

nge  Offs

ets  

Globa

l  Sca

tter  Offs

ets  

…  

…  

…  

…  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

…  

…  

…  

…  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

Digit  Flag  Vectors

 

211  

122  

302  

232  

300  

021  

022  

021  

123  

330  

023  

130  

220  

020  

112  

221  

023  

322  

003  

012  

022  

010  

121  

323  

020  

101  

212  

220  

013  

301  

130  

333  

Subseq

uenc

e  of  Key

s  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  1  

1  1  

1  1  

1  1  

1  1  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  1  

1  1  

1  1  

1  1  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  1  

1  1  

1  1  

1  1  

1  1  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  1  

1  1  

1  1  

1  1  

0s  

1s  

2s  

3s  

0s  to

tal  

9  

1s  to

tal  

7  

2s  to

tal  

9  

3s  to

tal  

7  

0  0  

0  0  

1  1  

1  1  

2  5  

5  5  

5  5  

5  5  

7  7  

8  8  

1  5  

9  1  

2  3  

4  5  

7  8  

9  6  

1  1  

1  1  

2  3  

3  3  

3  3  

3  4  

5  5  

5  5  

5  5  

6  6  

7  7  

2  5  

7  1  

2  3  

5  6  

7  4  

0  3  

3  4  

4  4  

4  4  

4  4  

5  5  

6  8  

8  8  

8  8  

9  4  

4  8  

9  1  

2  3  

4  5  

6  7  

8  9  

0  0  

0  0  

0  0  

0  1  

2  3  

3  3  

3  3  

4  5  

5  5  

5  6  

6  6  

6  3  

5  2  

3  4  

5  6  

1  7  

0s  

1s  

2s  

3s  

Exch

ange

d  Ke

ys  

300  

330  

130  

220  

020  

130  

010  

220  

211  

021  

021  

301  

221  

121  

122  

302  

232  

022  

122  

112  

322  

212  

013  

123  

023  

023  

003  

323  

020  

101  

022  

333  

t 1  

t 2  

t 3  

t 2  

t 3  

t 3  

t 0  

t 1  

t 0  

t 0  

t 2  

t 1  

t 0  

t 3  

t 2  

t 1  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t T/2  +  2  

t T/4  +  1  

t T/4    

t T/4  +  2  

t T/2  -­‐  1  

t 1  

t 0  

t 2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t 3T/4+2  

t 3T/4  -­‐1  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

…  

…  

…  

…  

…  

…  

…  

…  

t 1  

t 2  

t 3  

t 2  

t 3  

t 3  

t 0  

t 1  

t 0  

t 0  

t 2  

t 1  

t 0  

t 3  

t 2  

t 1  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t T/2  +  2  

t T/4  +  1  

t T/4    

t T/4  +  2  

t T/2  -­‐  1  

t 1  

t 0  

t 2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t 3T/4+2  

t 3T/4  -­‐1  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

…  

…  

…  

…  

…  

…  

…  

…  

t 1  

t 2  

t 3  

t 2  

t 3  

t 3  

t 0  

t 1  

t 0  

t 0  

t 2  

t 1  

t 0  

t 3  

t 2  

t 1  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t T/2  +  2  

t T/4  +  1  

t T/4    

t T/4  +  2  

t T/2  -­‐  1  

t 1  

t 0  

t 2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t 3T/4+2  

t 3T/4  -­‐1  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

…  

…  

…  

…  

…  

…  

…  

…  

t 1  

t 2  

t 3  

t 2  

t 3  

t 3  

t 0  

t 1  

t 0  

t 0  

t 2  

t 1  

t 0  

t 3  

t 2  

t 1  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t T/2  +  2  

t T/4  +  1  

t T/4    

t T/4  +  2  

t T/2  -­‐  1  

t 1  

t 0  

t 2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t 3T/4+2  

t 3T/4  -­‐1  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

…  

…  

…  

…  

…  

…  

…  

…  

•  Resource-allocation as runtime

1.  Kernel memory-wall analysis (kernel fusion)

2.  Algorithm serialization

3.  Tune for data-movement

4.  Warp-synchronous programming

5.  Flexible granularity via meta-programming

–  Back40Computing (a Google Code Project) •  http://code.google.com/p/back40computing/

–  Default sorting method for Thrust •  http://code.google.com/p/thrust/

–  A single host-side procedure call launches a kernel that performs orthogonal program steps

MyUberKernel<<<grid_size, num_threads>>>(d_device_storage);

–  No existing public repositories of kernel “subroutines” for scavenging

–  Callbacks, iterators, visitors, functors, etc.

– ReduceKernel<<<grid_size, num_threads>>>(CountingIterator(100));

–  E.g., fused kernel left can’t be composed using a callback-based pattern

GATHER(key)

EXCHANGE (key)

SCATTER (key) SCATTER (value)

GATHER (value)

EXCHANGE (value)

LOCAL MULTI-SCAN (flag vectors)

Encode flag bit (into flag vectors)

Decode local rank (from flag vectors)

Extract radix digit

Extract radix digit (again)

Update global radix digit partition offsets

Fused radix sorting kernel

• Digit extraction

• Local prefix scan

• Scatter accordingly

–  Compiled libraries suffer from code bloat • CUDPP primitives library is 100s of MBs, yet still doesn’t support all built-in numeric types.

• Specializing for device configurations makes it even worse

–  The alternative is to ship source for #include’ing • Have to be willing to share source

–  Need a way to fit meta-programming in at the JIT / bytecode level to help avoid expansion / mismatch-by-omission

–  Can leverage fundamentally different algorithms for different phases • How to teach the compiler do to this?

top related