iap09 cuda@mit 6.963 - lecture 07: cuda advanced #2 (nicolas pinto, mit)

IAP09 CUDA@MIT / 6.963

Supercomputing on your desktop:Programming the next generation of cheap

and massively parallel hardware using CUDA

Lecture 07

CUDA Advanced #2-

Nicolas Pinto (MIT)

Friday, January 23, 2009

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for 6.963

Todayyey!!

Wanna Play with The Big Guys?

Here are the keys to High-Performance in CUDA

To optimize or not to optimize

Hoare said (and Knuth restated)

“We should forget about small e!ciencies, say about97% of the time:

“Premature optimization is the root of all evil.”

!3% of the time we really should worry about small e!ciencies

(Every 33rd codeline)

Applied Mathematics 23/53slide by Johan Seland

Warning!

To optimize or not to optimize

Hoare said (and Knuth restated)

“We should forget about small e!ciencies, say about97% of the time:Premature optimization is the root of all evil.”

!3% of the time we really should worry about small e!ciencies

(Every 33rd codeline)

Warning!

StrategyMemory Optimizations

Execution Optimizations

CUDAPerformance Strategies

Optimization goals

We should strive to reach GPU performance

We must know the GPU performanceVendor specificationsSyntetic benchmarks

Choose a performance metricMemory bandwidth or GFLOPS?

Use clock() to measure

Experiment and profile!

Applied Mathematics 25/53

Strategy

slide by Johan Seland

Programming Model

A kernel is executed as a grid of thread blocks

A thread block is a batch of threads that can cooperate with each other by:

Sharing data through shared memory

Synchronizing their execution

Threads from different blocks cannot cooperate

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Threading

Data Movement in a CUDA Program

Host Memory

Device Memory

[Shared Memory]

COMPUTATION

[Shared Memory]

Device Memory

Host Memory

Memory

!"#$%$&'()*+,-$#.%/(0,-(#.'(123

456$%$&'($78'"'78'7#("5-5**'*$/%

456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.?

@,%'#$%'/($#A/(='##'-(#,(-'9,%"B#'(#.57(#,(959.'

123(/"'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-:

E,(%,-'(9,%"B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:(

85#5(#-57/0'-/

GF'7(*,>("5-5**'*$/%(9,%"B#5#$,7/(957(/,%'#$%'/(='(

05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/#

!"#$%$&'()'%*+,(-*.'+'/0'

-*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4'

=2*>12?@*012(4'5$0'(%'%*+,(

!"#$%$&'(:*+(3"1#$12(2*012$#,($/(010.'4(#'A#<+'(

%'%*+,

B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3

!"#$%&'(")*"+$%,-%./"0$'%1$2,03

45)'0$'6%,-%*72$6%-"6*$0%*/")%+8,9"8%2$2,03

!/0$"'6%:")%:,,;$0"*$%(7"%6/"0$'%2$2,03

<6$%,)$%=%"%-$>%*/0$"'6%*,%8,"'%=%:,2;5*$%'"*"%

6/"0$'%93%"88%*/0$"'6

<6$%7*%*,%"(,7'%),)?:,"8$6:$'%"::$66

.*"+$%8,"'6%")'%6*,0$6%7)%6/"0$'%2$2,03%*,%0$?,0'$0%),)?

:,"8$6:$"98$%"''0$667)+

1"*07@%*0")6;,6$%$@"2;8$%8"*$0

!"#$%&'&((#()"*$+,,)-)#./(0

%&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$

*2(/)3'1-#""1'"$#72&((0$82"0

9&.0$/5'#&:";$*&.0$/5'#&:$8(1-4"

<##3$'#"12'-#$2"&=#$(1>$#.12=5$/1$"2331'/$

*2(/)3(#$&-/)?#$/5'#&:$8(1-4"$3#'$*2(/)3'1-#""1'

@#=)"/#'";$"5&'#:$*#*1'0

Memory Optimizations

!"#$%&'$()*#*+,)*$-.

/()*#*+*-0'#"#$%&')%,-.1"%.

2$,3".4*-0'03$5,3'#"#$%&',44"..".

6.*-0'.7,%"8'#"#$%&'"11"4)*9"3&

Memory

!"#"$%&"'()*&(

!*+,-*$.*./&0$#/$1/(#$.*./&0$2"'34,3#1$.5-1$

6/4*&$#1"'$3*+,-*$.*./&0$#/$3*+,-*$2"'34,3#1

789:($;*"<$=>?@A*$BCDE$+(F$GH$89:($;*"<$=I5"3&/$JK$LDHHE

G89:($)/&$>?@A*$MFH

N,',.,O*$#&"'()*&(

@'#*&.*3,"#*$3"#"$(#&5-#5&*($-"'$2*$"66/-"#*3P$/;*&"#*3$

/'P$"'3$3*"66/-"#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$

.*./&0

8&/5;$#&"'()*&(

R'*$6"&Q*$#&"'()*&$.5-1$2*##*&$#1"'$."'0$(."66$/'*(

Memory

!"#$%&'()$*+,$-'./+0."123$.2

(4*","55'(6'2789+"55':2+"55'("7;'1+'3+<"#$%5'()$*+='27+-$-'./

>1"?5$2+=;#=$27+(4*",$-(</+<$.3'.-"1($@AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9

LM+CDE2+-$"24.$*+'1+1N'.($+KOP;+-'7=$.?'".*2+8'Q$.(5'()$*+!GH%$9

R$$+7=$+S?"1*:;*7=0$27T GUVW+RVX+2"-<5$

U2$+:;7=+("47;'1W55'("7;1#+7''+-4(=+<"#$%5'()$*+-$-'./+("1+.$*4($+'Q$."55+2/27$-+<$.3'.-"1($

0$27+/'4.+2/27$-2+"1*+"<<2+7'+5$".1+7=$;.+5;-;72

Memory

!"#$%"&'()#*+&,(%-./0*12(.

3145(.2&"%2(67+&16.2*8721#6.9&:;;<=;;&7"#7>&7+7"(.

?1>("+&2#&$(&@(*A#*)%67(&$#22"(6(7>

B@21)1C%21#6.&7%6&4*(%2"+&167*(%.(&@(*A#*)%67(

D#%"(.71649&8@&2#&E;F&.@((-8@

?%2(67+&51-1649&8@&2#&GHIF&.@((-8@

Accessing global memory

4 cycles to issue on memory fetch

but 400-600 cycles of latencyThe equivalent of 100 MADs

Likely to be a performance bottleneck

Order of magnitude speedups possibleCoalesce memory access

Use shared memory to re-order non-coalesced addressing

!"#$%&'()*

+,'""-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&:

+,'")/(*;";&,-%*("),"3,*$"0#$,<%<"-1=

9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5"-.=,()/?,3$"#/?,@

8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.";0$%45"-.=,()/A?,3$"#/A?,@

AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45"-.=,()/>?,3$"#/>?,@

+..(/(")#$,-%&/-('/(")&,"),FBGHFIG,#-'2(/%'/;-%=

J/#-/()*,#..-%&&,3"-,#,-%*("),<;&/,0%,#,<;$/(6$%,"3,-%*("),

L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#,

0$"'M,0%()*,-%#.

NO'%6/(")=,)"/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()*

P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6

!"#$%&'%()*''%&&+),%#(-./)0$"#1&

12 13 14 135 13617

374 378 395 3:4349 352 355 399

;"<%)=>?%#(&)@")A"1)B#?1-'-C#1%

*$$)1>?%#(&)C#?1-'-C#1%

!"#$%&'(#')*+##'((,*-'%)."/*0&$%1(

12 13 14 135 13617

374 378349 352 355

:';<=1')*+##'((*>?*@A;'%)(

395 3B4399

C.(%&./"')*D1%;1."/*+));'((*E"$1*%*<=&1.F&'*$0*85G

12 13 14 137 13617

374 378 395 3B4349 352 355 399

!"#$%&'()*+,-(.()*,/%&0$1&

234%5(.%)1,"),678+,

9%5)%$+,5%#:,#,;$"#1<,()'5%.%)1<,=5(1%,>#'?

@A,;$"#1&,BCDAEF

-(.%&,#G%5#*%:,"G%5,C89,50)&

CD9,>$"'?&,3,DHI,1J5%#:&+

@HIK&,L '"#$%&'%:

@HMK&,L '"#$%&'%:<,&".%,1J5%#:&,:")N1,4#51('(4#1%

@<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&

!"#$%&'()*+

,-./'-/.%&0"10&(2%0! 34054067089-%&:&%0#0,-./'-/.%0"10;..#9&0<,";=0()&-%#>0"10;..#90"10,-./'-/.%&0

?10,";0(&0)"-0@(#A$%+

B".'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540".067

:&%0,IJI0-"0#'G(%@%0'"#$%&'()*

zyx Point structure

zyx zyx zyx AoS

xxx yyy zzz SoA

!"#$%&'()*+,-.//#01

!"#$%&'()*,*0%#2$1,(/30"4%&,250".*53.2

!0(2('#$,2",/%/"0167".)8,9%0)%$&

:%#8()*,&20.'2.0%&,";,&(<%,"25%0,25#),=>,?>,"0,@A712%&,B($$,70%#9,'"#$%&'()*+

C0%;%0,-20.'2.0%&,";,D00#1& "4%0,D"-

E;,-"D,(&,)"2,4(#7$%>,0%#8FB0(2%,250".*5,-GHG

D88(2(")#$,0%&".0'%&+D$(*)%8,I13%&,-JK,-#/3$%

!"#"$$%$&'%()#*&+#,-./%,/0#%

12&"&3"#"$$%$&(",-.2%4&("2*&/-#%"56&",,%66&(%()#*

7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:"2;6

<66%2/."$&/)&",-.%9%&-.=-&:"25>.5/-

<",-&:"2;&,"2&6%#9.,%&)2%&"55#%66&3%#&,*,$%

+&(%()#*&,"2&6%#9.,%&"6&("2*&6.(0$/"2%)06&

",,%66%6&"6&./&-"6&:"2;6

'0$/.3$%&6.(0$/"2%)06&",,%66%6&/)&"&:"2;

#%60$/&.2&"&:"2;&,)28$.,/&

?)28$.,/.2=&",,%66%6&"#%&6%#."$.@%5

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

!"#$%&''()**+#,%-."/01)*

23%!"#$%43#51+67*

8+#)"(%"''()**+#,%

*7(+')%99%:

23%!"#$%43#51+67*

;"#'3/%:<:%=)(/>7"7+3#

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

!"#$%&''()**+#,%-."/01)*

234"5%!"#$%67#81+9:*

;+#)"(%"''()**+#,%

*:(+')%<<%2

=34"5%!"#$%67#81+9:*

;+#)"(%"''()**+#,%

*:(+')%<<%=

Thread 11

Thread 10

Thread 9

Thread 8

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 9

Bank 8

Bank 15

Bank 7

Bank 2

Bank 1

Bank 0x8

!"#$%&&'())()$*%+$,"$-%./)$".$012

3%.&#4&,5$"6$(%75$-%./$4)$89$-4,)$+('$9$7:"7/$7;7:()

<=77())4>($89?-4,$#"'&)$%'($%))4@.(&$,"$)=77())4>($

012$5%)$AB$-%./)

<"$-%./$C$%&&'())$D$AB

<%*($%)$,5($)4E($"6$%$5%:6?#%'+F"$-%./$7".6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$".:;$#4,54.$%$)4.@:($5%:6?#%'+

!"#$%&'(%()$*'+#,-'.),/01.23

!"#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2"%$%'#$%'

,)'+#,-'.),/01.23

5"%'/#32'.#3%6

7/'#00'2"$%#&3')/'#'"#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2"%$%'13'

,)'+#,-'.),/01.2

7/'#00'2"$%#&3')/'#'"#0/89#$:'$%#&'2"%'1&%,21.#0'#&&$%33;'

2"%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=

5"%'30)9'.#3%6

>#,-'?),/01.26'(@021:0%'2"$%#&3'1,'2"%'3#(%'"#0/89#$:'

#..%33'2"%'3#(%'+#,-

A@32'3%$1#01B%'2"%'#..%33%3

?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-

Use the right kind of memory

Constant memory:Quite small, ! 20KAs fast as register access if all threads in a warp access thesame location

Texture memory:Spatially cachedOptimized for 2D localityNeighboring threads should read neighboring addressesNo need to think about coalescing

Constraint:These memories can only be updated from the CPU

Strategy

Memory optimizations roundup

CUDA memory handling is complexAnd I have not covered all topics...

Using memory correctly can lead to huge speedupsAt least CUDA expose the memory hierarchy, unlike CPUs

Get your algorithm up an running first, then optimize

Use shared memory to let threads cooperate

Be wary of “data ownership”A thread does not have to read/write the data it calculate

Strategy

Conflicts,Coalescing, Warps...I hate growing up.

!"#$%$&'#$()*+,'%"-./*0'#1$,*21')3"(3.

Example

!"#$%&'($")*+,*-

./0'."1+2-'34#$")*+,*-56

7228*#$"#-*9

:,"2-*;%)<

=>,%?%)<'.!@!'A")B';,)C2%;#*

.+--?8+*'C,$'->-)'*1"22'1"#$%;-*

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 5 9 13

2 6 10 14

3 7 11 15

4 8 12 16

Example

!"#$%&'(#')*+,%"(-$('

__global__ void transpose_naive(float *odata, float *idata, int width, int height)

unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

if (xIndex < width && yIndex < height)

unsigned int index_in = xIndex + width * yIndex;

unsigned int index_out = yIndex + height * xIndex;

$)%.%/0")'12$3.4 = 0)%.%/0")'120"4;

Example

!"#$%&'(#')*+,%"(-$('

.'%)(*/"-01*2,$3*4565

787978:78778;

;879;8:;87;8;

79879798:7987798;

<,/1'*$01-01*1$*4565

7987:87787;87

798;:8;78;;8;

79879:8797879;879

Stride = 16, uncoalesced

45654565

Stride = 1, coalesced

Example

!"#$%&'%()*+#,&-"&%

.&&/0-12",3)0#1+24)2&)-#+1212",%()2,1")&5/#+%)12$%&

*6+%#(7$"'8)974:)7;<3

=%#()16%)974:7;< 2,-/1)12$%:)&1"+%)2,1")>?@?

A+21%)16%)>?@?)(#1#)1")97;:74< "/1-/1)12$%*+#,&-"&%)16%)2,(%42,B)2,1")>?@?

*6+%#()914:1;<3

=%#(&)%$%0%,1)914:1;< C+"0)2,-/1)12$%

A+21%&)%$%0%,1)914:1;< 2,1")"/1-/1)12$%

!"#$%&'2,B)2&)#'62%D%()2C3

E$"'8F12$%)(20%,&2",&)#+%)0/$12-$%&)"C)GH

Example

!"#$%&'%()*+#,&-"&%

.+/0%&)0")12324%#(&)5+"6)7232

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

.+/0%&)0")72324%#(&)5+"6)1232

8:98;98898<98

8:9<;9<89<<9<

8:98:;98:898:<98:

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

Example

!"#"$%&'()(*+'(,-

./01+23$01+2$!"#"$4('/$3'0(21$5$67

8+-9$:,-;<(:'3

=1+23$;0,)$!"#"

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

!,<B'(,-

C<<,:+'1$+-$D1E'0+F :,<B)-

=1+2$3'0(21$5$6G

./01+23$01+2$;0,)$:,-31:B'(H1$I+-93

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

Example

!"#"$%&'()(*+'(,-

./01+23$01+2$!"#"$4('/$3'0(21$5$67

8+-9$:,-;<(:'3

=1+23$;0,)$!"#"

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

!,<B'(,-

C<<,:+'1$+-$D1E'0+F :,<B)-

=1+2$3'0(21$5$6G

./01+23$01+2$;0,)$:,-31:B'(H1$I+-93

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

Example

!"#$%&'%()*+#,&-"&%

__global__ void transpose(float *odata, float *idata, int width, int height)

__shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;

unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;

unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

unsigned int index_in = width * yIndex + xIndex;

unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;

block[index_block] = idata[index_in];

index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;

index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

__syncthreads();

odata[index_out] = block[index_transpose];

Example

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

Example

}__synchthreads();

Allocate shared memory.

Example

}__synchthreads();

Set up indexing

Example

}__synchthreads();

Set up indexing

Check that we are withindomain, calculate moreindices

Example

}__synchthreads();

Set up indexing

Write to shared memory.

Example

}__synchthreads();

Set up indexing

Calculate output indices.

Example

}__synchthreads();

Set up indexing

Synchronize.NB:outside if-clause

Example

}__synchthreads();

Set up indexing

Synchronize.NB:outside if-clause

Write to global mem.Di!erent index

Example

Transpose timings

Was it worth the trouble?

Grid Size Coalesced Non-coalesced Speedup128! 128 0.011 ms 0.022 ms 2.0!512! 512 0.07 ms 0.33 ms 4.5!

1024! 1024 0.30 ms 1.92 ms 6.4!1024! 2048 0.79 ms 6.6 ms 8.4!

For me, this is a clear yes.

Example

Execution Optimizations

Know the arithmetic cost of operations

4 clock cycles:Floating point: add, multiply, fused multiply-addInteger add, bitwise operations, compare, min, max

16 clock cycles:reciprocal, reciprocal square root, log(x), 32-bit integermultiplication

32 clock cycles:sin(x), cos(x) and exp(x)

36 clock cycles:Floating point division (24-bit version in 20 cycles)

Particularly costly:Integer division, moduloRemedy: Replace with shifting whenever possible

Double precision (when available) will perform at half thespeed

!""#$%&"'

()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-

+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-

4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-

"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-

<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-

"1&"#**+&04'

?.<.0+,-9'-*+/1#*"+-#/%6+@

A+6./0+*/

B)%*+,-<+<1*'

!"#$%&'()*+,#-.+/.0"#12#)1

3+(4+5'()*1+6+3+(4+70'2#8"().11("1

,(+9''+70'2#8"().11("1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02.

3+(4+5'()*1+%+3+(4+70'2#8"().11("1+6+>

?0'2#8'.+5'()*1+)9<+"0<+)(<)0"".<2'@+#<+9+70'2#8"().11("

&'()*1+2:92+9".<A2+B9#2#<C+92+9+DD1@<)2:".9$1EF+*..8+2:.+

:9"$B9".+501@

,05G.)2+2(+".1(0").+9;9#'95#'#2@+H ".C#12."1I+1:9".$+7.7("@

3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020".+$.;#).1

&'()*1+.=.)02.$+#<+8#8.'#<.+491:#(<

JKKK+5'()*1+8."+C"#$+B#''+1)9'.+9)"(11+70'2#8'.+C.<."92#(<1

!"#$%&"'()"*"+,"+-.

!"/,0/1&"'02'$&"('"#$%&"'(,"*"+,"+-.3+%&'4-&$5+6%('"%47&(-/+(8"('"/,(9::(-.-7"%(7/&"'

;-"+/'$5%<=>)?< @AB<

A5(-5C*7"&"7.(D$,"(&D"(7/&"+-.<(!4+(/&(7"/%&(EF: &D'"/,%(GH(2/'*%I(*"'(C47&$*'5-"%%5'

?&(7"/%&(:JK 5--4*/+-.

AD'"/,%(,5(+5&(D/L"(&5(8"75+#(&5(&D"(%/C"(&D'"/,(875-M

/,,N1O:(((P1OQ(P1EQ(P1:

/,,N1O:(((P1JQ(P1OQ(P1R

S T(.(U(JV

W(T(S U(OV

7,N%D/'",N1O:((P1OQ(XP'OEUYZ(

/,,N1O:(((((((((((P1OQ(P1OQ(P1R

%[,/&/XYZ(UT(OV

!"#$%&"'()'"%%*'"

+$,"(-.&"/01(21(*%$/#(34'"(&5'".,%(6"'(78

9$3$&$/#(:.0&4'%;

<*32"'(4=('"#$%&"'%(6"'(>"'/"-

?@AB 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,%

D34*/&(4=(%5.'",(3"34'1

@EFG 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,2-40>%

H5"0>(I0*2$/(=$-"(=4'(J('"#$%&"'%(K(>"'/"-

L%"(M3.N''"#04*/&O< =-.#(&4(<PHH

< O(,"%$'",(3.N$3*3('"#$%&"'%(K(>"'/"-

D&(%43"(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'

!",*0"%(6"'=4'3./0"(M 98S8($%(%-4T

H5"0>(I0*2$/(=$-"(=4'(98S8(*%.#"

!"#"$%&'&'()$"*+,$-"),*.("

/*")012#3+2#&+'*4567 +2#&+')#+)'6--

8$9)-+%2&:")#;")<"$'":)-+=")>&#;)#;")5-,?&')@:.()#+)

="#"$%&'")$"(&*#"$),*.("A

82"')#;")A-,?&')@&:")>&#;).)#"3#)"=&#+$).'=):++<)@+$)

#;")0-+="7 *"-#&+'Aarchitecture {sm_10}

abiversion {0}

modname {cubin}

code {

name = BlackScholesGPU

lmem = 0

smem = 68

reg = 20

bar = 0

bincode {

0xa0004205 0x04200780 0x40024c09 0x00200780

per thread local memory

per thread block shared memory

per thread registers

!"#$%&''()*+',%!*-'(-*./0Exec

!"#$%$&$'()#*+,-./)",+)01234

5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,

9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/

<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)

*$.$'(

?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)

#*+,-.

A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.

B,6+$/#$3/

<$'$%6%C)DE)#*+,-./)",+)01234

!'1>)$7)%61#$"1,)32'36++,'#)01234/)

FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,

J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>

K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M

!""#$%&"'()*(+,-./-0%&",

1&"-,%23&4(/""#$%&"'(5/,2(&/6(&,",22%-37'(

3&"-,%2,($,-./-0%&",

BUT…

8/9:/""#$%&"'(0#763$-/",22/-2("%&&/6(%5,;#%6,7'(

<35,(7%6,&"'(/&(0,0/-':=/#&5(>,-&,72

?16(%77("/0,2(5/9&(6/(%-36<0,63"(3&6,&236'(%&5(%@%37%=7,(

$%-%77,7320A

!"#"$%&%#'(%)*+,#)-../'0"&'+1

!"#"$%&%#'("&'+1)2%/.3)"4".&"&'+1)&+)4'55%#%1&)6!73

6!73)8"#9)'1)$"19):"93;)+5)$,/&'.#+0%33+#3

<%$+#9)="14:'4&2

>2"#%4)$%$+#9)3'(%

?%@'3&%#)5'/%)3'(%

A2#%"43).%#)=/+0B

*+,)0"1)%8%1)$"B%)"..3)3%/5C&,1'1@)D/'B%)EEAF)"14)-AG->H

IJK.%#'$%1&L $+4%)4'30+8%#3)"14)3"8%3)+.&'$"/)0+15'@,#"&'+1

Loop unrolling

Sometimes we know some kernel parameters at compile time:# of loop iterationsDegrees of polynomialsNumber of data elements

If we could “tell” this to the compiler, it can unroll loops andoptimize register usage

We need to be genericAvoid code duplication, sizes unknown at compile time

Templates to rescueThe same trick can be used for regular C++ sources

Example: de Casteljau algorithm

A standard algorithm for evaluating polynomials in Bernstein form

Recursively defined:

f (x) = bd00

bki ,j = xbk!1

i+1,j + (1! x)bk!1i ,j+1

b0i ,jare coe!cients

f (x) = bd00

bd!110 bd!1

bd!220 bd!2

11 bd!202

1! x x 1! x2

Implementation

The de Casteljau algorithm is usually implemented as nestedfor-loops

Coe!cients are overwritten for each iteration

f l o a t d eCa s t e l j a u ( f l o a t ! c , f l o a t x , i n t d ){

f o r ( u i n t i = 1 ; i <= d ; ++i ) {f o r ( u i n t j = 0 ; j <= d" i ; ++j )

c [ j ] = ( 1 . 0 f"x )! c [ j ] + x! c [ j +1] ;}

r e t u r n c [ 0 ] ;}

f (x) = cd00

cd"110 cd"1

cd"220 cd"2

11 cd"202

1! x x 1! x2

Template loop unrolling

We make d a template parametertemplate<int d>f l o a t d eCa s t e l j a u ( f l o a t ! c , f l o a t x, int d ) {

f o r ( u i n t i = 1 ; i <= d ; ++i ) {f o r ( u i n t j = 0 ; j <= d" i ; ++j )

c [ j ] = ( 1 . 0 f"x )! c [ j ] + x! c [ j +1] ;}r e t u r n c [ 0 ] ;

Kernel is called assw i t c h ( d ) {case 1 :

d eCa s t e l j a u <1><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;ca se 2 :

d eCa s t e l j a u <2><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;..ca se MAXD:

deCa s t e l j a u <MAXD><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;}

Results

For the de Castelaju algorithm we see a relatively smallspeedup

! 1.2" (20%...)

Very easy to implement

Can lead to long compile times

Conclusion:

Probably worth it near end of development cycle

!"#$%&'("#

)#*+,'-.#*/!)01/2+,3",4.#$+/$5.,.$-+,('-($'

6+4",7/$".%+'$(#8

0(9+,8+#-/:,.#$5(#8

;.#</$"#3%($-'

=.-+#$7/5(*(#8

)'+/2+.</2+,3",4.#$+/4+-,($'/-"/8&(*+/"2-(4(>.-("#/

)#*+,'-.#*/2.,.%%+%/.%8",(-54/$"42%+?(-7/-5+",7

@#"A/5"A/-"/(*+#-(37/-72+/"3/:"--%+#+$<

+B8B/4+4",7C/$",+/$"42&-.-("#C/",/(#'-,&$-("#/"9+,5+.*

D2-(4(>+/7"&,/.%8",(-54C/then &#,"%%/%""2'

)'+/-+42%.-+/2.,.4+-+,'/-"/8+#+,.-+/"2-(4.%/$"*+

!"#$%&'($)*+,-.$/012*.#0

3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$

401:.#5

;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$

5#594?+

!*5#$+8-54+

(99#++$81$"-07@-0#$4#02105-69#$91,68#0+$

Profiling

!"#$%&'

()*$+',%-*,+-%./*0,1"+2,2%-01%-*,.34$+*-',3$,'"#$%&',"$,+2*,.2"56

+"7*'+%75

#&08"$.32*-*$+

#&08.32*-*$+

#'+8"$.32*-*$+

#'+8.32*-*$+

&3.%&8&3%0

&3.%&8'+3-*

9-%$.2

0")*-#*$+89-%$.2

"$'+-4.+"3$' : "$'+-4.+"3$,.34$+

1%-58'*-"%&";* : +2-*%0,1%-5',+2%+,'*-"%&";*,3$,%00-*'',.3$<&".+',+3,'2%-*0,3-,.3$'+%$+,7*73-=

.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'

Global memory loads/stores are coalesced

(coherent) or non-coalesced (incoherent)

Total branches and divergent branches

taken by threads

Local loads/stores

Profiling

!"#$%&%$#'"()&%*+',$%)-*."#$%/

01,.$/)%$&%$/$"#)$2$"#/)3'#4'")1)#4%$15)31%&

6",7)#1%($#/)*"$)8.,#'&%*-$//*%01,.$/)3',,)"*#)-*%%$/&*"5)#*)#4$)#*#1,)".89$%)*+)31%&/),1."-4$5)+*%)1)&1%#'-.,1%):$%"$,;

<1."-4)$"*.(4)#4%$15)9,*-:/)#*)$"/.%$)#41#)#4$)#1%($#)8.,#'&%*-$//*%)'/)('2$")1)-*"/'/#$"#)&$%-$"#1($)*+)#4$)#*#1,)3*%:;

01,.$/)1%$)9$/#)./$5)#*)'5$"#'+7)%$,1#'2$)&$%+*%81"-$)5'++$%$"-$/)9$#3$$")."*&#'8'=$5)1"5)*&#'8'=$5)-*5$

!")*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81("'#.5$/)*+)(,5?(/#@'"-*4$%$"#>)5'2$%($"#@9%1"-4>)1"5)31%&@/$%'1,'=$

Profiling

!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(

:43:72*;<=>

+93??:*;<=>

92346?*;<=>

273977*;<=>

?37+2*;<=>

+36@+*;<=>

43869*;<=>

9838+5

4232:5

2@3825

834:6*&>A"#("-*7B&0-.1C-"*"-"&"(.>*C"#*.D#"'/

83962*&>A"#("-*:B)%&C-"."-E*0(#%--"/

83@9:*&>A"#("-*@B0(#%--*-'>.*F'#C

83?:@*&>A"#("-*+B$1#>.*'//*/0#1(G*G-%H'-*-%'/

23744*&>A"#("-*9B>"I0"(.1'-*'//#">>1(G

93+@:*&>A"#("-*4B1(."#-"'J"/*'//#">>1(G

F1.D*H'(K*)%($-1).>

638@+*&>A"#("-*2B*1(."#-"'J"/*'//#">>1(G

F1.D*/1J"#G"(.*H#'()D1(G

A"#("-*7*%(*94,*"-"&"(.>B*74*;<=>L

MC""/0C<'(/F1/.DN1&"*O444*1(.>PQ0&0-'.1J"

MC""/0C

Example

Build your own!

Thank you!

Back Pocket Slides

slide by David Cox

19M02: High Performance Computing with CUDA

Tesla C1060 Computing ProcessorTesla C1060 Computing Processor

1.33 GHzCore GHz

Processor 1x Tesla T10P

Form factor

Full ATX:

4.736” (H) x 10.5” (L)

Dual slot wide

On-boardmemory

System I/O PCIe x16 gen2

Memory I/O512-bit, 800MHz DDR

102 GB/s peak bandwidth

Display outputs None

Typical power 160 W

Tesla S1070 1U SystemTesla S1070 1U System

1.5 GHzCore GHz

Processors 4 x Tesla T10P

Form factor1U for an EIA 19”

4-post rack

Total 1U systemmemory

16 GB (4.0GB per GPU)

System I/O 2 PCIe x16

Memory I/O perprocessor

512-bit, 800MHz GDDR

102 GB/s peakbandwidth

Display outputs None

Typical power 700 W

Chassisdimensions

1.73” H ! 17.5” W !28.5” D

Double Precision Floating PointDouble Precision Floating Point

NVIDIA GPU SSE2 Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADDand FMUL

All 4 IEEE, round tonearest, zero, inf, -inf

Round tozero/truncate only

Denormal handling Full speedSupported, costs 1000’sof cycles

Flush to zero

NaN support Yes Yes No

Overflow and Infinitysupport

Yes YesNo infinity,clamps to max norm

Flags No Yes Some

FMA Yes No Yes

Square rootSoftware with low-latencyFMA-based convergence

Hardware Software only

DivisionSoftware with low-latencyFMA-based convergence

Hardware Software only

Reciprocal estimateaccuracy

24 bit 12 bit 12 bit

Reciprocal sqrt estimateaccuracy

23 bit 12 bit 12 bit

log2(x) and 2^x estimatesaccuracy

23 bit No No

iap09 cuda@mit 6.963 - lecture 07: cuda advanced #2 (nicolas pinto, mit)

Education

cuda: new and upcoming features - university of oxford ·...

tuning cuda applications for...

jared law cuda: super-computing made easy. jared law nvidia...

cuda lecture 7 cuda threads and atomics

iap09 cuda@mit 6.963 - lecture 04: cuda advanced #1 (nicolas...

march 2015 cuda-gdb cuda debugger - rice university ·...

programming with cuda · programming with cuda ... cuda c...

iap09 cuda@mit 6.963 - guest lecture: out-of-core...

code gpu with cuda - cuda introduction

du-05227-042 v6.0 | february 2014 cuda-gdb cuda...

tuning cuda applications for kepler -...

introduction to scientific programming using gpgpu and...

iap09 cuda@mit 6.963 - lecture 03: cuda basics #2 (nicolas...

programming the next generation of cheap parallel hardware...

cuda-gdb: the nvidia cuda debugger · cuda debugger user...

cuda compiler driver nvcc -...

gpudirect, cuda aware mpi, & cuda ipc€¦ · steve abbott,...

cuda lecture 8 cuda memories

cuda without cuda (cuda libraries) -...

iap09 cuda@mit 6.963 - guest lecture: cuda tricks and...