iap09 cuda@mit 6.963 - lecture 07: cuda advanced #2 (nicolas pinto, mit)
Post on 28-Nov-2014
2.347 Views
Preview:
DESCRIPTION
TRANSCRIPT
IAP09 CUDA@MIT / 6.963
Supercomputing on your desktop:Programming the next generation of cheap
and massively parallel hardware using CUDA
Lecture 07
CUDA Advanced #2-
Nicolas Pinto (MIT)
Friday, January 23, 2009
During this course,
we’ll try to
and use existing material ;-)
“ ”
adapted for 6.963
Friday, January 23, 2009
Todayyey!!
Friday, January 23, 2009
Wanna Play with The Big Guys?
Friday, January 23, 2009
Here are the keys to High-Performance in CUDA
Friday, January 23, 2009
To optimize or not to optimize
Hoare said (and Knuth restated)
“We should forget about small e!ciencies, say about97% of the time:
“Premature optimization is the root of all evil.”
!3% of the time we really should worry about small e!ciencies
(Every 33rd codeline)
Applied Mathematics 23/53slide by Johan Seland
Warning!
Friday, January 23, 2009
To optimize or not to optimize
Hoare said (and Knuth restated)
“We should forget about small e!ciencies, say about97% of the time:Premature optimization is the root of all evil.”
!3% of the time we really should worry about small e!ciencies
(Every 33rd codeline)
Applied Mathematics 23/53slide by Johan Seland
Warning!
Friday, January 23, 2009
StrategyMemory Optimizations
Execution Optimizations
IAP09 CUDA@MIT / 6.963
Friday, January 23, 2009
CUDAPerformance Strategies
IAP09 CUDA@MIT / 6.963
Friday, January 23, 2009
Optimization goals
We should strive to reach GPU performance
We must know the GPU performanceVendor specificationsSyntetic benchmarks
Choose a performance metricMemory bandwidth or GFLOPS?
Use clock() to measure
Experiment and profile!
Applied Mathematics 25/53
Strategy
slide by Johan Seland
Friday, January 23, 2009
© NVIDIA Corporation 2006 3
Programming Model
A kernel is executed as a grid of thread blocks
A thread block is a batch of threads that can cooperate with each other by:
Sharing data through shared memory
Synchronizing their execution
Threads from different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Threading
Friday, January 23, 2009
© NVIDIA Corporation 2008 10
Data Movement in a CUDA Program
Host Memory
Device Memory
[Shared Memory]
COMPUTATION
[Shared Memory]
Device Memory
Host Memory
Memory
Friday, January 23, 2009
39
!"#$%$&'()*+,-$#.%/(0,-(#.'(123
456$%$&'($78'"'78'7#("5-5**'*$/%
456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.?
@,%'#$%'/($#A/(='##'-(#,(-'9,%"B#'(#.57(#,(959.'
123(/"'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-:
E,(%,-'(9,%"B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:(
85#5(#-57/0'-/
GF'7(*,>("5-5**'*$/%(9,%"B#5#$,7/(957(/,%'#$%'/(='(
05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/#
Perf
Friday, January 23, 2009
40
!"#$%$&'()'%*+,(-*.'+'/0'
-*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4'
=2*>12?@*012(4'5$0'(%'%*+,(
!"#$%$&'(:*+(3"1#$12(2*012$#,($/(010.'4(#'A#<+'(
%'%*+,
B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3
Perf
Friday, January 23, 2009
41
!"#$%&'(")*"+$%,-%./"0$'%1$2,03
45)'0$'6%,-%*72$6%-"6*$0%*/")%+8,9"8%2$2,03
!/0$"'6%:")%:,,;$0"*$%(7"%6/"0$'%2$2,03
<6$%,)$%=%"%-$>%*/0$"'6%*,%8,"'%=%:,2;5*$%'"*"%
6/"0$'%93%"88%*/0$"'6
<6$%7*%*,%"(,7'%),)?:,"8$6:$'%"::$66
.*"+$%8,"'6%")'%6*,0$6%7)%6/"0$'%2$2,03%*,%0$?,0'$0%),)?
:,"8$6:$"98$%"''0$667)+
1"*07@%*0")6;,6$%$@"2;8$%8"*$0
Perf
Friday, January 23, 2009
42
!"#$%&'&((#()"*$+,,)-)#./(0
%&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$
*2(/)3'1-#""1'"$#72&((0$82"0
9&.0$/5'#&:";$*&.0$/5'#&:$8(1-4"
<##3$'#"12'-#$2"&=#$(1>$#.12=5$/1$"2331'/$
*2(/)3(#$&-/)?#$/5'#&:$8(1-4"$3#'$*2(/)3'1-#""1'
@#=)"/#'";$"5&'#:$*#*1'0
Perf
Friday, January 23, 2009
Friday, January 23, 2009
Memory Optimizations
IAP09 CUDA@MIT / 6.963
Friday, January 23, 2009
44
!"#$%&'$()*#*+,)*$-.
/()*#*+*-0'#"#$%&')%,-.1"%.
2$,3".4*-0'03$5,3'#"#$%&',44"..".
6.*-0'.7,%"8'#"#$%&'"11"4)*9"3&
Memory
Friday, January 23, 2009
45
!"#"$%&"'()*&(
!*+,-*$.*./&0$#/$1/(#$.*./&0$2"'34,3#1$.5-1$
6/4*&$#1"'$3*+,-*$.*./&0$#/$3*+,-*$2"'34,3#1
789:($;*"<$=>?@A*$BCDE$+(F$GH$89:($;*"<$=I5"3&/$JK$LDHHE
G89:($)/&$>?@A*$MFH
N,',.,O*$#&"'()*&(
@'#*&.*3,"#*$3"#"$(#&5-#5&*($-"'$2*$"66/-"#*3P$/;*&"#*3$
/'P$"'3$3*"66/-"#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$
.*./&0
8&/5;$#&"'()*&(
R'*$6"&Q*$#&"'()*&$.5-1$2*##*&$#1"'$."'0$(."66$/'*(
Memory
Friday, January 23, 2009
46
!"#$%&'()$*+,$-'./+0."123$.2
(4*","55'(6'2789+"55':2+"55'("7;'1+'3+<"#$%5'()$*+='27+-$-'./
>1"?5$2+=;#=$27+(4*",$-(</+<$.3'.-"1($@AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9
LM+CDE2+-$"24.$*+'1+1N'.($+KOP;+-'7=$.?'".*2+8'Q$.(5'()$*+!GH%$9
R$$+7=$+S?"1*:;*7=0$27T GUVW+RVX+2"-<5$
U2$+:;7=+("47;'1W55'("7;1#+7''+-4(=+<"#$%5'()$*+-$-'./+("1+.$*4($+'Q$."55+2/27$-+<$.3'.-"1($
0$27+/'4.+2/27$-2+"1*+"<<2+7'+5$".1+7=$;.+5;-;72
Memory
Friday, January 23, 2009
47
!"#$%"&'()#*+&,(%-./0*12(.
3145(.2&"%2(67+&16.2*8721#6.9&:;;<=;;&7"#7>&7+7"(.
?1>("+&2#&$(&@(*A#*)%67(&$#22"(6(7>
B@21)1C%21#6.&7%6&4*(%2"+&167*(%.(&@(*A#*)%67(
D#%"(.71649&8@&2#&E;F&.@((-8@
?%2(67+&51-1649&8@&2#&GHIF&.@((-8@
gmem
Friday, January 23, 2009
Accessing global memory
4 cycles to issue on memory fetch
but 400-600 cycles of latencyThe equivalent of 100 MADs
Likely to be a performance bottleneck
Order of magnitude speedups possibleCoalesce memory access
Use shared memory to re-order non-coalesced addressing
Applied Mathematics 32/53slide by Johan Seland
gmem
Friday, January 23, 2009
48
!"#$%&'()*
+,'""-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&:
+,'")/(*;";&,-%*("),"3,*$"0#$,<%<"-1=
9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5"-.=,()/?,3$"#/?,@
8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.";0$%45"-.=,()/A?,3$"#/A?,@
AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45"-.=,()/>?,3$"#/>?,@
+..(/(")#$,-%&/-('/(")&,"),FBGHFIG,#-'2(/%'/;-%=
J/#-/()*,#..-%&&,3"-,#,-%*("),<;&/,0%,#,<;$/(6$%,"3,-%*("),
&(K%
L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#,
0$"'M,0%()*,-%#.
NO'%6/(")=,)"/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()*
P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6
gmem
Friday, January 23, 2009
49
!"#$%&'%()*''%&&+),%#(-./)0$"#1&
12 13 14 135 13617
12 13 14 135 13617
374 378 395 3:4349 352 355 399
374 378 395 3:4349 352 355 399
;"<%)=>?%#(&)@")A"1)B#?1-'-C#1%
*$$)1>?%#(&)C#?1-'-C#1%
gmem
Friday, January 23, 2009
50
!"#$%&'(#')*+##'((,*-'%)."/*0&$%1(
12 13 14 135 13617
374 378349 352 355
:';<=1')*+##'((*>?*@A;'%)(
395 3B4399
C.(%&./"')*D1%;1."/*+));'((*E"$1*%*<=&1.F&'*$0*85G
12 13 14 137 13617
374 378 395 3B4349 352 355 399
135
gmem
Friday, January 23, 2009
51
!"#$%&'()*+,-(.()*,/%&0$1&
234%5(.%)1,"),678+,
9%5)%$+,5%#:,#,;$"#1<,()'5%.%)1<,=5(1%,>#'?
@A,;$"#1&,BCDAEF
-(.%&,#G%5#*%:,"G%5,C89,50)&
CD9,>$"'?&,3,DHI,1J5%#:&+
@HIK&,L '"#$%&'%:
@HMK&,L '"#$%&'%:<,&".%,1J5%#:&,:")N1,4#51('(4#1%
@<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&
gmem
Friday, January 23, 2009
58
!"#$%&'()*+
,-./'-/.%&0"10&(2%0! 34054067089-%&:&%0#0,-./'-/.%0"10;..#9&0<,";=0()&-%#>0"10;..#90"10,-./'-/.%&0
<;",=
?10,";0(&0)"-0@(#A$%+
B".'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540".067
:&%0,IJI0-"0#'G(%@%0'"#$%&'()*
zyx Point structure
zyx zyx zyx AoS
xxx yyy zzz SoA
gmem
Friday, January 23, 2009
59
!"#$%&'()*+,-.//#01
!"#$%&'()*,*0%#2$1,(/30"4%&,250".*53.2
!0(2('#$,2",/%/"0167".)8,9%0)%$&
:%#8()*,&20.'2.0%&,";,&(<%,"25%0,25#),=>,?>,"0,@A712%&,B($$,70%#9,'"#$%&'()*+
C0%;%0,-20.'2.0%&,";,D00#1& "4%0,D"-
E;,-"D,(&,)"2,4(#7$%>,0%#8FB0(2%,250".*5,-GHG
D88(2(")#$,0%&".0'%&+D$(*)%8,I13%&,-JK,-#/3$%
gmem
Friday, January 23, 2009
64
!"#"$$%$&'%()#*&+#,-./%,/0#%
12&"&3"#"$$%$&(",-.2%4&("2*&/-#%"56&",,%66&(%()#*
7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:"2;6
<66%2/."$&/)&",-.%9%&-.=-&:"25>.5/-
<",-&:"2;&,"2&6%#9.,%&)2%&"55#%66&3%#&,*,$%
+&(%()#*&,"2&6%#9.,%&"6&("2*&6.(0$/"2%)06&
",,%66%6&"6&./&-"6&:"2;6
'0$/.3$%&6.(0$/"2%)06&",,%66%6&/)&"&:"2;
#%60$/&.2&"&:"2;&,)28$.,/&
?)28$.,/.2=&",,%66%6&"#%&6%#."$.@%5
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
smem
Friday, January 23, 2009
65
!"#$%&''()**+#,%-."/01)*
23%!"#$%43#51+67*
8+#)"(%"''()**+#,%
*7(+')%99%:
23%!"#$%43#51+67*
;"#'3/%:<:%=)(/>7"7+3#
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
smem
Friday, January 23, 2009
66
!"#$%&''()**+#,%-."/01)*
234"5%!"#$%67#81+9:*
;+#)"(%"''()**+#,%
*:(+')%<<%2
=34"5%!"#$%67#81+9:*
;+#)"(%"''()**+#,%
*:(+')%<<%=
Thread 11
Thread 10
Thread 9
Thread 8
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 9
Bank 8
Bank 15
Bank 7
Bank 2
Bank 1
Bank 0x8
x8
smem
Friday, January 23, 2009
67
!"#$%&&'())()$*%+$,"$-%./)$".$012
3%.&,5$"6$(%75$-%./$4)$89$-4,)$+('$9$7:"7/$7;7:()
<=77())4>($89?-4,$#"'&)$%'($%))4@.(&$,"$)=77())4>($
-%./)
012$5%)$AB$-%./)
<"$-%./$C$%&&'())$D$AB
<%*($%)$,5($)4E($"6$%$5%:6?#%'+F"$-%./$7".6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$".:;$#4,54.$%$)4.@:($5%:6?#%'+
smem
Friday, January 23, 2009
68
!"#$%&'(%()$*'+#,-'.),/01.23
!"#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2"%$%'#$%'
,)'+#,-'.),/01.23
5"%'/#32'.#3%6
7/'#00'2"$%#&3')/'#'"#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2"%$%'13'
,)'+#,-'.),/01.2
7/'#00'2"$%#&3')/'#'"#0/89#$:'$%#&'2"%'1&%,21.#0'#&&$%33;'
2"%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=
5"%'30)9'.#3%6
>#,-'?),/01.26'(@021:0%'2"$%#&3'1,'2"%'3#(%'"#0/89#$:'
#..%33'2"%'3#(%'+#,-
A@32'3%$1#01B%'2"%'#..%33%3
?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-
smem
Friday, January 23, 2009
Use the right kind of memory
Constant memory:Quite small, ! 20KAs fast as register access if all threads in a warp access thesame location
Texture memory:Spatially cachedOptimized for 2D localityNeighboring threads should read neighboring addressesNo need to think about coalescing
Constraint:These memories can only be updated from the CPU
Applied Mathematics 31/53slide by Johan Seland
Strategy
Friday, January 23, 2009
Memory optimizations roundup
CUDA memory handling is complexAnd I have not covered all topics...
Using memory correctly can lead to huge speedupsAt least CUDA expose the memory hierarchy, unlike CPUs
Get your algorithm up an running first, then optimize
Use shared memory to let threads cooperate
Be wary of “data ownership”A thread does not have to read/write the data it calculate
Applied Mathematics 41/53
Strategy
slide by Johan Seland
Friday, January 23, 2009
Conflicts,Coalescing, Warps...I hate growing up.
Friday, January 23, 2009
!"#$%$&'#$()*+,'%"-./*0'#1$,*21')3"(3.
Example
Friday, January 23, 2009
70
!"#$%&'($")*+,*-
./0'."1+2-'34#$")*+,*-56
7228*#$"#-*9
:,"2-*;%)<
=>,%?%)<'.!@!'A")B';,)C2%;#*
.+--?8+*'C,$'->-)'*1"22'1"#$%;-*
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
Example
Friday, January 23, 2009
71
!"#$%&'(#')*+,%"(-$('
__global__ void transpose_naive(float *odata, float *idata, int width, int height)
{
unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
if (xIndex < width && yIndex < height)
{
unsigned int index_in = xIndex + width * yIndex;
unsigned int index_out = yIndex + height * xIndex;
$)%.%/0")'12$3.4 = 0)%.%/0")'120"4;
}
}
1.
2.
3.
4.
5.
6.
Example
Friday, January 23, 2009
72
!"#$%&'(#')*+,%"(-$('
.'%)(*/"-01*2,$3*4565
787978:78778;
;879;8:;87;8;
79879798:7987798;
<,/1'*$01-01*1$*4565
7987:87787;87
798;:8;78;;8;
79879:8797879;879
Stride = 16, uncoalesced
45654565
Stride = 1, coalesced
Example
Friday, January 23, 2009
73
!"#$%&'%()*+#,&-"&%
.&&/0-12",3)0#1+24)2&)-#+1212",%()2,1")&5/#+%)12$%&
*6+%#(7$"'8)974:)7;<3
=%#()16%)974:7;< 2,-/1)12$%:)&1"+%)2,1")>?@?
A+21%)16%)>?@?)(#1#)1")97;:74< "/1-/1)12$%*+#,&-"&%)16%)2,(%42,B)2,1")>?@?
*6+%#()914:1;<3
=%#(&)%$%0%,1)914:1;< C+"0)2,-/1)12$%
A+21%&)%$%0%,1)914:1;< 2,1")"/1-/1)12$%
!"#$%&'2,B)2&)#'62%D%()2C3
E$"'8F12$%)(20%,&2",&)#+%)0/$12-$%&)"C)GH
Example
Friday, January 23, 2009
74
!"#$%&'%()*+#,&-"&%
.+/0%&)0")12324%#(&)5+"6)7232
898:89;89889<
<98:<9;<98<9<
8:98:8:9;8:988:9<
.+/0%&)0")72324%#(&)5+"6)1232
8:98;98898<98
8:9<;9<89<<9<
8:98:;98:898:<98:
898:89;89889<
<98:<9;<98<9<
8:98:8:9;8:988:9<
898:89;89889<
<98:<9;<98<9<
8:98:8:9;8:988:9<
Example
Friday, January 23, 2009
75
!"#"$%&'()(*+'(,-
./01+23$01+2$!"#"$4('/$3'0(21$5$67
8+-9$:,-;<(:'3
=1+23$;0,)$!"#"
6>?6@?66?6A?6
6>?A@?A6?AA?A
6>?6>@?6>6?6>A?6>
!,<B'(,-
C<<,:+'1$+-$D1E'0+F :,<B)-
=1+2$3'0(21$5$6G
./01+23$01+2$;0,)$:,-31:B'(H1$I+-93
6>?6@?66?6A?6
6>?A@?A6?AA?A
6>?6>@?6>6?6>A?6>
Example
Friday, January 23, 2009
75
!"#"$%&'()(*+'(,-
./01+23$01+2$!"#"$4('/$3'0(21$5$67
8+-9$:,-;<(:'3
=1+23$;0,)$!"#"
6>?6@?66?6A?6
6>?A@?A6?AA?A
6>?6>@?6>6?6>A?6>
!,<B'(,-
C<<,:+'1$+-$D1E'0+F :,<B)-
=1+2$3'0(21$5$6G
./01+23$01+2$;0,)$:,-31:B'(H1$I+-93
6>?6@?66?6A?6
6>?A@?A6?AA?A
6>?6>@?6>6?6>A?6>
Example
Friday, January 23, 2009
76
!"#$%&'%()*+#,&-"&%
__global__ void transpose(float *odata, float *idata, int width, int height)
{
__shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;
unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;
unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if (xIndex < width && yIndex < height)
{
unsigned int index_in = width * yIndex + xIndex;
unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;
block[index_block] = idata[index_in];
index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}
__syncthreads();
if (xIndex < width && yIndex < height)
odata[index_out] = block[index_transpose];
}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Example
Friday, January 23, 2009
Coalesced transpose: Source code
__global__ voidtranspose( float *out, float *in, int w, int h ) {
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}__synchthreads();
if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];
}}
Applied Mathematics 39/53
Example
slide by Johan Seland
Friday, January 23, 2009
Coalesced transpose: Source code
__global__ voidtranspose( float *out, float *in, int w, int h ) {
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}__synchthreads();
if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];
}}
Allocate shared memory.
Applied Mathematics 39/53slide by Johan Seland
Example
Friday, January 23, 2009
Coalesced transpose: Source code
__global__ voidtranspose( float *out, float *in, int w, int h ) {
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}__synchthreads();
if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];
}}
Allocate shared memory.
Set up indexing
Applied Mathematics 39/53slide by Johan Seland
Example
Friday, January 23, 2009
Coalesced transpose: Source code
__global__ voidtranspose( float *out, float *in, int w, int h ) {
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}__synchthreads();
if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];
}}
Allocate shared memory.
Set up indexing
Check that we are withindomain, calculate moreindices
Applied Mathematics 39/53slide by Johan Seland
Example
Friday, January 23, 2009
Coalesced transpose: Source code
__global__ voidtranspose( float *out, float *in, int w, int h ) {
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}__synchthreads();
if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];
}}
Allocate shared memory.
Set up indexing
Check that we are withindomain, calculate moreindices
Write to shared memory.
Applied Mathematics 39/53slide by Johan Seland
Example
Friday, January 23, 2009
Coalesced transpose: Source code
__global__ voidtranspose( float *out, float *in, int w, int h ) {
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}__synchthreads();
if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];
}}
Allocate shared memory.
Set up indexing
Check that we are withindomain, calculate moreindices
Write to shared memory.
Calculate output indices.
Applied Mathematics 39/53slide by Johan Seland
Example
Friday, January 23, 2009
Coalesced transpose: Source code
__global__ voidtranspose( float *out, float *in, int w, int h ) {
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}__synchthreads();
if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];
}}
Allocate shared memory.
Set up indexing
Check that we are withindomain, calculate moreindices
Write to shared memory.
Calculate output indices.
Synchronize.NB:outside if-clause
Applied Mathematics 39/53slide by Johan Seland
Example
Friday, January 23, 2009
Coalesced transpose: Source code
__global__ voidtranspose( float *out, float *in, int w, int h ) {
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}__synchthreads();
if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];
}}
Allocate shared memory.
Set up indexing
Check that we are withindomain, calculate moreindices
Write to shared memory.
Calculate output indices.
Synchronize.NB:outside if-clause
Write to global mem.Di!erent index
Applied Mathematics 39/53slide by Johan Seland
Example
Friday, January 23, 2009
Transpose timings
Was it worth the trouble?
Grid Size Coalesced Non-coalesced Speedup128! 128 0.011 ms 0.022 ms 2.0!512! 512 0.07 ms 0.33 ms 4.5!
1024! 1024 0.30 ms 1.92 ms 6.4!1024! 2048 0.79 ms 6.6 ms 8.4!
For me, this is a clear yes.
Applied Mathematics 40/53slide by Johan Seland
Example
Friday, January 23, 2009
Friday, January 23, 2009
Execution Optimizations
IAP09 CUDA@MIT / 6.963
Friday, January 23, 2009
Know the arithmetic cost of operations
4 clock cycles:Floating point: add, multiply, fused multiply-addInteger add, bitwise operations, compare, min, max
16 clock cycles:reciprocal, reciprocal square root, log(x), 32-bit integermultiplication
32 clock cycles:sin(x), cos(x) and exp(x)
36 clock cycles:Floating point division (24-bit version in 20 cycles)
Particularly costly:Integer division, moduloRemedy: Replace with shifting whenever possible
Double precision (when available) will perform at half thespeed
Applied Mathematics 28/53slide by Johan Seland
Exec
Friday, January 23, 2009
79
!""#$%&"'
()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-
+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-
4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'
!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-
"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-
<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-
"1&"#**+&04'
?.<.0+,-9'-*+/1#*"+-#/%6+@
A+6./0+*/
B)%*+,-<+<1*'
Exec
Friday, January 23, 2009
80
!"#$%&'()*+,#-.+/.0"#12#)1
3+(4+5'()*1+6+3+(4+70'2#8"().11("1
,(+9''+70'2#8"().11("1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02.
3+(4+5'()*1+%+3+(4+70'2#8"().11("1+6+>
?0'2#8'.+5'()*1+)9<+"0<+)(<)0"".<2'@+#<+9+70'2#8"().11("
&'()*1+2:92+9".<A2+B9#2#<C+92+9+DD1@<)2:".9$1EF+*..8+2:.+
:9"$B9".+501@
,05G.)2+2(+".1(0").+9;9#'95#'#2@+H ".C#12."1I+1:9".$+7.7("@
3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020".+$.;#).1
&'()*1+.=.)02.$+#<+8#8.'#<.+491:#(<
JKKK+5'()*1+8."+C"#$+B#''+1)9'.+9)"(11+70'2#8'.+C.<."92#(<1
Exec
Friday, January 23, 2009
81
!"#$%&"'()"*"+,"+-.
!"/,0/1&"'02'$&"('"#$%&"'(,"*"+,"+-.3+%&'4-&$5+6%('"%47&(-/+(8"('"/,(9::(-.-7"%(7/&"'
;-"+/'$5%<=>)?< @AB<
A5(-5C*7"&"7.(D$,"(&D"(7/&"+-.<(!4+(/&(7"/%&(EF: &D'"/,%(GH(2/'*%I(*"'(C47&$*'5-"%%5'
?&(7"/%&(:JK 5--4*/+-.
AD'"/,%(,5(+5&(D/L"(&5(8"75+#(&5(&D"(%/C"(&D'"/,(875-M
/,,N1O:(((P1OQ(P1EQ(P1:
/,,N1O:(((P1JQ(P1OQ(P1R
S T(.(U(JV
W(T(S U(OV
7,N%D/'",N1O:((P1OQ(XP'OEUYZ(
/,,N1O:(((((((((((P1OQ(P1OQ(P1R
%[,/&/XYZ(UT(OV
Exec
Friday, January 23, 2009
82
!"#$%&"'()'"%%*'"
+$,"(-.&"/01(21(*%$/#(34'"(&5'".,%(6"'(78
9$3$&$/#(:.0&4'%;
<*32"'(4=('"#$%&"'%(6"'(>"'/"-
?@AB 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,%
D34*/&(4=(%5.'",(3"34'1
@EFG 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,2-40>%
H5"0>(I0*2$/(=$-"(=4'(J('"#$%&"'%(K(>"'/"-
L%"(M3.N''"#04*/&O< =-.#(&4(<PHH
< O(,"%$'",(3.N$3*3('"#$%&"'%(K(>"'/"-
D&(%43"(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'
!",*0"%(6"'=4'3./0"(M 98S8($%(%-4T
H5"0>(I0*2$/(=$-"(=4'(98S8(*%.#"
Exec
Friday, January 23, 2009
83
!"#"$%&'&'()$"*+,$-"),*.("
/*")012#3+2#&+'*4567 +2#&+')#+)'6--
8$9)-+%2&:")#;")<"$'":)-+=")>&#;)#;")5-,?&')@:.()#+)
="#"$%&'")$"(&*#"$),*.("A
82"')#;")A-,?&')@&:")>&#;).)#"3#)"=&#+$).'=):++<)@+$)
#;")0-+="7 *"-#&+'Aarchitecture {sm_10}
abiversion {0}
modname {cubin}
code {
name = BlackScholesGPU
lmem = 0
smem = 68
reg = 20
bar = 0
bincode {
0xa0004205 0x04200780 0x40024c09 0x00200780
…
per thread local memory
per thread block shared memory
per thread registers
Exec
Friday, January 23, 2009
84
!"#$%&''()*+',%!*-'(-*./0Exec
Friday, January 23, 2009
85
!"#$%$&$'()#*+,-./)",+)01234
5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,
9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/
<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)
*$.$'(
?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)
#*+,-.
A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.
B,6+$/#$3/
<$'$%6%C)DE)#*+,-./)",+)01234
!'1>)$7)%61#$"1,)32'36++,'#)01234/)
FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,
J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>
K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M
Exec
Friday, January 23, 2009
86
!""#$%&"'()*(+,-./-0%&",
1&"-,%23&4(/""#$%&"'(5/,2(&/6(&,",22%-37'(
3&"-,%2,($,-./-0%&",
BUT…
8/9:/""#$%&"'(0#763$-/",22/-2("%&&/6(%5,;#%6,7'(
<35,(7%6,&"'(/&(0,0/-':=/#&5(>,-&,72
?16(%77("/0,2(5/9&(6/(%-36<0,63"(3&6,&236'(%&5(%@%37%=7,(
$%-%77,7320A
Exec
Friday, January 23, 2009
87
!"#"$%&%#'(%)*+,#)-../'0"&'+1
!"#"$%&%#'("&'+1)2%/.3)"4".&"&'+1)&+)4'55%#%1&)6!73
6!73)8"#9)'1)$"19):"93;)+5)$,/&'.#+0%33+#3
<%$+#9)="14:'4&2
>2"#%4)$%$+#9)3'(%
?%@'3&%#)5'/%)3'(%
A2#%"43).%#)=/+0B
*+,)0"1)%8%1)$"B%)"..3)3%/5C&,1'1@)D/'B%)EEAF)"14)-AG->H
IJK.%#'$%1&L $+4%)4'30+8%#3)"14)3"8%3)+.&'$"/)0+15'@,#"&'+1
Exec
Friday, January 23, 2009
Loop unrolling
Sometimes we know some kernel parameters at compile time:# of loop iterationsDegrees of polynomialsNumber of data elements
If we could “tell” this to the compiler, it can unroll loops andoptimize register usage
We need to be genericAvoid code duplication, sizes unknown at compile time
Templates to rescueThe same trick can be used for regular C++ sources
Applied Mathematics 43/53slide by Johan Seland
Exec
Friday, January 23, 2009
Example: de Casteljau algorithm
A standard algorithm for evaluating polynomials in Bernstein form
Recursively defined:
f (x) = bd00
bki ,j = xbk!1
i+1,j + (1! x)bk!1i ,j+1
b0i ,jare coe!cients
f (x) = bd00
bd!110 bd!1
01
bd!220 bd!2
11 bd!202
1! x
1! x x 1! x2
x
x
Applied Mathematics 44/53slide by Johan Seland
Exec
Friday, January 23, 2009
Implementation
The de Casteljau algorithm is usually implemented as nestedfor-loops
Coe!cients are overwritten for each iteration
f l o a t d eCa s t e l j a u ( f l o a t ! c , f l o a t x , i n t d ){
f o r ( u i n t i = 1 ; i <= d ; ++i ) {f o r ( u i n t j = 0 ; j <= d" i ; ++j )
c [ j ] = ( 1 . 0 f"x )! c [ j ] + x! c [ j +1] ;}
r e t u r n c [ 0 ] ;}
f (x) = cd00
cd"110 cd"1
01
cd"220 cd"2
11 cd"202
1! x
1! x x 1! x2
x
x
Applied Mathematics 45/53slide by Johan Seland
Exec
Friday, January 23, 2009
Template loop unrolling
We make d a template parametertemplate<int d>f l o a t d eCa s t e l j a u ( f l o a t ! c , f l o a t x, int d ) {
f o r ( u i n t i = 1 ; i <= d ; ++i ) {f o r ( u i n t j = 0 ; j <= d" i ; ++j )
c [ j ] = ( 1 . 0 f"x )! c [ j ] + x! c [ j +1] ;}r e t u r n c [ 0 ] ;
}
Kernel is called assw i t c h ( d ) {case 1 :
d eCa s t e l j a u <1><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;ca se 2 :
d eCa s t e l j a u <2><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;..ca se MAXD:
deCa s t e l j a u <MAXD><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;}
Applied Mathematics 46/53slide by Johan Seland
Exec
Friday, January 23, 2009
Results
For the de Castelaju algorithm we see a relatively smallspeedup
! 1.2" (20%...)
Very easy to implement
Can lead to long compile times
Conclusion:
Probably worth it near end of development cycle
Applied Mathematics 47/53slide by Johan Seland
Exec
Friday, January 23, 2009
88
!"#$%&'("#
)#*+,'-.#*/!)01/2+,3",4.#$+/$5.,.$-+,('-($'
6+4",7/$".%+'$(#8
0(9+,8+#-/:,.#$5(#8
;.#</$"#3%($-'
=.-+#$7/5(*(#8
)'+/2+.</2+,3",4.#$+/4+-,($'/-"/8&(*+/"2-(4(>.-("#/
)#*+,'-.#*/2.,.%%+%/.%8",(-54/$"42%+?(-7/-5+",7
@#"A/5"A/-"/(*+#-(37/-72+/"3/:"--%+#+$<
+B8B/4+4",7C/$",+/$"42&-.-("#C/",/(#'-,&$-("#/"9+,5+.*
D2-(4(>+/7"&,/.%8",(-54C/then &#,"%%/%""2'
)'+/-+42%.-+/2.,.4+-+,'/-"/8+#+,.-+/"2-(4.%/$"*+
Exec
Friday, January 23, 2009
61
!"#$%&'($)*+,-.$/012*.#0
3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$
401:.#5
;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$
5#594?+
!*5#$+8-54+
(99#++$81$"-07@-0#$4#02105-69#$91,68#0+$
Profiling
Friday, January 23, 2009
62
!"#$%&'
()*$+',%-*,+-%./*0,1"+2,2%-01%-*,.34$+*-',3$,'"#$%&',"$,+2*,.2"56
+"7*'+%75
#&08"$.32*-*$+
#&08.32*-*$+
#'+8"$.32*-*$+
#'+8.32*-*$+
&3.%&8&3%0
&3.%&8'+3-*
9-%$.2
0")*-#*$+89-%$.2
"$'+-4.+"3$' : "$'+-4.+"3$,.34$+
1%-58'*-"%&";* : +2-*%0,1%-5',+2%+,'*-"%&";*,3$,%00-*'',.3$<&".+',+3,'2%-*0,3-,.3$'+%$+,7*73-=
.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'
Global memory loads/stores are coalesced
(coherent) or non-coalesced (incoherent)
Total branches and divergent branches
taken by threads
Local loads/stores
Profiling
Friday, January 23, 2009
63
!"#$%&%$#'"()&%*+',$%)-*."#$%/
01,.$/)%$&%$/$"#)$2$"#/)3'#4'")1)#4%$15)31%&
6",7)#1%($#/)*"$)8.,#'&%*-$//*%01,.$/)3',,)"*#)-*%%$/&*"5)#*)#4$)#*#1,)".89$%)*+)31%&/),1."-4$5)+*%)1)&1%#'-.,1%):$%"$,;
<1."-4)$"*.(4)#4%$15)9,*-:/)#*)$"/.%$)#41#)#4$)#1%($#)8.,#'&%*-$//*%)'/)('2$")1)-*"/'/#$"#)&$%-$"#1($)*+)#4$)#*#1,)3*%:;
01,.$/)1%$)9$/#)./$5)#*)'5$"#'+7)%$,1#'2$)&$%+*%81"-$)5'++$%$"-$/)9$#3$$")."*&#'8'=$5)1"5)*&#'8'=$5)-*5$
!")*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81("'#.5$/)*+)(,5?(/#@'"-*4$%$"#>)5'2$%($"#@9%1"-4>)1"5)31%&@/$%'1,'=$
Profiling
Friday, January 23, 2009
84
!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(
23+45
23+25
2365
23765
43825
43995
:43:72*;<=>
+93??:*;<=>
92346?*;<=>
273977*;<=>
?37+2*;<=>
+36@+*;<=>
43869*;<=>
9838+5
4232:5
2@3825
639+5
+3:65
43995
834:6*&>A"#("-*7B&0-.1C-"*"-"&"(.>*C"#*.D#"'/
83962*&>A"#("-*:B)%&C-"."-E*0(#%--"/
83@9:*&>A"#("-*@B0(#%--*-'>.*F'#C
83?:@*&>A"#("-*+B$1#>.*'//*/0#1(G*G-%H'-*-%'/
23744*&>A"#("-*9B>"I0"(.1'-*'//#">>1(G
93+@:*&>A"#("-*4B1(."#-"'J"/*'//#">>1(G
F1.D*H'(K*)%($-1).>
638@+*&>A"#("-*2B*1(."#-"'J"/*'//#">>1(G
F1.D*/1J"#G"(.*H#'()D1(G
A"#("-*7*%(*94,*"-"&"(.>B*74*;<=>L
M."C
MC""/0C<'(/F1/.DN1&"*O444*1(.>PQ0&0-'.1J"
MC""/0C
Example
Friday, January 23, 2009
Build your own!
Friday, January 23, 2009
Friday, January 23, 2009
© 2008 NVIDIA Corporation.slide by David Kirk
Thank you!
Friday, January 23, 2009
Back Pocket Slides
slide by David Cox
Friday, January 23, 2009
Friday, January 23, 2009
Misc
IAP09 CUDA@MIT / 6.963
Friday, January 23, 2009
19M02: High Performance Computing with CUDA
Tesla C1060 Computing ProcessorTesla C1060 Computing Processor
1.33 GHzCore GHz
Processor 1x Tesla T10P
Form factor
Full ATX:
4.736” (H) x 10.5” (L)
Dual slot wide
On-boardmemory
4 GB
System I/O PCIe x16 gen2
Memory I/O512-bit, 800MHz DDR
102 GB/s peak bandwidth
Display outputs None
Typical power 160 W
Friday, January 23, 2009
20M02: High Performance Computing with CUDA
Tesla S1070 1U SystemTesla S1070 1U System
1.5 GHzCore GHz
Processors 4 x Tesla T10P
Form factor1U for an EIA 19”
4-post rack
Total 1U systemmemory
16 GB (4.0GB per GPU)
System I/O 2 PCIe x16
Memory I/O perprocessor
512-bit, 800MHz GDDR
102 GB/s peakbandwidth
Display outputs None
Typical power 700 W
Chassisdimensions
1.73” H ! 17.5” W !28.5” D
Friday, January 23, 2009
18M02: High Performance Computing with CUDA
Double Precision Floating PointDouble Precision Floating Point
NVIDIA GPU SSE2 Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADDand FMUL
All 4 IEEE, round tonearest, zero, inf, -inf
All 4 IEEE, round tonearest, zero, inf, -inf
Round tozero/truncate only
Denormal handling Full speedSupported, costs 1000’sof cycles
Flush to zero
NaN support Yes Yes No
Overflow and Infinitysupport
Yes YesNo infinity,clamps to max norm
Flags No Yes Some
FMA Yes No Yes
Square rootSoftware with low-latencyFMA-based convergence
Hardware Software only
DivisionSoftware with low-latencyFMA-based convergence
Hardware Software only
Reciprocal estimateaccuracy
24 bit 12 bit 12 bit
Reciprocal sqrt estimateaccuracy
23 bit 12 bit 12 bit
log2(x) and 2^x estimatesaccuracy
23 bit No No
Friday, January 23, 2009
top related