gpu!#!$%&’(cyy/courses/assembly/08fall/...computer organization and assembly language final...
TRANSCRIPT
Computer Organization and Assembly Language Final Project Report
GPU Architecture and Programming: From GPGPU to CUDA
GPU!"#!"$"%&'(
)GPGPU#CUDA
!
*+,!-./012324!567!
Contents
89!!
!
Abstract !!!!!!!!!!!!!!!!!!!!!!!!!!!! 1
I. Evolution of GPU Architecture
i. Traditional Graphics Pipeline and GPU Architecture !!!!! 2
"#$%&'()*+%&'(,-.
ii. Programmable Graphics Pipeline and GPU Evolution !!!!! 4
/*01$%&'()*2%&'(,$34
iii. From Graphics to General Computation: CPU and GPU !!!!5
5%&6789:;<=%&'(,>?@'(,$ABC
II. Evolution of GPU Programming Model
i. GPGPU: General Purpose Graphic Processing Unit !!!!!! 6
GPGPU: 9:DEC$%&'(,
ii. The Architecture of Unified Shader GPU!!!!!!!!!!!6
FDGDHI,$GPU-.
iii. CUDA: Compute Unified Device Architecture !!!!!!!!!9
CUDA: GD;<-.
iv. The Programming Model of CPU, GPGPU, and CUDA !!!!!9
CPU, GPGPU2CUDA*0;JK0$LM
III. The Programming Model of CUDA
i. Detailed Architecture of CUDA Programming !!!!!!!! 11
CUDA$*0;JNO
ii. Hardware Implementation!!!!!!!!!!!!!!!! 12
PQRJNO
iii. Application Interface !!!!!!!!!!!!!!!!!!!13
STUV
iv. Programming Example: Matrix Multiplication!!!!!!!! 13
*0WX=YZ[T
IV. References !!!!!!!!!!!!!!!!!!!!!!!!16
1
Abstract
!"#$!
!!
!"#$%&'()*+,-./0123451*6789+:;'<=
>?@ABCDEF+G89HIJKLMNO+PQ'RS:;TUV+:
WXYG89Z[\+]^CD[\T_'`abc1G89d@(ef+JKL
M'Ygh+ijklmHd+nopBqj'rstHIABCD+EF`
a+uv'rwx1ABCD+yz{CX!
(.e"|}'789+~c��e��������’�!!���+����+~
c'���"�h���e�'�Nr�e�%.+5�'��JK+LMN
O��+�.5�X Igh+~c¡¢��'��)*no+�N5�'/
£�¤¥gh+_¦§¨X Igh©ª+§«¬<[~��'���@H
ILM+®¥��¯°'±gh+~c²`a³´1Tµ¶·¸¹hº+»¼
½¨�I¾¿»¼'!}ÀÁ³�ÃÄ%ÅÆ+J¹hÇÈ�ɵ¶'Ê!
"ÇÈ<LMNOËËÌÍ'Î�789+~cÏ�TÐe"§�Ñ*6'r�
�ABCDÒÓÔe}JKLM+Õ#X(gh+~c��`aÖ×'ÊAB
CD+ÔEFØÙ®Úno+ÛÜ'RS%ABCD+§oÝÞno'ßÃe
"|}+8��à��áá��âÚ+NOX!
%Ãã G89+¾\qj'räß(]^LMå[\Þ;'Yæç+]^
CDènklm+¾\G89ij'1éê}8��à��áá�ë��!G89+mì'`ab
ímG89HI789LMNO+î}¬'%ïRdÊl+G8G89ðG�ñ����!8ò�ó���!
G89ôEFXõ�ö÷(éê}ghij.+øù'%ïG8G89noEF+�
ú'Ê�Ôe}òñûüû�ý!�þ�ý��EF+G89mì'%ïÛÜòñûüû�ý!�þ�ý��'Ô+
ABCDnoqjÑ79ÿ!ð7�áóò"�ý!9ñûüû�ý!ÿ�#û$�!!$þû"�$"ò��ôXZ%&'(
)79ÿ!!ó��à��ááûñàL*+¾D'%ïno+,'&ö÷!"Ô+ABLMi
j(no�N+5�X!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
2
!
Evolution of GPU Architecture %&'()*+,-./0!
!
!
I. Evolution of The base of GPU Architecture
GPU(Graphics Processing Unit)'Y-³.&_.'/Z0+1N«ß
2úJKÕ3]^§³+®¥Ê4l+56XÊ/Y~c{0'ÓTÚ71
real-timeÕ3]^+8+'RS«Ì9�:LMNO.+~cX<CPU:;
Áe"instructionpB<º�='RÓ]^â=>?Õ3'GPU+�N:@
(IYée"1Z%e"instructionCD+<º�N¡A'É%GPU?@5
�LM�N+§o'ß5�=e<ºÄ%CD+instructionÇ8'räßÉ
B+ABCD(Parallel Computing)X
%Ãã ]^Õ3É®ÚCD+C*&DZ0+GPUijEFX
i. æç+]^CDènG]^CDåij
Traditional Graphics Pipeline and GPU Architecture
efJKÚÕ3e?]H'®Ú%Ã+èn·
]eIæç+]^Jmèn
1. Application input: Yd@noJ'KÚÕ3+]^LMï
commandX
G P U
Application Input
Geometry Processing
Rasterization
Fragment Processing
Hidden surface Removal
Output image
GPU
memory buffer
3
2. Geometry processing: NOÓVertex processing'RÓPÚCD+ß
]^+Q@LMXJ'+LMR�SÒT³ij+UV^Q@+3D
WX'S<!"Á5ÚYZ[ÒJKÕ3+2DWX'9*Ãe"Á
5+inputXS<\ÚFMlightingHIUV^+�]�^<_`@+
Âa'bYLMcÃæX
3. Rasterization: RÓJKß%def+§o`@Õ3]^'RSÚã
.e"Jm+2DUV^�@gh¨MmÉ�jÒUV^@+WXX
4. Fragment processing: �@rasterise +i^'ij.�"@+
textureX
5. Pixel processing: �@z buffer+LM'klmn%+�"@ßop
qZ@+mnr1'&¦st@�u�(vw.Õ3'¦sZ%+
Jmi^X
RS4l+%Ã+GPUij:
]êIKurt Akeley. RealityEngine Graphics. In Proc. SIGGRAPH ’93. ACM Press, 1993.
(!"Z0+EFx'�%Ãy"z@·
1. {|?@+ABCD+§o&jÒ56·Y.³+GPUij]xÄ%
D1Geometry Engine, Fragment generator, /ßABb}+ijX
2. ij?@fixed functionEF'LM~'56%�N~B�s+�*'
u�)z+�¬·��?@TABCD'±ßÄ%~B+LMß�s
+'RS!"<$+GPU\u���î}CPU+NOX
RÓLMNO+¶�'RS�Téê}GPU+��·
4
ii. Äno�+]^CDèn<]^CDå+í~ (Programmable Graphics
Pipeline and GPU Evolution)
éê}É*+��PÚßã¾��s1N+vertex < pixel processing
unit�Óprogrammable unit'OÓshader 'snoÄ%~B+�*�*'
!>+��r��TGPUg�ij+ALUNO'��General Purpose GPU
+EF4lXÊpipeline(��CD+��r~BT��·
1. Vertex +CD�ÒVertex Generating(VG)<Vertex Processing(VP)�
"��'VGßãjÒ]^+lines, points, triangles%vertex record+^
oÀ3'ã!�recored�æ�VPCDXVPÖ1LM%'äã!�@[
�Òmn12D+WX'*ÒÔ+record�æmiXÉBprogrammable
+���+projectionImodel<view transform+LMX
2. Primitive Generating < primitive processingäßCD¾�+geometry
z¬'â=Úã=e"UV^+Q@group;&æmiX
3. Fragment generating(FG)äß¾�+Rasterization'ãprimitive
sample(vw.'(ã!�pixel@� ÒÔ+recordæmi�
Fragment processing(FP)XFP¡ßYmemoryx¢mtexture+L£'
b¤ã¾\%vertex processing+lightingC*¥1!"<¦&*'!
>%&Ä%§@UV^�"@+z¬'¨1phong shading©ªÓ«¬
+�]�^X
4. Z%+pixel operation¡ß¨hidden surface removalX
G P U
Application Input
Vertex processing
Primitive generating
Fragment generating
Hidden surface Removal
Output image
GPU memory buffer
Vertex generating
Primitive processing
Fragment processing
5
GPU+¬ij¡®Ã]·
S<'GPU+ij{|��)�+LMNOX
iii. Y]^Õ31efLM·]^CDåHx¯CDå+î}¬(From
Graphics to General Computation: CPU and GPU)
�TGCPUÌ°+LMNO'GPU®±CDef+computing taskÄ%
YÃÀT_·
²³!G89!óû´��!µþ�ý��!"��¶<789!$�áóò"ûñà!"��¶+HdÍ··!
!
Y.ÀxÄ%D1'GPUÚ~Bef+LM'Ä%@<CPULM
¸¹+ºo»'',®·¾�¼½textureLM+memory bufferÄ%
@&¼½data arrayX[\¢'LM<<texture look up~B+¾*Ì
='Ú~B+LM+noPh¡<@shader language(openGL, cg©
©)Þm+shader program¸¹'Z%¿Ò+LM¡Ä%Þ'frame
buffer'©=IÞÀ�Áh+¾*XRÓ!>+̹¬'Ê�T
Genernal purpose GPU+EF4lX
789! G89!
ÿ�"�!!��� Ã�´"ò��
��á��Â!Ä��ý Ã�´"ò��!Å��¶òó
Å��ó!ë�ý µþ�ý��!8��à��á
��á��Â!Æ�û"�! Ä�ñý��!"�!Ç��á�!Èòüü��!
6
Evolution of GPU Programming Model %&'()*123,-./0
!
II. Evolution of GPU Programming Model!
i. General purpose GPU
(1) §@GPU�ÉABLMNO'&~Bef¬Ê��ß]^Ë+L
M'%�ÌCPU+LMÍÌXÎÁ&ö'äߧ@¾��*GPU
+]^LM�Ï'4lThread(GPUгpBef¬+LMXe
f+GPGPU�@video memory&¨<CPU<DRAM¸¹+ij
ÑÒ'ʧ@GPGPU+no®Ú%]^LM+noÓ�&ÝÞ'
,®openGL'cg©©X
(2) noLBqj·Ã]ÔÓGPU<CPU{ºæÕprogram£Ë+3
Ö]X]x+glReadPixel()ߧ@openGL+�Ï&î�DRAMг
LM+,¿X
3. GPGPU+ø@·
ð²ô!â=�@�þ�ý��!��ñàò�à�&*�*'RÓnoÓ�#×+Ò�ØÒ
T?@!Ùqj+~'ÚÛ'��Ø]^LMÜÝ+no)Þ?
@'rßG8G89bB.Zz+ßÞX!
ðàô!Yqj].&D'LM+æÕáâ)ã'Ê;â=:ä(G89<
ÿÄ!�{º¨LM+æJ'±ßÿÄ!�ßEF�789î�LM§
«'RSåG89{º+»¼esæ<789{ºã'ØÒ�çèéX!
!!!!RÓ!�ø@'�@G89¨ê@LMÁ5N�ëì'RÓí�<789
§@+§«¬�)z+»¼X±ß��îï��+í~'éU}G89+
mì<G89nopBqj+��'ðoñòTABCD<}+&´X ii. Unified Shader GPU:
(1) Unified Shader GPU: PÚ8+ß�5�Programmable GPU+�
ç' I]^+CDèn?@ó�o+CD'â=Ú©1.e"ô
õCD¿ÒöN~BÃe"ôõ+C*'RS(�N.'ÄNR÷
1!>+ø^·
GPU
shader
Texture memory
Framebuffer
Depth buffer
output
DRAM
CPU
glReadPixel()
glReadPixel() GPU
7
Vertex shader
Pixel Shader
Vertex shader
Pixel Shader
ùax+shader capacity �@x+shader capacity
(CD�=¬ú+Jm<'RÓshader+1Np¶�ûT'RSÔ�
í��*ü+C*NO'±ß½¨~BqZ+C*�Nùa'ØÒLý+
þÿXRS!N"#shaderpBC*$¸+¶�'sùax+Á5Ä%�
Ì%&+Á5''(�N§@+�A)'äÄ%��pB�ç5�X�!
>+EF{Ã4l+'äßUnified ShaderX*+Tshader+1Ns,'�
e"shader�-ßvertex processing.ßpixel processing'/Ä%~BL
M'Ê��%+i^'Ä%�.]+pBi^�Ò·
Unified Shader vertex
processing pixel processing
Unified Shader
vertex processing pixel processing
Heavy Geometry
Heavy Pixel
Heavy Geometry
Heavy Pixel
8
=<5�T�$C*+�N'��Lý§@çz/+5�X
(2) Graphics Pipeline of Unified Shader GPU
ãshader+��/�@unified shaderî}'Êunprogramable+��
¡01 ¾�+unit23pBX
(3) GPU¬+gh¬ì
G P U
Application Input
Vertex Processing
Rasterization
Fragment Processing
Output image
GPU memory buffer
Hidden surface removal
Geometry Processing
Unified Shader
9
iii. CUDA: Compute Unified Device Architecture
(1) CUDA: �+«ß(unified shader+GPU architectureÃ'�@GPU
*Óef¬LM+CDÁ5X(CUDA+qjÃ'CPUOÓhostÓP
Ú+LMÁ5'ÊGPU¡ß4í�coprocessor+V]'OÓ
deviceXRS'(host.®^LMC*ß�ÍAB+LMCD'.
®ÚzÈ+FM'hostÄ%ã!"C*�Ì1device.ipBX
(2) <5e}+GPGPUÌæ'CUDA�óU"z@·
a) Shared memory: Host<device67�89+memory'Êdevice
+memory¡ devicex+56:;'!ßx1Tunified shader
ghEF+ÉC'RÓ(unified shader+qjÃ'�"shader
æ®ÚÄ%��+�;CD%+i^'%71ABCD+*@X
b) Hierarchy of Thread Group: ãThread�Ò"hthread, thread
block <grid of thread blocks'ã�e"shaderÄ%67CD+
<=õ�*'>�~?ABCD+@AX
c) ÓC+extension language: {551GPGPU�@+ß6$]^�
*+noÓ�'ØÒT)z+�@ßÞ'RSCUDAB
runtime library+C9<©ª'ã>"qjC9(C+DE{
.'s�@+õx�5�'rs�@+§«¬5�X
iv. The Programming architecture of CPU, GPGPU, and CUDA
CPU+L*ºo'�@Áe"thread&pB'eF�Ge"�ÏX
��®ÚYDRAMx¢mLM'±ß(CPU���Cache+Ea'Ä%
§@refetch.�±memory access+policy'��ãLM¢'Cache'
%�HpB<+¢î��X
CPU programming architecture
10
GPGPU+pBâ=Ú��+YVideo memoryx¢mLM'�YLM¿
+LMÞÀi'�F=�Ä%pB)*+instruction'�e"instruction�
e"ALU component<control unitI�'©=IG(e"�)*"CPU+
Jå.X±ßø@äßVideo memory<¬pB+ALUß�K(�=56
.'Aâ�¢î�ç<<º.+þÿX
CUDAiÜTGPGPUABCD<CPU cache+@@'Ä%=<%)*+
thread&pB�=+noõ'%��LM+�ç'\(GPU+��C�
Parallel Data CacheÄ%�HLM+¢î'�RÓ�=threadRÓpB=
eno+�=LM')�ÄNeNLMÚp)*+thread access'.
threadpB¿+i^'Ú)H+¼Ài�qZthread�@'RSC9+
Cacheâ=ÚÄ%sthreadº)H�+�;LMX
GPGPU programming architecture
CUDA programming architecture
11
The Programming Model of CUDA
CUDA45163,!
III.The Programming Model of CUDA
i. CUDA+noL*OP (Detailed Architecture of CUDA Programming)
a. Thread Batching
i. Thread block:seQthreadÄ%:=�@shared memoryг+
LM'�RpBaccess�=LMLõ+=eno'b¤B =�
�+pBÄ%0SLM+¿>¬X
ii. Grid of thread blocks:e"blockгTU+thread+Ç8�¶'
±(=e"kernelÐpB'ʤ�=>V�<z�+thread
blocksÄ%�SÒgrid of thread blocks'!>e&'e"kernel
Ä%�@+thread+Ç8äR5�X
host
Kernel1
Kernel2
device
Grid1
Block (0,0)
Block (1,0)
Block (2,0)
Block (0,1)
Block (1,1)
Block
(2,1)
Grid2
Block
(0,0)
Block
(1,0)
Block
(2,0)
Block (0,1)
Block (1,1)
Block (2,1)
Block
(0,1)
Block
(1,1)
Block
(2,1) Block(1,1)
thread
(0,0)
thread
(1,0)
thread
(2,0)
thread
(3,0)
thread
(4,0)
thread
(0,1)
thread
(1,1)
thread
(2,1)
thread
(3,1)
thread
(4,1)
thread
(0,2)
thread
(1,2)
thread
(2,2)
thread
(3,2)
thread
(4,2)
12
b. Memory model
(1) Shared memory: =e"blockx+thread�;LM'WX�Ģ
îX
(2) Global memory:��=+block±=e"gridx+thread�;LM
@'hostx+kernelÄ%access!Ð+LMX
(3) Constant memory, Text memory: ��=+block±=e"gridx+
thread¢îLM'�NÞ'' hostxkernelÄ%accessX
ii. gh¬*OP(Hardware Implementation): @emultiprocessor¬ì'
�"multiprocessor��SIMD(Single Instruction Multiple Data)q
j·(�s+clock cycle'multiprocessor+"processorpBÌ=+�
Ï'±�*(�=+LM.X
Grid
Global memory
Constant
memory
Text memory
Block(0,0)
Shared memory
Registers Registers
Thread(0,0
)
Thread(1,0
)
Local memory
Local memory
Block(1,0)
Shared memory
Registers Registers
Thread(0,0)
Thread(1,0)
Local memory
Local memory
13
iii. Ó¨s,(Application Interface):%CÓ�ÓDE'klmY$W�+^
Zs,X
a. Function type qualifiers: @&�÷%Ã+functionßpB(host, device
\ßkernel'�R(no+[[ñòÓ_host_, _device_,.
_global_ '±ß(device <kernel+no�)*¶�'\ß�©ª
recursion'�Nñòstatic variable'�©ª¾ZÛa+�Çñò©
©Xñò+§o®·
_devise_ void fun(){
nogÅ] }
b. Variable type qualifier: �÷^"deviceg+�Ç(�Áhx+Â
a'�_device_, _constant_, <_shared_U$Xñò+§o®·
extern __shared__ float shared[];
À3ñòe"(shared memoryx'^ZÓ_@Ç`}+W��Ç
shared[]X
c. ñòÒ_global_+function(pB<'®Ú%e"za^o+�Ï
b¾',®·
ñòÓ __global__ void Func(float* parameter);
+kernelno'â=%·
Func<<< Dg, Db, Ns >>>(parameter);
&pBXqxDg}Àgrid of thread block+ÇÈ, Db}Àthread block
+ÇÈ, NsÀ3�e"thread block(shared memoryÐÄ%�@*(
¾ZÛa�Áh,
d. Y"build-in variableÀ3�grid dimension(gridDim), block
index(blockIdx), block dimension(blockDim), thread
index(threadIdx)©L£X
iv. no+,(Example Program: Matrix Multiplication)
// c`Ìd·C = A * B
// hA c`A+��
// wA c`A+e�
// wB c`B+e�
void Mul(const float* A, const float* B, int hA, int wA, int wB,
14
float* C) {
// YMatrix A<B+LMload1device float* Ad; size = hA * wA * sizeof(float); cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; size = wA * wB * sizeof(float); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);
// (devicexñòe"f+space�C Matrix float* Cd; size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size);
// FMblock dimension<grid dimension dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);
// ghdevice function Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);
// YdeviceMÉ+C+gÅäiÀhost cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);
// u+device�@+memory cudaFree(Ad); cudaFree(Bd);
cudaFree(Cd); }
__global__ void Muld(float* A, float* B, int wA, int wB, float* C) {
// Block indexX3ÚpB!"¾*+block+jk int bx = blockIdx.x; int by = blockIdx.y;
// Thread indexX3ÚpB!"¾*+thread+jk int tx = threadIdx.x; int ty = threadIdx.y;
// Aée"ppB+submatrix+blockjk int aBegin = wA * BLOCK_SIZE * by;
// AZ%e"pblockpB+submatrix+blockjk int aEnd = aBegin + wA - 1;
// A+submatrixpB+lmÇ int aStep = BLOCK_SIZE;
// Bée"ppB+submatrix+blockjk int bBegin = BLOCK_SIZE * bx;
// B+submatrixpB+lmÇ int bStep = BLOCK_SIZE * wB;
15
// threadFMsubmatrix element+Ç8 float Csub = 0;
// d¢É�+submatrix of A<BFMmi^ for (int a=aBegin, b=bBegin; a<=aEnd; a+=aStep,
b+=bStep) {
// ñòe"�A+submatrix@+shared memory space __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
// ñòe"�B+submatrix@+shared memory space __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
//Yglobal memory load c`+LM1shard memoryi
// �e"thread loade"element As[ty][tx] = A[a + wA * ty + tx]; Bs[ty][tx] = B[b + wB * ty + tx];
// =�¢î __syncthreads();
// �"`}Ìd
// �"threadFMe"e"submatrix block+element for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ty][k] * Bs[k][tx];
//n0�thread¢'Ô+o5'.³+elemento{|pMm&
j' __syncthreads();
}
// Yblock submatrixÞ1global memory
// =>e"threadÍpÞe"element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;
}
16
References
789:
IV.References
1. How GPU works, David Luebke, NVIDIA Research, Greg
Humphreys, University of Virginia, IEEE computer, 2008
2. GPU a Close Look, Kayvon Fatahalian, Mike Houston, ACM Queue
2008
3. NVIDIA CUDA Compute Unified Device Architecture
Programming Guide Version 1.1, NVIDIA Corporation, 2007
4. Scalable Parallel Programming, John Nickolls, Ian Buck, and
Michael Garland, NVIDIA, Kevin Skadron, University of Virginia, ACM
Queue 2008.
5. A Survey of General-Purpose Computation on Graphics
Hardware, John D. Owens, David Luebke, Naga Govindaraju, Mark
Harris, Jens Krüger, Aaron E. Lefohn, and Timothy J. Purcell,
Computer Graphics Forum, 26(1):80–113, March 2007
6. GPU Gems2, Chapter 30 GeForce 6 Architecture, Emmett
Kilgariff, Randima Fernando, NIVIDIA Corporation, 2005
7. GPU Computing with NVIDIA CUDA, Ian Buck, NVIDIA
Corporation, SIGGRAPH 2007 Lecture.
8. GPU Architecture Overview, John Owens, UC Davis, SIGGRAPH
2007 GPGPU lecture
9. CUDA, Wikipedia, http://en.wikipedia.org/wiki/CUDA