gpu!#!$%&’(cyy/courses/assembly/08fall/...computer organization and assembly language final...

Computer Organization and Assembly Language Final Project Report

GPU Architecture and Programming: From GPGPU to CUDA

GPU!"#!"$"%&'(

)GPGPU#CUDA

!

*+,!-./012324!567!

Contents

89!!

!

Abstract !!!!!!!!!!!!!!!!!!!!!!!!!!!! 1

I. Evolution of GPU Architecture

i. Traditional Graphics Pipeline and GPU Architecture !!!!! 2

"#$%&'()*+%&'(,-.

ii. Programmable Graphics Pipeline and GPU Evolution !!!!! 4

/*01$%&'()*2%&'(,$34

iii. From Graphics to General Computation: CPU and GPU !!!!5

5%&6789:;<=%&'(,>?@'(,$ABC

II. Evolution of GPU Programming Model

i. GPGPU: General Purpose Graphic Processing Unit !!!!!! 6

GPGPU: 9:DEC$%&'(,

ii. The Architecture of Unified Shader GPU!!!!!!!!!!!6

FDGDHI,$GPU-.

iii. CUDA: Compute Unified Device Architecture !!!!!!!!!9

CUDA: GD;<-.

iv. The Programming Model of CPU, GPGPU, and CUDA !!!!!9

CPU, GPGPU2CUDA*0;JK0$LM

III. The Programming Model of CUDA

i. Detailed Architecture of CUDA Programming !!!!!!!! 11

CUDA$*0;JNO

ii. Hardware Implementation!!!!!!!!!!!!!!!! 12

PQRJNO

iii. Application Interface !!!!!!!!!!!!!!!!!!!13

STUV

iv. Programming Example: Matrix Multiplication!!!!!!!! 13

*0WX=YZ[T

IV. References !!!!!!!!!!!!!!!!!!!!!!!!16

1

Abstract

!"#$!

!!

!"#$%&'()*+,-./0123451*6789+:;'<=

>?@ABCDEF+G89HIJKLMNO+PQ'RS:;TUV+:

WXYG89Z[\+]^CD[\T_'àbc1G89d@(ef+JKL

M'Ygh+ijklmHd+nopBqj'rstHIABCD+EF`

a+uv'rwx1ABCD+yz{CX!

(.e"|}'789+~c��e��’�!!��+��+~

c'��"�h��e�'�Nr�e�%.+5�'��JK+LMN

O��+�.5�X Igh+~c¡¢��'��)*no+�N5�'/

£�¤¥gh+_¦§¨X Igh©ª+§«¬<[~��'��@H

ILM+®¥��¯°'±gh+~c²à³´1Tµ¶·¸¹hº+»¼

½¨�I¾¿»¼'!}ÀÁÂ³�ÃÄ%ÅÆ+J¹hÇÈ�Éµ¶'Ê!

"ÇÈ<LMNOËËÌÍ'Î�789+~cÏ�TÐe"§�Ñ*6'r�

�ABCDÒÓÔe}JKLM+Õ#X(gh+~c��àÖ×'ÊAB

CD+ÔEFØÙ®Úno+ÛÜ'RS%ABCD+§oÝÞno'ßÃe

"|}+8��à��áá��âÚ+NOX!

%Ãã G89+¾\qj'räß(]^LMå[\Þ;'Yæç+]^

CDènklm+¾\G89ij'1éê}8��à��áá�ë��!G89+mì'àb

ímG89HI789LMNO+î}¬'%ïRdÊl+G8G89ðG�ñ��!8ò�ó��!

G89ôEFXõ�ö÷(éê}ghij.+øù'%ïG8G89noEF+�

ú'Ê�Ôe}òñûüû�ý!�þ�ý��EF+G89mì'%ïÛÜòñûüû�ý!�þ�ý��'Ô+

ABCDnoqjÑ79ÿ!ð7�áóò"�ý!9ñûüû�ý!ÿ�#û$�!!$þû"�$"ò��ôXZ%&'(

)79ÿ!!ó��à��ááûñàL*+¾D'%ïno+,'&ö÷!"Ô+ABLMi

j(no�N+5�X!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

2

!

Evolution of GPU Architecture %&'()*+,-./0!

!

!

I. Evolution of The base of GPU Architecture

GPU(Graphics Processing Unit)'Y-³.&_.'/Z0+1N«ß

2úJKÕ3]^§³+®¥Ê4l+56XÊ/Y~c{0'ÓTÚ71

real-timeÕ3]^+8+'RS«Ì9�:LMNO.+~cX<CPU:;

Áe"instructionpB<º�='RÓ]^â=>?Õ3'GPU+�N:@

(IYée"1Z%e"instructionCD+<º�N¡A'É%GPU?@5

�LM�N+§o'ß5�=e<ºÄ%CD+instructionÇ8'räßÉ

B+ABCD(Parallel Computing)X

%Ãã ]^Õ3É®ÚCD+C*&DZ0+GPUijEFX

i. æç+]^CDènG]^CDåij

Traditional Graphics Pipeline and GPU Architecture

efJKÚÕ3e?]H'®Ú%Ã+èn·

]eIæç+]^Jmèn

1. Application input: Yd@noJ'KÚÕ3+]^LMï

commandX

G P U

Application Input

Geometry Processing

Rasterization

Fragment Processing

Hidden surface Removal

Output image

GPU

memory buffer

3

2. Geometry processing: NOÓVertex processing'RÓPÚCD+ß

]^+Q@LMXJ'+LMR�SÒT³ij+UV^Q@+3D

WX'S<!"Á5ÚYZ[ÒJKÕ3+2DWX'9*Ãe"Á

5+inputXS<\ÚFMlightingHIUV^+�]�^<_`@+

Âa'bYLMcÃæX

3. Rasterization: RÓJKß%def+§o`@Õ3]^'RSÚã

.e"Jm+2DUV^�@gh¨MmÉ�jÒUV^@+WXX

4. Fragment processing: �@rasterise +i^'ij.�"@+

textureX

5. Pixel processing: �@z buffer+LM'klmn%+�"@ßop

qZ@+mnr1'&¦st@�u�(vw.Õ3'¦sZ%+

Jmi^X

RS4l+%Ã+GPUij:

]êIKurt Akeley. RealityEngine Graphics. In Proc. SIGGRAPH ’93. ACM Press, 1993.

(!"Z0+EFx'�%Ãy"z@·

1. {|?@+ABCD+§o&jÒ56·Y.³+GPUij]xÄ%

D1Geometry Engine, Fragment generator, /ßABb}+ijX

2. ij?@fixed functionEF'LM~'56%�N~B�s+�*'

u�)z+�¬·��?@TABCD'±ßÄ%~B+LMß�s

+'RS!"<$+GPU\u��î}CPU+NOX

RÓLMNO+¶�'RS�Téê}GPU+��·

4

ii. Äno�+]^CDèn<]^CDå+í~ (Programmable Graphics

Pipeline and GPU Evolution)

éê}É*+��PÚßã¾��s1N+vertex < pixel processing

unit�Óprogrammable unit'OÓshader 'snoÄ%~B+�*�*'

!>+��r��TGPUg�ij+ALUNO'��General Purpose GPU

+EF4lXÊpipeline(��CD+��r~BT��·

1. Vertex +CD�ÒVertex Generating(VG)<Vertex Processing(VP)�

"��'VGßãjÒ]^+lines, points, triangles%vertex record+^

oÀ3'ã!�recored�æ�VPCDXVPÖ1LM%'äã!�@[

�Òmn12D+WX'*ÒÔ+record�æmiXÉBprogrammable

+��+projectionImodel<view transform+LMX

2. Primitive Generating < primitive processingäßCD¾�+geometry

z¬'â=Úã=e"UV^+Q@group;&æmiX

3. Fragment generating(FG)äß¾�+Rasterization'ãprimitive

sample(vw.'(ã!�pixel@� ÒÔ+recordæmi�

Fragment processing(FP)XFP¡ßYmemoryx¢mtexture+L£'

b¤ã¾\%vertex processing+lightingC*¥1!"<¦&*'!

>%&Ä%§@UV^�"@+z¬'¨1phong shading©ªÓ«¬

+�]�^X

4. Z%+pixel operation¡ß¨hidden surface removalX

G P U

Application Input

Vertex processing

Primitive generating

Fragment generating

Hidden surface Removal

Output image

GPU memory buffer

Vertex generating

Primitive processing

Fragment processing

5

GPU+¬ij¡®Ã]·

S<'GPU+ij{|��)�+LMNOX

iii. Y]^Õ31efLM·]^CDåHx¯CDå+î}¬(From

Graphics to General Computation: CPU and GPU)

�TGCPUÌ°+LMNO'GPU®±CDef+computing taskÄ%

YÃÀT_·

²³!G89!óû´��!µþ�ý��!"��¶<789!$�áóò"ûñà!"��¶+HdÍ··!

!

Y.ÀxÄ%D1'GPUÚ~Bef+LM'Ä%@<CPULM

¸¹+ºo»'',®·¾�¼½textureLM+memory bufferÄ%

@&¼½data arrayX[\¢'LM<<texture look up~B+¾*Ì

='Ú~B+LM+noPh¡<@shader language(openGL, cg©

©)Þm+shader program¸¹'Z%¿Ò+LM¡Ä%Þ'frame

buffer'©=IÞÀ�Áh+¾*XRÓ!>+Ì¹¬'Ê�T

Genernal purpose GPU+EF4lX

789! G89!

ÿ�"�!!��Â Ã�´"ò��

��á��Â!Ä��ý Ã�´"ò��!Å��¶òó

Å��ó!ë�ýÂ µþ�ý��!8��à��á

��á��Â!Æ�û"�! Ä�ñý��!"�!Ç��á�!Èòüü��!

6

Evolution of GPU Programming Model %&'()*123,-./0

!

II. Evolution of GPU Programming Model!

i. General purpose GPU

(1) §@GPU�ÉABLMNO'&~Bef¬Ê��ß]^Ë+L

M'%�ÌCPU+LMÍÌXÎÁ&ö'äß§@¾��*GPU

+]^LM�Ï'4lThread(GPUÐ³pBef¬+LMXe

f+GPGPU�@video memory&¨<CPU<DRAM¸¹+ij

ÑÒ'Ê§@GPGPU+no®Ú%]^LM+noÓ�&ÝÞ'

,®openGL'cg©©X

(2) noLBqj·Ã]ÔÓGPU<CPU{ºæÕprogram£Ë+3

Ö]X]x+glReadPixel()ß§@openGL+�Ï&î�DRAMÐ³

LM+,¿X

3. GPGPU+ø@·

ð²ô!â=�@�þ�ý��!��ñàò�à�&*�*'RÓnoÓ�#×+Ò�ØÒ

T?@!Ùqj+~'ÚÛ'��Ø]^LMÜÝ+no)Þ?

@'rßG8G89bB.Zz+ßÞX!

ðàô!Yqj].&D'LM+æÕáâ)ã'Ê;â=:ä(G89<

ÿÄ!�{º¨LM+æJ'±ßÿÄ!�ßEF�789î�LM§

«'RSåG89{º+»¼esæ<789{ºã'ØÒ�çèéX!

!!!!RÓ!�ø@'�@G89¨ê@LMÁ5N�ëì'RÓí�<789

§@+§«¬�)z+»¼X±ß��îï��+í~'éU}G89+

mì<G89nopBqj+��'ðoñòTABCD<}+&´X ii. Unified Shader GPU:

(1) Unified Shader GPU: PÚ8+ß�5�Programmable GPU+�

ç' I]^+CDèn?@ó�o+CD'â=Ú©1.e"ô

õCD¿ÒöN~BÃe"ôõ+C*'RS(�N.'ÄNR÷

1!>+ø^·

GPU

shader

Texture memory

Framebuffer

Depth buffer

output

DRAM

CPU

glReadPixel()

glReadPixel() GPU

7

Vertex shader

Pixel Shader

Vertex shader

Pixel Shader

ùax+shader capacity �@x+shader capacity

(CD�=¬ú+Jm<'RÓshader+1Np¶�ûT'RSÔ�

í��*ü+C*NO'±ß½¨~BqZ+C*�Nùa'ØÒLý+

þÿXRS!N"#shaderpBC*$¸+¶�'sùax+Á5Ä%�

Ì%&+Á5''(�N§@+�A)'äÄ%��pB�ç5�X�!

>+EF{Ã4l+'äßUnified ShaderX*+Tshader+1Ns,'�

e"shader�-ßvertex processing.ßpixel processing'/Ä%~BL

M'Ê��%+i^'Ä%�.]+pBi^�Ò·

Unified Shader vertex

processing pixel processing

Unified Shader

vertex processing pixel processing

Heavy Geometry

Heavy Pixel

Heavy Geometry

Heavy Pixel

8

=<5�T�$C*+�N'��Lý§@çz/+5�X

(2) Graphics Pipeline of Unified Shader GPU

ãshader+��/�@unified shaderî}'Êunprogramable+��

¡01 ¾�+unit23pBX

(3) GPU¬+gh¬ì

G P U

Application Input

Vertex Processing

Rasterization

Fragment Processing

Output image

GPU memory buffer

Hidden surface removal

Geometry Processing

Unified Shader

9

iii. CUDA: Compute Unified Device Architecture

(1) CUDA: �+«ß(unified shader+GPU architectureÃ'�@GPU

*Óef¬LM+CDÁ5X(CUDA+qjÃ'CPUOÓhostÓP

Ú+LMÁ5'ÊGPU¡ß4í�coprocessor+V]'OÓ

deviceXRS'(host.®^LMC*ß�ÍAB+LMCD'.

®ÚzÈ+FM'hostÄ%ã!"C*�Ì1device.ipBX

(2) <5e}+GPGPUÌæ'CUDA�Ã³U"z@·

a) Shared memory: Host<device67�89+memory'Êdevice

+memory¡ devicex+56:;'!ßx1Tunified shader

ghEF+ÉC'RÓ(unified shader+qjÃ'�"shader

æ®ÚÄ%��+�;CD%+i^'%71ABCD+*@X

b) Hierarchy of Thread Group: ãThread�Ò"hthread, thread

block <grid of thread blocks'ã�e"shaderÄ%67CD+

<=õ�*'>�~?ABCD+@AX

c) ÓC+extension language: {551GPGPU�@+ß6$]^�

*+noÓ�'ØÒT)z+�@ßÞ'RSCUDAB

runtime library+C9<©ª'ã>"qjC9(C+DE{

.'s�@+õx�5�'rs�@+§«¬5�X

iv. The Programming architecture of CPU, GPGPU, and CUDA

CPU+L*ºo'�@Áe"thread&pB'eF�Ge"�ÏX

��®ÚYDRAMx¢mLM'±ß(CPU��Cache+Ea'Ä%

§@refetch.�±memory access+policy'��ãLM¢'Cache'

%�HpB<+¢î��X

CPU programming architecture

10

GPGPU+pBâ=Ú��+YVideo memoryx¢mLM'�YLM¿

+LMÞÀi'�F=�Ä%pB)*+instruction'�e"instruction�

e"ALU component<control unitI�'©=IG(e"�)*"CPU+

Jå.X±ßø@äßVideo memory<¬pB+ALUß�K(�=56

.'Aâ�¢î�ç<<º.+þÿX

CUDAiÜTGPGPUABCD<CPU cache+@@'Ä%=<%)*+

thread&pB�=+noõ'%��LM+�ç'\(GPU+��C�

Parallel Data CacheÄ%�HLM+¢î'�RÓ�=threadRÓpB=

eno+�=LM')�ÄNeNLMÚp)*+thread access'.

threadpB¿+i^'Ú)H+¼Ài�qZthread�@'RSC9+

Cacheâ=ÚÄ%sthreadº)H�+�;LMX

GPGPU programming architecture

CUDA programming architecture

11

The Programming Model of CUDA

CUDA45163,!

III.The Programming Model of CUDA

i. CUDA+noL*OP (Detailed Architecture of CUDA Programming)

a. Thread Batching

i. Thread block:seQthreadÄ%:=�@shared memoryÐ³+

LM'�RpBaccess�=LMLõ+=eno'b¤B =�

�+pBÄ%0SLM+¿>¬X

ii. Grid of thread blocks:e"blockÐ³TU+thread+Ç8�¶'

±(=e"kernelÐpB'Ê¤�=>V�<z�+thread

blocksÄ%�SÒgrid of thread blocks'!>e&'e"kernel

Ä%�@+thread+Ç8äR5�X

host

Kernel1

Kernel2

device

Grid1

Block (0,0)

Block (1,0)

Block (2,0)

Block (0,1)

Block (1,1)

Block

(2,1)

Grid2

Block

(0,0)

Block

(1,0)

Block

(2,0)

Block (0,1)

Block (1,1)

Block (2,1)

Block

(0,1)

Block

(1,1)

Block

(2,1) Block(1,1)

thread

(0,0)

thread

(1,0)

thread

(2,0)

thread

(3,0)

thread

(4,0)

thread

(0,1)

thread

(1,1)

thread

(2,1)

thread

(3,1)

thread

(4,1)

thread

(0,2)

thread

(1,2)

thread

(2,2)

thread

(3,2)

thread

(4,2)

12

b. Memory model

(1) Shared memory: =e"blockx+thread�;LM'WX�Ä¢

îX

(2) Global memory:��=+block±=e"gridx+thread�;LM

@'hostx+kernelÄ%access!Ð+LMX

(3) Constant memory, Text memory: ��=+block±=e"gridx+

thread¢îLM'�NÞ'' hostxkernelÄ%accessX

ii. gh¬*OP(Hardware Implementation): @emultiprocessor¬ì'

�"multiprocessor��SIMD(Single Instruction Multiple Data)q

j·(�s+clock cycle'multiprocessor+"processorpBÌ=+�

Ï'±�*(�=+LM.X

Grid

Global memory

Constant

memory

Text memory

Block(0,0)

Shared memory

Registers Registers

Thread(0,0

)

Thread(1,0

)

Local memory

Local memory

Block(1,0)

Shared memory

Registers Registers

Thread(0,0)

Thread(1,0)

Local memory

Local memory

13

iii. Ó¨s,(Application Interface):%CÓ�ÓDE'klmY$W�+^

Zs,X

a. Function type qualifiers: @&�÷%Ã+functionßpB(host, device

\ßkernel'�R(no+[[ñòÓ_host_, _device_,.

_global_ '±ß(device <kernel+no�)*¶�'\ß�©ª

recursion'�Nñòstatic variable'�©ª¾ZÛa+�Çñò©

©Xñò+§o®·

_devise_ void fun(){

nogÅ] }

b. Variable type qualifier: �÷^"deviceg+�Ç(�Áhx+Â

a'�_device_, _constant_, <_shared_U$Xñò+§o®·

extern __shared__ float shared[];

À3ñòe"(shared memoryx'^ZÓ_@Ç`}+W��Ç

shared[]X

c. ñòÒ_global_+function(pB<'®Ú%e"zaô+�Ï

b¾',®·

ñòÓ __global__ void Func(float* parameter);

+kernelno'â=%·

Func<<< Dg, Db, Ns >>>(parameter);

&pBXqxDg}Àgrid of thread block+ÇÈ, Db}Àthread block

+ÇÈ, NsÀ3�e"thread block(shared memoryÐÄ%�@*(

¾ZÛa�Áh,

d. Y"build-in variableÀ3�grid dimension(gridDim), block

index(blockIdx), block dimension(blockDim), thread

index(threadIdx)©L£X

iv. no+,(Example Program: Matrix Multiplication)

// c`Ìd·C = A * B

// hA cÀ+��

// wA cÀ+e�

// wB c`B+e�

void Mul(const float* A, const float* B, int hA, int wA, int wB,

14

float* C) {

// YMatrix A<B+LMload1device float* Ad; size = hA * wA * sizeof(float); cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; size = wA * wB * sizeof(float); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

// (devicexñòe"f+space�C Matrix float* Cd; size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size);

// FMblock dimension<grid dimension dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);

// ghdevice function Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);

// YdeviceMÉ+C+gÅäiÀhost cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);

// u+device�@+memory cudaFree(Ad); cudaFree(Bd);

cudaFree(Cd); }

__global__ void Muld(float* A, float* B, int wA, int wB, float* C) {

// Block indexX3ÚpB!"¾*+block+jk int bx = blockIdx.x; int by = blockIdx.y;

// Thread indexX3ÚpB!"¾*+thread+jk int tx = threadIdx.x; int ty = threadIdx.y;

// Aée"ppB+submatrix+blockjk int aBegin = wA * BLOCK_SIZE * by;

// AZ%e"pblockpB+submatrix+blockjk int aEnd = aBegin + wA - 1;

// A+submatrixpB+lmÇ int aStep = BLOCK_SIZE;

// Bée"ppB+submatrix+blockjk int bBegin = BLOCK_SIZE * bx;

// B+submatrixpB+lmÇ int bStep = BLOCK_SIZE * wB;

15

// threadFMsubmatrix element+Ç8 float Csub = 0;

// d¢É�+submatrix of A<BFMmi^ for (int a=aBegin, b=bBegin; a<=aEnd; a+=aStep,

b+=bStep) {

// ñòe"�A+submatrix@+shared memory space __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// ñòe"�B+submatrix@+shared memory space __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

//Yglobal memory load c`+LM1shard memoryi

// �e"thread loade"element As[ty][tx] = A[a + wA * ty + tx]; Bs[ty][tx] = B[b + wB * ty + tx];

// =�¢î __syncthreads();

// �"`}Ìd

// �"threadFMe"e"submatrix block+element for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ty][k] * Bs[k][tx];

//n0�thread¢'Ô+o5'.³+elemento{|pMm&

j' __syncthreads();

}

// Yblock submatrixÞ1global memory

// =>e"threadÍpÞe"element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;

}

16

References

789:

IV.References

1. How GPU works, David Luebke, NVIDIA Research, Greg

Humphreys, University of Virginia, IEEE computer, 2008

2. GPU a Close Look, Kayvon Fatahalian, Mike Houston, ACM Queue

2008

3. NVIDIA CUDA Compute Unified Device Architecture

Programming Guide Version 1.1, NVIDIA Corporation, 2007

4. Scalable Parallel Programming, John Nickolls, Ian Buck, and

Michael Garland, NVIDIA, Kevin Skadron, University of Virginia, ACM

Queue 2008.

5. A Survey of General-Purpose Computation on Graphics

Hardware, John D. Owens, David Luebke, Naga Govindaraju, Mark

Harris, Jens Krüger, Aaron E. Lefohn, and Timothy J. Purcell,

Computer Graphics Forum, 26(1):80–113, March 2007

6. GPU Gems2, Chapter 30 GeForce 6 Architecture, Emmett

Kilgariff, Randima Fernando, NIVIDIA Corporation, 2005

7. GPU Computing with NVIDIA CUDA, Ian Buck, NVIDIA

Corporation, SIGGRAPH 2007 Lecture.

8. GPU Architecture Overview, John Owens, UC Davis, SIGGRAPH

2007 GPGPU lecture

9. CUDA, Wikipedia, http://en.wikipedia.org/wiki/CUDA

gpu!#!$%&’(cyy/courses/assembly/08fall/...computer organization and assembly language final...

Documents