gpu!#!$%&’(cyy/courses/assembly/08fall/...computer organization and assembly language final...

Computer Organization and Assembly Language Final Project Report

GPU Architecture and Programming: From GPGPU to CUDA

GPU!"#!"$"%&'(

)GPGPU#CUDA

*+,!-./012324!567!

Contents

Abstract !!!!!!!!!!!!!!!!!!!!!!!!!!!! 1

I. Evolution of GPU Architecture

i. Traditional Graphics Pipeline and GPU Architecture !!!!! 2

"#$%&'()*+%&'(,-.

ii. Programmable Graphics Pipeline and GPU Evolution !!!!! 4

/*01$%&'()*2%&'(,$34

iii. From Graphics to General Computation: CPU and GPU !!!!5

5%&6789:;<=%&'(,>?@'(,$ABC

II. Evolution of GPU Programming Model

i. GPGPU: General Purpose Graphic Processing Unit !!!!!! 6

GPGPU: 9:DEC$%&'(,

ii. The Architecture of Unified Shader GPU!!!!!!!!!!!6

FDGDHI,$GPU-.

iii. CUDA: Compute Unified Device Architecture !!!!!!!!!9

CUDA: GD;<-.

iv. The Programming Model of CPU, GPGPU, and CUDA !!!!!9

CPU, GPGPU2CUDA*0;JK0$LM

III. The Programming Model of CUDA

i. Detailed Architecture of CUDA Programming !!!!!!!! 11

CUDA$*0;JNO

ii. Hardware Implementation!!!!!!!!!!!!!!!! 12

PQRJNO

iii. Application Interface !!!!!!!!!!!!!!!!!!!13

iv. Programming Example: Matrix Multiplication!!!!!!!! 13

*0WX=YZ[T

IV. References !!!!!!!!!!!!!!!!!!!!!!!!16

Abstract

!"#$%&'()*+,-./0123451*6789+:;'<=

>?@ABCDEF+G89HIJKLMNO+PQ'RS:;TUV+:

WXYG89Z[\+]^CD[\T_'`abc1G89d@(ef+JKL

M'Ygh+ijklmHd+nopBqj'rstHIABCD+EF`

a+uv'rwx1ABCD+yz{CX!

(.e"|}'789+~c��e��’�!!��+��+~

c'��"�h��e�'�Nr�e�%.+5�'��JK+LMN

O��+�.5�X Igh+~c¡¢��'��)*no+�N5�'/

£�¤¥gh+_¦§¨X Igh©ª+§«¬<[~��'��@H

ILM+®¥��¯°'±gh+~c²`a³´1Tµ¶·¸¹hº+»¼

½¨�I¾¿»¼'!}ÀÁÂ³�ÃÄ%ÅÆ+J¹hÇÈ�Éµ¶'Ê!

"ÇÈ<LMNOËËÌÍ'Î�789+~cÏ�TÐe"§�Ñ*6'r�

�ABCDÒÓÔe}JKLM+Õ#X(gh+~c��`aÖ×'ÊAB

CD+ÔEFØÙ®Úno+ÛÜ'RS%ABCD+§oÝÞno'ßÃe

"|}+8��à��áá��âÚ+NOX!

%Ãã G89+¾\qj'räß(]^LMå[\Þ;'Yæç+]^

CDènklm+¾\G89ij'1éê}8��à��áá�ë��!G89+mì'`ab

ímG89HI789LMNO+î}¬'%ïRdÊl+G8G89ðG�ñ��!8ò�ó��!

G89ôEFXõ�ö÷(éê}ghij.+øù'%ïG8G89noEF+�

ú'Ê�Ôe}òñûüû�ý!�þ�ý��EF+G89mì'%ïÛÜòñûüû�ý!�þ�ý��'Ô+

ABCDnoqjÑ79ÿ!ð7�áóò"�ý!9ñûüû�ý!ÿ�#û$�!!$þû"�$"ò��ôXZ%&'(

)79ÿ!!ó��à��ááûñàL*+¾D'%ïno+,'&ö÷!"Ô+ABLMi

j(no�N+5�X!

Evolution of GPU Architecture %&'()*+,-./0!

I. Evolution of The base of GPU Architecture

GPU(Graphics Processing Unit)'Y-³.&_.'/Z0+1N«ß

2úJKÕ3]^§³+®¥Ê4l+56XÊ/Y~c{0'ÓTÚ71

real-timeÕ3]^+8+'RS«Ì9�:LMNO.+~cX<CPU:;

Áe"instructionpB<º�='RÓ]^â=>?Õ3'GPU+�N:@

(IYée"1Z%e"instructionCD+<º�N¡A'É%GPU?@5

�LM�N+§o'ß5�=e<ºÄ%CD+instructionÇ8'räßÉ

B+ABCD(Parallel Computing)X

%Ãã ]^Õ3É®ÚCD+C*&DZ0+GPUijEFX

i. æç+]^CDènG]^CDåij

Traditional Graphics Pipeline and GPU Architecture

efJKÚÕ3e?]H'®Ú%Ã+èn·

]eIæç+]^Jmèn

1. Application input: Yd@noJ'KÚÕ3+]^LMï

commandX

Application Input

Geometry Processing

Rasterization

Fragment Processing

Hidden surface Removal

Output image

memory buffer

2. Geometry processing: NOÓVertex processing'RÓPÚCD+ß

]^+Q@LMXJ'+LMR�SÒT³ij+UV^Q@+3D

WX'S<!"Á5ÚYZ[ÒJKÕ3+2DWX'9*Ãe"Á

5+inputXS<\ÚFMlightingHIUV^+�]�^<_`@+

Âa'bYLMcÃæX

3. Rasterization: RÓJKß%def+§o`@Õ3]^'RSÚã

.e"Jm+2DUV^�@gh¨MmÉ�jÒUV^@+WXX

4. Fragment processing: �@rasterise +i^'ij.�"@+

textureX

5. Pixel processing: �@z buffer+LM'klmn%+�"@ßop

qZ@+mnr1'&¦st@�u�(vw.Õ3'¦sZ%+

RS4l+%Ã+GPUij:

]êIKurt Akeley. RealityEngine Graphics. In Proc. SIGGRAPH ’93. ACM Press, 1993.

(!"Z0+EFx'�%Ãy"z@·

1. {|?@+ABCD+§o&jÒ56·Y.³+GPUij]xÄ%

D1Geometry Engine, Fragment generator, /ßABb}+ijX

2. ij?@fixed functionEF'LM~'56%�N~B�s+�*'

u�)z+�¬·��?@TABCD'±ßÄ%~B+LMß�s

+'RS!"<$+GPU\u��î}CPU+NOX

RÓLMNO+¶�'RS�Téê}GPU+��·

ii. Äno�+]^CDèn<]^CDå+í~ (Programmable Graphics

Pipeline and GPU Evolution)

éê}É*+��PÚßã¾��s1N+vertex < pixel processing

unit�Óprogrammable unit'OÓshader 'snoÄ%~B+�*�*'

!>+��r��TGPUg�ij+ALUNO'��General Purpose GPU

+EF4lXÊpipeline(��CD+��r~BT��·

1. Vertex +CD�ÒVertex Generating(VG)<Vertex Processing(VP)�

"��'VGßãjÒ]^+lines, points, triangles%vertex record+^

oÀ3'ã!�recored�æ�VPCDXVPÖ1LM%'äã!�@[

�Òmn12D+WX'*ÒÔ+record�æmiXÉBprogrammable

+��+projectionImodel<view transform+LMX

2. Primitive Generating < primitive processingäßCD¾�+geometry

z¬'â=Úã=e"UV^+Q@group;&æmiX

3. Fragment generating(FG)äß¾�+Rasterization'ãprimitive

sample(vw.'(ã!�pixel@� ÒÔ+recordæmi�

Fragment processing(FP)XFP¡ßYmemoryx¢mtexture+L£'

b¤ã¾\%vertex processing+lightingC*¥1!"<¦&*'!

>%&Ä%§@UV^�"@+z¬'¨1phong shading©ªÓ«¬

+�]�^X

4. Z%+pixel operation¡ß¨hidden surface removalX

Application Input

Vertex processing

Primitive generating

Fragment generating

Hidden surface Removal

Output image

GPU memory buffer

Vertex generating

Primitive processing

Fragment processing

GPU+¬ij¡®Ã]·

S<'GPU+ij{|��)�+LMNOX

iii. Y]^Õ31efLM·]^CDåHx¯CDå+î}¬(From

Graphics to General Computation: CPU and GPU)

�TGCPUÌ°+LMNO'GPU®±CDef+computing taskÄ%

YÃÀT_·

²³!G89!óû´��!µþ�ý��!"��¶<789!$�áóò"ûñà!"��¶+HdÍ··!

Y.ÀxÄ%D1'GPUÚ~Bef+LM'Ä%@<CPULM

¸¹+ºo»'',®·¾�¼½textureLM+memory bufferÄ%

@&¼½data arrayX[\¢'LM<<texture look up~B+¾*Ì

='Ú~B+LM+noPh¡<@shader language(openGL, cg©

©)Þm+shader program¸¹'Z%¿Ò+LM¡Ä%Þ'frame

buffer'©=IÞÀ�Áh+¾*XRÓ!>+Ì¹¬'Ê�T

Genernal purpose GPU+EF4lX

789! G89!

ÿ�"�!!��Â Ã�´"ò��

��á��Â!Ä��ý Ã�´"ò��!Å��¶òó

Å��ó!ë�ýÂ µþ�ý��!8��à��á

��á��Â!Æ�û"�! Ä�ñý��!"�!Ç��á�!Èòüü��!

Evolution of GPU Programming Model %&'()*123,-./0

II. Evolution of GPU Programming Model!

i. General purpose GPU

(1) §@GPU�ÉABLMNO'&~Bef¬Ê��ß]^Ë+L

M'%�ÌCPU+LMÍÌXÎÁ&ö'äß§@¾��*GPU

+]^LM�Ï'4lThread(GPUÐ³pBef¬+LMXe

f+GPGPU�@video memory&¨<CPU<DRAM¸¹+ij

ÑÒ'Ê§@GPGPU+no®Ú%]^LM+noÓ�&ÝÞ'

,®openGL'cg©©X

(2) noLBqj·Ã]ÔÓGPU<CPU{ºæÕprogram£Ë+3

Ö]X]x+glReadPixel()ß§@openGL+�Ï&î�DRAMÐ³

LM+,¿X

3. GPGPU+ø@·

ð²ô!â=�@�þ�ý��!��ñàò�à�&*�*'RÓnoÓ�#×+Ò�ØÒ

T?@!Ùqj+~'ÚÛ'��Ø]^LMÜÝ+no)Þ?

@'rßG8G89bB.Zz+ßÞX!

ðàô!Yqj].&D'LM+æÕáâ)ã'Ê;â=:ä(G89<

ÿÄ!�{º¨LM+æJ'±ßÿÄ!�ßEF�789î�LM§

«'RSåG89{º+»¼esæ<789{ºã'ØÒ�çèéX!

!!!!RÓ!�ø@'�@G89¨ê@LMÁ5N�ëì'RÓí�<789

§@+§«¬�)z+»¼X±ß��îï��+í~'éU}G89+

mì<G89nopBqj+��'ðoñòTABCD<}+&´X ii. Unified Shader GPU:

(1) Unified Shader GPU: PÚ8+ß�5�Programmable GPU+�

ç' I]^+CDèn?@ó�o+CD'â=Ú©1.e"ô

õCD¿ÒöN~BÃe"ôõ+C*'RS(�N.'ÄNR÷

1!>+ø^·

shader

Texture memory

Framebuffer

Depth buffer

output

glReadPixel()

glReadPixel() GPU

Vertex shader

Pixel Shader

Vertex shader

Pixel Shader

ùax+shader capacity �@x+shader capacity

(CD�=¬ú+Jm<'RÓshader+1Np¶�ûT'RSÔ�

í��*ü+C*NO'±ß½¨~BqZ+C*�Nùa'ØÒLý+

þÿXRS!N"#shaderpBC*$¸+¶�'sùax+Á5Ä%�

Ì%&+Á5''(�N§@+�A)'äÄ%��pB�ç5�X�!

>+EF{Ã4l+'äßUnified ShaderX*+Tshader+1Ns,'�

e"shader�-ßvertex processing.ßpixel processing'/Ä%~BL

M'Ê��%+i^'Ä%�.]+pBi^�Ò·

Unified Shader vertex

processing pixel processing

Unified Shader

vertex processing pixel processing

Heavy Geometry

Heavy Pixel

Heavy Geometry

Heavy Pixel

=<5�T�$C*+�N'��Lý§@çz/+5�X

(2) Graphics Pipeline of Unified Shader GPU

ãshader+��/�@unified shaderî}'Êunprogramable+��

¡01 ¾�+unit23pBX

(3) GPU¬+gh¬ì

Application Input

Vertex Processing

Rasterization

Fragment Processing

Output image

GPU memory buffer

Hidden surface removal

Geometry Processing

Unified Shader

iii. CUDA: Compute Unified Device Architecture

(1) CUDA: �+«ß(unified shader+GPU architectureÃ'�@GPU

*Óef¬LM+CDÁ5X(CUDA+qjÃ'CPUOÓhostÓP

Ú+LMÁ5'ÊGPU¡ß4í�coprocessor+V]'OÓ

deviceXRS'(host.®^LMC*ß�ÍAB+LMCD'.

®ÚzÈ+FM'hostÄ%ã!"C*�Ì1device.ipBX

(2) <5e}+GPGPUÌæ'CUDA�Ã³U"z@·

a) Shared memory: Host<device67�89+memory'Êdevice

+memory¡ devicex+56:;'!ßx1Tunified shader

ghEF+ÉC'RÓ(unified shader+qjÃ'�"shader

æ®ÚÄ%��+�;CD%+i^'%71ABCD+*@X

b) Hierarchy of Thread Group: ãThread�Ò"hthread, thread

block <grid of thread blocks'ã�e"shaderÄ%67CD+

<=õ�*'>�~?ABCD+@AX

c) ÓC+extension language: {551GPGPU�@+ß6$]^�

*+noÓ�'ØÒT)z+�@ßÞ'RSCUDAB

runtime library+C9<©ª'ã>"qjC9(C+DE{

.'s�@+õx�5�'rs�@+§«¬5�X

iv. The Programming architecture of CPU, GPGPU, and CUDA

CPU+L*ºo'�@Áe"thread&pB'eF�Ge"�ÏX

��®ÚYDRAMx¢mLM'±ß(CPU��Cache+Ea'Ä%

§@refetch.�±memory access+policy'��ãLM¢'Cache'

%�HpB<+¢î��X

CPU programming architecture

GPGPU+pBâ=Ú��+YVideo memoryx¢mLM'�YLM¿

+LMÞÀi'�F=�Ä%pB)*+instruction'�e"instruction�

e"ALU component<control unitI�'©=IG(e"�)*"CPU+

Jå.X±ßø@äßVideo memory<¬pB+ALUß�K(�=56

.'Aâ�¢î�ç<<º.+þÿX

CUDAiÜTGPGPUABCD<CPU cache+@@'Ä%=<%)*+

thread&pB�=+noõ'%��LM+�ç'\(GPU+��C�

Parallel Data CacheÄ%�HLM+¢î'�RÓ�=threadRÓpB=

eno+�=LM')�ÄNeNLMÚp)*+thread access'.

threadpB¿+i^'Ú)H+¼Ài�qZthread�@'RSC9+

Cacheâ=ÚÄ%sthreadº)H�+�;LMX

GPGPU programming architecture

CUDA programming architecture

The Programming Model of CUDA

CUDA45163,!

III.The Programming Model of CUDA

i. CUDA+noL*OP (Detailed Architecture of CUDA Programming)

a. Thread Batching

i. Thread block:seQthreadÄ%:=�@shared memoryÐ³+

LM'�RpBaccess�=LMLõ+=eno'b¤B =�

�+pBÄ%0SLM+¿>¬X

ii. Grid of thread blocks:e"blockÐ³TU+thread+Ç8�¶'

±(=e"kernelÐpB'Ê¤�=>V�<z�+thread

blocksÄ%�SÒgrid of thread blocks'!>e&'e"kernel

Ä%�@+thread+Ç8äR5�X

Kernel1

Kernel2

device

Block (0,0)

Block (1,0)

Block (2,0)

Block (0,1)

Block (1,1)

Block (0,1)

Block (1,1)

Block (2,1)

(2,1) Block(1,1)

thread

b. Memory model

(1) Shared memory: =e"blockx+thread�;LM'WX�Ä¢

(2) Global memory:��=+block±=e"gridx+thread�;LM

@'hostx+kernelÄ%access!Ð+LMX

(3) Constant memory, Text memory: ��=+block±=e"gridx+

thread¢îLM'�NÞ'' hostxkernelÄ%accessX

ii. gh¬*OP(Hardware Implementation): @emultiprocessor¬ì'

�"multiprocessor��SIMD(Single Instruction Multiple Data)q

j·(�s+clock cycle'multiprocessor+"processorpBÌ=+�

Ï'±�*(�=+LM.X

Global memory

Constant

memory

Text memory

Block(0,0)

Shared memory

Registers Registers

Thread(0,0

Thread(1,0

Local memory

Block(1,0)

Shared memory

Registers Registers

Thread(0,0)

Thread(1,0)

Local memory

iii. Ó¨s,(Application Interface):%CÓ�ÓDE'klmY$W�+^

a. Function type qualifiers: @&�÷%Ã+functionßpB(host, device

\ßkernel'�R(no+[[ñòÓ_host_, _device_,.

©Xñò+§o®·

_devise_ void fun(){

nogÅ] }

b. Variable type qualifier: �÷^"deviceg+�Ç(�Áhx+Â

a'�_device_, _constant_, <_shared_U$Xñò+§o®·

extern __shared__ float shared[];

À3ñòe"(shared memoryx'^ZÓ_@Ç`}+W��Ç

shared[]X

c. ñòÒ_global_+function(pB<'®Ú%e"za^o+�Ï

b¾',®·

ñòÓ __global__ void Func(float* parameter);

+kernelno'â=%·

Func<<< Dg, Db, Ns >>>(parameter);

&pBXqxDg}Àgrid of thread block+ÇÈ, Db}Àthread block

+ÇÈ, NsÀ3�e"thread block(shared memoryÐÄ%�@*(

¾ZÛa�Áh,

d. Y"build-in variableÀ3�grid dimension(gridDim), block

index(blockIdx), block dimension(blockDim), thread

iv. no+,(Example Program: Matrix Multiplication)

// c`Ìd·C = A * B

// hA c`A+��

// wA c`A+e�

// wB c`B+e�

void Mul(const float* A, const float* B, int hA, int wA, int wB,

float* C) {

// YMatrix A<B+LMload1device float* Ad; size = hA * wA * sizeof(float); cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; size = wA * wB * sizeof(float); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

// (devicexñòe"f+space�C Matrix float* Cd; size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size);

// FMblock dimension<grid dimension dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);

// ghdevice function Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);

// YdeviceMÉ+C+gÅäiÀhost cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);

// u+device�@+memory cudaFree(Ad); cudaFree(Bd);

cudaFree(Cd); }

__global__ void Muld(float* A, float* B, int wA, int wB, float* C) {

// Block indexX3ÚpB!"¾*+block+jk int bx = blockIdx.x; int by = blockIdx.y;

// Thread indexX3ÚpB!"¾*+thread+jk int tx = threadIdx.x; int ty = threadIdx.y;

// Aée"ppB+submatrix+blockjk int aBegin = wA * BLOCK_SIZE * by;

// AZ%e"pblockpB+submatrix+blockjk int aEnd = aBegin + wA - 1;

// A+submatrixpB+lmÇ int aStep = BLOCK_SIZE;

// Bée"ppB+submatrix+blockjk int bBegin = BLOCK_SIZE * bx;

// B+submatrixpB+lmÇ int bStep = BLOCK_SIZE * wB;

// threadFMsubmatrix element+Ç8 float Csub = 0;

// d¢É�+submatrix of A<BFMmi^ for (int a=aBegin, b=bBegin; a<=aEnd; a+=aStep,

b+=bStep) {

// ñòe"�A+submatrix@+shared memory space __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// ñòe"�B+submatrix@+shared memory space __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

//Yglobal memory load c`+LM1shard memoryi

// �e"thread loade"element As[ty][tx] = A[a + wA * ty + tx]; Bs[ty][tx] = B[b + wB * ty + tx];

// =�¢î __syncthreads();

// �"`}Ìd

// �"threadFMe"e"submatrix block+element for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ty][k] * Bs[k][tx];

//n0�thread¢'Ô+o5'.³+elemento{|pMm&

j' __syncthreads();

// Yblock submatrixÞ1global memory

// =>e"threadÍpÞe"element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;

References

IV.References

1. How GPU works, David Luebke, NVIDIA Research, Greg

Humphreys, University of Virginia, IEEE computer, 2008

2. GPU a Close Look, Kayvon Fatahalian, Mike Houston, ACM Queue

3. NVIDIA CUDA Compute Unified Device Architecture

Programming Guide Version 1.1, NVIDIA Corporation, 2007

4. Scalable Parallel Programming, John Nickolls, Ian Buck, and

Michael Garland, NVIDIA, Kevin Skadron, University of Virginia, ACM

Queue 2008.

5. A Survey of General-Purpose Computation on Graphics

Hardware, John D. Owens, David Luebke, Naga Govindaraju, Mark

Harris, Jens Krüger, Aaron E. Lefohn, and Timothy J. Purcell,

Computer Graphics Forum, 26(1):80–113, March 2007

6. GPU Gems2, Chapter 30 GeForce 6 Architecture, Emmett

Kilgariff, Randima Fernando, NIVIDIA Corporation, 2005

7. GPU Computing with NVIDIA CUDA, Ian Buck, NVIDIA

Corporation, SIGGRAPH 2007 Lecture.

8. GPU Architecture Overview, John Owens, UC Davis, SIGGRAPH

2007 GPGPU lecture

9. CUDA, Wikipedia, http://en.wikipedia.org/wiki/CUDA

gpu!#!$%&’(cyy/courses/assembly/08fall/...computer organization and assembly language final...

Documents

application du gpgpu avec cuda en tomographie rx gpu conçu...

technologia gpgpu na przykładzie środowiska cuda

cuda and gpgpu computing · 2020-04-07 · spring 2019...

gpgpu – cuda 1.cg.elte.hu/~gpgpu/cuda/gpgpu_cuda_1.pdf ·...

gpgpu using cuda thrust

parallel vision by gpgpu/cuda

gpgpu programming on example of cuda - panoramix -...

cuda gpu computing

intermediate gpgpu programming in cuda

gpu history cuda

cuda gpu programming

improving gpgpu concurrency with elastic...

introduction to numerical general purpose gpu …...gpgpu...

gpgpu: nvidia cuda

programming gpgpu with mapreduce™³文光.pdf · program...

gpgpu architectures and cuda c

matrix computations on the gpu - eötvös loránd...

introdução a gpgpu cuda: arquitetura de computação...

inf5063 – gpu & cuda

gpu programming - macalester...