gpu!#!$%&’(cyy/courses/assembly/08fall/...computer organization and assembly language final...

18
Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA *+, -./012324 567

Upload: others

Post on 28-May-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

Computer Organization and Assembly Language Final Project Report

GPU Architecture and Programming: From GPGPU to CUDA

GPU!"#!"$"%&'(

)GPGPU#CUDA

!

*+,!-./012324!567!

Page 2: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

Contents

89!!

!

Abstract !!!!!!!!!!!!!!!!!!!!!!!!!!!! 1

I. Evolution of GPU Architecture

i. Traditional Graphics Pipeline and GPU Architecture !!!!! 2

"#$%&'()*+%&'(,-.

ii. Programmable Graphics Pipeline and GPU Evolution !!!!! 4

/*01$%&'()*2%&'(,$34

iii. From Graphics to General Computation: CPU and GPU !!!!5

5%&6789:;<=%&'(,>?@'(,$ABC

II. Evolution of GPU Programming Model

i. GPGPU: General Purpose Graphic Processing Unit !!!!!! 6

GPGPU: 9:DEC$%&'(,

ii. The Architecture of Unified Shader GPU!!!!!!!!!!!6

FDGDHI,$GPU-.

iii. CUDA: Compute Unified Device Architecture !!!!!!!!!9

CUDA: GD;<-.

iv. The Programming Model of CPU, GPGPU, and CUDA !!!!!9

CPU, GPGPU2CUDA*0;JK0$LM

III. The Programming Model of CUDA

i. Detailed Architecture of CUDA Programming !!!!!!!! 11

CUDA$*0;JNO

ii. Hardware Implementation!!!!!!!!!!!!!!!! 12

PQRJNO

iii. Application Interface !!!!!!!!!!!!!!!!!!!13

STUV

iv. Programming Example: Matrix Multiplication!!!!!!!! 13

*0WX=YZ[T

IV. References !!!!!!!!!!!!!!!!!!!!!!!!16

Page 3: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

1

Abstract

!"#$!

!!

!"#$%&'()*+,-./0123451*6789+:;'<=

>?@ABCDEF+G89HIJKLMNO+PQ'RS:;TUV+:

WXYG89Z[\+]^CD[\T_'`abc1G89d@(ef+JKL

M'Ygh+ijklmHd+nopBqj'rstHIABCD+EF`

a+uv'rwx1ABCD+yz{CX!

(.e"|}'789+~c��e��������’�!!���+����+~

c'���"�h���e�'�Nr�e�%.+5�'��JK+LMN

O��+�.5�X Igh+~c¡¢��'��)*no+�N5�'/

£�¤¥gh+_¦§¨X Igh©ª+§«¬<[~��'���@­H

ILM+®¥��¯°'±gh+~c²`a³´1Tµ¶·¸¹hº+»¼

½¨�I¾¿»¼'!}ÀÁ³�ÃÄ%ÅÆ+J¹hÇÈ�ɵ¶'Ê!

"ÇÈ<LMNOËËÌÍ'Î�789+~cÏ�TÐe"§�Ñ*6'r�

�ABCDÒÓÔe}JKLM+Õ#X(gh+~c��`aÖ×'ÊAB

CD+ÔEFØÙ®Úno+ÛÜ'RS%ABCD+§oÝÞno'ßÃe

"|}+8��à��áá��âÚ+NOX!

%Ãã G89+¾\qj'räß(]^LMå[\Þ;'Yæç+]^

CDènklm+¾\G89ij'1éê}8��à��áá�ë��!G89+mì'`ab

ímG89HI789LMNO+î}¬'%ïRdÊl+G8G89ðG�ñ����!8ò�ó���!

G89ôEFXõ�ö÷(éê}ghij.+øù'%ïG8G89noEF+�

ú'Ê�Ôe}òñûüû�ý!�þ�ý��EF+G89mì'%ïÛÜòñûüû�ý!�þ�ý��'Ô+

ABCDnoqjÑ79ÿ!ð7�áóò"�ý!9ñûüû�ý!ÿ�#û$�!!$þû"�$"ò��ôXZ%&'(

)79ÿ!!ó��à��ááûñàL*+¾D'%ïno+,'&ö÷!"Ô+ABLMi

j(no�N+5�X!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

Page 4: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

2

!

Evolution of GPU Architecture %&'()*+,-./0!

!

!

I. Evolution of The base of GPU Architecture

GPU(Graphics Processing Unit)'Y-³.&_.'/Z0+1N«ß

2úJKÕ3]^§³+®¥Ê4l+56XÊ/Y~c{0'ÓTÚ71

real-timeÕ3]^+8+'RS«Ì9�:LMNO.+~cX<CPU:;

Áe"instructionpB<º�='RÓ]^â=>?Õ3'GPU+�N:@

(IYée"1Z%e"instructionCD+<º�N¡A'É%GPU?@5

�LM�N+§o'ß5�=e<ºÄ%CD+instructionÇ8'räßÉ

B+ABCD(Parallel Computing)X

%Ãã ]^Õ3É®ÚCD+C*&DZ0+GPUijEFX

i. æç+]^CDènG]^CDåij

Traditional Graphics Pipeline and GPU Architecture

efJKÚÕ3e?]H'®Ú%Ã+èn·

]eIæç+]^Jmèn

1. Application input: Yd@noJ'KÚÕ3+]^LMï

commandX

G P U

Application Input

Geometry Processing

Rasterization

Fragment Processing

Hidden surface Removal

Output image

GPU

memory buffer

Page 5: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

3

2. Geometry processing: NOÓVertex processing'RÓPÚCD+ß

]^+Q@LMXJ'+LMR�SÒT³ij+UV^Q@+3D

WX'S<!"Á5ÚYZ[ÒJKÕ3+2DWX'9*Ãe"Á

5+inputXS<\ÚFMlightingHIUV^+�]�^<_`@+

Âa'bYLMcÃæX

3. Rasterization: RÓJKß%def+§o`@Õ3]^'RSÚã

.e"Jm+2DUV^�@gh¨MmÉ�jÒUV^@+WXX

4. Fragment processing: �@rasterise +i^'ij.�"@+

textureX

5. Pixel processing: �@z buffer+LM'klmn%+�"@ßop

qZ@+mnr1'&¦st@�u�(vw.Õ3'¦sZ%+

Jmi^X

RS4l+%Ã+GPUij:

]êIKurt Akeley. RealityEngine Graphics. In Proc. SIGGRAPH ’93. ACM Press, 1993.

(!"Z0+EFx'�%Ãy"z@·

1. {|?@+ABCD+§o&jÒ56·Y.³+GPUij]xÄ%

D1Geometry Engine, Fragment generator, /ßABb}+ijX

2. ij?@fixed functionEF'LM~'56%�N~B�s+�*'

u�)z+�¬·��?@TABCD'±ßÄ%~B+LMß�s

+'RS!"<$+GPU\u���î}CPU+NOX

RÓLMNO+¶�'RS�Téê}GPU+��·

Page 6: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

4

ii. Äno�+]^CDèn<]^CDå+í~ (Programmable Graphics

Pipeline and GPU Evolution)

éê}É*+��PÚßã¾��s1N+vertex < pixel processing

unit�Óprogrammable unit'OÓshader 'snoÄ%~B+�*�*'

!>+��r��TGPUg�ij+ALUNO'��General Purpose GPU

+EF4lXÊpipeline(��CD+��r~BT��·

1. Vertex +CD�ÒVertex Generating(VG)<Vertex Processing(VP)�

"��'VGßãjÒ]^+lines, points, triangles%vertex record+^

oÀ3'ã!�recored�æ�VPCDXVPÖ1LM%'äã!�@[

�Òmn12D+WX'*ÒÔ+record�æmiXÉBprogrammable

+���+projectionImodel<view transform+LMX

2. Primitive Generating < primitive processingäßCD¾�+geometry

z¬'â=Úã=e"UV^+Q@group;&æmiX

3. Fragment generating(FG)äß¾�+Rasterization'ãprimitive

sample(vw.'(ã!�pixel@� ÒÔ+recordæmi�

Fragment processing(FP)XFP¡ßYmemoryx¢mtexture+L£'

b¤ã¾\%vertex processing+lightingC*¥1!"<¦&*'!

>%&Ä%§@UV^�"@+z¬'¨1phong shading©ªÓ«¬

+�]�^X

4. Z%+pixel operation¡ß¨hidden surface removalX

G P U

Application Input

Vertex processing

Primitive generating

Fragment generating

Hidden surface Removal

Output image

GPU memory buffer

Vertex generating

Primitive processing

Fragment processing

Page 7: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

5

GPU+¬­ij¡®Ã]·

S<'GPU+ij{|��)�+LMNOX

iii. Y]^Õ31efLM·]^CDåHx¯CDå+î}¬(From

Graphics to General Computation: CPU and GPU)

�TGCPUÌ°+LMNO'GPU®±CDef+computing taskÄ%

YÃÀT_·

²³!G89!óû´��!µþ�ý��!"��¶<789!$�áóò"ûñà!"��¶+HdÍ··!

!

Y.ÀxÄ%D1'GPUÚ~Bef+LM'Ä%@<CPULM

¸¹+ºo»'',®·¾�¼½textureLM+memory bufferÄ%

@&¼½data arrayX[\¢'LM<<texture look up~B+¾*Ì

='Ú~B+LM+noPh¡<@shader language(openGL, cg©

©)Þm+shader program¸¹'Z%¿Ò+LM¡Ä%Þ'frame

buffer'©=IÞÀ�Áh+¾*XRÓ!>+̹¬'Ê�T

Genernal purpose GPU+EF4lX

789! G89!

ÿ�"�!!��� Ã�´"ò��

��á��Â!Ä��ý Ã�´"ò��!Å��¶òó

Å��ó!ë�ý µþ�ý��!8��à��á

��á��Â!Æ�û"�! Ä�ñý��!"�!Ç��á�!Èòüü��!

Page 8: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

6

Evolution of GPU Programming Model %&'()*123,-./0

!

II. Evolution of GPU Programming Model!

i. General purpose GPU

(1) §@GPU�ÉABLMNO'&~Bef¬Ê��ß]^Ë+L

M'%�ÌCPU+LMÍÌXÎÁ&ö'äߧ@¾��*GPU

+]^LM�Ï'4lThread(GPUгpBef¬+LMXe

f+GPGPU�@video memory&¨<CPU<DRAM¸¹+ij

ÑÒ'ʧ@GPGPU+no®Ú%]^LM+noÓ�&ÝÞ'

,®openGL'cg©©X

(2) noLBqj·Ã]ÔÓGPU<CPU{ºæÕprogram£Ë+3

Ö]X]x+glReadPixel()ߧ@openGL+�Ï&î�DRAMг

LM+,¿X

3. GPGPU+ø@·

ð²ô!â=�@�þ�ý��!��ñàò�à�&*�*'RÓnoÓ�#×+Ò�ØÒ

T?@!Ùqj+~'ÚÛ'��Ø]^LMÜÝ+no)Þ?

@'rßG8G89bB.Zz+ßÞX!

ðàô!Yqj].&D'LM+æÕáâ)ã'Ê;â=:ä(G89<

ÿÄ!�{º¨LM+æJ'±ßÿÄ!�ßEF�789î�LM§

«'RSåG89{º+»¼esæ<789{ºã'ØÒ�çèéX!

!!!!RÓ!�ø@'�@G89¨ê@LMÁ5N�ëì'RÓí�<789

§@+§«¬�)z+»¼X±ß��îï��+í~'éU}G89+

mì<G89nopBqj+��'ðoñòTABCD<}+&´X ii. Unified Shader GPU:

(1) Unified Shader GPU: PÚ8+ß�5�Programmable GPU+�

ç' I]^+CDèn?@ó�o+CD'â=Ú©1.e"ô

õCD¿ÒöN~BÃe"ôõ+C*'RS(�N.'ÄNR÷

1!>+ø^·

GPU

shader

Texture memory

Framebuffer

Depth buffer

output

DRAM

CPU

glReadPixel()

glReadPixel() GPU

Page 9: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

7

Vertex shader

Pixel Shader

Vertex shader

Pixel Shader

ùax+shader capacity �@x+shader capacity

(CD�=¬ú+Jm<'RÓshader+1Np¶�ûT'RSÔ�

í��*ü+C*NO'±ß½¨~BqZ+C*�Nùa'ØÒLý+

þÿXRS!N"#shaderpBC*$¸+¶�'sùax+Á5Ä%�

Ì%&+Á5''(�N§@+�A)'äÄ%��pB�ç5�X�!

>+EF{Ã4l+'äßUnified ShaderX*+Tshader+1Ns,'�

e"shader�-ßvertex processing.­ßpixel processing'/Ä%~BL

M'Ê��%+i^'Ä%�.]+pBi^�Ò·

Unified Shader vertex

processing pixel processing

Unified Shader

vertex processing pixel processing

Heavy Geometry

Heavy Pixel

Heavy Geometry

Heavy Pixel

Page 10: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

8

=<5�T�$C*+�N'��Lý§@çz/+5�X

(2) Graphics Pipeline of Unified Shader GPU

ãshader+��/�@unified shaderî}'Êunprogramable+��

¡01 ¾�+unit23pBX

(3) GPU¬­+gh¬ì

G P U

Application Input

Vertex Processing

Rasterization

Fragment Processing

Output image

GPU memory buffer

Hidden surface removal

Geometry Processing

Unified Shader

Page 11: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

9

iii. CUDA: Compute Unified Device Architecture

(1) CUDA: �+«ß(unified shader+GPU architectureÃ'�@GPU

*Óef¬LM+CDÁ5X(CUDA+qjÃ'CPUOÓhostÓP

Ú+LMÁ5'ÊGPU¡ß4í�coprocessor+V]'OÓ

deviceXRS'(host.®^LMC*ß�ÍAB+LMCD'.­

®ÚzÈ+FM'hostÄ%ã!"C*�Ì1device.ipBX

(2) <5e}+GPGPUÌæ'CUDA�óU"z@·

a) Shared memory: Host<device67�89+memory'Êdevice

+memory¡ devicex+56:;'!ßx1Tunified shader

ghEF+ÉC'RÓ(unified shader+qjÃ'�"shader

æ®ÚÄ%��+�;CD%+i^'%71ABCD+*@X

b) Hierarchy of Thread Group: ãThread�Ò"hthread, thread

block <grid of thread blocks'ã�e"shaderÄ%67CD+

<=õ�*'>�~?ABCD+@AX

c) ÓC+extension language: {551GPGPU�@+ß6$]^�

*+noÓ�'ØÒT)z+�@ßÞ'RSCUDAB 

runtime library+C9<©ª'ã>"qjC9(C+DE{

.'s�@­+õx�5�'rs�@+§«¬5�X

iv. The Programming architecture of CPU, GPGPU, and CUDA

CPU+L*ºo'�@Áe"thread&pB'eF�Ge"�ÏX

��®ÚYDRAMx¢mLM'±ß(CPU���Cache+Ea'Ä%

§@refetch.­�±memory access+policy'��ãLM¢'Cache'

%�HpB<+¢î��X

CPU programming architecture

Page 12: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

10

GPGPU+pBâ=Ú��+YVideo memoryx¢mLM'�YLM¿

+LMÞÀi'�F=�Ä%pB)*+instruction'�e"instruction�

e"ALU component<control unitI�'©=IG(e"�)*"CPU+

Jå.X±ßø@äßVideo memory<¬­pB+ALUß�K(�=56

.'Aâ�¢î�ç<<º.+þÿX

CUDAiÜTGPGPUABCD<CPU cache+@@'Ä%=<%)*+

thread&pB�=+noõ'%��LM+�ç'\(GPU+��C�

Parallel Data CacheÄ%�HLM+¢î'�­RÓ�=threadRÓpB=

eno+�=LM')�ÄNeNLMÚp)*+thread access'.­ 

threadpB¿+i^'Ú)H+¼Ài�qZthread�@'RSC9+

Cacheâ=ÚÄ%sthreadº)H�+�;LMX

GPGPU programming architecture

CUDA programming architecture

Page 13: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

11

The Programming Model of CUDA

CUDA45163,!

III.The Programming Model of CUDA

i. CUDA+noL*OP (Detailed Architecture of CUDA Programming)

a. Thread Batching

i. Thread block:seQthreadÄ%:=�@shared memoryг+

LM'�RpBaccess�=LMLõ+=eno'b¤B =�

�+pBÄ%0SLM+¿>¬X

ii. Grid of thread blocks:e"blockгTU+thread+Ç8�¶'

±(=e"kernelÐpB'ʤ�=>V�<z�+thread

blocksÄ%�SÒgrid of thread blocks'!>e&'e"kernel

Ä%�@+thread+Ç8äR5�X

host

Kernel1

Kernel2

device

Grid1

Block (0,0)

Block (1,0)

Block (2,0)

Block (0,1)

Block (1,1)

Block

(2,1)

Grid2

Block

(0,0)

Block

(1,0)

Block

(2,0)

Block (0,1)

Block (1,1)

Block (2,1)

Block

(0,1)

Block

(1,1)

Block

(2,1) Block(1,1)

thread

(0,0)

thread

(1,0)

thread

(2,0)

thread

(3,0)

thread

(4,0)

thread

(0,1)

thread

(1,1)

thread

(2,1)

thread

(3,1)

thread

(4,1)

thread

(0,2)

thread

(1,2)

thread

(2,2)

thread

(3,2)

thread

(4,2)

Page 14: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

12

b. Memory model

(1) Shared memory: =e"blockx+thread�;LM'WX�Ģ

îX

(2) Global memory:��=+block±=e"gridx+thread�;LM

@'hostx+kernelÄ%access!Ð+LMX

(3) Constant memory, Text memory: ��=+block±=e"gridx+

thread¢îLM'�NÞ'' hostxkernelÄ%accessX

ii. gh¬*OP(Hardware Implementation): @emultiprocessor¬ì'

�"multiprocessor��SIMD(Single Instruction Multiple Data)q

j·(�s+clock cycle'multiprocessor+"processorpBÌ=+�

Ï'±�*(�=+LM.X

Grid

Global memory

Constant

memory

Text memory

Block(0,0)

Shared memory

Registers Registers

Thread(0,0

)

Thread(1,0

)

Local memory

Local memory

Block(1,0)

Shared memory

Registers Registers

Thread(0,0)

Thread(1,0)

Local memory

Local memory

Page 15: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

13

iii. Ó¨s,(Application Interface):%CÓ�ÓDE'klmY$W�+^

Zs,X

a. Function type qualifiers: @&�÷%Ã+functionßpB(host, device

\ßkernel'�R(no+[[ñòÓ_host_, _device_,.

_global_ '±ß(device <kernel+no�)*¶�'\ß�©ª

recursion'�Nñòstatic variable'�©ª¾ZÛa+�Çñò©

©Xñò+§o®·

_devise_ void fun(){

nogÅ] }

b. Variable type qualifier: �÷^"deviceg+�Ç(�Áhx+Â

a'�_device_, _constant_, <_shared_U$Xñò+§o®·

extern __shared__ float shared[];

À3ñòe"(shared memoryx'^ZÓ_@Ç`}+W��Ç

shared[]X

c. ñòÒ_global_+function(pB<'®Ú%e"za^o+�Ï

b¾',®·

ñòÓ __global__ void Func(float* parameter);

+kernelno'â=%·

Func<<< Dg, Db, Ns >>>(parameter);

&pBXqxDg}Àgrid of thread block+ÇÈ, Db}Àthread block

+ÇÈ, NsÀ3�e"thread block(shared memoryÐÄ%�@*(

¾ZÛa�Áh,

d. Y"build-in variableÀ3�grid dimension(gridDim), block

index(blockIdx), block dimension(blockDim), thread

index(threadIdx)©L£X

iv. no+,(Example Program: Matrix Multiplication)

// c`Ìd·C = A * B

// hA c`A+��

// wA c`A+e�

// wB c`B+e�

void Mul(const float* A, const float* B, int hA, int wA, int wB,

Page 16: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

14

float* C) {

// YMatrix A<B+LMload1device float* Ad; size = hA * wA * sizeof(float); cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; size = wA * wB * sizeof(float); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

// (devicexñòe"f+space�C Matrix float* Cd; size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size);

// FMblock dimension<grid dimension dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);

// ghdevice function Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);

// YdeviceMÉ+C+gÅäiÀhost cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);

// u+device�@+memory cudaFree(Ad); cudaFree(Bd);

cudaFree(Cd); }

__global__ void Muld(float* A, float* B, int wA, int wB, float* C) {

// Block indexX3ÚpB!"¾*+block+jk int bx = blockIdx.x; int by = blockIdx.y;

// Thread indexX3ÚpB!"¾*+thread+jk int tx = threadIdx.x; int ty = threadIdx.y;

// Aée"ppB+submatrix+blockjk int aBegin = wA * BLOCK_SIZE * by;

// AZ%e"pblockpB+submatrix+blockjk int aEnd = aBegin + wA - 1;

// A+submatrixpB+lmÇ int aStep = BLOCK_SIZE;

// Bée"ppB+submatrix+blockjk int bBegin = BLOCK_SIZE * bx;

// B+submatrixpB+lmÇ int bStep = BLOCK_SIZE * wB;

Page 17: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

15

// threadFMsubmatrix element+Ç8 float Csub = 0;

// d¢É�+submatrix of A<BFMmi^ for (int a=aBegin, b=bBegin; a<=aEnd; a+=aStep,

b+=bStep) {

// ñòe"�A+submatrix@+shared memory space __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// ñòe"�B+submatrix@+shared memory space __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

//Yglobal memory load c`+LM1shard memoryi

// �e"thread loade"element As[ty][tx] = A[a + wA * ty + tx]; Bs[ty][tx] = B[b + wB * ty + tx];

// =�¢î __syncthreads();

// �"`}Ìd

// �"threadFMe"e"submatrix block+element for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ty][k] * Bs[k][tx];

//n0�thread¢'Ô+o5'.³+elemento{|pMm&

j' __syncthreads();

}

// Yblock submatrixÞ1global memory

// =>e"threadÍpÞe"element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;

}

Page 18: GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

16

References

789:

IV.References

1. How GPU works, David Luebke, NVIDIA Research, Greg

Humphreys, University of Virginia, IEEE computer, 2008

2. GPU a Close Look, Kayvon Fatahalian, Mike Houston, ACM Queue

2008

3. NVIDIA CUDA Compute Unified Device Architecture

Programming Guide Version 1.1, NVIDIA Corporation, 2007

4. Scalable Parallel Programming, John Nickolls, Ian Buck, and

Michael Garland, NVIDIA, Kevin Skadron, University of Virginia, ACM

Queue 2008.

5. A Survey of General-Purpose Computation on Graphics

Hardware, John D. Owens, David Luebke, Naga Govindaraju, Mark

Harris, Jens Krüger, Aaron E. Lefohn, and Timothy J. Purcell,

Computer Graphics Forum, 26(1):80–113, March 2007

6. GPU Gems2, Chapter 30 GeForce 6 Architecture, Emmett

Kilgariff, Randima Fernando, NIVIDIA Corporation, 2005

7. GPU Computing with NVIDIA CUDA, Ian Buck, NVIDIA

Corporation, SIGGRAPH 2007 Lecture.

8. GPU Architecture Overview, John Owens, UC Davis, SIGGRAPH

2007 GPGPU lecture

9. CUDA, Wikipedia, http://en.wikipedia.org/wiki/CUDA