1 ece 8823a gpu architectures module 4: memory model and locality © david kirk/nvidia and wen-mei...
DESCRIPTION
Objective To understand the different memory spaces in the CUDA programming model To learn to efficiently use the important levels of the CUDA memory hierarchy –Registers, shared memory, global memory –Tiled algorithms and barrier synchronization © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,TRANSCRIPT
![Page 1: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/1.jpg)
1
ECE 8823A
GPU Architectures
Module 4: Memory Model and Locality
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
![Page 2: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/2.jpg)
Reading Assignment
• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 5
• CUDA Programming Guide– http://docs.nvidia.com/cuda/cuda-c-programming-guide/
#abstract
2
![Page 3: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/3.jpg)
Objective
• To understand the different memory spaces in the CUDA programming model
• To learn to efficiently use the important levels of the CUDA memory hierarchy– Registers, shared memory, global memory– Tiled algorithms and barrier synchronization
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
3
![Page 4: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/4.jpg)
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011
4
The Von-Neumann Model
Memory
Control Unit
I/O
ALURegFile
PC IR
Processing UnitConsequences of
limited global memory bandwidth
![Page 5: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/5.jpg)
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011
5
Going back to the program
• Every instruction needs to be fetched from memory, decoded, then executed.– The decode stage typically accesses register file
• Instructions come in three flavors: Operate, Data transfer, and Program Control Flow.
• An example instruction cycle is the following:
Fetch | Decode | Execute | Memory
![Page 6: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/6.jpg)
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011
6
Operate Instructions
• Example of an operate instruction:ADD R1, R2, R3
• Instruction cycle for an operate instruction:Fetch | Decode | Execute | Memory
Register Space
![Page 7: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/7.jpg)
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011
7
Memory Access Instructions
• Examples of memory access instruction:LDR R1, R2, offsetSTR R1, R2, offset
• Instruction cycle for an operate instruction:Fetch | Decode | Execute | Memory
Global Address Space
![Page 8: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/8.jpg)
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011
8
Registers vs Memory• Registers are “free”
– No additional memory access instruction
– Very fast to use, however, there are very few of them
– More energy efficient• Memory is expensive
(slow), but very large• Additional levels
– Cache/scratch pad
Memory
Control Unit
I/O
ALURegFile
PC IR
Memory
Control Unit
I/O
ALUALURegFileRegFile
PC IR
![Page 9: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/9.jpg)
9
Programmer View of CUDA Memories
• Each thread can:– Read/write per-thread
registers (~1 cycle)– Read/write per-block
shared memory (~5 cycles)
– Read/write per-grid global memory (~500 cycles)
– Read/only per-grid constant memory (~5 cycles with caching)
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
Relationship to the Programming Model ?
![Page 10: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/10.jpg)
Automatic Variables
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
10
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
Automatic
variables
Automatic array
variables
• Private version created for each thread block
![Page 11: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/11.jpg)
Matrix Multiplication Revisited
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
11
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{// Calculate the row index of the d_P element and d_M int Row = blockIdx.y*blockDim.y+threadIdx.y;// Calculate the column idenx of d_P and d_N int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {float Pvalue = 0;
// each thread computes one element of the block sub-matrixfor (int k = 0; k < Width; ++k)
Pvalue += d_M[Row*Width+k] * d_N[k*Width+Col];
d_P[Row*Width+Col] = Pvalue; }}
private to each thread
Register File(128 KB)
L1(16 KB)
Shared Memory(48 KB)
![Page 12: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/12.jpg)
Shared Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
12
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
Shared by threads in a
block
• Private version of a shared variable created for each thread block
• Parallel access
![Page 13: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/13.jpg)
Shared Memory in CUDA
• A special type of memory whose contents are explicitly declared and used in the source code– Located in the processor– Accessed at much higher speed (in both latency and
throughput)– Still accessed by memory access instructions– Commonly referred to as scratchpad memory in
computer architecture– Shared by threads in a block
• Algorithmic optimizations to use shared memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
13
![Page 14: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/14.jpg)
Global Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
14
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
• Visible to all threads
• Persistent
Means for a thread to collaborate across thread blocks
• Note: no synchronization between thread blocks
• Availability of atomics and fences (used in some circumstances)
![Page 15: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/15.jpg)
Constant Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
15
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
• Visible to all threads
• Persistent
• Declaration of constants across all threads
• Cached for faster access
![Page 16: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/16.jpg)
Use of Constants: 1D Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
16
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1. __global__ void convolution_1D_basic_kernel(float *N, float *M, float *P, int Mask_Width, int Width) {
2. int i = blockIDx.x*blockDim.x + threadIDx;3. float Pvalue = 0;4. int N_start = i – (Mask_Width/2);5. For (int j= 0; j <Width; j++) {6. if (N_start +j >=0 && N_start + j< Width){7. Pvalue += N[N_start +j] * M[j];}8. }9. P[i] = Pvalue; }
N[]
M[] (constant array)
P[]
Unique variable
per thread
Constant across threads
ghost nodes
![Page 17: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/17.jpg)
Use of Constant Memory
• Consider– Size of M[] is typically small– M[] is constant– All threads access the same elements at the same
time
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
17
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
N[]
M[] (constant array)
P[]
ghost nodes
![Page 18: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/18.jpg)
Management
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
18
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
• Constant memory must be explicitly managed– Allocation: constants are
treated as global variables– Copying: copied into global
memory and can be cached into a separate, dedicated cache
• Kernel functions access constants as global variables– Cached in the constant cache– Broadcast capability to all
threads in a warp
![Page 19: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/19.jpg)
Modified: 1D Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
19
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1. __global__ void convolution_1D_ba sic_kernel(float *N, float *P, int Mask_Width, int Width) {
2. int i = blockIDx.x*blockDim.x + threadIDx;3. float Pvalue = 0;4. int N_start = i – (Mask_Width/2);5. For (int j= 0; j <Width; j++) {6. if (N_start +j >=0 && N_start + j< Width){7. Pvalue += N[N_start +j] * M[j];}8. }9. P[i] = Pvalue; }
N[]
M[] (constant array)
P[]
ghost nodes
• M[] not passed in as a parameter• Accessed as a global array• Transfer using different API
functioncudaMemcpyToSymbol(dest,src,size)
![Page 20: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/20.jpg)
• __device__ is optional when used with __shared__, or __constant__
• Automatic variables without any qualifier reside in a register– Except per-thread arrays that reside in global memory
• Note programmer controlled placement20
CUDA Variable Type QualifiersVariable declaration Memory Scope Lifetime
int LocalVar; register thread thread__device__ __shared__ int SharedVar; shared block block__device__ int GlobalVar; global grid application__device__ __constant__ int ConstantVar; constant grid application
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 21: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/21.jpg)
Refining the Memory Hierarchy
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
21
• For best performance, index into constant cache should not be a function of thread ID!
• Separate read-only cache that the compiler uses – distinct from from the constant cache– Accessed with additional
qualifier (__restrict__)
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
![Page 22: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/22.jpg)
22
Where to Declare Variables?
yes noglobalconstant
register (automatic)sharedlocal
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 23: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/23.jpg)
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011
23
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
1. __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];2. __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
![Page 24: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/24.jpg)
24
A Common Programming Strategy• Global memory resides in device memory (DRAM)
- slow access• So, a profitable way of performing computation on
the device is to tile input data to take advantage of fast shared memory:– Partition data into subsets that fit into shared memory– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element
• Copying results from shared memory to global memory© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 25: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/25.jpg)
25
Matrix-Matrix Multiplication using Shared Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 26: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/26.jpg)
Base Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;// each thread computes one element of the block sub-matrixfor (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col];
d_P[Row*Width+Col] = Pvalue;}
26© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 27: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/27.jpg)
27
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
How about performance on Fermi?• All threads access global memory
for their input matrix elements– Two memory accesses (8 bytes)
per floating point multiply-add– 4B/s of memory
bandwidth/FLOPS– 4*1,000 = 4,000 GB/s required
to achieve peak FLOP rating– 150 GB/s limits the code at 37.5
GFLOPS• The actual code runs at about 25
GFLOPS• Need to drastically cut down
memory accesses to get closer to the peak 1,000 GFLOPS
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011
![Page 28: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/28.jpg)
Shared Memory Blocking Basic Idea
Thread 1
Thread 2
…
in
Global Memory
Thread 1
Thread 2
…
Global Memory
in
On-chip Memory
28© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 29: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/29.jpg)
Basic Concept of Blocking/Tiling
• In a congested traffic system, significant reduction of vehicles can greatly improve the delay seen by all vehicles– Carpooling for commuters– Blocking/Tiling for global
memory accesses• drivers = threads,• cars = data
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
29
![Page 30: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/30.jpg)
Some computations are more challenging to block/tile than others.• Some carpools may
be easier than others– More efficient if
neighbors are also classmates or co-workers
– Some vehicles may be more suitable for carpooling
• Similar variations exist in blocking/tiling
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
30
![Page 31: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/31.jpg)
Carpools need synchronization.
• Good – when people have similar schedule
• Bad – when people have very different schedule
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
Worker A
Worker B
Time
sleep
sleep work
work
dinner
dinner
Worker A
Worker B
time
sleep
sleep work
work
dinner
party
31
![Page 32: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/32.jpg)
Same with Blocking/Tiling
• Good – when threads have similar access timing
• Bad – when threads have very different timing© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
Thread 1
Thread 2
Time
Thread 1
Thread 2
time
…
32
![Page 33: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/33.jpg)
Outline of Technique
• Identify a block/tile of global memory content that are accessed by multiple threads
• Load the block/tile from global memory into on-chip memory
• Have the multiple threads to access their data from the on-chip memory
• Move on to the next block/tile
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
33
![Page 34: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/34.jpg)
34
Idea: Use Shared Memory to reuse global memory data
• Each input element is read by WIDTH threads.
• Load each element into Shared Memory and have several threads use the local version to reduce the memory bandwidth– Tiled algorithms
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
ty
tx
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 35: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/35.jpg)
Col = 0 * 2 + threadIdx.xRow = 0 * 2 + threadIdx.y
Col =
0C
ol = 1
Work for Block (0,0)in a TILE_WIDTH = 2 Configuration
P0,1P0,0
P1,0
P0,2 P0,3
P1,1
P2,0 P2,2 P2,3P2,1
P1,3P1,2
P3,0 P3,2 P3,3P3,1
M0,1M0,0
M1,0
M0,2 M0,3
M1,1
M2,0 M2,2 M2,3M2,1
M1,3M1,2
M3,0 M3,2 M3,3M3,1
N0,1N0,0
N1,0
N0,2 N0,3
N1,1
N2,0 N2,2 N2,3N2,1
N1,3N1,2
N3,0 N3,2 N3,3N3,1
Row = 0Row = 1
blockIdx.x blockIdx.y
blockDim.x blockDim.y
35
![Page 36: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/36.jpg)
36
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
bx
tx01 TILE_WIDTH-12
0 1 2
by ty 210
TILE_WIDTH-1
2
1
0
TIL
E_W
IDT
HT
ILE
_WID
TH
TIL
E_W
IDT
HE
WID
TH
WID
TH
Tiled Multiply• Break up the execution of the
kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 37: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/37.jpg)
Loading a Tile
• All threads in a block participate– Each thread loads one Md element and one Nd
element in based tiled code
• Assign the loaded element to each thread such that the accesses within each warp is coalesced (more later).
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
37
![Page 38: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/38.jpg)
Work for Block (0,0)
P0,1P0,0
P1,0
P0,2 P0,3
P1,1
P2,0 P2,2 P2,3P2,1
P1,3P1,2
P3,0 P3,2 P3,3P3,1
M0,1M0,0
M1,0
M0,2 M0,3
M1,1
M2,0 M2,2 M2,3M2,1
M1,3M1,2
M3,0 M3,2 M3,3M3,1
N0,1N0,0
N1,0
N0,2 N0,3
N1,1
N2,0 N2,2 N2,3N2,1
N1,3N1,2
N3,0 N3,2 N3,3N3,1
M0,1M0,0
M1,0 M1,1
N0,1N0,0
N1,0 N1,1
SM
SM
38© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 39: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/39.jpg)
Work for Block (0,0)
SM
SM
M0,1M0,0
M1,0
M0,2 M0,3
M1,1
M2,0 M2,2 M2,3M2,1
M1,3M1,2
M3,0 M3,2 M3,3M3,1
N0,1N0,0
N1,0
N0,2 N0,3
N1,1
N2,0 N2,2 N2,3N2,1
N1,3N1,2
N3,0 N3,2 N3,3N3,1
P0,1P0,0
P1,0
P0,2 P0,3
P1,1
P2,0 P2,2 P2,3P2,1
P1,3P1,2
P3,0 P3,2 P3,3P3,1
M0,1M0,0
M1,0 M1,1
N0,1N0,0
N1,0 N1,1
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
39
![Page 40: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/40.jpg)
Work for Block (0,0)
SM
SM
M0,1M0,0
M1,0
M0,2 M0,3
M1,1
M2,0 M2,2 M2,3M2,1
M1,3M1,2
M3,0 M3,2 M3,3M3,1
N0,1N0,0
N1,0
N0,2 N0,3
N1,1
N2,0 N2,2 N2,3N2,1
N1,3N1,2
N3,0 N3,2 N3,3N3,1
P0,1P0,0
P1,0
P0,2 P0,3
P1,1
P2,0 P2,2 P2,3P2,1
P1,3P1,2
P3,0 P3,2 P3,3P3,1
M0,1M0,0
M1,0 M1,1
N0,1N0,0
N1,0 N1,1
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
40
![Page 41: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/41.jpg)
N1,0
Work for Block (0,0)
41© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
M0,1M0,0
M1,0
M0,2 M0,3
M1,1
M2,0 M2,2 M2,3M2,1
M1,3M1,2
M3,0 M3,2 M3,3M3,1
N0,1N0,0
N1,0
N0,2 N0,3
N1,1
N2,0 N2,2 N2,3N2,1
N1,3N1,2
N3,0 N3,2 N3,3N3,1
P0,1P0,0
P1,0
P0,2 P0,3
P1,1
P2,0 P2,2 P2,3P2,1
P1,3P1,2
P3,0 P3,2 P3,3P3,1
M0,1M0,0
M1,0 M1,1
N0,1N0,0
N1,1
SM
![Page 42: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/42.jpg)
Work for Block (0,0)
SM
42© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
M0,1M0,0
M1,0
M0,2 M0,3
M1,1
M2,0 M2,2 M2,3M2,1
M1,3M1,2
M3,0 M3,2 M3,3M3,1
N0,1N0,0
N1,0
N0,2 N0,3
N1,1
N2,0 N2,2 N2,3N2,1
N1,3N1,2
N3,0 N3,2 N3,3N3,1
P0,1P0,0
P1,0
P0,2 P0,3
P1,1
P2,0 P2,2 P2,3P2,1
P1,3P1,2
P3,0 P3,2 P3,3P3,1
M0,1M0,0
M1,0 M1,1
N0,1N0,0
N1,0 N1,1
SM
![Page 43: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/43.jpg)
Barrier Synchronization
• An API function call in CUDA– __syncthreads()
• All threads in the same block must reach the __syncthreads() before any can move on
• Best used to coordinate tiled algorithms– To ensure that all elements of a tile are loaded– To ensure that all elements of a tile are consumed
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
43
![Page 44: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/44.jpg)
…
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
…Thread N-3
Thread N-2
Thread N-1
Time
Barriers and Thread Behaviors
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 45: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/45.jpg)
45
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
bx
tx01 TILE_WIDTH-12
0 1 2
by ty 210
TILE_WIDTH-1
2
1
0
TIL
E_W
IDT
HT
ILE
_WID
TH
TIL
E_W
IDT
HE
WID
TH
WID
TH
Loading an Input Tile
m
kbx
by
k
m
©Wen-mei W. Hwu and David Kirk/NVIDIA, Urbana, August 13-17, 2012
Row = by * TILE_WIDTH +ty
Accessing tile 0 2D indexing:M[Row][tx]N[ty][Col]
![Page 46: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/46.jpg)
46
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
bx
tx01 TILE_WIDTH-12
0 1 2
by ty 210
TILE_WIDTH-1
2
1
0
TIL
E_W
IDT
HT
ILE
_WID
TH
TIL
E_W
IDT
HE
WID
TH
WID
TH
Loading an Input Tile
m
kbx
by
k
m
©Wen-mei W. Hwu and David Kirk/NVIDIA, Urbana, August 13-17, 2012
Row = by * TILE_WIDTH +ty
Accessing tile 1 in 2D indexing:M[Row][1*TILE_WIDTH+tx]N[1*TILE_WIDTH+ty][Col]
![Page 47: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/47.jpg)
d_M
d_N
d_P
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
TIL
E_W
IDT
HT
ILE
_WID
TH
TIL
E_W
IDT
HE
WID
TH
WID
TH
m*TILE_W
IDTH
m*TILE_WIDTH
Col
Row
…
…
However, M and N are dynamically allocated and can only use 1D indexing:
M[Row][m*TILE_WIDTH+tx]M[Row*Width + m*TILE_WIDTH + tx]
N[m*TILE_WIDTH+ty][Col]N[(m*TILE_WIDTH+ty) * Width + Col]
Loading Input Tile m
![Page 48: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/48.jpg)
48
Tiled Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){1. __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];2. __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;7. float Pvalue = 0;// Loop over the Md and Nd tiles required to compute the Pd element8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {// Collaborative loading of Md and Nd tiles into shared memory9. ds_M[ty][tx] = d_M[Row*Width + m*TILE_WIDTH+tx];10. ds_N[ty][tx] = d_N[Col+(m*TILE_WIDTH+ty)*Width];11. __syncthreads();12. for (int k = 0; k < TILE_WIDTH; ++k)13. Pvalue += ds_M[ty][k] * ds_N[k][tx];14. __synchthreads();15.}16. d_P[Row*Width+Col] = Pvalue;}
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011
![Page 49: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/49.jpg)
Compare with the Base Kernel
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;// each thread computes one element of the block sub-matrixfor (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col];
d_P[Row*Width+Col] = Pvalue;}
49© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 50: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/50.jpg)
50
First-order Size Considerations• Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads– TILE_WIDTH of 32 gives 32*32 = 1024 threads
• For 16, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations.
• For 32, each block performs 2*1024 = 2048 float loads from global memory for 1024 * (2*32) = 65,536 mul/add operations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 51: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/51.jpg)
51
Shared Memory and Threading• Each SM in Fermi has 16KB or 48KB shared memory*
– SM size is implementation dependent!– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB
of shared memory. – Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing 2 or 6 thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16– The 86.4GB/s bandwidth can now support (86.4/4)*16 = 347.6
GFLOPS!
*Configurable vs L1, total 64KB© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 52: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/52.jpg)
Resource Constraints
• Number of available registers also limit the number of thread blocks that can concurrently execute– Reduction is in size of a thread block– Impact on utilization can be significant
• Auto-tune based on querying of device properties– E,g., fix tile size based on available shared memory
size– Need to change the preceding tiled MM code
52
![Page 53: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/53.jpg)
Device Query• Number of devices in the system
int dev_count;cudaGetDeviceCount( &dev_count);
• Capability of devicescudaDeviceProp dev_prop;for (i = 0; i < dev_count; i++) {
cudaGetDeviceProperties( &dev_prop, i);
// decide if device has sufficient resources and capabilities }
• cudaDeviceProp is a built-in C structure type – dev_prop.dev_prop.maxThreadsPerBlock – Dev_prop.sharedMemoryPerBlock– …
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
53
![Page 54: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/54.jpg)
54
• Global variables declaration– __host__– __device__... __global__, __constant__, __texture__
• Function prototypes– __global__ void kernelOne(…)– float handyFunction(…)
• Main ()– allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes
)– transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…)– execution configuration setup– kernel call – kernelOne<<<execution configuration>>>( args… );– transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…)– optional: compare against golden (host computed) solution
• Kernel – void kernelOne(type args,…)– variables declaration - auto, __shared__
• automatic variables transparently assigned to registers – syncthreads()…
• Other functions– float handyFunction(int inVar…);
Summary- Typical Structure of a CUDA Program
repeatas needed
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
![Page 55: 1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b4a7f8b9ab0599a538c/html5/thumbnails/55.jpg)
ANY MORE QUESTIONS?READ CHAPTER 5!
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
55