basic cuda programming shin-kai chen [email protected] vlsi signal processing laboratory...
Post on 20-Dec-2015
223 views
TRANSCRIPT
![Page 1: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/1.jpg)
Basic CUDA Basic CUDA ProgrammingProgrammingBasic CUDA Basic CUDA
ProgrammingProgramming
Shin-Kai ChenShin-Kai [email protected]@twins.ee.nctu.edu.tw
VLSI Signal Processing LaboratoryVLSI Signal Processing LaboratoryDepartment of Electronics EngineeringDepartment of Electronics Engineering
National Chiao Tung UniversityNational Chiao Tung University
![Page 2: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/2.jpg)
What will you learn in this lab?
• Concept of multicore accelerator• Multithreaded/multicore programming• Memory optimization
![Page 3: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/3.jpg)
Slides
• Mostly from Prof. Wen-Mei Hwu of UIUC– http://courses.ece.uiuc.edu/ece498/al/S
yllabus.html
![Page 4: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/4.jpg)
CUDA – Hardware? Software?
. . .. . .
. . .. . .
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/storeLoad/store
Global Memory
Thread Execution Manager
Input Assembler
Host
TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTextureTextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
Thread Id #:0 1 2 3 … m
Thread program
Application
CUDA
Platform
![Page 5: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/5.jpg)
Host-Device ArchitectureCPU
(host)GPU w/
local DRAM(device)
![Page 6: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/6.jpg)
G80 CUDA mode – A Device Example
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
![Page 7: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/7.jpg)
Functional Units in G80
• Streaming Multiprocessor (SM)– 1 instruction decoder ( 1 instruction / 4
cycle )– 8 streaming processor (SP)– Shared memory
t0 t1 t2 … tm
Blocks
SP
SharedMemory
MT IU
SP
SharedMemory
MT IUt0 t1 t2 … tm
Blocks
SM 1SM 0
![Page 8: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/8.jpg)
Setup CUDA for Setup CUDA for WindowsWindows
Setup CUDA for Setup CUDA for WindowsWindows
![Page 9: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/9.jpg)
CUDA Environment Setup
• Get GPU that support CUDA– http://www.nvidia.com/object/cuda_learn_products.html
• Download CUDA– http://www.nvidia.com/object/cuda_get.html
• CUDA driver• CUDA toolkit• CUDA SDK (optional)
• Install CUDA• Test CUDA
– Device Query
![Page 10: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/10.jpg)
Setup CUDA for Visual Studio
• From scratch– http://forums.nvidia.com/index.php?sho
wtopic=30273
• CUDA VS Wizard– http://sourceforge.net/projects/cudavs
wizard/
• Modified from existing project
![Page 11: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/11.jpg)
Lab1: First CUDA Lab1: First CUDA ProgramProgram
Lab1: First CUDA Lab1: First CUDA ProgramProgram
![Page 12: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/12.jpg)
CUDA Computing Model
Serial Code
Parallel Code
Serial Code
Parallel Code
Host
Serial Code
Parallel Code
Serial Code
Parallel Code
Host
Memory Transfer
Memory Transfer
Memory Transfer
Memory Transfer
Device
Lunch Kernel
Lunch Kernel
![Page 13: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/13.jpg)
Data Manipulation between Host and Device
• cudaError_t cudaMalloc( void** devPtr, size_t count )– Allocates count bytes of linear memory on the device and return in *devPtr
as a pointer to the allocated memory
• cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)– Copies count bytes from memory area pointed to by src to the memory are
a pointed to by dst– kind indicates the type of memory transfer
• cudaMemcpyHostToHost• cudaMemcpyHostToDevice• cudaMemcpyDeviceToHost• cudaMemcpyDeviceToDevice
• cudaError_t cudaFree( void* devPtr )– Frees the memory space pointed to by devPtr
![Page 14: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/14.jpg)
Example
• Functionality:– Given an integer arr
ay A holding 8192 elements
– For each element in array A, calculate A[i]256 and leave the result in B[i]
Float GPU_kernel(int *B, int *A) {
// Create Pointers for Memory Space on Device
int *dA, *dB;
// Allocate Memory Space on Device
cudaMalloc( (void**) &dA, sizeof(int)*SIZE );
cudaMalloc( (void**) &dB, sizeof(int)*SIZE );
// Copy Data to be Calculated
cudaMemcpy( dA, A, sizeof(int)*SIZE, cudaMemcpyHostToDevice );
// Lunch Kernel
cuda_kernel<<<1,1>>>(dB, dA);
// Copy Output Back
cudaMemcpy( B, dB, sizeof(int)*SIZE, cudaMemcpyDeviceToHost );
// Free Memory Spaces on Device
cudaFree( dA );
cudaFree( dB );
}
cudaMemcpy( dB, B, sizeof(int)*SIZE, cudaMemcpyHostToDevice );
![Page 15: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/15.jpg)
Now, go and finish your first Now, go and finish your first CUDA program !!!CUDA program !!!
Now, go and finish your first Now, go and finish your first CUDA program !!!CUDA program !!!
![Page 16: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/16.jpg)
• Download http://twins.ee.nctu.edu.tw/~skchen/lab1.zip
• Open project with Visual C++ 2008 ( lab1/cuda_lab/cuda_lab.vcproj )– main.cu
• Random input generation, output validation, result reporting
– device.cu• Lunch GPU kernel, GPU kernel code
– parameter.h• Fill in appropriate APIs
– GPU_kernel() in device.cu
![Page 17: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/17.jpg)
Lab2: Make the Lab2: Make the Parallel Code Faster Parallel Code Faster
Lab2: Make the Lab2: Make the Parallel Code Faster Parallel Code Faster
![Page 18: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/18.jpg)
Parallel Processing in CUDA
• Parallel code can be partitioned into blocks and threads– cuda_kernel<<<nBlk, nTid>>>(…)
• Multiple tasks will be initialized, each with different block id and thread id
• The tasks are dynamically scheduled– Tasks within the same block will be scheduled on the sa
me stream multiprocessor• Each task take care of single data partition accordi
ng to its block id and thread id
![Page 19: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/19.jpg)
Locate Data Partition by Built-in Variables
• Built-in Variables– gridDim
• x, y– blockIdx
• x, y– blockDim
• x, y, z– threadIdx
• x, y, z
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
![Page 20: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/20.jpg)
Data Partition for Previous Example
… ……
head
length
When processing 64 integer data:
cuda_kernel<<<2, 2>>>(…)
int total_task = gridDim.x * blockDim.x ;
int task_sn = blockIdx.x * blockDim.x + threadIdx.x ;
int length = SIZE / total_task ;
int head = task_sn * length ;
TASK 0blockIdx.x = 0threadIdx.x = 0
TASK 1blockIdx.x = 0threadIdx.x = 1
TASK 2blockIdx.x = 1threadIdx.x = 0
TASK 3blockIdx.x = 1threadIdx.x = 1
![Page 21: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/21.jpg)
Processing Single Data Partition
__global__ void cuda_kernel ( int *B, int *A ) {
int length = SIZE / total_task;
for ( int i = head ; i < head + length ; i++ ) {
B[i] = A[i]256;
}
}
return;
int total_task = gridDim.x * blockDim.x;
int task_sn = blockDim.x * blockIdx.x + threadIdx.x;
int head = task_sn * length;
![Page 22: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/22.jpg)
Parallelize Your Parallelize Your Program !!!Program !!!
Parallelize Your Parallelize Your Program !!!Program !!!
![Page 23: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/23.jpg)
• Partition kernel into threads– Increase nTid from 1 to 512– Keep nBlk = 1
• Group threads into blocks– Adjust nBlk and see if it helps
• Maintain total number of threads below 512, e.g. nBlk * nTid < 512
![Page 24: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/24.jpg)
Lab3: Resolve Lab3: Resolve Memory Contention Memory Contention
Lab3: Resolve Lab3: Resolve Memory Contention Memory Contention
![Page 25: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/25.jpg)
Parallel Memory Architecture
• Memory is divided into banks to achieve high bandwidth
• Each bank can service one address per cycle
• Successive 32-bit words are assigned to successive banks BANK15
BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0
![Page 26: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/26.jpg)
Lab2 Review
BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0
THREAD3
THREAD2
THREAD1
THREAD0
Iteration 1
CONFILICT!!!!
A[48]
A[32]
A[16]
A[ 0]
When processing 64 integer data:
cuda_kernel<<<1, 4>>>(…)
BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0
THREAD3
THREAD2
THREAD1
THREAD0
Iteration 2
CONFILICT!!!!
A[49]
A[33]
A[17]
A[ 1]
![Page 27: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/27.jpg)
How about Interleave Accessing?
BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0
THREAD3
THREAD2
THREAD1
THREAD0
Iteration 1
NO CONFLICT
A[ 3]
A[ 2]
A[ 1]
A[ 0]
When processing 64 integer data:
cuda_kernel<<<1, 4>>>(…)
BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0
THREAD3
THREAD2
THREAD1
THREAD0
Iteration 2
NO CONFLICT
A[ 7]
A[ 6]
A[ 5]
A[ 4]
![Page 28: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/28.jpg)
Implementation of Interleave Accessing
• head = task_sn• stripe = total_task
cuda_kernel<<<1, 4>>>(…)
…
head
stripe
![Page 29: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/29.jpg)
Improve Your Program Improve Your Program !!!!!!
Improve Your Program Improve Your Program !!!!!!
![Page 30: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/30.jpg)
• Modify original kernel code in interleaving manner– cuda_kernel() in device.cu
• Adjusting nBlk and nTid as in Lab2 and examine the effect– Maintain total number of threads below
512, e.g. nBlk * nTid < 512
![Page 31: Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d495503460f94a24cda/html5/thumbnails/31.jpg)
Thank You• http://twins.ee.nctu.edu.tw/~skche
n/lab3.zip• Final project issue• Group issue