programming gpgpus using cuda - about us | …research.nesc.ac.uk/files/cuda_programming.pdf · why...
TRANSCRIPT
F A N Z H U 2 0 1 2 - 1 1 - 2 0
PROGRAMMING GPGPUS USING CUDA
WHY GPGPUS
• GPGPUs - General Purpose Computing on Graphics Processing Units (GPUs)
From NVIDIA: CUDA C Programming Guide
GPUS VS. CPUS
• NVIDEA: 10x to 1000x speedups
• Intel: 2.5x speedups
CUDA
• CUDA - Compute Unified Device Architecture • C for CUDA is the programming language • Fortran for CUDA • Version 1.0 in 2007 • Version 5.0 in 2012
• Shared Memory Architecture
CUDA CODE PORTABILITY
• Hardware independent. • Change configuration to achieve best performance
CUDA WORKFLOW
1. A CPU thread copies data from main memory to GPU memory.
2. A CPU thread instructs GPU threads to start processing.
3. GPU threads execute in parallel on different GPU cores.
3.∗ The CPU thread and all of the idle GPU threads wait for completion of the running GPU threads. This step happens at the same time as step 3.
4. The CPU thread copies the results from GPU memory to main memory.
5. The CPU thread acts on the results, and may return to step 1 in order to execute another GPU function.
FUNCTION TYPES
• __host__ • Executed on the host (CPU) • Callable from the host only
• __global__ • Executed on the device (GPU) • Callable from the host only
• __device__ • Executed on the device • Callable from the device only
FUNCTIONS: MEMORY COPY
• Executed on CPU
• Allocate and free GPU memory • cudaMalloc() and cudaFree()
• Copy CPU memory to GPU memory • cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
• Copy GPU memory to CPU memory • cudaMemcpy(h_B, d_B, size, cudaMemcpyDeviceToHost);
FUNCTIONS
• __syncthreads() • Called from the host
• clock(); clock64(); • Called from device code
CUDA EXAMPLES: VECTOR ADD
On GPU
On CPU
You can ask for memory here. 16 KB limitation
GRID AND BLOCK
0,0 0,1 0,2 0,3
1,0 1,1 1,2 1,3
2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
Grid
Block(0,0)
Block(0,1)
Block(1,0)
Block(1,1)
• GRID • Share
memory • Block (<=1024
threads) • Share cache
BLOCKS
0,0 0,1 0,2 0,3
1,0 1,1 1,2 1,3
2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
CUDA EXAMPLE: MATRIX ADD
Block = 1x1 Block = 16x16
COMPLETE CODE
In a same .cu file!
THANK YOU.
• CUDA C Programming Guide • http://docs.nvidia.com/cuda/index.html