introduction to cuda geek camp singapore 2011
DESCRIPTION
This presentation is for Geek Camp Singapore 2011 1st OctoberTRANSCRIPT
INTRODUCTION TO CUDA Prepared for Geek Camp Singapore 2011
Raymond Tay
THE FREE LUNCH IS OVER – HERB SUTTER
WE NEED TO THINK BEYOND MULTI-CORE CPUS … WE NEED TO THINK MANY-CORE GPUS
…
NVIDIA GPUS FPS
FPS – Floating-point per second aka flops. A measure of how many flops can a GPU do. More is Better
GPUs beat CPUs
NVIDIA GPUS MEMORY BANDWIDTH
With massively parallel processors in Nvidia’s GPUs, providing high memory bandwidth plays a big role in high performance computing.
GPUs beat CPUs
GPU VS CPU
CPU " Optimised for low-latency
access to cached data sets " Control logic for out-of-order
and speculative execution
GPU " Optimised for data-parallel,
throughput computation " Architecture tolerant of
memory latency " More transistors dedicated to
computation
I DON’T KNOW C/C++, SHOULD I LEAVE?
Relax, no worries. Not to fret.
Your Brain Asks: Wait a minute, why should I learn the C/C++ SDK?
CUDA Answers: Efficiency!!!
WHAT DO I NEED TO BEGIN WITH CUDA?
A Nvidia CUDA enabled graphics card e.g. Fermi
HOW DOES CUDA WORK
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute, caching data on chip for performance
3. Copy results from GPU memory to CPU memory
PCI Bus
EXAMPLE: BLOCK CYPHER
void host_shift_cypher(unsigned int *input_array, unsigned int *output_array, unsigned int shift_amount, unsigned int alphabet_max, unsigned int array_length)
{
for(unsigned int i=0;i<array_length;i++)
{
int element = input_array[i];
int shifted = element + shift_amount;
if(shifted > alphabet_max)
{
shifted = shifted % (alphabet_max + 1);
}
output_array[i] = shifted;
}
}
Int main() {
host_shift_cypher(input_array, output_array, shift_amount, alphabet_max, array_length);
}
__global__ void shift_cypher(unsigned int *input_array, unsigned int *output_array, unsigned int shift_amount, unsigned int alphabet_max, unsigned int array_length)
{
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
int shifted = input_array[tid] + shift_amount;
if ( shifted > alphabet_max )
shifted = shifted % (alphabet_max + 1);
output_array[tid] = shifted;
}
Int main() {
dim3 dimGrid(ceil(array_length)/block_size);
dim3 dimBlock(block_size);
shift_cypher<<<dimGrid,dimBlock>>>(input_array, output_array, shift_amount, alphabet_max, array_length);
} CPU Program
GPU Program
EXAMPLE: VECTOR ADDITION // CUDA CODE __global__ void VecAdd(const float* A, const float* B, float* C,
unsigned int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; }
// C CODE void VecAdd(const float* A, const float* B, float* C,unsigned int N) { for( int i = 0; i < N; ++i) C[i] = A[i] + B[i]; }
DEBUGGER
CUDA-GDB
Parallel Nsight
• Based on GDB • Linux • Mac OS X
• Plugin inside Visual Studio
VISUAL PROFILER & MEMCHECK
Profiler
• Microsoft Windows • Linux • Mac OS X
• Analyze Performance
CUDA-MEMCHECK
• Microsoft Windows • Linux • Mac OS X
• Detect memory access errors
WHERE’S CUDA AT IN 2011?
60,000 researchers use it to aid drug discovery 470 universities teach CUDA
WHERE’S CUDA AT IN 2011? (PART 2..)
NVIDIA Show Case (1000+ applications)
ADDITIONAL RESOURCES CUDA FAQ (http://tegradeveloper.nvidia.com/cuda-faq)
CUDA Tools & Ecosystem (http://tegradeveloper.nvidia.com/cuda-tools-ecosystem)
CUDA Downloads (http://tegradeveloper.nvidia.com/cuda-downloads)
NVIDIA Forums (http://forums.nvidia.com/index.php?showforum=62)
GPGPU (http://gpgpu.org )
CUDA By Example (http://tegradeveloper.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0)
Jason Sanders & Edward Kandrot GPU Computing Gems Emerald Edition (
http://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ ) Editor in Chief: Prof Hwu Wen-Mei
CUDA LIBRARIES
Visit this site http://developer.nvidia.com/cuda-tools-ecosystem#Libraries
Thrust, CUFFT, CUBLAS, CUSP, NPP, OpenCV, GPU AI-Tree Search, GPU AI-Path Finding
A lot of the libraries are hosted in Google Code. Many more gems in there too!
THANK YOU @RaymondTayBL