cuda - 101 basics. overview what is cuda? data parallelism host-device model thread execution...
TRANSCRIPT
![Page 1: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/1.jpg)
CUDA - 101
Basics
![Page 2: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/2.jpg)
Overview
• What is CUDA?• Data Parallelism• Host-Device model• Thread execution• Matrix-multiplication
![Page 3: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/3.jpg)
GPU revised!
![Page 4: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/4.jpg)
What is CUDA?
• Compute Device Unified Architecture• Programming interface to GPU• Supports C/C++ and Fortran natively– Third party wrappers for Python, Java, MATLAB etc
• Various libraries available– cuBLAS, cuFFT and many more…– https://developer.nvidia.com/gpu-accelerated-libr
aries
![Page 5: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/5.jpg)
CUDA computing stack
![Page 6: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/6.jpg)
CUDA computing stack
![Page 7: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/7.jpg)
CUDA computing stack
![Page 8: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/8.jpg)
CUDA computing stack
![Page 9: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/9.jpg)
Data Parallel programming
i1
Kernel
i2 i3 … iN
o1 o2 o3 … oN
![Page 10: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/10.jpg)
Data parallel algorithm
• Dot product : C = A . BA1 B1 …
C1 C2 C3 … CN
A2 B2 A3 B3 AN BN
+ + + + +Kernel
![Page 11: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/11.jpg)
Host-Device model
CPU (Host) GPU (Device)
![Page 12: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/12.jpg)
Threads
• A thread is an instance of the kernel program– Independent in a data
parallel model– Can be executed on a
different core• Host tells the device to
run a kernel program– And how many threads
to launch
![Page 13: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/13.jpg)
Matrix-Multiplication
![Page 14: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/14.jpg)
CPU-only MatrixMultiplication
Execute this code
For all elements of P
![Page 15: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/15.jpg)
Memory Indexing in C (and CUDA)
M(i, j) = M[i + j * width]
![Page 16: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/16.jpg)
CUDA version - I
![Page 17: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/17.jpg)
CUDA program flow
• Allocate input and output memory on host– Do the same for device
• Transfer input data from host -> device• Launch kernel on device• Transfer output data from device -> host
![Page 18: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/18.jpg)
Allocating Device memory
• Host tells the device when to allocate and free memory in device
• Functions for host-program– cudaMalloc(memory reference, size)– cudaFree(memory reference)
![Page 19: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/19.jpg)
Transfer Data to/from device
• Again, host tells device when to transfer data• cudaMemcpy(target, source, size, flag)
![Page 20: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/20.jpg)
CUDA version - 2Host Memory
Device Memory
Allocate matrix M on deviceTransfer M from host -> Device
Allocate matrix N on deviceTransfer N from host -> Device
Allocate matrix P on device
Execute Kernel on Device
Transfer P from Device-> Host
Free Device memories for M, N and P
![Page 21: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/21.jpg)
Matrix Multiplication Kernel
• Kernel specifies the function to be executed on Device
Parameters = Device memories, width
Thread = Each element of output matrix P
Dot product of M’s row and N’s column
Write dot product at current location
![Page 22: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/22.jpg)
Extensions : Function qualifiers
![Page 23: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/23.jpg)
Extensions : Thread indexing
• All threads execute the same code– But they need work on separate memory data
• threadId.x & threadId.y– These variables automatically receive
corresponding values for their threads
![Page 24: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/24.jpg)
Thread Grid
• Represents group of all threads to be executed for a particular kernel
• Two level hierarchy– Grid is composed of Blocks– Each Block is composed of threads
![Page 25: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/25.jpg)
Thread Grid
0, 0 1, 0 2, 0 width-1, 0
0, 1 width–1, 1
0, 2
0, width-1 width – 1, width - 1
![Page 26: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/26.jpg)
Conclusion
• Sample code and tutorials• CUDA nodes?• Programming guide – http://docs.nvidia.com/cuda/cuda-c-programming
-guide/
• SDK– https://developer.nvidia.com/cuda-downloads– Available for windows, Mac and Linux– Lot of sample programs
![Page 27: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication](https://reader035.vdocuments.site/reader035/viewer/2022081506/5697bfc71a28abf838ca79ae/html5/thumbnails/27.jpg)
QUESTIONS?