gpu programming with cuda – cuda 5 and 6 paul richmond gpucomputing@sheffield
TRANSCRIPT
![Page 1: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield](https://reader036.vdocuments.site/reader036/viewer/2022082711/56649f0c5503460f94c1f1ca/html5/thumbnails/1.jpg)
GPU Programming with CUDA – CUDA 5 and 6
Paul Richmond
GPUComputing@Sheffieldhttp://gpucomputing.sites.sheffield.ac.uk/
![Page 2: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield](https://reader036.vdocuments.site/reader036/viewer/2022082711/56649f0c5503460f94c1f1ca/html5/thumbnails/2.jpg)
• Dynamic Parallelism (CUDA 5+)• GPU Object Linking (CUDA 5+)• Unified Memory (CUDA 6+)• Other Developer Tools
Overview
![Page 3: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield](https://reader036.vdocuments.site/reader036/viewer/2022082711/56649f0c5503460f94c1f1ca/html5/thumbnails/3.jpg)
• Before CUDA 5 threads had to be launched from the host• Limited ability to perform recursive functions
• Dynamic Parallelism allows threads to be launched from the device• Improved load balancing• Deep Recursion
Dynamic Parallelism
CPU Kernel A
Kernel B
Kernel C
Kernel D
GPU
![Page 4: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield](https://reader036.vdocuments.site/reader036/viewer/2022082711/56649f0c5503460f94c1f1ca/html5/thumbnails/4.jpg)
//Host Code
...
A<<<...>>>(data);
B<<<...>>>(data);
C<<<...>>>(data);
//Kernel Code
__global__ void vectorAdd(float *data)
{
do_stuff(data);
X<<<...>>>(data);
X<<<...>>>(data);
X<<<...>>>(data);
do_more stuff(data);
}
An Example
![Page 5: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield](https://reader036.vdocuments.site/reader036/viewer/2022082711/56649f0c5503460f94c1f1ca/html5/thumbnails/5.jpg)
• CUDA 4 required a single source file for a single kernel• No linking of compiled device code
• CUDA 5.0+ Allows different object files to be linked• Kernels and host code can be built independently
GPU Object Linking
Main .cpp___________________________
a.cu____________________
b.cu____________________
c.cu____________________
a.o b.o c.o
+ Program.exe
![Page 6: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield](https://reader036.vdocuments.site/reader036/viewer/2022082711/56649f0c5503460f94c1f1ca/html5/thumbnails/6.jpg)
• Objects can also be built into static libraries• Shared by different sources• Much better code reuse• Reduces compilation time• Closed source device libraries
GPU Object Linking
Main .cpp___________________________
a.cu____________________
b.cu____________________
a.o b.o
ab.culib
+
Program.exe
+
+
Main2 .cpp___________________________
ab.culib
Program2.exe
+
+foo.cu bar.cu
...
![Page 7: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield](https://reader036.vdocuments.site/reader036/viewer/2022082711/56649f0c5503460f94c1f1ca/html5/thumbnails/7.jpg)
• Developer view is that GPU and CPU have separate memory• Memory must be explicitly copied• Deep copies required for complex data structures
• Unified Memory changes that view• Single pointer to data accessible anywhere• Simpler code porting
Unified Memory
System Memory GPU Memory
CPU GPU
Unified Memory
CPU GPU
![Page 8: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield](https://reader036.vdocuments.site/reader036/viewer/2022082711/56649f0c5503460f94c1f1ca/html5/thumbnails/8.jpg)
Unified Memory Example
void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }
void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort(data, N, 1, compare); cudaDeviceSynchronize(); use_data(data); free(data); }
![Page 9: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield](https://reader036.vdocuments.site/reader036/viewer/2022082711/56649f0c5503460f94c1f1ca/html5/thumbnails/9.jpg)
• XT and Drop-in libraries• cuFFT and cuBLAS optimised for multi GPU (on the same node)
• GPUDirect• Direct Transfer between GPUs (cut out the host)• To support direct transfer via Infiniband (over a network)
• Developer Tools• Remote Development using Nsight Eclipse• Enhanced Visual Profiler
Other Developer Tools