![Page 1: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/1.jpg)
1
CUDA StreamsThese notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations.
Also introduced is paged-locked memory
![Page 2: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/2.jpg)
2
Page-locked host memory(also called pinned host memory)
Page-locked memory is not paged in and out main memory by the OS through paging but will remain resident.
Allows:
• Concurrent host/device memory transfers with kernel operations (Compute capability 2.x) – see next
• Host memory can be mapped to device address space (Compute capability > 1.0)
• Memory bandwidth is higher• Uses real addresses rather than virtual addresses• Does not need to intermediate copy buffering
![Page 3: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/3.jpg)
3
Note on using page-locked memory
Using page-locked memory will reduce memory available to the OS for paging and so need to be careful in allocating it
![Page 4: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/4.jpg)
4
Allocating page locked memory
cudaMallocHost ( void ** ptr, size_t size ) Allocates page-locked host memory that is accessible to device
cudaHostAlloc ( void ** ptr, size_t size, unsigned int flags)
Allocates page-locked host memory that is accessible to device – seems to have more options
![Page 5: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/5.jpg)
5
CUDA Streams
A CUDA Stream is a sequence of operations (commands) that are executed in order.
CUDA streams can be created and executed together and interleaved although the “program order” is always maintained within each stream.
Streams proved a mechanism to overlap memory transfer and computations operations in different stream for increased performance if sufficient resources are available.
![Page 6: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/6.jpg)
6
Creating a stream
Done by creating a stream object and associated it with a series of CUDA commands that then becomes the stream. CUDA commands have a stream pointer as an argument:
cudaStream_t stream1;cudaStreamCreate(&stream1);
cudaMemcpyAsync(…, stream1);MyKernel<<< grid, block, stream1>>>(…);cudaMemcpyAsync(… , stream1);
Cannot use regular cudaMemcpy with streams, need asynchronous commands for concurrent operation see nextStream
![Page 7: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/7.jpg)
7
cudaMemcpyAsync( …, stream)
Asynchronous version of cudaMemcpy that copies date to/from host and the device
May return before copy complete
A stream argument specified.
Needs “page-locked” memory
![Page 8: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/8.jpg)
8
Simply concatenating statements does not work well because of the way the GPU schedules work
Page 206 CUDA by Example,
![Page 9: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/9.jpg)
9Page 207 CUDA by Example,
![Page 10: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/10.jpg)
10Page 208 CUDA by Example
![Page 11: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/11.jpg)
11
for(int i=0;I < SIZE;i+= N*2 { // loop over data in chunks// interleave stream 1 and stream 2 cudaMemcpyAsync(dev_a1,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream1); cudaMemcpyAsync(dev_a2,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream2); cudaMemcpyAsync(dev_b1,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream1); cudaMemcpyAsync(dev_b2,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream2);
kernel<<<N/256,256,0,stream1>>>(dev_a,dev-b,dev_c); kernel<<<N/256,256,0,stream2>>>(dev_a,dev-b,dev_c);
cudaMemcpyAsync(c+1,dev_c1,N*sizeof(int),cudaMemcpyDeviceToHost,stream1); cudaMemcpyAsync(c+1,dev_c2,N*sizeof(int),cudaMemcpyDeviceToHost,stream2);}
Interleave statements of each stream
![Page 12: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/12.jpg)
12Page 210 CUDA by Example
![Page 13: 1 CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f125503460f94c25c63/html5/thumbnails/13.jpg)
Questions