introduction to cuda - prace research infrastructure€¦ · cuda a parallel computing architecture...
TRANSCRIPT
Introduction to CUDA
GPU Performance History• GPUs are massively multithreaded many-core chips
• Hundreds of cores, thousands of concurrent threads• Huge economies of scale• Still on aggressive performance growth • High memory bandwidth
G80
GT200
CUDAA Parallel Computing Architecture for NVIDIA GPUs
Supports standard languages and APIs
• C• OpenCL• DX Compute• Fortran (PGI)
Supported on common operating systems:
• Windows• Mac OS X• Linux
DXCompute Fortran
NVIDIA supports any initiative that unleashes the massive power of the GPU
© NVIDIA Corporation 2008
C for CUDA
CUDA is industry-standard CWrite a program for one threadInstantiate it on many parallel threadsFamiliar programming model and language
CUDA is a scalable parallel programming modelProgram runs on any number of processors without recompiling
© NVIDIA Corporation 2008
GPU Sizes Require CUDA Scalability
128 SP Cores32 SPCores
240 SP Cores
© NVIDIA Corporation 2008
TeslaTM
High-Performance ComputingQuadro®
Design & CreationGeForce®
Entertainment
CUDA runs on NVIDIA GPUs…Over 100 Million CUDA GPUs Deployed
© NVIDIA Corporation 2008
Pervasive CUDA Parallel ComputingCUDA brings data-parallel computing to the masses
Over 100 M CUDA-capable GPUs deployed since Nov 2006
Wide developer acceptanceDownload CUDA from www.nvidia.com/cudaOver 50K CUDA developer downloadsA GPU “developer kit” costs ~$100 for several hundreds GFLOPS
Data-parallel supercomputers are everywhere!CUDA makes this power readily accessibleEnables rapid innovations in data-parallel computing
Parallel computing rides the commodity technology wave
© NVIDIA Corporation 2008
CUDA Zone: www.nvidia.com/cuda
Resources, examples, and pointers for CUDA developers
© NVIDIA Corporation 2008
1.4 billion transistors
1 Teraflop of processing power
240 SP processing cores
30 DP processing cores with
IEEE-754 double precision
Introducing Tesla T10P Processor
…NVIDIA’s 2nd Generation CUDA Processor
© NVIDIA Corporation 2008
CUDA Computing with Tesla T10240 SP processors at 1.44 GHz: 1 TFLOPS peak30 DP processors at 1.44 GHz: 86 GFLOPS peak
128 threads per processor: 30,720 threads total
© NVIDIA Corporation 2008
Double Precision Floating Point NVIDIA GPU SSE2 Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADD and FMUL
All 4 IEEE, round to nearest, zero, inf, -inf
All 4 IEEE, round to nearest, zero, inf, -inf
Round to zero/truncate only
Denormal handling Full speed Supported, costs 1000’s of cycles Flush to zero
NaN support Yes Yes No
Overflow and Infinity support Yes Yes No infinity, clamps to max norm
Flags No Yes Some
FMA Yes No Yes
Square root Software with low-latency FMA-based convergence Hardware Software only
Division Software with low-latency FMA-based convergence Hardware Software only
Reciprocal estimate accuracy 24 bit 12 bit 12 bit
Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit
log2(x) and 2^x estimates accuracy 23 bit No No
© NVIDIA Corporation 2008
Tesla C1060 Computing Processor
Processor 1x Tesla T10P
Core GHz 1.29 GHz
Form factorFull ATX:
4.736” (H) x 10.5” (L)Dual slot wide
On-board memory 4 GB
System I/O PCIe x16 gen2
Memory I/O 512-bit, 800MHz DDR102 GB/s peak bandwidth
Display outputs None
Typical power 160 W
© NVIDIA Corporation 2008
Tesla S1070 1U SystemProcessors 4 x Tesla T10P
Core GHz 1.44 GHz
Form factor 1U for an EIA 19”4-post rack
Total 1U system memory 16 GB (4.0GB per GPU)
System I/O 2 PCIe x16
Memory I/O per processor
512-bit, 800MHz GDDR102 GB/s peak bandwidth
Display outputs None
Typical power 700 W
Chassis dimensions 1.73” H × 17.5” W × 28.5” D
© NVIDIA Corporation 2008
Applications
© NVIDIA Corporation 2008
4100
170
377423
740
0
100
200
300
400
500
600
700
800
CPU PS3 Radeon HD3870
Radeon HD4850
Tesla 8‐series Tesla 10‐Series
nano seconds simulation per QuickTime™ and a
decompressorare needed to see this picture.
F@H kernel based on GROMACS code
Folding@home Performance Comparison
© NVIDIA Corporation 2008
Lattice Boltzmann
1000 iterations on a 256x128x128 domain
Cluster with 8 GPUs: 7.5 sec
Blue Gene/L 512 nodes: 21 sec
10000 iterations on irregular 1057x692x1446 domain with 4M fluid nodes
1 C870 760 s 53 MLUPS
2 C1060 159 s 252 MLUPS
8 C1060 42 s 955 MLUPS
Blood flow pattern in a human coronary artery, Bernaschi et al.
© NVIDIA Corporation 2008
Desktop GPU Supercomputer Beats Cluster
FASTRA8 GPUs in a Desktop
CalcUA256 Nodes (512 cores)
http://fastra.ua.ac.be/en/index.html
© NVIDIA Corporation 2008
Standard HPL code, with library that intercepts DGEMM and DTRSMcalls and executes them simultaneously on the GPUs and CPU cores. Library is implemented with CUBLAS
Cluster with 8 nodes:-Each node has 2 Intel Xeon E5462 ( 2.8Ghz), 16GB of memoryand 2 Tesla GPUs (1.44Ghz clock).
-The nodes are connected with SDR Infiniband.
CUDA accelerated Linpack
T/V N NB P Q Time Gflops---------------------------------------------------------------------------------------------------WR11R2L2 118144 960 4 4 874.26 1.258e+03----------------------------------------------------------------------------------------------------||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0031157 ...... PASSED
© NVIDIA Corporation 2008
Pseudo-spectral simulation of 2D Isotropic turbulence
Accelerating MATLAB®
1024x1024 mesh, 400 RK4 steps, Windows XP, Core2 Duo 2.4Ghz vs GeForce 8800GTX
http://developer.nvidia.com/object/matlab_cuda.htmll
Use MEX files to call CUDA from MATLAB, 17x speed-up
© NVIDIA Corporation 2008
146X 36X 19X 17X 100X
Astrophysics NAstrophysics N--body simulationbody simulation
Interactive Interactive visualization of visualization of
volumetric white volumetric white matter connectivitymatter connectivity
Ionic placement for Ionic placement for molecular molecular dynamics dynamics
simulation on GPUsimulation on GPU
TranscodingTranscoding HD HD video stream to video stream to
H.264H.264
Simulation in Simulation in MatlabMatlab using .using .mexmexfile CUDA functionfile CUDA function
149X 47X 20X 24X 30X
Financial Financial simulation of simulation of
LIBOR model with LIBOR model with swaptionsswaptions
GLAME@labGLAME@lab: An : An MM--script API for script API for linear Algebra linear Algebra
operations on GPUoperations on GPU
Ultrasound Ultrasound medical imaging medical imaging
for cancer for cancer diagnosticsdiagnostics
Highly optimized Highly optimized object oriented object oriented
molecular molecular dynamicsdynamics
CmatchCmatch exact exact string matching to string matching to
find similar find similar proteins and gene proteins and gene
sequencessequences
Applications in several fields
© NVIDIA Corporation 2008
CUDA Basic
© NVIDIA Corporation 2008
CUDAA Parallel Computing Architecture for NVIDIA GPUs
Supports standard languages and APIs
•C•OpenCL•Fortran (PGI)•DX Compute
Supported on common operating systems:
•Windows•Mac OS•Linux
DXCompute
© NVIDIA Corporation 2008 23
Arrays of Parallel Threads
A CUDA kernel is executed by an array of threadsAll threads run the same codeEach thread has an ID that it uses to compute memory addresses and make control decisions
0 1 2 3 4 5 6 7
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
© NVIDIA Corporation 2008 24
Example: Increment Array Elements
Increment N-element vector a by scalar b
Let’s assume N=16, blockDim=4 -> 4 blocks
blockIdx.x=0blockDim.x=4threadIdx.x=0,1,2,3idx=0,1,2,3
blockIdx.x=1blockDim.x=4threadIdx.x=0,1,2,3idx=4,5,6,7
blockIdx.x=2blockDim.x=4threadIdx.x=0,1,2,3idx=8,9,10,11
blockIdx.x=3blockDim.x=4threadIdx.x=0,1,2,3idx=12,13,14,15
int idx = blockDim.x * blockId.x + threadIdx.x;will map from local index threadIdx to global index
NB: blockDim should be >= 32 in real code, this is just an example
© NVIDIA Corporation 2008 25
Example: Increment Array Elements
CPU program CUDA program
void increment_cpu(float *a, float b, int N){
for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;
}
void main(){
.....increment_cpu(a, b, N);
}
__global__ void increment_gpu(float *a, float b, int N){
int idx = blockIdx.x * blockDim.x + threadIdx.x;if (idx < N)
a[idx] = a[idx] + b;}
void main(){
…..dim3 dimBlock (blocksize);dim3 dimGrid( ceil( N / (float)blocksize) );increment_gpu<<<dimGrid, dimBlock>>>(ad,bd, N);
}
© NVIDIA Corporation 2008
Outline of CUDA Basics
Basics Memory ManagementBasic Kernels and Execution on GPUCoordinating CPU and GPU ExecutionDevelopment Resources
See the Programming Guide for the full API
© NVIDIA Corporation 2008
Basic Memory Management
© NVIDIA Corporation 2008
Memory Spaces
CPU and GPU have separate memory spacesData is moved across PCIe busUse functions to allocate/set/copy memory on GPU
Very similar to corresponding C functions
Pointers are just addressesCan’t tell from the pointer value whether the address is on CPU or GPUMust exercise care when dereferencing:
Dereferencing CPU pointer on GPU will likely crashSame for vice versa
© NVIDIA Corporation 2008
GPU Memory Allocation / Release
Host (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_t nbytes)cudaMemset (void * pointer, int value, size_t count)cudaFree (void* pointer)
int n = 1024;int nbytes = 1024*sizeof(int);int * d_a = 0;cudaMalloc( (void**)&d_a, nbytes );cudaMemset( d_a, 0, nbytes);cudaFree(d_a);
© NVIDIA Corporation 2008
Data Copies
cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);
returns after the copy is completeblocks CPU thread until all bytes have been copieddoesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice
Non-blocking memcopies are provided
© NVIDIA Corporation 2008
Code Walkthrough 1
Allocate CPU memory for n integersAllocate GPU memory for n integersInitialize GPU memory to 0sCopy from GPU to CPUPrint the values
© NVIDIA Corporation 2008
Code Walkthrough 1#include <stdio.h>
int main(){
int dimx = 16;int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
© NVIDIA Corporation 2008
Code Walkthrough 1#include <stdio.h>
int main(){
int dimx = 16;int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a ){
printf("couldn't allocate memory\n");return 1;
}
© NVIDIA Corporation 2008
Code Walkthrough 1#include <stdio.h>
int main(){
int dimx = 16;int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a ){
printf("couldn't allocate memory\n");return 1;
}
cudaMemset( d_a, 0, num_bytes );cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
© NVIDIA Corporation 2008
Code Walkthrough 1#include <stdio.h>
int main(){
int dimx = 16;int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a ){
printf("couldn't allocate memory\n");return 1;
}
cudaMemset( d_a, 0, num_bytes );cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
for(int i=0; i<dimx; i++)printf("%d ", h_a[i] );
printf("\n");
free( h_a );cudaFree( d_a );
return 0;}
© NVIDIA Corporation 2008
Basic Kernels and Execution on GPU
© NVIDIA Corporation 2008
CUDA Programming Model
Parallel code (kernel) is launched and executed on a device by many threadsThreads are grouped into thread blocksParallel code is written for a thread
Each thread is free to execute a unique code pathBuilt-in thread and block ID variables
© NVIDIA Corporation 2008
Thread Hierarchy
Threads launched for a parallel section are partitioned into thread blocks
Grid = all blocks for a given launchThread block is a group of threads that can:
Synchronize their executionCommunicate via shared memory
© NVIDIA Corporation 2008
IDs and Dimensions
Threads:3D IDs, unique within a block
Blocks:2D IDs, unique within a grid
Dimensions set at launch timeCan be unique for each grid
Built-in variables:threadIdx, blockIdxblockDim, gridDim
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
© NVIDIA Corporation 2008
Code executed on GPU
C function with some restrictions:Can only access GPU memoryNo variable number of argumentsNo static variablesNo recursion
Must be declared with a qualifier:__global__ : launched by CPU,
cannot be called from GPU must return void__device__ : called from other GPU functions,
cannot be launched by the CPU__host__ : can be executed by CPU__host__ and __device__ qualifiers can be combined
sample use: overloading operators
© NVIDIA Corporation 2008
Code Walkthrough 2
Build on Walkthrough 1Write a kernel to initialize integersCopy the result back to CPUPrint the values
© NVIDIA Corporation 2008
__global__ void kernel( int *a ){
int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = 7;
}
Kernel Code (executed on GPU)
© NVIDIA Corporation 2008
Launching kernels on GPU
Launch parameters:grid dimensions (up to 2D), dim3 typethread-block dimensions (up to 3D), dim3 typeshared memory: number of bytes per block
for extern smem variables declared without sizeOptional, 0 by default
stream IDOptional, 0 by default
dim3 grid(16, 16);dim3 block(16,16);kernel<<<grid, block, 0, 0>>>(...);kernel<<<32, 512>>>(...);
© NVIDIA Corporation 2008
#include <stdio.h>
__global__ void kernel( int *a ){
int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = 7;
}
int main(){
int dimx = 16;int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a ){
printf("couldn't allocate memory\n");return 1;
}
cudaMemset( d_a, 0, num_bytes );
dim3 grid, block;block.x = 4;grid.x = dimx / block.x;
kernel<<<grid, block>>>( d_a );
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
for(int i=0; i<dimx; i++)printf("%d ", h_a[i] );
printf("\n");
free( h_a );cudaFree( d_a );
return 0;}
© NVIDIA Corporation 2008
__global__ void kernel( int *a ){
int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = 7;
}
__global__ void kernel( int *a ){
int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = blockIdx.x;
}
__global__ void kernel( int *a ){
int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = threadIdx.x;
}
Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Kernel Variations and Output
© NVIDIA Corporation 2008
Code Walkthrough 3
Build on Walkthruogh 2Write a kernel to increment n×m integersCopy the result back to CPUPrint the values
© NVIDIA Corporation 2008
__global__ void kernel( int *a, int dimx, int dimy ){
int ix = blockIdx.x*blockDim.x + threadIdx.x;int iy = blockIdx.y*blockDim.y + threadIdx.y;int idx = iy*dimx + ix;
a[idx] = a[idx]+1;}
Kernel with 2D Indexing
© NVIDIA Corporation 2008
int main(){
int dimx = 16;int dimy = 16;int num_bytes = dimx*dimy*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a ){
printf("couldn't allocate memory\n");return 1;
}
cudaMemset( d_a, 0, num_bytes );
dim3 grid, block;block.x = 4;block.y = 4;grid.x = dimx / block.x;grid.y = dimy / block.y;
kernel<<<grid, block>>>( d_a, dimx, dimy );
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
for(int row=0; row<dimy; row++){
for(int col=0; col<dimx; col++)printf("%d ", h_a[row*dimx+col] );
printf("\n");}
free( h_a );cudaFree( d_a );
return 0;}
__global__ void kernel( int *a, int dimx, int dimy ){
int ix = blockIdx.x*blockDim.x + threadIdx.x;int iy = blockIdx.y*blockDim.y + threadIdx.y;int idx = iy*dimx + ix;
a[idx] = a[idx]+1;}
© NVIDIA Corporation 2008
Blocks must be independent
Any possible interleaving of blocks should be validpresumed to run to completion without pre-emptioncan run in any ordercan run concurrently OR sequentially
Blocks may coordinate but not synchronizeshared queue pointer: OKshared lock: BAD … can easily deadlock
Independence requirement gives scalability
© NVIDIA Corporation 2008
Blocks must be independent
Thread blocks can run in any orderConcurrently or sequentiallyFacilitates scaling of the same code across many devices
Scalability
© NVIDIA Corporation 2008
Coordinating CPU and GPU Execution
© NVIDIA Corporation 2008
Synchronizing GPU and CPU
All kernel launches are asynchronouscontrol returns to CPU immediatelykernel starts executing once all previous CUDA calls have completed
Memcopies are synchronouscontrol returns to CPU once the copy is completecopy starts once all previous CUDA calls have completed
cudaThreadSynchronize()blocks until all previous CUDA calls complete
Asynchronous CUDA calls provide:non-blocking memcopiesability to overlap memcopies and kernel execution
© NVIDIA Corporation 2008
CUDA Error Reporting to CPU
All CUDA calls return error code:except kernel launchescudaError_t type
cudaError_t cudaGetLastError(void)returns the code for the last error (“no error” has a code)
char* cudaGetErrorString(cudaError_t code)returns a null-terminated character string describing the error
printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );
© NVIDIA Corporation 2008
CUDA Event API
Events are inserted (recorded) into CUDA call streamsUsage scenarios:
measure elapsed time for CUDA calls (clock cycle precision)query the status of an asynchronous CUDA callblock CPU until CUDA calls prior to the event are completedasyncAPI sample in CUDA SDK
cudaEvent_t start, stop;cudaEventCreate(&start); cudaEventCreate(&stop);cudaEventRecord(start, 0);kernel<<<grid, block>>>(...);cudaEventRecord(stop, 0);cudaEventSynchronize(stop);float et;cudaEventElapsedTime(&et, start, stop);cudaEventDestroy(start); cudaEventDestroy(stop);
© NVIDIA Corporation 2008
Device Management
CPU can query and select GPU devicescudaGetDeviceCount( int* count )cudaSetDevice( int device )cudaGetDevice( int *current_device )cudaGetDeviceProperties( cudaDeviceProp* prop,
int device )cudaChooseDevice( int *device, cudaDeviceProp* prop )
Multi-GPU setup:device 0 is used by defaultone CPU thread can control one GPU
multiple CPU threads can control the same GPU
– calls are serialized by the driver
© NVIDIA Corporation 2008
Shared Memory
© NVIDIA Corporation 2008
Shared Memory
On-chip memory2 orders of magnitude lower latency than global memoryOrder of magnitude higher bandwidth than gmem16KB per multiprocessor
NVIDIA GPUs contain up to 30 multiprocessors
Allocated per threadblockAccessible by any thread in the threadblock
Not accessible to other threadblocksSeveral uses:
Sharing data among threads in a threadblockUser-managed cache (reducing gmem accesses)
© NVIDIA Corporation 2008 58
Using shared memory
Size known at compile time
__global__ void kernel(…){
…__shared__ float sData[256];…
}
int main(void){…kernel<<<nBlocks,blockSize>>>(…);…
}
Size known at kernel launch
__global__ void kernel(…){
…extern __shared__ float sData[];…
}
int main(void){
…smBytes = blockSize*sizeof(float);kernel<<<nBlocks, blockSize,
smBytes>>>(…);…
}
© NVIDIA Corporation 2008
Example of Using Shared Memory
Applying a 1D stencil:1D dataFor each output element, sum all elements within a radius
For example, radius = 3Add 7 input elements
radius radius
© NVIDIA Corporation 2008
Implementation with Shared Memory
1D threadblocks (partition the output)Each threadblock outputs BLOCK_DIMX elements
Read input from gmem to smemNeeds BLOCK_DIMX + 2*RADIUS input elements
ComputeWrite output to gmem
“halo” “halo”Input elements corresponding to output
as many as there are threads in a threadblock
© NVIDIA Corporation 2008
Kernel code
__global__ void stencil( int *output, int *input, int dimx, int dimy ){
__shared__ int s_a[BLOCK_DIMX+2*RADIUS];
int global_ix = blockIdx.x*blockDim.x + threadIdx.x;int local_ix = threadIdx.x + RADIUS;
s_a[local_ix] = input[global_ix];
if ( threadIdx.x < RADIUS ){
s_a[local_ix – RADIUS] = input[global_ix – RADIUS];s_a[local_ix + BLOCK_DIMX + RADIUS] =
input[global_ix + BLOCK_DIMX + RADIUS];}__syncthreads();
int value = 0;for( offset = -RADIUS; offset<=RADIUS; offset++ )
value += s_a[ local_ix + offset ];
output[global_ix] = value;}
© NVIDIA Corporation 2008
Thread Synchronization Function
void __syncthreads();Synchronizes all threads in a thread-block
Since threads are scheduled at run-timeOnce all threads have reached this point, execution resumes normallyUsed to avoid RAW / WAR / WAW hazards when accessing shared memory
Should be used in conditional code only if the conditional is uniform across the entire thread block
© NVIDIA Corporation 2008
Memory Model Review
Local storageEach thread has own local storageMostly registers (managed by the compiler)Data lifetime = thread lifetime
Shared memoryEach thread block has own shared memory
Accessible only by threads within that block
Data lifetime = block lifetimeGlobal (device) memory
Accessible by all threads as well as host (CPU)Data lifetime = from allocation to deallocation
© NVIDIA Corporation 2008
Memory Model Review
Thread
Per-threadLocal Storage
Block
Per-blockSharedMemory
© NVIDIA Corporation 2008
Memory Model Review
Kernel 0
. . .Per-device
GlobalMemory
. . .
Kernel 1
SequentialKernels
© NVIDIA Corporation 2008
Memory Model Review
Device 0memory
Device 1memory
Host memory cudaMemcpy()