gpu architecture & implications - computer...
TRANSCRIPT
David LuebkeNVIDIA Research
GPU Architecture & Implications
© NVIDIA Corporation 2007
GPU Architecture
CUDA provides a parallel programming model
The Tesla GPU architecture implements this
This talk will describe the characteristics, goals, and implications of that architecture
G80 GPU Implementation: Tesla C870
681 million transistors470 mm2 in 90 nm CMOS
128 thread processors518 GFLOPS peak1.35 GHz processor clock
1.5 GB DRAM76 GB/s peak800 MHz GDDR3 clock384 pin DRAM interface
ATX form factor cardPCI Express x16170 W max with DRAM
© NVIDIA Corporation 2007
G80 (launched Nov 2006)128 Thread Processors execute kernel threadsUp to 12,288 parallel threads activePer-block shared memory (PBSM) accelerates processing
Block Diagram Redux
Thread Execution Manager
Input Assembler
Host
PBSM
Global Memory
Load/store
PBSM
Thread Processors
PBSM
Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors
PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSMPBSM
© NVIDIA Corporation 2007
Streaming Multiprocessor (SM)
Processing elements8 scalar thread processors (SP)32 GFLOPS peak at 1.35 GHz8192 32-bit registers (32KB)
½ MB total register file space!usual ops: float, int, branch, …
Hardware multithreadingup to 8 blocks resident at onceup to 768 active threads in total
16KB on-chip memorylow latency storageshared amongst threads of a blocksupports thread communication
SP
SharedMemory
MT IU
SM t0 t1 … tB
Goal: Scalability
Scalable executionProgram must be insensitive to the number of coresWrite one program for any number of SM coresProgram runs on any size GPU without recompiling
Hierarchical execution modelDecompose problem into sequential steps (kernels)Decompose kernel into computing parallel blocksDecompose block into computing parallel threads
Hardware distributes independent blocks to SMs as available
Blocks Run on Multiprocessors
Kernel launched by host
. . .
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
. . .
Device processor array
Device Memory
Goal: easy to program
Strategies:Familiar programming language mechanics
C/C++ with small extensionSimple parallel abstractions
Simple barrier synchronizationShared memory semanticsHardware-managed hierarchy of threads
© NVIDIA Corporation 2007
Hardware Multithreading
Hardware allocates resources to blocksblocks need: thread slots, registers, shared memoryblocks don’t run until resources are available
Hardware schedules threadsthreads have their own registersany thread not waiting for something can runcontext switching is (basically) free – every cycle
Hardware relies on threads to hide latencyi.e., parallelism is necessary for performance
SP
SharedMemory
MT IU
SM
Goal: Performance per millimeter
For GPUs, perfomance == throughput
Strategy: hide latency with computation not cacheHeavy multithreading – already discussed by Kevin
Implication: need many threads to hide latencyOccupancy – typically need 128 threads/SM minimumMultiple thread blocks/SM good to minimize effect of barriers
Strategy: Single Instruction Multiple Thread (SIMT)Balances performance with ease of programming
© NVIDIA Corporation 2007
SIMT Thread Execution
Groups of 32 threads formed into warpsalways executing same instructionshared instruction fetch/dispatchsome become inactive when code path divergeshardware automatically handles divergence
Warps are the primitive unit of schedulingpick 1 of 24 warps for each instruction slot
SIMT execution is an implementation choicesharing control logic leaves more space for ALUslargely invisible to programmermust understand for performance, not correctness
SP
SharedMemory
MT IU
SM
© NVIDIA Corporation 2007gh07 Hot3D: Tesla GPU Computing12
SIMT Multithreaded Execution
Weaving: the original parallel thread technology is about 10,000 years oldWarp: a set of 32 parallel threadsthat execute a SIMD instruction
SM hardware implements zero-overhead warp and thread schedulingEach SM executes up to 768 concurrent threads, as 24 SIMD warps of 32 threads
Threads can execute independentlySIMD warp automatically diverges and converges when threads branchBest efficiency and performance when threads of a warp execute togetherSIMT across threads (not just SIMD across data) gives easy single-thread scalar programming with SIMD efficiency
warp 8 instruction 11
SM multithreadedinstruction scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
© NVIDIA Corporation 2007
SP SharedMemory
MT
IU
Device Memory
Texture Cache Constant Cache
I Cac
heMemory Architecture
Direct load/store access to device memorytreated as the usual linear sequence of bytes (i.e., not pixels)
Texture & constant caches are read-only access pathsOn-chip shared memory shared amongst threads of a block
important for communication amongst threadsprovides low-latency temporary storage (~100x less than DRAM)
HostMemory
PCIe
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphicsNO: CUDA compiles directly to the hardware
GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines… NO: warps are 32-wide…on which branching is impossible or prohibitive……with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive… NOPE…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers. NO: scalar thread processors
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.
GPUs are power-inefficient: No – 4-10x perf/W advantage, up to 89x reported for some studies
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.
GPUs are power-inefficient:
GPUs don’t do real floating point
GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADD and FMUL
Round to nearest and round to zero
All 4 IEEE, round to nearest, zero, inf, -inf
Round to nearest only
Round to zero/truncate only
Denormal handling Flush to zero Supported,1000’s of cycles
Supported,1000’s of cycles Flush to zero
NaN support Yes Yes Yes No
Overflow and Infinity support
Yes, only clamps to max norm Yes Yes No, infinity
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit
Reciprocal sqrtestimate accuracy 23 bit 12 bit 12 bit 12 bit
log2(x) and 2^x estimates accuracy 23 bit No 12 bit No
Do GPUs Do Real IEEE FP?
G8x GPU FP is IEEE 754Comparable to other processors / acceleratorsMore precise / usable in some waysLess precise in other ways
GPU FP getting better every generationDouble precision support shortlyGoal: best of class by 2009
Questions?
David [email protected]
Applications &
Sweet Spots
© NVIDIA Corporation 2007
GPU Computing Sweet Spots
Applications:
High arithmetic intensity:Dense linear algebra, PDEs, n-body, finite difference, …
High bandwidth: Sequencing (virus scanning, genomics), sorting, database…
Visual computing:Graphics, image processing, tomography, machine vision…
© NVIDIA Corporation 2007
GPU Computing Example Markets
ComputationalModeling
ComputationalChemistry
ComputationalMedicine
ComputationalScience
ComputationalBiology
ComputationalFinance
ComputationalGeoscience
ImageProcessing
© NVIDIA Corporation 2007
Applications - Condensed
3D image analysisAdaptive radiation therapyAcousticsAstronomyAudioAutomobile visionBioinfomaticsBiological simulationBroadcastCellular automataComputational Fluid DynamicsComputer VisionCryptographyCT reconstructionData MiningDigital cinema/projectionsElectromagnetic simulationEquity training
FilmFinancial - lots of areasLanguagesGISHolographics cinemaImaging (lots)Mathematics researchMilitary (lots)Mine planningMolecular dynamicsMRI reconstructionMultispectral imagingnbodyNetwork processingNeural networkOceanographic researchOptical inspectionParticle physics
Protein foldingQuantum chemistryRay tracingRadarReservoir simulationRobotic vision/AIRobotic surgerySatellite data analysisSeismic imagingSurgery simulationSurveillanceUltrasoundVideo conferencingTelescopeVideoVisualizationWirelessX-ray
© NVIDIA Corporation 2007
GPU Computing Sweet Spots
From cluster to workstationThe “personal supercomputing” phase change
From lab to clinicFrom machine room to engineer, grad student desksFrom batch processing to interactiveFrom interactive to real-time
GPU-enabled clusters A 100x or better speedup changes the science
Solve at different scalesDirect brute-force methods may outperform clevernessNew bottlenecks may emergeApproaches once inconceivable may become practical
© NVIDIA Corporation 2007
New ApplicationsReal-time options implied volatility engine
Swaption volatility cube calculator
Manifold 8 GIS
Ultrasound imaging
HOOMD Molecular Dynamics
Also…Image rotation/classificationGraphics processing toolboxMicroarray data analysisData parallel primitivesAstrophysics simulations
SDK: Mandelbrot, computer vision
Seismic migration
© NVIDIA Corporation 2007
The Future of GPUs
GPU Computing drives new applicationsReducing “Time to Discovery”100x Speedup changes science and research methods
New applications drive the future of GPUs and GPU Computing
Drives new GPU capabilitiesDrives hunger for more performance
Some exciting new domains: Vision, acoustic, and embedded applicationsLarge-scale simulation & physics
Accuracy &
Performance
© NVIDIA Corporation 2007
GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADD and FMUL
Round to nearest and round to zero
All 4 IEEE, round to nearest, zero, inf, -inf
Round to nearest only
Round to zero/truncate only
Denormal handling Flush to zero Supported,1000’s of cycles
Supported,1000’s of cycles Flush to zero
NaN support Yes Yes Yes No
Overflow and Infinity support
Yes, only clamps to max norm Yes Yes No, infinity
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit
Reciprocal sqrtestimate accuracy 23 bit 12 bit 12 bit 12 bit
log2(x) and 2^x estimates accuracy 23 bit No 12 bit No
© NVIDIA Corporation 2007
Do GPUs Do Real IEEE FP?
G8x GPU FP is IEEE 754Comparable to other processors / acceleratorsMore precise / usable in some waysLess precise in other ways
GPU FP getting better every generationDouble precision support shortlyGoal: best of class by 2009
© NVIDIA Corporation 2007
CUDA Performance Advantages
Performance:BLAS1: 60+ GB/secBLAS3: 127 GFLOPSFFT: 52 benchFFT* GFLOPSFDTD: 1.2 Gcells/secSSEARCH: 5.2 Gcells/secBlack Scholes: 4.7 GOptions/sec
VMD: 290 GFLOPS
How:Leveraging shared memoryGPU memory bandwidthGPU GFLOPS performanceCustom hardware intrinsics
__sinf(), __cosf(), __expf(), __logf(), …
All benchmarks are compiled code!
GPGPU vs.
GPU Computing
© NVIDIA Corporation 2007
Problem: GPGPU
OLD: GPGPU – trick the GPU into general-purpose computing by casting problem as graphics
Turn data into images (“texture maps”)Turn algorithms into image synthesis (“rendering passes”)
Promising results, but:Tough learning curve, particularly for non-graphics expertsPotentially high overhead of graphics APIHighly constrained memory layout & access modelNeed for many passes drives up bandwidth consumption
© NVIDIA Corporation 2007
Solution: CUDA
NEW: GPU Computing with CUDACUDA = Compute Unified Driver ArchitectureCo-designed hardware & software for direct GPU computing
Hardware: fully general data-parallel architecture
Software: program the GPU in C
General thread launchGlobal load-storeParallel data cache
Scalar architectureIntegers, bit operationsDouble precision (soon)
Scalable data-parallel execution/memory model
C with minimal yet powerful extensions
© NVIDIA Corporation 2007
Graphics Programming Model
Graphics Application
Vertex Program
Rasterization
Fragment Program
Display
© NVIDIA Corporation 2007
Streaming GPGPU Programming OpenGL Program to
Add A and B
Vertex Program
Rasterization
Fragment Program
CPU Reads Texture Memory for Results
Start by creating a quad
“Programs” created with raster operation
Write answer to texture memory as a “color”
All this just to do A + B
Read textures as input to OpenGL shader program
© NVIDIA Corporation 2007
What’s Wrong With GPGPU?
Application
Vertex Program
Rasterization
Pixel Program
Display
Input Registers
Pixel Program
Output Registers
Constants
Texture
Temp Registers
© NVIDIA Corporation 2007
What’s Wrong With GPGPU?
Application
Vertex Program
Rasterization
Fragment Program
Display
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
APIs are specific to graphics
Limited texture size and dimension
Limited shader outputs
No scatter
Limited instruction set
No thread communication
Limited local storage
© NVIDIA Corporation 2007
Building a Better Pixel
Input Registers
Fragment Program
Output Registers
Constants
Texture
Registers
© NVIDIA Corporation 2007
Building a Better Pixel Thread
Thread Program
Output Registers
Constants
Texture
Registers
Thread Number
Features
Millions of instructions
Full Integer and Bit instructions
No limits on branching, looping
1D, 2D, or 3D thread ID allocation
© NVIDIA Corporation 2007
Global Memory
Thread Program
Global Memory
Thread Number
Constants
Texture
Registers
Features
Fully general load/store to GPU memory
Untyped, not fixed texture types
Pointer support
© NVIDIA Corporation 2007
Parallel Data Cache
Thread Program
Global Memory
Thread Number
Constants
Texture
Registers
FeaturesDedicated on-chip memoryShared between threads for inter-thread communicationExplicitly managedAs fast as registers
Parallel Data Cache
© NVIDIA Corporation 2007
Example Algorithm - Fluids
So the pressure for each particle is…
Pressure1 = P1 + P2 + P3 + P4
Pressure2 = P3 + P4 + P5 + P6
Pressure3 = P5 + P6 + P7 + P8
Pressure4 = P7 + P8 + P9 + P10
Pressure depends on neighbors
Goal: Calculate PRESSURE in a fluid
Pressure = Sum of neighboring pressuresPn’ = P1 + P2 + P3 + P4
© NVIDIA Corporation 2007
Example Fluid Algorithm
CPU GPGPU
GPU Computingwith CUDA
Multiple passes through video
memory
Parallel execution through cache
Single thread out of cache
Program/Control
Data/Computation
Control
ALU
Cache DRAM
P1P2P3P4
Pn’=P1+P2+P3+P4
ALU
VideoMemory
Control
ALU
Control
ALU
Control
ALU
P1,P2P3,P4
P1,P2P3,P4
P1,P2P3,P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
ParallelData
Cache
ThreadExecutionManager
ALU
Control
ALU
Control
ALU
Control
ALU
DRAM
P1P2P3P4P5
SharedData
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
© NVIDIA Corporation 2007
Parallel Data Cache
Bring the data closer to the ALU
• Addresses a fundamental problem of stream computing:
• The data are far from the FLOPS, video RAM latency is high
• Threads can only communicate their results through this high latency RAM
GPGPU
Multiple passes through video
memory
ALU
VideoMemory
Control
ALU
Control
ALU
Control
ALU
P1,P2P3,P4
P1,P2
P3,P4
P1,P2P3,P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
© NVIDIA Corporation 2007
Parallel Data Cache
Parallel execution through cache
ParallelData
Cache
ThreadExecutionManager
ALU
Control
ALU
Control
ALU
Control
ALU
DRAM
P1P2P3P4P5
SharedData
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4
Bring the data closer to the ALU
• Stage computation for the parallel data cache
• Minimize trips to external memory • Share values to minimize overfetch
and computation• Increases arithmetic intensity by
keeping data close to the processors
• User managed generic memory, threads read/write arbitrarily
© NVIDIA Corporation 2007
Streaming vs. GPU Computing
GPGPU
CUDA
ALU
ALU
StreamingGather in, Restricted writeMemory is far from ALUNo inter-element communication
GPU Computing with CUDAMore general data parallel modelFull Scatter / GatherPDC brings the data closer to the ALUApp decides how to decompose the problem across threadsShare and communicate between threads to solve problems efficiently
GPU Design
© NVIDIA Corporation 2007
CPU/GPU Parallelism
Moore’s Law gives you more and more transistorsWhat do you want to do with them? CPU strategy: make the workload (one compute thread) run as fast as possible
Tactics: – Cache (area limiting)– Instruction/Data prefetch– Speculative execution
limited by “perimeter” – communication bandwidth…then add task parallelism…multi-core
GPU strategy: make the workload (as many threads as possible) run as fast as possible
Tactics:– Parallelism (1000s of threads)– Pipelining
limited by “area” – compute capability
© NVIDIA Corporation 2007
Background: Unified Design
© NVIDIA Corporation 2007
Hardware Implementation:Collection of SIMT MultiprocessorsEach multiprocessor is a set of SIMT thread processors
Single Instruction Multiple Thread
Each thread processor has:program counter, register file, etc.scalar data pathread/write memory access
Unit of SIMT execution: warpexecute same instruction/clockHardware handles thread scheduling and divergence transparently
Warps enable a friendly data-parallel programming model!
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
InstructionUnit
Processor 1 …Processor 2 Processor M
© NVIDIA Corporation 2007
Hardware Implementation:Memory Architecture
The device has local device memory
Can be read and written by the host and by the multiprocessors
Each multiprocessor has:A set of 32-bit registersper processoron-chip shared memoryA read-only constant cacheA read-only texture cache
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
ConstantCache
TextureCache
© NVIDIA Corporation 2007
Hardware Implementation:Memory Model
Each thread can:Read/write per-block on-chip shared memoryRead per-grid cached constant memoryRead/write non-cached device memory:
Per-grid global memoryPer-thread local memory
Read cached texture memory
Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
CUDAProgramming
© NVIDIA Corporation 2007
CUDA SDK
NVIDIA C Compiler
NVIDIA Assemblyfor Computing CPU Host Code
Integrated CPUand GPU C Source Code
Libraries:FFT, BLAS,…Example Source Code
CUDADriver
DebuggerProfiler Standard C Compiler
GPU CPU
© NVIDIA Corporation 2007
CUDA: Features available to kernels
Standard mathematical functionssinf, powf, atanf, ceil, etc.
Built-in vector typesfloat4, int4, uint4, etc. for dimensions 1..4
Texture accesses in kernelstexture<float,2> my_texture; // declare texture reference
float4 texel = texfetch(my_texture, u, v);
© NVIDIA Corporation 2007
G8x CUDA = C with ExtensionsPhilosophy: provide minimal set of extensions necessary to expose power
Function qualifiers:__global__ void MyKernel() { }__device__ float MyDeviceFunc() { }
Variable qualifiers:__constant__ float MyConstantArray[32];__shared__ float MySharedArray[32];
Execution configuration:dim3 dimGrid(100, 50); // 5000 thread blocksdim3 dimBlock(4, 8, 8); // 256 threads per blockMyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel
Built-in variables and functions valid in device code:dim3 gridDim; // Grid dimensiondim3 blockDim; // Block dimensiondim3 blockIdx; // Block indexdim3 threadIdx; // Thread indexvoid __syncthreads(); // Thread synchronization
© NVIDIA Corporation 2007
CUDA: Runtime support
Explicit memory allocation returns pointers to GPU memorycudaMalloc(), cudaFree()
Explicit memory copy for host ↔ device, device ↔ devicecudaMemcpy(), cudaMemcpy2D(), ...
Texture managementcudaBindTexture(), cudaBindTextureToArray(), ...
OpenGL & DirectX interoperabilitycudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …
© NVIDIA Corporation 2007
Example: Adding matrices w/ 2D gridsCPU C program CUDA C program
void addMatrix(float *a, float *b, float *c, int N)
{int i, j, index;for (i = 0; i < N; i++) {for (j = 0; j < N; j++) {index = i + j * N;c[index]=a[index] + b[index];
}}
}
void main(){
.....addMatrix(a, b, c, N);
}
__global__ void addMatrix(float *a,float *b,float *c, int N)
{int i=blockIdx.x*blockDim.x+threadIdx.x;int j=blockIdx.y*blockDim.y+threadIdx.y;int index = i + j * N;if ( i < N && j < N)c[index]= a[index] + b[index];
}
void main(){
..... // allocate & transfer data to GPUdim3 dimBlk (blocksize, blocksize);dim3 dimGrd (N/dimBlk.x, N/dimBlk.y); addMatrix<<<dimGrd,dimBlk>>>(a, b, c,N);
}
© NVIDIA Corporation 2007
Example: Vector Addition Kernel
// Compute vector sum C = A+B// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C){
int i = threadIdx.x + blockDim.x * blockIdx.x;C[i] = A[i] + B[i];
}
© NVIDIA Corporation 2007
Example: Invoking the Kernel
__global__ void vecAdd(float* A, float* B, float* C);
void main(){
// Execute on N/256 blocks of 256 threads eachvecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
© NVIDIA Corporation 2007
Example: Host code for memory
// allocate host (CPU) memoryfloat* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));… initalize h_A and h_B …
// allocate device (GPU) memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));
// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
© NVIDIA Corporation 2007
A quick review
device = GPU = set of multiprocessors Multiprocessor = set of processors & shared memoryKernel = GPU programGrid = array of thread blocks that execute a kernelThread block = group of SIMD threads that execute a kernel and can communicate via shared memory
Memory Location Cached Access WhoLocal Off-chip No Read/write One threadShared On-chip N/A - resident Read/write All threads in a blockGlobal Off-chip No Read/write All threads + hostConstant Off-chip Yes Read All threads + hostTexture Off-chip Yes Read All threads + host
Data-ParallelProgramming
Scan Literature
Pre-HibernationFirst proposed in APL by Iverson (1962)Used as a data parallel primitive in the Connection Machine (1990)
Feature of C* and CM-LispGuy Blelloch used scan as a primitive for various parallel algorithms; his balanced-tree scan is used in the example here
Blelloch, 1990, “Prefix Sums and Their Applications”Post-Democratization
O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2)Applied to Summed Area Tables by Hensley et al. (EG05)
O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al. (EG06)O(n) work & space GPU implementation by Harris et al. (2007)
NVIDIA CUDA SDK and GPU Gems 3Applied to radix sort, stream compaction, and summed area tables
Parallel Reduction Complexity
Log(N) parallel steps, each step S does N/2S
independent opsStep Complexity is O(log N)
For N=2D, performs ∑S∈[1..D]2D-S = N-1 operations Work Complexity is O(N) – It is work-efficienti.e. does not perform more operations than a sequential algorithm
With P threads physically in parallel (P processors), time complexity is O(N/P + log N)
Compare to O(N) for sequential reduction
Unrolling Last Steps
Only one warp is active during the last few stepsUnroll them and remove unneeded __syncthreads()
for (unsigned int s = bd/2; s > 32; s >>= 1) {
if (t < s) {data[t] += data[t + s];
}__syncthreads();
}if (t < 32) data[t] += data[t + 32];if (t < 16) data[t] += data[t + 16];if (t < 8) data[t] += data[t + 8];if (t < 4) data[t] += data[t + 4];if (t < 2) data[t] += data[t + 2];if (t < 1) data[t] += data[t + 1];
Unrolling the Loop Completely
When block size is known at compile time, we can completely unroll the loop
It often is, since the maximum thread block size of 512 constrains us
Use templates…
#define STEP(d) \if (t < (d)) data[t] += data[t+(d));
#define SYNC __syncthreads();
template <unsigned int bsize>__global__ void d_reduce(int *g_idata,
int *g_odata){ ...
if (bsize == 512) STEP(512) SYNC if (bsize >= 256) STEP(256) SYNCif (bsize >= 128) STEP(128) SYNCif (bsize >= 64) STEP(64) SYNCif (bsize >= 32) { STEP(32) STEP(16) STEP(8)
STEP(4) STEP(2) STEP(1) }
}
GPU ComputingMotivation
© NVIDIA Corporation 2007
Computing Challenge
graphic
Task Computing Data Computing
© NVIDIA Corporation 2007
Extreme Growth in Raw Data
Source: John Bates, NOAA Nat. Climate Center
NOAA Weather Data
Peta
byte
s
Source: Alexa, YouTube 2006
YouTube Bandwidth Growth
Mill
ions
Source: Hedburg, CPI, Walmart
Walmart Transaction Tracking
Mill
ions
Source: Jim Farnsworth, BP May 2005
BP Oil and Gas Active Data
Tera
byte
s
NOAA NASA Weather Data in Petabytes
0
10
20
30
40
50
60
70
80
90
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
© NVIDIA Corporation 2007
Computational Horsepower
GPU is a massively parallel computation engineHigh memory bandwidth (5-10x CPU)High floating-point performance (5-10x CPU)
© NVIDIA Corporation 2007
Benchmarking: CPU vs. GPU Computing
G80 vs. Core2 Duo 2.66 GHzMeasured against commercial CPU benchmarks when possible
“Free” Massively Parallel Processors
It’s not science fiction, it’s just funded by them
Asst Master Chief Harvard
SuccessStories
© NVIDIA Corporation 2007
Success Stories: Data to DesignAcceleware EM Field simulation technology for the GPU
3D Finite-Difference and Finite-Element (FDTD)
Modeling of:
Cell phone irradiation
MRI Design / Modeling
Printed Circuit Boards
Radar Cross Section (Military)
Pacemaker with Transmit Antenna10X
1X
4 GPUs2 GPUs1 GPUCPU3.2 GHz
0
100
200
300
400
500
600
700
Performance (Mcells/s)
20X
5X
© NVIDIA Corporation 2007
EvolvedMachines130X Speed upSimulate brain circuitrySensory computing: vision, olfactory
EvolvedMachines
© NVIDIA Corporation 2007
10X with MATLAB CPU+GPU
Pseudo-spectral simulation of 2D Isotropic turbulence
Matlab: Language of Science
http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m
http://developer.nvidia.com/object/matlab_cuda.html
© NVIDIA Corporation 2007
MATLAB Example:Advection of an elliptic vortex
256x256 mesh, 512 RK4 steps, Linux, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_vortex.m
Matlab168 seconds
Matlab with CUDA(single precision FFTs)20 seconds
© NVIDIA Corporation 2007
MATLAB Example:Pseudo-spectral simulation of 2D Isotropic turbulence
MATLAB 992 seconds
MATLAB with CUDA(single precision FFTs)93 seconds
512x512 mesh, 400 RK4 steps, Windows XP, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m
© NVIDIA Corporation 2007
NAMD/VMD Molecular Dynamics
http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/
240X speedup Computational biology
© NVIDIA Corporation 2007
Molecular Dynamics Example
Case study: molecular dynamics research at U. Illinois Urbana-Champaign
(Scientist-sponsored) course project for CS 498AL: Programming Massively Parallel Multiprocessors (Kirk/Hwu)Next slides stolen from a nice description of problem, algorithms, and iterative optimization process available at:
http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/
© NVIDIA Corporation 2007
© NVIDIA Corporation 2007
Molecular Modeling: Ion Placement
Biomolecular simulations attempt to replicate in vivoconditions in silico.Model structures are initially constructed in vacuumSolvent (water) and ions are added as necessary for the required biological conditionsComputational requirements scale with the size of the simulated structure
© NVIDIA Corporation 2007
Evolution of Ion Placement CodeFirst implementation was sequentialVirus structure with 10^6 atoms would require 10 CPU daysTuned for Intel C/C++ vectorization+SSE, ~20x speedupParallelized /w pthreads: high data parallelism = linear speedupParallelized GPU accelerated implementation: 3 GeForce 8800GTX cards outrun ~300 Itanium2 CPUs!Virus structure now runs in 25 seconds on 3 GPUs!Further speedups should still be possible…
© NVIDIA Corporation 2007
Multi-GPU CUDA Coulombic Potential Map Performance
Host: Intel Core 2 Quad, 8GB RAM, ~$3,0003 GPUs: NVIDIA GeForce 8800GTX, ~$550 each32-bit RHEL4 Linux (want 64-bit CUDA!!)235 GFLOPS per GPU for current version of coulombic potential map kernel705 GFLOPS total for multithreaded multi-GPU version Three GeForce 8800GTX GPUs
in a single machine, cost ~$4,650
ProfessorPartnership
© NVIDIA Corporation 2007
NVIDIA Professor Partnership
Support faculty research & teaching effortsSmall equipment gifts (1-2 GPUs) Significant discounts on GPU purchases
Especially Quadro, Tesla equipmentUseful for cost matching
Research contracts Small cash grants (typically ~$25K gifts)Medium-scale equipment donations (10-30 GPUs)
Informal proposals, reviewed quarterlyFocus areas: GPU computing, especially with an educational mission or component
http://www.nvidia.com/page/professor_partnership.html
Easy
Competitive