gpu architecture & implications - computer...

David LuebkeNVIDIA Research

GPU Architecture & Implications

© NVIDIA Corporation 2007

GPU Architecture

CUDA provides a parallel programming model

The Tesla GPU architecture implements this

This talk will describe the characteristics, goals, and implications of that architecture

G80 GPU Implementation: Tesla C870

681 million transistors470 mm2 in 90 nm CMOS

128 thread processors518 GFLOPS peak1.35 GHz processor clock

1.5 GB DRAM76 GB/s peak800 MHz GDDR3 clock384 pin DRAM interface

ATX form factor cardPCI Express x16170 W max with DRAM


G80 (launched Nov 2006)128 Thread Processors execute kernel threadsUp to 12,288 parallel threads activePer-block shared memory (PBSM) accelerates processing

Block Diagram Redux

Thread Execution Manager

Input Assembler

Host

PBSM

Global Memory

Load/store

PBSM

Thread Processors

PBSM

Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSMPBSM


Streaming Multiprocessor (SM)

Processing elements8 scalar thread processors (SP)32 GFLOPS peak at 1.35 GHz8192 32-bit registers (32KB)

½ MB total register file space!usual ops: float, int, branch, …

Hardware multithreadingup to 8 blocks resident at onceup to 768 active threads in total

16KB on-chip memorylow latency storageshared amongst threads of a blocksupports thread communication

SP

SharedMemory

MT IU

SM t0 t1 … tB

Goal: Scalability

Scalable executionProgram must be insensitive to the number of coresWrite one program for any number of SM coresProgram runs on any size GPU without recompiling

Hierarchical execution modelDecompose problem into sequential steps (kernels)Decompose kernel into computing parallel blocksDecompose block into computing parallel threads

Hardware distributes independent blocks to SMs as available

Blocks Run on Multiprocessors

Kernel launched by host

. . .

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

. . .

Device processor array

Device Memory

Goal: easy to program

Strategies:Familiar programming language mechanics

C/C++ with small extensionSimple parallel abstractions

Simple barrier synchronizationShared memory semanticsHardware-managed hierarchy of threads


Hardware Multithreading

Hardware allocates resources to blocksblocks need: thread slots, registers, shared memoryblocks don’t run until resources are available

Hardware schedules threadsthreads have their own registersany thread not waiting for something can runcontext switching is (basically) free – every cycle

Hardware relies on threads to hide latencyi.e., parallelism is necessary for performance

SP

SharedMemory

MT IU

SM

Goal: Performance per millimeter

For GPUs, perfomance == throughput

Strategy: hide latency with computation not cacheHeavy multithreading – already discussed by Kevin

Implication: need many threads to hide latencyOccupancy – typically need 128 threads/SM minimumMultiple thread blocks/SM good to minimize effect of barriers

Strategy: Single Instruction Multiple Thread (SIMT)Balances performance with ease of programming


SIMT Thread Execution

Groups of 32 threads formed into warpsalways executing same instructionshared instruction fetch/dispatchsome become inactive when code path divergeshardware automatically handles divergence

Warps are the primitive unit of schedulingpick 1 of 24 warps for each instruction slot

SIMT execution is an implementation choicesharing control logic leaves more space for ALUslargely invisible to programmermust understand for performance, not correctness

SP

SharedMemory

MT IU

SM

© NVIDIA Corporation 2007gh07 Hot3D: Tesla GPU Computing12

SIMT Multithreaded Execution

Weaving: the original parallel thread technology is about 10,000 years oldWarp: a set of 32 parallel threadsthat execute a SIMD instruction

SM hardware implements zero-overhead warp and thread schedulingEach SM executes up to 768 concurrent threads, as 24 SIMD warps of 32 threads

Threads can execute independentlySIMD warp automatically diverges and converges when threads branchBest efficiency and performance when threads of a warp execute togetherSIMT across threads (not just SIMD across data) gives easy single-thread scalar programming with SIMD efficiency

warp 8 instruction 11

SM multithreadedinstruction scheduler




...

time



SP SharedMemory

MT

IU

Device Memory

Texture Cache Constant Cache

I Cac

heMemory Architecture

Direct load/store access to device memorytreated as the usual linear sequence of bytes (i.e., not pixels)

Texture & constant caches are read-only access pathsOn-chip shared memory shared amongst threads of a block

important for communication amongst threadsprovides low-latency temporary storage (~100x less than DRAM)

HostMemory

PCIe

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point


GPUs layer normal programs on top of graphicsNO: CUDA compiles directly to the hardware






GPUs architectures are:Very wide (1000s) SIMD machines… NO: warps are 32-wide…on which branching is impossible or prohibitive……with 4-wide vector registers.





GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive… NOPE…with 4-wide vector registers.





GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers. NO: scalar thread processors






GPUs are power-inefficient: No – 4-10x perf/W advantage, up to 89x reported for some studies





GPUs are power-inefficient:


GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zero Supported,1000’s of cycles

Supported,1000’s of cycles Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit

Reciprocal sqrtestimate accuracy 23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy 23 bit No 12 bit No

Do GPUs Do Real IEEE FP?

G8x GPU FP is IEEE 754Comparable to other processors / acceleratorsMore precise / usable in some waysLess precise in other ways

GPU FP getting better every generationDouble precision support shortlyGoal: best of class by 2009

Questions?

David [email protected]

Applications &

Sweet Spots


GPU Computing Sweet Spots

Applications:

High arithmetic intensity:Dense linear algebra, PDEs, n-body, finite difference, …

High bandwidth: Sequencing (virus scanning, genomics), sorting, database…

Visual computing:Graphics, image processing, tomography, machine vision…


GPU Computing Example Markets

ComputationalModeling

ComputationalChemistry

ComputationalMedicine

ComputationalScience

ComputationalBiology

ComputationalFinance

ComputationalGeoscience

ImageProcessing


Applications - Condensed

3D image analysisAdaptive radiation therapyAcousticsAstronomyAudioAutomobile visionBioinfomaticsBiological simulationBroadcastCellular automataComputational Fluid DynamicsComputer VisionCryptographyCT reconstructionData MiningDigital cinema/projectionsElectromagnetic simulationEquity training

FilmFinancial - lots of areasLanguagesGISHolographics cinemaImaging (lots)Mathematics researchMilitary (lots)Mine planningMolecular dynamicsMRI reconstructionMultispectral imagingnbodyNetwork processingNeural networkOceanographic researchOptical inspectionParticle physics

Protein foldingQuantum chemistryRay tracingRadarReservoir simulationRobotic vision/AIRobotic surgerySatellite data analysisSeismic imagingSurgery simulationSurveillanceUltrasoundVideo conferencingTelescopeVideoVisualizationWirelessX-ray


GPU Computing Sweet Spots

From cluster to workstationThe “personal supercomputing” phase change

From lab to clinicFrom machine room to engineer, grad student desksFrom batch processing to interactiveFrom interactive to real-time

GPU-enabled clusters A 100x or better speedup changes the science

Solve at different scalesDirect brute-force methods may outperform clevernessNew bottlenecks may emergeApproaches once inconceivable may become practical


New ApplicationsReal-time options implied volatility engine

Swaption volatility cube calculator

Manifold 8 GIS

Ultrasound imaging

HOOMD Molecular Dynamics

Also…Image rotation/classificationGraphics processing toolboxMicroarray data analysisData parallel primitivesAstrophysics simulations

SDK: Mandelbrot, computer vision

Seismic migration


The Future of GPUs

GPU Computing drives new applicationsReducing “Time to Discovery”100x Speedup changes science and research methods

New applications drive the future of GPUs and GPU Computing

Drives new GPU capabilitiesDrives hunger for more performance

Some exciting new domains: Vision, acoustic, and embedded applicationsLarge-scale simulation & physics

Accuracy &

Performance


GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zero Supported,1000’s of cycles

Supported,1000’s of cycles Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit

Reciprocal sqrtestimate accuracy 23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy 23 bit No 12 bit No


Do GPUs Do Real IEEE FP?

G8x GPU FP is IEEE 754Comparable to other processors / acceleratorsMore precise / usable in some waysLess precise in other ways

GPU FP getting better every generationDouble precision support shortlyGoal: best of class by 2009


CUDA Performance Advantages

Performance:BLAS1: 60+ GB/secBLAS3: 127 GFLOPSFFT: 52 benchFFT* GFLOPSFDTD: 1.2 Gcells/secSSEARCH: 5.2 Gcells/secBlack Scholes: 4.7 GOptions/sec

VMD: 290 GFLOPS

How:Leveraging shared memoryGPU memory bandwidthGPU GFLOPS performanceCustom hardware intrinsics

__sinf(), __cosf(), __expf(), __logf(), …

All benchmarks are compiled code!

GPGPU vs.

GPU Computing


Problem: GPGPU

OLD: GPGPU – trick the GPU into general-purpose computing by casting problem as graphics

Turn data into images (“texture maps”)Turn algorithms into image synthesis (“rendering passes”)

Promising results, but:Tough learning curve, particularly for non-graphics expertsPotentially high overhead of graphics APIHighly constrained memory layout & access modelNeed for many passes drives up bandwidth consumption


Solution: CUDA

NEW: GPU Computing with CUDACUDA = Compute Unified Driver ArchitectureCo-designed hardware & software for direct GPU computing

Hardware: fully general data-parallel architecture

Software: program the GPU in C

General thread launchGlobal load-storeParallel data cache

Scalar architectureIntegers, bit operationsDouble precision (soon)

Scalable data-parallel execution/memory model

C with minimal yet powerful extensions


Graphics Programming Model

Graphics Application

Vertex Program

Rasterization

Fragment Program

Display


Streaming GPGPU Programming OpenGL Program to

Add A and B

Vertex Program

Rasterization

Fragment Program

CPU Reads Texture Memory for Results

Start by creating a quad

“Programs” created with raster operation

Write answer to texture memory as a “color”

All this just to do A + B

Read textures as input to OpenGL shader program


What’s Wrong With GPGPU?

Application

Vertex Program

Rasterization

Pixel Program

Display

Input Registers

Pixel Program

Output Registers

Constants

Texture

Temp Registers


What’s Wrong With GPGPU?

Application

Vertex Program

Rasterization

Fragment Program

Display

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

APIs are specific to graphics

Limited texture size and dimension

Limited shader outputs

No scatter

Limited instruction set

No thread communication

Limited local storage


Building a Better Pixel

Input Registers

Fragment Program

Output Registers

Constants

Texture

Registers


Building a Better Pixel Thread

Thread Program

Output Registers

Constants

Texture

Registers

Thread Number

Features

Millions of instructions

Full Integer and Bit instructions

No limits on branching, looping

1D, 2D, or 3D thread ID allocation


Global Memory

Thread Program

Global Memory

Thread Number

Constants

Texture

Registers

Features

Fully general load/store to GPU memory

Untyped, not fixed texture types

Pointer support


Parallel Data Cache

Thread Program

Global Memory

Thread Number

Constants

Texture

Registers

FeaturesDedicated on-chip memoryShared between threads for inter-thread communicationExplicitly managedAs fast as registers

Parallel Data Cache


Example Algorithm - Fluids

So the pressure for each particle is…

Pressure1 = P1 + P2 + P3 + P4




Pressure depends on neighbors

Goal: Calculate PRESSURE in a fluid

Pressure = Sum of neighboring pressuresPn’ = P1 + P2 + P3 + P4


Example Fluid Algorithm

CPU GPGPU

GPU Computingwith CUDA

Multiple passes through video

memory

Parallel execution through cache

Single thread out of cache

Program/Control

Data/Computation

Control

ALU

Cache DRAM

P1P2P3P4

Pn’=P1+P2+P3+P4

ALU

VideoMemory

Control

ALU

Control

ALU

Control

ALU

P1,P2P3,P4

P1,P2P3,P4

P1,P2P3,P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

ParallelData

Cache

ThreadExecutionManager

ALU

Control

ALU

Control

ALU

Control

ALU

DRAM

P1P2P3P4P5

SharedData

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4


Parallel Data Cache

Bring the data closer to the ALU

• Addresses a fundamental problem of stream computing:

• The data are far from the FLOPS, video RAM latency is high

• Threads can only communicate their results through this high latency RAM

GPGPU

Multiple passes through video

memory

ALU

VideoMemory

Control

ALU

Control

ALU

Control

ALU

P1,P2P3,P4

P1,P2

P3,P4

P1,P2P3,P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4


Parallel Data Cache

Parallel execution through cache

ParallelData

Cache

ThreadExecutionManager

ALU

Control

ALU

Control

ALU

Control

ALU

DRAM

P1P2P3P4P5

SharedData

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Bring the data closer to the ALU

• Stage computation for the parallel data cache

• Minimize trips to external memory • Share values to minimize overfetch

and computation• Increases arithmetic intensity by

keeping data close to the processors

• User managed generic memory, threads read/write arbitrarily


Streaming vs. GPU Computing

GPGPU

CUDA

ALU

ALU

StreamingGather in, Restricted writeMemory is far from ALUNo inter-element communication

GPU Computing with CUDAMore general data parallel modelFull Scatter / GatherPDC brings the data closer to the ALUApp decides how to decompose the problem across threadsShare and communicate between threads to solve problems efficiently

GPU Design


CPU/GPU Parallelism

Moore’s Law gives you more and more transistorsWhat do you want to do with them? CPU strategy: make the workload (one compute thread) run as fast as possible

Tactics: – Cache (area limiting)– Instruction/Data prefetch– Speculative execution

limited by “perimeter” – communication bandwidth…then add task parallelism…multi-core

GPU strategy: make the workload (as many threads as possible) run as fast as possible

Tactics:– Parallelism (1000s of threads)– Pipelining

limited by “area” – compute capability


Background: Unified Design


Hardware Implementation:Collection of SIMT MultiprocessorsEach multiprocessor is a set of SIMT thread processors

Single Instruction Multiple Thread

Each thread processor has:program counter, register file, etc.scalar data pathread/write memory access

Unit of SIMT execution: warpexecute same instruction/clockHardware handles thread scheduling and divergence transparently

Warps enable a friendly data-parallel programming model!

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

InstructionUnit

Processor 1 …Processor 2 Processor M


Hardware Implementation:Memory Architecture

The device has local device memory

Can be read and written by the host and by the multiprocessors

Each multiprocessor has:A set of 32-bit registersper processoron-chip shared memoryA read-only constant cacheA read-only texture cache

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device memory

Shared Memory

InstructionUnit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

ConstantCache

TextureCache


Hardware Implementation:Memory Model

Each thread can:Read/write per-block on-chip shared memoryRead per-grid cached constant memoryRead/write non-cached device memory:

Per-grid global memoryPer-thread local memory

Read cached texture memory

Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

CUDAProgramming


CUDA SDK

NVIDIA C Compiler

NVIDIA Assemblyfor Computing CPU Host Code

Integrated CPUand GPU C Source Code

Libraries:FFT, BLAS,…Example Source Code

CUDADriver

DebuggerProfiler Standard C Compiler

GPU CPU


CUDA: Features available to kernels

Standard mathematical functionssinf, powf, atanf, ceil, etc.

Built-in vector typesfloat4, int4, uint4, etc. for dimensions 1..4

Texture accesses in kernelstexture<float,2> my_texture; // declare texture reference

float4 texel = texfetch(my_texture, u, v);


G8x CUDA = C with ExtensionsPhilosophy: provide minimal set of extensions necessary to expose power

Function qualifiers:__global__ void MyKernel() { }__device__ float MyDeviceFunc() { }

Variable qualifiers:__constant__ float MyConstantArray[32];__shared__ float MySharedArray[32];

Execution configuration:dim3 dimGrid(100, 50); // 5000 thread blocksdim3 dimBlock(4, 8, 8); // 256 threads per blockMyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel

Built-in variables and functions valid in device code:dim3 gridDim; // Grid dimensiondim3 blockDim; // Block dimensiondim3 blockIdx; // Block indexdim3 threadIdx; // Thread indexvoid __syncthreads(); // Thread synchronization


CUDA: Runtime support

Explicit memory allocation returns pointers to GPU memorycudaMalloc(), cudaFree()

Explicit memory copy for host ↔ device, device ↔ devicecudaMemcpy(), cudaMemcpy2D(), ...

Texture managementcudaBindTexture(), cudaBindTextureToArray(), ...

OpenGL & DirectX interoperabilitycudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …


Example: Adding matrices w/ 2D gridsCPU C program CUDA C program

void addMatrix(float *a, float *b, float *c, int N)

{int i, j, index;for (i = 0; i < N; i++) {for (j = 0; j < N; j++) {index = i + j * N;c[index]=a[index] + b[index];

}}

}

void main(){

.....addMatrix(a, b, c, N);

}

__global__ void addMatrix(float *a,float *b,float *c, int N)

{int i=blockIdx.x*blockDim.x+threadIdx.x;int j=blockIdx.y*blockDim.y+threadIdx.y;int index = i + j * N;if ( i < N && j < N)c[index]= a[index] + b[index];

}

void main(){

..... // allocate & transfer data to GPUdim3 dimBlk (blocksize, blocksize);dim3 dimGrd (N/dimBlk.x, N/dimBlk.y); addMatrix<<<dimGrd,dimBlk>>>(a, b, c,N);

}


Example: Vector Addition Kernel

// Compute vector sum C = A+B// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C){

int i = threadIdx.x + blockDim.x * blockIdx.x;C[i] = A[i] + B[i];

}


Example: Invoking the Kernel

__global__ void vecAdd(float* A, float* B, float* C);

void main(){

// Execute on N/256 blocks of 256 threads eachvecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}


Example: Host code for memory

// allocate host (CPU) memoryfloat* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));… initalize h_A and h_B …

// allocate device (GPU) memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));

// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);


A quick review

device = GPU = set of multiprocessors Multiprocessor = set of processors & shared memoryKernel = GPU programGrid = array of thread blocks that execute a kernelThread block = group of SIMD threads that execute a kernel and can communicate via shared memory

Memory Location Cached Access WhoLocal Off-chip No Read/write One threadShared On-chip N/A - resident Read/write All threads in a blockGlobal Off-chip No Read/write All threads + hostConstant Off-chip Yes Read All threads + hostTexture Off-chip Yes Read All threads + host

Data-ParallelProgramming

Scan Literature

Pre-HibernationFirst proposed in APL by Iverson (1962)Used as a data parallel primitive in the Connection Machine (1990)

Feature of C* and CM-LispGuy Blelloch used scan as a primitive for various parallel algorithms; his balanced-tree scan is used in the example here

Blelloch, 1990, “Prefix Sums and Their Applications”Post-Democratization

O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2)Applied to Summed Area Tables by Hensley et al. (EG05)

O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al. (EG06)O(n) work & space GPU implementation by Harris et al. (2007)

NVIDIA CUDA SDK and GPU Gems 3Applied to radix sort, stream compaction, and summed area tables

Parallel Reduction Complexity

Log(N) parallel steps, each step S does N/2S

independent opsStep Complexity is O(log N)

For N=2D, performs ∑S∈[1..D]2D-S = N-1 operations Work Complexity is O(N) – It is work-efficienti.e. does not perform more operations than a sequential algorithm

With P threads physically in parallel (P processors), time complexity is O(N/P + log N)

Compare to O(N) for sequential reduction

Unrolling Last Steps

Only one warp is active during the last few stepsUnroll them and remove unneeded __syncthreads()

for (unsigned int s = bd/2; s > 32; s >>= 1) {

if (t < s) {data[t] += data[t + s];

}__syncthreads();

}if (t < 32) data[t] += data[t + 32];if (t < 16) data[t] += data[t + 16];if (t < 8) data[t] += data[t + 8];if (t < 4) data[t] += data[t + 4];if (t < 2) data[t] += data[t + 2];if (t < 1) data[t] += data[t + 1];

Unrolling the Loop Completely

When block size is known at compile time, we can completely unroll the loop

It often is, since the maximum thread block size of 512 constrains us

Use templates…

#define STEP(d) \if (t < (d)) data[t] += data[t+(d));

#define SYNC __syncthreads();

template <unsigned int bsize>__global__ void d_reduce(int *g_idata,

int *g_odata){ ...

if (bsize == 512) STEP(512) SYNC if (bsize >= 256) STEP(256) SYNCif (bsize >= 128) STEP(128) SYNCif (bsize >= 64) STEP(64) SYNCif (bsize >= 32) { STEP(32) STEP(16) STEP(8)

STEP(4) STEP(2) STEP(1) }

}

GPU ComputingMotivation


Computing Challenge

graphic

Task Computing Data Computing


Extreme Growth in Raw Data

Source: John Bates, NOAA Nat. Climate Center

NOAA Weather Data

Peta

byte

s

Source: Alexa, YouTube 2006

YouTube Bandwidth Growth

Mill

ions

Source: Hedburg, CPI, Walmart

Walmart Transaction Tracking

Mill

ions

Source: Jim Farnsworth, BP May 2005

BP Oil and Gas Active Data

Tera

byte

s

NOAA NASA Weather Data in Petabytes

0

10

20

30

40

50

60

70

80

90

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017


Computational Horsepower

GPU is a massively parallel computation engineHigh memory bandwidth (5-10x CPU)High floating-point performance (5-10x CPU)


Benchmarking: CPU vs. GPU Computing

G80 vs. Core2 Duo 2.66 GHzMeasured against commercial CPU benchmarks when possible

“Free” Massively Parallel Processors

It’s not science fiction, it’s just funded by them

Asst Master Chief Harvard

SuccessStories


Success Stories: Data to DesignAcceleware EM Field simulation technology for the GPU

3D Finite-Difference and Finite-Element (FDTD)

Modeling of:

Cell phone irradiation

MRI Design / Modeling

Printed Circuit Boards

Radar Cross Section (Military)

Pacemaker with Transmit Antenna10X

1X

4 GPUs2 GPUs1 GPUCPU3.2 GHz

0

100

200

300

400

500

600

700

Performance (Mcells/s)

20X

5X


EvolvedMachines130X Speed upSimulate brain circuitrySensory computing: vision, olfactory

EvolvedMachines


10X with MATLAB CPU+GPU

Pseudo-spectral simulation of 2D Isotropic turbulence

Matlab: Language of Science

http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m

http://developer.nvidia.com/object/matlab_cuda.html


MATLAB Example:Advection of an elliptic vortex

256x256 mesh, 512 RK4 steps, Linux, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_vortex.m

Matlab168 seconds

Matlab with CUDA(single precision FFTs)20 seconds


MATLAB Example:Pseudo-spectral simulation of 2D Isotropic turbulence

MATLAB 992 seconds

MATLAB with CUDA(single precision FFTs)93 seconds

512x512 mesh, 400 RK4 steps, Windows XP, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m


NAMD/VMD Molecular Dynamics

http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/

240X speedup Computational biology


Molecular Dynamics Example

Case study: molecular dynamics research at U. Illinois Urbana-Champaign

(Scientist-sponsored) course project for CS 498AL: Programming Massively Parallel Multiprocessors (Kirk/Hwu)Next slides stolen from a nice description of problem, algorithms, and iterative optimization process available at:




Molecular Modeling: Ion Placement

Biomolecular simulations attempt to replicate in vivoconditions in silico.Model structures are initially constructed in vacuumSolvent (water) and ions are added as necessary for the required biological conditionsComputational requirements scale with the size of the simulated structure


Evolution of Ion Placement CodeFirst implementation was sequentialVirus structure with 10^6 atoms would require 10 CPU daysTuned for Intel C/C++ vectorization+SSE, ~20x speedupParallelized /w pthreads: high data parallelism = linear speedupParallelized GPU accelerated implementation: 3 GeForce 8800GTX cards outrun ~300 Itanium2 CPUs!Virus structure now runs in 25 seconds on 3 GPUs!Further speedups should still be possible…


Multi-GPU CUDA Coulombic Potential Map Performance

Host: Intel Core 2 Quad, 8GB RAM, ~$3,0003 GPUs: NVIDIA GeForce 8800GTX, ~$550 each32-bit RHEL4 Linux (want 64-bit CUDA!!)235 GFLOPS per GPU for current version of coulombic potential map kernel705 GFLOPS total for multithreaded multi-GPU version Three GeForce 8800GTX GPUs

in a single machine, cost ~$4,650

ProfessorPartnership


NVIDIA Professor Partnership

Support faculty research & teaching effortsSmall equipment gifts (1-2 GPUs) Significant discounts on GPU purchases

Especially Quadro, Tesla equipmentUseful for cost matching

Research contracts Small cash grants (typically ~$25K gifts)Medium-scale equipment donations (10-30 GPUs)

Informal proposals, reviewed quarterlyFocus areas: GPU computing, especially with an educational mission or component

http://www.nvidia.com/page/professor_partnership.html

Easy

Competitive

http://www.nvidia.com/page/professor_partnership.html

gpu architecture & implications - computer...

Documents