gpu architecture & implications - computer...

94
David Luebke NVIDIA Research GPU Architecture & Implications

Upload: others

Post on 27-Sep-2019

37 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

David LuebkeNVIDIA Research

GPU Architecture & Implications

Page 2: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

GPU Architecture

CUDA provides a parallel programming model

The Tesla GPU architecture implements this

This talk will describe the characteristics, goals, and implications of that architecture

Page 3: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

G80 GPU Implementation: Tesla C870

681 million transistors470 mm2 in 90 nm CMOS

128 thread processors518 GFLOPS peak1.35 GHz processor clock

1.5 GB DRAM76 GB/s peak800 MHz GDDR3 clock384 pin DRAM interface

ATX form factor cardPCI Express x16170 W max with DRAM

Page 4: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

G80 (launched Nov 2006)128 Thread Processors execute kernel threadsUp to 12,288 parallel threads activePer-block shared memory (PBSM) accelerates processing

Block Diagram Redux

Thread Execution Manager

Input Assembler

Host

PBSM

Global Memory

Load/store

PBSM

Thread Processors

PBSM

Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSMPBSM

Page 5: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Streaming Multiprocessor (SM)

Processing elements8 scalar thread processors (SP)32 GFLOPS peak at 1.35 GHz8192 32-bit registers (32KB)

½ MB total register file space!usual ops: float, int, branch, …

Hardware multithreadingup to 8 blocks resident at onceup to 768 active threads in total

16KB on-chip memorylow latency storageshared amongst threads of a blocksupports thread communication

SP

SharedMemory

MT IU

SM t0 t1 … tB

Page 6: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Goal: Scalability

Scalable executionProgram must be insensitive to the number of coresWrite one program for any number of SM coresProgram runs on any size GPU without recompiling

Hierarchical execution modelDecompose problem into sequential steps (kernels)Decompose kernel into computing parallel blocksDecompose block into computing parallel threads

Hardware distributes independent blocks to SMs as available

Page 7: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Blocks Run on Multiprocessors

Kernel launched by host

. . .

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

. . .

Device processor array

Device Memory

Page 8: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Goal: easy to program

Strategies:Familiar programming language mechanics

C/C++ with small extensionSimple parallel abstractions

Simple barrier synchronizationShared memory semanticsHardware-managed hierarchy of threads

Page 9: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Hardware Multithreading

Hardware allocates resources to blocksblocks need: thread slots, registers, shared memoryblocks don’t run until resources are available

Hardware schedules threadsthreads have their own registersany thread not waiting for something can runcontext switching is (basically) free – every cycle

Hardware relies on threads to hide latencyi.e., parallelism is necessary for performance

SP

SharedMemory

MT IU

SM

Page 10: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Goal: Performance per millimeter

For GPUs, perfomance == throughput

Strategy: hide latency with computation not cacheHeavy multithreading – already discussed by Kevin

Implication: need many threads to hide latencyOccupancy – typically need 128 threads/SM minimumMultiple thread blocks/SM good to minimize effect of barriers

Strategy: Single Instruction Multiple Thread (SIMT)Balances performance with ease of programming

Page 11: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

SIMT Thread Execution

Groups of 32 threads formed into warpsalways executing same instructionshared instruction fetch/dispatchsome become inactive when code path divergeshardware automatically handles divergence

Warps are the primitive unit of schedulingpick 1 of 24 warps for each instruction slot

SIMT execution is an implementation choicesharing control logic leaves more space for ALUslargely invisible to programmermust understand for performance, not correctness

SP

SharedMemory

MT IU

SM

Page 12: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007gh07 Hot3D: Tesla GPU Computing12

SIMT Multithreaded Execution

Weaving: the original parallel thread technology is about 10,000 years oldWarp: a set of 32 parallel threadsthat execute a SIMD instruction

SM hardware implements zero-overhead warp and thread schedulingEach SM executes up to 768 concurrent threads, as 24 SIMD warps of 32 threads

Threads can execute independentlySIMD warp automatically diverges and converges when threads branchBest efficiency and performance when threads of a warp execute togetherSIMT across threads (not just SIMD across data) gives easy single-thread scalar programming with SIMD efficiency

warp 8 instruction 11

SM multithreadedinstruction scheduler

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

...

time

warp 3 instruction 96

Page 13: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

SP SharedMemory

MT

IU

Device Memory

Texture Cache Constant Cache

I Cac

heMemory Architecture

Direct load/store access to device memorytreated as the usual linear sequence of bytes (i.e., not pixels)

Texture & constant caches are read-only access pathsOn-chip shared memory shared amongst threads of a block

important for communication amongst threadsprovides low-latency temporary storage (~100x less than DRAM)

HostMemory

PCIe

Page 14: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 15: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Myths of GPU Computing

GPUs layer normal programs on top of graphicsNO: CUDA compiles directly to the hardware

GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 16: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 17: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines… NO: warps are 32-wide…on which branching is impossible or prohibitive……with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 18: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive… NOPE…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 19: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 20: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers. NO: scalar thread processors

GPUs are power-inefficient

GPUs don’t do real floating point

Page 21: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point

Page 22: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.

GPUs are power-inefficient: No – 4-10x perf/W advantage, up to 89x reported for some studies

GPUs don’t do real floating point

Page 23: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines……on which branching is impossible or prohibitive……with 4-wide vector registers.

GPUs are power-inefficient:

GPUs don’t do real floating point

Page 24: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zero Supported,1000’s of cycles

Supported,1000’s of cycles Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit

Reciprocal sqrtestimate accuracy 23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy 23 bit No 12 bit No

Page 25: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Do GPUs Do Real IEEE FP?

G8x GPU FP is IEEE 754Comparable to other processors / acceleratorsMore precise / usable in some waysLess precise in other ways

GPU FP getting better every generationDouble precision support shortlyGoal: best of class by 2009

Page 26: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Questions?

David [email protected]

Page 27: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Applications &

Sweet Spots

Page 28: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

GPU Computing Sweet Spots

Applications:

High arithmetic intensity:Dense linear algebra, PDEs, n-body, finite difference, …

High bandwidth: Sequencing (virus scanning, genomics), sorting, database…

Visual computing:Graphics, image processing, tomography, machine vision…

Page 29: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

GPU Computing Example Markets

ComputationalModeling

ComputationalChemistry

ComputationalMedicine

ComputationalScience

ComputationalBiology

ComputationalFinance

ComputationalGeoscience

ImageProcessing

Page 30: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Applications - Condensed

3D image analysisAdaptive radiation therapyAcousticsAstronomyAudioAutomobile visionBioinfomaticsBiological simulationBroadcastCellular automataComputational Fluid DynamicsComputer VisionCryptographyCT reconstructionData MiningDigital cinema/projectionsElectromagnetic simulationEquity training

FilmFinancial - lots of areasLanguagesGISHolographics cinemaImaging (lots)Mathematics researchMilitary (lots)Mine planningMolecular dynamicsMRI reconstructionMultispectral imagingnbodyNetwork processingNeural networkOceanographic researchOptical inspectionParticle physics

Protein foldingQuantum chemistryRay tracingRadarReservoir simulationRobotic vision/AIRobotic surgerySatellite data analysisSeismic imagingSurgery simulationSurveillanceUltrasoundVideo conferencingTelescopeVideoVisualizationWirelessX-ray

Page 31: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

GPU Computing Sweet Spots

From cluster to workstationThe “personal supercomputing” phase change

From lab to clinicFrom machine room to engineer, grad student desksFrom batch processing to interactiveFrom interactive to real-time

GPU-enabled clusters A 100x or better speedup changes the science

Solve at different scalesDirect brute-force methods may outperform clevernessNew bottlenecks may emergeApproaches once inconceivable may become practical

Page 32: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

New ApplicationsReal-time options implied volatility engine

Swaption volatility cube calculator

Manifold 8 GIS

Ultrasound imaging

HOOMD Molecular Dynamics

Also…Image rotation/classificationGraphics processing toolboxMicroarray data analysisData parallel primitivesAstrophysics simulations

SDK: Mandelbrot, computer vision

Seismic migration

Page 33: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

The Future of GPUs

GPU Computing drives new applicationsReducing “Time to Discovery”100x Speedup changes science and research methods

New applications drive the future of GPUs and GPU Computing

Drives new GPU capabilitiesDrives hunger for more performance

Some exciting new domains: Vision, acoustic, and embedded applicationsLarge-scale simulation & physics

Page 34: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Accuracy &

Performance

Page 35: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zero Supported,1000’s of cycles

Supported,1000’s of cycles Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit

Reciprocal sqrtestimate accuracy 23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy 23 bit No 12 bit No

Page 36: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Do GPUs Do Real IEEE FP?

G8x GPU FP is IEEE 754Comparable to other processors / acceleratorsMore precise / usable in some waysLess precise in other ways

GPU FP getting better every generationDouble precision support shortlyGoal: best of class by 2009

Page 37: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

CUDA Performance Advantages

Performance:BLAS1: 60+ GB/secBLAS3: 127 GFLOPSFFT: 52 benchFFT* GFLOPSFDTD: 1.2 Gcells/secSSEARCH: 5.2 Gcells/secBlack Scholes: 4.7 GOptions/sec

VMD: 290 GFLOPS

How:Leveraging shared memoryGPU memory bandwidthGPU GFLOPS performanceCustom hardware intrinsics

__sinf(), __cosf(), __expf(), __logf(), …

All benchmarks are compiled code!

Page 38: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

GPGPU vs.

GPU Computing

Page 39: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Problem: GPGPU

OLD: GPGPU – trick the GPU into general-purpose computing by casting problem as graphics

Turn data into images (“texture maps”)Turn algorithms into image synthesis (“rendering passes”)

Promising results, but:Tough learning curve, particularly for non-graphics expertsPotentially high overhead of graphics APIHighly constrained memory layout & access modelNeed for many passes drives up bandwidth consumption

Page 40: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Solution: CUDA

NEW: GPU Computing with CUDACUDA = Compute Unified Driver ArchitectureCo-designed hardware & software for direct GPU computing

Hardware: fully general data-parallel architecture

Software: program the GPU in C

General thread launchGlobal load-storeParallel data cache

Scalar architectureIntegers, bit operationsDouble precision (soon)

Scalable data-parallel execution/memory model

C with minimal yet powerful extensions

Page 41: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Graphics Programming Model

Graphics Application

Vertex Program

Rasterization

Fragment Program

Display

Page 42: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Streaming GPGPU Programming OpenGL Program to

Add A and B

Vertex Program

Rasterization

Fragment Program

CPU Reads Texture Memory for Results

Start by creating a quad

“Programs” created with raster operation

Write answer to texture memory as a “color”

All this just to do A + B

Read textures as input to OpenGL shader program

Page 43: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

What’s Wrong With GPGPU?

Application

Vertex Program

Rasterization

Pixel Program

Display

Input Registers

Pixel Program

Output Registers

Constants

Texture

Temp Registers

Page 44: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

What’s Wrong With GPGPU?

Application

Vertex Program

Rasterization

Fragment Program

Display

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

APIs are specific to graphics

Limited texture size and dimension

Limited shader outputs

No scatter

Limited instruction set

No thread communication

Limited local storage

Page 45: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Building a Better Pixel

Input Registers

Fragment Program

Output Registers

Constants

Texture

Registers

Page 46: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Building a Better Pixel Thread

Thread Program

Output Registers

Constants

Texture

Registers

Thread Number

Features

Millions of instructions

Full Integer and Bit instructions

No limits on branching, looping

1D, 2D, or 3D thread ID allocation

Page 47: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Global Memory

Thread Program

Global Memory

Thread Number

Constants

Texture

Registers

Features

Fully general load/store to GPU memory

Untyped, not fixed texture types

Pointer support

Page 48: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Parallel Data Cache

Thread Program

Global Memory

Thread Number

Constants

Texture

Registers

FeaturesDedicated on-chip memoryShared between threads for inter-thread communicationExplicitly managedAs fast as registers

Parallel Data Cache

Page 49: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Example Algorithm - Fluids

So the pressure for each particle is…

Pressure1 = P1 + P2 + P3 + P4

Pressure2 = P3 + P4 + P5 + P6

Pressure3 = P5 + P6 + P7 + P8

Pressure4 = P7 + P8 + P9 + P10

Pressure depends on neighbors

Goal: Calculate PRESSURE in a fluid

Pressure = Sum of neighboring pressuresPn’ = P1 + P2 + P3 + P4

Page 50: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Example Fluid Algorithm

CPU GPGPU

GPU Computingwith CUDA

Multiple passes through video

memory

Parallel execution through cache

Single thread out of cache

Program/Control

Data/Computation

Control

ALU

Cache DRAM

P1P2P3P4

Pn’=P1+P2+P3+P4

ALU

VideoMemory

Control

ALU

Control

ALU

Control

ALU

P1,P2P3,P4

P1,P2P3,P4

P1,P2P3,P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

ParallelData

Cache

ThreadExecutionManager

ALU

Control

ALU

Control

ALU

Control

ALU

DRAM

P1P2P3P4P5

SharedData

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Page 51: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Parallel Data Cache

Bring the data closer to the ALU

• Addresses a fundamental problem of stream computing:

• The data are far from the FLOPS, video RAM latency is high

• Threads can only communicate their results through this high latency RAM

GPGPU

Multiple passes through video

memory

ALU

VideoMemory

Control

ALU

Control

ALU

Control

ALU

P1,P2P3,P4

P1,P2

P3,P4

P1,P2P3,P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Page 52: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Parallel Data Cache

Parallel execution through cache

ParallelData

Cache

ThreadExecutionManager

ALU

Control

ALU

Control

ALU

Control

ALU

DRAM

P1P2P3P4P5

SharedData

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Bring the data closer to the ALU

• Stage computation for the parallel data cache

• Minimize trips to external memory • Share values to minimize overfetch

and computation• Increases arithmetic intensity by

keeping data close to the processors

• User managed generic memory, threads read/write arbitrarily

Page 53: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Streaming vs. GPU Computing

GPGPU

CUDA

ALU

ALU

StreamingGather in, Restricted writeMemory is far from ALUNo inter-element communication

GPU Computing with CUDAMore general data parallel modelFull Scatter / GatherPDC brings the data closer to the ALUApp decides how to decompose the problem across threadsShare and communicate between threads to solve problems efficiently

Page 54: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

GPU Design

Page 55: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

CPU/GPU Parallelism

Moore’s Law gives you more and more transistorsWhat do you want to do with them? CPU strategy: make the workload (one compute thread) run as fast as possible

Tactics: – Cache (area limiting)– Instruction/Data prefetch– Speculative execution

limited by “perimeter” – communication bandwidth…then add task parallelism…multi-core

GPU strategy: make the workload (as many threads as possible) run as fast as possible

Tactics:– Parallelism (1000s of threads)– Pipelining

limited by “area” – compute capability

Page 56: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Background: Unified Design

Page 57: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Hardware Implementation:Collection of SIMT MultiprocessorsEach multiprocessor is a set of SIMT thread processors

Single Instruction Multiple Thread

Each thread processor has:program counter, register file, etc.scalar data pathread/write memory access

Unit of SIMT execution: warpexecute same instruction/clockHardware handles thread scheduling and divergence transparently

Warps enable a friendly data-parallel programming model!

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

InstructionUnit

Processor 1 …Processor 2 Processor M

Page 58: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Hardware Implementation:Memory Architecture

The device has local device memory

Can be read and written by the host and by the multiprocessors

Each multiprocessor has:A set of 32-bit registersper processoron-chip shared memoryA read-only constant cacheA read-only texture cache

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device memory

Shared Memory

InstructionUnit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

ConstantCache

TextureCache

Page 59: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Hardware Implementation:Memory Model

Each thread can:Read/write per-block on-chip shared memoryRead per-grid cached constant memoryRead/write non-cached device memory:

Per-grid global memoryPer-thread local memory

Read cached texture memory

Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Page 60: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

CUDAProgramming

Page 61: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

CUDA SDK

NVIDIA C Compiler

NVIDIA Assemblyfor Computing CPU Host Code

Integrated CPUand GPU C Source Code

Libraries:FFT, BLAS,…Example Source Code

CUDADriver

DebuggerProfiler Standard C Compiler

GPU CPU

Page 62: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

CUDA: Features available to kernels

Standard mathematical functionssinf, powf, atanf, ceil, etc.

Built-in vector typesfloat4, int4, uint4, etc. for dimensions 1..4

Texture accesses in kernelstexture<float,2> my_texture; // declare texture reference

float4 texel = texfetch(my_texture, u, v);

Page 63: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

G8x CUDA = C with ExtensionsPhilosophy: provide minimal set of extensions necessary to expose power

Function qualifiers:__global__ void MyKernel() { }__device__ float MyDeviceFunc() { }

Variable qualifiers:__constant__ float MyConstantArray[32];__shared__ float MySharedArray[32];

Execution configuration:dim3 dimGrid(100, 50); // 5000 thread blocksdim3 dimBlock(4, 8, 8); // 256 threads per blockMyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel

Built-in variables and functions valid in device code:dim3 gridDim; // Grid dimensiondim3 blockDim; // Block dimensiondim3 blockIdx; // Block indexdim3 threadIdx; // Thread indexvoid __syncthreads(); // Thread synchronization

Page 64: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

CUDA: Runtime support

Explicit memory allocation returns pointers to GPU memorycudaMalloc(), cudaFree()

Explicit memory copy for host ↔ device, device ↔ devicecudaMemcpy(), cudaMemcpy2D(), ...

Texture managementcudaBindTexture(), cudaBindTextureToArray(), ...

OpenGL & DirectX interoperabilitycudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …

Page 65: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Example: Adding matrices w/ 2D gridsCPU C program CUDA C program

void addMatrix(float *a, float *b, float *c, int N)

{int i, j, index;for (i = 0; i < N; i++) {for (j = 0; j < N; j++) {index = i + j * N;c[index]=a[index] + b[index];

}}

}

void main(){

.....addMatrix(a, b, c, N);

}

__global__ void addMatrix(float *a,float *b,float *c, int N)

{int i=blockIdx.x*blockDim.x+threadIdx.x;int j=blockIdx.y*blockDim.y+threadIdx.y;int index = i + j * N;if ( i < N && j < N)c[index]= a[index] + b[index];

}

void main(){

..... // allocate & transfer data to GPUdim3 dimBlk (blocksize, blocksize);dim3 dimGrd (N/dimBlk.x, N/dimBlk.y); addMatrix<<<dimGrd,dimBlk>>>(a, b, c,N);

}

Page 66: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Example: Vector Addition Kernel

// Compute vector sum C = A+B// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C){

int i = threadIdx.x + blockDim.x * blockIdx.x;C[i] = A[i] + B[i];

}

Page 67: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Example: Invoking the Kernel

__global__ void vecAdd(float* A, float* B, float* C);

void main(){

// Execute on N/256 blocks of 256 threads eachvecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}

Page 68: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Example: Host code for memory

// allocate host (CPU) memoryfloat* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));… initalize h_A and h_B …

// allocate device (GPU) memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));

// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);

Page 69: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

A quick review

device = GPU = set of multiprocessors Multiprocessor = set of processors & shared memoryKernel = GPU programGrid = array of thread blocks that execute a kernelThread block = group of SIMD threads that execute a kernel and can communicate via shared memory

Memory Location Cached Access WhoLocal Off-chip No Read/write One threadShared On-chip N/A - resident Read/write All threads in a blockGlobal Off-chip No Read/write All threads + hostConstant Off-chip Yes Read All threads + hostTexture Off-chip Yes Read All threads + host

Page 70: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Data-ParallelProgramming

Page 71: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Scan Literature

Pre-HibernationFirst proposed in APL by Iverson (1962)Used as a data parallel primitive in the Connection Machine (1990)

Feature of C* and CM-LispGuy Blelloch used scan as a primitive for various parallel algorithms; his balanced-tree scan is used in the example here

Blelloch, 1990, “Prefix Sums and Their Applications”Post-Democratization

O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2)Applied to Summed Area Tables by Hensley et al. (EG05)

O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al. (EG06)O(n) work & space GPU implementation by Harris et al. (2007)

NVIDIA CUDA SDK and GPU Gems 3Applied to radix sort, stream compaction, and summed area tables

Page 72: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Parallel Reduction Complexity

Log(N) parallel steps, each step S does N/2S

independent opsStep Complexity is O(log N)

For N=2D, performs ∑S∈[1..D]2D-S = N-1 operations Work Complexity is O(N) – It is work-efficienti.e. does not perform more operations than a sequential algorithm

With P threads physically in parallel (P processors), time complexity is O(N/P + log N)

Compare to O(N) for sequential reduction

Page 73: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Unrolling Last Steps

Only one warp is active during the last few stepsUnroll them and remove unneeded __syncthreads()

for (unsigned int s = bd/2; s > 32; s >>= 1) {

if (t < s) {data[t] += data[t + s];

}__syncthreads();

}if (t < 32) data[t] += data[t + 32];if (t < 16) data[t] += data[t + 16];if (t < 8) data[t] += data[t + 8];if (t < 4) data[t] += data[t + 4];if (t < 2) data[t] += data[t + 2];if (t < 1) data[t] += data[t + 1];

Page 74: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Unrolling the Loop Completely

When block size is known at compile time, we can completely unroll the loop

It often is, since the maximum thread block size of 512 constrains us

Use templates…

#define STEP(d) \if (t < (d)) data[t] += data[t+(d));

#define SYNC __syncthreads();

template <unsigned int bsize>__global__ void d_reduce(int *g_idata,

int *g_odata){ ...

if (bsize == 512) STEP(512) SYNC if (bsize >= 256) STEP(256) SYNCif (bsize >= 128) STEP(128) SYNCif (bsize >= 64) STEP(64) SYNCif (bsize >= 32) { STEP(32) STEP(16) STEP(8)

STEP(4) STEP(2) STEP(1) }

}

Page 75: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

GPU ComputingMotivation

Page 76: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Computing Challenge

graphic

Task Computing Data Computing

Page 77: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Extreme Growth in Raw Data

Source: John Bates, NOAA Nat. Climate Center

NOAA Weather Data

Peta

byte

s

Source: Alexa, YouTube 2006

YouTube Bandwidth Growth

Mill

ions

Source: Hedburg, CPI, Walmart

Walmart Transaction Tracking

Mill

ions

Source: Jim Farnsworth, BP May 2005

BP Oil and Gas Active Data

Tera

byte

s

NOAA NASA Weather Data in Petabytes

0

10

20

30

40

50

60

70

80

90

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

Page 78: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Computational Horsepower

GPU is a massively parallel computation engineHigh memory bandwidth (5-10x CPU)High floating-point performance (5-10x CPU)

Page 79: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Benchmarking: CPU vs. GPU Computing

G80 vs. Core2 Duo 2.66 GHzMeasured against commercial CPU benchmarks when possible

Page 80: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

“Free” Massively Parallel Processors

It’s not science fiction, it’s just funded by them

Asst Master Chief Harvard

Page 81: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

SuccessStories

Page 82: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Success Stories: Data to DesignAcceleware EM Field simulation technology for the GPU

3D Finite-Difference and Finite-Element (FDTD)

Modeling of:

Cell phone irradiation

MRI Design / Modeling

Printed Circuit Boards

Radar Cross Section (Military)

Pacemaker with Transmit Antenna10X

1X

4 GPUs2 GPUs1 GPUCPU3.2 GHz

0

100

200

300

400

500

600

700

Performance (Mcells/s)

20X

5X

Page 83: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

EvolvedMachines130X Speed upSimulate brain circuitrySensory computing: vision, olfactory

EvolvedMachines

Page 84: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

10X with MATLAB CPU+GPU

Pseudo-spectral simulation of 2D Isotropic turbulence

Matlab: Language of Science

http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m

http://developer.nvidia.com/object/matlab_cuda.html

Page 85: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

MATLAB Example:Advection of an elliptic vortex

256x256 mesh, 512 RK4 steps, Linux, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_vortex.m

Matlab168 seconds

Matlab with CUDA(single precision FFTs)20 seconds

Page 86: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

MATLAB Example:Pseudo-spectral simulation of 2D Isotropic turbulence

MATLAB 992 seconds

MATLAB with CUDA(single precision FFTs)93 seconds

512x512 mesh, 400 RK4 steps, Windows XP, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m

Page 87: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

NAMD/VMD Molecular Dynamics

http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/

240X speedup Computational biology

Page 88: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Molecular Dynamics Example

Case study: molecular dynamics research at U. Illinois Urbana-Champaign

(Scientist-sponsored) course project for CS 498AL: Programming Massively Parallel Multiprocessors (Kirk/Hwu)Next slides stolen from a nice description of problem, algorithms, and iterative optimization process available at:

http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/

Page 89: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Page 90: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Molecular Modeling: Ion Placement

Biomolecular simulations attempt to replicate in vivoconditions in silico.Model structures are initially constructed in vacuumSolvent (water) and ions are added as necessary for the required biological conditionsComputational requirements scale with the size of the simulated structure

Page 91: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Evolution of Ion Placement CodeFirst implementation was sequentialVirus structure with 10^6 atoms would require 10 CPU daysTuned for Intel C/C++ vectorization+SSE, ~20x speedupParallelized /w pthreads: high data parallelism = linear speedupParallelized GPU accelerated implementation: 3 GeForce 8800GTX cards outrun ~300 Itanium2 CPUs!Virus structure now runs in 25 seconds on 3 GPUs!Further speedups should still be possible…

Page 92: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

Multi-GPU CUDA Coulombic Potential Map Performance

Host: Intel Core 2 Quad, 8GB RAM, ~$3,0003 GPUs: NVIDIA GeForce 8800GTX, ~$550 each32-bit RHEL4 Linux (want 64-bit CUDA!!)235 GFLOPS per GPU for current version of coulombic potential map kernel705 GFLOPS total for multithreaded multi-GPU version Three GeForce 8800GTX GPUs

in a single machine, cost ~$4,650

Page 93: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

ProfessorPartnership

Page 94: GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

© NVIDIA Corporation 2007

NVIDIA Professor Partnership

Support faculty research & teaching effortsSmall equipment gifts (1-2 GPUs) Significant discounts on GPU purchases

Especially Quadro, Tesla equipmentUseful for cost matching

Research contracts Small cash grants (typically ~$25K gifts)Medium-scale equipment donations (10-30 GPUs)

Informal proposals, reviewed quarterlyFocus areas: GPU computing, especially with an educational mission or component

http://www.nvidia.com/page/professor_partnership.html

Easy

Competitive