1 ceng 545 gpu computing. grading 2 midterm exam: 20% homeworks: 40% demo/knowledge: 25%...

1

Ceng 545

GPU Computing

Grading

2

Midterm Exam: 20%Homeworks: 40%

Demo/knowledge: 25%Functionality: 40%Report: 35%

Project: 40%Design Document: 25%Project Presentation: 25%Demo/Final Report: 50%

TextBooks /References

3

D. Kirk and W. Hwu,”Programming Massively Parallel Processors”,Morgan Kaufmann 2010, 978-0-12-381472-2

J. Sanders and E. Kandrot, “CUDA by Example: An Introduction to General-Purpose GPU Programming”,Pearson 2010,978-0-13-138768-3

Draft textbook by Prof. Hwu and Prof. Kirk available at the website

NVIDIA, NVidia CUDA Programming Guide, NVidia, 2007 (reference book)

Videos (Stanford University) http://itunes.apple.com/us/itunes-u/programming-massively-parallel/id384233322 Lecture Notes (Illinois University )http://courses.engr.illinois.edu/ece498/al/ Lecture notes will be posted at the class web site

GPU vs CPUA GPU is tailored for highly parallel operation

while a CPU executes programs seriallyFor this reason, GPUs have many parallel

execution units and higher transistor counts (GTX 480 has 3200*million) , while CPUs have few execution units and higher clockspeeds

GPUs have much deeper pipelines (several thousand stages vs 10-20 for CPUs)

GPUs have significantly faster and more advanced memory interfaces as they need to shift around a lot more data than CPUs

Many-core GPUs vs Multicore CPUDesign philosophies:

The design of a CPU is optimized for sequential code performance.(Large cache memories are provided to reduce the instruction and data access latencies)

Memory bandwith:CPU :It has to satisfy requirements from OS,

applications andI/O devices.GPU: Small cache memories are provided to help

control the bandwith requirements so multiple threads that access the same memory data do not need to all go to the DDRAM

Marketplace

6

CPU vs. GPU - Hardware

More transistors devoted to data processing

Supercomputing 2008 Education Program

Supercomputing 2008 Education Program7

Processing Element

Processing element = thread processor = ALU

Supercomputing 2008 Education Program8

Memory ArchitectureConstant MemoryTexture MemoryDevice Memory

Why Massively Parallel Processor

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

9

A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics API

GPU in every PC and workstation – massive volume and potential impact

GF

LOP

S

G80 = GeForce 8800 GTX



NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

GeForce 8800


10

16 highly threaded SM’s (each with 8 SP), >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S

Mem BW, 4GB/S BW to CPU

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

We have

11

GTX 48032 highly threaded SM’s (each with 15

SP), >480 FPU’s, 1344 GFLOPS, 1536 MB DRAM, 177.4 GB/S Mem BW

Tesla C2070Compared to the latest quad-core CPUs,Tesla C2050 and C2070 Computing Processors deliver equivalent supercomputing performance at1/10th the cost and 1/20th the powerconsumption.

12

Terms GPGPU

General-Purpose computing on a Graphics Processing Unit

Using graphic hardware for non-graphic computations

CUDACompute Unified Device ArchitectureSoftware architecture for managing data-

parallel programming

Parallel Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign13

MPI: Computing nodes do not share memory; all data sharing and interaction must be done through explicit passing.Cuda provides sharde memory

OpenMP :It has not been able to scale beyond a couple hundred computing nodes due to thread management overheads and cache coherence hardware requirements.Cuda achivies simple thread management and

no cache coherence hardware requirements.

Future Apps Reflect a Concurrent World


Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio coding

and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products

These “Super-apps” represent and model physical, concurrent world

Various granularities of parallelism exist, but…programming model must not hinder parallel

implementationdata delivery needs careful management

Stretching Traditional Architectures Traditional parallel architectures cover some

super-applicationsDSP, GPU, network apps, Scientific

The game is to grow mainstream architectures “out” or domain-specific architectures “in”CUDA is latter


Traditional applications

Current architecture coverage

New applications

Domain-specificarchitecture coverage

Obstacles

Previous ProjectsApplication Description Source Kernel % time

H.264 SPEC ‘06 version, change in guess vector 34,811 194 35%

LBM SPEC ‘06 version, change to single precision and print fewer reports

1,481 285 >99%

RC5-72 Distributed.net RC5-72 challenge client code 1,979 218 >99%

FEM Finite element modeling, simulation of 3D graded materials 1,874 146 99%

RPES Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion

1,104 281 99%

PNS Petri Net simulation of a distributed system 322 160 >99%

SAXPY Single-precision implementation of saxpy, used in Linpack’s Gaussian elim. routine

952 31 >99%

TRACF Two Point Angular Correlation Function 536 98 96%

FDTD Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation

1,365 93 16%

MRI-Q Computing a matrix Q, a scanner’s configuration in MRI reconstruction

490 33 >99%


16

Speedup of Applications

0

10

20

30

40

50

60

H.264 LBM RC5-72 FEM RPES PNS SAXPY TPACF FDTD MRI-Q MRI-FHD

KernelApplication

210 457431

316263

GP

U S

peed

upR

elat

ive

to C

PU

79

GeForce 8800 GTX vs. 2.2GHz Opteron 248 10 speedup in a kernel is typical, as long as the kernel can

occupy enough parallel threads25 to 400 speedup if the function’s data requirements and

control flow suit the GPU and the application is optimized


1 ceng 545 gpu computing. grading 2 midterm exam: 20% homeworks: 40% demo/knowledge: 25%...

Documents

university of illinois

cpu david kirknvidia

gpu computing david

education programwhy

memory data

gbs bw

parallel processors

parallel execution units