1 ceng 545 gpu computing. grading 2 midterm exam: 20% homeworks: 40% demo/knowledge: 25%...
TRANSCRIPT
1
Ceng 545
GPU Computing
Grading
2
Midterm Exam: 20%Homeworks: 40%
Demo/knowledge: 25%Functionality: 40%Report: 35%
Project: 40%Design Document: 25%Project Presentation: 25%Demo/Final Report: 50%
TextBooks /References
3
D. Kirk and W. Hwu,”Programming Massively Parallel Processors”,Morgan Kaufmann 2010, 978-0-12-381472-2
J. Sanders and E. Kandrot, “CUDA by Example: An Introduction to General-Purpose GPU Programming”,Pearson 2010,978-0-13-138768-3
Draft textbook by Prof. Hwu and Prof. Kirk available at the website
NVIDIA, NVidia CUDA Programming Guide, NVidia, 2007 (reference book)
Videos (Stanford University) http://itunes.apple.com/us/itunes-u/programming-massively-parallel/id384233322 Lecture Notes (Illinois University )http://courses.engr.illinois.edu/ece498/al/ Lecture notes will be posted at the class web site
GPU vs CPUA GPU is tailored for highly parallel operation
while a CPU executes programs seriallyFor this reason, GPUs have many parallel
execution units and higher transistor counts (GTX 480 has 3200*million) , while CPUs have few execution units and higher clockspeeds
GPUs have much deeper pipelines (several thousand stages vs 10-20 for CPUs)
GPUs have significantly faster and more advanced memory interfaces as they need to shift around a lot more data than CPUs
Many-core GPUs vs Multicore CPUDesign philosophies:
The design of a CPU is optimized for sequential code performance.(Large cache memories are provided to reduce the instruction and data access latencies)
Memory bandwith:CPU :It has to satisfy requirements from OS,
applications andI/O devices.GPU: Small cache memories are provided to help
control the bandwith requirements so multiple threads that access the same memory data do not need to all go to the DDRAM
Marketplace
6
CPU vs. GPU - Hardware
More transistors devoted to data processing
Supercomputing 2008 Education Program
Supercomputing 2008 Education Program7
Processing Element
Processing element = thread processor = ALU
Supercomputing 2008 Education Program8
Memory ArchitectureConstant MemoryTexture MemoryDevice Memory
Why Massively Parallel Processor
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
9
A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics API
GPU in every PC and workstation – massive volume and potential impact
GF
LOP
S
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
GeForce 8800
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
10
16 highly threaded SM’s (each with 8 SP), >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S
Mem BW, 4GB/S BW to CPU
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
We have
11
GTX 48032 highly threaded SM’s (each with 15
SP), >480 FPU’s, 1344 GFLOPS, 1536 MB DRAM, 177.4 GB/S Mem BW
Tesla C2070Compared to the latest quad-core CPUs,Tesla C2050 and C2070 Computing Processors deliver equivalent supercomputing performance at1/10th the cost and 1/20th the powerconsumption.
12
Terms GPGPU
General-Purpose computing on a Graphics Processing Unit
Using graphic hardware for non-graphic computations
CUDACompute Unified Device ArchitectureSoftware architecture for managing data-
parallel programming
Parallel Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign13
MPI: Computing nodes do not share memory; all data sharing and interaction must be done through explicit passing.Cuda provides sharde memory
OpenMP :It has not been able to scale beyond a couple hundred computing nodes due to thread management overheads and cache coherence hardware requirements.Cuda achivies simple thread management and
no cache coherence hardware requirements.
Future Apps Reflect a Concurrent World
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign14
Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio coding
and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products
These “Super-apps” represent and model physical, concurrent world
Various granularities of parallelism exist, but…programming model must not hinder parallel
implementationdata delivery needs careful management
Stretching Traditional Architectures Traditional parallel architectures cover some
super-applicationsDSP, GPU, network apps, Scientific
The game is to grow mainstream architectures “out” or domain-specific architectures “in”CUDA is latter
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign15
Traditional applications
Current architecture coverage
New applications
Domain-specificarchitecture coverage
Obstacles
Previous ProjectsApplication Description Source Kernel % time
H.264 SPEC ‘06 version, change in guess vector 34,811 194 35%
LBM SPEC ‘06 version, change to single precision and print fewer reports
1,481 285 >99%
RC5-72 Distributed.net RC5-72 challenge client code 1,979 218 >99%
FEM Finite element modeling, simulation of 3D graded materials 1,874 146 99%
RPES Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion
1,104 281 99%
PNS Petri Net simulation of a distributed system 322 160 >99%
SAXPY Single-precision implementation of saxpy, used in Linpack’s Gaussian elim. routine
952 31 >99%
TRACF Two Point Angular Correlation Function 536 98 96%
FDTD Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation
1,365 93 16%
MRI-Q Computing a matrix Q, a scanner’s configuration in MRI reconstruction
490 33 >99%
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
16
Speedup of Applications
0
10
20
30
40
50
60
H.264 LBM RC5-72 FEM RPES PNS SAXPY TPACF FDTD MRI-Q MRI-FHD
KernelApplication
210 457431
316263
GP
U S
peed
upR
elat
ive
to C
PU
79
GeForce 8800 GTX vs. 2.2GHz Opteron 248 10 speedup in a kernel is typical, as long as the kernel can
occupy enough parallel threads25 to 400 speedup if the function’s data requirements and
control flow suit the GPU and the application is optimized
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign17