brook for gpus ian buck, tim foley, daniel horn, jeremy sugerman, kayvon fatahalian, mike houston,...
Post on 19-Dec-2015
215 views
TRANSCRIPT
![Page 1: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/1.jpg)
Brook for GPUs
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan
Stanford University
DARPA Site Visit, UNCMay 6th, 2004
![Page 2: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/2.jpg)
May 6, 2004 2
Motivation
• GPUs are faster than CPUs• GPUs are getting faster, faster• Why?
– Massive parallelism (1000s of ALUs)– Choreographed communication– Efficiently utilize VLSI resources [DIS/PCA mantra]
• Programmable GPUs = stream processors• Many streaming applications beyond
graphics• Buy desktop supercomputer for $50!• Revolutionize computing?
![Page 3: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/3.jpg)
May 6, 2004 3
Recent Performance Trends
![Page 4: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/4.jpg)
May 6, 2004 4
![Page 5: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/5.jpg)
May 6, 2004 5
CPU vs GPU
• Intel 3 Ghz Pentium 4– 12 GFLOPS peak performance (via SSE2)– 5.96 GB/sec peak memory bandwidth– 44 GB/sec peak bandwidth from 8K L1
data cache• NVIDIA GeForce 6800
– 45 GFLOPS peak performance– 36 GB/sec peak memory bandwidth– Texture cache bandwidth and size
(undisclosed)?
![Page 6: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/6.jpg)
May 6, 2004 6
Deliverables
• Develop version of PCA Brook for GPUs– Programmer need not know GL
• Versions– New ATI (420) and NVIDIA (NV40) hardware– Linux and Windows– DX and OpenGL
• Release as open source [V1.0 Dec 2003]
• Support OneSAF LOS, collision detection and route planning algorithms
![Page 7: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/7.jpg)
May 6, 2004 7
Research Issues
• Brook semantics– E.g. variable length streams: vout– …
• Compilation techniques– Virtualization of GPU– Splitting kernels (MRDS)
• Explore streaming application space– Scientific computing: RT, MD, BLAS, FFT, …– Machine learning: HMM, linear mod., Bayes,
…
![Page 8: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/8.jpg)
Brook Update
Ian Buck
![Page 9: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/9.jpg)
May 6, 2004 9
![Page 10: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/10.jpg)
Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication
Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
![Page 11: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/11.jpg)
May 6, 2004 11
Dense Matrix-Matrix Multiplication
Atlas on the Intel P4 wins!
![Page 12: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/12.jpg)
May 6, 2004 12
CPU vs GPU
• Intel 3 Ghz Pentium 4– 12 GFLOPS peak performance (via SSE2)– 5.96 GB/sec peak memory bandwidth– 44 GB/sec peak bandwidth from 8K L1 data cache
• NVIDIA GeForce 6800– 43 GFLOPS peak performance– 36 GB/sec peak memory bandwidth– Texture cache bandwidth and size (undisclosed)?
Why is graphics hardware so slow?
![Page 13: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/13.jpg)
![Page 14: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/14.jpg)
May 6, 2004 14
Why is Graphics Hardware so Slow?
GFLOPS Cache BW Seq Read BW
NV35 39.99 11.08 4.40
NV40 43.00 18.9 3.85
ATI 9800XT 26.14 12.20 7.33
ATI X800 33.4 30.7 18.4
Microbenchmark (MAD)
NVIDIA: 8% compute efficiency, 82% of cache bandwidth.• Arithmetic intensity: 12 math operations per float fetched
from cache ATI: 18% of peak performance, 99% of peak cache bandwidth.• Arithimetic intensity: 8 to 1 math to cache-fetch ratio
![Page 15: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/15.jpg)
May 6, 2004 15
Why is Graphics Hardware so Slow?
• Matrix-matrix multiplication is bandwidth limited on GPU.– Memory blocking to increase cache utilization does not help– Architectural problem, not programming model problem
• PCA stream processing architectures (Imagine) will do much better!
GFLOPS Bandwidth
NV35 3.04 9.07
NV40 7.24 14.88
ATI 9800XT 4.83 12.06
ATI X800 ~12 ~30
P4 7.78 27.68
Matrix-Matrix Multiplication
![Page 16: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/16.jpg)
Variable Output Shaders
Daniel Horn, Ian Buck, Pat Hanrahan
![Page 17: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/17.jpg)
May 6, 2004 17
Motivation: Enabling Algorithms
• Not all algorithms map to the 1-in 1-out semantics of GPUs
• Other classes of algorithms require data filtering (1-in 0-out) and amplification (1-in n-out).
• Vout is conditional write on Imagine
![Page 18: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/18.jpg)
May 6, 2004 18
Algorithms
• Ray Tracing terrains• Marching Cubes• Adaptive Subdivision Surfaces• Collision Detection [OBB]• Graph traversal• …
![Page 19: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/19.jpg)
May 6, 2004 19
Implementation on GPU
• Push output (sentinel if no push)• Options to consolidate sentinels:
– Sort O(n (log n)^2)• Sort sentinels to the end, truncate
– Scan/Search O(n log n)• Perform a running sum, then search for
gather loc
– Scan/Scatter O(n log n)• Perform a running sum, scatter to
destination
– Constant time hardware implementation
![Page 20: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/20.jpg)
May 6, 2004 20
Timing and Bandwidth Numbers
![Page 21: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/21.jpg)
May 6, 2004 21
Future Work
• Brook: semantics, compiling, virtualization– Support new GPU features (branching, FB ops, …)– Predication
• Integration with graphics pipeline– Documented path to texture for rendering– Access to other GPU features: e.g. occlusion
culling
• Interactive simulation; new algorithms– Collision detection and line of sight calculations
• Merge ray tracer with UNC/SAIC algorithm
– Machine learning: HMM, GLM, K-means, ...– Protein folding (StreamMD) and docking– Virtual surgery
![Page 22: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/22.jpg)
May 6, 2004 22
Distributed Brook
• Stream- and thread-level parallelism• UPC distributed memory semantics• PCI-express system for fast readback
![Page 23: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/23.jpg)
May 6, 2004 23
GPU Cluster [DOE]
• 16 node cluster• Each node 3U half depth • 32 2.4GHz P4 Xeons• 16GB DDR• 1.2TB disk• Infiniband 4X interconnect• Dual 2.4GHz P4 Xeons• Intel E7505 chipset• 1GB DDR • ATI Radeon 9800 Pro 256MB• GigE• 80 GB IDE
![Page 24: Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC](https://reader030.vdocuments.site/reader030/viewer/2022032704/56649d3e5503460f94a16877/html5/thumbnails/24.jpg)
May 6, 2004 24
Questions?
Fly-fishing fly images from The English Fly Fishing Shop