gpu compute
DESCRIPTION
Introduction to GPU Compute (think Cuda, OpenCL, DirectCompute...) for "non-GPU" programmers. Slight focus on the videogame bussiness.TRANSCRIPT
![Page 1: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/1.jpg)
1
![Page 2: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/2.jpg)
So I found this batman image on the net. I had to put it somewhere in the presenta7on. Here it is. Nvidia has a page on its «cuda zone» showing applica7ons and speedups, it’s a bit silly and it’s not so adver7zed anymore nowadays, but s7ll useful as an introduc7on hCp://www.nvidia.com/object/cuda-‐apps-‐flash-‐new.html# Intel’s Knights Corner news hCp://www.intel.com/pressroom/archive/releases/2010/20100531comp.htm hCp://en.wikipedia.org/wiki/Intel_MIC#Knights_Corner
2
![Page 3: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/3.jpg)
A «wild» and not even a par7cularly crea7ve example... More computa7on could enable en7rely novel scenarios... The past couple of Unreal engine demos were architected to show some of what it could be possible…
3
![Page 4: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/4.jpg)
I like the idea of mo7ongraphs hCp://cg.cis.upenn.edu/hms/research/wcMG/ Also, imperfect shadowmaps are cool hCp://cg.ibds.kit.edu/english/publica7ons.php The two images top right are for megameshes (Lionhead) and megatextures (ID) Fight night round 4 and Champion are the first games using EA’s Real AI, which is trained from examples
4
![Page 5: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/5.jpg)
Naughy dog famously hand op7mizes SPU kernels. How many games do pack so much computa7on, even today, on SPUs? And s7ll, preCy «uniform» workloads, on a processor that can handle non-‐uniform loads beCer than compute (lower SIMD width). Heart bokeh image taken from hCp://www.augus7nefou.com/2009/08/bokeh-‐filter-‐turns-‐blurry-‐lights-‐into.html, I like how the woman is pushed out of the frame to show the bokeh, as a metaphor of how we did some effects “early on” 360 and ps3
5
![Page 6: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/6.jpg)
* Data parallel is “harder” for more high level tasks, but it’s far from impossible. In fact, we s7ll rely on lots of data, just we let this data be combed through with mostly handmade serial logic… Use learning, classifica7on, probabilis7c, op7miza7on methods for games... I.E. in anima7on, we have data, go from blend trees to anima7on clips all down to work with individual poses, use anima7on data itself and its metrics to drive anima7on, not logic! hCp://graphics.stanford.edu/projects/ccclde/ccclde.pdf hCp://www.stanford.edu/~svlevine/ Another avenue: simula7on, modeling, physics… The last image is from hCp://www.crea7veapplica7ons.net/theory/meibach-‐and-‐posavec-‐data-‐visualiza7on-‐poetry-‐and-‐sculpture/ think about what can you do with all the data we already have…
6
![Page 7: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/7.jpg)
7
![Page 8: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/8.jpg)
Usually also superscalar, mul7ple pipes, SIMD… but for now we don’t care
8
![Page 9: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/9.jpg)
Ok, ok, the last batman…
9
![Page 10: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/10.jpg)
...but ooen is not, it’s «surprising» (but consider that ooen the compiler does not know if a given loop is executed ooen or not) how much manual unrolling (and prefetching on plaporms without strong cache predic7ors) in 7ght loops s7ll pays, allowing the compiler to then compute a good schedule without dependencies of the generated assembly.
10
![Page 11: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/11.jpg)
11
![Page 12: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/12.jpg)
12
![Page 13: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/13.jpg)
First «insight» we learned
13
![Page 14: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/14.jpg)
Again, same reasoning
14
![Page 15: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/15.jpg)
15
![Page 16: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/16.jpg)
Fibers and Corou7nes are i.e. used to hide I/O and network access in web servers (a popular example is node.js)
16
![Page 17: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/17.jpg)
For a sequen7al CPU loop, all this work done *in the specified order*. The compiler may reschedule instruc7ons without breaking the logical order of opera7ons, and the OOO CPU (i.e. PC, but not Ps3 or 360) will re-‐order instruc7ons (well, microinstruc7ons) to get rid of dependencies, but the processing order is essen7ally preserved. By contrast, on the GPU, all you're saying is you want that whole array processed, you don't care about the order.
17
![Page 18: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/18.jpg)
Of course there is other stuff too, and never mistake a logical model for a physical one, but this is enough for everything we’ll need
18
![Page 19: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/19.jpg)
Vectors are usually even larger than the number of ALUs. Repeat the same instruc7on a number of 7mes to process an en7re vector (allows ALUs to be pipelined, i.e. a vector of 16 float4 elements might be processed by four float4 ALUs issuing 4 at a 7me, each ALU pipeline stage can compute a single float per cycle and with four stages we achieve the float4 processing with a throughput of one per cycle and a latency of four). We say “threads” here but they are more similar to lightweight fibers. Also, confusingly, different GPU languages call the same concepts in different ways… The more computa7onal threads we can pack in the registers, the more independent work we’ll have to hide latencies! This lightweight threading mechanism is fundamental for performance with manycores, see i.e. hCp://www.linleygroup.com/newsleCers/newsleCer_detail.php?num=4747&year=2011&tag=3 which seems a deriva7ve of the Tilera Tile64 hCp://www.7lera.com/products/processors/TILE64
19
![Page 20: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/20.jpg)
Illustra7on stole from Running Code at a Teraflop: Overview of GPU Architecture, Kayvon Fatahalian, Siggraph 2009 Some GPUs do not support variable length loops, everything gets unrolled and inlined...
20
![Page 21: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/21.jpg)
I.E. Pixar’s Reyes rendering algorithm had interpreted, very parallel shaders. The cost of the interpreter is zero when you repeat the same instruc7on over thousands of elements. Reyes shaders are essen7ally the same as the hardware shaders we have on modern GPUs Note that GPU shaders have some explicit SIMD instruc7ons in HLSL and similar languages, suppor7ng vectors of elements up to 4-‐wide. In prac7ce, many modern GPUs process only scalar code, turning vector instruc7on into scalars. This is because the SIMD parallelism that GPUs implement is much wider than the 4-‐wide instruc7ons of the language, coupling many (i.e. 16 or more) execu7on elements together in a big execu7on vector. This regardless of if each element is scalar or 4-‐way, some GPUs even have a 4+1 configura7on (one 4-‐way and one scalar instruc7on in parallel per each execu7on element). In general when we talk about “GPU vectors” we don’t mean the actual vector instruc7ons in the shader code but the aggrega7on of many ALUs following the same code.
21
![Page 22: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/22.jpg)
22
![Page 23: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/23.jpg)
23
![Page 24: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/24.jpg)
Skinning images from hCp://blog.wolfire.com/2009/11/volumetric-‐heat-‐diffusion-‐skinning/ If you’ve done rendering on the GPU, you might recognize that the pseudo-‐code looks nothing like a vertex shader, where you don’t fetch data but it’s passed as the parameters of the «main» func7on of a kernel. The data is laid out in GPU memory and bound via «streams», «constants» and other means. I didn’t include a real vertex shader code example, even if it would not have been harder to read, because it might even be misleading. The way DirectX or OpenGL expose programmable func7onality to the user might be close to the hardware, but most ooen nowadays is not. I.E. shaders might indeed compile to include explicit fetches... Pixel shaders can «kill» their output, so to be pedan7c they can output one or zero elements. Also, pixel processing might be culled «early», this is very important. The «independence» that I assert is not en7rely true. Pixel shaders need deriva7ve informa7on for all the variables, these are computed using finite differences from the neighboring pixels which not only constrains them to always be processed in 2x2 pixel units (quads) but also establishes through the deriva7ves (ddx, ddy), a mean of communica7on (i.e. see Eric Penner’s Shader Amor7za7on using Pixel Quad Message Passing – GPU Pro 2)
24
![Page 25: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/25.jpg)
We’ll see later some more accurate pseudo-code… for now let’s s7ll pretend to work with a generic “in-‐parallel”
25
![Page 26: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/26.jpg)
26
![Page 27: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/27.jpg)
* At least in our representa7on of the GPU... Really compilers can decide to spill some of the local state onto the shared memory (we’ll see later) and allow for a stack and so on, but that is quite a different thing... I.E. Nvidia Fermi has a limit of 64 temporary registers in kernel code (number of registers per core is higher, but it will be used for mul7ple contexts)
27
![Page 28: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/28.jpg)
28
![Page 29: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/29.jpg)
ALL PSEUDO-‐CODE DOES NOT DO BOUNDS CHECKS J
29
![Page 30: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/30.jpg)
hCp://en.wikipedia.org/wiki/Sor7ng_network hCp://www.cs.brandeis.edu/~hugues/sor7ng_networks.html hCp://www.sciencedirect.com/science/ar7cle/pii/0020019089901968 hCp://www.springerlink.com/content/g002019156467227/ Tangen7ally related, but interes7ng hCp://www.pvk.ca/Blog/2012/08/27/tabasco-‐sort-‐super-‐op7mal-‐merge-‐sort/
30
![Page 31: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/31.jpg)
31
![Page 32: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/32.jpg)
Rendering is a non-‐uniform workload, fully automated scheduling has its benefits… early-‐z early-‐stencil rejec7on, which performs an hardware, automated «stream compac7on»! Fundamental for rendering...
32
![Page 33: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/33.jpg)
At first it’s «easy» to complain about bank rules and cohalescing issues when trying to access shared and global memory... But we have to consider that the GPU offers the ability of accessing different memory loca7ons per SIMD element, on many of them on many cores... This is a rather impressive feat, on standard CPUs to load data into SIMD vectors you have to do it serially, one element at a 7me. Memory has addressing limita7ons, it’s a limit of the technology. SIMD communica7on via vo7ng: all, any, ballot Atomics: Add, Sub, Inc, Dec, And, Or, Xor, Exch, Min, Max, CompareAndSwap. Append (or) consume structured buffer views (Dx11)
33
![Page 34: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/34.jpg)
If we don’t have control over scheduling, we could not be able to access core local memory, obviously, as we would not know which elements are scheduled where.
34
![Page 35: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/35.jpg)
For convenience, in compute, indices can be generated on 1d, 2d or 3d grids ResetCoreMemory… memory is not –really-‐ reset, but we can’t rely on its contents, the ini7aliza7on is to be done in the Kernel and any output has to go into GLOBAL_MEM, so I “created” this pseudo-‐func7on to illustrate this in a C-‐like way.
35
![Page 36: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/36.jpg)
36
![Page 37: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/37.jpg)
37
![Page 38: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/38.jpg)
In the picture, the first Connec7on Machine, a 65536-‐processor supercomputer of the 80ies
38
![Page 39: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/39.jpg)
Shared memory is not exposed to pixel and vertex shaders as it requires manual scheduling of elements on cores. It’s s7ll used internally by the graphic card to store results of rasteriza7on and shared informa7on…
39
![Page 40: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/40.jpg)
40
![Page 41: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/41.jpg)
Note how widening the number of outputs per element looks like «unrolling» the computa7on... Such scheduling decisions have impacts on latency hiding (switching contextes), the ability of using registers as cache (i.e. in this computa7on, as we average columns, going wide on columns will enable reusing fetches from registers), the ability of syncronizing different tasks in a single kernel (persistent threads) and so on...
41
![Page 42: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/42.jpg)
Also, many libraries worth checking out. Thrust, Nvidia performance primi7ves, CuDPP, libCL etc… Even more for tasks that are inherently parallel, linear algebra, imaging and so on
42
![Page 43: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/43.jpg)
43
![Page 44: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/44.jpg)
44
![Page 45: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/45.jpg)
45
![Page 46: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/46.jpg)
All the arrows here represent sums to the exis7ng values, the “self” part of the addi7on is not shown to not confuse with too many arrows hCp://en.wikipedia.org/wiki/Prefix_sum
46
![Page 47: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/47.jpg)
All the black arrows here represent sums to the exis7ng values, the “self” part of the addi7on is not shown, as for the slide before The red arrows mark element “moves” hCp://hCp.developer.nvidia.com/GPUGems3/gpugems3_ch39.html www.umiacs.umd.edu/~ramani/cmsc828e_gpusci/ScanTalk.pdf
47
![Page 48: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/48.jpg)
The moderngpu website has lots of code, and it take a „scan centric” approach to parallel computa7on hCp://www.moderngpu.com/intro/scan.html hCp://www.moderngpu.com/scan/segscan.html
48
![Page 49: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/49.jpg)
Blelloch paper references many applica7on domains: www.cs.cmu.edu/~guyb/papers/Ble93.pdf hCp://back40compu7ng.googlecode.com/svn-‐history/r225/wiki/documents/RadixSortTR.pdf
49
![Page 50: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/50.jpg)
50
![Page 51: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/51.jpg)
hCp://www.cse.chalmers.se/~olaolss/papers/Efficient%20Stream%20Compac7on%20on%20Wide%20SIMD%20Many-‐Core%20Architectures.pdf
51
![Page 52: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/52.jpg)
52
![Page 53: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/53.jpg)
Images from the original paper: On-‐the-‐fly Point Clouds through Histogram Pyramids. Gernot Ziegler, Art Tevs, Chris7an Theobalt, Hans-‐Peter Seidel hCp://www.mpi-‐inf.mpg.de/~gziegler/gpu_pointlist/paper17_gpu_pointclouds.pdf hCp://www.mpi-‐inf.mpg.de/%7Egziegler/gpu_pointlist/slides_vmv2006.pdf hCp://www.astro.lu.se/compugpu2010/resources/bonus_histopyramid.pdf Marching cubes using HP hCp://diglib.eg.org/EG/DL/CGF/volume27/issue8/v27i8pp2028-‐2039.pdf.abstract.pdf
53
![Page 54: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/54.jpg)
Good rendering hCps://graphics.stanford.edu/wikis/cs448s-‐10/FrontPage?ac7on=ACachFile&do=get&target=08-‐GPUArchII.pdf
54
![Page 55: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/55.jpg)
Some of the compromises, many differences in the implementa7on… See Debunking the 100x GPU vs CPU Myth for more details: h@p://www.hwsw.hu/kepek/hirek/2010/06/p451-‐lee.pdf
55
![Page 56: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/56.jpg)
There are many languages that are trying to be both GPU and CPU targets, OpenCL has CPU backends, without considering CPU and GPUs on the same chip like AMD Fusion (see i.e. synergy.cs.vt.edu/pubs/papers/daga-‐saahpc11-‐apu-‐efficacy.pdf) and languages that target the CPU for wide SIMD compu7ng like Intel’s opensource SPMD ISPC hCp://ispc.github.com/ … hCp://tbex.twbbs.org/~tbex/pad/CA_papers/Twin%20Peaks.pdf Throughput compu7ng is everywhere, not only for high performance calcula7ons but also servers, low power and so on. Amdhal for mul7cores: research.cs.wisc.edu/mul7facet/papers/ieeecomputer08_amdahl_mul7core.pdf
56
![Page 57: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/57.jpg)
57
![Page 58: GPU Compute](https://reader033.vdocuments.site/reader033/viewer/2022050816/55325399550346a05d8b456b/html5/thumbnails/58.jpg)
Some “bonus” reads…
BASICS • Nvidia CUDA programming guide hCp://developer.download.nvidia.com/compute/
DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf • N-‐Body Simula7on with CUDA from GPU Gems 3 hCp://hCp.developer.nvidia.com/
GPUGems3/gpugems3_ch31.html
ADVANCED • Efficient Primi7ves and Algorithms for Many-‐core architectures hCp://
www.cs.ucdavis.edu/research/tech-‐reports/2011/CSE-‐2011-‐4.pdf • Persistent threads programming hCp://developer.download.nvidia.com/GTC/PDF/
GTC2012/Presenta7onPDF/S0157-‐GTC2012-‐Persistent-‐Threads-‐Compu7ng.pdf • More on PT and Task Parallelism… hCp://developer.download.nvidia.com/GTC/PDF/
GTC2012/Presenta7onPDF/S0138-‐GTC2012-‐Parallelism-‐Primi7ves-‐Applica7ons.pdf • Registers or shared memory? hCp://mc.stanford.edu/cgi-‐bin/images/6/65/
SC08_Volkov_GPU.pdf • Op7miza7on 7ps hCp://www.cs.virginia.edu/~skadron/Papers/
cuda_tuning_bof_sc09_final.pdf
DATA STRUCTURES… TONS of spa7al ones, very few general purpose… NOTE: aCributes of parallel data structures:
58