gpu data formatting and addressing aaron lefohn university of california, davis

GPU Data Formatting and GPU Data Formatting and AddressingAddressing

Aaron Lefohn University of California, Davis

OverviewOverview• GPU Memory Model

• GPU-Based Data Structures

• Performance Considerations

GPU memory modelGPU memory model

• GPU Data Storage– Vertex data– Texture data– Frame buffer

Vertex DataVertex

ProcessorRasterizer

FragmentProcessor

Texture Data

Frame Buffer(s)

PS3.0 GPUs

GPU memory modelGPU memory model

• Read-Only– Traditional use of GPU memory– CPU writes, GPU reads

• Read/Write– Save frame buffer(s) for later use as texture or vertex array– Save up to 16, 32-bit floating values per pixel

• Multiple Render Targets (MRTs)

How to Save Render ResultHow to Save Render Result

1. Copy framebuffer result to “other GPU memory”– Copy-to-texture– Copy-to-vertex-array

2. Write directly to “other GPU memory'' – Render-to-texture– Render-to-vertex-array

OpenGL GPU Memory WritesOpenGL GPU Memory Writes

• Texture1. Copy frame buffer to texture

2. Render-to-texture• WGL_ARB_render_texture • GL_EXT_render_target• Superbuffers

• Vertex Array1. Copy frame buffer to vertex array

• GL_EXT_pixel_buffer_object• Superbuffers

2. Render-to-vertex-array• Superbuffers

Render-To-Texture: 1Render-To-Texture: 1

• Copy-To-Texture– Good

• Cross-Platform texture writes• Flexible output• 2D output Copy to 1D, 2D, or 3D texture

– Bad• Slow • Consumes internal GPU memory bandwidth


• WGL_ARB_render_texture– Render-to-texture (RTT) using pbuffers

http://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_render_texture.txt

– Good• Fast RTT• Current state of the art for RTT

– Bad• Only works on Windows• Slow OpenGL context switches • Many hacks to avoid this bottleneck


• GL_EXT_render_target– Proposed extension for cross-platform RTT

http://www.opengl.org/resources/features/GL_EXT_render_target.txt

– Good• Cross-platform, efficient RTT solution• Lightweight, simple extension

– Bad• Specification not approved (April 24, 2004)• No implementations exist (April 24, 2004)


• Superbuffers– Proposed new memory model for GPUs

http://www.ati.com/developer/gdc/SuperBuffers.pdf

– Good• Unified GPU memory model• Render to any GPU memory• Cross platform (OpenGL owns memory, not OS)• Mix-and-match depth/stencil/color buffers

– Bad• Large, complex extension• Specification not approved (April 24, 2004)• Only driver support is alpha version (ATI)

Render-To-Texture SummaryRender-To-Texture Summary

• OpenGL RTT Currently Only Under Windows– Pbuffers

• Complex and awkward RTT mechanism• Current state of the art

• Cross-Platform RTT Coming Soon…

Render-To-Vertex-Array: 1Render-To-Vertex-Array: 1

• GL_EXT_pixel_buffer_object– Copy framebuffer to vertex buffer object

http://developer.nvidia.com/object/nvidia_opengl_specs.html

– Good• Only GPU/AGP memory bandwidth• Works with current drivers (NVIDIA)

– Bad• No direct render-to-vertex-array (slower than true RTVA)• No ATI implementation

Render-To-Vertex-Array: 2Render-To-Vertex-Array: 2

• Superbuffers– Write to “memory object” as render target – Read from “memory object” as vertex array

– Good• Direct render-to-vertex-array (fast)

– Bad• Can render results always be interpreted as vertex data?• Large, complex, unapproved extension, …

Render-To-Vertex-Array SummaryRender-To-Vertex-Array Summary

• Current OpenGL Support– NVIDIA: GL_EXT_pixel_buffer_object– ATI: Superbuffers

• Semantics Still Under Development…

Fbuffer: Capturing FragmentsFbuffer: Capturing Fragments

• Idea– “Rasterization-Order FIFO Buffer”– Render results are fragment values instead of pixel values– Mark and Proudfoot, Graphics Hardware 2001

http://graphics.stanford.edu/projects/shading/pubs/hwws2001-fbuffer/

• Uses– Designed for multi-pass rendering with transparent geometry– New possibilities for GPGPU?

• Varying number of results per pixel• RTT and RTVA with an fbuffer?

Fbuffer: Capturing FragmentsFbuffer: Capturing Fragments

• Implementations– ATI Radeon 9800 and newer ATI GPUs– Not yet exposed to user (ask for it!)

• Problems– Size of fbuffer is not known before rendering– GPUs cannot perform dynamic memory allocation– How to handle buffer overflow?

GPU-Based Data StructuresGPU-Based Data Structures

• Building Blocks– GPU memory addresses

• Address Generation• Address Use• Pointers

– Multi-dimensional arrays– Sparse representations

GPU Memory AddressesGPU Memory Addresses

• Where Are Addresses Generated?– CPU Vertex stream or textures– Vertex processor Input stream, ALU ops or textures– Rasterizer Interpolation– Fragment processor Input stream, ALU ops or textures

Vertex Processor

Rasterizer FragmentProcessor

CPU


• Where Are Addresses Used?– Vertex textures (PS3.0 GPUs)– Fragment textures

Vertex Processor

RasterizerFragmentProcessor

Texture Data

CPU


• Pointers– Store addresses in texture– Dependent texture read– Example: See Tim Purcell’s ray tracing talk

float2 addr = tex2D( addrTex, texCoord );

float2 data = tex2D( dataTex, addr );

3311

DataDataDataData

Address Texture Data Texture

0123

0123

Multi-Dimensional ArraysMulti-Dimensional Arrays

• Build Data Structures in 2D Memory– Read/Write GPU memory optimized for 2D – Images

• But Isn’t Physical Memory 1D?– GPU memory hierarchy optimized to capture 2D locality

• Rasterization• Texture filtering• Igehy, Eldridge, Proudfoot, “"Prefetching in a Texture

Cache Architecture,” Graphics Hardware, 1998

• Conclusion: Use illusion of 2D physical memory

GPU ArraysGPU Arrays

• Large 1D Arrays– Current GPUs limit 1D array sizes to 2048 or 4096– Pack into 2D memory– 1D-to-2D address translation


• 3D Arrays– Problem

• GPUs do not have 3D frame buffers• No RTT to slice of 3D texture (except Superbuffers)

– Solutions

1. Stack of 2D slices

2. Multiple slices per 2D buffer


• Problems With 3D Arrays for GPGPU– Cannot read stack of 2D slices as 3D texture– Must know which slices are needed in advance– Visualization of 3D data difficult

• Solutions– Need render-to-slice-of-3D-texture (Superbuffers)– Volume rendering of slice-based 3D data

• Course 28, “Real-Time Volume Graphics”, Siggraph 2004


• Higher Dimensional Arrays– Pack into 2D buffers– N-D to 2D address translation– Same problems as 3D arrays if data does not fit in a single

2D texture

• Conclusions– Fundamental GPU memory primitive is a fixed-size 2D array– GPGPU needs more general memory model

Sparse Data StructuresSparse Data Structures

• Why Sparse Data Structures?– Reduce computational workload – Reduce memory pressure

• Examples– Sparse matrices

• Krueger et al., Siggraph 2003• Bolz et al., Siggraph 2003

– Implicit surface computations (sparse volumes)• Sherbondy et al., IEEE Visualization 2003• Lefohn et al., IEEE Visualization 2003

Premoze et al.Eurographics 2003

Sparse ComputationSparse Computation

• Option 1: Store Complete Data Set on GPU– Cull unused data– Conditional execution tricks (discussed earlier)

• Option 2: Store Only Sparse Data on GPU– Saves memory– Potentially much faster than culling– Much more complicated (especially if time-varying)


• Basic Idea– Pack “active” data elements into GPU memory– For more information

• Linear algebra section in this course : Static structures• Level-set case study in this course : Dynamic

structures


• Addressing Sparse Data– Neighborhoods no longer implicitly defined on grid

– Use pointer-based data structures to locate neighbors• Pre-compute neighbor addresses if possible

– Use CPU or vertex processor– Removes pointer dereference from fragment program

– Separate common addressing case from boundary conditions• Common case must be cache coherent• See Harris and Lefohn case studies for “substream”

technique

Memory Performance IssuesMemory Performance Issues

• Pbuffer Survival Guide

• Dependent Texture Costs

• Computational Frequency

Pbuffer Survival GuidePbuffer Survival Guide

• Pbuffers Give us Render-To-Texture– Designed to create an environment map or two– Never intended to be used for GPGPU (100s of pbuffers)

– Problem• Each pbuffer has its own OpenGL render context• Each pbuffer may have depth and/or stencil buffer• Changing OpenGL contexts is slow

– Solution• Many optimizations to avoid this bottleneck…


1. Pack Scalar Data Into RGBA– > 4x memory savings– 4x reduction in context switches– Be careful of read-modify-write hazard

1 RGBA PbufferScalar Data in 4 RGBA Pbuffers


2. Use Multi-Surface Pbuffers – Each RGBA surface is its own render-texture

• Front, Back, AuxN (N = 0,1,2,…)– Greatly reduces context switches– Technically illegal, but “blessed” by ATI. Works on NVIDIA.

1 Pbuffer5 RGBA Surfaces

5 Pbuffers1 RGBA Surface Each


2. Using Multi-Surface Pbuffers

a) Allocate double buffer pbuffer (and/or with AUX buffers)

b) Set render target to back bufferglDrawBuffer(GL_BACK)

2. Bind front buffer as texturewglBindTexImageARB(hpbuffer, WGL_FRONT_ARB)

a) Render

b) Switch bufferswglReleaseTexImageARB(hpbuffer, WGL_FRONT_ARB)

glDrawBuffer(GL_FRONT)

wglBindTexImageARB(hpbuffer, WGL_BACK_ARB)


3. Pack 2D domains into large buffer– “Flat 3D textures”– Be careful of read-modify-write hazard

Flattened Volume3D Volume

Dependent Texture CostsDependent Texture Costs

• Cache Coherency– Dependent reads fast if they hit cache

• Even chained dependencies can be same speed as non-dependent reads

– Very slow if out of cache• Example:

3 levels of dependent cache misses can be >10x slower

– More detail in “GPU Computation Strategies and Tricks”

Computational FrequencyComputational Frequency

• Compute Memory Addresses at Low Frequency– Compute memory addresses in vertex program

• Let rasterizer interpolation create per-fragment addresses• Compute neighbor addresses this way

– Avoid fragment-level address computation whenever possible• Consumes fragment instructions• Computation often redundant with neighboring fragments• May defeat texture pre-fetch

ConclusionsConclusions

• GPU Memory Model Evolving– Writable GPU memory forms loop-back in an otherwise feed-

forward streaming pipeline– Memory model will continue to evolve as GPUs become more

general stream processors

• GPGPU Data Structures– Basic memory primitive is limited-size, 2D texture– Use address translation to fit all array dimensions into 2D– Maintain 2D cache locality

• Render-To-Texture– Use pbuffers with care and eagerly adopt their successor

Selected ReferencesSelected References

• J. Boltz, I. Farmer, E. Grinspun, P. Schoder, “Spare Matrix Solvers on the GPU: Conjugate Gradients and Multigrid,” SIGGRAPH 2003

• N. Goodnight, C. Woolley, G. Lewin, D. Luebke, G. Humphreys, “A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware,” Graphics Hardware 2003

• M. Harris, W. Baxter, T. Scheuermann, A. Lastra, “Simulation of Cloud Dynamics on Graphics Hardware,“ Graphics Hardware 2003

• H. Igehy, M. Eldridge, K. Proudfoot, “Prefetching in a Texture Cache Architecture,” Graphics Hardware 1998

• J. Krueger, R. Westermann, “Linear Algebra Operators for GPU Implementation of Numerical Algorithms,” SIGGRAPH 2003

• A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “A Streaming Narrow-Band Algorithm: Interactive Deformation and Visualization of Level Sets,” IEEE Transactions on Visualization and Computer Graphics 2004

Selected ReferencesSelected References

• A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “Interactive Deformation and Visualization of Level Set Surfaces Using Graphics Hardware,” IEEE Visualization 2003

• W. Mark, K. Proudfoot, “The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering,” Graphics Hardware 2001

• T. Purcell, C. Donner, M. Cammarano, H. W. Jensen, P. Hanrahan, “Photon Mapping on Programmable Graphics Hardware,” Graphics Hardware 2003

• A. Sherbondy, M. Houston, S. Napel, “Fast Volume Segmentation With Simultaneous Visualization Using Programmable Graphics Hardware,” IEEE Visualization 2003

OpenGL ReferencesOpenGL References

• GL_EXT_pixel_buffer_objecthttp://www.nvidia.com/dev_content/nvopenglspecs/GL_EXT_pixel_buffer_object.txt

• GL_EXT_render_target, http://www.opengl.org/resources/features/GL_EXT_render_target.txt

• OpenGL Extension Registryhttp://oss.sgi.com/projects/ogl-sample/registry/

• Superbuffershttp://www.ati.com/developer/gdc/SuperBuffers.pdf

• WGL_ARB_render_texturehttp://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_render_texture.txthttp://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_pbuffer.txt

Questions?Questions?

• Acknowledgements– Cass Everitt, Craig Kolb, Chris Seitz, and Jeff Juliano at NVIDIA– Mark Segal, Rob Mace, and Evan Hart at ATI– GPGPU Siggraph 2004 course presenters– Joe Kniss and Ross Whitaker– Brian Budge– John Owens– National Science Foundation Graduate Fellowship– Pixar Animation Studios

gpu data formatting and addressing aaron lefohn university of california, davis

Documents