hsa for the common man vinod tipparaju heterogeneous system software, amd lee howes heterogeneous...
TRANSCRIPT
HSA FOR THE COMMON MAN
Vinod TipparajuHeterogeneous System Software, AMD
Lee HowesHeterogeneous System Software, AMD
3| Presentation Title | Month ##, 2012
THE HETEROGENEOUS SYSTEM ARCHITECTURE
Taking to programmers
4| Presentation Title | Month ##, 2012
OPENCL™ AND HSA
HSA is an optimized platform architecture for OpenCL™
– Not an alternative to OpenCL™
OpenCL™ on HSA will benefit from
– Avoidance of wasteful copies
– Low latency dispatch
– Improved memory model
– Pointers shared between CPU and GPU
HSA also exposes a lower level programming interface, for those that want the ultimate in control and performance
– Optimized libraries may choose the lower level interface
5| Presentation Title | Month ##, 2012
HSA TAKING PLATFORM TO PROGRAMMERS
Balance between CPU and GPU for performance and power efficiency
Make GPUs accessible to wider audience of programmers
– Programming models close to today’s CPU programming models
– Enabling more advanced language features on GPU
– Shared virtual memory enables complex pointer-containing data structures (lists, trees, etc) and hence more applications on GPU
– Kernel can enqueue work to any other device in the system (e.g. GPU->GPU, GPU->CPU)
• Enabling task-graph style algorithms, Ray-Tracing, etc
Clearly defined HSA memory model enables effective reasoning for parallel programming
HSA provides a compatible architecture across a wide range of programming models and HW implementations.
6| Presentation Title | Month ##, 2012
How we deliver the HSA value proposition?
Overall Vision:
– Make GPU easily accessibleSupport mainstream languages
Expandable to domain specific languages
– Make compute offload efficientDirect path to GPU (avoid Graphics overhead)
Eliminate memory copy
Low-latency dispatch
– Make it ubiquitousDrive HSA as a standard through HSA
Foundation
Open Source key components
HSA SOFTWARE STACK
Application and System Languages, domain specific languages, etc
e.g.
OpenCL™, C++ AMP, Python, R, JS, etc.
HSA Runtime
LLVM IR
HSA Hardware
Applications
HSAIL
7| Presentation Title | Month ##, 2012
HSA EXECUTION MODEL VIA HSA RUNTIME
HSA Runtime User-mode work queues
– Uniform abstraction across devices, simple insertion mechanism
– Multi-level parallelism -- within a queue and across queues
Simple parallelism specifier
– Range/Grid, and group
– HW specifics have a simple abstraction
Analogous to programming based on cache-line size
– Implicit preemption – launch and execute multiple tasks simultaneously
User Write
Device Read
User Write
Device Read
User Write
Device Read
8| Presentation Title | Month ##, 2012
HSA MEMORY MODEL VIA HSA RUNTIME
Key concepts
– Simplified view of memory
– Sharing pointers across devices is possible
makes it possible to run a task on a any device
Possible to use pointers and data structures that require pointer chasing correctly across device boundaries
– Relaxed consistency memory model
Acquire/release
Barriers
HSA Runtime exposes allocation interfaces with control over memory attributes
– Types of memories can be mixed and matched based on usage needs
Simplified launches – dispatch(task, arg1, arg2, …)
– Run device tasks with stack memory
9| Presentation Title | Month ##, 2012
ARCHITECTED QUEUES AND DEVICE-TO-DEVICE ENQUEUE
Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU)
• Enabling task-graph style algorithms, Ray-Tracing, etc
• Queue is an architected feature
• Format of what represents a queue is architected
• Methods to enqueue follow
• Decoupled from HSAIL language
• Unique way of dynamically specifying where enqueues go to
• Resolution at the time of execution permits many load-balancing solutions
User Write
Device Read
10| Presentation Title | Month ##, 2012
ARCHITECTED QUEUES AND DEVICE-TO-DEVICE ENQUEUE
Device Write
Device Read
Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU)
• Enabling task-graph style algorithms, Ray-Tracing, etc
• Queue is an architected feature
• Format of what represents a queue is architected
• Methods to enqueue follow
• Decoupled from HSAIL language
• Unique way of dynamically specifying where enqueues go to
• Resolution at the time of execution permits many load-balancing solutions
11| Presentation Title | Month ##, 2012
ARCHITECTED QUEUES AND DEVICE-TO-DEVICE ENQUEUE
Device Read
Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU)
• Enabling task-graph style algorithms, Ray-Tracing, etc
• Queue is an architected feature
• Format of what represents a queue is architected
• Methods to enqueue follow
• Decoupled from HSAIL language
• Unique way of dynamically specifying where enqueues go to
• Resolution at the time of execution permits many load-balancing solutions
12| Presentation Title | Month ##, 2012
ABSTRACTING ARCHITECTED FEATURES
CreateQueue(ptr,size,…);
For(i=1,n)queue.dispatch(kernel,args,dep n+i,n-i);
queue.dispatch(1minuteKernel,args)
Event.wait(); event.getExceptionDetails();
Call CPU_FUNCTION_FROM_GPU
*fptr(….)
queue.dispatch(kernel,iptr); *iptr=2;
Queue.dispatch(kernel_set_i_value_1, iptr); While(i==1);
HSAAllocate(1) LDS/GDS as virtual memory
Access any address from host/kernel
Do atomics on the queue, in host and in kernel
Channels
User Mode Queuing
Context Switching
Process Reset (to avoid TDRs)
HW Exceptions
Function calls
Virtual functions
Memory Coherence
Unpinned Memory Access (for DMA and Compute Shader)
Flat Address Space
Unaligned Addressing / Memory Access
Platform Atomic Operations
Memory Watchpoints
13| Presentation Title | Month ##, 2012
ENABLING DIFFERENT KINDS OF PROBLEM DOMAINS
Memory model
HSAIL Language
Execution Model
Architected Features
Utilize combination of characteristics per application requirements
Memory Model HS
AIL Language
Execution Model
Arc
hite
ct F
eatu
re
Arc
hitec
ted
F
eatu
re- 1
Arc
hitec
ted
F
eatu
re- 2
Arc
hitec
ted
F
eatu
re- 3
Model1
Arc
hitec
ted
F
eatu
re- 1
Arc
hitec
ted
F
eatu
re- 3
Arc
hitec
ted
F
eatu
re- 4
Model2
Arc
hitec
ted
F
eatu
re- 1
Arc
hitec
ted
F
eatu
re- 4
Model3
Application and System Languages, domain specific languages, etc
e.g.
OpenCL™, C++ AMP, Python, R, JS, etc.
HSA Runtime
LLVM IR
HSA Hardware
Applications
HSAIL
14| Presentation Title | Month ##, 2012
EXPOSING DATAFLOW THROUGH DEVICE-SIDE
ENQUEUE
15| Presentation Title | Month ##, 2012
CHANNELS - PERSISTENT CONTROL PROCESSOR THREADING MODEL
Add data-flow support to GPGPU
We are not primarily notating this as producer/consumer kernel bodies
– That is that we are not promoting a method where one kernel loops producing values and another loops to consume them
– That has the negative behavior of promoting long-running kernels
– We’ve tried to avoid this elsewhere by basing in-kernel launches around continuations rather than waiting on children
Instead we assume that kernel entities produce/consume but consumer work-items are launched on-demand
An alternative to the point to point data flow using of persistent threads, avoiding the uber-kernel
16| Presentation Title | Month ##, 2012
OPERATIONAL FLOW
Channel
CommandQueue
Kernel
CPScheduler
17| Presentation Title | Month ##, 2012
OPERATIONAL FLOW
Channel
CommandQueue
Kernel
CPScheduler
18| Presentation Title | Month ##, 2012
OPERATIONAL FLOW
Channel
CommandQueue
Kernel
CPScheduler
19| Presentation Title | Month ##, 2012
OPERATIONAL FLOW
Channel
CommandQueue CP
Scheduler
Kernel
Write
20| Presentation Title | Month ##, 2012
OPERATIONAL FLOW
Channel
CommandQueue CP
Scheduler
Kernel
21| Presentation Title | Month ##, 2012
OPERATIONAL FLOW
Channel
CommandQueue CP
Scheduler
Kernel
TriggerDispatch
Work items complete
22| Presentation Title | Month ##, 2012
OPERATIONAL FLOW
Channel
CommandQueue
Kernel
CPScheduler
Launch
23| Presentation Title | Month ##, 2012
OPERATIONAL FLOW
Channel
CommandQueue
Kernel
CPScheduler
Consume
24| Presentation Title | Month ##, 2012
OPERATIONAL FLOW
Channel
Kernel
CPScheduler
Next writes
25| Presentation Title | Month ##, 2012
CHANNEL EXAMPLE
std::function<bool (opp::Channel<int>*)> predicate = [] (opp::Channel<int>* c) -> bool __device(fql) { return c->size() % PACKET_SIZE == 0; };
opp::Channel<int> b(N); b.executeWith( predicate, opp::Range<1>(CHANNEL_SIZE), [&sumB] (opp::Index<1>) __device(opp) { sumB++; });
opp::Channel<int> c(N);
c.executeWith( predicate, opp::Range<1>(CHANNEL_SIZE), [&sumC] (opp::Index<1>, const int v) __device(opp) { sumC += v; });
opp::parallelFor( opp::Range<1>(N), [a, &b, &c] (opp::Index<1> index) __device(opp) { unsigned int n = *(a+index.getX()); if (n > 5) { b.write(n); } else { c.write(n); } });
26| Presentation Title | Month ##, 2012
EXAMPLE PROBLEMSRigid body/cloth collision
27| Presentation Title | Month ##, 2012
CLOTH SIMULATION AND COLLISION DETECTION
Physics simulation has a range of properties
Rigid body simulation is often
– Not highly parallel
– Very dynamic
– Not necessarily a good match for wide SIMD architectures
Cloth simulation is
– Highly parallel
– While meshes are complicated connectivity is largely static
28| Presentation Title | Month ##, 2012
EFFICIENT GPU CLOTH SIMULATION: TWO-LEVEL BATCHING
Offline static batching of the mesh
Create independent subsets of links through graph coloring.
Synchronize between batches
10 batches
29| Presentation Title | Month ##, 2012
EFFICIENT GPU CLOTH SIMULATION: BATCHING
Chunk mesh into larger groups of links
Batch those chunks
– 4 global dispatches
Iterate within the workgroups
– 8 secondary batches
4 batches 8 secondary batches
30| Presentation Title | Month ##, 2012
COLLISION WITH RIGID BODY
Small set of rigid bodies
Rigid bodies best computed on the CPU
Cloth on GPU
31| Presentation Title | Month ##, 2012
OPTIONS
Either
– Small launches of rigid body/cloth collisions against cloth
– Process rigid body/cloth collisions on CPU
On GPU:
– Small launches suffer dispatch overhead
– Must update rigid body data structures from the GPU
On CPU:
– Must continuously move cloth mesh data to and from GPU
RB solveCloth solve
Cloth/RB collide
CPU GPU
RB solveCloth solve
Cloth/RB collide
32| Presentation Title | Month ##, 2012
RB solveCloth solve
Cloth/RB collide
CPU GPU
RB solveCloth solve
Cloth/RB collide
OPTIONS
Either
– Small launches of rigid body/cloth collisions against cloth
– Process rigid body/cloth collisions on CPU
On GPU:
– Small launches suffer dispatch overhead
– Must update rigid body data structures from the GPU
On CPU:
– Must continuously move cloth mesh data to and from GPU
33| Presentation Title | Month ##, 2012
WHY HSA?
Colliding rigid bodies are likely to be very sparse in memory
– Do not want to copy the rigid body array to the GPU “just in case”
– Do not even want to incur OS page lock overhead
– Accessing targeted virtual addresses as necessary reduces the overhead
HSA runtime exposes the architected feature and permits use of any memory either as an argument or otherwise in a task
Operations are in a tight loop
– Overhead from dispatch grows quickly
– User mode queuing reduces this significantly
Architected queues are exposed via simple API
Do not want to transform rigid body code
– It is a common problem to do vast, confusing, transformations to host code to enabled wide vector processing
Shared pointer model enables access to those structures directly rather than restructuring them
34| Presentation Title | Month ##, 2012
HOW DO YOU TRIGGER CLOTH/RB COLLIDE?
RB solveCloth solve
Cloth/RB collide
CPU GPUCS = dispatch(cloth_solve, x, y, …)
CPU does RB Solve (p, q, …)
Wait for CS
Dispatch(cloth_RB_collide, x, p, …)
RB = dispatch(RB_solve, p, q, …)
CS = dispatch(cloth_solve, x, y, …)
Collide = dispatch(cloth_RB_collide, x, p, …, RB & CS)
35| Presentation Title | Month ##, 2012
IN PSEUDOCODE
batch rigid bodies and RB/cloth pairs
dispatch cloth solver to GPU;
dispatch cloth/rigid body collision solver to GPU pending event;
foreach rigid body batch:
for rigid body pair in batch:
compute force
update position and velocity of rigid body
Signal GPU event
Foreach rigid body not involved in cloth collision:
– update positions
Return to next iteration
Cloth solver:
for iteration towards convergence:
foreach cloth link batch:
foreach cloth link subbatch:
update positions and velocities
Cloth/RB solver:
for each batch of RB/Cloth pairs:
– Read rigid body data directly from data structures used by the CPU
– Test cloth against RB and update cloth
– Read/write update to global data (relies on memory visibility rules guaranteed by HSA)
36| Presentation Title | Month ##, 2012
EXAMPLE PROBLEMSTree search
37| Presentation Title | Month ##, 2012
NESTED DATA PARALLELISM AND EFFICIENT EXECUTION OF UNSTRUCTURED DATA
Perfectly balanced trees are easy:
– If the tree is being regularly rebalanced and stored contiguously then data may be moved around where needed
Large, poorly balanced trees are harder:
– Layout is ambiguous so copying data is challenging
– Amount of parallelism is unpredictable
One approach to deal with this, on a single node:
– Fine-grained tasking
– Share memory infrastructure
– Picture breadth first search through FIFO queues
Example: UTS (unbalanced Tree Search):
– To count the number of nodes in an implicitly constructed tree
– tree is parameterized in shape, depth, size, and imbalance.
38| Presentation Title | Month ##, 2012
SEARCHING A TREE
As we move through the tree:
– Unpredictable amount of parallelism
– Unpredictable dependence structure
Launch tasks as tree space is available:
– Perform BFS queuing into a buffer
– When buffer reaches a certain size, launch processing code
39| Presentation Title | Month ##, 2012
SEARCHING A TREE
As we move through the tree:
– Unpredictable amount of parallelism
– Unpredictable dependence structure
Launch tasks as tree space is available:
– Perform BFS queuing into a buffer
– When buffer reaches a certain size, launch processing code
40| Presentation Title | Month ##, 2012
SEARCHING A TREE
As we move through the tree:
– Unpredictable amount of parallelism
– Unpredictable dependence structure
Launch tasks as tree space is available:
– Perform BFS queuing into a buffer
– When buffer reaches a certain size, launch processing code
– Slowly increase launch batch size to improve efficiency
41| Presentation Title | Month ##, 2012
SEARCHING A TREE
As we move through the tree:
– Unpredictable amount of parallelism
– Unpredictable dependence structure
Launch tasks as tree space is available:
– Perform BFS queuing into a buffer
– When buffer reaches a certain size, launch processing code
– Slowly increase launch batch size to improve efficiency
42| Presentation Title | Month ##, 2012
SEARCHING A TREE
As we move through the tree:
– Unpredictable amount of parallelism
– Unpredictable dependence structure
Launch tasks as tree space is available:
– Perform BFS queuing into a buffer
– When buffer reaches a certain size, launch processing code
– Slowly increase launch batch size to improve efficiency
43| Presentation Title | Month ##, 2012
SEARCHING A TREE
As we move through the tree:
– Unpredictable amount of parallelism
– Unpredictable dependence structure
Launch tasks as tree space is available:
– Perform BFS queuing into a buffer
– When buffer reaches a certain size, launch processing code
– Slowly increase launch batch size to improve efficiency
44| Presentation Title | Month ##, 2012
EXTENDING THE TREE
Many tree search algorithms expand the tree as time goes on
– A lot of overhead in the absence of shared memory
– With shared memory we can be searching parts of the tree while adding to others
Example: Multiresolution Analysis (MRA) is a mathematical technique for approximating a continuous function as a hierarchy of coefficients over a set of basis functions.
– characterized by dynamic adaptivity to guarantee the accuracy of approximation, Challenges include:
The coefficient trees are unbalanced on account of the adaptive multiresolution
properties leading to different scales of information granularity
The tree structure may be refined in an uncoordinated fashion – different parts of the tree may be refined independently and the intervals of such refinement are not preset.
45| Presentation Title | Month ##, 2012
WHY HSA?
Dynamic adaptive nature means unpredictable amount of parallelism unpredictable dependence structure
– Do not want to copy sections array to the GPU “just in case”
– Accessing targeted virtual addresses as necessary reduces the overhead
HSA runtime exposes the architected feature and permits use of any memory either as an argument or otherwise in a task
Tremendous nesting of parallelism is inherent in the problem
– Single search-node can lead to a very large number of additional searches
– Ability to efficiently do nesting is key and triggering searches based on grouping is important
HSA allows for device-to-device enqueue that permits nested parallelism
Significant load imbalance is possible
– Need to group searches and trigger them when they reach a certain size
Support for dataflow via channels makes this possible
– Need to balance what is already launched due to the unpredictable amount of parallelism
Queues are in user mode, balancing is enabled by the architected features that allow user-level access to a queue
46| Presentation Title | Month ##, 2012
HOW DO YOU BALANCE?
Several user mode queues
– Number of nodes to start with don’t represent the real load (imbalance)
47| Presentation Title | Month ##, 2012
IN PSEUDOCODE
Control process:
– dipatch(unbalanced_tree_kernel, root….)
GPU/CPU unbalanced_tree_kernel
– For the next n-1 levels
Count
– For each child at level n
Insert child into parse channel
Control process:
– dipatch(unbalanced_tree_kernel, root….)
– balance and terminate
GPU/CPU : unbalanced_tree_kernel
– For the next n-1 levels
Count
– For each child at level n
dispatch(unbalanced_tree_kernel, child….)
48| Presentation Title | Month ##, 2012
CONCLUSIONS
HSA has several architected features that can improve programmability and an ecosystem that exposes these features to the users effectively
HSA runtime is how these features are exposed to higher-level programming models
– Composability is possible, a new higher-level model can be composed of multiple architected features
Channels are a very unique technology made possible with HSA
– Channels enable many applications that need dataflow model or features
Cloth simulation and collision detection is an example that shows how several of HSA features both simplify the solution and avoid unnecessary costs typically involved with using GPUs to solve this problem
Unbalanced tree search is a domain with unpredictable amount of parallelism, a major load balancing problem and need for adjusting the granularity of a task
– HSA features significantly simplify and allow a natural solution to this problem.
– Channels address adjusting of granularity of launches by allowing a dataflow patterns that launches a task when certain data-dependent criterion are met
49| Presentation Title | Month ##, 2012
Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.
© 2012 Advanced Micro Devices, Inc.