hsa for the common man vinod tipparaju heterogeneous system software, amd lee howes heterogeneous...

HSA FOR THE COMMON MAN

Vinod TipparajuHeterogeneous System Software, AMD

Lee HowesHeterogeneous System Software, AMD

3| Presentation Title | Month ##, 2012

THE HETEROGENEOUS SYSTEM ARCHITECTURE

Taking to programmers


OPENCL™ AND HSA

HSA is an optimized platform architecture for OpenCL™

– Not an alternative to OpenCL™

OpenCL™ on HSA will benefit from

– Avoidance of wasteful copies

– Low latency dispatch

– Improved memory model

– Pointers shared between CPU and GPU

HSA also exposes a lower level programming interface, for those that want the ultimate in control and performance

– Optimized libraries may choose the lower level interface


HSA TAKING PLATFORM TO PROGRAMMERS

Balance between CPU and GPU for performance and power efficiency

Make GPUs accessible to wider audience of programmers

– Programming models close to today’s CPU programming models

– Enabling more advanced language features on GPU

– Shared virtual memory enables complex pointer-containing data structures (lists, trees, etc) and hence more applications on GPU

– Kernel can enqueue work to any other device in the system (e.g. GPU->GPU, GPU->CPU)

• Enabling task-graph style algorithms, Ray-Tracing, etc

Clearly defined HSA memory model enables effective reasoning for parallel programming

HSA provides a compatible architecture across a wide range of programming models and HW implementations.


How we deliver the HSA value proposition?

Overall Vision:

– Make GPU easily accessibleSupport mainstream languages

Expandable to domain specific languages

– Make compute offload efficientDirect path to GPU (avoid Graphics overhead)

Eliminate memory copy

Low-latency dispatch

– Make it ubiquitousDrive HSA as a standard through HSA

Foundation

Open Source key components

HSA SOFTWARE STACK

Application and System Languages, domain specific languages, etc

e.g.

OpenCL™, C++ AMP, Python, R, JS, etc.

HSA Runtime

LLVM IR

HSA Hardware

Applications

HSAIL


HSA EXECUTION MODEL VIA HSA RUNTIME

HSA Runtime User-mode work queues

– Uniform abstraction across devices, simple insertion mechanism

– Multi-level parallelism -- within a queue and across queues

Simple parallelism specifier

– Range/Grid, and group

– HW specifics have a simple abstraction

Analogous to programming based on cache-line size

– Implicit preemption – launch and execute multiple tasks simultaneously

User Write

Device Read

User Write

Device Read

User Write

Device Read


HSA MEMORY MODEL VIA HSA RUNTIME

Key concepts

– Simplified view of memory

– Sharing pointers across devices is possible

makes it possible to run a task on a any device

Possible to use pointers and data structures that require pointer chasing correctly across device boundaries

– Relaxed consistency memory model

Acquire/release

Barriers

HSA Runtime exposes allocation interfaces with control over memory attributes

– Types of memories can be mixed and matched based on usage needs

Simplified launches – dispatch(task, arg1, arg2, …)

– Run device tasks with stack memory


ARCHITECTED QUEUES AND DEVICE-TO-DEVICE ENQUEUE

Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU)


• Queue is an architected feature

• Format of what represents a queue is architected

• Methods to enqueue follow

• Decoupled from HSAIL language

• Unique way of dynamically specifying where enqueues go to

• Resolution at the time of execution permits many load-balancing solutions

User Write

Device Read



Device Write

Device Read











Device Read










ABSTRACTING ARCHITECTED FEATURES

CreateQueue(ptr,size,…);

For(i=1,n)queue.dispatch(kernel,args,dep n+i,n-i);

queue.dispatch(1minuteKernel,args)

Event.wait(); event.getExceptionDetails();

Call CPU_FUNCTION_FROM_GPU

*fptr(….)

queue.dispatch(kernel,iptr); *iptr=2;

Queue.dispatch(kernel_set_i_value_1, iptr); While(i==1);

HSAAllocate(1) LDS/GDS as virtual memory

Access any address from host/kernel

Do atomics on the queue, in host and in kernel

Channels

User Mode Queuing

Context Switching

Process Reset (to avoid TDRs)

HW Exceptions

Function calls

Virtual functions

Memory Coherence

Unpinned Memory Access (for DMA and Compute Shader)

Flat Address Space

Unaligned Addressing / Memory Access

Platform Atomic Operations

Memory Watchpoints


ENABLING DIFFERENT KINDS OF PROBLEM DOMAINS

Memory model

HSAIL Language

Execution Model

Architected Features

Utilize combination of characteristics per application requirements

Memory Model HS

AIL Language

Execution Model

Arc

hite

ct F

eatu

re

Arc

hitec

ted

F

eatu

re- 1

Arc

hitec

ted

F

eatu

re- 2

Arc

hitec

ted

F

eatu

re- 3

Model1

Arc

hitec

ted

F

eatu

re- 1

Arc

hitec

ted

F

eatu

re- 3

Arc

hitec

ted

F

eatu

re- 4

Model2

Arc

hitec

ted

F

eatu

re- 1

Arc

hitec

ted

F

eatu

re- 4

Model3

Application and System Languages, domain specific languages, etc

e.g.

OpenCL™, C++ AMP, Python, R, JS, etc.

HSA Runtime

LLVM IR

HSA Hardware

Applications

HSAIL


EXPOSING DATAFLOW THROUGH DEVICE-SIDE

ENQUEUE


CHANNELS - PERSISTENT CONTROL PROCESSOR THREADING MODEL

Add data-flow support to GPGPU

We are not primarily notating this as producer/consumer kernel bodies

– That is that we are not promoting a method where one kernel loops producing values and another loops to consume them

– That has the negative behavior of promoting long-running kernels

– We’ve tried to avoid this elsewhere by basing in-kernel launches around continuations rather than waiting on children

Instead we assume that kernel entities produce/consume but consumer work-items are launched on-demand

An alternative to the point to point data flow using of persistent threads, avoiding the uber-kernel


OPERATIONAL FLOW

Channel

CommandQueue

Kernel

CPScheduler


OPERATIONAL FLOW

Channel

CommandQueue CP

Scheduler

Kernel

Write


OPERATIONAL FLOW

Channel

CommandQueue CP

Scheduler

Kernel


OPERATIONAL FLOW

Channel

CommandQueue CP

Scheduler

Kernel

TriggerDispatch

Work items complete


OPERATIONAL FLOW

Channel

CommandQueue

Kernel

CPScheduler

Launch


OPERATIONAL FLOW

Channel

CommandQueue

Kernel

CPScheduler

Consume


OPERATIONAL FLOW

Channel

Kernel

CPScheduler

Next writes


CHANNEL EXAMPLE

std::function<bool (opp::Channel<int>*)> predicate = [] (opp::Channel<int>* c) -> bool __device(fql) { return c->size() % PACKET_SIZE == 0; };

opp::Channel<int> b(N); b.executeWith( predicate, opp::Range<1>(CHANNEL_SIZE), [&sumB] (opp::Index<1>) __device(opp) { sumB++; });

opp::Channel<int> c(N);

c.executeWith( predicate, opp::Range<1>(CHANNEL_SIZE), [&sumC] (opp::Index<1>, const int v) __device(opp) { sumC += v; });

opp::parallelFor( opp::Range<1>(N), [a, &b, &c] (opp::Index<1> index) __device(opp) { unsigned int n = *(a+index.getX()); if (n > 5) { b.write(n); } else { c.write(n); } });


EXAMPLE PROBLEMSRigid body/cloth collision


CLOTH SIMULATION AND COLLISION DETECTION

Physics simulation has a range of properties

Rigid body simulation is often

– Not highly parallel

– Very dynamic

– Not necessarily a good match for wide SIMD architectures

Cloth simulation is

– Highly parallel

– While meshes are complicated connectivity is largely static


EFFICIENT GPU CLOTH SIMULATION: TWO-LEVEL BATCHING

Offline static batching of the mesh

Create independent subsets of links through graph coloring.

Synchronize between batches

10 batches


EFFICIENT GPU CLOTH SIMULATION: BATCHING

Chunk mesh into larger groups of links

Batch those chunks

– 4 global dispatches

Iterate within the workgroups

– 8 secondary batches

4 batches 8 secondary batches


COLLISION WITH RIGID BODY

Small set of rigid bodies

Rigid bodies best computed on the CPU

Cloth on GPU


OPTIONS

Either

– Small launches of rigid body/cloth collisions against cloth

– Process rigid body/cloth collisions on CPU

On GPU:

– Small launches suffer dispatch overhead

– Must update rigid body data structures from the GPU

On CPU:

– Must continuously move cloth mesh data to and from GPU

RB solveCloth solve

Cloth/RB collide

CPU GPU

RB solveCloth solve

Cloth/RB collide


RB solveCloth solve

Cloth/RB collide

CPU GPU

RB solveCloth solve

Cloth/RB collide

OPTIONS

Either

– Small launches of rigid body/cloth collisions against cloth

– Process rigid body/cloth collisions on CPU

On GPU:

– Small launches suffer dispatch overhead

– Must update rigid body data structures from the GPU

On CPU:

– Must continuously move cloth mesh data to and from GPU


WHY HSA?

Colliding rigid bodies are likely to be very sparse in memory

– Do not want to copy the rigid body array to the GPU “just in case”

– Do not even want to incur OS page lock overhead

– Accessing targeted virtual addresses as necessary reduces the overhead

HSA runtime exposes the architected feature and permits use of any memory either as an argument or otherwise in a task

Operations are in a tight loop

– Overhead from dispatch grows quickly

– User mode queuing reduces this significantly

Architected queues are exposed via simple API

Do not want to transform rigid body code

– It is a common problem to do vast, confusing, transformations to host code to enabled wide vector processing

Shared pointer model enables access to those structures directly rather than restructuring them


HOW DO YOU TRIGGER CLOTH/RB COLLIDE?

RB solveCloth solve

Cloth/RB collide

CPU GPUCS = dispatch(cloth_solve, x, y, …)

CPU does RB Solve (p, q, …)

Wait for CS

Dispatch(cloth_RB_collide, x, p, …)

RB = dispatch(RB_solve, p, q, …)

CS = dispatch(cloth_solve, x, y, …)

Collide = dispatch(cloth_RB_collide, x, p, …, RB & CS)


IN PSEUDOCODE

batch rigid bodies and RB/cloth pairs

dispatch cloth solver to GPU;

dispatch cloth/rigid body collision solver to GPU pending event;

foreach rigid body batch:

for rigid body pair in batch:

compute force

update position and velocity of rigid body

Signal GPU event

Foreach rigid body not involved in cloth collision:

– update positions

Return to next iteration

Cloth solver:

for iteration towards convergence:

foreach cloth link batch:

foreach cloth link subbatch:

update positions and velocities

Cloth/RB solver:

for each batch of RB/Cloth pairs:

– Read rigid body data directly from data structures used by the CPU

– Test cloth against RB and update cloth

– Read/write update to global data (relies on memory visibility rules guaranteed by HSA)


EXAMPLE PROBLEMSTree search


NESTED DATA PARALLELISM AND EFFICIENT EXECUTION OF UNSTRUCTURED DATA

Perfectly balanced trees are easy:

– If the tree is being regularly rebalanced and stored contiguously then data may be moved around where needed

Large, poorly balanced trees are harder:

– Layout is ambiguous so copying data is challenging

– Amount of parallelism is unpredictable

One approach to deal with this, on a single node:

– Fine-grained tasking

– Share memory infrastructure

– Picture breadth first search through FIFO queues

Example: UTS (unbalanced Tree Search):

– To count the number of nodes in an implicitly constructed tree

– tree is parameterized in shape, depth, size, and imbalance.


SEARCHING A TREE

As we move through the tree:

– Unpredictable amount of parallelism

– Unpredictable dependence structure

Launch tasks as tree space is available:

– Perform BFS queuing into a buffer

– When buffer reaches a certain size, launch processing code


SEARCHING A TREE








SEARCHING A TREE







– Slowly increase launch batch size to improve efficiency


SEARCHING A TREE









EXTENDING THE TREE

Many tree search algorithms expand the tree as time goes on

– A lot of overhead in the absence of shared memory

– With shared memory we can be searching parts of the tree while adding to others

Example: Multiresolution Analysis (MRA) is a mathematical technique for approximating a continuous function as a hierarchy of coefficients over a set of basis functions.

– characterized by dynamic adaptivity to guarantee the accuracy of approximation, Challenges include:

The coefficient trees are unbalanced on account of the adaptive multiresolution

properties leading to different scales of information granularity

The tree structure may be refined in an uncoordinated fashion – different parts of the tree may be refined independently and the intervals of such refinement are not preset.


WHY HSA?

Dynamic adaptive nature means unpredictable amount of parallelism unpredictable dependence structure

– Do not want to copy sections array to the GPU “just in case”

– Accessing targeted virtual addresses as necessary reduces the overhead

HSA runtime exposes the architected feature and permits use of any memory either as an argument or otherwise in a task

Tremendous nesting of parallelism is inherent in the problem

– Single search-node can lead to a very large number of additional searches

– Ability to efficiently do nesting is key and triggering searches based on grouping is important

HSA allows for device-to-device enqueue that permits nested parallelism

Significant load imbalance is possible

– Need to group searches and trigger them when they reach a certain size

Support for dataflow via channels makes this possible

– Need to balance what is already launched due to the unpredictable amount of parallelism

Queues are in user mode, balancing is enabled by the architected features that allow user-level access to a queue


HOW DO YOU BALANCE?

Several user mode queues

– Number of nodes to start with don’t represent the real load (imbalance)


IN PSEUDOCODE

Control process:

– dipatch(unbalanced_tree_kernel, root….)

GPU/CPU unbalanced_tree_kernel

– For the next n-1 levels

Count

– For each child at level n

Insert child into parse channel

Control process:

– dipatch(unbalanced_tree_kernel, root….)

– balance and terminate

GPU/CPU : unbalanced_tree_kernel

– For the next n-1 levels

Count

– For each child at level n

dispatch(unbalanced_tree_kernel, child….)


CONCLUSIONS

HSA has several architected features that can improve programmability and an ecosystem that exposes these features to the users effectively

HSA runtime is how these features are exposed to higher-level programming models

– Composability is possible, a new higher-level model can be composed of multiple architected features

Channels are a very unique technology made possible with HSA

– Channels enable many applications that need dataflow model or features

Cloth simulation and collision detection is an example that shows how several of HSA features both simplify the solution and avoid unnecessary costs typically involved with using GPUs to solve this problem

Unbalanced tree search is a domain with unpredictable amount of parallelism, a major load balancing problem and need for adjusting the granularity of a task

– HSA features significantly simplify and allow a natural solution to this problem.

– Channels address adjusting of granularity of launches by allowing a dataflow patterns that launches a task when certain data-dependent criterion are met


Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.

© 2012 Advanced Micro Devices, Inc.

hsa for the common man vinod tipparaju heterogeneous system software, amd lee howes heterogeneous...

Documents

hsa hsa

gpu hsa

device enqueue device

gpu kernel

device enqueue kernel

device possible

presentation title month

hsa execution model