hsa platform system architecture specification provisional verl 1.0 ratifed

62
HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION PROVISIONAL 1.0 - RATIFIED APRIL 18, 2014

Upload: hsa-foundation

Post on 19-Aug-2015

10.776 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA PLATFORM SYSTEM ARCHITECTURE

SPECIFICATION

PROVISIONAL 1.0 - RATIFIED APRIL 18, 2014

Page 2: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

© 2013-2014 HSA Foundation. All rights reserved.

The contents of this document are provided in connection with the HSA Foundation specifications. This specification is protected by copyright laws and contains material proprietary to the HSA Foundation. It or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast or otherwise exploited in any manner without the express prior written permission of HSA Foundation. You may use this specification for implementing the functionality therein, without altering or removing any trademark, copyright or other notice from the specification, but the receipt or possession of this specification does not convey any rights to reproduce, disclose, or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part.

HSA Foundation grants express permission to any current Founder, Promoter, Supporter Contributor, Academic or Associate member of HSA Foundation to copy and redistribute UNMODIFIED versions of this specification in any fashion, provided that NO CHARGE is made for the specification and the latest available update of the specification for any version of the API is used whenever possible. Such distributed specification may be re-formatted AS LONG AS the contents of the specification are not changed in any way. The specification may be incorporated into a product that is sold as long as such product includes significant independent work developed by the seller. A link to the current version of this specification on the HSA Foundation web-site should be included whenever possible with specification distributions.

HSA Foundation makes no, and expressly disclaims any, representations or warranties, express or implied, regarding this specification, including, without limitation, any implied warranties of merchantability or fitness for a particular purpose or non-infringement of any intellectual property. HSA Foundation makes no, and expressly disclaims any, warranties, express or implied, regarding the correctness, accuracy, completeness, timeliness, and reliability of the specification. Under no circumstances will the HSA Foundation, or any of its Founders, Promoters, Supporters, Academic, Contributors, and Associates members or their respective partners, officers, directors, employees, agents or representatives be liable for any damages, whether direct, indirect, special or consequential damages for lost revenues, lost profits, or otherwise, arising from or in connection with these materials.

Page 3: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

Acknowledgements 1

HSA Foundation Proprietary

Acknowledgements

The HSA Platform System Architecture Specification is the result of the contributions of many people. Here is a partial list of the contributors, including the company that they represented at the time of their contribution:

AMD:

Ben Sander, Brad Beckmann, Mark Fowler, Michael Mantor, Paul Blinzer (Workgroup Chair), Rex McCrary, Tony Tye, Vinod Tipparaju

ARM:

Andrew Rose, Håkan Persson (Spec Editor), Ian Bratt, Ian Devereux, Jason Parker, Richard Grisenthwaite

General Processor Technologies:

John Glossner

Imagination:

Andy Glew, Georg Kolling, James Aldis, Jason Meredith, John Howson, Mark Rankilor

Mediatek:

ChienPing Lu, Fred Liao, Richard Bagley, Roy Ju, Stephen Huang

Qualcomm:

Alex Bourd, Benedict Gaster, Bob Rychlik, Greg Bellows, Jamie Esliger, Lee Howes, Lihan Bin, Michael Weber, PJ Bosley, Robert J. Simpson, Wilson Kwan

Samsung:

Ignacio Llamas, Michael C. Shebanow, Soojung Ryu

Sony:

Jim Rasmusson

ST Microelectronics:

Marcello Coppola

Page 4: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

2 About the HSA Platform System Architecture Specification

HSA Foundation Proprietary

About the HSA Platform System Architecture Specification

This document identifies, from a hardware point of view, system architecture requirements necessary to support the Heterogeneous System Architecture (HSA) programming model and HSA application and system software infrastructure.

It defines a set of functionality and features for HSA hardware product deliverables to meet the minimum specified requirements to qualify for a valid HSA product.

Where necessary, the document illustrates possible design implementations to clarify expected operation. However, unless otherwise specified, these implementations only describe an example of how to implement a feature and do not imply a specific hardware or software design.

Audience

This document is written for system and component architects interested in supporting the HSA infrastructure (hardware and software) within platform designs.

New Terms

This document shows new terms in italics. See Appendix A Glossary - HSA Hardware System Architecture Specification (p 51) for their definitions.

HSA Information Sources

HSA Programmer's Reference Manual

HSA Runtime Specification

HSA Platform System Architecture Specification

Revision History

Date Description

9 March 2014 Release of Provisional 1.0 HSA Platform System Architecture Specification

Page 5: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

Contents 3

HSA Foundation Proprietary

Contents

Chapter 1 System Architecture Requirements: Overview .................................. 2

1.1 What is HSA? ............................................................................................................. 2

1.2 Key Words: Must, Must Not, Should, Should Not, May, Shall, Shall Not, Required, Recommended, Not Recommended, Optional ..................................................................... 2

1.3 Distinction Between Minimum and Complete HSA Software Product ..................... 3

1.4 HSA Programming Model .......................................................................................... 3

1.5 List of Requirements .................................................................................................. 3

1.6 Desirable items deferred from this release ............................................................... 4

Chapter 2 System Architecture Requirements: Details ...................................... 5

2.1 Requirement: Shared Virtual Memory ...................................................................... 5

2.2 Requirement: Cache Coherency Domains ................................................................. 7

Read-only image data ........................................................................................ 7

2.3 Requirement: Signaling and Synchronization............................................................ 7

2.4 Requirement: Atomic memory operations ............................................................. 10

2.5 Requirement: HSA system timestamp .................................................................... 10

2.6 Requirement: User Mode Queuing ......................................................................... 11

Queue types .................................................................................................... 12

Queue Features ............................................................................................... 13

Queue mechanics ............................................................................................ 13

Multiple vs Single submitting HSA Agents ....................................................... 15

Queue index access ......................................................................................... 16

Runtime Services Dispatch Queue................................................................... 17

2.7 Requirement: Architected Queuing Language (AQL) .............................................. 18

Packet header .................................................................................................. 19

Packet process flow ......................................................................................... 20

Error handling .................................................................................................. 21

Always reserved packet ................................................................................... 21

Invalid packet ................................................................................................... 21

Dispatch packet ............................................................................................... 21

Agent Dispatch packet ..................................................................................... 22

Barrier packet .................................................................................................. 23

Small machine model ...................................................................................... 24

2.8 Requirement: HSA Agent scheduling ...................................................................... 24

2.9 Requirement: HSA Component Context Switching ................................................. 25

2.10 Requirement: HSA Component IEEE754-2008 Floating Point Exception Reporting 26

Page 6: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

4 Contents

HSA Foundation Proprietary

2.11 Requirement: HSA Component Hardware Debug Infrastructure ........................... 28

2.12 Requirement: HSA Platform Topology Discovery .................................................... 28

Introduction ..................................................................................................... 28

Topology Table Requirements ......................................................................... 32

HSA Agent Entry .............................................................................................. 33

HSA Memory Entry .......................................................................................... 34

HSA Cache Entry .............................................................................................. 34

Topology Structure Example ........................................................................... 35

2.13 Requirement: Images .............................................................................................. 35

Chapter 3 HSA Memory Consistency Model ..................................................... 38

3.1 What Is a Memory Consistency Model? .................................................................. 38

3.2 What Is a HSA Memory Consistency Model? .......................................................... 38

3.3 How Does Software See Correct vs. Incorrect? ....................................................... 39

3.4 HSA Memory Consistency Model Definitions.......................................................... 39

Operations ....................................................................................................... 40

Segments ......................................................................................................... 41

Scopes .............................................................................................................. 41

Segment/scope instances ................................................................................ 42

3.5 Atomic operations ................................................................................................... 43

3.6 Write Serialization ................................................................................................... 43

3.7 Ordering of synchronizing operations ..................................................................... 44

3.8 Packet Processor Fences ......................................................................................... 44

3.9 Single thread memory operation order .................................................................. 44

3.10 Requirement on race free programs ....................................................................... 45

3.11 Implications ............................................................................................................. 45

Relaxed atomics ............................................................................................... 46

Sequential consistency .................................................................................... 46

Transitivity of ordering between threads ........................................................ 46

Allowed Reorder for Operations to Different Addresses by the Same Thread 46

Examples .......................................................................................................... 48

Appendix A Glossary - HSA Hardware System Architecture Specification ......... 51

Index ................................................................................................................... 54

Page 7: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

Figures 5

HSA Foundation Proprietary

Figures

Figure 2-1 Example of a Simple HSA Platform .................................................................. 30

Figure 2-2 Example of a HSA platform with extended topology ...................................... 31

Figure 2-3 Topology Table example .................................................................................. 32

Figure 2-4 Structure of a HSA Agent Entry ....................................................................... 33

Figure 2-5 Topology definition structure for previously defined system block diagram . 35

Page 8: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

6 Tables

HSA Foundation Proprietary

Tables

Table 2-1 User Mode Queue structure ............................................................................... 11

Table 2-2 User Mode Queue types ..................................................................................... 12

Table 2-3 User Mode Queue features ................................................................................ 13

Table 2-4 Architected Queuing Language (AQL) Packet Header Format ........................... 19

Table 2-5 Encoding of acquireFenceScope ......................................................................... 19

Table 2-6 Encoding of releaseFenceScope ......................................................................... 20

Table 2-7 Architected Queuing Language (AQL) Dispatch Packet Format ......................... 21

Table 2-8 Architected Queuing Language (AQL) Dispatch Packet Format ......................... 22

Table 2-9 Architected Queuing Language (AQL) Barrier Packet Format ............................ 23

Table 2-10 IEEE754-2008 exceptions................................................................................ 26

Table 3-1 Allowed reorder for operations by the same thread to different addresses, that

share a segment/scope instance. ............................................................................................ 48

Page 9: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

2 System Architecture Requirements: Overview

HSA Foundation Proprietary

Chapter 1 System Architecture Requirements: Overview

1.1 What is HSA?

The Heterogeneous System Architecture (HSA) is designed to efficiently support a wide assortment of data-parallel and task-parallel programming models. A single HSA system can support multiple instruction sets based on host CPUs and HSA Components.

HSA supports two machine models: large model (64-bit address space) and small model (32-bit address space).

A HSA-compliant system will meet the requirements for a queuing model, a memory model, quality of service, and an instruction set for parallel processing and enabling heterogeneous programming models for computing platforms using standardized interfaces, processes, communication protocols, and memory models.

1.2 Key Words: Must, Must Not, Should, Should Not, May, Shall, Shall Not, Required, Recommended, Not Recommended, Optional

This document specifies HSA system requirements at different levels using the keywords defined below:

“Must”: This word, or the terms “required” or “shall,” mean that the definition is an absolute requirement of the specification.

“Must not”: This phrase, or the phrase “shall not,” mean that the definition is an absolute prohibition of the specification.

“Should”: This word, or the adjective “recommended,” mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.

“Should not”: This phrase, or the phrase “not recommended,” mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.

“May”: This word, or the adjective “optional,” mean that an item is truly optional. One vendor may choose to include the item because a particular marketplace requires it or because the vendor feels that it enhances the product while another vendor may omit the same item. An implementation which does not include a particular option MUST be prepared to interoperate with another implementation which does include the option, though perhaps with reduced functionality. In the same vein an implementation which does include a particular option MUST be prepared to interoperate with another implementation which does not include the option (except, of course, for the feature the option provides).

Page 10: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Overview 3

HSA Foundation Proprietary

These definitions are exactly as described in the Internet Engineering Task Force RFC 2119, BCP 14, Key words for use in RFCs to Indicate Requirement Levels.

One vendor might choose to include an optional item because a particular marketplace requires it or because the vendor feels that it enhances the product, while another vendor might omit the same item.

1.3 Distinction Between Minimum and Complete HSA Software Product

This document also provides guidance for the HSA product to be more complete, enhanced, and competitive:

The minimum HSA software product is defined by the mandatory requirements highlighted with the key words “shall,” “must,” and “required.”

The complete HSA software product feature set is defined by the key words “should” and “recommended.”

Unless otherwise stated, functionality referred to by this document that is outside of the HSA software, such as a software component or an API, indicates the then-current version at the time of HSA software delivery.

Note: A higher-level requirement specification shall supersede a lower-level (HSA Component) requirement if there is an implied contradiction.

1.4 HSA Programming Model

The HSA programming model is enabled through the presence of a select number of key hardware and system features for the heterogeneous system components. (Examples are HSA Components and other HSA Agents, interface connection fabric, memory, and so forth.) The presence of these features on a HSA-compatible system simplifies the number of permutations that the software stack needs to deal with. Thus, the HSA programming model is much simpler than heterogeneous system programming models based on more traditional system design.

1.5 List of Requirements

The list below shows the minimum required system architecture features:

Shared virtual memory, including adherence to the HSA memory model. See 2.1 Requirement: Shared Virtual Memory (p. 5).

Cache coherency domains, including host CPUs, HSA Components and other HSA Agents and interconnecting I/O bus fabric. See 2.2 Requirement: Cache Coherency Domains (p. 7).

Memory-based signaling and synchronization primitives between all HSA-enabled system components, including support for platform atomics. See 2.3 Requirement: Signaling and Synchronization (p. 7).

Atomic memory operations, see 2.4 Requirement: Atomic memory operations (p. 10).

Page 11: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

4 System Architecture Requirements: Overview

HSA Foundation Proprietary

HSA system timestamp providing a uniform view of time across the HSA system. See 2.5 Requirement: HSA system timestamp (p. 10).

User mode queues with low-latency application-level dispatch to hardware. See 2.6 Requirement: User Mode Queuing (p. 11).

Use of Architected Queuing Language (AQL), which reduces launch latency by allowing applications to enqueue tasks to HSA Components and other HSA Agents. See 2.7 Requirement: Architected Queuing Language (AQL) (p. 18).

HSA Agent Scheduling, see 2.8 Requirement: HSA Agent scheduling (p. 24).

Preemptive HSA Component context switch with a maximum guaranteed latency. See 2.9 Requirement: HSA Component Context Switching (p. 25).

HSA Component error reporting mechanism that meets a similar level of detail as provided by host CPUs, including adherence to the policies specified in 2.10 Requirement: HSA Component IEEE754-2008 Floating Point Exception Reporting (p. 26).

HSA Component debug infrastructure that meets the specified level of functional support. See 2.11 Requirement: HSA Component Hardware Debug Infrastructure (p. 28).

Architected HSA Component discovery by means of system firmware tables placed in ACPI or equivalent architected firmware infrastructure. This allows system software and, in turn, application software to discover and leverage platform topology independent of the system-specific bus fabric and host CPU infrastructure, as long as they support the other HSA-relevant features. See 2.12 Requirement: HSA Platform Topology Discovery (p. 28).

Optionally a HSA platform supports image operations. See 2.13 Requirement: Images (p. 35).

Note that there are a wide variety of methods for implementing these requirements.

1.6 Desirable items deferred from this release

The working group has identified a number of desirable features that are left out of this release for schedule reasons, with a target to include them for the next release. These are:

Expansion of AQL packet formats to allow vendor specific packets

Expansion of AQL packet formats to allow indirect packets (packets containing a reference to an array of further packets)

Expansion of AQL packet formats to allow for continuation packets processed after completion of the first packet.

Architected mechanism for exposure of profiling and debug data, such as timestamps for job dispatch and completion.

Page 12: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 5

HSA Foundation Proprietary

Chapter 2 System Architecture Requirements: Details

2.1 Requirement: Shared Virtual Memory

A compliant HSA system shall allow HSA agents to access shared system memory through the common HSA unified virtual address space. The minimum virtual address width that must be supported by all HSA Agents is 48 bits for a 64-bit HSA system implementation and 32 bits for a 32-bit HSA system implementation.1 The full addressable virtual address space shall be available for both instructions and data.

System Software may reserve ranges of the virtual address space for HSA Agent or system internal use, e.g. private and group memory segments. Access to locations within these ranges from a particular HSA Agent follow implementation-specific system software policy and are not subject to the shared virtual memory requirements and access properties further defined in this specification. The reserved ranges must be discoverable or configurable by system software. System software is expected not to allocate pages for HSA application access in these non-shareable ranges.

Each HSA agent shall handle shared memory virtual address translation through system software managed page tables. System software is responsible for setting, maintaining and invalidating these page tables according to its policy. HSA agents must observe the shared memory page properties as established by system software. The observed shared memory page properties shall be consistent across all HSA agents.

The shared memory virtual address translation shall:

Interpret shared memory attributes consistently for all HSA agents and ensure memory protection mechanisms cannot be circumvented. In particular:

o The same page sizes shall be supported for all HSA agents

o Read and write permissions apply to all HSA agents equally.

o Execute restrictions are not required to apply to HSA Components.

o HSA Agents must support shared virtual memory for the lowest privilege level. HSA Agents are not required to support shared virtual memory for higher privilege levels that may be used by host CPU operating system or hypervisor.

o Execute accesses from HSA agents to shared virtual memory must use only the lowest privilege level

o If implemented in the HSA system, page status bit updates (e.g. “access” and “dirty”) must be tracked across all HSA agents to insure proper paging behavior.

o For the primary memory type, there is a requirement of the same interpretation of cacheability and data coherency properties (excluding those for read-only image data) by all HSA agents (including the host CPUs).

1 There is no minimum physical memory size for a HSA system implementation

Page 13: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

6 System Architecture Requirements: Details

HSA Foundation Proprietary

o For the primary memory type, a HSA agent shall interpret the memory type in a common way and in the same way as host CPUs, with the following caveats:

There is no requirement of the same interpretation of end-point ordering properties by all HSA agents and the host CPU.

There is no requirement of the same interpretation of observation ordering properties by all HSA agents and the host CPU.

There is no requirement of the same interpretation of multi-copy atomicity properties by all HSA agents and the host CPU.

o For any memory type other than the primary memory type, a HSA agent shall either:

A. Generate a memory fault on use of that memory type, or

B. Interpret the memory type in a common way and in the same way as the host CPU, with the following caveats:

There is no requirement of the same interpretation of end-point ordering properties by all HSA agents and the host CPU.

There is no requirement of the same interpretation of observation ordering properties by all HSA agents and the host CPU.

There is no requirement of the same interpretation of multi-copy atomicity properties by all HSA agents and the host CPU.

There is no requirement of the same interpretation of cacheability and data coherency properties by all HSA agents and the host CPU.

o For all memory types, there is a requirement of the same interpretation of speculation permission properties by all HSA agents and the host CPU.

Provide a mechanism to notify system software in case of a translation fault. This notification shall include the virtual address and a device tag to identify the HSA agent that issued the translation request.

Provide a mechanism to handle a recoverable translation fault (e.g. page not present). System software shall be able to initiate a retry of the address translation request which triggered the translation fault.

Provide the concept of a process address space ID (PASID) in the protocol to service separate, per-process virtual address spaces within the system.2 For systems that support hardware virtualization, the PASID implementation shall allow for a separation of PASID bits for virtualization (e.g. as "PartitionID") and PASID bits for HSA Memory Management Unit (HSA MMU) use. The number of bits used for PASID for HSA MMU functionality shall be at least 8. It is recommended to support a higher minimum number of PASID bits (16bit) for all HSA Agents if the HSA system is targeted by more advanced system software, running many processes concurrently.

2 The PASID ECNs of PCI Express® 3.0 provides an example of the implementation and use.

Page 14: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 7

HSA Foundation Proprietary

2.2 Requirement: Cache Coherency Domains

Data accesses to global memory from all HSA Agents shall be coherent without the need for explicit cache maintenance. This only applies to global memory locations with the primary memory type of the translation system and does not apply to image accesses. A HSA application may limit the scope of coherency for data-items as a performance optimization, see Chapter 3 for details.

The specification does not require that data memory accesses from HSA Agents are coherent for any memory location with any memory type other than the primary memory type supported by the translation system.

The specification does not require that instruction memory accesses to any memory type by HSA Agents are coherent.

The specification does not require that HSA Agents have a coherent view of any memory location where the HSA Agents do not specify the same memory attributes.

Coherency and ordering between HSA shared virtual memory aliases to the same physical address are not guaranteed.

Read-only image data

Read–only image data is required to remain static during the execution of a HSA kernel. Implications of this include (but are not limited to) that it is not allowed to modify the data for a read-only image view using a read/write image view for the same data, or direct access to the data array from either the same kernel, the host CPU or another kernel running in parallel.

2.3 Requirement: Signaling and Synchronization

A HSA-compliant platform shall provide for the ability of HSA software to define and use signaling and synchronization primitives accessible from all HSA Agents in a non-discriminatory way. HSA signaling primitives are used to operate on a signal value which is referenced using an opaque 64 bit signal handle. In addition to the signal value, the signal handle may also have associated implementation defined data. Collectively, the signal value and implementation data are referred to in this document as a HSA signal or simply a signal.

The signaling mechanism has the following properties:

A HSA signal value must only be manipulated by HSA Components using the specific HSAIL mechanisms, and by the host CPU using the HSA runtime mechanisms. Other HSA Agents may use implementation specific mechanisms that are compatible with the HSA runtime and HSAIL mechanisms for manipulating signals. The required mechanisms for HSAIL and the HSA runtime are:

o Allocate a HSA signal

The HSA system shall support allocation of signals through the HSA runtime by means of a dedicated runtime API, and an application may expose this through a runtime agent dispatch.

Allocation of a HSA signal must include initialization of the signal value.

Page 15: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

8 System Architecture Requirements: Details

HSA Foundation Proprietary

o Destroy a HSA signal

The HSA system shall support deallocation of signals through the HSA runtime by means of a dedicated runtime API, and an application may expose this through a runtime agent dispatch.

o Read the current HSA signal value The signal read mechanism must support relaxed and acquire

memory ordering semantics. Reading the value of a HSA signal returns the momentary value of

the HSA signal, as visible to that thread (see 2.13 for information on memory consistency). There is no guarantee that an Agent will see all sequential values of a signal.

o Wait on a HSA signal to meet a specified condition

All signal wait operations specify a condition. The conditions supported are equal, not equal, less than, and greater than or equal.

The signal value is considered a signed integer for assessing the wait condition.

When a signal is modified by a send or atomic read-modify-write operation then a check of the condition is scheduled for all wait operations on that signal. The check of the condition and potential resumption of the wait operation thread may also be blocked by other factors (such as OS process scheduling).

A wait operation does not necessary see all sequential values of a signal. The application must ensure that signals are used such that wait wakeup conditions are not invalidated before dependent threads have woken up, otherwise the wakeup may not occur.

Signal wait operations may return sporadically. There is no guarantee that the signal value satisfies the wait condition on return. The caller must confirm whether the signal value has satisfied the wait condition.

Wait operations have a maximum duration before returning, the duration can be discovered through the HSA runtime as a count based at the HSA system timestamp frequency. The resumption of the wait operation thread may also be blocked by other factors (such as OS process scheduling).

The signal wait mechanism must support relaxed and acquire memory ordering semantics.

Multiple HSA agents must be allowed to simultaneously wait on the same HSA signal.

o Wait on a HSA signal to meet a specified condition, with a maximum wait duration requested

The requirements on the ordinary wait on signal condition operation apply also to this operation.

The maximum wait duration requested is a hint, the operation may block for a shorter or longer time even if the condition is not met.

Page 16: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 9

HSA Foundation Proprietary

The maximum wait duration request is specified in the same unit as the HSA system timestamp.

o Send a HSA signal value

Sending a signal value should wake up any HSA agents waiting on the signal in which the wait condition is satisfied.

It is implementation defined whether the wake up is only issued to HSA agents for which the wait condition is met.

The signal send mechanism must support relaxed and release memory ordering semantics.

o Atomic read-modify-write a HSA signal value

The set of atomic operations defined for general data in HSAIL should be supported for signals in both HSAIL as well as the HSA runtime.

Performing an atomic read-modify-write update of a signal value should wake up any HSA agents waiting on the signal in which the wait condition is satisfied.

It is implementation defined whether the wake up is only issued to HSA agents for which the wait condition is met.

The signal atomic update mechanism must support relaxed, release, acquire, and acquire-release memory ordering semantics.

Atomic read-modify-write operations are not supported on queue doorbell signals that are created by the runtime for User Mode Queues associated with a HSA Component.

HSA signals should only be manipulated by HSA agents within the same user process address space in which it was allocated. The HSA signaling mechanism does not support inter-user process communication.

There is no architectural limitation imposed on the number of HSA signals, unless implicitly limited by other requirements outlined in this document.

The HSA signal handle is always a 64-bit wide quantity with an opaque format

o A HSA signal handle value of zero must be interpreted as the absence of a signal.

The HSA signal value width depends on the supported machine model.

o A HSA signal must have a 32-bit signal value when used in a small machine model.

o A HSA signal must have a 64-bit signal value when used in a large machine model.

Signals are generally intended for notification between agents. Therefore, signal operations interact with the memory model as if the signal value resides in global segment memory, is naturally aligned and is accessed using atomic memory operations at system scope with the memory ordering specified by the signal operation. See Chapter 3 for further information on memory consistency.

Page 17: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

10 System Architecture Requirements: Details

HSA Foundation Proprietary

2.4 Requirement: Atomic memory operations

HSA requires the following standard atomic memory operations to be supported by HSA Components (other HSA Agents only need to support the subset of these operations required by their role in the system):

Load from memory

Store to memory

Fetch from memory, apply logic operation with one additional operand, and store back. Required logic operations are bitwise AND, bitwise OR, bitwise XOR.

Fetch from memory, apply integer arithmetic operation with one additional operand and store back. Required integer arithmetic operations are add, subtract, increment, decrement, minimum, maximum.

Exchange memory location with operand.

Compare-and-swap; load memory location, compare with first operand, if equal then store second operand back to memory location.

In the small machine model 32 bit standard atomic memory operations should be supported; the large machine model is required to support 32 bit and 64 bit standard atomic memory operations.

2.5 Requirement: HSA system timestamp

The HSA system provides for a low overhead mechanism of determining the passing of time. A system timestamp is required that can be read from HSAIL or through the HSA runtime. It is also possible to determine the system timestamp frequency through the HSA runtime.

The HSA system requires the following properties of the timestamp:

The timestamp value is a 64 bit number.

The timestamp value should not roll-over.

The timestamp value should appear to increase at a constant rate to the HSA applications, regardless of any low power state the system may enter or any change to operating frequency.

The constant observable timestamp value increase rate is in the range 1-400MHz, and can be queried through the HSA runtime.

Provides a uniform view of time across the HSA system, i.e.

o The value of the system timestamp must never appear to decrease in time by a single reader.

o The HSA system timestamp function must generate a monotonically increasing time stamp value, such that a timestamp A observed to be generated earlier than a timestamp B will have a timestamp value less than or equal to timestamp B as visible from each HSA Agent.

The requirement to not roll-over can be resolved in various ways. For example if the timestamp value is not synchronized with real-world time, and is reset on every system boot, then the roll-over period to be designed for can be the maximum expected system uptime rather than lifetime. Other implementations may want to reuse an existing real-

Page 18: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 11

HSA Foundation Proprietary

world timestamp for the HSA system timestamp, in this case the roll-over requirement will imply different hardware requirements.

The system timestamp function can be implemented in several ways. Examples of possible implementations are (many others are of course possible as long as they meet the requirements):

A single timestamp counter peripheral in the system that is read by a HSA Agent when it generates a timestamp.

Distributed local counters that increment at the same rate and have a synchronized startup.

An implementation may vary the increment quanta of the timestamp, for example when entering or exiting a low power state, as long as the requirements above are met.

2.6 Requirement: User Mode Queuing

A HSA-compliant platform shall have support for the allocation of multiple “user-level command queues.”

For the purpose of HSA software, a user-level command queue shall be characterized as runtime-allocated, user-level accessible virtual memory of a certain size, containing packets defined in the Architected Queueing Language (“AQL Packets”, reference: 2.7). HSA software should receive memory-based structures to configure the hardware queues to allow for efficient software management of the hardware queues of the HSA Agents.

This queue memory shall be processed by the HSA Packet Processor as a ring buffer, with separate memory locations defining write and read state information of that queue.

The User Mode Queue shall be defined as follows:

Table 2-1 User Mode Queue structure

Start Offset (Bytes)

Format Field Name Description

0 uint32_t queueType Queue type for future expansion. Intended to be used for dynamic queue protocol determination.

Queue types are defined in Table 2-2.

4 uint32_t queueFeatures

Bit field to indicate specific features supported by queue. HSA applications should ignore any unknown set bits. Features differ from types in that a queue of a certain type may have a certain feature (specific restrictions on behavior, additional packet types supported), but still be compatible with the basic queue type.

Queue features are defined in Table 2-3.

8 uint64_t baseAddress A 64-bit pointer to the base of the virtual memory which holds the AQL packets for the queue.

Page 19: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

12 System Architecture Requirements: Details

HSA Foundation Proprietary

Start Offset (Bytes)

Format Field Name Description

16 uint64_t doorbellSignal

HSA signaling object handle, which must be signaled with the most recent writeIndex when a new AQL packet has been enqueued. Doorbell signals are allocated as part of the queue creation by the HSA runtime, and have restrictions on the signal operations possible – atomic read-modify-write operations are not supported.

24 uint32_t size

A 32-bit unsigned integer which specifies the maximum number of packets the queue can hold. The size of the queue is a power of two number of packets.

28 uint32_t queueId A 32-bit ID which is unique-per-process.

32 uint64_t serviceQueue

A pointer to another User Mode Queue that can be used by the HSAIL kernel to request system services. The serviceQueue property is provided by the application to the runtime API call when the queue is created, and may be NULL, the system provided serviceQueue or an application managed queue.

The User Mode Queue structure itself is read-only from HSA Agents, however HSA Agents can modify the content of the packet ring buffer pointed to by the baseAddress field.

In addition to the data held in the queue structure defined in Table 2-1 the queue defines two additional properties:

writeIndex – A 64-bit unsigned integer which specifies the packetID of the next AQL packet slot to be allocated by the application or user-level runtime. baseAddress + ((writeIndex % size) * AQL packet size) is the virtual address for the next AQL packet allocated. Initialized to 0 at queue creation time.

readIndex – A 64-bit unsigned integer which specifies the packetID of the next AQL packet to be consumed by the compute unit hardware. baseAddress + ((readIndex % size) * AQL packet size) is the virtual address of the eldest AQL packet that is not yet released by the HSA Packet Processor. Initialized to 0 at queue creation time.

The writeIndex and readIndex properties cannot be accessed directly using memory operations from HSA applications or HSAIL kernels, instead they are accessed through the HSA runtime API (HSA applications) or specific HSAIL instructions as described in 2.6.5 Queue index access (p. 16).

Queue types

Table 2-2 User Mode Queue types

ID Name Description

0 MULTI Queue supports multiple producers.

Page 20: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 13

HSA Foundation Proprietary

ID Name Description

1 SINGLE Queue only supports a single producer.

Queue Features

Table 2-3 User Mode Queue features

Bit Name Description

0 DISPATCH Queue supports DISPATCH packet.

1 AGENT_DISPATCH Queue supports AGENT_DISPATCH packet.

31:2 RESERVED Must be 0.

Queue mechanics

HSA specifies the following rules for how the queue must be implemented using the information defined above. The rules define a contract between the submitting HSA Agent and the HSA Packet Processor which enables the submitting HSA Agent to enqueue work for the associated HSA Agents in a manner that is portable across HSA implementations.

The main points of the AQL queue mechanism are as follows

A HSA Agent allocates an AQL packet slot by incrementing the writeIndex. The value of the writeIndex before the increment operation is the packetID of the AQL packet allocated.

A HSA Agent assigns a packet to the HSA Packet Processor by changing the AQL packet format field from INVALID or ALWAYS_RESERVED. The HSA Agent is required to ensure that the rest of the AQL packet is globally visible before or at the same time as the format field is written.

o When a HSA Agent assigns an AQL packet to the HSA Packet Processor the ownership of the AQL packet is transferred. The HSA Packet Processor may update the AQL packet at any time after packet submission, and the submitting HSA Agent should not rely on the content of the AQL packet after submission.

A HSA Agent notifies the Packet Processor by signaling the queue doorbell with a packetID equal to the packetID of the packet or to a subsequent packetID.

o Multiple AQL packets may be notified to the Packet Processor by one queue doorbell signal operation. The packetID for the last packet notified should be used for the doorbell value.

o When the queue doorbell is signaled the submitting HSA Agent is required to ensure that any AQL packets referenced by the doorbell value are already globally visible.

o It is not allowed to signal the doorbell with the packetID for an invalid or always reserved packet.

Page 21: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

14 System Architecture Requirements: Details

HSA Foundation Proprietary

o On a small machine model platform the least significant 32 bits of the packetID is used for the doorbell value.

o For queues with a queueType restricting them to single producer mechanics the value signaled on the doorbell must be monotonically increasing, and all packets up to the notification packetID must be already assigned.

A HSA Agent submits a task to a queue by performing the following steps:

1. Allocating an AQL packet slot

2. Update the AQL packet with the task particulars (note, the format field must remain unchanged).

3. Assigning the packet to the Packet Processor

4. Notifying the Packet Processor of the packet

In addition the following detailed constraints apply

The HSA signaling mechanism is restricted to within the process address space for which it is initialized, see 2.3 Requirement: Signaling and Synchronization (p. 7), the use of this signaling mechanism in the user mode queue implementation implies the same restriction on the user mode queues; only HSA Agents within the queue process address space are allowed to submit AQL packets to a queue.

The size of the queue is a power of two number of AQL packets.

The baseAddress must always be aligned to the AQL packet size.

The writeIndex increases as new AQL packets are added to the queue by the HSA Agent. The writeIndex never wraps around, but the actual address offset used is ((writeIndex % size) * AQL packet size).

The readIndex increases as AQL packets are processed and released by the HSA Packet Processor. The readIndex never wraps around, but the actual address offset used is ((readIndex % size) * AQL packet size).

AQL packets are 64 bytes in size.

When the writeIndex == readIndex, the queue is empty.

The format field of AQL packet slots is initialized to ALWAYS_RESERVED by the HSA system when a queue is created and is set to INVALID when a packet has been processed.

A submitting HSA Agent may only modify an allocated AQL packet when packetID < readIndex + size and the AQL packet format field is set to INVALID or ALWAYS_RESERVED.

HSA Packet Processors must not process an AQL packet with the format field set to INVALID or ALWAYS_RESERVED.

HSA Packet Processors are not required to process an AQL packet until the packet has been notified to the Packet Processor.

HSA Packet Processors may process AQL packets after the packet format field is updated, but before the doorbell is signaled. HSA Agents must not rely on the doorbell as a “gate” to hold off execution of an AQL packet.

Page 22: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 15

HSA Foundation Proprietary

AQL packets in a queue are dispatched in-order, but may complete in any order. An INVALID or ALWAYS_RESERVED AQL packet marked in the queue will block subsequent packets from being dispatched.

A HSA Packet Processor may choose to release a packet slot (by incrementing readIndex and setting the format field to INVALID) at any time after a packet has been submitted, regardless of the state of the task specified by the packet. HSA Agents should not rely on the readIndex value to imply any state of a particular task.

The Packet Processor must make sure the packet slot format field is set to INVALID and made globally visible before the readIndex is moved past the packet and it is made available for new task submission.

The User Mode queue base address and queue size of each of the User Mode queues may be confined by system software to system-dependent restrictions (for example, queue size and base address aligned to system page size, and so forth) as long as the imposed hardware restrictions do not impede HSA software’s use to allocate multiple such User Mode queues per application process and system software provides for APIs allowing to impose such limits on a particular operating system. Queues are allocated by HSA applications through the HSA runtime.

HSA Packet Processors shall have visibility of all their associated User Mode queues of all concurrently running processes and shall provide process-relative memory isolation and support privilege limitations for the instructions in the user-level command queue consistent with the application-level privileges as defined for the host CPU instruction set. The HSA system handles switching among the list of available queues for a HSA Packet Processor, as defined by system software.

Multiple vs Single submitting HSA Agents

A HSA application may incorporate multiple HSA Agents or multiple threads on a HSA Agent that may submit jobs to the same AQL queue. In the common case this will be multiple host CPU threads, or multiple threads running on a HSA Component.

In order to support job submission from multiple HSA Agents, the update of the queue writeIndex must use read-modify-write atomic memory operations. If only a single thread in a single HSA Agent is submitting jobs to a queue then the queue writeIndex update can be done using an atomic store instead of a read-modify-write atomic operation.

There are several options of implementing multiple thread packet submission, each with different characteristics for avoiding deadlocks if/when a queue is full (i.e. writeIndex >= (readIndex + size)). Two examples are:

Check for queue full before incrementing the writeIndex, and use an atomic CAS operation for the writeIndex increment. This allows aborting the packet slot allocation in all cases if the queue is full, but when multiple writers collide this requires retrying the packet slot allocation.

First atomic add the number of AQL packets to allocate to the writeIndex, then check if the queue is full. This can be more efficient when many threads submit AQL packets, but it may be more difficult to manage queue full conditions gracefully.

Page 23: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

16 System Architecture Requirements: Details

HSA Foundation Proprietary

Alternatively, a HSA application may limit the job submission to one HSA Agent thread only ("Single submitting HSA Agent"), allowing the HSA Agent and packet processor to leverage this property and streamline the job submission process for such a queue.

The submission process can be simplified for queues that are only submitted to by a single HSA Agent:

The increment of the writeIndex value by the job-submitting HSA Agent can be done using an atomic store, as it is the only HSA Agent permitted to update the value.

The submitting HSA Agent may use a private variable to hold a copy of the writeIndex value and can rely on no one else modifying the writeIndex to avoid reading the writeIndex.

Optionally the queue can be initialized to only support single producer queue mechanics. Some HSA implementations may implement optimized Packet Processor behavior for single producer queue mechanics.

Some HSA implementations may implement the same Packet Processor behavior for both the multi-producer queue mechanics and the single producer queue mechanics case. Packet Processor behavior can be optimized as long as it complies with the specification. For example, it is allowed for the Packet Processor in either case to:

Use the packet format field to determine whether a packet has been submitted rather than reading the writeIndex queue field.

Speculatively read multiple packets from the queue.

Not update the readIndex for every packet processed.

Similarly job-submitting HSA Agents do not need to read the readIndex for every packet submitted.

The queue mechanics do not require packets to be submitted in order (except for queues restricted to single producer mechanics), however, the Packet Processor performance and thus dispatch latency can benefit from the doorbell in the common case being signaled in order. For this reason it is recommended that the doorbell is always signaled in order from within a single thread (this is a strict requirement for queues using single producer mechanics), and that excessive contention between different threads on a queue is avoided.

Queue index access

The queue writeIndex and readIndex properties are accessed from HSA applications using HSA runtime API calls and from HSAIL kernels using specific instructions. The operations provided through the HSA runtime API and specific HSAIL instructions for accessing the queue writeIndex and readIndex properties are equivalent. The operations are:

Load the queue writeIndex property. Returns the 64 bit writeIndex value and takes the following parameters:

o Pointer to the queue structure. Store new value for the writeIndex property. No return value, takes the

following parameters: o Pointer to the queue structure. o New 64 bit value to store in the writeIndex property.

Page 24: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 17

HSA Foundation Proprietary

Compare-and-swap operation on the writeIndex property. The compare-and-swap operation updates the writeIndex property if the original value matches the comparison parameter. Returns the original value of the writeIndex property and takes the following parameters:

o Pointer to the queue structure. o 64 bit comparison value o New 64 bit value to store in the writeIndex property if comparison value

matches original value. Add operation on the writeIndex property. Returns the original value of the

writeIndex property and takes the following parameters: o Pointer to the queue structure. o 64 bit increment value

Load the readIndex property. Returns the 64 bit readIndex value and takes the following parameters:

o Pointer to the queue structure. Store new value for the readIndex property. No return value, takes the following

parameters: o Pointer to the queue structure. o New 64 bit value to store in the readIndex property.

The following memory ordering options are required for the different operations:

Load of readIndex or writeIndex:

o Relaxed atomic load with system scope

o Atomic load acquire with system scope

Store of readIndex or writeIndex:

o Relaxed atomic store with system scope

o Atomic store release with system scope

Add or compare-and-swap operation on writeIndex:

o Relaxed atomic with system scope

o Atomic acquire with system scope

o Atomic release with system scope

o Atomic acquire-release with system scope

All access operations for the queue indices are ordered in the memory model as if the queue indices are stored in the Global segment.

HSA Agents other than host CPUs or HSA Components may access the writeIndex and readIndex properties through implementation specific means.

Runtime Services Dispatch Queue

A HSA system uses a queue-based mechanism to invoke defined system- and HSA runtime services and functions via AGENT_DISPATCH packets issued to an appropriate queue.

This service queue is configured by the HSA runtime and system software when a User Mode Queue is created, is visible to HSA agents through the queue structure serviceQueue field (see Table 2-1) and is serviced by an appropriate HSA agent (usually the host CPU). The service queue configured when a queue is created is provided from

Page 25: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

18 System Architecture Requirements: Details

HSA Foundation Proprietary

the application. The application may select the system provided serviceQueue, a queue managed by the application, or NULL.

In accordance to section 2.7.7 of this specification, the caller passes the arguments and function code in the AGENT_DISPATCH packet; the caller can monitor or wait on a completion signal to identify when the "call" has finished.

The caller must not assume the serviced function to be initiated or completed until completion is indicated through the AGENT_DISPATCH completionSignal (see section 2.7.2).

When the AGENT_DISPATCH is complete, the returnLocation referenced within the AGENT_DISPATCH packet will contain the function's return value.

The dispatch and packet processing flow of the AGENT_DISPATCH must follow all requirements set in the section 2.7 of this specification.

The caller must follow all documented function requirements of the called function, especially related to acquire & release fence scope and referenced memory.

The exact services provided by this queue are not in the scope of this specification and are defined by the HSA Runtime Specification and the HSA Programmers Reference Manual.

2.7 Requirement: Architected Queuing Language (AQL)

A HSA-compliant system shall provide a command interface for the dispatch of HSA Agent commands. This command interface is provided by the Architected Queuing Language (AQL).

AQL allows HSA Agents to build and enqueue their own command packets, enabling fast, low-power dispatch. AQL also provides support for HSA Component queue submissions: the HSA Component kernel can write commands in AQL format.

AQL defines several different packet types:

Always reserved packet

o Packet format is set to always reserved when the queue is initialized. From a functional view always reserved packets are equivalent to invalid packets.

Invalid packet

o Packet format is set to invalid when the readIndex is incremented, making the packet slot available to the HSA Agents.

Dispatch packet

o Dispatch packets contain jobs for the HSA Component and are created by HSA Agents. The queue features field indicates whether a queue supports Dispatch packets.

Agent Dispatch packet

o Dispatch packets contain jobs for the HSA Agent and are created by HSA Agents. The queue features field indicates whether a queue supports Agent Dispatch packets.

Barrier packet

Page 26: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 19

HSA Foundation Proprietary

o Barrier packets can be inserted by HSA Agents to delay processing subsequent packets. All queues support Barrier packets.

The packet type is determined by the format field in the AQL packet header.

Packet header

Table 2-4 Architected Queuing Language (AQL) Packet Header Format

Start Offset (Bytes)

Format Field Name Description

0 uint16_t

format:8

AQL_FORMAT:

0=ALWAYS_RESERVED

1=INVALID

2=DISPATCH

3=BARRIER

4=AGENT_DISPATCH

others reserved

barrier:1 If set then processing of packet will only begin when all preceding packets are complete.

acquireFenceScope:2

Determines the scope and type of the memory fence operation applied before the packet enters the active phase. Must be 0 for Barrier Packets.

releaseFenceScope:2

Determines the scope and type of the memory fence operation applied after kernel completion but before the packet is completed.

reserved:3 Must be 0

Acquire fences

Launch of a Barrier packet does not imply any memory fence for the HSA Agent.

The packet launch memory acquire fence behavior is the same for both Dispatch and Agent Dispatch packets. The packet acquire fence is applied at the end of the launch phase, just before the packet enters the active phase (see section 2.7.2) and has the effect described in section 3.8. The packet launch memory fence applies to the Global segment as well as any image data. The scope of the system that the fence applies to for the Global segment is configurable using the acquireFenceScope header field.

Note that to ensure visibility of a particular memory transaction both a release and a matching acquire fence may need to apply, as defined in the HSA Memory Consistency Model (see 2.13). The required fences can be applied by instructions within HSA kernels or as part of an Agent Dispatch task as well as through the AQL packet processing.

Table 2-5 Encoding of acquireFenceScope

acquireFenceScope Description

0 Reserved (this value is used for Barrier packets, as these don’t support acquire fences).

1 The acquire fence is applied with Component scope for the Global segment.

Page 27: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

20 System Architecture Requirements: Details

HSA Foundation Proprietary

acquireFenceScope Description

2 The acquire fence is applied across both Component and System scope for the Global segment.

3 reserved

Release fences

The packet completion memory release fence behavior is the same for Barrier, Dispatch and Agent Dispatch packets. The packet release fence is applied at the start of the completion phase (see section 2.7.2) and has the effect described in section 3.8. The packet completion memory fence applies to the Global segment as well as read-write images modified by the previous completed packet executions. The scope of the system that the fence applies to for the Global segment is configurable using the releaseFenceScope header field.

Table 2-6 Encoding of releaseFenceScope

releaseFenceScope Description

0 reserved

1 The release fence is applied with Component scope for the Global segment.

2 The release fence is applied across both Component and System scope for the Global segment.

3 reserved

Packet process flow

All packet types follow the same process. The packet processing is separated in launch, active and completion phases:

Launch phase is initiated when launch conditions are met. Launch conditions are:

All preceding packets in the queue must have completed their launch phase.

If the barrier bit in the packet header is set then all preceding packets in the queue must have completed.

In the launch phase an acquire memory fence is applied before the packet enters the active phase.

After application of the acquire fence the launch phase completes and the packet enters the active phase. In the active phase the behavior of packets differ depending on packet type:

Dispatch packets and Agent Dispatch packets execute on the HSA Component/Agent, and the active phase ends when the task completes.

Barrier packets remain in the active phase until their condition is met. Additionally no further packets are processed by the packet processor until the Barrier packet has completed both its active and completion phase.

The active phase is followed by the completion phase. The first step in the completion phase is the memory release fence.

After the memory release fence completes the signal specified by the completionSignal field in the AQL packet is signaled with a decrementing atomic operation. NULL is a valid value for the completionSignal field, in this case no signal operation is performed.

Page 28: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 21

HSA Foundation Proprietary

Error handling

If an error is detected during packet processing or execution the following sequence is followed:

If the error was detected during the packet active execution phase then a release fence is applied to the failing packet.

If the error was detected during the packet active execution phase then the failing packet completionSignal is signaled with an error indication value. The initial error indication value is defined as the two most significant bits set, with all other bits cleared.

o Any negative value on a completionSignal should be considered to indicate an error.

o The failing packet completionSignal may be used as completion Signal in other packets, so may be incremented/decremented after the error has been signaled. However it is unlikely that the increment/decrement operations will cause the completionSignal value to not be negative within a relevant time.

If an error is caused unrelated to a packet active execution phase then the queue will be set into an error state and a callback will be issued through the HSA runtime. The HSA runtime API can be used to query the error.

No further packets will be launched on the queue that has encountered an error, however any packet already executing will continue to execute until complete, or aborted through the HSA runtime.

Packets in other queues within the same process (i.e. PASID) will still be processed.

Always reserved packet

An always reserved packet has the format field in the packet header set to ALWAYS_RESERVED (e.g. zero). All other fields of the packet are don’t care.

Invalid packet

An invalid packet has the format field in the packet header set to INVALID (e.g. one). All other fields of the packet are don’t care.

Dispatch packet

Table 2-7 Architected Queuing Language (AQL) Dispatch Packet Format

Start Offset (Bytes)

Format Field Name Description

0 uint16_t header Packet header, see 2.7.1 Packet

header (p. 19).

2 uint16_t dimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3.

Page 29: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

22 System Architecture Requirements: Details

HSA Foundation Proprietary

Start Offset (Bytes)

Format Field Name Description

reserved:14 Must be 0

4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items).

6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items).

8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items).

10 uint16_t reserved2 Must be 0.

12 uint32_t gridSize.x x dimension of grid (measured in work-items).

16 uint32_t gridSize.y y dimension of grid (measured in work-items).

20 uint32_t gridSize.z z dimension of grid (measured in work-items).

24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item).

28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group).

32 uint64_t kernelObjectAddress Address of an object in memory that includes an implementation-defined executable ISA image for the kernel.

40 uint64_t kernargAddress Address of memory containing kernel arguments.

48 uint64_t reserved3 Must be 0.

56 uint64_t completionSignal HSA signaling object handle used to indicate completion of the job.

The kernargAddress minimum alignment granule is 16 bytes, HSAIL directives may enforce larger alignment granules for individual kernels. The memory at kernargAddress must remain allocated and not be modified until the kernel completes.

The memory referenced by the kernargAddress must be located in globally shared memory, and must be visible at system scope before the Dispatch packet is assigned.

The kernargAddress memory buffer should be allocated through the runtime allocation API provided for this purpose.

Agent Dispatch packet

Table 2-8 Architected Queuing Language (AQL) Dispatch Packet Format

Start Offset (Bytes)

Format Field Name Description

0 uint16_t header Packet header, see 2.7.1 Packet header (p.

19).

Page 30: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 23

HSA Foundation Proprietary

Start Offset (Bytes)

Format Field Name Description

2 uint16_t type

The function to be performed by the destination HSA Agent. The type value is split into the following ranges:

0x0000:0x3FFF – Vendor specific

0x4000:0x7FFF – HSA runtime

0x8000:0xFFFF – User registered function

4 uint32_t reserved2 Must be 0.

8 uint64_t returnLocation Pointer to location to store the function return value(s) in.

16 uint64_t arg[0]

64-bit direct or indirect arguments. 24 uint64_t arg[1]

32 uint64_t arg[2]

40 uint64_t arg[3]

48 uint64_t reserved3 Must be 0.

56 uint64_t completionSignal HSA signaling object handle used to indicate completion of the job.

Barrier packet

Table 2-9 Architected Queuing Language (AQL) Barrier Packet Format

Start Offset (Bytes)

Format Field Name Description

0 uint16_t header Packet header, see 2.7.1 Packet

header (p. 19).

2 uint16_t reserved2 Must be 0.

4 uint32_t reserved3 Must be 0.

8 uint64_t depSignal0

Handles for dependent signaling objects to be evaluated by the packet processor.

16 uint64_t depSignal1

24 uint64_t depSignal2

32 uint64_t depSignal3

40 uint64_t depSignal4

48 uint64_t reserved4 Must be 0.

56 uint64_t completionSignal HSA signaling object handle used to indicate completion of the job.

The Barrier packet defines dependencies for the HSA Packet Processor to monitor. The HSA Packet Processor will not launch any further packets until the Barrier packet is complete. Up to five dependent signals can be specified in the Barrier packet, and the execution phase for the Barrier packet completes when any of these two conditions have been met:

Any of the dependent signals have been observed with a negative value after the Barrier packet launched.

Page 31: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

24 System Architecture Requirements: Details

HSA Foundation Proprietary

o This occurs when an error has occurred in a job there is a dependency on. The error should if possible be propagated to the HSA application, and dependent jobs prevented from executing:

The completionSignal will be signaled with the error value defined in section 2.7.3.

If the queue is not already in an error state (e.g. the job generating the error was processed in a different queue) then the HSA Packet Processor should consider the error code on the dependent signal an error and follow the procedure outlined in section 2.7.3.

All of the dependent signals have been observed with the value 0 after the Barrier packet launched (note that it is not required that all dependent signals are observed to be 0 at the same time).

o The completionSignal will be signaled with an atomic decrementing operation in the completion phase.

The HSA Packet Processor should monitor all the dependent signals continuously, but is not required to see all intermediate values on the signals.

If a dependent signaling pointer has the value NULL it is considered to be constant 0 in the dependency condition evaluation.

Small machine model

The HSA small machine model has 32 bit rather than 64 bit native pointers. This affects any packet field that specifies an address:

Dispatch packet kernelObjectAddress and kernargAddress fields

Agent Dispatch packet returnLocation field

For these fields only the least significant 32 bits are used in the small machine model.

2.8 Requirement: HSA Agent scheduling

The HSA compliant system is required to schedule tasks for each HSA Agent from the HSA queues connected to the HSA Agent as well as from any additional pool of non-HSA tasks. The exact mechanism for scheduling is implementation defined, but should meet the restrictions outlined in this section.

The HSA programming model encourages the use of multiple queues per application to enable complex sequences of tasks to be scheduled. It is expected that HSA compliant applications uses multiple queues simultaneously. To ensure consistent performance across HSA compliant systems it is recommended that when using multiple queues the scheduling algorithm will schedule efficiently across them and minimize the overhead for managing multiple queues. HSA compliant implementations are encouraged to minimize the scheduling overhead across multiple queues in order to maintain efficiency of this performance critical path.

To ensure this the HSA Agent scheduling must have the following characteristics:

1. As a minimum the HSA Agent scheduling should be triggered by the following events :

Page 32: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 25

HSA Foundation Proprietary

o HSA Agent task execution completes and the HSA Agent becomes available for executing another task.

o If the HSA Agent is not executing a task then the HSA Agent scheduling must also be triggered by:

Packet submission to any queue connected to the HSA Agent.

A signal used by an AQL Barrier packet active on a queue connected to the HSA Agent is signaled such that the AQL Barrier packet can complete. The HSA agent scheduling must minimally be able to monitor the forefront AQL packet being active on each queue connected to it. Monitoring includes any combination of supported AQL packets and their dependencies. To ensure deadlock is avoided the monitoring of dependencies must not occupy any execution resources required to make forward progress on tasks (e.g. a Barrier Packet active on one queue must not block task execution on other queues).

2. The HSA Agent scheduling should respond to the events listed in point 1 in a reasonable time.

3. When the HSA Agent scheduling mechanism is triggered and the HSA Agent has available execution resources the scheduling mechanism must dispatch a task if there is a runnable Dispatch or Agent Dispatch packet on any queue connected to the HSA Agent. Note that Barrier packets are not allowed to occupy HSA Agent/Component execution resources.

2.9 Requirement: HSA Component Context Switching

HSA Component Context Switching is required to provide a mechanism to the system software to support:

Scheduling of multiple HSA contexts to a device with appropriate prioritization, throughput, and latency constraints.

Scheduling of both HSA and non-HSA tasks on devices that can support both task types.

The HSA specification defines the mechanisms that must be present to support scheduling through a so-called context switch. The precise constraints on scheduling are considered platform specific and are not mandated by the HSA specification.

The design of context switching includes stopping any currently running wavefronts, saving the current context state to memory (memory is managed by system software), restoring context state of another context from memory, and starting execution at the state left when it was stopped previously.

System software may request a HSA Component to initiate the HSA Component compute context switch, indicating at minimum the affected PASID and user-level queue that should be switched.

The context state data format is implementation-specific. System software shall assure the residency of the context save memory during the preemption operation.

System software may request three priorities of context switch:

1. Switch. A context switch is required. The context switch is not required to occur immediately. The hardware may continue to complete issued work-items, and context switch at a point where the size of the HSA context can be reduced.

Page 33: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

26 System Architecture Requirements: Details

HSA Foundation Proprietary

2. Preempt. A context switch is required as soon as possible. It must be detectable if a platform implements this through optional hard preemption and hence can provide a platform specific guaranteed context switch latency for a preempt.

3. Terminate and context reset. The HSA context should be terminated as fast as possible. There is no requirement to preserve state. The HSA Component is available for scheduling another context afterwards.

System software must be able to determine if a switch has been completed, and may request a higher priority switch if the switch has not occurred.

For instance a scheduler may first issue a switch and then after a grace period detect if the switch has occurred and if not issue a preempt to trigger a higher priority context switch. Eventually system software may choose to terminate a context (with loss of context state) if the HSA Component does not respond within a pre-determined time set by system software policy. The required policies are not mandated by this specification, only the required capabilities.

The HSA PRM (post 1.0 provisional) will define a YIELD mechanism which may be used to indicate a point of reduced context for a context switch.

If preempt is implemented through “hard preemption” it shall fulfil the following requirements:

The mechanism shall provide preemption and forced exit of currently running wavefronts.

On preemption, the HSA Component shall save all state to memory that would be required to restart the preempted context again.

The preemption mechanism must not wait for events that have undetermined latency (including end-of-kernel or end-of-wavefront events).

Platforms implementing the Full Profile of the HSA PRM are required to implement hard preemption.

2.10 Requirement: HSA Component IEEE754-2008 Floating Point Exception Reporting

A HSA Component shall report certain defined exceptions related to the execution of the HSAIL code to the HSA Runtime. At minimum the following exceptions shall be reported for IEEE758-2008 floating point errors:

Table 2-10 IEEE754-2008 exceptions

Operation Exception

Code Description

INVALID_OPERATION 0 Invalid operation that raises a flag under default exception handling as specified in IEEE754-2008.

DIVIDE_BY_ZERO 1 Divide-by-zero that raises a flag under default exception handling as specified in IEEE754-2008.

OVERFLOW 2 Overflow that raises a flag under default exception handling as specified in IEEE754-2008.

Page 34: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 27

HSA Foundation Proprietary

Operation Exception

Code Description

UNDERFLOW 3 Underflow that raises a flag under default exception handling as specified in IEEE754-2008.

INEXACT 4 Inexact that raises a flag under default exception handling as specified in IEEE754-2008.

DETECT:

Support for DETECT is not required in the base HSA profile.

The HSA Component must maintain a status bit for each of the DETECT enabled arithmetic exceptions for each work-group. The status bits are reset to zero at the start of a dispatch. If an exception occurs and the DETECT policy is enabled for that exception, then the HSA Component will set the corresponding status bit. The HSA Component must:

o Support HSAIL instructions that can read or write the exception status field while the kernel is running.

BREAK:

Support for BREAK is not required in the base HSA profile.

When an exception occurs and the BREAK policy is enabled for that exception, the HSA Component must stop the excepting wavefront’s execution of the HSAIL instruction stream precisely at the instruction which generated the exception. Memory and register state is just before the excepting instruction executed, except that non-excepting lanes in the wavefront which generated the exception may have already executed and updated register or memory state. In reasonable time, the HSA Component executing the wavefront must stop initiating new wavefronts for all dispatches executing on the same User Mode queue, and must ensure that all wavefronts currently being executed for those dispatches either complete, or are halted. Any dispatches that complete will have their completion signal updated as normal, however, any dispatch that do not complete the execution of all their associated wavefronts will not have their completion signal updated.

The HSA Component then signals the runtime, which invokes an exception handler defined in the HSA Runtime. The reported information must include:

o Exception code

o PASID

o queueId

o packetId

o Precise instruction counter that caused the exception.

Some possible uses for the runtime exception handler are:

o Invoking a debugger which allows the developer to determine which function caused the exception and inspect the surrounding memory and register state.

o Logging the exception to a file

Page 35: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

28 System Architecture Requirements: Details

HSA Foundation Proprietary

o Killing the process.

After signaling the BREAK, it is not required that the HSA Component can restart the instruction which generated the exception.

The enabling of the DETECT and BREAK exception policies is controlled through the HSAIL finalization. Both DETECT and BREAK can be enabled for the same arithmetic exception.

The exception handling policy must be specified at the time the HSAIL code is finalized, since implementations may generate different code when exception handling is requested. Enabling either DETECT or BREAK functionality may reduce the performance of the kernel running on the HSA Component. However, it is recommended that performance with DETECT policy be close to native code performance since this is a commonly-used exception policy.

2.11 Requirement: HSA Component Hardware Debug Infrastructure

The HSA Component shall provide mechanisms to allow system software and some select application software (for example, debuggers and profilers) to set breakpoints and collect throughput information for profiling. At a minimum, instruction breakpoints shall be supported. Memory breakpoints, memory faults, and conditional breakpoints may be supported as an implementation option.

When hitting a breakpoint condition, the hardware shall retire all pending writes to memory, stop the wavefront, and dump a status indicating the breakpoint condition, the affected PASID, user-level queue, instruction(s), and wavefront/group ID.

System software shall provide appropriate resident memory to allow the HSA Component hardware on reaching the breakpoint condition to write out state information that can be parsed by appropriate select application software (for example, debuggers and profilers). The format of the written data shall be system implementation specific.

2.12 Requirement: HSA Platform Topology Discovery

Introduction

The HSA System Architecture requirements cover a wide range of systems with a variety of elements. Therefore architected topology data delivery is important for application and system software to discover and characterize the available HSA agents, their topological relation to each other and to other relevant system resources like memory, caches, I/O and host agents (e.g. CPU) reliably to leverage the available elements to their fullest.

A HSA compliant system has a host of different agents, interfaces and other resources like memory and caches. To take advantage of these in application, runtime and system software, therefore a variety of properties are needed to characterize it for application and system software use.

At minimum a HSA compliant system must provide a set of parameter characterizations that are surfaced to HSA software, applications can discover these through various runtime APIs (through HSA Runtime, with the expectation that other API targeting the

Page 36: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 29

HSA Foundation Proprietary

platform e.g. OpenCL™ & DirectCompute reporting equivalent information) and software may leverage these parameters for optimizing performance on a specific platform.

Unless otherwise provided to the HSA runtime level by system software, each of these properties shall be contained in appropriate table entries, the multitude of HSA component properties can be enumerated and their relationship to each other established by system software parsing the table contents and communicated to HSA runtime and HSA applications.

Depending on platform, those element descriptions may be structured through meta-entries within a table to higher order hierarchical groups that identifies functionally consistent sets of elements like Agent, Memory, Compute, Caches, I/O, within the system and are associated with a node mask or id, forming an element list referencing the elements contained as relative table offsets. Node masks or id’s should be 0-based, with the 0 id representing the primary elements of the system (e.g. “boot cores”, memory) if applicable. The enumeration order and – if applicable – values of the ID should match other information reported through mechanisms outside of the scope of the HSA requirements; for example the ACPI SRAT settings available on some systems should match the memory order and properties reported through HSA. Further detail is out of the scope of the System Architecture and outlined in the Runtime API specification.

Each of these nodes is interconnected with other nodes in more complex systems to the level necessary to adequately describe the HSA system.

Where applicable, the node grouping follows NUMA principles to leverage memory locality in software when multiple physical memory blocks are available in the system and HSA agents have a different “access cost” (e.g. bandwidth/latency) to that memory.

The design concept allows expressing other group hierarchies as necessary for a specific platform.

Figure 2-2 provides an illustrative example for a grouping characterizing an exemplary multi-socket CPU/APU and multi-board GPU system, whereas Figure 2-1 (p. 30) is an example for a structurally simpler platform with only one “node”, containing HSA agent and memory resources.

Page 37: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

30 System Architecture Requirements: Details

HSA Foundation Proprietary

Figure 2-1 Example of a Simple HSA Platform

Page 38: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 31

HSA Foundation Proprietary

Figure 2-2 Example of a HSA platform with extended topology

Figure 2-2 shows a more advanced system with a variety of host CPUs and HSA Components and five discoverable HSA Memory Nodes, distributed across multiple system components. The appropriate topology structure definition is referenced in later paragraph for the structure definition.

The platform topology is considered an inherent property of the HSA system and unchanging during the operation of the HSA system for the purpose of this specification.

HSA Platform Node 2

Node 0

Add-In Board (optional)

HSA discrete GPU

System Memory

(cacheable)

coherent

(non-cacheable)

non-coherent

HSA APU

GPU

H-CU

H-CU

H-CU

GPU

H-CU

H-CU

H-CU

CPU

Core

Core

Core

Device Local

Memory

coherent

non-coherent

Mem

Mem

HSA MMU

SBIOS

UEFI

HSA discrete GPU

GPU

H-CU

H-CU

H-CU

Device Local

Memory

coherent

non-coherent

Mem

Node 1

PCIe

BridgePCIE

System Memory

(cacheable)

coherent

(non-cacheable)

non-coherent

HSA APU

GPU

H-CU

H-CU

H-CU

CPU

Core

Core

Core

Mem HSA MMU

Add-In Board (optional)

HSA discrete GPU

GPU

H-CU

H-CU

H-CU

Device Local

Memory

coherent

non-coherent

PCIE

Mem

VBIOS

UEFI GOPS

ocke

t In

terc

on

ne

ct

Node 3

PCIE

Node 4

PCIE

VBIOS

UEFI GOP

Page 39: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

32 System Architecture Requirements: Details

HSA Foundation Proprietary

Topology Table Requirements

Figure 2-3 Topology Table example

Figure 2-3 outlines an example of a topology table containing identifiable elements of a HSA platform. Typically, though not in all cases system software may enumerate all HSA system topology information at or close to system startup and makes it available to the application at application startup. System software and runtime may provide mechanisms to the application to notify it about topology changes that may be caused by platform & system events (e.g. docking) that change the previously reported topology.

Each of the table entries characterizes particular HSA component properties of the system. In addition HSA agent entries may reference elements that characterize the resources the HSA components are associated with.

Each of the elements in the table is characterized by a unique element ID, and type specific property data.

The table header should at minimum provide information on system vendor of the platform, HSA platform type and version information, number and a pointer to a table of each element type and total size of the table in addition to other ancillary information required for discovery of a specific platform architecture.

The header may be followed by element type specific tables containing among other things memory and cache descriptors & enumerating the appropriate resources

Topology

Header

Topology

Base

Agent

AgentID=<value1>

Node=0

Type=<value>

Properties=<…>

Agent

AgentID=<value2>

Node=0

Type=<value>

Properties=<…>

Agent

AgentID=<value3>

Node=1

Type=<value>

Properties=<…>

Memory

MemID=<valueM1>

Node=0

BasePhysical=0x00000000

SizePhysical=0x00100000

Properties=<…>Memory

MemID=<valueM2>Node=1

BasePhysical=0x00400000SizePhysical=0x00100000

Properties=<…>

Cache

CacheID=<valueC1>

Node=0

Level=1

Properties=<…>

Cache

CacheID=<valueC1>

Node=0

Level=2

Properties=<…>

IO Link

IOLinkID=<valueL1>

Node=0

Properties=<…>

Page 40: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 33

HSA Foundation Proprietary

available on the platform. It is implementation defined what order the element type data is stored in.

The table structure is defined to allow extensions in the future. Software should evaluate the type information and ignore/skip any unknown entries.

The following sections will describe the elements in more detail.

HSA Agent Entry

Figure 2-4 Structure of a HSA Agent Entry

A HSA agent entry characterizes the hardware component participating in the HSA memory model.

Each topology entry for a platform agent is represented by an ID by its location in the table identifying it from other agents on the same platform to application and system software. It also contains an Agent type (e.g. throughput/vector, latency/scalar, DSP, …) characterizing its compute execution properties, work-group & wavefront size, concurrent number of wavefronts, in addition to a nodeID value.

One or more HSA agents may belong to the same nodeID or some HSA agents may be represented in more than one node, depending on the system configuration.

In addition to HSA agent specific properties, a HSA agent entry may contain element references to system resources, expressed as arrays 32bit wide table relative offsets to other table entries, that allow expressing the topological relations between the HSA agent and memory, cache and IO resources or to sub-divisible components of the HSA agent itself e.g. a HSA agent (like a CPU) could be “broken down” by the HSA runtime or application into individual software-schedulable units, like “sockets” or “cores”, for finer-grained scheduling, if an implementation provides such capabilities.

At minimum the following HSA agent parameters would be surfaced to system software and made available to applications via Application programming interfaces (e.g. HSA runtime, OpenCL™ or others):

Agent Type

Element ID (this may be derived by its table location, from a hardware or bus property of an implementation to uniquely identify the agent)

Maximum number of Compute Units

Maximum work-item dimensions

HSA Agent

MemoryMemoryI/O

MemoryMemoryCache

MemoryMemoryMemory

MemoryMemoryAgent

Memory List

Cache List

I/O List

Component = Y

SubAgent List

ID = < id >

Page 41: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

34 System Architecture Requirements: Details

HSA Foundation Proprietary

Maximum work item sizes

Machine model

Endianness

Optional properties of a HSA agent would also surface through the mechanism, as an example:

Maximum Clock Frequency in MHz (0 if not exposed on a platform)

Friendly name (equivalent to “device name” in OpenCL™)

Floating point capabilities

Other optional parameters may be available depending on platform

HSA Memory Entry

The HSA Memory Entry describes the physical properties of a block memory in the system with identifiable characteristics that can be accessed via HSA memory model semantics by the HSA agents.

At minimum the following HSA memory parameters of each physically distinguishable memory block would be surfaced to system software and potentially made available to applications via Application programming interfaces (e.g. HSA runtime, OpenCL™ or others):

Node Id

Element ID (this may be derived by its table location, from a hardware or bus property of an implementation to uniquely identify the physical memory

Virtual base address (32bit or 64bit value)

Physical memory size in bytes

Optional HSA memory parameters would also surface through the mechanism, as an example

Memory interleave (1, 2, 4)

Number of memory channels

Memory width per memory channel

Maximum memory throughput in MByte/s

Minimum memory access latency in picoseconds

Other optional parameters may be available depending on platform

HSA Cache Entry

The HSA Cache Entry describes the available caches for HSA agents within a system.

At minimum the following HSA memory parameters would be surfaced to system software and potentially made available to applications via Application programming interfaces (e.g. HSA runtime, OpenCL™ or others):

Node ID

Page 42: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 35

HSA Foundation Proprietary

Element ID (this may be derived by its table location, from a hardware or bus property of an implementation to uniquely identify the physical memory

Cache type, e.g. read-only or read/write cache

Total Size of cache in Kilobytes

Cache line size in bytes

Cache Level (1, 2, 3, …)

Associativity (0 = reserved, 1= direct mapped, 2-0xFE associativity, 0xFFh=fully associative)

Exclusive/Inclusive to lower level caches

Agent association if cache is shared between agents (identified via nodemask)

Optional HSA memory parameters would also surface through the mechanism, as an example:

Maximum Latency (in picoseconds)

Other optional parameters may be available depending on platform

Topology Structure Example

Figure 2-5 outlines an illustrative example of how the information can be used to express specific system topology, using the Figure 2-2 block diagram as the basis.

Figure 2-5 Topology definition structure for previously defined system block diagram

2.13 Requirement: Images

The requirements on images are preliminary.

A HSA-compliant platform shall optionally provide for the ability of HSA software to define and use image objects, which are used to store one-, two-, or three-dimensional images. The elements of an image object are defined from a list of predefined image formats. Image primitives accessible from HSA components, operate on an image value, which is referenced using an opaque 64 bit image handle. Image handles are allocated

HSA AgentMemory HSA Agent MemoryHSA Agent Memory

HSA Agent HSA AgentMemory MemoryHSA Agent MemoryMemory

HSA Agent

HSA Agent

Node 2

Sub

Mem

Node 4

Sub

Mem

Node 3

Sub

Mem

Discrete GPU Component Discrete GPU Component Discrete GPU Component

Node 0

Node 1

Mem

Sub

Mem

Sub

Mem

Sub

APU GPU Component APU GPU Component

Host CPU Agent

Node 1

Node 0

Mem

Sub

Mem

Sub

Page 43: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

36 System Architecture Requirements: Details

HSA Foundation Proprietary

by a HSA platform and must be bound before being accessed by a HSA component. An image object can be read only, i.e. a HSA component can only read values from the image, or read and write, i.e. a HSA component can write to an image as well as read from it. Images have the following additional requirements:

Images are created by the HSA platform and may be initialized from data in the global segment. However, the actual image data storage is not in the global segment, and any initialization buffer is not connected to the data storage once the image is created.

Once an image is created it must be bound by the HSA platform, the behavior of using an image that is not bound, is undefined.

Image data can only be accessed via interfaces exposed by the HSA platform or HSA component primitives.

The layout of the image data storage is implementation defined.

Images are not part of the shared-virtual-address space and a consequence of this is that in general, agent access to image data must be performed via interfaces exposed by the HSA platform.

Image operations do not contribute to memory orderings defined by the HSA memory model. The following rules apply for image operation visibility:

o Modifications to image data through HSA runtime API calls are visible to all subsequent packet dispatches.

o Modifications to image data by a HSA component packet dispatch are visible as follows:

All image data modifications made by an AQL packet dispatch are visible to subsequent AQL packet dispatches through fence operations issued by the packet processor.

Read-modify-write atomics are performed as relaxed atomics at both work-group and component scopes.

Stores to, and read-modify-write atomics on, an image by a single work-item are visible to loads by the same work-item after execution of a work-item scope image fence operation.

Stores to an image by a work-item are visible to loads by the same work-item and other work-items in the same work-group after both the writing and reading work-item(s) execute work-group execution uniform image acquire/release fences at work-group scope.

Read-modify-write atomics on an image by a work-item are visible to loads by the same work-item and other work-items in the same work-group after the reading work-item executes a work-group execution uniform image acquire/release fence at work-group scope.

Stores to an image by a work-item are visible to read-modify-write atomics by the same work-item and other work-items in the same work-group after the writing work-item executes a work-group execution uniform image acquire/release fence at work-group scope.

Page 44: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

System Architecture Requirements: Details 37

HSA Foundation Proprietary

Note that there is no image acquire/release fence at component scope. Therefore, it is not possible to make image stores performed by a work-item visible to image loads performed by another work-item in a different work-group of the same dispatch. However, image read-modify-write atomic operations are visible to work-items in different work-groups of the same component.

Image accesses and image fences by a single work-item cannot be reordered.

o Access to modified image data that is not yet visible is undefined.

A read/write image may alias a read only image that is allocated for the same HSA component.

If the data of a read only image is modified while it is being read by a HSA Component kernel dispatch, then the behavior is undefined.

As described above, images are not part of shared virtual memory and thus are not included in the HSA Memory Model, as defined in Chapter 3.

Page 45: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

38 HSA Memory Consistency Model

HSA Foundation Proprietary

Chapter 3 HSA Memory Consistency Model

This chapter describes the HSA memory consistency model.

The HSA memory consistency model is preliminary.

3.1 What Is a Memory Consistency Model?

A memory consistency model defines how writes by one thread of execution become visible to another thread of execution, whether on the same agent or different agents. The memory consistency model described in this chapter sets a minimum bar. Some HSAIL implementations might provide a stronger guarantee than required.

For many implementations, better performance will result if either the hardware or the finalizer is allowed to relax the perceived order of the execution of memory operations. For example, the finalizer might find it more efficient if a write is moved later in the program; so long as the program semantics do not change, the finalizer is free to do so. Once a store is deferred, other work-items and agents will not see it until the store actually happens. Hardware might provide a cache that also defers writes.

In the same way, both the hardware and the finalizer are sometimes allowed to move writes to memory earlier in the program. In that case, the other work-items and agents may see the write when it occurs, before it appears to be executed in the source.

A memory consistency model provides answers to two kinds of questions. For developers and tools, the question is: “How do I make it always get the right answer?” For implementations, the question is: “If I do this, is it allowed?”

Programs cannot see any global order directly. Implementations cannot know about future loads and stores. Each implementation creates an order such that any possible visibility order requirement created by future loads and stores will not be violated.

Implementations can constrain events more than the model allows. The model is a minimum required ordering of events. The model goes from a description of operations, to a minimal required ordering of operations due to a single compute unit, to an interleaving of the operations in time from multiple compute units with allowed reordering.

Legal results are determined by the interleavings, allowed reorderings, and visibility rules.

3.2 What Is a HSA Memory Consistency Model?

The HSA memory consistency model describes how multiple work-items interact with themselves and with HSA agents.

The HSA memory consistency model is a relaxed model based around RCsc semantics on a set of synchronizing operations. The standard RCsc model is extended to include fences and relaxed atomic operations.

In addition HSA includes concepts of memory segments and scopes.

Page 46: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

HSA Memory Consistency Model 39

HSA Foundation Proprietary

Each load or store operation occurs to a single memory segment, whereas fences can apply across multiple segments. For synchronizing memory operations we define a set of scopes it applies to (namely wavefront, work-group, component and system).

There is then a single order for all synchronizing memory operations to a specific segment/scope combination.

Under this memory consistency model we define a different single global order for synchronizing operations for each segment/scope instance.

The memory model defines the concept of a data race, and a program is undefined if it is not data-race free.

The precise rules and definitions for the memory consistency model are described in the rest of the HSA Memory Consistency Model section.

There might be devices in a system that do not subscribe to the HSA memory consistency model, but can move data in and out of HSA memory regions. These need coarse-grain synchronization (“start now” or “all done”). These mechanisms are defined outside of the HSA memory consistency model, although HSA memory might be used for that communication or storage.

3.3 How Does Software See Correct vs. Incorrect?

Software can only see the results of load and read-modify-write operations. Each load result creates a more constrained partial visibility order. The contract is violated when a load gets a result that does not meet all the required partial visibility orders in the system.

HSAIL might interact with code that has a stronger memory consistency model requirement but is not visible to HSAIL-compliant memory systems.

3.4 HSA Memory Consistency Model Definitions

The HSA memory consistency model is defined in terms of threads and the ordering of visibility of operations.

A thread is a program-ordered sequence of operations through a processing element. A thread can be any thread of execution on an agent, a work-item, or any method of sending operations through a processing element in a HSA-compatible device.

For primitives of type “store”, visibility to thread A of a store operation X is when the data of store X is available to loads from thread A.

For primitives of type “load”, visibility is when a load gets the data that the thread will put in the register.

Primitives of type “fence” can enforce additional constraints on the visibility of load and store operations. A fence is visible at the point where the constraint implied by the fence is met. For example a release fence is visible when all preceding stores from threads affected by the fence are visible in the required segment scope instances.

For any two memory operations, X >> Y indicates that Y is visible to any thread only after X is visible to that same thread. The set of threads for which the >>

Page 47: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

40 HSA Memory Consistency Model

HSA Foundation Proprietary

relationship is valid for is limited by the segments and scopes constraints on operations X and Y.

Definition of single-copy atomic property

If two single-copy atomic stores overlap, then all the writes of one of the stores will be inserted into the modification order of each overlapping byte before any of the writes of the other store are inserted.

If a single-copy atomic load A overlaps with a single-copy atomic store B and if for any of the overlapping bytes the load returns the data written by the write B inserted into the modification order of that byte by the single-copy atomic store then the load A must return data from a point in the modification order of each of the overlapping bytes no earlier than the write to that byte inserted into the modification order by single-copy atomic store B.

Operations

A “memory operation” is any operation that implies executing one or more of the load, store, or fence primitives. The following basic operations are defined:

o ld, load a value, also referred to as “ordinary load”.

o st, store a value, also referred to as “ordinary store”.

o atomic(op), perform an atomic operation (op) returning a value. This may be a load (ld) or a read-modify-write operation (rmw).

o atomicNoRet(op), perform an atomic operation without returning a value. This may be a store (st) or a read-modify-write operation (rmw).

o fence, perform a memory fence operation.

Memory operation order constraints are attached to memory operations. The following ordering constraints are defined:

o rlx, no ordering constraint attached (i.e. relaxed).

o acq, acquire constraint attached.

o rel, release constraint attached.

o ar, acquire and release constraints attached.

The only ordering constraint allowed for ordinary loads or stores is rlx.

The following combinations of atomic(op) operation and ordering constraint are allowed:

o atomic(ld)_rlx, atomic(ld)_acq,

o atomic(rmw)_rlx, atomic(rmw)_acq, atomic(rmw)_rel, atomic(rmw)_ar

The following combinations of atomicNoRet(op) operation and ordering constraint are allowed:

o atomicNoRet(st)_rlx, atomicNoRet(st)_rel,

o atomicNoRet(rmw)_rlx, atomicNoRet(rmw)_acq, atomicNoRet(rmw)_rel, atomicNoRet(rmw)_ar

Page 48: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

HSA Memory Consistency Model 41

HSA Foundation Proprietary

The following combinations of fence operation and ordering constraint are allowed: fence_acq, fence_rel, fence_ar.

Atomic stores are any atomic operations that imply modification of a memory location.

Atomic loads are any atomic operations that imply reading a memory location.

Atomic read-modify-writes are any operations that imply both reading and modifying a memory location.

An ordinary memory operation is an ordinary load or an ordinary store (ld or st)

Release operations are all memory operations (atomic or fence) with order rel or ar.

Acquire operations are all memory operations (atomic or fence) with order acq or ar.

Relaxed atomics are all atomic operations that are neither an acquire operation nor a release operation.

Synchronizing operations are the union of all release and all acquire operations, i.e. all operations (atomic or fence) with order acq, rel or ar.

The operations available in HSAIL have different names, and may also have restrictions on the combinations of operation, ordering, segment and scope that are allowed. See the HSA Programmers Reference Manual for a full description of HSAIL memory operations.

Segments

The HSA memory consistency model defines the following different segments:

Global segment, shared between all agents.

Group segment, shared between work-items in the same work-group in a HSAIL kernel dispatch.

Private, private to a single work-item in a HSAIL kernel dispatch.

Kernarg, read-only memory visible to all work-items in a HSAIL kernel dispatch.

Readonly, read-only memory visible to all agents.

Read-only segments are not updated after the HSAIL kernel dispatch therefore only ordinary load operations are relevant for these segments. Read-only segments are not further discussed in the HSA memory consistency model section.

The different segments are described in more detail in the HSA Programmers Reference Manual.

A particular memory location is always accessed using the same segment.

All operations except fences apply only to one segment. Fences may apply to multiple segments.

Scopes

The HSA memory consistency model defines the following different scopes:

Page 49: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

42 HSA Memory Consistency Model

HSA Foundation Proprietary

Wavefront (wv)

Work-group (wg)

Component (cmp)

System (sys)

Memory operation scope is used to limit the required visibility of a memory operation to the same or smaller than the visibility implied by the segment.

An operation may specify one or more scopes. Typically synchronizing operations are restricted to a single scope of visibility, whereas ordinary memory operations do not specify any scope restriction and are treated as specifying all scopes supported by the segment they access.

Ordinary memory operations are only required to be visible to other threads in relation to synchronizing operations, and are then required visible within the segment/scope instance(s) of the synchronizing operation.

Segment/scope instances

Each segment type has a different sharing property for memory regions which can be global, component, work-group or private for the segments that may be modified. The valid scopes vary between segments; only scopes with visibility the same or smaller than the segment sharing are valid. The following segment scope combinations are valid:

Global segment may use the following scopes

o Wavefront (wv)

o Work-group (wg)

o Component (cmp)

o System (sys)

Group segment may use the following scopes:

o Wavefront (wv)

o Work-group (wg)

Other segments are either not writeable or only visible to a single thread, and thus scopes are not applicable.

For example scopes are not valid for private segments, whereas all scopes are allowed for global segments.

When a segment/scope combination other than global/system is used then there may be several segment/scope instances. Segment/scope instances of two different memory operations executed by either a single or different threads are shared if:

The segments of the segment/scope instances are the same and

The scopes of the segment/scope instances are the same and

o The scope is system or

o The scope is component, and the threads execute on the same component or

Page 50: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

HSA Memory Consistency Model 43

HSA Foundation Proprietary

o The scope is work-group and the threads belong to the same work-group or

o The scope is wavefront and the threads belong to the same wavefront

For example, if work-items from different work-groups access the global segment using work-group scope they will not share instances of the global/work-group combination of segment/scope.

The component scope is worth considering carefully. Data visible at component scope will persist across HSAIL kernel dispatches and be visible to future HSAIL kernels on the same component.

3.5 Atomic operations

The following properties apply to atomic operations:

Atomic operations must use naturally aligned addresses.

Atomic loads and stores are single-copy atomic for the segment/scope instances they specify. If an atomic store is made visible to a different scope through a later release operation the single-copy atomic property is not necessarily propagated.

Atomic read-modify-write operations are indivisible in the modification order of the bytes accessed for the segment/scope instances they specify.

Relaxed atomic operations are required to become visible in finite time to the segment/scope instance they specify.

3.6 Write Serialization

In each segment/scope instance, there is a single order for each byte of all atomic stores, by any thread.

The write serialization order does not apply to ordinary stores, neither does it apply to atomic stores outside the segment/scope instance they specify.

Note that there may be multiple segment/scope instances for particular combinations of segment/scope (see section 3.4.4), e.g. work-items from two different work-groups that access the global segment using work-group scope will see two different instances of the global/work-group combination of segment/scope, and potentially two different orders of atomic stores to a single byte.

Note that this statement only specifies the allowed re-ordering of stores to a single location. It does not prohibit merging of stores, nor does it specify any order between stores to different locations.

The write serialization order applies also to the value read in atomic read-modify-write operations. See section 3.10 for a discussion on the implications of this for relaxed atomic operations.

Ordinary stores are not required to be visible in finite time, but if they are ordered before something that is required to be visible in finite time or a release fence, then it is a consequence of the ordering behavior that those ordinary stores will be visible in finite time.

Page 51: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

44 HSA Memory Consistency Model

HSA Foundation Proprietary

3.7 Ordering of synchronizing operations

There is a single order of synchronizing operations for each instance of scope and segment combination. Note that there may be multiple segment/scope instances for particular scopes, e.g. work-items from two different work-groups that access the global segment using work-group scope will see two different instances of the global/work-group combination of segment/scope, and potentially two different orders of synchronizing operations.

The single order of synchronizing operations per segment/scope instance imply that for any two synchronizing operations A and B such that A is ordered before B in segment scope instance S:

A >> B for the segment scope instance S.

For any operation C such that B >> C for the segment scope instance S then: A >> C for the segment/scope instance S.

For any operation D such that D >> A for the segment scope instance S then: D >> B for the segment/scope instance S.

For any operation C such that B >> C for the segment scope instance S, and any operation D such that D >> A for the segment scope instance S then: D >> C for the segment/scope instance S.

If two synchronizing operations apply to more than one instance of segment/scope then they must have the same order in both instances. This implies the ordering of transactions across multiple segment/scope instances is aligned, regardless of which thread originally produced the data.

Consider a synchronizing operation F that applies to two segment/scope instances S1 and S2, as well as two further synchronizing operations A ordered before F in S1, and B ordered after F in S2. The requirement on ordering for a synchronizing operation across all segment scope instances it applies to implies that:

A >> B for segment/scope instances S1 and S2

3.8 Packet Processor Fences

In addition the memory model defines packet memory fences that are only used by the packet processor (see section 2.7.1):

A packet memory release fence makes any global segment or image data that was stored by any thread that belonged to a dispatch that has completed the active phase on any queue of the same component visible in all the scopes shared by the packet release fence.

A packet memory acquire fence ensures any subsequent global segment or image loads by any thread that belongs to a dispatch that has not yet entered the active phase on any queue of the same component, see any data previously released at the scopes shared by the packet acquire fence.

3.9 Single thread memory operation order

Single thread memory operation order defines the order of visibility between memory operations executed by a single thread to itself and other threads.

Page 52: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

HSA Memory Consistency Model 45

HSA Foundation Proprietary

Thread memory location address dependence order is maintained using virtual addresses only. Two accesses to the same physical address using different virtual addresses are not guaranteed to be ordered. Similarly operations with different scopes or segments are not ordered.

Transitivity for a single thread:

If memory operations X, Y and Z are performed by a single thread, then if X>>Y and Y>>Z, then X>>Z for the segment/scope instances shared by X, Y and Z.

If X appears before Y in the program execution order for a single thread, then the following hold:

If X and Y are to the same virtual address then X>>Y for the segment/scope instances shared by X and Y.

If X and Y are to the same virtual address then X>>Y for the thread that executed X and Y.

If X is an acquire operation and Y is any other memory operation to any location, then X>>Y for the segment/scope instances of X that shares segment with Y.

If X is any memory operation to any location and Y is a release operation, then X>>Y for the segment/scope instances of Y that shares segment with X.

If X and Y are synchronizing operations then X>>Y for the segment/scope instances shared by X and Y.

If one of X and Y is a fence operation and the other is an atomic operation then X>>Y for the segment/scope instances shared by X and Y.

Otherwise, X and Y can be completed in any order, as long as the following constraints are also met. These constraints affect operations with multiple threads. For the constraints, see the following:

3.5 Atomic operations (p. 43)

3.6 Write Serialization (p. 43)

3.7 Ordering of synchronizing operations (p. 44)

3.10 Requirement on race free programs

A data race free program is one where every execution order that satisfies all the visibility rules also satisfies the following rules. A program that is not data race free is undefined. For every memory byte location where A and B are any memory operations to the same byte location, one or more of the following must be true:

A>>B in at least one segment/scope instance

A and B are both reads

A and B are both atomic operations that share at least one segment/scope instance

3.11 Implications

The memory consistency model described above defines some common functionality by implication. This subsection discusses some implied aspects of the memory consistency model and also lists a few examples to clarify the behavior.

Page 53: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

46 HSA Memory Consistency Model

HSA Foundation Proprietary

Relaxed atomics

Relaxed read-modify-write atomics are not part of the single order of synchronizing operations per segment/scope instance, but are bound by the single thread memory operation order and write serialization constraints. As the read-modify-write operation is indivisible the read of the read-modify-write operation must observe the write serialization order described in section 3.6, and must operate on the last value stored before the read-modify-write operation. As a consequence a number of relaxed atomic read-modify-write operations to the same location will yield a result as if they were executed in a sequence after each other. Similarly, the ordering is defined for an ordinary store followed by an atomic operation to the same address if performed by the same thread. Note that the exact sequence of operations visible may not be predictable without additional acquire or release operations.

Sequential consistency

The single order of synchronizing operations (see section 3.7) implies that synchronizing operations meet the requirements for sequential consistency within each scope/segment instance.

Transitivity of ordering between threads

The thread ordering (section 3.9), and ordering of synchronizing operations (section 3.7) allows us to infer some examples of how visibility order is transitive across threads:

X -> Y implies that a release operation X, in some thread, synchronizes with an acquire operation Y, in another (different) thread. A valid synchronization requires that X and Y must have at least one segment/scope instance in common, and X must be ordered before Y in the common segment/scope instances, i.e. X>>Y for the shared segment/scope instances.

If there are two operations X, Z performed by thread A such that Z >> X, and an operation Y is performed by thread B such that X->Y then the thread ordering and ordering of synchronizing operations imply that Z >> Y in all segment/scope instances that are shared by X and Y.

If there are two operations Y, Z performed by thread B such that Y >> Z, and an operation X is performed by thread A such that X->Y then the thread ordering and ordering of synchronizing operations imply that X >> Z in all segment/scope instances that are shared by X and Y.

Allowed Reorder for Operations to Different Addresses by the Same Thread

Table 3-1 shows whether two operations by a single thread may appear in a different order to an observer in the same segment/scope instance. The table enumerates the first operation in the leftmost column, and the second operation in the header row.

Annotations for the table are:

“yes” means reorder is allowed between the first operation and the second operation.

Page 54: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

HSA Memory Consistency Model 47

HSA Foundation Proprietary

“no” means reorder is not allowed between the first operation and the second operation.

Page 55: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

48 HSA Memory Consistency Model

HSA Foundation Proprietary

Table 3-1 Allowed reorder for operations by the same thread to different addresses, that share a segment/scope

instance.

Second Operation: Ld_rlx or

st_rlx

atomic_rlx or atomicNoRet_rlx

atomic_acq or atomicNoRet_acq

fence_acq atomic_rel or atomicNoRet_rel or fence_rel

atomic_ar or atomicNoRet_ar or fence_ar

First Operation:

ld_rlx or st_rlx yes yes yes yes no no

atomic_rlx or atomicNoRet_rlx

yes yes yes no no no

atomic_acq or atomicNoRet_acq or fence_acq

no no no no no no

atomic_rel or atomicNoRet_rel

yes yes no no no no

fence_rel yes no no no no no

atomic_ar or atomicNoRet_ar or fence_ar

no no no no no no

Examples

Successful synchronization between threads

Consider two threads, A and B, in the same work-group, and two memory locations, X and Y:

group int X=0;

group int Y=0;

A:

st_rlx X,1

atomicNoRet(st)_rel_wg Y,1 // Force X visible in group segment,

// work-group scope

B:

atomic(ld)_acq_wg Y // Synchronize group segment,

// work-group scope

ld_rlx X

If the load of Y by B synchronizes with the store to Y by A then B must see X as 1. This is an example of a successful handover of data from thread A to B.

Separate segment race

Consider two threads, A and B, in the same work-group, and two memory locations, X and Y:

global int X=0;

group int Y=0;

A:

st_rlx X,1

atomicNoRet(st)_rel_wg Y,1 // Does not force X visible

B:

Page 56: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

HSA Memory Consistency Model 49

HSA Foundation Proprietary

atomic(ld)_acq_wg Y // Synchronize group segment,

// work-group scope

ld_rlx X

As X and Y are in different segments, there is a race between thread A and B for the value of X.

Different scope synchronization race

Consider two threads, A and B, in the same work-group, and two memory locations, X and Y:

global int X=0;

global int Y=0;

A:

st_rlx X,1

atomicNoRet(st)_rel_wg Y,1 // Force X visible in global segment,

// work-group scope

B:

atomic(ld)_acq_cmp Y // Synchronize global segment,

// component scope

ld_rlx X

As A and B access Y using different scopes there is no synchronization. Even if B sees the value of 1 in Y, the load by B of X is a data race and is undefined.

Correct synchronization, safe transitivity

Consider three threads, A, B and C, in the same work-group, and three memory locations X, Y and Z:

global int X=0;

global int Y=0;

global int Z=0;

A:

st_rlx X,1

atomicNoRet(st)_rel_sys Y,1 // Force X visible in global segment,

// system scope

B:

atomic(ld)_acq_sys Y // Synchronize global segment, system scope

atomicNoRet(st)_rel_sys Z,1

C:

atomic(ld)_acq_sys Z // Synchronize global segment, system scope

ld_rlx X

If the load of Y by B synchronizes with the store to Y by A and the load of Z by C synchronizes with the store to Z by B then C must see X as 1. This is an example of a

Page 57: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

50 HSA Memory Consistency Model

HSA Foundation Proprietary

successful handover of data from thread A to C with transitivity implied through multiple synchronizations.

Scope/segment mismatch transitivity race

Consider three threads, A, B and C, in the same work-group, and three memory locations, X and Y:

global int X=0;

global int Y=0;

global int Z=0;

A:

st_rlx X,1

atomicNoRet(st)_rel_sys Y,1 // Force X visible in global segment,

// system scope

B:

atomic(ld)_acq_sys Y // Synchronize global segment,

// work-group scope

atomicNoRet(st)_rel_wg Z,1

C:

atomic(ld)_acq_wg Z // Synchronize global segment,

// work-group scope

ld_rlx X

Even if thread B synchronizes successfully with thread A, and thread C synchronizes successfully with thread B there is no transitivity of the visibility of X from A to C as the synchronization of B->C uses a different segment/scope instance.

Page 58: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed
Page 59: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

Glossary - HSA Hardware System Architecture Specification 51

Appendix A Glossary - HSA Hardware System Architecture Specification

Acquire synchronizing operation A memory operation marked with acquire.

application global memory Memory that is to be shared between all HSA Components and the host CPUs for processing using the HSA. This corresponds to global memory as defined in the HSA Programmer’s Reference Manual.

Architected Queuing Language (AQL) A command interface for the dispatch of HSA Agent commands.

coherent Memory accesses from a set of observers to a memory location shall be coherent if accesses to that memory location by the members of the set of observers are consistent with there being a single total order of all writes to that memory location by all members of the set of observers. This single total order of all writes to that memory location shall be the coherence order for that location.

Grid A multidimensional, rectangular structure containing work-groups. A Grid is formed when a program launches a kernel.

host CPU A HSA Agent that also supports the native CPU instruction set and runs the host operating system and the HSA runtime. As a HSA Agent, the host CPU can dispatch commands to a HSA Component using memory operations to construct and enqueue AQL packets. In some systems, a host CPU can also act as a HSA Component (with appropriate HSAIL finalizer and AQL mechanisms).

HSA Agent A hardware component that participates in the HSA memory model. A HSA Agent can submit AQL packets for execution. A HSA Agent may also but is not required to be a HSA Component. It is possible for a system to include HSA Agents that are neither a HSA Component nor a host CPU.

HSA application A program written in the host CPU instruction set architecture (ISA).

HSA Component A hardware or software component that can be a target of the AQL queries and which conforms to the memory model of the HSA. By implication a HSA Component is also a HSA Agent.

HSA implementation

Page 60: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

52 Glossary - HSA Hardware System Architecture Specification

A combination of (1) hardware components that execute one or more machine instruction set architectures (ISAs), (2) a compiler, linker, and loader, (3) a finalizer that translates HSAIL code into the appropriate native ISA if the hardware components cannot support HSAIL natively, and (4) a runtime system.

HSA Memory Management Unit (HSA MMU) A memory management unit used by HSA Components and other HSA Agents, designed to support page-granular virtual memory access with the same attributes provided by a host CPU MMU.

HSA Memory Node (HMN) A node that delineates a set of system components (host CPUs and HSA Components) with “local” access to a set of memory resources attached to the node's memory controller and appropriate HSA-compliant access attributes.

HSA Packet Processor HSA Packet Processors are tightly bound to one or more HSA Agents, and provide the user mode queue functionality for the HSA Agents. HSA Packet Processors participate in the HSA memory model and are HSA Agents.

HSAIL Heterogeneous System Architecture Intermediate Language. A virtual machine and a language. The instruction set of the HSA virtual machine that preserves virtual machine abstractions and allows for inexpensive translation to machine code.

invalid address An invalid address is a location in application global memory where an access from a HSA Component or other HSA Agent is violating system software policy established by the setup of the system page table attributes. If a HSA Component accesses an invalid address, system software shall be notified.

kernel A section of code executed in a data-parallel way by a HSA Component. Kernels are written in HSAIL and then separately translated by a finalizer to the target instruction set.

memory type A set of attributes defined by the translation tables, covering at least the following properties:

Cacheability

Data Coherency requirements

Requirements for endpoint ordering, being the order of arrival of memory accesses at the endpoint

Requirements for observation ordering, being restrictions on the observable ordering of memory accesses to different locations

Requirements for multi-copy atomicity

Permissions for speculative access

natural alignment

Page 61: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

Provisional 1.0 - Ratified APRIL 18, 2014 HSA Platform System Architecture specification

Glossary - HSA Hardware System Architecture Specification 53

Alignment in which a memory operation of size n bytes has an address that is an integer multiple of n. For example, naturally aligned 8-byte stores can only be to addresses 0, 8, 16, 24, 32, 40, and so forth.

packetID Each AQL packet has a queue scope unique 64 bit packetID. The packetID is assigned as the sequential number of the packet slot allocated in the queue.

primary memory type The memory type used by default for user processes. HSA Agents shall have a common interpretation of the data coherency (excluding accesses to read-only image data) and cacheability attributes for this type

process address space ID (PASID) An ID used to identify the application address space within a host CPU or guest virtual machine. It is used on a device to isolate concurrent contexts residing in shared local memory.

read atomicity A condition of a load such that it must be read in its entirety. A load has read atomicity of size b if it cannot be read as fragments that are smaller than b bits.

release synchronizing operation A memory operation marked with release.

wavefront A group of work-items that share a single program counter.

write atomicity A condition of a store such that it must be written in its entirety. A store has write atomicity of size b if it cannot be broken up into fragments that are smaller than b bits.

Page 62: Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed

HSA Platform System Architecture specification Provisional 1.0 - Ratified APRIL 18, 2014

54 Index

Index

A

address, 2, 6, 9, 14, 15, 45, 52, 53 address space, 2, 6, 9, 14, 53 agent, 6, 9, 11, 13, 14, 39, 51, 52 Architected Queuing Language (AQL), 4,

18, 19, 21, 22, 23, 51 atomicity, 6, 52, 53

C

coherency domain, 3

D

dispatch, 4, 18, 27, 51

F

finalizer, 38, 52 firmware, 4

G

global memory, 7, 51, 52 global order, 38 Grid, 51

H

HSA implementation, 13, 51 HSA Memory Node (HMN), 52 HSA programming model, 3 HSAIL, 7, 10, 26, 27, 28, 38, 39, 52

I

implementation, 2, 5, 6, 10, 11, 13, 14, 22, 28, 38, 51

infrastructure, 2, 4

K

kernel, 7, 18, 19, 22, 27, 28, 51, 52

M

memory model, 2, 3, 38, 39, 51, 52 memory operation, 40, 45, 51, 53

P

packet format, 4, 14 packet processor, 23 process address space ID (PASID), 6, 53

S

sequence of operations, 39 signaling, 3, 7, 9, 12, 14, 22, 23, 24, 28 synchronization, 3, 7, 39

T

thread, 15, 39

V

virtual memory, 3, 11, 52 visibility order, 38, 39

W

wavefront, 27, 28, 53 work-item, 22, 38, 39, 45, 53