guide to heterogeneous system architecture (hsa)

GUIDE TO HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)

DIBYENDU DAS, PRAKASH RAGHAVENDRADEC 16TH 2013

2 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC

OUTLINE

Introduction to HSA

Unified Memory Access

Power Management

HSA Programming Languages

Workloads


WHAT IS HSA?

SERIALWORKLOADS

PARALLELWORKLOADS

hUMA (MEMORY)

APUACCELERATED PROCESSING UNIT

An intelligent computing architecture that enables CPU, GPU and other processors to work in harmony on a single piece of silicon by seamlessly moving the right tasks to the best suited processing element


HSA EVOLUTION

Uniform memory access for CPU and

GPUGPU can access CPU

memoryIntegrate CPU and

GPU in silicon

CapabilitiesCapabilities

Simplifieddata sharing

Improved compute efficiency

Unified power efficiency

BenefitsBenefits


STATE-OF-THE-ART HETEROGENEOUS PROCESSOR

Graphics processing unit (GPU):

384 AMD Radeon™ cores

Multi-threaded CPU cores

Shared Northbridge access to overlapping CPU/GPU physical address spaces

Many resources are shared between the CPU and GPU– For example, memory hierarchy, power, and thermal capacity

Accelerated processing unit (APU)


A NEW ERA OF PROCESSOR PERFORMANCE

?

Sing

le-t

hrea

d P

erfo

rman

ce

Time

we arehere

Enabled by: Moore’s Law Voltage Scaling

Constrained by:PowerComplexity

Single-Core EraSingle-Core Era

Mod

ern

Appl

icati

on

Perf

orm

ance

Time (Data-parallel exploitation)

we arehere

HeterogeneousHeterogeneousSystems EraSystems Era

Enabled by: Abundant data

parallelism Power efficient

GPUs

Temporarily Constrained by:

Programming modelsComm.overhead

Thro

ughp

ut

Perf

orm

ance

Time (# of processors)

we arehere

Enabled by: Moore’s Law SMP architecture

Constrained by:PowerParallel SWScalability

Multi-Core EraMulti-Core Era

Assembly C/C++ Java …Assembly C/C++ Java … pthreads OpenMP / TBB …pthreads OpenMP / TBB … Shader CUDA OpenCL C++AMP …

Shader CUDA OpenCL C++AMP …


EVOLUTION OF HETEROGENEOUS COMPUTINGAr

chite

ctur

e M

atur

ity &

Pro

gram

mer

Acc

essi

bilit

y

Poor

Exce

llent

2012 - 20202009 - 20112002 - 2008

Graphics & ProprietaryGraphics & ProprietaryDriver-based APIsDriver-based APIs

Proprietary Drivers EraProprietary Drivers Era

““Adventurous” programmersAdventurous” programmers

Exploit early programmable Exploit early programmable “shader cores” in the GPU“shader cores” in the GPU

Make your program look like Make your program look like “graphics” to the GPU“graphics” to the GPU

CUDA™, Brook+, etcCUDA™, Brook+, etc

OpenCL™, DirectComputeOpenCL™, DirectComputeDriver-based APIsDriver-based APIs

Standards Drivers EraStandards Drivers Era

Expert programmersExpert programmers C and C++ subsetsC and C++ subsets Compute centric APIs , data typesCompute centric APIs , data types Multiple address spaces with Multiple address spaces with

explicit data movementexplicit data movement Specialized work queue based Specialized work queue based

structuresstructures Kernel mode dispatchKernel mode dispatch

AMD Heterogeneous System ArchitectureAMD Heterogeneous System ArchitectureGPU Peer ProcessorGPU Peer Processor

Architected EraArchitected Era

Mainstream programmersMainstream programmers Full C++Full C++ GPU as a co-processorGPU as a co-processor Unified coherent address spaceUnified coherent address space Task parallel runtimes Task parallel runtimes Nested Data Parallel programsNested Data Parallel programs User mode dispatchUser mode dispatch Pre-emption and context Pre-emption and context

switchingswitching


HETEROGENEOUS PROCESSORS - EVERYWHERESMARTPHONES TO SUPER-COMPUTERS

Phone

Tablet

Notebook

Workstation

Dense Server

Super computer

A SINGLE SCALABLE ARCHITECTURE FOR THE WORLD’S PROGRAMMERS IS DEMANDED AT THIS POINT


HOW DOES HSA MAKE THIS ALL WORK?Enables acceleration of languages like Java, C++ AMP and Python

All processors use the same addresses, and can share data structures in place

Heterogeneous computing can use all of virtual and physical memory

Extends multicore coherency to the GPU and other processorsPass work quickly between the processorsEnables quality of service

HSA FOUNDATION – BUILDING THE ECOSYSTEM


HSA FOUNDATION AT LAUNCHBORN IN JUNE 2012

Founders


HSA FOUNDATION TODAY – DECEMBER 2013A GROWING AND POWERFUL FAMILY

Founders

Promoters

ORACLE

Supporters

Contributors

Universities NTHU ProgrammingLanguage Lab

NTHU SystemSoftware Lab C O M P U T E R S C I E N C E

Unified Memory

Access

| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC13

UNDERSTANDING UMAUNDERSTANDING UMA

Original meaning of UMA is Uniform Memory AccessUniform Memory Access• Refers to how processing cores in a system view and access memory

• All processing cores in a true UMA system share a single memory address space

Introduction of GPU compute created Non-Uniform Memory Access (NUMA)

• Require data to be managed across multiple heaps with different address spaces

• Add programming complexity due to frequent copies, synchronization, and address translation

HSA restores the GPU to Uniform memory Access• Heterogeneous computing replaces GPU Computing

| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC


INTRODUCING hUMAINTRODUCING hUMA

CPU

APU

APU with HSA

Memory

CPU

CPU

CPU

CPU

UMA

CPU Memory

CPU

CPU

CPU

CPU

NUMA

GPUGPUGPUGPU

GPU Memory

Memory

CPU

CPU

CPU

CPU

hUMA

GPUGPUGPUGPU



hUMA KEY FEATUREShUMA KEY FEATURES

Physical Memory

HWCoherency

Virtual Memory

CPU

Entire memory space: Both CPU and GPU can access and allocate

any location in the system’s virtual memory space

CacheCache

Coherent Memory:

Ensures CPU and GPU caches both see an up-to-date view of data

Pageable memory:

The GPU can seamlessly access virtual memory

addresses that are not (yet)

present in physical memory



WITHOUT POINTERS AND DATA SHARINGWITHOUT POINTERS AND DATA SHARING

CPU Memory GPU Memory

| | | | |

| | | | |

| | | | |

| | | | |

Without hUMA:•CPU explicitly copies data to GPU memory•GPU completes computation•CPU explicitly copies result back to CPU memory

Only the data array can be copied since GPU cannot follow embedded

data-structure links



CPU / GPU Uniform MemoryCPU / GPU Uniform Memory

| | | | |

| | | | |

CPU can pass a pointer to entire data structure since the GPU can now follow embedded links

WITH POINTERS AND DATA SHARINGWITH POINTERS AND DATA SHARING



hUMA FEATURES

Access to Entire Memory Space

Pageable memory

Bi-directional Coherency

Fast GPU access to system memory

Dynamic Memory Allocation

Power Management


KEY OBSERVATIONS

Applications exhibit varying degrees of CPU and GPU frequency sensitivities due to

‒ Control divergence‒ Interference at shared resources‒ Performance coupling between CPU and GPU

Efficient energy management requires metrics that can predict frequency sensitivity (power) in heterogeneous processors

Sensitivity metrics drive the coordinated setting of CPU and GPU power states


STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION POWER MANAGEMENT (BAPM)

Power management algorithm1. Calculate digital estimate of power consumption 2. Convert power to temperature

- RC network model for heat transfer3. Assign new power budgets to TEs based on temperature

headroom4. TEs locally control (boost) their own DVFS states to maximize

performance

Chip is divided into BAPM-controlled

thermal entities (TEs)

CU0 TE

CU1 TE

GPU TE


DYNACO: RUN-TIME SYSTEM FOR COORDINATED ENERGY MANAGEMENT

GPU Frequency Sensitivity

CPU Frequency Sensitivity

Decision

High Low Shift power to GPU

High High Proportional power allocation

Low High Shift power to CPU

Low Low Reduce power of both CPU and GPU

Performance Metric Monitor

CPU-GPU Frequency Sensitivity

Computation

CPU-GPU Power State Decision

DynaCo implemented as a run-time software policy overlaid on top of BAPM in real hardware

Programming Languages


HSAIL (HSA Intermediate Language)

Kernel Fusion Driver (KFD)

Kernel Fusion Driver (KFD)

HSA Core Runtime

HSA Core Runtime

HSA Finalizer

HSA Finalizer

HSA Helper

Libraries

HSA Helper

Libraries

OpenCL™ App

OpenCL™ App

OpenCL RuntimeOpenCL Runtime

Java AppJava App

Java JVM (Sumatra)Java JVM (Sumatra)

PythonApp

PythonApp

Fabric Engine RT

Fabric Engine RT

C++ AMPApp

C++ AMPApp

Various RuntimesVarious

Runtimes

PROGRAMMING LANGUAGES PROLIFERATING ON HSA


PROGRAMMING MODELS EMBRACING HSAIL AND HSATHE RIGHT LEVEL OF ABSTRACTION

UNDER DEVELOPMENT

Java: Project Sumatra OpenJDK 9 OpenMP from SuSEC++ AMP, based on CLANG/LLVMPython and KL from Fabric Engine

NEXT

DSLs: Halide, Julia, RustFortranJavaScriptOpen Shading LanguageR


HSAIL

HSAIL (HSA Intermediate Language) as the SW interface

‒ A virtual ISA for parallel programs‒ Finalized to a native ISA by a finalizer/JIT

‒ Accommodate to rapid innovations in native GPU architectures ‒ HSAIL expected to be stable and backward compatible

across implementations‒ Enable multiple hardware vendors to support HSA

Key design points and benefits for HSA compilers

‒ Adopt a thin finalizer approach ‒ Enable fast translation time and robustness in the

finalizer‒ Drive performance optimizations through high-level compilers (HLC)

‒ Take advantage of the strength and compilation time budget in HLCs for aggressive optimizations

OpenCL™ KernelOpenCL™ Kernel

High-Level Compiler Flow

Finalizer Flow (Runtime)

EDG or CLANGEDG or CLANGSPIRSPIR

LLVMLLVMHSAILHSAIL

HSAILHSAILFinalizerFinalizer

Hardware ISAHardware ISAEDG – Edison Design GroupCLANG – LLVM FESPIR – Standard Portable Intermediate Representation


HSA ENABLEMENT OF JAVA

JAVA 8 – HSA ENABLED APARAPI Java 8 brings Stream + Lambda API.

‒ More natural way of expressing data parallel algorithms

‒ Initially targeted at multi-core.

APARAPI will :‒ Support Java 8 Lambdas‒ Dispatch code to HSA enabled devices at

runtime via HSAIL

JVMJVM

Java Application

HSAIL

HSA Finalizer & Runtime

APARAPI + Lambda API

CPU ISA GPU ISA

GPUGPUCPUCPU

JAVA 7 – OpenCL ENABLED APARAPI

AMD initiated Open Source project APIs for data parallel algorithms

‒ GPU accelerate Java applications‒ No need to learn OpenCL™

Active community captured mindshare‒ ~20 contributors‒ >7000 downloads‒ ~150 visits per day

JVMJVM

Java Application

OpenCL™OpenCL™ Compiler

& Runtime

APARAPI API

CPU ISA GPU ISA

GPUGPUCPUCPU

JAVA 9 – HSA ENABLED JAVA (SUMATRA) Adds native GPU acceleration to Java Virtual

Machine (JVM)

Developer uses JDK Lambda, Stream API

JVM uses GRAAL compiler to generate HSAIL

JVM decides at runtime to execute on either CPU or GPU depending on workload characteristics.

JVMJVM

Java Application

HSAIL

HSA Finalizer & Runtime

Java JDK Stream + Lambda API

Java GRAAL JIT backend

CPU ISA GPU ISA

GPUGPUCPUCPU

Workloads


OVERVIEW OF B+ TREES

B+ Trees are a special case of B Trees

Fundamental data structure used in several popular database management systems

‒ SQLite‒ CouchDB

A B+ Tree …‒ is a dynamic, multi-level index‒ Is efficient for retrieval of data, stored in a

block-oriented context

Order (b) of a B+ Tree measures the capacity of its nodes

7 8

d7 d8

6

d6

5

d5

4

d4

3

d3

2

d2

1

d1

2 4 6 7

3 5


HOW WE ACCELERATE

Utilize coarse-grained parallelism in B+ Tree searches‒ Perform many queries in parallel ‒ Increase memory bandwidth utilization with parallel reads‒ Increase throughput (transactions per second for OLTP)

B+ Tree searches on an HSA enabled APU‒ Allows much larger B+ Trees to be searched, than traditional GPU compute‒ Eliminates data-copies since CPU and GPU cores can access the same memory


1M search queries in parallel

Input B+ Tree contains 112 million keys and uses 6GB of memory

Hardware: AMD “Kaveri” APU with Quad Core CPU and 8 GCN Compute Units at 35W TDP

Software: OpenCL on HSA

RESULTS

Baseline: 4-core OpenMP + hand-tuned SSE CPU implementation


REVERSE TIME MIGRATION (RTM)

A technique for creating images based on sensor data to improve seismic interpretations done by geophysicists

A memory-intensive and highly parallel algorithm

RTM is run on massive data sets

A natural scale out algorithm

Often run today on 100K node CPU systems

Bringing this to HSA and APU based supercomputing will increase performance for current sensor arrays, and allow more sensors and accuracy in the future.

Marine crewsLand crews

HOWEVER, SPEED OF PROCESSING AND INTERPRETATION IS A CRITICAL BOTTLENECK IN MAKING FULL USE OF ACQUISITION ASSETS


TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA SEARCH

MINING BIG DATA

Multi-stage pipeline or parallel processing stages

Traditional GPU Compute is challenged by copies

APU with HSA accelerates each stage in place‒ Sort‒ Compression ‒ Regular expression parsing ‒ CRC generation

Acceleration of large data search scales out across the cluster of APU nodes

Input HDFS (Hadoop Distributed File System)

Output HDFS

HDFS Replication

HDFS Replication

sort

copy

merge

split 0 map

part 0reduce

split 1 map

split 2 map

part 1reduce


DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.


BACKUP

Programming Tools


AMD V1.3

AMD’s comprehensive heterogeneous developer tool suite including:‒ CPU and GPU Profiling‒ GPU kernel Debugging‒ GPU kernel analysis

New features in version 1.3:‒ Supports Java‒ Integrated static kernel analysis‒ Remote debugging/profiling ‒ Supports latest AMD APU and GPU products


OPEN SOURCE LIBRARIES ACCELERATED BY AMD

OpenCV

Most popular computer vision library

Now with many OpenCL™ accelerated functions

Bolt

C++ template library Provides GPU off-

load for common data-parallel algorithms

Now with cross-OS support and improved performance/functionality

clMath

AMD released APPML as open source to create clMath

Accelerated BLAS and FFT libraries

Accessible from Fortran, C and C++

Aparapi

OpenCL™ accelerated Java 7

Java APIs for data parallel algorithms (no need to learn OpenCL™

guide to heterogeneous system architecture (hsa)

Technology

gpu memory g

c gpu

ecosystem9 indo us hpc

silicon4 indo us hpc

gpu caches

memory hierarchy

virtual memory addresses

nonuniform memory access