guide to heterogeneous system architecture (hsa)
TRANSCRIPT
GUIDE TO HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)
DIBYENDU DAS, PRAKASH RAGHAVENDRADEC 16TH 2013
2 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
OUTLINE
Introduction to HSA
Unified Memory Access
Power Management
HSA Programming Languages
Workloads
3 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
WHAT IS HSA?
SERIALWORKLOADS
PARALLELWORKLOADS
hUMA (MEMORY)
APUACCELERATED PROCESSING UNIT
An intelligent computing architecture that enables CPU, GPU and other processors to work in harmony on a single piece of silicon by seamlessly moving the right tasks to the best suited processing element
4 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
HSA EVOLUTION
Uniform memory access for CPU and
GPUGPU can access CPU
memoryIntegrate CPU and
GPU in silicon
CapabilitiesCapabilities
Simplifieddata sharing
Improved compute efficiency
Unified power efficiency
BenefitsBenefits
5 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
STATE-OF-THE-ART HETEROGENEOUS PROCESSOR
Graphics processing unit (GPU):
384 AMD Radeon™ cores
Multi-threaded CPU cores
Shared Northbridge access to overlapping CPU/GPU physical address spaces
Many resources are shared between the CPU and GPU– For example, memory hierarchy, power, and thermal capacity
Accelerated processing unit (APU)
6 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
A NEW ERA OF PROCESSOR PERFORMANCE
?
Sing
le-t
hrea
d P
erfo
rman
ce
Time
we arehere
Enabled by: Moore’s Law Voltage Scaling
Constrained by:PowerComplexity
Single-Core EraSingle-Core Era
Mod
ern
Appl
icati
on
Perf
orm
ance
Time (Data-parallel exploitation)
we arehere
HeterogeneousHeterogeneousSystems EraSystems Era
Enabled by: Abundant data
parallelism Power efficient
GPUs
Temporarily Constrained by:
Programming modelsComm.overhead
Thro
ughp
ut
Perf
orm
ance
Time (# of processors)
we arehere
Enabled by: Moore’s Law SMP architecture
Constrained by:PowerParallel SWScalability
Multi-Core EraMulti-Core Era
Assembly C/C++ Java …Assembly C/C++ Java … pthreads OpenMP / TBB …pthreads OpenMP / TBB … Shader CUDA OpenCL C++AMP …
Shader CUDA OpenCL C++AMP …
7 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
EVOLUTION OF HETEROGENEOUS COMPUTINGAr
chite
ctur
e M
atur
ity &
Pro
gram
mer
Acc
essi
bilit
y
Poor
Exce
llent
2012 - 20202009 - 20112002 - 2008
Graphics & ProprietaryGraphics & ProprietaryDriver-based APIsDriver-based APIs
Proprietary Drivers EraProprietary Drivers Era
““Adventurous” programmersAdventurous” programmers
Exploit early programmable Exploit early programmable “shader cores” in the GPU“shader cores” in the GPU
Make your program look like Make your program look like “graphics” to the GPU“graphics” to the GPU
CUDA™, Brook+, etcCUDA™, Brook+, etc
OpenCL™, DirectComputeOpenCL™, DirectComputeDriver-based APIsDriver-based APIs
Standards Drivers EraStandards Drivers Era
Expert programmersExpert programmers C and C++ subsetsC and C++ subsets Compute centric APIs , data typesCompute centric APIs , data types Multiple address spaces with Multiple address spaces with
explicit data movementexplicit data movement Specialized work queue based Specialized work queue based
structuresstructures Kernel mode dispatchKernel mode dispatch
AMD Heterogeneous System ArchitectureAMD Heterogeneous System ArchitectureGPU Peer ProcessorGPU Peer Processor
Architected EraArchitected Era
Mainstream programmersMainstream programmers Full C++Full C++ GPU as a co-processorGPU as a co-processor Unified coherent address spaceUnified coherent address space Task parallel runtimes Task parallel runtimes Nested Data Parallel programsNested Data Parallel programs User mode dispatchUser mode dispatch Pre-emption and context Pre-emption and context
switchingswitching
8 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
HETEROGENEOUS PROCESSORS - EVERYWHERESMARTPHONES TO SUPER-COMPUTERS
Phone
Tablet
Notebook
Workstation
Dense Server
Super computer
A SINGLE SCALABLE ARCHITECTURE FOR THE WORLD’S PROGRAMMERS IS DEMANDED AT THIS POINT
9 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
HOW DOES HSA MAKE THIS ALL WORK?Enables acceleration of languages like Java, C++ AMP and Python
All processors use the same addresses, and can share data structures in place
Heterogeneous computing can use all of virtual and physical memory
Extends multicore coherency to the GPU and other processorsPass work quickly between the processorsEnables quality of service
HSA FOUNDATION – BUILDING THE ECOSYSTEM
10 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
HSA FOUNDATION AT LAUNCHBORN IN JUNE 2012
Founders
11 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
HSA FOUNDATION TODAY – DECEMBER 2013A GROWING AND POWERFUL FAMILY
Founders
Promoters
ORACLE
Supporters
Contributors
Universities NTHU ProgrammingLanguage Lab
NTHU SystemSoftware Lab C O M P U T E R S C I E N C E
Unified Memory
Access
| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC13
UNDERSTANDING UMAUNDERSTANDING UMA
Original meaning of UMA is Uniform Memory AccessUniform Memory Access• Refers to how processing cores in a system view and access memory
• All processing cores in a true UMA system share a single memory address space
Introduction of GPU compute created Non-Uniform Memory Access (NUMA)
• Require data to be managed across multiple heaps with different address spaces
• Add programming complexity due to frequent copies, synchronization, and address translation
HSA restores the GPU to Uniform memory Access• Heterogeneous computing replaces GPU Computing
| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC14
INTRODUCING hUMAINTRODUCING hUMA
CPU
APU
APU with HSA
Memory
CPU
CPU
CPU
CPU
UMA
CPU Memory
CPU
CPU
CPU
CPU
NUMA
GPUGPUGPUGPU
GPU Memory
Memory
CPU
CPU
CPU
CPU
hUMA
GPUGPUGPUGPU
| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC15
hUMA KEY FEATUREShUMA KEY FEATURES
Physical Memory
HWCoherency
Virtual Memory
CPU
Entire memory space: Both CPU and GPU can access and allocate
any location in the system’s virtual memory space
CacheCache
Coherent Memory:
Ensures CPU and GPU caches both see an up-to-date view of data
Pageable memory:
The GPU can seamlessly access virtual memory
addresses that are not (yet)
present in physical memory
| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC16
WITHOUT POINTERS AND DATA SHARINGWITHOUT POINTERS AND DATA SHARING
CPU Memory GPU Memory
| | | | |
| | | | |
| | | | |
| | | | |
Without hUMA:•CPU explicitly copies data to GPU memory•GPU completes computation•CPU explicitly copies result back to CPU memory
Only the data array can be copied since GPU cannot follow embedded
data-structure links
| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC17
CPU / GPU Uniform MemoryCPU / GPU Uniform Memory
| | | | |
| | | | |
CPU can pass a pointer to entire data structure since the GPU can now follow embedded links
WITH POINTERS AND DATA SHARINGWITH POINTERS AND DATA SHARING
| INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
18 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
hUMA FEATURES
Access to Entire Memory Space
Pageable memory
Bi-directional Coherency
Fast GPU access to system memory
Dynamic Memory Allocation
Power Management
20 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
KEY OBSERVATIONS
Applications exhibit varying degrees of CPU and GPU frequency sensitivities due to
‒ Control divergence‒ Interference at shared resources‒ Performance coupling between CPU and GPU
Efficient energy management requires metrics that can predict frequency sensitivity (power) in heterogeneous processors
Sensitivity metrics drive the coordinated setting of CPU and GPU power states
21 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION POWER MANAGEMENT (BAPM)
Power management algorithm1. Calculate digital estimate of power consumption 2. Convert power to temperature
- RC network model for heat transfer3. Assign new power budgets to TEs based on temperature
headroom4. TEs locally control (boost) their own DVFS states to maximize
performance
Chip is divided into BAPM-controlled
thermal entities (TEs)
CU0 TE
CU1 TE
GPU TE
22 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
DYNACO: RUN-TIME SYSTEM FOR COORDINATED ENERGY MANAGEMENT
GPU Frequency Sensitivity
CPU Frequency Sensitivity
Decision
High Low Shift power to GPU
High High Proportional power allocation
Low High Shift power to CPU
Low Low Reduce power of both CPU and GPU
Performance Metric Monitor
CPU-GPU Frequency Sensitivity
Computation
CPU-GPU Power State Decision
DynaCo implemented as a run-time software policy overlaid on top of BAPM in real hardware
Programming Languages
24 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
HSAIL (HSA Intermediate Language)
Kernel Fusion Driver (KFD)
Kernel Fusion Driver (KFD)
HSA Core Runtime
HSA Core Runtime
HSA Finalizer
HSA Finalizer
HSA Helper
Libraries
HSA Helper
Libraries
OpenCL™ App
OpenCL™ App
OpenCL RuntimeOpenCL Runtime
Java AppJava App
Java JVM (Sumatra)Java JVM (Sumatra)
PythonApp
PythonApp
Fabric Engine RT
Fabric Engine RT
C++ AMPApp
C++ AMPApp
Various RuntimesVarious
Runtimes
PROGRAMMING LANGUAGES PROLIFERATING ON HSA
25 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
PROGRAMMING MODELS EMBRACING HSAIL AND HSATHE RIGHT LEVEL OF ABSTRACTION
UNDER DEVELOPMENT
Java: Project Sumatra OpenJDK 9 OpenMP from SuSEC++ AMP, based on CLANG/LLVMPython and KL from Fabric Engine
NEXT
DSLs: Halide, Julia, RustFortranJavaScriptOpen Shading LanguageR
26 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
HSAIL
HSAIL (HSA Intermediate Language) as the SW interface
‒ A virtual ISA for parallel programs‒ Finalized to a native ISA by a finalizer/JIT
‒ Accommodate to rapid innovations in native GPU architectures ‒ HSAIL expected to be stable and backward compatible
across implementations‒ Enable multiple hardware vendors to support HSA
Key design points and benefits for HSA compilers
‒ Adopt a thin finalizer approach ‒ Enable fast translation time and robustness in the
finalizer‒ Drive performance optimizations through high-level compilers (HLC)
‒ Take advantage of the strength and compilation time budget in HLCs for aggressive optimizations
OpenCL™ KernelOpenCL™ Kernel
High-Level Compiler Flow
Finalizer Flow (Runtime)
EDG or CLANGEDG or CLANGSPIRSPIR
LLVMLLVMHSAILHSAIL
HSAILHSAILFinalizerFinalizer
Hardware ISAHardware ISAEDG – Edison Design GroupCLANG – LLVM FESPIR – Standard Portable Intermediate Representation
27 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
HSA ENABLEMENT OF JAVA
JAVA 8 – HSA ENABLED APARAPI Java 8 brings Stream + Lambda API.
‒ More natural way of expressing data parallel algorithms
‒ Initially targeted at multi-core.
APARAPI will :‒ Support Java 8 Lambdas‒ Dispatch code to HSA enabled devices at
runtime via HSAIL
JVMJVM
Java Application
HSAIL
HSA Finalizer & Runtime
APARAPI + Lambda API
CPU ISA GPU ISA
GPUGPUCPUCPU
JAVA 7 – OpenCL ENABLED APARAPI
AMD initiated Open Source project APIs for data parallel algorithms
‒ GPU accelerate Java applications‒ No need to learn OpenCL™
Active community captured mindshare‒ ~20 contributors‒ >7000 downloads‒ ~150 visits per day
JVMJVM
Java Application
OpenCL™OpenCL™ Compiler
& Runtime
APARAPI API
CPU ISA GPU ISA
GPUGPUCPUCPU
JAVA 9 – HSA ENABLED JAVA (SUMATRA) Adds native GPU acceleration to Java Virtual
Machine (JVM)
Developer uses JDK Lambda, Stream API
JVM uses GRAAL compiler to generate HSAIL
JVM decides at runtime to execute on either CPU or GPU depending on workload characteristics.
JVMJVM
Java Application
HSAIL
HSA Finalizer & Runtime
Java JDK Stream + Lambda API
Java GRAAL JIT backend
CPU ISA GPU ISA
GPUGPUCPUCPU
Workloads
29 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
OVERVIEW OF B+ TREES
B+ Trees are a special case of B Trees
Fundamental data structure used in several popular database management systems
‒ SQLite‒ CouchDB
A B+ Tree …‒ is a dynamic, multi-level index‒ Is efficient for retrieval of data, stored in a
block-oriented context
Order (b) of a B+ Tree measures the capacity of its nodes
7 8
d7 d8
6
d6
5
d5
4
d4
3
d3
2
d2
1
d1
2 4 6 7
3 5
30 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
HOW WE ACCELERATE
Utilize coarse-grained parallelism in B+ Tree searches‒ Perform many queries in parallel ‒ Increase memory bandwidth utilization with parallel reads‒ Increase throughput (transactions per second for OLTP)
B+ Tree searches on an HSA enabled APU‒ Allows much larger B+ Trees to be searched, than traditional GPU compute‒ Eliminates data-copies since CPU and GPU cores can access the same memory
31 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
1M search queries in parallel
Input B+ Tree contains 112 million keys and uses 6GB of memory
Hardware: AMD “Kaveri” APU with Quad Core CPU and 8 GCN Compute Units at 35W TDP
Software: OpenCL on HSA
RESULTS
Baseline: 4-core OpenMP + hand-tuned SSE CPU implementation
32 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
REVERSE TIME MIGRATION (RTM)
A technique for creating images based on sensor data to improve seismic interpretations done by geophysicists
A memory-intensive and highly parallel algorithm
RTM is run on massive data sets
A natural scale out algorithm
Often run today on 100K node CPU systems
Bringing this to HSA and APU based supercomputing will increase performance for current sensor arrays, and allow more sensors and accuracy in the future.
Marine crewsLand crews
HOWEVER, SPEED OF PROCESSING AND INTERPRETATION IS A CRITICAL BOTTLENECK IN MAKING FULL USE OF ACQUISITION ASSETS
33 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA SEARCH
MINING BIG DATA
Multi-stage pipeline or parallel processing stages
Traditional GPU Compute is challenged by copies
APU with HSA accelerates each stage in place‒ Sort‒ Compression ‒ Regular expression parsing ‒ CRC generation
Acceleration of large data search scales out across the cluster of APU nodes
Input HDFS (Hadoop Distributed File System)
Output HDFS
HDFS Replication
HDFS Replication
sort
copy
merge
split 0 map
part 0reduce
split 1 map
split 2 map
part 1reduce
34 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
35 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
BACKUP
Programming Tools
37 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
AMD V1.3
AMD’s comprehensive heterogeneous developer tool suite including:‒ CPU and GPU Profiling‒ GPU kernel Debugging‒ GPU kernel analysis
New features in version 1.3:‒ Supports Java‒ Integrated static kernel analysis‒ Remote debugging/profiling ‒ Supports latest AMD APU and GPU products
38 | INDO US HPC SUMMIT | April 12, 2023 | PUBLIC
OPEN SOURCE LIBRARIES ACCELERATED BY AMD
OpenCV
Most popular computer vision library
Now with many OpenCL™ accelerated functions
Bolt
C++ template library Provides GPU off-
load for common data-parallel algorithms
Now with cross-OS support and improved performance/functionality
clMath
AMD released APPML as open source to create clMath
Accelerated BLAS and FFT libraries
Accessible from Fortran, C and C++
Aparapi
OpenCL™ accelerated Java 7
Java APIs for data parallel algorithms (no need to learn OpenCL™