uniform abstractions for heterogeneous parallel systems · rakesh komuravelli, sarita adve and sasa...

Uniform Abstractions for

Heterogeneous Parallel Systems

Vikram Adve

With:

Maria Kotsifakou, Prakalp Srivastava, Adel Ejjeh, Hashim Sharif, Matt Sinclair,

Rakesh Komuravelli, Sarita Adve and Sasa Misailovic

University of Illinois at Urbana-Champaign

Supported by: NSF, SRC, DARPA, Intel

Main Memory

Interconnect

Modem

GPS

DSP DSP

GPU

A/V Hardware

Accelerators

DSPMulti-media

CPU

L1 Cache

L2 Cache

CPU

L1 Cache

VectorVector

different

parallelism

models

Incompatible memory systems different hardware ISAs

And different SoCs have different combinations of such hardware!

Key to Programmability:

Common abstractions for heterogeneous parallel hardware

A Modern Mobile SOC

Interface Levels and Key Benefit

CPUs + Vector

SIMD Units

…

GPUDSP

Domain-specific

Accelerators

FPGA

"Hardware" ISA

Virtual ISA

Language-neutral Compiler IR

Language-level Compiler IR

General-purpose prog. language

Domain-specific prog. language

Delite IR, HPVM, MLIR

Delite DSL IR, DLVM, TVM, …

CUDA, OpenCL, OpenAcc,

OpenMP, Python, Julia

TensorFlow, MXNet, Halide, …

Hardware innovation

Object-code portability

Compiler investment

Language innovation

App. performance

App. productivity

GPU ISAs, SIMD ISAs, TPU,

Domain-specific accelerators, …

IBM AS/400, PTX, SPIR-V

HSAIL, HPVM


CPUs + Vector

SIMD Units

…

GPUDSP

Domain-specific

Accelerators

FPGA

"Hardware" ISA

Virtual ISA









Hardware innovation


Compiler investment

Language innovation

App. performance

App. productivity




HSAIL, HPVM



CPUs + Vector

SIMD Units

…

GPUDSP

Domain-specific

Accelerators

FPGA

"Hardware" ISA

Virtual ISA









Hardware innovation


Compiler investment

Language innovation

App. performance

App. productivity


HSAIL, HPVM




Which Interface Levels Can Be Uniform?

CPUs + Vector

SIMD Units

…

GPUDSP

Domain-specific

Accelerators

FPGA

"Hardware" ISA

Virtual ISA




Domain-specific prog. language Too diverse

to define a

uniform

interface

Also too

diverse …

Much more

uniform

GPU ISAs, SIMD ISAs, TPU, Domain-specific accelerators, …

IBM AS 400, PTX, SPIR

HSAIL, HPVM

Delite DSL IR, XLA IR, TVM, …


OpenMP, Python, Julia, …



What Should the Interface Enable?

• Uniform parallel abstraction for diverse hardware

• Aggressive compiler optimizations

• Vendor-provided back ends

• Use of target-specific low-level libraries: MKL, cuDNN, …

• Partitioning, static scheduling, dynamic scheduling

• Application-guided error vs. energy vs performance tradeoffs

• H/w-agnostic HLS for FPGAs, ASICs

• Application-driven software + hardware specialization

• Mechanized formal verification of designs

WITHIN

REACH:

2-5 Yrs

10 Yr. GOALS

The HPVM Program Representation

• A common parallel abstraction

• Compiler IR + Virtual ISA + Run-time scheduling

Kotsifakou et al., PPOPP 2018

Goal: Programmability for Heterogeneous Parallel SystemsMobile phone SoCs

Supercomputers

Cloud with accelerators

Key to Programmability:

Common abstractions for heterogeneous parallel hardware

Heterogeneous Parallel Virtual Machine

Use HPVM for:

1. Portable object code

2. Retargetable parallel

compiler IR and system

3. Run-time scheduling

Translators

HPVM

Virtual ISARuntime

Scheduler

C+HPVM

Keras

TensorFlow

Other

DSLs

Front ends

HPVM: IR and Tools

CPUs + Vector

SIMD Units

…

GPUDSP

Domain-specific

Accelerators

FPGA

Halide

HPVM Abstraction of Parallel Computation

Dataflow Graph

with side effects

Vector

VA = load <L4 x float>* AVB = load <L4 x float>* B

…VC = fmul <L4 x float> VA,

VB

Hierarchical

or

• Graph nodes – coarse-grain or fine-grain computational tasks

• Graph edges – explicit data transfer between nodes

• Loads and stores – implicit communication via shared memory

• Hierarchical – multiple levels of nested parallelism

Static Dataflow Graph

Dynamic Dataflow Graph

[N] 1 2 N

✓ Graph Structure – coarse grain task parallelism, streams, pipelines✓ Graph hierarchy – nested parallelism✓ Node Instantiation – captures SPMD-style data parallelism✓ Vector instructions in leaf nodes – fine grain vector parallelism✓ Supports high-level optimizations✓ Captures FPGAs, some semi-custom hardware

N different parallelism models single unified parallelism model!

Node instantiation

HPVM Abstractions

✓ Pipelined (task) parallelism with streaming input images

✓ Medium-grain data parallelism within pipeline stages

✓ Fine-grain data parallelism in most stages

E.g., Edge Detection in Images

HPVM Compiler Optimizations

Complex optimizations as simple graph transforms

• Graph node “tiling” for memory hierarchy

• Graph node merging

• Graph pipelining

• Graph partitioning and mapping

• (Future) Graph-based loop optimizations

Host code

x86 binary

SPIR

binary

Intel OpenCL runtime

Intel Xeon E5 core i7

+AVXnVidia GeForce

GTX 680 GPU

nVidia OpenCL runtime

PTX

binary

Host code

x86 binary

Host code

x86 binary

P-threads

Intel Xeon E5 core i7

Front

end

Source

program

.bc (with HPVM intrinsics)

Developer site

User site

HPVM-to-

PTX

HPVM-to-

SPIR-to-AVX

HPVM-to-

x86

HPVM graph optimizer

Code-gen: Bottom-up on graph hierarchy

Code Generation Strategy – Overview

Key:

1. any node

any device

2. reuse vendor

back ends

Evaluation: Summary

Abstraction and object-code portability

➢Single HPVM code is close to (or slightly worse than) separately

hand-tuned code on both GPU, AVX

➢HPVM performance limited by vendor-specific back ends, not by

HPVM abstractions

Flexible scheduling

➢HPVM enables highly flexible mappings to diverse h/w

Ongoing Research (1)

ApproxHPVM for accuracy-aware optimization

• App developers only express end-to-end accuracy goals

• Domain-specific strategy:

➢Extend HPVM with tensor domain ops

➢Express hardware-independent accuracy metrics in IR

• Algorithmic approximations as well as system-level

• Portable virtual ISA after hardware-agnostic autotuning

• Dynamic optimization to adapt to run-time conditions

Sharif et al., OOPSLA 2019


Hardware-agnostic programming of FPGAs

• FPGAs are becoming widely available in data centers

• Application users lack hardware expertise

Intermediate Compilation

AOC Compiler

Full Compilation

Transformations

Code Gen

HPVM virtual object code

Analyze Report

Ke

rnel (.

cl)

Optim

iza

tion R

eport

Bitstream (.aocx)

HPVM-OpenCL Goal: Use compiler optimizations to

achieve high-perf. FPGA designs

from hardware-agnostic code


Integrate ApproxHPVM with Jasmine Toolflow

• Improve hw-agnostic tuning to match hw-specific

• Partition application + iterate through design space

• Explore approximate hw, sw mechanisms

DSSoC: Hardware Design Space Exploration

…

…

ReLU

…

Ontology 1

Ontology 2

Ontology 3

Ontology n

Conv

1D

Conv

2D ……

Convolution

MatMul

Ontology

discovery

using graph

analytics &

static

analysisHPVM

Acc

GPUCPU

AccGPU

CPU

Acc

GPU

CPU

“Test” Set

38% 41% 3%Workload Mix

CPU

GPU

CPU

A A AA

A

A

CNN design space

WL1

“Training” Set

WL2 WL3 WLn

CV

Jasmine

DSSoC

Ontology learning

DSSoC design exploration

NOC

architectures

Design

constraints

Physical

interface

Dynamic

DDG

Hierarchical

static DDG

Compiler

flow

Training

flow

Design

flow

HPVMJasmine:

Design Space Exploration

Hierarchical

DFG

Performance

results

ESP:

SoC Design Framework

DSSoC

Applications

(IBM, Columbia, Harvard, Illinois)

(Harvard)

(Columbia)

Accelerator

Pareto curves


Domain-specific programming of edge systems

• Xilinx Zynq, NVIDIA Jetson Nano, Intel Movidius …

• Example: ARM (+ GPU) (+ FPGA) (+ DNN)

• Users: Crop scientists, civil engrs, medical researchers…

• Can we enable non-expert users to program complex

heterogeneous SoCs?

➢Very high-level DSLs

➢Automatic partitioning, approximation, mapping, code generation

➢Automatic run-time scheduling, performance analysis

Summary

HPVM: portability + performance for heterogeneous systems

ApproxHPVM: easy access to approximation techniques

Long-term goals:

➢Application-driven hardware design needs uniform interfaces!

➢Rich compiler infrastructure for DSLs

➢Easy programming of energy-efficient edge compute systems

Questions?

uniform abstractions for heterogeneous parallel systems · rakesh komuravelli, sarita adve and sasa...

Documents