intra-core loop-task accelerators for task-based parallel programscbatten/pdfs/batten-lta... ·...

Intra-Core Loop-Task Accelerators forTask-Based Parallel Programs

Christopher Batten

Computer Systems LaboratorySchool of Electrical and Computer Engineering

Cornell University

Spring 2018

• Research Overview • LTA Motivation LTA SW LTA HW LTA Evaluation

Motivating Trends in Computer Architecture

Transistors(Thousands)

MIPSR2K

IntelP4

Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten1975 1980 1985 1990 1995 2000 2005 2010 2015

100

101

102

103

104

105

106

107

DECAlpha 21264

TypicalPower (W)

Frequency(MHz)

SPECintPerformance

~9%/year

~15%/year

Numberof Cores

Intel 48-Core Prototype

AMD 4-Core Opteron Data-Parallelism via GPGPUs & Vector

HardwareSpecialization

Fine-Grain Task- Level Parallelism Instruction Set Specialization Subgraph Specialization Application-Specific Accelerators Domain-Specific Accelerators Coarse-Grain Reconfig Arrays Field-Programmable Gate Arrays

Cornell University Christopher Batten 2 / 29


Performance (Tasks per Second)

Ener

gy E

ffici

ency

(Tas

ks p

er J

oule

)

SimpleProcessor

Design PowerConstraint

High-PerformanceArchitectures

EmbeddedArchitectures

DesignPerformanceConstraint

Flexibil

ity vs.

Specia

lizatio

nCustomASIC

Less FlexibleAccelerator

More FlexibleAccelerator



Vertically Integrated Research Methodology

Our research involves reconsidering all aspects of the computing stackincluding applications, programming frameworks, compiler optimizations,runtime systems, instruction set design, microarchitecture design, VLSI

implementation, and hardware design methodologies

CrossCompiler

FunctionalSimulator

Binary

Applications

Functional-LevelModel

Cycle-LevelSimulator

Cycle-LevelModel

Layout

Register-Transfer-Level Model

RTLSimulator

Gate-Level Model

Gate-LevelSimulator

Switching Activity

PowerAnalysis

SynthesisPlace&Route

Key Metrics: Cycle Count, Cycle Time, Area, Energy

Experimenting with full-chiplayout, FPGA prototypes, andtest chips is a key part of our

research methodology



Projects Within the Batten Research Group

PL

ISA

uArch

RTL

VLSI

Circuits

Tech

Compiler

Apps

Algos

GPGPUArchitecture

[ISCA'13]

(AFOSR)

[MICRO'14a]

IntegratedVoltage

Regulation[MICRO'14b]

[TCAS'18]

AcceleratorCentric

SoC Design[HOTCHIPS'17]

(DARPA,NSF)

[DAC'16] [FPGA'17]

DeepProcessingin Memory

(NSF,SRC)

Accel forLoop-LevelParallelism[MICRO'14c]

(DARPA,AFOSR,

NSF)

Accel forTask-LevelParallelism

[MICRO'17]

(AFOSR,NSF)

[ISCA'16]

PyMTL/PydginFrameworks

[MICRO'14d]

(NSF,DARPA)[DAC'18]

[ISPASS'15,'16]

Accel forTask-LevelParallelism

[MICRO'17]


Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation

Cornell University Ji Kim 3/21Cornell University Ji Kim 3/21Cornell University Ji Yun Kim 3

Inter-Core● Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk





○ Intel TBB, Cilk





○ Intel TBB, Cilk

Intra-Core● Packed-SIMD Vectorization

○ Intel AVX, Arm NEON



Challenges of Combining Tasks and Vectors

Cornell University Ji Kim 9/21Cornell University Ji Kim 9/21Cornell University Ji Yun Kim


9

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) {// Pack data into padded aligned chunks// src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH]// dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH]...

// Use TBB across coresparallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) {for (int i = r.begin(); i < r.end(); i++) {// Use packed-SIMD within a core#pragma simd vlen(SIMD_WIDTH)for (int j = 0; j < SIMD_WIDTH; j++) {if (src_chunks[i][j] > THRESHOLD)aligned_dst[i] = DoLightCompute(aligned_src[i]);

elsealigned_dst[i] = DoHeavyCompute(aligned_src[i]);

...

...






10





...

...






12





...

...

Challenge #2: Inefficient Execution of Irregular Tasks



Native Performance Results

Cornell University Ji Kim 16/21Cornell University Ji Kim 16/21Cornell University Ji Yun Kim 16Regular Irregular

Native Performance Results


Research Overview LTA Motivation • LTA SW • LTA HW LTA Evaluation

Loop-Task Accelerator (LTA) Vision



20

● Motivation

● Challenge #1: LTA SW

● Challenge #2: LTA HW

● Evaluation

● Conclusion



LTA SW: API and ISA Hint


LTA SW: API and ISA Hintvoid app_kernel_lta(int N, float* src, float* dst) {LTA_PARALLEL_FOR(0, N, (dst, src), ({if (src[i] > THRESHOLD)dst[i] = DoComputeLight(src[i]);

elsedst[i] = DoComputeHeavy(src[i]);

}));}

void loop_task_func(void* a, int start, int end, int step=1);

23

Hint that hardware can potentially accelerate task execution



LTA SW: Task-Based Runtime


LTA SW: Task-Based Runtime

29Cornell University Christopher Batten 11 / 29

Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation




30

● Motivation



● Evaluation

● Conclusion



LTA HW: Fully Coupled LTA


LTA HW: Fully-Coupled LTA

35

Coupling better for regular workloads (amortize frontend/memory)



LTA HW: Fully Decoupled LTA


LTA HW: Fully Decoupled LTA

41Decoupling better for irregular workloads (hide latencies)



LTA HW: Task-Coupling Taxonomy



46

+ Higher Perf on Irregular- Higher Area/Energy

Task Group (lock-step execution)

More decoupling (more task groups) in either space or time improves performance on irregular workloads at the cost of area/energy






47

Does it matter whether we decouple in space or in time?



4/1x2/1 4/2x2/1 4/4x2/1

4/1x2/2 4/2x2/2 4/4x2/2M

ore

Tem

pora

l Cou

plin

g

More Spatial Decoupling

4 Lanes x 2 Chimes

More Spatial Coupling

Mor

e Te

mpo

ral D

ecou

plin

g

2/1x4/1 2/2x4/1

2/1x4/2 2/2x4/2

2/1x4/4 2/2x4/4

2 Lanes x 4 Chimes

8/1x4/1 8/2x4/1 8/4x4/1 8/8x4/1

8/1x4/2 8/2x4/2 8/4x4/2 8/8x4/2

8/1x4/4 8/2x4/4 8/4x4/4 8/8x4/4

8 Lanes x 4 Chimes4/1x8/1 4/2x8/1 4/4x8/1

4/1x8/2 4/2x8/2 4/4x8/2

4/1x8/4 4/2x8/4 4/4x8/4

4/1x8/8 4/2x8/8 4/4x8/8

4 Lanes x 8 Chimes



LTA HW: Microarchitectural Template

IMem Xbar

LaneGroup

PIB

DMem Xbar

FPUGroup

LaneGroup

LaneGroup

MDUGroup

FPUXbar

μTaskQueue

MDUXbar

PDB

MemPorts

IMU

TM

UD

MU

L1 Instruction Cache

Task Distributer

L1 DataCache

IU Seq

MDU Interface FPU Interface

IU Seq IU Seq

Writeback Arbiter

WQ

WCU

DU

FU

FPU Xbar

DMU

IMUTMU

FromGPP

LaneGroup

4B

32B

y

SLFU

u

IQ

y y y

v v

z chimes perchime group

y lanes per lane group

gcchimegroups

μRFμRF

μRFμRFμRF

μRFμRFμRF

μRF

MDU Xbar

gc

gcgc gc

gly SLFU y

gc

LSU y

RT PFB

PC


Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •




48

● Motivation



● Evaluation

● Conclusion



Evaluation: Methodology


Evaluation: Methodology

• Ported 16 application kernels from PBBS and in-house

benchmark suites with diverse loop-task parallelism

• Scientific computing: N-body simulation, MRI-Q, SGEMM

• Image processing: bilateral filter, RGB-to-CMYK, DCT

• Graph algorithms: breadth-first search, maximal matching

• Search/Sort algorithms: radix sort, substring matching

• gem5 + PyMTL co-simulation for cycle-level performance

• Component/event-based area/energy modeling

• Uses area/energy dictionary backed by VLSI results and McPAT

49



Evaluation: Design-Space Exploration


Evaluation: Design Space Exploration

52

spatial decoupling

temporal decoupling

resource constraints






53

spatial decoupling

temporal decoupling

Prefer spatial decoupling over temporal decoupling







55

spatial decoupling

temporal decoupling







56

spatial decoupling

temporal decoupling

Reduce spatial decoupling to improve energy efficiency




Evaluation: Energy Breakdown

8/1x4/1 8/2x4/1 8/4x4/1 8/8x4/1

mriq (regular) sarray (irregular)

I CachePIBTMUFrontendRegfileRT+ROBSLFULLFULSUD Cache

Conservative comparison since IO/O3 running serial baseline,while LTA is using parallel runtime even on a single core



Evaluation: Energy Efficiency vs. Performance

0 1 2 3 4 5 6 7 8 9 10 11Performance

0.00.10.20.30.40.50.60.70.80.91.01.1

Ener

gy E

ffici

ency

0 1 2 3 4 5 6Performance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

In-Order vs IO+LTA Out-of-Order vs IO+LTA

In-OrderBaseline

Out-of-OrderBaseline



Evaluation: Multicore LTA Performance


Evaluation: Multicore LTA Performance

61Regular Irregular

4.4x2.9x

10.7x5.2x



Evaluation: Multicore Energy and Area

I Both the baseline CMP and theCMP+LTA designs use the sameapplication code and almost theexact same parallel runtime

I CMP+LTA vs. CMP-IO:improves energy efficiency by1.1⇥ geo mean

I CMP+LTA vs. CMP-O3:improves energy efficiency by3.2⇥ geo mean



Related Work

I Challenge #1: Intra-Core Parallel Abstraction Gap. Persistent threads for GPGPUs (S. Tzeng et al.). OpenCL, OpenMP, C++ AMP. Cilk for packed-SIMD (B. Ren et al.). and more ...

I Challenge #2: Inefficient Execution of Irregular Tasks. Variable warp sizing (T. Rogers et al.). Temporal SIMT (S. Keckler et al.). Vector-lane threading (S. Rivoire et al.). and more ...

I See MICRO’17 paper for detailed references ...


Research Overview LTA Motivation LTA SW LTA HW LTA Evaluation

LTA Take-Away Points

I Intra-core parallel abstraction gapand inefficient execution ofirregular tasks are fundamentalchallenges for CMPs

I LTAs address both challengeswith a lightweight ISA hint and aflexible microarchitecturaltemplate

LTAGPP GPP

Task-Based ParallelProgramming Framework

LTA ISA Hint

Loop-TaskLoop-Task

SIMD SIMD LTA

LTA HW

LTA SW

I Results suggest in a resource-constrained environment, architectsshould favor spatial decoupling over temporal decoupling



Upcoming Computer Laboratory Seminar

“A New Era of Open-Source SoC Design”Wednesday, May 16th @ 4:15pm

Celerity: An Accelerator-CentricSystem-on-Chip

CARRV’17, October 14, 2017, Boston, MA, USA T. Ajayi et al.

AX

I

RO

CCRV64G Core

I-CacheD-Cache

AX

I

RO

CCRV64G Core

I-CacheD-Cache

AX

I

RO

CCRV64G Core

I-CacheD-Cache

AX

I

RO

CCRV64G Core

I-CacheD-Cache

AX

I

RO

CCRV64G Core

I-CacheD-Cache

RV32IMCore

I Mem

XB

AR

NoC

Router

D Mem

BN

N A

ccel

erat

or

Off-Chip I/O

Bus f

rom

BSG

IP L

ibra

ry

Manycore

General-PurposeTier

Massively ParallelTier

SpecializationTier

Figure 1: Celerity SoC Architecture

stack to support new software research ideas. Complete o�-the-shelf RISC-V processor and memory system implementations (e.g.,Rocket chip SoC generator) enable rapidly deploying traditionalprocessors before modifying and/or extending these initial imple-mentations with new hardware research ideas. Similarly, OCN IPthat is designed within the RISC-V ecosystem (e.g., NASTI, TileLink)can reduce system-level integration e�ort. Standard veri�cation testsuites can greatly simplify developing new processor microarchi-tectures, and turn-key FPGA gateware (e.g., framework for runningRocket cores on various Xilinx Zynq FPGA boards) can help re-duce engineering e�ort. The entire RISC-V ecosystem has a strongemphasis on open-source software and hardware which facilitiesmodifying and/or extending just the component of interest in thecontext of a given research idea.

In this paper, we describe our experiences using the RISC-Vecosystem to build Celerity, an accelerator-centric system-on-chip(SoC) which uses a tiered accelerator fabric to improve energy e�-ciency in the context of high-performance embedded systems [1].The general-purpose tier includes a few fully featured RISC-V proces-sors capable of running general-purpose software including an oper-ating system, networking stack, and non-critical control/con�gurationsoftware. This tier is optimized for high �exibility, but of course atthe cost of energy e�ciency. The massively parallel tier includeshundreds of lightweight RISC-V processors, a distributed, non-cache-coherent memory system, and a mesh-based interconnect.This tier is optimized for e�ciently executing applications with�ne-grain data- and/or thread-level parallelism. The specializationtier includes application-speci�c accelerators (possibly generatedusing high-level synthesis). This tier is optimized for extreme en-ergy e�ciency, but of course at the cost of �exibility. We envisiona three-step process for mapping algorithms to such fabrics. Step 1:Implement the algorithm using the general-purpose tier. Step 2:Accelerate the algorithm using either the massively parallel tier ORthe specialization tier. Step 3: Improve performance and e�ciencyby cooperatively using both the specialization AND the massivelyparallel tier. A key feature of tiered accelerator fabrics is the use ofhigh-throughput parallel links to inter-connect all three tiers.

The Celerity SoC is a 5� 5 mm 385M-transistor chip in TSMC16 nm designed and implemented by a team of over 20 students andfaculty from the University of Michigan, Cornell University, and

the Bespoke Silicon Group at the University of Washington andthe University of California, San Diego, as part of the DARPA Cir-cuit Realization At Faster Timescales (CRAFT) program. Figure 1illustrates the SoC architecture. The Celerity SoC includes �veChisel-generated Rocket RV64G cores in the general-purpose tier, a496-core RV32IM tiled manycore processor in the massively paralleltier, and a complex HLS-generated BNN (binarized neural network)accelerator implemented as a Rocket custom co-processor (RoCC)in the specialization tier. Celerity also includes tightly integratedRocket-to-manycore communication channels, manycore-to-BNNhigh-speed links, sleep-mode subsystem with ten RV32IM cores,fully synthesizable phase-locked-loop clocking subsystem, and dig-ital low-dropout voltage regulator. The chip was taped out in May2017, and it will return from the foundry in the fall. The CeleritySoC is an open-source project, and links to all of the source �lesare available online at http://opencelerity.org.

In the rest of the paper, we describe each of the three tiers inmore detail by answering four key questions: What did we build inthat tier? How did we build it? How did we leverage the RISC-Vecosystem to facilitate design, implementation, and veri�cation inthat tier? and What were the challenges in leveraging the RISC-Vecosystem in that tier? Overall, the RISC-V ecosystem played animportant role in enabling a team of junior graduate students todesign and tapeout the highest-performance RISC-V SoC to date injust nine months.

2 GENERAL-PURPOSE TIERWITH RV64G CORES

The general-purpose tier uses fully featured RISC-V processors toexecute general-purpose software. This tier is optimized for high�exibility, at the potential expense of energy e�ciency.

What Did We Build? – The Celerity SoC general-purpose tierincludes �ve RV64G cores. The RV64G instruction set is comprisedof approximately 150 instructions for 64-bit integer arithmetic, sin-gle and double-precision �oating-point arithmetic, memory access,unconditional and conditional control �ow, and atomic memoryoperations. The cores use a relatively simple �ve-stage, single-issue,in-order pipeline. The RV64G core includes a memory managementunit and support for RISC-V machine, supervisor, and user privi-lege levels. It can be used in either a bare-metal mode, with a proxykernel (i.e., system calls are proxied to a separate host machine),or with a RISC-V port of the Linux operating system. The RV64Gcores serve as the interface between the other tiers and the o�-chip“northbridge” which is implemented as gateware in an FPGA. Thenorthbridge includes support for initial boot-up, o�-chip DRAM,and other I/O. Each RV64G core includes a 16 KB four-way set-associative instruction cache and a 16 KB four-way set-associativedata cache. There is no on-chip L2 cache. The �ve RV64G cores arenot cache-coherent, and thus only support running �ve indepen-dent instances of non-SMP Linux. Limited communication betweenthe RV64G cores is possible using software-managed coherenceand special support in the northbridge.

How Did We Build It? – We used the Berkeley Rocket chip SoCgenerator to create the RV64G core [2]. The Rocket chip SoC gen-erator is written in Chisel [4], a hardware construction languageembedded in Scala. Generating an RV64G core simply required

PyMTL: A Python-BasedHardware Modeling Framework



Ji Kim, Shreesha Srinath, Christopher Torng, Berkin Ilbeyi, Moyang WangShunning Jiang, Khalid Al-Hawaj, Tuan Ta, Lin Cheng

and many M.S./B.S. students

Equipment, Tools, and IPIntel, NVIDIA, Synopsys, Cadence, Xilinx, ARM



Batten Research Group

Exploring cross-layer hardware specializationusing a vertically integrated research methodology

Performance (Tasks per Second)

Ener

gy E

ffici

ency

(Tas

ks p

er J

oule

)

SimpleProcessor

Design PowerConstraint

High-PerformanceArchitectures

EmbeddedArchitectures

DesignPerformanceConstraint

Flexibil

ity vs.

Specia

lizatio

nCustomASIC

Less FlexibleAccelerator

More FlexibleAccelerator

LTAGPP GPP

Task-Based ParallelProgramming Framework

LTA ISA Hint

Loop-TaskLoop-Task

SIMD SIMD LTA

LTA HW

LTA SW


intra-core loop-task accelerators for task-based parallel programscbatten/pdfs/batten-lta... ·...

Documents