intra-core loop-task accelerators for task-based parallel programscbatten/pdfs/batten-lta... ·...
TRANSCRIPT
Intra-Core Loop-Task Accelerators forTask-Based Parallel Programs
Christopher Batten
Computer Systems LaboratorySchool of Electrical and Computer Engineering
Cornell University
Spring 2018
• Research Overview • LTA Motivation LTA SW LTA HW LTA Evaluation
Motivating Trends in Computer Architecture
Transistors(Thousands)
MIPSR2K
IntelP4
Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten1975 1980 1985 1990 1995 2000 2005 2010 2015
100
101
102
103
104
105
106
107
DECAlpha 21264
TypicalPower (W)
Frequency(MHz)
SPECintPerformance
~9%/year
~15%/year
Numberof Cores
Intel 48-Core Prototype
AMD 4-Core Opteron Data-Parallelism via GPGPUs & Vector
HardwareSpecialization
Fine-Grain Task- Level Parallelism Instruction Set Specialization Subgraph Specialization Application-Specific Accelerators Domain-Specific Accelerators Coarse-Grain Reconfig Arrays Field-Programmable Gate Arrays
Cornell University Christopher Batten 2 / 29
• Research Overview • LTA Motivation LTA SW LTA HW LTA Evaluation
Performance (Tasks per Second)
Ener
gy E
ffici
ency
(Tas
ks p
er J
oule
)
SimpleProcessor
Design PowerConstraint
High-PerformanceArchitectures
EmbeddedArchitectures
DesignPerformanceConstraint
Flexibil
ity vs.
Specia
lizatio
nCustomASIC
Less FlexibleAccelerator
More FlexibleAccelerator
Cornell University Christopher Batten 3 / 29
• Research Overview • LTA Motivation LTA SW LTA HW LTA Evaluation
Vertically Integrated Research Methodology
Our research involves reconsidering all aspects of the computing stackincluding applications, programming frameworks, compiler optimizations,runtime systems, instruction set design, microarchitecture design, VLSI
implementation, and hardware design methodologies
CrossCompiler
FunctionalSimulator
Binary
Applications
Functional-LevelModel
Cycle-LevelSimulator
Cycle-LevelModel
Layout
Register-Transfer-Level Model
RTLSimulator
Gate-Level Model
Gate-LevelSimulator
Switching Activity
PowerAnalysis
SynthesisPlace&Route
Key Metrics: Cycle Count, Cycle Time, Area, Energy
Experimenting with full-chiplayout, FPGA prototypes, andtest chips is a key part of our
research methodology
Cornell University Christopher Batten 4 / 29
• Research Overview • LTA Motivation LTA SW LTA HW LTA Evaluation
Projects Within the Batten Research Group
PL
ISA
uArch
RTL
VLSI
Circuits
Tech
Compiler
Apps
Algos
GPGPUArchitecture
[ISCA'13]
(AFOSR)
[MICRO'14a]
IntegratedVoltage
Regulation[MICRO'14b]
[TCAS'18]
AcceleratorCentric
SoC Design[HOTCHIPS'17]
(DARPA,NSF)
[DAC'16] [FPGA'17]
DeepProcessingin Memory
(NSF,SRC)
Accel forLoop-LevelParallelism[MICRO'14c]
(DARPA,AFOSR,
NSF)
Accel forTask-LevelParallelism
[MICRO'17]
(AFOSR,NSF)
[ISCA'16]
PyMTL/PydginFrameworks
[MICRO'14d]
(NSF,DARPA)[DAC'18]
[ISPASS'15,'16]
Accel forTask-LevelParallelism
[MICRO'17]
Cornell University Christopher Batten 5 / 29
Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation
Cornell University Ji Kim 3/21Cornell University Ji Kim 3/21Cornell University Ji Yun Kim 3
Inter-Core● Task-Based Parallel Programming Frameworks
○ Intel TBB, Cilk
Cornell University Christopher Batten 6 / 29
Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation
Cornell University Ji Kim 4/21Cornell University Ji Kim 4/21Cornell University Ji Yun Kim 4
Inter-Core● Task-Based Parallel Programming Frameworks
○ Intel TBB, Cilk
Cornell University Christopher Batten 6 / 29
Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation
Cornell University Ji Kim 5/21Cornell University Ji Kim 5/21Cornell University Ji Yun Kim 5
Inter-Core● Task-Based Parallel Programming Frameworks
○ Intel TBB, Cilk
Intra-Core● Packed-SIMD Vectorization
○ Intel AVX, Arm NEON
Cornell University Christopher Batten 6 / 29
Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation
Challenges of Combining Tasks and Vectors
Cornell University Ji Kim 9/21Cornell University Ji Kim 9/21Cornell University Ji Yun Kim
Challenges of Combining Tasks and Vectors
9
Challenge #1: Intra-Core Parallel Abstraction Gap
void app_kernel_tbb_avx(int N, float* src, float* dst) {// Pack data into padded aligned chunks// src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH]// dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH]...
// Use TBB across coresparallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) {for (int i = r.begin(); i < r.end(); i++) {// Use packed-SIMD within a core#pragma simd vlen(SIMD_WIDTH)for (int j = 0; j < SIMD_WIDTH; j++) {if (src_chunks[i][j] > THRESHOLD)aligned_dst[i] = DoLightCompute(aligned_src[i]);
elsealigned_dst[i] = DoHeavyCompute(aligned_src[i]);
...
...
Cornell University Christopher Batten 7 / 29
Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation
Challenges of Combining Tasks and Vectors
Cornell University Ji Kim 10/21Cornell University Ji Kim 10/21Cornell University Ji Yun Kim
Challenges of Combining Tasks and Vectors
10
Challenge #1: Intra-Core Parallel Abstraction Gap
void app_kernel_tbb_avx(int N, float* src, float* dst) {// Pack data into padded aligned chunks// src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH]// dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH]...
// Use TBB across coresparallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) {for (int i = r.begin(); i < r.end(); i++) {// Use packed-SIMD within a core#pragma simd vlen(SIMD_WIDTH)for (int j = 0; j < SIMD_WIDTH; j++) {if (src_chunks[i][j] > THRESHOLD)aligned_dst[i] = DoLightCompute(aligned_src[i]);
elsealigned_dst[i] = DoHeavyCompute(aligned_src[i]);
...
...
Cornell University Christopher Batten 7 / 29
Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation
Challenges of Combining Tasks and Vectors
Cornell University Ji Kim 12/21Cornell University Ji Kim 12/21Cornell University Ji Yun Kim
Challenges of Combining Tasks and Vectors
12
Challenge #1: Intra-Core Parallel Abstraction Gap
void app_kernel_tbb_avx(int N, float* src, float* dst) {// Pack data into padded aligned chunks// src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH]// dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH]...
// Use TBB across coresparallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) {for (int i = r.begin(); i < r.end(); i++) {// Use packed-SIMD within a core#pragma simd vlen(SIMD_WIDTH)for (int j = 0; j < SIMD_WIDTH; j++) {if (src_chunks[i][j] > THRESHOLD)aligned_dst[i] = DoLightCompute(aligned_src[i]);
elsealigned_dst[i] = DoHeavyCompute(aligned_src[i]);
...
...
Challenge #2: Inefficient Execution of Irregular Tasks
Cornell University Christopher Batten 7 / 29
Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation
Native Performance Results
Cornell University Ji Kim 16/21Cornell University Ji Kim 16/21Cornell University Ji Yun Kim 16Regular Irregular
Native Performance Results
Cornell University Christopher Batten 8 / 29
Research Overview LTA Motivation • LTA SW • LTA HW LTA Evaluation
Loop-Task Accelerator (LTA) Vision
Cornell University Ji Kim 20/21Cornell University Ji Kim 20/21Cornell University Ji Yun Kim
Loop-Task Accelerator (LTA) Vision
20
● Motivation
● Challenge #1: LTA SW
● Challenge #2: LTA HW
● Evaluation
● Conclusion
Cornell University Christopher Batten 9 / 29
Research Overview LTA Motivation • LTA SW • LTA HW LTA Evaluation
LTA SW: API and ISA Hint
Cornell University Ji Kim 23/21Cornell University Ji Kim 23/21Cornell University Ji Yun Kim
LTA SW: API and ISA Hintvoid app_kernel_lta(int N, float* src, float* dst) {LTA_PARALLEL_FOR(0, N, (dst, src), ({if (src[i] > THRESHOLD)dst[i] = DoComputeLight(src[i]);
elsedst[i] = DoComputeHeavy(src[i]);
}));}
void loop_task_func(void* a, int start, int end, int step=1);
23
Hint that hardware can potentially accelerate task execution
Cornell University Christopher Batten 10 / 29
Research Overview LTA Motivation • LTA SW • LTA HW LTA Evaluation
LTA SW: Task-Based Runtime
Cornell University Ji Kim 29/21Cornell University Ji Kim 29/21Cornell University Ji Yun Kim
LTA SW: Task-Based Runtime
29Cornell University Christopher Batten 11 / 29
Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation
Loop-Task Accelerator (LTA) Vision
Cornell University Ji Kim 30/21Cornell University Ji Kim 30/21Cornell University Ji Yun Kim
Loop-Task Accelerator (LTA) Vision
30
● Motivation
● Challenge #1: LTA SW
● Challenge #2: LTA HW
● Evaluation
● Conclusion
Cornell University Christopher Batten 12 / 29
Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation
LTA HW: Fully Coupled LTA
Cornell University Ji Kim 35/21Cornell University Ji Kim 35/21Cornell University Ji Yun Kim
LTA HW: Fully-Coupled LTA
35
Coupling better for regular workloads (amortize frontend/memory)
Cornell University Christopher Batten 13 / 29
Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation
LTA HW: Fully Decoupled LTA
Cornell University Ji Kim 41/21Cornell University Ji Kim 41/21Cornell University Ji Yun Kim
LTA HW: Fully Decoupled LTA
41Decoupling better for irregular workloads (hide latencies)
Cornell University Christopher Batten 14 / 29
Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation
LTA HW: Task-Coupling Taxonomy
Cornell University Ji Kim 46/21Cornell University Ji Kim 46/21Cornell University Ji Yun Kim
LTA HW: Task-Coupling Taxonomy
46
+ Higher Perf on Irregular- Higher Area/Energy
Task Group (lock-step execution)
More decoupling (more task groups) in either space or time improves performance on irregular workloads at the cost of area/energy
Cornell University Christopher Batten 15 / 29
Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation
LTA HW: Task-Coupling Taxonomy
Cornell University Ji Kim 47/21Cornell University Ji Kim 47/21Cornell University Ji Yun Kim
LTA HW: Task-Coupling Taxonomy
47
Does it matter whether we decouple in space or in time?
Cornell University Christopher Batten 15 / 29
Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation
4/1x2/1 4/2x2/1 4/4x2/1
4/1x2/2 4/2x2/2 4/4x2/2M
ore
Tem
pora
l Cou
plin
g
More Spatial Decoupling
4 Lanes x 2 Chimes
More Spatial Coupling
Mor
e Te
mpo
ral D
ecou
plin
g
2/1x4/1 2/2x4/1
2/1x4/2 2/2x4/2
2/1x4/4 2/2x4/4
2 Lanes x 4 Chimes
8/1x4/1 8/2x4/1 8/4x4/1 8/8x4/1
8/1x4/2 8/2x4/2 8/4x4/2 8/8x4/2
8/1x4/4 8/2x4/4 8/4x4/4 8/8x4/4
8 Lanes x 4 Chimes4/1x8/1 4/2x8/1 4/4x8/1
4/1x8/2 4/2x8/2 4/4x8/2
4/1x8/4 4/2x8/4 4/4x8/4
4/1x8/8 4/2x8/8 4/4x8/8
4 Lanes x 8 Chimes
Cornell University Christopher Batten 16 / 29
Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation
LTA HW: Microarchitectural Template
IMem Xbar
LaneGroup
PIB
DMem Xbar
FPUGroup
LaneGroup
LaneGroup
MDUGroup
FPUXbar
μTaskQueue
MDUXbar
PDB
MemPorts
IMU
TM
UD
MU
L1 Instruction Cache
Task Distributer
L1 DataCache
IU Seq
MDU Interface FPU Interface
IU Seq IU Seq
Writeback Arbiter
WQ
WCU
DU
FU
FPU Xbar
DMU
IMUTMU
FromGPP
LaneGroup
4B
32B
y
SLFU
u
IQ
y y y
v v
z chimes perchime group
y lanes per lane group
gcchimegroups
μRFμRF
μRFμRFμRF
μRFμRFμRF
μRF
MDU Xbar
gc
gcgc gc
gly SLFU y
gc
LSU y
RT PFB
PC
Cornell University Christopher Batten 17 / 29
Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •
Loop-Task Accelerator (LTA) Vision
Cornell University Ji Kim 48/21Cornell University Ji Kim 48/21Cornell University Ji Yun Kim
Loop-Task Accelerator (LTA) Vision
48
● Motivation
● Challenge #1: LTA SW
● Challenge #2: LTA HW
● Evaluation
● Conclusion
Cornell University Christopher Batten 18 / 29
Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •
Evaluation: Methodology
Cornell University Ji Kim 49/21Cornell University Ji Kim 49/21Cornell University Ji Yun Kim
Evaluation: Methodology
• Ported 16 application kernels from PBBS and in-house
benchmark suites with diverse loop-task parallelism
• Scientific computing: N-body simulation, MRI-Q, SGEMM
• Image processing: bilateral filter, RGB-to-CMYK, DCT
• Graph algorithms: breadth-first search, maximal matching
• Search/Sort algorithms: radix sort, substring matching
• gem5 + PyMTL co-simulation for cycle-level performance
• Component/event-based area/energy modeling
• Uses area/energy dictionary backed by VLSI results and McPAT
49
Cornell University Christopher Batten 19 / 29
Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •
Evaluation: Design-Space Exploration
Cornell University Ji Kim 52/21Cornell University Ji Kim 52/21Cornell University Ji Yun Kim
Evaluation: Design Space Exploration
52
spatial decoupling
temporal decoupling
resource constraints
Cornell University Christopher Batten 20 / 29
Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •
Evaluation: Design-Space Exploration
Cornell University Ji Kim 53/21Cornell University Ji Kim 53/21Cornell University Ji Yun Kim
Evaluation: Design Space Exploration
53
spatial decoupling
temporal decoupling
Prefer spatial decoupling over temporal decoupling
resource constraints
Cornell University Christopher Batten 20 / 29
Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •
Evaluation: Design-Space Exploration
Cornell University Ji Kim 55/21Cornell University Ji Kim 55/21Cornell University Ji Yun Kim
Evaluation: Design Space Exploration
55
spatial decoupling
temporal decoupling
resource constraints
Cornell University Christopher Batten 20 / 29
Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •
Evaluation: Design-Space Exploration
Cornell University Ji Kim 56/21Cornell University Ji Kim 56/21Cornell University Ji Yun Kim
Evaluation: Design Space Exploration
56
spatial decoupling
temporal decoupling
Reduce spatial decoupling to improve energy efficiency
resource constraints
Cornell University Christopher Batten 20 / 29
Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •
Evaluation: Energy Breakdown
8/1x4/1 8/2x4/1 8/4x4/1 8/8x4/1
mriq (regular) sarray (irregular)
I CachePIBTMUFrontendRegfileRT+ROBSLFULLFULSUD Cache
Conservative comparison since IO/O3 running serial baseline,while LTA is using parallel runtime even on a single core
Cornell University Christopher Batten 21 / 29
Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •
Evaluation: Energy Efficiency vs. Performance
0 1 2 3 4 5 6 7 8 9 10 11Performance
0.00.10.20.30.40.50.60.70.80.91.01.1
Ener
gy E
ffici
ency
0 1 2 3 4 5 6Performance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
In-Order vs IO+LTA Out-of-Order vs IO+LTA
In-OrderBaseline
Out-of-OrderBaseline
Cornell University Christopher Batten 22 / 29
Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •
Evaluation: Multicore LTA Performance
Cornell University Ji Kim 61/21Cornell University Ji Kim 61/21Cornell University Ji Yun Kim
Evaluation: Multicore LTA Performance
61Regular Irregular
4.4x2.9x
10.7x5.2x
Cornell University Christopher Batten 23 / 29
Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •
Evaluation: Multicore Energy and Area
I Both the baseline CMP and theCMP+LTA designs use the sameapplication code and almost theexact same parallel runtime
I CMP+LTA vs. CMP-IO:improves energy efficiency by1.1⇥ geo mean
I CMP+LTA vs. CMP-O3:improves energy efficiency by3.2⇥ geo mean
Cornell University Christopher Batten 24 / 29
Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •
Related Work
I Challenge #1: Intra-Core Parallel Abstraction Gap. Persistent threads for GPGPUs (S. Tzeng et al.). OpenCL, OpenMP, C++ AMP. Cilk for packed-SIMD (B. Ren et al.). and more ...
I Challenge #2: Inefficient Execution of Irregular Tasks. Variable warp sizing (T. Rogers et al.). Temporal SIMT (S. Keckler et al.). Vector-lane threading (S. Rivoire et al.). and more ...
I See MICRO’17 paper for detailed references ...
Cornell University Christopher Batten 25 / 29
Research Overview LTA Motivation LTA SW LTA HW LTA Evaluation
LTA Take-Away Points
I Intra-core parallel abstraction gapand inefficient execution ofirregular tasks are fundamentalchallenges for CMPs
I LTAs address both challengeswith a lightweight ISA hint and aflexible microarchitecturaltemplate
LTAGPP GPP
Task-Based ParallelProgramming Framework
LTA ISA Hint
Loop-TaskLoop-Task
SIMD SIMD LTA
LTA HW
LTA SW
I Results suggest in a resource-constrained environment, architectsshould favor spatial decoupling over temporal decoupling
Cornell University Christopher Batten 26 / 29
Research Overview LTA Motivation LTA SW LTA HW LTA Evaluation
Upcoming Computer Laboratory Seminar
“A New Era of Open-Source SoC Design”Wednesday, May 16th @ 4:15pm
Celerity: An Accelerator-CentricSystem-on-Chip
CARRV’17, October 14, 2017, Boston, MA, USA T. Ajayi et al.
AX
I
RO
CCRV64G Core
I-CacheD-Cache
AX
I
RO
CCRV64G Core
I-CacheD-Cache
AX
I
RO
CCRV64G Core
I-CacheD-Cache
AX
I
RO
CCRV64G Core
I-CacheD-Cache
AX
I
RO
CCRV64G Core
I-CacheD-Cache
RV32IMCore
I Mem
XB
AR
NoC
Router
D Mem
BN
N A
ccel
erat
or
Off-Chip I/O
Bus f
rom
BSG
IP L
ibra
ry
Manycore
General-PurposeTier
Massively ParallelTier
SpecializationTier
Figure 1: Celerity SoC Architecture
stack to support new software research ideas. Complete o�-the-shelf RISC-V processor and memory system implementations (e.g.,Rocket chip SoC generator) enable rapidly deploying traditionalprocessors before modifying and/or extending these initial imple-mentations with new hardware research ideas. Similarly, OCN IPthat is designed within the RISC-V ecosystem (e.g., NASTI, TileLink)can reduce system-level integration e�ort. Standard veri�cation testsuites can greatly simplify developing new processor microarchi-tectures, and turn-key FPGA gateware (e.g., framework for runningRocket cores on various Xilinx Zynq FPGA boards) can help re-duce engineering e�ort. The entire RISC-V ecosystem has a strongemphasis on open-source software and hardware which facilitiesmodifying and/or extending just the component of interest in thecontext of a given research idea.
In this paper, we describe our experiences using the RISC-Vecosystem to build Celerity, an accelerator-centric system-on-chip(SoC) which uses a tiered accelerator fabric to improve energy e�-ciency in the context of high-performance embedded systems [1].The general-purpose tier includes a few fully featured RISC-V proces-sors capable of running general-purpose software including an oper-ating system, networking stack, and non-critical control/con�gurationsoftware. This tier is optimized for high �exibility, but of course atthe cost of energy e�ciency. The massively parallel tier includeshundreds of lightweight RISC-V processors, a distributed, non-cache-coherent memory system, and a mesh-based interconnect.This tier is optimized for e�ciently executing applications with�ne-grain data- and/or thread-level parallelism. The specializationtier includes application-speci�c accelerators (possibly generatedusing high-level synthesis). This tier is optimized for extreme en-ergy e�ciency, but of course at the cost of �exibility. We envisiona three-step process for mapping algorithms to such fabrics. Step 1:Implement the algorithm using the general-purpose tier. Step 2:Accelerate the algorithm using either the massively parallel tier ORthe specialization tier. Step 3: Improve performance and e�ciencyby cooperatively using both the specialization AND the massivelyparallel tier. A key feature of tiered accelerator fabrics is the use ofhigh-throughput parallel links to inter-connect all three tiers.
The Celerity SoC is a 5� 5 mm 385M-transistor chip in TSMC16 nm designed and implemented by a team of over 20 students andfaculty from the University of Michigan, Cornell University, and
the Bespoke Silicon Group at the University of Washington andthe University of California, San Diego, as part of the DARPA Cir-cuit Realization At Faster Timescales (CRAFT) program. Figure 1illustrates the SoC architecture. The Celerity SoC includes �veChisel-generated Rocket RV64G cores in the general-purpose tier, a496-core RV32IM tiled manycore processor in the massively paralleltier, and a complex HLS-generated BNN (binarized neural network)accelerator implemented as a Rocket custom co-processor (RoCC)in the specialization tier. Celerity also includes tightly integratedRocket-to-manycore communication channels, manycore-to-BNNhigh-speed links, sleep-mode subsystem with ten RV32IM cores,fully synthesizable phase-locked-loop clocking subsystem, and dig-ital low-dropout voltage regulator. The chip was taped out in May2017, and it will return from the foundry in the fall. The CeleritySoC is an open-source project, and links to all of the source �lesare available online at http://opencelerity.org.
In the rest of the paper, we describe each of the three tiers inmore detail by answering four key questions: What did we build inthat tier? How did we build it? How did we leverage the RISC-Vecosystem to facilitate design, implementation, and veri�cation inthat tier? and What were the challenges in leveraging the RISC-Vecosystem in that tier? Overall, the RISC-V ecosystem played animportant role in enabling a team of junior graduate students todesign and tapeout the highest-performance RISC-V SoC to date injust nine months.
2 GENERAL-PURPOSE TIERWITH RV64G CORES
The general-purpose tier uses fully featured RISC-V processors toexecute general-purpose software. This tier is optimized for high�exibility, at the potential expense of energy e�ciency.
What Did We Build? – The Celerity SoC general-purpose tierincludes �ve RV64G cores. The RV64G instruction set is comprisedof approximately 150 instructions for 64-bit integer arithmetic, sin-gle and double-precision �oating-point arithmetic, memory access,unconditional and conditional control �ow, and atomic memoryoperations. The cores use a relatively simple �ve-stage, single-issue,in-order pipeline. The RV64G core includes a memory managementunit and support for RISC-V machine, supervisor, and user privi-lege levels. It can be used in either a bare-metal mode, with a proxykernel (i.e., system calls are proxied to a separate host machine),or with a RISC-V port of the Linux operating system. The RV64Gcores serve as the interface between the other tiers and the o�-chip“northbridge” which is implemented as gateware in an FPGA. Thenorthbridge includes support for initial boot-up, o�-chip DRAM,and other I/O. Each RV64G core includes a 16 KB four-way set-associative instruction cache and a 16 KB four-way set-associativedata cache. There is no on-chip L2 cache. The �ve RV64G cores arenot cache-coherent, and thus only support running �ve indepen-dent instances of non-SMP Linux. Limited communication betweenthe RV64G cores is possible using software-managed coherenceand special support in the northbridge.
How Did We Build It? – We used the Berkeley Rocket chip SoCgenerator to create the RV64G core [2]. The Rocket chip SoC gen-erator is written in Chisel [4], a hardware construction languageembedded in Scala. Generating an RV64G core simply required
PyMTL: A Python-BasedHardware Modeling Framework
Cornell University Christopher Batten 27 / 29
Research Overview LTA Motivation LTA SW LTA HW LTA Evaluation
Ji Kim, Shreesha Srinath, Christopher Torng, Berkin Ilbeyi, Moyang WangShunning Jiang, Khalid Al-Hawaj, Tuan Ta, Lin Cheng
and many M.S./B.S. students
Equipment, Tools, and IPIntel, NVIDIA, Synopsys, Cadence, Xilinx, ARM
Cornell University Christopher Batten 28 / 29
Research Overview LTA Motivation LTA SW LTA HW LTA Evaluation
Batten Research Group
Exploring cross-layer hardware specializationusing a vertically integrated research methodology
Performance (Tasks per Second)
Ener
gy E
ffici
ency
(Tas
ks p
er J
oule
)
SimpleProcessor
Design PowerConstraint
High-PerformanceArchitectures
EmbeddedArchitectures
DesignPerformanceConstraint
Flexibil
ity vs.
Specia
lizatio
nCustomASIC
Less FlexibleAccelerator
More FlexibleAccelerator
LTAGPP GPP
Task-Based ParallelProgramming Framework
LTA ISA Hint
Loop-TaskLoop-Task
SIMD SIMD LTA
LTA HW
LTA SW
Cornell University Christopher Batten 29 / 29