by rafat rashid - university of toronto t-space · rafat rashid master of applied science graduate...
TRANSCRIPT
A Dual-Engine Fetch/Compute Overlay Processor for FPGAs
by
Rafat Rashid
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
c© Copyright 2015 by Rafat Rashid
Abstract
A Dual-Engine Fetch/Compute Overlay Processor for FPGAs
Rafat Rashid
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2015
High-Level-Synthesis (HLS) tools translate a software description of an application into custom FPGA
logic, increasing designer productivity vs. Hardware Description Language (HDL) design flows. Overlays
seek to further improve productivity by allowing the designer to target a software-programmable sub-
strate instead of the underlying FPGA, raising abstraction and reducing application compile times. We
propose a highly configurable dual-engine, fetch/compute overlay processor which is designed to achieve
high throughput per unit area on data-parallel and compute intensive floating-point applications. The
fetch component is newly proposed as part of this work. For the compute component, we use an im-
proved version of the TILT overlay processor originally developed by Ovtcharov and Tili. As part of
our evaluation, we rigorously compare the performance, area and productivity enabled by our proposed
architecture to that of Altera’s OpenCL HLS tool.
ii
Acknowledgements
First, I want to thank my supervisors Prof. Greg Steffan and Prof. Vaughn Betz for their advice and
direction during the completion of my degree. I have gained valuable research experience under their
supervision. Thank you to professors Paul Chow and Andreas Moshovos for being part of my defence
committee and for providing feedback on my thesis. Thanks also to Wai Tung for being the committee’s
defence chair. I also want to thank Kalin Ovtcharov, Ilian Tili, Charles LaForest and the students in
Prof. Betz research group for their feedback and useful suggestions on this work. I also thank David
Lewis and Tim Vanderhoek at Altera for helpful discussions on the Stratix V architecture. Finally, I
thank NSERC and Altera for funding support and SciNet for their compute resources.
iii
Contents
1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Overlay Processors on FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 High-Level-Synthesis Tools for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Memory Architectures on FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 TILT Overlay Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 TILT Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 TILT Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 TILT Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 DFG Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Memory Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Compute Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 TILT Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 External Memory Fetcher 18
3.1 Possible Design Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Chosen Design Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 External Memory Access Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Memory and Compute GPGMLP Scheduling . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 ALAP-Inputs, ASAP-Outputs Memory Scheduling . . . . . . . . . . . . . . . . . . 24
3.3.3 Slack-Based Memory Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.4 Fetcher-DDR Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Comparison of the Memory Scheduling Approaches . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 TILT Efficiency Enhancements 35
4.1 Instruction Looping Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 LoopUnit FU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.2 DFG Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iv
4.1.3 Scheduling onto TILT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.4 Conditional Jump Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.5 LLVM Phi Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.6 Unrolling Loops vs. Using the LoopUnit FU . . . . . . . . . . . . . . . . . . . . . 42
4.2 Indirect Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Shift-Register Addressing Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Indirect Addressing Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Memory Allocation for DFGs with Aliased Locations . . . . . . . . . . . . . . . . . . . . . 47
4.4 Schedule Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Predictor Tool 50
5.1 Estimation of TILT-System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 TILT and Fetcher Area Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 TILT FU Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 TILT and Fetcher Instruction Memories . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.3 TILT Data Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.4 TILT Read and Write Crossbars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.5 Fetcher Data FIFOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Prediction of the Densest TILT-System Design . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.1 Constraining the Explored Design Space . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.2 Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.3 Combining TILT FUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.4 Alleviation of Bottlenecks with Performance Counters . . . . . . . . . . . . . . . . 61
5.3.5 Fetcher Configuration Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Efficacy of the Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Design Space Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.2 Runtime Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.3 Accuracy of Area Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.4 Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6 TILT-System Evaluation 69
6.1 Benchmark Implementation on TILT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Densest TILT-System Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3 TILT Core and System Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4 TILT-System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.5 Comparison with Altera’s OpenCL HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.5.1 Performance and Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.5.2 Designer Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.5.3 TILT-System and OpenCL HLS Scalability . . . . . . . . . . . . . . . . . . . . . . 77
6.5.4 TILT-System vs. OpenCL HLS Summary . . . . . . . . . . . . . . . . . . . . . . . 79
7 Conclusion 81
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
v
Appendix 83
Appendix A: Quartus Settings for FUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Appendix B: Quartus Settings for DDR Controller . . . . . . . . . . . . . . . . . . . . . . . . . 83
Bibliography 84
vi
Chapter 1
Introduction
Modern Field Programmable Gate Arrays (FPGAs) offer lots of on-chip compute parallelism and internal
memory bandwidth with high power efficiency compared to other, more conventional platforms such as
CPUs or GPUs [1, 2]. At the same time, FPGAs provide the flexibility to implement any hardware
on their reconfigurable fabric that CPUs and GPUs do not. For these reasons, FPGAs have become
compelling design platforms for accelerating a variety of compute intensive applications in both academic
and commercial domains [2–5].
For example, Sano et al. demonstrate the relative low power and high performance of FPGAs by
implementing customized, hand-coded hardware for the Jacobi stencil computation [3]. Their tuned
architecture achieves linear scalability in performance with constant memory bandwidth for up to nine
Altera Stratix III FGPAs. This is in contrast to the poor performance scaling achieved on multi-core
CPUs and GPUs in earlier efforts, owing to insufficient inter-processor memory bandwidth. Similarly,
Cassidy et al. present a custom implementation of the Monte Carlo biophotonic application on Stratix V
FPGAs [4] which achieves 4x higher computational throughput and consumes 67x less power compared
to a tightly optimized multi-threaded CPU implementation.
IBM currently offers several commercial products that use FPGAs including the PureData System [6]
for managing large database workloads. PureData uses FPGAs to perform high-performance processing
of analytics on these databases. Another IBM product is Data Power [7] where FPGAs are used to process
large amounts of mobile and web traffic. More recently, Microsoft has proposed Catapult [5] which uses
“overlay” processing engines as a part of its system, which is built on a cluster of high-end FPGAs.
Catapult was used to accelerate Microsoft’s Bing web search ranking algorithms and could be used to
accelerate other large-scale data centre services as well. The authors did not use GPUs because the
power requirements of current high-end GPUs were too high to merit the amount of compute parallelism
they offered. Catapult demonstrated a 95% increase in ranking throughput at comparable latency to
their software-only solution.
However, despite the potential of FPGAs, the difficulty in implementing applications on the platform
remains as the biggest barrier to entry into compute-acceleration markets and greatly limits the areas
where FPGAs have been successfully used. First, the compile time of the design tools is typically hours for
FPGAs vs. minutes or seconds for CPUs and GPUs, lengthening design iterations and reducing designer
productivity considerably. Second, most FPGA designs are currently specified in Hardware Description
Languages (HDLs) such as Verilog and the cycle-accurate description required in such languages is
1
Chapter 1. Introduction 2
time-consuming to write.
Overlay architectures for FPGAs seek to alleviate these problems by allowing the designer to target
a software-programmable substrate instead of the underlying FPGA to raise design abstraction and to
reduce application compile times. One class of overlays comprises soft-processors that execute applica-
tions on top of configurable execution units on the FPGA. These overlay processors behave similarly to
CPUs, which are both easier to program and more familiar to software designers. Many academic works
on such overlay processors have been proposed [8–10].
Overlay processors increase designer productivity by eliminating the need to implement application-
specific datapaths in HDL while also providing the flexibility to execute multiple similar applications
without requiring recompilation of the overlay. Software compilation of an application onto the overlay is
fast, usually taking a few seconds compared to hours with direct hardware compilation onto the FPGA.
This abstraction is provided at the cost of lower performance and more area. However, some overlays
provide software tools that can analyze an application and suggest a suitably customized architecture
to reduce this overhead [11,12].
For overlay processors to be compelling, they must have high performance, low area overhead and
be capable of operating on large data sets that reside in off-FPGA memory. In this work, we propose
and evaluate a dual-engine, fetch/compute overlay processor that is designed to achieve high throughput
per area on compute intensive and data-parallel floating-point applications. The architecture also offers
a wide range of customization options to closely match the compute and external memory bandwidth
requirements of different applications. The fetch and compute components are responsible for fetching
external data from off-chip DDR memory and executing compute instructions respectively. We allow
the independent operation and optimization of these two components by decoupling them as much as
possible. The fetch or the Memory Fetcher component is newly proposed as part of this work. For the
compute component, we use an improved version of the TILT overlay processor which was developed by
Ovtcharov and Tili in [13–15].
An alternative way to make computational designs easier to create on FPGAs is to use a High-Level-
Synthesis (HLS) tool which converts a software-like input directly into FPGA hardware [16–19]. As part
of our evaluation, we rigorously compare the hardware efficiency and designer productivity enabled by
our Memory Fetcher and enhanced TILT overlay processor architectures with custom designs produced
by Altera’s OpenCL HLS tool [19]. This work was published in FPT 2014 [20].
1.1 Contributions
The main contributions of this work are:
Memory Fetcher We implement a new configurable and scalable External Memory Fetcher unit
that prefetches only the required data from off-chip DDR memory and loads it into the on-chip data
memories of one or more TILT processors through static compiler analysis of the target application. The
Memory Fetcher and the TILT processors execute their own respective instruction schedules to provide
independent and concurrent off-chip communication and on-chip computation.
TILT Enhancements We extend the TILT architecture developed by Ovtcharov and Tili [13, 14]
in several ways to improve its performance and reduce its area requirements. This includes the addition
of application-specific enhancements to efficiently support loops and indirect addressing.
Predictor We develop a software Predictor tool that quickly enumerates a large design space of pos-
Chapter 1. Introduction 3
sible TILT and Fetcher configurations to predict the best (set of) architecture(s) for a target application
on Stratix V FPGAs without performing full hardware synthesis of each design.
Comparison with OpenCL HLS We quantitatively compare the performance and scalability of
our best TILT and Fetcher designs with OpenCL HLS implementations of five large memory (i.e. off-chip
DDR required) data-parallel applications. We also compare the compile time and development effort
between these two methods.
1.2 Organization
The work presented in this thesis is organized as follows. We begin in Chapter 2 with a discussion of
several commercial and academic related work on overlay processors, HLS tools and memory architectures
for FPGAs. Also in Chapter 2, we introduce the TILT architecture developed by Ovtcharov and Tili
in [13,14] and describe its components that we use and extend in our work. In Chapter 3, we propose our
External Memory Fetcher architecture and evaluate several external memory scheduling approaches. In
Chapter 4, we present and evaluate the architectural enhancements added to the TILT overlay of [13,14].
In Chapter 5, we describe our Predictor tool and evaluate its ability to accurately and quickly predict
the best TILT and Fetcher configurations for a target application. Then in Chapter 6, we compare our
best TILT and Fetcher systems with Altera’s OpenCL HLS tool. Lastly in Chapter 7, we summarize
the work presented in this thesis and propose several areas for future investigation.
Chapter 2
Background
In Section 2.1, we present the related work in three main areas: overlay processors, High-Level-Synthesis
(HLS) and memory architectures for FPGAs. Then in Sections 2.2 and 2.3, we describe the TILT
overlay processor and its compiler flow which statically schedules applications written in C onto TILT.
The TILT overlay and its compiler are prior work developed by Kalin Ovtcharov [13] and Ilian Tili [14]
respectively. We make use of the TILT architecture and extend it in the work presented in this thesis. The
overarching goal of the TILT overlay is to take an algorithm description within a C kernel and execute
it on an application-tuned, statically scheduled soft processor to obtain high compute throughput while
minimizing the hardware resource consumption on the FPGA. The TILT architecture promises rapid
application compilation and configuration onto the overlay compared to the time and effort required to
produce equivalent custom, hand-coded HDL or HLS implementations.
2.1 Related Work
2.1.1 Overlay Processors on FPGAs
Overlay processors on FPGAs seek to combine the fast application compile times enabled by their
software-programmable execution units with higher performance and lower area overhead than a basic
soft processor such as Nios II or MicroBlaze can achieve. Nios II [21], the commercial 32-bit processor
designed by Altera for their FPGAs, is capable of supporting floating-point operations using an extension
and has up to 6 pipeline stages. MicroBlaze [22], the competing soft processor for Xilinx FPGAs, supports
floating-point as well and consists of 3 to 5 pipeline stages. The TILT architecture was designed to
support multiple and varied pipeline latencies at the same time, currently with a pipeline latency of up
to 41 cycles.
Several academic works use vector processors as their overlay including VESPA [23], VIPERS [24],
VEGAS [8] and VENICE [9]. These processors operate on vectors of multiple data elements instead of
a single, scalar data item. Of these, VENICE currently achieves the highest throughput; it combines a
scalar Nios II/f processor for control with wide vector lanes feeding multi-function ALUs. The number
of ALUs and vector lanes can be customized depending on the application and its throughput and area
requirements. VIPERS also uses a scalar processor for control and a configurable number of vector lanes.
Unlike VENICE, the functional units (FUs) of the TILT overlay processor operate on scalar data
and perform a single, specialized function, allowing them to be much smaller in area. However, multiple
4
Chapter 2. Background 5
instances of TILT can be connected to operate on their respective data memories in SIMD (single-
instruction-multiple-data) [25] to obtain higher performance while keeping the individual TILT cores
small and area-efficient. In a manner similar to TILT, VENICE scales best as a multiprocessor system
of small VENICE cores and has compiler support to target the architecture to a given application.
VENICE connects to a standard DMA engine that moves data directly between its on-chip scratchpad
memory and off-chip DDR. The scratchpad is double-buffered with dedicated ports to the DMA to
overlap data movement with computation, eliminate data hazards and to mask long memory latencies.
In this work, we add to the TILT architecture of [13–15] the ability to communicate with off-chip memory
with our new Memory Fetcher unit. Our approach does not require TILT’s scratchpad data memory
to be double-buffered and the off-chip memory transfer operations are interleaved into the compute
schedule using data memory ports that are shared with the FUs.
The VESPA soft processor also operates on vector data and demonstrates improvement in perfor-
mance with vector chaining, where the output of one functional unit is passed directly to the input
of another that is executing a subsequent vector instruction [26]. However, this requires a more com-
plex register file with greater number of read and write ports. With TILT, its array of FUs read and
write data to a single data memory. This increases the total latency of operations and the length of
the pipelines since the output of an FU must be written to memory prior to another FU reading that
memory location.
The TILT processor executes multiple threads in parallel to fill the long pipelines of its execution
units, improving their utilization and the overall throughput. A single thread is usually not sufficient
due to the dependencies between operations and because the TILT FUs are deeply pipelined and have
varied latencies across different FU types. The execution of multiple threads is also a more area-efficient
solution than replicating a single-threaded TILT core.
Labrecque et al. in [27] demonstrate how the parallel execution of multiple threads can significantly
improve the performance of pipelined overlay processors with minimal area overhead. Compared to
their single-threaded soft processor implementation, multi-threading improved the processor’s compute
throughput by up to 104% for a 7-stage pipeline and improved the area-efficiency of the same design
by 106%. Similarly, since the TILT architecture is deeply pipelined, much higher throughput and area-
efficient designs can be obtained by executing many threads at once.
Moussali et al. in [28] present micro-architectural techniques to enable overlay processors to efficiently
support multi-threading. Their approaches involve implementing multi-threading capability in hardware.
These include thread schedulers that select the operation to issue based on the latencies of different
operation types and how the threads are to be interleaved; their work supports both fine-grained and
blocked multi-threading. The TILT architecture instead relies on its compiler to statically produce its
multi-threaded instruction schedule to minimize hardware complexity and area.
An earlier soft processor, called CUSTARD, features the automatic generation of custom datapaths
and instructions intended to accelerate frequently performed computations of the target application
alongside standard operations such as add and multiply [11]. The architectural configuration of CUS-
TARD is determined with static compiler analysis of the application. The TILT architecture does not
generate custom FUs to accelerate software code blocks. Instead, TILT uses a weaker form of application
customization by varying the mix of pre-configured, standard FUs provided by Altera and optionally
generating application-dependent custom units to handle predication and the new hardware features we
describe in this work to support loops and indirect addressing.
Chapter 2. Background 6
CUSTARD also supports multi-threading by storing the state of multiple, independent threads and
context switching between them. In contrast, TILT’s implementation of multi-threading is fine-grained,
since individual operations from different threads can be scheduled together. TILT also does not require
any thread states to be stored or maintained during the execution of its instructions, requiring only a
single program counter (PC).
Soft processors such as VLIW-SCORE execute Very-Long-Instruction-Words (VLIW) stored in on-
chip memory [10], like TILT, instead of relying on a scalar processor like Nios. These instructions specify
a different operation per FU or compute unit per cycle. The architecture’s software compiler is used to
statically find and extract parallelism from the application and to generate the VLIW instructions. This
eliminates the additional hardware complexity and cost of performing dynamic hardware scheduling.
The TILT overlay requires similar compiler scheduling but its instructions are instead classified as
Horizontally Microcoded (HM) [29] since they directly control low-level hardware components such as
multiplexers and require minimal decoding. The main difference between VLIW and HM architectures
is the higher level of abstraction provided by the VLIW instructions. An example of a reconfigurable
HM processor is the No Instruction Set Computer (NISC) [30].
Saghir et al. in [31] explore the performance and area trade-offs of implementing custom datapaths
and instructions on VLIW overlay architectures. They utilize a standard 4 stage pipeline of fetch, decode,
execute and writeback and their custom compute units have latencies of between 1 and 8 cycles. Their
approaches range from augmenting a standard FU with custom logic to implementing fully custom units.
They show that there is a trade-off between performance and area cost, with both increasing with higher
degrees of customization. Another important trade-off of implementing custom units is that the overlay
becomes more application-specific. While this benefits the target computation, it also reduces the range
of different applications the overlay can target.
2.1.2 High-Level-Synthesis Tools for FPGAs
An alternative to overlay architectures are High-Level-Synthesis (HLS) tools that translate a software
description of an application into custom FPGA logic, providing high abstraction FPGA programming.
Just as with overlays, this increases designer productivity vs. Hardware Description Language (HDL) de-
sign flows which require low-level, cycle-accurate specification of the application. Several HLS techniques
have been proposed including academic works such as FCUDA [16] and LegUp [17], commercial products
such as Vivado HLS for Xilinx FPGAs [18], as well as compilers targeting OpenCL for Altera [19] and
Xilinx [32] FPGA families.
Overlay architectures provide fast application configuration and the flexibility to execute a range of
different applications without requiring the overlay to be recompiled. In contrast, HLS tools require full
recompilation of the application into hardware after any code change. Small changes in the input code
can also lead to large differences in the system area and performance so many design iterations are often
necessary to fully optimize a system. Taken together, the combination of many design iterations and
long compile times can significantly increase development time compared to using an overlay. However,
by generating custom logic, HLS tools generally offer higher performance than overlay processors.
OpenCL is a popular high-level computing language that enables parallel programming across het-
erogeneous platforms [33]. The OpenCL programming model separates the application into two parts.
The first is the serial host program that executes on a processor and is responsible for managing data
and control flow. The host offloads the parallel compute intensive second portion defined within kernels
Chapter 2. Background 7
onto accelerator(s) such as CPUs, GPUs and recently FPGAs [19].
LegUp [17] is an academic HLS tool that compiles a standard C program onto a host/accelerator
model similar to OpenCL. However, the partitioning of the host and kernel is done automatically by the
LegUp compiler. The host executes on an FPGA-based 32-bit MIPS soft processor and communicates
with the custom accelerator using a standard on-chip bus interface. This means TILT can be easily inte-
grated into LegUp as an accelerator by connecting TILT’s external data/address bus to the MIPS host,
enabling faster application compilation onto the overlay and requiring fewer accelerator recompilations.
CUDA is a language for expressing parallel applications on Nvidia GPUs [34] that also shares the
host/accelerator model. FCUDA [16] transforms CUDA kernels to parallel C code for AutoPilot HLS [35]
which translates the code into custom FPGA logic. The authors demonstrate their FCUDA flow to have
competitive performance on Virtex 5 FPGAs, outperforming Nvidia’s G80 GPU in some cases. Studies
including [36] demonstrate CUDA’s slightly higher performance when compared with OpenCL. However,
they also show it is relatively easy to translate CUDA programs to OpenCL and that OpenCL is portable,
achieving good performance on other platforms with only minor code modifications. This makes OpenCL
a compelling platform to compare against C-to-FPGA approaches such as overlays.
Stitt and Coole compile OpenCL applications to a spatial pipeline on their pre-compiled intermediate
fabrics (IFs) composed of fixed coarse computational resources and configurable interconnect instead of
directly targeting the underlying FPGA [12]. This approach is similar to TILT since the IF is customized
to the requirements of the kernel and allows rapid kernel compilation and reconfiguration (seconds vs.
hours) while incurring a performance penalty and area overhead compared to direct OpenCL synthesis.
However, as we show in Section 6.5.3, TILT also enables smaller designs than OpenCL HLS when a
lower throughput is adequate, enabling a larger range of throughput and area solutions to be explored.
In Section 6.5, we compare the performance, area, development effort and scalability of our TILT
overlay processor and Memory Fetcher architectures with Altera’s OpenCL HLS tool. As a part of
our evaluation, we also quantitatively explore the strengths and disadvantages of these two different
C-to-FPGA approaches.
2.1.3 Memory Architectures on FPGAs
Conventional microprocessor architectures attempt to reduce the average memory access latency of the
processor at the expense of memory bandwidth by prefetching entire cache lines of spatially or temporally
local data from a larger, off-chip memory. FPGAs provide a large amount of on-chip compute and routing
bandwidth but they are bandwidth and latency limited at the FPGA to external memory boundary.
Here we present a survey of memory architectures that bring data onto the on-chip memories of the
FPGA from external sources such as DDR.
Several works have explored the merits of building memory architectures on FPGAs using caches
and/or scratchpads [37–40]. Caches are a more complex data storage solution, requiring the additional
storage of tags which are used to determine what data is present in the cache. Data is fetched proactively
based on the application’s memory accesses, usually with adjacent data items being fetched as well.
Scratchpads are simpler structures with a lower area cost that store data items without tags and require
the data movement to be explicitly managed; this is the method adopted by TILT’s data memory.
Kalokerinos et al. [37] propose a memory system that can be configured to support both implicit
and explicit memory management via caches and scratchpads respectively. Caches are used where it
is not known in advance when or what data will be needed which means the memory accesses have
Chapter 2. Background 8
non-deterministic timing. The reverse is true for the directly-addressable scratchpad memories for the
systems that need direct control and optimization of the data placement and memory transfers.
Panda et al. [38] propose a processor system that combines a small scratchpad memory with an
on-chip data cache which interfaces with an external memory. Application data is partitioned to either
reside in the external memory (such as large arrays of contiguous data) or the on-chip scratchpad (scalars
and constants) with the goal of minimizing the total execution time of the application. The scratchpad
provides a single cycle access latency guarantee while the data cache can take between 1 and 20 cycles
depending on whether the access is a hit or a miss.
Choi et al. [39] and Putnam et al. [40] implement multi-ported and multi-banked cache-based memory
architectures on an FPGA and evaluate their performance, area and power consumption. Choi et al.
found that implementing cache design parameters such as different cache line sizes, associativities and
port counts on an FPGA has a high resource and performance cost, partly because of the multiplexing
networks that must be implemented.
The Reconfigurable Data Cache (RDC) [41] presents a caching mechanism tailored for FPGAs. It
consists of an on-chip data cache and a speculation unit that are automatically configured and generated
by its compiler based on the predominant access patterns of the application and inserted between the
application and the external memory. The speculation unit intelligently fetches only the data items
that are needed by the application into the data cache instead of entire cache lines. This is intended
to minimize the communication with off-chip memory. In our approach, the Memory Fetcher uses
the compiler to determine the application’s external memory accesses and statically schedules these
operations to execute in parallel to the TILT processor(s). We do not need to implement hardware logic
to speculatively prefetch data and we can guarantee only the necessary data is fetched.
RDC’s speculation unit is ineffective for applications with irregular access patterns and for applica-
tions that operate on large chunks of data at a time. A single data cache with a single arbitrated channel
to the application also makes it difficult to exploit any data and compute parallelism that exists in the
application. Stitt et al. in [42] have looked at how cache-based memory architectures can efficiently
support irregular memory accesses such as the traversal of pointer-based data structures. Since the
Memory Fetcher’s operations are statically scheduled, we can support any memory access pattern but
lose the flexibility of any dynamic behaviour. This is acceptable since the TILT processor is statically
scheduled as well. The Memory Fetcher also supports multiple, independent data streams that can be
accessed by several TILT cores in parallel, allowing multiple data items to be accessed by each TILT
core at the same time.
Unlike RDC, FCache [43] implements multiple coherent and spatially distributed data caches con-
nected by a mesh interconnect network. Instead of being centralized, coherency and consistency man-
agement is distributed across all caches. The caches are connected to off-chip memory via an arbitrated,
custom shared memory controller. Having multiple caches that can service an application in parallel is
distinct from the singular cache of RDC. FCache tries to exploit the on-chip bandwidth of the FPGA
but is limited by the memory controller and arbitration to external memory. FCache also has a much
higher area overhead compared to RDC.
The CoRAM memory architecture [44, 45] attempts to exploit the spatially distributed fabric of
the FPGA by aggregating some of its Block RAMs (BRAMs) into small ‘CoRAMs’ that store off-chip
data in a highly distributed fashion. This allows applications to access these memories in parallel and
independently of each other just like BRAMs. CoRAM provides ‘control threads’ that the application
Chapter 2. Background 9
uses to explicitly manage the movement of data to and from the CoRAMs and off-chip memory. These
threads are implemented in C alongside the application. Thus, developers must be intimately familiar
with how their application consumes and produces data and must explicitly describe this data flow for
the lifetime of the application. This burden is placed on the compiler instead in our approach.
The guiding principal of LEAP [46] is the polar opposite of CoRAM. It aims to provide a complete
abstraction of memory and its management from the application developer. LEAP allows the designer
to plug in memory components such as scratchpads or caches based on the properties that best fit their
application needs while providing the same communication interface as that of a BRAM. The abstraction
provided by LEAP means the LEAP memories cannot guarantee to the application how many cycles
a memory operation will take. LEAP memories fetch data from external memory after the application
issues the memory operation. The application has no way to know if the data is already present in the
on-chip LEAP memories or has to be fetched from external memory. This means applications designed
with LEAP need to be timing insensitive. With RDC or CoRAM, applications can be timing driven.
TILT is timing sensitive but since we know what data the processor will need and when through
off-line compiler analysis and scheduling, we are able to prefetch the data from off-chip memory ahead of
time to cover the long external memory access latency. This data is stored in buffers inside the Fetcher
which behaves as an intermediary between the TILT cores and off-chip memory. The buffers allow the
Fetcher to provide deterministic memory access latency to the TILT cores similar to accessing an on-chip
BRAM. If however the intermediate buffers are getting empty or full, the Fetcher preemptively stalls
the TILT cores from executing future compute instructions.
2.2 TILT Overlay Processor
...
...
I/O
CT
RL
CT
RL
CT
RL ...
INS
TR
ME
M
ADDR,mEN RESULT
BankedMulti-PortedMemory
...
...Read bar
Write bar
FU FU FU
PC
Figure 2.1: The TILT architecture, consisting of banked, multi-ported data memory and single-precisionfloating-point FUs that are connected by crossbar networks [15].
TILT (Thread and Instruction Level parallel Template architecture) is a highly configurable overlay
compute engine for FPGAs with multiple, varied and deeply pipelined 32-bit single-precision, floating-
point functional units (FUs) [13, 15]. TILT supports the parallel execution of multiple independent
threads with each thread capable of issuing multiple operations per cycle to obtain high utilization of
its FUs. The threads perform the same computation and do not communicate data between them. As
shown in Figure 2.1, TILT has read and write crossbar networks that connect the array of FUs to an
explicitly managed banked, multi-ported data memory built using on-chip BRAMs [47].
Chapter 2. Background 10
TILT relies on static compiler instruction scheduling to reduce hardware complexity [14] and does
not require forwarding logic or dynamic data hazard detection. Data is always read from TILT’s data
memory and the pipeline executes instructions without any stalls. It is the responsibility of the compiler
to create an instruction stream that separates data-dependent operations by enough cycles such that
any data required by a later operation is produced and written back to the data memory beforehand.
The TILT instructions are stored within an on-chip memory, also built using BRAMs.
We briefly describe the hardware components of the TILT processor shown in Figure 2.1 and how
they can be configured in Section 2.2.1 below.
2.2.1 TILT Components
Functional Units
The pipelined, single-precision, floating-point FUs that are supported by the TILT architecture of [13]
and their latencies are summarized in Table 2.1. The custom Cmp FU is used to support if-else control
flow instructions in the application code. This is done with predication [48] which we describe in Section
4.1.4. The remaining FU types are standard, pre-configured FUs that are generated using Altera’s
megafunction wizard. The Quartus settings used are provided in Appendix A. All FUs are IEEE-754
compliant and have up to two 32-bit scalar input operands and a single scalar output. TILT’s FU mix
can consist of any combination of these FUs with multiple instances of each type as required by the
target application.
FU Type AddSub Mult Div Sqrt Exp Log Abs CmpLatency
7 5 14 28 17 21 1 3(cycles)
Table 2.1: Listing of the single-precision, floating-point FUs that are supported by the TILT architectureof [13] and their latencies.
Data Memory
Bank
BRAM BRAM
Bank
BRAM BRAM
Data Memory
1 2
(a) Data memory with 2 banks, eachwith 2 read and 1 write ports.
BRAM BRAM BRAM
Bank
BRAM BRAM BRAM
(b) A single data memory bank with 3 read and 2 write ports.
Figure 2.2: Illustration of TILT’s data memory organization.
TILT’s data memory holds 32-bit floating-point data and is organized into memory banks as illus-
trated in Figure 2.2. Each memory bank is composed of two or more BRAMs that hold the same data to
support multiple concurrent reads and writes. This is because each BRAM consists of only 2 dual-ports
that can be used to either perform a read or a write. As an example, a memory bank with 2 reads and
Chapter 2. Background 11
1 write requires 2 BRAMs to be connected as shown in Figure 2.2(a). When an FU writes to this bank,
the data item will be committed to both BRAMs. The duplication of data into two BRAMs allows two
independent reads, one from each of the physical BRAMs. For each additional read or write port, the
number of BRAMs per bank increases in the manner illustrated by Figure 2.2(b) – the BRAM count is
the product of the number of read and write ports. The example assumes a single BRAM is big enough
to store the contents of an entire bank. Threads are evenly assigned to the available memory banks with
the data of each thread having its own address space in a single bank.
Read and Write Crossbars
The read and write crossbar networks route input data from the memory banks to the input operands
of FUs and FU outputs to the data memory respectively. These crossbars are pipelined to improve
TILT’s operating frequency (Fmax) and are fully connected, giving FUs read and write access to every
memory bank, but not to all the individual BRAMs that comprise the bank. For this work, the read and
write latencies are 8 and 5 cycles respectively. The connectivity of the crossbars is illustrated in Figure
2.3. Figure 2.3(a) presents a simple TILT configuration with 2 FUs and 2 memory banks, each with 2
read and 1 write ports. With this configuration, both FUs can issue and complete a compute operation
simultaneously as long as they read from and write to different banks at every cycle.
R0
R1
R2
R3
W0
W1
B1B0
Read Xbar
Write Xbar
FU
FU
(a) 2 memory banks, each with 2read and 1 write ports.
B0B1
Read Xbar
FU
FU
Write Xbar
R0
R1
R2
R3
W0
W1
(b) 2 memory banks, each with 4 read and 1 write ports.
Figure 2.3: Illustration of TILT’s read and write crossbar connectivity.
Increasing the read or write port counts increases the number of concurrent operations that can
be issued or completed by the FUs from the same memory bank. In Figure 2.3(b), 2 additional read
ports allows both FUs to concurrently read data from the same memory bank. While this configuration
still only supports 1 write per bank and hence can only complete one operation per bank each cycle,
additional read ports can potentially improve TILT’s computational throughput since operations issued
on the same cycle may writeback to memory on different cycles due to the widely varying pipeline depths
of the FUs (between 1 and 28 cycles). However adding more read and/or write ports also increases the
memory storage requirements (BRAMs required) and the size of the crossbar networks. The appropriate
Chapter 2. Background 12
mix of FUs, thread count and data memory organization to use is largely dependent on the target
application and the designer’s performance and area requirements.
Instruction Memory
As shown in Figure 2.4, TILT’s compute instructions encode the operations that each of the generated
FUs must execute on a given cycle. Each FU operation contains 2 read operand addresses and 1 write
address for reading input data and writing the results of the FUs to TILT’s data memory. The read and
write bank ids identify the memory bank to access. The opcode is used by the AddSub and Cmp FUs
to determine their mode of operation and the thread id is used solely by the Cmp FU. We describe the
operation of the Cmp FU in Section 4.1.4. Finally, the valid bit of each FU’s operation indicates if an
operation has been scheduled for the FU at that cycle. The TILT architecture of [13, 14] required the
encoding of each FU type in the instruction to be the same. We have relaxed this requirement during
the addition of new FU and operation types to TILT.
...
Operand BRead Addr
Operand ARead Addr
ResultWrite Addrvalid Read
bank idWritebank id
threadidopcode
.........
cycle 0cycle 1
cycle n
FU operationTILT Compute Instruction
1 bit 5 bits log2(threads) log2(banks) log2(banks) log2(bank depth) log2(bank depth) log2(bank depth)
Figure 2.4: The encoding of the TILT compute instruction.
The instruction memory stores the schedule generated by TILT’s compiler in its entirety. Since each
thread executes its own set of compute operations that are scheduled onto TILT, adding more threads
typically increases the length of TILT’s schedule, hence requiring a deeper instruction memory. However,
the parallel execution of more threads also improves TILT’s compute throughput as well. As illustrated
in Figure 2.4, the width of each FU operation depends on TILT’s architectural parameters such as
the number of threads, memory banks and the depth of each bank. Similarly, the width of the TILT
instruction increases with each additional FU. This means larger TILT configurations will also require
a bigger instruction memory.
2.2.2 TILT Customization
The TILT architecture is designed to achieve high computational throughput per unit area on data
parallel applications. We accomplish this by customizing TILT’s architectural parameters such as its
FU mix, organization of banked data memory, number of threads and number of operations that can issue
or complete in parallel, to closely match the compute requirements of the target application. TILT’s
throughput can be increased by adding more TILT resources to allow more operations to execute in
parallel. However, this is obtained at the expense of increasingly higher area cost with diminishing gains
in compute throughput. Conversely, TILT can be configured to be smaller as needed but its throughput
will be reduced as well.
We can further customize TILT by allowing certain FUs to share both their operation field in TILT’s
Chapter 2. Background 13
instruction and their read and write ports with another FU to reduce the size of the instruction memory
and the crossbars. In this case, only one of the two FUs can issue and/or complete an operation per
cycle. TILT’s compiler ensures that there are no scheduling conflicts. A single bit in the operation
field’s opcode (Figure 2.4) is sufficient to indicate to TILT which FU the operation targets. Although
this operation field sharing increases contention between the two FUs for resources which can degrade
compute throughput, it can result in an overall improvement in throughput per unit area if one or both
of the FUs are underutilized or if the operations of one of the FUs must precede those of the other due
to data dependencies.
A more effective way to increase TILT’s performance beyond that which can be obtained with a single
area-efficient TILT core is by instantiating multiple copies of the core, composed of the data memory,
crossbars and FU array. All TILT cores share a single instance of the instruction memory and execute in
parallel in SIMD [25]. We call this architecture TILT-SIMD. As we will demonstrate in Chapter 6 of our
evaluation of the TILT architecture, this approach allows the individual cores to remain computationally
dense (have high throughput per area) while achieving near linear increase in total compute throughput
up to a certain core count.
The wide range of configuration parameters and options constitute a very large design exploration
space. They also provide a large degree of flexibility to tailor the TILT architecture to an application
and the overall system based on the compute requirements and area budget.
2.3 TILT Compiler
The TILT compiler flow is responsible for generating TILT’s static instruction schedule from an input
application kernel. The kernel defines the computation of a single TILT thread. The flow of an example
kernel is shown in Figure 2.5. Given the TILT configuration and the target application, we want to find
a schedule with the highest compute throughput, identified by the shortest schedule, while respecting
TILT’s architectural limitations. The flow consists of four main stages, starting with the generation
of the Data Flow Graph (DFG) which defines the application’s compute operations and their data
dependencies. The second stage allocates TILT’s data memory and in the third stage, the operations
of the DFG are scheduled onto TILT based on the processor’s hardware configuration. Finally, the
application is executed on the target TILT configuration after it is synthesized and loaded onto the
FPGA using Quartus. These stages are described in the sections below.
2.3.1 DFG Generation
The Data Flow Graph Generator (DFGGen) is implemented as an LLVM compiler pass [49] that produces
a custom DFG description of the application kernel from its intermediate representation (IR) [50]. The
IR is generated by the LLVM frontend compiler [51] from the application’s C source code. The DFGGen
tool requires the kernel computation to be inside a single C function, similar to a kernel in OpenCL [33].
Variables must be declared inside the C function or they can be passed into the function as parameters.
The nodes of the DFG are the supported compute operations of Table 2.1 such as add or compare. The
predecessors of a node are operations that must execute before it, usually to produce its input operands.
Similarly, operations that depend on the node are its successors in the DFG. Each TILT thread is a
separate instance of the DFG.
Chapter 2. Background 14
FUsAddSub Mult
123456789
101112131415
Comp,Schedule
Cycle
161718192021
1:B2:B3:B4:B
3:A4:A1:A2:A
1:D2:D3:D4:D
1:C2:C3:C4:C
1:E2:E3:E4:E
A} B1
Dr
E
C1
DFG
void,fn.float,1inK,float,1out;,{,,,,,float,A,=,in[3],},kL3H,,,,,float,B,=,in[3],1,kL3H,,,,,float,C,=,in[R],1,kL3H,,,,,float,D,=,A,r,BH,,,,,out[3],=,C,},DH}
Application,Kernel
FUSMix:,,,,AddSub,LatL,W,,,,Mult,LatL,4
Xbar:,,,,Rd,LatL,W,,,,Wr,LatL,W
DataSMem:,,,,s,threads,,,,W,banks,,,,W,rd,ports2bank,,,,R,wr,port2bank
Configuration
BankR BankW
1234567
Addr
0 in3inRkL3ABCD
out3
in3inRkL3ABCD
out3
LLVM,IR
DFGGen
Scheduler
MemoryAllocator
Verilog,HDLDesign,Files
R3R3R3R3R3R3RRRRRR3RR3R3R33333RRR3R33RRRR3R3RRRRR3R333R33333
Insn,Mem
T1 T3
89
101112131415
in3inRkL3ABCD
out3
in3inRkL3ABCD
out3
Data,Memory
T2 T4
R
W
4
s
}
2223
Software
Hardware
Figure 2.5: The TILT compilation flow for an example application. The FU and crossbar latencies werereduced for illustration.
The input kernel must consist of only the floating-point operations that TILT supports. Loops or
any integer operations are not supported by the TILT architecture of [13, 14]. Any loops that exist in
the kernel must be manually unrolled and array indexes must hold constants. Indirect addressing, where
data in memory is used to address into another data entry in memory, is also not supported. Conditional
if-else branching is supported with predication [48] which we describe in Section 4.1.4.
2.3.2 Memory Allocator
The TILT compiler developed by Tili in [14] statically allocates data memory for the DFG after it
is generated from the application kernel and before it is scheduled onto the target TILT architecture.
Since TILT threads are separate, persistent instances of the DFG that execute in parallel, the allocated
address space is replicated for each thread, as shown in Figure 2.5. The LLVM IR uses SSA (Static-
Single-Assignment) [52] to name the temporary registers used in the computation. This means every
operation that produces an output gets written to a new register with an unique name.
Instead of allocating a different memory location for each register in the LLVM IR, Tili’s Memory
Allocator reassigns the location to another register after the previous one is no longer referenced by any
compute operations. This reassignment is performed based on the dependencies defined by the DFG
Chapter 2. Background 15
only and without making any assumptions on how the DFG will be scheduled. Therefore the algorithm
cannot optimize memory usage across parallel dependency chains. Memory variables referenced in the
C kernel are allocated their own memory locations and do not get reassigned so they persist in memory
for the lifetime of the schedule. For pointer variables, the algorithm only assigns locations in TILT’s
data memory to regions that are accessed by the computation.
2.3.3 Compute Scheduler
For our work, we use Tili’s Grouped Prioritized Greedy Mix Longest Path (GPGMLP) algorithm [14] to
generate the compute schedule of an application that will be executed on the target TILT architecture.
We also extend this algorithm to construct one of our Fetcher memory scheduling approaches, which
we describe in Section 3.3.1 and evaluate in Section 3.5. The algorithm takes as input the DFG of the
application, the number of threads to schedule and TILT’s hardware configuration which specifies its
mix of FUs and their latencies and the organization of data memory. The objective of GPGMLP is to
produce the shortest (or densest) compute schedule while adhering to the hardware resource constraints
of the processor. The GPGMLP algorithm combines several heuristics to obtain an efficient schedule
which we describe next. The pseudo-code of the algorithm is provided in Algorithms 1 and 2.
Algorithm 1: Tili’s Grouped Prioritized Greedy Mix Longest Path (GPGMLP)
Input: DFG, hwTILTConfig, groupSizeOutput: TILT compute schedule
1 set1 = longest path operations plus their predecessors;2 set2 = remaining operations in DFG;
3 call Algorithm 2 on set1;4 call Algorithm 2 on set2;
Greedy. The algorithm greedily schedules compute operations at the earliest possible cycle, starting
with operations that have no predecessors (lines 7 to 9 in Algorithm 2). These are operations that do
not require any other operation to be scheduled first and can be scheduled immediately. After every
iteration of the scheduling step, the algorithm checks for newly ready operations whose predecessors
have all been scheduled and marks them as ready to be scheduled (lines 21 to 24).
Operation Priority. When there are several operations that are ready to be scheduled, the algo-
rithm grants priority to the operation with the lowest slack time. The slack time of an operation is the
cycle difference between scheduling the operation As-Late-As-Possible (ALAP) and As-Soon-As-Possible
(ASAP). ASAP schedules operations as soon as their predecessors have finished executing while ALAP
schedules operations as late as possible but without increasing the total schedule latency beyond that
of the ASAP. The generation of the two schedules take into account the dependencies of operations
and the FU latencies but TILT’s hardware resource constraints are ignored to simplify the calculation.
This slack based ranking of ready operations means operations that are more likely to increase the total
length of the schedule if they are not scheduled well are scheduled first.
Grouped Mix Thread Scheduling. The algorithm schedules operations from multiple threads in a
round-robin fashion instead of scheduling each thread one after the other. This allows the higher priority
operations from across threads to be scheduled before the lower priority operations. This generates a
more dense, higher throughput compute schedule and has the added benefit of requiring a smaller
Chapter 2. Background 16
Algorithm 2: Tili’s Grouped Prioritized Greedy Mix
Input: DFG operations to schedule, hwTILTConfig, groupSizeOutput: TILT compute schedule
1 create ASAP schedule for DFG operations;2 create ALAP schedule for DFG operations;3 calculate slack (ALAP - ASAP) of DFG operations;
4 create readyList[] with an empty readyList entry for each thread;5 foreach group of threads do6 for t = first thread of group to last thread of group do7 foreach operation in DFG do8 if operation has no predecessors then9 insert in readyList[t];
10 cycleCounter = 0;11 while readyList[] not empty do12 sort readyList[] entries by priority for each thread of group;
13 for i = 0 to (maximum list size across all readyList[] entries of group)1 do14 for t = first thread of group to last thread of group do15 if readyList[t] contains an operation at index i then16 if earliest cycle operation can start <= cycleCounter then17 schedule operation in earliest free spot;18 mark operation for removal from readyList[t];
19 remove marked entries from readyList[] for all threads of group;20 cycleCounter = cycleCounter + 1;21 for t = first thread of group to last thread of group do22 foreach operation in DFG not already scheduled for t do23 if all predecessors are scheduled then24 insert in readyList[t];
instruction memory to store the shorter schedule. However, instead of scheduling all threads together,
they are scheduled in groups, with each thread group being scheduled completely before scheduling the
next group of threads.
Tili’s algorithm tries all possible group sizes between 1 and the total number of threads and selects
the best schedule with the shortest length. With a group size of 1, each thread is scheduled in sequence.
At the other extreme with a group size equal to the number of threads, all threads are scheduled in
parallel. For this work, we found trying only the sizes that are multiples of 2 (including 1 and the total
number of threads) is a good compromise that improves the algorithm’s runtime significantly without
noticeably penalizing the achieved throughput.
Longest Path. As the algorithm’s final heuristic, it schedules for all threads the operations that
are part of the longest dependence chain within the DFG before scheduling the remaining operations
(see Algorithm 1). The length of a chain is defined by the sum of the latencies of each operation within
the chain. The algorithm separates the DFG into two sets of operations. The first set comprises the
operations in the longest path in addition to all the operations they depend on and is scheduled first.
The second set consists of the remaining operations and is scheduled last. Since delaying operations
within the longest path is more likely to increase the total length of the schedule, it makes sense to
Chapter 2. Background 17
prioritize the scheduling of this path over other operations.
For our example TILT application in Figure 2.5, the longest path contains the compute operations
{B, D, E}. These operations and operation C (due to being the predecessor of E) are scheduled first.
Operation A of the third and fourth thread is then scheduled on the first and second cycle instead of
that of the first two threads because the ports of the first memory bank (which holds the data of thread
1 and 2) are being used by the Mult FU at those cycles.
2.4 TILT Execution Model
The TILT architecture targets data-parallel, compute-intensive, floating-point applications. We schedule
multiple threads (or instances of the application kernel) onto a TILT core and multiple TILT cores can
then execute the same schedule in SIMD. Further, we can configure TILT to re-execute its schedule after
reaching the end of the instruction stream. In this scenario, TILT’s application kernel represents the
computation inside an implicit, external loop. To keep the size of the instruction memory small, this is
the preferred use case. For our example kernel in Figure 2.5, each of the four scheduled TILT threads
computes an element of the out vector. With 8 TILT cores, groups of 32 elements can be computed
together. If we assume a vector length of 512, we will need to re-execute the TILT schedule a total of
16 times to compute the entire vector.
TILT maps well to applications where each thread performs coarse-grain (to mask long access latencies
to off-chip memory) computation on a set of inputs and where the same computation is performed on a
large data set. Example applications include image and video processing, such as Mandelbrot [53] and
HDR [54] and simulation of neurons with the Hodgkin-Huxley algorithm [55]. Applications such as the
dot product of a vector can only be supported on TILT if it is entirely computed by a single thread.
This way, the dot product of multiple vectors can be computed in parallel. Since TILT threads and
cores do not communicate, the dot product of a single vector cannot be distributed to multiple threads
and cores due to the reduction step required at the end of the computation.
Chapter 3
External Memory Fetcher
The TILT architecture and compilation flow developed by Ovtcharov and Tili in [13] and [14] do not
include a mechanism to communicate with off-chip DDR memory. The authors assumed all off-chip
input data was either already present or will somehow be brought into TILT’s relatively small data
memory before it will be needed by the computation. The scheduling of the application onto TILT and
their throughput measurements did not account for the accesses to off-chip memory, their relatively high
latencies or the performance penalties that may be incurred from these accesses. Only the compute
portion of the applications was scheduled and evaluated. Ovtcharov and Tili have focused primarily on
exploring the best way to compute using soft processors on an FPGA by evaluating their architecture
with compute intensive benchmarks. A parallel study conducted by Charles LaForest that led to the
Octavo soft-processor also followed from this research direction [56].
Not implementing the hardware to communicate with off-chip memory and not accounting for when
the external data will arrive yielded a simpler and more area-efficient TILT design. It has also enabled the
statically scheduled processor to be optimized for fast computation on its local data memory and achieve
a high execution rate of compute operations per cycle. However, this is not an accurate representation
of a complete and usable soft-processor system. To evaluate a more compelling system on realistic, large
memory applications, we have designed a separate Memory Fetcher unit to efficiently move data between
multiple TILT cores and off-chip DDR memory.
3.1 Possible Design Solutions
We begin with a discussion of the different approaches we have considered to move data between off-chip
memory and TILT’s data memory. First, we can integrate the function as part of the TILT processor by
adding external load/store memory operations to directly access off-chip memory and implement stall
logic to wait for the data to arrive. TILT’s data memory will then behave as a local scratchpad on which
the processor would compute. However for our statically scheduled TILT processor, we cannot easily
interleave external loads and stores between compute operations without also incurring a significant
performance loss due to the indeterminism of the data arrival time. The latencies of these external
memory operations are also much larger than the latencies of the compute operations that access TILT’s
data memory. This means allowing TILT to switch between moving data and computation will also
significantly degrade the processor’s compute throughput. This is because we will need to wait for the
18
Chapter 3. External Memory Fetcher 19
long memory transfers to complete before being able to resume computation on the new data.
Alternatively, we can leave the TILT processor as is and implement a separate entity that is re-
sponsible for efficiently moving data between off-chip memory and TILT’s data memory while the TILT
processor performs computation. This can be accomplished with some or all of the read and write ports
of TILT’s data memory being connected to external memory via the new entity. This effectively sepa-
rates the concerns of communicating with off-chip memory and performing computation, allowing each
component to be optimized to perform their respective tasks. This is the approach we have adapted.
Another approach we considered was to make TILT’s data memory twice as deep and have twice
as many ports so that TILT can communicate with off-chip memory and compute at the same time by
operating on different addresses spaces. This approach is known as double-buffering and is similar to
the solution adapted by the VENICE soft-processor [9]. It requires TILT’s data memory ports to be
connected to both the external memory interface and TILT’s FU array. At the end of each iteration of
computation and data movement, there will be a synchronized switch. We found the resource cost of
this approach to be high as TILT’s crossbar grew quadratically with every additional data memory port.
These ports are also underutilized since they are either being used for computation or to communicate
with off-chip memory. It also assumes the time to gather and scatter the necessary external data is
comparable to the time it takes to perform computation which is not usually the case.
Instead, we can either use dedicated data memory ports to communicate with external memory or
share the available ports with the external memory interface via the TILT crossbars as they are currently
shared with the TILT FUs. The read and write ports of the data memory are shared with TILT’s FU
array since not all FUs will execute operations or read and write to memory at every cycle. This is just
as true with reading and writing data to off-chip memory. Hence, connecting some of the data memory
ports directly to TILT’s external memory interface will result in the underutilization of these ports while
also requiring a larger data memory to support more ports.
For these reasons, in our chosen solution, external memory operations access the same data memory
address space as the compute operations while TILT concurrently performs computation. Additionally,
the available data memory ports are shared with TILT’s FU array as well as its external memory interface
via the crossbars. We are able to vary the number of data memory ports and crossbar outputs that
are connected to TILT’s external memory interface (external ports) independently of one another. How
many of each there should be is dependent on the computation and its external data bandwidth.
3.2 Chosen Design Solution
Taking into consideration the above design choices, we have implemented a separate Memory Fetcher
unit, with its fundamental design elements illustrated in Figure 3.1. In the figure, TILT-SIMD comprises
an array of TILT compute units which are connected to the Fetcher to produce the TILT-System. The
purpose of the Fetcher is to efficiently move data between these TILT cores and off-chip memory while
the TILT cores compute on their local data memories. Like TILT-SIMD, the Fetcher is fully-pipelined
and is capable of reading and writing data from/to the TILT cores and off-chip memory at every cycle.
Internally, the Fetcher contains data FIFOs that act as intermediate buffers between the data mem-
ories of the TILT cores and off-chip memory. These buffers are present to mask the long latencies and
indeterminism inherent with external memory accesses and to decouple the communication with off-chip
memory from the computation as much as possible. The long latencies and non-deterministic timings
Chapter 3. External Memory Fetcher 20
TILT
... In
snM
emDDR Wr
DDR Rd
TILT Rd
TILT Wr
cntrl
Insn Mem
cntrl
wid
thco
nv
TILT
TILT
nExtW Ports
nExtR Ports
Fetcher
TILT
-SIM
D...
TILT clk DDR clk
FIFOs
FIFOs
Figure 3.1: TILT-System - TILT-SIMD connected to off-chip DDR via the Memory Fetcher.
of the external memory accesses are hidden by buffering the off-chip input data needed by future TILT
compute operations into these FIFOs ahead of when these operations will be executed by the TILT
cores. This enables the Fetcher to behave as a loosely-coupled, run-ahead fetch unit.
How much slack there will be depends on several factors including TILT’s compute schedule and
its external data bandwidth, the depth of the data FIFOs and the rate at which off-chip data can be
consumed and produced by both TILT-SIMD and the Fetcher. Ideally, we want the Fetcher to be able
to consume data at the maximum rate at which TILT-SIMD can produce off-chip outputs while also
matching the rate at which TILT-SIMD requires input data. Finding the appropriate balance among
these parameters will maximize compute throughput while minimizing processor stalls and the hardware
area cost of the Fetcher and TILT-SIMD. The data FIFOs also perform clock domain crossing between the
TILT-SIMD and the DDR controller clocks. The relevant Quartus settings used for the DDR controller
are provided in Appendix B.
The Fetcher’s data FIFOs provide deterministic external memory access latency guarantee to the
TILT cores. This works well for the statically scheduled TILT processor as external data can be moved
into and out of the TILT cores parallel to the computation without requiring changes in the TILT overlay.
The Fetcher writing to TILT’s data memory works the same way as an FU writing its computed result
to data memory. Similarly, reading data from TILT’s data memory into the Fetcher’s outgoing data
FIFOs is comparable to reading data into the input operands of an FU.
Conversely, moving data between the data FIFOs and off-chip memory can remain non-deterministic
with respect to timing. Further, several DDR read and write bursts can be enqueued together to fetch
external input data needed by future compute operations into the FIFOs and flush buffered computed
results to external memory from the FIFOs. The depth of the FIFOs and burst sizes can be varied to
improve DDR bandwidth and mask latency, providing optimized communication with off-chip memory
and minimizing processor stalls.
Each TILT core’s external memory interface consists of a configurable number of external read and
write ports, indicated by the nExtRPorts and nExtWPorts respectively in Figure 3.1. These control the
number of incoming and outgoing data FIFOs and the maximum number of data words that can be read
or written in parallel between the data memories of the TILT cores and the FIFOs on a given cycle.
An external read port is an output of the read crossbar, like FU input ports. Similarly, an external
Chapter 3. External Memory Fetcher 21
write port is an input of the write crossbar, like FU output ports. We define an external memory read
operation as TILT output data that is read from the data memories of the TILT cores and written to
off-chip memory. Similarly, an external write is defined as TILT input data that is written into the TILT
data memories from off-chip memory.
We add to TILT the ability to halt the processor (temporarily prevent execution of future instructions)
if the Fetcher gets too far behind or to halt the Fetcher if it gets too far ahead. TILT-SIMD is stalled by
the Fetcher when the TILT cores need to read data from a FIFO and it is empty or when they need to
write data to a FIFO and it is full. The Fetcher stalls for similar reasons when communicating with off-
chip memory. This synchronization logic is implemented as part of the Fetcher. Otherwise the Fetcher
and TILT-SIMD execute their own respective instruction schedules, generated statically by the TILT
compiler, to provide independent operation of off-chip communication and computation. The Fetcher
is aware of the computation’s external data movement behaviour through static compiler analysis of its
memory accesses.
Each TILT core computes on 32-bit words. The Fetcher operates on 256-bit words, the same width
as our interface to the DDR controller. This means 8 TILT cores can communicate with the Fetcher in
parallel in the same cycle. The widthconv module shown in Figure 3.1 converts the TILT-SIMD word
to a multiple of 256-bits or vice-versa. So for TILT-SIMD with 12 cores, data from the first 8 will be
sent to the Fetcher on the first cycle and data from the remaining 4 cores will be sent on the next cycle.
The widthconv can become a bottleneck if too many TILT cores are connected to the Fetcher.
TILT executes operations on the Fetcher unit in the same way it executes operations on its FUs.
However, instead of the Fetcher performing a compute operation such as an add or a multiply, it ‘executes’
multiple concurrent external read and write memory operations. Also, like compute operations, these
Fetcher-TILT operations have deterministic latencies and operate on the same address space as the
concurrently executing compute operations.
FUnFU1 FU2 ...valid bankId addr ... valid bankId addr ...nExtWPorts nExtRPorts TILT Compute Insn
(a) Encoding of TILT’s external memory and compute instructions.
opcode (r/w) DDR addr burst sizefifo id
(b) Encoding of the Fetcher-DDR instructions.
Figure 3.2: The encoding of the TILT and Fetcher instructions.
The new statically scheduled instructions that move data between TILT-SIMD and off-chip memory
are decoupled into two separate instruction streams. The encoding of these two streams is provided in
Figure 3.2. The first stream consists of the Fetcher-TILT operations which are scheduled as part the
TILT instructions, as shown in Figure 3.2(a). These operations move data between the Fetcher’s data
FIFOs and the data memories of the TILT cores on cycles when there are available memory bank ports
that are not being used by compute operations.
The second stream comprises the Fetcher-DDR instructions that facilitate communication between
off-chip DDR memory and the Fetcher. The two sets of decoupled instructions execute in the same order
so off-chip data do not need to be tagged with their destination when inserted into the FIFOs. Reads
always remove data entries from the FIFO head and writes always insert at the FIFO tail. We describe
Chapter 3. External Memory Fetcher 22
how the Fetcher-TILT and Fetcher-DDR instructions are statically scheduled in Section 3.3.
The architecture of the Fetcher was inspired by the Decoupled Access/Execute processor proposed
by James E. Smith [57] and Outrider which splits a thread’s instruction context into memory-accessing
and memory-consuming streams that execute in separate hardware contexts [58]. The memory-accessing
stream of Outrider fetches data non-speculatively substantially ahead of the memory-consuming stream
to tolerate long off-chip memory latencies.
3.3 External Memory Access Scheduling
The external memory and compute schedules of the Fetcher and TILT are statically generated by the
TILT compiler. The application kernel describes within a C function the behaviour of a single TILT
thread with its external inputs and outputs present in its parameter list. An example kernel is provided
in Figure 3.3(a). This is the same kernel as the one we used to illustrate the TILT compiler flow in
Section 2.3. We have updated the DFGGen tool to include the parameter list in the generated DFG file
which contains a textual description of the kernel computation (Figure 3.3(b)). The inputs to the TILT
compiler are the DFG and the target TILT-System hardware configuration (Figure 3.3(c)).
void fn(float *in, float *out) { float A = in[0] + 5.0; float B = in[0] * 5.0; float C = in[1] * 5.0; float D = A - B; out[0] = C + D;}
(a) Application kernel.
A+ B*
D-
E
C*
+
in0
in1
out0
(b) DFG.
FU Mix:ccccAddSubcLat.c2ccccMultcLat.c3
Xbar:ccccRdcLat.c2ccccWrcLat.c2
Data Mem:cccc4cthreadscccc2cbankscccc2crdcports/bankcccc1cwrcport/bank
Fetcher:cccc8cTILTccorescccc1cextcrdc&cwrcport
(c) Configuration.
Figure 3.3: An example application kernel, its DFG with the inserted TILT memory operations (externalwrites in green, reads in red) and the TILT-System hardware configuration used to produce the TILTinstruction schedules in Figure 3.4.
The compiler begins by parsing the function parameter list and compute operations from the DFG file.
Next, Tili’s Memory Allocator (described earlier in Section 2.3.2) is used to allocate TILT’s data memory.
We have updated the allocator to mark the function parameters that are read by the computation as
external inputs. The data of these parameters must be moved from off-chip memory to the local data
memories of the TILT cores. Similarly, the function parameters that are written to by the computation
are marked as external outputs and must be written to off-chip memory after they are produced. These
external inputs and outputs are assigned permanent locations in TILT’s data memory for the lifetime
of the entire schedule even though they might be needed by only a part of the computation. This is to
provide more flexibility in scheduling the external memory accesses later.
Next, TILT’s compute schedule is generated using Tili’s Grouped Prioritized Greedy Mix Longest
Chapter 3. External Memory Fetcher 23
Path First (GPGMLP) scheduling algorithm, presented earlier in Section 2.3.3. This algorithm assumes
the data coming from external memory exists locally within the assigned TILT data memory. This means
external input data must be moved into TILT’s data memory prior to any compute operations reading
those local memory locations. Similarly, external output data must be moved to off-chip memory after
TILT computes and writes these results to its local data memory.
Next, we use the produced compute schedule to generate memory statistics for the external memory
locations allocated within TILT’s data memory. For each external input, the cycle the input is first
and last read by the schedule are recorded. For external outputs, the first and last written cycles are
recorded. We also record the latency of the FU that performs these reads or writes.
Finally, we schedule the external memory accesses with our new memory scheduling algorithms.
These algorithms are responsible for scheduling both the Fetcher-TILT and Fetcher-DDR instructions.
The memory statistics are used by these algorithms to determine the most efficient way to move data
between the off-chip and TILT data memories. Our goal is to reduce any compute stalls incurred by
TILT when waiting for the required external memory transfers to complete. Further, we seek to minimize
the growth in TILT’s schedule length due to the addition of the external memory operations.
Compute
AddSub Mult123456789101112131415
Cycle
1617181920212223
ExtR1:in0ExtWMemory
1:B
2:B3:B
4:B
1:A
2:A3:A
4:A
1:D
2:D3:D
4:D
1:C2:C3:C4:C
1:E
2:E3:E
4:E
2:in03:in04:in01:in12:in23:in34:in4
2425262728
1:out0
2:out04:out0
3:out0
29
(a) MC-GPGMLP.
Compute
AddSub Mult123456789101112131415
Cycle
161718192021
1:B2:B3:B4:B
3:A4:A1:A2:A
1:D2:D3:D4:D
1:C2:C3:C4:C
1:E2:E3:E4:E
2223
ExtRExtWMemory
1:in02:in03:in04:in0
1:in12:in23:in34:in4
1:out0
4:out0
3:out02:out0
(b) LateI-EarlyO.
Compute
AddSub Mult123456789101112131415
Cycle
161718192021
1:B2:B3:B4:B
3:A4:A1:A2:A
1:D2:D3:D4:D
1:C2:C3:C4:C
1:E2:E3:E4:E
2223
ExtRExtWMemory
1:in02:in03:in04:in0
1:in12:in23:in34:in4
1:out0
4:out0
3:out02:out0
(c) SlackB.
Figure 3.4: TILT instruction schedules produced from the application kernel and TILT-System configu-ration provided in Figure 3.3. The schedules show when the operations are issued (thread:operation).
We present three external memory scheduling algorithms: 1) Memory and Compute GPGMLP (MC-
GPGMLP), 2) ALAP-Inputs, ASAP-Outputs (LateI-EarlyO) and 3) Slack-Based (SlackB). The first
phase of these algorithms prioritizes the external memory reads and writes and schedules the Fetcher-
TILT memory operations based on their priority. We describe the three flavours of this phase in Sections
3.3.1 to 3.3.3 respectively. The second phase of the algorithms produces the associated Fetcher-DDR
instructions. Since this phase is the same for all three variations, it is described separately in Section
Chapter 3. External Memory Fetcher 24
3.3.4. The TILT instruction schedules produced after scheduling the external memory operations using
the three different scheduling approaches for the example kernel and TILT-System configuration of Figure
3.3 are provided in Figure 3.4.
The LateI-EarlyO and SlackB approaches utilizes the TILT data memory ports that are unused
by compute operations to schedule the external memory reads and write operations at those cycles,
allowing the compute schedule to remain unchanged. The LateI-EarlyO and SlackB algorithms are only
applicable for cyclic schedules that are iterated on multiple times due to the way the memory accesses
are scheduled by these algorithms.
Conversely, the MC-GPGMLP algorithm schedules external memory operations as part of Tili’s
GPGMLP compute scheduling algorithm instead of scheduling them separately after the generation of
the compute schedule. For this approach, we do not need to generate memory statistics for the external
memory locations. Unlike the other two approaches, this algorithm can be used to generate both cyclic
and acyclic schedules. In the acyclic case, TILT executes its instruction schedule only once.
3.3.1 Memory and Compute GPGMLP Scheduling
Tili’s GPGMLP algorithm schedules the compute operations defined by the DFG of an application. In
order to schedule the external memory accesses as part of the GPGMLP algorithm, they must first be
inserted into the DFG, as shown in Figure 3.3(b) for the application kernel in Figure 3.3(a). This occurs
after allocating TILT’s data memory which is when we also determine the memory locations that are
TILT’s external inputs and outputs.
For each external input, we insert a new external write operation into the DFG (the nodes highlighted
in green in Figure 3.3(b)). This operation does not depend on any previous operation and is marked
as a predecessor of any compute operation that reads the input arriving from off-chip memory. This
means the MC-GPGMLP algorithm must schedule the external write before these compute operations.
As an example, in Figure 3.4(a), compute operation B of the first thread is scheduled two cycles (write
crossbar latency for the example is two cycles) after the memory operation which writes the in0 data to
TILT’s memory is scheduled.
Similarly for each external output, we insert an external read operation into the DFG (the red node
in Figure 3.3(b)). This operation is marked as the successor of the compute operation that produces the
output which means it must be scheduled after that compute operation writes its result to data memory
(Figure 3.4(a)). External read operations have no successors of their own.
The pseudo-code of the GPGMLP algorithm was provided in Algorithm 1 of Section 2.3.3. Just as
with the compute operations, external memory accesses are scheduled at the earliest possible cycle given
all their predecessors (operations they depend on) have been scheduled, there are available data memory
bank ports and data hazard conditions are met. While compute operations such as add or multiply
require 2 read and 1 write port to be available at a given cycle for the data memory bank being accessed,
external reads and writes require 1 read and write port respectively. Additionally, they require either
an external read or write port to be available to access the Fetcher’s data FIFOs.
3.3.2 ALAP-Inputs, ASAP-Outputs Memory Scheduling
In the ALAP-Inputs, ASAP-Outputs (LateI-EarlyO) memory scheduling algorithm, external inputs of
the computation are written into TILT memory from the data FIFOs of the Fetcher As-Late-As-Possible
Chapter 3. External Memory Fetcher 25
(ALAP) before they are needed by any compute operations while external outputs are read from TILT
memory into the FIFOs As-Soon-As-Possible (ASAP) after they are produced by the TILT FUs. The
pseudo-code of this approach is provided in Algorithms 3 and 4. With read and write crossbar latencies
of 8 and 5 cycles, read operations must be scheduled at least 3 cycles after the write operation that
produces the value to be read (RAWcycles of 3). Similarly, write operations must be issued at least 6
cycles after the last read that requires the old value at that memory location (WARcycles of 6).
Algorithm 3: LateI-EarlyO - Schedule external memory write operations ALAP
Input: extInputs[], compSchedOutput: Fetcher-TILT schedule
1 sort extInputs[] by extInput.firstRead, from largest to smallest;2 foreach extInput in extInputs[] do3 foreach TILT thread do
// must be able to read location at least RAWcycles after it is written
4 cycleToWrite = extInput.firstRead - RAWcycles;
// find cycle when it is safe to overwrite location with new value
5 while cycleToWrite >= 0 and (no free external or bank write port) do6 cycleToWrite = cycleToWrite - 1;
7 if cycleToWrite >= 0 then8 schedule extInput at cycleToWrite;9 decrement # of free external and bank write ports at cycleToWrite by 1;
10 else// must write to location at least WARcycles after it is last read
11 minCycleCanWrite = extInput.lastRead + WARcycles;12 cycleToWrite = last cycle of schedule;13 while cycleToWrite >= minCycleCanWrite and (no free ext. or bank write port) do14 cycleToWrite = cycleToWrite - 1;
15 if cycleToWrite >= minCycleCanWrite then16 schedule extInput at cycleToWrite;17 decrement # of free external and bank write ports at cycleToWrite by 1;
18 else// handle Read-After-Write (RAW) hazard
19 while cannot read extInput at extInput.firstRead if written at cycle 0 do20 extend start of schedule by 1 cycle;
21 schedule extInput at cycle 0;22 decrement # of free external and bank write ports at cycle 0 by 1;
External writes that move input data from the FIFOs to the TILT data memories are scheduled first
(Algorithm 3). The external inputs are ordered by the latest cycle they are first read by the computation.
The external writes for these inputs are prioritized and scheduled into TILT’s instruction schedule in this
order. Each external write is scheduled as close to the compute operation which reads the input first, at
a cycle when an external write port is available and the memory bank to be written has an unused write
port. Starting from just before the input is first read by the computation, the search advances toward
the start of the TILT schedule. If the external write cannot be scheduled before wrapping around to
when the input location is last read by the previous compute iteration, then the start of the schedule is
extended until the write can be safely scheduled at the beginning of TILT’s schedule.
Chapter 3. External Memory Fetcher 26
Algorithm 4: LateI-EarlyO - Schedule external memory read operations ASAP
Input: extOutputs[], compSchedOutput: Fetcher-TILT schedule
1 sort extOutputs[] by extOutput.lastWritten, from smallest to largest;2 foreach extOutput in extOutputs[] do3 foreach TILT thread do
// must read location at least RAWcycles after its written
4 cycleToRead = extOutput.lastWritten + RAWcycles;
// find cycle when it is safe to read output location
5 while cycleToRead < endOfSchedule and (no free external or bank read port) do6 cycleToRead = cycleToRead + 1;
7 if cycleToRead < endOfSchedule then8 schedule extOutput at cycleToRead;9 decrement # of free external and bank read ports at cycleToRead by 1;
10 else// must be able to write to location at least WARcycles after it is read
11 maxCycleCanRead = extOutput.firstWritten + WARcycles;12 cycleToRead = 0;13 while cycleToRead <= maxCycleCanRead and (no free external or bank read port) do14 cycleToRead = cycleToRead + 1;
15 if cycleToRead <= maxCycleCanRead then16 schedule extOutput at cycleToRead;17 decrement # of free external and bank read ports at cycleToRead by 1;
18 else// handle Read-After-Write (RAW) hazard
19 while cannot read extOutput at endOfSched if written at extOutput.lastWritten do20 extend end of schedule by 1 cycle;
21 schedule extOutput at last cycle of schedule;22 decrement # of free external and bank read ports at last cycle of schedule by 1;
External reads which commit the computed outputs of TILT-SIMD to the data FIFOs in the Fetcher
are scheduled next (Algorithm 4). These operations are scheduled in the order of the earliest cycle the
outputs are last written by the computation. Each external read is scheduled at the earliest cycle when
the memory bank to be read from has a free read port, starting the search after the output is last written
to the TILT data memory by an FU. This enables the output data to be moved to the FIFOs as soon as
possible. If the external read cannot be scheduled before wrapping around to the cycle when the output
is first written (at which point the computation will write a new result at that location), then the length
of the schedule is increased until the read can be scheduled at the end of the schedule.
By scheduling TILT’s external write operations to move input data from the Fetcher’s FIFOs into
TILT memory as late as possible, we maximize the amount of time the Fetcher has to bring the inputs
into the FIFOs from off-chip memory. We also minimize the periods where TILT stalls on an external
memory write due to the input data being unavailable in the FIFOs while future compute operations
can still be executed.
Conversely, scheduling TILT’s external reads as soon as possible allows the output data to be buffered
into the Fetcher’s data FIFOs sooner, increasing overall DDR bandwidth. We do not need to be concerned
Chapter 3. External Memory Fetcher 27
with TILT stalling on an external memory read due to the FIFO getting full so long as we do not saturate
the bandwidth of writing to off-chip memory. Finally, scheduling the external writes before the reads
does not affect the produced schedule as writing and reading to TILT’s data memory occur independently
of each other and utilizes separate hardware resources.
For the application and TILT-System configuration of Figure 3.3, the TILT schedule produced with
the LateI-EarlyO algorithm is provided in Figure 3.4(b). Unlike the MC-GPGMLP algorithm, scheduling
the memory operations with LateI-EarlyO does not increase the length of the compute schedule generated
by Tili’s GPGMLP algorithm.
3.3.3 Slack-Based Memory Scheduling
The Slack-Based (SlackB) algorithm schedules external memory operations based on the slack metric of
TILT’s external inputs and outputs. For an external input, we define its slack as the number of cycles
within which we must overwrite its TILT memory location with new data before it is needed by the next
compute iteration and after the old data at that location is no longer needed by the preceding iteration.
For external outputs, the slack is the number of cycles we have to move the computed result to the
data FIFOs in the Fetcher before the memory location is overwritten by the result of the next compute
iteration. The calculation of the slack metric for external inputs and outputs is provided in Equations
3.1 and 3.2 respectively.
Algorithm 5: Slack-Based - Scheduling memory operations based on their scheduling flexibility
Input: extInputs[], extOutputs[], compSchedOutput: Fetcher-TILT schedule
1 calc slack of each extInput in extInputs[] using Equation 3.1;2 sort extInputs[] from smallest slack to largest;
3 calc slack of each extOutput in extOutputs[] using Equation 3.2;4 sort extOutputs[] from smallest slack to largest;
// generate Fetcher-TILT external write/read memory operations
5 schedule extInputs[] ALAP (see Algorithm 3);6 schedule extOutputs[] ASAP (see Algorithm 4);
external input slack (cycles) = (cycle first read) +
[(compute schedule length)− (cycle last read)] (3.1)
external output slack (cycles) = (cycle first written) +
[(compute schedule length)− (cycle last written)] (3.2)
From observing the schedules generated by the LateI-EarlyO algorithm, we find memory operations
with a smaller slack have less scheduling flexibility and are more likely to increase the length of TILT’s
instruction schedule if other memory operations consume the cycles where they can be safely scheduled.
For this reason, external reads and writes with a smaller slack are given a higher priority and scheduled
first. Following the same rationale behind the LateI-EarlyO algorithm, the external writes are scheduled
ALAP and external reads are scheduled ASAP. We find these implementation choices minimize the
compute stalls and growth in the length of the TILT schedule.
Chapter 3. External Memory Fetcher 28
The pseudo-code of the SlackB approach is provided in Algorithm 5. For the kernel and TILT-System
configuration of Figure 3.3, the TILT instruction schedule produced using SlackB is provided in Figure
3.4(c). For our example, the external inputs in0 and in1 have the same priority, producing the same
schedule as LateI-EarlyO, with in0 being scheduled first.
3.3.4 Fetcher-DDR Instructions
Here we describe how the TILT compiler statically generates the Fetcher-DDR instructions which are
responsible for moving external data between the Fetcher’s data FIFOs and off-chip DDR memory. The
encoding of these instructions is provided in Figure 3.2(b). Each instruction has a 2-bit opcode that
specifies whether the instruction performs a read or a write, the data FIFO and the relative address of
the first data item in DDR memory to access and the burst size which is used to indicate the number of
256-bit words to move.
The Fetcher-TILT instructions produced during the first phase of scheduling the external memory
operations determine the order in which the external input and output data items needed by the compute
schedule will be moved between the data memories of the TILT cores and off-chip DDR memory. This
ordering, the TILT core count and the placement of input and output data in DDR memory are used
to calculate the maximum amount of contiguous data that we can move to and from DDR with each
Fetcher-DDR instruction. The memory layout is specified with a configuration file which provides the
mapping of the DFG’s external inputs and outputs and their relative addresses in DDR memory.
read from ddr incoming fifo burst size 8write to ddr outgoing fifo burst size 4
:: :
:
repe
at
Figure 3.5: Fetcher-DDR instructions produced from the kernel and TILT-System configuration providedin Figure 3.3. For this example, the burst size is calculated as follows: (threads) ∗ (inputs or outputs) ∗dTILT cores / 8e.
The enforcement of the DDR memory layout is optional. If the memory layout is not specified then
we assume the input and output data will exist in contiguous but separate regions of DDR memory. This
effectively allows the TILT compiler to enforce its own DDR memory layout that best suits the order in
which the TILT cores access external input and output data, maximizing the DDR burst sizes that can
be used to improve DDR bandwidth and reducing the total number of Fetcher-DDR instructions.
For the above reasons, it is preferable to not enforce a layout prior to scheduling and instead let it
be defined by the TILT compiler. We only allow the specification of the DDR memory layout through
a configuration file because some other part of the application system may enforce where the input and
output data needed by the TILT cores will be stored in DDR memory. For the purposes of evaluating
the TILT and Fetcher architectures, we let our compiler choose the placement of data in DDR memory.
In Figure 3.5, we provide the Fetcher-DDR instructions produced for the kernel and TILT-System
configuration in Figure 3.3.
3.4 Benchmarks
We evaluate our TILT-System architecture and the performance obtained by our memory scheduling
approaches using the five large memory (i.e. off-chip memory required) data-parallel, compute intensive,
Chapter 3. External Memory Fetcher 29
floating-point benchmarks listed below. These benchmarks span different application domains such as:
neuroscience, economics, image processing, mathematics and physics simulations. We have selected these
applications because they use a variety of operation types including transcendental functions (exp, log).
We defer the discussion of the implementation of these benchmarks on our enhanced TILT architec-
ture to Chapter 6 where we present a more thorough evaluation of the achieved performance and area
requirements of our best TILT-System designs.
Black-Scholes Option Pricing (BSc)
The BSc model is based on a partial differential equation used to approximate the value of European
call and put options given inputs such as the stock price, risk-free interest rate and volatility [59]. In
our implementation, each BSc thread computes the call and put option prices for a single set of inputs.
High Dynamic Range (HDR)
This benchmark takes three input images of the same scene captured by standard cameras at three
different exposures (bright, medium and dark) and produces a single output image with a greater range
of luminance than the input images [54]. The algorithm performs the same set of operations on the red,
green and blue components of a pixel. Each HDR thread computes a single pixel component.
Mandelbrot Fractal Rendering (MBrot)
We use Altera’s Mandelbrot implementation where each thread computes a single pixel of a 800x640
window frame in which the Mandelbrot image is rendered [53]. The thread computation contains a loop
that iterates up to 1000 times and can break out of the loop at an earlier point depending on the value
of a variable computed inside the loop body.
Hodgkin-Huxley (HH)
This neuron simulation benchmark describes the electrical activity across a patch of a neuron membrane
where the voltage across the membrane varies as ions flow radially through each compartment [55].
The computation inside each thread involves solving four first-order differential equations using Euler’s
method to iteratively compute simple finite differences.
FIR Filter
We use the fully pipelined 64-tap Time-Domain Finite Impulse Response filter benchmark from the
HPEC Challenge Benchmark suite [60]. The algorithm applies 64 sets of filter coefficients to an array
of 4096 inputs. Each TILT thread produces a single output, given a single input, by applying the 64
coefficients on that input. The equivalent OpenCL HLS implementation is provided by Altera [53].
3.5 Comparison of the Memory Scheduling Approaches
In Figure 3.6, we compare the performance of our three memory scheduling algorithms: MC-GPGMLP,
LateI-EarlyO and SlackB. The comparison is performed on the five data-parallel benchmarks described
earlier. We only present the results for the most computationally dense (highest compute throughput
Chapter 3. External Memory Fetcher 30
per area) TILT configurations, provided in Table 6.6 in Section 6.2 of our evaluation of the TILT-System.
Table 6.7 lists the external port counts we use for each benchmark. However, the conclusions we draw
from these results can be generalized to any TILT-System design.
We summarize the TILT and Fetcher configuration parameters used in Table 3.1. We vary the thread
count from 1 to 64 to illustrate the improvement in throughput for the different scheduling methods as
the amount of parallel work is increased. Since the number of data memory banks can at most be equal
to the number of total TILT threads (minimum of 1 thread per memory bank) we reduce the number of
memory banks to match the thread count used when necessary.
Benchmark FU MixThreads Mem W/R Ports Ext W/R
(T) Banks / Bank PortsBSc 2-3-1-1-1-1 1 to 64 min(T,4) 2-4 1-1HDR 2-2-1-1-0-0 1 to 64 min(T,4) 2-4 2-1
Mandelbrot 1-2* 1 to 64 1 3-6 1-1HH 3-2-2-2-0-0 1 to 64 min(T,4) 2-4 1-1
FIR 64-tap 1-1-0-0-0-0 1 to 64 1 2-4 2-1
Table 3.1: The TILT and Fetcher configurations used to obtain the throughput results of Figure 3.6.FU Mix: AddSub/Mult/Div/Sqrt/ExpCmp/LogAbs; *AddSubLoopUnit/MultCmp.
The throughput reported in Figure 3.6 represents the average number of compute operations that
is executed by each TILT core per cycle (operations-per-cycle or opc) and is calculated by dividing the
total compute operations in TILT’s instruction schedule by the length of the schedule. This throughput
is calculated after the external memory and compute operations are both scheduled onto TILT since the
memory operations may increase the length of the schedule and hence reduce the throughput achieved.
Increasing the number of threads increases the number of independent memory and compute operations
that can be scheduled in parallel, improving the utilization of the ports and FUs and the overall compute
throughput. We provide the number of memory and compute operations scheduled onto TILT per thread
in Table 3.2 for our five benchmarks.
BenchmarkExt Mem Ops Comp
Inputs Outputs OpsBSc 5 2 77HDR 6 1 14
Mandelbrot 5 1 21HH 5 4 115
FIR 64-tap 1* 1 128
Table 3.2: External memory and compute operations per thread to be scheduled onto TILT. *Filtercoefficients are loaded prior to computation.
The LateI-EarlyO and SlackB algorithms can only be used to produce cyclic instruction schedules that
are re-executed by TILT after the processor reaches the end of the schedule. We fold cyclic schedules
to improve compute throughput; this new technique is described in Section 4.4. The TILT schedule
produced using the MC-GPGMLP algorithm is not folded because the algorithm is applicable for both
cyclic and acyclic schedules.
As a reference point, we compare the compute throughput obtained by using the MC-GPGMLP,
LateI-EarlyO and SlackB algorithms with the throughput of Tili’s GPGMLP algorithm where only the
Chapter 3. External Memory Fetcher 31
0
1
2
3
4
5
6
1 2 4 8 16 32 64
Thro
ughp
utMS
op
ck
Threads
GPGMLP MC-GPGMLPGPGMLPMSfoldedk LateI-EarlyO SlackB
acycliccyclic
compMopsMonly memMEMcomp
(a) Black-Scholes.
1 2 4 8 16 32 64
Thro
ughp
ut (
op
c)
Threads
0
0.4
0.8
1.2
1.6
2
2.4
2.8
3.2
(b) HDR.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 4 8 16 32 64
Thro
ughp
ut (
op
c)
Threads
(c) Mandelbrot.
1 2 4 8 16 32 64
Thro
ughp
ut (
op
c)
Threads
0
1
2
3
4
5
6
7
(d) Hodgkin-Huxley.
1 2 4 8 16 32 64
Thro
ughp
ut (
op
c)
Threads
0
0.3
0.6
0.9
1.2
1.5
1.8
2.1
(e) FIR 64-tap.
Figure 3.6: Performance comparison of the Fetcher’s external memory scheduling approaches and toTili’s GPGMLP compute scheduler. The throughput is reported in operations-per-cycle (or opc) andrepresents the average number of compute operations that is executed by each TILT core per cycle.
Chapter 3. External Memory Fetcher 32
compute operations are scheduled. Tili’s scheduler did not take into account communication with off-
chip memory. Thus, the GPGMLP results represent an upper bound on the throughput that we seek to
attain after scheduling the new memory operations. We provide the throughput of both the folded (cyclic
schedules only) and unfolded (intended for acyclic schedules) GPGMLP algorithm. Although folding
cyclic schedules is optional, it is recommended, as evident from an average of 10% higher throughput
achieved by folding vs. not folding the TILT schedules produced by Tili’s algorithm. We compare the
throughput results of the folded GPGMLP algorithm with LateI-EarlyO and SlackB and compare the
throughput of the regular GPGMLP with that of our unfolded MC-GPGMLP algorithm.
The throughput obtained with our MC-GPGMLP algorithm will always be worse than the reference
GPGMLP (by an average of 11% for the results presented in Figure 3.6) because the compute operations
that require external input data must be scheduled after the new external write operations. As illustrated
in Figure 3.4(a) for our example kernel, this pushes most of the compute operations down in the schedule
due to their dependence on external inputs. Similarly, external read operations must be scheduled to
move output data to the Fetcher’s data FIFOs after they are produced by the compute operations. These
new memory operations also increase the contention for the limited number of memory bank ports. In
some cases, compute operations that are ready to be scheduled are pushed down due to a memory
operation with a higher priority utilizing the needed bank port.
For MC-GPGMLP, Mandelbrot has the biggest drop in throughput of up to 33% with 64 threads
relative to GPGMLP. This benchmark contains a loop which is scheduled using our new LoopUnit FU
(refer to Section 4.1). The loop body and the operations outside it are separated into sections in the
TILT schedule that cannot cross the boundaries of the loop. Of the 21 compute operations per thread,
4 must be scheduled before the loop. An additional 6 memory operations, 5 external writes and a single
read, must be scheduled before and after the loop respectively. The writes bring into TILT memory the
input data that are needed by the 4 compute operations and the read sends the single output to the
Fetcher after it is produced inside the loop. Taken together, this produces a much longer TILT schedule
and a considerably larger drop in throughput compared to the other benchmarks.
Relative to our MC-GPGMLP algorithm, we observe an average of 19% and 23% improvement in
compute throughput with the LateI-EarlyO and SlackB algorithms respectively. These two algorithms
attempt to schedule memory operations in the region of TILT’s compute schedule where the external
input or output is not needed by any compute operations. For a long enough schedule, this region is
usually sufficient to schedule most memory operations without increasing the length of the compute
schedule that is produced by Tili’s GPGMLP algorithm.
The SlackB algorithm performs as well or better than our LateI-EarlyO approach by an average of
1.6%. The order in which the memory operations are scheduled has a bigger impact on the achieved
throughput with larger number of threads. This is because the higher utilization of the data memory
ports results in fewer available spots in the TILT schedule to insert the new memory operations. We
observe no change in throughput with a single thread relative to the LateI-EarlyO approach and an
average of 3.7% higher throughput with 64 threads.
Compared to the folded GPGMLP throughput results, scheduling the new memory operations with
the SlackB algorithm leads to only a 0.57% drop in throughput on average. Tallying the operation
counts provided in Table 3.2, the external memory reads and writes account for an additional 18% more
operations on average. Therefore we conclude scheduling based on our defined memory slack metric
which calculates the scheduling flexibility available to each external data item is the best scheduling
Chapter 3. External Memory Fetcher 33
option. For acyclic schedules, we use the MC-GPGMLP algorithm. For all other situations, we use the
SlackB algorithm.
We provide the utilization of TILT’s data memory ports by compute operations in Figure 3.7 for our
benchmarks to justify the 0.57% drop in throughput. As we expect, memory port utilization increases
with the number of threads but with diminishing returns. We do not increase the thread count indefinitely
because the larger data and instruction memories will produce less area-efficient TILT cores. The FIR
filter achieves 100% utilization of its FUs and data memory except for at the start and end of its schedule,
obtaining up to 98% memory port utilization with 64 threads. The remaining benchmarks achieve 62%
utilization on average at 64 threads and 70% when the FIR filter’s utilization is also included.
1 4 16 640%
10%
20%
30%
40%
50%60%70%80%90%
100%
1 4 16 64 1 4 16 64 1 4 16 64 1 4 16 64Threads
Dat
a M
emor
y P
ort U
tiliz
atio
n
BSc HDR MBrot HH FIR
Figure 3.7: The utilization of TILT’s data memory ports by the compute schedules generated with thefolded GPGMLP algorithm for the TILT-System configurations of Table 3.1.
The LateI-EarlyO and SlackB approaches schedules memory operations when the off-chip data in
TILT’s memory and the memory ports are not in-use by compute operations, instead of scheduling them
based on just data dependencies like the MC-GPGMLP algorithm. This improves the utilization of the
memory ports without increasing the length of TILT’s schedule. For the HDR benchmark, the memory
port usage is low and all memory operations can be scheduled with SlackB without lengthening the
compute schedule produced by the folded GPGMLP algorithm. In contrast, for the HH benchmark, the
throughput drops by up to 2% at 64 threads, which is also when the utilization of the memory ports by
compute operations is the highest, at 80%. For HH, we observe a throughput drop of 0.64% on average
between 1 and 64 threads, the second largest drop observed across all five of our benchmarks.
For SlackB, the largest throughput drop is obtained by Mandelbrot, with an average drop of 1.7%
(and up to 4% at 64 threads) relative to the throughput of the folded GPGMLP. This is still very small,
especially when compared with the MC-GPGMLP results. The memory operations have less scheduling
flexibility compared to the other benchmarks because they cannot be scheduled within the Mandelbrot
loop. The external reads and writes are both scheduled at the start of the TILT schedule, overlapped
with the compute operations scheduled outside the loop. With our SlackB approach, we are able to
schedule memory operations before or after the dependent compute operation(s), thus avoiding the need
to increase the schedule length in most cases.
3.6 Summary
Application designers are usually required to manually construct custom memory architectures to connect
their compute logic to external memory. This is not a trivial exercise [44, 46] and is not suitable for
Chapter 3. External Memory Fetcher 34
configurable compute platforms that target a variety of applications such as the TILT overlay. With our
Memory Fetcher, we exploit the on-chip bandwidth and parallelism afforded by the FPGA with multiple
parallel data buffers feeding an array of TILT cores while also diligently managing its limited off-chip
memory bandwidth by fetching only the required data. Our proposed architecture supports automatic
instantiation, has low resource cost, is scalable and can be customized to suit the memory bandwidth
requirements of different applications. The Fetcher is statically scheduled to minimize hardware area
and behaves as a loosely-coupled co-processor to the TILT compute units. Of our three proposed
memory scheduling approaches, SlackB produces the most efficient (shortest) memory schedules. Using
SlackB, we obtain a small drop in compute throughput (an average of 0.57% drop in opc across our
five benchmarks) relative to the same designs where only the compute operations are scheduled and the
latencies incurred due to off-chip data transfers are omitted.
Chapter 4
TILT Efficiency Enhancements
The existing TILT architectural design parameters discussed in Section 2.2 provide a high degree of flex-
ibility to closely match the compute and memory requirements of different applications. In this chapter,
we introduce several enhancements to the TILT architecture of [13, 14], including instruction looping
support and indirect addressing modes to enable more area-efficient TILT-System implementations for
a greater range of application domains. Note that as the TILT overlay is an application customized
“family” of execution units, these enhancements are optional and are generated only for the applications
they benefit.
We also evaluate the performance, resource utilization and overall efficacy of our enhancements to
that of the standard TILT architecture. We measure computational throughput in millions of threads
executed by the TILT cores per second (M tps). We report resource utilization as a single value in
equivalent ALMs (eALMs) which accounts for the total layout area of the FPGA resources consumed
by our designs. An M20K BRAM on a Stratix V chip costs 40 eALMs as its layout area is 40x that of
an ALM [61]. Similarly, a DSP block has 30x the layout area of an ALM so it costs 30 eALMs [61]. We
rank different designs based on the largest ratio of compute throughput per area (M tps / 10k eALMs)
which we define as compute density. The FPGA resource utilization is obtained from Quartus 13.1 fitter
compilation reports while cycles of execution are measured using ModelSim 10.1d. Throughput per
second is calculated by applying the Fmax achieved by TILT-SIMD which is obtained from the Quartus
TimeQuest reports.
4.1 Instruction Looping Support
The TILT architecture and compiler flow developed by Ovtcharov and Tili in [14] and [13] respectively
does not support application kernels with loops. Loops in the application code needed to be manually
unrolled prior to running it through the TILT compiler flow, which schedules a separate set of operations
for each loop iteration. This effort is tedious and the generated TILT instruction schedules are very long,
growing linearly with the number of loop iterations. This can result in a significant portion of TILT’s
area to be consumed by the instruction memory which must store the entire TILT schedule. Moreover,
since independent operations across loop iterations can be scheduled together and memory allocation
occurs prior to scheduling, fewer memory locations can be reassigned to store different data, necessitating
a larger data memory as well.
35
Chapter 4. TILT Efficiency Enhancements 36
FU FU
loopbody
LoopUnit
loop-start
loop-end
repeat
operation
Figure 4.1: An example TILT schedule containing a loop. The loop-start operation sets the number ofiterations initially. Upon reaching the loop-end operation, that value is decremented by 1. If the valuereaches 0, TILT exits the loop, otherwise, TILT re-executes the loop body.
As an optional alternative solution, we have implemented a small, custom LoopUnit FU which enables
TILT to iterate multiple times on a section of its instruction schedule. This is illustrated in Figure 4.1,
allowing the body of a loop to be scheduled only once by the TILT compiler. We describe the architecture
of the LoopUnit FU in Section 4.1.1. The changes to the DFGGen tool and the TILT scheduler required
to support this new FU are presented in Sections 4.1.2 and 4.1.3 respectively. We now also provide the
option to automatically unroll loops with the help of the LLVM compiler so that we do not need to
manually unroll them.
4.1.1 LoopUnit FU Architecture
The design of the LoopUnit and the encoding of the operations it executes are illustrated in Figure 4.2.
The bit width of the iterations and pcJump fields can be different from the values in the figure and
are dependent on the number of loop iterations and the length of the TILT schedule respectively. The
register inside the LoopUnit FU tracks the number of loop iterations remaining. The combinational logic
that determines the input to the register is presented as a 1-hot multiplexer for illustrative purposes.
The loopStart signal is asserted when the TILT processor reaches the beginning of a loop in the
schedule. This is indicated with a loop-start operation (as shown in Figure 4.1) which has a type field of
0. The assertion of the loopStart signal results in the value of the iterations field to be loaded into the
register. This field contains the number of times that TILT must iterate on the section of its schedule
that represents the loop body.
Similarly, the end of the loop body is marked with a loop-end operation which has a type field of
1. This causes the loopEnd signal to be asserted which decrements the value stored in the LoopUnit’s
register by 1. As long as the register’s value is non-zero, the pcLoad output signal is asserted when the
end of the loop is reached. This causes TILT’s PC (Program Counter) to be overwritten with the value
stored in the loop-end operation’s pcJump field which contains the cycle when the first operation of the
loop is scheduled.
When the register’s value reaches 0, pcLoad is not asserted and the PC value is incremented by 1
normally to execute operations following the loop body. This way, we can either jump to the start of the
loop to execute the loop body once more or continue to the next TILT instruction after the loop body.
Chapter 4. TILT Efficiency Enhancements 37
opcode
FU operation
...TILT VLIW Instruction
valid type pcJumpiterations
1 bit 5 bits 1 bit 10 bits 9 bits
loo
pV
alid
1Reg
1-hotmux
0
loopEnd
loopStart
else
rese
t
10 bits
not 0 pcLo
adpc
LoopUnit
LoopOp
Figure 4.2: The architecture of the LoopUnit FU.
Since each loop in the schedule requires only two operations to mark its start and end, the LoopUnit is
shared with another (usually the least utilized) FU to reduce the width of the TILT instruction memory.
4.1.2 DFG Generation
Recall that the DFGGen tool is used to generate the DFG of an application prior to scheduling it onto
a target TILT-System architecture. We have updated the tool to support application kernels containing
loops. To illustrate the generation of the DFG from these types of applications, we provide an example
C code snippet containing a for-loop in Figure 4.3. The LLVM instructions generated from compiling
this code and the DFG produced by our DFGGen tool from these instructions are provided in Figures
4.4 and 4.5 respectively. Placing a set of operations inside a loop in the C kernel wraps these operations
with the new loop-start and loop-end operations in the DFG (refer to Figure 4.5).
floatuxu=u0.0f,uxSqru=u0.0f;foru(intuiteru=u0;uiteru<u100;uiter++)u{uuuuuifu(xSqru<u4.0f)u{uuuuuuuuuuxSqru=ux*x;uuuuuuuuuuxu=uxu+u0.1;uuuuu}uuuuuelseu{uuuuuuuuuu<someucomputation>uuuuuuuuuubreak;uuuuu}}
Figure 4.3: Application kernel code snippet of a for-loop.
As shown in the LLVM instructions in Figure 4.4 of the code snippet, for-loops are defined by four
code blocks that have the prefixes of for.cond, for.body, for.inc and for.end. For the loop-start operation
in the DFG, we need to determine the number of times the loop iterates which can be calculated from
the information provided within the for.cond and for.inc blocks. Since we generate a static schedule,
we only support loops with bounds that can be calculated during the generation of the DFG.
Chapter 4. TILT Efficiency Enhancements 38
entry:88888store8i3280,8i32*8hiter,8align8488888br8label8hfor.cond
for.cond:88888888888888888888888888888888888888888;8preds8=8hfor.inc,8hentry88888h78=8load8i32*8hiter,8align8488888hcmp8=8icmp8slt8i328h7,810088888br8i18hcmp,8label8hfor.body,8label8hfor.end
for.body:88888888888888888888888888888888888888888;8preds8=8hfor.cond88888h88=8load8float*8hxSqr,8align8488888hcmp38=8fcmp8olt8float8h8,84.000000e+0088888br8i18hcmp3,8label8hif.then,8label8hif.else
if.then:888888888888888888888888888888888888888888;8preds8=8hfor.body88888h108=8load8float*8hx,8align8488888h118=8load8float*8hx,8align8488888hmul48=8fmul8float8h10,8h1188888store8float8hmul4,8float*8hxSqr,8align8488888hadd128=8fadd8float8h10,80.100000e+0088888store8float8hadd12,8float*8hx,8align8488888br8label8hif.end
if.else:888888888888888888888888888888888888888888;8preds8=8hfor.body88888<some8computation>88888br8label8hfor.end
if.end:8888888888888888888888888888888888888888888;8preds8=8hif.then88888br8label8hfor.inc
for.inc:888888888888888888888888888888888888888888;8preds8=8hif.end88888h228=8load8i32*8hiter,8align8488888hinc8=8add8nsw8i328h22,8188888store8i328hinc,8i32*8hiter,8align8488888br8label8hfor.cond
for.end:888888888888888888888888888888888888888888;8preds8=8hif.else,8hfor.cond
Figure 4.4: Intermediate representation (IR) [50] of the code in Figure 4.3 that is generated by LLVM’sfrontend compiler.
We support for.cond blocks that consist of an integer compare that evaluates the termination condi-
tion of the loop and a conditional branch that jumps to the start of the loop body or its end depending
on the output of the comparison. From the compare operation in the loop condition, we can determine
the type of comparison, the variable that acts as the loop index, its initial value before entering for.cond
and the value that causes the loop to terminate.
The for.inc block describes how the loop index is updated. Currently, we support loops that update
their index with a single integer compute operation that is either an add, subtract, multiply or divide.
As the loop index calculations are integer operations and the conditional branches introduced with loops
depend on integer comparisons, these operations can be easily distinguished from the floating-point
operations that are inserted into the DFG to be executed by the TILT compute units.
The for.body block and any code blocks called from within for.body defines the body of the loop.
The loop-start and loop-end operations are inserted into the DFG before and after the insertion of the
operations within these blocks respectively.
Chapter 4. TILT Efficiency Enhancements 39
4.1.3 Scheduling onto TILT
The DFGs of applications that contain LoopUnit operations are scheduled in sections instead of all at
once. These sections are separated by the loop boundaries which are indicated by the loop-start and
loop-end operations that appear before and after the operations of the loop body (see Figure 4.5). For an
application with a single loop, we first schedule the compute operations and external memory accesses
that precede the loop. The operations inside the body of the loop are scheduled next and the operations
after the loop are scheduled last. Nested loops are scheduled in a similarly recursive fashion by scheduling
the operations preceding the inner loop first, then the body of the inner loop and then finally scheduling
the remaining operations.
loop-startb100
cmp:bxSqrb<b4b? mul:bxSqr-mulb=bxb*bx add:bx-addb=bxb+b0.1 <somebcomputation>
cmpmux_0:bxSqrb=bxSqr-mulborbxSqr cmpmux_1:bxb=bx-addborbx
loop-end
<computationbafterbloop>
cmpmux_2:bpcb=bcont.borbbreak
1
2 3 4 5
6 7 8
9
10
Figure 4.5: The DFG generated by the DFGGen tool from the input LLVM IR of Figure 4.4. The arrowsdenote the data flow dependencies of the operations.
Algorithms with nested loops currently require multiple LoopUnits, one for each level of nesting.
Since two LoopUnit operations from two nested loops are not executed on the same cycle due to the
dependencies that exist between them and since each loop produces only two LoopUnit operations, as
future work, we would like to use only a single LoopUnit FU to execute all loop operations. This will
require the LoopUnit (shown in Figure 4.2) to have as many registers as there are levels of nesting.
The scheduled instructions of each section must not cross loop boundaries. After the operations of
a section of the DFG are scheduled, the first operation of the next section must be scheduled on a later
cycle. Moreover, we must wait extra cycles for the computed results of the loop body to be committed to
TILT memory before scheduling the loop-end operation. This produces a less dense instruction schedule
compared to scheduling a fully unrolled loop where the TILT pipeline does not need to be drained
between loop iterations and operations between loop boundaries and loop iterations can be overlapped
and scheduled together.
The benefit of our approach is that the body of the loop needs to be scheduled only once, producing
fewer instructions and shorter schedules depending on the size of the loop body and the number of
times the loop is iterated. We obtain the biggest win with the LoopUnit when the loop is iterated many
times and the loop body is large enough to fill the TILT pipeline without requiring loop iterations to
be overlapped. Due to the lower throughput incurred from a less dense schedule (refer to Section 4.1.6),
Chapter 4. TILT Efficiency Enhancements 40
we can manually unroll certain loops and use the LoopUnit with others depending on the configuration
that will work best. As an example, for a computation with a large outer loop that iterates many times
containing a small inner loop that iterates only a few times, it will be best to use the LoopUnit with the
outer loop and unroll the inner loop.
4.1.4 Conditional Jump Instruction
The TILT architecture of [13, 14] supports if-else instruction branching with predication [48] where
operations from both sides of the branch are scheduled and executed and the results of only the taken
path are committed to the destination variables in memory. This is illustrated in Figure 4.5 for the
example code snippet provided earlier in Figure 4.3. We have added support for more complex loops
with conditional break statements by extending the hardware and compiler flow used to implement
predication at only a small hardware cost. We will first describe how normal predicated execution is
supported on the TILT overlay and then present the changes we have made to handle conditional jump
instructions within loops.
TILT determines the taken path of the if-else branch by evaluating the condition of the if statement.
This is accomplished with the cmp operation which takes as input the two 32-bit floating-point operands
to compare (representing the left and right side of the if condition) and a flag that indicates the type of
comparison to perform. For the example in Figure 4.5, there is a single cmp operation which evaluates
if xSqr is less than 4. After the operations inside the two branches are executed (operations 3 to 5 in
the example), the cmp-mux operations commit the results of the taken path. For each variable that is
modified inside the if-else branch, a cmp-mux operation takes as input the two possible values of that
variable stored in temporary locations from both sides of the branch and commits the value of the taken
path to the variable based on the outcome of the cmp operation.
These cmp and cmp-mux operations are executed by the custom Cmp FU of the TILT processor.
The architecture of this FU is shown in Figure 4.6 with the additional components necessary to support
conditional jump instructions highlighted in red. For cmp operations, the two input values to compare
are read from TILT’s data memory into data-a and data-b and the comparison flag is encoded in the
first three bits of the operation’s opcode. The result of the comparison is written into a small memory
array, addressed by the operation’s thread-id. This enables TILT to support different outcomes for each
thread and commit the results of the correct path taken by each thread independently of each other.
For regular cmp-mux operations (such as operations 6 and 7 in Figure 4.5), the two possible values
from either side of the branch are read from TILT’s data memory into data-a and data-b. The opcode
provided is 0111 which asserts the mux-mode and lowers the pc-mode control signals, causing the Cmp
FU to behave as a mux. The outcome of the if-else branch condition written previously by the cmp
operation is used to select the output value of wdata between data-a and data-b. The signal wdata-valid
is asserted to commit wdata to TILT’s data memory.
To allow breaking out of a loop on a certain condition, we needed to modify both the DFGGen tool
and the TILT instruction scheduler. For the code snippet in Figure 4.3, the break statement inside the
loop has the signature ‘br label %for.end ’ in the LLVM IR (shown in Figure 4.4). During the generation
of the DFG, we look for these unconditional branch instructions inside if-else statements within loops. If
such an instruction is found, we insert a new type of cmp-mux operation into the DFG, such as operation
8 in Figure 4.5. Here, instead of choosing between two data items, we select between two PC values,
placed in the cmp-mux operation’s Operand A and B (Figure 4.6). The two PCs are calculated when
Chapter 4. TILT Efficiency Enhancements 41
reg
cmpa=ba>ba>ba<ba<ba=b
0
1
MEM
waddrwenwdata
raddrren
rdata
reg reg reg reg
data
-a
data
-b
data
-val
id
thre
ad-id
opco
de[2:0]
32-bits
flag
1
mux-mode
[3] pc-mode
select
32-bits
CMP FU
Readffffbar
FU operation
...
Operand BRead Addr
Operand ARead Addr
ResultWrite Addrvalid Read
bank idWritebank id
threadid
opcode
TILT VLIW Instruction
DatafMem
Writeffffbar
addr,id
rdata
wda
ta
wda
ta
wda
ta-v
alid
load
-pc
pc
addr
,id
Figure 4.6: The architecture of TILT’s Cmp FU which handles both cmp and cmp-mux operations.
the DFG is scheduled and they represent the next PC address after the cmp-mux operation (to continue
loop execution) and the first instruction after the loop (to break out of the loop) respectively.
We needed to slightly modify the read crossbar to forward Operand A and B into data-a and data-b
instead of the data read from TILT’s data memory to handle this new scenario. We provide an opcode of
1111 to assert both mux-mode and pc-mode control signals inside the Cmp FU. This asserts the load-pc
signal which allows the PC of the TILT instruction memory to be overwritten with the Cmp FU’s PC
output. The same bit that toggles pc-mode is used as a select for the mux that determines whether the
data-a and data-b inputs to the FU should carry the values of Operand A and B or the data read from
TILT’s data memory.
4.1.5 LLVM Phi Instruction
Phi instructions are found in the LLVM intermediate representation (IR) code generated from C kernels
that have if-else conditions present inside loops that are automatically unrolled by the LLVM compiler.
The phi instruction is used to select between two possible values of a variable depending on the path
taken to reach the code block in which the phi instruction is found. Since we updated the DFGGen tool
to support application kernels with automatically unrolled loops, we now describe how these new phi
instructions are handled by the tool. An example code snippet with the usual LLVM instructions and
the equivalent instructions with phi is provided in Figure 4.7. In Figure 4.7(b) and 4.7(c), the value of
Chapter 4. TILT Efficiency Enhancements 42
either the temporary mul or div register will be written into the result variable depending on whether
the if.then or if.else paths are taken.
if (cond > 0.0) result = x * y;else result = x / y;
(a) Kernel code.
entry: %cmp = fcmp ogt float %cond, 0.0 br i1 %cmp, label %if.then, label %if.else
if.then: %mul = fmul float %x, %y store float %mul, float* %result, align 4 br label %if.end
if.else: %div = fdiv float %x, %y store float %div, float* %result, align 4 br label %if.end
if.end:
(b) Regular LLVM IR.
entry:]]]]]%cmp]=]fcmp]ogt]float]%cond,]0.0]]]]]br]i1]%cmp,]label]%if.then,]label]%if.else
if.then:]]]]]%mul]=]fmul]float]%x,]%y]]]]]br]label]%if.end
if.else:]]]]]%div]=]fdiv]float]%x,]%y]]]]]br]label]%if.end
if.end:]]]]]%4]=]phi]float][]%mul,]%if.then]],][]%div,]%if.else]]]]]]]store]float]%4,]float*]%result,]align]4
(c) Equivalent LLVM IR using phi instead.
Figure 4.7: Illustrative example of the LLVM phi instruction [50].
Normally, the DFGGen tool tracks variables that exist outside of if-else branches that are written
to by compute operations inside them. In the example in Figure 4.7(b), this is the case for the result
variable. Since TILT supports conditional branches with predication, the DFGGen tool renames the
output locations of these instructions to point to temporary locations instead. This effectively ignores
the LLVM store instructions of Figure 4.7(b) that commit the outputs of the compute operations into
the result variable. However, the variable to which the data is being stored is recorded by DFGGen for
later use. After the compute operations from both sides of the branch and the compare that evaluates
the branch condition are inserted into the DFG, we insert cmp mux FU operations that commit the
temporary values of the taken path into these variables.
In the case of the phi instruction in Figure 4.7(c), the two possible values are written to temporary
locations only. The phi instruction is translated directly into a cmp mux FU operation that depends on
the two compute operations that produces the two values and the compare operation that determines
the taken path. Unlike the example in Figure 4.7(c), the two possible values do not have to be produced
within the two paths. If this is the case, the compute operations that produce the temporary values
are marked as successors of (dependent on) the compare operation in the DFG. The resulting DFGs
produced from either set of the LLVM instructions (Figure 4.7(b) or 4.7(c)) will be the same.
4.1.6 Unrolling Loops vs. Using the LoopUnit FU
In Table 4.1, we demonstrate the efficacy of the LoopUnit FU with the Mandelbrot benchmark which
contains a loop that iterates up to 1000 times (see Section 3.4). We do so by comparing the performance
and area of two similar TILT-Systems. The first utilizes the LoopUnit and schedules the loop body only
Chapter 4. TILT Efficiency Enhancements 43
once and the second is of the same configuration but without the LoopUnit, requiring the loop to be
fully unrolled prior to being scheduled onto TILT.
We observe a drop in compute throughput when the LoopUnit is used, as shown in Table 4.1. This
is because in the unrolled case, operations from across loop iterations can be interleaved and scheduled
together as long as the data dependencies between them are respected. This cannot be done with the
LoopUnit approach. The benefit of using the LoopUnit FU lies in the much smaller instruction memory
that results from scheduling the body of the loop only once, producing an overall more area-efficient
TILT-System design.
With Without ChangeLoopUnit LoopUnit with LoopUnit
Insn Mem140x103 151x66,570
(width x depth)Throughput
0.76 0.91 -19.6%M tpsArea
3,320 24,233 -7.3xeALMs
Compute Density3.2 0.39 +6.1x
M tps/10k eALMs
Table 4.1: Comparison of the densest TILT-System configuration for the Mandelbrot application (Tables6.6 and 6.7) with and without the LoopUnit FU. The TILT core count used is 1.
We present the results for the most computationally dense Mandelbrot TILT-System configuration
(refer to Tables 6.6 and 6.7 in Section 6.2). However, other TILT configurations will benefit in a similar
way. The overall improvement in compute density and reduction in area will vary depending on the size
of the loop body and number of times the loop is iterated. In general, a larger loop body will result in a
smaller drop in compute throughput because there will be more operations to fill the TILT pipeline. The
LoopUnit will also be more desirable with higher iteration counts. The area cost of adding the LoopUnit
FU is very small – only 9.8 ALMs.
4.2 Indirect Addressing
TILT executes operations containing static memory bank ids and addresses that must be known at
compile time. On successive iterations of the TILT schedule, these operations are re-executed on different
data loaded from off-chip memory using the Memory Fetcher but because the operations are static, they
access the same locations in TILT’s memory. However, certain benchmarks may access different memory
locations for the same computation. This will require a separate set of instructions to be scheduled with
the standard TILT overlay developed by Ovtcharov and Tili. This increases the length of the schedule,
requiring a larger instruction memory. For other benchmarks, memory addresses are calculated during
runtime instead of being known immediately at software compile. This is not supported by the standard
TILT architecture.
In Section 4.2.1, we present the shift-register addressing mode which allows the TILT architecture to
support accesses to adjacent memory locations with a single static operation. Finally in Section 4.2.2,
we present a more general indirect addressing solution for the TILT overlay. When this optional feature
is enabled, data stored in TILT memory can also be used as memory addresses.
Chapter 4. TILT Efficiency Enhancements 44
4.2.1 Shift-Register Addressing Mode
The shift-register mode was motivated by the 64-tap FIR filter described earlier in Section 3.4. This filter
is most efficiently implemented as a spatial design, as shown in Figure 4.8, with input (or output) data
placed inside shift-registers. It is possible to model this behaviour on the standard TILT hardware but
it will be extremely inefficient since separate operations must be scheduled to read data from the TILT
data memory and move them to adjacent locations between the computation of each output. This can
be trivially accomplished by using an auxiliary 0-cycle “FU” that simply forwards its single input to its
output. For our 64-tap FIR design with two threads, an additional 64 reads and writes per thread would
be required, resulting in a 415 cycle TILT schedule. Compiler support does not exist to automatically
detect and insert these extra operations into the TILT schedule and must be manually added.
in ...
x+DSPb0
0inn x
+DSPb1
inn-1 x+DSPb2
inn-2 x+DSPb63
inn-63
... outn
Figure 4.8: Spatial 64-tap FIR implementation with DSPs on an FPGA.
We are able to obviate the need for these shift operations with our new shift-register mode. When
this optional mode is enabled, a small amount of additional hardware is generated which allows FUs to
access adjacent locations in TILT’s data memory without requiring the data to be shifted. As illustrated
in Figure 4.9, when TILT reaches the end of its schedule, the schedEnd control signal is asserted and the
value within the counter register is incremented by 1. This value acts as an offset to the static read and
write addresses provided in the TILT operations. This allows the same operations to access adjacent
locations in TILT’s data memory when they are re-executed by TILT.
FU operation
Operand BRead Addr
Operand ARead Addr
ResultWrite Addr
valid Readbank id
...
Writebank id
threadid
opcode
+
Counter
+ +
Write bar Read bar
TILT VLIW Instruction
offset
1+ Reg
schedEnd
Figure 4.9: Shift-Register indirect addressing mode.
The performance improvement obtained by turning on this mode is summarized in Table 4.2 for the
64-tap FIR filter. For the TILT-System configuration, we use the most computationally dense TILT and
Fetcher configurations of the FIR filter provided in Tables 6.6 and 6.7 in Section 6.2. The mode requires
an additional 54 ALMs but shortens the TILT schedule to 144 cycles, improving the overall compute
density of the TILT-System with a single TILT core by 2.9x (see Table 4.2). Figure 4.9 shows all three
operand addresses being shifted. However, the FIR filter is configured to shift only the result.
Chapter 4. TILT Efficiency Enhancements 45
With Without ChangeSR Mode SR Mode with SR Mode
Insn Mem76x144 76x415
(width x depth)Throughput
3.8 1.3 +2.9xM tpsArea
2,674 2,728 +54 ALMseALMs
Compute Density14 4.8 +2.9x
M tps/10k eALMs
Table 4.2: Comparison of the densest TILT-System configuration for the FIR application (Tables 6.6and 6.7) with and without the Shift-Register (SR) addressing mode. The TILT core count used is 1.
4.2.2 Indirect Addressing Mode
When enabled, this option creates additional hardware that allows TILT FUs to use data stored in TILT’s
data memory as addresses rather than requiring all data memory to be addressed via the immediate
addresses in the TILT instruction stream. As shown in Figure 4.10, for every TILT FU that may access
data indirectly this way, we provide two Indirect Read Address (IRA) registers and one Indirect Write
Address (IWA) register. These registers temporarily store the indirect addresses that can be used to
access data in TILT’s memory. The three 2-1 muxes placed in front of the read and write side crossbars
choose between the static addresses provided by the FU operation and the output of the FU’s Indirect
Address (IA) registers. Each of the three mux selects are controlled independently by a single bit in the
operation’s opcode. This allows FU operations to use any combination of indirect and static addresses
to access data in TILT’s memory.
FU operation
...
Operand BRead Addr
Operand ARead Addr
ResultWrite Addrvalid Read
bank idWritebank id
threadid
opcode
TILT VLIW Instruction
DataMem Read bar
IWA
Write baraddrs,id
rdata
addr,id
wdata
FU Arraywdata rdata
IRA IRA
Figure 4.10: Enabling the option to use temporary values stored within the Indirect Read/Write Addressregisters to access TILT’s data memory instead of the static addresses provided by the FU operation.
We also add the Indirect Address Read (IAR) operation into TILT’s instruction set to populate the
IA registers. They are similar to TILT’s external memory read operations, requiring a single read port
to be available for the data memory bank being accessed. However, instead of reading data into the data
FIFOs of the Fetcher, they store the data into the FU’s IA registers. Which of the three registers the IAR
operation writes to is controlled by three enable bits in the operation’s opcode. These IAR operations
Chapter 4. TILT Efficiency Enhancements 46
share their operation field with each of the FUs that require indirect addressing. The encoding of the
IAR operations is provided in Figure 4.11.
FU operation
...
Operand BRead Addr
Operand ARead Addr
ResultWrite Addrvalid Read
bank idWritebank id
threadid
opcode
TILT VLIW Instruction
Read bar
float-to-intrdata
DataMem
rdata
en enenaddr,id
rdata
IWA IRA IRA
Figure 4.11: Populating the Indirect Read/Write Address registers with data in TILT’s memory usingthe Indirect Address Read (IAR) operation.
For a given FU, we try to schedule the IAR operation(s) just before the dependent compute opera-
tion(s) so that other operations that indirectly access different memory locations can be safely scheduled
as soon as possible. Thus, having separate operations to fetch the addresses from TILT data memory
and perform the computation is not necessary. However, this design choice provides more scheduling
flexibility while minimizing the implementation complexity. First, we can populate 1, 2 or all IA registers
depending on the operands that access TILT’s memory indirectly. Second, if several compute operations
indirectly access the same memory locations, we will only need to fetch the addresses once. Providing
the same flexibility with a single operation would necessitate a more complex operation with variable
latencies and extra decode logic. Finally, the IAR operations and their dependent compute operations
may be scheduled with other operations scheduled in between, yielding a denser schedule. This is not
possible with a single operation and such an operation would be more difficult to schedule because it
would need to perform several reads and writes to TILT’s data memory on specific cycles.
Bank Depth (words) 256 1024 4096Addr Width (bits) 8 10 12
ALMs Used3 IA registers
12.5 15.5 18.5and 2-1 muxesaltfp convert 76.0 81.0 87.5
Table 4.3: Area cost of the indirect addressing mode per IA capable FU for different memory bankdepths. The altfp convert module translates a 32-bit floating point to an integer of address width bits.
The computed indirect addresses are produced by the existing single-precision, floating-point FUs
and stored in TILT’s data memory as 32-bit floats. As shown in Figure 4.11, we use Altera’s altfp convert
module with a six cycle latency [62] to convert these floats into integers after reading them from data
memory via the read crossbar and prior to writing them into the IA registers. Indirect addresses can be
moved from off-chip memory to TILT’s data memory like any other external input data as well.
In Table 4.3, we provide the area cost to implement indirect addressing on TILT for three different
data memory bank depths. The ALM utilization was obtained from the Quartus fitter reports and the
bank depths presented is representative of the range observed for computationally dense TILT-System
designs. The relative size of the altfp convert module is much larger than the cost of the registers and
muxing required. As future work, we would like to use small, low latency integer FUs to compute
Chapter 4. TILT Efficiency Enhancements 47
indirect addresses instead. This has several benefits including no longer requiring float-to-int conversion
and likely requiring fewer threads to fill the shallower schedule pipeline.
4.3 Memory Allocation for DFGs with Aliased Locations
The LLVM IR input to the DFGGen tool may contain aliased memory variables with different names
to the same physical memory location. Aliasing can occur when the LLVM compiler loads the same
memory location multiple times. This can result from how the application kernel is written (i.e. when
the same memory location is loaded into two different variables) or due to the LLVM compiler itself.
The latter is especially noticeable when loops are unrolled by the LLVM compiler prior to the generation
of the DFG. The Memory Allocator algorithm developed by Tili in [14] treats these aliases as separate
physical memory locations. Even though the correctness of the algorithm is maintained, this increases
the required size of TILT’s data memory.
Algorithm 6: Removal of memory location aliases from the DFG
Input: DFG with aliasesOutput: DFG with aliases removed
1 Create empty aliasListMap;2 Create empty newAddrMap;
// Populate aliasListMap and addrMap
3 foreach node in DFG do4 newName = node.outputVar with all instances of “([.][a-z,1-9]+)+” removed;
5 if newName != node.outputVar then // found alias6 if aliasList of newName exists in aliasListMap then7 add node.outputVar to aliasList;8 else9 add node.outputVar to new aliasList;
10 add aliasList to aliasListMap with key newName;11 add node.outputAddr to newAddrMap with key newName;
12 node.outputVar = newName // rename
13 sort aliasLists in aliasListMap from longest alias to shortest;14 foreach node in DFG do15 foreach aliasKey in aliasListMap in descending order do16 aliasList = aliasListMap.get(aliasKey);17 newAddr = newAddrMap.get(aliasKey);
18 foreach alias in aliasList do19 if node.inputVar1 or node.inputVar2 or node.outputVar == alias then20 replace aliased variable name with aliasKey;21 replace aliased variable addr with newAddr;
The algorithm presented in Algorithm 6 aims to remove all aliased memory variables to improve
the generated TILT schedule’s memory footprint. The first step is to determine which variables are
aliases of another. For the benchmarks we have studied, we find the aliases of <prefix> are of the form
<prefix>.<suffix>. Next, we replace all instances of the aliases in the DFG with <prefix> and assign
them a single memory address. This algorithm is run on the DFG prior to Tili’s Memory Allocator
Chapter 4. TILT Efficiency Enhancements 48
algorithm which further reduces the amount of data memory required for the computation with register
reallocation techniques (refer to Section 2.3.2).
BenchmarkWith Without
Aliasing Aliasing(32-bit words per thread)
BSc 46 38HDR 24 18
Mandelbrot28 25
(with LoopUnit FU)Mandelbrot
903 36(fully unrolled)
HH 91 86FIR 64-tap 192 192
Table 4.4: TILT data memory requirements with and without the removal of aliased memory locations.
The reduction in TILT’s data memory usage after the removal of the aliased memory locations is
summarized in Table 4.4 for our benchmarks. The fully unrolled implementation of the Mandelbrot
benchmark benefits from this the most, requiring only 36 data words per TILT thread instead of 903.
The aliases of this benchmark are introduced when the LLVM compiler is used to fully unroll the
application’s loop which iterates up to 1000 times.
4.4 Schedule Folding
FUsAdd Mult
12345678910111213
Cycle
AB
E
DC
FUsAdd Mult
12345678
Cycle
AB
E
DC
Folded
re-execute
(a) TILT schedule on the left is foldedto produce the schedule on the right.
Add Mult
123456789
10111213
Cycle
AB
E
DC
A
E
C
A
E
C
readwoperandscompute
writewresult
B
DB
D
Add Mult
12345678
Cycle
AB
E
DC
AE
C
AE
C
B
DB
D
Folded
(b) Same example but the cycles when the TILT FUs readoperands, compute and write to data memory is also shown.
Figure 4.12: An example TILT schedule being folded. Memory operations are omitted to simplify theexample. Read and write latencies of 2 cycles and Add and Mult latencies of 2 and 3 cycles are used.
When TILT executes a scheduled operation, it takes a few cycles for the task to complete. Thus, the
TILT schedule has extra cycles after the last scheduled operation to drain the pipeline and allow the
results of outstanding operations to be committed to TILT’s data memory, such as operations C, D and
E in Figure 4.12. For cyclic schedules that are re-executed by the TILT cores, we apply a technique we
call ‘Schedule Folding’ where we fold the cycles at the end of the schedule into the start of the schedule.
Chapter 4. TILT Efficiency Enhancements 49
This is performed after the TILT schedule is generated from the application’s DFG by the compiler. The
key observation is that the memory bank and external ports still available at the start of the schedule
can be utilized to commit the operations scheduled at its end. The schedule is folded at the earliest
possible cycle after the last scheduled operation while respecting TILT’s memory bank and external read
and write limitations.
In Figure 4.12, we illustrate how we fold the TILT schedule with an example. In Figures 4.12(a) and
4.12(b), the TILT schedule on the left is folded to produce the schedule on the right. In Figure 4.12(b),
we also show when the compute operations read input operands from the TILT data memory, compute
and write their results back to memory. As an example, operation E in the normal TILT schedule begins
reading its operands on cycle 8, starts to compute the sum on cycle 10 and commits the sum to data
memory by cycle 13. Once the TILT schedule is folded, operation E begins computation on cycle 2 and
completes by cycle 5 after wrapping around to the next iteration of the shorter schedule. During this
time (cycles 1 to 5), TILT also re-executes operations A, B and C.
Schedule Folding reduces the overall length of the TILT schedule, improving TILT’s compute through-
put proportionally to the number of cycles folded and may also require a smaller instruction memory to
store the schedule. It was motivated by the well known technique called Software Pipelining [63] which
creates code for a loop’s prologue, steady state and epilogue and overlaps loop iterations to improve
resource utilization. Since the TILT schedule is cyclic, it is akin to that of a loop body.
BenchmarkCycles Sched Len ThroughputSaved (after folded) Improvement
(cycles) (% of avg opc)BSc 19 933 2.1%HDR 30 304 9.9%
Mandelbrot 15 103 14.6%HH 21 611 3.5%
FIR 64-tap 19 144 13.2%
Table 4.5: Improvement in throughput obtained by folding TILT’s schedule (opc is operations-per-cycle).
Table 4.5 presents the improvement in compute throughput achieved by folding TILT’s instruction
schedule for the most computationally dense TILT-System designs. The configurations of these designs
are provided in Table 6.6 in Section 6.2 of our evaluation of the TILT-System. TILT’s memory operations
are scheduled using the Fetcher’s Slack-Based algorithm, described earlier in Section 3.3.3. The cycles
folded for the HDR benchmark is noticeably larger than that of the other benchmarks because the last
compute operation of the HDR computation is the Sqrt which has the largest latency of 28 cycles on
TILT. Schedule Folding has an expectedly smaller performance impact on a longer schedule.
Chapter 5
Predictor Tool
The TILT overlay and Memory Fetcher are highly configurable and can be tuned to closely match the
compute and memory requirements of different applications. However, due to the wide range of archi-
tectural parameters that can be varied, it can be difficult to determine the TILT-System configuration
with the highest compute density (throughput per area) for a target application. Calculation of the com-
pute density normally requires synthesis of the TILT-System so that we can obtain its computational
throughput and resource consumption. However, it is impractical to synthesize millions of possible TILT
and Fetcher configurations in Quartus to determine the design with the highest compute density since
each compilation can take hours. To ease development effort and improve designer productivity, we have
implemented a software Predictor tool that is capable of quickly predicting the most area-efficient design
for a given application on Altera’s Stratix V FPGAs.
The Predictor tool uses derived TILT-System performance and area models to quickly estimate the
expected compute throughput and hardware area of promising TILT-System configurations. We devise
a search algorithm that navigates a subset of the large TILT and Fetcher design space using performance
heuristics, attempting only those configurations that are likely to produce successively better designs
with higher compute densities. The Predictor also automatically configures the new TILT enhancements
presented in Chapter 4 for those applications that benefit from them. As we will show in Section 5.4.2,
the Predictor can analyze in minutes thousands of potential TILT-System configurations that would
take weeks to compile in Quartus. Further, the tool is able to predict with high accuracy the densest
TILT-System design without any manual designer effort.
The Predictor tool does not currently predict TILT-System designs with the best core count, targeting
only 8-core TILT-Systems instead. Varying the number of TILT cores has a noticeable impact on compute
throughput, area cost, Fmax and the required configuration of the Memory Fetcher which we explore
in Section 6.5.3. Further work is necessary to amend our tool so that it is able to reliably predict the
densest TILT-System configuration with the core count being a part of its recommendation.
We begin this chapter with a description of how the Predictor estimates the compute throughput
of a TILT-System configuration in Section 5.1. The derivation of the TILT and Fetcher area models
is presented in Section 5.2. We describe how the Predictor tool navigates the large TILT and Fetcher
configuration space to determine the most computationally dense TILT-System design in Section 5.3.
Finally we present runtime and prediction accuracy results in Section 5.4 to demonstrate the efficacy of
our Predictor tool.
50
Chapter 5. Predictor Tool 51
5.1 Estimation of TILT-System Performance
The expected compute throughput of a TILT-System configuration is calculated by scheduling the appli-
cation’s DFG onto that architecture. This also includes scheduling the external memory operations that
move data between the TILT cores and the Fetcher. This is necessary to ensure any possible growth in
TILT’s instruction schedule due to the memory operations is accounted for in the throughput estimate.
The expected compute throughput is the total number of compute operations (such as add and multiply)
in TILT’s schedule divided by the length of the schedule in cycles. This average operations-per-cycle
value does not take into account the Fmax of the TILT cores.
From synthesizing thousands of different TILT-System configurations, we observe computationally
dense TILT designs fall in the middle of the possible TILT core sizes and have Fmax values typically
between 220 and 260 MHz on Stratix V FPGAs (with a speed grade of 2). For the very large designs
that are considered by the Predictor, the Fmax can drop to be as low as 170 MHz. By not taking the
Fmax into consideration in the throughput calculation, we essentially assume a constant Fmax for all
TILT-System designs. This means we over-estimate the performance of large TILT-Systems relative
to the medium-sized designs that have a lower variance in Fmax and higher Fmax values. However,
since these larger designs will already have lower compute densities due to their higher area costs and
diminishing gains in compute throughput, they will not be selected by the Predictor as the best design
anyway. Their lower Fmax values will merely reduce the estimated compute densities further.
Since the Predictor selects a Fetcher configuration that is capable of supporting the maximum rate
at which the target TILT-System can move data between its cores and the Fetcher (see Section 5.3.5),
we can safely assume the memory bandwidth requirement of the TILT cores is met. We also assume the
TILT cores do not incur any unexpected stalls due to waiting for data to be read or written to off-chip
memory. This is acceptable as long as the rate at which the Fetcher reads and writes data to off-chip
memory is sufficiently below the maximum rate at which we are able to access DDR memory. For these
reasons, the calculated throughput value is a close enough approximation of the actual throughput.
5.2 TILT and Fetcher Area Models
In this section, we present the area models of the TILT processor and Memory Fetcher on Stratix V
FPGAs. These models are used by the Predictor to quickly estimate the layout area of TILT-System
configurations without requiring them to be synthesized. The area models were derived by compiling
a representative set of TILT-System configurations in Quartus and obtaining their resource usage from
the fitter reports. The total layout area of the ALMs, M20Ks and DSP Blocks used is reported as a
single value in equivalent ALMs (eALMs) which we defined earlier in Chapter 4. The conversion of these
FPGA resources into a single metric makes it easier for the Predictor to minimize the total resource
consumption. The TILT processor consists of the FU array, instruction and data memories and the read
and write crossbars. The Fetcher is composed of its own instruction memory, control and data FIFOs.
The calculation of the area costs of these components is presented in the sections below.
5.2.1 TILT FU Array
The FU array comprises the single-precision, floating-point FUs generated by Altera’s MegaWizard
Plug-in Manager and the custom Cmp and LoopUnit FUs that are used to support predication and
Chapter 5. Predictor Tool 52
loops. Each instance of an FU has an associated Decode module that decodes the FU’s operation field
in TILT’s instruction and generates the signals that control the behaviour of the crossbar multiplexors,
reading and writing to data memory and so on.
The FUs and their Decode modules are built out of ALMs on the FPGA. As each instance of an FU
type and its Decode module produces the same hardware, the ALM count remains fairly constant with
the small amount of variability due to how the components are placed and routed onto the FPGA by
Quartus. Certain FU types also consume a constant number of M20K BRAMs and DSP Blocks. Table
5.1 summarizes the resources consumed by the different TILT FUs and their total average area costs
in eALMs. Table 5.2 provides the areas of the Decode modules for these FUs. The size of the Decode
module can be different for two FU types even if the encoding of their operation is the same. This is
due to the FUs having different pipeline depths – delayed copies of the operation fields feed into control
logic at different pipeline stages. The Predictor estimates the area of TILT’s FU array for a given FU
mix using the area numbers provided in these two tables.
FU Unit Avg ALMs M20Ks DSPs Avg Area (eALMs)FAddSub 331.1 0 0 331.1FMult 88.2 0 1 118.2FDiv 263.3 1 5 453.7FSqrt 421.7 0 0 421.7FExp 371.9 0 9 641.9FCmp 56.8 1 0 96.8FLog 1130.9 0 2 1190.9FAbs 12.6 0 0 12.6LoopUnit 9.8 0 0 9.8
Table 5.1: Resources consumed by Altera’s single-precision, floating-point FUs and custom FUs.
Unit Avg ALMs Unit Avg ALMs Unit Avg ALMsAddSub 100.6 AddSubCmp 120.7 AddSubAbs 101.5Mult 101.6 MultCmp 121.1 MultAbs 107.3Div 124.1 DivCmp 144.6 DivAbs 127.7Sqrt 142.3 SqrtCmp 182.8 SqrtAbs 148.8Exp 109.5 ExpCmp 151.8 ExpAbs 111.2Log 120.7 LogCmp 159.9 LogAbs 125.2Cmp 103.7Abs 75.6
Table 5.2: Resources consumed by the FU Decode modules.
The Predictor may choose to share the operation field and read and write ports of an FU with
another if doing so produces a more computationally dense design with a smaller instruction memory
and crossbars. This TILT customization option was described earlier in Section 2.2.2. Any TILT FU can
share its ports and operation field with any other FU. However practically, we only perform this sharing
for FUs that are relatively small in size, such as the Cmp and Abs FUs, or which have low utilization,
such as the LoopUnit FU with another standard and larger FU. We present the algorithm used by the
Predictor to determine which FUs to combine in Section 5.3.3.
Since two “combined” FUs share the same operation space in TILT’s instruction, they also share a
single Decode module. This module is slightly larger than the individual Decode modules of the FUs
Chapter 5. Predictor Tool 53
but smaller than their sum, resulting in an overall reduction in FU area. An additional 2:1 mux that
consumes 34.8 ALMs on average selects between the outputs of the two FUs to forward to the write
crossbar. The area costs of the merged Decode modules that are currently supported by the Predictor
are also provided in Table 5.2. The area savings obtained from combining FUs is provided in Table 5.8.
5.2.2 TILT and Fetcher Instruction Memories
The TILT and Fetcher instruction schedules are stored in two logical read-only memories built out of
physical M20K BRAMs. The width and depth of the schedules determine the dimensions of the logical
memories and the number of BRAMs that will be needed. From experiments, we observe that for a
logical memory with a depth of less than 16,384 words, Quartus first chooses the M20K configuration
with the smallest depth that is at least as deep as the logical memory. Then the width of the selected
M20K configuration is used to calculate the number of BRAMs that will be required to fit the width of
the logical memory (Equation 5.1).
BRAMs = ceiling(width of ROM / width of selected M20K config) (5.1)
TILT Insn Mem Area (eALMs) = BRAMs ∗ 40 (5.2)
The M20K BRAMs on Stratix V chips support the memory configurations provided in Table 5.3.
As an example, a logical memory with a width of 295 bits and a depth of 1936 words uses the 10x2048
M20K configuration and requires 30 M20Ks in total. For the benchmarks evaluated, we do not exceed
a depth of 16,384 words for any of the TILT-System memories but in such a scenario, the Predictor
assumes the 1x16,384 configuration is used.
Width (bits) Depth (words)40 51220 1,02410 2,0485 4,0962 8,1921 16,384
Table 5.3: Supported M20K BRAM configuration sizes on Stratix V FPGAs [64].
The width of the TILT and Fetcher variable length instructions depend on the TILT-System’s archi-
tectural parameters (refer to Figures 2.4 and 3.2). The area cost of the TILT instruction memory is the
cost of the BRAMs needed to build the memory in eALMs (Equation 5.2). The Fetcher’s instruction
memory area is calculated similarly. The cost of an M20K BRAM on a Stratix V chip is 40 eALMs [61].
5.2.3 TILT Data Memory
TILT’s data memory is organized into memory banks, with each composed of two or more BRAMs that
hold the same data to support multiple concurrent reads and writes per bank (refer to Figure 2.2). The
width of the data words is always 32 bits. The depth of each memory bank is the product of the number
of threads assigned to each bank and the number of data words per thread. Thread and memory bank
counts are provided as TILT’s configuration parameters and the number of data words per thread is
determined by performing memory allocation on the application’s DFG.
Chapter 5. Predictor Tool 54
To determine the number of BRAMs needed by TILT’s data memory, we begin by choosing the M20K
configuration that can fit the data words in each memory bank. The width of this configuration is then
used to determine the number of BRAMs required (Equation 5.3). Since each memory bank supports
multiple reads and writes and there are multiple banks, the BRAM count of Equation 5.3 is multiplied
by these parameters to obtain the total number of BRAMs required by the data memory (Equations 5.4
and 5.5). The area cost in eALMs is then calculated using Equation 5.6.
BRAMs = ceiling(32 / width of selected M20K config) (5.3)
BRAMs per Bank = BRAMs ∗ (Read ports per Bank) ∗ (Write ports per Bank) (5.4)
Data Mem BRAMs = (BRAMs per Bank) ∗ (# of Banks) (5.5)
Data Mem Area (eALMs) = (Data Mem BRAMs) ∗ 40 (5.6)
5.2.4 TILT Read and Write Crossbars
The read crossbar routes data read from TILT’s data memory to the input ports of the FU array and
to the external read ports of the Memory Fetcher. The number of inputs of this crossbar and how they
are connected to the crossbar’s outputs are a function of TILT’s data memory organization: the number
of memory banks and the number of read and write ports per bank. The number of outputs is equal to
the sum of the FU input ports and external read ports. Similarly, the write crossbar routes the outputs
of the FUs and data arriving from the Fetcher to TILT’s data memory. The number of inputs of this
crossbar is equal to the sum of the number of FUs and external write ports. The number of outputs is
the product of the number of total memory banks and write ports per bank.
# of inputs = fn(# of banks, # of bank read ports, # of bank write ports)# of outputs = (# of FU input ports) + (# of external read ports)
# of banks = 1, 2, 4, 8, 16, 32# of bank read ports = 2, 4, 6, 8
# of bank write ports = 1, 2, 3# of outputs = 2, 4, 6, ..., 30
# of area table entries = (6*4*3)*15= 1080
Table 5.4: TILT predictor read crossbar configuration mixes.
# of inputs = (# of FU outputs) + (# of external write ports)# of outputs = (# of banks) * (# of bank write ports)
# of inputs = 2, 3, 4, ..., 15# of banks = 1, 2, 4, 8, 16, 32
# of bank write ports = 1, 2, 3
# of area table entries = 14*(6*3)= 252
Table 5.5: TILT predictor write crossbar configuration mixes.
We produce two tables of the actual read and write crossbar areas which are indexed by their input
and output configuration parameters. The areas are obtained from compiling a representative selection
of different crossbar configurations in Quartus that span the Predictor’s TILT design exploration space.
Chapter 5. Predictor Tool 55
These configuration mixes are provided in Table 5.4 and 5.5 for the read and write crossbars respectively.
The Predictor estimates the crossbar areas of a TILT configuration using the entries in these two tables
and interpolates between points using a linear model.
The read and write crossbar area graphs in Figure 5.1 illustrate the linear growth in crossbar area
with respect to the number of FUs (which are outputs for the read crossbar and inputs for the write
crossbar) and the number of memory banks for two sets of bank read and write port configurations. For
a different number of read and write ports, the crossbar area follows the same trend as that presented
in Figure 5.1. Therefore, the linear model we use to interpolate the area cost of intermediate crossbar
configurations is quite accurate.
0
3
6
9
12
15
18
21
24
27
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1k A
LMs
# of FUs
1 2 4 8 16 32
# of memory banks
(a) Read crossbar area with 1 read and 2 write portsper memory bank.
05
101520253035404550556065
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1k A
LMs
# of FUs
(b) Read crossbar area with 2 read and 4 write portsper memory bank.
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1k A
LMs
# of FUs
(c) Write crossbar area with 1 read and 2 write portsper memory bank.
Figure 5.1: Read and write crossbar areas for different mixes of FUs, data memory banks and memorybank ports obtained from compiling the configurations in Quartus.
5.2.5 Fetcher Data FIFOs
The data FIFOs inside the Fetcher are built using Altera’s dual-clock FIFO (DCFIFO) megafunction [65].
The Fetcher supports data FIFOs with arbitrary depth and a word width of 256 bits. We summarize
the resources consumed by these FIFOs as a function of depth in Table 5.6. Although the DCFIFO
megafunction can support depths below 64 or beyond 32,768 words, we do not present the costs for
these depths because we do not make use of them. The number of M20Ks, LUTs and registers are
obtained directly from the resource usage numbers estimated by Altera’s MegaWizard Plug-in Manager.
We observe that the number of M20Ks used follows the same calculation as the one we have used to
Chapter 5. Predictor Tool 56
calculate BRAM usage for TILT and Fetcher memories in earlier sections.
Depth (words) M20Ks ALUTs Registers Avg. ALMs Total Area (eALMs)64 7 20 117 49.4 329.4128 7 22 132 57.6 337.6256 7 24 147 66.7 346.7512 7 26 162 73.6 353.6
1,024 13 28 177 81.0 601.02,048 26 30 192 89.3 1,129.34,096 52 32 209 96.2 2,176.28,192 128 34 224 105.1 5,225.116,384 256 36 239 112.1 10,352.132,768 512 127 257 120.0 20,600.0
Table 5.6: Resources consumed by Altera’s DCFIFO megafunction with a word width of 256 bits. EachALM on a Stratix V FPGA contains 2 ALUTs (Adaptive-Look-Up-Tables) and 4 registers [66].
The average number of ALMs consumed by the data FIFOs is also provided in Table 5.6. These
numbers are obtained from the fitter compilation reports in Quartus. Finally, the average area cost of
the data FIFOs in Table 5.6 is the sum of the ALMs used and the cost of the M20K BRAMs in eALMs.
The Predictor estimates the total area cost of the data FIFOs for a Fetcher configuration with Equation
5.7 using the area costs provided in Table 5.6 and the number of external read and write ports.
Data FIFO Area (eALMs) = (area of outgoing FIFO) ∗ (# of external read ports)
+ (area of incoming FIFO) ∗ (# of external write ports) (5.7)
5.3 Prediction of the Densest TILT-System Design
A brute-force exploration of the TILT-System’s large configuration space is not practical, even with using
the performance and area models provided in Sections 5.1 and 5.2 instead of Altera’s hardware synthesis
flow. We reduce the number of designs that are considered by the Predictor in two ways to improve the
runtime and efficacy of the tool. The first is by bounding the explored design space by imposing fixed
limitations on the range of values we try for different TILT-System configuration parameters. This is
presented in Section 5.3.1. We further improve the runtime of the Predictor by considering only those
configurations among the bounded set that are likely to produce computationally dense designs. This is
accomplished with heuristics using the search algorithm described in Section 5.3.2.
If the bounded configuration space is made too small, then the tool will arrive at a solution very
quickly but will also be more likely to miss good designs. Conversely, if the Predictor is tuned to explore
a very large number of TILT-System configurations, then it is likely to produce an optimal solution but
at the cost of designer productivity due to the long execution time. We want the tool to explore a large
set of designs among promising candidates so that it produces good solutions but also have an acceptable
runtime of under a minute, with a few minutes as the worst case. Moreover, we allow the designer to
optionally configure the size of the design space that the Predictor explores. Through parameters that
can be changed, such as the range of thread counts to try, the Predictor can be tuned to explore a larger
or smaller set of configurations to meet the designer’s productivity and performance requirements for a
target application.
Chapter 5. Predictor Tool 57
5.3.1 Constraining the Explored Design Space
Despite the large permutation of TILT-System configuration choices that exist, computationally dense
designs are usually situated within a well-defined, bounded region of the configuration space. As we
demonstrate in Section 6.3, we observe large gains in compute density initially as we add resources such
as FUs and threads to the baseline TILT configuration to enable more work to be performed in parallel.
Adding more resources beyond a certain point causes the compute density to drop due to diminishing
gains in compute throughput at increasingly higher area costs. This means large TILT-Systems are not
very area-efficient and therefore, need not be considered by the Predictor.
TILT parametersTILT core count = 8
FU array = maximum of 3 FUs per typememory banks = 1, 2, 4, 6, 8, 12, 16, 24, 32threads/bank = 1, 2, 4, 6, 8, 12, 16 (24 with up to 4 banks, 32 with up to 2)
read ports/bank = 2, 4, 6, 8write ports/bank = 1, 2, 3
Fetcher parametersexternal read ports = min{[1, 2, 3], total bank read ports}
external write ports = min{[1, 2, 3], total bank write ports}data FIFO depth = 64, 128, 256, ..., 8192
Table 5.7: All permutations of TILT and Fetcher configurations that the Predictor is configured toexplore in this work.
Table 5.7 summarizes all the possible permutations of TILT and Fetcher configuration parameters
that are available to the Predictor’s search algorithm described in Section 5.3.2. The parameter ranges
provided in this table were determined experimentally based on the observations made in finding com-
putationally dense designs in a brute-force manner for our benchmarks. The configuration space is made
sufficiently broad so that we do not potentially miss good design solutions. The designer can choose to
deviate from the ranges presented in the table to increase prediction accuracy or to improve the tool’s
runtime as needed by modifying a configuration file.
5.3.2 Search Algorithm
We now describe the Predictor’s search algorithm and the heuristics that the algorithm uses to narrow
down its search beyond the fixed limitations placed on the TILT-System configuration space earlier in
Section 5.3.1. The pseudo-code of this algorithm is provided in Algorithm 7. The algorithm requires
the DFG of the target application as its input. We also perform memory allocation and the removal of
aliases on the DFG (described in Section 4.3) prior to the execution of the algorithm. As its output, the
algorithm produces the TILT-System configuration with the highest predicted compute density.
Algorithm 7 begins by choosing the baseline TILT-System configuration as its best design (lines 2
through 14). This is an 8-core system with the smallest TILT core and Fetcher design possible; this has
the lowest area cost but also the lowest compute throughput. Each TILT core in this design consists
of 1 of each FU type required by the application with 1 thread, 1 data memory bank and 2 read and 1
write ports per memory bank. Similarly, the Fetcher comprises 1 external read and write port with the
minimal data FIFO depth calculated using Equations 5.9 and 5.10 which we presented in Section 5.3.5.
Chapter 5. Predictor Tool 58
Algorithm 7: Get TILT-System configuration with highest compute density
Input: DFG of applicationOutput: TILT-System configuration
1 FU array = determine the FU types in the DFG to share using Algorithm 8;2 trialConfig = base TILT-System config with FU array;
3 toTryQueue = {∅}; configsTried = {∅};4 enqueue {trialConfig, 0, 0} into toTryQueue;5 bestConfig = nil; bestCd = 0;
6 while toTryQueue is not empty do7 dequeue {trialConfig, startI, iToSkip} from toTryQueue;8 generate memory and compute schedule for trialConfig;9 order {trialPerfType, trialPerfCount} from largest trialPerfCount to smallest;
10 throughput = total # of compute operations / length of tilt schedule;11 trialCd = throughput / TILT-System area estimated by AreaModel;
12 if trialCd > bestCd then13 bestConfig = trialConfig;14 bestCd = trialCd;
15 {bestPerfType, bestPerfCount} = {trialPerfType, trialPerfCount};16 startI = 0;
17 for i = startI to bestPerfCount.length-1 do18 trialConfig = updateConfig(bestPerfCount[i], bestConfig);
19 if trialConfig != nil then// startI = i+1; iToSkip = i
20 if trialConfig does not exist in toTryQueue or configsTried then21 enqueue {trialConfig, i+1, i} into toTryQueue;
22 break;
23 if trialCd < bestCd then24 for i = 0 to trialPerfCount.length-1 do25 if i == iToSkip then26 skip loop iteration;
27 trialConfig = updateConfig(trialPerfType[i], trialConfig);28 if trialConfig != nil then
// startI = 0; iToSkip = i
29 if trialConfig does not exist in toTryQueue or configsTried then30 enqueue {trialConfig, 0, i} into toTryQueue;
31 break;
32 insert trialConfig into configsTried;
33 return bestConfig ;
Chapter 5. Predictor Tool 59
The mix of FU types to be used by the TILT cores and which FU types, if any, to share with another
FU is determined beforehand (line 1) using Algorithm 8 which we describe in Section 5.3.3.
At every iteration of the while loop (line 6), the algorithm attempts to find incrementally larger
TILT-System configurations which are likely to have higher compute densities. This is accomplished by
determining the TILT or Fetcher resource that is the largest bottleneck to the TILT-System’s perfor-
mance and attempting to alleviate it by adding more of that resource. This is done by the updateConfig
procedure (lines 18 and 27) which we will describe later in Section 5.3.4. As an example, if the compute
throughput is limited by the number of available Mult FUs then another Mult FU is added. If TILT’s
existing FU array and/or memory bank ports are being underutilized, more threads are added. The
objective of the algorithm is to reach the peak compute density by selectively increasing the size of the
TILT-System to alleviate performance bottlenecks in the design.
The bottlenecks of the TILT-System are represented by performance counters which are generated
when the application’s DFG is statically scheduled onto the trial TILT-System configuration (line 8).
These counters are ordered from the largest value to the smallest (line 9) and iterated in this sequence
to select the resource bottlenecks to alleviate (lines 17 and 24). If updateConfig is unable to add more of
a resource to alleviate the selected bottleneck due to the fixed constraints imposed in Section 5.3.1 (lines
19 and 28), then the next bottleneck from the ordered array is attempted. These performance counters
are presented in Section 5.3.4.
For any ‘trial’ TILT-System configuration that is considered by the Predictor, there exist two cases:
either the trial configuration has a higher compute density than the current best configuration or it does
not. In the first case (line 12), we update the best configuration to that of the trial (line 13) and the next
configuration to evaluate is the one that alleviates the biggest bottleneck of the new best configuration;
this configuration is inserted into the toTryQueue (lines 17 to 22). In the second case, the bestConfig
parameter is not updated and two new trial configurations are inserted into the toTryQueue. In both
cases, if we already reached the fixed resource limit(s) for a bottleneck, we try instead to alleviate the
next most significant bottleneck. The search ends when the toTryQueue is empty and there are no more
trial configurations to consider.
For the second case, the first of the two configurations that is inserted into toTryQueue is the
configuration that alleviates the next biggest bottleneck of the best configuration (lines 17 to 22). This
is because the previous bottleneck did not yield a design with a higher compute density. The second
is the configuration that alleviates the biggest bottleneck of the attempted trial configuration (lines 24
to 31). We ensure this bottleneck is not the same as the one that was alleviated to obtain the trial
configuration (line 25). This additional condition prevents the algorithm from trying configurations that
have more of the resource that was added to obtain the trial configuration initially. Due to diminishing
gains associated with each incremental increase in the same resource, if the compute density did not
improve with the trial configuration, it will also not increase with a configuration that has more of that
resource until some other resource is increased first.
To summarize, the algorithm greedily records only the TILT-System configuration with the highest
compute density. It also greedily chooses trial configuration(s) by attempting to alleviate the biggest
bottleneck(s) of the current best configuration. Although we keep track of only the best configuration,
we also try to alleviate bottlenecks of trial configurations that have a lower compute density than that of
the best. This way, we expand the search to consider improvements to more TILT-System configurations
than just the current best. This distinguishes our algorithm from a purely greedy solution which can
Chapter 5. Predictor Tool 60
get stuck in a local maximum that is not close to the globally optimal design solution.
5.3.3 Combining TILT FUs
In this section, we present Algorithm 8 which is used to determine the FU types that will share their
read and write ports and operation field in TILT’s instruction with another FU. The array of FU types
produced by this algorithm is adopted by the Predictor’s search algorithm described in Section 5.3.2
which is used to predict the TILT-System configuration with the highest compute density.
Algorithm 8: Determine the FU types in the application’s DFG to share.
Input: DFG FU types[], DFG operationsOutput: TILT FU array
1 toShareFUs[] = {∅};2 foreach FU in FU types do3 if FU is LoopUnit or Cmp or Abs then4 insert into toShareFUs;
5 FU usages[] = # of compute operations of each FU type in DFG;6 utilCostRatios[][] = {∅};7 foreach toShareFU in toShareFUs do8 foreach FU in FU types do9 areaSavings = sharedArea[toShareFU, FU] - (area[toShareFU] + area[FU]);
10 util = FU usages[toShareFU] + FU usages[FU];11 utilCostRatios[toShareFU][FU] = util / areaSavings;
12 FU array = FU types;13 while toShareFUs is not empty do14 get <toShareFU, FU> entry with lowest value in utilCostRatios that is > 0;15 replace FU entry in FU array with <FU + toShareFU>;16 remove toShareFU from FU array and toShareFUs;
17 return FU array ;
Sharing an FU’s read and write ports with another FU can be worthwhile due to the reduction in
TILT’s instruction width and the size of TILT’s crossbars. However it also introduces contention in
scheduling because only one of the FUs can read data into its input ports or write its result to data
memory on a given cycle. For this reason, it is best to only combine FUs that have low utilization
relative to the other FUs and/or are small in area with comparatively larger Decode modules.
From Tables 5.1 and 5.2, we observe that the Cmp, Abs and LoopUnit are the smallest of the TILT
FUs and the cost of their Decode modules are the largest relative to their size. Among them, the
LoopUnit is the smallest and is likely to be the least utilized due to requiring only two loop operations
in the schedule for each loop in the application. Table 5.8 presents the area savings that can be achieved
by combining the Cmp and Abs FUs. The LoopUnit provides the best area savings because it does not
increase the size of the Decode module beyond that of the other FU it is shared with. This makes the
LoopUnit an ideal candidate for sharing with another FU type.
Algorithm 8 first determines the FU types that it will attempt to combine with another (lines 1 to 4).
Currently these FUs are the Cmp, Abs and LoopUnit discussed earlier. Next, the algorithm determines
the number of operations of each type in the DFG (line 5). This is the utilization of each FU type. Based
Chapter 5. Predictor Tool 61
on the earlier discussion, we want to combine the least utilized FUs. The second metric we calculate
is the area savings that can be achieved by combining two FU types. We want to maximize this value
which can be determined using a table lookup of the area savings provided in Table 5.8.
FU UnitArea with Cmp FU (eALMs) Savings
Separate Shared (eALMs)
FAddSub 632 549 84
FMult 420 336 84
FDiv 778 695 83
FSqrt 765 701 63
FExp 952 891 61
FLog 1,512 1,448 65
Area with Abs FU (eALMs)FAddSub 520 445 75
FMult 308 238 70
FDiv 665 594 71FSqrt 652 583 69
FExp 840 766 74
FLog 1400 1329 71
Table 5.8: Relative sizes of FUs and their Decode modules with Cmp or Abs FUs separate and shared.
The utilization and area savings are used to calculate a ratio between the two metrics for each possible
combination of two FU types (lines 6 to 11). Finally, the algorithm shares those FU combinations that
have the lowest ratio of utilization and area savings (lines 12 to 16). We avoid combinations with a
negative ratio because it would mean the area cost is higher when the two FU types are shared. This
maximizes the area savings benefit while combining the least utilized FU types to minimize resource
contention during scheduling.
Using only the operation count of each FU type disregards dependencies between operations which
is not ideal because it can result in an underprediction of the FU utilization. However, our approach
works well since TILT executes many independent threads to improve the utilization of the FUs which
makes the dependencies between operations within the same thread less important. Considering only a
single set of DFG operations is also sufficient because the ratio of the types of operations will be the
same irrespective of how many instances of the DFG (or threads) are scheduled onto TILT.
The runtime of Algorithm 8 would be much higher if we decided to schedule the application with
varying numbers of threads, FU mixes and TILT hardware configurations to obtain more accurate FU
utilization and area cost numbers. However, this level of complexity is likely unnecessary since our simpler
algorithm presented here produces the same mix of FU types as that of the most computationally dense
TILT-System designs of the five data-parallel benchmarks that we evaluate in Chapter 6. These designs
were determined through a more exhaustive search of the TILT-System parameter space. Further work
is necessary to determine the situations in which it may be beneficial to combine the larger standard
FUs with each other which we leave as future work.
5.3.4 Alleviation of Bottlenecks with Performance Counters
Performance counters are produced during the generation of the instruction schedules for a target TILT-
System and application. These counters record the number of cycles by which the issuing of a memory or
Chapter 5. Predictor Tool 62
compute operation must be delayed in the instruction schedule due to TILT or Fetcher resources being
unavailable. As an example, if the earliest time an operation could be scheduled is cycle 6 but it could
not be scheduled until cycle 10 due to resource constraints, the counter associated with that resource
would be incremented by 4.
An operation being postponed results in all dependent operations also being pushed down in the
schedule due to data and control dependencies. This contributes to the growth in TILT’s schedule which
reduces the TILT-System performance. Moreover, larger instruction memories may also be required to
store the longer TILT and Fetcher schedules. The Predictor utilizes the performance counters to assess
resource bottlenecks of the TILT-System. The tool then attempts to alleviate some of these bottlenecks
by adding more resources to obtain higher compute throughput for an acceptable area cost that results
in an overall increase in compute density.
The updateConfig procedure used by Algorithm 7 comprises a series of conditions that attempt to
resolve the provided resource bottleneck of its input TILT-System configuration. The function produces
the updated TILT-System configuration as its output or nil if the bottleneck could not be resolved. This
can happen when the fixed constraint imposed on a TILT core or Fetcher resource in Section 5.3.1 is
reached for the resource that is to be increased. When this happens, the Predictor attempts to alleviate
the next biggest bottleneck.
We describe how the different TILT and Fetcher resources are increased and introduce the perfor-
mance counters that are used to make these decisions below. Each resource is increased in increments
up to their maximum value which is summarized in Table 5.7. The order in which the performance
counters are presented is the order they are considered in the series of conditions mentioned earlier.
The counters are normalized so that they do not artificially have a higher value (and therefore a higher
priority) depending on what they are measuring.
TILT FU Array. For each FU type, we record the number of times that an operation of that type
could not be scheduled due to all FUs of that type being used by other compute operations. This value is
normalized by the number of FUs of that type. The bottleneck is resolved by incrementing the number
of FUs of that type by 1, up to the maximum of 3, as outlined in Table 5.7.
TILT Threads. The number of threads is increased when the TILT FUs are underutilized, measured
by the number of effective empty compute cycles when an FU was available but no operation was
scheduled for it. This value is normalized by the number of FUs of each TILT core. We increase the
number of threads per bank as outlined in Table 5.7 to increase the FU utilization and throughput.
Data Memory Banks. The maximum number of memory banks is limited by the total number
of threads. If the thread count needs to be greater than 32 per memory bank, we try to increase the
number of banks instead. Moreover, if the read and write port counts per memory bank are 8 and 3
respectively and either the read or write ports are the bottleneck, we increase the number of memory
banks instead as well. Any time there is a change in the number of memory banks, we reset the read
and write ports of each memory bank to 2 and 1. The existing number of threads is also redistributed
equally among all memory banks, usually resulting in a lower number of threads per bank.
Memory Bank Read and Write Ports. Two performance counters track the number of times
when a memory or compute operation could not be scheduled due to being unable to read from or write
data to TILT’s data memory. This happens when the target memory bank’s read or write ports are
already in use by other previously scheduled operations. This is resolved by incrementing the number
of read or write ports of each memory bank respectively. These counters are normalized by the sum of
Chapter 5. Predictor Tool 63
the number of FUs and external read and write ports.
External Read and Write Ports. Two performance counters track the number of times when
TILT memory read or write operations could not be scheduled due to the associated port being in use
by another memory operation. This is resolved by increasing the number of external read or write ports
respectively. The counters are normalized by the number of external read or write ports. The external
read or write port counts is not allowed to be greater than the total number of memory bank read or
write ports (refer to Table 5.7). This is because the maximum number of data words that can be read
or written from TILT’s data memory to the Fetcher’s FIFOs is the minimum of the memory bank and
external port counts.
5.3.5 Fetcher Configuration Selection
The Fetcher can be scaled by varying the depths of the data FIFOs and the number of external read and
write ports. The Predictor chooses the smallest Fetcher configuration that meets the external memory
bandwidth requirement of the target TILT design’s statically generated memory schedule. The Predictor
estimates this bandwidth using several metrics to calculate the recommended depths of the incoming
and outgoing data FIFOs inside the Fetcher, provided in Equations 5.9 and 5.10 respectively.
FIFOstretch = max[1, (tilt schedule depth) / (schedule segment)] ∗ (schedule iterations)
schedule segment = 256 cycles
schedule iterations = 4 (5.8)
Incoming Data FIFO Depth = (external inputs in DFG) ∗ (tilt threads)
/ (external write ports) ∗ FIFOstretch (5.9)
Outgoing Data FIFO Depth = (external outputs in DFG) ∗ (tilt threads)
/ (external read ports) ∗ FIFOstretch (5.10)
As we described in Section 3.3, we can determine the number of external input and output data
words that must be written or read from TILT’s data memory per thread by analyzing the DFG of
the application. Combining this with the number of TILT threads and the depth of TILT’s schedule
provides us with an accurate estimate of the number of data words that must move between TILT and
the Fetcher over a period of time. This is TILT’s external memory bandwidth that the Fetcher must
be able to support. Dividing this quantity by the number of external read and write ports gives us the
average per FIFO depth in 256-bit words (Equations 5.9 and 5.10).
We want the Fetcher to be able to buffer a reasonable amount of data inside its data FIFOs ahead of
TILT’s execution to minimize the stalls that TILT may incur. This will usually require the Fetcher to
buffer external data of multiple iterations of TILT’s schedule. However for long schedules, it is sufficient
and also preferable to have only a part of a single iteration of the schedule buffered at any point in time
instead to reduce the area cost of the data FIFOs. This is captured by the FIFOstretch parameter in
Equation 5.8 which is used to express how deep the data FIFOs should be in relation to the amount
of data that must move between TILT and off-chip memory. The values of the ’schedule iterations’
and ’schedule segment’ parameters in this equation were determined experimentally. We found 3 to 5
iterations in increments of 256 cycles works best.
Chapter 5. Predictor Tool 64
The depths of the Fetcher’s data FIFOs are calculated using Equations 5.8 to 5.10 during the exe-
cution of the Predictor’s search algorithm. The number of external read and write ports is predicted by
the search algorithm. These parameters are used to determine the configuration of the Fetcher and its
area cost (line 11 of Algorithm 7).
5.4 Efficacy of the Predictor
We have designed our Predictor tool with three important objectives in mind. First, the Predictor should
be able to reliably recommend the most computationally dense TILT-System configuration for a target
application with high accuracy. Second, the Predictor should be able to produce this design solution
within an acceptable amount of time, preferably within minutes. Finally, the tool should be able to
perform its function without requiring manual designer effort in tuning the tool. Taking these design
goals into consideration, the Predictor exposes the potential of the highly customizable TILT-System
architecture to the designer. The Predictor improves designer productivity by removing the need to
manually arrive at the most computationally dense TILT-System design using trial-and-error methods
that require long Quartus compilations and performance simulations.
In this section, we evaluate the effectiveness of the Predictor in meeting the design goals outlined
above using the five applications of Section 3.4. We begin by highlighting the significant reduction in
the TILT-System’s configuration space which is obtained by the Predictor’s search algorithm in Section
5.4.1. Then in Section 5.4.2, we present the fast runtime of our tool and compare it with the runtime
of a more exhaustive search of the configuration space. We also contrast these execution times with the
time it took to compile the densest TILT-System designs in Quartus. In Section 5.4.3, we evaluate the
accuracy of our area model by comparing its output with the TILT-System area obtained from the fitter
reports after synthesis of the design in Quartus. Finally in Section 5.4.4, we evaluate the Predictor’s
ability to accurately predict the densest design while only exploring a limited number of configurations
to improve its execution time.
5.4.1 Design Space Reduction
We significantly reduce the number of different TILT-System configurations that are explored by the
Predictor first by fixing the design space that is searched and second by intelligently trying only those
configurations in this space that are likely to produce good design solutions. This is illustrated by Table
5.9 where we present the number of designs that are considered by our Predictor and contrast this with
the significantly higher number of design permutations that are possible even after the design constraints
provided in Table 5.7 are placed on the TILT-System configuration space.
Benchmark# of Designs Avail. # of Designs Tried
to Predictor by PredictorBSc 153,055,008 1,264HDR 17,006,112 1,009Mbrot 1,889,568 932HH 17,006,112 1,874FIR 1,889,568 324
Table 5.9: Number of TILT-System designs considered by the Predictor vs. designs available to the toolafter applying the fixed constraints of Table 5.7 on the design space.
Chapter 5. Predictor Tool 65
5.4.2 Runtime Comparison
The execution time of our Predictor is provided in Table 5.10 for each of our applications. By reducing
the number of designs that are considered by the Predictor, we are able to meet our design objective
of obtaining a design solution within a few minutes for all of our benchmarks, with an average runtime
of 42 seconds. The majority of the Predictor’s execution time is spent on generating the memory and
compute instruction schedules for the target TILT-System. The time it takes to determine the FU array
composition, estimate the performance and area costs and decide which TILT-System configurations to
try is negligible in comparison.
BenchmarkRuntime of Runtime of Quartus Runtime toPredictor Exhaustive Search Compile Densest Design
mins days minsBSc 2.7 3+ days 53HDR 0.4 3+ days 31Mbrot 0.4 18 hrs 16HH 2.3 3+ days 52FIR 0.2 15 hrs 13
Geomean 0.7 28
Table 5.10: Execution time of the Predictor vs. an exhaustive search of the TILT-System design spacedefined by the configuration ranges provided in Table 5.7. The time it takes to synthesize the densestTILT-System is also provided for comparison.
In Table 5.10, we also contrast the fast execution time of the Predictor with the significantly longer
execution of an exhaustive search of the design space constrained only by the parameter ranges provided
in Table 5.7. The exhaustive search took longer than 3 days for the BSc and HH applications. These
applications have the largest number of DFG operations to schedule and also the largest number of
TILT-System designs to consider. Generating the instruction schedules for the larger TILT-System
configurations that have a large number FUs, memory banks and threads takes much longer than smaller
designs due to the super-linear runtime of TILT’s scheduling algorithms. The Predictor benefits from
not having to consider these large designs. This is because the tool’s search algorithm prunes the
configuration space and does not explore large designs due to the poor compute densities of earlier,
related but smaller configurations.
Finally Table 5.10 also provides synthesis runtime of the densest design for each application. These
designs are medium-sized compared to the smallest and largest TILT-System configurations that are
possible for the same application. Quartus compile times also increase super-linearly with the size of
the design. Since synthesis times are orders of magnitude higher than the time we require to predict
the performance of a single TILT system, it is clear that such fast models are necessary to guide our
Predictor; full hardware synthesis of a large number of candidate systems is not practical.
5.4.3 Accuracy of Area Model
For the Predictor to be able to reliably choose the most computationally dense design for a target
application, it must accurately estimate the area of a wide variety of TILT-System configurations. The
accuracy of our area models is therefore an important indicator of the utility of the Predictor. To measure
this, we use Quartus to synthesize the TILT-System designs that are considered by our Predictor during
Chapter 5. Predictor Tool 66
the execution of its search algorithm and obtain the actual layout areas of these designs from the
generated fitter compilation reports. After the ALMs, M20K BRAMs and DSP resources utilized by the
designs are converted into eALMs, we compare these costs with the estimated layout areas reported by
our models. The percent error between the estimate and the actual area is presented in Figure 5.2 for
our benchmarks.
0
100
200
300
400
500
600
<3 <6 <9 <12 <15 >=15
#ro
frTI
LT-S
yste
mrD
esig
ns
%rError
(a) BSc
0
100
200
300
400
500
<3 <6 <9 <12 <15 >=15
#ro
frTI
LT-S
yste
mrD
esig
ns
%rError
(b) HDR
0
100
200
300
400
500
<3 <6 <9 <12 <15 >=15
#ro
frTI
LT-S
yste
mrD
esig
ns
%rError
(c) MBrot
0
150
300
450
600
750
<3 <6 <9 <12 <15 >=15
#Eo
fETI
LT-S
yste
mED
esig
ns
%EError
900
(d) HH
0
50
100
150
200
250
<3 <6 <9 <12 <15 >=15
# o
f TI
LT-S
yste
m D
esig
ns
% Error
(e) FIR
Figure 5.2: Error distribution of the TILT-System layout area estimated by our area model vs. theactual area obtained from compiling the configurations in Quartus for each of our applications. Thedesigns plotted correspond to those configurations that are explored by the Predictor’s search algorithm.
We find good prediction accuracy with less than 6% error for the small to mid-range TILT-System
designs. This can be attributed to having more table entries of actual areas for such designs. These
entries are also closer together compared to the larger configurations, resulting in less variance in area
during the interpolation of intermediate designs. For example, looking at the crossbar mixes in Tables
5.4 and 5.5, we have entries for 1 to 8 memory banks close together while there are only two entries
to interpolate the areas of the crossbars between 16 and 32 banks. Having more entries for the lower
to mid-range designs was an intentional design choice. As we will show in our evaluation of the TILT-
System’s performance and area in Chapter 6, we foresee the TILT-System being most useful in this
range. Computationally dense TILT-Systems are typically found in this range as well.
Our decision to use tables that contain the exact area costs of different configurations instead of
formulae that perform an imperfect fit across all available data points also helped improve the accuracy
of our area models. Higher prediction accuracy can be achieved by increasing the number of entries
in these tables. However, this also increases the storage requirements of the Predictor program but
more importantly, requires a larger number of Quartus compilations to populate the table. Some of the
Chapter 5. Predictor Tool 67
accuracy loss reported in Figure 5.2 can be attributed to the expected variance in ALMs consumed from
one compilation to another which is a property of hardware synthesis tools such as Quartus.
5.4.4 Prediction Accuracy
Here we present whether our Predictor’s search algorithm was able to correctly predict the most com-
putationally dense TILT-System configuration. It is infeasible to perform a complete exploration of the
design space by synthesizing every possible permutation. We instead rely on the densest configuration
that is reported by an exhaustive search of the constrained design space – a software solution which uses
the Predictor’s performance and area models. The number of designs explored is provided in the second
column of Table 5.9. For the BSc, HDR and HH applications, the search was manually stopped after
three days so the larger designs were not considered.
Benchmark FU Mix ThreadsMem Bank W/R Ports W/RBanks Depth / Bank Ext Ports
BSc2-3-1-1-1-1 64 4 1024 2-4 1-1
match match match match match match
HDR2-2-1-1 64 4 512 2-4 2-1match match match match match match
MBrot1-2* 8 1 256 3-6 1-1
match match match match 2-4 match
HH3-2-2-2 32 4 1024 2-4 1-1match match match match match match
FIR1-1 2 1 512 2-4 2-1
match match match match match match
Table 5.11: Actual (first row) vs. predicted (second row) TILT-System configurations. FU Mix: AddSub/Mult/Div/Sqrt/ExpCmp/LogAbs *AddSubLoopUnit/MultCmp.
BenchmarkCompute Density
% ErrorM tps/10k eALMs
BSc12.6
matchmatch
HDR70.1
matchmatch
MBrot3.2
5%3.0
HH9.4
matchmatch
FIR24.6*
matchmatch
Table 5.12: Actual (first row) vs. predicted (second row) TILT-System (with 8 TILT cores) computedensities. *FIR compute density is in M inputs/sec per 10k eALMs.
All output designs of the exhaustive search were also present in the set of TILT-System designs that
were explored by the Predictor tool and synthesized in Quartus to obtain the area model accuracy results
presented earlier in Section 5.4.3. These designs are provided in Table 5.11 as the “actual” best designs
and their actual compute densities, calculated from their ModelSim and Quartus reports, are provided in
Table 5.12. The predicted densest designs are provided in these two tables (second row) for comparison.
Chapter 5. Predictor Tool 68
The Predictor was able to correctly predict the configurations of all applications except for Mandelbrot
where the memory bank port counts are different. For this reason, the compute density differs by 5%.
The actual densest design in this case is the second densest predicted configuration.
5.5 Summary
In this chapter, we propose a software Predictor tool capable of intelligently navigating the large TILT
and Fetcher design space and recommend a suitably tuned TILT-System design for a target application
without requiring manual designer effort. The Predictor’s execution time is fast compared to hardware
synthesis and is 42 seconds on average for our benchmarks. Our tool accurately predicted the densest
TILT-System configuration for 4 of the 5 benchmarks. For the mis-predicted Mandelbrot application,
the read and write ports per memory bank were predicted to be 4 and 2 instead of 6 and 3, resulting
in a 5% drop in compute density. The designer can optionally expand or reduce the design parameter
space that the Predictor may explore to improve prediction accuracy or reduce runtime respectively.
The Predictor’s performance and area models and the heuristics used to prune the search space are
application independent. Therefore, we believe our tool will be broadly applicable to other data-parallel
applications. Moreover, although in this work we focus on maximizing compute density, we can readily
modify the tool to rank designs based on the throughput calculated, to minimize system area or some
other metric that is tracked by our tool.
Chapter 6
TILT-System Evaluation
In this chapter, we present the performance of our most computationally dense, application-customized
TILT-Systems and the FPGA layout area consumed by these designs for the benchmarks provided in
Section 3.4. The implementation of these benchmarks on the TILT architecture is described in Section
6.1. In addition, we compare the performance, area, development effort and scalability of our best TILT-
System designs with our OpenCL HLS implementations of the same benchmarks. We summarize our
evaluation platform and the metrics used below.
Platform. The TILT-System and OpenCL HLS designs were generated using Altera’s Quartus 13.1
and OpenCL SDK [19] targeting the Stratix V 5SGSMD5H2F35C2 FPGA with 2 banks of 4 GB DDR3
memory on the Nallatech 385 D5 board. The relevant Quartus settings used to generate the FUs and
the DDR controller for both approaches are provided in Appendices A and B respectively.
Metrics Unit of MeasurementPerformance millions of threads/sec
Compute Throughput (M tps)
FPGA AreaALMs, BRAMs and DSPs⇒ equivalent ALMs (eALMs)
Ranking Designs throughput-per-areaCompute Density (M tps / 10k eALMs)
Table 6.1: Evaluation metrics.
Metrics. We use the same metrics to evaluate our designs as those we have defined in Chapter
4. These metrics are summarized in Table 6.1 and the equivalent ALM (eALM) costs of the FPGA
resources used to measure area are provided in Table 6.2. The maximum frequency at which we are able
to clock our designs (Fmax) is reported in MHz. The FPGA resource utilization and Fmax values are
obtained from Quartus fitter and TimeQuest compilation reports respectively.
Resource eALM CostALM 1
M20K BRAM 40DSP Block 30
Table 6.2: Equivalent ALM (eALM) costs of resources on a Stratix V FPGA [61].
69
Chapter 6. TILT-System Evaluation 70
6.1 Benchmark Implementation on TILT
We present the FU operation usages for each TILT thread for our benchmarks in Table 6.3, providing
an estimate of their compute size. Uniquely, Mandelbrot’s FU usages will be different depending on
the computed pixel of the 800x640 pixels image frame because the Mandelbrot thread can break out of
its loop without completing the full 1000 iterations. The Mandelbrot FU usages provided in Table 6.3
represents the average number of operations of each type that are executed by each thread across all
executed iterations of the thread’s loop. Since the body of the loop needs to be scheduled once because
of our new LoopUnit FU (see Section 4.1), the TILT compiler only needs to schedule 7 AddSub, 6 Mult
and 6 Cmp operations. The body of the loop comprises 15 of these operations. The Mandelbrot thread
utilizes the LoopUnit FU to execute its two loop operations that handle the single loop it contains, for
a total of 21 compute operations per thread that needs to be scheduled.
BenchmarkFunctional Units
TotalAddSub Mult Div Sqrt Exp Cmp Log Abs
BSc 23 30 9 4 4 4 1 2 77HDR 6 4 3 1 0 0 0 0 14
Mandelbrot 177 142 0 0 0 210 0 0 529HH 36 24 28 0 15 12 0 0 115
FIR 64-tap 64 64 0 0 0 0 0 0 128
Table 6.3: TILT FU operation usages per thread.
The TILT instruction schedule is configured to be cyclic for all benchmarks, wrapping around to the
start to execute the same instructions on different data. Table 6.4 provides the memory requirements of
each benchmark per thread on TILT. External inputs and outputs define the number of data words that
must be read from or written to off-chip memory per thread respectively. One notable difference in the
Fetcher implementation for the FIR filter is that TILT-SIMD is halted while loading the filter coefficients
into the data memories of the TILT cores. The TILT-System then resumes normal concurrent operation
of the Fetcher and TILT-SIMD during the computation of the FIR outputs. This step is not necessary
for the other benchmarks because they do not require a large sequence of data to be loaded into TILT
memory prior to or between the computation of a set of results.
Benchmark Data Words Ext Inputs Ext OutputsBSc 38 5 2HDR 18 6 1
Mandelbrot 25 5 1HH 86 5 4
FIR 64-tap 192 1* 1
Table 6.4: TILT data memory requirements per thread (in 32-bit words). *Filter coefficients are loadedinitially prior to streaming in the FIR inputs.
Some of the FUs share both their operation field in TILT’s instruction and their read and write
ports with another FU (usually the least utilized) to generate a more computationally dense schedule
and to reduce the size of the crossbars (refer to Section 2.2.2). We use Algorithm 8 of our Predictor
tool (presented in Section 5.3.3) to determine which of the FU types should be shared. The algorithm
makes this decision based on the relative operation counts of each FU type in the application’s DFG
Chapter 6. TILT-System Evaluation 71
Benchmark FUs - Operation Field / Port SharingBSc Exp+Cmp and Log+Abs
Mandelbrot Mult+CmpHH Exp+Cmp
Table 6.5: TILT FUs that share their operation field and ports with another FU.
and the reduction in area that can be obtained from combining any two FUs. The FU types which are
shared is summarized for our benchmarks in Table 6.5. As an example, the Mult and Cmp units for the
Mandelbrot benchmark share the same operation field and ports since they have similar FU latencies
and because all multiplies precede compare operations for a given thread. The latencies of the TILT
FUs vary between 1 cycle for the Abs FU to 28 cycles for the Sqrt FU.
6.2 Densest TILT-System Configurations
For each of our benchmarks, the Predictor tool presented in Chapter 5 is used to predict the TILT-
System configuration with the highest compute density. The Predictor selects the Fetcher configuration
with the smallest area that meets the external input and output bandwidth requirements of the selected
TILT design. TILT’s external memory schedules are generated using the Fetcher’s SlackB algorithm
which we describe in Section 3.3.3. We have shown earlier in Section 3.5 that this algorithm produces
the shortest TILT instruction schedules for our benchmarks. The top 10 predicted TILT-System designs
ranked by compute density are compiled in Quartus to obtain their actual layout areas. Throughput
is obtained through cycle accurate simulation in ModelSim 10.1d assuming all input data is initially
available in off-chip DDR3 memory. The configuration with the highest compute density is then selected
as our best TILT-System design.
Benchmark FU Mix ThreadsMem Bank W/R Ports Insn MemBanks Depth / Bank WidthxDepth
BSc 2-3-1-1-1-1 64 4 1024 2-4 476 x 933HDR 2-2-1-1-0-0 64 4 512 2-4 318 x 304
Mandelbrot 1-2* 8 1 256 3-6 140 x 103HH 3-2-2-2-0-0 32 4 1024 2-4 467 x 611
FIR 64-tap 1-1-0-0-0-0 2 1 512 2-4 76 x 144
Table 6.6: Top TILT configurations with highest compute density for each benchmark. FU Mix: AddSub/Mult/Div/Sqrt/ExpCmp/LogAbs *AddSubLoopUnit/MultCmp.
Table 6.6 provides the top (most computationally dense) TILT configurations for the benchmarks
of Table 6.3. The Fetcher configurations used with these TILT designs are provided in Table 6.7. As
shown in Figure 3.1, external write and read ports determine the number of 256-bit words that can
be written to or read from the TILT cores per cycle. Similarly, the depths in Table 6.7 correspond to
the incoming and outgoing data FIFOs respectively. Appropriate FIFO depths are a function of the
bandwidth requirements of TILT-SIMD which can be accurately predicted when the number of TILT
cores, threads, port dimensions and the number of words each thread reads or writes to external memory
(Table 6.4) are known.
The external bandwidth of the FIR benchmark after the filter coefficients are initially loaded is low:
2 input and output 256-bit words every 144 cycles (refer to Tables 6.4 and 6.7). However, the relatively
Chapter 6. TILT-System Evaluation 72
BenchmarkExtW/R W/R FIFO Depths
Ports 256-bit wordsBSc 1-1 2048 / 1024HDR 2-1 1024 / 512
Mandelbrot 1-1 256 / 128HH 1-1 512 / 512
FIR 64-tap 2-1 128 / 128
Table 6.7: Fetcher designs for the top TILT configurations of Table 6.6.
deep data FIFO depth of 128 words is selected to prefetch the coefficients ahead of time and communicate
with DDR in bursts. Further, since the M20K BRAMs used to build the FIFOs support a minimum
depth of 512, using a smaller depth does not save area.
6.3 TILT Core and System Customization
We seek to maximize the TILT-System’s computational throughput per unit area by customizing the
TILT core and system architecture to the compute and external memory bandwidth requirements of the
application. This is made possible by the wide range of configuration options that are available to the
designer which we demonstrate in this section. Table 6.8 presents the improvement in compute density
achieved with the tuned TILT configurations of Table 6.6 relative to the baseline which is composed
of 1 of each required FU with 1 thread and 1 data memory bank with 2 read and 1 write ports for a
TILT-System with a single TILT core.
TILT-System with 1 TILT core
BenchmarkThroughput Area Compute Density
M tps eALMs M tps/10k eALMsTop / Base Top / Base Top / Base
BSc 16.3 / 0.48 13,930 / 5,567 11.7 / 0.87HDR 46.9 / 1.54 8,231 / 3,128 57 / 4.91
Mandelbrot 0.76 / 0.14 3,320 / 1,987 2.3 / 0.69HH 11.8 / 0.61 13,718 / 3,590 8.6 / 1.71
FIR 64-tap 3.8 / 2.06* 2,674 / 1,909 14 / 10.8*
Table 6.8: Comparison of the top TILT-System designs with the minimal area baseline. *FIR throughputis in M inputs/sec.
As an example, the minimal BSc TILT-System in Table 6.8 with a single TILT core computes 0.48
M tps at the cost of 5,567 eALMs. An additional 63 threads and 3 data memory banks with 2 extra read
and 1 extra write ports improves throughput by 17x by executing many threads and more operations
in parallel but requires 2x more area. An extra AddSub and 2 more Mult FUs improves throughput
further by 2x, at a cost of 1.1x more area, resulting in an overall 13.5x improvement in compute density.
We can improve the throughput of the TILT core further with diminishing returns by adding more
FUs, threads, data memory banks and/or read and write ports to allow more operations to issue and
complete every cycle but at an increasingly higher area cost, causing the compute density to decrease
beyond this point.
In Section 4, we have added several application-specific customization options to TILT to improve
Chapter 6. TILT-System Evaluation 73
the compute density of the TILT-System beyond that which can be achieved with the TILT architecture
developed by Ovtcharov and Tili [13, 14]. For example, besides customizing the TILT core’s mix of
standard FUs, the addition of the new LoopUnit FU for the Mandelbrot benchmark significantly reduces
the size of the instruction memory, requiring 4 BRAMs for the top Mandelbrot design in Table 6.8 instead
of the 527 BRAMs that would be needed if the loop was fully unrolled. The required memory bank
depth is also reduced to 256 words from 512. However, the compute schedule becomes 35% less dense
due to being unable to intermix and schedule operations between loop boundaries and loop iterations
together, contributing to a throughput drop of 20%. Overall, the LoopUnit improves compute density
by 6.1x, while consuming only 9.8 ALMs.
Similarly for the top FIR design in Table 6.6, a conventional TILT core without the shift-register
addressing mode will require 64 reads and writes per thread between the computation of each output,
resulting in a 415 cycle instruction schedule. The addition of the mode costs only 54 eALMs but shortens
the schedule to 144 cycles, improving the overall compute density of the TILT-System by 2.9x.
As illustrated by the tuning of the BSc TILT core to maximize compute density and the optionally
generated application-dependent custom units to more efficiently handle loops (Mandelbrot) and indirect
addressing (FIR), the TILT core presents multiple degrees of freedom to the designer to reduce area or
to increase throughput and compute density. The significant performance improvement that can be
achieved demonstrates the value of our Predictor tool. The minimal TILT-Systems in Table 6.8 are
small, consuming between 1.9k and 5.6k eALMs.
Beyond customizing the TILT core, we can improve the throughput and compute density further by
connecting multiple area-efficient TILT cores to be executed in SIMD. This is preferable to increasing
the throughput by making the TILT core larger which will result in a less computationally dense design.
The improvement can be observed for the TILT-Systems in Table 6.9 where the TILT core of Table
6.8 is replicated 7 more times, allowing us to achieve near-linear growth in throughput. The area cost
of the single TILT instruction memory (provided in Table 6.11) and the Fetcher (Table 6.7) becomes
amortized, causing the overall increase in compute density.
6.4 TILT-System Performance
Table 6.9 presents the compute throughput, FPGA layout area and compute density of the densest
TILT-Systems with 8 identical TILT cores. The TILT and Fetcher configuration parameters used by
these designs are provided in Tables 6.6 and 6.7 respectively.
TILT-System with 8 TILT cores
BenchmarkFmax Tput Area Compute DensityMHz M tps eALMs M tps/10k eALMs
BSc 220 121 95,893 12.6HDR 223 359 51,163 70.1
Mandelbrot 246 6.1 19,051 3.2HH 215 90 96,073 9.4
FIR 64-tap 270 30* 12,187 24.6*
Table 6.9: TILT-System performance numbers of the densest TILT configurations of Table 6.6. *FIRthroughput is in M inputs/sec.
The Fmax reported in Table 6.9 is that of the TILT-SIMD. The FMax of the FIR filter is uniquely
Chapter 6. TILT-System Evaluation 74
much higher than that of the other benchmarks due to the small size of each TILT core, requiring only
a single AddSub and Mult FU, with a per core area cost of only 1,429 eALMs on average (Table 6.11).
The critical path which limits the Fmax of the TILT cores typically lies between the inputs of an FU
and the read crossbar. For the FIR filter, this is the case with the AddSub FU.
The Fmaxes of the Fetcher designs and their area costs are provided in Table 6.10. The Fetcher
consumes a relatively small amount of area compared to the area consumed by TILT-SIMD, ranging
between 934 eALMs for the HH benchmark to 1,816 eALMs for HDR which requires deeper data FIFOs
and an additional external write port (Table 6.7).
BenchmarkFmax AreaMHz eALMs
BSc 354 1,995HDR 343 1,816
Mandelbrot 385 934HH 386 1,413
FIR 64-tap 357 1,321
Table 6.10: The Fmax and area cost of the Fetcher designs of Table 6.7.
For the computationally dense TILT-Systems of Table 6.9, the area breakdown of the TILT cores is
presented in Table 6.11. The average area of the TILT core varies widely between 1,358 eALMs for FIR
and 11,713 eALMs for HH, showing the different benchmarks prefer quite different TILT cores. This
further highlights the utility of our Predictor as determining the top TILT-System design is non-trivial.
BenchmarkInsn Average Area (eALMs) per Core Total FUs/Mem Mem FUs Rd Xbar Wr Xbar eALMs Total
BSc 961 2,560 5,016 3,070 971 12,578 40%HDR 321 1,280 2,489 1,738 621 6,449 39%
Mandelbrot 161 720 1,049 396 80 2,406 44%HH 961 2,560 4,777 3,391 985 12,674 38%
FIR 64-tap 81 320 624 296 108 1,429 44%
Table 6.11: Area breakdown of the TILT cores for the TILT-Systems in Table 6.9.
To achieve the most area-efficient design, our objective is to minimize the non-FU area, comprising
the crossbars and instruction and data memories. The purpose of these components is to keep the TILT
FUs busy while ideally consuming minimal area. The average FU area for our benchmarks is 41% of the
total. The percentage of FU area of the area-efficient TILT cores for our benchmarks are similar, with a
small variance of between 38% and 44%. The Fetcher accounts for only an average of 4.5% of the total
area of the 8-core TILT-System designs in Table 6.9.
6.5 Comparison with Altera’s OpenCL HLS
As discussed in Chapters 1 and 2, overlays such as TILT and HLS tools present two different ways to
improve FPGA designer productivity. Here we compare the productivity and performance of our best
TILT-System designs with that obtained using Altera’s OpenCL HLS tool [19]. We have chosen this tool
as our comparison point because recent studies have shown it can generate FPGA hardware with good
performance relative to other platforms. Chen and Singh report their OpenCL FPGA implementation
Chapter 6. TILT-System Evaluation 75
of a Fractal Video Compression algorithm is 3x faster than a high-end GPU while consuming only 12%
of the GPU’s power [2]. They also demonstrate a huge gain in productivity, with their simplified and
error-prone hand-coded FPGA implementation taking a month to complete relative to the few hours it
took to develop a working OpenCL version.
The TILT application kernel, which defines the computation of a TILT thread inside a C function, is
very similar to our OpenCL kernel implementation, which defines the work to be performed by a group
of work-items. For our purposes, an work-item in OpenCL is analogous to a thread in TILT. Our TILT
threads and OpenCL work-items execute independently and perform the same computation.
Altera’s OpenCL HLS tool takes our OpenCL kernel as input to generate a custom pipelined accel-
erator on the FPGA that can be executed from a host CPU program. The OpenCL host program is
developed using Visual Studio 2013 and compiled into an executable with default Release configuration.
To obtain throughput numbers for each benchmark, the input data of the entire workload is flushed to
DDR3 prior to the execution of the kernel across the entire workload. This is to ensure a fair comparison
with TILT by excluding the host-accelerator transfer time. Elapsed time is obtained by capturing the
wall clock time before and after kernel execution using Windows high resolution timers with a precision
of <1 µs. For each benchmark, we increase the number of work-items in the OpenCL host program until
throughput saturates and we report this highest throughput number.
6.5.1 Performance and Area
We begin by comparing the performance and area of the densest 8-core TILT-System designs in Table
6.9 with our best OpenCL HLS designs presented in Table 6.12. The Fmax reported in Table 6.12 is
that of the OpenCL kernel system. Both the TILT-System and OpenCL HLS designs achieve similar
Fmax values of over 200 MHz. The area of the Direct Memory Access (DMA) unit which interfaces the
OpenCL on-chip memory to DDR memory averages 4,367 eALMs in size, roughly 3x larger than our
relatively small Fetcher designs which consume between 1k to 2k eALMs (Table 6.10).
OpenCL HLS - Kernel System with DMA
BenchmarkFmax Tput Area Compute DensityMHz M tps eALMs M tps/10k eALMs
BSc 221 153 51,982 29.5HDR 234 231 26,246 88.1
Mandelbrot 268 23 46,204 5.0HH 236 116 50,571 23.0
FIR 64-tap 274 239* 51,577 46.4*
Table 6.12: OpenCL HLS performance numbers. *FIR throughput is in M inputs/sec.
0
0.2
0.4
0.6
0.8
1
BSc HDR Mbrot HH FIRy64-tap
Co
mp
ute
yDen
sity
8-coreyTILT-System OpenCLyHLS
Figure 6.1: Compute densities of the TILT-System designs normalized to that of OpenCL HLS.
Chapter 6. TILT-System Evaluation 76
From the performance results presented in Tables 6.9 and 6.12, we note that the compute density
of our TILT-Systems is 41% (HH) to 80% (HDR) of the density of the OpenCL designs, as shown in
Figure 6.1. The TILT-System of the HDR application is the only TILT design that achieves a higher
compute throughput than OpenCL HLS. It obtains 359 M tps compared to the 231 M tps obtained by
the OpenCL HLS design. However, TILT achieves this high throughput at a higher area cost of 51.2k
eALMs compared to the 26.2k eALMs required by the OpenCL HLS design.
6.5.2 Designer Productivity
The runtime of the TILT and OpenCL HLS tools for the designs in Tables 6.9 and 6.12 are summarized
in Table 6.13 and compared in Figure 6.2. For the initial setup of the TILT-System, the TILT and
Fetcher instruction schedules are first generated from the C kernel using the TILT compiler. Then the
densest TILT-System configuration recommended by our Predictor tool is synthesized into hardware
using Quartus, with the schedules loaded into the instruction memories during compilation. The initial
setup is dominated by the compilation of the overlay, with an average runtime of 28 mins.
TILT-System with 8 cores OpenCLInitial Setup Kernel Kernel
Benchmark Kernel Predictor Overlay Update Compilemins mins mins mins mins
BSc 0.0110 2.7 53 0.65 108HDR 0.0025 0.4 31 0.63 86
Mandelbrot 0.0016 0.4 16 0.63 111HH 0.0087 2.5 52 0.65 106
FIR 64-tap 0.0017 0.2 13 0.63 107
Geomean 0.0036 0.7 28 0.64 103
Table 6.13: Runtime of the TILT-System and OpenCL HLS tools.
0
20
40
60
80
100
120
Ru
nti
mea
(min
s)
HH FIRMBrotBSc HDR Geomean
Kernel Predictor Overlay KernelaUpdate
KernelacompileTILT:
OpenCL:
Figure 6.2: Runtime of the TILT-System and OpenCL HLS tools.
After a kernel code change that does not require the overlay to be recompiled, determined by running
the Predictor on the modified kernel, the instruction memories of the TILT and the Fetcher are updated
with the regenerated schedules, taking only 38 secs on average. By comparison, any change made to an
OpenCL kernel requires full recompilation, taking an average of 103 mins, a 163x increase. Moreover,
the fast runtime of our Predictor enables us to navigate TILT’s large design space and obtain a suitably
Chapter 6. TILT-System Evaluation 77
customized, high performance design for an application much faster than would be possible through an
exhaustive search using Quartus.
6.5.3 TILT-System and OpenCL HLS Scalability
Many real world applications combine several heterogeneous compute systems with different throughput
requirements and need to fit them all on a chip with a finite area budget. In this section, we scale the
HH and FIR benchmarks to show how efficiently the TILT-System and OpenCL HLS scale up or down
to match compute requirements and area constraints.
100%Chip5ALM5Utilization0%100%Chip5ALM5Utilization0%
M tp
s
OpenCL5Replication5Factor1 2 4 8 10 12 14 16 18 20
TILT5Cores6
050
100150200250300
1 2 3 4 50
50100150200250300
(a) TILT-System (left) and OpenCL HLS (right) throughput.
0
50
100
150
200
250
1 2 3 4 5OpenCLIReplicationIFactor
100%ChipIALMIUtilization0% 100%ChipIALMIUtilization0%
1k e
ALM
s
1 2 4 8 10 12 14 16 18 20TILTICores
60
50
100
150
200
250
(b) TILT-System (left) and OpenCL HLS (right) area.
100%Chip5ALM5Utilization0% 100%Chip5ALM5Utilization0%
M tp
s / 1
0k e
ALM
s
OpenCL5Replication5Factor1 2 4 8 10 12 14 16 18 20
TILT5Cores6
0
5
10
15
20
25
1 2 3 4 5
8.6 9.6 8.3
23
13
0
5
10
15
20
25
(c) TILT-System (left) and OpenCL HLS (right) compute density.
Figure 6.3: Scaling HH on TILT-System and OpenCL HLS.
As we have demonstrated earlier in Section 6.4 with the 8-core TILT-System designs, connecting
multiple area-efficient TILT cores in SIMD allows us to achieve near-linear growth in throughput. In
Figure 6.3, we scale up the TILT-System of the HH benchmark beyond 8 cores up to chip capacity and
draw comparisons with the OpenCL HLS scalability results where the HH computation is replicated to
execute in parallel. The maximum size of our designs is limited by the ALMs available on our FPGA.
Each TILT core requires 19 DSPs for FUs, with each spatially pipelined instance in OpenCL HLS
requiring 143 DSPs.
In Figure 6.3(a), we observe near-linear scaling in the TILT-System throughput, with the small drop
Chapter 6. TILT-System Evaluation 78
from 16 to 20 cores caused by a drop in Fmax from 201 to 181 MHz. In Figure 6.3(b), the OpenCL HLS
design cannot scale below 51k eALMs while the TILT-System is able to scale down to 14k eALMs. In
Figure 6.3(c), the compute density of the TILT-System remains fairly constant, growing from 8.6 to 9.6
from 1 to 12 cores and then dropping to 8.3 at 20 cores. However, the TILT-System’s compute density
is also consistently lower than the OpenCL HLS for all design solutions.
From Figure 6.3, we observe that the TILT-System offers a greater range of intermediate throughput
and area solutions than OpenCL HLS, allowing any number of cores between 1 and 20 to be instantiated
for the HH benchmark. These solutions vary between small designs with a throughput of 12 M tps at
a cost of 14k eALMs to large designs that have a throughput of 190 M tps and an area cost of 230k
eALMs. In comparison, OpenCL HLS provides only 5 design choices, with a narrower throughput range
between 116 and 288 M tps and a medium to large area cost of 51k to 210k eALMs. The wider, more
flexible range of the TILT-System designs is one of the benefits of the TILT architecture.
x+DSPb0
0inn x
+DSPb1
inn-1 x+DSPb2
inn-2 x+DSPb63
inn-63
... outn
(a) Fully spatial 64-tap FIR design.
x+DSPb
in
64-to-1muxes
0-63
0-63
out
(b) 64-tap FIR with a single DSP.
x+DSPb
in
32-to-1muxes
0-31
0-31x+DSPb
in
32-to-1muxes
31-63
31-63
out
(c) 64-tap FIR with two DSPs.
Figure 6.4: Implementing 64-tap FIR on OpenCL HLS with different number of DSPs (or stages).
In Figure 6.5, we study the OpenCL HLS compiler’s ability to scale down our spatially pipelined
64-tap FIR design. The 64-tap FIR filter is normally implemented as a fully spatial compute pipeline,
as illustrated in Figure 6.4(a). This implementation with 64 stages achieves very good performance for
relatively low area on OpenCL HLS. However, if such a high throughput is not required, we may wish to
use fewer compute resources (DSPs) to save area. Figure 6.4(b) depicts a single compute unit performing
the same 64-tap FIR computation. The compute unit is shared by all 64 filter coefficients to produce
an output every 64 cycles. The throughput achieved with this design is very small and we do not save
that much area due to the large muxes that are required at the inputs of the compute unit.
The availability of two parallel compute units with two stages, as shown in Figure 6.4(c), allows an
output to be computed every 32 cycles. The first stage takes 32 cycles to apply the first 32 coefficients
before forwarding the sum to the second stage and receiving a new input to compute. Smaller muxes
are required at the inputs of the compute units but there is two sets of them and additional muxing is
required to forward the output of one stage to another. The throughput of this design is slightly better
but it is still very low relative to that of the spatial design.
Between 4 and 32 stages, the muxing overhead grows with the number of stages, resulting in an
overall growth in area, with the design consuming 86% of the chip’s ALMs at 32 stages. The drop in
Fmax from 221 MHz at 4 stages to the lowest 142 MHz at 32 also results in a sub-linear growth in
throughput at that range. As seen in Figure 6.5(b), the net result is a low compute density for all but
Chapter 6. TILT-System Evaluation 79
0
50
100
150
200
250
300
350
400
0 23 45 68 90 113 135 158 180
MIin
pu
ts/s
ec
SystemIAreaI(1kIeALMs)
TILT-System OpenCLIHLS
64 stages
12 4 816
321 TILTcore
120 TILTcores
(a) Throughput vs. area cost.
0
10
20
30
40
50
0 23 45 68 90 113 135 158 180
M-in
pu
ts/s
-/-1
0k-
eALM
s
System-Area-(1k-eALMs)
TILT-System OpenCL-HLS
12 4 8 16 32
1 TILTcore
120 TILTcores
64 stages
(b) Compute density vs. area cost.
Figure 6.5: Scaling 64-tap FIR on TILT-System and OpenCL HLS.
the fully spatial design where the overheads of forwarding the output of a compute unit back to its input
and selecting between multiple coefficients are eliminated. We conclude that OpenCL HLS has difficulty
producing area-efficient, lower throughput systems, especially for spatially pipelined computations.
In comparison, the small FIR TILT configuration of Table 6.6 provides near-linear scaling in through-
put and area for up to 70% of the chip’s ALMs, also provided in Figure 6.5. The TILT-System was scaled
by connecting multiple TILT-SIMDs to the Fetcher in parallel, each with a maximum of 24 TILT cores,
to eliminate the bandwidth bottleneck at the widthconv module. In Figure 6.5(b), the sharp increase in
the TILT compute density from 1 to 8 cores is due to the amortization of the Fetcher area. With each
TILT-SIMD requiring a maximum of 6 input and generating 6 output words per compute iteration of
144 cycles (Table 6.4), data FIFO depths of 512 words were sufficient for the Fetcher.
The TILT-System architecture enables small, area efficient design choices with modest throughput,
scaling down to 2.7k eALMs with a single TILT core compared to the 52k eALMs of the spatial OpenCL
design. The OpenCL design is smallest at two stages, consuming 35k eALMs but has a low throughput of
8.3 M inputs/sec. For the same area, a 25 core TILT-System achieves a throughput of 96 M inputs/sec.
Further, we are able to exceed the throughput of 239 M inputs/sec of the spatial OpenCL FIR design
with roughly 70 TILT cores but at 1.9x more area.
6.5.4 TILT-System vs. OpenCL HLS Summary
HLS tools generate application-specific, custom hardware from a software specification. Conversely,
overlays map the same software kernel onto software-programmable hardware. Unlike HLS, overlays
provide versatility and improve designer productivity by supporting multiple software functions for the
same hardware. This means the overlay can be quickly reprogrammed instead of requiring re-synthesis
after a kernel code change. However, this has a performance and area overhead associated with it.
The highly configurable TILT overlay is intended to be both application-tunable as well as software-
programmable. We also added several enhancements to TILT to more efficiently target specific applica-
tion types. These optionally enabled enhancements further reduce the performance and area gap between
the TILT overlay and OpenCL HLS approaches. The addition of the LoopUnit FU for the Mandelbrot
and the shift-register mode for the FIR filter improve the compute density of the TILT-Systems by
6.1x and 2.9x respectively relative to the standard TILT core. Overall, our tuned TILT-System designs
achieve 41% to 80% of the compute density of our best OpenCL HLS designs.
The scalability results of the HH and FIR benchmarks demonstrate the TILT-System’s ability to
Chapter 6. TILT-System Evaluation 80
explore throughput and area regions that OpenCL HLS designs are unable to reach. Further, we can
configure the TILT-System for higher throughput or lower area without requiring the application code to
be changed, with the baseline configuration being very small (Table 6.8). The runtime of our Predictor
tool which is used to select the densest TILT-System design is 42 seconds on average. Regenerating the
TILT and Fetcher schedules after a kernel code change and reprogramming the instruction memories of
an already instantiated TILT-System design takes 38 seconds on average. In contrast, OpenCL kernels
must be recompiled into hardware to observe the changes in performance and area after any application
code change, taking an average of 103 minutes, lengthening design time considerably.
We recommend Altera’s OpenCL HLS for the generation of high throughput systems. OpenCL HLS
maximizes throughput at the cost of more resources by generating a heavily pipelined, spatial design
and by executing many threads in parallel. Deep FIFOs buffer the thread data and stream it into the
compute units. The resulting designs are large, about 3.2x (HDR) to 19x (FIR) bigger than the top
(most area-efficient) single core TILT-System designs. If a kernel requires more modest throughput, the
OpenCL HLS has difficulty generating an area-efficient, lower throughput system. Therefore when a low
to moderate throughput is sufficient, we instead recommend the TILT architecture as it is capable of
generating smaller but still computationally dense designs. The top TILT-Systems with a single core
require 2.7k to 14k eALMs (with 1.9k to 5.6k for minimal designs) compared to 26k to 52k eALMs for
the smallest, computationally dense OpenCL HLS systems. Hence, we see the TILT overlay paradigm
as an useful complement to OpenCL HLS.
Chapter 7
Conclusion
The TILT overlay is an area-efficient method to implement shared operator, application customizable,
execution units. We extend the TILT architecture of [13–15] to allow the use of off-chip memory with
our scalable and small Memory Fetcher. We also enable the generation of more area-efficient designs
with new custom units to support loops and indirect addressing. These are optionally generated for
applications that benefit from them, improving compute density by 6.1x for the Mandelbrot and 2.9x for
the FIR applications respectively. We also provide designers the ability to quickly explore a large design
space of throughput and area trade-offs without requiring the overlay to be synthesized with our new
Predictor tool, with an average runtime of 42 seconds. Further, the TILT and Fetcher components that
comprise the TILT-System can be configured for higher throughput or lower area without requiring the
application code to be changed. Configuration of an application onto an existing TILT-System design
is also fast, taking an average of 38 seconds.
7.1 Future Work
TILT as a Custom Accelerator for OpenCL HLS
We would like to integrate the TILT overlay as an optional OpenCL HLS accelerator component and
evaluate the performance of the resulting system. A few small TILT cores can be used to perform
specialized calculations for a larger computation, enabling the OpenCL HLS to take advantage of TILT’s
high FU reuse capability and its ability to scale down to very small implementations.
Compute Acceleration with Application-Specific Custom FUs
TILT currently supports the Sqrt, Exp, Log and Abs compute operations via function calls in the C
kernel. Custom datapaths with arbitrary pipeline latency intended to accelerate segments of the target
application can be supported by defining a new function identifier and latency mapping in the TILT
architecture configuration file. Given the new custom FU has up to 2 input and 1 output data ports,
no changes to TILT’s compiler flow or architecture is necessary. The FU may also optionally accept the
opcode bit vector which is presently used to modify the behaviour of an FU. For this case, the compiler
will need to be updated to identity and encode the appropriate FU behaviour.
Custom FUs can be desirable if more application-specific compute acceleration is required. The
81
Chapter 7. Conclusion 82
area cost may also be reduced due to a smaller custom FU and/or requiring fewer standard FUs to be
instantiated, which will produce both a smaller FU array and crossbars. We leave the study of how
much compute density can be gained from implementing custom FUs to accelerate frequently performed
computations of a target application as future work.
Tuning the TILT-System to a Group of Applications
The scope of our thesis was limited to tuning the TILT-System architecture to a single application.
However, most applications can be grouped into categories that are representative of their compute and
memory access patterns [67]. This observation suggests we can customize the TILT-System to a class of
applications, with a single configuration that will perform reasonably well across multiple benchmarks.
This is useful because it will reduce the need to recompile the TILT-System if the computation is
modified or replaced with a similar application. Instead, we will only need to regenerate the schedules
and update the instruction memories which is much faster than recompiling the overlay (Table 6.13).
This investigation and extending our Predictor to a mix of application kernels are left as future work.
Software Pipelining with Partially Unrolled Loops
For kernels containing loops, we have presented two scheduling extremes: either the loop is fully unrolled
and the operations across loop iterations are scheduled together or we only schedule a single iteration
of the loop using the LoopUnit FU. There exists a trade-off between producing a denser schedule by
unrolling the loop and a deeper instruction memory required to store the longer schedule. We can instead
partially unroll the loop using the LLVM compiler prior to generating the DFG and schedule several loop
iterations together to improve the utilization of the FUs and the achieved compute density. Since our
external memory scheduling algorithms take advantage of memory bank ports that are not used by the
compute operations, a denser compute schedule may not always result in the best overall throughput.
Hence, scheduling of the memory operations will partly dictate the best unroll factor.
Appendix
Appendix A: Quartus Settings for FUs
We summarize the Quartus settings used for the mega-function generated FUs of the TILT architecture
and those used by our OpenCL HLS (High-Level-Synthesis) designs in Table A. All FUs are IEEE-754
compliant, supporting single-precision floating-points.
Setting TILT OpenCL HLSDenormal no noOptimize speed speed
Reduced Functionality no noException yes (where applicable) no
Table A: Quartus FU settings for TILT and OpenCL HLS.
Appendix B: Quartus Settings for DDR Controller
The Quartus settings used by the TILT and Fetcher architecture and our OpenCL HLS designs to
interface with DDR3 on the Nallatech 385 D5 board are summarized in Table B.
Setting TILT OpenCL HLSSpeed Grade 2
Memory Clock Freq 400 MHzMemory Vendor MicronMemory Format Unbuffered DIMMInterface Width 64 72
Row Address Width 14 15Column Address Width 10 12
Bank Address Width 3 3Avalon Word Width 256 512Avalon Burst Size 64 32
Table B: Quartus DDR3 Controller settings for TILT and OpenCL HLS.
83
Bibliography
[1] J. Richardson, S. Fingulin, D. Raghunathan, C. Massie, A. George, and H. Lam, “Comparative
analysis of HPC and accelerator devices: Computation, memory, I/O, and power,” in HPRCTA.
IEEE, 2010, pp. 1–10.
[2] D. Chen and D. P. Singh, “Fractal video compression in OpenCL: An evaluation of CPUs, GPUs,
and FPGAs as acceleration platforms.” in ASP-DAC, 2013, pp. 297–304.
[3] K. Sano, Y. Hatsuda, and S. Yamamoto, “Multi-FPGA accelerator for scalable stencil computation
with constant memory bandwidth,” Parallel and Distributed Systems, vol. 25, no. 3, pp. 695–705,
2014.
[4] J. Cassidy, L. Lilge, and V. Betz, “Fast, power-efficient biophotonic simulations for cancer treatment
using fpgas,” in FCCM. IEEE, May 2014, pp. 133–140.
[5] A. Putnam, A. Caulfield, E. Chung, and et. al., “A reconfigurable fabric for accelerating large-scale
datacenter services,” in ISCA, June 2014.
[6] IBM. (2014) IBM PureData system. [Online]. Available: http://www-01.ibm.com/software/data/
puredata/analytics/index.html
[7] IBM. (2014) WebSphere DataPower SOA appliances. [Online]. Available: http://www-
03.ibm.com/software/products/en/datapower/
[8] C. H. Chou, A. Severance, A. D. Brant et al., “VEGAS: Soft vector processor with scratchpad
memory,” in FPGA. ACM, 2011, pp. 15–24.
[9] A. Severance and G. Lemieux, “VENICE: A compact vector processor for FPGA applications,” in
FCCM. IEEE, 2012, pp. 245–245.
[10] N. Kapre and A. DeHon, “VLIW-SCORE: Beyond C for sequential control of spice FPGA acceler-
ation,” in FPT. IEEE, 2011, pp. 1–9.
[11] R. Dimond, O. Mencer, and W. Luk, “CUSTARD-a customisable threaded FPGA soft processor
and tools,” in FPL. IEEE, 2005, pp. 1–6.
[12] J. Coole and G. Stitt, “Fast and flexible high-level synthesis from OpenCL using reconfiguration
contexts,” Micro, vol. 34, no. 1, pp. 42–53, 2013.
[13] K. Ovtcharov, “TILT: A horizontally-microcoded, highly-configurable and statically scheduled soft
processor family,” Master’s thesis, University of Toronto, In progress.
84
Bibliography 85
[14] I. Tili, “Compiling for a multithreaded horizontally-microcoded soft processor family,” Master’s
thesis, University of Toronto, Nov 2013.
[15] K. Ovtcharov, I. Tili, and J. G. Steffan, “TILT: A multithreaded VLIW soft processor family,” in
FPL, Sept 2013, pp. 1–4.
[16] A. Papakonstantinou, K. Gururaj, J. A. Stratton et al., “FCUDA: Enabling efficient compilation of
CUDA kernels onto FPGAs,” in Application Specific Processors. IEEE, 2009, pp. 35–42.
[17] A. Canis, J. Choi, M. Aldham et al., “LegUp: An open-source high-level synthesis tool for FPGA-
based processor/accelerator systems,” TECS, vol. 13, no. 2, p. 24, 2013.
[18] T. Feist, “Vivado design suite,” White Paper, 2012.
[19] Altera. (2014) Altera SDK for OpenCL. [Online]. Available: http://www.altera.com/literature/lit-
opencl-sdk.jsp
[20] R. Rashid, J. G. Steffan, and V. Betz, “Comparing performance, productivity and scalability of the
TILT overlay processor to OpenCL HLS,” in FPT. IEEE, December 2014.
[21] Altera. (2014) Nios ii processor. [Online]. Available: http://www.altera.com/literature/lit-opencl-
sdk.jsp
[22] Xilinx. (2014) Microblaze soft processor. [Online]. Available: www.xilinx.com/tools/microblaze.htm
[23] P. Yiannacouras, J. G. Steffan, and J. Rose, “VESPA: portable, scalable, and flexible FPGA-based
vector processors,” in CASES. ACM, 2008, pp. 61–70.
[24] J. Yu, C. Eagleston, C. H.-Y. Chou, M. Perreault, and G. Lemieux, “Vector processing as a soft
processor accelerator,” TRETS, vol. 2, no. 2, p. 12, 2009.
[25] M. Flynn, “Some computer organizations and their effectiveness,” Computers, vol. 100, no. 9, pp.
948–960, 1972.
[26] P. Yiannacouras, J. G. Steffan, and J. Rose, “Fine-grain performance scaling of soft vector proces-
sors,” in CASES. ACM, 2009, pp. 97–106.
[27] M. Labrecque and J. G. Steffan, “Improving pipelined soft processors with multithreading,” in FPL.
IEEE, 2007, pp. 210–215.
[28] R. Moussali, N. Ghanem, and M. A. Saghir, “Microarchitectural enhancements for configurable
multi-threaded soft processors,” in FPL. IEEE, 2007, pp. 782–785.
[29] D. J. Dewitt, “A machine independent approach to the production of optimized horizontal mi-
crocode,” 1976.
[30] D. Gajski, “No-instruction-set-computer processor,” Sep. 17 2004, US Patent App. 10/944,365.
[31] M. A. Saghir, M. El-Majzoub, and P. Akl, “Datapath and ISA customization for soft VLIW pro-
cessors,” in ReConFig. IEEE, 2006, pp. 1–10.
[32] K. Shagrithaya, K. Kepa, and P. Athanas, “Enabling development of OpenCL applications on
FPGA platforms,” in ASAP. IEEE, 2013, pp. 26–30.
Bibliography 86
[33] Khronos OpenCL Working Group. (2009, October) The OpenCL specification. [Online]. Available:
http://www.khronos.org/registry/cl/specs/opencl-1.0.48.pdf
[34] Nvidia, “Cuda programming guide,” 2014. [Online]. Available: http://docs.nvidia.com/cuda/cuda-
c-programming-guide/
[35] Z. Zhang, Y. Fan et al., “AutoPilot: A platform-based ESL synthesis system,” in High-Level Syn-
thesis. Springer, 2008, pp. 99–112.
[36] J. Fang, A. L. Varbanescu, and H. Sips, “A comprehensive performance comparison of CUDA and
OpenCL,” in ICPP. IEEE, 2011, pp. 216–225.
[37] G. Kalokerinos, V. Papaefstathiou, G. Nikiforos, and et. al., “FPGA implementation of a config-
urable cache/scratchpad memory with virtualized user-level RDMA capability,” in SAMOS. IEEE,
2009, pp. 149–156.
[38] P. R. Panda, N. D. Dutt, and A. Nicolau, “Efficient utilization of scratch-pad memory in embedded
processor applications,” in Design and Test. IEEE Computer Society, 1997, pp. 7–11.
[39] J. Choi, K. Nam, A. Canis, and et. al., “Impact of cache architecture and interface on performance
and area of FPGA-based processor/parallel-accelerator systems,” in FCCM. IEEE, 2012, pp.
17–24.
[40] A. Putnam, S. Eggers, D. Bennett, and et. al., “Performance and power of cache-based reconfig-
urable computing,” in SIGARCH, vol. 37, no. 3. ACM, 2009, pp. 395–405.
[41] P. Nalabalapu and R. Sass, “Bandwidth management with a reconfigurable data cache,” in Parallel
and Distributed Processing Symposium. IEEE, 2005, pp. 159–167.
[42] G. Stitt, G. Chaudhari, and J. Coole, “Traversal caches: A first step towards FPGA acceleration of
pointer-based data structures,” in Hardware/Software codesign and system synthesis. ACM, 2008,
pp. 61–66.
[43] V. Mirian and P. Chow, “FCache: a system for cache coherent processing on FPGAs,” in FPGA.
ACM, 2012, pp. 233–236.
[44] E. S. Chung, J. C. Hoe, and K. Mai, “CoRAM: an in-fabric memory architecture for FPGA-based
computing,” in FPGA. ACM, 2011, pp. 97–106.
[45] E. S. Chung, M. K. Papamichael, G. Weisz, J. C. Hoe, and K. Mai, “Prototype and evaluation
of the CoRAM memory architecture for FPGA-based computing,” in FPGA. ACM, 2012, pp.
139–142.
[46] M. Adler, K. E. Fleming, A. Parashar, M. Pellauer, and J. Emer, “LEAP scratchpads: automatic
memory and cache management for reconfigurable logic,” in FPGA. ACM, 2011, pp. 25–28.
[47] C. E. LaForest and J. G. Steffan, “Efficient multi-ported memories for FPGAs,” in FPGA. ACM,
2010, pp. 41–50.
Bibliography 87
[48] M. Abdallah and K. Al-Dajani, “Hardware predication for conditional instruction path branching,”
Jun. 22 2004, US Patent 6,754,812. [Online]. Available: http://www.google.com/patents/
US6754812
[49] Writing an LLVM pass. [Online]. Available: http://llvm.org/docs/WritingAnLLVMPass.html
[50] The LLVM instruction reference. [Online]. Available: http://llvm.org/docs/LangRef.html#
instruction-reference
[51] The LLVM compiler infrastructure. Version 3.1, Accessed Nov. 2013. [Online]. Available:
http://llvm.org
[52] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck, “Efficiently computing
static single assignment form and the control dependence graph,” TOPLAS, vol. 13, no. 4, pp.
451–490, 1991.
[53] Altera. (2014) OpenCL design examples. [Online]. Available: http://www.altera.com/support/
examples/opencl/opencl.html
[54] S. Mann and R. W. Picard, “On being ‘undigital’ with digital cameras: Extending dynamic range
by combining differently exposed pictures,” in IS&T, 1995, pp. 442–448.
[55] A. L. Hodgkin and A. F. Huxley, “A quantitative description of membrane current and its appli-
cation to conduction and excitation in nerve,” vol. 117, no. 4. Blackwell Publishing, 1952, p.
500.
[56] C. E. LaForest and J. G. Steffan, “OCTAVO: an FPGA-centric processor family,” in FPGA. ACM,
2012, pp. 219–228.
[57] J. E. Smith, “Decoupled access/execute computer architectures,” Computer Architecture News,
vol. 10, no. 3, pp. 112–119, 1982.
[58] N. C. Crago and S. J. Patel, “OUTRIDER: efficient memory latency tolerance with decoupled
strands,” in Computer Architecture News, vol. 39, no. 3. ACM, 2011, pp. 117–128.
[59] F. Black and M. Scholes, “The pricing of options and corporate liabilities,” The journal of political
economy, pp. 637–654, 1973.
[60] A. R. James Lebak and E. Wong. (2006) HPEC challenge benchmark suite. [Online]. Available:
http://www.omgwiki.org/hpec/files/hpec-challenge/
[61] D. Lewis and T. Vanderhoek, “Stratix V block areas,” personal communication, January 2014.
[62] Altera. (2013, November) Floating-point megafunctions user guide. [Online]. Available:
http://www.altera.com/literature/ug/ug altfp mfug.pdf
[63] M. Lam, “Software pipelining: An effective scheduling technique for VLIW machines,” in ACM
Sigplan Notices, vol. 23, no. 7. ACM, 1988, pp. 318–328.
[64] Altera. (2014, April) Stratix V device overview. [Online]. Available: http://www.altera.com/
literature/hb/stratix-v/stx5 51001.pdf
Bibliography 88
[65] Altera. (2013, May) SCFIFO and DCFIFO megafunctions. [Online]. Available: http:
//www.altera.com/literature/ug/ug fifo.pdf
[66] Altera. (2014, January) Logic array blocks and adaptive logic modules in Stratix V devices.
[Online]. Available: http://www.altera.com/literature/hb/stratix-v/stx5 51002.pdf
[67] K. Asanovic, R. Bodik, B. C. Catanzaro et al., “The landscape of parallel computing research:
A view from berkeley,” Technical Report UCB/EECS-2006-183, EECS Department, University of
California, Berkeley, Tech. Rep., 2006.