by rafat rashid - university of toronto t-space · rafat rashid master of applied science graduate...

94
A Dual-Engine Fetch/Compute Overlay Processor for FPGAs by Rafat Rashid A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2015 by Rafat Rashid

Upload: others

Post on 04-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

A Dual-Engine Fetch/Compute Overlay Processor for FPGAs

by

Rafat Rashid

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2015 by Rafat Rashid

Page 2: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Abstract

A Dual-Engine Fetch/Compute Overlay Processor for FPGAs

Rafat Rashid

Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2015

High-Level-Synthesis (HLS) tools translate a software description of an application into custom FPGA

logic, increasing designer productivity vs. Hardware Description Language (HDL) design flows. Overlays

seek to further improve productivity by allowing the designer to target a software-programmable sub-

strate instead of the underlying FPGA, raising abstraction and reducing application compile times. We

propose a highly configurable dual-engine, fetch/compute overlay processor which is designed to achieve

high throughput per unit area on data-parallel and compute intensive floating-point applications. The

fetch component is newly proposed as part of this work. For the compute component, we use an im-

proved version of the TILT overlay processor originally developed by Ovtcharov and Tili. As part of

our evaluation, we rigorously compare the performance, area and productivity enabled by our proposed

architecture to that of Altera’s OpenCL HLS tool.

ii

Page 3: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Acknowledgements

First, I want to thank my supervisors Prof. Greg Steffan and Prof. Vaughn Betz for their advice and

direction during the completion of my degree. I have gained valuable research experience under their

supervision. Thank you to professors Paul Chow and Andreas Moshovos for being part of my defence

committee and for providing feedback on my thesis. Thanks also to Wai Tung for being the committee’s

defence chair. I also want to thank Kalin Ovtcharov, Ilian Tili, Charles LaForest and the students in

Prof. Betz research group for their feedback and useful suggestions on this work. I also thank David

Lewis and Tim Vanderhoek at Altera for helpful discussions on the Stratix V architecture. Finally, I

thank NSERC and Altera for funding support and SciNet for their compute resources.

iii

Page 4: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Contents

1 Introduction 1

1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Overlay Processors on FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 High-Level-Synthesis Tools for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Memory Architectures on FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 TILT Overlay Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 TILT Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 TILT Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 TILT Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 DFG Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Memory Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Compute Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 TILT Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 External Memory Fetcher 18

3.1 Possible Design Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Chosen Design Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 External Memory Access Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.1 Memory and Compute GPGMLP Scheduling . . . . . . . . . . . . . . . . . . . . . 24

3.3.2 ALAP-Inputs, ASAP-Outputs Memory Scheduling . . . . . . . . . . . . . . . . . . 24

3.3.3 Slack-Based Memory Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.4 Fetcher-DDR Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Comparison of the Memory Scheduling Approaches . . . . . . . . . . . . . . . . . . . . . . 29

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 TILT Efficiency Enhancements 35

4.1 Instruction Looping Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 LoopUnit FU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.2 DFG Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

iv

Page 5: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

4.1.3 Scheduling onto TILT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.4 Conditional Jump Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1.5 LLVM Phi Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.6 Unrolling Loops vs. Using the LoopUnit FU . . . . . . . . . . . . . . . . . . . . . 42

4.2 Indirect Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.1 Shift-Register Addressing Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.2 Indirect Addressing Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Memory Allocation for DFGs with Aliased Locations . . . . . . . . . . . . . . . . . . . . . 47

4.4 Schedule Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Predictor Tool 50

5.1 Estimation of TILT-System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 TILT and Fetcher Area Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.1 TILT FU Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.2 TILT and Fetcher Instruction Memories . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2.3 TILT Data Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2.4 TILT Read and Write Crossbars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2.5 Fetcher Data FIFOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3 Prediction of the Densest TILT-System Design . . . . . . . . . . . . . . . . . . . . . . . . 56

5.3.1 Constraining the Explored Design Space . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3.2 Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3.3 Combining TILT FUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3.4 Alleviation of Bottlenecks with Performance Counters . . . . . . . . . . . . . . . . 61

5.3.5 Fetcher Configuration Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4 Efficacy of the Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4.1 Design Space Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4.2 Runtime Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.3 Accuracy of Area Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.4 Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6 TILT-System Evaluation 69

6.1 Benchmark Implementation on TILT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.2 Densest TILT-System Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.3 TILT Core and System Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.4 TILT-System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.5 Comparison with Altera’s OpenCL HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.5.1 Performance and Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.5.2 Designer Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.5.3 TILT-System and OpenCL HLS Scalability . . . . . . . . . . . . . . . . . . . . . . 77

6.5.4 TILT-System vs. OpenCL HLS Summary . . . . . . . . . . . . . . . . . . . . . . . 79

7 Conclusion 81

7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

v

Page 6: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Appendix 83

Appendix A: Quartus Settings for FUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Appendix B: Quartus Settings for DDR Controller . . . . . . . . . . . . . . . . . . . . . . . . . 83

Bibliography 84

vi

Page 7: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 1

Introduction

Modern Field Programmable Gate Arrays (FPGAs) offer lots of on-chip compute parallelism and internal

memory bandwidth with high power efficiency compared to other, more conventional platforms such as

CPUs or GPUs [1, 2]. At the same time, FPGAs provide the flexibility to implement any hardware

on their reconfigurable fabric that CPUs and GPUs do not. For these reasons, FPGAs have become

compelling design platforms for accelerating a variety of compute intensive applications in both academic

and commercial domains [2–5].

For example, Sano et al. demonstrate the relative low power and high performance of FPGAs by

implementing customized, hand-coded hardware for the Jacobi stencil computation [3]. Their tuned

architecture achieves linear scalability in performance with constant memory bandwidth for up to nine

Altera Stratix III FGPAs. This is in contrast to the poor performance scaling achieved on multi-core

CPUs and GPUs in earlier efforts, owing to insufficient inter-processor memory bandwidth. Similarly,

Cassidy et al. present a custom implementation of the Monte Carlo biophotonic application on Stratix V

FPGAs [4] which achieves 4x higher computational throughput and consumes 67x less power compared

to a tightly optimized multi-threaded CPU implementation.

IBM currently offers several commercial products that use FPGAs including the PureData System [6]

for managing large database workloads. PureData uses FPGAs to perform high-performance processing

of analytics on these databases. Another IBM product is Data Power [7] where FPGAs are used to process

large amounts of mobile and web traffic. More recently, Microsoft has proposed Catapult [5] which uses

“overlay” processing engines as a part of its system, which is built on a cluster of high-end FPGAs.

Catapult was used to accelerate Microsoft’s Bing web search ranking algorithms and could be used to

accelerate other large-scale data centre services as well. The authors did not use GPUs because the

power requirements of current high-end GPUs were too high to merit the amount of compute parallelism

they offered. Catapult demonstrated a 95% increase in ranking throughput at comparable latency to

their software-only solution.

However, despite the potential of FPGAs, the difficulty in implementing applications on the platform

remains as the biggest barrier to entry into compute-acceleration markets and greatly limits the areas

where FPGAs have been successfully used. First, the compile time of the design tools is typically hours for

FPGAs vs. minutes or seconds for CPUs and GPUs, lengthening design iterations and reducing designer

productivity considerably. Second, most FPGA designs are currently specified in Hardware Description

Languages (HDLs) such as Verilog and the cycle-accurate description required in such languages is

1

Page 8: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 1. Introduction 2

time-consuming to write.

Overlay architectures for FPGAs seek to alleviate these problems by allowing the designer to target

a software-programmable substrate instead of the underlying FPGA to raise design abstraction and to

reduce application compile times. One class of overlays comprises soft-processors that execute applica-

tions on top of configurable execution units on the FPGA. These overlay processors behave similarly to

CPUs, which are both easier to program and more familiar to software designers. Many academic works

on such overlay processors have been proposed [8–10].

Overlay processors increase designer productivity by eliminating the need to implement application-

specific datapaths in HDL while also providing the flexibility to execute multiple similar applications

without requiring recompilation of the overlay. Software compilation of an application onto the overlay is

fast, usually taking a few seconds compared to hours with direct hardware compilation onto the FPGA.

This abstraction is provided at the cost of lower performance and more area. However, some overlays

provide software tools that can analyze an application and suggest a suitably customized architecture

to reduce this overhead [11,12].

For overlay processors to be compelling, they must have high performance, low area overhead and

be capable of operating on large data sets that reside in off-FPGA memory. In this work, we propose

and evaluate a dual-engine, fetch/compute overlay processor that is designed to achieve high throughput

per area on compute intensive and data-parallel floating-point applications. The architecture also offers

a wide range of customization options to closely match the compute and external memory bandwidth

requirements of different applications. The fetch and compute components are responsible for fetching

external data from off-chip DDR memory and executing compute instructions respectively. We allow

the independent operation and optimization of these two components by decoupling them as much as

possible. The fetch or the Memory Fetcher component is newly proposed as part of this work. For the

compute component, we use an improved version of the TILT overlay processor which was developed by

Ovtcharov and Tili in [13–15].

An alternative way to make computational designs easier to create on FPGAs is to use a High-Level-

Synthesis (HLS) tool which converts a software-like input directly into FPGA hardware [16–19]. As part

of our evaluation, we rigorously compare the hardware efficiency and designer productivity enabled by

our Memory Fetcher and enhanced TILT overlay processor architectures with custom designs produced

by Altera’s OpenCL HLS tool [19]. This work was published in FPT 2014 [20].

1.1 Contributions

The main contributions of this work are:

Memory Fetcher We implement a new configurable and scalable External Memory Fetcher unit

that prefetches only the required data from off-chip DDR memory and loads it into the on-chip data

memories of one or more TILT processors through static compiler analysis of the target application. The

Memory Fetcher and the TILT processors execute their own respective instruction schedules to provide

independent and concurrent off-chip communication and on-chip computation.

TILT Enhancements We extend the TILT architecture developed by Ovtcharov and Tili [13, 14]

in several ways to improve its performance and reduce its area requirements. This includes the addition

of application-specific enhancements to efficiently support loops and indirect addressing.

Predictor We develop a software Predictor tool that quickly enumerates a large design space of pos-

Page 9: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 1. Introduction 3

sible TILT and Fetcher configurations to predict the best (set of) architecture(s) for a target application

on Stratix V FPGAs without performing full hardware synthesis of each design.

Comparison with OpenCL HLS We quantitatively compare the performance and scalability of

our best TILT and Fetcher designs with OpenCL HLS implementations of five large memory (i.e. off-chip

DDR required) data-parallel applications. We also compare the compile time and development effort

between these two methods.

1.2 Organization

The work presented in this thesis is organized as follows. We begin in Chapter 2 with a discussion of

several commercial and academic related work on overlay processors, HLS tools and memory architectures

for FPGAs. Also in Chapter 2, we introduce the TILT architecture developed by Ovtcharov and Tili

in [13,14] and describe its components that we use and extend in our work. In Chapter 3, we propose our

External Memory Fetcher architecture and evaluate several external memory scheduling approaches. In

Chapter 4, we present and evaluate the architectural enhancements added to the TILT overlay of [13,14].

In Chapter 5, we describe our Predictor tool and evaluate its ability to accurately and quickly predict

the best TILT and Fetcher configurations for a target application. Then in Chapter 6, we compare our

best TILT and Fetcher systems with Altera’s OpenCL HLS tool. Lastly in Chapter 7, we summarize

the work presented in this thesis and propose several areas for future investigation.

Page 10: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2

Background

In Section 2.1, we present the related work in three main areas: overlay processors, High-Level-Synthesis

(HLS) and memory architectures for FPGAs. Then in Sections 2.2 and 2.3, we describe the TILT

overlay processor and its compiler flow which statically schedules applications written in C onto TILT.

The TILT overlay and its compiler are prior work developed by Kalin Ovtcharov [13] and Ilian Tili [14]

respectively. We make use of the TILT architecture and extend it in the work presented in this thesis. The

overarching goal of the TILT overlay is to take an algorithm description within a C kernel and execute

it on an application-tuned, statically scheduled soft processor to obtain high compute throughput while

minimizing the hardware resource consumption on the FPGA. The TILT architecture promises rapid

application compilation and configuration onto the overlay compared to the time and effort required to

produce equivalent custom, hand-coded HDL or HLS implementations.

2.1 Related Work

2.1.1 Overlay Processors on FPGAs

Overlay processors on FPGAs seek to combine the fast application compile times enabled by their

software-programmable execution units with higher performance and lower area overhead than a basic

soft processor such as Nios II or MicroBlaze can achieve. Nios II [21], the commercial 32-bit processor

designed by Altera for their FPGAs, is capable of supporting floating-point operations using an extension

and has up to 6 pipeline stages. MicroBlaze [22], the competing soft processor for Xilinx FPGAs, supports

floating-point as well and consists of 3 to 5 pipeline stages. The TILT architecture was designed to

support multiple and varied pipeline latencies at the same time, currently with a pipeline latency of up

to 41 cycles.

Several academic works use vector processors as their overlay including VESPA [23], VIPERS [24],

VEGAS [8] and VENICE [9]. These processors operate on vectors of multiple data elements instead of

a single, scalar data item. Of these, VENICE currently achieves the highest throughput; it combines a

scalar Nios II/f processor for control with wide vector lanes feeding multi-function ALUs. The number

of ALUs and vector lanes can be customized depending on the application and its throughput and area

requirements. VIPERS also uses a scalar processor for control and a configurable number of vector lanes.

Unlike VENICE, the functional units (FUs) of the TILT overlay processor operate on scalar data

and perform a single, specialized function, allowing them to be much smaller in area. However, multiple

4

Page 11: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 5

instances of TILT can be connected to operate on their respective data memories in SIMD (single-

instruction-multiple-data) [25] to obtain higher performance while keeping the individual TILT cores

small and area-efficient. In a manner similar to TILT, VENICE scales best as a multiprocessor system

of small VENICE cores and has compiler support to target the architecture to a given application.

VENICE connects to a standard DMA engine that moves data directly between its on-chip scratchpad

memory and off-chip DDR. The scratchpad is double-buffered with dedicated ports to the DMA to

overlap data movement with computation, eliminate data hazards and to mask long memory latencies.

In this work, we add to the TILT architecture of [13–15] the ability to communicate with off-chip memory

with our new Memory Fetcher unit. Our approach does not require TILT’s scratchpad data memory

to be double-buffered and the off-chip memory transfer operations are interleaved into the compute

schedule using data memory ports that are shared with the FUs.

The VESPA soft processor also operates on vector data and demonstrates improvement in perfor-

mance with vector chaining, where the output of one functional unit is passed directly to the input

of another that is executing a subsequent vector instruction [26]. However, this requires a more com-

plex register file with greater number of read and write ports. With TILT, its array of FUs read and

write data to a single data memory. This increases the total latency of operations and the length of

the pipelines since the output of an FU must be written to memory prior to another FU reading that

memory location.

The TILT processor executes multiple threads in parallel to fill the long pipelines of its execution

units, improving their utilization and the overall throughput. A single thread is usually not sufficient

due to the dependencies between operations and because the TILT FUs are deeply pipelined and have

varied latencies across different FU types. The execution of multiple threads is also a more area-efficient

solution than replicating a single-threaded TILT core.

Labrecque et al. in [27] demonstrate how the parallel execution of multiple threads can significantly

improve the performance of pipelined overlay processors with minimal area overhead. Compared to

their single-threaded soft processor implementation, multi-threading improved the processor’s compute

throughput by up to 104% for a 7-stage pipeline and improved the area-efficiency of the same design

by 106%. Similarly, since the TILT architecture is deeply pipelined, much higher throughput and area-

efficient designs can be obtained by executing many threads at once.

Moussali et al. in [28] present micro-architectural techniques to enable overlay processors to efficiently

support multi-threading. Their approaches involve implementing multi-threading capability in hardware.

These include thread schedulers that select the operation to issue based on the latencies of different

operation types and how the threads are to be interleaved; their work supports both fine-grained and

blocked multi-threading. The TILT architecture instead relies on its compiler to statically produce its

multi-threaded instruction schedule to minimize hardware complexity and area.

An earlier soft processor, called CUSTARD, features the automatic generation of custom datapaths

and instructions intended to accelerate frequently performed computations of the target application

alongside standard operations such as add and multiply [11]. The architectural configuration of CUS-

TARD is determined with static compiler analysis of the application. The TILT architecture does not

generate custom FUs to accelerate software code blocks. Instead, TILT uses a weaker form of application

customization by varying the mix of pre-configured, standard FUs provided by Altera and optionally

generating application-dependent custom units to handle predication and the new hardware features we

describe in this work to support loops and indirect addressing.

Page 12: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 6

CUSTARD also supports multi-threading by storing the state of multiple, independent threads and

context switching between them. In contrast, TILT’s implementation of multi-threading is fine-grained,

since individual operations from different threads can be scheduled together. TILT also does not require

any thread states to be stored or maintained during the execution of its instructions, requiring only a

single program counter (PC).

Soft processors such as VLIW-SCORE execute Very-Long-Instruction-Words (VLIW) stored in on-

chip memory [10], like TILT, instead of relying on a scalar processor like Nios. These instructions specify

a different operation per FU or compute unit per cycle. The architecture’s software compiler is used to

statically find and extract parallelism from the application and to generate the VLIW instructions. This

eliminates the additional hardware complexity and cost of performing dynamic hardware scheduling.

The TILT overlay requires similar compiler scheduling but its instructions are instead classified as

Horizontally Microcoded (HM) [29] since they directly control low-level hardware components such as

multiplexers and require minimal decoding. The main difference between VLIW and HM architectures

is the higher level of abstraction provided by the VLIW instructions. An example of a reconfigurable

HM processor is the No Instruction Set Computer (NISC) [30].

Saghir et al. in [31] explore the performance and area trade-offs of implementing custom datapaths

and instructions on VLIW overlay architectures. They utilize a standard 4 stage pipeline of fetch, decode,

execute and writeback and their custom compute units have latencies of between 1 and 8 cycles. Their

approaches range from augmenting a standard FU with custom logic to implementing fully custom units.

They show that there is a trade-off between performance and area cost, with both increasing with higher

degrees of customization. Another important trade-off of implementing custom units is that the overlay

becomes more application-specific. While this benefits the target computation, it also reduces the range

of different applications the overlay can target.

2.1.2 High-Level-Synthesis Tools for FPGAs

An alternative to overlay architectures are High-Level-Synthesis (HLS) tools that translate a software

description of an application into custom FPGA logic, providing high abstraction FPGA programming.

Just as with overlays, this increases designer productivity vs. Hardware Description Language (HDL) de-

sign flows which require low-level, cycle-accurate specification of the application. Several HLS techniques

have been proposed including academic works such as FCUDA [16] and LegUp [17], commercial products

such as Vivado HLS for Xilinx FPGAs [18], as well as compilers targeting OpenCL for Altera [19] and

Xilinx [32] FPGA families.

Overlay architectures provide fast application configuration and the flexibility to execute a range of

different applications without requiring the overlay to be recompiled. In contrast, HLS tools require full

recompilation of the application into hardware after any code change. Small changes in the input code

can also lead to large differences in the system area and performance so many design iterations are often

necessary to fully optimize a system. Taken together, the combination of many design iterations and

long compile times can significantly increase development time compared to using an overlay. However,

by generating custom logic, HLS tools generally offer higher performance than overlay processors.

OpenCL is a popular high-level computing language that enables parallel programming across het-

erogeneous platforms [33]. The OpenCL programming model separates the application into two parts.

The first is the serial host program that executes on a processor and is responsible for managing data

and control flow. The host offloads the parallel compute intensive second portion defined within kernels

Page 13: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 7

onto accelerator(s) such as CPUs, GPUs and recently FPGAs [19].

LegUp [17] is an academic HLS tool that compiles a standard C program onto a host/accelerator

model similar to OpenCL. However, the partitioning of the host and kernel is done automatically by the

LegUp compiler. The host executes on an FPGA-based 32-bit MIPS soft processor and communicates

with the custom accelerator using a standard on-chip bus interface. This means TILT can be easily inte-

grated into LegUp as an accelerator by connecting TILT’s external data/address bus to the MIPS host,

enabling faster application compilation onto the overlay and requiring fewer accelerator recompilations.

CUDA is a language for expressing parallel applications on Nvidia GPUs [34] that also shares the

host/accelerator model. FCUDA [16] transforms CUDA kernels to parallel C code for AutoPilot HLS [35]

which translates the code into custom FPGA logic. The authors demonstrate their FCUDA flow to have

competitive performance on Virtex 5 FPGAs, outperforming Nvidia’s G80 GPU in some cases. Studies

including [36] demonstrate CUDA’s slightly higher performance when compared with OpenCL. However,

they also show it is relatively easy to translate CUDA programs to OpenCL and that OpenCL is portable,

achieving good performance on other platforms with only minor code modifications. This makes OpenCL

a compelling platform to compare against C-to-FPGA approaches such as overlays.

Stitt and Coole compile OpenCL applications to a spatial pipeline on their pre-compiled intermediate

fabrics (IFs) composed of fixed coarse computational resources and configurable interconnect instead of

directly targeting the underlying FPGA [12]. This approach is similar to TILT since the IF is customized

to the requirements of the kernel and allows rapid kernel compilation and reconfiguration (seconds vs.

hours) while incurring a performance penalty and area overhead compared to direct OpenCL synthesis.

However, as we show in Section 6.5.3, TILT also enables smaller designs than OpenCL HLS when a

lower throughput is adequate, enabling a larger range of throughput and area solutions to be explored.

In Section 6.5, we compare the performance, area, development effort and scalability of our TILT

overlay processor and Memory Fetcher architectures with Altera’s OpenCL HLS tool. As a part of

our evaluation, we also quantitatively explore the strengths and disadvantages of these two different

C-to-FPGA approaches.

2.1.3 Memory Architectures on FPGAs

Conventional microprocessor architectures attempt to reduce the average memory access latency of the

processor at the expense of memory bandwidth by prefetching entire cache lines of spatially or temporally

local data from a larger, off-chip memory. FPGAs provide a large amount of on-chip compute and routing

bandwidth but they are bandwidth and latency limited at the FPGA to external memory boundary.

Here we present a survey of memory architectures that bring data onto the on-chip memories of the

FPGA from external sources such as DDR.

Several works have explored the merits of building memory architectures on FPGAs using caches

and/or scratchpads [37–40]. Caches are a more complex data storage solution, requiring the additional

storage of tags which are used to determine what data is present in the cache. Data is fetched proactively

based on the application’s memory accesses, usually with adjacent data items being fetched as well.

Scratchpads are simpler structures with a lower area cost that store data items without tags and require

the data movement to be explicitly managed; this is the method adopted by TILT’s data memory.

Kalokerinos et al. [37] propose a memory system that can be configured to support both implicit

and explicit memory management via caches and scratchpads respectively. Caches are used where it

is not known in advance when or what data will be needed which means the memory accesses have

Page 14: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 8

non-deterministic timing. The reverse is true for the directly-addressable scratchpad memories for the

systems that need direct control and optimization of the data placement and memory transfers.

Panda et al. [38] propose a processor system that combines a small scratchpad memory with an

on-chip data cache which interfaces with an external memory. Application data is partitioned to either

reside in the external memory (such as large arrays of contiguous data) or the on-chip scratchpad (scalars

and constants) with the goal of minimizing the total execution time of the application. The scratchpad

provides a single cycle access latency guarantee while the data cache can take between 1 and 20 cycles

depending on whether the access is a hit or a miss.

Choi et al. [39] and Putnam et al. [40] implement multi-ported and multi-banked cache-based memory

architectures on an FPGA and evaluate their performance, area and power consumption. Choi et al.

found that implementing cache design parameters such as different cache line sizes, associativities and

port counts on an FPGA has a high resource and performance cost, partly because of the multiplexing

networks that must be implemented.

The Reconfigurable Data Cache (RDC) [41] presents a caching mechanism tailored for FPGAs. It

consists of an on-chip data cache and a speculation unit that are automatically configured and generated

by its compiler based on the predominant access patterns of the application and inserted between the

application and the external memory. The speculation unit intelligently fetches only the data items

that are needed by the application into the data cache instead of entire cache lines. This is intended

to minimize the communication with off-chip memory. In our approach, the Memory Fetcher uses

the compiler to determine the application’s external memory accesses and statically schedules these

operations to execute in parallel to the TILT processor(s). We do not need to implement hardware logic

to speculatively prefetch data and we can guarantee only the necessary data is fetched.

RDC’s speculation unit is ineffective for applications with irregular access patterns and for applica-

tions that operate on large chunks of data at a time. A single data cache with a single arbitrated channel

to the application also makes it difficult to exploit any data and compute parallelism that exists in the

application. Stitt et al. in [42] have looked at how cache-based memory architectures can efficiently

support irregular memory accesses such as the traversal of pointer-based data structures. Since the

Memory Fetcher’s operations are statically scheduled, we can support any memory access pattern but

lose the flexibility of any dynamic behaviour. This is acceptable since the TILT processor is statically

scheduled as well. The Memory Fetcher also supports multiple, independent data streams that can be

accessed by several TILT cores in parallel, allowing multiple data items to be accessed by each TILT

core at the same time.

Unlike RDC, FCache [43] implements multiple coherent and spatially distributed data caches con-

nected by a mesh interconnect network. Instead of being centralized, coherency and consistency man-

agement is distributed across all caches. The caches are connected to off-chip memory via an arbitrated,

custom shared memory controller. Having multiple caches that can service an application in parallel is

distinct from the singular cache of RDC. FCache tries to exploit the on-chip bandwidth of the FPGA

but is limited by the memory controller and arbitration to external memory. FCache also has a much

higher area overhead compared to RDC.

The CoRAM memory architecture [44, 45] attempts to exploit the spatially distributed fabric of

the FPGA by aggregating some of its Block RAMs (BRAMs) into small ‘CoRAMs’ that store off-chip

data in a highly distributed fashion. This allows applications to access these memories in parallel and

independently of each other just like BRAMs. CoRAM provides ‘control threads’ that the application

Page 15: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 9

uses to explicitly manage the movement of data to and from the CoRAMs and off-chip memory. These

threads are implemented in C alongside the application. Thus, developers must be intimately familiar

with how their application consumes and produces data and must explicitly describe this data flow for

the lifetime of the application. This burden is placed on the compiler instead in our approach.

The guiding principal of LEAP [46] is the polar opposite of CoRAM. It aims to provide a complete

abstraction of memory and its management from the application developer. LEAP allows the designer

to plug in memory components such as scratchpads or caches based on the properties that best fit their

application needs while providing the same communication interface as that of a BRAM. The abstraction

provided by LEAP means the LEAP memories cannot guarantee to the application how many cycles

a memory operation will take. LEAP memories fetch data from external memory after the application

issues the memory operation. The application has no way to know if the data is already present in the

on-chip LEAP memories or has to be fetched from external memory. This means applications designed

with LEAP need to be timing insensitive. With RDC or CoRAM, applications can be timing driven.

TILT is timing sensitive but since we know what data the processor will need and when through

off-line compiler analysis and scheduling, we are able to prefetch the data from off-chip memory ahead of

time to cover the long external memory access latency. This data is stored in buffers inside the Fetcher

which behaves as an intermediary between the TILT cores and off-chip memory. The buffers allow the

Fetcher to provide deterministic memory access latency to the TILT cores similar to accessing an on-chip

BRAM. If however the intermediate buffers are getting empty or full, the Fetcher preemptively stalls

the TILT cores from executing future compute instructions.

2.2 TILT Overlay Processor

...

...

I/O

CT

RL

CT

RL

CT

RL ...

INS

TR

ME

M

ADDR,mEN RESULT

BankedMulti-PortedMemory

...

...Read bar

Write bar

FU FU FU

PC

Figure 2.1: The TILT architecture, consisting of banked, multi-ported data memory and single-precisionfloating-point FUs that are connected by crossbar networks [15].

TILT (Thread and Instruction Level parallel Template architecture) is a highly configurable overlay

compute engine for FPGAs with multiple, varied and deeply pipelined 32-bit single-precision, floating-

point functional units (FUs) [13, 15]. TILT supports the parallel execution of multiple independent

threads with each thread capable of issuing multiple operations per cycle to obtain high utilization of

its FUs. The threads perform the same computation and do not communicate data between them. As

shown in Figure 2.1, TILT has read and write crossbar networks that connect the array of FUs to an

explicitly managed banked, multi-ported data memory built using on-chip BRAMs [47].

Page 16: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 10

TILT relies on static compiler instruction scheduling to reduce hardware complexity [14] and does

not require forwarding logic or dynamic data hazard detection. Data is always read from TILT’s data

memory and the pipeline executes instructions without any stalls. It is the responsibility of the compiler

to create an instruction stream that separates data-dependent operations by enough cycles such that

any data required by a later operation is produced and written back to the data memory beforehand.

The TILT instructions are stored within an on-chip memory, also built using BRAMs.

We briefly describe the hardware components of the TILT processor shown in Figure 2.1 and how

they can be configured in Section 2.2.1 below.

2.2.1 TILT Components

Functional Units

The pipelined, single-precision, floating-point FUs that are supported by the TILT architecture of [13]

and their latencies are summarized in Table 2.1. The custom Cmp FU is used to support if-else control

flow instructions in the application code. This is done with predication [48] which we describe in Section

4.1.4. The remaining FU types are standard, pre-configured FUs that are generated using Altera’s

megafunction wizard. The Quartus settings used are provided in Appendix A. All FUs are IEEE-754

compliant and have up to two 32-bit scalar input operands and a single scalar output. TILT’s FU mix

can consist of any combination of these FUs with multiple instances of each type as required by the

target application.

FU Type AddSub Mult Div Sqrt Exp Log Abs CmpLatency

7 5 14 28 17 21 1 3(cycles)

Table 2.1: Listing of the single-precision, floating-point FUs that are supported by the TILT architectureof [13] and their latencies.

Data Memory

Bank

BRAM BRAM

Bank

BRAM BRAM

Data Memory

1 2

(a) Data memory with 2 banks, eachwith 2 read and 1 write ports.

BRAM BRAM BRAM

Bank

BRAM BRAM BRAM

(b) A single data memory bank with 3 read and 2 write ports.

Figure 2.2: Illustration of TILT’s data memory organization.

TILT’s data memory holds 32-bit floating-point data and is organized into memory banks as illus-

trated in Figure 2.2. Each memory bank is composed of two or more BRAMs that hold the same data to

support multiple concurrent reads and writes. This is because each BRAM consists of only 2 dual-ports

that can be used to either perform a read or a write. As an example, a memory bank with 2 reads and

Page 17: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 11

1 write requires 2 BRAMs to be connected as shown in Figure 2.2(a). When an FU writes to this bank,

the data item will be committed to both BRAMs. The duplication of data into two BRAMs allows two

independent reads, one from each of the physical BRAMs. For each additional read or write port, the

number of BRAMs per bank increases in the manner illustrated by Figure 2.2(b) – the BRAM count is

the product of the number of read and write ports. The example assumes a single BRAM is big enough

to store the contents of an entire bank. Threads are evenly assigned to the available memory banks with

the data of each thread having its own address space in a single bank.

Read and Write Crossbars

The read and write crossbar networks route input data from the memory banks to the input operands

of FUs and FU outputs to the data memory respectively. These crossbars are pipelined to improve

TILT’s operating frequency (Fmax) and are fully connected, giving FUs read and write access to every

memory bank, but not to all the individual BRAMs that comprise the bank. For this work, the read and

write latencies are 8 and 5 cycles respectively. The connectivity of the crossbars is illustrated in Figure

2.3. Figure 2.3(a) presents a simple TILT configuration with 2 FUs and 2 memory banks, each with 2

read and 1 write ports. With this configuration, both FUs can issue and complete a compute operation

simultaneously as long as they read from and write to different banks at every cycle.

R0

R1

R2

R3

W0

W1

B1B0

Read Xbar

Write Xbar

FU

FU

(a) 2 memory banks, each with 2read and 1 write ports.

B0B1

Read Xbar

FU

FU

Write Xbar

R0

R1

R2

R3

W0

W1

(b) 2 memory banks, each with 4 read and 1 write ports.

Figure 2.3: Illustration of TILT’s read and write crossbar connectivity.

Increasing the read or write port counts increases the number of concurrent operations that can

be issued or completed by the FUs from the same memory bank. In Figure 2.3(b), 2 additional read

ports allows both FUs to concurrently read data from the same memory bank. While this configuration

still only supports 1 write per bank and hence can only complete one operation per bank each cycle,

additional read ports can potentially improve TILT’s computational throughput since operations issued

on the same cycle may writeback to memory on different cycles due to the widely varying pipeline depths

of the FUs (between 1 and 28 cycles). However adding more read and/or write ports also increases the

memory storage requirements (BRAMs required) and the size of the crossbar networks. The appropriate

Page 18: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 12

mix of FUs, thread count and data memory organization to use is largely dependent on the target

application and the designer’s performance and area requirements.

Instruction Memory

As shown in Figure 2.4, TILT’s compute instructions encode the operations that each of the generated

FUs must execute on a given cycle. Each FU operation contains 2 read operand addresses and 1 write

address for reading input data and writing the results of the FUs to TILT’s data memory. The read and

write bank ids identify the memory bank to access. The opcode is used by the AddSub and Cmp FUs

to determine their mode of operation and the thread id is used solely by the Cmp FU. We describe the

operation of the Cmp FU in Section 4.1.4. Finally, the valid bit of each FU’s operation indicates if an

operation has been scheduled for the FU at that cycle. The TILT architecture of [13, 14] required the

encoding of each FU type in the instruction to be the same. We have relaxed this requirement during

the addition of new FU and operation types to TILT.

...

Operand BRead Addr

Operand ARead Addr

ResultWrite Addrvalid Read

bank idWritebank id

threadidopcode

.........

cycle 0cycle 1

cycle n

FU operationTILT Compute Instruction

1 bit 5 bits log2(threads) log2(banks) log2(banks) log2(bank depth) log2(bank depth) log2(bank depth)

Figure 2.4: The encoding of the TILT compute instruction.

The instruction memory stores the schedule generated by TILT’s compiler in its entirety. Since each

thread executes its own set of compute operations that are scheduled onto TILT, adding more threads

typically increases the length of TILT’s schedule, hence requiring a deeper instruction memory. However,

the parallel execution of more threads also improves TILT’s compute throughput as well. As illustrated

in Figure 2.4, the width of each FU operation depends on TILT’s architectural parameters such as

the number of threads, memory banks and the depth of each bank. Similarly, the width of the TILT

instruction increases with each additional FU. This means larger TILT configurations will also require

a bigger instruction memory.

2.2.2 TILT Customization

The TILT architecture is designed to achieve high computational throughput per unit area on data

parallel applications. We accomplish this by customizing TILT’s architectural parameters such as its

FU mix, organization of banked data memory, number of threads and number of operations that can issue

or complete in parallel, to closely match the compute requirements of the target application. TILT’s

throughput can be increased by adding more TILT resources to allow more operations to execute in

parallel. However, this is obtained at the expense of increasingly higher area cost with diminishing gains

in compute throughput. Conversely, TILT can be configured to be smaller as needed but its throughput

will be reduced as well.

We can further customize TILT by allowing certain FUs to share both their operation field in TILT’s

Page 19: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 13

instruction and their read and write ports with another FU to reduce the size of the instruction memory

and the crossbars. In this case, only one of the two FUs can issue and/or complete an operation per

cycle. TILT’s compiler ensures that there are no scheduling conflicts. A single bit in the operation

field’s opcode (Figure 2.4) is sufficient to indicate to TILT which FU the operation targets. Although

this operation field sharing increases contention between the two FUs for resources which can degrade

compute throughput, it can result in an overall improvement in throughput per unit area if one or both

of the FUs are underutilized or if the operations of one of the FUs must precede those of the other due

to data dependencies.

A more effective way to increase TILT’s performance beyond that which can be obtained with a single

area-efficient TILT core is by instantiating multiple copies of the core, composed of the data memory,

crossbars and FU array. All TILT cores share a single instance of the instruction memory and execute in

parallel in SIMD [25]. We call this architecture TILT-SIMD. As we will demonstrate in Chapter 6 of our

evaluation of the TILT architecture, this approach allows the individual cores to remain computationally

dense (have high throughput per area) while achieving near linear increase in total compute throughput

up to a certain core count.

The wide range of configuration parameters and options constitute a very large design exploration

space. They also provide a large degree of flexibility to tailor the TILT architecture to an application

and the overall system based on the compute requirements and area budget.

2.3 TILT Compiler

The TILT compiler flow is responsible for generating TILT’s static instruction schedule from an input

application kernel. The kernel defines the computation of a single TILT thread. The flow of an example

kernel is shown in Figure 2.5. Given the TILT configuration and the target application, we want to find

a schedule with the highest compute throughput, identified by the shortest schedule, while respecting

TILT’s architectural limitations. The flow consists of four main stages, starting with the generation

of the Data Flow Graph (DFG) which defines the application’s compute operations and their data

dependencies. The second stage allocates TILT’s data memory and in the third stage, the operations

of the DFG are scheduled onto TILT based on the processor’s hardware configuration. Finally, the

application is executed on the target TILT configuration after it is synthesized and loaded onto the

FPGA using Quartus. These stages are described in the sections below.

2.3.1 DFG Generation

The Data Flow Graph Generator (DFGGen) is implemented as an LLVM compiler pass [49] that produces

a custom DFG description of the application kernel from its intermediate representation (IR) [50]. The

IR is generated by the LLVM frontend compiler [51] from the application’s C source code. The DFGGen

tool requires the kernel computation to be inside a single C function, similar to a kernel in OpenCL [33].

Variables must be declared inside the C function or they can be passed into the function as parameters.

The nodes of the DFG are the supported compute operations of Table 2.1 such as add or compare. The

predecessors of a node are operations that must execute before it, usually to produce its input operands.

Similarly, operations that depend on the node are its successors in the DFG. Each TILT thread is a

separate instance of the DFG.

Page 20: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 14

FUsAddSub Mult

123456789

101112131415

Comp,Schedule

Cycle

161718192021

1:B2:B3:B4:B

3:A4:A1:A2:A

1:D2:D3:D4:D

1:C2:C3:C4:C

1:E2:E3:E4:E

A} B1

Dr

E

C1

DFG

void,fn.float,1inK,float,1out;,{,,,,,float,A,=,in[3],},kL3H,,,,,float,B,=,in[3],1,kL3H,,,,,float,C,=,in[R],1,kL3H,,,,,float,D,=,A,r,BH,,,,,out[3],=,C,},DH}

Application,Kernel

FUSMix:,,,,AddSub,LatL,W,,,,Mult,LatL,4

Xbar:,,,,Rd,LatL,W,,,,Wr,LatL,W

DataSMem:,,,,s,threads,,,,W,banks,,,,W,rd,ports2bank,,,,R,wr,port2bank

Configuration

BankR BankW

1234567

Addr

0 in3inRkL3ABCD

out3

in3inRkL3ABCD

out3

LLVM,IR

DFGGen

Scheduler

MemoryAllocator

Verilog,HDLDesign,Files

R3R3R3R3R3R3RRRRRR3RR3R3R33333RRR3R33RRRR3R3RRRRR3R333R33333

Insn,Mem

T1 T3

89

101112131415

in3inRkL3ABCD

out3

in3inRkL3ABCD

out3

Data,Memory

T2 T4

R

W

4

s

}

2223

Software

Hardware

Figure 2.5: The TILT compilation flow for an example application. The FU and crossbar latencies werereduced for illustration.

The input kernel must consist of only the floating-point operations that TILT supports. Loops or

any integer operations are not supported by the TILT architecture of [13, 14]. Any loops that exist in

the kernel must be manually unrolled and array indexes must hold constants. Indirect addressing, where

data in memory is used to address into another data entry in memory, is also not supported. Conditional

if-else branching is supported with predication [48] which we describe in Section 4.1.4.

2.3.2 Memory Allocator

The TILT compiler developed by Tili in [14] statically allocates data memory for the DFG after it

is generated from the application kernel and before it is scheduled onto the target TILT architecture.

Since TILT threads are separate, persistent instances of the DFG that execute in parallel, the allocated

address space is replicated for each thread, as shown in Figure 2.5. The LLVM IR uses SSA (Static-

Single-Assignment) [52] to name the temporary registers used in the computation. This means every

operation that produces an output gets written to a new register with an unique name.

Instead of allocating a different memory location for each register in the LLVM IR, Tili’s Memory

Allocator reassigns the location to another register after the previous one is no longer referenced by any

compute operations. This reassignment is performed based on the dependencies defined by the DFG

Page 21: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 15

only and without making any assumptions on how the DFG will be scheduled. Therefore the algorithm

cannot optimize memory usage across parallel dependency chains. Memory variables referenced in the

C kernel are allocated their own memory locations and do not get reassigned so they persist in memory

for the lifetime of the schedule. For pointer variables, the algorithm only assigns locations in TILT’s

data memory to regions that are accessed by the computation.

2.3.3 Compute Scheduler

For our work, we use Tili’s Grouped Prioritized Greedy Mix Longest Path (GPGMLP) algorithm [14] to

generate the compute schedule of an application that will be executed on the target TILT architecture.

We also extend this algorithm to construct one of our Fetcher memory scheduling approaches, which

we describe in Section 3.3.1 and evaluate in Section 3.5. The algorithm takes as input the DFG of the

application, the number of threads to schedule and TILT’s hardware configuration which specifies its

mix of FUs and their latencies and the organization of data memory. The objective of GPGMLP is to

produce the shortest (or densest) compute schedule while adhering to the hardware resource constraints

of the processor. The GPGMLP algorithm combines several heuristics to obtain an efficient schedule

which we describe next. The pseudo-code of the algorithm is provided in Algorithms 1 and 2.

Algorithm 1: Tili’s Grouped Prioritized Greedy Mix Longest Path (GPGMLP)

Input: DFG, hwTILTConfig, groupSizeOutput: TILT compute schedule

1 set1 = longest path operations plus their predecessors;2 set2 = remaining operations in DFG;

3 call Algorithm 2 on set1;4 call Algorithm 2 on set2;

Greedy. The algorithm greedily schedules compute operations at the earliest possible cycle, starting

with operations that have no predecessors (lines 7 to 9 in Algorithm 2). These are operations that do

not require any other operation to be scheduled first and can be scheduled immediately. After every

iteration of the scheduling step, the algorithm checks for newly ready operations whose predecessors

have all been scheduled and marks them as ready to be scheduled (lines 21 to 24).

Operation Priority. When there are several operations that are ready to be scheduled, the algo-

rithm grants priority to the operation with the lowest slack time. The slack time of an operation is the

cycle difference between scheduling the operation As-Late-As-Possible (ALAP) and As-Soon-As-Possible

(ASAP). ASAP schedules operations as soon as their predecessors have finished executing while ALAP

schedules operations as late as possible but without increasing the total schedule latency beyond that

of the ASAP. The generation of the two schedules take into account the dependencies of operations

and the FU latencies but TILT’s hardware resource constraints are ignored to simplify the calculation.

This slack based ranking of ready operations means operations that are more likely to increase the total

length of the schedule if they are not scheduled well are scheduled first.

Grouped Mix Thread Scheduling. The algorithm schedules operations from multiple threads in a

round-robin fashion instead of scheduling each thread one after the other. This allows the higher priority

operations from across threads to be scheduled before the lower priority operations. This generates a

more dense, higher throughput compute schedule and has the added benefit of requiring a smaller

Page 22: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 16

Algorithm 2: Tili’s Grouped Prioritized Greedy Mix

Input: DFG operations to schedule, hwTILTConfig, groupSizeOutput: TILT compute schedule

1 create ASAP schedule for DFG operations;2 create ALAP schedule for DFG operations;3 calculate slack (ALAP - ASAP) of DFG operations;

4 create readyList[] with an empty readyList entry for each thread;5 foreach group of threads do6 for t = first thread of group to last thread of group do7 foreach operation in DFG do8 if operation has no predecessors then9 insert in readyList[t];

10 cycleCounter = 0;11 while readyList[] not empty do12 sort readyList[] entries by priority for each thread of group;

13 for i = 0 to (maximum list size across all readyList[] entries of group)1 do14 for t = first thread of group to last thread of group do15 if readyList[t] contains an operation at index i then16 if earliest cycle operation can start <= cycleCounter then17 schedule operation in earliest free spot;18 mark operation for removal from readyList[t];

19 remove marked entries from readyList[] for all threads of group;20 cycleCounter = cycleCounter + 1;21 for t = first thread of group to last thread of group do22 foreach operation in DFG not already scheduled for t do23 if all predecessors are scheduled then24 insert in readyList[t];

instruction memory to store the shorter schedule. However, instead of scheduling all threads together,

they are scheduled in groups, with each thread group being scheduled completely before scheduling the

next group of threads.

Tili’s algorithm tries all possible group sizes between 1 and the total number of threads and selects

the best schedule with the shortest length. With a group size of 1, each thread is scheduled in sequence.

At the other extreme with a group size equal to the number of threads, all threads are scheduled in

parallel. For this work, we found trying only the sizes that are multiples of 2 (including 1 and the total

number of threads) is a good compromise that improves the algorithm’s runtime significantly without

noticeably penalizing the achieved throughput.

Longest Path. As the algorithm’s final heuristic, it schedules for all threads the operations that

are part of the longest dependence chain within the DFG before scheduling the remaining operations

(see Algorithm 1). The length of a chain is defined by the sum of the latencies of each operation within

the chain. The algorithm separates the DFG into two sets of operations. The first set comprises the

operations in the longest path in addition to all the operations they depend on and is scheduled first.

The second set consists of the remaining operations and is scheduled last. Since delaying operations

within the longest path is more likely to increase the total length of the schedule, it makes sense to

Page 23: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 2. Background 17

prioritize the scheduling of this path over other operations.

For our example TILT application in Figure 2.5, the longest path contains the compute operations

{B, D, E}. These operations and operation C (due to being the predecessor of E) are scheduled first.

Operation A of the third and fourth thread is then scheduled on the first and second cycle instead of

that of the first two threads because the ports of the first memory bank (which holds the data of thread

1 and 2) are being used by the Mult FU at those cycles.

2.4 TILT Execution Model

The TILT architecture targets data-parallel, compute-intensive, floating-point applications. We schedule

multiple threads (or instances of the application kernel) onto a TILT core and multiple TILT cores can

then execute the same schedule in SIMD. Further, we can configure TILT to re-execute its schedule after

reaching the end of the instruction stream. In this scenario, TILT’s application kernel represents the

computation inside an implicit, external loop. To keep the size of the instruction memory small, this is

the preferred use case. For our example kernel in Figure 2.5, each of the four scheduled TILT threads

computes an element of the out vector. With 8 TILT cores, groups of 32 elements can be computed

together. If we assume a vector length of 512, we will need to re-execute the TILT schedule a total of

16 times to compute the entire vector.

TILT maps well to applications where each thread performs coarse-grain (to mask long access latencies

to off-chip memory) computation on a set of inputs and where the same computation is performed on a

large data set. Example applications include image and video processing, such as Mandelbrot [53] and

HDR [54] and simulation of neurons with the Hodgkin-Huxley algorithm [55]. Applications such as the

dot product of a vector can only be supported on TILT if it is entirely computed by a single thread.

This way, the dot product of multiple vectors can be computed in parallel. Since TILT threads and

cores do not communicate, the dot product of a single vector cannot be distributed to multiple threads

and cores due to the reduction step required at the end of the computation.

Page 24: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3

External Memory Fetcher

The TILT architecture and compilation flow developed by Ovtcharov and Tili in [13] and [14] do not

include a mechanism to communicate with off-chip DDR memory. The authors assumed all off-chip

input data was either already present or will somehow be brought into TILT’s relatively small data

memory before it will be needed by the computation. The scheduling of the application onto TILT and

their throughput measurements did not account for the accesses to off-chip memory, their relatively high

latencies or the performance penalties that may be incurred from these accesses. Only the compute

portion of the applications was scheduled and evaluated. Ovtcharov and Tili have focused primarily on

exploring the best way to compute using soft processors on an FPGA by evaluating their architecture

with compute intensive benchmarks. A parallel study conducted by Charles LaForest that led to the

Octavo soft-processor also followed from this research direction [56].

Not implementing the hardware to communicate with off-chip memory and not accounting for when

the external data will arrive yielded a simpler and more area-efficient TILT design. It has also enabled the

statically scheduled processor to be optimized for fast computation on its local data memory and achieve

a high execution rate of compute operations per cycle. However, this is not an accurate representation

of a complete and usable soft-processor system. To evaluate a more compelling system on realistic, large

memory applications, we have designed a separate Memory Fetcher unit to efficiently move data between

multiple TILT cores and off-chip DDR memory.

3.1 Possible Design Solutions

We begin with a discussion of the different approaches we have considered to move data between off-chip

memory and TILT’s data memory. First, we can integrate the function as part of the TILT processor by

adding external load/store memory operations to directly access off-chip memory and implement stall

logic to wait for the data to arrive. TILT’s data memory will then behave as a local scratchpad on which

the processor would compute. However for our statically scheduled TILT processor, we cannot easily

interleave external loads and stores between compute operations without also incurring a significant

performance loss due to the indeterminism of the data arrival time. The latencies of these external

memory operations are also much larger than the latencies of the compute operations that access TILT’s

data memory. This means allowing TILT to switch between moving data and computation will also

significantly degrade the processor’s compute throughput. This is because we will need to wait for the

18

Page 25: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 19

long memory transfers to complete before being able to resume computation on the new data.

Alternatively, we can leave the TILT processor as is and implement a separate entity that is re-

sponsible for efficiently moving data between off-chip memory and TILT’s data memory while the TILT

processor performs computation. This can be accomplished with some or all of the read and write ports

of TILT’s data memory being connected to external memory via the new entity. This effectively sepa-

rates the concerns of communicating with off-chip memory and performing computation, allowing each

component to be optimized to perform their respective tasks. This is the approach we have adapted.

Another approach we considered was to make TILT’s data memory twice as deep and have twice

as many ports so that TILT can communicate with off-chip memory and compute at the same time by

operating on different addresses spaces. This approach is known as double-buffering and is similar to

the solution adapted by the VENICE soft-processor [9]. It requires TILT’s data memory ports to be

connected to both the external memory interface and TILT’s FU array. At the end of each iteration of

computation and data movement, there will be a synchronized switch. We found the resource cost of

this approach to be high as TILT’s crossbar grew quadratically with every additional data memory port.

These ports are also underutilized since they are either being used for computation or to communicate

with off-chip memory. It also assumes the time to gather and scatter the necessary external data is

comparable to the time it takes to perform computation which is not usually the case.

Instead, we can either use dedicated data memory ports to communicate with external memory or

share the available ports with the external memory interface via the TILT crossbars as they are currently

shared with the TILT FUs. The read and write ports of the data memory are shared with TILT’s FU

array since not all FUs will execute operations or read and write to memory at every cycle. This is just

as true with reading and writing data to off-chip memory. Hence, connecting some of the data memory

ports directly to TILT’s external memory interface will result in the underutilization of these ports while

also requiring a larger data memory to support more ports.

For these reasons, in our chosen solution, external memory operations access the same data memory

address space as the compute operations while TILT concurrently performs computation. Additionally,

the available data memory ports are shared with TILT’s FU array as well as its external memory interface

via the crossbars. We are able to vary the number of data memory ports and crossbar outputs that

are connected to TILT’s external memory interface (external ports) independently of one another. How

many of each there should be is dependent on the computation and its external data bandwidth.

3.2 Chosen Design Solution

Taking into consideration the above design choices, we have implemented a separate Memory Fetcher

unit, with its fundamental design elements illustrated in Figure 3.1. In the figure, TILT-SIMD comprises

an array of TILT compute units which are connected to the Fetcher to produce the TILT-System. The

purpose of the Fetcher is to efficiently move data between these TILT cores and off-chip memory while

the TILT cores compute on their local data memories. Like TILT-SIMD, the Fetcher is fully-pipelined

and is capable of reading and writing data from/to the TILT cores and off-chip memory at every cycle.

Internally, the Fetcher contains data FIFOs that act as intermediate buffers between the data mem-

ories of the TILT cores and off-chip memory. These buffers are present to mask the long latencies and

indeterminism inherent with external memory accesses and to decouple the communication with off-chip

memory from the computation as much as possible. The long latencies and non-deterministic timings

Page 26: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 20

TILT

... In

snM

emDDR Wr

DDR Rd

TILT Rd

TILT Wr

cntrl

Insn Mem

cntrl

wid

thco

nv

TILT

TILT

nExtW Ports

nExtR Ports

Fetcher

TILT

-SIM

D...

TILT clk DDR clk

FIFOs

FIFOs

Figure 3.1: TILT-System - TILT-SIMD connected to off-chip DDR via the Memory Fetcher.

of the external memory accesses are hidden by buffering the off-chip input data needed by future TILT

compute operations into these FIFOs ahead of when these operations will be executed by the TILT

cores. This enables the Fetcher to behave as a loosely-coupled, run-ahead fetch unit.

How much slack there will be depends on several factors including TILT’s compute schedule and

its external data bandwidth, the depth of the data FIFOs and the rate at which off-chip data can be

consumed and produced by both TILT-SIMD and the Fetcher. Ideally, we want the Fetcher to be able

to consume data at the maximum rate at which TILT-SIMD can produce off-chip outputs while also

matching the rate at which TILT-SIMD requires input data. Finding the appropriate balance among

these parameters will maximize compute throughput while minimizing processor stalls and the hardware

area cost of the Fetcher and TILT-SIMD. The data FIFOs also perform clock domain crossing between the

TILT-SIMD and the DDR controller clocks. The relevant Quartus settings used for the DDR controller

are provided in Appendix B.

The Fetcher’s data FIFOs provide deterministic external memory access latency guarantee to the

TILT cores. This works well for the statically scheduled TILT processor as external data can be moved

into and out of the TILT cores parallel to the computation without requiring changes in the TILT overlay.

The Fetcher writing to TILT’s data memory works the same way as an FU writing its computed result

to data memory. Similarly, reading data from TILT’s data memory into the Fetcher’s outgoing data

FIFOs is comparable to reading data into the input operands of an FU.

Conversely, moving data between the data FIFOs and off-chip memory can remain non-deterministic

with respect to timing. Further, several DDR read and write bursts can be enqueued together to fetch

external input data needed by future compute operations into the FIFOs and flush buffered computed

results to external memory from the FIFOs. The depth of the FIFOs and burst sizes can be varied to

improve DDR bandwidth and mask latency, providing optimized communication with off-chip memory

and minimizing processor stalls.

Each TILT core’s external memory interface consists of a configurable number of external read and

write ports, indicated by the nExtRPorts and nExtWPorts respectively in Figure 3.1. These control the

number of incoming and outgoing data FIFOs and the maximum number of data words that can be read

or written in parallel between the data memories of the TILT cores and the FIFOs on a given cycle.

An external read port is an output of the read crossbar, like FU input ports. Similarly, an external

Page 27: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 21

write port is an input of the write crossbar, like FU output ports. We define an external memory read

operation as TILT output data that is read from the data memories of the TILT cores and written to

off-chip memory. Similarly, an external write is defined as TILT input data that is written into the TILT

data memories from off-chip memory.

We add to TILT the ability to halt the processor (temporarily prevent execution of future instructions)

if the Fetcher gets too far behind or to halt the Fetcher if it gets too far ahead. TILT-SIMD is stalled by

the Fetcher when the TILT cores need to read data from a FIFO and it is empty or when they need to

write data to a FIFO and it is full. The Fetcher stalls for similar reasons when communicating with off-

chip memory. This synchronization logic is implemented as part of the Fetcher. Otherwise the Fetcher

and TILT-SIMD execute their own respective instruction schedules, generated statically by the TILT

compiler, to provide independent operation of off-chip communication and computation. The Fetcher

is aware of the computation’s external data movement behaviour through static compiler analysis of its

memory accesses.

Each TILT core computes on 32-bit words. The Fetcher operates on 256-bit words, the same width

as our interface to the DDR controller. This means 8 TILT cores can communicate with the Fetcher in

parallel in the same cycle. The widthconv module shown in Figure 3.1 converts the TILT-SIMD word

to a multiple of 256-bits or vice-versa. So for TILT-SIMD with 12 cores, data from the first 8 will be

sent to the Fetcher on the first cycle and data from the remaining 4 cores will be sent on the next cycle.

The widthconv can become a bottleneck if too many TILT cores are connected to the Fetcher.

TILT executes operations on the Fetcher unit in the same way it executes operations on its FUs.

However, instead of the Fetcher performing a compute operation such as an add or a multiply, it ‘executes’

multiple concurrent external read and write memory operations. Also, like compute operations, these

Fetcher-TILT operations have deterministic latencies and operate on the same address space as the

concurrently executing compute operations.

FUnFU1 FU2 ...valid bankId addr ... valid bankId addr ...nExtWPorts nExtRPorts TILT Compute Insn

(a) Encoding of TILT’s external memory and compute instructions.

opcode (r/w) DDR addr burst sizefifo id

(b) Encoding of the Fetcher-DDR instructions.

Figure 3.2: The encoding of the TILT and Fetcher instructions.

The new statically scheduled instructions that move data between TILT-SIMD and off-chip memory

are decoupled into two separate instruction streams. The encoding of these two streams is provided in

Figure 3.2. The first stream consists of the Fetcher-TILT operations which are scheduled as part the

TILT instructions, as shown in Figure 3.2(a). These operations move data between the Fetcher’s data

FIFOs and the data memories of the TILT cores on cycles when there are available memory bank ports

that are not being used by compute operations.

The second stream comprises the Fetcher-DDR instructions that facilitate communication between

off-chip DDR memory and the Fetcher. The two sets of decoupled instructions execute in the same order

so off-chip data do not need to be tagged with their destination when inserted into the FIFOs. Reads

always remove data entries from the FIFO head and writes always insert at the FIFO tail. We describe

Page 28: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 22

how the Fetcher-TILT and Fetcher-DDR instructions are statically scheduled in Section 3.3.

The architecture of the Fetcher was inspired by the Decoupled Access/Execute processor proposed

by James E. Smith [57] and Outrider which splits a thread’s instruction context into memory-accessing

and memory-consuming streams that execute in separate hardware contexts [58]. The memory-accessing

stream of Outrider fetches data non-speculatively substantially ahead of the memory-consuming stream

to tolerate long off-chip memory latencies.

3.3 External Memory Access Scheduling

The external memory and compute schedules of the Fetcher and TILT are statically generated by the

TILT compiler. The application kernel describes within a C function the behaviour of a single TILT

thread with its external inputs and outputs present in its parameter list. An example kernel is provided

in Figure 3.3(a). This is the same kernel as the one we used to illustrate the TILT compiler flow in

Section 2.3. We have updated the DFGGen tool to include the parameter list in the generated DFG file

which contains a textual description of the kernel computation (Figure 3.3(b)). The inputs to the TILT

compiler are the DFG and the target TILT-System hardware configuration (Figure 3.3(c)).

void fn(float *in, float *out) { float A = in[0] + 5.0; float B = in[0] * 5.0; float C = in[1] * 5.0; float D = A - B; out[0] = C + D;}

(a) Application kernel.

A+ B*

D-

E

C*

+

in0

in1

out0

(b) DFG.

FU Mix:ccccAddSubcLat.c2ccccMultcLat.c3

Xbar:ccccRdcLat.c2ccccWrcLat.c2

Data Mem:cccc4cthreadscccc2cbankscccc2crdcports/bankcccc1cwrcport/bank

Fetcher:cccc8cTILTccorescccc1cextcrdc&cwrcport

(c) Configuration.

Figure 3.3: An example application kernel, its DFG with the inserted TILT memory operations (externalwrites in green, reads in red) and the TILT-System hardware configuration used to produce the TILTinstruction schedules in Figure 3.4.

The compiler begins by parsing the function parameter list and compute operations from the DFG file.

Next, Tili’s Memory Allocator (described earlier in Section 2.3.2) is used to allocate TILT’s data memory.

We have updated the allocator to mark the function parameters that are read by the computation as

external inputs. The data of these parameters must be moved from off-chip memory to the local data

memories of the TILT cores. Similarly, the function parameters that are written to by the computation

are marked as external outputs and must be written to off-chip memory after they are produced. These

external inputs and outputs are assigned permanent locations in TILT’s data memory for the lifetime

of the entire schedule even though they might be needed by only a part of the computation. This is to

provide more flexibility in scheduling the external memory accesses later.

Next, TILT’s compute schedule is generated using Tili’s Grouped Prioritized Greedy Mix Longest

Page 29: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 23

Path First (GPGMLP) scheduling algorithm, presented earlier in Section 2.3.3. This algorithm assumes

the data coming from external memory exists locally within the assigned TILT data memory. This means

external input data must be moved into TILT’s data memory prior to any compute operations reading

those local memory locations. Similarly, external output data must be moved to off-chip memory after

TILT computes and writes these results to its local data memory.

Next, we use the produced compute schedule to generate memory statistics for the external memory

locations allocated within TILT’s data memory. For each external input, the cycle the input is first

and last read by the schedule are recorded. For external outputs, the first and last written cycles are

recorded. We also record the latency of the FU that performs these reads or writes.

Finally, we schedule the external memory accesses with our new memory scheduling algorithms.

These algorithms are responsible for scheduling both the Fetcher-TILT and Fetcher-DDR instructions.

The memory statistics are used by these algorithms to determine the most efficient way to move data

between the off-chip and TILT data memories. Our goal is to reduce any compute stalls incurred by

TILT when waiting for the required external memory transfers to complete. Further, we seek to minimize

the growth in TILT’s schedule length due to the addition of the external memory operations.

Compute

AddSub Mult123456789101112131415

Cycle

1617181920212223

ExtR1:in0ExtWMemory

1:B

2:B3:B

4:B

1:A

2:A3:A

4:A

1:D

2:D3:D

4:D

1:C2:C3:C4:C

1:E

2:E3:E

4:E

2:in03:in04:in01:in12:in23:in34:in4

2425262728

1:out0

2:out04:out0

3:out0

29

(a) MC-GPGMLP.

Compute

AddSub Mult123456789101112131415

Cycle

161718192021

1:B2:B3:B4:B

3:A4:A1:A2:A

1:D2:D3:D4:D

1:C2:C3:C4:C

1:E2:E3:E4:E

2223

ExtRExtWMemory

1:in02:in03:in04:in0

1:in12:in23:in34:in4

1:out0

4:out0

3:out02:out0

(b) LateI-EarlyO.

Compute

AddSub Mult123456789101112131415

Cycle

161718192021

1:B2:B3:B4:B

3:A4:A1:A2:A

1:D2:D3:D4:D

1:C2:C3:C4:C

1:E2:E3:E4:E

2223

ExtRExtWMemory

1:in02:in03:in04:in0

1:in12:in23:in34:in4

1:out0

4:out0

3:out02:out0

(c) SlackB.

Figure 3.4: TILT instruction schedules produced from the application kernel and TILT-System configu-ration provided in Figure 3.3. The schedules show when the operations are issued (thread:operation).

We present three external memory scheduling algorithms: 1) Memory and Compute GPGMLP (MC-

GPGMLP), 2) ALAP-Inputs, ASAP-Outputs (LateI-EarlyO) and 3) Slack-Based (SlackB). The first

phase of these algorithms prioritizes the external memory reads and writes and schedules the Fetcher-

TILT memory operations based on their priority. We describe the three flavours of this phase in Sections

3.3.1 to 3.3.3 respectively. The second phase of the algorithms produces the associated Fetcher-DDR

instructions. Since this phase is the same for all three variations, it is described separately in Section

Page 30: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 24

3.3.4. The TILT instruction schedules produced after scheduling the external memory operations using

the three different scheduling approaches for the example kernel and TILT-System configuration of Figure

3.3 are provided in Figure 3.4.

The LateI-EarlyO and SlackB approaches utilizes the TILT data memory ports that are unused

by compute operations to schedule the external memory reads and write operations at those cycles,

allowing the compute schedule to remain unchanged. The LateI-EarlyO and SlackB algorithms are only

applicable for cyclic schedules that are iterated on multiple times due to the way the memory accesses

are scheduled by these algorithms.

Conversely, the MC-GPGMLP algorithm schedules external memory operations as part of Tili’s

GPGMLP compute scheduling algorithm instead of scheduling them separately after the generation of

the compute schedule. For this approach, we do not need to generate memory statistics for the external

memory locations. Unlike the other two approaches, this algorithm can be used to generate both cyclic

and acyclic schedules. In the acyclic case, TILT executes its instruction schedule only once.

3.3.1 Memory and Compute GPGMLP Scheduling

Tili’s GPGMLP algorithm schedules the compute operations defined by the DFG of an application. In

order to schedule the external memory accesses as part of the GPGMLP algorithm, they must first be

inserted into the DFG, as shown in Figure 3.3(b) for the application kernel in Figure 3.3(a). This occurs

after allocating TILT’s data memory which is when we also determine the memory locations that are

TILT’s external inputs and outputs.

For each external input, we insert a new external write operation into the DFG (the nodes highlighted

in green in Figure 3.3(b)). This operation does not depend on any previous operation and is marked

as a predecessor of any compute operation that reads the input arriving from off-chip memory. This

means the MC-GPGMLP algorithm must schedule the external write before these compute operations.

As an example, in Figure 3.4(a), compute operation B of the first thread is scheduled two cycles (write

crossbar latency for the example is two cycles) after the memory operation which writes the in0 data to

TILT’s memory is scheduled.

Similarly for each external output, we insert an external read operation into the DFG (the red node

in Figure 3.3(b)). This operation is marked as the successor of the compute operation that produces the

output which means it must be scheduled after that compute operation writes its result to data memory

(Figure 3.4(a)). External read operations have no successors of their own.

The pseudo-code of the GPGMLP algorithm was provided in Algorithm 1 of Section 2.3.3. Just as

with the compute operations, external memory accesses are scheduled at the earliest possible cycle given

all their predecessors (operations they depend on) have been scheduled, there are available data memory

bank ports and data hazard conditions are met. While compute operations such as add or multiply

require 2 read and 1 write port to be available at a given cycle for the data memory bank being accessed,

external reads and writes require 1 read and write port respectively. Additionally, they require either

an external read or write port to be available to access the Fetcher’s data FIFOs.

3.3.2 ALAP-Inputs, ASAP-Outputs Memory Scheduling

In the ALAP-Inputs, ASAP-Outputs (LateI-EarlyO) memory scheduling algorithm, external inputs of

the computation are written into TILT memory from the data FIFOs of the Fetcher As-Late-As-Possible

Page 31: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 25

(ALAP) before they are needed by any compute operations while external outputs are read from TILT

memory into the FIFOs As-Soon-As-Possible (ASAP) after they are produced by the TILT FUs. The

pseudo-code of this approach is provided in Algorithms 3 and 4. With read and write crossbar latencies

of 8 and 5 cycles, read operations must be scheduled at least 3 cycles after the write operation that

produces the value to be read (RAWcycles of 3). Similarly, write operations must be issued at least 6

cycles after the last read that requires the old value at that memory location (WARcycles of 6).

Algorithm 3: LateI-EarlyO - Schedule external memory write operations ALAP

Input: extInputs[], compSchedOutput: Fetcher-TILT schedule

1 sort extInputs[] by extInput.firstRead, from largest to smallest;2 foreach extInput in extInputs[] do3 foreach TILT thread do

// must be able to read location at least RAWcycles after it is written

4 cycleToWrite = extInput.firstRead - RAWcycles;

// find cycle when it is safe to overwrite location with new value

5 while cycleToWrite >= 0 and (no free external or bank write port) do6 cycleToWrite = cycleToWrite - 1;

7 if cycleToWrite >= 0 then8 schedule extInput at cycleToWrite;9 decrement # of free external and bank write ports at cycleToWrite by 1;

10 else// must write to location at least WARcycles after it is last read

11 minCycleCanWrite = extInput.lastRead + WARcycles;12 cycleToWrite = last cycle of schedule;13 while cycleToWrite >= minCycleCanWrite and (no free ext. or bank write port) do14 cycleToWrite = cycleToWrite - 1;

15 if cycleToWrite >= minCycleCanWrite then16 schedule extInput at cycleToWrite;17 decrement # of free external and bank write ports at cycleToWrite by 1;

18 else// handle Read-After-Write (RAW) hazard

19 while cannot read extInput at extInput.firstRead if written at cycle 0 do20 extend start of schedule by 1 cycle;

21 schedule extInput at cycle 0;22 decrement # of free external and bank write ports at cycle 0 by 1;

External writes that move input data from the FIFOs to the TILT data memories are scheduled first

(Algorithm 3). The external inputs are ordered by the latest cycle they are first read by the computation.

The external writes for these inputs are prioritized and scheduled into TILT’s instruction schedule in this

order. Each external write is scheduled as close to the compute operation which reads the input first, at

a cycle when an external write port is available and the memory bank to be written has an unused write

port. Starting from just before the input is first read by the computation, the search advances toward

the start of the TILT schedule. If the external write cannot be scheduled before wrapping around to

when the input location is last read by the previous compute iteration, then the start of the schedule is

extended until the write can be safely scheduled at the beginning of TILT’s schedule.

Page 32: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 26

Algorithm 4: LateI-EarlyO - Schedule external memory read operations ASAP

Input: extOutputs[], compSchedOutput: Fetcher-TILT schedule

1 sort extOutputs[] by extOutput.lastWritten, from smallest to largest;2 foreach extOutput in extOutputs[] do3 foreach TILT thread do

// must read location at least RAWcycles after its written

4 cycleToRead = extOutput.lastWritten + RAWcycles;

// find cycle when it is safe to read output location

5 while cycleToRead < endOfSchedule and (no free external or bank read port) do6 cycleToRead = cycleToRead + 1;

7 if cycleToRead < endOfSchedule then8 schedule extOutput at cycleToRead;9 decrement # of free external and bank read ports at cycleToRead by 1;

10 else// must be able to write to location at least WARcycles after it is read

11 maxCycleCanRead = extOutput.firstWritten + WARcycles;12 cycleToRead = 0;13 while cycleToRead <= maxCycleCanRead and (no free external or bank read port) do14 cycleToRead = cycleToRead + 1;

15 if cycleToRead <= maxCycleCanRead then16 schedule extOutput at cycleToRead;17 decrement # of free external and bank read ports at cycleToRead by 1;

18 else// handle Read-After-Write (RAW) hazard

19 while cannot read extOutput at endOfSched if written at extOutput.lastWritten do20 extend end of schedule by 1 cycle;

21 schedule extOutput at last cycle of schedule;22 decrement # of free external and bank read ports at last cycle of schedule by 1;

External reads which commit the computed outputs of TILT-SIMD to the data FIFOs in the Fetcher

are scheduled next (Algorithm 4). These operations are scheduled in the order of the earliest cycle the

outputs are last written by the computation. Each external read is scheduled at the earliest cycle when

the memory bank to be read from has a free read port, starting the search after the output is last written

to the TILT data memory by an FU. This enables the output data to be moved to the FIFOs as soon as

possible. If the external read cannot be scheduled before wrapping around to the cycle when the output

is first written (at which point the computation will write a new result at that location), then the length

of the schedule is increased until the read can be scheduled at the end of the schedule.

By scheduling TILT’s external write operations to move input data from the Fetcher’s FIFOs into

TILT memory as late as possible, we maximize the amount of time the Fetcher has to bring the inputs

into the FIFOs from off-chip memory. We also minimize the periods where TILT stalls on an external

memory write due to the input data being unavailable in the FIFOs while future compute operations

can still be executed.

Conversely, scheduling TILT’s external reads as soon as possible allows the output data to be buffered

into the Fetcher’s data FIFOs sooner, increasing overall DDR bandwidth. We do not need to be concerned

Page 33: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 27

with TILT stalling on an external memory read due to the FIFO getting full so long as we do not saturate

the bandwidth of writing to off-chip memory. Finally, scheduling the external writes before the reads

does not affect the produced schedule as writing and reading to TILT’s data memory occur independently

of each other and utilizes separate hardware resources.

For the application and TILT-System configuration of Figure 3.3, the TILT schedule produced with

the LateI-EarlyO algorithm is provided in Figure 3.4(b). Unlike the MC-GPGMLP algorithm, scheduling

the memory operations with LateI-EarlyO does not increase the length of the compute schedule generated

by Tili’s GPGMLP algorithm.

3.3.3 Slack-Based Memory Scheduling

The Slack-Based (SlackB) algorithm schedules external memory operations based on the slack metric of

TILT’s external inputs and outputs. For an external input, we define its slack as the number of cycles

within which we must overwrite its TILT memory location with new data before it is needed by the next

compute iteration and after the old data at that location is no longer needed by the preceding iteration.

For external outputs, the slack is the number of cycles we have to move the computed result to the

data FIFOs in the Fetcher before the memory location is overwritten by the result of the next compute

iteration. The calculation of the slack metric for external inputs and outputs is provided in Equations

3.1 and 3.2 respectively.

Algorithm 5: Slack-Based - Scheduling memory operations based on their scheduling flexibility

Input: extInputs[], extOutputs[], compSchedOutput: Fetcher-TILT schedule

1 calc slack of each extInput in extInputs[] using Equation 3.1;2 sort extInputs[] from smallest slack to largest;

3 calc slack of each extOutput in extOutputs[] using Equation 3.2;4 sort extOutputs[] from smallest slack to largest;

// generate Fetcher-TILT external write/read memory operations

5 schedule extInputs[] ALAP (see Algorithm 3);6 schedule extOutputs[] ASAP (see Algorithm 4);

external input slack (cycles) = (cycle first read) +

[(compute schedule length)− (cycle last read)] (3.1)

external output slack (cycles) = (cycle first written) +

[(compute schedule length)− (cycle last written)] (3.2)

From observing the schedules generated by the LateI-EarlyO algorithm, we find memory operations

with a smaller slack have less scheduling flexibility and are more likely to increase the length of TILT’s

instruction schedule if other memory operations consume the cycles where they can be safely scheduled.

For this reason, external reads and writes with a smaller slack are given a higher priority and scheduled

first. Following the same rationale behind the LateI-EarlyO algorithm, the external writes are scheduled

ALAP and external reads are scheduled ASAP. We find these implementation choices minimize the

compute stalls and growth in the length of the TILT schedule.

Page 34: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 28

The pseudo-code of the SlackB approach is provided in Algorithm 5. For the kernel and TILT-System

configuration of Figure 3.3, the TILT instruction schedule produced using SlackB is provided in Figure

3.4(c). For our example, the external inputs in0 and in1 have the same priority, producing the same

schedule as LateI-EarlyO, with in0 being scheduled first.

3.3.4 Fetcher-DDR Instructions

Here we describe how the TILT compiler statically generates the Fetcher-DDR instructions which are

responsible for moving external data between the Fetcher’s data FIFOs and off-chip DDR memory. The

encoding of these instructions is provided in Figure 3.2(b). Each instruction has a 2-bit opcode that

specifies whether the instruction performs a read or a write, the data FIFO and the relative address of

the first data item in DDR memory to access and the burst size which is used to indicate the number of

256-bit words to move.

The Fetcher-TILT instructions produced during the first phase of scheduling the external memory

operations determine the order in which the external input and output data items needed by the compute

schedule will be moved between the data memories of the TILT cores and off-chip DDR memory. This

ordering, the TILT core count and the placement of input and output data in DDR memory are used

to calculate the maximum amount of contiguous data that we can move to and from DDR with each

Fetcher-DDR instruction. The memory layout is specified with a configuration file which provides the

mapping of the DFG’s external inputs and outputs and their relative addresses in DDR memory.

read from ddr incoming fifo burst size 8write to ddr outgoing fifo burst size 4

:: :

:

repe

at

Figure 3.5: Fetcher-DDR instructions produced from the kernel and TILT-System configuration providedin Figure 3.3. For this example, the burst size is calculated as follows: (threads) ∗ (inputs or outputs) ∗dTILT cores / 8e.

The enforcement of the DDR memory layout is optional. If the memory layout is not specified then

we assume the input and output data will exist in contiguous but separate regions of DDR memory. This

effectively allows the TILT compiler to enforce its own DDR memory layout that best suits the order in

which the TILT cores access external input and output data, maximizing the DDR burst sizes that can

be used to improve DDR bandwidth and reducing the total number of Fetcher-DDR instructions.

For the above reasons, it is preferable to not enforce a layout prior to scheduling and instead let it

be defined by the TILT compiler. We only allow the specification of the DDR memory layout through

a configuration file because some other part of the application system may enforce where the input and

output data needed by the TILT cores will be stored in DDR memory. For the purposes of evaluating

the TILT and Fetcher architectures, we let our compiler choose the placement of data in DDR memory.

In Figure 3.5, we provide the Fetcher-DDR instructions produced for the kernel and TILT-System

configuration in Figure 3.3.

3.4 Benchmarks

We evaluate our TILT-System architecture and the performance obtained by our memory scheduling

approaches using the five large memory (i.e. off-chip memory required) data-parallel, compute intensive,

Page 35: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 29

floating-point benchmarks listed below. These benchmarks span different application domains such as:

neuroscience, economics, image processing, mathematics and physics simulations. We have selected these

applications because they use a variety of operation types including transcendental functions (exp, log).

We defer the discussion of the implementation of these benchmarks on our enhanced TILT architec-

ture to Chapter 6 where we present a more thorough evaluation of the achieved performance and area

requirements of our best TILT-System designs.

Black-Scholes Option Pricing (BSc)

The BSc model is based on a partial differential equation used to approximate the value of European

call and put options given inputs such as the stock price, risk-free interest rate and volatility [59]. In

our implementation, each BSc thread computes the call and put option prices for a single set of inputs.

High Dynamic Range (HDR)

This benchmark takes three input images of the same scene captured by standard cameras at three

different exposures (bright, medium and dark) and produces a single output image with a greater range

of luminance than the input images [54]. The algorithm performs the same set of operations on the red,

green and blue components of a pixel. Each HDR thread computes a single pixel component.

Mandelbrot Fractal Rendering (MBrot)

We use Altera’s Mandelbrot implementation where each thread computes a single pixel of a 800x640

window frame in which the Mandelbrot image is rendered [53]. The thread computation contains a loop

that iterates up to 1000 times and can break out of the loop at an earlier point depending on the value

of a variable computed inside the loop body.

Hodgkin-Huxley (HH)

This neuron simulation benchmark describes the electrical activity across a patch of a neuron membrane

where the voltage across the membrane varies as ions flow radially through each compartment [55].

The computation inside each thread involves solving four first-order differential equations using Euler’s

method to iteratively compute simple finite differences.

FIR Filter

We use the fully pipelined 64-tap Time-Domain Finite Impulse Response filter benchmark from the

HPEC Challenge Benchmark suite [60]. The algorithm applies 64 sets of filter coefficients to an array

of 4096 inputs. Each TILT thread produces a single output, given a single input, by applying the 64

coefficients on that input. The equivalent OpenCL HLS implementation is provided by Altera [53].

3.5 Comparison of the Memory Scheduling Approaches

In Figure 3.6, we compare the performance of our three memory scheduling algorithms: MC-GPGMLP,

LateI-EarlyO and SlackB. The comparison is performed on the five data-parallel benchmarks described

earlier. We only present the results for the most computationally dense (highest compute throughput

Page 36: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 30

per area) TILT configurations, provided in Table 6.6 in Section 6.2 of our evaluation of the TILT-System.

Table 6.7 lists the external port counts we use for each benchmark. However, the conclusions we draw

from these results can be generalized to any TILT-System design.

We summarize the TILT and Fetcher configuration parameters used in Table 3.1. We vary the thread

count from 1 to 64 to illustrate the improvement in throughput for the different scheduling methods as

the amount of parallel work is increased. Since the number of data memory banks can at most be equal

to the number of total TILT threads (minimum of 1 thread per memory bank) we reduce the number of

memory banks to match the thread count used when necessary.

Benchmark FU MixThreads Mem W/R Ports Ext W/R

(T) Banks / Bank PortsBSc 2-3-1-1-1-1 1 to 64 min(T,4) 2-4 1-1HDR 2-2-1-1-0-0 1 to 64 min(T,4) 2-4 2-1

Mandelbrot 1-2* 1 to 64 1 3-6 1-1HH 3-2-2-2-0-0 1 to 64 min(T,4) 2-4 1-1

FIR 64-tap 1-1-0-0-0-0 1 to 64 1 2-4 2-1

Table 3.1: The TILT and Fetcher configurations used to obtain the throughput results of Figure 3.6.FU Mix: AddSub/Mult/Div/Sqrt/ExpCmp/LogAbs; *AddSubLoopUnit/MultCmp.

The throughput reported in Figure 3.6 represents the average number of compute operations that

is executed by each TILT core per cycle (operations-per-cycle or opc) and is calculated by dividing the

total compute operations in TILT’s instruction schedule by the length of the schedule. This throughput

is calculated after the external memory and compute operations are both scheduled onto TILT since the

memory operations may increase the length of the schedule and hence reduce the throughput achieved.

Increasing the number of threads increases the number of independent memory and compute operations

that can be scheduled in parallel, improving the utilization of the ports and FUs and the overall compute

throughput. We provide the number of memory and compute operations scheduled onto TILT per thread

in Table 3.2 for our five benchmarks.

BenchmarkExt Mem Ops Comp

Inputs Outputs OpsBSc 5 2 77HDR 6 1 14

Mandelbrot 5 1 21HH 5 4 115

FIR 64-tap 1* 1 128

Table 3.2: External memory and compute operations per thread to be scheduled onto TILT. *Filtercoefficients are loaded prior to computation.

The LateI-EarlyO and SlackB algorithms can only be used to produce cyclic instruction schedules that

are re-executed by TILT after the processor reaches the end of the schedule. We fold cyclic schedules

to improve compute throughput; this new technique is described in Section 4.4. The TILT schedule

produced using the MC-GPGMLP algorithm is not folded because the algorithm is applicable for both

cyclic and acyclic schedules.

As a reference point, we compare the compute throughput obtained by using the MC-GPGMLP,

LateI-EarlyO and SlackB algorithms with the throughput of Tili’s GPGMLP algorithm where only the

Page 37: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 31

0

1

2

3

4

5

6

1 2 4 8 16 32 64

Thro

ughp

utMS

op

ck

Threads

GPGMLP MC-GPGMLPGPGMLPMSfoldedk LateI-EarlyO SlackB

acycliccyclic

compMopsMonly memMEMcomp

(a) Black-Scholes.

1 2 4 8 16 32 64

Thro

ughp

ut (

op

c)

Threads

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

(b) HDR.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1 2 4 8 16 32 64

Thro

ughp

ut (

op

c)

Threads

(c) Mandelbrot.

1 2 4 8 16 32 64

Thro

ughp

ut (

op

c)

Threads

0

1

2

3

4

5

6

7

(d) Hodgkin-Huxley.

1 2 4 8 16 32 64

Thro

ughp

ut (

op

c)

Threads

0

0.3

0.6

0.9

1.2

1.5

1.8

2.1

(e) FIR 64-tap.

Figure 3.6: Performance comparison of the Fetcher’s external memory scheduling approaches and toTili’s GPGMLP compute scheduler. The throughput is reported in operations-per-cycle (or opc) andrepresents the average number of compute operations that is executed by each TILT core per cycle.

Page 38: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 32

compute operations are scheduled. Tili’s scheduler did not take into account communication with off-

chip memory. Thus, the GPGMLP results represent an upper bound on the throughput that we seek to

attain after scheduling the new memory operations. We provide the throughput of both the folded (cyclic

schedules only) and unfolded (intended for acyclic schedules) GPGMLP algorithm. Although folding

cyclic schedules is optional, it is recommended, as evident from an average of 10% higher throughput

achieved by folding vs. not folding the TILT schedules produced by Tili’s algorithm. We compare the

throughput results of the folded GPGMLP algorithm with LateI-EarlyO and SlackB and compare the

throughput of the regular GPGMLP with that of our unfolded MC-GPGMLP algorithm.

The throughput obtained with our MC-GPGMLP algorithm will always be worse than the reference

GPGMLP (by an average of 11% for the results presented in Figure 3.6) because the compute operations

that require external input data must be scheduled after the new external write operations. As illustrated

in Figure 3.4(a) for our example kernel, this pushes most of the compute operations down in the schedule

due to their dependence on external inputs. Similarly, external read operations must be scheduled to

move output data to the Fetcher’s data FIFOs after they are produced by the compute operations. These

new memory operations also increase the contention for the limited number of memory bank ports. In

some cases, compute operations that are ready to be scheduled are pushed down due to a memory

operation with a higher priority utilizing the needed bank port.

For MC-GPGMLP, Mandelbrot has the biggest drop in throughput of up to 33% with 64 threads

relative to GPGMLP. This benchmark contains a loop which is scheduled using our new LoopUnit FU

(refer to Section 4.1). The loop body and the operations outside it are separated into sections in the

TILT schedule that cannot cross the boundaries of the loop. Of the 21 compute operations per thread,

4 must be scheduled before the loop. An additional 6 memory operations, 5 external writes and a single

read, must be scheduled before and after the loop respectively. The writes bring into TILT memory the

input data that are needed by the 4 compute operations and the read sends the single output to the

Fetcher after it is produced inside the loop. Taken together, this produces a much longer TILT schedule

and a considerably larger drop in throughput compared to the other benchmarks.

Relative to our MC-GPGMLP algorithm, we observe an average of 19% and 23% improvement in

compute throughput with the LateI-EarlyO and SlackB algorithms respectively. These two algorithms

attempt to schedule memory operations in the region of TILT’s compute schedule where the external

input or output is not needed by any compute operations. For a long enough schedule, this region is

usually sufficient to schedule most memory operations without increasing the length of the compute

schedule that is produced by Tili’s GPGMLP algorithm.

The SlackB algorithm performs as well or better than our LateI-EarlyO approach by an average of

1.6%. The order in which the memory operations are scheduled has a bigger impact on the achieved

throughput with larger number of threads. This is because the higher utilization of the data memory

ports results in fewer available spots in the TILT schedule to insert the new memory operations. We

observe no change in throughput with a single thread relative to the LateI-EarlyO approach and an

average of 3.7% higher throughput with 64 threads.

Compared to the folded GPGMLP throughput results, scheduling the new memory operations with

the SlackB algorithm leads to only a 0.57% drop in throughput on average. Tallying the operation

counts provided in Table 3.2, the external memory reads and writes account for an additional 18% more

operations on average. Therefore we conclude scheduling based on our defined memory slack metric

which calculates the scheduling flexibility available to each external data item is the best scheduling

Page 39: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 33

option. For acyclic schedules, we use the MC-GPGMLP algorithm. For all other situations, we use the

SlackB algorithm.

We provide the utilization of TILT’s data memory ports by compute operations in Figure 3.7 for our

benchmarks to justify the 0.57% drop in throughput. As we expect, memory port utilization increases

with the number of threads but with diminishing returns. We do not increase the thread count indefinitely

because the larger data and instruction memories will produce less area-efficient TILT cores. The FIR

filter achieves 100% utilization of its FUs and data memory except for at the start and end of its schedule,

obtaining up to 98% memory port utilization with 64 threads. The remaining benchmarks achieve 62%

utilization on average at 64 threads and 70% when the FIR filter’s utilization is also included.

1 4 16 640%

10%

20%

30%

40%

50%60%70%80%90%

100%

1 4 16 64 1 4 16 64 1 4 16 64 1 4 16 64Threads

Dat

a M

emor

y P

ort U

tiliz

atio

n

BSc HDR MBrot HH FIR

Figure 3.7: The utilization of TILT’s data memory ports by the compute schedules generated with thefolded GPGMLP algorithm for the TILT-System configurations of Table 3.1.

The LateI-EarlyO and SlackB approaches schedules memory operations when the off-chip data in

TILT’s memory and the memory ports are not in-use by compute operations, instead of scheduling them

based on just data dependencies like the MC-GPGMLP algorithm. This improves the utilization of the

memory ports without increasing the length of TILT’s schedule. For the HDR benchmark, the memory

port usage is low and all memory operations can be scheduled with SlackB without lengthening the

compute schedule produced by the folded GPGMLP algorithm. In contrast, for the HH benchmark, the

throughput drops by up to 2% at 64 threads, which is also when the utilization of the memory ports by

compute operations is the highest, at 80%. For HH, we observe a throughput drop of 0.64% on average

between 1 and 64 threads, the second largest drop observed across all five of our benchmarks.

For SlackB, the largest throughput drop is obtained by Mandelbrot, with an average drop of 1.7%

(and up to 4% at 64 threads) relative to the throughput of the folded GPGMLP. This is still very small,

especially when compared with the MC-GPGMLP results. The memory operations have less scheduling

flexibility compared to the other benchmarks because they cannot be scheduled within the Mandelbrot

loop. The external reads and writes are both scheduled at the start of the TILT schedule, overlapped

with the compute operations scheduled outside the loop. With our SlackB approach, we are able to

schedule memory operations before or after the dependent compute operation(s), thus avoiding the need

to increase the schedule length in most cases.

3.6 Summary

Application designers are usually required to manually construct custom memory architectures to connect

their compute logic to external memory. This is not a trivial exercise [44, 46] and is not suitable for

Page 40: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 3. External Memory Fetcher 34

configurable compute platforms that target a variety of applications such as the TILT overlay. With our

Memory Fetcher, we exploit the on-chip bandwidth and parallelism afforded by the FPGA with multiple

parallel data buffers feeding an array of TILT cores while also diligently managing its limited off-chip

memory bandwidth by fetching only the required data. Our proposed architecture supports automatic

instantiation, has low resource cost, is scalable and can be customized to suit the memory bandwidth

requirements of different applications. The Fetcher is statically scheduled to minimize hardware area

and behaves as a loosely-coupled co-processor to the TILT compute units. Of our three proposed

memory scheduling approaches, SlackB produces the most efficient (shortest) memory schedules. Using

SlackB, we obtain a small drop in compute throughput (an average of 0.57% drop in opc across our

five benchmarks) relative to the same designs where only the compute operations are scheduled and the

latencies incurred due to off-chip data transfers are omitted.

Page 41: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4

TILT Efficiency Enhancements

The existing TILT architectural design parameters discussed in Section 2.2 provide a high degree of flex-

ibility to closely match the compute and memory requirements of different applications. In this chapter,

we introduce several enhancements to the TILT architecture of [13, 14], including instruction looping

support and indirect addressing modes to enable more area-efficient TILT-System implementations for

a greater range of application domains. Note that as the TILT overlay is an application customized

“family” of execution units, these enhancements are optional and are generated only for the applications

they benefit.

We also evaluate the performance, resource utilization and overall efficacy of our enhancements to

that of the standard TILT architecture. We measure computational throughput in millions of threads

executed by the TILT cores per second (M tps). We report resource utilization as a single value in

equivalent ALMs (eALMs) which accounts for the total layout area of the FPGA resources consumed

by our designs. An M20K BRAM on a Stratix V chip costs 40 eALMs as its layout area is 40x that of

an ALM [61]. Similarly, a DSP block has 30x the layout area of an ALM so it costs 30 eALMs [61]. We

rank different designs based on the largest ratio of compute throughput per area (M tps / 10k eALMs)

which we define as compute density. The FPGA resource utilization is obtained from Quartus 13.1 fitter

compilation reports while cycles of execution are measured using ModelSim 10.1d. Throughput per

second is calculated by applying the Fmax achieved by TILT-SIMD which is obtained from the Quartus

TimeQuest reports.

4.1 Instruction Looping Support

The TILT architecture and compiler flow developed by Ovtcharov and Tili in [14] and [13] respectively

does not support application kernels with loops. Loops in the application code needed to be manually

unrolled prior to running it through the TILT compiler flow, which schedules a separate set of operations

for each loop iteration. This effort is tedious and the generated TILT instruction schedules are very long,

growing linearly with the number of loop iterations. This can result in a significant portion of TILT’s

area to be consumed by the instruction memory which must store the entire TILT schedule. Moreover,

since independent operations across loop iterations can be scheduled together and memory allocation

occurs prior to scheduling, fewer memory locations can be reassigned to store different data, necessitating

a larger data memory as well.

35

Page 42: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 36

FU FU

loopbody

LoopUnit

loop-start

loop-end

repeat

operation

Figure 4.1: An example TILT schedule containing a loop. The loop-start operation sets the number ofiterations initially. Upon reaching the loop-end operation, that value is decremented by 1. If the valuereaches 0, TILT exits the loop, otherwise, TILT re-executes the loop body.

As an optional alternative solution, we have implemented a small, custom LoopUnit FU which enables

TILT to iterate multiple times on a section of its instruction schedule. This is illustrated in Figure 4.1,

allowing the body of a loop to be scheduled only once by the TILT compiler. We describe the architecture

of the LoopUnit FU in Section 4.1.1. The changes to the DFGGen tool and the TILT scheduler required

to support this new FU are presented in Sections 4.1.2 and 4.1.3 respectively. We now also provide the

option to automatically unroll loops with the help of the LLVM compiler so that we do not need to

manually unroll them.

4.1.1 LoopUnit FU Architecture

The design of the LoopUnit and the encoding of the operations it executes are illustrated in Figure 4.2.

The bit width of the iterations and pcJump fields can be different from the values in the figure and

are dependent on the number of loop iterations and the length of the TILT schedule respectively. The

register inside the LoopUnit FU tracks the number of loop iterations remaining. The combinational logic

that determines the input to the register is presented as a 1-hot multiplexer for illustrative purposes.

The loopStart signal is asserted when the TILT processor reaches the beginning of a loop in the

schedule. This is indicated with a loop-start operation (as shown in Figure 4.1) which has a type field of

0. The assertion of the loopStart signal results in the value of the iterations field to be loaded into the

register. This field contains the number of times that TILT must iterate on the section of its schedule

that represents the loop body.

Similarly, the end of the loop body is marked with a loop-end operation which has a type field of

1. This causes the loopEnd signal to be asserted which decrements the value stored in the LoopUnit’s

register by 1. As long as the register’s value is non-zero, the pcLoad output signal is asserted when the

end of the loop is reached. This causes TILT’s PC (Program Counter) to be overwritten with the value

stored in the loop-end operation’s pcJump field which contains the cycle when the first operation of the

loop is scheduled.

When the register’s value reaches 0, pcLoad is not asserted and the PC value is incremented by 1

normally to execute operations following the loop body. This way, we can either jump to the start of the

loop to execute the loop body once more or continue to the next TILT instruction after the loop body.

Page 43: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 37

opcode

FU operation

...TILT VLIW Instruction

valid type pcJumpiterations

1 bit 5 bits 1 bit 10 bits 9 bits

loo

pV

alid

1Reg

1-hotmux

0

loopEnd

loopStart

else

rese

t

10 bits

not 0 pcLo

adpc

LoopUnit

LoopOp

Figure 4.2: The architecture of the LoopUnit FU.

Since each loop in the schedule requires only two operations to mark its start and end, the LoopUnit is

shared with another (usually the least utilized) FU to reduce the width of the TILT instruction memory.

4.1.2 DFG Generation

Recall that the DFGGen tool is used to generate the DFG of an application prior to scheduling it onto

a target TILT-System architecture. We have updated the tool to support application kernels containing

loops. To illustrate the generation of the DFG from these types of applications, we provide an example

C code snippet containing a for-loop in Figure 4.3. The LLVM instructions generated from compiling

this code and the DFG produced by our DFGGen tool from these instructions are provided in Figures

4.4 and 4.5 respectively. Placing a set of operations inside a loop in the C kernel wraps these operations

with the new loop-start and loop-end operations in the DFG (refer to Figure 4.5).

floatuxu=u0.0f,uxSqru=u0.0f;foru(intuiteru=u0;uiteru<u100;uiter++)u{uuuuuifu(xSqru<u4.0f)u{uuuuuuuuuuxSqru=ux*x;uuuuuuuuuuxu=uxu+u0.1;uuuuu}uuuuuelseu{uuuuuuuuuu<someucomputation>uuuuuuuuuubreak;uuuuu}}

Figure 4.3: Application kernel code snippet of a for-loop.

As shown in the LLVM instructions in Figure 4.4 of the code snippet, for-loops are defined by four

code blocks that have the prefixes of for.cond, for.body, for.inc and for.end. For the loop-start operation

in the DFG, we need to determine the number of times the loop iterates which can be calculated from

the information provided within the for.cond and for.inc blocks. Since we generate a static schedule,

we only support loops with bounds that can be calculated during the generation of the DFG.

Page 44: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 38

entry:88888store8i3280,8i32*8hiter,8align8488888br8label8hfor.cond

for.cond:88888888888888888888888888888888888888888;8preds8=8hfor.inc,8hentry88888h78=8load8i32*8hiter,8align8488888hcmp8=8icmp8slt8i328h7,810088888br8i18hcmp,8label8hfor.body,8label8hfor.end

for.body:88888888888888888888888888888888888888888;8preds8=8hfor.cond88888h88=8load8float*8hxSqr,8align8488888hcmp38=8fcmp8olt8float8h8,84.000000e+0088888br8i18hcmp3,8label8hif.then,8label8hif.else

if.then:888888888888888888888888888888888888888888;8preds8=8hfor.body88888h108=8load8float*8hx,8align8488888h118=8load8float*8hx,8align8488888hmul48=8fmul8float8h10,8h1188888store8float8hmul4,8float*8hxSqr,8align8488888hadd128=8fadd8float8h10,80.100000e+0088888store8float8hadd12,8float*8hx,8align8488888br8label8hif.end

if.else:888888888888888888888888888888888888888888;8preds8=8hfor.body88888<some8computation>88888br8label8hfor.end

if.end:8888888888888888888888888888888888888888888;8preds8=8hif.then88888br8label8hfor.inc

for.inc:888888888888888888888888888888888888888888;8preds8=8hif.end88888h228=8load8i32*8hiter,8align8488888hinc8=8add8nsw8i328h22,8188888store8i328hinc,8i32*8hiter,8align8488888br8label8hfor.cond

for.end:888888888888888888888888888888888888888888;8preds8=8hif.else,8hfor.cond

Figure 4.4: Intermediate representation (IR) [50] of the code in Figure 4.3 that is generated by LLVM’sfrontend compiler.

We support for.cond blocks that consist of an integer compare that evaluates the termination condi-

tion of the loop and a conditional branch that jumps to the start of the loop body or its end depending

on the output of the comparison. From the compare operation in the loop condition, we can determine

the type of comparison, the variable that acts as the loop index, its initial value before entering for.cond

and the value that causes the loop to terminate.

The for.inc block describes how the loop index is updated. Currently, we support loops that update

their index with a single integer compute operation that is either an add, subtract, multiply or divide.

As the loop index calculations are integer operations and the conditional branches introduced with loops

depend on integer comparisons, these operations can be easily distinguished from the floating-point

operations that are inserted into the DFG to be executed by the TILT compute units.

The for.body block and any code blocks called from within for.body defines the body of the loop.

The loop-start and loop-end operations are inserted into the DFG before and after the insertion of the

operations within these blocks respectively.

Page 45: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 39

4.1.3 Scheduling onto TILT

The DFGs of applications that contain LoopUnit operations are scheduled in sections instead of all at

once. These sections are separated by the loop boundaries which are indicated by the loop-start and

loop-end operations that appear before and after the operations of the loop body (see Figure 4.5). For an

application with a single loop, we first schedule the compute operations and external memory accesses

that precede the loop. The operations inside the body of the loop are scheduled next and the operations

after the loop are scheduled last. Nested loops are scheduled in a similarly recursive fashion by scheduling

the operations preceding the inner loop first, then the body of the inner loop and then finally scheduling

the remaining operations.

loop-startb100

cmp:bxSqrb<b4b? mul:bxSqr-mulb=bxb*bx add:bx-addb=bxb+b0.1 <somebcomputation>

cmpmux_0:bxSqrb=bxSqr-mulborbxSqr cmpmux_1:bxb=bx-addborbx

loop-end

<computationbafterbloop>

cmpmux_2:bpcb=bcont.borbbreak

1

2 3 4 5

6 7 8

9

10

Figure 4.5: The DFG generated by the DFGGen tool from the input LLVM IR of Figure 4.4. The arrowsdenote the data flow dependencies of the operations.

Algorithms with nested loops currently require multiple LoopUnits, one for each level of nesting.

Since two LoopUnit operations from two nested loops are not executed on the same cycle due to the

dependencies that exist between them and since each loop produces only two LoopUnit operations, as

future work, we would like to use only a single LoopUnit FU to execute all loop operations. This will

require the LoopUnit (shown in Figure 4.2) to have as many registers as there are levels of nesting.

The scheduled instructions of each section must not cross loop boundaries. After the operations of

a section of the DFG are scheduled, the first operation of the next section must be scheduled on a later

cycle. Moreover, we must wait extra cycles for the computed results of the loop body to be committed to

TILT memory before scheduling the loop-end operation. This produces a less dense instruction schedule

compared to scheduling a fully unrolled loop where the TILT pipeline does not need to be drained

between loop iterations and operations between loop boundaries and loop iterations can be overlapped

and scheduled together.

The benefit of our approach is that the body of the loop needs to be scheduled only once, producing

fewer instructions and shorter schedules depending on the size of the loop body and the number of

times the loop is iterated. We obtain the biggest win with the LoopUnit when the loop is iterated many

times and the loop body is large enough to fill the TILT pipeline without requiring loop iterations to

be overlapped. Due to the lower throughput incurred from a less dense schedule (refer to Section 4.1.6),

Page 46: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 40

we can manually unroll certain loops and use the LoopUnit with others depending on the configuration

that will work best. As an example, for a computation with a large outer loop that iterates many times

containing a small inner loop that iterates only a few times, it will be best to use the LoopUnit with the

outer loop and unroll the inner loop.

4.1.4 Conditional Jump Instruction

The TILT architecture of [13, 14] supports if-else instruction branching with predication [48] where

operations from both sides of the branch are scheduled and executed and the results of only the taken

path are committed to the destination variables in memory. This is illustrated in Figure 4.5 for the

example code snippet provided earlier in Figure 4.3. We have added support for more complex loops

with conditional break statements by extending the hardware and compiler flow used to implement

predication at only a small hardware cost. We will first describe how normal predicated execution is

supported on the TILT overlay and then present the changes we have made to handle conditional jump

instructions within loops.

TILT determines the taken path of the if-else branch by evaluating the condition of the if statement.

This is accomplished with the cmp operation which takes as input the two 32-bit floating-point operands

to compare (representing the left and right side of the if condition) and a flag that indicates the type of

comparison to perform. For the example in Figure 4.5, there is a single cmp operation which evaluates

if xSqr is less than 4. After the operations inside the two branches are executed (operations 3 to 5 in

the example), the cmp-mux operations commit the results of the taken path. For each variable that is

modified inside the if-else branch, a cmp-mux operation takes as input the two possible values of that

variable stored in temporary locations from both sides of the branch and commits the value of the taken

path to the variable based on the outcome of the cmp operation.

These cmp and cmp-mux operations are executed by the custom Cmp FU of the TILT processor.

The architecture of this FU is shown in Figure 4.6 with the additional components necessary to support

conditional jump instructions highlighted in red. For cmp operations, the two input values to compare

are read from TILT’s data memory into data-a and data-b and the comparison flag is encoded in the

first three bits of the operation’s opcode. The result of the comparison is written into a small memory

array, addressed by the operation’s thread-id. This enables TILT to support different outcomes for each

thread and commit the results of the correct path taken by each thread independently of each other.

For regular cmp-mux operations (such as operations 6 and 7 in Figure 4.5), the two possible values

from either side of the branch are read from TILT’s data memory into data-a and data-b. The opcode

provided is 0111 which asserts the mux-mode and lowers the pc-mode control signals, causing the Cmp

FU to behave as a mux. The outcome of the if-else branch condition written previously by the cmp

operation is used to select the output value of wdata between data-a and data-b. The signal wdata-valid

is asserted to commit wdata to TILT’s data memory.

To allow breaking out of a loop on a certain condition, we needed to modify both the DFGGen tool

and the TILT instruction scheduler. For the code snippet in Figure 4.3, the break statement inside the

loop has the signature ‘br label %for.end ’ in the LLVM IR (shown in Figure 4.4). During the generation

of the DFG, we look for these unconditional branch instructions inside if-else statements within loops. If

such an instruction is found, we insert a new type of cmp-mux operation into the DFG, such as operation

8 in Figure 4.5. Here, instead of choosing between two data items, we select between two PC values,

placed in the cmp-mux operation’s Operand A and B (Figure 4.6). The two PCs are calculated when

Page 47: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 41

reg

cmpa=ba>ba>ba<ba<ba=b

0

1

MEM

waddrwenwdata

raddrren

rdata

reg reg reg reg

data

-a

data

-b

data

-val

id

thre

ad-id

opco

de[2:0]

32-bits

flag

1

mux-mode

[3] pc-mode

select

32-bits

CMP FU

Readffffbar

FU operation

...

Operand BRead Addr

Operand ARead Addr

ResultWrite Addrvalid Read

bank idWritebank id

threadid

opcode

TILT VLIW Instruction

DatafMem

Writeffffbar

addr,id

rdata

wda

ta

wda

ta

wda

ta-v

alid

load

-pc

pc

addr

,id

Figure 4.6: The architecture of TILT’s Cmp FU which handles both cmp and cmp-mux operations.

the DFG is scheduled and they represent the next PC address after the cmp-mux operation (to continue

loop execution) and the first instruction after the loop (to break out of the loop) respectively.

We needed to slightly modify the read crossbar to forward Operand A and B into data-a and data-b

instead of the data read from TILT’s data memory to handle this new scenario. We provide an opcode of

1111 to assert both mux-mode and pc-mode control signals inside the Cmp FU. This asserts the load-pc

signal which allows the PC of the TILT instruction memory to be overwritten with the Cmp FU’s PC

output. The same bit that toggles pc-mode is used as a select for the mux that determines whether the

data-a and data-b inputs to the FU should carry the values of Operand A and B or the data read from

TILT’s data memory.

4.1.5 LLVM Phi Instruction

Phi instructions are found in the LLVM intermediate representation (IR) code generated from C kernels

that have if-else conditions present inside loops that are automatically unrolled by the LLVM compiler.

The phi instruction is used to select between two possible values of a variable depending on the path

taken to reach the code block in which the phi instruction is found. Since we updated the DFGGen tool

to support application kernels with automatically unrolled loops, we now describe how these new phi

instructions are handled by the tool. An example code snippet with the usual LLVM instructions and

the equivalent instructions with phi is provided in Figure 4.7. In Figure 4.7(b) and 4.7(c), the value of

Page 48: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 42

either the temporary mul or div register will be written into the result variable depending on whether

the if.then or if.else paths are taken.

if (cond > 0.0) result = x * y;else result = x / y;

(a) Kernel code.

entry: %cmp = fcmp ogt float %cond, 0.0 br i1 %cmp, label %if.then, label %if.else

if.then: %mul = fmul float %x, %y store float %mul, float* %result, align 4 br label %if.end

if.else: %div = fdiv float %x, %y store float %div, float* %result, align 4 br label %if.end

if.end:

(b) Regular LLVM IR.

entry:]]]]]%cmp]=]fcmp]ogt]float]%cond,]0.0]]]]]br]i1]%cmp,]label]%if.then,]label]%if.else

if.then:]]]]]%mul]=]fmul]float]%x,]%y]]]]]br]label]%if.end

if.else:]]]]]%div]=]fdiv]float]%x,]%y]]]]]br]label]%if.end

if.end:]]]]]%4]=]phi]float][]%mul,]%if.then]],][]%div,]%if.else]]]]]]]store]float]%4,]float*]%result,]align]4

(c) Equivalent LLVM IR using phi instead.

Figure 4.7: Illustrative example of the LLVM phi instruction [50].

Normally, the DFGGen tool tracks variables that exist outside of if-else branches that are written

to by compute operations inside them. In the example in Figure 4.7(b), this is the case for the result

variable. Since TILT supports conditional branches with predication, the DFGGen tool renames the

output locations of these instructions to point to temporary locations instead. This effectively ignores

the LLVM store instructions of Figure 4.7(b) that commit the outputs of the compute operations into

the result variable. However, the variable to which the data is being stored is recorded by DFGGen for

later use. After the compute operations from both sides of the branch and the compare that evaluates

the branch condition are inserted into the DFG, we insert cmp mux FU operations that commit the

temporary values of the taken path into these variables.

In the case of the phi instruction in Figure 4.7(c), the two possible values are written to temporary

locations only. The phi instruction is translated directly into a cmp mux FU operation that depends on

the two compute operations that produces the two values and the compare operation that determines

the taken path. Unlike the example in Figure 4.7(c), the two possible values do not have to be produced

within the two paths. If this is the case, the compute operations that produce the temporary values

are marked as successors of (dependent on) the compare operation in the DFG. The resulting DFGs

produced from either set of the LLVM instructions (Figure 4.7(b) or 4.7(c)) will be the same.

4.1.6 Unrolling Loops vs. Using the LoopUnit FU

In Table 4.1, we demonstrate the efficacy of the LoopUnit FU with the Mandelbrot benchmark which

contains a loop that iterates up to 1000 times (see Section 3.4). We do so by comparing the performance

and area of two similar TILT-Systems. The first utilizes the LoopUnit and schedules the loop body only

Page 49: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 43

once and the second is of the same configuration but without the LoopUnit, requiring the loop to be

fully unrolled prior to being scheduled onto TILT.

We observe a drop in compute throughput when the LoopUnit is used, as shown in Table 4.1. This

is because in the unrolled case, operations from across loop iterations can be interleaved and scheduled

together as long as the data dependencies between them are respected. This cannot be done with the

LoopUnit approach. The benefit of using the LoopUnit FU lies in the much smaller instruction memory

that results from scheduling the body of the loop only once, producing an overall more area-efficient

TILT-System design.

With Without ChangeLoopUnit LoopUnit with LoopUnit

Insn Mem140x103 151x66,570

(width x depth)Throughput

0.76 0.91 -19.6%M tpsArea

3,320 24,233 -7.3xeALMs

Compute Density3.2 0.39 +6.1x

M tps/10k eALMs

Table 4.1: Comparison of the densest TILT-System configuration for the Mandelbrot application (Tables6.6 and 6.7) with and without the LoopUnit FU. The TILT core count used is 1.

We present the results for the most computationally dense Mandelbrot TILT-System configuration

(refer to Tables 6.6 and 6.7 in Section 6.2). However, other TILT configurations will benefit in a similar

way. The overall improvement in compute density and reduction in area will vary depending on the size

of the loop body and number of times the loop is iterated. In general, a larger loop body will result in a

smaller drop in compute throughput because there will be more operations to fill the TILT pipeline. The

LoopUnit will also be more desirable with higher iteration counts. The area cost of adding the LoopUnit

FU is very small – only 9.8 ALMs.

4.2 Indirect Addressing

TILT executes operations containing static memory bank ids and addresses that must be known at

compile time. On successive iterations of the TILT schedule, these operations are re-executed on different

data loaded from off-chip memory using the Memory Fetcher but because the operations are static, they

access the same locations in TILT’s memory. However, certain benchmarks may access different memory

locations for the same computation. This will require a separate set of instructions to be scheduled with

the standard TILT overlay developed by Ovtcharov and Tili. This increases the length of the schedule,

requiring a larger instruction memory. For other benchmarks, memory addresses are calculated during

runtime instead of being known immediately at software compile. This is not supported by the standard

TILT architecture.

In Section 4.2.1, we present the shift-register addressing mode which allows the TILT architecture to

support accesses to adjacent memory locations with a single static operation. Finally in Section 4.2.2,

we present a more general indirect addressing solution for the TILT overlay. When this optional feature

is enabled, data stored in TILT memory can also be used as memory addresses.

Page 50: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 44

4.2.1 Shift-Register Addressing Mode

The shift-register mode was motivated by the 64-tap FIR filter described earlier in Section 3.4. This filter

is most efficiently implemented as a spatial design, as shown in Figure 4.8, with input (or output) data

placed inside shift-registers. It is possible to model this behaviour on the standard TILT hardware but

it will be extremely inefficient since separate operations must be scheduled to read data from the TILT

data memory and move them to adjacent locations between the computation of each output. This can

be trivially accomplished by using an auxiliary 0-cycle “FU” that simply forwards its single input to its

output. For our 64-tap FIR design with two threads, an additional 64 reads and writes per thread would

be required, resulting in a 415 cycle TILT schedule. Compiler support does not exist to automatically

detect and insert these extra operations into the TILT schedule and must be manually added.

in ...

x+DSPb0

0inn x

+DSPb1

inn-1 x+DSPb2

inn-2 x+DSPb63

inn-63

... outn

Figure 4.8: Spatial 64-tap FIR implementation with DSPs on an FPGA.

We are able to obviate the need for these shift operations with our new shift-register mode. When

this optional mode is enabled, a small amount of additional hardware is generated which allows FUs to

access adjacent locations in TILT’s data memory without requiring the data to be shifted. As illustrated

in Figure 4.9, when TILT reaches the end of its schedule, the schedEnd control signal is asserted and the

value within the counter register is incremented by 1. This value acts as an offset to the static read and

write addresses provided in the TILT operations. This allows the same operations to access adjacent

locations in TILT’s data memory when they are re-executed by TILT.

FU operation

Operand BRead Addr

Operand ARead Addr

ResultWrite Addr

valid Readbank id

...

Writebank id

threadid

opcode

+

Counter

+ +

Write bar Read bar

TILT VLIW Instruction

offset

1+ Reg

schedEnd

Figure 4.9: Shift-Register indirect addressing mode.

The performance improvement obtained by turning on this mode is summarized in Table 4.2 for the

64-tap FIR filter. For the TILT-System configuration, we use the most computationally dense TILT and

Fetcher configurations of the FIR filter provided in Tables 6.6 and 6.7 in Section 6.2. The mode requires

an additional 54 ALMs but shortens the TILT schedule to 144 cycles, improving the overall compute

density of the TILT-System with a single TILT core by 2.9x (see Table 4.2). Figure 4.9 shows all three

operand addresses being shifted. However, the FIR filter is configured to shift only the result.

Page 51: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 45

With Without ChangeSR Mode SR Mode with SR Mode

Insn Mem76x144 76x415

(width x depth)Throughput

3.8 1.3 +2.9xM tpsArea

2,674 2,728 +54 ALMseALMs

Compute Density14 4.8 +2.9x

M tps/10k eALMs

Table 4.2: Comparison of the densest TILT-System configuration for the FIR application (Tables 6.6and 6.7) with and without the Shift-Register (SR) addressing mode. The TILT core count used is 1.

4.2.2 Indirect Addressing Mode

When enabled, this option creates additional hardware that allows TILT FUs to use data stored in TILT’s

data memory as addresses rather than requiring all data memory to be addressed via the immediate

addresses in the TILT instruction stream. As shown in Figure 4.10, for every TILT FU that may access

data indirectly this way, we provide two Indirect Read Address (IRA) registers and one Indirect Write

Address (IWA) register. These registers temporarily store the indirect addresses that can be used to

access data in TILT’s memory. The three 2-1 muxes placed in front of the read and write side crossbars

choose between the static addresses provided by the FU operation and the output of the FU’s Indirect

Address (IA) registers. Each of the three mux selects are controlled independently by a single bit in the

operation’s opcode. This allows FU operations to use any combination of indirect and static addresses

to access data in TILT’s memory.

FU operation

...

Operand BRead Addr

Operand ARead Addr

ResultWrite Addrvalid Read

bank idWritebank id

threadid

opcode

TILT VLIW Instruction

DataMem Read bar

IWA

Write baraddrs,id

rdata

addr,id

wdata

FU Arraywdata rdata

IRA IRA

Figure 4.10: Enabling the option to use temporary values stored within the Indirect Read/Write Addressregisters to access TILT’s data memory instead of the static addresses provided by the FU operation.

We also add the Indirect Address Read (IAR) operation into TILT’s instruction set to populate the

IA registers. They are similar to TILT’s external memory read operations, requiring a single read port

to be available for the data memory bank being accessed. However, instead of reading data into the data

FIFOs of the Fetcher, they store the data into the FU’s IA registers. Which of the three registers the IAR

operation writes to is controlled by three enable bits in the operation’s opcode. These IAR operations

Page 52: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 46

share their operation field with each of the FUs that require indirect addressing. The encoding of the

IAR operations is provided in Figure 4.11.

FU operation

...

Operand BRead Addr

Operand ARead Addr

ResultWrite Addrvalid Read

bank idWritebank id

threadid

opcode

TILT VLIW Instruction

Read bar

float-to-intrdata

DataMem

rdata

en enenaddr,id

rdata

IWA IRA IRA

Figure 4.11: Populating the Indirect Read/Write Address registers with data in TILT’s memory usingthe Indirect Address Read (IAR) operation.

For a given FU, we try to schedule the IAR operation(s) just before the dependent compute opera-

tion(s) so that other operations that indirectly access different memory locations can be safely scheduled

as soon as possible. Thus, having separate operations to fetch the addresses from TILT data memory

and perform the computation is not necessary. However, this design choice provides more scheduling

flexibility while minimizing the implementation complexity. First, we can populate 1, 2 or all IA registers

depending on the operands that access TILT’s memory indirectly. Second, if several compute operations

indirectly access the same memory locations, we will only need to fetch the addresses once. Providing

the same flexibility with a single operation would necessitate a more complex operation with variable

latencies and extra decode logic. Finally, the IAR operations and their dependent compute operations

may be scheduled with other operations scheduled in between, yielding a denser schedule. This is not

possible with a single operation and such an operation would be more difficult to schedule because it

would need to perform several reads and writes to TILT’s data memory on specific cycles.

Bank Depth (words) 256 1024 4096Addr Width (bits) 8 10 12

ALMs Used3 IA registers

12.5 15.5 18.5and 2-1 muxesaltfp convert 76.0 81.0 87.5

Table 4.3: Area cost of the indirect addressing mode per IA capable FU for different memory bankdepths. The altfp convert module translates a 32-bit floating point to an integer of address width bits.

The computed indirect addresses are produced by the existing single-precision, floating-point FUs

and stored in TILT’s data memory as 32-bit floats. As shown in Figure 4.11, we use Altera’s altfp convert

module with a six cycle latency [62] to convert these floats into integers after reading them from data

memory via the read crossbar and prior to writing them into the IA registers. Indirect addresses can be

moved from off-chip memory to TILT’s data memory like any other external input data as well.

In Table 4.3, we provide the area cost to implement indirect addressing on TILT for three different

data memory bank depths. The ALM utilization was obtained from the Quartus fitter reports and the

bank depths presented is representative of the range observed for computationally dense TILT-System

designs. The relative size of the altfp convert module is much larger than the cost of the registers and

muxing required. As future work, we would like to use small, low latency integer FUs to compute

Page 53: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 47

indirect addresses instead. This has several benefits including no longer requiring float-to-int conversion

and likely requiring fewer threads to fill the shallower schedule pipeline.

4.3 Memory Allocation for DFGs with Aliased Locations

The LLVM IR input to the DFGGen tool may contain aliased memory variables with different names

to the same physical memory location. Aliasing can occur when the LLVM compiler loads the same

memory location multiple times. This can result from how the application kernel is written (i.e. when

the same memory location is loaded into two different variables) or due to the LLVM compiler itself.

The latter is especially noticeable when loops are unrolled by the LLVM compiler prior to the generation

of the DFG. The Memory Allocator algorithm developed by Tili in [14] treats these aliases as separate

physical memory locations. Even though the correctness of the algorithm is maintained, this increases

the required size of TILT’s data memory.

Algorithm 6: Removal of memory location aliases from the DFG

Input: DFG with aliasesOutput: DFG with aliases removed

1 Create empty aliasListMap;2 Create empty newAddrMap;

// Populate aliasListMap and addrMap

3 foreach node in DFG do4 newName = node.outputVar with all instances of “([.][a-z,1-9]+)+” removed;

5 if newName != node.outputVar then // found alias6 if aliasList of newName exists in aliasListMap then7 add node.outputVar to aliasList;8 else9 add node.outputVar to new aliasList;

10 add aliasList to aliasListMap with key newName;11 add node.outputAddr to newAddrMap with key newName;

12 node.outputVar = newName // rename

13 sort aliasLists in aliasListMap from longest alias to shortest;14 foreach node in DFG do15 foreach aliasKey in aliasListMap in descending order do16 aliasList = aliasListMap.get(aliasKey);17 newAddr = newAddrMap.get(aliasKey);

18 foreach alias in aliasList do19 if node.inputVar1 or node.inputVar2 or node.outputVar == alias then20 replace aliased variable name with aliasKey;21 replace aliased variable addr with newAddr;

The algorithm presented in Algorithm 6 aims to remove all aliased memory variables to improve

the generated TILT schedule’s memory footprint. The first step is to determine which variables are

aliases of another. For the benchmarks we have studied, we find the aliases of <prefix> are of the form

<prefix>.<suffix>. Next, we replace all instances of the aliases in the DFG with <prefix> and assign

them a single memory address. This algorithm is run on the DFG prior to Tili’s Memory Allocator

Page 54: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 48

algorithm which further reduces the amount of data memory required for the computation with register

reallocation techniques (refer to Section 2.3.2).

BenchmarkWith Without

Aliasing Aliasing(32-bit words per thread)

BSc 46 38HDR 24 18

Mandelbrot28 25

(with LoopUnit FU)Mandelbrot

903 36(fully unrolled)

HH 91 86FIR 64-tap 192 192

Table 4.4: TILT data memory requirements with and without the removal of aliased memory locations.

The reduction in TILT’s data memory usage after the removal of the aliased memory locations is

summarized in Table 4.4 for our benchmarks. The fully unrolled implementation of the Mandelbrot

benchmark benefits from this the most, requiring only 36 data words per TILT thread instead of 903.

The aliases of this benchmark are introduced when the LLVM compiler is used to fully unroll the

application’s loop which iterates up to 1000 times.

4.4 Schedule Folding

FUsAdd Mult

12345678910111213

Cycle

AB

E

DC

FUsAdd Mult

12345678

Cycle

AB

E

DC

Folded

re-execute

(a) TILT schedule on the left is foldedto produce the schedule on the right.

Add Mult

123456789

10111213

Cycle

AB

E

DC

A

E

C

A

E

C

readwoperandscompute

writewresult

B

DB

D

Add Mult

12345678

Cycle

AB

E

DC

AE

C

AE

C

B

DB

D

Folded

(b) Same example but the cycles when the TILT FUs readoperands, compute and write to data memory is also shown.

Figure 4.12: An example TILT schedule being folded. Memory operations are omitted to simplify theexample. Read and write latencies of 2 cycles and Add and Mult latencies of 2 and 3 cycles are used.

When TILT executes a scheduled operation, it takes a few cycles for the task to complete. Thus, the

TILT schedule has extra cycles after the last scheduled operation to drain the pipeline and allow the

results of outstanding operations to be committed to TILT’s data memory, such as operations C, D and

E in Figure 4.12. For cyclic schedules that are re-executed by the TILT cores, we apply a technique we

call ‘Schedule Folding’ where we fold the cycles at the end of the schedule into the start of the schedule.

Page 55: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 4. TILT Efficiency Enhancements 49

This is performed after the TILT schedule is generated from the application’s DFG by the compiler. The

key observation is that the memory bank and external ports still available at the start of the schedule

can be utilized to commit the operations scheduled at its end. The schedule is folded at the earliest

possible cycle after the last scheduled operation while respecting TILT’s memory bank and external read

and write limitations.

In Figure 4.12, we illustrate how we fold the TILT schedule with an example. In Figures 4.12(a) and

4.12(b), the TILT schedule on the left is folded to produce the schedule on the right. In Figure 4.12(b),

we also show when the compute operations read input operands from the TILT data memory, compute

and write their results back to memory. As an example, operation E in the normal TILT schedule begins

reading its operands on cycle 8, starts to compute the sum on cycle 10 and commits the sum to data

memory by cycle 13. Once the TILT schedule is folded, operation E begins computation on cycle 2 and

completes by cycle 5 after wrapping around to the next iteration of the shorter schedule. During this

time (cycles 1 to 5), TILT also re-executes operations A, B and C.

Schedule Folding reduces the overall length of the TILT schedule, improving TILT’s compute through-

put proportionally to the number of cycles folded and may also require a smaller instruction memory to

store the schedule. It was motivated by the well known technique called Software Pipelining [63] which

creates code for a loop’s prologue, steady state and epilogue and overlaps loop iterations to improve

resource utilization. Since the TILT schedule is cyclic, it is akin to that of a loop body.

BenchmarkCycles Sched Len ThroughputSaved (after folded) Improvement

(cycles) (% of avg opc)BSc 19 933 2.1%HDR 30 304 9.9%

Mandelbrot 15 103 14.6%HH 21 611 3.5%

FIR 64-tap 19 144 13.2%

Table 4.5: Improvement in throughput obtained by folding TILT’s schedule (opc is operations-per-cycle).

Table 4.5 presents the improvement in compute throughput achieved by folding TILT’s instruction

schedule for the most computationally dense TILT-System designs. The configurations of these designs

are provided in Table 6.6 in Section 6.2 of our evaluation of the TILT-System. TILT’s memory operations

are scheduled using the Fetcher’s Slack-Based algorithm, described earlier in Section 3.3.3. The cycles

folded for the HDR benchmark is noticeably larger than that of the other benchmarks because the last

compute operation of the HDR computation is the Sqrt which has the largest latency of 28 cycles on

TILT. Schedule Folding has an expectedly smaller performance impact on a longer schedule.

Page 56: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5

Predictor Tool

The TILT overlay and Memory Fetcher are highly configurable and can be tuned to closely match the

compute and memory requirements of different applications. However, due to the wide range of archi-

tectural parameters that can be varied, it can be difficult to determine the TILT-System configuration

with the highest compute density (throughput per area) for a target application. Calculation of the com-

pute density normally requires synthesis of the TILT-System so that we can obtain its computational

throughput and resource consumption. However, it is impractical to synthesize millions of possible TILT

and Fetcher configurations in Quartus to determine the design with the highest compute density since

each compilation can take hours. To ease development effort and improve designer productivity, we have

implemented a software Predictor tool that is capable of quickly predicting the most area-efficient design

for a given application on Altera’s Stratix V FPGAs.

The Predictor tool uses derived TILT-System performance and area models to quickly estimate the

expected compute throughput and hardware area of promising TILT-System configurations. We devise

a search algorithm that navigates a subset of the large TILT and Fetcher design space using performance

heuristics, attempting only those configurations that are likely to produce successively better designs

with higher compute densities. The Predictor also automatically configures the new TILT enhancements

presented in Chapter 4 for those applications that benefit from them. As we will show in Section 5.4.2,

the Predictor can analyze in minutes thousands of potential TILT-System configurations that would

take weeks to compile in Quartus. Further, the tool is able to predict with high accuracy the densest

TILT-System design without any manual designer effort.

The Predictor tool does not currently predict TILT-System designs with the best core count, targeting

only 8-core TILT-Systems instead. Varying the number of TILT cores has a noticeable impact on compute

throughput, area cost, Fmax and the required configuration of the Memory Fetcher which we explore

in Section 6.5.3. Further work is necessary to amend our tool so that it is able to reliably predict the

densest TILT-System configuration with the core count being a part of its recommendation.

We begin this chapter with a description of how the Predictor estimates the compute throughput

of a TILT-System configuration in Section 5.1. The derivation of the TILT and Fetcher area models

is presented in Section 5.2. We describe how the Predictor tool navigates the large TILT and Fetcher

configuration space to determine the most computationally dense TILT-System design in Section 5.3.

Finally we present runtime and prediction accuracy results in Section 5.4 to demonstrate the efficacy of

our Predictor tool.

50

Page 57: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 51

5.1 Estimation of TILT-System Performance

The expected compute throughput of a TILT-System configuration is calculated by scheduling the appli-

cation’s DFG onto that architecture. This also includes scheduling the external memory operations that

move data between the TILT cores and the Fetcher. This is necessary to ensure any possible growth in

TILT’s instruction schedule due to the memory operations is accounted for in the throughput estimate.

The expected compute throughput is the total number of compute operations (such as add and multiply)

in TILT’s schedule divided by the length of the schedule in cycles. This average operations-per-cycle

value does not take into account the Fmax of the TILT cores.

From synthesizing thousands of different TILT-System configurations, we observe computationally

dense TILT designs fall in the middle of the possible TILT core sizes and have Fmax values typically

between 220 and 260 MHz on Stratix V FPGAs (with a speed grade of 2). For the very large designs

that are considered by the Predictor, the Fmax can drop to be as low as 170 MHz. By not taking the

Fmax into consideration in the throughput calculation, we essentially assume a constant Fmax for all

TILT-System designs. This means we over-estimate the performance of large TILT-Systems relative

to the medium-sized designs that have a lower variance in Fmax and higher Fmax values. However,

since these larger designs will already have lower compute densities due to their higher area costs and

diminishing gains in compute throughput, they will not be selected by the Predictor as the best design

anyway. Their lower Fmax values will merely reduce the estimated compute densities further.

Since the Predictor selects a Fetcher configuration that is capable of supporting the maximum rate

at which the target TILT-System can move data between its cores and the Fetcher (see Section 5.3.5),

we can safely assume the memory bandwidth requirement of the TILT cores is met. We also assume the

TILT cores do not incur any unexpected stalls due to waiting for data to be read or written to off-chip

memory. This is acceptable as long as the rate at which the Fetcher reads and writes data to off-chip

memory is sufficiently below the maximum rate at which we are able to access DDR memory. For these

reasons, the calculated throughput value is a close enough approximation of the actual throughput.

5.2 TILT and Fetcher Area Models

In this section, we present the area models of the TILT processor and Memory Fetcher on Stratix V

FPGAs. These models are used by the Predictor to quickly estimate the layout area of TILT-System

configurations without requiring them to be synthesized. The area models were derived by compiling

a representative set of TILT-System configurations in Quartus and obtaining their resource usage from

the fitter reports. The total layout area of the ALMs, M20Ks and DSP Blocks used is reported as a

single value in equivalent ALMs (eALMs) which we defined earlier in Chapter 4. The conversion of these

FPGA resources into a single metric makes it easier for the Predictor to minimize the total resource

consumption. The TILT processor consists of the FU array, instruction and data memories and the read

and write crossbars. The Fetcher is composed of its own instruction memory, control and data FIFOs.

The calculation of the area costs of these components is presented in the sections below.

5.2.1 TILT FU Array

The FU array comprises the single-precision, floating-point FUs generated by Altera’s MegaWizard

Plug-in Manager and the custom Cmp and LoopUnit FUs that are used to support predication and

Page 58: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 52

loops. Each instance of an FU has an associated Decode module that decodes the FU’s operation field

in TILT’s instruction and generates the signals that control the behaviour of the crossbar multiplexors,

reading and writing to data memory and so on.

The FUs and their Decode modules are built out of ALMs on the FPGA. As each instance of an FU

type and its Decode module produces the same hardware, the ALM count remains fairly constant with

the small amount of variability due to how the components are placed and routed onto the FPGA by

Quartus. Certain FU types also consume a constant number of M20K BRAMs and DSP Blocks. Table

5.1 summarizes the resources consumed by the different TILT FUs and their total average area costs

in eALMs. Table 5.2 provides the areas of the Decode modules for these FUs. The size of the Decode

module can be different for two FU types even if the encoding of their operation is the same. This is

due to the FUs having different pipeline depths – delayed copies of the operation fields feed into control

logic at different pipeline stages. The Predictor estimates the area of TILT’s FU array for a given FU

mix using the area numbers provided in these two tables.

FU Unit Avg ALMs M20Ks DSPs Avg Area (eALMs)FAddSub 331.1 0 0 331.1FMult 88.2 0 1 118.2FDiv 263.3 1 5 453.7FSqrt 421.7 0 0 421.7FExp 371.9 0 9 641.9FCmp 56.8 1 0 96.8FLog 1130.9 0 2 1190.9FAbs 12.6 0 0 12.6LoopUnit 9.8 0 0 9.8

Table 5.1: Resources consumed by Altera’s single-precision, floating-point FUs and custom FUs.

Unit Avg ALMs Unit Avg ALMs Unit Avg ALMsAddSub 100.6 AddSubCmp 120.7 AddSubAbs 101.5Mult 101.6 MultCmp 121.1 MultAbs 107.3Div 124.1 DivCmp 144.6 DivAbs 127.7Sqrt 142.3 SqrtCmp 182.8 SqrtAbs 148.8Exp 109.5 ExpCmp 151.8 ExpAbs 111.2Log 120.7 LogCmp 159.9 LogAbs 125.2Cmp 103.7Abs 75.6

Table 5.2: Resources consumed by the FU Decode modules.

The Predictor may choose to share the operation field and read and write ports of an FU with

another if doing so produces a more computationally dense design with a smaller instruction memory

and crossbars. This TILT customization option was described earlier in Section 2.2.2. Any TILT FU can

share its ports and operation field with any other FU. However practically, we only perform this sharing

for FUs that are relatively small in size, such as the Cmp and Abs FUs, or which have low utilization,

such as the LoopUnit FU with another standard and larger FU. We present the algorithm used by the

Predictor to determine which FUs to combine in Section 5.3.3.

Since two “combined” FUs share the same operation space in TILT’s instruction, they also share a

single Decode module. This module is slightly larger than the individual Decode modules of the FUs

Page 59: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 53

but smaller than their sum, resulting in an overall reduction in FU area. An additional 2:1 mux that

consumes 34.8 ALMs on average selects between the outputs of the two FUs to forward to the write

crossbar. The area costs of the merged Decode modules that are currently supported by the Predictor

are also provided in Table 5.2. The area savings obtained from combining FUs is provided in Table 5.8.

5.2.2 TILT and Fetcher Instruction Memories

The TILT and Fetcher instruction schedules are stored in two logical read-only memories built out of

physical M20K BRAMs. The width and depth of the schedules determine the dimensions of the logical

memories and the number of BRAMs that will be needed. From experiments, we observe that for a

logical memory with a depth of less than 16,384 words, Quartus first chooses the M20K configuration

with the smallest depth that is at least as deep as the logical memory. Then the width of the selected

M20K configuration is used to calculate the number of BRAMs that will be required to fit the width of

the logical memory (Equation 5.1).

BRAMs = ceiling(width of ROM / width of selected M20K config) (5.1)

TILT Insn Mem Area (eALMs) = BRAMs ∗ 40 (5.2)

The M20K BRAMs on Stratix V chips support the memory configurations provided in Table 5.3.

As an example, a logical memory with a width of 295 bits and a depth of 1936 words uses the 10x2048

M20K configuration and requires 30 M20Ks in total. For the benchmarks evaluated, we do not exceed

a depth of 16,384 words for any of the TILT-System memories but in such a scenario, the Predictor

assumes the 1x16,384 configuration is used.

Width (bits) Depth (words)40 51220 1,02410 2,0485 4,0962 8,1921 16,384

Table 5.3: Supported M20K BRAM configuration sizes on Stratix V FPGAs [64].

The width of the TILT and Fetcher variable length instructions depend on the TILT-System’s archi-

tectural parameters (refer to Figures 2.4 and 3.2). The area cost of the TILT instruction memory is the

cost of the BRAMs needed to build the memory in eALMs (Equation 5.2). The Fetcher’s instruction

memory area is calculated similarly. The cost of an M20K BRAM on a Stratix V chip is 40 eALMs [61].

5.2.3 TILT Data Memory

TILT’s data memory is organized into memory banks, with each composed of two or more BRAMs that

hold the same data to support multiple concurrent reads and writes per bank (refer to Figure 2.2). The

width of the data words is always 32 bits. The depth of each memory bank is the product of the number

of threads assigned to each bank and the number of data words per thread. Thread and memory bank

counts are provided as TILT’s configuration parameters and the number of data words per thread is

determined by performing memory allocation on the application’s DFG.

Page 60: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 54

To determine the number of BRAMs needed by TILT’s data memory, we begin by choosing the M20K

configuration that can fit the data words in each memory bank. The width of this configuration is then

used to determine the number of BRAMs required (Equation 5.3). Since each memory bank supports

multiple reads and writes and there are multiple banks, the BRAM count of Equation 5.3 is multiplied

by these parameters to obtain the total number of BRAMs required by the data memory (Equations 5.4

and 5.5). The area cost in eALMs is then calculated using Equation 5.6.

BRAMs = ceiling(32 / width of selected M20K config) (5.3)

BRAMs per Bank = BRAMs ∗ (Read ports per Bank) ∗ (Write ports per Bank) (5.4)

Data Mem BRAMs = (BRAMs per Bank) ∗ (# of Banks) (5.5)

Data Mem Area (eALMs) = (Data Mem BRAMs) ∗ 40 (5.6)

5.2.4 TILT Read and Write Crossbars

The read crossbar routes data read from TILT’s data memory to the input ports of the FU array and

to the external read ports of the Memory Fetcher. The number of inputs of this crossbar and how they

are connected to the crossbar’s outputs are a function of TILT’s data memory organization: the number

of memory banks and the number of read and write ports per bank. The number of outputs is equal to

the sum of the FU input ports and external read ports. Similarly, the write crossbar routes the outputs

of the FUs and data arriving from the Fetcher to TILT’s data memory. The number of inputs of this

crossbar is equal to the sum of the number of FUs and external write ports. The number of outputs is

the product of the number of total memory banks and write ports per bank.

# of inputs = fn(# of banks, # of bank read ports, # of bank write ports)# of outputs = (# of FU input ports) + (# of external read ports)

# of banks = 1, 2, 4, 8, 16, 32# of bank read ports = 2, 4, 6, 8

# of bank write ports = 1, 2, 3# of outputs = 2, 4, 6, ..., 30

# of area table entries = (6*4*3)*15= 1080

Table 5.4: TILT predictor read crossbar configuration mixes.

# of inputs = (# of FU outputs) + (# of external write ports)# of outputs = (# of banks) * (# of bank write ports)

# of inputs = 2, 3, 4, ..., 15# of banks = 1, 2, 4, 8, 16, 32

# of bank write ports = 1, 2, 3

# of area table entries = 14*(6*3)= 252

Table 5.5: TILT predictor write crossbar configuration mixes.

We produce two tables of the actual read and write crossbar areas which are indexed by their input

and output configuration parameters. The areas are obtained from compiling a representative selection

of different crossbar configurations in Quartus that span the Predictor’s TILT design exploration space.

Page 61: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 55

These configuration mixes are provided in Table 5.4 and 5.5 for the read and write crossbars respectively.

The Predictor estimates the crossbar areas of a TILT configuration using the entries in these two tables

and interpolates between points using a linear model.

The read and write crossbar area graphs in Figure 5.1 illustrate the linear growth in crossbar area

with respect to the number of FUs (which are outputs for the read crossbar and inputs for the write

crossbar) and the number of memory banks for two sets of bank read and write port configurations. For

a different number of read and write ports, the crossbar area follows the same trend as that presented

in Figure 5.1. Therefore, the linear model we use to interpolate the area cost of intermediate crossbar

configurations is quite accurate.

0

3

6

9

12

15

18

21

24

27

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1k A

LMs

# of FUs

1 2 4 8 16 32

# of memory banks

(a) Read crossbar area with 1 read and 2 write portsper memory bank.

05

101520253035404550556065

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1k A

LMs

# of FUs

(b) Read crossbar area with 2 read and 4 write portsper memory bank.

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1k A

LMs

# of FUs

(c) Write crossbar area with 1 read and 2 write portsper memory bank.

Figure 5.1: Read and write crossbar areas for different mixes of FUs, data memory banks and memorybank ports obtained from compiling the configurations in Quartus.

5.2.5 Fetcher Data FIFOs

The data FIFOs inside the Fetcher are built using Altera’s dual-clock FIFO (DCFIFO) megafunction [65].

The Fetcher supports data FIFOs with arbitrary depth and a word width of 256 bits. We summarize

the resources consumed by these FIFOs as a function of depth in Table 5.6. Although the DCFIFO

megafunction can support depths below 64 or beyond 32,768 words, we do not present the costs for

these depths because we do not make use of them. The number of M20Ks, LUTs and registers are

obtained directly from the resource usage numbers estimated by Altera’s MegaWizard Plug-in Manager.

We observe that the number of M20Ks used follows the same calculation as the one we have used to

Page 62: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 56

calculate BRAM usage for TILT and Fetcher memories in earlier sections.

Depth (words) M20Ks ALUTs Registers Avg. ALMs Total Area (eALMs)64 7 20 117 49.4 329.4128 7 22 132 57.6 337.6256 7 24 147 66.7 346.7512 7 26 162 73.6 353.6

1,024 13 28 177 81.0 601.02,048 26 30 192 89.3 1,129.34,096 52 32 209 96.2 2,176.28,192 128 34 224 105.1 5,225.116,384 256 36 239 112.1 10,352.132,768 512 127 257 120.0 20,600.0

Table 5.6: Resources consumed by Altera’s DCFIFO megafunction with a word width of 256 bits. EachALM on a Stratix V FPGA contains 2 ALUTs (Adaptive-Look-Up-Tables) and 4 registers [66].

The average number of ALMs consumed by the data FIFOs is also provided in Table 5.6. These

numbers are obtained from the fitter compilation reports in Quartus. Finally, the average area cost of

the data FIFOs in Table 5.6 is the sum of the ALMs used and the cost of the M20K BRAMs in eALMs.

The Predictor estimates the total area cost of the data FIFOs for a Fetcher configuration with Equation

5.7 using the area costs provided in Table 5.6 and the number of external read and write ports.

Data FIFO Area (eALMs) = (area of outgoing FIFO) ∗ (# of external read ports)

+ (area of incoming FIFO) ∗ (# of external write ports) (5.7)

5.3 Prediction of the Densest TILT-System Design

A brute-force exploration of the TILT-System’s large configuration space is not practical, even with using

the performance and area models provided in Sections 5.1 and 5.2 instead of Altera’s hardware synthesis

flow. We reduce the number of designs that are considered by the Predictor in two ways to improve the

runtime and efficacy of the tool. The first is by bounding the explored design space by imposing fixed

limitations on the range of values we try for different TILT-System configuration parameters. This is

presented in Section 5.3.1. We further improve the runtime of the Predictor by considering only those

configurations among the bounded set that are likely to produce computationally dense designs. This is

accomplished with heuristics using the search algorithm described in Section 5.3.2.

If the bounded configuration space is made too small, then the tool will arrive at a solution very

quickly but will also be more likely to miss good designs. Conversely, if the Predictor is tuned to explore

a very large number of TILT-System configurations, then it is likely to produce an optimal solution but

at the cost of designer productivity due to the long execution time. We want the tool to explore a large

set of designs among promising candidates so that it produces good solutions but also have an acceptable

runtime of under a minute, with a few minutes as the worst case. Moreover, we allow the designer to

optionally configure the size of the design space that the Predictor explores. Through parameters that

can be changed, such as the range of thread counts to try, the Predictor can be tuned to explore a larger

or smaller set of configurations to meet the designer’s productivity and performance requirements for a

target application.

Page 63: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 57

5.3.1 Constraining the Explored Design Space

Despite the large permutation of TILT-System configuration choices that exist, computationally dense

designs are usually situated within a well-defined, bounded region of the configuration space. As we

demonstrate in Section 6.3, we observe large gains in compute density initially as we add resources such

as FUs and threads to the baseline TILT configuration to enable more work to be performed in parallel.

Adding more resources beyond a certain point causes the compute density to drop due to diminishing

gains in compute throughput at increasingly higher area costs. This means large TILT-Systems are not

very area-efficient and therefore, need not be considered by the Predictor.

TILT parametersTILT core count = 8

FU array = maximum of 3 FUs per typememory banks = 1, 2, 4, 6, 8, 12, 16, 24, 32threads/bank = 1, 2, 4, 6, 8, 12, 16 (24 with up to 4 banks, 32 with up to 2)

read ports/bank = 2, 4, 6, 8write ports/bank = 1, 2, 3

Fetcher parametersexternal read ports = min{[1, 2, 3], total bank read ports}

external write ports = min{[1, 2, 3], total bank write ports}data FIFO depth = 64, 128, 256, ..., 8192

Table 5.7: All permutations of TILT and Fetcher configurations that the Predictor is configured toexplore in this work.

Table 5.7 summarizes all the possible permutations of TILT and Fetcher configuration parameters

that are available to the Predictor’s search algorithm described in Section 5.3.2. The parameter ranges

provided in this table were determined experimentally based on the observations made in finding com-

putationally dense designs in a brute-force manner for our benchmarks. The configuration space is made

sufficiently broad so that we do not potentially miss good design solutions. The designer can choose to

deviate from the ranges presented in the table to increase prediction accuracy or to improve the tool’s

runtime as needed by modifying a configuration file.

5.3.2 Search Algorithm

We now describe the Predictor’s search algorithm and the heuristics that the algorithm uses to narrow

down its search beyond the fixed limitations placed on the TILT-System configuration space earlier in

Section 5.3.1. The pseudo-code of this algorithm is provided in Algorithm 7. The algorithm requires

the DFG of the target application as its input. We also perform memory allocation and the removal of

aliases on the DFG (described in Section 4.3) prior to the execution of the algorithm. As its output, the

algorithm produces the TILT-System configuration with the highest predicted compute density.

Algorithm 7 begins by choosing the baseline TILT-System configuration as its best design (lines 2

through 14). This is an 8-core system with the smallest TILT core and Fetcher design possible; this has

the lowest area cost but also the lowest compute throughput. Each TILT core in this design consists

of 1 of each FU type required by the application with 1 thread, 1 data memory bank and 2 read and 1

write ports per memory bank. Similarly, the Fetcher comprises 1 external read and write port with the

minimal data FIFO depth calculated using Equations 5.9 and 5.10 which we presented in Section 5.3.5.

Page 64: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 58

Algorithm 7: Get TILT-System configuration with highest compute density

Input: DFG of applicationOutput: TILT-System configuration

1 FU array = determine the FU types in the DFG to share using Algorithm 8;2 trialConfig = base TILT-System config with FU array;

3 toTryQueue = {∅}; configsTried = {∅};4 enqueue {trialConfig, 0, 0} into toTryQueue;5 bestConfig = nil; bestCd = 0;

6 while toTryQueue is not empty do7 dequeue {trialConfig, startI, iToSkip} from toTryQueue;8 generate memory and compute schedule for trialConfig;9 order {trialPerfType, trialPerfCount} from largest trialPerfCount to smallest;

10 throughput = total # of compute operations / length of tilt schedule;11 trialCd = throughput / TILT-System area estimated by AreaModel;

12 if trialCd > bestCd then13 bestConfig = trialConfig;14 bestCd = trialCd;

15 {bestPerfType, bestPerfCount} = {trialPerfType, trialPerfCount};16 startI = 0;

17 for i = startI to bestPerfCount.length-1 do18 trialConfig = updateConfig(bestPerfCount[i], bestConfig);

19 if trialConfig != nil then// startI = i+1; iToSkip = i

20 if trialConfig does not exist in toTryQueue or configsTried then21 enqueue {trialConfig, i+1, i} into toTryQueue;

22 break;

23 if trialCd < bestCd then24 for i = 0 to trialPerfCount.length-1 do25 if i == iToSkip then26 skip loop iteration;

27 trialConfig = updateConfig(trialPerfType[i], trialConfig);28 if trialConfig != nil then

// startI = 0; iToSkip = i

29 if trialConfig does not exist in toTryQueue or configsTried then30 enqueue {trialConfig, 0, i} into toTryQueue;

31 break;

32 insert trialConfig into configsTried;

33 return bestConfig ;

Page 65: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 59

The mix of FU types to be used by the TILT cores and which FU types, if any, to share with another

FU is determined beforehand (line 1) using Algorithm 8 which we describe in Section 5.3.3.

At every iteration of the while loop (line 6), the algorithm attempts to find incrementally larger

TILT-System configurations which are likely to have higher compute densities. This is accomplished by

determining the TILT or Fetcher resource that is the largest bottleneck to the TILT-System’s perfor-

mance and attempting to alleviate it by adding more of that resource. This is done by the updateConfig

procedure (lines 18 and 27) which we will describe later in Section 5.3.4. As an example, if the compute

throughput is limited by the number of available Mult FUs then another Mult FU is added. If TILT’s

existing FU array and/or memory bank ports are being underutilized, more threads are added. The

objective of the algorithm is to reach the peak compute density by selectively increasing the size of the

TILT-System to alleviate performance bottlenecks in the design.

The bottlenecks of the TILT-System are represented by performance counters which are generated

when the application’s DFG is statically scheduled onto the trial TILT-System configuration (line 8).

These counters are ordered from the largest value to the smallest (line 9) and iterated in this sequence

to select the resource bottlenecks to alleviate (lines 17 and 24). If updateConfig is unable to add more of

a resource to alleviate the selected bottleneck due to the fixed constraints imposed in Section 5.3.1 (lines

19 and 28), then the next bottleneck from the ordered array is attempted. These performance counters

are presented in Section 5.3.4.

For any ‘trial’ TILT-System configuration that is considered by the Predictor, there exist two cases:

either the trial configuration has a higher compute density than the current best configuration or it does

not. In the first case (line 12), we update the best configuration to that of the trial (line 13) and the next

configuration to evaluate is the one that alleviates the biggest bottleneck of the new best configuration;

this configuration is inserted into the toTryQueue (lines 17 to 22). In the second case, the bestConfig

parameter is not updated and two new trial configurations are inserted into the toTryQueue. In both

cases, if we already reached the fixed resource limit(s) for a bottleneck, we try instead to alleviate the

next most significant bottleneck. The search ends when the toTryQueue is empty and there are no more

trial configurations to consider.

For the second case, the first of the two configurations that is inserted into toTryQueue is the

configuration that alleviates the next biggest bottleneck of the best configuration (lines 17 to 22). This

is because the previous bottleneck did not yield a design with a higher compute density. The second

is the configuration that alleviates the biggest bottleneck of the attempted trial configuration (lines 24

to 31). We ensure this bottleneck is not the same as the one that was alleviated to obtain the trial

configuration (line 25). This additional condition prevents the algorithm from trying configurations that

have more of the resource that was added to obtain the trial configuration initially. Due to diminishing

gains associated with each incremental increase in the same resource, if the compute density did not

improve with the trial configuration, it will also not increase with a configuration that has more of that

resource until some other resource is increased first.

To summarize, the algorithm greedily records only the TILT-System configuration with the highest

compute density. It also greedily chooses trial configuration(s) by attempting to alleviate the biggest

bottleneck(s) of the current best configuration. Although we keep track of only the best configuration,

we also try to alleviate bottlenecks of trial configurations that have a lower compute density than that of

the best. This way, we expand the search to consider improvements to more TILT-System configurations

than just the current best. This distinguishes our algorithm from a purely greedy solution which can

Page 66: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 60

get stuck in a local maximum that is not close to the globally optimal design solution.

5.3.3 Combining TILT FUs

In this section, we present Algorithm 8 which is used to determine the FU types that will share their

read and write ports and operation field in TILT’s instruction with another FU. The array of FU types

produced by this algorithm is adopted by the Predictor’s search algorithm described in Section 5.3.2

which is used to predict the TILT-System configuration with the highest compute density.

Algorithm 8: Determine the FU types in the application’s DFG to share.

Input: DFG FU types[], DFG operationsOutput: TILT FU array

1 toShareFUs[] = {∅};2 foreach FU in FU types do3 if FU is LoopUnit or Cmp or Abs then4 insert into toShareFUs;

5 FU usages[] = # of compute operations of each FU type in DFG;6 utilCostRatios[][] = {∅};7 foreach toShareFU in toShareFUs do8 foreach FU in FU types do9 areaSavings = sharedArea[toShareFU, FU] - (area[toShareFU] + area[FU]);

10 util = FU usages[toShareFU] + FU usages[FU];11 utilCostRatios[toShareFU][FU] = util / areaSavings;

12 FU array = FU types;13 while toShareFUs is not empty do14 get <toShareFU, FU> entry with lowest value in utilCostRatios that is > 0;15 replace FU entry in FU array with <FU + toShareFU>;16 remove toShareFU from FU array and toShareFUs;

17 return FU array ;

Sharing an FU’s read and write ports with another FU can be worthwhile due to the reduction in

TILT’s instruction width and the size of TILT’s crossbars. However it also introduces contention in

scheduling because only one of the FUs can read data into its input ports or write its result to data

memory on a given cycle. For this reason, it is best to only combine FUs that have low utilization

relative to the other FUs and/or are small in area with comparatively larger Decode modules.

From Tables 5.1 and 5.2, we observe that the Cmp, Abs and LoopUnit are the smallest of the TILT

FUs and the cost of their Decode modules are the largest relative to their size. Among them, the

LoopUnit is the smallest and is likely to be the least utilized due to requiring only two loop operations

in the schedule for each loop in the application. Table 5.8 presents the area savings that can be achieved

by combining the Cmp and Abs FUs. The LoopUnit provides the best area savings because it does not

increase the size of the Decode module beyond that of the other FU it is shared with. This makes the

LoopUnit an ideal candidate for sharing with another FU type.

Algorithm 8 first determines the FU types that it will attempt to combine with another (lines 1 to 4).

Currently these FUs are the Cmp, Abs and LoopUnit discussed earlier. Next, the algorithm determines

the number of operations of each type in the DFG (line 5). This is the utilization of each FU type. Based

Page 67: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 61

on the earlier discussion, we want to combine the least utilized FUs. The second metric we calculate

is the area savings that can be achieved by combining two FU types. We want to maximize this value

which can be determined using a table lookup of the area savings provided in Table 5.8.

FU UnitArea with Cmp FU (eALMs) Savings

Separate Shared (eALMs)

FAddSub 632 549 84

FMult 420 336 84

FDiv 778 695 83

FSqrt 765 701 63

FExp 952 891 61

FLog 1,512 1,448 65

Area with Abs FU (eALMs)FAddSub 520 445 75

FMult 308 238 70

FDiv 665 594 71FSqrt 652 583 69

FExp 840 766 74

FLog 1400 1329 71

Table 5.8: Relative sizes of FUs and their Decode modules with Cmp or Abs FUs separate and shared.

The utilization and area savings are used to calculate a ratio between the two metrics for each possible

combination of two FU types (lines 6 to 11). Finally, the algorithm shares those FU combinations that

have the lowest ratio of utilization and area savings (lines 12 to 16). We avoid combinations with a

negative ratio because it would mean the area cost is higher when the two FU types are shared. This

maximizes the area savings benefit while combining the least utilized FU types to minimize resource

contention during scheduling.

Using only the operation count of each FU type disregards dependencies between operations which

is not ideal because it can result in an underprediction of the FU utilization. However, our approach

works well since TILT executes many independent threads to improve the utilization of the FUs which

makes the dependencies between operations within the same thread less important. Considering only a

single set of DFG operations is also sufficient because the ratio of the types of operations will be the

same irrespective of how many instances of the DFG (or threads) are scheduled onto TILT.

The runtime of Algorithm 8 would be much higher if we decided to schedule the application with

varying numbers of threads, FU mixes and TILT hardware configurations to obtain more accurate FU

utilization and area cost numbers. However, this level of complexity is likely unnecessary since our simpler

algorithm presented here produces the same mix of FU types as that of the most computationally dense

TILT-System designs of the five data-parallel benchmarks that we evaluate in Chapter 6. These designs

were determined through a more exhaustive search of the TILT-System parameter space. Further work

is necessary to determine the situations in which it may be beneficial to combine the larger standard

FUs with each other which we leave as future work.

5.3.4 Alleviation of Bottlenecks with Performance Counters

Performance counters are produced during the generation of the instruction schedules for a target TILT-

System and application. These counters record the number of cycles by which the issuing of a memory or

Page 68: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 62

compute operation must be delayed in the instruction schedule due to TILT or Fetcher resources being

unavailable. As an example, if the earliest time an operation could be scheduled is cycle 6 but it could

not be scheduled until cycle 10 due to resource constraints, the counter associated with that resource

would be incremented by 4.

An operation being postponed results in all dependent operations also being pushed down in the

schedule due to data and control dependencies. This contributes to the growth in TILT’s schedule which

reduces the TILT-System performance. Moreover, larger instruction memories may also be required to

store the longer TILT and Fetcher schedules. The Predictor utilizes the performance counters to assess

resource bottlenecks of the TILT-System. The tool then attempts to alleviate some of these bottlenecks

by adding more resources to obtain higher compute throughput for an acceptable area cost that results

in an overall increase in compute density.

The updateConfig procedure used by Algorithm 7 comprises a series of conditions that attempt to

resolve the provided resource bottleneck of its input TILT-System configuration. The function produces

the updated TILT-System configuration as its output or nil if the bottleneck could not be resolved. This

can happen when the fixed constraint imposed on a TILT core or Fetcher resource in Section 5.3.1 is

reached for the resource that is to be increased. When this happens, the Predictor attempts to alleviate

the next biggest bottleneck.

We describe how the different TILT and Fetcher resources are increased and introduce the perfor-

mance counters that are used to make these decisions below. Each resource is increased in increments

up to their maximum value which is summarized in Table 5.7. The order in which the performance

counters are presented is the order they are considered in the series of conditions mentioned earlier.

The counters are normalized so that they do not artificially have a higher value (and therefore a higher

priority) depending on what they are measuring.

TILT FU Array. For each FU type, we record the number of times that an operation of that type

could not be scheduled due to all FUs of that type being used by other compute operations. This value is

normalized by the number of FUs of that type. The bottleneck is resolved by incrementing the number

of FUs of that type by 1, up to the maximum of 3, as outlined in Table 5.7.

TILT Threads. The number of threads is increased when the TILT FUs are underutilized, measured

by the number of effective empty compute cycles when an FU was available but no operation was

scheduled for it. This value is normalized by the number of FUs of each TILT core. We increase the

number of threads per bank as outlined in Table 5.7 to increase the FU utilization and throughput.

Data Memory Banks. The maximum number of memory banks is limited by the total number

of threads. If the thread count needs to be greater than 32 per memory bank, we try to increase the

number of banks instead. Moreover, if the read and write port counts per memory bank are 8 and 3

respectively and either the read or write ports are the bottleneck, we increase the number of memory

banks instead as well. Any time there is a change in the number of memory banks, we reset the read

and write ports of each memory bank to 2 and 1. The existing number of threads is also redistributed

equally among all memory banks, usually resulting in a lower number of threads per bank.

Memory Bank Read and Write Ports. Two performance counters track the number of times

when a memory or compute operation could not be scheduled due to being unable to read from or write

data to TILT’s data memory. This happens when the target memory bank’s read or write ports are

already in use by other previously scheduled operations. This is resolved by incrementing the number

of read or write ports of each memory bank respectively. These counters are normalized by the sum of

Page 69: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 63

the number of FUs and external read and write ports.

External Read and Write Ports. Two performance counters track the number of times when

TILT memory read or write operations could not be scheduled due to the associated port being in use

by another memory operation. This is resolved by increasing the number of external read or write ports

respectively. The counters are normalized by the number of external read or write ports. The external

read or write port counts is not allowed to be greater than the total number of memory bank read or

write ports (refer to Table 5.7). This is because the maximum number of data words that can be read

or written from TILT’s data memory to the Fetcher’s FIFOs is the minimum of the memory bank and

external port counts.

5.3.5 Fetcher Configuration Selection

The Fetcher can be scaled by varying the depths of the data FIFOs and the number of external read and

write ports. The Predictor chooses the smallest Fetcher configuration that meets the external memory

bandwidth requirement of the target TILT design’s statically generated memory schedule. The Predictor

estimates this bandwidth using several metrics to calculate the recommended depths of the incoming

and outgoing data FIFOs inside the Fetcher, provided in Equations 5.9 and 5.10 respectively.

FIFOstretch = max[1, (tilt schedule depth) / (schedule segment)] ∗ (schedule iterations)

schedule segment = 256 cycles

schedule iterations = 4 (5.8)

Incoming Data FIFO Depth = (external inputs in DFG) ∗ (tilt threads)

/ (external write ports) ∗ FIFOstretch (5.9)

Outgoing Data FIFO Depth = (external outputs in DFG) ∗ (tilt threads)

/ (external read ports) ∗ FIFOstretch (5.10)

As we described in Section 3.3, we can determine the number of external input and output data

words that must be written or read from TILT’s data memory per thread by analyzing the DFG of

the application. Combining this with the number of TILT threads and the depth of TILT’s schedule

provides us with an accurate estimate of the number of data words that must move between TILT and

the Fetcher over a period of time. This is TILT’s external memory bandwidth that the Fetcher must

be able to support. Dividing this quantity by the number of external read and write ports gives us the

average per FIFO depth in 256-bit words (Equations 5.9 and 5.10).

We want the Fetcher to be able to buffer a reasonable amount of data inside its data FIFOs ahead of

TILT’s execution to minimize the stalls that TILT may incur. This will usually require the Fetcher to

buffer external data of multiple iterations of TILT’s schedule. However for long schedules, it is sufficient

and also preferable to have only a part of a single iteration of the schedule buffered at any point in time

instead to reduce the area cost of the data FIFOs. This is captured by the FIFOstretch parameter in

Equation 5.8 which is used to express how deep the data FIFOs should be in relation to the amount

of data that must move between TILT and off-chip memory. The values of the ’schedule iterations’

and ’schedule segment’ parameters in this equation were determined experimentally. We found 3 to 5

iterations in increments of 256 cycles works best.

Page 70: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 64

The depths of the Fetcher’s data FIFOs are calculated using Equations 5.8 to 5.10 during the exe-

cution of the Predictor’s search algorithm. The number of external read and write ports is predicted by

the search algorithm. These parameters are used to determine the configuration of the Fetcher and its

area cost (line 11 of Algorithm 7).

5.4 Efficacy of the Predictor

We have designed our Predictor tool with three important objectives in mind. First, the Predictor should

be able to reliably recommend the most computationally dense TILT-System configuration for a target

application with high accuracy. Second, the Predictor should be able to produce this design solution

within an acceptable amount of time, preferably within minutes. Finally, the tool should be able to

perform its function without requiring manual designer effort in tuning the tool. Taking these design

goals into consideration, the Predictor exposes the potential of the highly customizable TILT-System

architecture to the designer. The Predictor improves designer productivity by removing the need to

manually arrive at the most computationally dense TILT-System design using trial-and-error methods

that require long Quartus compilations and performance simulations.

In this section, we evaluate the effectiveness of the Predictor in meeting the design goals outlined

above using the five applications of Section 3.4. We begin by highlighting the significant reduction in

the TILT-System’s configuration space which is obtained by the Predictor’s search algorithm in Section

5.4.1. Then in Section 5.4.2, we present the fast runtime of our tool and compare it with the runtime

of a more exhaustive search of the configuration space. We also contrast these execution times with the

time it took to compile the densest TILT-System designs in Quartus. In Section 5.4.3, we evaluate the

accuracy of our area model by comparing its output with the TILT-System area obtained from the fitter

reports after synthesis of the design in Quartus. Finally in Section 5.4.4, we evaluate the Predictor’s

ability to accurately predict the densest design while only exploring a limited number of configurations

to improve its execution time.

5.4.1 Design Space Reduction

We significantly reduce the number of different TILT-System configurations that are explored by the

Predictor first by fixing the design space that is searched and second by intelligently trying only those

configurations in this space that are likely to produce good design solutions. This is illustrated by Table

5.9 where we present the number of designs that are considered by our Predictor and contrast this with

the significantly higher number of design permutations that are possible even after the design constraints

provided in Table 5.7 are placed on the TILT-System configuration space.

Benchmark# of Designs Avail. # of Designs Tried

to Predictor by PredictorBSc 153,055,008 1,264HDR 17,006,112 1,009Mbrot 1,889,568 932HH 17,006,112 1,874FIR 1,889,568 324

Table 5.9: Number of TILT-System designs considered by the Predictor vs. designs available to the toolafter applying the fixed constraints of Table 5.7 on the design space.

Page 71: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 65

5.4.2 Runtime Comparison

The execution time of our Predictor is provided in Table 5.10 for each of our applications. By reducing

the number of designs that are considered by the Predictor, we are able to meet our design objective

of obtaining a design solution within a few minutes for all of our benchmarks, with an average runtime

of 42 seconds. The majority of the Predictor’s execution time is spent on generating the memory and

compute instruction schedules for the target TILT-System. The time it takes to determine the FU array

composition, estimate the performance and area costs and decide which TILT-System configurations to

try is negligible in comparison.

BenchmarkRuntime of Runtime of Quartus Runtime toPredictor Exhaustive Search Compile Densest Design

mins days minsBSc 2.7 3+ days 53HDR 0.4 3+ days 31Mbrot 0.4 18 hrs 16HH 2.3 3+ days 52FIR 0.2 15 hrs 13

Geomean 0.7 28

Table 5.10: Execution time of the Predictor vs. an exhaustive search of the TILT-System design spacedefined by the configuration ranges provided in Table 5.7. The time it takes to synthesize the densestTILT-System is also provided for comparison.

In Table 5.10, we also contrast the fast execution time of the Predictor with the significantly longer

execution of an exhaustive search of the design space constrained only by the parameter ranges provided

in Table 5.7. The exhaustive search took longer than 3 days for the BSc and HH applications. These

applications have the largest number of DFG operations to schedule and also the largest number of

TILT-System designs to consider. Generating the instruction schedules for the larger TILT-System

configurations that have a large number FUs, memory banks and threads takes much longer than smaller

designs due to the super-linear runtime of TILT’s scheduling algorithms. The Predictor benefits from

not having to consider these large designs. This is because the tool’s search algorithm prunes the

configuration space and does not explore large designs due to the poor compute densities of earlier,

related but smaller configurations.

Finally Table 5.10 also provides synthesis runtime of the densest design for each application. These

designs are medium-sized compared to the smallest and largest TILT-System configurations that are

possible for the same application. Quartus compile times also increase super-linearly with the size of

the design. Since synthesis times are orders of magnitude higher than the time we require to predict

the performance of a single TILT system, it is clear that such fast models are necessary to guide our

Predictor; full hardware synthesis of a large number of candidate systems is not practical.

5.4.3 Accuracy of Area Model

For the Predictor to be able to reliably choose the most computationally dense design for a target

application, it must accurately estimate the area of a wide variety of TILT-System configurations. The

accuracy of our area models is therefore an important indicator of the utility of the Predictor. To measure

this, we use Quartus to synthesize the TILT-System designs that are considered by our Predictor during

Page 72: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 66

the execution of its search algorithm and obtain the actual layout areas of these designs from the

generated fitter compilation reports. After the ALMs, M20K BRAMs and DSP resources utilized by the

designs are converted into eALMs, we compare these costs with the estimated layout areas reported by

our models. The percent error between the estimate and the actual area is presented in Figure 5.2 for

our benchmarks.

0

100

200

300

400

500

600

<3 <6 <9 <12 <15 >=15

#ro

frTI

LT-S

yste

mrD

esig

ns

%rError

(a) BSc

0

100

200

300

400

500

<3 <6 <9 <12 <15 >=15

#ro

frTI

LT-S

yste

mrD

esig

ns

%rError

(b) HDR

0

100

200

300

400

500

<3 <6 <9 <12 <15 >=15

#ro

frTI

LT-S

yste

mrD

esig

ns

%rError

(c) MBrot

0

150

300

450

600

750

<3 <6 <9 <12 <15 >=15

#Eo

fETI

LT-S

yste

mED

esig

ns

%EError

900

(d) HH

0

50

100

150

200

250

<3 <6 <9 <12 <15 >=15

# o

f TI

LT-S

yste

m D

esig

ns

% Error

(e) FIR

Figure 5.2: Error distribution of the TILT-System layout area estimated by our area model vs. theactual area obtained from compiling the configurations in Quartus for each of our applications. Thedesigns plotted correspond to those configurations that are explored by the Predictor’s search algorithm.

We find good prediction accuracy with less than 6% error for the small to mid-range TILT-System

designs. This can be attributed to having more table entries of actual areas for such designs. These

entries are also closer together compared to the larger configurations, resulting in less variance in area

during the interpolation of intermediate designs. For example, looking at the crossbar mixes in Tables

5.4 and 5.5, we have entries for 1 to 8 memory banks close together while there are only two entries

to interpolate the areas of the crossbars between 16 and 32 banks. Having more entries for the lower

to mid-range designs was an intentional design choice. As we will show in our evaluation of the TILT-

System’s performance and area in Chapter 6, we foresee the TILT-System being most useful in this

range. Computationally dense TILT-Systems are typically found in this range as well.

Our decision to use tables that contain the exact area costs of different configurations instead of

formulae that perform an imperfect fit across all available data points also helped improve the accuracy

of our area models. Higher prediction accuracy can be achieved by increasing the number of entries

in these tables. However, this also increases the storage requirements of the Predictor program but

more importantly, requires a larger number of Quartus compilations to populate the table. Some of the

Page 73: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 67

accuracy loss reported in Figure 5.2 can be attributed to the expected variance in ALMs consumed from

one compilation to another which is a property of hardware synthesis tools such as Quartus.

5.4.4 Prediction Accuracy

Here we present whether our Predictor’s search algorithm was able to correctly predict the most com-

putationally dense TILT-System configuration. It is infeasible to perform a complete exploration of the

design space by synthesizing every possible permutation. We instead rely on the densest configuration

that is reported by an exhaustive search of the constrained design space – a software solution which uses

the Predictor’s performance and area models. The number of designs explored is provided in the second

column of Table 5.9. For the BSc, HDR and HH applications, the search was manually stopped after

three days so the larger designs were not considered.

Benchmark FU Mix ThreadsMem Bank W/R Ports W/RBanks Depth / Bank Ext Ports

BSc2-3-1-1-1-1 64 4 1024 2-4 1-1

match match match match match match

HDR2-2-1-1 64 4 512 2-4 2-1match match match match match match

MBrot1-2* 8 1 256 3-6 1-1

match match match match 2-4 match

HH3-2-2-2 32 4 1024 2-4 1-1match match match match match match

FIR1-1 2 1 512 2-4 2-1

match match match match match match

Table 5.11: Actual (first row) vs. predicted (second row) TILT-System configurations. FU Mix: AddSub/Mult/Div/Sqrt/ExpCmp/LogAbs *AddSubLoopUnit/MultCmp.

BenchmarkCompute Density

% ErrorM tps/10k eALMs

BSc12.6

matchmatch

HDR70.1

matchmatch

MBrot3.2

5%3.0

HH9.4

matchmatch

FIR24.6*

matchmatch

Table 5.12: Actual (first row) vs. predicted (second row) TILT-System (with 8 TILT cores) computedensities. *FIR compute density is in M inputs/sec per 10k eALMs.

All output designs of the exhaustive search were also present in the set of TILT-System designs that

were explored by the Predictor tool and synthesized in Quartus to obtain the area model accuracy results

presented earlier in Section 5.4.3. These designs are provided in Table 5.11 as the “actual” best designs

and their actual compute densities, calculated from their ModelSim and Quartus reports, are provided in

Table 5.12. The predicted densest designs are provided in these two tables (second row) for comparison.

Page 74: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 5. Predictor Tool 68

The Predictor was able to correctly predict the configurations of all applications except for Mandelbrot

where the memory bank port counts are different. For this reason, the compute density differs by 5%.

The actual densest design in this case is the second densest predicted configuration.

5.5 Summary

In this chapter, we propose a software Predictor tool capable of intelligently navigating the large TILT

and Fetcher design space and recommend a suitably tuned TILT-System design for a target application

without requiring manual designer effort. The Predictor’s execution time is fast compared to hardware

synthesis and is 42 seconds on average for our benchmarks. Our tool accurately predicted the densest

TILT-System configuration for 4 of the 5 benchmarks. For the mis-predicted Mandelbrot application,

the read and write ports per memory bank were predicted to be 4 and 2 instead of 6 and 3, resulting

in a 5% drop in compute density. The designer can optionally expand or reduce the design parameter

space that the Predictor may explore to improve prediction accuracy or reduce runtime respectively.

The Predictor’s performance and area models and the heuristics used to prune the search space are

application independent. Therefore, we believe our tool will be broadly applicable to other data-parallel

applications. Moreover, although in this work we focus on maximizing compute density, we can readily

modify the tool to rank designs based on the throughput calculated, to minimize system area or some

other metric that is tracked by our tool.

Page 75: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6

TILT-System Evaluation

In this chapter, we present the performance of our most computationally dense, application-customized

TILT-Systems and the FPGA layout area consumed by these designs for the benchmarks provided in

Section 3.4. The implementation of these benchmarks on the TILT architecture is described in Section

6.1. In addition, we compare the performance, area, development effort and scalability of our best TILT-

System designs with our OpenCL HLS implementations of the same benchmarks. We summarize our

evaluation platform and the metrics used below.

Platform. The TILT-System and OpenCL HLS designs were generated using Altera’s Quartus 13.1

and OpenCL SDK [19] targeting the Stratix V 5SGSMD5H2F35C2 FPGA with 2 banks of 4 GB DDR3

memory on the Nallatech 385 D5 board. The relevant Quartus settings used to generate the FUs and

the DDR controller for both approaches are provided in Appendices A and B respectively.

Metrics Unit of MeasurementPerformance millions of threads/sec

Compute Throughput (M tps)

FPGA AreaALMs, BRAMs and DSPs⇒ equivalent ALMs (eALMs)

Ranking Designs throughput-per-areaCompute Density (M tps / 10k eALMs)

Table 6.1: Evaluation metrics.

Metrics. We use the same metrics to evaluate our designs as those we have defined in Chapter

4. These metrics are summarized in Table 6.1 and the equivalent ALM (eALM) costs of the FPGA

resources used to measure area are provided in Table 6.2. The maximum frequency at which we are able

to clock our designs (Fmax) is reported in MHz. The FPGA resource utilization and Fmax values are

obtained from Quartus fitter and TimeQuest compilation reports respectively.

Resource eALM CostALM 1

M20K BRAM 40DSP Block 30

Table 6.2: Equivalent ALM (eALM) costs of resources on a Stratix V FPGA [61].

69

Page 76: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6. TILT-System Evaluation 70

6.1 Benchmark Implementation on TILT

We present the FU operation usages for each TILT thread for our benchmarks in Table 6.3, providing

an estimate of their compute size. Uniquely, Mandelbrot’s FU usages will be different depending on

the computed pixel of the 800x640 pixels image frame because the Mandelbrot thread can break out of

its loop without completing the full 1000 iterations. The Mandelbrot FU usages provided in Table 6.3

represents the average number of operations of each type that are executed by each thread across all

executed iterations of the thread’s loop. Since the body of the loop needs to be scheduled once because

of our new LoopUnit FU (see Section 4.1), the TILT compiler only needs to schedule 7 AddSub, 6 Mult

and 6 Cmp operations. The body of the loop comprises 15 of these operations. The Mandelbrot thread

utilizes the LoopUnit FU to execute its two loop operations that handle the single loop it contains, for

a total of 21 compute operations per thread that needs to be scheduled.

BenchmarkFunctional Units

TotalAddSub Mult Div Sqrt Exp Cmp Log Abs

BSc 23 30 9 4 4 4 1 2 77HDR 6 4 3 1 0 0 0 0 14

Mandelbrot 177 142 0 0 0 210 0 0 529HH 36 24 28 0 15 12 0 0 115

FIR 64-tap 64 64 0 0 0 0 0 0 128

Table 6.3: TILT FU operation usages per thread.

The TILT instruction schedule is configured to be cyclic for all benchmarks, wrapping around to the

start to execute the same instructions on different data. Table 6.4 provides the memory requirements of

each benchmark per thread on TILT. External inputs and outputs define the number of data words that

must be read from or written to off-chip memory per thread respectively. One notable difference in the

Fetcher implementation for the FIR filter is that TILT-SIMD is halted while loading the filter coefficients

into the data memories of the TILT cores. The TILT-System then resumes normal concurrent operation

of the Fetcher and TILT-SIMD during the computation of the FIR outputs. This step is not necessary

for the other benchmarks because they do not require a large sequence of data to be loaded into TILT

memory prior to or between the computation of a set of results.

Benchmark Data Words Ext Inputs Ext OutputsBSc 38 5 2HDR 18 6 1

Mandelbrot 25 5 1HH 86 5 4

FIR 64-tap 192 1* 1

Table 6.4: TILT data memory requirements per thread (in 32-bit words). *Filter coefficients are loadedinitially prior to streaming in the FIR inputs.

Some of the FUs share both their operation field in TILT’s instruction and their read and write

ports with another FU (usually the least utilized) to generate a more computationally dense schedule

and to reduce the size of the crossbars (refer to Section 2.2.2). We use Algorithm 8 of our Predictor

tool (presented in Section 5.3.3) to determine which of the FU types should be shared. The algorithm

makes this decision based on the relative operation counts of each FU type in the application’s DFG

Page 77: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6. TILT-System Evaluation 71

Benchmark FUs - Operation Field / Port SharingBSc Exp+Cmp and Log+Abs

Mandelbrot Mult+CmpHH Exp+Cmp

Table 6.5: TILT FUs that share their operation field and ports with another FU.

and the reduction in area that can be obtained from combining any two FUs. The FU types which are

shared is summarized for our benchmarks in Table 6.5. As an example, the Mult and Cmp units for the

Mandelbrot benchmark share the same operation field and ports since they have similar FU latencies

and because all multiplies precede compare operations for a given thread. The latencies of the TILT

FUs vary between 1 cycle for the Abs FU to 28 cycles for the Sqrt FU.

6.2 Densest TILT-System Configurations

For each of our benchmarks, the Predictor tool presented in Chapter 5 is used to predict the TILT-

System configuration with the highest compute density. The Predictor selects the Fetcher configuration

with the smallest area that meets the external input and output bandwidth requirements of the selected

TILT design. TILT’s external memory schedules are generated using the Fetcher’s SlackB algorithm

which we describe in Section 3.3.3. We have shown earlier in Section 3.5 that this algorithm produces

the shortest TILT instruction schedules for our benchmarks. The top 10 predicted TILT-System designs

ranked by compute density are compiled in Quartus to obtain their actual layout areas. Throughput

is obtained through cycle accurate simulation in ModelSim 10.1d assuming all input data is initially

available in off-chip DDR3 memory. The configuration with the highest compute density is then selected

as our best TILT-System design.

Benchmark FU Mix ThreadsMem Bank W/R Ports Insn MemBanks Depth / Bank WidthxDepth

BSc 2-3-1-1-1-1 64 4 1024 2-4 476 x 933HDR 2-2-1-1-0-0 64 4 512 2-4 318 x 304

Mandelbrot 1-2* 8 1 256 3-6 140 x 103HH 3-2-2-2-0-0 32 4 1024 2-4 467 x 611

FIR 64-tap 1-1-0-0-0-0 2 1 512 2-4 76 x 144

Table 6.6: Top TILT configurations with highest compute density for each benchmark. FU Mix: AddSub/Mult/Div/Sqrt/ExpCmp/LogAbs *AddSubLoopUnit/MultCmp.

Table 6.6 provides the top (most computationally dense) TILT configurations for the benchmarks

of Table 6.3. The Fetcher configurations used with these TILT designs are provided in Table 6.7. As

shown in Figure 3.1, external write and read ports determine the number of 256-bit words that can

be written to or read from the TILT cores per cycle. Similarly, the depths in Table 6.7 correspond to

the incoming and outgoing data FIFOs respectively. Appropriate FIFO depths are a function of the

bandwidth requirements of TILT-SIMD which can be accurately predicted when the number of TILT

cores, threads, port dimensions and the number of words each thread reads or writes to external memory

(Table 6.4) are known.

The external bandwidth of the FIR benchmark after the filter coefficients are initially loaded is low:

2 input and output 256-bit words every 144 cycles (refer to Tables 6.4 and 6.7). However, the relatively

Page 78: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6. TILT-System Evaluation 72

BenchmarkExtW/R W/R FIFO Depths

Ports 256-bit wordsBSc 1-1 2048 / 1024HDR 2-1 1024 / 512

Mandelbrot 1-1 256 / 128HH 1-1 512 / 512

FIR 64-tap 2-1 128 / 128

Table 6.7: Fetcher designs for the top TILT configurations of Table 6.6.

deep data FIFO depth of 128 words is selected to prefetch the coefficients ahead of time and communicate

with DDR in bursts. Further, since the M20K BRAMs used to build the FIFOs support a minimum

depth of 512, using a smaller depth does not save area.

6.3 TILT Core and System Customization

We seek to maximize the TILT-System’s computational throughput per unit area by customizing the

TILT core and system architecture to the compute and external memory bandwidth requirements of the

application. This is made possible by the wide range of configuration options that are available to the

designer which we demonstrate in this section. Table 6.8 presents the improvement in compute density

achieved with the tuned TILT configurations of Table 6.6 relative to the baseline which is composed

of 1 of each required FU with 1 thread and 1 data memory bank with 2 read and 1 write ports for a

TILT-System with a single TILT core.

TILT-System with 1 TILT core

BenchmarkThroughput Area Compute Density

M tps eALMs M tps/10k eALMsTop / Base Top / Base Top / Base

BSc 16.3 / 0.48 13,930 / 5,567 11.7 / 0.87HDR 46.9 / 1.54 8,231 / 3,128 57 / 4.91

Mandelbrot 0.76 / 0.14 3,320 / 1,987 2.3 / 0.69HH 11.8 / 0.61 13,718 / 3,590 8.6 / 1.71

FIR 64-tap 3.8 / 2.06* 2,674 / 1,909 14 / 10.8*

Table 6.8: Comparison of the top TILT-System designs with the minimal area baseline. *FIR throughputis in M inputs/sec.

As an example, the minimal BSc TILT-System in Table 6.8 with a single TILT core computes 0.48

M tps at the cost of 5,567 eALMs. An additional 63 threads and 3 data memory banks with 2 extra read

and 1 extra write ports improves throughput by 17x by executing many threads and more operations

in parallel but requires 2x more area. An extra AddSub and 2 more Mult FUs improves throughput

further by 2x, at a cost of 1.1x more area, resulting in an overall 13.5x improvement in compute density.

We can improve the throughput of the TILT core further with diminishing returns by adding more

FUs, threads, data memory banks and/or read and write ports to allow more operations to issue and

complete every cycle but at an increasingly higher area cost, causing the compute density to decrease

beyond this point.

In Section 4, we have added several application-specific customization options to TILT to improve

Page 79: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6. TILT-System Evaluation 73

the compute density of the TILT-System beyond that which can be achieved with the TILT architecture

developed by Ovtcharov and Tili [13, 14]. For example, besides customizing the TILT core’s mix of

standard FUs, the addition of the new LoopUnit FU for the Mandelbrot benchmark significantly reduces

the size of the instruction memory, requiring 4 BRAMs for the top Mandelbrot design in Table 6.8 instead

of the 527 BRAMs that would be needed if the loop was fully unrolled. The required memory bank

depth is also reduced to 256 words from 512. However, the compute schedule becomes 35% less dense

due to being unable to intermix and schedule operations between loop boundaries and loop iterations

together, contributing to a throughput drop of 20%. Overall, the LoopUnit improves compute density

by 6.1x, while consuming only 9.8 ALMs.

Similarly for the top FIR design in Table 6.6, a conventional TILT core without the shift-register

addressing mode will require 64 reads and writes per thread between the computation of each output,

resulting in a 415 cycle instruction schedule. The addition of the mode costs only 54 eALMs but shortens

the schedule to 144 cycles, improving the overall compute density of the TILT-System by 2.9x.

As illustrated by the tuning of the BSc TILT core to maximize compute density and the optionally

generated application-dependent custom units to more efficiently handle loops (Mandelbrot) and indirect

addressing (FIR), the TILT core presents multiple degrees of freedom to the designer to reduce area or

to increase throughput and compute density. The significant performance improvement that can be

achieved demonstrates the value of our Predictor tool. The minimal TILT-Systems in Table 6.8 are

small, consuming between 1.9k and 5.6k eALMs.

Beyond customizing the TILT core, we can improve the throughput and compute density further by

connecting multiple area-efficient TILT cores to be executed in SIMD. This is preferable to increasing

the throughput by making the TILT core larger which will result in a less computationally dense design.

The improvement can be observed for the TILT-Systems in Table 6.9 where the TILT core of Table

6.8 is replicated 7 more times, allowing us to achieve near-linear growth in throughput. The area cost

of the single TILT instruction memory (provided in Table 6.11) and the Fetcher (Table 6.7) becomes

amortized, causing the overall increase in compute density.

6.4 TILT-System Performance

Table 6.9 presents the compute throughput, FPGA layout area and compute density of the densest

TILT-Systems with 8 identical TILT cores. The TILT and Fetcher configuration parameters used by

these designs are provided in Tables 6.6 and 6.7 respectively.

TILT-System with 8 TILT cores

BenchmarkFmax Tput Area Compute DensityMHz M tps eALMs M tps/10k eALMs

BSc 220 121 95,893 12.6HDR 223 359 51,163 70.1

Mandelbrot 246 6.1 19,051 3.2HH 215 90 96,073 9.4

FIR 64-tap 270 30* 12,187 24.6*

Table 6.9: TILT-System performance numbers of the densest TILT configurations of Table 6.6. *FIRthroughput is in M inputs/sec.

The Fmax reported in Table 6.9 is that of the TILT-SIMD. The FMax of the FIR filter is uniquely

Page 80: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6. TILT-System Evaluation 74

much higher than that of the other benchmarks due to the small size of each TILT core, requiring only

a single AddSub and Mult FU, with a per core area cost of only 1,429 eALMs on average (Table 6.11).

The critical path which limits the Fmax of the TILT cores typically lies between the inputs of an FU

and the read crossbar. For the FIR filter, this is the case with the AddSub FU.

The Fmaxes of the Fetcher designs and their area costs are provided in Table 6.10. The Fetcher

consumes a relatively small amount of area compared to the area consumed by TILT-SIMD, ranging

between 934 eALMs for the HH benchmark to 1,816 eALMs for HDR which requires deeper data FIFOs

and an additional external write port (Table 6.7).

BenchmarkFmax AreaMHz eALMs

BSc 354 1,995HDR 343 1,816

Mandelbrot 385 934HH 386 1,413

FIR 64-tap 357 1,321

Table 6.10: The Fmax and area cost of the Fetcher designs of Table 6.7.

For the computationally dense TILT-Systems of Table 6.9, the area breakdown of the TILT cores is

presented in Table 6.11. The average area of the TILT core varies widely between 1,358 eALMs for FIR

and 11,713 eALMs for HH, showing the different benchmarks prefer quite different TILT cores. This

further highlights the utility of our Predictor as determining the top TILT-System design is non-trivial.

BenchmarkInsn Average Area (eALMs) per Core Total FUs/Mem Mem FUs Rd Xbar Wr Xbar eALMs Total

BSc 961 2,560 5,016 3,070 971 12,578 40%HDR 321 1,280 2,489 1,738 621 6,449 39%

Mandelbrot 161 720 1,049 396 80 2,406 44%HH 961 2,560 4,777 3,391 985 12,674 38%

FIR 64-tap 81 320 624 296 108 1,429 44%

Table 6.11: Area breakdown of the TILT cores for the TILT-Systems in Table 6.9.

To achieve the most area-efficient design, our objective is to minimize the non-FU area, comprising

the crossbars and instruction and data memories. The purpose of these components is to keep the TILT

FUs busy while ideally consuming minimal area. The average FU area for our benchmarks is 41% of the

total. The percentage of FU area of the area-efficient TILT cores for our benchmarks are similar, with a

small variance of between 38% and 44%. The Fetcher accounts for only an average of 4.5% of the total

area of the 8-core TILT-System designs in Table 6.9.

6.5 Comparison with Altera’s OpenCL HLS

As discussed in Chapters 1 and 2, overlays such as TILT and HLS tools present two different ways to

improve FPGA designer productivity. Here we compare the productivity and performance of our best

TILT-System designs with that obtained using Altera’s OpenCL HLS tool [19]. We have chosen this tool

as our comparison point because recent studies have shown it can generate FPGA hardware with good

performance relative to other platforms. Chen and Singh report their OpenCL FPGA implementation

Page 81: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6. TILT-System Evaluation 75

of a Fractal Video Compression algorithm is 3x faster than a high-end GPU while consuming only 12%

of the GPU’s power [2]. They also demonstrate a huge gain in productivity, with their simplified and

error-prone hand-coded FPGA implementation taking a month to complete relative to the few hours it

took to develop a working OpenCL version.

The TILT application kernel, which defines the computation of a TILT thread inside a C function, is

very similar to our OpenCL kernel implementation, which defines the work to be performed by a group

of work-items. For our purposes, an work-item in OpenCL is analogous to a thread in TILT. Our TILT

threads and OpenCL work-items execute independently and perform the same computation.

Altera’s OpenCL HLS tool takes our OpenCL kernel as input to generate a custom pipelined accel-

erator on the FPGA that can be executed from a host CPU program. The OpenCL host program is

developed using Visual Studio 2013 and compiled into an executable with default Release configuration.

To obtain throughput numbers for each benchmark, the input data of the entire workload is flushed to

DDR3 prior to the execution of the kernel across the entire workload. This is to ensure a fair comparison

with TILT by excluding the host-accelerator transfer time. Elapsed time is obtained by capturing the

wall clock time before and after kernel execution using Windows high resolution timers with a precision

of <1 µs. For each benchmark, we increase the number of work-items in the OpenCL host program until

throughput saturates and we report this highest throughput number.

6.5.1 Performance and Area

We begin by comparing the performance and area of the densest 8-core TILT-System designs in Table

6.9 with our best OpenCL HLS designs presented in Table 6.12. The Fmax reported in Table 6.12 is

that of the OpenCL kernel system. Both the TILT-System and OpenCL HLS designs achieve similar

Fmax values of over 200 MHz. The area of the Direct Memory Access (DMA) unit which interfaces the

OpenCL on-chip memory to DDR memory averages 4,367 eALMs in size, roughly 3x larger than our

relatively small Fetcher designs which consume between 1k to 2k eALMs (Table 6.10).

OpenCL HLS - Kernel System with DMA

BenchmarkFmax Tput Area Compute DensityMHz M tps eALMs M tps/10k eALMs

BSc 221 153 51,982 29.5HDR 234 231 26,246 88.1

Mandelbrot 268 23 46,204 5.0HH 236 116 50,571 23.0

FIR 64-tap 274 239* 51,577 46.4*

Table 6.12: OpenCL HLS performance numbers. *FIR throughput is in M inputs/sec.

0

0.2

0.4

0.6

0.8

1

BSc HDR Mbrot HH FIRy64-tap

Co

mp

ute

yDen

sity

8-coreyTILT-System OpenCLyHLS

Figure 6.1: Compute densities of the TILT-System designs normalized to that of OpenCL HLS.

Page 82: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6. TILT-System Evaluation 76

From the performance results presented in Tables 6.9 and 6.12, we note that the compute density

of our TILT-Systems is 41% (HH) to 80% (HDR) of the density of the OpenCL designs, as shown in

Figure 6.1. The TILT-System of the HDR application is the only TILT design that achieves a higher

compute throughput than OpenCL HLS. It obtains 359 M tps compared to the 231 M tps obtained by

the OpenCL HLS design. However, TILT achieves this high throughput at a higher area cost of 51.2k

eALMs compared to the 26.2k eALMs required by the OpenCL HLS design.

6.5.2 Designer Productivity

The runtime of the TILT and OpenCL HLS tools for the designs in Tables 6.9 and 6.12 are summarized

in Table 6.13 and compared in Figure 6.2. For the initial setup of the TILT-System, the TILT and

Fetcher instruction schedules are first generated from the C kernel using the TILT compiler. Then the

densest TILT-System configuration recommended by our Predictor tool is synthesized into hardware

using Quartus, with the schedules loaded into the instruction memories during compilation. The initial

setup is dominated by the compilation of the overlay, with an average runtime of 28 mins.

TILT-System with 8 cores OpenCLInitial Setup Kernel Kernel

Benchmark Kernel Predictor Overlay Update Compilemins mins mins mins mins

BSc 0.0110 2.7 53 0.65 108HDR 0.0025 0.4 31 0.63 86

Mandelbrot 0.0016 0.4 16 0.63 111HH 0.0087 2.5 52 0.65 106

FIR 64-tap 0.0017 0.2 13 0.63 107

Geomean 0.0036 0.7 28 0.64 103

Table 6.13: Runtime of the TILT-System and OpenCL HLS tools.

0

20

40

60

80

100

120

Ru

nti

mea

(min

s)

HH FIRMBrotBSc HDR Geomean

Kernel Predictor Overlay KernelaUpdate

KernelacompileTILT:

OpenCL:

Figure 6.2: Runtime of the TILT-System and OpenCL HLS tools.

After a kernel code change that does not require the overlay to be recompiled, determined by running

the Predictor on the modified kernel, the instruction memories of the TILT and the Fetcher are updated

with the regenerated schedules, taking only 38 secs on average. By comparison, any change made to an

OpenCL kernel requires full recompilation, taking an average of 103 mins, a 163x increase. Moreover,

the fast runtime of our Predictor enables us to navigate TILT’s large design space and obtain a suitably

Page 83: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6. TILT-System Evaluation 77

customized, high performance design for an application much faster than would be possible through an

exhaustive search using Quartus.

6.5.3 TILT-System and OpenCL HLS Scalability

Many real world applications combine several heterogeneous compute systems with different throughput

requirements and need to fit them all on a chip with a finite area budget. In this section, we scale the

HH and FIR benchmarks to show how efficiently the TILT-System and OpenCL HLS scale up or down

to match compute requirements and area constraints.

100%Chip5ALM5Utilization0%100%Chip5ALM5Utilization0%

M tp

s

OpenCL5Replication5Factor1 2 4 8 10 12 14 16 18 20

TILT5Cores6

050

100150200250300

1 2 3 4 50

50100150200250300

(a) TILT-System (left) and OpenCL HLS (right) throughput.

0

50

100

150

200

250

1 2 3 4 5OpenCLIReplicationIFactor

100%ChipIALMIUtilization0% 100%ChipIALMIUtilization0%

1k e

ALM

s

1 2 4 8 10 12 14 16 18 20TILTICores

60

50

100

150

200

250

(b) TILT-System (left) and OpenCL HLS (right) area.

100%Chip5ALM5Utilization0% 100%Chip5ALM5Utilization0%

M tp

s / 1

0k e

ALM

s

OpenCL5Replication5Factor1 2 4 8 10 12 14 16 18 20

TILT5Cores6

0

5

10

15

20

25

1 2 3 4 5

8.6 9.6 8.3

23

13

0

5

10

15

20

25

(c) TILT-System (left) and OpenCL HLS (right) compute density.

Figure 6.3: Scaling HH on TILT-System and OpenCL HLS.

As we have demonstrated earlier in Section 6.4 with the 8-core TILT-System designs, connecting

multiple area-efficient TILT cores in SIMD allows us to achieve near-linear growth in throughput. In

Figure 6.3, we scale up the TILT-System of the HH benchmark beyond 8 cores up to chip capacity and

draw comparisons with the OpenCL HLS scalability results where the HH computation is replicated to

execute in parallel. The maximum size of our designs is limited by the ALMs available on our FPGA.

Each TILT core requires 19 DSPs for FUs, with each spatially pipelined instance in OpenCL HLS

requiring 143 DSPs.

In Figure 6.3(a), we observe near-linear scaling in the TILT-System throughput, with the small drop

Page 84: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6. TILT-System Evaluation 78

from 16 to 20 cores caused by a drop in Fmax from 201 to 181 MHz. In Figure 6.3(b), the OpenCL HLS

design cannot scale below 51k eALMs while the TILT-System is able to scale down to 14k eALMs. In

Figure 6.3(c), the compute density of the TILT-System remains fairly constant, growing from 8.6 to 9.6

from 1 to 12 cores and then dropping to 8.3 at 20 cores. However, the TILT-System’s compute density

is also consistently lower than the OpenCL HLS for all design solutions.

From Figure 6.3, we observe that the TILT-System offers a greater range of intermediate throughput

and area solutions than OpenCL HLS, allowing any number of cores between 1 and 20 to be instantiated

for the HH benchmark. These solutions vary between small designs with a throughput of 12 M tps at

a cost of 14k eALMs to large designs that have a throughput of 190 M tps and an area cost of 230k

eALMs. In comparison, OpenCL HLS provides only 5 design choices, with a narrower throughput range

between 116 and 288 M tps and a medium to large area cost of 51k to 210k eALMs. The wider, more

flexible range of the TILT-System designs is one of the benefits of the TILT architecture.

x+DSPb0

0inn x

+DSPb1

inn-1 x+DSPb2

inn-2 x+DSPb63

inn-63

... outn

(a) Fully spatial 64-tap FIR design.

x+DSPb

in

64-to-1muxes

0-63

0-63

out

(b) 64-tap FIR with a single DSP.

x+DSPb

in

32-to-1muxes

0-31

0-31x+DSPb

in

32-to-1muxes

31-63

31-63

out

(c) 64-tap FIR with two DSPs.

Figure 6.4: Implementing 64-tap FIR on OpenCL HLS with different number of DSPs (or stages).

In Figure 6.5, we study the OpenCL HLS compiler’s ability to scale down our spatially pipelined

64-tap FIR design. The 64-tap FIR filter is normally implemented as a fully spatial compute pipeline,

as illustrated in Figure 6.4(a). This implementation with 64 stages achieves very good performance for

relatively low area on OpenCL HLS. However, if such a high throughput is not required, we may wish to

use fewer compute resources (DSPs) to save area. Figure 6.4(b) depicts a single compute unit performing

the same 64-tap FIR computation. The compute unit is shared by all 64 filter coefficients to produce

an output every 64 cycles. The throughput achieved with this design is very small and we do not save

that much area due to the large muxes that are required at the inputs of the compute unit.

The availability of two parallel compute units with two stages, as shown in Figure 6.4(c), allows an

output to be computed every 32 cycles. The first stage takes 32 cycles to apply the first 32 coefficients

before forwarding the sum to the second stage and receiving a new input to compute. Smaller muxes

are required at the inputs of the compute units but there is two sets of them and additional muxing is

required to forward the output of one stage to another. The throughput of this design is slightly better

but it is still very low relative to that of the spatial design.

Between 4 and 32 stages, the muxing overhead grows with the number of stages, resulting in an

overall growth in area, with the design consuming 86% of the chip’s ALMs at 32 stages. The drop in

Fmax from 221 MHz at 4 stages to the lowest 142 MHz at 32 also results in a sub-linear growth in

throughput at that range. As seen in Figure 6.5(b), the net result is a low compute density for all but

Page 85: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6. TILT-System Evaluation 79

0

50

100

150

200

250

300

350

400

0 23 45 68 90 113 135 158 180

MIin

pu

ts/s

ec

SystemIAreaI(1kIeALMs)

TILT-System OpenCLIHLS

64 stages

12 4 816

321 TILTcore

120 TILTcores

(a) Throughput vs. area cost.

0

10

20

30

40

50

0 23 45 68 90 113 135 158 180

M-in

pu

ts/s

-/-1

0k-

eALM

s

System-Area-(1k-eALMs)

TILT-System OpenCL-HLS

12 4 8 16 32

1 TILTcore

120 TILTcores

64 stages

(b) Compute density vs. area cost.

Figure 6.5: Scaling 64-tap FIR on TILT-System and OpenCL HLS.

the fully spatial design where the overheads of forwarding the output of a compute unit back to its input

and selecting between multiple coefficients are eliminated. We conclude that OpenCL HLS has difficulty

producing area-efficient, lower throughput systems, especially for spatially pipelined computations.

In comparison, the small FIR TILT configuration of Table 6.6 provides near-linear scaling in through-

put and area for up to 70% of the chip’s ALMs, also provided in Figure 6.5. The TILT-System was scaled

by connecting multiple TILT-SIMDs to the Fetcher in parallel, each with a maximum of 24 TILT cores,

to eliminate the bandwidth bottleneck at the widthconv module. In Figure 6.5(b), the sharp increase in

the TILT compute density from 1 to 8 cores is due to the amortization of the Fetcher area. With each

TILT-SIMD requiring a maximum of 6 input and generating 6 output words per compute iteration of

144 cycles (Table 6.4), data FIFO depths of 512 words were sufficient for the Fetcher.

The TILT-System architecture enables small, area efficient design choices with modest throughput,

scaling down to 2.7k eALMs with a single TILT core compared to the 52k eALMs of the spatial OpenCL

design. The OpenCL design is smallest at two stages, consuming 35k eALMs but has a low throughput of

8.3 M inputs/sec. For the same area, a 25 core TILT-System achieves a throughput of 96 M inputs/sec.

Further, we are able to exceed the throughput of 239 M inputs/sec of the spatial OpenCL FIR design

with roughly 70 TILT cores but at 1.9x more area.

6.5.4 TILT-System vs. OpenCL HLS Summary

HLS tools generate application-specific, custom hardware from a software specification. Conversely,

overlays map the same software kernel onto software-programmable hardware. Unlike HLS, overlays

provide versatility and improve designer productivity by supporting multiple software functions for the

same hardware. This means the overlay can be quickly reprogrammed instead of requiring re-synthesis

after a kernel code change. However, this has a performance and area overhead associated with it.

The highly configurable TILT overlay is intended to be both application-tunable as well as software-

programmable. We also added several enhancements to TILT to more efficiently target specific applica-

tion types. These optionally enabled enhancements further reduce the performance and area gap between

the TILT overlay and OpenCL HLS approaches. The addition of the LoopUnit FU for the Mandelbrot

and the shift-register mode for the FIR filter improve the compute density of the TILT-Systems by

6.1x and 2.9x respectively relative to the standard TILT core. Overall, our tuned TILT-System designs

achieve 41% to 80% of the compute density of our best OpenCL HLS designs.

The scalability results of the HH and FIR benchmarks demonstrate the TILT-System’s ability to

Page 86: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 6. TILT-System Evaluation 80

explore throughput and area regions that OpenCL HLS designs are unable to reach. Further, we can

configure the TILT-System for higher throughput or lower area without requiring the application code to

be changed, with the baseline configuration being very small (Table 6.8). The runtime of our Predictor

tool which is used to select the densest TILT-System design is 42 seconds on average. Regenerating the

TILT and Fetcher schedules after a kernel code change and reprogramming the instruction memories of

an already instantiated TILT-System design takes 38 seconds on average. In contrast, OpenCL kernels

must be recompiled into hardware to observe the changes in performance and area after any application

code change, taking an average of 103 minutes, lengthening design time considerably.

We recommend Altera’s OpenCL HLS for the generation of high throughput systems. OpenCL HLS

maximizes throughput at the cost of more resources by generating a heavily pipelined, spatial design

and by executing many threads in parallel. Deep FIFOs buffer the thread data and stream it into the

compute units. The resulting designs are large, about 3.2x (HDR) to 19x (FIR) bigger than the top

(most area-efficient) single core TILT-System designs. If a kernel requires more modest throughput, the

OpenCL HLS has difficulty generating an area-efficient, lower throughput system. Therefore when a low

to moderate throughput is sufficient, we instead recommend the TILT architecture as it is capable of

generating smaller but still computationally dense designs. The top TILT-Systems with a single core

require 2.7k to 14k eALMs (with 1.9k to 5.6k for minimal designs) compared to 26k to 52k eALMs for

the smallest, computationally dense OpenCL HLS systems. Hence, we see the TILT overlay paradigm

as an useful complement to OpenCL HLS.

Page 87: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 7

Conclusion

The TILT overlay is an area-efficient method to implement shared operator, application customizable,

execution units. We extend the TILT architecture of [13–15] to allow the use of off-chip memory with

our scalable and small Memory Fetcher. We also enable the generation of more area-efficient designs

with new custom units to support loops and indirect addressing. These are optionally generated for

applications that benefit from them, improving compute density by 6.1x for the Mandelbrot and 2.9x for

the FIR applications respectively. We also provide designers the ability to quickly explore a large design

space of throughput and area trade-offs without requiring the overlay to be synthesized with our new

Predictor tool, with an average runtime of 42 seconds. Further, the TILT and Fetcher components that

comprise the TILT-System can be configured for higher throughput or lower area without requiring the

application code to be changed. Configuration of an application onto an existing TILT-System design

is also fast, taking an average of 38 seconds.

7.1 Future Work

TILT as a Custom Accelerator for OpenCL HLS

We would like to integrate the TILT overlay as an optional OpenCL HLS accelerator component and

evaluate the performance of the resulting system. A few small TILT cores can be used to perform

specialized calculations for a larger computation, enabling the OpenCL HLS to take advantage of TILT’s

high FU reuse capability and its ability to scale down to very small implementations.

Compute Acceleration with Application-Specific Custom FUs

TILT currently supports the Sqrt, Exp, Log and Abs compute operations via function calls in the C

kernel. Custom datapaths with arbitrary pipeline latency intended to accelerate segments of the target

application can be supported by defining a new function identifier and latency mapping in the TILT

architecture configuration file. Given the new custom FU has up to 2 input and 1 output data ports,

no changes to TILT’s compiler flow or architecture is necessary. The FU may also optionally accept the

opcode bit vector which is presently used to modify the behaviour of an FU. For this case, the compiler

will need to be updated to identity and encode the appropriate FU behaviour.

Custom FUs can be desirable if more application-specific compute acceleration is required. The

81

Page 88: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Chapter 7. Conclusion 82

area cost may also be reduced due to a smaller custom FU and/or requiring fewer standard FUs to be

instantiated, which will produce both a smaller FU array and crossbars. We leave the study of how

much compute density can be gained from implementing custom FUs to accelerate frequently performed

computations of a target application as future work.

Tuning the TILT-System to a Group of Applications

The scope of our thesis was limited to tuning the TILT-System architecture to a single application.

However, most applications can be grouped into categories that are representative of their compute and

memory access patterns [67]. This observation suggests we can customize the TILT-System to a class of

applications, with a single configuration that will perform reasonably well across multiple benchmarks.

This is useful because it will reduce the need to recompile the TILT-System if the computation is

modified or replaced with a similar application. Instead, we will only need to regenerate the schedules

and update the instruction memories which is much faster than recompiling the overlay (Table 6.13).

This investigation and extending our Predictor to a mix of application kernels are left as future work.

Software Pipelining with Partially Unrolled Loops

For kernels containing loops, we have presented two scheduling extremes: either the loop is fully unrolled

and the operations across loop iterations are scheduled together or we only schedule a single iteration

of the loop using the LoopUnit FU. There exists a trade-off between producing a denser schedule by

unrolling the loop and a deeper instruction memory required to store the longer schedule. We can instead

partially unroll the loop using the LLVM compiler prior to generating the DFG and schedule several loop

iterations together to improve the utilization of the FUs and the achieved compute density. Since our

external memory scheduling algorithms take advantage of memory bank ports that are not used by the

compute operations, a denser compute schedule may not always result in the best overall throughput.

Hence, scheduling of the memory operations will partly dictate the best unroll factor.

Page 89: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Appendix

Appendix A: Quartus Settings for FUs

We summarize the Quartus settings used for the mega-function generated FUs of the TILT architecture

and those used by our OpenCL HLS (High-Level-Synthesis) designs in Table A. All FUs are IEEE-754

compliant, supporting single-precision floating-points.

Setting TILT OpenCL HLSDenormal no noOptimize speed speed

Reduced Functionality no noException yes (where applicable) no

Table A: Quartus FU settings for TILT and OpenCL HLS.

Appendix B: Quartus Settings for DDR Controller

The Quartus settings used by the TILT and Fetcher architecture and our OpenCL HLS designs to

interface with DDR3 on the Nallatech 385 D5 board are summarized in Table B.

Setting TILT OpenCL HLSSpeed Grade 2

Memory Clock Freq 400 MHzMemory Vendor MicronMemory Format Unbuffered DIMMInterface Width 64 72

Row Address Width 14 15Column Address Width 10 12

Bank Address Width 3 3Avalon Word Width 256 512Avalon Burst Size 64 32

Table B: Quartus DDR3 Controller settings for TILT and OpenCL HLS.

83

Page 90: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Bibliography

[1] J. Richardson, S. Fingulin, D. Raghunathan, C. Massie, A. George, and H. Lam, “Comparative

analysis of HPC and accelerator devices: Computation, memory, I/O, and power,” in HPRCTA.

IEEE, 2010, pp. 1–10.

[2] D. Chen and D. P. Singh, “Fractal video compression in OpenCL: An evaluation of CPUs, GPUs,

and FPGAs as acceleration platforms.” in ASP-DAC, 2013, pp. 297–304.

[3] K. Sano, Y. Hatsuda, and S. Yamamoto, “Multi-FPGA accelerator for scalable stencil computation

with constant memory bandwidth,” Parallel and Distributed Systems, vol. 25, no. 3, pp. 695–705,

2014.

[4] J. Cassidy, L. Lilge, and V. Betz, “Fast, power-efficient biophotonic simulations for cancer treatment

using fpgas,” in FCCM. IEEE, May 2014, pp. 133–140.

[5] A. Putnam, A. Caulfield, E. Chung, and et. al., “A reconfigurable fabric for accelerating large-scale

datacenter services,” in ISCA, June 2014.

[6] IBM. (2014) IBM PureData system. [Online]. Available: http://www-01.ibm.com/software/data/

puredata/analytics/index.html

[7] IBM. (2014) WebSphere DataPower SOA appliances. [Online]. Available: http://www-

03.ibm.com/software/products/en/datapower/

[8] C. H. Chou, A. Severance, A. D. Brant et al., “VEGAS: Soft vector processor with scratchpad

memory,” in FPGA. ACM, 2011, pp. 15–24.

[9] A. Severance and G. Lemieux, “VENICE: A compact vector processor for FPGA applications,” in

FCCM. IEEE, 2012, pp. 245–245.

[10] N. Kapre and A. DeHon, “VLIW-SCORE: Beyond C for sequential control of spice FPGA acceler-

ation,” in FPT. IEEE, 2011, pp. 1–9.

[11] R. Dimond, O. Mencer, and W. Luk, “CUSTARD-a customisable threaded FPGA soft processor

and tools,” in FPL. IEEE, 2005, pp. 1–6.

[12] J. Coole and G. Stitt, “Fast and flexible high-level synthesis from OpenCL using reconfiguration

contexts,” Micro, vol. 34, no. 1, pp. 42–53, 2013.

[13] K. Ovtcharov, “TILT: A horizontally-microcoded, highly-configurable and statically scheduled soft

processor family,” Master’s thesis, University of Toronto, In progress.

84

Page 91: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Bibliography 85

[14] I. Tili, “Compiling for a multithreaded horizontally-microcoded soft processor family,” Master’s

thesis, University of Toronto, Nov 2013.

[15] K. Ovtcharov, I. Tili, and J. G. Steffan, “TILT: A multithreaded VLIW soft processor family,” in

FPL, Sept 2013, pp. 1–4.

[16] A. Papakonstantinou, K. Gururaj, J. A. Stratton et al., “FCUDA: Enabling efficient compilation of

CUDA kernels onto FPGAs,” in Application Specific Processors. IEEE, 2009, pp. 35–42.

[17] A. Canis, J. Choi, M. Aldham et al., “LegUp: An open-source high-level synthesis tool for FPGA-

based processor/accelerator systems,” TECS, vol. 13, no. 2, p. 24, 2013.

[18] T. Feist, “Vivado design suite,” White Paper, 2012.

[19] Altera. (2014) Altera SDK for OpenCL. [Online]. Available: http://www.altera.com/literature/lit-

opencl-sdk.jsp

[20] R. Rashid, J. G. Steffan, and V. Betz, “Comparing performance, productivity and scalability of the

TILT overlay processor to OpenCL HLS,” in FPT. IEEE, December 2014.

[21] Altera. (2014) Nios ii processor. [Online]. Available: http://www.altera.com/literature/lit-opencl-

sdk.jsp

[22] Xilinx. (2014) Microblaze soft processor. [Online]. Available: www.xilinx.com/tools/microblaze.htm

[23] P. Yiannacouras, J. G. Steffan, and J. Rose, “VESPA: portable, scalable, and flexible FPGA-based

vector processors,” in CASES. ACM, 2008, pp. 61–70.

[24] J. Yu, C. Eagleston, C. H.-Y. Chou, M. Perreault, and G. Lemieux, “Vector processing as a soft

processor accelerator,” TRETS, vol. 2, no. 2, p. 12, 2009.

[25] M. Flynn, “Some computer organizations and their effectiveness,” Computers, vol. 100, no. 9, pp.

948–960, 1972.

[26] P. Yiannacouras, J. G. Steffan, and J. Rose, “Fine-grain performance scaling of soft vector proces-

sors,” in CASES. ACM, 2009, pp. 97–106.

[27] M. Labrecque and J. G. Steffan, “Improving pipelined soft processors with multithreading,” in FPL.

IEEE, 2007, pp. 210–215.

[28] R. Moussali, N. Ghanem, and M. A. Saghir, “Microarchitectural enhancements for configurable

multi-threaded soft processors,” in FPL. IEEE, 2007, pp. 782–785.

[29] D. J. Dewitt, “A machine independent approach to the production of optimized horizontal mi-

crocode,” 1976.

[30] D. Gajski, “No-instruction-set-computer processor,” Sep. 17 2004, US Patent App. 10/944,365.

[31] M. A. Saghir, M. El-Majzoub, and P. Akl, “Datapath and ISA customization for soft VLIW pro-

cessors,” in ReConFig. IEEE, 2006, pp. 1–10.

[32] K. Shagrithaya, K. Kepa, and P. Athanas, “Enabling development of OpenCL applications on

FPGA platforms,” in ASAP. IEEE, 2013, pp. 26–30.

Page 92: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Bibliography 86

[33] Khronos OpenCL Working Group. (2009, October) The OpenCL specification. [Online]. Available:

http://www.khronos.org/registry/cl/specs/opencl-1.0.48.pdf

[34] Nvidia, “Cuda programming guide,” 2014. [Online]. Available: http://docs.nvidia.com/cuda/cuda-

c-programming-guide/

[35] Z. Zhang, Y. Fan et al., “AutoPilot: A platform-based ESL synthesis system,” in High-Level Syn-

thesis. Springer, 2008, pp. 99–112.

[36] J. Fang, A. L. Varbanescu, and H. Sips, “A comprehensive performance comparison of CUDA and

OpenCL,” in ICPP. IEEE, 2011, pp. 216–225.

[37] G. Kalokerinos, V. Papaefstathiou, G. Nikiforos, and et. al., “FPGA implementation of a config-

urable cache/scratchpad memory with virtualized user-level RDMA capability,” in SAMOS. IEEE,

2009, pp. 149–156.

[38] P. R. Panda, N. D. Dutt, and A. Nicolau, “Efficient utilization of scratch-pad memory in embedded

processor applications,” in Design and Test. IEEE Computer Society, 1997, pp. 7–11.

[39] J. Choi, K. Nam, A. Canis, and et. al., “Impact of cache architecture and interface on performance

and area of FPGA-based processor/parallel-accelerator systems,” in FCCM. IEEE, 2012, pp.

17–24.

[40] A. Putnam, S. Eggers, D. Bennett, and et. al., “Performance and power of cache-based reconfig-

urable computing,” in SIGARCH, vol. 37, no. 3. ACM, 2009, pp. 395–405.

[41] P. Nalabalapu and R. Sass, “Bandwidth management with a reconfigurable data cache,” in Parallel

and Distributed Processing Symposium. IEEE, 2005, pp. 159–167.

[42] G. Stitt, G. Chaudhari, and J. Coole, “Traversal caches: A first step towards FPGA acceleration of

pointer-based data structures,” in Hardware/Software codesign and system synthesis. ACM, 2008,

pp. 61–66.

[43] V. Mirian and P. Chow, “FCache: a system for cache coherent processing on FPGAs,” in FPGA.

ACM, 2012, pp. 233–236.

[44] E. S. Chung, J. C. Hoe, and K. Mai, “CoRAM: an in-fabric memory architecture for FPGA-based

computing,” in FPGA. ACM, 2011, pp. 97–106.

[45] E. S. Chung, M. K. Papamichael, G. Weisz, J. C. Hoe, and K. Mai, “Prototype and evaluation

of the CoRAM memory architecture for FPGA-based computing,” in FPGA. ACM, 2012, pp.

139–142.

[46] M. Adler, K. E. Fleming, A. Parashar, M. Pellauer, and J. Emer, “LEAP scratchpads: automatic

memory and cache management for reconfigurable logic,” in FPGA. ACM, 2011, pp. 25–28.

[47] C. E. LaForest and J. G. Steffan, “Efficient multi-ported memories for FPGAs,” in FPGA. ACM,

2010, pp. 41–50.

Page 93: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Bibliography 87

[48] M. Abdallah and K. Al-Dajani, “Hardware predication for conditional instruction path branching,”

Jun. 22 2004, US Patent 6,754,812. [Online]. Available: http://www.google.com/patents/

US6754812

[49] Writing an LLVM pass. [Online]. Available: http://llvm.org/docs/WritingAnLLVMPass.html

[50] The LLVM instruction reference. [Online]. Available: http://llvm.org/docs/LangRef.html#

instruction-reference

[51] The LLVM compiler infrastructure. Version 3.1, Accessed Nov. 2013. [Online]. Available:

http://llvm.org

[52] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck, “Efficiently computing

static single assignment form and the control dependence graph,” TOPLAS, vol. 13, no. 4, pp.

451–490, 1991.

[53] Altera. (2014) OpenCL design examples. [Online]. Available: http://www.altera.com/support/

examples/opencl/opencl.html

[54] S. Mann and R. W. Picard, “On being ‘undigital’ with digital cameras: Extending dynamic range

by combining differently exposed pictures,” in IS&T, 1995, pp. 442–448.

[55] A. L. Hodgkin and A. F. Huxley, “A quantitative description of membrane current and its appli-

cation to conduction and excitation in nerve,” vol. 117, no. 4. Blackwell Publishing, 1952, p.

500.

[56] C. E. LaForest and J. G. Steffan, “OCTAVO: an FPGA-centric processor family,” in FPGA. ACM,

2012, pp. 219–228.

[57] J. E. Smith, “Decoupled access/execute computer architectures,” Computer Architecture News,

vol. 10, no. 3, pp. 112–119, 1982.

[58] N. C. Crago and S. J. Patel, “OUTRIDER: efficient memory latency tolerance with decoupled

strands,” in Computer Architecture News, vol. 39, no. 3. ACM, 2011, pp. 117–128.

[59] F. Black and M. Scholes, “The pricing of options and corporate liabilities,” The journal of political

economy, pp. 637–654, 1973.

[60] A. R. James Lebak and E. Wong. (2006) HPEC challenge benchmark suite. [Online]. Available:

http://www.omgwiki.org/hpec/files/hpec-challenge/

[61] D. Lewis and T. Vanderhoek, “Stratix V block areas,” personal communication, January 2014.

[62] Altera. (2013, November) Floating-point megafunctions user guide. [Online]. Available:

http://www.altera.com/literature/ug/ug altfp mfug.pdf

[63] M. Lam, “Software pipelining: An effective scheduling technique for VLIW machines,” in ACM

Sigplan Notices, vol. 23, no. 7. ACM, 1988, pp. 318–328.

[64] Altera. (2014, April) Stratix V device overview. [Online]. Available: http://www.altera.com/

literature/hb/stratix-v/stx5 51001.pdf

Page 94: by Rafat Rashid - University of Toronto T-Space · Rafat Rashid Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 High-Level-Synthesis

Bibliography 88

[65] Altera. (2013, May) SCFIFO and DCFIFO megafunctions. [Online]. Available: http:

//www.altera.com/literature/ug/ug fifo.pdf

[66] Altera. (2014, January) Logic array blocks and adaptive logic modules in Stratix V devices.

[Online]. Available: http://www.altera.com/literature/hb/stratix-v/stx5 51002.pdf

[67] K. Asanovic, R. Bodik, B. C. Catanzaro et al., “The landscape of parallel computing research:

A view from berkeley,” Technical Report UCB/EECS-2006-183, EECS Department, University of

California, Berkeley, Tech. Rep., 2006.