marc: a many-core approach to reconfigurable computing

6
MARC: A Many-Core Approach to Reconfigurable Computing Ilia Lebedev, Shaoyi Cheng, Austin Doupnik, James Martin, Christopher Fletcher, Daniel Burke, Mingjie Lin, and John Wawrzynek Department of EECS, University of California at Berkeley, CA 94704 Abstract—We present a Many-core Approach to Re- configurable Computing (MARC), enabling efficient high- performance computing for applications expressed using paral- lel programming models such as OpenCL. The MARC system exploits abundant special FPGA resources such as distributed block memories and DSP blocks to implement complete single-chip high efficiency many-core microarchitectures. The key benefits of MARC are that it (i) allows programmers to easily express parallelism through the API defined in a high-level programming language, (ii) supports coarse-grain multithreading and dataflow-style fine-grain threading while permitting bit-level resource control, and (iii) greatly reduces the effort required to re-purpose the hardware system for dif- ferent algorithms or different applications. A MARC prototype machine with 48 processing nodes was implemented using a Virtex-5 (XCV5LX155T-2) FPGA for a well known Bayesian network inference problem. We compare the runtime of the MARC machine against a manually optimized implementation. With fully synthesized application-specific processing cores, our MARC machine comes within a factor of 3 of the performance of a fully optimized FPGA solution but with a considerable reduction in development effort and a significant increase in retargetability. Keywords-Reconfigurable Computing; Many-Core; FPGA; Compiler; Performance; Throughput. I. I NTRODUCTION Reconfigurable devices such as FPGAs exhibit huge po- tential for exploiting application-specific parallelism and performing power-efficient computation. As a result, the overall performance of FPGA-based solutions is often sig- nificantly higher than that of CPU-based ones [1], [2]. Unfortunately, truly unleashing an FPGA’s performance potential usually requires cumbersome HDL programming and laborious manual optimization. Specifically, program- ming FPGAs demands skills and techniques well outside the application-oriented expertise of many developers, thus forcing them to step beyond their traditional programming abstractions and embrace hardware design concepts such as clock management, state machines, pipelining, and device- specific memory management. Finally, wide acceptance in the marketplace requires binary compatibility across a range of implementations. However, the current crop of FPGAs re- quires lengthy reimplementation for each new chip version, even within the same FPGA family. These observations therefore raise a natural question: for a given class of computation-intensive applications, is it possible to build a reconfigurable computing machine constrained to resemble a many-core computer, program it using a high-level imperative language such as C/C++, and yet still achieve orders of magnitude in performance gain relative to conventional computing means? If so, such a methodology would 1) allow programmers to easily express parallelism through the API defined in a high-level program- ming language, and 2) support coarse-grain multithreading and dataflow-style fine-grain threading while permitting bit- level resource control, and 3) greatly reduce the effort required to repurpose the implemented hardware platform for different algorithms or different applications. Intuitively, constraining reconfigurable architectures would likely result in performance degradation compared to fully-customized solutions. As depicted in Fig. 1, from the lowest performing platform to the highest, the amount of effort required for application mapping increases significantly. We contend that there is a sizable design space, the gray area in Fig. 1, between hand optimized FPGA solutions and general-purpose processors, that warrants a systematic exploration. To that end, this work attempts to show that, although the MARC approach trails behind hand optimized FPGA platforms in performance and efficiency, a disciplined approach with architectural constraints, and without resorting to cumbersome HDL programming and time-consuming manual optimization, may win overwhelmingly in terms of hardware portability, and reduced design time and effort. We believe MARC, our proposed Many-core Approach to Reconfigurable Computing, to be a first step toward finding the right computational abstractions to characterize a wide range of reconfigurable devices, expose a uniform view to the programmer, capture computation in a manner that diverse hardware implementations can exploit efficiently, and leverage ongoing work in parallel programming. GPP GPU FPGA (HDL) ASIC MARC Ease-of-design Performance Low Low High High Figure 1. Landscape of modern computing platforms: Ease of application design and implementation vs. performance. We believe that the key to retaining the performance efficiency of FPGAs is to allow the many-core architecture 2010 International Conference on Reconfigurable Computing 978-0-7695-4314-7/10 $26.00 © 2010 IEEE DOI 10.1109/ReConFig.2010.49 7

Upload: vuongtu

Post on 14-Feb-2017

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MARC: A Many-Core Approach to Reconfigurable Computing

MARC: A Many-Core Approach to Reconfigurable Computing

Ilia Lebedev, Shaoyi Cheng, Austin Doupnik, James Martin, Christopher Fletcher,

Daniel Burke, Mingjie Lin, and John Wawrzynek

Department of EECS, University of California at Berkeley, CA 94704

Abstract—We present a Many-core Approach to Re-configurable Computing (MARC), enabling efficient high-performance computing for applications expressed using paral-lel programming models such as OpenCL. The MARC systemexploits abundant special FPGA resources such as distributedblock memories and DSP blocks to implement completesingle-chip high efficiency many-core microarchitectures. Thekey benefits of MARC are that it (i) allows programmersto easily express parallelism through the API defined in ahigh-level programming language, (ii) supports coarse-grainmultithreading and dataflow-style fine-grain threading whilepermitting bit-level resource control, and (iii) greatly reducesthe effort required to re-purpose the hardware system for dif-ferent algorithms or different applications. A MARC prototypemachine with 48 processing nodes was implemented using aVirtex-5 (XCV5LX155T-2) FPGA for a well known Bayesiannetwork inference problem. We compare the runtime of theMARC machine against a manually optimized implementation.With fully synthesized application-specific processing cores, ourMARC machine comes within a factor of 3 of the performanceof a fully optimized FPGA solution but with a considerablereduction in development effort and a significant increase inretargetability.

Keywords-Reconfigurable Computing; Many-Core; FPGA;Compiler; Performance; Throughput.

I. INTRODUCTION

Reconfigurable devices such as FPGAs exhibit huge po-

tential for exploiting application-specific parallelism and

performing power-efficient computation. As a result, the

overall performance of FPGA-based solutions is often sig-

nificantly higher than that of CPU-based ones [1], [2].

Unfortunately, truly unleashing an FPGA’s performance

potential usually requires cumbersome HDL programming

and laborious manual optimization. Specifically, program-

ming FPGAs demands skills and techniques well outside

the application-oriented expertise of many developers, thus

forcing them to step beyond their traditional programming

abstractions and embrace hardware design concepts such as

clock management, state machines, pipelining, and device-

specific memory management. Finally, wide acceptance in

the marketplace requires binary compatibility across a range

of implementations. However, the current crop of FPGAs re-

quires lengthy reimplementation for each new chip version,

even within the same FPGA family.

These observations therefore raise a natural question:

for a given class of computation-intensive applications, is

it possible to build a reconfigurable computing machine

constrained to resemble a many-core computer, program it

using a high-level imperative language such as C/C++, and

yet still achieve orders of magnitude in performance gain

relative to conventional computing means? If so, such a

methodology would 1) allow programmers to easily express

parallelism through the API defined in a high-level program-

ming language, and 2) support coarse-grain multithreading

and dataflow-style fine-grain threading while permitting bit-

level resource control, and 3) greatly reduce the effort

required to repurpose the implemented hardware platform

for different algorithms or different applications.

Intuitively, constraining reconfigurable architectures

would likely result in performance degradation compared

to fully-customized solutions. As depicted in Fig. 1,

from the lowest performing platform to the highest, the

amount of effort required for application mapping increases

significantly. We contend that there is a sizable design

space, the gray area in Fig. 1, between hand optimized

FPGA solutions and general-purpose processors, that

warrants a systematic exploration. To that end, this work

attempts to show that, although the MARC approach trails

behind hand optimized FPGA platforms in performance

and efficiency, a disciplined approach with architectural

constraints, and without resorting to cumbersome HDL

programming and time-consuming manual optimization,

may win overwhelmingly in terms of hardware portability,

and reduced design time and effort. We believe MARC,

our proposed Many-core Approach to Reconfigurable

Computing, to be a first step toward finding the right

computational abstractions to characterize a wide range

of reconfigurable devices, expose a uniform view to the

programmer, capture computation in a manner that diverse

hardware implementations can exploit efficiently, and

leverage ongoing work in parallel programming.

GPPGPU

FPGA

(HDL) ASIC

MARC

Eas

e-of-

des

ign

PerformanceLow

Low

High

Hig

h

Figure 1. Landscape of modern computing platforms: Ease of applicationdesign and implementation vs. performance.

We believe that the key to retaining the performance

efficiency of FPGAs is to allow the many-core architecture

2010 International Conference on Reconfigurable Computing

978-0-7695-4314-7/10 $26.00 © 2010 IEEE

DOI 10.1109/ReConFig.2010.49

7

Page 2: MARC: A Many-Core Approach to Reconfigurable Computing

to be customized on a per application basis. We think of the

architecture model as a template with a set of parameters to

be chosen based on characteristics of the target application.

Our understanding of which aspects of the architecture to

parameterize continues to evolve as we investigate differ-

ent application mappings. However, the obviously impor-

tant parameters are: the number of processing cores, core

arithmetic-width, core pipeline depth, richness and topology

of an interconnection network, customization of cores—

from addition of specialized instructions to fixed function

datapaths, and details of the cache and local store hierar-

chies. In this study we explore a part of this parameterization

space and compare the performance of a customized many-

core FPGA implementation to a hand optimized version.

The rest of the paper is organized as follows: introduction

of the target application—a Bayesian network inference

problem—in Section II, Section III describes the high-level

architecture of the MARC system. Finally, in Section IV, we

illustrate a prototype of the MARC machine using a Virtex-5

FPGA on a BEE3 system and compare the performance on

several different MARC variants with that of a manually

optimized FPGA solution as well as a general purpose

processor (GPP) implementation.

II. APPLICATION: BAYESIAN NETWORK INFERENCE

We evaluated MARC against a Bayesian network (BN)

inference implementation called ParaLearn [3]. BNs are

statistical models that show conditional independences be-

tween nodes, via the local Markov property: a node is

conditionally independent of its non-descendants, given its

parents. Bayesian inference is the process of finding a BN’s

structure from quantitative data (“evidence”) taken for the

BN. Once a BN’s structure (made up of nodes V1, . . . , Vn)

is known, and each node Vi has been conditioned on its

parent set Πi, the joint distribution over the nodes becomes

the product:

P (V1, . . . , Vn) =

n∏

i=1

P (Vi|Πi)

We chose to compare MARC and ParaLearn for two

primary reasons. First, ParaLearn is a computationally in-

tensive application believed to be particularly well suited for

FPGA acceleration as illustrated by [3]. Second, our group,

in a collaboration with Stanford University, has expended

significant effort over the previous two years developing

several generations of a hand optimized FPGA implemen-

tation tailored for ParaLearn [3]. Therefore, we have not

only a concrete reference design but also well-corroborated

performance results for fair comparisons with a manually-

optimized FPGA implementation.

This paper compares ParaLearn’s kernel, the “order sam-

pler”, against various MARC configurations. The order sam-

pler takes as input prior evidence (D) of the BN’s structure

and produces a set of “high-scoring” BN orders. The other

steps in the BN inference workflow, known as the pre-

processing, graph sampling and post-processing [3] steps,

are outside the scope of this work. We chose to implement

the order sampler as it has the highest asymptotic complexity

of any step in the workflow.

The order sampler uses Markov chain Monte Carlo

(MCMC) sampling to perform an iterative random walk in

the space of BN orders. Per iteration, a “proposed” order

≺ is (1) formed through exchanging two nodes within the

current order, (2) scored, and (3) accepted or rejected based

on the Metropolis-Hastings rule. The score of a proposed

order is given by [4]:

Score(≺ |D) =n∏

i=1

Πi∈Π≺

LocalScore(Vi,Πi;D,G)

for a network of n nodes, where the inner-most loop is

an accumulation over each node’s local scores (a statistical

representation for the evidence), whose corresponding parent

sets are compatible with the given order. Generally, steps 1–

3 repeat until a domain expert is confident that the high

scoring orders will yield networks that represent D.

III. THE MARC ARCHITECTURE

A. Many-Core Template

The overall architecture of a MARC system, as illus-

trated in Fig. 2, resembles a scalable, many-core-style pro-

cessor architecture, comprising one Control Processor (C-

core) and multiple Algorithmic Processing Cores (A-cores).

Both the C-cores and the A-core can be implemented as

conventional pipelined RISC processors. However, unlike

embedded processors commonly found in modern FPGAs,

the processing cores in MARC are completely parameterized

with variable bit-width, reconfigurable multi-threading, and

even aggregate/fused instructions. Furthermore, A-cores can

alternatively be synthesized as fully customized datapaths.

For example, in order to hide global memory access la-

tency, improve processing node utilization, and increase the

overall system throughput, a MARC system can perform

fine-grained multithreading through shift register insertion

and automatic retiming. Finally, while each processing core

possesses a dedicated local memory accessible only to itself,

a MARC system has a global memory space implemented as

distributed block memories accessible to all processing cores

through the interconnect network. Communication between

a MARC system and its host can be realized by reading and

writing global memory.

P/L P/LP/LP/L

MemMemMemMemMemMemMemMem

Sch

eduler

A-coreA-coreA-coreC-core

Interconnect Network

Figure 2. Diagram of key components in a MARC machine.

8

Page 3: MARC: A Many-Core Approach to Reconfigurable Computing

B. Execution Model and Software Infrastructure

Our MARC system builds upon both LLVM, a

production-grade open-source compiler infrastructure [5],

and OpenCL (Open Computing Language) [6], a framework

for writing programs that execute across heterogeneous

platforms consisting of GPPs, GPUs, and other accelerators.

Fig. 3 presents a high-level schematic of a typical MARC

machine. A user application runs on a host according to

the models native to the host platform—a high-performance

PC in our study. Execution of a MARC program occurs

in two parts: kernels that run on one or more A-cores of

the MARC devices and a control program that runs on the

C-core. The control program defines the context for the

kernels and manages their execution. During the execution,

the MARC application spawns kernel threads to run on the

A-cores, each of which runs a single stream of instructions

as SIMD units (that execute in lockstep with a single stream

of instructions) or as SPMD units (each processing core

maintains its own program counter).

ThreadCounter

LocalScheduler

MIPS CoreIMEM DMEM

MemoryMemoryMapMap

BootMemory

Private Memory

Local Memory

Host-MARC Interface

Kernel Scheduler

Kernel Queue

Results Queue

Global Memory

Figure 3. Schematic of a MARC machine’s implementation.

C. Application-Specific Processing Core

One strength of MARC is its capability to integrate fully

customized application-specific processing cores/datapaths

so that the kernels in an application can be more efficiently

executed. To this end, a high-level synthesis flow depicted

by Fig. 4 was developed to generate customized datapaths

for a target application.

Kernel Written in C

llvm-gcc

SSA IR

Predication

Data Flow Graph

Datapath Generation

Instruction Mapping

Pipelining Datapath

Scheduling/Ctr Gen.

Loop Scheduling

Multithreading

HDL Code of Customized Datapath

Figure 4. CAD flow of synthesizing application-specific processing cores.

The original kernel source code in C/C++ is first compiled

through llvm-gcc to generate the intermediate represen-

tation (IR) in the form of a single static assignment graph

(SSA), which forms a control flow graph where instructions

are grouped into basic blocks. Within each basic block, the

instruction parallelism can be extracted easily as all false

dependencies have been removed in the SSA representa-

tion. Between basic blocks, the control dependencies can

then be transformed to data dependencies through branch

predication. In our implementation, only memory operations

are predicated since they are the only instructions that can

generate stalls in the pipeline. By converting the control

dependencies to data dependencies, the boundaries between

basic blocks can be eliminated. This results in a single

data flow graph with each node corresponding to a single

instruction in the IR. Creating hardware from this graph

involves a one-to-one mapping between each instruction

and various pre-determined hardware primitives. Finally,

the customized cores have the original function arguments

converted into inputs. In addition, a simple set of control

signals is created for cores to be started and to signal the

completion of the execution. For memory accesses within

the original code, each non-aliasing memory pointer used

by the C function is mapped to a memory interface capable

of accommodating variable memory access latency. The

integration of the customized cores into a MARC machine

involves mapping the input of the cores to memory addresses

accessible by the control core, as well as the addition of

a memory handshake mechanism allowing cores to access

global and local memories. For the results reported in this

paper, loop pipelining and predication are done manually, but

a fully automated flow from C to HDL is currently under

development in our group.

D. Host-MARC Interface

Gigabit Ethernet is used to implement the communication

link between the host and the MARC device. We leveraged

the GateLib [7] project from Berkeley to implement the

host interface, allowing the physical transport to be easily

replaced by a faster medium in the future.

E. Memory Organization

In a MARC machine, threads executing a kernel have

access to three distinct memory regions: private, local, and

global. Global memory permits read and write access to

all threads within any executing kernels on any processing

core (ideally, reads and writes to global memory may be

cached depending on the capabilities of the device, however

in our current MARC machine implementation, caching is

not supported). Local memory is a section of the address

space shared by the threads within a computing core. This

memory region can be used to allocate variables that are

shared by all threads spawned from the same computing

kernel. Finally, private memory is a memory region that is

dedicated to a single thread. Variables defined in one thread’s

private memory are not visible to another thread, even when

they belong to the same executing kernel.

Physically, the private memory regions in a MARC system

are implemented within distributed LUT RAMs, while local

memory and part of global memory reside in the block RAM

(BRAM). To permit a larger memory space, we also allow

9

Page 4: MARC: A Many-Core Approach to Reconfigurable Computing

external memory to be used as part of the global memory

region. To increase the number of global memory ports, we

use both ports of each BRAM block separately, exposing

each BRAM as two smaller single-port memories. Obvi-

ously, the achievable aggregate memory access bandwidth

inside an FPGA is often far below its peak value, and the

available amount of memory is small in comparison with

other platforms, such as a modern GPU. Nevertheless, the

flexibility of the FPGA enables the MARC approach to use

application-specific access patterns in order to achieve high

memory bandwidth.

F. Kernel Scheduler

To achieve high throughput, kernels must be scheduled

to avoid memory access conflicts. The MARC system

allows for a globally aware kernel scheduler, which can

orchestrate the execution of kernels and control access to

shared resources. The global scheduler is controlled via a

set of memory-mapped registers, which are implementation-

specific. This approach allows a range of schedulers,

from simple round-robin or priority schedules to complex

problem-specific scheduling algorithms.

The MARC machine optimized for ParaLearn uses the

global scheduler to dispatch threads at a coarse grain (gang-

ing up thread starts). The use of the global scheduler is

therefore rather limited as the problem does not greatly

benefit from a globally-aware approach to scheduling.

G. System Interconnect

One of the key advantages of reconfigurable computing

is the ability to exploit application-specific communication

patterns in the hardware system. MARC allows the network

to be selected from a library of various topologies, such

as mesh, H-tree, crossbar, or torus. Application-specific

communication patterns can thus be exploited by providing

low-latency links along common routes.

The MARC machine optimized for ParaLearn explores

two topologies: a pipelined crossbar and a ring, as shown

in Fig. 5. The pipelined crossbar contains no assumptions

about the communication pattern of the target application—

it is a non-blocking network that provides uniform latency

to all locations in the global memory address space. Due to

the large number of endpoints on the network, the crossbar

is limited to 120 MHz with 8 cycles of latency.

The ring interconnect only implements nearest-neighbor

links, thereby providing very low latency access to some

locations in global memory, while requiring multiple hops

for other accesses. Nearest neighbor communication is im-

portant in the accumulation phase of ParaLearn, and helps

reduce overall latency. Moreover, this network topology is

significantly more compact, and can be clocked at a much

higher frequency—as high as 300 MHz in our implementa-

tions.

C-core C-core

A-core

A-core

A-core

A-core

A-core

A-core

Memory

Memory

Memory

Memory

Memory

Memory

Memory

MemoryMapping MappingScheduler Scheduler

Node

to Hostto Host

Block

Block

Block

Block

Block

Block

Figure 5. System diagram of a MARC system. (a) Ring network. (b)Pipelined crossbar.

IV. MARC IMPLEMENTATION AND PERFORMANCE

A. Hardware Prototyping

The prototype MARC implementation for this study

comprises one C-core and 48 A-cores implemented on

a single Virtex-5 (XCV5LX155T-2) of a BEEcube BEE3

module. While the C-core is implemented as a conven-

tional 4-stage RISC processor, all A-cores are application-

specific/customized with multithreading support. Each A-

core normally executes multiple concurrent threads to sat-

urate the long cycles in the application dataflow graph

and to maintain high throughput. In this study, we imple-

mented single threaded, two-way multithreaded and four-

way multithreaded A-cores. When individually instantiated,

the cores are clocked at 119 MHz, 180 MHz, and 260 MHz

respectively. However they achieve 105 MHz, 160 MHz, and

148 MHz respectively in the completely assembled MARC

due to high FPGA resource utilization.

The placed-and-routed MARC prototype with the RISC

A-cores is shown in Fig. 6 (left panel), and the system with

application specific A-cores is shown in Fig. 6 (right panel),

with various main components highlighted. Constrained by

long CAD tool run-time (about 20 hours), the hardware

usage of our MARC implementations is 84% and 71%

respectively, while the full custom FPGA implementation

of ParaLearn utilizes about 92% of total chip resource.

Figure 6. FPGA layouts after placement and routing.

10

Page 5: MARC: A Many-Core Approach to Reconfigurable Computing

As in other computing platforms, memory accesses sig-

nificantly impact the overall performance of a MARC

system. In the current MARC implementation, private or

local memory accesses take exactly one cycle, while global

memory accesses typically involve longer latencies that are

network dependent. Such discrepancies between local and

global memory access latencies, we believe, provide ample

opportunities for memory optimization and performance im-

provements, especially considering the hardware flexibility

of the MARC system manifested by application-specific pro-

cessing cores and customized interconnect networks. This

benefit becomes even more pronounced when local memory

accesses constitute the majority of all memory accesses, as

in ParaLearn.

B. Mapping ParaLearn onto the MARC Machine

The ParaLearn order sampler comprises a C-core to

control the main loop and A-cores to implement the scoring

step. Per iteration, the C-core performs the node swap

operation, broadcasts the proposed order, and applies the

Metropolis-Hastings check. These operations take up a neg-

ligible amount of time relative to the scoring process.

Scoring is composed of 1) the parent set compatibility

check, and 2) an accumulation across all compatible par-

ent sets. Step 1 must be made over every parent set; its

performance is limited by how many parent sets can be

simultaneously accessed. We store parent sets in BRAMs

that serve as A-core private memory, and are therefore

limited by the number of A-cores and attainable A-core

throughput. Step 2 must be first carried out independently

by each A-core thread, then across A-core threads, and

finally across the A-cores themselves. We serialize cross-

thread and cross-core accumulations. Each accumulation is

implemented with a global memory access.

The benchmark we chose consists of 16 nodes, each of

which has 2517 parent sets. We divide these into a total

of 12 chunks and allocate 12 threads per node. In the

implementations surveyed, 48 cores are used to run 192

threads. Thus, 3 cores are used per node and 4 threads are

used per core.

Because step 2 in the scoring process depends on the

global memory latency, network customization is key to

improving performance. Clustering each node’s 3 A-cores

is beneficial when local memory is used for storing scores

across all 3 A-cores. With this setup, it takes only 12 cycles

to initially write the 12 threads’ scores. Furthermore, since

one core can be given exclusive access to local memory, all

further memory accesses are single-cycle.

Clustering also reduces the size of the global network and

the number of FIFOs needed to decouple cores and the net-

work. This optimization greatly reduces area consumption

and increases clock frequency, especially in the case of the

largest 4-way multithreaded cores.

Table INAMING CONVENTION FOR MARC MACHINES IN THIS STUDY.

Alias Description

MARC-Rgen RISC A-core with Generic Network

MARC-Ropt RISC A-core with Optimized Network

MARC-C1 Customized A-core

MARC-C2 Customized A-core (2-way MT)

MARC-C4 Customized A-core (4-way MT)

MARC-C1c Clustered Customized A-core

MARC-C2c Clustered Customized A-core (2-way MT)

MARC-C4c Clustered Customized A-core (4-way MT)

C. Performance Comparison and Analysis

We compare the performance of MARC machines,

with and without application-specific customized processing

cores. The C-core variations and associated names are listed

in Table I.

We benchmark the order scoring algorithm against the

manually-optimized FPGA solution, as well as a conven-

tional microprocessor (GPP) reference implementation.

The runtime of each hardware platform—absolute and

relative to the FPGA reference implementation—is shown in

Table II. We also list the LUT utilization (listed as “Device

Utilization” in the table) along with performance normalized

by the LUT utilization (listed as “Relative Area Eff.” in the

table). The performance relative to the full-custom FPGA

implementation is also shown graphically in Fig. 7.

It is clear that using RISC processor A-cores only achieves

about 5% of the performance shown by the FPGA reference

implementation, even with optimization of the interconnect

topology (a ring network versus a pipelined crossbar). Cus-

tomizing the A-cores, however, yields a significant leap in

performance, moving MARC to within an order of magni-

tude of the performance of the reference FPGA implementa-

tion. Further optimizing the A-cores through clustering and

multi-threading significantly accelerates the accumulation

phase of the order sampling algorithm allowing MARC to

perform within a factor of 3 of the reference. Furthermore,

when we normalize for LUT utilization (an approximation

for chip area), MARC performs within a factor of 2 of the

hand optimized FPGA reference design.

Although the main objective of this work is to compare

various implementations using the many-core approach to

reconfigurable computing, we also benchmark a implemen-

tation using a general purpose processor. We do not claim

that the implementation of the GPP reference algorithm

is fully optimized, instead it is included to provide a

rough idea of the performance of MARC relative to a

non-reconfigurable platform with this algorithm. The GPP

implementation used in this study was written in OpenCL

and run on a 3.33 GHz Intel Core i7 Nehalem 975 with 4

cores, using 1 HW thread per core (with a 32KB L1 and

256KB L2 cache per core).

11

Page 6: MARC: A Many-Core Approach to Reconfigurable Computing

Figure 7. Performance Comparison to Full Custom FPGA Implementation

Table IIPERFORMANCE COMPARISON BETWEEN MARC, GPU, AND GPP.

Configuration Device Execution Relative Relative

Utilization Time (µs) Perf. Area Eff.

GPP Reference n/a 350 0.0055 n/a

MARC-Rgen 0.9 58.48 0.0327 0.0334

MARC-Ropt 0.84 38 0.0503 0.0551

MARC-C1 0.55 10.89 0.1754 0.2935

MARC-C2 0.63 7.76 0.2462 0.3595

MARC-C4 0.71 9.93 0.1924 0.2493

MARC-C1c 0.46 9.4 0.2033 0.4066

MARC-C2c 0.53 6.77 0.2819 0.4894

MARC-C4c 0.57 5.47 0.3492 0.5636

FPGA Reference 0.92 1.91 1.0000 1.0000

V. CONCLUSION

MARC offers a new methodology to design FPGA-based

computing systems by combining a many-core architectural

template, a high-level imperative programming model [6],

and modern compiler technology [5] to efficiently target

FPGAs for general-purpose, compute-intensive applications.

The primary objective of our work is to evaluate a many-

core architecture as an abstraction layer (or execution model)

for FPGA-based computation. We are motivated by recent

renewed interest and efforts in parallel programming for

emerging many-core platforms, and feel that finding an

efficient many-core abstraction for FPGAs would apply the

advances in parallel programming to reconfigurable com-

puting. Of course, constraining an FPGA to an execution

template reduces the flexibility of implementation, and there-

fore the potential for performance. However, we hypothesize

that much of the potential loss in efficiency is recoverable

through per application customization of the many-core

system. This paper outlines our initial efforts to quantify this

tradeoff for one real-world application (Bayesian inference).

We have demonstrated that performance competitive with

a highly optimized FPGA solution is attainable via a produc-

tive abstraction (days versus months of development time).

Despite these results, the effectiveness of MARC remains

to be investigated. We are limited by our ability produce

many high-quality custom FPGA solutions in a variety

of domains. Nonetheless, we plan to expand this study,

surveying more applications and improving the many-core

template in a systematic way. We are optimistic that a

MARC-like approach will open new frontiers for high-

performance reconfigurable computing.

VI. ACKNOWLEDGMENTS

This project was supported by the NIH, grant no. 130826-

02, the DoE1, award no. DE-SC0003624, and the Berkeley

Wireless Research Center. We would also like to thank the

members of the Berkeley Reconfigurable Computing group

for contributing ideas and discussions surrounding this work,

and the Stanford Nolan Lab for the benchmark data set.

REFERENCES

[1] M. Lin, I. Lebedev, and J. Wawrzynek, “High-

throughput bayesian computing machine with reconfig-

urable hardware,” in FPGA ’10: Proceedings of the 18th

annual ACM/SIGDA international symposium on Field

programmable gate arrays. New York, NY, USA:

ACM, 2010, pp. 73–82.

[2] ——, “OpenRCL: From Sea-of-Gates to Sea-of-Cores,”

in Proceedings of IEEE 20th International Conference

on Field Programmable Logic and Applications, 2010.

[3] N. Bani Asadi, C. W. Fletcher, G. Gibeling, E. N. Glass,

K. Sachs, D. Burke, Z. Zhou, J. Wawrzynek, W. H.

Wong, and G. P. Nolan, “Paralearn: a massively par-

allel, scalable system for learning interaction networks

on fpgas,” in ICS ’10: Proceedings of the 24th ACM

International Conference on Supercomputing. New

York, NY, USA: ACM, 2010, pp. 83–94.

[4] N. Friedman and D. Koller, “Being bayesian about

network structure,” in UAI ’00: Proceedings of the 16th

Conference on Uncertainty in Artificial Intelligence.

San Francisco, CA, USA: Morgan Kaufmann Publishers

Inc., 2000, pp. 201–210.

[5] C. Lattner and V. Adve, “LLVM: A Compilation Frame-

work for Lifelong Program Analysis & Transformation,”

in Proceedings of the 2004 International Symposium

on Code Generation and Optimization (CGO’04), Palo

Alto, California, Mar 2004.

[6] Khronos OpenCL Working Group, The OpenCL

Specification, version 1.0.29, 8 December 2008.

[Online]. Available: http://khronos.org/registry/cl/specs/

opencl-1.0.29.pdf

[7] G. Gibeling and et al., “Gatelib: A library for hardware

and software research,” Tech. Rep., 2010.

1Support from the DoE does not constitute the DoE’s endorsement ofthe views expressed in this paper.

12