research article multi-softcore architecture on...

14
Research Article Multi-Softcore Architecture on FPGA Mouna Baklouti and Mohamed Abid CES Laboratory, ENIS, University of Sfax, 3038 Sfax, Tunisia Correspondence should be addressed to Mouna Baklouti; [email protected] Received 28 June 2014; Revised 29 September 2014; Accepted 15 October 2014; Published 27 November 2014 Academic Editor: Gokhan Memik Copyright © 2014 M. Baklouti and M. Abid. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Soſtcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. is paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soſt-processor core from Altera is also presented. e NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. e performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication). 1. Introduction e requirements for higher computational capacity and lower cost of future high-end real-time data-intensive appli- cations like multimedia and image processing (filtering, edge detection, correlation, etc.) are rapidly growing. As a promis- ing solution, parallel programming and parallel systems-on- a-chip (SoC) such as clusters, multiprocessor systems, and grid systems are proposed. Such sophisticated embedded sys- tems are required to satisfy high demands regarding energy, power, area, and cost efficiency. Moreover, many embedded systems are required to be flexible enough to be reused for different soſtware versions. Multicore systems working in SIMD/SPMD (Single Instruction Multiple Data/Single Pro- gram Multiple Data) fashion have been shown to be powerful executers for data-intensive computations [1, 2] and prove very fruitful in pixel processing domain [3, 4]. eir parallel architecture strongly reduces the amount of memory access and the clock speed, thereby enabling higher performance [5, 6]. Such multicore systems are made up of an array of process- ing elements (PEs) that synchronously execute the same pro- gram on different local data and can communicate through an interconnection network. ese parallel architectures are characterized by their regular design, which enables design- cost effective scaling of the platform for different performance applications by simply increasing or reducing the number of processors. Although these architectures have accomplished a great deal of success in solving data-intensive computations, their high price and the long design cycle have resulted in a very low acceptance rate [7]. Designing specific solutions is a costly and risky investment due to time-to-market constraints and high design and fabrication costs. Embedded hardware architectures have to be flexible to a great extent to satisfy the various parameters of multimedia applications. To overcome these problems, embedded system designers are increasingly relying on field programmable gate arrays (FPGAs) as target design platforms. Recent FPGAs provide a high level of logic element density [8]. ey are replacing ASICs (Application Specific Integrated Circuits) in many products due to their increased capacity, smaller nonrecurring engineering costs, and programmability [9, 10]. It is easy to put many cores and memories on an FPGA device, which today has over 100 million transistors available. Such platforms allow cre- ating increasingly complex multiprocessor system-on-a-chip (MPSoC) architectures. FPGAs also support the reuse of standard or custom IP (intellectual property) blocks, which Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 2014, Article ID 979327, 13 pages http://dx.doi.org/10.1155/2014/979327

Upload: others

Post on 22-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

Research ArticleMulti-Softcore Architecture on FPGA

Mouna Baklouti and Mohamed Abid

CES Laboratory, ENIS, University of Sfax, 3038 Sfax, Tunisia

Correspondence should be addressed to Mouna Baklouti; [email protected]

Received 28 June 2014; Revised 29 September 2014; Accepted 15 October 2014; Published 27 November 2014

Academic Editor: Gokhan Memik

Copyright © 2014 M. Baklouti and M. Abid. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multipleprocessing units. However, they are mostly based on custom-logic designmethodology. Designing parallel multicore systems usingavailable standards intellectual properties yetmaintaining high performance is also a challenging issue. Softcore processors and fieldprogrammable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems.This paper describes a FPGA-baseddesign methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoCusing the NIOS II soft-processor core fromAltera is also presented.TheNIOS II features a general-purpose RISC CPU architecturedesigned to address a wide range of applications. The performance of the implemented architecture is discussed, and also someparallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performanceof the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster forthe matrix-matrix multiplication).

1. Introduction

The requirements for higher computational capacity andlower cost of future high-end real-time data-intensive appli-cations like multimedia and image processing (filtering, edgedetection, correlation, etc.) are rapidly growing. As a promis-ing solution, parallel programming and parallel systems-on-a-chip (SoC) such as clusters, multiprocessor systems, andgrid systems are proposed. Such sophisticated embedded sys-tems are required to satisfy high demands regarding energy,power, area, and cost efficiency. Moreover, many embeddedsystems are required to be flexible enough to be reused fordifferent software versions. Multicore systems working inSIMD/SPMD (Single Instruction Multiple Data/Single Pro-gramMultiple Data) fashion have been shown to be powerfulexecuters for data-intensive computations [1, 2] and provevery fruitful in pixel processing domain [3, 4]. Their parallelarchitecture strongly reduces the amount of memory accessand the clock speed, thereby enabling higher performance [5,6]. Suchmulticore systems aremade up of an array of process-ing elements (PEs) that synchronously execute the same pro-gram on different local data and can communicate throughan interconnection network. These parallel architectures are

characterized by their regular design, which enables design-cost effective scaling of the platform for different performanceapplications by simply increasing or reducing the number ofprocessors. Although these architectures have accomplisheda great deal of success in solving data-intensive computations,their high price and the long design cycle have resulted in avery low acceptance rate [7]. Designing specific solutions is acostly and risky investment due to time-to-market constraintsand high design and fabrication costs. Embedded hardwarearchitectures have to be flexible to a great extent to satisfy thevarious parameters of multimedia applications. To overcomethese problems, embedded system designers are increasinglyrelying on field programmable gate arrays (FPGAs) as targetdesign platforms. Recent FPGAs provide a high level of logicelement density [8]. They are replacing ASICs (ApplicationSpecific Integrated Circuits) in many products due to theirincreased capacity, smaller nonrecurring engineering costs,and programmability [9, 10]. It is easy to put many coresand memories on an FPGA device, which today has over100 million transistors available. Such platforms allow cre-ating increasingly complex multiprocessor system-on-a-chip(MPSoC) architectures. FPGAs also support the reuse ofstandard or custom IP (intellectual property) blocks, which

Hindawi Publishing CorporationInternational Journal of Reconfigurable ComputingVolume 2014, Article ID 979327, 13 pageshttp://dx.doi.org/10.1155/2014/979327

Page 2: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

2 International Journal of Reconfigurable Computing

Controller

DATAM

InstM

Control

GlobalrouterI/O

device

Sequentialinstruction

Parallelinstruction NI NI

NINI

NI

1ik

j

Bidirectional neighboring linkBidirectional global router link

M1 Mj

Mi Mk

RE1 REj

REkREi

PE1

PEkPE i

PEj

Figure 1: Multicore architecture.

further decreases system development time and costs. Agood example is the use of softcores, such as the NIOS II[11] and the MicroBlaze [12] for Altera and Xilinx FPGAs,respectively, to implement FPGA-based embedded systems.Thus, the importance of multicore FPGAs platforms steadilygrows since they facilitate designing parallel systems. Basedon configurable logic (softcores), such systems provide a goodbasis for parallel, scalable, and adaptive systems [13].

In a context of high performance, low technology cost,and application code reusability objectives, we target FPGAdevices and softcore processors to implement and to testthe proposed multicore system. The focus of this work is tostudy the viability of softcore replication designmethodologyto implement a multicore SPMD architecture dedicated tocompute data-intensive signal processing applications. Threebenchmarks that represent different application domain weretested: FIR filter, spatial Laplacian filter, and matrix-matrixmultiplication, characteristic for 1D digital signal processing,2D image filtering, and linear algebra computations, respec-tively. All tested applications involve the computation of alarge number of repetitive operations.

The remainder of this paper is organized as follows. Thenext section briefly introduces the multicore system withfocus on the proposed FPGA-based design methodology. InSection 3, the main characteristics of the NIOS II softcoreprocessor are presented. The methodology followed to adaptthe NIOS II IP to the parallel SoC requirements is alsodescribed. Section 4 discusses some experimental resultsshowing the performance of the softcore-based parallel sys-tem. Section 5 gives an overview of related work. A perfor-mance analysis and comparison of the implemented system

with other architectures and a discussion on the architecturalaspects are also provided. Finally, Section 6 presents theconcluding remarks and further work.

2. Softcore-Based ReplicationDesign Methodology

In this section, we first give an overview over our multicorearchitecture including the different used components. Then,we describe the replication design methodology based on theuse of the NIOS II softcore as a case study.

2.1.Multicore SoCOverview. Theproposed parallelmulticoresystem works in SPMD fashion.

As presented in Figure 1, it consists of a controller thataccesses its sequential data memory (DATAM). It syn-chronously controls the whole system composed of a 2D gridof processing elements (PEs) interconnected via two typesof interconnection networks thanks to network interfaces(NI). Indices 𝑗 and 𝑘 present the number of columns andthe total number of PEs (𝑛𝑢𝑚𝑏𝑒𝑟 𝑐𝑜𝑙𝑢𝑚𝑛𝑠 ∗ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑟𝑜𝑤𝑠),respectively. Index 𝑖 is calculated by

𝑖 = ((𝑛𝑢𝑚𝑏𝑒𝑟 𝑟𝑜𝑤𝑠 − 1) ∗ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑐𝑜𝑙𝑢𝑚𝑛𝑠) + 1. (1)

All processors execute the same instructions coming froma parallel instruction memory (InstM). The controller servesas a master and can only execute sequential programwhereaseach PE executes the same parallel program while operatingon its local data.The execution of a given instruction dependson the activity of each PE, identified by a unique number

Page 3: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

International Journal of Reconfigurable Computing 3

Table 1: Configurable parameters.

Parameters Time Values Default valueNumber of PEs Design 2𝑁,𝑁 > 0 24

Size of PE local memory Design Limited by the FPGA surface 5 KBNeighborhood network’s topology Design and runtime Mesh, Torus, and Xnet Torus

Global router communication mode RuntimePE-PE,

PE-Controller,PE-I/O Peripheral, and

Controller-I/O Peripheral

PE-Controller

Table 2: Area occupancy of the network interfaces.

Neighbourhood network Global routerNI Area occupancy NI Area occupancy

ALUTs 36 0.0077% (36/469 440) 2 0.0004% (2/469 440)Registers 118 0.0125% (118/938 880) 75 0.0079% (75/938 880)

in the array. In fact, each processor in the multicore system ischaracterized by its identity to specify its own job.Themulti-ple processors can communicate via two types of networks:a neighbourhood network and a crossbar network (globalrouter) that assures point-to-point communications betweenprocessors and also performs parallel I/O data transfer. Theneighbourhood network interconnects the PEs in a NEWSfashion so that every PE can communicate with its nearestneighbour in a defined direction (north, south, east, or west).It mainly consists of a number (equal to the number of PEs)of routing elements (RE) which are arranged in a grid fashionand are responsible of routing the data between PEs. Therouters may be customized at compile time to implementthree different topologies (Mesh, Torus, and Xnet). It isalso possible to change dynamically from one topologyto another according to the application needs. The globalrouter assures the irregular communications between thedifferent components and can be configured in four possiblebidirectional communication modes: PE-PE, PE-Controller,PE-I/O Peripheral, and Controller-I/O Peripheral. Thesedifferent described communication patterns are handled bythe communication instructions (presented in Section 3.2) asdefined by the designer. To achieve the required architectureperformance and configurability, a fully parameterizable andstructural description of the networks of the proposed archi-tecture was done at the RTL (Register Transfer Level) usingthe VHDL (Very High Speed Integrated Circuits HardwareDescription Language).

The proposed multicore SoC is a parameterizable archi-tecture template and thus offers a high degree of flexibility.Some of its parameters can be reconfigured at synthesis time(such as the number of PEs as well as the memory size),whereas other parameters can be reconfigured at runtime likethe neighbourhood network topology. Table 1 details thesedifferent parameters.

This multicore prototype serves as a base to test differentconfigurations with different processors. The designer justneeds to use a given processor IP and adapt the networkinterface to that processor. Such characteristics make the

system scalable and flexible enough to satisfy a wide rangeof data-parallel applications needs.

The following subsection highlights the FPGA-basedreplication design methodology followed to build the mul-ticore system. A case study is then highlighted based on theNIOS II softcore.

2.2. Design Approach. To design a parallel multicore SoC,we propose in this paper a replication design methodology,which consists in using a softcore to implement the differentprocessors in the multiprocessor array. Following this man-ner, the design process is easier and faster than traditionaldesign methods. As a consequence, it offers a large gainin the development time. The additional advantage of thismethodology is the facility of programming since we use thesame instruction set for all processors. The aim is to proposea flexible, programmable, and customizable architecture toaccelerate parallel applications.

Figure 1 illustrates themulticore architecture based on thereplication designmethodology.The designer has to carefullyrespect the system requirements in order to integrate agiven processor IP. In the proposed system, the controlleris in charge of controlling the global router functioning andtransferring or reading data to or from PEs respectively. ThePE executes parallel instructions and communicates withneighbouring PEs through the neighbourhood network orwith the controller and peripherals through the global router.To accomplish these different functionalities, the controllerhas to be interfaced with the global router whereas the PEhas to be interfaced with the neighbouring network as wellas the global router (as shown in Figure 1). In both cases, thenetworks only need data and address signals issued from thedifferent processors. Table 3 details the global router networkinterface with the processors. It is scalable so that it is adaptedto a parametric number of PEs.This is performed bymappingeach PE data and request to input ports configured as a vectorof length equal to the number of PEs. The neighbourhoodnetwork interface only contains data and addresses coming orgoing to PEs. It consists of a number of routers (equal to thenumber of PEs), each one connected to one PE. Connectionsbetween the routers allow transferring data through severalpaths across the architecture. Both implemented on-chipnetworks are reliable, in the sense that data sent from onecore is guaranteed to be delivered to its destination core. Theresults obtained for the network interface hardware cost, onthe Stratix V FPGA [14], are summarized in Table 2.

Page 4: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

4 International Journal of Reconfigurable Computing

Table 3: Global router interface.

Inpe d in [k] [32 b] PE data (32-bit data vector (length = number of PEs))pe a in [k] [32 b] PE address (32-bit address vector (length = number of PEs))pe wr [k] [1 b] PE read/write (1-bit R/W vector (length = number of PEs))Ctrl d in [32 b] Controller data (32 bits)Ctrl a in [32 b] Controller address (32 bits)Ctrl wr [1 b] Controller read/write signal

Outpe d out [k] [32 b] PE data outpe a out [k] [32 b] PE addressCtrl d out [32 b] Controller data outCtrl a out [32 b] Controller address out

I/O PeripheralsSRAM d out [32 b] SRAM data outSRAM a out [32 b] SRAM address outSDRAM d in [32 b] SDRAM data inSDRAM a in [32 b] SDRAM addressAcc d in [32 b] HW accelerator data inAcc a in [32 b] HW accelerator address in

It is remarkable from the synthesis results given in Table 2that the implemented network adapters present a cheapsolution requiring a significant small logic area on the FPGA.This demonstrates the simplicity of the interface and confirmsthat changing the processor IP is an easy task. Thus, smalldesign time is required to reconfigure the architecture.

The next section discusses more precisely the use of theNIOS II IP to implement a multicore configuration followingthe replication methodology.

3. NIOS II-Based MulticoreArchitecture Overview

3.1. Architecture Description. TheNIOS II embedded softcoreis a reduced instruction set computer, optimized for AlteraFPGA implementations. A block diagram of the NIOS II coreis depicted in Figure 2 [11].

It can be configured at design time (data width, amount ofmemory, included peripherals, etc.) in order to be optimizedfor a given application. The most relevant configurationsare the clock frequency, the debug level, the performancelevel, and the user-defined instructions. The performancelevel configuration enables the user to choose one of thethree provided processors: the NIOS II/fast, which results ina larger processor that uses more logic elements but is fasterand has more features; the NIOS II/standard, which createsa processor with balanced relationship between speed andarea and some special features; the NIOS II/economic, whichgenerates a very economic processor in terms of area, but verysimple in terms of data processing capability.

In this work, we use the NIOS II economic version inorder to put a large number of processors in one FPGAdevice since we target a multicore SoC. NIOS II processors

use Avalon bus for connecting memories and peripherals.So, the implemented network interface is based on Avaloninterface. The Avalon interface is basically an interface thatcreates a common interface from different interfaces ofall memory and peripheral components of the system. Inthis work, we use the Avalon-Memory Mapped interface(Avalon-MM) [15]. It is an address-based read/write interfacesynchronous to the system clock. A wrapper, considered asthe network interface, has been implemented in order tobe able to integrate the NIOS with other components ofthe architecture, mainly the neighbourhood network and theglobal router.This wrapper is a custom component which hastwo interfaces: one Avalon slave interface to be connected tothe NIOS and one conduit interface (for exporting signals tothe top level) to be connected to the networks. To assure com-munication, we only need addresses and data coming fromprocessors as well as the read/write enable signal to indicateif it is a read or write operation. The NIOS-based wrapper isresponsible of transferring data and addresses of the attachedprocessor to the appropriate network (dependently on theaddress coding). Moreover, it communicates the data andaddresses received fromnetworks to the processor dependingon its identity number.

To design systems which integrate a NIOS II processor,Quartus II has as complement a SW called Qsys [16]. Qsysis used to specify the NIOS II processor cores, memory, andother components the system requires. Qsys automaticallygenerates the interconnect logic to integrate the componentsin the hardware system. The NIOS II-based SoC synthesisflow is summarized in Figure 3.

In the implemented architecture, each processor is con-nected to a timer which provides a periodic system clock tick,a system ID peripheral which allows uniquely identifying thePE, and to a local data memory. In addition, the controller is

Page 5: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

International Journal of Reconfigurable Computing 5

NIOS II processor core

Program

andaddress

generation

Exceptioncontroller

InternalinterruptcontrollerExternalinterruptcontrollerinterface

controllerGeneralpurposeregisters

Controlregisters

Shadowregister

sets

Instructioncache

Instructionregions

Memoryprotection

unit

Dataregions

Memorymanagement

unit

Translationlookaside

buffer

Arithmeticlogic unit

Datacache

Key Requiredmodule

Optionalmodule

Tightly coupledinstruction memory

Tightly coupledinstruction memory

Instruction bus

Data bus

Tightly coupleddata memory

Tightly coupleddata memory

...

...

ResetClockcpu resetrequestcpu resettaken

JTAGinterface

to softwaredebugger

JTAGdebug module

irq[31..0]

eic port data[44..0]eic port valid

CustomI/O

signals

Custominstruction

logic

Figure 2: NIOS II processor core block diagram.

connected to a JTAGUARTwhich allows communicating PCwith the NIOS core through the USB-Blaster download cableand providing debug information. Each processor is also con-nected to a custom peripheral which integrated the networkinterface, allowing transferring the processor requests to theother components in the multicore architecture.

An example of 4-core architecture is shown schematicallyin Figure 4. In the architecture, different subsystems are con-nected by network interfaces (NI) to networks (neighbour-hood network and global router). A network interface is usedto send and receive data transmitted over the network. Eachsubsystem contains the processor and some peripherals oraccelerators.The total number of PEs, defined by the numberof rows and columns, can be specified before synthesis.

The design steps followed to implement the multicorearchitecture show the simplicity of the replication method-ology to be applied on any softcore processor and theconfigurability of the proposed SoC prototype. To build amulticore system, two steps have to be performed: to adaptthe network interface to be connected to the chosen processorand to define the corresponding communication instructionsbased on the processor instructions since the HW networksare managed by the SW program. The latter step is moredetailed in the next paragraph.

3.2. SW Programming. Themain important instructions thatdepend on the used softcore are communication instructions.Such instructions are defined to assure communicating data

through both networks. In this work, they are based onstandard C and specific NIOS instructions. Accessing andcommunicating with NIOS II peripherals can be accom-plished using theHWabstraction layer (HAL) interface of theNIOS II processor. In fact, the HAL provides the C macrosIORD (IO Read) and IOWR (IO Write) that expand to theappropriate assembly instructions [17]. These instructionsallow accessing and communicating with different peripher-als connected to the NIOS II processor. Each communicationinstruction consists in writing or reading data from thedefined address. In our case, the data has a 32-bit lengthand the address has a 12-bit length. The address field isdifferent depending on the communication network. In thecase of the global router the most significant bit (the bitat the 12th position) is set to “1,” whereas in the case ofthe neighbourhood network it is set to “0.” To assure globalrouter communications, we distinguish different instructionsas illustrated below.

(i) MODE instruction allows setting the needed com-munication mode and is executed by the controller:IOWR (NI BASE, 0x800, data), where data is thevalue that corresponds to the global router communi-cation mode (PE-PE, PE-Controller, PE-I/O Periph-eral, or Controller-I/O Peripheral) as defined in themulticore architecture configuration file.

(ii) SEND instruction allows sending data through theglobal router: IOWR (NI BASE, address, data), where

Page 6: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

6 International Journal of Reconfigurable Computing

Design entrytool

FPGA synthesis tool

Processor core configuration tool

Additional hardware(networks, wrappers)

FPGA place and route tool

Program memory

HDL ornetlist

Program FPGA and initialize

memory

Hardwaredesign

Softwaredesign

C/C++ compiler forprocessor

Operating systemkernels and libraries

(optional)

Processor config.data

Application programsource code

Binary program/data files

ControllerNIOS II\e

DATAM

Control

GlobalrouterI/O

device

TimerUART

Timer

SysID

Timer

SysID

Timer

SysID

Timer

SysID

Avalon bus

NIOS II\eNIOS II\e

NIOS II\eNIOS II\e

Wrapper Wrapper

Wrapper Wrapper

Wrapper

PE1 PE2

PE4PE3

RE1 RE2

RE3 RE4

M1 M2

M4M3

Figure 3: System design flow.

Bidirectional neighboring linkBidirectional global router link

ControllerNIOS II\e

DATAM

Control

GlobalrouterI/O

device

TimerUART

NITimer

SysID

Timer

SysID

Timer

SysID

Timer

SysID

InstM

Avalon bus

NI NI

NI NI

NIOS II\eNIOS II\e

NIOS II\eNIOS II\ePE1 PE2

PE4PE3

RE1 RE2

RE3 RE4

M3 M4

M2M1

Figure 4: 4-core configuration.

Page 7: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

International Journal of Reconfigurable Computing 7

Table 4: NIOS II/e-based multicore SoC implementation results on FPGA.

Number Logic utilization Memory use Max. frequencyPEs ALUTs Registers % Progr. mem. size/core Data mem. size/core Total bits MHz2 3 905 2 086 <1 8 KB 3KB 268 544 (<1%) 4204 6 650 3 353 1 8 KB 3KB 444 544 (<1%) 4128 10 180 4 931 2 8KB 3KB 796 544 (1%) 38216 18 978 9 069 4 12KB 2KB 1 884 544 (3%) 33732 35 700 17 009 8 12 KB 1KB 3 420 544 (6%) 328

NI BASE is the address of the controller networkinterface based custom component when the instruc-tion is executed by the controller or the PE networkinterface based custom component when it is exe-cuted by the PE. The address field (11 bits) containsthe following:

(1) the identity of the PE receiver (mode PE-PE,mode Controller-PE, and mode I/O Peripheral-PE);

(2) “00000000000” (mode PE-Controller andmode I/O Peripheral-Controller);

(3) the peripheral address (mode PE-I/OPeripheraland mode Controller-I/O Peripheral).

(iii) RECEIVE instruction allows receiving data throughthe global router: data = IORD (NI BASE, address). Itanalogously takes the same address field as the SENDinstruction.

In the case of the neighbouring communication, SENDand RECEIVE instructions are also encoded from IORDand IOWR NIOS macros. Compared to the global routerinstructions, they only differ in the address field, whichcontains the direction to which the data will be transferred(0: north, 1: east, 2: west, 3: south, 4: northeast, 5: northwest,6: southeast, and 7: southwest).

Altera provides the NIOS Integrated Development Envi-ronment tool to implement and compile the parallel SWprogram.The following section presents the implementationand experimental results of the proposed multicore design.

4. Experimental Results

4.1. Synthesis Results. The used development board is theStratix V GX FPGA that provides a hardware platform fordeveloping and prototyping high-performance applicationdesigns [14].

It is equipped with a Stratix V (5SGXEA7K2F45C2)FPGA, which has 622000 logic elements (LEs) and 50-Mbitembedded memory. It presents many peripherals, such asmemories, expansion pins, and communication ports. Theused SW tools are the Quartus II 11.1, the Qsys, and the NIOSII IDE that allow designers to synthesize, program, and debugtheir designs and build embedded systems on Altera FPGAs.

Table 4 shows synthesis results for different multicoreconfigurations varying the number of cores as well as thememory size. These systems contain both neighbouring and

global router networks. The maximum frequency resultedfrom each SoC design is also measured.

It is clearly obvious that the consumed FPGA areaincreases as the number of processors increases. As illustratedin Table 4, the parallel architectures occupy proportionalareas while increasing their size. A 32-core configuration justconsumes 8% of the available logic elements as well as 6% ofmemory blocks.This shows the efficiency of the implementedarchitecture. It is so possible to implement up to 300 cores inthe Stratix V 5SGXEA7K2F45C2 FPGA. Table 4 also presentsthe clock frequency resulting values which slightly drop dueto the used networks.

4.2. Tested Benchmarks. This section gives better insightinto the use and performance of the parametric multicoresystemby executing different benchmarks. As benchmark, weconsidered three algorithms to represent different applicationdomain. They are FIR filter, spatial Laplacian filter, andmatrix-matrixmultiplication, characteristic for 1D digital sig-nal processing, 2D image filtering, and linear algebra compu-tations, respectively. The applications are implemented in Clanguage, using the NIOS II IDE. The results for executiontime and speed are measured for different multicore con-figurations. Based on the results obtained and the systemrequirement, suitable architecture can be chosen for thetargeted application. In fact, the nature of the applicationdetermines the scaling that needs to be applied to the SoCin order to meet the performance.

(1) FIR Filter. Finite Impulse Response (FIR) filtering is oneof the most popular DSP algorithms. Such computationsrequire increased number of multiplications and additions aswell as multiple memory read operations. A FIR filter isimplemented with the following equation:

𝑌 (𝑛) =𝑀−1

∑𝑘=0

(𝑏𝑘𝑋 (𝑛 − 𝑘)) , (2)

where 𝑋 is the input signal and 𝑏𝑘are the filter coefficients.

𝑀 is the filter order or number of taps. An𝑀-order FIR filterrequires𝑀multiplications and𝑀 additions for every outputsignal sample and 2𝑀 memory read operations (for inputsignal and filter coefficients). We implement the FIR filteron different multicore configurations to study the scalabilityof the system and the influence of the type of the intercon-nection network.The results, shown in Figure 5, measure theexecution time of a 64-tap FIR and an impulse response witha length of 128. They describe the performance and speedup

Page 8: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

8 International Journal of Reconfigurable Computing

0

50

100

150

200

250

300

350

2 4 8 16 32 64

Cloc

k cy

cles

Number of cores

XnetCrossbar

Figure 5: Execution time of FIR filter on different multicore con-figurations and two types of networks.

0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 4 8 16 32

Tim

e (𝜇

s)

Number of cores

Execution timeSpeedup (×1000)

Figure 6: Laplacian filter execution results.

of the implemented FIR filter on different multicore con-figurations. Two communication mechanisms are tested inthis case study: the neighbouring communication and thepoint-to-point communication, through the integration ofan Xnet network and a crossbar network, respectively. It isclear from Figure 5 that the designed system is scalable andefficient while integrating more cores on the chip. When thenumber of cores increases, a larger part of the output signalcan be calculated at the same time. Then, communicationinstructions will be decreased, which enhances the speedupof the system. In fact, the 64-core configuration with anXnet network has a speedup 7 times higher than the 2-coreconfiguration. As expected, the multicore architecture with

a neighborhood interprocessor network is more efficient tocompute FIR filtering operations. Thus, based on flexiblecommunication network, the designer can choose the bestinterconnect suited to their application and its require-ments. Furthermore, efficient communication is critical toachieve high performance since the interconnect scheme cansignificantly affect the type of algorithms running on thearchitecture.

(2) Spatial Laplacian Filter. We consider a 2D filter such asLaplacian filter, which smooths a given image (of size 800 ×480 pixels) by convolving it with a Laplacian kernel (3 × 3kernel [9]). The parallel algorithm implemented in this workhas been proposed in [18]. Such application can be performedin parallel, so that multiple pixels can be processed at a time.Each core reads an amount of pixels (depending of the imagesize) from an external SDRAM memory and stores them invectors. Then, it multiplies the vector elements by the filterkernel coefficients. After that, all resultant elements are addedand divided for scaling. As a result, each core simultaneouslyproduces a convolution of 𝑛 pixels, where 𝑛 is the numberof cores integrated in the system. In this algorithm, we needto integrate the global router to assure data transfers fromSDRAM to data memories of the processors. The controlleris so responsible of controlling and configuring the network.The execution time results and speedup values are shown inFigure 6 varying the number of cores. Figure 6 clearly showsthat the speedup exponentially increases while increasing thenumber of cores. It shows a good scalability since doublingthe number of cores provides an average speedup of 1.6.We deduce that the multicore system is efficient to computeimage processing applications. For example, we reach aspeedup 5 times over a sequential execution when integrating8 PEs, which is quite appreciable. It is also noted that theexecution time slightly decreases between 16 and 32 cores dueto the time needed to accomplish communications.

(3) Matrix-Matrix Multiplication. One of the basic computa-tional kernels inmany data parallel codes is themultiplicationof two matrices (𝐶 = 𝐴 × 𝐵). Matrix multiplicationis considered an important kernel in computational linearalgebra. It is a very compute-intensive task, but also richin computational parallelism, and hence well suited forparallel computation [19]. In this work, the tested multicoreconfiguration integrates 64 PEs. We study the influence ofthe interconnection network topology on the applicationperformance. The PEs are arranged in 8 × 8 grid. To performmultiplication, all-to-all row and column broadcasts are per-formed by the PEs.The following code for PE(𝑖, 𝑗) is executedby all PEs simultaneously, in a 64-core configuration. In thiscase, matrices 𝐴 and 𝐵, of size 128 × 128, are partitioned intosubmatrices 𝐴(𝑖, 𝑗) and 𝐵(𝑖, 𝑗) of size 16 × 16.

/∗ East and West data transfer ∗/For k = 1 to 7 do

Send A(i,j) to PE(i,(j+k) mod 8)

/∗ North and South data transfer ∗/For k = 1 to 7 do

Page 9: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

International Journal of Reconfigurable Computing 9

0

500

1000

1500

2000

2500

3000

3500

Exec

utio

n tim

e (𝜇

s)

Matrices size

Mesh networkTorus network

128 ∗ 128 256 ∗ 256

Figure 7: Matrix-matrix multiplication results.

Send B(i,j) to PE((i+k) mod 8,j)

/∗ Local multiplication of submatrices ∗/For k = 0 to 7 do

C(i,j) = C(i,j) + A(i,k) x B(k,j).

The obtained time results are depicted in Figure 7 withdifferently sized matrices. Figure 7 demonstrates that theused network influences the total execution time. In ourcase, it was easy to test different topologies thanks to theimplemented mode instruction. From the SW level, the pro-grammer can choose the interconnection network topologythey want to use. The results show that the Torus network isthemost appropriate neighbourhood network for thematrix-matrix multiplication.

To further evaluate the performance of the implementedmulticore configurations, we measure the communication-to-computation ratio (CCR) for the three tested benchmarksbased on a 32-core configuration. Table 5 depicts the obtainedresults. It is clearly shown that themeasured communication-to-computation ratio is small, which indicates that the num-ber of processing cycles exceeds the number of communica-tion cycles. This allows having a good speedup.

Table 6 shows the parallel efficiency for the tested bench-marks depending on the number of PEs. The parallel effi-ciency decreases as the number of cores increases. We noticethat the efficiency drops down to 12%, when executing theFIR filter with 64 cores. In this case, the additional cores donot deliver additional performance due to the needed extracommunications. Through these results, we can choose thebest number of cores to parallelize a given application.

5. Multicore Architecturesand Performance Analysis

5.1. Related Work. Several on-chip multicore platforms havebeen proposed [20]. Different multiprocessor architectures

Table 5: Communication-to-computation ratio for a 32-core con-figuration.

Benchmark CCRFIR filter 0.51M-Mmultiplication 0.47Spatial Laplacian filter 0.32

Table 6: Parallel efficiency.

Numberof PEs

Efficiency (%)FIR filter Laplacian

filterM-Mmultiplication

(with Xnet) (matrices of size 128 ∗ 128)2 84 95 974 74 81 888 59 62 8116 36 37 7532 24 22 6664 18 — 59

with the NIOS II softcore were also presented in [21]. Theseplatforms are designed to raise the efficiency in terms ofGOPS per Watt. In this work, a particular attention is givento architectures working in SIMD/SPMD fashion and proto-typed on FPGA devices.

One of the famous high performance architectures is theGPGPU. The Fermi processor is an example [22]. It is com-posed of 512 cores. However, these architectures present somelimits in terms of the accessmemory bottleneck since all coresshare the same memory and the high power consumptionmaking them not suitable in the embedded system field.

In [23], the authors present a parameterizable coarse-grained reconfigurable architecture. Different interconnecttopologies can be reconfigured, and static as well as dynamicparameters are described. This work mainly concentrates onproviding a dynamic reconfigurable network by developinga generic interconnect wrapper module. The PEs used in thedesign have limited instructionmemory and can only executespecific instructions needed in digital signal processing.

In [24], the authors present massively parallel progra-mmable accelerators. The authors propose a resource-awareparallel computing paradigm, called invasive computing,where an application can dynamically use resources. Theyshow that the invasive computing allows reducing energyconsumption by dynamically powering off the idle regions ata given time.

A many-core computing accelerator for embedded SoCs,called Platform 2012, is presented in [25]. It is based onmulti-ple processor clusters and works inMIMD fashion. One clus-ter can hold a number of PEs (STxP70-V4 processor) varyingfrom 1 to 16. P2012 is a Globally Asynchronous Locally Syn-chronous (GALS) fabric of clusters connected through anasynchronous global NoC.

TheCell Broadband Engine Architecture (CBEA) [18] is amulticore system, which integrates a PowerPC PE and 8 Syn-ergistic PEs dedicated to compute intensive applications.

Page 10: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

10 International Journal of Reconfigurable Computing

Table7:Re

latedwork.

Reference

Author

Year

CPUtype

#cores

Network

Mem

ory

architecture

Teste

dapplication

[30]

Lietal.

2003

FPGA-

basedmassiv

elyparallelSIM

Dprocessor

95sim

ple

processors

——

RC4key

search

engine

[23]

Kisslere

tal.

2006

Parameterizablecoarse-

grainedreconfi

gurable

architecture

Parametric

numbero

fweaklyprogrammablePE

s

Differentinterconn

ect

topo

logies

(Torus,4Dhypercub

e,etc.)

——

[18]

Ruchandani

andRa

wat

2008

CBEA

8PEs

+Po

werPC

Elem

entary

intercon

nect

bus(EIB)

Shared

mem

ory

Shared

mem

ory

Spatiald

omain

filter

[22]

Wittenbrinketal.

2011

Ferm

iproc(GPG

PU)

512

—Shared

mem

ory

Gam

ingandhigh

-perfo

rmance

compu

ting

[25]

Melp

ignano

etal.

2012

P2012

(MIM

D)

Multip

leclu

sters;one

cluste

rcanhave

from

1to16PE

sAsynchron

ousg

lobal

NoC

Shared

mem

ory

Extractio

nand

tracking

algorithm

s

[29]

Waidyasoo

riyae

tal.

2012

Heterogeneous

multic

ore

architecture

SIMD-1D8P

Earray

andMIM

D-2DPE

arrayaccelerators

Crossbar

Shared

mem

ory

M-M

multip

lication

[31]

WjcikandDlugopo

lski2013

Application-specific

multic

orea

rchitecture

4parallel

processors

—Shared

parallel

mem

ory

Ciph

eralgorithm

[24]

Larietal.

2014

Massiv

elyparallel

programmableaccelerator

Parametric

numbero

ftig

htlycoup

ledPE

sCircuit-s

witched

Mesh-lik

einterconn

ect

Multip

lebu

ffer

bank

sLo

opcompu

tatio

ns

Prop

osed

work

2014

FPGA-

based

multic

oreS

oCParametric

number

ofcores

Crossbar

Localpriv

ate

mem

ory

FIRfilter

Laplacianfilter

M-M

multip

lication

Page 11: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

International Journal of Reconfigurable Computing 11

Table 8: Performance results of Mali-T604GPU and 16-core archi-tecture. Numbers in brackets denote the difference between GPU’sand our architecture’s values.

Application GPU 16-core configurationExec.time

% peakperf. Exec. time % peak perf.

FIR filter−32 taps, 131 072samples

655 𝜇s 18.8% 461 𝜇s(29% faster)

84.2%(4.5x better)

Matrix-matrixmultiplication−256 × 256 matrixsize

2 634 𝜇s 18.7% 2 011 𝜇s(23% faster)

77.37%(4x better)

It has been shown to be significantly faster than existingCPUsfor many applications.

Many modern commercial microprocessors also haveSIMD capability, which has become an integral part of theInstruction Set Architecture (ISA). Design methods includecombining a SIMD functional unit with a processor orexploiting the SIMD instructions extension [26] present insoftcores to run data-parallel applications. SIMD extensionsare used, for example, in ARMs Media Extensions [27] andPowerPCs AltiVec [28]. The SIMD-style media ISA exten-sions are presented in most high-performance general-purpose processors (e.g., Intels SSE1/SSE2, Suns VIS, HPsMAX, Compaqs MVI, and MIPs MDMX). Their drawbackis due to their data access overhead for instructions, whichbecomes an obstacle to the performance enhancement.

In [29], the authors present heterogeneous multicorearchitectures with CPU and SIMD 1-dimensional PE array(SIMD-1D) and MIMD 2-dimensional PE array (MIMD-2D) accelerators. A custom hardware address generation unit(AGU) is used to generate preprogrammed addressing pat-terns in smaller number of clock cycles compared to theALU-based address generation.The SIMD-1D accelerator has 8 PEsand a 16 bit× 256 16-bank sharedmemory connected througha crossbar interconnection network. Through a matrix-matrix multiplication benchmark, the authors demonstratethat the proposed SIMD-1D architecture is 15 times power-efficient than GPU for the fixed-point calculations.

Another FPGA-based massively parallel SIMD processoris presented in [30]. The architecture integrates 95 simpleprocessors and memory on a single FPGA chip. The authorsshow that their proposed architecture can be used to effi-ciently implement an RC4 key search engine. An application-specific multicore architecture designed to perform a simplecipher algorithm is presented in [31].The architecture consistsof four main modules: the sequential processor that isresponsible for the interpretation and execution of a user-defined program, the queuing module that is responsible forscheduling tasks received from the sequential processor, fourparallel processors that are idle until they get tasks from thequeuingmodule, and the parallel memory that holds the dataand key for processing and allows for concurrent access fromall parallel processors. There is also a monitoring modulewhich interprets user-control signals and sends back their

results via a specified output. The architecture can integratemore parallel cores; however this can be done manually andso requires long design time.

These different platforms are summarized in Table 7.

5.2. Performance Analysis. The multicore architecture, dis-cussed in this paper, is suitable for data-intensive computa-tions. Comparing with embedded general-purpose graphicsprocessing units, such architectures have both huge numberof processing units; however, there exist also some differ-ences. GPUs used in multiprocessor systems-on-a-chip havedirect access to the main memory, shared with the processorcores. In GPUs, there is no neighbouring communicationbetween the cores. Thus, exchanging data is made via theshared memory. In contrast, the PEs in the proposed multi-core SoC can communicate with each other and also othercomponents via two networks. GPUs are mainly dedicatedto massively regular data-parallel algorithms. However, theproposed multicore architecture can handle both regular andirregular data parallel algorithms thanks to its communica-tion features. To validate this reasoning, our multicore SoCand a commercial embedded GPU are evaluated for selectedalgorithms (FIR filter and matrix-matrix multiplication). ASamsung Exynos 5420ArndaleOcta Board is used. It featuresa dual-core ARM Cortex-A15 that runs at 1.7 GHz and anARMMali T-604 GPU operating at 533MHz [32]. The GPUhas a theoretic peak performance of 68.22 GOPS. For thecomparison, a 16-core configuration (the PEs are connectedvia a Xnet network) was considered. The multicore architec-ture operated at 337MHz. The considered architecture hasa peak performance of 10.78 GOPS. The performance com-parison is illustrated in Table 8. Execution time measuresand the achieved fraction of the peak performance arehighlighted.

Table 8 shows that our proposed multicore systemachieves better speedup than the GPU (29% faster for theFIR filter and 23% faster for the matrix-matrix multiplica-tion). Our system is also more efficient since it achieves abetter fraction of the peak performance than the GPU. Thisproves the well resource utilization in the multicore on-chipimplementation.The aforementioned results demonstrate theperformance of the proposed architecture. We think that oursystem can be coupledwithGPU to form a high-performancemultiprocessor system.

Tanabe et al. [33] propose an FPGA-based SIMD proces-sor. The proposed architecture is based on SuperH processorthat was modified to include more numbers of computationunits and then performs SIMDprocessing. A JPEG case studywas presented; however no idea about programming wasgiven.The achieved clock frequency of the SIMD processor istoo low (55MHz) while many softcores run faster these days.Table 9 compares between our multi-softcore architectureand the architecture described in [33] composed of one con-trol unit and a number of computation units (CU) in termsof power consumption and frequency. As illustrated, oursystem achieves good results since it presents a higher fre-quency and a lower power consumption compared to theSIMD processor. The replication design methodology makes

Page 12: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

12 International Journal of Reconfigurable Computing

Table 9: Comparison between multi-softcore SoC and [33].

Proposed multi-softcore SoC 𝐹max Power (mW) [33] 𝐹max Power (mW)4 PEs 412 756 4CUs 56 1 1498 PEs 382 834 8CUs 55 1 23316 PEs 337 1 151 16 CUs 52 1 385

the design of a multi-softcore architecture easier and canintegrate a parametric number of PEs compared to theused dedicated implementation [33] that makes adding morecomputational units to the architecture a heavy task.

Compared to the multicore architecture described in [31]which is application specific, our proposed architecture isparametric and scalable and can satisfy a wide range of data-intensive processing applications. In addition, our systemintegrates two types of flexible communication networks thatcan efficiently handle communication schemes present inmany applications. In terms of performance, our multicorearchitecture can reach a speedup of 1.9 with two cores and 3.2with four cores. In comparison, the achieved results presentedin [31] show a speedup reaching 1.85 with two cores and 2.92with four cores active.

The preceding performance comparison confirms thepromising advantages and performance of the proposedFPGA-based multicore architecture. This paper proves thatan FPGA-based multicore design built with softcores is pow-erful to compute data-intensive processing. The presentedarchitecture is easy to design, configurable, flexible, andscalable. Thus, it can be tailored to different data-parallelapplications.

6. Conclusion

In this paper, we have presented a softcore-based multicoreimplementation on FPGA device. The proposed architectureis parametric and scalable, which can be tailored accordingto application requirements such as performance and areacost. Such a programmable architecture iswell suited for data-intensive applications from the areas of signal, image, andvideo processing applications. Through experimental results,we have demonstrated better performance gains of oursystem in comparison to state-of-the-art embedded archi-tectures. The presented architecture provides a great deal ofperformance as well as flexibility.

Future work will focus on studying the energy efficiencyof the presented system, as power consumption is consideredthe hardest challenge for FPGAs [34]. We want to explore theusage of custom accelerators in the proposed architecture.Wealso envision studying a possible dynamic exploitation of theavailable level of parallelism in themulticore system based onthe application requirements.

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper.

References

[1] W. C. Meilander, J. W. Baker, and M. Jin, “Importance of SIMDcomputation reconsidered,” in Proceedings of the InternationalParallel and Distributed Processing Symposium, IEEE, April2003.

[2] H. Gao, M. Dimitrov, J. Kong, and H. Zhou, Experiencing Vari-ous Massively Parallel Architectures and Programming Modelsfor Data-Intensive Applications, 2014, http://people.engr.ncsu.edu/hzhou/dlp.pdf.

[3] P. Jonker, “Why linear arrays are better image processors,” inProceedings of the Conference on Pattern Recognition, pp. 334–338, 1994.

[4] R. Kleihorst, H. Broers, A. Abbo et al., “An SIMD smart cameraarchitecture for real-time face recognition,” in Proceedings of theSAFE & ProRISC/IEEE Workshops on Semiconductors, Circuitsand Systems and Signal Processing, Abstracts, Veldhoven, TheNetherlands, November 2003.

[5] R. P. Kleihorst, A. A. Abbo, A. van der Avoird et al., “Xetal:a low-power high-performance smart camera processor,” inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS ’01), pp. 215–218, May 2001.

[6] A. Abbo and R. Kleihorst, “A programmable smart-cameraarchitecture,” in Proceedings of the Advanced Concepts forIntelligent Vision Systems (ACIVS ’02), 2002.

[7] J. Cloutier, E. Cosatto, S. Pigeon, F. R. Boyer, and P. Y.Simard, “VIP: an FPGA-based processor for image processingand neural networks,” in Proceedings of the 5th InternationalConference on Microelectronics for Neural Networks and FuzzySystems, pp. 330–336, Lausanne, Switzerland, February 1996.

[8] ITRS System Design Source, The ITRS Website, 2011, http://www.itrs.net/.

[9] P. G. Paulin, “DATE panel chips of the future: soft, crunchyor hard?” in Proceedings of the Design, Automation and Test inEurope Conference and Exhibition (DATE ’04), vol. 2, pp. 844–849, February 2004.

[10] H. Kalte, D. Langen, E. Vonnahme, A. Brinkmann, andU. Ruck-ert, “Dynamically reconfigurable system-on-programmable-chip,” in Proceedings of the 10th EuromicroWorkshop on Parallel,Distributed and Network-Based Processing, pp. 235–242, CanaryIslands, Spain, January 2002.

[11] Altera, Nios II Processor Reference Handbook, February 2014,http://www.altera.com/literature/hb/nios2/n2cpu nii5v1.pdf.

[12] MicroBlaze Processor Reference Guide, Xilinx Inc., http://www.xilinx.com.

[13] Y. Jin, N. Satish, K. Ravindran, and K. Keutzer, “An automatedexploration framework for FPGA-based soft multiprocessorsystems,” in Proceedings of the IEEE/ACM/IFIP InternationalConference on Hardware/Software Codesign and System Synthe-sis, 2005.

[14] Altera, “Stratix V GX FPGA Development Board ReferenceManual,” http://www.altera.com/literature/manual/rm svgxfpga dev board.pdf.

Page 13: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

International Journal of Reconfigurable Computing 13

[15] Altera, “AvalonMemory-Mapped Interface Specification,” 2006,http://www.cs.columbia.edu/∼sedwards/classes/2007/4840/mnl avalon spec.pdf.

[16] Altera, Quartus IIHandbookVersion 11.1, Volume 1: Design andSynthesis, 2011, http://www.altera.com/literature/hb/qts/archi-ves/quartusii handbook archive 111.pdf.

[17] J. O. Hamblen, T. S. Hall, and M. D. Furman, Rapid Prototypingof Digital Systems, Springer, New York, NY, USA, 2008.

[18] K. V. Ruchandani and H. Rawat, “Implementation of spatialdomain filters for Cell Broadband Engine,” in Proceedings of the1st International Conference on Emerging Trends in Engineeringand Technology (ICETET ’08), pp. 116–118, July 2008.

[19] P. Bjørstad, F. Manne, T. Sørevik, and M. Vajtersic, “Efficientmatrix multiplication on SIMD computers,” SIAM Journal onMatrix Analysis & Applications, vol. 13, no. 1, pp. 386–401, 1992.

[20] T. Dorta, J. Jimenez, J. L. Martın, U. Bidarte, and A. Astarloa,“Reconfigurable multiprocessor systems: a review,” Interna-tional Journal of Reconfigurable Computing, vol. 2010, Article ID570279, 11 pages, 2010.

[21] A. Kulmala, E. Salminen, and T. D. Hamalainen, “Evaluatinglarge system-on-chip on multi-FPGA platform,” in Proceedingsof the International Workshop on Systems, Architectures, Model-ing and Simulation (SAMOS ’07), S. Vassiliadis, M. Berekovic,and T. D. Hamalainen, Eds., pp. 179–189, Springer, 2007.

[22] C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, “Fermi GF100GPU architecture,” IEEE Micro, vol. 31, no. 2, pp. 50–59, 2011.

[23] D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich, “A highlyparameterizable parallel processor array architecture,” in Pro-ceedings of the IEEE International Conference on Field Pro-grammable Technology (FPT ’06), pp. 105–112, Bangkok, Thai-land, December 2006.

[24] V. Lari, A. Tanase, F. Hannig, and J. Teich, “Massively paral-lel processor architectures for resource-aware computing,” inProceedings of the 1st Workshop on Resource Awareness andAdaptivity in Multi-Core Computing (Racing ’14), Paderborn,Germany, May 2014.

[25] D. Melpignano, L. Benini, E. Flamand et al., “Platform 2012, amany-core computing accelerator for embedded SoCs: perfor-mance evaluation of visual analytics applications,” in Proceed-ings of the 49th Annual Design Automation Conference (DAC’12), pp. 1137–1142, June 2012.

[26] D. Etiemble and L. Lacassagne, “Introducing image processingand SIMD computations with FPGA soft-cores and customizedinstructions,” in Proceedings of the International Workshopon Reconfigurable Computing Education, Karlsruhe, Germany,March 2006.

[27] “ARM, SIMDextensions formultimedia,” 2014, http://www.arm.com/products/processors/technologies/dsp-simd.php.

[28] J. Tyler, J. Lent, A. Mather, and H. Nauyen, “AltiVec: bring-ing vector technology to the PowerPC processor family,” inProceedings of the IEEE International Performance, Computingand Communications Conference, pp. 437–444, Scottsdale, Ariz,USA, February 1999.

[29] H. M. Waidyasooriya, Y. Takei, M. Hariyama, and M.Kameyama, “FPGA implementation of heterogeneous mul-ticore platform with SIMD/MIMD custom accelerators,” inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS ’12), pp. 1339–1342, May 2012.

[30] S. Y. C. Li, G. C. K. Cheuk, K. H. Lee, and P. H. W. Leong,“FPGA-based SIMD Processor,” in Proceedings of the 11thIEEE Symposium on Field-Programmable Custom Computing

Machines (FCCM ’03), pp. 267–268, IEEE Computer Society,Napa, Calif, USA, April 2003.

[31] W. Wjcik and J. Dlugopolski, “FPGA-based multi-core proces-sor,” Computer Science, vol. 14, no. 3, pp. 459–474, 2013.

[32] Arndale Board, http://www.arndaleboard.org/wiki/index.php/Main/Page.

[33] S. Tanabe, T. Nagashima, and Y. Yamaguchi, “A study ofan FPGA based flexible SIMD processor,” ACM SIGARCHComputer Architecture News, vol. 39, no. 4, pp. 86–89, 2011.

[34] I. Kuon and J. Rose, “Measuring the gap between FPGAsand ASICs,” IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems, vol. 26, no. 2, pp. 203–215, 2007.

Page 14: Research Article Multi-Softcore Architecture on FPGAdownloads.hindawi.com/journals/ijrc/2014/979327.pdf · F : Multicore architecture. further decreases system development time and

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

Journal ofEngineeringVolume 2014

Submit your manuscripts athttp://www.hindawi.com

VLSI Design

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

DistributedSensor Networks

International Journal of