virtualization of reconfigurable coprocessors in hprc systems with multicore architecture

Journal of Systems Architecture 58 (2012) 247–256

Contents lists available at SciVerse ScienceDirect

Journal of Systems Architecture

journal homepage: www.elsevier .com/ locate /sysarc

Virtualization of reconfigurable coprocessors in HPRC systemswith multicore architecture

Ivan Gonzalez ⇑, Sergio Lopez-Buedo, Gustavo Sutter, Diego Sanchez-Roman, Francisco J. Gomez-Arribas,Javier AracilHigh-Performance Computing and Networking Group, Escuela Politecnica Superior, Universidad Autonoma de Madrid, 28049 Madrid, Spain

a r t i c l e i n f o a b s t r a c t

Article history:Received 8 February 2011Received in revised form 10 February 2012Accepted 12 March 2012Available online 23 March 2012

Keywords:High Performance ReconfigurableComputingCoprocessor virtualizationMulticore programmingReconfigurable hardware

1383-7621/$ - see front matter � 2012 Elsevier B.V. Ahttp://dx.doi.org/10.1016/j.sysarc.2012.03.002

⇑ Corresponding author. Address: Department of EleTechnology, Escuela Politecnica Superior, UniversidadMadrid, Spain. Tel.: +34 914976212.

E-mail addresses: [email protected] (I. Gonzalez(S. Lopez-Buedo), [email protected] (G. Sutter), d.Roman), [email protected] (F.J. Gomez-Arribas), j

HPRC (High-Performance Reconfigurable Computing) systems include multicore processors and reconfig-urable devices acting as custom coprocessors. Due to economic constraints, the number of reconfigurabledevices is usually smaller than the number of processor cores, thus preventing that a 1:1 mappingbetween cores and coprocessors could be achieved. This paper presents a solution to this problem, basedon the virtualization of reconfigurable coprocessors. A Virtual Coprocessor Monitor (VCM) has beendevised for the XtremeData XD2000i In-Socket Accelerator, and a thread-safe API is available for userapplications to communicate with the VCM. Two reference applications, an IDEA cipher and an EulerCFD solver, have been implemented in order to validate the proposed architecture and execution model.Results show that the benefits arising from coprocessor virtualization outperform its overhead, speciallywhen code has a significant software weight.

� 2012 Elsevier B.V. All rights reserved.

1. Introduction

Nowadays, most computing systems have processors capable ofexecuting several tasks in parallel. Even low-cost PCs or smart-phones have multicore processors, which include two or more pro-cessing units. There are conventional x86 processors with up totwelve cores, such as AMD ‘‘Magny-Cours’’ [1], while non-conven-tional many-core architectures such as Tilera can have up to 100cores [2]. There are also specialized architectures having hundredsof cores such as the GPUs (Graphics Processing Units) Nvidia Fermior AMD FireStream. Typically, each core is capable of executing onetask, composed by one or more threads, reaching up to thousandsof concurrent threads in common GPU applications [3]. Multicoreprocessing is the only option for performance increases, now thatinstruction-level parallelism and frequency scaling have reachedto its limits. However, programming these multicore architecturesis not always an easy task. It is necessary to parallelize the codeamong the different cores, using programming APIs such asOpenMP, MPI or CUDA [4].

HPRC systems [5] are composed by conventional processors to-gether with FPGA coprocessors. In these systems, acceleration is

ll rights reserved.

ctronic and CommunicationsAutonoma de Madrid, 28049

), sergio.lopez-buedo@uam. [email protected] (D. [email protected] (J. Aracil).

obtained at two levels. Firstly, the application is partitioned intomany parallel tasks, each running in one processor core of theHPRC machine. For this step, conventional HPC techniques apply,such as MPI or OpenMP. Secondly, the most computationally inten-sive kernels of each task are ported to the FPGA, which acts as acustom coprocessor. If every processor core has a FPGA coprocessoravailable, application partitioning is homogeneous and thereforestraightforward. However, a problem arises if a 1:1 mapping be-tween processor cores and FPGA coprocessors is not available. Inthis case, partition is no longer homogeneous: chances are thatsome tasks finish sooner because they can access a coprocessor,while others take longer time because all the computations areperformed in software. This configuration brings forth severe syn-chronization and load balancing issues, but nonetheless it is themost typical one, because multicore processors are widespreadand the number of FPGAs in a HPRC system is limited due to eco-nomic and communication constraints. This paper proposes amechanism to transparently share FPGA coprocessing resourcesamong all processor cores, that is, to divide coprocessors into manyvirtual devices. This concept has been successfully explored inother works, not only in the field of reconfigurable computingbut also in other heterogeneous computing technologies. In 1998,Fornaciari et al. [6] proposed some operating-system techniquesto virtually enlarge the size of the FPGA from the point of viewof the applications, although they did not show any experimentalresults. There are no more references to FPGA virtualization until2008, when El-Araby et al. [7,8] studied the viability of implement-ing virtualization of reconfigurable coprocessors in HPRC systems

http://dx.doi.org/10.1016/j.sysarc.2012.03.002

mailto:[email protected]






http://dx.doi.org/10.1016/j.sysarc.2012.03.002

http://www.sciencedirect.com/science/journal/13837621

http://www.elsevier.com/locate/sysarc

Fig. 1. Alternatives for application acceleration: original sequential code (a),hardware acceleration (b), code parallelization (c), and code parallelization withacceleration in a virtualized coprocessor (d).

248 I. Gonzalez et al. / Journal of Systems Architecture 58 (2012) 247–256

by means of partial runtime reconfiguration, obtaining promisingresults. Later, in 2010, Chun-Hsian et al. [9] used a hardware virtu-alization mechanism for dynamically linking device nodes toreconfigurable hardware functions and reduce up to 12.8% theoverall system execution time. In addition, virtualization of copro-cessors is gaining attention with the emerge of the GPGPU technol-ogy, where the approach of [10] is sharing GPUs through thenetwork.

This paper presents a complementary approach to that pre-sented in [7–9]. In those works, the FPGA is partitioned into severalareas, each of them dedicated to implement a virtual coprocessor.Therefore, one single physical FPGA device appears as many copro-cessors to the system. Each of these virtual FPGA coprocessors canbe separately configured by taking advantage of the partial run-time reconfiguration capabilities of Xilinx’ devices. While thoseworks used spatial multiplexing to virtualize FPGA coprocessors,the approach proposed here is to use time multiplexing. That is, al-low each parallel task to access the FPGA coprocessor during a frac-tion of the total time. Due to the high reconfiguration time ofcommercial FPGA devices, this time multiplexing approach is validonly for SIMD problems, where all parallel tasks execute the samecode but operate on different partitions of data. However, SIMD isprobably the most common mode of operation of parallel applica-tions, and the main advantage of this approach is that can beimplemented on all FPGA devices, even those not supporting par-tial runtime reconfiguration.

In theory, time-sharing a computation resource among ele-ments executing parallel tasks implies synchronizing and serializ-ing access to that resource. Consequently, time spent by theapplication in sequential execution is increased and, according toAmdahl’s law, the maximum speedup becomes limited by that se-rial time. Although it seems that this situation might not be desir-able, the fact is that the combination of speedups coming fromboth algorithm parallelization and hardware acceleration can bebeneficial for many applications, even taking into considerationthe waits to access the shared coprocessor. The challenge is toimplement this virtualization model in such a manner that proces-sor cores can use FPGA accelerators in a deterministic way, usingan API based on simple primitives that hides virtualization detailsto the user. Additionally, it would be interesting that the program-ming model that is provided to the user is compatible with amultithreading environment commonly used in application parall-elization, for example, OpenMP. This has been the goal of this pa-per, to implement a Virtual Coprocessor Monitor (VCM) thatallows programmers to transparently use OpenMP in their HW-accelerated applications. Each of the execution threads createdby OpenMP uses a virtual coprocessor, enabling that the program-ming model is homogeneous for all threads, regardless of the ac-tual number of coprocessors available in the system.

The paper is organized at follows. The next section details theexecution model of HPRC machines and how it enables virtualizingthe reconfigurable coprocessors. The architecture of the VCM is de-tailed in Section 3. Section 4 shows the results of the experimentsthat have been devised in order to validate the proposed virtualiza-tion model. Finally, conclusions and future work are presented inSection 5.

2. Usage methodology and execution model

In reconfigurable computing systems, applications can be accel-erated by porting to FPGA hardware a computationally-intensivesection of the code [11,12]. Fig. 1(a) presents a representativeapplication, consisting on a algorithm that iterates through threetasks, namely preprocessing T1, computation kernel T2 and post-processing T3. When the computation kernel is ported to hardware,a significant acceleration may be obtained. For example, Fig. 1(b)

presents a hardware implementation of the computation kernelthat runs 5 times faster than its software counterpart. Data is pre-processed in software, then it is sent to the coprocessor where it isprocessed, and then it goes back to the CPU where it is postpro-cessed. Since preprocessing T1 and postprocessing T3 are still madein software, the overall speedup is smaller than the hardwareacceleration that has been attained for task T2, according to Am-dahl’s Law. In fact, the figure shows that the global speedup is justslightly bigger than 2, while the hardware acceleration of T2 was 5.

On the other hand, the method for application acceleration inmulti-core processors is code parallelization among several execu-tion threads, one per core. This is shown in Fig. 1(c) for a 4-coreprocessor, where it can be seen that the original tasks are being rep-licated among the cores. That is, the original task T1 is replicatedinto four concurrent tasks T11, T12, T13, and T14. Parallelization israrely perfect, so some of the parallel threads are faster than others.Additionally, synchronization is typically needed when each itera-tion ends, so faster threads have to wait for the slowest one. Eventaking into account these drawbacks, acceleration is achieved be-cause the computational load is distributed among the 4 threadsconcurrently running. Considering a simple model where all paral-lel threads take the same time, and t1 is the time to execute T1i, t2 isthe time to execute T2i and t3 is the time to execute T3i, the timeneeded to execute the application can be expressed as:

tapp ¼ I � t1 þ t2 þ t3

APð1Þ

Where AP is the acceleration obtained due to code parallelization,and I is the original number of iterations in the non-parallelized ver-sion of the code. Usually parallelization is not perfect, so accelera-tion AP will be smaller than the number of concurrent cores N,although some rare applications may feature a superlinear behaviorwhere AP is bigger than N.

The approach proposed in this paper is to apply both accelera-tion techniques: hardware acceleration and code parallelization.This accomplished by time-sharing the FPGA coprocessor amongthe different threads that are being executed on processor cores.Time-sharing is possible since typical applications do not storedata in the coprocessor between algorithm iterations. That is, the

I. Gonzalez et al. / Journal of Systems Architecture 58 (2012) 247–256 249

coprocessor is stateless: results only depend on the data receivedon the present call, there is no dependency to data received in pre-vious calls. The result of applying both acceleration schemes can beseen in Fig. 1(d). As there is only one coprocessor in the system, ac-cess to it is serialized, so performance is reduced. However, if hard-ware acceleration is big enough, an overall speedup can beobtained. In fact, the execution time can now be written as:

tapp ¼ I �t1 þ Nð t2

AHWþ tovÞ þ t3

Apð2Þ

Where AHW is the hardware acceleration attained for the computa-tional kernel T2, tov is the time overhead due to sharing the copro-cessor, and N is the number of concurrent cores. If we compare Eqs.(1) and (2), we can state that coprocessor sharing will be advanta-geous if

Nt2

AHWþ tov

� �< t2 ð3Þ

That is, the use of a time-shared (virtualized) coprocessor will beworthwhile as long as the hardware acceleration AHW is high, biggerthan the number of concurrent cores N, and the virtualization over-head tov is small. Actually, it is not uncommon for FPGA accelerationto reach speedups of two orders of magnitude or more, while cur-rent processors have between 4 and 8 cores. Taking into consider-ation these figures, one can expect that the use of time-sharedcoprocessors will be appealing for many applications. Of course, ifcondition (3) is not met, performance using an all-software, multi-threaded solution will be better. However, a third case might hap-pen, where both AHW and tov are high, and AP is low. If this is thecase, chances are that the best performance is obtained with anon-parallel, FPGA-accelerated code, which does not incur in thecosts of coprocessor sharing.

The model that has been presented in the previous paragraphsconsiders the case where synchronization among threads is re-quired after each algorithm iteration. However, the problem withsynchronization is that threads will require access to the coproces-sor at approximately the same time, causing contention. This con-tention in coprocessor requests is solved by access serialization,see Fig. 1(d). But if the application can be parallelized into indepen-dent threads, where synchronization is no longer needed, conten-tion is avoided. After a few algorithm iterations, threads self-organize in such a way that simultaneous coprocessor access isprevented, as it is shown in Fig. 2. In such case, performance is in-creased, provided that coprocessor usage does not come to 100%:

tapp ¼ I �t1 þ ð t2

AHWþ tovÞ þ t3

Apð4Þ

Coprocessor virtualization does not significantly modify the usagemethodology of HPRC systems: the application is first parallelized,and afterwards the most computationally intensive sections of thecode are ported to the FPGA coprocessors. Since applications cannotdistinguish between real and virtual coprocessors, the program-ming methodology remains the same. That is, a shared-memory

Fig. 2. Acceleration with HW virtualization and independent parallel tasks (nobarriers).

multi-processing API such as OpenMP is commonly utilized to easeapplication parallelization in multi-core architectures, and codeprofiling is used to identify computational bottlenecks suitable tobe accelerated in hardware. The main difference when using virtu-alized coprocessors comes to performance optimization. In systemswithout virtualization, performance will always benefit from moreparallel tasks and more FPGA acceleration. However, when copro-cessor virtualization is used, this affirmation is no longer true: theoptimal number of parallel tasks depends on the accelerations at-tained and the virtualization costs, as it was stated in the previousequations. Although the simple models presented in this section al-low for a rough estimation of the optimal number of threads, it isbest to determine that optimal number by experimenting on thereal system, since the effort needed to develop an accurate perfor-mance estimation is significantly high. One of the key parameterswhen determining the optimal number of threads is the virtualiza-tion cost. In order to estimate this parameter, a straightforwardexperiment can be devised. In the FPGA coprocessor, a simple loop-back design is implemented, which simply returns unaltered thedata it receives. Bandwidth to the coprocessor is then measuredwith and without virtualization. The difference in the bandwidthobtained in both cases is caused by the virtualization overhead.

3. Virtualization architecture

Virtualization implies moving from the original two-layer mod-el to a three-layer model, see Fig. 3. In the original model, user pro-cesses directly connect to the coprocessor driver, but access isexclusive. That is, just one process can use the coprocessor at givenmoment, so processes running in other cores cannot use the copro-cessor. In the new model, the virtual coprocessor monitor (VCM)appears as a new layer in between user processes and coprocessor.The VCM connects to the coprocessor driver and creates severalvirtual coprocessors, that are accessed by user processes via thevirtual coprocessor API. Access is still exclusive, but since manyvirtual coprocessors exist, each process may have its own HWaccelerator, so the execution model is homogeneous.

Ideally, the VCM should be implemented at the kernel (driver)level in order to optimize its performance. Unfortunately, thecoprocessor vendor only provides a driver and a user API, thelow-level details about the FPGA operation are typically vendor-proprietary. This lack of information prevents the kernel optionto be feasible, and therefore the VCM must run just above the dri-ver, as a user-level daemon. The solution proposed in this work ispresented in Fig. 4. The VCM daemon is composed by three maincomponents: virtualization manager, virtual memory space and re-quest queue. The main component is the virtualization manager,which is the element that implements the time-sharing mecha-nisms and the interface to the actual coprocessor. User applicationsexchange data with the daemon by means of virtual memoryspaces, which are implemented using the conventional POSIXmemory sharing IPC primitives. Operations solicited by user appli-cations are stored in the request queue, from where they are ex-tracted in an orderly manner by the virtualization manager. Therequest queue is also implemented using standard POSIX IPC prim-itives, more precisely message queues.

On the other side, user applications interact with the VCM dae-mon by means of the virtualization API, which is implemented as alibrary that is statically linked with the application. This virtualiza-tion API should be as close as the original user API provide by thevendor in order to make virtualization as transparent as possibleto the programmer. However, there is a significant difference.While the original API communicates with the coprocessor devicedriver, the virtualization API communicates with the VCM daemonusing IPC mechanisms. Currently, the virtualization API has fourprimitives: init_environment (), request_fpga (),

Fig. 3. Software layers in a conventional (left) and in a virtualized (right) HPRC architecture.

Fig. 4. Proposed coprocessor virtualization architecture.


send_recv_data (), and release_fpga (). The first one,init_environment (), creates a handler and connects to theVCM message queue. This primitive adds an optional parameterthat points to the bitstream file. This parameter tells the VCM whichis the bitstream for configuring the coprocessor for the currentapplication. The use of this parameter must be done with care toavoid multiple reconfigurations from all threads in the application.Usually only the master thread should use it. The rest of the threadsmust set it to NULL. When request_fpga () is called, a message issent to the VCM to inform that a user application wants to use a vir-tual coprocessor. The VCM then creates a virtual memory space, andacknowledges by sending back a message with the details of theshared memory space just created. Actual data transfer is imple-mented in send_recv_data (). This function first copies the datato be processed into the virtual memory space, and then sends amessage to the VCM. The VCM copies the data from the virtualmemory space into the coprocessor, and copies back the resultsfrom the coprocessor into the VCM. When the coprocessor has fin-ished, the VCM sends an acknowledge signal that is used by thesend_recv_data () function to know that it can copy the resultsfrom the virtual memory space into the application buffer. Finally,the release_fgpa () primitive sends a message to the VCM thatis used to free the virtual memory space. After having freed thisspace, an acknowledgement message is sent from the VCM in orderto inform the function that it can detach from the shared memoryspace. It is worth noting that the virtualization API has been

implemented to be thread-safe allowing the programmer to use itwith thread-based programming paradigms such OpenMP or pth-reads. Obviously, independent process can also make use of the vir-tualization API.

The VCM supports an execution model with optimized I/O.When data is sent by user applications, several data requests accu-mulate at the queue, and the virtualization manager sends the datafrom all these requests to the coprocessor one after the other, with-out waiting for the processed data to be received. At the same timethe data is sent, it programs the reception of the processed results.When results have been copied, a callback function is called, sothere is no need to do polling, and reception can be also processedone result after the other. However, because the VCM is currently aone-thread application, send and receive data cannot be done at thesame time. The reason is that both functionalities are associatedwith callback functions and only one of them can be executed ata time. An improvement for this limitation is to implement a mul-tithread version of the VCM with one thread associated to each call-back function and the main thread in charge of the request queue.

4. Validation results

The proposed architecture has been validated on a workstationequipped with a dual-processor motherboard and 32 GB of RAM.One processor socket is populated with an Intel L5408 quad-coreXeon processor featuring 12 megabytes of L2 cache and a

Fig. 5. Results of the bandwidth test.


1066 MHz front-side bus (FSB) interface to the motherboard. Theother socket is populated with a XD2000i in-socket acceleratorfrom XtremeData Inc [13]. The XD2000i in-socket module includesthree Altera Stratix-III FPGAs, one EP3SL150 that acts as a bridge tothe FSB bus, and two EP3SE260 devices that are available to theuser to implement custom applications. The two EP3SE260 appli-cation FPGAs contain each 254,400 Logic Elements (LEs), 768embedded 18x18b multipliers and 14.68 Mb of internal memory.The bridge FPGA implements FSB communication through an en-crypted core, and provides two independent 9.6 GB/s bidirectionallinks (256-bit wide) to both application FPGAs. Application FPGAsare also connected to each other via a third 9.6 GB/s bidirectionallink (256-bit wide). The bandwidth between the host and in-socketaccelerator depends on the type of data transaction. It is 2 GB/s fordata being sent from the host to the ISA, and 1 GB/s for data beingsent from the ISA to the host, significantly less than the 8.5 GB/stheoretical maximum for FSB, due to limitations in the bridgeFPGA. When there is simultaneous data exchange between the hostand the ISA, the bandwidth is limited to a maximum 1 GB/s in bothdirections. Additionally, each user FPGA has available two 8 MB(2 M � 36b) QDRII + memories running at 350 MHz. These memo-ries have independent read and write ports, so a maximum band-width of 2.8 GB/s can be attained, with a write latency of 2.5 clockcycles. QDRII+ memories are not directly available from the bridgeFPGA.

4.1. Mitrion-C implementation

Nowadays, the common practice for FPGA design is using HDLs(hardware description languages). However, the semantic gap be-tween HDLs and the languages traditionally used in HPC applica-tions (Fortran, C++) constitutes a well-known weakness of HPRCtechnology [14]. Although HDLs provide a total control over thegenerated hardware, design cycles are known to be slow and dis-tressing: the time required to port a complex scientific algorithmto FPGA usually goes from weeks to months [14]. In addition,HDL design cannot be handled by common scientists or softwareengineers, it requires skilled computer engineers. Due to thesedrawbacks, several alternatives based on high-level languages(HLLs) have appeared in recent years to make FPGA developmenteasier and faster [15]. Typically, HLL compilers use either ANSI-Cor proprietary C dialects as entry language, although there areother alternatives such as Java or Matlab. The outcome from com-pilation is generally HDL code, which is forwarded to the tools pro-vided by the FPGA vendor to create the configuration bitstream.Although the benefits of using HLL compilers depend heavily onthe case study, the results reported in the literature are encourag-ing at least in terms of development time [15]. This is the reasonbehind the choice of one of these compilers, Mitrion SDK [16], tocreate the validation designs used in this work.

Unlike other HLL compilers, Mitrion SDK does not directly gener-ate gates. Instead, it generates a custom processor which optimallyexecutes Mitrion-C code. The compiler analyzes instructions anddata dependencies in the original code in order to maximize thenumber of operations that can be concurrently executed. The resultis the so-called MVP (Mitrion Virtual Processor), which can executeseveral instructions or even several iterations of a loop in parallel.The entry to Mitrion SDK tools is the Mitrion-C language, a dialectof C with some peculiarities. Firstly, its constructs and data typesare focused on parallelism driven by data dependencies. There isno predefined execution order, instructions are executed as soonas their data dependencies are solved. Secondly, Mitrion-C is a sin-gle-assignment functional language, variables can only be assignedonce in a scope. Mitrion SDK supports a number of FPGA accelera-tors, XD2000i among them. The interface between the host andthe FPGA accelerator is straightforward. Data exchange is based on

streams, using primitives such as mitrion_buffer_alloc,mitrion_stream_post_read or mitrion_stream_post_write.Additionally, there are procedures to connect to the FPGA, downloadthe MVP and start/stop execution.

4.2. Bandwidth characterization

As it was introduced in Section 2, a simple methodology for esti-mating the virtualization costs is to implement a simple loopbackdesign in the FPGA coprocessor. Using a trivial Mitrion-C code, the256-bit input stream of one application FPGA in the XD2000i accel-erator was directly connected to its 256-bit output. Fig. 5(a) showsthe bandwidth obtained when sending data blocks of sizes rangingfrom 0.5 to 8 MB. Without virtualization, bandwidth surpasses500 MB/s. Although the raw bandwidth of the XtremeData ISA is1 GB/s for full-duplex transfers, the actual number is smaller dueto the overhead added by Mitrion-C. When virtualization is added,bandwidth decreases to 200 MB/s, showing that the virtualizationcost is significative in this implementation. After profiling the code,it was found that the overhead was caused by copying the data be-tween the user space and the virtual memory space. First, data iscopied from the user application into the virtualization memoryspace, from where it is sent to the coprocessor. The returned datais stored in the virtualization memory space, from where it is cop-ied into the user application buffer. These two copies take 46% of


the time, see Table 1. Other 32% of the time corresponds to thecommunication management, that is, getting the requests fromthe queue and sending the acknowledgements. Finally, 22% ofthe time corresponds to the XtremData API, which is not relatedto virtualization. Therefore, there is space for improvement in theVCM implementation, as it was stated in the previous section.

It is perhaps more interesting to see how bandwidth increaseswhen more parallel tasks are added. Fig. 5(b) shows that band-width increases when more processes are added, probing that vir-tualization actually increases coprocessor utilization. However,this behavior is not observed when the code is parallelized in mul-tiple threads using OpenMP. The reason is that a synchronizationbarrier is automatically added at the end of the main program loopwhen the pragma omp for OpenMP directive is used. Therefore,execution time follows the model presented in Eq. (3), since syn-chronization means that all threads want to access the coprocessorat the same time. In this particular application, the software part ofthe code is minimal, so the preprocessing t1 and postprocessing t3

times are negligible in comparison to the coprocessor access timet2. Additionally, in such a simple code, partitioning in parallelapplications is nearly perfect (there are no dependencies betweenthreads), so AP � N. Considering these two facts in Eq. (3), it comesoff that execution time remains unchanged when more threads areadded, because N in the numerator has a similar value to AP in thedenominator. However, a performance increase is observed whenpassing from 1 to 2 threads. This can be explained by the fact thatthe VCM has different memory spaces for each thread, so copies todifferent virtual memory spaces can be done concurrently (seeFig. 4). This concurrency is manifested as a reduction in the virtu-alization overhead tov when more than one thread are being used,and this is the reason why the one-thread version of the code has aslightly poorer performance than its multithreaded counterparts.

4.3. Ciphering application: IDEA algorithm

The second test consists on ciphering a file using the IDEA(International Data Encryption Algorithm) algorithm in the sim-plest of the encryption modes, the Electronic CodeBook (ECB)mode. Two versions of the IDEA cipher have been developed usingMitrion-C. The first one implements just one ciphering unit, andthe second one implements four parallel ciphering units, seeFig. 6. Since IDEA ciphers 64-bit words, and data streams in theXD2000i in-socket accelerator are 256-bit wide, the solutionimplementing four ciphering units is the one that best uses thebandwidth available to communicate with the coprocessor. Fig. 7shows the obtained results for ciphering two different files whosesizes are, respectively, 5 MB and 300 MB. Both the SW-only and theHW-accelerated versions of the application have been coded usingOpenMP. The performance of the SW-only version scales very wellwith the number of threads, reaching a 4� speedup when the fourcores of the Xeon processor are being used. The HW-acceleratedversions show a much better performance that their SW-onlycounterparts, for example the speedup for the one thread case is13� for the one-unit IDEA coprocessor and 15� for the four-unitIDEA coprocessor when a 300 MB file is the target. However, virtu-alization only improves performance when moving from one

Table 1Execution time breakdown of data inter-change function

Task %Time

Vendor API functions 22Memory operations 46Communication

management32

thread to two threads. The reason for this behavior is that thisapplication operates in a similar manner to the bandwidth charac-terization application described in the previous section. In thiscase, preprocessing corresponds to reading the file to be ciphered,and postprocessing to writing the ciphered data. Although I/O ac-cess (read/write a file) is usually expensive, Linux usually cachesfiles in main memory, so these preprocessing and postprocessingtimes are small in comparison to the access to the coprocessor.As a consequence, its behavior is analogous to that of the band-width characterization application.

4.4. Execution results for the Euler algorithm

The third test is aimed at evaluating the performance of hard-ware virtualization when a significant fraction of the applicationis executed in software. For this experiment, a simple CFD (Compu-tational Fluid Dynamics) algorithm has been selected: 1D Euler sol-ver using Roe scheme. This algorithm uses single-precision floatingpoint arithmetic, and it consists on three loops, see Fig. 8. The firstloop, Roe, calculates the new values for each of the variables thatdefine the state of the fluid in each point of the grid. The secondloop calculates the time step for the next integration process, goingthrough all points in the grid in order to find the maximum veloc-ity of the fluid. Finally, the third loop applies the time step to eachpoint in the grid. Calculations in loops one and three apply to eachpoint separately, so they can be easily parallelized. The second loopbasically consists on finding a maximum, so it can be done sequen-tially or parallelized using a reduction mechanism. In this case, ithas been done sequentially in order to simplify the design of thesolver. After profiling the software version, it was found that 66%of the time was spent in the first (Roe) loop, 11% in the second(maximum) loop, and 23% in the third (time advance) loop. There-fore, the first loop is the most suitable candidate to be ported to thehardware accelerator.

Fig. 9 shows the results for the 1D Euler solver implementedwith OpenMP. The three loops were parallelized using the pragmaomp for OpenMP directive, and after each loop threads were syn-chronized by means of a pragma omp barrier directive. Two ver-sions of the code were created: one is totally executed in SW andthe other one which implements the whole Roe loop in the virtual-ized coprocessor. Both versions were compiled for one to fourthreads of execution, and the experiments were performed forthree different grid sizes: 105, 106, and 107 points. Remarkably,the results show that adding HW acceleration does not work wellfor small grids, due to the communication overhead. When biggergrids are used, communication time is not so relevant, so HW-accel-erated solutions outperform SW-only solutions. The applicationscales reasonably well when multiple threads are used. For the107-point case, performance is multiplied by 1.84 when the numberof threads is increased from one to four for the SW-only version, andfor the HW-accelerated version there is a slightly bigger 2.37�speedup. An interesting difference in this application, in compari-son to the IDEA cipher, is that performance always increases whenmore threads are being used. The explanation is that the fraction oftime executed in the processor cores is in this case significantlyhigher, concretely 34% of the total execution time. Therefore, inEq. (3) the terms of the numerator not multiplied by N are no longernegligible, so time can be reduced by increasing parallelizationspeedup AP . Actually, if the whole algorithm is completely executedin hardware, the same behavior as in the bandwidth characteriza-tion application occurs, see Fig. 10. Improvement only occurs whengoing from one to two threads, performance does not increasewhen more than two threads are added because of the contentionin the access to the coprocessor.

To better understand the performance results obtained, the TAU(Tuning and Analysis Utilities) Performance System has been used.

Fig. 6. Implementation of four IDEA ciphering units using Mitrion-C.

Fig. 7. Execution results for the IDEA ciphering application.


This tool is able to gather performance information through theinstrumentation of functions, methods, basic blocks and state-ments [17]. By default, TAU collects profile information and tracesall user functions and OpenMP directives. The informationobtained for a four-thread execution of Euler 1D using Roe copro-cessor is presented in Fig. 11. For the purpose of clarity, the Open-MP directives have been removed and only user functions areincluded. Fig. 11 shows the execution of three iterations of the Eu-ler 1D algorithm. The first task that threads execute is the pre-pro-cessing, tpreP, which is necessary to translate from the gridstructure used by the algorithm to a more suitable structure re-quired by the Roe coprocessor. This is a software task so it canbe executed in parallel. After it, they execute the send_recv_data() primitive to process the data using the coprocessor. This prim-itive copies the data to be processed into the virtual memory space,waits for the data to be processed and copies back the results fromthe coprocessor from the virtual memory space. To gather sometiming information about the primitive, TAU’s instrumentationAPI was manually added to trace the primitive and obtain the sub-tasks time. As result, the first subtask, tcopy, shows the time to copythe data to be processed into the virtual memory space. This is a

software task carried out in parallel for all threads and can be over-lapped with computation. After copying the data, the second sub-task, tFPGA, shows the time for processing the data (computation)and includes the waiting time if the coprocessor is busy, and thedata transfer to/from the virtual memory space to the FPGA. Thethird subtask, the second tcopy, shows the time to copy the resultsback from the virtual memory space to the application buffer. Thistask, again, can be overlapped with computation but not with thesending of data (the first tcopy). Finally, the last task is the post-pro-cessing function, tposP, that translates the coprocessor output to thegrid structure. The time spent in the processing section depends onthe scheduling order selected by the VCM. After all threads executethe Roe loop, (there is a barrier to synchronize all threads, tBARRIER1),the master thread (thread 1 in Fig. 11) executes loop two in soft-ware while the rest of the threads wait in a barrier, tBARRIER2, and fi-nally all threads execute loop three in software, tloop3

. Beforeinitiating a new iteration, there is another barrier to synchronizeall threads. This barrier is required to guarantee that the grid iscompletely updated before executing the Roe loop again. However,this barrier force all threads to start the Roe loop at the same time,and a new competition for the coprocessor is started again. The

Fig. 8. Loops in the Euler 1D algorithm (left: serial execution; right: parallel execution).

Fig. 9. Execution results for Euler 1D CFD application, Roe accelerated in HW.


consequence of the barrier is that each iteration could have a dif-ferent execution order. For example, for the first iteration the ac-cess order to the coprocessor was thread 1, thread 2, thread 4and thread 3. For the second iteration the order was thread 4,thread 3, thread 1 and thread 2. Therefore, HW time in the threadthat first accesses the coprocessor gives an indication of what thereal coprocessor execution time is, and HW time is bigger inthreads that access the coprocessor later simply because they haveto wait until the other threads finish, as it was presented inFig. 1(d).

5. Conclusion and future work

Nowadays, it is increasingly difficult for HPRC systems to meet a1:1 ratio between processor cores and FPGA accelerators. Multi-

core processors are quickly improving, but there are seriouseconomic limitations to increase the number of FPGA devices atthe same pace. As a result, programmers are forced to give up usingsome processor cores or, alternatively, to implement complex loadbalancing systems to ensure that threads not having access tocoprocessors end at the same time as their HW-accelerated coun-terparts. In this paper we have proved that it is possible to split aFPGA accelerator into several virtual devices by using time-multi-plexing. A virtual coprocessor manager (VCM) has been developed,which provides an API similar to that offered by most coprocessors,but also allows for seamless integration with multi-threaded appli-cations written in OpenMP. Therefore, application programming issignificantly simplified. Programmers use OpenMP directives toparallelize their code, and have access to the FPGA accelerator witha simple yet powerful set of primitives. When the code is compiled,

Fig. 10. Execution results for Euler 1D CFD application, all loops accelerated in HW.

Fig. 11. Four-thread diagram execution of the Euler 1D CFD application with Roe coprocessor.


it is automatically partitioned by the OpenMP tools into multiplethreads, and each of these threads will have access to a virtualcoprocessor, regardless of the number of threads being created orthe number of actual FPGA devices available in the HPRC system.The VCM has been evaluated with two different programs: IDEAciphering and Euler 1D solver. In both cases, the performance ofversions using virtualized HW is higher than the ones imple-mented only in SW, especially in those applications where the frac-tion of time executed in SW is significant. However, there is asignificant overhead in the VCM, as it was observed in the assess-ment of the processor-virtual coprocessor bandwidth. This over-head is mainly due to the user-space implementation of the VCMdaemon, which forces data to be copied to an intermediate bufferbefore being sent to the actual coprocessor. In a future work, thislimitation will be overcome by using a kernel-level VCM, but thiswill require knowing the low-level details of the FPGA accelerator,which were not available by the time of writing this paper.

Acknowledgements

This work is supported by DOVRES project, part of the Airbus’Fusim-E initiative.

References

[1] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, B. Hughes, Cachehierarchy and memory subsystem of the AMD opteron processor, IEEE Micro30 (2) (2010) 16–29.

[2] David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards,Carl Ramey, Matthew Mattina, Chyi-Chang Miao, Brown JohnF. III. AnantAgarwal, On-chip interconnection architecture of the tile processor, IEEE Micro2007, 27 (5), 15–31.

[3] David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors: AHands-on Approach, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2010.

[4] P. Alonso, R. Cortina, F. Martnez-Zaldvar, J. Ranilla, Neville elimination onmulti- and many-core systems: OpenMP, MPI and CUDA, The Journal ofSupercomputing (2009) 1–11.


[5] Buell Duncan, El-Ghazawi Tarek, Gaj Kris, Kindratenko Volodymyr, GuestEditors’ Introduction: High-Performance Reconfigurable Computing, Computer40 (3) (2007) 23–27.

[6] W. Fornaciari, V. Piuri, General methodologies to virtualize FPGAs in Hw/Swsystems, in: Proceedings of the Midwest Symposium on Circuits and Systems,pp. 90–93, 1998.

[7] E. El-Araby, I. Gonzalez, T. El-Ghazawi, Virtualizing and sharing reconfigurableresources in High-Performance Reconfigurable Computing Systems, High-Performance Reconfigurable Computing Technology and Applications, 2008,in: Second International Workshop on HPRCTA 2008, Nov. 2008.

[8] E. El-Araby, I. Gonzalez, T. El-Ghazawi, Exploiting partial runtimereconfiguration for High-Performance Reconfigurable Computing, ACMTransactions on Reconfigurable Technology and Systems 1 (4) (2009).

[9] Huang Chun-Hsian, Hsiung Pao-Ann, Shen Jih-Sheng, Model-based platform-specific co-design methodology for dynamically partially reconfigurablesystems with hardware virtualization and preemption, Journal of SystemsArchitecture 56 (11) (2010) 545–560.

[10] J. Duato, A.J. Pena, et al., Modeling the CUDA Remoting VirtualizationBehaviour in High Performance Networks, Workshop on Language, Compiler,and Architecture Support for GPGPU, Bangalore, Jan. 2010.

[11] T. El-Ghazawi, E. El-Araby, M. Huang, K. Gaj, V. Kindratenko, D. Buell, Promiseof High-Performance Reconfigurable Computing, IEEE Computer 41 (2) (2008)69–76.

[12] M. Smith, S. Alam, P. Agarwal, J. Vetter, D. Caliga, A task-based developmentmodel for accelerating large-scale scientific applications on FPGA-basedreconfigurable computing platforms, in: Proceedings of ReconfigurableSystems Summer Institute (RSSI’06), Urbana, Ill, USA, Jul. 2006.

[13] Vincent Natoli, Jeff Allred, Jack Coyne, William Lynch, In-Socket FPGAImplementation of Bioinformatic Algorithms Using the Intel AAL, 2009Symposium on Application Accelerators in High Performance Computing(SAAHPC’09), 2009.

[14] Saha Proshanta, El-Araby Esam, Huang Miaoqing, Taher Mohamed, Lopez-Buedo Sergio, El-Ghazawi Tarek, Shu Chang, Gaj Kris, Michalski Alan, BuellDuncan, Portable library development for reconfigurable computing systems:a case study, Parallel Computing 34 (4-5) (2008) 245–260.

[15] E. El-Araby, P. Nosum, T. El-Ghazawi, Productivity of High-Level Languages onReconfigurable Computers: An HPC Perspective, International Conference onField-Programmable Technology, 2007 (ICFPT 2007), pp. 257–260, Dec. 2007.

[16] Mitrion SDK 2.0.2, Mitrion Users’ Guide, 2009 (software to hardwarecompiler). Available from: <http://www.mitrionics.com/>.

[17] [17] S. Shende, A.D. Malony, The TAU parallel performance system,International Journal of High Performance Computing Applications 20 (No.2) (2006) 287–331 (SAGE Publications).

Ivan Gonzalez is Associate Professor in the Departmentof Electronic and Communications Technology at Uni-versidad Autonoma de Madrid (UAM), Spain. He hasreceived is Ph.D. degree in Computer Engineering fromUAM in 2006. From November 2006 to January 2008 hewas a Postdoctoral Research Scientist at the High Per-formance Computing Laboratory (HPCL), Electrical &Computer Engineering Department, The George Wash-ington University (Washington, DC, USA). He was a fac-ulty member of the NSF Center of High PerformanceReconfigurable Computing (CHREC) at The GeorgeWashington University. His main research interests are

heterogeneous computing (with GPUs, FPGAs, etc.), parallel algorithms and perfor-mance tuning. Other interests include FPGA-based reconfigurable computing appli-cations, with a special focus on dynamic partial reconfiguration, embedded systems
and robotics.
Sergio Lopez-Buedo received in 2003 his Ph.D. in Com-puter Engineering from Universidad Autonoma deMadrid (Spain), where he currently serves as associateprofessor in the area of Computer Architecture. He was avisiting researcher at University of British Columbia(2005) and at The George Washington University (2006,2007), and he has also collaborated in the doctorateprogram of Universita degli Studi di Trento (2007–2009).FPGA technology is his main research interest, especiallyHigh-Performance Reconfigurable Computing and Com-munication Applications. Dr. Lopez-Buedo holds morethan 50 publications, including journals, conferences and

books as editor, and he is also co-founder of Naudit HPCN, a company dedicated toproviding high-performance computing and networking solutions.

Gustavo D. Sutter received an MS degree in ComputerScience from State University UNCPBA of Tandil (BuenosAires) Argentina, in 1997, and a Ph.D. degree from theAutonomous University of Madrid, Spain, in 2005. Hehas been a professor at the UNCPBA Argentina and iscurrently a professor at Universidad Autonoma deMadrid, Spain. His research interests include ASIC andFPGA design, digital arithmetic, development ofembedded systems and High Performance Computing.He is the author of three books and more than fiftyinternational papers and communications.

Diego Sanchez-Roman received the degree in com-puter science and mathematics from UniversidadAutonoma de Madrid, Spain, in 2009. He is currentlypursuing the Master in Computer Science and Tele-communication Engineering from the same university.His research interest include computer architecture andHigh Performance Reconfigurable Computing.

Francisco J. Gomez-Arribas received the Ph.D. fromUniversidad Autonoma de Madrid (UAM), Spain, in1996. From October 1996 until November 2000 he wasAssistant Professor at the Computer EngineeringDepartment of the UAM. He is currently Professor ofComputer Architecture and Parallel Computing coursesat the same university. His research field of interestconcern reconfigurable computing applications based inFPGA circuits, with a special focus on the design ofmultiprocessor systems with reconfigurable architec-ture. Secondary fields of interest include networkcomputing, cryptographic coprocessors, embedded

system on-a-chip and experimental support of C.S. and E.E. education on Internet.

Javier Aracil received the M.Sc. and Ph.D. degrees(Honors) from Technical University of Madrid in 1993and 1995, both in Telecommunications Engineering. In1995 he was awarded with a Fulbright scholarship andwas appointed as a Postdoctoral Researcher of theDepartment of Electrical Engineering and ComputerSciences, University of California, Berkeley. In 1998 hewas a research scholar at the Center for AdvancedTelecommunications, Systems and Services of The Uni-versity of Texas at Dallas. He has been an associateprofessor for University of Cantabria and Public Uni-versity of Navarra and he is currently a full professor at

Universidad Autonoma de Madrid, Madrid, Spain. His research interest are inoptical networks and performance evaluation of communication networks. He hasauthored more than 100 papers in international conferences and journals.

http://www.mitrionics.com/

virtualization of reconfigurable coprocessors in hprc systems with multicore architecture

Documents