research article high level synthesis fpga implementation...

12
Research Article High Level Synthesis FPGA Implementation of the Jacobi Algorithm to Solve the Eigen Problem Ignacio Bravo, 1 César Vázquez, 1 Alfredo Gardel, 1 José L. Lázaro, 1 and Esther Palomar 2 1 Electronics Department, University of Alcal´ a, Alcal´ a de Henares, 28805 Madrid, Spain 2 School of Computing, Telecommunications and Networks, Birmingham City University, Millennium Point, Birmingham B4 7XG, UK Correspondence should be addressed to Ignacio Bravo; [email protected] Received 2 December 2014; Revised 19 January 2015; Accepted 19 January 2015 Academic Editor: Jos´ e R. C. Piqueira Copyright © 2015 Ignacio Bravo et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We present a hardware implementation of the Jacobi algorithm to compute the eigenvalue decomposition (EVD). e computation of eigenvalues and eigenvectors has many applications where real time processing is required, and thus hardware implementations are oſten mandatory. Some of these implementations have been carried out with field programmable gate array (FPGA) devices using low level register transfer level (RTL) languages. In the present study, we used the Xilinx Vivado HLS tool to develop a high level synthesis (HLS) design and evaluated different hardware architectures. Aſter analyzing the design for different input matrix sizes and various hardware configurations, we compared it with the results of other studies reported in the literature, concluding that although resource usage may be higher when HLS tools are used, the design performance is equal to or better than low level hardware designs. 1. Introduction Eigenvalue calculation is a problem that arises in many fields of science and engineering, such as computer vision [1], business and finance [2], power electronics [3], and wireless sensor networks [4]. In these applications, eigenvalues are normally used in a decision process, so real time performance is oſten required. Since eigenvalue computation is usually based on opera- tions such as matrix multiplication and matrix inversion, and a large amount of data must be handled, it is computationally expensive and can represent a potential bottleneck for the rest of the algorithm. e algorithms involved in eigenvalue and eigenvector computations are generally iterative and present strong data dependencies between iterations; however, many similar operations can be carried out in parallel in the same iteration, such as multiply and accumulate in matrix products. Although this is not important when working with general purpose processors or more specialized digital signal processors (DSPs), some devices leverage this feature by allowing the execution of repetitive operations in parallel. For example, graphic processing units (GPUs) are useful when a large number of multiply and accumulate operations are involved, since they are designed to work in graphics processing, which is based on vector and matrix operations. For similar reasons, field programmable gate arrays (FPGAs) display a remarkable capacity to carry out repetitive operations in parallel. ere are many applications where matrices are real and symmetric, with sizes not exceeding 20 × 20 [5, 6]. In this situation, EVD computation is easier, and there are many algorithms that take advantage of this. Some of the applications where the symmetric eigenvalue problem arises include the principal component analysis (PCA) statistical technique used for dimension reduction in data analysis [1, 7] and the multiple signal classification (MUSIC) algorithm [8] which uses a covariance matrix (which is real and symmetric by definition). In these studies, FPGA devices were employed to solve the problem, using a specialized hardware module to compute the EVD. As mentioned previously, most of the methods used to solve the eigen problem are iterative. Depending on how the operations are performed on the initial matrix and Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2015, Article ID 870569, 11 pages http://dx.doi.org/10.1155/2015/870569

Upload: phungque

Post on 04-Mar-2018

239 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

Research ArticleHigh Level Synthesis FPGA Implementation ofthe Jacobi Algorithm to Solve the Eigen Problem

Ignacio Bravo1 Ceacutesar Vaacutezquez1 Alfredo Gardel1 Joseacute L Laacutezaro1 and Esther Palomar2

1Electronics Department University of Alcala Alcala de Henares 28805 Madrid Spain2School of Computing Telecommunications and Networks Birmingham City University Millennium Point Birmingham B4 7XG UK

Correspondence should be addressed to Ignacio Bravo ibravodepecauahes

Received 2 December 2014 Revised 19 January 2015 Accepted 19 January 2015

Academic Editor Jose R C Piqueira

Copyright copy 2015 Ignacio Bravo et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

We present a hardware implementation of the Jacobi algorithm to compute the eigenvalue decomposition (EVD)The computationof eigenvalues and eigenvectors has many applications where real time processing is required and thus hardware implementationsare often mandatory Some of these implementations have been carried out with field programmable gate array (FPGA) devicesusing low level register transfer level (RTL) languages In the present study we used the Xilinx Vivado HLS tool to develop a highlevel synthesis (HLS) design and evaluated different hardware architectures After analyzing the design for different input matrixsizes and various hardware configurations we compared it with the results of other studies reported in the literature concludingthat although resource usage may be higher when HLS tools are used the design performance is equal to or better than low levelhardware designs

1 Introduction

Eigenvalue calculation is a problem that arises in many fieldsof science and engineering such as computer vision [1]business and finance [2] power electronics [3] and wirelesssensor networks [4] In these applications eigenvalues arenormally used in a decision process so real time performanceis often required

Since eigenvalue computation is usually based on opera-tions such as matrix multiplication andmatrix inversion anda large amount of data must be handled it is computationallyexpensive and can represent a potential bottleneck for the restof the algorithm

The algorithms involved in eigenvalue and eigenvectorcomputations are generally iterative and present strong datadependencies between iterations however many similaroperations can be carried out in parallel in the same iterationsuch as multiply and accumulate in matrix products

Although this is not important when working withgeneral purpose processors or more specialized digital signalprocessors (DSPs) some devices leverage this feature byallowing the execution of repetitive operations in parallel

For example graphic processing units (GPUs) are usefulwhen a large number of multiply and accumulate operationsare involved since they are designed to work in graphicsprocessing which is based on vector and matrix operations

For similar reasons field programmable gate arrays(FPGAs) display a remarkable capacity to carry out repetitiveoperations in parallel

There are many applications where matrices are real andsymmetric with sizes not exceeding 20 times 20 [5 6] In thissituation EVD computation is easier and there are manyalgorithms that take advantage of this

Some of the applications where the symmetric eigenvalueproblem arises include the principal component analysis(PCA) statistical technique used for dimension reductionin data analysis [1 7] and the multiple signal classification(MUSIC) algorithm [8] which uses a covariance matrix(which is real and symmetric by definition) In these studiesFPGA devices were employed to solve the problem using aspecialized hardware module to compute the EVD

As mentioned previously most of the methods used tosolve the eigen problem are iterative Depending on howthe operations are performed on the initial matrix and

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 870569 11 pageshttpdxdoiorg1011552015870569

2 Mathematical Problems in Engineering

the results obtained a distinction can be made betweenpurely iterative methods (Jacobi power iterations) [9] andalgorithms that perform some kind of transformation of theinitial matrix (Lanczos-bisection householder-QR) [10 11]A distinction can also be made between algorithms thatoutput all the eigenvalues (Jacobi QR) and those that onlycompute extremal eigenvalues or the one closest to a givenvalue (power iterations Lanczos)

In the studies cited above the main method employedfor eigenvalue and eigenvector computation was the Jacobialgorithm since its characteristics render it highly suitablefor a parallel implementation The Jacobi algorithm can beimplemented in such a way that there are almost no datadependencies between operations in the same iteration and itis possible to use systolic architectures as has been proposedby Brent et al [12] In addition all the operations involvedcan be carried out by using the CORDIC algorithm [13] ashas been shown by Cavallaro and Luk in [14] and this canbe implemented simply with add and shift operations whichonly use a small amount of hardware resources compared tomultiplication and division operations Another advantage ofthe Jacobi method is that its round-off error is very stable [9]

However there are other methods for computingextremal eigenvalues that can take advantage of thecomputing power provided by FPGAs such as the QRalgorithm or the Lanczos [15] method since some of theiroperations can be carried out in parallel In this study we alsopresent an FPGA floating point implementation of the QRalgorithm In terms of a practical FPGA implementation it isa challenging task to make efficient use of the availableresources while at the same time meeting real timerequirements This is mainly because the analysis ofalgorithm data dependencies and their parallelization isextremely difficult

Furthermore FPGA implementations are usually carriedout in register transfer level (RTL) languages where there is avery low level of abstraction and it is therefore necessary tomanage resources and timing carefully This also implies thatwhen some architectural decisions are made it is difficult togo back without recoding most of the system Neverthelessthis kind of system can be very efficient in terms of the totalamount of resources used

When an algorithm with relevant mathematical work-load such as the ones mentioned before has to be imple-mented in an FPGA a great part of the design processconsist in the partition of the system in small units toachieve the required grade of concurrency Design time canbe shortened by using a high level synthesis (HLS) tool Thistool attempts to raise the abstraction level allowing algorithmspecification by means of a high level programming languagesuch as C or C++ Thanks to this the designer can focuson the algorithm itself while the tool makes the software tohardware conversion guided by some user directives whichrange from resource usage to concurrency grade Some ofthe tools that have become popular in the recent years areImpulse C [16] Catapult C by Calypto [17] and Vivado HLSby Xilinx [18]

Hardware implementations of the Jacobi algorithm arevery popular because its characteristics permit different

architectural designs depending on the required perfor-mance In the literature there are systolic implementationswhere speed is themost important requirement [19 20] serialdesigns where low resource consumption is selected [1 21]and a meet in the middle approach (semiparallel) where bothcriteria are balanced [8] In this study the Jacobi algorithmwas implemented using the Xilinx Vivado HLS tool toexplore different hardware implementations and comparethem to select the most efficient one in terms of executiontime and FPGA resources used

In this study the Jacobi algorithmwas implemented usingthe Xilinx Vivado HLS tool to explore different hardwareimplementations and compare them to select the mostefficient one in terms of execution time and FPGA resourcesused

To gain a better understanding of the results obtained wealso carried out a floating point implementation of the QRalgorithm using Vivado HLS The results obtained were thencompared to similar systems described in the literature

The rest of the paper is organized as follows In Section 2the mathematical background behind the Jacobi and QRalgorithms is presented together with some details relevantto implementation The proposed implementations and itsparameters are discussed in Sections 3 and 4 respectivelyFinally the system performance is compared with otherdesign approaches in Section 5 and we present our conclu-sions in Section 6

2 Mathematical Background

The eigenvalue problem [22] is one of the main questions innumerical algebra and due to themany applications in whichit is useful it has attracted much research attention in recentyears [23ndash25] The formulation of the problem is simple (1)but it can be approached from very different angles

119860V = 120582V (1)where V and 120582 are an eigenvalue and an eigenvector respec-tively of a matrix 119860 isin R119899times119899 Although there is an analyticalsolution it involves the calculation of polynomial roots andtherefore it is only suitable for low order (119899) matrices (119899 lt 4)When the problem has a higher order numerical algorithmsmust be used In this section the mathematical backgroundof Jacobi and QR algorithms is presented when they areused to compute the eigenvalues from real and symmetricmatrices

21 The Jacobi Algorithm The Jacobi algorithm [26] wasfirst proposed in 1846 but did not become popular until itscomputational implementation was explored The singularvalue decomposition (SVD) of a matrix 119860 isin R119899times119899 is givenby

119860 = 119880119863119881119879

(2)where 119880 and 119881 are orthogonal matrices and 119863 is a diagonalmatrix storing the singular values of119860 on its diagonal For thecase where 119860 is real and symmetric we can rewrite (2) as

119860 = 119881119863119881119879

(3)

Mathematical Problems in Engineering 3

The idea behind the Jacobi method is to perform aseries of similarity transformations (ie before and aftermultiplication with an orthogonal matrix) of 119860 to renderit more diagonal in each iteration This can be seen as aniterative expression

119860(119896+1)

= 119881(119896)

119860(119896)

119881(119896)119879

(4)

where 119896 represents the current iteration and 119896+1 the next oneThe sequence of transformation matrices is chosen in such away that the doublemultiplication renders the updated119860(119896+1)

a little more diagonal than its predecessor 119860(119896) A popularchoice for 119881(119896) is the Givens rotation matrix (5) Consider

119866 (119894 119895 120572) =

[[[[[[[[[[[[

[

1 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 0

d

0 sdot sdot sdot 119862120572sdot sdot sdot 119878120572sdot sdot sdot 0

d

0 sdot sdot sdot minus119878120572sdot sdot sdot 119862

120572sdot sdot sdot 0

d

0 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 1

]]]]]]]]]]]]

]

(5)

where 119862120572and 119878

120572are the sine and cosine of 120572 respectively

Parallelismof the algorithm resides in this choice If the119881119860119881119879product is studied it can easily be seen that only four elementsof 119860(119896) are modified so multiple similarity transformationscan be carried at the same time

Eigenvectors of 119860 are obtained by accumulating therotation matrices used in eigenvalue computation as shownin the following expression

119881 =

prod

119896=1

119881(119896)

where 119881(1) = 119868 (6)

To take full advantage of this Brent andLuk [19] proposeda systolic architecture in which the initial matrix 119860

119899times119899is

mapped onto a set of 2times2matrices such as the one shown inexpression (7) This leaves a subset of 2 times 2 diagonalizationproblems which can easily be solved in hardware makinguse of theCORDIC (COordinate RotationDIgital Computer)algorithm [27] as shown in [14]

119860119897119898= [1198862119897minus12119898minus1

1198862119897minus12119898

11988621198972119898minus1

11988621198972119898

] (7)

Since 1198992 diagonalization problems now have to besolved 1198992 rotation angles (120572

119897) are also required Given that

119860 is symmetric the angle is selected from the submatriceswhere 119897 = 119898 according to expression (8) so 119899 elements(where 119899 is the order of119860) can be eliminated in each iterationAs will now be shown angle calculation can also be carriedout by the CORDIC algorithm making an efficient use ofresources in the implementation

tan (2120572) =1198862119897minus12119898

11988621198972119898

minus 1198862119897minus12119898minus1

(8)

After each update of 119860(119896+1) it is necessary to bring newnonzero elements near the main diagonal so that they can beeliminated in the next iteration repeating the entire processuntil 119860(119896+1) = 119863 To this end several rows and columns areinterchanged as it will be explained in Section 3

As mentioned before the CORDIC algorithm [13 27]can be used to solve the double rotation (4) on everysubmatrix119860

119897119898and to calculate the rotation angles One great

advantage of CORDIC is that it can be expressed by meansof shift and add operations which are easy to implement inreconfigurable hardware It also facilitates working with fixedpoint codification where the word length is not fixed to 32 or64 bits as occurswith the floating point standard thusmakingmore efficient use of FPGA resources For example in thepresent study the entire system was fixed to a word lengthof 18 bits since this is the entry word length of Xilinx FPGAmultipliers

The fundamental idea behind CORDIC is to carry out aseries of rotations (called microrotations) on an input vectorto obtain another vector as outputThis is performed in threedifferent coordinate systems (linear circular and hyperbolic)and it is possible to obtain various basic functions suchas multiplication trigonometric operations logarithms andexponentials [28] in the components of the output vectorgiven some initial input vector conditions [27]Microrotationvalues are selected as shown in the following expression

tan (120572119894) = 120575119894 where 120575

119894= 2minus119894

(9)

The choice of 120575119894as an inverse power of 2 is what will allow us

to implement the algorithm using shifts since multiplicationor division for a power of 2 is equivalent to a shift to the leftor the right respectively in binary codification

In the circular coordinate system it is possible to obtainthe two operations required for the Jacobi algorithm Givensrotation and angle calculation (arctangent) Since CORDICperforms a series of rotations on a vector it can be seen as aniterative expression (11) Consider

V119894+1= [

1 minus1205901198942minus119894

1205901198942minus119894

1] V119894 V = [119909

119910] (10)

119911119894+1= 119911119894minus 120590119894120572119894 (11)

where V is a vector determined by its vertical and horizontalcomponents (119909 and 119910) and 119911 is a variable that keeps trackof the total rotated angle In each iteration the direction ofthe rotation 120590

119894is determined by the operation that is to be

performed In the particular case of the circular coordinatesystem the two algorithm operation modes are as follows

(i) Rotation mode the algorithm takes a vector V0=

(1199090 1199100) and an angle 119911

0and outputs a new vector

V = (119909 119910) which corresponds to V0rotated 119911 radians

To do so 120590119894is selected in such a way that 119911 approaches

zero in each iteration When 119911 cong 0 it can be assumedthat the starting vector has been fully rotated a 119911 angle

(ii) Translation mode algorithm input is again a vectorV0= (1199090 1199100) whose phase angV

0we want to compute

4 Mathematical Problems in Engineering

To do so the vector is rotated until its 119910 componentis zero selecting 120590

119894accordingly Since 119911 accumulates

microrotation angles when 119910 cong 0 the variable 119911will correspond to the initial phase of the vector Inaddition since 119910 = 0 119909 will store the absolute valueof V0

If this is expressed mathematically in rotation mode 120590119894=

sign(119911119894) the relation between the input and the output is as

follows

[119909

119910] = 119870

1[cos (119911

0) sin (119911

0)

minus sin (1199110) cos (119911

0)] [1199090

1199100

] (12)

On the other hand if the algorithm works in vectoringmode 120590

119894must be computed as 120590

119894= minus sign(119910

119894) and the relation

between the input and the output is now as shown in thefollowing expression

119909 = 1198701radic1199092

0+ 1199102

0

119911 = 1199110+ atan

1199100

1199090

(13)

As can be observed in expressions (12) and (13) theresult is scaled by a constant value of 119870

1 This is due to the

nonorthogonality of the microrotations which modifies theabsolute value of the vector To solve this results obtainedcan be scaled by 1119870

1 In addition 119870

1(14) is known in

advance and only varies with the number of microrotationsperformed Since 119870

1is fixed 1119870

1can be precomputed and

stored in a memory element to correct the output data

1198701=

119899minus1

prod

119894=0

radic1 + 1205902

119894sdot 2minus2119894 (14)

22 The QR Algorithm As indicated in Section 1 to showthe effectiveness of the Jacobi algorithm over other meth-ods when it is implemented in reconfigurable logic a QRalgorithm implementation was also carried out using VivadoHLS

In the QR algorithm another condition must be addedto our initial matrix which was real and symmetric now itmust also be a tridiagonal matrix like the one presented inthe following expression

119879 =

[[[

[

119886111988720 0

1198872119886211988730

0 119887311988631198874

0 0 11988741198864

]]]

]

(15)

To obtain the eigenvalues of119879 a reduction strategy is usedwhere the 119887

2 119887

119899entries are zeroed When for example

1198872= 0 the row and the column involved can be removed

where 1198861is an eigenvalue of 119879

To accomplish this in the QR algorithm a sequence ofmatrices similar to119879 (119879(1) 119879(119899)) is produced [29] To startwith the initial matrix 119879(119896) is factored as follows

119879(119896)

= 119876(119896)

119877(119896)

(16)

where 119876(119896) is an orthogonal matrix and 119877(119896) is an uppertriangular matrix In a similar way 119879(119896+1) is defined as

119879(119896+1)

= 119877(119896)

119876(119896)

(17)

As presented thus far the convergence of the method isslow To speed it up a shift can be performed on the originalmatrix for a 120590 quantity close to the eigenvalue we are lookingfor Thus we now have a new sequence of matrices such that

119879(119896)

minus 120590119868 = 119876(119896)

119877(119896)

119879(119896+1)

= 119877(119896)

119876(119896)

+ 120590119868

(18)

After each shift the eigenvalues of 119879 will be shifted by thesame 120590 amount and it is easy to recover the original valuesby accumulating the shifts after each iteration To obtainan approximated value of 120590 for each iteration a popularchoice is to calculate the eigenvalues of the matrix 119864 (19) andtake the one closest to 119886

119899 This is called the Wilkinson shift

[26] Finally 119876(119896) and 119877(119896) are chosen to be Givens rotationmatrices to eliminate the appropriate 119887

119894element

119864 = [119886119899minus1

119887119899

119887119899119886119899

] (19)

3 Implementation Details

As discussed in Section 1 the implementation tool used inthis study was the Xilinx Vivado HLS [18] which allowsthe use of a high level programming language (C C++ orSystem C) instead of a hardware description language such asVHDL or Verilog Although it presents some drawbacks thismeans of implementing the algorithms enabled us to test verydifferent hardware configurations of the systemwith very fewdesign modifications

Due to its iterative nature the Jacobi method can easilybe expressed as a series of loops which is the main elementused in Vivado HLS to implement the algorithms

This can be done via pragma directives or tcl scriptinggiving each part of the design different attributes that deter-mine the degree of concurrency or the desired resource con-straints Some important operations that can be performedon design loops include the following

(i) Pipelinewhichwill try to achieve parallelism betweenthe operations performed in the same iterationPipeline operation can also be specified throughoutthe entire design so that all internal loops are unrolledmaking parallelism between two system iterationspossible

(ii) Unroll which will implement each iteration as aseparate hardware instance

(iii) Dataflow which will try to achieve parallelismbetween different loops that are data dependentimplementing communication channels such as ping-pong memories

As mentioned before we can also apply resource con-straints to the designs limiting the number of instances

Mathematical Problems in Engineering 5

of a particular hardware element The elements that canbe limited range from operators (+lowast) to specific resourcessuch as DSP48E and BRAM memory blocks User definedelements such as a function called multiple times canalso be constrained and this is very important for designoptimization as will be shown later

Figure 1 shows a functional diagram of the Jacobi algo-rithm Since the C programming language was used eachtask was implemented as a for loop and labeled to makeoptimization easier

When the design starts its operation (a) the input matrix119860 isin R119899times119899 is buffered in a memory element 119878 In addition tocompute eigenvectors 119881(1) is loaded with an identity matrix(119868 isin R119899times119899) After the initial operations there are three tasks(a b and c) to be performed ℎ times in the external loopwhere ℎ corresponds to the total number of Jacobi iterations(elimination of the 119899 elements adjacent to the main diagonal)that the algorithmwill execute and is selected in suchway thateigenvector error is minimized

Thefirst task thatmust be done in each Jacobi iteration (b)is to calculate rotation angles (120572(119896)

119895) As previouslymentioned

this was achieved using the CORDIC algorithm in vectoringmode

Thenext tasks to be performed (c1 and c2) are eigenvalueand eigenvector rotations which consist of pre- and post-multiplication with the Givens rotation matrix Since thereare no data dependencies between these two tasks they areplaced in different loops and thus parallelization is possiblePre- or postmultiplication with a Givens rotation matrix canbe expressed as two vector rotations so it can be performedwith two rotation mode CORDIC algorithm executionsThisleaves us with six CORDIC executions four of them whichcorrespond to eigenvalue calculation and the other two toeigenvector computation

Thefinal operation to be performed before the next Jacobiiteration (d) is a series of row and column permutations withthe purpose of introducing new nonzero elements into thediagonal submatrices

This element rearrangement was implemented asdescribed by Brent and Luk [19] but with somemodificationsIn their study they proposed a systolic architecturewhere each submatrix corresponded to an independenthardware unit called a processor and all processors wereinterconnected so that data interchanges were possible

Since our design did not have a defined hardware archi-tecture at this point element interchangewas substitutedwiththe corresponding row and column permutations in order tofulfill the requirement of introducing new nonzero elementsinto the diagonal submatrices

Finally when all the Jacobi rotations are completedeigenvalues and eigenvectors are output as arrays which canbe implemented as BRAM ports or AXI buses so that thedesign can be integrated in different kinds of system

All tasks were implemented in the C programminglanguage using a fixed point data type following the XQNquantification scheme where a number is represented by 119883bits of integer part and 119873 bits of fractional part with a total

number of bits (word length) of WL = 119883 + 119873 + 1 for signedvalues and WL = 119883 + 119873 for unsigned values

Starting from the initial C prototype different hardwarearchitectures can be considered depending on the designrequirements Mainly resource usage (LUT Flip-Flops mul-tipliers memory elements) and timing performance (latency(119879119871) throughput or initiation interval (II) and maximum

clock frequency) are the parameters that we attempt tooptimize

(i) the latency (119879119871) which specifies how much time will

take until the system generate the first valid result(ii) the initiation interval (II) which corresponds to the

throughput or the number of clock cycles betweentwo valid sets of results after the latency period

As a result of the Vivado HLS optimization directivesthree architectures can be achieved depending on whichparameters are considered more important

(i) Serial architecture this hardware scheme pursuesminimal resource usage at the expense of a higherlatency To reach this optimization level a pipelinedirective is specified on the external loop of thedesign which performs the Jacobi iterations

(ii) Parallel architecture in this kind of system the maingoal is to achieve the best performance in termsof execution time (latency and throughput) at theexpense of resource usage To achieve this pipelineis applied to the entire design which means thatthe external loop will be unrolled and parallelismbetween two algorithm executions will be possible

(iii) Semiparallel architecture although timing is stillimportant in this scheme resource usage is alsoconsidered and attempts are made to minimize itwhile affecting execution time as little as possible Inour particular case this was achieved by limiting thenumber of CORDIC instances of the design in theparallel architecture

Although all three alternatives were analyzed the fullparallel architecture was found to use a massive amount ofinternal FPGA resources greatly exceeding the capacity ofthe device This was mainly due to an excessive amountof CORDIC units in the rotation mode which led to 55hardware instances of it in the case of an 8 times 8 inputmatrix Consequently we ruled out the possibility of parallelarchitecture

It is important to clarify the difference between semi-parallel and serial architecture In both cases we alwaysattempted to maximize concurrency between operationssuch as arithmetic operations andmemory accesses howeverwith parallel architecture it was also possible to achieveconcurrence between two full executions of the algorithm

The excessive amount of CORDIC units implemented bythe Vivado HLS synthesizer was taken as a starting pointfor optimization of the design Since Vivado HLS allowedus to limit the number of elements of a particular hardwareresource or entity the number of CORDIC instances in both

6 Mathematical Problems in Engineering

S A

V Intimesnk = 1

120572(k)j = 05tans(k)2jminus12j

s(k)2j2j minus s(k)2jminus12jminus1

Sk+1 Sk+1aux Vk+1 Vk+1aux

(k+1) = V(k)ij middot R(120572j)

i = 1 n2 j = 1 n2 j = i n2i = 1 n2

(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)

k = h

(a)

(b)

(c2)

(d)

(c1)

Result rearrange

No

[ ]

Yes

A Rntimesnisin

Vaux 119894119895 Saux 119894119895

larrminus

larrminus

larrminus larrminus

rarr

rarrVlarrminus

1205821 120582n diag(S)

lceilrceil lceilrceil

larrminus

1205821 120582n

j = 1 n2

Figure 1 Jacobi algorithm functional diagram

semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained

First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2

An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded

This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously

indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation

From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules

The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879

119871gt II This means that the first valid result will be

obtained after 119879119871clock cycles but the next results will be

obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown

in Figure 3 This time only the latency is shown since there

Mathematical Problems in Engineering 7

1 2 3 4 5 6 7 8 90

20406080

100

Disc

rete

elem

ents

()

Resource usage

LUTFF

1 2 3 4 5 6 7 8 90

500

1000

1500

2000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time (clock cycles)

IITL

nmiddotT

CLK

Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices

was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879

119871+ 1)

As indicated earlier the execution time between two validresult sets is the same as the latency period (119879

119871) and in

the semiparallel design alternative it evolved similarly to theinitiation interval

For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources

4 Assessment of Design Parameters

41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults

The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors

Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and

1 2 3 4 5 6 7 8 90

1020304050

Disc

rete

elem

ents

Resource usage

1 2 3 4 5 6 7 8 91500

2000

2500

3000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time

LUTFF

TL

nmiddotT

CLK

Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices

21 215 22 225 23 235 24 245 25 255 260

1020304050

Max

imum

erro

r (

)

Iterations

21 215 22 225 23 235 24 245 25 255 260

02040608

1

Qua

drat

ic m

ean

erro

r (

)

Iterations

Eigenvector component error (rarr

rarr

)1205821 120582m

Figure 4 Eigenvectors maximum and quadratic component error

eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)

When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows

8 Mathematical Problems in Engineering

the eigenvector component error calculated according to thefollowing expressions

119890 =

1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB

10038161003816100381610038161003816100381610038161003816119881MATLAB

1003816100381610038161003816

sdot 100

119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)

119890SQRT =1

1198962

radic

119896minus1

sum

119894=0

1198902

(20)

where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results

Our analysis of 8 times 8 matrices indicated that ℎ = 26

was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on

After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix

ℎ = 9 + 119899 log 119899 (21)

42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size

With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)

Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length

Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device

Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity

Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression

119879119871= 119879load + 119899

2

sdot ℎ (22)

8 9 10 11 12 13 14 15 1605

1

15

2Used resources

FF an

d LU

T (

)

8 9 10 11 12 13 14 15 160

50

100

150

FFLUT

Input matrix size (n)

Input matrix size (n)

Execution time with 100MHz

times104

Tex

e(120583

s)

fCLK =

Figure 5 Serial Jacobi design performance with matrix orderincrease

where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration

119879load =1198992

2+119899

2+ 4 (23)

A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879

119871) was still related to the

size of the inputmatrix as shown in the following expression

119879119871= 119879load + 119879119869 sdot ℎ (24)

where 119879119869= 119891(119899

2

) is the time required to perform a Jacobiiteration

Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs

Mathematical Problems in Engineering 9

Design comparison

11000 23000 34000 45000LUT

8500

17000

26000

34000

FF

400080001200016000

4000

8000

12000

16000

Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library

TL(nTCLK)

II(nT

CLK

)

Figure 6 Comparative between different design alternatives

5 Comparison with Other Proposals

The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention

Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation

The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)

The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode

The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources

Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one

Vivado HLS library

Clock frequency (MHz)

VHDL (Bravo et al

2008)QR

Parallel Jacobi

Serial Jacobi

0 50 100 150 200 250 300

Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8

The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks

The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit

As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations

Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required

Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically

Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources

As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed

On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz

10 Mathematical Problems in Engineering

On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level

6 Conclusions

In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature

The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired

In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms

Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs

As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high

Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)

The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view

We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results

Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)

References

[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008

[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014

[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014

[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014

[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005

[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000

[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013

[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008

[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988

[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000

[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990

[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983

[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959

[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988

[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013

Mathematical Problems in Engineering 11

[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts

designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value

and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985

[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014

[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009

[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014

[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014

[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014

[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014

[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013

[27] J S Walther A Unified Algorithm for Elementary Functions1972

[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010

[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

2 Mathematical Problems in Engineering

the results obtained a distinction can be made betweenpurely iterative methods (Jacobi power iterations) [9] andalgorithms that perform some kind of transformation of theinitial matrix (Lanczos-bisection householder-QR) [10 11]A distinction can also be made between algorithms thatoutput all the eigenvalues (Jacobi QR) and those that onlycompute extremal eigenvalues or the one closest to a givenvalue (power iterations Lanczos)

In the studies cited above the main method employedfor eigenvalue and eigenvector computation was the Jacobialgorithm since its characteristics render it highly suitablefor a parallel implementation The Jacobi algorithm can beimplemented in such a way that there are almost no datadependencies between operations in the same iteration and itis possible to use systolic architectures as has been proposedby Brent et al [12] In addition all the operations involvedcan be carried out by using the CORDIC algorithm [13] ashas been shown by Cavallaro and Luk in [14] and this canbe implemented simply with add and shift operations whichonly use a small amount of hardware resources compared tomultiplication and division operations Another advantage ofthe Jacobi method is that its round-off error is very stable [9]

However there are other methods for computingextremal eigenvalues that can take advantage of thecomputing power provided by FPGAs such as the QRalgorithm or the Lanczos [15] method since some of theiroperations can be carried out in parallel In this study we alsopresent an FPGA floating point implementation of the QRalgorithm In terms of a practical FPGA implementation it isa challenging task to make efficient use of the availableresources while at the same time meeting real timerequirements This is mainly because the analysis ofalgorithm data dependencies and their parallelization isextremely difficult

Furthermore FPGA implementations are usually carriedout in register transfer level (RTL) languages where there is avery low level of abstraction and it is therefore necessary tomanage resources and timing carefully This also implies thatwhen some architectural decisions are made it is difficult togo back without recoding most of the system Neverthelessthis kind of system can be very efficient in terms of the totalamount of resources used

When an algorithm with relevant mathematical work-load such as the ones mentioned before has to be imple-mented in an FPGA a great part of the design processconsist in the partition of the system in small units toachieve the required grade of concurrency Design time canbe shortened by using a high level synthesis (HLS) tool Thistool attempts to raise the abstraction level allowing algorithmspecification by means of a high level programming languagesuch as C or C++ Thanks to this the designer can focuson the algorithm itself while the tool makes the software tohardware conversion guided by some user directives whichrange from resource usage to concurrency grade Some ofthe tools that have become popular in the recent years areImpulse C [16] Catapult C by Calypto [17] and Vivado HLSby Xilinx [18]

Hardware implementations of the Jacobi algorithm arevery popular because its characteristics permit different

architectural designs depending on the required perfor-mance In the literature there are systolic implementationswhere speed is themost important requirement [19 20] serialdesigns where low resource consumption is selected [1 21]and a meet in the middle approach (semiparallel) where bothcriteria are balanced [8] In this study the Jacobi algorithmwas implemented using the Xilinx Vivado HLS tool toexplore different hardware implementations and comparethem to select the most efficient one in terms of executiontime and FPGA resources used

In this study the Jacobi algorithmwas implemented usingthe Xilinx Vivado HLS tool to explore different hardwareimplementations and compare them to select the mostefficient one in terms of execution time and FPGA resourcesused

To gain a better understanding of the results obtained wealso carried out a floating point implementation of the QRalgorithm using Vivado HLS The results obtained were thencompared to similar systems described in the literature

The rest of the paper is organized as follows In Section 2the mathematical background behind the Jacobi and QRalgorithms is presented together with some details relevantto implementation The proposed implementations and itsparameters are discussed in Sections 3 and 4 respectivelyFinally the system performance is compared with otherdesign approaches in Section 5 and we present our conclu-sions in Section 6

2 Mathematical Background

The eigenvalue problem [22] is one of the main questions innumerical algebra and due to themany applications in whichit is useful it has attracted much research attention in recentyears [23ndash25] The formulation of the problem is simple (1)but it can be approached from very different angles

119860V = 120582V (1)where V and 120582 are an eigenvalue and an eigenvector respec-tively of a matrix 119860 isin R119899times119899 Although there is an analyticalsolution it involves the calculation of polynomial roots andtherefore it is only suitable for low order (119899) matrices (119899 lt 4)When the problem has a higher order numerical algorithmsmust be used In this section the mathematical backgroundof Jacobi and QR algorithms is presented when they areused to compute the eigenvalues from real and symmetricmatrices

21 The Jacobi Algorithm The Jacobi algorithm [26] wasfirst proposed in 1846 but did not become popular until itscomputational implementation was explored The singularvalue decomposition (SVD) of a matrix 119860 isin R119899times119899 is givenby

119860 = 119880119863119881119879

(2)where 119880 and 119881 are orthogonal matrices and 119863 is a diagonalmatrix storing the singular values of119860 on its diagonal For thecase where 119860 is real and symmetric we can rewrite (2) as

119860 = 119881119863119881119879

(3)

Mathematical Problems in Engineering 3

The idea behind the Jacobi method is to perform aseries of similarity transformations (ie before and aftermultiplication with an orthogonal matrix) of 119860 to renderit more diagonal in each iteration This can be seen as aniterative expression

119860(119896+1)

= 119881(119896)

119860(119896)

119881(119896)119879

(4)

where 119896 represents the current iteration and 119896+1 the next oneThe sequence of transformation matrices is chosen in such away that the doublemultiplication renders the updated119860(119896+1)

a little more diagonal than its predecessor 119860(119896) A popularchoice for 119881(119896) is the Givens rotation matrix (5) Consider

119866 (119894 119895 120572) =

[[[[[[[[[[[[

[

1 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 0

d

0 sdot sdot sdot 119862120572sdot sdot sdot 119878120572sdot sdot sdot 0

d

0 sdot sdot sdot minus119878120572sdot sdot sdot 119862

120572sdot sdot sdot 0

d

0 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 1

]]]]]]]]]]]]

]

(5)

where 119862120572and 119878

120572are the sine and cosine of 120572 respectively

Parallelismof the algorithm resides in this choice If the119881119860119881119879product is studied it can easily be seen that only four elementsof 119860(119896) are modified so multiple similarity transformationscan be carried at the same time

Eigenvectors of 119860 are obtained by accumulating therotation matrices used in eigenvalue computation as shownin the following expression

119881 =

prod

119896=1

119881(119896)

where 119881(1) = 119868 (6)

To take full advantage of this Brent andLuk [19] proposeda systolic architecture in which the initial matrix 119860

119899times119899is

mapped onto a set of 2times2matrices such as the one shown inexpression (7) This leaves a subset of 2 times 2 diagonalizationproblems which can easily be solved in hardware makinguse of theCORDIC (COordinate RotationDIgital Computer)algorithm [27] as shown in [14]

119860119897119898= [1198862119897minus12119898minus1

1198862119897minus12119898

11988621198972119898minus1

11988621198972119898

] (7)

Since 1198992 diagonalization problems now have to besolved 1198992 rotation angles (120572

119897) are also required Given that

119860 is symmetric the angle is selected from the submatriceswhere 119897 = 119898 according to expression (8) so 119899 elements(where 119899 is the order of119860) can be eliminated in each iterationAs will now be shown angle calculation can also be carriedout by the CORDIC algorithm making an efficient use ofresources in the implementation

tan (2120572) =1198862119897minus12119898

11988621198972119898

minus 1198862119897minus12119898minus1

(8)

After each update of 119860(119896+1) it is necessary to bring newnonzero elements near the main diagonal so that they can beeliminated in the next iteration repeating the entire processuntil 119860(119896+1) = 119863 To this end several rows and columns areinterchanged as it will be explained in Section 3

As mentioned before the CORDIC algorithm [13 27]can be used to solve the double rotation (4) on everysubmatrix119860

119897119898and to calculate the rotation angles One great

advantage of CORDIC is that it can be expressed by meansof shift and add operations which are easy to implement inreconfigurable hardware It also facilitates working with fixedpoint codification where the word length is not fixed to 32 or64 bits as occurswith the floating point standard thusmakingmore efficient use of FPGA resources For example in thepresent study the entire system was fixed to a word lengthof 18 bits since this is the entry word length of Xilinx FPGAmultipliers

The fundamental idea behind CORDIC is to carry out aseries of rotations (called microrotations) on an input vectorto obtain another vector as outputThis is performed in threedifferent coordinate systems (linear circular and hyperbolic)and it is possible to obtain various basic functions suchas multiplication trigonometric operations logarithms andexponentials [28] in the components of the output vectorgiven some initial input vector conditions [27]Microrotationvalues are selected as shown in the following expression

tan (120572119894) = 120575119894 where 120575

119894= 2minus119894

(9)

The choice of 120575119894as an inverse power of 2 is what will allow us

to implement the algorithm using shifts since multiplicationor division for a power of 2 is equivalent to a shift to the leftor the right respectively in binary codification

In the circular coordinate system it is possible to obtainthe two operations required for the Jacobi algorithm Givensrotation and angle calculation (arctangent) Since CORDICperforms a series of rotations on a vector it can be seen as aniterative expression (11) Consider

V119894+1= [

1 minus1205901198942minus119894

1205901198942minus119894

1] V119894 V = [119909

119910] (10)

119911119894+1= 119911119894minus 120590119894120572119894 (11)

where V is a vector determined by its vertical and horizontalcomponents (119909 and 119910) and 119911 is a variable that keeps trackof the total rotated angle In each iteration the direction ofthe rotation 120590

119894is determined by the operation that is to be

performed In the particular case of the circular coordinatesystem the two algorithm operation modes are as follows

(i) Rotation mode the algorithm takes a vector V0=

(1199090 1199100) and an angle 119911

0and outputs a new vector

V = (119909 119910) which corresponds to V0rotated 119911 radians

To do so 120590119894is selected in such a way that 119911 approaches

zero in each iteration When 119911 cong 0 it can be assumedthat the starting vector has been fully rotated a 119911 angle

(ii) Translation mode algorithm input is again a vectorV0= (1199090 1199100) whose phase angV

0we want to compute

4 Mathematical Problems in Engineering

To do so the vector is rotated until its 119910 componentis zero selecting 120590

119894accordingly Since 119911 accumulates

microrotation angles when 119910 cong 0 the variable 119911will correspond to the initial phase of the vector Inaddition since 119910 = 0 119909 will store the absolute valueof V0

If this is expressed mathematically in rotation mode 120590119894=

sign(119911119894) the relation between the input and the output is as

follows

[119909

119910] = 119870

1[cos (119911

0) sin (119911

0)

minus sin (1199110) cos (119911

0)] [1199090

1199100

] (12)

On the other hand if the algorithm works in vectoringmode 120590

119894must be computed as 120590

119894= minus sign(119910

119894) and the relation

between the input and the output is now as shown in thefollowing expression

119909 = 1198701radic1199092

0+ 1199102

0

119911 = 1199110+ atan

1199100

1199090

(13)

As can be observed in expressions (12) and (13) theresult is scaled by a constant value of 119870

1 This is due to the

nonorthogonality of the microrotations which modifies theabsolute value of the vector To solve this results obtainedcan be scaled by 1119870

1 In addition 119870

1(14) is known in

advance and only varies with the number of microrotationsperformed Since 119870

1is fixed 1119870

1can be precomputed and

stored in a memory element to correct the output data

1198701=

119899minus1

prod

119894=0

radic1 + 1205902

119894sdot 2minus2119894 (14)

22 The QR Algorithm As indicated in Section 1 to showthe effectiveness of the Jacobi algorithm over other meth-ods when it is implemented in reconfigurable logic a QRalgorithm implementation was also carried out using VivadoHLS

In the QR algorithm another condition must be addedto our initial matrix which was real and symmetric now itmust also be a tridiagonal matrix like the one presented inthe following expression

119879 =

[[[

[

119886111988720 0

1198872119886211988730

0 119887311988631198874

0 0 11988741198864

]]]

]

(15)

To obtain the eigenvalues of119879 a reduction strategy is usedwhere the 119887

2 119887

119899entries are zeroed When for example

1198872= 0 the row and the column involved can be removed

where 1198861is an eigenvalue of 119879

To accomplish this in the QR algorithm a sequence ofmatrices similar to119879 (119879(1) 119879(119899)) is produced [29] To startwith the initial matrix 119879(119896) is factored as follows

119879(119896)

= 119876(119896)

119877(119896)

(16)

where 119876(119896) is an orthogonal matrix and 119877(119896) is an uppertriangular matrix In a similar way 119879(119896+1) is defined as

119879(119896+1)

= 119877(119896)

119876(119896)

(17)

As presented thus far the convergence of the method isslow To speed it up a shift can be performed on the originalmatrix for a 120590 quantity close to the eigenvalue we are lookingfor Thus we now have a new sequence of matrices such that

119879(119896)

minus 120590119868 = 119876(119896)

119877(119896)

119879(119896+1)

= 119877(119896)

119876(119896)

+ 120590119868

(18)

After each shift the eigenvalues of 119879 will be shifted by thesame 120590 amount and it is easy to recover the original valuesby accumulating the shifts after each iteration To obtainan approximated value of 120590 for each iteration a popularchoice is to calculate the eigenvalues of the matrix 119864 (19) andtake the one closest to 119886

119899 This is called the Wilkinson shift

[26] Finally 119876(119896) and 119877(119896) are chosen to be Givens rotationmatrices to eliminate the appropriate 119887

119894element

119864 = [119886119899minus1

119887119899

119887119899119886119899

] (19)

3 Implementation Details

As discussed in Section 1 the implementation tool used inthis study was the Xilinx Vivado HLS [18] which allowsthe use of a high level programming language (C C++ orSystem C) instead of a hardware description language such asVHDL or Verilog Although it presents some drawbacks thismeans of implementing the algorithms enabled us to test verydifferent hardware configurations of the systemwith very fewdesign modifications

Due to its iterative nature the Jacobi method can easilybe expressed as a series of loops which is the main elementused in Vivado HLS to implement the algorithms

This can be done via pragma directives or tcl scriptinggiving each part of the design different attributes that deter-mine the degree of concurrency or the desired resource con-straints Some important operations that can be performedon design loops include the following

(i) Pipelinewhichwill try to achieve parallelism betweenthe operations performed in the same iterationPipeline operation can also be specified throughoutthe entire design so that all internal loops are unrolledmaking parallelism between two system iterationspossible

(ii) Unroll which will implement each iteration as aseparate hardware instance

(iii) Dataflow which will try to achieve parallelismbetween different loops that are data dependentimplementing communication channels such as ping-pong memories

As mentioned before we can also apply resource con-straints to the designs limiting the number of instances

Mathematical Problems in Engineering 5

of a particular hardware element The elements that canbe limited range from operators (+lowast) to specific resourcessuch as DSP48E and BRAM memory blocks User definedelements such as a function called multiple times canalso be constrained and this is very important for designoptimization as will be shown later

Figure 1 shows a functional diagram of the Jacobi algo-rithm Since the C programming language was used eachtask was implemented as a for loop and labeled to makeoptimization easier

When the design starts its operation (a) the input matrix119860 isin R119899times119899 is buffered in a memory element 119878 In addition tocompute eigenvectors 119881(1) is loaded with an identity matrix(119868 isin R119899times119899) After the initial operations there are three tasks(a b and c) to be performed ℎ times in the external loopwhere ℎ corresponds to the total number of Jacobi iterations(elimination of the 119899 elements adjacent to the main diagonal)that the algorithmwill execute and is selected in suchway thateigenvector error is minimized

Thefirst task thatmust be done in each Jacobi iteration (b)is to calculate rotation angles (120572(119896)

119895) As previouslymentioned

this was achieved using the CORDIC algorithm in vectoringmode

Thenext tasks to be performed (c1 and c2) are eigenvalueand eigenvector rotations which consist of pre- and post-multiplication with the Givens rotation matrix Since thereare no data dependencies between these two tasks they areplaced in different loops and thus parallelization is possiblePre- or postmultiplication with a Givens rotation matrix canbe expressed as two vector rotations so it can be performedwith two rotation mode CORDIC algorithm executionsThisleaves us with six CORDIC executions four of them whichcorrespond to eigenvalue calculation and the other two toeigenvector computation

Thefinal operation to be performed before the next Jacobiiteration (d) is a series of row and column permutations withthe purpose of introducing new nonzero elements into thediagonal submatrices

This element rearrangement was implemented asdescribed by Brent and Luk [19] but with somemodificationsIn their study they proposed a systolic architecturewhere each submatrix corresponded to an independenthardware unit called a processor and all processors wereinterconnected so that data interchanges were possible

Since our design did not have a defined hardware archi-tecture at this point element interchangewas substitutedwiththe corresponding row and column permutations in order tofulfill the requirement of introducing new nonzero elementsinto the diagonal submatrices

Finally when all the Jacobi rotations are completedeigenvalues and eigenvectors are output as arrays which canbe implemented as BRAM ports or AXI buses so that thedesign can be integrated in different kinds of system

All tasks were implemented in the C programminglanguage using a fixed point data type following the XQNquantification scheme where a number is represented by 119883bits of integer part and 119873 bits of fractional part with a total

number of bits (word length) of WL = 119883 + 119873 + 1 for signedvalues and WL = 119883 + 119873 for unsigned values

Starting from the initial C prototype different hardwarearchitectures can be considered depending on the designrequirements Mainly resource usage (LUT Flip-Flops mul-tipliers memory elements) and timing performance (latency(119879119871) throughput or initiation interval (II) and maximum

clock frequency) are the parameters that we attempt tooptimize

(i) the latency (119879119871) which specifies how much time will

take until the system generate the first valid result(ii) the initiation interval (II) which corresponds to the

throughput or the number of clock cycles betweentwo valid sets of results after the latency period

As a result of the Vivado HLS optimization directivesthree architectures can be achieved depending on whichparameters are considered more important

(i) Serial architecture this hardware scheme pursuesminimal resource usage at the expense of a higherlatency To reach this optimization level a pipelinedirective is specified on the external loop of thedesign which performs the Jacobi iterations

(ii) Parallel architecture in this kind of system the maingoal is to achieve the best performance in termsof execution time (latency and throughput) at theexpense of resource usage To achieve this pipelineis applied to the entire design which means thatthe external loop will be unrolled and parallelismbetween two algorithm executions will be possible

(iii) Semiparallel architecture although timing is stillimportant in this scheme resource usage is alsoconsidered and attempts are made to minimize itwhile affecting execution time as little as possible Inour particular case this was achieved by limiting thenumber of CORDIC instances of the design in theparallel architecture

Although all three alternatives were analyzed the fullparallel architecture was found to use a massive amount ofinternal FPGA resources greatly exceeding the capacity ofthe device This was mainly due to an excessive amountof CORDIC units in the rotation mode which led to 55hardware instances of it in the case of an 8 times 8 inputmatrix Consequently we ruled out the possibility of parallelarchitecture

It is important to clarify the difference between semi-parallel and serial architecture In both cases we alwaysattempted to maximize concurrency between operationssuch as arithmetic operations andmemory accesses howeverwith parallel architecture it was also possible to achieveconcurrence between two full executions of the algorithm

The excessive amount of CORDIC units implemented bythe Vivado HLS synthesizer was taken as a starting pointfor optimization of the design Since Vivado HLS allowedus to limit the number of elements of a particular hardwareresource or entity the number of CORDIC instances in both

6 Mathematical Problems in Engineering

S A

V Intimesnk = 1

120572(k)j = 05tans(k)2jminus12j

s(k)2j2j minus s(k)2jminus12jminus1

Sk+1 Sk+1aux Vk+1 Vk+1aux

(k+1) = V(k)ij middot R(120572j)

i = 1 n2 j = 1 n2 j = i n2i = 1 n2

(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)

k = h

(a)

(b)

(c2)

(d)

(c1)

Result rearrange

No

[ ]

Yes

A Rntimesnisin

Vaux 119894119895 Saux 119894119895

larrminus

larrminus

larrminus larrminus

rarr

rarrVlarrminus

1205821 120582n diag(S)

lceilrceil lceilrceil

larrminus

1205821 120582n

j = 1 n2

Figure 1 Jacobi algorithm functional diagram

semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained

First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2

An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded

This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously

indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation

From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules

The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879

119871gt II This means that the first valid result will be

obtained after 119879119871clock cycles but the next results will be

obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown

in Figure 3 This time only the latency is shown since there

Mathematical Problems in Engineering 7

1 2 3 4 5 6 7 8 90

20406080

100

Disc

rete

elem

ents

()

Resource usage

LUTFF

1 2 3 4 5 6 7 8 90

500

1000

1500

2000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time (clock cycles)

IITL

nmiddotT

CLK

Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices

was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879

119871+ 1)

As indicated earlier the execution time between two validresult sets is the same as the latency period (119879

119871) and in

the semiparallel design alternative it evolved similarly to theinitiation interval

For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources

4 Assessment of Design Parameters

41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults

The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors

Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and

1 2 3 4 5 6 7 8 90

1020304050

Disc

rete

elem

ents

Resource usage

1 2 3 4 5 6 7 8 91500

2000

2500

3000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time

LUTFF

TL

nmiddotT

CLK

Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices

21 215 22 225 23 235 24 245 25 255 260

1020304050

Max

imum

erro

r (

)

Iterations

21 215 22 225 23 235 24 245 25 255 260

02040608

1

Qua

drat

ic m

ean

erro

r (

)

Iterations

Eigenvector component error (rarr

rarr

)1205821 120582m

Figure 4 Eigenvectors maximum and quadratic component error

eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)

When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows

8 Mathematical Problems in Engineering

the eigenvector component error calculated according to thefollowing expressions

119890 =

1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB

10038161003816100381610038161003816100381610038161003816119881MATLAB

1003816100381610038161003816

sdot 100

119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)

119890SQRT =1

1198962

radic

119896minus1

sum

119894=0

1198902

(20)

where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results

Our analysis of 8 times 8 matrices indicated that ℎ = 26

was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on

After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix

ℎ = 9 + 119899 log 119899 (21)

42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size

With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)

Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length

Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device

Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity

Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression

119879119871= 119879load + 119899

2

sdot ℎ (22)

8 9 10 11 12 13 14 15 1605

1

15

2Used resources

FF an

d LU

T (

)

8 9 10 11 12 13 14 15 160

50

100

150

FFLUT

Input matrix size (n)

Input matrix size (n)

Execution time with 100MHz

times104

Tex

e(120583

s)

fCLK =

Figure 5 Serial Jacobi design performance with matrix orderincrease

where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration

119879load =1198992

2+119899

2+ 4 (23)

A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879

119871) was still related to the

size of the inputmatrix as shown in the following expression

119879119871= 119879load + 119879119869 sdot ℎ (24)

where 119879119869= 119891(119899

2

) is the time required to perform a Jacobiiteration

Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs

Mathematical Problems in Engineering 9

Design comparison

11000 23000 34000 45000LUT

8500

17000

26000

34000

FF

400080001200016000

4000

8000

12000

16000

Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library

TL(nTCLK)

II(nT

CLK

)

Figure 6 Comparative between different design alternatives

5 Comparison with Other Proposals

The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention

Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation

The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)

The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode

The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources

Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one

Vivado HLS library

Clock frequency (MHz)

VHDL (Bravo et al

2008)QR

Parallel Jacobi

Serial Jacobi

0 50 100 150 200 250 300

Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8

The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks

The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit

As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations

Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required

Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically

Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources

As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed

On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz

10 Mathematical Problems in Engineering

On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level

6 Conclusions

In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature

The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired

In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms

Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs

As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high

Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)

The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view

We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results

Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)

References

[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008

[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014

[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014

[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014

[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005

[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000

[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013

[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008

[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988

[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000

[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990

[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983

[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959

[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988

[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013

Mathematical Problems in Engineering 11

[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts

designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value

and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985

[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014

[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009

[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014

[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014

[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014

[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014

[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013

[27] J S Walther A Unified Algorithm for Elementary Functions1972

[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010

[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

Mathematical Problems in Engineering 3

The idea behind the Jacobi method is to perform aseries of similarity transformations (ie before and aftermultiplication with an orthogonal matrix) of 119860 to renderit more diagonal in each iteration This can be seen as aniterative expression

119860(119896+1)

= 119881(119896)

119860(119896)

119881(119896)119879

(4)

where 119896 represents the current iteration and 119896+1 the next oneThe sequence of transformation matrices is chosen in such away that the doublemultiplication renders the updated119860(119896+1)

a little more diagonal than its predecessor 119860(119896) A popularchoice for 119881(119896) is the Givens rotation matrix (5) Consider

119866 (119894 119895 120572) =

[[[[[[[[[[[[

[

1 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 0

d

0 sdot sdot sdot 119862120572sdot sdot sdot 119878120572sdot sdot sdot 0

d

0 sdot sdot sdot minus119878120572sdot sdot sdot 119862

120572sdot sdot sdot 0

d

0 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 1

]]]]]]]]]]]]

]

(5)

where 119862120572and 119878

120572are the sine and cosine of 120572 respectively

Parallelismof the algorithm resides in this choice If the119881119860119881119879product is studied it can easily be seen that only four elementsof 119860(119896) are modified so multiple similarity transformationscan be carried at the same time

Eigenvectors of 119860 are obtained by accumulating therotation matrices used in eigenvalue computation as shownin the following expression

119881 =

prod

119896=1

119881(119896)

where 119881(1) = 119868 (6)

To take full advantage of this Brent andLuk [19] proposeda systolic architecture in which the initial matrix 119860

119899times119899is

mapped onto a set of 2times2matrices such as the one shown inexpression (7) This leaves a subset of 2 times 2 diagonalizationproblems which can easily be solved in hardware makinguse of theCORDIC (COordinate RotationDIgital Computer)algorithm [27] as shown in [14]

119860119897119898= [1198862119897minus12119898minus1

1198862119897minus12119898

11988621198972119898minus1

11988621198972119898

] (7)

Since 1198992 diagonalization problems now have to besolved 1198992 rotation angles (120572

119897) are also required Given that

119860 is symmetric the angle is selected from the submatriceswhere 119897 = 119898 according to expression (8) so 119899 elements(where 119899 is the order of119860) can be eliminated in each iterationAs will now be shown angle calculation can also be carriedout by the CORDIC algorithm making an efficient use ofresources in the implementation

tan (2120572) =1198862119897minus12119898

11988621198972119898

minus 1198862119897minus12119898minus1

(8)

After each update of 119860(119896+1) it is necessary to bring newnonzero elements near the main diagonal so that they can beeliminated in the next iteration repeating the entire processuntil 119860(119896+1) = 119863 To this end several rows and columns areinterchanged as it will be explained in Section 3

As mentioned before the CORDIC algorithm [13 27]can be used to solve the double rotation (4) on everysubmatrix119860

119897119898and to calculate the rotation angles One great

advantage of CORDIC is that it can be expressed by meansof shift and add operations which are easy to implement inreconfigurable hardware It also facilitates working with fixedpoint codification where the word length is not fixed to 32 or64 bits as occurswith the floating point standard thusmakingmore efficient use of FPGA resources For example in thepresent study the entire system was fixed to a word lengthof 18 bits since this is the entry word length of Xilinx FPGAmultipliers

The fundamental idea behind CORDIC is to carry out aseries of rotations (called microrotations) on an input vectorto obtain another vector as outputThis is performed in threedifferent coordinate systems (linear circular and hyperbolic)and it is possible to obtain various basic functions suchas multiplication trigonometric operations logarithms andexponentials [28] in the components of the output vectorgiven some initial input vector conditions [27]Microrotationvalues are selected as shown in the following expression

tan (120572119894) = 120575119894 where 120575

119894= 2minus119894

(9)

The choice of 120575119894as an inverse power of 2 is what will allow us

to implement the algorithm using shifts since multiplicationor division for a power of 2 is equivalent to a shift to the leftor the right respectively in binary codification

In the circular coordinate system it is possible to obtainthe two operations required for the Jacobi algorithm Givensrotation and angle calculation (arctangent) Since CORDICperforms a series of rotations on a vector it can be seen as aniterative expression (11) Consider

V119894+1= [

1 minus1205901198942minus119894

1205901198942minus119894

1] V119894 V = [119909

119910] (10)

119911119894+1= 119911119894minus 120590119894120572119894 (11)

where V is a vector determined by its vertical and horizontalcomponents (119909 and 119910) and 119911 is a variable that keeps trackof the total rotated angle In each iteration the direction ofthe rotation 120590

119894is determined by the operation that is to be

performed In the particular case of the circular coordinatesystem the two algorithm operation modes are as follows

(i) Rotation mode the algorithm takes a vector V0=

(1199090 1199100) and an angle 119911

0and outputs a new vector

V = (119909 119910) which corresponds to V0rotated 119911 radians

To do so 120590119894is selected in such a way that 119911 approaches

zero in each iteration When 119911 cong 0 it can be assumedthat the starting vector has been fully rotated a 119911 angle

(ii) Translation mode algorithm input is again a vectorV0= (1199090 1199100) whose phase angV

0we want to compute

4 Mathematical Problems in Engineering

To do so the vector is rotated until its 119910 componentis zero selecting 120590

119894accordingly Since 119911 accumulates

microrotation angles when 119910 cong 0 the variable 119911will correspond to the initial phase of the vector Inaddition since 119910 = 0 119909 will store the absolute valueof V0

If this is expressed mathematically in rotation mode 120590119894=

sign(119911119894) the relation between the input and the output is as

follows

[119909

119910] = 119870

1[cos (119911

0) sin (119911

0)

minus sin (1199110) cos (119911

0)] [1199090

1199100

] (12)

On the other hand if the algorithm works in vectoringmode 120590

119894must be computed as 120590

119894= minus sign(119910

119894) and the relation

between the input and the output is now as shown in thefollowing expression

119909 = 1198701radic1199092

0+ 1199102

0

119911 = 1199110+ atan

1199100

1199090

(13)

As can be observed in expressions (12) and (13) theresult is scaled by a constant value of 119870

1 This is due to the

nonorthogonality of the microrotations which modifies theabsolute value of the vector To solve this results obtainedcan be scaled by 1119870

1 In addition 119870

1(14) is known in

advance and only varies with the number of microrotationsperformed Since 119870

1is fixed 1119870

1can be precomputed and

stored in a memory element to correct the output data

1198701=

119899minus1

prod

119894=0

radic1 + 1205902

119894sdot 2minus2119894 (14)

22 The QR Algorithm As indicated in Section 1 to showthe effectiveness of the Jacobi algorithm over other meth-ods when it is implemented in reconfigurable logic a QRalgorithm implementation was also carried out using VivadoHLS

In the QR algorithm another condition must be addedto our initial matrix which was real and symmetric now itmust also be a tridiagonal matrix like the one presented inthe following expression

119879 =

[[[

[

119886111988720 0

1198872119886211988730

0 119887311988631198874

0 0 11988741198864

]]]

]

(15)

To obtain the eigenvalues of119879 a reduction strategy is usedwhere the 119887

2 119887

119899entries are zeroed When for example

1198872= 0 the row and the column involved can be removed

where 1198861is an eigenvalue of 119879

To accomplish this in the QR algorithm a sequence ofmatrices similar to119879 (119879(1) 119879(119899)) is produced [29] To startwith the initial matrix 119879(119896) is factored as follows

119879(119896)

= 119876(119896)

119877(119896)

(16)

where 119876(119896) is an orthogonal matrix and 119877(119896) is an uppertriangular matrix In a similar way 119879(119896+1) is defined as

119879(119896+1)

= 119877(119896)

119876(119896)

(17)

As presented thus far the convergence of the method isslow To speed it up a shift can be performed on the originalmatrix for a 120590 quantity close to the eigenvalue we are lookingfor Thus we now have a new sequence of matrices such that

119879(119896)

minus 120590119868 = 119876(119896)

119877(119896)

119879(119896+1)

= 119877(119896)

119876(119896)

+ 120590119868

(18)

After each shift the eigenvalues of 119879 will be shifted by thesame 120590 amount and it is easy to recover the original valuesby accumulating the shifts after each iteration To obtainan approximated value of 120590 for each iteration a popularchoice is to calculate the eigenvalues of the matrix 119864 (19) andtake the one closest to 119886

119899 This is called the Wilkinson shift

[26] Finally 119876(119896) and 119877(119896) are chosen to be Givens rotationmatrices to eliminate the appropriate 119887

119894element

119864 = [119886119899minus1

119887119899

119887119899119886119899

] (19)

3 Implementation Details

As discussed in Section 1 the implementation tool used inthis study was the Xilinx Vivado HLS [18] which allowsthe use of a high level programming language (C C++ orSystem C) instead of a hardware description language such asVHDL or Verilog Although it presents some drawbacks thismeans of implementing the algorithms enabled us to test verydifferent hardware configurations of the systemwith very fewdesign modifications

Due to its iterative nature the Jacobi method can easilybe expressed as a series of loops which is the main elementused in Vivado HLS to implement the algorithms

This can be done via pragma directives or tcl scriptinggiving each part of the design different attributes that deter-mine the degree of concurrency or the desired resource con-straints Some important operations that can be performedon design loops include the following

(i) Pipelinewhichwill try to achieve parallelism betweenthe operations performed in the same iterationPipeline operation can also be specified throughoutthe entire design so that all internal loops are unrolledmaking parallelism between two system iterationspossible

(ii) Unroll which will implement each iteration as aseparate hardware instance

(iii) Dataflow which will try to achieve parallelismbetween different loops that are data dependentimplementing communication channels such as ping-pong memories

As mentioned before we can also apply resource con-straints to the designs limiting the number of instances

Mathematical Problems in Engineering 5

of a particular hardware element The elements that canbe limited range from operators (+lowast) to specific resourcessuch as DSP48E and BRAM memory blocks User definedelements such as a function called multiple times canalso be constrained and this is very important for designoptimization as will be shown later

Figure 1 shows a functional diagram of the Jacobi algo-rithm Since the C programming language was used eachtask was implemented as a for loop and labeled to makeoptimization easier

When the design starts its operation (a) the input matrix119860 isin R119899times119899 is buffered in a memory element 119878 In addition tocompute eigenvectors 119881(1) is loaded with an identity matrix(119868 isin R119899times119899) After the initial operations there are three tasks(a b and c) to be performed ℎ times in the external loopwhere ℎ corresponds to the total number of Jacobi iterations(elimination of the 119899 elements adjacent to the main diagonal)that the algorithmwill execute and is selected in suchway thateigenvector error is minimized

Thefirst task thatmust be done in each Jacobi iteration (b)is to calculate rotation angles (120572(119896)

119895) As previouslymentioned

this was achieved using the CORDIC algorithm in vectoringmode

Thenext tasks to be performed (c1 and c2) are eigenvalueand eigenvector rotations which consist of pre- and post-multiplication with the Givens rotation matrix Since thereare no data dependencies between these two tasks they areplaced in different loops and thus parallelization is possiblePre- or postmultiplication with a Givens rotation matrix canbe expressed as two vector rotations so it can be performedwith two rotation mode CORDIC algorithm executionsThisleaves us with six CORDIC executions four of them whichcorrespond to eigenvalue calculation and the other two toeigenvector computation

Thefinal operation to be performed before the next Jacobiiteration (d) is a series of row and column permutations withthe purpose of introducing new nonzero elements into thediagonal submatrices

This element rearrangement was implemented asdescribed by Brent and Luk [19] but with somemodificationsIn their study they proposed a systolic architecturewhere each submatrix corresponded to an independenthardware unit called a processor and all processors wereinterconnected so that data interchanges were possible

Since our design did not have a defined hardware archi-tecture at this point element interchangewas substitutedwiththe corresponding row and column permutations in order tofulfill the requirement of introducing new nonzero elementsinto the diagonal submatrices

Finally when all the Jacobi rotations are completedeigenvalues and eigenvectors are output as arrays which canbe implemented as BRAM ports or AXI buses so that thedesign can be integrated in different kinds of system

All tasks were implemented in the C programminglanguage using a fixed point data type following the XQNquantification scheme where a number is represented by 119883bits of integer part and 119873 bits of fractional part with a total

number of bits (word length) of WL = 119883 + 119873 + 1 for signedvalues and WL = 119883 + 119873 for unsigned values

Starting from the initial C prototype different hardwarearchitectures can be considered depending on the designrequirements Mainly resource usage (LUT Flip-Flops mul-tipliers memory elements) and timing performance (latency(119879119871) throughput or initiation interval (II) and maximum

clock frequency) are the parameters that we attempt tooptimize

(i) the latency (119879119871) which specifies how much time will

take until the system generate the first valid result(ii) the initiation interval (II) which corresponds to the

throughput or the number of clock cycles betweentwo valid sets of results after the latency period

As a result of the Vivado HLS optimization directivesthree architectures can be achieved depending on whichparameters are considered more important

(i) Serial architecture this hardware scheme pursuesminimal resource usage at the expense of a higherlatency To reach this optimization level a pipelinedirective is specified on the external loop of thedesign which performs the Jacobi iterations

(ii) Parallel architecture in this kind of system the maingoal is to achieve the best performance in termsof execution time (latency and throughput) at theexpense of resource usage To achieve this pipelineis applied to the entire design which means thatthe external loop will be unrolled and parallelismbetween two algorithm executions will be possible

(iii) Semiparallel architecture although timing is stillimportant in this scheme resource usage is alsoconsidered and attempts are made to minimize itwhile affecting execution time as little as possible Inour particular case this was achieved by limiting thenumber of CORDIC instances of the design in theparallel architecture

Although all three alternatives were analyzed the fullparallel architecture was found to use a massive amount ofinternal FPGA resources greatly exceeding the capacity ofthe device This was mainly due to an excessive amountof CORDIC units in the rotation mode which led to 55hardware instances of it in the case of an 8 times 8 inputmatrix Consequently we ruled out the possibility of parallelarchitecture

It is important to clarify the difference between semi-parallel and serial architecture In both cases we alwaysattempted to maximize concurrency between operationssuch as arithmetic operations andmemory accesses howeverwith parallel architecture it was also possible to achieveconcurrence between two full executions of the algorithm

The excessive amount of CORDIC units implemented bythe Vivado HLS synthesizer was taken as a starting pointfor optimization of the design Since Vivado HLS allowedus to limit the number of elements of a particular hardwareresource or entity the number of CORDIC instances in both

6 Mathematical Problems in Engineering

S A

V Intimesnk = 1

120572(k)j = 05tans(k)2jminus12j

s(k)2j2j minus s(k)2jminus12jminus1

Sk+1 Sk+1aux Vk+1 Vk+1aux

(k+1) = V(k)ij middot R(120572j)

i = 1 n2 j = 1 n2 j = i n2i = 1 n2

(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)

k = h

(a)

(b)

(c2)

(d)

(c1)

Result rearrange

No

[ ]

Yes

A Rntimesnisin

Vaux 119894119895 Saux 119894119895

larrminus

larrminus

larrminus larrminus

rarr

rarrVlarrminus

1205821 120582n diag(S)

lceilrceil lceilrceil

larrminus

1205821 120582n

j = 1 n2

Figure 1 Jacobi algorithm functional diagram

semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained

First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2

An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded

This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously

indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation

From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules

The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879

119871gt II This means that the first valid result will be

obtained after 119879119871clock cycles but the next results will be

obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown

in Figure 3 This time only the latency is shown since there

Mathematical Problems in Engineering 7

1 2 3 4 5 6 7 8 90

20406080

100

Disc

rete

elem

ents

()

Resource usage

LUTFF

1 2 3 4 5 6 7 8 90

500

1000

1500

2000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time (clock cycles)

IITL

nmiddotT

CLK

Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices

was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879

119871+ 1)

As indicated earlier the execution time between two validresult sets is the same as the latency period (119879

119871) and in

the semiparallel design alternative it evolved similarly to theinitiation interval

For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources

4 Assessment of Design Parameters

41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults

The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors

Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and

1 2 3 4 5 6 7 8 90

1020304050

Disc

rete

elem

ents

Resource usage

1 2 3 4 5 6 7 8 91500

2000

2500

3000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time

LUTFF

TL

nmiddotT

CLK

Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices

21 215 22 225 23 235 24 245 25 255 260

1020304050

Max

imum

erro

r (

)

Iterations

21 215 22 225 23 235 24 245 25 255 260

02040608

1

Qua

drat

ic m

ean

erro

r (

)

Iterations

Eigenvector component error (rarr

rarr

)1205821 120582m

Figure 4 Eigenvectors maximum and quadratic component error

eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)

When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows

8 Mathematical Problems in Engineering

the eigenvector component error calculated according to thefollowing expressions

119890 =

1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB

10038161003816100381610038161003816100381610038161003816119881MATLAB

1003816100381610038161003816

sdot 100

119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)

119890SQRT =1

1198962

radic

119896minus1

sum

119894=0

1198902

(20)

where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results

Our analysis of 8 times 8 matrices indicated that ℎ = 26

was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on

After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix

ℎ = 9 + 119899 log 119899 (21)

42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size

With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)

Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length

Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device

Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity

Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression

119879119871= 119879load + 119899

2

sdot ℎ (22)

8 9 10 11 12 13 14 15 1605

1

15

2Used resources

FF an

d LU

T (

)

8 9 10 11 12 13 14 15 160

50

100

150

FFLUT

Input matrix size (n)

Input matrix size (n)

Execution time with 100MHz

times104

Tex

e(120583

s)

fCLK =

Figure 5 Serial Jacobi design performance with matrix orderincrease

where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration

119879load =1198992

2+119899

2+ 4 (23)

A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879

119871) was still related to the

size of the inputmatrix as shown in the following expression

119879119871= 119879load + 119879119869 sdot ℎ (24)

where 119879119869= 119891(119899

2

) is the time required to perform a Jacobiiteration

Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs

Mathematical Problems in Engineering 9

Design comparison

11000 23000 34000 45000LUT

8500

17000

26000

34000

FF

400080001200016000

4000

8000

12000

16000

Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library

TL(nTCLK)

II(nT

CLK

)

Figure 6 Comparative between different design alternatives

5 Comparison with Other Proposals

The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention

Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation

The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)

The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode

The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources

Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one

Vivado HLS library

Clock frequency (MHz)

VHDL (Bravo et al

2008)QR

Parallel Jacobi

Serial Jacobi

0 50 100 150 200 250 300

Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8

The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks

The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit

As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations

Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required

Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically

Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources

As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed

On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz

10 Mathematical Problems in Engineering

On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level

6 Conclusions

In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature

The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired

In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms

Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs

As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high

Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)

The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view

We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results

Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)

References

[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008

[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014

[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014

[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014

[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005

[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000

[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013

[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008

[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988

[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000

[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990

[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983

[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959

[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988

[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013

Mathematical Problems in Engineering 11

[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts

designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value

and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985

[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014

[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009

[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014

[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014

[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014

[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014

[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013

[27] J S Walther A Unified Algorithm for Elementary Functions1972

[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010

[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

4 Mathematical Problems in Engineering

To do so the vector is rotated until its 119910 componentis zero selecting 120590

119894accordingly Since 119911 accumulates

microrotation angles when 119910 cong 0 the variable 119911will correspond to the initial phase of the vector Inaddition since 119910 = 0 119909 will store the absolute valueof V0

If this is expressed mathematically in rotation mode 120590119894=

sign(119911119894) the relation between the input and the output is as

follows

[119909

119910] = 119870

1[cos (119911

0) sin (119911

0)

minus sin (1199110) cos (119911

0)] [1199090

1199100

] (12)

On the other hand if the algorithm works in vectoringmode 120590

119894must be computed as 120590

119894= minus sign(119910

119894) and the relation

between the input and the output is now as shown in thefollowing expression

119909 = 1198701radic1199092

0+ 1199102

0

119911 = 1199110+ atan

1199100

1199090

(13)

As can be observed in expressions (12) and (13) theresult is scaled by a constant value of 119870

1 This is due to the

nonorthogonality of the microrotations which modifies theabsolute value of the vector To solve this results obtainedcan be scaled by 1119870

1 In addition 119870

1(14) is known in

advance and only varies with the number of microrotationsperformed Since 119870

1is fixed 1119870

1can be precomputed and

stored in a memory element to correct the output data

1198701=

119899minus1

prod

119894=0

radic1 + 1205902

119894sdot 2minus2119894 (14)

22 The QR Algorithm As indicated in Section 1 to showthe effectiveness of the Jacobi algorithm over other meth-ods when it is implemented in reconfigurable logic a QRalgorithm implementation was also carried out using VivadoHLS

In the QR algorithm another condition must be addedto our initial matrix which was real and symmetric now itmust also be a tridiagonal matrix like the one presented inthe following expression

119879 =

[[[

[

119886111988720 0

1198872119886211988730

0 119887311988631198874

0 0 11988741198864

]]]

]

(15)

To obtain the eigenvalues of119879 a reduction strategy is usedwhere the 119887

2 119887

119899entries are zeroed When for example

1198872= 0 the row and the column involved can be removed

where 1198861is an eigenvalue of 119879

To accomplish this in the QR algorithm a sequence ofmatrices similar to119879 (119879(1) 119879(119899)) is produced [29] To startwith the initial matrix 119879(119896) is factored as follows

119879(119896)

= 119876(119896)

119877(119896)

(16)

where 119876(119896) is an orthogonal matrix and 119877(119896) is an uppertriangular matrix In a similar way 119879(119896+1) is defined as

119879(119896+1)

= 119877(119896)

119876(119896)

(17)

As presented thus far the convergence of the method isslow To speed it up a shift can be performed on the originalmatrix for a 120590 quantity close to the eigenvalue we are lookingfor Thus we now have a new sequence of matrices such that

119879(119896)

minus 120590119868 = 119876(119896)

119877(119896)

119879(119896+1)

= 119877(119896)

119876(119896)

+ 120590119868

(18)

After each shift the eigenvalues of 119879 will be shifted by thesame 120590 amount and it is easy to recover the original valuesby accumulating the shifts after each iteration To obtainan approximated value of 120590 for each iteration a popularchoice is to calculate the eigenvalues of the matrix 119864 (19) andtake the one closest to 119886

119899 This is called the Wilkinson shift

[26] Finally 119876(119896) and 119877(119896) are chosen to be Givens rotationmatrices to eliminate the appropriate 119887

119894element

119864 = [119886119899minus1

119887119899

119887119899119886119899

] (19)

3 Implementation Details

As discussed in Section 1 the implementation tool used inthis study was the Xilinx Vivado HLS [18] which allowsthe use of a high level programming language (C C++ orSystem C) instead of a hardware description language such asVHDL or Verilog Although it presents some drawbacks thismeans of implementing the algorithms enabled us to test verydifferent hardware configurations of the systemwith very fewdesign modifications

Due to its iterative nature the Jacobi method can easilybe expressed as a series of loops which is the main elementused in Vivado HLS to implement the algorithms

This can be done via pragma directives or tcl scriptinggiving each part of the design different attributes that deter-mine the degree of concurrency or the desired resource con-straints Some important operations that can be performedon design loops include the following

(i) Pipelinewhichwill try to achieve parallelism betweenthe operations performed in the same iterationPipeline operation can also be specified throughoutthe entire design so that all internal loops are unrolledmaking parallelism between two system iterationspossible

(ii) Unroll which will implement each iteration as aseparate hardware instance

(iii) Dataflow which will try to achieve parallelismbetween different loops that are data dependentimplementing communication channels such as ping-pong memories

As mentioned before we can also apply resource con-straints to the designs limiting the number of instances

Mathematical Problems in Engineering 5

of a particular hardware element The elements that canbe limited range from operators (+lowast) to specific resourcessuch as DSP48E and BRAM memory blocks User definedelements such as a function called multiple times canalso be constrained and this is very important for designoptimization as will be shown later

Figure 1 shows a functional diagram of the Jacobi algo-rithm Since the C programming language was used eachtask was implemented as a for loop and labeled to makeoptimization easier

When the design starts its operation (a) the input matrix119860 isin R119899times119899 is buffered in a memory element 119878 In addition tocompute eigenvectors 119881(1) is loaded with an identity matrix(119868 isin R119899times119899) After the initial operations there are three tasks(a b and c) to be performed ℎ times in the external loopwhere ℎ corresponds to the total number of Jacobi iterations(elimination of the 119899 elements adjacent to the main diagonal)that the algorithmwill execute and is selected in suchway thateigenvector error is minimized

Thefirst task thatmust be done in each Jacobi iteration (b)is to calculate rotation angles (120572(119896)

119895) As previouslymentioned

this was achieved using the CORDIC algorithm in vectoringmode

Thenext tasks to be performed (c1 and c2) are eigenvalueand eigenvector rotations which consist of pre- and post-multiplication with the Givens rotation matrix Since thereare no data dependencies between these two tasks they areplaced in different loops and thus parallelization is possiblePre- or postmultiplication with a Givens rotation matrix canbe expressed as two vector rotations so it can be performedwith two rotation mode CORDIC algorithm executionsThisleaves us with six CORDIC executions four of them whichcorrespond to eigenvalue calculation and the other two toeigenvector computation

Thefinal operation to be performed before the next Jacobiiteration (d) is a series of row and column permutations withthe purpose of introducing new nonzero elements into thediagonal submatrices

This element rearrangement was implemented asdescribed by Brent and Luk [19] but with somemodificationsIn their study they proposed a systolic architecturewhere each submatrix corresponded to an independenthardware unit called a processor and all processors wereinterconnected so that data interchanges were possible

Since our design did not have a defined hardware archi-tecture at this point element interchangewas substitutedwiththe corresponding row and column permutations in order tofulfill the requirement of introducing new nonzero elementsinto the diagonal submatrices

Finally when all the Jacobi rotations are completedeigenvalues and eigenvectors are output as arrays which canbe implemented as BRAM ports or AXI buses so that thedesign can be integrated in different kinds of system

All tasks were implemented in the C programminglanguage using a fixed point data type following the XQNquantification scheme where a number is represented by 119883bits of integer part and 119873 bits of fractional part with a total

number of bits (word length) of WL = 119883 + 119873 + 1 for signedvalues and WL = 119883 + 119873 for unsigned values

Starting from the initial C prototype different hardwarearchitectures can be considered depending on the designrequirements Mainly resource usage (LUT Flip-Flops mul-tipliers memory elements) and timing performance (latency(119879119871) throughput or initiation interval (II) and maximum

clock frequency) are the parameters that we attempt tooptimize

(i) the latency (119879119871) which specifies how much time will

take until the system generate the first valid result(ii) the initiation interval (II) which corresponds to the

throughput or the number of clock cycles betweentwo valid sets of results after the latency period

As a result of the Vivado HLS optimization directivesthree architectures can be achieved depending on whichparameters are considered more important

(i) Serial architecture this hardware scheme pursuesminimal resource usage at the expense of a higherlatency To reach this optimization level a pipelinedirective is specified on the external loop of thedesign which performs the Jacobi iterations

(ii) Parallel architecture in this kind of system the maingoal is to achieve the best performance in termsof execution time (latency and throughput) at theexpense of resource usage To achieve this pipelineis applied to the entire design which means thatthe external loop will be unrolled and parallelismbetween two algorithm executions will be possible

(iii) Semiparallel architecture although timing is stillimportant in this scheme resource usage is alsoconsidered and attempts are made to minimize itwhile affecting execution time as little as possible Inour particular case this was achieved by limiting thenumber of CORDIC instances of the design in theparallel architecture

Although all three alternatives were analyzed the fullparallel architecture was found to use a massive amount ofinternal FPGA resources greatly exceeding the capacity ofthe device This was mainly due to an excessive amountof CORDIC units in the rotation mode which led to 55hardware instances of it in the case of an 8 times 8 inputmatrix Consequently we ruled out the possibility of parallelarchitecture

It is important to clarify the difference between semi-parallel and serial architecture In both cases we alwaysattempted to maximize concurrency between operationssuch as arithmetic operations andmemory accesses howeverwith parallel architecture it was also possible to achieveconcurrence between two full executions of the algorithm

The excessive amount of CORDIC units implemented bythe Vivado HLS synthesizer was taken as a starting pointfor optimization of the design Since Vivado HLS allowedus to limit the number of elements of a particular hardwareresource or entity the number of CORDIC instances in both

6 Mathematical Problems in Engineering

S A

V Intimesnk = 1

120572(k)j = 05tans(k)2jminus12j

s(k)2j2j minus s(k)2jminus12jminus1

Sk+1 Sk+1aux Vk+1 Vk+1aux

(k+1) = V(k)ij middot R(120572j)

i = 1 n2 j = 1 n2 j = i n2i = 1 n2

(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)

k = h

(a)

(b)

(c2)

(d)

(c1)

Result rearrange

No

[ ]

Yes

A Rntimesnisin

Vaux 119894119895 Saux 119894119895

larrminus

larrminus

larrminus larrminus

rarr

rarrVlarrminus

1205821 120582n diag(S)

lceilrceil lceilrceil

larrminus

1205821 120582n

j = 1 n2

Figure 1 Jacobi algorithm functional diagram

semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained

First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2

An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded

This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously

indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation

From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules

The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879

119871gt II This means that the first valid result will be

obtained after 119879119871clock cycles but the next results will be

obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown

in Figure 3 This time only the latency is shown since there

Mathematical Problems in Engineering 7

1 2 3 4 5 6 7 8 90

20406080

100

Disc

rete

elem

ents

()

Resource usage

LUTFF

1 2 3 4 5 6 7 8 90

500

1000

1500

2000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time (clock cycles)

IITL

nmiddotT

CLK

Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices

was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879

119871+ 1)

As indicated earlier the execution time between two validresult sets is the same as the latency period (119879

119871) and in

the semiparallel design alternative it evolved similarly to theinitiation interval

For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources

4 Assessment of Design Parameters

41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults

The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors

Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and

1 2 3 4 5 6 7 8 90

1020304050

Disc

rete

elem

ents

Resource usage

1 2 3 4 5 6 7 8 91500

2000

2500

3000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time

LUTFF

TL

nmiddotT

CLK

Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices

21 215 22 225 23 235 24 245 25 255 260

1020304050

Max

imum

erro

r (

)

Iterations

21 215 22 225 23 235 24 245 25 255 260

02040608

1

Qua

drat

ic m

ean

erro

r (

)

Iterations

Eigenvector component error (rarr

rarr

)1205821 120582m

Figure 4 Eigenvectors maximum and quadratic component error

eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)

When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows

8 Mathematical Problems in Engineering

the eigenvector component error calculated according to thefollowing expressions

119890 =

1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB

10038161003816100381610038161003816100381610038161003816119881MATLAB

1003816100381610038161003816

sdot 100

119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)

119890SQRT =1

1198962

radic

119896minus1

sum

119894=0

1198902

(20)

where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results

Our analysis of 8 times 8 matrices indicated that ℎ = 26

was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on

After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix

ℎ = 9 + 119899 log 119899 (21)

42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size

With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)

Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length

Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device

Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity

Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression

119879119871= 119879load + 119899

2

sdot ℎ (22)

8 9 10 11 12 13 14 15 1605

1

15

2Used resources

FF an

d LU

T (

)

8 9 10 11 12 13 14 15 160

50

100

150

FFLUT

Input matrix size (n)

Input matrix size (n)

Execution time with 100MHz

times104

Tex

e(120583

s)

fCLK =

Figure 5 Serial Jacobi design performance with matrix orderincrease

where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration

119879load =1198992

2+119899

2+ 4 (23)

A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879

119871) was still related to the

size of the inputmatrix as shown in the following expression

119879119871= 119879load + 119879119869 sdot ℎ (24)

where 119879119869= 119891(119899

2

) is the time required to perform a Jacobiiteration

Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs

Mathematical Problems in Engineering 9

Design comparison

11000 23000 34000 45000LUT

8500

17000

26000

34000

FF

400080001200016000

4000

8000

12000

16000

Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library

TL(nTCLK)

II(nT

CLK

)

Figure 6 Comparative between different design alternatives

5 Comparison with Other Proposals

The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention

Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation

The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)

The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode

The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources

Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one

Vivado HLS library

Clock frequency (MHz)

VHDL (Bravo et al

2008)QR

Parallel Jacobi

Serial Jacobi

0 50 100 150 200 250 300

Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8

The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks

The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit

As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations

Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required

Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically

Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources

As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed

On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz

10 Mathematical Problems in Engineering

On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level

6 Conclusions

In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature

The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired

In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms

Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs

As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high

Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)

The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view

We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results

Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)

References

[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008

[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014

[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014

[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014

[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005

[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000

[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013

[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008

[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988

[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000

[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990

[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983

[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959

[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988

[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013

Mathematical Problems in Engineering 11

[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts

designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value

and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985

[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014

[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009

[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014

[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014

[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014

[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014

[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013

[27] J S Walther A Unified Algorithm for Elementary Functions1972

[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010

[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

Mathematical Problems in Engineering 5

of a particular hardware element The elements that canbe limited range from operators (+lowast) to specific resourcessuch as DSP48E and BRAM memory blocks User definedelements such as a function called multiple times canalso be constrained and this is very important for designoptimization as will be shown later

Figure 1 shows a functional diagram of the Jacobi algo-rithm Since the C programming language was used eachtask was implemented as a for loop and labeled to makeoptimization easier

When the design starts its operation (a) the input matrix119860 isin R119899times119899 is buffered in a memory element 119878 In addition tocompute eigenvectors 119881(1) is loaded with an identity matrix(119868 isin R119899times119899) After the initial operations there are three tasks(a b and c) to be performed ℎ times in the external loopwhere ℎ corresponds to the total number of Jacobi iterations(elimination of the 119899 elements adjacent to the main diagonal)that the algorithmwill execute and is selected in suchway thateigenvector error is minimized

Thefirst task thatmust be done in each Jacobi iteration (b)is to calculate rotation angles (120572(119896)

119895) As previouslymentioned

this was achieved using the CORDIC algorithm in vectoringmode

Thenext tasks to be performed (c1 and c2) are eigenvalueand eigenvector rotations which consist of pre- and post-multiplication with the Givens rotation matrix Since thereare no data dependencies between these two tasks they areplaced in different loops and thus parallelization is possiblePre- or postmultiplication with a Givens rotation matrix canbe expressed as two vector rotations so it can be performedwith two rotation mode CORDIC algorithm executionsThisleaves us with six CORDIC executions four of them whichcorrespond to eigenvalue calculation and the other two toeigenvector computation

Thefinal operation to be performed before the next Jacobiiteration (d) is a series of row and column permutations withthe purpose of introducing new nonzero elements into thediagonal submatrices

This element rearrangement was implemented asdescribed by Brent and Luk [19] but with somemodificationsIn their study they proposed a systolic architecturewhere each submatrix corresponded to an independenthardware unit called a processor and all processors wereinterconnected so that data interchanges were possible

Since our design did not have a defined hardware archi-tecture at this point element interchangewas substitutedwiththe corresponding row and column permutations in order tofulfill the requirement of introducing new nonzero elementsinto the diagonal submatrices

Finally when all the Jacobi rotations are completedeigenvalues and eigenvectors are output as arrays which canbe implemented as BRAM ports or AXI buses so that thedesign can be integrated in different kinds of system

All tasks were implemented in the C programminglanguage using a fixed point data type following the XQNquantification scheme where a number is represented by 119883bits of integer part and 119873 bits of fractional part with a total

number of bits (word length) of WL = 119883 + 119873 + 1 for signedvalues and WL = 119883 + 119873 for unsigned values

Starting from the initial C prototype different hardwarearchitectures can be considered depending on the designrequirements Mainly resource usage (LUT Flip-Flops mul-tipliers memory elements) and timing performance (latency(119879119871) throughput or initiation interval (II) and maximum

clock frequency) are the parameters that we attempt tooptimize

(i) the latency (119879119871) which specifies how much time will

take until the system generate the first valid result(ii) the initiation interval (II) which corresponds to the

throughput or the number of clock cycles betweentwo valid sets of results after the latency period

As a result of the Vivado HLS optimization directivesthree architectures can be achieved depending on whichparameters are considered more important

(i) Serial architecture this hardware scheme pursuesminimal resource usage at the expense of a higherlatency To reach this optimization level a pipelinedirective is specified on the external loop of thedesign which performs the Jacobi iterations

(ii) Parallel architecture in this kind of system the maingoal is to achieve the best performance in termsof execution time (latency and throughput) at theexpense of resource usage To achieve this pipelineis applied to the entire design which means thatthe external loop will be unrolled and parallelismbetween two algorithm executions will be possible

(iii) Semiparallel architecture although timing is stillimportant in this scheme resource usage is alsoconsidered and attempts are made to minimize itwhile affecting execution time as little as possible Inour particular case this was achieved by limiting thenumber of CORDIC instances of the design in theparallel architecture

Although all three alternatives were analyzed the fullparallel architecture was found to use a massive amount ofinternal FPGA resources greatly exceeding the capacity ofthe device This was mainly due to an excessive amountof CORDIC units in the rotation mode which led to 55hardware instances of it in the case of an 8 times 8 inputmatrix Consequently we ruled out the possibility of parallelarchitecture

It is important to clarify the difference between semi-parallel and serial architecture In both cases we alwaysattempted to maximize concurrency between operationssuch as arithmetic operations andmemory accesses howeverwith parallel architecture it was also possible to achieveconcurrence between two full executions of the algorithm

The excessive amount of CORDIC units implemented bythe Vivado HLS synthesizer was taken as a starting pointfor optimization of the design Since Vivado HLS allowedus to limit the number of elements of a particular hardwareresource or entity the number of CORDIC instances in both

6 Mathematical Problems in Engineering

S A

V Intimesnk = 1

120572(k)j = 05tans(k)2jminus12j

s(k)2j2j minus s(k)2jminus12jminus1

Sk+1 Sk+1aux Vk+1 Vk+1aux

(k+1) = V(k)ij middot R(120572j)

i = 1 n2 j = 1 n2 j = i n2i = 1 n2

(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)

k = h

(a)

(b)

(c2)

(d)

(c1)

Result rearrange

No

[ ]

Yes

A Rntimesnisin

Vaux 119894119895 Saux 119894119895

larrminus

larrminus

larrminus larrminus

rarr

rarrVlarrminus

1205821 120582n diag(S)

lceilrceil lceilrceil

larrminus

1205821 120582n

j = 1 n2

Figure 1 Jacobi algorithm functional diagram

semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained

First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2

An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded

This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously

indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation

From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules

The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879

119871gt II This means that the first valid result will be

obtained after 119879119871clock cycles but the next results will be

obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown

in Figure 3 This time only the latency is shown since there

Mathematical Problems in Engineering 7

1 2 3 4 5 6 7 8 90

20406080

100

Disc

rete

elem

ents

()

Resource usage

LUTFF

1 2 3 4 5 6 7 8 90

500

1000

1500

2000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time (clock cycles)

IITL

nmiddotT

CLK

Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices

was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879

119871+ 1)

As indicated earlier the execution time between two validresult sets is the same as the latency period (119879

119871) and in

the semiparallel design alternative it evolved similarly to theinitiation interval

For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources

4 Assessment of Design Parameters

41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults

The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors

Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and

1 2 3 4 5 6 7 8 90

1020304050

Disc

rete

elem

ents

Resource usage

1 2 3 4 5 6 7 8 91500

2000

2500

3000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time

LUTFF

TL

nmiddotT

CLK

Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices

21 215 22 225 23 235 24 245 25 255 260

1020304050

Max

imum

erro

r (

)

Iterations

21 215 22 225 23 235 24 245 25 255 260

02040608

1

Qua

drat

ic m

ean

erro

r (

)

Iterations

Eigenvector component error (rarr

rarr

)1205821 120582m

Figure 4 Eigenvectors maximum and quadratic component error

eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)

When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows

8 Mathematical Problems in Engineering

the eigenvector component error calculated according to thefollowing expressions

119890 =

1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB

10038161003816100381610038161003816100381610038161003816119881MATLAB

1003816100381610038161003816

sdot 100

119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)

119890SQRT =1

1198962

radic

119896minus1

sum

119894=0

1198902

(20)

where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results

Our analysis of 8 times 8 matrices indicated that ℎ = 26

was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on

After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix

ℎ = 9 + 119899 log 119899 (21)

42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size

With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)

Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length

Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device

Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity

Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression

119879119871= 119879load + 119899

2

sdot ℎ (22)

8 9 10 11 12 13 14 15 1605

1

15

2Used resources

FF an

d LU

T (

)

8 9 10 11 12 13 14 15 160

50

100

150

FFLUT

Input matrix size (n)

Input matrix size (n)

Execution time with 100MHz

times104

Tex

e(120583

s)

fCLK =

Figure 5 Serial Jacobi design performance with matrix orderincrease

where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration

119879load =1198992

2+119899

2+ 4 (23)

A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879

119871) was still related to the

size of the inputmatrix as shown in the following expression

119879119871= 119879load + 119879119869 sdot ℎ (24)

where 119879119869= 119891(119899

2

) is the time required to perform a Jacobiiteration

Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs

Mathematical Problems in Engineering 9

Design comparison

11000 23000 34000 45000LUT

8500

17000

26000

34000

FF

400080001200016000

4000

8000

12000

16000

Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library

TL(nTCLK)

II(nT

CLK

)

Figure 6 Comparative between different design alternatives

5 Comparison with Other Proposals

The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention

Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation

The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)

The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode

The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources

Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one

Vivado HLS library

Clock frequency (MHz)

VHDL (Bravo et al

2008)QR

Parallel Jacobi

Serial Jacobi

0 50 100 150 200 250 300

Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8

The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks

The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit

As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations

Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required

Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically

Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources

As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed

On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz

10 Mathematical Problems in Engineering

On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level

6 Conclusions

In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature

The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired

In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms

Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs

As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high

Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)

The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view

We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results

Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)

References

[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008

[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014

[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014

[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014

[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005

[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000

[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013

[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008

[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988

[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000

[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990

[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983

[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959

[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988

[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013

Mathematical Problems in Engineering 11

[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts

designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value

and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985

[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014

[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009

[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014

[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014

[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014

[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014

[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013

[27] J S Walther A Unified Algorithm for Elementary Functions1972

[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010

[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

6 Mathematical Problems in Engineering

S A

V Intimesnk = 1

120572(k)j = 05tans(k)2jminus12j

s(k)2j2j minus s(k)2jminus12jminus1

Sk+1 Sk+1aux Vk+1 Vk+1aux

(k+1) = V(k)ij middot R(120572j)

i = 1 n2 j = 1 n2 j = i n2i = 1 n2

(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)

k = h

(a)

(b)

(c2)

(d)

(c1)

Result rearrange

No

[ ]

Yes

A Rntimesnisin

Vaux 119894119895 Saux 119894119895

larrminus

larrminus

larrminus larrminus

rarr

rarrVlarrminus

1205821 120582n diag(S)

lceilrceil lceilrceil

larrminus

1205821 120582n

j = 1 n2

Figure 1 Jacobi algorithm functional diagram

semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained

First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2

An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded

This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously

indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation

From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules

The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879

119871gt II This means that the first valid result will be

obtained after 119879119871clock cycles but the next results will be

obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown

in Figure 3 This time only the latency is shown since there

Mathematical Problems in Engineering 7

1 2 3 4 5 6 7 8 90

20406080

100

Disc

rete

elem

ents

()

Resource usage

LUTFF

1 2 3 4 5 6 7 8 90

500

1000

1500

2000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time (clock cycles)

IITL

nmiddotT

CLK

Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices

was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879

119871+ 1)

As indicated earlier the execution time between two validresult sets is the same as the latency period (119879

119871) and in

the semiparallel design alternative it evolved similarly to theinitiation interval

For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources

4 Assessment of Design Parameters

41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults

The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors

Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and

1 2 3 4 5 6 7 8 90

1020304050

Disc

rete

elem

ents

Resource usage

1 2 3 4 5 6 7 8 91500

2000

2500

3000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time

LUTFF

TL

nmiddotT

CLK

Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices

21 215 22 225 23 235 24 245 25 255 260

1020304050

Max

imum

erro

r (

)

Iterations

21 215 22 225 23 235 24 245 25 255 260

02040608

1

Qua

drat

ic m

ean

erro

r (

)

Iterations

Eigenvector component error (rarr

rarr

)1205821 120582m

Figure 4 Eigenvectors maximum and quadratic component error

eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)

When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows

8 Mathematical Problems in Engineering

the eigenvector component error calculated according to thefollowing expressions

119890 =

1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB

10038161003816100381610038161003816100381610038161003816119881MATLAB

1003816100381610038161003816

sdot 100

119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)

119890SQRT =1

1198962

radic

119896minus1

sum

119894=0

1198902

(20)

where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results

Our analysis of 8 times 8 matrices indicated that ℎ = 26

was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on

After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix

ℎ = 9 + 119899 log 119899 (21)

42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size

With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)

Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length

Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device

Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity

Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression

119879119871= 119879load + 119899

2

sdot ℎ (22)

8 9 10 11 12 13 14 15 1605

1

15

2Used resources

FF an

d LU

T (

)

8 9 10 11 12 13 14 15 160

50

100

150

FFLUT

Input matrix size (n)

Input matrix size (n)

Execution time with 100MHz

times104

Tex

e(120583

s)

fCLK =

Figure 5 Serial Jacobi design performance with matrix orderincrease

where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration

119879load =1198992

2+119899

2+ 4 (23)

A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879

119871) was still related to the

size of the inputmatrix as shown in the following expression

119879119871= 119879load + 119879119869 sdot ℎ (24)

where 119879119869= 119891(119899

2

) is the time required to perform a Jacobiiteration

Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs

Mathematical Problems in Engineering 9

Design comparison

11000 23000 34000 45000LUT

8500

17000

26000

34000

FF

400080001200016000

4000

8000

12000

16000

Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library

TL(nTCLK)

II(nT

CLK

)

Figure 6 Comparative between different design alternatives

5 Comparison with Other Proposals

The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention

Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation

The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)

The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode

The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources

Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one

Vivado HLS library

Clock frequency (MHz)

VHDL (Bravo et al

2008)QR

Parallel Jacobi

Serial Jacobi

0 50 100 150 200 250 300

Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8

The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks

The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit

As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations

Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required

Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically

Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources

As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed

On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz

10 Mathematical Problems in Engineering

On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level

6 Conclusions

In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature

The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired

In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms

Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs

As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high

Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)

The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view

We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results

Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)

References

[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008

[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014

[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014

[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014

[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005

[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000

[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013

[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008

[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988

[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000

[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990

[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983

[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959

[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988

[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013

Mathematical Problems in Engineering 11

[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts

designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value

and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985

[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014

[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009

[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014

[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014

[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014

[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014

[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013

[27] J S Walther A Unified Algorithm for Elementary Functions1972

[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010

[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

Mathematical Problems in Engineering 7

1 2 3 4 5 6 7 8 90

20406080

100

Disc

rete

elem

ents

()

Resource usage

LUTFF

1 2 3 4 5 6 7 8 90

500

1000

1500

2000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time (clock cycles)

IITL

nmiddotT

CLK

Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices

was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879

119871+ 1)

As indicated earlier the execution time between two validresult sets is the same as the latency period (119879

119871) and in

the semiparallel design alternative it evolved similarly to theinitiation interval

For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources

4 Assessment of Design Parameters

41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults

The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors

Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and

1 2 3 4 5 6 7 8 90

1020304050

Disc

rete

elem

ents

Resource usage

1 2 3 4 5 6 7 8 91500

2000

2500

3000

Rotation CORDIC modules

Rotation CORDIC modules

Execution time

LUTFF

TL

nmiddotT

CLK

Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices

21 215 22 225 23 235 24 245 25 255 260

1020304050

Max

imum

erro

r (

)

Iterations

21 215 22 225 23 235 24 245 25 255 260

02040608

1

Qua

drat

ic m

ean

erro

r (

)

Iterations

Eigenvector component error (rarr

rarr

)1205821 120582m

Figure 4 Eigenvectors maximum and quadratic component error

eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)

When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows

8 Mathematical Problems in Engineering

the eigenvector component error calculated according to thefollowing expressions

119890 =

1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB

10038161003816100381610038161003816100381610038161003816119881MATLAB

1003816100381610038161003816

sdot 100

119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)

119890SQRT =1

1198962

radic

119896minus1

sum

119894=0

1198902

(20)

where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results

Our analysis of 8 times 8 matrices indicated that ℎ = 26

was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on

After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix

ℎ = 9 + 119899 log 119899 (21)

42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size

With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)

Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length

Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device

Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity

Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression

119879119871= 119879load + 119899

2

sdot ℎ (22)

8 9 10 11 12 13 14 15 1605

1

15

2Used resources

FF an

d LU

T (

)

8 9 10 11 12 13 14 15 160

50

100

150

FFLUT

Input matrix size (n)

Input matrix size (n)

Execution time with 100MHz

times104

Tex

e(120583

s)

fCLK =

Figure 5 Serial Jacobi design performance with matrix orderincrease

where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration

119879load =1198992

2+119899

2+ 4 (23)

A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879

119871) was still related to the

size of the inputmatrix as shown in the following expression

119879119871= 119879load + 119879119869 sdot ℎ (24)

where 119879119869= 119891(119899

2

) is the time required to perform a Jacobiiteration

Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs

Mathematical Problems in Engineering 9

Design comparison

11000 23000 34000 45000LUT

8500

17000

26000

34000

FF

400080001200016000

4000

8000

12000

16000

Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library

TL(nTCLK)

II(nT

CLK

)

Figure 6 Comparative between different design alternatives

5 Comparison with Other Proposals

The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention

Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation

The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)

The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode

The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources

Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one

Vivado HLS library

Clock frequency (MHz)

VHDL (Bravo et al

2008)QR

Parallel Jacobi

Serial Jacobi

0 50 100 150 200 250 300

Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8

The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks

The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit

As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations

Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required

Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically

Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources

As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed

On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz

10 Mathematical Problems in Engineering

On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level

6 Conclusions

In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature

The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired

In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms

Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs

As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high

Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)

The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view

We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results

Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)

References

[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008

[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014

[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014

[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014

[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005

[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000

[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013

[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008

[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988

[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000

[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990

[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983

[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959

[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988

[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013

Mathematical Problems in Engineering 11

[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts

designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value

and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985

[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014

[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009

[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014

[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014

[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014

[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014

[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013

[27] J S Walther A Unified Algorithm for Elementary Functions1972

[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010

[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

8 Mathematical Problems in Engineering

the eigenvector component error calculated according to thefollowing expressions

119890 =

1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB

10038161003816100381610038161003816100381610038161003816119881MATLAB

1003816100381610038161003816

sdot 100

119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)

119890SQRT =1

1198962

radic

119896minus1

sum

119894=0

1198902

(20)

where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results

Our analysis of 8 times 8 matrices indicated that ℎ = 26

was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on

After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix

ℎ = 9 + 119899 log 119899 (21)

42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size

With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)

Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length

Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device

Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity

Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression

119879119871= 119879load + 119899

2

sdot ℎ (22)

8 9 10 11 12 13 14 15 1605

1

15

2Used resources

FF an

d LU

T (

)

8 9 10 11 12 13 14 15 160

50

100

150

FFLUT

Input matrix size (n)

Input matrix size (n)

Execution time with 100MHz

times104

Tex

e(120583

s)

fCLK =

Figure 5 Serial Jacobi design performance with matrix orderincrease

where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration

119879load =1198992

2+119899

2+ 4 (23)

A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879

119871) was still related to the

size of the inputmatrix as shown in the following expression

119879119871= 119879load + 119879119869 sdot ℎ (24)

where 119879119869= 119891(119899

2

) is the time required to perform a Jacobiiteration

Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs

Mathematical Problems in Engineering 9

Design comparison

11000 23000 34000 45000LUT

8500

17000

26000

34000

FF

400080001200016000

4000

8000

12000

16000

Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library

TL(nTCLK)

II(nT

CLK

)

Figure 6 Comparative between different design alternatives

5 Comparison with Other Proposals

The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention

Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation

The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)

The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode

The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources

Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one

Vivado HLS library

Clock frequency (MHz)

VHDL (Bravo et al

2008)QR

Parallel Jacobi

Serial Jacobi

0 50 100 150 200 250 300

Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8

The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks

The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit

As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations

Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required

Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically

Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources

As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed

On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz

10 Mathematical Problems in Engineering

On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level

6 Conclusions

In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature

The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired

In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms

Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs

As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high

Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)

The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view

We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results

Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)

References

[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008

[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014

[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014

[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014

[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005

[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000

[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013

[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008

[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988

[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000

[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990

[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983

[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959

[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988

[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013

Mathematical Problems in Engineering 11

[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts

designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value

and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985

[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014

[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009

[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014

[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014

[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014

[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014

[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013

[27] J S Walther A Unified Algorithm for Elementary Functions1972

[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010

[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

Mathematical Problems in Engineering 9

Design comparison

11000 23000 34000 45000LUT

8500

17000

26000

34000

FF

400080001200016000

4000

8000

12000

16000

Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library

TL(nTCLK)

II(nT

CLK

)

Figure 6 Comparative between different design alternatives

5 Comparison with Other Proposals

The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention

Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation

The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)

The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode

The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources

Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one

Vivado HLS library

Clock frequency (MHz)

VHDL (Bravo et al

2008)QR

Parallel Jacobi

Serial Jacobi

0 50 100 150 200 250 300

Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8

The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks

The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit

As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations

Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required

Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically

Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources

As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed

On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz

10 Mathematical Problems in Engineering

On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level

6 Conclusions

In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature

The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired

In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms

Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs

As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high

Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)

The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view

We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results

Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)

References

[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008

[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014

[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014

[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014

[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005

[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000

[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013

[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008

[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988

[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000

[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990

[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983

[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959

[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988

[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013

Mathematical Problems in Engineering 11

[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts

designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value

and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985

[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014

[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009

[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014

[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014

[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014

[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014

[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013

[27] J S Walther A Unified Algorithm for Elementary Functions1972

[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010

[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

10 Mathematical Problems in Engineering

On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level

6 Conclusions

In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature

The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired

In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms

Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs

As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high

Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)

The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view

We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results

Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)

References

[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008

[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014

[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014

[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014

[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005

[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000

[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013

[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008

[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988

[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000

[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990

[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983

[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959

[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988

[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013

Mathematical Problems in Engineering 11

[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts

designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value

and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985

[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014

[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009

[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014

[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014

[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014

[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014

[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013

[27] J S Walther A Unified Algorithm for Elementary Functions1972

[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010

[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

Mathematical Problems in Engineering 11

[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts

designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value

and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985

[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014

[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009

[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014

[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014

[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014

[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014

[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013

[27] J S Walther A Unified Algorithm for Elementary Functions1972

[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010

[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article High Level Synthesis FPGA Implementation ...downloads.hindawi.com/journals/mpe/2015/870569.pdf · High Level Synthesis FPGA Implementation of the Jacobi Algorithm

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of