1156 ieee transactions on circuits and...

1156 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 61, NO. 4, APRIL 2014

Multifunction Residue Architecturesfor Cryptography

Dimitrios Schinianakis, Member, IEEE, and Thanos Stouraitis, Fellow, IEEE

Abstract—A design methodology for incorporating ResidueNumber System (RNS) and Polynomial Residue Number System(PRNS) in Montgomery modular multiplication in or

respectively, as well as a VLSI architecture of a dual-fieldresidue arithmetic Montgomery multiplier are presented in thispaper. An analysis of input/output conversions to/from residuerepresentation, along with the proposed residue Montgomerymultiplication algorithm, reveals common multiply-accumulatedata paths both between the converters and between the tworesidue representations. A versatile architecture is derived thatsupports all operations of Montgomery multiplication inand , input/output conversions, Mixed Radix Conver-sion (MRC) for integers and polynomials, dual-field modularexponentiation and inversion in the same hardware. Detailedcomparisons with state-of-the-art implementations prove thepotential of residue arithmetic exploitation in dual-field modularmultiplication.

Index Terms—Computations in finite fields, computer arith-metic, Montgomery multiplication, parallel arithmetic and logicstructures.

I. INTRODUCTION

A significant number of applications including cryptog-raphy, error correction coding, computer algebra, DSP,

etc., rely on the efficient realization of arithmetic over finitefields of the form , where and , or theform , where a prime. Cryptographic applicationsform a special case, since, for security reasons, they requirelarge integer operands [1]–[5].Efficient field multiplication with large operands is crucial

for achieving a satisfying cryptosystem performance, sincemultiplication is the most time- and area-consuming operation.Therefore, there is a need for increasing the speed of cryptosys-tems employing modular arithmetic with the least possible areapenalty. An obvious approach to achieve this would be throughparallelization of their operations.In recent years, RNS and PRNS have enjoyed renewed sci-

entific interest due to their ability to perform fast and parallelmodular arithmetic [6]–[13]. Using RNS/PRNS, a given path

Manuscript received November 24, 2012; revised April 15, 2013 and June09, 2013; accepted July 03, 2013. Date of publication January 02, 2014; dateof current version March 25, 2014. This work was supported by the EuropeanUnion (European Social Fund-ESF) and Greek national funds through the Op-erational Program “Education and Lifelong Learning” of the National StrategicReference Framework (NSRF)—Research Funding Program: Heracleitus II, In-vesting in Knowledge Society through the European Social Fund. This paperwas recommended by Associate Editor M. Alioto.The authors are with the Electrical and Computer Engineering Department,

University of Patras, 265 00 Patras, Greece (e-mail: [email protected];[email protected]).Digital Object Identifier 10.1109/TCSI.2013.2283674

serving a large data range is replaced by parallel paths of smallerdynamic ranges, with no need for exchanging information be-tween paths. As a result, the use of residue systems can offerreduced complexity and power consumption of arithmetic unitswith large word lengths [14]. On the other hand, RNS/PRNS im-plementations bear the extra cost of input converters to translatenumbers from a standard binary format into residues and outputconverters to translate from RNS/PRNS to binary representa-tions [14].A new methodology for embedding residue arithmetic in a

dual-fieldMontgomerymodular multiplication algorithm for in-tegers in and for polynomials in is presentedin this paper. The mathematical conditions that need to be satis-fied for a valid RNS/PRNS incorporation are examined. The de-rived architecture is highly parallelizable and versatile, as it sup-ports binary-to-RNS/PRNS and RNS/PRNS-to-binary conver-sions, Mixed Radix Conversion (MRC) for integers and poly-nomials, dual-field Montgomery multiplication, and dual-fieldmodular exponentiation and inversion in the same hardware.The rest of the paper is organized as follows. A brief overview

of related previous work is offered in Section II, while the mainconcepts of RNS and PRNS are summarized in Section III. InSection IV, basic finite-field arithmetic concepts are providedand the operation of field multiplication is defined. Following,

and Montgomery multiplication algorithmsare presented. In Section VI, the proposed RNS/PRNS Mont-gomery multiplication algorithm is analyzed. The mathematicalconditions that allow a valid incorporation of residue arithmeticin the Montgomery algorithm are also presented. In sectionSection VI, a detailed overview of the proposed dual-fieldMontgomery multiplier architecture is provided. Section VIIprovides time and area complexity measurements, as wellas comparisons with other state-of-the-art implementations.Conclusions are offered in Section VIII.

II. PREVIOUS WORK

Important progress has been reported lately regardingimplementations. The Massey-Omura algorithm

[15], the introduction of optimal normal bases [16] and theirsoftware and hardware implementations [17], [18], the Mont-gomery algorithm for multiplication in [19], as wellas PRNS application in multiplication, are, amongothers, important advances [9], [10], [12], [20], [21]. PRNSincorporation in field multiplication, as proposed in [9], isbased on a straightforward implementation of the ChineseRemainder Theorem (CRT) for polynomials which requireslarge storage resources and many pre-computations. The multi-pliers proposed in [10], [20], perform multiplication in PRNS,

1549-8328 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

SCHINIANAKIS AND STOURAITIS: MULTIFUNCTION RESIDUE ARCHITECTURES FOR CRYPTOGRAPHY 1157

but the result is converted back to polynomial representation.This limitation makes them inappropriate for cryptographicalgorithms which require consecutive multiplications. Finally,an algorithm which employs trinomials for the modulus set andperforms PRNS Montgomery multiplication has been proposed[12]. However, there is no reference to conversion methods andthe use of trinomials may issue limitations in the PRNS datarange.

implementations have also withstood great analysis,with the Montgomery algorithm being used in the majorityof them. Montgomery multiplication designs fall into twocategories. The first includes fixed-precision input operandimplementations, in which the multiplicand and modulus areprocessed in full word length, while the multiplier is handledbit-by-bit [22]–[25]. These designs are optimized for certainword lengths and do not scale efficiently for departures fromthese word lengths. Their performance has been improved byhigh-radix algorithms and architectures [26]–[28].The second category includes scalable architectures for vari-

able word-length operands, based on algorithms, in which themultiplicand and modulus are processed word by word, whilethe multiplier is consumed bit by bit [26], [29]–[31].Montgomery’s algorithm has also become a predicate for

dual-field implementations [31]–[37]. The Montgomery archi-tectures perform well for RSA key word lengths, by processingword-size data, since RSA key sizes (512, 1024, 2048, etc.) arealways multiples of word size. However, in ECC, key sizes arenot integer multiples of word size, meaning that, if these archi-tectures were to be used in ECC, they would require more clockcycles for their execution and thus more power consumption.An architecture configured at bit-level overcomes this problem[23].Finally, methods for embedding RNS in Montgomery multi-

plication have also been proposed [6], [7], [13], [38]–[41]. De-tailed analysis and comparisons with such works are offered inSection VII.

III. RESIDUE ARITHMETIC

A. Residue Number System

RNS consists of a set of , pair-wise relatively primeintegers (called the base) and therange of the RNS is computed as . Any integer

has a unique RNS representation givenby ,where denotes the operation . Assuming twointegers in RNS format, i.e., and

, then one can perform the operationsin parallel by

(1)To reconstruct the integer from its residues, two methods may

be employed [14]. The first is through the CRT according to

(2)

where , is the inverse of modulo , andis an integer correction factor.The second method is through the MRC. The MRC of an

integer with an RNS representation isgiven by

(3)

where and the s are computedaccording to

...

(4)

providing that the predetermined factors andare all unity [42]. Of the two

methods, the proposed architecture uses the MRC, as it avoidsthe problem of evaluating the correction factor of (2).

B. Polynomial Residue Number System

Similar to RNS, a PRNS is defined through aset of , pair-wise relatively prime polynomials

. We denote bythe dynamic range of the PRNS.

In PRNS, every polynomial , with, has a unique PRNS representation:

(5)

such as , denoted as . Inthe rest of the paper, the notation “ ” to denote polynomialsshall be omitted, for simplicity. The notation will be used in-terchangeably to denote either an integer or a polynomial ,according to context.Assuming the PRNS representation

and of two polynomials ,then all operations can be performed in parallel, as

(6)

Conversion from PRNS to weighted polynomial representa-tion is identical to the MRC for integers. The only difference isthat, the subtractions in (4) are substituted by polynomial addi-tions.

IV. MONTGOMERY MULTIPLICATION

A. Arithmetic

Field elements in are integers in andarithmetic is performed modulo . Montgomery’s algorithmfor modular multiplication without division [43] is presentedbelow, as Algorithm 1, in five steps, where is the Mont-gomery radix, , and . must be chosen so


that steps 2 and 5 are efficiently computed. It is usually chosento be a power of 2, when radix-2 representation is employed.Since Montgomery’s method was originally devised to avoid

divisions, it is well-suited to RNS implementations, consideringthat RNS division is inefficient to perform.

B. Arithmetic

Field elements in are polynomials represented as bi-nary vectors of dimension , relative to a given polynomial basis

, where is a root of an irreducible poly-nomial of degree over . The field is then realized as

and the arithmetic is that of polynomials of de-gree at most , modulo [1].The addition of two polynomials and in is per-

formed by adding the polynomials, with their coefficients addedin , i.e., modulo 2. This is equivalent to a bit-wise XORoperation on the vectors and . The product of two elementsand in is obtained by computing

(7)

where is a polynomial of degree at most and .A Montgomery multiplication algorithm suitable for polyno-mials in has been proposed [19]. Instead of computingthe product , the algorithm computes

, with and is a special fixedelement in . The selection of is the mostappropriate, since modular reduction and division by aresimple shifts [3]. The algorithm is identical to Algorithm 1, ex-cept from the constant in step 2, which is in .

V. A NEW METHODOLOGY FOR EMBEDDING RESIDUEARITHMETIC IN MONTGOMERY MULTIPLICATION

A. Embedding RNS in Montgomery Multiplication

An MRC-based algorithm [44] that avoids the evaluation ofthe factor of (2) forms the basis of the proposed RNS-basedMontgomery multiplication algorithm. The derived algorithm isbriefly presented here as Algorithm 2, and extended for the caseof .Two RNS bases are introduced, namely

and , such that. The 5 steps of the Montgomery algorithm are translated

to RNS computations in both bases, denoted from now on as.

Initially, the inputs , are expressed in RNS representationin both bases as and . Steps 1, 3, and 4 of Algorithm 1, in-volve addition and multiplication operations, thus their transfor-mation to RNS is straightforward. For steps 2 and 5, the Mont-gomery radix is replaced by , which is the range

of .We also denote as the range of base . Then,in the second step, in RNS format is computed in base by

. Nevertheless, the computations in base cannotbe continued for steps 3, 4, and 5 of Algorithm 1, since in step 5we would need to compute a quantity of the form ,which does not exist since s are factors of . Thus, a base con-version (BC) step, from base to base , is inserted, to compute. is then used to execute the old steps 3, 4, and 5 in base .

The result at the end of this algorithm is a quantity in RNSformat that equals , since BC is error-free.In Algorithm 2, inputs and output are all less than , so thatthey are compatible with each other. This allows the construc-tion of a modular exponentiation algorithm by repetition of theRNS Montgomery multiplication. Base conversion in step 7 isutilized for the same reason. Algorithm 3 depicts the proposedbase conversion process that converts an integer expressed inRNS base as to the RNS representation of another base. In contrast to other RNS Montgomery multiplication algo-

rithms which also employ MRC [39], [40], [44], the proposedone implements a simplified version of the MRC in (3) and (4)which, as will be shown in next sections, not only reduces thetotal complexity of the algorithm but also offers better oppor-tunities for parallelization of operations. Dual-field circuitry isalso not offered by the works in [39], [40], [44]. Algorithm 3implements (4) in steps 1–8 to obtain the mixed-radix digitsof . In steps 9–15, (3) is realized, while the whole summationis computed modulo each modulus of the new base .The two base conversions in the RMM algorithm are error-

free, contrasted to other algorithms that employ CRT and utilizeapproximationmethods to compute the correction factor of (2)[6], [7], [38]. Conditions and aresufficient for the existence of and , respectively.As it holds that

(8)

it follows that is sufficient for to hold when. Finally, (8) also shows that is sufficient forand . Since is the maximum intermediate

value, all values are less than [6], [44].

B. Embedding PRNS in Montgomery Multiplication

A modification of the Montgomery algorithm for multipli-cation in that encompasses PRNS is proposed next.The proposed algorithm employs general polynomials of any


degree, and is an extension of an algorithm [12], which em-ploys trinomials for the PRNS modulus set. Additionally, theproposed algorithm addresses the problem of converting datato/from PRNS representation. In contrast to a similar algorithmin [45], which employed CRT for polynomials for the BC algo-rithm, the proposed architecture employs MRC. This allows fordual-field RNS/PRNS implementation, which is not supportedin [45], and a new methodology to implement RNS-to-binaryconversion as will be shown in Section V-D.The proposed algorithm for PRNS Montgomery

multiplication (PRMM) is presented below as Algorithm 4.The corresponding algorithm for base conversion inis identical to Algorithm 3, thus it is omitted for simplicity rea-sons. The only difference is that integer additions/subtractionsand multiplications are replaced by polynomial ones. Again,the degree of input and output polynomials are both less than, which allows the construction of a modular exponentiationalgorithm by repetition of the PRMM. Base conversion instep 7 is employed for the same reason.1) Proof of PRMM Algorithm’s Validity:Theorem 1: If , (2) , (3)

, and (4) , then Algorithm 4 outputs, for which and .Proof: Since and , is

relatively prime to and is relatively prime to . Thus, thequantities and exist , and therefore

and exist.Assume that the polynomial is a multiple of , i.e.,

. Then, , which means that. This corresponds to step 2 of the PRMM algo-

rithm, which means that step 6 is error-free since base conver-sion in step 3 is error-free, therefore PRMM holds.Furthermore, it must be proved that the resulting polynomialof Algorithm 4 is a polynomial of degree less than . It holdsthat

(9)

Since is the maximum intermediate value of Algorithm 4,its degree must be less than the degree of the polynomial .Under this assumption, we get

(10)

From (9) and (10) we have

(11)

Finally, note that (10) is independent of , thus selectingis sufficient.

C. The Proposed Dual-Field Montgomery MultiplicationAlgorithm

A careful examination of RMM and PRMM algorithms, re-veals potential for unification into a common dual-field residuearithmetic Montgomery multiplication (DRAMM) algorithmand a common dual-field base conversion (DBC) algorithm.The unified algorithms are depicted below as Algorithms 5 and6, where represents a dual-field addition/subtraction andrepresents a dual-field multiplication. An important aspect isthat all operations within the DRAMM and the DBC algorithmsare now decomposed into simple multiply-accumulate (MAC)operations of word-length equal to the modulus word length. This allows for a fully-parallel hardware implementation,employing parallel MAC units, each dedicated to a modulus ofthe RNS/PRNS base.


Finally, the conditions from Sections V-A and V-B, for a validRNS/PRNS transformation of the Montgomery algorithm yield

(12)

which means that one should select RNS/PRNS ranges of wordlength

(13)

for the bases and , respectively. Algorithms 5 and 6 alongwith conditions (12) and (13) form the complete framework fora dual-field residue arithmetic Montgomery multiplication.As described before, the structure of the proposed algorithm

allows it to be reused in the context of any exponentia-tion algorithm. A possible implementation is depicted inAlgorithm 7, requiring in total DRAMMmultiplications[4], [46].

D. Conversions

In the following discussion, base shallbe used as an example to analyze the conversions to/fromresidue representations, without loss of generality.

1) Binary-to-Residue Conversion: A radix- representationof an integer as an -tuple satisfies

... (14)

where, . A method to compute must bedevised, that matches the proposed DRAMM’s multiply-accu-mulate structure. By applying the modulo operation in (14)we obtain

(15)

If constants are precomputed, this computation is well-suited to the proposed MAC structure and can be computed insteps, when executed by units in parallel.Similar to the integer case, a polynomial can

be written as

... (16)

Applying the modulo operation in (16) it yields

(17)

which is a similar operation to (15), if are precomputed.From (15) and (17), it is deduced that conversions in both

fields can be unified into a common conversion method, if dual-field circuitry is employed, as already mentioned for the case ofthe DRAMM and DBC. In the rest of the paper, the radix vec-tors and of both fieldsshall be denoted as a common radix vector , without loss ofgenerality.2) Residue-to-Binary Conversion: As all operands in (4) are

of word length , they can be efficiently handled by an -bitMAC unit. However, (3) employs multiplications with largevalues, namely the s.


Fig. 1. Dual-field full-adder cell (DFA).

To overcome this problem, (3) can be rewritten in matrix no-tation, as in (18), shown at the bottom of the page, which impliesa fully parallel implementation of the conversion process. Theinner products of a row are calculated in parallel in each MACunit. EachMAC then propagates its result to subsequent MACs,so that at the end the last MAC( ) outputs the radix- digitof the result. In parallel with this summation, inner products ofthe next row can be formulated, since the adder and multi-plier of the proposed MAC architecture may operate in parallel.

VI. HARDWARE IMPLEMENTATION

A. Dual-Field Addition/Subtraction

A dual-field full-adder (DFA) cell (Fig. 1) is basically afull-adder (FA) cell, equipped with a field-select signalthat controls the operation mode [33]. In the proposed im-plementation, 3-level, carry-lookahead adders (CLA) with4-bit carry-lookahead generator groups (CLG) are employed[47]. An example of a 4-bit dual-field CLA adder is shown inFig. 2. The GAP modules generate the signals ,

, , and AND gates along witha signal control whether to eliminate carries or not. Thecarry-lookahead generator is an network basedon (19) [47].

(19)

Fig. 2. Dual-field CLA.

1) Dual-Field Modular/Normal Addition/Subtraction: Withtrivial modifications of algorithms for modular addition/sub-traction in [3], [4], a dual-field modular adder/subtracter(DMAS) shown in Fig. 3 can be mechanized using CLA adders.When , the circuit is in mode and the output

is derived directly from the top adder which performs aaddition. When , the circuit may operate either as anormal -bit adder/subtracter oras a modular adder/subtracter .In the first case, the output is the concatenation of the outputs

of the two adders. This is required during residue-to-binary con-version, since (18) dictates that , -bit quantities need to beadded recursively via a normal adder.

B. Dual-Field Multiplication

A parallel tree multiplier, which is suitable for high-speedarithmetic, and requires little modification to support bothfields, is considered in the proposed architecture. Regardinginput operands, either integers or polynomials, partial productgeneration is common for both fields, i.e., an opera-tion among all operand bits. Consequently, the addition treethat sums the partial products must support both formats. In

......

......

......

......

......

(18)


Fig. 3. Dual-field modular/normal adder/subtracter (DMAS).

Fig. 4. Dual-field multiplier (DM).

mode, if DFA cells are used, all carries are eliminatedand only operations are performed among partial prod-ucts. In mode, the multiplier acts as a conventionaltree multiplier. A 4 4-bit example of the proposed dual-fieldmultiplier (DM) with output in carry-save format is depicted inFig. 4.

C. Dual-Field Modular Reduction

A final modular reduction by each RNS/PRNS modulus is re-quired, for each multiplication outcome, within each MAC unit.From several modular reduction strategies [3], a method basedon careful modulus selection is utilized, since, not only it offersefficient implementations but also provides the best unificationpotential at a low area penalty.Assume a -bit product that needs to be reduced modulo

an integer modulus . By selecting of the form ,where the -bit , the modular reduction process can

Fig. 5. Dual-field modular reduction unit (DMR).

be simplified as

(20)

From (20), it is apparent that

(21)

The same decomposition can be applied to polynomials andconsequently, if dual-field adders and dual-field multipliers areemployed, a dual-field modular reduction (DMR) unit can bemechanized as shown in Fig. 5. The word length of can belimited to a maximum of 10 bits for a base with 66 elements [6].

D. MAC Unit

The circuit organization of the proposed MAC unit is shownin Fig. 7. Its operation is analyzed below in three steps, corre-sponding to the three phases of the calculations it handles, i.e.,binary-to-residue conversion, RNS/PRNS Montgomery multi-plication, and residue-to-binary conversion.1) Binary-to-Residue Conversion: Initially, -bit words of

the input operands, as implied by (15), are cascaded to eachMAC unit and stored in RAM1 at the top of Fig. 7. These wordsserve as the first input to the multiplier, along with the quantities


Fig. 6. Task distribution in the proposed DRAMM.

TABLE INORMALIZED AREA AND DELAY OF THE PROPOSED DRAMM ARCHITECTURE

-bit DF-CLA. -bit. -bit. 3-input -bit DF-CLA.3-input -bit DF-CLA. 2-input -bit DF-CLA. -bit adders. -bit.

or , which are stored in a ROM. Their multi-plication produces the inner products of (15) or (17) which areadded recursively in the DMAS unit. The result is stored via thebus in RAM1. The process is repeated for the second operandand the result is stored in RAM2, so that when the conversionis finished, each MAC unit holds the residue digits of the twooperands in the two RAMs. The conversion requires steps tobe executed.2) MontgomeryMultiplication: The first step of the proposed

DRAMM is a modular multiplication of the residue digits of theoperands. Since these digits are immediately available by thetwo RAMs, a modular multiplication is executed and the resultin is stored in RAM1 for base and RAM2 for base . Step2 of DRAMM is a multiplication of the previous result with aconstant provided by the ROM. The results correspond toinputs of the DBC algorithm and are stored again in RAM1. AllMAC units are updated through the bus with the correspondingRNS digits of all other MACs and a DBC process is initiated.

To illustrate the DBC process, a task distribution graph is pre-sented in Fig. 6 for a DRAMM requiring moduli. Twocases are represented; the first corresponds to a fully parallel ar-chitecture with units and the second shows how the taskscan be overlapped when only MAC units are available.Each MAC unit has been assigned to a different color, thus inthe overlapped case the color codes signify when a MAC unitperforms operations for other units. In the example of Fig. 6,MAC(1) handles MAC(4) and MAC(2) handles MAC(3).In each cycle, modular additions and multiplications are per-

formed in parallel in each MAC. To depict this, each cycle issplit in two parts: the operations on the left correspond to mod-ular additions and on the right to modular multiplications. Theresults obtained by each operation are depicted in each cycle(they correspond to Alg. 6), while idle states are denoted bydashed lines. An analysis on the number number of clock cy-cles required, and how MAC units can be efficiently paired ispresented in the next section.


Fig. 7. The proposed MAC unit.

Fig. 8. The proposed DRAMM architecture.

TABLE IINORMALIZED AREA AND DELAY OF STANDARD CELLS

The remaining multiplications, additions, and the final DBCoperation required by the DRAMM algorithm are computedin the same multiply-accumulate manner and the final residueMontgomery product can be either driven to the I/O interface,or it can be reused by the MAC units to convert the result tobinary format.3) Residue-to-Binary Conversion: Residue-to-binary con-

version is essentially a repetition of the DBC algorithm, exceptfor steps 9–14, which are no longer modulo operations. To illus-trate the conversion process, assume the generation of the innerproducts in row 1 of (18). Each product is calculated in parallelin each MAC unit and a “carry-propagation” from MAC(1) to

is performed to add all inner products. When sum-mation finishes the first digit of the result is produced in

. In parallel with this “carry-propagation”, the innerproducts of line 2 are calculated. As soon as a MAC unit com-pletes an addition of carry-propagated inner products for line 1,a new addition for line 2 is performed. The process continuesfor all lines of (18) and the result is available after steps. Thecomplete DRAMM architecture is depicted in Fig. 8.

VII. PERFORMANCE AND COMPARISONS

A. Area and Delay Estimations

Table I summarizes the area and delay complexity of the pro-posed architecture, in terms of the parameters . In general,delay and area of a cell shall be denoted with and re-spectively.

Regarding the area of a -bit, dual-field CLA adder, with4-bit carry-lookahead generator (CLG) modules and three CLGlevels it holds that [47]

(22)

Based on (22) it is easy to show that a bit CLA isapproximately 3 times larger than an -bit DFA, which ex-plains the factor 3 regarding the area of the dual-field CLAin Table I [47].1) Number of Clock Cycles: Assume that parallel

MAC units are utilized in a real implementation. To simplify thediscussion it is assumed that is a multiple of , i.e., .This means that each one of the first ,will provide results for more channels. By constructionof the MRC, the DBC process in Fig. 6 requires clockcycles in the full parallel case. Each channel requirescycles for multiplications and cycles for additions,

thus each channel has free slots for multiplications andfree slots for additions (idle states in Fig. 6).

Let us assume for simplicity that . The free slots in eachwill accommodate operations from one ,

and one , . Since eachrequires multiplications and additions and

each requires cycles for multiplications andcycles for additions, then the cycles required to accommodatethe results are . The free slots in eachare thus the extra cycles to produce all results in each

are . Theproblem transposes to finding the best combinations forso that the quantity minimizes. The problemcan be described in terms of the following pseudo-code:

1: for all do

2: for do

3:

4: end for

5: end for

6: find common

7: for all do

8: match with and such that

9: end for

The pseudo-code calculates all possible values of extra cy-cles for all combinations of and in step 6 we selectas the maximum common value for all . For every dis-tinct combination that satisfies we match thecorresponding units until all distinct pairs of units in positions

are assigned to a distinct unit in position . The remaining6 steps of the DRAMM require cycles in total.2) Memory Requirements: Table III summarizes the ROM

requirements of the proposed DRAMM architecture. As far asRAM is concerned, the worst case occurs duringDBC and input/


Fig. 9. Normalized complexities for (a) time; (b) area.

Fig. 10. Area time product function .

output conversions. This amounts to a -bit RAM1/2per MAC unit, thus in total a -bit RAM is requiredby the proposed DRAMM architecture.

B. Comparisons With RNS Implementations

The proposed architecture introduces the concept of dual-field RNS Montgomery multiplication, which is not supportedby existing RNS solutions [6], [7], [11], [13], [38]–[41]. Thearchitecture in [38] requires additional bits in the mantissa ofthe arithmetic units employed, in order to accurately computethe final modular product. Although the authors in [38] do notprovide accurate metrics of the hardware complexity, the bitprecision of the proposed MAC unit is equal to the modulusword-length, implying a simpler VLSI structure and less hard-ware. In [7], no hardware details are discussed, while the pro-posed algorithm is impractical in real cryptosystem implemen-tations. Data sent from one party to another are in RNS formatwhich, as a concept, poses limitations on the hardware architec-ture of the communicating parties.Compared to themostefficient andpracticalRNSimplementa-

tions in [6], [11], [13], [41], the proposed architecture further re-duces the number ofmodularmultiplications for the base conver-sion and the RNS-to-binary conversion, as depicted in Table IV.Other algorithms that employMRC also perform worse. For ex-ample, the work in [40] requires modular multiplica-tions while the work in [39] is a predecessor of [7], which alsoperformsworse as shown in Table IV. This is also due to the sim-plifiedversionofMRCemployed in thiswork that requires

TABLE IIIPARAMETERS OF THE PROPOSED DRAMM STORED IN ROM

TABLE IVNUMBEROFMODULARMULTIPLICATIONS IN THEDRAMMALGORITHM

multiplication to implement (4), while [39] requiresmultiplications for the sameconversion [39], [40].This improvement is due to the choice of MRC instead of

CRT employed in [6], [11], [13], [41], and due to the proposedmatrix approach for the output conversion. Finally, comparedto [6], [7], [13], [41], base conversion is error-free and does notrequire extra operations to produce the correct result.

C. Complexity Comparisons With Non-RNS Implementations

None of the RNS implementations presented previously pro-vided comparisons with non-RNS solutions. In the following,a systematic, technology-independent, complexity comparisonbetween RNS and non-RNS architectures is attempted for thefirst time, plus a function-based approach for the calculationof the optimal values for the parameters is developed. Ascommon constants for all comparisons, the TSMC 0.13 m li-brary of standard cells in Table II is employed [24].The references in Tables V and VI, refer to both scal-

able [28]–[30]31], [33], [48] and non-scalable architectures[22]–[24]. Regarding the scalable implementations which


TABLE VNORMALIZED TIME AND AREA COMPLEXITY COMPARISONS IN

design with 5-to-2 CSA. design with 4-to-2 CSA. . . . , each PE is bit-sliced.otherwise. . fastest case for time complexity, area calculated by the authors.calculated by the authors in [29]. , . .

encompass multiple PEs, if -bit input operands are employed,then or (if carry-save format is usedor not accordingly), is the number of PEs, and .Table V offers a detailed, technology-independent compar-

ison with the most up-to-date non-RNS solutions. The timecomplexity of the proposed architecture is a function ,calculated according to Table I, while it is a constant for eachof the other works. Selecting and , aplot for the time complexity can be sketched, as shownin Fig. 9(a).The normalized area of the proposed architecture is provided

as a 2-variable function , by selecting a typical valuefor the word length of . The other works provide func-

tions of , , and . A plot of is depicted in Fig. 9(b).The area of the proposed architecture, although larger, is ofcomparable magnitude as the other works. Unfortunately, com-plexity comparisons are not offered for the RNS implementation

TABLE VINORMALIZED AREA-TIME COMPLEXITY COMPARISONS FOR A 1024-BIT

MONTGOMERY MULTIPLICATION (CPA DELAYS INCLUDED)

Only the fastest versions of other works are considered.

in [13]. Instead, only metrics in terms of the number of Adap-tive Logic Modules (ALM) of the Stratix II FPGA are given.


TABLE VIIAREA-TIME COMPARISONS FOR 1024-BIT MODULAR EXPONENTIATION

Calculated by the authors. Original value is 12,537 slices. Each slice contains 2 LUTs in the XCV6000 device [49].Result in CSA format, CPA delay should be included. Equivalent area only for the multiplier.optimized for area and power. original value is 4.2 ms using a 4-bit window method exponentiation algorithm [11].design does not fit in this device, results are offered though for comparisons with other Xilinx Virtex II designs

Altera does not provide the number of gates per ALM, thus adirect comparison is infeasible.To depict the trade-offs between RNS and non-RNS imple-

mentations, the area-time product is given in Fig. 10 as a func-tion for a 1024-bit implementation.Table VI was created by assigning values to the parameters

in Table V used in other works, to obtain numericalvalues for the number of cycles, normalized delay and area. Thenormalized areas are calculated for the complete designs andtotal execution time is calculated by the product cycles crit-ical path, all under the common TMSC 0.13 m technology.Based on the derived functions , one

can parameterize the proposed RNS architecture for minimumdelay, area, or product, according to the require-ments.

D. Area-Time-Power Comparisons

To verify the complexity comparisons presented in the pre-vious paragraphs, a prototype of the proposed DRAMM archi-tecture was synthesized in Xilinx Virtex 4 (XC4VFX60) FPGAdevice using VHDL. Post-layout results were obtained from theXilinx 14.4 ISE tool [50] and are presented in Table VII for threecombinations of . Note that the last configuration for ,

would not fit in the FPGA device, so it was optimized

for area and power minimization, while the other two were op-timized for timing performance using the corresponding strate-gies of the Xilinx 14.4 ISE tool.In cases where other works reported area in terms of CLB

slices, the equivalent area in terms of LUT was calculated ac-cording to the Xilinx FPGA datasheets [49]. Exponentiation ex-ecution times were calculated assuming that a single exponen-tiation requires Montgomery multiplications. Works in[13], [37], [40] provide results for precisions up to 512-bit, thusdirect comparisons are infeasible. The works in [39], [51] werealso not included since no experimental results are presented.Finally, the work in [41] was omitted since it offers speed andarea results for the fully-parallel case of , which isunfair to compare with the proposed configuration that employstime-sharing among MAC units. Additionally, area results areoffered in terms of m which is incompatible with the valuesoffered by other works.Finally, power consumption measurements were performed

using the XPower Analyzer functionality of Xilinx 14.4 ISE tool[50]. For the three configurations of the proposed DRAMM inTable VII the corresponding consumptions are 1,656/588 mW( , ), 1,265/598 mW ( , ) and1,076/595 mW ( , ) respectively, where the firstvalue is the total consumption and the second is the leakagepower. From the considered works only [31] reports that the


complete unit draws 69/23 mW. In terms of power/throughputefficiency (throughput is in terms of exponentiations/sec) thecorresponding values are 14.9 ( , ), 7.61 ( ,

) and 4.75 ( , ) for the proposed architectureand 1.14 for [31].Table VII is in accordance with the complexity comparisons

in Table V. It is proved that the proposed architecture outper-forms existing implementations in terms of total execution timefor a modular exponentiation, with an overhead in area. How-ever, when considering RNS implementations, other usefulRNS properties like the proposed dual-field design, fault-detec-tion [14] and, more recently, immunity against hardware-faultattacks [52] should be taken into account.Both Table VI and Table VII are sorted from the fastest to the

slowest design for easy comparisons. The results of the com-plexity analysis are in accordance with the results obtained fromsynthesis, since five out of nine references appear in the sameranking ([11] can be ignored since no complexity results are pro-vided). Deviations exist mainly due to the fan-out factor, whichis not included in the complexity analysis, plus due to the com-pletely different characteristics of each implementation plat-form, even between FPGA packages of the same FPGA family[49].

VIII. CONCLUSION

The mathematical framework and a flexible, dual-field,residue arithmetic architecture for Montgomery multiplica-tion in and is developed and the necessaryconditions for the system parameters (number of modulichannels, modulus word length) are derived. The proposedDRAMM architecture supports all operations of Montgomerymultiplication in and , residue-to-binary andbinary-to-residue conversions, MRC for integers and poly-nomials, dual-field modular exponentiation and inversion, inthe same hardware. Generic complexity and real performancecomparisons with state-of-the-art works prove the potential ofresidue arithmetic exploitation in Montgomery multiplication.

REFERENCES[1] I. Blake, G. Seroussi, and N. Smart, Elliptic Curves in Cryptography.

Cambridge, U.K.: Cambridge Univ. Press, 2002.[2] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curves

Cryptography. New York, NY, USA: Springer-Verlag & Hall/CRC,2004.

[3] J.-P. Deschamps, Hardware Implementation of Finite-Field Arith-metic. New York, NY, USA: McGraw-Hill, 2009.

[4] R. Lidl and H. Niederreiter, Introduction to Finite Fields and TheirApplications. New York, NY, USA: Cambridge Univ. Press, 1986.

[5] R. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digitalsignatures and public-key cryptosystems,” Commun. ACM, vol. 21, pp.120–126, Feb. 1978.

[6] S. Kawamura, M. Koike, F. Sano, and A. Shimbo, “Cox-Rowerarchitecture for fast parallel Montgomery multiplication,” in EU-ROCRYPT’00: Proc. 19th Int. Conf. Theory and Application ofCryptographic Techniques, 2000, pp. 523–538.

[7] J.-C. Bajard and L. Imbert, “Brief contributions: A full RNS imple-mentation of RSA,” IEEE Trans. Comput., vol. 53, no. 6, pp. 769–774,Jun. 2004.

[8] D. Schinianakis, A. Fournaris, H. Michail, A. Kakarountas, and T.Stouraitis, “An RNS implementation of an elliptic curve point mul-tiplier,” IEEE Trans. Circuits Syst. I, vol. 56, no. 6, pp. 1202–1213, Jun.2009.

[9] A. Halbutoğullari and Ç. K. Koç, “Parallel multiplication inusing polynomial residue arithmetic,” Design, Codes and Cryptog-raphy, vol. 20, no. 2, pp. 155–173, Jun. 2000.

[10] M. G. Parker and M. Benaissa, “ multiplication using poly-nomial residue number systems,” IEEE Trans. Circuits Syst. II, vol. 42,no. 11, pp. 718–721, Nov. 1995.

[11] H. Nozaki, M. Motoyama, A. Shimbo, and S.-I. Kawamura, “Imple-mentation of RSA algorithm based on RNS Montgomery multiplica-tion,” in Proc. 3rd Int. Workshop on Cryptographic Hardware and Em-bedded Systems (CHES ’01), 2001.

[12] J.-C. Bajard, L. Imbert, and G. A. Jullien, “Parallel Montgomery mul-tiplication in using Trinomial Residue Arithmetic,” in IEEESymp. Computer Arithmetic, 2005, vol. 0, pp. 164–171.

[13] N. Guillermin, “A high speed coprocessor for elliptic curve scalar mul-tiplications over ,” in Cryptographic Hardware and Embedded Sys-tems, CHES 2010, 2010, pp. 48–64, Lecture Notes in Computer Sci-ence 6225.

[14] F. J. Taylor, “Residue arithmetic: A tutorial with examples,” IEEEComputer, vol. 17, pp. 50–62, May 1988.

[15] J. Omura and J. Massey, “Computational method and apparatus forfinite field arithmetic,” U.S. Patent 4,587,627, May 6, 1986.

[16] R. Mullin, I. Onyszchuk, S. Vanstone, and R. Wilson, “Optimalnormal bases in ,” Discrete Applied Mathematics, vol. 22,pp. 149–161, 1988.

[17] G. Agnew, R. Mullin, and S. Vanstone, “An implementation of ellipticcurve cryptosystems over ,” IEEE J. Select. Areas Commun., vol.11, no. 5, pp. 804–813, Jun 1993.

[18] R. Schroeppel, H. Orman, S. W. O’Malley, and O. Spatscheck,“Fast key exchange with elliptic curve systems,” in CRYPTO’95: Proc. 15th Annu. Int. Cryptology Conf. Advances in Cryp-tology, 1995, pp. 43–56.

[19] Ç. K. Koç and T. Acar, “Montgomery multiplication in ,” De-sign, Codes and Cryptography, vol. 14, no. 1, pp. 57–69, Apr. 1998.

[20] D. Schinianakis, A. Kakarountas, T. Stouraitis, and A. Skavantzos, “El-liptic curve point multiplication in using Polynomial ResidueArithmetic,” in Proc. IEEE Int. Conf. Electronics, Circuits, and Sys-tems (ICECS 2009), 2009, pp. 980–983.

[21] A. Skavantzos and F. J. Taylor, “On the polynomial residue numbersystem,” IEEE Trans. Signal Process., vol. 39, no. 2, pp. 376–382, Feb.1991.

[22] C. McIvor, M. McLoone, and J. McCanny, “Modified Montgomerymodular multiplication and RSA exponentiation techniques,” IEEProc.—Computers and Digital Techniques, vol. 151, no. 6, pp.402–408, Nov. 2004.

[23] M. Sudhakar, R. Kamala, and M. Srinivas, “A bit-sliced, scalable andunified Montgomery multiplier architecture for RSA and ECC,” inIFIP Int. Conf. Very Large Scale Integration, 2007, VLSI—SoC 2007,Oct. 2007, pp. 252–257.

[24] M.-D. Shieh, J.-H. Chen, H.-H. Wu, and W.-C. Lin, “A new modularexponentiation architecture for efficient design of RSA cryptosystem,”IEEE Trans. VLSI Syst., vol. 16, pp. 1151–1161, Sep. 2008.

[25] M. Huang, K. Gaj, S. Kwon, and T. El-Ghazawi, “An optimized hard-ware architecture for the Montgomery multiplication algorithm,” inProc. Practice and Theory in Public Key Cryptography, PKC’08, 2008,pp. 214–228.

[26] W. Freking and K. Parhi, “Performance-scalable array architectures formodular multiplication,” in Proc. IEEE Int. Conf. Application-SpecificSystems, Architectures, and Processors, 2000, pp. 149–160.

[27] P. Amberg, N. Pinckney, and D. Harris, “Parallel high-radix Mont-gomery multipliers,” in Proc. 42nd Asilomar Conf. Signals, Systemsand Computers, Oct. 2008, pp. 772–776.

[28] N. R. Pinckney and D. M. Harris, “Parallelized radix-4 scalable Mont-gomery multipliers,” Integr. Circuits Syst., vol. 3, no. 1, pp. 39–45,Nov. 2008.

[29] M.-D. Shieh and W.-C. Lin, “Word-based Montgomery modularmultiplication algorithm for low-latency scalable architectures,” IEEETrans. Comput., vol. 59, no. 8, pp. 1145–1151, aug. 2010.

[30] A. Tenca and Ç. K. Koç, “A scalable architecture for modular multipli-cation based on Montgomery’s algorithm,” IEEE Trans. Comput., vol.52, no. 9, pp. 1215–1221, sep. 2003.

[31] D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, and S. Hsu, “Animproved unified scalable radix-2 Montgomery multiplier,” in Proc.IEEE Symp. Computer Arithmetic, 2005, vol. 0, pp. 172–178.

[32] A. Satoh and K. Takano, “A scalable dual-field elliptic curve crypto-graphic processor,” IEEE Trans. Comput., vol. 52, pp. 449–460, Apr.2003.


[33] E. Savaş, A. Tenca, and Ç. Koç, “A scalable and unified multiplierarchitecture for finite fields and ,” in Proc. Cryp-tographic Hardware and Embedded Systems CHES 2000, 2000, pp.277–292, Lecture Notes in Computer Science 1965.

[34] E. Savaş, A. Tenca, M. Ciftcibasi, and Ç. Koç, “Multiplier architec-tures for and ,” IEE Proc.—Computers and DigitalTechniques, vol. 151, no. 2, pp. 147–160, Mar. 2004.

[35] R. Satzoda and C.-H. Chang, “A fast kernel for unifying andMontgomery multiplications in a scalable pipelined archi-

tecture,” in Proc. IEEE Int. Symp. Circuits and Systems, 2006, 4 pp.[36] J.-H. Hong and C.-W. Wu, “Radix-4 modular multiplication and

exponentiation algorithms for the RSA public-key cryptosystem,” inProc. 2000 Asia and South Pacific Design Automation Conf., 2000,pp. 565–570.

[37] J.-Y. Lai and C.-T. Huang, “Elixir: High-throughput cost-effectivedual-field processors and the design framework for elliptic curve cryp-tography,” IEEE Trans. VLSI Syst., vol. 16, no. 11, pp. 1567–1580,Nov. 2008.

[38] K. Posch and R. Posch, “Modulo reduction in residue number systems,”IEEE Trans. Parallel Distrib. Syst., vol. 6, no. 5, pp. 449–454, May1995.

[39] J. Bajard, L.-S. Didier, and P. Kornerup, “An RNS Montgomery mod-ular multiplication algorithm,” IEEE Trans. Comput., vol. 47, no. 7,pp. 766–776, 1998.

[40] Y. Tong-jie, D. Zi-bin, Y. Xiao-Hui, and Z. Qian-jin, “An improvedRNS Montgomery modular multiplier,” in Proc. 2010 Int. Conf. Com-puter Application and System Modeling (ICCASM), 2010, vol. 10, pp.V10-144–V10-147.

[41] F. Gandino, F. Lamberti, G. Paravati, J. Bajard, and P. Montuschi, “Analgorithmic and architectural study on Montgomery exponentiation inRNS,” IEEE Trans. Comput., vol. 61, no. 8, pp. 1071–1083, 2012.

[42] H. M. Yassine and W. Moore, “Improved mixed-radix conversion forresidue number system architectures,” IEE Proc. G—Circuits, Devicesand Systems, vol. 138, no. 1, pp. 120–124, Feb. 1991.

[43] P.Montgomery, “Modular multiplication without trial division,”Math-ematics of Computation, vol. 44, no. 170, pp. 149–161, 1985.

[44] D. Schinianakis and T. Stouraitis, “A RNSMontgomery multiplicationarchitecture,” in Proc. IEEE Int. Symp. Circuits and Systems, 2011, pp.1167–1170.

[45] D. Schinianakis, A. Skavantzos, and T. Stouraitis, “ Mont-gomery multiplication using polynomial residue arithmetic,” in Proc.IEEE Int. Symp. Circuits and Systems, 2012, pp. 3033–3036.

[46] R. J. McEliece, Finite Field for Scientists and Engineers. Norwell,MA, USA: Kluwer Academic, 1987.

[47] M. Ercegovac and T. Lang, Digital Arithmetic. San Francisco, CA,USA: Morgan Kaufmann, 2004.

[48] M. Huang, K. Gaj, and T. El-Ghazawi, “New hardware architecturesfor Montgomery modular multiplication algorithm,” IEEE Trans.Comput., vol. 60, no. 7, pp. 923–936, Jul. 2011.

[49] Xilinx Inc., Xilinx Data Sheets, 2012 [Online]. Available: http://www.xilinx.com/support/documentation/index.htm

[50] Xilinx Downloads, Xilinx Inc., 2012 [Online]. Available: http://www.xilinx.com/support/download/index.htm

[51] J. Großschädl, “A bit-serial unified multiplier architecture for finitefields and ,” in Proc. 3rd Int. Workshop on Cryp-tographic Hardware and Embedded Systems, CHES ’01, 2001, pp.202–219.

[52] D. Schinianakis and T. Stouraitis, “Harware-fault attack handling inRNS-based Montgomery multipliers,” in Proc. IEEE Int. Symp. Cir-cuits and Systems, 2013.

Dimitrios Schinianakis (M’12) received thediploma in electrical and computer engineering fromthe Electrical and Computer Engineering Depart-ment, University of Patras, Greece, in 2005 and hisPh.D. from the same department in 2013.His research interests include computer arithmetic,

cryptography, cryptanalysis and VLSI hardware de-sign.Dr. Schinianakis was the recipient of a student

paper award from IEEE in 2006, and an awardfor a tutorial presented in ICECS 2009. He has

authored more than 15 conference and journal papers. He regularly reviews forconferences like IEEE ISCAS and ICECS, among others, and for the ElsevierJournal of Systems and Software. He is a Member of the IEEE and a memberof the Technical Chamber of Greece

Thanos Stouraitis (F’07) received the Ph.D. inelectrical engineering from the University of Florida,Gainesville, FL, USA, the M.Sc. degree from theUniversity of Cincinnati, Cincinnati, OH, USA, theM.S. degree from the University of Athens, Athens,Greece, and the B.S. degree in physics from theUniversity of Athens.He is a Professor of the Electrical and Computer

Engineering Department, University of Patras, Pa-tras, Greece, where he directs the Signal and ImageProcessing Laboratory. He has served as a member

of the Administrative Committee of the University of Sterea Hellas, Greece.He has served on the faculty of The Ohio State University and has visited theUniversity of Florida and Polytechnic University in New York. He has authoredmore than 150 technical papers, several books and book chapters. He holds onepatent on DSP processor design. He has delivered several keynote addressesto conferences, and has led various DSP processor design projects funded bythe European Union, American organizations, and the Greek government andindustry. His current research interests include signal and image processingsystems, application- specific processor technology and design, computerarithmetic, and design and architecture of optimal digital systems.Dr. Stouraitis is an IEEE fellow for his contributions in digital signal pro-

cessing architectures and computer arithmetic. He has served as Regional Ed-itor for Europe for the Journal of Circuits, Systems, and Computers, as Asso-ciate Editor for several IEEE Transactions, and as a consultant for industry.He regularly reviews for the IEEE SP, C&S, C, and Education Transactions, forIEE Proceedings E, F, G, and for conferences like IEEE ISCAS, ICASSP, VLSI,SiPS, Computer Arithmetic, Euro DAC, etc. He also reviews proposals for NSF,the European Commission, and other agencies. He is President-elect, IEEE Cir-cuits and Systems Society. He served as general chair of IEEE InternationalConference on Electronics, Circuits, and Systems (ICECS 2010), CAS-FEST2010, IEEE ISCAS 2006, the IEEE International Conference on Electronics,Circuits, and Systems (ICECS ’96), the IEEE Workshop on Signal ProcessingSystems (SiPS 2005), and the IEEE International Symposium on Wireless Per-vasive Computing (ISWPC 2008). He has served as the Chair of the VLSI Sys-tems and Applications (VSA) Technical Committee and as a member of theDSP and the Multimedia Technical Committees of the IEEE Circuits and Sys-tems Society. He is a founder and Chair of the IEEE Signal Processing chapterin Greece. He was the Technical Program Chair of Eusipco ’98, ICECS ’99, andVLSI-SoC 2008. He has served as Chair or as a member of the Technical Pro-gram Committees of a multitude of IEEE Conferences, including ISCAS (Pro-gram Committee Track Chair). He received the 2000 IEEE Circuits and SystemsSociety Guillemin-Cauer (best paper) Award for the paper “Multi-Function Ar-chitectures for RNS Processors” and a number of conference best paper awards.

1156 ieee transactions on circuits and...

Documents