[ieee 2009 12th euromicro conference on digital system design, architectures, methods and tools...

One dimensional Systolic Inversion Architecture Based on Modified GF(2K) Extended Euclidean Algorithm

A. P. Fournaris and O. Koufopavlou Electrical and Computer Engineering Department

University of Patras Patras, GREECE

[email protected]

Abstract— The need for small chip covered area in most handheld devices with out sacrifices in computational power introduces an interesting problem concerning expensive, computational intensive operations, like GF(2k) inversion which is widely used in cryptography. This paper addresses this problem by proposing a systolic inversion architecture for GF(2k) fields. This architecture is based on an extended analysis on an optimized version of Modified Extended Euclidean Algorithm (OMEEA) that is using signal reusability and simplification of the control signals with regard to hardware design and manages to make the inversion process less complex. The proposed one dimensional systolic inversion architecture based on OMEEA was measured in terms of hardware components number, latency and critical path delay with very interesting results when compared to other well known designs thus proving the efficiency of the analysis on OMEEA algorithm.

Keywords-Computations in Finite Fields, Finite Field inversion, VLSI design, Systolic array design, Cryptography.

I. INTRODUCTION GF(2k) fields pose a powerful mathematical framework

for many computer applications including cryptography [1] error coding and signal processing [2]. Many cryptographic algorithms like Elliptic Curve Cryptography or AES use inversion extensively for data encryption-decryption. However, inversion is the most complex of all GF(2k) field operations since it requires many intermediate calculations including multiplication and addition-subtraction and needs many computational steps to give a result [3].

On the other hand, the increasing tendency for embedding security hardware modules into mobile, handheld devices creates new challenges. Many such applications demand strong security – cryptographic features and therefore require a considerable amount of hardware resources to cover that demand. However, in a handheld environment the restrictions posed on hardware resources are very strict and have to do with chip covered area and power dissipation.

As a result of the above observations, optimizing the GF(2k) inversion operation, used in many security hardware solutions, would improve the overall performance of such a solution and make it compliable with possible hardware restrictions.

Many algorithms have been proposed for GF(2k) field inversion [3]. The Extended Euclidean Algorithm (EEA) [1] is a well known, widely accepted such algorithm. However its inability to give results in a constant number of iterations have led to the development of Modified versions of EEA (MEEA) [4],[5],[6],[7],[8],[9] that solve that problem. Not all of those algorithms give efficient hardware designs in terms of space and time complexity since in order to solve the constant iteration number problem, they employ complex control logic and high number of control signals. This creates problems concerning the hardware architectural design like increased critical path delay, high latency and gate number.

One modified algorithmic approach of the EEA (OMEEA) seems to offer some very interesting potential in terms of gate number and critical path delay. This approach (OMEEA), as proposed in [10], however, was only introduced briefly leaving many uncertainties concerning its appropriate use. The OMEEA, employing reusability of some signals in an already existing EEA and simplification of the control logic with special concern in hardware design, manages to reduce the control logic of inversion. Based on the OMEEA a two dimensional architecture was proposed in [10] for achieving high speed and throughput. However, in restricted chip covered area applications, a two dimensional architectural approach offers few benefits. Due to the high number of gates in the two dimensional architecture of [10] the tradeoff between area and speed (high area versus high speed) becomes unacceptable. As a result, a different approach needs to be addressed concerning the algorithm of [10] if it is to be used in hardware resources restricted applications.

In this paper, the work of [10] is extended and the transition from the OMEEA algorithmic form to a real hardware realization is analyzed in depth. Through the use of Finite State Machine diagrams further optimizations on

2009 12th Euromicro Conference on Digital System Design / Architectures, Methods and Tools

978-0-7695-3782-5/09 $25.00 © 2009 IEEE

DOI 10.1109/DSD.2009.161

736

OMEEA are found and lead to a one dimensional systolic architecture that offers small number of gates with out an increase in critical path delay. When comparing this proposed one dimensional architecture with similar inversion designs, our approach proved to be very competitive.

The paper is organized as follows. In Section II, GF(2k) inversion is described and EEA and MEEA algorithms are analyzed. In Section III, the optimized MEEA is analyzed. In Section IV, the proposed systolic inversion architecture is presented and analyzed. Performance analysis, in terms of speed and covered area, and comparisons are presented in Section V and Section VI concludes the paper.

II. INVERSION IN GF(2K) FIELDS The GF(2k) field is isomorphic to GF(2)[x]/(F(x)), where

F(x) is a monic irreducible polynomial of degree k of the form ∑ −

=+=

1

0)( k

ii

ik xfxxF with coefficients fi ∈ {0,1}.

According to polynomial basis representation, an element α of a GF(2k) field is a polynomial of degree less than or equal to k-1 defined over a basis {xk-1… x3, x2, x, 1} with coefficients αi ∈{0,1}, where x is a root of the irreducible polynomial F(x). This can be written as

,...)(1

001

22

11∑

−

=

−−

−− +++==

k

i

kk

kk

ii axaxaxaxaxa

In hardware, we can represent an element α as a binary

vector of its coefficients (αk-1, αk-2 … α1, α0). Addition becomes a modulo 2 addition of the coefficients of the field elements, is identical to subtraction and can be represented by a XOR operation.

If ))((mod1)()(1 xfxaxa ≡− , then )(1 xa− is called the multiplicative inverse of )(xa .

A. Inversion in GF(2k) fields and Extended Euclidean Algorithm The Extended Euclidean Algorithm (EEA) [1], [7] that

can calculate the GCD of two numbers, can be used, through proper initialization, for the calculation of )(1 xa− .

Extended Euclidean Algorithm for Inversion (EEA)

Initialization: S(-1)=f(x), R(-1)=a(x), U(-1)=1, V(-1)=0, i=0

while R(i) ≠ 0 repeat

1. Q = S(i-1) div R(i-1),

2. R(i) = S(i-1) - Q⋅R(i-1)

3. U(i) = V(i-1) - Q⋅U(i-1),

4. S(i) = R(i-1)

5. V(i) = U(i-1)

6. i = i + 1

Output: V = )(1 xa−

In the EEA algorithm, four type of operations are performed in every round:

• Division of the S and R variables. Its remainder is S-Q⋅R and is used as the R value of the following step and its quotient Q is needed for the calculation of V-Q⋅U

• Multiplication for the calculation of V-Q⋅U

• Subtraction operation which is identical to addition in GF(2k) fields

• Swap operation where the values of R and U are exchanged with S and V respectively

While the swap and subtraction operation have trivial computation complexity, multiplication and especially division are complex, time consuming operations. Additionally, the number of repetitions until R=0 (when a result is reached), is not constant and probing the progress of R at every round is difficult to implement. Therefore, EEA, in this form, is unsuitable for hardware architectures where small critical path delay and high throughput is needed. Techniques like pipelining can not be efficiently employed and systolic designing that could decrease dramatically the critical path delay and increase throughput, is impossible because of the non constant loop number.

To avoid the above problems, modified versions of the EEA have been proposed [5], [6], [7],[8]. Those algorithms use a different division process that finds the “remainders” S-Q·R and V-Q·U with out calculating the quotient Q. The basic characteristics of all modified version of EEA (MEEA) are the use of values with k+1 bit length and the use of a special integer d as an estimator of the algorithm’s calculation progress especially of the calculation’s proximity to a zero remainder. In all the MEEA, the constant loop number problem is solved (there are 2k repetitions independent of the input) and therefore systolic architectures can be easily designed.

However, while the time and space complexity of the division in MEEA is reduced and multiplication essentially avoided, the high complexity of the control logic remains unsolved. The d variable that estimates the progress of the division, marking the number of steps needed for one division, is represented in hardware as an Up/Down Counter or an adder – subtracter [7], [8]. The number of encapsulated if statements in MEEA, with different control signals each, is high thus increasing the control complexity. All the above, contribute to the overall space and time complexity of a resulting architecture, through an increase in Multiplexers (MUX), extra signals (resulting in more latches – flip flops) and extra gates. The number of shifted values is high and the shifting is depended on the control signals, meaning that a value is not always shifted. All those factors make the control design of an inversion architecture complex, leading to a considerable increase in the gate number and critical path delay.

737

III. THE OMME INVERSION ALGORITHM Among the solution offered for the MEEA control logic complexity was the approach proposed in [10]. Reduction of the needed shifted signals, shifting at every round independently of any control signal and reduction and simplification of the control signals especially of the d signal where the goals set for the proposed optimization offered in OMEEA.

More specifically, inspection of MEEA revealed that we can identify some basic principles that are in effect in many steps of the algorithm. Instead of having to shift two different values depending on the value of most significant bit of the same polynomial, the shifting was addressed only to one polynomial t(x) though appropriate algorithmic redesign. That way, at every round of the algorithm there is one 1bit left shift of t(x) regardless of what value is stored in it. Using the proposed optimization, in every round of the algorithm there is one left shift and the shifted signals are reduced by one.

An approach was also proposed in [10] concerning the estimation of division’s progress. Instead of increasing and reducing d as done in all MEEA, it was chosen to increase the value of d at every round (d=d+1). When tk=1 instead of reducing d, the sign of d is exchanged and d < 0. A new 1 bit control signal was introduced for that operation (sign). When tk=1 the new “sign” is, simply, the invert of the previous. That way, d can be represented as a 1 bit left/right shift register depended on the signal “sign”. The OMEEA, as proposed in [10] is presented below:

Optimized Modified Extended Euclidean Algorithm (OMEEA(a(x), f(x)) Function) Input: a(x), f(x) Output: 1( ) mod ( )a x f x− Initialization: r(x) = f(x), t(x) = x·a(x), u(x) = x, v(x) = 0, d = 1, sign = 1 (meaning d > 0) For i = 1 to 2k - 1 do 1. If d = 0 then sign = 1 1.1 If tk = 1 then 1.1.1 u(x) = v(x) – u(x) else 1.1.2 u(x) ↔ v(x) end if else 1.2 If tk = 1 then 1.2.1 v(x) = v(x) – u(x) end if end if 2. If (tk = 1 and sign = 1) then 2.1 t(x) ↔ r(x) 2.2 sign = not sign end if 3. If tk = 1 then 3.1 t(x) = t(x) – r(x)

end if 4. t(x) = x · t(x) 5. If sign = 1 then 5.1 u(x) = x · u(x) else 5.2 u(x) = u(x)/x end if 6. d = d + 1 Output: u(x) = 1( ) mod ( )a x f x−

It must be noted, that the signal “sign” characterizes also the state of the t(x) shifting. It can also be used to determine whether or not to swap t(x) and r(x) when tk=1 and to shift left or right u(x).

The control logic of OMEEA is much simpler than MEEA. The number of encapsulated if statements is reduced and the control signals have smaller interference in the computational flow of the algorithm when compared to MEEA. The shifted values are reduced and the left shifting of t(x) and increase of d is performed at every round regardless of the control signals.

One of the problems of MEEA, is the signal d, since an Up/Down Counter or an adder - subtracter is needed in order to calculate the d value at every round of MEEA. In OMEEA a different approach is proposed by using positive and negative integers. In OMEEA , d is only checked for its sign and whether is zero or not. By shifting d one bit left when we have sign = 1 (d > 0) and one bit right when sign = 0 (d < 0) in each round and by checking the LSBit of d in each round, using the coding of Table I, the above restriction can be met. Note that the vacant place occurs after a shifting operation is filled with either 0 or 1 depending on the value of the sign (sign = 0 or sign = 1 accordingly).

TABLE I. TYPICAL VALUES FOR D SIGNAL

d={dk,dk-1...d1,d0} sign shifting d0

d=0 {000000…000……000000} 1 1 bit left 0 d >0 {000000…000……011111} 1 1 bit left 1 d <0 {000000…000……001111} 0 1 bit right 1

IV. PROPOSED OMEEA ANALYSIS The basic actions performed during one round of the

OMEEA algorithm is the exchange between two polynomials (t(x) and r(x) or u(x) and v(x)), the 1 bit left – right shifting of u(x) values and polynomial GF(2k) subtraction.

The most demanding arithmetic operation is polynomial GF(2k) subtraction between either v(x) and u(x) or t(x) and r(x). However, this operation proves to be computationally easy since, from the GF(2k) definition, is a modulo 2 polynomial addition and can be implemented as a bit vector XOR operation. As result, all subtractions can be substituted by addition in OMEEA.

738

Inspecting the algorithm reveals that the exchange between t(x) and r(x) happens only when tk(x) = 1. So after the exchange the polynomial r(x) will always be of degree(r(x)) = k. Considering the fact that the initial value of r(x) = f(x) (degree(f(x)) = k) and that r(x) is never shifted or otherwise manipulated in any algorithmic process, it can be concluded that r(x) is always a k degree polynomial. As a result, when the if statement in step 2 of the algorithm is true then the if statement of step 3 will always be true since after the exchange between t(x) and r(x), the new t(x) = r(x) will have degree(r(x)) = k. Delving further on to this issue reveals that there is a small difference between the involved actions taken for tk = 1 when sign = 0 or 1. In both cases the resulting t(x) is equal to t(x) + r(x), since ( ) ( ) ( ) ( ) ( ) ( )t x r x r x t x t x r x− = − = + , The only

discrimination between sign = 0 and sign = 1, is the value of resulting r(x) that is equal to r(x) or t(x) accordingly. Note also, that after the execution of step 2.2, the sign is always sign = 0 since the if statement is true only when sign = 1. As a result, after the execution of any step where tk = 1 the resulting sign is always equal to zero.

The above remarks can be observed comprehensively if we introduce a more detailed notation and structure of the algorithm as a Finite State Machine (FSM) like diagram. Let’s assume that the algorithm is currently on round i and that the round number is indicated as a superscript on each polynomial. Then, all the polynomial values deriving from the previous algorithmic round will have a superscript value of (i-1) while the current round resulting values will have a superscript value of (i). Following the above notation and defining each possible state of one algorithmic round separately, the FSM diagram of Figure 1 can be created. Note that the 1 bit left–right shifting of u(x) has been blended with the state’s other calculations.

Figure 1. Finite State Machine of OMEEA.

In Figure 1, the transition from one state to the next is well defined and the actions performed in every state can be viewed in detail. A no operation state has been initialized (state next round) in order to describe the transition to the next algorithmic round. Note, that for d ≠ 0 and tk = 1 regardless of the sign’s value the actions performed in those states are the same apart from r(x). Therefore, an optimized FSM taking into account the above issue is presented in Figure 2.

Figure 2. Optimized FSM for OMEEA.

V. PROPOSED SYSTOLIC ARCHITECTURE

Using the FMS analysis of OMEEA along with the coding of d proposed in [10], a hardware architecture can be proposed. Since our basic goal is to provide a low area design, the proposed architecture is chosen to be of one dimensional systolic form.

From Figures 1 and 2, three control values that can be identified. The indicated values are the sign, the least significant bit of d and the most significant bit of t(x). Using those values as input, a control unit can generate a series of control signals in order to manage the data flow in the proposed architecture. More specifically, from Figure 2, three control signals can be derived:

( 1) ( 1)

( 1) ( 1)0

( 1) ( 1)0

i ik

i ik

i ik

a sign t

b d t

c d t

− −

− −

− −

= ∧

= ∨

= ⊕

739

where , , and∧ ∨ ⊕ are the logical AND, OR and XOR operations.

In the proposed architecture, two type of Processing Elements (PE) can be identified. Type I PEs are responsible for the inversion calculations and Type II are responsible for the generation of the a, b, c control signals and the inversion calculations. The two types of PEs are shown in Figure 3. The additional cost of the type II PEs is three gates.

The proposed one dimensional systolic inversion architecture consist of k+1 PE, one for the calculation of each bit of the inversion result as indicated by OMMEA where each value is considered k+1 bits. Since there are 2k-1 repetitions of the main loop of OMEEA 2k-1 clock cycles will be required for the architecture to come up with a correct output. The control signals need to be computed only once per round of OMEEA, therefore, there exist k PEs of Type I and only one PE of Type II.

The proposed one dimensional systolic inversion architecture is presented in Figure 4. The control logic of the proposed architecture is very simple, it consists of only three gates in the Type II PE and k+1 two to one multiplexers (MUX) (for the calculation of d). No counters or adders are needed as in the MEEA. The control signals are broadcasted to the whole systolic array.

uk-1

u1

uk-2

u1

vk-1

v1

tk-1

t1

tk-1

tk-2

t1

rk-1

r1

rk-1

rk-2

r1

d 0d 0

t kt k

dk-1

d1

aa

bb

rk-1

r1

t k-1

t 1

uk-1

u1

vk-1

v1

sign

sign

cc

Type II

Type I

Type I

Init

Init

R

R

R

R

R

R

R

R

R

R

uk R

0

0

0

......

......

......

u0

uk-

1

u0

v0

t0

t 0

rk

r0

r 0

sign

d 0t k

d0

a b

rk

r0

t 0

u0

v0

sign

sign

c

Type I

Init

Init

R

R

R

R

R

vk

tkt k

vk R

R

R

R

uk

0

0

10

00 0

0

01

Figure 4. The proposed one dimensional systolic GF(2k) inversion architecture

rj(i-1)tj(i-1)

u j-1(i-

1)

b

uj+1 (i-1)

c b

vj(i-1)

d0(i-1)

tK(i-1)

d j-1(i-

1)dj+1 (i-1)

tj(i)rj(i) vj

(i)uj(i) dj

(i)

sign(i-1)sign(i)

b

c

a

tK(i-1)

d0(i-1)

sign(i)

t j-1(i-

1)

r j-1(i-

1)

b

c

atK

(i-1)

d0(i-1)

sign(i)

Type I Processing Element

j-1(i-

1)j+1 (i-1) j-1

(i-1)

j+1 (i-1)

j-1(i-

1)

j-1(i-

1)

Type II Processing Element

Figure 3. The PEs of the proposed systolic inversion architecture

740

VI. PERFORMANCE AND COMPARISONS The critical path delay of the proposed systolic inversion

architecture can be extracted from Fig. 1. If TAND is the delay of an AND, TXOR the delay of a XOR gate and TMUX the delay of a MUX, the critical path would be TAND + 2TXOR. It is assumed that TMUX < TXOR. The proposed systolic inversion architecture requires 2k-1 clock cycles to complete one inversion. Those measurements along with details about space and time complexity of the proposed systolic inversion architecture are shown in Table II and comparisons are made with other well known inversion architectures

TABLE II. AREA AND TIME COMPLEXITY COMPARISONS OF ONE DIMENSIONAL SYSTOLIC ARCHITECTURES.

Architect. Guo [9] Daneshbeh [11] proposed

OR 0 0 1

NOT 0 0 0 AND 26k 14k 5k+6 XOR 11k 6k 3k+4 MUX 35k+2 20k 7k+7

Flip Flops 46 4 log 1k k k+ +⎡ ⎤⎢ ⎥ 36k 5k+5

other adder, zero check - - Lat. (div) 8k-1 5k-4 2k-1

Crit. Path 2TXOR+ 2TAND+2 TMUX

TXOR+2TAND

+ TMUX TXOR+2TAN

D

Our proposed architecture has the smallest critical path

delay compared to [9] and [11]. The gate number of the proposed design is also considerably smaller than the other designs. The number of MUX and Flip Flop is smaller that the other designs and by summing all the Hardware components in each compared architecture it can noted that our proposed architecture is advantageous against the other designs with the smallest number of hardware components. Also, the latency in the proposed systolic inversion architecture is much smaller than in [9], [11].

VII. CONCLUSIONS In this paper, a systolic inversion architecture was

proposed. This architecture is based on an extended analysis an optimized version of MEEA algorithm that using signal reusability and simplification of the control signals with regard to hardware design, manages to make the inversion process less complex. The proposed systolic inversion architecture was measured in terms of number of hardware components, latency and critical path delay with very interesting results when compared to other well known designs thus proving the efficiency of the analysis on OMEEA algorithm.

REFERENCES [1] A. Menezes, P. C. van Oorschot and S. A. Vanstone. “Handbook of

Applied Cryptography” CRC press, 1997. [2] E. D. Mastrovito, “VLSI architectures for computations in Galois

fields”, Phd Thesis, Linköping Univ., Sweden, 1991.

[3] M. Olofsson, “VLSI Aspects on Inversion in Finite Felds”, Phd Thesis, Linköping Univ., Sweden, 2002.

[4] J.-H. Guo and C.-L. Wang, “Hardware Efficient Systolic Architecture for Inversion and Division in GF(2k)”, IEE Proc. on Computer and Digit. Techniques, p 272 – 278, 1998.

[5] C.-H. Wu, C.-M. Wu, M.-D. Shieh and Y.-T. Hwang, “High-speed, low complexity Systolic designs of novel iterative division algorithms in GF(2m)”, IEEE Transactions on Computers, vol 53, no 3, pp 375 – 380, March 2004.

[6] Z. Yan, D. V. Sarwate and Z. Liu “High-speed Systolic Architectures for Finite Field Inversion”, Integration, the VLSI Journal, vol 38, issue 3, pp 382 – 398, January 2005.

[7] H. Brunner, A. Curiger and M. Hofstetter, “On Computing Multiplicative Inverses in GF(2k)”, IEEE Transactions on Computers, vol 42, no 8, pp 1010 – 1015, August 1993.

[8] J.-H. Guo and C.-L. Wang, “Systolic Array Implementation of Euclid’s Algorithm for Inversion and Division in GF(2m)”, IEEE Transactions on Computers, vol 47, no 10, p 1161 – 1167, October 1998.

[9] C.-H. Wu, C.-M. Wu, M.-D. Shieh and Y.-T. Hwang, “Systolic VLSI Realization of a Novel Iterative Division Algortihm Over GF(2k): A High-speed, Low-Complexity Design”, in proc. of Int’l Sympos. on Circuits and Systems (ISCAS’01), pp 33 – 36, May 2001.

[10] A. P. Fournaris and O. Koufopavlou, “A Systolic Inversion Architecture Based on Modified Extended Euclidean Algorithm for GF(2k) Fields “,in proc. 13th IEEE International Conference on Electronics, Circuits and Systems (ICECS 2006), Nice, December 10 - 13, France, 2006

[11] K. Daneshbeh and M. A. Hasan, “A Class of Unidirectional bit serial systolic architectures for Multiplicative Inversion and Division over GF(2m)”, IEEE Transactions on Computers, vol. 54 , no. 3, pp 370-380, March 2005.

741

[ieee 2009 12th euromicro conference on digital system design, architectures, methods and tools...

Documents