high speed rsa implementations

8/3/2019 High Speed RSA Implementations

1/86

Masters Thesis

High-Speed RSA Implementation

for FPGA Platforms

Thomas Wockinger

Institute for Applied Information Processing and CommunicationsGraz University of Technology

under the guardiance ofAo.Univ.-Prof. Dipl.-Ing. Dr.techn. Karl-Christian Posch

andUniv.-Ass. Dipl.-Ing. Dr.techn. Johannes Wolkerstorfer

as the contributing advisor responsibleGraz, January 2005


2/86

Abstract

Providing cryptographic infrastructure for thousands of users on a server, re-quires an enormous computing power. Therefore, it makes sense to use hardwareaccelerator cards. In this context, the thesis treats the implementation of RSA onFPGA devices. The entire design, from the software front-end to the hardware back-end, is shown. With the implementation of software interfaces, for example JCE and

JCA, existing programs do not need to be modified and can nevertheless profit fromthe performance. The basic operation for computing the RSA algorithm is modularexponentiation. It can be computed by repeated modular multiplication. There ex-ist many hardware-based implementations of RSA on FPGA platforms. These arerestricted to architectures operating not on the full word size, because the resultinghardware complexity for word sizes greater than 1024 bit is too big. Since the de-velopment of the FPGA devices allows implementing more complex circuits, todayit is possible to implement also architectures with a higher radix. The presentedmultiplier architecture is operating at full word size and employs even higher-radixmultiplication to speed-up operation. Since the needed FPGA resources for a highradix implementation are enormous, suitable measures need to be applied to achieve

high clock rates. Therefore, it is necessary to tailor the architecture towards FPGAstructures.


3/86

Kurzfassung

Um eine kryptographische Infrastruktur fur tausende Benutzer zentralisiert aufeinem Server zur Verfugung zu stellen, wird eine enorme Rechenleistung benotigt.Es ist daher sinnvoll, diese Server mit Hardwarebeschleunigerkarten zu entlasten.In diesem Zusammenhang behandelt die Diplomarbeit die Implementierung vonRSA auf FPGAs. Es soll dabei der gesamte Entwurf vom Software-Frontend bis

zum Hardware-Backend gezeigt werden. Durch die Implementierung von Soft-wareschnittstellen, wie zum Beispiel JCE und JCA, mussen bestehende Programmenicht verandert werden und konnen dennoch von der Performance profitieren. Diegrundlegende mathematische Operation bei RSA ist die modulare Exponentitation.Diese kann wiederum durch eine wiederholte modulare Multiplikation berechnet wer-den. Es gibt bereits eine Vielzahl von hardwarebasierten RSA-Implementierungenfur FPGAs, diese operieren jedoch nicht auf der vollen Wortbreite. Der Grunddafur ist der enorme Platzbedarf fur Wortbreiten jenseits von 1024. Da die En-twicklung der FPGAs stetig voranschreitet, ist es heute moglich, auch Architekturenmit hoherem Radix zu implementieren. Die gezeigte Architektur ist parametrisier-bar, damit unterschiedliche Wortbreiten und Radizes ohne Anderung der Hard-

warebeschreibung synthetisiert werden konnen. Da der Flachenbedarf fur eine HighRadix-Implementierung enorm ist, sind geeignete Manahmen zu treffen, um dieTaktrate zu maximieren. Dazu ist es notwendig, optimierte Hardwarestruktureneines FPGAs zu verwenden.


4/86

Acknowledgments

I would like to thank Hannes Wolkerstorfer who is an outstanding advisor. Be-sides his work he always took the time to discuss different design approaches, toanswer my emails, and to let me profit from his experience.

Special thanks to my parents for supporting me during the years of study.As well I would like to thank my girlfriend Susi for her understanding during the

work on this thesis (ILU).I would also like to thank Klaus Schwarzenberger for the time spent correcting

this thesis several times.Last but not least, I want to thank all my friends in Graz, especially from the

great Eheim, for sharing their time with me.


5/86

Contents

List of Figures i

List of Tables ii

List of Algorithms iii

List of Abbreviations iv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Problem Analysis and Theoretical Background 4

2.1 Public-Key Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Advantages of Public-Key Cryptography . . . . . . . . . . . . . . . . 52.1.2 Security of the Private Key . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Generating Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.2 Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.3 Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Security of RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 The Selection ofp and q . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 The Selection ofe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.3 The Selection ofd . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Signatures with RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 Fast Modular Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6.1 Montgomery Multiplication . . . . . . . . . . . . . . . . . . . . . . . 11

2.6.2 Orups Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.8 High-Radix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.8.1 Redundant Number Representation . . . . . . . . . . . . . . . . . . 19

2.8.2 Modified Booths Recoding . . . . . . . . . . . . . . . . . . . . . . . 19

2.8.3 Carry-Save Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


6/86

3 Design Methodology 25

3.1 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1.1 State of the Art Implementations . . . . . . . . . . . . . . . . . . . . 263.1.2 High-Level Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.3 Hardware Description Language . . . . . . . . . . . . . . . . . . . . 283.1.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.5 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.1 The Layer Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.2 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.1 Control Path and Data path . . . . . . . . . . . . . . . . . . . . . . 353.3.2 Synchronizing Clock Domains . . . . . . . . . . . . . . . . . . . . . . 353.3.3 Architecture of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.4 Xilinx FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.5 Designing for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.6 High Performance Optimizations . . . . . . . . . . . . . . . . . . . . 41

4 Implementation 43

4.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.1.1 JCE and JCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.1.2 JNI Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.1.3 Device Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.1 RSA Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.2 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.3 Montgomery Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.4 XILINX Single-Port RAM . . . . . . . . . . . . . . . . . . . . . . . . 554.2.5 Parallel-Loadable Shift-Register . . . . . . . . . . . . . . . . . . . . . 574.2.6 PCI Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.2.7 High-Radix Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2.8 Wallace Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2.9 Modified Wallace Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2.10 Ripple-Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 Results 66

5.1 Word Size and Radix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Performance and Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 675.3 Comparison with other Implementations . . . . . . . . . . . . . . . . . . . . 70

6 Conclusion 72

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Bibliography 74


7/86

List of Figures

2.1 Multiplication of two binary numbers in dot notation . . . . . . . . . . . . . 162.2 Radix-4 multiplication in dot notation . . . . . . . . . . . . . . . . . . . . . 182.3 Operand selection for radix-4 multiplication . . . . . . . . . . . . . . . . . . 182.4 Radix-4 partial product generation with Booth recoding . . . . . . . . . . . 21

2.5 Conversion of a ripple-carry adder into a carry-save adder . . . . . . . . . . 222.6 6-to-2 CSA reduction tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.7 Radix-4 multiplication with carry-save adder . . . . . . . . . . . . . . . . . 24

3.1 FPGA design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Comparison between VHDL and Verilog . . . . . . . . . . . . . . . . . . . . 293.3 Hardware synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Examples for the layer pattern . . . . . . . . . . . . . . . . . . . . . . . . . 333.5 Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.6 Separating hardware in control- and data path . . . . . . . . . . . . . . . . 353.7 Clock to output delay of flip-flops . . . . . . . . . . . . . . . . . . . . . . . . 363.8 Meta-stable states of flip-flops . . . . . . . . . . . . . . . . . . . . . . . . . . 363.9 Signal synchronization with double synchronizer . . . . . . . . . . . . . . . 373.10 Architecture of Xilinx Spartan II . . . . . . . . . . . . . . . . . . . . . . . . 393.11 Xilinx Spartan II CLB implementation . . . . . . . . . . . . . . . . . . . . . 403.12 Xilinx Spartan II slice implementation . . . . . . . . . . . . . . . . . . . . . 41

4.1 Architecture of the RSA chip . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Architecture of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . 534.3 State diagram of the control unit . . . . . . . . . . . . . . . . . . . . . . . . 544.4 IRQ handshake circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.5 Architecture of the Montgomery multiplier . . . . . . . . . . . . . . . . . . 564.6 Timing diagram of the PCI write transaction . . . . . . . . . . . . . . . . . 59

4.7 Timing diagram of the PCI read transaction . . . . . . . . . . . . . . . . . . 604.8 Schematic of the interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.9 Architecture of the high-radix multiplier . . . . . . . . . . . . . . . . . . . . 62

i


8/86

List of Tables

2.1 Radix-2 Booths recoding scheme . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Radix-4 Booths recoding scheme . . . . . . . . . . . . . . . . . . . . . . . . 202.3 Radix-8 Booths recoding scheme . . . . . . . . . . . . . . . . . . . . . . . . 212.4 Basic Boolean operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 State encoding formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2 Operation mode of the Montgomery multiplier . . . . . . . . . . . . . . . . 55

5.1 XILINX Xc2s200 synthesis results . . . . . . . . . . . . . . . . . . . . . . . 665.2 XILINX Xc2v3000 synthesis results . . . . . . . . . . . . . . . . . . . . . . . 675.3 XILINX Xc2v4000 synthesis results . . . . . . . . . . . . . . . . . . . . . . . 675.4 XILINX Xc2v6000 synthesis results . . . . . . . . . . . . . . . . . . . . . . . 675.5 XILINX Xc2v8000 synthesis results . . . . . . . . . . . . . . . . . . . . . . . 675.6 XILINX Xc2s200 throughput results . . . . . . . . . . . . . . . . . . . . . . 685.7 XILINX Xc2v3000 throughput results . . . . . . . . . . . . . . . . . . . . . 695.8 XILINX Xc2v4000 throughput results . . . . . . . . . . . . . . . . . . . . . 69

5.9 XILINX Xc2v6000 throughput results . . . . . . . . . . . . . . . . . . . . . 695.10 XILINX Xc2v8000 throughput results . . . . . . . . . . . . . . . . . . . . . 705.11 Throughput results with average exponent . . . . . . . . . . . . . . . . . . . 705.12 Throughput results with F4 exponent (216 + 1) . . . . . . . . . . . . . . . . 70

ii


9/86

List of Algorithms

1 Square and multiply, right to left . . . . . . . . . . . . . . . . . . . . . . . . 92 Square and multiply, left to right . . . . . . . . . . . . . . . . . . . . . . . . 93 Square and multiply, optimized version . . . . . . . . . . . . . . . . . . . . . 104 Montgomery multiplication algorithm, version 1 . . . . . . . . . . . . . . . . 12

5 Montgomery multiplication algorithm, version 2 . . . . . . . . . . . . . . . . 136 Montgomery multiplication algorithm, version 3 . . . . . . . . . . . . . . . . 147 Montgomery multiplication algorithm, version 4 . . . . . . . . . . . . . . . . 158 Binary right-shift multiplication algorithm . . . . . . . . . . . . . . . . . . . 169 Binary left-shift multiplication algorithm . . . . . . . . . . . . . . . . . . . . 1710 Radix-r right-shift multiplication algorithm . . . . . . . . . . . . . . . . . . 1711 Radix-r left-shift multiplication algorithm . . . . . . . . . . . . . . . . . . . 17

iii


10/86

List of Abbreviations

AMD Advanced Micro DevicesAPI Application Programming InterfaceCLB Configurable Logic BlocksCMOS Complementary Metal Oxide SemiconductorCPU Central Processing UnitCSA Carry Save AdderDLL Direct Link LibraryDLLs Delayed Locked LoopsDPRAM Dual-Port Random-Access MemoryFA Full AdderFIFO First In First OutFPGA Field Programmable Gate ArrayGCHQ Government Communications Headquarters

(British intelligence service)HDL Hardware Description LanguageISR Interrupt-Service RoutineIOB Input/Output Blocks

JCA Java Cryptography ArchitectureJCE Java Cryptography ExtensionJSP Java Server PagesLSB Least Significant BitLUT Lookup TablesMFC Microsoft Foundation ClassesMIT Massachusetts Institute of TechnologyMSB Most Significant BitPCI Peripheral Component InterconnectPKCS Public-Key Cryptography StandardsRAM Random Access Memory

RSA Asymmetric Cryptographic System byRonald L. Rivest, Adi Shamir, and Leonard Adleman

SPRAM Single-Port Random-Access MemorySSL Secure Socket LayerVHDL Very High-Speed Hardware Description LanguageVLSI Very Large Scale Integration

iv


11/86

Chapter 1

Introduction

1.1 Motivation

Public-key cryptography is used in our daily life. It becomes particularly importantfor setting up safe connections over insecure networks like the Internet. It allowsto exchange authenticated data and to generate digital signatures. The appliedmathematical operations are very time-consuming. If the RSA algorithm is usedas cryptographic backend, the basic mathematical operation is modular exponenti-ation. To achieve appropriate security, the used word size should not be less than1024 bits. Modular exponentiation is achieved by repeated modular multiplication.Implementing the RSA algorithm on hardware is typically done by implementing a

fast modular multiplier. Designing such a multiplier is always a tradeoff betweenachieved speed and needed area. For example, architectures featuring multiplierswith small or moderate word sizes do not need much area, but their computationtime is longer because they need more clock cycles to obtain the result. On theother hand, parallel multipliers (such as array multipliers) can obtain the result inone clock cycle, but they are very big in terms of required area. Therefore, it is agood idea to use parameterizable designs. To build parameterizable hardware, theuse of a hardware description language (HDL) is indispensable. As well, the usedalgorithm must offer the possibility to change the word size and the radix.

There are many server-side applications which need high performance encryptionand decryption devices. For example a web server providing asymmetric authenti-

cation to several thousand users. It therefore is highly recommendable to use specialhardware to accelerate the cryptographic computation substantially. Todays FP-GAs are already manufactured on 90 nm CMOS processes and offer a large numberof flip-flops and logical functions. Therefore, it is possible to implement the RSA al-gorithm with actual word sizes from 1024 bits up to 2048 bits to achieve throughputrates required for server operation.

There exist many software frontends which provide already optimized encryptionand decryption of data and generation and verification of digital signatures. Well-known Java frontends are JCE and JCA [LFB+00]. In this work, these interfaces

1


12/86

were implemented to use the FPGA device as encryption and decryption chip. Thisoffers the advantage that already existing programs do not need to be modified,nevertheless they can profit from the improved performance.

1.2 Objectives

The goal of this work is a high-speed hardware implementation of the RSA algorithmon an FPGA platform, offering an easy way to change the used word size and radixof the implementation. This leads to parameterizable hardware. Therefore, VHDLis used for implementing such a parameterizable design. Thus it is possible topre-compute as many FPGA configurations as needed, which is a precondition toadapt the RSA implementation for particular needs. This makes an FPGA nearly

as flexible as software. To exchange data with the RSA device, there is also aninterface (kernel driver) to the operating system required. For this application, theuse of the layer pattern is obviously recommendable. This offers the advantage thateach layer is implemented independently from the others. The whole design flow,from the JCE interface to the hardware description, is shown in this thesis. In orderto prove the functionality of the implementation, a proof of concept implementationis shown on a Xilinx Spartan II (Xc2s200) device. The objective for this relativelysmall FPGA is to reach the maximally possible clock frequency of 66 MHz withgreatest possible word size and multiplier radix. In order to be able to compare theperformance with other implementations, the circuit is also to be synthesized onlarger Xilinx devices like the Virtex II line. With such high-end FPGAs it shouldbe possible to provide up to thousand 1024-bit RSA signatures per second.

1.3 Structure of this Thesis

The thesis is written in a top down manner, which means that general problems willbe described first and details are discussed later on.

The next chapter deals with the theoretical background of public-key cryptog-raphy. It covers the RSA algorithm, digital signatures, Montgomery multiplication,high-radix multiplication and much more. It also explains why the chosen algorithm

can be implemented very well in hardware.The third chapter describes the used design methodology in more detail. Key-

words such as synthesis, simulation and hardware description languages are de-scribed in detail. Both the software design process is treated as well as the hardwaredesign process. In the hardware section, the implementation of complex circuits onFPGA basis is discussed in particular. The focus lies on high-speed optimization inorder to achieve high clock rates and thus a high throughput with these devices.

The fourth chapter describes the implementation of the RSA hardware accel-erator itself. Again, it separately treats software and hardware. Important basic

2


13/86

elements of the implementation are explained and reasons for design decisions aredescribed explicitly.

The fifth chapter shows the obtained results and compares them with otherimplementations. It also tries to point out the pros and cons of the selected imple-mentation.

The last chapter concludes the work and gives an outlook on possible futureimprovements.

3


14/86

Chapter 2

Problem Analysis and Theoretical

Background

This chapter is organized in a top-down manner. It starts with public-key cryp-tography. Then RSA, modular exponentiation, modular multiplication, and fastmultiplication are treated. Finally, the usage of redundant number systems withBooths recoding and carry save addition are discussed.

2.1 Public-Key Cryptography

Before going into the details of public-key cryptography, the difference betweenasymmetric and symmetric cryptosystems should be explained. If Alice wants tosend messages to Bob, which are encrypted with a cryptosystem, Alice needs a keye for encryption and Bob the associated key d for decryption. If in the cryptosystemthe encryption key e is equal to the appropriate decryption key d or d can be easilycomputed from e, we speak of a symmetric cryptosystem or a secret-key system. Ifsuch a system is employed, Alice and Bob must exchange the key e at the beginningof the communication over a safe channel and then keep it secret. The decryptionkey d can be computed from e.

In asymmetric cryptosystems, d and e are different and d can not be computedin reasonable time from e. If Bob wants to receive encrypted messages, he publishes

the encryption key e and keeps the decryption key d secret. Everyone can use enow, in order to send an encrypted message to Bob. Therefore, e is also called thepublic key and d is called the private key. A cryptosystem where the private keycannot be derived from the public available data (the public key) is also known as apublic-key system. Such cryptosystems are also called asymmetric cryptosystems.

Public-key systems often use two different key spaces, since encoding and de-coding keys have different representations. RSA (see section 2.2) uses a public keyconsisting of a pair (n, e) of natural numbers, while the private key is one naturalnumber d.

4


15/86

Clifford Cocks has developed the first asymmetric algorithm at the GHCQ inthe year 1970. The algorithm was reinvented by Rivest, Shamir and Adleman in theyear 1976 at the MIT [RSA79].

2.1.1 Advantages of Public-Key Cryptography

The disadvantage when using symmetric coding algorithms is the distribution andadministration of secret keys. Before each secret communication, Alice and Bobmust exchange a secret key. Therefore, a safe channel must be available. Thisproblem becomes increasingly difficult if more participants in a network want tocommunicate with one another. When using public-key algorithms, the key man-agement becomes substantially simpler, because there is no need to exchange secretkeys beforehand.

A problem of public-key cryptography is the mapping of public keys to a par-ticular person. If someone wants to send an encrypted message to Alice, he has toensure that the public key he uses really was published by Alice. If an attacker suc-ceeds in replacing the public key of Alice by his own, then the attacker can decryptall messages which were intended for Alice.

In practice, asymmetric and symmetric cryptography were often mixed becauseasymmetric cryptography is too slow for bulk encryption. Therefore, asymmetriccryptography is used for generation and verification of digital signatures and keyestablishment. Symmetric cryptography is used for bulk encryption. Examples forsuch hybrid systems are SSL, OpenSSL, and TLS.

2.1.2 Security of the Private Key

A public-key system can be only safe if it is impossible to compute the private keysfrom the public keys in reasonable time. Today this is ensured by the complexityof computation in order to resolve the corresponding problems of number theory.There is however no certain knowledge about the persistence of this situation in thefuture. It is well known for example, that quantum computers can make all commonpublic-key systems unsafe. However, this does not imply that such computers canreally be built. Therefore, it is absolutely necessary to design security infrastructuresin a way that enables the easy replacement of the used cryptographic techniques.

There is a huge amount of literature about public-key cryptography and itssecurity. A selection can be found under [Buc03, DH76, Sch96, Sti95].

2.2 RSA

The RSA cryptosystem, designated after its inventors Ron Rivest, Adi Shamir, andLen Adleman. It was the first publication of a public-key system and is still oneof the most important ones. Its security is closely associated with the difficulty

5


16/86

of factorizing large numbers. The following section describes how to use the RSAcryptosystem. Especially key generation, the encryption, and the decryption aretreated in detail.

2.2.1 Generating Keys

Alice selects two random prime numbers p and q and computes the product

n = p q. (2.1)

Additionally Alice selects a natural number e with

1 < e < (n) = (p 1) (q 1) and gcd(e, (p 1) (q 1)) = 1 (2.2)

and computes a natural number d with

1 < d < (p 1) (q 1) and d e 1 mod (p 1) (q 1). (2.3)

Since gcd(e, (p 1) (q 1)) = 1, there is actually such a number d. It can becomputed with the extended Euclidean algorithm. We also consider that e is alwaysodd. The public key consists of the pair (n, e). The private key is d. The numbern is called RSA modulus, e is called encryption exponent and d is called decryptionexponent. Nowadays, the word size of the used modulus is 1024 bits up to 2048 bits.So the chosen primes p and q are between 512 bits and 1024 bits. For signatures

216

+ 1 is often used as encryption exponent e (see section 2.3.2).

2.2.2 Encryption

The finite set of plaintexts consists of all natural numbers m with

2 m < n. (2.4)

Zero and one should not be used because in that case the resulting ciphertextwould equal the plaintext. A plaintext m is encrypted with

c = me

mod n. (2.5)Where c is the ciphertext.

2.2.3 Decryption

The decryption of RSA is based on the following theorem. If (n, e) is a public keyand d is the according private key in the RSA system, then we have

(me)d mod n = m (2.6)

6


17/86

for each natural number m with 0 m < nProof: Since e d 1 mod (p 1) (q 1), there is an integer l, so that

e d = 1 + l (p 1) (q 1). (2.7)

Therefore

(me)d = med = m1+l(p1)(q1) = m (m(p1)(q1))l. (2.8)

This equation shows that

(me)d m (m(p1))(q1)l m mod p (2.9)

applies. Ifp is not a divisor of m, this congruence results from the small theorem ofFermat. Otherwise the statement is trivial, since both sides of the congruence are

0 mod p. Exactly the same applies to

(me)d m mod q. (2.10)

Because p and q are different prime numbers, we achieve

(me)d m mod n. (2.11)

If c was computed as in equation 2.5, we can reconstruct m by means of

m = cd mod n. (2.12)

2.3 Security of RSA

To find out the secret RSA key is as difficult as factorizing the RSA modulus. Theproof can be found at [Buc03, Sch96, Mur94, Sti95]. However, the attacker can alsohave the intention to find the plaintext from the ciphertext. It is not known whetherit is therefore necessary to find the secret RSA key. But even if it could be provedthat breaking RSA is as difficult as factorizing the RSA modulus, this would notautomatically signify that RSA is totally secure, since there is no knowledge aboutthe difficulty of factorizing natural numbers.

2.3.1 The Selection of p and q

To make the factorization of the RSA modulus n as difficult as possible, both primefactors p and q should be chosen the same size. Sometimes it is still required that pand q are chosen with respect to known factorization algorithms and their compu-tation time, so that a result can not be obtained in reasonable time. Hence, p andq should be chosen randomly and uniformly distributed. For long term security, thelength of the RSA modulus should be at least 1024 bits. More about key sizes canbe found at [LV99].

7


18/86

2.3.2 The Selection of e

The exponent e should be chosen in such a manner that the encryption is efficient,

without decreasing the security. The choice e = 2 is always excluded because (n) =(p1)(q1) is even and gcd(e, (p1)(q1)) = 1 must be matched. Therefore, thesmallest exponent would be three. As shown in [Buc03], with e = 3 a low exponentattack is going to be successful. A common choice of the encryption exponent ise = 216 +1. This exponent withstands known low-exponent attacks, whilst speedingup encryption significantly.

2.3.3 The Selection of d

The decryption exponent d has to be greater than n0.292, otherwise the RSA cryp-

tosystem can be broken. The according proof can be found at [BD00].

2.4 Signatures with RSA

RSA can also be used to generate digital signatures. The idea is quite simple. Alicesigns the document m by applying her decryption function to the document, shecomputes s = md mod n, in which n is the RSA modulus and d the decryptionexponent. Bob verifies the signature by applying her encryption function to thesignature, he computes se mod n, where e is the public key of Alice. If the result isthe document m, the signature is valid.

The key generation works the same way as with the RSA cryptosystem.There are also some known vulnerabilities when signatures are produced likethis. An attacker can apply the no message attack or can use the multiplicativeproperty of RSA. More about attacks on RSA signatures can be found at [Buc03].Therefore, also a hash function is applied to the document or the document is filledup with some random numbers.

There are many possibilities to improve the security of signatures. Thereforecryptographic standards like PKCS #1 were introduced for signatures and othercryptographic applications. They can be found at [rsa04].

2.5 Fast Modular ExponentiationFor fast RSA encryption and decryption, efficient modular exponentiation is crucial.This can be achieved by the Square and Multiply algorithm. The idea behindsquare and multiply is based on the binary representation of the exponent. Thebinary representation can be obtained with

e =l1i=0

ei 2i (2.13)

8


19/86

where l is the word size of e. The modular exponentiation can be expressed as

p = ae mod m aPl1

i=0 ei

2i

l1i=0

(a2i

)ei. (2.14)

So we need to square in each iteration and to multiply only if the ith bit of theexponent is one. This leads to algorithm 1 and algorithm 2.

Algorithm 1 Square and multiply, right to leftInput: a, e, mOutput: p = ae mod m

1: p = 12: y = a

3: for i = 0 to l 1 do4: p = p y mod m5: if ei == 1 then6: y = y y mod m7: end if

8: end for

9: return p

Algorithm 2 Square and multiply, left to rightInput: a, e, m

Output: p = ae mod m

1: p = 12: for i = l 1 downto 0 do3: p = p p mod m4: if ei == 1 then5: p = p a mod m6: end if

7: end for

8: return p

The advantage of algorithm 2 compared to algorithm 1 is that no additionalvariables are required for intermediate results. This is very important for the imple-mentation in hardware, because no additional registers are needed. The algorithmcan be optimized even further, leading to algorithm 3.

With algorithm 3, the number of multiplications can be reduced if the position ofthe MSB is lower than the word size of the exponent. The hardware implementationof this algorithm is very simple, the exponent only needs to be shifted until the first1 occurs, thereafter the square-and-multiply procedure can start. The detailedimplementation is shown in chapter 4.2.2.

9


20/86

Algorithm 3 Square and multiply, optimized versionInput: a, e, mOutput: p = ae mod m

Require: e > 11: p = a2: i = l 13: while ei! = 1 do4: i = i 15: end while

6: for j = i 1 downto 0 do7: p = p p mod m8: if ej == 1 then9: p = p a mod m

10: end if11: end for

12: return p

Since we need up to two modular multiplications in each iteration, these opera-tions should perform as fast as possible. A very fast modular multiplication, withouttrail division, is achieved by the Montgomery multiplication algorithm, which is dis-cussed in the next section.

2.6 Modular Multiplication

A trivial version of modular multiplication of p = a b mod m would compute a bfirst and then apply a modular reduction step. Therefore, the following mathematicoperations need to be calculated.

p = a b floora b

m(2.15)

This approach has two essential drawbacks.

The word size is doubled for the intermediate result if the word size of a andb is equal, because the word size is calculated by log2 a + log2 b. When usingup to 2048 bits for each operand, with this increase of complexity it is nearlyimpossible to build fast and small hardware multipliers.

After each multiplication a modular reduction step is needed. This is usuallydone by an integer division, which is a complex operation.

These concerns can be handled by the Montgomery multiplication with someadditional costs for pre- and post-processing.

10


21/86

2.6.1 Montgomery Multiplication

The Montgomery algorithm [Mon85] calculates MonMul(a, b) = a b r1 mod m

from the two numbers a and b. In comparison to conventional modular multipli-cation, a further factor r1 is needed. How to compute r is shown later in thissection. To compute a b mod m, some transformations are needed, the reason isshown in equation 2.18. However, these transformations can also be done with theMontgomery algorithm.

Transformation to the Montgomery Domain

To transform an integer a to the Montgomery domain, it needs to be multiplied bythe constant r2. The result of this multiplication is the transformed number a

a = MonMul(a, r2) = a r2 r1 mod m = a r mod m (2.16)

Transformation to the Z Domain

Also the inverse transformation can be computed by a Montgomery multiplication.In this case, an element of the Montgomery domain needs to be multiplied by theconstant 1. The result of this multiplication is the transformed number a.

a = MonMul(a, 1) = a 1 r1 mod m =

= a r 1 r1 mod m = a mod m (2.17)

Modular Multiplication using Montgomery Multiplication

With help of these two transformations, the modulo multiplication of two integersa and b can be performed. First the two numbers will be transformed into theMontgomery domain, then the multiplication is applied, afterwards the result istransformed back into the Z domain.

a = MonMul(a, r2) = a r2 r1 mod m = a r mod m

b = MonMul(b, r2) = b r2 r1 mod m = b r mod m

c = MonMul(a, b) = a b r1 mod m =

= a r b r r1 mod m = a b r mod m

c = MonMul(c, 1) = c 1 r1 mod m =

= a b r 1 r1 mod m = a b mod m (2.18)

11


22/86

Montgomery Multiplication Algorithm

Algorithm 4 shows the original Montgomery algorithm, where m is the modulus of

the modular multiplication, a and b are the operands, p is the resulting product,k is the number of bits which are processed in one computation cycle (so this is a2k-radix version), n is the number of computation cycles to complete the modularmultiplication, and r, m are constants needed during the computation. There aretwo different definitions of the so called radix found in literature. Some define theradix as 2k and some as k, where k is the word size. For example: if k = 4, in thefirst case the radix is 16, and in the second case it is 4. In this thesis the radix isalways defined as 2k.

Algorithm 4 Montgomery multiplication algorithm, version 1Input: a, b, k, mOutput: p = a b r1 mod m and 0 p 2 m

Require: m > 2 with gcd(m, 2) = 1Require: k, n such that 4 m < 2kn

Require: r, m with 2kn r1 mod m = 1 and m m mod 2k = 1Require: a with 0 a 2 mRequire: b with

l1i=0 (2

k)i bi and 0 b 2 m1: p = 02: for i = 0 to n 1 do3: q = (((p + bi a) mod 2

k) m) mod 2k

4: p = p+qm+bia2k

5: end for6: return p

For calculating all required constants, the word size of the modulus m, the wordsize of the radix k and the modulus m are needed as input.

4 m < 2kn

log2 4 m < k n

log2 4 + log2 m < k n

2 + wordsize(m) < k n

n >2 + wordsize(m)

k

n =

2 + wordsize(m)

k

(2.19)

r is calculated from 2kn r1 mod m = 1 with r = 2kn. m is the negativemodulus inverse of m to the base 2k and can easily be computed with the extendedEuclidean algorithm. The correctness of these algorithms is shown in [Mon85] and

12


23/86

[Oru95]. The advantage of the Montgomery algorithm is the easy computation ofqand p. They need to be calculated in each iteration. For q, two multiplications, oneaddition and two modular reductions are needed. For p, two k-bit multiplications,two additions and one division are needed. The division can be performed as rightshift operation, because the divisor is a power of two. The modulus reduction isvery simple too, because the modulus is a multiple of two. To perform a modulusreduction of 2k, only the least significant k bits from the operand need to be taken.To reduce the amount of multiplications and modular reductions, some optimizationsare required, which were introduced by Holger Orup [Oru95].

2.6.2 Orups Optimization

When using a higher radix in algorithm 4, the result can be computed in fewer

cycles. But the computation time for one iteration increases, because the quotientdetermination requires an addition ofl + k bits and two multiplications ofk (l + k)bits. Therefore, it is possible that the total computation time will increase. Themajor advantage of Orups optimization is the possibility to pipeline the computationand thus to reduce the logical depth and increase the clock rates. The idea is toreduce the expensive algorithmic operations in each iteration by using additional pre-computed constants during the calculation. Algorithm 5 avoids the multiplicationin quotient determination. The condition m mod 2k = 1 must be fulfilled for allvalues of the modulus. This is done by transforming the constant m to m. Thenew constant m can be calculated by m = (m mod 2k) m.


Require: m > 2 with gcd(m, 2) = 1Require: k, n such that 4 m < 2kn where m = (m mod 2k) mRequire: r, m with 2kn r1 mod m = 1 and m m mod 2k = 1Require: a with 0 a 2 mRequire: b with

l1i=0 (2

k)i bi and 0 b 2 m1: p = 02: for i = 0 to n 1 do

3: q = (p + bi a) mod 2k

4: p = p+qm+bia2k

5: end for

6: return p

In comparison to algorithm 4, the borders of n changed from 4 m < 2kn to4 m < 2kn. So it is necessary to recalculate n. m is given by

m = (m mod 2k) m (2.20)

13


24/86

so the word size of m is calculated by

wordsize(m) = wordsize(m mod 2k) + wordsize(m) = k + wordsize(m). (2.21)

Therefore n is found by

4 m < 2kn

log2 4 m < k n

log2 4 + log2 m < k n

2 + wordsize(m) < k n

2 + wordsize(m) + k < k n

n > 2 + wordsize(m) + kk

n >2 + wordsize(m)

k+ 1

n =

2 + wordsize(m)

k

+ 1 (2.22)

So an additional cycle for the computation is needed, but one multiplication issaved in each iteration for computing the quotient q. The next version (algorithm6) shows how to avoid the addition in the quotient determination. This is done byreplacing a with 2k a. To compensate the extra factor 2k, an additional iterationis needed for multiplication.


Require: m > 2 with gcd(m, 2) = 1Require: k, n such that 4 m < 2kn where m = (m mod 2k) mRequire: r, m with 2kn r1 mod m = 1 and m m mod 2k = 1Require: a with 0 a 2 mRequire: b with

l1i=0 (2

k)i bi and 0 b 2 m

1: p = 02: for i = 0 to n do3: q = p mod 2k

4: p = p+qm2k

+ bi a5: end for

6: return p

In this version q is computed from the remainder of the intermediate sum mod-ulus the base 2k. This equation can easily be implemented in hardware, because

14


25/86

only the lowest k bits need to be taken from the intermediate sum. The addition ofp + q m is a very expensive arithmetical operation, because of the possible word sizeof the two operands. So very fast adders would be necessary to achieve a high-speedmultiplication with this algorithm. Therefore, it is a good idea to recompose thestatement for p to avoid such an expensive addition. The following equation showsthe rewritten statement.

p + q m

2k+ bi a =

=p

2k+

q m + p mod 2k

2k+ bi a

=p

2k+

q (m + 1)

2k+ bi a

= p2k

+ q m + 12k

+ bi a (2.23)

This transformation leads to the final algorithm 7.


Require: m > 2 with gcd(m, 2) = 1Require: k, n such that 4 m < 2kn where m = (m mod 2k) mRequire: r, m with 2kn r1 mod m = 1 and m m mod 2k = 1

Require: a with 0 a 2 mRequire: b withl1

i=0 (2k)i bi and 0 b 2 m

1: m = m+12k2: p = 03: for i = 0 to n do4: q = p mod 2k

5: p = p2k + q m + bi a

6: end for

7: return p

The advantage of algorithm 7 compared to algorithm 6 is the dataflow indepen-

dence in each iteration. Because of this, each term can be computed in one iterationwithout affecting the other terms. The constant m must be calculated only once foreach modulus m. As mentioned before, the computation of q can easily be done byusing only the lowest k bits of the previous intermediate sum. The term p

2kalso can

be easily computed by a right-shift operation by k bits. The bottlenecks for a high-speed design are the two remaining multiplications and the addition of the threeterms to obtain the intermediate sum. Therefore, the next two sections treat thetheoretical background of high-speed multiplication and high-radix multiplication.The design and implementation of these essential parts are discussed in chapter 3.

15


26/86

2.7 Multiplication

The following notation is used during the next consideration.

a Multiplicand al1al2 . . . a1a0

b Multiplier bl1bl2 . . . b1b0

p Product (a b) p2l1p2l2 . . . p1p0

The multiplication of two binary numbers is realized by shifting and adding.Figure 2.1 shows a multiplication of a six bit with a four bit binary number in dotnotation (see [Par00]). The partial products bi a can be easily formed by a logicalAND operation. This scheme also applies to other non-binary multiplications, butcomputing the terms bi a becomes more difficult and the product will be one digitwider than a. The remaining multi-operand addition is the same.

Figure 2.1: Multiplication of two binary numbers in dot notation

Instead of shifting the terms bi a, it is easier to shift the cumulative partialproduct. Therefore, there exist two version for one-bit-a-time multiplication, de-pending on the order of adding the terms bi a. Processing Figure 2.1 from top tobottom leads to the right-shift algorithm 8, from bottom to top leads to the left-shiftalgorithm 9. In the shown algorithms l is the word size of the multiplier b.

Algorithm 8Binary right-shift multiplication algorithmInput: a, b

Output: p = a b

1: p = 02: for i = 0 to l 1 do3: p = (p + bi a 2

l) 21

4: end for

5: return p

16


27/86

Algorithm 9 Binary left-shift multiplication algorithmInput: a, bOutput: p = a b

1: p = 02: for i = 0 to l 1 do3: p = 2 p + bli1 a4: end for

5: return p

2.8 High-Radix Multiplication

As mentioned before, there is no consistent definition of the multiplication radix in

literature. In this thesis the radix is defined as 2

k

, where k is the number of bitswhich are processed in one iteration during the multiplication. In other words, thebase of the number is changed to the radix 2k.

The modified right-shift version and left-shift version of the radix-r multiplicationis shown in algorithm 10 and algorithm 11.

Algorithm 10 Radix-r right-shift multiplication algorithmInput: a, bOutput: p = a b

1: p = 02: for i = 0 to l 1 do

3: p = (p + bi a rl) r14: end for

5: return p

Algorithm 11 Radix-r left-shift multiplication algorithmInput: a, bOutput: p = a b

1: p[0] = 02: for i = 0 to l 1 do

3: p = r p + bli1 a4: end for

5: return p

Figure 2.2 (see [Par00]) shows an example of radix-4 multiplication in dot nota-tion. So for a radix-4 multiplication two bits a time are needed to form the partialproduct.

In each step the partial product (bi+1bi) a needs to be formed and added tothe intermediate partial sum. When using radix-2, processing one bit in each cycle,

17


28/86

Figure 2.2: Radix-4 multiplication in dot notation

the partial product can easily be formed because only 0 a and 1 a need to be

formed. This can be done by masking. With higher radices this computation is notthat simple anymore. For example, with radix-4 the multiples 0 a, 1 a, 2 a and3 a are needed. 0 a and 1 a can be computed as before, 2 a requires a shiftoperation. In this case the problem is how to compute 3 a. A solution would be thepre-computation of 3 a and to store it for future use in a register. Figure 2.3 (see[Par00]) shows a hardware example for the multiple selection of a radix-4 multiplier.

Figure 2.3: Operand selection for radix-4 multiplication

The advantage is that this can easily be done even with higher radices. Thedisadvantage is the pre-computation of all operands which are not a multiple of 2i.With radix-4 there is only one pre-computation needed, for radix-8 there are threepre-computations needed (3 a, 5 a and 7 a). In general, each prime numberin the used radix set has to be pre-computed. For very high radices, this slowsdown the whole computation, since the pre-computation must be performed beforeeach multiplication. Another approach for high-radix multiplication is the use of aredundant number representation.

18


29/86

2.8.1 Redundant Number Representation

For a non-redundant number system there exists only one representation of any

number, for example the binary, octal, decimal, hexadecimal number systems. Whenusing redundant number systems, there is more than one representation for a certainnumberdepending on the radix. For example, a radix-2 digit set with [0,1,2] canbe used to represent the number 6 as (110) or as (102). When using a radix r digitset [, ], the following cases can be distinguished

Signed > 0

Unsigned = 0

Non-redundant = r 1

Redundant > r 1

Redundant number systems can be used to recode a binary number so that onlysimple operations like masking and shifting are required to compute the necessarymultiples of the multiplicand. A very popular recoding scheme is modified Boothsrecoding.

2.8.2 Modified Booths Recoding

The idea behind Booths recoding is the detection of strings of 1s. The recodingis performed by the scheme shown in table 2.1. The binary number is representedby xi with i = 0 to i = l 1 and l as the word size. Per definition x1 = 0. Therecoded number is represented by yi with i = 0 to i = l.

xi xi1 yi Explanation0 0 0 No string of 1s0 1 1 End of a string of 1s1 0 -1 Beginning of a string of 1s1 1 0 Continuation of string of 1s

Table 2.1: Radix-2 Booths recoding scheme

If a string of 1s starts, the multiplicand is subtracted from the cumulative partialproduct. If a string of 1s ends, the multiplicand is added to the cumulative partialproduct. If there is a continuing string of 1s or 0s, nothing is added to the cumulativepartial product. In other words, a continuing string of 1s will be represented as

2j + 2ji + + 2i+1 + 2i = 2j+1 2i. (2.24)

The longer the sequence of 1s, the larger savings can be achieved, for example:

19


30/86

1 0 0 1 1 1 1 1 1 1 1 0 1 1 0 1 binary number1 -1 0 1 0 0 0 0 0 0 0 -1 1 0 -1 1 -1 recoded number

But if there are only isolated 1s in the binary number, no savings can be achieved,for example:

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 binary number1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 recoded number

With radix-2 this does not make sense in designs where the data path goesthrough the adder anyway. But it can be used in micro programs, because shift-

ing alone is faster than addition followed by shifting. With radix-4 it also makessense for designs where the data path goes through the adder. The disadvantage ofradix-4 multiplication was the need of an extra addition to form the 3 a multiple.With Booths recoding it is possible to convert the radix-4 digit set [0,3] to [-2,2].Accordingly, there is no need to compute the partial product 3 a anymore. Therecoding scheme is shown in table 2.2. The binary number x is converted to z, y isthe radix-2 conversion of x.

xi+1 xi xi1 yi+1 yi zi/2 Explanation0 0 0 0 0 0 No string of 1s0 0 1 0 1 1 End of a string of 1s

0 1 0 1 -1 1 Isolated 10 1 1 1 0 2 End of a string of 1s1 0 0 -1 0 -2 Beginning of a string of 1s1 0 1 -1 1 -1 End one string, begin new string1 1 0 0 -1 -1 Beginning of a string of 1s1 1 1 0 0 0 Continuation of string of 1s


Thus, if radix-4 recoding is performed, only the multiples a and 2 a of the

multiplicand are required. All of them can easily be obtained by shifting and otherlow level logical functions like AND and XOR. Figure 2.4 (see [Par00]) shows anexample hardware design of a radix-4 Booth-recoded multiplier.

The disadvantage of Booths recoding is the need for signed numbers. So anadditional sign bit and hardware are required to form the twos complement of abinary number. When using radix-8 Booths recoding scheme (see table 2.3), themultiple 3a appears again. This fact does not change in radices higher than eight.

Therefore, pre-computation is needed with radices greater than four. To avoidthis pre-computation, carry-save adders can be used.

20


31/86

Figure 2.4: Radix-4 partial product generation with Booth recoding

xi+2 xi+1 xi xi1 yi+2 yi+1 yi zi/3 Explanation0 0 0 0 0 0 0 0 No string of 1s0 0 0 1 0 0 1 1 End of a string of 1s0 0 1 0 0 1 -1 1 Isolated 10 0 1 1 0 1 0 2 End of a string of 1s0 1 0 0 1 -1 0 2 Isolated 10 1 0 1 1 -1 1 3 Isolated 10 1 1 0 1 0 -1 3 Isolated 1s

0 1 1 1 1 0 0 4 End of a string of 1s1 0 0 0 -1 0 0 -4 Beginning of a string of 1s1 0 0 1 -1 0 1 -3 End one string, begin new string1 0 1 0 -1 1 -1 -3 End one string, begin new string1 0 1 1 -1 1 0 -2 End one string, begin new string1 1 0 0 0 -1 0 -2 Beginning of a string of 1s1 1 0 1 0 -1 1 -1 End one string, begin new string1 1 1 0 0 0 -1 -1 Beginning of a string of 1s1 1 1 1 0 0 0 0 Continuation of string of 1s


2.8.3 Carry-Save Adders

The basic operation of the multiplication is the addition. Therefore, fast adders areneeded to build fast multipliers. That is particularly important when operating withword sizes beyond 1024 bits (for example RSA). There are many different fast adderdesigns. The most important ones are carry-propagation adder, carry-look-aheadadder, carry-skip adder, and carry-save adder. An excellent description of theirdetailed functionality can be found in [Par00]. However, the architecture of fast

21


32/86

adders always results in higher hardware complexity. In this section the carry-saveadder is explained in detail, because it is often used within the RSA design.

Carry-save adders (CSA) can be obtained from a full-adder (FA) cell. Figure 2.5shows the conversion from a ripple carry adder to a carry save adder.

Figure 2.5: Conversion of a ripple-carry adder into a carry-save adder

The disadvantage of a ripple-carry adder is the long carry chain when operatingon big word sizes. The last full adder has always to wait for the carry from the pre-vious full-adder cells. In other words, ripple-carry adders have a long critical path.

This decreases the computation time significantly. Analyzing the timing behaviorcan be performed with the O-notation. The O-notation shows the needed compu-tation time for infinite input. Therefore, a ripple-carry adder has a computationtime of O(n), because n full-adder cells form the critical path. With O-notation itis also possible to compare different adder architectures. But it is necessary to keepin mind that in real world no infinite numbers are used as input.

So CSAs are saving the carries instead of propagating them. Figure 2.5 showsalso that a CSA is made up of ordinary full adders. The Boolean equations for afull adder cell are:

sum = a b ccarry = (a b) + (a + b) c = (a b) + (a c) + (b c) (2.25)

Where is the Boolean XOR function, is the Boolean AND function and + theBoolean OR function. For the sake of completeness, the truth table of this operatorsis shown in table 2.4.

In other words, a CSA is a 3-to-2 reduction circuit. Therefore, only multi-operand addition makes sense with carry-save adders. CSAs can be combined toa tree. A CSA tree reduces n binary numbers to two binary numbers, which are

22


33/86

AND ()a b a b

0 0 00 1 01 0 01 1 1

OR (+)a b a + b

0 0 00 1 11 0 11 1 1

XOR ()a b a b

0 0 00 1 11 0 11 1 0

Table 2.4: Basic Boolean operations

representing the sum and the carry. There are several ways to convert the redundantrepresentation of the result into the binary representation. They are explained laterin this section. Figure 2.6 shows a 6-to-2 compression tree, also called Wallace tree.

Figure 2.6: 6-to-2 CSA reduction tree

Using CSA trees within multiplication architectures reduces the critical pathsignificantly. The height of the Wallace tree can be estimated with h(n) log1.5

n2 .

The logical depth is equivalent to the height of the tree.Compared to a bit-serial multiplier (single bit a time multiplier), designs as

shown in figure 2.7 reduce the number of computation cycles for one multiplication.

The disadvantages are the need for additional area (compared to a two-operandaddition) and the final addition which has to be done by a ripple-carry adder, carry-look-ahead adder, carry-skip-adder, or any other carry-propagate adder. Especiallywhen using large word sizes like 1024 bits or more, the final addition is a very timeconsuming operation. The multiplier architecture has to pay special attention tothe final adder, because the final adder could contribute significantly to the criticalpath of the circuit.

The final addition can also be performed by the CSA tree itself. This is realizedby using the result of the adder tree as input and setting all other operands to zero.

23


34/86

So it is necessary to know how many times this operation has to be performed. Theworst case of this transformation is depending on the used word size. In each com-putation cycle the carry is shifted one bit to the left. Therefore the transformationis finished after word size cycles.

Parhami shows in [Par00] that the probability of carry generation is 14

, the prob-ability of carry annihilation is 14 and the probability of carry propagation is

12 .

Therefore, the averrage length of a carry chain could be estimated with log 2 of theused word size. When using a word size of 512 bits the estimated length is nine, with1024 it is ten, and with 2048 it is eleven. This is very low compared to the neededcomputation cycles of a multiplication, and has to be done only once at the end ofthe multiplication. The disadvantage is that a comparator is required to make surethat the carry is zero.

CSA trees can be used to implement fast multipliers with or without Booths

recoding. Figure 2.7 shows a radix-4 multiplication without Booths recoding.

Figure 2.7: Radix-4 multiplication with carry-save adder

It is also possible to keep the cumulative partial product in stored-carry form (see[Par00] for more details), but an additional register for the carry is needed, whichincreases the needed area. With such adder trees it is possible to implement high-radix multipliers with very short logical depth. This is an important requirement toachieve high clock rates. Therefore, the multipliers (section 4.2.7 used in the RSAimplementation were also built with Wallace trees. In conclusion, it is recommendedto design high-radix multipliers with CSA trees.

24


35/86

Chapter 3

Design Methodology

This chapter discusses the design methodology, from the software frontend to thehardware backend. The used design tools for creating software and hardware aretreated, and how they are combined to achieve great design results. This alsoincludes the optimization on different abstraction layers like the software layer, thealgorithmic layer, the register-transfer level, place and route, and much more.

3.1 Design Flow

For designing complex hardware, which RSA is without doubt, development, sim-ulation and verification tools are necessary. The general design flow of this RSA

implementation consists of the following major points:

Get information about state-of-the-art implementations.

Implement a high-level model with an object-oriented programming language(Java).

Compare different algorithms, and estimate the costs of their hardware imple-mentation.

Choose an appropriate architecture.

Implement the hardware with a hardware-description language (VHDL, Ver-ilog, ...).

Verify the hardware by simulation and compare the results with the high-levelmodel.

Synthesize the hardware.

Simulate the hardware at gate level and compare the results with the high-levelmodel.

25


36/86

Generate an FPGA configuration file and compare the results from the FPGAwith the high-level model.

Optimize the implementation for the target platform.

The design flow for FPGAs is shown in figure 3.1. The following sections de-scribe the mentioned points in more detail. The behavior behind the used tools, forexample synthesis, is also explained.

Figure 3.1: FPGA design flow

3.1.1 State of the Art Implementations

To keep the time to market short, it is necessary to study state-of-the-art imple-mentations. So it is possible to get an overview of successfully implemented designs.With this knowledge, possible failures should be minimized during the design phase.

26


37/86

It is highly recommended to review the design process periodically, in order to findpossible design failures as soon as possible, since the faster they are discovered,the easier it is to correct them. Therefore, the costs for correcting failures can beminimized.

There is a huge number of publications about the implementation of high speedRSA, for example [YCC98, SV93, EW93, Wal91, OK91, Oru95, LW99, LW00, BP99].Most of them are based on the Montgomery multiplication algorithm. Holger Oruphas introduced the first hardware design operating on full word size. Orups im-plementation was about five times faster than the fastest implementations. At thetime this paper was written, there were no known implementation of Ourps algo-rithm based on FPGA platforms. It is not possible to achieve better results with animplementation on an FPGA compared to Orups CMOS implementation becauseon CMOS implementations the routing delay will always be less than on FPGA

platforms.

3.1.2 High-Level Model

It is very useful to implement a high-level model in an object-orientated language.The implementation was done with Java. The advantage of Java is the supportof big numbers, which are encapsulated in the BigInteger class. With a high-levelmodel it is easier to understand the background of the used algorithms.

The first high-level model implements RSA with the Montgomery algorithm.Based on this version, Booths recoding and Orups optimizations where added.

At this point an incompatibility of Booths recoding and Orups optimizations wasdetermined when using radices higher than two. Therefore, the final algorithm,which is implemented on the FPGA, is the Montgomery algorithm with Orupsoptimization but without Booths recoding. Another advantage of this decision isthat without Booths recoding there is no need for a signed number representation,which makes things easier to implement in hardware.

Starting the implementation with ordinary operators like +, -, *, / or mod, theycan be replaced successively by more hardware-oriented operations. The result is analgorithm which can directly be implemented by a hardware description language.For example, the implementation of a carry-save addition starts with the + operatorand is replaced by

sum = a b c

carry = (a b) + (a c) + (b c) (3.1)

Another advantage of a high-level model is the ability to verify the own imple-mentation with other correct implementations. Especially for RSA there are manyimplementations. When the implementation of the algorithm is completed and it isworking correctly, the intermediate results of the algorithm should be saved in order

27


38/86

to compare them with the simulation results of the register-transfer-level model.This method offers a fast way to find as many design errors as possible in a veryshort time span. Therefore, the design costs stay low. Another reason is that theimplementation of test benches is easier in an object-orientated language, thereforethe potential sources for faults are less, compared to implementations in hardware.For such testing purposes, an interface to the hardware is needed. How to implementsuch an interface with Java and C++ is shown in section 3.2.

3.1.3 Hardware Description Language

Hardware description languages (HDL) are needed to implement parameterizablehardware. They are used to describe hardware at different abstraction levels. Designdescriptions of a HDL are divided into four main sections:

Definitions: Defining variables, signals, ports, and structures of hardware.

Behavioral description: Describing the dependence between inputs and out-puts with operators over time.

Structural description: Combining functionality in blocks and connect theirinputs and outputs using wires.

Constraints: To ensure correct implementations of the synthesized hardwareunder certain conditions as area limitations or clock frequency requirements.

HDLs normally follow the concept of general programming languages plus addi-tional data types and operators which fulfils the need of concurrent execution.

For the control structures, there are three main concepts:

Local attempt: Describing when a local command should be executed.

Extended structural programming: The common language concepts for pro-gramming (if then else, for do, while do, ...) are extended with additionalconstructs being able to handle concurrency.

Process communication: The whole system is a set of processes which are

communicating in a defined manner.

Common programming languages have the ability to describe the behavior, butthey do not offer concepts for describing hardware-specific characteristics. Therefore,a HDL should be able to describe interfaces by defining input and output signalsand ports, which offers the ability to communicate with other modules. Structuraldescription is used to keep the overview over complex designs. This is a kind ofhierarchical abstraction. In addition to existing operators, operators depending onthe register-transfer level and the logic level are needed. In the register-transfer level

28


39/86

also asynchronous behavior is needed to describe signals like interrupts, reset, andset. For implementing the hardware, VHDL was chosen as hardware descriptionlanguage. Figure 3.2 shows the comparison between VHDL and Verilog. Bothlanguages have nearly the same capability to describe hardware. The VHDL codewas written with the emacs editor, because of its great VHDL mode. This was oneof the reasons why VHDL was chosen as hardware description language.

Figure 3.2: Comparison between VHDL and Verilog

It is also necessary to know how to write well synthesizable code, some codeexamples are described in section 3.3.5 and in chapter 4. Some hints for writinggood synthesizable code for XILINX devices can be found under [xil04].

3.1.4 Simulation

Simulation is used to verify the functionality of the described hardware. Ssimulationcan be done on different abstraction layers. These are the simulation of the behaviormodels and simulation of the gate-level model.

Simulation of Behavioral Models

Behavioral models are written in an editor or another front-end tool. With be-havioral models it is possible to verify design at an early stage of the project, italso is used as input of the synthesis tools. To verify the behavioral model, a testscript is used. The test script provides stimuli for the input signal and comparesthe output signal with the expected output values. The test script must be able toread and write from interface of the behavioral model. Before the verification of the

29


40/86

functional correctness the syntactical correctness has to be verified. The simulationis fast because it is time discrete and level discrete. There is no computation oftransient signal transitions, there are only three logical levels which are used duringthe simulation process high, low, and tristate.

Simulation of Gate-Level Models

The gate-level model will be generated with synthesis tools making use of targetspecific logic models. The simulation is used to compare the behavioral model withthe gate-level model. Also the timing constraints of the gate-level model can bechecked. It takes much more time to simulate the gate-level model than to simulatethe behavioral model.

Simulation of the RSA HDL Model

The simulation of the RSA chip was performed with ModelSim from Mentor Graph-ics. The test script was implemented with TCL. The test bench compares theintermediate results from the high-level model with the results of the behavioralmodel. Therefore, methods where needed which are able to compare big numbersin TCL, because the range of an unsigned integer is limited to 4294967296 (32 bit).The simulation of one modular exponentiation takes about five minutes on a INTELPentium IV machine running at 1.6 GHz. For detailed analysis it is also possible touse a waveform viewer.

3.1.5 Synthesis

Aim of the circuit synthesis and optimization is the generation of a technology-dependent netlist on the gate level that should be optimized under most differentpoints of view, like for example minimal area demand or maximum achievable clockfrequency. Besides conventional netlists, HDL behavior descriptions can also be usedas input. The steps of synthesis and optimization are displayed in figure 3.3.

The first step during the synthesis is transformation of the input, which can bea behavioral description of a HDL model or a netlist, to Boolean equations.

If the input is a netlist on gate level, the extraction of Boolean equations requires

the functionality of the associated technology library. Afterwards the Boolean equa-tions are available in a structured form. That means for each output an equationand for each signal a variable is generated.

If the input is a HDL behavior description the synthesis is done by linking knownfunction blocks (after the syntax check) on register-transfer level. Since each dig-ital hardware can be described by state machines, it is necessary to find out howmany states are needed and which is the best encoding of them. For example one-hot encoding needs more hardware resources but is faster than binary encoding. Anoptimization is achieved by favorable selection of the function blocks. For the follow-

30


41/86

Figure 3.3: Hardware synthesis

ing optimization steps, a block-oriented description is available in form of Booleanequations.

On the basis of these Boolean equations, the technology-independent logic op-timization can be enforced. Aim of this optimization is an improved structure ofequations, so that redundant logic can be removed in order to reduce the needed area

and time delay of the circuit. Boolean algebra is used as basic reduction tool for theseoptimizations. Other optimization functions are logic flattening and logic structur-ing. The resulting equations can be understood also as a technology-independentgate-level netlist.

In the last step, a technology-dependent netlist is generated on the basis of acorresponding gate library, this procedure is called Mapping. Then the Place &Route procedure is accomplished. Depending on the hardware design it is possiblethat the routing requires additional cells of an FPGA. The goal pursued with thisprocess is to meet the constraints which were set by the developer and the design

31


42/86

rules defined in the technology library. The observance of the design rules has morepriority than the optimization goals. It is important to distinguish between placeand route on FPGA platforms and other production processes like CMOS, becausethere is no need for a geometrical rule-set on FPGA platforms.

For the synthesis and the place and route procedure, the ISE development studiowas used. It is only available for Xilinx devices, but it is possible to use third partytools like SYNPLIFY for the synthesis. The whole procedure can also be executedfrom command line. This offers an easy way of nightly builds. Nightly builds arenecessary, because generating a bit stream can take up to four days on an AMDAthlon XP 2000+ or a 1.6 GHz INTEL Pentium IV CPU. The generation timecan be reduced significantly when using more RAM because the place and routeprocedure is a very memory demanding. On a 2.8 GHZ INTEL Pentium IV systemwith two gigabyte RAM it takes about four hours to generate a bit-stream file.

3.2 Software

Since programming languages are not hardware independent, it is necessary to writesoftware in such a manner that the dependent parts can be exchanged easily. Duringthe implementation of software, many programming languages need to be used. So,it is necessary to use well defined interfaces on a dedicated abstraction layer tokeep the software as platform-independent as possible. Choosing a programminglanguage is not that difficult. It depends on the particular requirements of theapplication, the surrounding environment (interface), and personal preferences. For

this application Java was chosen because

a big number representation is already included (BigInteger class), which alsoprovides basic algorithmic functions like addition, subtraction, multiplication,division, modular reduction, and modular exponentiation,

a cryptographic interface is already implemented (JCE),

it is easy to use with JSP,

it is possible to access low-level C or C++ functions via the Java Native

Interface (JNI),

is available on almost all platforms.

When writing software it is recommended to use a design pattern which satisfiesall requirements of the application. Many software applications have a commonfunctional behavior, so it is not necessary to do all the development work again.In this case the Layer Pattern is used. More information about patterns can befound in [Bus96].

32


43/86

3.2.1 The Layer Pattern

The layer pattern is a so-called architectural pattern. There are many applications

using the layer pattern, for example the ISO-OSI network model, the TCP/IP pro-tocol and nearly any Application Programming Interface (API). Figure 3.4 showssome examples for the layer pattern

Figure 3.4: Examples for the layer pattern

The intention is to decompose the application into layers of abstraction. Thesolution is to group similar functions in one layer. Between two different layers,there is always a standardized interface. This makes it possible to replace the im-plementation of a layer without changing other layers. Applications on a particularlayer may only call functions of their subordinate layer(s).

This may arise the impression of slowing down the whole application, but in this

case all function calls work in call-back manner, so the only overhead is puttingparameters on the stack.

3.2.2 Software Architecture

Figure 3.5 shows the design of the architecture and the used programming languagesfor each layer.

If the application is to be ported to another system like Linux, it is only necessaryto replace the layers between the JNI interface and the hardware. The JNI interface

33


44/86

Figure 3.5: Software architecture

also implements the interrupt service routine. Interrupts have an asynchronousbehavior and therefore they must be synchronized with function calls from the upperlayer. The implementation is shown in section 4.1.2.

3.3 Hardware

When implementing hardware, it is necessary to keep in mind for which targetenvironment the hardware is designed. For example a Full-Custom Design iscompletely different from a design for FPGA devices. In full-custom designs theusage of registers is very expensive because of the needed area. When designinghardware for a FPGA platform, registers are relatively cheap because each slicecontains up to two flip-flops (see 3.3.3). But there are also some common designrules, like separating the hardware in a control path and a data path.

34


45/86

3.3.1 Control Path and Data path

Most hardware applications can be separated into a control path and a data path.

The data path typically has some pipeline stages (register banks) and a huge block ofcombinational logic which shows regular structures. The control path, on the otherhand, is mostly implemented through state encoding and therefore is relatively smallcompared to the data path. The control path and the data path are connected troughcontrol signals and status signals, therefore the control unit also affects the delayin the data path. How to separate control and data path to minimize these effectsis shown in section 3.3.6. The hardware implementation of the RSA algorithm wasalso separated in a control path and a data path. It is shown in figure 3.6.

Figure 3.6: Separating hardware in control- and data path

The detailed architecture is described in chapter 4.2.1.

3.3.2 Synchronizing Clock Domains

To communicate with other hardware, an interface to surrounding environment isneeded. In many cases a bus is used to connect hardware. Typically the bus fre-quency is different from the clock frequency of the device. Therefore, synchronizationbetween the two clock domains is needed. First off all, each flip-flop has a typicalsetup and hold time, during this time span the input signal has to be stable, oth-erwise the flip-flop may drop into a meta-stable state. In a meta-stable state the

35


46/86

output has no definite logic level. Figure 3.7 shows how the input of a flip-flop ismoved to the output.

Figure 3.7: Clock to output delay of flip-flops

Figure 3.8 shows the meta-stable state of a flip-flop when the input is changedduring setup tsu and hold time th.

Figure 3.8: Meta-stable states of flip-flops

To synchronize signals or data from one clock domain to another, it is necessaryto ensure that the input does not change during the setup and hold time. There arethree basic approaches to avoid meta-stable states:

36


47/86

Using a double synchronizer. It consists of 2-bit shift register which is clockedby the destination clock. This kind of synchronization should only be used for asingle signal. When using this circuit for synchronizing a data bus, it is possiblethat one bit is captured a clock cycle before another bit, so the transferred datais not consistent anymore. Figure 3.9 shows a signal synchronization circuitwith two flip-flops connected as a shift register. Additional flip-flops can beadded to the shift register to increase the probability of avoiding meta-stablestates. With each additional flip-flop, the probability to fail will be roughly1

1000, therefore the probability of a meta-stable state after two flip-flops is

approximately 11000000 .

Using a handshake circuit. This circuit consists of a FIFO, additional hand-shake logic and signals for request and acknowledge. It is very difficult to

design such handshake circuits without decreasing the throughput. Therefore,if it is possible to ensure stable data during the setup and hold time of theflip-flop, these kind of circuits are avoided.

Figure 3.9: Signal synchronization with double synchronizer

When using the PCI bus, additional logic is needed to prevent reading andwriting at the same time, because this could destroy the IO buffer of the PCI bus.The implementation and synchronization of the presented RSA circuit is shown in

section 4.2.6.

3.3.3 Architecture of FPGAs

An FPGA basically is a reconfigurable hardware. The actual functionality alwaysdepends on loaded configuration file. When the power supply is switched off, theconfiguration is lostremains the same. Therefore, an FPGA always has an configu-ration interface, for example a serial interface or a parallel interface. This offers thepossibility of fast hardware developing.

37


48/86

Before writing hardware for FPGA devices, the architecture of the used FPGAshould be known. There are many possibilities to optimize a circuit for FPGAdevices. In this project, Xilinx FPGAs are used but the basic FPGA architecture isthe same also on other FPGA devices.

The typical layout of a FPGA consists of logic blocks surrounded by input/outputblocks. There are many names for a logic block, for example ConfigurableLogic Block (CLB), Logic Module, Logic Cell, Logic Element (LE), Pro-grammable Function Unit (PFU), Core Cell or Tile. Other names for aninput/output block are I/O Block (IOB), I/O Module, I/O Cell, I/O Ele-ment (IOE), Programmable Input/Output Cells (PIC). The different the namesof logic blocks are, the different are their implementations. An important concept isthe granularity. High granularity means less combinational logic in one logic block.FPGAs with a high granularity make it easier for synthesis tools to map the logic

equations to the hardware because they do not have to optimize it for large logicblocks. The disadvantage of high granularity is that more signals are needed forrouting, therefore the achieved clock rates are decreasing (see [BFRV92]). Thereexists no perfect granularity. It depends on the application which granularity is thebest.

There are three major strategies to implement a logic block:

Lookup Tables (LUT). LUTs are a kind of SRAM talbe where the addresssignals are the logical input and the addressed value is the logical output. Thevalues of the SRAM table are set during the configuration of the FPGA.

Multiplexers. The logical function is realized by multiplexers.

Gates only. The gates are arranged in an array. It is possible to map equationswith up to eight inputs.

Because of the different implementation of FPGAs it is difficult to compare them.At the beginning the comparison was done by the number of gates. A gate is definedas NAND module with two inputs. Nowadays this comparison does not make senseanymore because todays FPGAs are already equipped with built-in memory andsmall CPUs. Once again the choice of the most suitable FPGA depends on the

application to be implemented on it.

3.3.4 Xilinx FPGAs

The pictures and detail information in this section are taken form the XILINXSpartan-II data sheet and can be found under [xil05]. Figure 3.10 shows the archi-tecture of a Xilinx Spartan II device. The huge array of CLBs is surrounded by IObuffers and DLLs.

Figure 3.11 shows the combination of two slices in one CLB.

38


49/86

Figure 3.10: Architecture of Xilinx Spartan II

Xilinx is using LUTs for implementing a logic block. It depends on the productfamily which and how many additional elements like multiplexers and registers areused. Each Xilinx FPGA has at least two input LUTs (with four inputs) per sliceand two slices per CLB. Figure 3.12 shows the schematic of one slice within a XilinxSpartan II FPGA.

The major advantage of the LUT architecture is that the four-bit input LUTcan also be used as single-port RAM (SPRAM) or as dual-port RAM (DPRAM).Therefore, it is possible to address 16 bits instead of saving just one with a single

flip-flop. Each slice also has a fast carry line, which can be used to implement fastcounters, adders, or subtractors. It is also possible to implement fast comparators.There are two ways to use such built-in functions. The first way is to instantiatehard macros from the tool library, the second possibility is to write the VHDL codein such a manner that the synthesizer can map it to a built-in function. The VirtexII model line has some additional interconnection circuits in each CLB, allowingshort net delays with the surrounding CLBs. The Virtex IV family already hassmall built-in CPUs and Rocket IO Buffers for transfer rates up to ten gigabits persecond.

39


50/86

high speed rsa implementations

Documents