kris gaj electrical and computer engineering george mason university towards secure cryptographic...

Kris GajElectrical and Computer EngineeringGeorge Mason University

Towards secure cryptographic transformations efficient in both software and hardware:

A case for synergy among math, computing, and engineering

http://ece.gmu.edu/crypto-text.htm

Motivation

Criteria used to evaluate cryptographictransformations

Security

SoftwareEfficiency

HardwareEfficiency

Flexibility

Flexibility

• Additional key-sizes and block-sizes

• Ability to function efficiently and securely in a wide variety of platforms and applications low-end smartcards, wireless: small memory requirements IPSec, ATM – small key setup time in hardware B-ISDN, satellite communication – large encryption speed

Advanced Encryption Standard (AES) Contest1997-2001

15 Candidates from USA, Canada, Belgium,

France, Germany, Norway, UK, Israel,Korea, Japan, Australia, Costa Rica

June 1998

August 1999

October 2000

1 winner: RijndaelBelgium

5 final candidates

Mars, RC6, Rijndael, Serpent, Twofish

Round 1

Round 2

SecuritySoftware efficiency

Flexibility

SecurityHardware efficiency

NESSIE ProjectNew European Schemes for Signatures,

Integrity, and Encryption2000-2002

CRYPTREC Project2000-2002

Europe

Japan

Multiple types of transformations:

Development of methodology of a fair evaluation and comparison of algorithms belonging to the same class, including

software and hardware efficiency

NESSIE, CRYPTREC

• Symmetric-key block ciphers• Stream ciphers• Hash functions• MACs• Asymmetric encryption schemes• Asymmetric digital signature schemes• Asymmetric identification schemes

0

50

100

150

200

250

300

350

400

450

500

Serpent Rijndael Twofish RC6 Mars

Speed of the final AES candidates in hardware

Speed [Mbit/s] K.Gaj, P. Chodowiec, AES3, April, 2000

0102030405060708090100

SerpentRijndael Twofish RC6 Mars

Survey filled by 167 participants of the Third AES Conference, April 2000

# votes

SerpentRijndael Twofish RC6 Mars

Results of the NSA groupHardwareSpeed [Mbit/s]

606

414

0

100

200

300

400

500

600

700

202

105 10357

431

177143

61

NSAASIC

GMUFPGA

0

5

10

15

20

25

30

SerpentRijndael TwofishRC6 Mars

Efficiency in software: NIST-specified platform

128-bit key192-bit key256-bit key

200 MHz Pentium Pro, Borland C++Speed [Mbits/s]

Security Margin

Complexity

High

Adequate

Simple Complex

NIST Report: Security

Rijndael

MARSSerpentTwofish

RC6

Security: Theoretical attacks better than exhaustive key search

0 5 10 15 20 25 30 35

Twofish

Serpent

Rijndael

RC6

Mars without 16 mixing rounds

# of rounds in the attack/total # of rounds

6 16

329

7 10

15 20

1611

23

10

5

3

5

0 10 20 30 40 50 60 70 80 90 100

Twofish

Serpent

Rijndael

RC6

Mars

Security: Theoretical attacks better than exhaustive key search

# of rounds in the attack/total # of rounds 100%

28% 72%

38% 62%

69% 31%

70% 30%

75% 25%

0

100

200

300

400

500

600

700

359

610

Speed in hardware [Mbit/s]

SHA-1 SHA-512

Security and hardware speed for hash functions

Complexityof the best attack 280 2256

GMU team, May 2002

Skipjack AES-256the same as

What’s more important:software or hardware?

Historical view

Secret-key ciphers Hash functions

time

1970

1980

1990

2000

DES – optimized for hardware

Fast Software Encryption:ciphers optimized for software:e.g., RC5, Blowfish, RC4

AES – optimized for software and hardware

MD4-familyoptimized primarilyfor software

DES-based hash functions– optimized for hardware

Software or hardware?

SOFTWARE HARDWARE

security of dataduring transmission

flexibility(new cryptoalgorithms,

protection against new attacks)

speed

random keygeneration

access controlto keys

tamper resistance(viruses, internal attacks)

low cost

Efficiency indicators

Memory

Power consumption

Primary efficiency indicators

Software Hardware

Speed Memory Speed Area

Efficiency parameters

Latency Throughput = Speed

Encryption/decryption

Time to encrypt/decrypt

a single block of data

Mi

Ci

Number of bits encrypted/decrypted

in a unit of time


Mi

Mi+1

Mi+2

Ci

Ci+1

Ci+2

Throughput =Block_size · Number_of_blocks_processed_simultaneously

Latency

What’s more important:Speed or area?

Non-Feedback Cipher ModesECB, counter

Comparison for non-feedback cipher modes, e.g.Counter Mode - CTR

M0 M1 M2

E

Ci = Mi E(IV+i) for i=0..N

MN-1 MN

. . .

E E E E. . .

C1 C2 C3 CN-1 CN

IV IV+1 IV+2 IV+N-1 IV+N

Increasing speed by parallel processing


unit


unit


unit


unit


unit


unit

Increasing speed using pipelining

Cipher 1 Cipher 2

round 1round 1

round 2

round 10

. . .

round 16

. . .

Speed =target_clock_period

block size

targetclock period,e.g., 20 ns

Pipelined operation of the encryption unit

B1

clockcycle 1

B2

2

B1

B3

3

B2B1

B4

4

B3B2B1

B5

5

B4B3B2

B6

6

B5B4B3

B7

7

B6B5B4

B8B7B6B5

8

B13B4B3B10

B14B5B4B11

B15B6B5

B12

B16B7B6

B13

B9B8B7B6

B10B9B8B7

B11B10B9B8

B12B3B2B9

clockcycle 9 10 11 12 13 14 15 16

0

1000

2000

3000

4000

5000

6000

7000

0 10000 20000 30000 40000 50000 60000Area [CLB slices]

Speed [Mbit/s]

Encryption in non-feedback modes (ECB, counter)decryption in all modes

Assuming clock period = 50 MHz

6.4 Gbit/s

SerpentTwofish

RC6

Rijndael

Mars

0

2

4

6

8

10

12

14

16

18

Our Results: Full mixed pipelining

Throughput [Gbit/s] Virtex FPGA

Serpent RijndaelTwofish RC6

16.815.2

13.1 12.2

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Serpent RijndaelTwofish RC6

Area [CLB slices]

19,700 21,000

46,900

12,600

80 RAMs

dedicated memory blocks, RAMs

Our Results: Full mixed pipelining

NIST Report + GMU Report: Hardware Efficiency

Non-feedback cipher modes: ECB, CTR

Speed

Area

High

Low

Small

RijndaelSerpentTwofish

RC6Mars

Medium

Medium Large

Feedback cipher modesCBC, CFB, OFB

Feedback cipher modes - CBCM1 M2 M3

E

IV

C1 = E(Mi IV)

Ci = E(Mi Ci-1) for i=2..N

MN-1 MN

. . .

E E E E. . .

C1 C2 C3CN-1

CN

Initial transformation

Final transformation

#rounds times

Round Key[i]

i:=i+1

Round Key[0]

i:=1

i<#rounds?

Cipher Round

Round Key[#rounds+1]

Typical Flow Diagram of a Secret-Key Block Cipher

register

combinationallogic

one round

multiplexer

Basic iterative architecture

speed

area

k=2 k=3 k=4 k=5

loop-unrolling

basic architecture

Increasing speed in cipher feedback modes

GMU Results: Encryption in cipher feedback modes (CBC, CFB, OFB) - Virtex FPGA

Throughput [Mbit/s]

Area [CLB slices]

0

100

200

300

400

500

0 1000 2000 3000 4000 5000

Rijndael Serpent I8

Mars

RC6

TwofishSerpent I1

NSA Results: Encryption in cipher feedback modes (CBC, CFB, OFB) - ASIC, 0.5 m CMOS

Throughput [Mbit/s]

Area [CLB slices]

0

100

200

300

400

500

600

700

0 5 10 15 20 25 30 35 40

Serpent I1

RC6 TwofishMars

Rijndael

Decreasing area by resource sharing

F F

D0 D1

D0’ D1’

F

D0 D1

D0’ D1’

multiplexer

Before After

register register

Throughput

Area

basic architecture

Resource sharing: Speed vs. Area

- basic architecture

- resource sharing

resource sharing

NIST Report + GMU Report: Hardware Efficiency

Feedback cipher modes: CBC, CFB

Speed

Area

High

Low

Small

Rijndael

MARS

Serpent

Twofish

RC6Medium

Medium Large

Aren’t software and hardwareoptimizations equivalent?

0

5

10

15

20

25

30

SerpentRijndael TwofishRC6 Mars

Efficiency in software: NIST-specified platform

128-bit key192-bit key256-bit key

200 MHz Pentium Pro, Borland C++Speed [Mbits/s]

0

50

100

150

200

250

300

350

400

450

500

Serpent Rijndael Twofish RC6 Mars

Our Results: Basic architecture - SpeedThroughput [Mbit/s]

Basic atomic operationsof secret-key ciphers and hash functions

Atomic operations used in 41 most popular secret-key ciphers (1)

B. Chetwynd, MS Thesis, WPI

Considered ciphers:

Blowfish, CAST, CAST-128, CAST-256, CRYPTON, CS-Cipher, DEAL, DES, DFC, E2, FEAL, FROG, GOST, Hasty Pudding, ICE, IDEA, Khafre, Khufu, LOKI91, LOKI97, Lucifer, MacGuffin, MAGENTA, MARS, MISTY1, MISTY2, MMB, RC2, RC5, RC6, Rijndael, SAFER K, SAFER+, Serpent, SQUARE, SHARK, Skipjack, TEA, Twofish, WAKE, WiderWake

Major atomic operations used in 41 most popular secret-key ciphers (2)


0

5

10

15

20

25

30

35

40

30

107 7

1

S-box Variablerotation

Modularmulti-

plication

GF(2n)multi-

plication

Modularinversion

Auxiliary atomic operations used in 41 most popular secret-key ciphers (3)


Boolean(XOR, AND, OR,

etc.)

Fixedrotation

Modularaddition

& subtraction

Permutation0

5

10

15

20

25

30

35

4040

25

20

?

Major cipher operations (1) - S-box

S-box n x mROM

Software Hardware

C

ASM

WORD S[1<<n]={ 0x23, 0x34, 0x56 . . . . . . . . . . . . . .}

S DW 23H, 34H, 56H …..

direct logic

n

m

2n words

n-bit address

m-bit output

...

x1x2

xn

...

y1y2

ym

S

2n m bits

S-box: Memory in hardware32 x 4 = 128 bits

S

4

4

S

4

4

S

4

4

S

4

4

S

4

4

S

4

4

S

4

4

. . .

Memory = 32 24 4 bits = 2 kbit

S

8

8

S

8

8

S

8

8

S

8

8

. . .

16 x 8 = 128 bits

Memory = 16 28 8 bits = 32 kbit = 16 2 kbit

S-box: Memory in software32 x 4 = 128 bits

S

4

4

S

4

4

S

4

4

S

4

4

S

4

4

S

4

4

S

4

4

. . .

Memory = 24 4 bits = 64 bit

S

8

8

S

8

8

S

8

8

S

8

8

. . .

16 x 8 = 128 bits

Memory = 28 8 bits = 2 kbit = 32 64 bits

variable rotation ROL32

Mux-based shifter

High-speed clock

C

ASM

Major cipher operations (2) – Variable Rotation

A <<< B

ROL A, B

C = (A << B) | (A >> (32-B));

min (B, 32-B) CLK’ cycles

HardwareSoftware

fast clock CLK’

A

A<<<B

A<<<0 A<<<16

32

B[4]B[3]

B[2]B[1]

B[0]

C=A·B mod 2n

Half-Multiplier

ASM

C

Major cipher operations (3) – Modular Multiplication

HardwareSoftware

C = A*B;

MUL

n n

MUL

n

n n

n

unsigned long A, B, C;

A B

C

n=32, 16

ASM

C

Major cipher operations (4)Multiplication in the Galois Field GF(2m)

HardwareSoftware

ROL, XOR, OR, ANDorALOG DW 3H, 5H, …LOG DW 7H, 9H, …

8 8

MUL GF(28)

<<, ^, |, &oralog[log[X]+log[C]%255]

X

Y

C = constx0 x3 x7

y0

. . .

x0 x3 x7

y7

x4

Permutation

C

order of wires

Auxiliary cipher operations (1) - Permutation

P

HardwareSoftware

ASM

complexsequence ofinstructions<<, |, &

complexsequence ofinstructionsROL, OR, AND

n

n

x1 x2 x3 xnxn-1

. . .

y1 y2 y3 ynyn-1

. . .

C

order of wires

Auxiliary cipher operations (2) - Fixed rotation

HardwareSoftware

ASM

ROL A, n

x1 x2 x3 xnxn-1

. . .

y1 y2 y3 ynyn-1

. . .

C = (A << n) | (A >> (32-n));

fixed rotationROL32

A <<< n32

ASM

C

Auxiliary cipher operations (3)Boolean operations

HardwareSoftware

XOR A, B AND A, BOR A, B

n n

XOR, AND, OR

A ^ BA & BA | B

A

Y

Ba0 b0

y0

. . .

an-1 bn-1

yn-1

a0 b0

y0

. . .

an-1 bn-1

yn-1

C=A+B mod 2n

Adder/subtractor

ASM

C

Auxiliary cipher operations (4)Addition/subtraction

HardwareSoftware

C = A+B;

ADD

n n

ADD

n

n n

n

unsigned long A, B, C;

A B

C

n=32, 16

Delay

Area

Multiple designs for hardware adders

Ripple carry adder (RC)

Carry-Skip adder (CS)

Carry-LookAhead adder (CLA)Carry-Select adder

Parallel-Prefix Network adder(Kogge-Stone, Brent-Kung)

Delay

Area

modularmultiplication

Boolean

permutation

variablerotationGF(2n)

multiplication

fixed rotation

Delay and area in HARDWAREBasic operations

addition (CLA)

addition (RC)

S-box4x4

S-box8x8

S-box9x32

modularinverse

additionmultiplication

Boolean

permutation

fixed rotation

GF(2n)multiplication

variable rotation

Delay and area in SOFTWAREBasic operations

Delay

Memory

S-box4x4

S-box8x8

S-box9x32

modular inverse

MarsTwofishSerpent RC6Rijndael

Major operations of AES finalists

S-boxes

Integer multiplication

Variable rotation

Multiplication in GF(2m)

MarsTwofishSerpent RC6Rijndael

Auxiliary operations of AES finalists

Boolean

Addition/subtraction

Permutation

Fixed rotation

Delay

Area


Boolean

permutation


multiplication

fixed rotation

Delay and area in HARDWAREMARS – IBM team

addition (CLA)

addition (RC)

S-box4x4

S-box8x8

S-box9x32

modularinverse

Delay

Area


Boolean

permutation


multiplication

fixed rotation

Delay and area in HARDWARESerpent – R. Anderson, E. Biham, L. Knudsen

addition (CLA)

addition (RC)

S-box4x4

S-box8x8

S-box9x32

modularinverse

Delay

Area


Boolean

permutation


multiplication

fixed rotation

Delay and area in HARDWARERijndael – V. Rijmen, J. Daemen

addition (CLA)

addition (RC)

S-box4x4

S-box8x8

S-box9x32

modularinverse

additionmultiplication

Boolean

permutation

fixed rotation

GF(2n)multiplication

variable rotation

Delay and area in SOFTWAREMARS – IBM team

Delay

Memory

S-box4x4

S-box8x8

S-box9x32

modular inverse

Fast & compact Slow & big

Software

Fast &compact

Slow &big

permutation

addition

GF(2n) multiply

multiplication

S-box

Booleanfixed rotation

variable rotation

Operations efficient in both software and hardwareSummary

Slow orbig

Slow or big Hardware

modular inverse

Types of ciphers

Feistel Networks Modified FeistelNetwork

Substitution-Linear TransformationNetworks

Others

AES: Types of candidate algorithms

TwofishE2DFC

DealLOKI97Magenta

RC6MARSCAST-256

RijndaelSerpent

Safer+Crypton

FrogHPC

<<< 1

>>> 1

F - function

Feistel Network: Single Round of Twofish

D[3] D[2] D[1] D[0]

D’[3] D’[2] D’[1] D’[0]

K2r+8 K2r+9

- units shared between encryption and decryption

Modified Feistel Network: Single Round of MARS

D[3] D[2] D[1] D[0]

E

<<<13

D’[3] D’[2] D’[1] D’[0]

k k’

out1

out2

out3

in

k=K[4+2i],k’ = K[5+2i],i - round no.


Substitution-Linear Transformation Network:Single Round of Serpent

S-boxes

Linear Transformation

128

128

K[i]


128

initial permutation

encryptionblock

decryptionblock

final permutation

128

128128128

128128

128

128

K0, ... , K7, K32 K32, ... , K7, K0

Substitution-Linear Transformation Network: Serpent in Hardware

Inversion in GF(28)

affinetransformation

inversed affinetransformation

ShiftRow

MixColumn

subkey

InvShiftRow

subkey

InvMixColumn

encryption decryption

Substitution-Linear Transformation Network: Rijndael in Hardware


Number and complexity of rounds

Number of rounds

Complexity of a round

Triple DES

DES

Serpent

Rijndael

Mars

RC6

Twofish

Number vs. complexity of a round

10

20

30

40

50

Complexity of the cipher round in hardware

Serpent

Rijndael

Twofish

RC6

Mars

S-box 4x4 XOR7

S-box 8x8 XOR6 XOR5 XOR4

6 S-boxes 4x42 ADD32 XOR5 XOR49 XOR2

SQR32 2 ADD32 ROT32

MUL32 4 MUX2

4 MUX2

2 MUX2

MUX2

2 MUX2

regular round

0 20 40 60 80 100Time in hardware [ns]

ADD32 ROT32 ADD32 2XOR2

K. Gaj, P. ChodowiecApril 2000

Security margin: Theoretical attacks better than exhaustive key search

0 5 10 15 20 25 30 35

Twofish

Serpent

Rijndael

RC6

Mars without 16 mixing rounds

# of rounds in the attack/total # of rounds

6 16

329

7 10

15 20

1611

23

10

5

3

5

Making all rounds identical

128-bit register

32 x S-box 0

linear transformation

K0 round 0

32 x S-box 7

linear transformation

K7 round 7

K32

output

128

128

128

Serpent: Hardware Architecture I8

one implementation round of Serpent

=8 regular cipher

rounds

128-bit register

32 x S-box 0

Ki regular Serpent round

32 x S-box 7

linear transformationK32

output

128

128

128

Serpent – Hardware Architecture I1

32 x S-box 1

8-to-1 128-bit multiplexer

128 128 128

128 128 128

GMU Results: Encryption in cipher feedback modes (CBC, CFB, OFB) - Virtex FPGA

Throughput [Mbit/s]

Area [CLB slices]

0

100

200

300

400

500

0 1000 2000 3000 4000 5000

Rijndael Serpent I8

Mars

RC6

TwofishSerpent I1

Serpent with all S-boxesidentical

Parallelism

Parallelism in SHA-1

A

B

D

C

E

ROTL5

ft

ROTL 30

+ + ++

Kt Wt

A

B

D

C

E

32

32

32

32

32

A

B

C

ROTL5

f t

ROTL30

+ + ++

Kt Wt

A

B

D

C

E

32

32

32

32

32

Operations from two different steps that can be performedin parallel

ROL5

ROL1

ROL30

ROL1

ROL5 ROL30

ROL1

ROL5 ROL30

ROL1

ROL5ROL30

ROL30

ROL1

ROL1

ROL30

step n

step n+1

step n+2

step n+3

step n+4

Executing SHA-1 on a 7-way superscalar processorA. Bosselaers, R. Govaerts, J. Vandewalle, 1997

Number of operations that can be executed in parallel

for various hash functions

0

1

2

3

4

5

6

7

8

SHA-1 RIPEMD160

RIPEMD128

RIPEMD MD5 MD4

A. Bosselaers, R. Govaerts, J. Vandewalle, 1997

Optimization tricks

Rijndael round: Table-lookup implementation

a0,0 a0,1 a0,2 a0,3

a1,0 a1,1 a1,2 a1,3

a2,0 a2,1 a2,2 a2,3

a3,0 a3,1 a3,2 a3,3

b0 b1 b2 b3

T0

T1

T2

T3

= k2 x3,2 x2,2 x1,2 x0,2 b2

Speed-up in software: ~ 100 timesSpeed-up in hardware: ~ 20%

Serpent: Bit-slice implementation

S

x1

(0)x2

(0) x3

(0)x4

(0)

y1(0)

S

x1

(1)x2

(1) x3

(1)x4

(1)

y1(1)

S

x1

(2)x2

(2) x3

(2)x4

(2)

y1(2)

S

x1

(3)x2

(3) x3

(3)x4

(3)

y1(3)

S

x1

(31)x2

(31)x3

(31)x4

(31)

y1(31)

y1 = f (x1, x2, x3, x4 ) = x1 x2 (x3 x4 ) (k) (k) (k) (k) (k)

e.g. (k) (k) (k) (k )

=

ANDx1

(31)x1(30) x1

(1) x1(0)x1

(3) x1(2). . .

x2(31)x2

(30) x2(1) x2

(0)x2(3) x2

(2). . .

u1(31)u1

(30) u1(1) u1

(0)u1(3) u1

(2). . .=

ORx3

(31)x3(30) x3

(1) x3(0)x3

(3) x3(2). . .

x4(31)x4

(30) x4(1) x4

(0)x4(3) x4

(2). . .

v1(31)v1

(30) v1(1) v1

(0)v1(3) v1

(2). . .XOR

y1(31)y1

(30) y1(1) y1

(0)y1(3) y1

(2)

32 x 4 = 128 bits

The proposed approach

Cipher design methodology (1)

1. Choose one or maximum two major operations efficient in both software and hardware

best choice: S-box 4x4, GF(2n) multiplication2. Choose one or maximum two auxiliary operations efficient in both software and hardware

best choice: Boolean, fixed rotation3. Choose cipher type that enables maximum sharing among encryption and decryption

best choice: Feistel network, modified Feistel network

Cipher design methodology (2)

4. Design a round taking into account a trade-off among • round complexity• number of rounds necessary to guarantee sufficient security margin

5. Make each round [possibly] identicalnegative examples: Serpent, Mars

6. Look for parallelism within a round and among consecutive rounds

positive example: SHA-1

7. Look for optimization trickspositive examples:

table-look-up in Rijndaelbit-slice implementation in Serpent

Mathematicians

Computerscientists

ComputerEngineers

Security

Softwareefficiency

Hardwareefficiency

Flexibility

$A100 Challenges

For mathematicians:

Prove or disprove that Serpent with • all S-boxes identical• 16 rounds

is at least as secure as Rijndael

For computer scientists:

Is there a way of using instruction level parallelismto speed-up software implementation of [modified] Serpent to make it as fast as Rijndael?

$A50 Challenge

For computer scientists:

What is a level of parallelism present in SHA-256, SHA-384, SHA-512?

For mathematicians:

Is there a way of changing Serpent into a modified Feistel network cipherwithout loosing its security properties?

kris gaj electrical and computer engineering george mason university towards secure cryptographic...

Documents

hardware mbitssha

hardware speed mbitsk

hardware bisdn

mixing rounds

final aes candidates

atm small key setup

hash functionscomplexityof

aes conference