kris gaj electrical and computer engineering george mason university towards secure cryptographic...
TRANSCRIPT
Kris GajElectrical and Computer EngineeringGeorge Mason University
Towards secure cryptographic transformations efficient in both software and hardware:
A case for synergy among math, computing, and engineering
http://ece.gmu.edu/crypto-text.htm
Motivation
Criteria used to evaluate cryptographictransformations
Security
SoftwareEfficiency
HardwareEfficiency
Flexibility
Flexibility
• Additional key-sizes and block-sizes
• Ability to function efficiently and securely in a wide variety of platforms and applications low-end smartcards, wireless: small memory requirements IPSec, ATM – small key setup time in hardware B-ISDN, satellite communication – large encryption speed
Advanced Encryption Standard (AES) Contest1997-2001
15 Candidates from USA, Canada, Belgium,
France, Germany, Norway, UK, Israel,Korea, Japan, Australia, Costa Rica
June 1998
August 1999
October 2000
1 winner: RijndaelBelgium
5 final candidates
Mars, RC6, Rijndael, Serpent, Twofish
Round 1
Round 2
SecuritySoftware efficiency
Flexibility
SecurityHardware efficiency
NESSIE ProjectNew European Schemes for Signatures,
Integrity, and Encryption2000-2002
CRYPTREC Project2000-2002
Europe
Japan
Multiple types of transformations:
Development of methodology of a fair evaluation and comparison of algorithms belonging to the same class, including
software and hardware efficiency
NESSIE, CRYPTREC
• Symmetric-key block ciphers• Stream ciphers• Hash functions• MACs• Asymmetric encryption schemes• Asymmetric digital signature schemes• Asymmetric identification schemes
0
50
100
150
200
250
300
350
400
450
500
Serpent Rijndael Twofish RC6 Mars
Speed of the final AES candidates in hardware
Speed [Mbit/s] K.Gaj, P. Chodowiec, AES3, April, 2000
0102030405060708090100
SerpentRijndael Twofish RC6 Mars
Survey filled by 167 participants of the Third AES Conference, April 2000
# votes
SerpentRijndael Twofish RC6 Mars
Results of the NSA groupHardwareSpeed [Mbit/s]
606
414
0
100
200
300
400
500
600
700
202
105 10357
431
177143
61
NSAASIC
GMUFPGA
0
5
10
15
20
25
30
SerpentRijndael TwofishRC6 Mars
Efficiency in software: NIST-specified platform
128-bit key192-bit key256-bit key
200 MHz Pentium Pro, Borland C++Speed [Mbits/s]
Security Margin
Complexity
High
Adequate
Simple Complex
NIST Report: Security
Rijndael
MARSSerpentTwofish
RC6
Security: Theoretical attacks better than exhaustive key search
0 5 10 15 20 25 30 35
Twofish
Serpent
Rijndael
RC6
Mars without 16 mixing rounds
# of rounds in the attack/total # of rounds
6 16
329
7 10
15 20
1611
23
10
5
3
5
0 10 20 30 40 50 60 70 80 90 100
Twofish
Serpent
Rijndael
RC6
Mars
Security: Theoretical attacks better than exhaustive key search
# of rounds in the attack/total # of rounds 100%
28% 72%
38% 62%
69% 31%
70% 30%
75% 25%
0
100
200
300
400
500
600
700
359
610
Speed in hardware [Mbit/s]
SHA-1 SHA-512
Security and hardware speed for hash functions
Complexityof the best attack 280 2256
GMU team, May 2002
Skipjack AES-256the same as
What’s more important:software or hardware?
Historical view
Secret-key ciphers Hash functions
time
1970
1980
1990
2000
DES – optimized for hardware
Fast Software Encryption:ciphers optimized for software:e.g., RC5, Blowfish, RC4
AES – optimized for software and hardware
MD4-familyoptimized primarilyfor software
DES-based hash functions– optimized for hardware
Software or hardware?
SOFTWARE HARDWARE
security of dataduring transmission
flexibility(new cryptoalgorithms,
protection against new attacks)
speed
random keygeneration
access controlto keys
tamper resistance(viruses, internal attacks)
low cost
Efficiency indicators
Memory
Power consumption
Primary efficiency indicators
Software Hardware
Speed Memory Speed Area
Efficiency parameters
Latency Throughput = Speed
Encryption/decryption
Time to encrypt/decrypt
a single block of data
Mi
Ci
Number of bits encrypted/decrypted
in a unit of time
Encryption/decryption
Mi
Mi+1
Mi+2
Ci
Ci+1
Ci+2
Throughput =Block_size · Number_of_blocks_processed_simultaneously
Latency
What’s more important:Speed or area?
Non-Feedback Cipher ModesECB, counter
Comparison for non-feedback cipher modes, e.g.Counter Mode - CTR
M0 M1 M2
E
Ci = Mi E(IV+i) for i=0..N
MN-1 MN
. . .
E E E E. . .
C1 C2 C3 CN-1 CN
IV IV+1 IV+2 IV+N-1 IV+N
Increasing speed by parallel processing
Encryption/decryption
unit
Encryption/decryption
unit
Encryption/decryption
unit
Encryption/decryption
unit
Encryption/decryption
unit
Encryption/decryption
unit
Increasing speed using pipelining
Cipher 1 Cipher 2
round 1round 1
round 2
round 10
. . .
round 16
. . .
Speed =target_clock_period
block size
targetclock period,e.g., 20 ns
Pipelined operation of the encryption unit
B1
clockcycle 1
B2
2
B1
B3
3
B2B1
B4
4
B3B2B1
B5
5
B4B3B2
B6
6
B5B4B3
B7
7
B6B5B4
B8B7B6B5
8
B13B4B3B10
B14B5B4B11
B15B6B5
B12
B16B7B6
B13
B9B8B7B6
B10B9B8B7
B11B10B9B8
B12B3B2B9
clockcycle 9 10 11 12 13 14 15 16
0
1000
2000
3000
4000
5000
6000
7000
0 10000 20000 30000 40000 50000 60000Area [CLB slices]
Speed [Mbit/s]
Encryption in non-feedback modes (ECB, counter)decryption in all modes
Assuming clock period = 50 MHz
6.4 Gbit/s
SerpentTwofish
RC6
Rijndael
Mars
0
2
4
6
8
10
12
14
16
18
Our Results: Full mixed pipelining
Throughput [Gbit/s] Virtex FPGA
Serpent RijndaelTwofish RC6
16.815.2
13.1 12.2
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Serpent RijndaelTwofish RC6
Area [CLB slices]
19,700 21,000
46,900
12,600
80 RAMs
dedicated memory blocks, RAMs
Our Results: Full mixed pipelining
NIST Report + GMU Report: Hardware Efficiency
Non-feedback cipher modes: ECB, CTR
Speed
Area
High
Low
Small
RijndaelSerpentTwofish
RC6Mars
Medium
Medium Large
Feedback cipher modesCBC, CFB, OFB
Feedback cipher modes - CBCM1 M2 M3
E
IV
C1 = E(Mi IV)
Ci = E(Mi Ci-1) for i=2..N
MN-1 MN
. . .
E E E E. . .
C1 C2 C3CN-1
CN
Initial transformation
Final transformation
#rounds times
Round Key[i]
i:=i+1
Round Key[0]
i:=1
i<#rounds?
Cipher Round
Round Key[#rounds+1]
Typical Flow Diagram of a Secret-Key Block Cipher
register
combinationallogic
one round
multiplexer
Basic iterative architecture
speed
area
k=2 k=3 k=4 k=5
loop-unrolling
basic architecture
Increasing speed in cipher feedback modes
GMU Results: Encryption in cipher feedback modes (CBC, CFB, OFB) - Virtex FPGA
Throughput [Mbit/s]
Area [CLB slices]
0
100
200
300
400
500
0 1000 2000 3000 4000 5000
Rijndael Serpent I8
Mars
RC6
TwofishSerpent I1
NSA Results: Encryption in cipher feedback modes (CBC, CFB, OFB) - ASIC, 0.5 m CMOS
Throughput [Mbit/s]
Area [CLB slices]
0
100
200
300
400
500
600
700
0 5 10 15 20 25 30 35 40
Serpent I1
RC6 TwofishMars
Rijndael
Decreasing area by resource sharing
F F
D0 D1
D0’ D1’
F
D0 D1
D0’ D1’
multiplexer
Before After
register register
Throughput
Area
basic architecture
Resource sharing: Speed vs. Area
- basic architecture
- resource sharing
resource sharing
NIST Report + GMU Report: Hardware Efficiency
Feedback cipher modes: CBC, CFB
Speed
Area
High
Low
Small
Rijndael
MARS
Serpent
Twofish
RC6Medium
Medium Large
Aren’t software and hardwareoptimizations equivalent?
0
5
10
15
20
25
30
SerpentRijndael TwofishRC6 Mars
Efficiency in software: NIST-specified platform
128-bit key192-bit key256-bit key
200 MHz Pentium Pro, Borland C++Speed [Mbits/s]
0
50
100
150
200
250
300
350
400
450
500
Serpent Rijndael Twofish RC6 Mars
Our Results: Basic architecture - SpeedThroughput [Mbit/s]
Basic atomic operationsof secret-key ciphers and hash functions
Atomic operations used in 41 most popular secret-key ciphers (1)
B. Chetwynd, MS Thesis, WPI
Considered ciphers:
Blowfish, CAST, CAST-128, CAST-256, CRYPTON, CS-Cipher, DEAL, DES, DFC, E2, FEAL, FROG, GOST, Hasty Pudding, ICE, IDEA, Khafre, Khufu, LOKI91, LOKI97, Lucifer, MacGuffin, MAGENTA, MARS, MISTY1, MISTY2, MMB, RC2, RC5, RC6, Rijndael, SAFER K, SAFER+, Serpent, SQUARE, SHARK, Skipjack, TEA, Twofish, WAKE, WiderWake
Major atomic operations used in 41 most popular secret-key ciphers (2)
B. Chetwynd, MS Thesis, WPI
0
5
10
15
20
25
30
35
40
30
107 7
1
S-box Variablerotation
Modularmulti-
plication
GF(2n)multi-
plication
Modularinversion
Auxiliary atomic operations used in 41 most popular secret-key ciphers (3)
B. Chetwynd, MS Thesis, WPI
Boolean(XOR, AND, OR,
etc.)
Fixedrotation
Modularaddition
& subtraction
Permutation0
5
10
15
20
25
30
35
4040
25
20
?
Major cipher operations (1) - S-box
S-box n x mROM
Software Hardware
C
ASM
WORD S[1<<n]={ 0x23, 0x34, 0x56 . . . . . . . . . . . . . .}
S DW 23H, 34H, 56H …..
direct logic
n
m
2n words
n-bit address
m-bit output
...
x1x2
xn
...
y1y2
ym
S
2n m bits
S-box: Memory in hardware32 x 4 = 128 bits
S
4
4
S
4
4
S
4
4
S
4
4
S
4
4
S
4
4
S
4
4
. . .
Memory = 32 24 4 bits = 2 kbit
S
8
8
S
8
8
S
8
8
S
8
8
. . .
16 x 8 = 128 bits
Memory = 16 28 8 bits = 32 kbit = 16 2 kbit
S-box: Memory in software32 x 4 = 128 bits
S
4
4
S
4
4
S
4
4
S
4
4
S
4
4
S
4
4
S
4
4
. . .
Memory = 24 4 bits = 64 bit
S
8
8
S
8
8
S
8
8
S
8
8
. . .
16 x 8 = 128 bits
Memory = 28 8 bits = 2 kbit = 32 64 bits
variable rotation ROL32
Mux-based shifter
High-speed clock
C
ASM
Major cipher operations (2) – Variable Rotation
A <<< B
ROL A, B
C = (A << B) | (A >> (32-B));
min (B, 32-B) CLK’ cycles
HardwareSoftware
fast clock CLK’
A
A<<<B
A<<<0 A<<<16
32
B[4]B[3]
B[2]B[1]
B[0]
C=A·B mod 2n
Half-Multiplier
ASM
C
Major cipher operations (3) – Modular Multiplication
HardwareSoftware
C = A*B;
MUL
n n
MUL
n
n n
n
unsigned long A, B, C;
A B
C
n=32, 16
ASM
C
Major cipher operations (4)Multiplication in the Galois Field GF(2m)
HardwareSoftware
ROL, XOR, OR, ANDorALOG DW 3H, 5H, …LOG DW 7H, 9H, …
8 8
MUL GF(28)
<<, ^, |, &oralog[log[X]+log[C]%255]
X
Y
C = constx0 x3 x7
y0
. . .
x0 x3 x7
y7
x4
Permutation
C
order of wires
Auxiliary cipher operations (1) - Permutation
P
HardwareSoftware
ASM
complexsequence ofinstructions<<, |, &
complexsequence ofinstructionsROL, OR, AND
n
n
x1 x2 x3 xnxn-1
. . .
y1 y2 y3 ynyn-1
. . .
C
order of wires
Auxiliary cipher operations (2) - Fixed rotation
HardwareSoftware
ASM
ROL A, n
x1 x2 x3 xnxn-1
. . .
y1 y2 y3 ynyn-1
. . .
C = (A << n) | (A >> (32-n));
fixed rotationROL32
A <<< n32
ASM
C
Auxiliary cipher operations (3)Boolean operations
HardwareSoftware
XOR A, B AND A, BOR A, B
n n
XOR, AND, OR
A ^ BA & BA | B
A
Y
Ba0 b0
y0
. . .
an-1 bn-1
yn-1
a0 b0
y0
. . .
an-1 bn-1
yn-1
C=A+B mod 2n
Adder/subtractor
ASM
C
Auxiliary cipher operations (4)Addition/subtraction
HardwareSoftware
C = A+B;
ADD
n n
ADD
n
n n
n
unsigned long A, B, C;
A B
C
n=32, 16
Delay
Area
Multiple designs for hardware adders
Ripple carry adder (RC)
Carry-Skip adder (CS)
Carry-LookAhead adder (CLA)Carry-Select adder
Parallel-Prefix Network adder(Kogge-Stone, Brent-Kung)
Delay
Area
modularmultiplication
Boolean
permutation
variablerotationGF(2n)
multiplication
fixed rotation
Delay and area in HARDWAREBasic operations
addition (CLA)
addition (RC)
S-box4x4
S-box8x8
S-box9x32
modularinverse
additionmultiplication
Boolean
permutation
fixed rotation
GF(2n)multiplication
variable rotation
Delay and area in SOFTWAREBasic operations
Delay
Memory
S-box4x4
S-box8x8
S-box9x32
modular inverse
MarsTwofishSerpent RC6Rijndael
Major operations of AES finalists
S-boxes
Integer multiplication
Variable rotation
Multiplication in GF(2m)
MarsTwofishSerpent RC6Rijndael
Auxiliary operations of AES finalists
Boolean
Addition/subtraction
Permutation
Fixed rotation
Delay
Area
modularmultiplication
Boolean
permutation
variablerotationGF(2n)
multiplication
fixed rotation
Delay and area in HARDWAREMARS – IBM team
addition (CLA)
addition (RC)
S-box4x4
S-box8x8
S-box9x32
modularinverse
Delay
Area
modularmultiplication
Boolean
permutation
variablerotationGF(2n)
multiplication
fixed rotation
Delay and area in HARDWARESerpent – R. Anderson, E. Biham, L. Knudsen
addition (CLA)
addition (RC)
S-box4x4
S-box8x8
S-box9x32
modularinverse
Delay
Area
modularmultiplication
Boolean
permutation
variablerotationGF(2n)
multiplication
fixed rotation
Delay and area in HARDWARERijndael – V. Rijmen, J. Daemen
addition (CLA)
addition (RC)
S-box4x4
S-box8x8
S-box9x32
modularinverse
additionmultiplication
Boolean
permutation
fixed rotation
GF(2n)multiplication
variable rotation
Delay and area in SOFTWAREMARS – IBM team
Delay
Memory
S-box4x4
S-box8x8
S-box9x32
modular inverse
Fast & compact Slow & big
Software
Fast &compact
Slow &big
permutation
addition
GF(2n) multiply
multiplication
S-box
Booleanfixed rotation
variable rotation
Operations efficient in both software and hardwareSummary
Slow orbig
Slow or big Hardware
modular inverse
Types of ciphers
Feistel Networks Modified FeistelNetwork
Substitution-Linear TransformationNetworks
Others
AES: Types of candidate algorithms
TwofishE2DFC
DealLOKI97Magenta
RC6MARSCAST-256
RijndaelSerpent
Safer+Crypton
FrogHPC
<<< 1
>>> 1
F - function
Feistel Network: Single Round of Twofish
D[3] D[2] D[1] D[0]
D’[3] D’[2] D’[1] D’[0]
K2r+8 K2r+9
- units shared between encryption and decryption
Modified Feistel Network: Single Round of MARS
D[3] D[2] D[1] D[0]
E
<<<13
D’[3] D’[2] D’[1] D’[0]
k k’
out1
out2
out3
in
k=K[4+2i],k’ = K[5+2i],i - round no.
- units shared between encryption and decryption
Substitution-Linear Transformation Network:Single Round of Serpent
S-boxes
Linear Transformation
128
128
K[i]
- units shared between encryption and decryption
128
initial permutation
encryptionblock
decryptionblock
final permutation
128
128128128
128128
128
128
K0, ... , K7, K32 K32, ... , K7, K0
Substitution-Linear Transformation Network: Serpent in Hardware
Inversion in GF(28)
affinetransformation
inversed affinetransformation
ShiftRow
MixColumn
subkey
InvShiftRow
subkey
InvMixColumn
encryption decryption
Substitution-Linear Transformation Network: Rijndael in Hardware
- units shared between encryption and decryption
Number and complexity of rounds
Number of rounds
Complexity of a round
Triple DES
DES
Serpent
Rijndael
Mars
RC6
Twofish
Number vs. complexity of a round
10
20
30
40
50
Complexity of the cipher round in hardware
Serpent
Rijndael
Twofish
RC6
Mars
S-box 4x4 XOR7
S-box 8x8 XOR6 XOR5 XOR4
6 S-boxes 4x42 ADD32 XOR5 XOR49 XOR2
SQR32 2 ADD32 ROT32
MUL32 4 MUX2
4 MUX2
2 MUX2
MUX2
2 MUX2
regular round
0 20 40 60 80 100Time in hardware [ns]
ADD32 ROT32 ADD32 2XOR2
K. Gaj, P. ChodowiecApril 2000
Security margin: Theoretical attacks better than exhaustive key search
0 5 10 15 20 25 30 35
Twofish
Serpent
Rijndael
RC6
Mars without 16 mixing rounds
# of rounds in the attack/total # of rounds
6 16
329
7 10
15 20
1611
23
10
5
3
5
Making all rounds identical
128-bit register
32 x S-box 0
linear transformation
K0 round 0
32 x S-box 7
linear transformation
K7 round 7
K32
output
128
128
128
Serpent: Hardware Architecture I8
one implementation round of Serpent
=8 regular cipher
rounds
128-bit register
32 x S-box 0
Ki regular Serpent round
32 x S-box 7
linear transformationK32
output
128
128
128
Serpent – Hardware Architecture I1
32 x S-box 1
8-to-1 128-bit multiplexer
128 128 128
128 128 128
GMU Results: Encryption in cipher feedback modes (CBC, CFB, OFB) - Virtex FPGA
Throughput [Mbit/s]
Area [CLB slices]
0
100
200
300
400
500
0 1000 2000 3000 4000 5000
Rijndael Serpent I8
Mars
RC6
TwofishSerpent I1
Serpent with all S-boxesidentical
Parallelism
Parallelism in SHA-1
A
B
D
C
E
ROTL5
ft
ROTL 30
+ + ++
Kt Wt
A
B
D
C
E
32
32
32
32
32
A
B
C
ROTL5
f t
ROTL30
+ + ++
Kt Wt
A
B
D
C
E
32
32
32
32
32
Operations from two different steps that can be performedin parallel
ROL5
ROL1
ROL30
ROL1
ROL5 ROL30
ROL1
ROL5 ROL30
ROL1
ROL5ROL30
ROL30
ROL1
ROL1
ROL30
step n
step n+1
step n+2
step n+3
step n+4
Executing SHA-1 on a 7-way superscalar processorA. Bosselaers, R. Govaerts, J. Vandewalle, 1997
Number of operations that can be executed in parallel
for various hash functions
0
1
2
3
4
5
6
7
8
SHA-1 RIPEMD160
RIPEMD128
RIPEMD MD5 MD4
A. Bosselaers, R. Govaerts, J. Vandewalle, 1997
Optimization tricks
Rijndael round: Table-lookup implementation
a0,0 a0,1 a0,2 a0,3
a1,0 a1,1 a1,2 a1,3
a2,0 a2,1 a2,2 a2,3
a3,0 a3,1 a3,2 a3,3
b0 b1 b2 b3
T0
T1
T2
T3
= k2 x3,2 x2,2 x1,2 x0,2 b2
Speed-up in software: ~ 100 timesSpeed-up in hardware: ~ 20%
Serpent: Bit-slice implementation
S
x1
(0)x2
(0) x3
(0)x4
(0)
y1(0)
S
x1
(1)x2
(1) x3
(1)x4
(1)
y1(1)
S
x1
(2)x2
(2) x3
(2)x4
(2)
y1(2)
S
x1
(3)x2
(3) x3
(3)x4
(3)
y1(3)
S
x1
(31)x2
(31)x3
(31)x4
(31)
y1(31)
y1 = f (x1, x2, x3, x4 ) = x1 x2 (x3 x4 ) (k) (k) (k) (k) (k)
e.g. (k) (k) (k) (k )
=
ANDx1
(31)x1(30) x1
(1) x1(0)x1
(3) x1(2). . .
x2(31)x2
(30) x2(1) x2
(0)x2(3) x2
(2). . .
u1(31)u1
(30) u1(1) u1
(0)u1(3) u1
(2). . .=
ORx3
(31)x3(30) x3
(1) x3(0)x3
(3) x3(2). . .
x4(31)x4
(30) x4(1) x4
(0)x4(3) x4
(2). . .
v1(31)v1
(30) v1(1) v1
(0)v1(3) v1
(2). . .XOR
y1(31)y1
(30) y1(1) y1
(0)y1(3) y1
(2)
32 x 4 = 128 bits
The proposed approach
Cipher design methodology (1)
1. Choose one or maximum two major operations efficient in both software and hardware
best choice: S-box 4x4, GF(2n) multiplication2. Choose one or maximum two auxiliary operations efficient in both software and hardware
best choice: Boolean, fixed rotation3. Choose cipher type that enables maximum sharing among encryption and decryption
best choice: Feistel network, modified Feistel network
Cipher design methodology (2)
4. Design a round taking into account a trade-off among • round complexity• number of rounds necessary to guarantee sufficient security margin
5. Make each round [possibly] identicalnegative examples: Serpent, Mars
6. Look for parallelism within a round and among consecutive rounds
positive example: SHA-1
7. Look for optimization trickspositive examples:
table-look-up in Rijndaelbit-slice implementation in Serpent
Mathematicians
Computerscientists
ComputerEngineers
Security
Softwareefficiency
Hardwareefficiency
Flexibility
$A100 Challenges
For mathematicians:
Prove or disprove that Serpent with • all S-boxes identical• 16 rounds
is at least as secure as Rijndael
For computer scientists:
Is there a way of using instruction level parallelismto speed-up software implementation of [modified] Serpent to make it as fast as Rijndael?
$A50 Challenge
For computer scientists:
What is a level of parallelism present in SHA-256, SHA-384, SHA-512?
For mathematicians:
Is there a way of changing Serpent into a modified Feistel network cipherwithout loosing its security properties?