aes and other secret key implementations...ingrid verbauwhede, k.u.leuven - cosic 1 kul - cosic...
TRANSCRIPT
Ingrid Verbauwhede, K.U.Leuven - COSIC 1
KUL - COSIC ECRYPT Summer School - 1 Albena, May 2011
AES and other secret key
implementations
Ingrid Verbauwhede
ingrid.verbauwhede-at-esat.kuleuven.be
K.U.Leuven, ESAT- SCD - COSIC
Computer Security and Industrial Cryptography
Acknowledgements:
Current and former Ph.D. students
at UCLA and K.U.Leuven
KUL - COSIC ECRYPT Summer School - 2 Albena, May 2011
Outline & Goal
• Crypto engineering for secret key algorithms
– Design parameters
– DES
– Modes of operation
– AES
– Light weight crypto
Ingrid Verbauwhede, K.U.Leuven - COSIC 2
KUL - COSIC ECRYPT Summer School - 3 Albena, May 2011
Design Parameters
Embedded security:
Area, delay, power, energy
KUL - COSIC ECRYPT Summer School - 4 Albena, May 2011
Crypto engineering everywhere
• Continuum between softwareand hardware– ASIC (microcode) – FPGA – fully
programmable processor
Everything is alwaysconnected everywhere
Ingrid Verbauwhede, K.U.Leuven - COSIC 3
KUL - COSIC ECRYPT Summer School - 5 Albena, May 2011
Embedded Security
NEED BOTH
• Efficient, light-weight Implementation– Within power, area, timing budgets
– Public key: 1024 bits RSA on 8 bit μC and 100 μW
– Public key on a passive RFID tag
• Trustworthy implementation– Resistant to attacks
– Active attacks: probing, power glitches, JTAG scan chain
– Passive attacks: side channel attacks, including power, timingand electromagnetic leaks
KUL - COSIC ECRYPT Summer School - 6 Albena, May 2011
Cost definition
• Area
• Time
• Power, Energy
• Physical Security
• NRE (Non Recurring Engineering) cost
Ingrid Verbauwhede, K.U.Leuven - COSIC 4
KUL - COSIC ECRYPT Summer School - 7 Albena, May 2011
Design parameters
• Speed or throughput:– HW: Gbits/sec or Mbits/sec/slice
– SW: Cycles/byte, independent of clock frequency
• Area:– HW: mm2 (gate or transistor count)
– SW: memory footprint
• Power or energy consumption:– Power (Watts) for cooling or transmission (RFID)
– Energy (Joule): battery operated devices
• Security: difficult to measure, but we want it– Entropy, leakage functions?
– Measurements until disclosure?
KUL - COSIC ECRYPT Summer School - 8 Albena, May 2011
Throughput: Real-time
• Extremely high throughput (Radar or fiber optics)
• One operator (= hardware unit, e.g. adder, shifter, register)
• for each operation (= algorithmic, e.g. addition, multiplication, delay)
clock frequency = sample frequency
• Most designs: time multiplexing
clock frequency = sample frequency
clock frequency
sample frequency = number of clock cycles available for the job
Ingrid Verbauwhede, K.U.Leuven - COSIC 5
KUL - COSIC ECRYPT Summer School - 9 Albena, May 2011
SW: cycles per byte
• “independent” of
clock frequency
or machine
• Size of packet
matters
• “match” of
algorithm to
architecture
[Source: http://bench.cr.yp.to/results-sha3.html]
Size (bytes)8 64 4096
40 cyc
les/
byte
Cycles/byte
KUL - COSIC ECRYPT Summer School - 10 Albena, May 2011
Power density problem
• Intel S. Borkar power density problem
[Author: S. Borkar, Intel]
Cooling!!
Ingrid Verbauwhede, K.U.Leuven - COSIC 6
KUL - COSIC ECRYPT Summer School - 11 Albena, May 2011
Low Energy: battery capacity
• Rabaey slide battery capacity
One AAA battery: 1300 to 5000 Joule
KUL - COSIC ECRYPT Summer School - 12 Albena, May 2011
Power and Energy are not the same!
• Power = P = I x V (current x voltage) (= Watt)
– instantaneous
– Typically checked for cooling or for peak
performance
• Energy = Power x execution time (= Joule)
– Battery content is expressed in Joules
– Gives idea of how much Joules to get the job done
Low power processor low energy solution !
Ingrid Verbauwhede, K.U.Leuven - COSIC 7
KUL - COSIC ECRYPT Summer School - 13 Albena, May 2011
Heat and parallelism
memory processor
M P
C Pmono = CV2f (Watt)
Power
(Heat)
C/4 C/4 C/4 C/4
M/4 P/4 M/4 P/4 M/4 P/4 M/4 P/44 (C/4)V2(f/4) = Pmono/4
but since f ~ V
can be even Pmono/43
Reduce power = reduce WASTE !!
TREND: MULTI-CORE!!
KUL - COSIC ECRYPT Summer School - 14 Albena, May 2011
Intermezzo: standard cell based design
Ingrid Verbauwhede, K.U.Leuven - COSIC 8
KUL - COSIC ECRYPT Summer School - 15 Albena, May 2011
Logic Design Activities
• Logic and FSM synthesis– State minim., coding
– Multilevel Logic Optimisation
• Technology Mapping– Functions to library cells
– Minimal Area for given delay
• Timing Verification– Estimate wiring load C
– Critical logic path
• Layout– P&R C extraction from wiring ...
Delay
Area
! !aoi ff
Extraction-> Timing
Timing
Closure
2 6... Logic
Depth
#literals
VHDL
Logic
Synthesis
(Synopsys)
KUL - COSIC ECRYPT Summer School - 16 Albena, May 2011
Standard Cell Layout
Std. Cell
Std. Cell Place & Route (RT-Module)
Routing Channel
Cell Row
(Courtesy : Tanner Tools)
Ingrid Verbauwhede, K.U.Leuven - COSIC 9
KUL - COSIC ECRYPT Summer School - 17 Albena, May 2011
Standard Cell Zoom In
layout
vdd
vss
KUL - COSIC ECRYPT Summer School - 18 Albena, May 2011
Module Generation
For data-path operators: structure is in bit-slices
Computer generated layout as function of wordlength
Compact, predictable IP
Power
Instruction, Clock
Data
Ingrid Verbauwhede, K.U.Leuven - COSIC 10
KUL - COSIC ECRYPT Summer School - 19 Albena, May 2011
Standard Cell and Module
Courtesy: J. Van Campenhout RUG
DatapathStandard Cell
Random Logic
KUL - COSIC ECRYPT Summer School - 20 Albena, May 2011
Start with easy one:
Block cipher - DES
Ingrid Verbauwhede, K.U.Leuven - COSIC 11
KUL - COSIC ECRYPT Summer School - 21 Albena, May 2011
Symmetric key: DES
• DES = Data Encryption Standard
• FIPS Standard 46 effective in July 1977: US government standardfor sensitive but unclassified data
• Re-affirmed in 1983, 1988, 1993, 1999 (FIPS 46-3)
• July 26, 2004: FIPS 46-3 is withdrawn: use TDEA or AES
• TDEA = Triple DES encryption algorithm – NIST 800-67
DESPlaintext (Pi) Ciphertext (Ci)
Key = 56 bits + 8 parity bits
64 64
64
KUL - COSIC ECRYPT Summer School - 22 Albena, May 2011
TDEA
• Triple DES Encryption Algorithm, NIST Spec. Pub. 800-67 (May 2004)
• Three Key options:– K1, K2, K3 different
– K1=K3, K2 different
– K1=K2=K3, backward compatible with single DES
• two-key triple DES: until 2009
• three-key triple DES: until 2030
DES
Plaintext (Pi)
Ciphertext (Ci)
64 64
64
DES-1 DES
K164
K264
K3
Ingrid Verbauwhede, K.U.Leuven - COSIC 12
KUL - COSIC ECRYPT Summer School - 23 Albena, May 2011
DES = Feistel cipher
+ f
Li-1 Ri-1
Li Ri
Encryption round i
+ f
Li-1 Ri-1
Li Ri
Decryption round i
Ki Ki
• DES has 16 rounds + initial and final permutation
• Basic cipher structure is Feistel cipher– other examples of Feistel: IDEA, FEAL, Kasumi
• Hardware: encryption = decryption!
(different for AES)
KUL - COSIC ECRYPT Summer School - 24 Albena, May 2011
DES- f function
32b-to-48b permutation
(wiring & bit duplication)
input of S-boxes: 8x6b
Si = 6b-to-4b non linear
substitution (ROM or logic
based Look up table)
output of S-boxes: 8x4b
Expansion E
+
32 Ri-1 Ki48
48
S1 S2 S3 S4 S5 S6 S7 S8
Permutation P
32
32
f(Ri-1, Ki)
• Because of Feistel: no need for f -1 (different for AES)
32b-to32b permutation
(wiring)
Ingrid Verbauwhede, K.U.Leuven - COSIC 13
KUL - COSIC ECRYPT Summer School - 25 Albena, May 2011
DES Key schedule
PC1
C D
56
64
PC2
48
56
PC1: permute and drop 8 bits
C&D: rotate left 1 or 2
bits each round
DECRYPTION: rotate right
PC2: permute and select 48
output bits
Initial key K
Round Key Ki
C&D left/right shift registers: encryption & decryption HW
KUL - COSIC ECRYPT Summer School - 26 Albena, May 2011
Key Schedule
Two options:
• On the “fly” = just in time processing
• Pre-compute and store
MemoryBC
Key
Schedule
Key
Schedule
BCTypical for Hardware
Typical for Software
Ingrid Verbauwhede, K.U.Leuven - COSIC 14
KUL - COSIC ECRYPT Summer School - 27 Albena, May 2011
Key schedule on the fly
• The cost of fast key
context switching:
• Example for IPSEC
router
– one 128 bit key = 1408
bits round keys (10 rounds
+ initial key)
– half of internet packets are
only 64 bytes in length
(512 bits)10 102 103 104 105
0
2
4
6
8
10
Record Size (bytes)
ARC4AES3DES
Co
nte
xt
ba
nd
wid
th (
Gb
ps
) Data at 1Gbps
[source: J. Goodman]
BANDWIDTH PROBLEM !
KUL - COSIC ECRYPT Summer School - 28 Albena, May 2011
Modes of operation
Ingrid Verbauwhede, K.U.Leuven - COSIC 15
KUL - COSIC ECRYPT Summer School - 29 Albena, May 2011
Design method
• Advice: include modes of operation into
hardware IP module or co-processor:
- increases the complexity somewhat: more
control or instructions are needed
+ CLEAN security partitioning
+ reduces communication overhead and traffic
KUL - COSIC ECRYPT Summer School - 30 Albena, May 2011
Modes of operation: ECB
• ECB = Electronic code book
• cipher blocks are independent, thus insertion ordeletion of blocks can go undetected
• block cipher does not hide data patterns
• BC = block cipher (e.g. 3DES or AES)
BC BC-1
Message M Ciphertext C Plaintext M
Key K K
Ingrid Verbauwhede, K.U.Leuven - COSIC 16
KUL - COSIC ECRYPT Summer School - 31 Albena, May 2011
Modes of operation: CBC
• CBC = Cipher block chaining
• error in Ci: propagation over 2 blocks (Ri and Ri+1)
• loss of block synchronization = fatal
• error in Pi: propagation for the remaining blocks
• mostly used with encryption-only for MAC generation
BC+Pi
Ci-1 CiBC-1 +
Ci-1
Ri
KUL - COSIC ECRYPT Summer School - 32 Albena, May 2011
Modes of operation: CBC-MAC
• Cipher block chaining – Message Authentication Code
• Initialization Vector: IV = C-1
• Feedback inhibits pipelining
BC+
Pi
Ci-1
BC+
Ci-1
Pi
Ingrid Verbauwhede, K.U.Leuven - COSIC 17
KUL - COSIC ECRYPT Summer School - 33 Albena, May 2011
Pipelining?
• Due to feedback pipeline remains empty
• Worse for triple DES
DES
+
Ci-1
Pi
16 xPi
DES
+
Ci-1
Pi
DES
+
Ci-1
Pi
48 xPi
48 rounds:
16 DES(K1)
16 DES-1(K2)
16 DES (K3)
KUL - COSIC ECRYPT Summer School - 34 Albena, May 2011
Modes of operation: counter
• Converts block cipher into stream cipher
• no feedback: pipelining is possible
• crucial to choose non-repeating counter functions, e.g. LFSR
• crucial to choose counter IV’s that are UNIQUE
BC
Pi
yi
Ci
yi
BC
Pi
cntri cntri
Ingrid Verbauwhede, K.U.Leuven - COSIC 18
KUL - COSIC ECRYPT Summer School - 35 Albena, May 2011
Modes of operation: OCB
• Offset code book – proprietary
• (used to be) popular because option in 802.11 WLAN
• can be pipelined
• need encryption & decryption logic (problem for AES)
BC+Pi
ZiCi
+
Zi
BC-1+Pi
Zi
+
Zi
+ extraschecksum + extras
checksum
KUL - COSIC ECRYPT Summer School - 36 Albena, May 2011
CCM (Counter + CBC MAC) mode
• Encryption & MAC creation (802.11 WLAN)
MIC = Message Integrity Check is same as MAC
Clear text frame
FC Dur A1 A2 A3 A4SC QC PC Data MIC
AES_E(K) AES_E(K) AES_E(K)AES_E(K)AES_E(K)
CBC-MAC
AES_E(K)AES_E(K)
FC Dur A1 A2 A3 A4SC QC PC Data MIC
Pl(2)Pl(1)
Counter preload
Transmitted
encrypted frame
IV
AES_E(K)
FCS
0 padded0 padded
Flag Nonce Dlen
Flag Nonce Cnt
Hlen
AES_E(K)
Pl(C)
AES_E(K)
Pl(0)
Ingrid Verbauwhede, K.U.Leuven - COSIC 19
KUL - COSIC ECRYPT Summer School - 37 Albena, May 2011
Block cipher modes of operation
• Conclusion: most practical applications ONLY use encryption. Important for:
– AES because encryption is more efficient than decryption (non-Feistel)
– area constraint applications (e.g. IEEE 802.11)
Privacy & MAC
Privacy & MAC
Yes
Only CTR
Dec
2Enc
Enc
2Enc
OCB
CCM
YesEncEncCTR
NoEncEncOFB
NoEncEncCFB
Message
authentication
NoEncEncCBC-MAC
NoDecEncCBC
not usedYesDecEncECB
NotesPipeliningReceiverSender
KUL - COSIC ECRYPT Summer School - 38 Albena, May 2011
Block cipher AES
Hardware
Ingrid Verbauwhede, K.U.Leuven - COSIC 20
KUL - COSIC ECRYPT Summer School - 39 Albena, May 2011
AES: Byte substitution
• Byte substitution: each byte individual
• 16 identical Sboxes
• Area - time trade-off: HW multiplexing
• 32 for Rijndael
a2 a6 a10 a14
a0 a4 a8 a12
a1 a5 a9 a13
a3 a7 a11 a15
b0 b4 b8 b12
b1 b5 b9 b13
b2 b6 b10 b14
b3 b7 b11 b15
GF(28)-1
Permute
ai bi
KUL - COSIC ECRYPT Summer School - 40 Albena, May 2011
AES: Shiftrow
• Shiftrow: circularly rotate each row of state array
• Easy wiring
a2 a6 a10 a14
a0 a4 a8 a12
a1 a5 a9 a13
a3 a7 a11 a15
a10 a14 a2 a6
a0 a4 a8 a12
a5 a9 a13 a1
a15 a3 a7 a11
Shiftrow
Ingrid Verbauwhede, K.U.Leuven - COSIC 21
KUL - COSIC ECRYPT Summer School - 41 Albena, May 2011
a6 a5 a4 a3 a2 a1 a0 0
0 0 0 a7 a7 0 a7 a7
b7 b6 b5 b4 b3 b2 b1 b0
AES: mix column
• matrix multiplication of state array columns
– multiply with constant entries
a2 a6 a10 a14
a0 a4 a8 a12
a1 a5 a9 a13
a3 a7 a11 a15
b0 b4 b8 b12
b1 b5 b9 b13
b2 b6 b10 b14
b3 b7 b11 b15
bi
bi+1
bi+2
bi+3
ai
ai+1
ai+2
ai+3
2 3 1 1
1 2 3 1
1 1 2 3
3 1 1 2
=
2 x
+
a7 a6 a5 a4 a3 a2 a1 a0a6 a5 a4 a3 a2 a1 a0 0
0 0 0 a7 a7 0 a7 a7
b7 b6 b5 b4 b3 b2 b1 b0
+
3 x
KUL - COSIC ECRYPT Summer School - 42 Albena, May 2011
Mix column - encryption
GF(B x 2)
GF(B x 3)
+
G(x)
00011011
<< 1carry
01
GF(B x 1)
+
GF(B)
8
a
b
c
d
02 03 01 01
01 02 03 01
01 01 02 03
03 01 01 02
Mix Column Operation is
GF(28) Linear Transform
+
<< 10
+++
Ingrid Verbauwhede, K.U.Leuven - COSIC 22
KUL - COSIC ECRYPT Summer School - 43 Albena, May 2011
Key schedule
• Unit is 32 bit words, W[i] = 32 bit = 1 column
• 4 different operations
• One round key is four W[i]’s
Input key
W[i-Nk] ^ W[i-1] W[i-Nk] ^ ByteSub(RotByte(W[i-1]))^ Rcon[i/Nk]
128b
W[i-Nk] ^ ByteSub(W[i-1])
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
192b
256b
KUL - COSIC ECRYPT Summer School - 44 Albena, May 2011
Key schedule
• Encryption key
– HW: on the fly: round key[i] = function (round key[i-1])
– SW: pre-compute and store in context (176, 208, 240Bytes)
• Decryption key
– encryption key in reverse order
– BUT need final round words to start
…
W[15] = D0 = C1^D1
W[14] = C0 = B1^C1
W[13] = B0 = A1^B1
W[12] = A0 = f(C1^D1)^A1
…
…
W[16] = A1 = f(D0)^A0
W[17] = B1 = f(D0)^A0^B0
W[18] = C1 = f(D0)^A0^B0^C0
W[18] = D1 = f(D0)^A0^B0^C0^D0
…
128 bit encrypt 128 bit decrypt
Initial Round Words
Initial Round Words
Final Round Words
Final Round Words
Ingrid Verbauwhede, K.U.Leuven - COSIC 23
KUL - COSIC ECRYPT Summer School - 45 Albena, May 2011
Combined AES architecture
• AES is not
Feistel
• Every operation
has its inverseMixColumn
ShiftRow
ByteSub
KeyAdd
KeyAdd
ShiftRow
ByteSub
KeyAdd
Ki
K0
KNr
Input Data
Output Data
Encryption
MixColumn-1
ShiftRow-1
ByteSub-1
KeyAdd-1
KeyAdd-1
ShiftRow-1
ByteSub-1
KeyAdd-1
Ki
K0
KNr
Output Data
Input Data
Decryption
KUL - COSIC ECRYPT Summer School - 46 Albena, May 2011
Decryption datapath
MixColumn-1
ShiftRow-1
ByteSub-1
KeyAdd-1
KeyAdd-1
ShiftRow-1
ByteSub-1
KeyAdd-1
Ki
K0
KNr
Output Data
Input Data
Decryption
MixColumn-1
ByteSub-1
ShiftRow-1
KeyAdd
KeyAdd
ByteSub-1
ShiftRow-1
KeyAdd
Ki
K0
KNr
Output Data
Input Data
Decryption
KeyAdd
ShiftRow-1
ByteSub-1
KeyAdd
MixColumn-1
ShiftRow-1
ByteSub-1
KeyAdd
Ki
KNr
K0
Output Data
Input Data
Decryption
MixColumn-1
ByteSub-1
ShiftRow-1
KeyAdd
KeyAdd
ByteSub-1
ShiftRow-1
KeyAdd
Ki
K0
KNr
Output Data
Input Data
Decryption
Reorganize
Key addition
is its own
inverse
switch
shiftrow and
bytesub
Flip
upside
down
Ingrid Verbauwhede, K.U.Leuven - COSIC 24
KUL - COSIC ECRYPT Summer School - 47 Albena, May 2011
Combined architecture
MixColumn
ShiftRow
ByteSub
KeyAdd
KeyAdd
ShiftRow
ByteSub
KeyAdd
Ki
K0
KNr
Input Data
Output Data
Encryption
KeyAdd
ShiftRow-1
ByteSub-1
KeyAdd
MixColumn-1
ShiftRow-1
ByteSub-1
KeyAdd
Ki
KNr
K0
Output Data
Input Data
Decryption
• Does not followcompletely Rijndaelproposal (which suggeststo switch KeyAdd andMixColumn) because itrequires a InvMixColumnto be applied to theroundkey.
KUL - COSIC ECRYPT Summer School - 48 Albena, May 2011
Combined architecture
• Provides encryption &
decryption
• Key addition is its
own inverse
ByteSub
ShiftRow
enc 10
MixColumn
10
01
KeyAdd
01
enc
enc
enc·first
first
regSel
Output Data
Input Data
Combined
enc
Ingrid Verbauwhede, K.U.Leuven - COSIC 25
KUL - COSIC ECRYPT Summer School - 49 Albena, May 2011
Sub modules
GF(28)
-1
permperm
-1
1
0 0
1
enc
In
Out
ByteSub
0
1
0
1
Out[3:0]
Out[7:4]
Out[11:8]
Out[15:12]
In[3:0]
In[4,7,6,5]
In[6,5,4,7]
In[11:8]
In[12,15,14,13]
In[14,13,12,15]
enc
ShiftRow
2 3 1 1
1 2 3 1
1 1 2 3
3 1 1 2
·Incol+enc·
c 8 c 8
8 c 8 c
c 8 c 8
8 c 8 c
·IncolOutcol=
MixColumn
Reason that
decryption is slower
KUL - COSIC ECRYPT Summer School - 50 Albena, May 2011
Sbox optimization
• GF(28)-1 requires large Look up tables
• Map to isomorphic fields, GF((24)2) or GF(((22)2)2) andinvert there
• smaller but slower!
()2 ()-1p0
A
A-1
al
ah
GF(24)
GF(28)
GF(28)
Ingrid Verbauwhede, K.U.Leuven - COSIC 26
KUL - COSIC ECRYPT Summer School - 51 Albena, May 2011
Sbox experiment• 0.18 μm CMOS, Synopsys experiment
• size of 1 Sbox, push for area or for speed
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 2 4 6 8
Latency (ns)
Are
a (
ga
tes
)
Direct Implementation Wolkerstorfer
GF(28)
GF((24)2)
KUL - COSIC ECRYPT Summer School - 52 Albena, May 2011
Compact SBOX
• GF(((22)2)2) instead of GF(28)
• Reduces the gate count to only 280 gates!
[size depends on the choice of ]
^2
^-1
4
4 4
4
88
GF(28)
GF(((22)2)2)
GF(((22)2)2)
GF(28)
[Mentens et al., RSA 2005]
Ingrid Verbauwhede, K.U.Leuven - COSIC 27
KUL - COSIC ECRYPT Summer School - 53 Albena, May 2011
Ballpark numbers
• 1 gate = 2input NAND gate = 4 transistors
• Sbox size:– GF(28)-1 = 650 to 700 gates
– GF((24)2)-1 = 400 gates ([Wol02])
– GF(((22)2)2)-1 = 280 gates ( [Sat01][Men05])
– but 50 to 100% slower
• AES core encryption only: 20K to 25Kgates– 128bit data, 128 bit key
– key schedule on the fly
– 1 clock cycle per round
• AES core for encryption and decryption: 40Kgates– 128 bit data, 128 bit key
– precompute and store round keys: 128x11bits SRAM
– 1 clock cycle per round
– savings in combining logic, losses in multiplexers!
KUL - COSIC ECRYPT Summer School - 54 Albena, May 2011
Extensions to Rijndael• Original Rijndael submission
– 128, 192 or 256 bits data (limited to 128bit in AES standard)
– 128, 192 or 256 bits key (unchanged in AES standard)
• Tricky part in key schedule: 2 loops in parallel
ciphertext
N-1 rounds
plaintext
m
m
m128
192
256{ k
128
192
256{
keyaddsubstitution
shiftrowmixcolumn
substitutionshiftrowkeyadd
keyschedule
key
k
L rounds
k
m
final round
roundkey
Ingrid Verbauwhede, K.U.Leuven - COSIC 28
KUL - COSIC ECRYPT Summer School - 55 Albena, May 2011
Key schedule for Rijndael
Cycle 1
Roundkey
Iterated Key
keysched1 keysched1
Cycle 2 Cycle 3
Roundkey
Iterated Key
keysched1 keysched2 keysched1
Cycle 1 Cycle 2
data=128
key=192
data=192
key=128
KUL - COSIC ECRYPT Summer School - 56 Albena, May 2011
Key schedule for Rijndael
• Key schedule on the
fly: one clockcycle
per round
seed
keysched1 keysched2
P C N
roundkey
data encrypt
Key Schedulecontroller
iteratedkey
roundkey assembly
m
k
m,k
m
Ingrid Verbauwhede, K.U.Leuven - COSIC 29
KUL - COSIC ECRYPT Summer School - 57 Albena, May 2011
• Rijndael
• Enc + Dec
• 0.18um CMOS
• Standard cells
KUL - COSIC ECRYPT Summer School - 58 Albena, May 2011
• AES, 2nd generation
• Regular & WDDL based implementation
• Standard cells
• 0.18 um CMOS
Ingrid Verbauwhede, K.U.Leuven - COSIC 30
KUL - COSIC ECRYPT Summer School - 59 Albena, May 2011
SW implementations
Illustrate with DES
• Option 1: operation by operation
• Option 2: table look-up
• Option 3: bit-slices
KUL - COSIC ECRYPT Summer School - 60 Albena, May 2011
SW: DES- f function
32b-to-48b perm/expansion
Combined with permutation P:
one Look-Up-Table
Wordwise EXOR: efficient
Sbox: Look-up Table
Expansion E
+
32 Ri-1 Ki48
48
S1 S2 S3 S4 S5 S6 S7 S8
Permutation P
32
32
f(Ri-1, Ki)
• DES: made for HW, inefficient in SW
Ingrid Verbauwhede, K.U.Leuven - COSIC 31
KUL - COSIC ECRYPT Summer School - 61 Albena, May 2011
Software SBOX
KUL - COSIC ECRYPT Summer School - 62 Albena, May 2011
SBOX in SW
Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009
Ingrid Verbauwhede, K.U.Leuven - COSIC 32
KUL - COSIC ECRYPT Summer School - 63 Albena, May 2011
Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009
Bit Slicing: Alternative Data RepresentationBit Slicing: Alternative Data Representation
• Introduced by Biham, 1997
• each register contains 1 bit of, e.g.,
32 blocks
• pipelining (=parallelization) of n
encryptions
bit 1, block n bit 1, block 1
bit 2, block 1 bit 2, block n
Block 1
Block n
Block 2
…
64
bit 64, block 1 bit 64 , block n
…
register (n=16, 32, 64)• CPU can be viewed as
16/32/64 one-bit parallel
processors
• CPU acts as SIMD
(Single-instruction
multiple-data) processor
Encryption with Bit Slicing (1): PermutationsEncryption with Bit Slicing (1): Permutations
bit 1, block n bit 1, block 1
bit 2, block 1 bit 2, block n
bit 64, block 1 bit 64 , block n
…
bit 1 , block n
bit 2, block n bit 2, block 1
bit 64, block 1 bit 64, block n
bit 1, block 1
…
Bit permuation is realized
by re-ordering of registers
(in practice: re-ordering of
pointers)
Ex: 64-64 bit permutation:
Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009
Ingrid Verbauwhede, K.U.Leuven - COSIC 33
KUL - COSIC ECRYPT Summer School - 65 Albena, May 2011
Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009
Encryption with Bit Slicing (2): S-BoxEncryption with Bit Slicing (2): S-Box
S-box
a b c d e f
O1 O2 O3 O4
Each output can be expressed as Boolean function (i.e., function
on bits)
O1 = f1(a, b, c, d, e, f)
…
O4 = f4(a, b, c, d, e, f)
On average, each S-box requires about 100 gates.
KUL - COSIC ECRYPT Summer School - 66 Albena, May 2011
Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009
Encryption with Bit Slicing (2): S-BoxEncryption with Bit Slicing (2): S-Box
bit 48 , block n
bit 1, block n bit 1, block 1
bit 2, block 1 bit 2, block n
bit 48, block 1
…
Ex:
Reg 8 AND Reg 17
OR Reg 44
NOR …
S-box
a b c d e f
O1 O2 O3 O4
Idea: Compute S-box Si for many blocks in parallel!
i.e., realize functions
O1 = f1(a, b, c, d, e, f)
…
O4 = f4(a, b, c, d, e, f)
for 64 (or 32 or 16) blocks of data in parallel!
Ingrid Verbauwhede, K.U.Leuven - COSIC 34
KUL - COSIC ECRYPT Summer School - 67 Albena, May 2011
Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009
DES Bit Slicing: Performance on 64 bit CPUDES Bit Slicing: Performance on 64 bit CPU
8 S-Boxes, 64 blocks parallel: 100 x 8 = 800 instructions
total (incl. load/store & conversion of I/O data)
19000 instr. / 64 blocks
300 instr. / block
4-5 instr. / bit
Throughput on 300 MHz Alpha
bit sliced: 137 Mbit/sec
optimized non-bit sliced: 46 Mbit/sec
ONLY COMPATIBLE WITH PIPELINE TYPE MODES OF OPERATION!!
Further reading: [Biham 97]
KUL - COSIC ECRYPT Summer School - 68 Albena, May 2011
[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator
[2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110
[3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet
[4] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS
[5] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS
648 Mbits/secAsm
Pentium III [3] 41.4 W 0.015 (1/800)
Java [5] Emb.
Sparc 450 bits/sec 120 mW 0.0000037
(1/3.000.000)
C Emb. Sparc [4]133 Kbits/sec 0.0011 (1/10.000)
350 mW
Power
1.32 Gbit/secFPGA [1]
11 (1/1)3.84 Gbits/sec0.18μm CMOS
Figure of Merit
(Gb/s/W)
ThroughputAES 128bit key
128bit data
490 mW 2.7 (1/4)
120 mW
Throughput – Energy numbers
ASM StrongARM
[2] 240 mW 0.13 (1/85)31 Mbit/sec
Ingrid Verbauwhede, K.U.Leuven - COSIC 35
KUL - COSIC ECRYPT Summer School - 69 Albena, May 2011
Conclusions
• Energy is not the same as power
• Feistel = hardware friendly!
• Modes of operation inhibit pipelining,parallelism
– in SW bitslicing = pipelining
• Need for light weight (ultra small)
• Need for high throughput (ultra fast)
• Need for low power/low energy