symmetric cryptography in hardware · 2013-06-17 · tim güneysu hardware security group horst...
TRANSCRIPT
Tim Güneysu
Hardware Security Group
Horst Görtz Institute for IT-Security 02/07/2013
Symmetric Cryptography in Hardware
Summer School on Design and Security of Cryptographic Functions, Algorithms and Devices
16.06.2013 Arbeitsgruppe Sichere Hardware, Ruhr-Universität Bochum
Agenda
• Introduction
• Objectives and Design Principles
• Identifying Cryptographic Building Blocks
• Case Study: Advanced Encryption Standard
• Lessons Learned
2
Why Cryptography in Hardware?
• Throughput/Performance
• Energy-Efficiency
• Cost
• Physical Security
3
Cryptographic Hardware Examples
• SmartCards (e.g., PayTV)
• RFID/NFC Tags (identif./auth.)
• Accelerators (AES NI, HDD encryption)
• Hardware Security and Trusted Platform Modules (HSM/TPM)
4
Hardware Implementation
• Application Specific Integrated Circuits (ASIC)
– Static circuit with fixed routes and gates
– Dedicated application-specific layout
– Expensive development but cheap per unit
• Field Programmable Gate Arrays (FPGA)
– Fabric of programmable logic and routing
– Different fabric sizes for applications
– Simple development but higher costs per unit
5
Agenda
• Introduction
• Objectives and Design Principles
• Identifying Cryptographic Building Blocks
• Case Study: Advanced Encryption Standard
• Lessons Learned
6
Dimensions in Hardware Design
Gajski-Diagram [1983] What about
Security?
Objectives for Hardware Implementation
• Hardware circuits are require design goals:
• AREA, DELAY, POWER, THROUGHPUT, ENERGY
8
Hardware Design for Minimum Area
• Targets: RFIDs, crypto cores running once-a-while
• Strategies:
– Serialize algorithm
– Minimize storage elements
– Reuse resources
Register
t
Logic blocks
t
Area 9
General Design Strategies for Power
• Targets: RFIDs, passively powered devices
• Strategies:
– Serialize algorithm
– Reduce data path
Active logic
t
t
Input
Input
t
t
48 bit 16 bit
10
General Design Strategies for Throughput
• Targets: Accelerators, bulk data applications
• Strategies: – Parallelize subfunctions/
precomputation tables
– Minimize critical path/ maximize frequency
– Unroll iterated structures
– Use pipelining
Active logic
t
t
tmax tmax
3x
Data Data 11
TAB
LE
General Design Strategies for Energy
• Target: Battery-powered devices, mobile sensors
• Strategies:
– Minimize clock cycles
– Unroll iterated structures
3x
4 cycles 2 cycles
12
Impact of Hardware Objectives on Block Cipher Implementations
• Example: Low-power ASIC 65nm CMOS
• Metrics: area, delay, power, energy, throughput
• Unrolling parameter Nr rounds per cycle
13
Impact on Area
• Naive expectation: 2x rounds 2x area
• In fact: state registers determine circuit growth
14
Impact on Circuit Delay
• Naive expectation: 2x rounds 2x delay
• In fact: critical path is not always in the round function
15
Impact on Energy
• Observation: energy is rather stable
16
Agenda
• Introduction
• Objectives and Design Principles
• Identifying Cryptographic Building Blocks
• Case Study: Advanced Encryption Standard
• Lessons Learned
17
Data Encryption Standard (DES)
DES is known as hardware-optimal
block cipher.
DES is still used today with triple encryption
TDES(x) = enck1(deck2(enck3(x))) 18
Data Encryption Standard – Round Function
f-Function Round Function
19
Inside the Round Function: Data Expansion
In HW: WIRES 20
• Eight substitution tables.
• 6 bits of input, 4 bits of output.
• Non-linear and resistant to differential
cryptanalysis.
Inside the Round Function: S-Box
In HW: 6-to-4 bit ROM
In HW: XOR-Gate
Inside the Round Function: Permutation
– Bitwise permutation to introduce diffusion.
– Output bits of one S-Box effect several S-Boxes in next round
In HW: WIRES 22
• Split key into 28-bit halves C0 and D0.
• In rounds i = 1, 2, 9 ,16, the two halves are each rotated left by one bit.
• In all other rounds where the two halves are each rotated left by two bits.
• In each round i permuted choice PC-2 selects a permuted subset of 48 bits of Ci and Di as round key ki, i.e. each ki is a permutation of k!
The Key Schedule of DES
In HW: WIRES
Summary: DES Implementation in Hardware
• Implementation of a DES round consists of
– Wires for all permutations, expansion and selection
– 2-input XOR gate per bit for key addition
– 6-to-4 (i.e. 64x4=256 bit) ROM for one S-box
– 2-input XOR gate per bit to mix left and right half
6x 2-XOR 6-to-4 SBOX
4x 2-XOR 4x Flipflop 6
6
6 4
4
4 Ri,(4)
Keyi Li,(4)
Ri+1(4)
4
Round function
for 6 input bits
Summary: Hardware-Friendly Cryptographic Building Blocks
• If you design a hardware-friendly cipher, choose:
– Static permutations wires
– Static rotations and shifts wires
– S-Box Read-Only-Memory (ROM)
– Static key wires tied to GND or VCC
– Dynamic key store (1-bit) flip-flop/SRAM (≥6 trans.)
– Key Addition (1-bit) XOR Gate (≥6 transistors)
– Boolean operations (1-bit ) (N)AND/(N)OR gate (≥4 transistors)
25
Agenda
• Introduction
• Objectives and Design Principles
• Identifying Cryptographic Building Blocks
• Case Study: Advanced Encryption Standard
• Lessons Learned
26
Recall: Advanced Encryption Standard
• AES was designed to be efficient in
hardware and software
• All AES rounds consists of several
layers, processing all 128 input bits
• The layers in each round are:
• Byte Substitution
• Diffusion Layer
• ShiftRows
• MixColumns
• Key Addition
27
Input
151173
141062
13951
12840
AAAA
AAAA
AAAA
AAAA
ShiftRow
MixColumn
KeyAddition
ByteSub
AES State Representation
Output
151173
141062
13951
12840
EEEE
EEEE
EEEE
EEEE
• AES is a byte-oriented
cipher (8-bit)
• AES round operates
on a state matrix
of 16 bytes A0-A15
• Output state E0-E15
is input to next round
28
Detailed Round Structure of AES
• Round function operates on 16 state bytes A0-A15
• Note: In the last round, the
MixColumn transformation is omitted
29
Byte Substitution Layer (S-Box)
• The S-Box is commonly realized as a lookup table
• The Byte Substitution layer consists of S-Boxes with the following properties:
• 16 identical, bijective S-Boxes
• the only nonlinear elements of AES, i.e., ByteSub(Ai) + ByteSub(Aj) ≠ ByteSub(Ai + Aj)
• Construction of S-Box in GF(28):
S[x] = AffineMap(x-1) 30
Diffusion Layer (D-Box)
• Mixes state bytes to influence as many other state bytes as possible
• Consists of two sublayers:
– ShiftRows Sublayer: Permutation of the data on a byte level
– MixColumn Sublayer: Matrix operation which combines blocks of four bytes (based on MDS code)
31
ShiftRows Sublayer
Rows of the state matrix are shifted cyclically:
Input matrix
Output matrix
B0 B4 B8 B12
B1 B5 B9 B13
B2 B6 B10 B14
B3 B7 B11 B15
B0 B4 B8 B12
B5 B9 B13 B1
B10 B14 B2 B6
B15 B3 B7 B11
no shift
← one position left shift
← two positions left shift
← three positions left shift
32
MixColumn Sublayer
• Linear transformation which mixes groups of 4 state bytes (32 bit)
• Each 4-byte column is considered a vector and multiplied by a fixed 4x4 matrix, e.g.,
where 01, 02 and 03 are given in hexadecimal notation
• All arithmetic is done in the Galois field GF(28)
15
10
5
0
3
2
1
0
02010103
03020101
01030201
01010302
B
B
B
B
C
C
C
C
33
Key Addition Layer
• Input
– 16-byte state matrix C
– 16-byte subkey ki
• Output: C ki (bit-wise XOR)
• A number of subkeys are generated in the key schedule (depending on number of rounds)
• Subkeys are added at the beginning and end of the cipher operations
34
3-D Depiction of AES Round
ShiftRow
MixColumn
KeyAddition
ByteSub
35
Key Schedule of AES-128 (44 Subkeys W[i])
RC[1] = x0 = (00000001)2
RC[2] = x1 = (00000010)2
RC[3] = x2 = (00000100)2
...
RC[10] = x9 = (00110110)2
Round constants
36
Efficient Implementation of AES in Hardware
Goal: Implement AES on FPGAs optimized for THROUGHPUT (1) and AREA (2)
Complexity of AES operations in hardware
1. Key addition (bit-wise XOR)
Straightforward using XOR Gates in HW or instruction in SW
2. ByteSub (S-Box)
Realized as memory table with 28 = 256x8-bit entries (2KB)
3. ShiftRows
Mere re-ordering of bytes (static permutation)
Remaining operation: MixColumn?
37
Recall: MixColumn = vector-matrix multiplication on bytes
3
2
1
0
02010103
03020101
01030201
01010302
3
2
1
0
b
b
b
b
x
c
c
c
c
Q: How to efficiently realize the constant multiplication
02 bi and 03 bi ?
Efficient Implementation of AES: MixColumns
38
Remark: 02 = x and 03 = (x+1) in GF(28); arithmetic in
GF(28) uses irreducible polynomial m(x) = x8+x4+x3+x+1
For c0 = 02 b0 = x · b0
= x · (b07 x7+b06 x
6+…+b01 x+b00 )
= b07 x8+b06 x
7+…+b01 x2+b00 x
= b07 x8+b06 x
7+…+b01 x2+b00 x + (b07 m(x))
For c0 = 03 b0 = (x+1) · b0 = x · b0 + b0
Efficient Implementation of AES: MixColumns
= 0 in GF(28)
Shift b0 to right by one bit and add m(x) if MSB=1
Compute 02 b0 and add b0 (XOR)
39
ShiftRow
ByteSub
MixColumn
KeyAddition
Optimizing AES for Throughput (1): T-Tables
Precompute tables
including all three layers
151173
141062
13951
12840
AAAA
AAAA
AAAA
AAAA
Input
Output
151173
141062
13951
12840
EEEE
EEEE
EEEE
EEEE
40
j
j
j
j
k
k
k
k
AS
AS
AS
AS
x
E
E
E
E
,3
,2
,1
,0
15
10
5
0
3
2
1
0
)(
)(
)(
)(
02010103
03020101
01030201
01010302
Optimizing AES for Throughput (1): T-Tables
¼ round in matrix notation
41
j
j
j
j
k
k
k
k
ASASASAS
E
E
E
E
,3
,2
,1
,0
151050
3
2
1
0
02
03
01
01
)(
01
02
03
01
)(
01
01
02
03
)(
03
01
01
02
)(
Decomposition of matrix multiplication:
Equation for a quarter round (32 bits; first column as example)
Optimizing AES for Throughput (1): T-Tables
j
j
j
j
k
k
k
k
AS
AS
AS
AS
x
E
E
E
E
,3
,2
,1
,0
15
10
5
0
3
2
1
0
)(
)(
)(
)(
02010103
03020101
01030201
01010302
42
j
j
j
j
k
k
k
k
ASASASAS
E
E
E
E
,3
,2
,1
,0
151050
3
2
1
0
02
03
01
01
)(
01
02
03
01
)(
01
01
02
03
)(
03
01
01
02
)(
03][
][
][
02][
][0
aS
aS
aS
aS
aT
][
][
02][
03][
][1
aS
aS
aS
aS
aT
][
02][
03][
][
][2
aS
aS
aS
aS
aT
02][
03][
][
][
][3
aS
aS
aS
aS
aT
j
j
j
j
k
k
k
k
ATATATAT
E
E
E
E
,3
,2
,1
,0
1531025100
3
2
1
0
)()()()(
New equation for quarter round:
each T-Box:
256 x 32 bit
Optimizing AES for Throughput (1): T-Tables
43
Optimizing AES for Throughput (1): T-Tables
¼ round: 4 TLU + key array + 4 XOR (32 bit wide)
1 round: 4 x 4 = 16 TLU
AES: 160 TLU / 1 block encryption
Memory: 4 T-Boxes, 1kB each (8192 bits)
Full encryption function for ¼ round (32 bits) using T-Tables
j
j
j
j
k
k
k
k
ATATATAT
E
E
E
E
,3
,2
,1
,0
1531025100
3
2
1
0
)()()()(
44
Two T-Tables are stored in one 18 kBit dual-ported BRAM
Output of BRAMs are XORed with each other and key
Fixed permutation (wires!) used to select correct input byte
Four instances with 4 XOR/2 BRAMs to compute columns 0-3 of AES state
B R
A M
8 8 8 32
T 0 T 1 T 2 T 3
IN 0
k i
8
128
8 8 8 8
128 ...
Column 3
T 0 T 1 T 2 T 3 k 0
k i + 3
k 3
32
Column 0
IN 3
S 3 S 0
32 32
π π π π π π π π
Optimizing AES for Throughput (1): T-Tables
45
Optimizing AES for Area (2): Tower-Field Arithmetic for S-Boxes in GF((24)2)
S-1=(shx+sl)=shΔx + (sl+sh) Δ
with Δ = (sh2λ+ shsl + sl
2)-1 = (sh2λ+ sl(sh+ sl))
-1
Transformation
to GF((24)2)
Reverse Transformation
back to GF(28)
T T-1 SBOX (8-to-8 bit=2KB SBOX is huge in area)
obtained
from EEA (see AES
book By Rijmen/Daemon)
46
Area Efficient Multipliers in GF(24) on FPGAs with 4-input LUTs
Multiplier
in GF(24)
Multiplier
in GF(2) Multiplication
with constant in GF(2) 47
Multiplier
in GF(24)
Multiplizierer
in GF(2) Multiplikation
mit Phi in GF(2)
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
3 levels
of logic
XOR
XOR
MUL
Area Efficient Multipliers in GF(24) on FPGAs with 4-input LUTs
48
Mapping GF(24) Arithmetic on 4-input LUTs
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
Square unit in GF(24) Multiplication with lambda
in GF(24)
Single level of LUT-4 logic 49
Implementing the S-Box in GF((24)2)
Combining tower field transformation with affine map
into a single matrix operation
T
T-1
50
Optimizing AES for Area (2): Tower-Field Arithmetic for S-Boxes in GF((24)2)
Round
function
T
A-1T
T-1
T-1A
Register Register Register Register Register Register Register
Final
Round
T
A-1T
T-1
T-1A
51
AES Results on Reconfigurable Hardware
• AES Performance on a large range of hardware (FPGA) devices
Basi
c R
ound
U
nro
lled
T-T
able
52
53
Agenda
• Introduction
• Objectives and Design Principles
• Identifying Cryptographic Building Blocks
• Case Study: Advanced Encryption Standard
• Lessons Learned
Lessons Learned
• Objectives for Hardware Implementation: Area, Power, Throughput, Energy and Security
• Objectives can be partially combined, often with non-linear scaling effects
• Optimally ciphers use hardware-friendly building blocks such as static permutations and rotations
• (Large) S-Boxes are usually the most complex component for implementation in hardware
54