hardware benchmarking for hash - … · kuleuven - cosic tenerife, hash3 – 1 nov 2009 hardware...
Post on 18-Aug-2018
213 Views
Preview:
TRANSCRIPT
KULeuven - COSIC Tenerife, Hash3 – 1 Nov 2009
Hardware benchmarking for HASH3
(for non Hardware designers)
Ingrid Verbauwhede
ingrid.verbauwhede-at-esat.kuleuven.be
K.U.Leuven, COSICComputer Security and Industrial Cryptography
www.esat.kuleuven.be/cosic
with input from:Junfeng Fan, Miroslav Knezevic,
Patrick Schaumont
Slides from: own Course notes, Rabaey’sDigital Integrated Circuit
KULeuven - COSIC Tenerife, Hash3 – 2 Nov 2009
Outline
• Goal of hardware design
• What is hardware design?
• What are the different options?
• What are the different contexts?
• How to compare hardware design: benchmark
• Where are we now?
KULeuven - COSIC Tenerife, Hash3 – 4 Nov 2009
When Hardware design?
• Fast
• Small
• Low power
• Security
• (Analog, RF)
HW
SW
HW
continuum
KULeuven - COSIC Tenerife, Hash3 – 5 Nov 2009
HW-SW continuum
ASIC FPGADomain
specificDSP VLIW
General
purpose
Performance/Energy unit
High Low
Programmability
Low High
Area efficiency
HW SWHW-SW
Intel AES-NI
Westmere
KULeuven - COSIC Tenerife, Hash3 – 6 Nov 2009
Design parameters
• Speed or throughput:
– Gbits/sec or Mbits/sec/slice
– Cycles/byte (see D. Bernstein)
• Area:– mm2 (gate or transistor count)
– Memory
• Power or energy consumption:
– Power (Watts) for cooling or transmission (RFID)
– Energy: battery operated devices
• Security:– Side channel resistance: special circuits styles
KULeuven - COSIC Tenerife, Hash3 – 7 Nov 2009
Power density problem
• Intel S. Borkar power density problem
[Author: S. Borkar, Intel]
KULeuven - COSIC Tenerife, Hash3 – 8 Nov 2009
Power, energy
• Include picture of Intel cooling issues
Source: ORNL Oak Ridge National Lab, US Dept. of Energy
•“Immediate need to add8MWto prepare for 2007installs of new systems”• “Need total of 40-50 MWfor projected systems by2011.”• “Numbers just forcomputers, add 75% forcooling.”• Cooling will require12.000-15.000 tons ofchiller capacity.”
KULeuven - COSIC Tenerife, Hash3 – 9 Nov 2009
Heat and parallelism
memory processor
M P
C Pmono = CV2f (Watt)
Power
(Heat)
C/4 C/4 C/4 C/4
M/4 P/4 M/4 P/4 M/4 P/4 M/4 P/44 (C/4)V2(f/4) = Pmono/4
but since f ~ V
can be even Pmono/43
Reduce power = reduce WASTE !!
TREND: MULTI-CORE!!
KULeuven - COSIC Tenerife, Hash3 – 10 Nov 2009
Low Energy: battery capacity
• Rabaey slide battery capacity
KULeuven - COSIC Tenerife, Hash3 – 12 Nov 2009
Skiing down a mountain
Specification:HASHX
ASIC FPGA Retargetable
coprocessor
DSP
processor
DSP-
RISCGPU
Algorithm Transformations
Memory Transformations
and Optimizations
Multi-precision arithmetic
C, C++, block diagram
pipelining, unrolling
loop merging, compaction
40 bit accumulator
Translation from spec into RTL (Register Transfer Level, e.g. VHDL, Verilog)l
KULeuven - COSIC Tenerife, Hash3 – 13 Nov 2009
From RTL to tape-out or FPGA
• “Back-end”: VHDL, Verilog, synthesis, FPGA
ASIC FPGA Retargetable
coprocessorDSP
DSP
Extensions
To RISC
RISC, VLIW,
GPU, CPU
Hardware Software
System-on-a-chip, system in package
•C-compilation
•Assembly optimization
•Verilog-VHDL
•Synopsys synthesis
•Cadence place&route
•FPGA download
KULeuven - COSIC Tenerife, Hash3 – 15 Nov 2009
Semicustom Design Flow
HDL
Logic Synthesis
Floorplanning
Placement
Routing
Tape-out
Circuit Extraction
Pre-LayoutSimulation
Post-LayoutSimulation
StructuralStructural
PhysicalPhysical
BehavioralBehavioralDesign Capture
Desig
n I
tera
tion
Desig
n I
tera
tion
Timing closure!Technology/library/manufacturer input
KULeuven - COSIC Tenerife, Hash3 – 16 Nov 2009
Cell-based Design (or standard cells)
Routing channelrequirements arereduced by presenceof more interconnectlayers
Functionalmodule(RAM,multiplier,…)
Routingchannel
Logic cellFeedthrough cell
KULeuven - COSIC Tenerife, Hash3 – 18 Nov 2009
Standard Cell – The New Generation
Cell-structure
hidden under
interconnect layers
KULeuven - COSIC Tenerife, Hash3 – 19 Nov 2009
The “Design Closure” Problem
Courtesy Synopsys
Iterative Removal of Timing Violations (white lines)
KULeuven - COSIC Tenerife, Hash3 – 20 Nov 2009
Synthesis together w Physical Design
Physical Synthesis
RTL (Timing) Constraints
Place-and-RouteOptimization
Artwork
Netlist with Place-and-Route Info
MacromodulesFixed netlists
Technology/librarymanufacturer input
KULeuven - COSIC Tenerife, Hash3 – 21 Nov 2009
Benchmark on gate count??Benchmark on gate count??
• Gate count (GE) depends on library and tools !
• Definition of one GATE?
• Example:– PRESENT[20] contains 1,000 GE in 0.35 m technology – 53,974 m2.
– PRESENT[20] contains 1,169 GE in 0.25 m technology – 32,987 m2.
– PRESENT[20] contains 1,075 GE in 0.18 m technology – 10,403 m2.
• Comparison is fair ONLY if the SAME library, SAMEtools, and SAME settings are used.
KULeuven - COSIC Tenerife, Hash3 – 22 Nov 2009
Benchmark on synthesis settings??Benchmark on synthesis settings??
• Same VHDL design synthesized with different constraints willresult in different performance.
• Benchmark on “area-time” product ??
[source: M. Knezevic]
Note: 2.7GHz is ‘synthesis’ report: NOT FEASIBLE in practice!
KULeuven - COSIC Tenerife, Hash3 – 24 Nov 2009
Pre-diffused(Gate Arrays)
Pre-wired(FPGA's)
Array-based
Late-Binding Implementation
KULeuven - COSIC Tenerife, Hash3 – 25 Nov 2009
Look-up Table Based Logic Cell
Out
ln1 ln2
I n Out
00 00
01 1
10 1
11 0
KULeuven - COSIC Tenerife, Hash3 – 26 Nov 2009
LUT-Based Logic Cell
Courtesy Xilinx
D4
C1....C4
xxxxxx
D3
D2
D1
F4
F3
F2
F1
Logicfunction
ofxxx
Logicfunction
ofxxx
Logicfunction
ofxxx
xx
xx
4
xxxxxx
xxxxxxxx
xxx
xxxx xxxx xxxx
HP
Bitscontrol
Bitscontrol
Multiplexer Controlledby Configuration Program
x
xx
x
xx
xxx xx
xxxx
x
xx
xxxx
xx
x
xx
xxx
xx
Xilinx 4000 Series
Not most up to date
KULeuven - COSIC Tenerife, Hash3 – 28 Nov 2009
Xilinx Virtex-II Pro FPGA
IBM PowerPC ®
RISC CPU
Synchronous
Dual-Port RAM
SelectIO-ltra™
SystemIO™ &
XCITE ™
Conexant
3.125Gb Serial
XtremeDSP™
KULeuven - COSIC Tenerife, Hash3 – 29 Nov 2009
29
Multi-Pass Place-and-Route AnalysisGMU SHA-512, Xilinx Virtex 5
100 runs for different placement starting points
The smaller the better
~ 20%
best worst
Minimum clock 29[courtesy: Kris Gaj]
KULeuven - COSIC Tenerife, Hash3 – 30 Nov 2009
30
Dependence of Results on Requested Clock freq.
[courtesy: Kris Gaj]
KULeuven - COSIC Tenerife, Hash3 – 31 Nov 2009
Saar Drimer, Figure 5.2 Ph.D. thesisDistribution max achievableclock frequency for Place&Routewith 100 different PAR seeds.1 & 2: for 1 or 4 AES instances3 & 4: same on different platform
5: different speed grade
KULeuven - COSIC Tenerife, Hash3 – 32 Nov 2009
FPGA benchmarks??
• Easier than ASIC– Tools are (almost) free (at least at universities)
– Options: similar to software
• Trend getting worse: FPGA becomesheterogeneous machine– Report with/without “block-Rams”
– Report with/without DSP multipliers
– Report with/without high speed IO
KULeuven - COSIC Tenerife, Hash3 – 33 Nov 2009
FPGA benchmarks??
• Area numbers:
– Slices, LUT’s, CLB’s, …
– Xilinx application engineer: “The number of CLB’sinside LUT’s changes from generation to generation.”(or was it LUT’s inside CLB’s?)
• Speed: accurately reported by tools
• Power:
– Poorly reporting by tools
– Hard to measure on board
KULeuven - COSIC Tenerife, Hash3 – 34 Nov 2009
Context 3: HW-SW interface
Dan would call this the API?
KULeuven - COSIC Tenerife, Hash3 – 35 Nov 2009
Intro: SHA3-ZOO
• 3 types of “Hardware” reporting, but no interface!
Mem
Fully Autonomous Fully AutonomousWith external memory
Core functionality
Integration of Hash module??
SHA3
KULeuven - COSIC Tenerife, Hash3 – 36 Nov 2009
Integration of the Hash module:
options for HW/SW co-design
• Tightly coupled
• Reuse of busses
• Reuse of registers
• Define instruction
• Usually: C-intrinsicor pragma
• Option 1: instruction set extension
Example: AES-NI off Intel (see Shay’s presentation!)Example: Build your own extension to embedded processor see e.g. Xtensa or Target Compiler Technologies
SHA3
KULeuven - COSIC Tenerife, Hash3 – 37 Nov 2009
• Option 2: Memory mapped
• Memory-mappedcoprocessor
• Loosely coupled
• Typical for DSP andother embeddedprocessors
• No need to changecompiler
• Check latency of co-processor & memoryconsistency!
Local
memory
SHA3
Main processor
KULeuven - COSIC Tenerife, Hash3 – 38 Nov 2009
• Option 3: novel forms of co-operation
router
router
• Custom HW or Networkon Chip (NOC)• Loosely coupled• Flexible interconect• Popular for large multi-core designs (80 or 100cores)SHA3
One of many othercores
KULeuven - COSIC Tenerife, Hash3 – 39 Nov 2009
Can have different forms in on System-
on-chip (SOC)
CPU Memory
MemoryController
DMA BusMaster
Bridge
UART CustomHW
Timer
High-speedBus
PeripheralBus
ParallelI/OI$ D$
custom dp
CustomHW
LocalBus
externalmemory
direct I/O
KULeuven - COSIC Tenerife, Hash3 – 40 Nov 2009
AES acceleration for SH3-DSP
• AES Co-processor– For 128bit key
– Using GEZEL
– Communicate with the
SH3-DSP ISS via the
memory mapped interface
aes_top
load
reset
key
text_in
128
128
done
text_out 128
Co-processor in GEZEL Simulation Kernel
aes_encoder
memory-mapped interface
8 ins
32 dout
32 din
KVM on SH3-DSP ISS
address 0x2f000 0x2f004 0x2f008
{ volatile char *ins = 0x2f000;
volatile int *dout = 0x2f004;
volatile int *din = 0x2f008; }
GEZEL-SH Co-Simulator
[Ref: Y. Matsuoka et al, CASES04]
KULeuven - COSIC Tenerife, Hash3 – 41 Nov 2009
AES Optimization results
• Number of lock cycles per AES encryption(Key scheduling + Block encryption)
– Starting from Java function call in user application
– KNI overhead limits the overall performance gain
(a) Java (b) Java+C (c) Java+C+GEZEL
Java API I/F
KNI I/F
Acceleration I/F
Mem-Mapped I/F
198741
10245
18763
260
18938
198741 29008 19198
(6.8x) (10.4x)
Total Cycles
[Ref: Y. Matsuoka et al, CASES04]
KULeuven - COSIC Tenerife, Hash3 – 43 Nov 2009
Adapt HW platform to application
Simple example: Key Schedule for secret key
Two options:
• On the “fly” = just in time processing
• Pre-compute and store in memory
MemoryBC
KeySchedule
KeySchedule
BCTypical for Hardware
Typical for Software
KULeuven - COSIC Tenerife, Hash3 – 44 Nov 2009
Key schedule on the fly
• The cost of fast key contextswitching in SW
• Example for IPSEC router
– one 128 bit key = 1408 bitsround keys (10 rounds +initial key)
– half of internet packets areonly 64 bytes in length (512bits)
10 102 103 104 105
0
2
4
6
8
10
Record Size (bytes)
ARC4AES3DES
Co
nte
xt
ban
dw
idth
(G
bp
s) Data at 1Gbps
[source: J. Goodman]
KULeuven - COSIC Tenerife, Hash3 – 45 Nov 2009
Benchmark??
Cost of HW module (minimum minimorum):
• Key storage– assume sub-keys on the fly
• State storage:– Does all state need to be alive all the time?
– Wide pipe - narrow pipe
– Windowing?
– Think context switching
• Input block / output block– Can I process input already before the complete input block
and/or padding is present?
– Same for output: can I send output, or do I have to wait for thecomplete output block
KULeuven - COSIC Tenerife, Hash3 – 47 Nov 2009
Match between algorithm & architecture
Close the gap:
• Dedicated HW: ASIC
• Programmable HW: FPGA
• Custom instructions, hand-coded assembly
• Compiled code
• JAVA on virtual machine,compiled on a real machine
Power Cost
???
General
Purpose
Fixed
Platform
Application
ASIC
KULeuven - COSIC Tenerife, Hash3 – 48 Nov 2009
[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator
[2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110
[3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet
[4] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS
[5] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS
648 Mbits/secAsmPentium III [3] 41.4 W 0.015 (1/800)
Java [5] Emb.Sparc 450 bits/sec 120 mW 0.0000037
(1/3.000.000)
C Emb. Sparc [4]133 Kbits/sec 0.0011 (1/10.000)
350 mW
Power
1.32 Gbit/secFPGA [1]
11 (1/1)3.84 Gbits/sec0.18μm CMOS
Figure of Merit(Gb/s/W = Gb/J)
ThroughputAES 128bit key128bit data
490 mW 2.7 (1/4)
120 mW
Throughput – Energy numbers
ASM StrongARM[2] 240 mW 0.13 (1/85)31 Mbit/sec
KULeuven - COSIC Tenerife, Hash3 – 50 Nov 2009
Data Flow Graph representation
• Illustrate with RIPEMD
• Indicate loops, operations, and delays
TDTD TD
TD
B C Erol(10)
F
+rol(s)+ + + A
Ki Xi
DTDTD TD TD
TD
KULeuven - COSIC Tenerife, Hash3 – 51 Nov 2009
Iteration Bound
tl – loop calculation time
wl – number of algorithmic delays (marked with TD) in the l-th loop
TDTD TD
TD
B C Erol(10)
F
+rol(s)+ + + A
Ki Xi
DTDTD TD TD
TD
KULeuven - COSIC Tenerife, Hash3 – 52 Nov 2009
Iteration Bound
TDTD TD
TD
B C Erol(10)
F
+rol(s)+ + + A
Ki Xi
DTDTD TD TD
TD
KULeuven - COSIC Tenerife, Hash3 – 53 Nov 2009
Critical path
TDTD TD
TD
B C Erol(10)
F
+rol(s)+ + + A
Ki Xi
DTDTD TD TD
TD
• The longest path between any two storage elements.
• - Determines the clock frequency!
• Problem: Critical Path > Iteration Bound!
KULeuven - COSIC Tenerife, Hash3 – 54 Nov 2009
Retiming transformation
TDTD TDB C Erol(10)
F
+rol(s)+ + +A1
Ki+1 Xi+1
DTDTD TD TD
TD
• Transformation technique that changes the locations of unit-delay elementsin a circuit without affecting the input/output characteristic.
• After retiming: Critical Path = Iteration Bound!
KULeuven - COSIC Tenerife, Hash3 – 55 Nov 2009
Hardware tricks
For speed:
• Parallelism
• Pipelining
• Loop unrolling
• FPGA: Block RAM instead of Logic
• …
For area:
• Multiplexing
• Composite field instead of Sbox
For power/energy:
• Parallelism
• Pipelining
KULeuven - COSIC Tenerife, Hash3 – 56 Nov 2009
Algorithm properties
As they affect HW realization
• Internal state
• Block size
• Initialization cost
• Iterative, sequential, …
• Parallelism
KULeuven - COSIC Tenerife, Hash3 – 57 Nov 2009
Benchmark efforts
Benchmarks on FPGA, ASIC
API efforts
Open questions
KULeuven - COSIC Tenerife, Hash3 – 58 Nov 2009
Stefan Tillich
See his presentation for the “context”
KULeuven - COSIC Tenerife, Hash3 – 59 Nov 2009
Brian Baldwin
• FPGA: CubeHash, Grostl, Shabal, SIMD, JH,Hamsi and Fugue
• Core functionality & compression function
• See his presentation for ‘context’
KULeuven - COSIC Tenerife, Hash3 – 60 Nov 2009
Christian Wenzel-Benner
• eXternal Benchmarking eXtension
KULeuven - COSIC Tenerife, Hash3 – 61 Nov 2009
Miroslav Knezevic
• Illustration of transformations: applied to Luffaand others
• More observations
KULeuven - COSIC Tenerife, Hash3 – 62 Nov 2009
Patrick Schaumont: API for HW
• INIT & GETCONFIG:initialization, type of I/O,etc
• IDATA & ODATA:parameter
• 16, 32 bit: low endprocessor
• 64, 128 (256): high endprocessors
KULeuven - COSIC Tenerife, Hash3 – 63 Nov 2009
ATHENa
Server
FPGA Synthesis and
Implementation
Result Summary
+ Database
Entries
2 3
HDL + scripts +
configuration files
1
Database
Entries
Download scripts
and
configuration files8
Designer
4
HDL + FPGA Tools
User
Database
query
Ranking
of designs
5
6
Kris Gaj: ATHENa
0
Interfaces
+ Testbenches 63
KULeuven - COSIC Tenerife, Hash3 – 64 Nov 2009
ATHENa Major Features• synthesis, implementation, and timing analysis in the batch
mode
• support for devices and tools of multiple FPGA vendors:
• generation of results for multiple families of FPGAs of agiven vendor
• automated choice of a best-matching device within agiven family
64
KULeuven - COSIC Tenerife, Hash3 – 65 Nov 2009
Open questions
• Area comparisons
• Throughput comparisons
• Power/Energy comparisons
• Sets of environments
top related