hardware benchmarking for hash - … · kuleuven - cosic tenerife, hash3 – 1 nov 2009 hardware...

66
KULeuven - COSIC Tenerife, Hash 3 1 Nov 2009 Hardware benchmarking for HASH 3 (for non Hardware designers) Ingrid Verbauwhede ingrid.verbauwhede-at-esat.kuleuven.be K.U.Leuven, COSIC Computer Security and Industrial Cryptography www.esat.kuleuven.be/cosic with input from: Junfeng Fan, Miroslav Knezevic, Patrick Schaumont Slides from: own Course notes, Rabaey’s Digital Integrated Circuit

Upload: donguyet

Post on 18-Aug-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

KULeuven - COSIC Tenerife, Hash3 – 1 Nov 2009

Hardware benchmarking for HASH3

(for non Hardware designers)

Ingrid Verbauwhede

ingrid.verbauwhede-at-esat.kuleuven.be

K.U.Leuven, COSICComputer Security and Industrial Cryptography

www.esat.kuleuven.be/cosic

with input from:Junfeng Fan, Miroslav Knezevic,

Patrick Schaumont

Slides from: own Course notes, Rabaey’sDigital Integrated Circuit

KULeuven - COSIC Tenerife, Hash3 – 2 Nov 2009

Outline

• Goal of hardware design

• What is hardware design?

• What are the different options?

• What are the different contexts?

• How to compare hardware design: benchmark

• Where are we now?

KULeuven - COSIC Tenerife, Hash3 – 3 Nov 2009

HW - SW continuum

When Hardware design?

KULeuven - COSIC Tenerife, Hash3 – 4 Nov 2009

When Hardware design?

• Fast

• Small

• Low power

• Security

• (Analog, RF)

HW

SW

HW

continuum

KULeuven - COSIC Tenerife, Hash3 – 5 Nov 2009

HW-SW continuum

ASIC FPGADomain

specificDSP VLIW

General

purpose

Performance/Energy unit

High Low

Programmability

Low High

Area efficiency

HW SWHW-SW

Intel AES-NI

Westmere

KULeuven - COSIC Tenerife, Hash3 – 6 Nov 2009

Design parameters

• Speed or throughput:

– Gbits/sec or Mbits/sec/slice

– Cycles/byte (see D. Bernstein)

• Area:– mm2 (gate or transistor count)

– Memory

• Power or energy consumption:

– Power (Watts) for cooling or transmission (RFID)

– Energy: battery operated devices

• Security:– Side channel resistance: special circuits styles

KULeuven - COSIC Tenerife, Hash3 – 7 Nov 2009

Power density problem

• Intel S. Borkar power density problem

[Author: S. Borkar, Intel]

KULeuven - COSIC Tenerife, Hash3 – 8 Nov 2009

Power, energy

• Include picture of Intel cooling issues

Source: ORNL Oak Ridge National Lab, US Dept. of Energy

•“Immediate need to add8MWto prepare for 2007installs of new systems”• “Need total of 40-50 MWfor projected systems by2011.”• “Numbers just forcomputers, add 75% forcooling.”• Cooling will require12.000-15.000 tons ofchiller capacity.”

KULeuven - COSIC Tenerife, Hash3 – 9 Nov 2009

Heat and parallelism

memory processor

M P

C Pmono = CV2f (Watt)

Power

(Heat)

C/4 C/4 C/4 C/4

M/4 P/4 M/4 P/4 M/4 P/4 M/4 P/44 (C/4)V2(f/4) = Pmono/4

but since f ~ V

can be even Pmono/43

Reduce power = reduce WASTE !!

TREND: MULTI-CORE!!

KULeuven - COSIC Tenerife, Hash3 – 10 Nov 2009

Low Energy: battery capacity

• Rabaey slide battery capacity

KULeuven - COSIC Tenerife, Hash3 – 11 Nov 2009

What is hardware design?

KULeuven - COSIC Tenerife, Hash3 – 12 Nov 2009

Skiing down a mountain

Specification:HASHX

ASIC FPGA Retargetable

coprocessor

DSP

processor

DSP-

RISCGPU

Algorithm Transformations

Memory Transformations

and Optimizations

Multi-precision arithmetic

C, C++, block diagram

pipelining, unrolling

loop merging, compaction

40 bit accumulator

Translation from spec into RTL (Register Transfer Level, e.g. VHDL, Verilog)l

KULeuven - COSIC Tenerife, Hash3 – 13 Nov 2009

From RTL to tape-out or FPGA

• “Back-end”: VHDL, Verilog, synthesis, FPGA

ASIC FPGA Retargetable

coprocessorDSP

DSP

Extensions

To RISC

RISC, VLIW,

GPU, CPU

Hardware Software

System-on-a-chip, system in package

•C-compilation

•Assembly optimization

•Verilog-VHDL

•Synopsys synthesis

•Cadence place&route

•FPGA download

KULeuven - COSIC Tenerife, Hash3 – 14 Nov 2009

Context 1: ASIC design

Standard cell based design

KULeuven - COSIC Tenerife, Hash3 – 15 Nov 2009

Semicustom Design Flow

HDL

Logic Synthesis

Floorplanning

Placement

Routing

Tape-out

Circuit Extraction

Pre-LayoutSimulation

Post-LayoutSimulation

StructuralStructural

PhysicalPhysical

BehavioralBehavioralDesign Capture

Desig

n I

tera

tion

Desig

n I

tera

tion

Timing closure!Technology/library/manufacturer input

KULeuven - COSIC Tenerife, Hash3 – 16 Nov 2009

Cell-based Design (or standard cells)

Routing channelrequirements arereduced by presenceof more interconnectlayers

Functionalmodule(RAM,multiplier,…)

Routingchannel

Logic cellFeedthrough cell

KULeuven - COSIC Tenerife, Hash3 – 17 Nov 2009

Standard Cell — Example

[Brodersen92]

KULeuven - COSIC Tenerife, Hash3 – 18 Nov 2009

Standard Cell – The New Generation

Cell-structure

hidden under

interconnect layers

KULeuven - COSIC Tenerife, Hash3 – 19 Nov 2009

The “Design Closure” Problem

Courtesy Synopsys

Iterative Removal of Timing Violations (white lines)

KULeuven - COSIC Tenerife, Hash3 – 20 Nov 2009

Synthesis together w Physical Design

Physical Synthesis

RTL (Timing) Constraints

Place-and-RouteOptimization

Artwork

Netlist with Place-and-Route Info

MacromodulesFixed netlists

Technology/librarymanufacturer input

KULeuven - COSIC Tenerife, Hash3 – 21 Nov 2009

Benchmark on gate count??Benchmark on gate count??

• Gate count (GE) depends on library and tools !

• Definition of one GATE?

• Example:– PRESENT[20] contains 1,000 GE in 0.35 m technology – 53,974 m2.

– PRESENT[20] contains 1,169 GE in 0.25 m technology – 32,987 m2.

– PRESENT[20] contains 1,075 GE in 0.18 m technology – 10,403 m2.

• Comparison is fair ONLY if the SAME library, SAMEtools, and SAME settings are used.

KULeuven - COSIC Tenerife, Hash3 – 22 Nov 2009

Benchmark on synthesis settings??Benchmark on synthesis settings??

• Same VHDL design synthesized with different constraints willresult in different performance.

• Benchmark on “area-time” product ??

[source: M. Knezevic]

Note: 2.7GHz is ‘synthesis’ report: NOT FEASIBLE in practice!

KULeuven - COSIC Tenerife, Hash3 – 23 Nov 2009

Context 2: FPGA design

KULeuven - COSIC Tenerife, Hash3 – 24 Nov 2009

Pre-diffused(Gate Arrays)

Pre-wired(FPGA's)

Array-based

Late-Binding Implementation

KULeuven - COSIC Tenerife, Hash3 – 25 Nov 2009

Look-up Table Based Logic Cell

Out

ln1 ln2

I n Out

00 00

01 1

10 1

11 0

KULeuven - COSIC Tenerife, Hash3 – 26 Nov 2009

LUT-Based Logic Cell

Courtesy Xilinx

D4

C1....C4

xxxxxx

D3

D2

D1

F4

F3

F2

F1

Logicfunction

ofxxx

Logicfunction

ofxxx

Logicfunction

ofxxx

xx

xx

4

xxxxxx

xxxxxxxx

xxx

xxxx xxxx xxxx

HP

Bitscontrol

Bitscontrol

Multiplexer Controlledby Configuration Program

x

xx

x

xx

xxx xx

xxxx

x

xx

xxxx

xx

x

xx

xxx

xx

Xilinx 4000 Series

Not most up to date

KULeuven - COSIC Tenerife, Hash3 – 27 Nov 2009

RAM-based FPGA

Xilinx XC4000ex

Courtesy Xilinx

KULeuven - COSIC Tenerife, Hash3 – 28 Nov 2009

Xilinx Virtex-II Pro FPGA

IBM PowerPC ®

RISC CPU

Synchronous

Dual-Port RAM

SelectIO-ltra™

SystemIO™ &

XCITE ™

Conexant

3.125Gb Serial

XtremeDSP™

KULeuven - COSIC Tenerife, Hash3 – 29 Nov 2009

29

Multi-Pass Place-and-Route AnalysisGMU SHA-512, Xilinx Virtex 5

100 runs for different placement starting points

The smaller the better

~ 20%

best worst

Minimum clock 29[courtesy: Kris Gaj]

KULeuven - COSIC Tenerife, Hash3 – 30 Nov 2009

30

Dependence of Results on Requested Clock freq.

[courtesy: Kris Gaj]

KULeuven - COSIC Tenerife, Hash3 – 31 Nov 2009

Saar Drimer, Figure 5.2 Ph.D. thesisDistribution max achievableclock frequency for Place&Routewith 100 different PAR seeds.1 & 2: for 1 or 4 AES instances3 & 4: same on different platform

5: different speed grade

KULeuven - COSIC Tenerife, Hash3 – 32 Nov 2009

FPGA benchmarks??

• Easier than ASIC– Tools are (almost) free (at least at universities)

– Options: similar to software

• Trend getting worse: FPGA becomesheterogeneous machine– Report with/without “block-Rams”

– Report with/without DSP multipliers

– Report with/without high speed IO

KULeuven - COSIC Tenerife, Hash3 – 33 Nov 2009

FPGA benchmarks??

• Area numbers:

– Slices, LUT’s, CLB’s, …

– Xilinx application engineer: “The number of CLB’sinside LUT’s changes from generation to generation.”(or was it LUT’s inside CLB’s?)

• Speed: accurately reported by tools

• Power:

– Poorly reporting by tools

– Hard to measure on board

KULeuven - COSIC Tenerife, Hash3 – 34 Nov 2009

Context 3: HW-SW interface

Dan would call this the API?

KULeuven - COSIC Tenerife, Hash3 – 35 Nov 2009

Intro: SHA3-ZOO

• 3 types of “Hardware” reporting, but no interface!

Mem

Fully Autonomous Fully AutonomousWith external memory

Core functionality

Integration of Hash module??

SHA3

KULeuven - COSIC Tenerife, Hash3 – 36 Nov 2009

Integration of the Hash module:

options for HW/SW co-design

• Tightly coupled

• Reuse of busses

• Reuse of registers

• Define instruction

• Usually: C-intrinsicor pragma

• Option 1: instruction set extension

Example: AES-NI off Intel (see Shay’s presentation!)Example: Build your own extension to embedded processor see e.g. Xtensa or Target Compiler Technologies

SHA3

KULeuven - COSIC Tenerife, Hash3 – 37 Nov 2009

• Option 2: Memory mapped

• Memory-mappedcoprocessor

• Loosely coupled

• Typical for DSP andother embeddedprocessors

• No need to changecompiler

• Check latency of co-processor & memoryconsistency!

Local

memory

SHA3

Main processor

KULeuven - COSIC Tenerife, Hash3 – 38 Nov 2009

• Option 3: novel forms of co-operation

router

router

• Custom HW or Networkon Chip (NOC)• Loosely coupled• Flexible interconect• Popular for large multi-core designs (80 or 100cores)SHA3

One of many othercores

KULeuven - COSIC Tenerife, Hash3 – 39 Nov 2009

Can have different forms in on System-

on-chip (SOC)

CPU Memory

MemoryController

DMA BusMaster

Bridge

UART CustomHW

Timer

High-speedBus

PeripheralBus

ParallelI/OI$ D$

custom dp

CustomHW

LocalBus

externalmemory

direct I/O

KULeuven - COSIC Tenerife, Hash3 – 40 Nov 2009

AES acceleration for SH3-DSP

• AES Co-processor– For 128bit key

– Using GEZEL

– Communicate with the

SH3-DSP ISS via the

memory mapped interface

aes_top

load

reset

key

text_in

128

128

done

text_out 128

Co-processor in GEZEL Simulation Kernel

aes_encoder

memory-mapped interface

8 ins

32 dout

32 din

KVM on SH3-DSP ISS

address 0x2f000 0x2f004 0x2f008

{ volatile char *ins = 0x2f000;

volatile int *dout = 0x2f004;

volatile int *din = 0x2f008; }

GEZEL-SH Co-Simulator

[Ref: Y. Matsuoka et al, CASES04]

KULeuven - COSIC Tenerife, Hash3 – 41 Nov 2009

AES Optimization results

• Number of lock cycles per AES encryption(Key scheduling + Block encryption)

– Starting from Java function call in user application

– KNI overhead limits the overall performance gain

(a) Java (b) Java+C (c) Java+C+GEZEL

Java API I/F

KNI I/F

Acceleration I/F

Mem-Mapped I/F

198741

10245

18763

260

18938

198741 29008 19198

(6.8x) (10.4x)

Total Cycles

[Ref: Y. Matsuoka et al, CASES04]

KULeuven - COSIC Tenerife, Hash3 – 42 Nov 2009

Context 4: Bandwidth

KULeuven - COSIC Tenerife, Hash3 – 43 Nov 2009

Adapt HW platform to application

Simple example: Key Schedule for secret key

Two options:

• On the “fly” = just in time processing

• Pre-compute and store in memory

MemoryBC

KeySchedule

KeySchedule

BCTypical for Hardware

Typical for Software

KULeuven - COSIC Tenerife, Hash3 – 44 Nov 2009

Key schedule on the fly

• The cost of fast key contextswitching in SW

• Example for IPSEC router

– one 128 bit key = 1408 bitsround keys (10 rounds +initial key)

– half of internet packets areonly 64 bytes in length (512bits)

10 102 103 104 105

0

2

4

6

8

10

Record Size (bytes)

ARC4AES3DES

Co

nte

xt

ban

dw

idth

(G

bp

s) Data at 1Gbps

[source: J. Goodman]

KULeuven - COSIC Tenerife, Hash3 – 45 Nov 2009

Benchmark??

Cost of HW module (minimum minimorum):

• Key storage– assume sub-keys on the fly

• State storage:– Does all state need to be alive all the time?

– Wide pipe - narrow pipe

– Windowing?

– Think context switching

• Input block / output block– Can I process input already before the complete input block

and/or padding is present?

– Same for output: can I send output, or do I have to wait for thecomplete output block

KULeuven - COSIC Tenerife, Hash3 – 46 Nov 2009

Context 5: gap between application

and architecture

KULeuven - COSIC Tenerife, Hash3 – 47 Nov 2009

Match between algorithm & architecture

Close the gap:

• Dedicated HW: ASIC

• Programmable HW: FPGA

• Custom instructions, hand-coded assembly

• Compiled code

• JAVA on virtual machine,compiled on a real machine

Power Cost

???

General

Purpose

Fixed

Platform

Application

ASIC

KULeuven - COSIC Tenerife, Hash3 – 48 Nov 2009

[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator

[2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110

[3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet

[4] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS

[5] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS

648 Mbits/secAsmPentium III [3] 41.4 W 0.015 (1/800)

Java [5] Emb.Sparc 450 bits/sec 120 mW 0.0000037

(1/3.000.000)

C Emb. Sparc [4]133 Kbits/sec 0.0011 (1/10.000)

350 mW

Power

1.32 Gbit/secFPGA [1]

11 (1/1)3.84 Gbits/sec0.18μm CMOS

Figure of Merit(Gb/s/W = Gb/J)

ThroughputAES 128bit key128bit data

490 mW 2.7 (1/4)

120 mW

Throughput – Energy numbers

ASM StrongARM[2] 240 mW 0.13 (1/85)31 Mbit/sec

KULeuven - COSIC Tenerife, Hash3 – 49 Nov 2009

Context 6: transformations

KULeuven - COSIC Tenerife, Hash3 – 50 Nov 2009

Data Flow Graph representation

• Illustrate with RIPEMD

• Indicate loops, operations, and delays

TDTD TD

TD

B C Erol(10)

F

+rol(s)+ + + A

Ki Xi

DTDTD TD TD

TD

KULeuven - COSIC Tenerife, Hash3 – 51 Nov 2009

Iteration Bound

tl – loop calculation time

wl – number of algorithmic delays (marked with TD) in the l-th loop

TDTD TD

TD

B C Erol(10)

F

+rol(s)+ + + A

Ki Xi

DTDTD TD TD

TD

KULeuven - COSIC Tenerife, Hash3 – 52 Nov 2009

Iteration Bound

TDTD TD

TD

B C Erol(10)

F

+rol(s)+ + + A

Ki Xi

DTDTD TD TD

TD

KULeuven - COSIC Tenerife, Hash3 – 53 Nov 2009

Critical path

TDTD TD

TD

B C Erol(10)

F

+rol(s)+ + + A

Ki Xi

DTDTD TD TD

TD

• The longest path between any two storage elements.

• - Determines the clock frequency!

• Problem: Critical Path > Iteration Bound!

KULeuven - COSIC Tenerife, Hash3 – 54 Nov 2009

Retiming transformation

TDTD TDB C Erol(10)

F

+rol(s)+ + +A1

Ki+1 Xi+1

DTDTD TD TD

TD

• Transformation technique that changes the locations of unit-delay elementsin a circuit without affecting the input/output characteristic.

• After retiming: Critical Path = Iteration Bound!

KULeuven - COSIC Tenerife, Hash3 – 55 Nov 2009

Hardware tricks

For speed:

• Parallelism

• Pipelining

• Loop unrolling

• FPGA: Block RAM instead of Logic

• …

For area:

• Multiplexing

• Composite field instead of Sbox

For power/energy:

• Parallelism

• Pipelining

KULeuven - COSIC Tenerife, Hash3 – 56 Nov 2009

Algorithm properties

As they affect HW realization

• Internal state

• Block size

• Initialization cost

• Iterative, sequential, …

• Parallelism

KULeuven - COSIC Tenerife, Hash3 – 57 Nov 2009

Benchmark efforts

Benchmarks on FPGA, ASIC

API efforts

Open questions

KULeuven - COSIC Tenerife, Hash3 – 58 Nov 2009

Stefan Tillich

See his presentation for the “context”

KULeuven - COSIC Tenerife, Hash3 – 59 Nov 2009

Brian Baldwin

• FPGA: CubeHash, Grostl, Shabal, SIMD, JH,Hamsi and Fugue

• Core functionality & compression function

• See his presentation for ‘context’

KULeuven - COSIC Tenerife, Hash3 – 60 Nov 2009

Christian Wenzel-Benner

• eXternal Benchmarking eXtension

KULeuven - COSIC Tenerife, Hash3 – 61 Nov 2009

Miroslav Knezevic

• Illustration of transformations: applied to Luffaand others

• More observations

KULeuven - COSIC Tenerife, Hash3 – 62 Nov 2009

Patrick Schaumont: API for HW

• INIT & GETCONFIG:initialization, type of I/O,etc

• IDATA & ODATA:parameter

• 16, 32 bit: low endprocessor

• 64, 128 (256): high endprocessors

KULeuven - COSIC Tenerife, Hash3 – 63 Nov 2009

ATHENa

Server

FPGA Synthesis and

Implementation

Result Summary

+ Database

Entries

2 3

HDL + scripts +

configuration files

1

Database

Entries

Download scripts

and

configuration files8

Designer

4

HDL + FPGA Tools

User

Database

query

Ranking

of designs

5

6

Kris Gaj: ATHENa

0

Interfaces

+ Testbenches 63

KULeuven - COSIC Tenerife, Hash3 – 64 Nov 2009

ATHENa Major Features• synthesis, implementation, and timing analysis in the batch

mode

• support for devices and tools of multiple FPGA vendors:

• generation of results for multiple families of FPGAs of agiven vendor

• automated choice of a best-matching device within agiven family

64

KULeuven - COSIC Tenerife, Hash3 – 65 Nov 2009

Open questions

• Area comparisons

• Throughput comparisons

• Power/Energy comparisons

• Sets of environments

KULeuven - COSIC Tenerife, Hash3 – 66 Nov 2009

Conclusions

• Results depend on:– ASIC set-up

– FPGA set-up

– Hardware API

– Bandwidth

– Transformations

• Need:– Set of ‘contexts’

– area and speed, but also POWER and ENERGY!