implementing and benchmarking seven round 2 …...total speed-up = 100/11.17=9.0 execution time...

40
Implementing and Benchmarking Seven Round 2 Lattice-Based Key Encapsulation Mechanisms Using a Software/Hardware Codesign Approach Farnoud Farahmand, Viet B. Dang, Michał Andrzejczak, Kris Gaj George Mason University

Upload: others

Post on 21-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Implementing and Benchmarking Seven Round 2 Lattice-Based

Key Encapsulation Mechanisms Using a Software/Hardware Codesign Approach

Farnoud Farahmand,

Viet B. Dang,

Michał Andrzejczak,

Kris Gaj

George Mason University

Page 2: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Co-Authors

Farnoud

Farahmand

Viet

Ba Dang

GMU PhD Students

2

Michał

Andrzejczak

Visiting

Scholar

Military University

of Technology in

Warsaw, Poland

Page 3: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Hardware Benchmarking

3

Page 4: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Round 2 Candidates in Hardware

#Round 2

candidates

5

14

29

26

AES

SHA-3

CAESAR

PQC

Implemented

in hardware

5

14

28

?

Percentage

100%

100%

97%

?

4

Page 5: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Software/HardwareCodesign

5

Page 6: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Most time-critical

operation

Software

Hardware

Software/Hardware Codesign

6

Page 7: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

SW/HW Codesign: Motivational Example 1

7

Software Software/Hardware

91% major operation(s)

9% other operations

~1% major operation(s) in HW

9% other operations in SW

speed-up ≥ 100

Total Speed-Up ≥ 10

Other

9%

Major

91%

Major

~1%

Other

9%

Time saved

90%

Page 8: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

SW/HW Codesign: Motivational Example 2

8

Software Software/Hardware

99% major operation(s)

1% other operations

~1% major operation(s) in HW

1% other operations in SW

speed-up ≥ 100

Total Speed-Up ≥ 50

Other

1%

Major

99%

Major

~1%

Other

1%

Time saved

98%

Page 9: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

SW/HW Codesign: Advantages

9

❖ Focus on a few major operations, known to be easily parallelizable

▪ much shorter development time (at least by a factor of 10)

▪ guaranteed substantial speed-up

▪ high-flexibility to changes in other operations (such as candidate

tweaks)

❖ Insight regarding performance of future instruction set extensions of modern microprocessors

❖ Possibility of implementing multiple candidates by the same research group, eliminating the influence of different

▪ design skills

▪ operation subset (e.g., including or excluding key generation)

▪ interface & protocol

▪ optimization target

▪ platform

Page 10: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

SW/HW Codesign: Potential Pitfalls

10

❖ Performance & ranking may strongly depend on

A. features of a particular platform

o Software/hardware interface

o Support for cache coherency

o Differences in max. clock frequency

B. selected hardware/software partitioning

C. optimization of an underlying software implementation

❖ Limited insight on ranking of purely hardware implementations

First step, not the ultimate solution!

Page 11: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Two Major Types of Platforms

11

FPGA Fabric

& Hard-core Processors

FPGA Fabric, including

Soft-core Processors

Examples:

• Xilinx Zynq 7000 System on Chip (SoC)

• Xilinx Zynq UltraScale+ MPSoC

• Intel Arria 10 SoC FPGAs

• Intel Stratix 10 SoC FPGAs

Examples:

Xilinx Virtex UltraScale+ FPGAs

Intel Stratix 10 FPGAs, including

• Xilinx MicroBlaze

• Intel Nios II

• RISC-V, originally UC Berkeley

Processor

w/ Memory

& I/O

FPGA

Fabric

FPGA

Fabric

Soft-core

Processor

Page 12: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Two Major Types of Platform

12

Feature FPGA Fabric and

Hard-core Processor

FPGA Fabric with

Soft-core Processor

Processor ARM MicroBlaze, NIOS II, RISC-V,

etc.

Clock frequency >1 GHz max. 200-450 MHz

Portability similar FPGA SoCs various FPGAs, FPGA SoCs,

and ASICs

Hardware accelerators Yes Yes

Instruction set extensions No Yes

Ease of design

(methodology, tools, OS

support)

Easy Dependent on a particular

soft-core processor and

tool chain

Xilinx Zynq UltraScale+ MPSoC

1.2 GHz ARM Cortex-A53 + UltraScale+ FPGA logic

Page 13: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Choice of a Platform for Benchmarking

13

In NIST presentations to date:

Embedded Processor: FPGA Architecture:

Our recommendation:

ARM Cortex-A53 UltraScale+

ARM Cortex-M4 Artix-7

• No FPGA SoC with ARM Cortex-M4 and Artix-7 on a single chip

• Cortex-M4 and Artix-7 more suitable for lightweight designs,

Cortex-A53 and UltraScale+ for high performance

• Zynq UltraScale+:

• capability to compare SW/HW implementations with fully-SW

and fully-HW implementations realized using the same chip

• likely in use in the first years of the new standard

deployments

Page 14: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Experimental Setup

14

Output FIFOInput FIFOHardware

Accelerator

Zynq Processing System

AXI DMA

FIFO Interface

FIFO Interface

AXI StreamInterface

AXI StreamInterface

AX

I Lit

e In

terf

ace

AX

I Fu

ll

Inte

rfa

ce

AX

I Lit

e In

terf

ace

IRQ

Clocking wizard

rd_clkwr_clk wr_clk rd_clkclk

UUT_clk

Main Clock

AX

I Lit

e In

terf

ace

AXI TimerAXI Lite

Interface

All elements located on a single chip

Page 15: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Code Release

15

• Full Code & Configuration of the Experimental Setup

• Software/Hardware Codesign of Round 1 NTRUEncrypt

to be made available at

https://cryptography.gmu.edu/athena

under PQC

by August 31, 2019

Page 16: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

OurCase Study

16

Page 17: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

SW/HW Codesign: Case Study

17

7 IND-CCA*-secure Lattice-Based Key Encapsulation Mechanisms (KEMs)

representing

5 NIST PQC Round 2 Submissions

LWE (Learning with Error)-based:

FrodoKEM

RLWR (Ring Learning with Rounding)-based:

Round5

Module-LWR-based:

Saber

* IND-CCA = with Indistinguishability under Chosen Ciphertext Attack

NTRU-based:

NTRU

• NTRU-HPS

• NTRU-HRSS

NTRU Prime

• Streamlined NTRU Prime

• NTRU LPRime

Page 18: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

SW/HW Partitioning

Top candidates for offloading to hardware

From profiling:

❖ Large percentage of the execution time

❖ Small number of function calls

From manual analysis of the code:

❖ Small size of inputs and outputs

❖ Potential for combining with neighboring functions

From knowledge of operations and concurrent computing:

❖ High potential for parallelization

18

Page 19: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Operations Offloaded to Hardware

• Major arithmetic operations

• Polynomial multiplications

• Matrix-by-vector multiplications

• Vector-by-vector multiplications

• All hash-based operations

• (c)SHAKE128, (c)SHAKE256

• SHA3-256, SHA3-512

19

Page 20: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Example: LightSaber Decapsulation

MatrixVectorMul

43.44%

InnerProduct

43.52%

GenMatrix

5.03%

GenSecret

2.30%Hash

3.30%

Other

2.40%

20

Page 21: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

LightSaber Decapsulation

MatrixVectorMul

43.44%

InnerProduct

43.52%

GenMatrix

5.03%

GenSecret

2.30%

Hash

3.30%Other

2.40%

Execution time saved

88.83%

Other

2.40%

Hardware

Accelerator

8.77%

Total Speed-Up = 100/11.17=9.0

Execution time

remaining

11.17%

Accelerator Speed-Up = 97.60/8.77=11.1

Execution time of functions

to be moved to hardware

97.60%

Execution time of functions

remaining in software

2.40% 21

Page 22: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

TentativeResults

22

Page 23: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Software Implementations Used

FrodoKEM, NTRU-HPS, NTRU-HRSS, Saber:

Round 2 submission packages – Optimized_Implementation

Round5:

https://github.com/r5embed/r5embed (2019-07-28)

Streamlined NTRU Prime, NTRU LPRime:

supercop-20190811 : factored

Changes made after the submission of the paper!

Results substantially different!

New version of the paper available on ePrint soon!

23

Page 24: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Total Execution Time in Software [𝜇s]Encapsulation

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

4,500

Round5 Saber Str NTRUPrime

NTRULPRime

NTRU-HRSS NTRU-HPS FrodoKEM

Level 1 Level 2 Level 3 Level 4 Level 5

16,19262,07634,609

24

Page 25: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Total Execution Time in Software/Hardware [𝜇s]Encapsulation

0

100

200

300

400

500

600

Round5 Saber NTRU-HRSS Str NTRUPrime

NTRU-HPS NTRULPRime

FrodoKEM

Level 1 Level 2 Level 3 Level 4 Level 5

1,2232,1861,642

1

3⇒4

7

5⇒3

4⇒6

2

6⇒5

25

Page 26: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Total Speed-ups: Encapsulation

17.8

13.2

8.3 7.9 7.1

3.2 2.4

21.1

10.29.3

12.7

3.4 2.53.6 2.6

28.4

10.6

18.1

0.0

5.0

10.0

15.0

20.0

25.0

30.0

NTRU-HRSS FrodoKEM Round5 NTRU-HPS Saber NTRU LPRime Str NTRUPrime

Level 1 Level 2 Level 3 Level 4 Level 5

26

Page 27: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Accelerator Speed-ups: Encapsulation

146.4140.4

43.5

8.7 8.520.4

12.3

192.2

44.3

27.515.3

10.6 14.5

32.517.7

46.1

11.1 21.3

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

180.0

200.0

NTRU-HRSS NTRU-HPS FrodoKEM NTRULPRime

Str NTRUPrime

Round5 Saber

Level 1 Level 2 Level 3 Level 4 Level 5

27

Page 28: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

SW Part Sped up by HW[%]: Encapsulation

99.36 98.49

95.03 94.62

88.03

70.35

62.53

99.59

98.9397.45

89.71

71.44

63.43

73.26

65.00

99.55 99.14 98.62

50.00

60.00

70.00

80.00

90.00

100.00

Round5 Saber NTRU-HRSS FrodoKEM NTRU-HPS NTRULPRime

Str NTRUPrime

Level 1 Level 2 Level 3 Level 4 Level 5

28

Page 29: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Total Execution Time in Software [𝜇s]Decapsulation

16,19262,37734,649

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

11,000

12,000

Round5 Saber NTRULPRime

Str NTRUPrime

NTRU-HPS NTRU-HRSS FrodoKEM

Level 1 Level 2 Level 3 Level 4 Level 5

Order reversed

compared to

encapsulation

3 4

5 6

29

Page 30: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

0

100

200

300

400

500

600

700

Round5 Saber NTRU-HPS Str NTRUPrime

NTRU-HRSS NTRULPRime

FrodoKEM

Level 1 Level 2 Level 3 Level 4 Level 5

Total Execution Time in Software/Hardware [𝜇s]:Decapsulation

1,3193,1201,866

1

7

5⇒3

3⇒6

2

6⇒54

30

Page 31: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Total Speed-ups: Decapsulation

77.4 74.1

12.3 9.0 8.1

38.3

3.9

119.3

45.5

18.613.4 9.4

4.1

54.8

4.5

20.0 17.9

9.8

0.010.020.030.040.050.060.070.080.090.0

100.0110.0120.0130.0

NTRU-HPS NTRU-HRSS Str NTRUPrime

FrodoKEM Saber Round5 NTRULPRime

Level 1 Level 2 Level 3 Level 4 Level 5

31

Page 32: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Accelerator Speed-ups: Decapsulation

188.1 182.8

44.4

11.1 8.1

86.2

27.5

235.7

112.0

46.032.8

17.79.4

132.7

37.746.0

24.69.8

0.0

25.0

50.0

75.0

100.0

125.0

150.0

175.0

200.0

225.0

250.0

NTRU-HRSS NTRU-HPS Str NTRUPrime

FrodoKEM NTRULPRime

Saber

Level 1 Level 2 Level 3 Level 4 Level 5

Round5

32

Page 33: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

SW Part Sped up by HW[%]: Decapsulation

100.00 99.25 99.1897.60

93.96

98.53

76.47

100.00 99.5898.69 98.10

96.78

77.46

98.92

79.22

100.0098.41 97.11

50.00

60.00

70.00

80.00

90.00

100.00

Round5 NTRU-HPS NTRU-HRSS Str NTRUPrime

Saber FrodoKEM NTRULPRime

Level 1 Level 2 Level 3 Level 4 Level 5

33

Page 34: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Conclusions

❖ Total speed-ups

▪ for encapsulation from 2.4 (Str NTRU Prime) to 28.4 (FrodoKEM)▪ for decapsulation from 3.9 (NTRU LPRime) to 119.3 (NTRU-HPS)

❖ Total speed-up dependent on the percentage of the software execution time taken by functions offloaded to hardware and the amount of acceleration itself

❖ Hardware accelerators thoroughly optimized using Register-Transfer Level design methodology

❖ Determining optimal software/hardware partitioning requires more work

❖ Ranking of the investigated candidates affected, but not dramatically changed, by hardware acceleration

❖ It is possible to complete similar designs for all Round 2 candidates within the evaluation period (12-18 months)

❖ Additional benefit: Comprehensive library of major operations in hardware

34

Page 35: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Future Work

35

Breadth

Depth

4 Remaining

Lattice-based

KEMs

3 Lattice-based

Digital

Signatures

Current

work

7 Code-based

KEMs

7 Other

Candidates

More operations moved to hardware / C code optimized for ARM Cortex-A53*

Algorithmic optimizations of software and hardware*

Hardware library

of basic operations

of lattice-based

candidates

Hardware library

of basic operations

of code-based

candidates

Hardware

library

for PQC

Full hardware implementations

*collaboration with submission teams and other groups very welcome

Page 36: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Q&A

36

Suggestions?

CERG: http://cryptography.gmu.edu

ATHENa: http://cryptography.gmu.edu/athena

Questions? Comments?

Thank You!

Page 37: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Backup

37

Page 38: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Level:

Algorithm

Clock

Freq

[MHz]

#LUTs #Slices #FFs #36kb

BRAMs

#DSPs

1:FrodoKEM 402 7,213 1,186 6,647 13.5 32

1: Round5 260 55,442 10,381 82,341 0 0

1: Saber 322 12,343 1,989 11,288 3.5 256

1: NTRU-

HPS

200 24,328 4,972 19,244 2.5 677

1: NTRU-

HRSS

200 27,218 5,770 21,410 2.5 701

2: Str NTRU

Prime

244 55,843 8,134 28,143 3.0 0

2: NTRU

LPRime

244 50,911 7,874 34,050 2.0 0

Device 274,080 34,260 548,160 912 2,520

Clock Frequency & Resource Utilization

< 31% < 2% < 28%

of total resources of the given device

< 21% < 15%

38

Page 39: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

Minor Modifications to C codeBare Metal vs. Linux

• No functions of OpenSSL – standalone implementations of:

o AES: Optimized ANSI C code for the Rijndael cipher (T-box-based)by Vincent Rijmen, Antoon Bosselaers, and Paulo Barretohttps://fastcrypto.org/front/misc/rijndael-alg-fst.c

o SHA-3:

▪ fips202.c from SUPERCOP by Ronny Van Keer, Gilles Van Assche, Daniel J. Bernstein, and Peter Schwabe (for all candidates other than Round5)

▪ r5_xof_shake.c by Markku-Juhani O. Saarinen and keccak1600.c from SUPERCOP, by the same authors as fips202.c(for Round5)

o randombytes(): based on SHAKE rather than AES in NTRU-HPS, NTRU-HRSS, and Streamlined NTRU Prime

• No support for SUPERCOP scripts

39

Page 40: Implementing and Benchmarking Seven Round 2 …...Total Speed-Up = 100/11.17=9.0 Execution time remaining 11.17% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions

40

randombytes()

❖ Function used for generating pseudorandom byte sequences

❖ The implementation vary among various benchmarking studies, depending

on the mode of operation (Bare Metal vs. Operating System), and

availability of libraries, such as OpenSSL

❖ Used to different extent by implementations of various candidates

Algorithm #Calls #Bytes(security category 1)

FrodoKEM 1 16

Round5 1 16

Saber 1 32

NTRU-HPS 1 3211

NTRU-HRSS 1 1400

Str NTRU Prime 1 2612

NTRU LPRime 1 32

❖ For 3 algorithms was sped-up over 3 times by using SHAKE128