aes encryption on modern consumer architectures

15
Author(s) Politehnica University of Bucharest Automatic Control and Computers Faculty Computer Science Department Scientific Advisor AES encryption on modern consumer architectures Ing. Grigore Lupescu [email protected] Sl. Dr. Ing. Laura Gheorghe Presentation Session - July 2014

Upload: glupescu

Post on 21-Jun-2015

307 views

Category:

Engineering


1 download

DESCRIPTION

Specialized cryptographic processors target professional applications and offer both low latency and high throughput at the expense of cost. At the consumer level, a modern SoC embodies several accelerators and vector extensions (e.g. SSE, AES-NI), having a high degree of programmability through multiple APIs (OpenMP, OpenCL, etc). This work explains how a modern x86 system that encompasses several compute architectures (MIMD/SIMD) might perform well compared to a specialized cryptographic unit at the fraction of the cost. The analyzed algorithm is AES (AES-128, AES-256) and the mode of operation is ECB. The initial test system is built around SoC AMD A6 5400K (CPU + integrated GPU), coupled with a discrete GPU – AMD R7 250. Benchmark results compare CPU OpenSSL execution (no AES-NI), CPU AES-NI acceleration, integrated GPU, discrete GPU and heterogeneous combinations of the above processing units. Multiple test results are presented and inconsistencies are explained. Finally based on initial results a system composed only of low-end and low power consumer components is designed, built and tested.

TRANSCRIPT

Page 1: AES encryption on modern consumer architectures

Author(s)

Politehnica University of

Bucharest

Automatic Control and Computers

Faculty

Computer Science

Department

Scientific Advisor

AES encryption on modern consumer architectures

Ing. Grigore [email protected]

Sl. Dr. Ing. Laura Gheorghe

Presentation Session - July 2014

Page 2: AES encryption on modern consumer architectures

AES Encryption

02.07.2014 Masters Presentation Session – July 2014 2

• Symmetric block cipher that can encrypt and decrypt information (adopted by NIST in 2001 as a standard for encryption of electronic data)

• AES algorithm:

1. KeyExpansion: round keys are derived from the cipher key.

2. InitialRound: (AddRoundKey)

3. Rounds:

I. SubBytes— substitution step, each byte replaced according to SBOX

II. ShiftRows— transposition step, rows of the state are shifted.

III. MixColumns—a mixing operation which operates on the columns of the state. Operations (+,*) are redefined in the Galois Finite Field.

IV. AddRoundKey - bitwise xor of the state with the round key.

4. Final Round:(SubBytes, ShiftRows, AddRoundKey).

Page 3: AES encryption on modern consumer architectures

Cipher Modes

02.07.2014 Masters Presentation Session – July 2014 3

• Algorithm to repeatedly apply a block cipher (e.g. AES) to the input plaintext

– Most operation modes require an initialization vector

• Most used cipher modes: Cipher-block chaining (CBC), Counter (CTR)

• Other cipher modes: Electronic codebook (ECB), Output feedback (OFB)

• Why use ECB ?

– Simple, fast, very well parallelizable, max throughput

– Provides a good estimate of how CTR would perform

Plaintext Plaintext Plaintext

Ciphertext Ciphertext Ciphertext

Block cipherencryption

Block cipherencryption

Block cipherencryption

Key Key Key

Page 4: AES encryption on modern consumer architectures

Motivation

• Explore how modern low-end commodity hardware handles AES encryption

• Literature strongly favors GPU in comparison with CPU

• transfer to/from the GPU is ignored, just the actual execution is considered

• CPU code is not optimized/parallelized

• Creates a false impression that the GPU clearly outperforms the CPU

• How different scenarios impact performance on CPU and GPU

• Performance on CPU, iGPU, dGPU and combinations of these

• Influence of the compiler/SDK/optimization techniques

• Study of the memory hierarchy and its impact on performance

01.07.2014 Bachelor Presentation Session - July 2010 4

Page 5: AES encryption on modern consumer architectures

Software architecture (1)

01.07.2014 Bachelor Presentation Session - July 2010 5

• Written in C++, portable and modular with cmake(Linux/Windows). Make use of OpenMP, OpenCL, AES-NI.

• Source code is divided into three categories: I/O, MAIN and AES.

• AES code may be further divided based on target processing units– CPU => AES_HWNI

– GPU => AES_GPU

– CPU+GPU => AES_HYBRID

Page 6: AES encryption on modern consumer architectures

Software architecture (2)

01.07.2014 Bachelor Presentation Session - July 2010 6

When processing (Ct + Gt) threads are spawned

• Ct give work to CPU cores (Ct = number of CPU cores)

• Gt give work to GPU devices (Gt = number of GPU devices)

• On the left, case Ct=2 (dual core CPU), Gt=2 (iGPU, dGPU)

• Each CPU core or GPU device receives a % of work according to it’s capabilities (work split is done statically by the user)

Page 7: AES encryption on modern consumer architectures

AES Performance Testing

02.07.2014 Masters Presentation Session – July 2014 7

• Initial test system – AMD A6-5400K (dual core x86 + iGPU HD7540, AES-NI support), dGPU R7 250, 6GB DDR3@1600, Ubuntu 12.04 LTS x64

• GPGPU implementation - OpenCL

– SubBytes – precomputed SBOX stored in constant memory

– MixColumns – precomputed Galois Field matrices to avoid (+,*) operations

– ShiftRows and AddRoundKey - simple operations

• CPU implementation – OpenMP

– AES encryption using AES-NI instructions and OpenMP for parallelism on multicore CPUs

– Comparison with OpenSSL library

Page 8: AES encryption on modern consumer architectures

Results Single Unit Processing

02.07.2014 Masters Presentation Session – July 2014 8

OX – chunk size in MB

OY – throughput AES-128 ECB in MB/sec

• CPU 5400K with AESNI has the best performance among tested compute units,

• iGPU with 3 CU yields a modest ~150MB, while the dGPU with 6 CU, yields a better result, ~400MB/sec, which correlates with the increase in processing power/ memory bandwidth over the iGPU.

Page 9: AES encryption on modern consumer architectures

Results Multi Unit Processing

02.07.2014 Masters Presentation Session – July 2014 9

• Various AES hybrid processing configurations (CPU+iGPU, CPU+dGPU, …)

• Work-split determined by experiments and statically given at runtime

• Performance results are bellow expectations

Page 10: AES encryption on modern consumer architectures

Influence Multi Unit Processing

02.07.2014 Masters Presentation Session – July 2014 10

CPU A6 5400K (blue)-performance degrades with each new processing unit added to the hybrid configuration

Performing a trace on the CPU with AMD Code XL we find the higher cache miss rate as the main cause

• OX – device configuration

• OY – throughput in MB/sec

iGPU HD7540 (yellow)- also is only slightly impacted in multi-device processing, most stable processing device

Low performance, variations are also low (~10%)

Page 11: AES encryption on modern consumer architectures

Compiler influence

02.07.2014 Masters Presentation Session – July 2014 11

• CPU assembly code generated

from C++ code for AES-NI

processing

• Differences come from how the

compiler optimizes aligned

memory accesses i.e. movdqa

vs movdqu

1. Performance with g++ 4.6 O1

=> 800MB/sec AES-128

2. Performance with g++ 4.7 O1

=> 1100MB/sec AES-128

3. Performance with g++ 4.8 O1

=> 1400MB/sec AES-128

Page 12: AES encryption on modern consumer architectures

GPU Optimizations

02.07.2014 Masters Presentation Session – July 2014 12

Prefer cache over constant memory on AMD GCN and VLIW4 architectures fox SBOX

Where possible analyze using precomputed tables vs computation on the fly – MixColumns is better computed than stored ni constant memory

PCIe bus limitations must be addressed (considerable difference PCIe4x vs PCIe16x)

Overlapping execution with I/O could improve iGPU performance by 10-20% (figure above)

Page 13: AES encryption on modern consumer architectures

Proposed AES encryption system

02.07.2014 Masters Presentation Session – July 2014 13

Build a low-end consumer system for large data AES encryption

Proposed configuration is a sub 100 Euro x86 system which can encrypt large AES data chunks (>1MB) at rates between 0.5 - 1GB/sec AES-128/AES-256 ECB

AMD Sempron 3850 (quad [email protected], AES-NI, iGPU wOpenCL), 2GB DDR3 @1600, mini-ITX MB and case, usb pendrive boot.

Page 14: AES encryption on modern consumer architectures

Results AES encryption system

02.07.2014 Masters Presentation Session – July 2014 14

Ubuntu 14.04 x64 LTS operating system with g++ 4.8 and gpu catalyst driver 13.35. Tests consisted of a 500MB le (generated by /dev/urandom) being encrypted AES-128

CPU processing with AESNI shows the disproportion in throughput versus the iGPU - the extent to which the iGPU may speed up computation is very limited.

Page 15: AES encryption on modern consumer architectures

Conclusions

02.07.2014 Masters Presentation Session – July 2014 15

• AES-NI instructions provide a simple way to improve AES performance by a

large margin (compiler may influence performance greatly)

• GPU for AES acceleration makes sense when the CPU does not support

AESNI, otherwise performance gain compared to cost is debatable.

• Fast data encryption using AES is possible on consumer systems that have

CPU AES-NI extensions or at least GPU accelerators

• Sub 100 Euro x86 low power system (<30 watt) with the processing

throughput of over 1GB/sec AES-128 ECB

• Future focus on heterogeneous unified memory architectures (hUMA) where

the communication between the CPU and GPU will be further simplified.