aes encryption on modern consumer architectures

Author(s)

Politehnica University of

Bucharest

Automatic Control and Computers

Faculty

Computer Science

Department

Scientific Advisor

AES encryption on modern consumer architectures

Ing. Grigore [email protected]

Sl. Dr. Ing. Laura Gheorghe

Presentation Session - July 2014

AES Encryption

02.07.2014 Masters Presentation Session – July 2014 2

• Symmetric block cipher that can encrypt and decrypt information (adopted by NIST in 2001 as a standard for encryption of electronic data)

• AES algorithm:

1. KeyExpansion: round keys are derived from the cipher key.

2. InitialRound: (AddRoundKey)

3. Rounds:

I. SubBytes— substitution step, each byte replaced according to SBOX

II. ShiftRows— transposition step, rows of the state are shifted.

III. MixColumns—a mixing operation which operates on the columns of the state. Operations (+,*) are redefined in the Galois Finite Field.

IV. AddRoundKey - bitwise xor of the state with the round key.

4. Final Round:(SubBytes, ShiftRows, AddRoundKey).

Cipher Modes


• Algorithm to repeatedly apply a block cipher (e.g. AES) to the input plaintext

– Most operation modes require an initialization vector

• Most used cipher modes: Cipher-block chaining (CBC), Counter (CTR)

• Other cipher modes: Electronic codebook (ECB), Output feedback (OFB)

• Why use ECB ?

– Simple, fast, very well parallelizable, max throughput

– Provides a good estimate of how CTR would perform

Plaintext Plaintext Plaintext

Ciphertext Ciphertext Ciphertext

Block cipherencryption



Key Key Key

Motivation

• Explore how modern low-end commodity hardware handles AES encryption

• Literature strongly favors GPU in comparison with CPU

• transfer to/from the GPU is ignored, just the actual execution is considered

• CPU code is not optimized/parallelized

• Creates a false impression that the GPU clearly outperforms the CPU

• How different scenarios impact performance on CPU and GPU

• Performance on CPU, iGPU, dGPU and combinations of these

• Influence of the compiler/SDK/optimization techniques

• Study of the memory hierarchy and its impact on performance

01.07.2014 Bachelor Presentation Session - July 2010 4

Software architecture (1)


• Written in C++, portable and modular with cmake(Linux/Windows). Make use of OpenMP, OpenCL, AES-NI.

• Source code is divided into three categories: I/O, MAIN and AES.

• AES code may be further divided based on target processing units– CPU => AES_HWNI

– GPU => AES_GPU

– CPU+GPU => AES_HYBRID

Software architecture (2)


When processing (Ct + Gt) threads are spawned

• Ct give work to CPU cores (Ct = number of CPU cores)

• Gt give work to GPU devices (Gt = number of GPU devices)

• On the left, case Ct=2 (dual core CPU), Gt=2 (iGPU, dGPU)

• Each CPU core or GPU device receives a % of work according to it’s capabilities (work split is done statically by the user)

AES Performance Testing


• Initial test system – AMD A6-5400K (dual core x86 + iGPU HD7540, AES-NI support), dGPU R7 250, 6GB DDR3@1600, Ubuntu 12.04 LTS x64

• GPGPU implementation - OpenCL

– SubBytes – precomputed SBOX stored in constant memory

– MixColumns – precomputed Galois Field matrices to avoid (+,*) operations

– ShiftRows and AddRoundKey - simple operations

• CPU implementation – OpenMP

– AES encryption using AES-NI instructions and OpenMP for parallelism on multicore CPUs

– Comparison with OpenSSL library

Results Single Unit Processing


OX – chunk size in MB

OY – throughput AES-128 ECB in MB/sec

• CPU 5400K with AESNI has the best performance among tested compute units,

• iGPU with 3 CU yields a modest ~150MB, while the dGPU with 6 CU, yields a better result, ~400MB/sec, which correlates with the increase in processing power/ memory bandwidth over the iGPU.

Results Multi Unit Processing


• Various AES hybrid processing configurations (CPU+iGPU, CPU+dGPU, …)

• Work-split determined by experiments and statically given at runtime

• Performance results are bellow expectations

Influence Multi Unit Processing


CPU A6 5400K (blue)-performance degrades with each new processing unit added to the hybrid configuration

Performing a trace on the CPU with AMD Code XL we find the higher cache miss rate as the main cause

• OX – device configuration

• OY – throughput in MB/sec

iGPU HD7540 (yellow)- also is only slightly impacted in multi-device processing, most stable processing device

Low performance, variations are also low (~10%)

Compiler influence


• CPU assembly code generated

from C++ code for AES-NI

processing

• Differences come from how the

compiler optimizes aligned

memory accesses i.e. movdqa

vs movdqu

1. Performance with g++ 4.6 O1

=> 800MB/sec AES-128


=> 1100MB/sec AES-128


=> 1400MB/sec AES-128

GPU Optimizations


Prefer cache over constant memory on AMD GCN and VLIW4 architectures fox SBOX

Where possible analyze using precomputed tables vs computation on the fly – MixColumns is better computed than stored ni constant memory

PCIe bus limitations must be addressed (considerable difference PCIe4x vs PCIe16x)

Overlapping execution with I/O could improve iGPU performance by 10-20% (figure above)

Proposed AES encryption system


Build a low-end consumer system for large data AES encryption

Proposed configuration is a sub 100 Euro x86 system which can encrypt large AES data chunks (>1MB) at rates between 0.5 - 1GB/sec AES-128/AES-256 ECB

AMD Sempron 3850 (quad [email protected], AES-NI, iGPU wOpenCL), 2GB DDR3 @1600, mini-ITX MB and case, usb pendrive boot.

Results AES encryption system


Ubuntu 14.04 x64 LTS operating system with g++ 4.8 and gpu catalyst driver 13.35. Tests consisted of a 500MB le (generated by /dev/urandom) being encrypted AES-128

CPU processing with AESNI shows the disproportion in throughput versus the iGPU - the extent to which the iGPU may speed up computation is very limited.

Conclusions


• AES-NI instructions provide a simple way to improve AES performance by a

large margin (compiler may influence performance greatly)

• GPU for AES acceleration makes sense when the CPU does not support

AESNI, otherwise performance gain compared to cost is debatable.

• Fast data encryption using AES is possible on consumer systems that have

CPU AES-NI extensions or at least GPU accelerators

• Sub 100 Euro x86 low power system (<30 watt) with the processing

throughput of over 1GB/sec AES-128 ECB

• Future focus on heterogeneous unified memory architectures (hUMA) where

the communication between the CPU and GPU will be further simplified.

aes encryption on modern consumer architectures

Engineering

gpu cpu

aes code

mbsec aes

cpu code

mbsec cpu

gpu performance

aes performance testing

masters presentation