aes encryption on modern consumer architectures
DESCRIPTION
Specialized cryptographic processors target professional applications and offer both low latency and high throughput at the expense of cost. At the consumer level, a modern SoC embodies several accelerators and vector extensions (e.g. SSE, AES-NI), having a high degree of programmability through multiple APIs (OpenMP, OpenCL, etc). This work explains how a modern x86 system that encompasses several compute architectures (MIMD/SIMD) might perform well compared to a specialized cryptographic unit at the fraction of the cost. The analyzed algorithm is AES (AES-128, AES-256) and the mode of operation is ECB. The initial test system is built around SoC AMD A6 5400K (CPU + integrated GPU), coupled with a discrete GPU – AMD R7 250. Benchmark results compare CPU OpenSSL execution (no AES-NI), CPU AES-NI acceleration, integrated GPU, discrete GPU and heterogeneous combinations of the above processing units. Multiple test results are presented and inconsistencies are explained. Finally based on initial results a system composed only of low-end and low power consumer components is designed, built and tested.TRANSCRIPT
Author(s)
Politehnica University of
Bucharest
Automatic Control and Computers
Faculty
Computer Science
Department
Scientific Advisor
AES encryption on modern consumer architectures
Ing. Grigore [email protected]
Sl. Dr. Ing. Laura Gheorghe
Presentation Session - July 2014
AES Encryption
02.07.2014 Masters Presentation Session – July 2014 2
• Symmetric block cipher that can encrypt and decrypt information (adopted by NIST in 2001 as a standard for encryption of electronic data)
• AES algorithm:
1. KeyExpansion: round keys are derived from the cipher key.
2. InitialRound: (AddRoundKey)
3. Rounds:
I. SubBytes— substitution step, each byte replaced according to SBOX
II. ShiftRows— transposition step, rows of the state are shifted.
III. MixColumns—a mixing operation which operates on the columns of the state. Operations (+,*) are redefined in the Galois Finite Field.
IV. AddRoundKey - bitwise xor of the state with the round key.
4. Final Round:(SubBytes, ShiftRows, AddRoundKey).
Cipher Modes
02.07.2014 Masters Presentation Session – July 2014 3
• Algorithm to repeatedly apply a block cipher (e.g. AES) to the input plaintext
– Most operation modes require an initialization vector
• Most used cipher modes: Cipher-block chaining (CBC), Counter (CTR)
• Other cipher modes: Electronic codebook (ECB), Output feedback (OFB)
• Why use ECB ?
– Simple, fast, very well parallelizable, max throughput
– Provides a good estimate of how CTR would perform
Plaintext Plaintext Plaintext
Ciphertext Ciphertext Ciphertext
Block cipherencryption
Block cipherencryption
Block cipherencryption
Key Key Key
Motivation
• Explore how modern low-end commodity hardware handles AES encryption
• Literature strongly favors GPU in comparison with CPU
• transfer to/from the GPU is ignored, just the actual execution is considered
• CPU code is not optimized/parallelized
• Creates a false impression that the GPU clearly outperforms the CPU
• How different scenarios impact performance on CPU and GPU
• Performance on CPU, iGPU, dGPU and combinations of these
• Influence of the compiler/SDK/optimization techniques
• Study of the memory hierarchy and its impact on performance
01.07.2014 Bachelor Presentation Session - July 2010 4
Software architecture (1)
01.07.2014 Bachelor Presentation Session - July 2010 5
• Written in C++, portable and modular with cmake(Linux/Windows). Make use of OpenMP, OpenCL, AES-NI.
• Source code is divided into three categories: I/O, MAIN and AES.
• AES code may be further divided based on target processing units– CPU => AES_HWNI
– GPU => AES_GPU
– CPU+GPU => AES_HYBRID
Software architecture (2)
01.07.2014 Bachelor Presentation Session - July 2010 6
When processing (Ct + Gt) threads are spawned
• Ct give work to CPU cores (Ct = number of CPU cores)
• Gt give work to GPU devices (Gt = number of GPU devices)
• On the left, case Ct=2 (dual core CPU), Gt=2 (iGPU, dGPU)
• Each CPU core or GPU device receives a % of work according to it’s capabilities (work split is done statically by the user)
AES Performance Testing
02.07.2014 Masters Presentation Session – July 2014 7
• Initial test system – AMD A6-5400K (dual core x86 + iGPU HD7540, AES-NI support), dGPU R7 250, 6GB DDR3@1600, Ubuntu 12.04 LTS x64
• GPGPU implementation - OpenCL
– SubBytes – precomputed SBOX stored in constant memory
– MixColumns – precomputed Galois Field matrices to avoid (+,*) operations
– ShiftRows and AddRoundKey - simple operations
• CPU implementation – OpenMP
– AES encryption using AES-NI instructions and OpenMP for parallelism on multicore CPUs
– Comparison with OpenSSL library
Results Single Unit Processing
02.07.2014 Masters Presentation Session – July 2014 8
OX – chunk size in MB
OY – throughput AES-128 ECB in MB/sec
• CPU 5400K with AESNI has the best performance among tested compute units,
• iGPU with 3 CU yields a modest ~150MB, while the dGPU with 6 CU, yields a better result, ~400MB/sec, which correlates with the increase in processing power/ memory bandwidth over the iGPU.
Results Multi Unit Processing
02.07.2014 Masters Presentation Session – July 2014 9
• Various AES hybrid processing configurations (CPU+iGPU, CPU+dGPU, …)
• Work-split determined by experiments and statically given at runtime
• Performance results are bellow expectations
Influence Multi Unit Processing
02.07.2014 Masters Presentation Session – July 2014 10
CPU A6 5400K (blue)-performance degrades with each new processing unit added to the hybrid configuration
Performing a trace on the CPU with AMD Code XL we find the higher cache miss rate as the main cause
• OX – device configuration
• OY – throughput in MB/sec
iGPU HD7540 (yellow)- also is only slightly impacted in multi-device processing, most stable processing device
Low performance, variations are also low (~10%)
Compiler influence
02.07.2014 Masters Presentation Session – July 2014 11
• CPU assembly code generated
from C++ code for AES-NI
processing
• Differences come from how the
compiler optimizes aligned
memory accesses i.e. movdqa
vs movdqu
1. Performance with g++ 4.6 O1
=> 800MB/sec AES-128
2. Performance with g++ 4.7 O1
=> 1100MB/sec AES-128
3. Performance with g++ 4.8 O1
=> 1400MB/sec AES-128
GPU Optimizations
02.07.2014 Masters Presentation Session – July 2014 12
Prefer cache over constant memory on AMD GCN and VLIW4 architectures fox SBOX
Where possible analyze using precomputed tables vs computation on the fly – MixColumns is better computed than stored ni constant memory
PCIe bus limitations must be addressed (considerable difference PCIe4x vs PCIe16x)
Overlapping execution with I/O could improve iGPU performance by 10-20% (figure above)
Proposed AES encryption system
02.07.2014 Masters Presentation Session – July 2014 13
Build a low-end consumer system for large data AES encryption
Proposed configuration is a sub 100 Euro x86 system which can encrypt large AES data chunks (>1MB) at rates between 0.5 - 1GB/sec AES-128/AES-256 ECB
AMD Sempron 3850 (quad [email protected], AES-NI, iGPU wOpenCL), 2GB DDR3 @1600, mini-ITX MB and case, usb pendrive boot.
Results AES encryption system
02.07.2014 Masters Presentation Session – July 2014 14
Ubuntu 14.04 x64 LTS operating system with g++ 4.8 and gpu catalyst driver 13.35. Tests consisted of a 500MB le (generated by /dev/urandom) being encrypted AES-128
CPU processing with AESNI shows the disproportion in throughput versus the iGPU - the extent to which the iGPU may speed up computation is very limited.
Conclusions
02.07.2014 Masters Presentation Session – July 2014 15
• AES-NI instructions provide a simple way to improve AES performance by a
large margin (compiler may influence performance greatly)
• GPU for AES acceleration makes sense when the CPU does not support
AESNI, otherwise performance gain compared to cost is debatable.
• Fast data encryption using AES is possible on consumer systems that have
CPU AES-NI extensions or at least GPU accelerators
• Sub 100 Euro x86 low power system (<30 watt) with the processing
throughput of over 1GB/sec AES-128 ECB
• Future focus on heterogeneous unified memory architectures (hUMA) where
the communication between the CPU and GPU will be further simplified.