gpgpu introduction and network applications - haifux · gpgpu introduction ... almost all c code...
TRANSCRIPT
![Page 1: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/1.jpg)
PacketShaders, SSLShader
GPGPU introduction and network applications
![Page 2: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/2.jpg)
© 2014 Mellanox Technologies 2
Agenda
GPGPU Introduction
• Computer graphics background
• GPGPUs – past, present and future
PacketShader – A GPU-Accelerated Software Router
SSLShader – A GPU-Accelerated SSL encryption/decryption proxy
![Page 3: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/3.jpg)
© 2014 Mellanox Technologies 3
GPGPU Intro
![Page 4: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/4.jpg)
© 2014 Mellanox Technologies 4
GPU = Graphics Processing Unit
The heart of graphics cards
Mainly used for real-time 3D game rendering
• Massively-parallel processing capacity
(Ubisoft’s AVARTAR, from http://ubi.com)
![Page 5: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/5.jpg)
© 2014 Mellanox Technologies 5
GPU Fundamentals: The Graphics Pipeline
A simplified graphics pipeline
• Note that pipe widths vary
• Many caches, FIFOs, and so on not shown
GPU
CPU
Application Transform Rasterizer Shade Video
Memory
(Textures) Vertices
(3D)
Xformed,
Lit
Vertices
(2D)
Fragments
(pre-pixels)
Final
pixels
(Color, Depth)
Graphics State
Render-to-texture
![Page 6: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/6.jpg)
© 2014 Mellanox Technologies 6
GPU Pipeline: Transform
Vertex Processor (multiple operate in parallel)
• Transform from “world space” to “image space”
• Compute per-vertex lighting
![Page 7: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/7.jpg)
© 2014 Mellanox Technologies 7
GPU Pipeline: Rasterizer
Rasterizer
• Convert geometric rep. (vertex) to image rep. (fragment)
- Fragment = image fragment
Pixel + associated data: color, depth, stencil, etc.
• Interpolate per-vertex quantities across pixels
![Page 8: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/8.jpg)
© 2014 Mellanox Technologies 8
GPU Pipeline: Shade
Fragment Processors (multiple in parallel)
• Compute a color for each pixel
• Optionally read colors from textures (images)
![Page 9: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/9.jpg)
© 2014 Mellanox Technologies 9
GPU Fundamentals: The Modern Graphics Pipeline
Programmable vertex processor! Programmable pixel processor!
GPU
CPU
Application Vertex
Processor Rasterizer
Pixel
Processor Video
Memory
(Textures) Vertices
(3D) Xformed,
Lit
Vertices
(2D)
Fragments
(pre-pixels)
Final
pixels
(Color, Depth)
Graphics State
Render-to-texture
Vertex
Processor Fragment
Processor
![Page 10: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/10.jpg)
© 2014 Mellanox Technologies 10
nVidia G80 GPU Architecture Overview
•16 Multiprocessors Blocks
•Each MP Block Has:
•8 Streaming Processors (IEEE 754 spfp compliant)
•16K Shared Memory
•64K Constant Cache
•8K Texture Cache
•Each processor can access all of the memory at 86Gb/s, but with different latencies:
•Shared – 2 cycle latency
•Device – 300 cycle latency
![Page 11: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/11.jpg)
© 2014 Mellanox Technologies 11
Queueing
FIFO buffering (first-in, first-out) is provided between task stages
• Accommodates variation in execution time
• Provides elasticity to allow unified load balancing to work
FIFOs can also be unified
• Share a single large memory with multiple head-tail pairs
• Allocate as required
Vertex assembly
Primitive assembly
Vertex operations
Application
FIFO
FIFO
FIFO
![Page 12: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/12.jpg)
© 2014 Mellanox Technologies 12
SIMT - Memory Access Latency Hiding
GPU can effectively hide memory latency
GPU core
Cache
miss
Cache
miss
Switch to Thread 2
Switch to Thread 3
![Page 13: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/13.jpg)
© 2014 Mellanox Technologies 13
Implementation vs. architecture model
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Prim Thread Issue Frag Thread Issue
Data Assembler
Application
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
Vertex assembly
Primitive assembly
Rasterization
Fragment operations
Vertex operations
Application
Primitive operations
NVIDIA GeForce 8800 OpenGL Pipeline
Framebuffer
Source : NVIDIA
![Page 14: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/14.jpg)
© 2014 Mellanox Technologies 14
Correspondence (by color)
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Prim Thread Issue Frag Thread Issue
Data Assembler
Application
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
Vertex assembly
Primitive assembly
Rasterization (fragment assembly)
Fragment operations
Vertex operations
Application
Primitive operations
NVIDIA GeForce 8800 OpenGL Pipeline
Framebuffer
this was missing
Application-
programmable
parallel processor
Fixed-function assembly
processors
Fixed-function
framebuffer operations
![Page 15: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/15.jpg)
© 2014 Mellanox Technologies 17
The nVidia G80 GPU
128 streaming floating point processors @1.5Ghz
1.5 Gb Shared RAM with 86Gb/s bandwidth
500 Gflop on one chip (single precision)
![Page 16: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/16.jpg)
© 2014 Mellanox Technologies 18
Entertainment Industry has driven the economy of these chips?
• Males age 15-35 buy $10B in video games / year
Moore’s Law ++
Simplified design (stream processing)
• Huge parallelism – maps well to hardware
• Latency hiding using the parallelism
Single-chip designs.
Why are GPU’s so fast?
![Page 17: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/17.jpg)
© 2014 Mellanox Technologies 19
“Silicon Budget” in CPU and GPU
19
Xeon X5550:
4 cores
731M transistors
GTX480:
480 cores
3,200M transistors
ALU
![Page 18: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/18.jpg)
© 2014 Mellanox Technologies 21
Floorplans comparison
CPU - Core i7 GPU – nVidia Kepler
![Page 19: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/19.jpg)
© 2014 Mellanox Technologies 22
GPU - A Specialized Processor
Very Efficient For • Fast Parallel Floating Point Processing
• Single Instruction Multiple Data Operations
• High Computation per Memory Access
Not As Efficient For • Double Precision – situation is improving
• Logical Operations on Integer Data
• Branching-Intensive Operations
• Random Access, Memory-Intensive Operations
![Page 20: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/20.jpg)
© 2014 Mellanox Technologies 23
GPGPU!
Programable stream processor
• Huge number of ALUs
• Huge memory bandwidth
Programming was painful
• OpenGL-SL – Shader Language
• Requires deep understanding of computers graphics
• Huge applications speedup when done correctly
CUDA/OpenCL
• C-like code
• Massively multi-threaded
• Simple to port existing code (but not to get good performance)
![Page 21: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/21.jpg)
© 2014 Mellanox Technologies 24
CUDA – Single Instruction Multiple Threads
![Page 22: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/22.jpg)
© 2014 Mellanox Technologies 25
Achieving Performance in CUDA
Almost all C code will compile to be CUDA code
• But will run slower
• Single threaded operation - ~50x slower than CPU code
Must expose parallelism
Careful with memory accesses
• Thread scheduling helps hide memory access latency
• But even this runs out
Moving target
• Performance optimizations are strongly HW and SW platform dependent
Can make huge difference
• 100x and even more
![Page 23: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/23.jpg)
© 2014 Mellanox Technologies 26
A GPU-Accelerated Software Router
PacketShader
![Page 24: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/24.jpg)
© 2014 Mellanox Technologies 27
High Performance Software Router
Work by Sangjin Han, Keon Jang, KyoungSoo Park and Sue Moon
• Advanced Networking Lab, CS, KAIST
• Networked and Distributed Computing Systems Lab, EE, KAIST
40 Gbps packet forwarding in a single box
• IPv4, 64B packets
• Bigger packet sizes – bounded by PCI-e bandwidth
20 Gbps IPsec tunneling
• For 1024B packets
• 10 Gbps for 64B packets
![Page 25: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/25.jpg)
© 2014 Mellanox Technologies 28
Software Router
Despite its name, not limited to IP routing • You can implement whatever you want on it.
Driven by software • Flexible
• Friendly development environments
Based on commodity hardware • Cheap
• Fast evolution
![Page 26: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/26.jpg)
© 2014 Mellanox Technologies 29
Now 10 Gigabit NIC is a commodity
From $200 – $300 per port • Great opportunity for software routers
![Page 27: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/27.jpg)
© 2014 Mellanox Technologies 30
Achilles’ Heel of Software Routers
Low performance • Due to CPU bottleneck
Year Ref. H/W IPv4 Throughput
2008 Egi et al. Two quad-core CPUs 3.5 Gbps
2008 “Enhanced SR”
Bolla et al. Two quad-core CPUs 4.2 Gbps
2009 “RouteBricks”
Dobrescu et al.
Two quad-core CPUs
(2.8 GHz) 8.7 Gbps
Not capable of supporting even a single 10G port
![Page 28: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/28.jpg)
© 2014 Mellanox Technologies 31
Per-Packet CPU Cycles for 10G
1,200 600
1,200 1,600 Cycles needed
Packet I/O IPv4 lookup
= 1,800 cycles
= 2,800
Your budget
1,400 cycles
10G, min-sized packets, dual quad-core 2.66GHz CPUs
5,400 1,200 … = 6,600
Packet I/O IPv6 lookup
Packet I/O Encryption and hashing
IPv4
IPv6
IPsec
+
+
+
(in x86, cycle numbers are from RouteBricks [Dobrescu09] and PacketShader)
![Page 29: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/29.jpg)
© 2014 Mellanox Technologies 32
PacketShader Approach 1: I/O Optimization
Packet I/O
Packet I/O
Packet I/O
Packet I/O
1,200 reduced to 200 cycles per packet
Main ideas
• Huge packet buffer
• Batch processing
Allocating SKBs – 50% of CPU time
600
1,600
IPv4 lookup
= 1,800 cycles
= 2,800
5,400 … = 6,600
IPv6 lookup
Encryption and hashing
+
+
+
1,200
1,200
1,200
![Page 30: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/30.jpg)
© 2014 Mellanox Technologies 33
PacketShader Approach 2: GPU Offloading
Packet I/O
Packet I/O
Packet I/O
GPU Offloading for
• Memory-intensive or
• Compute-intensive operations
Main topic of this talk
600
1,600
IPv4 lookup
5,400 …
IPv6 lookup
Encryption and hashing
+
+
+
![Page 31: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/31.jpg)
© 2014 Mellanox Technologies 34
GPU FOR PACKET PROCESSING
![Page 32: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/32.jpg)
© 2014 Mellanox Technologies 35
Advantages of GPU for Packet Processing
1. Raw computation power
2. Memory access latency
3. Memory bandwidth
Comparison between • Intel X5550 CPU
• NVIDIA GTX480 GPU
![Page 33: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/33.jpg)
© 2014 Mellanox Technologies 36
(1/3) Raw Computation Power
Compute-intensive operations in software routers • Hashing, encryption, pattern matching, network coding, compression, etc.
• GPU can help!
CPU: 43×109 = 2.66 (GHz) ×
4 (# of cores) ×
4 (4-way superscalar)
GPU: 672×109 = 1.4 (GHz) ×
480 (# of cores)
Instructions/sec
<
![Page 34: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/34.jpg)
© 2014 Mellanox Technologies 37
(2/3) Memory Access Latency
Software router lots of cache misses • GPU can effectively hide memory latency
GPU core
Cache
miss
Cache
miss
Switch to Thread 2
Switch to Thread 3
![Page 35: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/35.jpg)
© 2014 Mellanox Technologies 38
(3/3) Memory Bandwidth
CPU’s memory bandwidth (theoretical): 32 GB/s
![Page 36: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/36.jpg)
© 2014 Mellanox Technologies 39
(3/3) Memory Bandwidth
CPU’s memory bandwidth (empirical) < 25 GB/s
4. TX: RAM NIC
3. TX: CPU RAM 2. RX:
RAM CPU
1. RX: NIC RAM
![Page 37: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/37.jpg)
© 2014 Mellanox Technologies 40
(3/3) Memory Bandwidth
Your budget for packet processing can be less 10 GB/s
![Page 38: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/38.jpg)
© 2014 Mellanox Technologies 41
(3/3) Memory Bandwidth
Your budget for packet processing can be less 10 GB/s
GPU’s memory bandwidth: 174GB/s
![Page 39: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/39.jpg)
© 2014 Mellanox Technologies 42
HOW TO USE GPU
![Page 40: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/40.jpg)
© 2014 Mellanox Technologies 43
Basic Idea
Offload core operations to GPU (e.g., forwarding table lookup)
![Page 41: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/41.jpg)
© 2014 Mellanox Technologies 44
Recap
GTX480: 480 cores
For GPU, more parallelism, more throughput
![Page 42: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/42.jpg)
© 2014 Mellanox Technologies 45
Parallelism in Packet Processing
The key insight • Stateless packet processing = parallelizable
RX queue
1. Batching
2. Parallel Processing in GPU
![Page 43: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/43.jpg)
© 2014 Mellanox Technologies 46
Batching Long Latency?
Fast link = enough # of packets in a small time window
10 GbE link • up to 1,000 packets only in 67μs
Much less time with 40 or 100 GbE
![Page 44: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/44.jpg)
© 2014 Mellanox Technologies 47
PACKETSHADER DESIGN
![Page 45: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/45.jpg)
© 2014 Mellanox Technologies 48
Basic Design
Three stages in a streamline
Pre-shader
Shader Post-shader
![Page 46: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/46.jpg)
© 2014 Mellanox Technologies 49
Packet’s Journey (1/3)
IPv4 forwarding example
Pre-shader
Shader Post-shader
• Checksum, TTL
• Format check
• … Collected dst. IP addrs
Some packets go to slow-path
![Page 47: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/47.jpg)
© 2014 Mellanox Technologies 50
Packet’s Journey (2/3)
IPv4 forwarding example
Pre-shader
Shader Post-shader
1. IP addresses
2. Forwarding table lookup
3. Next hops
![Page 48: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/48.jpg)
© 2014 Mellanox Technologies 51
Packet’s Journey (3/3)
IPv4 forwarding example
Pre-shader
Shader Post-shader
Update packets
and transmit
![Page 49: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/49.jpg)
© 2014 Mellanox Technologies 52
Interfacing with NICs
Pre-shader
Shader Post-shader
Device driver
Packet RX
Device driver
Packet TX
![Page 50: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/50.jpg)
© 2014 Mellanox Technologies 53
Device driver
Pre-shader
Shader
Post-shader
Device driver
Scaling with a Multi-Core CPU
Master core
Worker cores
![Page 51: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/51.jpg)
© 2014 Mellanox Technologies 54
Device driver
Pre-shader
Shader
Post-shader
Device driver
Shader
Scaling with Multiple Multi-Core CPUs
![Page 52: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/52.jpg)
© 2014 Mellanox Technologies 55
EVALUATION
![Page 53: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/53.jpg)
© 2014 Mellanox Technologies 56
Hardware Setup
CPU:
Quad-core, 2.66 GHz
GPU:
NIC: Total 80 Gbps
Dual-port 10 GbE
Total 8 CPU cores
480 cores, 1.4 GHz
Total 960 cores
![Page 54: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/54.jpg)
© 2014 Mellanox Technologies 57
Experimental Setup
…
8 × 10 GbE links Packet generator PacketShader
(Up to 80 Gbps)
Input traffic
Processed packets
![Page 55: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/55.jpg)
© 2014 Mellanox Technologies 58
Results (w/ 64B packets)
28.2
8
15.6
3
39.2 38.2
32
10.2
0
5
10
15
20
25
30
35
40
IPv4 IPv6 OpenFlow IPsec
Th
rou
gh
pu
t (G
bp
s)
CPU-only CPU+GPU
1.4x 4.8x 2.1x 3.5x GPU speedup
![Page 56: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/56.jpg)
© 2014 Mellanox Technologies 59
Example 1: IPv6 forwarding
Longest prefix matching on 128-bit IPv6 addresses
Algorithm: binary search on hash tables [Waldvogel97] • 7 hashings + 7 memory accesses
… … … …
Prefix length 1 64 128 96 80
![Page 57: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/57.jpg)
© 2014 Mellanox Technologies 60
Example 1: IPv6 forwarding
(Routing table was randomly generated with 200K entries)
0
5
10
15
20
25
30
35
40
45
64 128 256 512 1024 1514
Th
rou
gh
pu
t (G
bp
s)
Packet size (bytes)
CPU-only CPU+GPU
Bounded by motherboard IO capacity
![Page 58: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/58.jpg)
© 2014 Mellanox Technologies 61
Example 2: IPsec tunneling
IP header IP payload
IP header IP payload ESP
trailer
ESP (Encapsulating Security Payload) Tunnel mode • with AES-CTR (encryption) and SHA1 (authentication)
+
Original IP packet
IP header IP payload ESP
trailer ESP
header +
IP header IP payload ESP
trailer ESP
header ESP
Auth. New IP header
+ IPsec Packet
1. AES
2. SHA1
![Page 59: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/59.jpg)
© 2014 Mellanox Technologies 62
Example 2: IPsec tunneling
3.5x speedup
1
1.5
2
2.5
3
3.5
4
0
4
8
12
16
20
24
64 128 256 512 1024 1514
Sp
eed
up
Th
rou
gh
pu
t (G
bp
s)
Packet size (bytes)
CPU-only CPU+GPU Speedup
![Page 60: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/60.jpg)
© 2014 Mellanox Technologies 63
Year Ref. H/W IPv4
Throughput
2008 Egi et al. Two quad-core CPUs 3.5 Gbps
2008 “Enhanced SR”
Bolla et al.
Two quad-core CPUs 4.2 Gbps
2009 “RouteBricks”
Dobrescu et al.
Two quad-core CPUs
(2.8 GHz)
8.7 Gbps
2010 PacketShader
(CPU-only)
Two quad-core CPUs
(2.66 GHz)
28.2 Gbps
2010 PacketShader
(CPU+GPU)
Two quad-core CPUs
+ two GPUs
39.2 Gbps
Kernel
User
![Page 61: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/61.jpg)
© 2014 Mellanox Technologies 64
Conclusions
GPU • a great opportunity for fast packet processing
PacketShader • Optimized packet I/O + GPU acceleration
• scalable with
- # of multi-core CPUs, GPUs, and high-speed NICs
Current Prototype • Supports IPv4, IPv6, OpenFlow, and IPsec
• 40 Gbps performance on a single PC
![Page 62: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/62.jpg)
© 2014 Mellanox Technologies 65
Future Work
Control plane integration • Dynamic routing protocols with Quagga or Xorp
Multi-functional, modular programming environment • Integration with Click? [Kohler99]
Opportunistic offloading • CPU at low load
• GPU at high load
Stateful packet processing
![Page 63: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/63.jpg)
© 2014 Mellanox Technologies 66
A GPU-Accelerated Software Router
SSLShader
![Page 64: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/64.jpg)
© 2014 Mellanox Technologies 67
![Page 65: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup](https://reader033.vdocuments.site/reader033/viewer/2022052918/5aeaa14d7f8b9a3b2e8cf127/html5/thumbnails/65.jpg)
Thank You