os architecture for omni-programmable systems
TRANSCRIPT
![Page 1: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/1.jpg)
Omnix: an accelerator-centric OS architecture for
omni-programmable systems
Rethinking the role of CPUs in modern computers
Mark Silberstein
Technion
May 2021ICL
![Page 3: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/3.jpg)
May 2021
Beyond CMOS: total disruption!
From «IEEE rebooting computing»
OS
![Page 4: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/4.jpg)
May 2021
Stagnation of the current processing technology
Performance
Today Time
![Page 5: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/5.jpg)
May 2021
Next generation is coming soon...
Performance
Today
Birth of new technology
New technology matured
Time
![Page 6: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/6.jpg)
May 2021
What to do until the next revolution?
Performance
Today
Birth of new technology
New technology matured
?????????
Time
![Page 7: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/7.jpg)
May 2021
What to do until the next revolution?
Performance
Today
Birth of new technology
New technology matured
Hardware specialization and
near-data accelerators
Time
![Page 8: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/8.jpg)
May 2021
Computer hardware: circa ~2021
Network I/O accelerator
Storage I/O accelerator
GPU parallelaccelerator
![Page 9: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/9.jpg)
May 2021
Central Processing Units (CPUs) are no longer Central
Network I/O accelerator
Storage I/O accelerator
GPU parallelaccelerator
Programmability
![Page 10: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/10.jpg)
May 2021
Omni-programmable systemX- Processing Units: xPUs
Network I/O accelerator
Storage I/O accelerator
GPU parallelaccelerator
Near-Data Processing
Near-Data Processing
AcceleratedProcessing
Programmability
![Page 11: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/11.jpg)
May 2021
But XPUs create...
Programmability
wall
![Page 12: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/12.jpg)
May 2021
Number of XPUs
Programmingcomplexity
CPU CPU+GPU
CPU+GPU+Smart NIC
Hard to maintainwhole-application efficiency
multi-coreCPU
Number of skillfuldevelopers
![Page 13: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/13.jpg)
May 2021
Number of XPUs
Programmingcomplexity
CPU CPU+GPU
CPU+GPU+Smart NIC
Hard to maintainwhole-application efficiency
multi-coreCPU
Number of skillfuldevelopers
Underutilized hardwarePoor application performance
Low efficiencyHigh costs
![Page 14: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/14.jpg)
May 2021
Agenda
● The root cause of the programmability wall
● OmniX: accelerator-centric OS design
– Principles
– Examples
● Future-proof: OmniX and disaggregated systems
![Page 15: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/15.jpg)
May 2021
Example: image server
1. put: parse → contrast-enhance → store 2. get: parse → resize → store → marshal
putget
Similar architecture used in Flickr
![Page 16: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/16.jpg)
May 2021
Example: image server
1. put: parse → contrast-enhance → store 2. get: parse → resize → store → marshal
NIC
SSD
GPU
CPU
![Page 17: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/17.jpg)
May 2021
Accelerating with XPUs
1. put: parse → contrast-enhance → store 2. get: parse → resize → store → marshal
NIC
SSD
GPU
CPU
![Page 18: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/18.jpg)
May 2021
Accelerating with XPUs
1. put: parse → contrast-enhance → store 2. get: parse → resize → store → marshal
NIC SSD GPU CPU
![Page 19: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/19.jpg)
May 2021
Closer look at get
parse req
resize imgstore img
marshal resp
SSDNIC
parse → resize → store → marshal
![Page 20: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/20.jpg)
May 2021
send(resp)
marshal resp
OS services run on CPUs
SSDNIC
get: parse → resize → store → marshal
CPU
recv(req)
read(file,img)parse req
resize img
write(file,img)
![Page 21: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/21.jpg)
May 2021
send(resp)
marshal resp
Result: offloading overheads dominate
SSDNIC
get: parse → resize → store → marshal
CPU
recv(req)
read(file,img)parse req
resize img
write(file,img)
![Page 22: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/22.jpg)
May 2021
send(resp)
marshal resp
Result: offloading overheads dominate
SSDNIC
get: parse → resize → store → marshal
CPU
recv(req)
read(file,img)parse req
resize img
write(file,img)
No sockets, isolation, transport layer …
No files, protection...
![Page 23: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/23.jpg)
May 2021
THETHE problem:OS architecture is CPU - centric
GPUStorage
XPU
CPU
NetworkXPU
![Page 24: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/24.jpg)
May 2021
THETHE problem is general:OS architecture is CPU - centric
GPU StorageXPU
CPU
NetworkXPU
InferenceXPU
….XPU
TrainingXPU
![Page 25: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/25.jpg)
May 2021
OmniX: accelerator-centric OS architecture
CPU
OS
S
ervices
Operating system
OS Services
OS
S
erv
ices
GPU StorageXPU
NetworkXPU
![Page 26: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/26.jpg)
May 2021
marshal resp
nic_send(resp)
Execution in OmniX
NIC
get: parse → resize → store → marshal
sdd_read(img)parse req
resize img
sdd_write (img)
nic_recv(req)
SSD
![Page 27: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/27.jpg)
May 2021
Accelerator-centric OS architecture
![Page 28: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/28.jpg)
May 2021
Types of OS abstractions for accelerators
Accelerator-centric: on-accelerator services● Networking: GPUnet, GPUrdma, Centaur, LYNX
● Files: GPUfs, ActivePointers
Accelerator-friendly: accelerator-aware host OS changes
● SPIN, GAIA – host-accelerator file sharing
Data-centric: CPU-less inline near-data processing● NICA – Server acceleration on FPGA-based SmartNICs
![Page 29: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/29.jpg)
May 2021
Types of OS abstractions for accelerators
Accelerator-centric: on-accelerator services● Networking: GPUnet, GPUrdma, Centaur, LYNX
● Files: GPUfs, ActivePointers
Accelerator-friendly: accelerator-aware host OS changes
● SPIN, GAIA – host-accelerator file sharing
Data-centric: CPU-less inline near-data processing● NICA – Server acceleration on FPGA-based SmartNICs
ASPLOS13,TOCS14,OSDI14,TOCS15,ISCA16,SYSTOR16, ROSS16, ATC17,HotOS17,ATC19, ATC19-2, TOCS19, PACT19, ASPLOS20
![Page 30: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/30.jpg)
May 2021
Storage
Network
On-GPU I/O services
CPU
OS
S
ervices
Operating system
OS Services
OS
S
ervi
ces
GPU
![Page 31: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/31.jpg)
May 2021
GPUfs: File system library for GPUs
open(“shared_file”)m
map
()
open(“shared_file”)w
rite(
)Host File System
GPUfs
System-wideshared namespace
CPUs GPU1 GPU2 GPU3
ASPLOS13: S., Keidar, Ford, Witchel
![Page 32: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/32.jpg)
May 2021
GPUnet: Network library for GPUs
socket(AF_INET,SOCK_STREAM);connect(“node0:2340”);
socket(AF_INET,SOCK_STREAM);connect(“node0:2340”)
GPUnet
GPU nativenative client
socket(AF_INET,SOCK_STREAM);listen(:2340)
GPU nativenative server
node0.technion.ac.il
GPUnet
CPU client
Network
OSDI14, S, Kim, Witchel
![Page 33: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/33.jpg)
May 2021
Accelerator in full control over its I/O
● I/O without «leaving» the GPU kernel● Data-driven access to huge DBs● Full-blown multi-tier GPU network servers● Multi-GPU Map/Reduce (no user CPU code)
● POSIX-like APIs with slightly modified semantics
● Transparency for the rest of the system
● Reduced code complexity
● Unleashed GPU performance potential
![Page 34: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/34.jpg)
May 2021
Example: face verification server
=?
memcached(unmodified)
GPU server(GPUnet)
CPU client(unmodified)
![Page 35: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/35.jpg)
May 2021
Face verification: Different implementations
1 GPU(no GPUnet)
1 GPUGPUnet
CPU6 cores
500
1000
1500
2000
2500
Late
ncy
(μ
sec)
34 5423
Throughput (KReq/sec)
1.9x throughput1/3x latency(500usec)
½ code size
![Page 36: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/36.jpg)
May 2021
Main design principles● Micro-kernel design
– RPC to File/Network services on the CPU
– User-land abstraction implementation (libOSes)
![Page 37: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/37.jpg)
May 2021
Main design principles● Micro-kernel design
– RPC to File/Network services on the CPU
– User-land abstraction implementation (libOSes)
● Single name space with the CPU OS
– Same socket space, same file name space
![Page 38: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/38.jpg)
May 2021
Main design principles● Micro-kernel design
– RPC to File/Network services on the CPU
– User-land abstraction implementation (libOSes)
● Single name space with the CPU OS
– Same socket space, same file name space
● Extensive SW layer on the GPU
– Handles massive API parallelism
– Implements consistency model (FS)
– Implements flow control (sockets)
![Page 39: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/39.jpg)
May 2021
Main design principles● Micro-kernel design
– RPC to File/Network services on the CPU
– User-land abstraction implementation (libOSes)
● Single name space with the CPU OS
– Same socket space, same file name space
● Extensive SW layer on the GPU
– Handles massive API parallelism
– Implements consistency model (FS)
– Implements flow control (sockets)
● Seamless data path optimization
– Eliminates CPU from data path
– Exploits data locality
![Page 40: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/40.jpg)
May 2021
Main design principles● Micro-kernel design
– RPC to File/Network services on the CPU
– User-land abstraction implementation (libOSes)
● Single name space with the CPU OS
– Same socket space, same file name space
● Extensive SW layer on the GPU
– Handles massive API parallelism
– Implements consistency model (FS)
– Implements flow control (sockets)
● Seamless data path optimization
– Eliminates CPU from data path
– Exploits data locality
![Page 41: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/41.jpg)
May 2021
Optimized I/O: no CPU in data path
● SSD/NIC may perform DMA directly into/from GPU memory without the CPU (P2P DMA)
● Why?
– Lower latency
– Less buffering/complexity for thpt
– No CPU involvementGPU
NIC Memory
PCIe bus
![Page 42: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/42.jpg)
May 2021
Optimized I/O: no CPU in data path
● SSD/NIC may perform DMA directly into/from GPU memory without the CPU (P2P DMA)
Challenge: the OS is on the CPU!I/O device sharing, multiplexing,
transport layer
Examples: ● GPU and CPU both need to access the network● TCP on GPU?
![Page 43: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/43.jpg)
May 2021
GPUnet: offloading transport layer to the NIC (via RDMA)
CPU GPU
NIC
Messagebuffers
Messagebuffers
Reliable RDMA
Streaminglogic
![Page 44: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/44.jpg)
May 2021
Summary so far...
● Accelerator-centric OS services
– Simplify code development
– Enable transparent performance optimization
● But what if we cannot add code to an accelerator?
– Accelerators are inefficient when running OS logic
– Some systems use close-source accelerated libs
![Page 45: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/45.jpg)
May 2021
Summary so far...
● Accelerator-centric OS services
– Simplify code development
– Enable transparent performance optimization
● But what if we cannot add code to an accelerator?
– Accelerators are inefficient when running OS logic
– Some systems use close-source accelerated libs
Make host OS accelerator-aware
![Page 46: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/46.jpg)
May 2021
Storage: OS integration of P2P DMAbetween SSD and GPUs
● Accelerator-aware modification to host FS API
● Allows using GPU memory buffers in read/write
– Transparently selects page cache or P2P DMA
– Maintains POSIX FS consistency
– Integrates with OS prefetcher
– Compatible with OS block layer (i.e., software RAID)
● Results:
– 5.2GB/s from SSDs to GPU
– 2-3x speedup in applications
SPIN: USENIX ATC17, partially adopted by NVIDIA
![Page 47: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/47.jpg)
May 2021
Storage: Extending CPU page cache into GPU memory
● Accelerator-aware modification to host page cache to use GPU page faults
● Enables mmap for GPU
● Enables CPU-GPU file sharing
● May cache/prefetch file data in GPU memory
● Insights:
– Slim GPU driver API for enabling host page cache integration
– Page cache release consistency model for high performance
– OS page cache and Linux kernel modifications for consistency support
GAIA: USENIX ATC19
![Page 48: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/48.jpg)
March 2021
Question: can we use strong consistency in the page cache?
But GPU page is 64KB!False sharing inevitable(also in real applications)
● Current practice in NVIDIA Unified Virtual Memory
● Single owner semantics: the page migrates to the requesting processor
![Page 49: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/49.jpg)
March 2021
Extreme false sharing is devastating 28x slowdown!
![Page 50: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/50.jpg)
May 2021
Lazy Release Consistency to rescue
● Transparent for legacy CPU processes
● Transparent for legacy GPU kernels
int fd=open(«shared_file»);void* ptr=mmap(…,ON_GPU,fd);macquire(ptr);gpu_kernel<<<>>>(ptr);mrelease(ptr);
GPU management code on the CPU
40% app improvementover strong consistency
![Page 51: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/51.jpg)
May 2021
Summary so far...
● Accelerator-centric OS services
– Simplify code development for accelerators
– Enable transparent performance optimization
● Accelerator-aware host OS services
– Optimize I/O for unmodified accelerators
– Coordinate sharing with the host OS
● But can we remove host CPUs altogether?
![Page 52: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/52.jpg)
May 2021
Operating system
CPU
CPU-less design: no CPU in control and data path
OS Services
OS
S
ervi
ces
GPU StorageNXU
NetworkXPU
OS
S
ervices
Lower latency (no CPU roundtrip)Better scalability (no CPU load)
Lower costs (wimpy CPUs)
![Page 53: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/53.jpg)
May 2021
CPU's role
![Page 54: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/54.jpg)
May 2021
CPU-less systems● GPUrdma [ROSS'16]
– RDMA VERBs from GPUs
– Achieves 2-3 usec latency and high throughput
● Centaur [PACT'19]
– Multi-GPU UNIX sockets and data flow runtime
– Multi-GPU scaling with zero CPU utilization
● NICA [ATC'19]
– Inline server acceleration on FPGA-based SmartNICs
● LYNX [ASPLOS'20]
– Accelerator-centric server architecture on SmartNICs
![Page 55: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/55.jpg)
March 2021
The case for CPU-less multi-GPUserver design
Image Similarity Search
S3
S2
S1
Brute force
search
Choosebest
ClusterKNN
Brute force
search
Brute force
search
Clientrequest
Clientresponse
![Page 56: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/56.jpg)
March 2021
Traditional design: CPU controls GPU invocation and
data movements
S3
S2
S1
Brute force
search
Choosebest
ClusterKNN
Brute force
search
Brute force
search
Clientrequest
Clientresponse
Offload to multiple GPUs
![Page 57: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/57.jpg)
March 2021
Traditional design: CPU controls GPU invocation and
data movements
![Page 58: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/58.jpg)
March 2021
Lets add more CPU cores
![Page 59: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/59.jpg)
March 2021
12 CPU cores needed!
![Page 60: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/60.jpg)
March 2021
12 CPUs are not enough to scale!
![Page 61: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/61.jpg)
March 2021
12 CPUs are not enough to scale!
The problem is inherent in the CPU-driven design[PACT19]
![Page 62: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/62.jpg)
March 2021
The case for CPU-less multi-GPU server design
PACT 19
S3
S2
S1
Brute force
search
Choosebest
ClusterKNN
Brute force
search
Brute force
search
GP
Une
t
Clientrequest
GP
Une
t
Clientresponse
GPU-side inter-GPUpipes
GPU-side requestscheduler
![Page 63: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/63.jpg)
March 2021
CPU-less Multi-GPU network server
StandardCPU-driven
design
CPU-lessdesign
CPU-less design: better scaling
![Page 64: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/64.jpg)
May 2021
CPU-less systems● GPUrdma [ROSS'16]
– RDMA VERBs from GPUs
– Achieves 2-3 usec latency and high throughput
● Centaur [PACT'19]
– Multi-GPU UNIX sockets and data flow runtime
– Multi-GPU scaling with zero CPU utilization
● NICA [ATC'19]
– Inline server acceleration on FPGA-based SmartNICs
● LYNX [ASPLOS'20]
– Accelerator-centric server architecture on SmartNICs
![Page 65: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/65.jpg)
May 2021
![Page 66: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/66.jpg)
May 2021
Thin on-accelerator abstractions for serving network requests
Transport processing offloaded to the SmartNIC
I/O API
![Page 67: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/67.jpg)
May 2021
![Page 68: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/68.jpg)
May 2021
1 SmartNIC can support up to 100 acceleratorsperforming neural net inference
Productized in Toga Networks [Huawei] as we speak
![Page 69: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/69.jpg)
May 2021
Summary so far..
● Accelerator-centric OS architecture is feasible today
● Advantageous for high performance, resource efficiency, code simplicity
● On-accelerator libraryOS approach with the CPU used for privileged operations
● But will it apply to future disaggregated systems?
![Page 70: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/70.jpg)
May 2021
Data Center Trends
● Hardware: Resource disaggregation
● High benefits in TCO and utilization
CPU
CPU CPU
CPU GPU
GPU GPU
GPU Accel
Accel Accel
Accel SSD
SSD SSD
SSD Mem
Mem Mem
Mem
NIC NIC NIC NIC NIC
...
Network
![Page 71: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/71.jpg)
May 2021
Data Center Trends
● Hardware: Resource disaggregation
● High benefits in TCO and utilization
CPU
CPU CPU
CPU GPU
GPU GPU
GPU Accel
Accel Accel
Accel SSD
SSD SSD
SSD Mem
Mem Mem
Mem
NIC NIC NIC NIC NIC
...
Network
But what about performance?
![Page 72: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/72.jpg)
May 2021
Not with the traditional server-centric design
CPU
GPU
SSDCPUOS
GPU
SSD
Typical Server
![Page 73: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/73.jpg)
May 2021
Not with the traditional server-centric design
CPU
GPU
SSDCPUOS
GPU
SSD
CPUOS
GPUSSD
Network
Typical Server
CPU Rack
Storage Rack GPU Rack
Fundamentally inefficient!Fundamentally inefficient!
![Page 74: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/74.jpg)
May 2021
What's wrong with the server-centric design ?
● A centralized OS is a control/data bottleneck
● I/O devices and accelerators are slaves
● Application control and data planes are centralized
![Page 75: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/75.jpg)
May 2021
What's wrong with the server-centric design ?
● A centralized OS is a control/data bottleneck
● I/O devices and accelerators are slaves
● Application control and data planes are centralized Needed Needed
disaggregation-native OS!disaggregation-native OS!
![Page 76: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/76.jpg)
May 2021
FractOS: decentralized disaggregation-native OS
CPUApp
GPUSSD Network
CPU Rack
Storage Rack GPU Rack
Sm
artNIC
FractO
SuK
erne lS
mar
tNIC
Fra
ctO
SuK
erne
l
SmartNIC
FractOSuKernel
Joint work with L Vilanova (Imperial), H. Haertig and his team (TU Dresden & Bakhausen)
![Page 77: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/77.jpg)
May 2021
FractOS: decentralized disaggregation-native OS
CPUApp
GPUSSD Network
CPU Rack
Storage Rack GPU Rack
Sm
artNIC
FractO
SuK
erne lS
mar
tNIC
Fra
ctO
SuK
erne
l
SmartNIC
FractOSuKernel
Joint work with L Vilanova (Imperial), H. Haertig and his team (TU Dresden & Bakhausen)
![Page 78: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/78.jpg)
May 2021
FractOS: decentralized disaggregation-native OS
CPUApp
GPUSSD Network
CPU Rack
Storage Rack GPU Rack
Sm
artNIC
FractO
SuK
erne lS
mar
tNIC
Fra
ctO
SuK
erne
l
SmartNIC
FractOSuKernel
Joint work with L Vilanova (Imperial), H. Haertig and his team (TU Dresden & Bakhausen)
![Page 79: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/79.jpg)
May 2021
FractOS vs. OmniX
● Avoid CPU in data/control path
● Devices as first-class citizens
● Direct interaction among devices
● Transparent data-path optimizations
● Decentralized capability management
● Decentralized task graph execution
● Unified software/hardware interfaces
OmniX
![Page 80: OS architecture for omni-programmable systems](https://reader031.vdocuments.site/reader031/viewer/2022012415/616f86c793ac1c53f1505470/html5/thumbnails/80.jpg)
May 2021
Summary
Same principles are useful for SGX [Eurosys17,USENIX ATC19]
and disaggregated data centers [FractOS]
● Future omni-programmable systems face programmability wall
● Accelerator-centric OS architecture simplifies programming and improves performance
● It exposes OS abstractions on accelerators
● Tightly integrates new abstractions with the host OS
Code available @ https://github.com/acsl-technion