a survey on in-a-box parallel computing and its implications on system software research
TRANSCRIPT
![Page 1: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/1.jpg)
A Survey on in-a-box parallel computing
and its implications on system software
research
Changwoo Min ([email protected])
![Page 2: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/2.jpg)
Motivation
Technology ratios matter, Jim Gray
In the face of such "10X" forces, you can lose control of your destiny, Andrew S Grove
What is the implications of multicore evolution for system software researcher?
![Page 3: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/3.jpg)
Survey Scope and Strategy
Multicore
CPU
Multicore
CPU GPGPU
Operating System
System Library
Parallel Programming Model
Parallel
Application
Parallel
Middleware
Multicore
CPU
Multicore
CPU GPGPU GPGPU …
Virtual Machine Monitor
![Page 4: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/4.jpg)
Contents
Background
Parallel Programming Model and Productivity Tools
Optimization of System Software
Supporting GPU in a Virtualized Environment
Utilizing GPU in Middleware
Conclusion
![Page 5: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/5.jpg)
Background
![Page 6: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/6.jpg)
Why multicore?
Multicore CPU
Power wall
ILP(instruction level parallelism) wall
Memory wall
Wire delay
GPGPU(General Purpose computing on a Graphic Processing Unit)
GPU typically handles computation only for computer graphics.
Add followings to the rendering pipelines
programmable stages
higher precision arithmetic
Use stream processing on non-graphics data.
![Page 7: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/7.jpg)
Architecture of GPGPU core
![Page 8: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/8.jpg)
Parallel Programming Model and
Productivity Tools
![Page 9: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/9.jpg)
OpenMP
Parallel Programming API for shared memory
multiprocessing programming in C, C++, Fortran
Use language extension – “#pragma omp”
Need compiler support
![Page 10: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/10.jpg)
OpenMP (cont’d)
Fork-and-join model
Bounded parallel loop, reduction
Task-creation-and-join model
Unbounded loop, recursive algorithm, producer/consumer
![Page 11: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/11.jpg)
Intel TBB (Threading Building Block)
Similar to OpenMP
API for shared memory multiprocessing
Fork-and-join
parallel-for, parallel-reduce
Task-creation-and-join
Task scheduler
Different from OpenMP
C++ template library
Concurrent container class
Hash map, vector, queue
Various synchronization mechanism
mutex, spin lock, …
Atomic type, atomic operations
Scalable memory allocator
![Page 12: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/12.jpg)
Nvidia CUDA (Compute Unified Device Architecture)
CUDA
Computing engine in Nvidia GPU
Programming framework for Nvidia GPU
Use CUDA extended C
declspecs, keywords, intrinsic, runtime API, function launch, …
CUDA extended C Compiling CUDA Code Processing flow on CUDA
![Page 13: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/13.jpg)
Nvidia CUDA (cont’d)
Execution Model Kernel Memory Access
![Page 14: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/14.jpg)
OpenCL (Open Compute Language)
CPU/GPU heterogeneous computing framework
standardized by Khronous group
OpenCL Memory Model CUDA, OpenCL Example
![Page 15: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/15.jpg)
Lithe: Enabling Efficient Composition of
Parallel Libraries
Who?
ParLab, UC Berkeley, HotPar’09
Problem
Composition of parallel libraries shows performance anomaly
![Page 16: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/16.jpg)
Lithe: Enabling Efficient Composition of
Parallel Libraries (cont’d)
Solution
Virtualized thread are bad for parallel libraries.
Harts
Unvirtualized hardware thread context
Sharing harts
Lithe
Cooperative hierarchical scheduler framework for harts
![Page 17: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/17.jpg)
Concurrency bug detection: DataCollider
Who?
Microsoft Research, OSDI’10
Problem
Detecting concurrency data race bug is difficult.
For large system such as Windows kernel, runtime overhead is critical.
Solution
Sampling using code break point
When a code break point is trapped,
Set data break point for its operand
Sleep for a while
If the data is changed, it could be data race.
![Page 18: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/18.jpg)
Concurrency bug detection: SyncFinder
Who?
UC San Diego, OSDI ’10
Problem
How to find ad-hoc synchronization
Solution
Formalize patterns of ad-hoc synchronization
Detect such patterns using LLVM
![Page 19: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/19.jpg)
Optimization of System Software
![Page 20: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/20.jpg)
Memory Allocation: Hoard
Who?
UT, ASPLOS’00
Problem
Memory allocator is performance bottleneck in multi
processor environment.
Lock contention, False sharing, Blow up
Allocator induced false sharing
![Page 21: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/21.jpg)
Memory Allocation: Hoard (cont’d)
Solution
Per-processor heap to reduce
lock contention and false
sharing
Global heap
Borrow memory from global
heap to increase per-processor
heap
Return memory to global heap if
there are too much free memory
in a per-processor heap
![Page 22: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/22.jpg)
Memory Allocation: Xmalloc
Who?
UIUC, ICCIT’10
Problem
Scalable malloc for CUDA whereby hundreds of threads run
concurrently.
Solution
Memory allocation coalescing
![Page 23: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/23.jpg)
System Call: FlexSC
Who?
University of Toronto, OSDI’10
Problem
Negative performance impact of system call is huge.
Direct cost + indirect cost
Solution
Batching, asynchronous system call
![Page 24: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/24.jpg)
Revisiting OS Architecture
![Page 25: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/25.jpg)
Multikernel
Who?
ETH Zurich, Microsoft Research Cambridge, SOSP’09
Problem
System diversity
It is no longer acceptable (or useful) to tune a general-purpose OS
design for a particular hardware model.
![Page 26: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/26.jpg)
Multikernel (cont’d)
Problem (cont’d)
The interconnects matters
Core diversity
Programmable NICs
GPU
FPGA in CPU sockets
8-socket Nahelem On-chip interconnects
SH
M:s
talle
d c
ycle
(n
o lo
ckin
g!)
SHM vs. Message Passing
![Page 27: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/27.jpg)
Multikernel (cont’d)
Solution
Today’s computer is already a distributed system. Why isn’t
your OS?
Barallelfish
Implementation of the multikernel approach
Message passing, shared nothing, replica maintenance
![Page 28: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/28.jpg)
An Analysis of Linux Scalability to Many
Cores
Who?
MIT CSAIL, OSDI’10
Problem
If so, is Linux scalable enough?
Solution
Test linux scalability using 48 Intel cores with 7 applications
No kernel problems up to 48 cores
3002 LOC patches
Sloopy counter
: replicated reference counter
![Page 29: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/29.jpg)
Supporting GPU in a virtualized
environment
![Page 30: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/30.jpg)
HyVM (Hybrid Virtual Machines)
Who?
Georgia Tech
Problem
Asymmetries in performance, memory and cache
Functional differences
Multiple accelerators
Vector processor
Floating point
Additional instructions for accelerations
Solution
heterogeneity- and asymmetry-aware hypervisors
![Page 31: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/31.jpg)
HyVM (cont’d)
Solution (cont’d)
HyVM Architecture GViM: GPU Virtualization Architecture
Memory management in GViM Harmony CPU/GPU co-scheduling
![Page 32: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/32.jpg)
VMGL (Virtualizing OpenGL)
Who?
University of Toronto, VEE’07
Problem
How to support OpenGL in a virtual machine environment
Solution
Forward OpenGL command to the driver domain
![Page 33: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/33.jpg)
Utilizing GPU in Middleware
![Page 34: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/34.jpg)
StoreGPU
Who?
University of British Columbia, HDPC’10
Problem
In CAS(Contents Addressable Storage),
How to minimizing hash calculation cost
Solution
Offloading to GPU
StoreGPU Architecture
![Page 35: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/35.jpg)
PacketShader
Who?
KAIST, SIGCOMM’10, NSDI’11
Problem
How to boot up performance of software router
Solution
Offload stateless (parallelizable) packet processing to GPU
PacketShader Architecture Basic Workflow of PacketShader
![Page 36: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/36.jpg)
Conclusion
![Page 37: A Survey on in-a-box parallel computing and its implications on system software research](https://reader033.vdocuments.site/reader033/viewer/2022052822/554be7a4b4c9055a368b4b00/html5/thumbnails/37.jpg)
S