supporting multi-processors bernard wong february 17, 2003
TRANSCRIPT
![Page 1: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/1.jpg)
Supporting Multi-Processors
Bernard WongFebruary 17, 2003
![Page 2: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/2.jpg)
Uni-processor systems Began with Uni-processor systems Simple to implement uni-processor
OS, allows for many assumptions UMA, efficient locks(small impact on
throughput), straight forward cache coherency
Hard to make faster
![Page 3: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/3.jpg)
Small SMP systems Multiple symmetric processors Requires some modifications to the OS Still allows for UMA System/Memory bus becomes a contended
resource Locks have larger impact on throughput
e.g. A lock on one process can block another process (running on another processor) from making progress
Must introduce finer grain locks to improve scalability System bus limits system size
![Page 4: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/4.jpg)
Large Shared Memory Multi-processor
Consist of many nodes, each of which may be a uni-processor or an SMP
Access to memory often NUMA, sometimes does not even provide cache coherency
Performance very poor if used with an off the shelf SMP OS
Requirement for good performance: Locality of service to request Independence between services
![Page 5: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/5.jpg)
DISCO Uses Virtual Machine Monitors to run
multiple commodity OSes on a scalable multi-processor
Virtual Machine Monitor Additional layer between OS and hardware Virtualizes processor, memory, I/O OS unaware of virtualization (ideally) Exports a simple general interface to the
commodity OS
![Page 6: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/6.jpg)
DISCO Architecture
DISCO
PE PE PE PE PE PE PE
Interconnect
ccNUMA Multiprocessor
OS SMP-OS OS OS Thin OS
![Page 7: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/7.jpg)
Implementation Details Virtual CPUs
Uses direct execution on real CPU• Fast, most instructions run at native speeds
Must detect and emulate operations that can not be safely exported to VM
• Primary privilege instructions: TLB modification, direct physical memory or I/O operations
Must also keep data-structure to save registers and other state
• For when virtual CPU not scheduled to real CPU Virtual CPUs uses affinity scheduling to
maintain cache locality
![Page 8: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/8.jpg)
Implementation Details Virtual Physical Memory
Adds a level of address translation Maintains physical-to-machine address mappings
• Because VMs use physical addresses that start from 0 and continuing for size of VM’s memory address
Performed via emulating TLB instructions• When OS tries to insert entry into TLB, DISCO
intercepts it and insert translated version TLB flushed on virtual CPU switches
• TLB lookup also more expensive due to required trap• Second level software TLB added to improve
performance
![Page 9: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/9.jpg)
Implementation Details Virtual I/O
Intercepts all device accesses from VM through special OS device drivers
Virtualizes both disk and network I/O DISCO allows persistent disks and non-
persistent disks• Persistent disks cannot be shared• Non-persistent disk implemented via copy-
on-write
![Page 10: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/10.jpg)
Why use a VMM? DISCO aware of NUMA-ness
Hides NUMA-ness from commodity OS Requires less work than engineering a NUMA-
aware OS Performs better than NUMA-unaware OS Good middle ground
How? Dynamic page migration and page replication
• Maintain locality between virtual CPU’s cache miss and memory pages to which cache miss occur
![Page 11: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/11.jpg)
Memory Management Pages heavily accessed by only one node are
migrated to that node Change physical to machine address mapping Invalidates TLB entries that point to old location Copy page to local machine
Pages that are heavily read-share and replicated to nodes move heavily accessing them Downgrade TLB entries pointing to page to read-only Copy pages Update relevant TLB entries to local machine version
and remove read-only
![Page 12: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/12.jpg)
Page Replication
![Page 13: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/13.jpg)
Aren’t VMs memory inefficient? Traditionally, VMs tend to replicate
memory used for each system image Additionally, structures such as disk cache
not shared DISCO uses notion of global buffer cache
to reduce memory footprint
![Page 14: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/14.jpg)
Page sharing DISCO keeps a data structure that maps disk
sectors to memory addresses If two VMs request for same disk sector, both
assigned to same read-only buffer page Modifications to pages performed via copy-on-
write Only works for non-persistent copy-on-write disks
![Page 15: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/15.jpg)
Page sharing
![Page 16: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/16.jpg)
Page sharing Sharing effective even via packets
when sharing data over NFS
![Page 17: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/17.jpg)
Virtualization overhead
![Page 18: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/18.jpg)
Data sharing
![Page 19: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/19.jpg)
Workload scalability
![Page 20: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/20.jpg)
Performance Benefits of Page Migration/Replication
![Page 21: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/21.jpg)
Tornado OS designed to take advantage of shared
memory multi-processors Object Oriented structure
Every virtual and physical resource represented by an independent object
Ensure natural locality and independence• Resource lock and data structure stored on some
node as resource• Resources manage independently and at a fine grain
• No global source of contention
![Page 22: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/22.jpg)
OO structure Example: Page fault
Separate File Cache Manager(FCM) object for different regions of memory
COR -> Cached Object Representative
All objects are specific to either the faulting process or the file(s) backing the process
Problem: Hard to make global policies
![Page 23: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/23.jpg)
Clustered objects Even with OO, widely shared objects can be
expensive due to contention Need replication, distribution, partition to reduce
contention Clustered Objects systematic way to do this Gives illusion of a single object, but is actual
composed of multiple component (rep) objects Each component handle a subset of processors Must handle consistency across reps
![Page 24: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/24.jpg)
Clustered objects
![Page 25: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/25.jpg)
Clustered object implementation Per-processor translation table
Contains pointer for to local rep of each clustered object Created on demand via a combination of global miss handling
object and clustered object specific miss handling object
![Page 26: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/26.jpg)
Memory Allocation Need an efficient, highly concurrent
allocator that maximizes locality Use local pools of memory
However, for small block allocation, still have problem of false sharing
Additional small pool of strictly local memory used
![Page 27: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/27.jpg)
Synchronization Use of objects, and additional clustered object
reduces scope of lock and limits lock contention to that of a rep
Existence guarantees hard A thread must determine whether an object is currently
being de-allocated by another thread Often require lock hierarchy where root is a global lock
DISCO uses semi-automatic garbage collector Thread never worries needs to test for existence, no
locking required
![Page 28: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/28.jpg)
Protected Procedure Calls Since Tornado is a microkernel, IPC traffic
is significant Need a fast IPC mechanism that
maintains locality Protected Procedure Calls (PPC) maintains
locality by: Spawning a new server thread in the same
processor as client to service client request Keeping all client specific data in data-
structures stored on the client
![Page 29: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/29.jpg)
Protected Procedure Calls
![Page 30: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/30.jpg)
Performance Comparison to other large shared-
memory multi-processors
![Page 31: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/31.jpg)
Performance (n threads in 1 process)
![Page 32: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/32.jpg)
Performance (n threads in n process)
![Page 33: Supporting Multi-Processors Bernard Wong February 17, 2003](https://reader034.vdocuments.site/reader034/viewer/2022052702/56649f165503460f94c2c69f/html5/thumbnails/33.jpg)
Conclusion Illustrated two different approach to make
efficient use of shared memory multi-processors
DISCO adds extra layer between hardware and OS Less engineering effort, more overhead
Tornado redesigns an OS to take advantage of locality and independence More engineering effort, less overhead but local
and independent algorithms may work poorly with real world loads