filesystem optimizations for static content multimedia servers review of academic papers for tdc573...
TRANSCRIPT
Filesystem Optimizations for Static Content Multimedia ServersReview of academic papers for TDC573
Jeff Absher
Papers Reviewed
Implementation and Evaluation of EXT3NS Multimedia File System Baik-Song Ahn, Sung-Hoon Sohn , Chei-Yol Kim , Gyu-Il
Cha, Yun-Cheol Baek, Sung-In Jung, Myung-Joon Kim. Presented 12th Annual ACM International Conference on
Multimedia , October 10–16, 2004, New York, New York, USA.
The Tiger Shark File System Roger L. Haskin, Frank B. Schmuck IBM Journal of Research and Development, 1998
What is the problem?
MM Server with relatively static content Prerecorded movies Audio Lectures Commercials
End user can start/stop/pause/seek
Many different simultaneous users.
Massive transfer of data from disks to NIC.
Can safely avoid focusing on real-time writes. It is a server. Assume data is collected
non-real-time. Note: both systems could
be easily extended within their scope to handle RT writes.
Should be backward compatible for legacy requests.
Scope Limitations and Design Goals Limitations
Single Server or Cluster with single shared set of disks. No distributed nodes. There is research in the
slightly different areas of distributed Filesystems, P2P filesystems, and others.
Single local Filesystem, May consist of an array of multiple disks.
Design Goals in order of importance Pump as much data as
you can from the disks to the NICs. This can be done by
avoiding kernel memcopys
Seeking Quick Recoverability for
very large filesystems Journaling
Legacy Compatibility
Problems with “old” filesystem block transfer to NIC in the network-server context? (simplified) Multiple Memcpy() calls across user/kernel mode. Disk blocks optimized for small files. Many context switches. The kernel must be involved in both reading from the disk and
writing to the NIC. Bus contention with other IO. Block Cache is in main memory, may not be fast enough from a
hardware perspective. The data may be slow to “bubble down” the Networking layer due to
redirectors, routing, etc. Checksum calculations and such for networking happen in software.
The newer MM Filesystems:Classes of requests Both of the studied filesystems assign some type of
class to FS requests. the minimum needed is 2 classes.
Legacy Requests Read/Write data for small files, not needed quickly at the NIC
High-Performance Requests Read data for large likely-contiguous files that needs to be quickly
dumped to the nic
This is similar to our newer networking paradigm “not all traffic is equal”
Unaddressed question that I had: Can we take the concept of discardability and apply it to filesystems?
Classes of requests
EXT3NS 2 classes which are
determined by an argument to the system call in a user buffer address. Fastpath Class dumps data
onto the NIC, Legacy Class handles
legacy filesystem requests.
The data itself does not have an inherent class and the client process explicitly defines its class.
Tiger Shark Real-time Class
Real-time class is fine grained into subclasses, because Tiger Shark has Resource Reservation Admission Control
If the controllers and disks cannot handle the predicted load then the request is denied.
Legacy Class Also has a legacy interface
for old filesystem access interfaces.
EXT3NS Caching, Quantization, and Scheduling optimizations The hardware is designed to have a minimum block size of 256
KB up to a maximum of 2MB; normal Linux block devices have a maximum block size of 4KB. Some compromises were made in disk metadata block design
for SDA (what is SDA? The substitute for RAID) that it was compatible with EXT3FRS.
The large block sizes lead to a large maximum addressable file size for first-level indirection is 275 GB, for maximal indirection is ~253B.
The memory contained on the NS card is actually a buffer in the current version of EXT3NS, the authors plan to add caching capability to it. (if you don't know the difference between a buffer and a cache.. Look it up!).
Asynch IO is not currently supported, but plans are in place.
Tiger Shark Caching, Quantization, and Scheduling optimizations "Deadline Scheduling" instead of elevator
algorithms. This is an interesting aspect of Tiger Shark, it benchmarks
the hardware against a "synthetic workload" to determine the best order to schedule the disk requests and the best thresholds to start denying requests.
Blocksize is 256KB (default), Normal AIX uses 4KB size.
Tiger Shark will "chunk" contiguous block reads better than the default filesystems to work with its large blocksize.
EXT3NS Streamlining of operations to get the data from the platter to the NIC. EXT3NS has special hardware that avoids
memcopy and most kernel calculations. This hardware takes the data output from the disk
hardware buffer directly onto a custom PCI BUS and then copies through buffers and directly to the NIC on the SAME CARD.
Hardware avoids using the system's PCI bus when the fastpath option is used.
Joint Network Interface and Disk Controller. Hardware speedups also calculate IP/TCP/UDP headers and checksums to speed up processing.
Tiger Shark Streamlining of operations to get the data from the platter to the NIC. A running daemon that pre-allocates OS resources
such as buffer space, disk bandwidth and controller time.
Not a hardware dependant solution. Even though it does not have shared memory
hardware, Tiger Shark copies data from the disks into a shared memory area. Essentially this is a very large extension of the kernel's disk block cache.
VFS layer for Tiger Shark intercepts repeated calls and uses the shared memory area, therefore saving kernel memcopys on subsequent requests.
Platter Layout and Scaling optimizations for Contiguous Streaming EXT3NS Hardware uses a RAID-3-like cross platter
optimization called "SDA" which distributes the blocks across multiple disk platters (simple striping, not interleaving). Maximum of 4 platters as implemented.
Tiger Shark Striping across a maximum of 65000 platters Striping method unspecified, looks like it is flexible and
extended to include redundancy if desired. Keeps all members of a block group contiguous (per
journaling FS concepts) and attempts to keep the block groups contiguous.
Seeking Optimizations
EXT3NS None noted beyond large
block size.
Tiger Shark Byte Range Locking.
Allows multiple clients to access different areas of a file with real-time guarantees if they don't step on each other.
Legacy nods
EXT3NS: If the legacy option (slow class) is used, the disk contents are copied into the system's page cache through the system's bus as if EXT3FS was being used. The paper does not go into it, but
my guess is that this is a rather wasteful operation given the large blocksize of SDA. Other legacy tools such as fsck and mkfs are also available for EXT3NS.
Tiger Shark: Compatible with JFS.
VFS/JFS calls go through the kernel interface with some Block translation.
EXT3NS and Tiger Shark: Fully compatible with VFS for respective platforms. Virtual Filesystem
Current Research and Future Directions and Jeff’s questions Tiger Shark gives us Filesystem QoS. But can we do better
by integrating VBR/ABR into the system? What about Peeling in a VBR system to save resources? Replication and redundancy are always an issue, but not
addressed in this scope. If it is a software-based system such as Tiger Shark, Where
in the OS should we put these optimizations? (Kernel, Tack-On Daemon, Middleware)
Legacy disk accesses have a huge cost in both of these systems, how can we minimize?
EXT3NS Final Thoughts
Valid, but not a novel approach. custom hardware does not represent an incremental step forward in
universal knowledge. EXT3NS is built for exactly one thing: Network Streaming of data.
An engineering change was made to the hardware design of a computer system, and some optimizations were made to the software to take advantage of it. The authors are not advocating a radical design change to all computers.
Violates a few “design principles” therefore it must be relegated to a customized specific-purpose system.
Empirical data confirm that EXT3NS design is able to squeeze more concurrent sessions out of a multimedia server than would have been available previously. There is still a saturation point where the memory of the NS card or the
capabilities of the card's internal bus break down and the system cannot scale beyond that point.
Better than Best Effort.
Tiger Shark Final Thoughts
Valid, somewhat novel approach It adds QoS guarantees to current disk interface architectures
Built to be extensible to more than just MM disk access. But definitely optimized for it.
Empirical data confirm that Tiger Shark design is able to serve more concurrent sessions out of a multimedia server than would have been available previously, BUT there is still a kernel bottleneck for the initial block load.
Better suited to multiple concurrent access than EXT3NS Currently appears scalable beyond any reasonable (modern)
demands.. As usual in computer science though, future demands may find a point of scaling breakdown of the system.
Guaranteed QoS. Many other later QoS filesystems extend this concept and tweak
some aspects of it such as scheduling.
The fundamental academic question at the end of the day: The 2 major competing solution paradigms:
Fundamentally alter the hardware datapath in a computer and present a customized hardware solution with relevant changes in OS. Scaling = not addressed.
Retrofit current operating systems with some tacked-on task-specific optimizations and tweaking of settings. The system and the hardware are kept generic. Scaling = buy more hardware
Or can we find an alternate third paradigm?