filesystem optimizations for static content multimedia servers review of academic papers for tdc573...

Filesystem Optimizations for Static Content Multimedia ServersReview of academic papers for TDC573

Jeff Absher

Papers Reviewed

Implementation and Evaluation of EXT3NS Multimedia File System Baik-Song Ahn, Sung-Hoon Sohn , Chei-Yol Kim , Gyu-Il

Cha, Yun-Cheol Baek, Sung-In Jung, Myung-Joon Kim. Presented 12th Annual ACM International Conference on

Multimedia , October 10–16, 2004, New York, New York, USA.

The Tiger Shark File System Roger L. Haskin, Frank B. Schmuck IBM Journal of Research and Development, 1998

What is the problem?

MM Server with relatively static content Prerecorded movies Audio Lectures Commercials

End user can start/stop/pause/seek

Many different simultaneous users.

Massive transfer of data from disks to NIC.

Can safely avoid focusing on real-time writes. It is a server. Assume data is collected

non-real-time. Note: both systems could

be easily extended within their scope to handle RT writes.

Should be backward compatible for legacy requests.

Scope Limitations and Design Goals Limitations

Single Server or Cluster with single shared set of disks. No distributed nodes. There is research in the

slightly different areas of distributed Filesystems, P2P filesystems, and others.

Single local Filesystem, May consist of an array of multiple disks.

Design Goals in order of importance Pump as much data as

you can from the disks to the NICs. This can be done by

avoiding kernel memcopys

Seeking Quick Recoverability for

very large filesystems Journaling

Legacy Compatibility

Problems with “old” filesystem block transfer to NIC in the network-server context? (simplified) Multiple Memcpy() calls across user/kernel mode. Disk blocks optimized for small files. Many context switches. The kernel must be involved in both reading from the disk and

writing to the NIC. Bus contention with other IO. Block Cache is in main memory, may not be fast enough from a

hardware perspective. The data may be slow to “bubble down” the Networking layer due to

redirectors, routing, etc. Checksum calculations and such for networking happen in software.

The newer MM Filesystems:Classes of requests Both of the studied filesystems assign some type of

class to FS requests. the minimum needed is 2 classes.

Legacy Requests Read/Write data for small files, not needed quickly at the NIC

High-Performance Requests Read data for large likely-contiguous files that needs to be quickly

dumped to the nic

This is similar to our newer networking paradigm “not all traffic is equal”

Unaddressed question that I had: Can we take the concept of discardability and apply it to filesystems?

Classes of requests

EXT3NS 2 classes which are

determined by an argument to the system call in a user buffer address. Fastpath Class dumps data

onto the NIC, Legacy Class handles

legacy filesystem requests.

The data itself does not have an inherent class and the client process explicitly defines its class.

Tiger Shark Real-time Class

Real-time class is fine grained into subclasses, because Tiger Shark has Resource Reservation Admission Control

If the controllers and disks cannot handle the predicted load then the request is denied.

Legacy Class Also has a legacy interface

for old filesystem access interfaces.

EXT3NS Caching, Quantization, and Scheduling optimizations The hardware is designed to have a minimum block size of 256

KB up to a maximum of 2MB; normal Linux block devices have a maximum block size of 4KB. Some compromises were made in disk metadata block design

for SDA (what is SDA? The substitute for RAID) that it was compatible with EXT3FRS.

The large block sizes lead to a large maximum addressable file size for first-level indirection is 275 GB, for maximal indirection is ~253B.

The memory contained on the NS card is actually a buffer in the current version of EXT3NS, the authors plan to add caching capability to it. (if you don't know the difference between a buffer and a cache.. Look it up!).

Asynch IO is not currently supported, but plans are in place.

Tiger Shark Caching, Quantization, and Scheduling optimizations "Deadline Scheduling" instead of elevator

algorithms. This is an interesting aspect of Tiger Shark, it benchmarks

the hardware against a "synthetic workload" to determine the best order to schedule the disk requests and the best thresholds to start denying requests.

Blocksize is 256KB (default), Normal AIX uses 4KB size.

Tiger Shark will "chunk" contiguous block reads better than the default filesystems to work with its large blocksize.

EXT3NS Streamlining of operations to get the data from the platter to the NIC. EXT3NS has special hardware that avoids

memcopy and most kernel calculations. This hardware takes the data output from the disk

hardware buffer directly onto a custom PCI BUS and then copies through buffers and directly to the NIC on the SAME CARD.

Hardware avoids using the system's PCI bus when the fastpath option is used.

Joint Network Interface and Disk Controller. Hardware speedups also calculate IP/TCP/UDP headers and checksums to speed up processing.

Tiger Shark Streamlining of operations to get the data from the platter to the NIC. A running daemon that pre-allocates OS resources

such as buffer space, disk bandwidth and controller time.

Not a hardware dependant solution. Even though it does not have shared memory

hardware, Tiger Shark copies data from the disks into a shared memory area. Essentially this is a very large extension of the kernel's disk block cache.

VFS layer for Tiger Shark intercepts repeated calls and uses the shared memory area, therefore saving kernel memcopys on subsequent requests.

Platter Layout and Scaling optimizations for Contiguous Streaming EXT3NS Hardware uses a RAID-3-like cross platter

optimization called "SDA" which distributes the blocks across multiple disk platters (simple striping, not interleaving). Maximum of 4 platters as implemented.

Tiger Shark Striping across a maximum of 65000 platters Striping method unspecified, looks like it is flexible and

extended to include redundancy if desired. Keeps all members of a block group contiguous (per

journaling FS concepts) and attempts to keep the block groups contiguous.

Seeking Optimizations

EXT3NS None noted beyond large

block size.

Tiger Shark Byte Range Locking.

Allows multiple clients to access different areas of a file with real-time guarantees if they don't step on each other.

Legacy nods

EXT3NS: If the legacy option (slow class) is used, the disk contents are copied into the system's page cache through the system's bus as if EXT3FS was being used. The paper does not go into it, but

my guess is that this is a rather wasteful operation given the large blocksize of SDA. Other legacy tools such as fsck and mkfs are also available for EXT3NS.

Tiger Shark: Compatible with JFS.

VFS/JFS calls go through the kernel interface with some Block translation.

EXT3NS and Tiger Shark: Fully compatible with VFS for respective platforms. Virtual Filesystem

Current Research and Future Directions and Jeff’s questions Tiger Shark gives us Filesystem QoS. But can we do better

by integrating VBR/ABR into the system? What about Peeling in a VBR system to save resources? Replication and redundancy are always an issue, but not

addressed in this scope. If it is a software-based system such as Tiger Shark, Where

in the OS should we put these optimizations? (Kernel, Tack-On Daemon, Middleware)

Legacy disk accesses have a huge cost in both of these systems, how can we minimize?

EXT3NS Final Thoughts

Valid, but not a novel approach. custom hardware does not represent an incremental step forward in

universal knowledge. EXT3NS is built for exactly one thing: Network Streaming of data.

An engineering change was made to the hardware design of a computer system, and some optimizations were made to the software to take advantage of it. The authors are not advocating a radical design change to all computers.

Violates a few “design principles” therefore it must be relegated to a customized specific-purpose system.

Empirical data confirm that EXT3NS design is able to squeeze more concurrent sessions out of a multimedia server than would have been available previously. There is still a saturation point where the memory of the NS card or the

capabilities of the card's internal bus break down and the system cannot scale beyond that point.

Better than Best Effort.

Tiger Shark Final Thoughts

Valid, somewhat novel approach It adds QoS guarantees to current disk interface architectures

Built to be extensible to more than just MM disk access. But definitely optimized for it.

Empirical data confirm that Tiger Shark design is able to serve more concurrent sessions out of a multimedia server than would have been available previously, BUT there is still a kernel bottleneck for the initial block load.

Better suited to multiple concurrent access than EXT3NS Currently appears scalable beyond any reasonable (modern)

demands.. As usual in computer science though, future demands may find a point of scaling breakdown of the system.

Guaranteed QoS. Many other later QoS filesystems extend this concept and tweak

some aspects of it such as scheduling.

The fundamental academic question at the end of the day: The 2 major competing solution paradigms:

Fundamentally alter the hardware datapath in a computer and present a customized hardware solution with relevant changes in OS. Scaling = not addressed.

Retrofit current operating systems with some tacked-on task-specific optimizations and tweaking of settings. The system and the hardware are kept generic. Scaling = buy more hardware

Or can we find an alternate third paradigm?

filesystem optimizations for static content multimedia servers review of academic papers for tdc573...

Documents