conquest: preparing for life after disks october 2, 2003 an-i andy wang
Post on 17-Jan-2018
218 Views
Preview:
DESCRIPTION
TRANSCRIPT
Conquest:Preparing for Life After Disks
October 2, 2003
An-I Andy Wang
2
Conquest Overview File systems are optimized for disks
Performance problem Complexity
Now we have tons of inexpensive RAM What can we do with that RAM?
3
Conquest Approach Combine disk and persistent RAM (e.g.,
battery-backed RAM) in a novel way Simplification
At least 20% smaller code base than ext2, reiserfs, and SGI XFS
Performance (under popular benchmarks) 24% to 1900% faster than LRU disk caching Best performance boost since Berkeley FFS
4
Performance Problem of Disks
1990 2000
1 KHz
1 MHz
1 GHzCPU (50% /yr)memory (50% /yr)
disk (15% /yr)
accessespersecond(log scale)
105106
1995(1 sec : 6 days) (1 sec : 3 months)
Genesis • Conquest Design • Performance Evaluation • Conclusion
5
Inside Pandora’s Box
Disk arm Disk platters
Access time = seek time (disk arm) + rotational delay (disk platter) + transfer time
Genesis • Conquest Design • Performance Evaluation • Conclusion
6
Disk Optimization Methods Disk arm scheduling Group information on
disk Disk readahead Buffered writes Disk caching Data mirroring Hardware parallelism
Genesis • Conquest Design • Performance Evaluation • Conclusion
7
Complexity Bytes
synchronization
predictive readahead
cache replacement
elevator algorithm
data clusteringdata consistencyasynchronous write
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Caceres et al., 1993; Hillyer et al., 1996; Qualstar 1998; Tanisys 1999; Quantum 2000; Micron Semiconductor Products 2002]
8
Storage Media Alternatives
accesses/sec (log scale)
$/MB (log scale)
100 103
persistent RAM
magnetic RAM?
(write once) flash memorydisktape
battery-backed DRAM10-3
10-3 106
10-6
Genesis • Conquest Design • Performance Evaluation • Conclusion
9
The Genesis of Conquest Idea: persistent-RAM-only file system
Improved performance Remove disk-related complexity
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Grochowski 2002] 10
The Genesis of Conquest (2) Problem: wrong growth curves
Disk prices dropping faster than RAM prices Disks will stay around
1995 2005
100
year
$/MB (log scale)
2000
10-2
10-1
101
102
3.5" HDD 2.5" HDD1" HDDpersistent RAM
booming of digitalphotography
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Grochowski 2002] 11
The Genesis of Conquest (3) New idea: hybrid system for transition
Takes advantage of RAM speed Still simplifies code
1995 2005
100
year
$/MB (log scale)
2000
10-2
10-1
101
102
paper/film
3.5" HDD 2.5" HDD1" HDDpersistent RAM
booming of digitalphotography
4 to 10 GB of persistent RAM
Genesis • Conquest Design • Performance Evaluation • Conclusion
12
Conquest Design Questions How to make effective use of RAM?
Common usage patterns Physical characteristics of RAM storage
Where and how to reduce complexity? Data paths Data structures and associated management Shutdown/boot sequence
How to assure the integrity of file system components that reside in BB-DRAM?
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Ousterhout 1985; Baker et al., 1991; Iram 1993; Douceur and Bolosky 1999; Roselli et al., 2000; Evans and Kuenning 2002]
13
User Access Patterns Small files
Take little space (10%) Represent most accesses (90%)
Large files Take most space Mostly sequential accesses
Not characteristic of database applications
Genesis • Conquest Design • Performance Evaluation • Conclusion
14
Characteristics of Storage Media RAM
Fast random accesses Cost-effective in performance
Disk Fast sequential accesses Cost-effective in storage
Genesis • Conquest Design • Performance Evaluation • Conclusion
15
The Design of Conquest Deliver all file system services from memory,
with the exception of high-capacity storage Persistent RAM
Data content of small files (smaller than 1 MB) Metadata (file descriptions for large and small
files, directories, and data structures) Disk
Data content of large files Two separate data paths to memory and disk
Genesis • Conquest Design • Performance Evaluation • Conclusion
[McKusick et al., 1990; Ganger et al., 2000; Roselli et al., 2000; Seltzer et al., 2000]
16
Conquest Alternatives Disk caching
Assumption of scarce memory Use disk as the final storage destination Complex mechanisms to maintain consistency
RAM drives and RAM file systems Not meant to be persistent Use disk-related mechanisms Limitations on storage capacity
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
17
Simplification of Data Paths
Genesis • Conquest Design • Performance Evaluation • Conclusion
18
Content of Persistent RAM Data content of small files (< 1MB)
No seek time or rotational delays Fast byte-level accesses Virtual contiguous allocation
Metadata (e.g., directories, file system states) Fast synchronous update No dual representations For both large and small files
Genesis • Conquest Design • Performance Evaluation • Conclusion
19
Memory Data Path of ConquestConventional File Systems
I/O buffer
disk management
storage requests
I/O buffermanagement
disk
persistencesupport
Conquest Memory Data Path
storage requests
persistencesupport
battery-backedRAM
small file and metadata storage
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Namesys 2002] 20
Large-File-Only Disk Storage Only store the data content of large files Allocate in big chunks
Lower access overhead Reduced management overhead
No fragmentation management No tricks for small files
Storing data in metadata No elaborate data structures
Wrapping a balanced tree onto disk cylinders Genesis • Conquest Design • Performance Evaluation • Conclusion
21
Sequential-Access Large Files Sequential disk accesses
Near-raw bandwidth Well-defined readahead semantics Read-mostly
Little synchronization overhead (between memory and disk)
Genesis • Conquest Design • Performance Evaluation • Conclusion
22
Disk Data Path of ConquestConventional File Systems
I/O buffer
disk management
storage requests
I/O buffermanagement
disk
persistencesupport
Conquest Disk Data Path
I/O buffermanagement
I/O buffer
storage requests
disk management
disk
battery-backedRAM
small file and metadata storage
large-file-only file system
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Baker et al., 1991; Vogels 1999; Roselli et al., 2000] 23
Random-Access Large Files Random access?
Common definition: nonsequential access A typical movie has 150 scene changes MP3 stores the title at the end of the files
Near sequential access? Simplifies large-file metadata representation
significantly
Genesis • Conquest Design • Performance Evaluation • Conclusion
24
Simplification of Data Structures
Genesis • Conquest Design • Performance Evaluation • Conclusion
25
Logical File Representation
File
Name(s) i-node File attributes
Data
Genesis • Conquest Design • Performance Evaluation • Conclusion
26
Physical File Representation
File
Name(s) i-node File attributes Data locations
Data blocks
Genesis • Conquest Design • Performance Evaluation • Conclusion
27
Ext2 Data Representation
data block location
index block location
index block location
index block location
data block location
index block location
index block location
data block location
data block location
i-node(stored on disk)
10data block location
data block locationdata block location
data block location
index block location
Genesis • Conquest Design • Performance Evaluation • Conclusion
28
Disadvantages with Ext2 Design Optimization for small files makes things
complex Designed for disk storage Random-access data structure for large files
that are accessed mostly sequentially Data access time dependent on the byte
position in a file Maximum file size is limited
Genesis • Conquest Design • Performance Evaluation • Conclusion
29
Conquest Representation
index array locationindex array location
i-node(stored in RAM) data block location
data block locationdata block location
data block location
Persistent RAM Single-level dynamically allocated index
Fast data access for files stored in RAM
Genesis • Conquest Design • Performance Evaluation • Conclusion
30
Conquest Representation (2)
segment list locationsegment list location
i-node(stored in RAM)
end block location
begin block locationbegin block location
end block location
Disk
end block location
begin block locationbegin block location
end block location
Worst case: sequential memory search for random disk locations Maximum file size limited by physical storage
(stored on disk)
Genesis • Conquest Design • Performance Evaluation • Conclusion
31
Conquest Directories Per-directory hash tables stored in memory Collisions resolved by rehashing Hard links: multiple names point to same
data Problem:
Dynamic resizing of directories Need to handle the current file position Important for rm -fr
Genesis • Conquest Design • Performance Evaluation • Conclusion
32
The Difficulty With Shrinking rm –fr
hash table locationhash table location
i-node(stored in RAM)
<empty>
NULL
<empty>
NULL
<empty>
NULL
<deleted>
NULL
file i-node location
file1
i-node location
0110 | file1
file i-node location
file1
i-node location
1001 | file2
file i-node location
file1
i-node location
1000 | dir
Genesis • Conquest Design • Performance Evaluation • Conclusion
33
The Difficulty With Shrinking rm -fr
hash table locationhash table location
i-node(stored in RAM)
<deleted>
NULL
<empty>
NULL
<empty>
NULL
<deleted>
NULL
file i-node location
file1
i-node location
0110 | file1
file i-node location
file1
i-node location
1001 | file2
Genesis • Conquest Design • Performance Evaluation • Conclusion
34
The Difficulty With Shrinking rm -fr
hash table locationhash table location
i-node(stored in RAM)
<deleted>
NULL
<empty>
NULL
<empty>
NULL
<deleted>
NULL
file i-node location
file1
i-node location
0110 | file1
file i-node location
file1
i-node location
1001 | file2
Genesis • Conquest Design • Performance Evaluation • Conclusion
35
The Difficulty With Shrinking rm -fr
hash table locationhash table location
i-node(stored in RAM)
<empty>
NULL
<empty>
NULL
file i-node location
file1
i-node location
0110 | file1
file i-node location
file1
i-node location
1001 | file2
Genesis • Conquest Design • Performance Evaluation • Conclusion
36
The Difficulty With Shrinking rm -fr
hash table locationhash table location
i-node(stored in RAM)
<empty>
NULL
<empty>
NULL
file i-node location
file1
i-node location
0110 | file1
Quick fixes Never shrink hash tables (for rm –fr) No promises for ls while adding files
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Fagin et al., 1979] 37
Extensible Hash Tables Use top, not bottom, bits of hash code
hash table locationhash table location
i-node(stored in RAM)
<empty>
NULL
<empty>
NULL
file i-node location
file1
i-node location
0110 | file1
file i-node location
file1
i-node location
1001 | file2
Genesis • Conquest Design • Performance Evaluation • Conclusion
38
Extensible Hash Tables Preserve ordering of entries when resizing
hash table locationhash table location
i-node(stored in RAM)
<empty>
NULL
<empty>
NULL
<empty>
NULL
<empty>
NULL
file i-node location
file1
i-node location
1001 | file2
file i-node location
file1
i-node location
0110 | file1
Genesis • Conquest Design • Performance Evaluation • Conclusion
39
Additional Engineering Details Dynamic file positioning Need to handle collisions Memory overhead and complexity tradeoffs
Genesis • Conquest Design • Performance Evaluation • Conclusion
40
Simplification of Metadata Management
Genesis • Conquest Design • Performance Evaluation • Conclusion
41
Metadata Allocation Requirements
Keep track of usage status of metadata entries
Avoid duplicate allocation with unique IDs
Fast retrieval of metadata with a given ID
ID: 30| free ID: 81| in useID: 58| freeID: 16| freeID: 89| in useID: 88| free
Genesis • Conquest Design • Performance Evaluation • Conclusion
42
Existing Memory Allocation Services
Keep track of unallocated memory
No duplicate allocation of physical addresses
Hmm…
ADDR 0xe000000| free ADDR 0xe000038| in use ADDR 0xe000070| free ADDR 0xe0000A8| free ADDR 0xe0000E0| free ADDR 0xe000118| in use
Genesis • Conquest Design • Performance Evaluation • Conclusion
43
Conquest Metadata Management Metadata = memory allocated by memory
manager Metadata ID = physical address of metadata
ID: 30| free ID: 81| in useID: 58| freeID: 16| freeID: 89| in useID: 88| free
ADDR 0xe000000| free ADDR 0xe000038| in use ADDR 0xe000070| free ADDR 0xe0000A8| free ADDR 0xe0000E0| free ADDR 0xe000118| in use
Usage status
Unique IDs and fast retrieval
Genesis • Conquest Design • Performance Evaluation • Conclusion
44
Simplification of Shutdown/Boot Sequence
Genesis • Conquest Design • Performance Evaluation • Conclusion
45
Persistence Support Restore file system states after a reboot
Data Metadata Memory manager
Keep track of metadata allocation Reinitialized at boot time No knowledge of persistently allocated data
Genesis • Conquest Design • Performance Evaluation • Conclusion
46
Linux Memory Manager Page allocator maintains individual pages
Page allocator
Genesis • Conquest Design • Performance Evaluation • Conclusion
47
Linux Memory Manager (2) Zone allocator allocates memory in power-of-
two sizes
Page allocator
Zone allocator
Genesis • Conquest Design • Performance Evaluation • Conclusion
48
Linux Memory Manager (3) Slab allocator groups allocations by sizes to
reduce internal memory fragmentation
Page allocator
Zone allocator
Slab allocator
Genesis • Conquest Design • Performance Evaluation • Conclusion
49
Memory Allocation Example Allocate a 455-byte data structure
Slab allocator
One page of data structures
Zone allocator
One page from DMA zone
Page allocator
Page address 0x0000d000 Genesis • Conquest Design • Performance Evaluation • Conclusion
50
Linux Memory Manager (4) Difficult to restore the persistent states
Three layers of pointer-rich mappings Mixing of persistent and temporary allocations
Page allocator
Slab allocator
Zone allocator
Genesis • Conquest Design • Performance Evaluation • Conclusion
51
Conquest Persistence Create memory zones with own instantiations
of memory managers
Page allocator
Slab allocator
Zone allocator
Genesis • Conquest Design • Performance Evaluation • Conclusion
52
Conquest Persistence Reuse existing memory manager code Encapsulate all pointers within each zone Pointers can survive reboots No serialization and deserialization Swapping and paging
Disabled for Conquest memory zones Enabled for non-Conquest zones
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Ng et al., 1996] 53
Integrity of Content in RAM User-level program crashes
Same file system interface as others Access control Memory protection
Operating system crashes 1.5% of crashes lead to memory corruption Lose about one data block a decade
Genesis • Conquest Design • Performance Evaluation • Conclusion
54
Other Reliability Mechanisms Instantaneous metadata commit Daily backups Pointer-switch commit semantics
pointerpointer
Genesis • Conquest Design • Performance Evaluation • Conclusion
55
Implementation Status Kernel module under Linux 2.4.2 Operational and POSIX compliant Modified memory manager to support
Conquest persistence Need to overcome BIOS limitations for
distribution
Genesis • Conquest Design • Performance Evaluation • Conclusion
56
Performance Evaluation Architectural simplification
Feature count Performance improvement
Memory-only workloads Memory-and-disk workloads
Genesis • Conquest Design • Performance Evaluation • Conclusion
57
Conventional Data Path Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management
Conventional File Systems
I/O buffer
disk management
storage requests
I/O buffermanagement
disk
persistencesupport
Genesis • Conquest Design • Performance Evaluation • Conclusion
58
Memory Path of Conquest Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management
Conquest Memory Data Pathstorage requests
Persistencesupport
battery-backedRAM
small file and metadata storage
Memory manager encapsulation
Genesis • Conquest Design • Performance Evaluation • Conclusion
59
Disk Path of Conquest Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management
Conquest Disk Data Path
I/O buffermanagement
I/O buffer
storage requests
disk management
disk
battery-backedRAM
small file and metadata storage
large-file-only file system
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Card et al., 1994; Sweeney et al., 1996; Katcher 1997; Namesys 2002] 60
Conquest is comparable to ramfs At least 24% faster than the LRU disk cache
ISP workload (emails, web-based transactions)
PostMark Benchmark (1)
0100020003000400050006000700080009000
5000 10000 15000 20000 25000 30000
files
trans / sec
SGI XFS reiserfs ext2fs ramfs Conquest
40 to 250 MB working set with 2 GB physical RAM
Genesis • Conquest Design • Performance Evaluation • Conclusion
61
0
1000
2000
3000
4000
5000
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
percentage of large files
trans / sec
SGI XFS reiserfs ext2fs Conquest
When both memory and disk components are exercised, Conquest can be several times faster than ext2fs, reiserfs, and SGI XFS
PostMark Benchmark (2)
10,000 files,80 MB to 3.5 GB working setwith 2 GB physical RAM
> RAM<= RAM
Genesis • Conquest Design • Performance Evaluation • Conclusion
62
When working set > RAM, Conquest is 1.4 to 2 times faster than ext2fs, reiserfs, and SGI XFS
PostMark Benchmark (3)
0
20
40
60
80
100
120
6.0 7.0 8.0 9.0 10.0
percentage of large files
trans / sec
SGI XFS reiserfs ext2fs Conquest
10,000 files,80 MB to 3.5 GB working setwith 2 GB physical RAM
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Rosenblum and Ousterhout 1991] 63
Sprite LFS Microbenchmarks Small-file benchmark
Operates on 10,000 1-KB files in three phases
020000400006000080000
100000120000140000160000180000
create read delete
op / sec
SGI XFS reiserfs ext2fs ramfs Conquest
Genesis • Conquest Design • Performance Evaluation • Conclusion
65
Sprite LFS Microbenchmarks (2) Modified large-file microbenchmark: ten
1-MB files (Conquest in-core files)
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
Genesis • Conquest Design • Performance Evaluation • Conclusion
66
Sprite LFS Microbenchmarks (3) Modified large-file microbenchmark: ten
1.01-MB files (Conquest on-disk files)
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
Genesis • Conquest Design • Performance Evaluation • Conclusion
67
Sprite LFS Microbenchmarks (4) Large-file microbenchmark: forty 100-MB
files (Conquest on-disk files)
0
5
10
15
20
25
30
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs Conquest
Genesis • Conquest Design • Performance Evaluation • Conclusion
68
istory’s Mystery
Puzzling Microbenchmark Numbers…
Geoff Kuenning: “If Conquest is slower than ext2fs, I will toss you off of the balcony…”
Genesis • Conquest Design • Performance Evaluation • Conclusion
69
With me hanging off a balcony… Original large-file microbenchmark: one
1-MB file (Conquest in-core file)
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
Genesis • Conquest Design • Performance Evaluation • Conclusion
70
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
Odd Microbenchmark Numbers Why are random reads slower than sequential
reads?
Genesis • Conquest Design • Performance Evaluation • Conclusion
71
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
Odd Microbenchmark Numbers Why are RAM-based file systems slower than
disk-based file systems?
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Keshava and Penkovski 1999; Torvalds 2001; Abraham 2002] 72
A Series of Hypotheses Warm-up effect?
Maybe Why do RAM-based systems warm up slower?
Bad initial states? No
Pentium III streaming I/O option? No
Genesis • Conquest Design • Performance Evaluation • Conclusion
73
Effects of L2 Cache FootprintsLarge L2 cache footprint Small L2 cache footprint
write a file sequentially
footprint file end
footprint
read the same file sequentially
footprint
flush
file endfile
read
write a file sequentially
footprint file end
footprint
read the same file sequentially
footprint
flush
file end
read
file
Genesis • Conquest Design • Performance Evaluation • Conclusion
74
LFS Sprite Microbenchmarks Modified large-file microbenchmark: ten
1-MB files (Conquest in-core files)
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Baker et al., 1992; Garcia-Molina and Salem 1992; Wu and Zwaenepoel 1994; Chen et al., 1996; Riedel 1998; Quantum 2000; Miller et al., 2001]
76
Related Work Main-Memory Databases
Memory-based data structures and query mechanisms
File-system applications of persistent RAM Write buffers Flash-memory-based file systems Disk emulators Rio file cache MRAM enabled storage
Genesis • Conquest Design • Performance Evaluation • Conclusion
[Anderson et al., 2000; Palm 2000; IBM 2002; Microsoft 2002] 77
Related Work (2) PDA operating systems
Designed with severe memory constraints Slice
Distributed storage system Dedicated servers for metadata, small files,
and large files
Genesis • Conquest Design • Performance Evaluation • Conclusion
78
Lessons Learned Faster than LRU caching, unexpected
Heavyweight disk handling Severe penalty for accessing memory content
Matching user access patterns to storage media offers considerable simplification and better performance Not an automatic result Need careful design
Genesis • Conquest Design • Performance Evaluation • Conclusion
79
More Lessons Learned Effects of L2 caching become highly visible in
memory workloads (modern workloads) Cannot blindly apply existing disk-based
microbenchmarks to measure memory performance of file systems
Need to consider states of L2 cache and memory behaviors at each stage of microbenchmarking
Genesis • Conquest Design • Performance Evaluation • Conclusion
80
Additional Lessons Learned Don’t discuss your performance numbers next
to a balcony…unless…
Genesis • Conquest Design • Performance Evaluation • Conclusion
81
Going Beyond Conquest Matching usage patterns with heterogeneous
machines in the distributed domain Specialized tasks for machines within a cluster Preferably self-organizing and self-evolving
State-rich computing Caching of runtime data structures Similar to specialized temporary file system
Genesis • Conquest Design • Performance Evaluation • Conclusion
82
Going Beyond Conquest (2) Separate storage of metadata from data
Opportunity for hierarchical replication across devices with different calibers
Benchmarking memory performance of file systems Developing new memory benchmarks
Why are modern operating systems so complicated? More places to expand Conquest approach
Genesis • Conquest Design • Performance Evaluation • Conclusion
83
Contributions Demonstrated the feasibility of disk-memory
hybrid file systems Showed performance does not preclude
simplicity Pinpointed cache-related problems with
modern benchmarks Opened doors to many exciting areas of
research
Genesis • Conquest Design • Performance Evaluation • Conclusion
84
Conclusion Conquest demonstrates how rethinking
changes in underlying assumptions can lead to significant architectural and performance improvements
Radical changes in hardware, applications, and user expectations in the past decade should lead us to rethink other aspects of OS as well.
Genesis • Conquest Design • Performance Evaluation • Conclusion
85
Questions . . .Conquest: http://www.cs.fsu.edu/~awang/conquestAndy Wang: awang@cs.fsu.edu
top related