improving performance of distributed file system - zfs
DESCRIPTION
Improving Performance of Distributed File System using OSDs and Cooperative cacheTRANSCRIPT
Improving Performance of a Distributed File System Using OSDs and Cooperative Cache
PROJECT REPORT SUBMITTED IN PARTIAL FULFILLMENT FOR THE
DEGREE OF
B.Sc(H) Computer Science
Hans Raj College
University Of Delhi
Delhi – 110 007
India
Submitted by:
Parvez GuptaRoll No. - 6010027
Varenya AgrawalRoll No. - 6010044
Certificate
This is to certify that the project work entitled “ Improving Performance of a Distributed File System Using OSDs and Cooperative Cache” being submitted by Parvez Gupta and Varenya Agrawal, in partial fulfillment of the requirement for the award of the degree of B.Sc (Hons) Computer Science, University of Delhi, is a record of work carried out under the supervision of Ms. Baljeet Kaur at Hans Raj College, University of Delhi, Delhi.
It is further certified that we have not submitted this report to any other organi-zation for any other degree.
Parvez Gupta
Roll No: - 6010027
Varenya Agrawal
Roll No: - 6010044
Project Supervisor Principal
Ms. Baljeet Kaur Dr. S.R. Arora
Dept. of Computer Science Hans Raj CollegeHans Raj College University of Delhi University of Delhi
Acknowledgment
We would sincerely like to thank Ms. Baljeet Kaur for her invaluable support and guidance in carrying out this project to successful completion. Also, we would like to thank Head of the Computer Science Department, Ms Harmeet Kaur, who was always there with her invaluable knowledge and experience that helped us greatly during the research work. We would also like to extend our gratitude and special thanks to Mr. I.P.S. Negi, Mr. Sanjay Narang and Ms. Anita Mittal for their help in the computer laboratory.
Lastly, we would like to thank all our friends and well wishers who directly or in-directly influenced the successful compilation of the project.
Table of Contents
List of Figures 3
Chapter 1 4
Introduction 4
1.1 Background 5
1.2 About the Work 6
Chapter 2 8
z-Series File System 8
2.1 Prominent Features 9
2.2 Architecture 10
2.2.1 Object Store 10
2.2.2 Front End 11
2.2.3 Lease Manager 11
2.2.4 File Manager 12
2.2.5 Cooperative Cache 13
2.2.6 Transaction Server 13
Chapter 3 16
Cooperative Cache 16
3.1 Working of Cooperative Cache 17
3.2 Cooperative Cache Algorithm 18
1
3.2.1 Node Failure 20
3.2.2 Network Delays 21
3.3 Choosing the Proper Third Party Node 25
3.4 Pre-fetching Data in zFS 26
Chapter 4 28
Testing 28
4.1 Test Environment 29
4.2 Comparing zFS and NFS 32
Conclusion 35
Bibliography 36
2
List of Figures
Figure 1: zFS Architecture 15
Figure 2: Delayed Move Notification Messages 24
Figure 3: System configuration for testing zFS performance 31
Figure 4: System configuration for testing NFS performance 31
Figure 5 : Performance results for large server cache 32
Figure 6 : Performance Results for small server cache 33
3
Chapter 1Introduction
4
1.1 Background
As computer networks started to evolve in the 1980s it became evident that the
old file systems had many limitations that made them unsuitable for multiuser
environments.
In the beginning, many users started to use FTP to share files. Although this
method avoided the time consuming physical movement of removable media,
files still needed to be copied twice: once from the source computer onto a
server, and a second time from the server onto the destination computer. Addi-
tionally, users had to know the physical addresses of every computer involved in
the file sharing process.
As computer companies tried to solve the shortcomings above, distributed file
systems were developed and new features such as file locking were added to
existing file systems. The new systems were not replacements for the old file
systems, but an additional layer between the disk file system and the user proc-
esses.
In a Distributed File System (DFS) a single file system can be distributed across
several physical computer nodes. Separate nodes have direct access to only a
part of the entire file system. With DFS, system administrators can make files
distributed across multiple servers appear to users as if they reside in one place
on the network.
zFS (z-Series File System), a distributed file system developed by IBM, is used in
the z/OS operating system. zFS evolved from DSF (Data Sharing Facility) project
which aimed at building a server-less file system that distributes all aspects of
file and storage management over cooperating machines interconnected by a
5
fast switched network. zFS was designed to achieve a scalable file system that
operates equally well on only a few and thousands of machines and in which the
addition of new machines leads to a linear increase in performance.
1.2 About the Work
This work describes a cooperative cache algorithm used in zFS, which can with-
stand network delays and node failures. The work explores the effectiveness of
this algorithm and of zFS as a file system. This is done by comparing the sys-
tem’s performance to NFS using the IOZONE benchmark. The researchers have
also investigated whether using a cooperative cache results in better perform-
ance, despite the fact that the object store devices have their own caches. Their
results show that zFS performs better than NFS when cooperative cache is acti-
vated and that zFS provides better performance even though the OSDs have
their own caches. They have also demonstrated that using pre-fetching in zFS
increases performance significantly. Thus, zFS performance scales well when the
number of participating clients increases linearly.
There are several other related works that have researched cooperative caching
in network file systems. Another file system, xFS uses a central server to coor-
dinate between the various clients, and the load of the server increases as the
number of clients increase. Thus, the scalability of xFS was limited by the
strength of the server. However, xFS is more scalable than AFS and NFS due to
four different caching techniques used by it that contribute significantly to the
load reduction.
6
There are three major differences between the zFS architecture and xFS archi-
tecture:
zFS does not have a central server and the management of files is distributed
among several file managers. There is no hierarchy of cluster servers; if two cli-
ents work on the same file they interact with the same file manager.
In zFS, caching is done on a per page basis rather than using whole files. This
increases sharing since different clients can work on different parts of the same
file.
In zFS, no caching is done on the local disk.
Thus, zFS is more scalable because it has no central server and file managers
can dynamically be added or removed to respond to load changes in the cluster.
Moreover, performance is better due to zFS’s stronger sharing capability. zFS
does not have a central server that can become a bottleneck. All control infor-
mation is exchanged between clients and file managers. The set of file managers
dynamically adapts itself to the load on the cluster. Clients in zFS only pass data
among themselves (in cooperative cache mode).
7
Chapter 2z-Series File System
8
2.1 Prominent Features
zFS is a scalable file system which uses Object Store Devices (OSD) and a set of
cooperative machines for distributed file management. These are its two most
prominent features.
zFS integrates the memory of all participating machines into one coherent
cache. Thus, instead of going to the disk for a block of data already in one of
the machine memories, zFS retrieves the data block from the remote machine.
To maintain file system consistency, zFS uses distributed transactions and
leases to implement meta data operations and coordinate shared access to
data. zFS achieves its high performance and scalability by avoiding group-
communication mechanisms and clustering software and using distributed
transactions and leases instead.
The design and implementation of zFS is aimed at achieving a scalable file sys-
tem beyond those that exist today. More specifically, the objectives of zFS are:
A file system that operates equally well on only on few or on thousands of
machines
Built from off-the-shelf components with Object disks (ObSs)
Makes use of the memory of all participating machines as a global cache to
increase performance
The addition of machines leads to an almost linear increase in performance
9
zFS will achieve scalability by separating storage management from file man-
agement and by dynamically distributing file management. Having ObSs handle
storage management implies that functions usually handled by file systems are
done in the ObS itself, and are transparent to other components of zFS. The Ob-
ject Store recognizes only those objects that are sparse streams of bytes. Thus,
it does not distinguish between files and directories. It is the responsibility of the
file system management to handle them correctly.
2.2 Architecture
zFS has six components: a Front End (FE), a Cooperative Cache (Cache), a File
Manager (FMGR), a Lease Manager (LMGR), a Transaction Server (TSVR), and
an Object Store (ObS). These components work together to provide applications
or users with a distributed file system. Now we describe the functionality of
each component and how it interacts with the other components.
2.2.1 Object Store
The object disk (ObS) is the storage device on which files and directories are
created, and from where they are retrieved. The ObS API enables creation and
deletion of objects (files), and writing and reading byte-ranges from the object.
Object disks provide file abstractions, security, safe writes and other capabilities.
Using object disks allows zFS to focus on management and scalability issues,
while letting the ObS handle the physical disk chores of block allocation and
mapping.
10
2.2.2 Front End
The zFS front-end (FE) runs on every workstation on which a client wants to use
zFS. It presents the client with the standard POSIX file system API and provides
access to zFS files and directories.
2.2.3 Lease Manager
The need for a Lease Manager (LMGR) stems from the following facts:
File systems use one form or another of locking mechanism to control access
to the disks in order to maintain data integrity when several users work on the
same files.
To work in SAN file systems where clients can write directly to object disks, the
ObSs themselves have to support some form of locking. Otherwise, two clients
could damage each other’s data.
In distributed environments, where network connections and even machines
themselves can fail, it is preferable to use leases rather than locks. Leases are
locks with an expiration period that is set up in advance. Thus, when a machine
holding a lease on a resource fails, we are able to acquire a new lease after the
lease of the failed machine expires. Obviously, the use of leases incurs the over-
head of lease renewal on the client that acquired the lease and still needs the
resource.
To reduce the overhead of the ObS, the following mechanism is used:
Each ObS maintains one major lease for the whole disk. Each ObS also has one
lease manager (LMGR) which acquires and renews the major lease. Leases for
specific objects (files or directories) on the ObS are managed by the ObS’s
LMGR. Thus, the majority of lease management overhead is offloaded from the
ObS, while still maintaining the ability to protect data. The ObS stores in memory
the network address of the current holder of the major-lease. To find out which
11
machine is currently managing a particular ObS O, a client simply asks O for the
network address of its current LMGR.
The lease-manager, after acquiring the major-lease, grants exclusive leases on
objects residing on the ObS. It also maintains in memory the current network
address of each object-lease owner. This allows looking up file-managers. Any
machine that needs to access an object obj on ObS O, first figures out who is
it’s LMGR. If one exists, the object-lease for obj is requested form from the
LMGR. If one does not exist, the requesting machine creates a local instance of
an LMGR to manage O for it.
2.2.4 File Manager
Each opened file in zFS is managed by a single file manager assigned to the file
when the file is opened. The set of all currently active file managers manage all
opened zFS files. Initially, no file has an associated file-manager(FMGR). The first
machine to open a file will create an instance of a file manager for the file.
Henceforth, and until that file manager is shut-down, each lease request for any
part of the file will be mediated by that FMGR. For better performance, the first
machine to open a file, will create a local instance of the file manager for that
file.
The FMGR keeps track of each accomplished open() and read() request, and
maintains the information regarding where each file’s blocks reside in internal
data structures. When an open() request arrives at the file manager, it checks
whether the file has already been opened by another client (on another ma-
chine). If not, the FMGR acquires the proper exclusive lease from the lease man-
ager and directs the request to the object disk. In case the data requested re-
sides in the cache of another machine, the FMGR directs the cache on that ma-
chine to forward the data to the requesting cache.
12
The file manager interacts with the lease manager of the ObS where the file re-
sides to obtain an exclusive lease on the file. It also creates and keeps track of
all range-leases it distributes. These leases are kept in internal FMGR tables, and
are used to control and provide proper access to files by various clients.
2.2.5 Cooperative Cache
The cooperative cache (Cache) of zFS is a key component in achieving high
scalability. Due to the fast increase in network speed nowadays, it takes less
time to retrieve data from another machine’s memory than from a local disk.
This is where a cooperative cache is useful. When a client on machine A re-
quests a block of data via FEa and the file manager (FMGRB on machine B) real-
izes that the requested block resides in the Cache of machine M , Cachem, it
sends a message to Cachem to send the block to Cachea and updates the in-
formation on the location of that block in FMGRB .
The Cache on A then receives the block, updates its internal tables (for future
accesses to the block) and passes the data to the FEa , which passes it to the
client.
2.2.6 Transaction Server
In zFS, directory operations are implemented as distributed transactions. For
example, a create-file operation includes, at the very least, (a) creating a new en-
try in the parent directory, and (b) creating a new file object. Each of these op-
erations can fail independently, and the initiating host can fail as well. Such oc-
currences can corrupt the file system. Hence, each directory operation should
be protected inside a transaction, such that in the event of failure, the consis-
13
tency of the file-system can be restored. This means either rolling the transac-
tion forward or backward.
The most complicated directory operation is rename(). This requires, at the very
least, (a) locking the source directory, target directory, and file (to be moved),
(b) creating a new directory entry at the target, (c) erasing the old entry, and (d)
releasing the locks.
Since such transactions are complex, zFS uses a special component to manage
them: a transaction server (TSVR). The TSVR works on a per operation basis. It
acquires all required leases and performs the transaction. The TSVR attempts to
hold onto acquired leases for as long as possible and releases them only for the
benefit of other hosts.
14
2
sive file lease, the file manager manages ranges of leases, which it grants the clients (FEs)2.
Figure 1: zFS architecture
Each participating node runs the front-end and cooperative cache. Each OSD has only one lease manager associated with it. Several file managers and transaction managers run on various nodes in the cluster.
Every file opened in zFS is managed by a single file man-ager that is assigned to the file when it is first opened. The set of all currently active file managers manage all opened zFS files. Initially, no file has an associated file manager. The first machine to perform an open() on file F creates a local instance of the file manager for that file. Henceforth, and until that file manager is shut down, each lease request for any part of the file F is handled by this file manager3. The cooperative cache of zFS is a key component in achiev-ing high performance. Its advantage stems from today’s fast networks, which enable data retrieval to be done more rap-idly from the memory of remote machines than from the local disk. Thus, if the same data is read by two different nodes, the first node will read the data blocks from the OSD, while the other will read it from the memory of the first node, achieving better performance. This eliminates a potential bottleneck in the OSD when many clients read the same data. In this case, each client will retrieve the data from a different node, as explained below.
3 zFS Cooperative Cache In zFS, we made an architectural design decision to inte-grate the cooperative cache with the Linux kernel page cache for two main reasons. First, by doing this we avoid 2 Details on how range leases are handled are beyond the scope of
this document and can be found in ![11]. 3 The manner in which nodes find the corresponding file manager
for F is described in ![11].
having the operating system run two caches with two differ-ent cache policies, which may interfere with each other. Second, we wanted comparable local performance between zFS and other local file systems supported by Linux. All supported file systems use the kernel page cache. As a result, we achieve the following with no extra effort: 1. Eviction is invoked by the kernel according to its
internal algorithm—when free available memory is low. We do not need a special zFS mechanism to de-tect this.
2. Caching is done on a per-page basis, not on whole files.
3. Fairness exists between zFS and other file systems; pages are treated equally by the kernel algorithm, regardless of the file system type.
4. When a file is closed, its pages remain in the cache until memory pressure causes the kernel to dis-card/evict them.
When eviction is invoked and the candidate page for evic-tion is a zFS page, the decision is passed to a specific zFS routine, which decides whether to forward the page to the cache of another node or to discard it as described in Sec-tion !3.2.
3.1 Existing Optimizations The current implementation of the zFS page cache supports the following optimizations: 1. When an application uses a zFS file and writes a
whole page, no read is done from the OSD—only the write lease is acquired.
2. If one application/user on a machine has a write lease, all other applications/users on the machine can try to read and write to the page, without going to the file manager for a lease. The kernel checks the permission to read/write, based on the permissions specified in the mode parameter when the file is opened. If the mode bits allow the operation, zFS does not prevent it.
3. When client A has a write lease and client B requests a read lease, a write to the OSD is done if the page is dirty and the lease on A is downgraded from write to read lease without discarding the page. This in-creases the probability of a cache hit, thus increasing performance.
3.2 zFS Cooperative Cache Algorithm In cooperative cache terminology, each data block/page4 is either a singlet or replicated ![5]. A singlet page is a page that resides in the memory of only one node, while a repli-cated page resides in the memory of several nodes.
4 The zFS prototype is implemented under Linux and the data
blocks used are pages, thus we use the term page for data block.
Figure 1: zFS Architecture
15
Chapter 3Cooperative Cache
16
3.1 Working of Cooperative Cache
In zFS, the cooperative cache is integrated with the Linux kernel page cache for
two main reasons. First, by doing this the operating system does not have to
have two seperate caches with different cache policies which may interfere with
each other. Second, it provides comparable local performance between zFS and
other local file systems supported by Linux. All the supported file systems use
the kernel page cache.
As a result, the researchers achieved the following:
The kernel evokes page eviction according to its internal algorithm̶when
free available memory is low. There is no need for a special zFS mechanism to
detect it.
Caching is not done on whole files but on per-page basis.
The pages of zFS and other file systems are treated equally by the kernel
algorithm, regardless of the file system type leading to fairness between the file
systems.
When a file is closed, its pages remain in the cache until memory pressure
causes the kernel to discard them.
When eviction is invoked and a zFS page is the candidate page for eviction then
the decision is passed to a specific zFS routine, which decides whether to for-
ward the page to the cache of another node or to discard it.
The implementation of zFS page cache supports the following optimizations:
An application using a zFS file to write a whole page acquires only the write
lease when no read is done from the OSD.
If one application or user on a machine has a write lease, al l other
applications/users on that machine can try to read and write to the page using
17
the same lease, without requesting the file manager for another lease. The ker-
nel then checks the permission to read/write, based on the permissions speci-
fied in the mode (read or write or both) parameter when the file is opened. If the
mode bits allow the operation, zFS allows it.
When a client has a write lease and another client requests a read lease for
the same page, a write to the object store device is done if the page has been
modified and the lease on the first client is downgraded from write to read lease
without discarding the page. This increases the probability of a cache hit by a
client requesting the same page, thus increasing performance.
3.2 Cooperative Cache Algorithm
In this paper a data block is considered to be a page. Each page that exists in
the cooperative cache is said to be either singlet or replicated. A singlet page is
the one that is present in the memory of only one of the nodes in the connected
network. A replicated page is the one that is present in the memory of several
nodes.
When a client wants to open a file for reading, the local cache is checked for the
page. In case of a cache miss, zFS requests the page and its read lease from the
zFS file manager. The file manager checks if a range of pages starting with the
requested page has already been read into the memory of another machine in
the network. If not, zFS grants the leases to the client A, which enables the client
to read the range of pages from the OSD directly. The client A then reads the
range of pages from the OSD, marking each page as a singlet (as A is the only
node having this range of pages in its cache). If the file manager finds that the
range of pages requested resides in the memory of some other node say B, it
sends a message to B requesting that B send the range of pages and leases to
18
A. In this case, zFS records the fact that A also has this particular range inter-
nally and both A and B mark the pages as replicated. Node B is called a third-
party node, since A gets the requested data not from the OSD but from a third-
party.
When memory becomes scarce for a client, the Linux kernel invokes the
kswapd() daemon. This daemon scans and discards inactive pages from the
memory of the client. In our modified kernel, if the page is a replicated zFS page,
a message is sent to the zFS file manager indicating that machine A no longer
holds the page and the page is discarded.
If the zFS page is a singlet, the page is forwarded to another node using the fol-
lowing steps:
1. A message is sent to the zFS file manager indicating that the page is sent to
another machine B, the node with the largest free memory known to A.
2. The page is forwarded to B.
3. The page is discarded from the page cache of A.
zFS uses a recirculation counter, and if the singlet page has not been accessed
after two hops, it is discarded. Once the page has been accessed, the recircula-
tion counter is reset. When a file manager is notified about a discarded page, it
updates the lease and page location and checks whether the page has become
a singlet. If only one other node N holds the page, the file manager sends a
singlet message to N to that effect.
The effects of node failure and network delays are also considered in this algo-
rithm.
19
3.2.1 Node Failure
To take care of node failure, the researchers take the approach that it is ac-
ceptable for the file manager to assume the existence of pages on nodes even if
this is not true but it is unacceptable to have pages on nodes, where the file
manager is unaware of their existence. If the file manager is wrong in its as-
sumption that a page exists on a node, its request will be rejected and thus it
will update its records eventually. However, if there are pages on nodes that the
file manager is not aware of, this may cause data trashing and thus is not al-
lowed.
Because of this, the order of steps for forwarding a singlet page to another
node is important and is to be followed as described above.
1. Node fails before Step 1:- The file manager will eventually detect this and
update its data to reflect that the respective node does not hold pages and
leases. If the node fails to execute Step 1 and notify the file manager, it does
not forward the page and only discards it. Thus, we end up with a situation
where the file manager assumes the page exists on node A, although it does
not. This is acceptable since it can be corrected without data corruption.
2. Node failed after Step 1:- In this case, the file manager is informed that the
page is on B, but node A may have crashed before it was able to forward the
page to B. Again, we have a situation where the file manager assumes the
page is on B, although in reality that is not true.
3. Failure after Step 2 does not pose any problem.
20
3.2.2 Network Delays
In this paper the following cases are considered for network delays:
1. The first case that the authors have considered is where a replicated page
residing on two nodes M and N is discarded from the memory of M:-
When the zFS file manager sees that a page has become singlet and only re-
sides in the memory of N now, it sends a message to N with this information.
However, due to network delays, this message may arrive after memory pres-
sure developed on N. But on the node N this page is marked as replicated,
while in reality it is a singlet and should have been forwarded to another
node.
They handle this situation as follows:
If a singlet message arrives at N and the page is not in the cache of N, the co-
operative cache algorithm on N will ignore the singlet message. Because the file
manager still knows that the page resides on N, it may ask N to forward the
page to a requesting client B. In this case, N will send back a reject message to
the file manager. Upon receiving a reject message, the file manager updates its
internal tables and retries to respond to the request from B by finding another
client who in the meantime read the page from the OSD or by telling B to read
the page from the OSD. In such cases, network delays will cause performance
degradation, but not inconsistency.
2. Another possible scenario is that no memory pressure occurred on N, the
page has not arrived yet, and a singlet message arrived and was ignored. The
file manager asked N to forward the page to B and N sent a reject message
21
back to the file manager. If the page never arrives at N due to sender failure
or network failure, there is no problem.
However, if the page arrives after the reject message was sent, a consistency
problem may occur if a write lease exists. Because the file manager is not aware
of the page on N, another node may get the write lease and the page from the
OSD. This leaves two clients having the same page with write leases on two dif-
ferent nodes.
To avoid this situation, A reject list is kept in the node N, which records the
pages (and their corresponding leases) that were requested but rejected. When
a forwarded page arrives at N and the page is on the reject list, the page and its
entry on the reject list are discarded, thus keeping the information in the file
manager accurate. The reject list is scanned periodically (by the FE) and each
entry whose time on the list exceeds T, is deleted. T is the maximum time it can
take a page to reach its destination node, and is determined experimentally de-
pending on the network topology.
An alternative method for handling these network delay issues is to use a com-
plicated synchronization mechanism to keep track of the state of each page in
the cluster. This is unacceptable for two reasons. First, it incurs overhead from
extra messages, and second, this synchronization delays the kernel when it
needs to evict pages quickly.
3. Another problem caused by network delays is that suppose node N notifies
the zFS file manager upon forwarding a page to M, and M does the same
forwarding the page to O. However, the link from N to the file manager is
slow compared to the other links. Thus, the file manager may receive a mes-
sage that a page was moved from M to O before receiving the message that
the singlet page was moved from N to M. Moreover, the file manager does
22
not have in its records that this specific page and lease reside on M. The
problem is further complicated by the fact that M may decide to discard the
page and this notification may arrive at the file manager before the move no-
tification.
To solve this problem, the researchers used the following data structures:
Each lease on a node has a hop_count, which counts the number of times the
lease and its corresponding page were moved from one node to another.
Initially, when the page is read from the OSD, the hop_count in the corre-
sponding lease is set to zero and is incremented whenever the lease and
page are transferred to another node.
When a node initiates a move, the move notification passed to the file man-
ager includes the hop_count and the target_node.
Two fields are reserved in each lease record in the file manager’s internal ta-
bles for handling move notification messages: Last_hop_count initially set to
-1, and target_node initially set to NULL.
23
Figure 2: Delayed Move Notification Messages
If message (3) arrives first, its hop count and target node are saved in the lease
record. This is done since node M is not registered as holding the lease and
page. When message (1) arrives, N is the registered node; therefore, the lease is
“moved” to the target node stored in the target_node field. This is done by up-
dating the information stored in the internal tables of the file manager. If mes-
sage (3) arrives first and then message (5) arrives, due to the larger hop count,
the information from message (5) is stored and used when message (1) arrives.
When message (3) arrives, it is ignored due to its smaller hop count. In other
words, using the hop count enables us to ignore late messages that are irrele-
vant.
4. Suppose the page was moved from N to M and then to O, where it was dis-
carded due to memory pressure on O and its recirculation count exceeded
its limit. O then sends a release_lease message, which arrives at the file man-
ager before the move notifications.
24
This case is resolved as follows:
Since O is not registered as holding the page and lease, the release_lease mes-
sage is placed on a pending queue and a flag is raised in the lease record. When
the move operation is resolved and this flag is set, the release_lease message is
moved to the input queue and executed.
3.3 Choosing the Proper Third Party Node
The zFS file manager uses an enhanced round robin method to choose the
third-party node, which holds a range of pages starting with the requested page.
For each range granted to a node N, the file manager records the time it was
granted t(N). When a request arrives, the file manager scans the list of all nodes
holding a potential range, N0…Nk. For each selected node Ni, the file manager
checks if currentTime‒t(Ni) > C to check whether enough time passed for the
range of pages granted to Ni to reach the node. If this is true, Ni is marked as a
potential provider for the requested range and the next node, Ni+1, is checked;
otherwise, the next node is checked. Once all nodes are checked, the marked
node with the largest range, Nmax, is chosen. The next time the file manager is
asked for a page and lease, it starts the scan from node Nmax+1. Two goals
are achieved using this algorithm. First, no single node is overloaded with re-
quests and becomes a bottleneck. Second, the pages reside at the chosen node
for sure, thus reducing the probability of reject messages.
25
3.4 Pre-fetching Data in zFS
The Linux kernel uses a read ahead mechanism to improve file reading perform-
ance. Based on the read pattern of each file, the kernel dynamically calculates
how many pages to read ahead, n, and invokes the readpage() routine n times.
This method of operating is not efficient when the pages are transmitted over
the network. The overhead for transmitting a data block is composed of two
parts: the network setup overhead and the transmission time of the data block
itself. For comparatively small blocks, the setup overhead is a significant part of
the total overhead.
Intuitively, it seems that it is more efficient to transmit k pages in one message
rather than transmitting them in a separate message for each page as the setup
overhead is amortized over k pages.
To confirm this, the researchers wrote client and server programs that test the
time it takes to read a file residing entirely in memory from one node to another.
Using a file size of N pages, they tested reading it in chunks of 1…k pages in
each TCP message. That is, reading the file in N…N/k messages. They found
that the best results are achieved for k=4 and k =8. When k is smaller, the setup
time is significant, and when k is larger (16 and above) the size of the L2 cache
starts to affect the performance. TCP performance decreases when the trans-
mitted block size exceeds the size of the L2 cache.
Similar performance gains were achieved by the zFS pre-fetching mechanism.
When the file manager is instantiated, it is passed a pre-fetching parameter, R,
indicating the maximum range of pages to grant. When a client A requests a
page (and lease) the file manager searches for a client B having the largest con-
26
tiguous range of pages, r, starting with the requested page p and r <= R. If such
a client B is found, the file manager sends B a message to send r pages (and
their leases) to A. The selected range r can be smaller than R if the file manager
finds a page with a conflicting lease before reaching the full range R. If no range
is found in any client, the file manager grants R leases to client A and instructs
A to read R pages from the OSD. The requested page may reside on client A,
while the next one resides on client B, and the next on client C. In this case, the
granted range will be only the requested page from client A. The next request
initiated by the kernel read-ahead mechanism will be granted from client B and
the next from client C. Thus, there is no interference with the kernel read ahead
mechanism. However, if the file manager finds that client A has a range of k
pages, it will ignore the subsequent requests that are initiated by the kernel
read-ahead mechanism and covered by the granted range.
27
Chapter 4Testing
28
4.1 Test Environment
The zFS performance test environment consisted of a cluster of client PCs and
one server PC. Each of the PCs in the cluster had an 800 MHz Pentium III proc-
essor with 256 MB memory, 256 KB L2 cache and 15 GB IDE (Integrated Drive
Electronics) disks. All of the PCs in the cluster ran the Linux operating system.
The kernel was a modified 2.4.19 kernel with VFS (Virtual File System) imple-
menting zFS and some patches to enable the integration of zFS with the kernel's
page cache. The server PC had a 2 GHz Pentium 4 processor with 512 MB
memory and 30 GB IDE disk running vanilla Linux kernel. The server PC ran a
simulator of the Antara OSD when the researchers tested zFS performance and
ran an NFS (Network File System) server when they compared the results to
NFS. The PCs in the cluster and the server PC were connected via 1 Gbit LAN.
The client running the zFS front end was implemented as a kernel mode proc-
ess, while all other components were implemented as user mode processes. The
file manager and lease manager were fully implemented. The transaction man-
ager implemented all operations in memory, without writing the log to the OSD.
However, this fact does not influence the results because only the results of
read operations using cooperative cache are recorded and not the meta data
operations.
To begin testing zFS, the researchers configured the system much like a SAN
(Storage Area Network) file system. The server PC ran an OSD simulator, a
separate PC ran the lease manager, file manager and transaction manager proc-
esses (thus acting as a meta data server) and four PCs ran the zFS front-end.
When testing NFS performance, they configured the system differently. The
server PC ran an NFS server with eight NFS daemons (nfsd) and the four PCs
29
ran NFS clients. The final results are an average over several runs, where the
caches of the machines were cleared before each run.
To evaluate zFS performance relative to an existing file system, the researchers
compared it to the widely-used NFS system, using the IOZONE benchmark. IO-
ZONE is a filesystem benchmark tool. The benchmark generates and measures a
variety of file operations. IOZONE is useful for performing a broad filesystem
analysis of a computer platform. The benchmark tests file I/O performance for
operations such as read,write,etc.
The comparison to NFS was difficult because NFS does not carry out pre-
fetching. To make up for this feature, IOZONE was configured to read the NFS
mounted file using record sizes of n=1,4,8,16 pages and compared its perform-
ance with reading zFS mounted files with record sizes of one page but with pre-
fetching parameter R=1,4,8,16 pages.
30
5
pages reside at the chosen node, thus reducing the probabil-ity of reject messages.
3.5 Pre-fetching Data in zFS The Linux kernel uses a read ahead mechanism to improve file reading performance. Based on the read pattern of each file, the kernel dynamically calculates how many pages to read ahead, n, and invokes the readpage() routine n times7. This modus operandi is not efficient when the pages are transmitted over the network. The overhead for transmitting a data block is composed of two parts: the network setup overhead and the transmission time of the data block itself. For comparatively small blocks, the setup overhead is a significant part of the total overhead. Intuitively, it seems that it is more efficient to transmit k pages in one (scatter/gather) message rather than transmit-ting them in k separate messages, since the setup overhead is amortized over k pages. To confirm this, we wrote a simple client and server programs that test the time it takes to read a file residing entirely in memory from one node to another. Using a file size of N pages, we tested reading it in chunks of 1…k pages in each TCP message. In other words, reading the file in N…N/k messages. We found that the best results are achieved for k=4 and k =8. When k is smaller, the setup time is significant, and when k is larger (16 and above) the size of the L2 cache starts to affect the perform-ance. As noted in ![8], TCP performance decreases when the transmitted block size exceeds the size of the L2 cache. This is shown in the performance results described in Sec-tion 5. The zFS pre-fetching mechanism is designed to achieve similar performance gains. When the file manager is instan-tiated, it is passed a pre-fetching parameter, R, indicating the maximum range of pages to grant. When a client A re-quests a page (and lease) the file manager searches for a client B having the largest contiguous range of pages, r, starting with the requested page p and r ! R. If such a cli-ent B is found, the file manager sends B a message to send r pages (and their leases) to A. The selected range r can be smaller than R if the file manager finds a page with a con-flicting lease8 before reaching the full range R. If no range is found in any client, the file manager grants R leases to client A and instructs A to read R pages from the OSD. The requested page may reside on client A, while the next one resides on client B, and the next on client C. In this case, the granted range will be only the requested page from client A. The next request initiated by the kernel read-ahead mechanism will be granted from client B and the next from 7 More specifically, calling readpage() is skipped if the page
to be read is already in the cache, and the next page in the read ahead range is checked.
8 For read request, a conflicting lease is a write lease; for write requests, a conflicting lease is a read lease.
client C. Thus, we do not interfere with the kernel read-ahead mechanism. However, if the file manager finds that client A has a range of k pages, it will ignore the subsequent requests that are initiated by the kernel read-ahead mecha-nism and covered by the granted range.
4 Test Environment The zFS performance test environment was comprised of a cluster of PCs and one server PC. Each of the PCs in the cluster was an 800 MHz Pentium III with 256 MB memory, 256 KB L2 cache and 15 GB IDE disks. All of the PCs in the cluster ran the Linux operating system. The kernel was a modified 2.4.19 kernel with VFS implementing zFS and some patches to enable the integration of zFS with the ker-nel's page cache. The server PC was a 2 GHz Pentium 4 with 512 MB mem-ory and 30 GB IDE disk running vanilla Linux kernel. The server PC ran a simulator of the Antara ![6] OSD when we tested zFS performance and ran an NFS server when we compared the results to NFS. The PCs in the cluster and the server PC were connected via 1Gbit LAN. The zFS front-end (client) was implemented as a kernel loadable module, while all other components were imple-mented as user mode processes. The file manager and lease manager were fully implemented. The transaction manager implemented all operations in memory, without writing the log to the OSD. However, because we are looking at the results of read operations using cooperative cache—and not meta data operations—this fact does not influence the re-sults. To begin testing zFS, we configured the system much like a SAN file system. The server PC ran an OSD simulator, a separate PC ran the lease manager, file manager and trans-action manager processes (thus acting as a meta data server) and four PCs ran the zFS front-end.
Figure 3: System configuration for testing zFS performance
When testing NFS performance, we configured the system differently. The server PC ran an NFS server with eight
Figure 3: System configuration for testing zFS performance
6
NFS daemons (nfsd) and the four PCs ran NFS clients. The reported results are an average over several runs, where the caches of the machines were cleared before each run.
Figure 4: System configuration for testing NFS performance
5 Comparison to NFS To evaluate zFS performance relative to an existing file system, we compared it to the widely-used NFS system, using the IOZONE benchmark. The comparison to NFS was difficult because NFS does not carry out pre-fetching. To make up for this feature, we con-figured IOZONE to read the NFS mounted file using re-cord9 sizes of n=1,4,8,16 pages and compared its perform-ance with reading zFS mounted files with record sizes of one page but with pre-fetching parameter R=1,4,8,16 pages.
5.1 Comparing zFS and NFS on a Cluster Our main goal was to test whether and how much perform-ance is gained if the total free memory in the cluster sur-passes the cache size of the server. To this end, we ran two scenarios. In the first one, the file size was smaller than the cache of the server and all the data resided in the server’s cache. In the second, the file size was much larger than the size of the server’s cache. The results appear in Figure 5 and Figure 6, respectively. In both cases, we observed that the performance of NFS was almost the same for different block sizes. However, the absolute performance is greatly influenced by the data size compared to the available memory. When the file can fit entirely into memory, NFS performance is almost four times better compared to the case where the file is larger than available memory. When the file fit entirely into memory (see Figure 5) the performance of zFS with cooperative cache was much bet-ter than NFS. When cooperative cache was deactivated, we observed different behavior for ranges of one page com-pared to larger ranges. This stems from the extra messages passed between the clients and the file manager. Thus, the zFS performance for R=1 is lower than NFS. However, for larger ranges, there are fewer messages to the file manager
9 Record in IOZONE is the data size read in one read system call.
(since the pages were pre-fetched) and the performance of zFS was slightly better than that of NFS.
File smaller than available server memory256MB file, 512MB server memory
0.00
5000.00
10000.00
15000.00
20000.00
25000.00
30000.00
1 4 8 16
Range
KB
/Sec
NFS
No c_cache
c_cache
Figure 5: Performance results for large server cache
This figure shows the performance results when the data fits entirely in the server’s memory. The graphs show the relative performance of zFS to NFS, with and without cooperative cache.
We also saw that when using cooperative cache, the per-formance for a range of 16 was lower than for ranges of 4 and 8. Because IOZONE starts the requests of each client with a fixed time delay relative to other clients, each new request was for different 256 KB. This stems from the fol-lowing calculation: For four clients with 16 pages each, we get 256 KB, the size of the L2 cache. Since almost the en-tire file is in memory, the L2 cache is flushed and reloaded for each granted request, resulting in reduced performance. This fits with ![8], since in this case the L2 cache is fully replaced with new data for each transmission. When the server cache was smaller that the data requested, we expected memory pressure to occur in the server (NFS and OSD) and the server’s local disk to be used. In this case, we anticipated that the cooperative cache would ex-hibit improved performance. The results are shown in Figure 6. Indeed, we see that zFS performance when cooperative cache is deactivated is lower than that of NFS, although it gets better for larger ranges. When the cooperative cache is active, zFS performance is significantly better than NFS and increases as the range increases. The performance with cooperative cache enabled is lower here when compared to the case when the file fits into memory. Because the file was larger than the available memory, clients suffer memory pressure, evict/discard pages and respond to the file manager with reject messages. Thus, sending data blocks to clients was interleaved with reject messages to the file manager and the probability that the requested data is in memory was smaller than when the file was almost entirely in memory.
Figure 4: System configuration for testing NFS performance
31
4.2 Comparing zFS and NFS
The primary aim of this research was to test whether and how much perform-
ance is gained when the total amount of free memory in the cluster exceeds the
server’s cache size. To this end, two scenarios were investigated. In the first
one, the file size was smaller than the cache of the server and all the data re-
sided in the server’s cache. In the second, the file size was much larger than the
size of the server’s cache. The results appear in Figure 5 and Figure 6 given be-
low, respectively.
6
NFS daemons (nfsd) and the four PCs ran NFS clients. The reported results are an average over several runs, where the caches of the machines were cleared before each run.
Figure 4: System configuration for testing NFS performance
5 Comparison to NFS To evaluate zFS performance relative to an existing file system, we compared it to the widely-used NFS system, using the IOZONE benchmark. The comparison to NFS was difficult because NFS does not carry out pre-fetching. To make up for this feature, we con-figured IOZONE to read the NFS mounted file using re-cord9 sizes of n=1,4,8,16 pages and compared its perform-ance with reading zFS mounted files with record sizes of one page but with pre-fetching parameter R=1,4,8,16 pages.
5.1 Comparing zFS and NFS on a Cluster Our main goal was to test whether and how much perform-ance is gained if the total free memory in the cluster sur-passes the cache size of the server. To this end, we ran two scenarios. In the first one, the file size was smaller than the cache of the server and all the data resided in the server’s cache. In the second, the file size was much larger than the size of the server’s cache. The results appear in Figure 5 and Figure 6, respectively. In both cases, we observed that the performance of NFS was almost the same for different block sizes. However, the absolute performance is greatly influenced by the data size compared to the available memory. When the file can fit entirely into memory, NFS performance is almost four times better compared to the case where the file is larger than available memory. When the file fit entirely into memory (see Figure 5) the performance of zFS with cooperative cache was much bet-ter than NFS. When cooperative cache was deactivated, we observed different behavior for ranges of one page com-pared to larger ranges. This stems from the extra messages passed between the clients and the file manager. Thus, the zFS performance for R=1 is lower than NFS. However, for larger ranges, there are fewer messages to the file manager
9 Record in IOZONE is the data size read in one read system call.
(since the pages were pre-fetched) and the performance of zFS was slightly better than that of NFS.
File smaller than available server memory256MB file, 512MB server memory
0.00
5000.00
10000.00
15000.00
20000.00
25000.00
30000.00
1 4 8 16
Range
KB
/Sec
NFS
No c_cache
c_cache
Figure 5: Performance results for large server cache
This figure shows the performance results when the data fits entirely in the server’s memory. The graphs show the relative performance of zFS to NFS, with and without cooperative cache.
We also saw that when using cooperative cache, the per-formance for a range of 16 was lower than for ranges of 4 and 8. Because IOZONE starts the requests of each client with a fixed time delay relative to other clients, each new request was for different 256 KB. This stems from the fol-lowing calculation: For four clients with 16 pages each, we get 256 KB, the size of the L2 cache. Since almost the en-tire file is in memory, the L2 cache is flushed and reloaded for each granted request, resulting in reduced performance. This fits with ![8], since in this case the L2 cache is fully replaced with new data for each transmission. When the server cache was smaller that the data requested, we expected memory pressure to occur in the server (NFS and OSD) and the server’s local disk to be used. In this case, we anticipated that the cooperative cache would ex-hibit improved performance. The results are shown in Figure 6. Indeed, we see that zFS performance when cooperative cache is deactivated is lower than that of NFS, although it gets better for larger ranges. When the cooperative cache is active, zFS performance is significantly better than NFS and increases as the range increases. The performance with cooperative cache enabled is lower here when compared to the case when the file fits into memory. Because the file was larger than the available memory, clients suffer memory pressure, evict/discard pages and respond to the file manager with reject messages. Thus, sending data blocks to clients was interleaved with reject messages to the file manager and the probability that the requested data is in memory was smaller than when the file was almost entirely in memory.
Figure 5 : Performance results for large server cache
32
7
File larger than available server memory1GB file, 512MB Server Memory
0.00
5000.00
10000.00
15000.00
20000.00
25000.00
1 4 8 16
Range
KB
/Sec
NFS
No c_cache
c_cache
Figure 6: Performance results for small server cache
This figure shows the performance results when the data size is greater than the server cache size and the server’s local disk has to be used. We see that cooperative cache provides much better performance than NFS. Deactivating cooperative cache results in worse performance than NFS.
6 Related Work Several existing works have researched cooperative caching ![2], ![3], ![4], ![5], ![12]. Although we cannot cover all these works, we will concentrate on those systems that describe network file systems, rather than those describing caching for parallel machines ![3], or those that use a micro-kernel ![2]. Dahlin et. al. ![4] describe four caching techniques to im-prove the performance of the xFS file system. xFS uses a central server to coordinate between the various clients, and the load of the server increases as the number of clients increase. Thus, the scalability of xFS was limited by the strength of the server. Implementing these techniques and testing it on a trace of 237 users (over more than one week) showed that the load on the central server was reduced by a factor of six, making xFS more scalable than AFS and NFS. They also found that each of the four techniques contrib-uted significantly to the load reduction. In other words, no one technique contributed significantly more than the oth-ers. It should be noted that in their article Dahlin et. al.![4] use the term caching to mean caching of whole file as in AFS; i.e., the clients cache the whole files on local disk. The four techniques described are:
1. no write through 2. client-to-client data transfer 3. write ownership 4. cluster servers10
Measurements showed that using these techniques, the load on the central server is reduced noticeably and that the load of the cluster servers is similar to that of the central server.
10 For details of these techniques see ![4].
There are three major differences between the zFS architec-ture and the one described in ![4]. 1. zFS does not have a central server and the management
of files is distributed among several file managers. There is no hierarchy of cluster servers; if two clients work on the same file they interact with the same file manager.
2. Caching in zFS is done on a per page basis rather than using whole files. This increases sharing since different clients can work on different parts of the same file.
3. In zFS, no caching is done on the local disk Additionally, no write through and client-to-client forward-ing are inherent in zFS. Thus, in addition to supporting the techniques mentioned in ![4], zFS is more scalable because it has no central server and file managers can dynamically be added or removed to respond to load changes in the clus-ter. Moreover, performance is better due to zFS’s stronger sharing capability. In ![5] several techniques were used in an experiment to find the best policy for cooperative caching. The conclu-sion reached was that N-chance Forwarding gives the best performance. This technique is also employed in zFS. The main difference is that in ![5] they choose the target node for eviction of a singlet randomly and suggest an optimization by choosing an idle client. As explained above, in zFS we chose the client that has with the largest amount of free memory and uses the same file manager. Sarkar and Hartman ![12] investigated the use of a hints mechanism, information that only approximates the global state of the system, for cooperative caching. In their system, communication among clients, rather than between clients and central server, is used to determine where to evict singlet blocks or where to get a data block. The decisions made by such a system may not be optimal, but managing hints is less expensive than managing facts. As long as the overhead eliminated by using hints more than offsets the effect of making mistakes, this system pays off. The environment consists of a central server with a cache. For each block of a file, a master copy is defined as the one received from the server. The hints mechanism only keeps track of where the master copy is located in order to avoid keeping too much information and thus simplifies the main-tenance of accurate hints. Each client builds its own hints database, which is updated when blocks are evicted or received. When a client wishes to evict a block or to read a block, it does lookup in the internal database to find the proper client. Because the hints mechanism is not accurate, lookup can fail. In this case, the client contacts the central server. Simulation showed that this algorithm works well in UNIX environment. However, there are scenarios where it will not be effective; for example, if several clients share a working set that is larger than the cache size, the location of the mas-
Figure 6 : Performance Results for small server cache
In both the cases, it was observed that the performance of NFS was almost the
same for different block sizes. However, the performance is greatly influenced
by the data size compared to the available memory. When the file can fit entirely
into the memory, the performance of NFS is almost four times better compared
to the case when the file size is larger than the available memory.
When the file fits entirely into memory ( Figure 1) the performance of zFS with
cooperative cache is much better than NFS. But when cooperative cache was
deactivated, different behaviors were observed for different ranges of pages.
This is due to the fact that extra messages are passed between the file manager
and the client for larger ranges of pages. Hence, the performance of zFS for
R=1 is lower than that of NFS. However, for larger ranges, there are fewer mes-
33
sages to the file manager (due to pre-fetching in zFS) and the performance of
zFS was slightly better than that of NFS.
The researchers also observed that when cooperative cache was used, the per-
formance for a range of 16 was lower than for ranges of 4 and 8. This is due to
the fact that IOZONE starts the requests of each client with a fixed time delay
relative to other clients, each new request was for different 256 KB. This stems
from the following calculation: For four clients with 16 pages each, we get 256
KB, the size of the L2 cache. Since almost the entire file is in memory, the L2
cache is cleared and reloaded for each new granted request, resulting in re-
duced performance.
When the cache of server was smaller than the requested data, it was expected
that memory pressure would occur in the server (NFS and OSD) and the
server’s local disk would be used. In such a case, the anticipations that the co-
operative cache would exhibit improved performance proved to be correct. The
results are shown in Figure 2.
We can see that zFS performance, when cooperative cache is deactivated, is
lower than that of NFS but it gets better for larger ranges. When the coopera-
tive cache is active, zFS performance is significantly better than NFS and in-
creases with increasing range.
The performance with cooperative cache enabled is lower in this case when
compared to the case when the file fits into memory. This is due to the fact that
the file was larger than the available memory, hence the clients suffered memory
pressure,and discarded pages and responded to the file manager with reject
messages. Thus, sending data blocks to clients was interleaved with reject mes-
sages to the file manager and the probability that the requested data is in mem-
ory was also smaller than when the file was almost entirely in memory.
34
Conclusion
The results show that using the cache of all the clients as one cooperative cache
gives better performance as compared to NFS as well as to the case when co-
operative cache is not used. This is evident when using pre-fetching with a
range of one page. It is also noted from the results that using pre-fetching with
ranges of four and eight pages results in much better performance. In zFS, the
selection of the target node for forwarding pages during page eviction is done
by the file manager, which chooses the node with the largest free memory as
the target node. However, the file manager chooses target nodes only from the
ones interacting with it. It may be the case that there is an idle machine with a
large free memory that is not connected to this file manager and thus will not be
used.
35
Bibliography
Improving Performance of a Distributed File System Using OSDs and Cooperative CacheA. Teperman, A. Weit IBM Haifa Labs, Haifa University, Mount Carmel, Haifa 31905, Israel
O. Rodeh and A. Teperman. "zFS ‒ A Scalable distributed File System using Object Disks." In Proceedings of the IEEE Mass Storage Systems and Technologies Conference, pages 207-218, San Diego, CA, USA, 2003.
T. Cortes, S. Girona and J. Labarta. Avoiding the Cache Coherence Problem in a Parallel/Distributed File System. Department d’Arquitectura de Computadros, Universitat Politecnica de Catalunia ‒ Bar-celona.
M. D. Dahlin, Randolph Y. Wang, Thomas E. Anderson and David A. Patterson." Cooperative Caching: Using Remote Client Memory to Improve File System Perform-ance." Proceedings of the First Symposium on Operating Systems Design and Implementa-tion,1994.
V. Drezin, N. Rinetzky, A. Tavory and E. Yerushalmi."The Antara Object-disk Design." Technical report IBM Research Labs, Haifa University Campus, Mount Carmel, Haifa, Israel, 2001.
Z. Dubitzky, I. Gold, E. Henis, J. Satran and D. Sheinwald. "DSF ‒ Data Sharing Facility." Technical report IBM Research Labs, Haifa University Campus, Mount Carmel, Haifa, Israel, 2000. http://www.haifa.il.ibm.com/projects/systems/dsf.html
Iozone. See http://iozone.org/
http://www.lustre.org/docs/whitepaper.pdf
P. Sarkar and J. Hartman. Efficient Cooperative Caching Using Hints. Department of Computer Science, University of Arizona, Tucson.
36