improving performance of distributed file system - zfs

Improving Performance of a Distributed File System Using OSDs and Cooperative Cache

PROJECT REPORT SUBMITTED IN PARTIAL FULFILLMENT FOR THE

DEGREE OF

B.Sc(H) Computer Science

Hans Raj College

University Of Delhi

Delhi – 110 007

India

Submitted by:

Parvez GuptaRoll No. - 6010027

Varenya AgrawalRoll No. - 6010044

Certificate

This is to certify that the project work entitled “ Improving Performance of a Distributed File System Using OSDs and Cooperative Cache” being submitted by Parvez Gupta and Varenya Agrawal, in partial fulfillment of the requirement for the award of the degree of B.Sc (Hons) Computer Science, University of Delhi, is a record of work carried out under the supervision of Ms. Baljeet Kaur at Hans Raj College, University of Delhi, Delhi.

It is further certified that we have not submitted this report to any other organi-zation for any other degree.

Parvez Gupta

Roll No: - 6010027

Varenya Agrawal

Roll No: - 6010044

Project Supervisor Principal

Ms. Baljeet Kaur Dr. S.R. Arora

Dept. of Computer Science Hans Raj CollegeHans Raj College University of Delhi University of Delhi

Acknowledgment

We would sincerely like to thank Ms. Baljeet Kaur for her invaluable support and guidance in carrying out this project to successful completion. Also, we would like to thank Head of the Computer Science Department, Ms Harmeet Kaur, who was always there with her invaluable knowledge and experience that helped us greatly during the research work. We would also like to extend our gratitude and special thanks to Mr. I.P.S. Negi, Mr. Sanjay Narang and Ms. Anita Mittal for their help in the computer laboratory.

Lastly, we would like to thank all our friends and well wishers who directly or in-directly influenced the successful compilation of the project.

Table of Contents

List of Figures 3

Chapter 1 4

Introduction 4

1.1 Background 5

1.2 About the Work 6

Chapter 2 8

z-Series File System 8

2.1 Prominent Features 9

2.2 Architecture 10

2.2.1 Object Store 10

2.2.2 Front End 11

2.2.3 Lease Manager 11

2.2.4 File Manager 12

2.2.5 Cooperative Cache 13

2.2.6 Transaction Server 13

Chapter 3 16

Cooperative Cache 16

3.1 Working of Cooperative Cache 17

3.2 Cooperative Cache Algorithm 18

1

3.2.1 Node Failure 20

3.2.2 Network Delays 21

3.3 Choosing the Proper Third Party Node 25

3.4 Pre-fetching Data in zFS 26

Chapter 4 28

Testing 28

4.1 Test Environment 29

4.2 Comparing zFS and NFS 32

Conclusion 35

Bibliography 36

2

List of Figures

Figure 1: zFS Architecture 15

Figure 2: Delayed Move Notification Messages 24

Figure 3: System configuration for testing zFS performance 31

Figure 4: System configuration for testing NFS performance 31

Figure 5 : Performance results for large server cache 32

Figure 6 : Performance Results for small server cache 33

3

Chapter 1Introduction

4

1.1 Background

As computer networks started to evolve in the 1980s it became evident that the

old file systems had many limitations that made them unsuitable for multiuser

environments.

In the beginning, many users started to use FTP to share files. Although this

method avoided the time consuming physical movement of removable media,

files still needed to be copied twice: once from the source computer onto a

server, and a second time from the server onto the destination computer. Addi-

tionally, users had to know the physical addresses of every computer involved in

the file sharing process.

As computer companies tried to solve the shortcomings above, distributed file

systems were developed and new features such as file locking were added to

existing file systems. The new systems were not replacements for the old file

systems, but an additional layer between the disk file system and the user proc-

esses.

In a Distributed File System (DFS) a single file system can be distributed across

several physical computer nodes. Separate nodes have direct access to only a

part of the entire file system. With DFS, system administrators can make files

distributed across multiple servers appear to users as if they reside in one place

on the network.

zFS (z-Series File System), a distributed file system developed by IBM, is used in

the z/OS operating system. zFS evolved from DSF (Data Sharing Facility) project

which aimed at building a server-less file system that distributes all aspects of

file and storage management over cooperating machines interconnected by a

5

fast switched network. zFS was designed to achieve a scalable file system that

operates equally well on only a few and thousands of machines and in which the

addition of new machines leads to a linear increase in performance.

1.2 About the Work

This work describes a cooperative cache algorithm used in zFS, which can with-

stand network delays and node failures. The work explores the effectiveness of

this algorithm and of zFS as a file system. This is done by comparing the sys-

tem’s performance to NFS using the IOZONE benchmark. The researchers have

also investigated whether using a cooperative cache results in better perform-

ance, despite the fact that the object store devices have their own caches. Their

results show that zFS performs better than NFS when cooperative cache is acti-

vated and that zFS provides better performance even though the OSDs have

their own caches. They have also demonstrated that using pre-fetching in zFS

increases performance significantly. Thus, zFS performance scales well when the

number of participating clients increases linearly.

There are several other related works that have researched cooperative caching

in network file systems. Another file system, xFS uses a central server to coor-

dinate between the various clients, and the load of the server increases as the

number of clients increase. Thus, the scalability of xFS was limited by the

strength of the server. However, xFS is more scalable than AFS and NFS due to

four different caching techniques used by it that contribute significantly to the

load reduction.

6

There are three major differences between the zFS architecture and xFS archi-

tecture:

zFS does not have a central server and the management of files is distributed

among several file managers. There is no hierarchy of cluster servers; if two cli-

ents work on the same file they interact with the same file manager.

In zFS, caching is done on a per page basis rather than using whole files. This

increases sharing since different clients can work on different parts of the same

file.

In zFS, no caching is done on the local disk.

Thus, zFS is more scalable because it has no central server and file managers

can dynamically be added or removed to respond to load changes in the cluster.

Moreover, performance is better due to zFS’s stronger sharing capability. zFS

does not have a central server that can become a bottleneck. All control infor-

mation is exchanged between clients and file managers. The set of file managers

dynamically adapts itself to the load on the cluster. Clients in zFS only pass data

among themselves (in cooperative cache mode).

7

Chapter 2z-Series File System

8

2.1 Prominent Features

zFS is a scalable file system which uses Object Store Devices (OSD) and a set of

cooperative machines for distributed file management. These are its two most

prominent features.

zFS integrates the memory of all participating machines into one coherent

cache. Thus, instead of going to the disk for a block of data already in one of

the machine memories, zFS retrieves the data block from the remote machine.

To maintain file system consistency, zFS uses distributed transactions and

leases to implement meta data operations and coordinate shared access to

data. zFS achieves its high performance and scalability by avoiding group-

communication mechanisms and clustering software and using distributed

transactions and leases instead.

The design and implementation of zFS is aimed at achieving a scalable file sys-

tem beyond those that exist today. More specifically, the objectives of zFS are:

A file system that operates equally well on only on few or on thousands of

machines

Built from off-the-shelf components with Object disks (ObSs)

Makes use of the memory of all participating machines as a global cache to

increase performance

The addition of machines leads to an almost linear increase in performance

9

zFS will achieve scalability by separating storage management from file man-

agement and by dynamically distributing file management. Having ObSs handle

storage management implies that functions usually handled by file systems are

done in the ObS itself, and are transparent to other components of zFS. The Ob-

ject Store recognizes only those objects that are sparse streams of bytes. Thus,

it does not distinguish between files and directories. It is the responsibility of the

file system management to handle them correctly.

2.2 Architecture

zFS has six components: a Front End (FE), a Cooperative Cache (Cache), a File

Manager (FMGR), a Lease Manager (LMGR), a Transaction Server (TSVR), and

an Object Store (ObS). These components work together to provide applications

or users with a distributed file system. Now we describe the functionality of

each component and how it interacts with the other components.

2.2.1 Object Store

The object disk (ObS) is the storage device on which files and directories are

created, and from where they are retrieved. The ObS API enables creation and

deletion of objects (files), and writing and reading byte-ranges from the object.

Object disks provide file abstractions, security, safe writes and other capabilities.

Using object disks allows zFS to focus on management and scalability issues,

while letting the ObS handle the physical disk chores of block allocation and

mapping.

10

2.2.2 Front End

The zFS front-end (FE) runs on every workstation on which a client wants to use

zFS. It presents the client with the standard POSIX file system API and provides

access to zFS files and directories.

2.2.3 Lease Manager

The need for a Lease Manager (LMGR) stems from the following facts:

File systems use one form or another of locking mechanism to control access

to the disks in order to maintain data integrity when several users work on the

same files.

To work in SAN file systems where clients can write directly to object disks, the

ObSs themselves have to support some form of locking. Otherwise, two clients

could damage each other’s data.

In distributed environments, where network connections and even machines

themselves can fail, it is preferable to use leases rather than locks. Leases are

locks with an expiration period that is set up in advance. Thus, when a machine

holding a lease on a resource fails, we are able to acquire a new lease after the

lease of the failed machine expires. Obviously, the use of leases incurs the over-

head of lease renewal on the client that acquired the lease and still needs the

resource.

To reduce the overhead of the ObS, the following mechanism is used:

Each ObS maintains one major lease for the whole disk. Each ObS also has one

lease manager (LMGR) which acquires and renews the major lease. Leases for

specific objects (files or directories) on the ObS are managed by the ObS’s

LMGR. Thus, the majority of lease management overhead is offloaded from the

ObS, while still maintaining the ability to protect data. The ObS stores in memory

the network address of the current holder of the major-lease. To find out which

11

machine is currently managing a particular ObS O, a client simply asks O for the

network address of its current LMGR.

The lease-manager, after acquiring the major-lease, grants exclusive leases on

objects residing on the ObS. It also maintains in memory the current network

address of each object-lease owner. This allows looking up file-managers. Any

machine that needs to access an object obj on ObS O, first figures out who is

it’s LMGR. If one exists, the object-lease for obj is requested form from the

LMGR. If one does not exist, the requesting machine creates a local instance of

an LMGR to manage O for it.

2.2.4 File Manager

Each opened file in zFS is managed by a single file manager assigned to the file

when the file is opened. The set of all currently active file managers manage all

opened zFS files. Initially, no file has an associated file-manager(FMGR). The first

machine to open a file will create an instance of a file manager for the file.

Henceforth, and until that file manager is shut-down, each lease request for any

part of the file will be mediated by that FMGR. For better performance, the first

machine to open a file, will create a local instance of the file manager for that

file.

The FMGR keeps track of each accomplished open() and read() request, and

maintains the information regarding where each file’s blocks reside in internal

data structures. When an open() request arrives at the file manager, it checks

whether the file has already been opened by another client (on another ma-

chine). If not, the FMGR acquires the proper exclusive lease from the lease man-

ager and directs the request to the object disk. In case the data requested re-

sides in the cache of another machine, the FMGR directs the cache on that ma-

chine to forward the data to the requesting cache.

12

The file manager interacts with the lease manager of the ObS where the file re-

sides to obtain an exclusive lease on the file. It also creates and keeps track of

all range-leases it distributes. These leases are kept in internal FMGR tables, and

are used to control and provide proper access to files by various clients.

2.2.5 Cooperative Cache

The cooperative cache (Cache) of zFS is a key component in achieving high

scalability. Due to the fast increase in network speed nowadays, it takes less

time to retrieve data from another machine’s memory than from a local disk.

This is where a cooperative cache is useful. When a client on machine A re-

quests a block of data via FEa and the file manager (FMGRB on machine B) real-

izes that the requested block resides in the Cache of machine M , Cachem, it

sends a message to Cachem to send the block to Cachea and updates the in-

formation on the location of that block in FMGRB .

The Cache on A then receives the block, updates its internal tables (for future

accesses to the block) and passes the data to the FEa , which passes it to the

client.

2.2.6 Transaction Server

In zFS, directory operations are implemented as distributed transactions. For

example, a create-file operation includes, at the very least, (a) creating a new en-

try in the parent directory, and (b) creating a new file object. Each of these op-

erations can fail independently, and the initiating host can fail as well. Such oc-

currences can corrupt the file system. Hence, each directory operation should

be protected inside a transaction, such that in the event of failure, the consis-

13

tency of the file-system can be restored. This means either rolling the transac-

tion forward or backward.

The most complicated directory operation is rename(). This requires, at the very

least, (a) locking the source directory, target directory, and file (to be moved),

(b) creating a new directory entry at the target, (c) erasing the old entry, and (d)

releasing the locks.

Since such transactions are complex, zFS uses a special component to manage

them: a transaction server (TSVR). The TSVR works on a per operation basis. It

acquires all required leases and performs the transaction. The TSVR attempts to

hold onto acquired leases for as long as possible and releases them only for the

benefit of other hosts.

14

2

sive file lease, the file manager manages ranges of leases, which it grants the clients (FEs)2.

Figure 1: zFS architecture

Each participating node runs the front-end and cooperative cache. Each OSD has only one lease manager associated with it. Several file managers and transaction managers run on various nodes in the cluster.

Every file opened in zFS is managed by a single file man-ager that is assigned to the file when it is first opened. The set of all currently active file managers manage all opened zFS files. Initially, no file has an associated file manager. The first machine to perform an open() on file F creates a local instance of the file manager for that file. Henceforth, and until that file manager is shut down, each lease request for any part of the file F is handled by this file manager3. The cooperative cache of zFS is a key component in achiev-ing high performance. Its advantage stems from today’s fast networks, which enable data retrieval to be done more rap-idly from the memory of remote machines than from the local disk. Thus, if the same data is read by two different nodes, the first node will read the data blocks from the OSD, while the other will read it from the memory of the first node, achieving better performance. This eliminates a potential bottleneck in the OSD when many clients read the same data. In this case, each client will retrieve the data from a different node, as explained below.

3 zFS Cooperative Cache In zFS, we made an architectural design decision to inte-grate the cooperative cache with the Linux kernel page cache for two main reasons. First, by doing this we avoid 2 Details on how range leases are handled are beyond the scope of

this document and can be found in ![11]. 3 The manner in which nodes find the corresponding file manager

for F is described in ![11].

having the operating system run two caches with two differ-ent cache policies, which may interfere with each other. Second, we wanted comparable local performance between zFS and other local file systems supported by Linux. All supported file systems use the kernel page cache. As a result, we achieve the following with no extra effort: 1. Eviction is invoked by the kernel according to its

internal algorithm—when free available memory is low. We do not need a special zFS mechanism to de-tect this.

2. Caching is done on a per-page basis, not on whole files.

3. Fairness exists between zFS and other file systems; pages are treated equally by the kernel algorithm, regardless of the file system type.

4. When a file is closed, its pages remain in the cache until memory pressure causes the kernel to dis-card/evict them.

When eviction is invoked and the candidate page for evic-tion is a zFS page, the decision is passed to a specific zFS routine, which decides whether to forward the page to the cache of another node or to discard it as described in Sec-tion !3.2.

3.1 Existing Optimizations The current implementation of the zFS page cache supports the following optimizations: 1. When an application uses a zFS file and writes a

whole page, no read is done from the OSD—only the write lease is acquired.

2. If one application/user on a machine has a write lease, all other applications/users on the machine can try to read and write to the page, without going to the file manager for a lease. The kernel checks the permission to read/write, based on the permissions specified in the mode parameter when the file is opened. If the mode bits allow the operation, zFS does not prevent it.

3. When client A has a write lease and client B requests a read lease, a write to the OSD is done if the page is dirty and the lease on A is downgraded from write to read lease without discarding the page. This in-creases the probability of a cache hit, thus increasing performance.

3.2 zFS Cooperative Cache Algorithm In cooperative cache terminology, each data block/page4 is either a singlet or replicated ![5]. A singlet page is a page that resides in the memory of only one node, while a repli-cated page resides in the memory of several nodes.

4 The zFS prototype is implemented under Linux and the data

blocks used are pages, thus we use the term page for data block.

Figure 1: zFS Architecture

15

Chapter 3Cooperative Cache

16

3.1 Working of Cooperative Cache

In zFS, the cooperative cache is integrated with the Linux kernel page cache for

two main reasons. First, by doing this the operating system does not have to

have two seperate caches with different cache policies which may interfere with

each other. Second, it provides comparable local performance between zFS and

other local file systems supported by Linux. All the supported file systems use

the kernel page cache.

As a result, the researchers achieved the following:

The kernel evokes page eviction according to its internal algorithm̶when

free available memory is low. There is no need for a special zFS mechanism to

detect it.

Caching is not done on whole files but on per-page basis.

The pages of zFS and other file systems are treated equally by the kernel

algorithm, regardless of the file system type leading to fairness between the file

systems.

When a file is closed, its pages remain in the cache until memory pressure

causes the kernel to discard them.

When eviction is invoked and a zFS page is the candidate page for eviction then

the decision is passed to a specific zFS routine, which decides whether to for-

ward the page to the cache of another node or to discard it.

The implementation of zFS page cache supports the following optimizations:

An application using a zFS file to write a whole page acquires only the write

lease when no read is done from the OSD.

If one application or user on a machine has a write lease, al l other

applications/users on that machine can try to read and write to the page using

17

the same lease, without requesting the file manager for another lease. The ker-

nel then checks the permission to read/write, based on the permissions speci-

fied in the mode (read or write or both) parameter when the file is opened. If the

mode bits allow the operation, zFS allows it.

When a client has a write lease and another client requests a read lease for

the same page, a write to the object store device is done if the page has been

modified and the lease on the first client is downgraded from write to read lease

without discarding the page. This increases the probability of a cache hit by a

client requesting the same page, thus increasing performance.

3.2 Cooperative Cache Algorithm

In this paper a data block is considered to be a page. Each page that exists in

the cooperative cache is said to be either singlet or replicated. A singlet page is

the one that is present in the memory of only one of the nodes in the connected

network. A replicated page is the one that is present in the memory of several

nodes.

When a client wants to open a file for reading, the local cache is checked for the

page. In case of a cache miss, zFS requests the page and its read lease from the

zFS file manager. The file manager checks if a range of pages starting with the

requested page has already been read into the memory of another machine in

the network. If not, zFS grants the leases to the client A, which enables the client

to read the range of pages from the OSD directly. The client A then reads the

range of pages from the OSD, marking each page as a singlet (as A is the only

node having this range of pages in its cache). If the file manager finds that the

range of pages requested resides in the memory of some other node say B, it

sends a message to B requesting that B send the range of pages and leases to

18

A. In this case, zFS records the fact that A also has this particular range inter-

nally and both A and B mark the pages as replicated. Node B is called a third-

party node, since A gets the requested data not from the OSD but from a third-

party.

When memory becomes scarce for a client, the Linux kernel invokes the

kswapd() daemon. This daemon scans and discards inactive pages from the

memory of the client. In our modified kernel, if the page is a replicated zFS page,

a message is sent to the zFS file manager indicating that machine A no longer

holds the page and the page is discarded.

If the zFS page is a singlet, the page is forwarded to another node using the fol-

lowing steps:

1. A message is sent to the zFS file manager indicating that the page is sent to

another machine B, the node with the largest free memory known to A.

2. The page is forwarded to B.

3. The page is discarded from the page cache of A.

zFS uses a recirculation counter, and if the singlet page has not been accessed

after two hops, it is discarded. Once the page has been accessed, the recircula-

tion counter is reset. When a file manager is notified about a discarded page, it

updates the lease and page location and checks whether the page has become

a singlet. If only one other node N holds the page, the file manager sends a

singlet message to N to that effect.

The effects of node failure and network delays are also considered in this algo-

rithm.

19

3.2.1 Node Failure

To take care of node failure, the researchers take the approach that it is ac-

ceptable for the file manager to assume the existence of pages on nodes even if

this is not true but it is unacceptable to have pages on nodes, where the file

manager is unaware of their existence. If the file manager is wrong in its as-

sumption that a page exists on a node, its request will be rejected and thus it

will update its records eventually. However, if there are pages on nodes that the

file manager is not aware of, this may cause data trashing and thus is not al-

lowed.

Because of this, the order of steps for forwarding a singlet page to another

node is important and is to be followed as described above.

1. Node fails before Step 1:- The file manager will eventually detect this and

update its data to reflect that the respective node does not hold pages and

leases. If the node fails to execute Step 1 and notify the file manager, it does

not forward the page and only discards it. Thus, we end up with a situation

where the file manager assumes the page exists on node A, although it does

not. This is acceptable since it can be corrected without data corruption.

2. Node failed after Step 1:- In this case, the file manager is informed that the

page is on B, but node A may have crashed before it was able to forward the

page to B. Again, we have a situation where the file manager assumes the

page is on B, although in reality that is not true.

3. Failure after Step 2 does not pose any problem.

20

3.2.2 Network Delays

In this paper the following cases are considered for network delays:

1. The first case that the authors have considered is where a replicated page

residing on two nodes M and N is discarded from the memory of M:-

When the zFS file manager sees that a page has become singlet and only re-

sides in the memory of N now, it sends a message to N with this information.

However, due to network delays, this message may arrive after memory pres-

sure developed on N. But on the node N this page is marked as replicated,

while in reality it is a singlet and should have been forwarded to another

node.

They handle this situation as follows:

If a singlet message arrives at N and the page is not in the cache of N, the co-

operative cache algorithm on N will ignore the singlet message. Because the file

manager still knows that the page resides on N, it may ask N to forward the

page to a requesting client B. In this case, N will send back a reject message to

the file manager. Upon receiving a reject message, the file manager updates its

internal tables and retries to respond to the request from B by finding another

client who in the meantime read the page from the OSD or by telling B to read

the page from the OSD. In such cases, network delays will cause performance

degradation, but not inconsistency.

2. Another possible scenario is that no memory pressure occurred on N, the

page has not arrived yet, and a singlet message arrived and was ignored. The

file manager asked N to forward the page to B and N sent a reject message

21

back to the file manager. If the page never arrives at N due to sender failure

or network failure, there is no problem.

However, if the page arrives after the reject message was sent, a consistency

problem may occur if a write lease exists. Because the file manager is not aware

of the page on N, another node may get the write lease and the page from the

OSD. This leaves two clients having the same page with write leases on two dif-

ferent nodes.

To avoid this situation, A reject list is kept in the node N, which records the

pages (and their corresponding leases) that were requested but rejected. When

a forwarded page arrives at N and the page is on the reject list, the page and its

entry on the reject list are discarded, thus keeping the information in the file

manager accurate. The reject list is scanned periodically (by the FE) and each

entry whose time on the list exceeds T, is deleted. T is the maximum time it can

take a page to reach its destination node, and is determined experimentally de-

pending on the network topology.

An alternative method for handling these network delay issues is to use a com-

plicated synchronization mechanism to keep track of the state of each page in

the cluster. This is unacceptable for two reasons. First, it incurs overhead from

extra messages, and second, this synchronization delays the kernel when it

needs to evict pages quickly.

3. Another problem caused by network delays is that suppose node N notifies

the zFS file manager upon forwarding a page to M, and M does the same

forwarding the page to O. However, the link from N to the file manager is

slow compared to the other links. Thus, the file manager may receive a mes-

sage that a page was moved from M to O before receiving the message that

the singlet page was moved from N to M. Moreover, the file manager does

22

not have in its records that this specific page and lease reside on M. The

problem is further complicated by the fact that M may decide to discard the

page and this notification may arrive at the file manager before the move no-

tification.

To solve this problem, the researchers used the following data structures:

Each lease on a node has a hop_count, which counts the number of times the

lease and its corresponding page were moved from one node to another.

Initially, when the page is read from the OSD, the hop_count in the corre-

sponding lease is set to zero and is incremented whenever the lease and

page are transferred to another node.

When a node initiates a move, the move notification passed to the file man-

ager includes the hop_count and the target_node.

Two fields are reserved in each lease record in the file manager’s internal ta-

bles for handling move notification messages: Last_hop_count initially set to

-1, and target_node initially set to NULL.

23

Figure 2: Delayed Move Notification Messages

If message (3) arrives first, its hop count and target node are saved in the lease

record. This is done since node M is not registered as holding the lease and

page. When message (1) arrives, N is the registered node; therefore, the lease is

“moved” to the target node stored in the target_node field. This is done by up-

dating the information stored in the internal tables of the file manager. If mes-

sage (3) arrives first and then message (5) arrives, due to the larger hop count,

the information from message (5) is stored and used when message (1) arrives.

When message (3) arrives, it is ignored due to its smaller hop count. In other

words, using the hop count enables us to ignore late messages that are irrele-

vant.

4. Suppose the page was moved from N to M and then to O, where it was dis-

carded due to memory pressure on O and its recirculation count exceeded

its limit. O then sends a release_lease message, which arrives at the file man-

ager before the move notifications.

24

This case is resolved as follows:

Since O is not registered as holding the page and lease, the release_lease mes-

sage is placed on a pending queue and a flag is raised in the lease record. When

the move operation is resolved and this flag is set, the release_lease message is

moved to the input queue and executed.

3.3 Choosing the Proper Third Party Node

The zFS file manager uses an enhanced round robin method to choose the

third-party node, which holds a range of pages starting with the requested page.

For each range granted to a node N, the file manager records the time it was

granted t(N). When a request arrives, the file manager scans the list of all nodes

holding a potential range, N0…Nk. For each selected node Ni, the file manager

checks if currentTime‒t(Ni) > C to check whether enough time passed for the

range of pages granted to Ni to reach the node. If this is true, Ni is marked as a

potential provider for the requested range and the next node, Ni+1, is checked;

otherwise, the next node is checked. Once all nodes are checked, the marked

node with the largest range, Nmax, is chosen. The next time the file manager is

asked for a page and lease, it starts the scan from node Nmax+1. Two goals

are achieved using this algorithm. First, no single node is overloaded with re-

quests and becomes a bottleneck. Second, the pages reside at the chosen node

for sure, thus reducing the probability of reject messages.

25

3.4 Pre-fetching Data in zFS

The Linux kernel uses a read ahead mechanism to improve file reading perform-

ance. Based on the read pattern of each file, the kernel dynamically calculates

how many pages to read ahead, n, and invokes the readpage() routine n times.

This method of operating is not efficient when the pages are transmitted over

the network. The overhead for transmitting a data block is composed of two

parts: the network setup overhead and the transmission time of the data block

itself. For comparatively small blocks, the setup overhead is a significant part of

the total overhead.

Intuitively, it seems that it is more efficient to transmit k pages in one message

rather than transmitting them in a separate message for each page as the setup

overhead is amortized over k pages.

To confirm this, the researchers wrote client and server programs that test the

time it takes to read a file residing entirely in memory from one node to another.

Using a file size of N pages, they tested reading it in chunks of 1…k pages in

each TCP message. That is, reading the file in N…N/k messages. They found

that the best results are achieved for k=4 and k =8. When k is smaller, the setup

time is significant, and when k is larger (16 and above) the size of the L2 cache

starts to affect the performance. TCP performance decreases when the trans-

mitted block size exceeds the size of the L2 cache.

Similar performance gains were achieved by the zFS pre-fetching mechanism.

When the file manager is instantiated, it is passed a pre-fetching parameter, R,

indicating the maximum range of pages to grant. When a client A requests a

page (and lease) the file manager searches for a client B having the largest con-

26

tiguous range of pages, r, starting with the requested page p and r <= R. If such

a client B is found, the file manager sends B a message to send r pages (and

their leases) to A. The selected range r can be smaller than R if the file manager

finds a page with a conflicting lease before reaching the full range R. If no range

is found in any client, the file manager grants R leases to client A and instructs

A to read R pages from the OSD. The requested page may reside on client A,

while the next one resides on client B, and the next on client C. In this case, the

granted range will be only the requested page from client A. The next request

initiated by the kernel read-ahead mechanism will be granted from client B and

the next from client C. Thus, there is no interference with the kernel read ahead

mechanism. However, if the file manager finds that client A has a range of k

pages, it will ignore the subsequent requests that are initiated by the kernel

read-ahead mechanism and covered by the granted range.

27

Chapter 4Testing

28

4.1 Test Environment

The zFS performance test environment consisted of a cluster of client PCs and

one server PC. Each of the PCs in the cluster had an 800 MHz Pentium III proc-

essor with 256 MB memory, 256 KB L2 cache and 15 GB IDE (Integrated Drive

Electronics) disks. All of the PCs in the cluster ran the Linux operating system.

The kernel was a modified 2.4.19 kernel with VFS (Virtual File System) imple-

menting zFS and some patches to enable the integration of zFS with the kernel's

page cache. The server PC had a 2 GHz Pentium 4 processor with 512 MB

memory and 30 GB IDE disk running vanilla Linux kernel. The server PC ran a

simulator of the Antara OSD when the researchers tested zFS performance and

ran an NFS (Network File System) server when they compared the results to

NFS. The PCs in the cluster and the server PC were connected via 1 Gbit LAN.

The client running the zFS front end was implemented as a kernel mode proc-

ess, while all other components were implemented as user mode processes. The

file manager and lease manager were fully implemented. The transaction man-

ager implemented all operations in memory, without writing the log to the OSD.

However, this fact does not influence the results because only the results of

read operations using cooperative cache are recorded and not the meta data

operations.

To begin testing zFS, the researchers configured the system much like a SAN

(Storage Area Network) file system. The server PC ran an OSD simulator, a

separate PC ran the lease manager, file manager and transaction manager proc-

esses (thus acting as a meta data server) and four PCs ran the zFS front-end.

When testing NFS performance, they configured the system differently. The

server PC ran an NFS server with eight NFS daemons (nfsd) and the four PCs

29

ran NFS clients. The final results are an average over several runs, where the

caches of the machines were cleared before each run.

To evaluate zFS performance relative to an existing file system, the researchers

compared it to the widely-used NFS system, using the IOZONE benchmark. IO-

ZONE is a filesystem benchmark tool. The benchmark generates and measures a

variety of file operations. IOZONE is useful for performing a broad filesystem

analysis of a computer platform. The benchmark tests file I/O performance for

operations such as read,write,etc.

The comparison to NFS was difficult because NFS does not carry out pre-

fetching. To make up for this feature, IOZONE was configured to read the NFS

mounted file using record sizes of n=1,4,8,16 pages and compared its perform-

ance with reading zFS mounted files with record sizes of one page but with pre-

fetching parameter R=1,4,8,16 pages.

30

5

pages reside at the chosen node, thus reducing the probabil-ity of reject messages.

3.5 Pre-fetching Data in zFS The Linux kernel uses a read ahead mechanism to improve file reading performance. Based on the read pattern of each file, the kernel dynamically calculates how many pages to read ahead, n, and invokes the readpage() routine n times7. This modus operandi is not efficient when the pages are transmitted over the network. The overhead for transmitting a data block is composed of two parts: the network setup overhead and the transmission time of the data block itself. For comparatively small blocks, the setup overhead is a significant part of the total overhead. Intuitively, it seems that it is more efficient to transmit k pages in one (scatter/gather) message rather than transmit-ting them in k separate messages, since the setup overhead is amortized over k pages. To confirm this, we wrote a simple client and server programs that test the time it takes to read a file residing entirely in memory from one node to another. Using a file size of N pages, we tested reading it in chunks of 1…k pages in each TCP message. In other words, reading the file in N…N/k messages. We found that the best results are achieved for k=4 and k =8. When k is smaller, the setup time is significant, and when k is larger (16 and above) the size of the L2 cache starts to affect the perform-ance. As noted in ![8], TCP performance decreases when the transmitted block size exceeds the size of the L2 cache. This is shown in the performance results described in Sec-tion 5. The zFS pre-fetching mechanism is designed to achieve similar performance gains. When the file manager is instan-tiated, it is passed a pre-fetching parameter, R, indicating the maximum range of pages to grant. When a client A re-quests a page (and lease) the file manager searches for a client B having the largest contiguous range of pages, r, starting with the requested page p and r ! R. If such a cli-ent B is found, the file manager sends B a message to send r pages (and their leases) to A. The selected range r can be smaller than R if the file manager finds a page with a con-flicting lease8 before reaching the full range R. If no range is found in any client, the file manager grants R leases to client A and instructs A to read R pages from the OSD. The requested page may reside on client A, while the next one resides on client B, and the next on client C. In this case, the granted range will be only the requested page from client A. The next request initiated by the kernel read-ahead mechanism will be granted from client B and the next from 7 More specifically, calling readpage() is skipped if the page

to be read is already in the cache, and the next page in the read ahead range is checked.

8 For read request, a conflicting lease is a write lease; for write requests, a conflicting lease is a read lease.

client C. Thus, we do not interfere with the kernel read-ahead mechanism. However, if the file manager finds that client A has a range of k pages, it will ignore the subsequent requests that are initiated by the kernel read-ahead mecha-nism and covered by the granted range.

4 Test Environment The zFS performance test environment was comprised of a cluster of PCs and one server PC. Each of the PCs in the cluster was an 800 MHz Pentium III with 256 MB memory, 256 KB L2 cache and 15 GB IDE disks. All of the PCs in the cluster ran the Linux operating system. The kernel was a modified 2.4.19 kernel with VFS implementing zFS and some patches to enable the integration of zFS with the ker-nel's page cache. The server PC was a 2 GHz Pentium 4 with 512 MB mem-ory and 30 GB IDE disk running vanilla Linux kernel. The server PC ran a simulator of the Antara ![6] OSD when we tested zFS performance and ran an NFS server when we compared the results to NFS. The PCs in the cluster and the server PC were connected via 1Gbit LAN. The zFS front-end (client) was implemented as a kernel loadable module, while all other components were imple-mented as user mode processes. The file manager and lease manager were fully implemented. The transaction manager implemented all operations in memory, without writing the log to the OSD. However, because we are looking at the results of read operations using cooperative cache—and not meta data operations—this fact does not influence the re-sults. To begin testing zFS, we configured the system much like a SAN file system. The server PC ran an OSD simulator, a separate PC ran the lease manager, file manager and trans-action manager processes (thus acting as a meta data server) and four PCs ran the zFS front-end.

Figure 3: System configuration for testing zFS performance

When testing NFS performance, we configured the system differently. The server PC ran an NFS server with eight

Figure 3: System configuration for testing zFS performance

6

NFS daemons (nfsd) and the four PCs ran NFS clients. The reported results are an average over several runs, where the caches of the machines were cleared before each run.

Figure 4: System configuration for testing NFS performance

5 Comparison to NFS To evaluate zFS performance relative to an existing file system, we compared it to the widely-used NFS system, using the IOZONE benchmark. The comparison to NFS was difficult because NFS does not carry out pre-fetching. To make up for this feature, we con-figured IOZONE to read the NFS mounted file using re-cord9 sizes of n=1,4,8,16 pages and compared its perform-ance with reading zFS mounted files with record sizes of one page but with pre-fetching parameter R=1,4,8,16 pages.

5.1 Comparing zFS and NFS on a Cluster Our main goal was to test whether and how much perform-ance is gained if the total free memory in the cluster sur-passes the cache size of the server. To this end, we ran two scenarios. In the first one, the file size was smaller than the cache of the server and all the data resided in the server’s cache. In the second, the file size was much larger than the size of the server’s cache. The results appear in Figure 5 and Figure 6, respectively. In both cases, we observed that the performance of NFS was almost the same for different block sizes. However, the absolute performance is greatly influenced by the data size compared to the available memory. When the file can fit entirely into memory, NFS performance is almost four times better compared to the case where the file is larger than available memory. When the file fit entirely into memory (see Figure 5) the performance of zFS with cooperative cache was much bet-ter than NFS. When cooperative cache was deactivated, we observed different behavior for ranges of one page com-pared to larger ranges. This stems from the extra messages passed between the clients and the file manager. Thus, the zFS performance for R=1 is lower than NFS. However, for larger ranges, there are fewer messages to the file manager

9 Record in IOZONE is the data size read in one read system call.

(since the pages were pre-fetched) and the performance of zFS was slightly better than that of NFS.

File smaller than available server memory256MB file, 512MB server memory

0.00

5000.00

10000.00

15000.00

20000.00

25000.00

30000.00

1 4 8 16

Range

KB

/Sec

NFS

No c_cache

c_cache

Figure 5: Performance results for large server cache

This figure shows the performance results when the data fits entirely in the server’s memory. The graphs show the relative performance of zFS to NFS, with and without cooperative cache.

We also saw that when using cooperative cache, the per-formance for a range of 16 was lower than for ranges of 4 and 8. Because IOZONE starts the requests of each client with a fixed time delay relative to other clients, each new request was for different 256 KB. This stems from the fol-lowing calculation: For four clients with 16 pages each, we get 256 KB, the size of the L2 cache. Since almost the en-tire file is in memory, the L2 cache is flushed and reloaded for each granted request, resulting in reduced performance. This fits with ![8], since in this case the L2 cache is fully replaced with new data for each transmission. When the server cache was smaller that the data requested, we expected memory pressure to occur in the server (NFS and OSD) and the server’s local disk to be used. In this case, we anticipated that the cooperative cache would ex-hibit improved performance. The results are shown in Figure 6. Indeed, we see that zFS performance when cooperative cache is deactivated is lower than that of NFS, although it gets better for larger ranges. When the cooperative cache is active, zFS performance is significantly better than NFS and increases as the range increases. The performance with cooperative cache enabled is lower here when compared to the case when the file fits into memory. Because the file was larger than the available memory, clients suffer memory pressure, evict/discard pages and respond to the file manager with reject messages. Thus, sending data blocks to clients was interleaved with reject messages to the file manager and the probability that the requested data is in memory was smaller than when the file was almost entirely in memory.


31

4.2 Comparing zFS and NFS

The primary aim of this research was to test whether and how much perform-

ance is gained when the total amount of free memory in the cluster exceeds the

server’s cache size. To this end, two scenarios were investigated. In the first

one, the file size was smaller than the cache of the server and all the data re-

sided in the server’s cache. In the second, the file size was much larger than the

size of the server’s cache. The results appear in Figure 5 and Figure 6 given be-

low, respectively.

6

NFS daemons (nfsd) and the four PCs ran NFS clients. The reported results are an average over several runs, where the caches of the machines were cleared before each run.


5 Comparison to NFS To evaluate zFS performance relative to an existing file system, we compared it to the widely-used NFS system, using the IOZONE benchmark. The comparison to NFS was difficult because NFS does not carry out pre-fetching. To make up for this feature, we con-figured IOZONE to read the NFS mounted file using re-cord9 sizes of n=1,4,8,16 pages and compared its perform-ance with reading zFS mounted files with record sizes of one page but with pre-fetching parameter R=1,4,8,16 pages.

5.1 Comparing zFS and NFS on a Cluster Our main goal was to test whether and how much perform-ance is gained if the total free memory in the cluster sur-passes the cache size of the server. To this end, we ran two scenarios. In the first one, the file size was smaller than the cache of the server and all the data resided in the server’s cache. In the second, the file size was much larger than the size of the server’s cache. The results appear in Figure 5 and Figure 6, respectively. In both cases, we observed that the performance of NFS was almost the same for different block sizes. However, the absolute performance is greatly influenced by the data size compared to the available memory. When the file can fit entirely into memory, NFS performance is almost four times better compared to the case where the file is larger than available memory. When the file fit entirely into memory (see Figure 5) the performance of zFS with cooperative cache was much bet-ter than NFS. When cooperative cache was deactivated, we observed different behavior for ranges of one page com-pared to larger ranges. This stems from the extra messages passed between the clients and the file manager. Thus, the zFS performance for R=1 is lower than NFS. However, for larger ranges, there are fewer messages to the file manager

9 Record in IOZONE is the data size read in one read system call.

(since the pages were pre-fetched) and the performance of zFS was slightly better than that of NFS.

File smaller than available server memory256MB file, 512MB server memory

0.00

5000.00

10000.00

15000.00

20000.00

25000.00

30000.00

1 4 8 16

Range

KB

/Sec

NFS

No c_cache

c_cache

Figure 5: Performance results for large server cache

This figure shows the performance results when the data fits entirely in the server’s memory. The graphs show the relative performance of zFS to NFS, with and without cooperative cache.

We also saw that when using cooperative cache, the per-formance for a range of 16 was lower than for ranges of 4 and 8. Because IOZONE starts the requests of each client with a fixed time delay relative to other clients, each new request was for different 256 KB. This stems from the fol-lowing calculation: For four clients with 16 pages each, we get 256 KB, the size of the L2 cache. Since almost the en-tire file is in memory, the L2 cache is flushed and reloaded for each granted request, resulting in reduced performance. This fits with ![8], since in this case the L2 cache is fully replaced with new data for each transmission. When the server cache was smaller that the data requested, we expected memory pressure to occur in the server (NFS and OSD) and the server’s local disk to be used. In this case, we anticipated that the cooperative cache would ex-hibit improved performance. The results are shown in Figure 6. Indeed, we see that zFS performance when cooperative cache is deactivated is lower than that of NFS, although it gets better for larger ranges. When the cooperative cache is active, zFS performance is significantly better than NFS and increases as the range increases. The performance with cooperative cache enabled is lower here when compared to the case when the file fits into memory. Because the file was larger than the available memory, clients suffer memory pressure, evict/discard pages and respond to the file manager with reject messages. Thus, sending data blocks to clients was interleaved with reject messages to the file manager and the probability that the requested data is in memory was smaller than when the file was almost entirely in memory.

Figure 5 : Performance results for large server cache

32

7

File larger than available server memory1GB file, 512MB Server Memory

0.00

5000.00

10000.00

15000.00

20000.00

25000.00

1 4 8 16

Range

KB

/Sec

NFS

No c_cache

c_cache

Figure 6: Performance results for small server cache

This figure shows the performance results when the data size is greater than the server cache size and the server’s local disk has to be used. We see that cooperative cache provides much better performance than NFS. Deactivating cooperative cache results in worse performance than NFS.

6 Related Work Several existing works have researched cooperative caching ![2], ![3], ![4], ![5], ![12]. Although we cannot cover all these works, we will concentrate on those systems that describe network file systems, rather than those describing caching for parallel machines ![3], or those that use a micro-kernel ![2]. Dahlin et. al. ![4] describe four caching techniques to im-prove the performance of the xFS file system. xFS uses a central server to coordinate between the various clients, and the load of the server increases as the number of clients increase. Thus, the scalability of xFS was limited by the strength of the server. Implementing these techniques and testing it on a trace of 237 users (over more than one week) showed that the load on the central server was reduced by a factor of six, making xFS more scalable than AFS and NFS. They also found that each of the four techniques contrib-uted significantly to the load reduction. In other words, no one technique contributed significantly more than the oth-ers. It should be noted that in their article Dahlin et. al.![4] use the term caching to mean caching of whole file as in AFS; i.e., the clients cache the whole files on local disk. The four techniques described are:

1. no write through 2. client-to-client data transfer 3. write ownership 4. cluster servers10

Measurements showed that using these techniques, the load on the central server is reduced noticeably and that the load of the cluster servers is similar to that of the central server.

10 For details of these techniques see ![4].

There are three major differences between the zFS architec-ture and the one described in ![4]. 1. zFS does not have a central server and the management

of files is distributed among several file managers. There is no hierarchy of cluster servers; if two clients work on the same file they interact with the same file manager.

2. Caching in zFS is done on a per page basis rather than using whole files. This increases sharing since different clients can work on different parts of the same file.

3. In zFS, no caching is done on the local disk Additionally, no write through and client-to-client forward-ing are inherent in zFS. Thus, in addition to supporting the techniques mentioned in ![4], zFS is more scalable because it has no central server and file managers can dynamically be added or removed to respond to load changes in the clus-ter. Moreover, performance is better due to zFS’s stronger sharing capability. In ![5] several techniques were used in an experiment to find the best policy for cooperative caching. The conclu-sion reached was that N-chance Forwarding gives the best performance. This technique is also employed in zFS. The main difference is that in ![5] they choose the target node for eviction of a singlet randomly and suggest an optimization by choosing an idle client. As explained above, in zFS we chose the client that has with the largest amount of free memory and uses the same file manager. Sarkar and Hartman ![12] investigated the use of a hints mechanism, information that only approximates the global state of the system, for cooperative caching. In their system, communication among clients, rather than between clients and central server, is used to determine where to evict singlet blocks or where to get a data block. The decisions made by such a system may not be optimal, but managing hints is less expensive than managing facts. As long as the overhead eliminated by using hints more than offsets the effect of making mistakes, this system pays off. The environment consists of a central server with a cache. For each block of a file, a master copy is defined as the one received from the server. The hints mechanism only keeps track of where the master copy is located in order to avoid keeping too much information and thus simplifies the main-tenance of accurate hints. Each client builds its own hints database, which is updated when blocks are evicted or received. When a client wishes to evict a block or to read a block, it does lookup in the internal database to find the proper client. Because the hints mechanism is not accurate, lookup can fail. In this case, the client contacts the central server. Simulation showed that this algorithm works well in UNIX environment. However, there are scenarios where it will not be effective; for example, if several clients share a working set that is larger than the cache size, the location of the mas-

Figure 6 : Performance Results for small server cache

In both the cases, it was observed that the performance of NFS was almost the

same for different block sizes. However, the performance is greatly influenced

by the data size compared to the available memory. When the file can fit entirely

into the memory, the performance of NFS is almost four times better compared

to the case when the file size is larger than the available memory.

When the file fits entirely into memory ( Figure 1) the performance of zFS with

cooperative cache is much better than NFS. But when cooperative cache was

deactivated, different behaviors were observed for different ranges of pages.

This is due to the fact that extra messages are passed between the file manager

and the client for larger ranges of pages. Hence, the performance of zFS for

R=1 is lower than that of NFS. However, for larger ranges, there are fewer mes-

33

sages to the file manager (due to pre-fetching in zFS) and the performance of

zFS was slightly better than that of NFS.

The researchers also observed that when cooperative cache was used, the per-

formance for a range of 16 was lower than for ranges of 4 and 8. This is due to

the fact that IOZONE starts the requests of each client with a fixed time delay

relative to other clients, each new request was for different 256 KB. This stems

from the following calculation: For four clients with 16 pages each, we get 256

KB, the size of the L2 cache. Since almost the entire file is in memory, the L2

cache is cleared and reloaded for each new granted request, resulting in re-

duced performance.

When the cache of server was smaller than the requested data, it was expected

that memory pressure would occur in the server (NFS and OSD) and the

server’s local disk would be used. In such a case, the anticipations that the co-

operative cache would exhibit improved performance proved to be correct. The

results are shown in Figure 2.

We can see that zFS performance, when cooperative cache is deactivated, is

lower than that of NFS but it gets better for larger ranges. When the coopera-

tive cache is active, zFS performance is significantly better than NFS and in-

creases with increasing range.

The performance with cooperative cache enabled is lower in this case when

compared to the case when the file fits into memory. This is due to the fact that

the file was larger than the available memory, hence the clients suffered memory

pressure,and discarded pages and responded to the file manager with reject

messages. Thus, sending data blocks to clients was interleaved with reject mes-

sages to the file manager and the probability that the requested data is in mem-

ory was also smaller than when the file was almost entirely in memory.

34

Conclusion

The results show that using the cache of all the clients as one cooperative cache

gives better performance as compared to NFS as well as to the case when co-

operative cache is not used. This is evident when using pre-fetching with a

range of one page. It is also noted from the results that using pre-fetching with

ranges of four and eight pages results in much better performance. In zFS, the

selection of the target node for forwarding pages during page eviction is done

by the file manager, which chooses the node with the largest free memory as

the target node. However, the file manager chooses target nodes only from the

ones interacting with it. It may be the case that there is an idle machine with a

large free memory that is not connected to this file manager and thus will not be

used.

35

Bibliography

Improving Performance of a Distributed File System Using OSDs and Cooperative CacheA. Teperman, A. Weit IBM Haifa Labs, Haifa University, Mount Carmel, Haifa 31905, Israel

O. Rodeh and A. Teperman. "zFS ‒ A Scalable distributed File System using Object Disks." In Proceedings of the IEEE Mass Storage Systems and Technologies Conference, pages 207-218, San Diego, CA, USA, 2003.

T. Cortes, S. Girona and J. Labarta. Avoiding the Cache Coherence Problem in a Parallel/Distributed File System. Department d’Arquitectura de Computadros, Universitat Politecnica de Catalunia ‒ Bar-celona.

M. D. Dahlin, Randolph Y. Wang, Thomas E. Anderson and David A. Patterson." Cooperative Caching: Using Remote Client Memory to Improve File System Perform-ance." Proceedings of the First Symposium on Operating Systems Design and Implementa-tion,1994.

V. Drezin, N. Rinetzky, A. Tavory and E. Yerushalmi."The Antara Object-disk Design." Technical report IBM Research Labs, Haifa University Campus, Mount Carmel, Haifa, Israel, 2001.

Z. Dubitzky, I. Gold, E. Henis, J. Satran and D. Sheinwald. "DSF ‒ Data Sharing Facility." Technical report IBM Research Labs, Haifa University Campus, Mount Carmel, Haifa, Israel, 2000. http://www.haifa.il.ibm.com/projects/systems/dsf.html

Iozone. See http://iozone.org/

http://www.lustre.org/docs/whitepaper.pdf

P. Sarkar and J. Hartman. Efficient Cooperative Caching Using Hints. Department of Computer Science, University of Arizona, Tucson.

36

http://www.haifa.il.ibm.com/projects/systems/dsf.html

http://www.haifa.il.ibm.com/projects/systems/dsf.html

http://iozone.org

http://iozone.org



improving performance of distributed file system - zfs

Documents