multiple page size support in the linux kernel...curs, the page cache uses this data to look up the...

23
Multiple Page Size Support in the Linux Kernel Simon Winwood ‡§ § School of Computer Science and Engineering University of New South Wales Sydney 2052, Australia [email protected] Yefim Shuf ‡¶ Computer Science Department Princeton University Princeton, NJ 08544, USA [email protected] Hubertus Franke IBM T.J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598, USA {swinwoo, yefim, frankeh}@us.ibm.com Abstract The Linux kernel currently supports a single user space page size, usually the minimum dic- tated by the architecture. This paper describes the ongoing modifications to the Linux kernel to allow applications to vary the size of pages used to map their address spaces and to reap the performance benefits associated with the use of large pages. The results from our implementation of mul- tiple page size support in the Linux kernel are very encouraging. Namely, we find that the performance improvement of applications written in various modern programming lan- guages range from 10% to over 35%. The ob- served performance improvements are consis- tent with those reported by other researchers. Considering that memory latencies continue to grow and represent a barrier for achieving scal- able performance on faster processors, we ar- gue that multiple page size support is a neces- sary and important addition to the OS kernel and the Linux kernel in particular. 1 Introduction To achieve high performance, many processors supporting virtual memory implement a Trans- lation Lookaside Buffer (TLB) [8]. A TLB is a small hardware cache for maintaining virtual to physical translation information for recently referenced pages. During execution of any in- struction, a translation from virtual to physical addresses needs to be performed at least once. Thereby, a TLB is effectively reducing the cost of obtaining translation information from page tables stored in memory. Programs with good spatial and temporal lo- cality of reference achieve high TLB hit rates which contribute to higher application perfor- mance. Because of long memory latencies, programs with poor locality can incur a notice- able performance hit due to low TLB utiliza- tion. Large working sets of many modern ap- plications and commercial middleware [12, 13] make achieving high TLB hit rates a challeng- ing and important task. Adding more entries to a TLB to increase its

Upload: others

Post on 11-May-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Multiple Page Size Support in the Linux Kernel

Simon Winwood‡ §§School of Computer Science and Engineering

University of New South Wales

Sydney 2052, Australia

[email protected]

Yefim Shuf‡ ¶¶Computer Science Department

Princeton University

Princeton, NJ 08544, USA

[email protected]

Hubertus Franke‡‡IBM T.J. Watson Research Center

P.O. Box 218

Yorktown Heights, NY 10598, USA

{swinwoo, yefim, frankeh}@us.ibm.com

Abstract

The Linux kernel currently supports a singleuser space page size, usually the minimum dic-tated by the architecture. This paper describesthe ongoing modifications to the Linux kernelto allow applications to vary the size of pagesused to map their address spaces and to reap theperformance benefits associated with the use oflarge pages.

The results from our implementation of mul-tiple page size support in the Linux kernelare very encouraging. Namely, we find thatthe performance improvement of applicationswritten in various modern programming lan-guages range from 10% to over 35%. The ob-served performance improvements are consis-tent with those reported by other researchers.Considering that memory latencies continue togrow and represent a barrier for achieving scal-able performance on faster processors, we ar-gue that multiple page size support is a neces-sary and important addition to the OS kerneland the Linux kernel in particular.

1 Introduction

To achieve high performance, many processorssupporting virtual memory implement a Trans-lation Lookaside Buffer (TLB) [8]. A TLB isa small hardware cache for maintaining virtualto physical translation information for recentlyreferenced pages. During execution of any in-struction, a translation from virtual to physicaladdresses needs to be performed at least once.Thereby, a TLB is effectively reducing the costof obtaining translation information from pagetables stored in memory.

Programs with good spatial and temporal lo-cality of reference achieve high TLB hit rateswhich contribute to higher application perfor-mance. Because of long memory latencies,programs with poor locality can incur a notice-able performance hit due to low TLB utiliza-tion. Large working sets of many modern ap-plications and commercial middleware [12, 13]make achieving high TLB hit rates a challeng-ing and important task.

Adding more entries to a TLB to increase its

Page 2: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 574

coverage and increasing the associativity of aTLB to reach higher TLB hit rates is not al-ways feasible as large and complex TLBs makeit difficult to attain short processor cycle times.A short TLB latency is a critical requirementfor many modern processors with fast physi-cally tagged caches, in which translation infor-mation (i.e., a physical page associated with aTLB entry) needs to be available to performcache tag checking [8]. Therefore, many pro-cessors achieve wider TLB coverage by sup-porting large pages. Traditionally, operatingsystems did not expose large pages to appli-cation software, limiting this support to thekernel. Growing working sets of applicationsmake it appealing to support large pages for ap-plications, as well as for the kernel itself.

A key challenge for this work was to provideefficient support for multiple page sizes withonly minor changes to the kernel. This paperdiscusses ongoing research to support multiplepage sizes in the context of the Linux operatingsystem, and makes the following contributions:

• it describes the changes necessary to sup-port multiple page sizes in the Linux ker-nel;

• it presents validation data demonstratingthe accuracy of our implementation andits ability to meet our design goals; and

• it illustrates non-trivial performance ben-efits of large pages (reaching more than35%) for Java applications and (reachingover 15%) for C and C++ applicationsfrom well-known benchmark suites.

We have an implementation of multiple pagesize support for the IA-32 architecture and arecurrently working on an implementation forthe PowerPC1 architecture.

1This is for the PPC405gp and PPC440 processors,both of which support multiple page sizes.

The rest of the paper is organized as follows.In Section 2, we present an overview of theLinux virtual memory subsystem. We de-scribe the design and implementation of mul-tiple page size support in the Linux kernel inSection 3 and Section 4 respectively. Experi-mental results obtained from the implementa-tion are presented and analyzed in Section 5.Related work is discussed in Section 6. Fi-nally, we summarize the results of our workand present some ideas for future work in Sec-tion 7.

2 The Virtual Memory Subsystemin Linux

In this section, we give a brief overview of theLinux Virtual Memory (VM) subsystem2. Un-less otherwise noted, this section refers to the2.4 series of kernels after version 2.4.18.

2.1 Address space data structures

Each address space is defined by amm_struct data structure3. Themm_struct contains information aboutthe address space, including a list ofVirtualMemory Areas(VMAs), a pointer to the pagedirectory, and various locks and resourcecounters.

A VMA contains information about a single re-gion of the address space. This includes:

• the address range the VMA is responsiblefor;

• the access rights — read, write, and exe-cute — for that region;

2This section is meant to be neither exhaustive orcomplete.

3Note that multiple tasks can share the same addressspace

Page 3: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 575

• the file, if any, which backs the region;and

• any performance hints supplied by anapplication, such as memory access be-haviour.

A VMA is also responsible for populatingthe region at page fault time via itsnopagemethod. A VMA generally maps a virtual ad-dress range onto a region of a file, or zero filled(anonymous) memory.

A VMA exists for each segment in a process’sexecutable (e.g., its text and data segments),its stack, any dynamically linked libraries, andany other files the process may have mappedinto its address space. All VMAs, except forthose created when a process is initially loaded,are created with themmapsystem call4. Themmapsystem call essentially checks that theprocess is allowed the desired access to the re-quested file5 and sets up the VMA.

The page directorycontains mappings fromvirtual addresses to physical addresses. Linuxuses a three levelhierarchical page table(PT),although in most cases the middle level is op-timised out. Each leaf node entry in the PT,called apage table entry(PTE), contains theaddress of the corresponding physical page, thecurrent protection attributes for that page6, andother page attributes such as whether the map-ping is dirty, referenced, or valid.

Figure 1 shows the relationship between theVirtual Address Space, themm_struct , theVMAs, the Physical Address Space, and thepage directory.

4This is not strictly true: theshmat system call isalso used to create VMAs. It is, however, essentially awrapper for themmapsystem call.

5This check is trivial if the mapping is anonymous.6The pages protection attributes may change over the

life of the mapping due to copy-on-write and referencecounting

2.2 Thepage data structure and allocator

The page data structure represents a page ofphysical memory, and contains the followingproperties:

• its usage count, which denotes whether itis in the page cache, if it has buffers as-sociated with it, and how many processesare using it;

• its associated mapping, which indicateshow a file is mapped onto its data, and itsoffset;

• its wait queue, which contains processeswaiting on the page; and

• its various flags, most importantly:

locked This flag is used to lock a page.When a page is locked, I/O is pend-ing on the page, or the page is beingexamined by the swap subsystem.

error This flag is used to communicate tothe VM subsystem that an error oc-curred during I/O to the page.

referenced This flag is used by the swap-ping algorithm. It is set when a pageis referenced, for example, when thepage is accessed by theread sys-tem call, and when a PTE is foundto be referenced during the page ta-ble scan performed by the swapper.

uptodate This flag is used by the pagecache to determine whether thepage’s contents are valid. It is set af-ter the page is read in.

dirty This flag is used to determinewhether the page’s contents has beenmodified. It is set when the pageis written to, either by an explicitwrite system call, or through astore instruction.

Page 4: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 576

Virtual Memory Area

mmap

pgd

mm_struct

Page Directory

Page Table

Physical Address Space

Virtual Address Space

Figure 1: Virtual Address Space data structures

lru This flag is used to indicate that thepage is in the LRU list.

active This flag is used to indicate thatthe page is in the active list.

launder This flag is used to determinewhether the page is currently under-going swap activity. It is set whenthe page is selected to be swappedout.

Pages are organised intozones; memory is re-quested in terms of the target zone. Each zonehas certain properties: theDMA zone consistsof pages whose physical address is below the16MB limit required by some older devices,the normal zone contains pages that can beused for any purpose (aside from that fulfilledby theDMA zone), and thehighmemzone con-tains all pages that do not fit into the kernel’svirtual memory: when the kernel needs to ac-cess this memory, it needs to be mapped into aregion of the kernels address space. Note thatthe DMA zone is only required for support oflegacy devices, and thehighmemzone is onlyrequired on machines with 32 bit (or smaller)address spaces.

Each zone uses a buddy allocator to allocatepages, so pages of different orders can be al-located. Although a client of the allocator re-quests pages from a specific list of zones and aspecific page order, the pages that are returnedcan come from anywhere within the zone. Thismeans that a page for the slab allocator7 canbe allocated between pages that are allocatedfor the page cache. The same can be said forother non-swappable pages such as task con-trol blocks and page table nodes.

2.3 The page cache

The page cache implements a general cachefor file data. Most filesystems use thepage cache to avoid re-implementing the pagecache’s functionality. A filesystem takes ad-vantage of the page cache by setting a file’smmapoperation togeneric_file_mmap .When the file ismmaped, the VMA is setup such that itsnopage function invokesfilemap_nopage . The file’s read andwrite operations will also go through thepage cache.

7The slab allocator[5] provides efficient allocationof objects, such asinode s. Pages allocated to the slaballocator cannot be paged out.

Page 5: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 577

The page cache uses the page’s mapping andoffset fields to uniquely identify the file datathat the page contains — when an access oc-curs, the page cache uses this data to look upthe page in a hash table.

2.4 The swap subsystem

Linux attempts to fully utilise memory. At anyone time, the amount of available memory maybe less than that required by an application. Tosatisfy a request for memory, the kernel mayneed to free a page that is currently being used.Selecting and freeing pages is the job of theswap subsystem.

The swap subsystem uses two lists to recordthe activity of pages: a list of pages which havenot been accessed during in a certain time pe-riod, called theinactivelist, and a list of pageswhich have been recently accessed, called theactivelist. The active list is maintained pseudoLRU, while the inactive list is used by the one-handed clock replacement algorithm currentlyimplemented in the kernel. Whenever a pageon the inactive list is referenced, it is moved tothe active list.

The kernel uses a swapper thread to periodi-cally balance the number of pages in the activeand inactive lists: if a page in the active list hasbeen referenced, it is moved to the end of theactive list, otherwise it is moved to the end ofthe inactive list.

Periodically, the swapper thread sweepsthrough the inactive list looking for pages thatcan be freed. If the swapper thread is unable tofree enough pages, it starts scanning page ta-bles: for each PTE examined, the kernel checksto see whether the page has been referenced(i.e., whether thereferencedbit is set in thePTE). If so, the page is moved to the active list,if it is not already a member. Otherwise, thepage is considered a candidate for swapping.

In this manner reference statistics are gatheredfrom the page tables in the system, and used toselect pages to be swapped out and freed.

The swapper thread may be woken up whenthe amount of memory becomes too low. Theswapper functions may also be called directlywhen the amount of free memory becomes crit-ical: when memory allocation fails, a task mayattempt to swap out pages directly.

2.5 Anatomy of a page fault

When a virtual memory address is accessed,but a corresponding mapping is not in the TLB,a TLB missoccurs. When this happens, thefault addressis looked up in the page table, byeither the hardware in systems with a hardwareloaded TLB, or via the kernel in systems witha software loaded TLB (note that this impliesan interrupt occurs).

If a mapping exists in the page table, is valid,and matches permissions with the type of theaccess, the entry is inserted into the TLB, thepage table is updated to reflect the access bysetting the referenced bit8, and the faulting in-struction is restarted.

If a valid mapping does not exist, the kernel’spage fault handler is invoked. The handlersearches the current address space’s VMA setfor the VMA which corresponds to the fault ad-dress, and checks whether the access requestedis allowed by the permissions specified in theVMA.

The kernel then looks up the PTE correspond-ing to the fault address and allocates a page ta-ble node if necessary. If the fault is a write toa PTE marked read-only, the address space re-quires a private copy of the page. A page isallocated, the old page is copied, and the dirty

8Note that architectures with a hardware loaded TLBwhose page table doesn’t map directly onto Linux’s needto simulate this bit

Page 6: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 578

bit is set in the PTE. If the PTE exists but isn’tvalid, the page needs to be swapped in, other-wise the page needs to be allocated and filled.

If the VMA does not define anopage method,the memory is defined to be anonymous, i.e.,zero-filled memory that is not associated withany device or file. In this case, the kernel allo-cates a page, zeroes it, and inserts the appropri-ate entry into the page table. If a validnopagemethod exists, it is invoked and the resultingpage is inserted into the PTE.

In the majority of filesystems, thenopagemethod goes to the page cache. The mappingand offset for the fault address are calculated— the information required for this calculationis stored in the VMA — and the page cachehash table is searched for the file data corre-sponding to the mapping and offset.

If an up-to-date page exists, then no further ac-tion is required. If the page exists but is notup-to-date, it is read in. Otherwise, a new pageis allocated, inserted into the page cache, andread in. In all cases, the page’s reference countis incremented, and the page is returned.

3 Design

This section discusses the approaches we con-sidered and justifies our final design. This sec-tion is organised as follows: Section 3.1 dis-cusses the goals that guided the design and theterminology used throughout this and futuresections. Section 3.2 discusses the semanticsof large pages: what aspects of the support forlarge pages the kernel exports to user space,the granularity at which page size decisions aremade, and the high-level abstractions the ker-nel exports to the user.

It should be noted that this is an ongoingproject, so the approaches describe here mayhave been improved upon by the time of publi-

cation.

3.1 Goals

This section discusses the design goals andguidelines which we attempt to adhere to in thedesign of our solution. We consider a good de-sign to have the following properties:

Low overhead We do not wish to penalise ap-plications that will not benefit from largepages, so we aim to minimise the per-formance impact of our modifications forthese applications.

Generic The Linux kernel runs on numer-ous different architectures9 and is usuallyported quickly to new architectures. Anykernel enhancements such as ours shouldbe easily adaptable to support existingand future systems, especially consider-ing that many modern architectures fea-ture MMUs which support multiple pagesizes.

Flexible While a generic solution allows foreasy portability, it does not indicate howwell such a solution takes advantage ofan architectures support for multiple pagesizes. The design should be flexibleenough to encompass any support.

Simple The more complex a solution is, themore likely it is to have subtle bugs, andthe harder it is to understand. While wecan foresee a point at which a more com-plex solution may be necessary, the initialdesign should be as simple as possible.

Minimal The Linux kernel is a large and com-plex system, so a minimalist approachis required: subsystem modifications that

9A count of the number of architectures in the main-line kernel reveals 15 implementations that are more orless complete

Page 7: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 579

are not absolutely required may result ina solution that is overly complex and un-wieldy. Therefore, we try to limit ourchanges to the VM subsystem only.

3.2 Semantics

This section discusses the semantics associatedwith supporting multiple page sizes: how thepage size for a range of virtual addresses ischosen and whether the kernel considers thispage size mandatory or advisory.

The following terms are used throughout thisand later sections:

Base pageA base pageis the smallest pagesupported by the kernel, usually the mini-mum dictated by the hardware.

Superpage A superpageis a contiguous se-quence of2n base pages.

Order A superpage’sorder refers to its size.A superpage of ordern contains2n basepages.

Sub-superpageA sub-superpageis a super-page of orderm, contained in a superpageof ordern, such thatn ≥ m. Note thata base page is a sub-superpage with orderm = 0

order = 2

order = 4

base page (order = 0)

Figure 2: A superpage and sub-superpage

These concepts are illustrated in Figure 2which shows a superpage of order 4 contain-ing a sub-superpage of order 2.

3.2.1 Visibility

There are two basic approaches to supportingmultiple page sizes: restrict knowledge of su-perpages to the kernel or export page size deci-sions to user space.

In the former approach, the kernel can createsuperpage mappings based on some heuristic,for example, a dynamic heuristic based on TLBmiss information, or a static heuristic based onthe type of mapping such as whether the map-ping is for code, data, or whether it is anony-mous. This approach is transparent to appli-cations, and should result in all applicationsbenefiting. It is, however, more complex, andwould rely on effective heuristics to map a vir-tual address range with large pages.

In the latter approach, an application explic-itly requests a section of its address space bemapped with superpages. This request couldcome in the form of programmer hints, or viainstrumentation inserted by a compiler. Whilethis approach requires applications to have spe-cific knowledge of the operating system’s sup-port for large pages, it is much simpler from thekernels perspective. The major problem withthis approach is that it requires the applicationprogrammer to have a good understanding ofthe applications memory behaviour.

We have decided on the latter approach, dueto its simplicity: the former approach wouldnecessitate developing heuristics that requirefine-tuning and rewriting.

3.2.2 Granularity

This section discusses the granularity of con-trol that the application has over page sizes.The approaches considered were:

per address spaceWhile making page sizes

Page 8: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 580

per address space would simplify someaspects of the implementation, it is too re-strictive. We expect applications to haveregions of their address space where theuse of large pages would be a waste ofmemory;

per address space region type10

This approach also has its drawbacks:there is no clear set of types, althoughthe region’s attributes (e.g., executable,anonymous) could be used, so again thisapproach is limited without any cleargains;

per address space regionThis approach ismore flexible than either of the above ap-proaches, however it does not allow forhotspot mapping within a region; or

over an arbitrary address space range.This is the most flexible approach, how-ever, there are implementation issues: thekernel would need to keep track of theapplications desired page sizes for theentire address space.

To allow maximum flexibility while minimis-ing implementation overhead, we have decidedupon a combination of the last two options: anapplication can dictate the page size for an ar-bitrary address range only if that range belongsto an address space region. This means thatan application can map a region hotspot withlarge pages, but leave the rest of the region atthe system’s default page size.

3.2.3 Interface

This section discusses the guarantees givenabout the actual page size used to map an ad-dress space range.

10A region is a defined part of the address space thatcreated by themmapsystem call, for example.

The kernel can take a best-effort approach tomapping a virtual address with the applicationsindicated page size, falling back to a smallerpage size if the larger page is not immediatelyavailable. Alternatively, the kernel can blockthe application until the desired page size be-comes available, copying any existing pages tothe newly allocates superpage.

Rather than mandating either behaviour, wehave elected to allow the application to choosebetween the two alternatives. In situationswhere selecting a larger page size is merely anopportunistic optimisation for a relatively shortrunning application, the first behaviour is desir-able. In cases where the application is expectedto execute for an extended period of time, how-ever, the expected performance improvementmay be greater than the expected wait time, andso waiting for a superpage to become availableis justified. If an application is expected to re-use a large mapping over a number of invoca-tions (a text page or a data file, for example),the application will benefit by waiting for thelarge page to be constructed.

4 Implementation

This section discusses the implementation ofthe design in Section 3.

4.1 Interface

An application requires some mechanism tocommunicate a desired page size to the kernel.A system call is the conventional mechanismfor communicating with the kernel. In this sec-tion, we discuss our implementation of a sys-tem call interface for setting the page size for aregion of the address space.

We considered three options: add a parameterto themmapsystem call which specifying thepage size for the new mapping; implement a

Page 9: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 581

new system call,setpagesize ; and add an-other operation to themadvise system call.

Using themmapsystem call would appear tobe an obvious solution. It has, however, severalnegative aspects: firstly, themmapsystem callis complex and is frequently used. Modifyingmmap’s argument types would break existingcode, as would adding extra parameters. Sec-ondly, the application would be restricted to theone page size for that mapping, for the life ofthe mapping.

Using a new system call would be the clean-est alternative, however this requires signifi-cant modifications to all architectures, and isgenerally frowned upon where an alternativeexists.

Using themadvise system call would allowan application to modify the page size at anypoint during its execution and would not af-fect existing applications, as any modificationwould be orthogonal to current operations.

We therefore added asetregionorder(n) oper-ation to themadvise system call, wheren isthe new page order. We implemented this us-ing the advise parameter of themadvisesystem call. The upper half of the parameterword contains the desired page order, while thelower half indicates that asetregionorderoper-ation is to be performed.

Within the kernel, themadvise system callverifies that the requested page order is actuallysupported by the processor, and sets the VMA’sorder attribute accordingly.

4.2 Address space data structures

This section discusses the modifications madeto the kernel’s representation of a virtual ad-dress space. The application can modify thepage size used by a VMA at runtime, eitherby an explicitmadvise system call or by in-

structing the kernel to fall back to a smallerpage size if a larger is not available. Conse-quently, the kernel needs to keep track of thefollowing: firstly, the page size indicated bythe application, which is associated with theVMA; secondly, the actual page size used tomap a virtual address.

To communicate the requested page order tothe VMA’s nopage function, another param-eter was added. This parameter indicates thedesired page order at invocation, and containsthe actual page size upon return. We rely uponthe fact that subsystems which have not beenmodified will only return base pages.

16K 4M

Physical memory

Page Table

Page Directory

Figure 3: The modified page table structure

To store the superpage size that actually mapsthe virtual address range, the PTE includes theorder of the mapping. To achieve this, we as-sociated unused bits within the PTE with dif-ferent page sizes, although the actual bits andsizes may be dictated by hardware.

The page table structure was also modified: su-perpages which span a virtual address rangegreater or equal to that of a non-leaf page direc-tory entry are collapsed until they fit into a sin-gle page table node (see Figure 3). This meansthat we can now have valid page table elementsat each level of the address translation hierar-chy. This affects kernel routines which scanthe page table, for example, the swap routine.

Although the main reason behind this was to

Page 10: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 582

conform to the page table structure defined bythe x86 family, it also has other advantages:the kernel can use positional information todetermine the page size, rather than relyingsolely on the information store in the PTE. Thismeans that the number of page sizes supportedby the kernel is not restricted by the numberof unused bits in the PTE (which can be quitefew). There may also be some performance ad-vantage as the TLB refill handler does needs totraverse fewer page table levels.

4.3 Representing superpages in physical mem-ory

This section discusses the representation of su-perpages in thepage data structure. The ker-nel needs to keep track of various properties ofthe superpage, such as whether it is freeable,whether it needs to be written back, etc. Thesuperpage can include sub-superpages whichare in use: any superpage operation that affectsthe sub-superpage also affects the superpage,and this needs to be taken into consideration.

We considered the following representationsof superpages: firstly, an explicit hierarchy ofpage data structures, with one level for eachpossible order. A superpage would then be op-erated on using thepage data structure at theappropriate level. This implies that each oper-ation would only have to look at a single in-stance of thepage data structure.

This approach is the cleanest in terms of se-mantics. Unfortunately, the kernel makes cer-tain assumptions about the one-to-one relation-ship between thepage data structure and theactual physical page. Implementing this de-sign would violate those assumptions and alsoinvolve significant modifications to the lowerlevels of the kernel.

The alternative involves a modification to theexisting page data structure, such that eachpage contains the order of the superpage it be-

longs to. A superpage of ordern would then beoperated on by iterating over all2n base pages.This approach conforms to the kernels existingsemantics. It is, however, subject to variousrace conditions, and is inelegant.

We implemented a combination of the two ap-proaches presented: while we do not have anexplicit hierarchy, there is an implicit hierar-chy created by storing the superpage’s order ineach component base page. We logically parti-tion the properties of a page into those associ-ated with superpages, or with base pages.

This partitioning was guided by the usage ofthese properties: if the property was used inthe VM subsystem only, it was usually put inthe superpage partition. If the property wasused for I/O, it was put into the base page par-tition. The properties were then partitioned asfollows:

• the page’susage count is per super-page. As all allocation are done in termsof superpages, it follows that a superpageis only freeable if no sub-superpage is be-ing used. This means that whenever a sub-superpage’s usage count is modified, theactual modification is applied to the super-page;

• themapping andoffset properties areper base page, as they are only used to per-form I/O on the page;

• the wait queue is per base page, as itis used to signal when I/O has completed;

• theflags are partitioned as follows:

locked is per base page, as it is used pri-marily to indicate that a page is un-dergoing I/O;

error is per base page, as it is used to in-dicate an I/O error in the page;

Page 11: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 583

referenced is per superpage, as it is usedby the VM subsystem only;

uptodate is per base page, as it is setwhen I/O successfully completes ona page;

dirty is per superpage, as it is primarilyused in the VM subsystem;

lru is per superpage, as it indicateswhether a page is in the LRU list,and the LRU list is now defined tocontain superpages;

active is per superpage, as it indicateswhether a page is in the active list,and the active list is now defined tocontain superpages;

launder is per superpage, as it is onlyused in the swap subsystem, and theswap subsystem has to deal with su-perpages.

All other flags are per base page, as theyreflect static properties of the page, (forexample, whether the page is in the high-mem zone).

Operations that iterate over each base page in asuperpage are required to operate in ascendingorder to avoid deadlock or other inconsisten-cies.

4.4 Page allocation

The current page allocator supports multiplepage sizes, however it has 2 major problems:firstly, non-swappable pages can be spreadthroughout each zone, causing memory frag-mentation; secondly, if a large page is required,but a user (i.e. swappable) page is in the way,there is no efficient way to find all users of thatpage.

While the latter problem can be solved by Rikvan Riel’s reverse mapping patch[18], the for-mer is still an issue. For this implementation,

we have created anotherlargepagezone, whichis used exclusively for large pages. While thisis not a permanent solution, it does aid in de-bugging, and solves the immediate problem forspecialised users. The size of thelargepagezone is fixed at boot time.

For maximum flexibility, the current allocatorshould be modified so that pages which are notpageable are allocated in so that they do notcause fragmentation. Also, pages which areallocated together will probably be freed to-gether, so clustering pages at allocation timemay also reduce fragmentation.

4.5 The Page Cache

To support mapping files with superpages, thepage cache needs to be modified. The bulkof these modifications are in thenopage andaffiliated functions, which attempt to allocateand read in a superpage of the requested or-der. To avoid any problems due to overlappingsuperpages, we require a superpage of ordern also have file ordern — that is, the align-ment of the superpage in the virtual, physical,and file space is the same. For example, a 64Kmapping of a file should be at a file offset that isa multiple of 64K, a virtual offset that is a mul-tiple of 64K, and a physical offset of 64K11.

The changes to thenopage function are es-sentially straightforward. If an application re-quests a superpage which contained in the pagecache, it get back a sub-superpage whose orderis the minimum of the requested order and thesuperpage’s order. If a superpage does not ex-ist, a page of the requested order is allocated,each base page is read in, and the superpage isadded to the LRU and active queues.

Because reading in a large page can cause sig-nificant I/O activity (the amount of time re-

11The virtual and physical alignment constraints arecommon to most architectures.

Page 12: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 584

quired to read in 4MB of data from a disk canbe significant), we may need to read in basepages in a more intelligent fashion. One so-lution is to read in the sub-superpage whichcontains the address of interest first and sched-ule the remainder of the superpage to be readin after the first sub-superpage has completed.When the rest of the superpage has completedI/O, the address space can be mapped with thesuperpage. Note that this is similar to the earlyrestart method used in some modern processorsto fetch a cache line.

4.6 The swap subsystem

In our current implementation, a regionmapped with superpages will not be swappedout. Swapping a superpage would negate anyperformance gained by its use due to the highcost of disk I/O. The superpage may need to bewritten back, however, and this is handled inan essentially iterative manner — when the su-perpage is not being used by any applications,and it is chosen by the swap subsystem to beswapped out (i.e. when it appears as a victimon the LRU list), each base page is flushed todisk, and the superpage is freed.

In the future, a number of approaches presentthemselves. The kernel may, for example, splitup a superpage into smaller superpages over aseries of swap events, until a threshold super-page order is met, and then swap that out. Al-ternatively, the kernel may just swap out theentire page.

4.7 Architecture specifics

This section discusses the architecture specificaspects of our implementation. Although ourimplementation attempts to be generic, the ker-nel requires knowledge of the architecture’ssupport for multiple page sizes and the addi-tional page table requirements.

The architecture specific layer in our im-plementation consists mainly of page tableoperations, i.e., creating and accessing aPTE. To constructed a PTE, the kernel nowusesmk_pte_order , which is identical tomk_pte 12 except for an additionalorder pa-rameter. This function creates a PTE withwhich maps a page of orderorder . To al-low the kernel to inspect a PTE, apte_orderfunction is required. This function returns theorder of a PTE.

On architectures which use an additional pagetable (usually because it is required by thehardware), theupdate_mmu_cache needsto be modified to take superpages into con-sideration. The kernel also requires a mech-anism to verify that a page size is sup-ported. This is achieved by implementing thepgorder_supported function.

4.8 Anatomy of a large page fault

In systems with a hardware loaded TLB, a TLBmiss is transparent to the kernel, and so is notdifferent in the case of a large page. In ar-chitectures with a software TLB refill handler,the new page table structure needs to be takeninto consideration: the handler needs to checkwhether each level in the page table hierarchyis a valid PTE. The refill handler also needs toextract the page size from the entry and insertthe correct(V A, PA, size) entry into the TLB.

If there is no valid mapping in the page table, apage fault occurs. As with the standard kernel,the VMA is found and the access is validated.The PTE is then found, although a page tablenode is not created if it is required — the pagetable node is allocated later on in the page faultprocess. This postponement in allocating pagetable nodes is required as the kernel does notknow what size the allocated page will be: this

12For backwards compatibility, mk_pte callsmk_pte_order with order 0

Page 13: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 585

is determined when the page is allocated.

On a write access to a page marked read-onlyin the PTE, a private copy is created and re-places the read-only mapping. This involvescopying the entire superpage, so it is a rela-tively expensive operation — as with all super-page operations, there will only be overhead ifthe operations would not have been done oneach base page. For example, writing a sin-gle character to a 4Mb mapping will result inthe whole 4Mb being copied, which would nothave occurred if the region was mapped with4K pages. Conversely, if most or all of thebase pages are to be written to, copying themin one operation may reduce the total overheaddue to caching effects and the reduced numberof page faults.

If no mapping exists, the VMA’sorder fieldis consulted to determine the application’s de-sired page size. If there are pages mapped intothe region defined by this order and the faultaddress, and the application has elected to op-portunistically allocate superpages, the kernelselects the largest supported order that containsthe fault address, no mapped pages, and is lessthan or equal to the desired order. Otherwise,the application’s desired page order is selected.

After the kernel has determined the correctpage order, it examines the VMA’snopagemethod. If thenopage method is not defined,a zeroed superpage is allocated and insertedinto the page table. Otherwise, thenopagemethod is called with the calculated page order,and the result is inserted into the page table.

If the file that backs the VMA is using the pagecache to handle page faults, the kernel searchesthe page cache for the file data associated withthe fault address. If a superpage is found, theminimum of the superpage’s order and the re-quested order is used to determine the sub-superpage to be validated. The sub-superpageis then checked to ensure its contents are valid,

I-TLB 4K pages 128 entries, 4-way SAI-TLB 4M pages Fragmented into 4K I-

TLBI-L1 cache 12K micro-opsD-TLB 4K pages 64 entries, FAD-TLB 4M pages Shared with 4K D-TLBD-L1 cache 8K, 64 byte CL, 4-way

SAunified L2 cache 256K, 64-byte CLS, 8-

way SA

Table 1: Pentium 4 processor’s memory sys-tem characteristics (Notation: CL - cache lines;CLS - cache lines, sectored; SA - set associa-tive; FA - fully associative).

and if so, it is returned. If the sub-superpage’scontents is not valid, each base page is read in,and the sub-superpage is returned.

5 Experimental Results

In this section, we present and analyze the ex-perimental data from our implementation ofmultiple page size support in the Linux kernel.

All results in this section were generated on a1.8GHz Pentium 4 system with 512M of RAM.The Pentium 4 processor has separate instruc-tion and data TLBs and supports two differentpage sizes: 4K and 4M13. Table 1 shows the pa-rameters of the memory system of Pentium 4.

5.1 Validating the Implementation with aMicro-benchmark

This section presents and discusses the datavalidating the accuracy of our implementa-tion and demonstrating the benefits of multi-ple page size support for a simple microbench-mark. The use of a simple benchmark makesit possible to reason in detail about its memory

13Note that with large physical memory support(>4GB), the large page size on Pentium 4 processorsis 2M.

Page 14: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 586

0

1

2

3

4

5

6

7

0 2 4 6 8 10 12 14 16

Tim

e (m

illis

econ

ds)

Test size (megabytes)

DTLB micro-benchmark (L1-DCache)1000 iterations, 128k increments

4k pagesize4M pagesize

0123456789

0 2 4 6 8 10 12 14 16

4k:4

M

Test size (megabytes)

DTLB micro-benchmark (L1-DCache)1000 iterations, 128k increments

4k:4M

Figure 4: The execution times of the microbenchmark with small 4K pages and large 4M pages(left) and the ratios of execution times (right).

behavior and its interactions with the memorysystem.

The benchmark allocates a heap and initializesit with data. We vary the heap size from 128Kto 32M in 128K increments in order to adjustthe working set of the benchmark. The bench-mark performs 1000 iterations during each ofwhich it strides through the heap in the follow-ing manner: for each 4K page, it accesses oneword of data. Assuming that caches and TLBsdo not contain any information, each data ac-cess brings one cache line of PTEs and onecache line of data into the data L1 cache. Toensure that consecutive accesses do not com-pete for cache lines in the same cache set, weincrement the offset at which we access datawithin a page by the size of a cache line. Wealso access every sixteenth page to ensure thatwe use only one PTE per L1 cache line14.

We performed two sets of experiments. In thefirst set, the heap was mapped with 4K pages.In the second set, the heap was mapped with4M pages. Both the 4K and the 4M cases haveseveral inflection points. The first two inflec-tion points for the 4K case are at 4M and 6M,and the first two inflection points for the 4M

14On our Pentium 4 machine, one 64-byte cache lineaccommodates sixteen 4-byte PTE entries.

case are at 8M and 10M. The first inflectionpoint indicates that the important working set(consisting of data and PTEs) can no longerfit in the fast L1 cache. Up to this point, thebenchmark achieves full L1 cache reuse (bothdata and PTEs fit in the L1 cache)15. Betweenthe first and the second inflection points, thebenchmark achieves partial cache reuse (someof the data and PTEs remain in L1 across iter-ations). After the second inflection point, thereis no L1 cache reuse (neither data nor PTEs re-main in the L1 cache across iterations). Theworking set, however, still fits in the larger L2cache. The performance of the 4K case de-grades sooner than that of the 4M case due tothe space overhead of PTEs16. The 4M casedoes not suffer from this behavior as it usesfew PTEs and, hence, significantly less spacein the L1 data cache; each cache line can ac-commodate 16 PTEs mapping a total of 64Mof contiguous address space.

By extending the portion of the graph where

15Coincidentally, because we access one cache line ofdata per 4K page and access every sixteenth page, the64-entry D-TLB begins thrashing at 4M, too.

16Namely, the PTEs occupy the same number of cachelines as the data. Consequently, the number of L1 missesbegins to grow once the number of distinct pages wetouch exceeds one half the number of cache lines in theL1 data cache.

Page 15: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 587

the benchmark achieves full L1 cache reuse(i.e., past the first inflection point to the right),one can estimate the performance of the bench-mark on a system with increasingly larger L1cache. Similarly, by extending the portion ofthe graph where the benchmark experiences noL1 cache reuse, one can estimate the perfor-mance of the benchmark on a system with aslower L1 data cache (whose access time isequal to the access time of the L2 cache of ourconfiguration). The next inflection point (notshown on the graph) will occur when the L2cache starts to saturate.

5.2 Assessing Performance for TraditionalWorkloads

This section discuss the performance of mul-tiple page size support in the context of theSPEC CPU2000 benchmark suite[16], specif-ically CINT2000, the integer component ofSPEC CPU2000.

The CINT2000 benchmark suite was designedto measure the performance of a CPU andits memory subsystem. There are 12 integerbenchmarks in the suite. These are thegzipdata compression utility,vpr circuit placementand routing utility, gcc compiler, mcf mini-mum cost network flow solver,crafty chessprogram, parser natural language processor,eonray tracer,perlbmk17 perl utility, gapcom-putational group theory,vortexobject orienteddatabase,bzip2 data compression utility, andtwolf place and route simulation benchmarks.All applications, except foreon, are written inC. Theeonbenchmark is written in C++.

We noted that the applications in theCINT2000 suite use themalloc familyof functions to allocate the majority of theirmemory. To provide the application with mem-ory backed by large pages via themalloc

17Due to compilation difficulties, this benchmark wasexcluded from out results

function, we modified thesbrk function.The memory allocator usessbrk to allocatememory at page granularity; it then allocatesportions of this memory to the applicationupon request. Thesbrk function ensuresthat the pages it gives to memory allocator arevalid; i.e., it grows the process’s heap usingthebrk system call when required.

We modified thesbrkfunction so that it returnsmemory backed by large pages. At the first re-quest,sbrk maps a large region of memory,and uses themadvise system call to map thatregion with large pages. Whenever the mem-ory allocator requests a memory,sbrk returnsthe next free page in this region.

If the memory request is greater than somethreshold (128K), the current memory alloca-tor will allocate pages using themmapsystemcall. To ensure that the memory allocator re-turned memory backed by large pages, we dis-abled this feature so that the allocator alwaysuses oursbrk .

To allow the applications to use our modi-fied memory allocator andsbrk functions, weplaced these functions in a shared library andused the dynamic linker’s preload functional-ity. We set theLD_PRELOADenvironmentvariable to out library, so the dynamic linkerwill resolve anymalloc function calls in theapplication to our implementation. In this way,no recompilation is necessary for the applica-tions to use large pages.

Table 2 shows the performance results we ob-tained using large pages. Overall, the resultsobtained are encouraging, many applicationsshowing approximately 15% improvement inrun time.

Page 16: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 588

Benchmark Improvement (%)164.gzip 12.31175.vpr 16.72176.gcc 9.29181.mcf 9.43186.crafty 15.22197.parser 16.30252.eon 12.07254.gap 5.91255.vortex 22.27256.bzip2 14.37300.twolf 12.47

Table 2: Performance improvements for SPECCPU2000 integer benchmark suite using largepages

5.3 Assessing Performance with EmergingWorkloads

This section discusses the impact of largepages on the performance of Java workloads.Java applications, and SPECjvm98 [15] ap-plications in particular, are known to have tohave poor cache and page locality of data ref-erences [11, 14]. To demonstrate the advan-tages of large pages for Java programs, we con-ducted a set of experiments with thefast con-figuration of Jikes Research Virtual Machine(Jikes RVM) [1, 2] configured with the mark-and-sweep memory manager (consisting of anallocator and a garbage collector) [3, 10].

To get the baseline numbers, i.e., where theheap is mapped with 4K pages, we ran theSPECjvm98 applications with the largest avail-able data size on an unmodified Jikes RVM.The virtual address space in Jikes RVM con-sists of three regions: thebootimage region, thesmall heap(the heap region intended for smallobjects), and thelarge heap(for objects whosesize exceeds 2K). We modified thebootim-age runnerof Jikes RVM18 to ensure that the

18Thebootimage runneris a program responsible for

small heapis aligned to a 4M boundary and ismapped by 4M pages.

The decision to map only thesmall heaptolarge pages was based on the observation that,with a few exceptions, most objects created bySPECjvm98 are small. We then repeated theexperiments by mapping all three heap regionsto large pages. We also varied the size of thesmall heapfrom 16M to 128M and computedthe performance improvements with 4M pagesover a configuration that uses only 4K pages.

For each application, Figure 5 shows the min-imum, the average, and the maximum per-formance improvements when thesmall heapis mapped to large pages (left) and when allthree heap regions are mapped to large pages(right). It can be seen that for several applica-tions the performance improvements are con-sistent and range from 15% to 30% even if onlythe small heapis mapped to large pages. Thecompress benchmark is the only one in thesuite that creates a significant number of largeobjects and only a few small objects, and sodoes not benefit from large pages in this case.

When all three heap regions are mapped tolarge pages, we observe an additional 5% to10% performance improvement. For manyapplications, the performance improvementranges from 20% to 40% over the base case.It can also be seen that thecompress bench-mark enjoys a significant performance boost.

5.4 Discussion

The observed benefits of large page supportcan vary and depend on a number of factorssuch as the characteristics of applications andarchitecture. In this section, we discuss someof these factors.

mapping memory for Jikes RVM and the heap, loadingthe core of the RVM into memory, and then passing con-trol to the RVM.

Page 17: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 589

0 5 10 15 20 25 30 35 40 45 50

_201_compress

_202_jess

_205_raytrace

_209_db

_213_javac

_222_mpegaudio

_227_mtrt

_228_jack

Performance improvement (percent)

Perform

ance improvem

ent with the sm

all heapm

apped to large pages

Ratio

0 5 10 15 20 25 30 35 40 45 50

_201_compress

_202_jess

_205_raytrace

_209_db

_213_javac

_222_mpegaudio

_227_mtrt

_228_jack

Performance improvement (percent)

Perform

ance improvem

ent with all three heap regions

mapped to large pages

Ratio

Figure 5: Summary of results for SPECjvm98.

A small number of TLB entries covering largepages19 may not be sufficient for a realistic ap-plication to take full advantage of large pagesupport. If the working set of an application isscattered over a wide range of address space,the application is likely to experiencing thrash-ing of a relatively small 4M page TLB, in somecases to a much larger extent than with the 64-entry 4K page data TLB. This is a problem onprocessors like Pentium II and Pentium III.

Applications executing on processors withsoftware loaded TLBs are expected to bene-fit from large pages. The TLB miss overheadof an application executing on a processor thathandles TLB misses in hardware (such as x86processors) can also be significant unless mostpage tables of an application can fit in the L2cache and co-reside with the rest of the work-ing set. This is highly unlikely for applicationsof interest: assuming that each L2 cache line

19In Pentium II and Pentium III microprocessors,there are eight 4M page data TLB entries some of whichare used by the kernel.

is 32 bytes and each PTE is 4 bytes, one L2cache line can cover eight 4K pages (a total of32k). Hence, a 512K L2 cache can accommo-date PTEs to cover only 512M of address space(this does not leave any space for data in theL2 cache). Consequently, for applications withrelatively large working sets, it is highly likelythat a significant fraction of PTEs would notbe found in the L2 cache on a TLB miss. Al-though hardware makes reloading a TLB froma L2 cache relatively inexpensive, many TLBmisses may need to be satisfied directly frommemory.

The Java platform [7] presents another set ofchallenges. For performance reasons, state-of-the-art JVMs compile Java bytecode into exe-cutable machine code [1, 2, 9]. In some virtualmachines, such as Jikes RVM [1, 2], generatedmachine code is placed into the same heap asapplication data and is managed by the samememory manager. It has been observed that thecode locality of Java programs tends to be bet-ter than their data locality [14]. This suggeststhat application code should reside in small

Page 18: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 590

pages while application data should reside inlarge pages. In Jikes RVM, generated codeand data objects are indistinguishable from thememory manager’s point of view and are in-termixed in the heap. Because a memory re-gion can only be mapped to either small orlarge pages, a tradeoff must be made. Map-ping the entire heap region to large pages maynot be effective since application code may notneed to use large pages. What is worse is thatsome processors have a very small number of4M page instruction TLB entries20 which canlead to thrashing of an instruction TLB. Con-sequently, for best performance results, a JVMshould be made aware of the constraints im-posed by the underlying OS and hardware, andsegregate application code and data into sepa-rate well-defined regions.

For Java programs, some performance gainsare expected to come from better garbage col-lection (GC) performance. Much work duringgarbage collection is spent on chasing point-ers in objects to find all reachable objectsin the region of the heap that is being col-lected [4]. Many reachable objects can be scat-tered throughout the heap. As a result, the lo-cality of GCs is often worse than that of ap-plications [11]. This behavior is representa-tive of systems employing non-moving GCswhich have to be used when some objects can-not be relocated (e.g., when not all pointerscan be identified reliably by a runtime). Con-sequently, large pages can improve TLB missrates during GC (and overall GC performance).Applications that perform GC frequently, havea lot of live data at GC times, or whose livedata are spread around the heap can benefitfrom large page support and achieve short GCpauses. Short pauses are critical for softwaresystems that are expected to have relativelypredictable response times.

20There are only two 4M page instruction TLB entriesin Pentium II and Pentium III processors.

The availability of large pages can also be ben-eficial for programs that use data prefetchinginstructions. Modern processors squash a dataprefetching request if the appropriate transla-tion information is not available in the dataTLB. Consequently, high TLB miss rates ofapplications can lead to many prefetching re-quests being squashed, thereby leading to in-effective utilization of memory bandwidth andreduced application performance[14]. The useof large pages can help reduce TLB misses andtake full advantage of prefetching hardware.Further, a hardware performing automatic se-quential data and code prefetching stops whena page boundary is crossed and has to berestarted at the beginning of the next page21.Large pages make it possible for such hard-ware to run for a longer period of time and toperform more useful work with fewer interrup-tions.

6 Related work

Ganapathy and Schimmel [6] discussed a de-sign of general purpose operating system sup-port for large pages. They implemented theirdesign in the IRIX operating system for theSGI ORIGIN 2000 system that employs theMIPS R10000 processors (which handle TLBmisses in software).

An important aspect of their approach is that itpreserves the format ofpfdat and PTE datastructures of the IRIX OS. Thepfdat struc-tures represent pages of a base size and containno page size information (just as in the originalsystem). Large pages are simply treated as acollection of base pages. Consequently, only afew parts of the OS kernel need to be aware oflarge pages and need to be modified.

The PTEs contain the page size information but

21This is due to the fact that such automatic prefetch-ing hardware uses physical addresses for prefetching.

Page 19: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 591

the page table layout is unchanged. They useone PTE for each base page of a large pageand create a set of PTEs that correspond to alladdresses falling withing a large page. As ex-pected, for the large page PTEs, the page framenumbers are contiguous.

To support multiple page sizes, the TLB misshandler needs to set a page mask register in theprocessor on each TLB miss. To ensure thatprograms that do not use large pages do no in-cur unnecessary runtime overhead, a TLB han-dler is configured per process. The allocationpolicy is specified on a command line (on a persegment basis) before starting an application.Hence, applications do not need to be modifiedto take advantage of large pages, and applica-tions that do not use large pages are not put atdisadvantage.

The advantage of this design is that it allowsdifferent processes to map the same large pagewith different page sizes. The disadvantagesare (i) this approach does not reduce the sizeof page tables for applications that use largepages and (ii) the information stored in PTEsthat cover a large page needs to be kept consis-tent.

They demonstrated that applications fromSPEC95 and NAS parallel suite do benefitfrom large pages. For these applications,they registered 80% to 99% reduction in TLBmisses and 10% to 20% performance improve-ment. A business application like the TPC-Cbenchmark (which is known to have poor lo-cality and large working set) was also shownto benefit from large pages. The authors report70% to 90% reduction in TLB misses and 6%to 9% performance improvement for this appli-cation.

Subramanian et al. [17] describe their imple-mentation of multiple page size support in theHP-UX operating system for the HP-9000 Se-ries 800 system which uses the PA-8000 mi-

croprocessor.

In their design the VM data structures suchas the page table entry, virtual and physicalpage frame descriptors are based on the small-est page size supported by the processor. Alarge page is defined as a set of contiguoussmall base size pages. Hence, this design isconceptually similar to that of Ganapathy andSchimmel [6].

The authors note that an important advantageof this approach is that it does not requirechanges to many parts of the OS. However,it neither reduces the sizes of data structuresfor applications that use large pages. In ad-dition, locking, access, and updates of datastructures for large pages are somewhat ineffi-cient. In spite of the benefits of space efficiencyand the efficiency of updates, they choose notto use variable page size based data structuresbecause, as the authors indicate, such an ap-proach would lead to more changes in the OSand would have negative performance implica-tions (e.g., a high page-fault latency in certaincases).

In their scheme, applications do not needto be recompiled to take advantage of largepages. The hints specifying large page sizes areregion-based and are used at page fault time. Insome cases, such as for performance reasons,the OS can ignore these page size hints and fallback to mapping small pages.

They implemented their design in the HP-UX operating system and studied the impactof large pages on several VM benchmarks,SPEC95 applications, and one commercial ap-plication. The reported performance improve-ments range from 15% to 55%.

Page 20: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 592

7 Conclusions and Further Work

Many modern processors support pages of var-ious sizes ranging from a few kilobytes to sev-eral megabytes. The Linux OS uses large pagesinternally for its kernel (to reduce the over-head of TLB misses) but does not expose largepages to applications. Growing memory la-tencies and large working sets of applicationsmake it important to provide support for largepages to the user-level code as well.

In this paper, we discussed the design and im-plementation of multiple page size support inthe Linux kernel. We validated our imple-mentation on a simple microbenchmark. Wealso demonstrated that realistic applicationscan take advantage of large pages to achievesignificant performance improvements.

This work opens up a number of interesting di-rection. In the future, we plan to modify ker-nel’s memory allocator to further support largepages. We would also like to evaluate the im-pact of large pages on database and web work-loads. These types of workloads are knownto have large working sets and poor locality.Achieving high performance on commercialworkloads is crucial for continuing success ofLinux.

The latency of fetching a large 4M page froma disk (as a result of a page fault) can be sig-nificant. We consider implementing the “earlyrestart” feature that would fetch and map thecritical chunk of data first and complete fetch-ing the remaining data chunks later, thereby re-ducing pauses experienced by applications.

Some architectures support a number of differ-ent page sizes (e.g., 16K, 256K, 4M, and 64M).We would be interested in evaluating the per-formance of applications on systems that havethis architectural support.

Acknowledgements

We would like to thank Pratap Pattnaik andManish Gupta for supporting this work. Wealso like to thank Chris Howson for all of hishelpful advice.

References

[1] B. Alpern, C. R. Attanasio, J. J. Barton,M. G. Burke, P.Cheng, J.-D. Choi, A. Cocchi,S. J. Fink, D. Grove, M. Hind, S. F. Hummel,D. Lieber, V. Litvinov, M. F. Mergen, T. Ngo,J. R. Russell, V. Sarkar, M. J. Serrano, J. C.Shepherd, S. E. Smith, V. C. Sreedhar,H. Srinivasan, and J. Whaley. The JalapeñoVirtual Machine.IBM System Journal, 39(1),2000.

[2] B. Alpern, A. Cocchi, D. Lieber, M. Mergen,and V. Sarkar. Jalapeño - acompiler-supported Java virtual machine forservers. InWorkshop on Compiler Supportfor Software System (WCSSS 1999), Atlanta,GA, May 1999.

[3] C. Attansio, D. Bacon, A. Cocchi, andS. Smith. A comparative evaluation ofparallel garbage collectors. InProc. ofFourteenth Annual Workshop on Languagesand Compilers for Parallel Computing(LCPC), Cumberland Falls, Kentucky, Aug.2001.

[4] H.-J. Boehm. Reducing garbage collectorcache misses. InProc. of ISMM 2000, Oct.2000.

[5] J. Bonwick. The slab allocator: Anobject-caching kernel memory allocator. InSummer 1994 USENIX Conference, pages87–98, 1994.

[6] N. Ganapathy and C. Schimmel. Generalpurpose operating system support formultiple page sizes. InProc. of the 1998USENIX Technical Conference, NewOrleans, USA, June 1998.

Page 21: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Ottawa Linux Symposium 2002 593

[7] J. Gosling, B. Joy, and G. Steele.TheJava(TM) Language Specification.Addison-Wesley, 1996.

[8] J. L. Hennessy and D. A. Patterson.Computer Architecture: A QuantitativeApproach. Morgan Kaufmann Publishers,1995.

[9] The Java Hotspot Performance EngineArchitecture.http://java.sun.com/products/hotspot/whitepaper.html.

[10] R. Jones and R. Lins.Garbage Collection:Algorithms for Automatic Dynamic MemoryManagement. John Wiley and Sons, 1996.

[11] J.-S. Kim and Y. Hsu. Memory systembehavior of Java programs: Methodology andanalysis. InProc. of SIGMETRICS 2000,June 2000.

[12] C. Navarro, A. Ramirez, J.-L. Larriba-Pey,and M. Valero. Fetch engines and databases.In Proc. of Third Workshop On ComputerArchitecture Evaluation Using CommercialWorkloads, Toulouse, France, 2000.

[13] V. Oleson, K. Schwan, G. Eisenhaur,B. Plale, C. Pu, and D. Aminv. Operationalinformation systems - an example from theairline industry. InFirst USENIX Workshopon Industrial Experiences with SystemsSoftware (WIESS), San Diego, California,October 2000.

[14] Y. Shuf, M. J. Serrano, M. Gupta, and J. P.Singh. Characterizing the memory behaviorof Java workloads: A structured view andopportunities for optimizations. InProc. ofSIGMETRICS 2001, June 2001.

[15] Standard Performance Evaluation Council.SPEC JVM98 Benchmarks, 1998.http://www.spec.org/osg/jvm98/ .

[16] Standard Performance Evaluation Council.SPEC CPU2000 Benchmarks, 2000.http://www.spec.org/osg/cpu2000/ .

[17] I. Subramanian, C. Mather, K. Peterson, andB. Raghunath. Implementation of multiplepagesize support in HP-UX. InProc. of the1998 USENIX Technical Conference, NewOrleans, USA, June 1998.

[18] R. van Riel. Rik van Riel’s Linux kernelpatches.http://www.surriel.com/patches/.

Page 22: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Proceedings of theOttawa Linux Symposium

June 26th–29th, 2002Ottawa, Ontario

Canada

Page 23: Multiple Page Size Support in the Linux Kernel...curs, the page cache uses this data to look up the page in a hash table. 2.4 The swap subsystem Linux attempts to fully utilise memory

Conference Organizers

Andrew J. Hutton,Steamballoon, Inc.Stephanie Donovan,Linux SymposiumC. Craig Ross,Linux Symposium

Proceedings Formatting Team

John W. Lockhart,Wild Open Source, Inc.

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rightsto all as a condition of submission.