Composing High-Performance Memory Allocators emery/pubs/berger-pldi2001.pdf · Composing High-Performance…

Download Composing High-Performance Memory Allocators emery/pubs/berger-pldi2001.pdf · Composing High-Performance…

Post on 18-Aug-2018

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Composing High-Performance Memory Allocators</p><p>Emery D. BergerDept. of Computer Sciences</p><p>The University of Texas at AustinAustin, TX 78712</p><p>emery@cs.utexas.edu</p><p>Benjamin G. ZornMicrosoft ResearchOne Microsoft Way</p><p>Redmond, WA 98052</p><p>zorn@microsoft.com</p><p>Kathryn S. McKinleyDept. of Computer ScienceUniversity of Massachusetts</p><p>Amherst, MA 01003</p><p>mckinley@cs.umass.edu</p><p>ABSTRACTCurrent general-purpose memory allocators do not provide suffi-cient speed or flexibility for modern high-performance applications.Highly-tuned general purpose allocators have per-operation costsaround one hundred cycles, while the cost of an operation in a cus-tom memory allocator can be just a handful of cycles. To achievehigh performance, programmers often write custom memory allo-cators from scratch a difficult and error-prone process.</p><p>In this paper, we present a flexible and efficient infrastructurefor building memory allocators that is based on C++ templates andinheritance. This novel approach allows programmers to build cus-tom and general-purpose allocators as heap layers that can becomposed without incurring any additional runtime overhead or ad-ditional programming cost. We show that this infrastructure simpli-fies allocator construction and results in allocators that either matchor improve the performance of heavily-tuned allocators written inC, including the Kingsley allocator and the GNU obstack library.We further show this infrastructure can be used to rapidly build ageneral-purpose allocator that has performance comparable to theLea allocator, one of the best uniprocessor allocators available. Wethus demonstrate a clean, easy-to-use allocator interface that seam-lessly combines the power and efficiency of any number of generaland custom allocators within a single application.</p><p>1. IntroductionMany general-purpose memory allocators implemented for C andC++ provide good runtime and low fragmentation for a wide rangeof applications [15, 17]. However, using specialized memory al-locators that take advantage of application-specific behavior candramatically improve application performance [4, 13, 26]. Hand-coded custom allocators are widely used: three of the twelve inte-ger benchmarks included in SPEC2000 (parser, gcc, and vpr[22]) and several server applications, including the Apache web</p><p>This work is supported by NSF grant EIA-9726401, NSF Infrastructuregrant CDA-9502639, NSF ITR grant CCR-0085792, and DARPA grant 5-21425. Part of this work was done while the first author was at MicrosoftResearch. Emery Berger was also supported by a grant from MicrosoftCorporation. Any opinions, findings and conclusions or recommendationsexpressed in this material are the authors and do not necessarily reflect thoseof the sponsors.</p><p>Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.PLDI 2001Copyright 2001 ACM 0-89791-88-6/97/05 ..$5.00</p><p>server [1] and Microsofts SQL Server [10], use one or more cus-tom allocators.</p><p>Custom allocators can take advantage of certain allocation pat-terns with extremely low-cost operations. For example, a program-mer can use a region allocator to allocate a number of small objectswith a known lifetime and then free them all at once [11, 19, 23].This customized allocator returns individual objects from a rangeof memory (i.e., a region), and then deallocates the entire region.The per-operation cost for a region-based allocator is only a hand-ful of cycles (advancing a pointer and bounds-checking), whereasfor highly-tuned, general-purpose allocators, the per-operation costis around one hundred cycles [11]. Other custom allocators canyield similar advantages over general-purpose allocators. Figure 1shows the estimated impact of the custom allocator used in theSPEC benchmark 197.parser, running on its test input. Re-placing its custom allocator by the system allocator increases itsruntime by over 60%.1</p><p>197.parser runtime</p><p>0</p><p>5</p><p>10</p><p>15</p><p>20</p><p>25</p><p>custom allocator system allocator (estimated)</p><p>Allocator</p><p>Ru</p><p>nti</p><p>me </p><p>(sec</p><p>s)</p><p>memory operationscomputation</p><p>Figure 1: The impact of custom allocation on performance for197.parser.</p><p>To attain high performance, programmers often write their ownad hoc custom allocators as macros or monolithic functions in or-der to avoid function call overhead. This approach to improvingapplication performance is enshrined among the best practices of</p><p>1We estimated this cost by measuring the time spent in allocationusing 197.parsers custom allocator and computing a conser-vative estimate of allocation time with the system allocator (whichcannot directly be substituted because of the semantics of the cus-tom allocator). This and the other programs in this paper were com-piled with Visual C++ 6.0 and run under Windows 2000 on a 366MHz Pentium II system.</p></li><li><p>skilled computer programmers [8, 18]. Unfortunately, this kind ofcode is brittle and hard to maintain or reuse, and as the applicationevolves, it can be difficult to adapt the memory allocator as needschange (e.g., to a multithreaded environment). In addition, writ-ing these allocators is both error-prone and difficult. Good memoryallocators are complicated pieces of software that require a sub-stantial engineering effort. Because of this complexity, combininga custom and general-purpose allocator to allow them to share freememory, for example, is currently almost impossible.</p><p>In this paper, we present a flexible and efficient infrastructure forbuilding custom and general-purpose allocators called heap layers.This infrastructure is based on a combination of C++ templates andinheritance called mixins [7]. Mixins are classes whose superclassmay be changed. Using mixins allows the programmer to code al-locators as composable layers that a compiler can implement withefficient code. Unlike previous approaches, we show that this tech-nique allows programmers to write highly modular and reusablecode with no abstraction penalty. We describe a number of high-performance custom allocators that we built by mixing and match-ing heap layers. We show that these allocators match or improveperformance when compared with their hand-tuned, monolithic Ccounterparts on a selection of C and C++ programs.</p><p>We further demonstrate that this infrastructure can be used effec-tively to build high-performance, general-purpose allocators. Weevaluate two general-purpose allocators we developed using heaplayers over a period of three weeks, and compare their performancewith the Kingsley allocator, one of the fastest general-purpose al-locators, and the Lea allocator, an allocator that is both fast andmemory-efficient. While the current heap layers allocator does notachieve the fragmentation and performance of the Lea allocator,the Lea allocator is highly tuned and has undergone many revisionsover a period of more than seven years [17].</p><p>The remainder of this paper is organized as follows. We discussrelated work in Section 2. In Section 3, we describe how we usemixins to build heap layers and demonstrate how we can mix andmatch a few simple heap layers to build and combine allocators.We briefly discuss our experimental methodology in Section 4. InSection 5, we show how we implement some real-world customallocators using heap layers and present performance results. Sec-tion 6 then describes two general-purpose allocators built with heaplayers and compares their runtime and memory consumption to theKingsley and Lea allocators. We describe some of the softwareengineering benefits of heap layers in Section 7, and in Section 8,we show how heap layers provide a convenient infrastructure formemory allocation experiments. We conclude in Section 9 with adiscussion of future directions.</p><p>2. Related WorkIn this section, we describe related work in memory allocation andmemory management infrastructures. We discuss two representa-tive general-purpose memory allocators and describe related workin custom memory allocation. We then compare heap layers to pre-vious infrastructures for building memory managers.</p><p>2.1 General-Purpose Allocation</p><p>The literature on general-purpose memory allocators is extensive[26]. Here we describe two memory allocators, the Kingsley allo-cator used in BSD 4.2 [26] and the Lea allocator [17]. In Section 6,we describe the implementation of these two allocators in heap lay-ers. We chose to implement these allocators because they are inwidespread use and are on opposite ends of the spectrum betweenmaximizing speed and minimizing memory consumption.</p><p>The Kingsley allocator is a power-of-two segregated fits alloca-tor: all allocation requests are rounded up to the next power of two.</p><p>This rounding can lead to severe internal fragmentation (wastedspace inside allocated objects), because in the worst case, it allo-cates twice as much memory as requested. Once an object is allo-cated for a given size, it can never be reused for another size: theallocator performs no splitting (breaking large objects into smallerones) or coalescing (combining adjacent free objects). This algo-rithm is well known to be among the fastest memory allocatorsalthough it is among the worst in terms of fragmentation [15].</p><p>The Lea allocator is an approximate best-fit allocator that pro-vides both high speed and low memory consumption. It forms thebasis of the memory allocator included in the GNU C library [12].The current version (2.7.0) is a hybrid allocator with different be-havior based on object size. Small objects (less than 64 bytes) areallocated using exact-size quicklists (one linked list of freed objectsfor each multiple of 8 bytes). Requests for a medium-sized object(less than 128K) and certain other events trigger the Lea allocatorto coalesce all of the objects in these quicklists in the hope thatthis reclaimed space can be reused for the medium-sized object.For medium-sized objects, the Lea allocator performs immediatecoalescing and spliting and approximates best-fit. Large objectsare allocated and freed using mmap. The Lea allocator is the bestoverall allocator (in terms of the combination of speed and memoryusage) of which we are aware [15].</p><p>2.2 Special-Purpose Allocation</p><p>Most research on special-purpose (custom) allocation has focusedon profile-based optimization of general-purpose allocation. Grun-wald and Zorns CustoMalloc builds memory allocators from allo-cation traces, optimizing the allocator based on the range of objectsizes and their frequency of usage [13]. Other profile-based allo-cators use lifetime information to improve performance and refer-ence information to improve locality [4, 20]. Regions, described inSection 1, have received recent attention as a method for improvinglocality [23]. Aiken and Gay have developed safe regions which de-lay region deletion when live objects are present [11]. Techniquesfor building other application-specific custom allocators have re-ceived extensive attention in the popular press [8, 18].</p><p>2.3 Memory Management Infrastructures</p><p>We know of only two previous infrastructures for building memorymanagers: vmalloc, by Vo, and CMM, by Attardi, Flagella, andIglio. We describe the key differences between their systems andours, focusing on the performance and flexibility advantages thatheap layers provide.</p><p>The most successful customizable memory manager of whichwe are aware is the vmalloc allocator [25]. Vmalloc lets the pro-grammer define multiple regions (distinct heaps) with different dis-ciplines for each. The programmer performs customization by sup-plying user-defined functions and structs that manage memory.By chaining these together, vmalloc does provide the possibility ofcomposing heaps. Unlike heap layers, each abstraction layer paysthe penalty of a function call. This approach often prevents manyuseful optimizations, in particular method inlining. The vmallocinfrastructure limits the programmer to a small set of functions formemory allocation and deallocation; a programmer cannot add newfunctionality or new methods as we describe in Section 6.1. Unlikeheap layers, vmalloc does not provide a way to delete heaps andreclaim all of their memory in one step. These limitations dramati-cally reduce vmallocs usefulness as an extensible infrastructure.</p><p>Attardi, Flagella, and Iglio created an extensive C++-based sys-tem called the Customizable Memory Management (CMM) frame-work [2, 3]. Unlike our work, the primary focus of the CMMframework is garbage collection. The only non-garbage collectedheaps provided by the framework are a single traditional man-</p></li><li><p>ual allocation discipline heap (whose policy the authors do notspecify) called UncollectedHeap and a zone allocator called Tem-pHeap. A programmer can create separate regions by subclassingthe abstract class CmmHeap, which uses virtual methods to obtainand reclaim memory. For every memory allocation, deallocation,and crossing of an abstraction boundary, the programmer must thuspay the cost of one virtual method call, while in heap layers, there isno such penalty. As in vmalloc, this approach often prevents com-piler optimizations across method boundaries. The virtual methodapproach also limits flexibility. In CMM, subclasses cannot imple-ment functions not already provided by virtual methods in the baseheap. Also, since class hierarchies are fixed, it is not possible tohave one class (such as FreelistHeap, described in Section 3.1) withtwo different parent heaps in different contexts. In contrast, themixin-based approach taken by heap layers allows both inheritanceand reparenting, making heap layers more flexible and reusable.</p><p>The goal of heap layers is to provide the performance of existingcustom and general-purpose allocators in a flexible, reusable frame-work that provides a foundation for programmers to build new allo-cators. We implement customized and general-purpose allocatorsusing heap layers, demonstrating their flexibility and competitiveperformance.</p><p>3. Heap LayersWhile programmers often write memory allocators as monolithicpieces of code, they tend to think of them as consisting of separatepieces. Most general-purpose allocators treat objects of differentsizes differently. The Lea allocator uses one algorithm for smallobjects, another for medium-sized objects, and yet another for largeobjects. Conceptually at least, these heaps consist of a number ofseparate heaps that are combined in a hierarchy to form one bigheap.</p><p>The standard way to build components like these in C++ usesvirtual method calls at each abstraction boundary. The overheadcaused by virtual method dispatch is significant when comparedwith the cost of memory allocation. This implementation style alsogreatly limits the opportunities for optimization since the compileroften cannot optimize across method boundaries. Building a...</p></li></ul>

Recommended

View more >