parallel memory allocation steven saunders. 2parallel memory allocationsteven saunders introduction...

29
Parallel Memory Parallel Memory Allocation Allocation Steven Saunders

Post on 19-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Parallel Memory AllocationParallel Memory Allocation

Steven Saunders

Page 2: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 2 Parallel Memory Allocation

Introduction

Fallacy: All dynamic memory allocators are either scalable, effective or efficient...

Truth: no allocator exists that handles all situations, allocation patterns and memory hierarchies best (serial or parallel)

Research results are difficult to compare– no standard benchmarks, simulators vs. real tests

Distributed memory, garbage collection, locality not considered in this talk

Page 3: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 3 Parallel Memory Allocation

Definitions

Heap– pool of memory available for allocation or

deallocation of arbitrarily-sized blocks in arbitrary order that will live an arbitrary amount of time

Dynamic Memory Allocator– used to request or return memory blocks to the heap

– aware only of the size of a memory block, not its

type and value– tracks which parts of the heap are in use and which

parts are available for allocation

Page 4: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 4 Parallel Memory Allocation

Design of an Allocator

Strategy - consider regularities in program behavior and memory requests to determine a set of acceptable policies

Policy - decide where to allocate a memory block within the heap

Mechanism - implement the policy using a set of data structures and algorithms

Emphasis has been on policies and mechanisms!

Page 5: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 5 Parallel Memory Allocation

Strategy

Ideal Serial Strategy – “put memory blocks where they won’t cause

fragmentation later”– Serial program behavior:

ramps, peaks, plateaus

Parallel Strategies– “minimize unrelated objects on the same page”– “bound memory blowup and minimize false sharing”– Parallel program behavior:

SPMD, producer-consumer

Page 6: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 6 Parallel Memory Allocation

Policy

Common Serial Policies:– best fit, first fit, worst fit, etc.

Common Techniques:– splitting - break large blocks into smaller pieces to

satisfy smaller requests– coalescing - free blocks to satisfy bigger requests

immediate - upon deallocation deferred - wait until requests cannot be satisfied

Page 7: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 7 Parallel Memory Allocation

Mechanism

Each block contains header information– size, in use flag, pointer to next free block, etc.

Free List - list of memory blocks available for allocation– Singly/Doubly Linked List - each free block points to

the next free block– Boundary Tag - size info at both ends of block

positive indicates free, negative indicates in use

– Quick Lists - multiple linked lists, where each list contains blocks of equal size

Page 8: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 8 Parallel Memory Allocation

Performance

Speed - comparable to serial allocators

Scalability - scale linearly with processors

False Sharing - avoid actively causing it

Fragmentation - keep it to a minimum

Page 9: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 9 Parallel Memory Allocation

0 1

Memory

Cache

Performance (False Sharing)

Multiple processors inadvertently share data that happens to reside in the same cache line– padding solves problem, but greatly increases

fragmentation

0 1

0 1 0 1X X X X

Page 10: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 10 Parallel Memory Allocation

Performance (Fragmentation)

Inability to use available memory

External - available blocks cannot satisfy requests (e.g., too small)

Internal - block used to satisfy a request is larger than the requested size

Page 11: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 11 Parallel Memory Allocation

Performance (Blowup)

Out of control fragmentation– unique to parallel allocators– allocator correctly reclaims freed memory but fails to

reuse it for future requests available memory not seen by allocating processor

– p threads that serialize one after another require O(s) memory

– p threads that execute in parallel require O(ps) memory

– p threads that execute interleaved on 1 processor require O(ps) memory...

...x = malloc(s);free(x);...

Example:

blowup from exchanging memory (Producer-Consumer)

Page 12: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 12 Parallel Memory Allocation

Parallel Allocator Taxonomy

Serial Single Heap– global lock protects the heap

Concurrent Single Heap– multiple free lists or a concurrent free list

Multiple Heaps– processors allocate from any of several heaps

Private Heaps– processors allocate exclusively from a local heap

Global & Local Heaps– processors allocate from a local and a global heap

Page 13: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 13 Parallel Memory Allocation

Serial Single Heap

Make an existing serial allocator thread-safe– utilize a single global lock for every request

Performance– high speed, assuming a fast lock– scalability limited– false sharing not considered– fragmentation bounded by serial policy

Typically production allocators– IRIX, Solaris, Windows

Page 14: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 14 Parallel Memory Allocation

Concurrent Single Heap

Apply concurrency to existing serial allocators– use quick list mechanism, with a lock per list

Performance– moderate speed, could require many locks– scalability limited by number of requested sizes– false sharing not considered– fragmentation bounded by serial policy

Typically research allocators– Buddy System, MFLF (multiple free list first),

NUMAmalloc

Page 15: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 15 Parallel Memory Allocation

Concurrent Single (Buddy System)

Policy/Mechanism– one free list per memory block size: 1,2,4,…,2i

– blocks recursively split on list i into 2 buddies for list i-1 in order to satisfy smaller requests

– only buddies are coalesced to satisfy larger requests– each free list can be individually locked– trade speed for reduced fragmentation

if free list empty, a thread’s malloc enters a wait queue malloc could be satisfied by another thread freeing memory

or by breaking a higher list’s block into buddies (whichever finishes first)

reducing the number of splits reduces fragmentation by leaving more large blocks for future requests

Page 16: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 16 Parallel Memory Allocation

Concurrent Single (Buddy System)

Performance– moderate speed, complicated locking/queueing,

although buddy split/coalesce code is fast– scalability limited by number of requested sizes– false sharing very likely– high internal fragmentation

1248

16

Thread 1 Thread 2x = malloc(8);

1

x = malloc(5);

2y = malloc(8); free(x);

1248

161

Page 17: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 17 Parallel Memory Allocation

Concurrent Single (MFLF)

Policy/Mechanism– set of quick lists to satisfy small requests exactly

malloc takes first block in appropriate list free returns block to head of appropriate list

– set of misc lists to satisfy large requests quickly each list labeled with range of block sizes, low…high malloc takes first block in list where request < low

– trades linear search for internal fragmentation free returns blocks to list where low < request < high

– each list can be individually locked and searched

Page 18: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 18 Parallel Memory Allocation

Concurrent Single (MFLF)

Performance– high speed– scalability limited by number of requested sizes– false sharing very likely– high internal fragmentation

Approach similar to current state-of-the-art serial allocators

Page 19: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 19 Parallel Memory Allocation

Concurrent Single (NUMAmalloc)

Strategy - minimize co-location of unrelated objects on the same page – avoid page level false sharing (DSM/software DSM)

Policy - place same-sized requests in the same page (heuristic hypothesis)– basically MFLF on the page level

Performance– high speed– scalability limited by number of requested sizes– false sharing: helps page level but not cache level– high internal fragmentation

Page 20: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 20 Parallel Memory Allocation

Multiple Heaps

List of multiple heaps– individually growable, shrinkable and lockable– threads scan list looking for first available (trylock)– threads may cache result to reduce next lookup

Performance– moderate speed, limited by # list scans and lock– scalability limited by number of heaps and traffic– false sharing unintentionally reduced– blowup increased (up to O(p))

Typically production allocators– ptmalloc (Linux), HP-UX

Page 21: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 21 Parallel Memory Allocation

Private Heaps

Processors exclusively utilize a local private heap for all allocation and deallocation– eliminates need for locks

Performance– extremely high speed– scalability unbounded– reduced false sharing

pass memory to another thread

– blowup unbounded

Both research and production allocators– CILK, STL

Page 22: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 22 Parallel Memory Allocation

Global & Local Heaps

Processors generally utilize a local heap– reduces most lock contention– private memory is acquired/returned to global heap

(which is always locked) as necessary

Performance– high speed, less lock contention– scalability limited by number of locks– low false sharing– blowup bounded (O(1))

Typically research allocators– VH, Hoard

Page 23: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 23 Parallel Memory Allocation

Global & Local Heaps (VH)

Strategy - exchange memory overhead for improved scalability

Policy– memory broken into stacks of size m/2– global heap maintains a LIFO of stack pointers that

local heaps can use global push releases a local heap’s stack global pop acquires a local heap’s stack

– local heaps maintain an Active and Backup stack local operations private, i.e. don’t require locks

Mechanism - serial free list within stacks

Page 24: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 24 Parallel Memory Allocation

Global & Local Heaps (VH)

Memory Usage = M + m*p– M = amount of memory in use by program– p = number of processors– m = private memory (2 size m/2 stacks)

higher m reduces number of global heap lock operations lower m reduces memory usage overhead

B

0

A B

1

A B

p

A

global heap

private heap

Page 25: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 25 Parallel Memory Allocation

Global & Local Heaps (Hoard)

Strategy - bound blowup and min. false sharing Policy

– memory broken into superblocks of size S all blocks within a superblock are of equal size

– global heap maintains a set of available superblocks– local heaps maintain local superblocks

malloc satisfied by local superblock free returns memory to original allocating superblock (lock!) superblocks acquired as necessary from global heap if local usage drops below a threshold, superblocks are

returned to the global heap

Mechanism – private superblock quick lists

Page 26: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 26 Parallel Memory Allocation

Global & Local Heaps (Hoard)

Memory Usage = O(M + p)– M = amount of memory in use by program– p = number of processors

False Sharing– since malloc is satisfied by a local superblock, and

free returns memory to the original superblock, false sharing is greatly reduced

– worst case: a non-empty superblock is released to global heap and another thread acquires it

set superblock size and emptiness threshold to minimize

Page 27: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 27 Parallel Memory Allocation

Global & Local Heaps (Compare)

VH proves a tighter memory bound than Hoard and only requires global locks

Hoard has a more flexible local mechanism and considers false sharing

They’ve never been compared!– Hoard is production quality and would likely win

Page 28: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 28 Parallel Memory Allocation

Summary

Memory allocation is still an open problem– strategies addressing program behavior still

uncommon

Performance Tradeoffs– speed, scalability, false sharing, fragmentation

Current Taxonomy– serial single heap, concurrent single heap, multiple

heaps, private heaps, global & local heaps

Page 29: Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either

Steven Saunders 29 Parallel Memory Allocation

ReferencesSerial Allocation

1) Paul Wilson, Mark Johnstone, Michael Neely, David Boles. Dynamic Storage Allocation: A Survey and Critical Review. 1995 International Workshop on Memory Management.

Shared Memory Multiprocessor Allocation

Concurrent Single Heap

2) Arun Iyengar. Scalability of Dynamic Storage Allocation Algorithms. Sixth Symposium on the Frontiers of Massively Parallel Computing. October 1996.

3) Theodore Johnson, Tim Davis. Space Efficient Parallel Buddy Memory Management. The Fourth International Conference on Computing and Information (ICCI'92). May 1992.

4) Jong Woo Lee, Yookun Cho. An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors. IEEE Second International Conference on Algorithms & Architectures for Parallel Processing (ICAPP'96).

Multiple Heaps

5) Wolfram Gloger. Dynamic Memory Allocator Implementations in Linux System Libraries.

Global & Local Heaps

6) Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson. Hoard: A Scalable Memory Allocator for Multithreaded Applications. The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). November 2000.

7) Voon-Yee Vee, Wen-Jing Hsu. A Scalable and Efficient Storage Allocator on Shared-Memory Multiprocessors. The International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN'99). June 1999.