a portable runtime interface for multi-level memory hierarchies mike houston, ji-young park, manman...

Post on 27-Mar-2015

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A Portable Runtime Interface For Multi-Level Memory Hierarchies

Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian,

Alex Aiken, William Dally, Pat Hanrahan

Stanford University

2Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

The Problem

Lots of different architectures– Shared memory – Distributed memory – Exposed communication– Disk systems

Each architecture has its own programming system Composed machines difficult to program and manage

– Different mechanisms for each architecture

Previous runtimes and languages limited– Designed for a single architecture– Struggle with memory hierarchies

3Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Shared Memory Machines

AMD Barcelona SGI Altix

4Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Distributed Memory Machines

Marenostrum2,282 Nodes

ASC Blue Gene/L65,536 Nodes

5Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Exposed Communication Architectures

90nm | ~220 mm2 |~100 WSTI CELL processor

~200 GFLOPS

6Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Complex machines

Cluster of SMPs?– MPI?

Cluster of Cell processors?– MPI + ALF/Cell SDK

Cluster of clusters of SMPs with Cell accelerators? What about disk systems?

– Disk I/O often second class

7Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Previous Work

APIs– MPI/MPI-2– Pthreads– OpenMP

(Dagum et al. 1998)– GASNet

(Bonachea et al. 2002)– Charm

(Kale et al, 1993)– SVM

(Lebonte et al. 2004)– Cell SDK

(IBM 2006)– CellSs

(Bellens et al. 2006)– HTA

(Bikshandi et al. 2006)– …

Languages– Co-Array Fortran

(Numrich et al. 1994)– Titanium

(Yelick et al. 1998) – UPC

(Carlson et al. 1999)– ZPL

(Deitz et al. 2004)– Chapel

(Callahan et al. 2004)– X10

(Charles et al. 2005)– Brook

(Buck et al. 2004)– Cilk

(Blumofe et al. 1995)– …

8Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Contributions

Uniform scheme for explicitly describing memory hierarchies– Capture common traits important for performance

– Allow composition of memory hierarchies

Simple, portable API interface for many parallel machines– Mechanism independence for communication and management of

parallel resources

– Few entry points

– Efficient execution

Efficient compiler target– Compiler can concentrate on global optimizations

– Runtime manages mechanisms

9Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

The Sequoia System

Programming language for memory hierarchies– Fatahalian et al. (Supercomputing 2006)– Adapts Parallel Memory Hierarchies programming abstraction

Compiler for exposed communication architectures– Knight et al. (PPoPP 2007)– Takes in Sequoia program, machine file, and mapping file– Generates task calls and data transfers– Performs large granularity optimizations

Portable runtime system– Houston et al. (PPoPP 2008)– Target for Sequoia compiler– Serves as abstract machine

http://sequoia.stanford.edu

Abstract Machine Model

11Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Parallel Memory Hierarchy Model (PMH)

Abstract machines a trees of memories (each memory is an address space)(Alpern et al. 1995)

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

12Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Parallel Memory Hierarchy Model

CPU CPU CPU CPU CPU CPU CPU CPU

Memory

Memory

Memory Memory Memory Memory

Memory

Memory Memory Memory Memory

13Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Example Mappings

SPE

LS

MainMemory

Cell Processor

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

CPU CPU

NodeMemory

NodeMemory

...

Aggregate Node Memory(Virtual Level)

Cluster

CPU

Disk

Disk

NodeMemory

Shared Memory Multi-processor

CPU

...

MainMemory

L2

CPU

L2

14Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Multi-Level Configurations

SPE SPE SPE SPE SPE SPE

...

CPU

...

Disk + Playstation 3

Cluster of Playstation 3s

Cluster of SMPs

...

CPU CPU

...

CPUSPE SPE SPE SPE SPE SPE

SPE SPE SPE SPE SPE SPE

Aggregate Cluster Memory

L2 L2 L2 L2

NodeMemory

NodeMemory

LS LS LS LS LS LS

MainMemory

Disk

LS LS LS LS LS LS

MainMemory

MainMemory

LS LS LS LS LS LS

Aggregate Cluster Memory

15Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Abstraction Rules

Tree of nodes 1 control thread per node 1 memory per node

Threads can:– Transfer in bulk from/to parent memory

asynchronously

– Wait for transfers from/to parent to complete

– Allocate data

– Only access their memory directly

– Transfer control to child node(s)

– Non-leaf threads only operate to move data and control

– Synchronize with siblings

Memory

CPU

Memory

CPU

Memory

CPU

Simliar to Space Limited Procedures (Alpern et al. 1995)

Portable Runtime Interface

17Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Requirements

Resource allocation– Data allocation and naming

– Setup parallel resources Explicit bulk asynchronous communication

– Transfer lists

– Transfer commands Parallel execution

– Launch tasks on children

– Asynchronous Synchronization

– Make sure tasks/transfers complete before continuing Runtime isolation

– No direct knowledge of other runtimes

18Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Memory Level i+1

CPU Level i+1

Memory Level iChild N

Memory Level i…

Graphical Runtime Representation

Memory Level iChild 1

Runtime

CPU Level iChild 1

CPU Level i…

CPU Level iChild N

19Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Top Interface

// create and free runtimeRuntime(TaskTable table, int numChildren);virtual ~Runtime();

// allocate and deallocate arraysvirtual Array_t* AllocArray (Size_t elmtSize, int dimensions, Size_t* dim_sizes, ArrayDesc_t

descriptor, int alignment) = 0; virtual void FreeArray(Array_t* array) = 0;

// array naming virtual void AddArray(Array_t array);virtual Array_t GetArray(ArrayDesc_t descriptor);virtual void RemoveArray(ArrayDesc_t descriptor);

// launch and synchronize on tasks virtual TaskHandle_t CallChildTask(TaskID_t taskid, ChildID_t start, ChildID_t end) =

0;virtual void WaitTask(TaskHandle_t handle) = 0;

20Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Bottom Interface

// array namingvirtual Array_t* GetArray (ArrayDesc_t descriptor);

// create, free, invoke, and synchronize on transfer lists virtual XferList* CreateXferList (Array_t* dst, Array_t*

src, Size_t* dst_idx, Size_t*

src_idx, Size_t* lengths, int

count) = 0;virtual void FreeXferList (XferList* list) = 0;virtual XferHandle_t Xfer (XferList* list) = 0;virtual void WaitXfer (XferHandle_t handle) = 0;

// get number of children in bottom level, get local processor id, // and barrierint GetSiblingCount ();int GetID (); virtual void Barrier (ChildID_t start, ChildID_t stop) =

0;

21Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Compiler/Runtime Interaction

Compiler initializes runtime for each pair of memories in the hierarchy

Initialize runtime for root memory– Machine description specifies runtime to initialize (SMP, Cluster,

Disk, Cell, etc.)

If more levels in hierarchy– Initialize runtimes for child levels

Runtime cleanup is inverse– Call exit on children, wait, cleanup local resources, return to

parent

Control of hierarchy via task calls

Runtime Implementations

23Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

SMP Runtime

CPU CPU...

MainMemory

SMP Runtime

CPU

Disk

NodeMemory

Disk Runtime

SPE

LS

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC

Cell Runtime

…CPU CPU

NodeMemory

NodeMemory

...

Aggregate Cluster Memory

Cluster Runtime

24Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

SMP Implementation

Pthreads based– Launch a thread per child– CallChildTask enqueues work on task queue per child

Data transfers– Memory copy from source to destination– Optimizations

• Pass reference to parent array• Feedback to compiler to remove transfers

– Machine file information

No processor at parent level– Processor 0 represents the parent node and a child node

25Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Disk Implementation

No processor at top level– Host CPU represents the parent node and child node

Implementation – Allocation

• Open file on disk– Data transfers

• Use Async I/O API to read/write data to disk

26Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cell Implementation

Overlay handling– At runtime creation, load overlay loader into SPEs

– On task call

• PowerPC notifies SPE of function to load

• SPE loads overlay and executes Data alignment

– All data allocated to 128 byte boundaries

– Multi-dimensional arrays padded to maintain alignment for dimensions Heavy use of mailboxes

– SPE synchronization

– PPE<->SPE communication Data transfers

– DMA lists

– DMA commands

27Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cluster Implementation

No processor at top level– Node 0 represents the parent node and a child node

Virtual level– Distributed Shared Memory (DSM)

• Many software and hardware solutions• IVY (Li et al. 1988), Mether (Minnich et al. 1991), …• Alewife (Agarwal et al. 1991), FLASH (Kuskin et al. 1994), …• Complex coherence and consistency mechanisms

– Overkill for our needs• No sharing between sibling processors = no coherence

requirements• Sequoia enforces consistency by disallowing aliasing on

writes to parent memory

28Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cluster Implementation, cont.

Virtual level implementation– Interval trees

• Represent arrays as covering an range of 0->N elements• Each node can have any sub-portion (interval) of range• Tree structure allows fast lookup of all nodes that cover the

interval of interest• Allows complex data distributions• See CLRS Section 14.3 for more detailed information

– Array allocation• Define distribution as interval tree and broadcast• Allocate data for intervals and register data pointers with

MPI-2 (MPI_win_create)• Align data to page alignment for network fast transfers

29Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cluster Implementation, cont.

Virtual level implementation– Data transfer

• Compare requested data range against interval tree• Read from parent: issue MPI_Get to any nodes with

matching intervals (MPI_LOCK_SHARED)

• Write to parent: issue MPI_Put to all nodes with matching intervals (MPI_LOCK_EXCLUSIVE)

– Optimizations• If requested range is local, return reference to parent

memory

30Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Runtime Composition

Disk

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC

Aggregate Cluster Memory

NodeMemory

NodeMemory

...

Aggregate Cluster Memory

CPU CPU... CPU CPU...

Disk + Playstation 3

Cluster of Playstation 3s

Cluster of SMPs

...

Cluster Runtime

Cluster Runtime

Cell Runtime

Cell Runtime

Disk Runtime

SMP Runtime SMP Runtime

Cell Runtime

Evaluation

32Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Evaluation Metrics

Can we run unmodified Sequoia code on all these systems?

How much overhead do we have in our abstraction? How well can we utilize machine resources?

– Computation– Bandwidth

Is our application performance competitive with best available implementations?

Can we effectively compose runtimes for more complex systems?

33Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Sequoia Benchmarks

Linear Algebra

Blas Level 1 SAXPY, Level 2 SGEMV, and Level 3 SGEMM benchmarks

2D single precision convolution with 9x9 support (non-periodic boundary constraints)

Complex single precision FFT

100 time steps of N-body stellar dynamics simulation (N2) single precision

Fuzzy protein string matching using HMM evaluation (Horn et al. SC2005 paper)

Conv2D

FFT3DGravity

HMMER

Best available implementations used as leaf task

34Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Single Runtime System Configurations

Scalar– 2.4 GHz Intel Pentium4 Xeon, 1GB

8-way SMP– 4 dual-core 2.66GHz Intel P4 Xeons, 8GB

Disk– 2.4 GHz Intel P4, 160GB disk, ~50MB/s from disk

Cluster– 16, Intel 2.4GHz P4 Xeons, 1GB/node, Infiniband interconnect

(780MB/s) Cell

– 3.2 GHz IBM Cell blade (1 Cell – 8 SPE), 1GB PS3

– 3.2 GHz Cell in Sony Playstation 3 (6 SPE), 256MB (160MB usable)

35Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

System Utilization

SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER

SMP | Disk | Cluster | Cell | PS3

Per

cent

age

of R

untim

e

100

0

36Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Resource Utilization – IBM Cell

Bandwidth utilizationCompute utilization

Res

ourc

e U

tiliz

atio

n (%

)

100

0

37Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Single Runtime Configurations – GFlop/s

Scalar SMP Disk Cluster Cell PS3

SAXPY 0.3 0.7 0.007 4.9 3.5 3.1

SGEMV 1.1 1.7 0.04 12 12 10

SGEMM 6.9 45 5.5 91 119 94

CONV2D 1.9 7.8 0.6 24 85 62

FFT3D 0.7 3.9 0.05 5.5 54 31

GRAVITY 4.8 40 3.7 68 97 71

HMMER 0.9 11 0.9 12 12 7.1

38Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

SGEMM Performance

Cluster– Intel Cluster MKL: 101 GFlop/s– Sequoia: 91 GFlop/s

SMP– Intel MKL: 44 GFlop/s– Sequoia: 45 GFlop/s

39Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

FFT3D Performance

Cell– Mercury: 58 GFlop/s– FFTW 3.2 alpha 2: 35 GFlop/s– Sequoia: 54 GFlop/s

Cluster– FFTW 3.2 alpha 2: 5.3 GFlop/s– Sequoia: 5.5 GFlop/s

SMP– FFTW 3.2 alpha 2: 4.2 GFlop/s– Sequoia: 3.9 GFlop/s

40Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Best Known Implementations

HMMer– ATI X1900XT: 9.4 GFlop/s

(Horn et al. 2005)

– Sequoia Cell: 12 GFlop/s– Sequoia SMP: 11 GFlop/s

Gravity– Grape-6A: 2 billion interactions/s

(Fukushige et al. 2005)

– Sequoia Cell: 4 billion interactions/s– Sequoia PS3: 3 billion interactions/s

41Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Multi-Runtime System Configurations

Cluster of SMPs– Four 2-way, 3.16GHz Intel Pentium 4 Xeons

connected via GigE (80MB/s peak) Disk + PS3

– Sony Playstation 3 bringing data from disk (~30MB/s) Cluster of PS3s

– Two Sony Playstation 3’s connected via GigE (60MB/s peak)

42Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Multi-Runtime Utilization

SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER

Cluster of SMPs | Disk + PS3 | Cluster of PS3s

Pe

rce

nta

ge

of

Ru

ntim

e

100

0

43Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cluster of PS3 Issues

SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER

Cluster of SMPs | Disk + PS3 | Cluster of PS3s

Pe

rce

nta

ge

of

Ru

ntim

e

100

0

44Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cluster of PS3 Issues

SAXPY SGEMV

Cluster of PS3s | PS3

Pe

rce

nta

ge

of

Ru

ntim

e

100

0

45Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Multi-Runtime Configurations - GFlop/s

Cluster-SMP Disk+PS3 PS3 Cluster

SAXPY 1.9 0.004 5.3

SGEMV 4.4 0.014 15

SGEMM 48 3.7 30

CONV2D 4.8 0.48 19

FFT3D 1.1 0.05 0.36

GRAVITY 50 66 119

HMMER 14 8.3 13

Conclusion

47Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Summary

Uniform runtime interface for multi-level memory hierarchies– Horizontal portability

• SMP, cluster, disk, Cell, PS3– Complex machines supported through composition

• Cluster of SMPs, disk + PS3, cluster of PS3s– Provides mechanism independence for communication and thread

management Efficient abstraction for multiple machines

– Maximize machine resource utilization– Low overhead– Competitive performance

Simple interface– <20 entry points

Code portability– Unmodified code running on 9 system configurations

Demonstrates viability of the Parallel Memory Hierarchies model

48Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Future Work

Higher level functions in the runtime?

Load balancing?

Running on more complex machines?

Combine with transactional memory?

What machines can be mapped as a tree?

49Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Questions?

Acknowledgements:Intel Graduate Fellowship ProgramDOE ASCIBMLANL

top related