a portable runtime interface for multi-level memory hierarchies mike houston, ji-young park, manman...

49
A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken, William Dally, Pat Hanrahan Stanford University

Upload: xavier-donovan

Post on 27-Mar-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

A Portable Runtime Interface For Multi-Level Memory Hierarchies

Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian,

Alex Aiken, William Dally, Pat Hanrahan

Stanford University

Page 2: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

2Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

The Problem

Lots of different architectures– Shared memory – Distributed memory – Exposed communication– Disk systems

Each architecture has its own programming system Composed machines difficult to program and manage

– Different mechanisms for each architecture

Previous runtimes and languages limited– Designed for a single architecture– Struggle with memory hierarchies

Page 3: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

3Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Shared Memory Machines

AMD Barcelona SGI Altix

Page 4: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

4Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Distributed Memory Machines

Marenostrum2,282 Nodes

ASC Blue Gene/L65,536 Nodes

Page 5: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

5Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Exposed Communication Architectures

90nm | ~220 mm2 |~100 WSTI CELL processor

~200 GFLOPS

Page 6: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

6Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Complex machines

Cluster of SMPs?– MPI?

Cluster of Cell processors?– MPI + ALF/Cell SDK

Cluster of clusters of SMPs with Cell accelerators? What about disk systems?

– Disk I/O often second class

Page 7: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

7Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Previous Work

APIs– MPI/MPI-2– Pthreads– OpenMP

(Dagum et al. 1998)– GASNet

(Bonachea et al. 2002)– Charm

(Kale et al, 1993)– SVM

(Lebonte et al. 2004)– Cell SDK

(IBM 2006)– CellSs

(Bellens et al. 2006)– HTA

(Bikshandi et al. 2006)– …

Languages– Co-Array Fortran

(Numrich et al. 1994)– Titanium

(Yelick et al. 1998) – UPC

(Carlson et al. 1999)– ZPL

(Deitz et al. 2004)– Chapel

(Callahan et al. 2004)– X10

(Charles et al. 2005)– Brook

(Buck et al. 2004)– Cilk

(Blumofe et al. 1995)– …

Page 8: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

8Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Contributions

Uniform scheme for explicitly describing memory hierarchies– Capture common traits important for performance

– Allow composition of memory hierarchies

Simple, portable API interface for many parallel machines– Mechanism independence for communication and management of

parallel resources

– Few entry points

– Efficient execution

Efficient compiler target– Compiler can concentrate on global optimizations

– Runtime manages mechanisms

Page 9: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

9Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

The Sequoia System

Programming language for memory hierarchies– Fatahalian et al. (Supercomputing 2006)– Adapts Parallel Memory Hierarchies programming abstraction

Compiler for exposed communication architectures– Knight et al. (PPoPP 2007)– Takes in Sequoia program, machine file, and mapping file– Generates task calls and data transfers– Performs large granularity optimizations

Portable runtime system– Houston et al. (PPoPP 2008)– Target for Sequoia compiler– Serves as abstract machine

http://sequoia.stanford.edu

Page 10: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

Abstract Machine Model

Page 11: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

11Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Parallel Memory Hierarchy Model (PMH)

Abstract machines a trees of memories (each memory is an address space)(Alpern et al. 1995)

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

Page 12: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

12Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Parallel Memory Hierarchy Model

CPU CPU CPU CPU CPU CPU CPU CPU

Memory

Memory

Memory Memory Memory Memory

Memory

Memory Memory Memory Memory

Page 13: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

13Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Example Mappings

SPE

LS

MainMemory

Cell Processor

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

CPU CPU

NodeMemory

NodeMemory

...

Aggregate Node Memory(Virtual Level)

Cluster

CPU

Disk

Disk

NodeMemory

Shared Memory Multi-processor

CPU

...

MainMemory

L2

CPU

L2

Page 14: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

14Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Multi-Level Configurations

SPE SPE SPE SPE SPE SPE

...

CPU

...

Disk + Playstation 3

Cluster of Playstation 3s

Cluster of SMPs

...

CPU CPU

...

CPUSPE SPE SPE SPE SPE SPE

SPE SPE SPE SPE SPE SPE

Aggregate Cluster Memory

L2 L2 L2 L2

NodeMemory

NodeMemory

LS LS LS LS LS LS

MainMemory

Disk

LS LS LS LS LS LS

MainMemory

MainMemory

LS LS LS LS LS LS

Aggregate Cluster Memory

Page 15: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

15Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Abstraction Rules

Tree of nodes 1 control thread per node 1 memory per node

Threads can:– Transfer in bulk from/to parent memory

asynchronously

– Wait for transfers from/to parent to complete

– Allocate data

– Only access their memory directly

– Transfer control to child node(s)

– Non-leaf threads only operate to move data and control

– Synchronize with siblings

Memory

CPU

Memory

CPU

Memory

CPU

Simliar to Space Limited Procedures (Alpern et al. 1995)

Page 16: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

Portable Runtime Interface

Page 17: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

17Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Requirements

Resource allocation– Data allocation and naming

– Setup parallel resources Explicit bulk asynchronous communication

– Transfer lists

– Transfer commands Parallel execution

– Launch tasks on children

– Asynchronous Synchronization

– Make sure tasks/transfers complete before continuing Runtime isolation

– No direct knowledge of other runtimes

Page 18: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

18Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Memory Level i+1

CPU Level i+1

Memory Level iChild N

Memory Level i…

Graphical Runtime Representation

Memory Level iChild 1

Runtime

CPU Level iChild 1

CPU Level i…

CPU Level iChild N

Page 19: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

19Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Top Interface

// create and free runtimeRuntime(TaskTable table, int numChildren);virtual ~Runtime();

// allocate and deallocate arraysvirtual Array_t* AllocArray (Size_t elmtSize, int dimensions, Size_t* dim_sizes, ArrayDesc_t

descriptor, int alignment) = 0; virtual void FreeArray(Array_t* array) = 0;

// array naming virtual void AddArray(Array_t array);virtual Array_t GetArray(ArrayDesc_t descriptor);virtual void RemoveArray(ArrayDesc_t descriptor);

// launch and synchronize on tasks virtual TaskHandle_t CallChildTask(TaskID_t taskid, ChildID_t start, ChildID_t end) =

0;virtual void WaitTask(TaskHandle_t handle) = 0;

Page 20: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

20Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Bottom Interface

// array namingvirtual Array_t* GetArray (ArrayDesc_t descriptor);

// create, free, invoke, and synchronize on transfer lists virtual XferList* CreateXferList (Array_t* dst, Array_t*

src, Size_t* dst_idx, Size_t*

src_idx, Size_t* lengths, int

count) = 0;virtual void FreeXferList (XferList* list) = 0;virtual XferHandle_t Xfer (XferList* list) = 0;virtual void WaitXfer (XferHandle_t handle) = 0;

// get number of children in bottom level, get local processor id, // and barrierint GetSiblingCount ();int GetID (); virtual void Barrier (ChildID_t start, ChildID_t stop) =

0;

Page 21: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

21Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Compiler/Runtime Interaction

Compiler initializes runtime for each pair of memories in the hierarchy

Initialize runtime for root memory– Machine description specifies runtime to initialize (SMP, Cluster,

Disk, Cell, etc.)

If more levels in hierarchy– Initialize runtimes for child levels

Runtime cleanup is inverse– Call exit on children, wait, cleanup local resources, return to

parent

Control of hierarchy via task calls

Page 22: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

Runtime Implementations

Page 23: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

23Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

SMP Runtime

CPU CPU...

MainMemory

SMP Runtime

CPU

Disk

NodeMemory

Disk Runtime

SPE

LS

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC

Cell Runtime

…CPU CPU

NodeMemory

NodeMemory

...

Aggregate Cluster Memory

Cluster Runtime

Page 24: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

24Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

SMP Implementation

Pthreads based– Launch a thread per child– CallChildTask enqueues work on task queue per child

Data transfers– Memory copy from source to destination– Optimizations

• Pass reference to parent array• Feedback to compiler to remove transfers

– Machine file information

No processor at parent level– Processor 0 represents the parent node and a child node

Page 25: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

25Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Disk Implementation

No processor at top level– Host CPU represents the parent node and child node

Implementation – Allocation

• Open file on disk– Data transfers

• Use Async I/O API to read/write data to disk

Page 26: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

26Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cell Implementation

Overlay handling– At runtime creation, load overlay loader into SPEs

– On task call

• PowerPC notifies SPE of function to load

• SPE loads overlay and executes Data alignment

– All data allocated to 128 byte boundaries

– Multi-dimensional arrays padded to maintain alignment for dimensions Heavy use of mailboxes

– SPE synchronization

– PPE<->SPE communication Data transfers

– DMA lists

– DMA commands

Page 27: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

27Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cluster Implementation

No processor at top level– Node 0 represents the parent node and a child node

Virtual level– Distributed Shared Memory (DSM)

• Many software and hardware solutions• IVY (Li et al. 1988), Mether (Minnich et al. 1991), …• Alewife (Agarwal et al. 1991), FLASH (Kuskin et al. 1994), …• Complex coherence and consistency mechanisms

– Overkill for our needs• No sharing between sibling processors = no coherence

requirements• Sequoia enforces consistency by disallowing aliasing on

writes to parent memory

Page 28: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

28Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cluster Implementation, cont.

Virtual level implementation– Interval trees

• Represent arrays as covering an range of 0->N elements• Each node can have any sub-portion (interval) of range• Tree structure allows fast lookup of all nodes that cover the

interval of interest• Allows complex data distributions• See CLRS Section 14.3 for more detailed information

– Array allocation• Define distribution as interval tree and broadcast• Allocate data for intervals and register data pointers with

MPI-2 (MPI_win_create)• Align data to page alignment for network fast transfers

Page 29: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

29Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cluster Implementation, cont.

Virtual level implementation– Data transfer

• Compare requested data range against interval tree• Read from parent: issue MPI_Get to any nodes with

matching intervals (MPI_LOCK_SHARED)

• Write to parent: issue MPI_Put to all nodes with matching intervals (MPI_LOCK_EXCLUSIVE)

– Optimizations• If requested range is local, return reference to parent

memory

Page 30: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

30Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Runtime Composition

Disk

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC

Aggregate Cluster Memory

NodeMemory

NodeMemory

...

Aggregate Cluster Memory

CPU CPU... CPU CPU...

Disk + Playstation 3

Cluster of Playstation 3s

Cluster of SMPs

...

Cluster Runtime

Cluster Runtime

Cell Runtime

Cell Runtime

Disk Runtime

SMP Runtime SMP Runtime

Cell Runtime

Page 31: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

Evaluation

Page 32: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

32Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Evaluation Metrics

Can we run unmodified Sequoia code on all these systems?

How much overhead do we have in our abstraction? How well can we utilize machine resources?

– Computation– Bandwidth

Is our application performance competitive with best available implementations?

Can we effectively compose runtimes for more complex systems?

Page 33: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

33Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Sequoia Benchmarks

Linear Algebra

Blas Level 1 SAXPY, Level 2 SGEMV, and Level 3 SGEMM benchmarks

2D single precision convolution with 9x9 support (non-periodic boundary constraints)

Complex single precision FFT

100 time steps of N-body stellar dynamics simulation (N2) single precision

Fuzzy protein string matching using HMM evaluation (Horn et al. SC2005 paper)

Conv2D

FFT3DGravity

HMMER

Best available implementations used as leaf task

Page 34: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

34Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Single Runtime System Configurations

Scalar– 2.4 GHz Intel Pentium4 Xeon, 1GB

8-way SMP– 4 dual-core 2.66GHz Intel P4 Xeons, 8GB

Disk– 2.4 GHz Intel P4, 160GB disk, ~50MB/s from disk

Cluster– 16, Intel 2.4GHz P4 Xeons, 1GB/node, Infiniband interconnect

(780MB/s) Cell

– 3.2 GHz IBM Cell blade (1 Cell – 8 SPE), 1GB PS3

– 3.2 GHz Cell in Sony Playstation 3 (6 SPE), 256MB (160MB usable)

Page 35: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

35Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

System Utilization

SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER

SMP | Disk | Cluster | Cell | PS3

Per

cent

age

of R

untim

e

100

0

Page 36: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

36Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Resource Utilization – IBM Cell

Bandwidth utilizationCompute utilization

Res

ourc

e U

tiliz

atio

n (%

)

100

0

Page 37: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

37Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Single Runtime Configurations – GFlop/s

Scalar SMP Disk Cluster Cell PS3

SAXPY 0.3 0.7 0.007 4.9 3.5 3.1

SGEMV 1.1 1.7 0.04 12 12 10

SGEMM 6.9 45 5.5 91 119 94

CONV2D 1.9 7.8 0.6 24 85 62

FFT3D 0.7 3.9 0.05 5.5 54 31

GRAVITY 4.8 40 3.7 68 97 71

HMMER 0.9 11 0.9 12 12 7.1

Page 38: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

38Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

SGEMM Performance

Cluster– Intel Cluster MKL: 101 GFlop/s– Sequoia: 91 GFlop/s

SMP– Intel MKL: 44 GFlop/s– Sequoia: 45 GFlop/s

Page 39: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

39Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

FFT3D Performance

Cell– Mercury: 58 GFlop/s– FFTW 3.2 alpha 2: 35 GFlop/s– Sequoia: 54 GFlop/s

Cluster– FFTW 3.2 alpha 2: 5.3 GFlop/s– Sequoia: 5.5 GFlop/s

SMP– FFTW 3.2 alpha 2: 4.2 GFlop/s– Sequoia: 3.9 GFlop/s

Page 40: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

40Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Best Known Implementations

HMMer– ATI X1900XT: 9.4 GFlop/s

(Horn et al. 2005)

– Sequoia Cell: 12 GFlop/s– Sequoia SMP: 11 GFlop/s

Gravity– Grape-6A: 2 billion interactions/s

(Fukushige et al. 2005)

– Sequoia Cell: 4 billion interactions/s– Sequoia PS3: 3 billion interactions/s

Page 41: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

41Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Multi-Runtime System Configurations

Cluster of SMPs– Four 2-way, 3.16GHz Intel Pentium 4 Xeons

connected via GigE (80MB/s peak) Disk + PS3

– Sony Playstation 3 bringing data from disk (~30MB/s) Cluster of PS3s

– Two Sony Playstation 3’s connected via GigE (60MB/s peak)

Page 42: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

42Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Multi-Runtime Utilization

SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER

Cluster of SMPs | Disk + PS3 | Cluster of PS3s

Pe

rce

nta

ge

of

Ru

ntim

e

100

0

Page 43: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

43Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cluster of PS3 Issues

SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER

Cluster of SMPs | Disk + PS3 | Cluster of PS3s

Pe

rce

nta

ge

of

Ru

ntim

e

100

0

Page 44: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

44Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Cluster of PS3 Issues

SAXPY SGEMV

Cluster of PS3s | PS3

Pe

rce

nta

ge

of

Ru

ntim

e

100

0

Page 45: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

45Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Multi-Runtime Configurations - GFlop/s

Cluster-SMP Disk+PS3 PS3 Cluster

SAXPY 1.9 0.004 5.3

SGEMV 4.4 0.014 15

SGEMM 48 3.7 30

CONV2D 4.8 0.48 19

FFT3D 1.1 0.05 0.36

GRAVITY 50 66 119

HMMER 14 8.3 13

Page 46: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

Conclusion

Page 47: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

47Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Summary

Uniform runtime interface for multi-level memory hierarchies– Horizontal portability

• SMP, cluster, disk, Cell, PS3– Complex machines supported through composition

• Cluster of SMPs, disk + PS3, cluster of PS3s– Provides mechanism independence for communication and thread

management Efficient abstraction for multiple machines

– Maximize machine resource utilization– Low overhead– Competitive performance

Simple interface– <20 entry points

Code portability– Unmodified code running on 9 system configurations

Demonstrates viability of the Parallel Memory Hierarchies model

Page 48: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

48Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Future Work

Higher level functions in the runtime?

Load balancing?

Running on more complex machines?

Combine with transactional memory?

What machines can be mapped as a tree?

Page 49: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

49Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

Questions?

Acknowledgements:Intel Graduate Fellowship ProgramDOE ASCIBMLANL