brasil ross 2011

37
© 2011 IBM Corporation ROSS Workshop 05/31/2011 BRASIL Basic Resource Aggregation Interface Layer Eric Van Hensbergen ([email protected] ) IBM Research Austin Pravin Shinde (now at ETH Zurich) Noah Evans (now at Bell-Labs)

Upload: eric-van-hensbergen

Post on 17-Dec-2014

477 views

Category:

Technology


0 download

DESCRIPTION

Basic Resource Aggregate Interface Layer. ACM ROSS 2011 Workshop presentation.

TRANSCRIPT

Page 1: Brasil Ross 2011

© 2011 IBM CorporationROSS Workshop 05/31/2011

BRASILBasic Resource Aggregation Interface Layer

Eric Van Hensbergen([email protected])IBM Research Austin

Pravin Shinde (now at ETH Zurich)Noah Evans (now at Bell-Labs)

Page 2: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Motivation

Figure 7: The data dependency graph for a portion of the Hartree-Fock

procedure using a traditional formulation.

12

64,000 Node TorusData Flow Oriented HPC Problems

2

Page 3: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

PUSH Pipelines

3

For more detail: refer to PODC09 Short Paper on PUSH Dataflow Shell

UNIX Modela | b | c

PUSH Modela |< b >| c

Page 4: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

PUSH Implementation

4

shell command Pipe Multiplexor

Pipe

Pipe

Pipe

Pipe

Pipe

Pipeshell

command

shell command

shell command

shell command

shell command

shell command Pipe

Pipe

Pipe

Pipe

Pipe

Pipe

Demultiplexor Pipe shell command

Page 5: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Back to Motivation: 64,000 Nodes

5

t

L

I1 I2

c1 c2 c4c3

Page 6: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Related Work/Background Work

•MPI• “A Compositional Environment for MPI Programs”

•Hadoop & Other Map/Reduce Solutions

•cpu (from Plan 9) - http://plan9.bell-labs.com/magic/man2html/1/cpu

•xcpu (from LANL) - http://www.xcpu.org/

•xcpu2 (from LANL)

6

Page 7: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

File System Interfaces

7

!"#$%

&'($"'

!)*+!*,!-,!*).!,."./,!$'(*)!&0!&1

!&*!$.'!)*+!*,!"#2,!3"4.!,."./,!,.54*!,.5(/.!,.54(

!&0!&*!$.'!)*+!*,!"#2,!3"4.!,."./,!,.54*!,.5(/.!,.54(

!&0!&*

!&*

6($"'!7),(/#$), 8),,4(* 8/9:8),,4(*

Control File Syntax• reserve [n] [os] [arch]• dir [wdir]• exec [command] [args]• kill• killonclose• nice [n]• splice [path]

Environment Syntax• [key] = [value]

Namespace Syntax• mount [-abcC] [server] [mnt]• bind [-abcC] [new] [old]• import [-abc] [host] [path]• cd [dir]• unmount [new] [old]• clear• . [path]

Page 8: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Distribution and Aggregation

8

!"#$%&#

"%'()

!"*!"#$%&#!"#+!"#$%&#!"%+!"#$%&#

!"%,!"#$%&#

!"#,!"#$%&#!"%-!"#$%&#

!"%.!"#$%&#

!"#$%&#

"%'()

!"/&(012!3#,4!"#$%&#!"%.!"#$%&#

!"/&(012!3*4!"#$%&#!"#+!"#$%&#!"%+!"#$%&#

!"%,!"#$%&#

!"/&(012!324!"#$%&#

5//6+!)708 5//6,!)708

2

*

#+ #,

%+ %, %- %.

5//6+

5//6,

Page 9: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Command Line Interface(s)

•Direct File Interaction

•Python Command Line

•PUSH

9

ORS=blm find . -type f |< xargs chasen | sort | uniq -c >| sort -rn

echo “res 2” > ./mpoint/csrv/local/0/ctlecho “exec date” > ./mpoint/csrv/local/0/0/ctlecho “exec wc” > cat ./mpoint/csrv/local/0/1/ctlecho “xsplice 0 1” > ./mpoint/csrv/0/local/0/ctlcat ./mpoint/csrv/0/local/0/stdio

runJob.py -n 4 /bin/date

Page 10: Brasil Ross 2011

� �

�����������������

�����������������

������������

��� ����

��������������������

������

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Evaluation Setup

10

Page 11: Brasil Ross 2011

! !! " # $ !% &" %# !"$ "'% '!" !("# "(#$

(

"

#

%

$

!(

!"

!#

)*+,-./012/3.,-14+.567*68196/*5,96/:3;6.<

'!"/513.=

>19=.?..,*5@

A.B+*5;6*15

C@@B.@;6*15

DE.F96*15

G.=.BH;6*15

!"#$%&'()'%*%+",-(./'0%1"%/,%2

A*+.

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Simple Micro-benchmark: Job start times

11

Page 12: Brasil Ross 2011

! !! " # $ !% &" %# !"$ "'% '!" !("# "(#$

(

"

#

%

$

!(

!"

!#

!%

!$

"(

)*+,-./0*12.34,5647,63/84,95:;

'!",3*<.=

>*8=.?../63@

A.B263C46*3

D@@B.@C46*3

E3/84

FG.:846*3

H.=.BIC46*3

JK*+=

A62.

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Job Completion Times with I/O

12

Page 13: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Discussion•Non-buffered communications channels are great for synchronization, but bad for

performance as aggregation points become serialization points.•Pushing multiplexing to PUSH a mistake•adds additional copies of data and additional context switches for I/O to do record separation•by pushing them into the infrastructure we can go from 6 copies and context switches down to 2 for each

pipeline stage•“beltway buffers” and other techniques may reduce this further or eliminate copies altogether

•Transitive mounts of namespace an elegant way to bypass network segmentation -- but it also incurs lots of overhead•Allowing splice inside a reservation has dubious usefulness. It seems like it might

be more useful to allow for individual elements to be spawned outside a reservation.•Fan-in and Fan-out are too limiting of a use case. What about deterministic

delivery? What about many-to-many pipeline components? What about collectives and barriers?•Fault tolerance & fault debugging properties were poor -- no channel for out-of-band

communication of error or logging information. When transitive mounts went down, the system wedged -- but no integrated method of figuring out where the system went down.•File systems and communication are outside the scope of this work, but still a sore

point for performance with the Plan 9 kernel on Blue Gene. Both issues are actively being worked on with a release planned later this summer.

13

Page 14: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Future Work: Cutting Out The Middle Man

14

Kernel

App Brasil

Kernel

Brasil

Kernel

Brasil App

Kernel

App

Kernel Kernel

App

Initiating Terminal or Node Parent or Gateway Node(s) Compute Node

Page 15: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Future Work: Splice Optimizations via Direct Communication

•Splice operations through name space is elegant, but transitive mounting of name space means that data has to flow through hierarchy in order to be spliced from one element to another•A direct communication path would be much more desirable, particularly on BG/P

where hierarchy is constructed on tree, but torus is the preferred node-to-node data transport•Solution: add a layer of indirection to splice communication•add a file to the session directories which contains a list of “locations” for this node which essentially gives a

recipe for connecting directly to the channel in order of performance. If the source node cannot use any of these recipes he will default to the name space path.•example:

•in the simplest embodiment we mount the namespace of the node in order to access its interfaces, but we also want to play with direct connections for data and/or potentially hiding everything inside the csrv interface so the mpipe splice code (and end-users) can be more or less ignorant of how the resources are accessed

15

% cat ./mpoint/csrv/parent/c4/local/0/locationtorus!1.3.4!564tree!0.23!564tcp!10.1.3.4!564

Page 16: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Future Work:

•Turn fan-out/fan-in/many-to-many pipeline operations into a system primitives with similar syntax to existing pipe primitives but with semantics of multipipe•respect record boundries•allow broadcast•allow enumerated specification of destination•support splice operations

•Build out rest of brasil infrastructure based on this new primitve•All stdio channels are multipipes

• allow record buffers for decoupled performance•All ctl channels implemented on top of multipipes•etc.

•Same model could potentially be used for implementation of collective operations and barriers within a distributed file system namespace

16

TYPE

SIZE

DESTINATION

PARAMETERS

pwrite(pipefd, buf, sz, ~(0));

Page 17: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

This project is supported in part by the U.S. Department of Energy under Award Number DE-FG02- 08ER25851 http://www.research.ibm.com/austin

http://goo.gl/5eFB

Brasil Code Available: http://www.bitbucket.org/ericvh/hare/usr/brasil

New Version In Progress: http://www.bitbucket.org/ericvh/hare/sys/src/cmd/uem

17

Page 18: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Core Concept: BRASILBasic Resource Aggregate System Inferno Layer

•Stripped down Inferno - No GUI or anything we can live without, minimal footprint•Runs as a daemon (no console), all interaction via 9p mounts of its namespace•Different modes•default (exports /srv/brasil or on a tcp!127.0.0.1!5670)•gateway (exports over standard I/O - to be used by ssh initialization)•terminal (initiates ssh connection and starts a gateway)

•Runs EVERYWHERE•User’s workstation•Surveyor Login Nodes•I/O Nodes•Compute Nodes

18

Page 19: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Preferred Embodiment: BRASIL Desktop Extension Model

19

workstation login node I/O

CPU

CPU

ssh-duct

•Setup•User starts brasild on workstation

•brasild ssh’s to login node and starts another brasil hooking the two together with 27b-6 and mount resources in /csrv

•User mounts brasild on workstation into namespace using 9pfuse or v9fs (or can mount from Plan 9 peer node, 9vx, p9p or ACME-sac)

•Boot•User runs anl/run script on workstation

•script interacts with taskfs on login node to start cobalt qsub •when I/O nodes boot it will connect its csrv to login csrv•when CPU nodes boot they will connect to csrv on I/O node

•Task Execution•User runs anl/exec script on workstation to run app

•script reserves x nodes for app using taskfs•taskfs on workstation aggregates execution by using taskfs running on I/O nodes

Page 20: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Core Concept: Central Services•Establish hierarchical namespace of cluster services•Automount remote servers based on reference (ie. cd /csrv/

criswell)•Export local services for use elsewhere within the network

20

/csrv/local/L

/local/l1

/local/c1

/local/c2

/local/l2

/local/c3

/local/c4

/local

/csrv/local/l2

/local/c4

/local/L

/local/t

/local/l1

/local/c1

/local/c2

/local

c3

t

L

I1 I2

c1 c2 c4c3

Page 21: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Our Approach: Workload Optimized Distribution

21

Page 22: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Our Approach: Workload Optimized Distribution

21

Desktop Extension

Page 23: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Our Approach: Workload Optimized Distribution

21

Desktop Extension

!"#$%&

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#(#

!"#(#

!"#(#

!"#(#

PUSH Pipeline Model

Page 24: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Our Approach: Workload Optimized Distribution

21

Desktop Extension

!"#$%&

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#(#

!"#(#

!"#(#

!"#(#

PUSH Pipeline Model

local service

remote services

local service proxy service aggregate service

Aggregation ViaDynamic Namespace

andDistributed Service

Model

Page 25: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Our Approach: Workload Optimized Distribution

21

Desktop Extension

!"#$%&

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#(#

!"#(#

!"#(#

!"#(#

PUSH Pipeline Model

local service

remote services

local service proxy service aggregate service

Aggregation ViaDynamic Namespace

andDistributed Service

Model

Scaling

Page 26: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

Our Approach: Workload Optimized Distribution

21

Desktop Extension

!"#$%&

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#(#

!"#(#

!"#(#

!"#(#

PUSH Pipeline Model

local service

remote services

local service proxy service aggregate service

Aggregation ViaDynamic Namespace

andDistributed Service

Model

Scaling Reliability

Page 27: Brasil Ross 2011

IBM Research

© 2011 IBM CorporationROSS Workshop 05/31/2011

“Old” Performance Graph (from USENIX)

22

XCPU3

Problem

Related Work

Solution

Evaluations

References

Evaluations:Deployment and aggregation time

XCPU3

Page 28: Brasil Ross 2011

IBM Research

© 2010 IBM Corporation

AAA

BBB

ABABAB

ABA

BAB

23

Problem: Limitations of Traditional Pipes

AAA

BBB

Page 29: Brasil Ross 2011

IBM Research

© 2010 IBM Corporation24

Long Packet Pipes

AAA

BBB

AAA BBB

BBB

AAA

AAA

BBB

Page 30: Brasil Ross 2011

IBM Research

© 2010 IBM Corporation25

Enumerated Pipes

1:A 2:B

A B

A B

A

B

Page 31: Brasil Ross 2011

IBM Research

© 2010 IBM Corporation

A

A

A

Broadcast

Reduce(+)

Allreduce(+)

(B+C)

B

C

(A+B+C)

B

C

A

26

Collective Pipes

Page 32: Brasil Ross 2011

IBM Research

© 2010 IBM Corporation

spliceto(b) a b = a b

splicefrom(b) a b = a b

27

Splicing Pipes

Page 33: Brasil Ross 2011

IBM Research

© 2010 IBM Corporation

Example Simple Invocation

•mpipefs•mount /srv/mpipe /n/testpipe•ls -l /n/testpipe

•echo hello > /n/testpipe/data &•cat /n/testpipe/data

28

--rw-rw-rw- M 24 ericvh ericvh 0 Oct 10 18:10 /n/testpipe/data

hello

Page 34: Brasil Ross 2011

IBM Research

© 2010 IBM Corporation

Passing Arguments via aname

•mount /srv/mpipe /n/test othername•ls -l /n/test

•mount /srv/mpipe /n/test2 -b bcastpipe

•mount /srv/mpipe /n/test3 -e 5 enumpipe

•....you get the idea, read the man page for more details

29

--rw-rw-rw- M 26 ericvh ericvh 0 Oct 10 18:12 /n/test/otherpipe

Page 35: Brasil Ross 2011

IBM Research

© 2010 IBM Corporation

Example for writing control blocks

intpipewrite(int fd, char *data, ulong size, ulong which){ int n; char hdr[255]; ulong tag = ~0; char pkttype='p';

/* header byte is at offset ~0 */ n = snprint(hdr, 31, "%c\n%lud\n%lud\n\n", pkttype, size, which); n = pwrite(fd, hdr, n+1, tag); if(n <= 0) return n;

return write(fd, data, size);}

30

Page 36: Brasil Ross 2011

IBM Research

© 2010 IBM Corporation

Larger Example (execfs)

31

/proc/clone/###

/stdin/stdout/stderr/args/ctl/fd/fpregs/kregs/mem/note/noteid/notepg/ns/proc/profile/regs/segment/status/text/wait

mount -a /srv/mpipe /proc/### stdinmount -a /srv/mpipe /proc/### stdoutmount -a /srv/mpipe /proc/### stderr

Page 37: Brasil Ross 2011

IBM Research

© 2010 IBM Corporation

Really Large Example (gangfs)

32

/proc/gclone/status/g###

/stdin/stdout/stderr/ctl/ns/status/wait

mount -a /srv/mpipe /proc/### -b stdinmount -a /srv/mpipe /proc/### stdoutmount -a /srv/mpipe /proc/### stderr

...and then, post exec from gangfs clone - execfs stdins are splicedfrom g#/stdin

...and then execfs stdouts and stderrs are splicedto g#/stdout and g#/stderr

...and you can do -e # with stdin to get enumerated instead of brodcast pipes